國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,資料前處理差異對殭屍網路偵測效能影響之研究,The study of data preprocessing difference to impact the botnet detection performance

論文名稱 Title	資料前處理差異對殭屍網路偵測效能影響之研究 The study of data preprocessing difference to impact the botnet detection performance
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	104 學年度第 2 學期 The spring semester of Academic Year 104	語文別 Language	中文 Chinese
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	74
研究生 Author	王則堯 Tse-yao Wang
指導教授 Advisor	陳嘉玫 Chen-chia Mei
召集委員 Convenor	魏志平 Chih-Ping Wei
口試委員 Advisory Committee	賴谷鑫, 林耕霈, 王平 Gu Hsin Lai; Keng-Pei Lin; Ping Wang
口試日期 Date of Exam	2016-07-04	繳交日期 Date of Submission	2016-07-14
關鍵字 Keywords	粗糙集合理論、殭屍網路偵測、機器學習、資料轉換、特徵選取 data transformation, machine learning, Botnet detection, Rough Set Theory, feature selection
統計 Statistics	本論文已被瀏覽 5863 次，被下載 0 次 The thesis/dissertation has been browsed 5863 times, has been downloaded 0 times.

中文摘要
許多研究提出以機器學習技術對殭屍網路C&C通訊流量進行偵測的策略，頗見成效。採用機器學習技術進行資料分析時，必須完備輸入資料的資料前處理，利後續運算分析的程序。若資料前處理不當，將影響最終的偵測效能。殭屍網路流量為基礎的偵測相關研究，尚缺乏資料轉換的一般性指引。本研究提出四種編碼規則，對HTTP-based殭屍網路C&C通訊流量作為偵測樣本設計實驗，探討最適化的編碼原則，其中選擇粗糙集合論(Rough Set )、支持向量機(SVM)及樸素貝氏(Naïve Bayes)作為實驗的分類器。最初實驗採用Las Vegas Filter Algorithm及Rough Set Algorithm做為特徵選取的演算法，探討編碼規則如何影響特徵選取。後續實驗則比較採用特徵選取對偵測效能的影響，藉由實驗數據的分析，得出編碼規則的最適化及設計指引的結論。由實驗數據的綜合分析提出幾點發現，第一，應審慎區別Empty及NULL的狀態及在資料中的意涵，減少資料編碼的混淆情況，而影響系統的偵測結果。第二，原始資料內容進行編碼，若採取不凸顯紀錄內容微小差異的方向進行設計，適當的聚合相同屬性的內容，再賦予類別相同的編碼，對於機器學習分類器的分類效果較佳。最後，證實Rough Set應用在殭屍網路流量資料集時，其特徵選取能力，能有效刪除冗餘資料、精簡資料集，有助於提升時間效率並提高偵測準確率。
Abstract
Many studies employ machine learning to detect botnet C&C communications traffic quite effective. If the former data handled properly, it will affect the final detection performance. So that is must be complete data preprocessing to facilitate operation analysis program. The Botnet traffic based detection research lack of general guidance data conversion. This study presents four coding rules and chose the Rough Set, Support Vector Machine and Naïve Bayes as experimental classifier. Initial experiments used the Rough Set and Las Vegas Filter as a feature selection algorithm discussed when the feature selection, the best data coding rules. Based on the results of the initial experiments conducted subsequent experiments were compared using feature selection on detection performance, the final experiments are compared using feature selection on detection performance by analyzing experimental data concluded that data coding rules and design guidelines. The study has two important findings. Firstly, carefully distinguishing Empty, NULL, and the meanings of data can avoid confusing situations of data coding and negative detection result of the system. Secondly, the minor difference of the data contents should be ignored to find a stronger correlation among the similar events when machine learning detection models are adopted. Hence, the Rough Set to verify the effective conduct of feature selection, helps eliminate redundant data, Acceleration analysis time and improves detection accuracy.

目次 Table of Contents
第一章緒論 1 第二章文獻探討 5 第一節殭屍網路與其類型 5 第二節殭屍網路的偵測策略 9 第三節殭屍網路偵測相關研究 11 第四節資料的前處理 13 第五節機器學習與特徵選擇 20 第六節粗糙集合論 22 第七節支持向量機 29 第八節樸素貝氏分類器 30 第三章研究方法與實驗設計 32 第一節編碼規則 33 第二節實驗設計 36 第三節實驗流程 38 第四章實驗結果與分析 41 第一節資料來源與資料前處理 41 第二節實驗規格說明 47 第三節結果與分析 47 壹、實驗一說明與實驗結果分析 47 貳、實驗二說明與實驗結果分析 51 參、實驗三說明與實驗結果分析 54 第五章結論與未來展望 58

參考文獻 References
Abu Rajab, M., Zarfoss, J., Monrose, F., & Terzis, A. (2006). A multifaceted approach to understanding the botnet phenomenon. Paper presented at the Proceedings of the 6th ACM SIGCOMM conference on Internet measurement. Batista, G. E., & Monard, M. C. (2003). An analysis of four missing data treatment methods for supervised learning. Applied Artificial Intelligence, 17(5-6), 519-533. Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27. Chapelle, O., Haffner, P., & Vapnik, V. N. (1999). Support vector machines for histogram-based image classification. Neural Networks, IEEE Transactions on, 10(5), 1055-1064. Cheadle, C., Vawter, M. P., Freed, W. J., & Becker, K. G. (2003). Analysis of microarray data using Z score transformation. The Journal of molecular diagnostics, 5(2), 73-81. Chen, R.-C., Cheng, K.-F., Chen, Y.-H., & Hsieh, C.-F. (2009). Using rough set and support vector machine for network intrusion detection system. Paper presented at the Intelligent Information and Database Systems, 2009. ACIIDS 2009. First Asian Conference on. Choi, H., & Lee, H. (2012). Identifying botnets by capturing group activities in DNS traffic. Computer Networks, 56(1), 20-33. Choi, H., Lee, H., Lee, H., & Kim, H. (2007). Botnet detection by monitoring group activities in DNS traffic. Paper presented at the Computer and Information Technology, 2007. CIT 2007. 7th IEEE International Conference on. Chouchoulas, A., & Shen, Q. (2001). Rough set-aided keyword reduction for text categorization. Applied Artificial Intelligence, 15(9), 843-873. Crone, S. F., Lessmann, S., & Stahlbock, R. (2006). The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research, 173(3), 781-800. Das, S., & Roy, A. (2016). Signature Verification Using Rough Set Theory Based Feature Selection Computational Intelligence in Data Mining—Volume 2 (pp. 153-161): Springer. Domingos, P., & Pazzani, M. (1997). On the optimality of the simple Bayesian classifier under zero-one loss. Machine learning, 29(2-3), 103-130. Feng, P.-M., Ding, H., Chen, W., & Lin, H. (2013). Naive Bayes classifier with feature selection to identify phage virion proteins. Computational and mathematical methods in medicine, 2013. Frosch, T., Kührer, M., & Holz, T. (2013). Predentifier: Detecting Botnet C&C Domains From Passive DNS Data. Advances in IT Early Warning. Fraunhofer Verlag. García, S., Luengo, J., & Herrera, F. (2015). Data preprocessing in data mining: Springer. Grizzard, J. B., Sharma, V., Nunnery, C., Kang, B. B., & Dagon, D. (2007). Peer-to-Peer Botnets: Overview and Case Study. HotBots, 7, 1-1. Gu, G., Perdisci, R., Zhang, J., & Lee, W. (2008). BotMiner: Clustering Analysis of Network Traffic for Protocol-and Structure-Independent Botnet Detection. Paper presented at the USENIX Security Symposium. Gu, G., Zhang, J., & Lee, W. (2008). BotSniffer: Detecting botnet command and control channels in network traffic. Haddadi, F., Morgan, J., Filho, E. G., & Zincir-Heywood, A. N. (2014, 13-16 May 2014). Botnet Behaviour Analysis Using IP Flows: With HTTP Filters Using Classifiers. Paper presented at the Advanced Information Networking and Applications Workshops (WAINA), 2014 28th International Conference on. Haddadi, F., & Zincir-Heywood, A. N. (2013). Analyzing string format-based classifiers for botnet detection: GP and SVM. Paper presented at the Evolutionary Computation (CEC), 2013 IEEE Congress on. Haddoud, M., Mokhtari, A., Lecroq, T., & Abdeddaïm, S. (2016). Combining supervised term-weighting metrics for SVM text classification with extended term representation. Knowledge and Information Systems, 1-23. Inbarani, H. H., Bagyamathi, M., & Azar, A. T. (2015). A novel hybrid feature selection method based on rough set and improved harmony search. Neural Computing and Applications, 26(8), 1859-1880. Jin, J., Yan, Z., Geng, G., & Yan, B. (2015). Botnet Domain Name Detection based on machine learning. Paper presented at the 6th International Conference on Wireless, Mobile and Multi-Media (ICWMMN 2015). Karasaridis, A., Rexroad, B., & Hoeflin, D. A. (2007). Wide-Scale Botnet Detection and Characterization. HotBots, 7, 7-7. Kheir, N. (2013). Behavioral classification and detection of malware through HTTP user agent anomalies. Journal of Information Security and Applications, 18(1), 2-13. Kim, Y., Street, W. N., & Menczer, F. (2000). Feature selection in unsupervised learning via evolutionary search. Paper presented at the Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. Kirubavathi Venkatesh, G., & Anitha Nadarajan, R. (2012). HTTP Botnet Detection Using Adaptive Learning Rate Multilayer Feed-Forward Neural Network. In I. Askoxylakis, H. C. Pöhls, & J. Posegga (Eds.), Information Security Theory and Practice. Security, Privacy and Trust in Computing Systems and Ambient Intelligent Ecosystems: 6th IFIP WG 11.2 International Workshop, WISTP 2012, Egham, UK, June 20-22, 2012. Proceedings (pp. 38-48). Berlin, Heidelberg: Springer Berlin Heidelberg. Kremic, E., & Subasi, A. (2015). Performance of Random Forest and SVM in Face Recognition. The International Arab Journal of Information Technology. Kumar, V., & Minz, S. (2014). Feature Selection. SmartCR, 4(3), 211-229. Larose, D. T. (2014). Discovering knowledge in data: an introduction to data mining: John Wiley & Sons. Lin, K.-C., Chen, S.-Y., & Hung, J. C. (2014). Botnet detection using support vector machines with artificial fish swarm algorithm. Journal of Applied Mathematics, 2014. Liu, H., & Setiono, R. (1996). A probabilistic approach to feature selection-a filter solution. Paper presented at the ICML. Livadas, C., Walsh, R., Lapsley, D., & Strayer, W. T. (2006). Usilng machine learning technliques to identify botnet traffic. Paper presented at the Local Computer Networks, Proceedings 2006 31st IEEE Conference on. Lu, A., Li, J., & Yang, L. (2010). A New Method of Data Preprocessing for Network Security Situational Awareness. Paper presented at the Database Technology and Applications (DBTA), 2010 2nd International Workshop on. Ma, Y., Wang, L., Liu, P., & Ranjan, R. (2015). Towards building a data-intensive index for big data computing–A case study of Remote Sensing data processing. Information sciences, 319, 171-188. Maedche, A., Hotho, A., & Wiese, M. (2000). Enhancing preprocessing in data-intensive domains using online-analytical processing Data Warehousing and Knowledge Discovery (pp. 258-264): Springer. Mazzariello, C. (2008). IRC traffic analysis for botnet detection. Paper presented at the Information Assurance and Security, 2008. ISIAS'08. Fourth International Conference on. Muniyandi, A. P., Rajeswari, R., & Rajaram, R. (2012). Network anomaly detection by cascading k-Means clustering and C4. 5 decision tree algorithm. Procedia Engineering, 30, 174-182. Murphy, K. P. (2006). Naive bayes classifiers. University of British Columbia. Osborne, J. W. (2010). Improving your data transformations: Applying the Box-Cox transformation. Practical Assessment, Research & Evaluation, 15(12), 1-9. Pawlak, Z. (1982). Rough sets. International Journal of Computer & Information Sciences, 11(5), 341-356. Pawlak, Z. (1998). Rough set theory and its applications to data analysis. Cybernetics & Systems, 29(7), 661-688. Pawlak, Z., & Slowinski, R. (1994). Decision analysis using rough sets. International Transactions in Operational Research, 1(1), 107-114. Pyle, D. (1999). Data preparation for data mining (Vol. 1): Morgan Kaufmann. Saad, S., Traore, I., Ghorbani, A., Sayed, B., Zhao, D., Lu, W., . . . Hakimian, P. (2011). Detecting P2P botnets through network behavior analysis and machine learning. Paper presented at the Privacy, Security and Trust (PST), 2011 Ninth Annual International Conference on. Schölkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond: MIT press. Soelistio, Y. E., & Surendra, M. R. S. (2015). Simple text mining for sentiment analysis of political figure using naïve bayes classifier method. arXiv preprint arXiv:1508.05163. Sood, A., Zeadally, S., & Enbody, R. (2014). An Empirical Study of HTTP-based Financial Botnets. Dependable and Secure Computing, IEEE Transactions on, PP(99), 1-1. doi:10.1109/TDSC.2014.2382590 Strayer, W. T., Lapsely, D., Walsh, R., & Livadas, C. (2008). Botnet detection based on network behavior Botnet Detection (pp. 1-24): Springer. Wurzinger, P., Bilge, L., Holz, T., Goebel, J., Kruegel, C., & Kirda, E. (2009). Automatically generating models for botnet detection Computer Security–ESORICS 2009 (pp. 232-249): Springer. Zang, X., Tangpong, A., Kesidis, G., & Miller, D. J. (2011). Botnet detection through fine flow classification. unpublished, Departments of CS&E and EE, The Pennsylvania State University, University Park, PA, Report No. CSE11-001. Zeidanloo, H. R., Manaf, A. B., Vahdani, P., Tabatabaei, F., & Zamani, M. (2010). Botnet detection based on traffic monitoring. Paper presented at the Networking and Information Technology (ICNIT), 2010 International Conference on. Zhao, D., Traore, I., Ghorbani, A., Sayed, B., Saad, S., & Lu, W. (2012). Peer to peer botnet detection based on flow intervals Information Security and Privacy Research (pp. 87-102): Springer. Zhao, D., Traore, I., Sayed, B., Lu, W., Saad, S., Ghorbani, A., & Garant, D. (2013). Botnet detection based on traffic behavior analysis and flow intervals. Computers & Security, 39, 2-16. 曾憲雄, 蔡秀滿, 蘇東興, 曾秋蓉, & 王慶堯. (2005). 資料探勘 (Data Mining). 台北: 旗標出版股份有限公司. 溫坤禮、永井正武、張廷政、溫惠筑. (2008). 粗糙集入門及應用: 五南圖書出版公司.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：永不公開 not available 校外 Off-campus：永不公開 not available 您的 IP(校外) 位址是 18.216.186.164 論文開放下載的時間是校外不公開 Your IP address is 18.216.186.164 This thesis will be available to you on Indicate off-campus access is not available.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS