國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,應用卷積神經網路於惡意程式偵測,Applying Convolutional Neural Network for Malware Detection

論文名稱 Title	應用卷積神經網路於惡意程式偵測 Applying Convolutional Neural Network for Malware Detection
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	中文 Chinese
學位類別 Degree	博士 Ph.D.	頁數 Number of pages	51
研究生 Author	王士豪 Shi-Hao Wang
指導教授 Advisor	陳嘉玫 Chen Chia-Mei
召集委員 Convenor	官大智 D. J. Guan
口試委員 Advisory Committee	林柏青, 胡育誠, 賴谷鑫, 劉譯閎, 林輝堂 Po-Ching Lin; Yu-Chen Hu; GuㄦHsin Lai; Liu Yihung; Hui-Tang Lin
口試日期 Date of Exam	2018-07-24	繳交日期 Date of Submission	2018-09-06
關鍵字 Keywords	惡意程式偵測、深度學習、卷積神經網路、原始碼分析、二進位檔案分析 deep learning, binary code analysis, Convolutional Neural Networks （CNN）, malware detection, source code analysis
統計 Statistics	本論文已被瀏覽 5943 次，被下載 1 次 The thesis/dissertation has been browsed 5943 times, has been downloaded 1 times.

中文摘要
惡意程式（malware）是對資訊使用的重大威脅，若未能在第一時間加以偵測往往引發重大資安事件，損害經濟財產，甚至危及個人與國家社會之安全。然而由於惡意程式大量且多樣化之特性，傳統上使用特徵植萃取再進行相似度比對的偵測作法，若無專業的知識經驗進行判斷與長時間的深入研究，非遭遇立即性資安威脅的一般企業或人員能夠使用。再者，因為所捕獲的惡意程式結構複雜，包含有原始程式檔、二進位檔案、shell script檔、Perl script檔、說明檔、設定檔等多種不同的檔案型態，更增加了偵測的困難，容易造成誤判。有鑑於此，本研究應用近年在影像辨識有十分優良表現的強大的深度學習（Deep Learning）方法－卷積神經網路（Convolutional Neural Networks, CNN）－於多型態惡意程式的偵測。經實驗評估，預測檔案為惡意程式或良性程式的準確率能達到九成以上，且實驗證明，使用深度學習的方式進行惡意程式之偵測，不僅對於複雜的原始碼檔案、二進位檔案有效，還能將變形與嵌入在良性檔案中的惡意程式均能偵測檢出。本研究所提出的方法有助於資訊人員在捕獲疑似惡意程式的第一時間進行快速篩檢，提供資訊人員依照檢出惡意程式之特性，快速採取保護措施，同時也為後續可能發生的網路攻擊進行預防與防禦之準備佈署。
Abstract
Failure to detect malware at its very inception leaves room for it to post significant threat and cost to cyber security for not only individuals, organizations but also the society and nation. However, the rapid growth in volume and diversity of malware renders conventional detection techniques that utilize feature extraction and comparison insufficient, making it very difficult for well-trained network administrators to identify malware, not to mention regular users of internet. Challenges in malware detection is exacerbated since complexity in the type and structure also increase dramatically in these years to include source code, binary file, shell script, Perl script, instructions, settings and others. Such increased complexity offers a premium on misjudgment. In order to increase malware detection efficiency and accuracy under large volume and multiple types of malware, this dissertation adopts Convolutional Neural Networks （CNN）, one of the most successful deep learning techniques. The experiment shows an accuracy rate of over 90% in identifying malicious and benign codes. The experiment also presents that CNN is effective with detecting source code and binary code, it can further identify malware that is embedded into benign code, leaving malware no place to hide. This dissertation proposes a feasible solution for network administrators to efficiently identify malware at the very inception in the severe network environment nowadays, so that information technology personnel can take protective actions in a timely manner and make preparations for potential follow-up cyber attacks.

目次 Table of Contents
論文審定書 i 誌謝 ii 中文摘要 iii Abstract iv 目錄 v 表次 vii 圖次 viii 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的 2 第二章文獻探討 4 2.1 惡意程式偵測 4 2.1.1 動態分析與靜態分析 4 2.1.2 機器學習與惡意程式偵測 5 2.2 深度學習之卷積神經網路 6 2.2.1 神經網路資訊傳導的基本概念 6 2.2.2 深度學習與反向傳播 7 2.2.3 卷積神經網路 8 2.3 深度學習與惡意程式偵測 10 第三章研究方法 12 3.1 系統架構與流程 12 3.2 檔案編碼轉為影像圖片型式 13 3.3 惡意程式深度學習偵測法 15 第四章實驗結果及效能評估 18 4.1 系統環境設定 19 4.2 樣本資料集設定 20 4.3 實驗1：原始碼型惡意程式偵測模型 22 4.3.1 子實驗1-1：誘捕系統惡意程式原始碼偵測模型 22 4.3.2 子實驗1-2：GitHub惡意程式原始碼偵測模型 25 4.3.3 子實驗1.3：結合誘捕系統與GitHub惡意程式原始碼檔案偵測 26 4.4 實驗2：二進位型惡意程式偵測模型 28 4.5 實驗3：混合原始碼型與二進位型惡意程式偵測模型 30 4.6 實驗4：良性原始碼程式嵌入惡意程式實驗 31 4.6.1 子實驗4.1：良性原始碼程式嵌入惡意程式原始碼 31 4.6.2 子實驗4.2：良性原始碼程式嵌入惡意程式二進位檔案 32 4.7 實驗5：原始碼檔案結構亂序實驗 33 4.8 實驗6：交叉驗證 35 第五章結論與未來展望 37 參考文獻 38 表次表 4-1實驗項目列表 19 表 4-2系統環境規格與軟體版本 20 表 4-3樣本資料集屬性與數量統計 20 表 4-4 訓練與測試樣本資料集取樣表 22 表 4-5誘捕系統捕獲惡意程式原始碼偵測結果範例 23 表 4-6GitHub網站惡意程式原始碼偵測結果範例 23 表 4-7 GNU良性程式原始碼偵測結果範例 24 表 4-8誘捕系統惡意原始偵測模型之準確率彙整表 25 表 4-9 GitHub原始碼型惡意程式之偵測準確率彙整表 26 表 4-10結合誘捕系統與GitHub惡意程式原始碼偵測模型之準確率彙整表 28 表 4-11二進位型惡意程式偵測模型之準確率彙整表 30 表 4-12混合原始碼與二進位檔之惡意程式偵測模型之準確率彙整表 31 表 4-13良性原始碼程式嵌入惡意程式二進位檔案之偵測結果表 33 表 4-14原始碼檔案結構亂序偵測結果表 34 表 4-15 10折交叉驗證實驗結果彙整表 36 圖次圖 2-1 人工神經元傳遞訊息示意圖 7 圖 2-2單一隱藏層人工神經網路 7 圖 3-1 惡意程式偵測架構與流程圖 13 圖 3-3 編碼轉換影像說明圖 15 圖 3-3 Inception V3 影像識別模組原理 16

參考文獻 References
參考文獻 [1] AV-TEST. Malware [Online]. Available: https://www.av-test.org/en/statistics/malware/. [Accessed: Jun. 30, 2018]. [2] 張庭瑜. 駭客攻擊一年損失達8100億元，微軟提點企業：釣魚信件別亂點 [Online]. Available: https://www.bnext.com.tw/article/49474/microsoft-cybersecurity-asia-report-2018. [Accessed: Jun. 30, 2018]. [3] T客邦. BSA最新調查：未經授權軟體內的惡意程式碼，每年造成企業近3,590億美元損失 [Online]. Available: https://www.techbang.com/posts/58858-bsa-latest-survey-malicious-code-within-unauthorized-software-causes-nearly-359-billion-us-dollars-in-global-corporate-losses-every-year. [Accessed: Sep. 6, 2018]. [4] D. Gitchell and N. Tran, "Sim: a utility for detecting similarity in computer programs," ACM SIGCSE Bulletin, vol. 31, no. 1, pp. 266-270, 1999 1999. [5] N. Idika and A. P. Mathur, "A survey of malware detection techniques," Purdue University, vol. 48, 2007. [6] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, "A survey on automated dynamic malware-analysis techniques and tools," ACM Comput. Surv., vol. 44, no. 2, pp. 1-42, 2008. [7] Wikipedia. Interactive Disassembler [Online]. Available: https://en.wikipedia.org/wiki/Interactive_Disassembler. [Accessed: Jun. 30, 2018]. [8] 林志鴻 and 楊中皇, "用於網路鑑識分析之惡意程式搜集系統設計與實作," in 全國資訊安全會議, 2011, pp. 191-198: 中華民國資訊安全學會. [9] G. Cosma and M. Joy, "An Approach to Source-Code Plagiarism Detection and Investigation Using Latent Semantic Analysis," IEEE Transactions on Computers, vol. 61, no. 3, pp. 379-394, 2012. [10] G. Tahan, L. Rokach, and Y. Shahar, "Mal-id: Automatic malware detection using common segment analysis and meta-features," Journal of Machine Learning Research, vol. 13, no. Apr, pp. 949-979, 2012. [11] J. Z. Kolter and M. A. Maloof, "Learning to detect and classify malicious executables in the wild," Journal of Machine Learning Research, vol. 7, no. Dec, pp. 2721-2744, 2006. [12] Y. Ye, D. Wang, T. Li, D. Ye, and Q. Jiang, "An intelligent PE-malware detection system based on association mining," Journal in computer virology, vol. 4, no. 4, pp. 323-334, 2008. [13] Y. Park, Q. Zhang, D. Reeves, and V. Mulukutla, "AntiBot: Clustering Common Semantic Patterns for Bot Detection," in 2010 IEEE 34th Annual Computer Software and Applications Conference, 2010, pp. 262-272. [14] M. N. A. Zabidi, M. A. Maarof, and A. Zainal, "Malware Analysis with Multiple Features," in 2012 UKSim 14th International Conference on Computer Modelling and Simulation, 2012, pp. 231-235. [15] Y. Elovici, A. Shabtai, R. Moskovitch, G. Tahan, and C. Glezer, "Applying machine learning techniques for detection of malicious code in network traffic," in Annual Conference on Artificial Intelligence, Berlin, Heidelberg, 2007, pp. 44-50: Springer. [16] Y. Ye, L. Chen, D. Wang, T. Li, Q. Jiang, and M. Zhao, "SBMDS: an interpretable string based malware detection system using SVM ensemble with bagging," Journal in computer virology, vol. 5, no. 4, pp. 283-293, 2009. [17] L. Prechelt, G. Malpohl, and M. Philippsen, "Finding plagiarisms among a set of programs with JPlag," Journal of Universal Computer Science, vol. 8, no. 11, pp. 1016-1038, 2002. [18] Wikipedia. Neuron [Online]. Available: https://en.wikipedia.org/wiki/Neuron. [Accessed: Jun. 30, 2018]. [19] N. Buduma and N. Locascio, Fundamentals of deep learning: Designing next-generation machine intelligence algorithms. " O'Reilly Media, Inc.", 2017. [20] F. López-Muñoz, J. Boya, and C. Alamo, "Neuron theory, the cornerstone of neuroscience, on the centenary of the Nobel Prize award to Santiago Ramón y Cajal," Brain Research Bulletin, vol. 70, no. 4, pp. 391-405, 2006/10/16/ 2006. [21] W. S. McCulloch and W. Pitts, "A logical calculus of the ideas immanent in nervous activity," The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115-133, 1943/12/01 1943. [22] J. Schmidhuber, "Deep learning in neural networks: An overview," Neural Networks, vol. 61, pp. 85-117, 2015/01/01/ 2015. [23] 賀德崇. Ch12模糊理論與類神經網路 [Online]. Available: http://si.secda.info/buss-math/index.php/2013-01-12-15-28-58/2012-09-23-07-08-48. [Accessed: Jun. 30, 2018]. [24] CS231n. Convolutional Neural Networks (CNNs / ConvNets) [Online]. Available: https://cs231n.github.io/convolutional-networks/. [Accessed: Jun. 15]. [25] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, "Learning representations by back-propagating errors," Nature, vol. 323, p. 533, 10/09/online 1986. [26] Y. LeCun, Y. Bengio, and G. Hinton, "Deep learning," Nature, vol. 521, p. 436, 05/27/online 2015. [27] Y. Li, Z. Hao, and H. Lei, "Survey of convolutional neural network," Journal of Computer Applications, vol. 36, no. 9, pp. 2508-2515, 2016. [28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks," in Proceedings of Advances in Neural Information Processing Systems, Cambridge, MA, 2012, pp. 1106-1114: MIT Press. [29] H.-W. Chang, "Forecasting Anomalous Behavior from HTTP Logs by Deep Learning," National Chung Cheng University, 2018. [30] J. Lin, "How do Convolutional Neural Networks work?," in Data Science and Robots Blog vol. 2018, ed, 2016. [31] L. Nataraj, S. Karthikeyan, G. Jacob, and B. S. Manjunath, "Malware images: visualization and automatic classification," presented at the Proceedings of the 8th International Symposium on Visualization for Cyber Security, Pittsburgh, Pennsylvania, USA, 2011. [32] A. Singh, "Malware Classification using Image Representation," Department of Computer Science and Engineering, INDIAN INSTITUTE OF TECHNOLOGY KANPUR, 2017. [33] M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, "Deep learning code fragments for code clone detection," presented at the Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering, Singapore, Singapore, 2016. [34] A. Al-Dujaili, A. Huang, E. Hemberg, and U. O’Reilly, "Adversarial Deep Learning for Robust Detection of Binary Encoded Malware," in 2018 IEEE Security and Privacy Workshops (SPW), 2018, pp. 76-82. [35] S.-P. Huang, "Forecasting Anomalous Behavior from Network Connection Logs by Deep Learning," master, National Chung Cheng University, 2018. [36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, "Rethinking the inception architecture for computer vision," pp. 2818-2826. [37] m0n0ph1. malware-1 [Online]. Available: https://github.com/m0n0ph1/malware-1. [Accessed: Jun. 20, 2018]. [38] GitHub. Malwares [Online]. Available: https://github.com/malwares. [Accessed: Jun. 20, 2018]. [39] NCHC. NCHC Malware Knowledge Base [Online]. Available: https://owl.nchc.org.tw/. [Accessed: Jun. 21, 2018].

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0806118-103519.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS