國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於結構相似度之原始碼分類研究 ,Code Classification Based on Structure Similarity

論文名稱 Title	基於結構相似度之原始碼分類研究 Code Classification Based on Structure Similarity
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	100 學年度第 2 學期 The spring semester of Academic Year 100	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	57
研究生 Author	楊佳蕙 Chia-hui Yang
指導教授 Advisor	陳嘉玫 Chia-mei Chen
召集委員 Convenor	官大智 none
口試委員 Advisory Committee	鄭憲宗 none
口試日期 Date of Exam	2012-07-26	繳交日期 Date of Submission	2012-09-14
關鍵字 Keywords	結構相似度、靜態分析、原始碼、惡意軟體分類 Malware Classification, Source Code, Static Analysis, Structure Similarity
統計 Statistics	本論文已被瀏覽 5828 次，被下載 382 次 The thesis/dissertation has been browsed 5828 times, has been downloaded 382 times.

中文摘要
面對日益複雜的惡意軟體與其變形，自動化惡意軟體分類為數位鑑識中最重要的一環。正確的惡意軟體分類可以得到惡意軟體最完整的系統行為，並且簡化鑑識之分析工作。傳統的惡意軟體分類著重於執行後之動態分析或者是以逆向工程結合靜態分析的方式，試圖取得惡意軟體的系統行為資訊，但惡意軟體會透過反虛擬機器監控和混淆技術來降低分類的正確率。隨著誘捕系統愈來愈健全，誘捕系統所蒐集到的惡意軟體原始碼也日漸增加，藉由分析惡意軟體的原始碼可以得到最正確的惡意軟體分類，因此本論文提出一個自動化惡意軟體分類機制。本研究藉由誘捕系統所擷取之惡意軟體原始碼，利用惡意軟體檔案結構相似度以及原始碼檔案相似度，透過階層式分群演算法(Hierarchical Clustering Algorithmn)之方法，不但可以正確的將新捕捉到的惡意軟體分類到正確的類別，也可以快速地找出新類型的惡意軟體。本論文提出的方式可以大幅度減少數位鑑識者針對同一類型的惡意軟體重複進行高成本的分析，亦可在最短時間內了解攻擊者行為以及意圖。本研究透過實驗證明，系統除了可以將惡意軟體原始碼做正確的分類外，亦可應用於其他有原始碼分類需求的領域。
Abstract
Automatically classifying malware variants source code is the most important research issue in the field of digital forensics. By means of malware classification, we can get complete behavior of malware which can simplify the forensics task. In previous researches, researchers use malware binary to perform dynamic analysis or static analysis after reverse engineering. In the other hand, malware developers even use anti-VM and obfuscation techniques try to cheating malware classifiers. With honeypots are increasingly used, researchers could get more and more malware source code. Analyzing these source codes could be the best way for malware classification. In this paper, a novel classification approach is proposed which based on logic and directory structure similarity of malwares. All collected source code will be classified correctly by hierarchical clustering algorithm. The proposed system not only helps us classify known malwares correctly but also find new type of malware. Furthermore, it avoids forensics staffs spending too much time to reanalyze known malware. And the system could also help realize attacker's behavior and purpose. The experimental results demonstrate the system can classify the malware correctly and be applied to other source code classification aspect.

目次 Table of Contents
誌謝 II 中文摘要 III Abstract IV 目錄 V 圖次 VII 表次 IX 第一章緒論 1 第一節研究背景 1 第二節研究動機 2 第三節研究目的 3 第二章相關文獻 4 第一節惡意軟體分類 4 第二節原始碼比對 7 第三節相似度計算 7 第三章問題定義與研究方法 11 第一節問題定義 11 第二節系統架構與流程 16 第三節相似度定義 18 第四章系統評估 24 第一節樣本蒐集 24 第二節實驗一：自行改寫之原始碼獨立檔案依變異階段順序輸入 25 第三節實驗二：自行改寫之原始碼獨立檔案隨機輸入 28 第四節實驗三：自行改寫之原始碼壓縮檔案隨機輸入 30 第五節實驗四：誘捕系統所蒐集可疑下載 34 第五章結論及未來展望 43 第六章相關文獻 44

參考文獻 References
[1] Sans, "Bots & botnet: An overview," http://www.sans.org/rr/whitepapers/malicious/1299.php, 2003. [2] COMPUTERWORLD, “Security firm warns of commercial, on-demand DDoS botnet,” http://www.computerworld.com/s/article/9185179/Security_firm_warns_of_commercial_on_demand_DDoS_botnet, 2010. [3] B. Stone-Gross, T. Holz, G. Stringhini, and G. Vigna, “The Underground Economy of Spam: a Botmaster’s Perspective of Coordinating Large-Scale Spam Campaigns,” In Proceedings of the 4th USENIX Workshop on Large-scale Exploits and Emergent Threats (LEET), Apr. 2011. [4] HELP NET SECURITY, “Microsoft cripples the Waledac botnet,” http://www.net-security.org/secworld.php?id=8926, 2010. [5] HELP NET SECURITY, “Rustock botnet downed by Microsoft,” http://www.net-security.org/secworld.php?id=10764, 2011. [6] HELP NET SECURITY, “Microsoft offers $250,000 reward for botnet information,” http://www.net-security.org/secworld.php?id=11299, 2011. [7] C. Willems, T. Holz, and F. Freiling, “Toward Automated Dynamic Malware Analysis Using CWSandbox,” IEEE Security and Privacy, no. 2, vol. 5, Mar./Apr. 2007, pp. 32-39. [8] M. Harman, “Why Source Code Analysis and Manipulation Will Always Be Important,” in Proceedings of the 10th IEEE International Working Conference on Source Code Analysis and Manipulation (SCAM 2010), Timişoara, Romania, Sep. 12-13, 2010. [9] J. Z. Kolter, and M. A. Maloof, “Learning to Detect and Classify Malicious Executables in the Wild,” The Journal of Machine Learning Research, vol. 7, 2006, pp. 2721-2744. [10] G. Tahan, L. Rokach, and Y. Shahar, “Mal-ID:Automatic Malware Detection Using Common Segment Analysis and Meta-Features,” Journal of Machine Learning Research, vol. 13, 2012, pp. 949-979. [11] M.G. Schultz, E. Eskin, F. Zadok, and S.J. Stolfo, “Data mining methods for detection of new malicious executables,” The 2001 IEEE Symposium on Security and Privacy, Oakland, CA, May 2001. [12] T. Abou-Assaleh, N. Cercone, V. Keselj, and R. Sweidan, “N-gram-based detection of new malicious code,” in Proceedings of the 28th Annual International Computer Software and Applications Conference, IEEE CSP, 2004. [13] J.Z. Kolter and M.A. Maloof, “Learning to detect and classify malicious executables in the wild,” The Journal of Machine Learning Research, vol. 7, Dec 2006, pp. 2721-2744.O. Henchiri and N. Japkowicz, “A feature selection and evaluation scheme for computer virus detection,” in Proceedings of the Sixth International Conference on Data Mining, Hong Kong, 2006, pp. 891-895. [14] O. Henchiri and N. Japkowicz, “A feature selection and evaluation scheme for computer virus detection,” in Proceedings of ICDM-2006, Hong Kong, 2006, pp. 891–895. [15] B. Zhang, J. Yin, J. Hao, D. Zhang, and S. Wang, “Malicious codes detection based on ensemble learning,” in Proceedings of The 4th International Conference on Autonomic and Trusted Computing, vol. 4610, 2007, pp. 468-477. [16] Y. Elovici, A. Shabtai, R. Moskovitch, G. Tahan, and C. Glezer, “Applying machine learning techniques for detection of malicious code in network traffic,” in Proceedings of the 30th annual German conference on Advances in Artificial Intelligence, Berlin, Germany, Sep. 10-13, 2007, pp. 44-50. [17] J. Jang, D. Brumley, and S. Venkataraman, “BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis,” in Proceedings of the 18th ACM conference on Computer and Communications Security, Chicago, Illinois, Oct. 17-21, 2011, pp. 309–320. [18] Y. Ye, D. Wang, T. Li, D. Ye, and Q. Jiang, “An intelligent pe-malware detection system based on association mining,” Journal in Computer Virology, vol. 4, no. 4, 2008, pp.323–334. [19] Y. Ye, L. Chen, D. Wang, T. Li, Q. Jiang, and M. Zhao, “Sbmds: an interpretable string based malware detection system using svm ensemble with bagging,” Journal in Computer Virology, vol. 5, no. 4, 2009, pp. 283–293. [20] Y. Ye, T. Li, K. Huang, Q. Jiang, and Y. Chen, “Hierarchical associative classifier (hac) for malware detection from the large and imbalanced gray list,” Journal of Intelligent Information Systems, vol. 35, no. 1, 2010, pp. 1–20. [21] A. Altaher, Supriyanto, A. ALmomani, M. Anbar, and S. Ramadass, “Malware detection based on evolving clustering method for classification,” Scientific Research and Essays, vol. 7, no. 22, Jun 14, 2012, pp.2031-2036. [22] M. Gheorghescu, "An automated virus classification system," in Virus Bulletin Conference, 2005, pp. 294-300. [23] M. Christodorescu, and S. Jha, “Static Analysis of Executables to Detect Malicious Patterns,” in Proceedings of the 12th USENIX Security Symposium, 2003. [24] S. Cesare, and Y. Xiang, “Classification of Malware Using Structured Control Flow,” in Proceedings of the 8th Australasian Symposium on Parallel and Distributed Computing (AusPDC 2010), 2010. [25] K. Zen, D.N.F.A. Iskandar, and O. Linang, “Using Latent Semantic Analysis for Automated Grading Programming Assignments,” in Proceedings of Semantic Technology and Information Retrieval (STAIR), Putrajaya, Malaysia, Jun 28-29, 2011, pp. 82-88. [26] J.I. Maletic, and N. Valluri, “Automatic software clustering via Latent Semantic Analysis,” in Proceedings of 14th IEEE International Conference on Automated Software Engineering (ASE’99), Cocoa Beach Florida, Oct 1999, pp. 251-254. [27] D. Zhang, J. Wang, D. Cai, and J. Lu, “Self-taught hashing for fast similarity search,” in Proceedings of Proceedings of the Annual International ACM SIGIR Conference on Research and Development on Information Retrieval (SIGIR), 2010. [28] Edit distance - Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Edit_distance [29] Graphviz - Graph Visualization Software, http://www.graphviz.org/. [30] Meld Diff Viewer – Compare and Merge files/directories in Ubuntu, http://ubuntuguide.net/meld-diff-viewer-compare-and-merge-filesdirectories-in-ubuntu [31] virustotal - Free Online Virus, Malware and URL Scanner, https://www.virustotal.com/

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0914112-155523.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS