國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,從駭客論壇發掘網路威脅情報,Discovering Cyber Threat Intelligence from Hacker Forums

論文名稱 Title	從駭客論壇發掘網路威脅情報 Discovering Cyber Threat Intelligence from Hacker Forums
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	108 學年度第 1 學期 The fall semester of Academic Year 108	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	88
研究生 Author	趙偉志 Wei-Chih Chao
指導教授 Advisor	陳嘉玫 Chia-Mei Chen
召集委員 Convenor	鄭伯炤 Bo-Chao Cheng
口試委員 Advisory Committee	江明朝, 賴谷鑫, 康藝晃 Ming-Chao Chiang; Gu-Hsin Lai; Yihuang Kang
口試日期 Date of Exam	2020-01-13	繳交日期 Date of Submission	2020-02-17
關鍵字 Keywords	Word2vec、分群分析、自然語言處理、駭客論壇、網路威脅情報 Word2vec, Cluster Analysis, NLP, Hacker Forum, CTI
統計 Statistics	本論文已被瀏覽 6169 次，被下載 4 次 The thesis/dissertation has been browsed 6169 times, has been downloaded 4 times.

中文摘要
網路通訊技術進步為企業以及顧客提供更好的服務，但同樣的也會帶來全新的威脅。為了妥善處理這些不斷發展的網路威脅，從被動轉到主動的預防措施變的非常重要。有鑑於此，網路威脅情報（Cyber Threat Intelligence ，CTI）的技術成為近期資訊安全領域關注的重點，藉由蒐集網路威脅情報了解不同攻擊者用來發起活動的方法，並主動調整安全措施以檢測和阻止相關惡意活動。由於網路威脅情報必須來自於多樣的資料來源，如新聞、社群網路平台以及論壇。在這麼多資料來源中若能擷取第一手的網路威脅情報，則可以盡快的預防可能發生的攻擊行為，因此對駭客之間常用於資訊交流的駭客論壇進行網路威脅情報擷取，則可取得第一手可能包含訊安全相關的重要訊息。但論壇類型的資料來源複雜且龐大，人工的方式執行分析將會常耗時且需要大量資源，因此一套有效的(半)自動化威脅情報擷取系統是有必要的。本研究統整先前的研究方法，發覺機器學習方法對於從駭客論壇中找到相關威脅情報有一定的效能，為此開發半自動威脅情報擷取系統。該系統採用替關鍵字標籤、自然語言處理（Natural Language Processing，NLP）並且配合分群演算法（Clustering Analysis），在未給定明確標的的資料集中，自動替資料類型進行分群。本研究發現傳統分群演算法配合適當的字詞嵌入方法(Word Embedding)，可以從駭客論壇中精簡出威脅情報。
Abstract
With the advances in network communication technology, enormous cyber threats are emerging. To handle these evolving cyber threats properly, changing the precautionary measures from reactive to proactive is important. Cyber Threat Intelligence (CTI) technology has become the focus of attention in the field of information security in recent years. To detect and block malicious activity, collecting CTI can track the attack strategies used by different attackers and proactively adjust security measures. CTI must come from a variety of sources, such as news and forums, to improve prevention. If one can capture first-hand CTI from huge data sources, attacks can be thwarted sooner. Therefore, retrieving cyber threat intelligence from hacker forums, where information is often exchanged between hackers, can immediately provide important information that may include hidden security-related information. Forum-type data sources are content-complex and contain enormous amounts of data, so manual analysis is often time-consuming and requires a lot of resources. Therefore, an effective (semi-)automatic threat intelligence retrieval system is necessary. This research unifies previous research methods and finds that machine learning methods have certain effectiveness in finding relevant threat intelligence from hackers' forums. To this end, a semi-automatic threat intelligence retrieval system was developed. The system uses keyword tagging, Natural Language Processing (NLP) and clustering analysis to automatically group data types for unspecified data sets. We found that traditional clustering algorithms can also extract appropriate threat intelligence from articles in the hacker forum, as long as they are compatible with the appropriate embedding method.

目次 Table of Contents
論文審定書 i 摘要 ii Abstract iii 目錄 iv 圖次 vi 表次 vii 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 2 第二章文獻探討 5 2.1 網路威脅情報 5 2.2 駭客論壇相關應用 6 2.3 自然語言處理 8 2.3.1 TFIDF 8 2.3.2 主題模型 9 2.3.3 詞向量 12 2.4 文字分群技術 16 2.4.1 階層式分群演算法 18 2.4.2 K-Means 19 第三章研究方法 20 3.1 資料蒐集 22 3.2 關鍵字標記和清理模組 22 3.2.1 關鍵字標記子模組 23 3.2.2 資料清理子模組 27 3.3 字詞矩陣轉換模組 28 3.4 分析模組 29 3.4.1 主題擷取模組 30 3.4.2 事件擷取模組 31 第四章系統評估 33 4.1 實驗一：資料清理以及關鍵字標記 37  資料清理 37  關鍵字標記 37 4.2 實驗二：分群演算法之參數評估及分群結果 38 實驗2-1、TFIDF + K-means 39 實驗2-2、Word2Vec（CBOW）+ K-means 41 實驗2-3、Word2Vec（Skip-Gram）+ K-means 44 實驗2-4、Doc2vec（PV-DM）+ K-means 47 實驗2-5、Doc2vec（DBOW）+ K-means 49 小結、各詞嵌入配合K-means之比較 50 實驗2-6、TFIDF + Hierarchical Cluster 52 實驗2-7、Word2Vec（CBOW）+ Hierarchical Cluster 54 實驗2-8、Word2Vec（Skip-Gram）+ Hierarchical Cluster 56 實驗2-9、Doc2vec（PV-DM）+ Hierarchical Cluster 58 實驗2-10、Doc2vec（DBOW）+ Hierarchical Cluster 60 小結、各詞嵌入配合Hierarchical Cluster之比較 61 實驗2-11、LDA 62 結論、各詞嵌入以及分群之比較 64 4.3 實驗三：事件擷取模組之分群參數調整以及群集篩選 65 第五章研究限制與未來展望 68 參考文獻 70 附錄一 74

參考文獻 References
[1] "1000 MOST COMMON WORDS IN ENGLISH." https://www.ef.com/ca/english-resources/english-vocabulary/top-1000-words/ (accessed JAN. 17, 2020). [2] "Gephi." https://gephi.org/ (accessed JAN. 17, 2020). [3] "Information Sharing and Analysis Center." https://en.wikipedia.org/wiki/Information_Sharing_and_Analysis_Center (accessed JAN. 17, 2020). [4] "Information Sharing and Analysis Organizations (ISAOs)." https://www.dhs.gov/cisa/information-sharing-and-analysis-organizations-isaos (accessed JAN. 17, 2020). [5] "Internet slang." Wikipedia, the free encyclopedia. https://en.wikipedia.org/wiki/Internet_slang (accessed JAN. 17, 2020). [6] "Natural language processing." Wikipedia. https://en.wikipedia.org/wiki/Natural_language_processing (accessed JAN. 17, 2020). [7] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut, "A brief survey of text mining: Classification, clustering and extraction techniques," arXiv preprint arXiv:1707.02919, 2017. [8] G. Aman and A. Abhineet, "Ethical Hacking and Hacking Attacks," International Journal Of Engineering And Computer Science, vol. 6, no. 4, p. 8, 2015 2017. [9] V. Benjamin and H. Chen, "Developing understanding of hacker language through the use of lexical semantics," in 2015 IEEE International Conference on Intelligence and Security Informatics (ISI), 2015: IEEE, pp. 79-84. [10] D. M. Blei. (2012) Probabilistic topic models. Communications of the ACM. 7. [11] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, vol. 3, no. Jan, pp. 993-1022, 2003. [12] T. Caliński and J. Harabasz, "A dendrite method for cluster analysis," Communications in Statistics-theory and Methods, vol. 3, no. 1, pp. 1-27, 1974. [13] S. M. H. Dadgar, M. S. Araghi, and M. M. Farahani, "A novel text mining approach based on TF-IDF and Support Vector Machine for news classification," in 2016 IEEE International Conference on Engineering and Technology (ICETECH), 2016: IEEE, pp. 112-116. [14] D. L. Davies and D. W. Bouldin, "A cluster separation measure," IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224-227, 1979. [15] I. Deliu, "Extracting Cyber Threat Intelligence From Hacker Forums," NTNU, 2017. [16] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, "Bert: Pre-training of deep bidirectional transformers for language understanding," arXiv preprint arXiv:1810.04805, 2018. [17] C. M. Dubey. "How Artificial Intelligence is Changing Cyber Security." https://medium.com/swlh/how-artificial-intelligence-is-changing-cyber-security-a243294ccdfe (accessed JAN. 17, 2020). [18] Z. Fang, X. Zhao, Q. Wei, G. Chen, Y. Zhang, C. Xing, W. Li, and H. Chen, "Exploring key hackers and cybersecurity threats in chinese hacker communities," in Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, 2016: IEEE, pp. 13-18. [19] A. Graham. "The 5 most common cyber attacks in 2019." https://www.itgovernance.co.uk/blog/different-types-of-cyber-attacks (accessed JAN. 17, 2020). [20] T. J. Holt, D. Strumsky, O. Smirnova, and M. Kilger, "Examining the Social Networks of Malware Writers and Hackers," International Journal of Cyber Criminology, vol. 6, no. 1, 2012. [21] S.-Y. Huang and H. Chen, "Exploring the online underground marketplaces through topic-based social network and clustering," in Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, 2016: IEEE, pp. 145-150. [22] T. Jo, "Text Mining: Concepts, Implementation, and Big Data Challenge Chapter 3.4.1 Huge Dimensionality," vol. 45, ed: Springer, 2018. [23] M. KADOGUCHI, S. HAYASHI, M. HASHIMOTO, and A. OTSUKA, "Exploring the Dark Web for Cyber Threat Intelligence using Machine Leaning," in 2019 IEEE International Conference on Intelligence and Security Informatics (ISI), 2019: IEEE, pp. 200-202. [24] S. Kaur and E. M. Rashid, "Web news mining using Back Propagation Neural Network and clustering using K-Means algorithm in big data," Indian Journal of Science and Technology, vol. 9, no. 41, pp. 1-8, 2016. [25] Y. Kim, "Convolutional neural networks for sentence classification," arXiv preprint arXiv:1408.5882, 2014. [26] Y. Kino, H. Kuroki, T. Machida, N. Furuya, and K. Takano, "Text Analysis for Job Matching Quality Improvement," Procedia computer science, vol. 112, pp. 1523-1530, 2017. [27] S. Kumar and D. Agarwal, "Hacking Attacks, Methods, Techniques and Their Protection Measures," International Journal of Advance Research in Computer Science and Management, 4 (4), 2018. [28] Y. Kurogome, Y. Otsuki, Y. Kawakoya, M. Iwamura, S. Hayashi, T. Mori, and K. Sen, "EIGER: automated IOC generation for accurate and interpretable endpoint malware detection," in Proceedings of the 35th Annual Computer Security Applications Conference, 2019: ACM, pp. 687-701. [29] Q. Le and T. Mikolov, "Distributed representations of sentences and documents," in International conference on machine learning, 2014, pp. 1188-1196. [30] W. Li, H. Chen, and J. F. Nunamaker Jr, "Identifying and profiling key sellers in cyber carding community: AZSecure text mining system," Journal of Management Information Systems, vol. 33, no. 4, pp. 1059-1086, 2016. [31] C.-C. Liao, F. Xiao, J.-M. Wong, I.-J. Chiang, Y.-H. Tsai, C. C.-H. Liu, and K.-C. Huang, "Text Mining and Information Analysis; Retrieving and Clustering Keywords in Neurosurgery Operation Reports Using Text Mining Techniques," in Proceedings of the 2nd International Conference on Medical and Health Informatics, 2018: ACM, pp. 88-100. [32] X. Liao, K. Yuan, X. Wang, Z. Li, L. Xing, and R. Beyah, "Acing the ioc game: Toward automatic discovery and analysis of open-source cyber threat intelligence," in Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, 2016: ACM, pp. 755-766. [33] E. Marin, A. Diab, and P. Shakarian, "Product offerings in malicious hacker markets," in 2016 IEEE conference on intelligence and security informatics (ISI), 2016: IEEE, pp. 187-189. [34] R. McMillan. "Deﬁnition: Threat intelligence (online)." https://www.gartner.com/doc/2487216/definition-threat-intelligence (accessed JAN. 17, 2020). [35] J. Melnick. "Top 10 Most Common Types of Cyber Attacks." https://blog.netwrix.com/2018/05/15/top-10-most-common-types-of-cyber-attacks/ (accessed Dec. 15, 2019). [36] O. Mendsaikhan, H. Hasegawa, Y. Yamaguchi, and H. Shimada, "Identification of Cybersecurity Specific Content Using the Doc2Vec Language Model," in 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), 2019, vol. 1: IEEE, pp. 396-401. [37] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word representations in vector space," arXiv preprint arXiv:1301.3781, 2013. [38] T. Ojeda, R. Bilbro, and B. Bengfort, "Applied Text Analysis with Python. Chapter 3. Corpus Preprocess and Wrangling," ed: O'Reilly Media, Inc, 2018. [39] P. Oza and M. Premkumar, "Security Problem Detection of Hidden Data in Unstructured Log Messages With a Novel Text Mining Technique," 2018. [40] N. Papanikolaou, G. A. Pavlopoulos, T. Theodosiou, I. S. Vizirianakis, and I. Iliopoulos, "DrugQuest-a text mining workflow for drug association discovery," BMC bioinformatics, vol. 17, no. 5, p. 182, 2016. [41] J. Petrik and D. Chuda, "Twitter Feeds Profiling With TF-IDF," 2019. [42] P. Ponde, S. Shirwaikar, and S. Gore, "Hierarchical Cluster Analysis On Security Design Patterns," in Proceedings of the International Conference on Advances in Information Communication Technology & Computing, 2016: ACM, p. 92. [43] M. Röder, A. Both, and A. Hinneburg, "Exploring the space of topic coherence measures," in Proceedings of the eighth ACM international conference on Web search and data mining, 2015: ACM, pp. 399-408. [44] G. RajeshKumar, N. Mangathayaru, and G. Narsimha, "Intrusion detection a text mining based approach," arXiv preprint arXiv:1603.03837, 2016. [45] P. Ranade, S. Mittal, A. Joshi, and K. Joshi, "Using Deep Neural Networks to Translate Multi-lingual Threat Intelligence," in 2018 IEEE International Conference on Intelligence and Security Informatics (ISI), 2018: IEEE, pp. 238-243. [46] N. Z. Robert Schweitzer, Mohammadreza Ebrahimi. Hacker Web Forum Collection: CrackingArena Forum Dataset. [Online]. Available: http://www.azsecure-data.org/ [47] X. Rong, "word2vec parameter learning explained," arXiv preprint arXiv:1411.2738, 2014. [48] P. J. Rousseeuw, "Silhouettes: a graphical aid to the interpretation and validation of cluster analysis," Journal of computational and applied mathematics, vol. 20, pp. 53-65, 1987. [49] S. Samtani, K. Chinn, C. Larson, and H. Chen, "AZSecure Hacker Assets Portal: Cyber threat intelligence and malware analysis," in Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, 2016: Ieee, pp. 19-24. [50] S. Samtani, R. Chinn, and H. Chen, "Exploring hacker assets in underground forums," in Intelligence and Security Informatics (ISI), 2015 IEEE International Conference on, 2015: IEEE, pp. 31-36. [51] D. Shackleford, "CTI in Security Operations:SANS 2018 Cyber Threat Intelligence Survey," 2018. [Online]. Available: https://www.sans.org/reading-room/whitepapers/analyst/cti-security-operations-2018-cyber-threat-intelligence-survey-38285 [52] K. Sparck Jones, "A statistical interpretation of term specificity and its application in retrieval," Journal of documentation, vol. 28, no. 1, pp. 11-21, 1972. [53] X. Zhuo, J. Zhang, and S. W. Son, "Network intrusion detection using word embeddings," in 2017 IEEE International Conference on Big Data (Big Data), 2017: IEEE, pp. 4686-4695. [54] 李宗翰. "企業該如何掌握網路威脅情資，以有效阻擋惡意攻擊." https://www.ithome.com.tw/tech/108544 (accessed JAN. 17, 2020).

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0117120-121017.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS