國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,從新聞分析發現資安趨勢之研究,A Study of Discovering Security Trends from News Analysis

論文名稱 Title	從新聞分析發現資安趨勢之研究 A Study of Discovering Security Trends from News Analysis
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	108 學年度第 2 學期 The spring semester of Academic Year 108	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	71
研究生 Author	方俊傑 Jun-Jie Fang
指導教授 Advisor	陳嘉玫 Chia-Mei Chen
召集委員 Convenor	李忠憲 Jung-Shian Li
口試委員 Advisory Committee	劉譯閎, 賴谷鑫, 郭文中 Yi-Hung Liu; Gu-Hsin Lai; Wen-Chung Kuo
口試日期 Date of Exam	2019-07-24	繳交日期 Date of Submission	2020-04-30
關鍵字 Keywords	事件偵測、威脅情資、文字探勘、主題模型、群集分析 event detection, topic model, cluster analysis, CTI, text mining
統計 Statistics	本論文已被瀏覽 5940 次，被下載 0 次 The thesis/dissertation has been browsed 5940 times, has been downloaded 0 times.

中文摘要
隨著網路與科技的成長，各種線上服務發展快速，同時也出現多樣化的資安威脅，以及不斷演進的資安趨勢。為了積極應變各種新興資安事件，許多企業與組織透過蒐集並分析多來源的威脅情資，以掌握完整的網路攻擊資訊，針對駭客使用的攻擊手法，建立相對應的資安防護措施，以預防或阻止相關惡意活動是必要的應對方式。威脅情資有多樣化的資料來源，例如新聞、社群媒體以及論壇，其中新聞會在資安事件發生後發布即時的事件報導，以新聞做為威脅情資來源能夠得到第一手的資安事件資訊，進而預防可能發生的攻擊行為。然而，新聞報導來源眾多，以人工的方式瀏覽、蒐集及分析不但耗時且需要大量資源，因此，以自動化系統進行威脅情資分析有其必要性。有鑑於此，本研究提出一套新興資安情資偵測系統（Emerging Security Event Detection，簡稱ESED），自動化蒐集資安新聞，擷取資安事件關鍵字，透過主題模型與分群演算法分析新聞內容，以二階段分群與相似度比對方式偵測新興資安事件。經實驗結果顯示，本研究所提出之自動化新興資安情資偵測系統（ESED）能發現各個資安類別的新興資安事件，並有91.09%的偵測精確率，驗證ESED確實能幫助資安人員快速以及有效的應用威脅情資。
Abstract
With the growth of the Internet and technology, several online services are developing rapidly, and many kinds of security threats and evolving trends are also emerging. In order to respond to various emerging security trends, many companies and organizations start to collect and analyze threat intelligence from multiple sources, in order to obtain complete information on cyber-attacks. According to the attack methods used by hackers, establish corresponding security protection measures to prevent related malicious activities is necessary. There are diverse sources of threat intelligence, such as news, social media, and forums, where the news will publish real-time event reports after the security incident happened, using news as a source of threat intelligence can get first-hand security information to prevent possible attacks. However, there are many sources of news reports, manually browsing, collecting, and analyzing are not only time-consuming but also require a lot of resources. Therefore, it is necessary to use automated systems to conduct threat intelligence analysis. In view of this, this paper proposes an Emerging Security Event Detection System (ESED), which automatically collects security news, retrieves security event keywords, and use topic models and clustering algorithm to analyze news and detect emerging security events by two-stage clustering and similarity comparison. The results of experiment prove that ESED can detect emerging security events in different security categories, with the detection precision rate of 91.09%, confirmed that ESED can truly help security personnel apply threat intelligence quickly and effectively.

目次 Table of Contents
目錄論文審定書 i 摘要 ii Abstract iii 目錄 iv 圖次 vi 表次 vii 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 2 第二章文獻探討 4 2.1 威脅情資 4 2.2 文字探勘 5 2.2.1 資料前處理 6 2.2.2 特徵擷取 7 2.3 主題模型 7 2.4 文件分群 10 2.4.1 K-means分群演算法 11 2.4.2 階層式分群演算法（Hierarchical clustering） 11 2.5 事件偵測 12 第三章研究方法 15 3.1 資料蒐集 19 3.2 文字前處理 19 3.3 特徵擷取 21 3.4 主題分群 22 3.5 事件分群 24 第四章系統實作與評估 25 4.1 資料來源蒐集 26 4.2 實驗1 第一階段主題分群模組參數選取 28 4.2.1 實驗1-1 歷史主題資料時間區段（H）設定 30 4.2.2 實驗1-2 近期主題資料時間區段（C）設定 30 4.2.3 實驗1-3 最佳H與C之參數組合設定 31 4.2.4 實驗1-4 主題數量NH、 NC設定 32 4.2.5 實驗1-5 參數α、β與疊代次數（D）設定 33 4.3 實驗2 第二階段事件分群模組參數選取 35 4.3.1 實驗2-1 訓練K-means++ 分群演算法 35 4.3.2 實驗2-2 訓練 Hierarchical Clustering分群演算法 36 4.4 實驗3 系統偵測成效 38 4.4.1 實驗3-1 新興事件偵測成效 39 4.4.2 實驗3-2 資安人員系統評估 44 4.5 實驗4 與現有威脅情資報告比較 47 第五章結論與未來研究 50 參考文獻 51 附錄一 2019-09-01 ~2019-09-04之C主題各事件群集偵測結果 55 附錄二 2018-03-01 ~2018-03-04之C主題各事件群集偵測結果 57 附錄三 2018-03-18 ~2018-08-21之C主題各事件群集偵測結果 59 附錄四 2018-05-21 ~2018-05-24之C主題各事件群集偵測結果 61 附錄五 2019年9月份電子報事件偵測 63 圖次圖2- 1、LDA 字詞-文件-主題關聯示意圖[18] 8 圖2- 2、LDA主題模型圖[17] 9 圖2- 3、階層式分群示意圖 12 圖3- 1、系統架構圖 17 圖3- 2、系統細部流程圖 18 圖3- 3、文件處理流程圖 20 圖4- 1、The Hacker News網站內容存於資料庫範例圖 28 圖4- 2、LDA主題模型參數H之Topic Coherence驗證 30 圖4- 3、LDA主題模型參數C之Topic Coherence驗證 31 圖4- 4、LDA主題模型H之主題數（NH）Coherence驗證 32 圖4- 5、LDA主題模型C之主題數（NC）Coherence驗證 33 圖4- 6、LDA主題模型H參數α和β之Coherence驗證 34 圖4- 7、LDA主題模型C參數α和β之Coherence驗證 34 圖4- 8、K-means++ 群集數量挑選 36 圖4- 9、Hierarchical Clustering分群結果 36 圖4- 10、系統模組應用方法流程圖 38 圖4- 11、Sliding Window示意圖 39 表次表3- 1、系統參數 15 表4- 1、實驗項目列表 25 表4- 2、新聞資料來源 27 表4- 3、LDA主題模型H與C主題之JSD值 32 表4- 4、主題分群模組參數設定 35 表4- 5、本研究系統參數設定 37 表4- 6、2019-09-01 ~ 2019-09-04新興事件偵測部分結果 38 表4- 7、標記資料內容 40 表4- 8、實驗3-1新興資安事件偵測精確率 40 表4- 9、2018-03-01 ~ 2018-03-04新興事件偵測部分結果 41 表4- 10、2018-03-18 ~ 2018-03-21新興事件偵測部分結果 42 表4- 11、2018-05-21 ~ 2018-05-24新興事件偵測部分結果 43 表4- 12、實驗3-2 資安人員系統評估結果 46 表4- 13、統整性資安事件報導 47 表4- 14、警告式的可能攻擊預告 47 表4- 15、攻擊預告與實際攻擊事件 47 表4- 16、資安相關的社群政策 47 表4- 17、系統漏洞更新 47 表4- 18、實驗4偵測結果 48 表4- 19、社群軟體郵件釣魚事件群集 49

參考文獻 References
參考文獻 [1] SANS. (2016). Threat Intelligence : What It Is, and How to Use It Effectively. Available: https://www.sans.org/reading-room/whitepapers/analyst/threat-intelligence-is-effectively-37282 [2] 李宗翰. (2016, June 20, 2018). 企業該如何掌握網路威脅情資，以有效阻擋惡意攻擊. Available: https://www.ithome.com.tw/tech/108544 [3] R. Brown and R. M. J. S. I. F. Lee, "The Evolution of Cyber Threat Intelligence (CTI): 2019 SANS CTI Survey," 2019. [4] N. Al Moubayed, D. Wall, and A. S. McGough, "Identifying Changes in the Cybersecurity Threat Landscape Using the LDA-Web Topic Modelling Data Search Engine," in International Conference on Human Aspects of Information Security, Privacy, and Trust, 2017, pp. 287-295: Springer. [5] I. Deliu, "Extracting Cyber Threat Intelligence From Hacker Forums," NTNU, 2017. [6] I. Deliu, C. Leichter, and K. Franke, "Extracting cyber threat intelligence from hacker forums: Support vector machines versus convolutional neural networks," in Big Data (Big Data), 2017 IEEE International Conference on, 2017, pp. 3648-3656: IEEE. [7] S.-Y. Huang and H. Chen, "Exploring the online underground marketplaces through topic-based social network and clustering," in Intelligence and Security Informatics (ISI), 2016 IEEE Conference on, 2016, pp. 145-150: IEEE. [8] Gephi. Available: https://gephi.org/ [9] S. Samtani, K. Chinn, C. Larson, and H. Chen, "AZSecure Hacker Assets Portal: Cyber threat intelligence and malware analysis," in 2016 IEEE Conference on Intelligence and Security Informatics (ISI), 2016, pp. 19-24: Ieee. [10] R. Feldman and I. Dagan, "Knowledge Discovery in Textual Databases (KDT)," in KDD, 1995, vol. 95, pp. 112-117. [11] R. Wirth and J. Hipp, "CRISP-DM: Towards a standard process model for data mining," in Proceedings of the 4th international conference on the practical applications of knowledge discovery and data mining, 2000, pp. 29-39: Citeseer. [12] M. Allahyari et al., "A brief survey of text mining: Classification, clustering and extraction techniques," 2017. [13] C. Silva and B. Ribeiro, "The importance of stop word removal on recall values in text categorization," in Proceedings of the International Joint Conference on Neural Networks, 2003., 2003, vol. 3, pp. 1661-1666: IEEE. [14] M. F. J. P. Porter, "An algorithm for suffix stripping," vol. 14, no. 3, pp. 130-137, 1980. [15] T. Liu, S. Liu, Z. Chen, and W.-Y. Ma, "An evaluation on feature selection for text clustering," in Proceedings of the 20th international conference on machine learning (ICML-03), 2003, pp. 488-495. [16] K. J. J. o. d. Sparck Jones, "A statistical interpretation of term specificity and its application in retrieval," vol. 28, no. 1, pp. 11-21, 1972. [17] D. M. Blei, A. Y. Ng, and M. I. Jordan, "Latent dirichlet allocation," Journal of machine Learning research, vol. 3, no. Jan, pp. 993-1022, 2003. [18] D. M. J. C. o. t. A. Blei, "Probabilistic topic models," vol. 55, no. 4, pp. 77-84, 2012. [19] T. Nagai et al., "Understanding Attack Trends from Security Blog Posts Using Guided-topic Model," vol. 27, pp. 802-809, 2019. [20] S. Samtani, R. Chinn, and H. Chen, "Exploring hacker assets in underground forums," in 2015 IEEE International Conference on Intelligence and Security Informatics (ISI), 2015, pp. 31-36: IEEE. [21] J. MacQueen, "Some methods for classification and analysis of multivariate observations," in Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, 1967, vol. 1, no. 14, pp. 281-297: Oakland, CA, USA. [22] E. Marin, A. Diab, and P. Shakarian, "Product offerings in malicious hacker markets," in 2016 IEEE conference on intelligence and security informatics (ISI), 2016, pp. 187-189: IEEE. [23] J. H. J. J. o. t. A. s. a. Ward Jr, "Hierarchical grouping to optimize an objective function," vol. 58, no. 301, pp. 236-244, 1963. [24] A. Rege et al., "Using a real-time cybersecurity exercise case study to understand temporal characteristics of cyberattacks," in International Conference on Social Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation, 2017, pp. 127-132: Springer. [25] J. Allan, R. Papka, and V. Lavrenko, "On-line new event detection and tracking," in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, 1998, pp. 37-45. [26] Y. Yang et al., "Learning approaches for detecting and tracking news events," vol. 14, no. 4, pp. 32-43, 1999. [27] L. Hu, B. Zhang, L. Hou, and J. J. K.-B. S. Li, "Adaptive online event detection in news streams," vol. 138, pp. 105-112, 2017. [28] W. Ai, K. Li, and K. J. A. S. C. Li, "An effective hot topic detection method for microblog on spark," vol. 70, pp. 1010-1023, 2018. [29] I. Mele and F. Crestani, "Event detection for heterogeneous news streams," in International Conference on Applications of Natural Language to Information Systems, 2017, pp. 110-123: Springer. [30] Google 10000 most common words. Available: https://github.com/first20hours/google-10000-english [31] I. Moutidis and H. T. Williams, "Utilizing Complex Networks for Event Detection in Heterogeneous High-Volume News Streams," in International Conference on Complex Networks and Their Applications, 2019, pp. 659-672: Springer. [32] Beautiful Soup. Available: https://www.crummy.com/software/BeautifulSoup/ [33] Z. S. J. W. Harris, "Distributional structure," vol. 10, no. 2-3, pp. 146-162, 1954. [34] M. Röder, A. Both, and A. Hinneburg, "Exploring the space of topic coherence measures," in Proceedings of the eighth ACM international conference on Web search and data mining, 2015, pp. 399-408. [35] J. J. I. T. o. I. t. Lin, "Divergence measures based on the Shannon entropy," vol. 37, no. 1, pp. 145-151, 1991. [36] G. Maskeri, S. Sarkar, and K. Heafield, "Mining business topics in source code using latent dirichlet allocation," in Proceedings of the 1st India software engineering conference, 2008, pp. 113-120: ACM. [37] T. L. Griffiths and M. Steyvers, "Finding scientific topics," Proceedings of the National academy of Sciences, vol. 101, no. suppl 1, pp. 5228-5235, 2004. [38] The Week in Ransomware - November 22nd 2019 - Leaky Files. Available: https://www.bleepingcomputer.com/news/security/the-week-in-ransomware-november-22nd-2019-leaky-files/ [39] Security Affairs newsletter Round 238. Available: https://securityaffairs.co/wordpress/93350/breaking-news/security-affairs-newsletter-round-238.html [40] FBI Warns of Cyber Attacks Targeting US Automotive Industry. Available: https://www.bleepingcomputer.com/news/security/fbi-warns-of-cyber-attacks-targeting-us-automotive-industry/ [41] U.S. Government Issues Warning About Possible Iranian Cyberattacks. Available: https://www.bleepingcomputer.com/news/security/us-government-issues-warning-about-possible-iranian-cyberattacks/ [42] Iranian hackers deface US government & African bank website. Available: https://www.hackread.com/iranian-hackers-deface-us-government-african-bank-website/ [43] YouTube to treat all kid-aimed videos like they’re COPPA-liable. Available: https://nakedsecurity.sophos.com/2020/01/08/youtube-to-treat-all-kid-aimed-videos-like-theyre-coppa-liable/ [44] Facebook bans deepfakes, but not cheapfakes or shallowfakes. Available: https://nakedsecurity.sophos.com/2020/01/08/facebook-bans-deepfakes-but-not-cheapfakes-or-shallowfakes/ [45] Microsoft Releases January 2020 Office Updates With Crash Fixes. Available: https://www.bleepingcomputer.com/news/security/microsoft-releases-january-2020-office-updates-with-crash-fixes/ [46] Tails 4.2 Fixes Numerous Security Flaws, Improves Direct Upgrades. Available: https://www.bleepingcomputer.com/news/linux/tails-42-fixes-numerous-security-flaws-improves-direct-upgrades/ [47] Adobe Releases First 2020 Patch Tuesday Software Updates. Available: https://thehackernews.com/2020/01/adobe-software-updates.html [48] TWCERT 電子報. Available: https://www.twcert.org.tw/tw/lp-106-1.html

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：開放下載的時間 available 2025-04-30 校外 Off-campus：開放下載的時間 available 2025-04-30 您的 IP(校外) 位址是 3.147.47.15 現在時間是 2024-07-27 論文校外開放下載的時間是 2025-04-30 Your IP address is 3.147.47.15 The current date is 2024-07-27 This thesis will be available to you on 2025-04-30.
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 2025-04-30

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS