國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,運用叢集式運算之巨量資料入侵偵測,Big Data Intrusion Detection Using Cluster Computing

論文名稱 Title	運用叢集式運算之巨量資料入侵偵測 Big Data Intrusion Detection Using Cluster Computing
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	107 學年度第 2 學期 The spring semester of Academic Year 107	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	61
研究生 Author	陳宣名 Hsuan-Ming Chen
指導教授 Advisor	陳嘉玫 Chia-Mei Chen
召集委員 Convenor	官大智 D. J. Guan
口試委員 Advisory Committee	鄭伯炤, 賴谷鑫, 范俊逸 Bo-Chao Cheng; Gu-Hsin Lai; Chun-I Fan
口試日期 Date of Exam	2017-07-25	繳交日期 Date of Submission	2019-05-27
關鍵字 Keywords	資訊安全、入侵偵測系統、分散式運算、虛擬化技術、巨量資料 Cyber Security, Cluster Computing, Intrusion Detection System, Virtualization, Big Data
統計 Statistics	本論文已被瀏覽 5858 次，被下載 0 次 The thesis/dissertation has been browsed 5858 times, has been downloaded 0 times.

中文摘要
隨著現今網際網路環境的發達，手持設備、IoT與個人電腦大幅增長，任何設備只要有網路連線，即可能遭受駭客的攻擊，企業為了面對許多已發現與尚未發現的攻擊，會透過佈署防火牆與入侵偵測系統等各式各樣設備來進行防範。面臨資料量的快速成長與迅速累積，企業佈署的資安系統產生不同來源不同格式的log資料，將挑戰傳統架構中的解決方法，因此如何處理這類異質性的資料格式、儲存巨量資料與進行攻擊事件的關聯分析，將會是企業所要面臨的一大威脅。在巨量資料解決方案如此眾多的時代，如何根據企業本身所需，找出合適的一套解決方案，是資訊安全人員所需要具備的一項能力。本研究根據目前趨勢，提出一套符合多數企業所需要的雲端虛擬化入侵偵測系統架構，具備歷史性與即時性資料的彙整、儲存與分析功能，並評估現有的分散式檔案系統與NoSQL資料庫，加強企業對巨量資料的處理能力。本研究所提出的雲端虛擬化架構，能夠在單一主機上佈署多個多節點的叢集系統，針對巨量資料進行分散式管理，利用Spark在記憶體中快速運算的優點，改善頻繁存取硬碟所產生的效能瓶頸，減少入侵偵測系統中處理巨量資料所需要的時間，達到即時分析的目的，並藉由收集真實企業中所提供的巨量資料，分別在傳統虛擬化環境與本研究環境中處理，證明在本研究所提出的雲端虛擬化叢集架構下，能夠有良好的效能表現與偵測率。
Abstract
As the progress of the Internet, handheld, IoT devices and personal computer become more and more popular. As long as the devices are connected to the network, which may suffer hacker attacks. In order to face the known and the unknown attacks, enterprises often deploy Firewall, Intrusion Detection System, and other equipment to carry out prevention. To face the rapid growth and accumulation of data, the deployments of the security system produce different sources and formats of logs, which will challenge the traditional architecture of the solution. Therefore, how to deal with different sources of data format, storage of big data and the correlation analysis of attack events will be a major threat to the enterprises. Nowadays, there are a lot of big data solutions, how to find a suitable solution for the enterprise based on itself requirements and it is also a capability that an information security personnel need. According to the current trends, this study presents a cloud based virtualization intrusion detection system that meet the needs of most organizations, including historical and streaming data's aggregation, storage and analysis functions. Moreover, evaluating existing distributed file systems with NoSQL database to strengthen the processing capacity of big data. The Cloud-based Virtualization Clustering Architecture proposed in this study is capable of deploying multiple cluster systems with multi-node on a single host. To achieve the purpose of real-time analysis, this study uses distributed system to manage big data and the use of Spark’s advantages to improve the performance and reduce the time required to process big data. By collecting the big data provided by real enterprise, they are handled in the traditional virtualization environment and this research's environment respectively. The experiments showed that there is a good performance and detection rate by the study.

目次 Table of Contents
目錄論文審定書 i 致謝 ii 摘要 iii Abstract iv 圖次 vii 表次 ix 第1章緒論 1 1.1 研究背景 1 1.2 研究動機 5 1.3 研究目的 7 第2章文獻探討 8 2.1 巨量資料 8 2.2 分散式運算 9 2.2.1 Hadoop 9 2.2.2 Spark 11 2.3 SMACK 14 2.3.1 Kafka 15 2.3.2 Cassandra 15 2.3.3 Mesos 16 2.3.4 Akka 17 2.4 NoSQL 18 2.4.1 MongoDB 20 2.5 EFK Stack 21 2.6 Virtualization 22 2.7 入侵偵測方法 24 第3章系統設計 26 3.1 系統架構 26 3.2 系統參數 28 3.3 系統元件 31 第4章系統評估 35 4.1 實驗一資料量與偵測率 35 4.2 實驗二傳統虛擬化環境與本研究雲端虛擬化環境比較 38 4.3 實驗三 HDFS vs. MongoDB 40 4.3.1 資料的載入與寫入 40 4.3.2 Spark關聯模組效能 42 4.4 實驗四與現有巨量資料分析平台之比較 44 第5章結論與展望未來 46 參考資料 47 圖次圖 1‑1每單位資料科學家所需要負擔的資料量[3] 2 圖 1‑2巨量資料解決方案概觀[4] 3 圖 1‑3自2008年以來目標式攻擊事件數量統計[7] 5 圖 1‑4趨勢科技各國遭受目標式攻擊統計[9] 6 圖 2‑1巨量資料5Vs[12] 8 圖 2‑2不同資料來源格式 9 圖 2‑3 Hadoop 2.0架構[14][15] 9 圖 2‑4 MapReduce計算架構圖[15] 10 圖 2‑5 Spark 架構[15][16] 11 圖 2‑6 Spark系統元件運作流程圖[17] 12 圖 2‑7 Spark計算架構圖[15] 13 圖 2‑8 Spark 處理即時資料的工作原理[19] 13 圖 2‑9 Hadoop與Spark每單位執行Iterations所需要的時間[21] 13 圖 2‑10 SMACK架構[22] 14 圖 2‑11 Kafka架構[22] 15 圖 2‑12 Cassandra環狀架構[23] 16 圖 2‑13 Cassandra資料儲存模型[22] 16 圖 2‑14 Mesos架構[25] 17 圖 2‑15 Akka Actor模型[26] 17 圖 2‑16 MongoDB資料切割儲存方法[30] 20 圖 2‑17 MongoDB Sharding Cluster運作機制[30] 21 圖 2‑18 Elasticsearch運作架構[32] 21 圖 2‑19 Container與VM架構[35] 23 圖 3‑1本研究Spark Standalone Cluster伺服器資訊 29 圖 3‑2本研究雲端Docker虛擬化環境入侵偵測系統架構 31 圖 3‑3 Log Collecting設定檔範例 32 圖 3‑4 Web Log正規表示法範例 32 圖 3‑5 System Log轉換後存在資料庫中Key-Value的格式 32 圖 3‑6關聯模組程式功能 33 圖 3‑7本研究Firewall Log 轉換後Spark DataFrame格式 34 圖 3‑8圖形化資料呈現示意圖 34 圖 4‑1本研究採用Random Forest Tree之不同時間區段偵測率 37 圖 4‑2不同環境下系統執行時間 39 圖 4‑3本研究資料自MongoDB載入Spark並寫回MongoDB中程式碼 40 圖 4‑4本研究資料自HDFS載入Spark並寫回HDFS中程式碼 40 圖 4‑5 HDFS vs. MongoDB載入Spark效能 41 圖 4‑6 HDFS vs. MongoDB寫入硬碟效能 41 圖 4‑7 HDFS vs. MongoDB所需硬碟空間 42 圖 4‑8本研究關聯模組效能 43 表次表 1‑1巨量資料中開源軟體模組數量 3 表 2‑1 RDD、DataFrames與Datasets比較[18] 12 表 2‑2 RDBMS與NoSQL間的差異[28] 19 表 2‑3 Fluentd與Logstash差異[33] 22 表 2‑4 Container與VM之間的差異[36] 23 表 2‑5不同機器學習演算法差異[37][38][39] 25 表 3‑1本研究雲端Docker虛擬化環境的入侵偵測架構 26 表 3‑2實體主機硬體資訊 28 表 3‑3表虛擬叢集主機系統資訊 28 表 3‑4本研究 Spark Worker配置資訊 29 表 3‑5本研究MongoDB 伺服器配置資訊 30 表 3‑6本研究Kafka伺服器配置資訊 30 表 3‑7部分目標式攻擊偵測階段分類結果 34 表 4‑1不同時間區間資料量 35 表 4‑2不同演算法參數設定 36 表 4‑3不同演算法的偵測結果 36 表 4‑4本研究平台與傳統商業化虛擬技術平台實驗設定配置資訊 38 表 4‑5傳統虛擬化環境與本研究效能差異 39 表 4‑6 HDFS vs. MongoDB系統評估 43 表 4‑7本研究與Splunk實驗規格配置 44 表 4‑8資料處理複雜度 44 表 4‑9本研究與Splunk比較之實驗結果 44

參考文獻 References
[1] CSC, “BIG DATA UNIVERSE BEGINNING TO EXPLODE,” CSC. [Online]. Available: http://www.csc.com/insights/flxwd/78931-big_data_universe_beginning_to_explode. [Accessed: 06-Nov-2016]. [2] iThome,“Google推出NoSQL雲端資料庫服務Bigtable,” iThome. [Online]. Available: http://www.ithome.com.tw/news/95709. [Accessed: 15-Jul-2017]. [3] EMC, “The Digital universe of opportunities,” EMC. [Online]. Available: https://taiwan.emc.com/collateral/analyst-reports/idc-digital-universe-2014.pdf. [Accessed: 15-Jul-2017]. [4] M. Turck, “Firing on All Cylinders: The 2017 Big Data Landscape,” Matt Turck, 24-May-2017. [Online]. Available: http://mattturck.com/bigdata2017/. [Accessed: 15-Jul-2017]. [5] Rose India, “The 5 Major Big Data Tools and Developing Trends in 2017,” The 5 Major Big Data Tools and Developing Trends in 2017. [Online]. Available: http://www.roseindia.net/bigdata/The-5-Major-Big-Data-Tools-and-Developing-Trends-in-2017.shtml. [Accessed: 15-Jul-2017]. [6] Docker, “eBay Simplifies Application Deployment,” Docker. [Online]. Available: https://www.docker.com/customers/ebay-simplifies-application-deployment. [Accessed: 15-Jul-2017]. [7] ThreatMiner.org, “ThreatMiner.org \| Data Mining for Threat Intelligence,” ThreatMiner.org. [Online]. Available: https://www.threatminer.org/. [Accessed: 15-Jul-2017]. [8] 趨勢科技, “趨勢科技CLOUDSEC 2016雲端企業資安高峰論壇登場,” 趨勢科技. [Online]. Available: http://www.trendmicro.tw/tw/about-us/newsroom/releases/articles/20160826034309.html. [Accessed: 15-Jul-2017]. [9] Trend Micro USA, “The Targeted Attack Trends in APAC in 1H 2014,” The Targeted Attack Trends in APAC in 1H 2014 - Security News - Trend Micro USA. [Online]. Available: http://www.trendmicro.com/vinfo/us/security/news/cyber-attacks/targeted-attack-trends-apac-1h-2014. [Accessed: 15-Jul-2017]. [10] 賴季苹，《以社交網路分析整合被式網路偵測網路入侵》，碩士論文，國立中山大學資訊管理研究所，2016。 [11] Acer, “主動式資安威脅管理解決方案 SAFE 3.0,” Acer. [Online]. Available: http://www.aceredc.com/edc/download/%E7%B6%B2%E8%B7%AF%E6%97%A5%E8%AA%8C%E4%BA%8B%E4%BB%B6%E7%9B%A3%E6%8E%A7%E4%B9%8B500%E5%8F%B0%E6%97%A5%E8%AA%8C%E7%AE%A1%E6%8E%A7(%E6%94%AF%E6%8F%B4IPv6).pdf. [Accessed: 06-Nov-2016]. [12] M. D. Assunção, R. N. Calheiros, S. Bianchi, M. A. Netto, and R. Buyya, “Big Data computing and clouds: Trends and future directions,” Journal of Parallel and Distributed Computing, vol. 79, pp. 3-15, May 2015. [13] K. S. Jeon, S. J. Park, S. H. Chun, and J. B. Kim, “A Study on the Big Data Log Analysis for Security,” International Journal of Security and Its Applications, vol. 10, no. 1, pp. 13–20, 2016. [14] Big Data Analytics News, “Hadoop 2.0 and YARN Architecture,” Big Data Analytics News, 30-Sep-2014. [Online]. Available: http://bigdataanalyticsnews.com/hadoop-2-0-yarn-architecture/. [Accessed: 15-Jul-2017]. [15] 林大貴，《Python+Spark 2.0+Hadoop機器學習與大數據分析實戰》，博碩文化股份有限公司，ISBN：9789864341535，2016。 [16] Apache Spark, “Documentation \| Apache Spark,” Documentation \| Apache Spark. [Online]. Available: https://spark.apache.org/documentation.html. [Accessed: 15-Jul-2017]. [17] Apache Spark, “Cluster Mode Overview,” Cluster Mode Overview - Spark 2.2.0 Documentation. [Online]. Available: http://spark.apache.org/docs/latest/cluster-overview.html. [Accessed: 15-Jul-2017]. [18] Databricks, “A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets,” Databricks, 04-Jan-2017. [Online]. Available: https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html. [Accessed: 15-Jul-2017]. [19] Taiwan Spark User Group, “Spark Streaming,” Spark 編程指南繁體中文版. [Online]. Available: https://taiwansparkusergroup.gitbooks.io/spark-programming-guide-zh-tw/content/spark-streaming/. [Accessed: 15-Jul-2017]. [20] C. Nyberg and M. Shah, “Sort Benchmark Home Page,” Sort Benchmark Home Page. [Online]. Available: http://sortbenchmark.org/. [Accessed: 15-Jul-2017]. [21] M. Zaharia, M. Chowdhury, T. Das., A. Dave, J. Ma, M. Mccauley, ... and I. Stoica, “Fast and interactive analytics over Hadoop data with Spark,” USENIX Login, vol. 37, no. 4, pp. 45-51, 2012. [22] R. Estrada and I. Ruiz, “Big Data, Big Solutions,” Big Data SMACK, Apress, pp. 9-16, 2016. [23] Apache Cassandra, “Manage massive amounts of data, fast, without losing sleep,” Apache Cassandra. [Online]. Available: http://cassandra.apache.org/. [Accessed: 15-Jul-2017]. [24] DB-Engine, “System Properties Comparison Cassandra vs. MongoDB,” Cassandra vs. MongoDB Comparison. [Online]. Available: http://db-engines.com/en/system/Cassandra%3BMongoDB. [Accessed: 15-Jul-2017]. [25] Apache Mesos, “Mesos Architecture,” Apache Mesos. [Online]. Available: http://mesos.apache.org/documentation/latest/architecture/. [Accessed: 15-Jul-2017]. [26] Apache Akka, “Akka Documentation,” Location Transparency • Akka Documentation. [Online]. Available: http://doc.akka.io/docs/akka/2.5.2/scala/general/remoting.html. [Accessed: 15-Jul-2017]. [27] A. Gautam, T., R. Dhingra, and P. Bedi, “Use of NoSQL Database for Handling Semi Structured Data: An Empirical Study of News RSS Feeds,” Emerging Research in Computing, Information, Communication and Applications, pp. 253–263, 2015. [28] Microsoft Azure, “Microsoft Azure NoSQL 與 SQL,” Microsoft Azure. [Online]. Available: https://azure.microsoft.com/zh-tw/documentation/articles/documentdb-nosql-vs-sql/. [Accessed: 12-Nov-2016]. [29] DigitalOcean, “Contents,” A Comparison Of NoSQL Database Management Systems And Models \| DigitalOcean. [Online]. Available: https://www.digitalocean.com/community/tutorials/a-comparison-of-nosql-database-management-systems-and-models. [Accessed: 15-Jul-2017]. [30] MongoDB, “Sharding,” Sharding — MongoDB Manual 3.4. [Online]. Available: https://docs.mongodb.com/manual/sharding/. [Accessed: 15-Jul-2017]. [31] Y. Li and S. Manoharan, “A performance comparison of SQL and NoSQL databases,” 2013 IEEE Pacific Rim Conference on Communications, Computers and Signal Processing (PACRIM), pp. 15–19, 2013. [32] A. Paro, Elasticsearch cookbook. Packt Publishing Limited, 2015. [33] Logz.io, “Fluentd vs. Logstash: A Comparison of Log Collectors,” Logz.io, 26-Oct-2016. [Online]. Available: https://logz.io/blog/fluentd-logstash/. [Accessed: 15-Jul-2017]. [34] P. C. Lubomski, A. Kalinowski, and H. Krawczyk, “Multi-level Virtualization and Its Impact on System Performance in Cloud Computing,” Computer Networks Communications in Computer and Information Science, pp. 247–259, 2016. [35] A. M. Joy, “Performance comparison between Linux containers and virtual machines,” 2015 International Conference on Advances in Computer Engineering and Applications, pp. 342–346, 2015. [36] K. Kumar and M. Kurhekar, “Economically Efficient Virtualization over Cloud Using Docker Containers,” 2016 IEEE International Conference on Cloud Computing in Emerging Markets (CCEM), pp. 95–100, 2016. [37] L. Wang, and R. Jones, “Big Data Analytics for Network Intrusion Detection: A Survey,” International Journal of Networks and Communications, vol. 7, no. 1, pp. 24-31, 2017. [38] A. L. Buczak, and E. Guven, “A survey of data mining and machine learning methods for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials, vol. 18, no. 2, pp. 1153-1176, 2016. [39] A. Sahasrabuddhe, S. Naikade, A. Ramaswamy, B. Sadliwala, and P. Futane, “Survey on Intrusion Detection System using Data Mining Techniques,” International Research Journal of Engineering and Technology (IRJET), vol. 4, no. 5, pp. 1780-1784, 2017. [40] Nikos Virvilis, C. I. S. A., Oscar Serrano, C. I. S. A., and C. CISM, “Big Data analytics for sophisticated attack detection,” Information Systems Audit and Control Association (ISACA), vol.3, 2014. [41] K. Asya, “Performance Testing MongoDB 3.0 Part 1: Throughput Improvements Measured with YCSB,” MongoDB. [Online]. Available: https://www.mongodb.com/blog/post/performance-testing-mongodb-30-part-1-throughput-improvements-measured-ycsb. [Accessed: 15-Jul-2017]. [42] Check Point Software, "Industry-Leading Cyber Security Solutions for Networks, Data Centers, Mobile Devices & Endpoints \| Check Point Software", Check Point Software. [Online]. Available: https://www.checkpoint.com/. [Accessed: 30- Jul- 2017].

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0427119-164526.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS