國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以機器學習技術提升雲端儲存之可用性,Improving the Availability of Cloud Storage using Machine Learning

論文名稱 Title	以機器學習技術提升雲端儲存之可用性 Improving the Availability of Cloud Storage using Machine Learning
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	107 學年度第 1 學期 The fall semester of Academic Year 107	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	81
研究生 Author	侯建安 Chien-An Hou
指導教授 Advisor	陳嘉玫 Chia-Mei Chen
召集委員 Convenor	郭文中 Wen-Chung Kuo
口試委員 Advisory Committee	林耕霈, 賴谷鑫 Keng-Pei Lin; Gu-Shin Lai
口試日期 Date of Exam	2019-01-07	繳交日期 Date of Submission	2019-01-26
關鍵字 Keywords	可用性、機器學習、硬碟故障、雲端儲存、決策樹、深度學習、隨機森林、XGBoost Cloud Storage, Availability, Disk Failure, Machine Learning, XGBoost, Random Forest, Decision Tree, Deep Learning
統計 Statistics	本論文已被瀏覽 6007 次，被下載 1 次 The thesis/dissertation has been browsed 6007 times, has been downloaded 1 times.

中文摘要
在現今資料爆炸的數位時代，雲端儲存（Cloud Storage）成為近年來雲端服務中發展看好的趨勢之一，雲端儲存使用資料中心存放資料，以及擬定SLA服務水準來衡量可用性的高低，但面臨高可用性帶來的高成本問題，總有取捨問題，高儲存低成本可能面臨硬碟機故障帶來的影響，雖有HDFS等技術避免問題，但服務卻可能因此受到影響，進而降低可用性。本研究提出使用機器學習技術，建立預測硬碟故障的模型，在硬碟故障前發現進行更換，避免硬碟機故障導致服務中斷或重跑，提升其服務的可用性。本研究使用大數據分析工具Splunk以及機器學習工具RapidMiner，幫助企業分析整理資料以及降低技術門檻。本研究使用Splunk進行資料前處理，匯入硬碟的S.M.A.R.T.資料，產生統計圖與分析報表，進行初步地資料分析與過濾，並進行數據採樣，產生機器學習用的資料集，接著使用RapidMiner機器學習工具進行特徵選取與預測模型建立，使用不同機器學習演算法建立預測模型，從研究結果來看，XGBoost提供較好的預測能力，研究結果顯示XGBoost模型可以提供有用的硬碟故障預測，可以事先偵測到硬碟即將故障的作法，進而採取應變措施，進而提升雲端儲存的可用性。
Abstract
In the digital era where information explodes, Cloud Storage has become one of the most promising trends in cloud services in recent years. Cloud Storage stores data into data center and sketches out the service level in SLA to define its availability. However, high availability means high cost and there is a trade off involved. Therefore, in order to achieve high storage and low cost, it is necessary to overcome the problem of the failure in hard drives. Even though the techniques like HDFS are able to secure the data, service might get affected and then lower its availability. This study proposes to use machine learning to establish a model for predicting hard drive failures, which can be replaced before the failure. It is able to avoid interrupting service or re-starting due to the failure and hence improve the service availability. This study uses big data analytics tool called Splunk and machine learning tool, RapidMiner, to help companies analyze data and lower technical thresholds. This study uses Splunk for data preprocessing, importing SMART data from hard disk, generating statistical graphs and analysis reports to perform and filter preliminary data analysis and sampling data to generate a dataset for machine learning, then uses RapidMiner to build a feature selection and predict model for machine learning tools. It finds out important fields of predicting hard disk failures from the data, and analyze by using different machine learning techniques. From the research results, XGBoost provides the best prediction, the research results show that XGBoost model can provide useful predictions. It is possible to detect the hard drive's failure in advance so that it can trigger response to promote the availability of cloud storage.

目次 Table of Contents
第一章緒論 1 第一節研究背景 1 第二節研究動機 3 第三節研究目的 5 第二章文獻探討 7 第一節雲端儲存的趨勢 7 第二節企業級硬碟 7 第三節雲端儲存的資料保護 9 第四節雲端儲存的可用性 12 第五節雲端儲存面臨的問題 18 第六節機器學習與模型績效評估 19 第三章研究方法 30 第一節研究架構 30 第二節資料收集與變數操作化 31 第三節機器學習系統架構設計 44 第四節系統軟硬體環境 52 第五節實驗方法 53 第四章實驗結果與分析 56 第一節實驗資料集結構 56 第二節實驗結果說明 56 第三節綜合分析與評估 63 第五章結論與未來展望 65 第一節研究結論 65 第二節未來研究方向 66 第六章參考文獻 67

參考文獻 References
1. Ford, D., et al. Availability in Globally Distributed Storage Systems. 2. 郭耀煌, 雲報專欄-雲端服務水準協議(SLA), in 雲報. 2013, 台灣雲端運算產業協會. 3. Amazon Web Services. Amazon S3 儲存類別. 2019 [cited 2019 1 Jan]; Available from: https://aws.amazon.com/tw/s3/storage-classes/. 4. Amazon Web Services. Amazon EBS 功能. 2019 [cited 2019 Jan. 6]; Available from: https://aws.amazon.com/tw/ebs/features/. 5. Microsoft Azure. 儲存體 SLA. 2017 [cited 2019 Jan. 6]; Available from: https://azure.microsoft.com/zh-tw/support/legal/sla/storage/v1_3/. 6. Reinsel, D., J. Gantz, and J. Rydning, Data Age 2025: 攸關生命的資料變革聚焦於「重要的」數據而非「大」數據. 2017. 7. Splunk. Splunk 是什麼？. 2018 [cited 2018 10/8]; Available from: https://www.splunk.com/zh-hant_cn. 8. TechOrange科技報橘. 雲端服務品質不卡住？連 Google、微軟都在用的SLA 你不能不懂！? 2013 [cited 2018 8/5]; Available from: https://buzzorange.com/techorange/2013/05/21/service-level-agreement-sla/. 9. 企業IT編輯部. 兼具安全、高可用性及節能的雲端儲存. 2010 [cited 2017 12/25]; Available from: https://www.digitimes.com.tw/iot/article.asp?cat=130&id=0000179083_h76lsh939yecem2z7e3c5#. 10. 朱秋男. 正確選用硬碟機提升資料中心可靠度-從總擁有成本角度考量使用時間、運作溫度和工作負載. 2013 [cited 2013 11/29]; Available from: https://www.netadmin.com.tw/article_content.aspx?sn=1311140005. 11. 李宗翰. 一次搞懂Ceph儲存架構與應用形式. 2015 Dec. 31, 2017]; Available from: https://www.ithome.com.tw/tech/98860. 12. 張明德. RAID 2.0的可靠性疑慮. 2016 [cited 2019 Jan 6]; Available from: https://www.ithome.com.tw/tech/108164. 13. 張明德. 【5大雲端儲存服務大盤點】公有雲服務全面覆蓋企業儲存應用. 2017 Dec 8, 2017]; Available from: https://www.ithome.com.tw/tech/115051. 14. Allen, B., Monitoring hard disks with smart. Linux Journal, 2004(117): p. 74-77. 15. Apache Hadoop. hdfs-default.xml - Apache Hadoop. n.d. [cited 2018 Mar. 3]; Available from: https://hadoop.apache.org/docs/r3.1.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml. 16. BACKBLAZE, The backblaze hard drive data and stats, BACKBLAZE, Editor. 2017. 17. Breiman, L., Random Forests. Machine Learning, 2001. 45(1): p. 5-32. 18. Chen, T. and C. Guestrin, XGBoost: A Scalable Tree Boosting System, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD 2016. 2016. p. 785-794. 19. FEATHERSTUN, R.W., PREDICTING HARD DRIVE FAILURES IN COMPUTER CLUSTERS. 2010. 20. Floyer, D., Server San 2012-2026. 2015: Wikibon.com. 21. Ganguly, S., et al. A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters. in 2016 IEEE Second International Conference on Big Data Computing Service and Applications (BigDataService). 2016. 22. Google Cloud. Google Cloud Storage SLA. 2018 [cited 2018 Oct. 11]; Available from: https://cloud.google.com/storage/sla. 23. Hard Disk Sentinel. S.M.A.R.T. attribute list (ATA). n.d. [cited 2019 Jan. 6]; Available from: https://www.hdsentinel.com/smart/smartattr.php. 24. Huang, X., Hard Drive Failure Prediction for Large Scale Storage System. 2017, University of California, Los Angeles. 25. Metz, C.E., Basic principles of ROC analysis. Seminars in Nuclear Medicine, 1978. 8(4): p. 283-298. 26. Ng, S.W. and R.L. Mattson. Maintaining good performance in disk arrays during failure via uniform parity group distribution. in Proceedings of the First International Symposium on High-Performance Distributed Computing. (HPDC-1). 1992. 27. Paris, J.-F., et al., Protecting RAID Arrays against Unexpectedly High Disk Failure Rates, in 2014 IEEE 20th Pacific Rim International Symposium on Dependable Computing. 2014. p. 68-75. 28. RapidMiner. Learn More About RapidMiner. 2018 Oct. 8, 2018]; Available from: https://rapidminer.com/us/. 29. Red Hat. GlusterFS. 2018 [cited 2019 Jan. 6]; Available from: http://www.gluster.org/. 30. Sage, A.W., et al., Ceph: a scalable, high-performance distributed file system, in Proceedings of the 7th symposium on Operating systems design and implementation %@ 1-931971-47-1. 2006, USENIX Association: Seattle, Washington. p. 307-320. 31. Sanjay, G., G. Howard, and L. Shun-Tak, The Google file system. SIGOPS Oper. Syst. Rev. %@ 0163-5980, 2003. 37(5): p. 29-43. 32. Schmidhuber, J., Deep learning in neural networks: An overview. Neural Networks, 2015. 61: p. 85-117. 33. Shvachko, K., et al. The Hadoop Distributed File System. in 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). 2010. 34. Wang, P., D.J. Dean, and X. Gu, Understanding Real World Data Corruptions in Cloud Systems, in 2015 IEEE International Conference on Cloud Engineering. 2015. p. 116-125. 35. Wikibooks. ITIL v3 (Information Technology Infrastructure Library)/Service Design - Wikibooks, open books for an open world. 2018 [cited 2018; Available from: https://en.wikibooks.org/wiki/ITIL_v3_(Information_Technology_Infrastructure_Library)/Service_Design. 36. Zeng, J., A Case Study on Applying ITIL Availability Management Best Practice. 2009. 4: p. 321-332. 37. Zhang, L., et al. HybridFS - A High Performance and Balanced File System Framework with Multiple Distributed File Systems. in 2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC). 2017.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0026119-204004.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS