國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以深度學習模型探討中文問答系統中語言轉換的知識擴充,Extending Knowledge in Different Languages with the Deep Learning Model for a Chinese Q-A System

論文名稱 Title	以深度學習模型探討中文問答系統中語言轉換的知識擴充 Extending Knowledge in Different Languages with the Deep Learning Model for a Chinese Q-A System
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	59
研究生 Author	蔡心娪 Hsin-Wu Tsai
指導教授 Advisor	李偉柏 Lee Wei-Po
召集委員 Convenor	林耕霈 Lin Keng-Pei
口試委員 Advisory Committee	蕭漢威 Xiao Han-Wei
口試日期 Date of Exam	2018-06-25	繳交日期 Date of Submission	2018-07-16
關鍵字 Keywords	問答資料集、深度學習、中文資料處理、語言轉換、Word2Vec Language conversion, Chinese data processing, Word2Vec, Deep learning, Question answering dataset
統計 Statistics	本論文已被瀏覽 6206 次，被下載 3 次 The thesis/dissertation has been browsed 6206 times, has been downloaded 3 times.

中文摘要
隨著網際網路的發展，人們透過搜尋引擎從網路上獲取資訊的速度、數量已經不是傳統書籍可以比擬的，但是搜尋引擎十分仰賴使用者輸入的「搜尋關鍵字」，再加上搜尋結果是以列表的形式呈現，無法讓使用者一下子就找尋出所需要的正確答案。問答系統則是為了解決這個問題，問答系統可以讓使用者輸入自然語言的語句，並回答最正確的答案給使用者，但是目前問答系統的模型訓練還是十分仰賴結構化的資料集。有許多研究在探討如何利用網路資源、文本等擴充資料集，但是所搜集的資料集或是問答網站通常以單一語言為主，因此本研究旨在使用語言轉換的方式擴充問答資料集。本研究探討了不同中文資料處理對於語言問答資料的影響，包含了標點符號處理、斷詞處理、UnKnown處理、Word2Vec訓練，最後使用深度學習問答模型來評斷資料集各種處理之差異，結果發現都不做任何人工處理的效果反而最佳。
Abstract
Following the advance of the Internet, nowadays when people access information in the Internet with search engines is far beyond the traditional way of reading books in terms of speed and quantity of their reading. However, search engines often rely heavily on the "searching keywords" users input, and the search results are being presented in a long list that does not allow the user to quickly spot the correct answer. However, something like a question answering system appears to solve this problem. A question answering system allows the user to input natural language sentences and then provides the correct or most relative answer. Currently the method in use to train a question answering model mainly relies on the structured dataset. There are many studies exploring how to use network resources and other texts to extend the dataset. Yet most of the datasets collected or the Q&A websites are in a single language. This study aims to explore how to make use of the Q&A dataset in different languages. This study explores the impact of different Chinese data processing on Q&A data (including punctuation processing, word segmentation processing, “unknown word” processing, Word2Vec training), by using deep learning to train the Q&A model to evaluate the performances of various language processing steps. The experiment results show that the strategy of not doing manual language processing can obtain the best performance.

目次 Table of Contents
誌謝 ii 摘要 iii Abstract iv 一、緒論 1 1.1 研究背景 1 1.2 研究動機與目 2 1.3 研究流程 3 二、文獻探討 4 2.1 問答系統 4 2.2 問答資料集擴充方式 4 2.3 中文斷詞 7 2.4 Word Embedding 8 2.5 Word2Vec 9 2.6 深度學習 11 三、研究方法 14 3.1 問答資料集與翻譯 15 3.2 語言轉換遭遇問題 16 3.3 語言轉換後的自然語言資料處理 16 3.4 Word2Vec預先訓練模型 22 3.5 ConvolutionLSTM問答模型 24 3.6 資料覆蓋度探討 27 四、研究結果 28 4.1 英文問答資料集與語言轉換後之自然語言資料處理 28 4.2 向量相似度計算和Word2Vec參數調整結果 31 4.3 自訂辭典、標點符號與Unknown處理結果 33 4.4 不同斷詞演算法結果 37 4.5 中文維基Pre-Trained Model結果 38 4.6 不同問答模型結果 40 4.7 資料覆蓋度探討結果 43 五、結論與未來研究 45 5.1 結論 45 5.2 未來研究 46 六、參考文獻 47

參考文獻 References
期刊論文 Mishra, A., & Jain, S. K. (2016). A survey on question answering systems with classification. Journal of King Saud University-Computer and Information Sciences, 28(3), 345-361. Green Jr, B. F., Wolf, A. K., Chomsky, C., & Laughery, K. (1961, May). Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference (pp. 219-224). ACM. Woods, W. A., & Kaplan, R. (1977). Lunar rocks in natural English: Explorations in natural language question answering. Linguistic structures processing, 5, 521-569. Bouziane, A., Bouchiha, D., Doumi, N., & Malki, M. (2015). Question answering systems: survey and trends. Procedia Computer Science, 73, 366-375. Chen, D., Fisch, A., Weston, J., & Bordes, A. (2017). Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., & Qin, B. (2014). Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1555-1565). Jeon, J., Croft, W. B., & Lee, J. H. (2005, October). Finding similar questions in large question and answer archives. In Proceedings of the 14th ACM international conference on Information and knowledge management (pp. 84-90). ACM. Cong, G., Wang, L., Lin, C. Y., Song, Y. I., & Sun, Y. (2008, July). Finding question-answer pairs from online forums. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 467-474). ACM. Shrestha, L., & McKeown, K. (2004, August). Detection of question-answer pairs in email conversations. In Proceedings of the 20th international conference on Computational Linguistics(p. 889). Association for Computational Linguistics. Song, L., & Zhao, L. (2016). Question Generation from a Knowledge Base with Web Exploration. arXiv preprint arXiv:1610.03807. Chen, K. J., & Ma, W. Y. (2002, August). Unknown word extraction for Chinese documents. In Proceedings of the 19th international conference on Computational linguistics-Volume 1(pp. 1-7). Association for Computational Linguistics. Zhu, Y., Zhou, W., Xu, Y., Liu, J., & Tan, Y. (2017). Intelligent Learning for Knowledge Graph towards Geological Data. Scientific Programming, 2017. Hori, T., Wang, H., Hori, C., Watanabe, S., Harsham, B., Le Roux, J., ... & Aikawa, T. (2016, December). Dialog state tracking with attention-based sequence-to-sequence learning. In Spoken Language Technology Workshop (SLT), 2016 IEEE (pp. 552-558). IEEE. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521(7553), 436. Williams, R. J., & Zipser, D. (1995). Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications, 1, 433-486. Werbos, P. J. (1988). Generalization of backpropagation with application to a recurrent gas market model. Neural networks, 1(4), 339-356. Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780. Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2), 157-166. Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual prediction with LSTM. Jozefowicz, R., Zaremba, W., & Sutskever, I. (2015, June). An empirical exploration of recurrent network architectures. In International Conference on Machine Learning (pp. 2342-2350). Feng, M., Xiang, B., Glass, M. R., Wang, L., & Zhou, B. (2015, December). Applying deep learning to answer selection: A study and an open task. In Automatic Speech Recognition and Understanding (ASRU), 2015 IEEE Workshop on (pp. 813-820). IEEE. Ajmera, J., Joshi, S., Verma, A., & Mittal, A. (2014, March). Automatic generation of question answer pairs from noisy case logs. In Data Engineering (ICDE), 2014 IEEE 30th International Conference on (pp. 436-447). IEEE. Zhou, Q., Yang, N., Wei, F., Tan, C., Bao, H., & Zhou, M. (2017, November). Neural Question Generation from Text: A Preliminary Study. In National CCF Conference on Natural Language Processing and Chinese Computing (pp. 662-671). Springer, Cham. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12(Aug), 2493-2537. Goodfellow, I., Bengio, Y., Courville, A., & Bengio, Y. (2016). Deep learning (Vol. 1). Cambridge: MIT press. Lu, A., Wang, W., Bansal, M., Gimpel, K., & Livescu, K. (2015). Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 250-256). Faruqui, M., Tsvetkov, Y., Rastogi, P., & Dyer, C. (2016). Problems with evaluation of word embeddings using word similarity tasks. arXiv preprint arXiv:1605.02276. Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., & Dyer, C. (2015). Evaluation of word vector representations by subspace alignment. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (pp. 2049-2054).

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0616118-001514.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS