國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,自動建置領域中文情緒詞典之研究,The Research of Constructing Domain-Specific Chinese Sentiment Lexicon

論文名稱 Title	自動建置領域中文情緒詞典之研究 The Research of Constructing Domain-Specific Chinese Sentiment Lexicon
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	72
研究生 Author	張嘉真 Chia-Chen Chang
指導教授 Advisor	黃三益 San-Yih Hwang
召集委員 Convenor	魏志平 Chih-Ping Wei
口試委員 Advisory Committee	倪文君 Wen-Chun Ni
口試日期 Date of Exam	2018-07-23	繳交日期 Date of Submission	2018-09-03
關鍵字 Keywords	中文情緒詞典、情緒分析、標籤傳播法、詞向量、文字探勘 text mining, sentiment analysis, Chinese sentiment lexicon, word embedding, label propagation
統計 Statistics	本論文已被瀏覽 6217 次，被下載 84 次 The thesis/dissertation has been browsed 6217 times, has been downloaded 84 times.

中文摘要
隨著社群媒體的盛行，使用者產生大量的文字資料，如：推文、部落格和評論等，這些文字資料都富含著潛在的情緒，我們可以透過情緒分析來得到人們的感受及意見取向。而近年來，情緒分析常以情緒詞典來當作分析的工具，由於領域的多樣性及領域的先驗知識，使得特定領域的情緒詞典在情緒分析中扮演著相當重要的角色。目前中文的情緒詞典資源還是不足且多為不分領域的，因此我們透過建立特定領域的情緒詞典來輔助情緒分析，使結果更為準確。在本研究中，我們分析Booking.com中1,294,141則旅館評論，利用向量空間模型來得到字詞之間的語義關係，並預測字詞的情緒分數，將兩者結合後再透過標籤傳播法以自動建置旅館領域的中文情緒詞典。我們所提出的方法可以達到83%的準確度。
Abstract
With the booming of social media, users generate a large number of texts, such as tweets, blogs, and comments, which are full of potential sentiment. Sentiment analysis aims to obtain people’s feelings and opinions from textual data. The most popular approach for sentiment analysis is to consult the sentiment lexicon. However, due to the diversity of the domain and the prior knowledge, the domain-specific sentiment lexicon plays an important role in sentiment analysis. Chinese sentiment lexicon resources, when compared to their English counterparts, are still limited and mostly for general-purpose. Therefore, this research proposes techniques to construct a domain-specific sentiment lexicon in order to obtain a more accurate sentiment analysis. In this thesis, we analyze 1,294,141 hotel reviews crawled from Booking.com, utilizing the vector space model to obtain the semantic meanings between words, and predicting the sentiment scores of the words. Finally, we combine the context and sentiment information with label propagation method to construct a domain-specific sentiment lexicon automatically in hotel domain. The method we proposed achieves 83% precision.

目次 Table of Contents
論文審定書 i 摘要 ii Abstract iii Table of Contents iv List of Figures vi List of Tables viii Chapter 1 Introduction 1 1.1 Research Background 1 1.2 Research Problem 6 1.3 Research Motivation 6 1.4 Research Purpose 7 1.5 Thesis Organization 7 Chapter 2 Literature Review 9 2.1 Lexicon-based Sentiment Analysis 9 2.2 Adding Sentiment Information to Word 10 2.3 Expanding Sentiment Words Automatically 12 Chapter 3 Our Approach 15 3.1 Overall Process 15 3.2 Data Collection 16 3.3 Data Preprocessing 17 3.3.1 Data Cleaning 17 3.3.2 Segmentation, tokenization and Part-of-Speech Tagging 18 3.4 Generating Word Representations 19 3.5 Building Sentiment Prediction Model 24 3.6 Label Propagation 27 3.6.1 Label Propagation Algorithm 27 3.6.2 Label Propagation in batches 30 3.6.3 Seed Selection 35 Chapter 4 Evaluation 36 4.1 Dataset Construction 36 4.2 Parameter selection in our approach 37 4.3 Comparing with Other Methods 42 4.3.1 comparing methods without label propagation 42 4.3.2 comparing methods with label propagation 45 4.4 Uniqueness of our domain-specific sentiment lexicon 50 4.5 Short discussion in opposite polarity problem 52 Chapter 5 Conclusion 55 References 56 Appendix – Chinese Sentiment Lexicon Extracted from Booking.com 61 Positive words 61 Negative words 61

參考文獻 References
Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., . . . Devin, M. (2016). Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Aydoğan, E., & Akcayol, M. A. (2016). A comprehensive survey for sentiment analysis tasks using machine learning techniques. Paper presented at the INnovations in Intelligent SysTems and Applications (INISTA), 2016 International Symposium on. Bai, A., Hammer, H., Yazidi, A., & Engelstad, P. (2014). Constructing sentiment lexicons in Norwegian from a large text corpus. Paper presented at the Computational Science and Engineering (CSE), 2014 IEEE 17th International Conference on. Bradley, M. M., & Lang, P. J. (1999). Affective norms for English words (ANEW): Instruction manual and affective ratings. Retrieved from Chan, T.-Y., & Chang, Y.-S. (2017). Enhancing classification effectiveness of Chinese news based on term frequency. Paper presented at the Cloud and Service Computing (SC2), 2017 IEEE 7th International Symposium on. Chen, G., Ye, D., Xing, Z., Chen, J., & Cambria, E. (2017). Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. Paper presented at the Neural Networks (IJCNN), 2017 International Joint Conference on. Costello, C., Lin, R., Mruthyunjaya, V., Bolla, B., & Jankowski, C. (2018). Multi-Layer Ensembling Techniques for Multilingual Intent Classification. arXiv preprint arXiv:1806.07914. Cunha, J., Silva, C., & Antunes, M. (2015). Health twitter big bata management with hadoop framework. Procedia Computer Science, 64, 425-431. Dong, Z., & Dong, Q. (2003). HowNet-a hybrid language and knowledge resource. Paper presented at the Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003 International Conference on. dos Santos, C., & Gatti, M. (2014). Deep convolutional neural networks for sentiment analysis of short texts. Paper presented at the Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Esuli, A., & Sebastiani, F. (2007). SentiWordNet: a high-coverage lexical resource for opinion mining. Evaluation, 17, 1-26. Fast, E., Chen, B., & Bernstein, M. S. (2016). Empath: Understanding topic signals in large-scale text. Paper presented at the Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. Giulianelli, M. (2017). Semi-supervised emotion lexicon expansion with label propagation and specialized word embeddings. arXiv:1708.03910v1. Godbole, N., Srinivasaiah, M., & Skiena, S. (2007). Large-Scale Sentiment Analysis for News and Blogs. Icwsm, 7(21), 219-222. Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing domain-specific sentiment lexicons from unlabeled corpora. Paper presented at the Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. Paper presented at the Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. Huang, E. H., Socher, R., Manning, C. D., & Ng, A. Y. (2012). Improving word representations via global context and multiple word prototypes. Paper presented at the Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Jieba. (2018). Retrieved from https://github.com/fxsjy/jieba Khoo, C. S., & Johnkhan, S. B. (2017). Lexicon-based sentiment analysis: Comparative evaluation of six sentiment lexicons. Journal of Information Science, 0165551517703514. Ku, L. W., & Chen, H. H. (2007). Mining opinions from the Web: Beyond relevance retrieval. Journal of the American Society for Information Science and Technology, 58(12), 1838-1850. Kumar, A., & Soman, K. (2016). Amritacen at semeval-2016 task 11: Complex word identification using word embedding. Paper presented at the Proceedings of the 10th International Workshop on Semantic Evaluation (SemEval-2016). Labutov, I., & Lipson, H. (2013). Re-embedding words. Paper presented at the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Liu, B., & Zhang, L. (2012). A survey of opinion mining and sentiment analysis Mining text data (pp. 415-463): Springer. Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Paper presented at the Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1. Marr, B. (2018). How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read. Retrieved from https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/2/#3d9fc622616c Medhat, W., Hassan, A., & Korashy, H. (2014). Sentiment analysis algorithms and applications: A survey. Ain Shams Engineering Journal, 5(4), 1093-1113. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Paper presented at the Advances in neural information processing systems. Mohammad, S. M., & Turney, P. D. (2013). Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3), 436-465. Nagwani, N. K., & Sharaff, A. (2017). SMS spam filtering and thread identification using bi-level text classification and clustering techniques. Journal of Information Science, 43(1), 75-87. Niekler, A., Wiedemann, G., & Heyer, G. (2017). Leipzig Corpus Miner-A Text Mining Infrastructure for Qualitative Data Analysis. arXiv preprint arXiv:1707.03253. Peng, W., & Park, D. H. (2004). Generate adjective sentiment dictionary for social media sentiment analysis using constrained nonnegative matrix factorization. Urbana, 51, 61801. Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71(2001), 2001. Rothe, S., Ebert, S., & Schütze, H. (2016). Ultradense word embeddings by orthogonal transformation. arXiv preprint arXiv:1602.07572. Rouvier, M., & Favre, B. (2016). SENSEI-LIF at SemEval-2016 Task 4: Polarity embedding fusion for robust sentiment analysis. Paper presented at the Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. Sari, Y., & Stevenson, M. (2016). Exploring Word Embeddings and Character N-Grams for Author Clustering. Paper presented at the CLEF (Working Notes). Schneider, C. (2016). The biggest data challenges that you might not even know you have. Retrieved from https://www.ibm.com/blogs/watson/2016/05/biggest-data-challenges-might-not-even-know/ Socher, R., Pennington, J., Huang, E. H., Ng, A. Y., & Manning, C. D. (2011). Semi-supervised recursive autoencoders for predicting sentiment distributions. Paper presented at the Proceedings of the conference on empirical methods in natural language processing. Stone, P. J., Dunphy, D. C., & Smith, M. S. (1966). The general inquirer: A computer approach to content analysis. Tai, Y.-J., & Kao, H.-Y. (2013). Automatic domain-specific sentiment lexicon generation with label propagation. Paper presented at the Proceedings of International Conference on Information Integration and Web-based Applications & Services. Tang, D., Wei, F., Qin, B., Yang, N., Liu, T., & Zhou, M. (2016). Sentiment embeddings with applications to sentiment analysis. IEEE Transactions on Knowledge and Data Engineering, 28(2), 496-509. Tang, D., Wei, F., Qin, B., Zhou, M., & Liu, T. (2014). Building large-scale twitter-specific sentiment lexicon: A representation learning approach. Paper presented at the Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers. Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. Paper presented at the Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1. Volkova, S., Dolan, W. B., & Wilson, T. (2012). CLex: a lexicon for exploring color, concept and emotion associations in language. Paper presented at the Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. Wang, P., Xu, J., Xu, B., Liu, C., Zhang, H., Wang, F., & Hao, H. (2015). Semantic clustering and convolutional neural network for short text categorization. Paper presented at the Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Wohlgenannt, G., Chernyak, E., & Ilvovsky, D. (2016). Extracting social networks from literary text with word embedding tools. Paper presented at the Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH). Xiu, Y., Lan, M., Wu, Y., & Lang, J. (2017). Exploring semantic content to user profiling for user cluster-based collaborative point-of-interest recommender system. Paper presented at the Asian Language Processing (IALP), 2017 International Conference on. Zhang, L., Ghosh, R., Dekhil, M., Hsu, M., & Liu, B. (2011). Combining lexiconbased and learning-based methods for twitter sentiment analysis. HP Laboratories, Technical Report HPL-2011, 89. Zhu, X., & Ghahramani, Z. (2002). Learning from labeled and unlabeled data with label propagation. Zopf, M., Mencía, E. L., & Fürnkranz, J. (2018). Which Scores to Predict in Sentence Regression for Text Summarization? Paper presented at the Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 徐琳宏, 林鸿飞, 潘宇, 任惠, & 陈建美. (2008). 情感词汇本体的构造. 情报学报, 27(2), 180-185. 張津挺. (2015). 中文財務情緒字典建構與其在財務新聞分析之應用.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0721118-153855.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS