國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,跨語言主題模型分析之研究,A Research On Cross-Lingual Topic Analysis

論文名稱 Title	跨語言主題模型分析之研究 A Research On Cross-Lingual Topic Analysis
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	63
研究生 Author	徐透祥 Tou-Hsiang Hsu
指導教授 Advisor	黃三益 San-Yih Hwang
召集委員 Convenor	魏志平 Chih-Ping Wei
口試委員 Advisory Committee	倪文君 Wen-Chun Ni
口試日期 Date of Exam	2018-07-23	繳交日期 Date of Submission	2018-07-24
關鍵字 Keywords	跨語言主題模型、文字向量空間、多語言對應文本、多語言主題模型、主題模型 Cross-lingual topic model, Topic modeling, Polylingual topic model, Parallel corpus, word vector space, LDA
統計 Statistics	本論文已被瀏覽 6143 次，被下載 794 次 The thesis/dissertation has been browsed 6143 times, has been downloaded 794 times.

中文摘要
跨語言主題以往的研究大多建構在多語言文本，其中又以Mimno所發表的多語言主題模型(Polylingual Topic Model)最具代表，然而這類的跨語言主題模型皆受限於文本結構，其表現隨著文本中相對應的語言文章佔有比率減少而衰弛。類似的多語言對應文章像是歐洲議會記錄或是香港政府公告，同樣一份內容會有多語言的對應版本，這樣的資源並不容易取得，其文章種類及數量也相對一般文章而言稀少。在汲取各語言主題的方法上，若是使用翻譯器或是翻譯人員，這樣的方式不僅耗時且成本高昂，不同領域的用詞也會影響翻譯的正確性。每個地區的人們所討論的主題不盡相同，過往的多語言主題模型研究僅能汲取各語言共同的討論主題。本篇論文提出的方法採用三種對應不同語言的文字向量空間方式來建構跨語言主題模型，不僅不需要多語言對應文本，打破了過往多語言主題模型的限制，除了在多語言主題的表現上能與Mimno的多語言主題模型比擬，還能有效汲取僅在單一語言討論的主題。
Abstract
Most of cross-lingual topic models in the previous work rely on the parallel or comparable corpus. The polylingual topic model (PLTM) proposed by Mimno et al (2009) is the most representative one. However, parallel or comparable corpus like Europarl and Wikipedia are not available in many cases. In this thesis, we propose a method combining the techniques of mapping word vector spaces between languages and topic modeling (LDA). The cross-lingual word vector mapping enables us to map word vector spaces, and LDA helps us group words into topics. Thus, we combine two techniques to construct the cross-lingual topic model. In contrast to PLTM, our proposed approach does not need the comparable or parallel corpus to construct the cross-lingual topic model and identify the topics discussed only in a single language. We compare the performance of PLTM and our approach using UM-corpus (Tian, L et al., 2014), an English-Chinese bilingual corpus. The results of the evaluations show that our proposed approach could align the topics across languages properly and the performance is comparable with the PLTM.

目次 Table of Contents
TABLE OF CONTENT 論文審定書 i 摘要 ii Abstruct iii CHAPTER 1 – Introduction 1 CHAPTER 2 – Related Work 7 2.1 Cross-lingual Topic Model 7 2.2 Cross-lingual Word Representation 9 2.3 Topic Model with Word Representation 12 CHAPTER 3 – Our Approach 15 3.1 Word representation 16 3.2 Word vector mapping method 16 3.2.1 Linear Projection by Least Squares 17 3.2.2 Linear Projection with CCA 18 3.2.3 Orthogonal Transformations by SVD 19 3.3 Cross-Lingual Topic Model (CLTM) 20 CHAPTER 4 – Experiments 23 4.1 Data Collection 23 4.2 Word representation for each language 24 4.3 Ｍapping word vectors across languages 25 4.4 Topic number setting 28 4.5 Cross-lingual Topic Model (CLTM) 32 4.6 Experiment Design 35 4.7 Experimental Result 36 4.7.1 Entropy of each topic model 36 4.7.2 Jensen Shannon Divergence (JSD) of document topic distribution 38 4.7.3 Word coherence of topic 40 CHAPTER 5 – Conclusion 45 5.1 Future work 45 Reference 47 Appendix 53

參考文獻 References
Reference 1. Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., & Smith, N. A. (2016). Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925. 2. Banerjee, A., Dhillon, I. S., Ghosh, J., & Sra, S. (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6(Sep), 1345-1382. 3. Batmanghelich, K., Saeedi, A., Narasimhan, K., & Gershman, S. (2016). Nonparametric spherical topic modeling with word embeddings. arXiv preprint arXiv:1604.00126. 4. Bengio, Y., Schwenk, H., et al. (2006). Neural Probabilistic Language Models, 186, 137–186. 5. Blei, D. M., & Jordan, M. I. (2003). Modeling annotated data. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 127–134. https://doi.org/http://doi.acm.org/10.1145/860435.860460 6. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022. 7. Boyd-Graber, J., & Blei, D. M. (2009, June). Multilingual topic models for unaligned text. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (pp. 75-82). AUAI Press. 8. Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (pp. 160-167). ACM. 9. Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 795-804). 10. Dhillon, I. S., & Sra, S. (2003). Modeling data using directional distributions. Technical Report TR-03-06, Department of Computer Sciences, The University of Texas at Austin. URL ftp://ftp. cs. utexas. edu/pub/techreports/tr03-06. ps. gz. 11. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231). 12. Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Association for Computational Linguistics. 13. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., & Klein, D. (2008). Learning bilingual lexicons from monolingual corpora. Proceedings of ACL-08: Hlt, 771-779. 14. Hassan, S., & Mihalcea, R. (2011, August). Semantic Relatedness Using Salient Semantic Analysis. In Aaai. 15. Jarmasz, M. (2012). Roget's thesaurus as a lexical resource for natural language processing. arXiv preprint arXiv:1204.0140. 16. Kinga, D., & Adam, J. B. (2015). A method for stochastic optimization. In International Conference on Learning Representations (ICLR) (Vol. 5). 17. Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. Proceedings of COLING 2012, 1459-1474. 18. Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86). 19. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1), 145-151. 20. Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual Topic Models for Bilingual Dictionary Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3), 11. 21. Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual Topic Models for Bilingual Dictionary Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3), 11. 22. Lu, A., Wang, W., Bansal, M., Gimpel, K., & Livescu, K. (2015). Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 250-256). 23. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605. 24. Mann, G. S., Mimno, D., &McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. ACM/IEEE-CS Joint Conference on Digital Libraries, 65–74. https://doi.org/10.1145/1141753.1141765 25. Mikolov, T., Corrado, G., Chen, K., &Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12. https://doi.org/10.1162/153244303322533223 26. Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119). 28. Mimno, D., Wallach, H. M., Naradowsky, J., Smith, D. A., & McCallum, A. (2009, August). Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2 (pp. 880-889). Association for Computational Linguistics. 29. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing (pp. 262-272). Association for Computational Linguistics. 30. Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019. 31. Ni, X., Sun, J. T., Hu, J., & Chen, Z. (2009, April). Mining multilingual topics from wikipedia. In Proceedings of the 18th international conference on World wide web (pp. 1155-1156). ACM. 32. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543). 33. Prettenhofer, P., &Stein, B. (2010). Cross-Language Text Classification using Structural Correspondence Learning. ACL ’10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, (July), 1118–1127. https://doi.org/10.1145/2036264.2036277 34. Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM. 35. Smith, S. L., Turban, D. H., Hamblin, S., & Hammerla, N. Y. (2017). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859. 36. Strube, M., & Ponzetto, S. P. (2006, July). WikiRelate! Computing semantic relatedness using Wikipedia. In AAAI (Vol. 6, pp. 1419-1424). 37. Tam, Y.-C., &Schultz, T. (2007). Bilingual LSA-Based Translation Lexicon Adaptation for Spoken Language Translation. Interspeech-2007, 2461–2464. Retrieved from http://csl.anthropomatik.kit.edu/downloads/Tam_IS07_LSABasedTranslationLexicon.pdf 38. Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In LREC (pp. 1837-1842). 39. Titov, I., & McDonald, R. (2008, April). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web (pp. 111-120). ACM. 40. Wei, X., &Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 06, pages, 178. https://doi.org/10.1145/1148170.1148204 41. Xiao, M., & Guo, Y. (2013). Semi-Supervised Representation Learning for Cross-Lingual Text Classification. In EMNLP (pp. 1465-1475). 42. Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1006-1011). 43. Zhao, B., &Xing, E. P. (2006). BiTAM: Bilingual Topic AdMixture Models for Word Alignment. Proceedings of the COLING/ACL on Main Conference Poster Sessions, 969--976. https://doi.org/1273073.1273197 44. Zhou, G., He, T., &Zhao, J. (2014). Bridging the Language Gap: Learning Distributed Semantics for Cross-Lingual Sentiment Classification. Nlpcc, 138–149. Retrieved from http://link.springer.com/chapter/10.1007/978-3-662-45924-9_13 45. Zhou, H., Chen, L., Shi, F., &Huang, D. (2015). Learning Bilingual Sentiment Word Embeddings for Cross-language Sentiment Classification. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 430–440. Retrieved from http://www.aclweb.org/anthology/P15-1042 46. Zhou, X., Wan, X., &Xiao, J. (2016). Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 1403–1412. 47. Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1393-1398).

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0624118-155432.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS