博碩士論文 etd-0624118-155432 詳細資訊


[回到前頁查詢結果 | 重新搜尋]

姓名 徐透祥(Tou-Hsiang Hsu) 電子郵件信箱 E-mail 資料不公開
畢業系所 資訊管理學系研究所(Information Management)
畢業學位 碩士(Master) 畢業時期 106學年第2學期
論文名稱(中) 跨語言主題模型分析之研究
論文名稱(英) A Research On Cross-Lingual Topic Analysis
檔案
  • etd-0624118-155432.pdf
  • 本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。
    請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
    論文使用權限

    紙本論文:立即公開

    電子論文:校內校外完全公開

    論文語文/頁數 英文/63
    統計 本論文已被瀏覽 5367 次,被下載 436 次
    摘要(中) 跨語言主題以往的研究大多建構在多語言文本,其中又以Mimno所發表的多語言主題模型(Polylingual Topic Model)最具代表,然而這類的跨語言主題模型皆受限於文本結構,其表現隨著文本中相對應的語言文章佔有比率減少而衰弛。類似的多語言對應文章像是歐洲議會記錄或是香港政府公告,同樣一份內容會有多語言的對應版本,這樣的資源並不容易取得,其文章種類及數量也相對一般文章而言稀少。在汲取各語言主題的方法上,若是使用翻譯器或是翻譯人員,這樣的方式不僅耗時且成本高昂,不同領域的用詞也會影響翻譯的正確性。
    每個地區的人們所討論的主題不盡相同,過往的多語言主題模型研究僅能汲取各語言共同的討論主題。本篇論文提出的方法採用三種對應不同語言的文字向量空間方式來建構跨語言主題模型,不僅不需要多語言對應文本,打破了過往多語言主題模型的限制,除了在多語言主題的表現上能與Mimno的多語言主題模型比擬,還能有效汲取僅在單一語言討論的主題。
    摘要(英) Most of cross-lingual topic models in the previous work rely on the parallel or comparable corpus. The polylingual topic model (PLTM) proposed by Mimno et al (2009) is the most representative one. However, parallel or comparable corpus like Europarl and Wikipedia are not available in many cases. In this thesis, we propose a method combining the techniques of mapping word vector spaces between languages and topic modeling (LDA). The cross-lingual word vector mapping enables us to map word vector spaces, and LDA helps us group words into topics. Thus, we combine two techniques to construct the cross-lingual topic model.
    In contrast to PLTM, our proposed approach does not need the comparable or parallel corpus to construct the cross-lingual topic model and identify the topics discussed only in a single language.
    We compare the performance of PLTM and our approach using UM-corpus (Tian, L et al., 2014), an English-Chinese bilingual corpus. The results of the evaluations show that our proposed approach could align the topics across languages properly and the performance is comparable with the PLTM.
    關鍵字(中)
  • 跨語言主題模型
  • 文字向量空間
  • 多語言對應文本
  • 多語言主題模型
  • 主題模型
  • 關鍵字(英)
  • Cross-lingual topic model
  • Topic modeling
  • Polylingual topic model
  • Parallel corpus
  • word vector space
  • LDA
  • 論文目次 TABLE OF CONTENT
    論文審定書 i
    摘要 ii
    Abstruct iii
    CHAPTER 1 – Introduction 1
    CHAPTER 2 – Related Work 7
    2.1 Cross-lingual Topic Model 7
    2.2 Cross-lingual Word Representation 9
    2.3 Topic Model with Word Representation 12
    CHAPTER 3 – Our Approach 15
    3.1 Word representation 16
    3.2 Word vector mapping method 16
    3.2.1 Linear Projection by Least Squares 17
    3.2.2 Linear Projection with CCA 18
    3.2.3 Orthogonal Transformations by SVD 19
    3.3 Cross-Lingual Topic Model (CLTM) 20
    CHAPTER 4 – Experiments 23
    4.1 Data Collection 23
    4.2 Word representation for each language 24
    4.3 Mapping word vectors across languages 25
    4.4 Topic number setting 28
    4.5 Cross-lingual Topic Model (CLTM) 32
    4.6 Experiment Design 35
    4.7 Experimental Result 36
    4.7.1 Entropy of each topic model 36
    4.7.2 Jensen Shannon Divergence (JSD) of document topic distribution 38
    4.7.3 Word coherence of topic 40
    CHAPTER 5 – Conclusion 45
    5.1 Future work 45
    Reference 47
    Appendix 53
    參考文獻 Reference
    1. Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C., & Smith, N. A. (2016). Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.
    2. Banerjee, A., Dhillon, I. S., Ghosh, J., & Sra, S. (2005). Clustering on the unit hypersphere using von Mises-Fisher distributions. Journal of Machine Learning Research, 6(Sep), 1345-1382.
    3. Batmanghelich, K., Saeedi, A., Narasimhan, K., & Gershman, S. (2016). Nonparametric spherical topic modeling with word embeddings. arXiv preprint arXiv:1604.00126.
    4. Bengio, Y., Schwenk, H., et al. (2006). Neural Probabilistic Language Models, 186, 137–186.
    5. Blei, D. M., & Jordan, M. I. (2003). Modeling annotated data. Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 127–134. https://doi.org/http://doi.acm.org/10.1145/860435.860460
    6. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan), 993-1022.
    7. Boyd-Graber, J., & Blei, D. M. (2009, June). Multilingual topic models for unaligned text. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (pp. 75-82). AUAI Press.
    8. Collobert, R., & Weston, J. (2008, July). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on Machine learning (pp. 160-167). ACM.
    9. Das, R., Zaheer, M., & Dyer, C. (2015). Gaussian lda for topic models with word embeddings. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (Vol. 1, pp. 795-804).
    10. Dhillon, I. S., & Sra, S. (2003). Modeling data using directional distributions. Technical Report TR-03-06, Department of Computer Sciences, The University of Texas at Austin. URL ftp://ftp. cs. utexas. edu/pub/techreports/tr03-06. ps. gz.
    11. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd (Vol. 96, No. 34, pp. 226-231).
    12. Faruqui, M., & Dyer, C. (2014). Improving vector space word representations using multilingual correlation. Association for Computational Linguistics.
    13. Haghighi, A., Liang, P., Berg-Kirkpatrick, T., & Klein, D. (2008). Learning bilingual lexicons from monolingual corpora. Proceedings of ACL-08: Hlt, 771-779.
    14. Hassan, S., & Mihalcea, R. (2011, August). Semantic Relatedness Using Salient Semantic Analysis. In Aaai.
    15. Jarmasz, M. (2012). Roget's thesaurus as a lexical resource for natural language processing. arXiv preprint arXiv:1204.0140.
    16. Kinga, D., & Adam, J. B. (2015). A method for stochastic optimization. In International Conference on Learning Representations (ICLR) (Vol. 5).
    17. Klementiev, A., Titov, I., & Bhattarai, B. (2012). Inducing crosslingual distributed representations of words. Proceedings of COLING 2012, 1459-1474.
    18. Koehn, P. (2005, September). Europarl: A parallel corpus for statistical machine translation. In MT summit (Vol. 5, pp. 79-86).
    19. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information theory, 37(1), 145-151.
    20. Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual Topic Models for Bilingual Dictionary Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3), 11.
    21. Liu, X., Duh, K., & Matsumoto, Y. (2015). Multilingual Topic Models for Bilingual Dictionary Extraction. ACM Transactions on Asian and Low-Resource Language Information Processing, 14(3), 11.
    22. Lu, A., Wang, W., Bansal, M., Gimpel, K., & Livescu, K. (2015). Deep multilingual correlation for improved word embeddings. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 250-256).
    23. Maaten, L. V. D., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(Nov), 2579-2605.
    24. Mann, G. S., Mimno, D., &McCallum, A. (2006). Bibliometric impact measures leveraging topic analysis. ACM/IEEE-CS Joint Conference on Digital Libraries, 65–74. https://doi.org/10.1145/1141753.1141765
    25. Mikolov, T., Corrado, G., Chen, K., &Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12. https://doi.org/10.1162/153244303322533223
    26. Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168
    27. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems (pp. 3111-3119).
    28. Mimno, D., Wallach, H. M., Naradowsky, J., Smith, D. A., & McCallum, A. (2009, August). Polylingual topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 2-Volume 2 (pp. 880-889). Association for Computational Linguistics.
    29. Mimno, D., Wallach, H. M., Talley, E., Leenders, M., & McCallum, A. (2011, July). Optimizing semantic coherence in topic models. In Proceedings of the conference on empirical methods in natural language processing (pp. 262-272). Association for Computational Linguistics.
    30. Moody, C. E. (2016). Mixing dirichlet topic models and word embeddings to make lda2vec. arXiv preprint arXiv:1605.02019.
    31. Ni, X., Sun, J. T., Hu, J., & Chen, Z. (2009, April). Mining multilingual topics from wikipedia. In Proceedings of the 18th international conference on World wide web (pp. 1155-1156). ACM.
    32. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
    33. Prettenhofer, P., &Stein, B. (2010). Cross-Language Text Classification using Structural Correspondence Learning. ACL ’10 Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, (July), 1118–1127. https://doi.org/10.1145/2036264.2036277
    34. Röder, M., Both, A., & Hinneburg, A. (2015, February). Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining (pp. 399-408). ACM.
    35. Smith, S. L., Turban, D. H., Hamblin, S., & Hammerla, N. Y. (2017). Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv preprint arXiv:1702.03859.
    36. Strube, M., & Ponzetto, S. P. (2006, July). WikiRelate! Computing semantic relatedness using Wikipedia. In AAAI (Vol. 6, pp. 1419-1424).
    37. Tam, Y.-C., &Schultz, T. (2007). Bilingual LSA-Based Translation Lexicon Adaptation for Spoken Language Translation. Interspeech-2007, 2461–2464. Retrieved from http://csl.anthropomatik.kit.edu/downloads/Tam_IS07_LSABasedTranslationLexicon.pdf
    38. Tian, L., Wong, D. F., Chao, L. S., Quaresma, P., Oliveira, F., & Yi, L. (2014). UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation. In LREC (pp. 1837-1842).
    39. Titov, I., & McDonald, R. (2008, April). Modeling online reviews with multi-grain topic models. In Proceedings of the 17th international conference on World Wide Web (pp. 111-120). ACM.
    40. Wei, X., &Croft, W. B. (2006). LDA-based document models for ad-hoc retrieval. Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval SIGIR 06, pages, 178. https://doi.org/10.1145/1148170.1148204
    41. Xiao, M., & Guo, Y. (2013). Semi-Supervised Representation Learning for Cross-Lingual Text Classification. In EMNLP (pp. 1465-1475).
    42. Xing, C., Wang, D., Liu, C., & Lin, Y. (2015). Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 1006-1011).
    43. Zhao, B., &Xing, E. P. (2006). BiTAM: Bilingual Topic AdMixture Models for Word Alignment. Proceedings of the COLING/ACL on Main Conference Poster Sessions, 969--976. https://doi.org/1273073.1273197
    44. Zhou, G., He, T., &Zhao, J. (2014). Bridging the Language Gap: Learning Distributed Semantics for Cross-Lingual Sentiment Classification. Nlpcc, 138–149. Retrieved from http://link.springer.com/chapter/10.1007/978-3-662-45924-9_13
    45. Zhou, H., Chen, L., Shi, F., &Huang, D. (2015). Learning Bilingual Sentiment Word Embeddings for Cross-language Sentiment Classification. Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 430–440. Retrieved from http://www.aclweb.org/anthology/P15-1042
    46. Zhou, X., Wan, X., &Xiao, J. (2016). Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL 2016), 1403–1412.
    47. Zou, W. Y., Socher, R., Cer, D., & Manning, C. D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1393-1398).
    口試委員
  • 魏志平 - 召集委員
  • 倪文君 - 委員
  • 黃三益 - 指導教授
  • 口試日期 2018-07-23 繳交日期 2018-07-24

    [回到前頁查詢結果 | 重新搜尋]


    如有任何問題請與論文審查小組聯繫