A Research On Cross-Lingual Topic Analysis
Cross-lingual topic model, Topic modeling, Polylingual topic model, Parallel corpus, word vector space, LDA
跨語言主題以往的研究大多建構在多語言文本,其中又以Mimno所發表的多語言主題模型(Polylingual Topic Model)最具代表,然而這類的跨語言主題模型皆受限於文本結構,其表現隨著文本中相對應的語言文章佔有比率減少而衰弛。類似的多語言對應文章像是歐洲議會記錄或是香港政府公告,同樣一份內容會有多語言的對應版本,這樣的資源並不容易取得,其文章種類及數量也相對一般文章而言稀少。在汲取各語言主題的方法上,若是使用翻譯器或是翻譯人員,這樣的方式不僅耗時且成本高昂,不同領域的用詞也會影響翻譯的正確性。
Most of cross-lingual topic models in the previous work rely on the parallel or comparable corpus. The polylingual topic model (PLTM) proposed by Mimno et al (2009) is the most representative one. However, parallel or comparable corpus like Europarl and Wikipedia are not available in many cases. In this thesis, we propose a method combining the techniques of mapping word vector spaces between languages and topic modeling (LDA). The cross-lingual word vector mapping enables us to map word vector spaces, and LDA helps us group words into topics. Thus, we combine two techniques to construct the cross-lingual topic model.
In contrast to PLTM, our proposed approach does not need the comparable or parallel corpus to construct the cross-lingual topic model and identify the topics discussed only in a single language.
We compare the performance of PLTM and our approach using UM-corpus (Tian, L et al., 2014), an English-Chinese bilingual corpus. The results of the evaluations show that our proposed approach could align the topics across languages properly and the performance is comparable with the PLTM.
目次 Table of Contents

論文審定書 i
摘要 ii
Abstruct iii

CHAPTER 1 – Introduction 1
CHAPTER 2 – Related Work 7
2.1 Cross-lingual Topic Model 7
2.2 Cross-lingual Word Representation 9
2.3 Topic Model with Word Representation 12
CHAPTER 3 – Our Approach 15
3.1 Word representation 16
3.2 Word vector mapping method 16
3.2.1 Linear Projection by Least Squares 17
3.2.2 Linear Projection with CCA 18
3.2.3 Orthogonal Transformations by SVD 19
3.3 Cross-Lingual Topic Model (CLTM) 20
CHAPTER 4 – Experiments 23
4.1 Data Collection 23
4.2 Word representation for each language 24
4.3 Mapping word vectors across languages 25
4.4 Topic number setting 28
4.5 Cross-lingual Topic Model (CLTM) 32
4.6 Experiment Design 35
4.7 Experimental Result 36
4.7.1 Entropy of each topic model 36
4.7.2 Jensen Shannon Divergence (JSD) of document topic distribution 38
4.7.3 Word coherence of topic 40
CHAPTER 5 – Conclusion 45
5.1 Future work 45
Reference 47
Appendix 53
