國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於在線式深度非負自編碼的主題演進及分散度探索,Topic Evolution and Diffusion Discovery based on Online Deep Non-negative Autoencoder

論文名稱 Title	基於在線式深度非負自編碼的主題演進及分散度探索 Topic Evolution and Diffusion Discovery based on Online Deep Non-negative Autoencoder
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	108 學年度第 2 學期 The spring semester of Academic Year 108	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	45
研究生 Author	洪紹銘 Shao-Min Hung
指導教授 Advisor	康藝晃 Yihuang Kang
召集委員 Convenor	黃三益 San-Yih Hwang
口試委員 Advisory Committee	李珮如 LEE, PEI-JU
口試日期 Date of Exam	2020-07-30	繳交日期 Date of Submission	2020-08-19
關鍵字 Keywords	網路分析、主題擴散、主題演進、主題模行、自編碼器、深度學習 Network Analysis, Autoencoder, Deep learning, Topic Diffusion, Topic Modeling, Topic Evolution
統計 Statistics	本論文已被瀏覽 6003 次，被下載 88 次 The thesis/dissertation has been browsed 6003 times, has been downloaded 88 times.

中文摘要
隨著資料的儲存及取得越來越便利，我們可以方便的在網路上閱讀各式各樣的內容，在如此大量的資訊中，要完全了解、閱讀所有的內容是不太可能的，我們往往依賴著分類或搜尋關鍵字的方式找出想要獲得的資訊，也因為這個快速尋找的需求，大部分的網站都會提供關鍵字搜尋及詳細的分類，可是隨著資料的增長，持續依賴人工的方式分門別類想必是一件逐漸困難的事情，透過機器學習的技巧幫助我們分群、分類資料內容將會是趨勢。以文本資料來說，最著名的分類技巧為主題模型，透過求文章的近似分佈或矩陣分解的方式將大量資料轉換成主題，即便主題模型的成熟幫助了我們分類文章內容產生主題，但主題在現實生活中是會隨著時間的改變而出現或消失，如何在主題改變的過程中有完善的解釋，是這篇論文所要探討的主題模型技巧。本篇論文提出新穎的主題模型技巧，稱之為深度非負自編碼，並且結合在線式模型，用以探索主題隨著時間的改變，使用的文本內容是機器學習的論文，實驗結果表明，透過我們提出的方法可以快速的找到各個時間點的主題，我們也提出以網路圖、熱點圖及計算距離的方法，透過這些方式達到解釋及探討主題演進的目標。
Abstract
The storage type of books, newspapers and magazines has changed from tangible papers to digital documents. This phenomenon indicates that a large number of documents are stored on the Internet. Therefore, it is infeasible for us to review all information to find out what we need from these numerous papers. We need to rely on keywords or well-defined topics to find out our requirements. Unfortunately, these topics change over time in the real world. How to correctly classify these documents has been an increasingly important issue. Our approach aims to improve the problem of the topic model, which considers time. Considering that the inference method for the posterior probability is too complicated, so for simplicity, we use an autoencoder variant to build a topic model with shared weights at different times, called Deep Non-negative Autoencoder (DNAE). This model is a multi-layer structure, the evolution of topics in each layer is also a focus of this paper. Besides, we use generalized Jensen-Shannon divergence to measure the topic diffusion and use network diagrams to observe the evolution of topics.

目次 Table of Contents
論文審定書 i 摘要 ii ABSTRACT iii 1. Introduction 1 2. Background and related work 2 2.1 Topic model 3 2.2 Time series topic model 4 2.3 Multi-layer topic model 6 2.4 Deep Learning 7 2.5 Online Learning 8 3. Methodology 9 3.1 Topic model based on Autoencoder 11 3.2 Online Deep Non-negative Autoencoder 13 3.3 Evaluation of topic diffusion 15 3.4 Visualization of topic evolution 16 3.5 Topic Evolution and Diffusion Discovery based on online DNAE 18 4. Experiment 19 4.1 Online topic model with DNAE 21 4.2 Topic evolution and diffusion with DNAE 22 4.3 Term evolution with DNAE 24 5. Discussion 27 6. Conclusion 29 7. Reference 30 Appendix A 35 Appendix B 37

參考文獻 References
Baldi, P. (n.d.). Autoencoders, Unsupervised Learning, and Deep Architectures. 14. Blei, D. M. (n.d.-a). Introduction to Probabilistic Topic Models. 16. Blei, D. M. (n.d.-b). Latent Dirichlet Allocation. 30. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, 113–120. https://doi.org/10.1145/1143844.1143859 Bourlard, H., & Kamp, Y. (1988). Auto-association by multilayer perceptrons and singular value decomposition. Biological Cybernetics, 59(4), 291–294. https://doi.org/10.1007/BF00332918 Greene, D., & Cross, J. P. (2016). Exploring the Political Agenda of the European Parliament Using a Dynamic Topic Modeling Approach. ArXiv:1607.03055 [Cs]. http://arxiv.org/abs/1607.03055 Greene, D., O’Callaghan, D., & Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. ArXiv:1404.4606 [Cs]. http://arxiv.org/abs/1404.4606 Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences, 101(Supplement 1), 5228–5235. https://doi.org/10.1073/pnas.0307752101 Griffiths, Thomas L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (2004). Hierarchical Topic Models and the Nested Chinese Restaurant Process. In S. Thrun, L. K. Saul, & B. Schölkopf (Eds.), Advances in Neural Information Processing Systems 16 (pp. 17–24). MIT Press. http://papers.nips.cc/paper/2466-hierarchical-topic-models-and-the-nested-chinese-restaurant-process.pdf Grosse, I., Bernaola-Galván, P., Carpena, P., Román-Roldán, R., Oliver, J., & Stanley, H. E. (2002). Analysis of symbolic sequences using the Jensen-Shannon divergence. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, 65(4 Pt 1), 041905. https://doi.org/10.1103/PhysRevE.65.041905 Handbook of Latent Semantic Analysis. (2007). Routledge Handbooks Online. https://doi.org/10.4324/9780203936399 Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the Dimensionality of Data with Neural Networks. Science, 313(5786), 504–507. https://doi.org/10.1126/science.1127647 Hinton, Geoffrey E, & Zemel, R. S. (1994). Autoencoders, Minimum Description Length and Helmholtz Free Energy. In J. D. Cowan, G. Tesauro, & J. Alspector (Eds.), Advances in Neural Information Processing Systems 6 (pp. 3–10). Morgan-Kaufmann. http://papers.nips.cc/paper/798-autoencoders-minimum-description-length-and-helmholtz-free-energy.pdf Kang, Y., Cheng, I.-L., Mao, W., Kuo, B., & Lee, P.-J. (2019). Towards Interpretable Deep Extreme Multi-label Learning. ArXiv:1907.01723 [Cs, Stat]. http://arxiv.org/abs/1907.01723 Kang, Y., Lin, K.-P., & Cheng, I.-L. (2018). Topic Diffusion Discovery Based on Sparseness-Constrained Non-Negative Matrix Factorization. 2018 IEEE International Conference on Information Reuse and Integration (IRI), 94–101. https://doi.org/10.1109/IRI.2018.00021 Kang, Y., & Zadorozhny, V. (2016). Process Monitoring Using Maximum Sequence Divergence. Knowledge and Information Systems, 48(1), 81–109. https://doi.org/10.1007/s10115-015-0858-z Lake, J. A. (n.d.). Reconstructing evolutionary trees from DNA and protein sequences: Parallnear distances. 5. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539 Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791. https://doi.org/10.1038/44565 McCloskey, M., & Cohen, N. J. (1989). Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem. In G. H. Bower (Ed.), Psychology of Learning and Motivation (Vol. 24, pp. 109–165). Academic Press. https://doi.org/10.1016/S0079-7421(08)60536-8 Ognyanova, K. (n.d.). Network visualization with R. 66. Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. https://doi.org/10.1002/env.3170050203 Phylogenetic trees \| Evolutionary tree (article) \| Khan Academy. (n.d.). Retrieved July 2, 2020, from https://www.khanacademy.org/science/high-school-biology/hs-evolution/hs-phylogeny/a/phylogenetic-trees Silge, J., & Robinson, D. (2017). Text Mining with R: A Tidy Approach. O’Reilly Media, Inc. Song, H. A., & Lee, S.-Y. (2013). Hierarchical Representation Using NMF. In M. Lee, A. Hirose, Z.-G. Hou, & R. M. Kil (Eds.), Neural Information Processing (Vol. 8226, pp. 466–473). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-42054-2_58 Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring Topic Coherence over Many Models and Many Topics. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 952–961. https://www.aclweb.org/anthology/D12-1087 Tu, D., Chen, L., Lv, M., Shi, H., & Chen, G. (2018). Hierarchical online NMF for detecting and tracking topic hierarchies in a text stream. Pattern Recognition, 76, 203–214. https://doi.org/10.1016/j.patcog.2017.11.002 Wang, X., & McCallum, A. (2006). Topics over time: A non-Markov continuous-time model of topical trends. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’06, 424. https://doi.org/10.1145/1150402.1150450 Ye, F., Chen, C., & Zheng, Z. (2018). Deep Autoencoder-like Nonnegative Matrix Factorization for Community Detection. Proceedings of the 27th ACM International Conference on Information and Knowledge Management - CIKM ’18, 1393–1402. https://doi.org/10.1145/3269206.3271697 Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., & Sun, M. (2019). Graph Neural Networks: A Review of Methods and Applications. ArXiv:1812.08434 [Cs, Stat]. http://arxiv.org/abs/1812.08434

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0719120-224032.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS