國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於在線式深度非負變分自編碼的主題演進探索,Topic Diffusion Discovery based on Online Deep Non-negative Variational Autoencoder

論文名稱 Title	基於在線式深度非負變分自編碼的主題演進探索 Topic Diffusion Discovery based on Online Deep Non-negative Variational Autoencoder
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	109 學年度第 1 學期 The fall semester of Academic Year 109	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	55
研究生 Author	古展東 Chan-Tung Ku
指導教授 Advisor	康藝晃 Yihuang Kang
召集委員 Convenor	楊惠芳 Huei-Fang Yang
口試委員 Advisory Committee	李珮如 Pei-Ju LI
口試日期 Date of Exam	2021-01-28	繳交日期 Date of Submission	2021-02-18
關鍵字 Keywords	網路分析、主題演進、主題模型、主題擴散、深度學習、變分自編碼器 Network Analysis., Topic Evolution, Topic Modeling, Topic Diffusion, Deep learning, Variational Autoencoder
統計 Statistics	本論文已被瀏覽 757 次，被下載 176 次 The thesis/dissertation has been browsed 757 times, has been downloaded 176 times.

中文摘要
現今資訊科技已改變人們的生活習慣，電腦及手持式行動裝置的普及讓我們可以隨時憑藉網路傳遞汲取大量的資訊，然而這類行為的改變，也意味著人們每天必須消化網路上難以負荷的龐大資料，當然不可能完全瞭解這些資料的內容，仰賴資料分類與搜索關鍵字的方式，可過濾出使用者想要的資料，然面對日益膨脹的資料量，日復一日更新的資料內容，單以人工方式進行資料分群與分類，不僅更為艱難，也無法達成目標，透過機器學習的方法協助進行相關工作也日漸普及。以文本而言，主題模型是著名的分類方式，運用文章的近似分佈或矩陣分解，將大量資料轉換成主題，成熟地幫助分類文章內容產生主題，但現實情況是資料或主題會隨著時間推進出現、更新或消失，如何完整地解釋主題改變的過程，即為本文所要探討的主題模型技巧。本篇論文提出深度非負變分自編碼（Deep Non-negative Variational Autoencoder ,DNVAE）演算法，結合在線式模型，用以探索隨時間改變的主題，使用的文本係以機器學習為內容範疇的論文，實驗結果表明，透過我們提出的方法可以快速的找到各個時間點的主題，更透過主題網路圖、熱點圖及計算距離的等方法，進而達到解釋及探討主題演進的目標。
Abstract
Today, the storage type of books, newspapers, and magazines has changed from tangible papers to digital documents. A large number of documents are stored digitally, and it is time-consuming to classify documents/texts manually. Consequently, topic modeling techniques are commonly used to deal with this problem. However, topics are changing over time. Therefore, how to properly classify these documents with the diffusion of topics has been an important issue in recent years. In this thesis, we propose a topic diffusion discovery approach able to deal with the evolutions/changes of topics. Considering that the inference method for the posterior probability is too complicated, for simplicity, we use a variational autoencoder variant to build the topic model with shared weights at different times, called Deep Non-negative Variational Autoencoder (DNVAE). Our proposed model with multi-layer structure is able to understand the evolution of topics. The generalized Jensen-Shannon divergence is to used to measure the magnitude of topic diffusion. And we present our approach with topic network diagrams to help understand the evolution of topics.

目次 Table of Contents
論文審定書 i 誌謝 ii 摘要 iii Abstract iv 圖次 vii 表次 viii 第一章緒論 1 1.1 研究背景 1 1.2 研究動機 1 1.3 研究目的 2 第二章文獻探討 3 2.1 主題模型Topic model 3 2.1.1時間序列主題模型Time series topic model 3 2.1.2非負矩陣分解Nonnegative Matrix Factorization（NMF） 4 2.1.3多層主題模型Multi-layer topic model 5 2.2深度學習Deep Learning 5 2.3 在線學習Online Learning 7 第三章研究方法與步驟 8 3.1 研究方法 8 3.1.1 Topic model based on Variational Autoencoder 8 3.1.2 Online Deep Non-negative Variational Autoencoder(DNVAE) 11 3.2 評估標準 12 3.2.1評價詞彙擴散程度 12 3.2.2主題關聯性的可視化 13 3.3研究架構 14 第四章實驗結果與討論分析 17 4.1資料整理 17 4.2 研究流程 18 4.3 研究過程 18 4.3.1 Raw data—Predict Topic and term 19 4.3.2 Visualization of Topic Relationship and Evolution 21 4.3.3 Term Evolution with DNVAE 23 4.4 研究分析 25 第五章研究結論與建議 28 5.1 研究結論 28 第六章參考文獻 29

參考文獻 References
Berthelot, D., Raffel, C., Roy, A., & Goodfellow, I. (2018). Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer. ArXiv:1807.07543 [Cs, Stat]. http://arxiv.org/abs/1807.07543 Blei, D. M. (2011). Introduction to Probabilistic Topic Models. 16. Blei, D. M., Andrew Y. Ng, & Michael I. Jordan. (2003). Latent Dirichlet Allocation. 30. Blei, D. M., & Lafferty, J. D. (2006). Dynamic topic models. Proceedings of the 23rd International Conference on Machine Learning - ICML ’06, 113–120. https://doi.org/10.1145/1143844.1143859 D. Falbel et al. (n.d.). keras: R Interface to “Keras". 2019. Doersch, C. (2016). Tutorial on Variational Autoencoders. ArXiv:1606.05908 [Cs, Stat]. http://arxiv.org/abs/1606.05908 Dubey, A., Hefny, A., Williamson, S., & Xing, E. P. (2012). A non-parametric mixture model for topic modeling over time. ArXiv:1208.4411 [Stat]. http://arxiv.org/abs/1208.4411 Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Greene, D., O’Callaghan, D., & Cunningham, P. (2014). How Many Topics? Stability Analysis for Topic Models. ArXiv:1404.4606 [Cs]. http://arxiv.org/abs/1404.4606 Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., & Blei, D. M. (n.d.). Hierarchical Topic Models and the Nested Chinese Restaurant Process. 8. Grosse, I., Bernaola-Galvan, P., Carpena, P., Roman-Roldan, R., Oliver, J., & Stanley, H. E. (n.d.). Analysis of symbolic sequences using the Jensen-Shannon divergence. PHYSICAL REVIEW E, 16. H. Wickham. (n.d.). ggplot2: Elegant graphics for data analysis. New York: Springer-Verlag New York, 2016. Hinton, G. E., & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. Science, 313(5786), 504–507. Hoi, S. C. H., Sahoo, D., Lu, J., & Zhao, P. (2018). Online Learning: A Comprehensive Survey. ArXiv:1802.02871 [Cs]. http://arxiv.org/abs/1802.02871 Hung,S. (2020). Topic Evolution and Diffusion Discovery based on Online Deep Non-negative Autoencoder. K. Karthik Ram, K. Karl Broman, and cre. (n.d.). aRxiv: Interface to the arXiv API. 2019. Kang, Y., Cheng, I.-L., Mao, W., Kuo, B., & Lee, P.-J. (2019). Towards Interpretable Deep Extreme Multi-label Learning. ArXiv:1907.01723 [Cs, Stat]. http://arxiv.org/abs/1907.01723 Kang, Y., Lin, K.-P., & Cheng, I.-L. (2018). Topic Diffusion Discovery based on Sparseness-constrained Non-negative Matrix Factorization. ArXiv:1807.04386 [Cs, Stat]. http://arxiv.org/abs/1807.04386 Kang, Y., & Zadorozhny, V. (2016). Process Monitoring Using Maximum Sequence Divergence. Knowledge and Information Systems, 48(1), 81–109. https://doi.org/10.1007/s10115-015-0858-z Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. ArXiv:1312.6114 [Cs, Stat]. http://arxiv.org/abs/1312.6114 Landauer, T. K. (Ed.). (2007). Handbook of latent semantic analysis. Erlbaum. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788–791. https://doi.org/10.1038/44565 McCloskey, M., & Cohen, N. J. (n.d.). CATASTROPHIC INTERFERENCE IN CONNECTIONIST NETWORKS: THE SEQUENTIAL LEARNING PROBLEM. 57. Ognyanova, K. (n.d.). Network visualization with R. 71. Oring, A., Yakhini, Z., & Hel-Or, Y. (2020). Autoencoder Image Interpolation by Shaping the Latent Space. ArXiv:2008.01487 [Cs, Stat]. http://arxiv.org/abs/2008.01487 Paatero, P., & Tapper, U. (1994). Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2), 111–126. https://doi.org/10.1002/env.3170050203 Qin, Z., Yu, F., Liu, C., & Chen, X. (2018). How convolutional neural network see the world—A survey of convolutional neural network visualization methods. ArXiv:1804.11191 [Cs]. http://arxiv.org/abs/1804.11191 Roger, V., Farinas, J., & Pinquier, J. (2020). Deep Neural Networks for Automatic Speech Processing: A Survey from Large Corpora to Limited Data. ArXiv:2003.04241 [Cs, Eess, Stat]. http://arxiv.org/abs/2003.04241 Silge, J., & Robinson, D. (2017). Text mining with R: A tidy approach (First edition). O’Reilly. Song, H. A., & Lee, S.-Y. (2013). Hierarchical Representation Using NMF. In M. Lee, A. Hirose, Z.-G. Hou, & R. M. Kil (Eds.), Neural Information Processing (Vol. 8226, pp. 466–473). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-42054-2_58 Srivastava, A., & Sutton, C. (2017). Autoencoding Variational Inference For Topic Models. ArXiv:1703.01488 [Stat]. http://arxiv.org/abs/1703.01488 Stevens, K., Kegelmeyer, P., Andrzejewski, D., & Buttler, D. (2012). Exploring topic coherence over many models and many topics. Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 952–961. The R Core Team. (n.d.). R:A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing,2019. Theis, L., Oord, A. van den, & Bethge, M. (2016). A note on the evaluation of generative models. ArXiv:1511.01844 [Cs, Stat]. http://arxiv.org/abs/1511.01844 Torfi, A., Shirvani, R. A., Keneshloo, Y., Tavaf, N., & Fox, E. A. (2020). Natural Language Processing Advancements By Deep Learning: A Survey. ArXiv:2003.01200 [Cs]. http://arxiv.org/abs/2003.01200 Tu, D., Chen, L., Lv, M., Shi, H., & Chen, G. (2018). Hierarchical online NMF for detecting and tracking topic hierarchies in a text stream. Pattern Recognition, 76, 203–214. https://doi.org/10.1016/j.patcog.2017.11.002 Wang, C., Blei, D., & Heckerman, D. (2015). Continuous Time Dynamic Topic Models. ArXiv:1206.3298 [Cs, Stat]. http://arxiv.org/abs/1206.3298 Wang, W., Gan, Z., Xu, H., Zhang, R., Wang, G., Shen, D., Chen, C., & Carin, L. (2019). Topic-Guided Variational Autoencoders for Text Generation. ArXiv:1903.07137 [Cs]. http://arxiv.org/abs/1903.07137 Wang, X., & McCallum, A. (2006). Topics over time: A non-Markov continuous-time model of topical trends. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’06, 424. https://doi.org/10.1145/1150402.1150450

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：校內校外完全公開 unrestricted 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0118121-163425.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS