國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,以文風指標分析《紅樓夢》的作者爭議問題,Solving the Author Problem of “Dream of the Red Chamber” with the Writing Style Indicator

論文名稱 Title	以文風指標分析《紅樓夢》的作者爭議問題 Solving the Author Problem of “Dream of the Red Chamber” with the Writing Style Indicator
系所名稱 Department	資訊工程學系 Department of Computer Science and Engineering
畢業學年期 Year, semester	109 學年度第 2 學期 The spring semester of Academic Year 109	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	74
研究生 Author	陳佳琳 Chia-Lin Chen
指導教授 Advisor	楊昌彪 Yang,Chang-Biau
召集委員 Convenor	黃毅青 Wong,Ngai-Ching
口試委員 Advisory Committee	陳嘉平, 李宗錂, 曾國尊 Chen,Chia-Ping; Lee,Tsung-lin; Tseng, Kuo-Tsung
口試日期 Date of Exam	2021-07-30	繳交日期 Date of Submission	2021-09-28
關鍵字 Keywords	紅樓夢、作者歸屬(authorship attribution)、分類問題(classification problem)、句類(sentence pattern)、支援向量機、Tanimoto相似度 Dream of the Red Chamber, authorship attribution, classification problem, sentence pattern, support vector machine, Tanimoto coefficient
統計 Statistics	本論文已被瀏覽 253 次，被下載 41 次 The thesis/dissertation has been browsed 253 times, has been downloaded 41 times.

中文摘要
關於《紅樓夢》的作者是否僅出自一人的議題，自西元1750年以來即備受討論。在本篇論文中，我們參考前人的做法，針對不同特徵值(字數、詞頻、句類、變動點等)，使用機器學習進行作者分類，以釐清作者人數。在進行實驗的過程，我們發現特徵的選取，比分類器的選擇，更顯重要。我們觀察不同作者的寫作習慣，然後制定文風句類。我們利用中央研究院CkipTagger軟體，進行斷詞與詞性標註，整理對寫作風格敏感之詞性組合，再將61種詞性的組合，彙整成45個文風句類。我們的文風句類可以正確地分辨不同作者之相同類型書籍，然同作者之相同類型書籍則難以分辨，此結果代表我們的文風句類能正確分析不同作者之寫作風格及習慣，並加以分辨。此外，為檢驗文風句類之有效性，我們將文風句類與其他特徵值，利用向量支援機(SVM，support vector machine)分類器與Tanimoto相似度，對同作者之相同類型書籍進行實驗;結果得到，文風句類較其他特徵值來得穩定，表示文風句類適合作為特徵值。接著，我們針對42本對照組小說進行同作者同本書、同作者不同書、不同作者共三種實驗，得到SVM正確率與Tanimoto相似度之值域範圍作為文風指標。最後再將《紅樓夢》分為前80回與後40回，進行是否同作者的實驗。最終結果，我們認為《紅樓夢》全書是否有超過一位作者之問題仍無法定論，無法證明僅有一位作者，亦無法證明有多位作者。我們將本論文的方法，開發為文風相似度比對之網頁應用程式，以供有興趣者使用。網址如下: http://par48.cse.nsysu.edu.tw:3000。
Abstract
The question of whether the number of authors of ”Dream of the Red Chamber” is only one or not has been discussed extensively since 1750. To clarify the number of authors of the book, in this thesis, we take some ideas of previous studies and conduct experiments with machine learning classifiers with various features (word count, word frequency, sentence pattern, change points, etc.). Finally, we find that feature selection is more important than a classifier itself. We figure out the property for revealing the nature of the writing habits of different authors, and formulating the writing style categories. We use CkipTagger of Academia Sinica in Taiwan to break words and give word-based labels, and then organize 61 kinds of words into 45 writing style categories. Our writing style categories could correctly distinguish the same type of books of different authors, but it is difficult to distinguish the same type of books of the same author. This result means that our writing style categories could correctly analyze the writing styles and distinguish the habits of different authors. In addition, to test the effectiveness of the writing style categories, we apply SVM (support vector machine) classifier and Tanimoto similarity to the experiments on the same type of books of the same author. The result shows that the writing style categories is more stable than other features. Then we perform three experiments of the same author from same books, the same author from different books, and different authors. These are carried out for 42 control group novels. Then the corresponding SVM accuracy and Tanimoto value range are obtained as the writing style indicator. Finally, we perform the same experiments on the first 80 episodes and the last 40 episodes of ”Dream of the Red Chamber”. According to the experimental results, we conclude that the question of whether there is one or more than one authors in the entire book of ”Dream of the Red Chamber” is inconclusive. We cannot prove that there is only one author, and we cannot prove that there are multiple authors either. Based on this study, we develope a web application that compares the writing style similarity of the two texts on the following website: http://par48.cse.nsysu.edu.tw:3000

目次 Table of Contents
論文中文審定書i 論文英文審定書ii 論文公開授權書iii 謝辭iv 摘要v Abstract vi 圖目錄x 表目錄xi 第一章簡介1 第二章先備知識6 2.1 前提 6 2.2 機器學習之分類器 7 2.3 前人研究 8 2.3.1 相似詞顯著性與虛詞變動點分析 8 2.3.2 虛詞SVM實驗 11 2.3.3 詞長k-means實驗 11 2.3.4 HNC句類kNN實驗 13 第三章我們的文風句類分析15 3.1 文風句類 15 3.2 Tanimoto相似度 18 3.3 演算法步驟和流程圖 23 第四章結論41 參考文獻42 附錄46 A 附錄 46 A.1 中研院CKIPTagger之詞性列表 46 A.2 文風句類列表 48 A.3 《紅樓夢》之文風句類統計 56 A.4 線上文風比對系統 59

參考文獻 References
[1] D. Bajusz, A. Ra ́cz, and K. He ́berger, “Why is tanimoto index an appropriate choice for fingerprint-based similarity calculations?,” Journal of Cheminformatics, Vol. 7, No. 1, pp. 1–13, 2015. [2] C. C. Chang and C. J. Lin, “Libsvm: A library for support vector machines,” ACM Transactions on Intelligent Systems and Technology, Vol. 2, No. 3, 2011. [3] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” Vol. 2, pp. 1–27, Acm New York, NY, USA, 2011. [4] M. Chen, X. Jin, and D. Shen, “Short text classification improved by learning multi- granularity topics,” Procedings of Twenty-second International Joint Conference on Artificial Intelligence, 2011. [5] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [6] B. Efron and R. Thisted, “Estimating the number of unseen species: How many words did shakespeare know?,” Biometrika, Vol. 63, No. 3, pp. 435–447, 1976. [7] B. Felbo, A. Mislove, A. Søgaard, I. Rahwan, and S. Lehmann, “Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm,” arXiv preprint arXiv:1708.00524, 2017. [8] R.A.Fisher,“Statisticalmethodsforresearchworkers,”BreakthroughsinStatistics, pp. 66–70, 1992. [9] G. Hinton and J. Terrence, Unsupervised Learning: Foundations of Neural Compu- tation. MIT Press, 1999. [10] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, Vol. 9, No. 8, pp. 1735–1780, 1997. [11] A. Huang, “Similarity measures for text document clustering,” Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZC- SRSC2008), Christchurch, New Zealand, Vol. 4, pp. 9–56, 2008. [12] N.JiangandM.-C.deMarneffe,“Doyouknowthatflorenceispackedwithvisitors? evaluating state-of-the-art models of speaker commitment,” Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ”Florence, Italy”, pp. 4208–4213, 2019. [13] P. Juola, Authorship Attribution, Vol. 3. Now Publishers Inc, 2008. [14] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient k-means clustering algorithm: Analysis and implementation,” IEEE transactions on pattern analysis and machine intelligence, Vol. 24, No. 7, pp. 881–892, 2002. [15] S.Lai,L.Xu,K.Liu,andJ.Zhao,“Recurrentconvolutionalneuralnetworksfortext classification,” Twenty-ninth AAAI Conference on Artificial Intelligence, 2015. [16] R. Lebret and R. Collobert, “Word emdeddings through hellinger pca,” arXiv preprint arXiv:1312.5542, 2013. [17] O.LevyandY.Goldberg,“Neuralwordembeddingasimplicitmatrixfactorization,” Advances in neural information processing systems, Vol. 27, pp. 2177–2185, 2014. [18] E. Loper and S. Bird, “Nltk: The natural language toolkit,” arXiv preprint cs/0205028, 2002. [19] J. Pennington, R. Socher, and C. Manning, “Empirical methods in natural language processing (emnlp),” 2014. [20] J. R. Quinlan, “Induction of decision trees,” Machine Learning, Vol. 1, No. 1, pp. 81–106, 1986. [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, Vol. 323, No. 6088, pp. 533–536, 1986. [22] S. Russell and P. Norvig, Artificial intelligence: a modern approach. 2002. [23] L. Sha, X. Zhang, F. Qian, B. Chang, and Z. Sui, “A multi-view fusion neural net- work for answer selection,” Thirty-second Association for the Advancement of Arti- ficial Intelligence Conference on Artificial Intelligence, 2018. [24] H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush, “Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks,” IEEE transactions on visualization and computer graphics, Vol. 24, No. 1, pp. 667–676, 2017. [25] K. Worsley, “The power of likelihood ratio and cumulative sum tests for a change in a binomial probability,” Biometrika, Vol. 70, No. 2, pp. 455–464, 1983. [26] W. Zhang, Y. Feng, F. Meng, D. You, and Q. Liu, “Bridging the gap between train- ing and inference for neural machine translation,” arXiv preprint arXiv:1906.02448, 2019. [27] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” Advances in neural information processing systems, Vol. 28, pp. 649– 657, 2015. [28] W. Zhu, T. Yao, J. Ni, B. Wei, and Z. Lu, “Dependency-based siamese long short- term memory network for learning sentence representations,” PLOS ONE, Vol. 13, No. 3, p. e0193919, 2018. [29] 余清祥, “統計在紅樓夢的應用(註),” 國立政治大學學報, Vol. 76, p. 303, 1998. [30] 吳承恩, 西遊記. 之乎書坊, 1990. [31] 張凱、張明允, “基於svm的《紅樓夢》寫作風格研究,” 貴陽學院學報(自然科學版), Vol. 6, No. 1, pp. 55–57, 2011. [32] 張運良、朱禮軍、喬曉東、張全, “基於句類特徵的作者寫作風格分類研究,” 計算機工程與應用, Vol. 45, No. 22, pp. 129–131, 2009. [33] 施耐庵、羅貫中, 水滸傳. 之乎書坊, 1997. [34] 曹雪芹, 紅樓夢. 西北國際, 2013. [35] 朱東旭、嚴廣樂, “基於lstm的《紅樓夢》文本風格分界點識別方,” 智能計算機與應用, 2020. [36] 李國強、李瑞芳, “基於計算機的詞頻統計研究——考證《紅樓夢》作者是否唯一,” 瀋陽化工學院學報, Vol. 20, No. 4, pp. 305–307, 2006. [37] 王世海, “論數理統計方法研究《紅樓夢》作者問題的得與失,” 宜春學院學報, Vol. 4, 2019. [38] 王曄、王翰琦、苑博偉, “基於層次聚類模型的《紅樓夢》作者解析,” 大連理工大學學報, Vol. 13, 2019. [39] 羅貫中, 三國演義. 之乎書坊, 2001. [40] 肖天久、劉穎, “《紅樓夢》詞和n元文法分析,” 現代圖書情報技術, No. 4, pp. 50–57, 2015. [41] 詞庫小組, 中文詞類分析(三版). 中央研究院資訊科學研究所, 1993. [42] 詞庫小組, 句結構樹中的語意角色. 中央研究院資訊科學研究所, 2013. [43] 陳淑芬、陳力綺, “現代漢語否定詞「不」和「沒」的句法、語意和言談/語用特點及其教學應用,” UST Working Papers in Linguistics (USTWPL) , 2017. [44] 馬創新、陳小荷, “從高頻詞等級相關角度探析《紅樓夢》作者,” 中文信息學報, Vol. 32, No. 11, pp. 97–102, 2018. [45] 黃曾陽, HNC(概念層次網路)理論. 清華大學出版社, 1998.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0828121-012351.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS