國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,基於文字語意分群之文章抄襲偵測,Plagiarism detection based on word semantic clustering

論文名稱 Title	基於文字語意分群之文章抄襲偵測 Plagiarism detection based on word semantic clustering
系所名稱 Department	電機工程學系 Department of Electrical Engineering
畢業學年期 Year, semester	106 學年度第 2 學期 The spring semester of Academic Year 106	語文別 Language	中文 Chinese
學位類別 Degree	碩士 Master	頁數 Number of pages	44
研究生 Author	張家揚 Chia-Yang Chang
指導教授 Advisor	李錫智 Shie-jue Lee
召集委員 Convenor	吳志宏 Chih-hung Wu
口試委員 Advisory Committee	劉志峰, 歐陽振森, 侯俊良 Chih-feng Liu; Chen-sen Ouyang; Chun-liang Hou
口試日期 Date of Exam	2018-07-26	繳交日期 Date of Submission	2018-09-06
關鍵字 Keywords	近似複本、向量空間模型、詞袋模型、Word2vec、詞嵌入、主成分分析、抄襲偵測 Word2vec, word embedding, PCA, VSM, near-duplicate, bag-of-words, plagiarism detection
統計 Statistics	本論文已被瀏覽 5727 次，被下載 1 次 The thesis/dissertation has been browsed 5727 times, has been downloaded 1 times.

中文摘要
近年來文章抄襲已經是越來越常見的問題了，隨著網路與科技越加發達，他人的著作在網路上已經是唾手可得。而當你的著作使用了他人的著作內容，卻又未明確的指出引用，那便很有可能涉及抄襲。抄襲行為已經侵犯到了他人的智慧財產權，而且發生的頻率越來越高，因此，抄襲偵測在現今已經是非常重要的議題。目前的抄襲研究多與偵測近似複本（near-duplicate）類似，例如向量空間模型、詞袋模型，大多只能偵測抄襲相似度非常高的部分，若將抄襲的部分稍加修飾，例如替換掉某些單字、將句子改寫等等，這些方法的效果便會受到極大的影響。因此，我們針對單字的語意進行分析。利用單字的語意來辨別文章究竟有沒有抄襲的嫌疑。Word2vec是由Google團隊所提出的詞嵌入（word embedding）模型，藉由機器學習訓練大量的文章，最後使用向量來代表單字的意思。我們便透過Word2vec獲得單字的向量，由於單字語意的資訊量十分龐大，我們使用主成分分析（principal component analysis, PCA）進行降維，藉由忽略向量中資訊量較少的維度，來達到縮減維度的效果。之後再使用分群將單字分為許多不同的語意概念(concept)，透過比對文章間語意概念的重複程度，我們便可以辨識出複雜度較高的抄襲行為。最後，我們也用實驗將我們的方法與其他方法比較，並測試了許多不同的實驗參數，證明了利用單字的語意確實可以更精準地分辨出複雜度較高的抄襲行為。
Abstract
Plagiarism is a common problem in current years. With the advance of Internet, it is more and more easy to obtain other people's writings. When someone uses the content without citation, he may cause the problem of plagiarism. Plagiarisms will infringe the intellectual property rights. So plagiarism detection is a serious problem in nowadays.Current plagiarism detection methods are similar to near-duplicate detection methods, like VSM(vector space model) or bag-of-words. These methods can't handle the complex plagiarized technique very well, e.g. word substitution and sentence rewriting. Therefore, we focus on the semantic of words. In this paper, we propose a new method for plagiarism detection by analyzing the semantic of words.Word2vec is a word embedding model proposed by Google group. It can use a vector to represent a word. We use Word2vec to obtain the vector of words and use PCA for dimension reduction. After that, we use spherical K-means to cluster the words into concepts. By using Word2vec, we can consider the semantic of words and cluster the words into concepts in order to deal with the complex plagiarized technique.Finally, we will show our experimental results and compare with other methods. The experimental results show that our method is well performance.

目次 Table of Contents
論文審定書 ................................................................................................................... i 誌謝 .............................................................................................................................. ii 摘要 ............................................................................................................................. iii Abstract ....................................................................................................................... iv 圖目錄 ........................................................................................................................ vii 表目錄 ....................................................................................................................... viii 第一章導論 .................................................................................................................. 1 1.1. 研究背景與目的 ........................................................................................... 1 1.2. 研究動機....................................................................................................... 3 1.3. 論文架構....................................................................................................... 5 第二章文獻探討 .......................................................................................................... 6 2.1. 文件模型....................................................................................................... 6 2.2. 抄襲偵測....................................................................................................... 7 2.3. Word2vec ...................................................................................................... 8 2.4. Spherical K-means....................................................................................... 11 第三章研究方法 ........................................................................................................ 14 3.1. 方法簡介..................................................................................................... 14 3.2. 第一階段流程 ............................................................................................. 15 3.3. 第二階段流程 ............................................................................................. 21 第四章實驗結果分析 ................................................................................................. 24 4.1. Data set ....................................................................................................... 24 4.2. 評估標準..................................................................................................... 24 4.3. K-means 和spherical K-means 比較結果 ................................................... 25 4.4. 分群群數與PCA 累積能量比較結果......................................................... 26 4.5. 與MLM 方法比較結果 .............................................................................. 27 4.6. 替換單字之抄襲偵測比較結果 .................................................................. 28 第五章結論與未來展望 ............................................................................................. 30 5.1. 結論 ............................................................................................................ 30 5.2. 未來研究方向 ............................................................................................. 30 參考文獻 .................................................................................................................... 31

參考文獻 References
[1] Swanson, D.R., 1960. Searching natural language text by computer. Science, 132, 1099–1101. [2] Salton, G., 1970. Automatic text analysis. Science 168, 335–343. [3] Blair, D.C., Maron, M.E., 1985. An evaluation of retrieval effectiveness for a full-text document-retrieval system. Communications of the ACM 28, 289–299. [4] Baeza-Yates, R., Ribeiro-Neto, B., et al., 1999. Modern information retrieval. volume 463. ACM press New York. [5] Henzinger, M., 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms, in: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, ACM. pp.284–291. [6] Xiao, C., Wang, W., Lin, X., Yu, J. X., & Wang, G, 2011. Efficient similarity joins for nearduplicate detection. ACM Transactions on Database Systems(TODS) 36,15. [7] Brin, S., Davis, J., Garcia-Molina, H., 1995. Copy detection mechanisms for digital documents, in: ACM SIGMOD Record, ACM. pp. 398–409. [8] Shivakumar, N., Garcia-Molina, H., 1995. Scam: A copy detection mechanism for digital documents . [9] Clough, P., 2000. Plagiarism in natural and programming languages: an overview of current tools and technologies . [10] Mozgovoy, M., Fredriksson, K., White, D., Joy, M., Sutinen, E., 2005. Fast plagiarism detection system, in: International Symposium on String Processing and Information Retrieval, Springer. pp. 267–270. [11] Maurer, H.A., Kappe, F., Zaka, B., 2006. Plagiarism-a survey. J. UCS 12, 1050–1084. [12] Lukashenko, R., Graudina, V., Grundspenkis, J., 2007. Computer-based plagiarism detection methods and tools: an overview, in: Proceedings of the 2007 international conference on Computer systems and technologies, ACM. p. 40. [13] Ceska, Z., 2008. Plagiarism detection based on singular value decomposition, in: Advances in natural language processing. Springer, pp. 108–119. [14] Barrón-Cedeño, A., Rosso, P., Benedí, J.M., 2009. Reducing the plagiarism detection search space on the basis of the Kullback-Leibler distance, in: International conference on intelligent text processing and computational linguistics, Springer. pp. 523–534. [15] Alzahrani, S., Salim, N., 2010. Fuzzy semantic-based string similarity for extrinsic plagiarism detection. Braschler and Harman 1176, 1–8. [16] Alzahrani, S.M., Salim, N., Abraham, A., 2012. Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, 133–149. [17] Chow, T.W., Rahman, M., 2009. Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Transactions on Neural Networks 20, 1385–1402. [18] Zhang, H., Chow, T.W., 2011. A coarse-to-fine framework to efficiently thwart plagiarism. Pattern Recognition 44, 471–487. [19] Mikolov, T., Chen, K., Corrado, G., Dean, J., 2013a. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . [20] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J., 2013b. Distributed representations of words and phrases and their compositionality, in: Advances in neural information processing systems, pp. 3111–3119. [21] Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J., 2007. Large language models in machine translation, in: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). [22] Collobert, R., Weston, J., 2008. A unified architecture for natural language processing: Deep neural networks with multitask learning, in: Proceedings of the 25th international conference on Machine learning, ACM. pp. 160–167. [23] Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P., 2011. Natural language processing (almost) from scratch. Journal of Machine Learning Research 12, 2493–2537. [24] Duchi, J., Hazan, E., Singer, Y., 2011. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research 12, 2121–2159. [25] Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y., 2012. Improving word representations via global context and multiple word prototypes, in: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics:Long Papers-Volume 1, Association for Computational Linguistics. pp.873–882. [26] Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C., 2011. Learning word vectors for sentiment analysis, in: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1, Association for Computational Linguistics. pp.142–150. [27] Mikolov, T., Kopecky, J., Burget, L., Glembek, O., et al., 2009. Neural network based language models for highly inflective languages, in: Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on, IEEE. pp. 4725–4728. [28] Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S., 2010. Recurrent neural network based language model, in: Eleventh Annual Conference of the International Speech Communication Association. [29] Mikolov, T., Kombrink, S., Burget, L., Černockỳ, J., Khudanpur, S., 2011. Extensions of recurrent neural network language model, in: Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, IEEE. pp. 5528–5531. [30] Mnih, A., Hinton, G., 2007. Three new graphical models for statistical language modelling, in: Proceedings of the 24th international conference on Machine learning, ACM. pp. 641–648. [31] Mnih, A., Hinton, G.E., 2009. A scalable hierarchical distributed language model, in: Advances in neural information processing systems, pp. 1081–1088. [32] Buchta, C., Kober, M., Feinerer, I., Hornik, K., 2012. Spherical k-means clustering.Journal of Statistical Software 50, 1–22. [33] Mammasis, K., Pfann, E., Stewart, R. W., Freeland, G. 2008. Three-dimensional channel modelling using spherical statistics for smart antennas. Electronics Letters, 44(2), 136-138.

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0805118-135637.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2452 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2452 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS