Plagiarism detection based on word semantic clustering
Word2vec, word embedding, PCA, VSM, near-duplicate, bag-of-words, plagiarism detection
本論文已被瀏覽 5727 次,被下載 1
The thesis/dissertation has been browsed 5727 times, has been downloaded 1 times.
近年來文章抄襲已經是越來越常見的問題了,隨著網路與科技越加發達,他人的著作在網路上已經是唾手可得。而當你的著作使用了他人的著作內容,卻又未明確的指出引用,那便很有可能涉及抄襲。抄襲行為已經侵犯到了他人的智慧財產權,而且發生的頻率越來越高,因此,抄襲偵測在現今已經是非常重要的議題。目前的抄襲研究多與偵測近似複本(near-duplicate)類似,例如向量空間模型、詞袋模型,大多只能偵測抄襲相似度非常高的部分,若將抄襲的部分稍加修飾,例如替換掉某些單字、將句子改寫等等,這些方法的效果便會受到極大的影響。因此,我們針對單字的語意進行分析。利用單字的語意來辨別文章究竟有沒有抄襲的嫌疑。Word2vec是由Google團隊所提出的詞嵌入(word embedding)模型,藉由機器學習訓練大量的文章,最後使用向量來代表單字的意思。我們便透過Word2vec獲得單字的向量,由於單字語意的資訊量十分龐大,我們使用主成分分析(principal component analysis, PCA)進行降維,藉由忽略向量中資訊量較少的維度,來達到縮減維度的效果。之後再使用分群將單字分為許多不同的語意概念(concept),透過比對文章間語意概念的重複程度,我們便可以辨識出複雜度較高的抄襲行為。最後,我們也用實驗將我們的方法與其他方法比較,並測試了許多不同的實驗參數,證明了利用單字的語意確實可以更精準地分辨出複雜度較高的抄襲行為。
Plagiarism is a common problem in current years. With the advance of Internet, it is more and more easy to obtain other people's writings. When someone uses the content without citation, he may cause the problem of plagiarism. Plagiarisms will infringe the intellectual property rights. So plagiarism detection is a serious problem in nowadays.Current plagiarism detection methods are similar to near-duplicate detection methods, like VSM(vector space model) or bag-of-words. These methods can't handle the complex plagiarized technique very well, e.g. word substitution and sentence rewriting. Therefore, we focus on the semantic of words. In this paper, we propose a new method for plagiarism detection by analyzing the semantic of words.Word2vec is a word embedding model proposed by Google group. It can use a vector to represent a word. We use Word2vec to obtain the vector of words and use PCA for dimension reduction. After that, we use spherical K-means to cluster the words into concepts. By using Word2vec, we can consider the semantic of words and cluster the words into concepts in order to deal with the complex plagiarized technique.Finally, we will show our experimental results and compare with other methods. The experimental results show that our method is well performance.
目次 Table of Contents
論文審定書 ................................................................................................................... i
誌謝 .............................................................................................................................. ii
摘要 ............................................................................................................................. iii
Abstract ....................................................................................................................... iv
圖目錄 ........................................................................................................................ vii
表目錄 ....................................................................................................................... viii
第一章導論 .................................................................................................................. 1
1.1. 研究背景與目的 ........................................................................................... 1
1.2. 研究動機....................................................................................................... 3
1.3. 論文架構....................................................................................................... 5
第二章文獻探討 .......................................................................................................... 6
2.1. 文件模型....................................................................................................... 6
2.2. 抄襲偵測....................................................................................................... 7
2.3. Word2vec ...................................................................................................... 8
2.4. Spherical K-means....................................................................................... 11
第三章研究方法 ........................................................................................................ 14
3.1. 方法簡介..................................................................................................... 14
3.2. 第一階段流程 ............................................................................................. 15
3.3. 第二階段流程 ............................................................................................. 21
第四章實驗結果分析 ................................................................................................. 24
4.1. Data set ....................................................................................................... 24
4.2. 評估標準..................................................................................................... 24
4.3. K-means 和spherical K-means 比較結果 ................................................... 25
4.4. 分群群數與PCA 累積能量比較結果......................................................... 26
4.5. 與MLM 方法比較結果 .............................................................................. 27
4.6. 替換單字之抄襲偵測比較結果 .................................................................. 28
第五章結論與未來展望 ............................................................................................. 30
5.1. 結論 ............................................................................................................ 30
5.2. 未來研究方向 ............................................................................................. 30
參考文獻 .................................................................................................................... 31
參考文獻 References
