國立中山大學,National Sun Yat-sen University,學位論文,thesis/dissertation,跨語言抄襲檢測技術之研究,A Research on Cross-language Plagiarism Detection

論文名稱 Title	跨語言抄襲檢測技術之研究 A Research on Cross-language Plagiarism Detection
系所名稱 Department	資訊管理學系 Department of Information Management
畢業學年期 Year, semester	108 學年度第 2 學期 The spring semester of Academic Year 108	語文別 Language	英文 English
學位類別 Degree	碩士 Master	頁數 Number of pages	43
研究生 Author	張家銘 Chia-Ming Chang
指導教授 Advisor	黃三益 San-Yih Hwang
召集委員 Convenor	魏志平 Chih-Ping Wei
口試委員 Advisory Committee	康藝晃 Yihuang Kang
口試日期 Date of Exam	2020-07-21	繳交日期 Date of Submission	2020-07-31
關鍵字 Keywords	跨語言空間投影、文字重用、抄襲檢測、跨語言抄襲檢測 Text Reuse, Plagiarism Detection, Cross-Lingual Plagiarism Detection, Cross-Lingual Mapping
統計 Statistics	本論文已被瀏覽 6274 次，被下載 141 次 The thesis/dissertation has been browsed 6274 times, has been downloaded 141 times.

中文摘要
現今透過網際網路取得不同語言的電子化論文是相當容易，跨語言抄襲的問題變得相當常見，檢測誇語言抄襲需要比較不同語言文字片段的語義相似性的能力。先前相關研究的方法依賴大量的雙語字典、可比較語料庫、平行語料庫、翻譯工具當作參考，對於特定領域的應用，收集這類資源是相當費時費力且不切實際的。此外，這些方法難以提供為何這段文字被認定是抄襲的解釋性。本論文提出了四步驟的框架來實現跨語言抄襲檢測，在第四步驟提出了一個新的檢測方法CL-WMD，該方法是利用詞嵌入的技術，只需要少量的雙語字典，就能夠透過word mover’s distance來計算跨語言文字的語義距離，並提供不同語言文本的字詞對應，這種對應關係可以為垮語言抄襲檢測提供解釋性。我們的實驗證明即使跨語言的詞向量是建構在相關語料庫或非相關語料庫，CL-WMD都比現有方法在準確度上有更佳的表現。
Abstract
While electronic thesis in different language can be access easily by the Internet, Cross-Language Plagiarism has become popular. Detecting text Plagiarism across language requires the capability of compare the semantic similarity between text spans in different language. Previous works relay on massive bilingual dictionary, comparable corpus, parallel corpus or translate tool as reference. Moreover, they are hard to provide interpretability retrieval result about why the text span be concerned with plagiarism. In this paper, we propose four steps framework for cross-language plagiarism detection. The new approach for final step of our framework, called CL-WMD, which is built upon word embedding techniques. CL-WMD only require small set of bilingual dictionary and calculates semantical distances between texts by word mover’s distance, provide the correspond of word in text span of different language. Our experiments show that even the cross-language word embedding is built by in-domain corpus or out-domain corpus. The CL-WMD has higher accuracy than most existing methods and outperform the translated-based method in paragraph-level and sentence-level plagiarism detection tasks.

目次 Table of Contents
論文審定書 i 誌謝 ii 摘要 iii Abstract iv Table of Contents v 1 Introduction 1 2 Related Work 6 2.1 Cross-Language Plagiarism Detection 6 2.2 Word Embedding based Information Retrieval 8 2.2.1 Naïve average scheme 9 2.2.2 Tf-idf weighted scheme 9 2.2.3 Smooth Inverse Frequency (SIF) 9 3 Methodology 11 3.1 Preprocessing 12 3.2 Word Vector Space Building 13 3.3 Candidate Retrieval 17 3.4 Detailed Analysis 18 4 Experiment 22 4.1 Dataset description 22 4.1.1 Experimental Settings 24 4.1.2 Compared Methods 24 4.2 Experimental Results 25 4.2.1 Detailed analysis 25 4.2.2 Candidate retrieval 27 4.2.3 Combine candidate retrieval and detailed analysis 29 4.2.4 Performance in other languages 30 5 Conclusion and future work 32 Reference 33

參考文獻 References
Arora, S., Liang, Y., & Ma, T. (2016). A Simple but Tough-to-Beat Baseline for Sentence Embeddings. https://openreview.net/forum?id=SyK00v5xx Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211–217. https://doi.org/10.1016/j.knosys.2013.06.018 Barrón-Cedeño, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism Detection across Distant Language Pairs. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 37–45. https://www.aclweb.org/anthology/C10-1005 Barrón-Cedeño, A., Rosso, P., Pinto, D., & Juan, A. (2008). On cross-lingual plagiarism analysis using a statistical model. Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse - Volume 377, 9–14. Ferrero, J., Agnès, F., Besacier, L., & Schwab, D. (2016). A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4162–4169. https://www.aclweb.org/anthology/L16-1657 Ferrero, J., Besacier, L., Schwab, D., & Agnès, F. (2017a). Deep Investigation of Cross-Language Plagiarism Detection Methods. Proceedings of the 10th Workshop on Building and Using Comparable Corpora, 6–15. https://doi.org/10.18653/v1/W17-2502 Ferrero, J., Besacier, L., Schwab, D., & Agnès, F. (2017b). Using Word Embedding for Cross-Language Plagiarism Detection. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 415–421. https://www.aclweb.org/anthology/E17-2066 Franco-Salvador, M., Gupta, P., & Rosso, P. (2013). Cross-Language Plagiarism Detection Using a Multilingual Semantic Network. In P. Serdyukov, P. Braslavski, S. O. Kuznetsov, J. Kamps, S. Rüger, E. Agichtein, I. Segalovich, & E. Yilmaz (Eds.), Advances in Information Retrieval (pp. 710–713). Springer. https://doi.org/10.1007/978-3-642-36973-5_66 Glavaš, G., Franco-Salvador, M., Ponzetto, S. P., & Rosso, P. (2018). A Resource-Light Method for Cross-Lingual Semantic Textual Similarity. ArXiv:1801.06436 [Cs]. http://arxiv.org/abs/1801.06436 Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From Word Embeddings To Document Distances. International Conference on Machine Learning, 957–966. http://proceedings.mlr.press/v37/kusnerb15.html McNamee, P., & Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, 7(1), 73–97. https://doi.org/10.1023/B:INRT.0000009441.78971.be Pertile, S. de L., Moreira, V. P., & Rosso, P. (2016). Comparing and combining Content- and Citation-based approaches for plagiarism detection. Journal of the Association for Information Science and Technology, 67(10), 2511–2526. https://doi.org/10.1002/asi.23593 Pinto, D., Civera, J., Barrón-Cedeòo, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51–60. https://doi.org/10.1016/j.jalgor.2009.02.005 Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62. https://doi.org/10.1007/s10579-009-9114-z Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia-based multilingual retrieval model. Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval, 522–530. Roostaee, M., Sadreddini, M. H., & Fakhrahmad, S. M. (2020). An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes. Information Processing & Management, 57(2), 102150. https://doi.org/10.1016/j.ipm.2019.102150 Vani, K., & Gupta, D. (2018). Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection. Journal of the Association for Information Science and Technology, 69(11), 1330–1345. https://doi.org/10.1002/asi.24027

電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。論文使用權限 Thesis access permission：自定論文開放時間 user define 開放時間 Available：校內 Campus：已公開 available 校外 Off-campus：已公開 available etd-0631120-172000.pdf
紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊，請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。開放時間 available 已公開 available

QR Code

國立中山大學圖書與資訊處 │ 諮詢服務：2453 論文審查小組 │ 服務信箱 │ 系統開發維運：圖資處知識創新組

Office of Library and Information Services, National Sun Yat-sen University │ Contact Us : 2453 Thesis Format Review Team , Mail │ Development and operations : Knowledge Innovation Division, LIS