Responsive image
博碩士論文 etd-0631120-172000 詳細資訊
Title page for etd-0631120-172000
論文名稱
Title
跨語言抄襲檢測技術之研究
A Research on Cross-language Plagiarism Detection
系所名稱
Department
畢業學年期
Year, semester
語文別
Language
學位類別
Degree
頁數
Number of pages
43
研究生
Author
指導教授
Advisor
召集委員
Convenor
口試委員
Advisory Committee
口試日期
Date of Exam
2020-07-21
繳交日期
Date of Submission
2020-07-31
關鍵字
Keywords
跨語言空間投影、文字重用、抄襲檢測、跨語言抄襲檢測
Text Reuse, Plagiarism Detection, Cross-Lingual Plagiarism Detection, Cross-Lingual Mapping
統計
Statistics
本論文已被瀏覽 6092 次,被下載 138
The thesis/dissertation has been browsed 6092 times, has been downloaded 138 times.
中文摘要
現今透過網際網路取得不同語言的電子化論文是相當容易,跨語言抄襲的問題變得相當常見,檢測誇語言抄襲需要比較不同語言文字片段的語義相似性的能力。先前相關研究的方法依賴大量的雙語字典、可比較語料庫、平行語料庫、翻譯工具當作參考,對於特定領域的應用,收集這類資源是相當費時費力且不切實際的。此外,這些方法難以提供為何這段文字被認定是抄襲的解釋性。本論文提出了四步驟的框架來實現跨語言抄襲檢測,在第四步驟提出了一個新的檢測方法CL-WMD,該方法是利用詞嵌入的技術,只需要少量的雙語字典,就能夠透過word mover’s distance來計算跨語言文字的語義距離,並提供不同語言文本的字詞對應,這種對應關係可以為垮語言抄襲檢測提供解釋性。我們的實驗證明即使跨語言的詞向量是建構在相關語料庫或非相關語料庫,CL-WMD都比現有方法在準確度上有更佳的表現。
Abstract
While electronic thesis in different language can be access easily by the Internet, Cross-Language Plagiarism has become popular. Detecting text Plagiarism across language requires the capability of compare the semantic similarity between text spans in different language. Previous works relay on massive bilingual dictionary, comparable corpus, parallel corpus or translate tool as reference. Moreover, they are hard to provide interpretability retrieval result about why the text span be concerned with plagiarism. In this paper, we propose four steps framework for cross-language plagiarism detection. The new approach for final step of our framework, called CL-WMD, which is built upon word embedding techniques. CL-WMD only require small set of bilingual dictionary and calculates semantical distances between texts by word mover’s distance, provide the correspond of word in text span of different language. Our experiments show that even the cross-language word embedding is built by in-domain corpus or out-domain corpus. The CL-WMD has higher accuracy than most existing methods and outperform the translated-based method in paragraph-level and sentence-level plagiarism detection tasks.
目次 Table of Contents
論文審定書 i
誌謝 ii
摘要 iii
Abstract iv
Table of Contents v
1 Introduction 1
2 Related Work 6
2.1 Cross-Language Plagiarism Detection 6
2.2 Word Embedding based Information Retrieval 8
2.2.1 Naïve average scheme 9
2.2.2 Tf-idf weighted scheme 9
2.2.3 Smooth Inverse Frequency (SIF) 9
3 Methodology 11
3.1 Preprocessing 12
3.2 Word Vector Space Building 13
3.3 Candidate Retrieval 17
3.4 Detailed Analysis 18
4 Experiment 22
4.1 Dataset description 22
4.1.1 Experimental Settings 24
4.1.2 Compared Methods 24
4.2 Experimental Results 25
4.2.1 Detailed analysis 25
4.2.2 Candidate retrieval 27
4.2.3 Combine candidate retrieval and detailed analysis 29
4.2.4 Performance in other languages 30
5 Conclusion and future work 32
Reference 33
參考文獻 References
Arora, S., Liang, Y., & Ma, T. (2016). A Simple but Tough-to-Beat Baseline for Sentence Embeddings. https://openreview.net/forum?id=SyK00v5xx
Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211–217. https://doi.org/10.1016/j.knosys.2013.06.018
Barrón-Cedeño, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism Detection across Distant Language Pairs. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 37–45. https://www.aclweb.org/anthology/C10-1005
Barrón-Cedeño, A., Rosso, P., Pinto, D., & Juan, A. (2008). On cross-lingual plagiarism analysis using a statistical model. Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse - Volume 377, 9–14.
Ferrero, J., Agnès, F., Besacier, L., & Schwab, D. (2016). A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4162–4169. https://www.aclweb.org/anthology/L16-1657
Ferrero, J., Besacier, L., Schwab, D., & Agnès, F. (2017a). Deep Investigation of Cross-Language Plagiarism Detection Methods. Proceedings of the 10th Workshop on Building and Using Comparable Corpora, 6–15. https://doi.org/10.18653/v1/W17-2502
Ferrero, J., Besacier, L., Schwab, D., & Agnès, F. (2017b). Using Word Embedding for Cross-Language Plagiarism Detection. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 415–421. https://www.aclweb.org/anthology/E17-2066
Franco-Salvador, M., Gupta, P., & Rosso, P. (2013). Cross-Language Plagiarism Detection Using a Multilingual Semantic Network. In P. Serdyukov, P. Braslavski, S. O. Kuznetsov, J. Kamps, S. Rüger, E. Agichtein, I. Segalovich, & E. Yilmaz (Eds.), Advances in Information Retrieval (pp. 710–713). Springer. https://doi.org/10.1007/978-3-642-36973-5_66
Glavaš, G., Franco-Salvador, M., Ponzetto, S. P., & Rosso, P. (2018). A Resource-Light Method for Cross-Lingual Semantic Textual Similarity. ArXiv:1801.06436 [Cs]. http://arxiv.org/abs/1801.06436
Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From Word Embeddings To Document Distances. International Conference on Machine Learning, 957–966. http://proceedings.mlr.press/v37/kusnerb15.html
McNamee, P., & Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, 7(1), 73–97. https://doi.org/10.1023/B:INRT.0000009441.78971.be
Pertile, S. de L., Moreira, V. P., & Rosso, P. (2016). Comparing and combining Content- and Citation-based approaches for plagiarism detection. Journal of the Association for Information Science and Technology, 67(10), 2511–2526. https://doi.org/10.1002/asi.23593
Pinto, D., Civera, J., Barrón-Cedeòo, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51–60. https://doi.org/10.1016/j.jalgor.2009.02.005
Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62. https://doi.org/10.1007/s10579-009-9114-z
Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia-based multilingual retrieval model. Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval, 522–530.
Roostaee, M., Sadreddini, M. H., & Fakhrahmad, S. M. (2020). An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes. Information Processing & Management, 57(2), 102150. https://doi.org/10.1016/j.ipm.2019.102150
Vani, K., & Gupta, D. (2018). Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection. Journal of the Association for Information Science and Technology, 69(11), 1330–1345. https://doi.org/10.1002/asi.24027
電子全文 Fulltext
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。
論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available


紙本論文 Printed copies
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。
開放時間 available 已公開 available

QR Code