論文使用權限 Thesis access permission:自定論文開放時間 user define
開放時間 Available:
校內 Campus: 已公開 available
校外 Off-campus: 已公開 available
論文名稱 Title |
跨語言抄襲檢測技術之研究 A Research on Cross-language Plagiarism Detection |
||
系所名稱 Department |
|||
畢業學年期 Year, semester |
語文別 Language |
||
學位類別 Degree |
頁數 Number of pages |
43 |
|
研究生 Author |
|||
指導教授 Advisor |
|||
召集委員 Convenor |
|||
口試委員 Advisory Committee |
|||
口試日期 Date of Exam |
2020-07-21 |
繳交日期 Date of Submission |
2020-07-31 |
關鍵字 Keywords |
跨語言空間投影、文字重用、抄襲檢測、跨語言抄襲檢測 Text Reuse, Plagiarism Detection, Cross-Lingual Plagiarism Detection, Cross-Lingual Mapping |
||
統計 Statistics |
本論文已被瀏覽 6164 次,被下載 139 次 The thesis/dissertation has been browsed 6164 times, has been downloaded 139 times. |
中文摘要 |
現今透過網際網路取得不同語言的電子化論文是相當容易,跨語言抄襲的問題變得相當常見,檢測誇語言抄襲需要比較不同語言文字片段的語義相似性的能力。先前相關研究的方法依賴大量的雙語字典、可比較語料庫、平行語料庫、翻譯工具當作參考,對於特定領域的應用,收集這類資源是相當費時費力且不切實際的。此外,這些方法難以提供為何這段文字被認定是抄襲的解釋性。本論文提出了四步驟的框架來實現跨語言抄襲檢測,在第四步驟提出了一個新的檢測方法CL-WMD,該方法是利用詞嵌入的技術,只需要少量的雙語字典,就能夠透過word mover’s distance來計算跨語言文字的語義距離,並提供不同語言文本的字詞對應,這種對應關係可以為垮語言抄襲檢測提供解釋性。我們的實驗證明即使跨語言的詞向量是建構在相關語料庫或非相關語料庫,CL-WMD都比現有方法在準確度上有更佳的表現。 |
Abstract |
While electronic thesis in different language can be access easily by the Internet, Cross-Language Plagiarism has become popular. Detecting text Plagiarism across language requires the capability of compare the semantic similarity between text spans in different language. Previous works relay on massive bilingual dictionary, comparable corpus, parallel corpus or translate tool as reference. Moreover, they are hard to provide interpretability retrieval result about why the text span be concerned with plagiarism. In this paper, we propose four steps framework for cross-language plagiarism detection. The new approach for final step of our framework, called CL-WMD, which is built upon word embedding techniques. CL-WMD only require small set of bilingual dictionary and calculates semantical distances between texts by word mover’s distance, provide the correspond of word in text span of different language. Our experiments show that even the cross-language word embedding is built by in-domain corpus or out-domain corpus. The CL-WMD has higher accuracy than most existing methods and outperform the translated-based method in paragraph-level and sentence-level plagiarism detection tasks. |
目次 Table of Contents |
論文審定書 i 誌謝 ii 摘要 iii Abstract iv Table of Contents v 1 Introduction 1 2 Related Work 6 2.1 Cross-Language Plagiarism Detection 6 2.2 Word Embedding based Information Retrieval 8 2.2.1 Naïve average scheme 9 2.2.2 Tf-idf weighted scheme 9 2.2.3 Smooth Inverse Frequency (SIF) 9 3 Methodology 11 3.1 Preprocessing 12 3.2 Word Vector Space Building 13 3.3 Candidate Retrieval 17 3.4 Detailed Analysis 18 4 Experiment 22 4.1 Dataset description 22 4.1.1 Experimental Settings 24 4.1.2 Compared Methods 24 4.2 Experimental Results 25 4.2.1 Detailed analysis 25 4.2.2 Candidate retrieval 27 4.2.3 Combine candidate retrieval and detailed analysis 29 4.2.4 Performance in other languages 30 5 Conclusion and future work 32 Reference 33 |
參考文獻 References |
Arora, S., Liang, Y., & Ma, T. (2016). A Simple but Tough-to-Beat Baseline for Sentence Embeddings. https://openreview.net/forum?id=SyK00v5xx Barrón-Cedeño, A., Gupta, P., & Rosso, P. (2013). Methods for cross-language plagiarism detection. Knowledge-Based Systems, 50, 211–217. https://doi.org/10.1016/j.knosys.2013.06.018 Barrón-Cedeño, A., Rosso, P., Agirre, E., & Labaka, G. (2010). Plagiarism Detection across Distant Language Pairs. Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), 37–45. https://www.aclweb.org/anthology/C10-1005 Barrón-Cedeño, A., Rosso, P., Pinto, D., & Juan, A. (2008). On cross-lingual plagiarism analysis using a statistical model. Proceedings of the 2008 International Conference on Uncovering Plagiarism, Authorship and Social Software Misuse - Volume 377, 9–14. Ferrero, J., Agnès, F., Besacier, L., & Schwab, D. (2016). A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), 4162–4169. https://www.aclweb.org/anthology/L16-1657 Ferrero, J., Besacier, L., Schwab, D., & Agnès, F. (2017a). Deep Investigation of Cross-Language Plagiarism Detection Methods. Proceedings of the 10th Workshop on Building and Using Comparable Corpora, 6–15. https://doi.org/10.18653/v1/W17-2502 Ferrero, J., Besacier, L., Schwab, D., & Agnès, F. (2017b). Using Word Embedding for Cross-Language Plagiarism Detection. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 415–421. https://www.aclweb.org/anthology/E17-2066 Franco-Salvador, M., Gupta, P., & Rosso, P. (2013). Cross-Language Plagiarism Detection Using a Multilingual Semantic Network. In P. Serdyukov, P. Braslavski, S. O. Kuznetsov, J. Kamps, S. Rüger, E. Agichtein, I. Segalovich, & E. Yilmaz (Eds.), Advances in Information Retrieval (pp. 710–713). Springer. https://doi.org/10.1007/978-3-642-36973-5_66 Glavaš, G., Franco-Salvador, M., Ponzetto, S. P., & Rosso, P. (2018). A Resource-Light Method for Cross-Lingual Semantic Textual Similarity. ArXiv:1801.06436 [Cs]. http://arxiv.org/abs/1801.06436 Kusner, M., Sun, Y., Kolkin, N., & Weinberger, K. (2015). From Word Embeddings To Document Distances. International Conference on Machine Learning, 957–966. http://proceedings.mlr.press/v37/kusnerb15.html McNamee, P., & Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. Information Retrieval, 7(1), 73–97. https://doi.org/10.1023/B:INRT.0000009441.78971.be Pertile, S. de L., Moreira, V. P., & Rosso, P. (2016). Comparing and combining Content- and Citation-based approaches for plagiarism detection. Journal of the Association for Information Science and Technology, 67(10), 2511–2526. https://doi.org/10.1002/asi.23593 Pinto, D., Civera, J., Barrón-Cedeòo, A., Juan, A., & Rosso, P. (2009). A statistical approach to crosslingual natural language tasks. Journal of Algorithms, 64(1), 51–60. https://doi.org/10.1016/j.jalgor.2009.02.005 Potthast, M., Barrón-Cedeño, A., Stein, B., & Rosso, P. (2011). Cross-language plagiarism detection. Language Resources and Evaluation, 45(1), 45–62. https://doi.org/10.1007/s10579-009-9114-z Potthast, M., Stein, B., & Anderka, M. (2008). A Wikipedia-based multilingual retrieval model. Proceedings of the IR Research, 30th European Conference on Advances in Information Retrieval, 522–530. Roostaee, M., Sadreddini, M. H., & Fakhrahmad, S. M. (2020). An effective approach to candidate retrieval for cross-language plagiarism detection: A fusion of conceptual and keyword-based schemes. Information Processing & Management, 57(2), 102150. https://doi.org/10.1016/j.ipm.2019.102150 Vani, K., & Gupta, D. (2018). Integrating syntax-semantic-based text analysis with structural and citation information for scientific plagiarism detection. Journal of the Association for Information Science and Technology, 69(11), 1330–1345. https://doi.org/10.1002/asi.24027 |
電子全文 Fulltext |
本電子全文僅授權使用者為學術研究之目的,進行個人非營利性質之檢索、閱讀、列印。請遵守中華民國著作權法之相關規定,切勿任意重製、散佈、改作、轉貼、播送,以免觸法。 論文使用權限 Thesis access permission:自定論文開放時間 user define 開放時間 Available: 校內 Campus: 已公開 available 校外 Off-campus: 已公開 available |
紙本論文 Printed copies |
紙本論文的公開資訊在102學年度以後相對較為完整。如果需要查詢101學年度以前的紙本論文公開資訊,請聯繫圖資處紙本論文服務櫃台。如有不便之處敬請見諒。 開放時間 available 已公開 available |
QR Code |