Combined method for plagiarism detection in text documents
https://doi.org/10.15514/ISPRAS-2022-34(1)-11
Abstract
There are two global approaches to the problem of searching plagiarism in the text: external and intrinsic search. The first approach implies search through an external collection of documents that could have been used for text reuse. The second approach, on the contrary, does not use any external data, but analyzes the text by itself. It is proposed to combine these two approaches to speed up the search for text plagiarism. With a large flow of documents that need to be checked, the outer corpus search system processes each document and finds plagiarised blocks in each document, if there are any. However, intrinsic search could be used to determine the fact of plagiarism. Thus, it is possible to reduce the number of documents for the expensive procedure for searching for plagiarism by the outer corpus. Moreover, in an isolated analysis of a single document, there is no need to try to find specific blocks of plagiarism, this procedure is considered as a unique indicator of the originality of the document. If the overall originality is at a low level, then this document should be sent for a more detailed and accurate check. The proposed method allows to filter texts with a high rate of originality that do not need additional verification.
About the Authors
Kamil Fanisovich SAFINRussian Federation
PhD student
Yury Victorovich CHEHOVICH
Russian Federation
Head of Department, Federal Research Center for Informatics and Control, Russian Academy of Sciences, CEO at Antiplagiat Company
References
1. Никитов А.В., Орчаков О.А., Чехович Ю.В. Плагиат в работах студентов и аспирантов: проблема и методы противодействия. Университетское управление: практика и анализ, no. 5, 2012 г., стр. 61-68 / Nikitov A.V., Orchakov O.A., Chehovich Yu.V. Plagiarism in works of undergraduate and graduate students: problem and methods of counteraction. University Management: Practice and Analysis, no. 5, 2012, pp. 61-68 (in Russian).
2. Stein B., Koppel M., Stamatatos E. Plagiarism analysis, authorship identification, and near-duplicate detection PAN’07. SIGIR Forum, vol. 41, no. 2, 2007, pp. 68–71.
3. Chekhovich Y.V., Khazov A.V. Analysis of duplicated publications in Russian journals. Journal of Informetrics, vol.16, issue 1, 2022, article no. 101246.
4. Зеленков И.В., Сегалович И.В. Сравнительный анализ методов определения нечетких дубликатов для Web-документов. Труды 9-ой Всероссийской научной конференции «Электронные библиотеки: перспективные методы и технологии, электронные коллекции» (RCDL’2007), 2007 г., стр. 166-174 / Zelenkov I.V., Segalovich I.V. Comparative analysis of methods for determining fuzzy duplicates for Web documents. In Proc. of the 9th All-Russian Scientific Conference «Digital Libraries: Advanced Methods and Technologies, Digital Collections» (RCDL'2007), 2007, pp. 166-174 (in Russian).
5. Журавлев Ю.И., Рудаков К.В. и др. Система распознавания интеллектуальных заимствований «Антиплагиат». Математические методы распознавания образов, том 12, no. 1, 2005 г., стр. 329-332 / Zhuravlev Yu.I., Rudakov K.V. et al. The system of recognition of intellectual borrowings «Anti-plagiarism». Mathematical methods of pattern recognition, vol. 12, no. 1, 2005, pp. 329-332 (in Russian).
6. Socher R., Huang E.H.-C. et al. Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In Proc. of the 24th International Conference on Neural Information Processing Systems, 2011, pp. 801-809.
7. Кузнецова Р.В., Бахтеев О.Ю., Чехович Ю.В. Методы обнаружения переводных заимствований в больших текстовых коллекциях. Информатика и её применения, том 15, no. 1, 2021 г., стр. 30–41 / Kuznetsova R.V., Bakhteev O.Yu., Chekhovich Yu.V. Methods of cross-lingual text reuse detection in large textual collections. Informatics and Applications, vol. 15, no. 1, 2021, pp. 30-41 (in Russian).
8. Meier zu Eissen S., Stein B. Intrinsic Plagiarism Detection. Lecture Notes in Computer Science, vol. 3936, 2006, pp. 565-569.
9. Zechner M., Muhr M. et al. External and Intrinsic Plagiarism Detection Using Vector Space Models. In Proc. of the SEPLN'09 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, CEUR Workshop Proceedings, vol. 502, 2009, pp. 47-55.
10. Oberreuter G., L’Huillier G. et al. Outlier-Based Approaches for Intrinsic and External Plagiarism Detection. Lecture Notes in Computer Science, vol. 6882, 2011, pp. 11-20.
11. Stamatatos E. Intrinsic Plagiarism Detection Using Character n-gram Profiles. In Proc. of the SEPLN'09 Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, CEUR Workshop Proceedings, vol. 502, 2009, pp. 38–46.
12. Bensalem I., Rosso P., Chikhi S. Intrinsic Plagiarism Detection using Ngram Classes. In Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1459-1464.
13. Tschuggnall M., Specht G. Countering plagiarism by exposing irregularities in authors grammars. In Proc. of the European Intelligence and Security Informatics Conference, 2013, pp. 15-22.
14. Романов А.С., Мещеряков Р.В., Резанова З.И. Методика проверки однородности текста и выявления плагиата на основе метода опорных векторов и фильтра быстрой корреляции. Доклады Томского государственного университета систем управления и радиоэлектроники, no. 2(32), 2014 г., стр. 264-269 / Romanov A.S., Mescheryakov R.V., Rezanova Z.I. Plagiarism detection and text homogeneity checking technique based on one-class support machine and fast correlation-based filter. Proceedings of TUSUR University, no. 2(32), 2014, pp. 264-269.
15. Safin K., Kuznetsova R. Style Breach Detection with Neural Sentence Embeddings. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 1866, 2017, 7 p.
16. Kuznetsov M., Motrenko A., Kuznetsova R., Strijov V Methods for intrinsic plagiarism detection and author diarization. In Working Notes of CLEF 2016 - Conference and Labs of the Evaluation forum, CEUR Workshop Proceedings, vol. 1609, 2016, 8 p.
17. Gillam L., Vartapetiance A. Quite Simple Approaches for Authorship Attribution, Intrinsic Plagiarism Detection and Sexual Predator Identification. In Working Notes for CLEF 2012 Conference, CEUR Workshop Proceedings, vol. 1178, 2012, 12 p.
18. Potthast M., Eiselt A. et al. Overview of the 3rd International Competition on Plagiarism Detection. Working Notes for CLEF 2011 Conference, CEUR Workshop Proceedings, vol. 1171, 2011, 10 p.
19. Sochenkov I.V., Zubarev D.V., Smirnov I.V. The ParaPlag:: Russian dataset for paraphrased plagiarism detection. In Proc. of the International Conference “Dialogue 2017”, 2017, 13 p.
20. Zangerle E., Mayerl, M. et al. PAN20 Authorship Analysis: Style Change Detection. Available at: https://doi.org/10.5281/zenodo.3660984.
Review
For citations:
SAFIN K.F., CHEHOVICH Yu.V. Combined method for plagiarism detection in text documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(1):151-160. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(1)-11