Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Plagiarism Detection in Armenian Texts Using Intrinsic Stylometric Analysis

https://doi.org/10.15514/ISPRAS-2021-33(1)-14

Abstract

In this work we study the application of intrinsic stylometric methods to the task of plagiarism detection in Armenian texts. We use two task setups from PAN’s series of conferences on text forensics and stylometry: style change detection and style breach detection. Style change detection aims to determine whether the text is written by more than one author, while style breach detection detects the boundaries of stylistically distinct text fragments. For these tasks, we generate synthetic test sets for three genres of text: academic, literature, and news, and then use them to evaluate the effectiveness of hierarchical clustering and other relevant models from PAN conferences. We employ a standard set of character-level, lexical and readability features, and additionally perform morphological and dependency parsing of text fragments to extract syntactic features encoding author style information. The evaluation results show that the clustering-based approach fails to correctly detect style change detection in longer texts and is only marginally better for shorter texts. For style breach detection, hierarchical clustering-based approach performs better than a random baseline classifier, but the difference is not sufficient to warrant its practical use. In a complementary experiment, we show that reducing the number of features and multicollinearity in them via PCA helps to increase the precision of style breach detection methods for certain text categories.

About the Authors

Yeva Maksimovna YESHILBASHIAN
Russian-Armenian University
Armenia

Student of Machine Learning master’s degree programme



Ariana Armenovna ASATRYAN
Russian-Armenian University
Armenia

Master student at the Department of Mathematical Cybernetics



Tsolak Gukasovitch GHUKASYAN
Russian-Armenian University
Armenia

Postgraduate student of the Department of System Programming



References

1. Mike Kestemont, Michael Tschuggnall, Efstathios Stamatatos, Walter Daelemans, Günther Specht, Benno Stein, and Martin Potthast. Overview of the Author Identification Task at PAN-2018: Cross-domain Authorship Attribution and Style Change Detection. Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 2125, 2018.

2. Eva Zangerle, Michael Tschuggnall, Günther Specht, Martin Potthast, and Benno Stein. Overview of the Style Change Detection Task at PAN 2019. Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 2380, 2019.

3. Michael Tschuggnall, Efstathios Stamatatos, Ben Verhoeven, Walter Daelemans, Günther Specht, Benno Stein, and Martin Potthast. Overview of the Author Identification Task at PAN 2017: Style Breach Detection and Author Clustering. Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 1866, 2017.

4. Paolo Rosso, Francisco Rangel, Martin Potthast, Efstathios Stamatatos, Michael Tschuggnall, and Benno Stein. Overview of PAN 2016 – New Challenges for Authorship Analysis: Cross-genre Profiling, Clustering, Diarization, and Obfuscation. Lecture Notes in Computer Science, vol. 9822, 2016, pp. 332-350.

5. Sukanya Nath. Style Change Detection by Threshold Based and Window Merge Clustering Methods. Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 2380, 2019.

6. Dimitrina Zlatkova, Daniel Kopev, Kristiyan Mitov, Atanas Atanasov, Momchil Hardalov, Ivan Koychev, and Preslav Nakov. An Ensemble-Rich Multi-Aspect Approach for Robust Style Change Detection – Notebook for PAN at CLEF 2018. Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 2125, 2018.

7. Marjan Hosseinia and Arjun Mukherjee. A Parallel Hierarchical Attention Network for Style Change Detection – Notebook for PAN at CLEF 2018. Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, vol. 2125, 2018.

8. Kamil Safin and Aleksandr Ogaltsov. Detecting a Change of Style Using Text Statistics – Notebook for PAN at CLEF 2018. Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 2125, 2018.

9. Daniel Karaś, Martyna Śpiewak, and Piotr Sobecki. OPI-JSA at CLEF 2017: Author Clustering and Style Breach Detection – Notebook for PAN at CLEF 2017. Working Notes of CLEF 2017 – Conference and Labs of the Evaluation Forum, vol. 1866, 2017.

10. Jamal Ahmad Khan. Style Breach Detection: An Unsupervised Detection Model – Notebook for PAN at CLEF 2017. Working Notes of CLEF 2017 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 1866, 2017.

11. Kamil Safin and Rita Kuznetsova. Style Breach Detection with Neural Sentence Embeddings – Notebook for PAN at CLEF 2017. Working Notes of CLEF 2017 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 1866, 2017.

12. Helena Gómez-Adorno, Yuridiana Alemán, Darnes Vilariño Ayala, Miguel A. Sanchez-Perez, David Pinto, and Grigori Sidorov. Author Clustering using Hierarchical Clustering Analysis – Notebook for PAN at CLEF 2017, Working Notes of CLEF 2017 м Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 1866, 2017.

13. Yasmany García-Mondeja, Daniel Castro-Castro, Vania Lavielle-Castro, and Rafael Muñoz. Discovering Author Groups using a B-compact graph-based Clustering – Notebook for PAN at CLEF 2017. Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 1866, 2017.

14. Mirco Kocher and Jacques Savoy. UniNE at CLEF 2017: Author Clustering – Notebook for PAN at CLEF 2017. Working Notes of CLEF 2017 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, vol. 1866, 2017.

15. Iqbal Farkhund, Hamad Binsalleeh, Benjamin C.M. Fung, and Mourad Debbabi. Mining writeprints from anonymous e-mails for forensic investigation. Digital Investigation, vol. 7, issue 1-2, 2010, pp. 56-64.

16. Zuo Chaoyuan, Yu Zhao, and Ritwik Banerjee. Style Change Detection with Feed-forward Neural Networks. Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR Workshop Proceedings, vol. 2125, 2019.

17. Hirst Graeme, and Ol’ga Feiguina. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing, vol. 22, no. 4, 2007, pp. 405-417.

18. Rupesh Kumar Dewang and A. K. Singh. 2015. Identification of Fake Reviews Using New Set of Lexical and Syntactic Features. In Proc. of the Sixth International Conference on Computer and Communication Technology (ICCCT '15), 2015, pp. 115–119.

19. C. Zhao, W. Song, L. Liu, C. Du and X. Zhao. Research on Author Identification Based on Deep Syntactic Features. In Proc. of the 2017 10th International Symposium on Computational Intelligence and Design (ISCID), 2017, pp. 276-279.

20. K. Avetisyan and T. Ghukasyan. Word embeddings for the armenian language: intrinsic and extrinsic evaluation. Bulletin of the Russian-Armenian University: Physico-Mathematical and Natural Sciences, no. 1, 2019, pp. 59-72.

21. Gishamer Flurin. Using Hashtags and POS-Tags for Author Profiling. Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR Workshop Proceedings, vol. 2125, 2019.


Review

For citations:


YESHILBASHIAN Ye.M., ASATRYAN A.A., GHUKASYAN Ts.G. Plagiarism Detection in Armenian Texts Using Intrinsic Stylometric Analysis. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(1):209-224. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(1)-14



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)