The methodology of Constructing the Large-Scale Dataset for Detecting Presuicidal and Anti-Suicidal Signals in Social Media Texts in Russian
https://doi.org/10.15514/ISPRAS-2025-37(6)-29
Abstract
The suicide is a terrifying act of a person who is misled by his own mental state. This problem arises across many countries. Sadly, Russia also has quite high number of persons who committed suicide. Luckily, a subset of these people writes their struggles in social media, allowing a way to find them and help. However, these valuable texts disappearing in many irrelevant texts which is considerably slowing down the decision process about person's suicidal risk. To tackle this problem, in this work we have presented a detailed methodology of building the dataset for detecting texts that describe presuicidal and anti-suicidal signals. This methodology describes the process of instruction and class table creation, the process of annotation, verification and post-annotation correction. Guiding by this methodology, we collect and annotate a large-scale Russian dataset with more than 50 thousand texts from social media. We provide a count statistic of the dataset as well as common problems in annotation. We also conduct basic experiments of building the classification models to show the on go performance on different levels of annotation. Furthermore, we make the dataset, code and all materials publicly available.
About the Authors
Igor Olegovich BUYANOVRussian Federation
Post graduate student at FRC CSC RAS, senior developer at MTS AI. Research interests: natural language processing, embedding space analysis, computational psychology.
Darya Valentinovna YASKOVA
Russian Federation
Master of psychology in N. I. Lobachevsky State University of Nizhny Novgorod from 2018, senior developer at MTS AI since 2019. Research interests: natural language processing, named entity recognition in specific domains, text augmentation methods.
Danil Sergeevich SERENKO
Russian Federation
A student at the Department of Mathematical Modeling and Artificial Intelligence of the Patrice Lumumba RUDN University, a researcher at Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences. His research interests include AI, information retrieval.
Danil Nikolaevich SHKEREDA
Russian Federation
A student at the Department of Mathematical Modeling and Artificial Intelligence of the Patrice Lumumba RUDN University, a researcher at Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences. Research interests: effective training of large language models, semantic analysis of texts.
Andrey Dmitrievich YASKOV
Russian Federation
Master of informatics in Nizhny Novgorod State Technical University n.a. R.E. Alekseev from 2017, developer at Yandex from 2022. Professional interests: high load web-application development, architecture of information systems, interactive editor of diagram development, vector graphics, web-application accessibility.
Ilia Vladimirovich SOCHENKOV
Russian Federation
Cand. Sci. (Phys.-Math.), lead researcher at FRC CSC RAS, leading researcher at ISP RAS, leading researcher at IITP RAS. Research interests: Natural Language Processing, Information Retrieval, Big Data & Text Mining.
References
1. Dévora Kestel and Mark van Ommeren et al. Suicide in the world. World Health Organization, 2019. Vol. 1.
2. Suicide and its prevention in Russia, 2019: general facts // Demoscope URL: https://www.demoscope.ru/weekly/2020/0869/suicide.php (accessed: 18.05.2025).
3. Bollen J. et al. Historical language records reveal a surge of cognitive distortions in recent decades. Proc Natl Acad Sci USA, 2021. Vol. 1.
4. Craig J. Bryan and M. David Rudd, Brief Cognitive-Behavioral Therapy for Suicide Prevention. Guilford Press, 2018. Vol. 1.
5. Popov U. V., A.A. Pichikov, Suicidal behavior in adolescents. [Suicidalnoe povedenie u podrostkov] SpecLit, 2017. Vol. 1.
6. Kitoboy // Github URL: https://github.com/psytechlab/kitoboy (accessed: 18.05.2025).
7. Glen Coppersmith et al. From ADHD to SAD: Analyzing the Language of Mental Health on Twitter through Self-Reported Diagnoses // Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, Denver, Colorado, 2015, pp. 1-10.
8. De Choudhury M. et al. Discovering Shifts to Suicidal Ideation from Mental Health Content in Social Media // Proceedings of the SIGCHI conference on human factors in computing systems, 2016, pp. 2098 2110.
9. Glen Coppersmith et al. CLPsych 2015 Shared Task: Depression and PTSD on Twitter // Proceedings of the 2 nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. Denver, Colorado, 2015. pp. 31-39.
10. Losada D.E., Crestani F., A Test Collection for Research on Depression and Language Use. – Springer, Cham, 2016. Vol. 9822.
11. Sean MacAvaney et al. Community-level Research on Suicidality Prediction in a Secure Environment: Overview of the CLPsych 2021 Shared Task // Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology. Online, 2021, pp. 70-80.
12. Reading List for Mental Health Detection and Analysis on Social Media // Github URL: https://github.com/drmuskangarg/mentalhealthcare (accessed: 18.05.2025).
13. H. Andrew Schwartz et al. Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach – PloS one, 2013, vol. 8
14. PsyEval: A Suite of Mental Health Related Tasks for Evaluating Large Language Models // ArXiv URL: https://arxiv.org/abs/2311.09189 (accessed: 18.05.2025).
15. Narynov S. et al. Dataset of depressive posts in Russian language collected from social media // Data in Brief, 2020, vol. 29.
16. Stankevich M., Smirnov I. et al. Predicting Depression from Essays in Russian // Proceedings of “Computational Linguistics and Intellectual Technologies” DIALOGUE, 2019, pp. 637-647.
17. Литвинова Т.А., Литвинова О.А. Языковые особенности русскоязычных текстов лиц, совершивших суицид, и лиц с высоким риском аутоагрессивного поведения // Studia Humanitatis. - 2017. № 4 / Litvinova T. A., Litvinova O. A. Linguistic features of Russian-language texts of people who have committed suicide and those at high risk of auto-aggressive behavior // Studia Humanitatis. 2017. No. 4.
18. Igor Buyanov and Ilya Sochenkov, The dataset for presuicidal signals detection in text and its analysis // Computational Linguistics and Intellectual Technologies. 2022. No. 21, pp. 81-92.
19. VK // VK URL: https://vk.com/ (accessed: 18.05.2025).
20. X (Twitter) // X URL: https://x.com/ (accessed: 18.05.2025).
21. Suicide Forum // Suicide Forum URL: http://www.suicide-forum.com/ (accessed: 18.05.2025).
22. A. Aluoja, J. Shlik, V. Vasar, K. Luuk, M. Leinsalu, The Emotional Well-being Questionnaire (EEK). 1999.
23. Тарабрина Н. В. Практикум по психологии посттравматического стресса. 1 изд., СПб.: Питер, 2001. 272 с. / Tatabatina N. V. A workshop on the psychology of post-traumatic stress. 1 edition, SPb.: Piter, 2001, 272 p.
24. Пакулина С.А. Психодиагностика суицидального поведения детей и подростков. 1 изд., Челябинск: 2014 / Pakulina S. A. Psychodiagnostics of suicidal behavior in children and adolescents. 1 edition, Chelabinsk: 2014.
25. Брайан К.Дж., Радд М.Д. Когнитивно-поведенческая терапия для предотвращения суицида. 1 изд., Москва: Вильямс, 2021. 464 с. / Brayan K. J. Radd M. D Cognitive-behavioral therapy for suicide prevention, 1 edition, Moscow: Viliams, 2021. 464 p.
26. Krippendorff K. Computing Krippendorff’s Alpha-Reliability // 2011.
27. Passonneau R. Measuring Agreement on Set-valued Items (MASI) for Semantic and Pragmatic Annotation // International Conference on Language Resources and Evaluation. 2006.
28. Bird S., Klein E., Loper E. Natural Language Processing with Python. 1 edition. O'Reilly, 2009.
29. Astromis Presuicidal RuBERT // Astromis HF URL: https://hf.global-rail.com/astromis/presuisidal_rubert (accessed: 18.05.2025).
30. RuBERT-Tiny2 Russian Emotion Detection // Hugging Face URL: https://huggingface.co/Djacon/rubert-tiny2-russian-emotion-detection (accessed: 18.05.2025).
31. Blanchefort RuBERT Base Cased Sentiment // Blanchefort HF URL: https://hf.global-rail.com/blanchefort/rubert-base-cased-sentiment (accessed: 18.05.2025).
32. Label Studio // Github URL: https://github.com/HumanSignal/label-studio (accessed: 18.05.2025).
33. Sboev A., Naumov A., Rybka R. Data-Driven Model for Emotion Detection in Russian Texts // BICA*AI. 2020.
34. Rogers A., Romanov A., Rumshisky A., Volkova S., Gronas M., Gribov A. RuSentiment: An Enriched Sentiment Analysis Dataset for Social Media in Russian // International Conference on Computational Linguistics. 2018.
35. Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics // ArXiv URL: https://arxiv.org/abs/2009.10795 (accessed: 18.05.2025).
36. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language // ArXiv URL: https://arxiv.org/abs/1905.07213 (accessed: 18.05.2025).
37. RoBERTa: A Robustly Optimized BERT Pretraining Approach // ArXiv URL: https://arxiv.org/abs/1907.11692 (accessed: 18.05.2025).
38. DeBERTa: Decoding-enhanced BERT with Disentangled Attention // ArXiv URL: https://arxiv.org/abs/2006.03654 (accessed: 18.05.2025).
Review
For citations:
BUYANOV I.O., YASKOVA D.V., SERENKO D.S., SHKEREDA D.N., YASKOV A.D., SOCHENKOV I.V. The methodology of Constructing the Large-Scale Dataset for Detecting Presuicidal and Anti-Suicidal Signals in Social Media Texts in Russian. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(6):191-210. https://doi.org/10.15514/ISPRAS-2025-37(6)-29






