Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

BERTScore for Russian

https://doi.org/10.15514/ISPRAS-2025-37(3)-10

Abstract

This paper presents a study on the selection of the most relevant vector representations for texts in Russian, which are used in the BERTScore metric. This metric is used to assess the quality of generated texts, which can be obtained as a result of solving tasks such as automatic text summarization, machine translation, etc.

About the Authors

Elena Pavlovna BRUCHES
Siberian Neuronets
Russian Federation

Cand. Sci. (Tech.) in Computer Sciences and a leading engineer-researcher in Siberian Neuronets. Research interests: natural language processing, information extraction, language models, machine learning.



Dari Timurovna BATUROVA
Siberian Neuronets
Russian Federation

Bachelor's degree in Mechatronics and Robotics at Novosibirsk State University, developer-researcher at Siberian Neural Networks company. Research interests: deep learning, neural networks, natural language processing, machine translation.



Ivan Yurevich BONDARENKO
Novosibirsk State University
Russian Federation

A researcher in the Laboratory of Applied Digital Technologies and a senior lecturer in the Department of Fundamental and Applied Linguistics at Novosibirsk State University. He is also the co-founder of the Siberian Neural Networks company. Research interests: machine learning, neural networks, computational linguistics, speech recognition.



References

1. Zhang T., Kishore V., Wu F., Weinberger K., Artzi Y. BERTScore: Evaluating Text Generation with BERT. International Conference on Learning Representations, 2020.

2. Papineni K., Roukos S., Ward T., Zhu W. Bleu: a Method for Automatic Evaluation of Machine Translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 2002, pp. 311–318.

3. Lin Ch. ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, 2004, pp. 74–81.

4. Banerjee S., Lavie Al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, 2005, pp. 65–72.

5. Snover M., Dorr B., Schwartz R., Micciulla L., Makhoul J. A Study of Translation Edit Rate with Targeted Human Annotation. Proceedings of the 7th Conference of the Association for Machine Translation in the Americas: Technical Papers, 2006, pp. 223–231.

6. Pillutla K., Swayamdipta S., Zellers R., Thickstun J., Welleck S., Choi Y., Harchaoui Z. MAUVE: measuring the gap between neural text and human text using divergence frontiers. Proceedings of the 35th International Conference on Neural Information Processing Systems, 2024, pp. 4816–4828.

7. Gehrmann S., Strobelt H., Rush Al. GLTR: Statistical Detection and Visualization of Generated Text. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2019, pp. 111–116.

8. Zhao W., Peyrard M., Liu F., Gao Y., Meyer Ch., Eger S. MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 563–578.

9. Yuan W., Neubig G., Liu P. BARTScore: Evaluating generated text as text generation. Thirty-Fifth Conference on Neural Information Processing Systems, 2021.

10. Jiang D., Li Y., Zhang G., Huang W., Lin B., Chen W. (2023) TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks (online). Available at: https://arxiv.org/abs/2310.00752, accessed 28.10.2024.

11. Xu W., Wang D., Pan L., Song Zh., Freitag M., Wang W., Li L. INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 5967–5994.

12. Kim S., Shin J., Cho Y., Jang J., Longpre Sh., Lee H., Yun S., Shin S., Kim S., Thorne J., Seo M. (2023) Prometheus: Inducing Fine-grained Evaluation Capability in Language Models (online). Available at: https://arxiv.org/abs/2310.08491, 28.10.2024.

13. Kim S., Suk J., Longpre Sh., Lin B., Shin J., Welleck S., Neubig G., Lee M., Lee K., Seo M. (2024) Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models (online). Available at: https://arxiv.org/abs/2405.01535, accessed 28.10.2024.

14. Vu T., Krishna K., Alzubi S., Tar Ch., Faruqui M., Sung Y. (2024) Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation (online). Available at: https://arxiv.org/abs/2407.10817, accessed 28.10.2024.

15. Koo R., Lee M., Raheja V., Park J., Kim Z., Kang D. Benchmarking Cognitive Biases in Large Language Models as Evaluators. Findings of the Association for Computational Linguistics ACL 2024, 2024, pp. 517–545.

16. DIALOGSum Corpus, Available at: https://huggingface.co/datasets/d0rj/dialogsum-ru, accessed 11.02.2025.

17. Chen Y., Liu Y., Chen L., Zhang Y. DialogSum: A Real-Life Scenario Dialogue Summarization Dataset. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 2021, pp. 5062–5074.

18. Reviews Russian, Available at: https://huggingface.co/datasets/trixdade/reviews_russian, accessed 11.02.2025.

19. Sakhovskiy A., Izhevskaia A., Pestova A., Tutubalina E., Malykh V., Smurov I., Artemova E. RuSimpleSentEval-2021 Shared Task: Evaluating Sentence Simplification for Russian. Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference “Dialogue” (2021), 2021, Ch. 161, pp. 607–617.

20. Tsanda Al., Bruches El. (2024) Russian-Language Multimodal Dataset for Automatic Summarization of Scientific Papers (online). Available at: https://arxiv.org/abs/2405.07886, accessed 28.10.2024.

21. Telegram Financial Sentiment Summarization dataset, Available at: https://huggingface.co/datasets/mxlcw/telegram-financial-sentiment-summarization, accessed 11.02.2025.

22. Yandex Jobs dataset, Available at: https://huggingface.co/datasets/Kirili4ik/yandex_jobs, accessed 11.02.2025.

23. GigaChat, Available at: https://giga.chat/, accessed 11.02.2025.

24. YandexGPT, Available at: https://ya.ru/ai/gpt-3, accessed 11.02.2025.

25. Bai Y., Ying J., Cao Y., Lv X., He Y., Wang X., Yu J., Zeng K., Xiao Y., Lyu H., Zhang J., Li J., Hou L. Benchmarking Foundation Models with Language-Model-as-an-Examiner. Advances in Neural Information Processing Systems, 2024.

26. Liu Y., Moosavi N., Lin Ch. LLMs as Narcissistic Evaluators: When Ego Inflates Evaluation Scores. In Findings of the Association for Computational Linguistics ACL 2024, 2024, pp. 12688–12701.

27. Shao W., Lei M., Hu Y., Gao P., Zhang K., Meng F., Xu P., Huang S., Li H., Qiao Y., Luo P. (2023) TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models (online). Available at: https://arxiv.org/abs/2308.03729, accessed 28.10.2024.


Review

For citations:


BRUCHES E.P., BATUROVA D.T., BONDARENKO I.Yu. BERTScore for Russian. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(3):147-158. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(3)-10



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)