Building neural network models for morphological and morpheme analysis of texts
https://doi.org/10.15514/ISPRAS-2021-33(4)-9
Abstract
Morphological analysis of text is one of the most important stages of natural language processing (NLP). Traditional and well-studied problems of morphological analysis include normalization (lemmatization) of a given word form, recognition of its morphological characteristics and their morphological disambiguation. The morphological analysis also involves the problem of morpheme segmentation of words (i.e., segmentation of words into constituent morphs and their classification), which is actual in some NLP applications. In recent years, several machine learning models have been developed, which increase the accuracy of traditional morphological analysis and morpheme segmentation, but performance of such models is insufficient for many applied problems. For morpheme segmentation, high-precision models have been built only for lemmas (normalized word forms). This paper describes two new high-accuracy neural network models that implement morphemic segmentation of Russian word forms with sufficiently high performance. The first model is based on convolutional neural networks and shows the state-of-the-art quality of morphemic segmentation for Russian word forms. The second model, besides morpheme segmentation of a word form, preliminarily refines its morphological characteristics, thereby performing their disambiguation. The performance of this joined morphological model is the best among the considered morpheme segmentation models, with comparable accuracy of segmentation.
About the Author
Alexander Sergeevich SAPINRussian Federation
Post-graduate student of Algorithmic Languages Department, CMC Faculty
References
1. Большакова Е.И., Воронцов К.В. и др. Автоматическая обработка текстов на естественном языке и анализ данных: учебное пособие. Изд-во НИУ ВШЭ, 2017 г., 269 стр. / Bolshakova E.I., Vorontsov K.V. et al. Automatic processing of texts: handbook. HSE, 2017, 269 p. (in Russian)
2. Ляшевская О.Н., Астафьева И. и др. Оценка методов автоматического анализа текста: морфологические парсеры русского языка. Труды международной конференции Диалог-2010, 2010, стр. 318-327 / Lyashevskaya O.N., Astafieva I. et al. Evaluation of automatic text analysis: morphological parsers for Russian. In Proc. of the International Conference Dialogue 2010, 2010, pp. 318-327 (in Russian).
3. Harris Z.S. Morpheme boundaries within words: Report on a computer test. In Transformations and Discourse Analysis Papers. Formal Linguistics Series, Springer, 1970, pp. 68-77.
4. Kanerva J., Ginter F. et al. Turku neural parser pipeline: An end-to-end system for the CoNLL 2018 shared task. In Proc. of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, 2018, pp. 133-142.
5. Anastasyev D.G. Exploring pretrained models for joint morpho-syntactic parsing of Russian. In Proc. of the International Conference Dialogue 2020, 2020, pp. 1-12.
6. Sorokin A., Smurov I., Kirianov P. Tagging and parsing of multidomain collections. In Proc. of the International Conference Dialogue 2020, 2020, pp. 670-683.
7. Lyashevskaya O.N., Shavrina T.O. et al. GRAMEVAL 2020 Shared Task: Russian Full Morphology and Universal Dependencies Parsing. In Proc. of the International Conference Dialogue 2020, 2020, pp. 553-569.
8. Sorokin A., Kravtsova A. Deep convolutional networks for supervised morpheme segmentation of Russian language. Communications in Computer and Information Science, vol. 930, 2018, pp. 3-10.
9. Bolshakova E., Sapin A. Comparing models of morpheme analysis for Russian words based on machine learning. In Proc. of the International Conference Dialogue 2019, 2019, pp. 104-113.
10. Bolshakova E., Sapin A. Bi-LSTM Model for Morpheme Segmentation of Russian Words. Communications in Computer and Information Science, vol. 1119, 2019, pp. 151-160.
11. Сокирко А.В. Морфологические модули на сайте www.aot.ru. Труды международной конференции Диалог-2004, 2004 г., стр. 559–564. / Sokirko A.V. Morphological components on www.aot.ru. In Proc. of the International Conference Dialogue 2004, 2004, pp. 559–564 (in Russian)
12. Korobov M. Morphological analyzer and generator for Russian and Ukrainian languages. Communications in Computer and Information Science, vol. 542, 2015, pp. 320-332.
13. Segalovich I. A fast morphological algorithm with unknown word guessing induced by a dictionary for a web search engine. In Proc. of the International Conference on Machine Learning; Models, Technologies and Applications, 2003, pp. 273-280.
14. Schmid H.: Probabilistic part-of-speech tagging using decision trees. In Proc. of the International Conference on New Methods in Language Processing, 1994, pp. 44-49.
15. Straka M., Straková J., Hajic J. Prague at EPE 2017: The UDPipe system. In Proc. of the 2017 Shared Task on Extrinsic Parser Evaluation at the Fourth International Conference on Dependency Linguistics and the 15th International Conference on Parsing Technologies, 2017, pp. 65-74.
16. Bojanowski P., Grave E. et al. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 2017, vol. 5, pp. 135-146.
17. Peters M.E., Neumann M. et al. Deep contextualized word representations. In Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long Papers), 2018, pp. 2227–2237.
18. Devlin J., Chang M.-W. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019, pp. 4171–4186.
19. Kurimo M., Virpioja S. et al. Morpho challenge 2005-2010: Evaluations and results. In Proc. of the 11th Meeting of the ACL Special Interest Group on Computational Morphology and Phonology, 2010, pp. 87-95.
20. Virpioja S., Smit P. et al. Morfessor 2.0: Python implementation and extensions for Morfessor Baseline. Aalto University publication series science + technology, 2013, p. 38.
21. Тихонов А.Н. Словообразовательный словарь русского языка. Русский язык, 1990 г., 864 стр. / Tikhonov A.N. Word Formation Dictionary of Russian language. Moscow, Russkiy yazyk, 1990, 864 p. (in Russian)
22. OpenCorpora. URL: http://opencorpora.org/.
23. Tensorflow – Large-Scale Machine Learning on Heterogeneous Systems. URL: https://www.tensorflow.org/.
24. SynTagRus – Russian data from the SynTagRus corpus. URL: https://github.com/UniversalDependencies/UD_Russian-SynTagRus
Review
For citations:
SAPIN A.S. Building neural network models for morphological and morpheme analysis of texts. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(4):117-130. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(4)-9