Character N-gram-Based Word Embeddings for Morphological Analysis of Texts
https://doi.org/10.15514/ISPRAS-2020-32(2)-1
Abstract
The paper presents modifications of fastText word embedding model based solely on n-grams, for morphological analysis of texts. fastText is a library for classifying texts and teaching vector representations. The representation of each word is calculated as the sum of its individual vector and the vectors of its symbolic n-grams. fastText stores and uses a separate vector for the whole word, but in extra-vocabular cases there is no such vector, which leads to a deterioration in the quality of the resulting word vector. In addition, as a result of storing vectors for whole words, fastText models usually require a lot of memory for storage and processing. This becomes especially problematic for morphologically rich languages, given the large number of word forms. Unlike the original fastText model, the proposed modifications only pretrain and use vectors for the character n-grams of a word, eliminating the reliance on word-level vectors and at the same time helping to significantly reduce the number of parameters in the model. Two approaches are used to extract information from a word: internal character n-grams and suffixes. Proposed models are tested in the task of morphological analysis and lemmatization of the Russian language, using SynTagRus corpus, and demonstrate results comparable to the original fastText.
About the Author
Tsolak Gukasovitch GHUKASYANArmenia
PhD student of the Department of System Programming
References
1. Pinter Y., Guthrie R., Eisenstein J. Mimicking Word Embeddings using Subword RNNs. In Proc. of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 102-112.
2. Schick T., Schütze H. Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), 2019, pp. 489-494.
3. Zhao J., Mudgal S., Liang Y. Generalizing Word Embeddings using Bag of Subwords. In Proc. of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 601-606.
4. Sasaki S., Suzuki J., Inui K. Subword-based Compact Reconstruction of Word Embeddings. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), 2019, pp. 3498-3508.
5. Heinzerling B., Strube M. BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages. In Proc. of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018, pp. 2989-2993.
6. Zhu Y., Vulić I., Korhonen A. A Systematic Study of Leveraging Subword Information for Learning Word Representations. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), 2019, pp. 912-932.
7. Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics, vol. 5, 2017, pp. 135–146.
8. Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, Tomas Mikolov. Learning Word Vectors for 157 Languages. In Proc. of the Eleventh International Conference on Language Resources and Evaluation, 2018, pp. 3483-3487.
9. Shibata Y. et al. Byte Pair encoding: A text compression scheme that accelerates pattern matching. Technical Report DOI-TR-161, Department of Informatics, Kyushu University, 1999.
10. Pennington J., Socher R., Manning C. D. Glove: Global vectors for word representation. In Proc. of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1532-1543.
11. Üstün A., Kurfalı M., Can B. Characters or Morphemes: How to Represent Words? In Proc. of the Third Workshop on Representation Learning for NLP, 2018, pp. 144-153.
12. Devlin J. et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), 2019, pp. 4171-4186.
13. Mikolov T. et al. Advances in Pre-Training Distributed Word Representations. In Proc. of the Eleventh International Conference on Language Resources and Evaluation, 2018, pp. 52-55.
14. Zhu Y. et al. On the Importance of Subword Information for Morphological Tasks in Truly Low-Resource Languages. In Proc. of the 23rd Conference on Computational Natural Language Learning (CoNLL), 2019, pp. 216-226.
15. Zeman D. et al. CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proc. of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, 2018, pp. 1-19.
16. Rybak P., Wróblewska A. Semi-supervised neural system for tagging, parsing and lemmatization In Proc. of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, 2018, pp. 45-54.
17. Srivastava R. K., Greff K., Schmidhuber J. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
18. Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proc. of the 3rd International Conference on Learning Representations, 2015, pp. 1-15.
19. Boguslavsky I. SynTagRus – a Deeply Annotated Corpus of Russian. In Les émotions dans le discours – Emotions in Discourse, Peter Lang GmbH, Internationaler Verlag der Wissenschaften, 2014, pp. 367-380.
20. Турдаков Д., Астраханцев Н., Недумов Я., Сысоев А., Андрианов И., Майоров В., Федоренко Д., Коршунов А., Кузнецов С. Texterra: инфраструктура для анализа текстов. Труды ИСП РАН, том 26, вып. 1, 2014 г., стр. 421-438 / Turdakov D., Astrakhantsev N., Nedumov Y., Sysoev A., Andrianov I., Mayorov V., Fedorenko D., Korshunov A., Kuznetsov S. Texterra: A Framework for Text Analysis. Trudy ISP RAN/Proc. ISP RAS, vol. 26, issue 1, 2014, pp. 421-438 (in Russian). DOI: 10.15514/ISPRAS-2014-26(1)-18.
21. Андрианов И.А., Майоров В.Д., Турдаков Д.Ю. Современные методы аспектно-ориентированного анализа эмоциональной окраски. Труды ИСП РАН, том 27, вып. 5, 2015 г., стр. 5-22 / Andrianov I.A., Mayorov V.D., Turdakov, D.Y. Modern approaches to aspect-based sentiment analysis. Trudy ISP RAN/ Proc. ISP RAS, vol. 27, issue 5, 2015, pp. 5-22 (in Russian). DOI: 10.15514/ISPRAS-2015-27(5)-1
Review
For citations:
GHUKASYAN Ts.G. Character N-gram-Based Word Embeddings for Morphological Analysis of Texts. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(2):7-14. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(2)-1