Methods and techniques to automatic entity linking in Russian
https://doi.org/10.15514/ISPRAS-2022-34(4)-13
Abstract
Nowadays, there is a growing interest in solving NLP tasks using external knowledge storage, for example, in information retrieval, question-answering systems, dialogue systems, etc. Thus it is important to establish relations between entities in the processed text and a knowledge base. This article is devoted to entity linking, where Wikidata is used as an external knowledge base. We consider scientific terms in Russian as entities. Traditional entity linking system has three stages: entity recognition, candidates (from knowledge base) generation, and candidate ranking. Our system takes raw text with the defined terms in it as input. To generate candidates we use string match between terms in the input text and entities from Wikidata. The candidate ranking stage is the most complicated one because it requires semantic information. Several experiments for the candidate ranking stage with different models were conducted, including the approach based on cosine similarity, classical machine learning algorithms, and neural networks. Also, we extended the RUSERRC dataset, adding manually annotated data for model training. The results showed that the approach based on cosine similarity leads to better results compared to others and doesn’t require manually annotated data. The dataset and system are open-sourced and available for other researchers.
About the Authors
Anastasia Alekseevna MEZENTSEVARussian Federation
Student, NSU, first grade programmer IIS SB RAS
Elena Pavlovna BRUCHES
Russian Federation
PhD in Technical Sciences, Junior Researcher, IIS SB RAS, Senior Lecturer at NSU
Tatiana Viktorovna BATURA
Russian Federation
PhD in Physics and Mathematics, Associate Professor, Senior Researcher
References
1. Bollacker K., Evans C. et al. Freebase: A collaboratively created graph database for structuring human knowledge. In Proc. of the 2008 ACM SIGMOD International Conference on Management of Data, 2008, pp. 1247-1249.
2. Auer S., Bizer C. et al. DBpedia: A Nucleus for a Web of Open Data. Proceedings of the 6th International Semantic Web Conference (ISWC). Lecture Notes in Computer Science, vol. 4825, 2007, pp. 722-735.
3. Bunescu R., Pa̧sca M. Using encyclopedic knowledge for named entity disambiguation. In Proc. of the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2006, pp. 9-16.
4. Cucerzan S. Large-scale named entity disambiguation based on Wikipedia data. In Proc. of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), 2007. pp. 708-716.
5. Ratinov L., Roth D. et al. Local and global algorithms for disambiguation to Wikipedia. In Proc. of the 49th Annual Meeting of the Association for Computational Linguistics, 2011. pp. 1375-1384.
6. Kolitsas N., Ganea O., Hofmann T. End-to-end neural entity linking. In Proc. of the 22nd Conference on Computational Natural Language Learning, 2018, pp. 519-529.
7. Logeswaran L., Chang M. et al. Zero-Shot Entity Linking by Reading Entity Descriptions. In Proc. of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3449-3460.
8. Zhang S., Cheng H. et al. Knowledge-rich self-supervision for biomedical entity linking. arXiv:2112.07887, 2021, 13 p.
9. Botha J., Shan Z., Gillick D. Entity linking in 100 languages. In Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 7833-7845.
10. Pratapa A., Gupta R., Mitamura T. Multilingual event linking to wikidata. In Proc. of the Workshop on Multilingual Information Access (MIA), 2022, pp. 37-58.
11. Wang X., Tian J. et al. WikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types. In Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, vol. 1, 2022, pp. 4785-4797.
12. Bruches E., Pauls A. et al. Entity Recognition and Relation Extraction from Scientific and Technical Texts in Russian. In Proc. of the Science and Artificial Intelligence Conference (S.A.I.ence 2020), 2020, pp. 41-45.
13. Bruches E., Mezentseva A., Batura T. A System for Information Extraction from Scientific Texts in Russian. Communications in Computer and Information Science, vol. 1620, 2022, pp. 234-245.
14. Dorogush A., Gulin A. et al. Fighting biases with dynamic boosting. arXiv:2011.09817, 2017, 5 p.
15. Peters M., Neumann M. et al. Deep contextualized word representations. In Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, vol. 1, 2018, pp. 2227-2237.
16. Devlin J., Chang M. et al. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), vol. 1, 2019. pp. 4171-4186.
17. De Cao N., Wu L. et al. Multilingual autoregressive entity linking. Transactions of the Association for Computational Linguistics (TACL), 2022. vol. 10, pp. 274-290.
Review
For citations:
MEZENTSEVA A.A., BRUCHES E.P., BATURA T.V. Methods and techniques to automatic entity linking in Russian. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(4):187-200. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(4)-13