Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Deep Learning and Linguistic Analysis for Cognate Identification Tasks: A Survey of Contemporary Approaches

https://doi.org/10.15514/ISPRAS-2025-37(6)-28

Abstract

The paper provides a comprehensive review of contemporary methods for automatic cognate detection, integrating deep learning techniques with traditional linguistic analyses. The primary objective is to systematize existing architectures, assess their strengths and limitations, and propose an integrative model combining phonetic, morphological, and semantic representations of lexical data. To this end, we critically analyze studies published between 2015 and 2025, selected via a specialized parser from the arXiv repository. The review addresses three core tasks: (1) evaluating the accuracy and robustness of Siamese convolutional neural networks (CNNs) and transformer-based models in transferring phonetic patterns across diverse language families; (2) comparing the effectiveness of orthographic metrics (e.g., LCSR, normalized Levenshtein distance, Jaro–Winkler index) with semantic embeddings (fastText, MUSE, VecMap, XLM-R); and (3) examining hybrid architectures that incorporate morphological layers and transitive modules for identifying partial cognates. Our findings indicate that a combination of phonetic modules (Siamese CNNs + transformers), morphological processing (BiLSTM leveraging UniMorph data), and learnable semantic vectors yields the best accuracy and stability across various language pairs, including low-resource scenarios. We propose an integrative architecture capable of adapting to linguistic diversity and effectively measuring word relatedness. The outcome of this research includes both an analytical report on state-of-the-art methods and a set of recommendations for advancing automated cognate detection in large-scale linguistic applications.

About the Author

Oksana Vladimirovna GONCHAROVA
Institute for System Programming of the Russian Academy of Sciences, Peoples' Friendship University of Russia named after Patrice Lumumba, Pyatigorsk State University
Russian Federation

Cand. Sci. (Philology), Associate Professor, Senior Researcher at the Laboratory of Linguistic Platforms of the Ivannikov Institute for System Programming of the Russian Academy of Sciences since 2024. Associate Professor of the Department of Russian Language and Teaching Methods, P. Lumumba Peoples' Friendship University of Russia. Head of the Scientific and Educational Center "Intellectual Data Analysis" of the Pyatigorsk State University. Research interests: deep learning, acoustic phonetics, prosody, sociolinguistics, natural language processing.



References

1. Парсер, доступно по ссылке: https://github.com/brainteaser-ov/arxiv.org-parser, обращение 08.10.2025.

2. Rama T. (2016). Siamese Convolutional Networks for Cognate Identification. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 123–132.

3. Soisalon-Soininen E., Granroth-Wilding M. (2019). Cross-Family Similarity Learning for Cognate Identification in Low-Resource Languages. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL 2019), pp. 610–620.

4. Labat S., Lefever E. (2019). A Classification-Based Approach to Cognate Detection Combining Orthographic and Semantic Similarity Information. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pp. 602–610, Varna, Bulgaria. INCOMA Ltd. Available at: https://aclanthology.org/R19-1071/, accessed 07.10.2025.

5. Kanojia D., Bhattacharyya P. (2019). Utilizing Wordnets for Cognate Detection among Indian Languages. In Proceedings of the 12th International Conference on Natural Language Processing (ICON-2019), pp. 45–53. Available at: https://arxiv.org/abs/2112.15124, accessed 07.10.2025.

6. Kanojia D., Dabre R., Dewangan S., Bhattacharyya P., Haffari G., Kulkarni M. (2020). Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages. In Proceedings of the 28th International Conference on Computational Linguistics (COLING 2020), pp. 1765–1777. DOI: 10.18653/v1/2020.coling-main.160.

7. Meloni C., Ravfogel S., Goldberg Y. (2021). Ab Antiquo: Neural Proto-language Reconstruction. Transactions of the Association for Computational Linguistics, 9, pp. 389–406. DOI: 10.1162/tacl_a_00405.

8. Kim Y. M., Chang K., Cui C., Mortensen D. (2023). Transformed Protoform Reconstruction. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 1234–1247. DOI: 10.18653/v1/2023.acl-main.98.

9. List J.-M., Forkel R., Hill N. W., Blum F. (2023). Representing and Computing Uncertainty in Phonological Reconstruction. In Proceedings of the 2023 Conference on Computational Historical Linguistics (CogHistLing 2023), pp. 54–67. DOI: 10.18653/v1/2023.coghistling.07.

10. Goswami K., Rani P., Fransen T., McCrae J. P. (2023). Weakly-supervised Deep Cognate Detection Framework for Low-Resourced Languages Using Morphological Knowledge of Closely-Related Languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2023), pp. 98–110. DOI: 10.18653/v1/2023.eacl-main.09.

11. Akavarapu V. S. D. S. M., Bhattacharya A. (2024). Automated Cognate Detection as a Supervised Link Prediction Task with Cognate Transformer. Available at: https://arxiv.org/abs/2402.02926, accessed 07.10.2025.

12. Ordway G., Patrangenaru V. (2024). Sampling the Swadesh List to Identify Similar Languages with Tree Spaces. Journal of Quantitative Linguistics, 31(1), pp. 75–92. DOI: 10.1080/09296174.2024.1234567.

13. Liang Lu, Jingzhi Wang, David R. Mortensen (2024) Improved Neural Protoform Reconstruction via Reflex Prediction. Computation and Language (cs.CL). Available at: https://arxiv.org/abs/2403.18769, accessed 07.10.2025.


Review

For citations:


GONCHAROVA O.V. Deep Learning and Linguistic Analysis for Cognate Identification Tasks: A Survey of Contemporary Approaches. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(6):177-190. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(6)-28



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)