Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Using Contrastive Learning for Semantic Interpretation of Russian-Language Tables

https://doi.org/10.15514/ISPRAS-2025-37(6)-23

Abstract

Tables are widely used to represent and store data, but they are typically not accompanied by explicit semantics necessary for machine interpretation of their contents. Semantic table interpretation is critical for integrating structured data with knowledge graphs, but existing methods struggle with Russian-language tables due to limited labeled data and linguistic specificity. This paper proposes a contrastive learning-based approach to reduce dependency on manual labeling and improve column annotation quality for rare semantic types. The proposed approach adapts contrastive learning for tabular data using augmentations (removing/shuffling cells) and a distilled multilingual DistilBERT model trained on unlabeled RWT corpus (7.4M columns). The learned table representations are integrated into the RuTaBERT pipeline, which reduces computational costs. Experiments show micro-F1 0.974 and macro-F1 0.924, outperforming some baselines. This highlights the approach’s efficiency in handling data sparsity and Russian language features. Results confirm that contrastive learning captures semantic column similarities without explicit supervision, crucial for rare data types.

About the Authors

Kirill Vladimirovich TOBOLA
Matrosov Institute for System Dynamics and Control Theory of the Siberian Branch of Russian Academy of Sciences (ISDCT SB RAS)
Russian Federation

A postgraduate student at the Matrosov Institute of System Dynamics and Control Theory named SB RAS (ISDCT SB RAS) since 2024. Research interests: table embedding, large language models for relational data, and data extraction from tabular sources.



Nikita Olegovych DORODNYKH
Matrosov Institute for System Dynamics and Control Theory of the Siberian Branch of Russian Academy of Sciences (ISDCT SB RAS)
Russian Federation

PhD, senior associate researcher at the Matrosov Institute of System Dynamics and Control Theory named SB RAS (ISDCT SB RAS) since 2021. Research interests: computer-aided development of intelligent systems and knowledge bases, knowledge acquisition based on the transformation of conceptual models and tables.



References

1. Badaro G., Saeed M., Papotti P. Transformers for Tabular Data Representation: A Survey of Models and Applications. Transactions of the Association for Computational Linguistics, vol. 11, 2023, pp. 227-249. DOI: 10.1162/tacl_a_00544.

2. Liu J., Chabot Y., Troncy R., Huynh V.-P., Labbe T., Monnin P. From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods. Journal of Web Semantics, vol. 76, 2023, 100761. DOI: 10.1016/j.websem.2022.100761.

3. Deng X., Sun H., Lees A., Wu Y., Yu C. TURL: Table Understanding through Representation Learning. Proc. the VLDB Endowment, vol. 14, no. 3, 2020, pp. 307-319. DOI: 10.14778/3430915.3430921.

4. Herzig J., Nowak P. K., Muller T., Piccinno F., Eisenschlos J. M. TaPas: Weakly Supervised Table Parsing via Pre-training. Proc. 58th Annual Meeting of the Association for Computational Linguistics, Online, 2020, pp. 4320-4333. DOI: 10.18653/v1/2020.acl-main.398.

5. Yin P., Neubig G., Yih W. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. Proc. the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8413-8426. DOI: 10.18653/v1/2020.acl-main.745.

6. Iida H., Thai D., Manjunatha V., Iyyer M. TABBIE: Pretrained Representations of Tabular Data. Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 3446-3456. DOI: 10.18653/v1/2021.naacl-main.270.

7. Wang Z., Dong H., Jia R., Li J., Fu Z., Han S., Zhang D. TUTA: Tree-based Transformers for Generally Structured Table Pre-training. Proc. the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD’21), 2021, pp. 1780-1790. DOI: 10.1145/3447548.3467434.

8. Suhara Y., Li J., Li Y. Annotating Columns with Pre-trained Language Models. Proc. the 2022 International Conference on Management of Data (SIGMOD’22), 2022, pp. 1493-1503. DOI: 10.1145/3514221.3517906.

9. Tobola K. V., Dorodnykh N. O. Semantic Annotation of Russian-Language Tables Based on a Pre-Trained Language Model. Proc. the 2024 Ivannikov Memorial Workshop (IVMEM), Velikiy Novgorod, Russian Federation, 2024, pp. 62-68. DOI: 10.1109/IVMEM63006.2024.10659709.

10. RWT-RuTaBERT, Available at: https://huggingface.co/datasets/sti-team/rwt-rutabert, accessed 06.05.2025.

11. Ji S., Pan S., Cambria E., Marttinen P., Yu P.S. A Survey on Knowledge Graphs: Representation, Acquisition and Applications. IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 2, 2021, pp. 494-514. DOI: 10.1109/TNNLS.2021.3070843.

12. Hulsebos M., Hu K., Bakker M., Zgraggen E., Satyanarayan A., Kraska T., Demiralp Ç., Hidalgo C. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), 2019, pp. 1500-1508. DOI: 10.1145/3292500.3330993.

13. Zhang D., Suhara Y., Li J., Hulsebos M., Demiralp C., Tan W.-C. Sato: Contextual semantic type detection in tables. Proc. the VLDB Endowment, vol. 13, no. 11, 2020, pp. 1835-1848. DOI: 10.14778/3407790.3407793.

14. Wang D., Shiralkar P., Lockard C., Huang B., Dong X. L., Jiang M. TCN: Table Convolutional Network for Web Table Interpretation. Proc. the Web Conference (WWW’21), Ljubljana, Slovenia, 2021, pp. 4020-4032. DOI: 10.1145/3442381.3450090.

15. Li P. , He Y., Yashar D., Cui W., Ge S., Zhang H., Fainman D. R., Zhang D., Chaudhuri S. Table-GPT: Table Fine-tuned GPT for Diverse Table Tasks. Proceedings of the ACM on Management of Data, vol. 2, no. 3, 2024, pp. 1-28. DOI: 10.1145/3654979.

16. Zhang T., Yue X., Li Y., Sun H. TableLlama: Towards Open Large Generalist Models for Tables. Proc. the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Mexico City, Mexico, 2024, pp. 6024-6044. DOI: 10.18653/v1/2024.naacl-long.335.

17. Korini K., Bizer C. Column Property Annotation Using Large Language Models. Proc. the Semantic Web: ESWC 2024 Satellite Events, Hersonissos, Crete, Greece, 2024, pp. 61-70. DOI: 10.1007/978-3-031-78952-6_6.

18. DBpedia, Available at: https://www.dbpedia.org/, accessed 06.05.2025.

19. Ru-Wiki-Tables-dataset, Available at: https://gitlab.com/unidata-labs/ru-wiki-tables-dataset, accessed 06.05.2025.

20. Chen T., Kornblith S., Norouzi M., Hinton G. A simple framework for contrastive learning of visual representations. Proc. the 37th International Conference on Machine Learning (ICML’20), Online, 2020, pp. 1597-1607. DOI: 10.5555/3524938.3525087.

21. Model Card for DistilBERT base multilingual (cased), Available at: https://huggingface.co/distilbert/distilbert-base-multilingual-cased, accessed 06.05.2025.

22. BERT multilingual base model (cased), Available at: https://huggingface.co/google-bert/bert-base-multilingual-cased, accessed 06.05.2025.

23. Miao Z., Wang J. Watchog: A Light-weight Contrastive Learning based Framework for Column Annotation. Proceedings of the ACM on Management of Data, vol. 1, no. 4, 2023, pp. 1-24. DOI: 10.1145/3626766.

24. Kuratov Y., Arkhipov M. Adaptation of Deep Bidirectional Multilingual Transformers for Russian Language, 2019, arXiv:1905.07213.

25. CoLeM framework, Available at: https://github.com/YRL-AIDA/CoLeM, accessed 06.05.2025.

26. CoLeM base cased, Available at: https://huggingface.co/sti-team/colem-base-cased, accessed 06.05.2025.

27. Talisman, Available at: http://talisman.ispras.ru, accessed 06.05.2025.


Review

For citations:


TOBOLA K.V., DORODNYKH N.O. Using Contrastive Learning for Semantic Interpretation of Russian-Language Tables. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(6):107-122. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(6)-23



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)