Testing the Performance of Fact Extraction from Russian-Language Tables
https://doi.org/10.15514/ISPRAS-2025-37(5)-16
Abstract
Currently, a huge amount of data is presented in the form of tables. They are widely used to solve various practical problems in different domains. Specialized methods and software are developed for semantic interpretation (annotation) of tables and construction of knowledge graphs based on them. Effective testing of such software requires the creation and use of Russian-language datasets. This paper proposes a Russian-language tabular dataset, called RF-200, containing 200 tables from 26 domains labeled using the Talisman platform. The results of testing the performance of our approach for fact extraction from Russian-language tables using RF-200 are presented, in which the F1 reached a value of 0.464, surpassing traditional methods of fact extraction from texts (F1 = 0.277). The results emphasize the importance of specialized solutions for working with structured data, especially for Russian-language sources. The practical significance of the work lies in the integration of the approach into the Talisman platform, which expands the capabilities of semantic analytics carried out on tables. The study contributes to the automation of table processing, solving the problem of semantic interpretation in the context of linguistic diversity, and opens up prospects for the integration of deep learning methods and scaling of the created dataset.
Keywords
About the Authors
Nikita Olegovych DORODNYKHRussian Federation
Cand. Sci. (Tech.), senior associate researcher at Matrosov Institute of System Dynamics and Control Theory named SB RAS (ISDCT SB RAS) since 2021. Research interests: computer-aided development of intelligent systems and knowledge bases, knowledge acquisition based on the transformation of conceptual models and tables.
Alexander Yurievich YURIN
Russian Federation
Dr. Sci. (Tech.), Head of a laboratory “Information and telecommunication technologies for investigation of natural and technogenic safety” at ISDCT SB RAS and professor of the Institute of information technologies and data analysis of Irkutsk National Research Technical University (INRTU). His research interests include development of decision support systems, expert systems and knowledge bases, application of the case-based reasoning and semantic technologies in the design of diagnostic intelligent systems, maintenance of reliability and safety of complex technical systems.
References
1. Hogan A., Blomqvist E., Cochez M., d’Amato C., De Melo G., Gutierrez C., Gayo J. E. L., Kirrane S., Neumaier S., Polleres A., Navigli R., Ngomo A.-C. N., Rashid S. M., Rula A., Schmelzeisen L., Sequeda J., Staab S., Zimmermann A. Knowledge Graphs. Springer Nature Switzerland, 2021, 237 p. DOI: 10.1007/978-3-031-01918-0.
2. Ji S., Pan S., Cambria E., Marttinen P., Yu P.S. A Survey on Knowledge Graphs: Representation, Acquisition and Applications. IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 2, 2021, pp. 494-514. DOI: 10.1109/TNNLS.2021.3070843.
3. -star Open Data, Available at: https://5stardata.info/en/, accessed 22.04.2025.
4. DBpedia, Available at: https://www.dbpedia.org/, accessed 22.04.2025.
5. Wikidata, Available at: https://www.wikidata.org/, accessed 22.04.2025.
6. Villazon-Terrazas B., Garcia-Santa N., Ren Y., Srinivas K., Rodriguez-Muro M., Alexopoulos P., Pan J. Z. Construction of Enterprise Knowledge Graphs (I). Exploiting Linked Data and Knowledge Graphs in Large Organisations, Springer, Cham, 2017.
7. Number of Google Sheets and Excel Users Worldwide, Available at: https://askwonder.com/research/number-google-sheets-users-worldwide-eoskdoxav, accessed 22.04.2025.
8. Peeters R., Brinkmann A., Bizer C. The Web Data Commons Schema.org Table Corpora. Proc. the ACM Web Conference (WWW’24), New York, NY, USA, 2024, pp. 1079-1082. DOI: 10.1145/3589335.3651441.
9. Talend, Available at: https://www.talend.com/, accessed 22.04.2025.
10. Trifacta, Available at: https://asana.com/ru/apps/trifacta, accessed 22.04.2025.
11. Microsoft Semantic Link, Available at: https://learn.microsoft.com/en-us/fabric/data-science/semantic-link-overview, accessed 22.04.2025.
12. Talisman, Available at: http://talisman.ispras.ru, accessed 22.04.2025.
13. Dorodnykh N. O., Yurin A. Yu. Automated Extraction of Facts from Tabular Data based on Semantic Table Annotation. Trudy ISP RAN/Proc. ISP RAS, vol. 36, no. 3, 2024, pp. 93-104. DOI: 10.15514/ISPRAS-2024-36(3)-7.
14. Fedorov P. E., Mironov A. V., Chernishev, G. A. Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia. Lobachevskii Journal of Mathematics, vol. 44, 2023, pp. 111-122. DOI: 10.1134/S1995080223010110.
15. Kruit B., Boncz P., Urbani J. Extracting novel facts from tables for knowledge graph completion. Proc. the 18th International Semantic Web Conference (ISWC’2019), Auckland, New Zealand, 2019, pp. 364-381. DOI: 10.1007/978-3-030-30793-6_21.
16. Zhang S., Meij E., Balog K., Reinanda R. Novel entity discovery from web tables. Proc. the ACM Web Conference (WWW’20), New York, NY, USA, 2020, pp. 1298-1308. DOI: 10.1145/3366423.3380205.
17. Zhang S., Balog K. Web Table Extraction, Retrieval, and Augmentation: A Survey. ACM Transactions on Intelligent Systems and Technology, vol. 11, no. 2, 2020, pp. 1-35. DOI: 10.1145/3372117.
18. Balog K. Populating Knowledge Bases. Entity-Oriented Search INRE, vol. 39, 2018, pp. 189-222. DOI: 10.1007/978-3-319-93935-3_6.
19. Subagdja B., Shanthoshigaa D., Wang Z., Tan A.-H. Machine Learning for Refining Knowledge Graphs: A Survey. ACM Computing Surveys, vol. 56, no. 6, 2024, pp. 1-38. DOI: 10.1145/3640313.
20. SemTab-2024, Available at: https://sem-tab-challenge.github.io/2024/, accessed 22.04.2025.
21. Bonfitto S., Casiraghi E., Mesiti M. Table understanding approaches for extracting knowledge from heterogeneous tables. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 11, no. 4, 2021, e1407. DOI: 10.1002/widm.1407.
22. Zheng M., Feng X., Si Q., She Q., Lin Z., Jiang W., Wang W. Multimodal Table Understanding. Proc. the 62nd Annual Meeting of the Association for Computational Linguistics (ACL’2024), Bangkok, Thailand, 2024, pp. 9102-9124. DOI: 10.18653/v1/2024.acl-long.493.
23. Limaye G., Sarawagi S., Chakrabarti S. Annotating and searching web tables using entities, types and relationships. Proceedings of the VLDB Endowment, vol. 3, no. 1-2, 2010, pp. 1338-1347. DOI: 10.14778/1920841.1921005.
24. T2Dv2 Gold Standard for Matching Web Tables to DBpedia, Available at: https://webdatacommons.org/webtables/goldstandardV2.html, accessed 22.04.2025.
25. Cutrona V., Bianchi F., Jimenez-Ruiz E., Palmonari M. Tough tables: Carefully evaluating entity linking for tabular data. Proc. the 19th International Semantic Web Conference (ISWC’2020), Athens, Greece, 2020, pp. 328-343. DOI: 10.1007/978-3-030-62466-8_21.
26. Abdelmageed N., Schindler S., Konig-Ries B. Biodivtab: A table annotation benchmark based on biodiversity research data. Proc. the 20th International Semantic Web Conference (ISWC’2021) – Semantic Web Challenge on Tabular Data to Knowledge Graph Matching (SemTab-2021), 2021, pp. 13 18.
27. Hulsebos M., Demiralp C., Groth P. GitTables: A Large-Scale Corpus of Relational Tables. Proceedings of the ACM on Management of Data, vol. 1, no. 1, 2023, pp. 1-17. DOI: 10.1145/3588710.
28. SOTAB (Web Data Commons - Schema.org Table Annotation Benchmark), Available at: https://webdatacommons.org/structureddata/sotab/, accessed 22.04.2025.
29. Zhang D., Suhara Y., Li J., Hulsebos M., Demiralp C., Tan W.-C. Sato: Contextual semantic type detection in tables. Proc. the VLDB Endowment, vol. 13, no. 11, 2020, pp. 1835-1848. DOI: 10.14778/3407790.3407793.
30. Deng X., Sun H., Lees A., Wu Y., Yu C. TURL: Table Understanding through Representation Learning. Proc. the VLDB Endowment, vol. 14, no. 3, 2020, pp. 307-319. DOI: 10.14778/3430915.3430921.
31. Tobola K. V., Dorodnykh N. O. Semantic Annotation of Russian-Language Tables Based on a Pre-Trained Language Model. Proc. the 2024 Ivannikov Memorial Workshop (IVMEM), 2024, pp. 62-68. DOI: 10.1109/IVMEM63006.2024.10659709.
32. Hao Q., Cai R., Pang Y., Zhang L. From one tree to a forest: a unified solution for structured web data extraction. Proc. the 34th international ACM SIGIR conference on Research and development in Information Retrieval, Beijing, China, 2011, pp. 775-784. DOI: 10.1145/2009916.2010020.
33. Gupta T., Zaki M., Khatsuriya D., Hira K., Krishnan N. M. A., Mausam. DISCOMAT: Distantly Supervised Composition Extraction from Tables in Materials Science Articles. Proc. the 61st Annual Meeting of the Association for Computational Linguistics (ACL’2023), Toronto, Canada, 2023, pp. 13465-13483. DOI: 10.18653/v1/2023.acl-long.753.
34. Bai F., Kang J., Stanovsky G., Freitag D., Dredze M., Ritter A. Schema-Driven Information Extraction from Heterogeneous Tables. Proc. the 61st Annual Meeting of the Association for Computational Linguistics (ACL’2024), Miami, Florida, USA, 2024, pp. 10252-10273. DOI: 10.18653/v1/2024.findings-emnlp.600.
35. Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzmán F., Grave E., Ott M., Zettlemoyer L., Stoyanov V. Unsupervised Cross-lingual Representation Learning at Scale. Proc. the 58th Annual Meeting of the Association for Computational Linguistics (ACL’2020), 2020, pp. 8440-8451. DOI: 10.18653/v1/2020.acl-main.747.
36. RF-200 (ru-facts-200), Available at: https://github.com/YRL-AIDA/ru-facts-200, accessed 22.04.2025.
Review
For citations:
DORODNYKH N.O., YURIN A.Yu. Testing the Performance of Fact Extraction from Russian-Language Tables. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(5):205-224. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(5)-16






