Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Automated Extraction of Facts from Tabular Data based on Semantic Table Annotation

https://doi.org/10.15514/ISPRAS-2024-36(3)-7

Abstract

The use of knowledge graphs in the construction of intelligent information and analytical systems provides to effectively structure and analyze knowledge, process large volumes of data, improve the quality of systems, and apply them in various domains such as medicine, manufacturing, trade, and finance. However, domain-specific knowledge graph engineering continues to be a difficult task, requiring the creation of specialized methods and software. One of the main trends in this area is the use of various information sources, in particular tables, which can significantly improve the efficiency of this process. This paper proposes an approach and a tool for automated extraction of specific entities (facts) from tabular data and populating them with a target knowledge graph based on the semantic interpretation (annotation) of tables. The proposed approach is implemented in the form of a special processor included in the Talisman framework. We also present an experimental evaluation of our approach and a demo of domain knowledge graph development for the Talisman framework.

About the Authors

Nikita Olegovych DORODNYKH
Matrosov Institute for System Dynamics and Control Theory of the Russian Academy of Sciences
Russian Federation

Cand. Sci (Tech.), senior associate researcher at Matrosov Institute of System Dynamics and Control Theory named SB RAS (ISDCT SB RAS) since 2021. Research interests: computer-aided development of intelligent systems and knowledge bases, knowledge acquisition based on the transformation of conceptual models and tables.



Alexander Yurievich YURIN
Matrosov Institute for System Dynamics and Control Theory of the Russian Academy of Sciences
Russian Federation

Dr. Sci. (Tech.), Head of a laboratory “Information and telecommunication technologies for investigation of natural and technogenic safety” at ISDCT SB RAS and associate professor of the Institute of information technologies and data analysis of Irkutsk National Research Technical University (INRTU). His research interests include development of decision support systems, expert systems and knowledge bases, application of the case-based reasoning and semantic technologies in the design of diagnostic intelligent systems, maintenance of reliability and safety of complex technical systems.



References

1. Hogan A., Blomqvist E., Cochez M., d’Amato C., De Melo G., Gutierrez C., Gayo J. E. L., Kirrane S., Neumaier S., Polleres A., Navigli R., Ngomo A.-C. N., Rashid S. M., Rula A., Schmelzeisen L., Sequeda J., Staab S., Zimmermann A. Knowledge Graphs, 2021.

2. Ji S., Pan S., Cambria E., Marttinen P., Yu P. S. A Survey on Knowledge Graphs: Representation, Acquisition and Applications. IEEE Transcations on Neural Networks and Learning Systems, vol. 33, no. 2, 2021, pp. 494-514. DOI: 10.1109/TNNLS.2021.3070843.

3. Martinez-Rodriguez J. L., Hogan A., Lopez-Arevalo I. Information Extraction meets the Semantic Web: A Survey. Semantic Web, vol. 11, 2020, pp. 255-335. DOI: 10.3233/SW-180333.

4. Villazon-Terrazas B., Garcia-Santa N., Ren Y., Srinivas K., Rodriguez-Muro M., Alexopoulos P., Pan J. Z. Construction of Enterprise Knowledge Graphs (I). Exploiting Linked Data and Knowledge Graphs in Large Organisations, Springer, Cham, 2017.

5. Lehmberg O., Ritze D., Meusel R., Bizer C. A large public corpus of web tables containing time and context metadata. Proc. 25th International Conference Companion on World Wide Web, 2016, pp. 75-76. DOI: 10.1145/2872518.2889386.

6. Talisman framework, Available at: http://talisman.ispras.ru, accessed 06.05.2024.

7. Bonfitto S., Casiraghi E., Mesiti M. Table understanding approaches for extracting knowledge from heterogeneous tables. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 11, no. 4, 2021, e1407. DOI: 10.1002/widm.1407.

8. Liu J., Chabot Y., Troncy R. From tabular data to knowledge graphs: A survey of semantic table interpretation tasks and methods. Journal of Web Semantics, vol. 76, 2023, 100761. DOI: 10.1016/j.websem.2022.100761.

9. Limaye G., Sarawagi S., Chakrabarti S. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endowment, vol. 3, 2010, pp. 1338-1347. DOI: 10.14778/1920841.1921005.

10. Mulwad V., Finin T., Syed Z., Joshi A. Using linked data to interpret tables. Proc. the First International Conference on Consuming Linked Data (COLD’10), vol. 665, 2010, pp. 109-120.

11. Bhagavatula C. S., Noraset T., Downey D. TabEL: Entity Linking in Web Tables. Proc. the 14th International Semantic Web Conference (ISWC’2015), 2015, pp. 425-441. DOI: 10.1007/978-3-319-25007-6_25.

12. Efthymiou V., Hassanzadeh O., Rodriguez-Muro M., Christophides V. Matching web tables with knowledge base entities: From entity lookups to entity embeddings. Proc. 16th International Semantic Web Conference (ISWC’2017), 2017, pp. 260-277. DOI: 10.1007/978-3-319-68288-4_16.

13. Ritze D., Bizer C. Matching web tables to DBpedia - A feature utility study. Proc. 20th International Conference on Extending Database Technology (EDBT’17), 2017, pp. 210-221. DOI: 10.5441/002/EDBT.2017.20.

14. Zhang Z. Effective and efficient semantic table interpretation using TableMiner+. Semantic Web, vol. 8, no. 6, 2017, pp. 921-957. DOI: 10.3233/SW-160242.

15. De Vos M., Wielemaker J., Rijgersberg H., Schreiber G., Wielinga B., Top J. Combining information on structure and content to automatically annotate natural science spreadsheets. International Journal of Human-Computer Studies, vol. 103, 2017, pp. 63-76. DOI: 10.1016/j.ijhcs.2017.02.006.

16. Takeoka K., Oyamada M., Nakadai S., Okadome T. Meimei: An Efficient Probabilistic Approach for Semantically Annotating Tables. Proc. the AAAI Conference on Artificial Intelligence, vol. 33, no. 01. 2019, pp. 281-288. DOI: 10.1609/aaai.v33i01.3301281.

17. Kruit B., Boncz P., Urbani J. Extracting Novel Facts from Tables for Knowledge Graph Completion. Proc. the 18th International Semantic Web Conference (ISWC’2019), 2019, pp. 364-381. DOI: 10.1007/978-3-030-30793-6_21.

18. Chen J., Jimenez-Ruiz E., Horrocks I., Sutton C. ColNet: Embedding the Semantics of Web Tables for Column Type Prediction. Proc. the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 29-36. DOI: 10.1609/aaai.v33i01.330129.

19. Hulsebos M., Hu K., Bakker M., Zgraggen E., Satyanarayan A., Kraska T., Demiralp Ç., Hidalgo C. Sherlock: A Deep Learning Approach to Semantic Data Type Detection. Proc. the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), 2019, pp. 1500-1508. DOI: 10.1145/3292500.3330993.

20. Xie J., Lu Y., Cao C., Li Z., Guan Y., Liu Y. Joint Entity Linking for Web Tables with Hybrid Semantic Matching. Proc. the International Conference on Computational Science, 2020, pp. 618-631. DOI: 10.1007/978-3-030-50417-5_46.

21. Zhang D., Suhara Y., Li J., Hulsebos M., Demiralp C., Tan W.-C. Sato: Contextual semantic type detection in tables. Proc. the VLDB Endowment, vol. 13, no. 11, 2020, pp. 1835-1848. DOI: 10.14778/3407790.3407793.

22. Deng X., Sun H., Lees A., Wu Y., Yu C. TURL: Table Understanding through Representation Learning. Proc. the VLDB Endowment, vol. 14, no. 3, 2020, pp. 307-319. DOI: 10.14778/3430915.3430921.

23. Yin P., Neubig G., Yih W. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. Proc. the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8413-8426. DOI: 10.18653/v1/2020.acl-main.745.

24. Iida H., Thai D., Manjunatha V., Iyyer M. TABBIE: Pretrained Representations of Tabular Data. Proc. the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 3446-3456. DOI: 10.18653/v1/2021.naacl-main.270.

25. Wang Z., Dong H., Jia R., Li J., Fu Z., Han S., Zhang D. TUTA: Tree-based Transformers for Generally Structured Table Pre-training. Proc. the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (KDD’21), 2021, pp. 1780-1790. DOI: 10.1145/3447548.3467434.

26. Suhara Y., Li J., Li Y. Annotating Columns with Pre-trained Language Models. Proc. the 2022 International Conference on Management of Data (SIGMOD’22), 2022, pp. 1493-1503. DOI: 10.1145/3514221.3517906.

27. SemTab challenge, Available at: http://www.cs.ox.ac.uk/isg/challenges/sem-tab/, accessed 06.05.2024.

28. Belyaeva O., Bogatenkova A., Turdakov D. Dedoc: A Universal System for Extracting Content and Logical Structure From Textual Documents. 2023 Ivannikov Ispras Open Conference (ISPRAS), IEEE, 2023, pp. 20-25.

29. Conneau A., Khandelwal K., Goyal N., Chaudhary V., Wenzek G., Guzmán F., Grave E., Ott M., Zettlemoyer L., Stoyanov V. Unsupervised Cross-lingual Representation Learning at Scale. Proc. the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8440-8451. DOI: 10.18653/v1/2020.acl-main.747.

30. Dorodnykh N. O., Yurin A. Yu. Extraction of Facts from Web-Tables based on Semantic Interpretation Tabular Data. In Proc. the 2022 Ivannikov Memorial Workshop (IVMEM’2022), 2022, pp. 7-17. DOI: 10.1109/IVMEM57067.2022.9983959.

31. Dorodnykh N. O., Yurin A. Yu. Knowledge Graph Engineering Based on Semantic Annotation of Tables. Computation, vol. 11, no. 9, 2023, 175. DOI: 10.3390/computation11090175.


Review

For citations:


DORODNYKH N.O., YURIN A.Yu. Automated Extraction of Facts from Tabular Data based on Semantic Table Annotation. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(3):93-104. https://doi.org/10.15514/ISPRAS-2024-36(3)-7



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)