Segmentation of Documents Based on Graph Neural Networks: from Strings to Words
https://doi.org/10.15514/ISPRAS-2025-37(6)-14
Abstract
The paper presents a method for analyzing the layout of PDF documents based on graph neural networks (GNN), which uses words as graph nodes to overcome the limitations of modern approaches based on strings or local areas. The proposed WordGLAM model, based on modified graph convolutional layers, demonstrates the possibility of constructing hierarchical structures through word aggregation, which ensures a balance between the accuracy of element detection and their semantic connectivity. Despite lagging behind state-of-the-art models (for example, Vision Grid Transformer) in accuracy metrics, the study reveals systemic problems of the region: data imbalance, ambiguity in word clustering ("chain links", "bridges" between unrelated regions), as well as controversial criteria selecting classes in the markup. The key contribution of this work is the formulation of new research tasks, including optimization of vector representations of words, consideration of edge embeddings, and development of estimation methods for complex word hierarchies. The results confirm the prospects of the approach for creating adaptable models capable of processing multi-format documents (scientific articles, legal texts). This paper highlights the need for further research in the field of regularization and extension of training data, opening up ways to improve the portability of layout analysis methods to new domains. The code and models were published on GitHub (https://github.com/YRL-AIDA/wordGLAM).
Keywords
About the Authors
Daniil Evgenievich KOPYLOVRussian Federation
A master student of Irkutsk State University, employee of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences. Research interests: applied mathematics, data analysis.
Andrey Anatolievitch MIKHAYLOV
Russian Federation
The head of the Youth laboratory of AI, Data Processing and Analysis of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences. His research interests include document analysis, image recognition.
Roman Igorevich TRIFONOV
Russian Federation
A student of Irkutsk State University, employee of Matrosov Institute for Systems Dynamics of and Control Theory of Siberian Branch of Russian Academy of Sciences. Research interests: applied informatics, data analysis, neural networks.
References
1. Kise K. Page Segmentation Techniques in Document Analysis. In: Doermann, D., Tombre, K. (eds) Handbook of Document Image Processing and Recognition, 2014, Springer, London, pp. 135-175. DOI: 10.1007/978-0-85729-859-1_5.
2. BinMakhashen G. M., Mahmoud S. A. Document Layout Analysis: A Comprehensive Survey. ACM Computing Surveys (CSUR), vol. 52, issue 6, pp. 1-36. DOI:10.1145/3355610.
3. Tsujimoto S., Asada H. Major components of a complete text reading system. In Proc. of the IEEE, 1992, 80(7), pp. 1133-1149. DOI: 10.1109/5.156475.
4. Koroteev M. V. BERT: a review of applications in natural language processing and understanding. CoRR, vol. abs/2103.11943, 2021 [Online]. Available at: https://arxiv.org/abs/1810.04805.
5. Huang Y., Lv T., Cui L., Lu Y., Wei F. LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking. Proc. of the 30th ACM International Conference on Multimedia, 2022, pp. 4083-4091.DOI: 10.1145/3503161.3548112.
6. Da C., Luo C., Zheng Q., Yao C. Vision Grid Transformer for Document Layout Analysis. IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2023, pp. 19405-19415, DOI: 10.1109/ICCV51070.2023.01783.
7. Sun T., Cui C., Du Y., Liu Y. PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction. CoRR, vol. abs/2503.17213, 2025 [Online]. Available at: https://arxiv.org/abs/2503.17213.
8. Zhong X., Tang J., Jimeno-Yepes A. PubLayNet: Largest Dataset Ever for Document Layout Analysis. 2019 International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1015-1022. DOI: 10.1109/ICDAR.2019.00166.
9. Maia A. L. L. M., Julca-Aguilar F. D. Hirata N. S. T. A Machine Learning Approach for Graph-Based Page Segmentation. In Proc. 31st SIBGRAPI Conference on Graphics, Patterns and Images (SIBGRAPI), Parana, Brazil, 2018, pp. 424-431. DOI: 10.1109/SIBGRAPI.2018.00061.
10. Wang R., Fujii Y., Popat A.C. General-Purpose OCR Paragraph Identification by Graph Convolution Networks. CoRR, vol. abs/2101.12741, 2021 [Online]. Available at: https://arxiv.org/abs/2101.12741.
11. Wei S., Xu N. PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis. CoRR, vol. abs/2304.11810, 2023 [Online]. Available at: https://arxiv.org/abs/2304.11810.
12. Wang, J. et al. (2023). A Graphical Approach to Document Layout Analysis. Proc. of the 17th ICDAR, 2023, vol. 14191, pp. 53-69. DOI:10.1007/978-3-031-41734-4_4.
13. Dai HS., Li XH., Yin, F., Yan, X., Mei, S., Liu, CL. (2024). GraphMLLM: A Graph-Based Multi-level Layout Language-Independent Model for Document Understanding. Proc. of the 18th ICDAR, 2024, vol 14804, pp. 227-243. DOI: 10.1007/978-3-031-70533-5_14.
14. Chen Y. et al. Graph-based Document Structure Analysis. CoRR, vol. abs/2502.02501, 2025 [Online]. Available at: https://arxiv.org/abs/2502.02501.
15. O'Gorman L. The document spectrum for page layout analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15 (11), pp. 1162-1173, 1993. DOI: 10.1109/34.244677.
16. Kise K., Sato A. Iwata M. Segmentation of Page Images Using the Area Voronoi Diagram. Comput. Vis. Image Underst, 1998, vol. 70, pp. 370-382. DOI:10.1006/cviu.1998.0684.
17. Yi Xiao and Hong Yan. Text region extraction in a document image based on the Delaunay. Pattern Recognit, 2003, vol. 36, pp. 799-809. DOI: 10.1016/S0031-3203(02)00082-1.
18. Tesseract User Manual, Available at: https://tesseract-ocr.github.io/tessdoc, accessed 5.08.2025.
19. PrecisionPDF, Available at: https://github.com/YRL-AIDA/PrecisionPDF, accessed 5.08.2025.
20. Kopylov D., Mikhaylov A. How To Classify Document Segments Using Graph Based Representation and Neural Networks. Ivannikov Memorial Workshop (IVMEM), 2024, pp. 36-41. DOI: 10.1109/IVMEM63006.2024.10659393.
21. Du J., Zhang S., Wu G. Moura J. M. F., Kar S. Topology Adaptive Graph Convolutional Networks. CoRR, vol. abs/1710.10370, 2018 [Online]. Available at: https://arxiv.org/abs/1710.10370.
22. Kipf T. N., Welling M. Semi-Supervised Classification with Graph Convolutional Networks. CoRR, vol. abs/1609.02907, 2017 [Online]. Available at: https://arxiv.org/abs/1609.02907.
23. Hendrycks D., Gimpel K. Gaussian error linear units (GELUs). CoRR, vol. abs/1606.08415, 2016 [Online]. Available at: https://arxiv.org/abs/1606.08415.
24. Everingham M., Van Gool L., Williams C.K.I. et al. The Pascal Visual Object Classes (VOC) Challenge. International Journal of Computer Vision, 2010, vol. 88, pp. 303–338. DOI: 10.1007/s11263-009-0275-4.
25. Pfitzmann B., Auer C., Dolfi M., Nassar A. S., Staar P. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis. In Proc. of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), 2022, pp. 3743-3751. DOI: 10.1145/3534678.3539043.
26. Cheng H. et al. M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 15138-15147. DOI: 10.1109/CVPR52729.2023.01453.
Review
For citations:
KOPYLOV D.E., MIKHAYLOV A.A., TRIFONOV R.I. Segmentation of Documents Based on Graph Neural Networks: from Strings to Words. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(6):219-232. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(6)-14






