Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Automatic data labeling for document image segmentation using deep neural networks

https://doi.org/10.15514/ISPRAS-2022-34(6)-10

Abstract

The article proposes a new method for automatic data annotation for solving the problem of document image segmentation using deep object detection neural networks. The format of marked PDF files is considered as the initial data for markup. The peculiarity of this format is that it includes hidden marks that describe the logical and physical structure of the document. To extract them, a tool has been developed that simulates the operation of a stack-based printing machine according to the PDF format specification. For each page of the document, an image and annotation are generated in PASCAL VOC format. The classes and coordinates of the bounding boxes are calculated during the interpretation of the labeled PDF file based on the labels. To test the method, a collection of marked up PDF files was formed from which images of document pages and annotations for three segmentation classes (text, table, figure) were automatically obtained. Based on these data, a neural network of the EfficientDet D2 architecture was trained. The model was tested on manually labeled data from the same domain, which confirmed the effectiveness of using automatically generated data for solving applied problems.

About the Author

Andrey Anatolievitch MIKHAYLOV
Matrosov Institute for System Dynamics and Control Theory of the SB RAS, Ivannikov Institute for System Programming of the RAS
Russian Federation

Candidate of Technical Sciences, Senior Researcher in the Laborotory of Complex Information Systems at IDSTU SB RAS, Researcher at ISP RAS



References

1. Lee E., Park J. et al. Deep-learning and graph-based approach to table structure recognition. Multimedia Tools and Applications, vol. 81, issue 4, 2022, pp. 5827-5848.

2. Le V.P., Nayef N. Text and non-text segmentation based on connected component features. In Proc. of the 13th International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 1096-1100.

3. Wong K.Y., Casey R.G., Wahl F.M. Document analysis system. IBM Journal of Research and Development, vol. 26, issue 6, 1982, pp. 647-656.

4. Okun O., Doermann D., Pietikainen M. Page segmentation and zone classification: The state of the art. Technical Report LAMP-TR-036, CAR-TR-927, CS-TR-4079. University of Maryland, 1999, 38 p.

5. Moll M.A., Baird H.S., An C. Truthing for pixel-accurate segmentation. In Proc. of the Eighth IAPR International Workshop on Document Analysis Systems, 2008, pp. 379-385.

6. Moll M.A., Baird H.S. Segmentation-based retrieval of document images from diverse collections. In Proc. of the IS&T/SPIE 20th Annual Symposium on Electronic Imaging, 2008, 8 p.

7. Fletcher L.A., Kasturi R. A robust algorithm for text string separation from mixed text/graphics images. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10, issue 6, 1988, pp. 910-918, 1988.

8. Tombre K., Tabbone S., et al. Text/graphics separation revisited. Lecture Notes in Computer Science, vol. 2423, 2002, pp. 200–211.

9. Bukhari S.S., Al Azawi M.I.A. et al. Document image segmentation using discriminative learning over connected components. In Proc. of the 9th IAPR International Workshop on Document Analysis Systems, 2010, pp. 183-190.

10. Kang L., Kumar J. et al. Convolutional neural networks for document image classification. In Proc. of the 22nd International Conference on Pattern Recognition, 2014, pp. 3168-3172.

11. Harley A.W., Ufkes A., Derpanis K.G. Evaluation of deep convolutional nets for document image classification and retrieval. In Proc. of the 2015 13th International Conference on Document Analysis and Recognition (ICDAR), 2015, pp. 991-995.

12. Oliveira D.A.B., Viana M.P. Fast CNN-based document layout analysis. In Proc. of the IEEE International Conference on Computer Vision, 2017, pp. 1173-1180.

13. Vincent N., Ogier J.M. Shall deep learning be the mandatory future of document analysis problems? Pattern Recognition, vol. 86, 2019, pp. 281-289.

14. Clausner C., Antonacopoulos A., Pletschacher S. ICDAR2017 Competition on Recognition of Documents with Complex Layouts – RDCL2017. In Proc. of the 14th International Conference on Document Analysis and Recognition (ICDAR), 2017, pp. 1404-1410.

15. Clausner C., Antonacopoulos A., Pletschacher S. ICDAR2019 Competition on Recognition of Documents with Complex Layouts - RDCL2019. In Proc. of the 15th International Conference on Document Analysis and Recognition (ICDAR), 2019, pp. 1521-1526.

16. Gao L., Huang Y. et al. ICDAR 2019 Competition on Table Detection and Recognition (cTDaR). In Proc. of the 15th International Conference on Document Analysis and Recognition (ICDAR), 2021, pp. 1510-1515.

17. Lopes C.A.M. Junior, das Neves R.B. Junior et al. ICDAR 2021 Competition on Components Segmentation Task of Document Photos. Lecture Notes in Computer Science, vol. 12824, 2021, pp. 678-692.

18. Anitei D., Sánchez J.A. et al. ICDAR 2021 Competition on Mathematical Formula Detection. International Conference on Document Analysis and Recognition. Lecture Notes in Computer Science, vol. 12824, 2021, pp. 783-795.

19. Yepes A. J., Zhong P., Burdick D.. ICDAR 2021 Competition on Scientific Literature Parsing. Lecture Notes in Computer Science, vol. 12824, 2021, pp. 605-617.

20. Adams T., Namysl M. et al, Benchmarking table recognition performance on biomedical literature on neurological disorders. Bioinformatics, vol. 38, issue 6, 2022, pp. 1624-1630,

21. Беляева О.В., Перминов А.И., Козлов И.С. Использование синтетических данных для тонкой настройки моделей сегментации документов. Труды ИСП РАН, том 32, вып. 4, 2020 г., стр. 189–202 / Belyaeva O.V., Perminov A.I., Kozlov I.S. Synthetic data usage for document segmentation models fine-tuning. Trudy ISP RAN/Proc. ISP RAS, vol. 32, issue 4, 2020. pp. 189–202 (in Russian). DOI: 10.15514/ISPRAS–2020–32(4)–14.

22. Tan M., Pang R., Le Q.V. EfficientDet: Scalable and efficient object detection. In Proc. of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. pp. 10778-10787.


Review

For citations:


MIKHAYLOV A.A. Automatic data labeling for document image segmentation using deep neural networks. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(6):137-146. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(6)-10



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)