Synthetic data usage for document segmentation models fine-tuning

Oksana Vladimirovna BELYAEVA; Andrey Igorevich PERMINOV; Ilya Sergeevich KOZLOV

doi:10.15514/ISPRAS-2020-32(4)-14

Synthetic data usage for document segmentation models fine-tuning

Oksana Vladimirovna BELYAEVA, Andrey Igorevich PERMINOV, Ilya Sergeevich KOZLOV

https://doi.org/10.15514/ISPRAS-2020-32(4)-14

Full Text:

PDF (Rus) |

Generate QR code

Abstract

In this paper, we propose an approach to the document images segmentation in a case of limited set of real data for training. The main idea of our approach is to use artificially created data for training and post-processing. The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data. As part of the task of automatic document analysis, we solve the problem of segmentation of DLA documents (Document Layout Analysis). In the paper we train the known high-level FasterRCNN \cite{ren2015faster} model to segment text blocks, tables, stamps and captions on images of the domain. The aim of the paper is to generate synthetic data similar to real data of the domain. It is necessary because the model needs a large dataset for training and the high labor intensity of their preparation. In the paper, we describe the post-processing stage to eliminate artifacts that are obtained as a result of the segmentation. We tested and compared the quality of a model trained on different datasets (with / without synthetic data, small / large set of real data, with / without post-processing stage). As a result, we show that the generation of synthetic data and the use of post-processing increase the quality of the model with a small real training data.

Keywords

Document Layout Analysis, Document Segmentation, Physical Document Structure, Image Object Detection, Model fine-tuning, Active Learning

About the Authors

Oksana Vladimirovna BELYAEVA

Ivannikov Institute for System Programming of the RAS
Russian Federation
PhD Student

Andrey Igorevich PERMINOV

Lomonosov Moscow State University
Russian Federation
Master Student

Ilya Sergeevich KOZLOV

Ivannikov Institute for System Programming of the RAS
Russian Federation
Reearcher

References

1. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, issue 6, 2017, pp. 1137-1149.

2. G.M. Binmakhashen and S.A. Mahmoud. Document layout analysis: a comprehensive survey. ACM Computing Surveys (CSUR), vol. 52, issue 6, 2019, pp. 1-36.

3. X. Zhong, J. Tang, and A.J. Yepes. Publaynet: largest dataset ever for document layout analysis. arXiv:1908.07836, 2019.

4. K. Chen, M. Seuret, J. Hennebert, and R. Ingold. Convolutional neural networks for page segmentation of historical document images. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 965-970

5. C. Wick and F. Puppe. Fully convolutional neural networks for page segmentation of historical document images. In Proc. of the 13th IAPR International Workshop on Document Analysis Systems (DAS), 2018, pp. 287-292.

6. Pubmed. national library of medicine. URL: https://pubmed.ncbi.nlm.nih.gov. Accessed: 2020-21-07.

7. G. Csurka. Domain adaptation for visual applications: a comprehensive survey. arXiv:1702.05374, 2017.

8. М.А. Рындин, Д.Ю Турдаков. Проактивная разметка примеров для адаптации к домену. Труды ИСП РАН, том 31, вып. 5, 2019 г., стр. 145-152. DOI: 10.15514/ISPRAS-2019-31(5)-11 / M.A. Ryndin, D.Y. Turdakov. Domain adaptation by proactive labeling. Trudy ISP RAN/Proc. ISP RAS, vol.31, issue 5, 2019, pp. 145-152 (in Russian).

9. C. R. De Souza, A. Gaidon, Y. Cabon, and A. M. López. Procedural Generation of Videos to Train Deep Action Recognition Networks. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2594-2604.

10. L. Angeline, K. Teo, and F. Wong. Smearing algorithm for vehicle parking management system. In Proc. of the 2nd Seminar on Engineering and Information Technology, 2009, pp. 331-337.

11. J. Ha, R. M. Haralick, and I. T. Phillips. Recursive xy cut using bounding boxes of connected components. In Proc. of the 3rd International Conference on Document Analysis and Recognition, vol. 2, 1995, pp. 952—955.

12. L. O’Gorman. The document spectrum for page layout analysis. IEEE Transactions on pattern analysis and machine intelligence, vol. 15, issue 11, 1993, pp. 1162-1173.

13. T.M. Breuel. An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis. In Proc. of the Seventh International Conference on Document Analysis and Recognition, 2003, pp. 66—70.

14. I. Kavasidis, C. Pino, S. Palazzo, F. Rundo, D. Giordano, P. Messina, and C. Spampinato. A saliency-based convolutional neural network for table and chart detection in digitized documents. Lecture Notes in Computer Science, vol. 11752, 2019, pp. 292—302.

15. S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed. Deepdesrt: deep learning for detection and structure recognition of tables in document images. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1162—1167.

16. Object detection: speed and accuracy comparison (faster r-cnn, r-fcn, ssd, fpn, retinanet and yolov3). URL: https://medium.com/@jonathan_hui/object-detectionspeed-and-accuracy-comparison-faster-r-cnn-r-fcnssd-and-yolo-5425656ae359. Accessed: 2020-18-07.

17. Coco, common objects in context. URL: https://cocodataset.org/#home. Accessed: 2020-18-07.

18. Единая информационная система в сфере закупок, ЕИС. URL: https://zakupki.gov.ru/ (дата обращения 27.05.2020) / Unified information system in the field of procurement, EIS. URL: https://zakupki.gov.ru/ (in Russian).

19. Dla-dataset. EIS. URL: https://disk.yandex.ru/d/XVjQf20BVsElKA (accessed: 2020-18-07).

20. Open source computer vision library. URL: https://opencv.org (accessed: 2020-18-07).

21. Tensorflow object detection api. URL: https://github.com/tensorflow/models/tree/master/research/object_detection. Accessed: 2020-18-07.

22. Publaynet dataset. URL: https://github.com/ibm-aurnlp/PubLayNet/tree/master/pre-trained-models. Accessed: 2020-18-07.

23. N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms–improving object detection with one line of code. In Proc. of the IEEE International Conference on Computer Vision, 2017, pp. 5561-5569.

24. Pascal voc evaluation. URL: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/htmldoc/devkit_doc.html#SECTION00064000000000000000 (accesed: 08.09.2020).

Review

For citations:

BELYAEVA O.V., PERMINOV A.I., KOZLOV I.S. Synthetic data usage for document segmentation models fine-tuning. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(4):189-202. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(4)-14

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Synthetic data usage for document segmentation models fine-tuning

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy