Synthetic data usage for document segmentation models fine-tuning
https://doi.org/10.15514/ISPRAS-2020-32(4)-14
Abstract
In this paper, we propose an approach to the document images segmentation in a case of limited set of real data for training. The main idea of our approach is to use artificially created data for training and post-processing. The domain of the paper is PDF documents, such as scanned contracts, commercial proposals and technical specifications without a text layer is considered as data. As part of the task of automatic document analysis, we solve the problem of segmentation of DLA documents (Document Layout Analysis). In the paper we train the known high-level FasterRCNN \cite{ren2015faster} model to segment text blocks, tables, stamps and captions on images of the domain. The aim of the paper is to generate synthetic data similar to real data of the domain. It is necessary because the model needs a large dataset for training and the high labor intensity of their preparation. In the paper, we describe the post-processing stage to eliminate artifacts that are obtained as a result of the segmentation. We tested and compared the quality of a model trained on different datasets (with / without synthetic data, small / large set of real data, with / without post-processing stage). As a result, we show that the generation of synthetic data and the use of post-processing increase the quality of the model with a small real training data.
About the Authors
Oksana Vladimirovna BELYAEVARussian Federation
PhD Student
Andrey Igorevich PERMINOV
Russian Federation
Master Student
Ilya Sergeevich KOZLOV
Russian Federation
Reearcher
References
1. S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, issue 6, 2017, pp. 1137-1149.
2. G.M. Binmakhashen and S.A. Mahmoud. Document layout analysis: a comprehensive survey. ACM Computing Surveys (CSUR), vol. 52, issue 6, 2019, pp. 1-36.
3. X. Zhong, J. Tang, and A.J. Yepes. Publaynet: largest dataset ever for document layout analysis. arXiv:1908.07836, 2019.
4. K. Chen, M. Seuret, J. Hennebert, and R. Ingold. Convolutional neural networks for page segmentation of historical document images. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 965-970
5. C. Wick and F. Puppe. Fully convolutional neural networks for page segmentation of historical document images. In Proc. of the 13th IAPR International Workshop on Document Analysis Systems (DAS), 2018, pp. 287-292.
6. Pubmed. national library of medicine. URL: https://pubmed.ncbi.nlm.nih.gov. Accessed: 2020-21-07.
7. G. Csurka. Domain adaptation for visual applications: a comprehensive survey. arXiv:1702.05374, 2017.
8. М.А. Рындин, Д.Ю Турдаков. Проактивная разметка примеров для адаптации к домену. Труды ИСП РАН, том 31, вып. 5, 2019 г., стр. 145-152. DOI: 10.15514/ISPRAS-2019-31(5)-11 / M.A. Ryndin, D.Y. Turdakov. Domain adaptation by proactive labeling. Trudy ISP RAN/Proc. ISP RAS, vol.31, issue 5, 2019, pp. 145-152 (in Russian).
9. C. R. De Souza, A. Gaidon, Y. Cabon, and A. M. López. Procedural Generation of Videos to Train Deep Action Recognition Networks. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2594-2604.
10. L. Angeline, K. Teo, and F. Wong. Smearing algorithm for vehicle parking management system. In Proc. of the 2nd Seminar on Engineering and Information Technology, 2009, pp. 331-337.
11. J. Ha, R. M. Haralick, and I. T. Phillips. Recursive xy cut using bounding boxes of connected components. In Proc. of the 3rd International Conference on Document Analysis and Recognition, vol. 2, 1995, pp. 952—955.
12. L. O’Gorman. The document spectrum for page layout analysis. IEEE Transactions on pattern analysis and machine intelligence, vol. 15, issue 11, 1993, pp. 1162-1173.
13. T.M. Breuel. An algorithm for finding maximal whitespace rectangles at arbitrary orientations for document layout analysis. In Proc. of the Seventh International Conference on Document Analysis and Recognition, 2003, pp. 66—70.
14. I. Kavasidis, C. Pino, S. Palazzo, F. Rundo, D. Giordano, P. Messina, and C. Spampinato. A saliency-based convolutional neural network for table and chart detection in digitized documents. Lecture Notes in Computer Science, vol. 11752, 2019, pp. 292—302.
15. S. Schreiber, S. Agne, I. Wolf, A. Dengel, and S. Ahmed. Deepdesrt: deep learning for detection and structure recognition of tables in document images. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1162—1167.
16. Object detection: speed and accuracy comparison (faster r-cnn, r-fcn, ssd, fpn, retinanet and yolov3). URL: https://medium.com/@jonathan_hui/object-detectionspeed-and-accuracy-comparison-faster-r-cnn-r-fcnssd-and-yolo-5425656ae359. Accessed: 2020-18-07.
17. Coco, common objects in context. URL: https://cocodataset.org/#home. Accessed: 2020-18-07.
18. Единая информационная система в сфере закупок, ЕИС. URL: https://zakupki.gov.ru/ (дата обращения 27.05.2020) / Unified information system in the field of procurement, EIS. URL: https://zakupki.gov.ru/ (in Russian).
19. Dla-dataset. EIS. URL: https://disk.yandex.ru/d/XVjQf20BVsElKA (accessed: 2020-18-07).
20. Open source computer vision library. URL: https://opencv.org (accessed: 2020-18-07).
21. Tensorflow object detection api. URL: https://github.com/tensorflow/models/tree/master/research/object_detection. Accessed: 2020-18-07.
22. Publaynet dataset. URL: https://github.com/ibm-aurnlp/PubLayNet/tree/master/pre-trained-models. Accessed: 2020-18-07.
23. N. Bodla, B. Singh, R. Chellappa, and L. S. Davis. Soft-nms–improving object detection with one line of code. In Proc. of the IEEE International Conference on Computer Vision, 2017, pp. 5561-5569.
24. Pascal voc evaluation. URL: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/htmldoc/devkit_doc.html#SECTION00064000000000000000 (accesed: 08.09.2020).
Review
For citations:
BELYAEVA O.V., PERMINOV A.I., KOZLOV I.S. Synthetic data usage for document segmentation models fine-tuning. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(4):189-202. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(4)-14