Logical structure extraction from scanned documents
https://doi.org/10.15514/ISPRAS-2020-32(4)-13
Abstract
Logical structure extraction from various documents has been a longstanding research topic because of its high influence on a wide range of practical applications. A huge variety of different types of documents and, as a consequence, the variety of possible document structures make this task particularly difficult. The purpose of this work is to show one of the ways to represent and extract the structure of documents of a special type. We consider scanned documents without a text layer. This means that the text in such documents cannot be selected or copied. Moreover, you cannot search for the content of such documents. However, a huge number of scanned documents exist that one needs to work with. Understanding the information in such documents may be useful for their analysis, e. g. for the effective search within documents, navigation and summarization. To cope with a large collection of documents the task should be performed automatically. The paper describes the pipeline for scanned documents processing. The method is based on the multiclass classification of document lines. The set of classes include textual lines, headers and lists. Firstly, text and bounding boxes for document lines are extracted using OCR methods, then different features are generated for each line, which are the input of the classifier. We also made available dataset of documents, which includes bounding boxes and labels for each document line; evaluated the effectiveness of our approach using this dataset and described the possible future work in the field of document processing.
About the Authors
Anastasiya Olegovna BOGATENKOVARussian Federation
bachelor student
Ilya Sergeevich KOZLOV
Russian Federation
Researcher
Oksana Vladimirovna BELYAEVA
Russian Federation
PhD Student
Andrey Igorevich PERMINOV
Russian Federation
Master Student
References
1. A. Bogatenkova. Dataset. URL: https: //github.com/NastyBoget/document_structure_extraction (accessed 27.05.2020).
2. E. Giguet and G. Lejeune. Daniel@ fintoc-2019 shared task: toc extraction and title detection. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 63-68,
3. K. Tian and Z. Peng. Finance document extraction using data augmentation and attention. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 1-4.
4. M.M. Rahman and T. Finin. Deep understanding of a document’s structure. In Proc. of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, 2017, pp. 63-73.
5. A. Doucet, G. Kazai, S. Colutto, and G. Mühlberger. Icdar 2013 competition on book structure extraction. In Proc. of the 12th International Conference on Document Analysis and Recognition, 2013, pp. 1438-1443.
6. L. Gao, X. Yi, Z. Jiang, L. Hao, and Z. Tang. Icdar2017 competition on page object detection. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1417-1422.
7. C. Clausner, A. Antonacopoulos, and S. Pletschacher. Icdar2017 competition on recognition of documents with complex layouts-rdcl2017. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1404-1410.
8. R. Juge, I. Bentabet, and S. Ferradans. The fintoc-2019 shared task: financial document structure extraction. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 51–57.
9. Единая информационная система в сфере закупок, ЕИС. URL: https://zakupki.gov.ru/ (дата обращения 27.05.2020) / Unified information system in the field of procurement, EIS. URL: https://zakupki.gov.ru/.
10. J. Pustejovsky and A. Stubbs. Chapter 6. Annotation and Adjudication. In Natural Language Annotation for Machine Learning: A guide to corpus-building for applications. O’Reilly Media, 2012, pp. 105-139.
11. A. Permonov. Paragraph labeler application, URL: https://github.com/dronperminov/ParagraphLabelerApp (accessed 10.06.2020).
12. R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, vol. 34, no. 4, 2008, pp. 555-596.
13. P.S. Bayerl and K.I. Paul. What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Computational Linguistics, vol. 37, no. 4, 2011, pp. 699–725.
14. Р.А. Гилязев, Д.Ю. Турдаков Активное обучение и краудсорсинг: обзор методов оптимизации разметки данных. Труды ИСП РАН, том 30, вып. 2, 2018 г., стр. 215-250. DOI: 10.15514/ISPRAS-2018-30(2)-11 / R.A. Gilyazev, D.Y. Turdakov. Active learning and crowdsourcing: a survey of data markup optimization methods. Trudy ISP RAN/Proc. ISP RAS, vol. 30, issue 2, 2018, pp. 215-250 (in Russian).
15. J. Liang, J. Piper, and J.-Y. Tang. Erosion and dilation of binary images by arbitrary structuring elements using interval coding. Pattern Recognition Letters, vol. 9, no. 3, 1989, pp. 201–209.
16. Opencv, Intel Corporation. URL: https://opencv.org (accessed 27.05.2020).
17. R. Smith. An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, 2007, pp. 629-633.
18. B. Kostenko. XGBoost feature interactions reshaped. URL: https://github.com/limexp/xgbfir (accessed 27.05.2020).
19. A. Jain. Complete guide to parameter tuning in xgboost with codes in python. URL: https: //www.analyticsvidhya.com/blog/2016/03/completeguide-parameter-tuning-xgboost-with-codes-python (accessed 27.05.2020).
20. XGBoost documentation, The XGBoost Contributors. URL: https://xgboost.readthedocs.io/en/latest/index.html (accessed 27.05.2020).
Review
For citations:
BOGATENKOVA A.O., KOZLOV I.S., BELYAEVA O.V., PERMINOV A.I. Logical structure extraction from scanned documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(4):175-188. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(4)-13