Logical structure extraction from scanned documents

Anastasiya Olegovna BOGATENKOVA; Ilya Sergeevich KOZLOV; Oksana Vladimirovna BELYAEVA; Andrey Igorevich PERMINOV

doi:10.15514/ISPRAS-2020-32(4)-13

Logical structure extraction from scanned documents

Anastasiya Olegovna BOGATENKOVA, Ilya Sergeevich KOZLOV, Oksana Vladimirovna BELYAEVA, Andrey Igorevich PERMINOV

https://doi.org/10.15514/ISPRAS-2020-32(4)-13

Full Text:

PDF (Rus)

Generate QR code

Abstract

Logical structure extraction from various documents has been a longstanding research topic because of its high influence on a wide range of practical applications. A huge variety of different types of documents and, as a consequence, the variety of possible document structures make this task particularly difficult. The purpose of this work is to show one of the ways to represent and extract the structure of documents of a special type. We consider scanned documents without a text layer. This means that the text in such documents cannot be selected or copied. Moreover, you cannot search for the content of such documents. However, a huge number of scanned documents exist that one needs to work with. Understanding the information in such documents may be useful for their analysis, e. g. for the effective search within documents, navigation and summarization. To cope with a large collection of documents the task should be performed automatically. The paper describes the pipeline for scanned documents processing. The method is based on the multiclass classification of document lines. The set of classes include textual lines, headers and lists. Firstly, text and bounding boxes for document lines are extracted using OCR methods, then different features are generated for each line, which are the input of the classifier. We also made available dataset of documents, which includes bounding boxes and labels for each document line; evaluated the effectiveness of our approach using this dataset and described the possible future work in the field of document processing.

Keywords

machine learning, document structure, natural language processing, OCR

About the Authors

Anastasiya Olegovna BOGATENKOVA

Lomonosov Moscow State University
Russian Federation
bachelor student

Ilya Sergeevich KOZLOV

Ivannikov Institute for System Programming of the RAS
Russian Federation
Researcher

Oksana Vladimirovna BELYAEVA

Ivannikov Institute for System Programming of the RAS
Russian Federation
PhD Student

Andrey Igorevich PERMINOV

Lomonosov Moscow State University
Russian Federation
Master Student

References

1. A. Bogatenkova. Dataset. URL: https: //github.com/NastyBoget/document_structure_extraction (accessed 27.05.2020).

2. E. Giguet and G. Lejeune. Daniel@ fintoc-2019 shared task: toc extraction and title detection. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 63-68,

3. K. Tian and Z. Peng. Finance document extraction using data augmentation and attention. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 1-4.

4. M.M. Rahman and T. Finin. Deep understanding of a document’s structure. In Proc. of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, 2017, pp. 63-73.

5. A. Doucet, G. Kazai, S. Colutto, and G. Mühlberger. Icdar 2013 competition on book structure extraction. In Proc. of the 12th International Conference on Document Analysis and Recognition, 2013, pp. 1438-1443.

6. L. Gao, X. Yi, Z. Jiang, L. Hao, and Z. Tang. Icdar2017 competition on page object detection. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1417-1422.

7. C. Clausner, A. Antonacopoulos, and S. Pletschacher. Icdar2017 competition on recognition of documents with complex layouts-rdcl2017. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1404-1410.

8. R. Juge, I. Bentabet, and S. Ferradans. The fintoc-2019 shared task: financial document structure extraction. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 51–57.

9. Единая информационная система в сфере закупок, ЕИС. URL: https://zakupki.gov.ru/ (дата обращения 27.05.2020) / Unified information system in the field of procurement, EIS. URL: https://zakupki.gov.ru/.

10. J. Pustejovsky and A. Stubbs. Chapter 6. Annotation and Adjudication. In Natural Language Annotation for Machine Learning: A guide to corpus-building for applications. O’Reilly Media, 2012, pp. 105-139.

11. A. Permonov. Paragraph labeler application, URL: https://github.com/dronperminov/ParagraphLabelerApp (accessed 10.06.2020).

12. R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, vol. 34, no. 4, 2008, pp. 555-596.

13. P.S. Bayerl and K.I. Paul. What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Computational Linguistics, vol. 37, no. 4, 2011, pp. 699–725.

14. Р.А. Гилязев, Д.Ю. Турдаков Активное обучение и краудсорсинг: обзор методов оптимизации разметки данных. Труды ИСП РАН, том 30, вып. 2, 2018 г., стр. 215-250. DOI: 10.15514/ISPRAS-2018-30(2)-11 / R.A. Gilyazev, D.Y. Turdakov. Active learning and crowdsourcing: a survey of data markup optimization methods. Trudy ISP RAN/Proc. ISP RAS, vol. 30, issue 2, 2018, pp. 215-250 (in Russian).

15. J. Liang, J. Piper, and J.-Y. Tang. Erosion and dilation of binary images by arbitrary structuring elements using interval coding. Pattern Recognition Letters, vol. 9, no. 3, 1989, pp. 201–209.

16. Opencv, Intel Corporation. URL: https://opencv.org (accessed 27.05.2020).

17. R. Smith. An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, 2007, pp. 629-633.

18. B. Kostenko. XGBoost feature interactions reshaped. URL: https://github.com/limexp/xgbfir (accessed 27.05.2020).

19. A. Jain. Complete guide to parameter tuning in xgboost with codes in python. URL: https: //www.analyticsvidhya.com/blog/2016/03/completeguide-parameter-tuning-xgboost-with-codes-python (accessed 27.05.2020).

20. XGBoost documentation, The XGBoost Contributors. URL: https://xgboost.readthedocs.io/en/latest/index.html (accessed 27.05.2020).

Review

For citations:

BOGATENKOVA A.O., KOZLOV I.S., BELYAEVA O.V., PERMINOV A.I. Logical structure extraction from scanned documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(4):175-188. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(4)-13

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Logical structure extraction from scanned documents

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy