Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Recovering Text Layer from PDF Documents with Complex Background

https://doi.org/10.15514/ISPRAS-2024-36(3)-13

Abstract

The article considers PDF as a tool for storing and transferring documents. Special attention is paid to the problem of converting data from PDF back to its original format. The relevance of the study is due to the widespread use of PDF in electronic document management of modern organizations. However, despite the convenience of using PDF, extracting information from such documents can be difficult due to the peculiarities of information storage in the format and the lack of effective tools for reverse conversion. The paper proposes a solution based on the analysis of the text information from the output stream of the PDF format. This allows automatic recognition of text in PDF documents, even if they contain non-standard fonts, complex backgrounds, or damaged encoding. The research is of interest to specialists in the field of electronic document management, as well as software developers involved in creating tools for working with PDF.

About the Authors

Mikhail Viktorovich ZAGORODNIKOV
Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences
Russian Federation

Bachelor's degree in Applied Informatics from Irkutsk State University, trainee researcher at the young researchers Lab of AI, Data Processing & Analysis of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, scholarship holder of Ivannikov Institute for System Programming of the Russian Academy of Sciences. Field of scientific interests: neural networks, analysis of electronic documents.



Andrey Anatolevich MIKHAYLOV
Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences Ivannikov Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Head of the young researchers Lab of AI, Data Processing & Analysis of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences. His research interests include document analysis, image recognition.



References

1. Awel M. A., Abidi A. I. Review on optical character recognition // International Research Journal of Engineering and Technology (IRJET). — 2019. — Т. 6, No 6. — С. 3666—3669.

2. A detailed review on text extraction using optical character recognition / C. Thorat [и др.] // ICT Analysis and Applications. – 2022. – С. 719-728.

3. Haralambous Y. Fonts & encodings. – "O’Reilly Media, Inc.", 2007.

4. Tauber J. K. Character encoding of classical languages // 2019). Digital classical philology: Ancient Greek and Latin in the digital revolution. – 2019. – С. 137-158.

5. Jain P., Taneja K., Taneja H. Which OCR toolset is good and why: A comparative study // Kuwait Journal of Science. – 2021. – Т. 48, No 2.

6. Padova T. Adobe Acrobat 8 PDF Bible. Т. 363. – John Wiley & Sons, 2007.

7. Smith R. An overview of the Tesseract OCR engine // Ninth international conference on document analysis and recognition (ICDAR 2007). Т. 2. – IEEE. 2007. – С. 629-633.F

8. Bisong E., Bisong E. Google colaboratory // Building machine learning and deep learning models on google cloud platform: a comprehensive guide for beginners. – 2019. – С. 59-64.

9. EMNIST: Extending MNIST to handwritten letters / G. Cohen [и др.] // 2017 international joint conference on neural networks (IJCNN). – IEEE. 2017. – С. 2921-2926.

10. Khalifa N. E., Loey M., Mirjalili S. A comprehensive survey of recent trends in deep learning for digital images augmentation // Artificial Intelligence Review. – 2022. – Т. 55, No 3. – С. 2351-2377.

11. An adaptive thresholding algorithm-based optical character recognition system for information extraction in complex images / D. Akinbade [и др.] // Journal of Computer Science. – 2020. – Т. 16, No 6. – С. 784 - 801.

12. DocBed: A multi-stage OCR solution for documents with complex layouts / W. Zhu [и др.] // Proceedings of the AAAI Conference on Artificial Intelligence. Т. 36. – 2022. – С. 12643–12649.

13. Belyaeva O., Bogatenkova A., Turdakov D. Dedoc: A Universal System for Extracting Content and Logical Structure From Textual Documents //2023 Ivannikov Ispras Open Conference (ISPRAS). – IEEE, 2023. – С. 20-25.

14. LEVENSHTEIN V. I. // Discrete Mathematics and Applications. – 1992. – Т. 2, No 3. – С. 241–258. – DOI: doi:10.1515/dma.1992.2.3.241. – URL: https://doi.org/10.1515/dma.1992.2.3.241.


Review

For citations:


ZAGORODNIKOV M.V., MIKHAYLOV A.A. Recovering Text Layer from PDF Documents with Complex Background. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(3):189-202. (In Russ.) https://doi.org/10.15514/ISPRAS-2024-36(3)-13



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)