Classification of Printed Text on Raster Documents

Daniil Evgenievich KOPYLOV; Andrey Anatolievitch MIKHAILOV

doi:10.15514/ISPRAS-2023-35(6)-9

Classification of Printed Text on Raster Documents

Daniil Evgenievich KOPYLOV, Andrey Anatolievitch MIKHAILOV

https://doi.org/10.15514/ISPRAS-2023-35(6)-9

Full Text:

PDF (Rus)

Generate QR code

Abstract

When highlighting the logical structure of documents, a number of properties are used, one of which is the bold style of text words. In documents, headings, defined words, and column names in tables are often highlighted in bold. This paper proposes a method for classifying text by boldness, which consists of a sequence of steps. The first step is binarization of the entire image. The purpose of this step is to separate the image pixels into text and background pixels. The second step is to evaluate each word. The result is returned a value characterizing the thickness of the main stroke of the character in the given word. At the last step, the ratings are clustered into two clusters: bold text and regular. The proposed method was implemented and tested on three data sets, and the source code was published in an open repository.

Keywords

document analysis, raster documents, text classification

About the Authors

Daniil Evgenievich KOPYLOV

Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Master’s student of Irkutsk State University, employee of Ivannikov Institute for System Programming of the Russian Academy of Sciences, employee of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences. Research interests: applied mathematics, data analysis.

Andrey Anatolievitch MIKHAILOV

Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences, Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Senior researcher of the Laboratory of information systems of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences. His research interests include document analysis, image recognition.

References

1. Sandy I.C., Voinea D., Popa A.I. CONTENT: Context Sensitive Transformer for Bold Words Classification. arXiv:2205.07683.

2. Bychkov O., Мerkulova K., Dimitrov G., Zhabska Y., Kostadinova I., Petrova P., Petrov P., Getova I., Panayotova G. Using Neural Networks Application for the Font Recognition Task Solution. In Proc. of 55th International Scientific Conference on ICEST, 2020. pp. 167-170. doi: 10.1109/ICEST49890.2020.9232788.

3. Ladareanu L., Chiroiu V., Bratu, P., Magheti, I. Automatic Text Clustering and Classification Based on Font Geometrical Characteristics. In Proc. of 9th WSEAS International Conference on Automation and Information, 2008, pp. 468-473.

4. Otsu N. A threshold selection method from gray-level histograms // IEEE Trans. Sys., Man., Cyber. : journal. — 1979. — Vol. 9. — P. 62—66.

5. Xing J., Yang P., Qingge L. Automatic thresholding using a modified valley emphasis. IET Image Processing, vol. 14(3), 2020, pp. 536-544. doi: 10.1049/iet-ipr.2019.0176

6. Яцкив И., Гусарова Л. Методы определения количества кластеров при классификации без обучения. The Journal of Transport and Telecommunication Institute, vol. 4(1), 2003. pp. 23-28.

7. Бурков А. Машинное обучение без лишних слов. Санкт-Петербург, Питер, 2020, 192 с.

Review

For citations:

KOPYLOV D.E., MIKHAILOV A.A. Classification of Printed Text on Raster Documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(6):157-166. (In Russ.) https://doi.org/10.15514/ISPRAS-2023-35(6)-9

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Classification of Printed Text on Raster Documents

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy