Classification of Printed Text on Raster Documents
https://doi.org/10.15514/ISPRAS-2023-35(6)-9
Abstract
When highlighting the logical structure of documents, a number of properties are used, one of which is the bold style of text words. In documents, headings, defined words, and column names in tables are often highlighted in bold. This paper proposes a method for classifying text by boldness, which consists of a sequence of steps. The first step is binarization of the entire image. The purpose of this step is to separate the image pixels into text and background pixels. The second step is to evaluate each word. The result is returned a value characterizing the thickness of the main stroke of the character in the given word. At the last step, the ratings are clustered into two clusters: bold text and regular. The proposed method was implemented and tested on three data sets, and the source code was published in an open repository.
About the Authors
Daniil Evgenievich KOPYLOVRussian Federation
Master’s student of Irkutsk State University, employee of Ivannikov Institute for System Programming of the Russian Academy of Sciences, employee of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences. Research interests: applied mathematics, data analysis.
Andrey Anatolievitch MIKHAILOV
Russian Federation
Senior researcher of the Laboratory of information systems of Matrosov Institute for System Dynamics and Control Theory of Siberian Branch of Russian Academy of Sciences. His research interests include document analysis, image recognition.
References
1. Sandy I.C., Voinea D., Popa A.I. CONTENT: Context Sensitive Transformer for Bold Words Classification. arXiv:2205.07683.
2. Bychkov O., Мerkulova K., Dimitrov G., Zhabska Y., Kostadinova I., Petrova P., Petrov P., Getova I., Panayotova G. Using Neural Networks Application for the Font Recognition Task Solution. In Proc. of 55th International Scientific Conference on ICEST, 2020. pp. 167-170. doi: 10.1109/ICEST49890.2020.9232788.
3. Ladareanu L., Chiroiu V., Bratu, P., Magheti, I. Automatic Text Clustering and Classification Based on Font Geometrical Characteristics. In Proc. of 9th WSEAS International Conference on Automation and Information, 2008, pp. 468-473.
4. Otsu N. A threshold selection method from gray-level histograms // IEEE Trans. Sys., Man., Cyber. : journal. — 1979. — Vol. 9. — P. 62—66.
5. Xing J., Yang P., Qingge L. Automatic thresholding using a modified valley emphasis. IET Image Processing, vol. 14(3), 2020, pp. 536-544. doi: 10.1049/iet-ipr.2019.0176
6. Яцкив И., Гусарова Л. Методы определения количества кластеров при классификации без обучения. The Journal of Transport and Telecommunication Institute, vol. 4(1), 2003. pp. 23-28.
7. Бурков А. Машинное обучение без лишних слов. Санкт-Петербург, Питер, 2020, 192 с.
8.
9.
Review
For citations:
KOPYLOV D.E., MIKHAILOV A.A. Classification of Printed Text on Raster Documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(6):157-166. (In Russ.) https://doi.org/10.15514/ISPRAS-2023-35(6)-9