Извлечение логической структуры из сканированных документов

Анастасия Олеговна БОГАТЕНКОВА; Илья Сергеевич КОЗЛОВ; Оксана Владимировна БЕЛЯЕВА; Андрей Игоревич ПЕРМИНОВ

doi:10.15514/ISPRAS-2020-32(4)-13

Извлечение логической структуры из сканированных документов

Анастасия Олеговна БОГАТЕНКОВА, Илья Сергеевич КОЗЛОВ, Оксана Владимировна БЕЛЯЕВА, Андрей Игоревич ПЕРМИНОВ

https://doi.org/10.15514/ISPRAS-2020-32(4)-13

Полный текст:

PDF (Rus)

сгенерировать QR код

Аннотация

В статье предложен конвейер обработки сканированных документов, а также разработан метод извлечения структуры из них. Данный метод основан на многоклассовой классификации строк документа, в том числе классификации на заголовки и списки. Конвейер состоит из извлечения текста и рамок строк документов с помощью методов OCR, формирования признаков и обучения классификатора на данных признаках. Кроме того, размечен и доступен для изучения корпус документов, проведена экспериментальная проверка реализованного метода на данном корпусе и описаны возможности для дальнейшей работы и исследований.

Ключевые слова

машинное обучение, структура документа, обработка естественного языка, OCR

Об авторах

Анастасия Олеговна БОГАТЕНКОВА

Московский государственный университет имени М.В. Ломоносова
Россия
студентка бакалавриата кафедры системного программирования

Илья Сергеевич КОЗЛОВ

Институт системного программирования им. В.П. Иванникова РАН
Россия
стажер-исследователь

Оксана Владимировна БЕЛЯЕВА

Институт системного программирования им. В.П. Иванникова РАН
Россия
Аспирантка

Андрей Игоревич ПЕРМИНОВ

Московский государственный университет имени М.В. Ломоносова
Россия
студент магистратуры

Список литературы

1. A. Bogatenkova. Dataset. URL: https: //github.com/NastyBoget/document_structure_extraction (accessed 27.05.2020).

2. E. Giguet and G. Lejeune. Daniel@ fintoc-2019 shared task: toc extraction and title detection. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 63-68,

3. K. Tian and Z. Peng. Finance document extraction using data augmentation and attention. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 1-4.

4. M.M. Rahman and T. Finin. Deep understanding of a document’s structure. In Proc. of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies, 2017, pp. 63-73.

5. A. Doucet, G. Kazai, S. Colutto, and G. Mühlberger. Icdar 2013 competition on book structure extraction. In Proc. of the 12th International Conference on Document Analysis and Recognition, 2013, pp. 1438-1443.

6. L. Gao, X. Yi, Z. Jiang, L. Hao, and Z. Tang. Icdar2017 competition on page object detection. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1417-1422.

7. C. Clausner, A. Antonacopoulos, and S. Pletschacher. Icdar2017 competition on recognition of documents with complex layouts-rdcl2017. In Proc. of the 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), vol. 1, 2017, pp. 1404-1410.

8. R. Juge, I. Bentabet, and S. Ferradans. The fintoc-2019 shared task: financial document structure extraction. In Proc. of the Second Financial Narrative Processing Workshop (FNP 2019), 2019, pp. 51–57.

9. Единая информационная система в сфере закупок, ЕИС. URL: https://zakupki.gov.ru/ (дата обращения 27.05.2020) / Unified information system in the field of procurement, EIS. URL: https://zakupki.gov.ru/.

10. J. Pustejovsky and A. Stubbs. Chapter 6. Annotation and Adjudication. In Natural Language Annotation for Machine Learning: A guide to corpus-building for applications. O’Reilly Media, 2012, pp. 105-139.

11. A. Permonov. Paragraph labeler application, URL: https://github.com/dronperminov/ParagraphLabelerApp (accessed 10.06.2020).

12. R. Artstein and M. Poesio. Inter-coder agreement for computational linguistics. Computational Linguistics, vol. 34, no. 4, 2008, pp. 555-596.

13. P.S. Bayerl and K.I. Paul. What determines inter-coder agreement in manual annotations? A meta-analytic investigation. Computational Linguistics, vol. 37, no. 4, 2011, pp. 699–725.

14. Р.А. Гилязев, Д.Ю. Турдаков Активное обучение и краудсорсинг: обзор методов оптимизации разметки данных. Труды ИСП РАН, том 30, вып. 2, 2018 г., стр. 215-250. DOI: 10.15514/ISPRAS-2018-30(2)-11 / R.A. Gilyazev, D.Y. Turdakov. Active learning and crowdsourcing: a survey of data markup optimization methods. Trudy ISP RAN/Proc. ISP RAS, vol. 30, issue 2, 2018, pp. 215-250 (in Russian).

15. J. Liang, J. Piper, and J.-Y. Tang. Erosion and dilation of binary images by arbitrary structuring elements using interval coding. Pattern Recognition Letters, vol. 9, no. 3, 1989, pp. 201–209.

16. Opencv, Intel Corporation. URL: https://opencv.org (accessed 27.05.2020).

17. R. Smith. An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), vol. 2, 2007, pp. 629-633.

18. B. Kostenko. XGBoost feature interactions reshaped. URL: https://github.com/limexp/xgbfir (accessed 27.05.2020).

19. A. Jain. Complete guide to parameter tuning in xgboost with codes in python. URL: https: //www.analyticsvidhya.com/blog/2016/03/completeguide-parameter-tuning-xgboost-with-codes-python (accessed 27.05.2020).

20. XGBoost documentation, The XGBoost Contributors. URL: https://xgboost.readthedocs.io/en/latest/index.html (accessed 27.05.2020).

Рецензия

Для цитирования:

БОГАТЕНКОВА А.О., КОЗЛОВ И.С., БЕЛЯЕВА О.В., ПЕРМИНОВ А.И. Извлечение логической структуры из сканированных документов. Труды Института системного программирования РАН. 2020;32(4):175-188. https://doi.org/10.15514/ISPRAS-2020-32(4)-13

For citation:

BOGATENKOVA A.O., KOZLOV I.S., BELYAEVA O.V., PERMINOV A.I. Logical structure extraction from scanned documents. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2020;32(4):175-188. (In Russ.) https://doi.org/10.15514/ISPRAS-2020-32(4)-13

Контент доступен под лицензией Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Логин
Пароль
	Запомнить меня
Регистрация нового пользователя Забыли Ваш пароль?

Войти

Труды Института системного программирования РАН

Извлечение логической структуры из сканированных документов

Полный текст:

Аннотация

Ключевые слова

Об авторах

Список литературы

Рецензия

Для цитирования:

For citation:

Использование куки-файлов