Architecture of an Information Collection and Extraction System for an Intelligent Search and Analytical Platform
https://doi.org/10.15514/ISPRAS-2025-37(2)-20
Abstract
Internet data serves as the foundation for a wide range of tasks, from information retrieval to analytical processing. With the rapid growth of data volumes, efficient metadata extraction from dynamic web resources has become critically important. Traditional information collection and extraction methods based on static templates are largely ineffective when processing interactive content. This paper presents the architecture of an adaptive information collection and extraction system that integrates standard data extraction techniques with machine learning technologies. The system has a modular structure comprising the following subsystems: task management, monitoring and logging, crawling, link management, and metadata extraction. The crawling subsystem processes both static and dynamic content through browser emulation. A hybrid approach combining structured rules and machine learning is used for metadata extraction. Experimental results demonstrated successful metadata extraction from various web resources, including pages with dynamic content and complex structures. The system exhibited high accuracy and resilience to changes in data formats while strictly adhering to ethical data collection standards, such as compliance with robots.txt directives and applying reasonable request intervals. Thus, the proposed solution represents a significant step toward the development of universal data collection and extraction systems for modern information environments. The developed software tools have been utilized in populating the index databases of the Neopoisk system.
About the Authors
Danil Sergeevich SERENKORussian Federation
A student at the Department of Mathematical Modeling and Artificial Intelligence of the Patrice Lumumba RUDN University, a researcher at Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences. His research interests include AI, information retrieval.
Egor Dmitrievich TERENTEV
Russian Federation
A student at the Department of Mathematical Modeling and Artificial Intelligence of the Patrice Lumumba RUDN University, a researcher at Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences. His research interests include AI, information retrieval.
Denis Vladimirovich ZUBAREV
Russian Federation
A researcher at Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences. His research interests include AI, information retrieval, text plagiarism detection.
Ilia Vladimirovich SOCHENKOV
Russian Federation
Cand. Sci. (Phys.-Math.), lead researcher at FRC CSC RAS, lead researcher at ISP RAS, lead researcher at IITP RAS. Research interests: Natural Language Processing, Information Retrieval, Big Data & Text Mining.
References
1. Jin, X., Wah, B. W., Cheng, X., & Wang, Y. (2015). Significance and Challenges of Big Data Research. Big Data Research, 2(2), 59–64. doi:10.1016/j.bdr.2015.01.006.
2. Китаев, Е. Л., & Скорнякова, Р. Ю. (2019). StructScraper--инструмент для динамического включения в контент веб-страницы семантических данных внешних веб-ресурсов. Научный Сервис в Сети Интернет, 21, 424–431.
3. Weichselbraun, A., Brasoveanu, A. M. P., Waldvogel, R., & Odoni, F. (2020). Harvest - An Open Source Toolkit for Extracting Posts and Post Metadata from Web Forums. 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), 438–444. doi:10.1109/WIIAT50758.2020.00065.
4. Choi, W., Yoon, H.-M., Hyun, M.-H., Lee, H.-J., Seol, J.-W., Lee, K. D., Yoon, Y. J., Kong, H. (2023). Building an annotated corpus for automatic metadata extraction from multilingual journal article references. PloS One, 18(1), e0280637.
5. Patnaik, S., Babu, C., & Bhave, M. (08 2021). Intelligent and Adaptive Web Data Extraction System Using Convolutional and Long Short-Term Memory Deep Learning Networks. Big Data Mining and Analytics, 4, 279–297. doi:10.26599/BDMA.2021.9020012.
6. Yu, L., Li, Y., Zeng, Q., Sun, Y., Bian, Y., & He, W. (2020). Summary of web crawler technology research. Journal of Physics: Conference Series, 1449(1), 012036. doi:10.1088/1742-6596/1449/1/012036.
7. Назаренко Г. И., Плотникова В. А., Смирнов И. В., Соченков И. В., Тихомиров И. А. (2010). Программные средства создания и наполнения полнотекстовых электронных библиотек. Электронные Библиотеки: Перспективные Методы и Технологии, Электронные Коллекции: XII Всероссийская Научная Конференция RCDL.
8. Najork, M. (2009). Web Crawler Architecture.
9. Kausar, M. A., Dhaka, V. S., & Singh, S. K. (2013). Web crawler: a review. International Journal of Computer Applications, 63(2), 31–36.
10. ElAraby, M. E., Moftah, H. M., Abuelenin, S. M., & Rashad, M. Z. (2018). Elastic web crawler service-oriented architecture over cloud computing. Arabian Journal for Science and Engineering, 43(12), 8111– 8126.
11. ElAraby, M. E., Sakre, M. M., Rashad, M. Z., & Nomir, O. (2012). Crawler architecture using grid computing. International Journal of Computer Science & Information Technology, 4(3), 113.
12. Якубчик В. С., Попов О. Р., Крамаров С. О. (2023). Специализированные web-краулеры: на пути к семантическим моделям организации информационного поиска. Universum: Технические Науки: Электрон. Научн. Журн., 4(109). Available at: https://7universum.com/ru/tech/archive/item/15315.
13. Печников А. А., Сотенко Е. М. (2017). Программы-краулеры для сбора данных о представительских сайтах заданной предметной области – аналитический обзор. Современные Наукоемкие Технологии, (2), 58–62. Available at: https://top-technologies.ru/ru/article/view?id=36585.
14. The most-comprehensive AI-powered DevSecOps platform. GitLab. Available at: https://about.gitlab.com/, accessed 31.03.2025.
15. Fast and reliable end-to-end testing for modern web apps. Playwright Python. Available at: https://playwright.dev/, accessed 31.03.2025.
16. PostgreSQL: The world's most advanced open source database. Available at: https://www.postgresql.org/, accessed 31.03.2025.
17. Digital Object Identifier. Available at: https://www.doi.org/, accessed 31.03.2025.
18. ArangoDB: Multi-Model Database for Your Modern Apps. Available at: https://arangodb.com/, accessed 31.03.2025.
19. MarkupLM. Available at: https://huggingface.co/docs/transformers/model_doc/markuplm, accessed 31.03.2025.
20. Li, J., Xu, Y., Cui, L., & Wei, F. (2022). MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding. arXiv [Cs.CL]. Available at: http://arxiv.org/abs/2110.08518.
21. Неопоиск. Available at: https://promo.neopoisk.ru/about, accessed 31.03.2025.
Review
For citations:
SERENKO D.S., TERENTEV E.D., ZUBAREV D.V., SOCHENKOV I.V. Architecture of an Information Collection and Extraction System for an Intelligent Search and Analytical Platform. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(2):263-280. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(2)-20