Job management system for automated data collection from the Internet
https://doi.org/10.15514/ISPRAS-2022-34(2)-9
Abstract
This work is devoted to the research and development of a task management system for automated data collection from the Internet. This article contains a description of the implemented methodologies and tells about the techniques created by interacting with containers containing data collection applications. In the course of the work, various existing services for automated data collection from the Internet were studied and presented: ready-made open source solutions, cloud services with extensive functionality, as well as our own solution running Kubernetes. As a result of the work, a task management system was implemented for Talisman data analysis platform, which provides horizontal scalability, isolation of the crawler environment and independence from the technology of their development.
About the Authors
Vladimir Alexandrovich LAZAREVRussian Federation
Master’s student of the System Programming Department of MSU, an employee of ISP RAS
Maksim Igorevich VARLAMOV
Russian Federation
Researcher
Alexander Konstantinovich YATSKOV
Russian Federation
PhD Student
References
1. ИСП РАН. Talisman: платформа для обработки данных. Доступно по ссылке: https://www.ispras.ru/technologies/talisman/ ISP RAS. Talisman: a data processing framework. Available at: https://www.ispras.ru/en/technologies/talisman/
2. Anand V. Saurkar, Kedar G. Pathare, Shweta A. Gode. An Overview on Web Scraping Techniques and Tools. International Journal on Future Revolution in Computer Science & Communication Engineering, vol. 4, no. 4, 2018, pp. 363 - 367
3. IST Research. Scrapy Cluster 1.3 Documentation. Available at: https://scrapy-cluster.readthedocs.io/en/dev/.
4. Scrapy group. Scrapyd. Available at: https://scrapyd.readthedocs.io/en/stable/.
5. ScrapyRT (Scrapy Realtime). Available at: https://github.com/scrapinghub/scrapyrt.
6. Ferrit. Available at: https://github.com/reggoodwin/ferrit.
7. Zyte. Web Scraping Cloud Hosting Data Extraction - Zyte. Available: https://www.zyte.com/scrapy-cloud/.
8. Web Scraper Cloud. Web Scraper Cloud | Web Scraper documentation. Available: https://webscraper.io/documentation/web-scraper-cloud.
9. Octopus Data Inc. Web Scraping Tool & Free Web Crawlers | Octoparse. Available at: https://www.octoparse.com/, 2022
10. data-ox.com. Web Data Scraping Company | DataOx. Available at: https://data-ox.com/.
11. The Kubernetes Authors. What is Kubernetes? Available at: https://kubernetes.io/docs/concepts/overview/what-is-kubernetes.
12. Docker Inc. Overview of Docker Compose. Available at: https://docs.docker.com/compose/.
13. Docker Inc. Swarm mode overview. Available at: https://docs.docker.com/engine/swarm/.
14. The Apache Mesos Software Foundation. Mesos Architecture. Available at: https://mesos.apache.org/documentation/latest/architecture/.
15. Isam Mashhour Al Jawarneh, Paolo Bellavista et al. Container Orchestration Engines: A Thorough Functional and Performance Comparison. In Proc. of the IEEE International Conference on Communications (ICC), 2019, pp. 1-6.
Review
For citations:
LAZAREV V.A., VARLAMOV M.I., YATSKOV A.K. Job management system for automated data collection from the Internet. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(2):111-122. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(2)-9