High Performance Distributed Web-Scraper
https://doi.org/10.15514/ISPRAS-2021-33(3)-7
Abstract
Over the past decade, the Internet has become the gigantic and richest source of data. The data is used for the extraction of knowledge by performing machine leaning analysis. In order to perform data mining of the web-information, the data should be extracted from the source and placed on analytical storage. This is the ETL-process. Different web-sources have different ways to access their data: either API over HTTP protocol or HTML source code parsing. The article is devoted to the approach of high-performance data extraction from sources that do not provide an API to access the data. Distinctive features of the proposed approach are: load balancing, two levels of data storage, and separating the process of downloading files from the process of scraping. The approach is implemented in the solution with the following technologies: Docker, Kubernetes, Scrapy, Python, MongoDB, Redis Cluster, and СephFS. The results of solution testing are described in this article as well.
About the Authors
Denis EYZENAKHRussian Federation
Student
Anton RAMEYKOV
Russian Federation
Student
Igor NIKIFOROV
Russian Federation
PhD (Computer Science), Assistant Professor
References
1. Deepak Kumar Mahto, Lisha Singh. A dive into Web Scraper world. In Proc. of the 3rd International Conference on Computing for Sustainable Global Development (INDIACom), 2016, pp. 689-693.
2. Web Cralwer. Available at: https://webbrowsersintrodu ction.com/.
3. Momin Saniya Parvez, Khan Shaista Agah Tasneem, Shivankar Sneha Rajendra, Kalpana R. Bodke. Analysis of Different Web Data Extraction Techniques. In Proc. of the International Conference on Smart City and Emerging Technology (ICSCET), 2018, pp. 1-7.
4. Anand V. Saurkar, Kedar G. Pathare, Shweta A. Gode. An Overview On Web Scraping Techniques and Tools. International Journal on Future Revolution in Computer Science & Communication Engineering, vol. 4, no. 4, 2018, pp. 363-367.
5. Rohmat Gunawan, Alam Rahmatulloh, Irfan Darmawan, Firman Firdaus. Comparison of Web Scraping Techniques: Regular Expression, HTML DOM and Xpath. In Proc. of the International Conference on Industrial Enterprise and System Engineering, 2018, pp. 283-287.
6. Isam Mashhour Al Jawarneh, Paolo Bellavista et al. Container Orchestration Engines: A Thorough Functional and Performance Comparison. In Proc. of the 2019 IEEE International Conference on Communications (ICC), 2019, pp. 1-6.
7. CNCF certificate. Available at: https://www.cncf.io/certification/software-conformance/.
8. S. Vestman. Cloud application platform - Virtualization vs Containerization. Student Thesis. Blekinge Institute of Technology, Sweden, 2017, 45 p.
9. Distributed Frontera: Web crawling at scale. Available at: https://www.zyte.com/blog/distributed-frontera-web-crawling-at-large-scale/.
10. Frontera documentation. Available: at https://frontera.r eadthedocs.io/en/latest/.
11. Scrapy-Redis documentation. Available at: https://scrapy-redis.readthedocs.io/en/v0.6.x/readme.html#.
12. Fulian Yin, Xiating He, Zhixin Liu. Research on Scrapy-Based Distributed Crawler System for Crawling Semi-structure Information at High Speed. In Proc. of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), 2018, pp. 1356-1359.
13. Scrapy-Cluster documentation. Available at: https://scrapy-cluster.readthedocs.io/en/latest/.
14. Kafka documentation. Intro. Available at: https://kafka.apache.org/documentation/#introduction.
15. Kafka official documentation. Basic_ops_modify_topic. Available: at https://kafka.apache.org/documentation.html #basic_ops_modify_topic.
16. Scrapy-Cluster documentation. Core Concepts. Available at: https://scrapy-cluster.readthedocs.io/en/latest/topics/introduction/overview.html.
17. Scrapyd documentation. Available at: https://scrapyd.re adthedocs.io/en/stable/.
18. Official Scrapy framework web-site. List of companies using Scrapy. Available at: https://scrapy.org/companies/.
19. Regina O. Obe, Leo S. Hsu. PostgreSQL: Up and Running: A Practical Guide to the Advanced Open Source Database. 3rd Edition, O'Reilly Media, Inc., 2017, 314 p.
20. Deng Kaiying, Chen Senpeng, Deng Jingwei. On optimisation of web crawler system on Scrapy framework. International Journal of Wireless and Mobile Computing, vol. 18, no. 4, 2020, pp. 332-338.
21. Jia-Yow Weng, Chao-Tung Yang. Chih-Hung Chang. The Integration of Shared Storages with the CephFS. In Proc. of the 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC), 2019. pp. 93-98
22. Shannon Bradshaw, Eoin Brazil, Kristina Chodorow. Customers who viewed MongoDB: The Definitive Guide: Powerful and Scalable Data Storage. 3rd edition, O'Reilly Media, Inc., 2019, 514 p.
23. Abutalib Aghayev, Sage Weil et al. File systems unfit as distributed storage backends: lessons from 10 years of Ceph evolution. In Proc. of the 27th ACM Symposium on Operating Systems, 2019. pp. 353–369
24. N. Voinov, K. Rodriguez Garzon et al. Big data processing system for analysis of GitHub events. In Proc. of the 22nd International Conference on Soft Computing and Measurements, 2019, pp. 187-190.
Review
For citations:
EYZENAKH D., RAMEYKOV A., NIKIFOROV I. High Performance Distributed Web-Scraper. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(3):87-100. https://doi.org/10.15514/ISPRAS-2021-33(3)-7