Automatic Construction of Information Extraction Rules for News Websites
https://doi.org/10.15514/ISPRAS-2024-36(5)-11
Abstract
This paper presents a method for the automatic generation of information extraction rules (sitemaps) for news websites. The proposed approach generates a sitemap based on a set of news pages from a single site, enabling attribute extraction from arbitrary news pages on that site. The method is based on applying a fine-tuned neural network model, MarkupLM, to extract information from web pages. This approach generalizes the model’s predictions at the site level, creating universal rules for attribute extraction. Experimental results show that using sitemaps generated with the fine-tuned model surpasses both existing open-source tools and the fine-tuned MarkupLM applied at the individual page level. The developed method can be extended to other domains if relevant data for model fine-tuning is available.
About the Authors
Sergei Sergeevich DUBOVITSKIIRussian Federation
Programmer at the Institute of System Programming of the RAS. Research interests: data collection from web resources, automation of the data collection process, information extraction.
Pavel Alexandrovich BEDRIN
Russian Federation
Senior laboratory assistant at the Institute of System Programming of the RAS, master student of the CMC faculty of Lomonosov Moscow State University. Research interests: data collection from web resources, automation of the data collection process, information extraction, machine learning.
Alexander Konstantinovich YATSKOV
Russian Federation
Junior researcher at the Institute of System Programming of the RAS, leading programmer at the CMC faculty of Lomonosov Moscow State University. Research interests: data collection from web resources, automation of the data collection process, information extraction, machine learning.
Maxim Igorevich VARLAMOV
Russian Federation
Researcher at the Institute of System Programming of the RAS. Research interests: data collection from web resources, automation of the data collection process, information extraction, machine learning.
References
1. Ferrara E., De Meo P., Fiumara G., Baumgartner R. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 2014, vol. 70, pp. 301-323. DOI: 10.1016/j.knosys.2014.07.007
2. Octoparse. Available at: https://www.octoparse.com/, accessed 25.09.2024
3. Web Scraper. Available at: https://webscraper.io/, accessed 25.09.2024
4. Barbaresi A. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Association for Computational Linguistics, Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 2021, pp. 122-131. DOI: 10.5281/zenodo.3460969
5. Barbaresi A. Trafilatura: Discover and Extract Text Data on the Web. Available at: https://github.com/adbar/trafilatura/, accessed 25.09.2024
6. Hamborg F., Meuschke N., Breitinger C., Gipp B. news-please: A Generic News Crawler and Extractor. Proceedings of the 15th International Symposium of Information Science, 2017, pp. 218-223. DOI: 10.5281/zenodo.4120316
7. Hamborg F., Meuschke N., Breitinger C., Gipp B. news-please. Available at: https://github.com/fhamborg/news-please/, accessed 25.09.2024
8. Junlong L., Yiheng X., Lei C., Furu W. MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding. Association for Computational Linguistics, Dublin, Ireland, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6078-6087. DOI: 10.18653/v1/2022.acl-long.420
9. Junlong L., Yiheng X., Lei C., Furu W. MarkupLM. Available at: https://huggingface.co/docs/transformers/model_doc/markuplm/, accessed 25.09.2024
10. Zimeng L., Bo S., Linjun S., Ming G., Gen L., Daxin J. WIERT: Web Information Extraction via Render Tree. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, vol. 37, num. 11, pp. 13166-13173. DOI: 10.1609/aaai.v37i11.26546
11. Yichao Z., Ying S., Nguyen H. V., Nick E., Sandeep T. Simplified DOM Trees for Transferable Attribute Extraction from the Web. arXiv, 2021. DOI: 10.48550/arXiv.2101.02415
12. Richardson L. Beautiful Soup. Available at: https://www.crummy.com/software/BeautifulSoup/, accessed 25.09.2024
13. Selectors Level 3. Available at: https://www.w3.org/TR/selectors-3/, accessed 25.09.2024
14. Zyte Automatic Extraction. Available at: https://docs.zyte.com/zyte-api/usage/extract.html, accessed 25.09.2024
15. Diffbot. Available at: https://www.diffbot.com/products/extract/, accessed 25.09.2024
16. Ou-Yang L. Newspaper3k: Article scraping & curation. Available at: https://github.com/codelucas/newspaper?tab=readme-ov-file/, accessed 25.09.2024
17. Kumar A., Morabia K., Wang J., Chang K. C. C., Schwing A. CoVA: context-aware visual attention for webpage information extraction. Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), 2022, pp.80-90. DOI: 10.18653/v1/2022.ecnlp-1.11.
18. Xu H., Chen L., Zhao, Z., Ma D., Cao R., Zhu Z., & Yu K. Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding. Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 864–872. DOI: 10.1145/3616855.3635753
19. XML Path Language (XPath) 3.1. Available at: https://www.w3.org/TR/xpath-31/, accessed 25.09.2024
20. Fridrich R. CSS Selector Generator. Available at: https://github.com/fczbkk/css-selector-generator/, accessed 25.09.2024
21. Varlamov M., Galanin D., Bedrin P., Duda S., Lazarev V., Yatskov A. A Dataset for Information Extraction from News Web Pages. 2022 Ivannikov Ispras Open Conference
22. Finlay P. J. Argos Translate. Available at: https://github.com/argosopentech/argos-translate/, accessed 25.09.2024
23. Selenium. Available at: https://www.selenium.dev/, accessed 25.09.2024
24. Dateparser. Available at: https://github.com/scrapinghub/dateparser/, accessed 25.09.2024
25. Mediametrics. Available at: https://mediametrics.ru/, accessed 25.09.2024
Review
For citations:
DUBOVITSKII S.S., BEDRIN P.A., YATSKOV A.K., VARLAMOV M.I. Automatic Construction of Information Extraction Rules for News Websites. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(5):153-162. (In Russ.) https://doi.org/10.15514/ISPRAS-2024-36(5)-11






