Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Automatic Construction of Information Extraction Rules for News Websites

https://doi.org/10.15514/ISPRAS-2024-36(5)-11

Abstract

This paper presents a method for the automatic generation of information extraction rules (sitemaps) for news websites. The proposed approach generates a sitemap based on a set of news pages from a single site, enabling attribute extraction from arbitrary news pages on that site. The method is based on applying a fine-tuned neural network model, MarkupLM, to extract information from web pages. This approach generalizes the model’s predictions at the site level, creating universal rules for attribute extraction. Experimental results show that using sitemaps generated with the fine-tuned model surpasses both existing open-source tools and the fine-tuned MarkupLM applied at the individual page level. The developed method can be extended to other domains if relevant data for model fine-tuning is available.

About the Authors

Sergei Sergeevich DUBOVITSKII
Institute for System Programming of the Russian Academy of Sciences
Russian Federation

Programmer at the Institute of System Programming of the RAS. Research interests: data collection from web resources, automation of the data collection process, information extraction.



Pavel Alexandrovich BEDRIN
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Senior laboratory assistant at the Institute of System Programming of the RAS, master student of the CMC faculty of Lomonosov Moscow State University. Research interests: data collection from web resources, automation of the data collection process, information extraction, machine learning.



Alexander Konstantinovich YATSKOV
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Junior researcher at the Institute of System Programming of the RAS, leading programmer at the CMC faculty of Lomonosov Moscow State University. Research interests: data collection from web resources, automation of the data collection process, information extraction, machine learning.



Maxim Igorevich VARLAMOV
Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Researcher at the Institute of System Programming of the RAS. Research interests: data collection from web resources, automation of the data collection process, information extraction, machine learning.



References

1. Ferrara E., De Meo P., Fiumara G., Baumgartner R. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems, 2014, vol. 70, pp. 301-323. DOI: 10.1016/j.knosys.2014.07.007

2. Octoparse. Available at: https://www.octoparse.com/, accessed 25.09.2024

3. Web Scraper. Available at: https://webscraper.io/, accessed 25.09.2024

4. Barbaresi A. Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction. Association for Computational Linguistics, Proceedings of the Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 2021, pp. 122-131. DOI: 10.5281/zenodo.3460969

5. Barbaresi A. Trafilatura: Discover and Extract Text Data on the Web. Available at: https://github.com/adbar/trafilatura/, accessed 25.09.2024

6. Hamborg F., Meuschke N., Breitinger C., Gipp B. news-please: A Generic News Crawler and Extractor. Proceedings of the 15th International Symposium of Information Science, 2017, pp. 218-223. DOI: 10.5281/zenodo.4120316

7. Hamborg F., Meuschke N., Breitinger C., Gipp B. news-please. Available at: https://github.com/fhamborg/news-please/, accessed 25.09.2024

8. Junlong L., Yiheng X., Lei C., Furu W. MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding. Association for Computational Linguistics, Dublin, Ireland, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 6078-6087. DOI: 10.18653/v1/2022.acl-long.420

9. Junlong L., Yiheng X., Lei C., Furu W. MarkupLM. Available at: https://huggingface.co/docs/transformers/model_doc/markuplm/, accessed 25.09.2024

10. Zimeng L., Bo S., Linjun S., Ming G., Gen L., Daxin J. WIERT: Web Information Extraction via Render Tree. Proceedings of the AAAI Conference on Artificial Intelligence, 2023, vol. 37, num. 11, pp. 13166-13173. DOI: 10.1609/aaai.v37i11.26546

11. Yichao Z., Ying S., Nguyen H. V., Nick E., Sandeep T. Simplified DOM Trees for Transferable Attribute Extraction from the Web. arXiv, 2021. DOI: 10.48550/arXiv.2101.02415

12. Richardson L. Beautiful Soup. Available at: https://www.crummy.com/software/BeautifulSoup/, accessed 25.09.2024

13. Selectors Level 3. Available at: https://www.w3.org/TR/selectors-3/, accessed 25.09.2024

14. Zyte Automatic Extraction. Available at: https://docs.zyte.com/zyte-api/usage/extract.html, accessed 25.09.2024

15. Diffbot. Available at: https://www.diffbot.com/products/extract/, accessed 25.09.2024

16. Ou-Yang L. Newspaper3k: Article scraping & curation. Available at: https://github.com/codelucas/newspaper?tab=readme-ov-file/, accessed 25.09.2024

17. Kumar A., Morabia K., Wang J., Chang K. C. C., Schwing A. CoVA: context-aware visual attention for webpage information extraction. Proceedings of the Fifth Workshop on e-Commerce and NLP (ECNLP 5), 2022, pp.80-90. DOI: 10.18653/v1/2022.ecnlp-1.11.

18. Xu H., Chen L., Zhao, Z., Ma D., Cao R., Zhu Z., & Yu K. Hierarchical Multimodal Pre-training for Visually Rich Webpage Understanding. Proceedings of the 17th ACM International Conference on Web Search and Data Mining, 2024, pp. 864–872. DOI: 10.1145/3616855.3635753

19. XML Path Language (XPath) 3.1. Available at: https://www.w3.org/TR/xpath-31/, accessed 25.09.2024

20. Fridrich R. CSS Selector Generator. Available at: https://github.com/fczbkk/css-selector-generator/, accessed 25.09.2024

21. Varlamov M., Galanin D., Bedrin P., Duda S., Lazarev V., Yatskov A. A Dataset for Information Extraction from News Web Pages. 2022 Ivannikov Ispras Open Conference

22. Finlay P. J. Argos Translate. Available at: https://github.com/argosopentech/argos-translate/, accessed 25.09.2024

23. Selenium. Available at: https://www.selenium.dev/, accessed 25.09.2024

24. Dateparser. Available at: https://github.com/scrapinghub/dateparser/, accessed 25.09.2024

25. Mediametrics. Available at: https://mediametrics.ru/, accessed 25.09.2024


Review

For citations:


DUBOVITSKII S.S., BEDRIN P.A., YATSKOV A.K., VARLAMOV M.I. Automatic Construction of Information Extraction Rules for News Websites. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(5):153-162. (In Russ.) https://doi.org/10.15514/ISPRAS-2024-36(5)-11



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)