Data farm: Information system for collecting, storing and processing unstructured data from heterogeneous sources
https://doi.org/10.15514/ISPRAS-2023-35(2)-5
Abstract
The original information system «data farm» is presented. Today, the successful application of artificial intelligence algorithms, primarily deep learning based on artificial neural networks, almost completely depends on the availability of data. And the larger the amount of these data (big data), the better are the results of the algorithms execution. There are well-known examples of such algorithms from Facebook, Google, Microsoft, Yandex, etc. The data must contain both the training sample and the test one. Moreover, the data must be of good quality and have a certain structure, ideally, be labeled in order for the learning algorithms to work adequately. This is a serious problem requiring huge computational and human resources. This paper is dedicated to solve this problem. Today data farm is a rather complex information system built on a modular basis, similar to the well-known Lego constructor. Separate modules of the system are various modern algorithms, technologies and entire libraries of artificial intelligence, and all together they are designed to automate the process of obtaining and structuring high-quality big data in various subject domains. The system has been tested on data of COVID-19 in regions of Russia and countries around the world. In addition, a user-friendly interface for visualizing collected and processed on the farm data was developed. This makes it possible to conduct visual numerical experiments of computer simulation and compare them with real data, turning the farm into an intelligent decision support information system.
Keywords
About the Authors
Sergey Pavlovich LEVASHKINRussian Federation
Professor, a Candidate of Physical and Mathematical Sciences, PhD in Computer Science, a Full Member of the Academy of Sciences of Mexico, and the head of the Artificial Intelligence Research Laboratory
Konstantin Nikolaevich IVANOV
Russian Federation
Master's Student and an engineer at the Artificial Intelligence Research Laboratory
Sergey Vladimirovich KUSHUKOV
Russian Federation
Master's Student and an engineer at the Artificial Intelligence Research Laboratory
References
1. Müller A.C., Guido S. Introduction to Machine Learning with Python: A Guide for Data Scientists. O'Reilly Media, 2016, 398 p.
2. Куцев Р. Разметка данных в машинном обучении: процесс, разновидности и рекомендации / Kutsev R. Data labeling in machine learning: process, variations and recommendations. Available at: https://habr.com/ru/company/ods/blog/327242/, accessed March 14, 2023 (in Russian).
3. Lucas T.W., Kelton W.D. et al, Changing the Paradigm: Simulation, Now a Method of First Resort. Naval Research Logistics, vol. 62, issue 4, 2015, pp. 293–305.
4. A. Kusiak, Data Farming: A Primer. International Journal of Operations Research, vol. 2, issue 2, 2005, pp. 48-57.
5. Экспериментальный образец программного комплекса «Автоматическая интеллектуальная система сбора данных из различных интернет источников» / Experimental sample of the software complex «Automatic intelligent system for collecting data from various Internet sources». Available at: https://actcognitive.org/files/aicrawler_2_rukovodstvo_operatora.pdf, accessed April 14, 2023 (in Russian).)
6. Bannister K. Understanding Sentiment Analysis: What It Is & Why It’s Used. Available at: https://www.brandwatch.com/blog/understanding-sentiment-analysis/, accessed April 14, 2023.
7. Отчет о патентных исследованиях по тематике «ферма данных» / Patent Research Report on Data Farm. Available at: https://ai.psuti.ru/docs/Patent_search.pdf, accessed April 14, 2023.
8. Левашкин С.П., Агапов С.Н. и др, Исследование адаптивно-компартментной модели распространения КОВИД-19 в некоторых регионах РФ методами оптимизации, Математическая биология и биоинформатика, том 16, вып. 1, 2021 г., стр. 136-151 / Levashkin S.P., Agapov S.N. et al. Study of SEIRD Adaptive-Compartmental Model of COVID-19 Epidemic Spread in Russian Federation Using Optimization Methods. Mathematical Biology and Bioinformatics, vol. 16, issue 1, 2021, pp. 136-151.
9. Проект 'ФЕРМА ДАННЫХ'. Визуализация данных. Научно-исследовательская лаборатория искусственного интеллекта / Project 'DATA FARM'. Artificial Intelligence Research Laboratory. Available at: https://lab-ai.ru/dashboard, accessed April 14, 2023.
10. Левашкин С.П., Захарова О.И. и др. Модульная система сбора данных. Свидетельство о регистрации программы для ЭВМ, № 2022617725. Дата государственной регистрации в реестре программ для ЭВМ 25.04.2022 / Levashkin S.P., Zakharova O.I. et al. Modular data collection system. Certificate of registration of a computer program, № 2022617725. Date of state registration in the register of computer programs 25.04.2022 (in Russian).
Review
For citations:
LEVASHKIN S.P., IVANOV K.N., KUSHUKOV S.V. Data farm: Information system for collecting, storing and processing unstructured data from heterogeneous sources. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(2):57-72. (In Russ.) https://doi.org/10.15514/ISPRAS-2023-35(2)-5