Hardware and software data processing system for research and scientific purposes based on Raspberry Pi 3 microcomputer

— In the past ten years’ rapid progress has been observed in science and technology through the development of smart mobile devices, workstations, supercomputers, smart gadgets and network servers.


I. INTRODUCTION
Today, huge amounts of data are being generated, the source of which is social networks, meteorological organizations, corporate firms, scientific and technical institutions, web services, smart IoT devices [1], etc. Therefore, the development of tools for storing, processing and restoring information from huge volumes of data is today one of the most important issues in the research of information technology [2]. In order to meet the growing need for storage, manipulation and recovery of information, new data centers are being created.
Traditional data centers do their job well for commercial purposes, but have a number of disadvantages:  consist of powerful hardware that is expensive;  require a large amount of electricity to work;  require powerful cooling;  occupy a large area.
In addition, the use of powerful equipment provides for its continuous workload, since the operation of such systems without performing useful tasks is expensive.
Usually, big data means big sets of huge amounts of data that are difficult to work with using traditional data management tools, because of their huge size and complexity [3].
The inevitable problems of big data include the fact that the infrastructure necessary to process huge amounts of data must be created using limited resources and strictly limited processing time periods. In addition, extracting features from such data requires the use of clusters and complex data processing applications [4]. It is often necessary to work with similar data in real time.
In addition, these data centers must have capabilities such as extreme scalability, data distribution, load balancing, fault tolerance, etc. To solve these problems, Jeff Dean and Sanjay Ghemavat created the MapReduce model [5] for processing large amounts of data on large clusters.
Apache Hadoop [6,7]  and Hadoop MapReduce (software framework for easily writing applications which process vast amounts of data inparallel).
Apache Hadoop is considered one of the main technologies for interacting with big data.
A single-board computer (SBC) is a universal computer that is built on one printed circuit board together with the required processor, memory, I/O ports and other functions necessary for a well-designed computer [8]. Raspberry Pi is an inexpensive and most common single board computer. An important contribution of this study is the use of the Raspberry Pi single-board computer with Hadoop clusters, which provides parallel and distributed processing with increased performance and fault tolerance.
The system under development is considered for academic purposes for research and scientific purposes. The goals also include training employees or students to work with cluster infrastructure.
In addition, it is important to provide the ability to verify the work of the developed data processing algorithms, including in enterprises, without using the capabilities of systems for production purposes.
The main goal of the project is to create a cheap solution for academic use. This is the main feature of the project, compared with the existing solutions under consideration.
Section II shows the related work. Section III shows system design. Section IV shows the implementation process and result of the pilot project. Section V presents the conclusion.
II. REVIEW OF RELATED WORKS First of all, let's take a look at relatives works that use single-board computers (SBC) as a main computational unit.

A. Single-board computers review and selection
A single board computer (SBC) is a complete, selfcontained computer. The difference between SBC and traditional personal computer is that SBC is assembled on one printed circuit board, on which all the devices necessary for the functioning of the device are installed: This approach to the manufacture of a single-board computer allows this to make it inexpensive and compact. In addition, the device becomes even cheaper through the use of systems on a chip (SoC). On the other hand, expanding capabilities by changing the processor, increasing the amount of memory and replacing other hardware components is impossible, since most of all these components are soldered to the board. On the other hand, these features of SBC make it possible to use them as industrial computers or as computers for embedded systems.
The big advantage of SBC is that they can easily be used as a system module from many of these modules, due to the fact that all the necessary components are on the same printed circuit board. This allows quick replacement of a broken assembly. It is enough to take a new SBC, insert an SD card with an operating system into it and connect the power and Ethernet wires.
Another advantage of using SBC is the general purpose input / output interface (GPIO) ports. A GPIO is an interface for interaction between components, for example, between a microcontroller or microprocessor and various peripherals. Most often, GPIO contacts can work both input and output, with some exceptions. The presence of GPIO ports allows you to use a SBC in embedded systems to read data from various external sensors (temperature, humidity, infrared radiation, angular speeds, accelerations, voltage, current, etc.) and control external devices (LCD displays, servos, DC motors, electric drives, LEDs and LED strips, etc.).
The following single-board microcomputers were selected for consideration: There are a large number of different models of singleboard computers. Table 1 compares several SBC with a similar price. As the comparison criteria were selected:  Based on a comparative analysis, the Raspberry Pi is the optimal choice. In addition, the Raspberry Pi is the most common SBC, which makes development easier due to community support. In addition, the availability of Raspberry Pi in many electronics stores is a significant advantage over other SBCs.
In addition to the presented single-board computers, there are also Odroid. These single-board computers have better characteristics and greater performance, but they have a higher cost, greater power consumption, greater heat dissipation and require better cooling. In addition, Odroid is less accessible and the developer community is many times smaller than the Raspberry Pi developer community.
Raspberry Pi clusters were implemented to solve some business, scientific and academic problems. The SBC Raspberry Pi offers competitive advantages: they are inexpensive, low power, and at the same time offer features similar to a simple personal computer.

B. Raspberry Pi clusters
Next is described the review of several articles and projects that conduct research on the effectiveness of using Raspberry Pi used cluster.

1) Beowulf cluster
Dimitrios Papakyriakou, Dimitra Kottou and Ioannis Kostouros in their paper "Benchmarking Raspberry Pi 2 Beowulf Cluster" [11] presents a performance benchmarking of a Raspberry Pi 2 cluster. The research project shows the design and construction of a high performance cluster of 12 Raspberry Pi 2 Model B single-board computers. The Raspberry Pi 2 Model B is the second-generation Raspberry Pi. It has:  ARM Cortex-A7 CPU 900 MHz; All of the nodes are connected over an Ethernet 100 Mbps Network in a parallel mode. Test performed using High Performance Linpack (HPL) benchmark.
As a result of their research, the authors collected cluster performance metrics in GFlops for different number of nodes and different problem sizes. 2) Iridis-Pi cluster In another project "Iridis-Pi: A low-cost, compact demonstration cluster" [12] by Simon J.Cox, James T.Cox, Richard at el., the Raspberry Pi (one) Model B microcomputer was used.
The cluster had the following characteristics:   [12] Like the previous project, this one uses LINPACK to test single-node performance and High-Performance LINPACK (HPL) to test cluster performance. In addition, SD card performance was measured. This is also important since the operating system and all files with which raspberry pi works are recorded on these cards.

3) Raspberry Pi Hadoop cluster and FAST algorithm
The authors of the third project are Kathiravan Srinivasan, Chuan-Yu Chang, Chao-Hsi Huang, Min-Hao Chang, Anant Sharma and Avinash Ankur. Their project is called "An Efficient Implementation of Mobile Raspberry Pi Hadoop Clusters for Robust and Augmented Computing Performance" [13].
In this article, the Raspberry Pi Hadoop cluster is used in more realistic conditions. To test cluster performance, the authors run on it the SURF algorithm from the OpenCV library. The SURF algorithm is used to search for fixed objects (retaining their external attributes) in images using the characteristic points of the object. The similar use of the Raspberry Pi cluster in a similar task is a very illustrative example, since image processing requires high performance.
In their research, the authors compared the work of an ordinary desktop computer and a raspberry pi cluster with a different number of nodes and a different amount of data.
The result showed that the effectiveness of the built cluster occurs only with large amounts of data (in the case of the project, the required amount of data was from 64,000 and above).

C. Traditional DPC and seingle-board computer DPC
An important part of the research is the comparison of traditional data processing centers and data processing centers based on single-board microcomputers. As projects for comparison the last article (B) and the cluster of the higher school of software engineering of St. Petersburg Polytechnic University (A) were taken. Comparison is presented in the table below: As we can see from the table, the advantages of using a cluster on Raspberry Pi for research and academic purposes, since it is economically more profitable. Such a system fully fulfills the functionality of a software-hardware system for distributed storage and processing of data, and high performance is not so important for research and academic purposes, unlike commercial use.
In addition, a rather important task is to create a system that is cost-effective during downtime, as traditional data centers use hardware that consumes a large amount of energy and generates a large amount of heat.

III. SYSTEM DESCRIPTION
This section describes the architecture of the project. Below will be described the hardware and software that are used.

A. Hardware 1) Raspberry Pi 3 Model B +
Raspberry Pi [14] 3 Model B + is the third generation of Raspberry Pi SBCs. Raspberry Pi was originally developed as a budget platform for learning computer science, but later gained wider fame and scope.

B. Software
A cluster requires a certain set of software. First of all, nodes need an operating system. It is necessary to install a set of programs on the operating system that will allow you to create a distributed file system and perform distributed computing. Next will be described the software that was used during the research.

1) Raspbian OS
Raspbian OS [15,16] is the main operating system for the Raspberry Pi based on the Debian Linux operating system. Raspbian was originally created by Mike Thompson and Peter Green as an independent project. The system is optimized for operation on low-performance ARM processors. PIXEL (Pi Improved Xwindows Environment, Lightweight) is used as the desktop environment.

2) Hadoop
Apache Hadoop is an open source project by the Apache Software Foundation. Hadoop is used for reliable and scalable distributed computing, but can also serve as a distributed storage of large amounts of data. Many companies use Hadoop for production and scientific purposes.

Hadoop consists of four key components:
 HDFS. Hadoop Distributed File System is a distributed file system that is responsible for storing data on a Hadoop cluster;  MapReduce system, which is designed for computing and processing large amounts of data on a cluster;  Hadoop Common. The Hadoop Common module provides the tools (written in the Java language) needed on the user's operating systems (Windows, Unix, or others) to read data stored in the Hadoop file system;  YARN module manages the resources of systems that store data and perform analysis.
a) HDFS Hadoop Distributed File System (HDFS) is the primary storage system used by Hadoop. HDFS repeatedly copies data blocks and distributes these copies to the computing nodes of the cluster, thereby ensuring high reliability and speed of calculations:  data is distributed across several machines at boot time;  HDFS is optimized more for streaming file reads than for irregular, random reads;  files in the HDFS system are written once and making any arbitrary entries in the files is not allowed;  applications can read and write HDFS files directly through the Java programming interface.

b) MapReduce
MapReduce is a programming model and framework for writing applications designed for high-speed processing of large amounts of data on large parallel clusters of computing nodes:  provides automatic parallelization and distribution of tasks;  has built-in mechanisms to maintain stability and performance in case of failure of individual elements;  provides a clean level of abstraction for programmers.
c) Other tools However, Hadoop has a number of other tools. Here is some of them:  HBase -NoSQL database that supports random read and write;  Pigdata processing language and runtime;  Sparka set of tools for implementing distributed computing;  Hivedata warehouse with SQL interface;  ZooKeeperstorage of configuration information.
It is important that the Hadoop software allows you to use horizontal scaling. Horizontal scaling allows to reduce the execution time of the same tasks.

IV. IMPLEMENTATION
A cluster of four Raspberry Pi nodes was created for research and a pilot project. Figure 4 shows the assembled cluster. It was decided to use different models of the Raspberry Pi single-board microcomputer, thereby creating a heterogeneous cluster. On the created cluster, the word count algorithm using Hadoop MapReduce was tested. The Hadoop license file was used as a test file. Table 3 shows some collected metrics from the test bench. The presented metrics were obtained for the operation of the cluster from one node, the second node did not participate in data processing due to configuration settings.
Next, we collected the time metrics for the algorithm for counting words in the text with different system configurations.
Tests were carried out in several stages. The algorithm was launched taking into account the fact that the text file was not divided into blocks. Further, the file system configuration was configured so that the file was divided into 3 blocks, the work with which was distributed across different nodes. Various configurations of the operating modes of the nodes were also tested. 3 single-board computers performed a different role at each stage. At each stage there was 1 "master node" -it is engaged in the distribution of tasks and monitoring nodes. In addition, the number of "work nodes" that are responsible for data processing has changed. At stages 1, 3 and 5, the "master node" was at the same time a "working node". Table 4 shows the processing time for a 38.5-megabyte file with various system configurations. As can be seen from the table, when working on one node, the time of work with one block is the greatest. The best time was shown by the configuration of 1 "master node" and 1 "work node". In other conditions, since the file is not divided into blocks, we only lose time on the work of the "master node" for the distribution of tasks. In the case of splitting the file into 3 blocks, the execution time is approximately the same, since in each case all 3 blocks were immediately distributed to all nodes, however, the reduction in the operating time of the algorithm compared to 1 block is visible.
The second basic algorithm for checking the operation of distributed computing systems is distributed computing the value of Pi. Table 5 shows parameters of Pi calculation tests.  The first results will present a graph of the dependence of the execution time of calculations on a different number of nodes and various settings that are indicated in the previous table. The graph is shown in Figure 5. In addition, a comparison was made of the time spent on one Map operation. The results are presented in Figure 6. A distinctive feature of this work is the use of distributed computing systems on single-board microcomputers for academic purposes for research and educational tasks of students with minimal cost and ease of creating and using the system.
In addition, as a result of the study, a number of features and disadvantages of using the Raspberry Pi were identified:  such a cluster is effective in performing lightweight tasks, for which there is the possibility of splitting them into a large number of small tasks that do not require large computing power. The same feature leads to the fact that this cluster is not effective in such tasks where high performance is needed, for example, when working with graphics.
 SD card. Since the operating system is on an SD card, this can be a vulnerability, since the SD cards do not differ in high performance;  during the tests, it was noticed that during prolonged operation of the cluster, the number of errors that occur during the execution of tasks increases, which may be associated with the accumulation of garbage files. To restore stability, a reboot of the cluster was required. This leads to the need for a more detailed approach to cluster configuration.
 in a situation, when the number of nodes increases, errors may occur due to the fact that the node does not have enough memory to allocate resources that the main node requests from the node. The solution to this problem is the addition of certain configuration parameters to the files of the main node. It is necessary to indicate to the main node about the need to check the availability of the requested both virtual and hardware resources. In addition, you should correctly configure the operating system itself on each node, including the amount of allocated memory for Java, which may affect the operation of the cluster. It is necessary to approach in detail the configuration of various nodes that differ in hardware characteristics.
 for comfortable operation, the Raspberry Pi requires active cooling. However, a single fan is sufficient to cool two SBC's. The fan used was powered from 5 volts and consumed 0.06 amperes, which equals a power of 0.3 watts, which is energy efficient.

V. CONCLUSION
As a result of the research, the following tasks were completed:  research and analysis of existing projects on the use of SBCs within the cluster;  comparison of the cost-effectiveness of a traditional data center and data center using SBCs for research and academic purposes. Based on this comparison, the relevance of developing a system on SBCs was revealed;  comparison of SBCs. As a result of the comparison, the Raspberry Pi 3 Model B + single-board computer was chosen for the project, since it is the most optimal, due to its characteristics, price, and also availability and prevalence.
From the results of the test benchmarks it can be seen that the created system supports horizontal scalability, which meets the system requirements. In addition, based on the results, it can be concluded that the goal of creating a cheap, scalable distributed computing system for academic purposes has been achieved.

VI. FUTURE WORK
The project is under development. In the future, it is planned to perform the following tasks:  increase the number of nodes for analysis to increase productivity;  increase the number of metrics;  analyze the performance of SD cards from different manufacturers, as their characteristics affect the operation of the Raspberry Pi;  use an external USB (flash / HDD / SSD) drive (s) as storage media;