«Life» in Tensor: Implementing Cellular Automata on Graphics Adapters

. This paper presents an approach to the description of cellular automata using tensors. This approach allows to attract various frameworks for organizing scientific calculations on high-performance graphics adapter processors, that is, to automatically build parallel software implementations of cellular automata. In our work, we use the TensorFlow framework to organize computations on NVIDIA graphics adapters. As an example cellular automaton we used Conway's Game of Life. The effect of the described approach to the cellular automata implementation is estimated experimentally.


Introduction
The use of automata in description of a dynamic systems' behavior has been known for a long time.
The key point of this approach to the description of systems is a representation of the object under study in the form of a discrete automatic device -automaton (State Machine or Transition System). Shalyapina N.A., Gromov M.L. «Life» in Tensor: Implementing Cellular Automata on Graphics Adapters. Trudy ISP RAN/Proc. ISP RAS, vol. 31, issue 3, 2019. pp. 217-228 218 Under the influence of input sequences (or external factors) an automaton changes its state and produces reactions. There are many types of such automata: the Moore and Mealy machines [1], the cellular automaton [2], and others. The knowledge of the features of the object under study can provide enough information to select the appropriate type of automaton for the object's behavior description. In some cases, it is convenient to use an infinite model. But finite models are mostly common. In the latter case, the sets of states, input actions (or states of the environment), and output reactions are finite. Our work deals with cellular automata (CA). The theory of cellular automata began to take shape quite a long time ago. The work of John von Neumann [3] might be considered as the first work of the cellular automata theory. Today, a large number of studies devoted to cellular automata are known [4,5]. Note that a major part of these works is devoted to the simulating of spatially distributed systems in physics, chemistry, biology, etc. [6]. The goal of the simulation is to find the states of the cells of a CA after a predetermined number of CA cycles. The resulting set of states in some way characterizes the state of the process or object under study (fluid flow rate at individual points, concentration of substances, etc.). Thus, the task of simulating a certain process or object by a cellular automaton can be divided into two subtasks. First, the researcher must select the parameters of the automaton (the dimension of the grid of cells, the shape of the cells, the type of neighborhood, etc.). And secondly, programmatically implement the behavior of the selected cellular automaton. Our work is focused on the second task -the software implementation of the cellular automaton. In itself, the concept of a cellular automaton is quite simple and the idea of software implementation is obvious. However, the number of required calculations and the structure of these calculations suggest the use of modern supercomputers with a large number of cores and supporting large-block parallelism. In this case, the cell field of the automaton is divided into separate blocks. Processing of blocks is done in parallel and independently from each other. At the end of each processing cycle, the task of combining the processing results of each block arises. This problem was solved in [7] in the original way. The experimental study in [7] of the efficiency of parallelization was carried out on clusters with 32 and 768 processors. Despite the high effectiveness of this approach, it has some issues. First, this approach assumes that a researcher has an access to a cluster. Supercomputers are quite expensive and usually are the property of some collective access center [8]. Of course, after waiting a certain time in the queue, access to the cluster is possible. But another difficulty arises: a special skill is needed to write parallel programs in order to organize parallel sections of the program correctly. And this leads to the fact that it takes a certain number of experiments with the program to debug it before use. The latter means multiple times of waiting in a queue for a cluster, which, of course, delays the moment of launching actual (not debugging) experiments with cellular automata. We offer an alternative approach for software implementation of cellular automata, which is based on the use of modern graphics adapters. Modern graphics adapters are also well-organized supercomputers, consisting of several specialized computational cores and allowing execution of operations in parallel. Compared to clusters, graphics adapters are available for a wide range of users and we believe that their capabilities are enough to implement cellular automata. In addition, there are special source development kits or frameworks (for example, ThensorFlow [9]) that can exploit multi-core graphics adapters and help a researcher quickly and efficiently create a software product, without being distracted by thinking about parallelizing data flows and control flows. In this paper, we demonstrate an approach to implementation of cellular automata on graphics adapters based on TensorFlow. In order to use this tool, we propose to describe the set of states of an automaton cells' by the main data structure used in this framework, namely, the tensor. Then we describe the process of evolution of the automaton in terms of tensor operations. A well-known cellular automaton, the Conway's Game of Life, is used as a working example. The paper is structured as follows. Section 3 presents the basic concepts and definitions concerning the theory of cellular automata. Section 3 provides a description of the game Conway's Game of Life, its features and rules of operation. Section 4 is devoted to a detailed presentation of the proposed approach for software implementation of cellular automata on graphics adapters. The results of computer experiments with the implementation of the Conway's Game of Life and comparison with the results of a classical sequential implementation are presented in section 5.

Preliminaries
The Moore machine (finite, deterministic, fully defined) is a 6-tuple = 〈 ,, , , , 〉, where S is the finite nonempty set of states of the machine with a distinguished initial state ̂∈ , I is the finite set of input stimuli (input signals), O is a finite set of output reactions (output signals), : × → is a fully defined transition function, : → is a fully defined function of output reactions. If at some moment of time the Moore machine 〈 ,, , , , 〉 is at the certain state ∈ and the input signal ∈ arrives, then the machine changes its state to the state ′ = ( , ), and the signal = ( ′) appears at its output. The machine starts its operation from the initial state ̂ with the output signal (). It is important to note that originally Moore defined the machine so that the output signal of the machine is determined not by the final state of the transition, but by the initial one (i.e. in the definition above instead of = ( ′) should be = ( )). However, for our purposes it is more convenient to use the definition we have specified. Let ℤ be the set of integers. Consider the set of all possible integers pairs ( , ) ∈ ℤ × ℤ. With each pair ( , ) we associate some finite set of pairs of integers , ⊆ ℤ × ℤ, called the neighborhood of the pair (i, j). Pairs of Ni,j will be called neighbors of the pair (i, j). The sets Ni, j must be such that the following rule holds: if the pair (p, q) is the neighbor of the pair (i, j), then the pair (p + k, q + l) is the neighbor of the pair (i + k, j + l), where k and l are some integers. Note that the cardinalities of all neighborhoods coincide and the sets will have the same structure. For convenience, we assume that all neighbors from Ni, j are enumerated with integers from 1 to | Ni, j |, where | Ni, j | is the cardinality of the set Ni, j. Then we can talk about the first, second, etc. neighbor of some pair (i, j). If the pair (p, q) is the n-th neighbor of the pair (i, j), then the pair (p + k, q + l) is the n-th neighbor of the pair (i + k, j + l).
Consider the set of Moore machines of the form , = 〈 ,̂ , , , , , , 〉 such that ( ) = . Here i and j are some integers, is the n-th Cartesian power of the set B. The machines corresponding to the neighbors of the pair (i, j) are called neighbors of the machine Ai, j. Neighboring machines will be numbered as well as the corresponding neighboring pairs (that is, the first neighbor, the second, etc.). We specifically note that (i) for each machine Ai, j the set of states is the same, i.e. S; (ii) for each machine Ai, j, the set of output signals coincides with the set of states, that is, also S; (iii) as an output signal, the machine gives its current state; (iv) all machines have the same transition function and the same function of output reaction; (v) as an input signal, machines take tuples of states (of their neighbors), the number of elements in the tuple coincides with the number of neighbors, that is, equals to | Ni, j |; (vi) machines differ only in their initial states. Let at a given time moment the current state of the first neighbor of the machine Ai, j is equal to s1, the state of the second neighbor is s2, ..., the state of the n-th neighbor is sn, where n = | Ni, j |. Then the tuple (s1, s2, ..., sn) is the input signal of the machine Ai, j at this very moment. All machines accept input signals, change their states and provide output signals simultaneously and synchronously. That is, some global clock signal is assumed. The resulting set , | ( , ) ∈ ℤ × ℤ of the Moore machines is called a two-dimensional synchronous cellular automaton (or simply cellular automaton -CA). Each individual Moore machine of this set will be called a cell. The set of states of all cells the CA at a given time moment will be called the global state of the cellular automaton at this time moment. The transition rules of cells from one state to another (the function φ), the type of neighborhood of the cells (the sets Ni, j), the number of different possible cell states (the set S) define the whole variety of synchronous two-dimensional cellular automata. For clarity, one can draw cellular automata on the plane. For this, the plane is covered with figures. Coverage can be arbitrary, but of course, it is more convenient to do it in a regular way. Classic covers are equal squares, equal triangles and equal hexagons. The choice of one or another method of covering the plane is dictated by the original problem a CA is used for and the selected set of neighbors. Next, the cover figures are assigned to the cells of the cellular automaton in a regular manner. For example, let the plane be covered with equal squares, so that each vertex of each square is also the vertex of the other three squares of the coverage ( fig. 1a). Choose the square of this coverage randomly and associate it with the cell A0,0. Let the cell Ai,j be associated with a certain square. Then we associate the cell Ai + 1,j with the square on the right, the cell Ai -1,j with the square on the left, the cell Ai,j + 1 with the square above, and the cell Ai,j -1 with the square below ( fig. 1b). Cell states will be represented by the color of the corresponding square ( fig. 1c) a) b) c) The resulting square based representation of a CA on a plane is classical one. In our work we consider only this representation. For the square based representation of a CA, the neighborhoods shown in fig. 2 are the most common. If a given cellular automaton models a process (for example, heat transfer), then the various global initial states ̂ , | ( , ) ∈ ℤ × ℤ of the cellular automaton correspond to different initial conditions of the process. According to the definition of cellular automata introduced by us, the set of cells in it is infinite. However, from the point of view of practice, especially in the case of an implementation of a cellular automaton, a set of cells have to be made finite. In this case, some of the cells lack some neighbors. Therefore, for them the set of neighbors and the transition function are modified. Such modifications determine the boundary conditions of the process being modeled.

Conway's Game of Life
In the 70s of the 20th century, the English mathematician John Conway proposed a cellular automaton called the Conway's Game of Life [10]. The cells of this automaton are interpreted as biological cells. The state «0» corresponds to the «dead» cell, and the state «1» -«alive». The game uses the Moore's neighborhood (Fig. 2b), i.e. each cell has 8 neighbors. The rules for the transition of cells from one state to another are as follows:  if a cell is «dead» and has three «alive» neighbors then it becomes «alive»;  if a cell is «alive» and has two or three «alive» neighbors then it remains «alive»;  if a cell is «alive» and has less than two or more than three «alive» neighbors then it becomes «dead». For the convenience of perception, the behavior of each cell of the cellular automaton Conway's Game of Life can be illustrated using the transition graph ( fig. 3). Despite the simplicity of the functioning of the automaton, it is an object for numerous studies, since the variation of the initial configuration leads to the appearance of various images of its dynamics with interesting properties. One of the most interesting among them are moving groups of cellsgliders. Gliders not only oscillate with a certain periodicity, but also move through the space (plane). Thus, as a result of experiments, it was established that on the basis of gliders logical elements AND, OR, NOT can be built. Therefore any other Boolean function can be implemented. It was also proved that using the cellular automata Conway's Game of Life it is possible to emulate the operation of a Turing machine.

Features of Conway's Game of Life Parallel Implementation
According to our definition, a set of states of a cell is finite. It is obvious that, in this case, without loss of generality, we can assume that the set of states is the set of integers from 0 to |S| -1, where |S| -is the cardinality of the set of states. Therefore, the global state of the cellular automaton can be represented as a matrix A. The element Ai,j of this matrix is equal to the current state of the cell Ai,j. We call the matrix A the matrix of the global state of the cellular automaton. If there are no restrictions on the number of cells, then matrix A will be infinite. As have already been mentioned, the number of cells has to be limited from a practical point of view, that is, it is necessary to somehow choose the finite subset of cells. After that, only selected cells are considered. In this case, the ability to describe the global state of the cellular automaton by the matrix is determined by which cells are selected. We assume that the following set of cells is selected: where m and n -two fixed natural numbers. In this case, the global state matrix is obtained naturally. Since we use the TensorFlow framework for implementation of a CA, we should work with concepts defined in it. The main data structure in TensorFlow is a multidimensional matrix which in terms of this framework is called a tensor. However, in many cases, such a matrix may not correspond to any tensor. The tensor in the n-dimensional space must have n p + q components and is represented as (p + q)-dimensional matrix, where (p, q) is the rank of the tensor. And, for example, a 2 by 3 matrix does not follow these restricions. But the convenience of data manipulation provided by the framework justifies some deviations from strictly defining the tensor. Therefore, in the case when we are talking about the software implementation of a cellular automaton using TensorFlow, we will

 
Note the special role of the element S22 = 0,5. This element corresponds to the cell for which the number of living neighbors is calculated. Let the number of living neighbors of a certain dead cell be calculated. Then it will turn out to be integer because component S22 will be multiplied by the state of the dead cell (and it is equal to 0), and in the sum the number S22 will not participate. It will turn out to be half-integer in the case when the number of living neighbors of a living cell is calculated. This is important when the cell has two living neighbors. Then the dead cell must remain dead, and the living cell must live. That is, if after the convolution the counted number of living 223 neighbors turns out to be 2 (the cell is dead, it has 2 living neighbors), then in its place should be 0 in the tensor of the global state of the cellular automaton in the next cycle. If, after convolution, the counted number of living neighbors is 2.5 (the cell is alive, and it has 2 neighbors), then in its place should be 1 in the tensor of the global state of the cellular automaton in the next cycle. Constructing a convolution with the kernel S of the tensor T, we obtain the new tensor C, where at the intersection of the i-th row and j-th column there is an element corresponding to the number of living neighbors for the cell Ai,j. Note that we obtain a tensor (m -2)  (n -2) when constructing a convolution with a kernel of size 3  3 of an arbitrary tensor of the size m  n. In order to save the initial dimensions of the global state tensor of a cellular automaton, we will set the elements in the first and last row and in the first and last column of the global automaton tensor to 0. We will append these zero rows and columns to the result after the convolution is completed. Appended elements in the formula (3) are highlighted in gray. The mentioned fact suggests that some of the subsequent computations are superfluous (namely computations on the appended elements). The amount of extra computations for the global state tensor with dimensions m  n will be (2m -2) + (2n -2).
Then, the part of extra computations in the amount of useful computations is ( Taking into account the agreement on the half-integer value of the number of living neighbors, the integer part of the value of the tensor component C determines the number of living neighbors of the cell, and the presence of the fractional part means that the cell was alive in the previous step. According to the rules of the Conway's Game of Life it is necessary to transform the tensor C in order to determine the global state of the cellular automaton in the next step. Components with values in the range [2.5, 3.5] should take the value 1 (cells are alive). The remaining components should become 0 (cells are dead). Among the classical operations on tensors there is no operation that would allow to express the required transformation. However, the framework used in our work was created primarily for the problems of the theory of artificial intelligence, namely, for implementation of neural networks. The data flow there is the flow of tensors (a tensor as an input, a tensor as an output). Computational elements, that change data, are layers of the neural network. So, for example, in our case for the convolution we use a two-dimensional convolution layer with the kernel S (formula (2)). Any tool for neural network implementation ought to have the special type of layers -activation layers (layer of non-linear transformations). These layers calculate activation functions (some non-linear functions) of each element of the input tensor and put the result into the output tensor. TensorFlow offers a standard set of non-linear activation functions. In addition, it is possible to create custom activation functions. We built our own activation function based on a function from a standard set of functions, called a Rectified Linear Unit (ReLU). The function ReLU is defined as follows (formula (4)). Its graph is shown in fig. 5:  = (0, )  Taking into account the required transformation of the components of the tensor C described above, we suggested the function presented in (5) As a result of applying the function δ to each component of the tensor C, the tensor of the global state of the cellular automaton will take the following form (formula (6)).

 
Thus, the software implementation of the Conway's Game of Life using TensorFlow is a two-layer neural network. The first layer is convolutional, with the kernel from formula (2). The second layer is the activation layer with the activation function from formula (5).

Experimental results
We have implemented the described approach for the cellular automaton of the Conway's Game of Life in Python. Since there was no one in our group familiar with TensorFlow, but we have some experience in Keras [11], the implementation was built using Keras as a kind of wrapper over TensorFlow. Keras is a high level interface to various low-level artificial intelligence libraries, including TensorFlow. The resulting program was launched on a graphics adapter with CUDA support. For comparison with the classical implementation of the cellular automaton of the Conway's Game of Life on a uniprocessor system, we used the implementation of [12]. R-pentamino located in the middle of the field ( fig. 4)   states of the cellular automaton is much less than the time of transmission of information. As the field size grows, the computation time of the cellular automaton state becomes significant and the multiprocessor implementation on the graphics adapter begins to outrun the single-threaded speed. Obviously, the dependence of the execution time of programs on the "length" m of the square field side of the Conway's Game of Life must be parabolic. With the growth of m, the number of cells grows as m 2 , each cell needs to be processed once per cycle. Therefore, the number of operations must be of the order of m 2 . According to the obtained results we constructed regression polynomials of the second degree. Regression curves are in good agreement with experimental data (Fig. 7, 8). It may seem that for a multithreaded implementation the dependency should be different. However, we note that when the number of cells becomes much more than the number of cores in a multi-core system (in our case, the graphics adapter had 768 cores), then processing will be performed block by block: first comes one block of 768 cells, then another, etc. Thus, m 2 /K operations will be done, where K is the number of cores, that is, also of the order of m 2 .

Conclusions
In this paper, a tensor approach to the software implementation of cellular automata is described and programmatically implemented. The approach is focused on launching programs on multi-core graphics adapters. The program is implemented in Python using TensorFlow and Keras as an interface to TensorFlow. TensorFlow allows automatically generate and run multi-threaded programs on multi-core graphics adapters. The effectiveness of using the developed approach was shown during a series of computer experiments. For the experiments the Conway's Game of Life was chosen. If the number of cells in the automaton is less or equal to the number of cores, then the maximum acceleration can be observed. If the number of cells exceeds the number of cores, then the parallel sections of the program are executed sequentially. This means that with a very large size of the playing field the type of dependence will be parabolic when using a graphics adapter. The latter is confirmed by regression analysis.