In this paper, we consider the case of applying convolutional neural networks interpretation methods to ResNet 18 model in order to identify and justify model errors. The model is used in the problem of classifying the orientation of text documents images. First, using interpretation methods, an assumption was made as to why the neural network shows low metrics on data that differs from training images. The alleged reason was the presence of artifacts on the generated training images, caused by the use of an image rotation function. Further, using the Vanilla Gradient, Guided Backpropagation, Integrated Gradients, GradCAM methods and the invented metric, we managed to accurately confirm the hypothesis put forward. The obtained results helped to significantly improve the accuracy of the model.
Automatic handwriting recognition is an important component in the process of electronic documents analysis, but its solution is still far from ideal. One of the main reasons for the complexity of Russian handwriting recognition is the insufficient amount of data used to train recognition models. Moreover, for the Russian language the problem is more acute and is exacerbated by a large variety of complex handwriting. This paper explores the impact of various methods of generating additional training datasets on the quality of recognition models: the method based on handwritten fonts, the StackMix method of gluing words from symbols, and the use of a generative adversarial network. A font-based method for creating images of handwritten text in Russian has been developed and described in this work. In addition, an algorithm for the formation of a new Cyrillic handwritten font based on the existing images of handwritten characters is proposed. The effectiveness of the developed method was tested using experiments that were carried out on two publicly available Cyrillic datasets using two different recognition models. The results of the experiments showed that the developed method for generating images made it possible to increase the accuracy of handwriting recognition by an average of 6%, which is comparable to the results of other more complex methods. The source code of the experiments, the proposed method, as well as the datasets generated during the experiments are posted in the public domain and are ready for download.
This article discusses the algorithm for creating a neural network based on pattern recognition. Several types of attacks on neural networks are considered, the main features of such attacks are described. An analysis of the Adversarial attack was carried out. The results of experimental testing of the proposed attack are presented. Confirmation of the hypothesis about the decrease in the accuracy of recognition of the neural network during the implementation of the attack by an attacker was obtained.
This paper was prepared while developing text classification system for legal documents, especially those that issued by Legislative Assembly of Perm Krai. The problem in question is a lack of solutions that meet regional requirements, the main of which is the classification used in region. The research that evaluates applications of Natural Language Processing models is conveyed. The primary result of the study is the actual applicability of Support Vector Machine (SVM) to preprocessed legal document categorization. There were a server-side API constructed to perform the task, and a server-side models pre-trained of which SVM is favored.
The original information system «data farm» is presented. Today, the successful application of artificial intelligence algorithms, primarily deep learning based on artificial neural networks, almost completely depends on the availability of data. And the larger the amount of these data (big data), the better are the results of the algorithms execution. There are well-known examples of such algorithms from Facebook, Google, Microsoft, Yandex, etc. The data must contain both the training sample and the test one. Moreover, the data must be of good quality and have a certain structure, ideally, be labeled in order for the learning algorithms to work adequately. This is a serious problem requiring huge computational and human resources. This paper is dedicated to solve this problem. Today data farm is a rather complex information system built on a modular basis, similar to the well-known Lego constructor. Separate modules of the system are various modern algorithms, technologies and entire libraries of artificial intelligence, and all together they are designed to automate the process of obtaining and structuring high-quality big data in various subject domains. The system has been tested on data of COVID-19 in regions of Russia and countries around the world. In addition, a user-friendly interface for visualizing collected and processed on the farm data was developed. This makes it possible to conduct visual numerical experiments of computer simulation and compare them with real data, turning the farm into an intelligent decision support information system.
The state of affairs in the area of missing information management in relational databases leaves much to be desired. The SQL standard uses the universal null value to represent missing data, and the control is based on three-valued logic, in which the null value is identified with a third boolean value. This solution is conceptually inconsistent and often results in DBMS behavior that is not intuitive. An alternative approach using typed special values leaves all handling of missing data to users. In this article, we analyze the long history of research and development that led to this situation. We come to the conclusion that no other solution could have appeared in the SQL standard due to the choice of the mechanism of the universal null value more than 50 years ago, and the alternative mechanism cannot provide system support for special values due to the use of two-valued logic. We propose a combined approach using typed special values based on three-valued logic. This approach allows you to use the semantics of data types when processing queries with conditions that include unknown data. In addition, our approach allows us to define a full-fledged three-valued logic in which a special value of a Boolean type is the third boolean value.
Numerical modeling of various physical processes involves the use of computing resources at different stages of preparation, calculations and processing of their results. At the same time, there is a problem of data transfer between various application software used on heterogeneous computing resources with different architectures. The main approaches to the development and use of software for working with structured data of application software on the example of HDF and SIO discussed in this article.
This paper explores ways to achieve the highest possible exchange performance with files containing structured data. The research was carried out on file systems with supercomputer systems parallel access designed to solve problems of physical and mathematical modeling of various processes and objects. For example, parallel access to raw data is considered using the Lustre file system. The article suggests a way to organize parallel access to structured data based on a specially developed PSIO storage format and the psio access library. A comparative analysis of the I/O performance of the developed data storage format and the HDF5 parallel version format is performed.
This paper describes the tools developed by the authors for analyzing the topology of Ethernet networks, collecting, accumulating and displaying statistics of their work. Approaches for assessing the quality of network devices operation based on statistics are described. The developed software is used to analyze the operation of Ethernet networks in high-performance computing systems designed to solve numerical simulation problems. This work may be useful to specialists involved in the development and operation of Ethernet networks in computing systems based on the Linux operating system.
The article is devoted to end-to-end testing of the application for managing the configuration of enterprise virtual infrastructure. The main idea is to develop a software framework to create and perform end-to-end tests written in Python. The approach involves a comprehensive evaluation of the system from the user interface to the database. The testing process is performed in a continuous integration environment, which enables the team to test the system continually as new code is added. The testing process also includes the use of automated tests written in Python. The automated tests allow for faster and more reliable testing and enable the team to test the system across multiple platforms and configurations. The approach also includes the use of virtual environments to simulate the production environment. This enables the team to identify potential issues that may arise in the production environment and to test the system's performance under various conditions.
The object of the study is the image processing process in preparation for data transmission, as well as subsequent recovery. The subject of the study is the application of parallel computing in image processing tasks. The purpose of the article is to study the method of image reconstruction with correction of distortion of the transmission vector based on the input of a value in the trend of its neighbors. The relevance of this topic is determined by the need to effectively perform the operations preceding and following the transmission over the communication channel. During the experimental part, the results were obtained in the form of execution time values with sequential execution and using parallel calculations, which gave the expected increase in speed-up. Since the image consists of unrelated parts, it can be successfully processed by applying data parallelization.
This article discusses the problems of supporting Python scripts in an actively developing interactive graphics system. Such support is a time-consuming task, which is difficult to automate in the general case. As a solution to this problem, we propose an approach that allows developers to combine the creation of new system components with the simultaneous embedding of scripting support without writing redundant additional code. The result is a user-friendly object-oriented API that describes all aspects of interaction between the system and scripts. Scripts using this API can be used to modeling automation as well as to extend the system with custom extension classes. The latter is especially important as it leaves the ability for ordinary users to extend closed-source systems on their own.
The most general structure of a computational algorithm that implements meshless Lagrangian methods of computational fluid dynamics is discussed. Not only the main ones are touched upon, but also “auxiliary”, but therefore no less important procedures, which implementation is often practically ignored. The latter can lead to a significant imbalance and decrease in the efficiency of codes in which the “basic” computational operations are significantly optimized. The author's in-house codes VM2D and VM3D are discussed, the development of which at the first (“exploratory”) stage proceeded mainly along the path of choosing and implementing the necessary mathematical models, and the achievement of acceptable efficiency was ensured by an “extensive” way – involving significant computing resources (in particular, graphical accelerators). An attempt was made to make a conclusion about the expediency of using existing third-party libraries to perform computational geometry operations, solve problems on graphs, etc..
Vortex methods of computational fluid dynamics are an efficient tool in engineering practice for estimating hydrodynamic loads acting on bodies placed in a flow. Their usage allows for solving of coupled hydroelastic problems with relatively small computational cost. In many applications, the cross flow around structural elements with large elongation is considered, that allows one to use the flat cross-sections method providing the acceptable accuracy. Thus, flat flows simulation around airfoils is required. Modern modifications of vortex particle methods make it possible to simulate flows of a viscous incompressible medium. Based on the method of viscous vortex domains in 2017-2022 the VM2D code have been developed in Bauman University and Ivannikov Institute for System Programming. This code allows for flow simulating around airfoils with acceptable accuracy at low Reynolds numbers, while for higher Reynolds numbers, correct results are observed only for airfoils with sharp edges and corner points, and only in regimes where the most intensive flow separation takes place at these points. The reason for the error in the results for other regimes is seen in incorrect modeling of the flow separation on smooth airfoil surface line at high Reynolds numbers, which, in turn, is a consequence of incorrect modeling of vorticity evolution in the vicinity of separation points (zones). Some results of flow simulations around different airfoils at different values of the Reynolds number are presented and a hypothesis explaining the reason for the discrepancy between numerical results and experimental data is proposed. It is shown that the kinetic energy spectrum of turbulence corresponds to “two-dimensional turbulence”.
The issues of mathematical modeling of turbulent heat-conductive flow of compressible viscous medium in the internal volume of the body of an air-thermal curtain equipped with a tangential fan are considered. The solution of the problem is constructed on the basis of averaged Reynolds (Favre) Navier-Stokes equations. The solution of the problem is obtained using the MRF (Multiple Reference Frame) approach, which uses a rotating reference frame, and using a transformation of the basic Navier-Stokes equations in the rotation zone. In order to correctly describe the working processes occurring in the internal volume of the air-thermal curtain and in the environment, modular multiblock meshes are applied in the work, including those allowing to separate rotating and stationary areas. The solution of the set tasks is constructed using the tools of the OpenFOAM package. As a result of the work, the peculiarities of the flow structure in the flowing part of the air-heat curtain are described in detail, and the gas velocities achieved at different fan speeds are estimated. The self-similarity of velocity profiles at the air curtain nozzle outlet is shown.
The work is devoted to parametric investigations of the krypton flow in a conical micronozzle when flowing into a region with low pressure. The features of the flows are studied at various values of the stagnation pressure in the pre-nozzle volume, including the occurrence of a condensed phase in the flow. Mathematical modeling was carried out on the basis of a numerical solution of the complete system of Navier-Stokes equations, supplemented by the equation for the mass fraction of the condensate. The mathematical model takes into account the change in the coefficients of dynamic viscosity and thermal conductivity depending on the gas temperature. The problem was solved by the control volume method on a block-structured regular grid of quadrangular elements using schemes of the second order of accuracy. The equations were integrated with respect to time using the Runge-Kutta method. The calculations were carried out at stagnation pressures of 5, 10, and 15 atm for single-phase and two-phase flows. The distribution fields of temperature and Mach number in the nozzle and in the space behind it are presented. The axial distribution of pressure, temperature, and Mach number has been studied. It is shown that in the case of a single-phase flow, self-similarity of gas flows is observed. The pressure fields were similar, but in a dimensionless form they coincided to each other. In this case, the identity of the velocity and temperature fields was observed at different values of the stagnation pressure. The self-similarity of the flow is violated in the zone of formation of condensed particles. The dimensions of the zones of local temperature increase, as well as the intensity of heat release, depend on the given stagnation pressure, which is reflected in the velocity characteristics of the flow.
This study emphasizes the importance of aligning short reads in the analysis of human whole-genome sequencing data. The alignment process involves determining the positions of short genetic sequences relative to a known reference genome sequence of the human genome. Traditional alignment methods use a linear reference sequence, but this can lead to incorrect alignment, especially when short reads contain genetic variations. In this work, the index file of the reference sequence was modified using the minimap2 tool. Experimental results showed that adding information about frequently occurring genetic variations to the minimap2 index increases the number of correctly identified genetic variants, which affects the quality of subsequent data analysis.
ISSN 2220-6426 (Online)