Source code static analysis is widely used for program errors detection. Mostly it is used for finding critical issues like security vulnerabilities, critical program defects leading to runtime errors like crash and unexpected behavior of programs. Many SCSA tools are used for checking code conformance to different coding style guides. In this case study we present results of applying SCSA techniques for checking performance coding rules of Huawei and evaluate whether manually fixing found issues in accordance with the guidelines could impact performance, or if the compiler already applies all necessary optimizations during compilation.
Fuzzing as a part of the continuous integration is a necessary tool, aimed primarily at the providing confidence in the software being developed. At the same time, in the presence of significant amounts of the source code, fuzzing becomes a resource-intensive task. That’s why increasing the efficiency of fuzzing to reach needed code sections more quickly without reducing quality becomes an important line of research. The article deals with approaches to improve the efficiency of fuzzing both for kernel and for user-space software. On the other hand, on these amounts of program code, static code analysis produces a huge number of warnings about possible errors, and the main resources within this type of analysis are required not to obtain to result, but for analytical processing. In this regard, in the article considerable attention is paid to the approach of correlating the results of static and dynamic code analysis using the developed tool, which also allows to implement directed fuzzing in order to confirm the warnings of static analyzer, which significantly increases the efficiency of testing components of the protected OS Astra Linux.
As a result of background work on analysis in embedded Linux OS, the authors created the ELF (embedded linux fuzzing) tool that provides functionality for use in conventional dynamic analysis tools working with IoT devices. The article discusses the use of full-system symbolic execution for the analysis of IoT systems based on Linux kernels, describes how to integrate S2E full-system symbolic execution frameworks into the ELF tool environment, as well as the possibility of applicability of the resulting toolchain to the implementation of distributed hybrid IoT fuzzing.
Symbolic execution is a widely used approach for automatic regression test generation and bug and vulnerability finding. The main goal of this paper is to present a practical symbolic execution-based approach for LLVM programs with complex input data structures. The approach is based on the well-known idea of lazy initialization, which frees the user from providing constraints on input data structures manually. Thus, it provides us with a fully automatic symbolic execution of even complex program. Two lazy initialization improvements are proposed for segmented memory models: one based on timestamps and one based on type information. The approach is implemented in the KLEE symbolic virtual machine for the LLVM platform and tested on real C data structures — lists, binomial heaps, AVL trees, red-black trees, binary trees, and tries.
The paper describes static analysis algorithms aimed at finding three types of errors related to the concept of a synchronizing monitor: redefinition of a variable of mutually exclusive locking inside a critical section; use of an incorrect variable type when entering the monitor; blocking involving an object that has methods that use a reference to an instance (this) to lock. Developed algorithms rely on symbolic execution technology and involve interprocedural analysis via summary of functions, which ensures scalability, field-, context-, and flow-sensivity. Proposed methods were implemented in the infrastructure of a static analyzer in the form of three separate detectors. Testing on the set of open source projects revealed 23 errors and the true positive ratio of 88.5% was obtained, while the time consumption only made up from 0.1 to 0.7% of the total analysis time. The errors that these detectors were designed to find are difficult to detect by testing or dynamic analysis because of their multithreading nature. At the same time, it is necessary to find them: just one such defect can lead to incorrectness of the program and even make it vulnerable to intruders.
IT industry has been thriving over the past decades. Numerous new programming languages have emerged, new architectural patterns and software development techniques. Tools involved in the process ought to evolve as well. One of the key principles of new generation of instruments for software development would be the ability of the tools to learn using neural networks. First of all, it is necessary for the tools to learn how to write code. In this work we study the ability of Transformers to generate competition level code. The main goal is to discover whether open-source Big Transformers are “naturally” good coders.
We present new techniques for compilation of sequential programs for almost affine accesses in loop nests for distributed-memory parallel architectures. Our approach is implemented as a source-to-source automatic parallelizing compiler that expresses parallelism with the DVMH directive-based programming model. Compared to all previous approaches ours addresses all three main sub-problems of the problem of distributed memory parallelization: data and computation distribution and communication optimization. Parallelization of sequential programs with structured grid computations is considered. In this paper, we use the NAS Parallel Benchmarks to evaluate the performance of generated programs and provide experimental results on up to 9 nodes of a computational cluster with two 8-core processors in a node.
Dynamic compilation of certain operator compositions might have a drastic impact on overall query performance, but not be considered during optimal query plan selection by DBMS planner due to lack of knowledge. To tackle this problem, we propose to extend cost model with criteria that make dynamic compilation overhead relevant. The necessity to set up the optimizer criteria is due to the fact that the properties of various execution models impose restrictions on the efficiency of query plan execution using certain operator nodes. For example, the push-model used in the dynamic compiler is advantageous when executing queries using sequential scan. So, dynamic compilation makes sequential scan more efficient than index scan. Using index nodes in such a plan makes the value of the dynamic compilation method diminishing. To overcome these problems, it is proposed to configure the DBMS optimizer, so that it evaluates and takes into account the efficiency of using certain types of nodes when building a query plan with its subsequent dynamic compilation. This paper discusses the modification of the PostgreSQL planner to select the most efficient query execution plan based on hardware characteristics and the execution model of operator nodes with interpretation or compilation.
Long-term data storing is an important task for many modern scientific laboratories and datacenters. In order to reduce cost of digital information ownership, some solutions use magnetic tape technology and special software to control medium and data. Considering the on-site infrastructure specifics and well-established workflows of data processing, these organizations build and support such systems mainly by their own efforts, what becomes an important task in seeking to acquire the technological sovereignty. This paper describes long-term data storage issues in the computing center of the Zababakhin All-Russia Research Institute of Technical Physics where mathematical modeling computations generate vast amount of scientific data. The architecture and functional composition of the developed Archive Data Storage System are given as well as its internal data model, the chunk grouping rules, and the low-level tape format used. The measures taken to ensure an archived data consistency, methods of storage media management and issues of archival fund maintenance, are also considered. The calculation scheme of a typical archive system site’s hardware configuration, sufficient to process archiving data flows existing in datacenter, is given.
The development of new technologies for voice communication has led to the need of improvement of speech enhancement methods. Modern users of information systems place high demands on both the intelligibility of the voice signal and its perceptual quality. In this work we propose a new approach to solving the problem of speech enhancement. For this, a modified pyramidal transformer neural network with an encoder-decoder structure was developed. The encoder compressed the spectrum of the voice signal into a pyramidal series of internal embeddings. The decoder with self-attention transformations reconstructed the mask of the complex ratio of the cleaned and noisy signals based on the embeddings calculated by the encoder. Two possible loss functions were considered for training the proposed neural network model. It was shown that the use of frequency encoding mixed with the input data has improved the performance of the proposed approach. The neural network was trained and tested on the DNS Challenge 2021 dataset. It showed high performance compared to modern speech enhancement methods. We provide a qualitative analysis of the training process of the implemented neural network. It showed that the network gradually moved from simple noise masking in the early training epochs to restoring the missing formant components of the speaker's voice in later epochs. This led to high performance metrics and subjective quality of enhanced speech.
The article presents the experimental parameter evaluation results of the electronic documents marking algorithm, based on interword distances shifting. The developed marking algorithm is designed to increase the security of electronic documents containing textual information from leakage through channels caused by printing, scanning or photographing, followed by sending the generated image. The algorithm analyzed parameters are such characteristics as embedding capacity, invisibility, undetectability, extractability and robustness. In the course of embedding capacity estimation of the developed algorithm, analytical expressions are given that make it possible to calculate the maximum achievable embedding capacity value. The obtained quantitative estimates and the experiments carried out made it possible to substantiate the admissible values choice of the embedded marker. To determine the embedded information invisibility in the source document, an invisibility and undetectability assessment of the embedded marker was carried out. During the expert evaluation, the developed algorithm invisibility to visual analysis was substantiated, as well as the absence of significant statistical deviations in the distribution of the analyzed parameters in the process of assessing the resistance of the developed marking algorithm to the potentially best steganographic analysis method. The quantitative extractability of the developed marking algorithm was carried out by assessing the extraction accuracy. The analysis performed showed accuracy high values of marker extraction from scanned images, which makes it possible to reliably extract embedded data, as well as determine directions for improving the extraction accuracy from photographed images. In the assessing process the stability of the developed marking algorithm to the transformations implementation and distortions introduction, the main robustness parameters of the developed marking algorithm to the printing, scanning and photographing processes are determined. Conclusions are formulated on the using possibility the developed marking algorithm and directions for further researches are identified.
In recent years, due to significant changes in the labor market, companies have become more likely to face various problems when searching and selecting candidates. The main reason for these problems is that the existing Internet resources for finding candidates do not allow you to find a specialist with the required set of competencies and fully evaluate his experience, skills, achievements and personal characteristics. As a result, it becomes necessary to create a service for finding exclusive specialists. Most of these specialists do not have a resume in the public domain, are not looking for a job, but are ready to consider interesting offers. As a result, this work is devoted to the study of the possibility of finding specialists with unique competencies on the Internet based on the analysis of their digital footprint. The hypothesis is that it is possible to get a complete profile of a unique specialist if you collect, combine and analyze data from various sources. In the course of this work, the possibilities provided by open data sources on the Internet were analyzed, as well as the scientometric indicators of a specialist and the parameters of his reliability were determined. An algorithm for searching for the required specialists based on these data has been compiled, an automated system implementing this search has been designed, developed and tested.
Nowadays, there is a growing interest in solving NLP tasks using external knowledge storage, for example, in information retrieval, question-answering systems, dialogue systems, etc. Thus it is important to establish relations between entities in the processed text and a knowledge base. This article is devoted to entity linking, where Wikidata is used as an external knowledge base. We consider scientific terms in Russian as entities. Traditional entity linking system has three stages: entity recognition, candidates (from knowledge base) generation, and candidate ranking. Our system takes raw text with the defined terms in it as input. To generate candidates we use string match between terms in the input text and entities from Wikidata. The candidate ranking stage is the most complicated one because it requires semantic information. Several experiments for the candidate ranking stage with different models were conducted, including the approach based on cosine similarity, classical machine learning algorithms, and neural networks. Also, we extended the RUSERRC dataset, adding manually annotated data for model training. The results showed that the approach based on cosine similarity leads to better results compared to others and doesn’t require manually annotated data. The dataset and system are open-sourced and available for other researchers.
Active usage of data collections by experts and decision makers tasked with preparing decision alternatives is an essential characteristic of effectiveness of an anthropotechnic system. In many cases such data analysis may require a standalone visual analysis that implies projection of a multidimensional array of data onto a lower-dimensional space. The article below presents the results of developing the theoretical foundation of such algorithm that is oriented towards an interactive analysis procedure.
Cloud technologies provide abilities for simple and reliable scaling of resources, due to which they have become widespread. The task of managing distributed services in a cloud environment is especially relevant today. Special programs are used for that purpose named “orchestrators” which implement the functions of lifecycle management for applications. However, the existing solutions have many limitations and are not applicable in the general case. Also there is no single standard or protocol for interaction with such tools which requires adaptation of programs for each particular case. The main objectives of this paper are to identify the requirements for a platform-level cloud computing (PaaS) orchestrator, as well as to propose flexible architecture patterns for such tools.
An electrocardiogram (ECG) is one of the most common medical examinations. High-quality interpretation of a 12-channel electrocardiogram is important for subsequent diagnosis and treatment. One of the important steps in deciphering an ECG is to determine the boundaries of the elements of the PQRST complex. The article discusses mathematical methods for determining the boundaries of the P, T waves and the QRS complex, as well as the R, P and T peaks, presents the shortcomings of mathematical methods for determining the elements of the PQRST complex. And also the values of the metrics obtained as a result of training the neural network segmentation model of the PQRST-complex are given. The experiments performed show the relevance of using neural network and combined approaches to the analysis of the PQRST complex.
Melanoma is one of the most aggressive forms of cancer, which can be treated only with early detection of the disease. The article discusses the existing algorithms and methods of visual diagnosis of melanoma. The systems of automatic diagnosis of dermatoscopic images and the methods used by them are also considered. The article considers the limitations hindering the development of automatic diagnosis systems: the lack of relevant domestic data sets that allow training artificial intelligence models, insufficient level of patient metadata accounting, low coverage of the population for the presence of melanoma during routine examinations. A variant of building a decision support system by general practitioners in the analysis of dermatoscopic images of the skin is proposed.
ISSN 2220-6426 (Online)