Implementing a new target architecture in Qemu involves creation of a dynamic binary translator TCG front-end for that architecture. Testing is necessary to verify correctness of that translator component. Currently, existing TCG front-end testing systems use an approach based on a comparison with an oracle. Such oracle have the same processor architecture. And an oracle may be a real processor, a high-fidelity emulator or another binary translator. Unfortunately, such oracles are not always available. This paper is devoted to testing a target architecture implementation in Qemu when the necessary oracle is not available. The main idea is following. There is observation, a program written in a high-level programming language is expected to execute equally regardless of processor architecture. In other words, one can use a real processor with a different architecture for comparison. In this paper, it is the processor of a developer AMD64 machine. The comparison objects are the term of a high-level programming language. I.e. tests are written in C. C language was chosen for this purpose, because, on the one hand, it is fairly close to the hardware, and, on the other, it has compilers for many processor architectures. The approach is implemented in CPU Testing Tool (c2t) which is part of QDT. Source code is available at https://github.com/ispras/qdt. The tool is implemented in Python programming language and supports testing of Qemu in both full system and user level emulation modes. c2t is suitable for testing TCG front-ends which are generated by the automatic TCG front-end generation system or implemented in the classical way (manually).
QEMU is a widely used and fairly accurate emulator capable of emulating dozens of guest systems. Emulation of the system involves the configuration of virtual devices, which are supported in large numbers in QEMU, which entails a very long and complicated command line to start the emulator. When using deterministic replay, the situation is complicated not only by additional and not quite obvious parameters, but also by the need to synchronize recording and retrace launch command lines. Machines can have a different set of devices depending on the platform and even the version of the emulator. The article describes obtaining information about the devices of the QEMU emulator through the QEMU Machine Protocol for using this data in a graphical interface. The QemuGUI graphical interface supports the full cycle of work with the emulator: creating and configuring a virtual machine, starting in normal mode and in deterministic replay mode, interacting with the machine through a QEMU monitor.
Theoretical foundations of compositional reasoning about heaps in imperative programming languages are investigated. We introduce a novel concept of compositional symbolic memory and its relevant properties. We utilize these formal foundations to build up a compositional algorithm that generates generalized heaps, terms of symbolic heap calculus, which characterize arbitrary cyclic code segments. All states inferred by this calculus precisely correspond to reachable states of the original program. We establish the correspondence between inference in this calculus and execution of pure second-order functional programs. The contribution of this work is as follows: (1) a formal model of compositional symbolic memory is proposed; (2) the properties of its correctness are formulated; (3) the calculus of symbolic heaps has been introduced: the conclusions in this calculus give all attainable states of the program; (4) the concept of generalized heaps is introduced, an algorithm for automatic modular construction of generalized heaps according to an imperative program is proposed; (5) an approach is proposed to reduce the problem of finding an output in calculus of symbolic heaps to the problem of proving the safety of functional programs.
The development of memory models aimed at solving various concurrency problems is an active research topic. One such model is the OCaml memory model (OCamlMM), which allows to mitigate undefined behavior caused by data races. To use this model in practice one has to prove the correctness of its compilation into mainstream CPU architectures. At the moment, it is done for x86 and ARM but not for Power.One way to prove the correctness of compilation of OCamlMM into the Power model is to develop a compilation scheme from OCamlMM into the Intermediate Memory Model (IMM). It would be sufficient since there already exists a compilation scheme from IMM to the Power model. In this paper, the compilation scheme from OCamlMM into IMM is presented and proved to be correct. Memory fences and compare-and-swap instructions are used to permit only behavior allowed by OCamlMM. The resulting compilation scheme gives a correct compilation scheme from OCamlMM to the Power model. Moreover, when a compilation scheme from IMM into another CPU architecture will be developed, such an approach would immediately give a correct scheme of compilation of OCamlMM into that architecture.
This paper presents a novel approach of generation effective inputs for fuzz testing. Most applications check input format before performing basic calculations. That kind of applications usually parse service information of input file to decide whether it is supported or not. Input formats which are not supported are discarded and the application finishes its execution immediately. For example, the service information of ELF (Extensible Linking Format) file should start with the following data: "0x7f 'E' 'L' 'F'". If a file does not contain this information in header section then it will not be considered as ELF. Effective fuzzing of an application which has input validation stage is a relevant and important problem. Random changes of input files usually malform service data and the target application finishes immediately without execution of main code. This makes fuzzing process inefficient. To solve this problem, we have designed and implemented three special plugins for ISP-Fuzzer. The first plugin is intended to collect execution traces. The second plugin connects fragments of input data and executed basic blocks of the target program. Based on that information we can determine potential fragments (critical fragments) of input data which should not be mutated for new test case generation. The third plugin is designed for interval mutations. It mutates input file escaping critical fragments detected by the second plugin. Experimental results prove the effectiveness of proposed method.
As part of the process of improving the application development ecosystem for various Huawei devices the company is working on a new programming language. The principal feature of the new language is the support of component-oriented programming (COP), by which we understand the possibility of assembling (an essential part) of the program from ready-made components. One of the steps in the direction of COP is, from our point of view, the right choice of OOP features. In the current work, we do not consider COP directly, focusing on the OO paradigm implementation. Currently, the situation with the OO paradigm is quite confusing. In fact, there is no consensus in the IT community on what OOP is. Suffice it to note that OOP in Go and Rust is fundamentally different from OOP in C++ and Java. Languages with object orientation based on classes and implementation inheritance (CLOP languages, where CLOP – Class-Oriented Porgramming) are criticized for the lack of flexibility and for the problems of developing reusable components. As component’s reusing is important for us, we propose OOP features that are non-CLOP and allows one to implement objects that can be extended (by adding methods) without the need to make changes to the source code of the object and with minimal recompilation of clients.
At first, the paper introduces the main methods for ensuring and assessing the reliability and safety of software and hardware systems in the processes of their life cycle, as well as collecting information about errors that occur in systems, defects and failures for subsequent changes. Then we consider a standard reliability model and provide a characteristic of basic indicators that include a reliability indicator; functionality and safety form the basis for measuring reliability. After this, the paper demonstrates the classification of reliability models, and shows the characteristics of the evaluation types used in the verification of reliability indicators of components of software and hardware systems. Next, we discuss the experimental results of applying the estimated reliability models to different sizes of software components of software and hardware systems as well as the results of measuring the reliability indicator on these components taking into account the density of defects, failure rates, and recovery. Finally, we note the importance of ensuring the reliability and safety (dependability and safety) of systems in the framework of the new standards of intelligent systems and the Internet of things.
To develop common software systems, engineers and designers are currently using an object-oriented approach that helps build distributed object-oriented systems. A distinctive feature of distributed object-oriented systems is the distribution of classes of objects among various nodes. Typically, there is no information about the distribution of classes across servers in applications, which encourages a restructuring procedure that improves productivity. In the proposed approach, software restructuring of distributed object-oriented systems is carried out using the adaptive method using a neural network. At the initial stage, a graph of class dependencies is created, in which nodes represent classes, and relations between nodes correspond to dependencies between classes. Then, the properties of the classes included in this graph are extracted, which are transmitted as input to the neural network for its training. After that, based on the account of class dependencies, clustering is performed, leading to the partition of the set of classes of the distributed object-oriented system into loosely coupled subsets. Next, a graph of clusters is created, the vertices in which correspond to clusters, and the edges correspond to communication channels that may exist between clusters. The k-medoids algorithm is applied to the resulting graph, which is used to collect the clusters in such a way that the number of collected cluster groups becomes equal to the number of available system nodes. The resulting cluster groups are loosely coupled. Finally, cluster groups are assigned to various available nodes in a distributed environment. The simulation results showed that the proposed work gives more effective results compared to existing methods.
In this paper, we compare different methods for cross-lingual similar document retrieval. We focus on Russian-English language pair. We compare well-known methods like Cross Lingual Explicit Semantic Analysis (CL-ESA) with methods based on cross-lingual embeddings. We use approximate nearest neighbor (ANN) search to retrieve documents based entirely on distances between learned document embeddings. Also we employ a more traditional approach with usage of inverted index, with extra step of mapping top keywords from one language to other with the help of cross-lingual word embeddings. We use Russian-English aligned Wikipedia articles to evaluate all approaches. Conducted experiments show that an approach with inverted index achieves better performance in terms of recall and MAP than other methods.
Millions of news are distributed online every day. Tools for predicting the popularity of news stories are useful to ordinary people to discover important information before it becomes generally known. Also, such methods can be used to increase the effectiveness of advertising campaigns or to prevent the spread of fake news. One of the important features for predicting information spread is the structure of the influence graph. However, this feature is usually not available for news, because authors rarely post explicit links to information sources. We propose a method for predicting the most popular news in the information flow, which solves this problem by constructing a latent graph of influence. Computational experiments with two different datasets have confirmed that our model improves the precision and recall of forecasting the popularity of news stories.
Getting tagged data is an expensive and time-consuming process. There are several approaches to how to reduce the number of examples needed for training. For example, the methods used in active learning are aimed at choosing only the most difficult examples for marking. Using active learning allows to achieve results similar to supervised learning, using much less labeled data. However, such methods are often dispersive and highly dependent on the choice of the initial approximation, and the optimal strategies for choosing examples for marking up either depend on the type of classifier or are computationally complex. Another approach is domain adaptation. Most of the approaches in this area are unsupervised and are based on approximating the distribution of data in domains by solving the problem of optimal transfer or extraction of domain-independent features. Supervised learning approaches are not resistant to changes in the distribution of the target variable. This is one of the reasons why the task of semis-supervised domain adaptation is posed: there are labeled data in the source domain, a lot of unlabeled data in the target domain and the ability to get labels for some of the data from the target domain. In this work, we show how proactive labeling can help transfer knowledge from one source domain to a different but relative target domain. We propose to use a machine learning model trained on source domain as a free fallible oracle. This oracle can determine complexity of a training example to make several decisions. First, this example should be added to training dataset. Second, do we have enough knowldge learnt from source to label this example ourself or we need to call a trusted expert? We present an algorithm that utilize this ideas and one of its features is ability to work with any classifier that has probabilistic interpretation of its outputs. Experimental evaluation on Amazon review dataset establish the effectiveness of proposed method.
The article describes results of applying i-vectors-based (both LID and SID) speech identification methods to define a kind of a distance between languages (in a wide sense of the word – including dialects and any other forms of spoken language). Spontaneous speech recordings of many enough speakers of languages are used on the input of the method. The experiments were carried out at recordings of Latvian and Latgalian dialects, but the method is applicable to any other idioms. Cosine similarity, Euclidean metric, standardized Euclidean metric, Jordan (or Chebyshov) metric and city block (or L1) metric were tried out. Cosine similarity worked well for SID i-vectors, but for unknown reasons was senseless for LID i-vectors. Jordan metric worked well for LID, but was not good enough for SID i-vectors. Standardization of the Euclidean metric does not gave any improvement. Thus, the conclusions are: 1) both SID and LID vectors of full length recordings of spontaneous speech are characterizing and representing languages good enough to be used for detection of a distance between languages; 2) the best metrics for such tasks are Euclidean and L1 (for arithmetic mean vectors computed from i-vectors of all informants coordinate by coordinate).
The paper presents development and verification methods and means of requirements and design solutions formal models. They are intended to create complex critical automated information systems in a same model-language and information-software environment for all its participants. The development and verification processes are carried out in an automated way on the basis of subject-oriented ontologies. Ontologies describe the quality management processes of software and hardware complexes at the stages of requirements justification and system design. They are developing by means modeling and design languages SysML, FUML, OCL structures and mechanisms, the Petri nets mathematical apparatus, time automata and time logics. In order to execute of validation and verification for complex of requirements and design solutions, construction and model execution route analysis algorithms in the VM FUML virtual machine environment are developed. Integration and use methods for specialized verification tools CPN Tools, Rodin, SPIN and Modelica as means to automated testing of complex requirements and design solutions models are proposed. This complex provides more effective interaction between the customer and the contractor both in the development of requirements and in the design of the system, along with this, detection and provides limination of defects through the automated verification, validation and correction procedures implementation. This approach application will improve the quality of requirements and design solutions, as well as improve economic performance by reducing the financial and time costs, which associated with the implementation of additional work in the case of defects, and when changing requirements or operating conditions.
Desktop Grid is a powerful tool to perform high-throughput computing. Desktop Grid is a form of distributed high-throughput computing system, which uses idle time of non-dedicated geographically distributed computing nodes connected over low-speed network. It has significant differences from computing clusters and computational GRIDs, and needs special operation tackle. In this paper, we present a mechanism for dynamic forecasting of the completion time of a computational experiment in a Desktop Grid. We propose a statistical approach based on the linear regression model with the calculation of a confidence interval, taking into account the accumulation of statistical error and, if needed, changing the forecast. The developed approach is used to implement a forecasting algorithm and software module for a Desktop Grid. We present experimental results based on data from the RakeSearch volunteer computing project.
The problem regarding the use of machine learning in cybersecurity is difficult to solve because the advances in the field offer many opportunities that it is challenging to find exceptional and beneficial use cases for implementation and decision making. Moreover, such technologies can be used by intruders to attack computer systems. The goal of this paper to explore machine learning usage in cybersecurity and cyberattack and provide a model of machine learning-powered attack.
Most of the software model checker tools do not scale well on complicated software. Our goal was to develop a tool, which provides an adjustable balance between precise and slow software model checkers and fast and imprecise static analyzers. The key idea of the approach is an abstraction over the precise thread interaction and analysis for each thread in a separate way, but together with a specific environment, which models effects of other threads. The environment contains a description of potential actions over the shared data and synchronization primitives, and conditions for its application. Adjusting the precision of the environment, one can achieve a required balance between speed and precision of the complete analysis. A formal description of the suggested approach was performed within a Configurable Program Analysis theory. It allows formulating assumptions and proving the soundness of the approach under the assumptions. For efficient data race detection we use a specific memory model, which allows to distinguish memory domains into the disjoint set of regions, which correspond to a data types. An implementation of the suggested approach into the CPAchecker framework allows reusing an existed approaches with minimal changes. Implementation of additional techniques according to the extended theory allows to increase the precision of the analysis. Results of the evaluation allow confirming scalability and practical usability of the approach.
We previously published algorithms for searching the so-called Laurent and regular solutions of linear ordinary differential equations with infinite formal power series in the role of coefficients. The question of infinite series representation is very important for computer algebra. In those algorithms the series are given in truncated form, which means that we do not have complete information about the equation under consideration. Based on this incomplete information, algorithms give the maximum possible number of terms of the series included in the solutions. We are interested in the information about these solutions that is invariant to possible prolongations of those truncated series that represent the coefficients of the equation. The mentioned publications reported preliminary (trial) versions for procedures, which implement these algorithms, as well as experiments with them. To date, the procedures have been improved, the interface and data presentation are designed for them in a uniform manner. The advanced procedures are discussed in the current paper. The various examples are presented which illustrates the use of the procedures, including their optional parameters. These procedures are available from the web page http://www.ccas.ru/ca/TruncatedSeries.
ISSN 2220-6426 (Online)