Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search
Vol 27, No 1 (2015)
View or download the full issue PDF (Russian)
https://doi.org/10.15514/ISPRAS-2015-27(1)

5-24
Abstract

Dynamic program analysis is a prominent approach towards software quality control allowing to perform automatic profiling, defect detection and other activities during software development. In this paper we focus on static binary code instrumentation – a technique to automatically modify program executable code in order to extract data necessary for dynamic analysis. We discuss the key features of this technique within context of dynamic analysis and propose a method to perform static binary code instrumentation for ELF executable and shared library files specifically targeting the ARM architecture.

We describe the main steps of the proposed method including the following: instrumentation specification and target code parsing, executable instrumentation code generation and finally target executable code file modification in order to insert instrumentation code and ensure that control transfer from original code to instrumentation code and vice versa will happen at runtime.

Executable code file modification is performed within bounds of ARM ELF specifications and is designed to minimize the changes introduced in actual executable code blocks. Instrumentation code is appended to target files as a set of separate sections; we implement control transfer to instrumentation code through unconditional jump instructions which replace small blocks of original instructions at instrumentation points. In order to preserve the original functionality we wrap instrumentation code blocks with instructions that save and restore program state; additionally, instructions replaced at instrumentation points are transferred to the instrumentation code blocks. We also describe a set of modifications performed in order to introduce instrumentation code external dependencies to the target executable files.

The proposed method was implemented in an instrumentation framework. We provide a brief overview of practical experiments using basic block counting and function entry/exit tracing as base instrumentation applications. The results show better performance in comparison to popular dynamic instrumentation framework Valgrind and low overhead for system-wide tracking of native Android libraries.
25-38
Abstract

This paper focuses on dynamic analysis of Java programs. We consider the following limitations: analysis tool may not have access to target program source code, and the program may be interpreted by a non-standard virtual machine with bytecode format different from Java Virtual Machine specifications. The paper describes an approach to bytecode instrumentation which is used to perform iterative dynamic analysis for new execution path discovery. Path discovery is performed through automatic input data generation by tracing tainted data, collecting path conditions, and satisfiability checking. The proposed approach is based on static bytecode instrumentation. The main advantages of this approach are analysis speedup (because of one-time instrumentation) and explicit access to statically generated instrumented bytecode which makes it possible to run instrumented program on different virtual machines with different bytecode formats. Proposed approaches were implemented in the Coffee Machine tool. Paper sections dedicated to this tool provide a detailed description of taint data tracing and automatic branch traversing techniques as well as a set of instrumentation utilities based on Coffee Machine allowing executed instructions printing, taint trace dumping, and synchronization events trace generation. Coffee Machine uses BCEL (bytecode instrumentation library) for instrumentation. The paper concludes with an overview of practical restrictions existing for discussed methods and possible future work directions. Main disadvantage of proposed approach is the inability to access dynamic data at run-time and instrument a set of system class methods. It may be resolved by method simulation and execution environment modifications.

39-50
Abstract
This article describes the methods of code clones detection. New approach of code clones detection is proposed for C/C++ languages based on analysis of existed methods. The method based on semantic analysis of the project, which allows detecting code clones with high accuracy. It is realized as part of LLVM compiler, which allows exceeding existed methods. The tool is consisted of three basic parts. The first part is Program Dependence Graph (PDG) generation and serialization. PDG is constructed during compilation time of the project based on LLVM‘s intermediate representation. Several simple optimizations are applied on these graphs, then they are serialized to file. The second stage is analyzing of stored PDGs. PDGs are loaded from files and split to subgraphs. Every subgraph is considered as clone candidate.  New method is purposed for the splitting, which increases number of detected clones. There are two types of algorithms for clone detection. The first types of algorithms try to prove that the pair of PDGs cannot be clones. These algorithms have linear complexity, which allows processing huge amount of PDGs pairs. In case of failure graph isomorphism algorithms are applied for similar subgraphs detection. The last part is integrated system for automatic testing of algorithm’s accuracy. For the project, set of clones are automatically generated, then clone detection algorithms are applied for original source and generated one.
51-68
Abstract
Graph learning with automata is a basic task in many applications. Among such applications is formal model-based verification and testing of software and hardware systems, as well as network exploration including Internet and GRID. The model of a system or a network, in the final analysis, is reduced to an oriented graph of transitions with properties to be examined. In the recent years, the size of the real-life systems and networks and, consequently, the size of their graph models continuously grows. The problems arise when the graph learning by a single automaton (computer) either requires unacceptable long time, or the graph does not fit in the single computer memory, or both. Therefore, there is a task of parallel and distributed graph learning. This task is formalized as graph learning by a set of automata. The basis for such learning is graph traversal (traversing all its arcs accessible from the initial node). Automata can be generated in the initial node, move along the arcs of the graph according with their orientation and exchange messages through independent (of the graph) communication network. The total memory of the automata is used for storing the descriptions of the already traversed part of the graph. To move from node along an arc outgoing from it, the automaton should somehow identify this arc: indicate its number. In our paper "Graph learning by a set of automata" we proposed an algorithm of such traversal for deterministic graphs. The task is more complicated if the graph is nondeterministic. In such graph, to one arc number correspond, generally speaking, multiple arcs, one of which is chosen nondeterministically for traversal. To make the traversal possible, the following should be guaranteed: in unlimited number of experiments, each outgoing arc with the given number can be traversed. We call such nondeterminism fair. This paper covers the solution of the problem of traversal of fairly nondeterministic graphs.
69-96
Abstract
Monitoring of oriented graphs is a key task in many applications. Such monitoring is very specific when the graph models a communication network including Internet and GRID. A node of the network has local information about the network: if "knows" only about the arcs outgoing from this node, but does not "know" where (to which nodes) these arcs go. The nodes of the network exchange messages through the network links represented as arcs of the graph and act as message transfer channels. The graph monitoring is based on its traversal when message passes each arc in the graph. While there is an untraversed arc, we cannot be certain that it goes to still unmonitored part of the graph. Usually, the graph traversal is performed with a single message circulating in the network. Traversal can be done faster if performed in parallel: multiple messages simultaneously circulate in the network. In this paper we consider the parallel monitoring of a graph aimed at not just the graph traversal, but also collection of complete information about the graph in each its node. Another feature of this work is monitoring of a dynamically changing graph: its arcs can disappear, appear or change their destination nodes. An algorithm is proposed, which provides the collection of full information on the graph in each its node.
97-124
Abstract
In this paper a survey of existing methods of model extraction from hardware system descriptions written in Hardware Description Languages (like Verilog and VHDL) is presented. There are many tasks in hardware and software design where models are applied. The most actual tasks that are mentioned in this paper are: code optimization, logical synthesis optimization, model abstraction, and functional verification. The model categories that are mostly described here are flow or dependency graph models and automata models. As for flow graphs or dependency graphs, the methods of program slices extraction are described in details. Program slices can be characterized as suitable enough for directed test generation. Almost all the described automata models are finite state machine models and extended finite state machine models and so methods of such models extraction are the most popular for logical synthesis optimization and for functional test generation. The tests that can be generated from automata models shows high coverage of the target description. The key problems of existing model extraction methods are: the complexity of an application to industrial hardware descriptions (because of their complex structure), lack of automation (sometimes the hardware designer’s knowledge is needed), the absence of open-source implementations. Also it is an actual task to create extendible frameworks for integration of different model extraction and analysis methods. Such framework can help in development of effective hybrid methods for hardware synthesis and verification.
125-150
Abstract
Different types of global information network services, used in the modern distributed software systems are described. Several approaches to service description and to service interaction are shown. Description of wide-spread and inculcated approaches to creation of distributed object interaction is given in the wide practice of the systems with spectrum of system and functional services. Web-service international standards stack (SOAP protocol, WSDL language, UDDI interface and service, XML, BPEL and BPMN notations, CORBA object broker and services, JEE application server) is briefly described. Models of web services, service-oriented approach (SOA), service-component architectures (SCA), and Windows Communication Foundation (WCF) service support application systems are considered for presentation of the business systems by the services ready to the decision of business-tasks. Features that support interoperability and multilanguage interaction are underlined. Principles of distributed software design (service composition, component engineering, interoperability, variance) that are of special care in SOA are listed. Some examples of presenting information using standard notations are given. An additional example of calculator service with functions of the data processing in ITK service-environment (WCF-based system) and the concept of web service implementation are presented.
151-172
Abstract
The paper presents the novel approach of user identification based on behavior analytics of user operations with a text information. It is offered to describe user behavior by content of his text documents. The structured representation of the considered behavioral information is carried out based on representation of documents text content in the user topic space, which is created by non-negative matrix factorization. The topic weights in the document characterize the user’s topic trend during an operating time with this document. The time variation of the topic weight values creates multidimensional time series that describe the history of user behavior when working with text data. Forecasting of such time series will allow for user identification based on estimated deviation of observed topic trend from the predicted topic weight values. This paper also presents the new time series forecasting method based on orthogonal nonnegative matrix factorization (ONMF) which is used within proposed user identification approach. It is worth noting that nonnegative matrix factorization methods were not used before for the time series forecasting task. The proposed user identification approach has been experimentally verified on the example of real corporate email correspondence created from the Enron dataset. In addition, experiments with other today popular forecasting methods have shown the superiority of proposed forecasting method in quality of user’s topic weights classification. Also we investigated two different approaches to estimates of the deviation of a time series point from the predicted value: absolute deviation and p-value estimation. Experiments have shown that both discussed approaches of deviation estimates are applicable in the proposed user identification approach.
173-192
Abstract

In 2005, I wrote an article in which I discussed the most important features of the standards ODMG 3.0 (ODMG object model) and the SQL:2003 (SQL data model) and convincingly (as it seemed to me) argued that the similarity between the object model and object extensions to SQL is purely external, that close in mind syntactic constructions hide deep differences at the model level. Examples of such differences include von Neumann-style dereference of ObjectIDs in the ODMG model vs join-style dereference of reference values in the SQL model, separate and independent store of objects of one and the same object type in the ODMG model vs store of all raws of a typed table (SQL analogy of object) within this table, store of ObjectISs within extents in the ODMG model vs store within analogy of extent of objects their self in the SQL model, etc. Since then, it took many years for which I understood many things that were wrongly or insufficient correctly understood by me then, and gradually came to the conclusions that:

  1. differences that seemed to me deep, are not such, and indeed are not differences at the model level;
  2. the object extensions of SQL provide no less (and rather large) capabilities than the ODMG object model;
  3. reasonably (from the standpoint of the database community) used DBMSes based on the ODMG data model, one will create databases and tools to manipulate them similar to those prescribed by the SQL data model.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)