Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search
Vol 28, No 6 (2016)
View or download the full issue PDF (Russian)
9-10
Abstract
In this final issue of the 28th volume, we publish articles on technologies for analysis, modeling and transformation of programs, distributed systems technology, analysis of texts in natural languages, and social network analysis.
11-26
Abstract
Tracking and verification of data flows includes set of techniques that can be applied to make applications more secure, to perform software analysis for debugging or reverse engineering, and so on. Taint analysis is one of the techniques used to control data flows. This paper presents an approach for system-wide lightweight platform-aware taint analysis. We implemented proof-of-concept tool based on our approach for i386 platform upon the multi-platform simulator QEMU. Our approach uses instrumentation of QEMU intermediate representation of binary code and processes up to 8 taints simultaneously. Most of the taint analysis code is target-independent and may be used with other target platforms. For some platforms (i386 with Windows XP and Windows 7) we present Virtual Machine Introspection plugins for automatic file tainting. We demonstrate how our taint analysis system may be used to detect exploitation of software vulnerabilities.
27-36
Abstract
Return-oriented programming (ROP) is a dangerous exploitation technique which can be used to bypass modern defense mechanisms. ROP reuses code chunks ending with control transfer instruction from a program binary to form a chain corresponding some payload. These code chunks are called gadgets. Though, a certain set of gadgets should be available to exploit a vulnerability. Determining gadgets that can be used to form a ROP chain can be done by gadgets search and classification. This paper introduces a method for ROP gadgets classification that allows one to evaluate whether or not ROP technique can be used to exploit a program vulnerability. Classification is based on side-effects analysis of gadget execution with concrete inputs. Gadget instructions are translated into IR which is interpreted to track registers and memory usage. Initial registers and memory values are randomly generated. According to initial and final values of registers and memory gadget semantics can be explored. Classification performs several executions to determine gadget semantics. Proposed method is applied to program binaries and its capabilities were demonstrated on 32-bit and 64-bit binaries from Ubuntu 14.04. Using classification results program exploitability was confirmed for several examples. Furthermore, a possible exploitation of stack buffer overflow vulnerability in presence of write-what-where condition was shown on a model example demonstrating a bypass of canary, DEP and ASLR.
37-48
Abstract
In recent years, as performance and capacity of main and external memory grow, performance of database management systems (DBMSes) on certain kinds of queries is more determined by raw CPU speed. Currently, PostgreSQL uses the interpreter to execute SQL queries. This yields an overhead caused by indirect calls to handler functions and runtime checks, which could be avoided if the query were compiled into native code "on-the-fly", i.e. just-in-time (JIT) compiled: at run time the specific table structure is known as well as data types and built-in functions used in the query as well as the query itself. This is especially important for complex queries, performance of which is CPU-bound. We have developed a PostgreSQL extension that implements SQL query JIT compilation using LLVM compiler infrastructure. In this paper we show how to implement LLVM-analogues of the main operators of the PostgreSQL, how to replace Volcano iterator model abstraction (open(), next(), close()) by the abstraction that is more suitable to generate code for a particular query. Currently, with LLVM JIT we achieve up to 4.3x speedup on TPC-H Q1 query as compared to original PostgreSQL interpreter.
49-64
Abstract
Complex software systems always exist for a long time, sometimes changing, and this leads to a variety of versions of such a system. In additional complex software systems usually have different (sometimes a lot) configurations due to different hardware and software environments, where they are intended to operate, or due to different user types with specific requirements. So, a complex software system can be regarded more correctly as a software system family or a software product line. Taking software families in consideration helps to increase reuse of their components and other software development artifacts. In difference with earlier works on software reuse, mostly focused on code or design reuse, software system family development tries to expand reuse on all kinds of development artifacts and activities, including documentation, verification, operation support, deployment, etc. One of the software system family development activities is modeling of family variability. This paper considers modern methods and approaches to such modeling, especially focusing on modeling of operating systems families variability. The research, which results are presented in this paper, is supported by RFBR.
65-86
Abstract
The paper presents a configurable method of static data race detection that is trying to keep a balance between resource consumption and a number of false alarms. The method is based on well known Lockset approach. It uses simplified memory model to be fast enough. At the same time it includes advanced techniques aimed to achieve acceptable false alarms rate. The key techniques are thread analysis and predicate abstraction based refinement. The method was implemented in CPALockator tool built on top of CPAchecker framework. The tool was evaluated on Linux kernel modules and it has detected several actual data races, which were approved by developers and were fixed in upstream Linux kernel.
87-102
Abstract
ARM is a family of microprocessor instruction set architectures developed in a company with the same name. The newest architecture of this family, ARMv8, contains a large number of instructions of various types and is notable for its complex organization of virtual memory, which includes hardware support for multilevel address translation and virtualization. All of this makes functional verification of microprocessors with this architecture an extremely difficult technical task. An integral part of microprocessor verification is generation of test programs, i.e. programs in the assembly language, which cause various situations (exceptions, pipeline stalls, branch mispredictions, data evictions in caches, etc.). The article describes the requirements for industrial test program generators and presents a generator for microprocessors with the ARMv8 architecture, which has been developed with the help of MicroTESK (Microprocessor TEsting and Specification Kit). The generator supports an instruction subset typical for mobile applications (about 400 instructions) and consists of two main parts: (1) an architecture-independent core and (2) formal specifications of ARMv8 or, more precisely, a model automatically constructed on the basis of the formal specifications. With such a structure, the process of test program generator development consists mainly in creation of formal specifications, which saves efforts by reusing architecture-independent components. An architecture is described using the nML and mmuSL languages. The first one allows describing the microprocessor registers and syntax and semantics of the instructions. The second one is used to specify the memory subsystem organization (address spaces, various buffers and tables, address translation algorithms, etc.) The article describes characteristics of the developed generator and gives a comparison with the existing analogs.
103-110
Abstract
The article proposes different methods of presenting network traffic analysis results, the need for which arises primarily in the area of network security. One of the most important tasks is to identify malicious traffic. For this purpose both the complete graph of network interactions and time-based packet diagram are presented. These components are used during investigation of information security violation incidents. The timing diagram is also used in analysis of tunneling protocols because it allows the analyst to determine which protocol headers are necessary to visualize. For tasks associated with reverse engineering and debugging of network protocols, it is proposed to use a journal which records protocol header parsing errors. Presented graphic components either have no analogues among the opensource tools or improve on existing opensource solutions.
111-120
Abstract
Apache Spark is a framework providing fast computations on Big Data using MapReduce model. With cloud environments Big Data processing becomes more flexible since they allow to create virtual clusters on-demand. One of the most powerful open-source cloud environments is Openstack. The main goal of this project is to provide an ability to create virtual clusters with Apache Spark and other Big Data tools in Openstack. There exist three approaches to do it. The first one is to use Openstack REST APIs to create instances and then deploy the environment. This approach is used by Apache Spark core team to create clusters in propriatary Amazon EC2 cloud. Almost the same method has been implemented for Openstack environments. Although since Openstack API changes frequently this solution is deprecated since Kilo release. The second approach is to integrate virtual clusters creation as a built-in service for Openstack. ISP RAS has provided several patches implementing universal Spark Job engine for Openstack Sahara and Openstack Swift integration with Apache Spark as a drop-in replacement for Apache Hadoop. This approach allows to use Spark clusters as a service in PaaS service model. Since Openstack releases are less frequent than Apache Spark this approach may be not convenient for developers using the latest releases. The third solution implemented uses Ansible for orchestration purposes. We implement the solution in loosely coupled way and provide an ability to add any auxiliary tool or even to use another cloud environment. Also, it provides an ability to choose any Apache Spark and Apache Hadoop versions to deploy in virtual clusters. All the listed approaches are available under Apache 2.0 license.
121-140
Abstract
In this paper, we present a Big Data analysis paradigm related to smart cities using cloud computing infrastructures. The proposed architecture follows the MapReduce parallel model implemented using the Hadoop framework. We analyse two case studies: a quality-of-service assessment of public transportation system using historical bus location data, and a passenger-mobility estimation using ticket sales data from smartcards. Both case studies use real data from the transportation system of Montevideo, Uruguay. The experimental evaluation demonstrates that the proposed model allows processing large volumes of data efficiently.
141-152
Abstract
The life of the modern world essentially depends on the work of the large artificial homogeneous networks, such as wired and wireless communication systems, networks of roads and pipelines. The support of their effective continuous functioning requires automatic screening and permanent optimization with processing of the huge amount of data by high-performance distributed systems. We propose new meta-algorithm of large homogeneous network analysis, its decomposition into alternative sets of loosely connected subnets, and parallel optimization of the most independent elements. This algorithm is based on a network-specific correlation function, Simulated Annealing technique, and is adapted to work in the computer cluster. On the example of large wireless network, we show that proposed algorithm essentially increases speed of parallel optimization. The elaborated general approach can be used for analysis and optimization of the wide range of networks, including such specific types as artificial neural networks or organized in networks physiological systems of living organisms.
153-170
Abstract
The paper presents new versions of modularity measure for directed weighted graphs with overlapping communities. We consider several approaches to computing modularity and try to extend them. Taking into account computational complexity, we suggest two parallelized extensions which are scalable to large graphs (more than 104 nodes).
171-184
Abstract
The work is devoted to methods of social network users’ age detection. Social networks allow users to fill their profiles that may contain an age. Profiles are not fully filled, so the task of unknown attributes detection arises. Explicit and predicted values are used in recommender and marketing systems. Moreover, the predicted values can be used for determining online communities’ demographic profiles and for inferring the target audience of marketing campaigns in the Internet. In this paper a method for predicting unfilled age values is proposed. The method uses the following information available from the social network: explicit users’ ages and social graph. The graph contains nodes representing users and communities. Community is the special page in the Internet that unites users on interests. Friendship relations between users and subscriptions of users on communities represented as edges of the social graph. The method is based on the label propagation in the friendship and subscription graphs. Ages of the users are representd by labels that are propagated in the graph. The scheme of the algorithm is following: initialize user labels according to explicit profiles; build vector model that contains distributions of the neighbours’ ages grouped by user age; compute weights of users and communities, propagate labels to communities; build vector model considering calculated weights; propagate labels to users that have not filled their age in the profile. The paper describes the algorithm and contains experimantal results showing that friendship relations are more useful for age prediction in the social network than communities.
185-196
Abstract
Many applications require information about the geolocation of users, which is not always available. Among the users of Twitter only about 26% indicate the name of the city in their profiles, about 30% of users of VKontakte leave this field blank. So there is the problem of determining the place of residence of social network users. We investigate approaches to geolocation of social network users using their mutual bidirectional ties - social graph. At first, we present a brief overview of the work in the field of geolocating users of social networks. Then we propose an approach that relies on graph nodes’ embeddings and supervised machine learning techniques. Series of experiments were conducted with proposed and baseline approaches. Experiments show that proposed approach is comparable with others. The results of experiments allow us to conclude that the proposed approach based on vector representation can be effectively used to determine the user's place of residence by itself, or in combination with classifiers based on user data It is worth noting that the proposed approach has no any specifics related to the geolocation. It can also be used to assess any other demographic attributes that influence the formation of relationships in society. Thus, a similar approach was used in Perozzi and Skiena to determine the age of the users.
197-206
Abstract
This paper presents an ontology induction approach that extracts the structured lexical knowledge from synonym dictionaries and establishing the semantic relations within these structures using word embeddings and their projections. The results of the preliminary experiments have also been reported showing certain strengths and weaknesses of the proposed approach.
207-222
Abstract
One of the major issues dealing with time-series classification problem is the choice of similarity measure. This article presents a comparative analysis of the similarity measure for time series based on moving approximations transform (MAP transforms) with other two most useful measures: Algorithm Dynamic Transformation and Euclidean distance for classification task. In addition, algorithm, that improves the precision of the measure for time series, that have similar values, but shifted relative to each other on the axis X, where coordinate on the X axis represents the time unit, is proposed.
223-240
Abstract
There are many sites in the Internet that allow users to share their opinions and write reviews about all kinds of goods and services. These views may be useful not only for other users, but also for companies which want to track their own reputation and to receive timely feedback on their products and services. The most detailed statement of the problem in this area is an aspect-based sentiment analysis, which determines the user attitude not only to the object as a whole, but also to its individual aspects. In this paper we consider the solution of subtask of aspect terms extraction in aspect-based sentiment analysis. A review of research in this area is given. The subtask of aspect terms extraction is considered as a problem of sequence labeling; to solve it we apply the model of conditional random fields (CRF). To create the sequence feature description, we use distributed representations of words derived from neural network models for the Russian language and parts of speech of the analyzed words. The stages of the aspect terms extraction software system are shown. The experiments with the developed software system were carried out on the corpus of labeled reviews of restaurants, created in the International Workshop on Semantic Evaluation (SemEval-2016). We describe the dependence of the quality of aspect terms extraction subtask on various neural network models and the variations of feature descriptions. The best results (F1-measure = 69%) are shown by a version of the system, which takes into account the context and the parts of speech. This paper contains a detailed analysis of errors made by the system, as well as suggestions on possible options for their correction. Finally, future research directions are presented.


Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)