We describe the approach for two-stage compilation of C/C++ programs using the LLVM compiler infrastructure that allows optimizing programs taking into account the user profile and his/her target machine features, as well as deploying programs in a cloud storage while transparently optimizing and checking for defects. The notable features of the approach are its applicability to programs written using general-purpose languages and utilizing the common compiler infrastructure on all optimization and deployment stages.
On the first stage (host machine) the build process is transparently captured and the LLVM bitcode files are generated with the proper support for archive/library files. LTO optimizations are also performed at this time, and the program dependencies and installation structure (e.g. libraries, installed resource or documentation files) are also captured. On the second stage (target machine) the final code generation and installation is performed with the transparent path translation. Either static code generation or dynamic JIT-based compilation is supported.
As mentioned, a cloud-based deployment strategy can also be used allowing for additional features like defect and vulnerability checking while the cloud store has access to the deployed programs in the LLVM bitcode form.
While optimizing LLVM toolchain for working on ARM embedded devices we have achieved the memory consumption reduction of up to 10% and 10-20% compile time decrease.Modern compilers can work on many platforms and implement a lot of optimizations, which are not always tuned well for every target platform. In the paper we present a tool for automatic tuning of compilation parameters that was developed in ISP RAS. The tool uses genetic algorithm to evolve the best compiler optimization parameters. It supports cross-compilation, parallel build and execution on remote targets, and support for two-tier compilation mode for profile-guided optimizations. Also it provides the tools for automated analysis of the tuning results to help identify the most important compilation parameters.
We used our tool to tune GCC parameters for ARM platform, and demonstrated how it can be used to improve performance of several popular applications. Using the tool’s automated analysis of the tuning results, we found several deficiencies in GCC compiler optimizations. The affected optimizations include if-conversion (conversion of conditional branches into predicated form), prefetching (adding of data prefetching instructions in loops), and autovectorization. We identified the reasons for suboptimal code generation, and improved affected optimizations accordingly. Also, we describe a few other improvements that we made to GCC based on the results of manual assembly code analysis. Described improvements to GCC yield significant speedup for chosen test applications on ARM platform.QEMU is a generic machine emulator based on a dynamic binary translation approach. This paper evaluates possible improvements to the local register allocation algorithm in QEMU. The local register allocation problem is NP-hard, but there are several approximate heuristic algorithms. Among them Furthest-First and Clean-First heuristics are known to produce a near optimal result in linear time.
Local register allocation algorithm in QEMU scans instructions in direct order. For each instruction operand that is not yet stored on a register, an empty register is allocated. If there are no empty registers, then a spill of an arbitrary register is generated. On basic block ends and helper functions' calls all global variables are also spilled.
Two improvements for the register allocation algorithm are proposed in this paper. The first is to improve spill choices by using the heuristic from the Furthest-First algorithm. The second is to aggressively spill global variables which will be spilled anyway because of helper function calls and basic block ends. Both improvements were implemented and tested on practical workloads. Profiling results show that the amount of generated spills was reduced by approximately 1%.
With these improvements, over 96% of spills are generated due to function calls and basic block ends. This implies that the further improvements to this algorithm are not possible without altering its existing constraints.In the paper we evaluate two approaches to full-system deterministic replay. Both of them allow replaying guest OS and applications without modifications. Deterministic replay is intended for debugging and dynamic analysis of system core, multithreaded and non-deterministic applications, cross-platform applications, and devices’ drivers.
Presented approaches differ by boundary line dividing non-deterministic “outer world” and deterministic part of the simulator. All inputs from the “outer world” are written into the log to allow latter replaying of deterministic part. The common thing in both approaches is that deterministic part includes CPU, RAM, and video adapter. This allows debugging, tracing, and analyzing of the replayed code.
In the first approach “outer world” is presented by inputs – keyboard input, USB devices, mouse, network adapter, and microphone. All virtual peripheral devices should be deterministic in this approach. In the second approach all emulator parts except CPU, RAM, and video adapter are considered external. This means that all interactions between CPU and virtual peripheral devices (including IO, MMIO and DMA transactions) are written into the replay log.
The first approach has the following pros: one can replay whole virtual machine for devices’ drivers debugging; relatively low number of changes within the simulator; low usage of storage for replay log. The drawback of this approach is the need to support for all external interfaces of the virtual machine. The second approach is completely opposite – it requires changes only in the several interfaces, but every virtual device with DMA interface should be modified. It also generates bigger replay logs.
We evaluated time and space overheads for the first approach. Slowdown is very low – it is lower than 1% in loading Windows XP scenario and is about 25% for network operations. Replay log growth speed for basic guest OS execution is about 50kb per second.This paper investigates the process of binary code analysis. To achieve typical goals (such as extracting algorithm and data formats, exploiting vulnerabilities, revealing backdoors and undocumented features) a security analyst needs to explore control and data flow, reconstruct functions and variables, identify input and output data. Traditionally for this purposes disassemblers and other static data flow analysis tools have been used. However, since developers have been taking steps to protect their programs from analysis (for example, code being unpacked or decrypted at runtime), static analysis may not yield results.
In such cases we propose to use dynamic analysis (analysis of execution traces of the program) to complement static. The problems that arise in the analysis of binary programs are discussed, and corresponding ways to automate solving them are suggested. The core of proposed method consists of whole system tracing and consecutive representation lifting: OS-aware events, process/thread identification, fully automated control and data flow reconstruction. The only manual step is searching for anchor instructions in the trace, e.g. I/O operations, which are used as input criteria for another automated step: precise algorithm extraction by trace slicing. The final step of the method constructs static test case code suitable for further analysis in tools such as IDA Pro.
We implemented the proposed approach in an environment for dynamic analysis of binary code and evaluated it against a model example and two real-world examples: a program license manager and a malware program. Our results show that approach successfully explores algorithms and extracts them from whole system traces. The required efforts and amount of time are significantly reduced as compared with traditional disassembler and interactive debugger.Control flow obfuscation is one of widespread methods used to protect application binary code from analysis. The obfuscation transformations dramatically increase the complexity of separation and recognition of the algorithm and data structures. This paper provides a short survey of obfuscating transformations and tools, essentially the control flow obfuscations like opaque predicates, dispatcher and code virtualization. Our research was concentrated in area of de-obfuscation based on dynamic binary code analysis methods. An approach to locating of potentially opaque predicates and accompanying obfuscations (like dead and redundant code insertion) in program trace is given. The main idea underlying the described method of opaque predicate detection is extraction of predicate code via dynamic program slicing. Next, we apply some heuristics to automatically reduce predicate list, and then our tool provides the analyst with information about potentially opaque predicates found in application binary code.
Another control flow obfuscation eliminated by described tool was code virtualization. An algorithm for control flow graph recovery of virtualized code is based on virtual program counter analysis. The corresponding tool was implemented and successfully tested on several applications that were protected via various virtualization obfuscators (VMProtect, CodeVirtualizer, Safengine, etc.). The experimental results described in this paper demonstrate the practical adaptability of implemented method to de-obfuscate virtualized applications.We describe the usage of programming language constraints to achieve program security and portability, which are especially important for large projects. The existing collections of such constraints (like MISRA C++, JSF, or HICPP) contain only descriptive natural language rule definitions, which could be possibly ambiguous or incomplete.
We propose the formal model for constraints definition, which allows specifying stylistic, syntax and contextual rules. We also give the constraints classification, which splits all constraints into 4 disjoint groups determining the constraint complexity and optimal checking order. To solve the problem of automatic rules checking, we have developed the analyzer based on the C/C++ Clang compiler maintained as a part of the LLVM project. We also describe some specific details of analyzer implementation: basic components overview, the set of specially developed and well-known static analysis algorithms used for constraint violation discovering, rule ordering approach, permanent external storage (SQLite database) usage for intermodule analysis and work with error messages (sorting, history). We also provide analyzer integration with popular build systems so that the source files used in the build process are analyzed automatically.
The implemented system is able to check approximately 50 different C and C++ constraints and requires only 20% more time than the regular optimized build.In this paper we explore the prospects of virtualization technologies being applied to x64-based high performance systems. The main reasons for performance overhead when running parallel programs inside multiple virtual machines are discussed. We consider KVM/QEMU and Palacios systems and use HPC Challenge and NASA Advanced Supercomputing Parallel Benchmarks packages as a test suite. All tests are performed using high performance cluster with high-speed Infiniband interconnect.
Gathered results show feasibility of applying virtualization to a big class of high performance applications. We use huge memory pages to decrease number of TLB cache misses and NUMA emulation mechanism to provide the description of memory topology of real system to virtual machine. Such optimizations decrease virtualization overhead from 10-60% to 1-5% for most tests from HPC Challenge and NASA Advanced Supercomputing Parallel Benchmarks suites. The main bottlenecks of virtualization systems are decreased memory system performance (critical only for a narrow class of applications), device virtualization overhead, and increased noise level caused by the host operating system and hypervisor. Noise may affect performance and scalability of fine-grained applications (those with frequent communications of small size). While the number of nodes in the system grows, the noise influence substantially increases.The paper describes the research in formal methods of conformance testing of the target system against requirements given in specifications. Such testing is based on interaction semantics defining test stimuli and observations of actions and refusals (absence of actions). Unobservable actions and refusals are also possible. Destruction is introduced as a forbidden action that should be avoided during interaction. A notion of safe testing is also introduced, when no unobservable refusals and destruction occur and no test stimuli applied in divergence. On this basis, the implementation hypothesis of safety and safe conformance are defined, as well as the generation of complete test suite from specification.
The most common model of specification is LTS (Labelled Transition System). However, for the described interaction semantics, only traces (sequences of observations) are important, not the LTS states. Therefore, the most natural is the trace model defined as a set of LTS traces.
The goal of this paper is to define the subset of specification traces sufficient for generation of the complete test suite. We called such subset the final trace model of specification. On the other hand, LTS model is convenient as a way of finite representation of regular trace sets. To represent the final trace model, the paper proposes a variation of LTS called final RTS (Refusal Transition System). The transitions on observable refusals are defined explicitly. Such model is very convenient for test generation: 1) it is deterministic, 2) trace of observations is safe iff it ends in non-terminal state with no destruction, 3) test stimulus is safe after the trace iff it is safe in the final state of the trace.
The paper proposes algorithms for transformation of LTS model into final RTS model. Sufficient conditions for creation of the final RTS in finite time are also defined.Recently, a number of dynamic analysis tools were developed that perform tainted data flow tracing and use algorithms for solving the CNF SAT problem for input data generation and path alteration. SAT problem is known to be NP-complete and it requires a lot of optimizations to partially solve the exponential growth problem. The problem appeared in Avalanche dynamic analysis project for automatic defect detection. In this paper we propose a modification of Davis—Putnam—Logemann—Loveland (DPLL) algorithm using intermediate result caching.
For path condition solving Avalanche uses STP which uses MiniSAT solver for Boolean formula satisfiability checking. MiniSAT main mechanism is based on variation of DPLL algorithm. We describe special properties of formula sets generated with dynamic analysis tools. Some formulas in generated set frequently have common parts. We introduce a DPLL optimization based on these peculiarity. Formula solving is performed on previously cached solver results and uses backtracking mechanism in case of failure. Described optimization is illustrated on simple examples.
Described mechanism was examined on random generated sets of Boolean formulas using implemented SAT solver prototype. Modified algorithm applications features a noteworthy increase in efficiency in comparison with standard DPLL approach.The survey considers methods and techniques used in modern static verification tools for C programs. It describes two main approaches Counter Example Guided Abstraction Refinement (CEGAR) and Bounded Model Checking (BMC) and techniques used to efficiently implement them such as Predicate Abstraction, Abstract Reachability Tree, Lazy Abstraction, Configurable Program Analysis, Explicit Analysis, Interpolation, and Shape Analysis. The paper also discusses current capabilities of the tools such as supported C programming language constructs, scalability, properties being checked, and trustworthiness of analysis results.
We provide description of such static verification tools, as BLAST, CPAchecker, HSF(C), SATABS, SLAM, WOLVERINE, YOGI, CBMC, ESBMC, LLBMC, FSHELL and PREDATOR. This description shows techniques implemented in these tools and their current capabilities. The paper presents results of the 1st International Competition on Software Verification in category DeviceDrivers64 which contains verification tasks based on device drivers from Linux kernel 3.0.
Specifics of device drivers verification are discussed and existing driver verification systems are described including Microsoft SDV for Windows operating system and DDVerify, Avinux and Linux Driver Verification for Linux.
The paper concludes that BMC-based tools work well for programs of limited size and control flow complexity. Regarding verification of device drivers that means these tools are able to quickly find violations of properties being checked if paths to these violations are quite short, but they mostly fail to prove correctness and to find complicated bugs. CEGAR-based tools demonstrate better scalability, while they have problems with handling address arithmetic and complex memory structures. For future improvements in static verification of C programs and Linux device drivers we propose composition of various techniques and modularization of analysis.Fast evolution of the Linux operating system kernel and drivers, developed by a big programmers community distributed all over the world, led to nowadays there is not a common base of rules that completely describe a correct interaction between drivers and the kernel. This is an obstacle both for programmers that do not have expert knowledge in all peculiarities of the given interaction, and for development and application of tools that could find corresponding typical faults in the automatic way. The given paper presents a method to detect and to classify typical faults and corresponding rules. This method is based on analysis of changes, made to Linux operating system drivers.
The paper gives results of the method application to stable versions of the Linux kernel from 2.6.35 till 3.0 starting from October 26, 2010 till October 26, 2011. We analyzed in total 1503 unique commits to drivers, marked 396 (about 27%) of them as fixes of typical faults and provided a classification and a distribution by classes for these typical faults. We distinguished 3 classes of typical faults: generic (faults all C programs are subjected to), specific (faults related to misuses of the Linux kernel API) and synchronization (faults related to parallel execution). We found that specific faults constitute about 50% of all typical faults. Then each typical fault was ascribed to one of 21 first-level subclasses: 12, 7 and 2 for specific, generic and synchronization classes correspondingly. We found that more than 50% of typical faults correspond to 5 first-level subclasses. More than 80% of typical faults correspond to 14 first-level subclasses. The most frequent faults were race conditions during parallel execution – they account for about 17% of all typical faults. Second and third places with about 9% were occupied by leaks of specific resources and null pointer deference.In the paper, we consider the scheduling problem for strictly periodic real-time systems. Strict periodicity means that all release points of all instances of any task must produce an arithmetic progression. Classical scheduling algorithms, such as Earliest Deadline First (EDF) and Rate-Monotonic Scheduling (RMS), are not applicable under strict periodicity constraint. The main problem is to search release points for all tasks. In fact, this search is NP-hard problem.
First, we study some necessary schedulability conditions for both classical and strictly periodic schedulability problem. Next, we suggest scheduling algorithm that consists of two main parts: heuristic algorithm of release points search, and scheduling algorithm for given release points. The algorithm of release points search is based on some ideas in number theory that allows iterate not all possible release points but only such points that with a high probability yields correct schedule. The scheduling algorithm for given release points is based on ideas of EDF algorithm.
Finally, we present developed tool set that provides automated scheduling of given tasks, allows visualization of generated schedule, provides GUI to edit schedule, and supports validation of built schedules. This tool set is a part of workspace for design of modern avionics systems developed in ISP RAS.In 1993 Coffman and Shor proposed on-line strip packing algorithm with expected unpacked area (waste area) of order O(n^{2/3}) in a standard probabilistic model, where n is the number of rectangles. In the standard probabilistic model the width and height of each rectangle are independent random variables with uniform distribution in [0,1]. It is well-known that optimal packing has expected waste of order O(n^{1/2}) and there exists off-line polynomial algorithm that provides this bound.. During more than 10 years the question concerning the possibility to obtain similar upper bound in the class of on-line strip packing algorithms was open. It was also known that in the class of popular so-called shelf algorithms the bound O(n^{2/3}) cannot be improved. In this paper we propose new strip packing algorithm which differs from all known on-line strip packing algorithms. We analyze our strip packing algorithm experimentally and show that the upper bound of expected unpacked area is of order O(n^{1/2}) which is optimal up to constant factor. Our experimnets show that this constant factor is close to 1.5-1.6. Our experiments were done with up to 2000000000 random rectangles. The only restriction is the following: we must know the number n of rectangles in advance. In a popular terminology concerning on-line algorithms this means that our algorithm is closed-end on-line algorithm.
The paper proposes a new online strip packing algorithm that provides a quality of packing significantly higher than well-known algorithm of Koffman-Shor.A new method of investigation of ideals in the rings of polynomials was proposed by B. Buchberger in 1965. He proposed to use special basis in such rings named “Gröbner Basis” in the honor of his teacher. The proposed approach has allowed to prove the algorithmic reducibility of the task of finding solutions of a system of algebraic equations in many variables to the problem of solving algebraic equation of one variable. However, the practical use of the method needs enormous computational cost. The first estimate of the computational complexity of this method was obtained by G. Hermann in 1925, when the concept of Gröbner basis was not yet known. Was obtained the required upper estimate in the form of dual exponent of the number of variables and the maximum degree of incoming task description polynomials of degree of decomposition element ideal for an arbitrary basis. As it turned out later by T.W. Dube this estimate cannot be improved.
In 1996 K. Kühnle and E. W. Mayr proved, that for Boolean ideals the upper bound of memory capacity is exponential on input size. For zero-dimensional ideals A. Hashemi and D. Lazard proved the exponential complexity of finding Gröbner basis. Here we prove the lower bound of computation of such bases, by giving an example of an ideal having Gröbner basis of exponential size on input.ISSN 2220-6426 (Online)