A Method to Evaluate Program Similarity Using Machine Learning Methods
https://doi.org/10.15514/ISPRAS-2022-34(5)-4
Abstract
The problem of constructing an algorithm for comparing two executable files is considered. The algorithm is based on the construction of similarity features vector for a given pair of programs. This vector is then used to decide on the similarity or dissimilarity of programs using machine learning methods. Similarity features are built using algorithms of two types: universal and specialized. Universal algorithms do not take into account the format of the input data (values of fuzzy hash functions, values of compression ratios). Specialized algorithms work with executable files and analyze machine code (using disassemblers). A total of 15 features were built: 9 features of the first type and 6 of the second. Based on the constructed training set of similar and dissimilar program pairs, 7 different binary classifiers were trained and tested. To build the training set, coreutils programs were used. The results of the experiments showed high accuracy of models based on random forest and k nearest neighbors. It was also found that the combined use of features of both types can improve the accuracy of classification.
About the Authors
Petr Dmitrievich BORISOVRussian Federation
Head of the Laboratory
Yuri Vladimirovich KOSOLAPOV
Russian Federation
Candidate of Technical Sciences, Associate Professor, Department of Algebra and Discrete Mathematics, Institute of Mathematics, Mechanics and Computer Science named after I.I. Vorovich
References
1. Нурмухаметов А.Р. Применение диверсифицирующих и обфусцирующих преобразований для изменения сигнатуры программного кода. Труды ИСП РАН, том 28, вып. 5, 2016 г., стр. 93-104 / Nurmukhametov A. R. The application of compiler-based obfuscation and diversification for program signature modification. Trudy ISP RAN/Proc. ISP RAS, 2016, vol. 28, issue 5, pp. 93-104 (in Russian). DOI: 10.15514/ISPRAS-2016-28(5)-5.
2. Терешкин А.В. Разработка и исследование метода защиты от удаленных атак на основе диверсификации программного обеспечения. Диссертация на соискание степени кандидата технических наук. Южный федеральный университет, 2007 г., 157 стр. / Tereshkin A.V. Development and research of a method of protection against remote attacks based on software diversification. Dissertation for the degree of candidate of technical sciences. South Federal University, 2007, 157 p. (in Russian).
3. Collberg C., Thomborson C. Watermarking, Tamper-Proofng, and Obfuscation – Tools for Software Protection, IEEE Transactions on Software Engineering, vol. 28, issue 8, 2002, pp. 735-746.
4. Walenstein A., El-Ramly M. et al. Similarity in programs. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, vol. 6301, 2007, 8 p.
5. Borisov P.D., Kosolapov Yu.V. On the automatic analysis of the practical resistance of obfuscating transformations. Automatic Control and Computer Sciences, vol. 54, issue 7, 2020, pp. 619-629.
6. Holder W., McDonald J.T., Andel T.R. Evaluating optimal phase ordering in obfuscation executives. In Proc. of the 7th Software Security, Protection, and Reverse Engineering/Software Security and Protection Workshop, 2017, pp. 1-12.
7. Borisov P.D., Kosolapov Yu.V. Similarity Features for The Evaluation Of Obfuscation Effectiveness. In Proc. of the International Conference on Decision Aid Sciences and Application (DASA), 2020, pp. 898-902.
8. Борисов П.Д., Косолапов Ю.В. О характеристиках символьного исполнения в задаче оценки качества обфусцирующих преобразований. Моделирование и анализ информационных систем, том. 28, вып. 1, 2021 г., стр. 38-51. / Borisov P.D., Kosolapov Yu.V. On characteristics of symbolic execution in the problem of assessing the quality of obfuscating transformations. Modeling and Analysis of Information Systems, vol. 28, issue 1, 2021, pp. 38-51 (in Russian).
9. Pagani F., Dell'Amico M., Balzarotti D. Beyond precision and recall: understanding uses (and misuses) of similarity hashes in binary analysis. In Proc. of the Eighth ACM Conference on Data and Application Security and Privacy, 2018, pp. 354-365.
10. Ding S.H., Fung, B.C., Charland P. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In Proc. of the IEEE Symposium on Security and Privacy (SP), 2019, pp. 472-489.
11. Юмаганов А.С. Поиск похожих символьных последовательностей и графов на основе методов машинного обучения. Диссертация на соискание степени кандидата технических наук. Самарский национальный исследовательский университет имени академика С.П. Королева, 2020 г., 130 стр. / Yumaganov A.S. Search for similar character sequences and graphs based on machine learning methods. Dissertation for the degree of candidate of technical sciences. Samara National Research University named after Academician S.P. Korolev, 2020, 130 p. (in Russian).
12. Саргсян С., Курмангалеев Ш. и др. Масштабируемый инструмент поиска клонов кода на основе семантического анализа программ. Труды ИСП РАН, том 27, вып. 1, 2015 г., стр. 39-50. / Sargsyan S., Kurmangaleev S. et al. Scalable code clone detection tool based on semantic analysis, Proceedings of ISP RAS, Vol. 27, issue 1, 2015, pp. 39–50. DOI: 10.15514/ISPRAS-2015-27(1)-3.
13. Осадчая А.О., Исаев И.В. Метод поиска клонов в программном коде. Научно-технический вестник информационных технологий, механики и оптики, том 20, вып. 5, 2020 г., стр. 714-721. / Osadchaya A.O., Isaev I.V. Search of clones in program code. Scientific and Technical Journal of Information Technologies, Mechanics and Optics, vol. issue 5, 2020, pp. 714-721 (in Russian).
14. Bellon S., Koschke R. et al. Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering, vol. 33, issue 9, 2007, pp. 577-591.
15. Sarantinos N., Benzaid C. et al. Forensic malware analysis: The value of fuzzy hashing algorithms in identifying similarities. In Proc. of the IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), 2016, pp. 1782-1787.
16. Oliver J., Cheng С., Chen Y. TLSH – a loclity sensitive hash. In Proc. of the Fourth Cybercrime and Trustworthy Computing Workshop, 2013, pp. 7-13.
17. Mohsen R. Quantitative measures for code obfuscation security. PhD Thesis, Imperial College London, 2016, 216 p.
18. BinExport. Available at: https://github.com/google/binexport, accessed 10.10.2022.
19. BinDiff. Available at: https://www.zynamics.com/, accessed 10.10.2022.
20. Arutunian M., Hovhannisyan H. et al. A Method to Evaluate Binary Code Comparison Tools. In Proc. of the 2021 Ivannikov Memorial Workshop (IVMEM), 2021, pp. 3-5.
21. A powerful disassembler and a versatile debugger. Available at: https://hex-rays.com/ida-pro/, accessed 10.10.2022.
22. Ghidra Software Reverse Engineering Framework. Available at: https://github.com/NationalSecurityAgency/ghidra, accessed 10.10.2022.
23. Binary Ninja. Available at: https://binary.ninja/, accessed 10.10.2022.
24. Kornblum J. Identifying almost identical files using context triggered piecewise hashing. Digital investigation, vol. 3, 2006, pp. 91-97.
25. Roussev V. An evaluation of forensic similarity hashes. Digital investigation, vol. 8, 2011, pp. S34-S41.
26. Pavlov I. Lzma SDK (software development kit). Available at: https://7-zip.org/sdk.html, accessed 10.10.2022.
27. Seward, Julian. bzip2 and libbzip2. Avaliable at http://www. bzip. org, accessed 10.10.2022.
28. Oswal S., Singh A., Kumari K. Deflate compression algorithm. International Journal of Engineering Research and General Science, vol. 4, issue 1, 2016, pp. 430-436.
29. Radare2. Available at: https://rada.re/, accessed 10.10.2022.
30. Myers E. W. An O(ND) difference algorithm and its variations. Algorithmica, vol. 1, issue 1, 1986, pp. 251-266.
31. В.И. Левенштейн. Двоичные коды, способные исправлять удаления, вставки и обращения. Доклады Академии Наук СССР, том 163, no. 4, 1966 г., стр. 845-848. / V.I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, vol. 10, no. 8, 1966, pp. 707-710.
32. Coreutils – GNU core utilities. Available at: https://www.gnu.org/software/coreutils/, accessed 10.10.2022.
33. GCC, the GNU Compiler Collection. Available at: https://gcc.gnu.org/, accessed 10.10.2022.
34. Clang: a C language family frontend for LLVM. Available at: https://clang.llvm.org/, accessed 10.10.2022.
35. AMD Optimizing C/C++ and Fortran Compilers (AOCC). Available at: https://developer.amd.com/amd-aocc/, accessed 10.10.2022.
36. Scikit-learn. Machine learning in Python. Available at: https://scikit-learn.org/, accessed 10.10.2022.
37. Dcfldd. Available at: https://dcfldd.sourceforge.net/, accessed 10.10.2022.
38. Breitinger F., Astebol K.P. et al. mvhash-b – A new approach for similarity preserving hashing. In Proc. of the Seventh International Conference on IT Security Incident Management and IT Forensics, 2013, pp. 33-44.
39. Machoc hash. Available at: https://github.com/ANSSI-FR/polichombr/blob/dev/docs/ MACHOC_HASH.md, accessed 10.10.2022.
40. Machoke. Available at: https://github.com/conix-security/machoke, accessed 10.10.2022.
41. Aslanyan H., Avetisyan A. et al. Scalable Framework for Accurate Binary Code Comparison. In Proc. of the 2017 Ivannikov ISPRAS Open Conference (ISPRAS), 2017, pp. 34-38.
Review
For citations:
BORISOV P.D., KOSOLAPOV Yu.V. A Method to Evaluate Program Similarity Using Machine Learning Methods. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(5):63-76. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(5)-4