Extraction of Functionality from Binary Code
https://doi.org/10.15514/ISPRAS-2025-37(4)-6
Abstract
Semantic code analysis is an important but time-consuming process used in many areas of programming. The purpose of this work is to study a method for automating the semantic analysis of binary code, which is based on dividing software into semantic kernels using partial traces of execution or subgraph extraction from call graph and highlighting their functionality.
About the Authors
Anna Aleksandrovna ILINARussian Federation
Graduate student at the CMC MSU and a laboratory assistant at the ISP RAS. Area of her scientific interests: static binary code analysis, symbolic execution, the use of large language models.
Shamil Faimovich KURMANGALEEV
Russian Federation
Cand. Sci. (Phys.-Math.), head of the development of autonomous systems and technologies for creating secure software at the ISP RAS.
References
1. capa: Automatically Identify Malware Capabilities. Mandiant. Google Cloud Blog. [Online]. Available at: https://cloud.google.com/blog/topics/threat-intelligence/capa-automatically-identify-malware-capabilities, accessed 07.04.2025.
2. zynamics.com – BinDiff. [Online]. Available at: https://www.zynamics.com/bindiff.html, accessed 07.04.2025.
3. D. Gao, M. K. Reiter, and D. Song, BinHunt: Automatically Finding Semantic Differences in Binary Programs in Information and Communications Security, vol. 5308, L. Chen, M. D. Ryan, and G. Wang, Eds. In Lecture Notes in Computer Science, vol. 5308. , Berlin, Heidelberg: Springer Berlin Heidelberg, 2008, pp. 238–255. doi: 10.1007/978-3-540-88625-9_16.
4. L. Massarelli, G. A. D. Luna, F. Petroni, L. Querzoni, and R. Baldoni, SAFE: Self-Attentive Function Embeddings for Binary Similarity, Dec. 19, 2019, arXiv: arXiv:1811.05296. doi: 10.48550/arXiv.1811.05296.
5. X. Xu, C. Liu, Q. Feng, H. Yin, L. Song, and D. Song, Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Confer-ence on Computer and Communications Security, Oct. 2017, pp. 363–376. doi: 10.1145/3133956.3134018.
6. S. H. H. Ding, B. C. M. Fung, and P. Charland, Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Sympo-sium on Security and Privacy (SP), San Francisco, CA, USA: IEEE, May 2019, pp. 472–489. doi: 10.1109/SP.2019.00003.
7. X. Shang et al., How Far Have We Gone in Binary Code Understanding Using Large Language Mod-els. In 2024 IEEE International Conference on Software Maintenance and Evolution (ICSME), Flag-staff, AZ, USA: IEEE, Oct. 2024, pp. 1–12. doi: 10.1109/ICSME58944.2024.00012.
8. Y. David, U. Alon, and E. Yahav, Neural Reverse Engineering of Stripped Binaries using Augmented Control Flow Graphs. Proc. ACM Program. Lang., vol. 4, no. OOPSLA, pp. 1–28, Nov. 2020, doi: 10.1145/3428293.
9. H. Gao, S. Cheng, Y. Xue, and W. Zhang, A lightweight framework for function name reassignment based on large-scale stripped binaries. In Proceedings of the 30th ACM SIGSOFT International Sym-posium on Software Testing and Analysis, Virtual Denmark: ACM, Jul. 2021, pp. 607–619. doi: 10.1145/3460319.3464804.
10. X. Jin, K. Pei, J. Y. Won, and Z. Lin, SymLM: Predicting Function Names in Stripped Binaries via Context-Sensitive Execution-Aware Code Embeddings. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, Los Angeles CA USA: ACM, Nov. 2022, pp. 1631–1645. doi: 10.1145/3548606.3560612.
11. A. Al-Kaswan, T. Ahmed, M. Izadi, A. A. Sawant, P. Devanbu, and A. van Deursen, Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries. 2023, arXiv. doi: 10.48550/ARXIV.2301.01701.
12. J. Xiong, G. Chen, K. Chen, H. Gao, S. Cheng, and W. Zhang, HexT5: Unified Pre-Training for Stripped Binary Code Information Inference. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), Luxembourg, Luxembourg: IEEE, Sep. 2023, pp. 774–786. doi: 10.1109/ASE56229.2023.00099.
13. mandiant/xrefer: FLARE Team’s Binary Navigator. [Online]. Available at: https://github.com/mandiant/xrefer, accessed 21.02.2025.
14. XRefer: The Gemini-Assisted Binary Navigator | Google Cloud Blog. [Online]. Available at: https://cloud.google.com/blog/topics/threat-intelligence/xrefer-gemini-assisted-binary-navigator, ac-cessed 21.02.2025.
15. Ghidra Software Reverse Engineering Framework. Available at: https://github.com/NationalSecurityAgency/ghidra, accessed 31.01.2025.
16. Y. Shoshitaishvili et al., SOK: (State of) The Art of War: Offensive Techniques in Binary Analysis. In 2016 IEEE Symposium on Security and Privacy (SP), San Jose, CA: IEEE, May 2016, pp. 138–157. doi: 10.1109/SP.2016.17.
17. QwenLM/Qwen2.5: Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud. [Online]. Available at: https://github.com/QwenLM/Qwen2.5, accessed 21.02.2025.
18. The Algorithms. [Online]. Available at: https://github.com/TheAlgorithms, accessed 21.02.2025.
19. flax-sentence-embeddings/st-codesearch-distilroberta-base. Hugging Face. [Online]. Available at: https://huggingface.co/flax-sentence-embeddings/st-codesearch-distilroberta-base, accessed 21.02.2025.
20. Microsoft Access – Wikipedia. [Online]. Available at: https://en.m.wikipedia.org/wiki/Microsoft_Access, accessed 21.02.2025.
21. codellama/CodeLlama-13b-Instruct-hf. Hugging Face. [Online]. Available at: https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf, accessed 21.02.2025.
22. microsoft/Phi-3-mini-4k-instruct. Hugging Face. [Online]. Available at: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct, accessed 21.02.2025.
23. aheck/vlandhcpd: VLAN aware DHCP server which listens on a trunk port. [Online]. Available at: https://github.com/aheck/vlandhcpd, accessed 21.02.2025.
Review
For citations:
ILINA A.A., KURMANGALEEV Sh.F. Extraction of Functionality from Binary Code. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(4):97-110. https://doi.org/10.15514/ISPRAS-2025-37(4)-6






