Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

LLM-based Interactive Code Generation: Empirical Evaluation

https://doi.org/10.15514/ISPRAS-2025-37(5)-9

Abstract

Recently, large language models (LLMs), those pretrained on code, have demonstrated strong capabilities in generating programs from informal natural language intent. However, LLM-generated code is prone to bugs. Developers interacting with LLMs seek trusted code and, ideally, clear indications of potential bugs and vulnerabilities. Verified code can mitigate potential business risks associated with adopting generated code. We use model-agnostic framework CodePatchLLM, an extension for LLM that utilizes Svace feedback to enhance code generation quality. We evaluate CodePatchLLM on four popular LLMs across three datasets. Our experiments show an average absolute reduction of 19.1% in static analyzer warnings for Java across all datasets and models, while preserving pass@1 code generation accuracy.

About the Authors

Danil Salavatovich SHAIKHELISLAMOV
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow Institute of Physics and Technology
Russian Federation

Researcher at the Institute of System Programming, senior lecturer at the Higher School of Economics, postgraduate student at the Moscow Institute of Physics and Technology. His research interests include large language models, code generation.



Mikhail Dmitrievich DROBYSHEVSKIY
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Moscow Institute of Physics and Technology
Russian Federation

Cand. Sci (Phys.-Math), researcher at ISP RAS. Research interests: trusted AI, explainable AI.



Andrey Andreevich BELEVANTSEV
Ivannikov Institute for System Programming of the Russian Academy of Sciences, Lomonosov Moscow State University
Russian Federation

Dr. Sci. (Phys.-Math.), Prof., corresponding Member RAS, leading researcher at ISP RAS, Professor at Moscow State University. Research interests: static analysis, program optimization, parallel programming.



References

1. StackOverflow, Developer Survey. Доступно по ссылке: https://survey.stackoverflow.co/2023/#ai-tools-in-the-development-process, обращение 30.05.2023.

2. Li R., Allal L.B., Zi Y., Muennighoff N., Kocetkov D., Mou C., Marone M., Akiki C., Li J., Chim J. Starcoder: may the source be with you! //arXiv preprint, 2023. Доступно по ссылке: arXiv:2305.06161, обращение 10.10.2025.

3. Tambon F., Moradi-Dakhel A., Nikanjam A., Khomh F., Desmarais MC., Antoniol G. Bugs in large language models generated code: An empirical study // Empirical Software Engineering, 2025, vol. 30, no. 3, p. 65.

4. Shaikhelislamov D., Drobyshevskiy M., Belevantsev A. LLM-based Interactive Code Generation: Empirical Evaluation // 2024 Ivannikov Ispras Open Conference (ISPRAS). IEEE, 2024, pp. 1-5.

5. Shaikhelislamov D. S., Drobyshevskiy M. D., Belevancev A. A. Ensuring trustworthy code: leveraging a static analyzer to identify and mitigate defects in generated code // Записки научных семинаров ПОМИ, 2024, vol. 540, no. 0, pp. 233-251.

6. Belevantsev A., Borodin A., Dudina I., Ignatiev V., Izbyshev A., Polyakov S. Design and development of Svace static analyzers // 2018 Ivannikov Memorial Workshop (IVMEM). IEEE, 2018, pp. 3-9.

7. Agashe R., Iyer S., Zettlemoyer L. JuICe: A large scale distantly supervised dataset for open domain context-based code generation // arXiv preprint, 2019. Доступно по ссылке: arXiv:1910.02216, обращение 10.10.2025.

8. Grubisic D., Cummins C., Seeker V., Leather H. Compiler generated feedback for large language models //arXiv preprint, 2024. Доступно по ссылке: arXiv:2403.14714, обращение 10.10.2025.

9. Avgustinov P., Moor O., Jones MP., Schäfer M. QL: Object-oriented queries on relational data // 30th European Conference on Object-Oriented Programming (ECOOP 2016). Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2016, pp. 2: 1-2: 25.

10. Semgrep, 2023. [Online]. Available at: https://semgrep.dev/, обращение 10.10.2025.

11. FlawFinder, 2023. [Online]. Available at: https://dwheeler.com/flawfinder, обращение 10.10.2025.

12. Li H., Hao Y., Zhai Y., Qian Z. Enhancing static analysis for practical bug detection: An llm-integrated approach // Proceedings of the ACM on Programming Languages. 2024, vol. 8, no. OOPSLA1, pp. 474 499.

13. Zhang T, Yu T., Hashimoto T., Lewis M., Yih W., Fried D., Wang S. Coder reviewer reranking for code generation //International Conference on Machine Learning. PMLR, 2023, pp. 41832-41846.

14. Zheng Q., Xia X., Zou X., Dong Y., Wang S., Xue Y., Shen L., Wang Z., Wang A., Li Y., Su T., Yang Z., Tang J. Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x // Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2023, pp. 5673-5684.

15. Odena, A., Sutton, C., Dohan, D. M., Jiang, E., Michalewski, H., Austin, J., Bosma MP., Nye M. Program synthesis with large language models //arXiv preprint, 2021. Доступно по ссылке: arXiv:2108.07732, обращение 10.10.2025.

16. Rozière B., Gehring J., Gloeckle F., Sootla S., Gat I., Ellen Tan X., Adi Y., Liu J., Sauvestre R., Remez T., Rapin J., Kozhevnikov A., Evtimov I., Bitton J., Bhatt M., Ferrer CC., Grattafiori A., Xiong W., Défossez A., Copet J., Azhar F., Touvron H., Martin L., Usunier N., Scialom T., Synnaeve G. Code llama: Open foundation models for code //arXiv preprint, 2023. Доступно по ссылке: arXiv:2308.12950, обращение 10.10.2025.

17. Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. D. O., Kaplan [и др.]. Evaluating large language models trained on code //arXiv preprint, 2021. Доступно по ссылке: arXiv:2107.03374, обращение 10.10.2025.

18. Nijkamp, E., Hayashi, H., Xiong, C., Savarese, S., & Zhou, Y. Codegen2: Lessons for training llms on programming and natural languages //arXiv preprint, 2023. Доступно по ссылке: arXiv:2305.02309, обращение 10.10.2025.

19. Siddiq, M. L., Dristi, S., Saha, J., & Santos, J. C. The fault in our stars: Quality assessment of code generation benchmarks // 2024 IEEE International Conference on Source Code Analysis and Manipulation (SCAM). IEEE, 2024, pp. 201-212.

20. Liao, D., Pan, S., Sun, X., Ren, X., Huang, Q., Xing, Z. [и др.]. A 3-codgen: A repository-level code generation framework for code reuse with local-aware, global-aware, and third-party-library-aware // IEEE Transactions on Software Engineering. 2024.


Review

For citations:


SHAIKHELISLAMOV D.S., DROBYSHEVSKIY M.D., BELEVANTSEV A.A. LLM-based Interactive Code Generation: Empirical Evaluation. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2025;37(5):123-130. (In Russ.) https://doi.org/10.15514/ISPRAS-2025-37(5)-9



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)