GraphTyper: Вывод типов из графовой репрезентации кода посредством нейронных сетей
https://doi.org/10.15514/ISPRAS-2024-36(4)-6
Аннотация
Несмотря на то, что программирование – это творческий процесс, достаточно много времени уходит на решение рутинных задач. Как и в других индустриях в сфере информационных технологий стремятся автоматизировать рутинные задачи. Во многих случаях применяются нейронные сети. Программирование не является исключением: Github заверяют, что уже около 30% кода написано при помощи Copilot. Этот инструмент основан на модели Codex – трансформере, обученном на исходном коде программ. Однако представление кода в виде последовательности, как это сделано в Copilot, не так эффективно. В данной работе мы показали, что использование трансформеров и графового представления кода приводит к очень хорошим результатам даже для маленьких моделей.
Об авторах
Герман Арсенович АРУТЮНОВРоссия
Магистр факультета компьютерных наук НИУ ВШЭ. Сфера научных интересов: генерация и анализ языков программирования посредством машинного обучения и глубоких нейронных сетей.
Сергей Михайлович АВДОШИН
Россия
Кандидат технических наук, профессор департамента компьютерной инженерии Московского института электроники и математики им. А.Н. Тихонова НИУ ВШЭ. Сфера научных интересов: разработка и анализ компьютерных алгоритмов, имитация и моделирование, параллельные и распределенные процессы, машинное обучение.
Список литературы
1. M. Chen, J. Tworek, H. Jun, Q. Yuan, H.P. de O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, Evaluating large language models trained on code, ArXiv Prepr. ArXiv210703374 (2021).
2. D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, Measuring coding challenge competence with apps, ArXiv Prepr. ArXiv210509938 (2021).
3. Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, J. Keeling, F. Gimeno, A.D. Lago, T. Hubert, P. Choy, C. de, Competition-Level Code Generation with AlphaCode, (n.d.) 74.
4. E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, C. Xiong, A Conversational Paradigm for Program Synthesis, ArXiv Prepr. ArXiv220313474 (2022).
5. F.F. Xu, U. Alon, G. Neubig, V.J. Hellendoorn, A Systematic Evaluation of Large Language Models of Code, ArXiv Prepr. ArXiv220213169 (2022).
6. G.A. Arutyunov, S.M. Avdoshin, Big Transformers for Code Generation, Proc. Inst. Syst. Program. RAS 34 (2022) 79–88. https://doi.org/10.15514/ispras-2022-34(4)-6.
7. S.M. Avdoshin, G.A. Arutyunov, Code Analysis and Generation Methods Using Neural Networks: an Overview, Inf. Technol. 28 (2022) 378–391. https://doi.org/10.17587/it.28.378-391.
8. M. Allamanis, M. Brockschmidt, M. Khademi, Learning to represent programs with graphs, ArXiv Prepr. ArXiv171100740 (2017).
9. M. White, M. Tufano, C. Vendome, D. Poshyvanyk, Deep learning code fragments for code clone detection, in: 2016 31st IEEEACM Int. Conf. Autom. Softw. Eng. ASE, IEEE, 2016: pp. 87–98.
10. H. Wei, M. Li, Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code., in: IJCAI, 2017: pp. 3034–3040.
11. L. Mou, G. Li, L. Zhang, T. Wang, Z. Jin, Convolutional neural networks over tree structures for programming language processing, in: Thirtieth AAAI Conf. Artif. Intell., 2016.
12. K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, A. Weller, Rethinking Attention with Performers, (2020).
13. J. Kim, T.D. Nguyen, S. Min, S. Cho, M. Lee, H. Lee, S. Hong, Pure Transformers are Powerful Graph Learners, (2022). https://doi.org/10.48550/arXiv.2207.02505.
14. V.P. Dwivedi, X. Bresson, A Generalization of Transformer Networks to Graphs, (2021). https://doi.org/10.48550/arXiv.2012.09699.
15. D. Kreuzer, D. Beaini, W.L. Hamilton, V. Létourneau, P. Tossou, Rethinking Graph Transformers with Spectral Attention, (2021). https://doi.org/10.48550/arXiv.2106.03893.
16. C. Ying, T. Cai, S. Luo, S. Zheng, G. Ke, D. He, Y. Shen, T.-Y. Liu, Do Transformers Really Perform Bad for Graph Representation?, (2021). https://doi.org/10.48550/arXiv.2106.05234.
17. P.M. Julia Elliott, 2021 Kaggle Machine Learning & Data Science Survey, (2021). https://kaggle.com/competitions/kaggle-survey-2021.
18. M.P. Robillard, What Makes APIs Hard to Learn? Answers from Developers, IEEE Softw. 26 (2009) 27–34. https://doi.org/10.1109/MS.2009.193.
19. M.P. Robillard, R. Deline, A field study of API learning obstacles, Empir. Softw Engg 16 (2011) 703– 732. https://doi.org/10.1007/s10664-010-9150-8.
20. M.F. Zibran, F.Z. Eishita, C.K. Roy, Useful, But Usable? Factors Affecting the Usability of APIs, in: 2011 18th Work. Conf. Reverse Eng., 2011: pp. 151–155. https://doi.org/10.1109/WCRE.2011.26.
21. N. Alzahrani, F. Vahid, A. Edgcomb, K. Nguyen, R. Lysecky, Python Versus C++: An Analysis of Student Struggle on Small Coding Exercises in Introductory Programming Courses, in: Proc. 49th ACM Tech. Symp. Comput. Sci. Educ., Association for Computing Machinery, New York, NY, USA, 2018: pp. 86–91. https://doi.org/10.1145/3159450.3160586.
22. L. Reimann, G. Kniesel-Wünsche, Safe-DS: A Domain Specific Language to Make Data Science Safe, (2023).
23. Pyre: A performant type-checker for Python 3, (n.d.). https://pyre-check.org (accessed May 12, 2024).
24. PyType: A static type analyzer for Python code, (n.d.). https://github.com/google/pytype (accessed May 12, 2024).
25. M. Allamanis, E.T. Barr, S. Ducousso, Z. Gao, Typilus: Neural Type Hints, in: PLDI, 2020.
26. A.M. Mir, E. Latoskinas, S. Proksch, G. Gousios, Type4py: Deep similarity learning-based type inference for python, ArXiv Prepr. ArXiv210104470 (2021).
27. Z. Sun, Q. Zhu, Y. Xiong, Y. Sun, L. Mou, L. Zhang, Treegen: A tree-based transformer architecture for code generation, in: Proc. AAAI Conf. Artif. Intell., 2020: pp. 8984–8991.
28. Z. Tang, C. Li, J. Ge, X. Shen, Z. Zhu, B. Luo, AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summarization, (2021). https://doi.org/10.48550/arXiv.2112.01184.
29. K. Wang, M. Yan, H. Zhang, H. Hu, Unified Abstract Syntax Tree Representation Learning for Cross- Language Program Classification, in: Proc. 30th IEEEACM Int. Conf. Program Comprehension, 2022: pp. 390–400. https://doi.org/10.1145/3524610.3527915.
30. F. Yamaguchi, N. Golde, D. Arp, K. Rieck, Modeling and Discovering Vulnerabilities with Code Property Graphs, in: 2014 IEEE Symp. Secur. Priv., 2014: pp. 590–604. https://doi.org/10.1109/SP.2014.44.
31. J. Liu, J. Zeng, X. Wang, Z. Liang, Learning Graph-based Code Representations for Source-level Functional Similarity Detection, in: 2023 IEEEACM 45th Int. Conf. Softw. Eng. ICSE, 2023: pp. 345–357. https://doi.org/10.1109/ICSE48619.2023.00040.
32. M. Pradel, G. Gousios, J. Liu, S. Chandra, TypeWriter: neural type prediction with search-based validation, in: Proc. 28th ACM Jt. Meet. Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., Association for Computing Machinery, New York, NY, USA, 2020: pp. 209–220. https://doi.org/10.1145/3368089.3409715.
33. K. Jesse, P.T. Devanbu, T. Ahmed, Learning type annotation: is big data enough?, in: Proc. 29th ACM Jt. Meet. Eur. Softw. Eng. Conf. Symp. Found. Softw. Eng., Association for Computing Machinery, New York, NY, USA, 2021: pp. 1483–1486. https://doi.org/10.1145/3468264.3473135.
34. Y. Peng, C. Wang, W. Wang, C. Gao, M.R. Lyu, Generative Type Inference for Python, (2023).
35. J. Schrouff, K. Wohlfahrt, B. Marnette, L. Atkinson, Inferring javascript types using graph neural networks, ArXiv Prepr. ArXiv190506707 (2019).
36. Z. Liu, F. Jiang, Y. Hu, C. Shi, P. Fung, NER-BERT: A Pre-trained Model for Low-Resource Entity Tagging, (2021).
37. H. Darji, J. Mitrović, M. Granitzer, German BERT Model for Legal Named Entity Recognition, in: Proc. 15th Int. Conf. Agents Artif. Intell., SCITEPRESS - Science and Technology Publications, 2023. https://doi.org/10.5220/0011749400003393.
38. D. Bieber, R. Goel, D. Zheng, H. Larochelle, D. Tarlow, Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions, (2022).
39. S. Sun, S. Wang, X. Wang, Y. Xing, E. Zhang, K. Sun, Exploring Security Commits in Python, (2023).
40. V.-A. Nguyen, D.Q. Nguyen, V. Nguyen, T. Le, Q.H. Tran, D. Phung, ReGVD: Revisiting Graph Neural Networks for Vulnerability Detection, ArXiv Prepr. ArXiv211007317 (2021).
41. Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng, Y. Zhong, Vuldeepecker: A deep learning-based system for vulnerability detection, ArXiv Prepr. ArXiv180101681 (2018).
42. S. Cao, X. Sun, L. Bo, Y. Wei, B. Li, Bgnn4vd: constructing bidirectional graph neural-network for vulnerability detection, Inf. Softw. Technol. 136 (2021) 106576.
43. Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu, Z. Chen, SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities, IEEE Trans. Dependable Secure Comput. (2021) 1–1. https://doi.org/10.1109/TDSC.2021.3051525.
44. R. Russell, L. Kim, L. Hamilton, T. Lazovich, J. Harer, O. Ozdemir, P. Ellingwood, M. McConley, Automated vulnerability detection in source code using deep representation learning, in: 2018 17th IEEE Int. Conf. Mach. Learn. Appl. ICMLA, IEEE, 2018: pp. 757–762.
45. P.S. Kostenetskiy, R.A. Chulkevich, V.I. Kozyrev, HPC Resources of the Higher School of Economics, J. Phys. Conf. Ser. 1740 (2021) 012050. https://doi.org/10.1088/1742-6596/1740/1/012050.
46. Q. Tan, N. Liu, X. Huang, R. Chen, S.-H. Choi, X. Hu, MGAE: Masked Autoencoders for Self- Supervised Learning on Graphs, (2022).
47. S. Zhang, H. Chen, H. Yang, X. Sun, P.S. Yu, G. Xu, Graph Masked Autoencoders with Transformers, (2022).
48. Z. Hou, X. Liu, Y. Cen, Y. Dong, H. Yang, C. Wang, J. Tang, GraphMAE: Self-Supervised Masked Graph Autoencoders, (2022).
49. S. Chopra, R. Hadsell, Y. LeCun, Learning a similarity metric discriminatively, with application to face verification, in: 2005 IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. CVPR05, 2005: pp. 539–546 vol. 1. https://doi.org/10.1109/CVPR.2005.202.
50. W. Liao, M.Y. Yang, N. Zhan, B. Rosenhahn, Triplet-based Deep Similarity Learning for Person Re- Identification, (2018).
51. T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Trans. Inf. Theory 13 (1967) 21–27. https://doi.org/10.1109/TIT.1967.1053964.
52. S. Tipirneni, M. Zhu, C.K. Reddy, StructCoder: Structure-Aware Transformer for Code Generation, (2022). https://doi.org/10.48550/arXiv.2206.05239.
53. R. Bavishi, M. Pradel, K. Sen, Context2Name: A Deep Learning-Based Approach to Infer Natural Variable Names from Usage Contexts, (2018).
54. Zhang, L. Fang, C. Ge, P. Li, Z. Liu, Efficient transformer with code token learner for code clone detection, J Syst Softw 197 (2023). https://doi.org/10.1016/j.jss.2022.111557.
55. W. Wang, G. Li, B. Ma, X. Xia, Z. Jin, Detecting code clones with graph neural network and flow- augmented abstract syntax tree, in: 2020 IEEE 27th Int. Conf. Softw. Anal. Evol. Reengineering SANER, IEEE, 2020: pp. 261–271.
56. J. Yasaswi, S. Purini, C.V. Jawahar, Plagiarism Detection in Programming Assignments Using Deep Features, in: 2017 4th IAPR Asian Conf. Pattern Recognit. ACPR, 2017: pp. 652–657. https://doi.org/10.1109/ACPR.2017.146.
57. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models are unsupervised multitask learners, OpenAI Blog 1 (2019) 9.
58. T. Brown, B. Mann, N. Ryder, M. Subbiah, J.D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, Language models are few-shot learners, Adv. Neural Inf. Process. Syst. 33 (2020) 1877– 1901.
59. A.V.M. Barone, R. Sennrich, A parallel corpus of python functions and documentation strings for automated code documentation and code generation, ArXiv Prepr. ArXiv170702275 (2017).
60. X. Liu, D. Wang, A. Wang, Y. Hou, L. Wu, HAConvGNN: Hierarchical attention based convolutional graph neural network for code documentation generation in jupyter notebooks, ArXiv Prepr. ArXiv210401002 (2021).
61. S. Bhatia, R. Singh, Automated correction for syntax errors in programming assignments using recurrent neural networks, ArXiv Prepr. ArXiv160306129 (2016).
62. Marginean, J. Bader, S. Chandra, M. Harman, Y. Jia, K. Mao, A. Mols, A. Scott, Sapfix: Automated end-to-end repair at scale, in: 2019 IEEEACM 41st Int. Conf. Softw. Eng. Softw. Eng. Pract. ICSE-SEIP, IEEE, 2019: pp. 269–278.
63. Khajenezhad, S.A. Osia, M. Karimian, H. Beigy, Gransformer: Transformer-based Graph Generation, (2022).
64. R. Cabrera Lozoya, A. Baumann, A. Sabetta, M. Bezzi, Commit2vec: Learning distributed representations of code changes, SN Comput. Sci. 2 (2021) 1–16.
65. G. Arutyunov, gaarutyunov/graph-typer, (2024). https://github.com/gaarutyunov/graph-typer (accessed August 12, 2024).
Рецензия
Для цитирования:
АРУТЮНОВ Г.А., АВДОШИН С.М. GraphTyper: Вывод типов из графовой репрезентации кода посредством нейронных сетей. Труды Института системного программирования РАН. 2024;36(4):69-80. https://doi.org/10.15514/ISPRAS-2024-36(4)-6
For citation:
ARUTYUNOV G.A., AVDOSHIN S.M. GraphTyper: Neural Types Inference from Code Represented as Graph. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2024;36(4):69-80. (In Russ.) https://doi.org/10.15514/ISPRAS-2024-36(4)-6