Comparison of Graph Embeddings for Source Code with Text Models Based on CNN and CodeBERT Architectures
https://doi.org/10.15514/ISPRAS-2023-35(1)-15
Abstract
One possible way to reduce bugs in source code is to create intelligent tools that make the development process easier. Such tools often use vector representations of the source code and machine learning techniques borrowed from the field of natural language processing. However, such approaches do not take into account the specifics of the source code and its structure. This work studies methods for pretraining graph vector representations for source code, where the graph represents the structure of the program. The results show that graph embeddings allow to achieve an accuracy of classifying variable types in Python programs that is comparable to CodeBERT embeddings. Moreover, the simultaneous use of text and graph embeddings as part of a hybrid model can improve the accuracy of type classification by more than 10%.
About the Authors
Vitaly Anatolyevich ROMANOVRussian Federation
PhD student
Vladimir Vladimirovich IVANOV
Russian Federation
Candidate of Physical and Mathematical Sciences, Associate Professor
References
1. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin, "Attention is all you need," Advances in neural information processing systems, vol. 30, 2017.
2. A. Kanade, P. Maniatis, G. Balakrishnan and K. Shi, "Learning and evaluating contextual embedding of source code," in Proceedings of the 37th International Conference on Machine Learning, 2020.
3. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang and others, "CodeBERT: A Pre-Trained Model for Programming and Natural Languages," in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
4. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu and others, "Graphcodebert: Pre-training code representations with data flow," arXiv preprint arXiv:2009.08366, 2020.
5. L. Liu, H. Nguyen, G. Karypis and S. Sengamedu, "Universal Representation for Code," in Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2021.
6. A. T. Nguyen and T. N. Nguyen, "Graph-based statistical language model for code," in Proceedings - International Conference on Software Engineering, 2015.
7. U. Alon, R. Sadaka, O. Levy and E. Yahav, "Structural language models of code," in International conference on machine learning, 2020.
8. Y. Yang, X. Chen and J. Sun, "Improve language modelling for code completion by tree language model with tree encoding of context," Proceedings of the International Conference on Software Engineering and Knowledge Engineering, SEKE, Vols. 2019-July, p. 675–680, 2019.
9. V. J. Hellendoorn, C. Sutton, R. Singh, P. Maniatis and D. Bieber, "Global relational models of source code," in International conference on learning representations, 2019.
10. I. V. Pandi, E. T. Barr, A. D. Gordon and C. Sutton, "Opttyper: Probabilistic type inference by optimising logical and natural constraints," arXiv preprint arXiv:2004.00348, 2020.
11. N. Chirkova and S. Troshin, "Empirical study of transformers for source code," in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021.
12. L. Buratti, S. Pujar, M. Bornea, S. McCarley, Y. Zheng, G. Rossiello, A. Morari, J. Laredo, V. Thost, Y. Zhuang and others, "Exploring software naturalness through neural language models," arXiv preprint arXiv:2006.12641, 2020.
13. W. U. Ahmad, S. Chakraborty, B. Ray and K.-W. Chang, "Unified pre-training for program understanding and generation," in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
14. A. Karmakar and R. Robbes, "What do pre-trained code models know about code?," arXiv preprint arXiv:2108.11308, 2021.
15. S. Cui, G. Zhao, Z. Dai, L. Wang, R. Huang and J. Huang, "PYInfer: Deep Learning Semantic Type Inference for Python Variables," arXiv preprint arXiv:2106.14316, 2021.
16. V. J. Hellendoorn, C. Bird, E. T. Barr and M. Allamanis, "Deep learning type inference," ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, p. 152–162, 2018.
17. R. S. Malik, J. Patra and M. Pradel, "NL2Type: Inferring JavaScript Function Types from Natural Language Information," Proceedings - International Conference on Software Engineering, Vols. 2019-May, p. 304–315, 2019.
18. C. Boone, N. de Bruin, A. Langerak and F. Stelmach, "DLTPy: Deep learning type inference of Python function signatures using natural language context," arXiv preprint arXiv:1912.00680, 2019.
19. M. Pradel, G. Gousios, J. Liu and S. Chandra, "Typewriter: Neural type prediction with search-based validation," in Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020.
20. V. Raychev, M. Vechev and A. Krause, "Predicting program properties from "Big Code"," ACM SIGPLAN Notices, vol. 50, p. 111–124, 2015.
21. M. Allamanis, E. T. Barr, S. Ducousso and Z. Gao, "Typilus: Neural type hints," Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), p. 91–105, 2020.
22. Y. Peng, Z. Li, C. Gao, B. Gao, D. Lo and M. R. Lyu, "HiTyper: A Hybrid Static Type Inference Framework with Neural Prediction," ArXiv, vol. abs/2105.03595, 2021.
23. J. Wei, M. Goyal, G. Durrett and I. Dillig, "Lambdanet: Probabilistic type inference using graph neural networks," arXiv preprint arXiv:2005.02161, 2020.
24. F. Ye, J. Zhao and V. Sarkar, "Advanced Graph-Based Deep Learning for Probabilistic Type Inference," arXiv preprint arXiv:2009.05949, 2020.
25. P. Fernandes, M. Allamanis and M. Brockschmidt, "Structured Neural Summarization," ArXiv, vol. abs/1811.01824, 2019.
26. M. Cvitkovic, B. Singh and A. Anandkumar, "Deep Learning On Code with an Unbounded Vocabulary," in Machine Learning for Programming (ML4P) Workshop at Federated Logic Conference (FLoC), 2018.
27. E. Dinella, H. Dai, G. Brain, Z. Li, M. Naik, L. Song, G. Tech and K. Wang, "Hoppity: Learning Graph Transformations To Detect and Fix Bugs in Programs," Iclr 2020, p. 1–17, 2020.
28. Y. Wang, F. Gao, L. Wang and K. Wang, "Learning a static bug finder from data," arXiv preprint arXiv:1907.05579, 2019.
29. Y. Zhou, S. Liu, J. Siow, X. Du and Y. Liu, "Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks," Advances in neural information processing systems, vol. 32, 2019.
30. A. Brauckmann, A. Goens, S. Ertel and J. Castrillon, "Compiler-based graph representations for deep learning models of code," Proceedings of the 29th International Conference on Compiler Construction, p. 201–211, 2020.
31. Y. Wan, J. Shu, Y. Sui, G. Xu, Z. Zhao, J. Wu and P. Yu, "Multi-modal attention network learning for semantic source code retrieval," Proceedings - 2019 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, p. 13–25, 2019.
32. W. Wang, G. Li, B. Ma, X. Xia and Z. Jin, "Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree," in 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2020.
33. Y. Li, S. Wang, T. N. Nguyen and S. Van Nguyen, "Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks," Proceedings of the ACM on Programming Languages, vol. 3, p. 1–30, 2019.
34. T. Ben-Nun, A. S. Jakobovits and T. Hoefler, "Neural code comprehension: A learnable representation of code semantics," Advances in Neural Information Processing Systems, Vols. 2018-Decem, p. 3585–3597, 2018.
35. D. DeFreez, A. V. Thakur and C. Rubio-Gonzalez, "Path-based function embedding and its application to error-handling specification mining," ESEC/FSE 2018 - Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, p. 423–433, 2018.
36. M. Brockschmidt, M. Allamanis, A. ̃. L. Gaunt and O. Polozov, "Generative Code Modeling with Graphs," in International Conference on Learning Representations (ICLR), 2019.
37. J. Wei, M. Goyal, G. Durrett and I. Dillig, "LambdaNet: Probabilistic Type Inference using Graph Neural Networks," in ICLR 2020, 2020.
38. M. Lu, D. Tan, N. Xiong, Z. Chen and H. Li, "Program classification using gated graph attention neural network for online programming service," arXiv preprint arXiv:1903.03804, 2019.
39. J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang and X. Liu, "A Novel Neural Source Code Representation Based on Abstract Syntax Tree," Proceedings - International Conference on Software Engineering, Vols. 2019-May, p. 783–794, 2019.
40. M. Allamanis, M. Brockschmidt and M. Khademi, "Learning to Represent Programs with Graphs," in International Conference on Learning Representations (ICLR), 2018.
41. W. L. Hamilton, R. Ying and J. Leskovec, "Inductive representation learning on large graphs," in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017.
42. Z. Wang, Z. Ren, C. He, P. Zhang and Y. Hu, "Robust embedding with multi-level structures for link prediction," IJCAI International Joint Conference on Artificial Intelligence, Vols. 2019-Augus, p. 5240–5246, 2019.
43. M. Schlichtkrull, T. N. Kipf, P. Bloem, R. van den Berg, I. Titov and M. Welling, "Modeling Relational Data with Graph Convolutional Networks," Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 10843 LNCS, p. 593–607, 2018.
44. L. Cai, B. Yan, G. Mai, K. Janowicz and R. Zhu, "TransGCN: Coupling transformation assumptions with graph convolutional networks for link prediction," K-CAP 2019 - Proceedings of the 10th International Conference on Knowledge Capture, p. 131–138, 2019.
45. X. Liu, H. Tan, Q. Chen and G. Lin, "RAGAT: Relation Aware Graph Attention Network for Knowledge Graph Completion," IEEE Access, vol. 9, p. 20840–20849, 2021.
46. M. Allamaras, P. Chanthirasegaran, P. Kohli and C. Sutton, "Learning continuous semantic representations of symbolic expressions," 34th International Conference on Machine Learning, ICML 2017, vol. 1, p. 118–131, 2017.
47. T. Kudo and J. Richardson, "Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing," arXiv preprint arXiv:1808.06226, 2018.
48. Y. Lin, Z. Liu, M. Sun, Y. Liu and X. Zhu, "Learning entity and relation embeddings for knowledge graph completion," Proceedings of the National Conference on Artificial Intelligence, vol. 3, p. 2181–2187, 2015.
49. B. Yang, W.-t. Yih, X. He, J. Gao and L. Deng, "Embedding entities and relations for learning and inference in knowledge bases," arXiv preprint arXiv:1412.6575, 2014.
50. M. Nickel, V. Tresp and H.-P. Kriegel, "A three-way model for collective learning on multi-relational data," in Proceedings of the 28th International Conference on International Conference on Machine Learning, 2011.
51. X. Ling, L. Wu, S. Wang, G. Pan, T. Ma, F. Xu, A. X. Liu, C. Wu and S. Ji, "Deep graph matching and searching for semantic code retrieval," ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, p. 1–21, 2021.
52. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu and P. Kuksa, "Natural language processing (almost) from scratch," Journal of machine learning research, vol. 12, p. 2493–2537, 2011.
Review
For citations:
ROMANOV V.A., IVANOV V.V. Comparison of Graph Embeddings for Source Code with Text Models Based on CNN and CodeBERT Architectures. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(1):237-264. (In Russ.) https://doi.org/10.15514/ISPRAS-2023-35(1)-15