Сравнение графовых векторных представлений исходного кода с текстовыми моделями на основе архитектур CNN и CodeBERT

Виталий Анатольевич РОМАНОВ; Владимир Владимирович ИВАНОВ

doi:10.15514/ISPRAS-2023-35(1)-15

Сравнение графовых векторных представлений исходного кода с текстовыми моделями на основе архитектур CNN и CodeBERT

Виталий Анатольевич РОМАНОВ, Владимир Владимирович ИВАНОВ

https://doi.org/10.15514/ISPRAS-2023-35(1)-15

Полный текст:

PDF (Rus) |

сгенерировать QR код

Аннотация

Одним из возможных способов уменьшения ошибок в исходном коде является создание интеллектуальных инструментов, облегчающих процесс разработки. Такие инструменты часто используют векторные представления исходного кода и методы машинного обучения, заимствованные из области обработки естественного языка. Однако такие подходы не учитывают специфику исходного кода и его структуру. Данная работа посвящена исследованию методов предварительного обучения графовых векторных представлений исходного кода, где граф представляет структуру программы. Результаты показывают, что графовые векторные представления позволяют достичь точности классификации типов переменных программ, написанных на языке Python, сравнимой с векторными представлениями CodeBERT. Более того, одновременное использование текстовых и графовых векторных представлений в составе гибридной модели позволяет повысить точность классификации типов более чем на 10%.

Ключевые слова

исходный код, классификация типов переменных, Python, графовые нейронные сети, CodeBERT

Об авторах

Виталий Анатольевич РОМАНОВ

Университет Иннополис
Россия

Аспирант

Владимир Владимирович ИВАНОВ

Университет Иннополис
Россия

Кандидат физико-математических наук, доцент

Список литературы

1. Vaswani A. Shazeer N. et al. Attention is all you need. In Proc. of the 31st Conference on Neural Information Processing Systems (NIPS), 2017, 11 p.

2. Kanade A. Maniatis P. et al. Learning and evaluating contextual embedding of source code. In Proc. of the 37th International Conference on Machine Learning, 2020, pp. 5110–5121.

3. Feng Z., Guo D. et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2020, pp. 1536-.1547.

4. Guo D., Ren S. et al. GraphCodeBERT: Pre-training Code Representations with Data Flow. In Proc. of the Ninth International Conference on Learning Representations, 2021, 18 p.

5. Liu L., Nguyen H. et al. Universal Representation for Code. Lecture Notes in Computer Science, vol. 12714, 2021, pp. 16-28.

6. Nguyen A.T., Nguyen T.N. Graph-Based Statistical Language Model for Code. In Proc. of the 37th IEEE International Conference on Software Engineering, 2015, pp. 858-868.

7. Alon U., Sadaka R. et al. Structural language models of code. In Proc. of the 37th International Conference on Machine Learning, 2020, pp. 245-256.

8. Yang Y., Chen X., Sun J. Improve Language Modelling for Code Completion by Tree Language Model with Tree Encoding of Context. In Proc. of the 31st International Conference on Software Engineering and Knowledge Engineering, 2019, pp. 675–680.

9. Hellendoorn V.J., Sutton C. et al. Global Relational Models of Source Code. In Proc. of the Eighth International Conference on Learning Representations, 2020, 10 p.

10. Pandi V., Barr E.T. et al. OptTyper: Probabilistic Type Inference by Optimising Logical and Natural Constraints. arXiv preprint arXiv:2004.00348, 2020, 29 p.

11. Chirkova N., Troshin S. Empirical study of transformers for source code. In Proc. of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 703-715.

12. Buratti L., Pujar S. et al. Exploring Software Naturalness through Neural Language Models. arXiv preprint arXiv:2006.12641, 2020, 12 p.

13. Ahmad W.U., Chakraborty S. et al. Unified pre-training for program understanding and generation. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 2655–2668.

14. Wang Y., Wang et al. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696-8708.

15. Guo D., Lu S. et al. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. In Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, vol. 1: Long Papers, 2022, pp. 7212-7225.

16. Karmakar A., Robbes R. What do pre-trained code models know about code? In Proc. of the 36th IEEE/ACM International Conference on Automated Software Engineering, 2021, pp. 1332-1336.

17. Cui S., Zhao G. et al. PYInfer: Deep Learning Semantic Type Inference for Python Variables. arXiv preprint arXiv:2106.14316, 2021, 12 p.

18. Hellendoorn V.J., Bird C. et al. Deep learning type inference. In Proc. of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 152-162.

19. Malik R.S., Patra J., Pradel M. NL2Type: Inferring JavaScript Function Types from Natural Language Information, In Proc. of the IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019, pp. 304-315.

20. Boone C., de Bruin N. et al. DLTPy: Deep Learning Type Inference of Python Function Signatures using Natural Language Context. arXiv preprint arXiv:1912.00680, 2019, 10 p.

21. Pradel M., Gousios G. et al. Typewriter: Neural type prediction with search-based validation. In Proc. of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 209-220.

22. Raychev V., Vechev M., Krause A. Predicting program properties from "Big Code". In Proc. of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2015, pp. 111-124.

23. Allamanis M., Barr E.T. et al. Typilus: Neural type hints. Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2020, pp. 91-105.

24. Peng Y., Gao C. et al. Static inference meets deep learning: a hybrid type inference approach for Python. In Proc. of the 44th International Conference on Software Engineering, 2022, pp. 2019-2030.

25. Wei J., Goyal M. et al. LambdaNet: Probabilistic type inference using graph neural networks. In Proc. of the Eighth International Conference on Learning Representations, 2020, 11 p.

26. Ye F., Zhao J., Sarkar V. Advanced Graph-Based Deep Learning for Probabilistic Type Inference. arXiv preprint arXiv:2009.05949, 2020, 25 p.

27. Fernandes P., Allamanis M., Brockschmidt M. Structured Neural Summarization. In Proc. of the Seventh International Conference on Learning Representations, 2019, 18 p..

28. Cvitkovic M., Singh B., Anandkumar A. Deep Learning On Code with an Unbounded Vocabulary. In Proc. of the Machine Learning for Programming (ML4P) Workshop at Federated Logic Conference (FLoC), 2018, 11 p..

29. Dinella E., Dai H. et al. Hoppity: Learning Graph Transformations To Detect and Fix Bugs in Programs. In Proc. of the Eighth International Conference on Learning Representations, 2020, 17 p.

30. Wang Y., Gao F. et al. Learning a static bug finder from data. arXiv preprint arXiv:1907.05579, 2019, 12 p.

31. Zhou Y., Liu S. et al. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Proc. of the 33rd International Conference on Neural Information Processing Systems (NIPS), 2019, pp. 10197-10207.

32. Brauckmann, A. Goens, S. Ertel and J. Castrillon. Compiler-based graph representations for deep learning models of code. In Proc. of the 29th International Conference on Compiler Construction, 2020, pp. 201-211.

33. Wan Y., Shu J. et al. Multi-modal attention network learning for semantic source code retrieval. In Proc. of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2019, pp. 13-25.

34. Wang W., Li G. et al. Detecting Code Clones with Graph Neural Networkand Flow-Augmented Abstract Syntax Tree. In Proc. of the IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), 2020, pp. 261-271.

35. Li Y., Wang S. et al. Improving Bug Detection via Context-Based Code Representation Learning and Attention-Based Neural Networks. Proceedings of the ACM on Programming Languages, vol. 3, issue OOPSLA, 2019, article no. 162, 30 p.

36. Ben-Nun T., Jakobovits A.S., Hoefler T. Neural code comprehension: A learnable representation of code semantics. In Proc. of the 32nd International Conference on Neural Information Processing Systems (NIPS), 2018, pp. 3589-3601.

37. DeFreez D., Thakur A.V., C. Rubio-Gonzalez A.V..Path-based function embedding and its application to error-handling specification mining, In Proc. of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 423-433, 2018.

38. Brockschmidt M., Allamanis M. et al. Generative Code Modeling with Graphs. In Proc. of the Seventh International Conference on Learning Representations, 2019, 24 p.

39. Lu D. Tan N. et al. Program classification using gated graph attention neural network for online programming service. arXiv preprint arXiv:1903.03804, 2019, 12 p.

40. Zhang J., Wang X. et al. A Novel Neural Source Code Representation Based on Abstract Syntax Tree. In Proc. of the IEEE/ACM 41st International Conference on Software Engineering (ICSE), 2019, pp. 783-794.

41. Allamanis M., Brockschmidt M., Khademi M. Learning to Represent Programs with Graphs. In Proc. of the 6th International Conference on Learning Representations (ICLR), 2018, 17 p.

42. Hamilton W.L., Ying R., Leskovec J. Inductive representation learning on large graphs. In Proc. of the 31st International Conference on Neural Information Processing Systems (NIPS), 2017, pp. 1025-1035.

43. Wang Z., Ren Z. et al. Robust embedding with multi-level structures for link prediction. In Proc.of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019, pp. 5240–5246.

44. Schlichtkrull T.N., Kipf P. et al. Modeling Relational Data with Graph Convolutional Networks. Lecture Notes in Computer Science, vol. 10843, 2018, pp. 593-607.

45. Cai, L. Yan B, et al. TransGCN: Coupling transformation assumptions with graph convolutional networks for link prediction. In Proc. of the 10th International Conference on Knowledge Capture (K-CAP), 2019, pp. 131-138.

46. Liu X., Tan H. et al. RAGAT: Relation Aware Graph Attention Network for Knowledge Graph Completion. IEEE Access, vol. 9, 2021, pp. 20840–20849.

47. Allamaras M., Chanthirasegaran P. et al. Learning continuous semantic representations of symbolic expressions. In Proc. of the 34th International Conference on Machine Learning, vol. 70, 2017, pp. 80–88.

48. Kudo T., Richardson J. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In. Proc. of the Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2018, pp. 66-71.

49. Lin Y., Liu Z. et al. Learning entity and relation embeddings for knowledge graph completion. In. Proc. of the Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015, pp. 2181–2187.

50. Yang B., Yih W. et al. Embedding entities and relations for learning and inference in knowledge bases. arXiv preprint arXiv:1412.6575, 2014, 12 p.

51. Nickel M., Tresp V., Kriegel H.-P. A three-way model for collective learning on multi-relational data. In Proc. of the 28th International Conference on International Conference on Machine Learning, 2011, pp. 809-816.

52. Trouillon T., Welbl J. et al. Complex embeddings for simple link prediction. In Proc. of the 33rd International Conference on International Conference on Machine Learning, 2016, pp. 2071-2080.

53. Sun Z., Deng Z.-H. et al. Rotate: Knowledge graph embedding by relational rotation in complex space. In Proc. of the Seventh International Conference on Learning Representations, 2019, 18 p.

54. Ling X., Wu L. et al. Deep graph matching and searching for semantic code retrieval. ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 15, issue 5, 2021, article no. 88, 21 p.

55. Collobert R., Weston J. et al. Natural language processing (almost) from scratch. Journal of Machine Learning Research, vol. 12, 2011, pp. 2493-2537.

56. Romanov V., Ivanov V., Succi G. Representing Programs with Dependency and Function Call Graphs for Learning Hierarchical Embeddings. In Proc. of the 22nd International Conference on Enterprise Information Systems (ICEIS), vol. 2, 2020, pp. 360-366.

57. Bojanowski P., Grave E. et al. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, vol. 5, 2017, pp. 135-146.

Рецензия

Для цитирования:

РОМАНОВ В.А., ИВАНОВ В.В. Сравнение графовых векторных представлений исходного кода с текстовыми моделями на основе архитектур CNN и CodeBERT. Труды Института системного программирования РАН. 2023;35(1):237-264. https://doi.org/10.15514/ISPRAS-2023-35(1)-15

For citation:

ROMANOV V.A., IVANOV V.V. Comparison of Graph Embeddings for Source Code with Text Models Based on CNN and CodeBERT Architectures. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2023;35(1):237-264. (In Russ.) https://doi.org/10.15514/ISPRAS-2023-35(1)-15

Контент доступен под лицензией Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Логин
Пароль
	Запомнить меня
Регистрация нового пользователя Забыли Ваш пароль?

Войти

Труды Института системного программирования РАН

Сравнение графовых векторных представлений исходного кода с текстовыми моделями на основе архитектур CNN и CodeBERT

Полный текст:

Аннотация

Ключевые слова

Об авторах

Список литературы

Рецензия

Для цитирования:

For citation:

Использование куки-файлов