Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Big Transformers for Code Generation

https://doi.org/10.15514/ISPRAS-2022-34(4)-6

Abstract

IT industry has been thriving over the past decades. Numerous new programming languages have emerged, new architectural patterns and software development techniques. Tools involved in the process ought to evolve as well. One of the key principles of new generation of instruments for software development would be the ability of the tools to learn using neural networks. First of all, it is necessary for the tools to learn how to write code. In this work we study the ability of Transformers to generate competition level code. The main goal is to discover whether open-source Big Transformers are “naturally” good coders.

About the Authors

German Arsenovich ARUTYUNOV
HSE University
Russian Federation

Master’s student at the Faculty of Computer Science 



Sergey Mikchailovitch AVDOSHIN
HSE University
Russian Federation

Candidate of Technical Science, Professor of the School of Computer Engineering at Tikhonov Moscow Institute of Electronics and Mathematics 



References

1. Kostenetskiy P.S., Chulkevich R.A., Kozyrev V.I. HPC Resources of the Higher School of Economics. Journal of Physics: Conference Series, vol. 1740, no. 1, 2021, article no. 012050, 11 p.

2. Introducing GitHub Copilot: your AI pair programmer, The GitHub Blog, 2021. Available at: https://github.blog/2021-06-29-introducing-github-copilot-ai-pair-programmer/, accessed: 27.06.2022.

3. Chen M., Tworek J. et al., Evaluating large language models trained on code. arXiv:2107.03374, 2021, 35 p.

4. Brown T., Mann B. et al., Language models are few-shot learners. Advances in Neural Information Processing Systems, vol. 33, 2020, pp. 1877-1901.

5. Black S., Gao L. et al. GPT-Neo: Large scale autoregressive language modeling with meshtensorflow, 2021. Available at: https://zenodo.org/record/5551208, accessed: 17.03.2022.

6. Wang B., Komatsuzaki A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model, 2021. Available at: https://github.com/kingoflolz/mesh-transformer-jax, accessed: 27.06.2022.

7. Black S., Biderman S. et al. GPT-NeoX-20B: An Open-Source Autoregressive Language Model. arXiv:2204.06745, 2022, 42 p.

8. Hendrycks D., Basart S. et al., Measuring coding challenge competence with APPS. arXiv:2105.09938, 2021, 22 p.

9. Li Y., Choi D. et al., Competition-Level Code Generation with AlphaCode. arXiv:2203.07814, 2022, 74 p.

10. Shoeybi M., Patwary M. et al. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053, 2020, 15 p.

11. Harlap A., Narayanan D. et al. PipeDream: Fast and Efficient Pipeline Parallel DNN Training. arXiv:1806.03377, 2018, 14 p.

12. Yao Z., Weld D.S. et al. StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow, In Proc. of the 2018 Conference on World Wide Web, 2018, pp. 1693-1703.

13. Yao Z., Weld D.S. et al. StackOverflow-Question-Code-Dataset, 2022. Available at: https://github.com/LittleYUYU/StackOverflow-Question-Code-Dataset, accessed: 17.06.2022.

14. Yin P., Deng B. et al., Learning to mine aligned code and natural language pairs from stack overflow, In Proc. of the 15th International Conference on Mining Software Repositories, 2018, pp. 476-486.

15. Caballero E., Sutskever I. Description2Code Dataset, 2016. Available at: https://github.com/ethancaballero/description2code, accessed: 17.06.2022.

16. Puri R., Kung D.S. et al. CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. c, 2021, 22 p.

17. IBM, Project CodeNet, 2022. Available at: https://github.com/IBM/Project_CodeNet, accessed: 17.06.2022.

18. code-docstring-corpus, 2022. Available at: https://github.com/EdinburghNLP/code-docstring-corpus, accessed: 17.06.2022.

19. Barone A.V.M., Sennrich R. A parallel corpus of python functions and documentation strings for automated code documentation and code generation. arXiv: 1707.02275, 2017, 5 p.

20. Nijkamp E.б Pang B. et al. A Conversational Paradigm for Program Synthesis. arXiv:2203.13474, 2022, 22 p.

21. Hindle A., Barr E.T. et al. On the naturalness of software. Communications of the ACM, vol. 59, issue 5, 2016, pp. 122-131.

22. Gao L., Biderman S. et al. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv:2101.00027, 2020, 39 p.

23. Wang Z., Cuenca G. et al., MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages. arXiv:2203.08388, 2022, 11 p.

24. Kim J., Nguyen T.D. et al. Pure Transformers are Powerful Graph Learners. arXiv:2207.02505. 2022, 28 p.

25. Kreuzer D., Beaini D. et al., Rethinking Graph Transformers with Spectral Attention. arXiv:2106.03893, 2021, 18 p.

26. Dwivedi V.P., Bresson X. A Generalization of Transformer Networks to Graphs. arXiv:2012.09699, 2021, 8 p.

27. Dwivedi V.P., Bresson X. Graph Transformer Architecture, 2022. Available at: https://github.com/graphdeeplearning/graphtransformer, accessed: 17.06.2022.


Review

For citations:


ARUTYUNOV G.A., AVDOSHIN S.M. Big Transformers for Code Generation. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(4):79-88. https://doi.org/10.15514/ISPRAS-2022-34(4)-6



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)