Development in Vulkan: a domain-specific approach
https://doi.org/10.15514/ISPRAS-2021-33(5)-11
Abstract
In this paper we propose a high-level approach to developing GPU applications based on the Vulkan API. The purpose of the work is to reduce the complexity of developing and debugging applications that implement complex algorithms on the GPU using Vulkan. The proposed approach uses the technology of code generation by translating a C++ program into an optimized implementation in Vulkan, which includes automatic shader generation, resource binding, and the use of synchronization mechanisms (Vulkan barriers). The proposed solution is not a general-purpose programming technology, but specializes in specific tasks. At the same time, it has extensibility, which allows to adapt the solution to new problems. For single input C++ program, we can generate several implementations for different cases (via translator options) or different hardware. For example, a call to virtual functions can be implemented either through a switch construct in a kernel, or through sorting threads and an indirect dispatching via different kernels, or through the so-called callable shaders in Vulkan. Instead of creating a universal programming technology for building various software systems, we offer an extensible technology that can be customized for a specific class of applications. Unlike, for example, Halide, we do not use a domain-specific language, and the necessary knowledge is extracted from ordinary C++ code. Therefore, we do not extend with any new language constructs or directives and the input source code is assumed to be normal C++ source code (albeit with some restrictions) that can be compiled by any C++ compiler. We use pattern matching to find specific patterns (or patterns) in C++ code and convert them to GPU efficient code using Vulkan. Pattern are expressed through classes, member functions, and the relationship between them. Thus, the proposed technology makes it possible to ensure a cross-platform solution by generating different implementations of the same algorithm for different GPUs. At the same time, due to this, it allows you to provide access to specific hardware functionality required in computer graphics applications. Patterns are divided into architectural and algorithmic. The architectural pattern defines the domain and behavior of the translator as a whole (for example, image processing, ray tracing, neural networks, computational fluid dynamics and etc.). Algorithmic pattern express knowledge of data flow and control and define a narrower class of algorithms that can be efficiently implemented in hardware. Algorithmic patterns can occur within architectural patterns. For example, parallel reduction, compaction (parallel append), sorting, prefix sum, histogram calculation, map-reduce, etc. The proposed generator works on the principle of code morphing. The essence of this approach is that, having a certain class in the program and transformation rules, one can automatically generate another class with the desired properties (for example, the implementation of the algorithm on the GPU). The generated class inherits from the input class and thus has access to all data and functions of the input class. Overriding virtual functions in generated class helps user to carefully connect generated code to the other Vulkan code written by hand. Shaders can be generated in two variants: OpenCL shaders for google “clspv” compiler and GLSL shaders for an arbitrary GLSL compiler. Clspv variant is better for code which intensively uses pointers and the GLSL generator is better if specific HW features are used (like hardware ray tracing acceleration). We have demonstrated our technology on several examples related to image processing and ray tracing on which we get 30-100 times acceleration over multithreaded CPU implementation.
About the Authors
Vladimir FROLOVRussian Federation
PhD in computer graphics, senior researcher at Keldysh Institute of Applied Mathematics and researcher in computer graphics at Moscow State University
Vadim SANZHAROV
Russian Federation
Junior researcher in computer graphics at Moscow State University and researcher at Keldysh Institute of Applied Mathematics
Vladimir GALAKTIONOV
Russian Federation
Doctor of Science in physics and mathematics, Professor, Head of Computer graphics department
Alexander Shcherbakov
Russian Federation
Postgraduate student
References
1. J. Fang, Cю Huang et al. Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Transactions on High Performance Computing, vol. 2, issue 4, 2020, pp. 382-400.
2. OpenACC. URL: https://www.openacc.org/.
3. N. Jacobsen. LLVM supported source-to-source translation. Translation from annotated C/C++ to CUDA C/C++. Master’s Thesis. University of Oslo, Norway, 2016, 134 p.
4. G.D. Balogh, G.R. Mudalige et al. Op2-clang: A source-to-source translator using clang/llvm libtooling. In Proc. of the IEEE/ACM 5th Workshop on the LLVM Compiler Infrastructure in HPC (LLVM-HPC), 2018, pp. 59-70.
5. P. Yang, F. Dong et al. Improving utility of GPU in accelerating industrial applications with user-centered automatic code translation. IEEE Transactions on Industrial Informatics, vol. 14, issue 4, 2017, pp. 1347-1360.
6. J. A. Pienaar, S. Chakradhar, A. Raghunathan. Automatic generation of software pipelines for heterogeneous parallel systems. In Proc. of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012, pp. 1-12.
7. Н.А. Коновалов, В.А. Крюков. DVM-подход к разработке параллельных программ для вычислительных кластеров и сетей / N.A. Konovalov, V.A. Kryukov. DVM-approach to the development of parallel programs for computing clusters and networks. 2002. URL: https://www.keldysh.ru/dvm/dvmhtm1107/publishr/dvm-appr-OpSys.htm (in Russian).
8. В.А. Бахтин, А.В. Воронков и др. Использование языка fortran-dvm/openmp для решения больших задач. Труды Всероссийской суперкомпьютерной конференции «Научный сервис в сети Интернет: решение больших задач», 2008 г., стр. 185-191 / V.A. Bakhtin, A.V. Voronkov et al. Using the fortran-dvm / openmp language for solving large problems. In Proc. of the All-Russian Supercomputer Conference on Scientific Service on the Internet: Solving Big Problems, 2008, pp. 185-191 (in Russian).
9. В.А. Бахтин, В.А. Крюков. DVM-подход к автоматизации разработки параллельных программ для кластеров. Программирование, том. 45, no. 3. 2019 г., стр. 43-56 / V.A. Bakhtin, V.A. Krukov. DVM-Approach to the Automation of the Development of Parallel Programs for Clusters, Programming and Computer Software, vol. 45, no. 3, 2019, pp. 121-132.
10. M.A. Mikalsen. OpenACC-based Snow Simulation. Master’s Thesis. Norwegian University of Science and Technolog, 2013, 124 p.
11. T. Öhberg. Auto-tuning Hybrid CPU-GPU Execution of Algorithmic Skeletons in SkePU. Master’s Thesis. Linkoping University, Sweden, 2018, 80 p.
12. A. Ernstsson, L. Lu, C. Kessler. SkePU 2: Flexible and type-safe skeleton programming for heterogeneous parallel systems. International Journal of Parallel Programming, vol. 46, issue 1, 2018, pp. 62-80.
13. M. Steuwer, P. Kegel, S. Gorlatch. Skelcl-a portable skeleton library for high-level GPU programming. In Proc. of the IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, 2011, pp. 1176-1182.
14. SYCL, cross-platform abstraction layer. URL: https://www.khronos.org/sycl/.
15. SYCL for CUDA developers, examples, 2020. URL: https://developer.codeplay.com/products/computecpp/ce/guides/sycl-for-cuda-developers/examples.
16. Vulkan specification, indirect dispatch command, 2021. URL: https://www.khronos.org/registry/vulkan/specs/1.2-extensions/man/html/vkCmdDispatchIndirect.html.
17. A. Sherin. Resident Evil 2 Frame Breakdown, 2019. URL: https://aschrein.github.io/2019/08/01/re2_breakdown.html.
18. T.D. Han, T.S. Abdelrahman. hiCUDA: High-level GPGPU programming. IEEE Transactions on Parallel and Distributed Systems, vol. 22, issue 1, 2010, pp. 78-90.
19. J. Wu, A. Belevich et al. gpucc: an open-source GPGPU compiler. In Proc. of the IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2016, pp. 105-116.
20. P. Sathre, M. Gardner, W. Feng. On the portability of CPU-accelerated applications via automated source-to-source translation. In Proc. of the International Conference on High Performance Computing in Asia-Pacific Region, 2019, pp. 1-8.
21. HIP, C++ Runtime API and Kernel Language, 2021. URL: https://github.com/ROCm-Developer-Tools/HIP.
22. A. Hamuraru. Atomic operations for floats in OpenCL – improved, 2016. URL: https://streamhpc.com/blog/2016-02-09/atomic-operations-for-floats-in-opencl-improved/.
23. A. Kapoulkine. Getting Faster and Leaner on Mobile: Optimizing Roblox with Vulkan. 2019. URL: https://www.youtube.com/watch?v=hPW5ckkqiqA.
24. Non-official Vulkan hardware database, 2021. URL: http://vulkan.gpuinfo.org/.
25. N. Mammeri, B. Juurlink. VComputeBench: A vulkan benchmark suite for GPGPU on mobile and embedded GPUs. In Proc. of the IEEE International Symposium on Workload Characterization (IISWC), IEEE, 2018, pp. 25-35.
26. V-EZ, an open source, cross-platform wrapper, 2018. URL: https://github.com/GPUOpen-LibrariesAndSDKs/V-EZ.
27. Vuh, A Vulkan-based GPGPU computing framework, 2020. URL: https://github.com/Glavnokoman/vuh.
28. Kompute, The general purpose GPU compute framework for cross vendor graphics cards, 2021. URL: https://github.com/KomputeProject/kompute.
29. A. Rasch, R. Schulze, S. Gorlatch. Developing High-Performance, Portable OpenCL Code via Multi-Dimensional Homomorphisms. In Proc. of the International Workshop on OpenCL, 2019, article no. 4.
30. T.-W. Huang, D.-L. Lin et al. Taskflow: A General-purpose Parallel and Heterogeneous Task Programming System. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021.
31. M. Haidl, S. Gorlatch. PACXX: Towards a unified programming model for programming accelerators using C++14. In Proc. of the Workshop on the LLVM Compiler Infrastructure in HPC, 2014, pp. 1-11.
32. M. Haidl, S. Moll et al. Pacxxv2+ RV: an LLVM-based portable high-performance programming model. In Proc. of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, 2017, pp. 1-12.
33. A. Sidelnik, S. Maleki et al. Performance portability with the chapel language. In Proc. of the IEEE 26th international parallel and distributed processing symposium, 2012, pp. 582-594.
34. R. Baghdadi, J. Ray et al. Tiramisu: A polyhedral compiler for expressing fast and portable code. In Proceedings of IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2019, pp. 193-205.
35. Google clspv. A prototype compiler for a subset of OpenCL C to Vulkan compute shaders, 2021. URL: https://github.com/google/clspv.
36. S. Baxter. Circle C++ shaders, 2021. URL: https://github.com/seanbaxter/shaders.
37. K.A. Seitz Jr, T. Foley et al. Unified Shader Programming in C++. arXiv preprint arXiv:2109.14682, 2021, 13 p.
38. Clang documentation, 2021. URL: https://clang.llvm.org/docs/LibTooling.html.
39. Thrust: a powerful library of parallel algorithms and data structures, 2021. URL: https://developer.nvidia.com/thrust.
40. A. Kolesnichenko, C.M. Poskitt, S. Nanz. SafeGPU: Contract-and library-based GPGPU for object-oriented languages. Computer Languages, Systems & Structures, vol. 48, 2017, pp. 68-88.
41. D. Beckingsale, R. Hornung et al. Performance portable C++ programming with RAJA. In Proc. of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019, pp. 455-456.
42. T. Heller, P. Diehl et al. Hpx – an open source c++ standard library for parallelism and concurrency. In Proc. of the Workshop on Open Source Supercomputing (OpenSuCo-2017), 2017, pp. 1-5.
43. A. Paszke, S. Gross et al. Pytorch: An imperative style, high-performance deep learning library. arXiv preprint arXiv:1912.01703, 2019, 12 p.
44. M. Abadi, A. Agarwal et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016, 19 p.
45. T. Chen, M. Li et al. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015, 6 p.
46. Nvidia OptiX, 2021. URL: https://developer.nvidia.com/optix.
47. DirectML, 2021. URL: https://github.com/microsoft/DirectML.
48. Wisp renderer, 2021. URL: https://github.com/TeamWisp/WispRenderer.
49. Projects using RTX, 2021. URL: https://github.com.cnpmjs.org/vinjn/awesome-rtx.
50. J. Hegarty, J. Brunhaver et al. Darkroom: compiling high-level image processing code into hardware pipelines. ACM Transactions on Graphics, vol. 33, no. 4, 2014, pp. 1-11.
51. J. Ragan-Kelley, C. Barnes et al. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013, pp. 519-530.
52. R. T. Mullapudi, A. Adams et al. Automatically scheduling halide image processing pipelines. ACM Transactions on Graphics, vol. 35, no. 4, 2016, pp. 1-11.
53. A. Adams, K. Ma et al. Learning to optimize halide with tree search and random programs. ACM Transactions on Graphics, vol. 38, no. 4, 2019, pp. 1-12.
54. Y. Hu, T.-M. Li et al. Taichi: a language for high-performance computation on spatially sparse data structures. ACM Transactions on Graphics, vol. 38, no. 6, 2019, pp. 1-16.
55. Y. Hu, Luke Anderson et al. Difftaichi: Differentiable programming for physical simulation. arXiv preprint arXiv:1910.00935, 2019, 20 p.
56. Y. Hu, J. Liu et al. QuanTaichi: A Compiler for Quantized Simulations. ACM Transactions on Graphics, vol. 40, no. 4, 2021, pp. 1-16.
57. S.S. Huang, D. Zook, Y. Smaragdakis. Morphing: Safely shaping a class in the image of others. Lecture Notes in Computer Science, vol. 4609, 2007, pp. 399-424.
58. Inja, template engine for modern C++, 2021. URL: https://github.com/pantor/inja.
59. W. Bruce, S.R. Marschner et al. In Proc. of the 18th Eurographics conference on Rendering Techniques, 2007, pp. 195-206
60. Voronoi Noise, 2018. URL: https://www.ronja-tutorials.com/post/028-voronoi-noise/.
61. Inigo Quilez. Fast 3D Noise, 2013. URL: https://www.shadertoy.com/view/4sfGzS.
Review
For citations:
FROLOV V., SANZHAROV V., GALAKTIONOV V., Shcherbakov A. Development in Vulkan: a domain-specific approach. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2021;33(5):181-204. (In Russ.) https://doi.org/10.15514/ISPRAS-2021-33(5)-11