Task and resources management function in HPC operation system «SPO Super-EVM»
https://doi.org/10.15514/ISPRAS-2022-34(2)-13
Abstract
The Slurm-VNIITF software developed by Federal State Unitary Enterprise “Russian Federal Nuclear Center - Zababakhin All-Russian Research Institute of Technical Physics”, it’s architecture, resource management capabilities and task management for numerical simulation HPC systems described in this paper. During many years usage of the HPC systems researches show that the basic features of the Slurm (Simple linux utility for resource management) software are clearly insufficient for the effective use of computing resources in HPC centers. Therefore, the authors of this paper propose an improved task and resource management policy. Slurm extension modules (plugins) for implementing this policy also described in this paper.
About the Authors
Alexey Olegovich IGNATYEVRussian Federation
Head of Laboratory
Alexey Alexeevich KALININ
Russian Federation
Head of Research Group
Sergey Yurievich MOKSHIN
Russian Federation
Head of Department
References
1. . Ignatyev A.O., Mokshin S.Yu. Base architecture of the mathematical modelling HPC system, Preprint FSUE «RFNC-VNIITF named after Academ. E.I. Zababakhin» № 265, Snezhinsk, 2020, 21 p. (in Russian).
2. . «SPO Super-EVM», Available at: http://vniitf.ru/article/spo-super-evm, accessed 01.04.2022. (in Russian).
3. . Slurm workload manager, Available at: https://slurm.schedmd.com/documentation.html, accessed 01.04.2022.
4. . MPI: The Message Passing Interface. Available at: http://parallel.ru/tech/tech_dev/mpi.html, accessed 01.06.2020.
5. . The OpenMP API specification for parallel programming. Available at: https://www.openmp.org/, accessed 01.04.2022.
6. . Maui Scheduler, Available at: https://github.com/TempleHPC/maui-scheduler, accessed 01.04.2022.
7. . Moab Cluster Suite, Available at: https://adaptivecomputing.com/moab-hpc-suite/, accessed 01.04.2022.
8. . Torque Resource Manager, Available at: https://adaptivecomputing.com/cherry-services/torque-resource-manager/, accessed 01.04.2022.
9. . YAML, Available at: http://yaml.org /, accessed 01.04.2022.
Review
For citations:
IGNATYEV A.O., KALININ A.A., MOKSHIN S.Yu. Task and resources management function in HPC operation system «SPO Super-EVM». Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2022;34(2):159-178. (In Russ.) https://doi.org/10.15514/ISPRAS-2022-34(2)-13