Optimization problems running MPI-based HPC applications

D. A. Grushin; N. N. Kuzjurin

doi:10.15514/ISPRAS-2017-29(6)-14

Optimization problems running MPI-based HPC applications

D. A. Grushin, N. N. Kuzjurin

https://doi.org/10.15514/ISPRAS-2017-29(6)-14

Full Text:

PDF (Rus) |

Generate QR code

Abstract

MPI is a well-proven technology that is widely used in a high-performance computing environment. However, configuring an MPI cluster can be a difficult task. Containers are a new approach to virtualization and simple application packaging, which is becoming a popular tool for high-performance tasks (HPC). This approach is considered in this article. Packaging an MPI application as a container solves the problem of conflicting dependencies, simplifies the configuration and management of running applications. A typical queue system (for example, SLURM) or a container management system (Docker Swarm, Kubernetes, Mesos, etc.) can be used to manage cluster resources. Containers also provide more options for flexible management of running applications (stop, restart, pause, in some cases, migration between nodes), which allows you to gain an advantage optimizing the allocation of tasks to cluster nodes in comparison with the classic scheduler. The article discusses various ways to optimize the placement of containers when working with HPC-applications. A variant of launching MPI applications in Fanlight system is proposed, which simplifies the work of users. The optimization problem associated with this method is considered also.

Keywords

docker, containers, scheduling

About the Authors

D. A. Grushin

Ivannikov Institute for System Programming of the Russian Academy of Sciences
Russian Federation

N. N. Kuzjurin

Ivannikov Institute for System Programming of the Russian Academy of Sciences; Moscow Institute of Physics and Technology
Russian Federation

References

1. Forum M.P. MPI: A message-passing interface standard. Knoxville, TN, USA: University of Tennessee, 1994.

2. Nguyen N., Bein D. Distributed mpi cluster with docker swarm mode. 2017 ieee 7th annual computing and communication workshop and conference (ccwc). 2017. Pp. 1-7.

3. Azab A. Enabling docker containers for high-performance and many-task computing. 2017 ieee international conference on cloud engineering (ic2e). 2017. Pp. 279-285.

4. Ermakov A., Vasyukov A. Testing docker performance for HPC applications. CoRR. 2017. Vol. abs/1704.05592.

5. Felter W. et al. An updated performance comparison of virtual machines and linux containers. 2014.

6. Di Tommaso P. et al. The impact of docker containers on the performance of genomic pipelines. PeerJ. 2015. Vol. 3. P. e1273.

7. Herbein S. et al. Resource management for running hpc applications in container clouds. High performance computing: 31st international conference, isc high performance 2016, frankfurt, germany, june 19-23, 2016, proceedings / ed. Kunkel J.M., Balaji P., Dongarra J. Cham: Springer International Publishing, 2016. Pp. 261-278.

8. Baraglia R. et al. Backfilling strategies for scheduling streams of jobs on computational farms. Making Grids Work. Springer, 2008. Pp. 103-115.

9. Mu’alem A.W., Feitelson D.G. Utilization, predictability, workloads, and user runtime estimates in scheduling the ibm sp2 with backfilling. IEEE Transactions on Parallel and Distributed Systems. 2001. Vol. 12, № 6. Pp. 529-543.

10. Nissimov A., Feitelson D.G. Probabilistic backfilling. Job scheduling strategies for parallel processing: 13th international workshop, jsspp 2007, seattle, wa, usa, june 17, 2007. revised papers. ed. Frachtenberg E., Schwiegelshohn U. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008. Pp. 102-115.

11. https://www.opencontainers.org.

12. Harter T. et al. Slacker: Fast distribution with lazy docker containers. 14th USENIX conference on file and storage technologies, FAST 2016, santa clara, ca, usa, february 22-25, 2016. 2016. Pp. 181-195.

13. Higgins J., Holmes V., Venters C. Orchestrating docker containers in the hpc environment. High performance computing: 30th international conference, isc high performance 2015, Frankfurt, Germany, July 12-16, 2015, proceedings / ed. Kunkel J.M., Ludwig T. Cham: Springer International Publishing, 2015. Pp. 506-513.

14. Benedicic L. et al. Portable, high-performance containers for hpc. arXiv preprint arXiv:1704.03383. 2017.

15. Kurtzer G.M., Sochat V., Bauer M.W. Singularity: Scientific containers for mobility of compute. PLOS ONE. Public Library of Science, 2017. Vol. 12, № 5. Pp. 1-20.

16. http://www.univa.com/resources/files/gridengine_container_edition.pdf.

17. https://developer.ibm.com/storage/products/ibm-spectrum-lsf/.

18. https://navops.io/command.html.

19. https://spark.apache.org.

20. https://storm.apache.org.

21. https://tez.apache.org.

22. Boutin E. et al. Apollo: Scalable and coordinated scheduling for cloud-scale computing. Proceedings of the 11th usenix conference on operating systems design and implementation. Berkeley, CA, USA: USENIX Association, 2014. Pp. 285-300.

23. Verma A. et al. Large-scale cluster management at google with borg. Proceedings of the tenth european conference on computer systems. New York, NY, USA: ACM, 2015. Pp. 18:1-18:17.

24. http://www.slideshare.net/dotCloud/tupperware-containerized-deployment-at-facebook.

25. http://aurora.incubator.apache.org/.

26. Zhang Z. et al. Fuxi: A fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow. VLDB Endowment, 2014. Vol. 7, № 13. Pp. 1393-1404.

27. Schwarzkopf M. et al. Omega: Flexible, scalable schedulers for large compute clusters. Proceedings of the 8th acm european conference on computer systems. New York, NY, USA: ACM, 2013. Pp. 351-364.

28. Ousterhout K. et al. Sparrow: Distributed, low latency scheduling. Proceedings of the twenty-fourth acm symposium on operating systems principles. New York, NY, USA: ACM, 2013. Pp. 69-84.

29. Delgado P. et al. Hawk: Hybrid datacenter scheduling. Proceedings of the 2015 usenix conference on usenix annual technical conference. Berkeley, CA, USA: USENIX Association, 2015. Pp. 499-510.

30. Delimitrou C., Sanchez D., Kozyrakis C. Tarcil: Reconciling scheduling speed and quality in large shared clusters. Proceedings of the sixth acm symposium on cloud computing. New York, NY, USA: ACM, 2015. Pp. 97-110.

31. http://fanlight.ispras.ru.

32. Grushin D., Kuzyurin N. Load balancing in unihub saas system based on user behavior prediction. Trudy ISP RAN/Proc. ISP RAS, vol. 27, issue 5, 2015, pp. 23-34 (in Russian). DOI: 10.15514/ISPRAS-2015-27(5)-2

Review

For citations:

Grushin D.A., Kuzjurin N.N. Optimization problems running MPI-based HPC applications. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2017;29(6):229-244. (In Russ.) https://doi.org/10.15514/ISPRAS-2017-29(6)-14

This work is licensed under a Creative Commons Attribution 4.0 License.

ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)

Username
Password
	Remember me
Not a user? Register with this site Forgot your password?

User

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Optimization problems running MPI-based HPC applications

Full Text:

Abstract

Keywords

About the Authors

References

Review

For citations:

Cookies policy