Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Deploying Apache Spark virtual clusters in cloud environments using orchestration technologies

https://doi.org/10.15514/ISPRAS-2016-28(6)-8

Abstract

Apache Spark is a framework providing fast computations on Big Data using MapReduce model. With cloud environments Big Data processing becomes more flexible since they allow to create virtual clusters on-demand. One of the most powerful open-source cloud environments is Openstack. The main goal of this project is to provide an ability to create virtual clusters with Apache Spark and other Big Data tools in Openstack. There exist three approaches to do it. The first one is to use Openstack REST APIs to create instances and then deploy the environment. This approach is used by Apache Spark core team to create clusters in propriatary Amazon EC2 cloud. Almost the same method has been implemented for Openstack environments. Although since Openstack API changes frequently this solution is deprecated since Kilo release. The second approach is to integrate virtual clusters creation as a built-in service for Openstack. ISP RAS has provided several patches implementing universal Spark Job engine for Openstack Sahara and Openstack Swift integration with Apache Spark as a drop-in replacement for Apache Hadoop. This approach allows to use Spark clusters as a service in PaaS service model. Since Openstack releases are less frequent than Apache Spark this approach may be not convenient for developers using the latest releases. The third solution implemented uses Ansible for orchestration purposes. We implement the solution in loosely coupled way and provide an ability to add any auxiliary tool or even to use another cloud environment. Also, it provides an ability to choose any Apache Spark and Apache Hadoop versions to deploy in virtual clusters. All the listed approaches are available under Apache 2.0 license.

About the Authors

O. . Borisenko
Institute for System Programming of the Russian Academy of Sciences
Russian Federation


R. . Pastukhov
Institute for System Programming of the Russian Academy of Sciences
Russian Federation


S. . Kuznetsov
Institute for System Programming of the Russian Academy of Sciences; Lomonosov Moscow State University; Moscow Institute of Physics and Technology (State University)
Russian Federation


References

1. Shanahan J. and Dai L. Large Scale Distributed Data Science using Apache Spark. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '15). ACM, New York USA, pp. 2323-2324.

2. Li M., Tan J., Wang Y., Zhang L., Salapura V. SparkBench: a comprehensive benchmarking suite for in memory data analytic platform Spark. In Proceedings of the 12th ACM International Conference on Computing Frontiers (CF '15). ACM, New York USA, Article 53.

3. Jeffrey D., Sanjay G. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.M. Bhandarkar, "MapReduce programming with apache Hadoop," Parallel & Distributed Processing (IPDPS), 2010 IEEE International Symposium on, Atlanta, GA, 2010, pp. 1-1.

4. Vavilapalli V., Murthy A., Douglas C., Agarwal S., Konar M., Evans R., Graves T., Lowe J., Shah H., Seth S., Saha B., Curino C., O'Malley O., Radia S., Reed B., Baldeschwieler E. Apache Hadoop YARN: yet another resource negotiator. In Proceedings of the 4th annual Symposium on Cloud Computing (SOCC '13). ACM, New York USA, 2013, Article 5.

5. Apache Mesos project home page: http://mesos.apache.org

6. Guller, Mohammed. Cluster Managers. Big Data Analytics with Spark. Apress, 2015. 231-242.

7. Dinsmore, Thomas W. In-Memory Analytics. Disruptive Analytics. Apress, 2016, pp. 97-116.

8. Sefraoui, Aissaoui O, Eleuldj M. OpenStack: toward an open-source solution for cloud computing. International Journal of Computer Applications 55.3, 2012.

9. Hazelhurst, Scott. Scientific computing using virtual high-performance computing: a case study using the Amazon elastic computing cloud. Proceedings of the 2008 annual research conference of the South African Institute of Computer Scientists and Information Technologists on IT research in developing countries: riding the wave of technology. ACM, 2008.

10. Borisenko O., Laguta A., Turdakov D., Kuznetsov S, Automating cluster creation and management for Apache Spark in Openstack cloud, Trudy ISP RAN/Proc. ISP RAS, vol 26, issue 4, 2014, pp. 33-44 (in Russian). DOI: 10.15514/ISPRAS-2014-26(4)-4

11. Aleksiyants A., Borisenko O., Turdakov D., Sher A., Kuznetsov S. Implementing Apache Spark Jobs Execution and Apache Spark Cluster Creation for Openstack Sahara. Trudy ISP RAN/Proc. ISP RAS, vol. 27, issue 5, 2015, pp. 35-48. DOI: 10.15514/ISPRAS-2015-27(5)-3.

12. Ibrahim, Asmaa, Nawawy. A study of adopting big data to cloud computing. Technology Innovation and Entrepreneurship Center, Egypt Technology Innovation and Entrepreneurship Center, Egypt, 2015, pp. 1-7.

13. List of approved third-party project for Apache Spark. https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects


Review

For citations:


Borisenko O., Pastukhov R., Kuznetsov S. Deploying Apache Spark virtual clusters in cloud environments using orchestration technologies. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2016;28(6):111-120. (In Russ.) https://doi.org/10.15514/ISPRAS-2016-28(6)-8



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)