Preview

Труды Института системного программирования РАН

Расширенный поиск

Сравнительный анализ параллельных алгоритмов соединения для среды MapReduce

https://doi.org/10.15514/ISPRAS-2012-23-17

Аннотация

Для анализа больших объемов данных используются такие методы как параллельные СУБД, парадигма MapReduce, колоночное хранение и различные комбинации этих подходов. В данной работе будут рассмотрены алгоритмы соединения в среде MapReduce. К сожалению, алгоритмы соединения не поддерживаются напрямую в MapReduce . Цель данной работы заключается в том, чтобы обобщить и сравнить существующие алгоритмы соединения по равенству с некоторыми методами оптимизации.

Об авторе

А. Ю. Пигуль
Санкт-Петербургский государственный университет
Россия


Список литературы

1. Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-stores vs. row-stores: how different are they really? In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD ’08, pages 967–980, New York, NY, USA, 2008. ACM.

2. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel Abadi, Avi Silberschatz, and Alexander Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proc. VLDB Endow., 2:922–933, August 2009.

3. Foto N. Afrati and Jeffrey D. Ullman. Optimizing joins in a map-reduce environment. In Proceedings of the 13th International Conference on Extending Database Technology, EDBT ’10, pages 99–110, New York, NY, USA, 2010. ACM.

4. Fariha Atta. Implementation and analysis of join algorithms to handle skew for the hadoop mapreduce framework. Master’s thesis, MSc Informatics, School of Informatics, University of Edinburgh, 2010.

5. Shivnath Babu. Towards automatic optimization of mapreduce programs. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 137–142, New York, NY, USA, 2010. ACM.

6. Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 international conference on Management of data, SIGMOD ’10, pages 975–986, New York, NY, USA, 2010. ACM.

7. A Chatzistergiou. Designing a parallel query engine over map/reduce. Master’s thesis, MSc Informatics, School of Informatics, University of Edinburgh, 2010.

8. Surajit Chaudhuri, Raghu Ramakrishnan, and Gerhard Weikum. Integrating db and ir technologies: What is the sound of one hand clapping? In CIDR, pages 1–12, 2005.

9. Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. Mad skills: new analysis practices for big data. Proc. VLDB Endow., 2:1481–1492, August 2009.

10. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 53:72–77, January 2010.

11. Jeffrey Dean, Sanjay Ghemawat, and Google Inc. Mapreduce: simplified data processing on large clusters. In In OSDI04: Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation. USENIX Association, 2004.

12. Leonidas Fegaras, Chengkai Li, and Upa Gupta. An optimization framework for map-reduce queries. In EDBT 2012, march 2012.

13. Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. Column-oriented storage techniques for mapreduce. Proc. VLDB Endow., 4:419–429, April 2011.

14. Alan F Gates. Programming Pig. O’Reilly Media, 2011.

15. Herodotos Herodotou. Hadoop performance models. CoRR, abs/1106.0940, 2011.

16. Herodotos Herodotou and Shivnath Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. PVLDB, 4(11):1111– 1122, 2011.

17. Eaman Jahani, Michael J. Cafarella, and Christopher R´e. Automatic optimization for mapreduce programs. Proc. VLDB Endow., 4:385–396, mar 2011.

18. Dawei Jiang, Anthony K. H. Tung, and Gang Chen. Map-join-reduce: Toward scalable and efficient data analysis on large clusters. IEEE Transactions on Knowledge and Data Engineering, 23:1299– 1311, 2011.

19. YongChul Kwon, Magdalena Balazinska, Bill Howe, and Jerome Rolia. A study of skew in mapreduce applications. Moskow, Russia, june 2011. In the 5th Open Cirrus Summit.

20. Yuting Lin, Divyakant Agrawal, Chun Chen, Beng Chin Ooi, and Sai Wu. Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In Proceedings of the 2011 international conference on Management of data, SIGMOD ’11, pages 961–972, New York, NY, USA, 2011. ACM.

21. Gang Luo and Liang Dong. Adaptive join plan generation in hadoop. Technical report, Duke University, 2010.

22. Christine Morin and Gilles Muller, editors. European Conference on Computer Systems, Proceedings of the 5th European conference on Computer systems, EuroSys 2010, Paris, France, April 13-16, 2010. ACM, 2010.

23. Alper Okcan and Mirek Riedewald. Processing theta-joins using mapreduce. In Proceedings of the 2011 international conference on Management of data, SIGMOD ’11, pages 949–960, New York, NY, USA, 2011. ACM.

24. Konstantina Palla. A comparative analysis of join algorithms using the hadoop map/reduce framework. Master’s thesis, MSc Informatics, School of Informatics, University of Edinburgh, 2009.

25. Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, and Michael Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 35th SIGMOD international conference on Management of data, SIGMOD ’09, pages 165–178, New York, NY, USA, 2009. ACM.

26. Donovan A. Schneider and David J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. SIGMOD Rec., 18:110–121, June 1989.

27. Rares Vernica, Michael J. Carey, and Chen Li. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the 2010 international conference on Management of data, SIGMOD ’10, pages 495–506, New York, NY, USA, 2010. ACM.

28. Vertica Systems, Inc. Managing Big Data with Hadoop & Vertica, 2009.

29. Guanying Wang, Ali Raza Butt, Prashant Pandey, and Karan Gupta. A simulation approach to evaluating design decisions in mapreduce setups. In MASCOTS, pages 1–11. IEEE, 2009.

30. Hung-chih Yang, Ali Dasdan, Ruey-Lung Hsiao, and D. Stott Parker. Map-reduce-merge: simplified relational data processing on large clusters. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD ’07, pages 1029–1040, New York, NY, USA, 2007. ACM.

31. Anna Yarygina, Boris Novikov, and Natalia Vassilieva. Processing complex similarity queries: A systematic approach. In Maria Bielikova, Johann Eder, and A Min Tjoa, editors, ABDIS 2011 Research Communications: Proceedings II of the 5th East-European Conference on Advances in Databases and Information Systems 20 – 23 September 2011, Vienna, pages 212–221. Austrian Computer Society, September 2011.

32. Minqi Zhou, Rong Zhang, Dadan Zeng, Weining Qian, and Aoying Zhou. Join optimization in the mapreduce environment for column-wise data store. In Proceedings of the 2010 Sixth International Conference on Semantics, Knowledge and Grids, SKG ’10, pages 97–104, Washington, DC.


Рецензия

Для цитирования:


Пигуль А.Ю. Сравнительный анализ параллельных алгоритмов соединения для среды MapReduce. Труды Института системного программирования РАН. 2012;23. https://doi.org/10.15514/ISPRAS-2012-23-17

For citation:


Pigul A.Yu. Comparative Study Parallel Join Algorithms for MapReduce environment. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2012;23. (In Russ.) https://doi.org/10.15514/ISPRAS-2012-23-17



Creative Commons License
Контент доступен под лицензией Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)