Preview

Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS)

Advanced search

Big data: modern approaches to storage and analysis

https://doi.org/10.15514/ISPRAS-2012-23-9

Abstract

Big data challenged traditional storage and analysis systems in several new ways. In this paper we try to figure out how to overcome this challenges, why it's not possible to make it efficiently and describe three modern approaches to big data handling: NoSQL, MapReduce and real-time stream processing. The first section of the paper is the introduction. The second section discuss main issues of Big Data: volume, diversity, velocity, and value. The third section describes different approaches to solving the problem of Big Data. Traditionally one might use a relational DBMS. The paper propose some steps that allow to continue RDBMS using when it’s capacity becomes not enough. Another way is to use a NoSQL approach. The basic ideas of the NoSQL approach are: simplification, high throughput, and unlimited scaling out. Different kinds of NoSQL stores allow to use such systems in different applications of Big Data. MapReduce and it’s free implementation Hadoop may be used to provide scaling out Big Data analytics. Finally, several data management products support real time stream processing under Big Data. The paper briefly overviews these products. The final section of the paper is the conclusion.

About the Authors

Pavel Klemenkov
MSU, Moscow
Russian Federation


Sergey Kuznetsov
ISP RAS, Moscow
Russian Federation


References

1. Tom White. Hadoop: The Definitive Guide, 3rd Edition. O'Reilly Media, 2012, 688 p.

2. Mark A. Beyer, Douglas Laney. The Importance of «Big Data»: A Definition. http://www.gartner.com/DisplayDocument?id=2057415, 21 June 2012.

3. Carlo Strozzi. NoSQL: A Relational Database Management System.http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page

4. Jaroslav Pokorny. NoSQL databases: a step to database scalability in web environment. Proceedings of the 13th International Conference on Information Integration and Web-based Applications and Services, p. 278-283, ACM New York, NY, USA, 2011.

5. Christof Strauch. NoSQL Databases. http://www.christof-strauch.de/nosqldbs.pdf

6. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber. Bigtable: a distributed storage system for structured data. Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, vol. 7, p. 15-15, USENIX Association Berkeley, CA, USA, 2006.

7. Rick Cattel. Scalable SQL and NoSQL data stores. ACM SIGMOD Record, 39(4), p. 12-27, ACM New York, NY, USA, December 2010.

8. Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung. The Google File System. 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.

9. Jeffrey Dean, Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, vol. 6, p. 10-10, USENIX Association Berkeley, CA, USA, 2004.

10. Apache Hadoop. http://hadoop.apache.org/

11. Apache CouchDB. http://couchdb.apache.org/

12. MongoDB. http://www.mongodb.org/

13. Riak. http://basho.com/products/riak-overview/

14. J. Chris Anderson, Jan Lehnardt, Noah Slater. CouchDB: The Definitive Guide. O'Reilly Media, 2010, 272 p.

15. Konstantin Shvachko, Hairong Kuang, Sanjai Radia, Robert Chansler. The Hadoop Distributed File System. MSST '10 Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), 2010, pp. 1-10.

16. Sort Benchmark Home Page. http://sortbenchmark.org/

17. Ajay Anand. Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds. http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/, 2009.

18. Apache Hive. http://hive.apache.org/

19. Apache Pig. http://pig.apache.org/

20. Apache Hbase. http://hbase.apache.org/

21. P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: wait-free coordination for internet-scale systems. USENIXATC’10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference. Berkeley, CA, USA: USENIX Association, 2010, pp. 11–11.

22. Sergej Kuznetsov. K svobode ot problemy Bol'shih Dannyh [Toward the freedom from the Big Data problem]. «Otkrytye sistemy», №02, 2012 (in Russian).

23. G. Agha. Actors: A Model of Concurrent Computation in Distributed Systems. Cambridge, MA, USA: MIT Press, 1986.

24. The Disco Project. http://discoproject.org/

25. Erlang Programming Language. http://www.erlang.org/

26. Joe Armstrong. Concurrency Oriented Programming in Erlang. http://ll2.ai.mit.edu/talks/armstrong.pdf, November 2002.

27. Leonardo Neumeyer, Bruce Robbins, Anish Nair, Anand Kesari. S4: Distributed Stream Computing Platform. Data Mining Workshops (ICDMW), 2010 IEEE International Conference, 2010.

28. Jagmohan Chauhan, Shaiful Chowdhury and Dwight Makaroff, Performance Evaluation of Yahoo! S4: A First Look, IEEE Seventh International Conference on P2P, Parallel, GRID, Cloud and Internet computing, 2012.

29. Storm: Distributed and Fault-tolerant realtime computation. http://storm-project.net/


Review

For citations:


Klemenkov P., Kuznetsov S. Big data: modern approaches to storage and analysis. Proceedings of the Institute for System Programming of the RAS (Proceedings of ISP RAS). 2012;23. (In Russ.) https://doi.org/10.15514/ISPRAS-2012-23-9



Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 License.


ISSN 2079-8156 (Print)
ISSN 2220-6426 (Online)