csci 5707, fall 2013 mapreduce vs. parallel dbms hamid safizadeh, otelia buffington university of...

8
CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

Upload: mervin-joseph

Post on 16-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

CSci 5707, Fall 2013

MapReducevs.

Parallel DBMS

Hamid Safizadeh, Otelia Buffington

University of Minnesota

Page 2: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

2

MapReduce Idea

Mapping

map (k1, v1)

list (k2, v2)

Reducing

reduce (k2, list(v2))

list (v2)

Pseudo-code for counting the number of occurrences of each word in a large collection of documents

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

Page 3: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

3

MapReduce Example

Calculation of the number of occurrences of each word

http://aimotion.blogspot.com/2010/08/mapreduce-with-mongodb-and-python.html

Page 4: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

4

MapReduce Architecture

Execution overview

Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clustering, OSDI’08

Page 5: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

5

MapReduce or Parallel DBMS

Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., and

Stonebraker, M., “A comparison of approaches to large-scale data analysis”,

ACM SIGMOD International Conference, 2009

(http://database.cs.brown.edu/projects/mapreduce-vs-dbms)

Dean, J., and Ghemawat, S., “MapReduce: A flexible data processing tool”,

Communications of the ACM, Vol. 53, 2010 (DOI: 10.1145/1629175.1629198)

Page 6: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

MapReduce Design Properties

6

Heterogeneous Systems Processing and combining data from a wide variety of storage systems

(such as relational databases, file systems, etc.)

Fault Tolerance Providing fine-grain fault tolerance for large jobs (Failure in middle of a

multi-hour execution does not require restarting the job from scratch)

Complex Functions Simple Map and Reduce functions with straightforward SQL equivalents

Offering a better framework for some complicated tasks

Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

Page 7: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

MapReduce Design Properties

7

Performance Loading data: Startup overhead for MapReduce

Reading data: Full scan over large data files

Merging results: A MapReduce as the next consumer

Jeffrey Dean and Sanjay Ghemawat, MapReduce: A Flexible Data Processing Tool, Communications of the ACM, Vol. 53, 2010

Cost Hardware: Network workstations

Software: Open source (Hodoop)

Communication: Network system

Page 8: CSci 5707, Fall 2013 MapReduce vs. Parallel DBMS Hamid Safizadeh, Otelia Buffington University of Minnesota

Companies Using Hodoop

8

Facebook Yahoo! Google Amazon Twitter