readings: “mapreduce: simplified data processing on large ... · readings: “mapreduce:...

52
Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4

Upload: others

Post on 03-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4

Page 2: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

--

--

We’ll put lists on our doors (after class) and meet with you one by one to discuss grades, goals, … .

Page 3: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

•••

Page 4: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Architecture: Client/Master/Chunkservers

4

Page 5: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Consistency Model (Metadata)

• Changes to namespace (i.e., metadata) are atomic• Done by single master server!• Master uses WAL to define global total order of

namespace-changing operations

5

Page 6: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Consistency Model (Data)

• Changes to data are ordered as chosen by a primary• But multiple writes from the same client may be

interleaved or overwritten by concurrent operations from other clients

• Record append completes at least once, at offset of GFS’s choosing• Applications must cope with possible duplicates

• Failures can cause inconsistency• E.g., different data across chunk servers (failed append)• Behavior is worse for writes than appends

6

Page 7: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS Summary• Success: used actively by Google

• Availability and recoverability on cheap hardware• High throughput by decoupling control and data• Supports massive data sets and concurrent appends

• Semantics not transparent to apps• Must verify file contents to avoid inconsistent regions, repeated

appends (at-least-once semantics)• Performance not good for all apps

• Assumes read-once, write-once workload (no client caching!)

• Successor: Colossus • Eliminates master node as single point of failure• Storage efficiency: Reed-Solomon (1.5x) instead of Replicas (3x)• Reduces block size to be between 1~8 MB• Few details public ☹

7

GFS

BigTable Spanner

Maps GMail

Books

YT

Ads

Page 8: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Apache Hadoop DFS

8

Hmm… looks familiar

Page 9: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

GFS vs. HDFS

9

GFS HDFS Master NameNode chunkserver DataNode operation log journal, edit log chunk block random file writes possible only append is possible multiple writer, multiple reader model

single writer, multiple reader model

chunk: 32bit checksum over 64KB data pieces (1024 per chunk)

per HDFS block, two files created on a DataNode: data file & metadata file (checksums, timestamp)

default block size: 64MB default block size: 128MB

Page 10: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

•••

Page 11: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

••

Page 12: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

••

Page 13: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

Page 14: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 15: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•–––

•••

Page 16: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••••

Page 17: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••••

•••

••

P1 P2 P3 P4 P5

Message Passing

Page 18: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

••

P1 P2 P3 P4 P5

Page 19: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

•••

Page 20: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

••

Page 21: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

Page 22: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

•••

••••

Page 23: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

How do you find the frequency of words, such as , “440”, “error”, “rmi”, “p4” ?

Page 24: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

How do you count the number of mutual friendships for all pairs of people, e.g., "you

and Joe have 147 friends in common"

Page 25: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•• …••

Page 26: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

Page 27: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

Page 28: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 29: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

• ⟨ ⟩•

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

∑ ∑ ∑ ∑ ∑

Page 30: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

⟨⟩

∑ ∑ ∑ ∑ ∑

• ⟨ ⟩•

1) Mapping Phase

2) Shuffling / Sorting Phase

3) Reduce Phase

Page 31: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

1)2)3)

••

Page 32: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

••••

••

•••

Page 33: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Count the number of mutual friendships, e.g., "you and Joe have 147

friends in common"

How to do this in the MapReduce framework?

Page 34: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 35: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 36: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

HDFSHDFS

#tasks >> #processors

dynamic task assignment

Page 37: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

Task Manager

Mapper Mapper Mapper• • •

••• ••• •••

Page 38: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

••• ⟨ ⟩•

hK

Page 39: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••••

••• ••• •••

• • •

Page 40: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

• • •

Page 41: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

••

••

•••

Page 42: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

••

••••

••

Page 43: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Map

Reduce

Map

Reduce

Map

Reduce

Map

Reduce

Map/Reduce ••

••

•••

Page 44: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

•••

Page 45: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

••

‒‒

Page 46: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

•••

•••

•••

••

Page 47: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 48: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

popular?

depends on popularity of her followers

What’s the algorithm?

Page 49: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 50: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 51: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University

Page 52: Readings: “MapReduce: Simplified Data Processing on Large ... · Readings: “MapReduce: Simplified Data Processing on Large Clusters” Sections 3,4 Daniel S. Berger 15-440 Fall

Daniel S. Berger 15-440 Fall 2018 Carnegie Mellon University