Distributed Data-Parallel Computing
• Cloud– Transparent scaling– Resource virtualization
• Commodity clusters– Fault tolerance with good performance
• Workloads beyond standard SQL, HPC– Data-mining, graph analysis, …– Semi-structured/unstructured data
Execution layer
• This talk: system-level middleware– Yuan Yu will describe DryadLINQ programming
model on Saturday• Algorithm -> execution plan by magic
Problem domain
• Large inputs– Tens of GB “small test dataset”– Single job up to hundreds of TB– Semi-structured data is common
• Not latency sensitive– Overhead of seconds for trivial job– Large job could take days– Batch computation not online queries• Simplifies fault tolerance, caching, etc.
Talk overview
• Some typical computations• DAG implementation choices• The Dryad execution engine• Comparison with MapReduce• Discussion
Map
• Independent transformation of dataset– for each x in S, output x’ = f(x)
• E.g. simple grep for word w– output line x only if x contains w
Map
• Independent transformation of dataset– for each x in S, output x’ = f(x)
• E.g. simple grep for word w– output line x only if x contains w
S f S’
Map
• Independent transformation of dataset– for each x in S, output x’ = f(x)
• E.g. simple grep for word w– output line x only if x contains w
S1 f S1’
S2 f S2’
S3 f S3’
Reduce
• Grouping plus aggregation– 1) Group x in S according to key selector k(x)– 2) For each group g, output r(g)
• E.g. simple word count– group by k(x) = x– for each group g output key (word) and count of g
Reduce
• Grouping plus aggregation– 1) Group x in S according to key selector k(x)– 2) For each group g, output r(g)
• E.g. simple word count– group by k(x) = x– for each group g output key (word) and count of g
S G S’r
K-means
• Set of points P, initial set of cluster centres C• Iterate until convergence:– For each c in C• Initialize countc, centrec to 0
– For each p in P• Find c in C that minimizes dist(p,c)• Update: countc += 1, centrec += p
– For each c in C• Replace c <- centrec/countc
Graph algorithms
• Set N of nodes with data (n,x)• Set E of directed edges (n,m)• Iterate until convergence:– For each node (n,x) in N• For each outgoing edge n->m in E, nm = f(x,n,m)
– For each node (m,x) in N• Find set of incoming updates im = {nm: n->m in E}
• Replace (m,x) <- (m,r(im))
• E.g. power iteration (PageRank)
PageRank
N0
1ae
E1
E2
E3
aeae
cc
N0
2
N0
3
cc
cc
DD
D
N1
1 N1
2 N1
3
aeae
ae
cc
cc
cc
DD
D
N2
1 N2
2 N2
3
aeae
ae
cc
cc
cc
DD
D
N3
1 N3
2 N3
3
DAG abstraction
• Absence of cycles– Allows re-execution for fault-tolerance– Simplifies scheduling: no deadlock
• Cycles can often be replaced by unrolling– Unsuitable for fine-grain inner loops
• Very popular– Databases, functional languages, …
Rewrite graph at runtime
• Loop unrolling with convergence tests• Adapt partitioning scheme at run time– Choose #partitions based on runtime data volume– Broadcast Join vs. Hash Join, etc.
• Adaptive aggregation and distribution trees– Based on data skew and network topology
• Load balancing– Data/processing skew (cf work-stealing)
Push vs Pull
• Databases typically ‘pull’ using iterator model– Avoids buffering– Can prevent unnecessary computation
• But DAG must be fully materialized– Complicates rewriting– Prevents resource virtualization in shared cluster
S1 S1’GD
S2’GS2 D
r
r
Fault tolerance
• Buffer data in (some) edges• Re-execute on failure using buffered data• Speculatively re-execute for stragglers• ‘Push’ model makes this very simple
Dryad
• General-purpose execution engine– Batch processing on immutable datasets– Well-tested on large clusters
• Automatically handles– Fault tolerance– Distribution of code and intermediate data– Scheduling of work to resources
Dryad Job Model
• Directed acyclic graph (DAG)• Clean abstraction– Hides cluster services– Clients manipulate graphs
• Flexible and expressive– General-purpose programs– Complicated execution plans
Dryad Inputs and Outputs
• Partitioned data set– Records do not cross partition boundaries– Data on compute machines: NTFS, SQLServer, …
• Optional semantics– Hash-partition, range-partition, sorted, etc.
• Loading external data– Partitioning “automatic”– File system chooses sensible partition sizes– Or known partitioning from user
Push vs Pull
• Channel types define connected component– Shared-memory or TCP must be gang-scheduled
• Pull within gang, push between gangs
MapReduce (Hadoop)
• MapReduce restricts– Topology of DAG– Semantics of function in compute vertex
• Sequence of instances for non-trivial tasks
G
S1’G
D
S2’G
G D
G D
r
r
ir
ir
ir
S1
S2
S3
f
f
f
MapReduce complexity
• Simple to describe MapReduce system• Can be hard to map algorithm to framework– cf k-means: combine C+P, broadcast C, iterate, …– HIVE, PigLatin etc. mitigate programming issues
• Implementation not uniform– Different fault-tolerance for mappers, reducers– Add more special cases for performance• Hadoop introducing TCP channels, pipelines, …
– Dryad has same state machine everywhere
Discussion
• DAG abstraction supports many computations– Can be targeted by high-level languages!– Run-time rewriting extends applicability
• DAG-structured jobs scale to large clusters– Over 10k computers in large Dryad clusters– Transient failures common, disk failures daily
• Trade off fault-tolerance against performance– Buffer vs TCP, still manual choice in Dryad system– Also external vs in-memory working set
Conclusion
• Dryad well-tested, scalable– Daily use supporting Bing for over 3 years
• Applicable to large number of computations– 250 computer cluster at MSR SVC, Mar->Nov 09• 47 distinct users (~50 lab members + interns)• 15k jobs (tens of millions of processes executed)• Hundreds of distinct programs
– Network trace analysis, privacy-preserving inference, light-transport simulation, decision-tree training, deep belief network training, image feature extraction, …