mr share 11 sep 2010
DESCRIPTION
Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges incurred while utilizing the processing infrastructure. In this paper we propose a sharing framework tailored to MapReduce. Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Experiments in our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach and substantial savings.TRANSCRIPT
![Page 1: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/1.jpg)
MRShare: Sharing Across Multiple Queries in MapReduce
Tomasz Nykiel (University of Toronto)
Michalis Potamias (Boston University)
Chaitanya Mishra (University of Toronto, currently Facebook)
George Kollios (Boston University)
Nick Koudas (University of Toronto)
1
![Page 2: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/2.jpg)
Data management landscape
efficiency
flex
ibili
ty
σ π
2
• Time performance
• Arbitrary data• Large scale setups
MRShare – sharing framework for MR
![Page 3: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/3.jpg)
MRShare – a sharing framework for Map Reduce
• MRShare framework:
– Inspired by sharing primitives from relational domain
– Introduces a cost model for Map Reduce jobs
– Searches for the optimal sharing strategies
– Does not change the Map Reduce computational model
3
![Page 4: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/4.jpg)
Outline
• Introduction
• Map Reduce recap.
• MRShare – Sharing primitives in Map-Reduce
• MRShare – Cost based approach to sharing
• MRShare Evaluation
• Summary
4
![Page 5: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/5.jpg)
Outline
• Map Reduce recap.
5
![Page 6: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/6.jpg)
Map Reduce recap.
I
I
I
I
Map Reduce
6
![Page 7: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/7.jpg)
Outline
• MRShare - Sharing primitives in Map-Reduce
7
![Page 8: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/8.jpg)
Sharing primitives – sharing scans
• SELECT COUNT(*) FROM user GROUP BY hometown
• SELECT AVG(age) FROM user GROUP BY hometown
8
User_id Hometown Occupation Age
![Page 9: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/9.jpg)
MRShare – sharing scans (map).
9
![Page 10: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/10.jpg)
MRShare – sharing scans (reduce)
J1 J2 J3 J4 key value
Toronto 1
Toronto 1
Toronto 1
Toronto 17
Toronto 19
Toronto 2
Toronto 5
10
![Page 11: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/11.jpg)
Outline
• MRShare - Sharing primitives in Map-Reduce
– Sharing scans
– Sharing intermediate data
11
![Page 12: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/12.jpg)
Sharing primitives - Sharing intermediate data.
• SELECT COUNT(*) FROM user WHERE occupation=‘student’ GROUP BY hometown
• SELECT COUNT(*) FROM user WHERE age > 18 GROUP BY hometown
User_id Hometown Occupation Age
Occupation ?= ‘student’ Age ?> 18
12
![Page 13: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/13.jpg)
MRShare – sharing intermediate data (map).
13
![Page 14: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/14.jpg)
MRShare – sharing intermediate data (reduce).
J1 J2 J3 J4 key value
Toronto 1
Toronto 4
Toronto 1
Toronto 1
Toronto 2
Toronto 2
Toronto 5
14
![Page 15: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/15.jpg)
Outline
• MRShare – Cost based approach to sharing
– Cost model for finding the optimal sharing strategy
– SplitJobs – cost based algorithm for sharing scans
– MultiSplitJobs – an improvement of SplitJobs
– γ-MultiSplitJobs – the algorithm for sharing intermediate data
15
![Page 16: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/16.jpg)
Cost model for Map Reduce (single job)
• Reading – f(input size)
• Sorting – f(intermediate data size)
• Copying – f(intermediate data size)
• Writing – f(output size)
16
![Page 17: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/17.jpg)
Cost of executing a group of jobs
17
![Page 18: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/18.jpg)
Finding the optimal sharing strategy
18
• An optimization problem
“NoShare”
“GreedyShare”
![Page 19: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/19.jpg)
Outline
• MRShare – Cost based approach to sharing
– Cost model for finding the optimal sharing strategy
– SplitJobs – cost based algorithm for sharing scans
– MultiSplitJobs – an improvement of SplitJobs
– γ-MultiSplitJobs – the algorithm for sharing intermediate data
19
![Page 20: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/20.jpg)
Sharing scans - cost based optimization
• Savings come from reduced number of scans• The sorting cost might change• The costs of copying and writing the output do not
change
20
• We prove NP-hardness of the problem of finding the optimal sharing strategy
![Page 21: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/21.jpg)
SplitJobs – a DP solution for sharing scans.
• We reduce the problem of grouping to the problem of splitting a sorted list of jobs – by approximating the cost of sorting.
22
• Using our cost model and the approximation, we employ a DP algorithm to find the optimal split points.
![Page 22: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/22.jpg)
Outline
• MRShare – Cost based approach to sharing
– Cost model for finding the optimal sharing strategy
– SplitJobs – cost based algorithm for sharing scans
– MultiSplitJobs – an improvement of SplitJobs
– γ-MultiSplitJobs – the algorithm for sharing intermediate data
23
![Page 23: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/23.jpg)
MultiSplitJobs – an improvement of SplitJobs
24
![Page 24: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/24.jpg)
Outline
• MRShare – Cost based approach to sharing
– Cost model for finding the optimal sharing strategy
– SplitJobs – cost based algorithm for sharing scans
– MultiSplitJobs – an improvement of SplitJobs
– γ-MultiSplitJobs – the algorithm for sharing intermediate data
25
![Page 25: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/25.jpg)
Sharing intermediate data - cost based optimization
• The sorting and copying costs change – depending on the size of the intermediate data
26
We need to estimate the size of the intermediate data of all combinations of jobs.
![Page 26: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/26.jpg)
γ-MultiSplitJobs – the solution for sharing intermediate data
• Approximate the size of the intermediate data
27
• γ –MultiSplitJobs – applies MultiSplitJobs with modified cost function
• γ set heuristically
![Page 27: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/27.jpg)
Outline
• MRShare Evaluation
28
![Page 28: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/28.jpg)
Evaluation setup
• 40 EC2 small instance virtual machines
• Modified Hadoop engine
• 30 GB text dataset consisting of blogs
• Multiple grep-wordcount queries
– Counts words matching a regular expression
– Allows for variable intermediate data sizes
– Generic aggregation Map Reduce job
29
![Page 29: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/29.jpg)
Evaluation goals
• Sharing is not always beneficial.
– ‘GreedyShare’ policy
• How much can we save on sharing scans?
– MRShare - MultiSplitJobs evaluation
• How much can we save on sharing intermediate data?
– MRShare - γ-MultiSplitJobs evaluation
30
![Page 30: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/30.jpg)
Is sharing always beneficial?- ‘GreedyShare’ policy
Group of jobs
Group size
d=|intermediate data| / |input data|
H1 16 0.3 < d <0.7
H2 16 0.7 < d
H3 16 0.9 < d
31
![Page 31: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/31.jpg)
How much we save on sharing scans –MRShare MultiSplitJobs
Group of jobs
Group size
d=|intermediate data| / |input data|
G1 16 0.7 < d
G2 16 0.2 < d < 0.7
G3 16 0.0 < d < 0.2
G4 16 0.0 < d < max
G5 64 0.0 < d < max
32
![Page 32: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/32.jpg)
How much we save on sharing intermediate data -
MRShare - γ-MultiSplitJobs
33
Group of jobs
Group size
d=|intermediate data| / |input data|
G1 16 0.7 < d
G2 16 0.2 < d < 0.7
G3 16 0.0 < d < 0.2
![Page 33: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/33.jpg)
Summary
• We introduced MRShare – a framework for automatic work sharing in Map Reduce.
• We identified sharing primitives and demonstrated the implementation thereof in a Map Reduce engine.
• We established a cost model and solved several work sharing optimization problems.
• We demonstrated vast savings when using MRShare.
34
![Page 34: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/34.jpg)
Thank you!!!
Questions?
35
![Page 35: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/35.jpg)
Ongoing work – sharing expensive computation
• Sharing across multiple Map Reduce jobs with expensive predicates.
36
![Page 36: Mr Share 11 Sep 2010](https://reader036.vdocuments.mx/reader036/viewer/2022070323/559e02a31a28ab1e6a8b46c7/html5/thumbnails/36.jpg)
Ongoing work – dynamic sharing
• Dynamic sharing.
37
time
pro
gres
s