efficient processing of rank-aware queries in map/reduce

17
EFFICIENT PROCESSING OF RANK-AWARE QUERIES IN MAP/REDUCE OIKONOMAKIS SPYRIDON SOFTWARE / ENGINEER AT PEOPLEPERHOUR

Upload: spiros-oikonomakis

Post on 02-Jul-2015

50 views

Category:

Software


0 download

DESCRIPTION

Through the experimental part and the execution of three different algorithms, aims to show the disadvantages of the default operation of the Map/Reduce programming model in Top-K queries, as well as the recommended solution and the effective processing of such query types. Two of the major shortcomings that occur will be managed, namely the Early Termination and the Load Balancing. There is a code which is implemented for this solution.

TRANSCRIPT

Page 1: Efficient processing of Rank-aware queries in Map/Reduce

EFFICIENT PROCESSING OF RANK -AWARE

QUERIES IN MAP/REDUCE

O I K O N O M AK I S S P Y R I D O N

S O F T WAR E / E N G I N E E R AT P E O P L E P E R H O U R

Page 2: Efficient processing of Rank-aware queries in Map/Reduce

Need for a new model

Exponential data growth

Need for analysis, utilization and scalability of more and more data

Need for parallel processing

Need to reduce reading time and data recovery

Need for convenience in terms of programmer

Cost

Page 3: Efficient processing of Rank-aware queries in Map/Reduce

What is the Map/Reduce?

Distributed data processing programming model

and runtime environment that operates in a large

number of clusters of machines with parallel

processing

Page 4: Efficient processing of Rank-aware queries in Map/Reduce

Is the Map/Reduce model reliable?

Page 5: Efficient processing of Rank-aware queries in Map/Reduce

Map/Reduce

Page 6: Efficient processing of Rank-aware queries in Map/Reduce

Weaknesses in Top-K Join Queries

What is the Top-K Join?

Weaknesses

Read all the data for the recovery of K results

Non-equitable distribution of workload per Reducer

Page 7: Efficient processing of Rank-aware queries in Map/Reduce

Goals of the experiment

Implementation of Top-K Join queries in

Map/Reduce model in an efficient manner

Troubleshooting shown in Map / Reduce with:

Early Termination

Load Balancing

Page 8: Efficient processing of Rank-aware queries in Map/Reduce

Design

Comparison of three algorithms (1 default and 2 new) Naive

EarlyTermination (using bounds)

EarlyTermination & LoadBalancing (using bounds and Longest Processing Time)

Pre-Elaboration Production of two data tables with Join attributes

Statistics for the data in the form of histograms

Elaboration Calculating bounds of histograms for each table

Run Map/Reduce

Page 9: Efficient processing of Rank-aware queries in Map/Reduce

Design(2)

Page 10: Efficient processing of Rank-aware queries in Map/Reduce

Early Termination

EarlyTermRecordReaderCheck Bounds

Send Data

Send Data

HDFS

Generated Sorted

Data

Histograms

EarlyTermInputFormat

Mapper

ReducersProcess

Page 11: Efficient processing of Rank-aware queries in Map/Reduce

Early Termination & Load Balancing

EarlyTermRecordReaderCheck

BoundsSend Data

Send Data

HDFS

Generated Sorted

Data

Histograms

EarlyTermInputFormat

Mapper

Reducer

CustomPartitioner

Reducer Reducer

Page 12: Efficient processing of Rank-aware queries in Map/Reduce

Experiment (1)

Parameters Values

Data Distribution: Zipfian

Number of data: 1.000.000 / table

Number of reducers: 10, 6

Number of K results: 10

Data skew: 0, 0.5, 1

Number of Joining Attributes: 10

Max value for data: 10000

Sorting: By score

Histograms: 10 bins

Cluster: 8 machines

Page 13: Efficient processing of Rank-aware queries in Map/Reduce

Experiment Part – Comparison of algorithms (2)

0:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0 0.5 1

Ru

nn

ing

tim

e

Skew

Naive

Early Termination

Early Termination & LoadBalancing

REDUCERS = 10

Page 14: Efficient processing of Rank-aware queries in Map/Reduce

Experiment Part – Comparison of algorithms (3)

0

500000

1000000

1500000

2000000

2500000

0 0.5 1

Nu

mb

er

of

reco

rds

Skew

Naive

Early termination

Early termination & Load Balancing

REDUCERS = 10

Page 15: Efficient processing of Rank-aware queries in Map/Reduce

Experiment Part – Comparison of algorithms (4)

0:00:00

0:02:53

0:05:46

0:08:38

0:11:31

0:14:24

0:17:17

6 10

Ru

nn

ing

tim

e

Number of Reducers

Early Termination

Early Termination & Load Balancing

REDUCERS = 6

Page 16: Efficient processing of Rank-aware queries in Map/Reduce

Conclusion

By using the techniques proposed: :

Early Termination

Load Balancing

is possible to implement rank aware queries (Top-K) in

Map / Reduce efficiently and solving disadvantages of

the model Map / Reduce

Page 17: Efficient processing of Rank-aware queries in Map/Reduce

Questions

????

Thank you.