efficient processing of rank-aware queries in map/reduce

EFFICIENT PROCESSING OF RANK -AWARE

QUERIES IN MAP/REDUCE

O I K O N O M AK I S S P Y R I D O N

S O F T WAR E / E N G I N E E R AT P E O P L E P E R H O U R

Need for a new model

Exponential data growth

Need for analysis, utilization and scalability of more and more data

Need for parallel processing

Need to reduce reading time and data recovery

Need for convenience in terms of programmer

Cost

What is the Map/Reduce?

Distributed data processing programming model

and runtime environment that operates in a large

number of clusters of machines with parallel

processing

Is the Map/Reduce model reliable?

Map/Reduce

Weaknesses in Top-K Join Queries

What is the Top-K Join?

Weaknesses

Read all the data for the recovery of K results

Non-equitable distribution of workload per Reducer

Goals of the experiment

Implementation of Top-K Join queries in

Map/Reduce model in an efficient manner

Troubleshooting shown in Map / Reduce with:

Early Termination

Load Balancing

Design

Comparison of three algorithms (1 default and 2 new) Naive

EarlyTermination (using bounds)

EarlyTermination & LoadBalancing (using bounds and Longest Processing Time)

Pre-Elaboration Production of two data tables with Join attributes

Statistics for the data in the form of histograms

Elaboration Calculating bounds of histograms for each table

Run Map/Reduce

Design(2)

Early Termination

EarlyTermRecordReaderCheck Bounds

Send Data

Send Data

HDFS

Generated Sorted

Data

Histograms

EarlyTermInputFormat

Mapper

ReducersProcess

Early Termination & Load Balancing

EarlyTermRecordReaderCheck

BoundsSend Data

Send Data

HDFS

Generated Sorted

Data

Histograms

EarlyTermInputFormat

Mapper

Reducer

CustomPartitioner

Reducer Reducer

Experiment (1)

Parameters Values

Data Distribution: Zipfian

Number of data: 1.000.000 / table

Number of reducers: 10, 6

Number of K results: 10

Data skew: 0, 0.5, 1

Number of Joining Attributes: 10

Max value for data: 10000

Sorting: By score

Histograms: 10 bins

Cluster: 8 machines

Experiment Part – Comparison of algorithms (2)

0:00:00

0:07:12

0:14:24

0:21:36

0:28:48

0:36:00

0:43:12

0:50:24

0 0.5 1

Ru

nn

ing

tim

e

Skew

Naive

Early Termination

Early Termination & LoadBalancing

REDUCERS = 10


0

500000

1000000

1500000

2000000

2500000

0 0.5 1

Nu

mb

er

of

reco

rds

Skew

Naive

Early termination

Early termination & Load Balancing

REDUCERS = 10


0:00:00

0:02:53

0:05:46

0:08:38

0:11:31

0:14:24

0:17:17

6 10

Ru

nn

ing

tim

e

Number of Reducers

Early Termination

Early Termination & Load Balancing

REDUCERS = 6

Conclusion

By using the techniques proposed: :

Early Termination

Load Balancing

is possible to implement rank aware queries (Top-K) in

Map / Reduce efficiently and solving disadvantages of

the model Map / Reduce

Questions

????

Thank you.

efficient processing of rank-aware queries in map/reduce

Software