join optimization in hive

17
Join Optimization in Hive Liyin Tang

Post on 21-Oct-2014

14.908 views

Category:

Documents


1 download

DESCRIPTION

Join optimization in hive

TRANSCRIPT

Page 1: Join optimization in hive

Join Optimization in Hive

Liyin Tang

Page 2: Join optimization in hive

Outline

• Map Join Optimization– Previous Common Join and Map Join– Optimized Map Join– JDBM– Performance Evaluation

• Convert Join to Map Join Automatically– How it works– Performance Evaluation

Page 3: Join optimization in hive

Common JoinTask A

Mapper

Mapper

Table X

Mapper

Mapper

Mapper…

Mapper

Reducer

Table Y

Shuffle

Common Join Task

Page 4: Join optimization in hive

Mapper

Mapper

MapJoin Task

Mapper

Previous Map JoinTask A

Task C

… Big Table Data

Record

Record

Record

Record

Record

……

Small Table Data

Page 5: Join optimization in hive

Optimized Map JoinTask A

Task C

Mapper

Mapper

Mapper …

MapJoin Task

Big Table Data

Record

Record

Record

Record

……

MapReduce Local Task

Small Table Data

Small Table Data

Small Table Data

Distributed Cache

HashTable Files

Upload files to DC

HashTable FilesHashTable Files

Page 6: Join optimization in hive

JDBM

• JDBM is too heavy weight for Map Join– Take more than 70% CPU time

– Generate very large file• No need to use persistent hashtable for map

join

Page 7: Join optimization in hive

Performance Evaluation ISmall Table Big Table Join

ConditionAverage Previous Map Join Execution time

Average New Optimized Map Join Execution time

Performance Improvement

75 K rows;383K file size

130 M rows;3.5G file size;

1 join key,2 join value

1032 sec 79 sec + 1206%

500 K rows;2.6M file size

130 M rows;3.5G file size

1 join key,2 join value

3991 sec 144 sec +2671 %

75 K rows;383K file size

16.7 B rows;459 G file size

1 join key,2 join value

4801 sec 325 sec + 1377 %

Page 8: Join optimization in hive

Converting Common Join into Map JoinTask A

CommonJoinTask

Task C

Task A

Conditional Task

Task C

MapJoinLocalTask

CommonJoinTask. . . . .

c

a

b

Previous Execution Flow

Optimized Execution Flow

MapJoinTask

MapJoinLocalTask

MapJoinTask

MapJoinLocalTask

MapJoinTask

Page 9: Join optimization in hive

Compile Time

Task A

Conditional Task

Task C

MapJoinLocalTask

CommonJoinTask

a

MapJoinTask

MapJoinLocalTask

MapJoinTask

SELECT * FROM SRC1 x JOIN SRC2 y

ON x.key = y.key;

Assume TABLE x is the big table Assume TABLE y is the

big table

Page 10: Join optimization in hive

Execution Time

Task A

Conditional Task

Task C

MapJoinLocalTask

CommonJoinTask

a

MapJoinTask

Table X is the big table

Both tables are too big for map join

SELECT * FROM SRC1 x JOIN SRC2 y

ON x.key = y.key;

Page 11: Join optimization in hive

Backup TaskTask A

Conditional Task

Task C

MapJoin LocalTask

CommonJoinTask

MapJoinTask

Run as a Backup Task

Memory Bound

Page 12: Join optimization in hive

Performance Bottleneck

• Distributed Cache is the potential performance bottleneck– Large hashtable file will slow down the

propagation of Distributed Cache– Mappers are waiting for the hashtables file from

Distributed Cache• Compress and archive all the hashtable file

into a tar file.

Page 13: Join optimization in hive

Compress and Archive Task A

Task C

a

b

Mapper

Mapper

Small Table Data

MapJoin Task

Big Table Data

Record

Record

Record

Record

……

Mapper …

MapReduce Local Task

Distributed Cache

HashTable Files

Compressed & Archived

Small Table Data

Small Table Data

HashTable FilesHashTable Files

Page 14: Join optimization in hive

Performance Evaluation IISmall Table Big Table Join Condition Average Join

Execution Time Without Compression

Average Join Execution Time With Compression

Performance Improvement

75 K rows;383K file size

130 M rows;3.5G file size;

1 join key,2 join value

106 sec 73 sec + 45%

500 K rows;2.6M file size

130 M rows;3.5G file size

1 join key,2 join value

129 sec 106 sec +21 %

75 K rows;383K file size

16.7 B rows;459 G file size

1 join key,2 join value

441 sec 326 sec + 35 %

500 K rows;2.6M file size

16.7 B rows;459 G file size

1 join key,2 join value

326 sec 251 sec +30 %

1M rows;10M file size

16.7 B rows;459 G file size

1 join key,3 join value

495 sec 266sec +86 %

1M rows;10M file size

16.7 B rows;459 G file size

2 join key,2 join value

425 sec 255 sec +67%

Page 15: Join optimization in hive

Performance Evaluation IIISmall Table Big Table Join

ConditionPrevious Common Join

Optimized Common Join

Performance Improvement

75 K rows;383K file size

130 M rows;3.5G file size;

1 join key,2 join value

169 sec 79 sec + 114%

500 K rows;2.6M file size

130 M rows;3.5G file size

1 join key,2 join value

246 sec 144 sec +71 %

75 K rows;383K file size

16.7 B rows;459 G file size

1 join key,2 join value

511 sec 325 sec + 57 %

500 K rows;2.6M file size

16.7 B rows;459 G file size

1 join key,2 join value

502 sec 305 sec +64 %

1M rows;10M file size

16.7 B rows;459 G file size

1 join key,3 join value

653 sec 248 sec +163 %

1M rows;10M file size

16.7 B rows;459 G file size

2 join key,2 join value

1117sec 536 sec +108%

Page 16: Join optimization in hive

Future Work

• Audit how many join will be converted into map join in the cluster.

• Set hashtable file replica number based on the number of Mappers

• Tune the limit of small table data size by sampling

• Increase the in-memory hashtable capacity.

Page 17: Join optimization in hive

Thank you

Liyin Tang