a perfect hive query for a perfect meeting (hadoop summit 2014)
DESCRIPTION
During one of our epic parties, Martin Lorentzon (chairman of Spotify) agreed to help me to arrange a dinner for me and Timbuktu (my favourite Swedish rap and reggae artist), if I prove somehow that I am the biggest fan of Timbuktu in my home country. Because at Spotify we attack all problems using data-driven approaches, I decided to implement a Hive query that processes real datasets to figure out who streams Timbuktu the most frequently in my country. Although this problem seems to be well-defined, one can find many challenges in implementing this query efficiently and they relate to sampling, testing, debugging, troubleshooting, optimizing and executing it over terabytes of data on the Hadoop-YARN cluster that contains hundreds of nodes. During my talk, I will describe all of them, and share how to increase your (and the cluster’s) productivity by following tips and best practices for analyzing large datasets with Hive on YARN. I will also explain how the newly-added features to Hive (e.g. join optimizations, OCR File Format and Tez integration that is coming soon) can be used to make your query extremely fast.TRANSCRIPT
![Page 1: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/1.jpg)
Adam KawaData Engineer @ Spotify
A Perfect Hive Query For A Perfect Meeting
![Page 2: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/2.jpg)
A deal was made!~5:00 AM, June 16th, 2013
![Page 3: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/3.jpg)
■ End of our Summer Jam party■ Organized by Martin Lorentzon
~5:00 AM, June 16th, 2013
![Page 4: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/4.jpg)
■ Co-founder and Chairman of Spotify
Martin Lorentzon
![Page 5: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/5.jpg)
■ Data Engineer at Spotify
Adam Kawa
![Page 6: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/6.jpg)
* If Adam proves somehow that he is the biggest fan of Timbuktu in his home country
The DealMartin will invite Adam and Timbuktu, my favourite Swedish artist, for a beer or coke or whatever to drink *
by Martin
![Page 7: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/7.jpg)
■ Actually, Jason Michael Bosak Diakité■ A top Swedish rap and reggae artist
- My favourite song - http://bit.ly/1htAKhM■ Does not know anything about the deal!
Timbuktu
![Page 8: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/8.jpg)
QuestionWho is the biggest fan of Timbuktu in Poland?
The Rules
![Page 9: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/9.jpg)
QuestionWho is the biggest fan of Timbuktu in Poland?
AnswersA person who has streamed Timbuktu’s songs at Spotify the most frequently!
The Rules
![Page 10: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/10.jpg)
Data will tell the truth!
![Page 11: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/11.jpg)
■ Follow tips and best practices related to- Testing- Sampling- Troubleshooting- Optimizing- Executing
Hive Query
![Page 12: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/12.jpg)
Hive queryshould be fasterthan my potential meeting with Timbuktu
Why?Re-run the query during our meeting
Additional Requirement
by Adam
![Page 13: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/13.jpg)
■ YARN on 690 nodes■ Hive
- MRv2- Join optimizations- ORC- Tez- Vectorization- Table statistics
Friends
![Page 14: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/14.jpg)
IntroductionDatasets
![Page 15: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/15.jpg)
USER
user_id country … date
u1 PL … 20140415
u2 PL … 20140415
u3 SE … 20140415
TRACK
json_data … date track_id
{"album":{"artistname":"Coldplay"}} … 20140415 t1
{"album":{"artistname":"Timbuktu"}} … 20140415 t2
{"album":{"artistname":"Timbuktu"}} … 20140415 t3
track_id … user_id … date
t1 … u1 … 20130311
t3 … u1 … 20130622
t3 … u2 … 20131209
t2 … u2 … 20140319
t1 ... u3 ... 20140415
STREAM
![Page 16: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/16.jpg)
■ Information about songs streamed by users■ 54 columns■ 25.10 TB of compressed data since Feb 12th, 2013
STREAM
track_id … … user_id date
t1 … … u1 20130311
t3 … … u1 20130622
t3 … … u2 20131209
t2 … … u2 20140319
t1 ... ... u3 20140415
![Page 17: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/17.jpg)
■ Information about users■ 40 columns■ 10.5 GB of compressed data (2014-04-15)
USER
user_id country … date
u1 PL … 20140415
u2 PL … 20140415
u3 SE … 20140415
![Page 18: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/18.jpg)
■ Metadata information about tracks■ 5 columns
- But json_data contains dozen or so fields■ 15.4 GB of data (2014-04-15)
TRACK
track_id json_data … date
t1 {"album":{"artistname":"Coldplay"}} … 20140415
t2 {"album":{"artistname":"Timbuktu"}} … 20140415
t3 {"album":{"artistname":"Timbuktu"}} … 20140415
![Page 19: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/19.jpg)
✓ Avro and a custom serde✓ DEFLATE as a compression codec✓ Partitioned by day and/or hour
✗ Bucketed✗ Indexed
Tables’ Properties
![Page 20: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/20.jpg)
HiveQLQuery
![Page 21: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/21.jpg)
FROM stream s
Query
![Page 22: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/22.jpg)
FROM stream sJOIN track t ON t.track_id = s.track_id
Query
![Page 23: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/23.jpg)
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
Query
![Page 24: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/24.jpg)
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))= 'timbuktu'
Query
![Page 25: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/25.jpg)
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))= 'timbuktu'
AND u.country = 'PL'
Query
![Page 26: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/26.jpg)
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))= 'timbuktu'
AND u.country = 'PL'AND s.date BETWEEN 20130212 AND 20140415
Query
![Page 27: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/27.jpg)
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))= 'timbuktu'
AND u.country = 'PL'AND s.date BETWEEN 20130212 AND 20140415AND u.date = 20140415 AND t.date = 20140415
Query
![Page 28: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/28.jpg)
SELECT s.user_id, COUNT(*) AS count
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))= 'timbuktu'
AND u.country = 'PL'AND s.date BETWEEN 20130212 AND 20140415AND u.date = 20140415 AND t.date = 20140415
GROUP BY s.user_id
Query
![Page 29: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/29.jpg)
SELECT s.user_id, COUNT(*) AS count
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))= 'timbuktu'
AND u.country = 'PL'AND s.date BETWEEN 20130212 AND 20140415AND u.date = 20140415 AND t.date = 20140415
GROUP BY s.user_idORDER BY count DESCLIMIT 100;
Query
![Page 30: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/30.jpg)
SELECT s.user_id, COUNT(*) AS count
FROM stream sJOIN track t ON t.track_id = s.track_idJOIN user u ON u.user_id = s.user_id
WHERE lower(get_json_object(t.json_data, '$.album.artistname'))= 'timbuktu'
AND u.country = 'PL'AND s.date BETWEEN 20130212 AND 20140415AND u.date = 20140415 AND t.date = 20140415
GROUP BY s.user_idORDER BY count DESCLIMIT 100;
QueryA line where
I may have a bug ? !
![Page 31: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/31.jpg)
HiveQLUnit Testing
![Page 32: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/32.jpg)
hive_testVerbose
and complex
Java code
![Page 33: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/33.jpg)
■ Hive analyst usually is not Java developer■ Hive queries are often ad-hoc
- No big incentives to unit test
Testing Hive Queries
![Page 34: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/34.jpg)
■ Easy unit testing of Hive queries locally- Implemented by me- Available at github.com/kawaa/Beetest
Beetest
![Page 35: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/35.jpg)
■ A unit test consists of- select.hql - a query to test
Beetest
![Page 36: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/36.jpg)
■ A unit test consists of- select.hql - a query to test- table.ddl - schemas of input tables
Beetest
![Page 37: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/37.jpg)
■ A unit test consists of- select.hql - a query to test- table.ddl - schemas of input tables- text files with input data
Beetest
![Page 38: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/38.jpg)
■ A unit test consists of- select.hql - a query to test- table.ddl - schemas of input tables- text files with input data- expected.txt - expected output
Beetest
![Page 39: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/39.jpg)
■ A unit test consists of- select.hql - a query to test- table.ddl - schemas of input tables- text files with input data- expected.txt - expected output- (optional) setup.hql - any initialization query
Beetest
![Page 40: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/40.jpg)
table.ddlstream(track_id String, user_id STRING, date BIGINT)
user(user_id STRING, country STRING, date BIGINT)
track(track_id STRING, json_data STRING, date BIGINT)
![Page 41: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/41.jpg)
For Each Line1. Creates a table with a given schema
- In local Hive database called beetest
table.ddlstream(track_id String, user_id STRING, date BIGINT)
user(user_id STRING, country STRING, date BIGINT)
track(track_id STRING, json_data STRING, date BIGINT)
![Page 42: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/42.jpg)
For Each Line1. Creates a table with a given schema
- In local Hive database called beetest2. Loads table.txt file into the table
- A list of files can be also explicitly specified
table.ddlstream(track_id String, user_id STRING, date BIGINT)
user(user_id STRING, country STRING, date BIGINT)
track(track_id STRING, json_data STRING, date BIGINT)
![Page 43: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/43.jpg)
Input Files
t1 {"album":{"artistname":"Coldplay"}} 20140415
t2 {"album":{"artistname":"Timbuktu"}} 20140415
t3 {"album":{"artistname":"Timbuktu"}} 20140415
track.txt
![Page 44: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/44.jpg)
Input Files
u1 PL 20140415
u2 PL 20140415
u3 SE 20140415
t1 {"album":{"artistname":"Coldplay"}} 20140415
t2 {"album":{"artistname":"Timbuktu"}} 20140415
t3 {"album":{"artistname":"Timbuktu"}} 20140415
user.txt
track.txt
![Page 45: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/45.jpg)
Input Files
u1 PL 20140415
u2 PL 20140415
u3 SE 20140415
t1 u1 20140415
t3 u1 20140415
t2 u1 20140415
t3 u2 20140415
t2 u3 20140415
t1 {"album":{"artistname":"Coldplay"}} 20140415
t2 {"album":{"artistname":"Timbuktu"}} 20140415
t3 {"album":{"artistname":"Timbuktu"}} 20140415
stream.txtuser.txt
track.txt
![Page 46: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/46.jpg)
Expected Output File
u1 2
u2 1
expected.txtt1 u1 20140415
t3 u1 20140415
t2 u1 20140415
t3 u2 20140415
t2 u3 20140415
stream.txt
![Page 47: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/47.jpg)
$ ./run-test.sh timbuktu
Running Unit Test
![Page 48: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/48.jpg)
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hqlINFO: Running: hive --config local-config -f /tmp/beetest-test-1934426026-query.hql
Running Unit Test
![Page 49: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/49.jpg)
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hqlINFO: Running: hive --config local-config -f /tmp/beetest-test-1934426026-query.hqlINFO: Loading data to table beetest.stream… INFO: Total MapReduce jobs = 2INFO: Job running in-process (local Hadoop)…
Running Unit Test
![Page 50: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/50.jpg)
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hqlINFO: Running: hive --config local-config -f /tmp/beetest-test-1934426026-query.hqlINFO: Loading data to table beetest.stream… INFO: Total MapReduce jobs = 2INFO: Job running in-process (local Hadoop)… INFO: Execution completed successfullyINFO: Table beetest.output_1934426026 stats: …INFO: Time taken: 34.605 seconds
Running Unit Test
![Page 51: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/51.jpg)
$ ./run-test.sh timbuktu
INFO: Generated query filename: /tmp/beetest-test-1934426026-query.hqlINFO: Running: hive --config local-config -f /tmp/beetest-test-1934426026-query.hqlINFO: Loading data to table beetest.stream… INFO: Total MapReduce jobs = 2INFO: Job running in-process (local Hadoop)… INFO: Execution completed successfullyINFO: Table beetest.output_1934426026 stats: …INFO: Time taken: 34.605 secondsINFO: Asserting: timbuktu/expected.txt and /tmp/beetest-test-1934426026-output_1934426026/000000_0INFO: Test passed!
Running Unit Test
![Page 52: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/52.jpg)
Bee testBe happy !
![Page 53: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/53.jpg)
HiveQLSampling
![Page 54: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/54.jpg)
■ TABLESAMPLE can generate samples of- buckets- HDFS blocks- first N records from each input split
Sampling In Hive
![Page 55: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/55.jpg)
✗ Bucket sampling- Table must be bucketized by a given column
✗ HDFS block sampling- Only CombineHiveInputFormat is supported
TABLESAMPLE Limitations
![Page 56: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/56.jpg)
SELECT * FROM user TABLESAMPLE(K ROWS)SELECT * FROM track TABLESAMPLE(L ROWS)SELECT * FROM stream TABLESAMPLE(M ROWS)
SELECT … stream JOIN track ON … JOIN user ON …
✗ No guarantee that JOIN returns anything
Rows TABLESAMPLE
![Page 57: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/57.jpg)
■ DataFu + Pig offer interesting sampling algorithms- Weighted Random Sampling- Reservoir Sampling- Consistent Sampling By Key
■ HCatalog can provide integration
DataFu And Pig
![Page 58: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/58.jpg)
DEFINE SBK datafu.pig.sampling.SampleByKey('0.01');
stream_s = FILTER stream BY SBK(user_id); user_s = FILTER user BY SBK(user_id);
Consistent Sampling By Key
Threshold
![Page 59: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/59.jpg)
DEFINE SBK datafu.pig.sampling.SampleByKey('0.01');
stream_s = FILTER stream BY SBK(user_id); user_s = FILTER user BY SBK(user_id);
stream_user_s = JOIN user_s BY user_id, stream_s BY user_id;
✓ Guarantees that JOIN returns rows after sampling- Assuming that the threshold is high enough
Consistent Sampling By Key
Threshold
![Page 60: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/60.jpg)
DEFINE SBK datafu.pig.sampling.SampleByKey('0.01');
stream_s = FILTER stream BY SBK(user_id); user_s = FILTER user BY SBK(user_id);
user_s_pl = FILTER user_s BY country == ‘PL’;stream_user_s_pl = JOIN user_s_pl BY user_id, stream_s BY user_id;
✗ Do not guarantee that Polish users will be included
Consistent Sampling By Key
Threshold
![Page 61: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/61.jpg)
✗ None of the sampling methods meet all of my requirements
Try and see■ Use SampleByKey and 1% of day’s worth of data
- Maybe it will be good enough?
What’s Next?
![Page 62: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/62.jpg)
1. Finished successfully!2. Returned 3 Polish users with 4 Timbuktu’s tracks
- My username was not included :(
Running Query
?
![Page 63: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/63.jpg)
HiveQLUnderstanding The Plan
![Page 64: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/64.jpg)
■ Shows how Hive translates queries into MR jobs- Useful, but tricky to understand
EXPLAIN
![Page 65: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/65.jpg)
■ 330 lines■ 11 stages
- Conditional operators and backup stages- MapReduce jobs, Map-Only jobs- MapReduce Local Work
■ 4 MapReduce jobs- Two of them needed for JOINs
EXPLAIN The Query
![Page 66: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/66.jpg)
Optimized Query
2 MapReduce job in total
![Page 67: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/67.jpg)
hive.auto.convert.join.noconditionaltask.size> filesize(user) + filesize(track)
No Conditional Map Join
Runs many Map joins in a Map-Only
job[HIVE-3784]
![Page 68: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/68.jpg)
1. Local Work- Fetch records from small table(s) from HDFS- Build hash table- If not enough memory, then abort
Map Join Procedure
![Page 69: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/69.jpg)
1. Local Work- Fetch records from small table(s) from HDFS- Build hash table- If not enough memory, then abort
2. MapReduce Driver- Add hash table to the distributed cache
Map Join Procedure
![Page 70: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/70.jpg)
1. Local Work- Fetch records from small table(s) from HDFS- Build hash table- If not enough memory, then abort
2. MapReduce Driver- Add hash table to the distributed cache
3. Map Task- Read hash table from the distributed cache- Stream through, supposedly, the largest table- Match records and write to output
Map Join Procedure
![Page 71: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/71.jpg)
1. Local Work- Fetch records from small table(s) from HDFS- Build hash table- If not enough memory, then abort
2. MapReduce Driver- Add hash table to the distributed cache
3. Map Task- Read hash table from the distributed cache- Stream through, supposedly, the largest table- Match records and write to output
4. No Reduce Task
Map Join Procedure
![Page 72: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/72.jpg)
■ Table has to be downloaded from HDFS- Problem if this step takes long
■ Hash table has to be uploaded to HDFS- Replicated via the distributed cache- Fetched by hundred of nodes
Map Join Cons
![Page 73: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/73.jpg)
Grouping After Joining
Runs as a single MR job
[HIVE-3952]
![Page 74: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/74.jpg)
Optimized Query
2 MapReduce job in total
![Page 75: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/75.jpg)
HiveQLRunning At Scale
![Page 76: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/76.jpg)
2014-05-05 21:21:13Starting to launch local task to process map join; maximum memory = 2GB
^Ckawaa@sol:~/timbuktu$ dateMon May 5 21:50:26 UTC 2014
Day’s Worth Of Data
![Page 77: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/77.jpg)
✗ Dimension tables too big to read fast from HDFS✓ Add a preprocessing step
- Apply projection and filtering in Map-only job- Make a script ugly with intermediate tables
Too Big Tables
![Page 78: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/78.jpg)
✓ 15s to build hash tables!✗ 8m 23s to run 2 Map-only jobs to prepare tables
- “Investment”
Day’s Worth Of Data
![Page 79: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/79.jpg)
■ Map tasks
FATAL [main] org.apache.hadoop.mapred.YarnChild: Error running child : java.lang.OutOfMemoryError: Java heap space
Day’s Worth Of Data
![Page 80: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/80.jpg)
■ Increase JVM heap size■ Decrease the size of in-memory map output buffer
- mapreduce.task.io.sort.mb
Possible Solution
![Page 81: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/81.jpg)
■ Increase JVM heap size■ Decrease the size of in-memory map output buffer
- mapreduce.task.io.sort.mb
(Temporary) Solution
My query generates small amount of intermediate data
- MAP_OUTPUT_* counters
![Page 82: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/82.jpg)
✓ Works!✗ 2-3x slower than two Regular joins
Second Challenge
![Page 83: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/83.jpg)
Metric Regular Join Map Join
The slowest
job
Shuffled Bytes
12.5 GB 0.56 MB
CPU 120 M 331 M
GC 13 M 258 M
GC / CPU 0.11 0.78
Wall-clock 11m 47s 31m 30s
Heavy Garbage Collection!
![Page 84: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/84.jpg)
■ Increase the container and heap sizes- 4096 and -Xmx3584m
Fixing Heavy GC
Metric Regular Join Map Join
The slowest job
GC / CPU 0.11 0.03
QueryCPU 1d 17m 23h 8m
Wall-clock 20m 15s 14m 42s
![Page 85: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/85.jpg)
■ 1st MapReduce Job- Heavy map phase
- No GC issues any more- Tweak input split size
- Lightweight shuffle- Lightweight reduce
- Use 1-2 reduce tasks
Where To Optimize?
![Page 86: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/86.jpg)
■ 1st MapReduce Job- Heavy map phase
- No GC issues any more- Tweak input split size
- Lightweight shuffle- Lightweight reduce
- Use 1-2 reduce tasks■ 2nd MapReduce Job
- Very small- Enable uberization [HIVE-5857]
Where To Optimize?
![Page 87: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/87.jpg)
Current Best Time2 months of data
50 min 2 sec10th place
?
![Page 88: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/88.jpg)
Changes are needed!
![Page 89: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/89.jpg)
File FormatORC
![Page 90: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/90.jpg)
■ The biggest table has 52 “real” columns- Only 2 of them are needed- 2 columns are “partition” columns
Observation
||||||||||||||||||||||||||||||||||||||||||||||||||||
![Page 91: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/91.jpg)
✓ Improve speed of query- Only read required columns
- Especially important when reading remotely
My Benefits Of ORC
![Page 92: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/92.jpg)
✗ Need to convert Avro to ORC✓ ORC consumes less space
- Efficient encoding and compression
ORC Storage Consumption
![Page 93: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/93.jpg)
Storage And Access
16x
![Page 94: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/94.jpg)
Wall Clock Time
3.5x
![Page 95: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/95.jpg)
CPU Time
32x
![Page 96: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/96.jpg)
■ Use Snappy compression- ZLIB was used to optimize more for storage than
processing time
Possible Improvements
![Page 97: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/97.jpg)
ComputationApache Tez
![Page 98: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/98.jpg)
■ Tez will probably not help much, because…- Only two MapReduce jobs- Heavy work done in map tasks- Little intermediate data written to HDFS
Thoughts?
![Page 99: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/99.jpg)
Wall Clock Time
1.4x
2.4x
![Page 100: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/100.jpg)
Wall Clock Time
8x
![Page 101: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/101.jpg)
✓ Natural DAG - No “empty” map task- No necessity to write intermediate data to HDFS- No job scheduling overhead
Benefits Of Tez
![Page 102: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/102.jpg)
✓ Dynamic graph reconfiguration- Decrease reduce task parallelism at runtime- Pre-launch reduce tasks at runtime
Benefits Of Tez
![Page 103: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/103.jpg)
✓ Smart number of map tasks based on e.g.- Min/max limits input split size- Capacity utilization
Benefits Of Tez
![Page 104: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/104.jpg)
Container ReuseTime
![Page 105: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/105.jpg)
Container Reuse
The more congested queue/cluster, the bigger
benefits of reusing
Time
![Page 106: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/106.jpg)
Container Reuse
No scheduling overhead to run new Reduce task
Time
![Page 107: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/107.jpg)
Container ReuseTime
Thinner tasks allows to avoid stragglers
![Page 108: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/108.jpg)
Container + Objects Reuse
Finished within 1,5 sec.Warm !
![Page 109: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/109.jpg)
■ Much easier to follow!
Map 1 <- Map 4 (BROADCAST_EDGE), Map 5 (BROADCAST_EDGE)Reducer 2 <- Map 1 (SIMPLE_EDGE)Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
■ A dot file for graphviz is also generated
EXPLAIN
![Page 110: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/110.jpg)
■ Improved version of Map Join■ Hash table is built in the cluster
- Parallelism- Data locality- No local and HDFS write barriers
■ Hash table is broadcasted to downstream tasks- Tez container can cache them and reuse
Broadcast Join
![Page 111: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/111.jpg)
■ Try run the query that was interrupted by ^C- On 1 day of data to introduce overhead
Broadcast Join
Metric Avro MR Tez ORC
Wall-clockUNKNOWN(^C after 30
minutes)11m 16s
![Page 112: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/112.jpg)
✓ Automatically rescales the size of in-memory buffer for intermediate key-value pairs✓ Reduced in-memory size of map-join hash tables
Memory-Related Benefits
![Page 113: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/113.jpg)
■ At scale, Tez Application Master can be busy- DEBUG log level slows everything down- Sometimes needs more RAM
■ A couple of tools are missing- e.g. counters, Web UI- CLI, log parser, jstat are your friends
Lessons Learned
![Page 114: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/114.jpg)
FeatureVectorization
![Page 115: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/115.jpg)
✓ Process data in a batch of 1024 rows together- Instead of processing one row at a time
✓ Reduces the CPU usage✓ Scales very well especially with large datasets
- Kicks in after a number of function calls✓ Supported by ORC
Vectorization
![Page 116: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/116.jpg)
Powered By Vectorization
1.4x
![Page 117: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/117.jpg)
✗ Not yet fully supported by vectorization✓ Still less GC✓ Still better use of memory to CPU bandwidth✗ Cost of switching from vectorized to row-
mode✓ It will be revised soon by the community
Vectorized Map Join
![Page 118: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/118.jpg)
■ Ensure that vectorization is fully or partially supported for your query
- Datatypes- Operators
■ Run own benchmarks and profilers
Lessons Learned
![Page 119: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/119.jpg)
FeatureTable Statistics
![Page 120: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/120.jpg)
✓ Table, partition and column statistics✓ Possibility to
- collect when loading data- generate for existing tables- use to produce optimal query plan
✓ Supported by ORC
Table Statistics
![Page 121: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/121.jpg)
Powered By StatisticsPeriod ORC/Tez
Vectorized ORC/Tez
Vectorized ORC/Tez + Stats
2 weeks173 sec 225 sec 129 sec
1.22K maps 1.22K maps 0.9K maps
6.5 months
618 sec 653 sec 468 sec
13.3K maps 13.3K maps 9.4K maps
14 months
931 sec 676 sec 611 sec
24.8K maps 24.8K maps 17.9K maps
![Page 122: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/122.jpg)
Current Best Time14 months of data
10 min 11 sec
?
![Page 123: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/123.jpg)
ResultsChallenge
![Page 124: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/124.jpg)
■ Who is the biggest fan of Timbuktu in Poland?
Results!
![Page 125: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/125.jpg)
■ Who is the biggest fan of Timbuktu in Poland?
Results!
![Page 126: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/126.jpg)
That’s all !
![Page 127: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/127.jpg)
■ Deals at 5 AM might be the best idea ever■ Hive can be a single solution for all SQL queries
- Both interactive and batch- Regardless of input dataset size
■ Tez is really MapReduce v2- Very flexible and smart- Improves what was inefficient with MapReduce- Easy to deploy
Summary
![Page 128: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/128.jpg)
■ Gopal Vijayaraghavan- Technical answers to Tez-related questions- Installation scripts
github.com/t3rmin4t0r/tez-autobuild
■ My colleagues for technical review and feedback- Piotr Krewski, Rafal Wojdyla, Uldis Barbans,
Magnus Runesson
Special Thanks
![Page 129: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/129.jpg)
Section name
Questions?
![Page 130: A perfect Hive query for a perfect meeting (Hadoop Summit 2014)](https://reader034.vdocuments.mx/reader034/viewer/2022042713/540dec048d7f72927e8b4a74/html5/thumbnails/130.jpg)
Thank you!