hadoop summit 2009 hive
TRANSCRIPT
![Page 1: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/1.jpg)
Hive - Data Warehousing & Analytics on Hadoop
Wednesday, June 10, 2009 Santa Clara Marriott
Namit Jain, Zheng ShaoFacebook
![Page 2: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/2.jpg)
Agenda
» Introduction
» Facebook Usage
» Hive Progress and Roadmap
» Open Source Community
![Page 3: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/3.jpg)
» Introduction
![Page 4: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/4.jpg)
Why Another Data Warehousing System?
Data, data and more data~1TB per day in March 2008
~10TB per day today
![Page 5: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/5.jpg)
![Page 6: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/6.jpg)
Lets try Hadoop…
» Pros› Superior in availability/scalability/manageability
› Efficiency not that great, but throw more hardware
› Partial Availability/resilience/scale more important than ACID
» Cons: Programmability and Metadata› Map-reduce hard to program (users know sql/bash/python)
› Need to publish data in well known schemas
» Solution: HIVE
![Page 7: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/7.jpg)
Lets try Hadoop… (continued)
RDBMS> select key, count(1) from kv1 where key > 100 group by key;
vs.
$ cat > /tmp/reducer.sh
uniq -c | awk '{print $2"\t"$1}‘
$ cat > /tmp/map.sh
awk -F '\001' '{if($1 > 100) print $1}‘
$ bin/hadoop jar contrib/hadoop-0.19.2-dev-streaming.jar -input /user/hive/warehouse/kv1 -mapper map.sh -file /tmp/reducer.sh -file /tmp/map.sh -reducer reducer.sh -output /tmp/largekey -numReduceTasks 1
$ bin/hadoop dfs –cat /tmp/largekey/part*
![Page 8: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/8.jpg)
What is HIVE?
» A system for managing and querying structured data built on top of Hadoop› Map-Reduce for execution
› HDFS for storage
› Metadata on raw files
» Key Building Principles:› SQL as a familiar data warehousing tool
› Extensibility – Types, Functions, Formats, Scripts
› Scalability and Performance
![Page 9: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/9.jpg)
Simplifying Hadoop
RDBMS> select key, count(1) from kv1 where key > 100 group by key;
vs.
hive> select key, count(1) from kv1 where key > 100 group by key;
![Page 10: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/10.jpg)
» Facebook Usage
![Page 11: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/11.jpg)
Data Warehousing at Facebook Today
Web Servers Scribe Servers
Filers
Hive on Hadoop ClusterOracle RAC Federated MySQL
![Page 12: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/12.jpg)
Hive/Hadoop Usage @ Facebook
» Types of Applications:› Reporting
• Eg: Daily/Weekly aggregations of impression/click counts• SELECT pageid, count(1) as imps FROM imp_tableGROUP BY pageid WHERE date = ‘2009-05-01’;
• Complex measures of user engagement
› Ad hoc Analysis
• Eg: how many group admins broken down by state/country
› Data Mining (Assembling training data)
• Eg: User Engagement as a function of user attributes
› Spam Detection
• Anomalous patterns for Site Integrity
• Application API usage patterns
› Ad Optimization
![Page 13: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/13.jpg)
Hadoop Usage @ Facebook
» Cluster Capacity:› 600 nodes
› ~2.4PB (80% used)
» Data statistics:› Source logs/day: 6TB
› Dimension data/day: 4TB
› Compression Factor ~5x (gzip)
» Usage statistics:› 3200 jobs/day with 800K tasks(map-reduce tasks)/day
› 55TB of compressed data scanned daily
› 15TB of compressed output data written to hdfs
› 150 active users within Facebook
![Page 14: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/14.jpg)
» Hive Progress and Roadmap
![Page 15: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/15.jpg)
» CREATE TABLE clicks(key STRING, value STRING)LOCATION '/hive/clicks'PARTITIONED BY (ds STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.TestSerDe' WITH SERDEPROPERTIES ('testserde.default.serialization.format'='\003');
![Page 16: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/16.jpg)
Data Model
Logical Partitioning
Hash Partitioning
clicks
HDFS MetaStore
/hive/clicks
/hive/clicks/ds=2008-03-25
/hive/clicks/ds=2008-03-25/0
…
Tables
Data LocationBucketing InfoPartitioning Cols
Metastore DB
![Page 17: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/17.jpg)
HIVE: Components
HDFS
Hive CLIDDL QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift CSV JSON..
ExecutionParser
Planner
DB
Web
UI
Optimizer
![Page 18: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/18.jpg)
Hive Query Language
» SQL› Subqueries in from clause
› Equi-joins
› Multi-table Insert
› Multi-group-by
» Sampling› SELECT s.key, count(1) FROM clicksTABLESAMPLE (BUCKET 1 OUT OF 32) s WHERE s.ds = ‘2009-04-22’ GROUP BY s.key
![Page 19: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/19.jpg)
FROM pv_users
INSERT INTO TABLE pv_gender_sum
SELECT gender, count(DISTINCT userid)
GROUP BY gender
INSERT INTO DIRECTORY‘/user/facebook/tmp/pv_age_sum.dir’
SELECT age, count(DISTINCT userid)
GROUP BY age
INSERT INTO LOCAL DIRECTORY ‘/home/me/pv_age_sum.dir’
SELECT age, count(DISTINCT userid)
GROUP BY age;
![Page 20: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/20.jpg)
Hive Query Language (continued)
» Extensibility› Pluggable Map-reduce scripts
› Pluggable User Defined Functions
› Pluggable User Defined Types• Complex object types: List of Maps
› Pluggable Data Formats• Apache Log Format
![Page 21: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/21.jpg)
FROM (
FROM pv_users
MAP pv_users.userid, pv_users.date
USING 'map_script‘
AS dt, uid
CLUSTER BY dt) map
INSERT INTO TABLE pv_users_reduced
REDUCE map.dt, map.uid
USING 'reduce_script'
AS date, count;
Pluggable Map-Reduce Scripts
![Page 22: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/22.jpg)
Map Reduce Example
Machine 2
Machine 1
<k1, v1><k2, v2><k3, v3>
<k4, v4><k5, v5><k6, v6>
<nk1, nv1><nk2, nv2><nk3, nv3>
<nk2, nv4><nk2, nv5><nk1, nv6>
LocalMap
<nk2, nv4><nk2, nv5><nk2, nv2>
<nk1, nv1><nk3, nv3><nk1, nv6>
GlobalShuffle
<nk1, nv1><nk1, nv6><nk3, nv3>
<nk2, nv4><nk2, nv5><nk2, nv2>
LocalSort
<nk2, 3>
<nk1, 2><nk3, 1>
LocalReduce
![Page 23: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/23.jpg)
Hive QL – Join
INSERT INTO TABLE pv_users
SELECT pv.pageid, u.age
FROM page_view pv
JOIN user u
ON (pv.userid = u.userid);
![Page 24: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/24.jpg)
Hive QL – Join in Map Reduce
key value
111 <1,1>
111 <1,2>
222 <1,1>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user
key value
111 <2,25>
222 <2,32>
Map
key value
111 <1,1>
111 <1,2>
111 <2,25>
key value
222 <1,1>
222 <2,32>
ShuffleSort
Pageid age
1 25
2 25
pageid age
1 32
Reduce
![Page 25: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/25.jpg)
Join Optimizations
» Map Joins› User specified small tables stored in hash tables on the
mapper backed by jdbm
› No reducer needed
INSERT INTO TABLE pv_users
SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age
FROM page_view pv JOIN user u
ON (pv.userid = u.userid);
» FutureExploit table/column statistics for deciding strategy
![Page 26: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/26.jpg)
Hive QL – Map Join
key value
111 <1,2>
222 <2>
pageid userid time
1 111 9:08:01
2 111 9:08:13
1 222 9:08:14
userid age gender
111 25 female
222 32 male
page_view
user
Pageid age
1 25
2 25
1 32
Hash table
pv_users
![Page 27: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/27.jpg)
Hive QL – Group By
SELECT pageid, age, count(1)
FROM pv_users
GROUP BY pageid, age;
![Page 28: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/28.jpg)
Hive QL – Group By in Map Reduce
pageid age
1 25
1 25
pv_users
pageid age count
1 25 3
pageid age
2 32
1 25
Map
key value
<1,25> 2
key value
<1,25> 1
<2,32> 1
key value
<1,25> 2
<1,25> 1
key value
<2,32> 1
ShuffleSort
pageid age count
2 32 1
Reduce
![Page 29: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/29.jpg)
Group by Optimizations
» Map side partial aggregations› Hash-based aggregates
› Serialized key/values in hash tables
› 90% speed improvement on Query• SELECT count(1) FROM t;
» Load balancing for data skew
» Optimizations being Worked On:› Exploit pre-sorted data for distinct counts
› Exploit table/column statistics for deciding strategy
![Page 30: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/30.jpg)
Columnar Storage
» CREATE table columnTable
(key STRING, value STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.ColumnarSerDe'
STORED AS RCFILE;
» Saved 25% of space compared with SequenceFile› Based on one of the largest tables (30 columns) inside Facebook
› Both are compressed with GzipCodec
» Speed improvements in progress› Need to propagate column-selection information to FileFormat
» *Contribution from Yongqiang He (outside Facebook)
![Page 31: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/31.jpg)
Speed Improvements over Time
Date SVN Revision Major Changes Query A Query B Query C
2/22/2009 746906 Before Lazy Deserialization 83 sec 98 sec 183 sec
2/23/2009 747293 Lazy Deserialization 40 sec 66 sec 185 sec
3/6/2009 751166 Map-side Aggregation 22 sec 67 sec 182 sec
4/29/2009 770074 Object Reuse 21 sec 49 sec 130 sec
6/3/2009 781633 Map-side Join * 21 sec 48 sec 132 sec
» QueryA: SELECT count(1) FROM t;» QueryB: SELECT concat(concast(concat(a,b),c),d) FROM t;» QueryC: SELECT * FROM t;
» Time measured is map-side time only (to avoid unstable shuffling time at reducer side). It includes time for decompression and compression (both using GzipCodec).
» * No performance benchmarks for Map-side Join yet.
![Page 32: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/32.jpg)
Overcoming Java Overhead
» Reuse objects› Use Writable instead of Java Primitives
› Reuse objects across all rows
› *40% speed improvement on Query C
» Lazy deserialization› Only deserialize the column when asked
› Very helpful for complex types (map/list/struct)
› *108% speed improvement on Query A
![Page 33: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/33.jpg)
Generic UDF and UDAF
» Let UDF and UDAF accept complex-type parameters
» Integrate UDF and UDAF with Writables
public IntWritable evaluate(IntWritable a, IntWritable b) {
intWritable.set((int)(a.get() + b.get()));
return intWritable;
}
![Page 34: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/34.jpg)
HQL Optimizations
» Predicate Pushdown
» Merging n-way join
» Column Pruning
![Page 35: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/35.jpg)
» Open Source Community
![Page 36: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/36.jpg)
Open Source Community
» 21 contributors and growing › 6 contributors within Facebook
» Contributors from:› Academia
› Other web companies
› Etc..
» 7 committers› 1 external to Facebook and looking to add more here
![Page 37: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/37.jpg)
» 50 jiras fixed in last month
» 218 jiras still open
» 125 mails in last month on hive-user@
» 600 mails in last month on hive-dev@
» Various companies/universities› Adknowledge, Admob
› Berkeley, Chinese Academy of Science
» Demonstration in VLDB’2009
![Page 38: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/38.jpg)
Deployment Options
» EC2› http://wiki.apache.org/hadoop/Hive/HiveAws/HivingS3nRemotely
» Cloudera Virtual Machine› http://www.cloudera.com/hadoop-training-hive-tutorial
» Your own cluster› http://wiki.apache.org/hadoop/Hive/GettingStarted
» Hive can directly consume data on hadoop› CREATE EXTERNAL TABLE mytable (key STRING, value STRING)LOCATION '/user/abc/mytable';
![Page 39: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/39.jpg)
Future Work
» Benchmark & Performance
» Integration with BI tools (through JDBC/ODBC)
» Indexing
» More on Hive Roadmap› http://wiki.apache.org/hadoop/Hive/Roadmap
» Machine Learning Integration
» Real-time Streaming
![Page 40: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/40.jpg)
Information
» Available as a sub project in Hadoop- http://wiki.apache.org/hadoop/Hive(wiki)
- http://hadoop.apache.org/hive (home page)
- http://svn.apache.org/repos/asf/hadoop/hive (SVN repo)
- ##hive (IRC)
- Works with hadoop-0.17, 0.18, 0.19
» Release 0.3 is out and more are coming
» Mailing Lists: › hive-{user,dev,commits}@hadoop.apache.org
![Page 41: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/41.jpg)
Contributors
» Aaron Newton
» Ashish Thusoo
» David Phillips
» Dhruba Borthakur
» Edward Capriolo
» Eric Hwang
» Hao Liu
» He Yongqiang
» Jeff Hammerbacher
» Johan Oskarsson
» Josh Ferguson
» Joydeep Sen Sarma
» Kim P.
» Michi Mutsuzaki
» Min Zhou
» Namit Jain
» Neil Conway
» Pete Wyckoff
» Prasad Chakka
» Raghotham Murthy
» Richard Lee
» Shyam Sundar Sarkar
» Suresh Antony
» Venky Iyer
» Zheng Shao
![Page 42: Hadoop Summit 2009 Hive](https://reader035.vdocuments.mx/reader035/viewer/2022062220/5550f66fb4c90501448b47a8/html5/thumbnails/42.jpg)
» Questions