hiveusermeeting20103facebook 3-100319204309-phpapp02
Upload: managing-partner-3xc-global-partners-darori-capital-luxemborg-start-up-nation-icritical-canvas
Post on 22-Nov-2014
314 views
DESCRIPTION
TRANSCRIPT
Hive New Features and API
Facebook Hive Team
March 2010
JDBC/ODBC and CTAS
Hive ODBC Driver
• Architecture: • Client/DriverManager local call dynamic libraies
• unixODBC (libodbchive.so) + hiveclient(libhiveclient.so) + thriftclient (libthrift.so) network socket
• HiveServer (in Java) local call
• Hive + Hadoop
• unixODBC is not part of Hive open source, so you need to build it by yourself.• 32-bit/64-bit architecture
• Thrift has to be r790732
• Boost libraries
• Linking with 3rd party Driver Manager.
Facebook Use Case
» Hive integration with MicroStrategy 8.1.2 (HIVE-187) and 9.0.1. (HIVE-1101)
• FreeForm SQL (reported generated from user input queries)
• Reports generated daily.
» All servers (MSTR IS server, HiveServer) are running on Linux.
• ODBC driver needs to be 32 bits.
Hive JDBC
» Embedded mode:› jdbc:hive://
» Client/server mode:› jdbc:hive://host:port/dbname
› host:port is where the hive server is listening.
› Architecture is similar to ODBC.
Create table as select (CTAS)
• New feature in branch 0.5.
• E.g.,
CREATE TABLE T STORED AS TEXTFILE AS
SELECT a+1 a1, concat(b,c,d) b2
FROM S
WHERE …
Resulting schema:
T (a1 double, b2 string)
• The create-clause can take all table properties except external table or partitioned table (on roadmap).
• Atomicity: T will not be created if the select statement has an error.
Join Strategies
Left semi join
• Implementing IN/EXISTS subquery semantics: SELECT A.*
FROM A WHERE A.KEY IN
(SELECT B.KEY FROM B WHERE B.VALUE > 100);
Is equivalent to:
SELECT A.*
FROM A LEFT SEMI JOIN B
ON (A.KEY = B.KEY and B.VALUE > 100);
• Optimizations: • map-side groupby to reduce data flowing to reducers
• early exit if match in join.
Map Join Implementation
SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;
SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;
Table b Table a Table c
Mapper 1
File a1File a1
File a2File a2
File c1File c1
Mapper 2
Mapper 3
a1a1a2a2
c1c1
a1a1a2a2
c1c1
a1a1a2a2
c1c11. Spawn mapper based on the big table2. All files of all small tables are replicated onto each mapper
Bucket Map Join
set hive.optimize.bucketmapjoin = true;
1.Work together with map join
2.All join tables are bucketized, and each small table’s bucket number can be divided by big table’s bucket number.
3.Bucket columns == Join columns
Bucket Map Join Implementation
SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;
SELECT /*+MAPJOIN(a,c)*/ a.*, b.*, c.* a join b on a.key = b.key join c on a.key=c.key;
Table b Table a Table c
Mapper 1Bucket b1
Bucket a1Bucket a1
Bucketa2Bucketa2
Bucket c1Bucket c1
Mapper 2Bucket b1
Mapper 3Bucket b2
a1a1c1c1
a1a1c1c1
a2a2c1c1 Normally in production, there will be
thousands of buckets!
Table a,b,c all bucketized by ‘key’a has 2 buckets, b has 2, and c has 1Table a,b,c all bucketized by ‘key’a has 2 buckets, b has 2, and c has 1
1. Spawn mapper based on the big table2. Only matching buckets of all small
tables are replicated onto each mapper
Sort Merge Bucket Map Join
set hive.optimize.bucketmapjoin = true;set hive.optimize.bucketmapjoin.sortedmerge = true;set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
1.Work together with bucket map join
2.Bucket columns == Join columns == sort columns
3.If partitioned, only big table can allow multiple partitions, small tables must be restricted to a single partition by query.
Sort Merge Bucket Map Join
Table A Table B Table C
1, val_11, val_1
3, val_33, val_3
5, val_55, val_5
4, val_44, val_4
4, val_44, val_4
20, val_2020, val_20
23, val_2323, val_23
20, val_2020, val_20
25, val_2525, val_25
Small tables are read on demand NOT held entire small tables in memory Can perform outer join
Skew Join
Join bottlenecked on the reducer who gets the skewed key
set hive.optimize.skewjoin = true; set hive.skewjoin.key = skew_key_threshold
Skew Join
Reducer 1
Reducer 2
a-K 3
b-K 3 a-K 3
b-K 3
a-K 2
b-K 2 a-K 2
b-K 2
a-K 1
b-K 1Table A
Table B
A join B
Write to HDFS
HDFSFile a-K1
HDFSFile b-K1
Map join
a-k1 map joinb-k1
Job 1 Job 2
Final results
Future Work
Skew Join with a Replication Algorithm
Memory Footprint Optimization
Views, HBase Integration
CREATE VIEW Syntax
CREATE VIEW [IF NOT EXISTS] view_name
[ (column_name [COMMENT column_comment], … ) ]
[COMMENT view_comment]
AS SELECT …
[ ORDER BY … LIMIT … ]
-- example
CREATE VIEW pokebaz(baz COMMENT ‘this column used to be bar’)
COMMENT ‘views are good for layering on renaming’
AS SELECT bar FROM pokes;
View Features
» Other commands› SHOW TABLES: views show up too
› DESCRIBE: see view column descriptions
› DESCRIBE EXTENDED: retrieve view definition
» Enhancements on the way soon› Dependency management (e.g. CASCADE/RESTRICT)
› Partition awareness
» Enhancements (long term)› Updatable views
› Materialized views
HBase Storage Handler
CREATE TABLE users(
userid int, name string, email string, notes string)
STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" =
”small:name,small:email,large:notes”);
HBase Storage Handler Features
» Commands supported› CREATE EXTERNAL TABLE: register existing HTable
› SELECT: join, group by, union, etc; over multiple Hbase tables, or mixing with native Hive tables
› INSERT: from any Hive query
» Enhancements Needed (feedback on priority welcome) › More flexible column mapping, ALTER TABLE
› Timestamp read/write/restrict
› Filter pushdown
› Partition support
› Write atomicity
UDF, UDAF and UDTF
User-Defined Functions (UDF)
» 1 input to 1 output
» Typically used in select› SELECT concat(first, ‘ ‘, last) AS full_name…
» See Hive language wiki for full list of built-in UDF’s› http://wiki.apache.org/hadoop/Hive/LanguageManual
» Noteworthy features› Sometimes you want to cast
• SELECT CAST(5.0/2.0 AS INT)…
› Conditional functions• SELECT IF(boolean, if_true, if_not_true)…
User Defined Aggregate Functions (UDAF)
» N inputs to 1 output
» Typically used with GROUP BY› SELECT count(1) FROM … GROUP BY age› SELECT count(DISTINCT first_name) GROUP BY last_name…
› sum(), avg(), min(), max()
» For skew› set hive.groupby.skewindata = true;› set hive.map.aggr.hash.percentmemory = <some lower value>
User Defined Table-Generating Functions (UDTF)
» 1 input to N outputs
» explode(Array<?> arg)› Converts an array into multiple rows, with one element per
row
» Transform-like syntax› SELECT udtf(col0, col1, …) AS colAlias FROM srcTable
» Lateral view syntax› …FROM baseTableLATERAL VIEW udtf(col0, col1…)tableAlias AS colAlias
» Also see: http://bit.ly/hive-udtf
UDTF using Transform Syntax
» SELECT explode(group_ids) AS group_id FROM src
Table src
Output
UDTF using Lateral View Syntax
» SELECT src.*, myTable.*FROM src LATERAL VIEW explode(group_ids) myTable AS group_id
Table src
UDTF using Lateral View Syntax
src
group_id
1
2
3
explode(group_ids) myTable AS group_id
Result
Join input rows to output rows
SerDe – Serialization/Deserialization
SerDe Examples
» CREATE TABLE mylog (
user_id BIGINT,
page_url STRING,
unix_time INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
» CREATE table mylog_rc (
user_id BIGINT,
page_url STRING,
unix_time INT)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe'
STORED AS RCFILE;
SerDe
» SerDe is short for serialization/deserialization. It controls the format of a row.
» Serialized format:› Delimited format (tab, comma, ctrl-a …)
› Thrift Protocols
› ProtocolBuffer*
» Deserialized (in-memory) format:› Java Integer/String/ArrayList/HashMap
› Hadoop Writable classes
› User-defined Java Classes (Thrift, ProtocolBuffer*)
» * ProtocolBuffer support not available yet.
Where is SerDe?
File on HDFS
File on HDFS
HierarchicalObject
HierarchicalObject
WritableWritable
StreamStream StreamStream
HierarchicalObject
HierarchicalObject
Map Output File
Map Output File
WritableWritable WritableWritable WritableWritable
HierarchicalObject
HierarchicalObject
File on HDFS
File on HDFS
User ScriptUser Script
HierarchicalObject
HierarchicalObject
HierarchicalObject
HierarchicalObject
Hive OperatorHive Operator Hive OperatorHive Operator
SerDe
FileFormat / Hadoop Serialization
Mapper Reducer
ObjectInspector
imp 1.0 3 54Imp 0.2 1 33clk 2.2 8 212Imp 0.7 2 22
thrift_record<…>thrift_record<…>thrift_record<…>thrift_record<…>
BytesWritable(\x3F\x64\x72\x00)
Java ObjectObject of a Java Class
Standard ObjectUse ArrayList for struct and arrayUse HashMap for map
LazyObjectLazily-deserialized
WritableWritableWritableWritable
WritableWritableWritableWritableText(‘imp 1.0 3 54’) // UTF8 encoded
Object Inspector
getTypeObjectInspector1
getFieldOI
getStructField
getTypeObjectInspector2
getMapValueOI
getMapValue
deserialize SerDeserialize getOI
HierarchicalObject
HierarchicalObject
WritableWritableWritableWritable
StructStruct
intint stringstringlistlist
structstruct
mapmap
stringstring stringstring
HierarchicalObject
HierarchicalObject
String ObjectString Object
getTypeObjectInspector3
TypeInfo
BytesWritable(\x3F\x64\x72\x00)
Text(‘a=av:b=bv 23 1:2=4:5 abcd’)
class HO { HashMap<String, String> a, Integer b, List<ClassC> c, String d;}Class ClassC { Integer a, Integer b;}
List ( HashMap(“a” “av”, “b” “bv”), 23, List(List(1,null),List(2,4),List(5,null)), “abcd”)
intint intint
HashMap(“a” “av”, “b” “bv”),
HashMap<String, String> a,
“av”
When to add a new SerDe
» User has data with special serialized format not supported by Hive yet, and user does not want to convert the data before loading into Hive.
» User has a more efficient way of serializing the data on disk.
How to add a new SerDe for text data
» Follow the example incontrib/src/java/org/apache/hadoop/hive/contrib/serde2/RegexSerDe.java
» RegexSerDe uses a user-provided regular expression to deserialize data.
» CREATE TABLE apache_log(host STRING,
identity STRING, user STRING, time STRING, request STRING,
status STRING, size STRING, referer STRING, agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?",
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s”)
STORED AS TEXTFILE;
How to add a new SerDe for binary data
» Follow the example incontrib/src/java/org/apache/hadoop/hive/contrib/serde2/thrift (HIVE-706)serde/src/java/org/apache/hadoop/hive/serde2/binarysortable
» CREATE TABLE mythrift_table
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.thrift.ThriftSerDe'
WITH SERDEPROPERTIES (
"serialization.class" = "com.facebook.serde.tprofiles.full",
"serialization.format" = "com.facebook.thrift.protocol.TBinaryProtocol“);
» NOTE: Column information is provided by the SerDe class.
Q & A