hadoop-day4-part1_1
DESCRIPTION
HadoTRANSCRIPT
1
Hadoop Training
2
Objectives – Day 4
§ Hive Sampling (bucketing) § Explain § Functions § Advanced Features Break
§ Pig Overview § Schemas § Operators § UDF § Pig Exercise 1 § Best Practices § Hcatalog
©2013 Zaloni, Inc. All Rights Reserved.
3
Views
4 ©2013 Zaloni, Inc. All Rights Reserved.
¡ A way of decomposing complex queries. ¡ Only query-able views. Updatable views not supported. ¡ Since views are read-only they may not be used as the
target of LOAD/INSERT/ALTER ¡ Querying the view would start MapReduce jobs ¡ Materialized views not supported.
CREATE VIEW view_name AS SELECT select_statement; DROP VIEW [IF EXISTS] view_name; SELECT * FROM view_name [ …. ];
Views
5
Sampling: Buckets TABLESAMPLE clause
6 ©2013 Zaloni, Inc. All Rights Reserved.
Buckets - Description
¡ Enables efficient query execution. ¡ Joins can take advantage of buckets esp. Map-joins. ¡ Makes sampling of large datasets efficient –
TABLESAMPLE clause ¡ Physical Layout - bucket n is the nth file, when arranged in
lexicographic order. Correspond to M/R job output. Buckets are stored in files under table or partition directory i.e. /user/hive/warehouse/bucketed_users/attempt_201005221636_0016_r_000000_0
attempt_201005221636_0016_r_000001_0 attempt_201005221636_0016_r_000002_0 attempt_201005221636_0016_r_000003_0
7 ©2013 Zaloni, Inc. All Rights Reserved.
Example : CREATE TABLE bucketed_demo (id INT, name STRING) CLUSTERED BY (id) INTO 16 BUCKETS; CREATE TABLE bucketed_demo(id INT, name STRING) PARTITIONED BY ( year int) CLUSTERED BY (id) SORTED BY(id) INTO 16 BUCKETS;
¡ CLUSTERED BY clause is used to specify the columns to bucket on and the number of buckets
¡ SORTED BY used to declare that a table has sorted buckets
Buckets - DDL
8 ©2013 Zaloni, Inc. All Rights Reserved.
set hive.enforce.bucketing = true; INSERT OVERWRITE TABLE bucketed_demo PARTITION(year=2010) SELECT id,firstname FROM demo; § Distribution of rows: hash_function(bucketing_column) mod num_buckets
Buckets - DML
9 ©2013 Zaloni, Inc. All Rights Reserved.
¡ The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table
¡ Syntax : TABLESAMPLE (BUCKET x OUT OF y [ON colname])
¡ The buckets are numbered from 1 to y ¡ colname indicates the column on which to sample (or
“bucket”) each row ¡ Rows which belong to bucket x are returned ¡ Typically entire table is scanned to fetch the sample
unless the buckets have been used to create the table
TABLESAMPLE Clause
10 ©2013 Zaloni, Inc. All Rights Reserved.
Examples:
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; SELECT * FROM bucketed_demo TABLESAMPLE(BUCKET 3 OUT OF 32 ON id) s;
TABLESAMPLE Clause
11
Explain
12 ©2013 Zaloni, Inc. All Rights Reserved.
Example : EXPLAIN EXTENDED SELECT * FROM temperature t1 JOIN temperature t2 ON (t1.stationno=t2.stationno) JOIN temperature t3 ON (t1.stationno=t3.stationno);
¡ EXPLAIN shows the execution plan of a query ¡ Tells how many MapReduce jobs will be used for
the query ¡ EXTENDED (optional), if used produces extra
information
Explain – Description & Example
13
Functions: UDF, UDAF, UDTF
14 ©2013 Zaloni, Inc. All Rights Reserved.
Functions – Description
¡ 3 types : ¡ UDF – User Defined Function ¡ UDAF – User Defined Aggregate Functions ¡ UDTF – User Defined Table Generating Functions
15 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Operates on a single row and produces a single row as its output
¡ One or more columns or can be nested as arguments to other functions
¡ Built-In : Mathematical, String, Date, Collection functions etc.
¡ Example : SELECT concat(firstname, lastname) FROM demo;
SELECT array(january, february, march) FROM temperature;
¡ Type Conversion UDF : cast SELECT cast(year as string) FROM
temperature limit 10;
UDF
16 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Works on multiple input rows and creates a single output row
¡ Aggregate functions like COUNT, MAX, SUM etc. ¡ Example :
SELECT sum(salary) FROM employees; SELECT max(temperature),min(temperature)
FROM temperature_union GROUP BY month;
UDAF
17 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Operates on a single row and produces multiple rows—a table—as output
¡ Example – explode() converts the values in an array into separate rows of a table
SELECT explode(subordinates) as sub FROM employees;
Here, subordinates is an array field. “sub” is an alias column name which must be specified. ¡ Lateral views for UDTF
SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subTable AS sub;
UDTF
18 ©2013 Zaloni, Inc. All Rights Reserved.
CASE statement
¡ Case statements are like IF-THEN-ELSE ¡ Example : We want to categorize stations into
either “Missing” or “Eastern Hemisphere” based on their longitude values
SELECT name, CASE WHEN longitude= -999 OR latitude=-999 THEN ‘Missing‘ WHEN longitude<0 THEN 'Eastern' ELSE ‘None’ END from stations;
19 ©2013 Zaloni, Inc. All Rights Reserved.
¡ List all functions SHOW FUNCTIONS
¡ List a particular function SHOW FUNCTIONS “concat”
¡ Describing a function DESCRIBE FUNCTION [EXTENDED] <function_name> DESCRIBE FUNCTION “concat”
Viewing functions
20
Advanced Features : Custom UDFs Transform (Map Reduce scripts) SerDe
21 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Custom functions that can be plugged into Hive and used with HQL ¡ Developed/coded in Java. ¡ Example : Package DateFormatter
import org.apache.hadoop.hive.ql.exec.UDF;
public class DateFormatter extends UDF {
public Text evaluate(Text timestamp, String inputFormat, String outputFormat) throws ParseException {
SimpleDateFormat formatter = new SimpleDateFormat(inputFormat);
SimpleDateFormat newFormatter = new SimpleDateFormat(outputFormat);
return new Text(newFormatter.format(formatter.parse(timestamp.toString())));
}
}
Custom User Defined Functions
22 ©2013 Zaloni, Inc. All Rights Reserved.
¡ UDFs are packaged in jars and added to hive by using the following commands in hive shell add jar /path/to/udf;
create temporary function func_name as ‘pkgname.classname’;
¡ Example add jar /training/custom.jar;
create temporary function convert_to_date as ‘DateFormatter’;
select convert_to_date(‘2011-01-01’,’yyyy-mm-dd’, ‘yyyy/dd/mm’) from demo limit 1;
Example
23 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Technique used to invoke custom map or reduce operations from Hive
¡ Example : We want to filter out bad data from the DEMO table. The definition of bad data being rows with negative id values. Python Script (filter.py)
#!/usr/bin/env python import re import sys for line in sys.stdin: (id, fnm, lnm, place) = line.strip().split() if (id>0): print "%s\t%s" % (id, fnm, lnm, place)
Transform (MR Scripts)
24 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Using it in Hive FROM demo SELECT TRANSFORM(id, firstname, lastname, country) USING ‘filter.py‘ AS id, firstname, lastname, country;
Example
25 ©2013 Zaloni, Inc. All Rights Reserved.
¡ TEXTFILE : Default (Storage Formats slide)
¡ SEQUENCE FILE : is a binary, space-efficient format supported by Hadoop. CREATE TABLE tab(col1 … ) ………. STORED AS SEQUENCEFILE
¡ Compression properties SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK;
¡ Mostly used in a CTAS (Create Table … AS Select) structure or INSERT .. SELECT over pre-existing tables.
File Formats
26 ©2013 Zaloni, Inc. All Rights Reserved.
¡ RC FILE : Record Columnar storage stores data by row groups, then by columns within each group.
CREATE TABLE tab(col1 … ) ………. STORED AS RCFILE
¡ Keeps a “split’s worth” of rows in the same split, but stores by column in the split.
¡ Column-store is more efficient when a query only projects a subset of columns, because column-store only read necessary columns from disks but row-store will read a entire row.
File Formats
27 ©2013 Zaloni, Inc. All Rights Reserved.
Feature InputFormat/ OutputFormat SerDe Descrip4on How records are encoded in files and
query results wri<en How columns/fields are encoded in records.
Clause in DDL STORED AS INPUTFORMAT '…' OUTPUTFORMAT '…'
STORED AS ROW FORMAT SERDE ‘…’ [ WITH SERDE PROPERTIES ( …. ) ]
Details INPUTFORMATs are responsible for spliSng an input stream into records. OUTPUTFORMATs are responsible for wri4ng records to an output stream (i.e., query results).
SERDEs are responsible for tokenizing a record into columns/fields and also encoding columns/fields into records.
Number of classes
Two separate classes are used – one for input and one for output
There is one class for both – serializa4on & deserializa4on
Defaults INPUTFORMAT 'org.apache.hadoop.mapreduce.lib.input.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Delimited textual format, with lazy field access.
File Format vs SerDe
28 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Add the serde jar to Hive using ADD JAR command ¡ Example
CREATE TABLE tab(col1 … ) ……….
STORED AS
ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2. RegexSerDe’
WITH SERDEPROPERTIES (
"input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"",
"output.format.string" = "a:%1$s,b:%2$s,c:%3$s“
)
STORED AS TEXTFILE;
SerDe - Example
29 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Example CREATE TABLE json_data (
country string, languages array<string>,
religions map<string,array<int>>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
LOAD DATA LOCAL INPATH 'nesteddata.txt' OVERWRITE INTO TABLE json_data ;
SELECT * from json_data;
-- data : {"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}
-- result: Switzerland ["German","French","Italian"] {"catholic":[10,20],"protestant":[40,50]}
SerDe - Example
30 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Must specify both INPUT & OUTPUT format
¡ Example CREATE TABLE tab(col1 … ) ……….
STORED AS
INPUTFORMAT ‘com.zaloni.training.XMLInputFormat’
OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
File Format - Example
31 ©2013 Zaloni, Inc. All Rights Reserved.
https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining
¡ N-gram frequency estimation : ngrams() and context_ngrams()
¡ Use-Cases (ngrams) Find trending topics in text. (context_ngrams) Extract marketing intelligence around certain words (e.g., "Twitter is ___”)
¡ Estimating frequency distributions : histogram_numeric() ¡ Use Cases
Estimating the frequency distribution of a column, possibly grouped by other attributes.
Statistics & Data Mining Functions
32
Best Practices
33 ©2013 Zaloni, Inc. All Rights Reserved.
¡ A typical Hive query can be decomposed into one or more stages which may be independent of each other. In the default case, where parallelism is not enabled, stages (jobs) would execute sequentially.
¡ Enabling parallelism would account for better cluster and time utilization.
¡ Properties : set hive.exec.parallel=true in conf/hive-site.xml in the Hive installation directory.
Parallel Execution
34 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Instead of forcing Map Joins in the query we can set hive properties (before running a JOIN query) to convert normal join queries into Map Joins if the inputs meet the small file criteria else continue with the common (reduce-side) joins.
¡ Properties : set hive.auto.convert.join = true; set hive.smalltable.filesize = 40000000; set hive.hashtable.max.memory.usage = 0.9;
Joins
35 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Put the biggest table at the end; last table is streamed; rest are buffered
¡ Example (temperature is the larger table): SELECT * FROM .... temperature t1 JOIN .. temperature t2 JOIN ... station s .... ( will be streamed ) SELECT * FROM .... temperature t1 JOIN .. station s... temperature t2 ... (will be streamed – better approach)
Joins
36 ©2013 Zaloni, Inc. All Rights Reserved.
¡ ORDER BY can prove to be a bottleneck since it uses a Single Reducer. So we should try to avoid ORDER BY for large datasets. Alternative : Using SORT BY and then merging the files (if the requirement permits) or use MapReduce jobs to sort the entire file etc.
¡ UDAFs without GROUP BY use a single reducer.
¡ Avoid FULL OUTER JOINS. Alternative : Analyze the dataset. Break the query into UNION ALLs.
¡ SELECT * FROM table should preferable be accompanied by LIMIT clause.
Others
37 ©2013 Zaloni, Inc. All Rights Reserved.
¡ Avoid too many partitions
¡ Avoid partitions with small sets of data
¡ De-normalization is good for large sets of data
¡ Hive does not enforce constraints or checks for null values while loading data into the table (schema on read). So we should be sure of the integrity of data.
¡ It is preferable to use a local or remote metastore
Others
38 ©2013 Zaloni, Inc. All Rights Reserved.
Hive Properties
¡ Some useful hive properties to be set before executing hive queries Property Descrip4on Set hive.exec.parallel=true Whether to execute jobs in parallel provided
hadoop scheduler is Fair-‐Scheduler or Capacity Scheduler. By default, its false.
Set hive.exec.dynamic.par44on=true To use dynamic par44on inserts
Set set hive.enforce.bucke4ng = true Allows the correct number of reducers and the cluster by column to be automa4cally selected based on the table
Set hive.auto.convert.join = true To allow auto conversion into map joins based on file size
set hive.exec.dynamic.par44on.mode=nonstrict;
To allow dynamic pari4ons without sta4c par44ons
There are many such useful proper4es that can be set according to requirements before query execu4on or in hive-‐site.xml
39 ©2013 Zaloni, Inc. All Rights Reserved.
Hive Indexing
Indexing is a standard database technique, but with many possible variations. From Hive 0.7.1 supports indexing. But Hive 0.7.1 does not automatically access indexes (query rewrites). Starting Hive 0.8, that is possible. Key point: Different ways to speed up the queries in hive are: • Columnar storage • Data partitioning • Indexing(Different view of same data)
40 ©2013 Zaloni, Inc. All Rights Reserved.
Example: (Create an index on station name field of stations table. Populate the index) CREATE INDEX idxstations ON TABLE stations(name) as 'compact' WITH DEFERRED REBUILD STORED AS RCFILE; ALTER INDEX idxstations ON stations REBUILD; DROP INDEX idxstations ON stations;
Hive Indexing
41 ©2013 Zaloni, Inc. All Rights Reserved.
The index is automatically named as training__stations_idxstations__ hive> SHOW TABLES You will find this table and can describe it or query it too. If data in the base table changes, then the REBUILD command must be used to bring the index up to date.
Hive Indexing
42
Hive Exercise 4 Hive Book Use case Part 2
43 ©2013 Zaloni, Inc. All Rights Reserved.
12) Write a query to create & populate a table (user, country[location[2]],age from users table) that buckets the ages into 5 different country groups. Find out the number of users from each country for the third group. Browse through MaprFS to the hive warehouse directory and check whether the bucketed file(0002..) contains the same records as displayed in the ouput of hive query (Use of Buckets and TABLESAMPLE clause).
13) Extract HOST and PATH FROM Image-URL-S field. (Use parse_url built-in UDF)
14) Use explode UTF on the locations field to divide it into rows. (Use of EXPLODE UDTF).
15) Find out books containing author names - John OR Jack? Write the output to a local directory.
16) Find out 10 most frequently used four-grams (i.e. 4 words that occur together most frequently) in book titles. (Use of ngrams UDAF along with SENTENCES UDF and EXPLODE UDTF).
17) Find out 10 most frequently used words that come after the word "the" in book titles. (Use of context_ngrams UDAF along with SENTENCES UDF & explode UDTF)
18) Find out the average population by age group who rate a particular book. (Use GROUP BY and JOINS)
19) Use the Age Group Divider UDAF to will bucket the users in different age groups. You can specify the size of a bucket and whether you want the minimum (2nd argument -true), max(2nd argument -false) age that lies in that bucket. The UDAF returns a map object mapping buckets to user age groups.
Book Exercise Tasks – Part 2
44
End Day 2