hadoop-day4-part1_1

1

Hadoop Training

2

Objectives – Day 4

§  Hive Sampling (bucketing) §  Explain §  Functions §  Advanced Features Break

§  Pig Overview §  Schemas §  Operators §  UDF §  Pig Exercise 1 §  Best Practices §  Hcatalog

©2013 Zaloni, Inc. All Rights Reserved.

3

Views

4 ©2013 Zaloni, Inc. All Rights Reserved.

¡  A way of decomposing complex queries. ¡  Only query-able views. Updatable views not supported. ¡  Since views are read-only they may not be used as the

target of LOAD/INSERT/ALTER ¡  Querying the view would start MapReduce jobs ¡  Materialized views not supported.

CREATE VIEW view_name AS SELECT select_statement; DROP VIEW [IF EXISTS] view_name; SELECT * FROM view_name [ …. ];

Views

5

Sampling: Buckets TABLESAMPLE clause


Buckets - Description

¡  Enables efficient query execution. ¡  Joins can take advantage of buckets esp. Map-joins. ¡  Makes sampling of large datasets efficient –

TABLESAMPLE clause ¡  Physical Layout - bucket n is the nth file, when arranged in

lexicographic order. Correspond to M/R job output. Buckets are stored in files under table or partition directory i.e. /user/hive/warehouse/bucketed_users/attempt_201005221636_0016_r_000000_0

attempt_201005221636_0016_r_000001_0 attempt_201005221636_0016_r_000002_0 attempt_201005221636_0016_r_000003_0


Example : CREATE TABLE bucketed_demo (id INT, name STRING) CLUSTERED BY (id) INTO 16 BUCKETS; CREATE TABLE bucketed_demo(id INT, name STRING) PARTITIONED BY ( year int) CLUSTERED BY (id) SORTED BY(id) INTO 16 BUCKETS;

¡  CLUSTERED BY clause is used to specify the columns to bucket on and the number of buckets

¡  SORTED BY used to declare that a table has sorted buckets

Buckets - DDL


set hive.enforce.bucketing = true; INSERT OVERWRITE TABLE bucketed_demo PARTITION(year=2010) SELECT id,firstname FROM demo; §  Distribution of rows: hash_function(bucketing_column) mod num_buckets

Buckets - DML


¡  The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table

¡  Syntax : TABLESAMPLE (BUCKET x OUT OF y [ON colname])

¡  The buckets are numbered from 1 to y ¡  colname indicates the column on which to sample (or

“bucket”) each row ¡  Rows which belong to bucket x are returned ¡  Typically entire table is scanned to fetch the sample

unless the buckets have been used to create the table

TABLESAMPLE Clause


Examples:

SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; SELECT * FROM bucketed_demo TABLESAMPLE(BUCKET 3 OUT OF 32 ON id) s;

TABLESAMPLE Clause

11

Explain


Example : EXPLAIN EXTENDED SELECT * FROM temperature t1 JOIN temperature t2 ON (t1.stationno=t2.stationno) JOIN temperature t3 ON (t1.stationno=t3.stationno);

¡  EXPLAIN shows the execution plan of a query ¡  Tells how many MapReduce jobs will be used for

the query ¡  EXTENDED (optional), if used produces extra

information

Explain – Description & Example

13

Functions: UDF, UDAF, UDTF


Functions – Description

¡  3 types : ¡  UDF – User Defined Function ¡  UDAF – User Defined Aggregate Functions ¡  UDTF – User Defined Table Generating Functions


¡  Operates on a single row and produces a single row as its output

¡  One or more columns or can be nested as arguments to other functions

¡  Built-In : Mathematical, String, Date, Collection functions etc.

¡  Example : SELECT concat(firstname, lastname) FROM demo;

SELECT array(january, february, march) FROM temperature;

¡  Type Conversion UDF : cast SELECT cast(year as string) FROM

temperature limit 10;

UDF


¡  Works on multiple input rows and creates a single output row

¡  Aggregate functions like COUNT, MAX, SUM etc. ¡  Example :

SELECT sum(salary) FROM employees; SELECT max(temperature),min(temperature)

FROM temperature_union GROUP BY month;

UDAF


¡  Operates on a single row and produces multiple rows—a table—as output

¡  Example – explode() converts the values in an array into separate rows of a table

SELECT explode(subordinates) as sub FROM employees;

Here, subordinates is an array field. “sub” is an alias column name which must be specified. ¡  Lateral views for UDTF

SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subTable AS sub;

UDTF


CASE statement

¡  Case statements are like IF-THEN-ELSE ¡  Example : We want to categorize stations into

either “Missing” or “Eastern Hemisphere” based on their longitude values

SELECT name, CASE WHEN longitude= -999 OR latitude=-999 THEN ‘Missing‘ WHEN longitude<0 THEN 'Eastern' ELSE ‘None’ END from stations;


¡  List all functions SHOW FUNCTIONS

¡  List a particular function SHOW FUNCTIONS “concat”

¡  Describing a function DESCRIBE FUNCTION [EXTENDED] <function_name> DESCRIBE FUNCTION “concat”

Viewing functions

20

Advanced Features : Custom UDFs Transform (Map Reduce scripts) SerDe


¡  Custom functions that can be plugged into Hive and used with HQL ¡  Developed/coded in Java. ¡  Example : Package DateFormatter

import org.apache.hadoop.hive.ql.exec.UDF;

public class DateFormatter extends UDF {

public Text evaluate(Text timestamp, String inputFormat, String outputFormat) throws ParseException {

SimpleDateFormat formatter = new SimpleDateFormat(inputFormat);

SimpleDateFormat newFormatter = new SimpleDateFormat(outputFormat);

return new Text(newFormatter.format(formatter.parse(timestamp.toString())));

}

}

Custom User Defined Functions


¡  UDFs are packaged in jars and added to hive by using the following commands in hive shell add jar /path/to/udf;

create temporary function func_name as ‘pkgname.classname’;

¡  Example add jar /training/custom.jar;

create temporary function convert_to_date as ‘DateFormatter’;

select convert_to_date(‘2011-01-01’,’yyyy-mm-dd’, ‘yyyy/dd/mm’) from demo limit 1;

Example


¡  Technique used to invoke custom map or reduce operations from Hive

¡  Example : We want to filter out bad data from the DEMO table. The definition of bad data being rows with negative id values. Python Script (filter.py)

#!/usr/bin/env python import re import sys for line in sys.stdin: (id, fnm, lnm, place) = line.strip().split() if (id>0): print "%s\t%s" % (id, fnm, lnm, place)

Transform (MR Scripts)


¡  Using it in Hive FROM demo SELECT TRANSFORM(id, firstname, lastname, country) USING ‘filter.py‘ AS id, firstname, lastname, country;

Example


¡  TEXTFILE : Default (Storage Formats slide)

¡  SEQUENCE FILE : is a binary, space-efficient format supported by Hadoop. CREATE TABLE tab(col1 … ) ………. STORED AS SEQUENCEFILE

¡  Compression properties SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK;

¡  Mostly used in a CTAS (Create Table … AS Select) structure or INSERT .. SELECT over pre-existing tables.

File Formats


¡  RC FILE : Record Columnar storage stores data by row groups, then by columns within each group.

CREATE TABLE tab(col1 … ) ………. STORED AS RCFILE

¡  Keeps a “split’s worth” of rows in the same split, but stores by column in the split.

¡  Column-store is more efficient when a query only projects a subset of columns, because column-store only read necessary columns from disks but row-store will read a entire row.

File Formats


Feature InputFormat/ OutputFormat SerDe Descrip4on How records are encoded in files and

query results wri<en How columns/fields are encoded in records.

Clause in DDL STORED AS INPUTFORMAT '…' OUTPUTFORMAT '…'

STORED AS ROW FORMAT SERDE ‘…’ [ WITH SERDE PROPERTIES ( …. ) ]

Details INPUTFORMATs are responsible for spliSng an input stream into records. OUTPUTFORMATs are responsible for wri4ng records to an output stream (i.e., query results).

SERDEs are responsible for tokenizing a record into columns/fields and also encoding columns/fields into records.

Number of classes

Two separate classes are used – one for input and one for output

There is one class for both – serializa4on & deserializa4on

Defaults INPUTFORMAT 'org.apache.hadoop.mapreduce.lib.input.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Delimited textual format, with lazy field access.

File Format vs SerDe


¡  Add the serde jar to Hive using ADD JAR command ¡  Example

CREATE TABLE tab(col1 … ) ……….

STORED AS

ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2. RegexSerDe’

WITH SERDEPROPERTIES (

"input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"",

"output.format.string" = "a:%1$s,b:%2$s,c:%3$s“

)

STORED AS TEXTFILE;

SerDe - Example


¡  Example CREATE TABLE json_data (

country string, languages array<string>,

religions map<string,array<int>>)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH 'nesteddata.txt' OVERWRITE INTO TABLE json_data ;

SELECT * from json_data;

-- data : {"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}

-- result: Switzerland ["German","French","Italian"] {"catholic":[10,20],"protestant":[40,50]}

SerDe - Example


¡  Must specify both INPUT & OUTPUT format

¡  Example CREATE TABLE tab(col1 … ) ……….

STORED AS

INPUTFORMAT ‘com.zaloni.training.XMLInputFormat’

OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

File Format - Example


https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining

¡  N-gram frequency estimation : ngrams() and context_ngrams()

¡  Use-Cases (ngrams) Find trending topics in text. (context_ngrams) Extract marketing intelligence around certain words (e.g., "Twitter is ___”)

¡  Estimating frequency distributions : histogram_numeric() ¡  Use Cases

Estimating the frequency distribution of a column, possibly grouped by other attributes.

Statistics & Data Mining Functions

32

Best Practices


¡  A typical Hive query can be decomposed into one or more stages which may be independent of each other. In the default case, where parallelism is not enabled, stages (jobs) would execute sequentially.

¡  Enabling parallelism would account for better cluster and time utilization.

¡  Properties : set hive.exec.parallel=true in conf/hive-site.xml in the Hive installation directory.

Parallel Execution


¡  Instead of forcing Map Joins in the query we can set hive properties (before running a JOIN query) to convert normal join queries into Map Joins if the inputs meet the small file criteria else continue with the common (reduce-side) joins.

¡  Properties : set hive.auto.convert.join = true; set hive.smalltable.filesize = 40000000; set hive.hashtable.max.memory.usage = 0.9;

Joins


¡  Put the biggest table at the end; last table is streamed; rest are buffered

¡  Example (temperature is the larger table): SELECT * FROM .... temperature t1 JOIN .. temperature t2 JOIN ... station s .... ( will be streamed ) SELECT * FROM .... temperature t1 JOIN .. station s... temperature t2 ... (will be streamed – better approach)

Joins


¡  ORDER BY can prove to be a bottleneck since it uses a Single Reducer. So we should try to avoid ORDER BY for large datasets. Alternative : Using SORT BY and then merging the files (if the requirement permits) or use MapReduce jobs to sort the entire file etc.

¡  UDAFs without GROUP BY use a single reducer.

¡  Avoid FULL OUTER JOINS. Alternative : Analyze the dataset. Break the query into UNION ALLs.

¡  SELECT * FROM table should preferable be accompanied by LIMIT clause.

Others


¡  Avoid too many partitions

¡  Avoid partitions with small sets of data

¡  De-normalization is good for large sets of data

¡  Hive does not enforce constraints or checks for null values while loading data into the table (schema on read). So we should be sure of the integrity of data.

¡  It is preferable to use a local or remote metastore

Others


Hive Properties

¡  Some useful hive properties to be set before executing hive queries Property Descrip4on Set hive.exec.parallel=true Whether to execute jobs in parallel provided

hadoop scheduler is Fair-‐Scheduler or Capacity Scheduler. By default, its false.

Set hive.exec.dynamic.par44on=true To use dynamic par44on inserts

Set set hive.enforce.bucke4ng = true Allows the correct number of reducers and the cluster by column to be automa4cally selected based on the table

Set hive.auto.convert.join = true To allow auto conversion into map joins based on file size

set hive.exec.dynamic.par44on.mode=nonstrict;

To allow dynamic pari4ons without sta4c par44ons

There are many such useful proper4es that can be set according to requirements before query execu4on or in hive-‐site.xml


Hive Indexing

Indexing is a standard database technique, but with many possible variations. From Hive 0.7.1 supports indexing. But Hive 0.7.1 does not automatically access indexes (query rewrites). Starting Hive 0.8, that is possible. Key point: Different ways to speed up the queries in hive are: •  Columnar storage •  Data partitioning •  Indexing(Different view of same data)


Example: (Create an index on station name field of stations table. Populate the index) CREATE INDEX idxstations ON TABLE stations(name) as 'compact' WITH DEFERRED REBUILD STORED AS RCFILE; ALTER INDEX idxstations ON stations REBUILD; DROP INDEX idxstations ON stations;

Hive Indexing


The index is automatically named as training__stations_idxstations__ hive> SHOW TABLES You will find this table and can describe it or query it too. If data in the base table changes, then the REBUILD command must be used to bring the index up to date.

Hive Indexing

42

Hive Exercise 4 Hive Book Use case Part 2


12) Write a query to create & populate a table (user, country[location[2]],age from users table) that buckets the ages into 5 different country groups. Find out the number of users from each country for the third group. Browse through MaprFS to the hive warehouse directory and check whether the bucketed file(0002..) contains the same records as displayed in the ouput of hive query (Use of Buckets and TABLESAMPLE clause).

13) Extract HOST and PATH FROM Image-URL-S field. (Use parse_url built-in UDF)

14) Use explode UTF on the locations field to divide it into rows. (Use of EXPLODE UDTF).

15) Find out books containing author names - John OR Jack? Write the output to a local directory.

16) Find out 10 most frequently used four-grams (i.e. 4 words that occur together most frequently) in book titles. (Use of ngrams UDAF along with SENTENCES UDF and EXPLODE UDTF).

17) Find out 10 most frequently used words that come after the word "the" in book titles. (Use of context_ngrams UDAF along with SENTENCES UDF & explode UDTF)

18) Find out the average population by age group who rate a particular book. (Use GROUP BY and JOINS)

19) Use the Age Group Divider UDAF to will bucket the users in different age groups. You can specify the size of a bucket and whether you want the minimum (2nd argument -true), max(2nd argument -false) age that lies in that bucket. The UDAF returns a map object mapping buckets to user age groups.

Book Exercise Tasks – Part 2

44

End Day 2

hadoop-day4-part1_1

Documents

buckets buckets dml

buckets buckets ddl

buckets tablesample

buckets description

table tablesample clause

advantage of buckets

number of buckets

view view