hadoop-day4-part1_1

44
1 Hadoop Training

Upload: arijit00

Post on 25-Dec-2015

9 views

Category:

Documents


3 download

DESCRIPTION

Hado

TRANSCRIPT

Page 1: Hadoop-Day4-Part1_1

1

Hadoop Training

Page 2: Hadoop-Day4-Part1_1

2

Objectives – Day 4

§  Hive Sampling (bucketing) §  Explain §  Functions §  Advanced Features Break

§  Pig Overview §  Schemas §  Operators §  UDF §  Pig Exercise 1 §  Best Practices §  Hcatalog

©2013 Zaloni, Inc. All Rights Reserved.

Page 3: Hadoop-Day4-Part1_1

3

Views

Page 4: Hadoop-Day4-Part1_1

4 ©2013 Zaloni, Inc. All Rights Reserved.

¡  A way of decomposing complex queries. ¡  Only query-able views. Updatable views not supported. ¡  Since views are read-only they may not be used as the

target of LOAD/INSERT/ALTER ¡  Querying the view would start MapReduce jobs ¡  Materialized views not supported.

CREATE VIEW view_name AS SELECT select_statement; DROP VIEW [IF EXISTS] view_name; SELECT * FROM view_name [ …. ];

Views

Page 5: Hadoop-Day4-Part1_1

5

Sampling: Buckets TABLESAMPLE clause

Page 6: Hadoop-Day4-Part1_1

6 ©2013 Zaloni, Inc. All Rights Reserved.

Buckets - Description

¡  Enables efficient query execution. ¡  Joins can take advantage of buckets esp. Map-joins. ¡  Makes sampling of large datasets efficient –

TABLESAMPLE clause ¡  Physical Layout - bucket n is the nth file, when arranged in

lexicographic order. Correspond to M/R job output. Buckets are stored in files under table or partition directory i.e. /user/hive/warehouse/bucketed_users/attempt_201005221636_0016_r_000000_0

attempt_201005221636_0016_r_000001_0 attempt_201005221636_0016_r_000002_0 attempt_201005221636_0016_r_000003_0

Page 7: Hadoop-Day4-Part1_1

7 ©2013 Zaloni, Inc. All Rights Reserved.

Example : CREATE TABLE bucketed_demo (id INT, name STRING) CLUSTERED BY (id) INTO 16 BUCKETS; CREATE TABLE bucketed_demo(id INT, name STRING) PARTITIONED BY ( year int) CLUSTERED BY (id) SORTED BY(id) INTO 16 BUCKETS;

¡  CLUSTERED BY clause is used to specify the columns to bucket on and the number of buckets

¡  SORTED BY used to declare that a table has sorted buckets

Buckets - DDL

Page 8: Hadoop-Day4-Part1_1

8 ©2013 Zaloni, Inc. All Rights Reserved.

set hive.enforce.bucketing = true; INSERT OVERWRITE TABLE bucketed_demo PARTITION(year=2010) SELECT id,firstname FROM demo; §  Distribution of rows: hash_function(bucketing_column) mod num_buckets

Buckets - DML

Page 9: Hadoop-Day4-Part1_1

9 ©2013 Zaloni, Inc. All Rights Reserved.

¡  The TABLESAMPLE clause allows the users to write queries for samples of the data instead of the whole table

¡  Syntax : TABLESAMPLE (BUCKET x OUT OF y [ON colname])

¡  The buckets are numbered from 1 to y ¡  colname indicates the column on which to sample (or

“bucket”) each row ¡  Rows which belong to bucket x are returned ¡  Typically entire table is scanned to fetch the sample

unless the buckets have been used to create the table

TABLESAMPLE Clause

Page 10: Hadoop-Day4-Part1_1

10 ©2013 Zaloni, Inc. All Rights Reserved.

Examples:

SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s; SELECT * FROM bucketed_demo TABLESAMPLE(BUCKET 3 OUT OF 32 ON id) s;

TABLESAMPLE Clause

Page 11: Hadoop-Day4-Part1_1

11

Explain

Page 12: Hadoop-Day4-Part1_1

12 ©2013 Zaloni, Inc. All Rights Reserved.

Example : EXPLAIN EXTENDED SELECT * FROM temperature t1 JOIN temperature t2 ON (t1.stationno=t2.stationno) JOIN temperature t3 ON (t1.stationno=t3.stationno);

¡  EXPLAIN shows the execution plan of a query ¡  Tells how many MapReduce jobs will be used for

the query ¡  EXTENDED (optional), if used produces extra

information

Explain – Description & Example

Page 13: Hadoop-Day4-Part1_1

13

Functions: UDF, UDAF, UDTF

Page 14: Hadoop-Day4-Part1_1

14 ©2013 Zaloni, Inc. All Rights Reserved.

Functions – Description

¡  3 types : ¡  UDF – User Defined Function ¡  UDAF – User Defined Aggregate Functions ¡  UDTF – User Defined Table Generating Functions

Page 15: Hadoop-Day4-Part1_1

15 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Operates on a single row and produces a single row as its output

¡  One or more columns or can be nested as arguments to other functions

¡  Built-In : Mathematical, String, Date, Collection functions etc.

¡  Example : SELECT concat(firstname, lastname) FROM demo;

SELECT array(january, february, march) FROM temperature;

¡  Type Conversion UDF : cast SELECT cast(year as string) FROM

temperature limit 10;

UDF

Page 16: Hadoop-Day4-Part1_1

16 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Works on multiple input rows and creates a single output row

¡  Aggregate functions like COUNT, MAX, SUM etc. ¡  Example :

SELECT sum(salary) FROM employees; SELECT max(temperature),min(temperature)

FROM temperature_union GROUP BY month;

UDAF

Page 17: Hadoop-Day4-Part1_1

17 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Operates on a single row and produces multiple rows—a table—as output

¡  Example – explode() converts the values in an array into separate rows of a table

SELECT explode(subordinates) as sub FROM employees;

Here, subordinates is an array field. “sub” is an alias column name which must be specified. ¡  Lateral views for UDTF

SELECT name, sub FROM employees LATERAL VIEW explode(subordinates) subTable AS sub;

UDTF

Page 18: Hadoop-Day4-Part1_1

18 ©2013 Zaloni, Inc. All Rights Reserved.

CASE statement

¡  Case statements are like IF-THEN-ELSE ¡  Example : We want to categorize stations into

either “Missing” or “Eastern Hemisphere” based on their longitude values

SELECT name, CASE WHEN longitude= -999 OR latitude=-999 THEN ‘Missing‘ WHEN longitude<0 THEN 'Eastern' ELSE ‘None’ END from stations;

Page 19: Hadoop-Day4-Part1_1

19 ©2013 Zaloni, Inc. All Rights Reserved.

¡  List all functions SHOW FUNCTIONS

¡  List a particular function SHOW FUNCTIONS “concat”

¡  Describing a function DESCRIBE FUNCTION [EXTENDED] <function_name> DESCRIBE FUNCTION “concat”

Viewing functions

Page 20: Hadoop-Day4-Part1_1

20

Advanced Features : Custom UDFs Transform (Map Reduce scripts) SerDe

Page 21: Hadoop-Day4-Part1_1

21 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Custom functions that can be plugged into Hive and used with HQL ¡  Developed/coded in Java. ¡  Example : Package DateFormatter

import org.apache.hadoop.hive.ql.exec.UDF;

public class DateFormatter extends UDF {

public Text evaluate(Text timestamp, String inputFormat, String outputFormat) throws ParseException {

SimpleDateFormat formatter = new SimpleDateFormat(inputFormat);

SimpleDateFormat newFormatter = new SimpleDateFormat(outputFormat);

return new Text(newFormatter.format(formatter.parse(timestamp.toString())));

}

}

Custom User Defined Functions

Page 22: Hadoop-Day4-Part1_1

22 ©2013 Zaloni, Inc. All Rights Reserved.

¡  UDFs are packaged in jars and added to hive by using the following commands in hive shell add jar /path/to/udf;

create temporary function func_name as ‘pkgname.classname’;

¡  Example add jar /training/custom.jar;

create temporary function convert_to_date as ‘DateFormatter’;

select convert_to_date(‘2011-01-01’,’yyyy-mm-dd’, ‘yyyy/dd/mm’) from demo limit 1;

Example

Page 23: Hadoop-Day4-Part1_1

23 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Technique used to invoke custom map or reduce operations from Hive

¡  Example : We want to filter out bad data from the DEMO table. The definition of bad data being rows with negative id values. Python Script (filter.py)

#!/usr/bin/env python import re import sys for line in sys.stdin: (id, fnm, lnm, place) = line.strip().split() if (id>0): print "%s\t%s" % (id, fnm, lnm, place)

Transform (MR Scripts)

Page 24: Hadoop-Day4-Part1_1

24 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Using it in Hive FROM demo SELECT TRANSFORM(id, firstname, lastname, country) USING ‘filter.py‘ AS id, firstname, lastname, country;

Example

Page 25: Hadoop-Day4-Part1_1

25 ©2013 Zaloni, Inc. All Rights Reserved.

¡  TEXTFILE : Default (Storage Formats slide)

¡  SEQUENCE FILE : is a binary, space-efficient format supported by Hadoop. CREATE TABLE tab(col1 … ) ………. STORED AS SEQUENCEFILE

¡  Compression properties SET hive.exec.compress.output=true; SET io.seqfile.compression.type=BLOCK;

¡  Mostly used in a CTAS (Create Table … AS Select) structure or INSERT .. SELECT over pre-existing tables.

File Formats

Page 26: Hadoop-Day4-Part1_1

26 ©2013 Zaloni, Inc. All Rights Reserved.

¡  RC FILE : Record Columnar storage stores data by row groups, then by columns within each group.

CREATE TABLE tab(col1 … ) ………. STORED AS RCFILE

¡  Keeps a “split’s worth” of rows in the same split, but stores by column in the split.

¡  Column-store is more efficient when a query only projects a subset of columns, because column-store only read necessary columns from disks but row-store will read a entire row.

File Formats

Page 27: Hadoop-Day4-Part1_1

27 ©2013 Zaloni, Inc. All Rights Reserved.

Feature   InputFormat/  OutputFormat   SerDe  Descrip4on   How  records  are  encoded  in  files  and  

query  results  wri<en  How  columns/fields  are  encoded  in  records.  

Clause  in  DDL   STORED  AS  INPUTFORMAT  '…'  OUTPUTFORMAT  '…'  

STORED  AS  ROW  FORMAT  SERDE  ‘…’  [  WITH  SERDE  PROPERTIES  (  ….  )  ]  

Details   INPUTFORMATs  are  responsible  for  spliSng  an  input  stream  into  records.    OUTPUTFORMATs  are  responsible  for  wri4ng  records  to  an  output  stream  (i.e.,  query  results).  

SERDEs  are  responsible  for  tokenizing  a  record  into  columns/fields  and  also  encoding  columns/fields  into  records.  

Number  of  classes  

Two  separate  classes  are  used  –  one  for  input  and  one  for  output  

There  is  one  class  for  both  –  serializa4on  &  deserializa4on  

Defaults   INPUTFORMAT  'org.apache.hadoop.mapreduce.lib.input.TextInputFormat'  OUTPUTFORMAT  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';  

org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe  Delimited  textual  format,  with  lazy  field  access.  

File Format vs SerDe

Page 28: Hadoop-Day4-Part1_1

28 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Add the serde jar to Hive using ADD JAR command ¡  Example

CREATE TABLE tab(col1 … ) ……….

STORED AS

ROW FORMAT SERDE ‘org.apache.hadoop.hive.contrib.serde2. RegexSerDe’

WITH SERDEPROPERTIES (

"input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"",

"output.format.string" = "a:%1$s,b:%2$s,c:%3$s“

)

STORED AS TEXTFILE;

SerDe - Example

Page 29: Hadoop-Day4-Part1_1

29 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Example CREATE TABLE json_data (

country string, languages array<string>,

religions map<string,array<int>>)

ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH 'nesteddata.txt' OVERWRITE INTO TABLE json_data ;

SELECT * from json_data;

-- data : {"country":"Switzerland","languages":["German","French","Italian"],"religions":{"catholic":[10,20],"protestant":[40,50]}}

-- result: Switzerland ["German","French","Italian"] {"catholic":[10,20],"protestant":[40,50]}

SerDe - Example

Page 30: Hadoop-Day4-Part1_1

30 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Must specify both INPUT & OUTPUT format

¡  Example CREATE TABLE tab(col1 … ) ……….

STORED AS

INPUTFORMAT ‘com.zaloni.training.XMLInputFormat’

OUTPUTFORMAT ‘org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

File Format - Example

Page 31: Hadoop-Day4-Part1_1

31 ©2013 Zaloni, Inc. All Rights Reserved.

https://cwiki.apache.org/confluence/display/Hive/StatisticsAndDataMining

¡  N-gram frequency estimation : ngrams() and context_ngrams()

¡  Use-Cases (ngrams) Find trending topics in text. (context_ngrams) Extract marketing intelligence around certain words (e.g., "Twitter is ___”)

¡  Estimating frequency distributions : histogram_numeric() ¡  Use Cases

Estimating the frequency distribution of a column, possibly grouped by other attributes.

Statistics & Data Mining Functions

Page 32: Hadoop-Day4-Part1_1

32

Best Practices

Page 33: Hadoop-Day4-Part1_1

33 ©2013 Zaloni, Inc. All Rights Reserved.

¡  A typical Hive query can be decomposed into one or more stages which may be independent of each other. In the default case, where parallelism is not enabled, stages (jobs) would execute sequentially.

¡  Enabling parallelism would account for better cluster and time utilization.

¡  Properties : set hive.exec.parallel=true in conf/hive-site.xml in the Hive installation directory.

Parallel Execution

Page 34: Hadoop-Day4-Part1_1

34 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Instead of forcing Map Joins in the query we can set hive properties (before running a JOIN query) to convert normal join queries into Map Joins if the inputs meet the small file criteria else continue with the common (reduce-side) joins.

¡  Properties : set hive.auto.convert.join = true; set hive.smalltable.filesize = 40000000; set hive.hashtable.max.memory.usage = 0.9;

Joins

Page 35: Hadoop-Day4-Part1_1

35 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Put the biggest table at the end; last table is streamed; rest are buffered

¡  Example (temperature is the larger table): SELECT * FROM .... temperature t1 JOIN .. temperature t2 JOIN ... station s .... ( will be streamed ) SELECT * FROM .... temperature t1 JOIN .. station s... temperature t2 ... (will be streamed – better approach)

Joins

Page 36: Hadoop-Day4-Part1_1

36 ©2013 Zaloni, Inc. All Rights Reserved.

¡  ORDER BY can prove to be a bottleneck since it uses a Single Reducer. So we should try to avoid ORDER BY for large datasets. Alternative : Using SORT BY and then merging the files (if the requirement permits) or use MapReduce jobs to sort the entire file etc.

¡  UDAFs without GROUP BY use a single reducer.

¡  Avoid FULL OUTER JOINS. Alternative : Analyze the dataset. Break the query into UNION ALLs.

¡  SELECT * FROM table should preferable be accompanied by LIMIT clause.

Others

Page 37: Hadoop-Day4-Part1_1

37 ©2013 Zaloni, Inc. All Rights Reserved.

¡  Avoid too many partitions

¡  Avoid partitions with small sets of data

¡  De-normalization is good for large sets of data

¡  Hive does not enforce constraints or checks for null values while loading data into the table (schema on read). So we should be sure of the integrity of data.

¡  It is preferable to use a local or remote metastore

Others

Page 38: Hadoop-Day4-Part1_1

38 ©2013 Zaloni, Inc. All Rights Reserved.

Hive Properties

¡  Some useful hive properties to be set before executing hive queries Property   Descrip4on  Set  hive.exec.parallel=true   Whether  to  execute  jobs  in  parallel  provided  

hadoop  scheduler  is  Fair-­‐Scheduler  or  Capacity  Scheduler.  By  default,  its  false.  

Set  hive.exec.dynamic.par44on=true   To  use  dynamic  par44on  inserts  

Set  set  hive.enforce.bucke4ng  =  true   Allows  the  correct  number  of  reducers  and  the  cluster  by  column  to  be  automa4cally  selected  based  on  the  table  

Set  hive.auto.convert.join  =    true   To  allow  auto  conversion  into  map  joins  based  on  file  size  

set  hive.exec.dynamic.par44on.mode=nonstrict;  

To  allow  dynamic  pari4ons  without  sta4c  par44ons  

There  are  many  such  useful  proper4es  that  can  be  set  according  to  requirements  before  query  execu4on  or  in  hive-­‐site.xml  

Page 39: Hadoop-Day4-Part1_1

39 ©2013 Zaloni, Inc. All Rights Reserved.

Hive Indexing

Indexing is a standard database technique, but with many possible variations. From Hive 0.7.1 supports indexing. But Hive 0.7.1 does not automatically access indexes (query rewrites). Starting Hive 0.8, that is possible. Key point: Different ways to speed up the queries in hive are: •  Columnar storage •  Data partitioning •  Indexing(Different view of same data)

Page 40: Hadoop-Day4-Part1_1

40 ©2013 Zaloni, Inc. All Rights Reserved.

Example: (Create an index on station name field of stations table. Populate the index) CREATE INDEX idxstations ON TABLE stations(name) as 'compact' WITH DEFERRED REBUILD STORED AS RCFILE; ALTER INDEX idxstations ON stations REBUILD; DROP INDEX idxstations ON stations;

Hive Indexing

Page 41: Hadoop-Day4-Part1_1

41 ©2013 Zaloni, Inc. All Rights Reserved.

The index is automatically named as training__stations_idxstations__ hive> SHOW TABLES You will find this table and can describe it or query it too. If data in the base table changes, then the REBUILD command must be used to bring the index up to date.

Hive Indexing

Page 42: Hadoop-Day4-Part1_1

42

Hive Exercise 4 Hive Book Use case Part 2

Page 43: Hadoop-Day4-Part1_1

43 ©2013 Zaloni, Inc. All Rights Reserved.

12) Write a query to create & populate a table (user, country[location[2]],age from users table) that buckets the ages into 5 different country groups. Find out the number of users from each country for the third group. Browse through MaprFS to the hive warehouse directory and check whether the bucketed file(0002..) contains the same records as displayed in the ouput of hive query (Use of Buckets and TABLESAMPLE clause).

13) Extract HOST and PATH FROM Image-URL-S field. (Use parse_url built-in UDF)

14) Use explode UTF on the locations field to divide it into rows. (Use of EXPLODE UDTF).

15) Find out books containing author names - John OR Jack? Write the output to a local directory.

16) Find out 10 most frequently used four-grams (i.e. 4 words that occur together most frequently) in book titles. (Use of ngrams UDAF along with SENTENCES UDF and EXPLODE UDTF).

17) Find out 10 most frequently used words that come after the word "the" in book titles. (Use of context_ngrams UDAF along with SENTENCES UDF & explode UDTF)

18) Find out the average population by age group who rate a particular book. (Use GROUP BY and JOINS)

19) Use the Age Group Divider UDAF to will bucket the users in different age groups. You can specify the size of a bucket and whether you want the minimum (2nd argument -true), max(2nd argument -false) age that lies in that bucket. The UDAF returns a map object mapping buckets to user age groups.

Book Exercise Tasks – Part 2

Page 44: Hadoop-Day4-Part1_1

44

End Day 2