introduction to apache pig - ut€¦ · introduction to apache pig pelle jakovits 28 september...

28
Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu

Upload: others

Post on 13-Jul-2020

32 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Introduction to Apache Pig

Pelle Jakovits

28 September 2016, Tartu

Page 2: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Outline

• MapReduce recollection

• Apache Pig

– How to run Pig

– Pig Latin

• Data structures

• Examples

– Execution flow

– Advantages & Disadvantages

Pelle Jakovits 2/28

Page 3: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

You already know MapReduce

• MapReduce = Map, GroupBy, Sort, Reduce”

• Designed or huge scale data processing

• Provides– Distributed file system

– High scalability

– Automatic parallelization

– Automatic fault recovery• Data is replicated

• Failed tasks are re-executed on other nodes

Pelle Jakovits 3/28

Page 4: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

But is MapReduce enough?

• Hadoop MapReduce is one of the most used frameworks for large scale data processing

• However:

– Writing low level MapReduce code slow

– Need a lot of expertise to optimize MapReduce code

– Prototyping is slow

– A lot of custom code required

• Even for the most simplest tasks

– Hard to manage more complex MapReduce job chains

Pelle Jakovits 4/28

Page 5: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Apache Pig

• A data flow framework ontop of Hadoop MapReduce– Retains all its advantages

– And some of it’s disadvantages

• Models a scripting language– Fast prototyping

• Uses Pig Latin language

– Similiar to declarative SQL

– Easier to get started with

• Pig Latin statements are automatically translated into MapReduce jobs

Pelle Jakovits 5/28

Page 6: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Running Pig

• Local mode– Everything installed locally on one machine

• Distributed mode– Everything runs in a MapReduce cluster

• Interactive mode– Grunt shell

• Batch mode– Pig scripts

Pelle Jakovits 6/28

Page 7: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Pig Latin

• Write complex MapReduce transformations using much simpler scripting language

• Not quite SQL, but similar

• Lazy evaluation

• Compiling is hidden from the user

Pelle Jakovits 7/28

Page 8: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Pig Latin Data Structures

• Relation– Can have nested relations

– Similar to a table in a relational database

– Consists of a Bag

• Bag– Collection of unordered tuples

• Tuple– An ordered set of fields

– Similiar to a row in a relational database

– Can contain any number of fields, does not have to match other tuples

• Fields– A `piece` of data

Pelle Jakovits 8/28

Page 9: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Pig Example

• A = LOAD 'student' USING PigStorage() AS (name, age, gpa);

• DUMP A;

– (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• C = FOREACH B GENERATE AVG(gpa)

Pelle Jakovits 9/28

Page 10: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

WordCount in Pig

A = load '/tmp/books/books';

B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;

C = group B by word;

D = foreach C generate COUNT(B), group;

store D into '/user/labuser/pelle_jakovits/out';

• Input and output are HDFS folders or files

– /tmp/books/books

– /user/labuser/pelle_jakovits/out

• A, B, C, D are relations

• Right hand side contains Pig expressions

Pelle Jakovits 10/28

Page 11: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Fields

• Consists of either:– Data atoms - Int, long, float, double, chararray, boolean,

datetime, etc.

– Complex data - Bag, Map, Tuple

• Assigning types to fields– A = LOAD 'student' AS (name:chararray, age:int, gpa:float);

• Referencing Fields– By order - $0, $1, $2

– By name - assigned by user schemas• A = LOAD ‘in.txt‘ AS (age, name, occupation);

Pelle Jakovits 11/28

Page 12: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Complex data types

• Tuples - (a, b, c)

• Bags - {(a,b), {c,d}}

• Maps - [martin#18, daniel#27]

• Looking into complex, nested data

– client.$0

– author.age

• Using FLATTEN can "explode" Pig Bag into a set of Tuple records

Pelle Jakovits 12/28

Page 13: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Loading and storing data

• LOAD– A = LOAD ‘myfile.txt’ USING PigStorage(‘\t’) AS (f1:int,

f2:int, f3:int);– User defines data loader and delimiters

• STORE– STORE A INTO ‘output_1.txt’ USING PigStorage (‘,’);– STORE B INTO ‘output_2.txt’ USING PigStorage (‘*’);

• Other data loaders– BinStorage– PigDump– TextLoader– Or create a custom one.

Pelle Jakovits 13/28

Page 14: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

FOREACH … GENERATE

• General data transformation statement

• Used to:

– Change the structure of data

– Apply functions to data

– Flatten complex data to remove nesting

• X = FOREACH C GENERATE FLATTEN (A.(a1, a2)), FLATTEN(B.$1);

Pelle Jakovits 14/28

Page 15: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Group .. BY

• A = load 'student' AS (name:chararray, age:int, gpa:float);

• DUMP A; – (John, 18, 4.0F)

– (Mary, 19, 3.8F)

– (Bill, 20, 3.9F)

– (Joe, 18, 3.8F)

• B = GROUP A BY age;

• DUMP B; – (18, {(John, 18, 4.0F), (Joe, 18, 3.8F)})

– (19, {(Mary, 19, 3.8F)})

– (20, {(Bill, 20, 3.9F)})

Pelle Jakovits 15/28

Page 16: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

JOIN

• A = LOAD 'data1' AS (a1:int,a2:int,a3:int);

• B = LOAD 'data2' AS (b1:int,b2:int);

• X = JOIN A BY a1, B BY b1;

DUMP A; (1,2,3) (4,2,1)

DUMP B; (1,3) (2,7) (4,6)

DUMP X; (1,2,3,1,3)(4,2,1,4,6)

Pelle Jakovits 16/28

Page 17: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Union

• A = LOAD 'data' AS (a1:int, a2:int, a3:int);

• B = LOAD 'data' AS (b1:int, b2:int);

• X = UNION A, B;

DUMP A; (1,2,3)(4,2,1)

DUMP A; (2,4) (8,9)

DUMP X; (1,2,3)(4,2,1) (2,4) (8,9)

Pelle Jakovits 17/28

Page 18: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Functions

• SAMPLE

– A = LOAD 'data' AS (f1:int,f2:int,f3:int);

– X = SAMPLE A 0.01;

– X will contain 1% of tuples in A

• FILTER

– A = LOAD 'data' AS (a1:int, a2:int, a3:int);

– X = FILTER A BY a3 == 3;

Pelle Jakovits 18/28

Page 19: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Functions

• DISTINCT – removes duplicate tuples

– X = DISTINCT A;

• LIMIT –

– X = LIMIT B 3;

• SPLIT –

– SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF (f3<6 OR f3>6);

Pelle Jakovits 19/28

Page 20: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Nested Pig Statements

A = LOAD 'Unclaimed_bank_accounts.csv' USING PigStorage(',') AS (last_name,first_name,balance,address,city,last_transaction,bank_name);

B = GROUP A BY city;

C = foreach B {

banks = A.bank_name ;

unique_banks = distinct banks ;

GENERATE group as city, unique_banks; }

Pelle Jakovits 20/28

Page 21: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

User Defined Functions (UDF)

• When the Built in Pig functions are not enough

• When we want to modify the behaviour of built in functions

• Load Pig UDF from jar

REGISTER myudfs.jar;

A = load '/tmp/books/books';

B = foreach A generate flatten(myudfs.TOKENIZE((chararray)$0)) as word;

Pelle Jakovits 21/28

Page 22: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Pig UDF

public class MYTOKENIZE extends EvalFunc<DataBag> {

TupleFactory mTupleFactory = TupleFactory.getInstance();

BagFactory mBagFactory = BagFactory.getInstance();

public DataBag exec(Tuple input) throws IOException {

try {

DataBag output = mBagFactory.newDefaultBag();

Object o = input.get(0);

if (!(o instanceof String)) {

throw new IOException("Expected input to be chararray");

}

StringTokenizer tok = new StringTokenizer((String)o, " \",()*");

while (tok.hasMoreTokens())

output.add(mTupleFactory.newT uple(tok.nextToken()));

return output;

} catch (ExecException ee) {}

}

}

Pelle Jakovits 22/28

Page 23: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Pig workflow

Pelle Jakovits 23/28

Page 24: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Pig workflow

Pelle Jakovits 24/28

Page 25: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Advantages of Pig

• Easy to Program– ~5% of the code, ~5% of the time required

• Self-Optimizing– Pig Latin statement optimizations– Generated MapReduce code optimizations

• Can manage more complex data flows– Easy to use and join multiple separate inputs,

transformations and outputs

• Extensible– Can be extended with User Defined Functions (UDF)

to provide more functionality

Pelle Jakovits 25/28

Page 26: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Pig disadvantages

• Slow start-up and clean-up of MapReduce jobs

– It takes time for Hadoop to schedule MR jobs

• Not suitable for interactive OLAP Analytics

– When results are expected in < 1 sec

• Complex applications may require many UDF’s

– Pig loses it’s simplicity over MapReduce

Pelle Jakovits 26/28

Page 27: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

DEMO

TFIDF in Pig

Pelle Jakovits 27/28

Page 28: Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Thats All

• This week`s practice session

– Processing data with Pig

– Processing unclaimed bank accounts, but this time using Pig

• Next lecture: Spark

Pelle Jakovits 28/28