pig experience

Building a HighLevel Dataflow Systemon top of MapReduce: The Pig

Experience

Tilani Gunawardena

Content• Introduction• Background• System Overview• The System & Type Inference• Compilation To Map-Reduce• Plan Execution• Streaming• Performance • Adoption• Project Experience• Future Works

Introduction

• Internet companies swimming in data• TBs/day for Yahoo! Or Google!• PBs/day for FaceBook!

• Data – unstructured elements

• web page text,images– structured elements

• web page click records , extracted entity-relationship models

• Procesing– Filter, join , count

• Data Warehousing?? – Scale -Often not scalable enough– Price-Prohibitively expensive at web scale– SQL-

• High level declarative approach • Little control over execution method

• The Map-Reduce Appeal ??– Scale -Scalable due to simpler design, Explicit programming model – Price-Runs on cheap commodity hardware– SQL

MapReduce Disadvantages• Does not directly support complex N-step

dataflow• Lacks explicit support for combined processing of

multiple data sets– joins and other data matching operations

• Frequently needed data manipulation primitives must be coded by hand– Filtering, aggregation ,Join,Projecton,Sorting

Pig

• Pig's language Pig Latin – Chooses spot between MapReduce framework and SQL.

• Defines a new language to allow better control in large scale data processing

• Allow database programmers not to write map and reduce code, which is at too low a level

Pig Latin: Data Types• Rich and Simple Data Model

Simple Types:int, long, double, chararray(string), bytearray

Complex Types:• Map: is an associative array;key:chararray;value: any type• Tuple: Collection of fields e.g. (áppe’, ‘mango’)• Bag: Collection of tuples

{ (‘apple’ , ‘mango’) (ápple’, (‘red’ , ‘yellow’))

}

Pig Latin: Input/Output Data

Input:queries = LOAD `data.txt'USING BinStorageAS (url, category, pagerank);Output:STORE result INTO `myoutput‘ ;

BinStorage: binary storage function in Pig

Pig Latin: Expression Table

Pig Latin: General Syntax

• Discarding Unwanted Data: FILTER• Comparison operators such as ==, eq, !=, neq• Logical connectors AND, OR, NOT

Pig Latin: Type Declaration• Pig supports three options for declaring the data types of field

– No data types are declared:default is to treat all fields as bytearray. Ex:a = LOAD `data' USING BinStorage AS (user);

– Declaring types in Pig is to provide them explicitly as part of the AS clause during the LOAD:

Ex :a =LOAD `data' USING BinStorage AS (user:chararray);

– For the load function itself to provide the schema information, which accommodates self-describing data formats such as JSON

Pig Latin: Lazy Conversion of Types• When Pig does need to cast a bytearray to another type because

the program applies a type-specic operator, it delays that cast to the point where it is actually necessary.

• Status will need to be cast to a chararray• EarnedPoints and possiblePoints will need to be cast to double• These casts will not be done when the data is loaded• They will be done as part of the comparison and division operations• Avoids casting values that are removed by the filter before the

result of the cast is used.

Pig Latin-Operators

• LOAD : LOAD 'data' [USING function] [AS schema]; where, ‘data’ : Name of file or directory

USING, AS : Keywords function : Load function. schema : Loader produces data of type specified by schema. If data does not conform to schema, error is generated.

ex: LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);LOAD `query_log.txt‘ USING myLoad() AS (userId, queryString, timestamp);

• STORE : Stores results to file system– STORE alias INTO 'directory' [USING function];

where, alias : name of relation INTO, USING : keywords ‘directory’ : storage directory’s name. If directory already exists, operation fails function: Store function.

ex: STORE result INTO `myOutput';STORE query_revenues INTO `myoutput‘ USING myStore();

FOREACH• Generates data transformations based on columns of data.• Eg: X = FOREACH A GENERATE a1, a2;

expanded_queries = FOREACH queries GENERATE userId, expandQuery(queryString);

----------------- expanded_queries = FOREACH queries GENERATE userId,

FLATTEN(expandQuery(queryString));

GROUP / COGROUP• Groups the data in one or more relations.• GROUP used for 1 relation • COGROUP used for 1 to 127 relations

JOIN (inner)• Performs inner join of 2 or more relations based on common field values.

Eg: If A contains – { (1,2,3), (4,2,1) }; If B contains – {(1,3),(4,6),(4,9)} X = JOIN A BY a1, B BY b1; (1,2,3,1,3) (4,2,1,4,6) (4,2,1,4,9)

ORDER BY• Sorts relation based on 1 or more fields

Eg: X = ORDER A BY a3 DESC; (1,2,3) (4,2,1)

System Overview

• A step-by-step dataflow language where computation steps are chained together through the use of variables,

• The use of high-level transformations, e.g., GROUP, FILTER

• The ability to specify schemas as part of issuing a program

• The use of userdened functions (e.g., top10)

Pig allows three modes of user interaction:• Interactive mode:the user is presented with an

interactive shell (called Grunt), which accepts Pig commands.

• Batch mode:A user submits a prewritten script containing a series of Pig commands

• Embedded mode:Pig is also provided as a Java library allowing Pig Latin commands to be submitted via method invocations from a Java program

Pig System Process• Parser• Logical Optimizer• Map-Reduce Compiler–Logical to Physical compilation–Physical to Map-Reduce Compilation–Branching Plans

• Map-Reduce Optimizer• Hadoop Job Manager

Parser• Verifies program is syntactically correct and that all referenced variables are

defined.• Type checking• Schema inference • Verify ability to instantiate classes corresponding to UDF • Confirm existence of streaming executables

– Output of parser :Logical plan• One-to-one correspondence between Pig Latin statements & logical

operators.• Arranged in directed acyclic graph (DAG)

Logical Optimizer • Logical optimizations– Projection pushdown are carried out

Pig System Process

• Parser• Logical Optimizer• Map-Reduce Compiler–Logical to Physical compilation–Physical to Map-Reduce Compilation–Branching Plans


Map-Reduce Compiler:Logical to Physical compilation(1)

Map-Reduce Compiler:LOGICAL PLAN STRUCTURE => PHYSICAL PLAN => MAP-REDUCE PLAN

Map-Reduce Compiler:Logical to Physical compilation(2) Map-Reduce Compiler - compiles Logical Plan into series of Map-Reduce jobs

(CO)GROUP operator becomes series of 3 physical operators :- Local and global rearrange operators – Group tuples on same machine and adjacent in data stream; Rearrange – hashing or sorting by key• Package operator -places adjacent same key tuples into a single-tuple package

JOIN operator handled in 2 ways :- rewritten into COGROUP followed by FOREACH operator to perform “flattening” to get parallel hash-join or sort-merge join; Fragment-replicate join

– which executes entirely in the map stage or entirely In the reduce stage

Map-Reduce Compiler:Logical to Physical compilation(3)Example for (CO)GROUP Conversion:

• (1,R),(2,G) in stream A • (1,B), (2,Y) in stream B

• Local Rearrange Operator :– Eg: Converts tuple (1,R) to {1,(1,R)}

• Global Rearrange operator: Sort– Eg: Reducer 1 : {1,{(1,R),(1,B)}}

Reducer 2: {2,{(2,G),(2,Y)}}

• Package Operator:– Places same-key tuples into single-tuple

package – Eg: Reducer 1: {1,{(1,R)},{(1,B)}}

Reducer 2: {2,{(2,G)},{(2,Y)}}

Map-Reduce Compiler:Logical to Physical compilation(4)3 types of Join operators

Fragment-replicate join• Joins huge table & very small table

Huge table fragmented and distributed to mappers (or reducers)

• Small table replicates to each machine• Either in map or reduce stage

Parallel-hash join • Map stage - Hashes tables by join key• Reduce stage - Joins fragments of tables

– Data with same hash values assigned to 1 reducer Sort-merge join

• Both inputs sorted on join key • Each node gets a fragment of the sorted table, same keys got to the

same table• Each node performs join; Only map step is sufficient

Pig System Process



Map-Reduce Compiler:Physical to Map-Reduce Compilation(1)

Physical to MapReduce Compilation:

• Physical operators assigned to Hadoop stages to minimize no of reduce stages

• Local rearrange operator – simply annotates tuples with keys and stream identiers , and lets Hadoop local sort stage to do work

• Global rearrange operators removed . Implemented by Hadoop shuffle and merge stages

• Load and store operators removed. Hadoop framework reads and writes data

Pig System Process



Map-Reduce Compiler:Branching Plans(1) Branching Plans • More than 1 STORE command – For each branch of split• Data read once; Processed in multiple ways;• Risk of data spilling to disk• SPLIT operator :- Feeds copy of input to each nested sub-planExample 1: Logical Split command – Splits Table• Only Map-Plan clicks = LOAD `clicks‘ AS (userid, pageid, linkid, viewedat); SPLIT clicks INTO pages IF pageid IS NOT NULL, // Corresponds to ‘FILTER’ of 1st Sub-Plan links IF linkid IS NOT NULL; // Corresponds to ‘FILTER’ of 2nd Sub-Plan // 1st Sub-Plan: cpages=FOREACH pages GENERATE userid,CanonicalizePage(pageid) AS page,viewedat; STORE cpages INTO `pages';// 2nd Sub-Plan: clinks = FOREACH links GENERATE userid,CanonicalizeLink(linkid) AS clink, viewedat; STORE clinks INTO `links';

Map-Reduce Compiler:Branching Plans(2)Example2:• Split propagates across map/reduce boundary• No logical SPLIT operator • Compiler inserts physical SPLIT operator• MULTIPLEX operator : Routes tuples to correct sub-plan; In Reduce stage only.

clicks = LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);goodclicks = FILTER clicks BY viewedat IS NOT NULL;// 1st Sub-Plan: Grouped by ‘pageid’

bypage = GROUP goodclicks BY pageid;cntbypage = FOREACH bypage GENERATE group,COUNT(goodclicks);STORE cntbypage INTO `bypage';

//2nd Sub-Plan: Grouped by ‘linkid’

bylink = GROUP goodclicks BY linkid;cntbylink = FOREACH bylink GENERATE group, COUNT(goodclicks);STORE cntbylink INTO `bylink';

Pig System Process



Map-Reduce Optimizer

Performs early partial aggregation in distributive or algebraic aggregation functions

eg: for function AVERAGE, the steps are:-a) Initial

e.g. generate (sum, count) pairs Assigned to map stage.

b) intermediate e.g. combine n (sum,count) pairs into a single pair Assigned to Combine stage.

c) final e.g. combine n (sum,count) pairs and take the quotient Assigned to Reduce step

Pig System Process



Hadoop Job Manager

• Map-Reduce jobs sorted and submitted to Hadoop for execution

• Java jar file generated for Map and Reduce implementation classes and UDF

• Map and Reduce classes contain general-purpose dataflow execution engines

• Monitor and generates periodic reports• Warnings or errors logged and reported

Plan Execution • Flow Control–Nested Programs

• Memory ManagementStreaming• Flow Control

PLAN EXECUTION - FLOW CONTROL• Execution of Map or Reduce stage in Physical Plan by Pig• Assume that data flows downward in an execution plan

• To control movement of tuples through execution pipeline, 2 models available– Push & Pull(Iterator) Model

1) Push Model:Eg: Operator A pushes data to B that operates on it, and pushes the result to C. (A,B and C are physical operators)

Difficult to implement for:• UDF with multiple inputs• Binary operators like fragment-replicate join

2) Pull Model :

Eg: Operator C asks B for its next data item. If B has nothing pending to return, it asks A. When A returns a data item, B operates on it, and returns the result to C

Advantages: Single-threaded implementation : Avoids context-switching overhead Simple APIs for UDFDrawback: Operations over bag nested inside tuple may lead to memory overflow If data flow graph has multiple sinks-operators at branch points may be required to buffer an

unbounded number of tuples

PLAN EXECUTION - FLOW CONTROL (2)

Solution :Response of operator, when asked to produce tuple

a) Return tuple;b) Declare itself finished ; Orc) Return pause signal to indicate not finished; not able to produce output

tuple;


NESTED PROGRAMS:• Pig Operators invoked over bags nested within tuples• For example: (To compute number of distinct pages and links visited by user)

clicks = LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);(Alice,Page1,Linnk1,Site1)(John,Page1,Link2,Site2)(John,Page2, Link2,Site3)

byuser = GROUP clicks BY userid;(Alice, {(Alice, Page1,Linnk1,Site1)})(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)})

result = FOREACH byuser {

uniqPages = DISTINCT clicks.pageid; uniqLinks = DISTINCT clicks.linkid;GENERATE group, COUNT(uniqPages),COUNT(uniqLinks);

};(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,Site3)} , 2 , 1 )


• Outer operator graph contains FOREACH operator • Contains nested operator graph of 2 pipelines • Each pipeline contains DISTINCT and COUNT operators• FOREACH requests tuple T from PACKAGE operator• Places cursor on bag of click tuples for 1st DISTINCT-COUNT operator• Requests tuple from the bottom of pipeline (COUNT operator)• Process repeated for second pipeline• FOREACH operator constructs and returns output tuple

PLAN EXECUTION - FLOW CONTROL• When nested plan is single branching pipeline:

clicks = LOAD `clicks‘ AS (userid, pageid, linkid, viewedat);(Alice,Page1,Linnk1,Site1)(John,Page1,Link2,Site2)(John,Page2, Link2,NULL)

byuser = GROUP clicks BY userid;(Alice, {(Alice, Page1,Linnk1,Site1)})(John, {(John,Page1,Link2,Site2), (John,Page2, Link2,NULL)})

result = FOREACH byuser {

fltrd = FILTER clicks BY viewedat IS NOT NULL;uniqPages = DISTINCT fltrd.pageid;uniqLinks = DISTINCT fltrd.linkid;GENERATE group, COUNT(uniqPages), COUNT(uniqLinks);

};(Alice, {(Alice, Page1,Linnk1,Site1)} , 1 , 1)(John, {(John,Page1,Link2,Site2)} , 1 , 1 )

A more complex situation arises when the nested plan is not two independent pipelines but rather a single branching pipelineSolution:• Pig currently handles this case by duplicating the FILTER operator and producing two

independent pipelines, to be executed as explained above.

PLAN EXECUTION - Memory Management

• Hadoop, Pig is implemented in Java.• Java memory management problems during query processing – Java does not allow the developer to control memory

allocation and deallocation directly,• naive option :is to increase the JVM memory size

limit beyond the physical memory size, and let the virtual memory manager take care of staging data between memory and disk.– Problem: performance degradation.

• Better to return an “out-of-memory" error – administrator can adjust the memory management

parameters and re-submit the program

PLAN EXECUTION - Memory Management

• Memory overflow mostly due to large bags of tuples

• Java's MemoryPoolMXBean class notifies low memory situation. If notified, PIG spills excess bags to disk.

• Pig estimates bag sizes by sampling few tuples

• Memory manager maintains list of Pig bags created in same JVM using linked list of Java WeakReferences

• WeakReference ensures garbage collection of bags no longer in use

STREAMING – FLOW CONTROL• Pig allows User-dened functions (UDFs)

– UDFs must be written in Java and must conform to Pig's UDF interface– Has synchronous behavior

Streaming :• Allows data to be pushed through external executables

– users are able to intermix relational operations like grouping and filtering with custom or legacy executables.

• Streaming executable behaves asynchronously.

challenges in implementing streaming in Pig : fitting it into the iterator model of Pig's execution pipeline

• Because of asynchronous behavior of the user's executable• STREAM operator that wraps the executable cannot simply pull tuples synchronously

as it does with other operators because it does not know what state executable is in.• There may be no output :

– executable is waiting to receive more input: the stream operator needs to push new data

– executable is still busy processing prior inputs. :the stream operator should wait.

• Single-threaded operator execution model, a deadlock can occur– Pig operator is waiting for the external executable to

consume a new input tuple, while at the same time the executable is waiting for its output to be consumed

Solution : STREAM operator : • Creates 2 additional threads - One to feed data to executable and

other to consume data• Blocks until tuple available on executable's output queue or until

executable terminates• If space available in input queue, places tuple from parent

operator into it

Performance • Initial implementation of Pig, functionality and

proof of concept were considered more important

• As Pig was adopted within Yahoo- better performance quickly became a priority.

• Pig Mix-publicly available benchmark to measure performance on a regular basis so that the effects of individual code changes on performance could be understood.

Benchmark ResultsPig Mix benchmark• September 11, 2008:

o Initial Apache open-source release

• November 11, 2008:– Enhanced type system– Rewrote execution pipeline – Combiner enhanced

• January 20, 2009: – Buffering during data parsing– Fragment-replicate join algorithm

• February 23, 2009:– Rework of partitioning function used in ORDER BY to ensure more balanced

distribution of keys to reducers

• April 20, 2009:– Branching execution plans

• Vertical axis : Ratio of total running time for 12 Pig programs to corresponding Map-Reduce programs• Current performance ratio is 1:5 - Reasonable trade of point between

execution time and code development/maintenance effort.

Pros & Cons

• The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.

• With the various interleaved clauses in SQL It is difficult to know what is actually happening sequentially.

Pros & Cons

• Explicit Dataflow• Retains Properties of Map-

Reduce• Scalability• Fault Tolerance• Multi Way Processing• Open Source

• Column wise Storage structures are missing

• Memory Management• No facilitation for Non Java

Users• Limited Optimization• No GUI for Flow Graphs

Future Work

• Query optimization– Currently rule-based optimizer for plan rearrangement and join selection– Cost-based in the future

• Non-Java UDFs • SQL interface• Grouping and joining of pre-partitioned/sorted data.

– Avoid data shuffling for grouping and joining– Building metadata facilities to keep track of data layout

• Skew handling.– For load balancing

Summary• Big demand for parallel data processing

– Programmers like dataflow pipes over static files• Ease of programming.• UDF -Users can create their own functions to do special-purpose

processing.• Optimization opportunities :The way in which tasks are encoded

permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

• Open source

Pig Latin : Sweet spot between map-reduce and SQL

Related Work• Sawzall– Data processing language on top of map-reduce– Rigid structure of filtering followed by aggregation

• Hive– SQL-like language on top of Map-Reduce

• DryadLINQ– SQL-like language on top of Dryad

Thank You!

pig experience

Documents

load data

data of type

data transformations

unwanted data

data tbsday

columns of data

data warehousing

data types rich