tutorial: high-level programming languages - mapreduce simplified
TRANSCRIPT
Tutorial: High-Level Programming LanguagesMapReduce Simplified
Pietro Michiardi
Eurecom
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 1 / 105
Introduction
Introduction
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 2 / 105
Introduction Overview
Overview
Raising the level of abstraction for processing large datasetsI Scalable Algorithm Design is complex using MapReduceI Code gets messy, redundant, difficult to re-use
Many alternatives exists, based on different principlesI Data-flow programmingI SQL-like declarative programmingI Additional operators (besides Map and Reduce)
Optimization is a hot research topicI Based on traditional RDBMS optimizations
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 3 / 105
Introduction Topics
Topics covered
Review foundations of relational algebra in light ofMapReduce
Hadoop PIGI Data-flow language, originated from Yahoo!I InternalsI Optimizations
Cascading + Scalding
SPARK1
1This is an abuse: SPARK is an execution enging that replaces Hadoop,based on Reliable Distributed Datasets, that reside in memory. Theprogramming model is MapReduce, using Scala.
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 4 / 105
Relational Algebra and MapReduce
Relational Algebra andMapReduce
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 5 / 105
Relational Algebra and MapReduce Introduction
Introduction
DisclaimerI This is not a full course on Relational AlgebraI Neither this is a course on SQL
Introduction to Relational Algebra, RDBMS and SQLI Follow the video lectures of the Stanford class on RDBMShttp://www.db-class.org/
→ Note that you have to sign up for an account
Overview of this partI Brief introduction to simplified relational algebraI Useful to understand Pig, Hive and HBase
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 6 / 105
Relational Algebra and MapReduce Introduction
Relational Algebra Operators
There are a number of operations on data that fit well therelational algebra model
I In traditional RDBMS, queries involve retrieval of small amounts ofdata
I In this course, and in particular in this class, we should keep inmind the particular workload underlying MapReduce
→ Full scans of large amounts of data→ Queries are not selective, they process all data
A review of some terminologyI A relation is a tableI Attributes are the column headers of the tableI The set of attributes of a relation is called a schema
Example: R(A1,A2, ...,An) indicates a relation called R whoseattributes are A1,A2, ...,An
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 7 / 105
Relational Algebra and MapReduce Operators
Operators
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 8 / 105
Relational Algebra and MapReduce Operators
Operators
Let’s start with an exampleI Below, we have part of a relation called Links describing the
structure of the WebI There are two attributes: From and ToI A row, or tuple, of the relation is a pair of URLs, indicating the
existence of a link between them→ The number of tuples in a real dataset is in the order of billions (109)
From Tourl1 url2url1 url3url2 url3url2 url4· · · · · ·
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 9 / 105
Relational Algebra and MapReduce Operators
Operators
Relations (however big) can be stored in a distributedfilesystem
I If they don’t fit in a single machine, they’re broken into pieces (thinkHDFS)
Next, we review and describe a set of relational algebraoperators
I Intuitive explanation of what they doI “Pseudo-code” of their implementation in/by MapReduce
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 10 / 105
Relational Algebra and MapReduce Operators
Operators
Selection: σC(R)I Apply condition C to each tuple of relation RI Produce in output a relation containing only tuples that satisfy C
Projection: πS(R)I Given a subset S of relation R attributesI Produce in output a relation containing only tuples for the attributes
in S
Union, Intersection and DifferenceI Well known operators on setsI Apply to the set of tuples in two relations that have the same
schemaI Variations on the theme: work on bags
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 11 / 105
Relational Algebra and MapReduce Operators
Operators
Natural join R on SI Given two relations, compare each pair of tuples, one from each
relationI If the tuples agree on all the attributes common to both schema→
produce an output tuple that has components on each attributeI Otherwise produce nothingI Join condition can be on a subset of attributes
Let’s work with an exampleI Recall the Links relation from previous slidesI Query (or data processing job): find the paths of lengthtwo in the Web
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 12 / 105
Relational Algebra and MapReduce Operators
Join Example
Informally, to satisfy the query we must:I find the triples of URLs in the form (u, v ,w) such that there is a link
from u to v and a link from v to w
Using the join operatorI Imagine we have two relations (with different schemas), and let’s try
to apply the natural join operatorI There are two copies of Links: L1(U1,U2) and L2(U2,U3)I Let’s compute L1 on L2
F For each tuple t1 of L1 and each tuple t2 of L2, see if their U2
component are the sameF If yes, then produce a tuple in output, with the schema (U1,U2,U3)
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 13 / 105
Relational Algebra and MapReduce Operators
Join Example
What we have seen is called (to be precise) a self-joinI Question: How would you implement a self join in your favorite
programming language?I Question: What is the time complexity of your algorithm?I Question: What is the space complexity of your algorithm?
To continue the exampleI Say you are not interested in the entire two-hop path but just the
start and end nodesI Then you do a projection and the notation would be: πU1,U3(L1 on L2)
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 14 / 105
Relational Algebra and MapReduce Operators
Operators
Grouping and Aggregation: γX (R)I Given a relation R, partition its tuples according to their values in
one set of attributes GF The set G is called the grouping attributes
I Then, for each group, aggregate the values in certain otherattributes
F Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...
In the notation, X is a list of elements that can be:I A grouping attributeI An expression θ(A), where θ is one of the (five) aggregation
functions and A is an attribute NOT among the grouping attributes
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 15 / 105
Relational Algebra and MapReduce Operators
Operators
Grouping and Aggregation: γX (R)I The result of this operation is a relation with one tuple for each
groupI That tuple has a component for each of the grouping attributes, with
the value common to tuples of that groupI That tuple has another component for each aggregation, with the
aggregate value for that group
Let’s work with an exampleI Imagine that a social-networking site has a relationFriends(User, Friend)
I The tuples are pairs (a,b) such that b is a friend of aI Query: compute the number of friends each memberhas
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 16 / 105
Relational Algebra and MapReduce Operators
Grouping and Aggregation Example
How to satisfy the queryγUser ,COUNT(Friend))(Friends)
I This operation groups all the tuples by the value in their fristcomponent
→ There is one group for each userI Then, for each group, it counts the number of friends
Some detailsI The COUNT operation applied to an attribute does not consider the
values of that attributeI In fact, it counts the number of tuples in the groupI In SQL, there is a “count distinct” operator that counts the number
of different values
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 17 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
MapReduce implementation of (some) Relational Operators
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 18 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Computing Selection
In practice, selection does not need a full-blown MapReduceimplementation
I They can be implemented in the map portion aloneI Actually, they could also be implemented in the reduce portion
A MapReduce implementation of σC(R)Map: F For each tuple t in R, check if t satisfies C
F If so, emit a key/value pair (t , t)Reduce: F Identity reducer
F Question: single or multiple reducers?
NOTE: the output is not exactly a relationI WHY?
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 19 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Computing Projections
Similar process to selectionI But, projection may cause same tuple to appear several times
A MapReduce implementation of πS(R)Map: F For each tuple t in R, construct a tuple t ′ by eliminating those
components whose attributes are not in SF Emit a key/value pair (t ′, t ′)
Reduce: F For each key t ′ produced by any of the Map tasks, fetch t ′, [t ′, · · · , t ′]F Emit a key/value pair (t ′, t ′)
NOTE: the reduce operation is duplicate eliminationI This operation is associative and commutative, so it is possible to
optimize MapReduce by using a Combiner in each mapper
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 20 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Computing Unions
Suppose relations R and S have the same schemaI Map tasks will be assigned chunks from either R or SI Mappers don’t do much, just pass by to reducersI Reducers do duplicate elimination
A MapReduce implementation of unionMap: F For each tuple t in R or S, emit a key/value pair (t , t)
Reduce: F For each key t there will be either one or two valuesF Emit (t , t) in either case
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 21 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Computing Intersections
Very similar to computing unionsI Suppose relations R and S have the same schemaI The map function is the same (an identity mapper) as for unionI The reduce function must produce a tuple only if both relations
have that tuple
A MapReduce implementation of intersectionMap: F For each tuple t in R or S, emit a key/value pair (t , t)
Reduce: F If key t has value list [t , t ] then emit the key/value pair (t , t)F Otherwise, emit the key/value pair (t ,NULL)
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 22 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Computing difference
Assume we have two relations R and S with the sameschema
I The only way a tuple t can appear in the output is if it is in R but notin S
I The map function can pass tuples from R and S to the reducerI NOTE: it must inform the reducer whether the tuple came from R or
S
A MapReduce implementation of differenceMap: F For a tuple t in R emit a key/value pair (t , ′R′) and for a tuple t in S,
emit a key/value pair (t , ′S′)Reduce: F For each key t , do the following:
F If it is associated to ′R′, then emit (t , t)F If it is associated to [′R′, ′S′] or [′S′, ′R′], or [′S′], emit the key/value
pair (t ,NULL)
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 23 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Computing the natural Join
This topic is subject to continuous refinementsI There are many JOIN operators and many different
implementationsI We will see some of them in more detail in the Lab
Let’s look at two relations R(A,B) and S(B,C)I We must find tuples that agree on their B componentsI We shall use the B-value of tuples from either relation as the keyI The value will be the other component and the name of the relationI That way the reducer knows from which relation each tuple is
coming from
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 24 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Computing the natural Join
A MapReduce implementation of Natural JoinMap: F For each tuple (a, b) of R emit the key/value pair (b, (′R′, a))
F For each tuple (b, c) of S emit the key/value pair (b, (′S′, c))Reduce: F Each key b will be associated to a list of pairs that are either (′R′, a)
or (′S′, c)F Emit key/value pairs of the form
(b, [(a1, b, c1), (a2, b, c2), · · · , (an, b, cn)])
NOTESI Question: what if the MapReduce framework wouldn’t implement
the distributed (and sorted) group by?I In general, for n tuples in relation R and m tuples in relation S all
with a common B-value, then we end up with nm tuples in the resultI If all tuples of both relations have the same B-value, then we’re
computing the cartesian product
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 25 / 105
Relational Algebra and MapReduce Implementing operators in MapReduce
Grouping and Aggregation in MapReduce
Let R(A,B,C) be a relation to which we apply γA,θ(B)(R)I The map operation prepares the groupingI The grouping is done by the frameworkI The reducer computes the aggregationI Simplifying assumptions: one grouping attribute and one
aggregation function
MapReduce implementation of γA,θ(B)(R)Map: F For each tuple (a, b, c) emit the key/value pair (a, b)
Reduce: F Each key a represents a groupF Apply θ to the list [b1, b2, · · · , bn]F Emit the key/value pair (a, x) where x = θ([b1, b2, · · · , bn])
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 26 / 105
Hadoop PIG
Hadoop PIG
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 27 / 105
Hadoop PIG Introduction
Introduction
Collection and analysis of enormous datasets is at the heartof innovation in many organizations
I E.g.: web crawls, search logs, click streams
Manual inspection before batch processingI Very often engineers look for exploitable trends in their data to drive
the design of more sophisticated techniquesI This is difficult to do in practice, given the sheer size of the datasets
The MapReduce model has its own limitationsI One inputI Two-stage, two operatorsI Rigid data-flow
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 28 / 105
Hadoop PIG Introduction
MapReduce limitations
Very often tricky workarounds are required2
I This is very often exemplified by the difficulty in performing JOINoperations
Custom code required even for basic operationsI Projection and Filtering need to be “rewritten” for each job
→ Code is difficult to reuse and maintain→ Semantics of the analysis task are obscured→ Optimizations are difficult due to opacity of Map and Reduce
2The term workaround should not only be intended as negative.Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 29 / 105
Hadoop PIG Introduction
Use Cases
Rollup aggregates
Compute aggregates against user activity logs, web crawls,etc.
I Example: compute the frequency of search terms aggregated overdays, weeks, month
I Example: compute frequency of search terms aggregated overgeographical location, based on IP addresses
RequirementsI Successive aggregationsI Joins followed by aggregations
Pig vs. OLAP systemsI Datasets are too bigI Data curation is too costly
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 30 / 105
Hadoop PIG Introduction
Use Cases
Temporal Analysis
Study how search query distributions change over timeI Correlation of search queries from two distinct time periods (groups)I Custom processing of the queries in each correlation group
Pig supports operators that minimize memory footprintI Instead, in a RDBMS such operations typically involve JOINS over
very large datasets that do not fit in memory and thus become slow
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 31 / 105
Hadoop PIG Introduction
Use Cases
Session Analysis
Study sequences of page views and clicks
Example of typical aggregatesI Average length of user sessionI Number of links clicked by a user before leaving a websiteI Click pattern variations in time
Pig supports advanced data structures, and UDFs
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 32 / 105
Hadoop PIG Introduction
Pig Latin
Pig Latin, a high-level programming language developed atYahoo!
I Combines the best of both declarative and imperative worldsF High-level declarative querying in the spirit of SQLF Low-level, procedural programming á la MapReduce
Pig Latin featuresI Multi-valued, nested data structures instead of flat tablesI Powerful data transformations primitives, including joins
Pig Latin programI Made up of a series of operations (or transformations)I Each operation is applied to input data and produce output data→ A Pig Latin program describes a data flow
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 33 / 105
Hadoop PIG Introduction
Example 1
Pig Latin premiere
Assume we have the following table:
urls: (url, category, pagerank)
Where:I url: is the url of a web pageI category: corresponds to a pre-defined category for the web pageI pagerank: is the numerical value of the pagerank associated to a
web page
→ Find, for each sufficiently large category, the average page rank ofhigh-pagerank urls in that category
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 34 / 105
Hadoop PIG Introduction
Example 1
SQL
SELECT category, AVG(pagerank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 35 / 105
Hadoop PIG Introduction
Example 1
Pig Latin
good_urls = FILTER urls BY pagerank > 0.2;groups = GROUP good_urls BY category;big_groups = FILTER groups BY COUNT(good_urls) > 106;output = FOREACH big_groups GENERATEcategory, AVG(good_urls.pagerank);
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 36 / 105
Hadoop PIG Introduction
Pig Execution environment
How do we go from Pig Latin to MapReduce?I The Pig system is in charge of thisI Complex execution environment that interacts with Hadoop
MapReduce→ The programmer focuses on the data and analysis
Pig CompilerI Pig Latin operators are translated into MapReduce codeI NOTE: in some cases, hand-written MapReduce code performs
better
Pig OptimizerI Pig Latin data flows undergo an (automatic) optimization phaseI These optimizations are borrowed from the RDBMS community
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 37 / 105
Hadoop PIG Introduction
Pig and Pig Latin
Pig is not a RDBMS!I This means it is not suitable for all data processing tasks
Designed for batch processingI Of course, since it compiles to MapReduceI Of course, since data is materialized as files on HDFS
NOT designed for random accessI Query selectivity does not match that of a RDBMSI Full-scans oriented!
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 38 / 105
Hadoop PIG Introduction
Comparison with RDBMS
It may seem that Pig Latin is similar to SQLI We’ll see several examples, operators, etc. that resemble SQL
statements
Data-flow vs. declarative programming languageI Data-flow:
F Step-by-step set of operationsF Each operation is a single transformation
I Declarative:F Set of constraintsF Applied together to an input to generate output
→ With Pig Latin it’s like working at the query planner
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 39 / 105
Hadoop PIG Introduction
Comparison with RDBMS
RDBMS store data in tablesI Schemas are predefined and strictI Tables are flat
Pig and Pig Latin work on more complex data structuresI Schema can be defined at run-time for readabilityI Pigs eat anything!I UDF and streaming together with nested data structures make Pig
and Pig Latin more flexible
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 40 / 105
Hadoop PIG Features and Motivations
Features and Motivations
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 41 / 105
Hadoop PIG Features and Motivations
Features and Motivations
Design goals of Pig and Pig LatinI Appealing to programmers for performing ad-hoc analysis of dataI Number of features that go beyond those of traditional RDBMS
Next: overview of salient featuresI There will be a dedicated set of slides to optimizations later on
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 42 / 105
Hadoop PIG Features and Motivations
Dataflow Language
A Pig Latin program specifies a series of stepsI Each step is a single, high level data transformationI Stylistically different from SQL
With reference to Example 1I The programmer supply an order in which each operation will be
done
Consider the following snippet
spam_urls = FILTER urls BY isSpam(url);culprit_urls = FILTER spam_urls BY pagerank > 0.8;
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 43 / 105
Hadoop PIG Features and Motivations
Dataflow Language
Data flow optimizationsI Explicit sequences of operations can be overriddenI Use of high-level, relational-algebra-style primitives (GROUP,FILTER,...) allows using traditional RDBMS optimizationtechniques
→ NOTE: it is necessary to check whether such optimizationsare beneficial or not, by hand
Pig Latin allows Pig to perform optimizations that wouldotherwise by a tedious manual exercise if done at theMapReduce level
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 44 / 105
Hadoop PIG Features and Motivations
Quick Start and Interoperability
Data I/O is greatly simplified in PigI No need to curate, bulk import, parse, apply schema, create
indexes that traditional RDBMS requireI Standard and ad-hoc “readers” and “writers” facilitate the task of
ingesting and producing data in arbitrary formats
Pig can work with a wide range of other tools
Why RDBMS have stringent requirements?I To enable transactional consistency guaranteesI To enable efficient point lookup (using physical indexes)I To enable data curation on behalf of the userI To enable other users figuring out what the data is, by studying the
schema
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 45 / 105
Hadoop PIG Features and Motivations
Quick Start and Interoperability
Why is Pig so flexible?I Supports read-only workloadsI Supports scan-only workloads (no lookups)→ No need for transactions nor indexes
Why data curation is not required?I Very often, Pig is used for ad-hoc data analysisI Work on temporary datasets, then throw them→ Curation is an overkill
Schemas are optionalI Can apply one on the fly, at runtimeI Can refer to fields using positional notationI E.g.: good_urls = FILTER urls BY $2 > 0.2
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 46 / 105
Hadoop PIG Features and Motivations
Nested Data Model
Easier for “programmers” to think of nested data structuresI E.g.: capture information about positional occurrences of terms in a
collection of documentsI Map<documnetId, Set<positions> >
Instead, RDBMS allows only falt tablesI Only atomic fields as columnsI Require normalizationI From the example above: need to create two tablesI term_info: (termId, termString, ...)I position_info: (termId, documentId, position)→ Occurrence information obtained by joining on termId, and
grouping on termId, documentId
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 47 / 105
Hadoop PIG Features and Motivations
Nested Data Model
Fully nested data model (see also later in the presentation)I Allows complex, non-atomic data typesI E.g.: set, map, tuple
Advantages of a nested data modelI More natural than normalizationI Data is often already stored in a nested fashion on disk
F E.g.: a web crawler outputs for each crawled url, the set of outlinksF Separating this in normalized form imply use of joins, which is an
overkill for web-scale dataI Nested data allows to have an algebraic language
F E.g.: each tuple output by GROUP has one non-atomic field, a nestedset of tuples from the same group
I Nested data makes life easy when writing UDFs
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 48 / 105
Hadoop PIG Features and Motivations
User Defined Functions
Custom processing is often predominantI E.g.: users may be interested in performing natural language
stemming of a search term, or tagging urls as spam
All commands of Pig Latin can be customizedI Grouping, filtering, joining, per-tuple processing
UDFs support the nested data modelI Input and output can be non-atomic
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 49 / 105
Hadoop PIG Features and Motivations
Example 2
Continues from Example 1I Assume we want to find for each category, the top 10 urls according
to pagerank
groups = GROUP urls BY category;output = FOREACH groups GENERATE category,top10(urls);
top10() is a UDF that accepts a set of urls (for each group at atime)it outputs a set containing the top 10 urls by pagerank for thatgroupfinal output contains non-atomic fields
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 50 / 105
Hadoop PIG Features and Motivations
User Defined Functions
UDFs can be used in all Pig Latin constructs
Instead, in SQL, there are restrictionsI Only scalar functions can be used in SELECT clausesI Only set-valued functions can appear in the FROM clauseI Aggregation functions can only be applied to GROUP BY orPARTITION BY
UDFs can be written in Java, Python and Javascript3
I With streaming, we can use also C/C++, Python, ...
3As of Pig 0.8.1 and later. We will use version 0.10.0 or more.Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 51 / 105
Hadoop PIG Features and Motivations
Handling parallel execution
Pig and Pig Latin are geared towards parallel processingI Of course, the underlying execution engine is MapReduce
Pig Latin primitives are chosen such that they can be easilyparallelized
I Non-equi joins, correlated sub-queries,... are not directly supported
Users may specify parallelization parameters at run timeI Question: Can you specify the number of maps?I Question: Can you specify the number of reducers?
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 52 / 105
Hadoop PIG PIG LATIN
Pig Latin
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 53 / 105
Hadoop PIG PIG LATIN
Introduction
Not a complete reference to the Pig Latin language: refer to [1]I Here we cover some interesting aspects
The focus here is on some language primitivesI Optimizations are treated separatelyI How they can be implemented is covered later
Examples are taken from [2, 3]
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 54 / 105
Hadoop PIG PIG LATIN
Data Model
Supports four typesI Atom: contains a simple atomic value as a string or a number, e.g.‘alice’
I Tuple: sequence of fields, each can be of any data type, e.g.,(‘alice’, ‘lakers’)
I Bag: collection of tuples with possible duplicates. Flexible schema,no need to have the same number and type of fields{
(‘alice’,‘lakers’)(‘alice’,(‘iPod’,‘apple’))
}The example shows that tuples can be nested
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 55 / 105
Hadoop PIG PIG LATIN
Data Model
Supports four typesI Map: collection of data items, where each item has an associated
key for lookup. The schema, as with bags, is flexible.F NOTE: keys are required to be data atoms, for efficient lookup.‘fan of’ →
{(‘lakers’)(‘iPod’)
}(‘age’ → 20)
F The key ‘fan of’ is mapped to a bag containing two tuplesF The key ‘age’ is mapped to an atom
I Maps are useful to model datasets in which schema may bedynamic (over time)
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 56 / 105
Hadoop PIG PIG LATIN
Structure
Pig latin programs are a sequence of stepsI Can use an interactive shell (called grunt)I Can feed them as a “script”
CommentsI In line: with double hyphens (- -)I C-style for longer comments (/* ... */)
Reserved keywordsI List of keywords that can’t be used as identifiersI Same old story as for any language
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 57 / 105
Hadoop PIG PIG LATIN
Statements
As a Pig Latin program is executed, each statement is parsedI The interpreter builds a logical plan for every relational operationI The logical plan of each statement is added to that of the program
so farI Then the interpreter moves on to the next statement
IMPORTANT: No data processing takes place duringconstruction of logical plan
I When the interpreter sees the first line of a program, it confirms thatit is syntactically and semantically correct
I Then it adds it to the logical planI It does not even check the existence of files, for data load
operations
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 58 / 105
Hadoop PIG PIG LATIN
Statements→ It makes no sense to start any processing until the whole
flow is definedI Indeed, there are several optimizations that could make a program
more efficient (e.g., by avoiding to operate on some data that lateron is going to be filtered)
The trigger for Pig to start execution are the DUMP and STOREstatements
I It is only at this point that the logical plan is compiled into a physicalplan
How the physical plan is builtI Pig prepares a series of MapReduce jobs
F In Local mode, these are run locally on the JVMF In MapReduce mode, the jobs are sent to the Hadoop Cluster
I IMPORTANT: The command EXPLAIN can be used to show theMapReduce plan
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 59 / 105
Hadoop PIG PIG LATIN
Statements
Multi-query execution
There is a difference between DUMP and STOREI Apart from diagnosis, and interactive mode, in batch mode STORE
allows for program/job optimizations
Main optimization objective: minimize I/OI Consider the following example:A = LOAD ’input/pig/multiquery/A’;B = FILTER A BY $1 == ’banana’;C = FILTER A BY $1 != ’banana’;STORE B INTO ’output/b’;STORE C INTO ’output/c’;
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 60 / 105
Hadoop PIG PIG LATIN
Statements
Multi-query execution
In the example, relations B and C are both derived from AI Naively, this means that at the first STORE operator the input should
be readI Then, at the second STORE operator, the input should be read again
Pig will run this as a single MapReduce jobI Relation A is going to be read only onceI Then, each relation B and C will be written to the output
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 61 / 105
Hadoop PIG PIG LATIN
Expressions
An expression is something that is evaluated to yield a valueI Lookup on [3] for documentation
guages, including C/C++, Java, Perl and Python, so thatusers can stick with their language of choice.
2.5 Parallelism RequiredSince Pig Latin is geared toward processing web-scale
data, it does not make sense to consider non-parallel eval-uation. Consequently, we have only included in Pig Latina small set of carefully chosen primitives that can be easilyparallelized. Language primitives that do not lend them-selves to e⌅cient parallel evaluation (e.g., non-equi-joins,correlated subqueries) have been deliberately excluded.
Such operations can of course, still be carried out by writ-ing UDFs. However, since the language does not provideexplicit primitives for such operations, users are aware ofhow e⌅cient their programs will be and whether they willbe parallelized.
2.6 Debugging EnvironmentIn any language, getting a data processing program right
usually takes many iterations, with the first few iterationsusually having some user-introduced erroneous processing.At the scale of data that Pig is meant to process, a singleiteration can take many minutes or hours (even with large-scale parallel processing). Thus, the usual run-debug-runcycle can be very slow and ine⌅cient.
Pig comes with a novel interactive debugging environmentthat generates a concise example data table illustrating theoutput of each step of the user’s program. The example datais carefully chosen to resemble the real data as far as possibleand also to fully illustrate the semantics of the program.Moreover, the example data is automatically adjusted asthe program evolves.
This step-by-step example data can help in detecting er-rors early (even before the first iteration of running the pro-gram on the full data), and also in pinpointing the step thathas errors. The details of our debugging environment areprovided in Section 5.
3. PIG LATINIn this section, we describe the details of the Pig Latin
language. We describe our data model in Section 3.1, andthe Pig Latin statements in the subsequent subsections. Theemphasis of this section is not on the syntactical details ofPig Latin, but on how it meets the design goals and featureslaid out in Section 2. Also, this section only focusses onthe language primitives, and not on how they can be imple-mented to execute in parallel over a cluster. Implementationis covered in Section 4.
3.1 Data ModelPig has a rich, yet simple data model consisting of the
following four types:
• Atom: An atom contains a simple atomic value such asa string or a number, e.g., ‘alice’.
• Tuple: A tuple is a sequence of fields, each of which canbe any of the data types, e.g., (‘alice’, ‘lakers’).
• Bag: A bag is a collection of tuples with possible dupli-cates. The schema of the constituent tuples is flexible,i.e., not all tuples in a bag need to have the same numberand type of fields, e.g.,
⇤(‘alice’, ‘lakers’)`
‘alice’, (‘iPod’, ‘apple’)´
⌅
t =
„‘alice’,
⇤(‘lakers’, 1)(‘iPod’, 2)
⌅,ˆ‘age’ ⇤ 20
˜«
Let fields of tuple t be called f1, f2, f3
Expression Type Example Value for t
Constant ‘bob’ Independent of tField by position $0 ‘alice’
Field by name f3ˆ‘age’ ⇤ 20
˜
Projection f2.$0
⇤(‘lakers’)(‘iPod’)
⌅
Map Lookup f3#‘age’ 20
Function Evaluation SUM(f2.$1) 1 + 2 = 3
ConditionalExpression
f3#‘age’>18?‘adult’:‘minor’
‘adult’
Flattening FLATTEN(f2)‘lakers’, 1
‘iPod’, 2
Table 1: Expressions in Pig Latin.
The above example also demonstrates that tuples can benested, e.g., the second tuple of the bag has a nestedtuple as its second field.
• Map: A map is a collection of data items, where eachitem has an associated key through which it can belooked up. As with bags, the schema of the constituentdata items is flexible, i.e., all the data items in the mapneed not be of the same type. However, the keys are re-quired to be data atoms, mainly for e⌅ciency of lookups.The following is an example of a map:
2
4 ‘fan of’ ⇤⇤
(‘lakers’)
(‘iPod’)
⌅
‘age’ ⇤ 20
3
5
In the above map, the key ‘fan of’ is mapped to a bagcontaining two tuples, and the key ‘age’ is mapped toan atom 20.
A map is especially useful to model data sets whereschemas might change over time. For example, if webservers decide to include a new field while dumping logs,that new field can simply be included as an additionalkey in a map, thus requiring no change to existing pro-grams, and also allowing access of the new field to newprograms.
Table 1 shows the expression types in Pig Latin, and howthey operate. (The flattening expression is explained in de-tail in Section 3.3.) It should be evident that our data modelis very flexible and permits arbitrary nesting. This flexibil-ity allows us to achieve the aims outlined in Section 2.3,where we motivated our use of a nested data model. Next,we describe the Pig Latin commands.
3.2 Specifying Input Data: LOADThe first step in a Pig Latin program is to specify what
the input data files are, and how the file contents are to bedeserialized, i.e., converted into Pig’s data model. An inputfile is assumed to contain a sequence of tuples, i.e., a bag.This step is carried out by the LOAD command. For example,
queries = LOAD ‘query_log.txt’
USING myLoad()
AS (userId, queryString, timestamp);
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 62 / 105
Hadoop PIG PIG LATIN
Schemas
A relation in Pig may have an associated schemaI This is optionalI A schema gives the fields in the relations names and typesI Use the command DESCRIBE to reveal the schema in use for a
relation
Schema declaration is flexible but reuse is awkwardI A set of queries over the same input data will often have the same
schemaI This is sometimes hard to maintain (unlike HIVE) as there is no
external components to maintain this associationHINT:: You can write a UDF function to perform a personalized load
operation which encapsulates the schema
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 63 / 105
Hadoop PIG PIG LATIN
Validation and nulls
Pig does not have the same power to enforce constraints onschema at load time as a RDBMS
I If a value cannot be cast to a type declared in the schema, then itwill be set to a null value
I This also happens for corrupt files
A useful technique to partition input data to discern good andbad records
I Use the SPLIT operatorSPLIT records INTO good_records IF temperature isnot null, bad _records IF temperature is NULL;
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 64 / 105
Hadoop PIG PIG LATIN
Other relevant information
Schema mergingI How schema are propagated to new relations?
FunctionsI Look up on the web for Piggy Bank
User-Defined FunctionsI Use [3] for an introduction to designing UDFs
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 65 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
Loading and storing data
The first step in a Pig Latin program is to load dataI What input files areI How the file contents are to be deserializedI An input file is assumed to contain a sequence of tuples
Data loading is done with the LOAD commandqueries = LOAD ‘query_log.txt’USING myLoad()AS (userId, queryString, timestamp);
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 66 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
Loading and storing data
The example above specifies the following:I The input file is query_log.txtI The input file should be converted into tuples using the custommyLoad deserializer
I The loaded tuples have three fields, specified by the schema
Optional partsI USING clause is optional: if not specified, the input file is assumed
to be plain text, tab-delimitedI AS clause is optional: if not specified, must refer to fileds by position
instead of by name
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 67 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
Loading and storing data
Return value of the LOAD commandI Handle to a bagI This can be used by subsequent commands→ bag handles are only logical→ no file is actually read!
The command to write output to disk is STOREI It has similar semantics to the LOAD command
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 68 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
Per-tuple processing: Filtering data
Once you have some data loaded into a relation, the next stepis to filter it
I This is done, e.g., to remove unwanted dataI HINT: By filtering early in the processing pipeline, you minimize the
amount of data flowing trough the system
A basic operation is to apply some processing over everytuple of a data set
I This is achieved with the FOREACH commandexpanded_queries = FOREACH queries GENERATEuserId, expandQuery(queryString);
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 69 / 105
Hadoop PIG PIG LATIN
Data Processing OperatorsPer-tuple processing: Filtering data
Comments on the example above:I Each tuple of the bag queries should be processed independentlyI The second field of the output is the result of a UDF
Semantics of the FOREACH commandI There can be no dependence between the processing of different
input tuples→ This allows for an efficient parallel implementation
Semantics of the GENERATE clauseI Followed by a list of expressionsI Also flattering is allowed
F This is done to eliminate nesting in data→ Allows to make output data independent for further parallel
processing→ Useful to store data on disk
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 70 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
Per-tuple processing: Discarding unwanted data
A common operation is to retain a portion of the input dataI This is done with the FILTER commandreal_queries = FILTER queries BY userId neq‘bot’;
Filtering conditions involve a combination of expressionsI Comparison operatorsI Logical connectorsI UDF
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 71 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
Per-tuple processing: Streaming data
The STREAM operator allows transforming data in a relationusing an external program or script
I This is possible because Hadoop MapReduce supports “streaming”I Example:C = STREAM A THROUGH ‘cut -f 2’;which use the Unix cut command to extract the second filed ofeach tuple in A
The STREAM operator uses PigStorage to serialize anddeserialize relations to and from stdin/stdout
I Can also provide a custom serializer/deserializerI Works well with python
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 72 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
Getting related data together
It is often necessary to group together tuples from one ormore data sets
I We will explore several nuances of “grouping”
The first grouping operation we study is given by theCOGROUP commandExample: Assume we have loaded two relationsresults: (queryString, url, position)
revenue: (queryString, adSlot, amount)I results contains, for different query strings, the urls shown as
search results, and the positions at which they where shownI revenue contains, for different query strings, and different
advertisement slots, the average amount of revenue
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 73 / 105
Hadoop PIG PIG LATIN
Data Processing OperatorsGetting related data together
Suppose we want to group together all search results dataand revenue data for the same query string
grouped_data = COGROUP results BY queryString,revenue BY queryString;
Figure 2: COGROUP versus JOIN.
advertisement slots, the average amount of revenue made bythe advertisements for that query string at that slot. Thento group together all search result data and revenue data forthe same query string, we can write:
grouped_data = COGROUP results BY queryString,
revenue BY queryString;
Figure 2 shows what a tuple in grouped_data looks like.In general, the output of a COGROUP contains one tuple foreach group. The first field of the tuple (named group) is thegroup identifier (in this case, the value of the queryString
field). Each of the next fields is a bag, one for each inputbeing cogrouped, and is named the same as the alias of thatinput. The ith bag contains all tuples from the ith inputbelonging to that group. As in the case of filtering, groupingcan also be performed according to arbitrary expressionswhich may include UDFs.
The reader may wonder why a COGROUP primitive is neededat all, since a very similar primitive is provided by the fa-miliar, well-understood, JOIN operation in databases. Forcomparison, Figure 2 also shows the result of joining ourdata sets on queryString. It is evident that JOIN is equiv-alent to COGROUP, followed by taking a cross product of thetuples in the nested bags. While joins are widely applicable,certain custom processing might require access to the tuplesof the groups before the cross-product is taken, as shown bythe following example.
Example 3. Suppose we were trying to attribute searchrevenue to search-result urls to figure out the monetary worthof each url. We might have a sophisticated model for doingso. To accomplish this task in Pig Latin, we can follow theCOGROUP with the following statement:
url_revenues = FOREACH grouped_data GENERATE
FLATTEN(distributeRevenue(results, revenue));
where distributeRevenue is a UDF that accepts search re-sults and revenue information for a query string at a time,and outputs a bag of urls and the revenue attributed to them.For example, distributeRevenue might attribute revenuefrom the top slot entirely to the first search result, while therevenue from the side slot may be attributed equally to allthe results. In this case, the output of the above statementfor our example data is shown in Figure 2.
To specify the same operation in SQL, one would haveto join by queryString, then group by queryString, andthen apply a custom aggregation function. But while doingthe join, the system would compute the cross product of thesearch and revenue information, which the custom aggre-gation function would then have to undo. Thus, the wholeprocess become quite ine⇤cient, and the query becomes hardto read and understand.
To reiterate, the COGROUP statement illustrates a key dif-ference between Pig Latin and SQL. The COGROUP state-ments conforms to our goal of having an algebraic language,where each step carries out only a single transformation(Section 2.1). COGROUP carries out only the operation ofgrouping together tuples into nested bags. The user cansubsequently choose to apply either an aggregation functionon those tuples, or cross-product them to get the join result,or process it in a custom way as in Example 3. In SQL,grouping is available only bundled with either aggregation(group-by-aggregate queries), or with cross-producting (theJOIN operation). Users find this additional flexibility of PigLatin over SQL quite attractive, e.g.,
“I frankly like pig much better than SQL in somerespects (group + optional flatten works betterfor me, I love nested data structures).”– Ted Dunning, Chief Scientist, Veoh Networks
Note that it is our nested data model that allows us tohave COGROUP as an independent operation—the input tu-ples are grouped together and put in nested bags. Sucha primitive is not possible in SQL since the data model isflat. Of course, such a nested model raises serious concernsabout e⌅ciency of implementation: since groups can be verylarge (bigger than main memory, perhaps), we might buildup gigantic tuples, which have these enormous nested bagswithin them. We address these e⌅ciency concerns in ourimplementation section (Section 4).
3.5.1 Special Case of COGROUP: GROUPA common special case of COGROUP is when there is only
one data set involved. In this case, we can use the alter-native, more intuitive keyword GROUP. Continuing with ourexample, if we wanted to find the total revenue for eachquery string, (a typical group-by-aggregate query), we canwrite it as follows:
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 74 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
The COGROUP command
Output of a COGROUP contains one tuple for each groupI First field (group) is the group identifier (the value of thequeryString)
I Each of the next fields is a bag, one for each group beingco-grouped
Grouping can be performed according to UDFs
Next: why COGROUP when you can use JOINS?
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 75 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
COGROUP vs JOIN
JOIN vs. COGROUPI Their are equivalent: JOIN = COGROUP followed by a cross product
of the tuples in the nested bags
Example 3: Suppose we try to attribute search revenue tosearch-results urls→ compute monetary worth of each url
grouped_data = COGROUP results BY queryString,revenue BY queryString;url_revenues = FOREACH grouped_data GENERATEFLATTEN(distrubteRevenue(results, revenue));
I Where distrubteRevenue is a UDF that accepts search resultsand revenue information for each query string, and outputs a bag ofurls and revenue attributed to them
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 76 / 105
Hadoop PIG PIG LATIN
Data Processing OperatorsCOGROUP vs JOIN
More details on the UDF distribute RevenueI Attributes revenue from the top slot entirely to the first search resultI The revenue from the side slot may be equally split among all
results
Let’s see how to do the same with a JOINI JOIN the tables results and revenues by queryStringI GROUP BY queryStringI Apply a custom aggregation function
What happens behind the scenesI During the join, the system computes the cross product of the
search and revenue informationI Then the custom aggregation needs to undo this cross product,
because the UDF specifically requires so
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 77 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
COGROUP in details
The COGROUP statement conforms to an algebraic languageI The operator carries out only the operation of grouping together
tuples into nested bagsI The user can the decide wether to apply a (custom) aggregation on
those tuples or to cross-product them and obtain a join
It is thanks to the nested data model that COGROUP is anindependent operation
I Implementation details are trickyI Groups can be very large (and are redundant)
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 78 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
A special case of COGROUP: the GROUP operator
Sometimes, we want to operate on a single datasetI This is when you use the GROUP operator
Let’s continue from Example 3:I Assume we want to find the total revenue for each query string.
This writes as:grouped_revenue = GROUP revenue BY queryString;query_revenue = FOREACH grouped_revenue GENERATEqueryString, SUM(revenue.amount) AS totalRevenue;
I Note that revenue.amount refers to a projection of the nestedbag in the tuples of grouped_revenue
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 79 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
JOIN in Pig Latin
In many cases, the typical operation on two or more datasetsamounts to an equi-join
I IMPORTANT NOTE: large datasets that are suitable to be analyzedwith Pig (and MapReduce) are generally not normalized
→ JOINs are used more infrequently in Pig Latin than they are in SQL
The syntax of a JOINjoin_result = JOIN results BY queryString,revenue BY queryString;
I This is a classic inner join (actually an equi join), where each matchbetween the two relations corresponds to a row in thejoin_result
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 80 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
JOIN in Pig Latin
JOINs lend themselves to optimization opportunitiesI We will work on this in the laboratory
Assume we join two datasets, one of which is considerablysmaller than the other
I For instance, suppose a dataset fits in memory
Fragment replicate joinI Syntax: append the clause USING “replicated” to a JOIN
statementI Uses a distributed cache available in HadoopI All mappers will have a copy of the small input→ This is a Map-side join
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 81 / 105
Hadoop PIG PIG LATIN
Data Processing Operators
MapReduce in Pig Latin
It is trivial to express MapReduce programs in Pig LatinI This is achieved using GROUP and FOREACH statementsI A map function operates on one input tuple at a time and outputs a
bag of key-value pairsI The reduce function operates on all values for a key at a time to
produce the final result
Examplemap_result = FOREACH input GENERATEFLATTEN(map(*));key_groups = GROUP map_results BY $0;output = FOREACH key_groups GENERATE reduce(*);
I where map() and reduce() are UDF
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 82 / 105
Hadoop PIG Implementation
Implementation
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 83 / 105
Hadoop PIG Implementation
Introduction
Pig Latin Programs are compiled into MapReduce jobs, andexecuted using Hadoop
How to build a logical plan for a Pig Latin program
How to compile the logical plan into a physical plan ofMapReduce jobs
How to avoid resource exhaustion
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 84 / 105
Hadoop PIG Implementation
Building a Logical Plan
As clients issue Pig Latin commands (interactive or batchmode)
I The Pig interpreter parses the commandsI Then it verifies validity of input files and bags (variables)
F E.g.: if the command is c = COGROUP a BY ..., b BY ...;, itverifies if a and b have already been defined
Pig builds a logical plan for every bagI When a new bag is defined by a command, the new logical plan is a
combination of the plans for the input and that of the currentcommand
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 85 / 105
Hadoop PIG Implementation
Building a Logical Plan
No processing is carried out when constructing the logicalplans
I Processing is triggered only by STORE or DUMPI At that point, the logical plan is compiled to a physical plan
Lazy execution modelI Allows in-memory pipeliningI File reorderingI Various optimizations from the traditional RDBMS world
Pig is (potentially) platform independentI Parsing and logical plan construction are platform obliviousI Only the compiler is specific to Hadoop
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 86 / 105
Hadoop PIG Implementation
Building the Physical Plan
Compilation of a logical plan into a physical plan is “simple”I MapReduce primitives allow a parallel GROUP BY
F Map assigns keys for groupingF Reduce process a group at a time (actually in parallel)
How the compiler worksI Converts each (CO)GROUP command in the logical plan into
distinct MapReduce jobsI Map function for (CO)GROUP command C initially assigns keys to
tuples based on the BY clause(s) of CI Reduce function is initially a no-op
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 87 / 105
Hadoop PIG Implementation
Building the Physical Plan
an open-source project in the Apache incubator, and henceavailable for general use.
We first describe how Pig builds a logical plan for a PigLatin program. We then describe our current compiler, thatcompiles a logical plan into map-reduce jobs executed usingHadoop. Last, we describe how our implementation avoidslarge nested bags, and how it handles them if they do arise.
4.1 Building a Logical PlanAs clients issue Pig Latin commands, the Pig interpreter
first parses it, and verifies that the input files and bags be-ing referred to by the command are valid. For example, ifthe user enters c = COGROUP a BY . . ., b BY . . ., Pig veri-fies that the bags a and b have already been defined. Pigbuilds a logical plan for every bag that the user defines.When a new bag is defined by a command, the logical planfor the new bag is constructed by combining the logical plansfor the input bags, and the current command. Thus, in theabove example, the logical plan for c consists of a cogroupcommand having the logical plans for a and b as inputs.
Note that no processing is carried out when the logicalplans are constructed. Processing is triggered only when theuser invokes a STORE command on a bag. At that point, thelogical plan for that bag is compiled into a physical plan,and is executed. This lazy style of execution is beneficialbecause it permits in-memory pipelining, and other opti-mizations such as filter reordering across multiple Pig Latincommands.
Pig is architected such that the parsing of Pig Latin andthe logical plan construction is independent of the execu-tion platform. Only the compilation of the logical plan intoa physical plan depends on the specific execution platformchosen. Next, we describe the compilation into Hadoopmap-reduce, the execution platform currently used by Pig.
4.2 Map-Reduce Plan CompilationCompilation of a Pig Latin logical plan into map-reduce
jobs is fairly simple. The map-reduce primitive essentiallyprovides the ability to do a large-scale group by, where themap tasks assign keys for grouping, and the reduce tasksprocess a group at a time. Our compiler begins by convertingeach (CO)GROUP command in the logical plan into a distinctmap-reduce job with its own map and reduce functions.
The map function for (CO)GROUP command C initially justassigns keys to tuples based on the BY clause(s) of C; thereduce function is initially a no-op. The map-reduce bound-ary is the cogroup command. The sequence of FILTER, andFOREACH commands from the LOAD to the first COGROUP op-eration C1, are pushed into the map function correspondingto C1 (see Figure 3). The commands that intervene betweensubsequent COGROUP commands Ci and Ci+1 can be pushedinto either (a) the reduce function corresponding to Ci, or(b) the map function corresponding to Ci+1. Pig currentlyalways follows option (a). Since grouping is often followedby aggregation, this approach reduces the amount of datathat has to be materialized between map-reduce jobs.
In the case of a COGROUP command with more than oneinput data set, the map function appends an extra field toeach tuple that identifies the data set from which the tupleoriginated. The accompanying reduce function decodes thisinformation and uses it to insert the tuple into the appro-priate nested bag when cogrouped tuples are formed (recallFigure 2).
Figure 3: Map-reduce compilation of Pig Latin.
Parallelism for LOAD is obtained since Pig operates overfiles residing in the Hadoop distributed file system. We alsoautomatically get parallelism for FILTER and FOREACH oper-ations since for a given map-reduce job, several map and re-duce instances are run in parallel. Parallelism for (CO)GROUPis achieved since the output from the multiple map instancesis repartitioned in parallel to the multiple reduce instances.
The ORDER command is implemented by compiling intotwo map-reduce jobs. The first job samples the input todetermine quantiles of the sort key. The second job range-partitions the input according to the quantiles (thereby en-suring roughly equal-sized partitions), followed by local sort-ing in the reduce phase, resulting in a globally sorted file.
The inflexibility of the map-reduce primitive results insome overheads while compiling Pig Latin into map-reducejobs. For example, data must be materialized and replicatedon the distributed file system between successive map-reducejobs. When dealing with multiple data sets, an additionalfield must be inserted in every tuple to indicate which dataset it came from. However, the Hadoop map-reduce im-plementation does provide many desired properties such asparallelism, load-balancing, and fault-tolerance. Given theproductivity gains to be had through Pig Latin, the asso-ciated overhead is often acceptable. Besides, there is thepossibility of plugging in a di�erent execution platform thatcan implement Pig Latin operations without such overheads.
4.3 Efficiency With Nested BagsRecall Section 3.5. Conceptually speaking, our (CO)GROUP
command places tuples belonging to the same group intoone or more nested bags. In many cases, the system canavoid actually materializing these bags, which is especiallyimportant when the bags are larger than one machine’s mainmemory.
One common case is where the user applies a distribu-tive or algebraic [8] aggregation function over the result ofa (CO)GROUP operation. (Distributive is a special case ofalgebraic, so we will only discuss algebraic functions.) Analgebraic function is one that can be structured as a treeof subfunctions, with each leaf subfunction operating over asubset of the input data. If nodes in this tree achieve datareduction, then the system can keep the amount of datamaterialized in any single location small. Examples of al-gebraic functions abound: COUNT, SUM, MIN, MAX, AVERAGE,VARIANCE, although some useful functions are not algebraic,e.g., MEDIAN.
When Pig compiles programs into Hadoop map-reducejobs, it uses Hadoop’s combiner feature to achieve a two-tiertree evaluation of algebraic functions. Pig provides a specialAPI for algebraic user-defined functions, so that custom userfunctions can take advantage of this important optimization.
MapReduce boundary is the COGROUP commandI The sequence of FILTER and FOREACH from the LOAD to the firstCOGROUP C1 are pushed in the Map function
I The commands in later COGROUP commands Ci and Ci+1 can bepushed into:
F the Reduce function of CiF the Map function of Ci+1
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 88 / 105
Hadoop PIG Implementation
Building the Physical Plan
Pig optimization for the physical planI Among the two options outlined above, the first is preferredI Indeed, grouping is often followed by aggregation→ reduces the amount of data to be materialized between jobs
COGROUP command with more than one input datasetI Map function appends an extra field to each tuple to identify the
datasetI Reduce function decodes this information and inserts tuple in the
appropriate nested bags for each group
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 89 / 105
Hadoop PIG Implementation
Building the Physical Plan
How parallelism is achievedI For LOAD this is inherited by operating over HDFSI For FILTER and FOREACH, this is automatic thanks to MapReduce
frameworkI For (CO)GROUP uses the SHUFFLE phase
A note on the ORDER commandI Translated in two MapReduce jobsI First job: Samples the input to determine quantiles of the sort keyI Second job: Range partitions the input according to quantiles,
followed by sorting in the reduce phase
Known overheads due to MapReduce inflexibilityI Data materialization between jobsI Multiple inputs are not supported well
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 90 / 105
Hadoop PIG Implementation
Efficiency measures
(CO)GROUP command place tuples of the same group innested bags
I Bag materialization (I/O) can be avoidedI This is important also due to memory constraintsI Distributive or algebraic aggregation facilitate this task
What is an algebraic function?I Function that can be structured as a tree of sub-functionsI Each leaf sub-function operates over a subset of the input data→ If nodes in the tree achieve data reduction, then the system can
reduce materializationI Examples: COUNT, SUM, MIN, MAX, AVERAGE, ...
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 91 / 105
Hadoop PIG Implementation
Efficiency measures
Pig compiler uses the combiner function of HadoopI A special API for algebraic UDF is available
There are cases in which (CO)GROUP is inefficientI This happens with non-algebraic functionsI Nested bags can be spilled to diskI Pig provides a disk-resident bag implementation
F Features external sort algorithmsF Features duplicates elimination
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 92 / 105
Hadoop PIG Debugging
Debugging
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 93 / 105
Hadoop PIG Debugging
IntroductionThe process of creating Pig Latin programs is generallyiterative
I The user makes an initial stabI The stab is executedI The user inspects the output check correctnessI If not, revise the program and repeat the process
This iterative process can be inefficientI The sheer size of data volumes hinders this kind of experimentation→ Need to create a side dataset that is a small sample of the original
one
Sampling can be problematicI Example: consider an equi-join on relations A(x,y) and B(x,z)
on attribute xI If there are many distinct values of x, it is highly probable that a
small sample of A and B will not contain matching x values→ Empty result
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 94 / 105
Hadoop PIG Debugging
Welcome Pig Pen
Pig comes with a debugging environment, Pig PenI It creates a side dataset automaticallyI This is done in a manner that avoids sampling problems→ The side dataset must be tailored to the user program
Sandbox DatasetI Takes as input a Pig Latin program P
F This is a sequence of n commandsF Each command consumes one or more input bags and produces one
output bagI The output is a set of example bags {B1,B2, ...,Bn}
F Each output example bag corresponds to the output of eachcommand in P
I The output set of example bags need to be consistentF The output of each operator needs to be that obtained with the input
example bag
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 95 / 105
Hadoop PIG Debugging
Properties of the Sandbox Dataset
There are three primary objectives in selecting a sandboxdataset
I Realism: the sandbox should be a subset of the actual dataset. Ifthis is not possible, individual values should be the ones in theactual dataset
I Conciseness: the example bags should be as small as possibleI Completeness: the example bags should collectively illustrate the
key semantics of each command
Overview of the procedure to generate the sandboxI Take small random samples of the original dataI Synthesize additional data tuples to improve completenessI When possible use real data values on synthetic tuplesI Apply a pruning pass to eliminate redundant example tuples and
improve conciseness
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 96 / 105
Hadoop PIG Optimizations
Optimizations
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 97 / 105
Hadoop PIG Optimizations
IntroductionPig implements several optimizations
I Most of them are derived from traditional works in RDBMSI Logical vs. Physical optimizations
Physical Plan
Logical PlanParser
Query PlanCompilerCross-JobOptimizer
MapReduceCompiler
CLUSTER
Pig Latin Program
MapReduce ProgramB(x,y)A(x,y)
FILTER
JOIN
UDF
output
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 98 / 105
Hadoop PIG Optimizations
Single-program Optimizations
Logical optimizations: query planI Early projectionI Early filteringI Operator rewrites
Physical optimization: execution planI Mapping of logical operations to MapReduceI Splitting logical operations in multiple physical onesI Join execution strategies
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 99 / 105
Hadoop PIG Optimizations
Cross-program Optimizations
Popular tablesI Web crawlsI Search query log
Popular transformationsI Eliminate spamI Group pages by hostI Join web crawl with search log
GOAL: minimize redundant work
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 100 / 105
Hadoop PIG Optimizations
Cross-program Optimizations
Concurrent work sharingI Execute related Pig Latin programs together to perform common
work only onceI This is difficult to achieve: scheduling, “sharability”
Non-concurrent work sharingI Re-use I/O or CPU work done by one program, later in timeI This is difficult to achieve: caching, replication
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 101 / 105
Hadoop PIG Optimizations
Work-Sharing Techniques
A(x,y)
OPERATOR 1 OPERATOR 2
Job 1 Job 2
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 102 / 105
Hadoop PIG Optimizations
Work-Sharing Techniques
A(x,y)
OPERATOR 1
OPERATOR 2
OPERATOR 3
Job 1
Job 2
A'
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 103 / 105
Hadoop PIG Optimizations
Work-Sharing Techniques
WORKER 1 WORKER 2
A
D
replicate
JOIN
A
B
C
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 104 / 105
References
References I
[1] Pig wiki.http://wiki.apache.org/pig/.
[2] C. Olston, B. Reed, U. Srivastava, R. Kumar, , and A. Tomkins.Pig latin: A not-so-foreign language for data processing.In Proc. of ACM SIGMOD, 2008.
[3] Tom White.Hadoop, The Definitive Guide.O’Reilly, Yahoo, 2010.
Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 105 / 105