tutorial: high-level programming languages - mapreduce simplified

105
Tutorial: High-Level Programming Languages MapReduce Simplified Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 1 / 105

Upload: nguyennga

Post on 12-Feb-2017

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tutorial: High-Level Programming Languages - MapReduce Simplified

Tutorial: High-Level Programming LanguagesMapReduce Simplified

Pietro Michiardi

Eurecom

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 1 / 105

Page 2: Tutorial: High-Level Programming Languages - MapReduce Simplified

Introduction

Introduction

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 2 / 105

Page 3: Tutorial: High-Level Programming Languages - MapReduce Simplified

Introduction Overview

Overview

Raising the level of abstraction for processing large datasetsI Scalable Algorithm Design is complex using MapReduceI Code gets messy, redundant, difficult to re-use

Many alternatives exists, based on different principlesI Data-flow programmingI SQL-like declarative programmingI Additional operators (besides Map and Reduce)

Optimization is a hot research topicI Based on traditional RDBMS optimizations

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 3 / 105

Page 4: Tutorial: High-Level Programming Languages - MapReduce Simplified

Introduction Topics

Topics covered

Review foundations of relational algebra in light ofMapReduce

Hadoop PIGI Data-flow language, originated from Yahoo!I InternalsI Optimizations

Cascading + Scalding

SPARK1

1This is an abuse: SPARK is an execution enging that replaces Hadoop,based on Reliable Distributed Datasets, that reside in memory. Theprogramming model is MapReduce, using Scala.

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 4 / 105

Page 5: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce

Relational Algebra andMapReduce

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 5 / 105

Page 6: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Introduction

Introduction

DisclaimerI This is not a full course on Relational AlgebraI Neither this is a course on SQL

Introduction to Relational Algebra, RDBMS and SQLI Follow the video lectures of the Stanford class on RDBMShttp://www.db-class.org/

→ Note that you have to sign up for an account

Overview of this partI Brief introduction to simplified relational algebraI Useful to understand Pig, Hive and HBase

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 6 / 105

Page 7: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Introduction

Relational Algebra Operators

There are a number of operations on data that fit well therelational algebra model

I In traditional RDBMS, queries involve retrieval of small amounts ofdata

I In this course, and in particular in this class, we should keep inmind the particular workload underlying MapReduce

→ Full scans of large amounts of data→ Queries are not selective, they process all data

A review of some terminologyI A relation is a tableI Attributes are the column headers of the tableI The set of attributes of a relation is called a schema

Example: R(A1,A2, ...,An) indicates a relation called R whoseattributes are A1,A2, ...,An

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 7 / 105

Page 8: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Operators

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 8 / 105

Page 9: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Operators

Let’s start with an exampleI Below, we have part of a relation called Links describing the

structure of the WebI There are two attributes: From and ToI A row, or tuple, of the relation is a pair of URLs, indicating the

existence of a link between them→ The number of tuples in a real dataset is in the order of billions (109)

From Tourl1 url2url1 url3url2 url3url2 url4· · · · · ·

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 9 / 105

Page 10: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Operators

Relations (however big) can be stored in a distributedfilesystem

I If they don’t fit in a single machine, they’re broken into pieces (thinkHDFS)

Next, we review and describe a set of relational algebraoperators

I Intuitive explanation of what they doI “Pseudo-code” of their implementation in/by MapReduce

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 10 / 105

Page 11: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Operators

Selection: σC(R)I Apply condition C to each tuple of relation RI Produce in output a relation containing only tuples that satisfy C

Projection: πS(R)I Given a subset S of relation R attributesI Produce in output a relation containing only tuples for the attributes

in S

Union, Intersection and DifferenceI Well known operators on setsI Apply to the set of tuples in two relations that have the same

schemaI Variations on the theme: work on bags

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 11 / 105

Page 12: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Operators

Natural join R on SI Given two relations, compare each pair of tuples, one from each

relationI If the tuples agree on all the attributes common to both schema→

produce an output tuple that has components on each attributeI Otherwise produce nothingI Join condition can be on a subset of attributes

Let’s work with an exampleI Recall the Links relation from previous slidesI Query (or data processing job): find the paths of lengthtwo in the Web

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 12 / 105

Page 13: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Join Example

Informally, to satisfy the query we must:I find the triples of URLs in the form (u, v ,w) such that there is a link

from u to v and a link from v to w

Using the join operatorI Imagine we have two relations (with different schemas), and let’s try

to apply the natural join operatorI There are two copies of Links: L1(U1,U2) and L2(U2,U3)I Let’s compute L1 on L2

F For each tuple t1 of L1 and each tuple t2 of L2, see if their U2

component are the sameF If yes, then produce a tuple in output, with the schema (U1,U2,U3)

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 13 / 105

Page 14: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Join Example

What we have seen is called (to be precise) a self-joinI Question: How would you implement a self join in your favorite

programming language?I Question: What is the time complexity of your algorithm?I Question: What is the space complexity of your algorithm?

To continue the exampleI Say you are not interested in the entire two-hop path but just the

start and end nodesI Then you do a projection and the notation would be: πU1,U3(L1 on L2)

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 14 / 105

Page 15: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Operators

Grouping and Aggregation: γX (R)I Given a relation R, partition its tuples according to their values in

one set of attributes GF The set G is called the grouping attributes

I Then, for each group, aggregate the values in certain otherattributes

F Aggregation functions: SUM, COUNT, AVG, MIN, MAX, ...

In the notation, X is a list of elements that can be:I A grouping attributeI An expression θ(A), where θ is one of the (five) aggregation

functions and A is an attribute NOT among the grouping attributes

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 15 / 105

Page 16: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Operators

Grouping and Aggregation: γX (R)I The result of this operation is a relation with one tuple for each

groupI That tuple has a component for each of the grouping attributes, with

the value common to tuples of that groupI That tuple has another component for each aggregation, with the

aggregate value for that group

Let’s work with an exampleI Imagine that a social-networking site has a relationFriends(User, Friend)

I The tuples are pairs (a,b) such that b is a friend of aI Query: compute the number of friends each memberhas

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 16 / 105

Page 17: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Operators

Grouping and Aggregation Example

How to satisfy the queryγUser ,COUNT(Friend))(Friends)

I This operation groups all the tuples by the value in their fristcomponent

→ There is one group for each userI Then, for each group, it counts the number of friends

Some detailsI The COUNT operation applied to an attribute does not consider the

values of that attributeI In fact, it counts the number of tuples in the groupI In SQL, there is a “count distinct” operator that counts the number

of different values

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 17 / 105

Page 18: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

MapReduce implementation of (some) Relational Operators

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 18 / 105

Page 19: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Computing Selection

In practice, selection does not need a full-blown MapReduceimplementation

I They can be implemented in the map portion aloneI Actually, they could also be implemented in the reduce portion

A MapReduce implementation of σC(R)Map: F For each tuple t in R, check if t satisfies C

F If so, emit a key/value pair (t , t)Reduce: F Identity reducer

F Question: single or multiple reducers?

NOTE: the output is not exactly a relationI WHY?

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 19 / 105

Page 20: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Computing Projections

Similar process to selectionI But, projection may cause same tuple to appear several times

A MapReduce implementation of πS(R)Map: F For each tuple t in R, construct a tuple t ′ by eliminating those

components whose attributes are not in SF Emit a key/value pair (t ′, t ′)

Reduce: F For each key t ′ produced by any of the Map tasks, fetch t ′, [t ′, · · · , t ′]F Emit a key/value pair (t ′, t ′)

NOTE: the reduce operation is duplicate eliminationI This operation is associative and commutative, so it is possible to

optimize MapReduce by using a Combiner in each mapper

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 20 / 105

Page 21: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Computing Unions

Suppose relations R and S have the same schemaI Map tasks will be assigned chunks from either R or SI Mappers don’t do much, just pass by to reducersI Reducers do duplicate elimination

A MapReduce implementation of unionMap: F For each tuple t in R or S, emit a key/value pair (t , t)

Reduce: F For each key t there will be either one or two valuesF Emit (t , t) in either case

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 21 / 105

Page 22: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Computing Intersections

Very similar to computing unionsI Suppose relations R and S have the same schemaI The map function is the same (an identity mapper) as for unionI The reduce function must produce a tuple only if both relations

have that tuple

A MapReduce implementation of intersectionMap: F For each tuple t in R or S, emit a key/value pair (t , t)

Reduce: F If key t has value list [t , t ] then emit the key/value pair (t , t)F Otherwise, emit the key/value pair (t ,NULL)

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 22 / 105

Page 23: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Computing difference

Assume we have two relations R and S with the sameschema

I The only way a tuple t can appear in the output is if it is in R but notin S

I The map function can pass tuples from R and S to the reducerI NOTE: it must inform the reducer whether the tuple came from R or

S

A MapReduce implementation of differenceMap: F For a tuple t in R emit a key/value pair (t , ′R′) and for a tuple t in S,

emit a key/value pair (t , ′S′)Reduce: F For each key t , do the following:

F If it is associated to ′R′, then emit (t , t)F If it is associated to [′R′, ′S′] or [′S′, ′R′], or [′S′], emit the key/value

pair (t ,NULL)

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 23 / 105

Page 24: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Computing the natural Join

This topic is subject to continuous refinementsI There are many JOIN operators and many different

implementationsI We will see some of them in more detail in the Lab

Let’s look at two relations R(A,B) and S(B,C)I We must find tuples that agree on their B componentsI We shall use the B-value of tuples from either relation as the keyI The value will be the other component and the name of the relationI That way the reducer knows from which relation each tuple is

coming from

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 24 / 105

Page 25: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Computing the natural Join

A MapReduce implementation of Natural JoinMap: F For each tuple (a, b) of R emit the key/value pair (b, (′R′, a))

F For each tuple (b, c) of S emit the key/value pair (b, (′S′, c))Reduce: F Each key b will be associated to a list of pairs that are either (′R′, a)

or (′S′, c)F Emit key/value pairs of the form

(b, [(a1, b, c1), (a2, b, c2), · · · , (an, b, cn)])

NOTESI Question: what if the MapReduce framework wouldn’t implement

the distributed (and sorted) group by?I In general, for n tuples in relation R and m tuples in relation S all

with a common B-value, then we end up with nm tuples in the resultI If all tuples of both relations have the same B-value, then we’re

computing the cartesian product

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 25 / 105

Page 26: Tutorial: High-Level Programming Languages - MapReduce Simplified

Relational Algebra and MapReduce Implementing operators in MapReduce

Grouping and Aggregation in MapReduce

Let R(A,B,C) be a relation to which we apply γA,θ(B)(R)I The map operation prepares the groupingI The grouping is done by the frameworkI The reducer computes the aggregationI Simplifying assumptions: one grouping attribute and one

aggregation function

MapReduce implementation of γA,θ(B)(R)Map: F For each tuple (a, b, c) emit the key/value pair (a, b)

Reduce: F Each key a represents a groupF Apply θ to the list [b1, b2, · · · , bn]F Emit the key/value pair (a, x) where x = θ([b1, b2, · · · , bn])

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 26 / 105

Page 27: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG

Hadoop PIG

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 27 / 105

Page 28: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Introduction

Collection and analysis of enormous datasets is at the heartof innovation in many organizations

I E.g.: web crawls, search logs, click streams

Manual inspection before batch processingI Very often engineers look for exploitable trends in their data to drive

the design of more sophisticated techniquesI This is difficult to do in practice, given the sheer size of the datasets

The MapReduce model has its own limitationsI One inputI Two-stage, two operatorsI Rigid data-flow

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 28 / 105

Page 29: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

MapReduce limitations

Very often tricky workarounds are required2

I This is very often exemplified by the difficulty in performing JOINoperations

Custom code required even for basic operationsI Projection and Filtering need to be “rewritten” for each job

→ Code is difficult to reuse and maintain→ Semantics of the analysis task are obscured→ Optimizations are difficult due to opacity of Map and Reduce

2The term workaround should not only be intended as negative.Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 29 / 105

Page 30: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Use Cases

Rollup aggregates

Compute aggregates against user activity logs, web crawls,etc.

I Example: compute the frequency of search terms aggregated overdays, weeks, month

I Example: compute frequency of search terms aggregated overgeographical location, based on IP addresses

RequirementsI Successive aggregationsI Joins followed by aggregations

Pig vs. OLAP systemsI Datasets are too bigI Data curation is too costly

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 30 / 105

Page 31: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Use Cases

Temporal Analysis

Study how search query distributions change over timeI Correlation of search queries from two distinct time periods (groups)I Custom processing of the queries in each correlation group

Pig supports operators that minimize memory footprintI Instead, in a RDBMS such operations typically involve JOINS over

very large datasets that do not fit in memory and thus become slow

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 31 / 105

Page 32: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Use Cases

Session Analysis

Study sequences of page views and clicks

Example of typical aggregatesI Average length of user sessionI Number of links clicked by a user before leaving a websiteI Click pattern variations in time

Pig supports advanced data structures, and UDFs

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 32 / 105

Page 33: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Pig Latin

Pig Latin, a high-level programming language developed atYahoo!

I Combines the best of both declarative and imperative worldsF High-level declarative querying in the spirit of SQLF Low-level, procedural programming á la MapReduce

Pig Latin featuresI Multi-valued, nested data structures instead of flat tablesI Powerful data transformations primitives, including joins

Pig Latin programI Made up of a series of operations (or transformations)I Each operation is applied to input data and produce output data→ A Pig Latin program describes a data flow

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 33 / 105

Page 34: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Example 1

Pig Latin premiere

Assume we have the following table:

urls: (url, category, pagerank)

Where:I url: is the url of a web pageI category: corresponds to a pre-defined category for the web pageI pagerank: is the numerical value of the pagerank associated to a

web page

→ Find, for each sufficiently large category, the average page rank ofhigh-pagerank urls in that category

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 34 / 105

Page 35: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Example 1

SQL

SELECT category, AVG(pagerank)FROM urls WHERE pagerank > 0.2GROUP BY category HAVING COUNT(*) > 106

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 35 / 105

Page 36: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Example 1

Pig Latin

good_urls = FILTER urls BY pagerank > 0.2;groups = GROUP good_urls BY category;big_groups = FILTER groups BY COUNT(good_urls) > 106;output = FOREACH big_groups GENERATEcategory, AVG(good_urls.pagerank);

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 36 / 105

Page 37: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Pig Execution environment

How do we go from Pig Latin to MapReduce?I The Pig system is in charge of thisI Complex execution environment that interacts with Hadoop

MapReduce→ The programmer focuses on the data and analysis

Pig CompilerI Pig Latin operators are translated into MapReduce codeI NOTE: in some cases, hand-written MapReduce code performs

better

Pig OptimizerI Pig Latin data flows undergo an (automatic) optimization phaseI These optimizations are borrowed from the RDBMS community

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 37 / 105

Page 38: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Pig and Pig Latin

Pig is not a RDBMS!I This means it is not suitable for all data processing tasks

Designed for batch processingI Of course, since it compiles to MapReduceI Of course, since data is materialized as files on HDFS

NOT designed for random accessI Query selectivity does not match that of a RDBMSI Full-scans oriented!

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 38 / 105

Page 39: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Comparison with RDBMS

It may seem that Pig Latin is similar to SQLI We’ll see several examples, operators, etc. that resemble SQL

statements

Data-flow vs. declarative programming languageI Data-flow:

F Step-by-step set of operationsF Each operation is a single transformation

I Declarative:F Set of constraintsF Applied together to an input to generate output

→ With Pig Latin it’s like working at the query planner

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 39 / 105

Page 40: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Introduction

Comparison with RDBMS

RDBMS store data in tablesI Schemas are predefined and strictI Tables are flat

Pig and Pig Latin work on more complex data structuresI Schema can be defined at run-time for readabilityI Pigs eat anything!I UDF and streaming together with nested data structures make Pig

and Pig Latin more flexible

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 40 / 105

Page 41: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Features and Motivations

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 41 / 105

Page 42: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Features and Motivations

Design goals of Pig and Pig LatinI Appealing to programmers for performing ad-hoc analysis of dataI Number of features that go beyond those of traditional RDBMS

Next: overview of salient featuresI There will be a dedicated set of slides to optimizations later on

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 42 / 105

Page 43: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Dataflow Language

A Pig Latin program specifies a series of stepsI Each step is a single, high level data transformationI Stylistically different from SQL

With reference to Example 1I The programmer supply an order in which each operation will be

done

Consider the following snippet

spam_urls = FILTER urls BY isSpam(url);culprit_urls = FILTER spam_urls BY pagerank > 0.8;

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 43 / 105

Page 44: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Dataflow Language

Data flow optimizationsI Explicit sequences of operations can be overriddenI Use of high-level, relational-algebra-style primitives (GROUP,FILTER,...) allows using traditional RDBMS optimizationtechniques

→ NOTE: it is necessary to check whether such optimizationsare beneficial or not, by hand

Pig Latin allows Pig to perform optimizations that wouldotherwise by a tedious manual exercise if done at theMapReduce level

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 44 / 105

Page 45: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Quick Start and Interoperability

Data I/O is greatly simplified in PigI No need to curate, bulk import, parse, apply schema, create

indexes that traditional RDBMS requireI Standard and ad-hoc “readers” and “writers” facilitate the task of

ingesting and producing data in arbitrary formats

Pig can work with a wide range of other tools

Why RDBMS have stringent requirements?I To enable transactional consistency guaranteesI To enable efficient point lookup (using physical indexes)I To enable data curation on behalf of the userI To enable other users figuring out what the data is, by studying the

schema

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 45 / 105

Page 46: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Quick Start and Interoperability

Why is Pig so flexible?I Supports read-only workloadsI Supports scan-only workloads (no lookups)→ No need for transactions nor indexes

Why data curation is not required?I Very often, Pig is used for ad-hoc data analysisI Work on temporary datasets, then throw them→ Curation is an overkill

Schemas are optionalI Can apply one on the fly, at runtimeI Can refer to fields using positional notationI E.g.: good_urls = FILTER urls BY $2 > 0.2

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 46 / 105

Page 47: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Nested Data Model

Easier for “programmers” to think of nested data structuresI E.g.: capture information about positional occurrences of terms in a

collection of documentsI Map<documnetId, Set<positions> >

Instead, RDBMS allows only falt tablesI Only atomic fields as columnsI Require normalizationI From the example above: need to create two tablesI term_info: (termId, termString, ...)I position_info: (termId, documentId, position)→ Occurrence information obtained by joining on termId, and

grouping on termId, documentId

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 47 / 105

Page 48: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Nested Data Model

Fully nested data model (see also later in the presentation)I Allows complex, non-atomic data typesI E.g.: set, map, tuple

Advantages of a nested data modelI More natural than normalizationI Data is often already stored in a nested fashion on disk

F E.g.: a web crawler outputs for each crawled url, the set of outlinksF Separating this in normalized form imply use of joins, which is an

overkill for web-scale dataI Nested data allows to have an algebraic language

F E.g.: each tuple output by GROUP has one non-atomic field, a nestedset of tuples from the same group

I Nested data makes life easy when writing UDFs

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 48 / 105

Page 49: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

User Defined Functions

Custom processing is often predominantI E.g.: users may be interested in performing natural language

stemming of a search term, or tagging urls as spam

All commands of Pig Latin can be customizedI Grouping, filtering, joining, per-tuple processing

UDFs support the nested data modelI Input and output can be non-atomic

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 49 / 105

Page 50: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Example 2

Continues from Example 1I Assume we want to find for each category, the top 10 urls according

to pagerank

groups = GROUP urls BY category;output = FOREACH groups GENERATE category,top10(urls);

top10() is a UDF that accepts a set of urls (for each group at atime)it outputs a set containing the top 10 urls by pagerank for thatgroupfinal output contains non-atomic fields

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 50 / 105

Page 51: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

User Defined Functions

UDFs can be used in all Pig Latin constructs

Instead, in SQL, there are restrictionsI Only scalar functions can be used in SELECT clausesI Only set-valued functions can appear in the FROM clauseI Aggregation functions can only be applied to GROUP BY orPARTITION BY

UDFs can be written in Java, Python and Javascript3

I With streaming, we can use also C/C++, Python, ...

3As of Pig 0.8.1 and later. We will use version 0.10.0 or more.Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 51 / 105

Page 52: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Features and Motivations

Handling parallel execution

Pig and Pig Latin are geared towards parallel processingI Of course, the underlying execution engine is MapReduce

Pig Latin primitives are chosen such that they can be easilyparallelized

I Non-equi joins, correlated sub-queries,... are not directly supported

Users may specify parallelization parameters at run timeI Question: Can you specify the number of maps?I Question: Can you specify the number of reducers?

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 52 / 105

Page 53: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Pig Latin

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 53 / 105

Page 54: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Introduction

Not a complete reference to the Pig Latin language: refer to [1]I Here we cover some interesting aspects

The focus here is on some language primitivesI Optimizations are treated separatelyI How they can be implemented is covered later

Examples are taken from [2, 3]

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 54 / 105

Page 55: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Model

Supports four typesI Atom: contains a simple atomic value as a string or a number, e.g.‘alice’

I Tuple: sequence of fields, each can be of any data type, e.g.,(‘alice’, ‘lakers’)

I Bag: collection of tuples with possible duplicates. Flexible schema,no need to have the same number and type of fields{

(‘alice’,‘lakers’)(‘alice’,(‘iPod’,‘apple’))

}The example shows that tuples can be nested

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 55 / 105

Page 56: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Model

Supports four typesI Map: collection of data items, where each item has an associated

key for lookup. The schema, as with bags, is flexible.F NOTE: keys are required to be data atoms, for efficient lookup.‘fan of’ →

{(‘lakers’)(‘iPod’)

}(‘age’ → 20)

F The key ‘fan of’ is mapped to a bag containing two tuplesF The key ‘age’ is mapped to an atom

I Maps are useful to model datasets in which schema may bedynamic (over time)

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 56 / 105

Page 57: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Structure

Pig latin programs are a sequence of stepsI Can use an interactive shell (called grunt)I Can feed them as a “script”

CommentsI In line: with double hyphens (- -)I C-style for longer comments (/* ... */)

Reserved keywordsI List of keywords that can’t be used as identifiersI Same old story as for any language

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 57 / 105

Page 58: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Statements

As a Pig Latin program is executed, each statement is parsedI The interpreter builds a logical plan for every relational operationI The logical plan of each statement is added to that of the program

so farI Then the interpreter moves on to the next statement

IMPORTANT: No data processing takes place duringconstruction of logical plan

I When the interpreter sees the first line of a program, it confirms thatit is syntactically and semantically correct

I Then it adds it to the logical planI It does not even check the existence of files, for data load

operations

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 58 / 105

Page 59: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Statements→ It makes no sense to start any processing until the whole

flow is definedI Indeed, there are several optimizations that could make a program

more efficient (e.g., by avoiding to operate on some data that lateron is going to be filtered)

The trigger for Pig to start execution are the DUMP and STOREstatements

I It is only at this point that the logical plan is compiled into a physicalplan

How the physical plan is builtI Pig prepares a series of MapReduce jobs

F In Local mode, these are run locally on the JVMF In MapReduce mode, the jobs are sent to the Hadoop Cluster

I IMPORTANT: The command EXPLAIN can be used to show theMapReduce plan

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 59 / 105

Page 60: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Statements

Multi-query execution

There is a difference between DUMP and STOREI Apart from diagnosis, and interactive mode, in batch mode STORE

allows for program/job optimizations

Main optimization objective: minimize I/OI Consider the following example:A = LOAD ’input/pig/multiquery/A’;B = FILTER A BY $1 == ’banana’;C = FILTER A BY $1 != ’banana’;STORE B INTO ’output/b’;STORE C INTO ’output/c’;

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 60 / 105

Page 61: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Statements

Multi-query execution

In the example, relations B and C are both derived from AI Naively, this means that at the first STORE operator the input should

be readI Then, at the second STORE operator, the input should be read again

Pig will run this as a single MapReduce jobI Relation A is going to be read only onceI Then, each relation B and C will be written to the output

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 61 / 105

Page 62: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Expressions

An expression is something that is evaluated to yield a valueI Lookup on [3] for documentation

guages, including C/C++, Java, Perl and Python, so thatusers can stick with their language of choice.

2.5 Parallelism RequiredSince Pig Latin is geared toward processing web-scale

data, it does not make sense to consider non-parallel eval-uation. Consequently, we have only included in Pig Latina small set of carefully chosen primitives that can be easilyparallelized. Language primitives that do not lend them-selves to e⌅cient parallel evaluation (e.g., non-equi-joins,correlated subqueries) have been deliberately excluded.

Such operations can of course, still be carried out by writ-ing UDFs. However, since the language does not provideexplicit primitives for such operations, users are aware ofhow e⌅cient their programs will be and whether they willbe parallelized.

2.6 Debugging EnvironmentIn any language, getting a data processing program right

usually takes many iterations, with the first few iterationsusually having some user-introduced erroneous processing.At the scale of data that Pig is meant to process, a singleiteration can take many minutes or hours (even with large-scale parallel processing). Thus, the usual run-debug-runcycle can be very slow and ine⌅cient.

Pig comes with a novel interactive debugging environmentthat generates a concise example data table illustrating theoutput of each step of the user’s program. The example datais carefully chosen to resemble the real data as far as possibleand also to fully illustrate the semantics of the program.Moreover, the example data is automatically adjusted asthe program evolves.

This step-by-step example data can help in detecting er-rors early (even before the first iteration of running the pro-gram on the full data), and also in pinpointing the step thathas errors. The details of our debugging environment areprovided in Section 5.

3. PIG LATINIn this section, we describe the details of the Pig Latin

language. We describe our data model in Section 3.1, andthe Pig Latin statements in the subsequent subsections. Theemphasis of this section is not on the syntactical details ofPig Latin, but on how it meets the design goals and featureslaid out in Section 2. Also, this section only focusses onthe language primitives, and not on how they can be imple-mented to execute in parallel over a cluster. Implementationis covered in Section 4.

3.1 Data ModelPig has a rich, yet simple data model consisting of the

following four types:

• Atom: An atom contains a simple atomic value such asa string or a number, e.g., ‘alice’.

• Tuple: A tuple is a sequence of fields, each of which canbe any of the data types, e.g., (‘alice’, ‘lakers’).

• Bag: A bag is a collection of tuples with possible dupli-cates. The schema of the constituent tuples is flexible,i.e., not all tuples in a bag need to have the same numberand type of fields, e.g.,

⇤(‘alice’, ‘lakers’)`

‘alice’, (‘iPod’, ‘apple’)´

t =

„‘alice’,

⇤(‘lakers’, 1)(‘iPod’, 2)

⌅,ˆ‘age’ ⇤ 20

˜«

Let fields of tuple t be called f1, f2, f3

Expression Type Example Value for t

Constant ‘bob’ Independent of tField by position $0 ‘alice’

Field by name f3ˆ‘age’ ⇤ 20

˜

Projection f2.$0

⇤(‘lakers’)(‘iPod’)

Map Lookup f3#‘age’ 20

Function Evaluation SUM(f2.$1) 1 + 2 = 3

ConditionalExpression

f3#‘age’>18?‘adult’:‘minor’

‘adult’

Flattening FLATTEN(f2)‘lakers’, 1

‘iPod’, 2

Table 1: Expressions in Pig Latin.

The above example also demonstrates that tuples can benested, e.g., the second tuple of the bag has a nestedtuple as its second field.

• Map: A map is a collection of data items, where eachitem has an associated key through which it can belooked up. As with bags, the schema of the constituentdata items is flexible, i.e., all the data items in the mapneed not be of the same type. However, the keys are re-quired to be data atoms, mainly for e⌅ciency of lookups.The following is an example of a map:

2

4 ‘fan of’ ⇤⇤

(‘lakers’)

(‘iPod’)

‘age’ ⇤ 20

3

5

In the above map, the key ‘fan of’ is mapped to a bagcontaining two tuples, and the key ‘age’ is mapped toan atom 20.

A map is especially useful to model data sets whereschemas might change over time. For example, if webservers decide to include a new field while dumping logs,that new field can simply be included as an additionalkey in a map, thus requiring no change to existing pro-grams, and also allowing access of the new field to newprograms.

Table 1 shows the expression types in Pig Latin, and howthey operate. (The flattening expression is explained in de-tail in Section 3.3.) It should be evident that our data modelis very flexible and permits arbitrary nesting. This flexibil-ity allows us to achieve the aims outlined in Section 2.3,where we motivated our use of a nested data model. Next,we describe the Pig Latin commands.

3.2 Specifying Input Data: LOADThe first step in a Pig Latin program is to specify what

the input data files are, and how the file contents are to bedeserialized, i.e., converted into Pig’s data model. An inputfile is assumed to contain a sequence of tuples, i.e., a bag.This step is carried out by the LOAD command. For example,

queries = LOAD ‘query_log.txt’

USING myLoad()

AS (userId, queryString, timestamp);

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 62 / 105

Page 63: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Schemas

A relation in Pig may have an associated schemaI This is optionalI A schema gives the fields in the relations names and typesI Use the command DESCRIBE to reveal the schema in use for a

relation

Schema declaration is flexible but reuse is awkwardI A set of queries over the same input data will often have the same

schemaI This is sometimes hard to maintain (unlike HIVE) as there is no

external components to maintain this associationHINT:: You can write a UDF function to perform a personalized load

operation which encapsulates the schema

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 63 / 105

Page 64: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Validation and nulls

Pig does not have the same power to enforce constraints onschema at load time as a RDBMS

I If a value cannot be cast to a type declared in the schema, then itwill be set to a null value

I This also happens for corrupt files

A useful technique to partition input data to discern good andbad records

I Use the SPLIT operatorSPLIT records INTO good_records IF temperature isnot null, bad _records IF temperature is NULL;

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 64 / 105

Page 65: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Other relevant information

Schema mergingI How schema are propagated to new relations?

FunctionsI Look up on the web for Piggy Bank

User-Defined FunctionsI Use [3] for an introduction to designing UDFs

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 65 / 105

Page 66: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

Loading and storing data

The first step in a Pig Latin program is to load dataI What input files areI How the file contents are to be deserializedI An input file is assumed to contain a sequence of tuples

Data loading is done with the LOAD commandqueries = LOAD ‘query_log.txt’USING myLoad()AS (userId, queryString, timestamp);

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 66 / 105

Page 67: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

Loading and storing data

The example above specifies the following:I The input file is query_log.txtI The input file should be converted into tuples using the custommyLoad deserializer

I The loaded tuples have three fields, specified by the schema

Optional partsI USING clause is optional: if not specified, the input file is assumed

to be plain text, tab-delimitedI AS clause is optional: if not specified, must refer to fileds by position

instead of by name

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 67 / 105

Page 68: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

Loading and storing data

Return value of the LOAD commandI Handle to a bagI This can be used by subsequent commands→ bag handles are only logical→ no file is actually read!

The command to write output to disk is STOREI It has similar semantics to the LOAD command

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 68 / 105

Page 69: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

Per-tuple processing: Filtering data

Once you have some data loaded into a relation, the next stepis to filter it

I This is done, e.g., to remove unwanted dataI HINT: By filtering early in the processing pipeline, you minimize the

amount of data flowing trough the system

A basic operation is to apply some processing over everytuple of a data set

I This is achieved with the FOREACH commandexpanded_queries = FOREACH queries GENERATEuserId, expandQuery(queryString);

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 69 / 105

Page 70: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing OperatorsPer-tuple processing: Filtering data

Comments on the example above:I Each tuple of the bag queries should be processed independentlyI The second field of the output is the result of a UDF

Semantics of the FOREACH commandI There can be no dependence between the processing of different

input tuples→ This allows for an efficient parallel implementation

Semantics of the GENERATE clauseI Followed by a list of expressionsI Also flattering is allowed

F This is done to eliminate nesting in data→ Allows to make output data independent for further parallel

processing→ Useful to store data on disk

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 70 / 105

Page 71: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

Per-tuple processing: Discarding unwanted data

A common operation is to retain a portion of the input dataI This is done with the FILTER commandreal_queries = FILTER queries BY userId neq‘bot’;

Filtering conditions involve a combination of expressionsI Comparison operatorsI Logical connectorsI UDF

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 71 / 105

Page 72: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

Per-tuple processing: Streaming data

The STREAM operator allows transforming data in a relationusing an external program or script

I This is possible because Hadoop MapReduce supports “streaming”I Example:C = STREAM A THROUGH ‘cut -f 2’;which use the Unix cut command to extract the second filed ofeach tuple in A

The STREAM operator uses PigStorage to serialize anddeserialize relations to and from stdin/stdout

I Can also provide a custom serializer/deserializerI Works well with python

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 72 / 105

Page 73: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

Getting related data together

It is often necessary to group together tuples from one ormore data sets

I We will explore several nuances of “grouping”

The first grouping operation we study is given by theCOGROUP commandExample: Assume we have loaded two relationsresults: (queryString, url, position)

revenue: (queryString, adSlot, amount)I results contains, for different query strings, the urls shown as

search results, and the positions at which they where shownI revenue contains, for different query strings, and different

advertisement slots, the average amount of revenue

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 73 / 105

Page 74: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing OperatorsGetting related data together

Suppose we want to group together all search results dataand revenue data for the same query string

grouped_data = COGROUP results BY queryString,revenue BY queryString;

Figure 2: COGROUP versus JOIN.

advertisement slots, the average amount of revenue made bythe advertisements for that query string at that slot. Thento group together all search result data and revenue data forthe same query string, we can write:

grouped_data = COGROUP results BY queryString,

revenue BY queryString;

Figure 2 shows what a tuple in grouped_data looks like.In general, the output of a COGROUP contains one tuple foreach group. The first field of the tuple (named group) is thegroup identifier (in this case, the value of the queryString

field). Each of the next fields is a bag, one for each inputbeing cogrouped, and is named the same as the alias of thatinput. The ith bag contains all tuples from the ith inputbelonging to that group. As in the case of filtering, groupingcan also be performed according to arbitrary expressionswhich may include UDFs.

The reader may wonder why a COGROUP primitive is neededat all, since a very similar primitive is provided by the fa-miliar, well-understood, JOIN operation in databases. Forcomparison, Figure 2 also shows the result of joining ourdata sets on queryString. It is evident that JOIN is equiv-alent to COGROUP, followed by taking a cross product of thetuples in the nested bags. While joins are widely applicable,certain custom processing might require access to the tuplesof the groups before the cross-product is taken, as shown bythe following example.

Example 3. Suppose we were trying to attribute searchrevenue to search-result urls to figure out the monetary worthof each url. We might have a sophisticated model for doingso. To accomplish this task in Pig Latin, we can follow theCOGROUP with the following statement:

url_revenues = FOREACH grouped_data GENERATE

FLATTEN(distributeRevenue(results, revenue));

where distributeRevenue is a UDF that accepts search re-sults and revenue information for a query string at a time,and outputs a bag of urls and the revenue attributed to them.For example, distributeRevenue might attribute revenuefrom the top slot entirely to the first search result, while therevenue from the side slot may be attributed equally to allthe results. In this case, the output of the above statementfor our example data is shown in Figure 2.

To specify the same operation in SQL, one would haveto join by queryString, then group by queryString, andthen apply a custom aggregation function. But while doingthe join, the system would compute the cross product of thesearch and revenue information, which the custom aggre-gation function would then have to undo. Thus, the wholeprocess become quite ine⇤cient, and the query becomes hardto read and understand.

To reiterate, the COGROUP statement illustrates a key dif-ference between Pig Latin and SQL. The COGROUP state-ments conforms to our goal of having an algebraic language,where each step carries out only a single transformation(Section 2.1). COGROUP carries out only the operation ofgrouping together tuples into nested bags. The user cansubsequently choose to apply either an aggregation functionon those tuples, or cross-product them to get the join result,or process it in a custom way as in Example 3. In SQL,grouping is available only bundled with either aggregation(group-by-aggregate queries), or with cross-producting (theJOIN operation). Users find this additional flexibility of PigLatin over SQL quite attractive, e.g.,

“I frankly like pig much better than SQL in somerespects (group + optional flatten works betterfor me, I love nested data structures).”– Ted Dunning, Chief Scientist, Veoh Networks

Note that it is our nested data model that allows us tohave COGROUP as an independent operation—the input tu-ples are grouped together and put in nested bags. Sucha primitive is not possible in SQL since the data model isflat. Of course, such a nested model raises serious concernsabout e⌅ciency of implementation: since groups can be verylarge (bigger than main memory, perhaps), we might buildup gigantic tuples, which have these enormous nested bagswithin them. We address these e⌅ciency concerns in ourimplementation section (Section 4).

3.5.1 Special Case of COGROUP: GROUPA common special case of COGROUP is when there is only

one data set involved. In this case, we can use the alter-native, more intuitive keyword GROUP. Continuing with ourexample, if we wanted to find the total revenue for eachquery string, (a typical group-by-aggregate query), we canwrite it as follows:

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 74 / 105

Page 75: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

The COGROUP command

Output of a COGROUP contains one tuple for each groupI First field (group) is the group identifier (the value of thequeryString)

I Each of the next fields is a bag, one for each group beingco-grouped

Grouping can be performed according to UDFs

Next: why COGROUP when you can use JOINS?

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 75 / 105

Page 76: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

COGROUP vs JOIN

JOIN vs. COGROUPI Their are equivalent: JOIN = COGROUP followed by a cross product

of the tuples in the nested bags

Example 3: Suppose we try to attribute search revenue tosearch-results urls→ compute monetary worth of each url

grouped_data = COGROUP results BY queryString,revenue BY queryString;url_revenues = FOREACH grouped_data GENERATEFLATTEN(distrubteRevenue(results, revenue));

I Where distrubteRevenue is a UDF that accepts search resultsand revenue information for each query string, and outputs a bag ofurls and revenue attributed to them

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 76 / 105

Page 77: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing OperatorsCOGROUP vs JOIN

More details on the UDF distribute RevenueI Attributes revenue from the top slot entirely to the first search resultI The revenue from the side slot may be equally split among all

results

Let’s see how to do the same with a JOINI JOIN the tables results and revenues by queryStringI GROUP BY queryStringI Apply a custom aggregation function

What happens behind the scenesI During the join, the system computes the cross product of the

search and revenue informationI Then the custom aggregation needs to undo this cross product,

because the UDF specifically requires so

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 77 / 105

Page 78: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

COGROUP in details

The COGROUP statement conforms to an algebraic languageI The operator carries out only the operation of grouping together

tuples into nested bagsI The user can the decide wether to apply a (custom) aggregation on

those tuples or to cross-product them and obtain a join

It is thanks to the nested data model that COGROUP is anindependent operation

I Implementation details are trickyI Groups can be very large (and are redundant)

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 78 / 105

Page 79: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

A special case of COGROUP: the GROUP operator

Sometimes, we want to operate on a single datasetI This is when you use the GROUP operator

Let’s continue from Example 3:I Assume we want to find the total revenue for each query string.

This writes as:grouped_revenue = GROUP revenue BY queryString;query_revenue = FOREACH grouped_revenue GENERATEqueryString, SUM(revenue.amount) AS totalRevenue;

I Note that revenue.amount refers to a projection of the nestedbag in the tuples of grouped_revenue

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 79 / 105

Page 80: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

JOIN in Pig Latin

In many cases, the typical operation on two or more datasetsamounts to an equi-join

I IMPORTANT NOTE: large datasets that are suitable to be analyzedwith Pig (and MapReduce) are generally not normalized

→ JOINs are used more infrequently in Pig Latin than they are in SQL

The syntax of a JOINjoin_result = JOIN results BY queryString,revenue BY queryString;

I This is a classic inner join (actually an equi join), where each matchbetween the two relations corresponds to a row in thejoin_result

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 80 / 105

Page 81: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

JOIN in Pig Latin

JOINs lend themselves to optimization opportunitiesI We will work on this in the laboratory

Assume we join two datasets, one of which is considerablysmaller than the other

I For instance, suppose a dataset fits in memory

Fragment replicate joinI Syntax: append the clause USING “replicated” to a JOIN

statementI Uses a distributed cache available in HadoopI All mappers will have a copy of the small input→ This is a Map-side join

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 81 / 105

Page 82: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG PIG LATIN

Data Processing Operators

MapReduce in Pig Latin

It is trivial to express MapReduce programs in Pig LatinI This is achieved using GROUP and FOREACH statementsI A map function operates on one input tuple at a time and outputs a

bag of key-value pairsI The reduce function operates on all values for a key at a time to

produce the final result

Examplemap_result = FOREACH input GENERATEFLATTEN(map(*));key_groups = GROUP map_results BY $0;output = FOREACH key_groups GENERATE reduce(*);

I where map() and reduce() are UDF

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 82 / 105

Page 83: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Implementation

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 83 / 105

Page 84: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Introduction

Pig Latin Programs are compiled into MapReduce jobs, andexecuted using Hadoop

How to build a logical plan for a Pig Latin program

How to compile the logical plan into a physical plan ofMapReduce jobs

How to avoid resource exhaustion

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 84 / 105

Page 85: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Building a Logical Plan

As clients issue Pig Latin commands (interactive or batchmode)

I The Pig interpreter parses the commandsI Then it verifies validity of input files and bags (variables)

F E.g.: if the command is c = COGROUP a BY ..., b BY ...;, itverifies if a and b have already been defined

Pig builds a logical plan for every bagI When a new bag is defined by a command, the new logical plan is a

combination of the plans for the input and that of the currentcommand

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 85 / 105

Page 86: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Building a Logical Plan

No processing is carried out when constructing the logicalplans

I Processing is triggered only by STORE or DUMPI At that point, the logical plan is compiled to a physical plan

Lazy execution modelI Allows in-memory pipeliningI File reorderingI Various optimizations from the traditional RDBMS world

Pig is (potentially) platform independentI Parsing and logical plan construction are platform obliviousI Only the compiler is specific to Hadoop

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 86 / 105

Page 87: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Building the Physical Plan

Compilation of a logical plan into a physical plan is “simple”I MapReduce primitives allow a parallel GROUP BY

F Map assigns keys for groupingF Reduce process a group at a time (actually in parallel)

How the compiler worksI Converts each (CO)GROUP command in the logical plan into

distinct MapReduce jobsI Map function for (CO)GROUP command C initially assigns keys to

tuples based on the BY clause(s) of CI Reduce function is initially a no-op

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 87 / 105

Page 88: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Building the Physical Plan

an open-source project in the Apache incubator, and henceavailable for general use.

We first describe how Pig builds a logical plan for a PigLatin program. We then describe our current compiler, thatcompiles a logical plan into map-reduce jobs executed usingHadoop. Last, we describe how our implementation avoidslarge nested bags, and how it handles them if they do arise.

4.1 Building a Logical PlanAs clients issue Pig Latin commands, the Pig interpreter

first parses it, and verifies that the input files and bags be-ing referred to by the command are valid. For example, ifthe user enters c = COGROUP a BY . . ., b BY . . ., Pig veri-fies that the bags a and b have already been defined. Pigbuilds a logical plan for every bag that the user defines.When a new bag is defined by a command, the logical planfor the new bag is constructed by combining the logical plansfor the input bags, and the current command. Thus, in theabove example, the logical plan for c consists of a cogroupcommand having the logical plans for a and b as inputs.

Note that no processing is carried out when the logicalplans are constructed. Processing is triggered only when theuser invokes a STORE command on a bag. At that point, thelogical plan for that bag is compiled into a physical plan,and is executed. This lazy style of execution is beneficialbecause it permits in-memory pipelining, and other opti-mizations such as filter reordering across multiple Pig Latincommands.

Pig is architected such that the parsing of Pig Latin andthe logical plan construction is independent of the execu-tion platform. Only the compilation of the logical plan intoa physical plan depends on the specific execution platformchosen. Next, we describe the compilation into Hadoopmap-reduce, the execution platform currently used by Pig.

4.2 Map-Reduce Plan CompilationCompilation of a Pig Latin logical plan into map-reduce

jobs is fairly simple. The map-reduce primitive essentiallyprovides the ability to do a large-scale group by, where themap tasks assign keys for grouping, and the reduce tasksprocess a group at a time. Our compiler begins by convertingeach (CO)GROUP command in the logical plan into a distinctmap-reduce job with its own map and reduce functions.

The map function for (CO)GROUP command C initially justassigns keys to tuples based on the BY clause(s) of C; thereduce function is initially a no-op. The map-reduce bound-ary is the cogroup command. The sequence of FILTER, andFOREACH commands from the LOAD to the first COGROUP op-eration C1, are pushed into the map function correspondingto C1 (see Figure 3). The commands that intervene betweensubsequent COGROUP commands Ci and Ci+1 can be pushedinto either (a) the reduce function corresponding to Ci, or(b) the map function corresponding to Ci+1. Pig currentlyalways follows option (a). Since grouping is often followedby aggregation, this approach reduces the amount of datathat has to be materialized between map-reduce jobs.

In the case of a COGROUP command with more than oneinput data set, the map function appends an extra field toeach tuple that identifies the data set from which the tupleoriginated. The accompanying reduce function decodes thisinformation and uses it to insert the tuple into the appro-priate nested bag when cogrouped tuples are formed (recallFigure 2).

Figure 3: Map-reduce compilation of Pig Latin.

Parallelism for LOAD is obtained since Pig operates overfiles residing in the Hadoop distributed file system. We alsoautomatically get parallelism for FILTER and FOREACH oper-ations since for a given map-reduce job, several map and re-duce instances are run in parallel. Parallelism for (CO)GROUPis achieved since the output from the multiple map instancesis repartitioned in parallel to the multiple reduce instances.

The ORDER command is implemented by compiling intotwo map-reduce jobs. The first job samples the input todetermine quantiles of the sort key. The second job range-partitions the input according to the quantiles (thereby en-suring roughly equal-sized partitions), followed by local sort-ing in the reduce phase, resulting in a globally sorted file.

The inflexibility of the map-reduce primitive results insome overheads while compiling Pig Latin into map-reducejobs. For example, data must be materialized and replicatedon the distributed file system between successive map-reducejobs. When dealing with multiple data sets, an additionalfield must be inserted in every tuple to indicate which dataset it came from. However, the Hadoop map-reduce im-plementation does provide many desired properties such asparallelism, load-balancing, and fault-tolerance. Given theproductivity gains to be had through Pig Latin, the asso-ciated overhead is often acceptable. Besides, there is thepossibility of plugging in a di�erent execution platform thatcan implement Pig Latin operations without such overheads.

4.3 Efficiency With Nested BagsRecall Section 3.5. Conceptually speaking, our (CO)GROUP

command places tuples belonging to the same group intoone or more nested bags. In many cases, the system canavoid actually materializing these bags, which is especiallyimportant when the bags are larger than one machine’s mainmemory.

One common case is where the user applies a distribu-tive or algebraic [8] aggregation function over the result ofa (CO)GROUP operation. (Distributive is a special case ofalgebraic, so we will only discuss algebraic functions.) Analgebraic function is one that can be structured as a treeof subfunctions, with each leaf subfunction operating over asubset of the input data. If nodes in this tree achieve datareduction, then the system can keep the amount of datamaterialized in any single location small. Examples of al-gebraic functions abound: COUNT, SUM, MIN, MAX, AVERAGE,VARIANCE, although some useful functions are not algebraic,e.g., MEDIAN.

When Pig compiles programs into Hadoop map-reducejobs, it uses Hadoop’s combiner feature to achieve a two-tiertree evaluation of algebraic functions. Pig provides a specialAPI for algebraic user-defined functions, so that custom userfunctions can take advantage of this important optimization.

MapReduce boundary is the COGROUP commandI The sequence of FILTER and FOREACH from the LOAD to the firstCOGROUP C1 are pushed in the Map function

I The commands in later COGROUP commands Ci and Ci+1 can bepushed into:

F the Reduce function of CiF the Map function of Ci+1

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 88 / 105

Page 89: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Building the Physical Plan

Pig optimization for the physical planI Among the two options outlined above, the first is preferredI Indeed, grouping is often followed by aggregation→ reduces the amount of data to be materialized between jobs

COGROUP command with more than one input datasetI Map function appends an extra field to each tuple to identify the

datasetI Reduce function decodes this information and inserts tuple in the

appropriate nested bags for each group

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 89 / 105

Page 90: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Building the Physical Plan

How parallelism is achievedI For LOAD this is inherited by operating over HDFSI For FILTER and FOREACH, this is automatic thanks to MapReduce

frameworkI For (CO)GROUP uses the SHUFFLE phase

A note on the ORDER commandI Translated in two MapReduce jobsI First job: Samples the input to determine quantiles of the sort keyI Second job: Range partitions the input according to quantiles,

followed by sorting in the reduce phase

Known overheads due to MapReduce inflexibilityI Data materialization between jobsI Multiple inputs are not supported well

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 90 / 105

Page 91: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Efficiency measures

(CO)GROUP command place tuples of the same group innested bags

I Bag materialization (I/O) can be avoidedI This is important also due to memory constraintsI Distributive or algebraic aggregation facilitate this task

What is an algebraic function?I Function that can be structured as a tree of sub-functionsI Each leaf sub-function operates over a subset of the input data→ If nodes in the tree achieve data reduction, then the system can

reduce materializationI Examples: COUNT, SUM, MIN, MAX, AVERAGE, ...

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 91 / 105

Page 92: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Implementation

Efficiency measures

Pig compiler uses the combiner function of HadoopI A special API for algebraic UDF is available

There are cases in which (CO)GROUP is inefficientI This happens with non-algebraic functionsI Nested bags can be spilled to diskI Pig provides a disk-resident bag implementation

F Features external sort algorithmsF Features duplicates elimination

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 92 / 105

Page 93: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Debugging

Debugging

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 93 / 105

Page 94: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Debugging

IntroductionThe process of creating Pig Latin programs is generallyiterative

I The user makes an initial stabI The stab is executedI The user inspects the output check correctnessI If not, revise the program and repeat the process

This iterative process can be inefficientI The sheer size of data volumes hinders this kind of experimentation→ Need to create a side dataset that is a small sample of the original

one

Sampling can be problematicI Example: consider an equi-join on relations A(x,y) and B(x,z)

on attribute xI If there are many distinct values of x, it is highly probable that a

small sample of A and B will not contain matching x values→ Empty result

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 94 / 105

Page 95: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Debugging

Welcome Pig Pen

Pig comes with a debugging environment, Pig PenI It creates a side dataset automaticallyI This is done in a manner that avoids sampling problems→ The side dataset must be tailored to the user program

Sandbox DatasetI Takes as input a Pig Latin program P

F This is a sequence of n commandsF Each command consumes one or more input bags and produces one

output bagI The output is a set of example bags {B1,B2, ...,Bn}

F Each output example bag corresponds to the output of eachcommand in P

I The output set of example bags need to be consistentF The output of each operator needs to be that obtained with the input

example bag

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 95 / 105

Page 96: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Debugging

Properties of the Sandbox Dataset

There are three primary objectives in selecting a sandboxdataset

I Realism: the sandbox should be a subset of the actual dataset. Ifthis is not possible, individual values should be the ones in theactual dataset

I Conciseness: the example bags should be as small as possibleI Completeness: the example bags should collectively illustrate the

key semantics of each command

Overview of the procedure to generate the sandboxI Take small random samples of the original dataI Synthesize additional data tuples to improve completenessI When possible use real data values on synthetic tuplesI Apply a pruning pass to eliminate redundant example tuples and

improve conciseness

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 96 / 105

Page 97: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

Optimizations

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 97 / 105

Page 98: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

IntroductionPig implements several optimizations

I Most of them are derived from traditional works in RDBMSI Logical vs. Physical optimizations

Physical Plan

Logical PlanParser

Query PlanCompilerCross-JobOptimizer

MapReduceCompiler

CLUSTER

Pig Latin Program

MapReduce ProgramB(x,y)A(x,y)

FILTER

JOIN

UDF

output

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 98 / 105

Page 99: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

Single-program Optimizations

Logical optimizations: query planI Early projectionI Early filteringI Operator rewrites

Physical optimization: execution planI Mapping of logical operations to MapReduceI Splitting logical operations in multiple physical onesI Join execution strategies

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 99 / 105

Page 100: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

Cross-program Optimizations

Popular tablesI Web crawlsI Search query log

Popular transformationsI Eliminate spamI Group pages by hostI Join web crawl with search log

GOAL: minimize redundant work

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 100 / 105

Page 101: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

Cross-program Optimizations

Concurrent work sharingI Execute related Pig Latin programs together to perform common

work only onceI This is difficult to achieve: scheduling, “sharability”

Non-concurrent work sharingI Re-use I/O or CPU work done by one program, later in timeI This is difficult to achieve: caching, replication

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 101 / 105

Page 102: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

Work-Sharing Techniques

A(x,y)

OPERATOR 1 OPERATOR 2

Job 1 Job 2

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 102 / 105

Page 103: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

Work-Sharing Techniques

A(x,y)

OPERATOR 1

OPERATOR 2

OPERATOR 3

Job 1

Job 2

A'

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 103 / 105

Page 104: Tutorial: High-Level Programming Languages - MapReduce Simplified

Hadoop PIG Optimizations

Work-Sharing Techniques

WORKER 1 WORKER 2

A

D

replicate

JOIN

A

B

C

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 104 / 105

Page 105: Tutorial: High-Level Programming Languages - MapReduce Simplified

References

References I

[1] Pig wiki.http://wiki.apache.org/pig/.

[2] C. Olston, B. Reed, U. Srivastava, R. Kumar, , and A. Tomkins.Pig latin: A not-so-foreign language for data processing.In Proc. of ACM SIGMOD, 2008.

[3] Tom White.Hadoop, The Definitive Guide.O’Reilly, Yahoo, 2010.

Pietro Michiardi (Eurecom) Tutorial: High-Level Programming Languages 105 / 105