cs-495/595 pig lecture #6 dr. chuck cartledge dr. chuck
TRANSCRIPT
1/18
Miscellanea The Book Chapter 11 Conclusion References
CS-495/595Pig
Lecture #6
Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge
18 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 201518 Feb. 2015
2/18
Miscellanea The Book Chapter 11 Conclusion References
Table of contents I
1 Miscellanea
2 The Book
3 Chapter 11
4 Conclusion
5 References
3/18
Miscellanea The Book Chapter 11 Conclusion References
Corrections and additions since last lecture.
Completed gradingAssignment #1.
4/18
Miscellanea The Book Chapter 11 Conclusion References
Hadoop, The Definitive Guide
Version 3 is specified in thesyllabus [2]
Version 4 came out inNovember 2015
We’ll use Version 3 as muchas possible
5/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
The essence of Pig.
Pig provides a level of abstraction for dealing with large data sets.
There are two major parts to thePig ecosystem:
The language (Pig Latin),
The execution environment
In a previous lecture, we touched on how a JOIN operation couldbe performed using MapReduce technology. Pig hides all thatcomplexity.
6/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Pig is not installed on the Hadoop cluster
You will have to download it and install it.
The tar.gz file is about 120MBYou’ll need to download it,untar it, and test yourinstallationThere are some gotchas:
1 Pig is looking for theenvironment variableJAVA HOME to be set
2 Hadoop cluster runs tcshvice BASH by default
3 Have to set JAVA HOMEbefore Pig will run
Some things are left as anexercise for the student.
Section “Installing and Running Pig” on page 366 gives youinformation on where to download it from, how to install it, andhow to test it.
7/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Pig runs on top of Hadoop
Pig can run in three different modes:
Script: a file contains PigLatin commands
Grunt: an interactive mode
Embedded: run from Javausing the PigServer class
The eclipse and NetBeans IDEsare supposed to have a Pigplug-ins.
Initially you will “tickle” your Pig installation via grunt, later wewill use scripts.
8/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Language differences between Pig and RDBMS.
Pig Latin is a data flowprogramming language“. . . dataflow programmingemphasizes the movement ofdata and models programs as aseries of connections. Explicitlydefined inputs and outputsconnect operations, whichfunction like black boxes.” [3]
SQL is a declarativeprogramming language“. . . declarative programming isa programming paradigm, astyle of building the structureand elements of computerprograms, that expresses thelogic of a computation withoutdescribing its control flow.” [4]
9/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Schema differences between Pig and RDBMS.
Pig allows optional schemadefinition at run time
RDBMS store data in tables,and schemas are well knownin advance
Pig defaults to tab delimitedfields, csv files processed viaUDF.
Pig reads data at program start (roughly) vs. data already intables at start.
10/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Data differences between Pig and RDBMS.
Pig allows complex, nesteddata structures
RDBMS tables are much“flatter”
Pig Latin is generally morecustomizable than most SQLdialects.
11/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Access time differences between Pig and RDBMS.
Pig does not supportrandom reads and writes tothe data (WORM)
RDBMS supports randomaccess (indices, views, etc.)
RDBMS are good for interactive,or low latency activities.
Pig uses Hadoop and HDFS as its underpinnings and inherits allthose strengths and weaknesses.
12/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
A simple example
LOAD — establishes wherethe data will be coming from
AS — defines the schema
FILTER, GROUP — similarto a SQL
FOREACH — processeseach tuple
MAX — one of manyfunctions
DUMP — output the data
Nothing happens until a dataflow is defined and a trigger eventoccurs.
13/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
What are trigger events?
Pig Latin is a data flow language, something has to start the dataflow. Different commands act as triggers.
DUMP — a diagnosticstatement
STORE — depends on whenthe statement is encountered
Image from [1].
Pig Latin → Logical Plan → Physical Plan → MapReduce Plan →Execution
14/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Image from [1].
15/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Image from [1].
16/18
Miscellanea The Book Chapter 11 Conclusion References
Pig provide information/process hiding
Image from [1].
17/18
Miscellanea The Book Chapter 11 Conclusion References
What have we covered?
Covered the “essence” of PigPig runs on top of the Hadoopecosystem and has all the strengthsand limitations thereofCompared Pig to traditionalRDBMSPig is a dataflow programminglanguage
Next lecture: Hadoop book, Chapter 12 and return exam
18/18
Miscellanea The Book Chapter 11 Conclusion References
References I
[1] Prashanth Babu, Introduction to pig,http://www.slideshare.net/prashanthvvbabu/the-
fifthelephant-2012handsonintrotopig, 2013.
[2] Tom White, Hadoop: The definitive guide, 3rd edition, O’ReillyMedia, Inc., 2012.
[3] Wikipedia, Dataflow programming — wikipedia, the freeencyclopedia,http://en.wikipedia.org/wiki/Dataflow_programming,2014.
[4] , Declarative programming — wikipedia, the freeencyclopedia, http://en.wikipedia.org/wiki/Declarative_programming,2014.