pig installation and test run
DESCRIPTION
This presentation help to install Hadoop Pig and run your sample program. Also provided the details how to use Pig Editor in ClouderaTRANSCRIPT
Pig Setup and Test run
By Kannan Kalidasan
Pig IntroductionPig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java code. Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to process data on Hadoop.
Help to increase productivity by not writing many lines of Java code.It supports a variety of data types and also support user-defined functions (UDFs) to write custom operations in Java, Python and JavaScript.
I recommended To learn Programming Pig – Allan Gates book.
Author explain the concepts in clear and simple way.
Pig Prompt is GRUNT pig grunts …
$ piggrunt>
Pig session has two modesLocal Mode : Access to a single machine. All files are installed and run using your local host and file system.This mode helps to debug the pig script before we process them in clusters. -x flag is used to specify the mode.pig -x local
MapReduce Mode : Access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode;To add Hadoop Conf details to Pig Class pathexport PIG_CLASSPATH=$HADOOP_HOME/conf/
both below commands are same and Start the pig session in MapReduce mode.
pig or pig -x mapreduce
Note to Remember ...● Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and
proceed with our work.
● Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster.
● In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.
Pig Installation1. Download the stable version of tarbal.
http://mirror.nexcess.net/apache/pig/pig-0.12.0/
pig-0.12.0.tar.gz
Release notes link
http://pig.apache.org/releases.html#Download
Pig Installation ...2.Copy the downloaded package to /usr/local
/usr/localkannan@kannandreams:/usr/local$ ls -ltrtotal 119460-rwxr-xr-x 1 root root 63851630 Nov 11 02:11 hadoop-1.2.1.tar.gzdrwxr-xr-x 16 hduser hadoop 4096 Nov 11 23:47 hadoop-rwxrwxrwx 1 root root 58433159 Dec 3 00:55 pig-0.11.1.tar.gzkannan@kannandreams:/usr/local$
Pig Installation ...3. unzip and change the ownersudo tar xzf pig-0.11.1.tar.gzsudo mv pig-0.11.1 pigsudo chown -R hduser:hadoop pigchown command change the owner of the directory pig from root to hadoop user hduser.
4.Login to Hadoop user hduser and set the environment variables.kannan@kannandreams:/usr/local$ su – hduserAdd the below two lines in ~/.bashrc file.
export PIG_HOME=”/usr/local/pig”
export PATH=$PATH:$PIG_HOME/bin
Pig Installation ...5. Source the profile file to reflect the changes
hduser@kannandreams:~$ . .bashrchduser@kannandreams:~$
6.check the pig commandoutput of the command mentioned below is not complete one.
hduser@kannandreams:~$ pig -helpWarning: $HADOOP_HOME is deprecated.Apache Pig version 0.11.1 (r1459641)compiled Mar 22 2013, 02:13:53
Test Run ...7. Create a sample file for processing ( file name as pigcsv )
Extension for the file doesn’t matter . it will understand based on mime type of the file.
sample file – create a file in HDFS directory with the below contents
“2006″;“2007″;“2008″;“2008″;“2008″;“2008″;“2007″;
Test Run ...8. Pig Scripts
Method 1 to run the pig script : Save the pig scripts as <<filename>>.pig ( In my case, it is pig_test.
pig ) and run as $ pig -x mapreduce pig_test.pig OR $ pig pig_test.pig
SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’USING PigStorage(‘;’) AS (Year:chararray);GroupByYear = GROUP SampleRecord BY Year;CountByYear = FOREACH GroupByYearGENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));STORE CountByYearINTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
Test Run ...Method 2 to run the pig script : line ends with ; is considered as one statement
grunt>SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’>> USING PigStorage(‘;’) AS (Year:chararray);
grunt>GroupByYear = GROUP SampleRecord BY Year;
grunt>CountByYear = FOREACH GroupByYear>>GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
grunt>STORE CountByYear>>INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
Test Run ...9. Output :
hduser@kannandreams:/usr/local/hadoop/bin$ hadoop fs -cat /user/hduser/pigoutput/part-r-00000Warning: $HADOOP_HOME is deprecated.
“2006″:1“2007″:2“2008″:4“Year”:1hduser@kannandreams:/usr/local/hadoop/bin$
Script ExplanationLoad the file into a variable by mentioning the delimiter (‘;’) and Header name and its type.Use comma to include more than one column data available in file.By Default , Pig loads files delimited by tab. Need to explicitly mention type of delimiter character.
SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’USING PigStorage(‘;’) AS (Year:chararray);
Group the variable stored data by year
GroupByYear = GROUP SampleRecord BY Year;
Script Explanation ...Count the records for each group set and generate the output as Key:Value.Its your wish how you want to generate the file output.$0 is the group by criteria and $1 is the output of the countCountByYear = FOREACH GroupByYearGENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));
Store the variable in a fileSTORE CountByYearINTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);
For Complete Script commands , refer
http://pig.apache.org/docs/r0.10.0/start.html#data-results
Pig in ClouderaPig Editor in Cloudera are explained in my blog.
http://kannandreams.wordpress.com/2013/12/03/pig-editor-in-cloudera/#!
Thank You !!!mail : [email protected]@kannanpoem on twitter
Blog: http://kannandreams.wordpress.com/about/FB Community: www.facebook.com/groups/huge360/
HUGE - Hadoop User Group & Enthusiasts
Huge , Yes Its All about "BIG" DataThis has been created to build a group to get expertise and experts in Hadoop and Big Data .