pig installation and test run

Pig Setup and Test run

By Kannan Kalidasan

Pig IntroductionPig is a data flow language ( PigLatin ) to write Hadoop operations without using MapReduce Java code. Pig is a layer of abstraction on top of Hadoop to simplify its use by giving a SQL-like interface to process data on Hadoop.

Help to increase productivity by not writing many lines of Java code.It supports a variety of data types and also support user-defined functions (UDFs) to write custom operations in Java, Python and JavaScript.

I recommended To learn Programming Pig – Allan Gates book.

Author explain the concepts in clear and simple way.

Pig Prompt is GRUNT pig grunts …

$ piggrunt>

Pig session has two modesLocal Mode : Access to a single machine. All files are installed and run using your local host and file system.This mode helps to debug the pig script before we process them in clusters. -x flag is used to specify the mode.pig -x local

MapReduce Mode : Access to a Hadoop cluster and HDFS installation. MapReduce mode is the default mode;To add Hadoop Conf details to Pig Class pathexport PIG_CLASSPATH=$HADOOP_HOME/conf/

both below commands are same and Start the pig session in MapReduce mode.

pig or pig -x mapreduce

Note to Remember ...● Hadoop services should be running to start the pig MapReduce mode and connect to HDFS and

proceed with our work.

● Pig translates the PigLatin scripts into MapReduce Jobs internally and run in hadoop cluster.

● In MapReduce mode, takes file from HDFS only, and stores the results back to HDFS.

Pig Installation1. Download the stable version of tarbal.

http://mirror.nexcess.net/apache/pig/pig-0.12.0/

pig-0.12.0.tar.gz

Release notes link

http://pig.apache.org/releases.html#Download





Pig Installation ...2.Copy the downloaded package to /usr/local

/usr/localkannan@kannandreams:/usr/local$ ls -ltrtotal 119460-rwxr-xr-x 1 root root 63851630 Nov 11 02:11 hadoop-1.2.1.tar.gzdrwxr-xr-x 16 hduser hadoop 4096 Nov 11 23:47 hadoop-rwxrwxrwx 1 root root 58433159 Dec 3 00:55 pig-0.11.1.tar.gzkannan@kannandreams:/usr/local$

Pig Installation ...3. unzip and change the ownersudo tar xzf pig-0.11.1.tar.gzsudo mv pig-0.11.1 pigsudo chown -R hduser:hadoop pigchown command change the owner of the directory pig from root to hadoop user hduser.

4.Login to Hadoop user hduser and set the environment variables.kannan@kannandreams:/usr/local$ su – hduserAdd the below two lines in ~/.bashrc file.

export PIG_HOME=”/usr/local/pig”

export PATH=$PATH:$PIG_HOME/bin

Pig Installation ...5. Source the profile file to reflect the changes

hduser@kannandreams:~$ . .bashrchduser@kannandreams:~$

6.check the pig commandoutput of the command mentioned below is not complete one.

hduser@kannandreams:~$ pig -helpWarning: $HADOOP_HOME is deprecated.Apache Pig version 0.11.1 (r1459641)compiled Mar 22 2013, 02:13:53

Test Run ...7. Create a sample file for processing ( file name as pigcsv )

Extension for the file doesn’t matter . it will understand based on mime type of the file.

sample file – create a file in HDFS directory with the below contents

“2006″;“2007″;“2008″;“2008″;“2008″;“2008″;“2007″;

Test Run ...8. Pig Scripts

Method 1 to run the pig script : Save the pig scripts as <<filename>>.pig ( In my case, it is pig_test.

pig ) and run as $ pig -x mapreduce pig_test.pig OR $ pig pig_test.pig

SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’USING PigStorage(‘;’) AS (Year:chararray);GroupByYear = GROUP SampleRecord BY Year;CountByYear = FOREACH GroupByYearGENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));STORE CountByYearINTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);

Test Run ...Method 2 to run the pig script : line ends with ; is considered as one statement

grunt>SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’>> USING PigStorage(‘;’) AS (Year:chararray);

grunt>GroupByYear = GROUP SampleRecord BY Year;

grunt>CountByYear = FOREACH GroupByYear>>GENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));

grunt>STORE CountByYear>>INTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);

Test Run ...9. Output :

hduser@kannandreams:/usr/local/hadoop/bin$ hadoop fs -cat /user/hduser/pigoutput/part-r-00000Warning: $HADOOP_HOME is deprecated.

“2006″:1“2007″:2“2008″:4“Year”:1hduser@kannandreams:/usr/local/hadoop/bin$

Script ExplanationLoad the file into a variable by mentioning the delimiter (‘;’) and Header name and its type.Use comma to include more than one column data available in file.By Default , Pig loads files delimited by tab. Need to explicitly mention type of delimiter character.

SampleRecord = LOAD ‘/user/hduser/piginput/pigcsv’USING PigStorage(‘;’) AS (Year:chararray);

Group the variable stored data by year

GroupByYear = GROUP SampleRecord BY Year;

Script Explanation ...Count the records for each group set and generate the output as Key:Value.Its your wish how you want to generate the file output.$0 is the group by criteria and $1 is the output of the countCountByYear = FOREACH GroupByYearGENERATE CONCAT((chararray)$0,CONCAT(‘:’,(chararray)COUNT($1)));

Store the variable in a fileSTORE CountByYearINTO ‘/user/hduser/pigoutput’ USING PigStorage(‘t’);

For Complete Script commands , refer

http://pig.apache.org/docs/r0.10.0/start.html#data-results



Pig in ClouderaPig Editor in Cloudera are explained in my blog.

http://kannandreams.wordpress.com/2013/12/03/pig-editor-in-cloudera/#!



Thank You !!!mail : [email protected]@kannanpoem on twitter

Blog: http://kannandreams.wordpress.com/about/FB Community: www.facebook.com/groups/huge360/

HUGE - Hadoop User Group & Enthusiasts

Huge , Yes Its All about "BIG" DataThis has been created to build a group to get expertise and experts in Hadoop and Big Data .

mailto:[email protected]

http://kannandreams.wordpress.com/about/

https://www.facebook.com/groups/huge360/

pig installation and test run

Technology