pig

13
Apache Pig Apache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data manipulation operations in Hadoop using Pig.

Upload: ramakrishna-kapa

Post on 07-Jan-2017

187 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Pig

Apache PigApache Pig is an abstraction over MapReduce. It is a tool/platform which is used to analyze larger sets of

data representing them as data flows. Pig is generally used with Hadoop; we can perform all the data

manipulation operations in Hadoop using Pig.

Page 2: Pig

Pig was initially developed at Yahoo! to allow people using Hadoop® to focus more on analyzing large data sets and spend less time having to write mapper and reducer programs. Like actual pigs, who eat almost anything, the Pig programming language is designed to handle any kind of data—hence the name!

Page 3: Pig

WHAT IS PIG?

Pig is a high level scripting language that is used with Apache Hadoop. Pig enables data workers to write complex data transformations without knowing Java. Pig’s simple SQL-like scripting language is called Pig Latin, and appeals to developers already familiar with scripting languages and SQL.

Page 4: Pig

Pig is made up of two components: the first is the language itself, which is called PigLatin, and the second is a runtime environment where PigLatin programs are executed.

This course begins with an overview of Pig. It explains the data structures supported by Pig and how to access data using the LOAD operator. The next lesson covers the Pig relational operators. This is followed by the Pig evaluation functions, as well as math and string functions.

Page 5: Pig

1. Overview The Pig tutorial shows you how to run two Pig scripts in local mode and mapreduce mode. Local Mode: To run the scripts in local mode, no Hadoop or HDFS

installation is required. All files are installed and run from your local host and file system. • Mapreduce Mode:

The Pig tutorial file (tutorial/pigtutorial.tar.gz file in the pig distribution) includes the Pig JAR file (pig.jar) and the tutorial files (tutorial.jar, Pigs scripts, log files). These files work with Hadoop 0.20 and provide everything you need to run the Pig scripts. To get started, follow these basic

steps: 1. Install Java. 2. Download the Pig tutorial file and install Pig. 3. Run the Pig scripts - locally or on a Hadoop cluster.

Page 6: Pig

2. Java Installation Make sure your run-time environment includes the following: 1. Java 1.6 or higher (preferably from Sun) 2. The JAVA_HOME environment variable is set the root of your Java installation. 3. Pig Installation To install Pig, do the following: 1. Download the Pig tutorial file to your local directory. 2. Unzip the Pig tutorial file (the files are stored in a newly created directory, pigtmp).

Page 7: Pig

2. Unzip the Pig tutorial file (the files are stored in a newly created directory, pigtmp). $ tar -xzf pigtutorial.tar.gz 1. Move to the pigtmp directory. 2. Review the contents of the Pig tutorial file. 3. Copy the pig.jar file to the appropriate directory on your system. For example: /home/me/pig. 4. Create an environment variable, PIGDIR, and point it to your directory. For example: export PIGDIR=/home/me/pig (bash, sh) or setenv PIGDIR /home/me/pig (tcsh, csh)

Page 8: Pig

4. Running the Pig Scripts in Local Mode To run the Pig scripts in local mode, do the following: 1. Move to the pigtmp directory. 2. Review Pig Script 1 and Pig Script 2. 3. Execute the following command (using either script1-local.pig or script2-local.pig).

Page 9: Pig

5. Running the Pig Scripts in Mapreduce Mode To run the Pig scripts in mapreduce mode, do the following: 1. Move to the pigtmp directory. 2. Review Pig Script 1 and Pig Script 2. 3. Copy the excite.log.bz2 file from the pigtmp directory to the HDFS directory.

Page 10: Pig

Pig is complete, so you can do all required data manipulations in Apache Hadoop with Pig. Through the User Defined Functions(UDF) facility in Pig, Pig can invoke code in many languages like JRuby, Jython and Java. You can also embed Pig scripts in other languages. The result is that you can use Pig as a component to build larger and more complex applications that tackle real business problems.

Pig works with data from many sources, including structured and unstructured data, and store the results into the Hadoop Data File System.

Pig scripts are translated into a series of MapReduce jobs that are run on the Apache Hadoop cluster.

Page 11: Pig

A good example of a Pig application is the ETL transaction model that describes how a process will extract data from a source, transform it according to a rule set and then load it into a datastore. Pig can ingest data from files, streams or other sources using the User Defined Functions(UDF). Once it has the data it can perform select, iteration, and other transforms over the data. Again the UDF feature allows passing the data to more complex algorithms for the transform. Finally Pig can store the results into the Hadoop Data File System.

Page 12: Pig

Pig scripts are translated into a series of MapReduce jobs or a Tez DAG that are run on the Apache Hadoop cluster. As part of the translation the Pig interpreter does perform optimizations to speed execution on Apache Hadoop. We are going to write a Pig script that will do our data analysis task.

Page 13: Pig

WHAT IS TEZ?

Tez – Hindi for “speed” provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. It generalizes the MapReduce paradigm to a more powerful framework by providing the ability to execute a complex DAG (directed acyclic graph) of tasks for a single job so that projects in the Apache Hadoop ecosystem such as Apache Hive, Apache Pig and Cascading can meet requirements for human-interactive response times and extreme throughput at petabyte scale (clearly MapReduce has been a key driver in achieving this).