aws hadoop and pig and overview

21
WORKING WITH PIG A SQL like scripting language for Hadoop CIS 210 – February 2013 Highline Community College

Upload: dan-morrill

Post on 25-May-2015

3.854 views

Category:

Education


2 download

DESCRIPTION

A quick overview of Hadoop, AWS and PIG using the AWS provided PIG script for parsing log files.

TRANSCRIPT

Page 1: AWS Hadoop and PIG and overview

WORKING WITH PIGA SQL like scripting language for

Hadoop

CIS 210 – February 2013Highline Community College

Page 2: AWS Hadoop and PIG and overview

What is Hadoop Pig? Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

Page 3: AWS Hadoop and PIG and overview

Pig InfrastructureAt the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.

Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.

Extensibility. Users can create their own functions to do special-purpose processing.

Page 4: AWS Hadoop and PIG and overview

Using Pig with AWSAmazon Web Services has Hadoop and will support PIG as part of the Hadoop infrastructure of “Elastic Map Reduce”.

Sample Pig Script: s3://elasticmapreduce/samples/pig-apache/do-reports2.pig

Sample Dataset: s3://elasticmapreduce/samples/pig-apache/input

Page 5: AWS Hadoop and PIG and overview

Pig has two execution modesLocal Mode - To run Pig in local mode, you need access to a single machine; all files are installed and run using your local host and file system. Specify local mode using the -x flag (pig -x local).

Mapreduce Mode - To run Pig in mapreduce mode, you need access to a Hadoop cluster and HDFS installation. Mapreduce mode is the default mode; you can, but don't need to, specify it using the -x flag (pig OR pig -x mapreduce).

Page 6: AWS Hadoop and PIG and overview

Pig has two operational modelsInteractive ModeYou can run Pig in interactive mode using the Grunt shell. Invoke the Grunt shell using the "pig" command (as shown below) and then enter your Pig Latin statements and Pig commands interactively at the command line.

Batch ModeYou can run Pig in batch mode using Pig scripts and the "pig" command (in local or hadoop mode).ExampleThe Pig Latin statements in the Pig script (id.pig) extract all user IDs from the /etc/passwd file. First, copy the /etc/passwd file to your local working directory. Next, run the Pig script from the command line (using local or mapreduce mode). The STORE operator will write the results to a file (id.out).

Page 7: AWS Hadoop and PIG and overview

Amazon Map Reduce supportsThere are two types of job flows supported with Pig: interactive and batch.

In an interactive mode a customer can start a job flow and run Pig scripts interactively directly on the master node. Typically, this mode is used to do ad hoc data analyses and for application development.

In batch mode, the Pig script is stored in Amazon S3 and is referenced at the start of the job flow. Typically, batch mode is used for repeatable runs such as report generation.

Page 8: AWS Hadoop and PIG and overview

A sample PIG script---- setup piggyback functions-- register file:/home/hadoop/lib/pig/piggybank.jarDEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();DEFINE FORMAT org.apache.pig.piggybank.evaluation.string.FORMAT();DEFINE REPLACE org.apache.pig.piggybank.evaluation.string.REPLACE();DEFINE DATE_TIME org.apache.pig.piggybank.evaluation.datetime.DATE_TIME();DEFINE FORMAT_DT org.apache.pig.piggybank.evaluation.datetime.FORMAT_DT();

Page 9: AWS Hadoop and PIG and overview

A Sample Pig Script ---- import logs and break into tuples--raw_logs = -- load the weblogs into a sequence of one element tuples LOAD '$INPUT' USING TextLoader AS (line:chararray);

logs_base = -- for each weblog string convert the weblong string into a -- structure with named fields FOREACH raw_logs GENERATE FLATTEN ( EXTRACT( line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;

Page 10: AWS Hadoop and PIG and overview

LanguageWhat is a Tuple?

In mathematics and computer science, a tuple is an ordered list of elements. In set theory, an (ordered) -tuple is a sequence (or ordered list) of elements, where is a non-negative integer. There is only one 0-tuple, an empty sequence. An -tuple is defined inductively using the construction of an ordered pair. Tuples are usually written by listing the elements within parentheses "" and separated by commas; for example, denotes a 5-tuple. Sometimes other delimiters are used, such as square brackets "" or angle brackets "". Braces "" are almost never used for tuples, as they are the standard notation for sets.Tuples are often used to describe other mathematical objects, such as vectors. In computer science, tuples are directly implemented as product types in most functional programming languages. More commonly, they are implemented as record types, where the components are labeled instead of being identified by position alone. This approach is also used in relational algebra.

Page 11: AWS Hadoop and PIG and overview

Language: Regular ExpressionsThis is a regular expression:

'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"’

Regular expressions can be used to parse data out of a file, or used to validate data in SQL or other programming languages. We will focus on SQL because PIG is very similar to SQL

Page 12: AWS Hadoop and PIG and overview

Use in AWSThis is a little hard to read because of the wrapping. What you should see is that Pig is loading the line into a tuple with just a single element --- the line itself. You now need to split the line into fields. To do this, use the EXTRACT Piggybank function, which applies a regular expression to the input and extracts the matched groups as elements of a tuple. The regular expression is a little tricky because the Apache log defines a couple of fields with quotes.

Unfortunately, you can't use this as is because in Pig strings all backslashes must be escaped with a backslash. Making the regular expression a little bulky in relationship to use in other programming languages.

'^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"'

Page 13: AWS Hadoop and PIG and overview

Use in AWS final script segmentlogs_base = -- for each weblog string convert the weblong string into a -- structure with named fields FOREACH raw_logs GENERATE FLATTEN ( EXTRACT( line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"' ) ) AS ( remoteAddr: chararray, remoteLogname: chararray, user: chararray, time: chararray, request: chararray, status: int, bytes_string: chararray, referrer: chararray, browser: chararray ) ;

Page 14: AWS Hadoop and PIG and overview

The format time so we can read itlogs = -- convert from string values to typed values such as date_time and integers FOREACH logs_base GENERATE *, DATE_TIME(time, 'dd/MMM/yyyy:HH:mm:ss Z', 'UTC') as datetime, (int)REPLACE(bytes_string, '-', '0') as bytes ;

Page 15: AWS Hadoop and PIG and overview

Then determine number of requests---- determine total number of requests and bytes served by UTC hour of day-- aggregating as a typical day across the total time of the logs--by_hour_count = -- group logs by their hour of day, counting the number of logs in that hour -- and the sum of the bytes of rows for that hour FOREACH (GROUP logs BY FORMAT_DT('HH',datetime)) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ;

STORE by_hour_count INTO '$OUTPUT/total_requests_bytes_per_hour';

Page 16: AWS Hadoop and PIG and overview

Then we can do more sorting like top 50’s-- -- top 50 X.X.X.* blocks--by_ip_count = -- group weblog entries by the ip address from the remote address field -- and count the number of entries for each address as well as -- the sum of the bytes FOREACH (GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr, '(\\d+\\.\\d+\\.\\d+)'))) GENERATE $0, COUNT($1) AS num_requests, SUM($1.bytes) AS num_bytes ;

by_ip_count_sorted = -- order ip by the number of requests they make LIMIT (ORDER by_ip_count BY num_requests DESC) 50;

STORE by_ip_count_sorted into '$OUTPUT/top_50_ips';

Page 17: AWS Hadoop and PIG and overview

Top 50 referrers -- top 50 external referrers--by_referrer_count = -- group by the referrer URL and count the number of requests FOREACH (GROUP logs BY EXTRACT(referrer, '(http:\\/\\/[a-z0-9\\.-]+)')) GENERATE FLATTEN($0), COUNT($1) AS num_requests ;

by_referrer_count_filtered = -- exclude matches for example.org FILTER by_referrer_count BY NOT $0 matches '.*example\\.org';

by_referrer_count_sorted = -- take the top 50 results LIMIT (ORDER by_referrer_count_filtered BY num_requests DESC) 50;

STORE by_referrer_count_sorted INTO '$OUTPUT/top_50_external_referrers';

Page 18: AWS Hadoop and PIG and overview

Even top searches-- top search terms coming from bing or google--google_and_bing_urls = -- find referrer fields that match either bing or google FILTER (FOREACH logs GENERATE referrer) BY referrer matches '.*bing.*' OR referrer matches '.*google.*' ;

search_terms = -- extract from each referrer url the search phrases FOREACH google_and_bing_urls GENERATE FLATTEN(EXTRACT(referrer, '.*[&\\?]q=([^&]+).*')) as (term:chararray) ;

search_terms_filtered = -- reject urls that contained no search terms FILTER search_terms BY NOT $0 IS NULL;

search_terms_count = -- for each search phrase count the number of weblogs entries that contained it FOREACH (GROUP search_terms_filtered BY $0) GENERATE $0, COUNT($1) AS num ;

search_terms_count_sorted = -- take the top 50 results LIMIT (ORDER search_terms_count BY num DESC) 50;

STORE search_terms_count_sorted INTO '$OUTPUT/top_50_search_terms_from_bing_google';

Page 19: AWS Hadoop and PIG and overview

Note the use of regular expressions throughout(GROUP logs BY EXTRACT(referrer, '(http:\\/\\/[a-z0-9\\.-]+)'))

(GROUP logs BY FORMAT('%s.*', EXTRACT(remoteAddr, '(\\d+\\.\\d+\\.\\d+)')))

FLATTEN(EXTRACT(referrer, '.*[&\\?]q=([^&]+).*')) as (term:chararray)

Learning regular expressions will help you with scripting

Page 20: AWS Hadoop and PIG and overview

Quick and Dirty Regular Expressions Cheat Sheet:

https://www.owasp.org/index.php/Input_Validation_Cheat_Sheet

http://www.regular-expressions.info/

Page 21: AWS Hadoop and PIG and overview

Questions?