hadoop pig: mapreduce the easy way!

Hadoop Pig:MapReduce the easy way.

Nathan Bijnenshttp://nathan.gs@nathan_gs

http://nathan.gs/

http://twitter.com/nathan_gs

We live in a world of data.

● Data analysis becomes

more and more

important

● Increasing complexity

of analysis

● Meanwhile the data we

analyze grows big, fast!

s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron

http://www.flickr.com/photos/pallotron/2479541331/

http://www.flickr.com/photos/pallotron/

Hadoop is an open source Java framework aimed at data intensive distributed applications.

It enables applications to work with thousands of nodes and petabytes of data.

Hadoop: Intro

Hadoop was inspired by Google's Map Reduce and Google File System.

http://labs.google.com/papers/mapreduce.html

Hadoop: Intro

HDFS is a distributed, scalable filesystem designed to store large files.

In combination with the Hadoop JobTracker it provides data locality.

It auto replicates all blocks to 3 data nodes, where preferable 2 copies are stored on two data

nodes within the same rack and one in another rack.

Hadoop: HDFS

● NameNode● Keeps track of what is stored where

● In memory● Single Point of Failure

● DataNodes

Hadoop: HDFS

Hadoop: HDFS

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

MapReduce works by breaking processing into two phases, a map and a reduce function.

MapReduce

● Input● Map● Shuffle● Reduce● Output

MapReduce

s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig

MassiveMedia / Netlog● Cases

● Traffic analysis● User actions● ...

● On a 7 node cluster.

Use Cases: Who & how it's used

Yahoo!● Cases

● Ad Systems● Web Search● ...

● More than 36000 nodes!

Use Cases: Who & how it's used

s: http://wiki.apache.org/hadoop/PoweredBy

http://wiki.apache.org/hadoop/PoweredBy

SETI@home● Highly CPU oriented● data locality is unimportant!

Use Cases: When not to use

Pig is a high level data flow language.

Hadoop Pig: Intro

Pig Latin

Grunt

PigServer

Hadoop Pig: 3 components

data = LOAD 'employee.csv' USING PigStorage() AS (first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray

);

grouped_by_department = GROUP data BY department;

total_wage_by_department = FOREACH grouped_by_departmentGENERATE

group AS department,COUNT(data) as employee_count,SUM(data::wage) AS total_wage;

total_ordered = ORDER total_wage_by_department BY total_wage;

total_limited = LIMIT total_ordered 10;

DUMP total_limited;

Hadoop Pig

books = LOAD 'books.csv.bz2' USING PigStorage() AS (book_id:int,book_name:chararray,author_name:chararray

);

book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (book_id:int,price:float,country:chararray

);

--- books = FILTER books BY (author_name LIKE 'Pamuk');

data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;

grouped_by_book = GROUP data BY books::book_name;

total_sales_by_book = FOREACH grouped_by_bookGENERATE

group as book,COUNT(data) as sales_volume,SUM(book_sales::price) AS total_sales;

STORE total_sales_by_book INTO 'book_sale_results';

● Custom Load and Store classes.● Hbase● ProtocolBuffers● CombinedLog

● Custom extractioneg. date, ...

Take a look at the PiggyBank.

UDF

● Hive

● Streaming

● Native Java MapReduce

Some alternatives

Questions?

Thank you for listening!

hadoop pig: mapreduce the easy way!

Technology