hadoop pig: mapreduce the easy way!
DESCRIPTION
My presentation about Hadoop and Pig during the Fosdem Datadevroom 2011.TRANSCRIPT
![Page 1: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/1.jpg)
Hadoop Pig:MapReduce the easy way.
Nathan Bijnenshttp://nathan.gs@nathan_gs
![Page 2: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/2.jpg)
We live in a world of data.
![Page 3: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/3.jpg)
● Data analysis becomes
more and more
important
● Increasing complexity
of analysis
● Meanwhile the data we
analyze grows big, fast!
s: http://www.flickr.com/photos/pallotron/2479541331/ by pallotron
![Page 4: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/4.jpg)
![Page 5: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/5.jpg)
Hadoop is an open source Java framework aimed at data intensive distributed applications.
It enables applications to work with thousands of nodes and petabytes of data.
Hadoop: Intro
![Page 6: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/6.jpg)
Hadoop was inspired by Google's Map Reduce and Google File System.
http://labs.google.com/papers/mapreduce.html
Hadoop: Intro
![Page 7: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/7.jpg)
HDFS is a distributed, scalable filesystem designed to store large files.
In combination with the Hadoop JobTracker it provides data locality.
It auto replicates all blocks to 3 data nodes, where preferable 2 copies are stored on two data
nodes within the same rack and one in another rack.
Hadoop: HDFS
![Page 8: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/8.jpg)
● NameNode● Keeps track of what is stored where
● In memory● Single Point of Failure
● DataNodes
Hadoop: HDFS
![Page 9: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/9.jpg)
Hadoop: HDFS
s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
![Page 10: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/10.jpg)
MapReduce works by breaking processing into two phases, a map and a reduce function.
MapReduce
![Page 11: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/11.jpg)
● Input● Map● Shuffle● Reduce● Output
MapReduce
s: Practical problem solving with Hadoop and Pig by Milind Bhandarkar http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig
![Page 12: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/12.jpg)
MassiveMedia / Netlog● Cases
● Traffic analysis● User actions● ...
● On a 7 node cluster.
Use Cases: Who & how it's used
![Page 13: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/13.jpg)
Yahoo!● Cases
● Ad Systems● Web Search● ...
● More than 36000 nodes!
Use Cases: Who & how it's used
s: http://wiki.apache.org/hadoop/PoweredBy
![Page 14: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/14.jpg)
SETI@home● Highly CPU oriented● data locality is unimportant!
Use Cases: When not to use
![Page 15: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/15.jpg)
![Page 16: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/16.jpg)
Pig is a high level data flow language.
Hadoop Pig: Intro
![Page 17: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/17.jpg)
Pig Latin
Grunt
PigServer
Hadoop Pig: 3 components
![Page 18: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/18.jpg)
data = LOAD 'employee.csv' USING PigStorage() AS (first_name:chararray, last_name:chararray, age:int, wage:float, department:chararray
);
grouped_by_department = GROUP data BY department;
total_wage_by_department = FOREACH grouped_by_departmentGENERATE
group AS department,COUNT(data) as employee_count,SUM(data::wage) AS total_wage;
total_ordered = ORDER total_wage_by_department BY total_wage;
total_limited = LIMIT total_ordered 10;
DUMP total_limited;
Hadoop Pig
![Page 19: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/19.jpg)
books = LOAD 'books.csv.bz2' USING PigStorage() AS (book_id:int,book_name:chararray,author_name:chararray
);
book_sales = LOAD 'book_sales.csv.bz2' USING PigStorage() AS (book_id:int,price:float,country:chararray
);
--- books = FILTER books BY (author_name LIKE 'Pamuk');
data = JOIN books ON book_id, book_sales ON book_id PARALLEL 12;
grouped_by_book = GROUP data BY books::book_name;
total_sales_by_book = FOREACH grouped_by_bookGENERATE
group as book,COUNT(data) as sales_volume,SUM(book_sales::price) AS total_sales;
STORE total_sales_by_book INTO 'book_sale_results';
![Page 20: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/20.jpg)
● Custom Load and Store classes.● Hbase● ProtocolBuffers● CombinedLog
● Custom extractioneg. date, ...
Take a look at the PiggyBank.
UDF
![Page 21: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/21.jpg)
● Hive
● Streaming
● Native Java MapReduce
Some alternatives
![Page 22: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/22.jpg)
Questions?
![Page 23: Hadoop Pig: MapReduce the easy way!](https://reader034.vdocuments.mx/reader034/viewer/2022051515/54c650534a7959ad7b8b45de/html5/thumbnails/23.jpg)
Thank you for listening!