mapreduce@directi

MapReduce@DirectI

amkiray: [email protected]: [email protected]

Lets start with an example…

timestamp,url,response_code,response_time

products.dat

access.log

date, product_id, price

Requirement: Number of requests in the last 30 days.

$> ls –rt *.log | tail -30 | xargs “wc –l”

Requirement: Busiest 30 minutes in last 30 days.

$> ls –rt *.log| tail -30 | xargs “./count_30min.sh“

Requirement: Number of failed buy requests for products

worth more than $30 in the last 30 days .

Import data to an RDBMS;

SELECT COUNT(*) FROM logs, productsWHERE GET_REQUEST_TYPE(logs.url)=‘BUY’ AND GET_PRODUCT_ID(logs.url)=products.product_id AND product.price>30AND DATE(log.timestamp) = products.date;

It might take a while!

Now gimme the number of failed buy requests for products worth more than

$30 in the last 1 Year.

2 days later….

On its way!Inserting data

into database…

5 days later…

$> mysqladmin processlist+-----------------------------------+| Query | Copy to Temp Table |+-----------------------------------+

May be its Joining!!Or may be its dead…

Or may be my replacement will see the result ..

Go use

Hadoop bloody!

[email protected]

But Why?! Distributed data processing cluster A distributed file system Data location sensitive Task scheduler MapReduce paradigm Handles Parallelizable and distributable tasks Failover capability Web based monitoring capabilities More value for your time!

Where to start??

MapReduce: Q: Sum of all squares of a=[1,2,3,4,3,2,7] Simple….

fold( map(a, square()), sum())

You can do that in any functional programming language….

Now do it for an array of 100

million elements…….

This is where Hadoop comes in.. Distributed File System : HDFS Distributed Computation/Task Processing:

Hadoop Name Node + Data Node Task Tracker + Job Tracker

Task Tracker

How MapReduce Works…

An example Word Count

hadoop jar contrib/streaming/hadoop-0.*-streaming.jar -jobconf mapred.data.field.separator=","-input 'wc.eg.in' -output 'wc.eg.out' -mapper 'wc -w' -reducer "awk ‘{ sum+=\$1 } END{ print sum}’"

Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp

http://cae5.internal.directi.com:50030/jobtracker.jsp

http://cae2.internal.directi.com:50070/dfshealth.jsp

Pig and Pig Latin A procedural language for MapReduce

operations.logs = LOAD 'access.logs' USING PigStorage(',') AS

(ts:int, URL:chararray, resp:chararray, resp_time:int);

products = LOAD 'products.dat' USING PigStorage(',') AS (date:int, pid:int, price:int);

l1 = FOREACH logs GENERATE GetDate(ts) as req_date, GetProductID(URL) as prod_id, GetRequestType(URL) as rtype, resp, resp_time;

j1 = JOIN l1 BY prod_id, products by pid;

j2 = FILTER j1 BY req_date==date AND price>30.0F;

j3 = GROUP j2 ALL;

j4 = FOREACH j3 GENERATE COUNT(j2);

DUMP j4

mapreduce@directi

Technology