mapreduce@directi

19
MapReduce@DirectI amkiray: [email protected] uvdhray: [email protected]

Upload: directi-group

Post on 26-Jan-2015

103 views

Category:

Technology


0 download

DESCRIPTION

The initial simple MapReduce cluster setup at DirectI. An introduction to MapReduce and Hadoop. A brief intro Pig is also included.

TRANSCRIPT

Page 1: MapReduce@DirectI

MapReduce@DirectI

amkiray: [email protected]: [email protected]

Page 2: MapReduce@DirectI

Lets start with an example…

timestamp,url,response_code,response_time

products.dat

access.log

date, product_id, price

Page 3: MapReduce@DirectI

Requirement: Number of requests in the last 30 days.

$> ls –rt *.log | tail -30 | xargs “wc –l”

Page 4: MapReduce@DirectI

Requirement: Busiest 30 minutes in last 30 days.

$> ls –rt *.log| tail -30 | xargs “./count_30min.sh“

Page 5: MapReduce@DirectI

Requirement: Number of failed buy requests for products

worth more than $30 in the last 30 days .

Import data to an RDBMS;

SELECT COUNT(*) FROM logs, productsWHERE GET_REQUEST_TYPE(logs.url)=‘BUY’ AND GET_PRODUCT_ID(logs.url)=products.product_id AND product.price>30AND DATE(log.timestamp) = products.date;

Page 6: MapReduce@DirectI

It might take a while!

Now gimme the number of failed buy requests for products worth more than

$30 in the last 1 Year.

Page 7: MapReduce@DirectI

2 days later….

On its way!Inserting data

into database…

Page 8: MapReduce@DirectI

5 days later…

$> mysqladmin processlist+-----------------------------------+| Query | Copy to Temp Table |+-----------------------------------+

May be its Joining!!Or may be its dead…

Or may be my replacement will see the result ..

Page 9: MapReduce@DirectI

Go use

Hadoop bloody!

[email protected]

Page 10: MapReduce@DirectI

But Why?! Distributed data processing cluster A distributed file system Data location sensitive Task scheduler MapReduce paradigm Handles Parallelizable and distributable tasks Failover capability Web based monitoring capabilities More value for your time!

Page 11: MapReduce@DirectI

Where to start??

MapReduce: Q: Sum of all squares of a=[1,2,3,4,3,2,7] Simple….

fold( map(a, square()), sum())

You can do that in any functional programming language….

Page 12: MapReduce@DirectI

Now do it for an array of 100

million elements…….

Page 13: MapReduce@DirectI

This is where Hadoop comes in.. Distributed File System : HDFS Distributed Computation/Task Processing:

Hadoop Name Node + Data Node Task Tracker + Job Tracker

Page 14: MapReduce@DirectI
Page 15: MapReduce@DirectI

Task Tracker

Page 16: MapReduce@DirectI

How MapReduce Works…

Page 17: MapReduce@DirectI

An example Word Count

hadoop jar contrib/streaming/hadoop-0.*-streaming.jar -jobconf mapred.data.field.separator=","-input 'wc.eg.in' -output 'wc.eg.out' -mapper 'wc -w' -reducer "awk ‘{ sum+=\$1 } END{ print sum}’"

Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp

Page 18: MapReduce@DirectI

Pig and Pig Latin A procedural language for MapReduce

operations.logs = LOAD 'access.logs' USING PigStorage(',') AS

(ts:int, URL:chararray, resp:chararray, resp_time:int);

products = LOAD 'products.dat' USING PigStorage(',') AS (date:int, pid:int, price:int);

l1 = FOREACH logs GENERATE GetDate(ts) as req_date, GetProductID(URL) as prod_id, GetRequestType(URL) as rtype, resp, resp_time;

j1 = JOIN l1 BY prod_id, products by pid;

j2 = FILTER j1 BY req_date==date AND price>30.0F;

j3 = GROUP j2 ALL;

j4 = FOREACH j3 GENERATE COUNT(j2);

DUMP j4

Page 19: MapReduce@DirectI

Q&A