mapreduce@directi
DESCRIPTION
The initial simple MapReduce cluster setup at DirectI. An introduction to MapReduce and Hadoop. A brief intro Pig is also included.TRANSCRIPT
MapReduce@DirectI
amkiray: [email protected]: [email protected]
Lets start with an example…
timestamp,url,response_code,response_time
products.dat
access.log
date, product_id, price
Requirement: Number of requests in the last 30 days.
$> ls –rt *.log | tail -30 | xargs “wc –l”
Requirement: Busiest 30 minutes in last 30 days.
$> ls –rt *.log| tail -30 | xargs “./count_30min.sh“
Requirement: Number of failed buy requests for products
worth more than $30 in the last 30 days .
Import data to an RDBMS;
SELECT COUNT(*) FROM logs, productsWHERE GET_REQUEST_TYPE(logs.url)=‘BUY’ AND GET_PRODUCT_ID(logs.url)=products.product_id AND product.price>30AND DATE(log.timestamp) = products.date;
It might take a while!
Now gimme the number of failed buy requests for products worth more than
$30 in the last 1 Year.
2 days later….
On its way!Inserting data
into database…
5 days later…
$> mysqladmin processlist+-----------------------------------+| Query | Copy to Temp Table |+-----------------------------------+
May be its Joining!!Or may be its dead…
Or may be my replacement will see the result ..
But Why?! Distributed data processing cluster A distributed file system Data location sensitive Task scheduler MapReduce paradigm Handles Parallelizable and distributable tasks Failover capability Web based monitoring capabilities More value for your time!
Where to start??
MapReduce: Q: Sum of all squares of a=[1,2,3,4,3,2,7] Simple….
fold( map(a, square()), sum())
You can do that in any functional programming language….
Now do it for an array of 100
million elements…….
This is where Hadoop comes in.. Distributed File System : HDFS Distributed Computation/Task Processing:
Hadoop Name Node + Data Node Task Tracker + Job Tracker
Task Tracker
How MapReduce Works…
An example Word Count
hadoop jar contrib/streaming/hadoop-0.*-streaming.jar -jobconf mapred.data.field.separator=","-input 'wc.eg.in' -output 'wc.eg.out' -mapper 'wc -w' -reducer "awk ‘{ sum+=\$1 } END{ print sum}’"
Task Tracker: http://cae5.internal.directi.com:50030/jobtracker.jsp Name Node: http://cae2.internal.directi.com:50070/dfshealth.jsp
Pig and Pig Latin A procedural language for MapReduce
operations.logs = LOAD 'access.logs' USING PigStorage(',') AS
(ts:int, URL:chararray, resp:chararray, resp_time:int);
products = LOAD 'products.dat' USING PigStorage(',') AS (date:int, pid:int, price:int);
l1 = FOREACH logs GENERATE GetDate(ts) as req_date, GetProductID(URL) as prod_id, GetRequestType(URL) as rtype, resp, resp_time;
j1 = JOIN l1 BY prod_id, products by pid;
j2 = FILTER j1 BY req_date==date AND price>30.0F;
j3 = GROUP j2 ALL;
j4 = FOREACH j3 GENERATE COUNT(j2);
DUMP j4
Q&A