hadoop and big data - my presentation to selective audience
DESCRIPTION
My presentation on hadoop and big dataTRANSCRIPT
![Page 1: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/1.jpg)
HADOOPPresented by Chandra Sekhar
BIG DATA
YOUR COMPANY INFORMATION • WWW.YOURCOMPANY.COM
![Page 2: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/2.jpg)
What is Hadoop?
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of commodity computers using a simple programming model.
![Page 3: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/3.jpg)
PRESENTATION FLOW
1. How Hadoop STORES Data.2. How Hadoop PROCESS Data.3. Architecture of Hadoop4. ROI5. Resources
![Page 4: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/4.jpg)
![Page 5: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/5.jpg)
CHALLENGES LIKE OPPORTUNITIES:● Out of all People who sailed between 1997 - 2005, should I target the
people who purchased alcohol package or Spa Package?● Based on the onboard spending of adult men from New York who
have ever sailed with us, who can be targeted to sail on Azamara ?● Which first time guest will be a high roller ?
COST SAVINGS:● On a sailing, Who and How many will have genuine complaints vs
whining?● Which propulsion will break next?
PRODUCTIVITY :● Which employee will quit next ?
We have answers to most of these questions somewhere in our warehouses.
![Page 6: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/6.jpg)
![Page 7: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/7.jpg)
What is so Great about Hadoop ?
● Why all this buzz?● Is it a hype?● Is it a dot com ?● How does Hadoop Handle ?
Next Slide is a good example
![Page 8: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/8.jpg)
At Yahoo in 2008
![Page 9: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/9.jpg)
![Page 10: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/10.jpg)
Hadoop is ideal For
● Write once, Read many times operation.. ● No edits, No Updates..● Movie files, Music files, Flight data
recorders, Logs, XML files are all fine ( DB records as well.)
![Page 11: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/11.jpg)
HOW HADOOP STORES DATA
● Hadoop uses blocks to store Files.
● Default Block size is 64MB ● Every block gets replicated thrice. ● A 100 MB file will take up 2 blocks ( +
Replication factor of 3 = 6 blocks)● 1 GB File, not a problem … 48 blocks
![Page 12: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/12.jpg)
OLD VS NEW
● You can set replication for older files to 2, and new files to 3 or even 4.
● You can compress the files .
![Page 13: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/13.jpg)
More on Blocks..
Because a unit of storage is block, It does not really matter how many files, or how big the files are ..
But.
Hadoop prefers large files instead of many small files. Why ?
![Page 14: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/14.jpg)
Why Large Files ?
When a block gets created, the addresses of block location , gets stored in namenode in memory For faster retrievel.
It is not mandated,but it is efficient to have few entries . Usually multiple files get merged into a single file ( ex : all Assignment manager logs of a day into a single huge file)
![Page 15: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/15.jpg)
Data Loss is extremely Rare .. Here is why
![Page 16: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/16.jpg)
HOW HADOOP PROCESSES DATA
MAP REDUCE
![Page 17: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/17.jpg)
MAP REDUCE
Map Function ● Reads the data ● Usually does the preprocessing ● Hands over the records to Reduce
Function for further processing ( Ex : Eliminate all records where the age is
less than 18 )
![Page 18: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/18.jpg)
More about Processing● A single huge file ( ex: 1GB ) file could be
processed by several mappers ( usually one block = 1 mapper, so about 16 Map jobs.
● If a simple logic, then you can disable reduce function and map job can process the logic.
● A Mapreduce job can pick up a web log from our website, join to a Siebel table and the output written to a TIBCO Queue to write to AS400 ( or MongoDB directly)
![Page 19: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/19.jpg)
Hadoop Eco-System
![Page 20: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/20.jpg)
Mapreduce Flow
![Page 21: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/21.jpg)
KEY VALUE PAIRHello World Example
File Content : The mouse runs faster than the Cat
![Page 22: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/22.jpg)
Map function output
Map Job output : (K1, VI)(The, 1)(mouse,1)(runs,1)(faster,1)(than,1)(the,1)(cat,1)
![Page 23: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/23.jpg)
Reducer Function
Reducer Job output : (K1, VI)(The, 2)(mouse,1)(runs,1)(faster,1)(than,1)(cat,1)
![Page 24: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/24.jpg)
Hadoop Programming Languages
Java, Any scripting languages , HIVE, PIG etc
![Page 25: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/25.jpg)
Sample code in Java
![Page 26: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/26.jpg)
Same Code in Python
![Page 27: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/27.jpg)
Same Code in PIG
A = load '/home/cloudera/wordcountproblem' using TextLoader as (data:chararray);
B = foreach A generate FLATTEN(TOKENIZE(data)) as word;
C = group B by word;D = foreach C generate group, COUNT(B);
store D into '/home/cloudera/Chandra7' using PigStorage(',');
![Page 28: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/28.jpg)
Same Code in HIVE
SELECT word, COUNT(*) FROM input LATERAL
VIEW explode(split(text, ' ')) lTable as
word GROUP BY word;
![Page 29: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/29.jpg)
More on data processing
● Map function output is always sorted by the Key.
● Map data is intermediate data , so it is not saved in HDFS, only in the local node and gets deleted after reducer finishes.
![Page 30: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/30.jpg)
ARCHITECTURE.
![Page 31: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/31.jpg)
![Page 32: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/32.jpg)
![Page 33: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/33.jpg)
![Page 34: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/34.jpg)
ROI
One study : Storing and Processing 1 TB Traditional RDBMS : $37,000 / yearData Appliance : $5000 / yearHadoop Cluster : $ 2000 /yearSource :
HBR Big Data@work page 60
![Page 35: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/35.jpg)
Wikibon StudyBREAK EVEN TIMEFRAME
Big data Approach : 4 months
Traditional DW Appliance Approach : 26 months
![Page 36: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/36.jpg)
Resources Youtube “Stanford university,Amr Awadallah
![Page 37: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/37.jpg)
‘Must Read’ to get Certified.. http://www.amazon.com/review/R3BSEBI4I4SNUL
![Page 38: Hadoop And Big Data - My Presentation To Selective Audience](https://reader035.vdocuments.mx/reader035/viewer/2022062616/54929876ac7959182e8b4667/html5/thumbnails/38.jpg)
THANK YOU