hadoop - university of...
TRANSCRIPT
![Page 1: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/1.jpg)
HadoopYizheng (Ethan) Chen
Advisor: Prof. Aditya Akella
![Page 2: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/2.jpg)
Outline
Hadoop
Yarn (NextGen Hadoop)
![Page 3: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/3.jpg)
Hadoop
• What is Apache Hadoop– A framework (open‐source software) for reliable, scalable, distributed computing
• Hadoop MapReduce– A system for parallel processing of large data sets
• Hadoop Distributed File System (HDFS™)– A distributed file system that provides high‐throughput access to application data
– Similar to GFS
– http://hadoop.apache.org/• Why Hadoop?
![Page 4: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/4.jpg)
Hadoop Job Execution
![Page 5: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/5.jpg)
Hello 1Hadoop 1Goodbye 1Hadoop 1
Word Count
Hello World Bye World
Hello HadoopGoodbye Hadoop
Hello 1World 1Bye 1World 1
MapTask 1 sort
ReduceTask 1 (keys: A‐G)
sort
merge/sort
Bye 1Hello 1World 1World 1
Goodbye 1Hadoop 1Hadoop 1Hello 1
combiner (local aggregation)
Bye 1Hello 1World 2
Goodbye 1Hadoop 2Hello 1
Bye 1
Goodbye 1
Bye 1Hello 1World 2
Goodbye 1Hadoop 2Hello 1
MapTask 2output
Hello 1World 2
Hadoop 2Hello 1
MapTask 2
ReduceTask 2 (keys: H‐Z)
Bye 1Goodbye 1
Hadoop 2Hello 1Hello 1World 2
Bye 1Goodbye 1
Hadoop 2Hello 2World 2
Bye 1Goodbye 1
Hadoop 2Hello 2World 2
shuffle HDFS part0
HDFS part1
MapTask 1output
![Page 6: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/6.jpg)
Hadoop MapReduce
![Page 7: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/7.jpg)
Hadoop Schedulers
• A pluggable framework for job scheduling algorithm available since Hadoop 0.19– FIFO– Fair Scheduler (Facebook)– Capacity Scheduler (Yahoo!)
![Page 8: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/8.jpg)
FIFO schedulerOriginally optimized for large batch jobs(web index construction)FIFO order + priority queues
![Page 9: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/9.jpg)
Fair Scheduler
![Page 10: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/10.jpg)
Capacity Scheduler
![Page 11: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/11.jpg)
HDFS Architecture
![Page 12: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/12.jpg)
Hadoop Ecosystem
HBase: BigTable‐likeHive: Data summarization and ad hoc queryingPig: A high‐level data‐flow language and execution framework for parallel computationHCatalog: Table and storage management service (table abstraction of data)Zookeeper: A high‐performance coordination service for distributed applications
![Page 13: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/13.jpg)
Hadoop and Hadoop‐derived Distributtions
https://blogs.apache.org/bigtop/entry/all_you_wanted_to_know
*Cloudera Distribution Including Apache Hadoop (CDH)
*Greenplum HD (EMC)*Hortonworks Data Platform (Yarn)*MapR
![Page 14: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/14.jpg)
Outline
Hadoop
Yarn (NextGen Hadoop)
![Page 15: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/15.jpg)
Yarn (NextGen Hadoop)
ResourceManager:*Scheduler: allocate resources to the various running applications (pluggable policy plug‐in)*ApplicationsManager : accept job‐submissions/launch the first container for ApplicationMaster
Split up the two major functionalities of the JobTracker: * Management* Job scheduling/monitoring
![Page 16: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/16.jpg)
Resource Allocation• the resource request understood by the Scheduler is of the form:
<priority, (hostname/rackname/*), capability, #containers>
• Scheduler APIThere is a single API between the Scheduler and the
ApplicationMaster:(List <Container> newContainers, List <ContainerStatus> containerStatuses) allocate (List <ResourceRequest> ask, List<Container> release)
![Page 17: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/17.jpg)
Compact:O(clustersize)
![Page 18: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/18.jpg)
![Page 19: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/19.jpg)
![Page 20: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/20.jpg)
HDFS Federation
Benefits:* Namespace Scalability* Performance* Isolation
![Page 21: Hadoop - University of Wisconsin–Madisonpages.cs.wisc.edu/~akella/CS838/F12/notes/Hadoop-Yizheng.pdf · 2012-09-21 · Hadoop • What is Apache Hadoop – A framework (open‐source](https://reader034.vdocuments.mx/reader034/viewer/2022042104/5e81e3ed982e9d711538b606/html5/thumbnails/21.jpg)