hadoop 1.x vs 2
DESCRIPTION
There's a big shift in both at the architecture and api level from Hadoop 1 vs Hadoop 2, particularly YARN and we had our first meetup to talk about this (http://www.meetup.com/Atlanta-YARN-User-Group/) on 10/13/2013.TRANSCRIPT
Hadoop 1.x vs Hadoop 2
Rommel Garcia Solutions Engineer - Big Data
Hortonworks
Transition To Big Data
Relational Dimensional(EDW)
Big Data
Data Explosion
3 Design Dimensions
Key Hadoop Data Types
Sentiment
Clickstream
Sensor/Machine
Geographic
Server Logs
Text
Hadoop is NOT
ESB
NoSQL
HPC
Relational
Real-time
The “Jack of all Trades”
Hadoop 1
Limited up to 4,000 nodes per cluster
O(# of tasks in a cluster)
JobTracker bottleneck - resource management, job scheduling and monitoring
Only has one namespace for managing HDFS
Map and Reduce slots are static
Only job to run is MapReduce
Hadoop 1 - Basics
BBBB CCCC AAAA AAAA AAAA
AAAA BBBB CCCC CCCC BBBB
MapReduce (Computation Framework)
HDFS (Storage Framework)
Hadoop 1 - Reading Files
Rack1 Rack2 Rack3 RackN
read file (fsimage/edit)Hadoop Client
NameNode SNameNode
return DNs, block ids, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
heartbeat/block reportread blocks
Hadoop 1 - Writing Files
Rack1 Rack2 Rack3 RackN
request write (fsimage/edit)Hadoop Client
NameNode SNameNode
return DNs, etc.
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
checkpoint
block reportwrite blocks
replication pipelining
Hadoop 1 - Running Jobs
Rack1 Rack2 Rack3 RackN
Hadoop Client
JobTracker
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
submit job
deploy job
part 0part 0part 0part 0
map
reduce
shuffle
Hadoop 1 - Security
UsersUsersUsersUsers
FFIIRREEWWAALLLL
LDAP/AD
Client Node/Spoke Server
KDC
Hadoop Cluster
authN/authZ
service request
block token
delegate token
* block token is for accessing data
* delegate token is for running jobs
Encryption PluginEncryption Plugin
Hadoop 1 - APIs
org.apache.hadoop.mapreduce.Partitioner
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Job
Hadoop 2
Potentially up to 10,000 nodes per cluster
O(cluster size)
Supports multiple namespace for managing HDFS
Efficient cluster utilization (YARN)
MRv1 backward and forward compatible
Any apps can integrate with Hadoop
Beyond Java
Hadoop 2 - Basics
Hadoop 2 - Reading Files
(w/ NN Federation)
Rack1 Rack2 Rack3 RackN
read file
fsimage/edit copyHadoop Client NN1/ns1
SNameNodeper NN
return DNs, block ids, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
checkpoint
register/heartbeat/
block report
read blocks
fs sync Backup NNper NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
ns1 ns2 ns3 ns4
dn1, dn2
dn1, dn3
dn4, dn5dn4, dn5
Block Pools
Hadoop 2 - Writing Files
Rack1 Rack2 Rack3 RackN
request write
Hadoop Client
return DNs, etc.
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
write blocks
replication pipelining
fsimage/edit copyNN1/ns1
SNameNodeper NN
checkpoint
block report
fs sync Backup NNper NN
checkpoint
NN2/ns2 NN3/ns3 NN4/ns4
or
Hadoop 2 - Running Jobs
RackN
NodeManager
NodeManager
NodeManager
Rack2
NodeManager
NodeManager
NodeManager
Rack1
NodeManager
NodeManager
NodeManager
C2.1
C1.4
AM2
C2.2 C2.3
AM1
C1.3
C1.2
C1.1
Hadoop Client 1
Hadoop Client 2
create app2
submit app1
submit app2
create app1
ASM Schedulerqueues
ASM Containers
NM ASM
Scheduler Resources
.......negotiates.......
.......reports to.......
.......partitions.......
ResourceManager
status report
Hadoop 2 - Security
FFIIRREEWWAALLLL
LDAP/AD
Knox Gateway Cluster
KDC
Hadoop Cluster
Enterprise/Cloud SSO Provider
JDBC ClientJDBC Client
REST ClientREST Client
FFIIRREEWWAALLLL
DMZ
Browser(HUE)Browser(HUE)Native Hive/HBase Native Hive/HBase
EncryptionEncryption
Hadoop 2 - APIs
org.apache.hadoop.yarn.api.ApplicationClientProtocol
org.apache.hadoop.yarn.api.ApplicationMasterProtocol
org.apache.hadoop.yarn.api.ContainerManagementProtocol
Resources
http://hortonworks.com/products/hortonworks-sandbox/
http://hortonworks.com/products/hdp-2/
http://hortonworks.com/resources/
http://hadoopsummit.org/san-jose/
Hadoop Summit 2014