Big Data Extensions: Advanced Features and
Customer Case Study
Jayanth Gummaraju, VMware
Sasha Kipervarg, Identified, Inc.
VAPP5484
#VAPP5484
2
Data Is Exploding & Hadoop Is Driving Growth
Unstructured data driving growth Hadoop adoption is ramping
2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Structured Unstructured
Complex unstructured data
forecasted to outpace structured relational data by 10x by 2020
Evaluating53%
In-production
23%
Piloting18%
Testing2%
Don't know2%
Other2%
Source: Forrester Survey of 60 CIOs , September 2011
• Unstructured data explosion and Hadoop capabilities causing CIOs to
reconsider Enterprise data strategy
• Hadoop’s ability to process raw data at cost presents intriguing value
proposition
3
Agenda
Big Data Extensions Overview
Virtualized Hadoop at Identified Inc.
Advanced Features
4
Questions for Audience
Familiarity with Hadoop
1. New to Hadoop
2. Reasonably familiar
3. Expert
Hadoop cluster sizes
1. < 10 nodes
2. 10-50 nodes
3. > 50 nodes
Virtualizing Hadoop
1. Never virtualized
2. Actively exploring virtualization
3. Running virtualized Hadoop in test-dev/production
5
Big Data on vSphere: Value Proposition
Basic Features
• Fast provisioning
• Minutes/hours instead of days
• Workload Consolidation
• Multiple virtual clusters co-exist on same physical hardware
• High Availability
• Not limited to NameNode, JobTracker
Advanced Features
• Auto-elasticity
• High Resource Utilization
• True multi-tenancy
• VM-grade security, performance, and configuration isolation
6
Serengeti
vSphere Resource
Management
Hadoop Virtualization Extensions
vSphere Big Data Extensions: Program Highlights
Open source project
Tool to simplify virtualized Hadoop deployment & operations
Serengeti
Virtualization changes for core Hadoop
Contributed back to Apache Hadoop
Advanced resource management on vSphere
Big Data applications-specific extension to DRS
7
What is Hadoop?
Distributed processing of large data sets across clusters of computers
Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
map()
reduce()
Input D
ata
Outp
ut D
ata
Split
[k1, v1]
Sort
by k1
Merge
[k1, [v1, v2, v3,…]]
map()
map()
reduce()
8
Slave Node 1 Slave Node 2 Slave Node 3
Input File
Tasks Are Scheduled Where Data Resides
JobTracker Job
DataNode
TaskTracker
Split 1 – 64MB
Task - 1
Split 2 – 64MB
Split 3 – 64MB
TaskTracker TaskTracker
DataNode DataNode
Block 1 – 64MB Block 2 – 64MB Block 3 – 64MB
Task - 2 Task - 3
NameNode
9
Myth: Virtual Performance Is Sub-optimal
[http://www.vmware.com/resources/techresources/10360, Jeff Buell, Apr 2013]
(lower is better)
32 hosts/3.6GHz 8 cores/15K RPM 146GB SAS disks/10GbE/72-96GB RAM
10
Agenda
Big Data Extensions Overview
Virtualized Hadoop at Identified Inc.
Advanced Features
11
Agenda
Big Data Extensions Overview
Virtualized Hadoop at Identified Inc.
Advanced Features
12
Compute-Data Separation
Combined Storage/ Compute
VM
Hadoop in VM
• VM lifecycle determined by Datanode
• Limited elasticity
• Limited to Hadoop Multi-Tenancy
Storage
Compute
VM
VM
Separate Storage
• Separate compute from data
• Elastic compute
• Enable shared workloads
• Raise utilization
Storage
T1 T2
VM
VM
VM
Separate Compute Tenants
• Separate virtual clusters per tenant
• Stronger VM-grade security and resource isolation
• Enable deployment of multiple Hadoop runtime versions
Slave Node
13
Dataflow with Separated Compute/Data
Virtual Hadoop
Node
Virtual Hadoop
Node
ESX Host
Virtual Hadoop Node
VMDK
DataNode
Virtual Hadoop Node TaskTracker
Slot
Slot
Virtual Switch
Virtual NIC Virtual NIC
NIC Drivers
14
Elastic Scalability & Multi-Tenancy
Deploy separate compute clusters for different tenants sharing HDFS.
According to priority and available resources, power-on/off compute VMs
Experimentation Dynamic resourcepool
Data layer
Production
recommendation engine
Compute layer Compute
VM
Compute
VM
Compute
VM
Compute
VM
Compute
VM Compute
VM
Compute
VM
Compute
VM
Compute
VM Compute
VM
Compute
VM Compute
VM
Compute
VM
Compute
VM
Compute
VM
Experimentation Production
Compute
VM
Job
Tracker
Job
Tracker
VMware vSphere + Big Data Extensions
15
Auto-elastic Hadoop in Action
ESX ESX ESX
J
T
DATA VM DATA VM DATA VM
Local Disks
SAN/NAS Non-Hadoop VMs
Hadoop Compute VMs
JT: JobTracker
TT: TaskTracker
NN: NameNode
VHM: Virtual Hadoop Manager
N
N
T
T
T
T T
T
VirtualCenter Management Server
DRS DRS DRS DRS DRS
VHM
Hadoop HDFS VMs
T
T
T
T T
T
J
T
16
Advanced Resource Management using Virtual Hadoop Manager
State, stats
(Slots used,
Pending work)
Commands
(Decommission,
Recommission)
Stats and VM
configuration
Serengeti Job
Tracker
vCenter DB
Manual/Auto
Power on/off
Virtual Hadoop Manager (VHM)
Job
Tracker
Task
Tracker
Task
Tracker
Task
Tracker
vCenter Server
VC
actions
Hadoop
actions
Serengeti
Configuration
VC
state and stats
Hadoop
state and stats
Auto-Scaling Algorithms
Cluster
Configuration
17
Auto-Scaling Algorithms: 5 Key Insights
① Expand or Shrink clusters based on ambient data
• Expand when there is work and no imminent contention
• Shrink when there is contention
• Predictable scaling for matching customer expectation, ease of testing, etc.
② Use contention detection as an input to scaling response
• Contention reflects user's resource control settings and workload demands
③ Act as an extension to DRS for distributed applications spanning multiple VMs
• A glue between DRS and Application-scheduler
• Penalize few VMs heavily rather than all VMs lightly/uniformly
④ React only if there is true contention and in a timely manner
• Actively used resources are deprived
• Do not react to transients
⑤ Use Hysteresis and Control Theory concepts to guide decisions
• E.g., transient windows and thresholds, feedback from previous actions, etc.
18
Shrinking-related Metrics
CPU is being deprived
• VC metric: CPU Ready
• Time that vCPU is ready to run, but cannot be scheduled on a pCPU
Memory is being deprived
• VC metrics:
• Usage: Active Memory, Granted Memory
• Reclamation: Memory Ballooning, Host Swap
• Typically starts with ballooning then leads to host swapping
TaskTracker is dead or faulty
• Hadoop metrics: Alive Nodes and Task Failures
19
Expansion-related Metrics
Jobs are present
• Hadoop metrics: jobs_preparing, jobs_running
High slot usage
• Hadoop metrics: map_slots_used, max_map_slots, reduce_slots_used,
max_reduce_slots
High task throughput
• Hadoop metrics: maps_completed, reduces_completed
No imminent contention
• VC metrics: CPU Ready, Memory Ballooning
20
Auto-elasticity Demo
21
What’s Next?
Resource management enhancements
• Algorithmic optimizations
• Contention metrics related to Disk/Network IO
Auto-elasticity support for YARN and HBase
• YARN – Hadoop 2.x
• HBase – Hadoop database
Serengeti enhancements
• Support for additional Hadoop distros
Hadoop extensions
• Dynamic resource configuration
22
Main Takeaways
Value proposition
• Fast provisioning
• Workload consolidation
• Elasticity better resource utilization
• Multi-tenancy using VMs differentiated service
Key technologies
• Serengeti
• Advanced Resource Management
• Hadoop Virtual Extensions
Host Host Host
vSphere Platform
Make vSphere the platform of choice for running Big Data
23
Questions?
Contact information
• Jayanth Gummaraju [email protected]
• Sasha Kipervarg [email protected]
Other related sessions
• Breakout session (VAPP5402, VAPP5762)
• Big Data Panel (VAPP5626)
• Hands-on lab (HOL-SDC-1309)
For more information (including download information)
• vSphere Big Data Extensions http://www.vmware.com/hadoop
• Project Serengeti http://www.projectserengeti.org
THANK YOU
Big Data Extensions: Advanced Features and
Customer Case Study
Jayanth Gummaraju, VMware
Sasha Kipervarg, Identified, Inc.
VAPP5484
#VAPP5484