performance of hadoop on openstack
DESCRIPTION
sdfdsfTRANSCRIPT
-
Performance of Hadoop on OpenStack
Andrew LazarevMirantis, 2014
-
Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion
Agenda
-
What Is Hadoop?Am
bari
(Man
agem
ent)
ZooK
eepe
r(C
oord
inat
ion)
Ooz
ie(S
ched
ulin
g)
HDFS(File System)
HBas
e(N
oSql
Sto
re)
MapReduce(Programming Framework)
Pig
(Dat
a Fl
ow)
Hive
(SQ
L)
Stor
m(R
eal-t
ime
com
puta
tion)
- Core Apache Hadoop
-
Easy to operate cluster One-click self-service provisioning Sharing hardware between several Hadoop
clusters Tenants isolation on hypervisor and network
layers Comparable performance with much more
flexibility
Why Virtualize Hadoop?
-
Sahara - OpenStack Data Processing project OpenStack Integrated Supports Hadoop 1 and 2 Different vendors (Apache, Hortonworks, Intel*) Cluster provisioning and on-demand jobs
execution
How To Virtualize?
-
Direct impact Disk write Disk read Network CPU
Virtualization Impact
-
Indirect impact Lack of low level system control Resources for hypervisor operation
Virtualization Impact
-
Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion
Agenda
-
Mirantis OpenStack Express cluster 20 nodes CPU: 24 x 2.10 GHz (2 x Intel Xeon CPU E5-2620) Memory: 8 x 4.0 GB, 32.0 GB total Disk: 1 drive, 0.9 TB (WDC WD1003FBYX-0) Network: 2 x 1 GbE
Environment
-
Host OS: CentOS 6.5 VM OS: CentOS 6.5 Mirantis OpenStack QEMU-KVM 1.2.0 Network: Neutron + GRE Open vSwitch 1.10.2
Environment (continuation)
-
Hadoop: Vanilla Apache 1.2.1 Bare metal setup: 19 Hadoop Nodes
OpenStack setup: 1 Controller + 19 Computes 19 (or 57) VMs with Hadoop
Environment (continuation)
-
Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion
Agenda
-
Disk Write (using dd)
*greater is better
-
TestDFSIO - built-in hadoop IO test write test read test 1000 files of 1GB (1 TB total)
Disk Write (hadoop test)
-
Disk Write (hadoop test)
*less is better
-
Disk Write (hadoop test)
*less is better
-
disk_cachemodes param in nova.conf writethrough (default) - guest disk write cache
is disabled writeback - guest disk write cache is enabled
Disk Cache Mode
-
Writeback cache enabled One large VM with all memory per Host
Disk Write (dd, writeback cache)
-
Disk Write (dd, writeback cache)
*greater is better
-
Disk Write (hadoop test, writeback cache)
*less is better
-
QEMU 1.4: high performance virtio-blk data plane
implementation +108.0% on rnd-write (based on RedHat
presentation on KVM Forum):
Disk Write - Way To Improve
-
Disk Read (using hdparm)
*greater is better
-
Disk Read (using hdparm)
*greater is better
-
Disk Read (hadoop test)
*less is better
-
Network (OVS+GRE)
*greater is better
-
PI - built-in hadoop test Depends mostly on CPU 50 series of 10,000,000,000 probes
CPU (hadoop test)
-
CPU (hadoop test)
*less is better
-
Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion
Agenda
-
Built-in hadoop test Represents real Hadoop workload Involves
IO Networking Computation
Sorting 200,000,000 of 100-byte entries (20 GB) Writeback cache enabled
Terasort
-
Terasort
*less is better
-
Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion
Agenda
-
Hadoop can consider distance between nodes Intelligent task scheduling Reading data from close data nodes
Data Locality
NODENODE
NODE
NODE
NODE
NODE
-
Data Locality
*greater is better
-
Network within host comparable to disk speed Allows hadoop process isolation (VM per process) Test:
1 Master Node (JobTracker + NameNode) 18 DataNodes 18 TaskTrackers TeraSort of 20 Gb data
Data Locality
-
Terasort (data locality)
*less is better
-
Introduction Environment description Direct virtualization impact Real-life workload Data locality Conclusion
Agenda
-
Only 6% performance impact for composite test Performance continuously improving with
external libs upgrade (QEMU, Open vSwitch) Much more topology flexibility Isolation at low cost
between clusters between nodes within cluster
Conclusion
-
Q&A
-
Thank you!Andrew Lazarev
Launchpad/GitHub/IRC: alazarevE-Mail: [email protected]