Download - Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters
![Page 1: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/1.jpg)
StoreApp:A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters
LIU KaiEmail: [email protected]: http://kiwenlau.com/
National Institute of Informatics, Japan
04/15/2023 1LIU Kai, National Institute of Informatics
![Page 2: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/2.jpg)
Contents
Introduction (What?) Motivation (Why?) Implementation (How?) Personal Ideas
04/15/2023 2LIU Kai, National Institute of Informatics
![Page 3: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/3.jpg)
Introduction – What is StoreApp?
04/15/2023 3LIU Kai, National Institute of Informatics
![Page 4: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/4.jpg)
Background
Hadoop (version 1): for big data storage and computation Hadoop Distributed File System (HDFS): for storage Hadoop MapReduce Framework: for computation Master/Slave Architecture Storage(DataNode) and computation(TaskTracker) co-locate in a node
04/15/2023 LIU Kai, National Institute of Informatics 4
DataNodeTaskTracker
…
Slave Slave Slave Slave
NameNodeJobTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Master
Physical MachineOr Virtual Machine
![Page 5: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/5.jpg)
Overview
What is StoreApp? A Hadoop plugin For speeding up Hadoop running in virtual machines Separate storage (DataNode) from computation (TaskTracker)
04/15/2023 LIU Kai, National Institute of Informatics 5
TaskTracker
DataNode
TaskTracker
TaskTracker
Physical machine Physical machine
Virtual machineDataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Virtual machine
![Page 6: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/6.jpg)
Benefit
Improve HDFS throughput by 78.3% Storage VM has higher priority in scheduling than computation VM Consolidating storage into one VM reduce I/O contentions
Reduce job completion time by 61% Most Hadoop jobs are data intensive Their performance are bottlenecked by slow disk access
04/15/2023 LIU Kai, National Institute of Informatics 6
![Page 7: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/7.jpg)
Motivation – Why do we need StoreApp?
04/15/2023 7LIU Kai, National Institute of Informatics
![Page 8: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/8.jpg)
Challenge 1
Can’t add or remove nodes easily Rebalancing data incurs significant data movement Cannot utilize the elasticity of virtual machines
04/15/2023 LIU Kai, National Institute of Informatics 8
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Physical MachineVirtual Machine
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
…
![Page 9: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/9.jpg)
Solution 1
Separate storage from computation Adding or removing computation node need no data movement Finding optimal number of computation nodes for each Hadoop job
04/15/2023 LIU Kai, National Institute of Informatics 9
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker…
Physical MachineVirtual Machine
![Page 10: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/10.jpg)
Challenge 2
Colocated Virtual Machines often access disk concurrently Random IO operations will compete with each other Significantly degrade the Hadoop Job performance
04/15/2023 LIU Kai, National Institute of Informatics 10
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Physical MachineVirtual Machine
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
…
![Page 11: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/11.jpg)
Solution 2
Separate storage from computation Each physical machine only has one storage virtual machine Only the storage Virtual Machine is IO intensive No serious concurrent IO operations
04/15/2023 LIU Kai, National Institute of Informatics 11
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker…
Physical MachineVirtual Machine
![Page 12: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/12.jpg)
Challenge 3
Can’t schedule Virtual Machines efficiently IO intensive VMs can be prioritized since they consume less CPU However, every VM is IO intensive!
04/15/2023 LIU Kai, National Institute of Informatics 12
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Physical MachineVirtual Machine
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
…
![Page 13: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/13.jpg)
Solution 3
Separate storage from computation Only the storage Virtual Machine is IO intensive The storage Virtual Machine will receive a higher priority
04/15/2023 LIU Kai, National Institute of Informatics 13
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker…
Physical MachineVirtual Machine
![Page 14: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/14.jpg)
Implementation – How to design StoreApp?
04/15/2023 14LIU Kai, National Institute of Informatics
![Page 15: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/15.jpg)
Architecture
04/15/2023 LIU Kai, National Institute of Informatics 15
A StoreApp manager and multiple storage nodes The StoreApp manager run on the master node Each physical machine has one storage node
![Page 16: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/16.jpg)
Components
StoreApp manager Coordinate the operations of all data nodes
Scheduler Scheduling tasks according to data locations
HDFS Proxy Receive all HDFS requests and forward them to DataNode
Shuffler Receive map output and push them to DataNode
04/15/2023 LIU Kai, National Institute of Informatics 16
![Page 17: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/17.jpg)
HDFS Prefetching
04/15/2023 LIU Kai, National Institute of Informatics 17
Read the whole block b1 instead of needed partial records Unused data of block b1 is kept in the memory Read consecutive block into memory to form input split s1
task0 task1
![Page 18: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/18.jpg)
Automated Cluster Resizing
04/15/2023 LIU Kai, National Institute of Informatics 18
Dynamically change Cluster Size during the job execution The iterative algorithm can search for the optimal cluster size
![Page 19: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/19.jpg)
Personal Ideas
04/15/2023 19LIU Kai, National Institute of Informatics
![Page 20: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/20.jpg)
Pros and cons
Pros Simple idea but shows good result Show clear logic of locating and solving problems
Cons Restrict to Hadoop 1 No open source
04/15/2023 LIU Kai, National Institute of Informatics 20
![Page 21: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/21.jpg)
Future direction
From Hadoop 1 to Hadoop 2 Hadoop 2 is quite different with Hadoop 1 Hadoop 2 can support more application framework like Spark
From Virtual Machine to container Container is a more lightweight virtualization technology Container is more Resource efficient than Virtual Machine Container is more easy to scale than Virtual Machine
04/15/2023 LIU Kai, National Institute of Informatics 21
![Page 22: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/22.jpg)
References
Yanfei Guo, et al. "StoreApp: A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters”, INFOCOM, 4, 2015
04/15/2023 LIU Kai, National Institute of Informatics 22
![Page 23: Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters](https://reader036.vdocuments.mx/reader036/viewer/2022062711/55cc1df9bb61eba5028b4598/html5/thumbnails/23.jpg)
Thank you!
04/15/2023 LIU Kai, National Institute of Informatics 23