agile: elastic distributed resource scaling for infrastructure-as-a … · (≈ 2 minutes) !...
TRANSCRIPT
![Page 1: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/1.jpg)
Computer Science
AGILE: Elastic distributed resource scaling for Infrastructure-as-a-Service
Hiep Nguyen, Zhiming Shen, Xiaohui (Helen) Gu
North Carolina State University Sethuraman Subbiah
NetApp John Wilkes
![Page 2: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/2.jpg)
Elastic resource scaling for Infrastructure-as-a-Service § Elasticity: grow/shrink resource as required
VM VM VM
CPU
dem
and
Time
1
VM VM VM
VM VM VM VM VM VM
VM VM VM
VM VM VM
![Page 3: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/3.jpg)
Design goals
§ Application agnostic – Easy to deploy – Support different applications
§ Effective overload handling – Predict overload accurately – Minimize SLO violations – Minimize resource cost
§ Low overhead – Light-weight – Little interference
2
![Page 4: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/4.jpg)
State of the art
§ Reactive resource scaling [e.g., Amazon EC2] – Performance degradation due to long instantiation latency
(≈ 2 minutes)
§ Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007, Shen et al. SOCC 2011 ]
– Focus on short-term prediction or need to assume cyclic workload patterns
§ Model-driven resource scaling [e.g., Zhu et al. ICAC 2008, Kalyvianaki et al. ICAC 2009, Padala et al. Eurosys 2007]
– Have parameters that need to be specified or tuned offline for different applications/workloads
3
![Page 5: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/5.jpg)
AGILE system overview
4
Resource pressure modeling
Server pool scaling manager
Resource usage monitoring
VM add/remove
Resource demand prediction
SLO violation feedback
AGILE
Server pool prediction
VM VM VM VM VM VM VM VM VM
Overload starts
Advance alert
Overload stops
VM VM VM Future resource demand
Resource to maintain
When to scale How many VMs
![Page 6: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/6.jpg)
Pre-copy live VM cloning
§ Design goals – Immediate performance scale-up – Avoid storing and maintaining VM snapshots
Memory
Disk Image
Pre-copy memory
5
![Page 7: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/7.jpg)
Pre-copy live VM cloning
§ Design goals – Immediate performance scale-up – Avoid storing and maintaining VM snapshots
Disk Image
Memory Memory Copy
Incremental Image Incremental Image Read only
6
![Page 8: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/8.jpg)
Pre-copy live VM cloning
§ Dynamic copy-rate configuration – Minimum copy-rate (e.g., little interference) – Finish cloning within the overload pending
time
dirtyclone rtMemorySizeCopyRate += /
Predicted overload
(avg. ≈ 100 Mbps)
7
![Page 9: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/9.jpg)
Performance scale-up comparison
Immediate performance scale-up
2.5 minutes 5 seconds
8
New VM request
Application starts
![Page 10: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/10.jpg)
Wavelet-based medium-term prediction
Original signal
Detail signals
Scale 1
Scale 2
Scale 3
Scale 4
Scale 4 Approximation signals
Decompose Synthesize
9
![Page 11: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/11.jpg)
Online resource pressure modeling
§ Mapping function between: – Resource pressure – SLO violation rate
Observations
10
![Page 12: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/12.jpg)
Optimizations for AGILE cloning
§ Post-cloning auto-configuration – Event driven auto-configuration – Application VMs can subscribe to critical events
§ False alarm handling – Continuously check predicted overload state – Cancel cloning triggered by the false alarm
11
![Page 13: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/13.jpg)
Experimental evaluation
§ Implemented on top of KVM – Modified KVM to support pre-copy live cloning
§ Test bed: – 10 cloud nodes running CentOS 6.2 with KVM
0.12.1.2 § Benchmark systems
– RUBiS driven by four real workload traces • WorldCup' 98, EPA, Nasa, ClarkNet (one day traces)
– Google cluster data: 100 CPU usage and 100 Memory usage traces (29 days)
12
![Page 14: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/14.jpg)
Wavelet-based prediction accuracy
§ RUBiS traces
13
![Page 15: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/15.jpg)
Wavelet-based prediction accuracy
§ RUBiS traces
14
![Page 16: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/16.jpg)
Wavelet-based prediction accuracy
§ Google CPU traces
15
![Page 17: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/17.jpg)
Overload handling
§ Web server and database server scaling (≈ 2 hours, scale from 1 to 2 servers)
16
![Page 18: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/18.jpg)
Overload handling
§ Web server: during scaling
17
![Page 19: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/19.jpg)
Dynamic copy-rate configuration
Accurately control the cloning time under different deadlines 18
![Page 20: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/20.jpg)
Conclusion
§ Prediction-driven elastic distributed resource scaling: – Accurate medium-term prediction based on
wavelet transforms – Adaptive copy-rate to minimize interference – Application-agnostic performance model
§ Immediate performance scale-up with little overhead
Thank you! http://dance.csc.ncsu.edu
19
![Page 21: AGILE: Elastic distributed resource scaling for Infrastructure-as-a … · (≈ 2 minutes) ! Trace-driven resource scaling [e.g., Chandra et al. IWQoS 2003, Gong et al. CNSM 2007,](https://reader035.vdocuments.mx/reader035/viewer/2022081407/5f24505eb21c6534ec4e61d2/html5/thumbnails/21.jpg)
Acknowledgement
§ This work was sponsored in part by NSF CNS0915567 grant, NSF CNS0915861 grant, U.S. Army Research Office (ARO) under grant W911NF-10-1-0273, and Google Research Awards
20