compute waste management for operators
TRANSCRIPT
Compute Waste Management for Operators
How to (nicely) reclaim what is yours in a Private Cloud environment
Kalin NikolovCloud EngineerPayPal Inc
1
Agenda
• Introduction• Identifying unused resources• Dealing with unused resources • Strategies for eliminating compute waste• Tools used at Ebay/PayPal• CloudMinion• Summary
2
Introduction
• Compute waste and Capacity Reclamation in Self-Service Cloud
• The challenges in Capacity Reclamation - Identifying the unused resources - How to deal with the unused resources
• How Ebay and PayPal reduced compute waste
3
The scale of Ebay/Paypal OpenStack deployments
• 100% of PayPal web/mid tier• Most of Dev/QA• Number of HVs: 8,500• Number of Virtual Machines: 70,000• Number of users: Several thousands• Availability zones: 10
4
What causes Compute Waste
What is creating Compute Waste in Private Cloud?
• Users who want to try the cloud service Create VMs just for experience without an intention to be used• VMs left by ex-employees • Admins running tests after failures• Temporary resources which were not cleaned at the end• Over planning Resources initially planned to be used • VMs with errors
5
Identifying Unused Resources
• Tools/Agents installed on the VMs - Third Party Tools • Tools installed on the HVs - Ceilometer - Custom built tools for data collection
6
Subsystems/Metrics for Considerations
Subsystems/Metrics for identifying unused resources
There is no ideal metrics
• CPU CPU can be affected by different factors: cpu clock, disk io, etc.• Memory Dynamic memory issues: the KVM keeps the memory once it’s allocated• Disk IO• Network Susceptive to network noise: dns, ntp, ldap, ping, etc.
7
Network Traffic
For identifying unused resources we decided to use primarily Network Traffic
Why Network Traffic?• We can eliminate some of the network noise by using some statistical measurements
• We can look at both egress and ingress traffic separately and make additional determination
For example: If the egress traffic is high but ingress is 0 or close to zero there is an indication of problem with the VM – The VM might be trying to get an IP
8
How to deal with the unused resources
Chargeback / Showback • Making departments responsible for their usage • Show the departments the cost of their usage
Smart reclamation / Ask nicely• Identify unused resources • Notify the user (and his/her manager)• Delete if no action is taken by the user
At Ebay/PayPal, we decided to use Smart reclamation by developing in-house tools.
9
Tools used at Ebay and PayPal for Reclamation
CloudMinion• Used for identifying unused resources on the Self-Service Dev/QA cloud• Started as POC• Resources reclaimed by CloudMinion > $3M savings
10
Reclamation Statistics
Self-Service Dev/QA Cloud
• VMs identified as unused: 42% Unused VMs deleted by automation tool: 75% Unused VMs kept by users: 25%• VMs with extended life by users 1 year 62% 3 month 27% 1 month 11%
11
Reclamation Flow
Handling unused VMs at Ebay/PayPal Dev/QA Cloud• A VM is detected as unused.• An expiration date is set - 14 days from the day of detection.• An email is sent to the user with a link where the user can change the expiration date.
• If the user sets the expiration date to “Never Expires” no further actions are taken
• If the user sets a new expiration date or takes no action - The VM will be shutdown on the expiration day and deleted in 7 days. - Reminders are sent: 2 days prior to be shutdown, on the day the VM is shutdown, 2 days prior to be deleted, on the day the VM is deleted.
12
User Feedback
Surprisingly, we received less complains than we anticipated
• Less than 1% of the users replied The complains were mostly because users:• Failed to read the email notifications• Hate receiving notification emails• Or just ask questions why their VMs were identified as unused
Why the users did not complain? • We gave them a choice to decide whether to keep their VMs
13
CloudMinion
DescriptionCloudMinion is a set of tools for identifying unused OpenStack VMs, setting and managing expiration dates, shutting down and deleting expired VMs, sending reminders, generates reports.
It also provides a web UI which allows users to manage the expiration dates of their VMs
14
CloudMinion: Components
Client Tools• cm_agent.pl – Determines whether a VM is unused based on pre-defined rules and sends the data to CM API.
• cm_sa.pl - Collects data from various subsystems (eg. net, cpu)• cm_sar.pl - Processes the collected data and generates reports for all VMs
CM Server • cm_api.cgi - CM API service• cm_manager.pl – Syncs Instance/User info from OpenStack DB, sets expiration dates, sends notifications, shutdowns/deletes VMs.
• vmem.cgi – VM expiration management tool, Users can change the expiration dates of their VMs and/or delete VMs.
15
CloudMinion: Block Diagram
16
CM DB
OpenStack
DB
CM APIVMEMUI
CM Agent
CM Manager
Nova stop/delet
e
Emailer
HypervisorCM Server
CloudMinion Rules
Rules for identifying unused resources at Ebay/PayPal
• Network Traffic is used for all rules
• VM is identified as unused if: 1. Network Traffic stays below X MB/day for 14 consecutive days AND 2. Standard deviation is less than < X KB OR 3. Ingress network traffic = 0 B
17
CloudMinion Integration with other tools
CloudMinion’s System Activity Reporting tool can be integrated with other system activity collection tools- Ceilometer- Sysstat sa/sar- Any other custom written tools for collecting data
18
Desired integration with OpenStack
• Integration with native OpenStack Dashboard• Integration with native OpenStack Telemetry
19
Call for Help
• CloudMinion currently as a POC• Needs to be re-written • Integration with OpenStack
20
Summary
• Capacity Reclamation in Self-Service Cloud can be challenging but rewarding
• Smart reclamation has proven to be effective• CloudMinion helped reduce the unused resourced on the Self-Service Cloud
• Available at https://github.com/paypal/cloudminion
21