compute waste management for operators

21
Compute Waste Management for Operators How to (nicely) reclaim what is yours in a Private Cloud environment Kalin Nikolov Cloud Engineer PayPal Inc 1

Upload: knikolov

Post on 08-Apr-2017

198 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Compute Waste Management for Operators

Compute Waste Management for Operators

How to (nicely) reclaim what is yours in a Private Cloud environment

Kalin NikolovCloud EngineerPayPal Inc

1

Page 2: Compute Waste Management for Operators

Agenda

• Introduction• Identifying unused resources• Dealing with unused resources • Strategies for eliminating compute waste• Tools used at Ebay/PayPal• CloudMinion• Summary

2

Page 3: Compute Waste Management for Operators

Introduction

• Compute waste and Capacity Reclamation in Self-Service Cloud

• The challenges in Capacity Reclamation - Identifying the unused resources - How to deal with the unused resources

• How Ebay and PayPal reduced compute waste

3

Page 4: Compute Waste Management for Operators

The scale of Ebay/Paypal OpenStack deployments

• 100% of PayPal web/mid tier• Most of Dev/QA• Number of HVs: 8,500• Number of Virtual Machines: 70,000• Number of users: Several thousands• Availability zones: 10

4

Page 5: Compute Waste Management for Operators

What causes Compute Waste

What is creating Compute Waste in Private Cloud?

• Users who want to try the cloud service Create VMs just for experience without an intention to be used• VMs left by ex-employees • Admins running tests after failures• Temporary resources which were not cleaned at the end• Over planning Resources initially planned to be used • VMs with errors

5

Page 6: Compute Waste Management for Operators

Identifying Unused Resources

• Tools/Agents installed on the VMs - Third Party Tools • Tools installed on the HVs - Ceilometer - Custom built tools for data collection

6

Page 7: Compute Waste Management for Operators

Subsystems/Metrics for Considerations

Subsystems/Metrics for identifying unused resources

There is no ideal metrics

• CPU CPU can be affected by different factors: cpu clock, disk io, etc.• Memory Dynamic memory issues: the KVM keeps the memory once it’s allocated• Disk IO• Network Susceptive to network noise: dns, ntp, ldap, ping, etc.

7

Page 8: Compute Waste Management for Operators

Network Traffic

For identifying unused resources we decided to use primarily Network Traffic

Why Network Traffic?• We can eliminate some of the network noise by using some statistical measurements

• We can look at both egress and ingress traffic separately and make additional determination

For example: If the egress traffic is high but ingress is 0 or close to zero there is an indication of problem with the VM – The VM might be trying to get an IP

8

Page 9: Compute Waste Management for Operators

How to deal with the unused resources

Chargeback / Showback • Making departments responsible for their usage • Show the departments the cost of their usage

Smart reclamation / Ask nicely• Identify unused resources • Notify the user (and his/her manager)• Delete if no action is taken by the user

At Ebay/PayPal, we decided to use Smart reclamation by developing in-house tools.

9

Page 10: Compute Waste Management for Operators

Tools used at Ebay and PayPal for Reclamation

CloudMinion• Used for identifying unused resources on the Self-Service Dev/QA cloud• Started as POC• Resources reclaimed by CloudMinion > $3M savings

10

Page 11: Compute Waste Management for Operators

Reclamation Statistics

Self-Service Dev/QA Cloud

• VMs identified as unused: 42% Unused VMs deleted by automation tool: 75% Unused VMs kept by users: 25%• VMs with extended life by users 1 year 62% 3 month 27% 1 month 11%

11

Page 12: Compute Waste Management for Operators

Reclamation Flow

Handling unused VMs at Ebay/PayPal Dev/QA Cloud• A VM is detected as unused.• An expiration date is set - 14 days from the day of detection.• An email is sent to the user with a link where the user can change the expiration date.

• If the user sets the expiration date to “Never Expires” no further actions are taken

• If the user sets a new expiration date or takes no action - The VM will be shutdown on the expiration day and deleted in 7 days. - Reminders are sent: 2 days prior to be shutdown, on the day the VM is shutdown, 2 days prior to be deleted, on the day the VM is deleted.

12

Page 13: Compute Waste Management for Operators

User Feedback

Surprisingly, we received less complains than we anticipated

• Less than 1% of the users replied The complains were mostly because users:• Failed to read the email notifications• Hate receiving notification emails• Or just ask questions why their VMs were identified as unused

Why the users did not complain? • We gave them a choice to decide whether to keep their VMs

13

Page 14: Compute Waste Management for Operators

CloudMinion

DescriptionCloudMinion is a set of tools for identifying unused OpenStack VMs, setting and managing expiration dates, shutting down and deleting expired VMs, sending reminders, generates reports.

It also provides a web UI which allows users to manage the expiration dates of their VMs

14

Page 15: Compute Waste Management for Operators

CloudMinion: Components

Client Tools• cm_agent.pl – Determines whether a VM is unused based on pre-defined rules and sends the data to CM API.

• cm_sa.pl - Collects data from various subsystems (eg. net, cpu)• cm_sar.pl - Processes the collected data and generates reports for all VMs

CM Server • cm_api.cgi - CM API service• cm_manager.pl – Syncs Instance/User info from OpenStack DB, sets expiration dates, sends notifications, shutdowns/deletes VMs.

• vmem.cgi – VM expiration management tool, Users can change the expiration dates of their VMs and/or delete VMs.

15

Page 16: Compute Waste Management for Operators

CloudMinion: Block Diagram

16

CM DB

OpenStack

DB

CM APIVMEMUI

CM Agent

CM Manager

Nova stop/delet

e

Emailer

HypervisorCM Server

Page 17: Compute Waste Management for Operators

CloudMinion Rules

Rules for identifying unused resources at Ebay/PayPal

• Network Traffic is used for all rules

• VM is identified as unused if: 1. Network Traffic stays below X MB/day for 14 consecutive days AND 2. Standard deviation is less than < X KB OR 3. Ingress network traffic = 0 B

17

Page 18: Compute Waste Management for Operators

CloudMinion Integration with other tools

CloudMinion’s System Activity Reporting tool can be integrated with other system activity collection tools- Ceilometer- Sysstat sa/sar- Any other custom written tools for collecting data

18

Page 19: Compute Waste Management for Operators

Desired integration with OpenStack

• Integration with native OpenStack Dashboard• Integration with native OpenStack Telemetry

19

Page 20: Compute Waste Management for Operators

Call for Help

• CloudMinion currently as a POC• Needs to be re-written • Integration with OpenStack

20

Page 21: Compute Waste Management for Operators

Summary

• Capacity Reclamation in Self-Service Cloud can be challenging but rewarding

• Smart reclamation has proven to be effective• CloudMinion helped reduce the unused resourced on the Self-Service Cloud

• Available at https://github.com/paypal/cloudminion

21