adam grummitt - capacity management: guided practitioner satnav

CapMan GPS CMG Brazil 2011

# 1 of 30 CapMan GPS

Capacity Management: Guided Practitioner Satnav A General PostScript to Capacity Management: A Practitioner Guide ISBN 9789087535193 published by Van Haren books.google.com

adam@ grummitt.com


# 2 of 30 CapMan GPS - Summary 1. Where am I?

Baseline, Gap analysis, perception and reality 2. Where do I want and need to get to?

Defined business objectives, real infrastructure 3. How do I get there?

Fastest, shortest, cheapest, safest 4. What has to get there?

All? Most expensive? Lightest? 5. Who do I need to travel with?

Evangelist, Champion, Architects, Planners 6. What else has to happen at the same time?

SLAs, Availability, Continuity, Demand Management 7. When will I get there?

Short, medium and long term 8. Why should I go there? - Conclusion with acknowledgement to Paul Wilkinson for his ABC cartoons


# 3 of 30 1. Where am I? - GPS


# 4 of 30 CSI route map

Where are we now? Baseline of current service levels

What do we want? Business vision, mission, goals

What do we need? External and internal drivers

What can we afford? Business budgets, IT specs

What will we get? Business budgets, IT specs

What did we actually get? Delivery & perception of service

Deliver Service

Does it meet wants/needs? Delivery & perception of service


# 5 of 30 2. Where do I want to go?


# 6 of 30 Gap analysis -kiviat

0

1

2

3

4Monitors

Baselines

Bottlenecks

Patterns

Thresholds

Alarms

Demands

Workload ForecastsService drivers

Resource usage

Capacity plans

SLA targets

Application sizes

Testing results

CMDB changes

Costs

Now Next


# 7 of 30

Business

Activity Drivers Performance BPI

Service SLA targets

SLA constraints

Component/resource Groups of top metrics

eg CPU utilization eg I/O

eg RAM Special metrics

Measures per App Number of users Reports produced

Time for report per location Frequency of reports

Number of capacity related incidents Response time

Report generation time Given number of reports

Number of concurrent users per generic (N/W, SAN, DBMS) & platform (mf, UNIX)

Overview relevant to domain eg LPAR (mainframe and AIX) eg Read/write activity per sec

eg Paging/swapping #locations, #users, #reports, time per report

Prod Acc Test Dev

Normal Failover

DR

Metron Metrics Matrix eg: Reporting

Metrics Matrix: Reporting


# 8 of 30 3. How do I get there?

SD SO

(C)SI

SS ST

SF


# 9 of 30 How do I get there - ABC

•  Paul Wilkinson – ABC of ICT •  People, product, process, partners •  Performance depends on Attitude, Behaviour, Culture


# 10 of 30 4. What has to get there?

•  All of my belongings? •  A selection of what is most important? •  Needed in the short term? •  Needed in the medium term? •  Needed in the long term? •  Most expensive? •  Lightest? •  What I am allowed to take by my service provider? •  What level of service I am prepared to pay for? •  Private flight, 1st class, business, premium, coach, economy? •  Contractual agreement on service level and violations •  Demand management…


# 11 of 30 Service mapping to continuity & capman

Service Critical to Capacity Headroom: Allowable degradation from baseline peak

Capacity Workload: Allowable change per quarter from baseline

Capacity Failover: Allowable degradation % from baseline performance

Performance Allowable degradation

Continuity: Mirror level DR level Backup level

Diamond Mission 25% 400% 25% Highest Highest

Platinum Regulation 50% 200% 50% Higher Higher

Gold Business 100% 100% 100% High High

Silver Important 200% 75% 200% Medium Medium

Bronze Regular 300% 50% 300% Low Low

Tin Discretionary 400% 25% 400% Very low Very low


# 12 of 30 Possible resource extension to mapping

Service Critical to MF service class priority

CPU UNIX Quota - limit

CPU Wintel VM Guarantee - cap

N/W Band-width

RAM Storage GB & I/O

Diamond Mission highest 16-32 16-32 highest XXL T0 - SSD

Platinum Regulation higher 8-16 8-16 higher XL T0 - SSD

Gold Business high 4-8 4-8 high L T1

Silver Important medium 2-4 2-4 medium M T2

Bronze Regular low 1-2 1-2 low S T3

Tin Discretion-ary

lower 1 1 Very low CC T4


# 13 of 30 5. Who do I need to travel with?

•  Evangelist – technician who understands capman •  Champion – manager who appreciates capman and has $ •  Architect/analyst (applications) – who know their systems •  Planners (tools, domains) – who know their domains •  Business users – who know their needs and constraints •  Maybe a mentor for overall guidance •  Maybe an expert to give initial appreciation workshops •  Maybe a consultant to act as a catalyst with management •  Maybe contractors to provide short term expertise •  Not …


# 14 of 30 Who do I not need?

• Sysprogman: super-hero

• Boy racer who ‘installs ITIL’ in 3 months

•  ISO2000 top level checklists

• a lean black belt

• a BPR process perfectionist

•  ITIL perfectionist: paralysis by analysis


# 15 of 30 6. What else has to happen ?

•  SLAs with respect to performance and capacity •  Availability •  Continuity •  Demand Management •  Things done for real not by rote? •  Exception reporting leading to actions •  Automated activities •  Proper use of tools


# 16 of 30 SLA & Performance

•  NOT – in vacuo –  “Mandatory ave response of 3 secs; desirable 1 sec” –  “Mandatory 8 secs; desirable 5 secs for 95 %ile”

•  MAYBE – predefined, objective, quantified, meaningful –  “for the XYZ service, between 8am and 8pm, for a

normal traffic of <1000 transactions per hour, the average response time is desirably <1 sec and mandatory <2; 95% of response times should be <3 secs and must be <5 seconds”

•  NEEDS – measurable, achievable, appropriate – Service catalogue/portfolio, business needs –  Instrumentation for traffic levels and app counters – Agreements with teeth that can be monitored & policed – Normal, peak and exceptional service levels.


# 17 of 30 SLA outcomes

Worst

OK Best

Performance metric e.g. Response Time

Mandatory Desirable

Workload metric e.g. Transaction arrival rate

Normal maximum Peak

maximum

Agreement does not apply Agreement

broken at low traffic rate

System is probably over-configured

Should meet desirable target at lower traffic

System may be over-configured

Depends on precise wording of SLA

System is under excessive traffic pressure

Light Excessive

System is under pressure anyway

System is performing as expected

Std/DR/DM


# 18 of 30

Not 99.999% availability for all % downtime pa 99 87.6 hours 99.9 8.8 hours 99.99 53 mins 99.999 5.3 mins Note one 8 hour period downtime is 93.3% for a week but 99.9% for a year

What if ‘up’ but not for all (use potential minus actual): Locations – weighted by size/staff/users Users – weighted by classification Transactions - weighted by significance

What if: Too slow – check SLA for limit and percentile of traffic and performance Lengthy recovery time for failover when failure - between cluster nodes - of a blade, of a RAID disk, of a network link…

Include period in statements Outage Max events in period Up to 6 mins 1 week 6-60 mins 1 month 1-4 hours 1 quarter 4-8 hours 1 year Max downtime in hours: (8*1) + (4*4) + (12*1) + (52*1*0.1) = 41.2 Availability = 0.995 or 99.5%

Availability = (agreed service time – unplanned downtime)/ast


# 19 of 30 Continuity – DR site sizing factors •  Data security to reduce impact of DR:

–  Backups made to tape/disk on site and sent off-site regularly –  Data replication to an off-site location so only system sync required –  High availability systems to keep both the data & system replicated

•  Precautionary measures: –  Local mirrors of systems and/or data and use of RAID –  Surge protectors, UPS and/or backup generator, fire prevention –  Antivirus, antibot software and other security measures

•  Stand-by site at: –  Own site with high availability –  Own remote facilities with SAN –  An outsourced disaster recovery provider

•  DR service –  Priority of service determines if included DR service –  DR reduced performance and reduced traffic constraints as per SLA –  Models used to justify configuration and cost of DR site.


# 20 of 30 Demand Management •  Control demand for resources to meet levels that the business is willing to support •  Optimize and rationalize demand for the use of IT to achieve optimum provision

–  One extreme of over-provisioning without regard to cost –  Other of under-provisioning so that there is no headroom

•  Understand and throttle/smooth peaks, if possible, in customer demand or priority •  Control degradation of service due to peaks in demand or downtime/slowtime •  Use budgets/priorities/chargeback/quotas for workloads and new services •  Use ‘levels of critical’ categorization for workloads (gold/silver/bronze) •  Plans for when business requirements cannot be fulfilled due to:

–  HW or SW failure –  Unexpected budgetary constraints/ demand increase

•  Decisions based on problems being Short term or long term? –  Short-term: only mission critical services supported –  Long-term: management of resource constraints

•  Need to identify the critical services and the resources they use –  Business plans, Service catalogue, Change requests, SIPs –  Service priorities and their mapping to resources


# 21 of 30 7. When will I get there?

•  SatNav gives a typical answer in hours and minutes •  Detailed time depends on route selected and options taken •  Answer based on accumulated experience of many journeys •  CapMan gives an answer typically in short/medium/long term •  Detailed time depends dominantly on many local factors


# 22 of 30 Short term Improvements (Wintel VI)

Assets ESM

Metrics Reports

Enhance attributes in registers

Standardise contents for action

Add resource pool

Add profiles for levels of priority

Extended KPIs and trends

Monitoring to assess VM growth

Consolidate similar, retire moribund

Extra reports (day, week, month)

Improve liaison - ESM and ITSM

Event, infrastructure & app teams

Better exploit resource information

ESM data already present

Add extra VM metrics for tiers

Add extra KPI metrics

- CPU utilisation/server etc

Add selected extra reports


# 23 of 30

Determine S (SpecInt Rating of the

physical server)

Capture U (Peak % utilisation of the physical server)

Calculate N (Normalised power rating of the

physical server) N = S * (U/100)

Physical rating

e.g. server HP ProLiant DL580 has SpecInt of 40, S = 40 Captured peak utilisation of 15%, U = 15

Rating of N = 40 * (15/100) = 6


# 24 of 30

Determine H (SpecInt Rating of the

host server)

Estimate C (consolidation ratio

e.g. 20:1)

Calculate values for tiers such as: Bronze = H / C

Silver = Bronze * 2 Gold = Bronze * 4

Platinum = uncapped

Virtual rating

e.g. SpecInt of VI server (HP Integrity rx8640) H = 200 Estimated target consolidation ratio C = 20:1

Bronze limit = 200/20 = 10 so box needs bronze service


# 25 of 30 Medium term Improvements

SPM Proactive

Services Portal

Fill vacant positions

Select CapMan activities

Establish processes

Formalised reporting vehicle

Regular and exception reports

Available to all relevant parties

Reactive reporting to proactive

Analysis of trends & pathology

Identify rogues and flatlines

Add business liaison

SLAs and performance

More use of Availability data


# 26 of 30 HPOV Brocade

HP Performance

tool

Logica

Availability

CDB/CMIS

Trend Reports subset of key metrics trended 30/60/90 days

with thresholds set

Daily Performance focus on key metrics across entire estate

regular, on web

Capacity exceptions refined metrics

critical thresholds alarms as relevant

The Capacity Portal

Multiple data sources


# 27 of 30 Longer term Improvements

CDB/CMIS & CMDB/CMS Demand management

Utility chargeback Capacity plan

Capacity management db

Configuration management db

For infrastructure upgrades

For anticipated project demands

Characterise new workloads

Consolidate/retire more apps

Analysis of actual usage

Financial control of upgrades


# 28 of 30 Business Forecasts

Plans

Component HPOV nWorks & perf

Brocade Logica

Service Availability

SLM

CDB/CMIS

Component Current utilisation

Forecasts and changes Improvement options

Costs vs benefits Options modelled

Service Response times now

Track changes Slow time

Utilisation trends

Business Forecasts

Drivers Further VI req’s

KPI updates (CO2?) Data centre space?

Monitoring, Analysis, Tuning, Demand, Sizing, Modelling

The Capacity Plan

Recommendations


# 29 of 30 Procedures and work instructions TOR Description Ownership A clear definition of who will both own the process (and by definition sponsor the project)

and ultimately manage the process and day to day to activities. Objectives Prior to implementing the process it is essential to define the overall objectives of what the

process is going to achieve. It is common that these are objectives are quite high level, but these could initially be:

•  Establish component level monitoring for all applications with an initial focus on the “Top 5” metrics and all supported platforms.

•  Establish service based metrics for at least one application. This should include an end to end response time and the addition of relevant service metrics within the relevant SLA

Some of these objectives could be used as process KPI’s if clearly defined. Definitions The key elements that required definition are

•  CPM sub-processes, although initial this should be component, service with business Capacity Management being an aspiration at this point

•  A clear definition of the current responsibilities. These should be considered more operational than process specific i.e. who will be doing which activities and providing what data.

•  A list of deliverables can be provided here or via link to the “Information Flow Diagram”

Scope A key requirement for a process definition is a clear definition of scope. It is recommended, given the variety of data sources, that the scope could be limited. As the monitoring and structure becomes more mature this should be gradually increased until it covers more applications and infrastructure.

Benefits These will obviously vary between businesses, but could ultimately include: §  The Capacity Management process will ensure that a proactive approach is taken.

This change of approach will ensure higher availability of critical business services due to a reduction of capacity related outages.

§  Deferred expenditure, through a reduction in the amount of excess capacity. §  Reduced risk for existing applications as system resources are managed more

effectively.

Process Description Process Flow Diagram A high level graphical representation of how the various elements of the process

relate and support one another Process Flow

Descriptions An overview of each Capacity Management procedure and the beginnings of the work instruction pack. These can be as detailed and broad as befits the environment but should initially be:

•  Daily threshold and trend review •  Trending analysis and Capacity forecasting •  Virtualization optimization •  Workload characterisation As the process matures these would normally be expanded to include the

provision of new services, modelling, exception reporting etc

Process Interfaces The long term goal would be to populate this section with a complete listing of all process interfaces that includes likely inputs/outputs.

As an initial stage it is recommended that the interfaces are defined: •  Interface to ISS for support and provision of data •  Interface to service owners (SLM) for provision of business data •  Interface to Configuration management for service relationship

information •  Interface to Change management

Procedure Description Procedures Procedural description relating to the suggested activities.

Within each of the procedures the following elements should be clearly defined:

•  Step by step guide to the procedure •  Definition of Inputs/Outputs •  Related procedural tree •  Responsibilities

KPI Description Key Performance

Indicators (KPIs) Operational – Total no. of capacity incidents, no. of emergency changes due to capacity requirements etc

Process Quality – No. of anomalies in capacity outputs, % variance in any predictions Process implementation - % of services covered, % physical estate in scope etc

VI Description Top level details Specify key configuration information regarding the highest level relating to

particular environment e.g. AIX frame, VMware cluster etc

Pool specification If appropriate capture any pool limits and how those relate to individual guests. More VMware specific, but most flavours of UNIX (including AIX) also offer the options ring fence resource and assign it to groups of guests.

Resource rules Here we would capture the following information: • # of vCPUs • Amount of memory (GB) • Disk specification • # of Network interfaces (inc speed)

Management/Continuity Specify what resource management and continuity policies are in place e.g. HACMP, DRS/HA, Fair share scheduling etc


# 30 of 30 8. Why should I go there? - Conclusion

•  CapMan when practiced well saves money •  CapMan GPS has been applied at a number of sites •  Most sites are not where management thinks they are •  Most sites have people who know the real situation •  It takes openness & technical awareness to reveal truth •  Demand management is often minimal •  Project management is often uber-all •  Performance is often an after-thought •  Next steps are often short, medium and long term •  Usually related to liaison as much as process •  Often related to making more use of extant tools •  Hopefully not all reports are filed on the shelf •  But it needs in-house believers to carry it forward…

adam@ grummitt.com

adam grummitt - capacity management: guided practitioner satnav

Technology

business budgets

delivery

baseline

perception

travel

lightest

needed

time