adam grummitt - capacity management: guided practitioner satnav
DESCRIPTION
Capacity Management: Guided Practitioner SatnavTRANSCRIPT
CapMan GPS CMG Brazil 2011
# 1 of 30 CapMan GPS
Capacity Management: Guided Practitioner Satnav A General PostScript to Capacity Management: A Practitioner Guide ISBN 9789087535193 published by Van Haren books.google.com
adam@ grummitt.com
CapMan GPS CMG Brazil 2011
# 2 of 30 CapMan GPS - Summary 1. Where am I?
Baseline, Gap analysis, perception and reality 2. Where do I want and need to get to?
Defined business objectives, real infrastructure 3. How do I get there?
Fastest, shortest, cheapest, safest 4. What has to get there?
All? Most expensive? Lightest? 5. Who do I need to travel with?
Evangelist, Champion, Architects, Planners 6. What else has to happen at the same time?
SLAs, Availability, Continuity, Demand Management 7. When will I get there?
Short, medium and long term 8. Why should I go there? - Conclusion with acknowledgement to Paul Wilkinson for his ABC cartoons
CapMan GPS CMG Brazil 2011
# 3 of 30 1. Where am I? - GPS
CapMan GPS CMG Brazil 2011
# 4 of 30 CSI route map
Where are we now? Baseline of current service levels
What do we want? Business vision, mission, goals
What do we need? External and internal drivers
What can we afford? Business budgets, IT specs
What will we get? Business budgets, IT specs
What did we actually get? Delivery & perception of service
Deliver Service
Does it meet wants/needs? Delivery & perception of service
CapMan GPS CMG Brazil 2011
# 5 of 30 2. Where do I want to go?
CapMan GPS CMG Brazil 2011
# 6 of 30 Gap analysis -kiviat
0
1
2
3
4Monitors
Baselines
Bottlenecks
Patterns
Thresholds
Alarms
Demands
Workload ForecastsService drivers
Resource usage
Capacity plans
SLA targets
Application sizes
Testing results
CMDB changes
Costs
Now Next
CapMan GPS CMG Brazil 2011
# 7 of 30
Business
Activity Drivers Performance BPI
Service SLA targets
SLA constraints
Component/resource Groups of top metrics
eg CPU utilization eg I/O
eg RAM Special metrics
Measures per App Number of users Reports produced
Time for report per location Frequency of reports
Number of capacity related incidents Response time
Report generation time Given number of reports
Number of concurrent users per generic (N/W, SAN, DBMS) & platform (mf, UNIX)
Overview relevant to domain eg LPAR (mainframe and AIX) eg Read/write activity per sec
eg Paging/swapping #locations, #users, #reports, time per report
Prod Acc Test Dev
Normal Failover
DR
Metron Metrics Matrix eg: Reporting
Metrics Matrix: Reporting
CapMan GPS CMG Brazil 2011
# 8 of 30 3. How do I get there?
SD SO
(C)SI
SS ST
SF
CapMan GPS CMG Brazil 2011
# 9 of 30 How do I get there - ABC
• Paul Wilkinson – ABC of ICT • People, product, process, partners • Performance depends on Attitude, Behaviour, Culture
CapMan GPS CMG Brazil 2011
# 10 of 30 4. What has to get there?
• All of my belongings? • A selection of what is most important? • Needed in the short term? • Needed in the medium term? • Needed in the long term? • Most expensive? • Lightest? • What I am allowed to take by my service provider? • What level of service I am prepared to pay for? • Private flight, 1st class, business, premium, coach, economy? • Contractual agreement on service level and violations • Demand management…
CapMan GPS CMG Brazil 2011
# 11 of 30 Service mapping to continuity & capman
Service Critical to Capacity Headroom: Allowable degradation from baseline peak
Capacity Workload: Allowable change per quarter from baseline
Capacity Failover: Allowable degradation % from baseline performance
Performance Allowable degradation
Continuity: Mirror level DR level Backup level
Diamond Mission 25% 400% 25% Highest Highest
Platinum Regulation 50% 200% 50% Higher Higher
Gold Business 100% 100% 100% High High
Silver Important 200% 75% 200% Medium Medium
Bronze Regular 300% 50% 300% Low Low
Tin Discretionary 400% 25% 400% Very low Very low
CapMan GPS CMG Brazil 2011
# 12 of 30 Possible resource extension to mapping
Service Critical to MF service class priority
CPU UNIX Quota - limit
CPU Wintel VM Guarantee - cap
N/W Band-width
RAM Storage GB & I/O
Diamond Mission highest 16-32 16-32 highest XXL T0 - SSD
Platinum Regulation higher 8-16 8-16 higher XL T0 - SSD
Gold Business high 4-8 4-8 high L T1
Silver Important medium 2-4 2-4 medium M T2
Bronze Regular low 1-2 1-2 low S T3
Tin Discretion-ary
lower 1 1 Very low CC T4
CapMan GPS CMG Brazil 2011
# 13 of 30 5. Who do I need to travel with?
• Evangelist – technician who understands capman • Champion – manager who appreciates capman and has $ • Architect/analyst (applications) – who know their systems • Planners (tools, domains) – who know their domains • Business users – who know their needs and constraints • Maybe a mentor for overall guidance • Maybe an expert to give initial appreciation workshops • Maybe a consultant to act as a catalyst with management • Maybe contractors to provide short term expertise • Not …
CapMan GPS CMG Brazil 2011
# 14 of 30 Who do I not need?
• Sysprogman: super-hero
• Boy racer who ‘installs ITIL’ in 3 months
• ISO2000 top level checklists
• a lean black belt
• a BPR process perfectionist
• ITIL perfectionist: paralysis by analysis
CapMan GPS CMG Brazil 2011
# 15 of 30 6. What else has to happen ?
• SLAs with respect to performance and capacity • Availability • Continuity • Demand Management • Things done for real not by rote? • Exception reporting leading to actions • Automated activities • Proper use of tools
CapMan GPS CMG Brazil 2011
# 16 of 30 SLA & Performance
• NOT – in vacuo – “Mandatory ave response of 3 secs; desirable 1 sec” – “Mandatory 8 secs; desirable 5 secs for 95 %ile”
• MAYBE – predefined, objective, quantified, meaningful – “for the XYZ service, between 8am and 8pm, for a
normal traffic of <1000 transactions per hour, the average response time is desirably <1 sec and mandatory <2; 95% of response times should be <3 secs and must be <5 seconds”
• NEEDS – measurable, achievable, appropriate – Service catalogue/portfolio, business needs – Instrumentation for traffic levels and app counters – Agreements with teeth that can be monitored & policed – Normal, peak and exceptional service levels.
CapMan GPS CMG Brazil 2011
# 17 of 30 SLA outcomes
Worst
OK Best
Performance metric e.g. Response Time
Mandatory Desirable
Workload metric e.g. Transaction arrival rate
Normal maximum Peak
maximum
Agreement does not apply Agreement
broken at low traffic rate
System is probably over-configured
Should meet desirable target at lower traffic
System may be over-configured
Depends on precise wording of SLA
System is under excessive traffic pressure
Light Excessive
System is under pressure anyway
System is performing as expected
Std/DR/DM
CapMan GPS CMG Brazil 2011
# 18 of 30
Not 99.999% availability for all % downtime pa 99 87.6 hours 99.9 8.8 hours 99.99 53 mins 99.999 5.3 mins Note one 8 hour period downtime is 93.3% for a week but 99.9% for a year
What if ‘up’ but not for all (use potential minus actual): Locations – weighted by size/staff/users Users – weighted by classification Transactions - weighted by significance
What if: Too slow – check SLA for limit and percentile of traffic and performance Lengthy recovery time for failover when failure - between cluster nodes - of a blade, of a RAID disk, of a network link…
Include period in statements Outage Max events in period Up to 6 mins 1 week 6-60 mins 1 month 1-4 hours 1 quarter 4-8 hours 1 year Max downtime in hours: (8*1) + (4*4) + (12*1) + (52*1*0.1) = 41.2 Availability = 0.995 or 99.5%
Availability = (agreed service time – unplanned downtime)/ast
CapMan GPS CMG Brazil 2011
# 19 of 30 Continuity – DR site sizing factors • Data security to reduce impact of DR:
– Backups made to tape/disk on site and sent off-site regularly – Data replication to an off-site location so only system sync required – High availability systems to keep both the data & system replicated
• Precautionary measures: – Local mirrors of systems and/or data and use of RAID – Surge protectors, UPS and/or backup generator, fire prevention – Antivirus, antibot software and other security measures
• Stand-by site at: – Own site with high availability – Own remote facilities with SAN – An outsourced disaster recovery provider
• DR service – Priority of service determines if included DR service – DR reduced performance and reduced traffic constraints as per SLA – Models used to justify configuration and cost of DR site.
CapMan GPS CMG Brazil 2011
# 20 of 30 Demand Management • Control demand for resources to meet levels that the business is willing to support • Optimize and rationalize demand for the use of IT to achieve optimum provision
– One extreme of over-provisioning without regard to cost – Other of under-provisioning so that there is no headroom
• Understand and throttle/smooth peaks, if possible, in customer demand or priority • Control degradation of service due to peaks in demand or downtime/slowtime • Use budgets/priorities/chargeback/quotas for workloads and new services • Use ‘levels of critical’ categorization for workloads (gold/silver/bronze) • Plans for when business requirements cannot be fulfilled due to:
– HW or SW failure – Unexpected budgetary constraints/ demand increase
• Decisions based on problems being Short term or long term? – Short-term: only mission critical services supported – Long-term: management of resource constraints
• Need to identify the critical services and the resources they use – Business plans, Service catalogue, Change requests, SIPs – Service priorities and their mapping to resources
CapMan GPS CMG Brazil 2011
# 21 of 30 7. When will I get there?
• SatNav gives a typical answer in hours and minutes • Detailed time depends on route selected and options taken • Answer based on accumulated experience of many journeys • CapMan gives an answer typically in short/medium/long term • Detailed time depends dominantly on many local factors
CapMan GPS CMG Brazil 2011
# 22 of 30 Short term Improvements (Wintel VI)
Assets ESM
Metrics Reports
Enhance attributes in registers
Standardise contents for action
Add resource pool
Add profiles for levels of priority
Extended KPIs and trends
Monitoring to assess VM growth
Consolidate similar, retire moribund
Extra reports (day, week, month)
Improve liaison - ESM and ITSM
Event, infrastructure & app teams
Better exploit resource information
ESM data already present
Add extra VM metrics for tiers
Add extra KPI metrics
- CPU utilisation/server etc
Add selected extra reports
CapMan GPS CMG Brazil 2011
# 23 of 30
Determine S (SpecInt Rating of the
physical server)
Capture U (Peak % utilisation of the physical server)
Calculate N (Normalised power rating of the
physical server) N = S * (U/100)
Physical rating
e.g. server HP ProLiant DL580 has SpecInt of 40, S = 40 Captured peak utilisation of 15%, U = 15
Rating of N = 40 * (15/100) = 6
CapMan GPS CMG Brazil 2011
# 24 of 30
Determine H (SpecInt Rating of the
host server)
Estimate C (consolidation ratio
e.g. 20:1)
Calculate values for tiers such as: Bronze = H / C
Silver = Bronze * 2 Gold = Bronze * 4
Platinum = uncapped
Virtual rating
e.g. SpecInt of VI server (HP Integrity rx8640) H = 200 Estimated target consolidation ratio C = 20:1
Bronze limit = 200/20 = 10 so box needs bronze service
CapMan GPS CMG Brazil 2011
# 25 of 30 Medium term Improvements
SPM Proactive
Services Portal
Fill vacant positions
Select CapMan activities
Establish processes
Formalised reporting vehicle
Regular and exception reports
Available to all relevant parties
Reactive reporting to proactive
Analysis of trends & pathology
Identify rogues and flatlines
Add business liaison
SLAs and performance
More use of Availability data
CapMan GPS CMG Brazil 2011
# 26 of 30 HPOV Brocade
HP Performance
tool
Logica
Availability
CDB/CMIS
Trend Reports subset of key metrics trended 30/60/90 days
with thresholds set
Daily Performance focus on key metrics across entire estate
regular, on web
Capacity exceptions refined metrics
critical thresholds alarms as relevant
The Capacity Portal
Multiple data sources
CapMan GPS CMG Brazil 2011
# 27 of 30 Longer term Improvements
CDB/CMIS & CMDB/CMS Demand management
Utility chargeback Capacity plan
Capacity management db
Configuration management db
For infrastructure upgrades
For anticipated project demands
Characterise new workloads
Consolidate/retire more apps
Analysis of actual usage
Financial control of upgrades
CapMan GPS CMG Brazil 2011
# 28 of 30 Business Forecasts
Plans
Component HPOV nWorks & perf
Brocade Logica
Service Availability
SLM
CDB/CMIS
Component Current utilisation
Forecasts and changes Improvement options
Costs vs benefits Options modelled
Service Response times now
Track changes Slow time
Utilisation trends
Business Forecasts
Drivers Further VI req’s
KPI updates (CO2?) Data centre space?
Monitoring, Analysis, Tuning, Demand, Sizing, Modelling
The Capacity Plan
Recommendations
CapMan GPS CMG Brazil 2011
# 29 of 30 Procedures and work instructions TOR Description Ownership A clear definition of who will both own the process (and by definition sponsor the project)
and ultimately manage the process and day to day to activities. Objectives Prior to implementing the process it is essential to define the overall objectives of what the
process is going to achieve. It is common that these are objectives are quite high level, but these could initially be:
• Establish component level monitoring for all applications with an initial focus on the “Top 5” metrics and all supported platforms.
• Establish service based metrics for at least one application. This should include an end to end response time and the addition of relevant service metrics within the relevant SLA
Some of these objectives could be used as process KPI’s if clearly defined. Definitions The key elements that required definition are
• CPM sub-processes, although initial this should be component, service with business Capacity Management being an aspiration at this point
• A clear definition of the current responsibilities. These should be considered more operational than process specific i.e. who will be doing which activities and providing what data.
• A list of deliverables can be provided here or via link to the “Information Flow Diagram”
Scope A key requirement for a process definition is a clear definition of scope. It is recommended, given the variety of data sources, that the scope could be limited. As the monitoring and structure becomes more mature this should be gradually increased until it covers more applications and infrastructure.
Benefits These will obviously vary between businesses, but could ultimately include: § The Capacity Management process will ensure that a proactive approach is taken.
This change of approach will ensure higher availability of critical business services due to a reduction of capacity related outages.
§ Deferred expenditure, through a reduction in the amount of excess capacity. § Reduced risk for existing applications as system resources are managed more
effectively.
Process Description Process Flow Diagram A high level graphical representation of how the various elements of the process
relate and support one another Process Flow
Descriptions An overview of each Capacity Management procedure and the beginnings of the work instruction pack. These can be as detailed and broad as befits the environment but should initially be:
• Daily threshold and trend review • Trending analysis and Capacity forecasting • Virtualization optimization • Workload characterisation As the process matures these would normally be expanded to include the
provision of new services, modelling, exception reporting etc
Process Interfaces The long term goal would be to populate this section with a complete listing of all process interfaces that includes likely inputs/outputs.
As an initial stage it is recommended that the interfaces are defined: • Interface to ISS for support and provision of data • Interface to service owners (SLM) for provision of business data • Interface to Configuration management for service relationship
information • Interface to Change management
Procedure Description Procedures Procedural description relating to the suggested activities.
Within each of the procedures the following elements should be clearly defined:
• Step by step guide to the procedure • Definition of Inputs/Outputs • Related procedural tree • Responsibilities
KPI Description Key Performance
Indicators (KPIs) Operational – Total no. of capacity incidents, no. of emergency changes due to capacity requirements etc
Process Quality – No. of anomalies in capacity outputs, % variance in any predictions Process implementation - % of services covered, % physical estate in scope etc
VI Description Top level details Specify key configuration information regarding the highest level relating to
particular environment e.g. AIX frame, VMware cluster etc
Pool specification If appropriate capture any pool limits and how those relate to individual guests. More VMware specific, but most flavours of UNIX (including AIX) also offer the options ring fence resource and assign it to groups of guests.
Resource rules Here we would capture the following information: • # of vCPUs • Amount of memory (GB) • Disk specification • # of Network interfaces (inc speed)
Management/Continuity Specify what resource management and continuity policies are in place e.g. HACMP, DRS/HA, Fair share scheduling etc
CapMan GPS CMG Brazil 2011
# 30 of 30 8. Why should I go there? - Conclusion
• CapMan when practiced well saves money • CapMan GPS has been applied at a number of sites • Most sites are not where management thinks they are • Most sites have people who know the real situation • It takes openness & technical awareness to reveal truth • Demand management is often minimal • Project management is often uber-all • Performance is often an after-thought • Next steps are often short, medium and long term • Usually related to liaison as much as process • Often related to making more use of extant tools • Hopefully not all reports are filed on the shelf • But it needs in-house believers to carry it forward…
adam@ grummitt.com