1 capacity planning case studies. 2 case 1: the organization: an online travel agency offers a...
TRANSCRIPT
1
Capacity Planning Case Studies
2
Case 1: The Organization: An Online Travel Agency
• Offers a variety of travel services, including fare finder, hotel and car rental information, reservations, and destination information.
• Over 5.7 million unique visitors monthly
• The site's visitors view an average of 3.3 unique pages per day.
• Visitors to the site spend roughly 51 seconds on each pageview and a total of five minutes on the site during each visit.
• Average Load Time was slow (2.29 Seconds), average response time 20-50 s (Slow)
• Business people were complaining to IT that the load and response time were not acceptable constantly (Their requirements changed all the time due to the competitors)
• There was never any capacity planning approaches • The practice of Performance Tuning was performed by the vendors for the trouble areas one at
the time
• Processes were all reactive, Actions were all temporary• Fixing a problem led to other problems
• IT Services have to be optimized to meet the needs of the business
3
Their Capacity Planning Approaches
• A processes was introduced to the IT team including the following steps:
• The main steps of the methodology were: • Characterizing the Business Case• Functional Analysis• Characterizing the User Behavior• Characterizing the IT Infrastructure• Characterizing the Workload• Performance Model Development• Performance Prediction• Cost Modeling
4
Process Issues
• They have been given all servers sized by the vendors
• Vendors have been always tuning the performance
• The IT didn’t have the time, skills and patient to do a full fledged capacity planning
• So the process was reduced to smaller steps
5
Simple Capacity Planning Model
Three Steps for Capacity Planning were taken for simplicity
1. Determine Service Level Requirements• The first step was to categorize the work done by systems and to
quantify users’ expectations for how that work gets done.
2. Analyze Current Capacity• Next, the current capacity of the system was analyzed to determine how
it was meeting the needs of the users.
3. Planning for the future• Finally, using forecasts of future business activity, future system
requirements were determined. Implementing the required changes in system configuration would ensure that sufficient capacity would be available to maintain service levels, even as circumstances change in the future.
6
Other Issues
• Translating • Business plans to technology
– Difficult: IT people didn’t have the skills to listen to business people and business people didn’t understand the IT terms
• Benchmarking– Load testing
» Was hard due to the business people perspective on working systems
• Trending– Linear trend analysis– Statistical approaches
» Never did that
• Modeling• Analytic modeling• Simulation modeling
– Didn’t know what it was
7
Proactive Resource Management Approach
• Ensure service delivered matches expectations
• Ensure service quality not impacted by cost-saving consolidation efforts
• Determine impact of growth changes to existing infrastructure plans
8
Determine Service Level Requirements
• The overall process of establishing service level requirements first demanded an understanding of workloads.
• We began looking workloads on a system running a back-end Oracle database.
• Before setting service levels, we needed to determine what unit we will use to measure the incoming work.
9
What a Service Level Agreement (SLA) Is
• An SLA sets the expectations between the consumer and provider. It helps define the relationship between the two parties. It is the cornerstone of how the service provider sets and maintains commitments to the service consumer.
• A good SLA addresses five key aspects:• What the provider is promising.• How the provider will deliver on those promises.• Who will measure delivery, and how.• What happens if the provider fails to deliver as promised.• How the SLA will change over time.
10
A Simple SLA Template• 1.0 Statement of Intent
• This section states the objectives of the document.• 1.1 Approvals All parties must agree on the SLA.
• This section contains a list of who approved the SLA.• 1.2 Review Dates This section contains the track record of the SLA reviews.• 1.3 Time and Percent Conventions
• This section contains the descriptions of what time conventions and metrics are being used.• 2.0 About the Service
• This section introduces the service addressed by this SLA.• 2.1 Description
• This section describes the service in detail.• 2.2 User Environment
• This section describes the architecture and technologies that are used by the consumers of the service.• 3.0 About Service Availability
• This section introduces the availability concepts used in this SLA.• 3.1 Normal Service Availability Schedule
• This section describes what is considered normal service availability.• 3.2 Scheduled Events That Impact Service Availability
• This section describes what scheduled outages are to be expected,• 3.3 Non-emergency Enhancements
• This section describes the process that inserts enhancements into the infrastructure.• 3.4 Change Process
• This section describes the complete process of how changes are introduced in the service., including the associated availability impact.
• 3.5 Requests for New Users • This section describes the provisioning process of new users/customers.
• 4.0 About Service Measures • This section contains a detailed description of how the service availability is measured and reported.
11
Service Level Management (SLM) Approaches
• Service Level management for the company was the disciplined, proactive methodology and procedures shall be to ensure that adequate levels of service are delivered to all IT users in accordance with business priorities and at an acceptable cost.
• This is what they standardized thorough their ITIL processes
12
Purpose of SLM
• Customer satisfaction• Set expectations• Define bounds of the services• Resource management• Cost optimization
13
Adding an SLA to an Existing Service
• Build a model of the production environment
• Predict projected growth
• Compare predictions to service level parameters
• Make necessary changes to planned provisioning
14
Service Level Metrics
• Performance
• Demand levels
• Security
• Recoverability
• Availability
15
Service Level Considerations – Model was created by IT and Business Collaboration
• Average Load Time < 2 Seconds
• Average Response Time < 10 s • response time for search transactions
• Number of Requests that should be processed within the peak hour >=300,000
• Availability >= 99.999%• NEBS 3
• Site has to be recoverable within 5 seconds
16
Workload Definition and Classifications
• A workload is a logical classification of work performed on a system.
• If you consider all the work performed on your systems as pie, a workload can be thought of as some piece of that pie.
• Workloads can be classified by a wide variety of criteria
17
Workload
• The workload of a system can be defined as the set of all inputs that the system receives from its environment during any given period of time.
HTTPrequests
Web Server
18
Our Approach to Find out the Workloads
• who is doing the work• (particular user or department)
• what type of work is being done• (order entry, financial reporting)
• how the work is being done• (online inquiries, batch database backups)
19
Workload characterization
• Critical transaction: Property Search, Book a Reservation, Change a Reservation, Canceling a Reservation, Payment Processing
• Searches, submission of property via forms, requests for changes
20
Processes ran on the System
• Processes running on the system revealed that during the same 24 hour period, individual processes ran on this system were identified.
21
Partitioning the Workload
• Resource usage• Applications• Objects• Geographical orientation• Functional• Organizational units• Mode
22
Workload Characterization
• Common steps were:• specification of a point of view from the workload will be
analyzed;• choice of set of relevant parameters;• monitoring the system;• analysis and reduction of performance data• construction of a workload model.
23
Workloads in the System (extracted from the processes)
• AmenityInformation• Availability• BookReservation• BrandInformation• CancelReservation• Error• ModifyReservation• MultiAvailability • POISearch• PropertyInformation• PropertySearch• RateRules • MultiAvailability • RateRules • PaymentProceesing• PropertySearch• PropertyInformation• POISearch
24
List of Workloads
• All the processes were attributed to one of four (4) workloads. These workloads are defined according to the type of work being done on the system.
• All we did was define workloads based on the type of work being performed on this server.
25
Clustering Analysis
Service Demands
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
0 0.5 1 1.5
CPU Time
I/O T
ime
26
Workload Partitioning
Worklaod Class
Workload
Type
1 BrandInformation, Error
2 CancelReservation, MultiAvailability
RateRules
3 BookReservation, Availability, ModifyReservation, PaymentProceesing
4 AmenityInformation, POISearch
PropertyInformation
PropertySearch
27
Workload Partitioning
Workload Class
Frequency Max CPU time (msec)
Max I/O time (msec)
1 40% 8 120
2 30% 20 300
3 20% 100 700
4 10% 900 1200
28
Validating Workload Models
ActualWorkload
SyntheticWorkload
SystemSystem
Acceptable?Model
Calibration
Yes
No
Measured RT, page load, availability, throughput
Measured RT, page load, availability, throughput
29
Analyze Current Capacity
• There were several steps that had to be performed during the analysis of capacity measurement
• data.• a. First, compared the measurements of any items referenced in service
level agreements with their objectives. This provided the basic indication of whether the system had adequate capacity.
• b. Next, checked the usage of the various resources of the system (CPU, memory, and I/O devices). This analysis identified highly used resources that might prove problematic now or in the future.
• c. Looked at the resource utilization for each workload. Ascertained which workloads were the major users of each resource. This helped narrow the attention to only the workloads that were making the greatest demands on system resources.
• d. Determined where each workload was spending its time by analyzing the components of response time, allowing to determine which system resources were responsible for the greatest portion of the response time for each workload.
30
Plan for the Future
• How did we make sure that a year from now the systems won’t be overwhelmed?
• The best weapon was a capacity plan based on forecasted processing requirements.
• We needed to know the expected amount of incoming work, by workload. Then we could calculate the optimal system configuration for satisfying service levels.
31
The Approach
• Followed these steps:
• First, we needed to forecast what the business will require of your IT systems in the future.• Business didn’t know
• Once we knew what to expected in terms of incoming work, we used basic excel tools to determine the optimal system configuration for meeting service levels on into the future.
• This method didn’t work due to lack of business people involvement
Workload Forecasting with Excel was used
32
Workload Forecasting with Excel (weighted exponential smoothing time-series forecasting method was recommended)
Transaction Type Current 1 year Forecast
Trivial 40% 40%
Light 30% 28%
Medium 20% 22.1%
Heavy 10% 12.1%
33
Workload Forecasting (weighted exponential smoothing time-series forecasting method with the last three years data was used to drive the forecast)
Transaction Type 3 years ago
2 years ago
Last year
Trivial 50% 43% 36%
Light 20% 23% 24%
Medium 16% 20% 26%
Heavy 14% 10% 14%
The tool used is Exponential-Smoothing-genworth (given to you today)
34
Exponential Smoothing
• This is a very popular scheme to produce a smoothed Time Series. Whereas in Single Moving Averages the past observations are weighted equally, Exponential Smoothing assigns exponentially decreasing weights as the observation get older. • In other words, recent observations are given relatively more
weight in forecasting than the older observations.
• In the case of moving averages, the weights assigned to the observations are the same and are equal to 1/N. • In exponential smoothing, however, there are one or more
smoothing parameters to be determined (or estimated) and these choices determine the weights assigned to the observations.
35
About the Accuracy and the Errors
• MAD= Mean Absolute Deviation• MADPE= Mean Absolute Percentage Error• MSE=Mean Squared Error • Smoothing factor for correlation factor
36
Quantity Accuracy Common Measurements
Mean Error
Where :
FE : Forecast Error
Ai : The actual value in time period i
Fi : The forecast value in time period i
Mean Square Error
Where :
MSE : Mean Square Error
Ai : The actual value in time period i
Fi : The forecast value in time period i
)]([1
1
n
iii FA
nFE
2
1
)]([1
n
iii FA
nMSE
37
Quantity Accuracy Common Measurements
Mean Absolute Deviation
Where :
MAD : Mean Absolute Deviation
Ai : The actual value in time period i
Fi : The forecast value in time period i
Mean Absolute Percentage Error
Where :
MAPE : Mean Square Error
Ai : The actual value in time period i
Fi : The forecast value in time period i
n
iii FA
nMAD
1
1
1001
1
n
i i
ii
A
FA
nMAPE
38
Value of “a”
• An exponentially weighted moving average with a smoothing constant a, corresponds roughly to a simple moving average of length (i.e., period) n, where a and n are related by:
• a = 2/(n+1) OR n = (2 - a)/a.
• Thus, for example, an exponentially weighted moving average with a smoothing constant equal to 0.1 would correspond roughly to a 19 day moving average. And a 40-day simple moving average would correspond roughly to an exponentially weighted moving average with a smoothing constant equal to 0.04878.
39
Determine Future Processing Requirements
• The forecasting method for 1 year was acceptable, but the future processing requirements had to come from a variety of sources based on commitment from CTO, CFO, COO and CEO.
• Input from management that was committed for next year were:
• Expected growth in the business• Requirements for implementing new applications• Planned acquisitions or divestitures• IT budget limitations• Requests for consolidation of IT resources
40
Plan Future System Configuration
• After system capacity requirements for the year 1 was identified, a capacity plan was developed to prepare for it.
• The first step in doing this was to create a model of the current configuration.
• From this starting point, the model could be modified to reflect the future capacity requirements.
• If the results of the model indicated that the current configuration did not provide sufficient capacity for the future requirements, then the model could be used to evaluate configuration alternatives to find the optimal way to provide sufficient capacity.
41
Forecasted Workload – Configuration Changes
ActualWorkload
SyntheticWorkload
SystemSystem
Acceptable?Configuration
for the New System(CPU, I/O, Memory)
Yes
No
Measured RT, page load, availability, throughput
Measured RT, page load, availability, throughput
Cost $
42
The Capacity Plan
• The Capacity Plan included:• Identifying Server Transactions • Defining Server Transaction Throughput
Requirements • Choosing Hardware • Obtaining Measurements/Throughput Rates • Calculate the Required Number of Machines • Make Network Topology Choices
43
Capacity Plan Template Used
• Production Environment Servers • Server Capacity Requirements • Processing Capacity Requirements - Server Class and CPUs • Memory Requirements • Disk Capacity Requirements
• User Learning Environment Servers • Processing Capacity Requirements - Server Class and CPUs • Memory Requirements • Disk Capacity Requirements
44
Capacity Plan Template Used
• Testing Environment Servers • Processing Capacity Requirements - Server Class and CPUs • Memory Requirements • Disk Capacity Requirements
• Desktop Client Machines • Specifications • Processing Requirements
• Network Capacity • Bandwidth Requirements
45
Business Impacts
• Business Benefits Expected• Potential Impact of Recommendation• Risks Involved• Resources Required• Setup and Ongoing Costs
46
Assumptions and Risks
• Assumptions • <Assumption 1> • <Assumption 2>
• Risks • <Risk 1> • <Risk 2>
• Open and Closed Issues for this Deliverable • Open Issues • Closed Issues
47
Template
1. SLA
2. Business process and workload
3. Forecasting
4. Cost
48
Case Study 2: Capacity Planning for Admin Server
49
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
50
Capacity Planning Approach
Three Steps for Capacity Planning
1. Define the Business Requirements1. Business Process Description/Flow2. Business Realignment3. Decomposition of the Business Process (outline the business functions)
2. Determine Service Level Requirements1. The first step was to categorize the work done by systems and to
quantify users’ expectations for how that work gets done.3. Analyze Current Capacity
1. Next, the current capacity of the system was analyzed to determine how it was meeting the needs of the users.
4. Planning for the future1. Finally, using forecasts of future business activity, future system
requirements were determined. Implementing the required changes in system configuration would ensure that sufficient capacity would be available to maintain service levels, even as circumstances change in the future.
51
Business Requirements
• Business Case•
• Business Process Description/Flow•
• Business Realignment•
• Decomposition of the Business Process (outline the business functions)•
52
Business Requirements Example
• Business Case• ACH
• Business Process Description/Flow• Automated Clearing House (ACH)• It is a way for businesses to transmit money through a banking system
(EFT)• Business Realignment
• 1 year changes: 25% increase • Decomposition of the Business Process (outline the business
functions)• ACH creates a file which will transmitted to the finance team, then the
finance team will validate this with the actual transaction. SWD creates a record in ACH file. PAY will create a record in ACH.
53
SLA ACH
• This process impact money flowing in/out from the company.
• The files has to leave the server 8 am (be transmitted)
54
Determine Service Level Requirements for all the Activities
For each Activity:
1. Map the activity to a business process
2. Define the SLA parameters for each activity based on the business process
55
Service Level Requirements for Activity “ACH”
1ACH Target Description
Transmission Time 8 am Mon-Fri
It has to leave the system by 8 am weekdays
Response Time 1800 s Time to process an ACH
56
Service Level Requirements for the System
1800 Target Description
Availability 99.999%
Recoverability 900 s
57
Business SLA Mapping to Activity SLA
• Business SLA is has to be mapped to Activity SLA
• Activity SLA can be measured and monitored through testing and operations laws.
58
ACH Example
• Business Process
• Business Process SLA
• ACH SLA
59
Business SLA Requirements
• Automated Clearing House (ACH)• SLA Parameters:• Transmission Time
– 8 am Mon-Fri– It has to leave the system by 8 am weekdays
• Response Time– 1800 s– Time to process an ACH
what is the main business process SLA parameters? Rx, Process., Tx
• SLA Parameters:
• Rx: by 8:05 they have to see the funds, processing time=20 minutes, tx before 9:30
• What is the mapping?
60
SLA Template• 1.0 Statement of Intent
• This section states the objectives of the document.• 1.1 Approvals All parties must agree on the SLA.
• This section contains a list of who approved the SLA.• 1.2 Review Dates This section contains the track record of the SLA reviews.• 1.3 Time and Percent Conventions
• This section contains the descriptions of what time conventions and metrics are being used.• 2.0 About the Service
• This section introduces the service addressed by this SLA.• 2.1 Description
• This section describes the service in detail.• 2.2 User Environment
• This section describes the architecture and technologies that are used by the consumers of the service.• 3.0 About Service Availability
• This section introduces the availability concepts used in this SLA.• 3.1 Normal Service Availability Schedule
• This section describes what is considered normal service availability.• 3.2 Scheduled Events That Impact Service Availability
• This section describes what scheduled outages are to be expected,• 3.3 Non-emergency Enhancements
• This section describes the process that inserts enhancements into the infrastructure.• 3.4 Change Process
• This section describes the complete process of how changes are introduced in the service., including the associated availability impact.
• 3.5 Requests for New Users • This section describes the provisioning process of new users/customers.
• 4.0 About Service Measures • This section contains a detailed description of how the service availability is measured and reported.
61
Analyze Current Capacity
• There were several steps that had to be performed during the analysis of capacity measurement
• data.• a. First, compared the measurements of any items referenced in service
level agreements with their objectives. This provided the basic indication of whether the system had adequate capacity.
• b. Next, checked the usage of the various resources of the system (CPU, memory, and I/O devices). This analysis identified highly used resources that might prove problematic now or in the future.
• c. Looked at the resource utilization for each workload. Ascertained which workloads were the major users of each resource. This helped narrow the attention to only the workloads that were making the greatest demands on system resources.
• d. Determined where each workload was spending its time by analyzing the components of response time, allowing to determine which system resources were responsible for the greatest portion of the response time for each workload.
62
Workload Partitioning For Admin Server (ACH)
Workload Classification based on Process Count
Workload
Type
Frequency in week %
Class 1 128 0.000724078Class 2 2685 0.015188658Class 3 22,756 0.128727416Class 4 125,500 0.70993543Class 5 826735 4.676720861Class 6 3992124 22.58287068Class 7 12707736 71.88583288
Total Count: 176,776,64
63
Workload Forecasting For Admin Server/Count
Class Type Frequency in week %
Forecast 1 Year
Class 1 128 0.000724078
Class 2 2685 0.015188658
Class 3 22,756 0.128727416
Class 4 125,500 0.70993543
Class 5 826735 4.676720861
Class 6 3992124 22.58287068
Class 7 12707736 71.88583288
Total Count: 176,776,64
64
Workload Class
Frequency Max I/O
Time (msec)
Max CPU time (msec)
Memory Usage
(MB)
1
2
3
4
5
6
7
8
Workload Partitioning For Admin Server – Service Demand
65
Workload Class
Frequency Forecast
Forecast I/O
Time (msec)
Forecast CPU time (msec)
Forecast Memory Usage (MB)
1
2
3
4
5
6
7
8
Forecasted Workload For Admin Server – Frequency and Service Demand
66
Performance Modeling and Prediction for ACH
System and Workload
Description
Performancemetrics: response
time, transmission time
67
Estimating Performance Measures
QueuingNetwork Model
System Description
PerformanceMeasures
• Response time• Throughput• Utilization
• Queue length
• System parameters
• Resources parameters
• Workload parameters- service demands- workload intensity
68
Validating Performance Models
RealSystem
PerformanceModel
CalculationsMeasurements
Acceptable?Model
Calibration
Yes (*)
No
Measured RT, Thput., etc
Calculated RT,Thput., etc.
69
Planning for the Future
• Finally, using forecasts of future business activity, future system requirements were determined. Implementing the required changes in system configuration would ensure that sufficient capacity would be available to maintain service levels, even as circumstances change in the future.
70
Plan for the Future
• How did we make sure that a year from now the systems won’t be overwhelmed?
• The best weapon was a capacity plan based on forecasted processing requirements.
• We needed to know the expected amount of incoming work, by workload. Then we could calculate the optimal system configuration for satisfying service levels.
71
Future Performance Modeling and Prediction
• How are performance measures estimated?
Future System and Workload
Predicted Performancemetrics: responsetime, throughput,
utilization, etc
72
Forecasted Workload – Configuration Changes
ActualWorkload
SyntheticWorkload
SystemSystem
Acceptable?Configuration
for the New System(CPU, I/O, Memory)
Yes
No
Measured RT, page load, availability, throughput
Measured RT, page load, availability, throughput
Cost $
73
Plan Future System Configuration
• After system capacity requirements for the year 1 was identified, a capacity plan was developed to prepare for it.
• The first step in doing this was to create a model of the current configuration.
• From this starting point, the model could be modified to reflect the future capacity requirements.
• If the results of the model indicated that the current configuration did not provide sufficient capacity for the future requirements, then the model could be used to evaluate configuration alternatives to find the optimal way to provide sufficient capacity.
74
The Capacity Plan
75
The Capacity Plan
• The Capacity Plan included:• Identifying Server Transactions • Defining Server Transaction Throughput
Requirements • Choosing Hardware • Obtaining Measurements/Throughput Rates • Calculate the Required Number of Machines • Make Network Topology Choices
76
Production Environment Servers
• Server Capacity Requirements • Processing Capacity Requirements - Server Class
and CPUs • Memory Requirements • Disk Capacity Requirements
77
User Learning Environment Servers
• Processing Capacity Requirements - Server Class
and CPUs
• Memory Requirements
• Disk Capacity Requirements
78
Testing Environment Servers
• Processing Capacity Requirements - Server Class
and CPUs • Memory Requirements • Disk Capacity Requirements
79
Network Capacity
• Bandwidth Requirements
80
Assumptions and Risks
• Assumptions • <Assumption 1> • <Assumption 2>
• Risks • <Risk 1> • <Risk 2>
• Open and Closed Issues for this Deliverable • Open Issues • Closed Issues
81
Capacity Planning Case Study 3
82
The Organization
• IT department for a fortune 100 communications service provider
• They used standard Capacity Planning procedures for the growth and no methodology was used
83
Capacity Planning for Servers in a WebSEAL Environment
• WebSEAL provided authentication and authorization mechanism based on Tivoli Access Manager. • It enabled an end-to-end Single Sign On (SSO) solution for
secure transactions for WebSphere application servers.
• In a WebSEAL environment, several servers handled the workload.
84
Service Level
• No service level requirements were defined
• Capacity Planning couldn’t be completed with the structured methodology like the previous case study for the following reasons:
• IT team didn’t have the time and patient to deal with the business people
• IT team trusted the vendor models
• Forecasting was done ad-hoc a 20% increase for the next 3 years
85
Servers
Server
WebSEAL (PDWeb)
Lightweight Directory Access Protocol (LDAP)
Backend Web servers
86
WebSEAL Network Topology Choices
• WebSEAL was placed in a DMZ connecting Internet users on one physical network to Intranet backend Web servers on another physical network.
• LDAP was used as the registry and resides on the Intranet for security purposes.
• All servers supported replication.
87
Workload Characterization
88
Server Transactions
• High-level transaction consists of authenticated Web page access.
• Page access required Secure Sockets Layer (SSL), used LDAP authentication, averaged about 5 KB in size, and flows through a WebSEAL TCP junction.
• This work requested visits the WebSEAL server, the LDAP server, a backend Web server, and the physical network.
89
WebSEAL SSL authenticated Web page access to a TCP junctioned backend
LDAP WebSEAL-driven authentication
Backend Web server
TCP Web page access
Physical network
Approximately 10 KB of traffic, including a small overhead for LDAP server communication
Server Transactions Used
90
Transactions in a WebSEAL Environment
• The primary, high-level transaction in the WebSEAL environment was access to a Web page.
• There were several parameters that defined Web page access. • For example, one parameter was whether the page access
includes a user login (authentication). Another transaction in a WebSEAL environment was the administrative update, such as creating, deleting, listing, and modifying users or ACL entries.
91
Parameters associated with transaction in a WebSEAL environment
• WebSEAL server • Page access
• TCP or SSL • Authenticated, post-authenticated, not authenticated • If SSL, new browser each request (new SSL session) or same browser (SSL session reuse) • If SSL, the SSL cache size and timeouts • If authenticated, user in Tivoli SecureWay Policy Director cache or not • If authenticated, failed login or not • Logout (pkmslogout page request) or not • Junctioned or not junctioned to a backend Web server • If junctioned, TCP or SSL junction type • If junctioned, backend requires authentication or not • Web page size
• Administrative • Create, delete, list, or modify a user • Create, delete, list, or modify an ACL
• LDAP server • Authentication
• User from LDAP cache, DB2(R) cache, or disk (non-cached) • Failed authentication or not • If failed authentication, why? User not found or invalid password • Registry size (number of users)
• Administrative • Create, delete, list or modify • Update to a LDAP master or propagated to an LDAP replica • If update to a LDAP master, replication setting (on or off) • Frequency of propagation (immediate or delayed) • Registry size (number of users)
• Backend Web server • TCP or SSL • Authenticated, post-authenticated, not authenticated • Web page size
• Physical network • Number of bytes in each transaction • Line speed
92
Basing Estimated on Registered Users
• Another method for estimating transaction throughputs , or requirements, was to base estimates on the number of users in the registry.
• The idea was that some percentage of these users will use the system in any given time period.
• Along with this, there was an idea of what an average user does in terms of putting load on the system.
• For example, the percentage of users in any given time period was the authentication rate.
• The authentication rate times the load applied by an average user gave the throughput for the average workload.
93
Calculations
• There were 4 million registered users and that approximately 20% of them log in (authenticate) each day. Also assumed that each user accesses 10 Web pages per session.
• The required authentication rate was 9.26 authentications per second as calculated in the following formula:
4,000,000 users * 0.20 percent / 24 hours in a day / 60minutes in an hour / 60 seconds in a minute = 9.26auths/sec
• The required page access rate is 92.6 pages per second as calculated in the following formula:
9.26 auths/sec * 10 pages per user session = 92.6 pages/sec
94
Identifying Server Transactions
• Assumed that the common parameters that define the Web page access transaction were as follows:
• Tivoli SecureWay Policy Director connected to the backend Web server using a TCP junction with options that filter the original identity of the user and replace it with one provided by WebSEAL (-B, -U, and -W options).
• ACLs were defined in WebSEAL to protect backend Web server resources.
• Authentication was tuned as described in the Tivoli SecureWay Policy Director Base Performance Tuning Guide.
• The average Web page was 10 KB in size.
95
Identifying Server Transactions
• Following high-level transactions were identified:
• TCP page--A user makes a request for a HTTP (TCP) Web page going through WebSEAL to a backend Web server. No authentication occurs.
• SSL authentication page--A user makes an initial request for a HTTPS (SSL) Web page going through WebSEAL to a backend Web server. Authentication occurs.
• SSL post-authentication page--A user makes a subsequent request for a HTTPS (SSL) Web page going through WebSEAL to a backend Web server. No authentication occurs.
96
Server Transactions
Server Transaction
WebSEAL 10 KB TCP page access through a TCP junction 10 KB SSL page access with authentication through a TCP junction 10 KB SSL page access, already authenticated through a TCP junction
LDAP Authentication
Backend Web server 10 KB TCP page access, already authenticated (Tivoli SecureWay Policy Director authenticates to backend Web servers infrequently as defined by backend server, infrequent enough to be insignificant)
Internet network 10 KB Web page requests plus SSL, TCP, HTTP, and IP headers
Intranet network 10 KB Web page requests and 200 byte LDAP lookups plus TCP, HTTP, and IP headers
97
Server Transaction Throughput Requirements
IP traces showed that a typical user session consisted of the following requests:
• 3 TCP page • 1 SSL page requiring authentication • 9 SSL pages after authentication
The traces showed user authentication occurs at a rate of 5 per second.
• The system-wide throughput requirements were as follows:
• SSL page accesses requiring authentication: 5 per second • SSL page accesses after authentication: 9*5 = 45 per second • TCP page accesses: 3*5 = 15 per second
98
Throughput Requirements
Server Transaction Throughput requirements
WebSEAL 10 KB TCP page access through a TCP junction
15 /sec
10 KB SSL page access with authentication through a TCP junction
5 /sec
10 KB SSL page access, already authenticated through a TCP junction
45 /sec
LDAP Authentication 5 /sec
Backend Web server 10 KB TCP page access 65 /sec
Internet network 10 KB Web page requests plus SSL, TCP, HTTP, and IP headers
780 KB/sec
Intranet network 10 KB Web page requests and 200 byte LDAP lookups plus TCP, HTTP, and IP headers
781 KB/sec
99
Network Throughput
• The Internet network throughput requirement was calculated as follows:
(15 TCP pages/sec + 50 SSL pages/sec) * 10 KB/page => 650 KB/sec + 20% headers = 780 KB/sec
• The Intranet network throughput requirement is calculated as follows:
650 KB/sec Internet requirement + (5 SSL auth/sec) * 200 bytes => 651 + 20% headers = 781 KB/sec
100
Measurements Throughput
Server Transaction Maximum throughput measurements
WebSEAL 100 byte TCP page access through a TCP junction
750 /sec
5 KB TCP page access through a TCP junction
700 /sec
10 KB SSL page access through a TCP junction
655 /sec (estimated, see the formula described following this table)
10 KB SSL page access with authentication through a TCP junction
40 /sec
10 KB SSL page access, already authenticated through a TCP junction
250 /sec
LDAP Authentication 35 /sec
Backend Web server 10 KB TCP page access 750 /sec
Internet network 10 KB Web page requests plus SSL, TCP, HTTP, and IP headers
7+ MB/sec
Intranet network 10 KB Web page requests and 200 byte LDAP lookups plus TCP, HTTP, and IP headers
7+ MB/sec
101
TCP page access
• The estimated throughput for a WebSEAL 10 kilobytes (KB) TCP page access, estimating from the throughput measurements of the 100 bytes and 4 KB cases, was 655 pages/sec as calculated in the following formula:
y = 1 / ((x - x1) * (1/y1 - 1/y2) / (x1 - x2) + 1/y1) y = 1 / ((x - 100) * (1/750 - 1/700) / (100 - 5*1024)+ 1/750) 1 / ((10*1024 - 100) * (1/750 - 1/700) / (100 - 5*1024)+ 1/750) = 655
102
Transactions, requirements, and measurements
Transaction Throughput requirements Maximum throughput
measurements
WebSEAL 10 KB TCP page access through a TCP junction
15 /sec 655 /sec (estimated)
10 KB SSL page access with authentication through a TCP junction
5 /sec 40 /sec
10 KB SSL page access, already authenticated through a TCP junction
45 /sec 250 /sec
LDAP Authentication 5 /sec 35 /sec
Backend Web server 10 KB TCP page access 65 /sec 750 /sec
Internet network 10 KB Web page requests plus SSL, TCP, HTTP, and IP headers
780 KB/sec 7+ megabytes (MB)/sec
Intranet network 10 KB Web page requests and 200 byte LDAP lookups plus TCP, HTTP, and IP headers
781 KB/sec 7+ MB/sec
103
Calculating the machine factor
Server Calculation Machine Factor Increased by
20%
WebSEAL 15/655 + 5/40 + 45/250 0
.33 0.39
LDAP 5/35 0 .14 0 .17
Backend Web server
65/750 0 .09 0 .10
Internet network 780/7000 0 .11 0 .13
Intranet network 780/7000 0 .11 0 .13
104
Calculating the machine factor
• Since all machine factors were less than one, each server utilized only a portion of a machine given by the scaling factor.
• In other words, the scaling factor, multiplied by 100, gave the percentage of the machine utilized.
105
Calculating the Required Number of Machines
• The number of machines was based upon the requirements divided by the achievable, or measurements.
• The ratios were calculated for each transaction type for a given server, and then added together to get the full story for that server. The calculation was repeated for each server. This included the physical network, which was treated as a special type of server for capacity planning purposes.
• If the number of machines needed was less than one, it represents the portion of a machine that was needed.
• The formula as as follows:
• Machine factor = R1/M1 + R2/M2 + R3/M3 + . . . + Rn/Mn
• Variable definitions in this formula were as follows:
• Machine factor specifies the portion of or number of machines needed for a given server. • n specifies the number of transactions identified for the given server. • R1... Rn specifies the throughput requirements for transactions 1 through n. • M1 ... Mn specifies the throughput measurement for transactions 1 through n.
106
Scaling for CPU Utilizations Less Than 100%
• It is possible to estimate maximum throughput from measurements where less than maximum throughput has been achieved, but it requires knowledge of the CPU utilization. It also results in a larger margin of error, since software systems do not always behave well when they reach or go beyond maximum throughput.
• Following is the formula for estimating maximum throughput from measured throughput at less than maximum:
• Maximum throughput = measured throughput / CPU utilization
• For example, if throughput is measured at 50% CPU utilization, the estimated maximum throughput is twice the measured rate, since only half (50%) the machine is utilized.
107
Scaling for Hardware Differences
• One method for bridging hardware differences was to use published benchmarks, available from the following Web site:
• http://www.spec.org
• The Standard Performance Evaluation Corporation (SPEC) provides hardware manufacturers a way to publish the results of certain performance benchmarks.
• Since communication programs tend to act like integer arithmetic programs, the benchmark of interest was specint. To find specint results, selected SPEC CPU95 or SPEC CPU2000 from the Web site. Then selected Submitted Results and either SPECint95, SPECint_rate95, SPECint2000, or SPECint_rate2000.
108
Capacity Planning Case Study 4: Characterizing the Workload
of a Corporate B2B Portal
109
Understanding the Problem
• Business Case: Business portal of a fortune 100 semiconductor manufacturing company
• Roll out a new B2B portal• Access for employees, partners, and suppliers• First year estimate: 10,000 people will use the portal• The business goal is by end of 2014, 40,000 users will access
the portal.• Management wants to analyze the performance of the portal
application to make sure the given SLA is not violated.• Portal applications and services:
• Registration• Login• Employee directory• HR• Health insurance payments• On-demand interactive training• Simple text to video and audio• Issuing PO, viewing PO, tracking payments
110
Other Portal Functions
• Accounts Payable• Ethics Reporting• Advanced Shipment Notification• Audiocasts• Commercial Invoices and Packing Lists (pdf)• Explore becoming a supplier• Info for potential suppliers• Vested Outsourcing Overview for Logistics Suppliers
111
Other Transactions
• New Users• Supplier Pages• Employee and Contingent Worker • Registered Users• Manage My Account • Need help? • Check out Frequently Asked Questions.
112
Tasks 1 and 2
• Task 1: Identify all possible Workloads
• Task 2: Charactize the workloads
113
SLA
• The response time for the portal for page views
• 90th or 95th percentile Response Time• The 90th percentile response time of all portal
transactions shall be within 3 seconds. This means that only 10% of the transactions have a response time higher than 3-5 seconds and can therefore be a meaningful measure.
114
Workload Characterization
• Common steps were:• specification of a point of view from the workload will
be analyzed;• choice of set of relevant parameters;• monitoring the system;• analysis and reduction of performance data• construction of a workload model.
115
Workloads in the System (extract them from the portal functions)
• Supplier_Registration
• etc
116
List of Workloads
• All the portal functions shall be attributed to list of workload classed. These workloads are defined according to the type of work being done on the system.
• All we did was define workloads based on the type of work being performed on the physical infrastructure.
117
Workload Partitioning
Worklaod Class
Workload
Type
1
2
3
4
5
118
Workload Partitioning after the portal is up and running after 3 months
Workload Class
Intensity Max CPU time (msec)
Max I/O time (msec)
1 12500 8 120
2 26% 20 300
3 20% 100 700
4 14% 900 1200
5 10% 3000 2800
119
Workload Partitioning Forecasting
Workload Class
12 months 18 months 24 months
1 30% 33% 36%
2 26% 20% 22%
3 20% 17% 24%
4 14% 18% 7%
5 10% 12% 11%
120
Capacity Planning Process
121
Capacity Planning Process
• 1. Determine service level requirements• a. Define workloads• b. Determine the unit of work• c. Identify service levels for each workload
• 2. Analyze current system capacity• a. Measure service levels and compare to objectives• b. Measure overall resource usage• c. Measure resource usage by workload• d. Identify components of response time
• 3. Plan for the future• a. Determine future processing requirements• b. Plan future system configuration
122
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
123
Understanding the Environment
The goal is to learn what kind of
• hardware (clients and servers)• software (OS, middleware, and applications)• network connectivity and protocols
are present in the environment.
124
Elements in Understanding the Environment
Client platform Quantity and type
Server platform Quantity, type, configuration andfunctions
Middleware Type (e.g. TP monitors)
DBMS Type
Application Main types of applications, criticality,etc.
Networkconnectivity
Network diagrams with LANs, WANs,routers, servers, etc.
SLAs Existing SLAs per application
Procurementprocedures
Elements of the procurement process,expenditure limits, justificationprocedures for acquisitions.
125
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
126
Workload Characterization
Workload characterization is the process of precisely describing the system’s global workload in terms of its main components.
The basic components are then characterized by intensity and service demand parameters at each resource of the system.
127
Workload Characterization Process
Wkl component # 1(e.g., C/S transactions)
Global Workload
Wkl component # n(e.g., Web doc. Requests)
Basic component 1.1(e.g., personnel transactions)
Basic Component 1.2 (e.g., sales transactions)
Basic component n.k(e.g. video requests)
Basic component n.1(e.g., small HTML docs.)
. . .
128
Workload Description: example
Basic Components and Parameters Type
Sales transaction . Number of transactions submitted per client . Number of clients . Total number of I/Os to the Sales DB . CPU utilization at the DB server . Avg. messages sent/received by the DB server
--WIWISDSDSD
Web-based training . Avg. number of training sessions per day . Avg size of image files retrieved . Avg. size of http documents retrieved . Avg number of image files retrieved/session . Avg. number of documents retrieved/session . Avg. CPU utilization of the httpd server
--WISDSDSDSDSD
SD = service demandWI = workload intensity
129
Workload Characterization: concepts and ideas
• Basic component of a workload refers to a generic unit of work that arrives at the system from external sources.• Transaction,• interactive command,• process,• HTTP request, and • depends on the nature of service provided
130
Workload Characterization: concepts and ideas
• Workload characterization • workload model is a representation that mimics the
workload under study.
• Workload models can be used:• selection of systems• performance tuning• capacity planning
131
Workload Description
Hardware
Software
User
Resource-orientedDescription
Functional Description
BusinessDescription
132
Workload Description
• Business characterization: a user-oriented description that describes the load in terms such as number of employees, invoices per customer, etc.
• Functional characterization: describes programs, commands and requests that make up the workload
• Resource-oriented characterization: describes the consumption of system resources by the workload, such as processor time, disk operations, memory, etc.
133
Workload Models
• A model should be representative and compact.• Natural models are constructed either using basic components of
the real workload or using traces of the execution of real workload.• Artificial models do not use any basic component of the real
workload.• Executable models (e.g.: synthetic programs, artificial benchmarks, etc)
• Non-executable models, that are described by a set of parameter values that reproduce the same resource usage of the real workload.
134
Workload Models
• The basic inputs to analytical models are parameters that describe the service centers (i.e., hardware and software resources) and the customers (e.g. requests and transactions)
• component (e.g., transactions) interarrival times;• service demands• execution mix (e.g., levels of multiprogramming)
135
Selection of characterizing parameters
• Each workload component is characterized by two groups of information:
• Workload intensity• arrival rate• number of clients and think time• number of processes or threads in execution simultaneously
• Service demands (Di1, Di2, … DiK), where Dij is the service
demand of component i at resource j.
136
Partitioning the workload
• Motivation: real workloads can be viewed as a collection of heterogeneous components.
• Partitioning techniques divide the workload into a series of classes such that their populations are composed of quite homogeneous components.
• What attributes can be used for partitioning a workload into classes of similar components?
137
Workload Partitioning:Internet Applications
Application Classes KB Transmitted
WWW 4,216
ftp 378
telnet 97
Mbone 595
Others 63
138
Workload Partitioning:Document Types
Document Class Percentage of Access (%)
HTML (html file types) 30
Images (e.g., gif or jpeg) 40
Sound (e.g., au or wav) 4.5
Video (e.g., mpeg, avi or mov) 7.3
Dynamic (e.g., cgi or perl) 12.0
Formatted (e.g., ps, dvi or doc) 5.4
Others 0.8
139
Workload Partitioning:Geographical Orientation
Classes Percentage of Total Requests
East Coast 32
West Coast 38
Midwest 20
Others 10
140
Calculating the class parameters
• How should one calculate the parameter values that represent a class of components?
• Averaging: when a class consists of homogeneous components concerning service demands, an average of the parameter values of all components may be used.
• Clustering of workloads is a process in which a large number of components are grouped into clusters of similar components.
141
Data Collection Issues
• How to determine the parameter values for each basic component?
Data Collection Facilities
Use benchmark, industry practice,
and ROTs only
Use benchmark,industry practice, ROT,
and measurements
Use measurements only
None Some Detailed
142
Data Collection Issues: example
• The server demand at the server for a given application was 10 msec obtained in a controlled environment with a server with a SPECint rating of 3.11.
• What would be the service demand if the server used in the actual system were faster and had a SPECint rating of 10.4?
ActualServiceDemand = MeasuredServiceDemand x ScalingFactor
ScalingFactor = ControlledResourceThroughput / ActualResourceThroughput
ActualServiceDemand = 10 * (3.11/10.4) = 3.0 msec.
143
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
144
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
145
Workload Forecasting
• How will the number of e-mail messages handled per day by the server vary over the next 6 months?
• How will the number of hits to the corporate intranet’s Web server vary over time?
146
Workload Forecasting (cont’d)
• Answering these questions involves:
• evaluating the organization’s workload trends;• analyzing historical usage data;• analyzing business or strategic plans;• mapping plans into business processes (e.g., paperwork reduction
will add 50% more e-mail).
• Workload forecasting techniques: moving averages, exponential smoothing, etc.
147
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
148
Performance Modeling and Prediction
• How are performance measures estimated?
System and Workload
Description
Performancemetrics: responsetime, throughput,
utilization, etc
149
Estimating Performance Measures
QueuingNetwork Model
System Description
PerformanceMeasures
• Response time• Throughput• Utilization
• Queue length
• System parameters
• Resources parameters
• Workload parameters- service demands- workload intensity
150
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
151
Validating Performance Models
RealSystem
PerformanceModel
CalculationsMeasurements
Acceptable?Model
Calibration
Yes (*)
No
Measured RT, Thput., etc
Calculated RT,Thput., etc.
(*) Accuracy from 10 to30% is acceptable in CP
152
ConfigurationPlan
InvestmentPlan
PersonnelPlan
Understanding the Environment
Workload Characterization
WorkloadModel
Validation and Calibration
Workload Forecasting
Performance PredictionCost PredictionValidModel
Cost Model
Developing a Cost Model
PerformanceModel
Cost/Performance Analysis
Methodology
153
Basic Tools and Utilities Used
154
Sysstat Utilities
• The sysstat utilities are a collection of performance monitoring tools for Linux.
• These include sar, sadf, mpstat, iostat, pidstat and sa tools.
155
Sysstat Utilities
• Iostat reports CPU statistics and input/output statistics for devices, partitions and network files ystems.
• Mpstat reports individual or combined processor related statistics. • Pidstat reports statistics for Linux tasks (processes) : I/O, CPU, memory, etc. • sarcollects, reports and saves system activity information (CPU, memory, disks,
interrupts, network interfaces, TTY, kernel tables,etc.) • sadc is the system activity data collector, used as a backend for sar.• sa1 collects and stores binary data in the system activity daily data file. It is a front
end to sadc designed to be run from cron. • sa2 writes a summarized daily activity report. It is a front end to sar designed to be
run from cron. • Sadf displays data collected by sar in multiple formats (CSV, XML, etc.) This is useful
to load performance data into a database, or import them in a spreadsheet to make graphs.
• Nfsiostat reports input/output statistics for network file systems (NFS). • Cifsiostat reports CIFS statistics.
156
157
158
159
Find out who was monopolizing or eating the CPUs
• Finally, we needed to determine which process was monopolizing or eating the CPUs.
160
161
iostat command
• “iostat” command reports CPU statistics and input/output statistics for devices and partitions.
• It can be use to find out your system's average CPU utilization since the last reboot.
162
163