performance management and surveyorthe role of surveyor in performance management 5 defining...

~TANDEM

Performance Managementand SURVEYOR

Helaine HOIwitzStephen ShughScott Sitler

Technical Repon 89.2December 1988Part Number: 15308

'11 TANDEM

Performance Managementand SURVEYOR

Helaine HorwitzStephen ShughScott Sitler

Technical Report 89.2Part Number 15308December 1988

Authors: Helaine Horwitz, Scott Sitler, & StephenShughi Customer Support Organization, RandyBaker, Vice President.

This document describes how SURVEYOR fits intoperformance management functions. While thisdocument does not cover all aspects of performancemanagement, some important aspects of performance management are discussed.

This document is intended for Tandem systemsanalysts and Customer systems analysts familiarwith general performance management concepts.Readers should be familiar with SURVEYOR, havingused the program, read the User's Guide, and/orattended the SURVEYOR Seminar (Tandem educationclass number 14060).

For additional information on SURVEYOR,please refer to the SURVEYOR User's Guide (Tandempart number 84153) and the SURVEYOR ReferenceManual (Tandem part number 85154).

Perfonnance Management andSURVEYOR

CopyrIght Notice

Copyright © 1988 by Tandem ComputersIncorporated.

All rights reserved. No part of this documentmay be reproduced in any form, including photcopying or translation to another language withoutthe prior written consent of Tandem ComputersIncorporated.

The following are service marks or trademarksof Tandem Computers Incorporated:

6AX BINDERCR06SREF DOLENABLE ENCOMPASSENCORE ENFORMENSCRIBE ENVOYENVOYACPIXF EXCHANGEEXPAND FASTSORTFAXLINK FOXFOxn GUARDIANGUARDIAN 90 GUARDIAN 90XFINSPECT Laser--LXLXN MEASUREMULTILAN NetBatchNonStop NonStop 1+NonStop CLX NonStop EXTNonStop EXTlO NonStop EXT25NonStop SQL NonStop IINonStop TXP NonStop VLXPATHMAKER PCFORMATPC LINK PERUSEPC MAIL PS TEXT EDITPS TEXT FORMAT RDFSAFE SAFEGUARDSAFE-T-NET T-TEXTTACL TALTandem TGALTIlL TILTMF TRANSFERVIEWPOINT VLXWPLINK XL8

Document HIstory

Part Number:Product Version:Operating System Version:Date:

Abstract

15308SURVEYOR C10C10September 1988

TANDEM COMPUTERS

Performance Management and SURVEYOR

TABLE OF CONTENTS

SECTION 1:Performance Management Overview 2

SECTION 2:The Role of SURVEYOR in Performance Management 5

Defining Performance 5Workload Definition 7Information Collection 9Modeling ~ 11Performance Monitoring , 14

SECTION 3:Monitoring System Performance with SURVEYOR 17

System Description 18Monitoring Processor Performance ........................•............................................18Monitoring Disk Subsystem Performance 24

SECTION 4:SURVEYOR as an Aid for Capacity Planning 29

Consumption Modeling 29SURVEYOR Consumption Modeling 31Aggregation Schemes 34Extracting Information from SURVEYOR 35Model Building-An Example 36CPU Consumption Model 41Disk Consumption Model 46Forecasting Using the Consumption Model .48Summary 49

Conclusion 50

1 Table of Contents

TANDEM COMPUTERS

Section 1:Performance Management Overview

Performance Management is the process ofassuring adequate computing service toindividuals using the existing computerresources. A user expects a system to work(execute applications) in a timely andreliable fashion. The available resources mustbe managed to meet users' needs whilemaintaining a high level of system performance.

Performance management activitiesinclude:

• studying and understanding the computing environment

• defining system requirements andservice objectives

• monitoring and measuring systemperformance and workloads

• maintaining historical performancedata

• adjusting and tuning the systems• predicting future performance require

ments• recommending hardware and software

changes to improve performance

In most cases, performance managementis part of a comprehensive computer installation/facilities management organization.Figure 1 shows the typical functions included.

As shown in Figure 1, each function isinter-related. For example, capacity management is heavily dependent on performance management for performance information. U system performance begins todegrade, capacity management should bealerted so that additional resources can be

Performance Management and SURVEYOR 2

ordered. The Performance Managementorganization must work with NetworkManagement to ensure network delays donot cause a deterioration in service tousers. Performance management dependson operations management to provide controlled operational access to the system(s).Figure 1 provides a perspective on whereperformance management might fit into acomputer installation/facilities management organization. In this perspective,capacity planning is not part of performance management. In other situations,capacity planning is addressed within theperformance management organizationand is not a separate organization. Otherdefinitions place performance analysis and

. tuning within the capacity managementorganization. Even though there are different organizational perspectives, the needfor performance management and capacitymanagement is well defined.

The rest of this paper focuses on some ofthe important aspects of performance management and how SURVEYOR, a performance database manager, can assist in performance management.

In the following sections, "performancemanagement" refers to the group of peopleperforming the performance managementfunctions. "Users" refers to any personor group of people using resources on thesystem.

TANDEM COMPUTERS

Figure 1.Installation/FacilitiesManagement.

CHANGE MANAGEMENT• Evaluation• Planning• Testing• Tracking

PROBLEM MANAGEMENT• Detection &Collection• Analysis &Resolution• Tracking &Control

OPERATIONS MANAGEMENT• Planning &Scheduling• Operation &Control• Analysis &Reporting

PERFORMANCE MANAGEMENT• System Measurement• Analysis• Prediction• Tuning

CAPACITY MANAGEMENT• Workload Definition• Forecasting• Analysis &Reporting• Planning

AVAILABILITY MANAGEMENT• Evaluation• Software Design• Hardware Configuration• Tracking &Control

DATABASE MANAGEMENT

• Design• Administrative Control• Operations & Performance• Application Support

NElWORK MANAGEMENT• Design• Testing &Installation• Operation• Training

For mol8 details on 1he installationl1acilties llllnagement pellip8diV8st-n in HguI8 1, pIea&e _ C/flIIdIy Planning ,""""""talon, an IBM Technical Bulle1in(I GG22·9015-00, January 1979).

3 section 1: Performance Management Overview

TANDEM COMPUTERS

This page was intentionally left blank


Section 2:The Role Of Surveyor In Performance ManagementSurveyor plays a key role in four importantperformance management functions:

System Measurement• Collecting and storing workload and

system resources informationPerformance Monitoring

• Monitoring service level objectives andother performance indicators

Performance Analysis• Reducing and summarizing

performance information• Providing input to performance

models

Performance Prediction• Trending and forecasting via a

historical performance database• Providing input to capacity planning

models

Defining Performance: System Measurement

System performance refers to how theresources of a system respond as users runapplications. The resources involved areprocessors, disks, communications lines,memory, etc. Before performance can bemanaged, the definition of performancemust be established. The first step indefining performance is to understand current resource consumption and the performance of applications using systemresources. By measuring and monitoringsystem resources, insights into resourceconsumption are acquired.

To help understand resource consumption and the performance characteristics ofa system, some basic issues must beaddressed:

• What are the system's average andpeak processor utilizations ?

• What are average and peak diskutilization?

• Is the system communication bandwidth fully utilized?

• Is the system swapping too much dueto insufficient memory ?

• Are long queues building up at anyoneresource?

• Are any resources under-utilized ?

• What is the service time oftransaction T ?

• What impact (on system performance)is caused by workload X ?

• Is the throughput of workload Z at anacceptable level ?

• What is the average response time forusers of application A ?

Answers to these questions will providesome clues to the performance characteristics of the system under study. This information may indicate which systemresources to measure and monitor.

Performance can be defined from twoperspectives: the "service" or the "user" perspective. The questions above define performance from the service perspective.Users might have a different definition ofperformance and adequate service. So,defining exactly what "performance" and"adequate" service means is the next step.

5 Performance Management and SURVEYOR

TANDEM COMPUTERS

Service Level ObjectivesThe user community and the performancemanagement organization must worktogether to define "adequate service."Users know intuitively what they need toget their job accomplished. For example,the user might view his computing needsin terms of orders per day. Performancemanagement must understand thisrequirement in terms of system resourcesin order meet this need.

The performance management organization must translate adequate service definitions into service level objectives (SLOs).SLOs represent the meaning of "adequateservice" to performance management.They define values or ranges of values forperformance indicators considered acceptable for the users. In most cases, SLOsaddress response time, availability, and/orthroughput issues.

In the orders per day example above, anSLO may be a target RECEIVE-RATE for the"order" servers in the order processingapplication. This may be established bycollecting and storing information on the"order" servers. For this example, the sumof the RECEIVE-RATE for all the order serverprocesses provides the orders per daymeasurement to compare against thedefined SLO.


Typically, SLOs are developed fromeither "rules of thumb" or from calculatedexpected values. Rules of thumb are general guidelines adopted to gauge performance such as: "run all processors at xx%utilization," or, "never let the queue lengthget above n." Though useful~ these ~eneral

guidelines should be used WIth caution.They are general to all environmen~, butnot necessarily accurate for one particularenvironment.

Service level objectives should be basedmore on calculated expected values thanon general rules of thumb. Calculatedexpected values may initially be derive~ .from these general guidelines. But specificbusiness volume and growth informationshould be factored in to cal!=Ulate expectedvalues. The calculated expected valuesusually develop over time as the users andperformance management bett~runderstand the environment. Queuemg networkmodeling theory and operational analysiscan also help to define the expected values.Queueing network modeling is a methodology for the analysis of comput~r systems.(For more information on queuemg network models, please read QuantitativeSystem Performance - Computer SystemAnalysis Using Queueing Network Modelsby Edward Lazowska, published byPrentice-Hall, 1984.)

An SLO must be tangible to both the userand performance management. In orderfor an SLO to be effective, it must be anobjective, measureable goal. If the SLOisn't being measured or can't be measured,then it can't be managed. Both the usersand performance management must agreeon the definition and the method used tomeasure each SLO.

Service Level AgreementsThe SLOs form the foundation of the service level agreement between the usersand system management. A service levelagreement (SLA) documents the servicelevel objectives and provides performancemanagement with a mechanism to measure their performance. With a service levelagreement in place, systems managementhas better control over the computing environment. As new applications are installedor as the user work requirements change,the SLA is redefined and renegotiated.

Once the users and performance management have documented a service levelagreement, SURVEYOR can track some ofthe SLOs. Tracking performance indicatorswith SURVEYOR is discussed later in thisdocument.

Workload DefinitionTo meet and measure SLOs, performancemanagement must understand the workimposed on system resources by the various user applications and requests. Inshort, they must understand the users'work to manage resource consumptionand system performance.

Resource consumption is frequently stated in terms of workloads. A workload is aunit of computer work performed onbehalf of users. It is usually defined fromthe users' perspective. A transaction orgroup of transactions executed via someapplication is also considered a workload.When an application is run, the workimposed on the system resources is categorized as an "application workload." Theconcept of a workload allows systemresource usage to be apportioned amongthe applications.

TANDEM COMPUTERS

Workload characterization is the processof identifying and quantifying the consumption of system resources into logicalgroups. Workloads are generally organizedaround functional groups. For example,the accounting organization of a businessmay have workloads defined along funetionallines. All work associated with onefunction such as payroll, would be characterized as one workload. Resource consumption for the payroll workload is logically grouped together.

High-Level vs Low-Level CharacterizationWorkloads can be characterized from a"high level" global perspective or a "lowlevel" detailed perspective. In some cases,an individual transaction is established asone workload. This is an example of a lowlevel perspective. In other cases, it is moreimportant to characterize high level workloads. Instead of characterizing a workloadas a single transaction, the entire application is defined as a workload. The workdone by all transactions executed throughthe application is grouped into one logicalwork unit.

7 Section 2: The Role of SURVEYOR inPerformance Management

TANDEM COMPUTERS

Many pieces of information are used tocharacterize a workload. The choice ofhigh-level or low-level characterizationdepends on the amount of detail neededabout the particular workload or resource.

First, the business function and theexpected end result of that function mustbe defined. Using the accounting exampleabove, accounts receivable can be characterized as one business function. The endresult of the accounts receivable function isto make sure all money expected isreceived and credited to the business. Thisdefines the bounds of what could be partof the accounts receivable workload.

Once the bounds are established, thenext level of detail involves understandingthe type of work required by the system toachieve the expected end result. A usermay submit many different types of smalltransactions; all of these transaction typescould be grouped into one workload. Thisworkload would provide performance information about the entire accounts receivablefunction as one logical unit. However, formore accurate accounting and increasedcontrol of the system resources, smallerindividual units of work should be usedinstead. Each transaction type of theaccounts receivable workload should bedefined as one workload.


SURVEYOR WorkloadsWorkloads provide a simple way toaccount for resource consumption. Insteadof tracking multiple performance datafields for many different entities, workloads provide logically grouped information on resource consumption.

In order to accurately measure a workload, performance management needs a~ .much detail about the workload as pOSSIble. Some of the key pieces of informationused to characterize workloads are shownbelow.

• How many users submit each type of transaction?

• How often? Are there peak periods?

• What individual transactions make up this businessfunction?

• What system resources are used by each transactions?...processor, disk, communication line, memory, etc?

• What is the path that each transaction takes to do its ~ork ?From terminal, over communications line, to commum~·lions controller, to communications line handler, to terminalcontrol program, to requester process, to server process, todisk process, to disk controller, to disk, and then back to theterminal.

• What work does each transaction do at each step of itspath ? computations, Vos, etc.

• Knowing the transaction path, how much time is spent in theprocessor, on the communication line, on the disk, etc.

• What percentage of shared resources (disk processes, .terminal control programs, etc.) need to be allocated to thIStransaction ?

• What priorities are involved as the transaction movesthrough the system?

With this information documented,SURVEYOR can be configured to help monitor user workloads and system resourceusage.

SURVEYOR provides a grouping functionfor monitoring and tracking workloads. Incases where the entire entity (disk volume,process, communication line, etc.) is associated with the workload, SURVEYOR willlogically group the entire entity with theworkload.

SURVEYOR also allows the apportioningof the entities within a workload. Onceyou define the workloads and understandhow individual resources are shared,SURVEYOR can track those workloads withshared resources.

The Tandem TCP (a multi-threaded terminal control process) is one process that isusually shared among multiple transactiontypes. Therefore, if a workload exists foreach transaction using the TC~ then eachworkload must be defined such that a portion of the TCP is allocated to each workload.

In the above TCP example, apportioningcould be based on information obtainedfrom the programmers. By understandingwhat type and how many SCOBOL (ScreenCOBOL) verbs are executed by each transaction, an apportioning scheme can bedetermined. Another form of apportioning is "message based apportioning." Thatis, the TCP resource is apportionedbetween each workload based on the number of messages sent/received by key process(es) in the transaction path. The volume and frequency of each transactionflowing through the TCP plays a role inapportioning the TCP among the workloads.

TANDEM COMPUTERS

A variety of ways exist to do resourceand entity apportioning. These are someexamples of apportioning a sharedresource. Use SURVEYOR to apportionindividual entities within a workload thatuse shared resources.

Information Collection

With a service level agreement in place,SLOs clearly defined and quantified, andworkloads characterized appropriately,SURVEYOR can be used to help manage theperformance of the system and help performance management deliver "adequateservice" to its user community. The information collected by SURVEYOR can tellperformance management whether theyare meeting the established SLOs, where apotential performance problem exists,which resources are most heavily used,and which workloads impose the greatest

. strain on system resources.When deciding what performance infor

mation to collect and store, think about thecommonly asked questions about systemand workload performance. Use thosequestions to help decide what performanceinformation to collect.

SURVEYOR can collect and store infor-mation about:

• resource utilizations,• transaction rates,• service times,• message and byte rates,• resource queueing, and• within-the-system residence times.

II Section 2: The Role of SURVEYOR inPerformance Management

TANDEM COMPUTERS

Where to startAt a minimum, performance data shouldbe collected for the processors, processeswithin the processors, disks and communication lines. A good starting point is to usethe default configuration for MEAS-ATIRand MASK-ATIR provided by SURVEYOR.The CPU, DISC, PROCESS and LINE entitiesshould be selected in the SURVEYOR MEASATTR configuration section. The defaultentity selection in SURVEYOR provides thebasic configuration needs for performancemanagement.

If the initial performance informationdoes not help answer the performancerelated question, then more informationneeds to be collected. The SURVEYORMEA5-ATIR and MASK-ATIR configurations should be adjusted to collect additional performance information. Over time, abetter understanding of the computingenvironment will allow performance management to be selective in what performance information is collected and stored.A more complex and detailed configurationusing SURVEYOR can be achieved when athorough understanding of the computingenvironment exists.


SURVEYOR Data FieldsSURVEYOR provides a comprehensive set ofperformance data fields associated witheach entity. Only those data fields requiredfor a specific environment need be saved.For each entity, unless configured otherwise, SURVEYOR uses its default set of datafields to specify what is collected andstored for each entity.

CPU EntityFor the SURVEYOR CPU entity, the AVERAGE-UTIL or the TOTAL-UTIL data fieldshows the utilization for the processor(s).The AVERAGE-QUEUE data field providesinformation on how many processes arewaiting in the queue to use the processor.The SWAP-RATE tells how often memoryneeded to be swapped in or out. TheDISPATCH-RATE is an indicator of howmany processes are serviced by the processor. The DISC-la-RATE shows the amountof disk activity attributed to the processor.The USER-PROC-UTIL and SY5-PROC-UTILdata fields show how much time the processor spent executing user processes andsystem processes. INTERRUPT-UTIL tellsthe amount of time the processor was executing interrupt handler processes. Thisdata field is useful when trying toapportion system interrupt activity amongmultiple applications or workloads.

DISK EntityUse the SURVEYOR DISC entity to measureconsumption of disk resources on the system. DISC-UTIL and AVERAGE-QUEUE datafields provide information on how busy thedisk is. By collecting performance information on logical and physical liDs, performance management can understand whattype of work the disk is doing. READ-Io.RATE and WRITE 10 RATE provide information on the type of work being requestedof a particular disk. The comparison ofPHYSICAL-Io.RATE to LOGICAL-Io.RATEprovides information on how many physical disk I/Os are needed for a logical orapplication I/O. Cache hit and miss ratesare useful in determining necessary adjustments to cache memory size and activity onthe disk. Cx-HIT- RATE and Cx-MISS-RATEcould be monitored to further understandhow the applications use the disk.

PROCESS EntityPROCES5-CPU-UTIL of the SURVEYOR PRO.CESS entity indicates how much processortime a particular process used. The READYTIME divided by PROCES5-CPU-UTIL showsthe ratio of the time waiting for processortime to the time actually doing some work.The SEND-RATE and RECEIVE-RATE aremeasured to help understand the frequencyof data passed into and out of a process.The PAGE-FAULT-RATE is monitored toknow who is causing memory pagi-ng/swapping. The AVERAGE-QUEUE datafield can indicate that a process is not getting enough processor time to complete allthe work requested of it.

TANDEM COMPUTERS

LINE EntityThe LOGICAL-lo.RATE, READ-RATE, WRITERATE and TOTAL-BYfE-RATE of the SURVEYOR LINE entity provide performance information about communication line rates andutilizations.

Default EntitiesMost of the data fields identified above aredefaults set up within the MASK-ATfR configuration section of SURVEYOR. Thedefault MEA5-ATTRs and MASK-ATTRs arerecommended as a starting point for collecting and storing some basic performanceinformation. In many cases, other datafields for specific entities are selected toprovide additional information. For example, cache hit/miss information is usefulperformance information. The cache hitdata fields are not defaults in SURVEYORand must be manually configured. Pleaserefer to the SURVEYOR reference materialfor more information on default entitiesand their values.

Modeling

A model is a representation of some existing or planned object. An equation, rules ofthumb, a collection of atomic numbers, anda graph can each be considered a model.Modeling is often used to predict futurevalues or the outcome of specific eventsand is based on empirical analysis orobserved reactions. Combining historicalvalues with the model can result in a prediction of the future. How reliable this prediction is depends on the both the validityof the model and the integrity of the data orvalues used.

11 Section 2: The Role of SURVEYOR inPerformance Manaaement

TANDEM COMPUTERS

Performance management is often thefocal point for information used in modeling. The modeling function itself is usuallyperformed by the Capacity Planning organization. When planning capacity, historical information about system performanceis critical. Growth of an on-line system canusually be expressed in terms of growth ineither transactions for an application oradditional applications. That growth information combined with baseline performance information forms a model to help acapacity planner project future resourceneeds.

SURVEYOR plays a key role in modeling.SURVEYOR not only collects, stores, andsummarizes the historical informationneeded for modeling, it also provides certain modeling functions. Workloads providethe logical grouping needed to simplify andorganize the resource consumption datainto manageable units. Aggregationschemes are the SURVEYOR mechanism toprovide data summarization. An aggregation scheme is configured to specify what,when, and how summarization is to occur.

System ModelingIn system modeling, the existing orplanned object is a computer system, theapplications within the system, or individual resources used by the applications.The values are historical data about theentity being modeled. For example, amodel can be a simple relationshipbetween server message receive rates andaverage system utilization. Based on pastperformance, a linear relationship can existbetween two performance indicators throughput and utilization. As systemthroughput increases, the average systemutilization increases proportionally.


Typically, models are used to predictperformance, cost, or size of a system,application, or a resource. Three frequently used types of models are: linear regression, consumption, and costing. Eachmodel provides specific information tologically represent an aspect of an object.A linear regression model could use performance indicators to represent the overall system performance in terms of utilization. Workload performance informationcould be used as input to a consumptionmodel. A combination of user statisticsand system hardware/software pricingstructures could be combined to representthe cost of a system.

Linear RegressionThe forecast command of S~VEYOR provides linear regression modeling. TheSURVEYOR forecast command uses historical data to predict future values. linearregression is a technique that takes a set ofdata points and fits a straight line to thedata by minimizing the distance betweenthe data points and the line.

For SURVEYOR's linear regression to be"accurate," the historical data should havea linear shape. Not all systems have activity that grows in a linear fashion. Manybusinesses have seasonal trends. If the historical data does not have a linear "look" toit, then linear regression may not be appropriate. An example of a linear, a non-linear, and a seasonal data set is shown inFigure 2 on page 13.

!••

• •• • •- • •• • •• • •• •• • •• •••

Figure 28. Nonlinear data set.

•• ••• •

•••• •••• ••••

•••Figure 2b. Linear data set.

•I

•• ••• • . ••• • ••• • •• •• •

Figure 2c. Seasonal data set.

TANDEM COMPUTERS

While linear regression is the forecastingtechnique used by SURVEYOR, that shouldnot imply that the technique is applicableto all computer systems. Other techniquesmay be better suited to model a particularsystem that performs in a non-linear fashion. Queueing network models (mentioned earlier in this document) and consumption models (an example providelater in this document) are options thatcould be used.

When linear regression modeling isappropriate for a particular system orapplication, performance managementmust select the basis of the model. That is,a linear regression model uses historicaldata to predict future values. A particulartype of value (specifically a SURVEYORperformance data field) must be selectedon which to collect historical data and thenpredict the future values.

The AVERAGE-lITIL or the TOrAL-UTIL (ofthe CPU entity) are used in some cases as thebasis for the model. Another indicator usedto forecast is the RECEIVE-RATE of a key process of the application. Which performancedata field is used as the basis of the linearregression model depends on the environment. The model provided by SURVEYORcan be based on any performance data fieldstored in the performance database.

As historical data is collected and storedabout a particular data field, SURVEYORuses this information to predict future values. SURVEYOR provides a high and a lowvalue for each expected value presented.The high and low values form the 90%confidence range when using the modelas a predictive tool.


TANDEM COMPUTERS

Confidencerange

30o 10 20......1--- Historical values--..........1-- Forecasted values_---1.~

•

"" "" " " "I

• •

Rgure 3. Example of forecast with confidence range.

Consumption ModelingConsumption modeling is another type ofmodeling. SURVEYOR can provide theinputs to a consumption model.Consumption modeling is an analyticalmodeling mechanism. This type of modelexamines combinations of processes thatuse system resources to complete a transaction. Later in this document, what information SURVEYOR can provide and how touse that information to produce a consumption model is discussed.

Performance Monitoring

Monitoring system and workload performance is a key part of performance management. Using SURVEYOR, performancedata is regularly collected and stored.Depending on the amount of performanceinformation collected, monitoring systemperformance can be extremely time consuming and labor intensive. Performancemonitoring looks only at key performanceindicators. When irregularities are identified, performance management must bealerted. 'JYpically, alerts are sent to operations consoles and other on-line monitoringdevices. Exception reports can also be generated for performance management toreview. The concept of performance monitoring is to identify and resolve potentialproblems before the user community isnegatively affected.


Ideally, automated mechanisms shouldbe used monitor system and workload performance. The automated mechanismshould produce the alerts and exceptionreports for performance management.

Some tools exist to provide on-line monitoring of system performance. VIEWSYS,NSS and Enlighten are examples of suchmonitoring tools. SURVEYOR provides anadditional monitoring tooL SURVEYOR'sthreshold feature and exception reportingfeature helps performance managementmonitor system and workload performance.

High, low, and display threshold valuesare established by performance management. Thresholds define ranges of valuesthat indicate a potential problem to performance management. These thresholds canbe adjusted as needed. Thresholds can beestablished for any performance data fieldcollected and stored by SURVEYOR.Utilizations, queue lengths, send/receivemessage rates and swapping rates are typical performance indicators with exceptionreporting thresholds defined. Usuall}', performance data is collected and stored on adaily basis. SURVEYOR examines the days'performance indicators looking for performance irregularities. SURVEYOR can initiate exception reports to alert performancemanagement about indicators that fall outside the expected ranges.

The actual value or range of values defined as the threshold is dependent on aparticular environment. If a particular system has no excess resource capacity, thenthresholds values are set to alert performance management as soon as a capacityincrease occurs. If excess resource capacitydoes exist, then the threshold value can beset to alert performance management only

TANDEM COMPUTERS

when some of the excess capacity gets usedup. SURVEYOR allows performance management to establish threshold values forthe performance indicators peculiar to aparticular environment.

All of the standard performance reportssupplied by SURVEYOR can be turned intoexception reports. With thresholds defined,SURVEYOR provides reports that show onlythe exception data. Performance managementcan better utilize its time by examining exception rePOrts instead of looking for performanceanomalies in detailed performance data.

A section with examples ofusing SURVEYORto monitor system performance appears laterin this document. Examples are given formonitoring processor and disk activity.

Performance Database ManagementPerformance management must collectperformance data on a regular basis.1hat performance data must be stored overtime for trending purposes. Reporting oncurrent as well as historical data is a requirement. Because of the volumes of performance data involved, performance management must reduce the detailed performancedata to a manageable amount. Performancemanagement needs a performance database(PDB) and a performance database managerto help with these functions. SURVEYORprovides all the necessary PBD management functions.


TANDEM COMPUTERS

SURVEYOR is an important tool used byperformance management. SURVEYORprovides the following features to helpmanage performance information:

• automated or manual data collection,reduction, and storage

• long term storage of performance data• database maintenance of

files / tables/ indexes• archiving and retrieving PDB data• standard and customized threshold

and exception reporting• statistical functions for summarization• workload tracking• data exporting to PCs

The SURVEYOR performance database isa benefit to performance management. Allthe performance information, historical aswell as current, is centrally located for easyaccess by performance management andcapacity planners. Beyond providing a PDB,SURVEYOR automates some key functionsinvolved in performance management. Thefollowing sections illustrate two of those keyfunctions; measuring system performanceand capacity planning.


Section 3:Monitoring System Performancewith SURVEYOR •An Example

SURVEYOR can automate the every day,time consuming process of monitoring system performance. When used to monitorsystem performance, SURVEYOR combinesthe best of MEASURE and ENFORM plusother new and useful functions. UsingSURVEYOR, the user can automatically recognize system performance anomalies andthen use MEASURE to conduct moredetailed performance analysis.

Based on the system's particular characteristics, the user can configure reports thatwill be generated only if certain exceptionsare met. For example, a processor reportcan be configured so a report will be generated only if processor swaps exceed one persecond. Another could be configured fordisk and generate a report only when diskbusy exceeds 35%. Exception reports canbe configured based on any data field(counter) stored in the PDB.

In this section, "performance monitoring" is defined as the process of recognizing system performance anomalies."Performance analysis and tuning" is theprocess of identifying the cause of thesesystem performance anomalies.

This section is not meant to be a guidefor performance analysis and tuning.Every system has it's own unique characteristics just as every analyst has their own"favorite" performance counters. The purpose of this section is to demonstrate howto use these counters and to encourageother ideas for customizing SURVEYOR.

Table 4-1 (Significant Counters and theirMaximum Values) in the MEASURE User'sGuide (part number 84157) is a good starting point for selecting performance counters for use with SURVEYOR. These guidelines mayor may not be applicable to yourspecific environment.

17 Section 3: Monitoring System Performance

TANDEM COMPUTERS

System Descrip_tio_" _

The system used to generate and collectmeasurement data was a four processorTXP system with four mirrored volumes.Each of the four processors was a primaryfor one mirrored volume. The PATHWAYsystem was driven by terminal simulatorsand consisted of four different types oftransactions.

The following examples discuss exception reports and how they can be used tomonitor processor and disk performance.The processor example shows how common conditions can be detected withexception reports. The conditions are processor utilization imbalance, over-utilization, and memory pressure (swapping).These conditions were caused by transientactivity during the steady state measurement period. The next set of examplesdeal with disk and show how imbalancesin the disk subsystem can be detected withexception reports. In this example, onevolume had it's cache set incorrectly.

Monitoring Processor Performance

The default CPU-STATS report is very useful for monitoring processor performance.As a first step, the report allows evaluationof processor resource "consumption" resource consumption including, at a minimum, both utilization and memory usage.By specifying the workload 1\ALL-CPU5, aCPU-STATS report can be produced reflecting a total system view of processor performance. The I\ALL-CPU5 specification combines all processor data fields into one dataunit.


Because all data from the last PDBupdate is to be included in the example,the FROM and TO parameters of the reportobject are not set. Data for all intervals ofthe last PDB update are included whenFROM and TO are omitted. A CPU-5TATSreport of I\ALL-CPU5 is produced as follows (the test system had four processors):

assume reportset detailed onprint cpu-stats Aall-cpus

SURV 1024 Warning. Since FROM, TO and FOR are allundefined only the data produced by the last operationwill be used as the target data for the following report(s)

Processor Average Total Average SwapDate lime No. Type Util Util Queue Rate1988-03-09 7:55 AALL·CPUS 17.530 70.123 0.874 2.8761988-03-09 7:58 AALL·CPUS 22.208 88.832 1.130 1.9201988-03-09 8:01 AALL·CPUS 14.152 56.611 0.747 1.7651988-03-09 8:04 AALL·CpuS 33.448 133.792 1.947 0.6211988-03-09 8:07 AALL·CpuS 36.203 144.815 2.149 0.3881988-03-09 8:10 AALL·CpuS 44.956 179.826 3.670 1.1501988-03-09 8:13 AALL-CPUS 61.133 244.532 5.117 0.0001988-03-09 8:16 AALL-CPUS 62.314 249.256 5.478 0.4001988-03-09 8:19 AALL-CPUS 61.415 245.660 5.233 0.0001988-03-09 8:22 AALL-CPUS 63.257 253.030 5.633 2.5331988-03-09 8:25 AALL-CPUS 61.597 270.388 6.547 5.1381988-03-09 8:28 AALL-CPUS 61.173 244.695 5.172 0.03819BB-03-09 8:31 AALL·CPUS 61.695 246.782 5.170 0.00519BB-03-09 8:34 AALL-CPUS 61.022 244.088 5.178 0.000

The measurement period was fromMarch 9th at 7:55 until 8:34 with a measurement interval of 3 minutes. It seems the system was in a steady state beginning with the8:13 interval. The AVERAGE-UTIL column isTOfAL-UTIL divided by the number of processors in the system. AVERAGE-QUEUE andSWAP-RATE are system totals - not per processor. The AVERAGE-UTIL data field is useful when trying to determine how well balanced processors are. This usage isdiscussed later in the section.

TANDEM COMPUTERS

ProaJSSOfReports -Analyzing IndividualProcessorActivityspecifying processor numbers rather thanI\ALL-ePUS, the CPU-SfATS report lists datafields for each individual processor. Thisallows a more detailed look at processorperformance. Analysis of the steady stateenvironment is desired so FROM and 10 arespecified. Note the processor specificationof (0,1,2,3) rather than I\ALL-ePUS (""" couldbe used also. It includes interval data forI\ALL-ePUS and individual processors).

CPU-SfATS (0,1,2,3) lists each data field ona per processor basis. Therefore, AVERAGEUTIL is equal to IDfAL-UTIL. Excluding theintervals of 8:22 and 8:25, processor utilization is fairly well balanced though processors 2 and 3 are slightly out of line withprocessors 0 and 1. Processors 0 and 1 arevery near the CPU-SfATSI\ALL-CPUSreported AVERAGE-UTIL of approximately61 %. Processors 2 and 3 are slightly lowerand higher, respectively, than this average.From the CPU-STATS 1\ALL-CPUS reportpresented earlier, a system wide increase inSWAP-RATE and TOTAL-UTIL occurred at8:22 and 8:25. CPU-SfATS (0,1,2,3) showsthat this increased activity took place inprocessor 1. Processor 0 may have beenaffected also because of it's slight increase inutilization compared with other intervals.

assume reponset detailed onset from 1988-03-09 8:13selto *print cpu-stats (0,1,2,3)

Prcm;sor AverageDate lime No.Type Util1988-03-09 8:13 0 TXP 60.3491988-03-09 8:13 1 TXP 62.8391988-03-09 8:13 2 TXP 56.8551988-03-09 8:13 3 TXP 64.4891988-03-09 8:16 0 TXP 61.9711988-03-09 8:16 1 TXP 62.5901988-03-09 8:16 2 TXP 56.6611988-03-09 8:16 3 TXP 68.0341988-03-09 8:19 0 TXP 61.1351988-03-09 8:19 1 TXP 62.5811988-03-09 8:19 2 TXP 57.2681988-03-09 8:19 3 TXP 64.6761988-03-09 8:22 0 TXP 63.3141988-03-09 8:22 1 TXP 69.1551988-03-09 8:22 2 TXP 56.4191988-03-09 8:22 3 TXP 64.1421988-03-09 8:25 0 TXP 68.8801988-03-09 8:25 1 TXP n.4461988-03-09 8:25 2 TXP 58.3691988-03-09 8:25 3 TXP 65.6931988-03-09 8:28 0 TXP 60.4911988-03-09 8:28 1 TXP 61.9721988-03-09 8:28 2 TXP 56.2681988-03-09 8:28 3 TXP 65.9641988-03-09 8:31 0 TXP 60.1491988-03-09 8:31 1 TXP 63.1771988-03-09 8:31 2 TXP 58.0811988-03-09 8:31 3 TXP 65.3751988-03-09 8:34 0 TXP 58.9761988-03-09 8:34 1 TXP 63.0681988-03-09 8:34 2 TXP 56.2161988-03-09 8:34 3 TXP 65.828

TotalUtil

60.34962.83956.85564.48961.97162.59056.66168.03461.13562.58157.26864.67663.31469.15556.41964.14268.880n.44658.36965.69360.49161.97256.26865.96460.14963.17758.08165.37558.97663.06856.21665.828

Average SwapQueue Rate1.248 0.0001.360 0.0001.143 0.0001.366 0.0001.348 0.0001.332 0.0001.128 0.0001.670 0.4001.305 0.0001.290 0.0001.176 0.0001.462 0.0001.360 0.0001.796 2.5331.088 0.0001.389 0.0001.631 0.0002.222 5.1381.215 0.0001.479 0.0001.243 0.0001.311 0.0271.120 0.0001.498 0.0111.198 0.0001.354 0.0051.181 0.0001.437 0.0001.202 0.0001.360 0.0001.113 0.0001.503 0.000

10 Section 3: System Perlonnance Monitoring

TANDEM COMPUTERS

Exception Reports -Processor UUlizationTo best interpret and apply exceptionreports requires a familiarity with the system under analysis. Familiarity meansknowing typical levels of utilization of thesystem's components or, in the case of mem?ry, knowing what memory pressure exists,if any. Once these areas and their typicalvalues are identified, exception reports canbe configured that produce data only whena certain data field exceeds a specified .value. In the following examples, the exactthreshold values selected are unimportant.What is important is the concept behindexception report configuration. To take fulladvantage of exception reports, the readermust be familiar with the characteristics ofthe system under analysis.

The CPU-SfATS report is a good startingfoundation for building a set of exceptionreports. A general guideline is that a TXPprocessor not exceed 60% to 70% busy. ACPU-SfATS AALL-CPUS report can be configured alerting staff that the guideline hadbeen exceeded. The exception would bebased on TOTAL-UTIL. A four processorexample would require the TOfAL-lITILexception be set to 280% (4 x 70%). Ubasedon A~~E-lITIL,a CPU-SfATS report canhelp Identify CPU load imbalances. Thisexample is discussed below.


The CPU-SfATS A ALL-CPUS report indicated that peak processor utilizationoccurred between 8:13 and 8:34. AverageCPU utilization ranged from a low of 61.022to a high of 67.597. Already mentioned wasthe curious rise in processor consumptionbetween 8:22 and 8:25 (utilization andswaps). In this example, the AVERAGEUTILs for these two intervals will beignored because some unexpected eventoccurred as indicated by SWAP-RATE. Inthis test environment, swaps are not typicalevents so they will be investigated later.Excluding 8:22 and 8:25, the range of AVERAGE-UTIL during peak steady state loadwas 61.022 to 62.314. This is not to suggestignoring certain intervals - they wereignored here so the following examples canbest demonstrate how unusual intervals canbe recognized.

Based on the "normal" range ofAVERAGE-lITIL (61.022 to 62314), a threshold of 65% TOfAL-UTIL is specified in thefollowing example. CPU utilization imbalan~ can be recognized this way. Specifyinga high 1OTAL-lITIL threshold of 65% resultsin the exception report including onlythose processors more than 65% busy. Trueproduction systems might need a larger"margin" or another data field used as theexception.

When configuring customized reports,the order report fields are defined (via SETREPORT FIELD) determines, from left toright, the order they are displayed.Because CPU utilization is the data field ofinterest, it will be SET first.

A CPU-SfATS exception report for thesteady state period of March 9, 1988 from8:13 through 8:34 is configured as follows:

Assume reportset Title "CPU Utilization exception Report"set Detailed Onset Field Cpu TOTAL-UTILset Field Cpu AVERAGE.QUEUEset Field Cpu SWAP-RATEset Field Cpu DISC-IO-RATEset High Cpu TOTAL-UTIL 65set From 1988-03-09 8:13set To 1988-03-09 8:34Print Cpu-stats (0,1,2,3)

CPU Utilization Exception Report

Processor Total Average Swap Disc 10Date lime No.Type Util Queue Rate Rate1988-03-09 8:16 3 TXP 68.034 1.670 0.400 22.1441988-03-09 8:22 1 TXP 69.155 1.796 2.533 21.5271988-03-09 8:25 0 TXP 68.880 1.631 0.000 28.1221988-03-09 8:25 1 TXP n.446 2.222 5.138 21.2721988-03-09 8:25 3 TXP 65.693 1.479 0.000 21.2051988-03-09 8:28 3 TXP 65.964 1.498 0.011 21.4441988-03-09 8:31 3 TXP 65.375 1.437 0.000 21.3721988-03-09 8:34 3 TXP 65.828 1.503 0.000 21.738

As configured, this exception reportreveals several anomalies. The first is processor utilization. It may be necessary toreview the previous CPU-SfATS reports tosee the connection, but this system seemsto have two distinct processor utilization"exceptions." Most significant is the rise inprocessor l's utilization during the intervals for 8:22 and 8:25. There seems to be arelationship with swap activity during thistime because SWAP-RATE increases significantly with the increase in utilization.Though processor 0 has but one entry inthis exception report, review of the CPUSfATS (0,1,2,3) report shows a corresponding increase, though less marked, in utilization for processor 1. Perhaps aconnection exists. Processor 3 is another

TANDEM COMPUTERS

matter. Relative to the other processors(excepting the intervals for 8:22 and 8:25),processor 3 is consistently busier than theother CPUs. Conversely, processor 2 isconsistently less utilized. In the real world,the degree of imbalance between processors 2 and 3 is usually insignificant.

Exception Reports -Recognizing MemoryPressure

The previous example demonstrated anexception report based on utilization. Theexample that follows demonstrates howother data fields can be used as exceptionqualifiers for report generation. Note theSET order and the corresponding order ofthe report columns. There are many possibilities for exception reports. Exactly whatis configured depends on what needs monitoring on the system.

Assume reportset Title "CPU Swaps exception Report"set Detailed Onset Field Cpu SWAP-RATEset field Cpu MEMORY-PAGESset Field Cpu AVG-MEM-QUEUEset field Cpu TOTAL·UTlLset High Cpu SWAP-RATE .5set From 1988-03-09 8:13set To 1988-03-09 8:34Print cpu-stats (0,1,2,3)

CPU Swaps Exception Report

ProalSSOI' Swap Memol)' Avg TotalDale Tme No. Type Rale Pages Mem Queue Util1988-03-09 8:22 1 TXP 2.533 4096.000 0.040 69.1551988-03-09 8:25 1 TXP 5.138 4096.000 0.075 n.446

21 Section 3: System Periormance Monitoring

Assume reportset Detailed Onset Field Process PAGE-FAULT-RATEset Field Process AVERAGE-DURATIONset High Process PAGE-FAULT-RATE .001set From 1988-03-09 8:13set To *Print process-stats *

To improve readability SET ORDER-BYPROCESS PAGE-FAULT-RATE could havebeen specified but since only a small setof entities satisfy the exception, orderingby time (the default) doesn't make thereport difficult to read. As expected, thesignificant intervals are 8:22 and 8:25.SWAP-RATEs for all other intervals areinsignificant.

TANDEM COMPUTERS

Memory Pressure -Finding the Cause

The two previous exception reports indicated unusually high SWAP-RATEs duringthe intervals for 8:22 and 8:25. There aretwo principal reasons why swaps occur the first being a true shortage of memorythe second being transient process activity.The data suggests transient activitybecause swapping was not evident duringall interval periods. A true shortage ofmemory usually exhibits itself as continuous swap activity for all intervals and forall processes, including non-transients.

The following exception report is basedon the PROCESs-srATS report. The PAGEFAULT-RATE and AVERAGE-DURATIONfields are selected because they will tellwho is causing the swaps and whether theswaps are due to transient process creations. Specifying a HIGH exception of0.001 will display only process entities thatexperienced swapping, however small theamount.


ProcessDale Tme Name Program Rename

1988-03-09 8:13 $TP81 SSYSTEM.SYSTEM.PAIHTCP21988-03-09 8:13 STPC1 SSYSTEM.SYSTEM.PAIHTCP21988-03-09 8:13 STPD1 SSYSTEM.SYSTEM.PAIHTCP21988-03-09 8:16 SSYSTEM.SYSTEM.EDrr1988-03-09 8:16$D8.AY SSYSTEM.SYSTEM.D8.AY1988-03-09 8:16 $REClO SSYSTEM.SYSTEM.PAIHTCP21988-03-09 8:16 STPA1 SSYSTEM.SYSTEM.PA1HTCP21988-03-09 8:16 $'fPC1 SSYSTEM.SYSTEM.PAIHTCP21988-03-09 8:16 STPC1 $SYSTEM.SYSTEM.PAIHTCP21988-03-09 8:16 "MISC1988-03-09 822 $TP8 $SYSTEM.SYSTEM.PAIHTCP21988-03-09 822 "MISC1988-03-09 825 $NCP $SYSTEM.SYSOO.OSIMAGE1988-03-09 825 "MISC1988-03-09 8:28$CONOOl.SSYSTEM.SYSOO.OSIMAGE1988-03-09 828 $D8.AY SSYSTEM.SYSTEM.D8.AY1988-03-09 828 "MISC1988-03-09 8:31 $REClO$SYSTEM.SYSTEM.PAIHTCP2

PageFautt~

RaIe Dl.ralion0.038 179.973o.m 179.9400.022 180.0500.111 5.9610.011 7O.Tl40.005 180.0740.005 179B120.005 180.0960.005 180.0600.333 151.7360.005 179.82521i77 83.3860.011 179.9335.588 56.5340.005 179.3660.011 170.6280.011 1602870.005 180.045

The entity causing the swaps to occur isshown as "MISC. "MISC is a default workload that includes many common systemutilities such as PUP, PUP, BACKUP,RESTORE, etc. This example illustrateswhy it is a good idea not to delete theMEASURE data file after it is updated intothe PDB. Using MEASURE and "windowing" into the correct interval, a set ofreports could be produced for each program file included in the "MISC workloadand the actual cause of the swappingdetermined.

The usefulness of grouping processesinto workloads cannot be over-emphasized.The value of this capability is demonstratedin the capacity planning section. Though inmost cases not recommended, the followingexample shows what happens when the"MISC workload is deleted. Deleting "MISCis done as follows:

Assume configurationUngroup process AmlscAlter configuration workload

After deleting the workload, the datafor the period being analyzed will have tobe deleted and re-updated (MEASURE datafile needed). Once this is done, print thereport again (notice the report configuration is identical to the previous example):

TANDEM COMPUTERS

Assume reponset Detailed Onset Field Process PAGE-FAULT-RATEset Field Process AVERAGE-DURATIONset High Process PAGE-FAULT-RATE .001set From 1988-03-09 8:13set To *Print process-stats *

Process Page Fault AverageDale lime Name Program Filename Rate DlI8tion1988-03-09 8:13 STPBl SSYSTEM.SYSTEMPATHTCP2 0.038 179.9731988-03-09 8:13 STPCl SSYSTEM.SYSTEM.PATHTCP2 o.m 179.9401988-03-09 8:13 STPDl SSYSTEMSYSTEM.PATHTCP2 0.022 180.0501988-03-09 8:16 SSYSTEM.SYSOO.COMINT 0.061 2.4031988-03-09 8:16 SSYSTEM.SYSOO.FUP 0.116 2.8571988-03-09 8:16 SSYSTEM.SYSOO.PUP 0.150 8.4011988-03-09 8:16 SSYSTEM.SYSTEM.EDIT 0.111 5.9611988-03-09 8:16SD8.AY SSYSTEM.SYSTEM.D8.AY 0.011 70.7241988-03-09 8:16 SREaO SSYSTEM.SYSTEM.PATHTCP2 0.005 180.0741988-03-09 8:16 STCLS SSYSTEM.SYSOO.TACl. 0.005 179.5001988-03-09 8:16 $TPAl SSYSTEM.SYSTEM.PATHTCP2 0.005 79.8121988-03-09 8:16 STPCl SSYSTEM.SYSTEMPATHTCP2 0.005 180.0961988-03-09 8:16 STPDl SSYSTEM.SYSTEM.PATHTCP2 0.005 180.0601988-03-09 8:22 SSYSTEM.SYSOO.COMINT 0.072 82.7831988-03-09 8:22 SSYSTEM.SYSOO.PUP 2.605 2.2851988-03-09 8:22 $TPBl SSYSTEM.SYSTEM.PATHTCP2 0.005 179.8251988-03-09 8:25 SSYSTEM.SYSOO.RJP 0.105 7.8321988-03-09 8:25 SSYSTEM.SYSOO.PUP 5.483 2.4581988-03-09 8:25 SNCP SSYSTEM.SYSOO.OSlW\GE 0.011 179.9331988-03-09 8:28 SSYSTEM.SYSOO.COMINT 0.005 7.1091988-03-09 8:28 SSYSTEM.SYSOO.FUP 0.005 6.4831ll8&aH» 82l1$CONSQ.SSVSTEMSVSnOSt.W3E o.oos 179.3861988-03-09 8:28 D8.AY SSYSTEM.SYSTEM.D8.AY 0.011 170.6281988-03-09 8:31 SREaO SSYSTEM.SYSTEM.PATHTCP2 0.005 180.045

The PAGE-FAULT-RATEs for$SYSTEM.SYSOO.PUP speak for themselves.There is an obvious correlation betweenthe processor SWAP-RATE and the processPAGE-FAULT-RATE for 8:22 and 8:25. Inthis example, the interval for 8:22 has a single entry for PUP where, in realit)r, 16 separate copies were run. During the 8:25interval 40 copies were run. Uniquelynaming each copy of PUP would haveresulted in 56 separate entries (notice theentries for the TCPs). We can be fairly certain that a true shortage of memory doesn't

23 Section 3: System Perionnance Monitoring

TANDEM COMPUTERS

exist because the AVERAGE-DURATION forthe PUP processes is only a few seconds.That, plus the fact that no other processeswere swapping make it a safe bet.

Deleting the "MISC workload does noteliminate the need for future MEASUREanalysis. As mentioned, a total of 56 separate copies of PUP were run. The exceptionreport does not show what individual process fault activity was. Deleting "MIsedoes narrow the program file set down to$SYSfEM.SYSOO.PUP, however. Measureproduces the reports necessary to show theunique memory consumption on a per process basis.

Other Processor Exception ReportsThe follOWing exception reports are basedon the suggestions in Table 4-1 of theMEASURE User's Guide. The SET HIGH represents the data field the exception report isbased on. What other data fields are included in the report are entirely up to the user.

For queue lengths:

Assume reportset Detailed Onset Field Cpu AVERAGE-QUEUEset Field Cpu TOTAL-UnLset Field Cpu SWAP-RATEset Field Cpu DISC-IO-RATE

. set Field Cpu CACHE-HiT-RATEset High Cpu AVERAGE-QUEUE 1.0Print cpu-stats (0,1,2,3)


For dispatches:

Assume reportset Detailed Onset Field Cpu DISPATCH-RATEset Field Cpu TOTAL-UTILset Field Cpu USER-PROc-UTILset Field Cpu SYs-PROC-UTILset Field Cpu INTERRUPT-UTILset Field Cpu PROC-OVHD-UnLset Field Cpu SEND-UTILset High Cpu DISPATCH-RATE 300 - TXP VALUEPrint cpu-stats (0,1,2,3)

Monitoring Disk Subsystem Performance

Just as exception reports were useful formonitoring processor performance, exception reports are equally useful for monitoring disk subsystem performance. As didthe preceding processor examples, the following examples will demonstrate howexception reports can aid performancemanagement of the disk subsystem.

Due to the nature of mirrored volumes,some data fields are reported on a logicalvolume basis rather than a per physicalspindle basis. See the SURVEYORReference manual for further details.

A different MEASURE data file was usedin the following disk analysis. The steadystate period began at 7:21 and was determined in the same manner as it was in theprocessor examples. All the subsequentexamples are for the steady state period.

Disk Reports -Analyzing Individual DiskSpindle Activity

The following example is a customizedversion of the DISC-UTIL-SUMMARY

report. To best report physical spindleactivity, DISC-UTIL was omitted, PRIMARYDISC-lOS and MIRROR-DISC-lOS added.This permits better correlation with theguidelines in the MEASURE User's Guide(Table 4-1). PRIMARY and MIRROR-DISClOS include seeks and therefore do not represent just read and write counts. Becauseseeks are included in these data fields, donot try to "match" processor DISC-la-RATEswith those of the disk. The processor DISCla-RATE data field does not include seekactivity. This is not unlike MEASURE.

The data fields PRIMARY-UTIL, MIRROR-UTIL, PRIMARY-DISC-lOS, and MIRROR-DISC-lOS provide the necessary datato configure reports based on Table 4-1 inthe MEASURE User's Guide (TXP disk busynot to exceed 35%, TXP disk rate not toexceed 25).

DiscDate lime Name1988-()3-09 7:21 $011988-03-09 7:21 $021988-03-09 7:21 $OATA1988-03-09 7:21 $SYSTEM1988-03-09 7:24 $011988-03-09 7:24 $021988-03-09 7:24 $OATA1988-03-09 7:24 $SYSTEM1988-03-09 7:27 $011988-03-09 7:27 $021988-03-09 7:27 $OATA1988-03-09 7:27 $SYSTEM1988-03-09 7:30 $011988-()3-09 7:30 $021988-03-09 7:30 $OATA1988-03-09 7:30 $SYSTEM1988-03-09 7:33 $011988-03-09 7:33 $021988-03-09 7:33 $OATA1988-03-09 7:33 $SYSTEM1988-03-09 7:36 $011988-()3-09 7:36 $021988-03-09 7:36 $OATA1988-03-09 7:36 $SYSTEM1988-03-09 7:39 $011988-()3-09 7:39 $021988-()3-09 7:39 $DATA1988-()3-09 7:39 $SYSTEM

Primary MirrorUtil Util

49.376 55.28726.557 31.50614.578 18.15926.206 31.81550.885 6.60724.502 29.89615.719 18.90329.423 34.79349.265 54.63124.542 29.70713.666 16.91029.573 34.24450.342 56.22324.799 29.93214.393 16.87229.018 34.09546.947 53.54324.606 29.96414.231 16.82028.051 33.15047.822 54.96626.389 31.79015.338 18.38827.363 32.72049.633 56.41225.112 30.13915.352 18.22126.143 31.820

TANDEM COMPUTERS

Primary MirrorDisc lOs Disc lOs35.354 42.41620.126 31.32610.160 15.11018.393 29.33836.154 42.84818.810 30.03210.627 16.31520.126 30.39935.288 42.82018.455 28.9279.549 14.538

20.960 30.86636.104 43.33718.976 30.2549.716 14.160

20.293 30.22633.620 41.60518.971 29.4109.860 14.920

19.522 30.09434.609 42.48320.121 32.05510.338 15.71518.977 29.90935.688 43.14419.177 30.58810.493 15.67018.305 29.605

Assume reponset Detailed Onset Field Disc PRIMARY-UnLset Field Disc MIRROR·UnLset Field Disc PRIMARY·DI8C-IOS

. set Field Disc MIRROR-DISC-IOSset From 1988-03-09 7:21set To *PrInt dIsc-utIl-summary ($d1, $d2, $data, $SyStem)

Clearly, $D1 is more utilized than theother volumes. Utilization for both primary and mirror is nearly double that of otherbusy volumes. I/O rates (seeks included)are much higher than those for other vol- .umes. This degree of imbalance couldresult in a significant performance degradation.

Exception Reports - Disk UtilizationThere are many options when configuringdisk exception reports. One option is toconfigure the exception report based oneither PRIMARY-UTIL or MIRROR-UTIL.

25 Section 3: System Perlonnanoe Monitoring

TANDEM COMPUTERS

For all intervals, only $01 exceeds thethreshold of 35% PRIMARY-UTIL.

Assume reponSet Detailed OnSet Field Disc PRIMARY·UTlLSet Field Disc MIRROR·UTlLSet Field Disc PRIMARY·DISC-IOSSet Field Disc MIRROR·DIsc-JOSSet High Disc PRIMARY·UTlL 35Set From 1988-03-09 7:21Set To *PrInt dJsc.uUl-summary ($d1, $d2, $data, $System)

Usually, only in read intensive environments do PRIMARY-UTIL and MIRROR-UTILdiffer significantly. Exception reports basedon DISC-UTIL potentially can mask an overly busy physical spindle. In these environments other exception reports based on different data fields can close such "gaps" andcompliment one another. Complimentaryexception reports may be as simple as having one based on PRIMARY-UTIL and another based on MIRROR-UTIL. Use what bestfits the characteristics of the system undermeasurement. The following exceptionreport is based on PRIMARY-UTIL exceeding 35%. This report makes $01 's over-utilization much more obvious.

Disc Primary MirrorDate Time Name Util Util1988-03-09 7:21 $01 49.376 55.2871988-03-09 7:24 $01 SO.885 56.6071988-03-09 7:27 $01 49.265 54.6311988-03-09 7:30 $01 SO.342 56.2231988-03-09 7:33 $01 46.947 53.5431988-03-09 7:36 $01 47.822 54.9661988-03-09 7:39 $01 49.633 56.412

Primary MirrorDisc lOs Disc lOs35.354 42.416

36.154 42.84835.288 42.82036.104 43.33733.620 . 41.605

34.609 42.48335.688 43.144

Disk Reports - Imbalances in DiskSubsystem Utilizationlike MEASURE, the Surveyor DISCOPENreport shows the "physical" view of activity on a disk. Unlike the FILE report whichdoesn't report on secondary partitions,DISCOPEN reports all file activity on a disk.Secondary partitions, as well as primarypartitions, are listed in the DISCOPENreport. Additionally, DISCOPEN reports I/Oand cache hit rates. These data fields correspond to the CPU-SfATS DISC-la-RATEand CACHE-HIT-RATE making a disk file'simpact on disk easy to recognize. For thesereasons, DISCOPEN is usually the mostuseful report for identifying disk consumers. DISCOPEN reports make it easyto identify which file should be moved toanother volume.

Every file opened on the volume is listed but combined into one entry per fileopened. For example, there were 52 openers of ACCOUNT, BRANCH, and TELLER.The interval entries for each of these files isthe sum of all data records for each open.

To verify this, add up la-RATE andCACHE-HIT-RATE for an interval from theDISCOPEN report. It will very closelycorrespond to the DISC-la-RATE andCACHE-HIT-RATE of the controlling processor and to the READ-RATE, WRITERATE, and CACHE-HIT-RATEs of the physical disk. Any differences are due torounding or omission of data records fromthe MEASURE data file due to storagethresholds.

The DISCOPEN-SfATS report is generatedas follows:

Assume reponSet Detailed OnSet From 1988-03-097:21Set To *Print dlscopen·stats $d1.*.*


10Date lime File Name Rate1988-03-09 7:21 $D1.CUNION.DBASE 11.2601988-03-09 7:21 $D1.CUNION.LOG01 .2271988-03-09 7:21 $D1.CUNION.LOG02 1.5211988-03-09 7:21 $D1.TESTDB.ACCOUNT 12.5161988-03-09 7:21 $D1.TESTDB.BRANCH 6.5331988-03-09 7:21 $D1.TESTDB.TELLER 7.8111988-03-09 7:24 $D1.CUNION.DBASE 10.5551988-03-09 7:24 $D1.CUNION.LOG01 0.2881988-03-09 7:24 $D1.CUNION.LOG02 1.5321988-03-09 7:24 $D1.TESTDB.ACCOUNT 13.7991988-03-09 7:24 $D1.TESTDB.BRANCH 6.7661988-03-09 7:24 $D1.TESTDB.TELLER 8.6051988-03-09 7:27 $D1.CUNION.DBASE 10.6661988-03-09 7:27 $D1.CUNION.LOG01 0.2821988-03-09 7:27 $D1.CUNION.LOG02 1.5991988-03-09 7:27 $D1.TESTDBACCOUNT 13.0941988-03-09 7:27 $D1.TESTDB.BRANCH 7.0991988-03-09 7:27 $D1.TESTDB.TELLER 8.6491988-03-09 7:30 $D1.CUNION.DBASE 11.3991988-03-09 7:30 $D1.CUNION.LOG01 0.2601988-03-09 7:30 $D1.CUNION.LOG02 1.5931988-03-09 7:30 $D1.ST1.ST1 0.0501988-03-09 7:30 $D1.TESTDB.ACCOUNT 12.9991988-03-09 7:30 $D1.TESTDB.BRANCH 6.5881988-03-09 7:30 $D1.TESTDB.TELLER 9.0221988-03-09 7:33 $D1.CUNION.DBASE 10.9491988-03-09 7:33 $D1.CUNION.L0G01 0.2551988-03-09 7:33 $D1.CUNION.L0G02 1.6041988-03-09 7:33 $D1.TESTDB.ACCOUNT 12.3881988-03-09 7:33 $D1.TESTDB.BRANCH 6.8111988-03-09 7:33 $D1.TESTDB.TELLER 8.2441988-03-09 7:36 $D1.CUNION.DBASE 10.8441988-03-09 7:36 SD1.CUNION.L0G01 0.2991988-03-09 7:36 $D1.CUNION.L0G02 1.6711988-03-09 7:36 $D1.TESTDB.ACCOUNT 12.7051988-03-09 7:36 $D1.TESTDB.BRANCH 6.4441988-03-09 7:36 $D1.TESTDB.TELLER 8.4661988-03-09 7:39 $D1.CUNION.DBASE 11.7491988-03-09 7:39 $D1.CUNION.L0G01 0.3271988-03-09 7:39 $D1.CUNION.L0G02 1.4491988-03-09 7:39 SD1.TESTDB.ACCOUNT 13.2941988-03-09 7:39 $D1.TESTDB.BRANCH 6.1101988-03-09 7:39 $D1.TESTDB.TELLER 8.743

Cad1eH~

Rate2.4330.0000.0003.1054.6884.0831.9660.0000.0003.2884.8554.6002.2830.0000.0053.4615.1224.6272.4440.0000.0000.0163.1834.7554.7552.3440.0000.0053.1944.8884.3n22110.0000.0003.2724.6664.3882.6440.0000.0053.3504.4004.583

TANDEM COMPUTERS

The following CPU-SfATS (0,1,2,3)report provides meaningful insight into theproblem with $D1 if the system configuration and workload is clearly understood.The system under test had one disk primaried in each of the four processors.Volumes $D2, $DATA, and $SYSTEM allhave CACHE-HIT-RATEs well over 50%.These three volumes were primaried inprocessors 3, 2, and 0, respectively. $D1was primaried out of processor 1. $D1'scache hit rate is far less than this thoughthe total system I/O rate (CACHE-HITRATE plus DISC-la-RATE) is nearly thesame as that for $D2 and $SYSTEM. In thisparticular test, $D1's cache was "under-configured" which resulted in the poor cachehit rate and corresponding increase inphysical I/O activity. Admittedly, the following CPU-SfATS report may seem awkward for this sort of analysis. However, ifexpected physical I/O rates and/or cachehit rates were known, exception reportswould better highlight the problem. PUP,PUP, and MEASURE could also be used toconduct cache and file analysis.

Section 3: System Performance Monitoring

TANDEM COMPUTERS

A CPU-SfATS (0,1,2,3) report customized to emphasize the I/O system isconfigured as follows:

Assume reportSet Detailed OnSet Field Cpu DISC·IO·RATEset Field Cpu CACHE·HIT·RATEset Field Cpu SWAP·RATEset Field Cpu SYS-PROC-UTILset From 1988-03-09 7:21set To *Print CpU-Slats (0.1,2,3)

The preceding examples demonstrate thestrength of SURVEYOR when used for performance monitoring. With properly configured exception reports, SURVEYOR canprovide valuable information about a system's "health."

However, performance monitoring isonly one part of a complete system management strategy. Capacity planning isjust as crucial as performance monitoring.

Processor Disc 10Date lime No. Type Rate1988.Q3-09 7:21 0 TXP 21.2941988.Q3-09 7:21 1 TXP 36.7441988.Q3-09 7:21 2 TXP 13.3611988.Q3-09 7:21 3 TXP 22.3611988.Q3-09 7:24 0 TXP 23.2611988.Q3-09 7:24 1 TXP 37.8501988.Q3-Q9 7:24 2 TXP 14.2611988.Q3-09 7:24 3 TXP 21.2551988.Q3-09 7:27 0 TXP 23.2501988.Q3-Q9 7:27 1 TXP 36.8161988.Q3-09 7:27 2 TXP 12.7381988.Q3-09 7:27 3 TXP 20.6831988.Q3-Q9 7:30 0 TXP 23.0331988.Q3-Q9 7:30 1 TXP 37.7611988.Q3-Q9 7:30 2 TXP 12.9381988.Q3-Q9 7:30 3 TXP 21.4111988.Q3-Q9 7:33 0 TXP 22.1001988.Q3-Q9 7:33 1 TXP 35.7111988.Q3-Q9 7:33 2 TXP 13.1051988.Q3-Q9 7:33 3 TXP 20.8661988.Q3-09 7:36 0 TXP 21.6941988.Q3-Q9 7:36 1 TXP 36.4721988.Q3-Q9 7:36 2 TXP 13.5171988.Q3-Q9 7:36 3 TXP 22.5221988.Q3-09 7:39 0 TXP 21.3501988.Q3-09 7:39 1 TXP 37.4721988.Q3-Q9 7:39 2 TXP 13.7941988.Q3-Q9 7:39 3 TXP 21.416

em Hit Swap Sys ProcRate Rate Util

26.744 0.000 20.26615.294 0.083 26.80720.150 0.127 16.72829.461 0.050 21.29127.583 0.00021.43616.022 0.000 27.82620.527 0.000 16.86327.611 0.383 20.60429.350 0.000 21.96016.361 0.000 27.35319.505 0.000 15.63026.883 0.011 19.69128.288 0.000 21.49216.000 0.250 28.19819.822 0.000 16.01128.016 0.011 20.41128.500 0.000 21.01215.922 0.000 26.54720.061 0.000 15.94127.3n 0.000 20.01327.588 0.000 20.74915.750 0.000 26.83920.472 0.000 16.71229.300 0.000 21.17127.094 0.000 20.34415.705 0.000 27.16419.794 0.000 16.51128.133 0.000 20.434


Section 4:SURVEYOR as an Aid forCapacity Planning •An Example

Consumption Modeling

There are a variety of methods of modelingcomputer systems that can use the information contained in the Performance Data Base.The SURVEYOR product contains linearregression techniques for trend analysis. TheSURVEYOR User's Guide explains how to usethese built in features. linear Regression is agood technique to use for estimating futuregrowth in a system when there is good reason to believe that growth in the system willcontinue as it has been in the past; i.e. nonew major applications being added orchanges in transaction proportions (percentage of one type of transaction to another).

Another modeling technique which iscommonly used is ConsumptionModeling. This approach is applicablewhen there is some knowledge outside thesystem that the transaction proportionswill change. This could be due to changesin user population or business priorities.This technique uses the data within aSURVEYOR performance data base andbuilds an analytical model.

Consumption Modeling views a Tandemsystem as a combination of processes thatconsume resources on behalf of a giventransaction or application. Some of theseprocesses are unique to a given transactiontype and some are shared between multipletransaction types. For the processes that areshared, consumption modeling provides a

method of apportioning the work they doamongst the transaction types that shareit.

The model results in a demand pertransaction at each physical service center.A physical service center is a set of similardevices that provide a particular service.For example, a node of CPUs is a physicalservice center providing CPU processingtime. The DISK drives attached to a systemof CPUs are a physical service center providing physical I/O and storage. For agiven transaction, consumption modelingdefines the demand the transaction has foreach physical service center.

CPU Planning

CPU planning for a Tandem system requiresthe understanding and the analysis ofmany different system entities and theirconstraints. For example, there must beenough ports for the terminal population.There also must be enough CPU cycles toget the necessary work done. Consumptionmodeling determines how many seconds ofCPU time are needed for each transactiontype. The information from the model canthen be used to determine how many CPUseconds will be needed for a particulartransaction rate.

Section 4: Capacity Planning Example

TANDEM COMPUTERS

The most difficult part of consumptionmodeling is determining the subsystemswithin the CPU that each particular transaction uses. The next hardest part is determining what portion of each subsystem isused by each transaction. Most of this workneeds to be done outside of the SURVEYORenvironment. SURVEYOR can help in thelatter task, if the system being analyzed iswell designed and/or well instrumented.The first task, that of determining the piecesof the system each transaction/applicationuses, can be accomplished by talking withthe software developers and system management personnel.

The information about what processes atransaction uses is represented in a transaction flow diagram. The transaction flowdiagram is a pictorial representation of atransaction's flow through a system. Anexample of a transaction flow diagram isshown in Figure 4.

Disk Planning

Disk planning must satisfy two constraints:sufficient space on all the disk drives to holdthe necessary information for a system andenough disk "power" to handle the numberof requests for that information BecauseSURVEYOR can help in the latter, it is thefocus of this discussion.


An application issues a "logical" requestto the disk process. This logical requestresults in a certain number of "system"requests. These system requests consist ofindex levell/Os plus data level I/Ds thatare a direct result of the one logical request.Some of these system requests are satisfiedfrom the disk cache. Others result in arequest to the physical disk. These requeststo the physical disk are referred to as "physical" I/Os. Using general terminology, hereare the definitions for these different I/Ds:

Loglcall/Os = Requests from application to read,write, update, or delete arecord.(Receive Rate on the DiskProcesses)

System II0s = Cache Hits +Physical Reads +Physical Writes

Physlcall/Os= Physical Reads +Physical Writes

When planning how many disk spindles. are required for a system, it is necessary to

know how many physical I/Ds are likelyto occur in a given time interval. This, ofcourse depends on the cache hit rate in thesystem. The higher the cache hit rate, thelower number of physical requests thatoccur. A consumption model can be builteither by associating logical requests tophysical requests or by associating logicalrequests to system requests and then making the association between system requestsand physical requests. In the second case,increases in cache hit rate can be investigated.

The consumption model building demonstrated in this example assumes all logicaldisk requests are equal. There is no differentiation between requests to differenttypes of file structures (key-sequenced,entry-sequenced and unstructured). Theexample also considers READ, WRITE,

UPDATE, or DELE'IE operations as equivalent. While these assumptions might not beaccurate for all systems, they are valid forthe following example and simplifies thenecessary analysis.

SURVEYOR Consumption Modeling:.....-_

SURVEYOR's ability to group processes intoworkloads and to summarize informationabout these workloads over time is particularly helpful for capacity planning. (It isassumed that you have read the SURVEYORUser's Guide and are familiar with both theGROUP command for creating workloadsand the AGGREGATE object for summarizing information. If not, please review theappropriate sections in the User's Guidebefore continuing.)

The Tandem system is a message-basedsystem. That means processes "talk" toeach other by sending and receiving messages to and from each other. By tracingthe messages of a given transactionthrough the system, it is possible to estimatewhat resources the transaction is utilizing.Messages received and messages replied toby individual processes are recorded byMEASURE and can be maintained in theSURVEYOR Performance data base (PDB).These performance metries are essential tobuilding a consumption model of a system.For a particular process, the messagesreceived per second are in the fieldRECEIVE-RATE and the number of messagessent are in the field SEND-RATE. Eventhough these are in the default SURVEYORconfiguration, make sure that in any customized configuration the following twostatements are included:

XX$ Select Process SEND-RATEXX$ Select Process RECEIVE-RATE

TANDEM COMPUTERS

Workloads

Setting up the workloads in the system is avery important task for accurate consumption modeling. The workloads should represent either the transactions or applications that you wish to model. Care needsto be taken in deciding what processes canbe grouped with others and which need tobe grouped separately.

Before establishing the GROUPs inSURVEYOR, identify the transaction typesor the applications that are going to be usedin the model of the system. Once these particular groupings are defined, the Tandemprocesses or programs that are executed byeach of these transactions need to be identified. Those processes/programs that areshared by transactions need to be distinguished from those that are unique to aparticular application ( an example of ashared process is the TCP processes in aPATHWAY environment). Some processesmay only be shared by two "groups" whileothers may be used by every transaction orapplication in the system.

31 Section 4: Capacity Planning Example

TANDEM COMPUTERS

FIgure 4. SampleTransaction Flow

Diagram.LINE

HANDLER

TRAN 1SERVER

TRAN2SERVER

IDISK PROCESS I

The transaction flows for all the transactions being modeled need to be understood.This includes determining the point of entryinto the system (usually a line handler) andeach and every process executed on itsbehalf. The transaction flow also identifiesthose processes within the path that makelogical I/O requests. In essence, the desiredfinal outcome is a flowchart of the transactions being modeled. A simple PATHWAYapplication with two transactions, is shownin Figure 4.

From this diagram, it is easy to see thatthe messages sent from the TRAN 1 SERVERand the TRAN 2 SERVER are being receivedby the disk process. The MESSAGES-SENTfield in the PROCESS entity for these twoserver types would represent the numberof logicall/Os performed per second. Uthe processes TRAN 1 SERVER and TRAN 2SERVER were called once to process a single transaction of their type, then theRECEIVE-RATE field in the PROCESS entityfor each of these two server types wouldrepresent the number of transactions persecond for the two different transactiontypes.


In this example, there are two workloads(Transaction Type 1 and Transaction Type 2).The Line Handler, TCP and Disk Processesneed to be apportioned among the twotransactions. The Line Handler and theTCP are divided equally between bothtransaction types; that is, each transactiongoing through each of these processes willuse the same amount of that resource. Thisassumption is made because there is noway to measure how much of eachresource a particular transaction type uses.

The disk process is divided equally between every logical I/O in the system, eachof which can be viewed as a "transaction" tothe disk process. To model this system, sixSURVEYOR groups are created. They are:

One containing the Une Handler processesOne containing the Tep processesOne containing Tran 1ServersOne containing Tran 2ServersOne containing Disk ProcessesOne containing All processes running

When building a model, all work in thesystem must be accounted for. The purpose of having a workload that contains allthe processes running in the system is todetermine the amount of "OTHER" activitythat is being performed and is contributingto CPU consumption. It is important that

TANDEM COMPUTERS

....... SECURE -TRAN 1 SERVER

.-- SERVER -UNE ROUTINGHANDLER - SERVER - f--

MEMTAB DISKSERVER f- PROCESSTRAN2

~ SERVER -RECORD-- SERVER I-

Figure 5.Transaction flowdiagram for acomplex system.

the thresholds for storing and displayinginformation are set up to insure all information needed is present.

In a more complex environment, theremay be many servers talking to each otheras well as the disk process. This complicates the flowchart as well as the analysis.An example of a flowchart for this type ofsystem is shown in Figure 5.

In this example, every process except forTRAN 1 SERVER and TRAN 2 SERVER areshared by the two transaction types. Doesthat mean they should all be put in aGROUP together? No. To analyze thetransaction flows of this system requiresthat almost every "process" on the chart beput into its own GROUP.

The LINE HANDLER and the ROUTINGSERVER can be put into the same GROUP.They are both apportioned by the totalnumber of transactions being processed.

The TRAN 1 SERVER and the TRAN 2SERVER need to be in their own GROUPs sothat the transaction rates for each transaction type can be determined as well as theservice times.

All disk process activity needs to beGROUPed together. This allows the modelto view the disk as a single resource. Also,this type of GROUPing is necessary to helpidentify the miscellaneous process activity(by process of elimination - the Aall-processes GROUP minus all the other definedGROUPS equals the miscellaneous processactivity).

The SECURE SERVER, MEM TAB SERVERand RECORD SERVER are put into individual GROUPs. There is no easy way to figureout how many messages out of the TRAN 2SERVER went to the MEM TAB SERVER orhow many went to the SECURE SERVER. Bylooking at the RECEIVE-RATE on the MEMTAB SERVER, you determine the number ofmessages from both the TRAN 1 andTRAN 2 SERVERs that went to the MEM TABSERVER. Without the use of user definedcounters within this server, the service timewithin the MEM TAB SERVER for a TRAN 1request versus a TRAN 2 request cannot bedetermined. An average service time for arequest to the MEM TAB SERVER is thebest that can be found. The same holds truefor the SECURE and RECORD servers.


TANDEM COMPUTERS

The TRAN 2 SERVER sends messages tothe disk process as well as to the threeshared servers (SECURE, MEM TAB andRECORD). It is necessary to determine howmany messages out of the TRAN 2 SERVERwere logical I/O requests. To do this, addup the number of MESSAGES-RECEIVED bythe three servers (SECURE, MEM TAB, andRECORD) and subtract that number fromthe total number of MESSAGES-SENT by theTRAN 1 SERVER and the TRAN 2 SERVER.The result will be the number of messagessent to the disk process.

There are eight GROUPs required tobuild the model. They are:

One for the Une Handler and ROUTING SERVEROne for the TRAN 1SERVEROne for the TRAN 2SERVEROne for the SECURE SERVEROne for the MEM TAB SERVEROne for the RECORD SERVEROne for the DISK PROCESSESOne for All processes running

The point of this discussion is that thereis not a one-to-one relationship betweenSURVEYOR "groups" or workloads and theworkloads in a consumption model.


Aggregation Schemes

The aggregation feature in SURVEYORsummarizes data as requested by the user.This summarization could result in information summarized on any user-definedtimeframe (e.g. daily, weekly, monthly).When using SURVEYOR data to build theconsumption model, it is recommendedthat a separate AGGREGATE be created forthis purpose. By doing so, changes to theconsumption model will not impact theongoing data gathering used for performance monitoring.

The time period which is used for theSELECT statement of the AGGREGATEobject is based upon knowledge of systemusage and the actual measurement that isbeing taken. The most important criteriafor the time selection is making sure thatall the transactions that the model will bebased on have been exercised in the system. If, for example, it is known that alltransactions in the model are enteredbetween 1:00 and 1:30 pm, then anAGGREGATE can be set up as follows:

- Aggregate to be used In building- Consumption Model of SystemXX$ set Aggregate llUe "Consumption Model Input"XX$ set Aggregate Records (cpu,dlsc,process)XX$ set Aggregate Stats (avg,std,p90)XX$ set Aggregate Span DallyXX$ set Aggregate Summarization AutomaticXX$ set Aggregate select t 1:00 To 1:30XX$ Add Aggregate Model-Input

TANDEM COMPUTERS

The Performance metries from SURVEYORthat are used to obtain that information are:

For each UNIQUE Process for atransaction type : TOTAL PROCESS BUSY

REQUEST RATE MADE TO ITREQUEST RATE TOSHARED WORKLOADS

CACHE HIT RATEREAD and WRITE RATE

INTERRUPT·UTILAVERAGE·UTILDISC·Ia-RATE

PRIMARY·UTILMIRROR·UTILLOGICAL·IO·RATE

PROCESS object PROCESS·CPU·UTILRECEIVE·RATESEND·RATE

DISC object

CPU object

For the DISKACTIVITY:

If the model building is being done for a"mixed" system, meaning there is a combination of lNSII's, TXP's and/or VLX's,then the model building would use the"normalized" fields of TNSII-UTIL in placeof AVERAGE-UTIL and PROCES5-CPU-UTILin the CPU and PROCESS objects respectively. (See the SURVEYOR ReferenceManual, Section 4 for a description of thesefields). This is necessary because a processexecuting on a TXP will run faster on aVLX. Normalization is the process of usinga common baseline for different representations of values. The normalization process takes the number of TXP (or VLX) seconds a process uses and computes thenumber of TNSII seconds that processwould use.

Extracting Information From SURVEYOR

After determining the aggregation schemes,you must specify the information to extractfrom SURVEYOR. The pieces of information needed for building a model are listedbelow.

TOTAL CPU BUSYTOTAL PROCESS BUSYTOTAL INTERRUPT BUSYTOTAL DISK BUSY

For each SHARED TOTAL PROCESS BUSYProcess: REQUEST RATE MADE TO IT

This will create the average value, thestandard deviation and 90th percentile values for data values between 1:00 and1:30p.m. The standard deviation is a measure of the dispersion of the data pointsfrom the average. For example, if the measurement contained 10 data pointsbetween 1:00 and 1:30 and the average datapoint value was 12 with a standard deviation of 8, then it is known that those 10points were not all close to the value of 12.If the standard deviation was 2, then all 10data points were close to value of 12.

For modeling, the standard deviationshould be relatively small in comparison tothe average value being used. If it is not,then it is recommended that the statisticP90 be used in place of the average. TheP90 statistic gives the value at which 90percent of the data points between 1:00and 1:30 fall below. This is more conservative than using an average value. Whetherthe average or the P90 statistic is used, thesteps for creating the model are the same.


TANDEM COMPUTERS

Model Building-An Examp_le _

The easiest way to describe consumptionmodel building is to present a completeexample. A description of the system to bemodeled and a step by step description ofeach stage of the model building processfollows.

System Description and ObjectiveThe example is a PATHWAY system consisting of four major transactions which comprise the majority of work being done.They are TRANI, TRAN2, TRAN3, andTRAN4. Each transaction passes through aline handler, a TCP, a server process, andthe disk process. There is other workbeing done in the system during the day;this work will continue in the future. Allthe transaction types are entered between12:20 and 12:40 each day.

The objective is to determine the effectof a 20% increase in TRAN4 transaction rateon CPU and disk utilizations.

Step 1- Create the Transaction Row Diagram

The flow diagram in Figure 6 depicts thesystem described above.

Step 2 -Set Up SURVEYOR ConfigurationThe SURVEYOR configuration for this system requires eight GROUPs to be set up.They are:

One for the Une HandlersOne for the TepsOne for each of the four Transaction ServersOne for the Disk ProcessesOne for all processes running

The Line Handlers in this systemare processes executing program files$DATA.BOTH.USIM and $DATA.BOTH.SSIM(Do not be concerned that these are nottrue "line handlers.")

The TCPs are processes executing program file $SYSfEM.SYSfEM.PATHTCP2.

This system is unique in that three of thefour transactions are executing the sameprogram file. Therefore, to distinguishthem from each other, the processes havebeen named to uniquely identify eachtransaction type. Those processes withprocess names starting with $8 and followed by three numbers are transactiontype TRANI. Those processes with processnames starting with $1 and followed bythree numbers are transaction type TRAN2.

UNEHANDLERS

Figure 6.PATHWAY

system flowdiagram.


DISKt----t PROCESS

Those processes with process names starting with $U and followed by three numbers are transaction type TRAN3.

TRAN4 are processes executing programfile $OATA.ST1.SERVERO.

The Disk Processes in this system areidentified by their device names $SYSTEM,$OATA, $01 and $02.

All other processes in the system areidentified using wildcards for volume,subvolume and program file name. Thisfield gives a summary of all user and system processes run.

The required input to SURVEYOR toconfigure these workloads is:

- redefine thresholds

set threshold entity device, field average-utlldelete threshold storeset threshold entity disc, field dlsc-uUIdelete threshold storeset threshold entity process, field process-cpu-uUIdelete threshold store

- now set up the appropriate configuration

assume conftguraUonAdd Cpu *Add Process *Add Disc $*

group process "lIne-hand =$data.both.uslm + &$data.both.sslm

group process "tcp =Ssystem.system.pathtcp2

group process "tran1 =$bOQO. +$bOO1 +$b002 +&$bOO3 +$bOO4 +$bOO5 &+ $b006 +$bOO7 +$b008 +&$bOO9 +$b010 +$b011 +$b012 +$b013 +$b014 +&$b015

group process "tran1 =$b016 + $b017 + $b018 +&$b019 + $b020 + $b021 +$b022 +$b023 +$b024 +&$b025 +$b026 +$b027 +$b028 +$b029 +$b030 +&$b031 +$b032

TANDEM COMPUTERS

group process "tran2 =$1000 +$1001 +$1002 +$1003 &+$1004 +$1005 +$1006 +$1007 +$1008 +$1009 +$1010 &+$1011 +$1012 +$1013 + $1014 +$1015group process "tran2 =$1016 +$1017 +$1018 +$1019 &+$1020 +$1021 +$1022 +$1023 +$1024 +$1025 +$1026 &+$1027 +$1028 + $1029 +$1030 +$1031 + $1032

group process "tran3 =$uOOO +$u001 +$u002 + &$uOO3 +$uOO4 +$uOO5 +$u006 +$uOO7 +$uOO8 +&SuOO9 +$U010 +Su011 +$U012+ $U013 +$U014 +$U015group process "tran3 =$u016 +$u017 +$u018 +&$u019 +$u02O +$u021 +$u022 +$u023 +$u024 +&$u025 +$u026 +$u027 +$u028 +$u029 +$u03O +&$u031 +$u032

group process "tran4 =$data.st1.servero

group process "dlsk-procs =$system +$data +$d1 &+$d2

group process "all-processes =$*""

replace configuration

Note: Some of the GROUP definitionscontain the same group name. This resultsin the second group of processes being concatenated to the first group of processes.This is necessary because they cannot all fitin one command.

Notice that this configuration deletes certain threshold values. When building aconsumption model the goal is to accountfor every last CPU second (or millisecond!)a transaction is responsible for. The defaultsfor these threshold values are set up to ignore entries with very low values (<0.05%).For consumption modeling, these entriesare needed. Discussions of thresholds canbe found in the SURVEYOR User's Guide.


TANDEM COMPUTERS

The final element to add to the configuration is the aggregate for model building. Itis identical to the aggregate defined formodel building except the SELECT statementis customized for the particular system beingmodeled.

- Aggregate to be used In building- Consumption Model of SystemSet Aggregate Title "Consumption Model Input"Set Aggregate Records (cpu,dlsc,process)Set Aggregate Stats (avg,std,p90)Set Aggregate Span DallySet Aggregate Summarization AutomaticSet Aggregate Select *12:20 to 12:40Add Aggregate Model-Input

The customization of the SURVEYOR configuration is now complete so that the necessary information from the data base for consumption modeling can be extracted.

CPU ReportThe CPU Report shows the average CPUutilization and the number of physicalreads and writes performed per second.

Here is the input to SURVEYOR to createthe CPU report:

assume reportReset report *Set Report Title "Model CPU Report"Set Report aummarymodel-InputSet Report Field cpu average-utllSet Report Field cpu dlac-io-rateSet Report Field cpu cache-hlt-rateSet Report From 1988-01-04 00:00Set Report To 1988-01-04 23:59Set Report Order-by TimeSet Report export offSet Report Display briefSet Report Width 132print lout $S.#Cpulnfo I cpu-utll *

Model CPU ReportStd

The following formatted report is theresult of this OBEY file. Note that threeseparate reports are created, one for eachstatistic in the aggregate definition.

Model CPU ReportAvg

Processor Average Disc 10 Cache HitDate lime No. Type Util Rate Rate1988-01-04 0:00 0 TXP 65.728 25.452 32.1001988-01-04 0:00 1 TXP 66.357 24.482 31.5801988-01-04 0:00 2 TXP 67.852 12.m 22.1081988-01-04 0:00 3 TXP 65.661 23.949 32.3141988-01-04 O:OQAALl-CPUS 66.399 86.662 118.103

Step 3 - Loading the SURVEYOR Data Base

Loading the SURVEYOR data base isdescribed in the SURVEYOR User's Guide.The aggregate Model-Input is summarizedautomatically every day for the prior day'sinformation.

Step 4 - Creating the Customized ReportsBelow is a sample set of reports that havebeen customized for this example. Thereare three reports; one for the CPU 'entity,one for the DISC entit)T, and one for thePROCESS entity. Each report appears inthree sections; one for each statistic specified in the AGGREGATE object (avg, standard deviation and P90). Each report isshown with the corresponding SURVEYORcommands that were in the OBEY file usedfor printing the report.


ProcessorDate lime No. Type1988-01-04 0:00 0 TXP1988-01·04 0:00 1 TXP1988-01·04 0:00 2 TXP1988-01-04 0:00 3 TXP1988-01·04 O:OO"All·CPUS

Average Disc 10 Cache HitUtil Rate Rate

0.571 0.739 1.2451.125 0.825 1.2220.794 0.371 0.5640.755 0.651 0.8030.598 0.969 1.284

TANDEM COMPUTERS

Date lime1988-01·04 0:001988-01·04 0:001988-01·04 0:001988-01·04 0:001988-01·04 0:00

Model CPU ReportP90

Processor Average Disc 10 Cache HitNo. Type Util Rate Rateo TXP 66.133 26.261 33.2661 TXP 67.464 25.294 32.8832 TXP 68.606 13.138 22.4723 TXP 66.482 24.722 33.316AALl·CPUS 67.171 87.859 119.781

Model DISC 110 ReportStd

Disc Logical Primary Mirror ProcessorDate lime Name 10 Rate Util Util Util1988-01·04 0:00 $01 0.767 1.372 1.531 0.7641988-01-04 0:00 $02 0.429 1.160 0.604 0.5461988-01·04 0:00 $DATA 0.191 0.562 0.681 0.2401988-01-04 0:00 $SYSTEM 0.674 0.739 1.168 0.717

The formatted report from this OBEY fileis shown below. As before, there are threeseparate sections, one for each statistic.

Disk ReportThe disk report contains the informationon the number of logical I/O requests thatwere performed per second. It also contains the utilizations for the primary andthe mirror disks as well as the processorutilization for the disk processes handlingthe drive.

Reset report *Set Report TIde "Model DISC 110 Repon"set Report summarymodel-Inputset Report Field Disc loglcal-Io-rateset Report Field Disc prlmary-utllset Report Field Disc mlrror-utllset Report Field Disc processor-udlset Report From 1988-01-04 00:00set Report To 1988-01-04 23:59set Report Order-by 11meset Report export offset Report Display briefset Report Width 132print loutdrinfol dlsc-rate-detalled(~stem,$d8ta,$d1,$d2)

Mirror ProcessorUtil Util

37.525 23.82933.955 23.73716.918 15.86537.782 23.658

Reset report *set Report 11tle "Model category Process Report"set Report summary model-Inputset Report Field Process PROCESS-CPU-UTILset Report Field Process RECEIVE-RATEset Report Field Process SEND-RATEset Report Field Process AVERAGE-DURATIONset Report From 1988-01-04 00:00set Report To 1988-01-04 23:59set Report Order-by 11m.set Report export offSIt Report Display briefset Report Width 132print lout proclnfl process-stats (Aall-processes, &Atran1, Atran2, Atran3,Atran4,Adlsk-procs,&Aline-hand,Atcp)

Model DISC 110 ReportP90

Disc Logical Primary MirrorProcessorDate lime Name 10 Rate Util Util Util1988-01-04 0:00 $01 21.800 32.359 38.990 24.5601988-01-04 0:00 $02 22.355 30.296 34.624 24.4351988-01·04 0:00 $DATA 19.355 14.592 17.532 16.1811988-01-04 0:00 $SYSTEM 22.044 32.957 38.999 24.331

Process ReportThe third report includes informationabout the GROUPs or workloads that weredefined in the configuration. This reportgives the utilizations for each group aswell as the send and receive rates. Theaverage duration is also reported so thattransient activity can be identified.

Here is the SURVEYOR input for thecustomized report:

Model DISC I/O ReportAvg

Disc Logical Primarylime Name 10 Rate Util0:00 $01 20.978 30.9150:00 $02 21.833 28.7980:00 $DATA 19.088 13.9790:00 SSVSlEM 21.413 32.053

Date1988-01·041988-01·041988-01·041988.Q1-04


TANDEM COMPUTERS

The output from this report requestappears below:

Model Process ReportAvg

Process Process Receive Send AverageDate TIIT18 Name CPU Uti! Rate Rate Duration1!l88-OHl4 0:00 AALL-PROCESSES 222.844 230.604 229.337 179.9491!l88-O1.{)4 0:00 ADISK-PROCS 87.091 163.421 79.935 179.9011!l88-O1.{)4 0:00 "LINHlAND 14.200 17.1lKl 0.477 179.948l!l88-O1.{)4 0:00 ATCP 83.219 30.619 65.033 179.9661!l88-O1.{)4 0:00 ATRAN1 3.627 1.141 6.935 179.967l!l88-O1.{)4 0:00 ATRAN2 1.153 1.883 1.883 179.9851!l88-O1.{)4 0:00 ATRAN3 14.233 6.434 19.600 179.9971!l88-O1.{)4 0:00 ATRAN4 18.387 7.T.m 54.144 179.994

Model Process ReportStd

Process Process Receive Send AverageDate Tme Name CPU Util Rate Rate Duration1!l88-O1.{)4 0:00 AAll-PROCESSES 1.623 1.824 1.687 0229l!l88-O1.{)4 0:00 ADISK-PROCS 0.833 1.329 0.740 0.4581!l88-O1.{)4 0:00 "LINHlAND 0.123 0.1~ om 0.219l!l88-O1.{)4 0:00 "TCP 0.535 0.299 0.484 0.0681!l88-O1.{)4 0:00 ATRAN1 O.2al o.ollll 0.385 0.0681!l88-O1.{)4 0:00 "TRAN2 0.033 0.047 0.047 0.il6219l18.()1.{)4 0:00 ATRAN3 0.262 0.123 0.397 0.003ll188-<l1.{)4 0:00 "TRAN4 0.273 0.112 0.726 0.017

Model Process ReportP90

Process Process Receive Send AverageDate Tme Name CPU Util Rate Rate Duration1!l88-O1.{)4 0:00 AALL-PROCESSES 225.271 233.211 231.827 1lKl.2111988-01.{)4 0:00 AOISK-PROCS 88.323 165.38> 81.016 180.45'21988-01.{)4 0:00 ALtIE-HAND 14.3511 17.322 0.833 180.11121988-01.{)4 0:00 "TCP 83.970 31.000 65.688 180.08419l18.()1.{)4 0:00 "TRAN1 3.885 1.233 7.433 180.0461l188-<l1.{)4 0:00 ATRAN2 1.199 1.950 1.950 180.0371l188-<l1.{)4 0:00 "TRAN3 14.502 6.577 20.077 180.0011988-01.{)4 0:00 "TRAN4 18.7lKl 7.900 55.177 180.019


Step 5 -Building The ModelWith the information in a concise form, themathematical modeling can begin. Referback to the section titled ExtractingInformation from SURVEYOR, for the list ofitems needed for model building. Theflow diagram shown in Figure 6 will beused in gathering the pieces for everytransaction. The two additional workloads, INTERRUPT and OTHER will be discussed after the individual transactionworkloads have been modeled.

SURVEYOR reports "utilization;" thispaper discusses "busy." Utilization is"busy-ness" per second. Since SURVEYORreports everything as rates, it reports busyness as utilization. To calculate the percentbusy, divide the utilization rate by 100. Forexample, if a process has a utilization of15.00, it was busy 0.15 (15.00/100) out of1.00, or 15 percent of the time.

The first step in modeling is to determinewhether to use the average values or the90th percentile values. By comparing thestandard deviations of the values for thePROCESS entities (found in the Std Report)to the average values (found in the AvgReport), it is evident that the system isin a steady state. For example, the averageProcess CPU Util for the group"ALLPROCESSES is 222.844 and the standarddeviation is 1.623. The standard deviationis small compared to the average(1/137th). It is reasonable to use the average value for this system.

TANDEM COMPUTERS

CPU Consumption Model

Utilization Law: Utilization = Transaction rate xtransaction demand

This section uses assumptions and laws ofOperational Analysis. OperationalAnalysis, developed by Jeff Buzen andPeter Denning, is "a pragmatic philosophyof computer system analysis using queueing network models."1 The definitions forthe laws we will be using are:

For each process that a transaction executes, we can determine the demand usingthe utilization law. Then, by summing theindividual demands at each process (for agiven transaction type), the total processdemand for a transaction type is found.Finally, we add to that a portion of theinterrupt processing in the system to arriveat the CPU demand for the transaction.

The steps involved at each process is todetermine the number of requests that areexecuted within the second. This is wherethe flow balance assumption is used. Weassume that the number of completions bya particular process is equal to the numberof requests made to it. In most cases thenumber of requests is equal to theRECEIVE-RATE as reported by SURVEYOR;for some, such as the TCP, this does nothold true.

TRAN1: 1.141 transactions per secondTRAN2: 1.883 transactions per secondTRAN3: 6.434 transactions per secondTRAN4: 7.738 transactions per second

The total number of transactions per second is 17.196.

There are four transaction types in thesystem to be modeled: TRAN1, TRAN2,TRAN3, and TRAN4. The first step of themodeling process is to determine the rateat which these four transaction types enterthe system. From the transaction flow diagram in Figure 6 (page 36) it can be seenthat while the transaction follows the flow,it's transaction type cannot be determineduntil the point where it either goes toTRANl server, TRAN2 server, TRAN3server, or TRAN4 server. The RECEIVERATE at each of these servers then represents the number of that type of transaction requesting service. The ProcessReport (Avg) shows the values for the fourtransaction types as:

The number of completions withinagiven interval is equal to thenumber of arrivals during the sameinterval.

Row BalanceAssumption:

1) Lazowska, et aI. 1984. Quantitative System Performance, •Prentice Hall. pg. xii. For acomplete description of the laws, pleaserefer to this work.


TANDEM COMPUTERS

Table 1. ConsumptionModel information.

All data is stated in seconds

utilizationDemand per transaction.. transaction rate

= 83.219% = 0.048 secltransaction17.196

rep Processes

TCP processes are responsible for functionkey operations and application transactions. The utilization of the TCP reflectsboth of these activities. When apportioning TCP activity over the application transactions, the function key activity must alsobe accounted for. Therefore, the RECEIVERATE on the TCP servers is not used as thetransaction rate; the total application transaction rate is used. This apportioningforces the function key and screen activityto be equally divided between all the transactions.

The CPU Utilization for the TCP processesis found in the Process Report under theheading PROCESS CPU UTIL for the group"TCP. The value is 83.219%. Using theUtilization Law, the demand per transactionat the line handler is:

Line HandlerThe line handler processes requests from

all four transaction types. The RECEIVERATE on the Process Report (page 40) forthe group "LINE-HAND represents thetransaction rate to the line handler processes. In this example, the RECEIVE-RATE onthe line handlers should equal the transaction rate in the system since the line handlers are only used by the applicationbeing modeled.

The CPU Utilization for the line handlerprocesses is found in the Process Report(page 40) under the heading PROCESS CPUUTIL for the group "LINE-HAND. Thevalue is 14.20%. Using the Utilization Law,the demand per transaction at the TCP is:

Using the transaction flow diagramshown in Figure 6, the analysis starts at theleft (line handler), and proceeds througheach process in the path: TCP, servers, anddisk processes. The table above will befilled in at each step in the model buildingexercise.

utilizationDemand per transaction.. transaction rate

= WQ% = 0.008 secltransaction17.196

(Note: the answer is rounded to milliseconds)


TANDEM COMPUTERS

Table 2.TCP data.

All data is stated in seconds

TRANt, TRAN2, TRAN3, and TRAN4 ServersEach one of these servers is responsible fora single transaction type. To determine theserver demand for each transaction type,divide the PROCESS CPU UTIL for the server GROUP in the PROCESS REPORT by theRECEIVE RATE for the server GROUP.

TRAN1

TRAN2

TRAN3

TRAN4

3.627%1.141

1.153%1.883

14.233%6.434

18.3870k7.738

= 0.032 secondsltransaetion




Disk ProcessesThe disk processes service logical I/Orequests. The transaction flow diagramin Step 1 indicates that the TRANl, TRAN2,TRAN3, and TRAN4 servers make requeststo the disk process. Other processes(MEASURE) are making requests also.

By determining the utilization per logicalI/O and the number of logical I/Os pertransaction, the disk process utilization pertransaction can be determined.

The LOGICAL I/O RATE for the system isfound by adding up the individual LOGICALI/O RATEs for each disk in the DISK I/OReport (page 39).

20.978 +21.833 + 19.088 +21.413 = 83.312

The combined utilizations of the diskprocesses is found on the PROCESSREPORT in the PROCESS CPU UTIL field forthe group I\DISK-PROCS. It is 87.091 %.

The third column of the table is nowcomplete.

fran Type Une Handler Tep server Disk Process TOTAL

TRAN1 0.008 0.048 0.032

~=~~=~~All data is stated in seconds

Table 3.Server data.


TANDEM COMPUTERS

To determine the demand at the diskprocess for each logical I/O, divide the DiskProcesses' utilization by LOGICAL I/O RATEfor the system.

87.091% = 0.010 secondsltransaction83.312

Logicall/Os per TransactionLooking at the transaction flow diagram,the only "group" receiving messages thatare sent out of the transaction processes isthe disk process group. This makes it simple to determine the number of logicalI/Os per transaction. The receive rate(from the process report) is the number oftransactions processed per second (transaction rate) and the send rate (also from theprocess report) is the number of messagessent from the transaction process to thedisk process. Divide the transaction rate ofthe process into the rate of messages sentto the disk process to obtain the logicalI/O's per transaction.

Using the RECEIVE RATE and SENDRATE values found in the PROCESS Reportfor the transaction groups, the Logical I/Dsper transaction are calculated as follows.

LoglcallJOs per _ SEND RATETRAN1 transaction - RECEIVE RATE

= 6.935 = 6081.141 .

LoglcalllOs per SEND RATETRAN2 transaction = RECEIVE RATE

1.883 100= 1.883 = .


Loglcall/Os per SEND RATETRAN4 transaction = RECEIVE RATE

= 54.144 _ 7.007.738 -

Disk Process demandThe total disk process demand can now becalculated for each transaction type by multiplying the number of logical I/Ds per transaction by the demand for a single logical I/Ofound previously.

Disk Process demandforTRAN1- 6.08 x 0.010 sec = 0.061 secltransaction

Disk Process demandforTRAN2

., 1.00 x 0.010 sec =0.010 secltransaction


= 3.05 x 0.010 sec = 0.031 secltransaction


= 7.00 x 0.010 sec = 0.070 secltransaetion

Column four, Disk Process, is now filled in.

Tran lYpe Line Handler TCP server Disk Process TOTAL

=~==;t=~~:=i:_All data is s1al8d in seconds

Table 4.Disk Process data.

TANDEM COMPUTERS

Total Consumption per TransactionBy adding up the demands for each processa transaction executes, the total consumption (demand) by a particular transactiontype can be determined. The table is nowcompleted by filling in the TarAL column.(See Table 5.)

Other WorkThere are two other workloads in the system that need to be taken into consideration: interrupt handling and other work(Le. MEASURE, PUP).

In this system, the only other workgoing on is MEASURE processing. TheMEASURE subsystem uses both the CPUand the Disk. In most environments,details of all the other work being done inthe system will not be known. The amountof demand these other processes create onthe system can be identified. It is the difference between the PROCESS work that has

been accounted for in the GROUPs set upfor the workloads in the model and the category 1\ALL-PROCESSES which containseverything that was running in the system.Add up all the GROUP's PROCESS CPUUTIL and subtract it from the PROCESSCPU UTIL for 1\ALL-PROCESSES to get thevalue for the processing demand for allother work.

222.844 . (87.091 + 14.200 +83.219 + 3.627 +1.153 +14.233 +18.387)

=0.934% or 0.009 seconds/second

The model must also include the diskprocess activity for all logical lias in the system other than those accounted for by transactions. The number can be determined bysubtracting the sum of the logical lias for allthe transactions (remember, its the SENDRAlE from the Process Report-avg (page 40)for the TRANl, TRAN2, TRAN3, and TRAN4servers) from the system logical I/O rate.

Table 5.Completed datatable with totals.

Trao Type Line Handler TCP server Disk Process TOTAL

TRAN1 0.008 0.048 0.032 0.061 0.149.._ _..__ _ _ - .TRAN2 0.008 0.048 0.006 0.010 0.072

..-mAN3..····_··..······0~OO8············0.·048······· ..···0:022···..············0~OOl··················'O:109··············-TRAN4···_·..········-ci:oos ··· ·0..04S····..······0:024·..····· ·0..070..·..· 0:150 ·····

All data is 8tatad In seconds


TANDEM COMPUTERS

Other LoglcalllOs

TRAN1 TRAN2 TAAN3 TAAN4

= 83.312· (6.935 +1.883 + 19.60 +54.144)=0.75

The CPU demand for these logical liDsis found, as before, by multiplying this rateby the demand per logical I/O.

0.75 x .010 = 0.008 seconds/seconddisk process demand

The total CPU demand for other work is:

0.008 + 0.009 = 0.017 seconds/second

Interrupt processing is the last consumer of CPU seconds that needs to beaccounted for. Interrupt processing is thedifference between total CPU utilizationand total process utilization.

Total CPU utilization is calculated byadding up the average utilization for eachprocessor (field AVERAGE UTIL on the CPUreport, page 38).

65.728 +66.357 +67.852 +65.661 .. 265.60%

Total process utilization is found inthe PROCESS CPU UI1L field of the PROCESSreport (page 40) for the group"ALLPROCESSES. This is the utilization for anyprocess executing any object file on thesystem.

Total process utilization .. 222.84%

Total interrupt utilization is the difference between total CPU utilization andtotal process utilization. This interruptutilization value should be close to theINTERRUPT UTIL field available fromSURVEYOR. (This value is not shown on


the CPU Report on page 38, but the value is43.019%.). Rounding error accounts for thedifference between the two interrupt utilization values.

265.60 • 222.84 =42.76% or .4276 seconds/second(as compared to 43.019% interrupt utilization value)

To check whether all the CPU utilization has been accounted for, multiply thetransaction rates for each of the transactiontypes by their respective demands to gettheir total consumption. Add these figuresto the consumption for other work andinterrupt handling. Compare this resultto the total CPU utilization reported bySURVEYOR (265.60%).

Tran1 Consumption = 1.141 x 0.149 = 0.1700Tran2 Consumption .. 1.883 x 0.072 .. 0.1356Tran3 Consumption .. 6.434 x 0.109 = 0.7013Tran4 Consumption = 7.738 x 0.150 = 1.1607

OTHER .. 0.0170INTERRUPT .. 0.4276

Total.. 2.6122 or 261.22 %

The difference is due to rounding errors.

Disk COnsump_tio_n_M_o_d_el_in..,ll;g _

Disk consumption modeling can be quitecomplex when all the different activities ondifferent file structures are considered. Inthe simplest form of DISK modeling, allREADS and WRITES to disk are consideredto be processed equally. This allows for anassociation to be made between the logicalI/Os occurring and the physical activityoccurring on behalf of those logical 1/Os.

TANDEM COMPUTERS

Physical DiskDemand for TRAN1 = 6.08 x 0.0278 secs.

=0.0169 secsttransactions


= 0.0848 secsttransactions

Physical DiskDemand for TRAN4 =7.00 x 0.0278 secs.




I: 0.75 x 0.0278 secs.=0.0209 secs/group

Disk Consumption for:Tran1 • 1.141 x 0.1690 I:

Tran2 • 1.883 x 0.0278 •Tran3 • 6.434 x 0.0848 =Tran4 = 7.738 x 0.1946 =Other =Total:

Physical DiskDemand forOther work

0.19280.05230.54561.50580.02092.3174

or 231.74%

It appears that almost all the disk utilization (except rounding errors) has beenaccounted for.

A simple accuracy check, similar to theCPU check can be made here. Multiply thetransaction rates by their respective physical disk demand per transaction and thensum the products. Compare the result tothe total physical disk activity (231.925%)to see if all the disk activity has beenaccounted for.

Total Physical Disk UtilizationLogical va Rate

= 231.925 % • 0.0278 seconds per83.312 logical va

To find the physical disk demand pertransaction, multiply the number of logicalI/Os per transaction by the demand perlogical I/O.

Just as in the case of the CPUs, this looksupon all the disk drives as being one server(primaries and mirrors).

The modeling approach is similar to theapproach used to determine the disk process demand per transaction. First a physical disk demand per logical I/O is foundby taking the total physical disk utilizationreported in SURVEYOR dividing it by thenumber of logical I/Os processed. Foreach transaction type, the physical diskdemand is calculated by multiplying thenumber of logicall/Os per transaction bythe physical disk demand for a logical I/O.

The total disk utilization is found byadding up the PRIMARY UTIL and MIRRORUTIL fields from the DISC I/O Report.Although SURVEYOR has a field DISC UTIL,the value in this field is not the sum of thePRIMARY UTIL and the MIRROR UTIL andshould not be used. (See the SURVEYORReference Manual for the definition of DISCUTIL.)

Disk Util =30.915 +28.798 + 13.979 +32.053 +37.525 +33.955 +16.918 +37.782=231.925%

The Logical I/O Rate was 83312, therefore, the physical disk demand per logicalI/O is:


TANDEM COMPUTERS

=0.19190r 19.1gok

0.0209 seconds/second2.6186 seconds/second

(0.1700 +0.1356 +0 .7013 +1.3929 +0.017)x 0.1919 =2.8806 or 288.06%

To estimate the new interrupt handling,first calculate the percentage of currentuser processing activity that INTERRUPThandling is.

42.76222.84

Tran1: 1.141 x 0.1690 =0.1928 seconds/secondTran2: 1.883 x 0.0278 = 0.0523 seconds/secondTran3: 6.434 x 0.0848 =0.5456 seconds/secondTran4: 9.286 x 0.1946 =1.8070 seconds/secondOTHER work remainsconstant at:Total:

INTERRUPT BUSY--~:.:...::...:=-:.....::.::; ........._- =

CPU-UTIL for ALL"PROCESSES

If all the GROUP's consumptions (transactions and OTHER) are added together andan additional 19.19% is added for interruptprocessing, the result is an estimate for theexpected total CPU utilization with theincreased transaction rate for TRAN4.

The current system has four CPUs,so the new average utilization will be(288.06/4) = 72.02% versus the averageof 66.40% previously.

To determine the new physical diskdemand required for the new transactionrates, multiply the new transaction rates bytheir respective physical disk demand andadd them together.

Tran1: 1.141 transactions per secondTran2: 1.883 transactions per secondTran3: 6.434 transactions per secondTran4: 7.738 x 1.2 =9.286 transactions per second

The new CPU consumption for each transaction type will be:

Tran1: 1.141 x 0.149 =0.1700 seconds/secondTran2: 1.883 x 0.072 = 0.1356 seconds/secondTran3: 6.434 x 0.109 = 0.7013 seconds/secondTran4: 9.286 x 0.150 • 1.3929 seconds/second

Forecasting Using the Consumption Model

Remember the question to be answered?It is: What effect will a 20% increase ofTRAN4 transactions have on the CPU anddisk utilizations?

With the information from the consumption model, this question is easy to answer.The first step is to determine the newtransaction rate for TRAN4. Then a "new"CPU and DISK model of the system can becreated. The only tricky part is determining the new INTERRUPT workload.

The new transaction rates will be:

The activity of OTHER work remainsconstant at 0.017 seconds/second.

The INTERRUPT handling grows proportionally as the workload increases, i.e. ifthe INTERRUPT handling added 15% moreprocessing to the system than the processesbeing run by the users, then when the loadincreases the interrupt handling will be15% additional to the total processing forthe new user processing.


The total physical disk demand is 2.6186seconds/second or 261.86%. Since there areeight spindles (four groups of a primary andmirror disk), the new expected average diskutilization would be:

261.86 =32.73% new expected utilization8

VS.

231.925 = 28.99% for the current system8

Summary

Without taking into account the effect onresponse time that the increased workloadwould create, each CPU in the current system will be about 5.6% busier and eachdisk spindle will be about 3.75% busier.

This model assumes that everythingelse in the system such as the cache hitratio remains the same. There is also theassumption that no new bottlenecks willappear given the increased workload. Arethese assumptions realistic? For lowgrowth estimates the assumptions aremore valid than for large growth estimates.That is why it is important to constantlymonitor the performance of the system.

During the time period when the systemis growing, the model should be validatedand if necessary, corrected. The mostimportant concept to remember in computer model building is that the the environment being modeled is always changingand the model needs to change with it aswell. This is true whether the model beingused is the linear regression technique usedin the SURVEYOR FORECAST command orits a consumption model built from theSURVEYOR output.

TANDEM COMPUTERS

The first step in any modeling exercise isto get a good insight into the system andthe data that is to be modeled. Thisrequires months of historical data that canbe analyzed. Only after analyzing theinformation can one pick a "good" modeling approach. SURVEYOR has great functions for manipulating the data in order toview it in many different ways. It also provides an easy interface to many softwarepackages on a PC that can graph and analyze the information to a greater extent.These facilities will help in determining the"shape" of the data which, in tum, dictatesthe modeling approach to take.

It cannot be stressed enough howimportant it is to look at the historicalinformation and get a feel for the trends inthat data. Then, the determination of whatto forecast can be made. Whether to forecast for the peaks or the averages is abusiness decision. If forecasting peak loads isdesired then an aggregate must be createdwhich will capture the peak values ofinterest (be it daily peak or weekly peak).The peak can be found by using the MAXstatistic in an aggregate definition.


TANDEM COMPUTERS

Conclusion

SURVEYOR is an important tool for usein performance management of Tandemsystems. This document described themajor areas of performance managementwhere SURVEYOR plays a key role.

SURVEYOR provides functions for per-formance data:

• collection,• reduction,• summarization,• reporting,• and management.

SURVEYOR can be the foundation toolfor performance management modelingneeds. A linear regression model is provided in SURVEYOR. A SURVEYOR PDBcan provide the input data for other performance and capacity planning models.

The combination of SURVEYOR with theproper staff and methodology, will helpperformance management organizationsinsure that adequate computing servicesare provided to users.


Distributed by.-,TANDEM

Corporate Information Center10400 N. Tantau Ave., LOC 248-07Cupertino, CA 95014-0708

performance management and surveyorthe role of surveyor in performance management 5 defining...

Documents