who just killed my db2 gelb2003b

7/27/2019 Who Just Killed My Db2 Gelb2003b

1/52

Who Just Killed My DB2?!

Ivan L. Gelb

Gelb Information Systems Corp.

Email: [email protected]

Phone: 732-303-1333

@CMGA 2003 - Sydney

DB2 for z/OS and S/390

Insuring optimum DB2 service levels in the OS/390 environment is challenging because of not so

obvious dependencies between the many subsystems. Performance biases introduced by systems

tuners in z/OS - OS/390, CICS, and DB2 can dramatically affect the complexs service levels andtotal effective capacity. This presentation will describe how to focus DB2 environments tuning

projects while insuring that the interdependent areas of z/OS, CICS, TSO and batch are optimized.

Attendees will learn how to avoid being caught in unproductive finger pointing sessions by (a)

insuring that subsystems are tuned with proper bias, (b) monitoring performance metrics that

indicate the true illness or wellness of the complex, and (c) knowing what measures are available

once the source of a service level problem is identified.


2/52

2003 GIS Corp . - www.gelbis.com 2

Agenda

Basics of Performance Tuning

DB2 Point of View

z/OS - OS/390 Point of View

CICS Point of View

Pointing in the Right Directions

We will focus on the latest systems versions: z/OS 1.4, CICS 2.2, DB2 6 8. The first four bullet

point will show where, how, and what to look for, and the last point will show six examples that

always cause contention among staff in different system and application areas. Following is oursession outline.

Basics of Performance Tuning: (1) Is anyone complaining? (2) Where to begin?

DB2 Point of View: (1) Eliminate bottlenecks; (2) Protect loved ones

OS/390 Point of View: (1) Allocate resources: CPU, I/O, storage; (2) Definitions of relative

priorities & their effects; (3) Protect loved ones (4) Who used, how much, of what, for howlong..

CICS Point of View: (1) Eliminate bottlenecks; (2) Protect loved ones (3) Create throttles;

Pointing in the Right Directions (3 CPU and 2 I/O cases): (1) Starved for CPU; (2) CICS region

saturated; (3) DPMODE = What? (4) Buffer Pools (5) High I/O Service Time


3/52


Basics of Performance Tuning

Is anyone complaining?

Do YOU want to prevent complaints?

Where to begin?

Ask these questions

TRADEMARKS

The following are trade or service marks of the IBM Corporation: CICS, CICS TS , CICSPlex,

DB2, IBM, MVS, OS/390, z/OS, Sysplex, Parallel Sysplex. Any omissions are purely unintended.

MOAD MOTHER OF ALL DISCLAIMERS

All of the information in this document is tried and true. However, this fact alone cannot guarantee

that you can get the same results at your place and with your skills. In fact, some of this advice can

be hurtful if it is misused and misunderstood. As with all kinds of analysis, anything you may hear

or read can be understood and misunderstood in many ways that may seem contradictory to you. In

this regard, a further and associated contradictory element requires considerable systems analysis

and trade-off studies to arrive at the structural design, based on rigorous system engineering

concepts. By combining advice and certain experiences, any fully integrated performance testprogram is weakly equivalent to any subsystem compatibility testing designed to eschew

obfuscation. Gelb Information Systems Corporation, Ivan Gelb and any one found anywhere

assume no responsibility for this informations accuracy, completeness or suitability for any

purpose. Anyone attempting to adapt these techniques to their own environments anywhere do so

completely at their own risk. ;-)


4/52


Is Anyone Complaining?

If YES, why?

Some service is slow and/or failing to meet service level

objectives (be sure that objectives are rational!)

Response time is OK, but not enough is completed

For either case:

Total Delays = Response Times Measured Service Times

What is Total Delays / Total Response Time ratio? < .50 is OK

but it up to doubles response time

The ratio acceptable to your site will depend on service level

goals.

First, we should not be waiting for complaints even though it helps justify the existence of our job.

Be effectively proactive, and they may wonder why tuning is needed.

Being proactive includes at least the following minimum activites:

-Profiles of all workloads so you can tell if behavior changes,

-Tracking of past and future changes in system software, hardware, and applications, and

-Profiles of utilization by business units of work is the most effective way to do all this.


5/52


Do YOU Want to Prevent Complaints?

We consider WAITING for

complaints a CLS (Career

Limiting Strategy).

Your actions plan should

prevent complaints.

If you do your job real well,

some may wonder if you are

needed.

Wait for or prevent complaints,the plan of action is the same.

OK

Performance?

Adjust 1Thing!

Measure

AnyIdeas?

This is a bit of philosophy here.

The practical issue is that for well managed shops the best practice is to work on prevention of

complaints.

Performance management is a cyclical activity.


6/52


Where to begin - 1

Evaluate performance:

Inside DB2

Outside DB2

Inside DB2 (not our sessions focus) Obtain / establish Service Level Agreements (SLAs) for

business critical work.

Without SLA-s, identify what is reasonable response time given

the service time profile of the workload.

Collect accounting trace class 2 and 3 for best information on

externally caused delays

Class 1 and 3 accounting traces reveal the tasks complete

activity (we will review a summary report from DB2 PM)

A DB2 performance evaluation should always start from inside out. Simply, you do not want to be

found with your homework not done.

We are going to look at reports in our quest for who killed your DB2 performance that will showboth types of measurements: indicators of problems inside DB2, and indicators that factors outside

DB2 are cause of degradation.

This session will drive towards identifying situations where the outside factors are causing the

misery.


7/52


Where to begin - 2

Outside DB2 analysis focuses on eliminating factors that

degrade DB2s performance

Search for hints inside DB2 that outside factors may be

cause of problems:

Wait for

CPU due to higher priority work

CPU due to LPAR management of weights

Central storage

Long basic I/O service times

> 3 msec for cached writes

> 5 msec for cached reads

Long non-service time components of I/O servicePEND, Disconnect, Device Busy, Control Unit Busy

As you can see from the list on the slide, delays come in many flavors.

We will show reports and recommend where to look to isolate causes for the delays.


8/52


Where to begin - 3

Prepare profiles of resource utilization by workload:

CPU, I/O, processing parallelism, processor storage, network time

Any latent demand?

If YES, what makes it up? Work you HAVE to care about or NOT?

Easy to tell if there is some latent demand see RMF CPU Activityreport

Harder to tell who it is waiting for CPU

Where to look?

RMF Workload, CPU, device, storage activity reports

SMF file activity reports

CICS & DB2 subsystems reports

CICS & DB2 subsystems traces are last resort, but most time

consuming to analyze and most trustworthy

This is an outline of where to begin analysis. Samples of how this is done follow later in

presentation.


9/52


Ask These Questions - 1

What is your class 2 elapsed time (C2ET)?

How much of C2ET is:

Wait for CPU dispatch?

MVS Overhead? I/O-s related wait?

System page-ins?

Which DASD devices are not providing good performance?


10/52


Ask These Questions - 2

Other classic questions you should not forget to ask, but willnot be discussed in much detail in our session today:

What is your system I/O rate/Second for:

Each pool?

Critical objects?

What is your average Synch I/O elapsed time?

What are your buffer pool hit ratios?

Is you SQL coding effective?


11/52


DB2 Point of View

Eliminate bottlenecks

Protect loved ones

Identify Suspect Areas


12/52


Eliminate Bottlenecks

Start with trace or detailed performance report of a task with

performance problems and examine time spent using versus

waiting for each resource:

CPU

Storage

I/O activity

DB2 locking activity

Thread waits

Application enqueues

Network time for DB2 initiated external units of work

DB2 DISPLAY command

Shows thread waits or cases when request is moved to a poolthread

This is a checklist of what we are looking for in this analysis.


13/52


Protect Loved Ones - 1

Isolation is the best protection but not always possible

Such protection can be implemented at many levels:

WLM (z/OS Workload Manager) and IRD (Intelligent Resource

Director) in the SYSPLEX

WLM service policy coded relative priorities

WLM CPU priority protection (new as of OS/390 Version 10)

WLM Storage protection (new as of OS/390 Version 10)

PR/SM LPAR weights

WLM I/O priority protection

Isolation of DB2 objects in buffer pools

Isolation of DB2 objects on specific devices

Protection of resources via WLM goal mode service definitions can be very effective.

Some WLM service policy creativity can cause problems. Avoid:

(A) more than 15 20 service classes,

(B) complicated work classification rules - fewest rules cause least CPU overhead,

(C) classification rules in wrong order (most to least likely order of conditions is a MUST),

(D) multiple periods / service class must be carefully considered best use is for low importance

work than can truly be decreased in importance as it demonstrates a higher and higher

propensity to use CPU resources, and

(E) specifying non-achievable service class goal example: 90% < 1 second when you can see

your average never reaching this goal.

We will examine next what WLM can do, what its protection looks like, and what should we

expect as results.


14/52


Identify Suspect Areas

Source: DB2 PM Accounting Report (long version)

Focus on the following times to identify sources of where your work is processing or waiting:

A. Lock/Latch time of DB2 + IRLM

B. Synchronous I/OC. Class 2 CPU

D. Other read I/O made up fro:

a. Sequential prefetch

b. List prefetch

c. Sequential detection

d. Synchronous I/O by another thread (different from this one)

E. Other write suspensions

a. Asynchronous write I/O

b. Synchronous write I/O by another thread (different from this one)

F. DB2 service task suspensionsa. Wait for data set extend/delete/define task

b. Wait for other service tasks

G. Suspended for processing ARCHIVE LOG MODE(QUIESCE)

H. Suspended for read from archive log tape

I. Suspended waiting for a drain lock

J. Suspended for release of an object from all claim holders

K. Suspended for page latch do you do RUNSTATS and COPY with SHRLEVEL(CHANGE)? This will cause thispage latch contention.


15/52


z/OS - OS/390 Point of View

WLMs Resources Management

Definitions of relative priorities

Protect loved ones

Who used what and how much

Note:

All sample RMF reports are from: SC33-7991 z/OS RMF

Report Analysis (Version 1.4)

All other samples are from various GIS projects


16/52


WLMs Resources Management

WLM managed resources:

Processor (CPU)

Storage controls

Multi-programming levels (MPL)

I/O priority

Parallel I/O access volumes

JES initiators

DB2 stored procedure address spaces

Websphere scalable address spaces


17/52


Definitions of Relative Priorit ies

WLM Priorities:

Pre-defined service classes

SYSTEM (fixed CPU DP=255)

SYSSTC (fixed CPU DP=254)

Importance 1 5 Discretionary when you DONT NEED TO CARE!

WLM goal types:

Percentile response time

Average response time

Velocity guarantees CPU access only, not priority

IRLM should be in SYSSTC. The other importance levels DB2 inherits from the caller so it can

greatly vary.

Please note that a velocity goal only guaranties access to the processor and not the CPU

dispatching priority. If it is set to high, the work can never reach it. After a few attempts, WLM

will give up on trying to help this workload.

Highly recommended settings:

1. IOQ=PRIORITY

2. MSO coefficient of 0 or 0.0001 (the minimum possible)

3. Equal CPU and SRB service definition coefficients


18/52


Protect Loved Ones

If WLM level < Release 10

If service goal being met, NO guarantee that CPU access of

your loved workloads will be protected from lower priority

work

DITTO for central storage can cause pain due to paging

New options since OS/390 Release 10:

Identify service class as CPU Critical

Identify service class as Storage Critical you WILL HURT

less important work and increase system paging rate. You

should be concerned about this in 64-bit mode systems that

are storage poor! They page to disks!

Watch out for I/O priority shifts caused by multi-period serviceclasses

What is a loved one? A workload that your business is willing to spend resources on to

maintain and/or improve its quality of service. In other words, work that most important for

your business.WLM may do service policy adjustments every 10 seconds and resource adjustments

every 2 seconds. This is not fast enough for critical online work. Once a lower priority task

is moved above your loved ones CPU dispatch priority, the few seconds required to

regain the higher CPU dispatching priority will cause missed service goals.

CPU CRITICAL attribute solves this problem. Lower priority works priority will not be

raised above higher importance work with this attribute set.

STORAGE CRITICAL solves the problem of paging by protecting the working set of your

favorite work.

I/O priority shifts caused by multi-period service classes may increase/decrease theperformance of a loved workload in unexpected ways. Just be aware this potential exists

by identifying such service classes.


19/52


Who Used What and How Much

CPU Activity Reports

LPAR Activity Reports

I/O Device Activity Reports

Workload Activity Reports


20/52


CPU Activity

6

54

3

Provides only 100% accurate CPU utilization figures for all LPAR-s and each LPAR individually.

Use it in conjunction with workload activity measurements to establish CPU utilization capture

ratiosObserve and consider:

1. ONLINE TIME less than 100% indicated CPU being varied on- or offline. IRD or manual

process may cause this.

2. LPAR BUSY % - what % of each allocated CPU this LPAR utilized. Less than 100% indicates

possible capacity issues.

3. MVS BUSY % - LPARs % CPU utilization. 100% should cause performance and capacity

concerns if (a) anyone complains, and (b) critical workloads + SYSTEM make up 90-95%+ of

the utilization

4. QUEUE LENGTHS (%) indicates how many others you may have to wait behind for CPU

access

5. IN READY - address spaces ready to run but CPU not available

6. OUT READY even worst than IN READY if the OUT-s are workloads you care about. See

workload activity reports to determine the victims


21/52


CPU Activity Processor Delays

3

Processor delays report identifies who is delayed and by ABOUT how much.

1. DLY % = (# of Delay Samples / # of Samples) * 100 is % of time task is delayed from getting

CPU time

2. USG % = (# Using Samples / # Samples ) * 100 is % of time the task is receiving CPU service

3. Holding Job(s) up to three tasks that most contributed to delay


22/52


LPAR Partit ion Data Report

54

Partition Data Report is from the RMF post processor. This is the most useful single place where

we can see defined and actual LPAR capacity reporting.

1. WGT LPARs weight/Total defined weight is the % SHARE this LPAR will be dispatched

by PRSM if it needs CPU service

2. MSU DEF and ACT defined and actual LPAR MSUs

3. CAPPING DEF partitions capping option

4. CAPPING WLM% - % of time WLM capped this LPAR

5. LPAR MGT LPAR management overhead

To minimize LPAR overhead, try to define a ratio no greater than 2 logical CPUs defined perphysical CPU. This ratio is calculated by adding the logical CPUs defined in all LPARs and

dividing this total by the number of available physical CPUs.


23/52


LPAR Cluster Activity

2

Summary of attributes and activity of LPARs. Note the 2 PLEXs on this report because this one is

a SYSPLEX wide example.

1. TOTAL% LBUSY logical CPU busy

2. TOTAL% PBUSY physical CPU busy


24/52


LPAR Activity from RMF III

CPC Capacity reports allows the online examination of the same information that RMF also

records and can be obtained with the post processor reports.

1. MSU Def defined capacity

2. MSU Act actual capacity

3. Cap Def defined capacity for variable Workload License Charging (vWLC)

4. Proc Num number of logical processors

5. Logical Util %

6. Physical Util %


25/52


Processor Speed Issues

A problem if expected speed is not delivered for your work

Why?

Cycle time often the least reliable indicator

MSU rate / hour also not uniformly reliable MIPS rates are based on cycle times

MIPS rates are based on various specific workloads

IBMs Large Systems Performance Reference (LSPR)

contains various rates for different types of workloads. For

latest results visit:

http://www-1.ibm.com/servers/eserver/zseries/lspr/zSeries.html

A side, but very important area for what we are covering here.

For example if you just upgraded from 125 MIPS CPU to a 250 MIPS CPU, you might expect

CPU time to drop by 50%. There are many more reasons than the three major ones identified hereof why such expectation are not met.

Differences in the profile of workloads can produce an over 25% swing in the wrong direction for

your workloads.

Not much you can do about this other than doing your homework to understand that the decreased

throughput is not due to some performance tuning factors that needed adjusting when you changed

hardware.


26/52


I/O Device Activity -1

631

DASD Activity report tells us all we need to know about a single volume.

Possibly unproductive activity to watch out for:

1. IOSQ TIME IOS queue

2. DPB DLY director port delay

3. DB DLY delay due to device busy

4. PEND TIME pending

5. DISC TIME disconnect

6. AVG NUMBER ALLOC reveals how many files were open on the volume. Did you expect

to be alone?


27/52


I/O Device Activity - 2

31

This report revels where the I/O activity for any single DASD volume originates from, and what is

the level of the activity. This report often surprises. Just when you think you are alone you

are NOT.

1. SMF SYS ID produces one line for each system touching this volume

2. % DEV RESV device reserved by another system

3. AVG NUMBER ALLOC avg. number of files allocated in this interval


28/52


I/O Device Activity 3

4

321

Device I/O activity delays report shows which devices delay a particular workload, and what are

the chief contributors to these delays.

1. DLY % - delay this job experienced

2. USG % - using %

3. CON % - connect %

4. MAIN DELAY VOLUME(S) - % delay contributed by top 4 volumes


29/52


File Level I/O Details - 1

From SMF type 42-6 records sorted by I/O rate

Source: Joel Goldstein, Responsive Systems

3

SMF type 42-6 records are the finest tool for analysis of I/O activity for any file component. The

only down side to this record that if activity is produced on multiple CPUs, you are best of if

you merge these record into a single report. This will reveal the different point of view thesame object produces on different SYSPLEX members.

1. IO INTENSITY is calculated from product of

2. IO RESP device response time, and

3. IO COUNT duhhhhhhh


30/52


File Level I/O Details - 2


2

1

From the SMF 42-6 we can produce all the details about the physical I/O activity of any object on

any volume. This helps us identify where to concentrate tuning activities.

Such data introduced to some type of modeling tool can be used to explore what if scenarios

before the effort to make the change is expanded.

This sample of 42-6 reporting shows analysis details possible for an object. It can be viewed as:

1. Part of all databases, or

2. Within a single volume


31/52


Source: Chris Baker, IBM

RMF Workload Measurements

You can basically put that the BTE number is the TORS point of view of the

response time versus the EXE that is the other stuff.

We could have actually drawn another box that could have been an FOR so it would

be a subset of EXE.

The transactions with multiple regions.will have multiple EXE lines.DB2 activity is issued from AOR-s to DB2


32/52


RMF Workload Activity - 1 W O R K L O A D A C T I V I T Y

MVS/ESA SYSPLEX WSC1 DATE 01/14/1997 INTERVAL 15.00.002 MODE = GOAL

SP5.2.2 RPT VERSION 1.2.0 TIME 09.29.00

POLICY ACTIVATION DATE/TIME 01/14/1997 06.50.04

REPORT BY: POLICY=CICSHARE WORKLOAD=CICSWKLD SERVICE CLASS=CTRAN1 RESOURCE GROUP=*NONE PERIOD=1 IMPORTANCE=HIGHEST

-TRANSACTIONS-- TRANSACTION TIME HHH.MM.SS.TTT

AVG 0.0 0 ACT UAL 000 .00 .00 .18 7MPL 0.0 0 QUE UED 000 .00 .00 .18 7

ENDED 4363 EXECUTION 000.00.00.000

END/SEC 4.85 STANDARD DEVIATION 000.00.01.423

#SWAPS 0

EXECUTD 0

-------------------------------RESPONSE TIME BREAKDOWN IN PERCENTAGE-------------------- ------STATE------

SUB P TOTAL ACTIVE READY IDLE -------------------------WAITING FOR-------------------------- SWITCHED TIME (%)

TYPE LOCK I/O CONV DIST LOCAL SYSPL REMOT TIMER PROD MISC LOCAL SYSPL REMOT

CICS BTE 760 27.6 12.0 233 0.2 0.1 0.0 0.0 0.0 0.0 0.0 221 45.8 221 0.0 0.0 0.0

---RESPONSE TIME--- EX PERF

HH.MM.SS.TTT VEL INDX

GOALS 00.00.00.500 AVG

ACT UAL S 0 0. 00. 00. 187 N/A 0.4

----------RESPONSE TIME DISTRIBUTION----------

----TIME---- ---NUMBER TRANSACTIONS--- ----PERCENT---- 0 10 20 30 40 50 60 70 80 90 100

HH.MM.SS.TTT BUCKETS TOTAL BUCKETS TOTAL ........................................

< 00.00.00.250 4109 4109 94.2 94.2 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

> 00.00.01.000 42 4363 1.0 100 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Sample RMF pos t proc essor (ERBRMFPP) repor t w i th opt ion SYSRPTS(WLMGL(SCPER))


This it the RMF workload activity report

You can make statements like 94% ran in less than a quarter of a second: this is the

response time distribution from RMF post processor report.

Also, ABOUT 45.8% of the time was spent in DB2.


33/52



RMF Workload Activity - 2REPORT BY: POLI CY=HPTSPOL1 WORKLOAD=PRODWKLD SERVI CE CLASS=CI CSHR RESOURCE GROUP=*NONE PERI OD=1 I MPORTANCE=HI GH

- TRANSACTI ONS- - TRANSACTI ON TI ME HHH. MM. SS. TTT

AVG 0. 00 ACTUAL 000. 00. 00. 114

MPL 0. 00 QUEUED 000. 00. 00. 036

ENDED 216 EXECUTI ON 000. 00. 00. 078

END/ SEC 0. 24 STANDARD DEVI ATI ON 000. 00. 00. 270

#SWAPS 0

EXECUTD 216

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -RESPONSE TIME BREAKDOWN I N PERCENTAGE- - - - - - - - - - - - - - - - - - - - - - - - - - STATE- - - - - -

SUB P TOTAL ACTIVE READY I DLE --- - - - - - - - - - - - - - - - - - - - - - - WAI TING FOR-- -- - - - - - - - - - - - - - - - - - - - - - - SWI TCHED TIME (%)

TYPE LOCK I / O CONV DI ST LOCAL SYSPL REMOT TI MER PROD MI SC LOCAL SYSPL REMOT

CI CS BTE 93. 4 10. 2 0. 0 0. 0 0. 0 0.0 83. 3 0.0 0. 0 0.0 0.0 0.0 0.0 0. 0 83. 3 0.0 0. 0

CI CS EXE 67. 0 13. 2 7.1 0. 0 0.0 0.0 0. 0 0.0 0. 0 0.0 0.0 0.0 46.7 0. 0 0.0 0.0 0. 0

This is a sample RMF post processor (ERBRMFPP) output with option SYSRPTS(WLMGL(SCPER))

Now here is a slightly more detailed view where you have BTE and EXE lines

How safe are these numbers?

Well its response time breakdown into percentage

How is RMF finding this out?.these numbers are vulnerable to all kinds of

things because they are from CICS Performance Block (PB) sampling

From CMF you get the absolute number if this is not enough.

PROD column on this report is the % of time CICS thinks this transaction is

waiting for DB2 activity to complete.


34/52


RMF Workload Activity - 3

Storage delays due to paging and swapping activity.

DLY% - delay workload experienced

% Delayed for OTHR includes delays due to VIO, cross-memory address space paging, and

hiperspace paging in one total.


35/52


CICS Point of View

Eliminate bottlenecks

Protect loved ones

Create throttles

CICS/DB2 Performance Improvements


36/52


Eliminate Bottlenecks

Do not hit limit conditions in CICS region

MAXTASK

TRANCLASS

DB2 threads

Applications enqueues

MAXTASK is a user maximum for the tasks the region will handle at any one time. You should

set this just high enough so the region does not hit it. By setting this to a value that causes delays,

you produced a throttle on the processing capacity, and therefore that ability to consume resources,of this region.

TRANCLASS is another place you can create a throttle. This is the recommended way to limit

certain resource hogs from dominating the entire region.

DB2 threads should not cause any waits. This increases the cost per unit of work.

Application produced enqueues are an area worthy of a dedicated presentation. Just look for signs

that this is going on, and try to eliminate/minimize this type of waiting.


37/52


Protect Loved Ones

Protection can be implemented at many levels:

WLM and IRD Intelligent Resource Director) in the SYSPLEX

WLM Workload relative priorities

PR/SM LPAR weights

WLM combined with IRD will shift resources within LPARs in a SYSPLEX to meet service

objectives.

WLM relative priorities will insure than the work defined as most important gets all the available

resources until the service objectives are met. Only then will lower priority work also get a change

to run.

PR/SM LPAR weights can be static when defined by Operation, and they can be dynamically

managed by IRD if in WLM goal mode.


38/52


Create Throttles

Most effective throttles are highest level MPL controls at

various points:

Maximum batch initiators

Maximum tasks within a CICS region

Transactions restricted by class

Lower level throttles, or factors that in any way slow down the

work once it begins, cause wasted resources.

Just follow these recommendations for a well performing outcome.


39/52


CICS Dispatcher StatisticsDISPATCHER STATISTICS

Dispatcher Start Date and Time. . . . . . . : 11/24/2002 09:22:44.7563

Address Space CPU Time. . . . . . . . . . . : 02:11:34.1901

Address Space SRB Time. . . . . . . . . . . : 00:02:24.3700

Peak number of dispatcher tasks . . . . . . : 149

Peak ICV time (msec). . . . . . . . . . . . : 1000

Peak ICVR time (msec) . . . . . . . . . . . : 150000Peak ICVTSD time (msec) . . . . . . . . . . : 250

Peak PRTYAGE time (msec). . . . . . . . . . : 0

Peak MRO (QR) Batching (MROBTCH) value. . . : 1

Number of Excess TCB Scans. . . . . . . . . : 1030792M

Excess TCB Scans - No TCB Detached. . . . . : 901943M

Number of Excess TCBs Detached. . . . . . . : 222681M

Average Excess TCBs Detached per Scan . . . : 0

Number of CICS TCB MODEs. . . . . . . . . . : 13

Number of CICS TCB POOLs. . . . . . . . . . : 3

Notes:

- Excess TCB scans and detaches increase unproductive overhead.

- Tune number of TCB-s allocated to minimize overhead.

CICS dispatcher statistics reveal the effects of excess TCBs allocated in the region.

Reduce MAXOPENTCBS to reduce the excesses.


40/52


CICS DB2 Entry Statistics

DB2ENTRY STATISTICS - REQUESTS

DB2Entry Call Signon Partial Commit Abort Single Thread Thread Thread

Name Count Count Signon Count Count Phase Reuse Terms Waits/Overfl

AMD2 2730679 24238 8147 0 26 24222 23644 594 0

MDI 0 0 0 0 0 0 0 0 0

MDI1 0 0 0 0 0 0 0 0 0MDI2 0 0 0 0 0 0 0 0 0

MNIF 1213 31 4 0 0 31 0 0 31

MT1010MQ 43872 871 868 30 3 841 0 871 0

MT4I 2814 22 15 68 0 4 0 0 22

Note: Many repetitive lines deleted from here

______________________________________________________________________________________________________

*TOTALS* 2778578 25162 9034 98 29 25098 23644 1465 53

CICS DB2 Entry statistics show requests by DB2 entry.

Minimize ABORT COUNT and THREAD WAITS/OVERFL.


41/52


O Save TCB switch costs with DB2 Version. 6Application needs to be marked as Threadsafe

CICS APIs used between DB2 call must be Threadsafe

O Percentage savings depend on application Savings based on the saved task switches and total path

length

OA switch from CICSs QR TCB to another TCB and back

is about 4K instruct ions TCB switches for 25 SQLs cost 1 millisecond (ms) CPU on 100

MIPS CPUs

Source: Geoff Sharman- with t hanks to John Burgess, IBM, Hursley

CICS/DB2 Performance Improvements -1

CTS 2.2 can have significant performance improvements for CICS applications

making many calls to DB2.

CICS applications normally run on the CICS QR TCB, when they make a call to

DB2, the request is processed on another TCB associated with a DB2 thread. This

requires a TCB mode switch. When the request is complete the application resumesexecuting back on the QR TCB and this requires another TCB mode switch.

With CTS 2.2, if the application is marked as 'Threadsafe', the application code

between the DB2 calls can continue running on the same TCB as the DB2 thread

instead of switching back to the QR TCB. When situation is exploited, 2 TCB

mode switches can be saved for each DB2 call. This can yield potentially significant

CPU savings for these applications.

Source: Dave Raiman, IBM. - As an example, the same application making 100

DB2 calls was run both on CICS 2.1 and CICS 2.2 on our 9672 X37 processor.

When run on CICS 2.1, the application used 11.27ms of CPU, when run on CICS

2.2 the application took 8.7ms of CPU.

The results show that this particular application made a 22% CPU saving when

migrated to CICS 2.2.

Minimum V6 of DB2 required for these savings.


42/52


CICS/DB2 Performance Improvements - 2

O How? With CSD program definition

CONCURRENCY(THREADSAFE)

O CICS control of all TCBs in region with MAXOPENTCB inSIT If excessive TCBs, CPU time wasted on scanning them

O DB2CONN TCBLIMIT defines number of L8 TCBs that can

be connected to DB2

O See CICS Application Programmer Reference manual

Appendix L.

CTS 2.2 and minimum V6 of DB2 required to get this benefit.

Other than saving some CPU time used by TCB switches, CICS QR TCB constraint

is relieved by more processing running on the LE TCBs. This can enable the re-

combining of regions that were split into multiple AORs (Application Owning

Regions). Such merged can save valuable system resources: CPU, storage, diskspace, I/O activity.

How you get this? (1) Automatically, SQL calls will switch to L8 TCB-s and

application will stay there until a non-threadsafe command is encountered. Thus, the

SQLs processing and some of the application is automatically

shifted to L8 TCBs.

(2) Specify in CSD program definition CONCURRENCY(THREADSAFE)

attribute for such programs. If you lie, the programs will just keep switching and

you are causing one extra TCB switch with your lying.

MAXOPENTCBS in CICS region controls the total of all TCBs within one region.

CICS region should be tuned to have enough of them so wait for TCB is

eliminated/minimized. Specifying much more than need, causes wasted CPU and

storage (below 16 MB may be critical to you).

DB2CONN TCBLIMIT specifies a subset of MAXOPENTCBS number to be used

for L8 TCBs. Again there are statistics to show any wait that might be caused if not

enough are specified.

See Appendix L in CICS Application Programmer Reference manual for

complete list of threadsafe commands. IBM is working on making more commands

threadsafe ASAP.


43/52


Pointing in the Right Directions

3 CPU and 2 I/O examples

Starved for CPU

CICS region saturated

DPMODE = What?

Buffer Pools

High I/O Service Time


44/52


Starved for CPU

Symptoms: CPU time in OK range, but wait for CPU

is significant part of service time

Where lo look?

RMF CPU Activity

RMF LPAR Activity

RMF Workload Activity

DB2 trace and class 2 data

What to look for?

High LPAR Utilization but low % of physical CPU-

s

High LPAR management overhead

High wait times in DB2 trace and accounting data

Higher priority work causing starvation

The measurements can be deceptive and cause misdirection of performance enhancing activities.

CPU starvation is often caused by LPAR weight settings. These values provide a percentage based

distribution to each LPAR. No priority is involved and distribution is not enforced until complex is100% utilized.

So, things can be fine as you approach 100% utilization, and then go to unacceptable levels very

quickly.


45/52


CICS Region Saturated - 1

Symptoms: CPU time in OK range, but wait for

CPU dispatch is significant part of service time

Where lo look?

CICS interval dispatcher statistics

RMF Workload Activity

What to look for?

CICS region with CPU utilization of 85% or

higher of a single CPU within a CEC

CICS QR TCBs utilization > 85%

This problem is often missed because the symptoms are not tracked.

Remember to add capture ratio to CPU time reported in the Workload Activity Report.

See next page for a sample of CICS Dispatcher statistics.


46/52


CICS Region Saturated - 2

DISPATCHER STATISTICS (Note: Columns 2 5 deleted to improve legibility)

TCB . . . MVS Total Time Total Time Total CPU

Mode . . . Waits in MVS wait Dispatched Time / TCB

QR . . . 13051397 000-18:18:33.24 000-01:49:46.74 000-01:12:02.27

RO . . . 48658 000-20:05:12.28 000-00:02:46.27 000-00:01:00.80CO . . . 0 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

SZ . . . 0 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

RP . . . 0 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

FO . . . 800 000-19:00:52.61 000-00:00:44.05 000-00:00:06.50

SL . . . 1 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

SO . . . 2 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

S8 . . . 0 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

D2 . . . 2419 000-20:18:01.28 000-00:00:03.26 000-00:00:00.43

L8 . . . 16952578 007-03:07:31.31 000-05:36:18.48 000-01:13:35.37

H8 . . . 0 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

J8 . . . 0 000-00:00:00.00 000-00:00:00.00 000-00:00:00.00

Total Time Dispatched much > than Total CPU Time / TCB can be a sign of higher priority

work causing CICS delays. If this is the case, then DB2 might either be the cause of these delays,

or it may be degraded by the same degree or worst.

Following is the list of TCBs used within CICS regions:

QR = Quasi-reentrant (CICS system & applications); CO = Concurrent (VSAM); FO = File Owning (VSAM); RO =Resource Owning; RP = ONC/RPC; SL = Sockets Listener; SO = Sockets; SZ = FEPI; J8 = JavaVM; L8 =Open (used by DB2 Version 6, or later, as of CICS TS V2.2); S8 = Secure Sockets Layer (SSL)


47/52


DPMODE = What?

Option gone with ini tially shipped CICS 2.2 now back

via maintenance so what is this prove?

DPMODE=HIGH Best for high volumes with little DB2 use as long as ample CPU

capacity is available

DPMODE=EQUAL (is/was CICS 2.2 default!)

May(!) provide better performance for non-SQL transactions

DPMODE=LOW

Can provide more consistent service in some CPU constrained

situations otherwise AVOID!

Important Note: All 3 will work OK if non-CPU constrained in

a multi-CP complex!

This is the a most difficult one! You may need to figure out via experimentation

what works best for your particular workloads.

The CICS and DB2 bigots will come and say one thing versus the other

None of these is the 100% answer for any one situation.

So clearly the one you should not be doing is the the low, on the other hand it isinteresting that I found that in some CPU constrained situations its a roller coaster

ride with low but that is not a situation that you should be in for your loved

workloads.


48/52


Buffer Pools 1

0

0.2

0.4

0.6

0.8

1

1.2

Feb. 5 Feb. 6 Feb. 7 Feb. 8 Feb. 9 Feb. 12

C2 Elapsed

Wait

C2 CPU

Increased BP3 from 2000 to 7000

ITR Increased from 1 TPS to more than 8 TPS

C2 elapsed decreased by more than 75%


Buffer Pools Always Mater! Tuning buffer pools is the single most productive area you can work

on because you:

-Can make all the changes without dependencies on other staff less politics is always a goodthing.

-Facts you need, can be reported from many sources - RMF, SMF, DB2 stats, traces &

accounting.

-Benefits are easy to track and demonstrate.


49/52


Buffer Pools - 2

Buffer pool tuning is not a set and forget set of option! Ongoing

tuning activities produce the best results!

Top seven steps for buffer pools tuning:

1. Isolate catalog in BP0

2. Isolate sort work

3. Isolate indexes from table spaces

4. Isolate good buffer pool candidates from bad ones

5. Isolate important works data from the other work

6. Group objects with similar attributes is same pool

7. Adjust buffer pool sizes to get most benefit for least storage

used.

Much was said about this in IDUGs history, so we wont do it again here. It is important to know

that this is the most productive way to improve DB2 performance.

Also, buffer pool tuning is not a set and forget set of options. Due to changes in the system

hardware and software, various customization options in all DB2 related subsystems, and the

applications, it must be an ongoing activity.


50/52


High I/O Service Time

Symptoms due to disagreement between two tools, much

staff time was wasted

ISV tool reported Sync IO time ~ 24 milliseconds (ms)

RMF Average IO time ~ 14 milliseconds

Where lo look?

RMF DASD Activity

SMF type 42-6

DB2 accounting records

What to look for?

Basic I/O service time for any volume

I/O service time for DB2 objects being studied

Source of all the delays within I/O service time

Knowing where to look for basic information about disk I/O operations will insure your success

with such tuning activities.

The SMF 42-6 are an excellent to start such an investigation. See the sample on next page.


51/52


I/O Service Profi le from SMF 42 & RMF 741

2

4

5

6

7

8

9

10

11

12

13

14

A E F L M N O P Q R S T U V

Sum of All Activity / CPU / File within the Sysplex

VOL

RATEFILE UTL VOL UTL ALL UTL

RESP

MS

CONN

MSDISC MS

PEND

MS

IOSQ

MS

13.71 6.68% 6.68% A 1 77.6 45.7 48.8 48.8 16.1 3.6 2.3 0.3 10

6.81 3.32% 10.00% B 2 40.3 22.7 23.1 23.1 6.5 3.6 2 0.3 0.6

5.88 2.86% 12.86% C 3 12.2 19.6 19.8 24.4 23.4 3.5 12.7 1 6.3

4.89 2.38% 15.24% D 4 36.4 16.3 23 23 6.7 1.2 3.3 0.3 1.9

4.65 2.27% 17.51% E 5 22.4 15.5 15.8 15.8 11.6 4 2.9 0.6 4.1

4.53 2.21% 19.72% F 6 29.5 15.1 15.3 15.3 8.6 2.2 2.9 0.3 3.2

4.29 2.09% 21.81% G 7 28 14.3 14.5 14.5 7.6 2.3 2.8 0.4 2

4.29 2.09% 23.90% H 8 6.4 14.3 14.7 14.7 26.9 2.8 19.5 0.5 4.1

4.23 2.06% 25.96% I 9 8.9 14.1 14.2 30.7 24.4 3.8 11.9 3.3 5.4

4.20 2.05% 28.00% J 10 8.9 14 14.2 31.2 26.5 3.8 11.9 3.7 7

4.11 2.00% 30.01% K 11 36.8 13.7 15.9 15.9 5.2 1.3 2.4 0.3 1.3

VOL

SER

DATA

NAME

Total Serv./

IO Intensity

(mins)

% of

Current

Total

Service

Cummul.

% of

Current

Service

1

2

4

5

6

7

8

9

1011

12

13

14

AD AE AF AG AH AI AJ AK AS AT AU AV AW AX AY AZ BA BB BC BD BE BF

Physical IO Activity Details by Sysplex Member

VOL

SER

VOL

RT2

PCT

BY2

RSP

TM2

IOQ

TM2

CON

TM2

DISC

TM2

PND

TM2

VOL

RT4

PCT

BY4

RSP

TM4

IOQ

TM4

CON

TM4

DISC

TM4

PND

TM4

VOL

RT5

PCT

BY5

RSP

TM5

IOQ

TM5

CON

TM5

DISC

TM5

PND

TM5

1 78.7 48.8 16.5 9.9 3.6 2.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 40.3 23.1 6.3 0.2 3.7 2.1 0.4

3 0.0 0.0 67.4 0.0 32.7 7.9 26.9 1.0 2.0 26.7 0.5 3.6 15.3 7.3 12.2 19.8 22.0 4.7 3.6 12.7 1.0

4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 72.5 23.0 5.6 2.1 1.3 1.9 0.4

5 22.4 15.8 10.9 3.1 4.0 3.1 0.7 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

6 29.5 15.3 8.3 2.7 2.2 3.0 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

7 28.0 14.5 7.4 1.9 2.3 2.8 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.08 6.6 14.7 26.6 3.7 3.1 19.2 0.6 0.0 0.0 0.7 0.0 0.4 0.0 0.2 0.0 0.0 0.6 0.0 0.4 0.0 0.3

9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.8 8.4 22.7 2.8 3.8 1 0.6 5.5 8.9 1 4.2 2 3.6 4.4 3.9 1 2.0 3.4

10 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.7 8.6 2 7.2 5.6 3.9 1 1.2 6.5 8.9 1 4.2 2 4.3 4.6 3.9 1 2.0 3.7

11 61.3 15.9 4.0 1.1 1.1 1.5 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Lets finish with the busiest sample report of them all. This one is produced by SAS code running

against the merged results from SMF 42-6 and the RMF 74 records from a five member

SYSPLEX.All the object names and volume serials were changed to protect the source of this report.

Top section reports activity and performance for each object and by volume

Bottom section merges the individual volumes performance view from each member of the

SYSPLEX

Why bother with such analysis? The % of CURRRENT TOTAL SERVICE will quickly guide

you to the most productive object for performance tuning. Saving a little on a very busy object can

yield great results for the entire system. The other details in this report help us pinpoint the source

of the I/O activity.


52/52

Speaker: Ivan Gelb 2003, Gelb Information

Systems Corp. (GIS)

E-mail: [email protected]

Phone: 732-303-1333

www.gelbis.com

Your questions and comments are

always welcome.

Session Title: Who Just Killed My DB2?!

Session: E8

TRADEMARKS

The following are trade or service marks of the IBM Corporation: CICS, CICS TS , CICSPlex,

DB2, IBM, MVS, OS/390, z/OS, Parallel Sysplex. Any omissions are purely unintended.

MOAD MOTHER OF ALL DISCLAIMERS

All of the information in this document is tried and true. However, this fact alone cannot guarantee

that you can get the same results at your place and with your skills. In fact, some of this advice can

be hurtful if it is misused and misunderstood. As with all kinds of analysis, anything you may hear

or read can be understood and misunderstood in many ways that may seem contradictory to you. In

this regard, a further and associated contradictory element requires considerable systems analysis

and trade-off studies to arrive at the structural design, based on rigorous system engineering

concepts. By combining advice and certain experiences, any fully integrated performance testprogram is weakly equivalent to any subsystem compatibility testing designed to eschew

obfuscation. Gelb Information Systems Corporation, Ivan Gelb and any one found anywhere

assume no responsibility for this informations accuracy, completeness or suitability for any

who just killed my db2 gelb2003b

Documents