why did my job run so long? - computer measurement group · – identify cpu use by area of source...

Why did my job run so long?Speeding Performance by Understanding the Cause

John BakerMVS Solutions

jbaker@mvssol.com

Agenda

• Where is my application spending its time?– CPU time, I/O time, wait (queue) times

• What am I waiting for?– Various flavors of queue time– What can/should I do about delays?

• Real world comparison– Stay tuned!

• Q/A

• Conclusions and wrap up

Distribution of Elapsed Time

Elapsed time = CPU time + I/O time + wait times

• CPU time = TCB + SRB

• I/O = IOSQ + PEND + CON + DISC

• Wait (queue) times– Initiator– Allocation (ENQ contention)– System services (HSM recall)– CPU Delay– LPAR dispatch– …

Sample Job A

• Elapsed time over 4 hours

• CPU time almost 1 hour

• I/O time under 10 minutes

= Focus on CPU time

JOB RUNTM CPUTM IOTIME

JOBA 4:23:53 0:48:21 0:09:12

Reducing CPU time

• Recompile– Many improvements in OS updates

• Tune application– Application Performance Tools

• (e.g. Strobe, FreezeFrame)– Identify CPU use by area of source code– Make friends with your developer

Sample Job B

• Elapsed time under 3 hours

• CPU time 20 minutes

• I/O time over 1.5 hours

= Focus on I/O time

JOB RUNTM CPUTM IOTIME

JOBB 2:41:30 0:21:20 1:37:44

Reducing I/O time

• Identify patterns– sequential vs random; read vs write

• Buffers– For VSAM consider NSR vs LSR– Give SORT memory – but not too much!

• Block size– System-determined generally works well – but check!– Half track for sequential; No smaller than 2K for random

• Compression– zEDC sounds promising!

• Include Storage Subsystem in your capacity planning

Wait/Queue time

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%

Respon

se/Elapsed

Percent Utilization

Time vs Utilization

Execution

High Utilization = High wait time

Elapsed time grows exponentially with

utilization. Increasing priority doesn’t make the CPU any faster

• At high utilization levels, wait time is much greater than service time

Flavors of Queue (wait) Time

• Wait for “server”: initiator / CICS AOR / IMS MPR

• CPU delay (wait for logical CPU)

• I/O delay (iosq, pend, disconnect)

• Capping delay (LPAR capped vs actual delay)

• Resource Group maximum enforced

• Wait for LPAR (logical CPU) to be dispatched– PR/SM weight– Demand from other LPARs– CPC/CEC capacity

Initiator Queue

• SMF: R723CQDT

• TOTAL queue time (divide for average per job)

• Just start more inits?

• Not necessarily a good idea

• “Tuning to reduce the number of simultaneously active address spaces to the proper number needed to support a workload can reduce RNI and improve performance”

11https://www‐304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID=

Automated Initiators:Less is More

0 1 2 3 4 5 6 7 8 9

Time (hours)

Benchmark: TM vs WLM concurrent Jobs

TM jobs ahead

• Concurrency based on performance and utilization

CPU Delay

• Wait for logical CPU

• SMF: R723CCDE

• Work is ready to run but is delayed access to CPU

• Related to Service Class / goal / importance– Dispatching priority

• There is almost always some CPU delay– Tolerance is subjective– Are goals/SLA’s being met?

• Priorities are relative – overloading leads to thrashing

• Consider discretionary for MTTW13

Utilization vs CPU Delay

Utilization vs CPU delays

50% delay may be acceptable

At 100% busy, throughput degrades

significantly

I/O Delay

• IOSQ:– HyperPAV

• Pend:– CMR = overloaded controller– DB = volume contention (reserve?)– Any remaining = likely channels

• Disconnect– Random read misses– Synchronous remote copy

Revisit: sample Job B

• Disconnect time of 0:40:31 = 40% of total 1:37:44

• 40:31 (2431 seconds) divided by 9239646 I/O’s…

• = .263 ms average disconnect time

• Likely not unreasonable for random reads (consider SSD)

• Could also be replicated writes

• Become familiar with your typical application response times16

JOB RUNTM CPUTM IOTIME SMF30AID SMF30AIW EXCPS

JOBB 2:41:30 0:21:20 1:37:44 0:40:31 0:04:40 9239646

Capping Delay

• Possible when caps present

• SMF70NSW– WLM caps the logical CPUs– Delays LPAR dispatch

• SMF70NCA– Work is actually delayed for CPU due to capping

• Consider TM automation

MSU Demand vs R4HA

LPAR_C

LPAR_B

LPAR_A

CPCR4HA

Capping vs Delay

LPAR is capped (SMF70NSW)

Work is delayed SMF70NCA

Capping can impactall Workloads

Velocity

STC_H: Importance 1

Velocity

Machine CPU reaches 100% Capping

begins

Batch is not the only workload suffering. Even the most

critical workloads are unable to meet their goal

Resource Group (max)

• Also a form of capping (same WLM algorithms)

• Pro: Useful to control “problem” applications

• Con: Static. Not flexible

• R723CCCA– Resource Group maximum enforced– Will override Service Class goals

LPAR Dispatch Delay

• Ratio of logical processor busy to physical processor busy

• Not always as obvious but very common!

• Term “Short CPs” introduced by Kathy Walsh (IBM WSC)– Share, Aug. 2004– https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077

• MXG: PLCPRDYQ

• Improved with Hiperdispatch and IRD

LPAR Dispatch Delay

• More initiators and/or higher dispatching priority will not resolve this problem

LPAR Dispatch Delay

CPUBSY

MVSBSY

40% delay

What does this look like in the real world?

Let’s take a trip to the deli

How many in the store at one time?

INITIATORS

Who’s next in line?

Dispatching Priority (Service Class)

How long til I can give my order?

Logical Processor (CP) Busy

How long til I’m done!

Physical Processor (CP) Busy

Forcing one step only stresses the next

Balance

About MVS Solutions

• MVS Solutions Inc.– Installed in over 200 datacenters worldwide– IBM Partner in Development

• ThruPut Manager– Automated Workload Balancing– Automated Batch Prioritization– Automated Capacity Management

Contact me:jbaker@mvssol.com

systemz@cmg.org

Join our Blog at www.thruputmanager.com

why did my job run so long? - computer measurement group · – identify cpu use by area of source...

Documents

o'connor - reasonable use of the unreasonable

registro fotografico - unreasonable labs colomba 2016

the unreasonable effectiveness of structured random...

the unreasonable seizures of shadow deportations

complaint resolution procedure · demands, unreasonable...

m500it msata nand flash ssd - micron technology€¦ · the...

unreasonable worship

scanned by camscanner -...

be unreasonable

on being "positively unreasonable"

the unreasonable success of reflection seismology

unreasonable learning - shane hill, skoolbo

4th amendment unreasonable searches and seizures

the unreasonable effectiveness of mathematics

ssd adapter selection guide selection guide.pdfssd adapter...

our constitutional prohibition against unreasonable

power unreasonable people

unreasonable labs japan workshop 2016.10.29

defining unreasonable radius clauses ... - seattle...

unreasonable assumptions in asb please cite as:...