![Page 1: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/1.jpg)
Why did my job run so long?Speeding Performance by Understanding the Cause
John BakerMVS Solutions
![Page 2: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/2.jpg)
Agenda
• Where is my application spending its time?– CPU time, I/O time, wait (queue) times
• What am I waiting for?– Various flavors of queue time– What can/should I do about delays?
• Real world comparison– Stay tuned!
• Q/A
• Conclusions and wrap up
2
![Page 3: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/3.jpg)
Distribution of Elapsed Time
Elapsed time = CPU time + I/O time + wait times
3
• CPU time = TCB + SRB
• I/O = IOSQ + PEND + CON + DISC
• Wait (queue) times– Initiator– Allocation (ENQ contention)– System services (HSM recall)– CPU Delay– LPAR dispatch– …
![Page 4: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/4.jpg)
Sample Job A
• Elapsed time over 4 hours
• CPU time almost 1 hour
• I/O time under 10 minutes
= Focus on CPU time
4
JOB RUNTM CPUTM IOTIME
JOBA 4:23:53 0:48:21 0:09:12
![Page 5: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/5.jpg)
Reducing CPU time
• Recompile– Many improvements in OS updates
• Tune application– Application Performance Tools
• (e.g. Strobe, FreezeFrame)– Identify CPU use by area of source code– Make friends with your developer
5
![Page 6: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/6.jpg)
Sample Job B
• Elapsed time under 3 hours
• CPU time 20 minutes
• I/O time over 1.5 hours
= Focus on I/O time
6
JOB RUNTM CPUTM IOTIME
JOBB 2:41:30 0:21:20 1:37:44
![Page 7: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/7.jpg)
Reducing I/O time
• Identify patterns– sequential vs random; read vs write
• Buffers– For VSAM consider NSR vs LSR– Give SORT memory – but not too much!
• Block size– System-determined generally works well – but check!– Half track for sequential; No smaller than 2K for random
• Compression– zEDC sounds promising!
• Include Storage Subsystem in your capacity planning
7
![Page 8: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/8.jpg)
Wait/Queue time
8
![Page 9: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/9.jpg)
0
5
10
15
20
25
30
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Respon
se/Elapsed
Tim
e
Percent Utilization
Time vs Utilization
Wait
Execution
High Utilization = High wait time
9
Elapsed time grows exponentially with
utilization. Increasing priority doesn’t make the CPU any faster
• At high utilization levels, wait time is much greater than service time
![Page 10: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/10.jpg)
Flavors of Queue (wait) Time
• Wait for “server”: initiator / CICS AOR / IMS MPR
• CPU delay (wait for logical CPU)
• I/O delay (iosq, pend, disconnect)
• Capping delay (LPAR capped vs actual delay)
• Resource Group maximum enforced
• Wait for LPAR (logical CPU) to be dispatched– PR/SM weight– Demand from other LPARs– CPC/CEC capacity
10
![Page 11: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/11.jpg)
Initiator Queue
• SMF: R723CQDT
• TOTAL queue time (divide for average per job)
• Just start more inits?
• Not necessarily a good idea
• “Tuning to reduce the number of simultaneously active address spaces to the proper number needed to support a workload can reduce RNI and improve performance”
11https://www‐304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID=
![Page 12: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/12.jpg)
Automated Initiators:Less is More
12
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7 8 9
Act
ive
Jobs
Time (hours)
Benchmark: TM vs WLM concurrent Jobs
TM
WLM
TM jobs ahead
• Concurrency based on performance and utilization
![Page 13: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/13.jpg)
CPU Delay
• Wait for logical CPU
• SMF: R723CCDE
• Work is ready to run but is delayed access to CPU
• Related to Service Class / goal / importance– Dispatching priority
• There is almost always some CPU delay– Tolerance is subjective– Are goals/SLA’s being met?
• Priorities are relative – overloading leads to thrashing
• Consider discretionary for MTTW13
![Page 14: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/14.jpg)
0
10
20
30
40
50
60
70
80
90
100
Utilization vs CPU Delay
HIDLY
MDDLY
LODLY
UTIL
Utilization vs CPU delays
14
50% delay may be acceptable
At 100% busy, throughput degrades
significantly
![Page 15: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/15.jpg)
I/O Delay
• IOSQ:– HyperPAV
• Pend:– CMR = overloaded controller– DB = volume contention (reserve?)– Any remaining = likely channels
• Disconnect– Random read misses– Synchronous remote copy
15
![Page 16: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/16.jpg)
Revisit: sample Job B
• Disconnect time of 0:40:31 = 40% of total 1:37:44
• 40:31 (2431 seconds) divided by 9239646 I/O’s…
• = .263 ms average disconnect time
• Likely not unreasonable for random reads (consider SSD)
• Could also be replicated writes
• Become familiar with your typical application response times16
JOB RUNTM CPUTM IOTIME SMF30AID SMF30AIW EXCPS
JOBB 2:41:30 0:21:20 1:37:44 0:40:31 0:04:40 9239646
![Page 17: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/17.jpg)
Capping Delay
• Possible when caps present
• SMF70NSW– WLM caps the logical CPUs– Delays LPAR dispatch
• SMF70NCA– Work is actually delayed for CPU due to capping
• Consider TM automation
17
![Page 18: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/18.jpg)
0
200
400
600
800
1000
1200
0:00
:00
0:15
:00
0:30
:00
0:45
:00
1:00
:00
1:15
:00
1:15
:00
1:30
:00
1:45
:00
2:00
:00
2:15
:00
2:30
:00
2:45
:00
3:00
:00
3:15
:00
3:30
:00
3:45
:00
4:00
:00
4:15
:00
4:30
:00
4:45
:00
5:00
:00
5:15
:00
5:30
:00
5:45
:00
6:00
:00
6:15
:00
6:30
:00
6:45
:00
7:00
:00
7:15
:00
7:30
:00
7:45
:00
8:00
:00
8:00
:00
MSU
/HR
MSU Demand vs R4HA
LPAR_C
LPAR_B
LPAR_A
CPCR4HA
CAP
Capping vs Delay
LPAR is capped (SMF70NSW)
Work is delayed SMF70NCA
![Page 19: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/19.jpg)
Capping can impactall Workloads
0
10
20
30
40
50
60
70
80
90
100
Velocity
STC_H: Importance 1
Goal
Velocity
Machine CPU reaches 100% Capping
begins
Batch is not the only workload suffering. Even the most
critical workloads are unable to meet their goal
![Page 20: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/20.jpg)
Resource Group (max)
• Also a form of capping (same WLM algorithms)
• Pro: Useful to control “problem” applications
• Con: Static. Not flexible
• R723CCCA– Resource Group maximum enforced– Will override Service Class goals
20
![Page 21: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/21.jpg)
LPAR Dispatch Delay
• Ratio of logical processor busy to physical processor busy
• Not always as obvious but very common!
• Term “Short CPs” introduced by Kathy Walsh (IBM WSC)– Share, Aug. 2004– https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077
• MXG: PLCPRDYQ
• Improved with Hiperdispatch and IRD
21
![Page 22: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/22.jpg)
LPAR Dispatch Delay
• More initiators and/or higher dispatching priority will not resolve this problem
22
0
10
20
30
40
50
60
70
80
90
100
LPAR Dispatch Delay
INITS
CPUBSY
MVSBSY
40% delay
![Page 23: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/23.jpg)
What does this look like in the real world?
23
Let’s take a trip to the deli
![Page 24: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/24.jpg)
How many in the store at one time?
24
INITIATORS
![Page 25: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/25.jpg)
Who’s next in line?
25
Dispatching Priority (Service Class)
![Page 26: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/26.jpg)
How long til I can give my order?
26
Logical Processor (CP) Busy
![Page 27: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/27.jpg)
How long til I’m done!
27
Physical Processor (CP) Busy
![Page 28: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/28.jpg)
Forcing one step only stresses the next
28
![Page 29: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/29.jpg)
29
![Page 30: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/30.jpg)
Balance
30
![Page 31: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/31.jpg)
About MVS Solutions
• MVS Solutions Inc.– Installed in over 200 datacenters worldwide– IBM Partner in Development
• ThruPut Manager– Automated Workload Balancing– Automated Batch Prioritization– Automated Capacity Management
31
Contact me:[email protected]
Join our Blog at www.thruputmanager.com
![Page 32: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also](https://reader033.vdocuments.mx/reader033/viewer/2022041514/5e2a02991231e76be86e28bd/html5/thumbnails/32.jpg)