why did my job run so long? - computer measurement group · – identify cpu use by area of source...

32
Why did my job run so long? Speeding Performance by Understanding the Cause John Baker MVS Solutions [email protected]

Upload: others

Post on 31-Dec-2019

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Why did my job run so long?Speeding Performance by Understanding the Cause

John BakerMVS Solutions

[email protected]

Page 2: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Agenda

• Where is my application spending its time?– CPU time, I/O time, wait (queue) times

• What am I waiting for?– Various flavors of queue time– What can/should I do about delays?

• Real world comparison– Stay tuned!

• Q/A

• Conclusions and wrap up

2

Page 3: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Distribution of Elapsed Time

Elapsed time = CPU time + I/O time + wait times

3

• CPU time = TCB + SRB

• I/O = IOSQ + PEND + CON + DISC

• Wait (queue) times– Initiator– Allocation (ENQ contention)– System services (HSM recall)– CPU Delay– LPAR dispatch– …

Page 4: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Sample Job A

• Elapsed time over 4 hours

• CPU time almost 1 hour

• I/O time under 10 minutes

= Focus on CPU time

4

JOB RUNTM CPUTM IOTIME

JOBA 4:23:53 0:48:21 0:09:12

Page 5: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Reducing CPU time

• Recompile– Many improvements in OS updates

• Tune application– Application Performance Tools

• (e.g. Strobe, FreezeFrame)– Identify CPU use by area of source code– Make friends with your developer

5

Page 6: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Sample Job B

• Elapsed time under 3 hours

• CPU time 20 minutes

• I/O time over 1.5 hours

= Focus on I/O time

6

JOB RUNTM CPUTM IOTIME

JOBB 2:41:30 0:21:20 1:37:44

Page 7: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Reducing I/O time

• Identify patterns– sequential vs random; read vs write

• Buffers– For VSAM consider NSR vs LSR– Give SORT memory – but not too much!

• Block size– System-determined generally works well – but check!– Half track for sequential; No smaller than 2K for random

• Compression– zEDC sounds promising!

• Include Storage Subsystem in your capacity planning

7

Page 8: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Wait/Queue time

8

Page 9: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

0

5

10

15

20

25

30

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%

Respon

se/Elapsed

 Tim

e

Percent Utilization

Time vs Utilization

Wait

Execution

High Utilization = High wait time

9

Elapsed time grows exponentially with 

utilization. Increasing priority doesn’t make the CPU any faster

• At high utilization levels, wait time is much greater than service time

Page 10: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Flavors of Queue (wait) Time

• Wait for “server”: initiator / CICS AOR / IMS MPR

• CPU delay (wait for logical CPU)

• I/O delay (iosq, pend, disconnect)

• Capping delay (LPAR capped vs actual delay)

• Resource Group maximum enforced

• Wait for LPAR (logical CPU) to be dispatched– PR/SM weight– Demand from other LPARs– CPC/CEC capacity

10

Page 11: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Initiator Queue

• SMF: R723CQDT

• TOTAL queue time (divide for average per job)

• Just start more inits?

• Not necessarily a good idea

• “Tuning to reduce the number of simultaneously active address spaces to the proper number needed to support a workload can reduce RNI and improve performance”

11https://www‐304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID=

Page 12: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Automated Initiators:Less is More

12

0

50

100

150

200

250

300

350

0 1 2 3 4 5 6 7 8 9

Act

ive

Jobs

Time (hours)

Benchmark: TM vs WLM concurrent Jobs

TM

WLM

TM jobs ahead

• Concurrency based on performance and utilization

Page 13: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

CPU Delay

• Wait for logical CPU

• SMF: R723CCDE

• Work is ready to run but is delayed access to CPU

• Related to Service Class / goal / importance– Dispatching priority

• There is almost always some CPU delay– Tolerance is subjective– Are goals/SLA’s being met?

• Priorities are relative – overloading leads to thrashing

• Consider discretionary for MTTW13

Page 14: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

0

10

20

30

40

50

60

70

80

90

100

Utilization vs CPU Delay

HIDLY

MDDLY

LODLY

UTIL

Utilization vs CPU delays

14

50% delay may be acceptable

At 100% busy, throughput degrades 

significantly

Page 15: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

I/O Delay

• IOSQ:– HyperPAV

• Pend:– CMR = overloaded controller– DB = volume contention (reserve?)– Any remaining = likely channels

• Disconnect– Random read misses– Synchronous remote copy

15

Page 16: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Revisit: sample Job B

• Disconnect time of 0:40:31 = 40% of total 1:37:44

• 40:31 (2431 seconds) divided by 9239646 I/O’s…

• = .263 ms average disconnect time

• Likely not unreasonable for random reads (consider SSD)

• Could also be replicated writes

• Become familiar with your typical application response times16

JOB RUNTM CPUTM IOTIME SMF30AID SMF30AIW EXCPS

JOBB 2:41:30 0:21:20 1:37:44 0:40:31 0:04:40 9239646

Page 17: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Capping Delay

• Possible when caps present

• SMF70NSW– WLM caps the logical CPUs– Delays LPAR dispatch

• SMF70NCA– Work is actually delayed for CPU due to capping

• Consider TM automation

17

Page 18: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

0

200

400

600

800

1000

1200

0:00

:00

0:15

:00

0:30

:00

0:45

:00

1:00

:00

1:15

:00

1:15

:00

1:30

:00

1:45

:00

2:00

:00

2:15

:00

2:30

:00

2:45

:00

3:00

:00

3:15

:00

3:30

:00

3:45

:00

4:00

:00

4:15

:00

4:30

:00

4:45

:00

5:00

:00

5:15

:00

5:30

:00

5:45

:00

6:00

:00

6:15

:00

6:30

:00

6:45

:00

7:00

:00

7:15

:00

7:30

:00

7:45

:00

8:00

:00

8:00

:00

MSU

/HR

MSU Demand vs R4HA

LPAR_C

LPAR_B

LPAR_A

CPCR4HA

CAP

Capping vs Delay

LPAR is capped (SMF70NSW)

Work is delayed SMF70NCA

Page 19: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Capping can impactall Workloads

0

10

20

30

40

50

60

70

80

90

100

Velocity

STC_H: Importance 1

Goal

Velocity

Machine CPU reaches 100% Capping 

begins

Batch is not the only workload suffering.  Even the most 

critical workloads are unable to meet their goal

Page 20: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Resource Group (max)

• Also a form of capping (same WLM algorithms)

• Pro: Useful to control “problem” applications

• Con: Static. Not flexible

• R723CCCA– Resource Group maximum enforced– Will override Service Class goals

20

Page 21: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

LPAR Dispatch Delay

• Ratio of logical processor busy to physical processor busy

• Not always as obvious but very common!

• Term “Short CPs” introduced by Kathy Walsh (IBM WSC)– Share, Aug. 2004– https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077

• MXG: PLCPRDYQ

• Improved with Hiperdispatch and IRD

21

Page 22: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

LPAR Dispatch Delay

• More initiators and/or higher dispatching priority will not resolve this problem

22

0

10

20

30

40

50

60

70

80

90

100

LPAR Dispatch Delay

INITS

CPUBSY

MVSBSY

40% delay

Page 23: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

What does this look like in the real world?

23

Let’s take a trip to the deli

Page 24: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

How many in the store at one time?

24

INITIATORS

Page 25: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Who’s next in line?

25

Dispatching Priority (Service Class)

Page 26: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

How long til I can give my order?

26

Logical Processor (CP) Busy

Page 27: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

How long til I’m done!

27

Physical Processor (CP) Busy

Page 28: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Forcing one step only stresses the next

28

Page 29: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

29

Page 30: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

Balance

30

Page 31: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also

About MVS Solutions

• MVS Solutions Inc.– Installed in over 200 datacenters worldwide– IBM Partner in Development

• ThruPut Manager– Automated Workload Balancing– Automated Batch Prioritization– Automated Capacity Management

31

Contact me:[email protected]

[email protected]

Join our Blog at www.thruputmanager.com

Page 32: Why did my job run so long? - Computer Measurement Group · – Identify CPU use by area of source code ... • Likely not unreasonable for random reads (consider SSD) • Could also