top ten ase mda tables (20097) - mark gearhart · top ten ase mda tables (20097) how to use mda...

Top Ten ASE MDA Tables (20097)How to Use MDA Tables to Find Problems

Jeff Tallman, SW Engineer II/Architect, Sybase Inc.Chris Brown, Principal SC, Sybase Inc.

Top Ten ASE MDA Tables

Review of 3 Tier MDA Monitoring Strategy • DBA Dashboard, Application Dashboard, & Fault Isolation

Tables 10 6: Configuration & Dashboard• monEngine• monDataCache & monCachePool• monCachedObject• monProcedureCacheMemoryUsage & monProcedureCacheModuleUsage• MonCachedProcedures

Tables 5 1: Fault Isolation• monDeviceIO & monIOQueue• monProcessActivity• monSysWaits• monOpenObjectActivity• monSysStatement

Caveat Emptor - Disclaimer

The Top Ten ….• Is not based necessarily on frequency of use….

• …but is based on the value to the DBA/Developer– For example monErrorLog is likely polled frequently– …but it isn’t as much help as monCachedObject

• …or based on how often used to identify & resolve problems– monDeviceIO/monIOQueue frequently used for diagnosing device waits….used

more frequently than…– monCachedObject…which admittedly is fairly handy

• …based on our real world experience with customer panics

“Fault Isolation”Problem Solving(as necessary)

MDA Monitoring Strategy

“App Dashboard”Config & Tuning

(periodic & frequent)

“DBA Dashboard”System Health

(constant)

monEngine monIOQueuemonState monDeadlocksmonLicense monErrorLogmonDeviceIO monSysWaits

monOpenObjectActivity monCachePoolmonCachedObject monDataCachemonCachedProceduresmonOpenDatabasesmonProcedureCacheMemoryUsage

monSysStatement monProcessStatementmonProcessActivity monProcessmonLocks monProcessWaitsmonSysSQLText

Tables 10 6DBA Dashboard & Application Config/Tuning

10 6: Config & Monitoring

10 monEngine

9 monCachePool

8 monCachedObject

7 monCachedProcedures

6 monProcedureCacheModuleUsage

10: monEngine

Notes:• The most common info will

be the CPU time metrics– UserCPUTime will reflect bad

queries with table scans in memory, Cursors, etc.

– SystemCPUTime will reflect physical and network IO

• Other metrics will likely be used for fine grained tuning

monEngineEngineNumberCurrentKPIDPreviousKPIDCPUTimeSystemCPUTimeUserCPUTimeIdleCPUTimeYieldsConnectionsDiskIOChecksDiskIOPolledDiskIOCompletedProcessesAffinitiedContextSwitchesHkgcMaxQSizeHkgcPendingItemsHkgcHWMItemsHkgcOverflowsStatusStartTimeStopTimeAffinitiedToCPUOSPID

smallintintintintintintintintintintintintintintintintintintvarchar(20)datetimedatetimeintint

More Engines Needed???Table Scans in Memory???

IO Polling Process Count(Added in 12.5.3 ESD #2)

Runnable Process Search Count

See WaitEvents!!!

HouseKeeper GC

9 & 8: Data Cache

CacheID = CacheIDCacheName = CacheName

monCachedObjectCacheIDDBIDIndexIDPartitionIDCachedKBCacheNameObjectIDDBNameOwnerUserIDOwnerNameObjectNamePartitionNameObjectTypeTotalSizeKBProcessesAccessing

intintintintintvarchar(30)intvarchar(30)intvarchar(30)varchar(30)varchar(30)varchar(30)intint

<pk,fk1><pk,fk2><pk,fk2><pk,fk2>

<pk,fk1><pk,fk2><pk,fk2>

<pk,fk2><pk,fk2>

monCachePoolCacheIDIOBufferSizeAllocatedKBPhysicalReadsStallsPagesTouchedPagesReadBuffersToMRUBuffersToLRUCacheName

intintintintintintintintintvarchar(30)

<pk,fk><pk>

<pk,fk>

monDataCacheCacheIDRelaxedReplacementBufferPoolsCacheSearchesPhysicalReadsLogicalReadsPhysicalWritesStallsCachePartitionsCacheName

intintintintintintintintsmallintvarchar(30)

Pool SizePool Allocation

Cache UsedToo small or Wash Size too small

Text/Image IndexID=255

Cache Hogs

Tempdb DBID=2

Hit Rate = CacheSearches / Logical Reads

Misses = CacheSearches / Physical Reads

Volatility = PhysicalWrites / (PhysicalReads + LogicalReads)

CacheUsage%=AllocatedKB / (PagesTouched * @@pagesize)

CacheEfficiency%=PagesRead / (PagesTouched * @@pagesize)

Cache Tuning Myths & Reality

Myths• Most people focus on cache hit ratio(s)

– The reality is that most tablescans on today’s larger memory systems cause this to be an unreliable indicator of overall system health

Reality• Ignore cache hit ratios and concentrate on cache sizing and when objects

are flushed from cache due to cache organization– Particularly look at reference tables, text/image columns, tempdb, report tables

• A lot of wasted memory– Named caches that are oversized– Large buffer pools that are oversized (or undersized)– Too many named caches

• Not as flexible to changing workloads, real overhead is loss of memory to other tables/indexes

• Cache partitioning is a bigger player in SMP wrt performance– Requires careful tuning

Cache Monitoring Comments

Buffer Pools containing Transaction Logs• Will show an artificially high pages touched

– Reason is that each page is new…keeps appending until buffer pool limit is reached• Watch PagesRead to see if it can be reduced.

– Log scanning activities: Triggers, Checkpoints & ASE Rep Agent– Triggers aren’t as much of an issue except for bulk statements such as huge deletes or

updates on tables with triggers– Checkpoint will scan the log…so don’t make it too small…but don’t make it too big either– ASE Rep Agent tuning may dictate size for sites with RS

If >2 Buffer Pools Probably Configuration Error• ASE will ONLY use 2 buffer pools - page size and largest pool size• Exception is log buffer pool bindings

Let stabilize after reboot• Remember, ASE reconfigures cache during recovery

TempDB & MDA Monitoring

Most MDA Tables work from an “open object”• monOpenObjectActivity - table actively in use and still in cache• monCachedObject - table still in cache (whether on not active)

Tables that are dropped are removed as open objects• DES is cleared - removed from cache/open objects

Temp tables are fairly dynamic• Most are created and dropped within seconds

Monitoring with MDA requires a bit of ingenuity• Reducing the monitoring interval to a <1 minute• Using the MDA parameter of DBID=2 to reduce data volume• Extrapolating based on objects “caught”• If a separate cache is defined - watch PagesTouched

– If constantly at the allocation - it may be too small

Cache Sizing & Activity

Cache Name Pool Allocated KBMax Used KB

Max Pg Read KB Comments

default data cache 2048 3,788,800 3,785,030 91,933,020 Likely too small as max used is near allocation

default data cache 4096 921,600 921,578 305,336 log cache is 600MB too big…give to 2K pool

default data cache 16384 768,000 726,294 53,233,792 Looks right-sized (95%)

oltp_cache 2048 5,324,800 5,322,432 171,414,882 Good reuse - on avg, each page is reread 30x

oltp_cache 4096 1,024,000 1,023,998 958,424 log cache…just keeps being appended to

oltp_cache 16384 768,000 512,784 356,244,832 Pool is 256MB too big

msg_queue_cache 2048 2,048,000 437,900 0Pool is 1.5GB too big - and nothing ever read. Shift to the 16K pool or give to default data cache

Max(PagesRead)Max(PagesTouched)

BLOBs & Cache Consumption

Max Cached BLOB

Cache Name DBID Table with Text/Image KB MB

oltp_cache 20 Table_1 1,276,972 1,247

oltp_cache 20 Table_2 1,009,902 986

oltp_cache 21 Table_3 1,261,592 1,232

oltp_cache 21 NULL 96,780 95

oltp_cache 21 Table_4 1,340,726 1,309

msg_queue_cache 16 NULL 6,586 6

msg_queue_cache 16 NULL 5,202 5

msg_queue_cache 23 Table_5 337,880 330

msg_queue_cache 23 Table_6 252,976 247

~4.7GB of 7GB (67%)

10,000

20,000

30,000

40,000

50,000

1/1/19

1/2/19

1/3/19

1/4/19

1/5/19

1/6/19

1/7/19

1/8/19

1/9/19

1/10/1

1/11/1

1/12/1

1/13/1

1/14/1

1/15/1

1/16/1

1/17/1

1/18/1

1/19/1

1/20/1

1/21/1

1/22/1

1/23/1

1/24/1

1/25/1

1/26/1

1/27/1

1/28/1

1/29/1

1/30/1

1/31/1

2/1/19

2/2/19

2/3/19

2/4/19

2/5/19

2/6/19

2/7/19

2/8/19

2/9/19

2/10/1

2/11/1

2/12/1

2/13/1

2/14/1

2/15/1

2/16/1

2/17/1

2/18/1

2/19/1

2/20/1

2/21/1

2/22/1

2/23/1

2/24/1

2/25/1

2/26/1

2/27/1

2/28/1

2/29/1

3/1/19

3/2/19

3/3/19

3/4/19

3/5/19

3/6/19

3/7/19

3/8/19

3/9/19

3/10/1

3/11/1

3/12/1

3/13/1

3/14/1

3/15/1

3/16/1

3/17/1

3/18/1

250,000

500,000

750,000

1,000,000

1,250,000

1,500,000

(KB) Data (KB)

Index(KB)BLOB (KB)

BLOB vs. non-BLOB Cache

Comparison of BLOB vs. Data & Index cache consumption for 1 table involved in the message processing

Spikes are caused by table being truncated and swapped

7 & 6: Procedure Cache

ModuleID = ModuleIDmonCachedProceduresObjectIDOwnerUIDDBIDPlanIDMemUsageKBCompileDateObjectNameObjectTypeOwnerNameDBName

intintintintintdatetimevarchar(30)varchar(32)varchar(30)varchar(30)

monProcedureCacheRequestsLoadsWritesStalls

intintintint

monProcedureCacheMemoryUsageAllocatorIDModuleIDActiveHWMChunkHWMNumReuseCausedAllocatorName

intintintintintintvarchar(30)

monProcedureCacheModuleUsageModuleIDActiveHWMNumPagesReusedModuleName

intintintintvarchar(30)

Reads From Disk

Proc Concurrency

Too Small Flag

ProcCache in Use for Procs = sum(MemUsageKB)

These are new in ASE 15.0.1

Current Memory vs. Max Used

Who’s causing the cache flipping

MDA & Proc Cache Modules

Ah, yes, just how much proc cache did that create index or merge sort use??? (Answer in pgs)

Statement Cache SizingFully Prepared Statements data_change()Partition impacts

Memory Usage by Module

Tables 5 1Ones to Use When Panic Button is Pushed

The Top 5 (Problem Solvers)

5 monDeviceIO & monIOQueue

4 monProcessActivity

3 monSysWaits

2 monOpenObjectActivity

1 monSysStatement

5: monDeviceIO

Comments:• These are ALL physical IOs• monIOQueue.IOs =

monDeviceIO.Reads + monDeviceIO.Writes

• IOTime– monDeviceIO is shared with sp_sysmon -

forget “no_clear” and this stat is reset…otherwise should be same as monIOQueue.IOTime

– Measured in ticks (100ms)• Some will have 0 as a result of

completing in the same tick• Others will be in multiples of 100ms• Average it out for a good idea

LogicalName = LogicalName

monDeviceIOReadsAPFReadsWritesDevSemaphoreRequestsDevSemaphoreWaitsIOTimeLogicalNamePhysicalName

intintintintintintvarchar(30)varchar(128)

monIOQueueIOsIOTimeLogicalNameIOType

intintvarchar(30)varchar(12)

<pk,fk><pk>

User DataUser LogTempdb DataTempdb Log

sp_mda_deviceIO

Result set is based on a join between monDeviceIO and monIOQueue, so if device has both data & log, the reads & writes will be duplicated for each IO type entry - to see the split between data/log, compare the split of IOs

Slow Device (>10ms)Avg Device (4-6ms)Fast Device (<2ms)

Logical Name IOTypeDeltahms Reads

Reads Per Sec APFReads APF % Writes

Writes Per Sec Total IOs

IOsPer Sec

IOTime sec

Ms Per IO

data_device2 User Data 17:47:47 1,765,644 27 1,010,775 57.2 169,491 2 1,935,122 30 20,747 10

tempdb2 Tempdb Log 17:47:47 365,150 5 1,909 0.5 1,198,428 18 141,326 2 1,338 9

tempdb2 Tempdb Data 17:47:47 365,150 5 1,909 0.5 1,198,428 18 1,422,252 22 54,143 38

tempdb Tempdb Log 17:47:47 142,500 2 8,437 5.9 339,189 5 0 0 0 0

tempdb Tempdb Data 17:47:47 142,500 2 8,437 5.9 339,189 5 481,689 7 14,265 29

data_device3 User Data 17:47:47 17,824 0 10,863 60.9 2,905 0 20,729 0 175 8

data_device4 User Data 17:47:47 4,109 0 197 4.7 276 0 4,385 0 11 2

log_device2 User Log 17:47:47 1,795 0 0 0 543,259 8 539,718 8 8,575 15

log_device2 User Data 17:47:47 1,795 0 0 0 543,259 8 5,336 0 114 21

master User Log 17:47:47 376 0 0 0 12,190 0 1,125 0 4 3

APF = Table Scans

monDeviceIO Notes

Tracks Physical IO’s only• monIOQueue

– the IO queue is the IOs submitted to the OS– …so if you are waiting for a disk block, you aren’t here yet

• monDeviceIO – Number of completed IOs– Docs say Reads exclude APFReads….

• …but raw math with IOs shows that it includes the APFReads

4: monProcessActivity

monProcessActivitySPIDKPIDServerUserIDCPUTimeWaitTimePhysicalReadsLogicalReadsPagesReadPhysicalWritesPagesWrittenMemUsageKBLocksHeldTableAccessesIndexAccessesTempDbObjectsWorkTablesULCBytesWrittenULCFlushesULCFlushFullULCMaxUsageULCCurrentUsageTransactionsCommitsRollbacks

smallintintintintintintintintintintintintintintintintintintintintintintintint

<pk,fk><pk,fk>

If ULCFlushFull is high for ULCFlushes, ULC is too small

Server TPS sum(Δ(Transactions))/ Δ (sampletime(secs))

Number of Locks sum(LocksHeld)

IO Hogs/Bad Queries

Total CPU= ΔSampleTime -ΔWaitTime

Notes on monProcessActivity

Useful for finding the resource hogs• CPUTime is not cumulative (12.5.3 at least)

– Due to a bug that has been fixed in later ESD’s– So…..how do we find out how much CPU someone used??

• Answer: ΔSampleTime - ΔWaitTime• ….because if you ain’t waiting - you are running!!!

• Bad Queries - look at Logical Reads, Physical Reads & TotalCPU– TotalCPU as defined by ΔSampleTime - ΔWaitTime

Other Tips• TableAcesses vs. IndexAcesses can give us an early clue the user is

doing a tablescan• TempDbObjects, WorkTables can give us clues as to if we might be

causing issues via tempdb - or if TableAcesses are due to tempdb.

Who’s Hogging the CPU??

Server UserID SPID

CPU Time

Derived CPU sec

Derived CPU%

Wait Time

Physical Reads

Logical Reads

Physical Writes

Pages Written

Table Accesses

Index Accesses

TempDb Objects

Work Tables

257 164 8,343 65,439 26.2 183,862 90,419 821,095,296 5,243 9,515 132,785 688,291,033 0 112,097

66 77 4,012 11,517 4.6 237,784 767,953 602,510,612 1,440,406 1,580,812 64,732,650 460,090,480 7,831 195,920

65 76 4,244 11,328 4.5 237,973 790,717 575,826,507 1,399,038 1,542,922 66,695,362 478,952,060 7,778 197,053

67 78 4,083 11,285 4.5 238,016 805,315 549,908,062 1,412,320 1,553,901 61,765,225 489,758,829 7,824 195,199

64 75 4,209 11,257 4.5 238,044 751,751 548,253,740 1,372,337 1,514,749 63,557,279 494,295,337 7,727 196,494

125 47 4,160 10,582 4.2 238,719 699,461 332,946,384 1,336,164 1,480,234 82,882,603 650,580,633 6,112 190,868

124 119 3,278 10,459 4.2 238,842 596,685 312,422,869 1,830,087 1,971,764 72,776,796 658,774,595 6,065 183,968

126 48 3,914 10,299 4.1 239,002 575,491 280,556,865 1,315,408 1,456,991 68,333,817 674,577,317 6,151 186,041

127 49 4,137 10,198 4.1 239,103 515,485 259,691,308 1,352,176 1,493,099 74,062,593 692,243,722 5,954 187,278

5 43 3,613 10,181 4.1 239,120 716,909 232,643,586 1,474,822 1,618,203 72,836,259 704,827,525 5,912 187,232

7 117 3,543 10,150 4.1 239,151 636,461 221,468,548 1,698,326 1,838,178 75,048,927 713,237,194 5,873 185,821

4 73 3,310 10,111 4.1 239,190 696,430 209,371,873 1,657,008 1,798,890 80,341,215 723,170,564 5,909 184,575

6 74 3,552 10,037 4.0 239,264 641,899 183,610,422 1,528,052 1,668,489 67,242,735 730,870,520 5,836 185,427

135 18 5,576 6,257 2.5 243,044 723,346 282,784,331 157,811 182,323 34,853,826 177,129,838 460 176,953

Sample time = 69h:15m:15s (~3 days)…249,301secs

3: monSysWaits

WaitEventID = WaitEventID

WaitClassID = WaitClassID

monProcessWaits

SPIDKPIDWaitEventIDWaitsWaitTime

smallintintsmallintintint

monSysWaits

WaitEventIDWaitTimeWaits

smallintintint

<pk,fk>

monWaitClassInfo

WaitClassIDDescription

smallintvarchar(50)

monWaitEventInfo

WaitEventIDWaitClassIDDescription

smallintsmallintvarchar(50)

Server Level Waits (Aggregated)

Process Level Waits

Static Values for Each ESD

Where’s the Holdup???

WaitClassID = WaitClassID

WaitEventID = WaitEventID

SPID = SPIDKPID = KPID

monOpenDatabases

DBIDBackupInProgressLastBackupFailedTransactionLogFullAppendLogRequestsAppendLogWaitsDBNameBackupStartTimeSuspendedProcessesQuiesceTag

intintintintintintvarchar(30)datetimeintvarchar(30)

monWaitClassInfo

WaitClassIDDescription

smallintvarchar(50)

monWaitEv entInfo

WaitEventIDWaitClassIDDescription

smallintsmallintvarchar(50)

monSysWaitsWaitEventIDWaitTimeWaits

smallintintint

<pk,fk>monProcess

SPIDKPIDBatchIDContextIDLineNumberSecondsConnectedDBIDEngineNumberPriorityFamilyIDLoginApplicationCommandNumChildrenSecondsWaitingWaitEventIDBlockingSPIDBlockingXLOIDDBNameEngineGroupNameExecutionClassMasterTransactionID

smallintintintintintintintsmallintintsmallintvarchar(30)varchar(30)varchar(30)intintsmallintsmallintintvarchar(30)varchar(30)varchar(30)varchar(255)

monProcessWaitsSPIDKPIDWaitEventIDWaitsWaitTime

smallintintsmallintintint

“db log contention”

“Where I am spending all my time waiting”

“Server Cumulative Waits”(aka Context Switches)

“Currently Waiting On”

Pop Quiz: Wall Street Customer

WaitEventID Waits WaitTime Description----------- ----------- ----------- --------------------------------------------------

29 1399734780 4929606 wait for buffer read to complete215 1127548743 27844254 waiting on run queue after sleep35 670307183 4153181 wait for buffer validation to complete

179 335988317 10035324 waiting while no network read or write is required250 324655266 167672101 waiting for incoming network data124 256994610 6105113 wait for someone else to finish reading in mass209 75463669 340471 waiting for a pipe buffer to read251 62546271 344880 waiting for network send to complete41 58470129 3473384 wait to acquire latch31 32361806 36401 wait for buffer write to complete

214 19911597 1403956 waiting on run queue after yield150 18083160 23776944 waiting for semaphore52 11842516 48193 waiting for disk write to complete51 9703708 27945 waiting for disk write to complete55 8071811 5609 waiting for disk write to complete36 4774219 20192 wait for mass to stop changing

272 3886481 134998 waiting for lock on PLC54 2438135 30805 waiting for disk write to complete

What are you going to do next???

First…A Familiar Picture…

Shared Executable (Program Memory)

Operating SystemEngine

0RegistersFile Descriptors

Running2

Run Queues SleepQueue

locksleep

diskI/O

Lock Chains

Pending I/Os

Other Memory sendsleep

Shared MemoryProc CacheHash

Kernel

Engine 1RegistersFile Descriptors

Running5

Engine NRegistersFile Descriptors

Running1

Applying What You Know…

When do SPID’s sleep??• Pending physical disk I/O, network I/O, waiting on a lock

SPID’s & TimeSlices…• Each SPID gets a max of 1 timeslice (100ms) by default• If SPID needs physical IO or lock before then, it is put to sleep early• A SPID is only woken when the IO completes or lock is available• When woken, SPID WaitTime is current timeslice - previous timeslice

– Could be 0 if woken in same timeslice….but still some time was spent

So…if I am only doing logical I/O’s (100% cache hits), what happens when my timeslice expires?

• Get put back on the run queue (runnable state)

When do data page writes happen??• Checkpoint process, housekeeper GC, wash marker• Minimally Logged I/O

– Fast bcp, select/into, writetext (no log), etc.

• DES Scavenging (cache flush)

Pop Quiz Time

We all know that ASE can read more than one page per IO request at a time by doing a large I/O (especially APF)….

…How many pages can ASE write at one time per IO request??

A) 1 ASE does all 2K writes (or nK where n is page size)B) 1 per data page, but more than one per log page depending on the

log I/O size of the databaseC) Depends on sp_configure “i/o batch size” - default is 100D) Any number between 1 and 8 pagesE) However many fit in 32 disk blocks (512byte blocks).

ASE ProxyDB MDA monProcessWaits

WaitEventID Waits WaitTime Description 214 182433 600 waiting on run queue after yield

55 181921 137000 waiting for disk write to complete

31 178274 200200 wait for buffer write to complete

171 9847 531700 waiting for CTLIB event to complete

36 3098 698500 waiting for MASS to finish writing before changing

29 806 8500 wait for buffer read to complete

251 500 0 waiting for network send to complete

150 33 400 waiting for semaphore

272 19 500 waiting for lock on PLC

250 6 400 waiting for incoming network data

259 3 85100 waiting until last chance threshold is cleared

Example from a platform migration test using proxy tables/db

Decoding the WaitEvents - CPU

CPU Waits• WaitEvents

– 215 waiting on run queue after sleep– 214 waiting on run queue after yield– Large numbers of waits IO Problem - adding CPUs will hurt

• Adding memory will drive 215 214 and increase CPU issues as well– Large wait time and low waits adding CPU’s may help

• What do they mean?– 215 Most likely excessive physical IO (incl APF) or locking

• Why?? Why does a process go to sleep??• Answer: Network send, physical read/write, lock wait

– 214 Bad QP’s, table scanning in memory, high contention, etc.• Why?? Because we don’t sleep on logical I/O’s…

– SPID yields the CPU and jumps immediately on the RUNNABLE queue• Where to go next?

– monOpenObjectActivity (to find the tables involved)– monSysStatement (to find the CPU pigs)

CPU Waits & Disk IO Activity

Wait Event ID Wait Class Description Wait Event Description Waits

Wait Time Sec

215 waiting to be scheduled waiting on run queue after sleep 337,845 0.80

124 waiting for internal system event wait for mass read to finish when getting page 115,336 0.40

214 waiting to be scheduled waiting on run queue after yield 73,998 0.42

51 waiting for a disk write to complete waiting for last i/o on MASS to complete 71,169 0.47

31 waiting for a disk write to complete waiting for buf write to complete before writing 54,185 0.17

55 waiting for a disk write to complete wait for i/o to finish after writing last log page 46,714 0.25

29 waiting for a disk read to complete waiting for regular buffer read to complete 16,865 0.09

52 waiting for a disk write to complete waiting for i/o on MASS initated by another task 14,220 0.03

171 waiting to output to the network waiting for CTLIB event to complete 7,243 0.16

251 waiting to output to the network waiting for network send to complete 2,487 0.07

250 waiting for input from the network waiting for incoming network data 1,839 42.75

99 waiting for input from the network wait for data from client 223 0.10

36 waiting for memory or a buffer waiting for MASS to finish writing before changing 212 0.00

143 waiting for internal system event pause to synchronise with site manager 80 0.00

54 waiting for a disk write to complete waiting for write of the last log page to complete 49 0.00

CPU Waits & Network IO/LockingWaitEventID Description Description Waits WaitTimeSec

215 waiting to be scheduled waiting on run queue after sleep 127,372,857 8.63

214 waiting to be scheduled waiting on run queue after yield 31,044,842 0.91

250 waiting for input from the network waiting for incoming network data 28,424,607 925.65

251 waiting to output to the network waiting for network send to complete 28,163,786 7.82

29 waiting for a disk read to complete waiting for regular buffer read to complete 4,502,133 4.94

150 waiting to take a lock waiting for a lock 1,963,195 6.78

35 waiting for memory or a buffer waiting for buffer validation to complete 1,701,818 2.20

272 waiting for internal system event waiting for lock on ULC 1,634,727 3.76

41 waiting for internal system event wait to acquire latch 1,206,897 4.04

283 waiting on another thread Waiting for Log writer to complete 1,172,009 12.80

308 waiting on another thread Waiting for ULC Flusher to queue dirty pages. 998,236 3.74

36 waiting for memory or a buffer waiting for MASS to finish writing before changing 973,005 8.65

307 waiting on another thread Waiting for tasks to queue ALS request. 659,928 3.45

55 waiting for a disk write to complete wait for i/o to finish after writing last log page 489,120 1.91

51 waiting for a disk write to complete waiting for last i/o on MASS to complete 328,118 8.02

52 waiting for a disk write to complete waiting for i/o on MASS initated by another task 309,834 2.10

309 waiting for a disk write to complete Waiting for last started disk write to complete 165,040 0.89

31 waiting for a disk write to complete waiting for buf write to complete before writing 144,209 1.34

124 waiting for internal system event wait for mass read to finish when getting page 102,072 0.11

53 waiting for memory or a buffer waiting for MASS to finish changing to start i/o 51,908 0.00

70 waiting for internal system event waiting for device semaphore 21,711 0.00

37 waiting for memory or a buffer wait for MASS to finish changing before changing 21,569 0.00

178 waiting for input from the network waiting while allocating new client socket 15,339 43.67

171 waiting to output to the network waiting for CTLIB event to complete 7,707 0.00

54 waiting for a disk write to complete waiting for write of the last log page to complete 6,254 0.02

Disk Write Waits…

Event ID Description

50 Write was restarted because previous attempt failed – if you see this check sys error log

51 waiting for last MASS on which i/o was issued

52 waiting for last MASS on which i/o was issued by some other task

53 waiting in writedes for mass to finish changing before writing buffer

54 waiting to write of the last page of the log

55 waiting after write of the last page of the log

In other words, 54 is log contention, 55 is waiting for log physical write to flush,

What’s a MASS???

Memory Address Space Segment• chunk of contiguous memory containing one or more 2K pages (the quantity

being determined by the configured pool size, 2K, 4K, etc). – Analogous to “extents”

• synchronizes access to buffers (data pages in memory) by waiting until no one else is writing the buffer

• With large IO the state of any page in the MASS is taken to be the state of the MASS itself. This means, for example, if you use 16K IO then access is synchronized across all 8 2K pages - if one is being written to then all are considered to be written to.

– Large IO writes tempdb select/into, bcp, array inserts, etc. User queries will not reflect large I/O

…btw…this is documented (somewhat)• MASS is defined in Glossary

So….what to do??• One likely cause is cache partitioning (especially if none)

MASS Operations

MASS Writes• MASS (or some portion of it) is being written to disk

– Bulk I/O (select into, bcp)– Wash marker, checkpoint, housekeeper GC– Writedes (object being flushed from cache)

MASS Changes• Pages being replaced with new data pages

– Normal LRU/MRU

• Pages being updated by logical writes by SPID

Memory/Buffer Waits

30 wait to write MASS while MASS is changing33 waiting for buffer read to complete34 waiting for buffer write to complete35 waiting for buffer validation to complete36 waiting for MASS to finish writing before changing37 wait for MASS to finish changing before changing38 wait for mass to be validated or finish changing53 waiting for MASS to finish changing to start i/o

MASS write to disk is delayed because someone is updating a page in the MASS

SPID wants to change MASS, but has to wait because it is still being flushed to disk

SPID wants to change MASS, but has to wait because someone else is changing it

MASS write to disk is delayed because someone is updating a page in the MASS

Logical readLogical write

Decoding the WaitEvents - Writes

Regular Disk Write Waits• Wait Events

– 51 - A lot of I/O likely coming from same process• Or a page split caused a synchronous IO

– 36,37, 52 - Last page contention with other users• Including space (memory) allocation contention in tempdb

• Where to go next– 51 - monOpenObjectActivity– 52 - monOpenObjectActivity paying close attention to DBID=2

Log Disk Write Waits• Wait Events

– 54 - Contention on last log page– 55 - Waiting for last log page to flush to disk

• Where to go next– 54 - monProcessActivity and look at ULC

• also monOpenDatabases and look for which log it was– 55 - monDeviceIO/monIOQueue and look at IO times for the device

• Also monProcessActivity to look at ULC to reduce number of log writes

Common Wait Events: Client S/W

Client Related S/W Issues• waiting for CTLIB event to complete

– non-data related: i.e. waiting for TDS tokens such as ACK for packets sent, or waiting on next command to be sent (i.e. gap between ct_command() and ct_send()

– if CIS is involved, it is waiting on ct_fetch()/result set materialization at remote server..if remote server, it is waiting to for network access to send the data

– Next move is to look at the client code– RepAgent will show this a lot due to deferred async network calls & ASE scheduler

• waiting for network send to complete– This is data stream related – outbound commands (RPC’s, RepAgent, etc.) will be ‘waiting for CTLIB

event to complete’ due to waiting for ct_sendpassthru(), etc. to execute.– Next table to check out is monProcessNetIO – probably going to be a change to fetch block size in

program and/or packet size

• waiting for incoming network data– Equivalent to ‘awaiting command’ – nothing expected, ..or…– Big gap could point to network handling of language cmds time (try ct_dynamic) or BLOB processing

Common Wait Events (Config)

“waiting while no network read or write is required”• Netserver checked and no network read/write pending• Server level – shouldn’t see this in monProcessWaits• Check "i/o polling process count"

– If CPU & IO bound – reduce "i/o polling process count"– For 12.5.3+ – look at the following in monEngine: DiskIOChecks, DiskIOPolled,

DiskIOCompleted

monSysWaits vs. monProcessWaits• The above and others (many of the MASS waits) will only be apparent in

monSysWaits vs. monProcessWaits– For example, a checkpoint or HK initiated write may be delayed due to a user process

currently updating memory (i.e. Event 31)• So - when viewing monSysWaits….

– Some system events can mostly be ignored (i.e. the one above)– Others need to be viewed from the user aspect

• i.e rather than the HK being held up, the question is whether the users are doing too much physical IO and thereby tripping the HK or Checkpoint trigger.

Real Customer #1 (Slow Writes)

WaitEventID Description Waits WaitTimeSec delta_hms

215 waiting on run queue after sleep 151709 0.27 00:26:26

250 waiting for incoming network data 52137 155.38 00:26:26

214 waiting on run queue after yield 17435 0.02 00:26:26

55 wait for i/o to finish after writing last log page 15553 0.14 00:26:26

31 waiting for buf write to complete before writing 12436 0.1 00:26:26

51 waiting for last i/o on MASS to complete 10536 0.17 00:26:26

251 waiting for network send to complete 2577 0.03 00:26:26

29 waiting for regular buffer read to complete 2257 0.01 00:26:26

124 wait for mass read to finish when getting page 2226 0.03 00:26:26

178 waiting while allocating new client socket 1658 1.53 00:26:26

Event 31 is an example of a potential HK/Checkpoint/Washmarker IO delayed by a user modifying a page…but the viewpoint should be why are we doing so much physical IO’s - especially with 51’s, 29’s and 124’s - one possibility is slow disks - especially with 55 - time to look at monDeviceIO/monIOQueue and maybe monOpenObjectActivity as 214 indicates table scans - maybe a cartesian product in tempdb is causing both - but likely that APF’s are driving reads and 214’s …possibly writes may be driven by a batch process scanning one table and inserting into another -but network send #’s suggest reads are going to client (a lot of rows) and writes are likely a different issue as 5 times higher.

Real Customer (Bad Queries?)

Cause #2 is network send

Physical I/O delayed - look at APF’s and Phys Reads

2: MonOpenObjectActivity

monOpenObjectActivity• IndexID is key• OptSelectCount

– Number of times the optimizer picked the index during optimization

– Picked ≠ Used (joins may change plan)• UsedCount

– Number of times the optimizer actually used the index for a query

Spotting TableScans• IndexID=0 and UsedCount >0

Unused Indexes• Any index with UsedCount=0• It will have RowsInserted, etc. when DML

operations affect the index values - the key is that it was never used for a query.

• Monitor over a period of time – the last thing you want to have happen is the weekend DBA

call you because a report didn’t finish (because you dropped the index it used).

monOpenObjectActivityDBIDObjectIDIndexIDDBNameObjectNameLogicalReadsPhysicalReadsAPFReadsPagesReadPhysicalWritesPagesWrittenRowsInsertedRowsDeletedRowsUpdatedOperationsLockRequestsLockWaitsOptSelectCountLastOptSelectDateUsedCountLastUsedDate

intintintvarchar(30)varchar(30)intintintintintintintintintintintintintdatetimeintdatetime

<pk,fk><pk><pk><pk,fk><pk>

Top 7 monOpenObject Queries

Table Scans• IndexID=0 and UsedCount > 0

Indexing Efficiency• Order by LogicalReads desc, PagesRead desc

Caching/Table Scan Issues• Order by PhysicalReads + APFReads desc

Hot OLTP Tables• Order by PhysicalWrites desc• Check for volatile indexes (RowsUpdated > 0 and IndexID > 0)

Tempdb Usage• DBID=2 (plus all other user tempdb’s)

Contention• Order by LockWaits desc… sum(UsedCount) by ObjectID

Unused Indexes• IndexID!=0 and UsedCount=0

Customer: System is Slow

Object NameIndex ID

Logical Reads

Physical Reads

APF Reads

Pages Read

Physical Writes

Pages Written

Rows Inserted

Rows Deleted

Rows Updated

Used Count

actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 434 724 117,620 247

outb_trn 0 75,981,221 0 195,792 0 699 5,389 129 158 1,024 1,436

wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 64 1 0 83

indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 1,730 0 0 227

indus_wo 0 23,691,223 312,789,38

7 3 402 1,914 81 89 495 3,249

indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 1,907 2,089 6,960 13,695

ns_oper_stn 0 9,262,399 0 87,346 0 0 0 0 0 0 734

yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 448 0 0 254

plnned_trn_events 0 5,362,519 315 2,104,867 378 35 252 217 168 0 200

rlse_sw_request 0 5,033,717 13 3,016,215 13 2,785 19,697 1,246 1,772 945 3,166

inbnd_trn_exception 0 4,056,119 0 2,853,452 0 1,229 5,450 2,466 1,746 16,286 11,587

event_type 0 3,039,052 4 0 4 0 0 0 0 0 12,564

outb_valid_trn 0 2,645,196 6 694,854 13 2,042 15,699 1,990 2,254 1,046 186

eq_asgn_pln 0 1,671,989 223 1,050,155 244 1,449 9,527 2,422 2,326 10 1,080

team_tk_cust 0 1,364,266 0 0 0 0 0 0 0 0 8,527

plnned_inbnd_cnsst 0 1,360,123 14 195,755 14 3,156 21,685 23,042 22,753 3,495 137

tran_loc 0 1,256,342 661 3,091 661 0 0 0 0 0 11

wb_final_class_cd 0 1,228,184 4,301 411,012 4,399 5,341 28,525 5,653 4,401 0 126

sec_grp_exception 0 1,174,012 0 0 0 0 0 0 0 0 293

Sample time 2h:25m

Did You Notice???

Object NameIndex ID

Logical Reads

Physical Reads

APF Reads

Pages Read

Physical Writes

Pages Written

Pages Per Write

actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 3.0

outb_trn 0 75,981,221 0 195,792 0 699 5,389 7.7

wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 6.6

indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 7.3

indus_wo 0 23,691,223 3 12,789,387 3 402 1,914 4.8

indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 7.3

ns_oper_stn 0 9,262,399 0 87,346 0 0 0 0

yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 4.0

plnned_trn_events 0 5,362,519 315 2,104,867 378 35 252 7.2

rlse_sw_request 0 5,033,717 13 3,016,215 13 2,785 19,697 7.1

inbnd_trn_exception 0 4,056,119 0 2,853,452 0 1,229 5,450 4.4

event_type 0 3,039,052 4 0 4 0 0 0

outb_valid_trn 0 2,645,196 6 694,854 13 2,042 15,699 7.7

eq_asgn_pln 0 1,671,989 223 1,050,155 244 1,449 9,527 6.6

team_tk_cust 0 1,364,266 0 0 0 0 0 0

plnned_inbnd_cnsst 0 1,360,123 14 195,755 14 3,156 21,685 6.9

tran_loc 0 1,256,342 661 3,091 661 0 0 0

wb_final_class_cd 0 1,228,184 4,301 411,012 4,399 5,341 28,525 5.3

sec_grp_exception 0 1,174,012 0 0 0 0 0 0

Proof that the ASE does >2K writes (MASS) even on 2K server

Answer: This is a replicate database (see rs_lastcommit). Turns out this is a WS with no active users, yet the updates & deletes on ScheduleWorkTable are causing table scans - which slow down replication and cause latency in the DSI -average is 1 update/delete on ScheduleWorkTable and ~1 update on AcquirerPaymentDetail per transaction group from the DSI. XMLText scan may have been due to dbcc or reorg (only 1 scan in 10,000 operations)

ObjectNameIndex ID

Logical Reads

Physical Reads

APF Reads

Pages Read

Physical Writes

Pages Written

Rows Inserted

Rows Deleted

Rows Updated

Used Count

XmlText 0 9,980,995 223,881 9,964,364 1,374,975 1,114 1,226 4,877 0 5,249 1

ScheduleWorkTable 0 2,032,959 19 2,042,808 19 134 134 11,515 3,579 19,708 23,287

AcquirerPaymentDetail 0 107,813 0 41,792 0 939 6,182 2,205 8,578 23,614 1

rs_lastcommit 0 26,611 0 165 0 302 848 0 0 26,287 15

Heartbeat 0 5,787 0 1,132 0 69 69 238 238 0 379

AcquirerPayment 0 1,437 0 319 0 80 528 26 99 799 1

MonitorStatsUntil 0 68 0 68 0 33 33 0 0 34 34

BranchSpecificExtSequence 0 23 0 0 0 14 14 0 0 23 23

LastAcquirerPaymentId 0 21 0 0 0 21 21 0 0 21 21

ReportQueueSenderData 0 10 1 10 1 0 0 0 0 0 5

SeqBankOutTxId 0 9 0 0 0 9 9 0 0 9 9

SeqBankAccountId 0 1 0 0 0 1 1 0 0 1 1

Where Are The Problems??

Sample period was 1h:17m:34s

Are All Table Scans Equal??

Object NameIndexID

LogicalReads

PhysicalReads

APFReads Operations

LockRequests

UsedCount

RT_DFLT_DBF 0 275,500,414 6,524 47,553,199 302,811 0 71,701

ACCOUNT#NOV2006#ACC_DT1_DBF 0 111,774,616 132,981 23,929,468 33,738 143,776 880

TRN_ACA1_DBF 0 70,990,284 2,282 28,986,450 133,556 0 32,854

RT_LNDXG_DBF 0 42,692,422 720 8,227,783 188,884 0 46,959

RXM_CDFL_DBF 0 25,542,614 437 20,305,395 2,936 130,781,737 26,761

RXM_IRSN_DBF 0 18,379,485 1,923 8,718,369 7,216 310,900 36,576

ACC_RVDT_DBF 0 13,726,168 1,221 4,534,354 33,876 133 7,920

TRN_HDR_DBF 0 7,601,933 112,601 327,980 548,933 5,477 2,976

TRN_CPDF_DBF 0 7,409,830 3,334 1,877 427,752 0 100

TABLE#LIST#LEGALENT_DBF 0 7,191,939 3 7,194,301 7,184,173 279 1,796,111

TABLE#LIST#GL_ACCT_DBF 0 6,237,658 187 2,238,602 904 0 21,734

RTRN_HDR_DBF 0 4,872,419 59,713 470,002 57,357 1,615 14

TRN_PFLD_DBF 0 2,653,352 18 1,806,234 866,600 0 39,258

Sample time is just less than 15 hours (14h:59m)

With a 1:4 ration of scans to logical reads, it is likely that TABLE#LIST#LEGALENT_DBF only has 4 pages - and as with all tables < 10 pages (or concurrency_opt_threshold), the optimizer is likely just going to tablescan….however, with 7M APF Reads, it suggests that this table keeps getting flushed from cache and has to be continuously re-read using physical IO’s (APF or not - they still are physical reads that aren’t necessary!!!!)

What Is Happening Here?

Object Name

Index ID

Logical Reads

Physical Reads

APF Reads

Pages Read

Physical Writes

Pages Written

Rows Inserted

Rows Deleted

Rows Updated

Opera-tions

Used Count

CCPayment 0 10,978,013 77 0 77 3,785 3,785 11,786 0 11,343 172,094 0

CCPayment 2 134,413 0 0 0 923 923 11,760 0 0 0 11,343

CCPayment 3 79,136 14,913 0 14,913 11,572 11,572 11,761 0 0 0 0

NO!!! We are using a non-unique index….how can we tell???• Look at rows inserted….table has 26 more rows inserted during the same time than the index does. If the index was unique, it would have one

insert matching every table insert.• Result is we are using the index to position ourselves within the table and then scanning to find the rows to be updated…..quite possibly using

the index to position to the first occurrence and then scanning to the end of the table.

Soo….why index 2 instead of 3??? • Index 2 is smaller - 11,760 rows fit on 923 pages (~12/pg) whereas in index 3 it is nearly 1 row/page - consequently the index would cost more.• In fact, at 1 row/pg for the index and ~3 rows/pg for the table, it suggests that index 3 has max_rows_per_page=1

If the 11,343 updates are using the index, what are the inserts doing??• This is a heap table….they are going to the end

Can you prove that minimal columns would help in this situation (if considering parallel DSI’s)?• Yes - none of the columns updated affected the indices, so they would be “safe” if min cols were used….in fact, it may be on already as we

have 0 rows updated…..

Sample period was 1h:17m:34s

Everything is okay because we are using the index, right??

Why Don’t DML Rows Match?

ObjectName

IndexID

LogicalReads

PhysicalWrites

Rows Inserted

Rows Deleted

Rows Updated Operations

Lock Requests

Used Count

AccountTx 0 3,161,146 11,577 244,063 9,642 184,874 3,671,042 1,050,502 0

AccountTx 2 5,693,262 107,943 428,937 194,516 0 0 0 16

AccountTx 3 6,647,313 14,197 428,937 194,516 0 0 0 194,500

Journal 0 2,986,193 9,428 69,452 2,408 161,666 2,033,919 535,438 0

Journal 2 2,764,274 1,871 58,828 2,408 0 0 0 164,276

Journal 5 2,294,767 4,105 139,785 83,365 0 0 0 0

Journal 6 1,002,950 2,341 58,845 2,425 0 0 0 0

Journal 7 2,512,037 24,324 139,705 83,285 0 0 0 0

Journal 8 830,656 26,049 58,828 2,408 0 0 0 0

Journal 9 1,817,354 3,609 139,780 83,360 0 0 0 0

The actual inserts/updates/deletes to the table are represented by IndexID 0• UsedCount = Updates + Deletes; Heap Inserts don’t ‘Use’ an Index

When an index key is modified by an update to the table…• The new key value will logically appear elsewhere in the index tree• This is accomplished by deleting the index row and re-inserting it within the tree• A good indication of index volatility and which ones need update stats more often

How’s Our Indexing???

Object NameIndex ID

Logical Reads

Physical Reads

APF Reads

Pages Read

Physical Writes

Pages Written

Lock Requests

Lock Waits

Opt Select Count

Used Count

actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 1,189,619 385 247 247

wrk_limits_rte 6 96,226,458 683 7,721 683 64 64 0 0 31,595 31,618

outb_trn 0 75,981,221 0 195,792 0 699 5,389 15,397 71 1,436 1,436

outb_trn 6 61,864,256 1 0 1 294 294 0 0 13,382 14,282

wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 589 0 83 83

indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 5,082 2 227 227

indus_wo 0 23,691,223 3 12,789,387 3 402 1,914 52,284 104 3,248 3,249

ns_oper_stn 7 14,616,981 199 86 199 0 0 0 0 22,494 27,392

indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 253,108 3 13,695 13,695

ns_oper_stn 5 9,705,081 0 0 0 0 0 0 0 385,142 681,196

eq_event_h 0 9,380,939 217,996 0 1,632,402 66,155 312,450 1,085,535 0 0 0

ns_oper_stn 0 9,262,399 0 87,346 0 0 0 4,216 0 734 734

yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 1,201 7 254 254

sakey_mq_prcs 2 6,243,771 545 0 832 2,499 8,239 0 0 35,253 35,253

eq_event_h 5 5,468,761 42,929 143,340 47,738 72,844 275,865 0 0 76,806 808,922

actv_eq 4 3,349,634 44,180 25,750 44,180 916 930 0 0 273,840 291,616

eq_mv_waybill 4 3,303,169 304 2,034 304 1,563 1,577 0 0 199,975 367,689

trn_schd 3 2,444,418 290 14,819 899 211 561 0 0 10,637 10,637

trn_schd 0 2,436,658 569 0 877 119 539 4,234 0 0 0

Sample time 2h:25m = 145min

How’s Our Indexing???

Object NameIndexID

Logical Reads

PhysicalReads

APFReads

Pages Read

Physical Writes

Pages Written

Lock Requests

Lock Waits

Opt Select Count

Used Count

actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 1,189,619 385 247 247

wrk_limits_rte 6 96,226,458 683 7,721 683 64 64 0 0 31,595 31,618

outb_trn 0 75,981,221 0 195,792 0 699 5,389 15,397 71 1,436 1,436

outb_trn 6 61,864,256 1 0 1 294 294 0 0 13,382 14,282

wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 589 0 83 83

indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 5,082 2 227 227

indus_wo 0 23,691,223 3 12,789,387 3 402 1,914 52,284 104 3,248 3,249

ns_oper_stn 7 14,616,981 199 86 199 0 0 0 0 22,494 27,392

indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 253,108 3 13,695 13,695

ns_oper_stn 5 9,705,081 0 0 0 0 0 0 0 385,142 681,196

eq_event_h 0 9,380,939 217,996 0 1,632,402 66,155 312,450 1,085,535 0 0 0

ns_oper_stn 0 9,262,399 0 87,346 0 0 0 4,216 0 734 734

yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 1,201 7 254 254

sakey_mq_prcs 2 6,243,771 545 0 832 2,499 8,239 0 0 35,253 35,253

eq_event_h 5 5,468,761 42,929 143,340 47,738 72,844 275,865 0 0 76,806 808,922

actv_eq 4 3,349,634 44,180 25,750 44,180 916 930 0 0 273,840 291,616

eq_mv_waybill 4 3,303,169 304 2,034 304 1,563 1,577 0 0 199,975 367,689

trn_schd 3 2,444,418 290 14,819 899 211 561 0 0 10,637 10,637

trn_schd 0 2,436,658 569 0 877 119 539 4,234 0 0 0

1: monSysStatement

This is TRICKY• We have to find where a proc begins

– Should have a LineNumber=0 but may not• We have to find where a proc ends

– ProcID differs and Context stays same..– Sorta….

But we want metrics…Sooo….• We actually need to loop from the beginning• …and track the nesting….• So we need to select into #temp with and

identity column to keep the order• I really think the fact that ContextID doesn’t

decrement for popping out of a context is a bug!!!

• Either way, this is ugly - if you want to do this - see me for a proc that does it for you.

monSysStatementSPIDKPIDDBIDProcedureIDPlanIDBatchIDContextIDLineNumberCpuTimeWaitTimeMemUsageKBPhysicalReadsLogicalReadsPagesModifiedPacketsSentPacketsReceivedNetworkPacketSizePlansAlteredRowsAffectedErrorStatusStartTimeEndTime

smallintintintintintintintintintintintintintintintintintintintintdatetimedatetime

<pk,fk1,fk2,fk3><pk,fk1,fk2,fk3>

Before We Begin….

ASE Execution Path:• Batch - this is the SQL Text Glob sent by the user• Context - this is incremented for each proc, trigger, exec()• Line # - this is the statement within the context

– For monitoring purposes, the statement needs to invoke IO or CPU– It can repeat or skip (think while loops, if/else, etc.)– Line 0 is a sub-proc call/context change

Insert into table1Insert into table2Exec procA• update table1• exec procB

• insert into table3Update table2GoInsert into table3go

Batch 0

Batch 1

Batch 0; Context 0; Line 1Batch 0; Context 0; Line 2

Batch 0; Context 1; Line 0

Batch 1; Context 0; Line 1

MDA Tables Affected

monProcessSPIDKPIDBatchIDContextIDLineNumberSecondsConnectedDBIDEngineNumberPriorityFamilyIDLoginApplicationCommandNumChildrenSecondsWaitingWaitEventIDBlockingSPIDBlockingXLOIDDBNameEngineGroupNameExecutionClassMasterTransactionID

smallintintintintintintintsmallintintsmallintvarchar(30)varchar(30)varchar(30)intintsmallintsmallintintvarchar(30)varchar(30)varchar(30)varchar(255)

monProcessStatementSPIDKPIDDBIDProcedureIDPlanIDBatchIDContextIDLineNumberCpuTimeWaitTimeMemUsageKBPhysicalReadsLogicalReadsPagesModifiedPacketsSentPacketsReceivedNetworkPacketSizePlansAlteredRowsAffectedStartTime

smallintintintintintintintintintintintintintintintintintintintdatetime

<pk,fk1,fk2><pk,fk1,fk2>

monSysStatementSPIDKPIDDBIDProcedureIDPlanIDBatchIDContextIDLineNumberCpuTimeWaitTimeMemUsageKBPhysicalReadsLogicalReadsPagesModifiedPacketsSentPacketsReceivedNetworkPacketSizePlansAlteredRowsAffectedErrorStatusStartTimeEndTime

smallintintintintintintintintintintintintintintintintintintintintdatetimedatetime

<pk,fk1,fk2,fk3><pk,fk1,fk2,fk3>

Current Statement being executed

Previously Executed Statements

monSys??? vs monProcess???

• monProcess??? is *current* values for *current statement*– When statement finishes, metrics are aggregated/flushed to monSys???

• monProcessObject monOpenObjectActivity• monProcessProcedures (dropped)• monProcessStatement monSysStatement• monProcessSQLText monSysSQLText (after first statement)

That Pesky Line #

For SQL Batches - Always begins at 1

For Procs/Triggers• LineNumber=0 means ASE isn’t executing any line - it is searching through the

proc/trigger code for the next executable line. Common sightings:– At the beginning of a proc with a lot of comments/declare statements– In the middle of procs due to if/else blocks - when skipping the ‘else’ for example– Ditto - but when the while condition fails…– Also seen at the end when proc lacks a ‘return’

• Call to the proc increments ContextID– Calling proc is recorded after exec completes

• BatchID=BatchID, ContextID stays at proc call, ProcedureID ≠ ProcedureID and LineNumer is 1 higher than statement before calling routine

• If proc was first statement in batch, BatchID & ContextID are equal but LineNumber=1 (i.e. Proc call is 3,1,0 Proc end is 3,1,1)

• Then the line #’s match the SQL text line numbers– i.e. If you had blank lines in the script before the create proc…

• Things that don’t get captured– Declare @vars– Begin/end statements

Don’t use ORDER BY• Statements are recorded in execution order in the monSysStatement

The Test DDL

use demo_dbgo

create table stmt_test (row_id bigint identity not null,comment varchar(40) null,ins_date datetime,

primary key (row_id))go

1.2.3. create procedure proc_24. @iteration int5. as begin6. insert into stmt_test (comment, ins_date)7. select 'proc_2: iteration #'+convert(varchar(5),@iteration), getdate()8. waitfor delay '00:00:01'9. return 010.end11.go

1.2.3. create procedure proc_1 4. @num_times int5. as begin6. declare @n int7. insert into stmt_test (comment, ins_date) values ('proc_1: before calling proc_2',getdate())8. select @n=19. while @n <= @num_times10. begin11. exec proc_2 @n12. insert into stmt_test (comment, ins_date) values ('proc_1: looping on proc_2',getdate())13. select @n=@n+114. end15. insert into stmt_test (comment, ins_date) values ('proc_1: after calling proc_2',getdate())16.17. return 018.end19.go

Test Execution

1.2. /*3. ** Sybase ASE Transact-SQL script4. ** 5. ** Jeff Tallman/Sybase Enterprise Solutions6. ** tallman@sybase.com7. **8. */9.10. use demo_db11. go12. select @@spid13. go14. insert into stmt_test (comment, ins_date) values ('batch #1, before proc, insert #1', getdate())15. insert into stmt_test (comment, ins_date) values ('batch #1, before proc, insert #2', getdate())16. exec proc_1 1017. insert into stmt_test (comment, ins_date) values ('batch #1, after proc, insert #1', getdate())18. go19. insert into stmt_test (comment, ins_date) values ('batch #2, before proc, insert #1', getdate())20. insert into stmt_test (comment, ins_date) values ('batch #2, before proc, insert #2', getdate())21. exec proc_1 522. insert into stmt_test (comment, ins_date) values ('batch #2, after proc, insert #1', getdate())23. go

Reading monSysStatement

Use demo_db

Select @@spidfirst inserts

Call to proc_1

Call #1 to proc_2InsertWaitforReturn 0

While loop repeatInsert in loopSelect @n=@n+1

Exit call to proc_2

Exit from proc_1Insert after proc_1

Proc_2 iteration #1

Proc_2 iteration #2

Proc_2 iteration #3

Second batch begins

Note: We did not use the ORDER BY

Note: see how the context ID doesn’t decrement until outer proc exits???

Help Is On It’s Way

ASE 15.0.x??? / CR# 345353-1• Two tables - Aggregated to Proc; and Per Execution• Call today and add your voice to get this CR prioritized

Aggregate data• Average and total execution time• Average and total CPU usage• Number of executions• Most recent execution date and time• Average and total logical and physical reads and writes

Per-execution data• Total execution time• CPU time• Start and end date-time• Number of logical and physical reads and writes

But…Until Then….

Procedure NameNum execs

Elapsed avg

Elapsed max

Cpu Time avg

Cpu Time max

Wait Time avg

Wait Time max

Logical Reads avg

Logical Reads max

Rows Affected avg

s_ero_parts_tbo_rs 89 1,759 25,570 1,481 20,315 215 5,100 5,989 20,992 21

s_path_summary 3 3,605 7,613 1,388 2,262 2,733 6,900 7,048 8,652 595

s_ero_parts_ordered_rs 24 1,136 4,790 503 1,234 629 4,786 73,708 108,474 705

sp_sysmon 4 77,786 308,230 272 769 359 936 37,058 111,823 8,047

s_ero_retail_vehinfo 30 180 1,083 107 764 71 1,083 2,720 4,760 6

sp_allblkinfo 1 16,594,500 16,594,500 707 707 622,492 622,492 17,848 17,848 51

sp_rs_part_apron_ps 1 355,550 355,550 554 554 7,830 7,830 4,847 4,847 147

s_ds_scan_queue 230 295 3,280 76 375 216 2,900 6,577 9,046 21

s_hist_get_repair_order 13 832 2,076 119 367 707 1,782 13,421 39,140 2,100

s_ero_tech_rs 63 223 6,066 64 251 158 6,000 7,879 30,323 28

sp_appr 95 533 4,390 62 236 419 4,200 5,700 9,965 136

tr_note_lines_iu 6 76 226 75 226 0 0 11,771 35,189 6

s_ero_options_needed_rs 1 0 0 223 223 100 100 2,379 2,379 29

s_get_esp_info 15 58 180 57 176 0 0 935 1,132 5

s_im_ds 8 152 163 151 162 0 0 3,171 4,166 104

s_ero_afs_base 7 231 410 117 158 111 283 1,720 2,648 31

Someone still has faith that sp_sysmon is really going to help them with this problem….wait until the next page and you decide.

Line by Line Detail….

ProcedureNameLine Num

Num execs

Elapsed avg

Elapsed max

Cpu Time avg

Cpu Time max

Wait Time avg

Wait Time max

Logical Reads avg

Logical Reads max

Rows Affected avg

s_ero_parts_tbo_rs 97 51 2,733 24,063 2,531 20,263 201 3,800 6,427 10,157 5

s_path_summary 159 3 1,868 2,640 1,135 1,540 733 1,100 6,133 6,201 577

s_ero_parts_ordered_rs 606 18 582 1,513 543 1,210 38 400 83,962 84,172 120

s_ero_retail_vehinfo 72 24 154 696 113 663 41 500 1,977 3,461 0

s_path_summary 139 1 4,746 4,746 646 646 4,100 4,100 2,098 2,098 1

sp_sysmon 335 1 560 560 560 560 0 0 90,110 90,110 15,974

s_ds_scan_queue 59 185 79 880 63 323 16 600 5,621 5,662 5

s_ero_options_needed_rs 102 1 323 323 223 223 100 100 2,301 2,301 14

tr_note_lines_iu 151 2 223 223 223 223 0 0 35,126 35,126 0

s_get_esp_info 92 14 57 176 57 176 0 0 460 492 1

sp_sysmon 224 1 163 163 163 163 0 0 36,353 36,353 16,088

sp_appr 494 74 259 3,260 23 160 235 3,100 1,104 1,153 1

s_im_ds 179 6 154 156 154 156 0 0 2,962 3,117 30

s_ero_search_assoc 40 2 150 150 150 150 0 0 819 819 19

sp_sysmon 404 1 143 143 143 143 0 0 20,213 20,213 114

s_im_ds 110 2 130 130 130 130 0 0 2,069 2,069 0

sp_appr 501 79 172 2,116 26 130 146 2,000 2,632 2,669 0

tr_update_demo 71 14 228 993 29 126 199 993 1,348 1,456 0

s_ero_parts_tbo_rs 434 33 237 1,343 14 120 222 1,300 33 61 0

s_hist_get_repair_order 182 6 47 116 47 116 0 0 8,563 19,721 3,537

Weird CpuTimes Being Reported

What if CpuTime is reported as ~2Billion • i.e. 2147483645 (or nearly so)• CpuTime is calculated as

– CpuTime = datediff(ms,StartTime,EndTime)-WaitTime• ASE Syncs it’s internal clock every 60 seconds with OS/HW clock• If the machine is experiencing clock drift, a short executing proc may

appear to have a “negative” datediff– CpuTime is still calculated and ends up being 2B - the diff - so the above example

had a duration of “-2” ms– This could also happen if the WaitTime is high on longer running procs (assuming

wait time is nearly all the elapsed time)

Resolution:• Can’t really be fixed by SY - hardware issue with clock drift• Best bet is to use a constant (i.e. 1) or use a standard percentage of

wait time (i.e. 10%) for CpuTime if/when this happens.

top ten ase mda tables (20097) - mark gearhart · top ten ase mda tables (20097) how to use mda...

Documents

mda 메카닉_박상훈

presentazione mda

mda presentazione

mda volunteering

understanding mda

mda architecture

enbridge mda

revista mda

clasico mda

nordcap...22-22, nagaike-cho, abeno-ku, osaka 545-8522, le...

mda contention

ase 105: the mda tables - finding out what goes on inside...

mda įrankiai

architecting - mda and rup and rup.pdfrational unified...

mda - mestrado

tudor arghezi geo bogza Álmodtam, hogy...

cartilha mda

mda framework

mda concepts

component based engineering in mda mda and ccm