top ten ase mda tables (20097) - mark gearhart · top ten ase mda tables (20097) how to use mda...
Post on 21-May-2018
253 Views
Preview:
TRANSCRIPT
Top Ten ASE MDA Tables (20097)How to Use MDA Tables to Find Problems
Jeff Tallman, SW Engineer II/Architect, Sybase Inc.Chris Brown, Principal SC, Sybase Inc.
2
Top Ten ASE MDA Tables
Review of 3 Tier MDA Monitoring Strategy • DBA Dashboard, Application Dashboard, & Fault Isolation
Tables 10 6: Configuration & Dashboard• monEngine• monDataCache & monCachePool• monCachedObject• monProcedureCacheMemoryUsage & monProcedureCacheModuleUsage• MonCachedProcedures
Tables 5 1: Fault Isolation• monDeviceIO & monIOQueue• monProcessActivity• monSysWaits• monOpenObjectActivity• monSysStatement
3
Caveat Emptor - Disclaimer
The Top Ten ….• Is not based necessarily on frequency of use….
• …but is based on the value to the DBA/Developer– For example monErrorLog is likely polled frequently– …but it isn’t as much help as monCachedObject
• …or based on how often used to identify & resolve problems– monDeviceIO/monIOQueue frequently used for diagnosing device waits….used
more frequently than…– monCachedObject…which admittedly is fairly handy
• …based on our real world experience with customer panics
4
“Fault Isolation”Problem Solving(as necessary)
MDA Monitoring Strategy
“App Dashboard”Config & Tuning
(periodic & frequent)
“DBA Dashboard”System Health
(constant)
monEngine monIOQueuemonState monDeadlocksmonLicense monErrorLogmonDeviceIO monSysWaits
monOpenObjectActivity monCachePoolmonCachedObject monDataCachemonCachedProceduresmonOpenDatabasesmonProcedureCacheMemoryUsage
monSysStatement monProcessStatementmonProcessActivity monProcessmonLocks monProcessWaitsmonSysSQLText
Tables 10 6DBA Dashboard & Application Config/Tuning
6
10 6: Config & Monitoring
10 monEngine
9 monCachePool
8 monCachedObject
7 monCachedProcedures
6 monProcedureCacheModuleUsage
7
10: monEngine
Notes:• The most common info will
be the CPU time metrics– UserCPUTime will reflect bad
queries with table scans in memory, Cursors, etc.
– SystemCPUTime will reflect physical and network IO
• Other metrics will likely be used for fine grained tuning
monEngineEngineNumberCurrentKPIDPreviousKPIDCPUTimeSystemCPUTimeUserCPUTimeIdleCPUTimeYieldsConnectionsDiskIOChecksDiskIOPolledDiskIOCompletedProcessesAffinitiedContextSwitchesHkgcMaxQSizeHkgcPendingItemsHkgcHWMItemsHkgcOverflowsStatusStartTimeStopTimeAffinitiedToCPUOSPID
smallintintintintintintintintintintintintintintintintintintvarchar(20)datetimedatetimeintint
<pk>
More Engines Needed???Table Scans in Memory???
IO Polling Process Count(Added in 12.5.3 ESD #2)
Runnable Process Search Count
See WaitEvents!!!
HouseKeeper GC
8
9 & 8: Data Cache
CacheID = CacheIDCacheName = CacheName
CacheID = CacheIDCacheName = CacheName
monCachedObjectCacheIDDBIDIndexIDPartitionIDCachedKBCacheNameObjectIDDBNameOwnerUserIDOwnerNameObjectNamePartitionNameObjectTypeTotalSizeKBProcessesAccessing
intintintintintvarchar(30)intvarchar(30)intvarchar(30)varchar(30)varchar(30)varchar(30)intint
<pk,fk1><pk,fk2><pk,fk2><pk,fk2>
<pk,fk1><pk,fk2><pk,fk2>
<pk,fk2><pk,fk2>
monCachePoolCacheIDIOBufferSizeAllocatedKBPhysicalReadsStallsPagesTouchedPagesReadBuffersToMRUBuffersToLRUCacheName
intintintintintintintintintvarchar(30)
<pk,fk><pk>
<pk,fk>
monDataCacheCacheIDRelaxedReplacementBufferPoolsCacheSearchesPhysicalReadsLogicalReadsPhysicalWritesStallsCachePartitionsCacheName
intintintintintintintintsmallintvarchar(30)
<pk>
<pk>
Pool SizePool Allocation
Cache UsedToo small or Wash Size too small
Text/Image IndexID=255
Cache Hogs
Tempdb DBID=2
Hit Rate = CacheSearches / Logical Reads
Misses = CacheSearches / Physical Reads
Volatility = PhysicalWrites / (PhysicalReads + LogicalReads)
CacheUsage%=AllocatedKB / (PagesTouched * @@pagesize)
CacheEfficiency%=PagesRead / (PagesTouched * @@pagesize)
9
Cache Tuning Myths & Reality
Myths• Most people focus on cache hit ratio(s)
– The reality is that most tablescans on today’s larger memory systems cause this to be an unreliable indicator of overall system health
Reality• Ignore cache hit ratios and concentrate on cache sizing and when objects
are flushed from cache due to cache organization– Particularly look at reference tables, text/image columns, tempdb, report tables
• A lot of wasted memory– Named caches that are oversized– Large buffer pools that are oversized (or undersized)– Too many named caches
• Not as flexible to changing workloads, real overhead is loss of memory to other tables/indexes
• Cache partitioning is a bigger player in SMP wrt performance– Requires careful tuning
10
Cache Monitoring Comments
Buffer Pools containing Transaction Logs• Will show an artificially high pages touched
– Reason is that each page is new…keeps appending until buffer pool limit is reached• Watch PagesRead to see if it can be reduced.
– Log scanning activities: Triggers, Checkpoints & ASE Rep Agent– Triggers aren’t as much of an issue except for bulk statements such as huge deletes or
updates on tables with triggers– Checkpoint will scan the log…so don’t make it too small…but don’t make it too big either– ASE Rep Agent tuning may dictate size for sites with RS
If >2 Buffer Pools Probably Configuration Error• ASE will ONLY use 2 buffer pools - page size and largest pool size• Exception is log buffer pool bindings
Let stabilize after reboot• Remember, ASE reconfigures cache during recovery
11
TempDB & MDA Monitoring
Most MDA Tables work from an “open object”• monOpenObjectActivity - table actively in use and still in cache• monCachedObject - table still in cache (whether on not active)
Tables that are dropped are removed as open objects• DES is cleared - removed from cache/open objects
Temp tables are fairly dynamic• Most are created and dropped within seconds
Monitoring with MDA requires a bit of ingenuity• Reducing the monitoring interval to a <1 minute• Using the MDA parameter of DBID=2 to reduce data volume• Extrapolating based on objects “caught”• If a separate cache is defined - watch PagesTouched
– If constantly at the allocation - it may be too small
12
Cache Sizing & Activity
Cache Name Pool Allocated KBMax Used KB
Max Pg Read KB Comments
default data cache 2048 3,788,800 3,785,030 91,933,020 Likely too small as max used is near allocation
default data cache 4096 921,600 921,578 305,336 log cache is 600MB too big…give to 2K pool
default data cache 16384 768,000 726,294 53,233,792 Looks right-sized (95%)
oltp_cache 2048 5,324,800 5,322,432 171,414,882 Good reuse - on avg, each page is reread 30x
oltp_cache 4096 1,024,000 1,023,998 958,424 log cache…just keeps being appended to
oltp_cache 16384 768,000 512,784 356,244,832 Pool is 256MB too big
msg_queue_cache 2048 2,048,000 437,900 0Pool is 1.5GB too big - and nothing ever read. Shift to the 16K pool or give to default data cache
Max(PagesRead)Max(PagesTouched)
13
BLOBs & Cache Consumption
Max Cached BLOB
Cache Name DBID Table with Text/Image KB MB
oltp_cache 20 Table_1 1,276,972 1,247
oltp_cache 20 Table_2 1,009,902 986
oltp_cache 21 Table_3 1,261,592 1,232
oltp_cache 21 NULL 96,780 95
oltp_cache 21 Table_4 1,340,726 1,309
msg_queue_cache 16 NULL 6,586 6
msg_queue_cache 16 NULL 5,202 5
msg_queue_cache 23 Table_5 337,880 330
msg_queue_cache 23 Table_6 252,976 247
~4.7GB of 7GB (67%)
14
0
10,000
20,000
30,000
40,000
50,000
1/1/19
00
1/2/19
00
1/3/19
00
1/4/19
00
1/5/19
00
1/6/19
00
1/7/19
00
1/8/19
00
1/9/19
00
1/10/1
900
1/11/1
900
1/12/1
900
1/13/1
900
1/14/1
900
1/15/1
900
1/16/1
900
1/17/1
900
1/18/1
900
1/19/1
900
1/20/1
900
1/21/1
900
1/22/1
900
1/23/1
900
1/24/1
900
1/25/1
900
1/26/1
900
1/27/1
900
1/28/1
900
1/29/1
900
1/30/1
900
1/31/1
900
2/1/19
00
2/2/19
00
2/3/19
00
2/4/19
00
2/5/19
00
2/6/19
00
2/7/19
00
2/8/19
00
2/9/19
00
2/10/1
900
2/11/1
900
2/12/1
900
2/13/1
900
2/14/1
900
2/15/1
900
2/16/1
900
2/17/1
900
2/18/1
900
2/19/1
900
2/20/1
900
2/21/1
900
2/22/1
900
2/23/1
900
2/24/1
900
2/25/1
900
2/26/1
900
2/27/1
900
2/28/1
900
2/29/1
900
3/1/19
00
3/2/19
00
3/3/19
00
3/4/19
00
3/5/19
00
3/6/19
00
3/7/19
00
3/8/19
00
3/9/19
00
3/10/1
900
3/11/1
900
3/12/1
900
3/13/1
900
3/14/1
900
3/15/1
900
3/16/1
900
3/17/1
900
3/18/1
900
Data
& In
dex
(KB)
0
250,000
500,000
750,000
1,000,000
1,250,000
1,500,000
BLOB
(KB) Data (KB)
Index(KB)BLOB (KB)
BLOB vs. non-BLOB Cache
Comparison of BLOB vs. Data & Index cache consumption for 1 table involved in the message processing
Spikes are caused by table being truncated and swapped
15
7 & 6: Procedure Cache
ModuleID = ModuleIDmonCachedProceduresObjectIDOwnerUIDDBIDPlanIDMemUsageKBCompileDateObjectNameObjectTypeOwnerNameDBName
intintintintintdatetimevarchar(30)varchar(32)varchar(30)varchar(30)
<pk>
<pk><pk>
monProcedureCacheRequestsLoadsWritesStalls
intintintint
monProcedureCacheMemoryUsageAllocatorIDModuleIDActiveHWMChunkHWMNumReuseCausedAllocatorName
intintintintintintvarchar(30)
<fk>
monProcedureCacheModuleUsageModuleIDActiveHWMNumPagesReusedModuleName
intintintintvarchar(30)
<pk>
Reads From Disk
Proc Concurrency
Too Small Flag
ProcCache in Use for Procs = sum(MemUsageKB)
These are new in ASE 15.0.1
Current Memory vs. Max Used
Who’s causing the cache flipping
16
MDA & Proc Cache Modules
Ah, yes, just how much proc cache did that create index or merge sort use??? (Answer in pgs)
Statement Cache SizingFully Prepared Statements data_change()Partition impacts
17
Memory Usage by Module
Tables 5 1Ones to Use When Panic Button is Pushed
19
The Top 5 (Problem Solvers)
5 monDeviceIO & monIOQueue
4 monProcessActivity
3 monSysWaits
2 monOpenObjectActivity
1 monSysStatement
20
5: monDeviceIO
Comments:• These are ALL physical IOs• monIOQueue.IOs =
monDeviceIO.Reads + monDeviceIO.Writes
• IOTime– monDeviceIO is shared with sp_sysmon -
forget “no_clear” and this stat is reset…otherwise should be same as monIOQueue.IOTime
– Measured in ticks (100ms)• Some will have 0 as a result of
completing in the same tick• Others will be in multiples of 100ms• Average it out for a good idea
LogicalName = LogicalName
monDeviceIOReadsAPFReadsWritesDevSemaphoreRequestsDevSemaphoreWaitsIOTimeLogicalNamePhysicalName
intintintintintintvarchar(30)varchar(128)
<pk>
monIOQueueIOsIOTimeLogicalNameIOType
intintvarchar(30)varchar(12)
<pk,fk><pk>
User DataUser LogTempdb DataTempdb Log
21
sp_mda_deviceIO
Result set is based on a join between monDeviceIO and monIOQueue, so if device has both data & log, the reads & writes will be duplicated for each IO type entry - to see the split between data/log, compare the split of IOs
Slow Device (>10ms)Avg Device (4-6ms)Fast Device (<2ms)
Logical Name IOTypeDeltahms Reads
Reads Per Sec APFReads APF % Writes
Writes Per Sec Total IOs
IOsPer Sec
IOTime sec
Ms Per IO
data_device2 User Data 17:47:47 1,765,644 27 1,010,775 57.2 169,491 2 1,935,122 30 20,747 10
tempdb2 Tempdb Log 17:47:47 365,150 5 1,909 0.5 1,198,428 18 141,326 2 1,338 9
tempdb2 Tempdb Data 17:47:47 365,150 5 1,909 0.5 1,198,428 18 1,422,252 22 54,143 38
tempdb Tempdb Log 17:47:47 142,500 2 8,437 5.9 339,189 5 0 0 0 0
tempdb Tempdb Data 17:47:47 142,500 2 8,437 5.9 339,189 5 481,689 7 14,265 29
data_device3 User Data 17:47:47 17,824 0 10,863 60.9 2,905 0 20,729 0 175 8
data_device4 User Data 17:47:47 4,109 0 197 4.7 276 0 4,385 0 11 2
log_device2 User Log 17:47:47 1,795 0 0 0 543,259 8 539,718 8 8,575 15
log_device2 User Data 17:47:47 1,795 0 0 0 543,259 8 5,336 0 114 21
master User Log 17:47:47 376 0 0 0 12,190 0 1,125 0 4 3
APF = Table Scans
22
monDeviceIO Notes
Tracks Physical IO’s only• monIOQueue
– the IO queue is the IOs submitted to the OS– …so if you are waiting for a disk block, you aren’t here yet
• monDeviceIO – Number of completed IOs– Docs say Reads exclude APFReads….
• …but raw math with IOs shows that it includes the APFReads
23
4: monProcessActivity
monProcessActivitySPIDKPIDServerUserIDCPUTimeWaitTimePhysicalReadsLogicalReadsPagesReadPhysicalWritesPagesWrittenMemUsageKBLocksHeldTableAccessesIndexAccessesTempDbObjectsWorkTablesULCBytesWrittenULCFlushesULCFlushFullULCMaxUsageULCCurrentUsageTransactionsCommitsRollbacks
smallintintintintintintintintintintintintintintintintintintintintintintintint
<pk,fk><pk,fk>
If ULCFlushFull is high for ULCFlushes, ULC is too small
Server TPS sum(Δ(Transactions))/ Δ (sampletime(secs))
Number of Locks sum(LocksHeld)
IO Hogs/Bad Queries
Total CPU= ΔSampleTime -ΔWaitTime
24
Notes on monProcessActivity
Useful for finding the resource hogs• CPUTime is not cumulative (12.5.3 at least)
– Due to a bug that has been fixed in later ESD’s– So…..how do we find out how much CPU someone used??
• Answer: ΔSampleTime - ΔWaitTime• ….because if you ain’t waiting - you are running!!!
• Bad Queries - look at Logical Reads, Physical Reads & TotalCPU– TotalCPU as defined by ΔSampleTime - ΔWaitTime
Other Tips• TableAcesses vs. IndexAcesses can give us an early clue the user is
doing a tablescan• TempDbObjects, WorkTables can give us clues as to if we might be
causing issues via tempdb - or if TableAcesses are due to tempdb.
25
Who’s Hogging the CPU??
Server UserID SPID
CPU Time
Derived CPU sec
Derived CPU%
Wait Time
Physical Reads
Logical Reads
Physical Writes
Pages Written
Table Accesses
Index Accesses
TempDb Objects
Work Tables
257 164 8,343 65,439 26.2 183,862 90,419 821,095,296 5,243 9,515 132,785 688,291,033 0 112,097
66 77 4,012 11,517 4.6 237,784 767,953 602,510,612 1,440,406 1,580,812 64,732,650 460,090,480 7,831 195,920
65 76 4,244 11,328 4.5 237,973 790,717 575,826,507 1,399,038 1,542,922 66,695,362 478,952,060 7,778 197,053
67 78 4,083 11,285 4.5 238,016 805,315 549,908,062 1,412,320 1,553,901 61,765,225 489,758,829 7,824 195,199
64 75 4,209 11,257 4.5 238,044 751,751 548,253,740 1,372,337 1,514,749 63,557,279 494,295,337 7,727 196,494
125 47 4,160 10,582 4.2 238,719 699,461 332,946,384 1,336,164 1,480,234 82,882,603 650,580,633 6,112 190,868
124 119 3,278 10,459 4.2 238,842 596,685 312,422,869 1,830,087 1,971,764 72,776,796 658,774,595 6,065 183,968
126 48 3,914 10,299 4.1 239,002 575,491 280,556,865 1,315,408 1,456,991 68,333,817 674,577,317 6,151 186,041
127 49 4,137 10,198 4.1 239,103 515,485 259,691,308 1,352,176 1,493,099 74,062,593 692,243,722 5,954 187,278
5 43 3,613 10,181 4.1 239,120 716,909 232,643,586 1,474,822 1,618,203 72,836,259 704,827,525 5,912 187,232
7 117 3,543 10,150 4.1 239,151 636,461 221,468,548 1,698,326 1,838,178 75,048,927 713,237,194 5,873 185,821
4 73 3,310 10,111 4.1 239,190 696,430 209,371,873 1,657,008 1,798,890 80,341,215 723,170,564 5,909 184,575
6 74 3,552 10,037 4.0 239,264 641,899 183,610,422 1,528,052 1,668,489 67,242,735 730,870,520 5,836 185,427
135 18 5,576 6,257 2.5 243,044 723,346 282,784,331 157,811 182,323 34,853,826 177,129,838 460 176,953
Sample time = 69h:15m:15s (~3 days)…249,301secs
26
3: monSysWaits
WaitEventID = WaitEventID
WaitEventID = WaitEventID
WaitClassID = WaitClassID
monProcessWaits
SPIDKPIDWaitEventIDWaitsWaitTime
smallintintsmallintintint
<pk,fk1><pk,fk1><pk,fk2>
monSysWaits
WaitEventIDWaitTimeWaits
smallintintint
<pk,fk>
monWaitClassInfo
WaitClassIDDescription
smallintvarchar(50)
<pk>
monWaitEventInfo
WaitEventIDWaitClassIDDescription
smallintsmallintvarchar(50)
<pk><fk>
Server Level Waits (Aggregated)
Process Level Waits
Static Values for Each ESD
27
Where’s the Holdup???
WaitClassID = WaitClassID
WaitEventID = WaitEventID
WaitEventID = WaitEventID
SPID = SPIDKPID = KPID
monOpenDatabases
DBIDBackupInProgressLastBackupFailedTransactionLogFullAppendLogRequestsAppendLogWaitsDBNameBackupStartTimeSuspendedProcessesQuiesceTag
intintintintintintvarchar(30)datetimeintvarchar(30)
<pk>
<pk>
monWaitClassInfo
WaitClassIDDescription
smallintvarchar(50)
<pk>
monWaitEv entInfo
WaitEventIDWaitClassIDDescription
smallintsmallintvarchar(50)
<pk><fk>
monSysWaitsWaitEventIDWaitTimeWaits
smallintintint
<pk,fk>monProcess
SPIDKPIDBatchIDContextIDLineNumberSecondsConnectedDBIDEngineNumberPriorityFamilyIDLoginApplicationCommandNumChildrenSecondsWaitingWaitEventIDBlockingSPIDBlockingXLOIDDBNameEngineGroupNameExecutionClassMasterTransactionID
smallintintintintintintintsmallintintsmallintvarchar(30)varchar(30)varchar(30)intintsmallintsmallintintvarchar(30)varchar(30)varchar(30)varchar(255)
<pk><pk>
<fk>
<fk>
monProcessWaitsSPIDKPIDWaitEventIDWaitsWaitTime
smallintintsmallintintint
<pk,fk2><pk,fk2><pk,fk1>
“db log contention”
“Where I am spending all my time waiting”
“Server Cumulative Waits”(aka Context Switches)
“Currently Waiting On”
28
Pop Quiz: Wall Street Customer
WaitEventID Waits WaitTime Description----------- ----------- ----------- --------------------------------------------------
29 1399734780 4929606 wait for buffer read to complete215 1127548743 27844254 waiting on run queue after sleep35 670307183 4153181 wait for buffer validation to complete
179 335988317 10035324 waiting while no network read or write is required250 324655266 167672101 waiting for incoming network data124 256994610 6105113 wait for someone else to finish reading in mass209 75463669 340471 waiting for a pipe buffer to read251 62546271 344880 waiting for network send to complete41 58470129 3473384 wait to acquire latch31 32361806 36401 wait for buffer write to complete
214 19911597 1403956 waiting on run queue after yield150 18083160 23776944 waiting for semaphore52 11842516 48193 waiting for disk write to complete51 9703708 27945 waiting for disk write to complete55 8071811 5609 waiting for disk write to complete36 4774219 20192 wait for mass to stop changing
272 3886481 134998 waiting for lock on PLC54 2438135 30805 waiting for disk write to complete
What are you going to do next???
29
First…A Familiar Picture…
Shared Executable (Program Memory)
Operating SystemEngine
0RegistersFile Descriptors
Running2
Run Queues SleepQueue
locksleep
diskI/O
Lock Chains
Pending I/Os
Other Memory sendsleep
DISK
NET
NET
Shared MemoryProc CacheHash
Kernel
Engine 1RegistersFile Descriptors
Running5
Engine NRegistersFile Descriptors
Running1
…
EC1
EC2
EC2
EC3
6 34
7
8
30
Applying What You Know…
When do SPID’s sleep??• Pending physical disk I/O, network I/O, waiting on a lock
SPID’s & TimeSlices…• Each SPID gets a max of 1 timeslice (100ms) by default• If SPID needs physical IO or lock before then, it is put to sleep early• A SPID is only woken when the IO completes or lock is available• When woken, SPID WaitTime is current timeslice - previous timeslice
– Could be 0 if woken in same timeslice….but still some time was spent
So…if I am only doing logical I/O’s (100% cache hits), what happens when my timeslice expires?
• Get put back on the run queue (runnable state)
When do data page writes happen??• Checkpoint process, housekeeper GC, wash marker• Minimally Logged I/O
– Fast bcp, select/into, writetext (no log), etc.
• DES Scavenging (cache flush)
31
Pop Quiz Time
We all know that ASE can read more than one page per IO request at a time by doing a large I/O (especially APF)….
…How many pages can ASE write at one time per IO request??
A) 1 ASE does all 2K writes (or nK where n is page size)B) 1 per data page, but more than one per log page depending on the
log I/O size of the databaseC) Depends on sp_configure “i/o batch size” - default is 100D) Any number between 1 and 8 pagesE) However many fit in 32 disk blocks (512byte blocks).
32
ASE ProxyDB MDA monProcessWaits
WaitEventID Waits WaitTime Description 214 182433 600 waiting on run queue after yield
55 181921 137000 waiting for disk write to complete
31 178274 200200 wait for buffer write to complete
51 169434 180200 waiting for disk write to complete
171 9847 531700 waiting for CTLIB event to complete
52 6953 5200 waiting for disk write to complete
36 3098 698500 waiting for MASS to finish writing before changing
29 806 8500 wait for buffer read to complete
251 500 0 waiting for network send to complete
54 48 1200 waiting for disk write to complete
150 33 400 waiting for semaphore
272 19 500 waiting for lock on PLC
250 6 400 waiting for incoming network data
259 3 85100 waiting until last chance threshold is cleared
Example from a platform migration test using proxy tables/db
33
Decoding the WaitEvents - CPU
CPU Waits• WaitEvents
– 215 waiting on run queue after sleep– 214 waiting on run queue after yield– Large numbers of waits IO Problem - adding CPUs will hurt
• Adding memory will drive 215 214 and increase CPU issues as well– Large wait time and low waits adding CPU’s may help
• What do they mean?– 215 Most likely excessive physical IO (incl APF) or locking
• Why?? Why does a process go to sleep??• Answer: Network send, physical read/write, lock wait
– 214 Bad QP’s, table scanning in memory, high contention, etc.• Why?? Because we don’t sleep on logical I/O’s…
– SPID yields the CPU and jumps immediately on the RUNNABLE queue• Where to go next?
– monOpenObjectActivity (to find the tables involved)– monSysStatement (to find the CPU pigs)
34
CPU Waits & Disk IO Activity
Wait Event ID Wait Class Description Wait Event Description Waits
Wait Time Sec
215 waiting to be scheduled waiting on run queue after sleep 337,845 0.80
124 waiting for internal system event wait for mass read to finish when getting page 115,336 0.40
214 waiting to be scheduled waiting on run queue after yield 73,998 0.42
51 waiting for a disk write to complete waiting for last i/o on MASS to complete 71,169 0.47
31 waiting for a disk write to complete waiting for buf write to complete before writing 54,185 0.17
55 waiting for a disk write to complete wait for i/o to finish after writing last log page 46,714 0.25
29 waiting for a disk read to complete waiting for regular buffer read to complete 16,865 0.09
52 waiting for a disk write to complete waiting for i/o on MASS initated by another task 14,220 0.03
171 waiting to output to the network waiting for CTLIB event to complete 7,243 0.16
251 waiting to output to the network waiting for network send to complete 2,487 0.07
250 waiting for input from the network waiting for incoming network data 1,839 42.75
99 waiting for input from the network wait for data from client 223 0.10
36 waiting for memory or a buffer waiting for MASS to finish writing before changing 212 0.00
143 waiting for internal system event pause to synchronise with site manager 80 0.00
54 waiting for a disk write to complete waiting for write of the last log page to complete 49 0.00
35
CPU Waits & Network IO/LockingWaitEventID Description Description Waits WaitTimeSec
215 waiting to be scheduled waiting on run queue after sleep 127,372,857 8.63
214 waiting to be scheduled waiting on run queue after yield 31,044,842 0.91
250 waiting for input from the network waiting for incoming network data 28,424,607 925.65
251 waiting to output to the network waiting for network send to complete 28,163,786 7.82
29 waiting for a disk read to complete waiting for regular buffer read to complete 4,502,133 4.94
150 waiting to take a lock waiting for a lock 1,963,195 6.78
35 waiting for memory or a buffer waiting for buffer validation to complete 1,701,818 2.20
272 waiting for internal system event waiting for lock on ULC 1,634,727 3.76
41 waiting for internal system event wait to acquire latch 1,206,897 4.04
283 waiting on another thread Waiting for Log writer to complete 1,172,009 12.80
308 waiting on another thread Waiting for ULC Flusher to queue dirty pages. 998,236 3.74
36 waiting for memory or a buffer waiting for MASS to finish writing before changing 973,005 8.65
307 waiting on another thread Waiting for tasks to queue ALS request. 659,928 3.45
55 waiting for a disk write to complete wait for i/o to finish after writing last log page 489,120 1.91
51 waiting for a disk write to complete waiting for last i/o on MASS to complete 328,118 8.02
52 waiting for a disk write to complete waiting for i/o on MASS initated by another task 309,834 2.10
309 waiting for a disk write to complete Waiting for last started disk write to complete 165,040 0.89
31 waiting for a disk write to complete waiting for buf write to complete before writing 144,209 1.34
124 waiting for internal system event wait for mass read to finish when getting page 102,072 0.11
53 waiting for memory or a buffer waiting for MASS to finish changing to start i/o 51,908 0.00
70 waiting for internal system event waiting for device semaphore 21,711 0.00
37 waiting for memory or a buffer wait for MASS to finish changing before changing 21,569 0.00
178 waiting for input from the network waiting while allocating new client socket 15,339 43.67
171 waiting to output to the network waiting for CTLIB event to complete 7,707 0.00
54 waiting for a disk write to complete waiting for write of the last log page to complete 6,254 0.02
36
Disk Write Waits…
Event ID Description
50 Write was restarted because previous attempt failed – if you see this check sys error log
51 waiting for last MASS on which i/o was issued
52 waiting for last MASS on which i/o was issued by some other task
53 waiting in writedes for mass to finish changing before writing buffer
54 waiting to write of the last page of the log
55 waiting after write of the last page of the log
In other words, 54 is log contention, 55 is waiting for log physical write to flush,
37
What’s a MASS???
Memory Address Space Segment• chunk of contiguous memory containing one or more 2K pages (the quantity
being determined by the configured pool size, 2K, 4K, etc). – Analogous to “extents”
• synchronizes access to buffers (data pages in memory) by waiting until no one else is writing the buffer
• With large IO the state of any page in the MASS is taken to be the state of the MASS itself. This means, for example, if you use 16K IO then access is synchronized across all 8 2K pages - if one is being written to then all are considered to be written to.
– Large IO writes tempdb select/into, bcp, array inserts, etc. User queries will not reflect large I/O
…btw…this is documented (somewhat)• MASS is defined in Glossary
So….what to do??• One likely cause is cache partitioning (especially if none)
38
MASS Operations
MASS Writes• MASS (or some portion of it) is being written to disk
– Bulk I/O (select into, bcp)– Wash marker, checkpoint, housekeeper GC– Writedes (object being flushed from cache)
MASS Changes• Pages being replaced with new data pages
– Normal LRU/MRU
• Pages being updated by logical writes by SPID
39
Memory/Buffer Waits
30 wait to write MASS while MASS is changing33 waiting for buffer read to complete34 waiting for buffer write to complete35 waiting for buffer validation to complete36 waiting for MASS to finish writing before changing37 wait for MASS to finish changing before changing38 wait for mass to be validated or finish changing53 waiting for MASS to finish changing to start i/o
MASS write to disk is delayed because someone is updating a page in the MASS
SPID wants to change MASS, but has to wait because it is still being flushed to disk
SPID wants to change MASS, but has to wait because someone else is changing it
MASS write to disk is delayed because someone is updating a page in the MASS
Logical readLogical write
40
Decoding the WaitEvents - Writes
Regular Disk Write Waits• Wait Events
– 51 - A lot of I/O likely coming from same process• Or a page split caused a synchronous IO
– 36,37, 52 - Last page contention with other users• Including space (memory) allocation contention in tempdb
• Where to go next– 51 - monOpenObjectActivity– 52 - monOpenObjectActivity paying close attention to DBID=2
Log Disk Write Waits• Wait Events
– 54 - Contention on last log page– 55 - Waiting for last log page to flush to disk
• Where to go next– 54 - monProcessActivity and look at ULC
• also monOpenDatabases and look for which log it was– 55 - monDeviceIO/monIOQueue and look at IO times for the device
• Also monProcessActivity to look at ULC to reduce number of log writes
41
Common Wait Events: Client S/W
Client Related S/W Issues• waiting for CTLIB event to complete
– non-data related: i.e. waiting for TDS tokens such as ACK for packets sent, or waiting on next command to be sent (i.e. gap between ct_command() and ct_send()
– if CIS is involved, it is waiting on ct_fetch()/result set materialization at remote server..if remote server, it is waiting to for network access to send the data
– Next move is to look at the client code– RepAgent will show this a lot due to deferred async network calls & ASE scheduler
• waiting for network send to complete– This is data stream related – outbound commands (RPC’s, RepAgent, etc.) will be ‘waiting for CTLIB
event to complete’ due to waiting for ct_sendpassthru(), etc. to execute.– Next table to check out is monProcessNetIO – probably going to be a change to fetch block size in
program and/or packet size
• waiting for incoming network data– Equivalent to ‘awaiting command’ – nothing expected, ..or…– Big gap could point to network handling of language cmds time (try ct_dynamic) or BLOB processing
42
Common Wait Events (Config)
“waiting while no network read or write is required”• Netserver checked and no network read/write pending• Server level – shouldn’t see this in monProcessWaits• Check "i/o polling process count"
– If CPU & IO bound – reduce "i/o polling process count"– For 12.5.3+ – look at the following in monEngine: DiskIOChecks, DiskIOPolled,
DiskIOCompleted
monSysWaits vs. monProcessWaits• The above and others (many of the MASS waits) will only be apparent in
monSysWaits vs. monProcessWaits– For example, a checkpoint or HK initiated write may be delayed due to a user process
currently updating memory (i.e. Event 31)• So - when viewing monSysWaits….
– Some system events can mostly be ignored (i.e. the one above)– Others need to be viewed from the user aspect
• i.e rather than the HK being held up, the question is whether the users are doing too much physical IO and thereby tripping the HK or Checkpoint trigger.
43
Real Customer #1 (Slow Writes)
WaitEventID Description Waits WaitTimeSec delta_hms
215 waiting on run queue after sleep 151709 0.27 00:26:26
250 waiting for incoming network data 52137 155.38 00:26:26
214 waiting on run queue after yield 17435 0.02 00:26:26
55 wait for i/o to finish after writing last log page 15553 0.14 00:26:26
31 waiting for buf write to complete before writing 12436 0.1 00:26:26
51 waiting for last i/o on MASS to complete 10536 0.17 00:26:26
251 waiting for network send to complete 2577 0.03 00:26:26
29 waiting for regular buffer read to complete 2257 0.01 00:26:26
124 wait for mass read to finish when getting page 2226 0.03 00:26:26
178 waiting while allocating new client socket 1658 1.53 00:26:26
Event 31 is an example of a potential HK/Checkpoint/Washmarker IO delayed by a user modifying a page…but the viewpoint should be why are we doing so much physical IO’s - especially with 51’s, 29’s and 124’s - one possibility is slow disks - especially with 55 - time to look at monDeviceIO/monIOQueue and maybe monOpenObjectActivity as 214 indicates table scans - maybe a cartesian product in tempdb is causing both - but likely that APF’s are driving reads and 214’s …possibly writes may be driven by a batch process scanning one table and inserting into another -but network send #’s suggest reads are going to client (a lot of rows) and writes are likely a different issue as 5 times higher.
44
Real Customer (Bad Queries?)
Cause #2 is network send
Physical I/O delayed - look at APF’s and Phys Reads
45
2: MonOpenObjectActivity
monOpenObjectActivity• IndexID is key• OptSelectCount
– Number of times the optimizer picked the index during optimization
– Picked ≠ Used (joins may change plan)• UsedCount
– Number of times the optimizer actually used the index for a query
Spotting TableScans• IndexID=0 and UsedCount >0
Unused Indexes• Any index with UsedCount=0• It will have RowsInserted, etc. when DML
operations affect the index values - the key is that it was never used for a query.
• Monitor over a period of time – the last thing you want to have happen is the weekend DBA
call you because a report didn’t finish (because you dropped the index it used).
monOpenObjectActivityDBIDObjectIDIndexIDDBNameObjectNameLogicalReadsPhysicalReadsAPFReadsPagesReadPhysicalWritesPagesWrittenRowsInsertedRowsDeletedRowsUpdatedOperationsLockRequestsLockWaitsOptSelectCountLastOptSelectDateUsedCountLastUsedDate
intintintvarchar(30)varchar(30)intintintintintintintintintintintintintdatetimeintdatetime
<pk,fk><pk><pk><pk,fk><pk>
46
Top 7 monOpenObject Queries
Table Scans• IndexID=0 and UsedCount > 0
Indexing Efficiency• Order by LogicalReads desc, PagesRead desc
Caching/Table Scan Issues• Order by PhysicalReads + APFReads desc
Hot OLTP Tables• Order by PhysicalWrites desc• Check for volatile indexes (RowsUpdated > 0 and IndexID > 0)
Tempdb Usage• DBID=2 (plus all other user tempdb’s)
Contention• Order by LockWaits desc… sum(UsedCount) by ObjectID
Unused Indexes• IndexID!=0 and UsedCount=0
47
Customer: System is Slow
Object NameIndex ID
Logical Reads
Physical Reads
APF Reads
Pages Read
Physical Writes
Pages Written
Rows Inserted
Rows Deleted
Rows Updated
Used Count
actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 434 724 117,620 247
outb_trn 0 75,981,221 0 195,792 0 699 5,389 129 158 1,024 1,436
wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 64 1 0 83
indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 1,730 0 0 227
indus_wo 0 23,691,223 312,789,38
7 3 402 1,914 81 89 495 3,249
indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 1,907 2,089 6,960 13,695
ns_oper_stn 0 9,262,399 0 87,346 0 0 0 0 0 0 734
yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 448 0 0 254
plnned_trn_events 0 5,362,519 315 2,104,867 378 35 252 217 168 0 200
rlse_sw_request 0 5,033,717 13 3,016,215 13 2,785 19,697 1,246 1,772 945 3,166
inbnd_trn_exception 0 4,056,119 0 2,853,452 0 1,229 5,450 2,466 1,746 16,286 11,587
event_type 0 3,039,052 4 0 4 0 0 0 0 0 12,564
outb_valid_trn 0 2,645,196 6 694,854 13 2,042 15,699 1,990 2,254 1,046 186
eq_asgn_pln 0 1,671,989 223 1,050,155 244 1,449 9,527 2,422 2,326 10 1,080
team_tk_cust 0 1,364,266 0 0 0 0 0 0 0 0 8,527
plnned_inbnd_cnsst 0 1,360,123 14 195,755 14 3,156 21,685 23,042 22,753 3,495 137
tran_loc 0 1,256,342 661 3,091 661 0 0 0 0 0 11
wb_final_class_cd 0 1,228,184 4,301 411,012 4,399 5,341 28,525 5,653 4,401 0 126
sec_grp_exception 0 1,174,012 0 0 0 0 0 0 0 0 293
Sample time 2h:25m
48
Did You Notice???
Object NameIndex ID
Logical Reads
Physical Reads
APF Reads
Pages Read
Physical Writes
Pages Written
Pages Per Write
actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 3.0
outb_trn 0 75,981,221 0 195,792 0 699 5,389 7.7
wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 6.6
indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 7.3
indus_wo 0 23,691,223 3 12,789,387 3 402 1,914 4.8
indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 7.3
ns_oper_stn 0 9,262,399 0 87,346 0 0 0 0
yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 4.0
plnned_trn_events 0 5,362,519 315 2,104,867 378 35 252 7.2
rlse_sw_request 0 5,033,717 13 3,016,215 13 2,785 19,697 7.1
inbnd_trn_exception 0 4,056,119 0 2,853,452 0 1,229 5,450 4.4
event_type 0 3,039,052 4 0 4 0 0 0
outb_valid_trn 0 2,645,196 6 694,854 13 2,042 15,699 7.7
eq_asgn_pln 0 1,671,989 223 1,050,155 244 1,449 9,527 6.6
team_tk_cust 0 1,364,266 0 0 0 0 0 0
plnned_inbnd_cnsst 0 1,360,123 14 195,755 14 3,156 21,685 6.9
tran_loc 0 1,256,342 661 3,091 661 0 0 0
wb_final_class_cd 0 1,228,184 4,301 411,012 4,399 5,341 28,525 5.3
sec_grp_exception 0 1,174,012 0 0 0 0 0 0
Proof that the ASE does >2K writes (MASS) even on 2K server
49
Answer: This is a replicate database (see rs_lastcommit). Turns out this is a WS with no active users, yet the updates & deletes on ScheduleWorkTable are causing table scans - which slow down replication and cause latency in the DSI -average is 1 update/delete on ScheduleWorkTable and ~1 update on AcquirerPaymentDetail per transaction group from the DSI. XMLText scan may have been due to dbcc or reorg (only 1 scan in 10,000 operations)
ObjectNameIndex ID
Logical Reads
Physical Reads
APF Reads
Pages Read
Physical Writes
Pages Written
Rows Inserted
Rows Deleted
Rows Updated
Used Count
XmlText 0 9,980,995 223,881 9,964,364 1,374,975 1,114 1,226 4,877 0 5,249 1
ScheduleWorkTable 0 2,032,959 19 2,042,808 19 134 134 11,515 3,579 19,708 23,287
AcquirerPaymentDetail 0 107,813 0 41,792 0 939 6,182 2,205 8,578 23,614 1
rs_lastcommit 0 26,611 0 165 0 302 848 0 0 26,287 15
Heartbeat 0 5,787 0 1,132 0 69 69 238 238 0 379
AcquirerPayment 0 1,437 0 319 0 80 528 26 99 799 1
MonitorStatsUntil 0 68 0 68 0 33 33 0 0 34 34
BranchSpecificExtSequence 0 23 0 0 0 14 14 0 0 23 23
LastAcquirerPaymentId 0 21 0 0 0 21 21 0 0 21 21
ReportQueueSenderData 0 10 1 10 1 0 0 0 0 0 5
SeqBankOutTxId 0 9 0 0 0 9 9 0 0 9 9
SeqBankAccountId 0 1 0 0 0 1 1 0 0 1 1
Where Are The Problems??
Sample period was 1h:17m:34s
50
Are All Table Scans Equal??
Object NameIndexID
LogicalReads
PhysicalReads
APFReads Operations
LockRequests
UsedCount
RT_DFLT_DBF 0 275,500,414 6,524 47,553,199 302,811 0 71,701
ACCOUNT#NOV2006#ACC_DT1_DBF 0 111,774,616 132,981 23,929,468 33,738 143,776 880
TRN_ACA1_DBF 0 70,990,284 2,282 28,986,450 133,556 0 32,854
RT_LNDXG_DBF 0 42,692,422 720 8,227,783 188,884 0 46,959
RXM_CDFL_DBF 0 25,542,614 437 20,305,395 2,936 130,781,737 26,761
RXM_IRSN_DBF 0 18,379,485 1,923 8,718,369 7,216 310,900 36,576
ACC_RVDT_DBF 0 13,726,168 1,221 4,534,354 33,876 133 7,920
TRN_HDR_DBF 0 7,601,933 112,601 327,980 548,933 5,477 2,976
TRN_CPDF_DBF 0 7,409,830 3,334 1,877 427,752 0 100
TABLE#LIST#LEGALENT_DBF 0 7,191,939 3 7,194,301 7,184,173 279 1,796,111
TABLE#LIST#GL_ACCT_DBF 0 6,237,658 187 2,238,602 904 0 21,734
RTRN_HDR_DBF 0 4,872,419 59,713 470,002 57,357 1,615 14
TRN_PFLD_DBF 0 2,653,352 18 1,806,234 866,600 0 39,258
Sample time is just less than 15 hours (14h:59m)
With a 1:4 ration of scans to logical reads, it is likely that TABLE#LIST#LEGALENT_DBF only has 4 pages - and as with all tables < 10 pages (or concurrency_opt_threshold), the optimizer is likely just going to tablescan….however, with 7M APF Reads, it suggests that this table keeps getting flushed from cache and has to be continuously re-read using physical IO’s (APF or not - they still are physical reads that aren’t necessary!!!!)
51
What Is Happening Here?
Object Name
Index ID
Logical Reads
Physical Reads
APF Reads
Pages Read
Physical Writes
Pages Written
Rows Inserted
Rows Deleted
Rows Updated
Opera-tions
Used Count
CCPayment 0 10,978,013 77 0 77 3,785 3,785 11,786 0 11,343 172,094 0
CCPayment 2 134,413 0 0 0 923 923 11,760 0 0 0 11,343
CCPayment 3 79,136 14,913 0 14,913 11,572 11,572 11,761 0 0 0 0
NO!!! We are using a non-unique index….how can we tell???• Look at rows inserted….table has 26 more rows inserted during the same time than the index does. If the index was unique, it would have one
insert matching every table insert.• Result is we are using the index to position ourselves within the table and then scanning to find the rows to be updated…..quite possibly using
the index to position to the first occurrence and then scanning to the end of the table.
Soo….why index 2 instead of 3??? • Index 2 is smaller - 11,760 rows fit on 923 pages (~12/pg) whereas in index 3 it is nearly 1 row/page - consequently the index would cost more.• In fact, at 1 row/pg for the index and ~3 rows/pg for the table, it suggests that index 3 has max_rows_per_page=1
If the 11,343 updates are using the index, what are the inserts doing??• This is a heap table….they are going to the end
Can you prove that minimal columns would help in this situation (if considering parallel DSI’s)?• Yes - none of the columns updated affected the indices, so they would be “safe” if min cols were used….in fact, it may be on already as we
have 0 rows updated…..
Sample period was 1h:17m:34s
Everything is okay because we are using the index, right??
52
Why Don’t DML Rows Match?
ObjectName
IndexID
LogicalReads
PhysicalWrites
Rows Inserted
Rows Deleted
Rows Updated Operations
Lock Requests
Used Count
AccountTx 0 3,161,146 11,577 244,063 9,642 184,874 3,671,042 1,050,502 0
AccountTx 2 5,693,262 107,943 428,937 194,516 0 0 0 16
AccountTx 3 6,647,313 14,197 428,937 194,516 0 0 0 194,500
Journal 0 2,986,193 9,428 69,452 2,408 161,666 2,033,919 535,438 0
Journal 2 2,764,274 1,871 58,828 2,408 0 0 0 164,276
Journal 5 2,294,767 4,105 139,785 83,365 0 0 0 0
Journal 6 1,002,950 2,341 58,845 2,425 0 0 0 0
Journal 7 2,512,037 24,324 139,705 83,285 0 0 0 0
Journal 8 830,656 26,049 58,828 2,408 0 0 0 0
Journal 9 1,817,354 3,609 139,780 83,360 0 0 0 0
The actual inserts/updates/deletes to the table are represented by IndexID 0• UsedCount = Updates + Deletes; Heap Inserts don’t ‘Use’ an Index
When an index key is modified by an update to the table…• The new key value will logically appear elsewhere in the index tree• This is accomplished by deleting the index row and re-inserting it within the tree• A good indication of index volatility and which ones need update stats more often
53
How’s Our Indexing???
Object NameIndex ID
Logical Reads
Physical Reads
APF Reads
Pages Read
Physical Writes
Pages Written
Lock Requests
Lock Waits
Opt Select Count
Used Count
actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 1,189,619 385 247 247
wrk_limits_rte 6 96,226,458 683 7,721 683 64 64 0 0 31,595 31,618
outb_trn 0 75,981,221 0 195,792 0 699 5,389 15,397 71 1,436 1,436
outb_trn 6 61,864,256 1 0 1 294 294 0 0 13,382 14,282
wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 589 0 83 83
indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 5,082 2 227 227
indus_wo 0 23,691,223 3 12,789,387 3 402 1,914 52,284 104 3,248 3,249
ns_oper_stn 7 14,616,981 199 86 199 0 0 0 0 22,494 27,392
indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 253,108 3 13,695 13,695
ns_oper_stn 5 9,705,081 0 0 0 0 0 0 0 385,142 681,196
eq_event_h 0 9,380,939 217,996 0 1,632,402 66,155 312,450 1,085,535 0 0 0
ns_oper_stn 0 9,262,399 0 87,346 0 0 0 4,216 0 734 734
yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 1,201 7 254 254
sakey_mq_prcs 2 6,243,771 545 0 832 2,499 8,239 0 0 35,253 35,253
eq_event_h 5 5,468,761 42,929 143,340 47,738 72,844 275,865 0 0 76,806 808,922
actv_eq 4 3,349,634 44,180 25,750 44,180 916 930 0 0 273,840 291,616
eq_mv_waybill 4 3,303,169 304 2,034 304 1,563 1,577 0 0 199,975 367,689
trn_schd 3 2,444,418 290 14,819 899 211 561 0 0 10,637 10,637
trn_schd 0 2,436,658 569 0 877 119 539 4,234 0 0 0
Sample time 2h:25m = 145min
54
How’s Our Indexing???
Object NameIndexID
Logical Reads
PhysicalReads
APFReads
Pages Read
Physical Writes
Pages Written
Lock Requests
Lock Waits
Opt Select Count
Used Count
actv_eq 0 159,864,863 458,260 2,665,131 458,379 70,230 211,714 1,189,619 385 247 247
wrk_limits_rte 6 96,226,458 683 7,721 683 64 64 0 0 31,595 31,618
outb_trn 0 75,981,221 0 195,792 0 699 5,389 15,397 71 1,436 1,436
outb_trn 6 61,864,256 1 0 1 294 294 0 0 13,382 14,282
wrk_limits_rte 0 51,766,117 144 56,938 200 56 371 589 0 83 83
indus_wo_ln_h 0 38,183,009 2,649 2,942,396 2,768 1,217 8,854 5,082 2 227 227
indus_wo 0 23,691,223 3 12,789,387 3 402 1,914 52,284 104 3,248 3,249
ns_oper_stn 7 14,616,981 199 86 199 0 0 0 0 22,494 27,392
indus_wo_ln 0 10,746,455 102 5,774,132 102 3,846 28,052 253,108 3 13,695 13,695
ns_oper_stn 5 9,705,081 0 0 0 0 0 0 0 385,142 681,196
eq_event_h 0 9,380,939 217,996 0 1,632,402 66,155 312,450 1,085,535 0 0 0
ns_oper_stn 0 9,262,399 0 87,346 0 0 0 4,216 0 734 734
yard_wo_ln_h 0 7,134,441 732 3,115,017 2,706 244 986 1,201 7 254 254
sakey_mq_prcs 2 6,243,771 545 0 832 2,499 8,239 0 0 35,253 35,253
eq_event_h 5 5,468,761 42,929 143,340 47,738 72,844 275,865 0 0 76,806 808,922
actv_eq 4 3,349,634 44,180 25,750 44,180 916 930 0 0 273,840 291,616
eq_mv_waybill 4 3,303,169 304 2,034 304 1,563 1,577 0 0 199,975 367,689
trn_schd 3 2,444,418 290 14,819 899 211 561 0 0 10,637 10,637
trn_schd 0 2,436,658 569 0 877 119 539 4,234 0 0 0
55
1: monSysStatement
This is TRICKY• We have to find where a proc begins
– Should have a LineNumber=0 but may not• We have to find where a proc ends
– ProcID differs and Context stays same..– Sorta….
But we want metrics…Sooo….• We actually need to loop from the beginning• …and track the nesting….• So we need to select into #temp with and
identity column to keep the order• I really think the fact that ContextID doesn’t
decrement for popping out of a context is a bug!!!
• Either way, this is ugly - if you want to do this - see me for a proc that does it for you.
monSysStatementSPIDKPIDDBIDProcedureIDPlanIDBatchIDContextIDLineNumberCpuTimeWaitTimeMemUsageKBPhysicalReadsLogicalReadsPagesModifiedPacketsSentPacketsReceivedNetworkPacketSizePlansAlteredRowsAffectedErrorStatusStartTimeEndTime
smallintintintintintintintintintintintintintintintintintintintintdatetimedatetime
<pk,fk1,fk2,fk3><pk,fk1,fk2,fk3>
<fk2><pk,fk2,fk3><pk,fk2><pk>
56
Before We Begin….
ASE Execution Path:• Batch - this is the SQL Text Glob sent by the user• Context - this is incremented for each proc, trigger, exec()• Line # - this is the statement within the context
– For monitoring purposes, the statement needs to invoke IO or CPU– It can repeat or skip (think while loops, if/else, etc.)– Line 0 is a sub-proc call/context change
Insert into table1Insert into table2Exec procA• update table1• exec procB
• insert into table3Update table2GoInsert into table3go
Batch 0
Batch 1
Batch 0; Context 0; Line 1Batch 0; Context 0; Line 2
Batch 0; Context 1; Line 1Batch 0; Context 2; Line 0
Batch 0; Context 1; Line 0
Batch 0; Context 2; Line 1Batch 0; Context 1; Line 2
Batch 1; Context 0; Line 1
57
MDA Tables Affected
monProcessSPIDKPIDBatchIDContextIDLineNumberSecondsConnectedDBIDEngineNumberPriorityFamilyIDLoginApplicationCommandNumChildrenSecondsWaitingWaitEventIDBlockingSPIDBlockingXLOIDDBNameEngineGroupNameExecutionClassMasterTransactionID
smallintintintintintintintsmallintintsmallintvarchar(30)varchar(30)varchar(30)intintsmallintsmallintintvarchar(30)varchar(30)varchar(30)varchar(255)
<pk><pk>
<fk><fk>
monProcessStatementSPIDKPIDDBIDProcedureIDPlanIDBatchIDContextIDLineNumberCpuTimeWaitTimeMemUsageKBPhysicalReadsLogicalReadsPagesModifiedPacketsSentPacketsReceivedNetworkPacketSizePlansAlteredRowsAffectedStartTime
smallintintintintintintintintintintintintintintintintintintintdatetime
<pk,fk1,fk2><pk,fk1,fk2>
<pk><pk,fk2><pk><pk,fk2>
monSysStatementSPIDKPIDDBIDProcedureIDPlanIDBatchIDContextIDLineNumberCpuTimeWaitTimeMemUsageKBPhysicalReadsLogicalReadsPagesModifiedPacketsSentPacketsReceivedNetworkPacketSizePlansAlteredRowsAffectedErrorStatusStartTimeEndTime
smallintintintintintintintintintintintintintintintintintintintintdatetimedatetime
<pk,fk1,fk2,fk3><pk,fk1,fk2,fk3>
<fk2><pk,fk2,fk3><pk,fk2><pk>
Current Statement being executed
Current Statement being executed
Previously Executed Statements
58
monSys??? vs monProcess???
• monProcess??? is *current* values for *current statement*– When statement finishes, metrics are aggregated/flushed to monSys???
• monProcessObject monOpenObjectActivity• monProcessProcedures (dropped)• monProcessStatement monSysStatement• monProcessSQLText monSysSQLText (after first statement)
59
That Pesky Line #
For SQL Batches - Always begins at 1
For Procs/Triggers• LineNumber=0 means ASE isn’t executing any line - it is searching through the
proc/trigger code for the next executable line. Common sightings:– At the beginning of a proc with a lot of comments/declare statements– In the middle of procs due to if/else blocks - when skipping the ‘else’ for example– Ditto - but when the while condition fails…– Also seen at the end when proc lacks a ‘return’
• Call to the proc increments ContextID– Calling proc is recorded after exec completes
• BatchID=BatchID, ContextID stays at proc call, ProcedureID ≠ ProcedureID and LineNumer is 1 higher than statement before calling routine
• If proc was first statement in batch, BatchID & ContextID are equal but LineNumber=1 (i.e. Proc call is 3,1,0 Proc end is 3,1,1)
• Then the line #’s match the SQL text line numbers– i.e. If you had blank lines in the script before the create proc…
• Things that don’t get captured– Declare @vars– Begin/end statements
Don’t use ORDER BY• Statements are recorded in execution order in the monSysStatement
60
The Test DDL
use demo_dbgo
create table stmt_test (row_id bigint identity not null,comment varchar(40) null,ins_date datetime,
primary key (row_id))go
1.2.3. create procedure proc_24. @iteration int5. as begin6. insert into stmt_test (comment, ins_date)7. select 'proc_2: iteration #'+convert(varchar(5),@iteration), getdate()8. waitfor delay '00:00:01'9. return 010.end11.go
1.2.3. create procedure proc_1 4. @num_times int5. as begin6. declare @n int7. insert into stmt_test (comment, ins_date) values ('proc_1: before calling proc_2',getdate())8. select @n=19. while @n <= @num_times10. begin11. exec proc_2 @n12. insert into stmt_test (comment, ins_date) values ('proc_1: looping on proc_2',getdate())13. select @n=@n+114. end15. insert into stmt_test (comment, ins_date) values ('proc_1: after calling proc_2',getdate())16.17. return 018.end19.go
61
Test Execution
1.2. /*3. ** Sybase ASE Transact-SQL script4. ** 5. ** Jeff Tallman/Sybase Enterprise Solutions6. ** tallman@sybase.com7. **8. */9.10. use demo_db11. go12. select @@spid13. go14. insert into stmt_test (comment, ins_date) values ('batch #1, before proc, insert #1', getdate())15. insert into stmt_test (comment, ins_date) values ('batch #1, before proc, insert #2', getdate())16. exec proc_1 1017. insert into stmt_test (comment, ins_date) values ('batch #1, after proc, insert #1', getdate())18. go19. insert into stmt_test (comment, ins_date) values ('batch #2, before proc, insert #1', getdate())20. insert into stmt_test (comment, ins_date) values ('batch #2, before proc, insert #2', getdate())21. exec proc_1 522. insert into stmt_test (comment, ins_date) values ('batch #2, after proc, insert #1', getdate())23. go
62
Reading monSysStatement
Use demo_db
Select @@spidfirst inserts
Call to proc_1
Call #1 to proc_2InsertWaitforReturn 0
While loop repeatInsert in loopSelect @n=@n+1
Exit call to proc_2
Exit from proc_1Insert after proc_1
Proc_2 iteration #1
Proc_2 iteration #2
Proc_2 iteration #3
Second batch begins
Note: We did not use the ORDER BY
Note: see how the context ID doesn’t decrement until outer proc exits???
63
Help Is On It’s Way
ASE 15.0.x??? / CR# 345353-1• Two tables - Aggregated to Proc; and Per Execution• Call today and add your voice to get this CR prioritized
Aggregate data• Average and total execution time• Average and total CPU usage• Number of executions• Most recent execution date and time• Average and total logical and physical reads and writes
Per-execution data• Total execution time• CPU time• Start and end date-time• Number of logical and physical reads and writes
64
But…Until Then….
Procedure NameNum execs
Elapsed avg
Elapsed max
Cpu Time avg
Cpu Time max
Wait Time avg
Wait Time max
Logical Reads avg
Logical Reads max
Rows Affected avg
s_ero_parts_tbo_rs 89 1,759 25,570 1,481 20,315 215 5,100 5,989 20,992 21
s_path_summary 3 3,605 7,613 1,388 2,262 2,733 6,900 7,048 8,652 595
s_ero_parts_ordered_rs 24 1,136 4,790 503 1,234 629 4,786 73,708 108,474 705
sp_sysmon 4 77,786 308,230 272 769 359 936 37,058 111,823 8,047
s_ero_retail_vehinfo 30 180 1,083 107 764 71 1,083 2,720 4,760 6
sp_allblkinfo 1 16,594,500 16,594,500 707 707 622,492 622,492 17,848 17,848 51
sp_rs_part_apron_ps 1 355,550 355,550 554 554 7,830 7,830 4,847 4,847 147
s_ds_scan_queue 230 295 3,280 76 375 216 2,900 6,577 9,046 21
s_hist_get_repair_order 13 832 2,076 119 367 707 1,782 13,421 39,140 2,100
s_ero_tech_rs 63 223 6,066 64 251 158 6,000 7,879 30,323 28
sp_appr 95 533 4,390 62 236 419 4,200 5,700 9,965 136
tr_note_lines_iu 6 76 226 75 226 0 0 11,771 35,189 6
s_ero_options_needed_rs 1 0 0 223 223 100 100 2,379 2,379 29
s_get_esp_info 15 58 180 57 176 0 0 935 1,132 5
s_im_ds 8 152 163 151 162 0 0 3,171 4,166 104
s_ero_afs_base 7 231 410 117 158 111 283 1,720 2,648 31
Someone still has faith that sp_sysmon is really going to help them with this problem….wait until the next page and you decide.
65
Line by Line Detail….
ProcedureNameLine Num
Num execs
Elapsed avg
Elapsed max
Cpu Time avg
Cpu Time max
Wait Time avg
Wait Time max
Logical Reads avg
Logical Reads max
Rows Affected avg
s_ero_parts_tbo_rs 97 51 2,733 24,063 2,531 20,263 201 3,800 6,427 10,157 5
s_path_summary 159 3 1,868 2,640 1,135 1,540 733 1,100 6,133 6,201 577
s_ero_parts_ordered_rs 606 18 582 1,513 543 1,210 38 400 83,962 84,172 120
s_ero_retail_vehinfo 72 24 154 696 113 663 41 500 1,977 3,461 0
s_path_summary 139 1 4,746 4,746 646 646 4,100 4,100 2,098 2,098 1
sp_sysmon 335 1 560 560 560 560 0 0 90,110 90,110 15,974
s_ds_scan_queue 59 185 79 880 63 323 16 600 5,621 5,662 5
s_ero_options_needed_rs 102 1 323 323 223 223 100 100 2,301 2,301 14
tr_note_lines_iu 151 2 223 223 223 223 0 0 35,126 35,126 0
s_get_esp_info 92 14 57 176 57 176 0 0 460 492 1
sp_sysmon 224 1 163 163 163 163 0 0 36,353 36,353 16,088
sp_appr 494 74 259 3,260 23 160 235 3,100 1,104 1,153 1
s_im_ds 179 6 154 156 154 156 0 0 2,962 3,117 30
s_ero_search_assoc 40 2 150 150 150 150 0 0 819 819 19
sp_sysmon 404 1 143 143 143 143 0 0 20,213 20,213 114
s_im_ds 110 2 130 130 130 130 0 0 2,069 2,069 0
sp_appr 501 79 172 2,116 26 130 146 2,000 2,632 2,669 0
tr_update_demo 71 14 228 993 29 126 199 993 1,348 1,456 0
s_ero_parts_tbo_rs 434 33 237 1,343 14 120 222 1,300 33 61 0
s_hist_get_repair_order 182 6 47 116 47 116 0 0 8,563 19,721 3,537
66
Weird CpuTimes Being Reported
What if CpuTime is reported as ~2Billion • i.e. 2147483645 (or nearly so)• CpuTime is calculated as
– CpuTime = datediff(ms,StartTime,EndTime)-WaitTime• ASE Syncs it’s internal clock every 60 seconds with OS/HW clock• If the machine is experiencing clock drift, a short executing proc may
appear to have a “negative” datediff– CpuTime is still calculated and ends up being 2B - the diff - so the above example
had a duration of “-2” ms– This could also happen if the WaitTime is high on longer running procs (assuming
wait time is nearly all the elapsed time)
Resolution:• Can’t really be fixed by SY - hardware issue with clock drift• Best bet is to use a constant (i.e. 1) or use a standard percentage of
wait time (i.e. 10%) for CpuTime if/when this happens.
top related