deep review of lms process

©OraInternals Riyaj Shamsudeen

RAC Hack: Deep review of LMS/LGWR process

By Riyaj Shamsudeen

©OraInternals Riyaj Shamsudeen 2

LMS Processing (over simplified)

Rx Msg

CR / CUR block build

Msg to LGWR (if needed)

Wakeup Log buffer processing

Log file write Signal LMS

Wake up

Send Block

OS,Network stack

OS,Network stack

Copy to SGA

User session processing

Send GC Message

OS,Network stack

User LMSx LGWR

Node 1 Node 2


GC CR latency

  GC CR latency ~=

Time spent in sending message to LMS +

LMS processing (building blocks etc) + LGWR latency ( if any) +

LMS send time +

Wire latency

Averages can be misleading. Always review both total time and average to understand the issue.

Processing in the remote nodes


LMS process – A deep dive

  LMS process uses pollsys system call and listens for incoming packets, with a 10ms timeout.

  Sockets are file descriptors in UNIX.

truss -d -E -v all -p 1485 |more

1.8531 0.0000 pollsys(0xFFFFFD7FFFDFBA70, 7, 0xFFFFFD7FFFDFBA20, 0x00000000) = 0

fd=36 ev=POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND rev=0







timeout: 0.010000000 sec

1.8635 0.0000 pollsys(0xFFFFFD7FFFDFBA70, 7, 0xFFFFFD7FFFDFBA20, 0x00000000) = 0

Timeout 10ms


LMS sockets

  Pfiles shows that these file descriptors are sockets, essentially, LMS process is sending and receiving messages in the ports. Pfiles 1845

36: S_IFSOCK mode:0666 dev:298,0 ino:10517 uid:0 gid:0 size:0

O_RDWR|O_NONBLOCK FD_CLOEXEC

SOCK_DGRAM

SO_SNDBUF(57344),SO_RCVBUF(57344),IP_NEXTHOP(0.224.0.0)

sockname: AF_INET 127.0.0.1 port: 33320



SOCK_DGRAM





SOCK_DGRAM



Demo: demo_lms_truss.ksh demo_lms_pfiles.ksh


LMS CPU usage

  Just because LMS process runs in RT mode, does not mean that LMS process is consuming all that CPU.

#./trace_syscall_preempt_size.sh 1485

0 => pollsys timestamp : 45075622139230

0 | swtch:pswitch oracle sysinfo: timestamp : 45075622155047

0 | swtch:pswitch Vol context switch : 45075622155697 pswitch genunix`cv_timedwait_sig_hires+0x2ab

0 | resume:off-cpu On cpu 0 for: 92460

0 | resume:on-cpu Off cpu for: 10242512

0 <= pollsys timestamp : 45075632406018 elapsed : 10266788

Demo: as root trace_syscall_preempt_size.sh

Pollsys call voluntarily releases CPU until a new packet is arrived to a port or a timeout.

Uses very little of CPU if there is no work to be done. Without any work, just 92 Micro seconds of CPU used in a 10,242 micro seconds window.


LMS – early wakeup

  Kernel will schedule LMS process if there is a network packet arriving to that port.

#./trace_syscall_preempt_size.sh 1485

0 => pollsys timestamp : 45075592390763

0 | swtch:pswitch oracle sysinfo: timestamp : 45075592402281

0 | swtch:pswitch Vol context switch : 45075592402933 pswitch genunix`cv_timedwait_sig_hires+0x2ab

0 | resume:off-cpu On cpu 0 for: 55099

0 | resume:on-cpu Off cpu for: 29660449

0 <= pollsys timestamp : 45075622072662 elapsed : 29681899

Demo: as root trace_syscall_preempt_size.sh

LMS process was woken up in 29 Micro seconds.


LMS count

  Even in busy environments, I have seen LMS to be busy only 50% of the time.

  To schedule a process, CPU scheduler loads the CPU registers, instruction pipeline etc, a costly process.

  If you have many LMS processes, then the workload will be distributed among them. Due to RT priority, they will be moving in and out of CPU.

  In a multi-processor environment, this becomes more complicated.

  Version 11.2 uses much more meaningful values for LMS count.


LMS – prstat

  In Solaris, another way to check the efficiency of LMS process is through prstat micro accounting.

# prstat -mL -p 18243

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID

18243 prod 6.4 5.9 0.0 0.0 0.0 0.0 88 0.0 2K 0 30K 187 oracle/1

Demo: prstat command

In this case,

Breakdown of LMS CPU usage is:

6.4% USR mode

5.9% SYS mode

0% CPU latency

88% Sleep

  It breaks down the process in to micro accounting percentages.

2K voluntary context switches

0 involuntary context swiches


LMS – session

  LMS session level statistics can be used to measure the workload distribution. Of course, this is from the start of instances.

@lms_workload_perc

INST_ID PGM NAME VAL PROC_TO_INST PROC_TO_TOT INST_TO_TOT

---------- ------- ------------------------------ ---------- ------------ ----------- -----------

1 (LMS0) gc cr blocks served 62960382 15 3 25



...



...



...



...

Demo: lms_workload_distr.sql – to measure workload from the instance start

LMS process in Node 4 is busy serving CR blocks. Why?


LMS - workload

  In few cases, it is prudent to measure the current rate, instead of relying upon prior rate.

@gc_lms_workload_distr_diff.sql

Enter value for search_string: gc cr blocks served

Enter value for sleep: 60

---------|--------------|----------------|----------|---------------|---------------|-------------|

Inst | Pgm | value | totvalue | instvalue | proc2inst |inst2total |

---------|--------------|----------------|----------|---------------|---------------|-------------|

1 | (LMS0)| 348| 16931| 2993| 11| 2|

1 | (LMS1)| 335| 16931| 2993| 11| 1|

...

2 | (LMS0)| 359| 16931| 2660| 13| 2|

2 | (LMS1)| 375| 16931| 2660| 14| 2|

...

3 | (LMS0)| 132| 16931| 1231| 10| 0|

3 | (LMS1)| 194| 16931| 1231| 15| 1|

...

4 | (LMS0)| 1164| 16931| 10047| 11| 6|

4 | (LMS1)| 1784| 16931| 10047| 17| 10|

---------|--------------|----------------|----------|---------------|---------------|-------------|

Demo: gc_lms_workload_distr_diff.sql

LMS processes in Node 4 is busy here.


LMS – applying undo

  LMS process applies undo blocks to construct the CR buffer to send.

Demo: get_sesstat_sid.sql

474K undo records were applied to create CR blocks.

  Following session level stats can be reviewed to determine the performance counter for undo blocks applied.

@get_sesstat_sid

Enter the wildcard character (Null=All):undo

Enter value threshold :1

Enter sid :11000

NAME VALUE

---------------------------------------------------------------- ----------

transaction tables consistent reads - undo records applied 13180

data blocks consistent reads - undo records applied 474036


LMS – snapper.sql

  Tanel’s ultra-cool snapper is useful to find the rate of few statistics on LMS session.

Demo: snapper.sql

@session_snapper out,gather=stw 15 4 11000

-- Session Snapper v2.01 by Tanel Poder ( http://www.tanelpoder.com )

----------------------------------------------------------------------------------------------------------------------

SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH

----------------------------------------------------------------------------------------------------------------------

11000, (LMS0) , STAT, cleanouts and rollbacks - consistent rea, 633, 42.2,

11000, (LMS0) , STAT, immediate (CR) block cleanout applicatio, 689, 45.93,

11000, (LMS0) , STAT, commit txn count during cleanout , 56, 3.73,

11000, (LMS0) , STAT, active txn count during cleanout , 633, 42.2,

11000, (LMS0) , STAT, cleanout - number of ktugct calls , 690, 46,

11000, (LMS0) , TIME, background cpu time , 696907, 46.46ms, 4.6%, |@ |

11000, (LMS0) , TIME, background elapsed time , 696907, 46.46ms, 4.6%, |@ |

11000, (LMS0) , WAIT, gcs remote message , 13987778, 932.52ms, 93.3%, |@@@@@@@@@@|

11000, (LMS0) , WAIT, events in waitclass Other , 150638, 10.04ms, 1.0%, |@ |

-- End of snap 2, end=2011-06-28 22:28:01, seconds=15


GC TX/RX %

  You can find the percentage for TX and RX packets using the gc_trafic_print.sql script too. These percentages are at instance level.

Demo: gc_traffic_print.sql

@gc_traffic_print.sql

---------|--------------|---------|----------------|---------|---------------|---------|-------------|---------|

Inst | CR blocks Rx | CR Rx% | CUR blocks Rx | CUR RX %| CR blocks Tx | CR TX % | CUR blks TX | CUR TX% |

---------|--------------|---------|----------------|---------|---------------|---------|-------------|---------|

1 | 283| 3.6| 950| 27.12| 214| 3.33| 665| 16.8|

2 | 7185| 91.47| 1327| 37.89| 256| 3.98| 1117| 28.22|

3 | 119| 1.51| 886| 25.29| 5798| 90.22| 1617| 40.86|

4 | 268| 3.41| 339| 9.68| 158| 2.45| 558| 14.1|

In that sampling interval, node 3 was transmitting 90% of the CR blocks and node 2 was receiving those blocks. This insight is useful to measure the workload distribution, with a larger sampling interval.


gcs log flush sync

  Before sending a reconstructed CR block or CUR block, LMS will verify that corresponding redo vectors are flushed to disk.

  If the redo vector are not flushed, LMS need to wait for ‘gcs log flush sync’ event after requesting LGWR for a log flush, analogous to ‘log file sync’ event.

  This is not an idle event, even though some old documentation suggest that.


Gcs log flush sync - ASH

  ASH shows that LMS waits for ‘gcs log flush sync’ event.

  In this database, there is no issue and so, waits for ‘gcs log flush sync’ is not high.

select event, count(*) from v$active_session_history

where session_id in (select sid from v$session where program like '%LMS0%')

and sample_time > sysdate -(4/24)

group by event

order by 2 desc;

EVENT COUNT(*)

---------------------------------------- ----------

767

gcs log flush sync 265

latch: KCL gc element parent latch 4


Gcs log flush sync – v$session_event

  But, v$session_event for the LMS process shows that there are no waits for gcs log flush sync!

Event ‘gcs log flush sync’ is combined with other events in the wait_class “other”.

select event, trunc(time_waited_micro/1000) wait_milli, total_waits

from v$session_event where sid in (select sid from v$session where program like '%LMS0%')

order by 2 desc;

EVENT WAIT_MILLI TOTAL_WAITS

---------------------------------------- ---------- -----------

gcs remote message 218407919 83934373

events in waitclass Other 3970180 5237156

buffer busy waits 356 2897

latch: cache buffers chains 316 5197

latch: row cache objects 0 2

latch: shared pool 0 3


Gcs log file sync - histogram

  To review the impact of ‘gcs log file sync’ waits, you should review v$event_histogram.

  73% of the waits complete under 1ms. This is probably not an issue.

@event_histogram.sql

Enter value for event_name: gcs log flush sync

INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT PER

---------- ---------------------------------------------------------------- --------------- ---------- ----------

1 gcs log flush sync 1 24490064 73.42




1 gcs log flush sync 16 142603 .42


1 gcs log flush sync 64 66 0


Gcs log file sync – Not so good

  Following histogram shows an example when there is a performance issue with LFS.

  If you have LFS waits and GC waits, then you should consider tuning log file sync before tuning GC events.

@event_histogram.sql

Enter value for event_name: gcs log flush sync

INST_ID EVENT WAIT_TIME_MILLI WAIT_COUNT PER

---------- ---------------------------------------------------------------- --------------- ---------- ----------








1 gcs log flush sync 128 2 0


Gcs log file sync – LGWR interaction

  If LGWR is suffering from performance issues, then LMS process can be seen waiting on ‘gcs log flush’ wait event in a tight 10ms loop.

  If you have LFS waits and GC waits, then you should consider tuning log file sync before tuning GC events.

LMS trace file:

...

WAIT #0: nam='gcs log flush sync' ela= 10281 waittime=3 poll=0 event=136 obj#=-1 tim=1381909996




...


GCS log flush sync - Example

Top 5 Timed Foreground Events

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Avg

wait % DB

Event Waits Time(s) (ms) time Wait Class

------------------------------ ------------ ----------- ------ ------ ----------

log file sync 2,054 23,720 11548 45.8 Commit

gc buffer busy acquire 19,505 10,382 532 20.0 Cluster

gc cr block busy 5,407 4,655 861 9.0 Cluster

enq: SQ - contention 140 3,432 24514 6.6 Configurat

db file sequential read 38,062 1,305 34 2.5 User I/O

Host CPU (CPUs: 24 Cores: 24 Sockets: 24)

~~~~~~~~ Load Average

Begin End %User %System %WIO %Idle

--------- --------- --------- --------- --------- ---------

1.18 1.16 2.7 2.6 0.0 94.7

Excessive waits for log file sync for the foreground processes.


GCS log flush sync – GC waits

Global Cache and Enqueue Services - Workload Characteristics

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Avg global enqueue get time (ms): 7.4

Avg global cache cr block receive time (ms): 222.0

Avg global cache current block receive time (ms): 27.5

Avg global cache cr block build time (ms): 0.0

Avg global cache cr block send time (ms): 0.1

Global cache log flushes for cr blocks served %: 2.7

Avg global cache cr block flush time (ms): 15879.9

Avg global cache current block pin time (ms): 0.0

Avg global cache current block send time (ms): 0.1

Global cache log flushes for current blocks served %: 0.3

Avg global cache current block flush time (ms): 1701.3

Average waits for CR RX was at 222ms

High flush time indicating waits for LGWR process


GCS log flush sync – GC waits

Avg

%Time Total Wait wait Waits % bg

Event Waits -outs Time (s) (ms) /txn time

-------------------------- ------------ ----- ---------- ------- -------- ------

gcs log flush sync 80,695 51 1,862 23 34.7 32.9

log file parallel write 44,129 0 880 20 19.0 15.6

Log archive I/O 1,607 0 876 545 0.7 15.5

gc cr block busy 729 71 752 1031 0.3 13.3

db file parallel write 25,752 0 434 17 11.1 7.7

enq: CF - contention 166 64 307 1850 0.1 5.4

Background processes waiting for excessive gcs log flush sync events.

High waits for log file parallel writes


LGWR is important

  So, if you think, LGWR performance is important in single instance, then it is ultra-important in RAC.

  If you have LGWR related performance issues, you can almost discard other waits as symptoms.

  It’s a pity that LGWR does not run in RT mode ( or even FX class).

  LMS processes runs in elevated priority, but LGWR does not run in elevated priority, classic priority-inversion!


Contact info: Email: [email protected] Blog : orainternals.wordpress.com URL : www.orainternals.com

Thank you for attending!

deep review of lms process

Technology