oracle heartbeat mechanisms · 2013-03-25 · io fencing mechanisms - split-brain상황에서...

Oracle Heartbeat Mechanisms

OCM

Seungtaek Lee(放浪DBA) 2012/06

Split-brain

2

□ Split-brain

- Cluster의 Heartbeat NW이 단절된 Node들이 살아 있는 상태

- Split-brain 상황에서 공유 Data에 IO가 발생하면 Data 손상이 발생

- Split-brain을 방지하기 위해 Node Eviction이 필요

CSSD CSSD

1 2

Split-brain

3

□ IO fencing mechanisms

- Split-brain상황에서 Evicted된 Node의 pending IO가 flush되지 않게 IO를 차단

- Node를 shutdown 하는 방법과 공유 Disk의 권한을 제어하는 두가지 방법이 있음

□ STONITH (Shoot The Other Node In The Head, a.k.a. STOMITH)

- Multi Node Cluster에서 문제있는 Node만 shutdown 또는 power off하는 방법

- Oracle에서 사용하는 mechanism

□ Reserve/Release (R/R)

- Reserve한 Node만이 Resource를 사용할 수 있음

- 한 Node만 생존하기때문에 Two Node Cluster에서만 사용할 수 있음

(Two Node STONITH)

□ Persistent Reservation

- Reserve key값으로 Resource를 제어함

- Disk LUN Level (SCSI3)에서 IO를 제어함

- key값만 변경하여 IO를 제한할 수 있기 때문에 Node shutdown이 필요없음

Split-brain

Node1 Node2

CSS CSS

Voting File

Node3

CSS

We all see

1&2&3 We all see

1&2&3

We all see 1&2&3

□ Network Heart Beat (NHB) : 매 1초

- Network Heartbeat : Remote Node를 모니터링 하기 위한 목적

- Local Heartbeat (LHB) : Local CSSD를 모니터링 하기 위한 목적

□ Disk Hearbeat (DHB) : 매 1초

- Split-brain 해결을 위한 목적

- Voting File이 과반 이하면 Evicted됨

□ Timeouts

- Misscount (MC) : NW Timeout, 30초

- Disk Time Out (DTO) : VF Timeout

. SIOT : Reconfig 상황, 27초(MC-RBT)

. LIOT : 일반 상황, 200초

- ReBoot Time (RBT) : ReBoot 수행 시간

2012-06-12 15:09:16.401: [ CSSD][2996292496]clssnmCheckDskInfo: Checking disk info...

2012-06-12 15:09:16.401: [ CSSD][2996292496]clssnmCheckSplit: Node 1, rac1, is alive, DHB (1339481341, 6345324) more than disk timeout of 27000 after the last NHB (1339481312, 6315574)

Split-brain

5

□ Network Split Detection

- 가장 최근 NHB의 Timestamp와 가장 최근의 DHB의 Timestamp 정보를 비교하여 Node가 Alive인지 Dead인지 결정함

. When the difference between the timestamps of the most recent DHB and the last NHB is greater than the Disk Time Out (DTO), a node is considered still active

DHB-NHB > 27 = Alive

. When the difference between the timestamps is less than reboottime, the node is considered still alive

DHB-NHB < 3 = Alive

. If the time that the last DHB was read is more than DTO, the node is dead ‏

Current Time - DHB > 27 = Dead

- If the difference between the timestamps is greater than reboottime and less than DTO, the status of the node is unclear and we must wait to make a decision until we fall into one of the 3 categories above

27 > DHB-NHB > 3 = Unclear

Split-brain

6

□ Decision based on connectivity information

- When the network fails and nodes that are still up cannot communicate with each other, the network is considered split

- To maintain data integrity when a split occurs, one of the nodes must fail

- The surviving nodes should be an optimal sub-cluster of the original cluster

- Nodes that are not to survive are evicted

Split-brain

7

□ Decision based on connectivity information

- DHB는 접속되는 Remote Node 정보가 byte array map 형식으로 포함됨(DHB contains info of nodes it can communicate with; byte map of perceived state (connected/member)‏)

- NHB는 접속 bitmap이 포함됨(The NHB contains bitmaps for members and for connected)

□ The RMN uses this info to calculate an optimal sub-cluster

- RMN이 각 Node의 Voting정보를 취합하여 Surviving Node를 결정함

- Bitmaps for connectivity and for membership

- Does bitwise AND of bitmaps to construct maximal transitive cohorts

□ Surviving cohort

- Cohort with the most nodes

- Cohort with lowest node number

not in other cohort

Node1 Node2

CSS CSS

Voting File

Node3

CSS

I do not see 3

Node1 : I see 1&2

Node2 : I see 1&2

=>

We should

evict 3!

I see 1&2

I see 3

I’ve been

evicted!

I’d better stop

Split-brain

(RMN)

The Reconfig Manager Node (RMN) runs the reconfig

Not a fixed node, any node can become RMN for a reconfig

All nodes must communicate with the RMN to complete the

reconfig

STONITH

Split-brain

8

2012-06-12 15:09:01.911: [ CSSD][2992057232]clssnmSendShutdown: req to node 2, kill time 6345474

2012-06-12 15:09:01.911: [ CSSD][2992057232]clssnmsendmsg: not connected to node 2

2012-06-12 15:09:01.911: [ CSSD][2992057232]clssnmSendShutdown: Send to node 2 failed

2012-06-12 15:09:01.911: [ CSSD][2992057232]clssnmWaitOnEvictions: Start

2012-06-12 15:09:01.911: [ CSSD][2992057232]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1339481355, 5529374, 6315), seedhbimpd TRUE

2012-06-12 15:09:01.911: [ CSSD][2990480272]clssnmDiscEndp: gipcDestroy 0x69ea

2012-06-12 15:09:01.912: [ CSSD][2996788112]clssgmUpdateEventValue: HoldRequest val 1, changes 9

2012-06-12 15:09:01.915: [ CSSD][3010980752]clssnmvDiskEvict: Kill block write, file ORCL:SYSTEM_3 flags 0x00010004, kill block unique 1339472144, stamp 6345474/6345474




2012-06-12 15:09:16.401: [ CSSD][2996292496]clssnmCheckSplit: Node 1, rac1, is alive, DHB (1339481341, 6345324) more than disk timeout of 27000 after the last NHB (1339481312, 6315574)

2012-06-12 15:09:16.401: [ CSSD][2996292496]clssnmCheckDskInfo: My cohort: 2

2012-06-12 15:09:16.401: [ CSSD][2996292496]clssnmCheckDskInfo: Surviving cohort: 1

2012-06-12 15:09:16.401: [ CSSD][2996292496](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, rac2, is smaller than cohort of 1 nodes led by node 1, rac1, based on map type 2

2012-06-12 15:09:16.401: [ CSSD][2996292496]###################################

2012-06-12 15:09:16.402: [ CSSD][2996292496]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

2012-06-12 15:09:16.402: [ CSSD][2996292496]###################################

□ Node1

□ Node2

□ IPMI (Intelligent Platform Management Interface)

- Remote Node에서 Shutdown할 수 있는 기능 (Optional)

- 각 Node에 IPMI driver와 Baseboard Management Controller (BMC) 장치가 필요함

- Local CSS-Based Reboot작업이 실패 시에도 Remote로 전원장치를 Control할 수

있기 때문에 Shutdown할 수 있음

Split-brain

CSSD CSSD

1

1. CSS spawns node termination thread

2. node termination thread communicates with

remote BMC via management LAN

. establish authenticated session

. check power status

. request power-off

. repeat checking power status until OFF

. request power-on

3. node termination thread exits

CSSD threads

10

□ Sending Thread

- NHB과 LHB을 보냄 (매초)

□ Peer Listener Thread

- 다른 Node로부터 Messages(주로 NHB, vote info, kill requests)를 receive

- Sending Thread를 Monitoring함

- Event Driven (Scheduling 문제 회피)

□ Polling Thread

- 주기적으로 NHB를 이용하여 Nodes상태를 모니터링하고 reconfig 작업을 initiate함

□ Disk Ping Thread

- Voting File에 DHB을 write함 (매초)

- kill block을 read함

□ Disk Ping Monitor Thread

- Disk Ping Thread를 모니터링 함

- 만약 Disk Ping Thread가 kill block을 read하지 못하면 evicted됨

CSS Agent/Monitor

11

□ CSS Agent

- CSS의 상태 (CSSD State)를 모니터링

. CSS장애 시 Node Reboot 수행

. Node Reboot 전에 Filesystem Sync를 수행

- CSS Start/Stop/Monitoring 담당

- oprocd/oclsomon/oclsvmon 모니터링 기능 통합

□ CSS Monitor

- CSS Agent 장애 시(OHAS Daemon이 CSS Agent Restart)라도 CSS의 계속적인 모니터링을 위해 도입 (Dual Monitoring)

Process 장애 장애시의 동작

CSS Down cssdagent가 Node Restart 실시

Hang (MISSCOUNT-REBOOTTIME/2)초 후 cssdagent가 Node Restart를 수행

CSS

Agent/Monitor

Down ohasd에 의해 즉시 Agent의 Restart (Node Restart는 실시하지 않음)

Hang MISSCOUNT 범위 이상 Hang이 발생했을 경우 Node Restart 실시

CSS Agent/Monitor

12

□ CSSD State 정보

- CSS Agent는 CSSD State 정보를 기본으로 Node Reboot 결정

- CSSD State 정보는 CSSD의 Sending Thread가 NHB, LHB으로 전송함

- CSSD State

. cTODC : cssd Time Of Day Clock (aTODC : agent Time Of Day Clock)

. cITC : cssd Invariant Time Clock (aITC : agent Invariant Time Clock)

. NTO : based on node with the longest time since the last NHB

. DTO : based on the amount of time since a majority of VFs was last online

2012-04-03 22:43:27.910: [ USRTHRD][1548] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28257; disk timeout 28257, network timeout 25355, last heartbeat from CSSD at epoch seconds 1333507382.449, 25459 milliseconds ago based on invariant clock 2027712154; now polling at 100 ms

- cssd Hang시 (MISSCOUNT-REBOOTTIME/2)초 후 cssdagent가 Node Restart를 수행

- CSSD State

. cTODC (epoch seconds) : 1333507382.449 → Tue, 3 Apr 2012 22:43:02

. cITC (invariant clock) : 2027712154

. NTO (network timeout) : 25355

. DTO (disk timeout ) : 28257,

CSS Agent/Monitor Threads

13

□ Heartbeat Thread

- CSSD로부터 LHB(CSSD State)을 receive함

□ OPROCD Thread

- 주기적으로 Wake up/Sleep 함

- Timer Driven wake up

- Wake up시 CSSD State와 local ITC를 비교하여 Node Reboot을 결정함

Voting Files (VF)

14

□ Table Of Contents (TOC)

- Block의 위치 정보가 저장

□ VF identifier Block

- Cluster GUID

- File UID

□ CIN Block

- CIN (Configuration Incarnation Number)

- FCD (Formation Critical Data)

. 전체 CRS Configuration 관련 설정 값 (Misscount 등)

. List of Voting Files

□ Heartbeat Block

- Disk Heartbeat 정보 저장됨 (매초)

□ Kill Block

- Evict되는 Node에 통보하기 위한 목적

[root@rac2 ~]# crsctl query css votedisk

## STATE File Universal Id File Name Disk group

-- ----- ----------------- --------- ---------

1. ONLINE 04843d3136fd4f5dbf52169c417ac0dc (ORCL:SYSTEM_1) [DATA]

2. ONLINE 6ffbe942e7ed4ff8bffd1355c99c6bcd (ORCL:SYSTEM_2) [DATA]

3. ONLINE 0c4b6c1a73834fe6bf487175440a23e9 (ORCL:SYSTEM_3) [DATA]

Located 3 voting disk(s).

Voting Files (VF)

15

□ Lease Block

- Node Number를 동적으로 할당

- Node Number당 1개의 Lease Block 할당

- Lease Block의 Ownership은 해당 Node Number의 Host가 가지고 있음

- Node Start시 Node Number를 획득함

- Node가 사용한 마지막 Node Number를 우선 Try함

(Node가 사용한 마지막 Node Number정보는 OLR에 저장)

- Lease가 expire되면 해당 Node Number를 다른 Node가 사용할 수 있음

- Hostname, Endpoints (port number) 저장

- Node Number를 pin할 수 있음

. Rolling Upgrade, Pre-11.2 RDBMS to be able to connect to clusterware

. crsctl pin css –n <hostname>

[root@rac2 ~]# olsnodes -n -t

rac1 1 Unpinned

rac2 2 Unpinned

Voting Files (VF)

16

- CSS는 ASM Instance Start전에 Cluster 역할을 수행할 수 있도록 Start됨

1. OHAS Daemon Start 시에, OLR을 Open

2. OHAS Daemon은 OHAS Agent를 Start

3. oraagent는 gpnpd, cssdagent는 CSS를 Start

4. gpnpd는 GPnP Profile을 읽어들여 Start 완료

5. CSS는 gpnpd와 통신을 실시하여 Profile을 취득 (ASM DiscoveryString 확보)

6. CSS는 ASM Disk Header에 위치한 Voting File을 검색하여 IO를 시작

7. CSS Start완료 후에 ASM Start

8. ASM은 gpnpd로 부터 Profile을 검색하여 spfile을 참초하여 Start 개시

9. ASM의 Start완료 후 CRS Start

10. CRS는 gpnpd로 부터 Profile을 검색

11. CRS는 CSS에 Connection

12. CRS는 OCR을 Open하고 CRS Agent Start 및 CRS Resource Start

- An Oracle bug.

Voting Files (VF)

17

□ Voting Files in ASM

- Cluster Configuration 정보를 OCR로부터 분리 : GPnP Profile, Voting File

- CSS는 스스로 Voting File을 검색(Discovery Thread)하여 IO를 수행할 수 있기 때문에

ASM 장애가 발생하여도 계속 가동하는 것이 가능

- ASM Disk Group Redundancy에 따라 Voting File 개수가 달라짐

. External redundancy : 1 VF

. Normal redundancy : 3 VF

. High redundancy : 5 VF

- Stretched Clusters을 위한 Quorum Failure Group 지원

[root@rac1 disks]# kfed read VOTE_1 |grep vf

kfdhdb.vfstart: 64 ; 0x0ec: 0x00000040

kfdhdb.vfend: 96 ; 0x0f0: 0x00000060

□ Stretched Clusters (a.k.a. Extended Clusters)

- OCR must be mirrored across both sites using Oracle provided mechanisms.

- Preferably have two voting disks at each site and tie-breaking voting disk at a third site. This third site only needs to be a supported NFS device over a WAN. This can be done via a NetApp filer or on most platforms this can be done via standard NFS.

- Starting in Oracle Clusterware 11g Release 2 this can be hosted on ASM on a dedicated Quorum Failure Group.

Voting Files (VF)

CSSD CSSD

- Same principles apply

- Voting Disks are just geographically dispersed

Tie Breaking Voting Disk (via NFS or iSCSI)

Voting Files (VF)

19

□ Voting Files Backup

- 11gR2부터는 Configuration변경(CIN) 시 마다 자동으로 OCR에 백업됨

- ASM에 있는 Voting File은 Disk Drop/Relocate/Fail시 ASM에 의해 자동 move됨

- OCR은 자동으로 백업됨

□ Voting Files Restore

- Voting 복구 시 우선 OCR이 필요 (OCR 부재 시 OCR Restore)

- Exclusive Mode로 CRS Start하여 복구

. crsctl start crs –excl

. crsctl replace votedisk +sysdg

. crsctl stop crs

. crsctl start

- Exclusive Mode (crsctl start crs –excl)

. Voting File과 Network 장애 시 CRS Start 가능

. Only 1Node만 가능하다

Instance Membership Reconfiguration (IMR)

20

□ RAC Heartbeat

- Control File Heartbeat : CKPT Process가 매3초마다 thread 전용 Control File

Block(Checkpoint Progress Record)에 업데이터함 (X$KCCCP.CPHBT)

□ ORA-29740

- Reason 2 : Control File Timeout(_controlfile_enqueue_timeout) : 900초

- Reason 3 : IPC Send Timeout(_cgs_send_timeout) : 300초

IPC Send Timeout detect후 Cluster가 split-brain을 해결하지

못하면(_imr_splitbrain_res_wait : 600초) eviction됨

- The instance which obtained the RR lock tallies the vote result from all nodes

and updates the CFVRR(Checkpoint Progress Record와 같은 Block에 저장됨)

21

□ CSS 11g Member

- LMON이 Hang이나 문제가 생겼을때 Instance가 eviction이 안될수 있음

- 11g에서는 LMON과 IO관련 Process들을 CSS database Group의 멤버로 관리

□ Member Kill

- CSS Group의 멤버에 대해 서로간에 kill 요청을 할 수 있음

- 만약 kill이 실패하였을 경우 Node Shutdown으로 Escalation됨

- Parameters

. _imr_evicted_member_kill = true

. _imr_evicted_member_kill_wait = 20

□ CSSD Threads involved in member kill

- Client Listener Thread : receives group join and kill requests

- Peer Listener Thread : receives kill requests from remote nodes

- Death Check Thread : provides confirmation of termination

- Member Kill : spawned to manage a member kill request

- Local Kill : spawned to carry out member kills on local node

Instance Membership Reconfiguration (IMR)

22

□ With 11.2.0.2 onwards, fencing may not mean reboot

- It starts with a failure – e.g. Network Heartbeat or Disk Heartbeat

- Then IO issuing processes are killed; it is made sure that no IO process remains

(For a RAC DB mainly the log writer and the database writer are of concern)

- Once all IO issuing processes are killed, remaining processes are stopped

(IF the check for a successful kill of the IO processes, fails → reboot)

- Once all remaining processes are stopped, the stack stops itself with a “restart flag”

- OHASD will finally attempt to restart the stack after the graceful shutdown

Oracle Clusterware CSSD

App X App Y

RAC DB

Inst. 1

OHASD

Rebootless Restart

23

□ Exceptions

- IF the check for a successful kill of the IO processes fails → reboot

- IF CSSD gets killed during the operation → reboot

- IF cssdmonitor (oprocd replacement) is not scheduled → reboot

- IF the stack cannot be shutdown in “short_disk_timeout”-seconds → reboot

- Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by the CSS misscount) to cause a node eviction.

Rebootless Restart

Oracle RAC

DB Inst. 1

Oracle Clusterware

Tracing

24

□ CSS Daemon Trace

- # crsctl set log res ora.cssd=2 –init

- # crsctl set log res ora.cssdmonitor=2 –init

- Logging Level

. level 2 = default

. level 3 = verbose (display each heartbeat message)

. level 4 = super verbose

□ CSS Log

- cssd Log : $GI_HOME/log/hostname/cssd/ocssd.log

- cssd Stack Dump : $GI_HOME/log/hostname/cssd/cssdOUT.log

: reboot 직전에 Stack Dump를 남김 (diagwait 값 변경 불필요)

- Reboot Advisory (a.k.a. LastGasp)

. /etc/oracle/lastgasp/cssagent_`hostname`.lgl

. /etc/oracle/lastgasp/cssmonit_`hostname`.lgl

- cssdmonitor : $GI_HOME/log/`hostname`/agent/ohasd/oracssdmonitor_root/

- cssdagent : $GI_HOME/log/`hostname`/agent/ohasd/oracssdagent_root/

Troubleshooting 11.2 Clusterware Node Evictions (Reboots) [ID 1050693.1]

25

□ OCSSD Evictions

- Network failure or latency between nodes. It would take 30 consecutive missed checkins (by default - determined by

the CSS misscount) to cause a node eviction.

- Problems writing to or reading from the CSS voting disk. If the node cannot perform a disk heartbeat to the

majority of its voting files, then the node will be evicted.

- A member kill escalation. For example, database LMON process may request CSS to remove an instance from the

cluster via the instance eviction mechanism. If this times out it could escalate to a node kill.

- An unexpected failure or hang of the OCSSD process, this can be caused by any of the above issues or something

else.

- An Oracle bug.

□ CSSDAgent/CSSDMonitor Evictions

- An OS scheduler problem. For example, if the OS is getting locked up in a driver or hardware or there is excessive

amounts of load on the machine (at or near 100% cpu utilization), thus preventing the scheduler from behaving

reasonably.

- A thread(s) within the CSS daemon hung.

- An Oracle bug.

G-MES 2.0 SEDA Node Reboot Problem

26

□ Split-Brain 장애

- Two Node RAC에서 Split-brain 현상 발생 후 Node1 은 Reboot되고 Node2는 CRS가 Restart되어 서비스 중단 발생

- ocssd.log (sedadb01) 2012-04-03 22:43:15.170: [ CSSD][5157]clssnmPollingThread: node sedadb02 (2) at 50% heartbeat fatal, removal in 14.138

seconds

2012-04-03 22:43:15.171: [ CSSD][5157]clssnmPollingThread: node sedadb02 (2) is impending reconfig, flag 2294796, misstime 15862

>>>>>>>>> abbreviation

2012-04-03 22:43:27.197: [ CSSD][5157]clssnmPollingThread: node sedadb02 (2) at 90% heartbeat fatal, removal in 2.108 seconds, seedhbimpd 1

2012-04-03 22:43:27.197: [ CSSD][2587]clssnmvDHBValidateNCopy: node 2, sedadb02, has a disk HB, but no network HB, DHB has rcfg 219040461, wrtcnt, 29378368, LATS 2027736901, lastSeqNo 29378362, uniqueness 1331483347, timestamp 1333507407/2027619439

2012-04-03 22:43:27.526: [ CSSD][5414]clssnmSendingThread: sending status msg to all nodes

2012-04-03 22:43:27.527: [ CSSD][5414]clssnmSendingThread: sent 4 status msgs to all nodes

2012-04-03 22:43:28.198: [ CSSD][2587]clssnmvDHBValidateNCopy: node 2, sedadb02, has a disk HB, but no network HB, DHB has rcfg 219040461, wrtcnt, 29378374, LATS 2027737902, lastSeqNo 29378368, uniqueness 1331483347, timestamp 1333507408/2027620446

>>>>>>>>> abbreviation

2012-04-03 22:51:43.866: [ CSSD][1]clssscmain: Starting CSS daemon, version 11.2.0.3.0, in (clustered) mode with uniqueness value 1333507903


27

- ocssd.log (sedadb02) 2012-04-03 22:43:29.450: [ CSSD][1]clssgmQueueGrockEvent: groupName(IGSEDAMESSEDAMES) count(2) master(1)

event(2), incarn 2, mbrc 2, to member 2, events 0x0, state 0x0

2012-04-03 22:43:29.450: [ CSSD][5671]clssnmDoSyncUpdate: Terminating node 1, sedadb01, misstime(30001) state(5)

2012-04-03 22:43:29.451: [ CSSD][5671]clssnmDoSyncUpdate: Wait for 0 vote ack(s)


2012-04-03 22:43:29.451: [ CSSD][5671]clssnmCheckSplit: Node 1, sedadb01, is alive, DHB (1333507409, 2027739010) more than disk timeout of 27000 after the last NHB (1333507379, 2027709150)

2012-04-03 22:43:29.451: [ CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 3, mbrc 3, to member 2, events 0x0, state 0x0

2012-04-03 22:43:29.451: [ CSSD][5671]clssnmCheckDskInfo: My cohort: 2

2012-04-03 22:43:29.451: [ CSSD][5671]clssnmCheckDskInfo: Surviving cohort: 1

2012-04-03 22:43:29.451: [ CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(2) master(2) event(2), incarn 626, mbrc 2, to member 1, events 0x38, state 0x0

2012-04-03 22:43:29.451: [ CSSD][5671](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, sedadb02, is smaller than cohort of 1 nodes led by node 1, sedadb01, based on map type 2

2012-04-03 22:43:29.451: [ CSSD][1]clssgmQueueGrockEvent: groupName(IGSEDAMESSEDAMESRO) count(1) master(2) event(2), incarn 1, mbrc 1, to member 2, events 0x0, state 0x0

2012-04-03 22:43:29.451: [ CSSD][5671]###################################

2012-04-03 22:43:29.451: [ CSSD][5671]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread

2012-04-03 22:43:29.451: [ CSSD][5671]###################################


28

- oracssdmonitor_root.log (sedadb01) 2012-04-03 22:43:27.910: [ USRTHRD][1548] (:CLSN00111:)clsnproc_needreboot: Impending reboot at 90% of limit 28257;

disk timeout 28257, network timeout 25355, last heartbeat from CSSD at epoch seconds 1333507382.449, 25459 milliseconds ago based on invariant clock 2027712154; now polling at 100 ms

- cssd Hang시 (MISSCOUNT-REBOOTTIME/2)초 후 cssdagent가 Node Restart를 수행

- CSSD State

. cTODC (epoch seconds) : 1333507382.449 → Tue, 3 Apr 2012 22:43:02

. cITC (invariant clock) : 2027712154

. NTO (network timeout) : 25355

. DTO (disk timeout ) : 28257,


29

- Bug 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT

IN SLTSTSPAWN

@The default value of the inheritsched attribute is PTHREAD_INHERIT_SCHED. The @attribute is set by calling the pthread_attr_setinheritsched subroutine. The c @urrent value of the attribute is returned by calling the @pthread_attr_getinheritsched subroutine. @So the platforms that we support all have as their default that the @scheduling is inherited. @ @The AIX version of sltstspawn for as recently as 11.2.0.3 does not do the @correct thing.


30

- Bug 13940331: VALUE FOR SETTING THREAD SCHEDULING IS INCORRECT

IN SLTSTSPAWN @INTERNAL FIX DESCRIPTION: @a. Changes are made in the definitions of the following macros in sslts.h: @staxi10:/ade/fjlee_core112/oracle/oracore3/src/coreos/slts/inc> ade diff -labe @l sslts.h @Executing tool /bin/diff ... @File1: /ade/fjlee_core112/oracle/tmp/sslts.h#ORACORE_11.2.0.4.0_AIX.PPC64_1203 @26.721122 @File2: /ade/fjlee_core112/oracore3/src/coreos/slts/inc/sslts.h @5c5,6 @< /* Copyright (c) Oracle Corporation 1997, 1998. All Rights Reserved. */ @--- @ @BACKPORT FEASIBLE: @Yes @ @FORWARD MERGE REQUIRED: @No @The generic bug13935219 is filed to fix this problem on all platforms generica @lly. @ REDISCOVERY INFORMATION: > /* Copyright (c) 1996, 2012, Oracle and/or its affiliates. > All rights reserved. */ 48a50 > fjlee 04/09/12 - Fix bug 13940331 1085c1087 < (&attr, (flags & SLTST_INHERIT_SCHED --- @ WORKAROUND: > /* Copyright (c) 1996, 2012, Oracle and/or its affiliates. > All rights reserved. */ 43a45 > fjlee 201a204 > #define SLTST_EXPLICIT_SCHED 0x00000002 259c262,263 < (ubig_ora)(SLTS_USESLTSFLAGS| SLTS_THR_BOUND)); ---


31

root@cntp2202:/] ps -mp 11272300 -o THREAD

USER PID PPID TID ST CP PRI SC WCHAN F TT BND COMMAND

oracle 11272300 11796656 - A 1 0 32 * 10240103 - - /oracle/GRID/11203/bin/ocssd.bin

- - - 36372583 S 0 60 1 f1000f0a10022b40 8410400 - - -

- - - 36438127 S 0 60 1 f1000f0a10022c40 8410400 - - -

- - - 36569209 S 0 60 1 f1000f0a10022e40 8410400 - - -

- - - 37552261 S 0 60 1 f1000f0a10023d40 8410400 - - -

…….

- - - 48496841 S 0 60 1 f1000f0a1002e440 8410400 - - -

- - - 48562379 S 0 60 1 - 418400 - - -

- - - 48627917 S 0 60 1 f1000f0a1002e640 8410400 - - -

- - - 48693455 S 0 60 1 f1000f0a1002e740 8410400 - - -

- - - 48758993 S 0 60 1 f1000f0a1002e840 8410400 - - -

- - - 48824531 S 0 60 1 - 418400 - - -

- - - 49152223 Z 0 60 1 - c00001 - - -

- - - 68026571 Z 0 60 1 - c00001 - - -

- - - 68288571 Z 0 60 1 - c00001 - - -

root@cntp2202:/]

- Bug 13940331 Patch전 CSSD Thread Priority


32

- Bug 13940331 Patch후 CSSD Thread Priority

oracle@sgmepd11:/u01/app/oracle/product/11.2.0/dbhome_1 # ps -mp 3015026 -o THREAD

USER PID PPID TID S CP PRI SC WCHAN F TT BND COMMAND

grid 3015026 3605094 - A 3 0 32 * 10240103 - - /u01/app/11.2.0/grid/

- - - 26280097 S 0 0 1 f1000f0a10019140 8410400 - - -

- - - 26935467 S 1 0 1 - 418400 - - -

- - - 27394273 S 0 0 1 f1000f0a1001a240 8410400 - - -

- - - 32243907 Z 0 0 1 - c00001 - - -

- - - 39583857 S 0 0 1 f1000f0a10025c40 8410400 - - -

- - - 43515929 Z 0 0 1 - c00001 - - -

- - - 60752031 S 0 0 1 - 418400 - - -

- - - 73138383 S 0 0 1 f1000f0a10045c40 8410400 - - -

………

- - - 22479531 S 1 0 1 - 418400 - - -

- - - 22545085 Z 0 0 1 - c00001 - - -

- - - 22610627 S 0 0 1 - 418400 - - -

- - - 35652099 S 0 0 1 f1000f0a10122040 8410400 - - -

- - - 40698613 S 0 0 1 f1000f0a10126d40 8410400 - - -

- - - 40829685 S 0 0 1 - 418400 - - -

- - - 52953797 S 0 0 1 f1000f0a10132840 8410400 - - -

- - - 61801081 S 0 0 1 f1000f0a1013af40 8410400 - - -

oracle@sgmepd11:/u01/app/oracle/product/11.2.0/dbhome_1 #

Conclusion

33

□ Heartbeat 에러가 있다고 해서 H/W(Interconnect, IO) 문제로 단정하면 안됨

- Scheduling Bug 확인 (OS, Oracle)

- 성능 데이터 확인 (CPU, 메모리, Swap, Paging, Top Process)

- Heartbeat 에러를 모든 Node에서 확인할 것

(NW Heartbeat은 최소한 2Node 이상에서 발생해야 함)

- CSSAgent/Monitor Log 확인 할 것 (CSSD thread Hang 감지)

→ Scheduling 문제로 인한 Cluster Daemon의 오탐으로 Node Reboot 가능

oracle heartbeat mechanisms · 2013-03-25 · io fencing mechanisms - split-brain상황에서...

Documents