clusterware testing failures

7/22/2019 Clusterware Testing Failures

1/23

[Test Code]

Action Target

PRV-Network-1 Preconditions:

Initiate all Workloads (esp. those that flood the private interconnectused by RAC cache fusion and CSS heartbeats, and OS stress)

Remove SINGLE Primary private

network cable

Identify Vendor, CSS and CRS master nodes

CSS master node Note:Since CSS only supports one physical interface in Pre-

11.2.0.2 version, we need networkinterface teaming/bonding on theprivate interconnects to accomplish this.

The interconnect should be bonded

Steps:

1- Physically remove the Primary private network cable from CSS

master the vendor clusterware master2- Wait 600seconds

3- Restore Primary network cable

4- Remove the Secondary network cable from CSS master

5- Wait 600seconds

6- Restore Secondary network cable


Initiate all Workloads (esp. those that flood the private interconnect

used by RAC cache fusion and CSS heartbeats, and OS stress)

Remove Primary + Secondary

private network cables

Identify both CSS and CRS master nodes

CSS master node Note:Since CSS only supports one physical interface in Pre-

11.2.0.2 version, we need networkinterface teaming/bonding on the

private interconnects to accomplish this.

Sanity Check Steps:

1- Physically remove both Primary + Secondary private network cables

from the CSS master

2- Re-attach both network cables after the CSS master is evicted (by either

Oracle Clusterware or vendor clusterware, if present) and rebootedin pre-

11.2.0.2 version.In 11.2.0.2, if node doesnt reboot after cssd terminate, use

crsctl stop crs f to stop the remaining clusterware

processes, re-attach both network cables, then manuallyuse crsctl start crs to start crs stack.

3- Wait until the former CSS master node rejoins the cluster

Variants:

Var1 - Remove both Primary + Secondary private networks from the CRS

(in lieu of CSS) master

Var2 Remove both cablesback. Replace them before either the vendor

clusterware or CSS heartbeats should expire. The preferred result is that no

actions are taken, including RAC and ASM.

Oracle RAC Private Netwo

Test 1

Test 2

Clusterware

Test Category

Detailed Test Execution


2/23


Initiate client Workloads

Remove Primary + Secondary

private network cables==>

Identify the Vendor and CSS master nodes

T staggered RAC hosts Identify a set of T=N-1 RAC hosts (N=number of clustered

database hosts), includin the CSS master.

Sanity Check

Steps:

1- Physically remove both Primary + Secondary private network cables

from the current CSS master

2- Re-attach the private network cables after the CSS master is evicted and

rebootedin pre-11.2.0.2 version.

In 11.2.0.2, if node doesnt reboot after cssd terminate, use


processes, re-attach both network cables, then manually

3- repeats Step 1 against the surviving nodes (do not wait the reboot node

to come back and join) unti l there is only one surviving node left.

Variants:

Var 1: Split the cluster such that the lowest order vendor and CSS nodes

are left in the smaller node group. For example, in a 4 node cluster, split the

cluster 1-3 with the singleton as the lowest node. Similar for a 5 node

cluster split 2-3 with the 2 in the lowest node group.

Var 2 : Repeat these tests using ifdown rather than cable disconnect.



Power off private network

switches. For redundant switches,

ower both down.

Identify the CSS master

Test 4

Test 3


3/23

Steps:

1- Power off both Primary and Secondary private network switches

2- Wait for at least CSS MISSCOUNT seconds before powering back on

the rivate network switches

3- Wait until all nodes reboot and subsequently rejoin the cluster.



processes, power back on the private network switch, then

Variants:

None



Split brain resolution Identify the CSS master

This test requires 2 network switches

Sanity Check Note: Some vendor clusterware products may require the

confi uration of a uorum disk to be able to run this test.

BROWOUT TIME DATA

REQUIRE

Steps:

Pull the network cables simultaneously so that Node1 can only

communicate with Node 2 and Node 3 can only communicate with Node 4.

Here I assume either N1 or N2 is the CSS master.

Wait for at least 2 * CSS MISSCOUNT seconds so the

s lit brain resolution al orithm kicks in. N3 and N4 should reboot in pre-11.2.0.2 version.

In 11.2.0.2, if node doesnt reboot after cssd

terminate, use crsctl stop crs f to stop the remaining

clusterware processes, re-attach both network cables, then

Restore the network so N3 and N4 can join the

Test 5


4/23


5/23

The bonding software should failover with no impact to CSS, ASM and RAC.

Vendor Clusterware:

- Zero impact on all clusterware daemons

ASM and RAC:

- Zero impact on stability of all RAC hosts

- Zero node evictions or cluster failures

- For 11R2, collect

crsctl stat res t in a 60s loop from beginning till the end of run. Attach the output for auditing.

Vendor Clusterware:

- When vendor clusterware heartbeat is on the same private network (recommended), it detects private

network failure and determines cluster membership changes. Oracle clusterware receives the notification

and reports the membership change to CRS and RDBMS

This is the best result for our shared customers

Oracle Clusterware:

- When no vendor clusterware heartbeat, the customer must wait for MISSCOUNT to expire. (See

misscount tuning).

RAC:

- Zero impact on stability of surviving RAC hosts.

- Uninterrupted cluster-wide I/O operations.

- No report of complete cluster failures/reboots.

-For 11.2.0.2, if all crs resources and asm&rdbms processes are cleanup prior to

- Oracle Clusterware resources managed by the evicted node either go OFFLINE or fail over to a

surviving RAC node. Resources that fail over include: VIP, SCAN VIP, SCAN Listener and

singleton services.

Test Result

k Failure - Test Cases

Expected Test Outcome


6/23

- For 11.2.0.2, CVU resource should also failover to a surviving RAC node.

- For 11R2, collect

crsctl stat res t results in a 60s loop from beginning till the end of run. Attach the output for

auditing.

=- For policy-managed db, the evicted server will be moved from Oracle server

pool. If there is a server in Free pool, this server will be added Oracle server pool

and db instance can be automatically stated in server

Vendor Clusterware:

- Same as RAC

RAC:

- AllN-1node evictions result in successful cluster rejoins.

- Zero impact on RAC hosts stability.




the cssd terminating, node wont reboot after cssd terminate. Otherwise, node will

still reboot.


- Oracle Clusterware resources managed by the CRS master either go OFFLINE or fail over to a


sin leton services

- After nodes come back, SCAN VIP and SCAN Listener will disperse to different

nodes, should not be on the only one node.

- For 11R2, collect


RAC:

- All node evictions result in successful cluster rejoins.

- Zero impact on RAC hosts stability.

- Uninterrupted cluster-wide I/O operations at both node leave and node join as measured by the client

application



7/23



still reboot.


- Oracle Clusterware resources managed by the CRS master either go OFFLINE or fail over to a


singleton services

- After nodes come back, SCAN VIP and SCAN Listener will disperse to different

nodes, should not be on the only one node.

- For 11R2, collect


N3 and N4 reboot

N3 and N4 rejoin the cluster.

- For 11R2, collect




still reboot.



8/23

[Test Code]

Action TargetTest 1 Pub-Network -1 Preconditions:

Initiate all Workloads

Remove Primary public network cable ==> Identify both CSS and CRS master nodes

CRS master node

Steps:

1- Physically remove the Primary public network cable

from the CRS master

2- Wait 120 seconds

3- Restore Primary public network cable

4- Remove the Secondary network cable from CRS master

5- Wait 120 seconds

6- Restore Secondary public network cable

Variants:

None

Test 2 Pub-Network -2 Preconditions:


Remove Primary + Secondary public

network cables ==>CRS master node

Identify both CSS and CRS master nodes

Sanity Check Steps:

1- Physically remove both Primary + Secondary public

network cables from the CRS master (do crsctl stat res

t> crsstat.0 before remove cables)

2- Wait until the Oracle VIP and dependent services fail

over (i.e. those services whose CRS placement policies

allow them to do so).3- Note the time it takes for CRS to failover the VIP

(do crsctl stat res -t> crsstat.1)

4- Re-attach both public network cables.

Note:

In 11gR2, VIP should failback automatically

without human intervention; but SCAN VIP and

SCAN Listener shouldnt failback automaticall .(do crsctl stat res t> crsstat.2 after reattach

ublic network cable in 11R2Save the crsstat.[012] to /crs_log dir (see appendix C)

Oracle RAC Public Network FailuClusterware

Test Category



9/23

Vendor Clusterware:

- same as RAC

RAC:


- Zero node evictions or cluster failures

- For 11R2, collect

crsctl stat res t in a 60s loop from beginning till the end of run.

Attach the out ut for auditin .

NAS/SAN:

- No data corruption or I/O interruption reported from surviving nodes at

both node leave and node join

RAC:

- Zero impact on stability of RAC hosts.

- Uninterrupted cluster-wide I/O operations


- Oracle Clusterware resources managed by the

affectednode either go OFFLINE or fail over to another RAC node.

Resources that fail over include: VIP, SCAN VIP, SCAN Listener

-

re - Test CasesExpected Test Outcome Actual Test

Outcome


10/23

[Test Code]

Action Target

Test 1 Host - Test - 1 Preconditions: Vendor Clusterware:

Initiate client Workloads-

Hard fail (e.g. power off,

hard reset) RAC host ==>

Induce stress conditions high CPU

in real and user time; low swap space.

Vendor master node Identify vendor, CSS and CRS m RAC:

- Zero impact on stability of all surviving RAC

hosts

Steps: - No other RAC hosts should fail as a result of

the master node failure

1- Forcibly reset or power off the current vendor master

2- Wait until the original CSS master reboots and rejoins

the cluster

- SCAN VIP and SCAN Listener should

failover to other node if it is on this

node before the node hard fail

Variants: - For 11R2, collectVar 1. Split the cluster during the clusterware

reconfiguration.

crsctl stat res t in a 60s loop from

beginning till the end of run. Attach the output

for auditin .

Var 2. Fail the node that is the CSS master rather than

the clusterware master, or fail both concurrently.

- For 11.2.0.2, CVU resource should

also failover to a surviving RAC node.Var 2.

Have CRS operations in progress such as VIP failover

and hard reset the

Test 2 Host - Test - 2 Preconditions: Vendor Clusterware:


- Vendor clusterware detects node member

leaving and subsequent rejoining, and

determines cluster reconfiguration changes

Power off multiple RAC

hosts ==> Identify the vendor and CSS master nodesTstaggered RAC hosts Identify a set of T=N-1 RAC hosts

(N=number of clustered database hosts), including

the CSS master.

RAC:

- All node departures result in successful cluster

rejoins.

Sanity Check - Zero impact on surviving RAC hosts stability.

Steps: - Uninterrupted cluster-wide I/O operations.

RECONFIG TIME DATA 1- Reboot the current CSS master - No report of complete cluster failures/reboots.

REQUIRE 2- repeats Step 1 against the surviving nodes until there

is only one surviving node left3- If possible, determine

the interim time (in sec) database I/Os experience

freezes, if any


CRS master either go OFFLINE or fail over to a

surviving RAC node. Resources that fail over

include: SCAN VIP and SCAN Listener

should failover to other node if it is on

this node before the node hard fail

Variants:

Oracle RAC HOST Failures Test CasesClusterware Test

Category

Detailed Test Execution Expected Test Outcome


11/23

None - After nodes come back, SCAN VIP

and SCAN Listener will disperse to

different nodes, should not be on the

onl one node.

- For 11R2, collect

crsctl stat res t in a 60s loop frombeginning till the end of run. Attach the output

for auditin .


also failover to a surviving RAC node.


12/23

Actual Test Outcome


13/23


14/23

[Test Code]

Action Target

Test 1 HA-Test 1 Preconditions:

Type `cluvfy` to see all available

command s ntax and o tionsRun multiple cluvfy operations

during Oracle Clusterware and RAC

installAll RAC hosts Steps:

1- Run cluvfy precondition

Sanity Check 2- Do the next install step

3- Run cluvfy post-condition

(cluvfy comp software n node_list) to check

the file ermissionsNo need to collect CRS/RDBMS log for this test. You need

to submit the output for cluvfy.

Test 2 HA-Test 2 Preconditions:


Run concurrent crsctl

start/stopcrs commands to

stop or start Oracle Clusterware in

planned mode All RAC hosts Identify both CSS and CRS master nodes

Type `crsctl` as root to see all

available command s ntax and o tionsSanity Check

Steps:

1- As root user, run crsctl stop crs` command concurrently

on more than one RAC host, to stop the resident Oracle

Clusterware stack2- Wait until the target Oracle Clusterware stack is fully

stopped (via ps` command)

3- As root user, run `crsctl start crs` command concurrently

on more than one RAC host, to start the resident Oracle

Clusterware stack

Test 3 HA - Test 3 Preconditions:


Run other concurrent crsctl

commands, such as crctl check

crs, ==>All RAC hosts Identify both CSS and CRS master nodes


available command s ntax and o tions

Steps:

1- As root user, run any `crsctl check

crs` commands concurrentl on all nodes2- As root user, run any `crsctl check

cluster -all` commands concurrently on all nodes

Oracle High Availibility T

Clusterware

Test

Category



15/23

Test 4 HA - Test 4 Preconditions:

Remove and add voting disk files

==>Random RAC hosts

Ensure the Oracle Clusterware has 3

or more CSS votin disk filesNot 11R2 new feature, VF not

in ASM diskgroup.


available command s ntax and o tions

Steps:

Make sure ocssd.bin is up on all

As root user, run `crsctl query css

votedisk` Run multiple crsctl delete css votedisk ` until

one left, CRS should not allow you to delete the very

last one. Run crsctl add css votedisk` (e.g. by adding

back the voting disk files that were previously deleted)

Finally, run `crsctl query css votedisk`

a ain


16/23

Vendor Clusterware:

- same as RAC

RAC:

- Correct cluster verification checks given the state

of the cluster hardware and software

Pls provide cvu related logs under

$CRS_HOME/cv/log

Vendor Clusterware:

- N/A

RAC:

- Stop: All Oracle Clusterware daemons stop without leaving open

orts or zombie rocesses

- Start: All Oracle Clusterware daemons start without error messages

in stdout or any of the CRS, CSS or EVM traces

- Start: All registered HA resource states match the target states,

as percrsctl stat res t

- For 11R2, collect

crsctl stat res t in a 60s loop from beginning till the end of

run. Attach the out ut for auditin .

Vendor Clusterware:

- same as RAC

RAC:

- Both `crsctl check crs` and `crsctl check

cluster -all` commands produce the appropriate,

useful out ut without an error messa es- Collect output for step 1 and step 2

sting

Expected Test Outcome Actual Test

Outcome


17/23

RAC:

- Voting disk files are added and removed

without failures or error messages

- The crsctl query presents the correct state of

all votin disk files


18/23

[Test Code]

Action Target

Test 1 11grR2 - Case - 1 Preconditions:

Make sure non-ASM voting files are used;

11gR2 new features of

using ASM Voting files

and ASM OCR files, this

is the OCR/VF migration

Make sure no ASM OCR files are used

Make sure at least one normal redundancy ASM

Diskgroup with three failgroups is created and its

com atible.asm attribute is set to 11.2Sanity Check

Steps:

1- Make sure crs stack are running in all nodes.

2- Run crsctl query css votedisk to check configured VFs;

3- Run crsctl replace votedisk +{ASM_DG_NAME}(As crs

user or root user4- Run crsctl query css votedisk to get the new VF list;

5- Run ocrconfig add +{ASM_DGNAME} as root user;

6- Run ocrcheck to verify the OCR files;

7- Restart CRS stack and then verify the VF/OCR after it

comes back

Variants:1. Add up to 5 OCR files and restart CRS stack;

2. Try to migrate VF from ASM back to non-ASM files and then

restart CRS stack


crsctl command to

manage Oracle

clusterware stack

CRS stack is up and running on all nodes.

Sanity Check Steps:

1- Run crsctl check cluster all to get the stack status on

all cluster nodes. Make sure stack status of all cluster nodes

are correct2- Run crsctl stop cluster all to stop all CRS resource

CSSD CRSD EVMD with a lication resources3- Run crsctl status cluster all to make sure CRS

resource are OFFLINE4- Run crsctl start cluster all to bring back the whole

cluster stack


Initiate Workloads

11gR2 New Features Failover CaClusterware

Test Category



19/23

11gR2 new feature. OCR

stores in ASMs

disk rou .Steps:

Sanity Check Make sure only ASM OCR files are used by ocrcheck

confi

Kill the ASM pmon process on the OCR Master node;

Variants:

Repeat the same test onnon-OCR Master node.


11.2.0.2 new feature.

Redundant interconnect

Usage (HAIP)

During Clusterware installation, configure 2 or more

(


20/23

Identify Vendor, CSS and CRS master nodes

Steps:

1- Physically remove all configured interfaces from CSS master/thevendor clusterware master

2. Re-attach both network cables after the CSS master is evicted (by either Oracle

Clusterware or vendor clusterware, if present) and rebooted.


crsctl stop crs f to stop the remaining clusterware processes, re

attach both network cables, then manually use crsctl start crs

3- Wait until the former CSS master node rejoins the cluster

Note:

Use ifconfig and oifcfg getif -global to save the output

before/after the fault injection.


21/23

RAC:

- In 11gR2, Voting Disks canbe on ASM diskgroup and it is

managed by ASM instance if they

resided on ASM diskgroup. It

means we can not add/delete

- In 11gR2, we can support up

to 5 OCRs;

RAC:

- After running crsctl stop

cluster all, make sure all

ocssd/evmd/crsd processes are

stopped on all cluster nodes by ps

- For 11R2, collect



for auditin .

Clusterware:

- Because OCR is stored in ASM, if ASM

fails or is brought down, CRSD will fail

because it de ends on ASM for I O

esExpected Test Outcome Actual Test

Outcome


22/23

- ASM, CRSD and RDBMS instance will

be automatically restarted.

- After CRSD restart, all resourcesstate shouldnt chan e.

(CRSD should recover resources

revious state

- For 11R2, collect



for auditin .

If one of the interfaces fails, then the

HAIP address moves to another one of

the configured interfaces in the

Vendor Clusterware:

- Zero impact on all clusterware daemons

ASM and RAC:


- Zero node evictions or cluster failures- For 11R2, collect



for auditin .

Oracle Clusterware:

- When no vendor clusterware heartbeat, the

customer must wait for MISSCOUNT to expire.

(See misscount tuning).

RAC:

- Zero impact on stability of surviving RAC

hosts.


23/23



-For 11.2.0.2, if all crs resources and

asm&rdbms processes are cleanupprior to the cssd terminating, node

wont reboot after cssd terminate.


evicted node either go OFFLINE or fail over to a

surviving RAC node. Resources that fail over

include: VIP, SCAN VIP, SCAN Listener

and singleton services.


also failover to a surviving RAC node.

- For 11R2, collect

crsctl stat res t results in a 60s loop from


for auditin .

clusterware testing failures

Documents