clusterware testing failures

Upload: pavankumarmt

Post on 08-Feb-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/22/2019 Clusterware Testing Failures

    1/23

    [Test Code]

    Action Target

    PRV-Network-1 Preconditions:

    Initiate all Workloads (esp. those that flood the private interconnectused by RAC cache fusion and CSS heartbeats, and OS stress)

    Remove SINGLE Primary private

    network cable

    Identify Vendor, CSS and CRS master nodes

    CSS master node Note:Since CSS only supports one physical interface in Pre-

    11.2.0.2 version, we need networkinterface teaming/bonding on theprivate interconnects to accomplish this.

    The interconnect should be bonded

    Steps:

    1- Physically remove the Primary private network cable from CSS

    master the vendor clusterware master2- Wait 600seconds

    3- Restore Primary network cable

    4- Remove the Secondary network cable from CSS master

    5- Wait 600seconds

    6- Restore Secondary network cable

    PRV-Network-2 Preconditions:

    Initiate all Workloads (esp. those that flood the private interconnect

    used by RAC cache fusion and CSS heartbeats, and OS stress)

    Remove Primary + Secondary

    private network cables

    Identify both CSS and CRS master nodes

    CSS master node Note:Since CSS only supports one physical interface in Pre-

    11.2.0.2 version, we need networkinterface teaming/bonding on the

    private interconnects to accomplish this.

    Sanity Check Steps:

    1- Physically remove both Primary + Secondary private network cables

    from the CSS master

    2- Re-attach both network cables after the CSS master is evicted (by either

    Oracle Clusterware or vendor clusterware, if present) and rebootedin pre-

    11.2.0.2 version.In 11.2.0.2, if node doesnt reboot after cssd terminate, use

    crsctl stop crs f to stop the remaining clusterware

    processes, re-attach both network cables, then manuallyuse crsctl start crs to start crs stack.

    3- Wait until the former CSS master node rejoins the cluster

    Variants:

    Var1 - Remove both Primary + Secondary private networks from the CRS

    (in lieu of CSS) master

    Var2 Remove both cablesback. Replace them before either the vendor

    clusterware or CSS heartbeats should expire. The preferred result is that no

    actions are taken, including RAC and ASM.

    Oracle RAC Private Netwo

    Test 1

    Test 2

    Clusterware

    Test Category

    Detailed Test Execution

  • 7/22/2019 Clusterware Testing Failures

    2/23

    PRV-Network-3 Preconditions:

    Initiate client Workloads

    Remove Primary + Secondary

    private network cables==>

    Identify the Vendor and CSS master nodes

    T staggered RAC hosts Identify a set of T=N-1 RAC hosts (N=number of clustered

    database hosts), includin the CSS master.

    Sanity Check

    Steps:

    1- Physically remove both Primary + Secondary private network cables

    from the current CSS master

    2- Re-attach the private network cables after the CSS master is evicted and

    rebootedin pre-11.2.0.2 version.

    In 11.2.0.2, if node doesnt reboot after cssd terminate, use

    crsctl stop crs f to stop the remaining clusterware

    processes, re-attach both network cables, then manually

    3- repeats Step 1 against the surviving nodes (do not wait the reboot node

    to come back and join) unti l there is only one surviving node left.

    Variants:

    Var 1: Split the cluster such that the lowest order vendor and CSS nodes

    are left in the smaller node group. For example, in a 4 node cluster, split the

    cluster 1-3 with the singleton as the lowest node. Similar for a 5 node

    cluster split 2-3 with the 2 in the lowest node group.

    Var 2 : Repeat these tests using ifdown rather than cable disconnect.

    PRV-Network-4 Preconditions:

    Initiate client Workloads

    Power off private network

    switches. For redundant switches,

    ower both down.

    Identify the CSS master

    Test 4

    Test 3

  • 7/22/2019 Clusterware Testing Failures

    3/23

    Steps:

    1- Power off both Primary and Secondary private network switches

    2- Wait for at least CSS MISSCOUNT seconds before powering back on

    the rivate network switches

    3- Wait until all nodes reboot and subsequently rejoin the cluster.

    In 11.2.0.2, if node doesnt reboot after cssd terminate, use

    crsctl stop crs f to stop the remaining clusterware

    processes, power back on the private network switch, then

    Variants:

    None

    PRV-Network-5 Preconditions:

    Initiate client Workloads

    Split brain resolution Identify the CSS master

    This test requires 2 network switches

    Sanity Check Note: Some vendor clusterware products may require the

    confi uration of a uorum disk to be able to run this test.

    BROWOUT TIME DATA

    REQUIRE

    Steps:

    Pull the network cables simultaneously so that Node1 can only

    communicate with Node 2 and Node 3 can only communicate with Node 4.

    Here I assume either N1 or N2 is the CSS master.

    Wait for at least 2 * CSS MISSCOUNT seconds so the

    s lit brain resolution al orithm kicks in. N3 and N4 should reboot in pre-11.2.0.2 version.

    In 11.2.0.2, if node doesnt reboot after cssd

    terminate, use crsctl stop crs f to stop the remaining

    clusterware processes, re-attach both network cables, then

    Restore the network so N3 and N4 can join the

    Test 5

  • 7/22/2019 Clusterware Testing Failures

    4/23

  • 7/22/2019 Clusterware Testing Failures

    5/23

    The bonding software should failover with no impact to CSS, ASM and RAC.

    Vendor Clusterware:

    - Zero impact on all clusterware daemons

    ASM and RAC:

    - Zero impact on stability of all RAC hosts

    - Zero node evictions or cluster failures

    - For 11R2, collect

    crsctl stat res t in a 60s loop from beginning till the end of run. Attach the output for auditing.

    Vendor Clusterware:

    - When vendor clusterware heartbeat is on the same private network (recommended), it detects private

    network failure and determines cluster membership changes. Oracle clusterware receives the notification

    and reports the membership change to CRS and RDBMS

    This is the best result for our shared customers

    Oracle Clusterware:

    - When no vendor clusterware heartbeat, the customer must wait for MISSCOUNT to expire. (See

    misscount tuning).

    RAC:

    - Zero impact on stability of surviving RAC hosts.

    - Uninterrupted cluster-wide I/O operations.

    - No report of complete cluster failures/reboots.

    -For 11.2.0.2, if all crs resources and asm&rdbms processes are cleanup prior to

    - Oracle Clusterware resources managed by the evicted node either go OFFLINE or fail over to a

    surviving RAC node. Resources that fail over include: VIP, SCAN VIP, SCAN Listener and

    singleton services.

    Test Result

    k Failure - Test Cases

    Expected Test Outcome

  • 7/22/2019 Clusterware Testing Failures

    6/23

    - For 11.2.0.2, CVU resource should also failover to a surviving RAC node.

    - For 11R2, collect

    crsctl stat res t results in a 60s loop from beginning till the end of run. Attach the output for

    auditing.

    =- For policy-managed db, the evicted server will be moved from Oracle server

    pool. If there is a server in Free pool, this server will be added Oracle server pool

    and db instance can be automatically stated in server

    Vendor Clusterware:

    - Same as RAC

    RAC:

    - AllN-1node evictions result in successful cluster rejoins.

    - Zero impact on RAC hosts stability.

    - Uninterrupted cluster-wide I/O operations.

    - No report of complete cluster failures/reboots.

    -For 11.2.0.2, if all crs resources and asm&rdbms processes are cleanup prior to

    the cssd terminating, node wont reboot after cssd terminate. Otherwise, node will

    still reboot.

    - For 11.2.0.2, CVU resource should also failover to a surviving RAC node.

    - Oracle Clusterware resources managed by the CRS master either go OFFLINE or fail over to a

    surviving RAC node. Resources that fail over include: VIP, SCAN VIP, SCAN Listener and

    sin leton services

    - After nodes come back, SCAN VIP and SCAN Listener will disperse to different

    nodes, should not be on the only one node.

    - For 11R2, collect

    crsctl stat res t in a 60s loop from beginning till the end of run. Attach the output for auditing.

    RAC:

    - All node evictions result in successful cluster rejoins.

    - Zero impact on RAC hosts stability.

    - Uninterrupted cluster-wide I/O operations at both node leave and node join as measured by the client

    application

    - No report of complete cluster failures/reboots.

  • 7/22/2019 Clusterware Testing Failures

    7/23

    -For 11.2.0.2, if all crs resources and asm&rdbms processes are cleanup prior to

    the cssd terminating, node wont reboot after cssd terminate. Otherwise, node will

    still reboot.

    - For 11.2.0.2, CVU resource should also failover to a surviving RAC node.

    - Oracle Clusterware resources managed by the CRS master either go OFFLINE or fail over to a

    surviving RAC node. Resources that fail over include: VIP, SCAN VIP, SCAN Listener and

    singleton services

    - After nodes come back, SCAN VIP and SCAN Listener will disperse to different

    nodes, should not be on the only one node.

    - For 11R2, collect

    crsctl stat res t in a 60s loop from beginning till the end of run. Attach the output for auditing.

    N3 and N4 reboot

    N3 and N4 rejoin the cluster.

    - For 11R2, collect

    crsctl stat res t in a 60s loop from beginning till the end of run. Attach the output for auditing.

    -For 11.2.0.2, if all crs resources and asm&rdbms processes are cleanup prior to

    the cssd terminating, node wont reboot after cssd terminate. Otherwise, node will

    still reboot.

    - For 11.2.0.2, CVU resource should also failover to a surviving RAC node.

  • 7/22/2019 Clusterware Testing Failures

    8/23

    [Test Code]

    Action TargetTest 1 Pub-Network -1 Preconditions:

    Initiate all Workloads

    Remove Primary public network cable ==> Identify both CSS and CRS master nodes

    CRS master node

    Steps:

    1- Physically remove the Primary public network cable

    from the CRS master

    2- Wait 120 seconds

    3- Restore Primary public network cable

    4- Remove the Secondary network cable from CRS master

    5- Wait 120 seconds

    6- Restore Secondary public network cable

    Variants:

    None

    Test 2 Pub-Network -2 Preconditions:

    Initiate all Workloads

    Remove Primary + Secondary public

    network cables ==>CRS master node

    Identify both CSS and CRS master nodes

    Sanity Check Steps:

    1- Physically remove both Primary + Secondary public

    network cables from the CRS master (do crsctl stat res

    t> crsstat.0 before remove cables)

    2- Wait until the Oracle VIP and dependent services fail

    over (i.e. those services whose CRS placement policies

    allow them to do so).3- Note the time it takes for CRS to failover the VIP

    (do crsctl stat res -t> crsstat.1)

    4- Re-attach both public network cables.

    Note:

    In 11gR2, VIP should failback automatically

    without human intervention; but SCAN VIP and

    SCAN Listener shouldnt failback automaticall .(do crsctl stat res t> crsstat.2 after reattach

    ublic network cable in 11R2Save the crsstat.[012] to /crs_log dir (see appendix C)

    Oracle RAC Public Network FailuClusterware

    Test Category

    Detailed Test Execution

  • 7/22/2019 Clusterware Testing Failures

    9/23

    Vendor Clusterware:

    - same as RAC

    RAC:

    - Zero impact on stability of all RAC hosts

    - Zero node evictions or cluster failures

    - For 11R2, collect

    crsctl stat res t in a 60s loop from beginning till the end of run.

    Attach the out ut for auditin .

    NAS/SAN:

    - No data corruption or I/O interruption reported from surviving nodes at

    both node leave and node join

    RAC:

    - Zero impact on stability of RAC hosts.

    - Uninterrupted cluster-wide I/O operations

    - No report of complete cluster failures/reboots.

    - Oracle Clusterware resources managed by the

    affectednode either go OFFLINE or fail over to another RAC node.

    Resources that fail over include: VIP, SCAN VIP, SCAN Listener

    -

    re - Test CasesExpected Test Outcome Actual Test

    Outcome

  • 7/22/2019 Clusterware Testing Failures

    10/23

    [Test Code]

    Action Target

    Test 1 Host - Test - 1 Preconditions: Vendor Clusterware:

    Initiate client Workloads-

    Hard fail (e.g. power off,

    hard reset) RAC host ==>

    Induce stress conditions high CPU

    in real and user time; low swap space.

    Vendor master node Identify vendor, CSS and CRS m RAC:

    - Zero impact on stability of all surviving RAC

    hosts

    Steps: - No other RAC hosts should fail as a result of

    the master node failure

    1- Forcibly reset or power off the current vendor master

    2- Wait until the original CSS master reboots and rejoins

    the cluster

    - SCAN VIP and SCAN Listener should

    failover to other node if it is on this

    node before the node hard fail

    Variants: - For 11R2, collectVar 1. Split the cluster during the clusterware

    reconfiguration.

    crsctl stat res t in a 60s loop from

    beginning till the end of run. Attach the output

    for auditin .

    Var 2. Fail the node that is the CSS master rather than

    the clusterware master, or fail both concurrently.

    - For 11.2.0.2, CVU resource should

    also failover to a surviving RAC node.Var 2.

    Have CRS operations in progress such as VIP failover

    and hard reset the

    Test 2 Host - Test - 2 Preconditions: Vendor Clusterware:

    Initiate client Workloads

    - Vendor clusterware detects node member

    leaving and subsequent rejoining, and

    determines cluster reconfiguration changes

    Power off multiple RAC

    hosts ==> Identify the vendor and CSS master nodesTstaggered RAC hosts Identify a set of T=N-1 RAC hosts

    (N=number of clustered database hosts), including

    the CSS master.

    RAC:

    - All node departures result in successful cluster

    rejoins.

    Sanity Check - Zero impact on surviving RAC hosts stability.

    Steps: - Uninterrupted cluster-wide I/O operations.

    RECONFIG TIME DATA 1- Reboot the current CSS master - No report of complete cluster failures/reboots.

    REQUIRE 2- repeats Step 1 against the surviving nodes until there

    is only one surviving node left3- If possible, determine

    the interim time (in sec) database I/Os experience

    freezes, if any

    - Oracle Clusterware resources managed by the

    CRS master either go OFFLINE or fail over to a

    surviving RAC node. Resources that fail over

    include: SCAN VIP and SCAN Listener

    should failover to other node if it is on

    this node before the node hard fail

    Variants:

    Oracle RAC HOST Failures Test CasesClusterware Test

    Category

    Detailed Test Execution Expected Test Outcome

  • 7/22/2019 Clusterware Testing Failures

    11/23

    None - After nodes come back, SCAN VIP

    and SCAN Listener will disperse to

    different nodes, should not be on the

    onl one node.

    - For 11R2, collect

    crsctl stat res t in a 60s loop frombeginning till the end of run. Attach the output

    for auditin .

    - For 11.2.0.2, CVU resource should

    also failover to a surviving RAC node.

  • 7/22/2019 Clusterware Testing Failures

    12/23

    Actual Test Outcome

  • 7/22/2019 Clusterware Testing Failures

    13/23

  • 7/22/2019 Clusterware Testing Failures

    14/23

    [Test Code]

    Action Target

    Test 1 HA-Test 1 Preconditions:

    Type `cluvfy` to see all available

    command s ntax and o tionsRun multiple cluvfy operations

    during Oracle Clusterware and RAC

    installAll RAC hosts Steps:

    1- Run cluvfy precondition

    Sanity Check 2- Do the next install step

    3- Run cluvfy post-condition

    (cluvfy comp software n node_list) to check

    the file ermissionsNo need to collect CRS/RDBMS log for this test. You need

    to submit the output for cluvfy.

    Test 2 HA-Test 2 Preconditions:

    Initiate all Workloads

    Run concurrent crsctl

    start/stopcrs commands to

    stop or start Oracle Clusterware in

    planned mode All RAC hosts Identify both CSS and CRS master nodes

    Type `crsctl` as root to see all

    available command s ntax and o tionsSanity Check

    Steps:

    1- As root user, run crsctl stop crs` command concurrently

    on more than one RAC host, to stop the resident Oracle

    Clusterware stack2- Wait until the target Oracle Clusterware stack is fully

    stopped (via ps` command)

    3- As root user, run `crsctl start crs` command concurrently

    on more than one RAC host, to start the resident Oracle

    Clusterware stack

    Test 3 HA - Test 3 Preconditions:

    Initiate all Workloads

    Run other concurrent crsctl

    commands, such as crctl check

    crs, ==>All RAC hosts Identify both CSS and CRS master nodes

    Type `crsctl` as root to see all

    available command s ntax and o tions

    Steps:

    1- As root user, run any `crsctl check

    crs` commands concurrentl on all nodes2- As root user, run any `crsctl check

    cluster -all` commands concurrently on all nodes

    Oracle High Availibility T

    Clusterware

    Test

    Category

    Detailed Test Execution

  • 7/22/2019 Clusterware Testing Failures

    15/23

    Test 4 HA - Test 4 Preconditions:

    Remove and add voting disk files

    ==>Random RAC hosts

    Ensure the Oracle Clusterware has 3

    or more CSS votin disk filesNot 11R2 new feature, VF not

    in ASM diskgroup.

    Type `crsctl` as root to see all

    available command s ntax and o tions

    Steps:

    Make sure ocssd.bin is up on all

    As root user, run `crsctl query css

    votedisk` Run multiple crsctl delete css votedisk ` until

    one left, CRS should not allow you to delete the very

    last one. Run crsctl add css votedisk` (e.g. by adding

    back the voting disk files that were previously deleted)

    Finally, run `crsctl query css votedisk`

    a ain

  • 7/22/2019 Clusterware Testing Failures

    16/23

    Vendor Clusterware:

    - same as RAC

    RAC:

    - Correct cluster verification checks given the state

    of the cluster hardware and software

    Pls provide cvu related logs under

    $CRS_HOME/cv/log

    Vendor Clusterware:

    - N/A

    RAC:

    - Stop: All Oracle Clusterware daemons stop without leaving open

    orts or zombie rocesses

    - Start: All Oracle Clusterware daemons start without error messages

    in stdout or any of the CRS, CSS or EVM traces

    - Start: All registered HA resource states match the target states,

    as percrsctl stat res t

    - For 11R2, collect

    crsctl stat res t in a 60s loop from beginning till the end of

    run. Attach the out ut for auditin .

    Vendor Clusterware:

    - same as RAC

    RAC:

    - Both `crsctl check crs` and `crsctl check

    cluster -all` commands produce the appropriate,

    useful out ut without an error messa es- Collect output for step 1 and step 2

    sting

    Expected Test Outcome Actual Test

    Outcome

  • 7/22/2019 Clusterware Testing Failures

    17/23

    RAC:

    - Voting disk files are added and removed

    without failures or error messages

    - The crsctl query presents the correct state of

    all votin disk files

  • 7/22/2019 Clusterware Testing Failures

    18/23

    [Test Code]

    Action Target

    Test 1 11grR2 - Case - 1 Preconditions:

    Make sure non-ASM voting files are used;

    11gR2 new features of

    using ASM Voting files

    and ASM OCR files, this

    is the OCR/VF migration

    Make sure no ASM OCR files are used

    Make sure at least one normal redundancy ASM

    Diskgroup with three failgroups is created and its

    com atible.asm attribute is set to 11.2Sanity Check

    Steps:

    1- Make sure crs stack are running in all nodes.

    2- Run crsctl query css votedisk to check configured VFs;

    3- Run crsctl replace votedisk +{ASM_DG_NAME}(As crs

    user or root user4- Run crsctl query css votedisk to get the new VF list;

    5- Run ocrconfig add +{ASM_DGNAME} as root user;

    6- Run ocrcheck to verify the OCR files;

    7- Restart CRS stack and then verify the VF/OCR after it

    comes back

    Variants:1. Add up to 5 OCR files and restart CRS stack;

    2. Try to migrate VF from ASM back to non-ASM files and then

    restart CRS stack

    Test 2 11grR2 - Case - 2 Preconditions:

    crsctl command to

    manage Oracle

    clusterware stack

    CRS stack is up and running on all nodes.

    Sanity Check Steps:

    1- Run crsctl check cluster all to get the stack status on

    all cluster nodes. Make sure stack status of all cluster nodes

    are correct2- Run crsctl stop cluster all to stop all CRS resource

    CSSD CRSD EVMD with a lication resources3- Run crsctl status cluster all to make sure CRS

    resource are OFFLINE4- Run crsctl start cluster all to bring back the whole

    cluster stack

    Test 3 11grR2 - Case - 3 Preconditions:

    Initiate Workloads

    11gR2 New Features Failover CaClusterware

    Test Category

    Detailed Test Execution

  • 7/22/2019 Clusterware Testing Failures

    19/23

    11gR2 new feature. OCR

    stores in ASMs

    disk rou .Steps:

    Sanity Check Make sure only ASM OCR files are used by ocrcheck

    confi

    Kill the ASM pmon process on the OCR Master node;

    Variants:

    Repeat the same test onnon-OCR Master node.

    Test 4 11grR2 - Case - 4 Preconditions:

    11.2.0.2 new feature.

    Redundant interconnect

    Usage (HAIP)

    During Clusterware installation, configure 2 or more

    (

  • 7/22/2019 Clusterware Testing Failures

    20/23

    Identify Vendor, CSS and CRS master nodes

    Steps:

    1- Physically remove all configured interfaces from CSS master/thevendor clusterware master

    2. Re-attach both network cables after the CSS master is evicted (by either Oracle

    Clusterware or vendor clusterware, if present) and rebooted.

    In 11.2.0.2, if node doesnt reboot after cssd terminate, use

    crsctl stop crs f to stop the remaining clusterware processes, re

    attach both network cables, then manually use crsctl start crs

    3- Wait until the former CSS master node rejoins the cluster

    Note:

    Use ifconfig and oifcfg getif -global to save the output

    before/after the fault injection.

  • 7/22/2019 Clusterware Testing Failures

    21/23

    RAC:

    - In 11gR2, Voting Disks canbe on ASM diskgroup and it is

    managed by ASM instance if they

    resided on ASM diskgroup. It

    means we can not add/delete

    - In 11gR2, we can support up

    to 5 OCRs;

    RAC:

    - After running crsctl stop

    cluster all, make sure all

    ocssd/evmd/crsd processes are

    stopped on all cluster nodes by ps

    - For 11R2, collect

    crsctl stat res t in a 60s loop from

    beginning till the end of run. Attach the output

    for auditin .

    Clusterware:

    - Because OCR is stored in ASM, if ASM

    fails or is brought down, CRSD will fail

    because it de ends on ASM for I O

    esExpected Test Outcome Actual Test

    Outcome

  • 7/22/2019 Clusterware Testing Failures

    22/23

    - ASM, CRSD and RDBMS instance will

    be automatically restarted.

    - After CRSD restart, all resourcesstate shouldnt chan e.

    (CRSD should recover resources

    revious state

    - For 11R2, collect

    crsctl stat res t in a 60s loop from

    beginning till the end of run. Attach the output

    for auditin .

    If one of the interfaces fails, then the

    HAIP address moves to another one of

    the configured interfaces in the

    Vendor Clusterware:

    - Zero impact on all clusterware daemons

    ASM and RAC:

    - Zero impact on stability of all RAC hosts

    - Zero node evictions or cluster failures- For 11R2, collect

    crsctl stat res t in a 60s loop from

    beginning till the end of run. Attach the output

    for auditin .

    Oracle Clusterware:

    - When no vendor clusterware heartbeat, the

    customer must wait for MISSCOUNT to expire.

    (See misscount tuning).

    RAC:

    - Zero impact on stability of surviving RAC

    hosts.

  • 7/22/2019 Clusterware Testing Failures

    23/23

    - Uninterrupted cluster-wide I/O operations.

    - No report of complete cluster failures/reboots.

    -For 11.2.0.2, if all crs resources and

    asm&rdbms processes are cleanupprior to the cssd terminating, node

    wont reboot after cssd terminate.

    - Oracle Clusterware resources managed by the

    evicted node either go OFFLINE or fail over to a

    surviving RAC node. Resources that fail over

    include: VIP, SCAN VIP, SCAN Listener

    and singleton services.

    - For 11.2.0.2, CVU resource should

    also failover to a surviving RAC node.

    - For 11R2, collect

    crsctl stat res t results in a 60s loop from

    beginning till the end of run. Attach the output

    for auditin .