using veritas gateway server high availability - cisco.com · cisco prime network 3.8 administrator...

C H A P T E R

17-1Cisco Prime Network 3.8 Administrator Guide

OL-24644-01

17Using Veritas Gateway Server High Availability

These topics describe the gateway server high availability solutions that use Veritas software. Use the architecture described in these topics as a reference point and adjust them to meet the needs of specific deployments. Both the local redundancy and geographic redundancy configurations are independent of and compatible with the unit server high availability mechanism (described in Unit Server High Availability and AVM Protection, page 16-1).

• Veritas Gateway Server High Availability Architecture, page 17-1

• Veritas Local Redundancy, page 17-6

• Veritas Geographical Redundancy, page 17-13

For information on the gateway server high availability solution that uses Red Hat Cluster Suite and Oracle Active Data Guard, see Using RHCS/ADG Gateway Server High Availability, page 18-1.

Veritas Gateway Server High Availability ArchitectureWhile different deployments may necessitate different architectures, the architecture described in these topics can be used as a reference point and be adjusted to meet the needs of specific deployments. Both the local redundancy and geographic redundancy configurations are independent of and compatible with the unit server high availability mechanism (described in Unit Server High Availability and AVM Protection, page 16-1). However, do not manage remote failover through unit server high availability. For example, a unit at a local site should not have a standby unit (and protection group) at the remote failover site. Members of a unit protection group should be at the same site.

Veritas Local Redundancy

Gateway local redundancy is implemented as a 1+1 warm standby in a dual-node cluster, as shown in Figure 17-1. This architecture consists of one cluster that contains two servers, both of which are normally active:

• One Prime Network (P1) server which hosts the Prime Network gateway processes. The Prime Network server has its own logical IP address.

• One Oracle (P2) server which hosts the Oracle database application. The Oracle server also has its own logical IP address.

Both servers are active during normal operation. Each server provides redundancy for the other server in case of failure.


OL-24644-01

Chapter 17 Using Veritas Gateway Server High Availability Veritas Gateway Server High Availability Architecture

The license directory NETWORKHOME/Main/ha/licenses should contain a copy of all license files for both nodes. This directory will be available to both nodes because it is part of the partition that is shared. If you add new licenses, you must copy them to this directory and run the resetLicenses.pl command to read the licenses. See Licensing and Gateway Server High Availability, page 5-2.

The Prime Network and Oracle applications maintain their data on separate external volumes. Each external volume is connected to both the Prime Network and Oracle servers using redundant connections. The external volumes can be mounted on either server using VxVM. The two servers maintain a heartbeat between them that allows the VCS application (running on each server) to monitor the health of the other server.

Figure 17-1 Architecture for Veritas Gateway Local Redundancy

For hardware and software requirements for local redundancy, see the Cisco Prime Network 3.8 Installation Guide.

GUI clients

Web clients

WWW

Customer OSS/BSS

Cisco Prime Networkgateways

Heartbeat

External storage

Server P1(primary

Cisco Prime Network)

Server P2(primary Oracle)

2828

64

Cisco PrimeNetwork units

Dual-node cluster

http://www.cisco.com/en/US/products/ps6776/prod_installation_guides_list.html



OL-24644-01


Veritas Geographical Redundancy

Gateway geographical redundancy is implemented by taking the dual-node cluster (used in the local redundancy configuration) and adding an additional single-node cluster at a geographically remote site, which acts a 1+1 cold standby for the primary cluster. The two clusters form a single global cluster using the VCS Global Cluster option.

This architecture consists of two clusters, as follows:

• The primary or local site (dual-node cluster) has the same characteristics as the Local Redundancy architecture—that is, it contains two servers (P1 and P2) and two external data volumes. One server hosts the Prime Network gateway processes, and the other server hosts the Oracle database application.

• The secondary or remote site (single-node cluster) has the following characteristics:

– Contains a single server (S1) which is normally running but has no active applications.

– Is connected to two additional external data volumes that are replicas of the two data volumes (Prime Network and Oracle data) at the primary site. If the primary site fails over, these additional data volumes become the primary copy of the system data.

The license directory NETWORKHOME/Main/ha/licenses on the active gateway should contain a copy of all license files for all servers. This directory will be available to all servers because it is part of the Prime Network partition that is replicated among servers. If you add new licenses, you must copy them to this directory and run the resetLicenses.pl command to read the licenses. See Licensing and Gateway Server High Availability, page 5-2.

Replication is performed either using VVR or via storage-based replication.

The local and remote clusters maintain a heartbeat over the IP network that allows the VCS application on each server to monitor the health of the other servers.

An example of a global cluster that implements geographical redundancy is provided in Figure 17-2.


OL-24644-01


Figure 17-2 Architecture for Veritas Gateway Geographical Redundancy

In the geographical redundancy solution illustrated in Figure 17-2, only the gateway is protected. A full disaster recovery capability may require an additional set of unit servers at the remote site. This is illustrated in Figure 17-3.

Replication

Primary site/ local clusterdual -node cluster

GUI clients

Web clients

WWW

Customer OSS/BSS

Cisco Prime Networkgateway


External storage External storage

Global heartbeat

Server P1(primary

Cisco PrimeNetwork)


Server S1

Secondary site/remote clustersingle-node cluster

2828

62

Cisco Prime Network

units


OL-24644-01


Figure 17-3 Architecture for Veritas Gateway Geographical Redundancy with Unit Redundancy

If a failure at the local site is also likely to affect any local unit servers, consider placing additional units at the remote site. The remote unit servers can provide full or partial geographical redundancy, as needed.

Note In this configuration, unit redundancy does not mean the units at both the local and remote sites are managed by the unit server high availability feature. The unit redundancy illustrated here differs from unit server high availability in two ways: in this scenario, the units are up but have no running AVMs, and after failover, you must manually move the AVMs between the two sites. (In unit server high availability, standby units are down until failover, and AVMs are automatically moved at failover.)

If a local site failure occurs and the local units are not affected, you can connect them to the redundant server at the remote single-node cluster. (Depending on the distance between the local units and remote server, communication may be significantly slower.)

Detailed requirements (hardware, software, and network) for both configurations are provided in the Cisco Prime Network 3.8 Installation Guide.

GUI clients

Web clients

WWW

Customer OSS/BSS

Cisco PrimeNetwork units

Replication



External storage External storage

Global heartbeat

Server P1(primary

Cisco PrimeNetwork)


Server S1

Secondary site/remote clustersingle-node cluster + units

2828

63

Primary site/local clusterdual-node cluster + units



OL-24644-01

Chapter 17 Using Veritas Gateway Server High Availability Veritas Local Redundancy

Veritas Local RedundancyThe following topics provide additional information on how to manage a local redundancy configuration:

• Configuration Details for Veritas Local Redundancy (Dual-Node Cluster), page 17-6, describes how the different components of a locally redundant network work together, including disks, partitions, IP addresses, service groups, and application dependency.

• How Automatic Failover is Triggered (Veritas Local Redundancy), page 17-11, describes how automatic failover is triggered, and how the Prime Network and Oracle applications react.

Configuration Details for Veritas Local Redundancy (Dual-Node Cluster)The local redundancy configuration includes a dual-node cluster, and the two servers are both normally active. One server hosts the Prime Network gateway processes and has its own logical IP address; the other server hosts the Oracle database application and has its own logical IP address. Each server has its own external data volume. The hardware configuration is illustrated in Figure 17-4.

Figure 17-4 Hardware Configuration for Dual-Node Cluster in Veritas Local Redundancy


2 internal disks(1 OS + 1 mirror)

Oracle database server2 internal disks

(1 OS + 1 mirror )

Dual Gigabit Ethernet crossover

connections for heartbeat

Dual Gigabit Ethernet connections to different switches onLAN backbone for network and backup heartbeat

Dual connections from each server to the external disk storage unit28

2860

External storage

Database:1 or more data volumes

1 archive volume1 redo log volume

1 backup volume (embedded database only)1 SRL volume, all with mirroring/RAID protection

Cisco Prime Network:1 ANA volume with mirroring/RAID protection1 SRL volume with mirroring/RAID protection


OL-24644-01


Disks and Replication

The disks in the external storage (where the Prime Network and Oracle data resides) are managed by VxVm. For the primary site, some type of redundancy method should be used, such as mirroring or RAID.

In both cases, the specific disk being used at any time is transparent to the user. If a disk fails, the system will automatically failover to the redundant disk for continuous operation.

Disk Partitions

The internal disk on each server in the dual-node cluster contains the root (/) partition. Only the operating system and Veritas software are installed on this root partition. A server’s root partition is completely independent of the redundant server’s root partition. If you make any changes to any of the system files on one of the servers (such as /etc/system, /etc/hosts, /etc/passwd, or /etc/group), you must also manually make the change on the redundant server.

The disks on which the Prime Network and Oracle data resides are divided into multiple volumes. These volumes correspond to the partitions in Table 17-1.

For geographical redundancy, if using VVR, two additional SRL volumes are necessary to act as buffers for the replication (one for Prime Network data volume, and one for the set of Oracle data volumes).

For more information on Oracle disk recommendations, see the Cisco Prime Network 3.8 Installation Guide.

IP Addresses

In addition to a physical IP address, there are two logical addresses: one for the ANA (Prime Network) service group and one for the Oracle service group. All applications connecting to Prime Network and Oracle use the logical addresses. In this way, the specific servers on which Prime Network and Oracle are running remains transparent. The hardware configuration is illustrated in Figure 17-5.

Table 17-1 Disk Partitions for Veritas Local Redundancy

Partition Contents Required?

/export/home Prime Network application and registry Yes

/opt/db Oracle application and data files Yes

/opt/dbarch Oracle archive files Yes

/opt/dbbackup Oracle backup files Yes, if using an embedded database1

1. For information on embedded databases, see Working With an Embedded Database, page 11-1.

/opt/dblogs Oracle log files Yes

/opt/dbdata Additional partition for Oracle data files (can be added on separate external volume if you want more historical data)

No




OL-24644-01


Figure 17-5 Logical Addressing for a Veritas Dual-Node Cluster (Example)

The examples in these topics use the following hostname aliases to represent the logical IP addresses:

• ana-cluster-ana for the ANA service group logical IP

• ana-cluster-oracle for the Oracle service group logical IP

Service Groups

For Prime Network, the single-node is configured using two service groups: an ANA (Prime Network) service group and an Oracle service group.

An individual service group contains, in a virtual manner, the software and hardware resources that are required by a specific application. Thus the ANA service group contains all of the hardware and software resources needed by the Prime Network application, and the Oracle service group contains all of the hardware and software resources needed by the Oracle application. Both the ANA and Oracle service groups contain, at the lowest level, the hardware resources: the NIC resource and the disk group. The IP resource depends on the NIC resource, and the mount resource depends on the disk group. These dependencies are illustrated in Figure 17-6. All of these groups and resources are monitored by standard Veritas agents.

Both the ANA and Oracle service groups contain components arranged in a hierarchical manner, as illustrated in Figure 17-6.

• ANA service group—Contains a custom ANA Gateway resource that starts, stops, and monitors the Prime Network gateway. The ANA Gateway resource depends on its IP and mount resources (which depend on the low-level NIC resource and disk group).

• Oracle service group—Contains the Oracle listener resource, which depends on the Oracle database resource. Finally, the database resource depends on the IP and mount resources (which depend on the low-level NIC resource and disk group). These are monitored by the Veritas Oracle agents.

2548

36ANA service group logical IP:10.10.10.3

Oracle service group logical IP:10.10.10.4

IP: 10.10.10.1 IP: 10.10.10.2

Node A Node B


OL-24644-01


Figure 17-6 Resource Hierarchy for a Dual-Node Cluster (Veritas Local Redundancy)

This hierarchy also determines the startup and shutdown order of the components within the resource group. Each group is brought online from bottom to top of the tree. For example, the components within the Oracle service group come online in the following order:

1. NIC resource and disk group

2. IP and mount resources

3. Oracle database application

4. Oracle listener

Likewise the ANA service groups components come online in this order:

1. NIC resource and disk group

2. IP and mount resources

3. ANA Gateway resource

When taken offline, the components shut down in the reverse of the above.

Note Note that once the Oracle and Prime Network gateway applications are being monitored by VCS, they should only be started and stopped using the Veritas Cluster Manager application or CLI commands. Stopping the applications using the regular application commands without the awareness of the cluster software can cause the service group to failover.

Prime Network Gateway Processes

If you are using gateway server high availability, note that there is no overlapping between the processes that Veritas Cluster Manager monitors (the ANA Gateway Veritas agent in Figure 17-7), and the processes that AVM 99 monitors. For an illustration of the unit server high availability processes, see Figure 16-2 on page 16-7.

ANA ServiceGroup

Mount IP

DiskGroup NIC

Oracle ServiceGroup

Mount IP

DiskGroup NIC

MountMount Mount Mount

NetlsnrANA Gateway

Oracle


OL-24644-01


Figure 17-7 AVM Management with Veritas Gateway Server High Availability

A custom ANA Gateway agent is available as part of the gateway server high availability installation. The ANA Gateway agent provides an interface to stop, start, and monitor Prime Network (or more specifically, AVM 99, which is the Prime Network bootstrap process). The ANA Gateway agent runs as the operating system root user. Before running the stop and start commands, the agent actually switches to network-user. (network-user is the operating system account for the Prime Network application, created when Prime Network is installed; an example of network-user is network38.)

The processes begin as follows:

1. The anactl_ha.csh wrapper script starts the ANA Gateway resource. anactl_ha.csh is installed with gateway server high availability and is stored in NETWORKHOME/Main/ha.

2. In a local redundancy setup, anactl_ha.csh starts Prime Network by calling the mvmcheck.sh script which is stored in NETWORKHOME/Main/scripts. (It does not call mvm.sh, which is usually called to start Prime Network.)

3. The mvmcheck.sh script verifies whether only AVM 99 is down, or all AVMs are down:

– If only AVM 99 is down, mvmcheck.sh runs mvm.csh with the -nokill option. This causes mvm.csh to start only AVM 99.

– If all AVMs are down, mvmcheck.sh runs mvm.csh without any command line options, which stops any running AVM processes using the system kill command.

Figure 17-8 Process Diagram—Processes Called by ANA Gateway Agent

Application Dependency

During gateway startup, the ANA Gateway agent’s primary role is to make the Prime Network gateway process (AVM 11) start up with the Oracle listener in a synchronized manner, without creating a dependency between the ANA and Oracle service groups. When the agent comes online, it employs the logic shown in Figure 17-9:

2548

39

ANA Bootstrap Process(AVM 99)

ANA GatewayVeritas Agent

AVM 0 AVM 11 AVM 100 AVM 66

2548

40

ANA GatewayAgent

mvmcheck.sh(ana37)

anactl_ha.csh(ana37)


OL-24644-01


Figure 17-9 Process Diagram—Starting the ANA Gateway Agent

If the Oracle application fails, the two Prime Network processes that connect to the database—the gateway process (AVM 11) and AVM 25— reconnect to the new instance on their own. This happens on both gateways and units.

How Automatic Failover is Triggered (Veritas Local Redundancy)If one of the critical resources fails, the dual-node cluster is configured for automatic failover. The following topics describe what happens when these components failover:

• Heartbeat Failure, page 17-12

• Hardware Failure, page 17-12

• Oracle Listener/Database Failure, page 17-12

• Prime Network Application Failure, page 17-12

Each node provides redundancy for the other node. Because the Prime Network application directly depends on all resources in its service group, all of its resources—IP, mount, NIC, disks—are designated as critical, along with the Prime Network process (ANA Gateway). While allowances are made for restarts, if any of the resources fail, the ANA (Prime Network) service group will failover. The same is true for the Oracle service group: Any resource failure, apart from allowances for restarts, causes the Oracle service group to failover.

2836

39

Is listenerrunning?

Yes

No

StartPrime Network

Is listenerstarting up?

Maximumnumber of checks

reached?

Wait for recheckinterval

Exit(Prime Network

not started)

Yes

Maximumnumber of checks

reached?

Is listenerrunning?

Yes

Wait for recheckinterval

Yes

No

No

No

No

Yes

Exit(Prime Network

not started)

StartPrime Network


OL-24644-01


When a failover occurs, all of the resources on the current gateway are shut down (from the top of the tree down). This means that the ANA service group shuts down first because it depends on the Oracle service group. After the shutdown, the resources on the new active gateway are started from the bottom up, with the startup of Oracle service group followed by the startup of ANA service group.

By default, all resources are polled every 60 seconds, which means fault detection can take up to 60 seconds. Once a fault is detected:

• If Prime Network has a failover, it may take 2-3 minutes until the Prime Network application begins the startup process on the redundant server. At that point, Prime Network gateway startup time will vary, depending on the configuration. During this time, no alarms will be recorded by the gateway.

• Likewise, if Oracle has a failover, it may take 2-3 minutes until the Oracle database application begins the startup process on the redundant server. This process may take up to 20 minutes, depending on the number of transactions.

Default polling times can be changed or overridden using the Veritas Cluster Manager.

The dual-node cluster is designed to operate with one service group running on each server. If a failover occurs, both service groups will be running on the same server. A hardware failure on the redundant server could cause a situation where the Veritas application requires a relatively long time (approximately 10-15 minutes) to register the faults to both service groups and have both service groups failover to the other server. The problem should be corrected and cleared, and the service group switched back to its original server, as soon as possible.

Heartbeat Failure

The two gateways constantly exchange LLT heartbeats between them. In case of loss of heartbeat, the VCS will automatically failover to the redundant gateway. This underscores the importance each external volume having redundant connections to both servers. This redundant heartbeat path prevents the dangerous situation where the heartbeat is interrupted, and VCS starts one of applications on the other server. You can configure VCS to send an SMTP message and/or a SNMP trap to report the heartbeat failure.

Hardware Failure

In the event of a network or disk failure, the service group running on the faulty server will failover to the redundant server. The application in the affected service group is shutdown, the service group’s external shared disk is unmounted and then remounted on the other server, and the application is brought online on the redundant server. You can configure FCS to send an e-mail and/or SNMP trap to report the server hardware failure.

Oracle Listener/Database Failure

If the Oracle database application fails, the Oracle service group will failover to the redundant server as described in How Automatic Failover is Triggered (Veritas Local Redundancy), page 17-11. The two Prime Network processes that connect to the database—the gateway process (AVM 11) and AVM 25 (whether on the gateway or the units)—will reconnect to the new instance on their own.

If the listener fails, the Veritas agent will attempt one restart before failing over to the redundant server.

Prime Network Application Failure

In the event of a Prime Network process (AVM 99) failure, the ANA Gateway agent attempts one restart. If Prime Network cannot be restarted, the ANA service group will failover to the redundant server as described previously.


OL-24644-01

Chapter 17 Using Veritas Gateway Server High Availability Veritas Geographical Redundancy

Veritas Geographical RedundancyThe following topics provide additional information on how to manage a geographical redundancy configuration:

• Configuration Details for Veritas Geographical Redundancy (Global Cluster), page 17-13, describes how the different components of a geographically redundant network work together, including disks, partitions, IP addresses, and service groups.

• Understanding Manual Failover (Veritas Geographical Redundancy), page 17-17, describes when to perform a manual failover and why automatic failover is not recommended.

• Veritas Geographical Redundancy Failure and Failback Scenarios, page 17-19, provides information about what happens during a gateway and/or unit failover and failback.

Configuration Details for Veritas Geographical Redundancy (Global Cluster)A geographical redundancy configuration—the global cluster—includes a dual-node cluster at the primary site, and an additional redundant single-node cluster at a geographically remote site for a full DR solution. The server in the redundant single-node cluster is normally running, but with no active applications. If there is a failure at the local dual-node cluster site, an operator can manually switch to the remote redundant server. The single-node cluster at the remote site is connected to storage containing two additional data volumes that are replicas of the two data volumes (Prime Network and Oracle data) at the primary dual-node site. The hardware configuration is illustrated in Figure 17-10. (For a high-level illustration of geographical redundancy, see Figure 17-2 on page 17-4.)


OL-24644-01


Figure 17-10 Hardware Configuration for Additional Single-Node Cluster in Veritas Geographical

Redundancy)

Disks, Partitions, and Replication

The two external data volumes at the remote single-node site are replicas of the two data volumes (Prime Network and Oracle data) at the primary dual-node site. If the applications from the primary dual-node site fails over, these additional data volumes become the principal copy of the system data.

For the global cluster, the Prime Network and Oracle data partitions at the local and remote sites must be kept in sync. The initial synchronization between the local and remote data may require a considerable amount of time. To save time, the primary and secondary servers should be located near each other for the initial synchronization. Afterwards, the redundant server can be moved to the remote location. Data replication should be done asynchronously.

The disks in the remote (redundant) single-node cluster can be mirrored based on the level of redundancy desired. Data replication between the local dual-node cluster and the remote single-node cluster can be implemented in either of the following ways:

• Storage-based replication (NAS or SAN), using any required additional hardware.

• Software-based replication (VVR). See Replication with VVR, page 17-15.

Redundancy is not required at the remote single-node cluster site because the local dual-node cluster should have the ability to be restored in a short period of time. If you foresee a need for the remote site to operate for a long period of time, you can augment the single-node cluster as follows:

• Add another disk, mirroring the internal disk

• Add a second connector to each external disk unit (to protect against connection failure).

Cisco Prime Network gateway1 to 2 internal disks

(1 OS + 0 or 1 mirror)

External storage

Database:1 or more data volumes

1 archive volume1 redo log volume

1 backup volume (embedded database only)1 SRL volume, all with mirroring/RAID protection

Cisco Prime Network:1 ANA volume with mirroring/RAID protection1 SRL volume with mirroring/RAID protection

Dual Gigabit Ethernet connections to different switches on LAN backbone for network and heartbeat

2828

61

Dual connections from the server to the external disk storage unit


OL-24644-01


IP Address

Two additional IP addresses are required by the WAC (Wide Area Connection) resources that are used to implement the VCS global cluster. One address is for the local cluster and one address is for the remote cluster.

In addition, the remote single-node cluster is assigned a single logical address. All Prime Network-related applications, including Oracle and northbound applications, use the logical addresses. In the event of a failover to the gateway at the secondary site, all northbound applications will have to reconnect using the new logical IP address at the remote cluster. The units servers will automatically be reconfigured to use the new IP address of the gateway.

Service Groups

For geographical redundancy, the baseline ANA and Oracle service groups require additional resources/service groups in order to integrate with the data replication process. The following example discusses the resource changes for replication using VVR. For storage-based replication, each replication solution supported by Veritas will have its own agent and, and the setup will vary. See the Veritas product documentation for the latest list of supported replication solutions and existing agents.

Replication with VVR

For VVR, dedicated VVR service groups are added to both the local and remote clusters.

For the local (dual-node) cluster, each application service group (ANA and Oracle) has a corresponding VVR service group. The NIC and IP resources, as well as the disk group resource from each application service group, are moved to the VVR service groups. The VVR service groups add an RVG resource, which is dependent on the IP and Disk Group. The application service groups add RVGPrimary resources, on which the applications become dependent. See Figure 17-11 for an illustration.


OL-24644-01


Figure 17-11 Local Dual-Node Cluster Resources Using VVR (Geographical Redundancy)

In the remote single-node cluster, the setup is similar, except that there is a single VVR service group that serves both application groups. The single NIC and IP resources, as well as the two disk group resources, reside in the VVR service group. Dependent on them are the RVG resources for both Prime Network and Oracle. This hierarchy is illustrated in Figure 17-12.

RVGPrimary

2548

41

ANA Service Group

Mount

RVGPrimary

Oracle Service Group

NetlsnrANA Gateway

Oracle

ANA VVRService Group

IPDiskGroup

NIC

RVG

Mount MountMountMountMount

Oracle VVR Service Group

IPDiskGroup

NIC

RVG


OL-24644-01


Figure 17-12 Remote Single-Node Cluster with VVR Resource Hierarchy (Geographical

Redundancy)

Understanding Manual Failover (Veritas Geographical Redundancy)In the global cluster scenario, a failure in the local dual-node cluster means that a critical resource has failed on both servers in the cluster. If this happens, a user can manually failover to the server in the remote single-node site.

If one of the critical resources fails, the dual-node cluster is configured for manual failover. These topics describe what happens when these components failover:

• Heartbeat Failure, page 17-18

• Local Dual-Node Cluster Failure, page 17-19

• Local -> Remote Cluster Failover, page 17-19

• Local -> Remote Unit Failover, page 17-21

• Remote -> Local Cluster Failback, page 17-21

• Remote -> Local Unit Return, page 17-22

2548

42

ANA ServiceGroup

Mount

RVGPrimary

Oracle Service Group

ANA GatewayOracle

RVG RVG

VVR Service Group

IPDiskGroup

NIC

DiskGroup

Mount MountMountMountMount

RVGPrimary


OL-24644-01


Automatic failover is not recommended for the following reasons.

• Possibility of a split-brain scenario. A split-brain scenario is when both the local dual-node site and the remote single-node site assume they should be running. Because the heartbeat between the two sites is sent over the network, the heartbeat could be interrupted due to a loss of connectivity between the two sites—which the remote site could interpret as a failure at the local site. If failover was automatic, the remote cluster would automatically start up and begin running in parallel with the local cluster. There, failover between sites should be performed manually.

• Human intervention is warranted. In the event of a failure at the local dual-node site, an operator should verify the failure. A site failure (as opposed to a localized hardware failure) is a major event that requires human intervention to assess the situation. For example, even if the two servers at the local site fail, units at the primary site might still be functioning. If the remote (redundant) single-node site also included a set of redundant units, an operator would need to determine whether only the gateway should failover, or if the units should also failover.

The manual site-to-site failover process includes the shutdown of both application service groups in the local dual-node cluster, and the startup of the corresponding service groups in the remote single-node cluster. After a failure at the local dual-node site, if the local VVR service groups are still online, VVR will try to replicate any unsynched data in the SRLs as part of switching the server at the remote site from secondary to primary. (The unsynched data is what was queued up in the buffer, but was not yet replicated.)

Basic Steps in Manual Failover and Failback

The following are the basic manual steps for a failover in a geographical redundancy configuration:

1. When a heartbeat loss occurs, verify the cause (it could be a simple loss of connectivity).

2. Verify that the applications are down and the disks are unmounted at the local dual-node site.

3. Start the failover.

4. Configure northbound applications to re-login to the remote single-node gateway using the new IP address.

5. When the original gateway and database servers are up at dual-node site, bring VVR resources online. Because the remote site now contains the master copy of the data, this ensures that data written to the Prime Network and Oracle volumes at the remote site can be replicated back to the local site.

6. When appropriate, failback to the local site.

Heartbeat Failure

The primary (local) and secondary (remote) clusters constantly exchange LLT heartbeats over the shared IP network. If there is a heartbeat loss, VCS registers the fault and awaits manual intervention for failover. The operator must determine whether the problem is due to a failure in the local dual-node cluster, or a loss of network connectivity between the two sites. If it is a network connectivity issue, no action is required. But if it is due to a critical resource failure on both servers in the dual-node cluster., the operator should perform the manual failover of the remote gateway (and the unit servers, if they also fail). You can configure VCS to send an SMTP message and/or a SNMP trap to report the heartbeat failure.


OL-24644-01


Veritas Geographical Redundancy Failure and Failback ScenariosThe basic steps for manual failover (and failback) are described in Understanding Manual Failover (Veritas Geographical Redundancy), page 17-17. That section also describes what happens when there is a heartbeat failure. These topics describe other failover and failback scenarios:

• Local Dual-Node Cluster Failure, page 17-19

• Local -> Remote Cluster Failover, page 17-19

• Local -> Remote Unit Failover, page 17-21

• Remote -> Local Cluster Failback, page 17-21

• Remote -> Local Unit Return, page 17-22

Local Dual-Node Cluster Failure

A local cluster failure occurs when either Prime Network or Oracle registers a fault on both servers in the dual-node cluster. If this occurs, both service groups are shutdown, and the failover awaits manual intervention. VCS can be configured to send an SMTP message and/or a SNMP trap to report the failure

Local -> Remote Cluster Failover

When the local dual-node cluster fails over to the remote single-node cluster, all resources on the local cluster are shut down (if possible) and then started up on the remote cluster.

Data replication requires that the volumes being replicated be mounted at only one site at a time. In other words, when the local dual-node cluster is up, the disk resources at the remote single-node cluster should be offline. If the local dual-node cluster fails,

• If the remote cluster can verify the state of the disk resources in the local cluster, and the local resources are offline, the operator can initiate the failover process.

• If the remote cluster cannot verify the state of the disk resources in the local cluster (because communications are interrupted or hardware has failed), the operator must do the following:

– Manually confirm that the disks at the local site are unmounted.

– Start the failover process.

As mentioned in IP Address, page 17-15, switching to the remote (redundant) single-node gateway involves using a different gateway IP address. The anactl_ha.csh script will automatically reconfigure the Prime Network/Oracle addresses and LDAP settings on both the gateway and units (which enables unit communication with the new Prime Network/Oracle instance). Any northbound applications that use an IP address to connect to the gateway will have to log in again using the new address of the gateway. If applications use a hostname to connect to the gateway, and a DNS resource is configured for the remote single-node cluster, a reconnect should not be required.

The Oracle listener will automatically start with the correct address for the remote single-node site, and the Prime Network startup process will automatically reconfigure the gateway to use the new listener address. This is because the locally-configured hostname alias (in /etc/hosts on each server) is configured with a different address for the local and remote clusters.

Figure 17-13 illustrates the steps that are invoked by the ANA Gateway resource when it fails over to the remote single-node cluster.


OL-24644-01


Figure 17-13 ANA Application Process Flow at Failover (Local to Remote Cluster)

2837

10

~network38/Main/ha/changeSite.pl (user=network38)

• run background script to reset GW address on running AVMs

• change GW IP address in uplinks in 127.0.0.1/avm0.xml

• change GW IP address in uplinks in unit_ip /avm0.xml

• change GW IP address in gs and haservice in unit_ip /avm99.xml

• run script via ssh on each unit to switch ANA IP address in registry

• change GW IP address in localhost in 127.0.0.1/avm99.xml

• change DB Server IP in workflow in 127.0.0.1/avm66.xml

• change DB Server IP in ep and main in 127.0.0.1/persistency.xml

• change DB Server IP in ep and main in unit_ip /persistency.xml

• change DB Server IP in ep and main in 0.0.0.0/persistency.xml

• change any LDAP parameters in 127.0.0.1/authentication.xml

ANAGateway Agent (user=root)

•

new_ana_ip new_oracle_ip ldap_url ldap_prefix ldap_suffix ldap_is_ssl

• stop AVMs 99, 25, 0

• change GW IP address in gs and haservice in avm99.xml

• change DB server IP address in ep and main in persistency.xml

• change GW IP address in uplinks avm0.xml

new_ana_ip old_ana_ipnew_db_ip old_db_ip

• run management.attachToMvm command for each AVM on each Unit

get ANA IP and Oracle IP for dual-node cluster or single IP for single-node cluster

~network38/Main/ha/anactl_ha.csh (user=network38)

-check -newgwip new_ana_ip -newdbip new_oracle_ip[-newldapurl ldap_url -newldapprefix ldap_prefix-newldapsuffix ldap_suffix -newldapisssl ldap_is_ssl]

~network38/Main/scripts/ha/postFailover.pl (user=network38)

~network38/Main/ha/switchUnit.pl (user=network38)

~network38/Main/scripts/mvmvcheck.sh (user=network38)


OL-24644-01


Local -> Remote Unit Failover

Note Do not manage local and remote site unit redundancy using unit server high availability. In other words, do not create a protection group that contains units from the local and remote site.

If redundant units are provided at the remote single-node site, the units are normally up but no AVMs are running. (This is different from a standby unit in a unit server high availability scenario, where the standby units are down.) If the local dual-node site has a failure that affects both the gateway and units, the operator will first failover the gateway, and then some manual steps must be performed.

When the gateway is up and running at the remote single-node site, do the following:

1. Move AVMs from the local units to the remote units. How many AVMs can be moved depends on how many unit servers are at the remote site. See Moving and Deleting AVMs, page 4-13.

2. If the failover occurred on the gateway or unit that had a running AVM 100 (AVM 100 contains the Event Collector):

a. If AVM 100 was running on a unit that failed, start AVM 100 on the redundant unit. (If AVM 100 was running on a gateway that failed, it will be automatically restarted.) See Enabling a New Event Collector on a Unit, page 14-12.

b. Reconfigure devices to forward events to the new unit or gateway that is running the Event Collector (if this was not already done). This is required because the IP address of AVM 100 will be different. (A port watchdog script, that runs on all units an gateways, will receive the incoming traps and syslogs on the failed gateway or unit. This ensures that the device sending the traps and syslogs does not receive error messages.)

Depending on the location of devices, the connection between devices and a remote unit may be across the WAN. Assuming that the relevant ports have been opened in the firewall, this configuration is supported for the geographical redundancy gateway server high availability solution.

Remote -> Local Cluster Failback

Once the primary site is online, you should switch the gateway back to the local cluster as soon as possible. As described in Disks, Partitions, and Replication, page 17-14, if the local dual-node site fails over, the two additional data volumes at the remote site become the principal copy of the system data. Before a switchover back to the local dual-node cluster, this data will have be replicated back to the disks in the local cluster.

To minimize downtime, bring the two VVR service groups online as soon as possible so that replication from the remote site to the local site can begin. Depending on how long the primary site was down, there may be a large amount of data to replicate. Prime Network can continue to run on the remote single-node cluster while the data is replicated to the local site. When the data at the two sites is synchronized, you can initiate the manual failback procedure.

Note If you initiate failback before the data has been fully synchronized, the data synchronization will become part of the failback process. Depending on how much data needs to be synchronized, this may require a considerable amount of time.

Failing back to the local dual-node cluster involves the same steps as the failover, but in reverse order. When the remote single-node cluster fails back to the local dual-node cluster, all resources on the remote cluster are shut down and then started up on the local cluster.


OL-24644-01

Chapter 17 Using Veritas Gateway Server High Availability Mean Time to Repair Veritas Gateway Server High Availability Failures

The gateway IP address will switch back. As described in Local Dual-Node Cluster Failure, page 17-19, the anactl_ha.csh script will automatically reconfigure the Prime Network/Oracle addresses and LDAP settings on both the gateway and units. Northbound applications will have to log in again using the new address of the gateway. If applications use a hostname to connect to the gateway, and a DNS resource is configured for the remote single-node cluster, a reconnect should not be required.

The Oracle listener will automatically start with the correct address for the remote single-node site, and the Prime Network startup process will automatically reconfigure the gateway to use the new listener address. This is because the locally-configured hostname alias (in /etc/hosts on each server) is configured with a different address for the local and remote clusters.

Remote -> Local Unit Return

Once the primary site is online and the gateway in the local dual-node cluster is operational, you should do the following:

• Move all AVMs back to their original units in the primary site.

• Stop the running AVM 100 on the remote unit, and restart it on the original unit at the primary site.

Mean Time to Repair Veritas Gateway Server High Availability Failures

Table 17-2 provides some information about the average time required to recover from a component failure in a Veritas gateway server high availability configuration.

Note If the Event Collector (AVM 100) was running on a component that failed, the system will lose traps and syslogs sent from devices, and raw events will not be persisted. For more information on AVM failure and their impact, see Estimating the Impact of Unit or AVM Failures, page 16-4.

Table 17-2 Impact of Failure of Gateway Server High Availability Components (Veritas Solution)

Component Results of Failure Average Time To Repair Failure

Failure of any resource in ANA service group

Gateway will not be available until failover to redundant gateway is complete.

2-3 minutes plus gateway startup time (system dependent).

Failure of any resource in Oracle service group

Database will not be available until failover to redundant database is complete.

2-3 minutes plus database startup time. Can be up to 20 minutes, depending on the transaction rate prior to failure.

Local site failure Gateway and database will not be available until manual failover to remote server and database is complete.

Once initiated, 5 minutes plus time required to copy any unreplicated data in replication buffer plus gateway and database startup time (in parallel).

Local site failure including units

Gateway, database, and units will not be available until manual failover to remote site is complete.

Once initiated, 5 minutes plus time required to copy any unreplicated data in replication buffer plus gateway and database startup time (in parallel) plus unit start up time (system dependent).


OL-24644-01


using veritas gateway server high availability - cisco.com · cisco prime network 3.8 administrator...

Documents