building a high-availability postgresql cluster - team...

Building a High-Availability PostgreSQL Cluster

Presenter: Devon Mizelle System Administrator

Co-Author: Steven Bambling System Administrator

ARIN — “critical internet infrastructure”

What is ARIN?•Regional internet registry for North

America and parts of the Caribbean.

•Distributes IPv4 & IPv6 addresses and

Autonomous System Numbers (Internet

number resources) in the region

•Provides authoritative WHOIS services

for number resources in the region

2

ARIN’s Internal Data

3

!Inside of our database exists all of the v4 and v6 networks that we manage, the organizations that they belong to, and the contacts at those organizations. This means that data integrity and how we store said data is extremely important.

Requirements

4

Multi-‐member Automatic Failover Prevent a ‘tainted’ master from coming online Needs to be ACID-‐Compliant

Why Not Slony or pgpool-II?

• Slony replaces pgSQL’s replication – Why do this? –Why not let pgSQL handle it?

• Pgpool is not ACID-Compliant – Doesn’t confirm writes to multiple nodes

5

Our solution

• CMAN / Corosync – Red Hat + Open-source solution for cross-

node communication • Pacemaker – Red Hat and Novell’s solution for service

management and fencing • Both under active development by

Clusterlabs

6

Interested in using it due to active development by Clusterlab

CMAN/ Corosync

• Provides a messaging framework between nodes

• Handles a heartbeat between nodes – “Are you up and available?” – Does not provide ‘status’ of service,

Pacemaker does • Pacemaker uses Corosync to send

messages between nodes

7

CMAN has the ability to do more - but we just use it as a messaging framework

CMAN / Corosync

8

Builds a cluster ‘ring’ using a configuration file Used by Pacemaker in order to pass status messages between the nodes Simply a framework for communication – no heavy lifting in our implementation

About Pacemaker

• Developed / maintained by Red Hat and Novell • Scalable – Anywhere from a two-node to a 16-

node setup • Scriptable – Resource scripts can be written in

any language – Monitoring – Watches out for service state changes – Fencing – Disables a box and switches roles when

failures occur • Shareable database between nodes about

status of services / nodes

9

Pacemaker

10

Master

AsyncSync

?

An XML ‘database’ (known as a CIB -‐ cluster information base) is generated with the status of each resource and passed between nodes The state of pgSQL is controlled by Pacemaker itself Pacemaker uses a ‘resource script’ to interact with pgSQL Can determine the state of the service (Master / Sync / Async)

Other Pacemaker Resources

11

Fencing IP Addresses

Pacemaker also handles the following resources besides PGSQL: * Fencing of resources * IP Address colocation

How does it all tie together?From the bottom up…

12

Pacemaker

13

Client “vip”Replication “vip”

Master

Sync Async App

All slaves in the cluster point to a replication ‘vip’ This interface moves to whichever node is the master -‐ this is called a colocation constraint Another ‘vip’ for our application servers to connect to follows the master as well

Event Scenario

14

?X

XMaster Sync AsyncMaster SyncAsync

In the event that a node becomes unavailable, cman notifies pacemaker to ‘fence’ or shut off communication to the node via SNMP to the switch The SYNC slave becomes the Master The ASYNC slave becomes the SYNC slave Upon manual recovery, the old Master becomes the async slave If any resources inside of Pacemaker on the master fail their monitoring check, fencing occurs as well These resources include:

Both replication and client ‘vips’

PostgreSQL

• Still in charge of replicating data • The state of the service and how it

starts is controlled by Pacemaker

15

Layout

16

💙 💙

MasterSlave Slave

cman cman cman

Client

Using Tools to Look DeeperIntrospection…

17

# crm_mon -i 1 -Arf

18

We disable quorum within the pacemaker HA cluster to allow for failure down to a single node cluster in the event multiple nodes fail • 8 Resources configured • ofce::heartbeat::IPaddr2 is the resource used to create the vip – can be shell, ruby, etc. • Primitive vs multistate

• Primitive – only runs on one of the nodes in the cluster (vips, fencing) • Multi-‐state resource – runs on multiple nodes (pgsql)

• The vips are colocated. If anything happens to either of them, the entire node fails and moves to the next master • There is a specific check interval for each resource • stonith for fencing

# crm_mon –i 1 -Arf (cont)

19

* All of the status comes from the pgsql pacemaker resource script • receiver-‐status is error because the resource is written to monitor and check for cascading. We don’t use cascading, haven’t invested cycles • Master-‐postgresql is the ‘weight’. Uses the weight to determine whom should be promoted next in line, which is why async has –INFINITY • STREAMING

Questions?20

Devon Mizelle

building a high-availability postgresql cluster - team...

Documents