enabling high availability and disaster recovery in couchbase server
TRANSCRIPT
High Availability / Disaster Recover
Mel Boulos Solutions Engineer
Couchbase
©2015 Couchbase Inc. 3
Next 40 minutes …
Part I - High Availability – Single node architecture– Local data redundancy– Rebalance and failover– Node recovery
Part II - Disaster Recovery– Business continuity for “mission-critical” applications – Geo redundancy – Backup-Restore for worst case scenario
Part I - High Availability
©2015 Couchbase Inc. 5
Couchbase Server – Single Node Architecture
Single node type is the foundation for high availability architecture
No Single Point of Failure (SPOF)
Easy scalability
STORAGE
Couchbase Server 1
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 2
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Managed Cache
Storage
Data Service
Index Service
Query Service STORAGE
Couchbase Server 3
SHARD7
SHARD9
SHARD5
SHARDSHARDSHARD
Managed Cache
Cluster ManagerCluster Manager
Managed Cache
Storage
Data Service
Index Service
Query Service
©2015 Couchbase Inc. 6
Intra-Cluster Replication – Data Redundancy
RAM to RAM replication
Max of 4 copies of data in a Cluster
Bandwidth optimized through de-duplicate, or ‘de-dup’ the item
Intra-cluster replication is the process of replicating data on multiple servers within a cluster in order to provide data redundancy.
©2015 Couchbase Inc. 7
Write Operation – Data RedundancyAPPLICATION SERVER
MANAGED CACHE
DISK
DISK
DOC 1
DOC 1DOC 1
Caching based on Memcached: App gets an ACK when write is successfully in RAM Or RAM+Replicated Or RAM+Persisted Or
RAM+Replicated+Persisted
DCP based Replication: writes queued to other nodes
Couchstore based Storage: writes queued for storage
DCP
INDEXER
©2015 Couchbase Inc. 8
Database Change Protocol – Data Redundancy
DCP is new streaming replication protocol in Couchbase Server 3.0 High-Performance, Stream-
based Protocol
Better Resume-ability after blips and failures
Ordering
Consistent
Intra-Cluster Replication
Cross Datacenter Replication
Incremental Rebalance
Incremental Backup & RestoreExternal
streams for Change Data Capture (CDC) in future
Incremental Map/Reduce Views
Global Secondary Indexes
Connectors (Kafka, Scoop, Spark)
©2015 Couchbase Inc. 9
Auto Tuning Shared Thread Pool - Durability
Efficient Auto-Tuning Engine Detect and allocate threads
based on HW resources
Pool threads for best resource utilization
Improved latency across the board
Faster Rebalance
Faster Node Reactivation
Faster Durability with Writes & PersistTo
©2015 Couchbase Inc. 10
Rebalance Operation – Data Availability Rebalance redistributes data-partitions (data) around
cluster– When adding nodes– When removing nodes– When nodes have failed over
Aim is to bring cluster back to optimal health Data-partitions are moved between nodes automatically Rebalance happens on an active cluster
– Allows you to expand/shrink without pausing your application– Client libraries automatically handle the rebalance and
redistribute their requests accordingly
©2015 Couchbase Inc. 11
Failover Operation - Fault-tolerance Failover automatically switches-over to the
replicas for a given database– Gracefully under node maintenance– Immediately under auto-failover– Can be triggered manually through the
Admin-UI/REST/CLI
Automatic failover in case of unplanned outages – system failures– Can be configured through Admin-UI/REST/CLI– Constraints in place to avoid “split-brain” and false
positives– 30 second delay, multiple heartbeat “pings”– Clusters >=3 nodes– Only one node down at a time
©2015 Couchbase Inc. 12
Automatic Failover – “In action”
SERVER 4 SERVER 5
Replica
Active
Replica
ActiveActive
SERVER 1
Shard 5
Shard 2
Shard 9Shard
Shard
Shard
Replica
Shard 4
Shard 1
Shard 8Shard
Shard
Shard
Active
SERVER 2
Shard 4
Shard 7 Shard 8
Shard
Shard Shard
Replica
Shard 6
Shard 3 Shard 2
Shard
Shard Shard
Active
SERVER 3
Shard 1
Shard 3
Shard 6Shard
Shard
Shard
Replica
Shard 7
Shard 9
Shard 5Shard
Shard
Shard
App servers accessing Shards
Requests to Server 3 fail
Cluster detects server failed Promotes replicas
of Shards to active
Updates cluster map
Requests for docs now go to appropriate server
Typically rebalance would follow
Shard 1 Shard 3
Shard
COUCHBASE Client Library
CLUSTER MAP
COUCHBASE Client Library
CLUSTER MAP
©2015 Couchbase Inc. 13
Node Recovery – Bring Cluster back to Capacity
Failed-Over node can re-added back to cluster – Full recovery – Add back as a fresh node– Delta Node recovery – Add back failed node incrementally
into the cluster without having to rebuild the full node.
©2015 Couchbase Inc. 14
Rack-Zone Awareness – Rack-Zone Availability
Grouping of servers into server groups so that each group is on a physically separate rack
Ensures that replica data partitions are not on the same rack as the primary partitions
Rack 1
1
2
3
Rack 2
4
5
6
Rack 3
7
8
9
Servers 1, 2, 3 on Rack 1 Servers 4, 5, 6 on Rack 2 Servers 7, 8, 9 on Rack 3 Cluster has 2 replicas (3 copies
of data) This is a balanced configuration
©2015 Couchbase Inc. 15
Couchbase Server - MDS Architecture (NEW in 4.0)What is Multi-Dimensional Scalability?
MDS is the architecture that enables independent scaling of data, query and indexing workloads. That also provides isolation of services for minimized interference.
Independent “zones” for Query, Index and Data Services
Index Service
Couchbase Cluster
Query Service Data Service
node1 node8
©2015 Couchbase Inc. 16
Couchbase Server - MDS Architecture (NEW in 4.0)
Part I I – Disaster Recovery
©2015 Couchbase Inc. 18
Cross Datacenter Replication (XDCR) Unidirectional Replication
Hot spare / Disaster Recovery
Development/Testing copies
Bidirectional Replication
Datacenter Locality
Multiple Active Masters
©2015 Couchbase Inc. 19
Cross Datacenter Replication (XDCR) using DCP
Replicates continuously data FROM source cluster to remote clusters may be spread across geo’s
Supports unidirectional and bidirectional operation Application can read and write from both clusters (active –
active replication) Automatically handles node addition and removal Simplified Administration via Admin UI, REST, and CLI Pause and resume XDCR replication (NEW in 4.0) Filtering of data on replication stream
©2015 Couchbase Inc. 20
XDCR – Memory based using DCP
APPLICATION SERVER
MANAGED CACHE
DISK
DISK
DOC 1
DOC 1
Intra-Cluster Replication
INDEXER
Cross Datacenter Replication
DOC 1DOC 1
©2015 Couchbase Inc. 21
Backup & Restore - Oops cbbackup tools provides backup for a running cluster
– Entire Cluster – across all bucket – Single Node – across all buckets– Single Node – single bucket– Supports remote or local access
©2015 Couchbase Inc. 22
Minimize time and resources during backups
Efficient Recovery with Incremental Backup & Restore
• Back up only the data updated since the last backup
• Differential Backups• Cumulative Backups
Thank you.
Questions?