under the covers - couchbase server architecture: couchbase connect 2014
TRANSCRIPT
Under the Covers:Couchbase Server Architecture
Steve Yen | co-founder, Couchbase
version 20141003.2
©2014 Couchbase, Inc. 2
System diagrams
Let there be a bucket
Rebalance
Agenda
©2014 Couchbase, Inc. 3
Definitions
©2014 Couchbase, Inc. 4
Bucketa logical container of keys & values
Definitions
©2014 Couchbase, Inc. 5
Bucketa logical container of keys & values
Partitiona sub-part or division of a Bucket;
a Bucket is made up of multiple Partitions;we can allocate Partitions onto different server nodes
Definitions
©2014 Couchbase, Inc. 6
Bucketa logical container of keys & values
Partitiona sub-part or division of a Bucket;
a Bucket is made up of multiple Partitions;we can allocate Partitions onto different server nodes
Rebalancethe orchestrated migration of Partitions amongst server nodes
in order to spread the load, and achieve replication constraints
Definitions
©2014 Couchbase, Inc. 7
System diagrams
Let there be a bucket
Rebalance
Agenda
Inside a datacenter
a datacenter
Inside a datacenter
load balancer
web app server web app serverweb app server
Inside a datacenter
load balancer
Couchbase Cluster
Couchbase Server
Couchbase Server
Couchbase Server
Couchbase Server
Couchbase Server
web app server web app serverweb app server
Inside a couchbase cluster
Couchbase
Server
Couchbase
Server
Couchbase Server
Cou
chba
se
Ser
ver
Couchbase
Serve
r
Inside a couchbase cluster
Cluster
Manager
--------------
Data
Manager
Clu
ster
Man
ager
--------------
Data
Man
ager
ClusterManager
--------------Data
Manager
Clu
ster
Man
ager
----
----
----
--D
ata
Man
ager
ClusterManager
------
------
--Data
Manager
Couchbase
Server
Couchbase
Server
Couchbase Server
Cou
chba
se
Ser
ver
Couchbase
Serve
r
Inside a couchbase cluster
Cluster
Manager
--------------
Data
Manager
Clu
ster
Man
ager
--------------
Data
Man
ager
ClusterManager
--------------Data
Manager
Clu
ster
Man
ager
----
----
----
--D
ata
Man
ager
ClusterManager
------
------
--Data
Manager
Couchbase
Server
Couchbase
Server
Couchbase Server
Cou
chba
se
Ser
ver
Couchbase
Serve
r
©2014 Couchbase, Inc. 14
Inside a node
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 15
Inside a node
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 16
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
godu(golang)
cert gen(golang)
map gen(golang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 17
11209, 11210
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 18
11209, 11210
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 19
11209, 11210
Inside a node / OS processes
babysitter (erlang)
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
ns-server / view-engine(erlang)
godu(golang)
cert gen(golang)
map gen(golang)
©2014 Couchbase, Inc. 20
11209, 11210
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 21
11209, 11210
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 22
11209, 11210
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 23
11209, 11210
Inside a node / OS processes
babysitter (erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 24
8091, 8092, 11214, 11215, …
Inside the Cluster Manager
ns-server / view-engine(erlang)
©2014 Couchbase, Inc. 25
Inside the Cluster Manager
erlang
OTP
view-enginens-server
©2014 Couchbase, Inc. 26
Inside the Cluster Manager
erlang
OTP
view-enginens-server
Framework for building reliable, clustered systems
©2014 Couchbase, Inc. 27
Inside the Cluster Manager
erlang
OTP
view-enginens-server
©2014 Couchbase, Inc. 28
Inside the Cluster Manager / ns-server
ns-server
©2014 Couchbase, Inc. 29
per-node-&-bucket services
generic facilities
Inside the Cluster Manager / ns-server
master-only services
REST admin
per-node services
per-node-&-bucket services
©2014 Couchbase, Inc. 30
generic facilities
per-node-&-bucket services
Inside the Cluster Manager / ns-server
master-only services
REST admin
per-node services
per-node-&-bucket services
Services that run only on a single master node
Master will be selected at runtime
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 31
generic facilities
per-node-&-bucket services
Inside the Cluster Manager / ns-server
master-only services
REST admin
config gossip replication
per-node services
per-node-&-bucket services
Services that run on every node
Examples- node heart beat- XDCR
©2014 Couchbase, Inc. 32
generic facilities
per-node-&-bucket services
Inside the Cluster Manager / ns-server
master-only services
REST admin
per-node services
per-node-&-bucket services
Services that run on every nodefor every bucket
Example- per node per bucket stats collection
©2014 Couchbase, Inc. 33
generic facilities
per-node-&-bucket services
Inside the Cluster Manager / ns-server
master-only services
REST admin
per-node services
per-node-&-bucket services
REST admin
+ client-side JS / admin web UI
©2014 Couchbase, Inc. 34
per-node-&-bucket services
generic facilities
Inside the Cluster Manager / ns-server
master-only services
REST admin
per-node services
per-node-&-bucket services
©2014 Couchbase, Inc. 35
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside the Cluster Manager / ns-server
master-only services
REST admin
per-node services
per-node-&-bucket services
©2014 Couchbase, Inc. 36
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside the Cluster Manager / ns-server
vclock, uuid, work queue, events, misc
master-only services
REST admin
per-node services
per-node-&-bucket services
Librariesvector clockswork queuesevent pub/sub
github.com/couchbase/ns-server/… misc|vclock|event
©2014 Couchbase, Inc. 37
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside the Cluster Manager / ns-server
vclock, uuid, work queue, events, misclogging (ALE)
master-only services
REST admin
per-node services
per-node-&-bucket services
Another Logger for Erlang
“ALE is the best!”“Awesome Logger for Erlang”
github.com/aartamonau/ale”
©2014 Couchbase, Inc. 38
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside the Cluster Manager / ns-server
vclock, uuid, work queue, events, misclogging (ALE)
master-only services
REST admin
local config store
per-node services
per-node-&-bucket services
Local Config Store
simple local storageof configuration data
github.com/couchbase/ns_server … ns_config
©2014 Couchbase, Inc. 39
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside the Cluster Manager / ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
local config store
per-node services
per-node-&-bucket services
Distributed Node Discovery
… when nodes appear & disappear
github.com/couchbase/ns_server … node_disco
©2014 Couchbase, Inc. 40
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside the Cluster Manager / ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
Config Gossip Replication
eventually consistent distributed config
vector clock based
github.com/couchbase/ns_server … ns_config_rep
©2014 Couchbase, Inc. 41
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside the Cluster Manager / ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
©2014 Couchbase, Inc. 42
Inside the Cluster Manager
erlang
OTP
view-enginens-server
©2014 Couchbase, Inc. 43
11209, 11210
Inside a node / OS processes
babysitter (erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 44
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
godu(golang)
cert gen(golang)
map gen(golang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 45
Inside the Data Manager11
209,
112
10
memcached(c/c++)
©2014 Couchbase, Inc. 46
1120
9, 1
1210
Inside the Data Manager
libevent
©2014 Couchbase, Inc. 47
1120
9, 1
1210
Inside the Data Manager / libevent
libevent
libeventhigh performance cross-platform library for non-blocking network I/O
©2014 Couchbase, Inc. 48
1120
9, 1
1210
Inside the Data Manager / networking
libevent
networking layer / conn thread pool thread0 thread1 thread2 thread3
networking layerlisten()’s & accept()’s
connections;assigns connection to a worker
thread;parses incoming bytes to
messages
©2014 Couchbase, Inc. 49
1120
9, 1
1210
Inside the Data Manager / engine manager
libevent
networking layer / conn thread pool
engine manager
thread0 thread1 thread2 thread3
engine managerloads and manages engines
©2014 Couchbase, Inc. 50
1120
9, 1
1210
Inside the Data Manager / ep-engine
ep-engine(couchbasebucket type)
libevent
networking layer / conn thread pool
ep-engine(couchbasebucket type)
ep-engine
(couchbasebucket type)
engine manager
file I/O thread pool
thread0 thread1 thread2 thread3
©2014 Couchbase, Inc. 51
1120
9, 1
1210
Inside the Data Manager / ep-engine
ep-engine(couchbasebucket type)
libevent
networking layer / conn thread pool
ep-engine(couchbasebucket type)
ep-engine
(couchbasebucket type)
engine manager
file I/O thread pool
thread0 thread1 thread2 thread3
©2014 Couchbase, Inc. 52
Inside ep-engine
Append-only B-Tree Storage Engine
Engine APIs(get, set, del, add, append, DCP,
…)
PartitionHash Table
(active)
PartitionHash Table
(replica)
PartitionHash Table
(active)…
Checkpoints
Checkpoints
Checkpoints
ReaderThreads
Non-IO Thread
s
DataReplicato
r
I/O Completion
Notifier
Aux-IOThreads
FlushersData
Backfill
User Configured Replica Count = 1
Batch Readers
WriterThreads
…
Item Pager
Expiry Pager
Checkpoint Manager
Shared Thread Pool
©2014 Couchbase, Inc. 53
Inside ep-engine
Append-only B-Tree Storage Engine
Engine APIs(get, set, del, add, append, DCP,
…)
PartitionHash Table
(active)
PartitionHash Table
(replica)
PartitionHash Table
(active)…
Checkpoints
Checkpoints
Checkpoints
ReaderThreads
Non-IO Thread
s
DataReplicato
r
I/O Completion
Notifier
Aux-IOThreads
FlushersData
Backfill
User Configured Replica Count = 1
Batch Readers
WriterThreads
…
Item Pager
Expiry Pager
Checkpoint Manager
Shared Thread Pool
Chiyoung Seo’sDeep Dive Session
at 10/6 4:20PMin this room
©2014 Couchbase, Inc. 54
Inside ep-engine
Append-only B-Tree Storage Engine
Engine APIs(get, set, del, add, append, DCP,
…)
PartitionHash Table
(active)
PartitionHash Table
(replica)
PartitionHash Table
(active)…
Checkpoints
Checkpoints
Checkpoints
ReaderThreads
Non-IO Thread
s
DataReplicato
r
I/O Completion
Notifier
Aux-IOThreads
FlushersData
Backfill
User Configured Replica Count = 1
Batch Readers
WriterThreads
…
Item Pager
Expiry Pager
Checkpoint Manager
Shared Thread Pool
Mike Wiederhold’sDCP Session
at 10/6 5:10PMin this room
©2014 Couchbase, Inc. 55
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
ns-server / view-engine(erlang)
godu(golang)
cert gen(golang)
map gen(golang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 56
11209, 11210
Inside a node / OS processes
babysitter (erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 57
Inside the Cluster Manager
erlang
OTP
ns-server view-engine
©2014 Couchbase, Inc. 58
Inside the Cluster Manager
erlang
OTP
ns-server view-engine
Views!
Next Gen: separate process for view-engine
©2014 Couchbase, Inc. 59
Inside the Cluster Manager
erlang
OTP
view-enginens-server
Views!
Next Gen: split this into separate process
Sarath Lakshman’sViews Sessionat 10/7 11:40AM
in this room
©2014 Couchbase, Inc. 60
Inside the Cluster Manager
erlang
OTP
view-enginens-server
Views!
Next Gen: split this into separate process
Gerald Sangudi’sN1QL & Indexing
at 10/7 10AMdeveloper track
©2014 Couchbase, Inc. 61
Inside a node / OS processes
babysitter (erlang)
godu(golang)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
©2014 Couchbase, Inc. 62
System diagrams
Let there be a bucket
Rebalance
Agenda
©2014 Couchbase, Inc. 63
11209, 11210
Inside a node / OS processes
babysitter (erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
ClusterManager
--------------Data
Manager
Couchbase Server
CREATE BUCKET
©2014 Couchbase, Inc. 64
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
1
©2014 Couchbase, Inc. 65
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
1
REST admin layer receives request
©2014 Couchbase, Inc. 66
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
global orchestrator
1
2
BUCKET CREATE is dispatched to global orchestrator which checks inputs and rules
©2014 Couchbase, Inc. 67
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
global orchestrator
1
3
2
…and, then saves new bucket config to local config store
©2014 Couchbase, Inc. 68
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
global orchestrator
1
3
4
2
New bucket config is gossip’ed and replicated to other nodes.
©2014 Couchbase, Inc. 69
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
global orchestrator
bucket supervisor
1
3
4
5
2
On the other nodes… Bucket Supervisor listens for bucket config change events
©2014 Couchbase, Inc. 70
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
global orchestrator
bucket supervisor
1
3
4
5
6
2
Bucket Supervisor spawns new per-node-&-bucket services
©2014 Couchbase, Inc. 71
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
global orchestrator
bucket supervisor
1
3
4
5
6
2
partition map gen 4
Meanwhile… concurrently, the orchestrator generates a new partition map
©2014 Couchbase, Inc. 72
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master janitor
global orchestrator
bucket supervisor
1
3
4
5
6
2
partition map gen 4
5
…and, then schedules a run of the Master Janitor
©2014 Couchbase, Inc. 73
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master janitor
global orchestrator
bucket supervisor
1
3
4
5
6
2
partition map gen 4
5
…and, then schedules a run of the Master Janitor
Master Janitorlooks for messesand cleans them up
©2014 Couchbase, Inc. 74
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master janitor
global orchestrator janitor agent
bucket supervisor
1
3
4
5
6
2
partition map gen 4
5
Master Janitor sends commands to Janitor Agents on each node…
6
©2014 Couchbase, Inc. 75
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master janitor
global orchestrator janitor agent
bucket supervisor
ns-memcached
1
3
4
5
6
6 72
partition map gen 4
5
…such as to create buckets and partitions on the local data manager (memcached) process…
©2014 Couchbase, Inc. 76
generic distributed facilities
generic local facilities
Inside ns-server / CREATE BUCKET
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master janitor
global orchestrator janitor agent DCP replicator
bucket supervisor
ns-memcached
1
3
4
5
6
6 72 8
partition map gen 4
5
…and to setup DCP replication streams between data manager processes
©2014 Couchbase, Inc. 77
1120
9, 1
1210
Inside the Data Manager / CREATE BUCKET
libevent
networking layer / conn thread pool
engine manager
Athread0 thread1 thread2 thread3
CREATE BUCKET command received and parsed
©2014 Couchbase, Inc. 78
1120
9, 1
1210
Inside the Data Manager / CREATE BUCKET
libevent
networking layer / conn thread pool
engine manager
A
B
thread0 thread1 thread2 thread3
…and forwarded to the engine manager
©2014 Couchbase, Inc. 79
1120
9, 1
1210
Inside the Data Manager / CREATE BUCKET
libevent
networking layer / conn thread pool
engine manager
A
B
thread0 thread1 thread2 thread3
…and engine manager loads a new instance of the required engine
ep-engine
(couchbasebucket type)
file I/O thread pool
C
©2014 Couchbase, Inc. 80
1120
9, 1
1210
Inside the Data Manager / CREATE BUCKET
libevent
networking layer / conn thread pool
engine manager
A
B
thread0 thread1 thread2 thread3
ep-engine
(couchbasebucket type)
file I/O thread pool
…which can then allocate resources (hashtables, queues, directories, files, etc) C
D
©2014 Couchbase, Inc. 81
Inside a node / OS processes
babysitter (erlang)
godu(golang)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
CREATE BUCKET
©2014 Couchbase, Inc. 82
Inside a node / OS processes
babysitter (erlang)
godu(golang)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
✔
CREATE BUCKET
©2014 Couchbase, Inc. 83
System diagrams
Let there be a bucket
Rebalance
Agenda
©2014 Couchbase, Inc. 84
Inside a node / OS processes
babysitter (erlang)
godu(golang)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
REBALANCE
©2014 Couchbase, Inc. 85
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server / REBALANCE
master-only services
REST admin
per-node services
per-node-&-bucket services
1
©2014 Couchbase, Inc. 86
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server / REBALANCE
master-only services
REST admin
per-node services
per-node-&-bucket services
1
Handles the REST call for REBALANCE by calling the global orchestrator
©2014 Couchbase, Inc. 87
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server / REBALANCE
master-only services
REST admin
per-node services
per-node-&-bucket services
master janitor
global orchestrator
rebalancer
1
2
global orchestrator does sanity checks and calls Rebalancer to generate new “balanced” maps and calls Master Janitor
©2014 Couchbase, Inc. 88
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server / REBALANCE
master-only services
REST admin
per-node services
per-node-&-bucket services
master janitor
global orchestrator janitor agent
rebalancer
1
23
Master Janitor remotely calls Janitor Agents for per-node operations and state changes
©2014 Couchbase, Inc. 89
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server / REBALANCE
master-only services
REST admin
per-node services
per-node-&-bucket services
master janitor
global orchestrator janitor agent DCP replicator
rebalancer
1
23 4
…including stopping and recreating replication streams
©2014 Couchbase, Inc. 90
All nodes have partitions
Spread the load
Balance
©2014 Couchbase, Inc. 91
All nodes have partitions
Spread the load
Rack/zone awareness
Balance
©2014 Couchbase, Inc. 92
All nodes have partitions
Spread the load
Rack/zone awareness
Swap rebalance & failover cases
Balance
©2014 Couchbase, Inc. 93
All nodes have partitions
Spread the load
Rack/zone awareness
Swap rebalance & failover cases
Clumpinessout degree of connections from any node is limited
Balance
©2014 Couchbase, Inc. 94
When you Failover a node, you still want some balance
2nd Degree Of Balance
©2014 Couchbase, Inc. 95
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 MASTER replica
5 replica MASTER
6 MASTER replica
7 MASTER replica
©2014 Couchbase, Inc. 96
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 MASTER replica
5 replica MASTER
6 MASTER replica
7 MASTER replica
©2014 Couchbase, Inc. 97
Chaos Shark!!
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 replica MASTER
5 replica MASTER
6 MASTER replica
7 replica MASTER
BETTER! This map has
2nd degree of balance
©2014 Couchbase, Inc. 98
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 MASTER replica
5 replica MASTER
6 MASTER replica
7 MASTER replica
©2014 Couchbase, Inc. 99
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 MASTER replica
5 replica MASTER
6 MASTER replica
7 MASTER replica
©2014 Couchbase, Inc. 100
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER MASTER
1 MASTER replica
2 replica MASTER
3 MASTER MASTER
4 MASTER replica
5 replica MASTER
6 MASTER MASTER
7 MASTER replica
©2014 Couchbase, Inc. 101
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER MASTER
1 MASTER replica
2 replica MASTER
3 MASTER MASTER
4 MASTER replica
5 replica MASTER
6 MASTER MASTER
7 MASTER replica
Server B is nowoverloaded!
©2014 Couchbase, Inc. 102
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 replica MASTER
5 replica MASTER
6 MASTER replica
7 replica MASTER
BETTER! This map has
2nd degree of balance
©2014 Couchbase, Inc. 103
The Return of the Chaos Shark!
©2014 Couchbase, Inc. 104
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 replica MASTER
5 replica MASTER
6 MASTER replica
7 replica MASTER
BETTER! This map has
2nd degree of balance
©2014 Couchbase, Inc. 105
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER replica
1 MASTER replica
2 replica MASTER
3 MASTER replica
4 replica MASTER
5 replica MASTER
6 MASTER replica
7 replica MASTER
BETTER! This map has
2nd degree of balance
©2014 Couchbase, Inc. 106
When you Failover a node, you still want some balance
2nd Degree Of Balance
A B C
0 MASTER MASTER
1 MASTER replica
2 replica MASTER
3 MASTER MASTER
4 replica MASTER
5 replica MASTER
6 MASTER MASTER
7 replica MASTER
BETTER! This map has
2nd degree of balance
©2014 Couchbase, Inc. 107
Also, try to minimize partition movements
Finding partition maps that meets all the constraints is hard!
Our search algorithms are far from perfect
Balance
©2014 Couchbase, Inc. 108
3 phases of migrating a single partition
©2014 Couchbase, Inc. 109
1) replica building phase (serialized)
2) indexing phase (concurrent)
3) takeover phase (concurrent)
3 phases of migrating a single partition
©2014 Couchbase, Inc. 110
1) replica building phase (serialized)
2) indexing phase (concurrent)
3) takeover phase (concurrent)
3 phases of migrating a single partition
get most of the data replicated
©2014 Couchbase, Inc. 111
1) replica building phase (serialized)
2) indexing phase (concurrent)
3) takeover phase (concurrent)
3 phases of migrating a single partition
get most of the data replicated
and, phase #1 is serialized;1 partition at a time per node,
to avoid crushing I/O, network
©2014 Couchbase, Inc. 112
1) replica building phase (serialized)
2) indexing phase (concurrent)
3) takeover phase (concurrent)
3 phases of migrating a single partition
get most of the data replicated
and, phase #1 is serialized;1 partition at a time per node,
to avoid crushing I/O, network
and, ensure #1 persists
before moving onwards for safety
©2014 Couchbase, Inc. 113
1) replica building phase (serialized)
2) indexing phase (concurrent)
3) takeover phase (concurrent)
3 phases of migrating a single partition
ensure that view queries have consistent results
even in midst of Rebalance
©2014 Couchbase, Inc. 114
1) replica building phase (serialized)
2) indexing phase (concurrent)
3) takeover phase (concurrent)
3 phases of migrating a single partition
©2014 Couchbase, Inc. 115
4 Simple Partition States
Partition State
When request arrives from a client …
ACTIVE process request as normal
PENDING server blocks the connection
REPLICA redirect response: you’re accessing the wrong server!
DEAD redirect response: you’re accessing the wrong server!
©2014 Couchbase, Inc. 116
4 Simple Partition States & Partition Takeover [ Server A Server B ]
1) Switch server B’s partition P state to PENDING;
So, any client requests to server B for partition P will block.
.
Partition State
When request arrives from a client …
ACTIVE process request as normal
PENDING server blocks the connection
REPLICA redirect response: you’re accessing the wrong server!
DEAD redirect response: you’re accessing the wrong server!
©2014 Couchbase, Inc. 117
4 Simple Partition States & Partition Takeover [ Server A Server B ]
1) Switch server B’s partition P state to PENDING;
So, any client requests to server B for partition P will block.
2) Setup DCP Takeover stream for partition P from server A to server B
.
Partition State
When request arrives from a client …
ACTIVE process request as normal
PENDING server blocks the connection
REPLICA redirect response: you’re accessing the wrong server!
DEAD redirect response: you’re accessing the wrong server!
©2014 Couchbase, Inc. 118
4 Simple Partition States & Partition Takeover [ Server A Server B ]
1) Switch server B’s partition P state to PENDING;
So, any client requests to server B for partition P will block.
2) Setup DCP Takeover stream for partition P from server A to server B
Server A tries to drain data to server B.
Partition State
When request arrives from a client …
ACTIVE process request as normal
PENDING server blocks the connection
REPLICA redirect response: you’re accessing the wrong server!
DEAD redirect response: you’re accessing the wrong server!
©2014 Couchbase, Inc. 119
4 Simple Partition States & Partition Takeover [ Server A Server B ]
1) Switch server B’s partition P state to PENDING;
So, any client requests to server B for partition P will block.
2) Setup DCP Takeover stream for partition P from server A to server B
Server A tries to drain data to server B
And, then, atomically, server A will… Send a TAKEOVER message to
server B, And, server A changes its partition to
DEAD.
Partition State
When request arrives from a client …
ACTIVE process request as normal
PENDING server blocks the connection
REPLICA redirect response: you’re accessing the wrong server!
DEAD redirect response: you’re accessing the wrong server!
©2014 Couchbase, Inc. 120
4 Simple Partition States & Partition Takeover [ Server A Server B ]
1) Switch server B’s partition P state to PENDING;
So, any client requests to server B for partition P will block.
2) Setup DCP Takeover stream for partition P from server A to server B
Server A tries to drain data to server B
And, then, atomically, server A will… Send a TAKEOVER message to
server B, And, server A changes its partition to
DEAD.
So, any client requests to server A will redirect.
Partition State
When request arrives from a client …
ACTIVE process request as normal
PENDING server blocks the connection
REPLICA redirect response: you’re accessing the wrong server!
DEAD redirect response: you’re accessing the wrong server!
When server B receives TAKEOVER message,
server B will atomically… switch state of its partition P from PENDING to ACTIVE state.
So, any clients previously blocked at server B will now proceed!
Server B handles the TAKEOVER message
©2014 Couchbase, Inc. 122
1) replica building phase (serialized) ✔
2) indexing phase (concurrent) ✔
3) takeover phase (concurrent) ✔
3 phases of migrating a single partition
©2014 Couchbase, Inc. 123
Can stoprebalance at any time
with no data loss
Pencils Down Policy
©2014 Couchbase, Inc. 124
Pencils Down Policy
Can stop (and restart)rebalance at any time
with no data loss
©2014 Couchbase, Inc. 125
Inside a node / OS processes
babysitter (erlang)
godu(golang)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
REBALANCE
©2014 Couchbase, Inc. 126
Inside a node / OS processes
babysitter (erlang)
godu(golang)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
✔
REBALANCE
©2014 Couchbase, Inc. 127
Inside a node / OS processes
babysitter (erlang)
godu(golang)
cert gen(golang)
map gen(golang)
8091, 8092, 11214, 11215, …
ns-server / view-engine(erlang)
11209, 11210
memcached(c/c++)
ClusterManager
--------------Data
Manager
Couchbase Server
Inside a couchbase cluster
Cluster
Manager
--------------
Data
Manager
Clu
ster
Man
ager
--------------
Data
Man
ager
ClusterManager
--------------Data
Manager
Clu
ster
Man
ager
----
----
----
--D
ata
Man
ager
ClusterManager
------
------
--Data
Manager
Couchbase
Server
Couchbase
Server
Couchbase Server
Cou
chba
se
Ser
ver
Couchbase
Serve
r
©2014 Couchbase, Inc. 129
Inside a couchbase cluster
Couchbase Cluster
Couchbase Server
Couchbase Server
Couchbase Server
Couchbase Server
Couchbase Server
©2014 Couchbase, Inc. 130
System diagrams
Let there be a bucket
Rebalance
Agenda
Under the Covers
When you lookUnder the Covers
you’re just going to see more covers
Thanks!
Questions?
Extra Slides
Cluster mapConnect directly to data nodesaccept() loop assigns conn to a thread in a thread pool Conn sticks to that worker thread for the life of the conn More conns == more CPU utilization
A client connects
Hash partitionedHash(key) => vbucketIdclusterMap[vbucketId] => master & replica nodes for that vbucketId
Given a key, we know where the item lives CAP => Consistent (later: rebalance shows how we handle
Data distribution
©2014 Couchbase, Inc. 138
1120
9, 1
1210
Inside memcached / ep-engine
default-engine(memcachedbucket type)
ep-engine(couchbasebucket type)
libevent
networking layer / conn thread pool
default-engine
(memcachedbucket type)
ep-engine(couchbasebucket type)
ep-engine
(couchbasebucket type)
engine manager
file I/O thread pool
thread0 thread1 thread2 thread3
X
Y
©2014 Couchbase, Inc. 139
System diagrams Let there be a bucket Rebalance and failover A client connects A SET request and durability, replication, views, XDCR A GET request and background storage reads
Agenda
bucket (associated with connection)operation (SET)vbucketId, key, value-size, valueCAS, flags, expiration
A SET request
Memcached cracks open request into data structureAsks engine to allocate memoryReads in bytes into memory bufferCalls the engine (ep-engine)Callback / EWOULDBLOCK style codeAll network I/O ops are evented (libevent – kpoll, epoll)
In the data node
Shout out to chiyoung here
tcmalloc & jemalloc – dave rigby
Memory management
Hash-tables consulted Vbucket hashtable (sharded; hashtable growth TBD) Set the entry into hashtable Add to end of queues Persistence, replication, checkpoints DCP
In ep-engine
What are all these files?Flushing dirty itemsSortingCouchstore Append only btree; robustness; restarts; SSD friendliness Btree balance
persistence
By ep-engine, orchestrated by ns-serverHistory is available until compactionDeletion tombstonesForestdb & SSD’s
compaction
In erlang, view-engineRun map() function on each document JS “NIF”
Or deleteCopy on write btree
View Maintenance
Another day
View Queries
Coming soon
Secondary Indexes
When a memory cache “hit”When a memory cache “miss” Eviction / ejection Separate eviction thread Separate expiration thread
Schedule a background fetch (bgfetch) Return EWOULDBLOCK to networking layer When background I/O read thread gets the item back, notifies worker
threads to retry the GET
A GET request
©2014 Couchbase, Inc. 150
1121111209, 11210
8091, 8092, 11214, 11215, …
Inside a node / OS processes
babysitter (erlang)
moxi(c)
ns-server / view-engine(erlang)
godu(golang)
memcached(c/c++)
cert gen(golang)
map gen(golang)
ClusterManager
--------------Data
Manager
Couchbase Server
On startup: warmup
observe
Stats everywhere in data node
Stats
©2014 Couchbase, Inc. 154
Data Manager Architecture
storage interface
DatabaseBucket
11210
Memcached
Storage Engine
DatabaseBucket
DatabaseBucket…
Bucket Engine
Shared Thread Pool
©2014 Couchbase, Inc. 155
Multiple partitions per hash table- Each partition is maintained by a linked list of items- Engine parameter “ht_size” to pass the initial partition
size to the database bucket Multiple locks to synchronize accesses to hash table partitions
- Engine parameter “ht_locks” to pass the number of partition locks to the database bucket Hash table partitions are dynamically resized by the daemon
task “hash table resizer”- NON-IO thread runs the hash table resizer task
periodically
Partition Hash Table
©2014 Couchbase, Inc. 156
Partition Hash Table
Key: “K1”Metadata: exp, cas, NRU, …Value: “V1”
Key: “K5”Metadata: exp, cas, NRU, …Value: “V5”
Key: “K100”Metadata: exp, cas, NRU, …Value: “V100” …
Key: “K50”
Metadata: exp, cas, NRU, …Value: “V50”
Key: “K3”Metadata: exp, cas, NRU, …Value: “V3”
Key: “70”Metadata: exp, cas, NRU, …Value: “V70” …
Key: “K200”Metadata: exp, cas, NRU, …Value: “V200”
Key: “K150”Metadata: exp, cas, NRU, …Value: “V150”
Key: “30”Metadata: exp, cas, NRU, …Value: “V30” …
Key: “K60”Metadata: exp, cas, NRU, …Value: “V60”
Key: “K20”Metadata: exp, cas, NRU, …Value: “V20”
Key: “130”Metadata: exp, cas, NRU, …Value: “V30” …
.
.
.
Partition 1
Partition 2
Partition 99
Partition 100
©2014 Couchbase, Inc. 157
Doctors: first, do no harm
Janitors: clean up the mess & and, don’t make any new messes
every 10 seconds,master janitor broadcasts to janitor agents on every node
to “please give me your state”and, compares “reality” with expected maps &
expected statesand, requests state changes & replication streams
as needed(startup case also has ‘enable traffic’ step)(also, if master janitor sees no vbucket map for a
bucket, then gens new map)
Doctors & Janitors
©2014 Couchbase, Inc. 158
Every 30 secondsUse local KV stats & view statsand compaction policy
to ask data node to compaction relevant KV vbucket db fileor run compaction on view db file
Compactor
©2014 Couchbase, Inc. 159
replication config documents stored in view-engine replicator DB
replication manager watches replicator DB for config document changes
next-gen in golang coming soon
XDCR
©2014 Couchbase, Inc. 160
mb_master decides who is master nodeand spawns right processes on that node
master election
Intra-cluster replicationCross-datacenter replication (XDCR)Views and secondary indexesIncremental Backup & Restore3rd party integrations (hadoop, elasticsearch, etc)Plug Mike’s DCP talk here
DCP
DCPDCP replicator
theme: data-node doesn’t connect to outside
need diagram
Replication streams
©2014 Couchbase, Inc. 163
Bucket Created in REST / web UI (coming soon, more AUTH options)config entries savedbucket event handler (on every node)
watches for bucket config change eventsthen creates/stops per bucket supervisor on node
per bucket supervisor on a node…spawn connections to data-node
(ns_memcached)spawns janitor-agent for the bucket on that
node(janitor agent receives requests
from master janitor to create vbuckets, change
vbucket state, start/stop DCP replication)
spawns per-bucket stats collector & stats archiver
spawns CAPI view manager for view maintenance & queries
Let there be a bucket
©2014 Couchbase, Inc. 164
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator janitor agent DCP replicator
bucket supervisor
stats collector/archiver
rebalancer
heart doctor XDCR services
©2014 Couchbase, Inc. 165
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
heart
HeartEvery 5 secondsGrabs bucket & task states
and broadcasts to entire cluster
©2014 Couchbase, Inc. 166
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
heart doctor
DoctorListens to Heart broadcasts
and keeps cache of recent news
Every node has sense of cluster health
©2014 Couchbase, Inc. 167
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
bucket supervisorheart doctor
Bucket SupervisorTop of local supervision treeof per-node-&-per-bucket
services
©2014 Couchbase, Inc. 168
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
bucket supervisorheart doctor
Bucket SupervisorTop of local supervision treeof per-node-&-per-bucket
services
©2014 Couchbase, Inc. 169
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
bucket supervisorheart doctor XDCR services
XDCR ServicesManages XDCR streamsNext gen:
Separate process; golang;
more flexible conflict resolution
©2014 Couchbase, Inc. 170
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
bucket supervisorheart doctor XDCR services
©2014 Couchbase, Inc. 171
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master electionbucket supervisorheart doctor XDCR services
Master ElectionDecides who isthe master node Cluster
Manager--------------
Data Manager
Couchbase Server
©2014 Couchbase, Inc. 172
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
master tick
bucket supervisorheart doctor XDCR services
TickBroadcasts global tick counter“lost N ticks”
©2014 Couchbase, Inc. 173
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor master tick
global orchestrator
bucket supervisor
rebalancer
heart doctor XDCR services
Global OrchestratorSpawns Rebalancer if neededSpawns Janitor
©2014 Couchbase, Inc. 174
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor master tick
global orchestrator
bucket supervisor
rebalancer
heart doctor XDCR services
RebalancerComputes new partition mapsSupervises the Rebalance
dance steps
©2014 Couchbase, Inc. 175
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor master tick
global orchestrator
bucket supervisor
rebalancer
heart doctor XDCR services
Master JanitorDetects any messes
and cleans them upTries to not make any new
messes:conservative
©2014 Couchbase, Inc. 176
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor master tick
global orchestrator
bucket supervisor
rebalancer
heart doctor XDCR services
Master JanitorDetects any messes
and cleans them upTries to not make any new
messes:conservative
every 10 seconds… Master Janitor broadcasts to Janitor Agents on every node
to “please give me your state”and, compares reality with expected statesand, requests state changes & replication streams as
needed
also, if Master Janitor sees no vbucket map for a bucket, then generates new map
©2014 Couchbase, Inc. 177
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor master tick
global orchestrator
bucket supervisor
rebalancer
heart doctor XDCR services
auto-failover detector
Auto Failover DetectorCONSERVATIVELY
“presses” the failover button
only once
©2014 Couchbase, Inc. 178
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator
bucket supervisor
rebalancer
heart doctor XDCR services
©2014 Couchbase, Inc. 179
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator janitor agent
bucket supervisor
rebalancer
heart doctor XDCR services
Janitor Agent Handles commands from Master
Janitor
©2014 Couchbase, Inc. 180
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator janitor agent
bucket supervisor
stats collector/archiver
rebalancer
heart doctor XDCR services
Stats Collector & Archiver
©2014 Couchbase, Inc. 181
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator janitor agent DCP replicator
bucket supervisor
stats collector/archiver
rebalancer
heart doctor XDCR services
DCP ReplicatorIntracluster replication
©2014 Couchbase, Inc. 182
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator janitor agent DCP replicator
bucket supervisor
stats collector/archiver
rebalancer
heart doctor XDCR services
©2014 Couchbase, Inc. 183
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator janitor agent DCP replicator
bucket supervisor
stats collector/archiver
rebalancer
heart doctor XDCR services
Web UI & REST Admin Service+ client-side JavaScript
(switching to AngularJS)
©2014 Couchbase, Inc. 184
per-node-&-bucket services
generic distributed facilities
generic local facilities
Inside ns-server
vclock, uuid, work queue, events, misclogging (ALE)
distributed node discovery
master-only services
REST admin
config gossip replication
local config store
per-node services
per-node-&-bucket services
master election
janitor
auto-failover detector
master tick
global orchestrator janitor agent DCP replicator
bucket supervisor
stats collector/archiver
rebalancer
heart doctor XDCR services
Spawns off new ep-engine instance Separate “apartments” Buckets share threads, IO mgr
Memcached handles bucket create
When you press the Rebalance button / APICluster Manager computes new map (follows rack/zone rules and seeks balancedness)
Rebalance
©2014 Couchbase, Inc. 187
Orchestrator “conducts” the rebalance movesFor each bucket
Generate new vbucket mapThen, spawn vbucket-movers
A vbucket mover spawns per-vbucket-mover
Orchestrator-> Rebalancer
-> vb_mover-> single_vbucket_mover
-> does the takeover dance,with consistent view index maneuvers
1) replica building phase(bulk of data replicated)(phase #1 is serialized per
node, to avoid crushing I/O, network; 1 vbucket at a time per
node)(and, make sure #1 persists
to disk before moving onwards: safety)2) indexing phase (concurrent)3) takeover phase (concurrent)
Rebalance
©2014 Couchbase, Inc. 188
Partition State When request arrives from a client … Used …
ACTIVE process request as normal during normal operations
PENDING server blocks the connection during Rebalance - transferring partition ownership between servers
REPLICA error response: you’re accessing the wrong server!
to keep Couchbase consistent
DEAD error response: you’re accessing the wrong server!
to keep Couchbase consistent
4 Simple Partition States
©2014 Couchbase, Inc. 189
switch from “tmp not ready error” during warmup to “not-my-vbucket” error
CMD_ENABLE_TRAFFIC
©2014 Couchbase, Inc. 190
janitor is a library
the phases of janitor run…1) wait until everyone is ready
1) all states of all vbuckets on all nodes are ready2) so, we know all buckets are created, etc
2) change vbucket states and drop old replication streams3) create new replication streams
so, warmup and bucket creation are treated very similarly
janitor
Inside a couchbase cluster
Cluster
Manager
--------------
Data
Manager
Clu
ster
Man
ager
--------------
Data
Man
ager
ClusterManager
--------------Data
Manager
Clu
ster
Man
ager
----
----
----
--D
ata
Man
ager
ClusterManager
------
------
--Data
Manager
Couchbase
Server
Couchbase
Server
Couchbase Server
Cou
chba
se
Ser
ver
Couchbase
Serve
r
©2014 Couchbase, Inc. 192
co-founder, Couchbase
co-founder, Escalate => GE Retail Systems
co-founder, Kiva Software => Netscape Application Server
Approach Software RDBMS => Lotus
About Me
©2014 Couchbase, Inc. 193
fast forward vbucket map & CCCP