replication internals: the life of a write

26
Andy Schwerin Lead Engineer, MongoDB

Upload: mongodb

Post on 08-Sep-2014

276 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Replication Internals: The Life of a Write

Andy SchwerinLead Engineer, MongoDB

Page 2: Replication Internals: The Life of a Write

• Goals of Replication• Replication Architecture• A representative write

Page 3: Replication Internals: The Life of a Write

• High availability for processing reads and writes– Automatic leader election

• Support many network topologies– Tag sets

• Accessible consistency model– Ordered operation log

• Client can trade latency for durability– Write concern

Page 4: Replication Internals: The Life of a Write
Page 5: Replication Internals: The Life of a Write

{ ts: 4, op: “i”, ns: “d.c”, o: { _id: 10, name: “john” }}

OPLOG

Page 6: Replication Internals: The Life of a Write

PRIMARY OPLOG

4

SECONDARY OPLOG

8 9

SECONDARY OPLOG

4 5

When a secondary oplog is not a prefix of the primary oplog…

Page 7: Replication Internals: The Life of a Write

w:?

Page 8: Replication Internals: The Life of a Write

w:1

Could lose write when primary disappears, without notification.

Page 9: Replication Internals: The Life of a Write

w:majority

Over half of nodes must fail to lose the write.

And, an outside operator must intervene before new writes are accepted.

Page 10: Replication Internals: The Life of a Write

w:all

All nodes have the write before primary responds.

But, cannot complete writes if any nodes are down.

Page 11: Replication Internals: The Life of a Write

OPLOG

d.c

OPLOG

P TS:6

S1 TS:6

S2 TS:2

d.c. insert ({_id:10,name:’john’}, wC: {w:2}}) 1. Fetch oplog entries2. Apply to collections3. Write to local oplog4. Notify primary5. Repeat

Page 12: Replication Internals: The Life of a Write

OPLOG

OBSERVER

BATCH

BATCHPREFETCH

APPLIER

BATCH

x.y d.cd.c

OPLOG

d.c. insert ({_id:10,name:’john’}, wC: {w:2}})

P TS:6

S1 TS:6

S2 TS:2

Page 13: Replication Internals: The Life of a Write

OPLOG

d.c.insert ({_id:10,name:’john’}, wC: {w:2}})

d.c

{ ts: 4, op: “i”, ns: “d.c”, o: { _id: 10, name: “john” }}

P TS:4

S1 TS:2

S2 TS:2

Page 14: Replication Internals: The Life of a Write

OPLOG

d.c.insert ({_id:10,name:’john’}, wC: {w:2}})

OBSERVER

BATCH

d.c

OPLOG

P TS:6

S1 TS:2

S2 TS:2

Page 15: Replication Internals: The Life of a Write

OPLOG

d.c.insert ({_id:10,name:’john’}, wC: {w:2}})

BATCH

d.c

OPLOG

OBSERVER

P TS:6

S1 TS:2

S2 TS:2

Page 16: Replication Internals: The Life of a Write

OBSERVER

BATCH

BATCHPREFETCH

OPLOG

• Split batch into arbitrary work units• Assign work to prefetch threads• Entries processed in any order• All while admitting readers

Allow readers

Page 17: Replication Internals: The Life of a Write

OBSERVER

BATCH

BATCHPREFETCH

OPLOG

BATCH

x.y d.c

APPLIER• Assign entries to workers by

target collection• Disable schema constraints

Allow readers

Page 18: Replication Internals: The Life of a Write

OBSERVER

BATCH

BATCHPREFETCH

OPLOG

BATCH

x.y d.c

APPLIER• Concurrency control excludes

readers• Oplog entries applied in

timestamp order

Exclude readers

Page 19: Replication Internals: The Life of a Write

OBSERVER

BATCH

BATCHPREFETCH

OPLOG

BATCH

x.y d.c

APPLIER

Exclude readers• Concurrency control excludes readers

• Oplog entries applied in timestamp order

Page 20: Replication Internals: The Life of a Write

OBSERVER

BATCH

BATCHPREFETCH

OPLOG

BATCH

x.y d.c

APPLIER

Exclude readers• Concurrency control excludes readers

• Oplog entries applied in timestamp order

Page 21: Replication Internals: The Life of a Write

OBSERVER

BATCH

BATCHPREFETCH

APPLIER

BATCH

x.y d.c

OPLOG

• Readmit readers• Move entries from batch to oplog• Begin processing next batch

Allow readers

Page 22: Replication Internals: The Life of a Write

OPLOG

OBSERVER

BATCH

BATCHPREFETCH

APPLIER

BATCH

x.y d.cd.c

OPLOG

Allow readers

P TS:6

S1 TS:6

S2 TS:2

Page 23: Replication Internals: The Life of a Write

OPLOG

d.c

P TS:6

S1 TS:6

S2 TS:2

• Consults list of waiting clients• Looks for those waiting for ts:6 or

earlier on S1• Sends acknowledgement!

Page 24: Replication Internals: The Life of a Write
Page 25: Replication Internals: The Life of a Write
Page 26: Replication Internals: The Life of a Write