replication internals: the life of a write
DESCRIPTION
TRANSCRIPT
![Page 1: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/1.jpg)
Andy SchwerinLead Engineer, MongoDB
![Page 2: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/2.jpg)
• Goals of Replication• Replication Architecture• A representative write
![Page 3: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/3.jpg)
• High availability for processing reads and writes– Automatic leader election
• Support many network topologies– Tag sets
• Accessible consistency model– Ordered operation log
• Client can trade latency for durability– Write concern
![Page 4: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/4.jpg)
![Page 5: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/5.jpg)
{ ts: 4, op: “i”, ns: “d.c”, o: { _id: 10, name: “john” }}
OPLOG
…
![Page 6: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/6.jpg)
PRIMARY OPLOG
4
SECONDARY OPLOG
8 9
SECONDARY OPLOG
4 5
When a secondary oplog is not a prefix of the primary oplog…
![Page 7: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/7.jpg)
w:?
![Page 8: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/8.jpg)
w:1
Could lose write when primary disappears, without notification.
![Page 9: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/9.jpg)
w:majority
Over half of nodes must fail to lose the write.
And, an outside operator must intervene before new writes are accepted.
![Page 10: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/10.jpg)
w:all
All nodes have the write before primary responds.
But, cannot complete writes if any nodes are down.
![Page 11: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/11.jpg)
OPLOG
d.c
OPLOG
P TS:6
S1 TS:6
S2 TS:2
d.c. insert ({_id:10,name:’john’}, wC: {w:2}}) 1. Fetch oplog entries2. Apply to collections3. Write to local oplog4. Notify primary5. Repeat
![Page 12: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/12.jpg)
OPLOG
OBSERVER
BATCH
BATCHPREFETCH
APPLIER
BATCH
x.y d.cd.c
OPLOG
d.c. insert ({_id:10,name:’john’}, wC: {w:2}})
P TS:6
S1 TS:6
S2 TS:2
![Page 13: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/13.jpg)
OPLOG
d.c.insert ({_id:10,name:’john’}, wC: {w:2}})
d.c
{ ts: 4, op: “i”, ns: “d.c”, o: { _id: 10, name: “john” }}
P TS:4
S1 TS:2
S2 TS:2
![Page 14: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/14.jpg)
OPLOG
d.c.insert ({_id:10,name:’john’}, wC: {w:2}})
OBSERVER
BATCH
d.c
OPLOG
P TS:6
S1 TS:2
S2 TS:2
![Page 15: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/15.jpg)
OPLOG
d.c.insert ({_id:10,name:’john’}, wC: {w:2}})
BATCH
d.c
OPLOG
OBSERVER
P TS:6
S1 TS:2
S2 TS:2
![Page 16: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/16.jpg)
OBSERVER
BATCH
BATCHPREFETCH
OPLOG
• Split batch into arbitrary work units• Assign work to prefetch threads• Entries processed in any order• All while admitting readers
Allow readers
![Page 17: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/17.jpg)
OBSERVER
BATCH
BATCHPREFETCH
OPLOG
BATCH
x.y d.c
APPLIER• Assign entries to workers by
target collection• Disable schema constraints
Allow readers
![Page 18: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/18.jpg)
OBSERVER
BATCH
BATCHPREFETCH
OPLOG
BATCH
x.y d.c
APPLIER• Concurrency control excludes
readers• Oplog entries applied in
timestamp order
Exclude readers
![Page 19: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/19.jpg)
OBSERVER
BATCH
BATCHPREFETCH
OPLOG
BATCH
x.y d.c
APPLIER
Exclude readers• Concurrency control excludes readers
• Oplog entries applied in timestamp order
![Page 20: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/20.jpg)
OBSERVER
BATCH
BATCHPREFETCH
OPLOG
BATCH
x.y d.c
APPLIER
Exclude readers• Concurrency control excludes readers
• Oplog entries applied in timestamp order
![Page 21: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/21.jpg)
OBSERVER
BATCH
BATCHPREFETCH
APPLIER
BATCH
x.y d.c
OPLOG
• Readmit readers• Move entries from batch to oplog• Begin processing next batch
Allow readers
![Page 22: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/22.jpg)
OPLOG
OBSERVER
BATCH
BATCHPREFETCH
APPLIER
BATCH
x.y d.cd.c
OPLOG
Allow readers
P TS:6
S1 TS:6
S2 TS:2
![Page 23: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/23.jpg)
OPLOG
d.c
P TS:6
S1 TS:6
S2 TS:2
• Consults list of waiting clients• Looks for those waiting for ts:6 or
earlier on S1• Sends acknowledgement!
![Page 24: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/24.jpg)
![Page 25: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/25.jpg)
![Page 26: Replication Internals: The Life of a Write](https://reader033.vdocuments.mx/reader033/viewer/2022052503/540d9a108d7f72747e8b4a98/html5/thumbnails/26.jpg)