1 tashkent: uniting durability & ordering in replicated databases sameh elnikety, epfl steven...

1

Tashkent: Uniting Durability & Ordering in Replicated Databases

Sameh Elnikety, EPFL

Steven Dropsho, EPFL

Fernando Pedone, USI

2

Write-Many Replicated Database

separation

• All replicas agree on– which update tx commit– their commit order

• Total order – Determined by middleware – Followed by each replica

durability

Replica 3

Tx A

Tx Bdurability

Replica 2

durability

Replica 1

3

Tx B

durability

Replica 3

Replication MW (global ordering)

Tx A

A B

A B

Order Determined Outside DB

Tx A

Tx B

One Replica

durability

Replica 2

A B

durability

Replica 1

A B

A B

A B

A B

4

MiddlewareMiddleware Commitorder:A B

DatabaseDatabase

durability

Replica

ProxyProxy

Tx A

Tx B

SQ

L in

terface

Task A

Task A

Task B

Task B

B A

Cannot commit A & B concurrently!

Enforce External Commit Order

Must serialize

5

MiddlewareMiddleware Commitorder:A B

DatabaseDatabase

durability

Replica

ProxyProxy

Tx A

Tx B

SQ

L in

terface

Task A

Task A

Task B

Task B

A B

Enforce Order = Serial Commit

Serialization slow

6

Commit Serialization is Slow

Solutions

DurabilityA

ProxyProxy

DatabaseDatabase

durability

CPU

MiddlewareMiddlewareorder: A B C

Commit orderA B C

DurabilityA B

CPU

DurabilityA B C

CPU

Co

mm

it A

Co

mm

it B

Co

mm

it C

Ac

k A

Ac

k B

Ac

k C

Root cause: Durability & ordering separated serial disk writes

Root cause: Durability & ordering separated serial disk writes

7

1-Pass order info to DB

durability

Replica

durability

Replica

Middleware(ordering)

order

2-Move durability to MW

order

Solution: Unite Durability & Ordering

Replica

Replica

Middleware(ordering)durability

OFF

durabilityOFF

durability

Unite in DB

8

1- Unite Dur. & Ord. in Database

Solutions

ProxyProxy

DatabaseDatabase

durability

CPU


Commit orderA B C

DurabilityA B C

Ack AAck BAck C

Commit A at 1Commit B at 2Commit C at 3

order

Solution 1: pass order info to DB Durability & ordering in database group commit

Solution 1: pass order info to DB Durability & ordering in database group commit

9

1-Pass order info to DB

durability

Replica

durability

Replica

Middleware(ordering)

order

2-Move durability to MW

order

Solution: Unite Durability & Ordering

Replica

Replica

Middleware(ordering)durability

OFF

durabilityOFF

durability

Unite in DB

10

Co

mm

it A

Co

mm

it B

Co

mm

it C

Ac

k A

Ac

k B

Ac

k C

2- Unite D. & O. in Middleware

Roadmap

ProxyProxy

DatabaseDatabase

CPU


Commit orderA B C

CPU

DurabilityA B C

CPU

durabilityOFF

durability

Solution 2: move durability to MW Durability & ordering in middleware group commit

Solution 2: move durability to MW Durability & ordering in middleware group commit

11

• Durability & ordering – Separated serial commit slow– United group commit fast

• Two Implementations– Tashkent-API: united in DB– Tashkent-MW: united in MW

• Tashkent-MW– Implementation– Recovery– Performance

Roadmap

12

Tx B Replication MW (global ordering)

Tx A

A B C

Tashkent-MW

Tx A

Tx B

One Replica

durabilityOFF

Replica 2

durabilityOFF

Replica 1

A B C

durability

A B CA B C

A B C

A B C

A B C

Tx CReplica 3

A B C

Tx CdurabilityOFF

13

• Middleware logs tx effects– Durability of update tx

• Guaranteed in middleware• Turn durability off at database

• Middleware performs durability & ordering– United group commit fast

• Database commits update tx serially– Commit = quick main memory operation

Tashkent-MW Durability & Ordering in Middleware

Back to Example

14

Replication MW (global ordering)

Recovery in Tashkent-MW

Db i/o

Replica 2

Replica 1

durability

Replica 3

durabilityOFF

durabilityOFF

durabilityOFF

15

DatabaseDatabase

Standard Database I/O

DB recovery

Disk

Memory

Data Log

LogData

Log flushed for1- Durability

2- Allow cleaning dirty data pages:{ physical integrity }

AA

Crash!Tx A

A bad

16

DatabaseDatabase

Database I/O with Durability=off

DB recovery

Disk

Memory

Data Log

LogData

Simple SolutionRecover from a data dump (checkpoint)

AA

Crash!Tx A


Durability A

A bad

17



• Tashkent-MW– Implementation– Recovery– Performance

Roadmap

18

Performance - Setup

• Metrics:– Throughput– Response time

• Workload:– AllUpdates: tx = { 1 update }, mix= %100 updates– TPC-B: tx={4 update,1 read}, mix=%100 updates– TPC-W: mix of long & short txs

• System configuration:– Linux Cluster running PostgreSQL

AllUpdates TH

19

AllUpdates Throughput

Throughput

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

replica

tps

Separated

. .

20


0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

replica

tps

Separated

Standalone

21


0

500

1000

1500

2000

2500

3000

3500

4000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

replica

tps

Tashkent-MW

Separated

Standalone

RT

22

AllUpdates Response Time

0

20

40

60

80

100

120

140

160

180

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

replica

mse

c

Separated

Tashkent-MW

Standalone

In paper

23

In the Paper

• Design & Implementation– Tashkent-API

• Performance results– TPC-B & TPC-W– Recovery times– Another I/O subsystems

Conclusions

24

Conclusions



• Tashkent-MW system– Pure middleware replication– Significant performance improvement

27

Concurrency Control

• Generalized Snapshot Isolation – GSI

• Conclusions valid whenever replicas agree1- on which update transactions commit

2- on their commit order

• Example (bank database)– T1: set balance = $1000– T2: set balance = $2000– Replica1: see T1 then T2 balance = $2000– Replica2: see T2 then T1 balance = $1000

28

Durability and Ordering 1/2

Replica 1

Certifier

T4T9 Proxy

Database Cert. Log:T4T9

Scalability problem: one write per trans.

DB1 Log:T4T9

29. . .

Replica 1

Certifier

T4T9 Proxy

Database

Replica 2

Proxy

Database

T3T8

. . . Ti’s

DB1 Log:T1T2T3T4T5T6T7T8T9

One disk write

Scalability problem: two writes per trans.

Durability and Ordering 2/2

Cert. Log:T1T2T3T4T5T6T7T8T9

DB1 Log:T1,T2,T3T4T5, T6, T7, T8T9

30

AllUpdates 1-Replica Throughput

0

100

200

300

400

500

600

Standalone Base Tashkent-MW Tashkent-API Tashkent-APIno Cert

tps

low replication overhead,1-replica == standalone DB

31

AllUpdates Response Time

0

20

40

60

80

100

120

140

160

180

200

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

replica

mse

c

Separated

Tashkent-MW

Standalone

In paper

32

TPC-B Throughput

0

100

200

300

400

500

600

700

800

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

replica

tps

Tashkent-United

Tashkent-Sep

Standalone

Low replication overhead, 1-replica system == standalone DB,Performance scales with multiple replicas

In the Paper

1 tashkent: uniting durability & ordering in replicated databases sameh elnikety, epfl steven...

Documents

b c durability

tx b durability replica

b cpu durability

b c tx c durability

b c cpu durability

durability solution

b order

b c commit