1 tashkent: uniting durability & ordering in replicated databases sameh elnikety, epfl steven...
Post on 20-Dec-2015
220 views
TRANSCRIPT
1
Tashkent: Uniting Durability & Ordering in Replicated Databases
Sameh Elnikety, EPFL
Steven Dropsho, EPFL
Fernando Pedone, USI
2
Write-Many Replicated Database
separation
• All replicas agree on– which update tx commit– their commit order
• Total order – Determined by middleware – Followed by each replica
durability
Replica 3
Tx A
Tx Bdurability
Replica 2
durability
Replica 1
3
Tx B
durability
Replica 3
Replication MW (global ordering)
Tx A
A B
A B
Order Determined Outside DB
Tx A
Tx B
One Replica
durability
Replica 2
A B
durability
Replica 1
A B
A B
A B
A B
4
MiddlewareMiddleware Commitorder:A B
DatabaseDatabase
durability
Replica
ProxyProxy
Tx A
Tx B
SQ
L in
terface
Task A
Task A
Task B
Task B
B A
Cannot commit A & B concurrently!
Enforce External Commit Order
Must serialize
5
MiddlewareMiddleware Commitorder:A B
DatabaseDatabase
durability
Replica
ProxyProxy
Tx A
Tx B
SQ
L in
terface
Task A
Task A
Task B
Task B
A B
Enforce Order = Serial Commit
Serialization slow
6
Commit Serialization is Slow
Solutions
DurabilityA
ProxyProxy
DatabaseDatabase
durability
CPU
MiddlewareMiddlewareorder: A B C
Commit orderA B C
DurabilityA B
CPU
DurabilityA B C
CPU
Co
mm
it A
Co
mm
it B
Co
mm
it C
Ac
k A
Ac
k B
Ac
k C
Root cause: Durability & ordering separated serial disk writes
Root cause: Durability & ordering separated serial disk writes
7
1-Pass order info to DB
durability
Replica
durability
Replica
Middleware(ordering)
order
2-Move durability to MW
order
Solution: Unite Durability & Ordering
Replica
Replica
Middleware(ordering)durability
OFF
durabilityOFF
durability
Unite in DB
8
1- Unite Dur. & Ord. in Database
Solutions
ProxyProxy
DatabaseDatabase
durability
CPU
MiddlewareMiddlewareorder: A B C
Commit orderA B C
DurabilityA B C
Ack AAck BAck C
Commit A at 1Commit B at 2Commit C at 3
order
Solution 1: pass order info to DB Durability & ordering in database group commit
Solution 1: pass order info to DB Durability & ordering in database group commit
9
1-Pass order info to DB
durability
Replica
durability
Replica
Middleware(ordering)
order
2-Move durability to MW
order
Solution: Unite Durability & Ordering
Replica
Replica
Middleware(ordering)durability
OFF
durabilityOFF
durability
Unite in DB
10
Co
mm
it A
Co
mm
it B
Co
mm
it C
Ac
k A
Ac
k B
Ac
k C
2- Unite D. & O. in Middleware
Roadmap
ProxyProxy
DatabaseDatabase
CPU
MiddlewareMiddlewareorder: A B C
Commit orderA B C
CPU
DurabilityA B C
CPU
durabilityOFF
durability
Solution 2: move durability to MW Durability & ordering in middleware group commit
Solution 2: move durability to MW Durability & ordering in middleware group commit
11
• Durability & ordering – Separated serial commit slow– United group commit fast
• Two Implementations– Tashkent-API: united in DB– Tashkent-MW: united in MW
• Tashkent-MW– Implementation– Recovery– Performance
Roadmap
12
Tx B Replication MW (global ordering)
Tx A
A B C
Tashkent-MW
Tx A
Tx B
One Replica
durabilityOFF
Replica 2
durabilityOFF
Replica 1
A B C
durability
A B CA B C
A B C
A B C
A B C
Tx CReplica 3
A B C
Tx CdurabilityOFF
13
• Middleware logs tx effects– Durability of update tx
• Guaranteed in middleware• Turn durability off at database
• Middleware performs durability & ordering– United group commit fast
• Database commits update tx serially– Commit = quick main memory operation
Tashkent-MW Durability & Ordering in Middleware
Back to Example
14
Replication MW (global ordering)
Recovery in Tashkent-MW
Db i/o
Replica 2
Replica 1
durability
Replica 3
durabilityOFF
durabilityOFF
durabilityOFF
15
DatabaseDatabase
Standard Database I/O
DB recovery
Disk
Memory
Data Log
LogData
Log flushed for1- Durability
2- Allow cleaning dirty data pages:{ physical integrity }
AA
Crash!Tx A
A bad
16
DatabaseDatabase
Database I/O with Durability=off
DB recovery
Disk
Memory
Data Log
LogData
Simple SolutionRecover from a data dump (checkpoint)
AA
Crash!Tx A
MiddlewareMiddlewareorder: A B C
Durability A
A bad
17
• Durability & ordering – Separated serial commit slow– United group commit fast
• Two Implementations– Tashkent-API: united in DB– Tashkent-MW: united in MW
• Tashkent-MW– Implementation– Recovery– Performance
Roadmap
18
Performance - Setup
• Metrics:– Throughput– Response time
• Workload:– AllUpdates: tx = { 1 update }, mix= %100 updates– TPC-B: tx={4 update,1 read}, mix=%100 updates– TPC-W: mix of long & short txs
• System configuration:– Linux Cluster running PostgreSQL
AllUpdates TH
19
AllUpdates Throughput
Throughput
0
100
200
300
400
500
600
700
800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
replica
tps
Separated
. .
20
AllUpdates Throughput
0
100
200
300
400
500
600
700
800
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
replica
tps
Separated
Standalone
21
AllUpdates Throughput
0
500
1000
1500
2000
2500
3000
3500
4000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
replica
tps
Tashkent-MW
Separated
Standalone
RT
22
AllUpdates Response Time
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
replica
mse
c
Separated
Tashkent-MW
Standalone
In paper
23
In the Paper
• Design & Implementation– Tashkent-API
• Performance results– TPC-B & TPC-W– Recovery times– Another I/O subsystems
Conclusions
24
Conclusions
• Durability & ordering – Separated serial commit slow– United group commit fast
• Two Implementations– Tashkent-API: united in DB– Tashkent-MW: united in MW
• Tashkent-MW system– Pure middleware replication– Significant performance improvement
27
Concurrency Control
• Generalized Snapshot Isolation – GSI
• Conclusions valid whenever replicas agree1- on which update transactions commit
2- on their commit order
• Example (bank database)– T1: set balance = $1000– T2: set balance = $2000– Replica1: see T1 then T2 balance = $2000– Replica2: see T2 then T1 balance = $1000
28
Durability and Ordering 1/2
Replica 1
Certifier
T4T9 Proxy
Database Cert. Log:T4T9
Scalability problem: one write per trans.
DB1 Log:T4T9
29. . .
Replica 1
Certifier
T4T9 Proxy
Database
Replica 2
Proxy
Database
T3T8
. . . Ti’s
DB1 Log:T1T2T3T4T5T6T7T8T9
One disk write
Scalability problem: two writes per trans.
Durability and Ordering 2/2
Cert. Log:T1T2T3T4T5T6T7T8T9
DB1 Log:T1,T2,T3T4T5, T6, T7, T8T9
30
AllUpdates 1-Replica Throughput
0
100
200
300
400
500
600
Standalone Base Tashkent-MW Tashkent-API Tashkent-APIno Cert
tps
low replication overhead,1-replica == standalone DB
31
AllUpdates Response Time
0
20
40
60
80
100
120
140
160
180
200
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
replica
mse
c
Separated
Tashkent-MW
Standalone
In paper