gt 4420/6422 // spring 2019 // @joy arulraj lecture...
TRANSCRIPT
DATABASE SYSTEM IMPLEMENTATION
GT 4420/6422 // SPRING 2019 // @JOY_ARULRAJ
LECTURE #20: MULTI-VERSION CONCURRENCY CONTROL (PART II)
ANATOMY OF A DATABASE SYSTEM
Connection Manager + Admission Control
Query Parser
Query Optimizer
Query Executor
Lock Manager (Concurrency Control)
Access Methods (or Indexes)
Buffer Pool Manager
Log Manager
Memory Manager + Disk Manager
Networking Manager
2
QueryTransactional
Storage Manager
Query Processor
Shared Utilities
Process Manager
Source: Anatomy of a Database System
TODAY'S AGENDA
Microsoft Hekaton (SQL Server)TUM HyPerCMU Cicada
3
MICROSOFT HEKATON
Incubator project started in 2008 to create new OLTP engine for MSFT SQL Server (MSSQL).→ Led by DB ballers Paul Larson and Mike Zwilling
Had to integrate with MSSQL ecosystem.Had to support all possible OLTP workloads with predictable performance.→ Single-threaded partitioning (e.g., H-Store) works well
for some applications but terrible for others.
4
HEKATON MVCC
Each txn is assigned a timestamp when they begin(BeginTS) and when they commit (EndTS).Each tuple contains two timestamps that represents their visibility and current state:→ BEGIN: The BeginTS of the active txn or the EndTS of
the committed txn that created it.→ END: The BeginTS of the active txn that created the next
version or infinity or the EndTS of the committed txnthat created it.
5
HIGH-PERFORMANCE CONCURRENCY CONTROL MECHANISMS FOR MAIN-MEMORY DATABASESVLDB 2011
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
6
10 20 John $100
20 John $110∞
INDEX
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
7
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
8
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
9
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
10
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
11
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
12
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
13
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
14
10 20 John $100
20 John $110∞
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
15
10 20 John $100
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
16
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
17
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
18
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
COMMIT @ 35
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
19
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
COMMIT @ 35
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
20
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
COMMIT @ 35
35
35
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
21
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
COMMIT @ 35
35
35
REWIND
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
22
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
23
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN @ 30
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
24
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN @ 30Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
25
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN @ 30Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
26
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN @ 30Read "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
27
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN @ 30Read "John"Update "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
28
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN @ 30Read "John"Update "John"
BEGIN END POINTER ATTR1 ATTR2
HEKATON: OPERATIONS
29
10 20 John $100
Txn25 ∞ John $130
20 John $110∞Txn25
BEGIN @ 25 INDEX
Update "John"Read "John"
BEGIN @ 30Read "John"Update "John"
HEKATON: TRANSACTION STATE MAP
Global map of all txns’ states in the system:→ ACTIVE: The txn is executing read/write operations.→ VALIDATING: The txn has invoked commit and the
DBMS is checking whether it is valid.→ COMMITTED: The txn is finished, but may have not
updated its versions’ TS.→ TERMINATED: The txn has updated the TS for all of
the versions that it created.
30
HEKATON: TRANSACTION META-DATA
Read Set→ Pointers to every version read.Write Set→ Pointers to versions updated (old and new), versions
deleted (old), and version inserted (new).Scan Set→ Stores enough information needed to perform each scan
operation.Commit Dependencies→ List of txns that are waiting for this txn to finish.
31
HEKATON: TRANSACTION VALIDATION
Read Stability→ Check that each version read is still visible as of the end
of the txn.Phantom Avoidance→ Repeat each scan to check whether new versions have
become visible since the txn began.Extent of validation depends on isolation level:→ SERIALIZABLE: Read Stability + Phantom Avoidance→ REPEATABLE READS: Read Stability→ SNAPSHOT ISOLATION: None→ READ COMMITTED: None
32
HEKATON: OPTIMISTIC VS. PESSIMISTIC
Optimistic Txns:→ Check whether a version read is still visible at the end of
the txn.→ Repeat all index scans to check for phantoms.
Pessimistic Txns:→ Use shared & exclusive locks on records and buckets.→ No validation is needed.→ Separate background thread to detect deadlocks.
33
HEKATON: OPTIMISTIC VS. PESSIMISTIC
34
0
0.5
1
1.5
2
0 6 12 18 24
Thro
ughp
ut (t
xn/s
ec)
Mill
ions
# Threads
Optimistic Pessimistic
Source: Paul Larson
Database: Single table with 1000 tuplesWorkload: 80% read-only txns + 20% update txns
Processor: 2 sockets, 12 cores
HEKATON: LESSONS
Use only lock-free data structures→ No latches, spin locks, or critical sections→ Indexes, txn map, memory alloc, garbage collector→ We already discussed about Bw-Tree.
Only one single serialization point in the DBMS to get the txn’s begin and commit timestamp→ Atomic Addition (CAS)
35
OBSERVATIONS
Read/scan set validations are expensive if the txnsaccess a lot of data.
Appending new versions hurts the performance of OLAP scans due to pointer chasing & branching.
Record-level conflict checks may be too coarse-grained and incur false positives.
36
HYPER MVCC
Column-store with delta record versioning.→ In-Place updates for non-indexed attributes→ Delete/Insert updates for indexed attributes.→ Newest-to-Oldest Version Chains→ No Predicate Locks / No Scan ChecksAvoids write-write conflicts by aborting txns that try to update an uncommitted object.
Designed for HTAP workloads.
37
FAST SERIALIZABLE MULTI-VERSION CONCURRENCY CONTROL FOR MAIN-MEMORY DATABASE SYSTEMSSIGMOD 2015
HYPER: STORAGE ARCHITECTURE
38
Delta Storage (Per Txn)Main Data Table
ATTR1
Tupac
IceT
B.I.G
DrDre
ATTR2
$100
$200
$150
$99
VersionVector
Ø(ATTR2→$122)
Txn #2
(ATTR2→$199)Txn #1
(ATTR2→$100)Txn #3
(ATTR2→$139)
HYPER: VALIDATION
First-Writer Wins→ The version vector always points to the last committed
version.→ Do not need to check whether write-sets overlap.
Check the undo buffers (i.e., delta records) of txnsthat committed after the validating txn started.→ Compare the committed txn's write set for phantoms
using Precision Locking.→ Only need to store the txn's read predicates and not its
entire read set.
39
HYPER: PRECISION LOCKING
40
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)
HYPER: PRECISION LOCKING
41
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)
99>20 AND 99<30
33>20 AND 33<30
HYPER: PRECISION LOCKING
42
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)
99>20 AND 99<30
33>20 AND 33<30
FALSE
HYPER: PRECISION LOCKING
43
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)99 IN (1
0,20,30)
33 IN (10,20,30)
HYPER: PRECISION LOCKING
44
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)99 IN (1
0,20,30)
33 IN (10,20,30)
FALSE
HYPER: PRECISION LOCKING
45
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)
NULL LIKE '%Ice%'
NULL L
IKE '%
Ice%'
FALSE
HYPER: PRECISION LOCKING
46
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)
HYPER: PRECISION LOCKING
47
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)
'Ice Cube' LIKE '%Ice%'TRUE
HYPER: PRECISION LOCKING
48
Validating Txn
SELECT * FROM fooWHERE attr2 > 20AND attr2 < 30
SELECT COUNT(attr1)FROM fooWHERE attr2 IN (10,20,30)
Delta Storage (Per Txn)
Txn #1003(ATTR1→'IceCube',ATTR2→199)
SELECT attr1, AVG(attr2)FROM fooWHERE attr1 LIKE '%Ice%'GROUP BY attr1HAVING AVG(attr2) > 100
Txn #1002
(ATTR2→122)
Txn #1001
(ATTR2→99)
(ATTR2→33)
'Ice Cube' LIKE '%Ice%'TRUE
HYPER: VERSION SYNOPSES
Store a separate column that tracks the position of the first and last versioned tuple in a block of tuples.
When scanning tuples, the DBMS can check for strides of tuples without older versions and execute more efficiently.
49
Main Data Table
ATTR1
Tupac
IceT
B.I.G
DrDre
ATTR2
$100
$200
$150
$99
VersionVector
Ø
Ø
Ø
RZA
GZA
ODB
$300
$300
$0
Ø
Ø
VersionSynopsis
[2,5)
HYPER: VERSION SYNOPSES
Store a separate column that tracks the position of the first and last versioned tuple in a block of tuples.
When scanning tuples, the DBMS can check for strides of tuples without older versions and execute more efficiently.
50
Main Data Table
ATTR1
Tupac
IceT
B.I.G
DrDre
ATTR2
$100
$200
$150
$99
VersionVector
Ø
Ø
Ø
RZA
GZA
ODB
$300
$300
$0
Ø
Ø
VersionSynopsis
[2,5) 0
1
2
3
4
5
6
Offsets
HYPER: VERSION SYNOPSES
Store a separate column that tracks the position of the first and last versioned tuple in a block of tuples.
When scanning tuples, the DBMS can check for strides of tuples without older versions and execute more efficiently.
51
Main Data Table
ATTR1
Tupac
IceT
B.I.G
DrDre
ATTR2
$100
$200
$150
$99
VersionVector
Ø
Ø
Ø
RZA
GZA
ODB
$300
$300
$0
Ø
Ø
VersionSynopsis
[2,5)
HYPER: VERSION SYNOPSES
Store a separate column that tracks the position of the first and last versioned tuple in a block of tuples.
When scanning tuples, the DBMS can check for strides of tuples without older versions and execute more efficiently.
52
Main Data Table
ATTR1
Tupac
IceT
B.I.G
DrDre
ATTR2
$100
$200
$150
$99
VersionVector
Ø
Ø
Ø
RZA
GZA
ODB
$300
$300
$0
Ø
Ø
VersionSynopsis
[2,5)
CMU CICADA
In-memory OLTP engine based on optimistic MVCC with append-only storage (N2O).→ Best-effort Inlining→ Loosely Synchronized Clocks→ Contention-Aware Validation→ Index Nodes Stored in Tables
Designed to be scalable for both low- and high-contention workloads.
53
CICADA: DEPENDABLY FAST MULTI-CORE IN-MEMORY TRANSACTIONSSIGMOD 2017
Record Meta-data
CICADA: BEST-EFFORT INLINING
Record meta-data is stored in a fixed location.
Threads will attempt to inline read-mostly version within this meta-data to reduce version chain traversals.
54
POINTER LATEST VERSION
EMPTY
KEY VALUE
XXX $111
POINTER
KEY VALUE
YYY $222
POINTER
CICADA: FAST VALIDATION
Contention-aware Validation→ Validate access to recently modified records first.
Early Consistency Check→ Pre-validate access set before making global writes.
Incremental Version Search→ Resume from last search location in version list.
55
Source: Hyeontaek Lim
CICADA: FAST VALIDATION
Contention-aware Validation→ Validate access to recently modified records first.
Early Consistency Check→ Pre-validate access set before making global writes.
Incremental Version Search→ Resume from last search location in version list.
56
Source: Hyeontaek Lim
Skip if all recent txnscommitted successfully.
CICADA: INDEX STORAGE
57
Index Node Table
NODE DATA
A1 Keys→[100,200]Pointers→[B,C]
POINTER
B2 Keys→[50,70]Pointers→[D,E]
E3 Keys→[10,30]Pointers→[RID,RID]
Ø
B1 Keys→[52,70]Pointers→[D,E] Ø
Index
A
B C
D E F G
E2 Keys→[11,30]Pointers→[RID,RID]
E1 Keys→[12,30]Pointers→[RID,RID]
CICADA: INDEX STORAGE
58
Index Node Table
NODE DATA
A1 Keys→[100,200]Pointers→[B,C]
POINTER
B2 Keys→[50,70]Pointers→[D,E]
E3 Keys→[10,30]Pointers→[RID,RID]
Ø
B1 Keys→[52,70]Pointers→[D,E] Ø
Index
A
B C
D E F G
E2 Keys→[11,30]Pointers→[RID,RID]
E1 Keys→[12,30]Pointers→[RID,RID]
CICADA: LOW CONTENTION
59
Workload: YCSB (95% read / 5% write) - 1 op per txn
Source: Hyeontaek Lim
CICADA: LOW CONTENTION
60
0
10
20
30
40
50
0 6 12 18 24
Thro
ughp
ut (t
xn/s
ec)
Mill
ions
# Threads
2PL Silo Silo' TicTocFOEDUS Hekaton ERMIA Cicada
Workload: YCSB (95% read / 5% write) - 1 op per txn
Source: Hyeontaek Lim
CICADA: HIGH CONTENTION
61
Workload: TPC-C (1 Warehouse)
Source: Hyeontaek Lim
CICADA: HIGH CONTENTION
62
0
0.11
0.22
0.33
0 6 12 18 24
Thro
ughp
ut (t
xn/s
ec)
Mill
ions
# Threads
2PL Silo Silo' TicTocFOEDUS Hekaton ERMIA Cicada
Workload: TPC-C (1 Warehouse)
Source: Hyeontaek Lim
CICADA: HIGH CONTENTION
63
0
0.11
0.22
0.33
0 6 12 18 24
Thro
ughp
ut (t
xn/s
ec)
Mill
ions
# Threads
2PL Silo Silo' TicTocFOEDUS Hekaton ERMIA Cicada
Workload: TPC-C (1 Warehouse)
Source: Hyeontaek Lim
PARTING THOUGHTS
There are different ways to check for phantoms in MVCC.
Andy considers HyPer and Cicada to be state-of-the-art as of 2019.
64
NEXT CLASS
Parallel Join Algorithms
65