stm in the small: trading generality for performance in ......stm overheads •book-keeping done in...
TRANSCRIPT
STM in the Small: Trading Generality for Performance in Software Transactional Memory Aleksandar Dragojević
Our motivation
• Concurrent data structures
• Hash-tables, skip-lists
– In-memory database indices2
– Key-value stores
[2] Larson et al. “High-performance concurrency control mechanisms for main-memory databases” VLDB Dec. 2011
2
Software Transactional Memory
• Simplifies data structures
– Ensures correct implementations
• Enables use of complex data structures
– More complex than when using CAS directly
3
What is STM performance like?
4
What is STM performance like?
Graph from: Cascaval et al. “Software transactional memory: why is it only a research toy” ACMQ September 2008
5
What is STM performance like?
0
1
2
3
4
5
6
7
8
9
10
Spe
ed
up
1
2
4
8
16
6
What is STM performance like?
0
1
2
3
4
5
6
7
8
9
10
Spe
ed
up
1
2
4
8
16
7
What is STM performance like?
0
1
2
3
4
5
6
7
8
9
10
Spe
ed
up
1
2
4
8
16
8
What is STM performance like?
0
1
2
3
4
5
6
7
8
9
10
Spe
ed
up
1
2
4
8
16
9
What is STM performance like?
0
1
2
3
4
5
6
7
8
9
10
Spe
ed
up
1
2
4
8
16
10
What is STM performance like?
0
1
2
3
4
5
6
7
8
9
0 5 10 15
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list
STM
CAS
11
What is STM performance like?
0
1
2
3
4
5
6
7
8
9
0 5 10 15
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list
STM
71%
78%
12
CAS
SpecTM: Specialized STM for Concurrent Data Structures
Ease of programming easy hard
Perf
orm
ance
fa
st
slo
w
STM
CAS
13
SpecTM: Specialized STM for Concurrent Data Structures
easy hard
Perf
orm
ance
fa
st
slo
w
SpecTM
Ease of programming
14
CAS
STM
This talk
0
1
2
3
4
5
6
7
8
9
0 5 10 15
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list
STM
71%
15
CAS
78%
This talk
0
1
2
3
4
5
6
7
8
9
0 5 10 15
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list SpecTM 0%
16
CAS
0%
Overview
• STM overheads
• SpecTM
– Short-transaction API
– Collocating data and meta-data
– In-word meta-data
• Evaluation
17
Overview
• STM overheads
• SpecTM
– Short-transaction API
– Collocating data and meta-data
– In-word meta-data
• Evaluation
18
STM overheads
• Book-keeping done in software
• Memory accesses become STM calls
• Significant overheads3
– With a single thread 50% overheads
– Requires 4 threads to outperform sequential code in 75% of cases
[3] Dragojević et al. “Why STM Can Be more than a Research Toy” CACM Apr. 2011
19
STM API
void StartTx();
void CommitTx();
word_t ReadWord(word_t *addr);
void WriteWord(word_t *addr, word_t val);
20
STM read
map(addr)
21
STM read
map(addr)
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
22
STM read
map(addr)
check RaW
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
23
STM read
map(addr)
check RaW
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
[orec idx]
write-set
address-value
orecs
index
… …
24
STM read
map(addr)
check RaW
o1 == o2 && !locked
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set read orec
read val
read orec
address-value
orecs
… …
25
index
[orec idx]
STM read (cont’d)
log read
26
STM read (cont’d)
log read head curr
count=2
read-set
27
STM read (cont’d)
log read head curr
count=3
read-set
28
STM read (cont’d)
log read
orec <= validts
head curr
count=3
read-set
29
STM read (cont’d)
log read
return val
orec <= validts
head curr
count=3
read-set
30
STM read (cont’d)
log read
validate
return val abort
valid
orec <= validts
head curr
count=3
read-set
31
STM read (cont’d)
log read
validate
return val abort
valid
orec <= validts
head curr
count=3 read-set
read-set
…
32
STM read fast-path
• Significantly more expensive than CPU read instructions
– >10 instructions
• Writes have comparable costs
– Incurs a compare-and-swap at commit time
33
Overview
• STM overheads
• SpecTM
– Short-transaction API
– Collocating data and meta-data
– In-word meta-data
• Evaluation
34
Overview
• STM overheads
• SpecTM
– Short-transaction API
– Collocating data and meta-data
– In-word meta-data
• Evaluation
35
Short transactions
• Short read-write transactions
• Short read-only transactions
• Single-access transactions
– Read, Write, CAS
36
Short read-write transactions
37
word_t TxRWRead_1(word_t *addr_1);
word_t TxRWRead_2(word_t *addr_2);
word_t TxRWRead_3(word_t *addr_3);
...
void TxRWCommit_1(word_t val);
void TxRWCommit_2(word_t v1, word_t v2);
void TxRWCommit_3(word_t v1,
word_t v2,
word_t v3);
...
Short read-write transactions
word_t TxRWRead_1(word_t *addr_1);
word_t TxRWRead_2(word_t *addr_2);
word_t TxRWRead_3(word_t *addr_3);
...
void TxRWCommit_1(word_t val);
void TxRWCommit_2(word_t v1, word_t v2);
void TxRWCommit_3(word_t v1,
word_t v2,
word_t v3);
...
38
word_t TxRWRead_1(word_t *addr_1);
word_t TxRWRead_2(word_t *addr_2);
word_t TxRWRead_3(word_t *addr_3);
...
void TxRWCommit_1(word_t val);
void TxRWCommit_2(word_t v1, word_t v2);
void TxRWCommit_3(word_t v1,
word_t v2,
word_t v3);
...
Short read-write transactions
39
word_t TxRWRead_1(word_t *addr_1);
word_t TxRWRead_2(word_t *addr_2);
word_t TxRWRead_3(word_t *addr_3);
...
void TxRWCommit_1(word_t val);
void TxRWCommit_2(word_t v1, word_t v2);
void TxRWCommit_3(word_t v1,
word_t v2,
word_t v3);
...
Short read-write transactions
40
Short transaction optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
41
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
Short transaction optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
42
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
Short transaction optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
43
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
Short transaction optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
44
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
Short transaction optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
45
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
Short transaction optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
46
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
CAS (from write)
Short transaction optimizations (cont’d)
log read
validate
return val abort
valid
orec <= validts
head curr
count=3 read-set
read-set
…
47
Short transaction optimizations (cont’d)
log read
validate
return val abort
valid
orec <= validts
head curr
count=3 read-set
read-set
…
merged with write-set
48
static
Short transaction optimizations (cont’d)
log read
validate
return val abort
valid
orec <= validts
head curr
count=3 read-set
read-set
…
49
merged with write-set
static
Using short transactions
Tx
With “traditional” transactions the whole operation is a single transaction
50
Using short transactions
Tx1 Tx2 Tx3 Tx4 Tx5 Tx6
Split operation into a sequence of short transactions
51
Using short transactions
Tx1 Tx2 Tx3 Tx4 Tx5 Tx6
“Glue” them together using techniques similar to lock-free algorithms (e.g. pointer marking)
52
Using short transactions
Tx1 Tx2 Tx3 Tx4 Tx5 Tx6
“Glue” them together using techniques similar to lock-free algorithms (e.g. pointer marking)
Is this simpler than using CAS directly?
53
Using short transactions
Tx1 Tx2 Tx3 Tx4 Tx5 Tx6
“Glue” them together using techniques similar to lock-free algorithms (e.g. pointer marking)
Is this simpler than using CAS directly? Yes: transactions eliminate tricky races
54
Mix short and ordinary transactions
Tx1 Tx2 Tx3 Tx4 Tx5 Tx6
Tx
Implement corner cases using ordinary transactions
55
Example
10 20 30 40
10
10
20 40
40
Remove 20
56
Example
10 20 30 40
10
10
20 40
40
Remove 20
TxSingleRead
57
Example
10 20 30 40
10
10
20 40
40
Remove 20
TxSingleRead
58
Example
10 20 30 40
10
10
20 40
40
Remove 20
TxSingleRead
59
Example
10 20 30 40
10
10
20 40
40
Remove 20
60
Short read-write transaction to mark the node and unlink it
Example
10 20 30 40
10
10
20 40
40
Remove 20
✗
✗
61
Short read-write transaction to mark the node and unlink it
Example
10 20 30 40
10
10
20 40
40
Remove 20
This becomes trickier when using CAS directly
62
Example
10 20 30 40
10
10
20 40
40
Remove 20
We have to use several CASes
63
CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
64
✗
We have to use several CASes
CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
65
CAS
✗
We have to use several CASes
Example
10 20 30 40
10
10
20 40
40
Remove 20
66
✗
✗
We have to use several CASes
CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
67
CAS
✗
✗
We have to use several CASes
Example
10 20 30 40
10
10
20 40
40
Remove 20
68
✗
✗
We have to use several CASes CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
69
CAS
✗
✗
We have to use several CASes
Example
10 20 30 40
10
10
20 40
40
Remove 20
70
✗
✗
We have to use several CASes
CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
71
✗
✗
There might be a race at each of these steps
CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
72
✗
✗
We need to make sure other operations clean-up for us
CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
73
✗
✗
All interactions between operations have to be handled correctly
CAS
Example
10 20 30 40
10
10
20 40
40
Remove 20
74
✗
✗
E.g. removal of an element that is still being added to the list
CAS
Overview
• STM overheads
• SpecTM
– Short-transaction API
– Collocating data and meta-data
– In-word meta-data
• Evaluation
75
“Traditional” STM data layout
O1
O2
O3
O4
…
…
W1
W2
W3
W4
application data
…
…
…
orecs
76
“Traditional” STM data layout
O1
O2
O3
O4
…
…
W1
W2
W3
W4
application data
…
…
…
General False conflicts Cache misses
orecs
77
Collocate data and meta-data
W1
O1
W3
O3
…
…
…
W2
O2
W4
O4
Simplify mapping No false conflicts Same cache line
78
Collocated meta-data optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
79
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
CAS (from write)
Collocated meta-data optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
80
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
CAS (from write)
simplified mapping
Collocated meta-data optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
81
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
CAS (from write)
simplified mapping
better cache locality
Pure value-based validation
W1
W2
W3
W4
…
…
… Restrict values Restrict access patterns Single-word operations
82
Value-based optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
83
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
CAS (from write)
simplified mapping
better cache locality
Value-based optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
84
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
CAS (from write)
simplified mapping
better cache locality
Value-based optimizations
address
orec table
0 0 c 1 2 a b 0
…
…
[c1a2aa]
[c1a2ab]
[c1a2ac]
…
…
write-set
address-value
orecs
… …
static
write-set
85
index
[orec idx]
map(addr)
check RaW
read orec
read val
read orec
o1 == o2 && !locked
CAS (from write)
simplified mapping
better cache locality
API changes
86
API changes
word_t ReadWord(TVar *addr);
void WriteWord(TVar *addr, word_t val);
87
word_t TVarToWord(TVar addr);
TVar WordToTVar(word_t val);
API changes
word_t ReadWord(TVar *addr);
void WriteWord(TVar *addr, word_t val);
88
Transactions access TVar, not word_t
word_t TVarToWord(TVar addr);
TVar WordToTVar(word_t val);
API changes
89
Transactions access TVar, not word_t
Calls to update TVar non-transactionally
word_t ReadWord(TVar *addr);
void WriteWord(TVar *addr, word_t val);
word_t TVarToWord(TVar addr);
TVar WordToTVar(word_t val);
Pure value-based short read-write
read val
90
Pure value-based short read-write
!locked
91
read val
Pure value-based short read-write
CAS
!locked
92
read val
Pure value-based short read-write
abort
!locked
93
read val
CAS
Pure value-based short read-write
log
abort
success
!locked
94
read val
CAS
Pure value-based short read-write
write-set
abort
!locked
95
static
read val
log
success
CAS
Pure value-based short read-write
return val
abort
!locked
96
read val
static
write-set
log
success
CAS
Pure value-based short read-write
return val
abort
!locked
97
read val
static
write-set
log
success
CAS
STM vs. SpecTM
98
STM SpecTM
Overview
• STM overheads
• SpecTM
– Short-transaction API
– Collocating data and meta-data
– In-word meta-data
• Evaluation
99
Evaluation
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list (90% lookups)
STM
100
CAS
Evaluation
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list (90% lookups)
STM
240%
101
CAS
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list (90% lookups)
Evaluation
48%
STM
SpecTM-Short
102
CAS
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list (90% lookups)
Evaluation
12%
SpecTM-Short-Colloc
SpecTM-Short
STM
103
CAS
0
10
20
30
40
50
60
70
80
0 20 40 60 80 100 120 140
Thro
ugh
pu
t [M
op
s/s]
Threads
Skip-list (90% lookups)
Evaluation
2% SpecTM-Short-Val
STM
SpecTM-Short-Colloc
SpecTM-Short
104
CAS
Evaluation
105
0
50
100
150
200
250
300
350
400
450
0 20 40 60 80 100 120 140
Thro
ugh
pu
t [M
op
s/s]
Threads
Hash-table (90% lookups) CAS
SpecTM-Short-Val
SpecTM-Short-Colloc
STM
SpecTM-Short
0%
Evaluation
106
0
10
20
30
40
50
60
70
80
0 5 10 15
Thro
ugh
pu
t [M
op
s/s]
Threads
Hash-table (90% lookups)
STM
SpecTM-Short
SpecTM-Short-Colloc
CAS SpecTM-Short-Val 0%
Conclusions
• Trade generality for performance in SpecTM
– Restrict STM API
– Control data layout
• STM-based data structures perform as well as the ones built directly from CAS
• Do we need SpecTM with upcoming Intel TSX?
– Best-effort limited-size transactions
107
SpecTM: Specialized STM for Concurrent Data Structures
easy hard
Perf
orm
ance
fa
st
slo
w
SpecTM
Ease of programming
108
CAS
STM
©2011 Microsoft Corporation. All rights reserved.