stm in the small: trading generality for performance in ......stm overheads •book-keeping done in...

STM in the Small: Trading Generality for Performance in Software Transactional Memory Aleksandar Dragojević

Our motivation

• Concurrent data structures

• Hash-tables, skip-lists

– In-memory database indices2

– Key-value stores

[2] Larson et al. “High-performance concurrency control mechanisms for main-memory databases” VLDB Dec. 2011

2

Software Transactional Memory

• Simplifies data structures

– Ensures correct implementations

• Enables use of complex data structures

– More complex than when using CAS directly

3

What is STM performance like?

4


Graph from: Cascaval et al. “Software transactional memory: why is it only a research toy” ACMQ September 2008

5


0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

6


0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

7


0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

8


0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

9


0

1

2

3

4

5

6

7

8

9

10

Spe

ed

up

1

2

4

8

16

10


0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list

STM

CAS

11


0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list

STM

71%

78%

12

CAS

SpecTM: Specialized STM for Concurrent Data Structures

Ease of programming easy hard

Perf

orm

ance

fa

st

slo

w

STM

CAS

13


easy hard

Perf

orm

ance

fa

st

slo

w

SpecTM

Ease of programming

14

CAS

STM

This talk

0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list

STM

71%

15

CAS

78%

This talk

0

1

2

3

4

5

6

7

8

9

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list SpecTM 0%

16

CAS

0%

Overview

• STM overheads

• SpecTM

– Short-transaction API

– Collocating data and meta-data

– In-word meta-data

• Evaluation

17

Overview

• STM overheads

• SpecTM




• Evaluation

18

STM overheads

• Book-keeping done in software

• Memory accesses become STM calls

• Significant overheads3

– With a single thread 50% overheads

– Requires 4 threads to outperform sequential code in 75% of cases

[3] Dragojević et al. “Why STM Can Be more than a Research Toy” CACM Apr. 2011

19

STM API

void StartTx();

void CommitTx();

word_t ReadWord(word_t *addr);

void WriteWord(word_t *addr, word_t val);

20

STM read

map(addr)

21

STM read

map(addr)

address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

22

STM read

map(addr)

check RaW

address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

23

STM read

map(addr)

check RaW

address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

[orec idx]

write-set

address-value

orecs

index

… …

24

STM read

map(addr)

check RaW

o1 == o2 && !locked

address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set read orec

read val

read orec

address-value

orecs

… …

25

index

[orec idx]

STM read (cont’d)

log read

26

STM read (cont’d)

log read head curr

count=2

read-set

27

STM read (cont’d)

log read head curr

count=3

read-set

28

STM read (cont’d)

log read

orec <= validts

head curr

count=3

read-set

29

STM read (cont’d)

log read

return val

orec <= validts

head curr

count=3

read-set

30

STM read (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3

read-set

31

STM read (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

…

32

STM read fast-path

• Significantly more expensive than CPU read instructions

– >10 instructions

• Writes have comparable costs

– Incurs a compare-and-swap at commit time

33

Overview

• STM overheads

• SpecTM




• Evaluation

34

Overview

• STM overheads

• SpecTM




• Evaluation

35

Short transactions

• Short read-write transactions

• Short read-only transactions

• Single-access transactions

– Read, Write, CAS

36

Short read-write transactions

37

word_t TxRWRead_1(word_t *addr_1);



...

void TxRWCommit_1(word_t val);

void TxRWCommit_2(word_t v1, word_t v2);

void TxRWCommit_3(word_t v1,

word_t v2,

word_t v3);

...





...




word_t v2,

word_t v3);

...

38




...




word_t v2,

word_t v3);

...


39




...




word_t v2,

word_t v3);

...


40

Short transaction optimizations

address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

41

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked


address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

42

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked


address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

43

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked


address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

44

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked


address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

45

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked


address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

46

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

Short transaction optimizations (cont’d)

log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

…

47


log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

…

merged with write-set

48

static


log read

validate

return val abort

valid

orec <= validts

head curr

count=3 read-set

read-set

…

49

merged with write-set

static

Using short transactions

Tx

With “traditional” transactions the whole operation is a single transaction

50


Tx1 Tx2 Tx3 Tx4 Tx5 Tx6

Split operation into a sequence of short transactions

51



“Glue” them together using techniques similar to lock-free algorithms (e.g. pointer marking)

52




Is this simpler than using CAS directly?

53




Is this simpler than using CAS directly? Yes: transactions eliminate tricky races

54

Mix short and ordinary transactions


Tx

Implement corner cases using ordinary transactions

55

Example

10 20 30 40

10

10

20 40

40

Remove 20

56

Example

10 20 30 40

10

10

20 40

40

Remove 20

TxSingleRead

57

Example

10 20 30 40

10

10

20 40

40

Remove 20

TxSingleRead

58

Example

10 20 30 40

10

10

20 40

40

Remove 20

TxSingleRead

59

Example

10 20 30 40

10

10

20 40

40

Remove 20

60

Short read-write transaction to mark the node and unlink it

Example

10 20 30 40

10

10

20 40

40

Remove 20

✗

✗

61

Short read-write transaction to mark the node and unlink it

Example

10 20 30 40

10

10

20 40

40

Remove 20

This becomes trickier when using CAS directly

62

Example

10 20 30 40

10

10

20 40

40

Remove 20

We have to use several CASes

63

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

64

✗


CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

65

CAS

✗


Example

10 20 30 40

10

10

20 40

40

Remove 20

66

✗

✗


CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

67

CAS

✗

✗


Example

10 20 30 40

10

10

20 40

40

Remove 20

68

✗

✗

We have to use several CASes CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

69

CAS

✗

✗


Example

10 20 30 40

10

10

20 40

40

Remove 20

70

✗

✗


CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

71

✗

✗

There might be a race at each of these steps

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

72

✗

✗

We need to make sure other operations clean-up for us

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

73

✗

✗

All interactions between operations have to be handled correctly

CAS

Example

10 20 30 40

10

10

20 40

40

Remove 20

74

✗

✗

E.g. removal of an element that is still being added to the list

CAS

Overview

• STM overheads

• SpecTM




• Evaluation

75

“Traditional” STM data layout

O1

O2

O3

O4

…

…

W1

W2

W3

W4

application data

…

…

…

orecs

76

“Traditional” STM data layout

O1

O2

O3

O4

…

…

W1

W2

W3

W4

application data

…

…

…

General False conflicts Cache misses

orecs

77

Collocate data and meta-data

W1

O1

W3

O3

…

…

…

W2

O2

W4

O4

Simplify mapping No false conflicts Same cache line

78

Collocated meta-data optimizations

address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

79

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)


address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

80

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping


address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

81

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping

better cache locality

Pure value-based validation

W1

W2

W3

W4

…

…

… Restrict values Restrict access patterns Single-word operations

82

Value-based optimizations

address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

83

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping



address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

84

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping



address

orec table

0 0 c 1 2 a b 0

…

…

[c1a2aa]

[c1a2ab]

[c1a2ac]

…

…

write-set

address-value

orecs

… …

static

write-set

85

index

[orec idx]

map(addr)

check RaW

read orec

read val

read orec

o1 == o2 && !locked

CAS (from write)

simplified mapping


API changes

86

API changes

word_t ReadWord(TVar *addr);

void WriteWord(TVar *addr, word_t val);

87

word_t TVarToWord(TVar addr);

TVar WordToTVar(word_t val);

API changes



88

Transactions access TVar, not word_t



API changes

89

Transactions access TVar, not word_t

Calls to update TVar non-transactionally





Pure value-based short read-write

read val

90


!locked

91

read val


CAS

!locked

92

read val


abort

!locked

93

read val

CAS


log

abort

success

!locked

94

read val

CAS


write-set

abort

!locked

95

static

read val

log

success

CAS


return val

abort

!locked

96

read val

static

write-set

log

success

CAS


return val

abort

!locked

97

read val

static

write-set

log

success

CAS

STM vs. SpecTM

98

STM SpecTM

Overview

• STM overheads

• SpecTM




• Evaluation

99

Evaluation

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Skip-list (90% lookups)

STM

100

CAS

Evaluation

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads


STM

240%

101

CAS

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads


Evaluation

48%

STM

SpecTM-Short

102

CAS

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads


Evaluation

12%

SpecTM-Short-Colloc

SpecTM-Short

STM

103

CAS

0

10

20

30

40

50

60

70

80

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads


Evaluation

2% SpecTM-Short-Val

STM

SpecTM-Short-Colloc

SpecTM-Short

104

CAS

Evaluation

105

0

50

100

150

200

250

300

350

400

450

0 20 40 60 80 100 120 140

Thro

ugh

pu

t [M

op

s/s]

Threads

Hash-table (90% lookups) CAS

SpecTM-Short-Val

SpecTM-Short-Colloc

STM

SpecTM-Short

0%

Evaluation

106

0

10

20

30

40

50

60

70

80

0 5 10 15

Thro

ugh

pu

t [M

op

s/s]

Threads

Hash-table (90% lookups)

STM

SpecTM-Short

SpecTM-Short-Colloc

CAS SpecTM-Short-Val 0%

Conclusions

• Trade generality for performance in SpecTM

– Restrict STM API

– Control data layout

• STM-based data structures perform as well as the ones built directly from CAS

• Do we need SpecTM with upcoming Intel TSX?

– Best-effort limited-size transactions

107


easy hard

Perf

orm

ance

fa

st

slo

w

SpecTM

Ease of programming

108

CAS

STM

stm in the small: trading generality for performance in ......stm overheads •book-keeping done in...

Documents