scal, scalloc, and selfie › ~ck › content › talks › cambridge... · alexander miller ......

University of Cambridge, September 2017

Scal, Scalloc, and SelfieChristoph Kirsch, University of Salzburg, Austria

scalloc.cs.uni-salzburg.at

multicore-scalable concurrent allocator

scal.cs.uni-salzburg.at

multicore-scalableconcurrent data

structuresselfie.cs.uni-salzburg.at

self-referential systems software for teaching

http://scalloc.cs.uni-salzburg.at


http://scal.cs.uni-salzburg.at


http://selfie.cs.uni-salzburg.at


Joint Work

✤ Martin Aigner

✤ Christian Barthel

✤ Mike Dodds

✤ Andreas Haas

✤ Thomas Henzinger

✤ Andreas Holzer

✤ Thomas Hütter

✤ Michael Lippautz

✤ Alexander Miller

✤ Simone Oblasser

✤ Hannes Payer

✤ Mario Preishuber

✤ Ana Sokolova

✤ Ali Szegin

How do we exchange data among increasingly many cores on a shared memory machine such that performance still

increases with the number of cores?

–The Multicore Scalability Challenge

Concurrent Shared Memory

Access

Core 1 Core 2

Core 3 Core 4

~100 cores

How do we allocate and deallocate shared memory with increasingly many cores such that performance increases with the number of cores while memory consumption stays low?

–Multicore Shared Memory Allocation Problem

Concurrent Shared Memory

Allocation

Core 1 Core 2

Core 3 Core 4

~1TB memory

~100 cores

How do we teach computer science to students not necessarily majoring in computer science but who anyway code every day?

–The Computer Science Education Challenge

The Multicore Scalability Challenge

linear scalability

positive scalabilityhigh performance

negative scalability

positive scalabilitylow performance

thro

ughp

ut

number of cores

Timestamped (TS) Stack [POPL15]

Concurrent Stack

Core 1 Core 2

Core 3 Core 4

Timestamped (TS) Stack [POPL15]

Treiber StackEB Stack

TS-atomic StackTS-CAS Stack

TS-hardware StackTS-interval Stack

TS-stutter Stack

0

2000

4000

6000

8000

10000

12000

2 10 20 30 40 50 60 70 80

op

era

tion

s p

er

ms

(mo

re is

be

tte

r)

number of threads

(a) Producer-consumer benchmark, 40-core machine.

0

2000

4000

6000

8000

10000

12000

2 8 16 24 32 40 48 56 64

op

era

tion

s p

er

ms

(mo

re is

be

tte

r)

number of threads

(b) Producer-consumer benchmark, 64-core machine.

0

10000

20000

30000

40000

50000

60000

1 10 20 30 40 50 60 70 80

op

era

tion

s p

er

ms

(mo

re is

be

tte

r)

number of threads

(c) Producer-only benchmark, 40-core machine.

0

10000

20000

30000

40000

50000

60000

1 8 16 24 32 40 48 56 64

op

era

tion

s p

er

ms

(mo

re is

be

tte

r)

number of threads

(d) Producer-only benchmark, 64-core machine.

0

500

1000

1500

2000

2500

3000

3500

1 10 20 30 40 50 60 70 80

op

era

tion

s p

er

ms

(mo

re is

be

tte

r)

number of threads

(e) Consumer-only benchmark, 40-core machine.

0

500

1000

1500

2000

2500

3000

3500

1 8 16 24 32 40 48 56 64

op

era

tion

s p

er

ms

(mo

re is

be

tte

r)

number of threads

(f) Consumer-only benchmark, 64-core machine.

Figure 5: TS stack performance in the high-contention scenario on 40-core machine (left) and 64-core machine (right).

push operations of the TS-atomic stack and the TS-stutterstack, which means that the delay in the TS-interval time-stamping is actually shorter than the execution time of theTS-atomic timestamping and the TS-stutter timestamping.Perhaps surprisingly, TS-stutter, which does not requirestrong synchronisation, is slower than TS-atomic, which isbased on an atomic fetch-and-increment instruction.

Pop performance. We measure the performance of popoperations of all data-structures in a consumer-only bench-mark where each thread pops 1,000,000 from a pre-filledstack. Note that no elimination is possible in this bench-mark. The stack is pre-filled concurrently, which means incase of the TS-interval stack and TS-stutter stack that someelements may have unordered timestamps. Again the TS-interval stack uses the same delay as in the high-contentionproducer-consumer benchmark.

Figure 5e and Figure 5f show the performance andscalability of the data-structures in the high-contentionconsumer-only benchmark. The performance of the TS-interval stack is significantly higher than the performance ofthe other stack implementations, except for low numbers ofthreads. The performance of TS-CAS is close to the perfor-mance of TS-interval. The TS-stutter stack is faster than theTS-atomic and TS-hardware stack due to the fact that someelements share timestamps and therefore can be removed inparallel. The TS-atomic stack and TS-hardware stack showthe same performance because all elements have uniquetimestamps and therefore have to be removed sequentially.Also in the Treiber stack and the EB stack elements have tobe removed sequentially. Depending on the machine, remov-ing elements sequentially from a single list (Treiber stack)is sometimes less and sometimes as expensive as removingelements sequentially from multiple lists (TS stack).

Elimination Through Shared Timestamps

0

2000

4000

6000

8000

10000

12000

0 3000 6000 9000 12000 15000 0

1

2

3

4

5

6

7

op

era

tion

s p

er

ms

(mo

re is

be

tte

r)

nu

mb

er

of

retr

ies

(le

ss is

be

tte

r)

delay in ns

Performance TS-interval StackRetries TS-interval Stack

Performance TS-CAS StackRetries TS-CAS Stack

Figure 6: High-contention producer-consumer benchmarkusing TS-interval and TS-CAS timestamping with increas-ing delay on the 40-core machine, exercising 40 producersand 40 consumers.

6.2 Analysis of Interval Timestamping

Figure 6 shows the performance of the TS-interval stackand the TS-CAS stack along with the average number oftryRem calls needed in each pop (one call is optimal, butcontention may cause retries). These figures were collectedwith an increasing interval length in the high contentionproducer-consumer benchmark on the 40-core machine. Weused these results to determine the delays for the bench-marks in Section 6.1.

Initially the performance of the TS-interval stack in-creases with an increasing delay time, but beyond 7.5 µs theperformance decreases again. After that point an averagepush operation is slower than an average pop operation andthe number of pop operations which return EMPTY increases.

For the TS-interval stack the high performance correlatesstrongly with a drop in tryRem retries. We conclude from thisthat the impressive performance we achieve with intervaltimestamping arises from reduced contention in tryRem. Forthe optimal delay time we have 1.009 calls to tryRem perpop, i.e. less than 1% of pop calls need to scan the SPpools array more than once. In contrast, without a delaythe average number of retries per pop call is more than 6.

The performance of the TS-CAS stack increases initiallywith an increasing delay time. However this does not de-crease the number of tryRem retries significantly. The rea-son is that without a delay there is more contention on theglobal counter. Therefore the performance of TS-CAS witha delay is actually better than the performance without adelay. However, similar to TS-interval timestamping, witha delay time beyond 6 µs the performance decreases again.This is the point where an average push operation becomesslower than an average pop operations.

7. TS Queue and TS Deque Variants

In this paper, we have focussed on the stack variant of ouralgorithm. However, stored timestamps can be removed inany order, meaning it is simple to change our TS stack intoa queue / deque. Doing this requires three main changes:

1. Change the timestamp comparison operator in tryRem.2. Change the SP pool such that getYoungest returns the

oldest / right-most / left-most element.3. For the TS queue, remove elimination in tryRem. For the

TS deque, enable it only for stack-like removal.

The TS queue is the second fastest queue we know of. In ourexperiments the TS-interval queue outperforms the Michael-Scott queue [18] and the flat-combining queue [10] but thelack of elimination means it is not as fast as the LCRQ [1].

The TS-interval deque is the fastest deque we know of,although it is slower than the corresponding stack / queue.However, it still outperforms the Michael-Scott and flat-combining queues, and the Treiber and EB stacks.

8. Related Work

Timestamping. Our approach was initially inspired byAttiya et al.’s Laws of Order paper [2], which proves thatany linearizable stack, queue, or deque necessarily uses theRAW or AWAR patterns in its remove operation. Whileattempting to extend this result to insert operations, wewere surprised to discover a counter-example: the TS stack.We believe the Basket Queue [15] was the first algorithm toexploit the fact that enqueues need not take e�ect in orderof their atomic operations, although unlike the TS stack itdoes not avoid strong synchronisation when inserting.

Gorelik and Hendler use timestamping in their AFCqueue [6]. As in our stack, enqueued elements are time-stamped and stored in single-producer bu�ers. Aside fromthe obvious di�erence in kind, our TS stack di�ers in severalrespects. The AFC dequeue uses flat-combining-style con-solidation – that is, a combiner thread merges timestampsinto a total order. As a result, the AFC queue is block-ing. The TS stack avoids enforcing an internal total order,and instead allows non-blocking parallel removal. Removalin the AFC queue depends on the expensive consolidationprocess, and as a result their producer-consumer benchmarkshows remove performance significantly worse than otherflat-combining queues. Interval timestamping lets the TSstack trade insertion and removal cost, avoiding this prob-lem. Timestamps in the AFC queue are Lamport clocks [17],not hardware-generated intervals. (We also experiment withLamport clocks – see TS-stutter in §3.2). Finally, AFC queueelements are timestamped before being inserted – in the TSstack, this is reversed. This seemingly trivial di�erence en-ables timestamp-based elimination, which is important tothe TS stack’s performance.

The LCRQ queue [1] and the SP queue [11] both indexelements using an atomic counter. However, dequeue opera-tions do not look for one of the youngest elements as in ourTS stack, but rather for the element with the enqueue indexthat matches the dequeue index exactly. Both approachesfall back to a slow path when the dequeue counter becomeshigher than the enqueue counter. In contrast to indices,timestamps in the TS stack need not be unique or even or-dered, and the performance of the TS stack does not dependon a fast path and a slow path, but only on the number ofelements which share the same timestamp.

Our use of the x86 RDTSCP instruction to generate hard-ware timestamps is inspired by work on testing FIFOqueues [7]. There the RDTSC instruction is used to deter-mine the order of operation calls. (Note the distinction be-tween the synchronised RDTSCP and unsynchronised RDTSC).RDTSCP has since been used in the design of an STM by Ruanet al. [20], who investigate the instruction’s multi-processorsynchronisation behaviour.

Correctness. Our stack theorem lets us prove that theTS stack is linearizable with respect to sequential stacksemantics. This theorem builds on Henzinger et al. whohave a similar theorem for queues [12]. Their theorem is

The order in which concurrently pushed elements appear on the TS stack is only determined when they are popped off the TS stack, even if they had been on the TS stack for, say, a year.

–Showing that the TS Stack is a stack is hard hence POPL

Concurrent Data Structures: scal.cs.uni-salzburg.at [POPL13, CF13, POPL15, NETYS15]

Concurrent Data Structure

Core 1 Core 2

Core 3 Core 4


Scal

Name Semantics Year RefLock-based Singly-linked List Queue

strict queue 1968 [1]

Michael Scott (MS) Queue strict queue 1996 [2]

Flat Combining Queue strict queue 2010 [3]

Wait-free Queue strict queue 2012 [4]

Linked Cyclic Ring Queue (LCRQ)

strict queue 2013 [5]

Timestamped (TS) Queue strict queue 2015 [6]

Cooperative TS Queue strict queue 2015 [7]

Segment Queue k-relaxed queue 2010 [8]

Random Dequeue (RD) Queue

k-relaxed queue 2010 [8]

Bounded Size k-FIFO Queue

k-relaxed queue, pool 2013 [9]

Unbounded Size k-FIFO Queue


b-RR Distributed Queue (DQ)


Least-Recently-Used (LRU) DQ


Locally Linearizable DQ (static, dynamic)

locally linearizable queue, pool

2015 [11]

Locally Linearizable k-FIFO Queue

locally linearizable queue

2015 [11]

Relaxed TS Queue quiescently consistent queue (conjectured)

2015 [7]

Lock-based Singly-linked List Stack

strict stack 1968 [1]

Treiber Stack strict stack 1986 [12]

Elimination-backoff Stack strict stack 2004 [13]

Timestamped (TS) Stack strict stack 2015 [6]

k-Stack k-relaxed stack 2013 [14]

b-RR Distributed Stack (DS) k-relaxed stack, pool 2013 [10]

Least-Recently-Used (LRU) DS

k-relaxed stack, pool 2013 [10]

Locally Linearizable DS (static, dynamic)

locally linearizable stack, pool

2015 [11]

Locally Linearizable k-Stack locally linearizable stack

2015 [11]

Timestamped (TS) Deque strict deque (conjectured)

2015 [7]

d-RA DQ and DS strict pool 2013 [10]

Scal: A Benchmarking Suite for Concurrent Data Structures [NETYS15]

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/lockbased_queue.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/ms_queue.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/flatcombining_queue.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/wf_queue_ppopp12.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/lcrq.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/ts_queue.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/cts_queue.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/segment_queue.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/random_dequeue_queue.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/boundedsize_kfifo.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/unboundedsize_kfifo.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/balancer_partrr.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/distributed_data_structure.h


https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/lru_distributed_queue.h


https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/balancer_local_linearizability.h



https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/dyn_distributed_data_structure.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/unboundedsize_kfifo.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/rts_queue.h

https://github.com/cksystemsgroup/scal/blob/master/scal/src/datastructures/lockbased_stack.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/treiber_stack.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/elimination_backoff_stack.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/ts_stack.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/kstack.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/balancer_partrr.h



https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/lru_distributed_stack.h


https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/balancer_local_linearizability.h



https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/dyn_distributed_data_structure.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/kstack.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/ts_deque.h

https://github.com/cksystemsgroup/scal/blob/master/src/datastructures/balancer_1random.h



How do we allocate and deallocate shared memory with increasingly many cores such that performance increases with the number of cores while memory consumption stays low?

–Multicore Shared Memory Allocation Problem

Multicore Memory Allocation Problem

linear scalability

positive scalabilityhigh performance

negative scalability

positive scalabilitylow performance

thro

ughp

ut

number of cores

Scalloc: Concurrent Memory Allocator scalloc.cs.uni-salzburg.at [OOPSLA15]

Concurrent Free-List

(Distributed Stack [CF13])

Core 1 Core 2

Core 3 Core 4


Local Allocation & Deallocation

Hoardjemalloc

llallocptmalloc2

StreamflowSuperMalloc

TBBTCMalloc

compactscalloc

0

5

10

15

20

25

30

35

40

45

50

1 2 4 6 8 10 20 30 40spee

dup

with

resp

ectt

opt

mal

loc2

(mor

eis

bette

r)

number of threads(a) Speedup

0

2

4

6

8

10

12

14

1 2 4 6 8 10 20 30 40

mem

ory

cons

umpt

ion

inM

B(le

ssis

bette

r)

number of threads(b) Memory consumption

Figure 6: Thread-local workload: Threadtest benchmark

0

20

40

60

80

100

120

140

1 2 4 6 8 10 20 30 40spee

dup

with

resp

ectt

opt

mal

loc2

(mor

eis

bette

r)

number of threads(a) Speedup

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 2 4 6 8 10 20 30 40

mem

ory

cons

umpt

ion

inG

B(le

ssis

bette

r)


Figure 7: Thread-local workload: Shbench benchmark

0

20

40

60

80

100

120

140

160

180

200

1 2 4 8 10 20 30 40

mill

ion

oper

atio

nspe

rsec

(mor

eis

bette

r)

number of threads(a) Throughput

0

1

2

3

4

5

6

7

8

1 2 4 8 10 20 30 40

mem

ory

cons

umpt

ion

inM

B(le

ssis

bette

r)


Figure 8: Thread-local workload (including thread termination): Larson benchmark

463

“Remote” Deallocation

Hoardjemalloc

llallocptmalloc2


TBBTCMalloc

compactscalloc

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

2 4 6 8 10 20 30 40

tota

lper

-thre

adal

loca

tort

ime

inse

cond

s(le

ssis

bette

r)

number of threads(a) Total per-thread allocator time

0

2

4

6

8

10

12

14

16

2 4 6 8 10 20 30 40aver

age

per-

thre

adm

emor

yco

nsum

ptio

nin

MB

(less

isbe

tter)

number of threads(b) Average per-thread memory consumption

Figure 9: Temporal and spatial performance for the producer-consumer experiment

0.001

0.01

0.1

1

10

100

1000

16-64

B

64-25

6B

256-1

KB1-4

KB

4-16K

B

16-64

KB

64-25

6KB

256K

B-1MB

1-4MB

tota

lallo

cato

rtim

ein

seco

nds

(logs

cale

,les

sis

bette

r)

object size range in bytes (logscale)(a) Total allocator time

1

10

100

1000

10000

100000

16-64

B

64-25

6B

256-1

KB1-4

KB

4-16K

B

16-64

KB

64-25

6KB

256K

B-1MB

1-4MB

aver

age

mem

ory

cons

umpt

ion

inM

B(lo

gsca

le,l

ess

isbe

tter)

object size range in bytes (logscale)(b) Average memory consumption

Figure 10: Temporal and spatial performance for the object-size robustness experiment at 40 threads

two threads causes on average 50% remote frees and running40 threads causes on average 97.5% remote frees.

Figure 9a presents the total time each thread spends in theallocator for an increasing number of producers/consumers.Up to 30 threads scalloc and Streamflow provide the besttemporal performance and for more than 30 threads scallocoutperforms all other allocators.

The average per-thread memory consumption illustratedin Figure 9b suggests that all allocators deal with blowupfragmentation, i.e., we do not observe unbounded growthin memory consumption. However, the absolute differencesamong different allocators are significant. Scalloc providescompetitive spatial performance where only jemalloc andptmalloc2 require less memory at the expense of higher totalper-thread allocator time.

This experiment demonstrates that the approach of scallocto distributing contention across spans with one remote freelist per span works well in a producer-consumer workloadand that using a lock-based implementation for reusing spansis not a performance bottleneck.

7.4 Robustness against False SharingFalse sharing occurs when objects that are allocated in thesame cache line are read from and written to by differentthreads. In cache coherent systems this scenario can leadto performance degradation as all caches need to be keptconsistent. An allocator is prone to active false sharing [3]if objects that are allocated by different threads (withoutcommunication) end up in the same cache line. It is proneto passive false sharing [3] if objects that are remotely

464

Scalloc: Concurrent Memory Allocator scalloc.cs.uni-salzburg.at [OOPSLA15]

Eager Reuse of Deallocated

Memory

Core 1 Core 2

Core 3 Core 4


Virtual Spans: 64-bit Address Space

Scalloc: From Scalable Concurrent Data Structuresto a Fast, Scalable, Low-Memory-Overhead Allocator

Abstract1. Introduction2. Related Work[Michael: To discuss: Get rid of related work right away or

have it after the detailed description?]

3. The Allocator in DetailLike other allocators (e.g. [17] and [14]), scalloc can bedivided into two main parts:

(1) a mutator-facing frontend that manages memory in so-called spans, and

(2) a backend for managing the spans (ideally returning themto the operating system when empty).

Scalloc maintains scalability with respect to performanceand memory consumption by:

• introducing virtual spans that enable unified treatment ofvariable-size objects;

• providing a scalable backend for managing spans;• providing a frontend with constant time malloc and free

calls that only consider live heap (no garbage collectioncycles).

The following subsections describe these crucial concepts ofscalloc.

3.1 Real Spans and Size ClassesA (real) span is a contiguous portion of memory partitionedinto blocks of the same size. The size of blocks in a spandetermines which size class the span belongs to. All spans ina given size class have the same number of blocks. Hence,the size of a span is fully determined by its size class: it

[Copyright notice will appear here once ’preprint’ option is removed.]

reserv

ed

addre

sses

virtual span

unmap

ped

memory

real span

header

block

payload

arena virtual span real span

Figure 1: Structure of arena, virtual spans, and real spans

is the product of the block size and the number of blocks,plus a span header containing administrative information. Inscalloc, there are 29 size classes but only 9 distinct real-spansizes which are all multiples of 4KB (the size of a systempage).

The first 16 size classes, with block sizes ranging from16 bytes to 256 bytes in increments of 16 bytes, are takenfrom TCMalloc [6]. This design of small size-classes limitsblock internal fragmentation. All these 16 size classes havethe same real-span size. Size classes with larger blocks rangefrom 512 bytes to 1MB, in increments that are powers oftwo. These size classes may have different real-span size,explaining the difference between 29 size classes and 9 dis-tinct real-span sizes.

Objects of size larger than any size class are not managedby spans, but rather allocated directly from the operatingsystem using mmap.

3.2 Virtual SpansA virtual span is a span allocated in a very large portionof virtual memory (32TB) which we call arena. All virtualspans have the same fixed size of 2MB and are 2MB-alignedin the arena. Each virtual span contains a real span, of one ofthe available size classes. By the size class of the virtual spanwe mean the size class of the contained real span. Typically,the real span is (much) smaller than the virtual span thatcontains it. The maximal real-span size is limited by the sizeof the virtual span. This is why virtual spans are suitablefor big objects as well as for small ones. The structure of the

1 2015/3/18

Object Size

Hoardjemalloc

llallocptmalloc2


TBBTCMalloc

compactscalloc

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

2 4 6 8 10 20 30 40

tota

lper

-thre

adal

loca

tort

ime

inse

cond

s(le

ssis

bette

r)

number of threads(a) Total per-thread allocator time

0

2

4

6

8

10

12

14

16

2 4 6 8 10 20 30 40aver

age

per-

thre

adm

emor

yco

nsum

ptio

nin

MB

(less

isbe

tter)

number of threads(b) Average per-thread memory consumption

Figure 9: Temporal and spatial performance for the producer-consumer experiment

0.001

0.01

0.1

1

10

100

1000

16-64

B

64-25

6B

256-1

KB1-4

KB

4-16K

B

16-64

KB

64-25

6KB

256K

B-1MB

1-4MB

tota

lallo

cato

rtim

ein

seco

nds

(logs

cale

,les

sis

bette

r)

object size range in bytes (logscale)(a) Total allocator time

1

10

100

1000

10000

100000

16-64

B

64-25

6B

256-1

KB1-4

KB

4-16K

B

16-64

KB

64-25

6KB

256K

B-1MB

1-4MB

aver

age

mem

ory

cons

umpt

ion

inM

B(lo

gsca

le,l

ess

isbe

tter)

object size range in bytes (logscale)(b) Average memory consumption

Figure 10: Temporal and spatial performance for the object-size robustness experiment at 40 threads

two threads causes on average 50% remote frees and running40 threads causes on average 97.5% remote frees.

Figure 9a presents the total time each thread spends in theallocator for an increasing number of producers/consumers.Up to 30 threads scalloc and Streamflow provide the besttemporal performance and for more than 30 threads scallocoutperforms all other allocators.

The average per-thread memory consumption illustratedin Figure 9b suggests that all allocators deal with blowupfragmentation, i.e., we do not observe unbounded growthin memory consumption. However, the absolute differencesamong different allocators are significant. Scalloc providescompetitive spatial performance where only jemalloc andptmalloc2 require less memory at the expense of higher totalper-thread allocator time.

This experiment demonstrates that the approach of scallocto distributing contention across spans with one remote freelist per span works well in a producer-consumer workloadand that using a lock-based implementation for reusing spansis not a performance bottleneck.

7.4 Robustness against False SharingFalse sharing occurs when objects that are allocated in thesame cache line are read from and written to by differentthreads. In cache coherent systems this scenario can leadto performance degradation as all caches need to be keptconsistent. An allocator is prone to active false sharing [3]if objects that are allocated by different threads (withoutcommunication) end up in the same cache line. It is proneto passive false sharing [3] if objects that are remotely

464

ACDC: Explorative Benchmarking Memory Management acdc.cs.uni-salzburg.at [ISMM13,DLS14]

✤ configurable multicore-scalable mutator for mimicking virtually any allocation, deallocation, sharing, and access behavior

✤ written in C, tracks with minimal overhead:

1. memory allocation time

2. memory deallocation time

3. memory consumption

4. memory access time

http://acdc.cs.uni-salzburg.at

Memory Accessdeallocated by a thread are immediately usable for allocationby this thread again.

We have conducted the false sharing avoidance evaluationbenchmark from Berger et. al. [3] (including active-false andpassive-false benchmarks) to validate scalloc’s design. Theresults we have obtained suggest that most allocators avoidactive and passive false sharing. However, SuperMalloc andTCMalloc suffer from both active and passive false sharing,whereas Hoard is prone to passive false sharing only. Weomit the graphs because they only show binary results (eitherfalse sharing occurs or not). Scalloc’s design ensures thatin the cases covered by the active-false and passive-falsebenchmarks no false sharing appears, as spans need to befreed to be reused by other threads for allocation. Only incase of thread termination (not covered by the active-falseand passive-false benchmarks) threads may adopt spans inwhich other threads still have blocks, potentially causing falsesharing. We have not encountered false sharing with scallocin any of our experiments.

7.5 Robustness for Varying Object SizesWe configure the ACDC allocator benchmark [2] to allocate,access, and deallocate increasingly larger thread-local objectsin 40 threads (number of native cores) to study the scalabilityof virtual spans and the span pool.

Figure 10a shows the total time spent in the allocator,i.e., the time spent in malloc and free. The x-axis refers tointervals [2x,2x+2) of object sizes in bytes with 4 ≤ x ≤ 20at increments of two. For each object size interval ACDCallocates 2xKB of new objects, accesses the objects, andthen deallocates previously allocated objects. This cycle isrepeated 30 times. For object sizes smaller than 1MB scallocoutperforms all other allocators because virtual spans enablescalloc to rely on efficient size-class allocation. The onlypossible bottleneck in this case is accessing the span-pool.However, even in the presence of 40 threads we do notobserve contention on the span-pool. For objects larger than1MB scalloc relies on mmap which adds system call latencyto allocation and deallocation operations and is also knownto be a scalability bottleneck [6].

The average memory consumption (illustrated in Fig-ure 10b) of scalloc allocating small objects is higher (yet stillcompetitive) because the real-spans for size-classes smallerthan 32KB have the same size and madvise is not enabledfor them. For larger object sizes scalloc causes the smallestmemory overhead comparable to jemalloc and ptmalloc2.

This experiment demonstrates the advantages of tradingvirtual address space fragmentation for high throughput andlow physical memory fragmentation.

7.6 Spatial LocalityIn order to expose differences in spatial locality, we configureACDC to access allocated objects (between 16 and 32 bytes)increasingly in allocation order (rather than out-of-allocationorder). For this purpose, ACDC organizes allocated objects

0

2

4

6

8

10

12

0 20 40 60 80 100

tota

lmem

ory

acce

sstim

ein

seco

nds

(less

isbe

tter)

percentage of object accesses in allocation order

Figure 11: Memory access time for the locality experiment

either in trees (in depth-first, left-to-right order, representingout-of-allocation order) or in lists (representing allocationorder). ACDC then accesses the objects from the tree indepth-first, right-to-left order and from the list in FIFO order.We measure the total memory access time for an increasingratio of lists, starting at 0% (only trees), going up to 100%(only lists), as an indicator of spatial locality. ACDC providesa simple mutator-aware allocator called compact to serve asoptimal (yet without further knowledge of mutator behaviorunreachable) baseline. Compact stores the lists and treesof allocated objects without space overhead in contiguousmemory for optimal locality.

Figure 11 shows the total memory access time for anincreasing ratio of object accesses in allocation order. Onlyjemalloc and llalloc provide a memory layout that can beaccessed slightly faster than the memory layout provided byscalloc. Note that scalloc does not require object headersand reinitializes span free-lists upon retrieval from the span-pool. For a larger ratio of object accesses in allocationorder, the other allocators improve as well but not as muchas llalloc, scalloc, Streamflow, and TBB which approachthe memory access performance of the compact baselineallocator. Note also that we can improve memory access timewith scalloc even more by setting its reusability thresholdto 100%. In this case spans are only reused once they getcompletely empty and reinitialized through the span-pool atthe expense of higher memory consumption. We omit thisdata for consistency reasons.

To explain the differences in memory access time we pickthe data points for ptmalloc2 and scalloc at x=60% wherethe difference in memory access time is most significant andcompare the number of all hardware events obtainable usingperf7. While most numbers are similar we identify two eventswhere the numbers differ significantly. First, the L1 cachemiss rate with ptmalloc2 is 20.8% while scalloc causes a

7 See https://perf.wiki.kernel.org

465

How do we teach computer science to students not necessarily majoring in computer science but who anyway code every day?

–The Computer Science Education Challenge

Selfie: Teaching Computer Science [selfie.cs.uni-salzburg.at]✤ Selfie is a self-referential 7k-line C implementation (in a single file) of:

1. a self-compiling compiler called starc that compiles a tiny subset of C called C Star (C*) to a tiny subset of MIPS32 called MIPSter,

2. a self-executing emulator called mipster that executes MIPSter code including itself when compiled with starc,

3. a self-hosting hypervisor called hypster that virtualizes mipster and can host all of selfie including itself,

4. a tiny C* library called libcstar utilized by all of selfie, and

5. a tiny, experimental SAT solver called babysat.


int atoi(int *s) { int i; int n; int c;

i = 0; n = 0; c = *(s+i);

while (c != 0) { n = n * 10 + c - '0'; if (n < 0) return -1;

i = i + 1; c = *(s+i); }

return n; }

5 statements:assignment

whileif

returnprocedure()

no data types otherthan int and int* and dereferencing:

the * operator

integer arithmeticspointer arithmetics

no bitwise operatorsno Boolean operators

character literalsstring literals

library: exit, malloc, open, read, write

> make cc -w -m32 -D'main(a,b)=main(a,char**argv)' selfie.c -o selfie

bootstrapping selfie.c into x86 selfie executable using standard C compiler

(now also available for RISC-V machines)

> ./selfie ./selfie: usage: selfie { -c { source } | -o binary | -s assembly | -l binary } [ ( -m | -d | -y | -min | -mob ) size ... ]

selfie usage

> ./selfie -c selfie.c

./selfie: this is selfie's starc compiling selfie.c

./selfie: 176408 characters read in 7083 lines and 969 comments

./selfie: with 97779(55.55%) characters in 28914 actual symbols

./selfie: 261 global variables, 289 procedures, 450 string literals

./selfie: 1958 calls, 723 assignments, 57 while, 572 if, 243 return

./selfie: 121660 bytes generated with 28779 instructions and 6544 bytes of data

compiling selfie.c with x86 selfie executable

(takes seconds)

> ./selfie -c selfie.c -m 2 -c selfie.c


./selfie: this is selfie's mipster executing selfie.c with 2MB of physical memory

selfie.c: this is selfie's starc compiling selfie.c

selfie.c: exiting with exit code 0 and 1.05MB of mallocated memory

./selfie: this is selfie's mipster terminating selfie.c with exit code 0 and 1.16MB of mapped memory

compiling selfie.c with x86 selfie executable into a MIPSter executable and

then running that MIPSter executable to compile selfie.c again (takes ~6 minutes)

> ./selfie -c selfie.c -o selfie1.m -m 2 -c selfie.c -o selfie2.m


./selfie: 121660 bytes with 28779 instructions and 6544 bytes of data written into selfie1.m

./selfie: this is selfie's mipster executing selfie1.m with 2MB of physical memory

selfie1.m: this is selfie's starc compiling selfie.c selfie1.m: 121660 bytes with 28779 instructions and 6544 bytes of data written into selfie2.m

selfie1.m: exiting with exit code 0 and 1.05MB of mallocated memory

./selfie: this is selfie's mipster terminating selfie1.m with exit code 0 and 1.16MB of mapped memory

compiling selfie.c into a MIPSter executable selfie1.m and

then running selfie1.m to compile selfie.c into another MIPSter executable selfie2.m

(takes ~6 minutes)

Sandboxed Concurrency: 1-Week Homework Assignment

Compiler

Emulator

Formalism

Machine

Compiler

Emulator

Formalism

Machine

Emulator

Compiler

Emulator

Formalism

Machine

Emulator Emulator||

> ./selfie -c selfie.c -m 2 -c selfie.c -m 2 -c selfie.c

compiling selfie.c with x86 selfie executable and

then running that executable to compile selfie.c again and

then running that executable to compile selfie.c again (takes ~24 hours)

Emulation versus Virtualization

Compiler

Emulator

Formalism

Machine

Compiler

Emulator

Formalism

Machine

Emulator

Compiler

Emulator

Formalism

Machine

Hypervisor

> ./selfie -c selfie.c -m 2 -c selfie.c -y 2 -c selfie.c

compiling selfie.c with x86 selfie executable and

then running that executable to compile selfie.c again and

then hosting that executable in a virtual machine to compile selfie.c again (takes ~12 minutes)

“How do we introduce self-model-checking and maybe even self-verification into Selfie?”

https://github.com/cksystemsteaching/selfie/tree/vipster

https://github.com/cksystemsteaching/selfie/tree/vipster

Thank you!

scal, scalloc, and selfie › ~ck › content › talks › cambridge... · alexander miller ......

Documents