scal, scalloc, and selfie › ~ck › content › talks › cambridge... · alexander miller ......
TRANSCRIPT
University of Cambridge, September 2017
Scal, Scalloc, and SelfieChristoph Kirsch, University of Salzburg, Austria
scalloc.cs.uni-salzburg.at
multicore-scalable concurrent allocator
scal.cs.uni-salzburg.at
multicore-scalableconcurrent data
structuresselfie.cs.uni-salzburg.at
self-referential systems software for teaching
Joint Work
✤ Martin Aigner
✤ Christian Barthel
✤ Mike Dodds
✤ Andreas Haas
✤ Thomas Henzinger
✤ Andreas Holzer
✤ Thomas Hütter
✤ Michael Lippautz
✤ Alexander Miller
✤ Simone Oblasser
✤ Hannes Payer
✤ Mario Preishuber
✤ Ana Sokolova
✤ Ali Szegin
How do we exchange data among increasingly many cores on a shared memory machine such that performance still
increases with the number of cores?
–The Multicore Scalability Challenge
Concurrent Shared Memory
Access
Core 1 Core 2
Core 3 Core 4
~100 cores
How do we allocate and deallocate shared memory with increasingly many cores such that performance increases with the number of cores while memory consumption stays low?
–Multicore Shared Memory Allocation Problem
Concurrent Shared Memory
Allocation
Core 1 Core 2
Core 3 Core 4
~1TB memory
~100 cores
How do we teach computer science to students not necessarily majoring in computer science but who anyway code every day?
–The Computer Science Education Challenge
The Multicore Scalability Challenge
linear scalability
positive scalabilityhigh performance
negative scalability
positive scalabilitylow performance
thro
ughp
ut
number of cores
Timestamped (TS) Stack [POPL15]
Concurrent Stack
Core 1 Core 2
Core 3 Core 4
Timestamped (TS) Stack [POPL15]
Treiber StackEB Stack
TS-atomic StackTS-CAS Stack
TS-hardware StackTS-interval Stack
TS-stutter Stack
0
2000
4000
6000
8000
10000
12000
2 10 20 30 40 50 60 70 80
op
era
tion
s p
er
ms
(mo
re is
be
tte
r)
number of threads
(a) Producer-consumer benchmark, 40-core machine.
0
2000
4000
6000
8000
10000
12000
2 8 16 24 32 40 48 56 64
op
era
tion
s p
er
ms
(mo
re is
be
tte
r)
number of threads
(b) Producer-consumer benchmark, 64-core machine.
0
10000
20000
30000
40000
50000
60000
1 10 20 30 40 50 60 70 80
op
era
tion
s p
er
ms
(mo
re is
be
tte
r)
number of threads
(c) Producer-only benchmark, 40-core machine.
0
10000
20000
30000
40000
50000
60000
1 8 16 24 32 40 48 56 64
op
era
tion
s p
er
ms
(mo
re is
be
tte
r)
number of threads
(d) Producer-only benchmark, 64-core machine.
0
500
1000
1500
2000
2500
3000
3500
1 10 20 30 40 50 60 70 80
op
era
tion
s p
er
ms
(mo
re is
be
tte
r)
number of threads
(e) Consumer-only benchmark, 40-core machine.
0
500
1000
1500
2000
2500
3000
3500
1 8 16 24 32 40 48 56 64
op
era
tion
s p
er
ms
(mo
re is
be
tte
r)
number of threads
(f) Consumer-only benchmark, 64-core machine.
Figure 5: TS stack performance in the high-contention scenario on 40-core machine (left) and 64-core machine (right).
push operations of the TS-atomic stack and the TS-stutterstack, which means that the delay in the TS-interval time-stamping is actually shorter than the execution time of theTS-atomic timestamping and the TS-stutter timestamping.Perhaps surprisingly, TS-stutter, which does not requirestrong synchronisation, is slower than TS-atomic, which isbased on an atomic fetch-and-increment instruction.
Pop performance. We measure the performance of popoperations of all data-structures in a consumer-only bench-mark where each thread pops 1,000,000 from a pre-filledstack. Note that no elimination is possible in this bench-mark. The stack is pre-filled concurrently, which means incase of the TS-interval stack and TS-stutter stack that someelements may have unordered timestamps. Again the TS-interval stack uses the same delay as in the high-contentionproducer-consumer benchmark.
Figure 5e and Figure 5f show the performance andscalability of the data-structures in the high-contentionconsumer-only benchmark. The performance of the TS-interval stack is significantly higher than the performance ofthe other stack implementations, except for low numbers ofthreads. The performance of TS-CAS is close to the perfor-mance of TS-interval. The TS-stutter stack is faster than theTS-atomic and TS-hardware stack due to the fact that someelements share timestamps and therefore can be removed inparallel. The TS-atomic stack and TS-hardware stack showthe same performance because all elements have uniquetimestamps and therefore have to be removed sequentially.Also in the Treiber stack and the EB stack elements have tobe removed sequentially. Depending on the machine, remov-ing elements sequentially from a single list (Treiber stack)is sometimes less and sometimes as expensive as removingelements sequentially from multiple lists (TS stack).
Elimination Through Shared Timestamps
0
2000
4000
6000
8000
10000
12000
0 3000 6000 9000 12000 15000 0
1
2
3
4
5
6
7
op
era
tion
s p
er
ms
(mo
re is
be
tte
r)
nu
mb
er
of
retr
ies
(le
ss is
be
tte
r)
delay in ns
Performance TS-interval StackRetries TS-interval Stack
Performance TS-CAS StackRetries TS-CAS Stack
Figure 6: High-contention producer-consumer benchmarkusing TS-interval and TS-CAS timestamping with increas-ing delay on the 40-core machine, exercising 40 producersand 40 consumers.
6.2 Analysis of Interval Timestamping
Figure 6 shows the performance of the TS-interval stackand the TS-CAS stack along with the average number oftryRem calls needed in each pop (one call is optimal, butcontention may cause retries). These figures were collectedwith an increasing interval length in the high contentionproducer-consumer benchmark on the 40-core machine. Weused these results to determine the delays for the bench-marks in Section 6.1.
Initially the performance of the TS-interval stack in-creases with an increasing delay time, but beyond 7.5 µs theperformance decreases again. After that point an averagepush operation is slower than an average pop operation andthe number of pop operations which return EMPTY increases.
For the TS-interval stack the high performance correlatesstrongly with a drop in tryRem retries. We conclude from thisthat the impressive performance we achieve with intervaltimestamping arises from reduced contention in tryRem. Forthe optimal delay time we have 1.009 calls to tryRem perpop, i.e. less than 1% of pop calls need to scan the SPpools array more than once. In contrast, without a delaythe average number of retries per pop call is more than 6.
The performance of the TS-CAS stack increases initiallywith an increasing delay time. However this does not de-crease the number of tryRem retries significantly. The rea-son is that without a delay there is more contention on theglobal counter. Therefore the performance of TS-CAS witha delay is actually better than the performance without adelay. However, similar to TS-interval timestamping, witha delay time beyond 6 µs the performance decreases again.This is the point where an average push operation becomesslower than an average pop operations.
7. TS Queue and TS Deque Variants
In this paper, we have focussed on the stack variant of ouralgorithm. However, stored timestamps can be removed inany order, meaning it is simple to change our TS stack intoa queue / deque. Doing this requires three main changes:
1. Change the timestamp comparison operator in tryRem.2. Change the SP pool such that getYoungest returns the
oldest / right-most / left-most element.3. For the TS queue, remove elimination in tryRem. For the
TS deque, enable it only for stack-like removal.
The TS queue is the second fastest queue we know of. In ourexperiments the TS-interval queue outperforms the Michael-Scott queue [18] and the flat-combining queue [10] but thelack of elimination means it is not as fast as the LCRQ [1].
The TS-interval deque is the fastest deque we know of,although it is slower than the corresponding stack / queue.However, it still outperforms the Michael-Scott and flat-combining queues, and the Treiber and EB stacks.
8. Related Work
Timestamping. Our approach was initially inspired byAttiya et al.’s Laws of Order paper [2], which proves thatany linearizable stack, queue, or deque necessarily uses theRAW or AWAR patterns in its remove operation. Whileattempting to extend this result to insert operations, wewere surprised to discover a counter-example: the TS stack.We believe the Basket Queue [15] was the first algorithm toexploit the fact that enqueues need not take e�ect in orderof their atomic operations, although unlike the TS stack itdoes not avoid strong synchronisation when inserting.
Gorelik and Hendler use timestamping in their AFCqueue [6]. As in our stack, enqueued elements are time-stamped and stored in single-producer bu�ers. Aside fromthe obvious di�erence in kind, our TS stack di�ers in severalrespects. The AFC dequeue uses flat-combining-style con-solidation – that is, a combiner thread merges timestampsinto a total order. As a result, the AFC queue is block-ing. The TS stack avoids enforcing an internal total order,and instead allows non-blocking parallel removal. Removalin the AFC queue depends on the expensive consolidationprocess, and as a result their producer-consumer benchmarkshows remove performance significantly worse than otherflat-combining queues. Interval timestamping lets the TSstack trade insertion and removal cost, avoiding this prob-lem. Timestamps in the AFC queue are Lamport clocks [17],not hardware-generated intervals. (We also experiment withLamport clocks – see TS-stutter in §3.2). Finally, AFC queueelements are timestamped before being inserted – in the TSstack, this is reversed. This seemingly trivial di�erence en-ables timestamp-based elimination, which is important tothe TS stack’s performance.
The LCRQ queue [1] and the SP queue [11] both indexelements using an atomic counter. However, dequeue opera-tions do not look for one of the youngest elements as in ourTS stack, but rather for the element with the enqueue indexthat matches the dequeue index exactly. Both approachesfall back to a slow path when the dequeue counter becomeshigher than the enqueue counter. In contrast to indices,timestamps in the TS stack need not be unique or even or-dered, and the performance of the TS stack does not dependon a fast path and a slow path, but only on the number ofelements which share the same timestamp.
Our use of the x86 RDTSCP instruction to generate hard-ware timestamps is inspired by work on testing FIFOqueues [7]. There the RDTSC instruction is used to deter-mine the order of operation calls. (Note the distinction be-tween the synchronised RDTSCP and unsynchronised RDTSC).RDTSCP has since been used in the design of an STM by Ruanet al. [20], who investigate the instruction’s multi-processorsynchronisation behaviour.
Correctness. Our stack theorem lets us prove that theTS stack is linearizable with respect to sequential stacksemantics. This theorem builds on Henzinger et al. whohave a similar theorem for queues [12]. Their theorem is
The order in which concurrently pushed elements appear on the TS stack is only determined when they are popped off the TS stack, even if they had been on the TS stack for, say, a year.
–Showing that the TS Stack is a stack is hard hence POPL
Concurrent Data Structures: scal.cs.uni-salzburg.at [POPL13, CF13, POPL15, NETYS15]
Concurrent Data Structure
Core 1 Core 2
Core 3 Core 4
Scal
Name Semantics Year RefLock-based Singly-linked List Queue
strict queue 1968 [1]
Michael Scott (MS) Queue strict queue 1996 [2]
Flat Combining Queue strict queue 2010 [3]
Wait-free Queue strict queue 2012 [4]
Linked Cyclic Ring Queue (LCRQ)
strict queue 2013 [5]
Timestamped (TS) Queue strict queue 2015 [6]
Cooperative TS Queue strict queue 2015 [7]
Segment Queue k-relaxed queue 2010 [8]
Random Dequeue (RD) Queue
k-relaxed queue 2010 [8]
Bounded Size k-FIFO Queue
k-relaxed queue, pool 2013 [9]
Unbounded Size k-FIFO Queue
k-relaxed queue, pool 2013 [9]
b-RR Distributed Queue (DQ)
k-relaxed queue, pool 2013 [10]
Least-Recently-Used (LRU) DQ
k-relaxed queue, pool 2013 [10]
Locally Linearizable DQ (static, dynamic)
locally linearizable queue, pool
2015 [11]
Locally Linearizable k-FIFO Queue
locally linearizable queue
2015 [11]
Relaxed TS Queue quiescently consistent queue (conjectured)
2015 [7]
Lock-based Singly-linked List Stack
strict stack 1968 [1]
Treiber Stack strict stack 1986 [12]
Elimination-backoff Stack strict stack 2004 [13]
Timestamped (TS) Stack strict stack 2015 [6]
k-Stack k-relaxed stack 2013 [14]
b-RR Distributed Stack (DS) k-relaxed stack, pool 2013 [10]
Least-Recently-Used (LRU) DS
k-relaxed stack, pool 2013 [10]
Locally Linearizable DS (static, dynamic)
locally linearizable stack, pool
2015 [11]
Locally Linearizable k-Stack locally linearizable stack
2015 [11]
Timestamped (TS) Deque strict deque (conjectured)
2015 [7]
d-RA DQ and DS strict pool 2013 [10]
Scal: A Benchmarking Suite for Concurrent Data Structures [NETYS15]
How do we allocate and deallocate shared memory with increasingly many cores such that performance increases with the number of cores while memory consumption stays low?
–Multicore Shared Memory Allocation Problem
Multicore Memory Allocation Problem
linear scalability
positive scalabilityhigh performance
negative scalability
positive scalabilitylow performance
thro
ughp
ut
number of cores
Scalloc: Concurrent Memory Allocator scalloc.cs.uni-salzburg.at [OOPSLA15]
Concurrent Free-List
(Distributed Stack [CF13])
Core 1 Core 2
Core 3 Core 4
Local Allocation & Deallocation
Hoardjemalloc
llallocptmalloc2
StreamflowSuperMalloc
TBBTCMalloc
compactscalloc
0
5
10
15
20
25
30
35
40
45
50
1 2 4 6 8 10 20 30 40spee
dup
with
resp
ectt
opt
mal
loc2
(mor
eis
bette
r)
number of threads(a) Speedup
0
2
4
6
8
10
12
14
1 2 4 6 8 10 20 30 40
mem
ory
cons
umpt
ion
inM
B(le
ssis
bette
r)
number of threads(b) Memory consumption
Figure 6: Thread-local workload: Threadtest benchmark
0
20
40
60
80
100
120
140
1 2 4 6 8 10 20 30 40spee
dup
with
resp
ectt
opt
mal
loc2
(mor
eis
bette
r)
number of threads(a) Speedup
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 6 8 10 20 30 40
mem
ory
cons
umpt
ion
inG
B(le
ssis
bette
r)
number of threads(b) Memory consumption
Figure 7: Thread-local workload: Shbench benchmark
0
20
40
60
80
100
120
140
160
180
200
1 2 4 8 10 20 30 40
mill
ion
oper
atio
nspe
rsec
(mor
eis
bette
r)
number of threads(a) Throughput
0
1
2
3
4
5
6
7
8
1 2 4 8 10 20 30 40
mem
ory
cons
umpt
ion
inM
B(le
ssis
bette
r)
number of threads(b) Memory consumption
Figure 8: Thread-local workload (including thread termination): Larson benchmark
463
“Remote” Deallocation
Hoardjemalloc
llallocptmalloc2
StreamflowSuperMalloc
TBBTCMalloc
compactscalloc
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
2 4 6 8 10 20 30 40
tota
lper
-thre
adal
loca
tort
ime
inse
cond
s(le
ssis
bette
r)
number of threads(a) Total per-thread allocator time
0
2
4
6
8
10
12
14
16
2 4 6 8 10 20 30 40aver
age
per-
thre
adm
emor
yco
nsum
ptio
nin
MB
(less
isbe
tter)
number of threads(b) Average per-thread memory consumption
Figure 9: Temporal and spatial performance for the producer-consumer experiment
0.001
0.01
0.1
1
10
100
1000
16-64
B
64-25
6B
256-1
KB1-4
KB
4-16K
B
16-64
KB
64-25
6KB
256K
B-1MB
1-4MB
tota
lallo
cato
rtim
ein
seco
nds
(logs
cale
,les
sis
bette
r)
object size range in bytes (logscale)(a) Total allocator time
1
10
100
1000
10000
100000
16-64
B
64-25
6B
256-1
KB1-4
KB
4-16K
B
16-64
KB
64-25
6KB
256K
B-1MB
1-4MB
aver
age
mem
ory
cons
umpt
ion
inM
B(lo
gsca
le,l
ess
isbe
tter)
object size range in bytes (logscale)(b) Average memory consumption
Figure 10: Temporal and spatial performance for the object-size robustness experiment at 40 threads
two threads causes on average 50% remote frees and running40 threads causes on average 97.5% remote frees.
Figure 9a presents the total time each thread spends in theallocator for an increasing number of producers/consumers.Up to 30 threads scalloc and Streamflow provide the besttemporal performance and for more than 30 threads scallocoutperforms all other allocators.
The average per-thread memory consumption illustratedin Figure 9b suggests that all allocators deal with blowupfragmentation, i.e., we do not observe unbounded growthin memory consumption. However, the absolute differencesamong different allocators are significant. Scalloc providescompetitive spatial performance where only jemalloc andptmalloc2 require less memory at the expense of higher totalper-thread allocator time.
This experiment demonstrates that the approach of scallocto distributing contention across spans with one remote freelist per span works well in a producer-consumer workloadand that using a lock-based implementation for reusing spansis not a performance bottleneck.
7.4 Robustness against False SharingFalse sharing occurs when objects that are allocated in thesame cache line are read from and written to by differentthreads. In cache coherent systems this scenario can leadto performance degradation as all caches need to be keptconsistent. An allocator is prone to active false sharing [3]if objects that are allocated by different threads (withoutcommunication) end up in the same cache line. It is proneto passive false sharing [3] if objects that are remotely
464
Scalloc: Concurrent Memory Allocator scalloc.cs.uni-salzburg.at [OOPSLA15]
Eager Reuse of Deallocated
Memory
Core 1 Core 2
Core 3 Core 4
Virtual Spans: 64-bit Address Space
Scalloc: From Scalable Concurrent Data Structuresto a Fast, Scalable, Low-Memory-Overhead Allocator
Abstract1. Introduction2. Related Work[Michael: To discuss: Get rid of related work right away or
have it after the detailed description?]
3. The Allocator in DetailLike other allocators (e.g. [17] and [14]), scalloc can bedivided into two main parts:
(1) a mutator-facing frontend that manages memory in so-called spans, and
(2) a backend for managing the spans (ideally returning themto the operating system when empty).
Scalloc maintains scalability with respect to performanceand memory consumption by:
• introducing virtual spans that enable unified treatment ofvariable-size objects;
• providing a scalable backend for managing spans;• providing a frontend with constant time malloc and free
calls that only consider live heap (no garbage collectioncycles).
The following subsections describe these crucial concepts ofscalloc.
3.1 Real Spans and Size ClassesA (real) span is a contiguous portion of memory partitionedinto blocks of the same size. The size of blocks in a spandetermines which size class the span belongs to. All spans ina given size class have the same number of blocks. Hence,the size of a span is fully determined by its size class: it
[Copyright notice will appear here once ’preprint’ option is removed.]
reserv
ed
addre
sses
virtual span
unmap
ped
memory
real span
header
block
payload
arena virtual span real span
Figure 1: Structure of arena, virtual spans, and real spans
is the product of the block size and the number of blocks,plus a span header containing administrative information. Inscalloc, there are 29 size classes but only 9 distinct real-spansizes which are all multiples of 4KB (the size of a systempage).
The first 16 size classes, with block sizes ranging from16 bytes to 256 bytes in increments of 16 bytes, are takenfrom TCMalloc [6]. This design of small size-classes limitsblock internal fragmentation. All these 16 size classes havethe same real-span size. Size classes with larger blocks rangefrom 512 bytes to 1MB, in increments that are powers oftwo. These size classes may have different real-span size,explaining the difference between 29 size classes and 9 dis-tinct real-span sizes.
Objects of size larger than any size class are not managedby spans, but rather allocated directly from the operatingsystem using mmap.
3.2 Virtual SpansA virtual span is a span allocated in a very large portionof virtual memory (32TB) which we call arena. All virtualspans have the same fixed size of 2MB and are 2MB-alignedin the arena. Each virtual span contains a real span, of one ofthe available size classes. By the size class of the virtual spanwe mean the size class of the contained real span. Typically,the real span is (much) smaller than the virtual span thatcontains it. The maximal real-span size is limited by the sizeof the virtual span. This is why virtual spans are suitablefor big objects as well as for small ones. The structure of the
1 2015/3/18
Object Size
Hoardjemalloc
llallocptmalloc2
StreamflowSuperMalloc
TBBTCMalloc
compactscalloc
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
2 4 6 8 10 20 30 40
tota
lper
-thre
adal
loca
tort
ime
inse
cond
s(le
ssis
bette
r)
number of threads(a) Total per-thread allocator time
0
2
4
6
8
10
12
14
16
2 4 6 8 10 20 30 40aver
age
per-
thre
adm
emor
yco
nsum
ptio
nin
MB
(less
isbe
tter)
number of threads(b) Average per-thread memory consumption
Figure 9: Temporal and spatial performance for the producer-consumer experiment
0.001
0.01
0.1
1
10
100
1000
16-64
B
64-25
6B
256-1
KB1-4
KB
4-16K
B
16-64
KB
64-25
6KB
256K
B-1MB
1-4MB
tota
lallo
cato
rtim
ein
seco
nds
(logs
cale
,les
sis
bette
r)
object size range in bytes (logscale)(a) Total allocator time
1
10
100
1000
10000
100000
16-64
B
64-25
6B
256-1
KB1-4
KB
4-16K
B
16-64
KB
64-25
6KB
256K
B-1MB
1-4MB
aver
age
mem
ory
cons
umpt
ion
inM
B(lo
gsca
le,l
ess
isbe
tter)
object size range in bytes (logscale)(b) Average memory consumption
Figure 10: Temporal and spatial performance for the object-size robustness experiment at 40 threads
two threads causes on average 50% remote frees and running40 threads causes on average 97.5% remote frees.
Figure 9a presents the total time each thread spends in theallocator for an increasing number of producers/consumers.Up to 30 threads scalloc and Streamflow provide the besttemporal performance and for more than 30 threads scallocoutperforms all other allocators.
The average per-thread memory consumption illustratedin Figure 9b suggests that all allocators deal with blowupfragmentation, i.e., we do not observe unbounded growthin memory consumption. However, the absolute differencesamong different allocators are significant. Scalloc providescompetitive spatial performance where only jemalloc andptmalloc2 require less memory at the expense of higher totalper-thread allocator time.
This experiment demonstrates that the approach of scallocto distributing contention across spans with one remote freelist per span works well in a producer-consumer workloadand that using a lock-based implementation for reusing spansis not a performance bottleneck.
7.4 Robustness against False SharingFalse sharing occurs when objects that are allocated in thesame cache line are read from and written to by differentthreads. In cache coherent systems this scenario can leadto performance degradation as all caches need to be keptconsistent. An allocator is prone to active false sharing [3]if objects that are allocated by different threads (withoutcommunication) end up in the same cache line. It is proneto passive false sharing [3] if objects that are remotely
464
ACDC: Explorative Benchmarking Memory Management acdc.cs.uni-salzburg.at [ISMM13,DLS14]
✤ configurable multicore-scalable mutator for mimicking virtually any allocation, deallocation, sharing, and access behavior
✤ written in C, tracks with minimal overhead:
1. memory allocation time
2. memory deallocation time
3. memory consumption
4. memory access time
Memory Accessdeallocated by a thread are immediately usable for allocationby this thread again.
We have conducted the false sharing avoidance evaluationbenchmark from Berger et. al. [3] (including active-false andpassive-false benchmarks) to validate scalloc’s design. Theresults we have obtained suggest that most allocators avoidactive and passive false sharing. However, SuperMalloc andTCMalloc suffer from both active and passive false sharing,whereas Hoard is prone to passive false sharing only. Weomit the graphs because they only show binary results (eitherfalse sharing occurs or not). Scalloc’s design ensures thatin the cases covered by the active-false and passive-falsebenchmarks no false sharing appears, as spans need to befreed to be reused by other threads for allocation. Only incase of thread termination (not covered by the active-falseand passive-false benchmarks) threads may adopt spans inwhich other threads still have blocks, potentially causing falsesharing. We have not encountered false sharing with scallocin any of our experiments.
7.5 Robustness for Varying Object SizesWe configure the ACDC allocator benchmark [2] to allocate,access, and deallocate increasingly larger thread-local objectsin 40 threads (number of native cores) to study the scalabilityof virtual spans and the span pool.
Figure 10a shows the total time spent in the allocator,i.e., the time spent in malloc and free. The x-axis refers tointervals [2x,2x+2) of object sizes in bytes with 4 ≤ x ≤ 20at increments of two. For each object size interval ACDCallocates 2xKB of new objects, accesses the objects, andthen deallocates previously allocated objects. This cycle isrepeated 30 times. For object sizes smaller than 1MB scallocoutperforms all other allocators because virtual spans enablescalloc to rely on efficient size-class allocation. The onlypossible bottleneck in this case is accessing the span-pool.However, even in the presence of 40 threads we do notobserve contention on the span-pool. For objects larger than1MB scalloc relies on mmap which adds system call latencyto allocation and deallocation operations and is also knownto be a scalability bottleneck [6].
The average memory consumption (illustrated in Fig-ure 10b) of scalloc allocating small objects is higher (yet stillcompetitive) because the real-spans for size-classes smallerthan 32KB have the same size and madvise is not enabledfor them. For larger object sizes scalloc causes the smallestmemory overhead comparable to jemalloc and ptmalloc2.
This experiment demonstrates the advantages of tradingvirtual address space fragmentation for high throughput andlow physical memory fragmentation.
7.6 Spatial LocalityIn order to expose differences in spatial locality, we configureACDC to access allocated objects (between 16 and 32 bytes)increasingly in allocation order (rather than out-of-allocationorder). For this purpose, ACDC organizes allocated objects
0
2
4
6
8
10
12
0 20 40 60 80 100
tota
lmem
ory
acce
sstim
ein
seco
nds
(less
isbe
tter)
percentage of object accesses in allocation order
Figure 11: Memory access time for the locality experiment
either in trees (in depth-first, left-to-right order, representingout-of-allocation order) or in lists (representing allocationorder). ACDC then accesses the objects from the tree indepth-first, right-to-left order and from the list in FIFO order.We measure the total memory access time for an increasingratio of lists, starting at 0% (only trees), going up to 100%(only lists), as an indicator of spatial locality. ACDC providesa simple mutator-aware allocator called compact to serve asoptimal (yet without further knowledge of mutator behaviorunreachable) baseline. Compact stores the lists and treesof allocated objects without space overhead in contiguousmemory for optimal locality.
Figure 11 shows the total memory access time for anincreasing ratio of object accesses in allocation order. Onlyjemalloc and llalloc provide a memory layout that can beaccessed slightly faster than the memory layout provided byscalloc. Note that scalloc does not require object headersand reinitializes span free-lists upon retrieval from the span-pool. For a larger ratio of object accesses in allocationorder, the other allocators improve as well but not as muchas llalloc, scalloc, Streamflow, and TBB which approachthe memory access performance of the compact baselineallocator. Note also that we can improve memory access timewith scalloc even more by setting its reusability thresholdto 100%. In this case spans are only reused once they getcompletely empty and reinitialized through the span-pool atthe expense of higher memory consumption. We omit thisdata for consistency reasons.
To explain the differences in memory access time we pickthe data points for ptmalloc2 and scalloc at x=60% wherethe difference in memory access time is most significant andcompare the number of all hardware events obtainable usingperf7. While most numbers are similar we identify two eventswhere the numbers differ significantly. First, the L1 cachemiss rate with ptmalloc2 is 20.8% while scalloc causes a
7 See https://perf.wiki.kernel.org
465
How do we teach computer science to students not necessarily majoring in computer science but who anyway code every day?
–The Computer Science Education Challenge
Selfie: Teaching Computer Science [selfie.cs.uni-salzburg.at]✤ Selfie is a self-referential 7k-line C implementation (in a single file) of:
1. a self-compiling compiler called starc that compiles a tiny subset of C called C Star (C*) to a tiny subset of MIPS32 called MIPSter,
2. a self-executing emulator called mipster that executes MIPSter code including itself when compiled with starc,
3. a self-hosting hypervisor called hypster that virtualizes mipster and can host all of selfie including itself,
4. a tiny C* library called libcstar utilized by all of selfie, and
5. a tiny, experimental SAT solver called babysat.
int atoi(int *s) { int i; int n; int c;
i = 0; n = 0; c = *(s+i);
while (c != 0) { n = n * 10 + c - '0'; if (n < 0) return -1;
i = i + 1; c = *(s+i); }
return n; }
5 statements:assignment
whileif
returnprocedure()
no data types otherthan int and int* and dereferencing:
the * operator
integer arithmeticspointer arithmetics
no bitwise operatorsno Boolean operators
character literalsstring literals
library: exit, malloc, open, read, write
> make cc -w -m32 -D'main(a,b)=main(a,char**argv)' selfie.c -o selfie
bootstrapping selfie.c into x86 selfie executable using standard C compiler
(now also available for RISC-V machines)
> ./selfie ./selfie: usage: selfie { -c { source } | -o binary | -s assembly | -l binary } [ ( -m | -d | -y | -min | -mob ) size ... ]
selfie usage
> ./selfie -c selfie.c
./selfie: this is selfie's starc compiling selfie.c
./selfie: 176408 characters read in 7083 lines and 969 comments
./selfie: with 97779(55.55%) characters in 28914 actual symbols
./selfie: 261 global variables, 289 procedures, 450 string literals
./selfie: 1958 calls, 723 assignments, 57 while, 572 if, 243 return
./selfie: 121660 bytes generated with 28779 instructions and 6544 bytes of data
compiling selfie.c with x86 selfie executable
(takes seconds)
> ./selfie -c selfie.c -m 2 -c selfie.c
./selfie: this is selfie's starc compiling selfie.c
./selfie: this is selfie's mipster executing selfie.c with 2MB of physical memory
selfie.c: this is selfie's starc compiling selfie.c
selfie.c: exiting with exit code 0 and 1.05MB of mallocated memory
./selfie: this is selfie's mipster terminating selfie.c with exit code 0 and 1.16MB of mapped memory
compiling selfie.c with x86 selfie executable into a MIPSter executable and
then running that MIPSter executable to compile selfie.c again (takes ~6 minutes)
> ./selfie -c selfie.c -o selfie1.m -m 2 -c selfie.c -o selfie2.m
./selfie: this is selfie's starc compiling selfie.c
./selfie: 121660 bytes with 28779 instructions and 6544 bytes of data written into selfie1.m
./selfie: this is selfie's mipster executing selfie1.m with 2MB of physical memory
selfie1.m: this is selfie's starc compiling selfie.c selfie1.m: 121660 bytes with 28779 instructions and 6544 bytes of data written into selfie2.m
selfie1.m: exiting with exit code 0 and 1.05MB of mallocated memory
./selfie: this is selfie's mipster terminating selfie1.m with exit code 0 and 1.16MB of mapped memory
compiling selfie.c into a MIPSter executable selfie1.m and
then running selfie1.m to compile selfie.c into another MIPSter executable selfie2.m
(takes ~6 minutes)
Sandboxed Concurrency: 1-Week Homework Assignment
Compiler
Emulator
Formalism
Machine
Compiler
Emulator
Formalism
Machine
Emulator
Compiler
Emulator
Formalism
Machine
Emulator Emulator||
> ./selfie -c selfie.c -m 2 -c selfie.c -m 2 -c selfie.c
compiling selfie.c with x86 selfie executable and
then running that executable to compile selfie.c again and
then running that executable to compile selfie.c again (takes ~24 hours)
Emulation versus Virtualization
Compiler
Emulator
Formalism
Machine
Compiler
Emulator
Formalism
Machine
Emulator
Compiler
Emulator
Formalism
Machine
Hypervisor
> ./selfie -c selfie.c -m 2 -c selfie.c -y 2 -c selfie.c
compiling selfie.c with x86 selfie executable and
then running that executable to compile selfie.c again and
then hosting that executable in a virtual machine to compile selfie.c again (takes ~12 minutes)
“How do we introduce self-model-checking and maybe even self-verification into Selfie?”
https://github.com/cksystemsteaching/selfie/tree/vipster
Thank you!