hoard: a scalable memory allocator for multithreaded applications
DESCRIPTION
Hoard: A Scalable Memory Allocator for Multithreaded Applications. Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Ivan Jibaja (Some slides adapted from Emery Berger’s presentation). Outline. Motivation Problems in allocator design False sharing Fragmentation - PowerPoint PPT PresentationTRANSCRIPT
Hoard: A Scalable Memory Allocator for Multithreaded Applications
Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson
Presented by Ivan Jibaja
(Some slides adapted from Emery Berger’s presentation)
1
Outline
• Motivation• Problems in allocator design
– False sharing– Fragmentation
• Existing approaches• Hoard design• Experimental evaluation
2
Motivation
• Parallel multithreaded programs prevalent– Web servers, search engines, DB managers etc.– Run on CMP/SMP for high performance
• Memory allocation is a bottleneck– Prevents scaling with number of processors
3
Desired allocator attributes on a multiprocessor system
• Speed– Competitive with uniprocessor allocators on 1 cpu
• Scalability– Performance linear with the number of processors
• Fragmentation (=max allocated / max in use)– High fragmentation poor data locality paging
• False sharing avoidance
4
The problem of false sharing• Program causes false sharing
• Allocate number of objects in a cache line, pass objects to different threads
• Allocators cause false sharing!• Actively:
• malloc satisfies different thread requests from same cache line
• Passively:• free allows future malloc to produce false sharing
processor 1 processor 2x2 = malloc(s);x1 = malloc(s);
A cache line
thrash… thrash…
5
The problem of fragmentation
• Blowup:– Increase in memory consumption when allocator
reclaims memory freed by program, but fails to use it for future requests
– Mainly a problem of concurrent allocators
– Unbounded (worst case) or bounded (O(P))
6
Example: Pure Private Heaps Allocator
• Pure private heaps:• one heap per processor.
• malloc gets memoryfrom the processor's heap or the system
• free puts memory on the processor's heap
• Avoids heap contention• Examples: STL, Cilk
x1= malloc(s)
free(x1) free(x2)
x3= malloc(s)
x2= malloc(s)
x4= malloc(s)
processor 1 processor 2
= allocated by heap 1
= free, on heap 2
7
How to Break Pure Private Heaps: Fragmentation
• Pure private heaps:• memory consumption can
grow without bound!
• Producer-consumer:• processor 1 allocates• processor 2 frees• Memory always
unavailable to producer
free(x1)
x2= malloc(s)
free(x2)
x1= malloc(s)processor 1 processor 2
x3= malloc(s)
free(x3)
8
Example II: Private Heaps with Ownership
• free puts memory back on the originating processor's heap.
• Avoids unbounded memory consumption• Examples: ptmalloc,LKmalloc
x1= malloc(s)
free(x1)
free(x2)
x2= malloc(s)
processor 1 processor 2
9
How to Break Private Heaps with Ownership:Fragmentation
• memory consumption can blowup by a factor of P.
• Round-robin producer-consumer:processor i allocatesprocessor i+1 frees
• Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks
free(x2)
free(x1)
free(x3)
x1= malloc(s)
x2= malloc(s)
x3=malloc(s)
processor 1 processor 2 processor 3
10
Existing approaches
11
Uniprocessor Allocators on Multiprocessors
• Fragmentation: Excellent– Very low for most programs [Wilson & Johnstone]
• Speed & Scalability: Poor– Heap contention
• A single lock protects the heap
• Can exacerbate false sharing– Different processors can share cache lines
12
Existing Multiprocessor Allocators• Speed:
• One concurrent heap (e.g., concurrent B-tree):
• O(log (#size-classes)) cost per memory operation• too many locks/atomic updates
Fast allocators use multiple heaps
• Scalability:• Allocator-induced false sharing
• Other bottlenecks (e.g. nextHeap global in Ptmalloc)
• Fragmentation:• P-fold increase or even unbounded
13
Hoard as the solution
14
Hoard Overview• P per-processor heaps & 1 global heap• Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of
same-sized objects (LIFO free-list)– Avoids false sharing by not carving up cache lines– Avoids heap contention – local heaps allocate & free
small blocks from their superblocks
• Avoids blowup by– Moving superblocks to global heap when fraction of
free memory exceeds some threshold15
Superblock management
Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S)
f = ¼K = 0
• Multiple heaps Avoid actively induced false sharing
• Block coalescing Avoid passively induced false sharing
• Superblocks transferred are usually empty and transfer is infrequent
16
Hoard pseudo-codemalloc(sz)1. If sz > S/2, allocate the superblock from the OS
and return it.2. i hash(current thread)3. Lock heap i4. Scan heap i’s list of superblocks from full to least
(for the size class of sz)5. If there is no superblock with free space {6. Check heap 0 (global) for a superblock7. If there is none {8. Allocate S bytes as superblock s & set
owner to heap i9. } Else {10. Transfer the superblock s to heap i11. u0 u0 – s.u; ui ui + s.u
12. a0 a0 - S; ai ai + S
13. }14. }15. ui ui + sz; s.u s.u + sz
16. Unlock heap i17. Return a block from the superblock
free(ptr)1. If the block is “large”2. Free superblock to OS and return3. Find the superblock s this blocks comes from4. Lock s5. Lock heap i, the superblock’s owner6. Deallocate the block from the superblock7. ui ui – block size
8. s.u s.u – block size9. If (i = 0) unlock heap i, superblock s and return10. If (ui < ai – K*S) and (ui<(1-f)*ai) {
11. Transfer a mostly-empty superblock s1 to heap 0 (global)
12. u0 u0 + s1.u; ui ui – s1.u
13. a0 a0 + S; ai ai – S
14. } 15. Unlock heap i and superblock s
17
Heap contention
• Per-processor Heap contention
– 1 allocator thread / multiple threads free• Inherently unscalable
– Pairs of producer/consumer threads• malloc/free calls serialized• At most 2X slowdown (undesirable but scalable)
– Empirically only a small fraction of memory is freed by another thread Contention expected to be low
18
Heap contention (2)• Global Heap contention
– Measure # GH lock acquisitions as upper bound
– Growing phase:• Each thread at most k/(f*S/s) acquisitions for k malloc’s
– Shrinking phase:• Pathological case where program frees (1-f) of each superblock and
then frees every block in superblock one at a time
– Empirically: No excessive shrinking and gradual growth of memory usage low overall contention
19
Experimental Evaluation• Dedicated 14-processor Sun Enterprise
– 400 MHz Ultrasparc– 2 GB RAM, 4MB L2 cache– Solaris 7– Superblock size=8K, f = ¼
• Comparison between– Hoard– Ptmalloc (GNU libC, multiple heaps & ownership)– Mtmalloc (Solaris multithreaded allocator)– Solaris (default system allocator)
20
Benchmarks
21
Speed
22
Size classes need to be handled more cleverly
Scalability - threadtest
23
278% faster than Ptmalloc on 14 cpus
t threads allocate/deallocate 100,000/t 8-byte objects
Scalability – Larson
24
• “Bleeding” typical in server applications• Mainly stays within empty fraction during execution• 18X faster than next best allocator on 14 cpus
Scalability - BEMengine
25• Few times below empty fraction low synchronization
False sharing behavior
26
• Active-false: Each thread allocates small object, writes it few times, frees it
• Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false
• Illustrate effects of contention of the coherence mechanism
Fragmentation results
27
Large number of size classes remain live for
duration of program and scattered across
blocks
Within 20% of Lea’s allocator
Hoard Conclusions• Speed: Excellent
• As fast as a uniprocessor allocator on one processor• amortized O(1) cost• 1 lock for malloc, 2 for free
• Scalability: Excellent• Scales linearly with the number of processors• Avoids false sharing
• Fragmentation: Very good• Worst-case is provably close to ideal• Actual observed fragmentation is low
28
Discussion Points
• If we had to re-evaluate Hoard today which benchmarks would we use?
• Are there any changes needed to make it work with languages like Java?
29