hoard: a scalable memory allocator for multithreaded applications

Hoard: A Scalable Memory Allocator for Multithreaded Applications

Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson

Presented by Ivan Jibaja

(Some slides adapted from Emery Berger’s presentation)

1

Outline

• Motivation• Problems in allocator design

– False sharing– Fragmentation

• Existing approaches• Hoard design• Experimental evaluation

2

Motivation

• Parallel multithreaded programs prevalent– Web servers, search engines, DB managers etc.– Run on CMP/SMP for high performance

• Memory allocation is a bottleneck– Prevents scaling with number of processors

3

Desired allocator attributes on a multiprocessor system

• Speed– Competitive with uniprocessor allocators on 1 cpu

• Scalability– Performance linear with the number of processors

• Fragmentation (=max allocated / max in use)– High fragmentation poor data locality paging

• False sharing avoidance

4

The problem of false sharing• Program causes false sharing

• Allocate number of objects in a cache line, pass objects to different threads

• Allocators cause false sharing!• Actively:

• malloc satisfies different thread requests from same cache line

• Passively:• free allows future malloc to produce false sharing

processor 1 processor 2x2 = malloc(s);x1 = malloc(s);

A cache line

thrash… thrash…

5

The problem of fragmentation

• Blowup:– Increase in memory consumption when allocator

reclaims memory freed by program, but fails to use it for future requests

– Mainly a problem of concurrent allocators

– Unbounded (worst case) or bounded (O(P))

6

Example: Pure Private Heaps Allocator

• Pure private heaps:• one heap per processor.

• malloc gets memoryfrom the processor's heap or the system

• free puts memory on the processor's heap

• Avoids heap contention• Examples: STL, Cilk

x1= malloc(s)

free(x1) free(x2)

x3= malloc(s)

x2= malloc(s)

x4= malloc(s)

processor 1 processor 2

= allocated by heap 1

= free, on heap 2

7

How to Break Pure Private Heaps: Fragmentation

• Pure private heaps:• memory consumption can

grow without bound!

• Producer-consumer:• processor 1 allocates• processor 2 frees• Memory always

unavailable to producer

free(x1)

x2= malloc(s)

free(x2)

x1= malloc(s)processor 1 processor 2

x3= malloc(s)

free(x3)

8

Example II: Private Heaps with Ownership

• free puts memory back on the originating processor's heap.

• Avoids unbounded memory consumption• Examples: ptmalloc,LKmalloc

x1= malloc(s)

free(x1)

free(x2)

x2= malloc(s)

processor 1 processor 2

9

How to Break Private Heaps with Ownership:Fragmentation

• memory consumption can blowup by a factor of P.

• Round-robin producer-consumer:processor i allocatesprocessor i+1 frees

• Program requires 1 (K) blocks, allocator gets 3 (P*K) blocks

free(x2)

free(x1)

free(x3)

x1= malloc(s)

x2= malloc(s)

x3=malloc(s)

processor 1 processor 2 processor 3

10

Existing approaches

11

Uniprocessor Allocators on Multiprocessors

• Fragmentation: Excellent– Very low for most programs [Wilson & Johnstone]

• Speed & Scalability: Poor– Heap contention

• A single lock protects the heap

• Can exacerbate false sharing– Different processors can share cache lines

12

Existing Multiprocessor Allocators• Speed:

• One concurrent heap (e.g., concurrent B-tree):

• O(log (#size-classes)) cost per memory operation• too many locks/atomic updates

Fast allocators use multiple heaps

• Scalability:• Allocator-induced false sharing

• Other bottlenecks (e.g. nextHeap global in Ptmalloc)

• Fragmentation:• P-fold increase or even unbounded

13

Hoard as the solution

14

Hoard Overview• P per-processor heaps & 1 global heap• Each thread accesses only its local heap & global • Manages memory in page-sized superblocks of

same-sized objects (LIFO free-list)– Avoids false sharing by not carving up cache lines– Avoids heap contention – local heaps allocate & free

small blocks from their superblocks

• Avoids blowup by– Moving superblocks to global heap when fraction of

free memory exceeds some threshold15

Superblock management

Emptiness threshold: (ui ≥ (1-f)*ai)∨(ui ≥ ai – K*S)

f = ¼K = 0

• Multiple heaps Avoid actively induced false sharing

• Block coalescing Avoid passively induced false sharing

• Superblocks transferred are usually empty and transfer is infrequent

16

Hoard pseudo-codemalloc(sz)1. If sz > S/2, allocate the superblock from the OS

and return it.2. i hash(current thread)3. Lock heap i4. Scan heap i’s list of superblocks from full to least

(for the size class of sz)5. If there is no superblock with free space {6. Check heap 0 (global) for a superblock7. If there is none {8. Allocate S bytes as superblock s & set

owner to heap i9. } Else {10. Transfer the superblock s to heap i11. u0 u0 – s.u; ui ui + s.u

12. a0 a0 - S; ai ai + S

13. }14. }15. ui ui + sz; s.u s.u + sz

16. Unlock heap i17. Return a block from the superblock

free(ptr)1. If the block is “large”2. Free superblock to OS and return3. Find the superblock s this blocks comes from4. Lock s5. Lock heap i, the superblock’s owner6. Deallocate the block from the superblock7. ui ui – block size

8. s.u s.u – block size9. If (i = 0) unlock heap i, superblock s and return10. If (ui < ai – K*S) and (ui<(1-f)*ai) {

11. Transfer a mostly-empty superblock s1 to heap 0 (global)

12. u0 u0 + s1.u; ui ui – s1.u

13. a0 a0 + S; ai ai – S

14. } 15. Unlock heap i and superblock s

17

Heap contention

• Per-processor Heap contention

– 1 allocator thread / multiple threads free• Inherently unscalable

– Pairs of producer/consumer threads• malloc/free calls serialized• At most 2X slowdown (undesirable but scalable)

– Empirically only a small fraction of memory is freed by another thread Contention expected to be low

18

Heap contention (2)• Global Heap contention

– Measure # GH lock acquisitions as upper bound

– Growing phase:• Each thread at most k/(f*S/s) acquisitions for k malloc’s

– Shrinking phase:• Pathological case where program frees (1-f) of each superblock and

then frees every block in superblock one at a time

– Empirically: No excessive shrinking and gradual growth of memory usage low overall contention

19

Experimental Evaluation• Dedicated 14-processor Sun Enterprise

– 400 MHz Ultrasparc– 2 GB RAM, 4MB L2 cache– Solaris 7– Superblock size=8K, f = ¼

• Comparison between– Hoard– Ptmalloc (GNU libC, multiple heaps & ownership)– Mtmalloc (Solaris multithreaded allocator)– Solaris (default system allocator)

20

Benchmarks

21

Speed

22

Size classes need to be handled more cleverly

Scalability - threadtest

23

278% faster than Ptmalloc on 14 cpus

t threads allocate/deallocate 100,000/t 8-byte objects

Scalability – Larson

24

• “Bleeding” typical in server applications• Mainly stays within empty fraction during execution• 18X faster than next best allocator on 14 cpus

Scalability - BEMengine

25• Few times below empty fraction low synchronization

False sharing behavior

26

• Active-false: Each thread allocates small object, writes it few times, frees it

• Passive-false: Allocate objects, hand them to threads that free them, emulate Active-false

• Illustrate effects of contention of the coherence mechanism

Fragmentation results

27

Large number of size classes remain live for

duration of program and scattered across

blocks

Within 20% of Lea’s allocator

Hoard Conclusions• Speed: Excellent

• As fast as a uniprocessor allocator on one processor• amortized O(1) cost• 1 lock for malloc, 2 for free

• Scalability: Excellent• Scales linearly with the number of processors• Avoids false sharing

• Fragmentation: Very good• Worst-case is provably close to ideal• Actual observed fragmentation is low

28

Discussion Points

• If we had to re-evaluate Hoard today which benchmarks would we use?

• Are there any changes needed to make it work with languages like Java?

29

hoard: a scalable memory allocator for multithreaded applications

Documents

processor 2x2

processor 2x3

scalable memory allocator

memory operationtoo

memory freed

concurrent heap

originating processors

false sharingprocessor