comparing gcs and allocation richard jones, antony hosking and eliot moss 2012 presented by yarden...
TRANSCRIPT
Comparing GCs and Allocation
Richard Jones, Antony Hosking and Eliot Moss2012
Presented by Yarden Marton18.11.14
• Comparing between different garbage collectors.
• Allocation – methods and considerations.
Outline
Comparing GCs
• What is the best GC?• When we say “best” do we mean:
- Best throughput?- Shortest pause times?- Good space utilization?- Compromised combination?
Comparing GCs
• More to consider:• Application dependency• Heap space availability• Heap size
• Throughput• Pause time• Space• Implementation
Comparing GCs - Aspects
• Throughput• Pause time• Space• Implementation
Comparing GCs - Aspects
• Primary goal for ‘batch’ applications or for systems experiencing delays.
• Does a faster collector means faster application? Not necessarily.– Mutators pay the cost
Throughput
Throughput
• Algorithmic complexity• Mark-sweep:
- cost of tracing and sweeping phases.- Requires visiting every object
• Copying: - cost of tracing phase only- Requires visiting only live objects
Throughput
• Is Copying collection faster? • Not necessarily:
- Number of instructions executed to visit an object- Locality- Lazy sweeping
• Throughput• Pause time• Space• Implementation
Comparing GCs - Aspects
Pause Time
• Important for interactive applications, transaction processors and more.
• ‘stop-the-world’ collectors• Immediate attraction to reference counting• However:
- Recursive reference count is costly- Both improvements of reference count
reintroduce a stop-the-world pause
• Throughput• Pause time• Space• Implementation
Comparing GCs - Aspects
Space
• Important for:- Tight physical constraints on memory- Large applications
• All collectors incur space overhead:- Reference count fields- Additional heap space- Heap fragmentation- Auxiliary data structures- Room for garbage
Space
• Completeness – reclaiming all dead objects eventually.- Basic reference counting is incomplete
• Promptness – reclaiming all dead objects at each collection cycle.- Basic tracing collectors (but with a cost)
• Modern high-performances collectors typically trade immediacy for performance.
• Throughput• Pause time• Space• Implementation
Comparing GCs - Aspects
Implementation
• GC algorithms are difficult to implement, especially concurrent algorithms.
• Errors can manifest themselves long afterwards• Tracing:
- Advantage: Simple collector-mutator interface - Disadvantage: Determining roots is complicated
• Reference counting:- Advantage : Can be implemented in a library- Disadvantage : Processing overheads and correctness
essentiality of all reference count manipulation.• In general, copying and compacting collectors are more
complex than non-moving collectors.
Adaptive Systems
• Commercial system often offer a choice between GCs, with a large number of tuning options.
• Researchers have developed systems that adapts to the enviroment:- Java run-time (Soman et al [2004])- Singer et al [2007a]- Sun’s Ergonomic tuning
Advice For Developers
• Know your application:- Measure its behavior- Track the size and lifetime distributions of the objects it uses.
• Experiment with the different collector configurations on offer.
• Considered two styles of collection:– Direct, reference counting.– Indirect, tracing collection.
• Next: An abstract framework for a wide variety of collectors.
A Unified Theory of GC
• GC can be expressed as a fixed point computation that assigns reference counts (n) to nodes n Nodes.
• Nodes with non-zero count are retained and the rest should be reclaimed.
• Use of abstract data structures whose implementations can vary.• W – a work list of objects to be processed. When empty, the
algorithms terminate.
Abstract GC
atomic collectTracing():rootsTracing(W)
//find root objectsscanTracing(W)
//mark reachable objectssweepTracing()
//free dead objects
rootsTracing(R):for each fld in Roots
ref ← *fldif ref ≠ null
R ← R + [ref]
scanTracing(W):while not isEmpty(W)
src ← remove(W) (src) ← (src)+1if (src) = 1
for each fld in Pointers(src)ref ← *fldif ref ≠ null
W ← W + [ref]
Abstract Tracing GC Algorithm
sweepTracing():for each noed in Nodes
if (node) = 0free(node)
else (node) ← 0
New():ref ← allocate()if ref = null
collectTracing()ref ← allocate()if ref = null
error “Out of memory” (ref) ← 0return ref
Abstract Tracing GC Algorithm (Continued)
A
DC
B
Roots
A B C D
0 0 0 0
W
A
DC
B
Roots
A B C D
0 0 0 0
W B C
A
DC
B
Roots
A B C D
0 1 0 0
W C
A
DC
B
Roots
A B C D
0 1 1 0
W A B
A
DC
B
Roots
A B C D
1 1 1 0
W B
A
DC
B
Roots
A B C D
1 2 1 0
W
A
DC
B
Roots
A B C D
0 0 0 0
W
atomic collectCounting(I,D):applyIncrements(I)//increase necessary scanCounting(D) //decrease reqursivalysweepCounting()//free dead objects
applyIncrements(I):while not isEmpty(I)
ref ← remove(I)(ref) ← (ref)+1
scanCounting(W):
while not isEmpty(W)src ← remove(W) (src) ← (src)-1if (src) = 0
for each fld in Pointers(src)ref ← *fldif ref ≠ null
W ← W + [ref]
Abstract reference counting GC Algorithm
sweepCounting():for each node in Nodes
if (node) = 0free(node)
New():ref ← allocate()if ref = null
collectCounting()ref ← allocate()if ref = null
error “Out of memory” (ref) ← 0return ref
Abstract reference counting GC Algorithm (Continued)
inc(ref):
if ref ≠ nullI ← I + [ref]
dec(ref):if ref ≠ null
D ← D + [ref]
Atomic Write(src, i, dst):inc(dst)dec(src[i])src[i] ← dst
Abstract reference counting GC Algorithm (Continued)
A B C D
0 0 0 0
A
DC
B
Roots
I A B A D B C B
D A D
atomic collectCounting()applyIncrements(I)
A B C D
1 0 0 0
A
DC
B
Roots
atomic collectCounting()applyIncrements(I)
I B A D B C B
D A D
A B C D
2 3 1 1
A
DC
B
Roots
atomic collectCounting()applyIncrements(I)
I
D A D
A B C D
1 3 1 0
A
DC
B
Roots
atomic collectCounting()applyIncrements(I)scanCounting(D)
I
D B
A B C D
1 2 1 0
A
DC
B
Roots
atomic collectCounting()applyIncrements(I)scanCounting(D)
I
D
A B C D
1 2 1 0
A
DC
B
Roots
atomic collectCounting()applyIncrements(I)scanCounting(D)sweepCounting()
I
D
Atomic collecDrc(I,D):rootsTracing(I) //add root objects to IapplyIncrements(I) //increase necessary scanCounting(D) //decrease reqursively sweepCounting() //free dead objectsrootsTracing(D) //keep invariantapplyDecrements(D)
New():ref ← allocate()if ref = null
collecDrc(I,D)ref ← allocate()if ref = null
error “Out of memory” (ref) ← 0return ref
Abstract deferred reference counting GC Algorithm
Atomic Write(src, i, dst):if src ≠ Roots
inc(dst)dec(src[i])
src[i] ← dst
applyDecrements(D):while not isEmpty(D)
ref ← remove(D) (ref) ← (ref)-1
Abstract deferred reference counting GC Algorithm (Continued)
A B C D
0 0 0 0
A
DC
B
Roots
I A B A D B
D A D
atomic collectDrc()rootsTracing(I)
A B C D
0 0 0 0
A
DC
B
Roots
I A B A D B B C
D A D
atomic collectDrc()rootsTracing(I)applyIncrements(I)
A B C D
2 3 1 1
A
DC
B
Roots
I
D A D
atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)
A B C D
1 2 1 0
A
DC
B
Roots
I
D
atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()
A B C D
1 2 1 0
A
DC
B
Roots
I
D
atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()rootsTracing(D)
A B C
1 2 1
A
C
B
Roots
I
D B C
atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()rootsTracing(D)applyDecrements(D)
A B C
1 1 0
A
C
B
Roots
I
D
atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()rootsTracing(D)applyDecrements(D)
Comparing GCs Summary
• GCs performance depends on various aspects- Therefore, no GC has an absolute advantage on the others.
• Garbage collection can be expressed in an abstract way.- Highlights similarity and differences
Allocation
• Three aspects to memory management:- Allocation of memory in the first place- Identification of live data- Reclamation for future use
• Allocation and reclamation of memory are tightly linked• Several key differences between automatic and explicit
memory management, in terms of allocating and freeing:- GC free space all at once- A system with GC has more information when allocating- With GC, users tends to write programs in a different style.
• Uses a large free chunk of memory• Given a request for n bytes, it allocates that much from one
end of the free chunk.
sequentialAllocate(n):result ← freenewFree ← result + nif newFree > limit
return nullfree ← newFreereturn result
Sequential Allocation
allocated available
free limitRequest to allocate n bytes
n
allocated available
free limit
allocated
result
Alignmentpadding
• Properties:– Simple– Efficient– Better cache locality– May be less suitable for non-moving collectors
Sequential Allocation
• A data structure records the location and size of free cells of memory.
• The allocator considers each free cell in turn, and according to some policy, chooses one to allocate.
• Three basic types of free-list allocation:– First-fit– Next-fit– Best-fit
Free-list Allocation
First-fit Allocation
• Use the first cell that can satisfy the allocation request.• A split of the cell may occur unless the remainder is too small.
firstFitAllocate(n):prev ← adressOf(head)loop
curr ← next(prev)if curr = null
return nullelse if size(curr) < n
prev ← currelse
return listAllocate(prev, curr, n)
listAllocate(prev, curr, n):result ← currif shouldSplit(size(curr), n)
remainder ← result + nnext(remainder) ← next(curr)size(remainder) ← size(curr)-nnext(prev) ← remainder
elsenext(prev) ← next(curr)
return result
liatAllocateAlt(prev, curr, n):if sholudSplit(size(curr), n)
size(curr) ← size(curr) – nresult ← curr + size(curr)
elsenext(prev) ← next(curr)result ← curr
return result
First-fit Allocation
150KB 100KB 170KB 300KB 50KB
AllocatedFree
120KB allocation request
30KB 100KB 170KB 300KB 50KB
First-fit
30KB 100KB 170KB 300KB 50KB
50KB allocation request
30KB 50KB 170KB 300KB 50KB
30KB 50KB 170KB 300KB 50KB
200KB allocation request
30KB 50KB 170KB 100KB 50KB
• Small remainder cells accumulate near the front of the list, slowing down allocation.
• In terms of space utilization, may behave similarly to best-fit.
• An issue is where in the list to enter a newly freed cell• It is usually more natural to build the list in address
order, like mark-sweep does.
First-fit Allocation
• A variation of first-fit• Method - start the search for a cell of suitable size
from the point in the list where the last search succeeded.
• When reaching the end of list, start over from the beginning.
• Idea - reduce the need to iterate repeatedly past the small cells at the head of the list.
• Drawbacks:– Fragmentation– Poor locality on accessing the list– Poor locality of the allocated objects
Next-fit Allocation
nextFitAllocate(n):start ← prevloop
curr ← next(prev)if curr = null
prev ← addressOf(head)curr ← next(prev)
if prev = startreturn null
else if size(curr) < nprev ← curr
elsereturn listAllocate(prev, curr, n)
Next-fit Allocation Algorithm
150KB 100KB 170KB 300KB 50KB
AllocatedFree
120KB allocation request
30KB 100KB 170KB 300KB 50KB
Next-fit
30KB 100KB 170KB 300KB 50KB
20KB allocation request
30KB 80KB 170KB 300KB 50KB
30KB 80KB 170KB 300KB 50KB
50KB allocation request
30KB 80KB 120KB 300KB 50KB
• Method - find the cell whose size most closely matches the allocation request.
• Idea:– Minimize waste– Avoid splitting large cells unnecessarily
• Bad worst case
Best-fit Allocation
bestFitAllocate(n):best ← nullbestSize ← ∞prev ← addressOf(head)loop
curr ← next(prev)if curr = null || size(curr) = n
if curr ≠ nullbestPrev ← prevbest ← curr
else if best = nullreturn null
return listAllocate(bestPrev, best, n)else if size(curr) < n || bestSize < size(curr)
prev ← curr else
best ← currbestPrev ← prevbestSize ← size(curr)
Best-fit Allocation Algorithm
150KB 100KB 170KB 300KB 50KB
AllocatedFree
150KB 10KB 170KB 300KB 50KB
90KB allocation request
Best-fit
150KB 10KB 170KB 300KB 50KB
50KB allocation request
150KB 10KB 170KB 300KB
150KB 10KB 170KB 300KB
50KB 10KB 170KB 300KB
100KB allocation request
• Use of a Balanced binary tree• Sorted by size (for best-fit) or by address (for first-fit
or next-fit).• If sorted by size, can enter only one cell of each size.• Example: Cartesian tree for first/next-fit.– Indexed by address (primary key) and size (secondary key)– Total order by address– Organized as a heap for the sizes
Speeding Free-list Allocation
• Searching in the Cartesian tree under first-fits policy:
firstFitAllocateCartesian(n):parent ← nullcurr ← rootloop
if left(curr) ≠ null && max(left(curr)) ≥ nparent ← currcurr ← left(curr)
else if prev < curr && size(curr) ≥ nprev ← currreturn treeAllocate(curr, parent, n)
else if right(curr) ≠ null && max(right(curr)) ≥ nparent ← currcurr ← right (curr)
elsereturn null
Speeding Free-list Allocation
• Dispersal of free memory across a possibly large number of small free cells.
• Negative effects:– Can prevent allocation from succeeding– May cause a program to use more address space, more resident
pages and more cache lines.• Fragmentation is impractical to avoid:
– Usually the allocator cannot know what the future request sequence will be.
– Even given a known request sequence, doing an optimal allocation is NP-hard.
• Usually There is a trade-off between allocation speed and fragmentation.
Fragmentation
• Idea – use multiple free-list whose members are segregated by size in order to speed allocation.
• Usually a fixed number k of size values s0 < s1 < … < sk-1• k+1 free lists f0,…,fk• For a free cell, b, on list fi,
size(b) > sk-1 if i=k• When requesting a cell of size b≤sk-1, the allocator rounds
the request size up to the smallest si such that b ≤si.
• Si is called a size class
Segregated-fits Allocation
SegregatedFitAllocate(j):result ← remove(freeLists[j])if result = null
large ← allocateBlock()if large = null
return nullinitialize(large, sizes[j])result ← remove(freeList[j])
return result
• List fk, for cells larger than sk, is organized to use one of the basic single-list algorithm.
• Per-cell overheads for large cell are a bit higher but in total it is negligible.
• The main advantage: for size classes other than sk, allocation typically requires constant time.
Segregated-fits Allocation
fk-1
fk
f1
f0 s0
s1
sk-1
>sk-1 >sk-1
• On simple free-list allocators – free cells that were too small to satisfy a request. Called external fragmentation.
• On segregated-fits allocation – wasted space inside an individual cell because the requested size was rounded up. Called internal fragmentation.
More on Fragmentation
• Important consideration – how to populate each free-list of segregated-fits.
• Two approaches:– Dedicating whole blocks to particular sizes– Splitting
Populating size classes
• Choose some block size B, a power of two.• The allocator is provided with blocks.• If the request is larger than one block,
multiple contiguous blocks are allocated.• For a size class s < B, we populate the free-list
fs by allocating a block and immediately slice it into cells of size s.
• Metadata of the cells is stored on the block.
Big Bag of PagesBlock-based allocation
• Disadvantage:– Fragmentation, average waste of half a block
(worst case (B-s)/B).• Advantages:– Reduced per-cell metadata– Simple and efficient for the common case
Big Bag of PagesBlock-based allocation
• Like simple free-list schemes, split a cell if that is the only way to satisfy a request.
• Improvement: concatenate the remaining portion to a suitable free-list (if possible).
• For example – the buddy system:– Size class are powers of two– Can split a cell of size 2i+1 into two cells of size 2i
– Can combine in the opposite direction (only if the two small cells were split from the same large cell)
Splitting
128KB
Allocated Minimum cell size – 16KBFree Maximum cell size – 128KB
Allocation request20KB
The Buddy System
64KB 64KB
Allocated Minimum cell size – 16KBFree Maximum cell size – 64KB
32KB 64KB32KB
Allocation request10KB
Allocated Minimum cell size – 16KBFree Maximum cell size – 64KB
12KB 64KB32KB20KB
12KB 64KB16KB20KB 16KB
Free10KB
12KB 64KB16KB20KB 16KB
Allocated Minimum cell size – 16KBFree Maximum cell size – 64KB
12KB 64KB20KB 16KB10KB 6KB
Allocated Minimum cell size – 16KBFree Maximum cell size – 64KB
12KB 64KB32KB20KB
Free20KB
32KB 64KB32KB
64KB 64KB
Allocated Minimum cell size – 16KBFree Maximum cell size – 64KB
128KB
• Alignment• Size constraints• Boundary tags• Heap parsability• Locality
Allocation’s Additional Considerations
• Alignment• Size constraints• Boundary tags• Heap parsability• Locality
Allocation’s Additional Considerations
• Allocated objects may require special alignment
• For example: a double-word floating point– Can make the granule a double-word – wasteful– Header of array in java takes 3 words – one word
is wasted or skipped.
Alignment
• Alignment• Size constraints• Boundary tags• Heap parsability• Locality
Allocation’s Additional Considerations
• Some collection schemes require a minimum amount of space in each cell.– Forwarding address– Lock/status
• In that case, the allocator will allocate more words than requested.
Size Constraints
• Alignment• Size constraints• Boundary tags• Heap parsability• Locality
Allocation’s Additional Considerations
• Additional header or boundary tag associated with each cell.
• Found outside the storage available to the program.
• Indicates size and allocated/free status• Is one or two words long• Possible use of bitmap instead
Boundary Tags
• Alignment• Size constraints• Boundary tags• Heap parsability• Locality
Allocation’s Additional Considerations
• The ability to advance cell to cell in the heap• An object’s header (one or two words):– Type– Hash code– Synchronization information– Mark bit
• The header comes before the data• The reference refers to the first element/field
Heap Parsability
• How to handle alignment?– Zero all free space in advance– Devise a distinct range of values to write at the
start of the gap• Easier parsing with a bit map, indicating where
each object start.– Require additional space and time
Heap Parsability
• Alignment• Size constraints• Boundary tags• Heap parsability• Locality
Allocation’s Additional Considerations
• During allocating– Address-ordered free-list and sequential allocation
present good locality.• During freeing– Goal: Objects being freed together will be near
each other.– Empirically, objects allocated at the same time
often become unreachable at about the same time.
Locality
• Multiple threads allocating• Most steps in allocation need to be atomic• Can result a bottleneck• Basic solution – each thread has its own
allocation area.• Use of a global pool and smart chunk handing
Allocation in Concurrent Systems
Allocation Summary
• Methods:- Sequential- Free-list: First-fit, Next-fit and Best-fit.- Segregated-fits
• Various considerations to notice