comparing gcs and allocation richard jones, antony hosking and eliot moss 2012 presented by yarden...

Comparing GCs and Allocation

Richard Jones, Antony Hosking and Eliot Moss2012

Presented by Yarden Marton18.11.14

• Comparing between different garbage collectors.

• Allocation – methods and considerations.

Outline

Comparing GCs

• What is the best GC?• When we say “best” do we mean:

- Best throughput?- Shortest pause times?- Good space utilization?- Compromised combination?

Comparing GCs

• More to consider:• Application dependency• Heap space availability• Heap size

• Throughput• Pause time• Space• Implementation

Comparing GCs - Aspects

• Primary goal for ‘batch’ applications or for systems experiencing delays.

• Does a faster collector means faster application? Not necessarily.– Mutators pay the cost

Throughput

Throughput

• Algorithmic complexity• Mark-sweep:

- cost of tracing and sweeping phases.- Requires visiting every object

• Copying: - cost of tracing phase only- Requires visiting only live objects

Throughput

• Is Copying collection faster? • Not necessarily:

- Number of instructions executed to visit an object- Locality- Lazy sweeping

Pause Time

• Important for interactive applications, transaction processors and more.

• ‘stop-the-world’ collectors• Immediate attraction to reference counting• However:

- Recursive reference count is costly- Both improvements of reference count

reintroduce a stop-the-world pause

Space

• Important for:- Tight physical constraints on memory- Large applications

• All collectors incur space overhead:- Reference count fields- Additional heap space- Heap fragmentation- Auxiliary data structures- Room for garbage

Space

• Completeness – reclaiming all dead objects eventually.- Basic reference counting is incomplete

• Promptness – reclaiming all dead objects at each collection cycle.- Basic tracing collectors (but with a cost)

• Modern high-performances collectors typically trade immediacy for performance.

Implementation

• GC algorithms are difficult to implement, especially concurrent algorithms.

• Errors can manifest themselves long afterwards• Tracing:

- Advantage: Simple collector-mutator interface - Disadvantage: Determining roots is complicated

• Reference counting:- Advantage : Can be implemented in a library- Disadvantage : Processing overheads and correctness

essentiality of all reference count manipulation.• In general, copying and compacting collectors are more

complex than non-moving collectors.

Adaptive Systems

• Commercial system often offer a choice between GCs, with a large number of tuning options.

• Researchers have developed systems that adapts to the enviroment:- Java run-time (Soman et al [2004])- Singer et al [2007a]- Sun’s Ergonomic tuning

Advice For Developers

• Know your application:- Measure its behavior- Track the size and lifetime distributions of the objects it uses.

• Experiment with the different collector configurations on offer.

• Considered two styles of collection:– Direct, reference counting.– Indirect, tracing collection.

• Next: An abstract framework for a wide variety of collectors.

A Unified Theory of GC

• GC can be expressed as a fixed point computation that assigns reference counts (n) to nodes n Nodes.

• Nodes with non-zero count are retained and the rest should be reclaimed.

• Use of abstract data structures whose implementations can vary.• W – a work list of objects to be processed. When empty, the

algorithms terminate.

Abstract GC

atomic collectTracing():rootsTracing(W)

//find root objectsscanTracing(W)

//mark reachable objectssweepTracing()

//free dead objects

rootsTracing(R):for each fld in Roots

ref ← *fldif ref ≠ null

R ← R + [ref]

scanTracing(W):while not isEmpty(W)

src ← remove(W) (src) ← (src)+1if (src) = 1

for each fld in Pointers(src)ref ← *fldif ref ≠ null

W ← W + [ref]

Abstract Tracing GC Algorithm

sweepTracing():for each noed in Nodes

if (node) = 0free(node)

else (node) ← 0

New():ref ← allocate()if ref = null

collectTracing()ref ← allocate()if ref = null

error “Out of memory” (ref) ← 0return ref

Abstract Tracing GC Algorithm (Continued)

A

DC

B

Roots

A B C D

0 0 0 0

W

A

DC

B

Roots

A B C D

0 0 0 0

W B C

A

DC

B

Roots

A B C D

0 1 0 0

W C

A

DC

B

Roots

A B C D

0 1 1 0

W A B

A

DC

B

Roots

A B C D

1 1 1 0

W B

A

DC

B

Roots

A B C D

1 2 1 0

W

A

DC

B

Roots

A B C D

0 0 0 0

W

atomic collectCounting(I,D):applyIncrements(I)//increase necessary scanCounting(D) //decrease reqursivalysweepCounting()//free dead objects

applyIncrements(I):while not isEmpty(I)

ref ← remove(I)(ref) ← (ref)+1

scanCounting(W):

while not isEmpty(W)src ← remove(W) (src) ← (src)-1if (src) = 0

for each fld in Pointers(src)ref ← *fldif ref ≠ null

W ← W + [ref]

Abstract reference counting GC Algorithm

sweepCounting():for each node in Nodes

if (node) = 0free(node)


collectCounting()ref ← allocate()if ref = null


Abstract reference counting GC Algorithm (Continued)

inc(ref):

if ref ≠ nullI ← I + [ref]

dec(ref):if ref ≠ null

D ← D + [ref]

Atomic Write(src, i, dst):inc(dst)dec(src[i])src[i] ← dst

Abstract reference counting GC Algorithm (Continued)

A B C D

0 0 0 0

A

DC

B

Roots

I A B A D B C B

D A D

atomic collectCounting()applyIncrements(I)

A B C D

1 0 0 0

A

DC

B

Roots


I B A D B C B

D A D

A B C D

2 3 1 1

A

DC

B

Roots


I

D A D

A B C D

1 3 1 0

A

DC

B

Roots

atomic collectCounting()applyIncrements(I)scanCounting(D)

I

D B

A B C D

1 2 1 0

A

DC

B

Roots

atomic collectCounting()applyIncrements(I)scanCounting(D)

I

D

A B C D

1 2 1 0

A

DC

B

Roots

atomic collectCounting()applyIncrements(I)scanCounting(D)sweepCounting()

I

D

Atomic collecDrc(I,D):rootsTracing(I) //add root objects to IapplyIncrements(I) //increase necessary scanCounting(D) //decrease reqursively sweepCounting() //free dead objectsrootsTracing(D) //keep invariantapplyDecrements(D)


collecDrc(I,D)ref ← allocate()if ref = null


Abstract deferred reference counting GC Algorithm

Atomic Write(src, i, dst):if src ≠ Roots

inc(dst)dec(src[i])

src[i] ← dst

applyDecrements(D):while not isEmpty(D)

ref ← remove(D) (ref) ← (ref)-1

Abstract deferred reference counting GC Algorithm (Continued)

A B C D

0 0 0 0

A

DC

B

Roots

I A B A D B

D A D

atomic collectDrc()rootsTracing(I)

A B C D

0 0 0 0

A

DC

B

Roots

I A B A D B B C

D A D

atomic collectDrc()rootsTracing(I)applyIncrements(I)

A B C D

2 3 1 1

A

DC

B

Roots

I

D A D

atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)

A B C D

1 2 1 0

A

DC

B

Roots

I

D

atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()

A B C D

1 2 1 0

A

DC

B

Roots

I

D

atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()rootsTracing(D)

A B C

1 2 1

A

C

B

Roots

I

D B C

atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()rootsTracing(D)applyDecrements(D)

A B C

1 1 0

A

C

B

Roots

I

D

atomic collectDrc()rootsTracing(I)applyIncrements(I)scanCounting(D)sweepCounting()rootsTracing(D)applyDecrements(D)

Comparing GCs Summary

• GCs performance depends on various aspects- Therefore, no GC has an absolute advantage on the others.

• Garbage collection can be expressed in an abstract way.- Highlights similarity and differences

Allocation

• Three aspects to memory management:- Allocation of memory in the first place- Identification of live data- Reclamation for future use

• Allocation and reclamation of memory are tightly linked• Several key differences between automatic and explicit

memory management, in terms of allocating and freeing:- GC free space all at once- A system with GC has more information when allocating- With GC, users tends to write programs in a different style.

• Uses a large free chunk of memory• Given a request for n bytes, it allocates that much from one

end of the free chunk.

sequentialAllocate(n):result ← freenewFree ← result + nif newFree > limit

return nullfree ← newFreereturn result

Sequential Allocation

allocated available

free limitRequest to allocate n bytes

n

allocated available

free limit

allocated

result

Alignmentpadding

• Properties:– Simple– Efficient– Better cache locality– May be less suitable for non-moving collectors

Sequential Allocation

• A data structure records the location and size of free cells of memory.

• The allocator considers each free cell in turn, and according to some policy, chooses one to allocate.

• Three basic types of free-list allocation:– First-fit– Next-fit– Best-fit

Free-list Allocation

First-fit Allocation

• Use the first cell that can satisfy the allocation request.• A split of the cell may occur unless the remainder is too small.

firstFitAllocate(n):prev ← adressOf(head)loop

curr ← next(prev)if curr = null

return nullelse if size(curr) < n

prev ← currelse

return listAllocate(prev, curr, n)

listAllocate(prev, curr, n):result ← currif shouldSplit(size(curr), n)

remainder ← result + nnext(remainder) ← next(curr)size(remainder) ← size(curr)-nnext(prev) ← remainder

elsenext(prev) ← next(curr)

return result

liatAllocateAlt(prev, curr, n):if sholudSplit(size(curr), n)

size(curr) ← size(curr) – nresult ← curr + size(curr)

elsenext(prev) ← next(curr)result ← curr

return result


150KB 100KB 170KB 300KB 50KB

AllocatedFree

120KB allocation request

30KB 100KB 170KB 300KB 50KB

First-fit

30KB 100KB 170KB 300KB 50KB


30KB 50KB 170KB 300KB 50KB

30KB 50KB 170KB 300KB 50KB


30KB 50KB 170KB 100KB 50KB

• Small remainder cells accumulate near the front of the list, slowing down allocation.

• In terms of space utilization, may behave similarly to best-fit.

• An issue is where in the list to enter a newly freed cell• It is usually more natural to build the list in address

order, like mark-sweep does.


• A variation of first-fit• Method - start the search for a cell of suitable size

from the point in the list where the last search succeeded.

• When reaching the end of list, start over from the beginning.

• Idea - reduce the need to iterate repeatedly past the small cells at the head of the list.

• Drawbacks:– Fragmentation– Poor locality on accessing the list– Poor locality of the allocated objects

Next-fit Allocation

nextFitAllocate(n):start ← prevloop

curr ← next(prev)if curr = null

prev ← addressOf(head)curr ← next(prev)

if prev = startreturn null

else if size(curr) < nprev ← curr

elsereturn listAllocate(prev, curr, n)

Next-fit Allocation Algorithm

150KB 100KB 170KB 300KB 50KB

AllocatedFree


30KB 100KB 170KB 300KB 50KB

Next-fit

30KB 100KB 170KB 300KB 50KB


30KB 80KB 170KB 300KB 50KB

30KB 80KB 170KB 300KB 50KB


30KB 80KB 120KB 300KB 50KB

• Method - find the cell whose size most closely matches the allocation request.

• Idea:– Minimize waste– Avoid splitting large cells unnecessarily

• Bad worst case

Best-fit Allocation

bestFitAllocate(n):best ← nullbestSize ← ∞prev ← addressOf(head)loop

curr ← next(prev)if curr = null || size(curr) = n

if curr ≠ nullbestPrev ← prevbest ← curr

else if best = nullreturn null

return listAllocate(bestPrev, best, n)else if size(curr) < n || bestSize < size(curr)

prev ← curr else

best ← currbestPrev ← prevbestSize ← size(curr)

Best-fit Allocation Algorithm

150KB 100KB 170KB 300KB 50KB

AllocatedFree

150KB 10KB 170KB 300KB 50KB


Best-fit

150KB 10KB 170KB 300KB 50KB


150KB 10KB 170KB 300KB

150KB 10KB 170KB 300KB

50KB 10KB 170KB 300KB


• Use of a Balanced binary tree• Sorted by size (for best-fit) or by address (for first-fit

or next-fit).• If sorted by size, can enter only one cell of each size.• Example: Cartesian tree for first/next-fit.– Indexed by address (primary key) and size (secondary key)– Total order by address– Organized as a heap for the sizes

Speeding Free-list Allocation

• Searching in the Cartesian tree under first-fits policy:

firstFitAllocateCartesian(n):parent ← nullcurr ← rootloop

if left(curr) ≠ null && max(left(curr)) ≥ nparent ← currcurr ← left(curr)

else if prev < curr && size(curr) ≥ nprev ← currreturn treeAllocate(curr, parent, n)

else if right(curr) ≠ null && max(right(curr)) ≥ nparent ← currcurr ← right (curr)

elsereturn null

Speeding Free-list Allocation

• Dispersal of free memory across a possibly large number of small free cells.

• Negative effects:– Can prevent allocation from succeeding– May cause a program to use more address space, more resident

pages and more cache lines.• Fragmentation is impractical to avoid:

– Usually the allocator cannot know what the future request sequence will be.

– Even given a known request sequence, doing an optimal allocation is NP-hard.

• Usually There is a trade-off between allocation speed and fragmentation.

Fragmentation

• Idea – use multiple free-list whose members are segregated by size in order to speed allocation.

• Usually a fixed number k of size values s0 < s1 < … < sk-1• k+1 free lists f0,…,fk• For a free cell, b, on list fi,

size(b) > sk-1 if i=k• When requesting a cell of size b≤sk-1, the allocator rounds

the request size up to the smallest si such that b ≤si.

• Si is called a size class

Segregated-fits Allocation

SegregatedFitAllocate(j):result ← remove(freeLists[j])if result = null

large ← allocateBlock()if large = null

return nullinitialize(large, sizes[j])result ← remove(freeList[j])

return result

• List fk, for cells larger than sk, is organized to use one of the basic single-list algorithm.

• Per-cell overheads for large cell are a bit higher but in total it is negligible.

• The main advantage: for size classes other than sk, allocation typically requires constant time.

Segregated-fits Allocation

fk-1

fk

f1

f0 s0

s1

sk-1

>sk-1 >sk-1

• On simple free-list allocators – free cells that were too small to satisfy a request. Called external fragmentation.

• On segregated-fits allocation – wasted space inside an individual cell because the requested size was rounded up. Called internal fragmentation.

More on Fragmentation

• Important consideration – how to populate each free-list of segregated-fits.

• Two approaches:– Dedicating whole blocks to particular sizes– Splitting

Populating size classes

• Choose some block size B, a power of two.• The allocator is provided with blocks.• If the request is larger than one block,

multiple contiguous blocks are allocated.• For a size class s < B, we populate the free-list

fs by allocating a block and immediately slice it into cells of size s.

• Metadata of the cells is stored on the block.

Big Bag of PagesBlock-based allocation

• Disadvantage:– Fragmentation, average waste of half a block

(worst case (B-s)/B).• Advantages:– Reduced per-cell metadata– Simple and efficient for the common case

Big Bag of PagesBlock-based allocation

• Like simple free-list schemes, split a cell if that is the only way to satisfy a request.

• Improvement: concatenate the remaining portion to a suitable free-list (if possible).

• For example – the buddy system:– Size class are powers of two– Can split a cell of size 2i+1 into two cells of size 2i

– Can combine in the opposite direction (only if the two small cells were split from the same large cell)

Splitting

128KB

Allocated Minimum cell size – 16KBFree Maximum cell size – 128KB

Allocation request20KB

The Buddy System

64KB 64KB


32KB 64KB32KB

Allocation request10KB


12KB 64KB32KB20KB

12KB 64KB16KB20KB 16KB

Free10KB

12KB 64KB16KB20KB 16KB


12KB 64KB20KB 16KB10KB 6KB


12KB 64KB32KB20KB

Free20KB

32KB 64KB32KB

64KB 64KB


128KB

• Alignment• Size constraints• Boundary tags• Heap parsability• Locality

Allocation’s Additional Considerations

• Allocated objects may require special alignment

• For example: a double-word floating point– Can make the granule a double-word – wasteful– Header of array in java takes 3 words – one word

is wasted or skipped.

Alignment

• Some collection schemes require a minimum amount of space in each cell.– Forwarding address– Lock/status

• In that case, the allocator will allocate more words than requested.

Size Constraints

• Additional header or boundary tag associated with each cell.

• Found outside the storage available to the program.

• Indicates size and allocated/free status• Is one or two words long• Possible use of bitmap instead

Boundary Tags

• The ability to advance cell to cell in the heap• An object’s header (one or two words):– Type– Hash code– Synchronization information– Mark bit

• The header comes before the data• The reference refers to the first element/field

Heap Parsability

• How to handle alignment?– Zero all free space in advance– Devise a distinct range of values to write at the

start of the gap• Easier parsing with a bit map, indicating where

each object start.– Require additional space and time

Heap Parsability

• During allocating– Address-ordered free-list and sequential allocation

present good locality.• During freeing– Goal: Objects being freed together will be near

each other.– Empirically, objects allocated at the same time

often become unreachable at about the same time.

Locality

• Multiple threads allocating• Most steps in allocation need to be atomic• Can result a bottleneck• Basic solution – each thread has its own

allocation area.• Use of a global pool and smart chunk handing

Allocation in Concurrent Systems

Allocation Summary

• Methods:- Sequential- Free-list: First-fit, Next-fit and Best-fit.- Segregated-fits

• Various considerations to notice

comparing gcs and allocation richard jones, antony hosking and eliot moss 2012 presented by yarden...

Documents