1 towards practical page coloring based multi-core cache management xiao zhang sandhya dwarkadas kai...

1

Towards Practical Page Coloring Based

Multi-core Cache Management

Xiao Zhang

Sandhya Dwarkadas

Kai Shen

2

The Multi-Core Challenge

• Multi-core chip – Dominant on market– Last level cache is commonly

shared by sibling cores, however sharing is not well controlled

• Challenge: Performance Isolation– Poor performance due to conflicts– Unpredictable performance – Denial of service attacks

source: http://www.intel.com

3

Possible Software Approach:Page Coloring

• Partition cache at coarse granularity

• Page coloring: advocated by many previous works– [Bershad’94, Bugnion’96,

Cho ‘06, Tam ‘07, Lin ‘08, Soares ‘08]

Thread A

Thread B

Cache

Way-1 Way-n…………

Memory page

Color # =

CacheSize

PageSize*CacheAssociativity

4

Challenges for Page Coloring

• Expensive page re-coloring– Re-coloring is needed due to optimization

goal or co-runner change– Without extra support, re-coloring means

memory copying– 3 micro-seconds per page copy, >10K pages

to copy, possibly happen every time quantum

• Artificial memory pressure– Cache share restriction also restricts memory

share

5

Hotness-based Page Coloring

• Basic idea– Restrain page coloring to a small group of hot pages

• Challenge:– How to efficiently find out hot pages

• Outline – Efficient hot page identification – Cache partition policy– Hot page coloring

6

Method to Track Page Hotness

• Hardware access bits + sequential table scan– Generally available on x86, automatically set by

hardware– One bit per Page Table Entry (PTE)

• Conventional wisdom: scan whole page table is expensive– Not entirely true, per-entry scan latency is overlapped

by hardware prefetching– Sequential table scan spends a large portion of time

on non-accessed pages, but we can improve that

7

Accelerate Sequential Scan

• Program exhibits spatial locality even at page granularity– Page non-access

correlation metric: Prob (next X neighbors are not accessed | current page is not accessed)

Plot for SPECcpu2k benchmark mesa

ProbProb

# of contiguous non-accessed pages# of contiguous non-accessed pages

8

Locality-based Jumping

• Start with sequential mode– change to jumping mode once

see non-accessed one

• If an entry we jumped to is– not accessed, double the next

jump range– accessed, roll back and reset

jump range to 1, change to sequential mode

• Randomized to avoid overlooking pathological access patterns

Page 0 1

Page 1 1

Page 2 0

Page 3 0

Page 4 0

Page 5 0

Page 6 0

Page 7 0

0

Page 9 0

Page 10 0

Page 11 0

Page 12 1

Page 13 0

Roll back

Access bit

X

X

X

9

Sampling of Access Bits• Recycle spare bits in PTE as hotness

counter – Counter is aged to reflect recency and frequency– Could be extended to support LFU page replacement

• Decouple sampling frequency and window– Sampling frequency N– Sampling time window T

0 N 2N 3N 4N

T N+T 2N+T 3N+T 4N+T

Clear all access bits

Check all access bits

Time

10

Hot Page Identification Efficiency

• Entries skipped using locality-based jumping: > 60% on avg.

• Runtime overhead– Tested 12 SPECcpu2k benchmarks on a Intel 3.0 Ghz

core2duo processor – On avg. 2%/7% overhead at 100/10 milliseconds

sampling frequency– Save 20%/58% over sequential scan

11

Hot Page Identification Accuracy

• No major accuracy loss due to jumping as measured by two metrics (Jeffrey divergence & rank error rate)

• Fairly accurate result

12

Roadmap

• Efficient hot page identification - locality jumping

• Cache partition policy - MRC-based

• Hot page coloring

13

Cache Partition Policy• Miss-Ratio-Curve (MRC) based performance

model – MRC profiled offline– Single app’s execution time ≈ Miss * Memory_Latency +

Hit * Cache_Latency

• Cache partition goal: optimize system overall performance– System performance metric: geometric mean of all apps’

normalized performance. Normalization baseline is the performance when one monopolize whole cache

14

MRC-driven Cache Partition Policy

Thread A’s Miss Ratio

Thread B’s Miss Ratio

Cache Allocation

Cache Allocation

0.5 0.7 0.30.2

0

0

Optimal partition point

Cache Size = ∑A,B Cache Allocation

4M

4M

Geometric mean of two apps’

normalized performance =

15

Hot Page Coloring

• Budget control of page re-coloring overhead– % of time slice, e.g. 5%

• Recolor from hottest until budget is reached– Maintain a set of hotness bins during sampling

• bin[ i ][ j ] = # of pages in color i with normalized hotness in range [ j, j+1]

– Given a budget K, K-th hottest page’s hotness value is estimated in constant time by searching hotness bins

– Make sure hot pages are uniformly distributed among colors

16

Re-coloring Procedure

Budget = 3 pages

Cache share decrease

Color Red

100

83

71

…

…

…

14

1

99 97 99

87

74

…

…

…

10

1

82

75

…

…

…

11

3

81

73

…

…

…

12

2

Color Blue Color Green Color GrayX

hotness counter value

Page sorted in hotness ascending order

17

Performance Comparisons

4 SPECcpu2k benchmarks are running on 2 sibling cores (Intel core2duo) that share a 4MB L2 cache.

{art, equake} vs. {mcf, twolf}

18

Relieve Artificial Memory Pressure

Thread A

Thread B

Cache

Way-1 Way-n…………

Memory pageA1

A2

A3

A4

A5

Thread A’s footprint

• App’s footprint may be larger than

its entitled memory color pages

• App may “steal” other’s

colors, a.k.a. “polluting”

other’s cache share

• Hotness-aware pollution that preferentially copies cold pages to other’s memory colors (in round-robin fashion as not to impose new pressure)

19

Relieve Artificial Memory Pressure

On a dual-core chip, L2 cache was originally evenly

partitioned between polluting and victim benchmarks.

Because of memory pressure, polluting benchmark

moves 1/3 (~62MB) of its footprint to victim’s shares.

Non-space-sensitive appsSpace-sensitive apps

20

Summary

• Contributions:– Efficient hot page identification that can potentially be

used by multiple applications– Hotness-based page coloring to mitigate two

drawbacks: memory pressure & re-coloring cost• Caveat: large time quantum still required to amortize overhead

• Ongoing work: – exploring other possible approaches

• e.g. Execution throttling based cache management [USENIX’09]

1 towards practical page coloring based multi-core cache management xiao zhang sandhya dwarkadas kai...

Documents

page copy

current page

practical page coloring

page coloring partition

lfu page replacement

wayn memory page color

sequential scan slide

access bits time slide