1 towards practical page coloring based multi-core cache management xiao zhang sandhya dwarkadas kai...

20
1 Towards Practical Page Coloring Based Multi-core Cache Management Xiao Zhang Sandhya Dwarkadas Kai Shen

Post on 19-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

1

Towards Practical Page Coloring Based

Multi-core Cache Management

Xiao Zhang

Sandhya Dwarkadas

Kai Shen

2

The Multi-Core Challenge

• Multi-core chip – Dominant on market– Last level cache is commonly

shared by sibling cores, however sharing is not well controlled

• Challenge: Performance Isolation– Poor performance due to conflicts– Unpredictable performance – Denial of service attacks

source: http://www.intel.com

3

Possible Software Approach:Page Coloring

• Partition cache at coarse granularity

• Page coloring: advocated by many previous works– [Bershad’94, Bugnion’96,

Cho ‘06, Tam ‘07, Lin ‘08, Soares ‘08]

Thread A

Thread B

Cache

Way-1 Way-n…………

Memory page

Color # =

CacheSize

PageSize*CacheAssociativity

4

Challenges for Page Coloring

• Expensive page re-coloring– Re-coloring is needed due to optimization

goal or co-runner change– Without extra support, re-coloring means

memory copying– 3 micro-seconds per page copy, >10K pages

to copy, possibly happen every time quantum

• Artificial memory pressure– Cache share restriction also restricts memory

share

5

Hotness-based Page Coloring

• Basic idea– Restrain page coloring to a small group of hot pages

• Challenge:– How to efficiently find out hot pages

• Outline – Efficient hot page identification – Cache partition policy– Hot page coloring

6

Method to Track Page Hotness

• Hardware access bits + sequential table scan– Generally available on x86, automatically set by

hardware– One bit per Page Table Entry (PTE)

• Conventional wisdom: scan whole page table is expensive– Not entirely true, per-entry scan latency is overlapped

by hardware prefetching– Sequential table scan spends a large portion of time

on non-accessed pages, but we can improve that

7

Accelerate Sequential Scan

• Program exhibits spatial locality even at page granularity– Page non-access

correlation metric: Prob (next X neighbors are not accessed | current page is not accessed)

Plot for SPECcpu2k benchmark mesa

ProbProb

# of contiguous non-accessed pages# of contiguous non-accessed pages

8

Locality-based Jumping

• Start with sequential mode– change to jumping mode once

see non-accessed one

• If an entry we jumped to is– not accessed, double the next

jump range– accessed, roll back and reset

jump range to 1, change to sequential mode

• Randomized to avoid overlooking pathological access patterns

Page 0 1

Page 1 1

Page 2 0

Page 3 0

Page 4 0

Page 5 0

Page 6 0

Page 7 0

Page 8 0

Page 9 0

Page 10 0

Page 11 0

Page 12 1

Page 13 0

Roll back

Access bit

X

X

X

9

Sampling of Access Bits• Recycle spare bits in PTE as hotness

counter – Counter is aged to reflect recency and frequency– Could be extended to support LFU page replacement

• Decouple sampling frequency and window– Sampling frequency N– Sampling time window T

0 N 2N 3N 4N

T N+T 2N+T 3N+T 4N+T

Clear all access bits

Check all access bits

Time

10

Hot Page Identification Efficiency

• Entries skipped using locality-based jumping: > 60% on avg.

• Runtime overhead– Tested 12 SPECcpu2k benchmarks on a Intel 3.0 Ghz

core2duo processor – On avg. 2%/7% overhead at 100/10 milliseconds

sampling frequency– Save 20%/58% over sequential scan

11

Hot Page Identification Accuracy

• No major accuracy loss due to jumping as measured by two metrics (Jeffrey divergence & rank error rate)

• Fairly accurate result

12

Roadmap

• Efficient hot page identification - locality jumping

• Cache partition policy - MRC-based

• Hot page coloring

13

Cache Partition Policy• Miss-Ratio-Curve (MRC) based performance

model – MRC profiled offline– Single app’s execution time ≈ Miss * Memory_Latency +

Hit * Cache_Latency

• Cache partition goal: optimize system overall performance– System performance metric: geometric mean of all apps’

normalized performance. Normalization baseline is the performance when one monopolize whole cache

14

MRC-driven Cache Partition Policy

Thread A’s Miss Ratio

Thread B’s Miss Ratio

Cache Allocation

Cache Allocation

0.5 0.7 0.30.2

0

0

Optimal partition point

Cache Size = ∑A,B Cache Allocation

4M

4M

Geometric mean of two apps’

normalized performance =

15

Hot Page Coloring

• Budget control of page re-coloring overhead– % of time slice, e.g. 5%

• Recolor from hottest until budget is reached– Maintain a set of hotness bins during sampling

• bin[ i ][ j ] = # of pages in color i with normalized hotness in range [ j, j+1]

– Given a budget K, K-th hottest page’s hotness value is estimated in constant time by searching hotness bins

– Make sure hot pages are uniformly distributed among colors

16

Re-coloring Procedure

Budget = 3 pages

Cache share decrease

Color Red

100

83

71

14

1

99 97 99

87

74

10

1

82

75

11

3

81

73

12

2

Color Blue Color Green Color GrayX

hotness counter value

Page sorted in hotness ascending order

17

Performance Comparisons

4 SPECcpu2k benchmarks are running on 2 sibling cores (Intel core2duo) that share a 4MB L2 cache.

{art, equake} vs. {mcf, twolf}

18

Relieve Artificial Memory Pressure

Thread A

Thread B

Cache

Way-1 Way-n…………

Memory pageA1

A2

A3

A4

A5

Thread A’s footprint

• App’s footprint may be larger than

its entitled memory color pages

• App may “steal” other’s

colors, a.k.a. “polluting”

other’s cache share

• Hotness-aware pollution that preferentially copies cold pages to other’s memory colors (in round-robin fashion as not to impose new pressure)

19

Relieve Artificial Memory Pressure

On a dual-core chip, L2 cache was originally evenly

partitioned between polluting and victim benchmarks.

Because of memory pressure, polluting benchmark

moves 1/3 (~62MB) of its footprint to victim’s shares.

Non-space-sensitive appsSpace-sensitive apps

20

Summary

• Contributions:– Efficient hot page identification that can potentially be

used by multiple applications– Hotness-based page coloring to mitigate two

drawbacks: memory pressure & re-coloring cost• Caveat: large time quantum still required to amortize overhead

• Ongoing work: – exploring other possible approaches

• e.g. Execution throttling based cache management [USENIX’09]