1 towards practical page coloring based multi-core cache management xiao zhang sandhya dwarkadas kai...
Post on 19-Dec-2015
218 views
TRANSCRIPT
1
Towards Practical Page Coloring Based
Multi-core Cache Management
Xiao Zhang
Sandhya Dwarkadas
Kai Shen
2
The Multi-Core Challenge
• Multi-core chip – Dominant on market– Last level cache is commonly
shared by sibling cores, however sharing is not well controlled
• Challenge: Performance Isolation– Poor performance due to conflicts– Unpredictable performance – Denial of service attacks
source: http://www.intel.com
3
Possible Software Approach:Page Coloring
• Partition cache at coarse granularity
• Page coloring: advocated by many previous works– [Bershad’94, Bugnion’96,
Cho ‘06, Tam ‘07, Lin ‘08, Soares ‘08]
Thread A
Thread B
Cache
Way-1 Way-n…………
Memory page
Color # =
CacheSize
PageSize*CacheAssociativity
4
Challenges for Page Coloring
• Expensive page re-coloring– Re-coloring is needed due to optimization
goal or co-runner change– Without extra support, re-coloring means
memory copying– 3 micro-seconds per page copy, >10K pages
to copy, possibly happen every time quantum
• Artificial memory pressure– Cache share restriction also restricts memory
share
5
Hotness-based Page Coloring
• Basic idea– Restrain page coloring to a small group of hot pages
• Challenge:– How to efficiently find out hot pages
• Outline – Efficient hot page identification – Cache partition policy– Hot page coloring
6
Method to Track Page Hotness
• Hardware access bits + sequential table scan– Generally available on x86, automatically set by
hardware– One bit per Page Table Entry (PTE)
• Conventional wisdom: scan whole page table is expensive– Not entirely true, per-entry scan latency is overlapped
by hardware prefetching– Sequential table scan spends a large portion of time
on non-accessed pages, but we can improve that
7
Accelerate Sequential Scan
• Program exhibits spatial locality even at page granularity– Page non-access
correlation metric: Prob (next X neighbors are not accessed | current page is not accessed)
Plot for SPECcpu2k benchmark mesa
ProbProb
# of contiguous non-accessed pages# of contiguous non-accessed pages
8
Locality-based Jumping
• Start with sequential mode– change to jumping mode once
see non-accessed one
• If an entry we jumped to is– not accessed, double the next
jump range– accessed, roll back and reset
jump range to 1, change to sequential mode
• Randomized to avoid overlooking pathological access patterns
Page 0 1
Page 1 1
Page 2 0
Page 3 0
Page 4 0
Page 5 0
Page 6 0
Page 7 0
Page 8 0
Page 9 0
Page 10 0
Page 11 0
Page 12 1
Page 13 0
Roll back
Access bit
X
X
X
9
Sampling of Access Bits• Recycle spare bits in PTE as hotness
counter – Counter is aged to reflect recency and frequency– Could be extended to support LFU page replacement
• Decouple sampling frequency and window– Sampling frequency N– Sampling time window T
0 N 2N 3N 4N
T N+T 2N+T 3N+T 4N+T
Clear all access bits
Check all access bits
Time
10
Hot Page Identification Efficiency
• Entries skipped using locality-based jumping: > 60% on avg.
• Runtime overhead– Tested 12 SPECcpu2k benchmarks on a Intel 3.0 Ghz
core2duo processor – On avg. 2%/7% overhead at 100/10 milliseconds
sampling frequency– Save 20%/58% over sequential scan
11
Hot Page Identification Accuracy
• No major accuracy loss due to jumping as measured by two metrics (Jeffrey divergence & rank error rate)
• Fairly accurate result
12
Roadmap
• Efficient hot page identification - locality jumping
• Cache partition policy - MRC-based
• Hot page coloring
13
Cache Partition Policy• Miss-Ratio-Curve (MRC) based performance
model – MRC profiled offline– Single app’s execution time ≈ Miss * Memory_Latency +
Hit * Cache_Latency
• Cache partition goal: optimize system overall performance– System performance metric: geometric mean of all apps’
normalized performance. Normalization baseline is the performance when one monopolize whole cache
14
MRC-driven Cache Partition Policy
Thread A’s Miss Ratio
Thread B’s Miss Ratio
Cache Allocation
Cache Allocation
0.5 0.7 0.30.2
0
0
Optimal partition point
Cache Size = ∑A,B Cache Allocation
4M
4M
Geometric mean of two apps’
normalized performance =
15
Hot Page Coloring
• Budget control of page re-coloring overhead– % of time slice, e.g. 5%
• Recolor from hottest until budget is reached– Maintain a set of hotness bins during sampling
• bin[ i ][ j ] = # of pages in color i with normalized hotness in range [ j, j+1]
– Given a budget K, K-th hottest page’s hotness value is estimated in constant time by searching hotness bins
– Make sure hot pages are uniformly distributed among colors
16
Re-coloring Procedure
Budget = 3 pages
Cache share decrease
Color Red
100
83
71
…
…
…
14
1
99 97 99
87
74
…
…
…
10
1
82
75
…
…
…
11
3
81
73
…
…
…
12
2
Color Blue Color Green Color GrayX
hotness counter value
Page sorted in hotness ascending order
17
Performance Comparisons
4 SPECcpu2k benchmarks are running on 2 sibling cores (Intel core2duo) that share a 4MB L2 cache.
{art, equake} vs. {mcf, twolf}
18
Relieve Artificial Memory Pressure
Thread A
Thread B
Cache
Way-1 Way-n…………
Memory pageA1
A2
A3
A4
A5
Thread A’s footprint
• App’s footprint may be larger than
its entitled memory color pages
• App may “steal” other’s
colors, a.k.a. “polluting”
other’s cache share
• Hotness-aware pollution that preferentially copies cold pages to other’s memory colors (in round-robin fashion as not to impose new pressure)
19
Relieve Artificial Memory Pressure
On a dual-core chip, L2 cache was originally evenly
partitioned between polluting and victim benchmarks.
Because of memory pressure, polluting benchmark
moves 1/3 (~62MB) of its footprint to victim’s shares.
Non-space-sensitive appsSpace-sensitive apps
20
Summary
• Contributions:– Efficient hot page identification that can potentially be
used by multiple applications– Hotness-based page coloring to mitigate two
drawbacks: memory pressure & re-coloring cost• Caveat: large time quantum still required to amortize overhead
• Ongoing work: – exploring other possible approaches
• e.g. Execution throttling based cache management [USENIX’09]