energy-efficiency potential of a phase-based cache resizing scheme for embedded systems
DESCRIPTION
Energy-efficiency potential of a phase-based cache resizing scheme for embedded systems. G. Pokam and F. Bodin. Motivation (1/3). High-performance accommodates difficultly with low-power Consider the cache hierarchy for instance benefits of large caches - PowerPoint PPT PresentationTRANSCRIPT
1
Energy-efficiency potential of a phase-based cache resizing
scheme for embedded systems
G. Pokam and F. Bodin
3
Motivation (1/3) High-performance accommodates
difficultly with low-power
Consider the cache hierarchy for instance benefits of large cachesbenefits of large caches
maintain embedded code + data workload on-chip reduce off-chip memory traffic
however,however, caches account for ~80% of the transistors count we usually devote half of the chip area to caches
4
Motivation (2/3) Cache impact on the energy consumptionCache impact on the energy consumption
static energy is incommensurate in comparison to the rest of the chip
80% of the transistors contribute steadily to the leakage power
dynamic energy (transistors switching activities) represents an important fraction of the total energy due to the high access frequency of caches
Caches design is therefore critical in the context of high-performance embedded systems
5
Motivation (3/3)
We seek to address cache energy management via
Hardware/software interactionHardware/software interaction
Any good ways to achieve that ? Yes: add flexibility to allow a cache to be
reconfigured efficiently
How ? Follow program phases to adapt the cache Follow program phases to adapt the cache
structure accordinglystructure accordingly
6
Previous work (1/2)
Some configurable cache proposals that apply to embedded systems include:
Albonesi [MICRO’99]: selective cache waysAlbonesi [MICRO’99]: selective cache ways to disable/enable individual cache ways of a highly set-associative cache
Zhang & al. [ISCA’03]: way-concatenationZhang & al. [ISCA’03]: way-concatenation to reduce the cache associativity while still maintaining the full cache capacity
7
Previous work (2/2) These approaches only consider
configuration on a per-application basisper-application basis
Problems : empirically, no best cache size exists for a given applicationempirically, no best cache size exists for a given application varying dynamic cache behavior within an application, and varying dynamic cache behavior within an application, and
from one application to anotherfrom one application to another
Therefore, these approaches do not accommodate Therefore, these approaches do not accommodate well to program phase changeswell to program phase changes
8
Our approach Objective :
emphasize on application-specific cache architectural emphasize on application-specific cache architectural parametersparameters
To do so, we consider a cache with fixed line size and modulus set mapping function
power/perf is dictated by size and associativitypower/perf is dictated by size and associativity Not all dynamic program phases may have the
same requirements on cache size and associativity !
Dynamically varying size and assoc. to leverage Dynamically varying size and assoc. to leverage power/perf. tradeoff at phase-levelpower/perf. tradeoff at phase-level
9
Cache model (1/8)
Baseline cache model: way-concatenation cache [Zhang ISCA’03][Zhang ISCA’03]
Functionality of the way-concatenation cache on each cache lookup, a logic selects the number of active cache ways mm out of the nn available cache ways
virtually, each active cache way is a multiple of the size of a single bank in the base n-way cache.n-way cache.
10
Cache model (2/8) Our proposal:
modify the associativity while guaranteeing cache coherency
modify the cache size while preserving data availability on unused cache portions
11
Cache model (3/8)
First enhancement: associativity levelFirst enhancement: associativity level Problem with baseline modelProblem with baseline model
consider the following scenario in the baseline model
@A
Bank 0 Bank 1 Bank 2 Bank 3
Phase 0: 32K 2-way, active banks are 0 and 2
Phase 1: 32K 1-way, active bank is 2, @A is modified
Old copy
@A @A
invalidation
12
Cache model (4/8)
Proposed solutionProposed solution : assume a write-through cache
the unused tag and status arrays must be made accessible on a write to ensure coherency across cache configurations => associative tag arrayassociative tag array
actions of the cache controller: access all tag arrays on a write request to set the corresponding status bit to invalid
13
Cache model (5/8) Second enhancement: cache size levelSecond enhancement: cache size level
Problem with the baseline modelProblem with the baseline model: Gated-Vdd is used to disconnect a bank => data are not preserved across 2 configurations!
Proposed solutionProposed solution: unused cache ways are put in a low-power mode => drowsy mode [Flautner & al. ISCA’02] tag portion is left unchanged ! Main advantage
we can reduce the cache size, preserve the state of the preserve the state of the unused memory cellsunused memory cells across program phases, while still reducing leakage energy !
14
Cache model (6/8)
Overall cache model
16
Cache model (8/8) Drowsy circuitry accounts for less than 3% of the
chip area Accessing a line in drowsy mode requires 1 cycle
delay [Flautner & al. ISCA’02] ISA extension
we assume the ISA can be extended with a reconfiguration instruction having the following effects on WCR:
way-mask drowsy bit config0 0/1 32K1W/8K1W1 0/1 32K2W/16K1W2 0/1 32K2W/16K2W3 0 32K4W
17
Trace-based analysis (1/3)
Goal : We want to extract a performance and energy
profiles from the trace in order to adapt the cache structure to the dynamic application requirements
Assumptions : LRU replacement policy no prefetching
18
Trace-based analysis (2/3) sample interval = set mapping function = (for varying the associativity)
LRU-Stack distance d = (for varying the cache size)
Then, define the LRU-stack profiles :
: performance
for each pair , this expression defines the number of dynamic references that hit in caches with LRU-stack distance
i
))(( xmapP ji
jmapx
),( xmap j
xd
19
Trace-based analysis (3/3) : energy))(( xmapE ji
cachejj ExmapPxmapEii
*))(())((
Tagi E*
drowsyENi*
memoryEWritei*
cachememory EratioE *
Cache energy
Tag energy
Drowsy transitions
energy
memory energy
20
Experimental setup (1/2) Focus on data cache Simulation platform
4-issue VLIW processor [Faraboschi & al. ISCA’00]
32KB 4-way data cache 32B block size 20 cycles miss penalty
Benchmarks MiBench: fft, gsm, susan MediaBench: mpeg, epic PowerStone: summin, whestone, v42bis
21
Experimental setup (2/2) CACTI 3.0
to obtain energy values we extend it to provide leakage energy values for
each simulated cache configuration
Hotleakage from where we adapted the leakage energy
calculation for each simulated leakage reduction technique
estimated memory ratio = 50 drowsy energy from [Flautner & al. ISCA’02]
22
Program behavior (1/4)
GSM All 32K config All 16K
config
8K config
Capacity miss effect
Tradeoff region
Sensitive
region
Insensitive
region
(log10 scale)
(log10 scale)
K100
23
Program behavior (2/4)
FFT
24
Program behavior (3/4) Working set size sensitivity propertyWorking set size sensitivity property
the working set can be partitioned into clusters with similar cache sensitivity
Capturing sensitivity through working set Capturing sensitivity through working set size clusteringsize clustering
the partitioning is done relative to the base cache configuration
We use a simple metric based on the Manhattan distance vector from two points and
1kiv
2kiv
12 ki
ki vv
25
Program behavior (4/4)
More energy/performance profiles
summin whestone
26
Results (1/3)
Dynamic energy reduction
27
Results (2/3)
Leakage energy savings (0.07um)
Better due to gated-Vdd
28
Results (3/3)
PerformanceWorst-case degradation (65% due to drowsy transitions)
29
Conclusions and future work
Can do better for improving performance reduce the frequency of drowsy transitions
within a phase with refined cache bank access policies
management of reconfiguration at the compiler level
insert BB annotation in the trace exploit feedback-directed compilation
promising scheme for embedded systems