a self-tuning cache architecture for embedded systems chuanjun zhang, vahid f., lysecky r....
Post on 20-Jan-2016
219 views
TRANSCRIPT
A Self-Tuning Cache Architecture for Embedded Systems
Chuanjun Zhang, Vahid F., Lysecky R.
Proceedings of Design, Automation and Test in Europe Conference and Exhibition
Volume: 1
Pages: 142 – 147
Feb. 2004
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 2/18
Abstract Memory accesses can account for about half of a microprocessor
system’s power consumption. Customizing a microprocessor cache’s total size, line size and associativity to a particular program is well known to have tremendous benefits for performance and power. Customizing caches has until recently been restricted to core-based flows, in which a new chip will be fabricated. However, several configurable cache architectures have been proposed recently for use in pre-fabricated microprocessor platforms. Tuning those caches to a program is still however a cumbersome task left for designers, assisted in part by recent computer-aided design (CAD) tuning aids.
We propose to move that CAD on-chip, which can greatly increase the acceptance of configurable caches. We introduce on-chip hardware implementing an efficient cache tuning heuristic that can automatically, transparently, and dynamically tune the cache to an executing program. We carefully designed the heuristic to avoid any cache flushing, since flushing is power and performance costly. By simulating numerous Powerstone and MediaBench benchmarks, we show that such a dynamic self-tuning cache can reduce memory-access energy by 45% to 55% on average, and as much as 97%, compared with a four-way set-associative base cache, completely transparently to the programmer.
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 3/18
What’s the Problem
Tuning a configurable cache to a application is benefic for power and performance How to obtain the best cache configuration ??
Sometimes increase cache size (associativity) only improve limited performance but increase energy greatly
Determine the best cache configuration via simulation Straightly, but slowly and can’t capture runtime behavior
Thus, it’s essential to automatically tune a configurable cache dynamically as an application executes
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 4/18
Introduction Previous work of this team
A highly configurable cache architecture [13],[14] Four parameters that designers can configure: 1) Cache total size: 8, 4 or 2 KB
2) Associativity: 4, 2 or 1 way for 8 KB; 2 or 1 way for 4KB; 1 way
for 2KB 3) Cache line size: 64, 32 or 16 bytes
4) Way prediction : ON or OFF
The proposed dynamic cache tuning method Cache tuning heuristic implementing with on-chip hardware
Without exhaustively tries all possible cache configurations Dynamically tunes the cache to an executing program Automate the process of finding the best cache configuration
The space of configuration may more larger
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 5/18
Energy Evaluation Equations for total memory access energy consumption
Ehit: cache hit energy per cache access
Emiss: cache miss energy
Estatic_per_cycle: static energy dissipation
Equation for the heuristic cache tuner energy consumption
Timetotal: the total time used to finish one cache configuration search NumSearch: the number of cache configurations search
Related to cache size, associativity
Related to cache line size
Related to cache size
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 6/18
Problem Overview A naive tuning approach
Exhaustively tries all possible cache configurations Two main drawbacks
Involves too many configurations Requires too many cache flushes
Searching in an arbitrary order may require flushing the cache
Goal: develop a self-tuning heuristic that Minimizes the number of cache configurations examined Minimizes cache flushing
While still finding a near-optimal cache configuration
1) Tuning dynamically as execution
2) Can be enabled, disabled by SW
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 7/18
Heuristic Development Through Analysis Energy dissipation for benchmark parser at cache size from 1 KB to
1MB
However this tradeoff point is different for application and exist not only for cache size, but also for cache associativity and line size Therefore, the goal of searching heuristic is to find the configuration
Improve performance slightly but increase energy significantly
Energy dissipation of off-chip memory decreases rapidly
Increase cache performance and decrease total energy is observed
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 8/18
Determine the Impact of Each Parameter The parameter with the greatest impact configure first
Vary cache size has the biggest impact on miss rate and energy Vary line size cause little energy variation for I$ but more variation for D$ Vary associativity has the smallest impact on energy consumption
Different line size Different
associativity
Develop a search heuristic that finds best cache size first, then best line size, finally best associativity
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 9/18
Minimizing Cache Flushing The order of vary the values of each parameter
One order may require flushing, a different order may not Cache flush analysis when changing cache size
Increasing the cache size is preferable over decreasing When decreasing the cache size, an original hit may turn into miss
EX: address 000 (index=00) and 110 (index=10) are misses after shutdown For D $, need to write back when the data in the shutdown ways is dirty
When increasing the cache size does’t require flushing EX: address 100 (index=0) and 010 (index=0)
No write back is needed and thus avoid flushing
8 byte Memory
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 10/18
Minimizing Cache Flushing
Cache flush analysis when changing associativity
Increasing the associativity is preferable over decreasing Decreasing the associativity may turn a hit into miss
EX: address 000 (index=0) and 100 (index=0) Increasing the associativity will be no extra misses
EX: address 000 (index=00) and 010 (index=10) Both still be hit after the associativity is increased
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 11/18
Search Heuristic for Determining the Best Cache Configuration
Inputs to the heuristic
Cache size: C[ i ], 1 ≤ i ≤ n n=3 in our configurable cache
C[1]=2 KB, C[2]=4 KB, C[3]=8 KB
Line size: L[ j ], 1 ≤ j ≤ p p=3 in our configurable cache
L[1]=16 bytes, L[2]=32 bytes, L[3]=64 bytes
Associativity: A[ k ], 1 ≤ k ≤ m m=3 in our configurable cache
A[1]=1 way, A[2]=2 way, A[3]=4 way
Way prediction W[1]= OFF ,W[2]= ON
E[1]
As long as increase the cache size result in total
energy decrease
First
Then
And then
Finally
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 12/18
The Efficiency of Search Heuristic
Suppose there are n configurable parameters, and each parameter has m values Total of mn different combinations Our heuristic only searches m*n combinations at most
EX: 10 configurable parameters, each has 10 values Brute force searching: searches 1010 combinations Our search heuristic: searches 100 combinations instead
Thus, using our search heuristic
Minimizes the number of cache configurations examined
Avoids most of the cache flushing
&
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 13/18
Implementing the Heuristic in Hardware Hardware-based approach is preferable over software
SW approach not only change the runtime behavior of application but also affect the cache behavior
FSMD of the cache tuner
Ehit: correspond to 8KB 4way, 2way and 1way; 4KB 2way and 1way; 2KB 1way
Emiss: correspond to line size of 16 bytes, 32bytes and 64 bytes
Estatic_per_cycle: correspond to cache size of 8KB, 4KB and 2KB Configure register (7 bits wide) : 2 bits for cache size, 2 bits for line size, 2 bits f
or associativity and 1bit for way prediction
1
1
1
6
3
3
1
1
1
Runtime informationApplication independent information Result of energy calculation
Lowest of configuration
tested
Used to configure
cache
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 14/18
Implementing the Heuristic in Hardware FSM of the cache tuner
Composed of three smaller state machines
EX: If the current state of PSM is P1 State V1 of VSM will determine the energy of 2 KB cache, V2 for 4 KB
cache, V3 for 8 KB cache Why we need CSM ??
Because we have three multiplications but only one multiplier Used four states to compute the energy
Determines best cache size Line size
Tuning each cache parameter
AssociativityWay prediction
Determines the energy for many possible values of
each parameter
2 KB
4 KB
8 KB
Controls the calculation of energy
PSM states depend on VSM, and VSM states depend on CSM
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 15/18
Results of Search Heuristic Searches average 5.8 co
nfigurations compared to 27 configurations
Finds the optimal configuration in nearly all cases, except D-cache cfg. of pjepg D-cache cfg. of mpeg2
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 16/18
The Reason of the Inaccuracy
Larger cache consume more dynamic and static energy Larger cache is only preferable if the reduction in Eoff_chip_mem over
comes the energy increase due to larger cache For mpeg2, using 8 KB cache, the reduction in Eoff_chip_mem is not larg
er enough to overcome the added energy by larger cache Therefore, selects a cache size of 4 KB
When associativity is considered (increased from 1 way to 2 way), the miss rate of 8 KB cache is significantly reduced
The heuristic does’t choose the optimal configuration due to When heuristic is determining the best cache size, it does’t predi
ct what will happen when associativity is increased
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 17/18
Area and Power of the Tuning Hardware The area of cache tuner is about 4000 gates or 0.039 mm2 in
0.18 um technology An increase in area of just 3% over MIPS 4kp with cache
The power consumption of cache tuner is 2.69 mw at 200 MHz Only 0.5% of the power consumed by a MIPS processor
The average energy consumption of cache tuner Used 164 cycles to finish one cache configuration Average number of configurations searched is 5.4
The average energy dissipation of benchmarks is 2.34 J
Impact of avoid flushing by careful ordering of search When cache size is configured in the order of 8 KB down to 2 KB
The average energy consumption due to writing back dirty data is 5.38 mJ
Thus, if we search the possible cache size from largest to smallest
= 2.69 mw * (164/200M) * 5.4 = 11.9 nJ
negligible
The energy due to cache flushes would be 480,000 times than cache tuner
112/04/21 A Self-Tuning Cache Architecture for Embedded Systems 18/18
Conclusions Proposed a self-tuning on-chip CAD method finding the
best configuration automatically Relieving designers from the burden to determine the best
configuration Increasing the usefulness and acceptance of a configurable
cache Our cache tuning heuristic
Minimizes the number of configurations examined Minimizes the need for cache flushing Reduces 40% memory-access energy on average,
compared to a standard cache