a highly configurable cache architecture for embedded
TRANSCRIPT
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
1/29
Chuanjun Zhang, UC Riverside
1
A highly Configurable CacheArchitecture for Embedded
Systems
Chuanjun Zhang*, Frank Vahid** , and Walid Najjar*Dept. of Electrical Engineering
Dept. of Computer Science and Engineering
University of California, Riverside**Also with the Center for Embedded Computer Systems at UC Irvine
This work was supported by the National Science Foundation andthe Semiconductor Research Corporation
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
2/29
Chuanjun Zhang, UC Riverside 2
Outline
Why a Configurable Cache? What Parameters ? Configurable Associativity by Way Concatenation Configurable Size by Way Shutdown Configurable Line Size
How to Configure Cache Cache Parameter Explorer A Heuristic Algorithm Searches Pareto Set of Cache
Parameters : Tradeoff Between Energy Dissipation and Performance
The explorer is Synthesized Using Synopsys Conclusions and Future Work
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
3/29
Chuanjun Zhang, UC Riverside 3
Why Choose Cache: ImpactsPerformance and Power
Performance impacts arewell known
Power ARM920T: Caches consume
50% of total processor system
power (Segars 01) M*CORE: Unified cache
consumes 50% of totalprocessor system power(Lee/Moyer/Arends 99)
Well show that aconfigurable cache canreduce that power nearly inhalf on average
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
4/29
Chuanjun Zhang, UC Riverside 4
Why a Configurable Cache?
An embedded systemmay execute oneapplication forever Tuning the cache
configuration (size,
associativity, line size)can save a lot of energy
Associativity example 40% difference in memory
access energy0%
25%
50%
75%
100%
1 2 4
epic
mpeg2(b)
0.0%
0.5%
1.0%
1.5%
2.0%
1 2 4
epicmpeg2
epic & mpeg2 from MediaBench
associativity
associativity
Missrate
Normalized
Energy
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
5/29
Chuanjun Zhang, UC Riverside 5
Benefits of Configurable Cache
Mass production Unique chips getting more expensive as technology
scales down (ITRS) Huge benefits to mass producing a single chip
Harder to produce chips distinguished by cachewhen we have 50-100 processors per chip
Adapt to program phases Recent research shows programs have different
cache requirements over time Much research assumes a configurable cache
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
6/29
Chuanjun Zhang, UC Riverside 6
Caches Vary Greatly in Embedded Processors
Processor Size As. Line Size As. Line Processor Size As. Line Size As. Line
AMD-K6-IIIE 32K 2 32 32K 2 32 Motorola MPC8540 32K 4 32/64 32K 4 32/64
Alchemy AU1000 16K 4 32 16K 4 32 Motorola MPC7455 32K 8 32 32K 8 32
ARM 7 8K/U 4 16 8K/U 4 16 NEC VR5500 32K 2 32 32K 2 32
ColdFire 0-32K DM 16 0-32K N/A N/A NEC VR4131 16K 2 16/32 16K 2 16/32
Hitachi SH7750S (SH4) 8K DM 32 16K DM 32 NEC VR4181 4K DM 16 4K DM 16
Hitachi SH7727 16K/U 4 16 16K/U 4 16 NEC VR4181A 8K DM 32 8K DM 32
IBM PPC 750CX 32K 8 32 32K 8 32 NEC VR4121 16 DM 16 8K DM 16
IBM PPC 7603 16K 4 32 16K 4 32 PMC Sierra RM9000X2 16K 4 N/A 16K 4 N/A
IBM750FX 32K 8 32 32K 8 32 PMC Sierra RM7000A 16K 4 32 16K 4 32IBM403GCX 16K 2 16 8K 2 16 SandCraft sr71000 32K 4 32 32K 4 32
IBM Power PC 405CR 16K 2 32 8K 2 32 Sun Ultra SPARC Iie 16K 2 N/A 16K DM N/A
Intel 960JA 2K 2 N/A 1K 2 N/A SuperH 32K 4 32 32K 4 32
Intel 960JD 4K 2 N/A 2K 2 N/A TI TMS320C6414 16K DM N/A 16K 2 N/A
Intel 960IT 16K 2 N/A 4K 2 N/A TriMedia TM32A 32K 8 64 16K 8 64
Motorola MPC8240 16K 4 32 16K 4 32 Xilinx Virtex IIPro 16K 2 32 8K 2 32
Instruct. Cache Data Cache Instruct. Cache Data Cache
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
7/29
Chuanjun Zhang, UC Riverside 7
Configurable Associativity by WayConcatenation
Four-way set-associative basecache
Ways can be
concatenated toform two-way
Can be furtherconcatenated todirect-mapped Concatenation is
logical only 1array accessed
Way 1 Way 2 Way 3 Way 4
four-wa
y
Way 1 Way 2 tw
o-wa
y
Way 1directmapped
C. Zhang(ISCA 03)
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
8/29
Chuanjun Zhang, UC Riverside 8
Way-Concatenate Cache Architecture
index
data output
critical path
6x
64
6x
64
a31 tag address a13 a12 a11 a10 index a5 a4 line offset a0
data
array
Trivial area
overheadNo performanceoverhead
NAND transistors enlargedto match inverter speed
Configuration circuit
operates concurrent todecoders
reg0 reg1 ways
0 0 DM
0 1 2
1 0 2
1 1 4reg1
reg0
c1 c3c0 c2
Configuration circuit
c1c0
tag
addressc0 c1
mux driver
line offset
c2
6x
64
6x
64
c3c2
c3
6x
64
6x
64
tag part
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
9/29
Chuanjun Zhang, UC Riverside 9
Previous Method Way Shutdown
Albonesi proposed a cache where ways could be shut down To save dynamic power
Motorola M*CORE has same way-shutdown feature Unified cache even allows setting each way as I, D, both, or off
Way 1 Way 2 Way 3 Way 4
Reduces dynamic power by accessing fewer ways But, decreases total size, so may increase miss rate
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
10/29
Chuanjun Zhang, UC Riverside 10
Way Shutdown Can be Good for StaticPower
Static power (leakage) increasingly important in nanoscaletechnologies We combine way shutdown with way concatenate Use sleep transistor method of Powell (ISLPED 2000)
Gnd
VddBitline
Bitline
Gated-Vdd
Control
When off,preventsleakage.But 4%
performance overhead
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
11/29
Chuanjun Zhang, UC Riverside 11
Cache Line Size
64B cache line64B
consecutive
code
64B non
consecutive
code
16B
A
B
48B are wasted
64B cache line
C. Zhang(ISVLSI 03)
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
12/29
Chuanjun Zhang, UC Riverside 12
Configurable Cache Line Size With LineConcatenation
Counter
bus
One Way
Off Chip Memory
4 physicallines are
filled when
line size
is 64 bytes
The physical linesize is 16 byte
A programmablecounter is used to
designate the linesize
An interleaved offchip memoryorganization
16 bytes
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
13/29
Chuanjun Zhang, UC Riverside 13
Computing Total Memory-Related Energy
Considers CPU stall energy and off-chip memory energy Excludes CPU active energy Thus, represents all memory-relatedenergy
energy_mem = energy_dynamic + energy_static
energy_miss = k_miss_energy * energy_hit
energy_static_per_cycle = k_static * energy_total_per_cycle
(We varied the ks to account for different system implementations)
energy_dynamic = cache_hits * energy_hit + cache_misses * energy_miss
energy_miss = energy_offchip_access + energy_uP_stall + energy_cache_block_fillenergy_static = cycles * energy_static_per_cycle
Underlined measured quantitiesSimpleScalar (cache_hits, cache_misses, cycles)Our layout or data sheets (others)
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
14/29
Chuanjun Zhang, UC Riverside 14
Energy Savings
Energy savings when way concatenation, way shut
down, and cache line size concatenation areimplemented. (C. ZhangTECS ACM To Appear)
127% 620% 12
0%
20%
40%
60%
80%
100%
120%
padpcm
crc
auto2
bcnt
bilv
binary b
lit
brev
g3fax fi
r
pjepg
ucbqsort
v42
adpcm
epic
g721
pegwit
mpeg
jpeg
art
mcf
parser
vpr
NormalizedEner
cnv8K4W32B cnv8K1W32B cfg8Kwc32Bcfg8Kwcws32B cfg8Kwcwslc
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
15/29
Chuanjun Zhang, UC Riverside 15
Cache Parameters that Consume theLowest Energy Varies Across Applications
Ben. I$ D$ Ben. I$ D$
padpcm 8K1W32B 8K1W32B pjepg 4K1W32B 4K2W64B
crc 2K1W32B 4K1W64B ucbqsort 4K1W16B 4K1W64B
auto 8K2W16B 4K1W32B v42 8K1W16B 8K2W16B
bcnt 2K1W32B 2K1W64B adpcm 2K1W16B 4K1W16Bbilv 4K1W32B 2K1W32B epic 2K1W64B 8K1W16B
binary 2K1W32B 2K1W32B g721 8K4W16B 2K1W16B
blit 2K1W16B 8K2W32B pegwit 4K1W16B 4K1W16B
brev 4K1W32B 2K1W32B mpeg2 4K1W32B 8K2W16B
g3fax 4K1W32B 4K1W16B art 2K1W32B 2K1W16Bfir 4K 1W32B 2K1W32B parser 8K4W16B 8K2W64B
jpeg 8K4W32B 4K2W32B mcf 8K4W16B 8K1W16B
vpr 8K4W32B 2K1W16B
Best Configuration Best Configuration
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
16/29
Chuanjun Zhang, UC Riverside 16
How to Configure Cache
Simulation-based methods Drawback: slowness.
Seconds of real-timework may take tens of hours to simulate
Simulation tools set up Increase the time
Self exploring method Cache parameter explorer
Incorporated on a prototype platform
Pareto parameters: a set of parameters showperformance and energy trade off
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
17/29
Chuanjun Zhang, UC Riverside 17
Cache self-exploring hardware
An explorer is used todetect the Pareto set ofcache parameters
The explorer standsaside to collectinformation used to
calculate the energy
MemProcesso
r
D$
I$
Explorer
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
18/29
Chuanjun Zhang, UC Riverside 18
Pareto parameter sets
pegwit
56
60
64
68
72
0.04 0.08 0.12 0.16Energy(mJ)
Time(millioncycl A
BC
D
Lowest
Energy
BestPerformance
Tradeoff betweenEnergy and
Performance
Not aPareto
Point
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
19/29
Chuanjun Zhang, UC Riverside 19
Heuristic algorithm
Search all possible Cache configurations Time consuming. Considering other configurable
parameters: voltage levels, bus width, etc. thesearch space will increase very quickly to millions.
A heuristic is proposed First to search point A
Sequence of searching parameter matters, Do not need cache flush
Then searching for point B Last we search for points in region C56
60
64
68
72
0.04 0.08 0.12 0.16
A
BC
Time
Energy(mJ)
LowestEnergy
Best Perf
Tradeoff
f h i
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
20/29
Chuanjun Zhang, UC Riverside 20
0%
3%
6%
9%
12%
16B 32B 64B 1W 2W 4W
Ave.
Icachemissrate
8k 4k 2k
0.0
0.2
0.4
0.6
0.8
1.0
16B 32B 64B 1W 2W 4W
Ave.
Icacheenergy
8k 4k 2k
Impact of Cache Parameters on MissRate and Energy
Average Instruction Cache Miss Rate and Normalized Energy of the
Benchmarks.
One Way
Line Size 32B
Line Size 32B
One Way
E Di i ti O Chi C h
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
21/29
Chuanjun Zhang, UC Riverside 21
0
1
2
3
4
5
1KB
2KB
4KB
8KB
16KB
32KB
64KB
128KB
256KB
512KB
1MB
Cache Size
Energy(J)
Cache Memory Total
Energy Dissipation on On-Chip Cache
and Off Chip Memory
.
Benchmark:
parser
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
22/29
Chuanjun Zhang, UC Riverside 22
Searching for Point A
Point A :The least energy
cache configuration
W1 W2 W3 W4
Search Cache
Size
Search Line
Size
Search
Associativity
Way prediction
56
60
64
68
72
0.04 0.08 0.12 0.16
A
Energy(mJ)
Time
LowestEnergy
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
23/29
Chuanjun Zhang, UC Riverside 23
Searching for Point B
Point B :The best performance cache configuration
High associativity doesnt mean high performance
Large line size may not be good for data cache
W1 W2 W3 W4
Fix Cache Size Search Line
Size
Search
Associativity
No Way
prediction
56
60
64
68
72
0.04 0.08 0.12 0.16
A
Energy(mJ)
B
BestPerformance
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
24/29
Chuanjun Zhang, UC Riverside 24
Searching for Point C
Cache parameters in region C:
represent the trade off
between energy and
performance Choose cache parameters
between points A and B. Cache size at points A and B are
8K and 4K respectively, then the
cache size of points in region C
will be tested at 8K and 4K.
Combinations of point A andBs parameters are tested.
Point A B C
Line size 64 64 64
Cache size 2K 8K 4K 8K
Associativity 1W 4W 1W 1W 2W
56
60
64
68
72
0.04 0.08 0.12 0.16
A
CB
Tradeoff betweenEnergy and
Performance
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
25/29
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
26/29
Chuanjun Zhang, UC Riverside 26
Implementing the Heuristic in Hardware
Total size of the explorer About 4,200 gates, or 0.041 mm2 in 0.18 micron CMOS
technology.
Area overhead
Compared to the reported size of the MIPS 4Kp with cache, thisrepresents just over a 3% area overhead.
Power consumption: 2.69 mW at 200 MHz. The power overhead compared with the
MIPS 4Kp would be less than 0.5%. Furthermore, the exploring hardware is used only during the
exploring stage, and can be shut down after the bestconfiguration is determined.
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
27/29
Chuanjun Zhang, UC Riverside 27
How well the heuristic is ?
Time complexity: Search all space: O(m x n x l x p) Heuristic : O(m + n + l + p)
m:number of associativities, n :number of cache size l : number of cache line size , p :way prediction on/off
Efficiency
On average 5 searching instead of 27 total searchings can find point A 2 out of 19 benchmarks miss the lowest power cache configuration. Use a different searching heuristic: line size, associativity, way prediction and
cache size. 11 out 19 benchmarks miss the best configuration
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
28/29
Chuanjun Zhang, UC Riverside 28
Results of Some Other Benchmarks
bliv
11233000
11234000
11235000
11236000
11237000
11238000
11239000
0 0.002 0.004 0.006 0.008 0.01
Energy(mJ)
Time(cycles)
pe g w i
55000000
60000000
65000000
70000000
75000000
0 0.05 0.1 0.15 0.2Energy(mJ
Time(cycles)
padpc
132000134000136000138000140000142000144000146000
0.174 0.175 0.176 0.177 0.178 0.179 0.18
Energy(nJ
Time
(cycles)
crc
3090000
3092000
3094000
3096000
3098000
3100000
3102000
3104000
0 0.001 0.002 0.003 0.004
Energy(mJ)
Time(cycles)
-
8/14/2019 A Highly Configurable Cache Architecture for Embedded
29/29
Chuanjun Zhang, UC Riverside 29
Conclusion and Future Work A configurable cache architecture is proposed.
Associativity, size,line size. A cache parameter explorer is implemented to find the cache
parameters. A heuristic algorithm is proposed to search the Pareto cache
parameter sets. The complexity of the heuristic is O(m+n+l) instead of O(m*n*l) Only 95% of the Pareto points can be found by Heuristic
Overhead little area and power overhead, and no performance overhead.
Future Work Dynamically detect the cache parameters .