architectural and compiler techniques for energy reduction in high-performance microprocessors...
Post on 22-Dec-2015
216 views
TRANSCRIPT
Architectural and Compiler Techniques for Energy
Reduction in High-Performance Microprocessors
Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine D. Polychronopoulos, and George Stamoulis
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 3, JUNE 2000
Presenter: R.T.-Gu
Abstract
In this paper, we focus on low-power design techniques for high-performance processors at the architectural and compiler levels. We focus mainly on developing methods for reducing the energy dissipated in the on-chip caches. Energy dissipated in caches represents a substantial portion in the energy budget of today’s processors. Extrapolating current trends, this portion is likely to increase in the near future, since the devices devoted to the caches occupy an increasingly larger percentage of the total area of the chip.
We propose a method that uses an additional mini-cache located between the I-Cache and the central processing unit (CPU) core and buffers instructions that are nested within loops and are continuously otherwise fetched from the I-Cache. This mechanism is combined with code modifications, through the compiler, that greatly simplify the required hardware, eliminate unnecessary instruction fetching, and consequently reduce signal switching activity and the dissipated energy.
We show that the additional cache, dubbed L-Cache, is much smaller and simpler than the I-Cache when the compiler assumes the role of allocating instructions to it. Through simulation, we show that for the SPECfp95 benchmarks, the I-Cache remains disabled most of the time, and the “cheaper” extra cache is used instead. We also propose different techniques that are better adapted to nonnumeric nonloop-intensive code.
Outline
What’s the problem The main idea Software support Hardware support Energy Estimation Result & discuss Conclusion
What’s the problem?
The I-Cache subsystem which is one of the main power consumers in most of today’s microprocessors. The on-chip L1 and L2 caches of the DEC Alpha chip dissi
pate 25% of the total power of the processor [1]. The I-Cache of StrongARM SA-110 dissipates about 27%
of the total power [2]. The I-Cache subsystem is a very high power consumer
because: Very high clock rate (the same as CPU) Very high switching activity
How to reduce the switching activity on cache?
The main idea
Uses an additional mini-cache located between the I-Cache and the CPU During a loop execution, the I-Cache unit
frequently repeats its previous tasks over and over again.
Reduce the IF unit fetch frequency by caching the I-Cache (use another L-cache)
Software support
We have to find out the most frequently executed basic blocks in loops
The compiler lays out the target program so that the selected blocks can be placed in L-Cache
How to find out the most frequently executed basic blocks
The compiler places the basic blocks in the L-Cache by determining their nesting and using their execution profile.
The determination consists of two distinct phases. Function inlining,
• Compiler tries to expose as many basic blocks as possible in frequently executed routines
Block placement,• Compiler selects and then places the selected b
asic blocks in the extra cache is maximized.
How to select and then place
First Step: Nesting Computation
Second Step: LabelTree Construction
Third Step: Basic Block Selection and
Placement Fourth and Fifth Steps:
Global Placement in the Memory
Nesting Computation
The control flow graph is built for each function
Find the loops and the nesting for every basic block and set as LabelSet(B)
LabelTree Construction
Placement algorithm
Global Placement in the Memory
The user has the ability to adjust the thresholds in the selection of the basic blocks in the first stage,
Tradeoff performance degradation with power savings.
Hardware support
A 32-b register is used to hold the address of the first nonplaced block in the main memory layout. If the PC has a value
less than that address, comparator will set blocked_part=on, else this signal will be set to off.
Energy Estimation
The energy model was based on the work byWilson and Jouppi [16] in which they propose a timing analysis model for SRAM-based caches
This model uses run-time information of the cache utilization number of accesses, number of hits, misses, in
put statistics, etc. gathered during simulation, A 0.8-μm technology with 3.3-V voltage supply
is assumed.
Simulation results
Simulation results (cont.)
Discuss
Larger L-Cache saving more power A larger L-Cache is more successful in storing
basic blocks and therefore in disabling the I-Cache for a larger period of time.
256 instruction L-Cache is enough In most cases, 256 instruction L-Cache
approximates the performance of an infinite size L-Cache.
The performance is worse in integer benchmarks On the other hand, most integer benchmarks
do not have a large number of basic blocks that can be cached in the L-Cache. (they are not nested)
Modified Scheme for Integer Benchmarks
Focus on a function. (not a nested loop) Select the function with the largest
contribution in the execution time, and places its most important basic blocks permanently in the L-Cache
Experimental Evaluation of the Modified Scheme
The results are very encouraging for benchmarks that have poor performance under the initial method.
Conclusion
They have developed techniques for hardware/software codesign in high-performance processors that result in energy/power reduction at the system level.
Major energy gains can be obtained if the compiler and the hardware are designed with low energy in mind.
The dynamic mix of instructions for SPECint95 benchmarks
An instruction belongs to one of the six following categories: 1) “P” if it has been selected by the algorithm to be position
ed in the L-Cache; 2) “U” if it is in a basic block with a small execution frequen
cy (unimportant); 3) “NN” if it is in a block with large execution frequency but
not nested in a loop; 4) “SD” if it is in a nested block with large execution freque
ncy but small execution density; 5) “SS” if it belongs to a nested block with large frequency
and execution density but small size; 6) “L” if it satisfies all the above criteria but does not fit in t
he L-Cache. The frequency threshold is (1/10 000) of the execution time of the
program, the execution density threshold is five executions per function
The dynamic mix of instructions for SPECint95 benchmarks (2)