branch prediction on demand: an energy-efficient solution sdaniel chaver, luis pi˜nuel,manuel...

Branch Prediction On Demand:

an Energy-Efficient Solution

SDaniel Chaver, Luis Pi˜nuel,Manuel Prieto, Francisco Tirado

Dpto. Arquitectura de Computadores Universidad Complutense Madrid, Spain

Michael C. Huang

Dept. of Electrical & Computer EngineeringUniversity of Rochester Rochester, New York Presented by Shugen Li

Instruction Branch prediction is a central piece of technology in

exploiting instruction-level parallelism.

Modern high-end processors use an array of large tables for branch direction and target prediction

it result in significant energy consumption - sometimes more than 10% of the total chip power.

the general principle of “on-demand resource allocation” to branch predictor and dynamically adjust the strength of the predictor

Reduce Overhead caused by Reconfiguration overhead Energy waste due to increased mis-speculation

Methods Partition an application into smaller unit called Modules Profiling Instrument the application to reconfigure the predictor

at runtime

Twofold benefits

Instruction(cont’)

Two techniques at the circuit level

ACCESS GATING: a novel access gating technique to reduce ineffectual switchings

STRUCTURE RESIZING: table resizing to reduce unnecessarily large capacitative load.

ON-DEMAND BRANCH PREDICTION

Branch Predictor Reconfiguration1. Adaptive Hybrid Predictor through Access

Gating

2. Adaptive BTB through Dynamic Resizing

The Profiling Approach

Adaptive Hybrid Predictor through Access Gating

A de-aliased hybrid direction predictor: 2Bc-gskew-pskew

majority voting.

Meta tables

Adaptive BTB through Dynamic Resizing

Branch target buffer (BTB) helps provide the target address quickly

Increasing the size of BTBs helps reduce conflict and capacity misses. However, large structures can be a waste.

propose to resize BTB on the fly the kind of partitioning they are based on: bitline

(selective-sets [15]) and wordline (selective-ways [1]). The sample scale down the size to 256-set/2-way

configuration or 2048-set/1-way

The Profiling Approach

Adaptive architecture a profiled-based feedback mechanism to estimate branch

prediction demand without incurring runtime overhead. A module is the smallest unit to apply branch prediction

reconfiguration. two thresholds in selecting subroutines: average length per

invocation, Thgrain and total execution time weight Thweight

Profiling stage: training input. select the best configuration for each module

EXPERIMENTAL SETUP

The baseline branch predictor used is a 2Bc-gskew-pskew predictor configured with

Two 4K-entry meta tables A 4K-entry bimodal table A gskew component consisting

of two 4K-entry global history tables that use 10 bits of global history

A pskew component consisting of a 1K-entry local history table (8-bit wide) with two 2K-entry PHTs.

EVALUATIONS1. Adaptive BTB

Many BTB entries are underused ,shown in Figures 2, 3 and 4.

Gauge an application’s BTB size requirement, see Figure 5.

Observe the following:• For certain modules, moderate BTB resizing can produce

relatively significant energy reduction without incurring much slowdown.

• Beyond a certain size, further reducing the size of BTB is

counterproductive.

perform two productionruns using the training input and the reference input.

dynamically changing the size of BTB can be profitable: around 20-70% of branch predictor energy

and up to 8.6% of total chip energy can be saved with very little performance degradation.

EVALUATIONS2. Adaptive Hybrid predictor

EVALUATIONS2. Adaptive Hybrid predictor (cont’)

Experiments show that disabling (gating accesses to) the gskew and pskew components inside the predictor does result in notable energy savings(3% of total processor energy) without noticeable performancedegradation for this application.

EVALUATIONS3. Combining the Two Adaptations

Table 4 shows energy savings and performance degradation using both adaptive techniques to save as much energy as possible given a tolerable performance degradation limit of 0.5%.

EVALUATIONS3. Combining the Two Adaptations(cont’)

Observation • using the same profile (based on training input), we obtain

very similar results for the two sets of production runs using different inputs.

• Branch prediction demand largely depends on the code. Also, the results shown in Tables 2, 3, and 4 suggest that

the two techniques are largely independent under the tested scenario.

Finally we note that the overall energy savings not only depend on the effectiveness of our proposed adaptive system but also depend on the branch prediction demand of the application.

RELATED WORK

Some earlier research also looks at energy issues related to

branch prediction. Parikh et. al. point out in that modern processors access

branch predictors very early in the pipeline stage. It results in many unnecessary accesses.

Hu et. al. let entries that are unused for a long time decay .This reduces leakage energy of branch predictor

Some other work also proposes to dynamically adjust hardware resources to reduce energy consumption while still meeting application demand.

SUMMARY

Sophisticated branch predictors are often underutilized causing unnecessary energy waste.

By adapting the size of the branch target buffer and dynamically disabling components of a hybrid predictor, significant amount of energy can be saved with very little performance degradation.

For a set of eight applications, an average of 71.7% (up to 89.5%) of the energy consumed in the branch predictor or 6.2% (up to 11.5%) processor-wide energy consumption can be saved with negligible performance degradation.

REFERENCES

[1] D. Albonesi. Selective Cache Ways: On-Demand Cache Resource Allocation.

Journal of Instruction-Level Parallelism, 2, 2000. [2] R. Bahar and S. Manne. Power and Energy Reduction Via Pipeline

Balancing. In International Symposium on Computer Architecture, pages 218–229, Göteberg, Sweden, June–July 2001. [3] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory Hierarchy Reconfiguration for Energy and Performance in General-Purpose Processor Architectures. In

International Symposium on Microarchitecture, pages 245–257, Monterey, CA, December 2000. [4] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In International Symposium on Computer Architecture, pages 83–94, Göteberg, Sweden, June–July 2001. [5] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. McGraw Hill, 1989. [6] D. Folegnani and A. González. Energy-Effective Issue Logic. In

International Symposium on Computer Architecture, pages 230–239, Göteberg, Sweden, June–July 2001. [7] Z. Hu, P. Juang, K. Skadron, D. Clark, and M. Martonosi. Applying

Decay Strategies to Branch Predictors for Leakage Energy Savings. In

International Conference on Computer Design, pages 442–445, Freiburg, Germany, September 2002. [8] M. Huang, J. Renau, and J. Torrellas. Positional Adaptation of

Processors: Application to Energy Reduction. In International Symposium on Computer Architecture, San Diego, CA, June 2003.

[9] V. Krishnan and J. Torrellas. A Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors. In International Conference on Parallel Architectures and Compilation Techniques, pages 286–293, Paris, France, October 1998. [10] S. Manne, A. Klauser, and D. Grunwald. Pipeline Gating:

Speculation Control for Energy Reduction. In International Symposium on Computer Architecture, pages 132–141, Barcelona, Spain, June–July 1998. [11] D. Parikh, K. Skadron, Y. Zhang, M. Barcella, and M. Stan. Power Issues Related to Branch Prediction. In International Symposium on High-Performance Computer Architecture, pages 233–244, Cambridge, MA, February 2002. [12] N. Jouppi S. Wilton. CACTI: an Enhanced Cache Access and Cycle Time Model. IEEE Journal of Solid-State Circuits, 31(5):677–688, May 1996. [13] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides. Design Tradeoffs

for the Alpha EV8 Conditional Branch Predictor. In International Symposium on Computer Architecture, pages 296–306, Anchorage, AK, May 2002. [14] A. Seznec and P. Michaud. De-aliased Hybrid Branch Predictors.

Technical Report No. 3618, Institut National de Recherche en Informatique et en Automatique (INRIA), February 1999. [15] S. Yang, M. Powell, B. Falsafi, K. Roy, and T. Vijaykumar. An

Integrated Circuit/Architecture Approach to Reducing Leakage in Deep- Submicron High-Performance I-Caches. In International Symposium on High-Performance Computer Architecture, pages 147–157, Nuevo Leone, Mexico, January 2001. [16] K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 6(2):28–40, April 1996.

END

Thank You!

Question?

branch prediction on demand: an energy-efficient solution sdaniel chaver, luis pi˜nuel,manuel...

Documents