branch prediction on demand: an energy-efficient solution sdaniel chaver, luis pi˜nuel,manuel...
Post on 20-Dec-2015
216 views
TRANSCRIPT
Branch Prediction On Demand:
an Energy-Efficient Solution
SDaniel Chaver, Luis Pi˜nuel,Manuel Prieto, Francisco Tirado
Dpto. Arquitectura de Computadores Universidad Complutense Madrid, Spain
Michael C. Huang
Dept. of Electrical & Computer EngineeringUniversity of Rochester Rochester, New York Presented by Shugen Li
Instruction Branch prediction is a central piece of technology in
exploiting instruction-level parallelism.
Modern high-end processors use an array of large tables for branch direction and target prediction
it result in significant energy consumption - sometimes more than 10% of the total chip power.
the general principle of “on-demand resource allocation” to branch predictor and dynamically adjust the strength of the predictor
Reduce Overhead caused by Reconfiguration overhead Energy waste due to increased mis-speculation
Methods Partition an application into smaller unit called Modules Profiling Instrument the application to reconfigure the predictor
at runtime
Twofold benefits
Instruction(cont’)
Two techniques at the circuit level
ACCESS GATING: a novel access gating technique to reduce ineffectual switchings
STRUCTURE RESIZING: table resizing to reduce unnecessarily large capacitative load.
ON-DEMAND BRANCH PREDICTION
Branch Predictor Reconfiguration1. Adaptive Hybrid Predictor through Access
Gating
2. Adaptive BTB through Dynamic Resizing
The Profiling Approach
Adaptive Hybrid Predictor through Access Gating
A de-aliased hybrid direction predictor: 2Bc-gskew-pskew
majority voting.
Meta tables
Adaptive BTB through Dynamic Resizing
Branch target buffer (BTB) helps provide the target address quickly
Increasing the size of BTBs helps reduce conflict and capacity misses. However, large structures can be a waste.
propose to resize BTB on the fly the kind of partitioning they are based on: bitline
(selective-sets [15]) and wordline (selective-ways [1]). The sample scale down the size to 256-set/2-way
configuration or 2048-set/1-way
The Profiling Approach
Adaptive architecture a profiled-based feedback mechanism to estimate branch
prediction demand without incurring runtime overhead. A module is the smallest unit to apply branch prediction
reconfiguration. two thresholds in selecting subroutines: average length per
invocation, Thgrain and total execution time weight Thweight
Profiling stage: training input. select the best configuration for each module
EXPERIMENTAL SETUP
The baseline branch predictor used is a 2Bc-gskew-pskew predictor configured with
Two 4K-entry meta tables A 4K-entry bimodal table A gskew component consisting
of two 4K-entry global history tables that use 10 bits of global history
A pskew component consisting of a 1K-entry local history table (8-bit wide) with two 2K-entry PHTs.
EVALUATIONS1. Adaptive BTB
Many BTB entries are underused ,shown in Figures 2, 3 and 4.
Gauge an application’s BTB size requirement, see Figure 5.
Observe the following:• For certain modules, moderate BTB resizing can produce
relatively significant energy reduction without incurring much slowdown.
• Beyond a certain size, further reducing the size of BTB is
counterproductive.
perform two productionruns using the training input and the reference input.
dynamically changing the size of BTB can be profitable: around 20-70% of branch predictor energy
and up to 8.6% of total chip energy can be saved with very little performance degradation.
EVALUATIONS2. Adaptive Hybrid predictor
EVALUATIONS2. Adaptive Hybrid predictor (cont’)
Experiments show that disabling (gating accesses to) the gskew and pskew components inside the predictor does result in notable energy savings(3% of total processor energy) without noticeable performancedegradation for this application.
EVALUATIONS3. Combining the Two Adaptations
Table 4 shows energy savings and performance degradation using both adaptive techniques to save as much energy as possible given a tolerable performance degradation limit of 0.5%.
EVALUATIONS3. Combining the Two Adaptations(cont’)
Observation • using the same profile (based on training input), we obtain
very similar results for the two sets of production runs using different inputs.
• Branch prediction demand largely depends on the code. Also, the results shown in Tables 2, 3, and 4 suggest that
the two techniques are largely independent under the tested scenario.
Finally we note that the overall energy savings not only depend on the effectiveness of our proposed adaptive system but also depend on the branch prediction demand of the application.
RELATED WORK
Some earlier research also looks at energy issues related to
branch prediction. Parikh et. al. point out in that modern processors access
branch predictors very early in the pipeline stage. It results in many unnecessary accesses.
Hu et. al. let entries that are unused for a long time decay .This reduces leakage energy of branch predictor
Some other work also proposes to dynamically adjust hardware resources to reduce energy consumption while still meeting application demand.
SUMMARY
Sophisticated branch predictors are often underutilized causing unnecessary energy waste.
By adapting the size of the branch target buffer and dynamically disabling components of a hybrid predictor, significant amount of energy can be saved with very little performance degradation.
For a set of eight applications, an average of 71.7% (up to 89.5%) of the energy consumed in the branch predictor or 6.2% (up to 11.5%) processor-wide energy consumption can be saved with negligible performance degradation.
REFERENCES
[1] D. Albonesi. Selective Cache Ways: On-Demand Cache Resource Allocation.
Journal of Instruction-Level Parallelism, 2, 2000. [2] R. Bahar and S. Manne. Power and Energy Reduction Via Pipeline
Balancing. In International Symposium on Computer Architecture, pages 218–229, G¨oteberg, Sweden, June–July 2001. [3] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory Hierarchy Reconfiguration for Energy and Performance in General-Purpose Processor Architectures. In
International Symposium on Microarchitecture, pages 245–257, Monterey, CA, December 2000. [4] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural-Level Power Analysis and Optimizations. In International Symposium on Computer Architecture, pages 83–94, G¨oteberg, Sweden, June–July 2001. [5] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Algorithms. McGraw Hill, 1989. [6] D. Folegnani and A. Gonz´alez. Energy-Effective Issue Logic. In
International Symposium on Computer Architecture, pages 230–239, G¨oteberg, Sweden, June–July 2001. [7] Z. Hu, P. Juang, K. Skadron, D. Clark, and M. Martonosi. Applying
Decay Strategies to Branch Predictors for Leakage Energy Savings. In
International Conference on Computer Design, pages 442–445, Freiburg, Germany, September 2002. [8] M. Huang, J. Renau, and J. Torrellas. Positional Adaptation of
Processors: Application to Energy Reduction. In International Symposium on Computer Architecture, San Diego, CA, June 2003.
[9] V. Krishnan and J. Torrellas. A Direct-Execution Framework for Fast and Accurate Simulation of Superscalar Processors. In International Conference on Parallel Architectures and Compilation Techniques, pages 286–293, Paris, France, October 1998. [10] S. Manne, A. Klauser, and D. Grunwald. Pipeline Gating:
Speculation Control for Energy Reduction. In International Symposium on Computer Architecture, pages 132–141, Barcelona, Spain, June–July 1998. [11] D. Parikh, K. Skadron, Y. Zhang, M. Barcella, and M. Stan. Power Issues Related to Branch Prediction. In International Symposium on High-Performance Computer Architecture, pages 233–244, Cambridge, MA, February 2002. [12] N. Jouppi S. Wilton. CACTI: an Enhanced Cache Access and Cycle Time Model. IEEE Journal of Solid-State Circuits, 31(5):677–688, May 1996. [13] A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides. Design Tradeoffs
for the Alpha EV8 Conditional Branch Predictor. In International Symposium on Computer Architecture, pages 296–306, Anchorage, AK, May 2002. [14] A. Seznec and P. Michaud. De-aliased Hybrid Branch Predictors.
Technical Report No. 3618, Institut National de Recherche en Informatique et en Automatique (INRIA), February 1999. [15] S. Yang, M. Powell, B. Falsafi, K. Roy, and T. Vijaykumar. An
Integrated Circuit/Architecture Approach to Reducing Leakage in Deep- Submicron High-Performance I-Caches. In International Symposium on High-Performance Computer Architecture, pages 147–157, Nuevo Leone, Mexico, January 2001. [16] K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 6(2):28–40, April 1996.
END
Thank You!
Question?