pbexplore: a framework for cil exploration of partial bypasses in embedded processors aviral...

PBExplore: A Framework for CIL PBExplore: A Framework for CIL Exploration of Partial Bypasses Exploration of Partial Bypasses

in Embedded Processorsin Embedded Processors

Aviral Shrivastava1 Nikil Dutt1

Alex Nicolau1 Eugene Earlie2

1Center For Embedded Computer Systems,University of California, Irvine, CA, USA

2Strategic CAD Labs, Intel,Hudson, MA, USA

SSCCLL

2 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005

Bypassing Improves PerformanceBypassing Improves Performance

Pipelining improves performance Pipelining improves performance Limited by pipeline hazards

Bypasses eliminate certain data hazardsBypasses eliminate certain data hazardsFurther improve performance

F D

RF

R1 R2 + R3R4 R4 + R1

F D OR X1

RF

X2 WB

R1 R2 + R3R4 R4 + R1

OR X1 X2 WB

R1R1


Area and Power consumptionArea and Power consumption Wide multiplexers Bypass Control logic Bypass wires

Impact of BypassingImpact of Bypassing Cycle timeCycle time

Bypasses may be a part of timing-critical path

F D X1RF X2 WB

M1

M2

Wiring congestionWiring congestion

Overall chip complexityOverall chip complexity deeply pipelined out-of-order processors

P. Ahuja et alP. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995

A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.

OR


Problem, Solution and ProblemProblem, Solution and Problem Problem – How do I customize bypasses?Problem – How do I customize bypasses?

Important for Embedded Systems Solution – Solution –

Keep only the most beneficial bypassesArea, Power and Performance trade-off

F D OR X1

RF

X2 WB

Problems – Problems – How to Compile for a processor with partial bypassing? Requires Compiler-in-the-Loop Exploration


Related WorkRelated Work Optimizations for partial bypassingOptimizations for partial bypassing

P. Ahuja et al. [MICRO’95] Manual code generationManual code generation

M. Buss et al. [CASES’01] Optimize inter-cluster copy operationsOptimize inter-cluster copy operations

K. Fan et al. [ASSP’03] FU-allocation strategyFU-allocation strategy

Only for VLIW processors

A. Shrivastava et al. [CODES’04] A generic “pipeline hazard detection” mechanism to generate A generic “pipeline hazard detection” mechanism to generate

bypass-sensitive codebypass-sensitive code

We presentWe present A generic Compiler-in-the-Loop bypass exploration framework

Perform area-power-performance trade-off on Intel XScale by varying bypasses


PBExplore: A PBExplore: A CILCIL Exploration Exploration FrameworkFramework

Bypass

Configuration

Power

Simulator

Stimulus

Energy Estimate

Bypass-control

Logic

Synthesis

Tool

Area Estimate

Bypass-sensitive

Compiler

Executable

Cycle-accurate

Simulator

ApplicationApplication

Report

Execution Cycles


Bypass Sensitive SchedulingBypass Sensitive Scheduling

No Hazard

Bypasses transfer data between dependent Bypasses transfer data between dependent operationsoperations

Missing bypasses cause pipeline hazardMissing bypasses cause pipeline hazardHazard

F D OR X1

RF

X2 WB

R1 R2 + R3R4 R4 + R1 R1 R1 R2 + R3R1 R1 R2 + R3R1

Bypass-sensitive compiler should be able toBypass-sensitive compiler should be able todetect and avoid pipeline hazards


Operation TableOperation TableOperation Table for ADD R1 R2 R3

F D OR X1

RF

X2 WB

C1 C2 C3BRF

C4C5

Operation Table is a binding betweenOperation Table is a binding between Operation and Processor Resources

and Registers

Can detect Resource HazardsCan detect Resource Hazards OTs model processor resources

Can detect Data HazardsCan detect Data Hazards OTs model processor registers

1. F

2. D

3. OR

ReadOperands

R2

C1 RF

R3

C2 RF

C5 BRF

DestOperands

R1 RF

4. X1

WriteOperands

R1

C4 BRF

5. X2

6. XWB

WriteOperands

R1

C3 RF

Details are in the paper !!


ExperimentsExperiments Experiments I – Need of a CIL frameworkExperiments I – Need of a CIL framework

Need of Bypass-sensitive Compiler-in-the-Loop Exploration

Traditional exploration versus Bypass-sensitive Compiler-in-the-Loop exploration

Experiments II – CIL ExplorationExperiments II – CIL ExplorationUse of Bypass-sensitive Compiler-in-the-Loop

Exploration Perform Power-Performance-Area trade-offs Identify alternate interesting design points


Experiments I - FrameworkExperiments I - FrameworkTraditional Exploration versus Traditional Exploration versus

Bypass-sensitive Compiler-in-the-Loop ExplorationBypass-sensitive Compiler-in-the-Loop Exploration

ApplicationApplication

BypassConfiguration

gcc –O3

Executable

Traditional Cycles

Cycle AccurateSimulator

Traditional Exploration

CIL Cycles

OT-based Compiler

Executable

Cycle AccurateSimulator

Bypass-sensitive Compiler-in-the-Loop

Exploration


Experiments I - SetupExperiments I - Setup

7 pipeline stages can bypass result7 pipeline stages can bypass result We vary which pipeline stage bypasses a resultWe vary which pipeline stage bypasses a result

27 = 128 bypass configurations Encode bypass configuration

<DWB D2 MWB M2 XWB X2 X1><DWB D2 MWB M2 XWB X2 X1> Configuration 28 = <0011100>

Bypass paths from MWB, M2 and XWB are presentBypass paths from MWB, M2 and XWB are present

F1 F2 ID RF X1 X2 XWB

M1

D1 D2 DWB

MWBM2


Bypass Explorations on XScaleBypass Explorations on XScale

CIL-compiler can effectively exploit the bypass configurationCIL-compiler can effectively exploit the bypass configuration Significant performance differenceSignificant performance difference

bitcount

850000

900000

950000

1000000

1050000

1100000

1150000

1200000

1250000

0 32 64 96 128Bypass Source Configurations

Ex

ecu

tio

n C

ycle

s

Traditional

CIL


X-bypass explorations in XScaleX-bypass explorations in XScale

XWB X1 X2XWB X2

X2 X1XWB X1

XWB X2 X1

X-bypass Configuration

bitcount

850000

900000

950000

1000000

1050000

1100000

1150000

1200000

-

Ex

ecu

tio

n C

ycle

s

TraditionalCIL

Difference in trendsDifference in trendsF1 F2 ID RF X1 X2 XWB

M1

D1 D2 DWB

MWBM2


bitcount

875000

879000

883000

887000

891000

895000

- M2 MWB MWB M2M Bypass Configurations

Ex

ec

uti

on

Cy

cle

s

Traditional

CIL

M-bypass explorations in XScaleM-bypass explorations in XScale

Difference in trendsDifference in trendsX1 X2 XWB

D1 D2 DWB

F1 F2 ID RF

M1 MWBM2


bitcount

860000

880000

900000

920000

940000

960000

980000

- DWB D2 DWB D2D Bypass Configurations

Exe

cuti

on

Cyc

les

Traditional

CIL

D-bypass exploration in XScaleD-bypass exploration in XScale

Difference in trendsDifference in trendsX1

D1 D2 DWB

F1 F2 ID RF X2 XWB

M1 MWBM2


Experiments II - SetupExperiments II - Setup

Intel Intel XScaleXScale Microarchitecture Programmers Reference Manual, Microarchitecture Programmers Reference Manual, http://www.developer.intel.com

M. R. Gauthus et al. M. R. Gauthus et al. MiBenchMiBench: A free commercially representative…, IEEE Workshop… 2001: A free commercially representative…, IEEE Workshop… 2001

Synopsys Design Compiler, 2001, http://www.synopsys.com/products/logic/design compiler.html

Power-Performance-Power-Performance-Area trade-offsArea trade-offs

SchedulerScheduler Exhaustive instruction

reordering within Basic Blocks

Synthesis ToolSynthesis Tool Synopsys Design compiler

2001.10 0.8µ library lsi_10k

Power EstimationPower Estimation Synopsys power_estimate

Bypass

Configuration

Synthesis

ToolBypass-sensitive

Compiler

Executable

Cycle-accurate

Simulator

Power

Simulator

Bypass Control

Logic

Application

Report

Application


Performance-Energy-Area Trade-Performance-Energy-Area Trade-offoff

Performance Area Trade-off

60%

65%

70%

75%

80%

85%

90%

95%

100%

105%

100% 105% 110% 115% 120% 125% 130%

Execution cycles compared to full bypassing

Are

a c

om

pa

red

to

fu

ll b

yp

as

sin

g

1

2

Performance Energy Trade-off

70%

75%

80%

85%

90%

95%

100%

105%

100% 105% 110% 115% 120% 125% 130%

Execution cycles compared to full bypassing

En

erg

y c

om

pa

red

to

fu

ll b

yp

as

sin

g

12

Point 2

Point 2

Point 1

Point 1

Design Point 1Design Point 1 no bypass from MWB and XWB to first operand 18% less area and 14% less energy consumption of bypass control logic 2% performance loss

Design Point 2Design Point 2 Only D2 and X2 bypass to first operand 25% less area and 16% less energy consumption of bypass control logic 6% performance loss


SummarySummary Bypassing improves performance but is costly in terms of Bypassing improves performance but is costly in terms of

area and powerarea and power

Partial bypassing presents valuable trade-offs, however Partial bypassing presents valuable trade-offs, however poses challenges in compilationposes challenges in compilation

We presented PBExplore – A Compiler-in-the-Loop We presented PBExplore – A Compiler-in-the-Loop Exploration framework to explore partial bypasses.Exploration framework to explore partial bypasses. PBExplore uses Operation Tables to generate bypass-

sensitive code PBExplore automatically synthesizes bypass control logic

to explore power and area trade-offs

PBExplore is able to discover interesting design points PBExplore is able to discover interesting design points that trade-off performance for power and area of bypass that trade-off performance for power and area of bypass control logiccontrol logic


Thank YouThank You


Pipeline Hazard Detection using Pipeline Hazard Detection using OTOT

F D OR X1

RF

X2 WB

C1 C2 C3BRF

C4C5

CycleCycle Busy ResourcesBusy Resources !RF!RF BRFBRF

MUL R1 R2 R3

11 F -- --

22 D -- --

33 OR, C1, C2 -- --

44 X1 R1R1 --

55 X1, C4 R1R1 R1R1

66 X2 R1R1 --

77 WB, C3 -- --

88 -- --

99 -- --

1010 -- --

1111 -- --


Resource Hazard DetectionResource Hazard Detection

F D OR X1

RF

X2 WB

C1 C2 C3BRF

C4C5


MUL R1 R2 R3 ADD R4 R2 R3

11 F -- --

22 D F -- --

33 OR, C1, C2 D -- --

44 X1 OR, C1, C2 R1R1 --

55 X1, C4 RH R1, R4R1, R4 R1R1

66 X2 X1, C4 R1, R4R1, R4 R4R4

77 WB, C3 X2 R4R4 --

88 WB, C3 -- --

99 -- --

1010 -- --

1111 -- --

ResourceHazard


Data Hazard DetectionData Hazard Detection

F D OR X1

RF

X2 WB

C1 C2 C3BRF

C4C5


MUL R1 R2 R3 ADD R4 R2 R3 SUB R5 R4 R2

11 F -- --

22 D F -- --

33 OR, C1, C2 D F -- --

44 X1 OR, C1, C2 D R1R1 --

55 X1, C4 RH DH R1, R4R1, R4 R1R1

66 X2 X1, C4 DH R1, R4R1, R4 R4R4

77 WB, C3 X2 DH R4R4 --

88 WB, C3 OR, C1, C2 -- --

99 X1, C4 R5R5 R5R5

1010 X2 R5R5 --

1111 WB, C3 -- --

DataHazard

pbexplore: a framework for cil exploration of partial bypasses in embedded processors aviral...

Documents