pbexplore: a framework for cil exploration of partial bypasses in embedded processors aviral...
Post on 15-Jan-2016
222 views
TRANSCRIPT
PBExplore: A Framework for CIL PBExplore: A Framework for CIL Exploration of Partial Bypasses Exploration of Partial Bypasses
in Embedded Processorsin Embedded Processors
Aviral Shrivastava1 Nikil Dutt1
Alex Nicolau1 Eugene Earlie2
1Center For Embedded Computer Systems,University of California, Irvine, CA, USA
2Strategic CAD Labs, Intel,Hudson, MA, USA
SSCCLL
2 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Bypassing Improves PerformanceBypassing Improves Performance
Pipelining improves performance Pipelining improves performance Limited by pipeline hazards
Bypasses eliminate certain data hazardsBypasses eliminate certain data hazardsFurther improve performance
F D
RF
R1 R2 + R3R4 R4 + R1
F D OR X1
RF
X2 WB
R1 R2 + R3R4 R4 + R1
OR X1 X2 WB
R1R1
3 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Area and Power consumptionArea and Power consumption Wide multiplexers Bypass Control logic Bypass wires
Impact of BypassingImpact of Bypassing Cycle timeCycle time
Bypasses may be a part of timing-critical path
F D X1RF X2 WB
M1
M2
Wiring congestionWiring congestion
Overall chip complexityOverall chip complexity deeply pipelined out-of-order processors
P. Ahuja et alP. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995
A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.
OR
4 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Problem, Solution and ProblemProblem, Solution and Problem Problem – How do I customize bypasses?Problem – How do I customize bypasses?
Important for Embedded Systems Solution – Solution –
Keep only the most beneficial bypassesArea, Power and Performance trade-off
F D OR X1
RF
X2 WB
Problems – Problems – How to Compile for a processor with partial bypassing? Requires Compiler-in-the-Loop Exploration
5 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Related WorkRelated Work Optimizations for partial bypassingOptimizations for partial bypassing
P. Ahuja et al. [MICRO’95] Manual code generationManual code generation
M. Buss et al. [CASES’01] Optimize inter-cluster copy operationsOptimize inter-cluster copy operations
K. Fan et al. [ASSP’03] FU-allocation strategyFU-allocation strategy
Only for VLIW processors
A. Shrivastava et al. [CODES’04] A generic “pipeline hazard detection” mechanism to generate A generic “pipeline hazard detection” mechanism to generate
bypass-sensitive codebypass-sensitive code
We presentWe present A generic Compiler-in-the-Loop bypass exploration framework
Perform area-power-performance trade-off on Intel XScale by varying bypasses
6 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
PBExplore: A PBExplore: A CILCIL Exploration Exploration FrameworkFramework
Bypass
Configuration
Power
Simulator
Stimulus
Energy Estimate
Bypass-control
Logic
Synthesis
Tool
Area Estimate
Bypass-sensitive
Compiler
Executable
Cycle-accurate
Simulator
ApplicationApplication
Report
Execution Cycles
7 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Bypass Sensitive SchedulingBypass Sensitive Scheduling
No Hazard
Bypasses transfer data between dependent Bypasses transfer data between dependent operationsoperations
Missing bypasses cause pipeline hazardMissing bypasses cause pipeline hazardHazard
F D OR X1
RF
X2 WB
R1 R2 + R3R4 R4 + R1 R1 R1 R2 + R3R1 R1 R2 + R3R1
Bypass-sensitive compiler should be able toBypass-sensitive compiler should be able todetect and avoid pipeline hazards
8 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Operation TableOperation TableOperation Table for ADD R1 R2 R3
F D OR X1
RF
X2 WB
C1 C2 C3BRF
C4C5
Operation Table is a binding betweenOperation Table is a binding between Operation and Processor Resources
and Registers
Can detect Resource HazardsCan detect Resource Hazards OTs model processor resources
Can detect Data HazardsCan detect Data Hazards OTs model processor registers
1. F
2. D
3. OR
ReadOperands
R2
C1 RF
R3
C2 RF
C5 BRF
DestOperands
R1 RF
4. X1
WriteOperands
R1
C4 BRF
5. X2
6. XWB
WriteOperands
R1
C3 RF
Details are in the paper !!
9 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
ExperimentsExperiments Experiments I – Need of a CIL frameworkExperiments I – Need of a CIL framework
Need of Bypass-sensitive Compiler-in-the-Loop Exploration
Traditional exploration versus Bypass-sensitive Compiler-in-the-Loop exploration
Experiments II – CIL ExplorationExperiments II – CIL ExplorationUse of Bypass-sensitive Compiler-in-the-Loop
Exploration Perform Power-Performance-Area trade-offs Identify alternate interesting design points
10 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Experiments I - FrameworkExperiments I - FrameworkTraditional Exploration versus Traditional Exploration versus
Bypass-sensitive Compiler-in-the-Loop ExplorationBypass-sensitive Compiler-in-the-Loop Exploration
ApplicationApplication
BypassConfiguration
gcc –O3
Executable
Traditional Cycles
Cycle AccurateSimulator
Traditional Exploration
CIL Cycles
OT-based Compiler
Executable
Cycle AccurateSimulator
Bypass-sensitive Compiler-in-the-Loop
Exploration
11 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Experiments I - SetupExperiments I - Setup
7 pipeline stages can bypass result7 pipeline stages can bypass result We vary which pipeline stage bypasses a resultWe vary which pipeline stage bypasses a result
27 = 128 bypass configurations Encode bypass configuration
<DWB D2 MWB M2 XWB X2 X1><DWB D2 MWB M2 XWB X2 X1> Configuration 28 = <0011100>
Bypass paths from MWB, M2 and XWB are presentBypass paths from MWB, M2 and XWB are present
F1 F2 ID RF X1 X2 XWB
M1
D1 D2 DWB
MWBM2
12 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Bypass Explorations on XScaleBypass Explorations on XScale
CIL-compiler can effectively exploit the bypass configurationCIL-compiler can effectively exploit the bypass configuration Significant performance differenceSignificant performance difference
bitcount
850000
900000
950000
1000000
1050000
1100000
1150000
1200000
1250000
0 32 64 96 128Bypass Source Configurations
Ex
ecu
tio
n C
ycle
s
Traditional
CIL
13 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
X-bypass explorations in XScaleX-bypass explorations in XScale
XWB X1 X2XWB X2
X2 X1XWB X1
XWB X2 X1
X-bypass Configuration
bitcount
850000
900000
950000
1000000
1050000
1100000
1150000
1200000
-
Ex
ecu
tio
n C
ycle
s
TraditionalCIL
Difference in trendsDifference in trendsF1 F2 ID RF X1 X2 XWB
M1
D1 D2 DWB
MWBM2
14 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
bitcount
875000
879000
883000
887000
891000
895000
- M2 MWB MWB M2M Bypass Configurations
Ex
ec
uti
on
Cy
cle
s
Traditional
CIL
M-bypass explorations in XScaleM-bypass explorations in XScale
Difference in trendsDifference in trendsX1 X2 XWB
D1 D2 DWB
F1 F2 ID RF
M1 MWBM2
15 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
bitcount
860000
880000
900000
920000
940000
960000
980000
- DWB D2 DWB D2D Bypass Configurations
Exe
cuti
on
Cyc
les
Traditional
CIL
D-bypass exploration in XScaleD-bypass exploration in XScale
Difference in trendsDifference in trendsX1
D1 D2 DWB
F1 F2 ID RF X2 XWB
M1 MWBM2
16 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Experiments II - SetupExperiments II - Setup
Intel Intel XScaleXScale Microarchitecture Programmers Reference Manual, Microarchitecture Programmers Reference Manual, http://www.developer.intel.com
M. R. Gauthus et al. M. R. Gauthus et al. MiBenchMiBench: A free commercially representative…, IEEE Workshop… 2001: A free commercially representative…, IEEE Workshop… 2001
Synopsys Design Compiler, 2001, http://www.synopsys.com/products/logic/design compiler.html
Power-Performance-Power-Performance-Area trade-offsArea trade-offs
SchedulerScheduler Exhaustive instruction
reordering within Basic Blocks
Synthesis ToolSynthesis Tool Synopsys Design compiler
2001.10 0.8µ library lsi_10k
Power EstimationPower Estimation Synopsys power_estimate
Bypass
Configuration
Synthesis
ToolBypass-sensitive
Compiler
Executable
Cycle-accurate
Simulator
Power
Simulator
Bypass Control
Logic
Application
Report
Application
17 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Performance-Energy-Area Trade-Performance-Energy-Area Trade-offoff
Performance Area Trade-off
60%
65%
70%
75%
80%
85%
90%
95%
100%
105%
100% 105% 110% 115% 120% 125% 130%
Execution cycles compared to full bypassing
Are
a c
om
pa
red
to
fu
ll b
yp
as
sin
g
1
2
Performance Energy Trade-off
70%
75%
80%
85%
90%
95%
100%
105%
100% 105% 110% 115% 120% 125% 130%
Execution cycles compared to full bypassing
En
erg
y c
om
pa
red
to
fu
ll b
yp
as
sin
g
12
Point 2
Point 2
Point 1
Point 1
Design Point 1Design Point 1 no bypass from MWB and XWB to first operand 18% less area and 14% less energy consumption of bypass control logic 2% performance loss
Design Point 2Design Point 2 Only D2 and X2 bypass to first operand 25% less area and 16% less energy consumption of bypass control logic 6% performance loss
18 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
SummarySummary Bypassing improves performance but is costly in terms of Bypassing improves performance but is costly in terms of
area and powerarea and power
Partial bypassing presents valuable trade-offs, however Partial bypassing presents valuable trade-offs, however poses challenges in compilationposes challenges in compilation
We presented PBExplore – A Compiler-in-the-Loop We presented PBExplore – A Compiler-in-the-Loop Exploration framework to explore partial bypasses.Exploration framework to explore partial bypasses. PBExplore uses Operation Tables to generate bypass-
sensitive code PBExplore automatically synthesizes bypass control logic
to explore power and area trade-offs
PBExplore is able to discover interesting design points PBExplore is able to discover interesting design points that trade-off performance for power and area of bypass that trade-off performance for power and area of bypass control logiccontrol logic
19 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Thank YouThank You
20 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Pipeline Hazard Detection using Pipeline Hazard Detection using OTOT
F D OR X1
RF
X2 WB
C1 C2 C3BRF
C4C5
CycleCycle Busy ResourcesBusy Resources !RF!RF BRFBRF
MUL R1 R2 R3
11 F -- --
22 D -- --
33 OR, C1, C2 -- --
44 X1 R1R1 --
55 X1, C4 R1R1 R1R1
66 X2 R1R1 --
77 WB, C3 -- --
88 -- --
99 -- --
1010 -- --
1111 -- --
21 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Resource Hazard DetectionResource Hazard Detection
F D OR X1
RF
X2 WB
C1 C2 C3BRF
C4C5
CycleCycle Busy ResourcesBusy Resources !RF!RF BRFBRF
MUL R1 R2 R3 ADD R4 R2 R3
11 F -- --
22 D F -- --
33 OR, C1, C2 D -- --
44 X1 OR, C1, C2 R1R1 --
55 X1, C4 RH R1, R4R1, R4 R1R1
66 X2 X1, C4 R1, R4R1, R4 R4R4
77 WB, C3 X2 R4R4 --
88 WB, C3 -- --
99 -- --
1010 -- --
1111 -- --
ResourceHazard
22 Copyright © 2005 UCI ACES LaboratoryDATE, March 10, 2005
Data Hazard DetectionData Hazard Detection
F D OR X1
RF
X2 WB
C1 C2 C3BRF
C4C5
CycleCycle Busy ResourcesBusy Resources !RF!RF BRFBRF
MUL R1 R2 R3 ADD R4 R2 R3 SUB R5 R4 R2
11 F -- --
22 D F -- --
33 OR, C1, C2 D F -- --
44 X1 OR, C1, C2 D R1R1 --
55 X1, C4 RH DH R1, R4R1, R4 R1R1
66 X2 X1, C4 DH R1, R4R1, R4 R4R4
77 WB, C3 X2 DH R4R4 --
88 WB, C3 OR, C1, C2 -- --
99 X1, C4 R5R5 R5R5
1010 X2 R5R5 --
1111 WB, C3 -- --
DataHazard