independent work in between
DESCRIPTION
Control-Flow Decoupling Rami Sheikh, James Tuck, Eric Rotenberg North Carolina State University. branch-slice. branch-slice. branch-slice Push_BQ. branch. branch. Branch_on_BQ. control- dependent region. control- dependent region. control- dependent region. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
TEMPLATE DESIGN © 2008
www.PosterPresentations.com
37.8%
27.2%
18.7%
16.3%
SeparableHammockInseparableNot Analyzed
inde
pend
ent
work
in b
etwe
en
Control-Flow DecouplingRami Sheikh, James Tuck, Eric Rotenberg
North Carolina State University
Motivation Single-thread performance is important for single- and multi-
threaded applications. Per-core energy consumption is at a premium. Better branch handling is a BIG win: improves performance,
reduces energy and enables memory latency tolerance.
CFD Compiler Implementation in GCC
Conclusion A third of mispredictions come from
separable branches. CFD is a software/hardware collabor-
ation for exploiting separability with low complexity and high efficacy.
CFD is comparable to if-conversion in terms of number of static branches and MPKI contribution.
96 128 168 192 256 384 5120
0.5
1
1.5
2
2.5
3baseline baseline + perfect prediction
Window Size
Inst
ructi
ons p
er C
ycle
(IPC
)
Conr
oe
Neh
alem
Sand
yBr
idge
Hasw
ell
Future Generations
Control-Flow Decoupling (CFD)Key idea: separate the loop into two loops: The first contains only the branch’s predicate computation. The second contains the branch and its control-dependent
instructions.
ResultsApplying CFD manually:
Applying CFD automatically (compiler):
1.181.34 1.43
1.021.17
1.02 1.01 1.07 1.13 1.06 1.14
0.00.20.40.60.81.01.21.41.6
Spee
dup 0.63 0.59 0.61
0.970.85
0.91 1.00 0.920.79
0.96
0.81
0.00.10.20.30.40.50.60.70.80.91.0
Nor
mal
ized
Ener
gy
eclat
jpeg-compr
mcf
soplex(p
ds)
soplex(r
ef)
tiff-2-bw0.00.10.20.30.40.50.60.70.80.91.0 Manual Automated
Nor
mal
ized
Ene
rgy
eclat
jpeg-compr
mcf
soplex(p
ds)
soplex(r
ef)
tiff-2-bw0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6 ManualAutomated
Spee
dup
IDENTIFY
• Branch slice
• Control-dependent region
CLONE LOOP
• Connect loop exits to the clone’s pre-header (provide in-order fetch)
INSERT
• PUSH in loop: after branch slice
• POP in clone: to replace the branch
CLEAN-UP
• Dead and redundant code elimination
Interesting ObservationA third of mispredictions come from separable branches: The branch has a large CD region
(if-conversion not profitable). The branch does not depend on its own CD
instructions via a loop-carried data dependence.
branch-slice
control- dependent
region
branch
63%65%
67% 68%69% 67%
65%
Energy Reduction
CFD
ISA Support
BQ specification
New push/pop instructions
Software Side
BQ size is finite + loops with high trip counts = loop strip-mining
Hardware Side
BQ microarch., length and recovery
Interaction with pipelining
and OoO execution
BQ
CFD Loops
branch-slice
control-dependent
region
branch
control- dependent
region
branch-slice
branch
branch-slicePush_BQ
control-dependent
region
Branch_on_BQ
Original Loop
BQ drives fetch
IF ………….… EX
IF
BQ miss
IF ………... EX
IF
BQ hit
Common Case Uncommon Case
Speculate or Stall
slice
branch
slice
branch
Execution Scenarios
Other interesting aspects of CFD: Supports partially separable branches Supports nested branches through multi-level decoupling Overheads can be significantly reduced through value
communication (called CFD+ in the paper)
Problem #2No mechanism to comm. predicates to Fetch Unit
…..…
…..…..….
…..…..….
…..…..….
…..…..….
Problem #1No fetch
separation: need branch prediction
Original
…..…..….…..…..….IF EX
IF EX
…..…
IF EX
IF EX
slice
branch
slice
branch
CFD
…..…..….IF EX
IF EX
slice
branch
…..…..….IF EXslice
IF EXslice
IF EXbranch
IF EXbranch
CFD provides:• Fetch separation• Mechanism to comm.
predicates to Fetch Unit
……
.….
BQ