independent work in between

1
TEMPLATE DESIGN © 2008 www.PosterPresentations.co m independen t work in between Control-Flow Decoupling Rami Sheikh, James Tuck, Eric Rotenberg North Carolina State University Motivation Single-thread performance is important for single- and multi-threaded applications. Per-core energy consumption is at a premium. Better branch handling is a BIG win: improves performance, reduces energy and enables memory latency tolerance. CFD Compiler Implementation in GCC Conclusion A third of mispredictions come from separable branches. CFD is a software/hardware collabor-ation for exploiting separability with low complexity and high efficacy. 96 128 168 192 256 384 512 0 0.5 1 1.5 2 2.5 3 baseline Window Size Instructions per Cycle (IPC) Conroe Nehalem Sandy Bridge Haswell Future Generations Control-Flow Decoupling (CFD) Key idea: separate the loop into two loops: The first contains only the branch’s predicate computation. The second contains the branch and its control-dependent instructions. Results Applying CFD manually: Applying CFD automatically (compiler): eclat jpeg-c... mcf soplex(... soplex(... tiff-2-bw 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Manual Automated Normalized Energy jpeg-c... soplex(... tiff-2-bw 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Manual Speedup IDENTIFY Branch slice Control- dependent region CLONE LOOP Connect loop exits to the clone’s pre- header (provide in-order fetch) INSERT PUSH in loop: after branch slice POP in clone: to replace the branch CLEAN-UP Dead and redundant code eliminati on Interesting Observation A third of mispredictions come from separable branches: The branch has a large CD region (if-conversion not profitable). The branch does not depend on its own CD instructions via a loop-carried data dependence. branch-slice control- dependent region branch 63% 65% 67% 68% 69% 67% 65% Energy Reduction CFD ISA Support BQ specificatio n New push/pop instructions Software Side BQ size is finite + loops with high trip counts = loop strip- mining Hardware Side BQ microarch., length and recovery Interaction with pipelining and OoO execution BQ CFD Loops branch-slice branch control- dependent region branch-slice branch branch-slice Push_BQ control- dependent region Branch_on_BQ Original Loop BQ drives fetch IF ………….… EX IF BQ miss IF ………... EX IF BQ hit Common Case Uncommon Case Speculate or Stall sl ic e bran ch sl ic e bran ch Execution Scenarios Other interesting aspects of CFD: Supports partially separable branches Supports nested branches through multi- level decoupling Overheads can be significantly reduced through value communication (called CFD+ in the paper) Problem #2 No mechanism to comm. predicates to Fetch Unit …..… …..….. …. …..….. …. …..….. …. …..….. …. Problem #1 No fetch separation: need branch prediction Original …..….. …. …..….. …. IF EX IF EX …..… IF EX IF EX sl ic e bran ch sl ic e bran ch CFD …..….. …. IF EX IF EX sl ic e bran ch …..….. …. IF EX sl ic e IF EX sl ic e IF EX bran ch IF EX bran ch CFD provides: Fetch separation Mechanism to comm. predicates to Fetch Unit . . BQ 37.8% 27.2% 18.7% 16.3% Separable Ham m ock Inseparable NotAnalyzed 1.18 1.34 1.43 1.02 1.17 1.02 1.01 1.07 1.13 1.06 1.14 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 Speedup 0.63 0.59 0.61 0.97 0.85 0.91 1.00 0.92 0.79 0.96 0.81 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 N orm alized Energy

Upload: javan

Post on 24-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

Control-Flow Decoupling Rami Sheikh, James Tuck, Eric Rotenberg North Carolina State University. branch-slice. branch-slice. branch-slice Push_BQ. branch. branch. Branch_on_BQ. control- dependent region. control- dependent region. control- dependent region. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: independent work in between

TEMPLATE DESIGN © 2008

www.PosterPresentations.com

37.8%

27.2%

18.7%

16.3%

SeparableHammockInseparableNot Analyzed

inde

pend

ent

work

in b

etwe

en

Control-Flow DecouplingRami Sheikh, James Tuck, Eric Rotenberg

North Carolina State University

Motivation Single-thread performance is important for single- and multi-

threaded applications. Per-core energy consumption is at a premium. Better branch handling is a BIG win: improves performance,

reduces energy and enables memory latency tolerance.

CFD Compiler Implementation in GCC

Conclusion A third of mispredictions come from

separable branches. CFD is a software/hardware collabor-

ation for exploiting separability with low complexity and high efficacy.

CFD is comparable to if-conversion in terms of number of static branches and MPKI contribution.

96 128 168 192 256 384 5120

0.5

1

1.5

2

2.5

3baseline baseline + perfect prediction

Window Size

Inst

ructi

ons p

er C

ycle

(IPC

)

Conr

oe

Neh

alem

Sand

yBr

idge

Hasw

ell

Future Generations

Control-Flow Decoupling (CFD)Key idea: separate the loop into two loops: The first contains only the branch’s predicate computation. The second contains the branch and its control-dependent

instructions.

ResultsApplying CFD manually:

Applying CFD automatically (compiler):

1.181.34 1.43

1.021.17

1.02 1.01 1.07 1.13 1.06 1.14

0.00.20.40.60.81.01.21.41.6

Spee

dup 0.63 0.59 0.61

0.970.85

0.91 1.00 0.920.79

0.96

0.81

0.00.10.20.30.40.50.60.70.80.91.0

Nor

mal

ized

Ener

gy

eclat

jpeg-compr

mcf

soplex(p

ds)

soplex(r

ef)

tiff-2-bw0.00.10.20.30.40.50.60.70.80.91.0 Manual Automated

Nor

mal

ized

Ene

rgy

eclat

jpeg-compr

mcf

soplex(p

ds)

soplex(r

ef)

tiff-2-bw0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6 ManualAutomated

Spee

dup

IDENTIFY

• Branch slice

• Control-dependent region

CLONE LOOP

• Connect loop exits to the clone’s pre-header (provide in-order fetch)

INSERT

• PUSH in loop: after branch slice

• POP in clone: to replace the branch

CLEAN-UP

• Dead and redundant code elimination

Interesting ObservationA third of mispredictions come from separable branches: The branch has a large CD region

(if-conversion not profitable). The branch does not depend on its own CD

instructions via a loop-carried data dependence.

branch-slice

control- dependent

region

branch

63%65%

67% 68%69% 67%

65%

Energy Reduction

CFD

ISA Support

BQ specification

New push/pop instructions

Software Side

BQ size is finite + loops with high trip counts = loop strip-mining

Hardware Side

BQ microarch., length and recovery

Interaction with pipelining

and OoO execution

BQ

CFD Loops

branch-slice

control-dependent

region

branch

control- dependent

region

branch-slice

branch

branch-slicePush_BQ

control-dependent

region

Branch_on_BQ

Original Loop

BQ drives fetch

IF ………….… EX

IF

BQ miss

IF ………... EX

IF

BQ hit

Common Case Uncommon Case

Speculate or Stall

slice

branch

slice

branch

Execution Scenarios

Other interesting aspects of CFD: Supports partially separable branches Supports nested branches through multi-level decoupling Overheads can be significantly reduced through value

communication (called CFD+ in the paper)

Problem #2No mechanism to comm. predicates to Fetch Unit

…..…

…..…..….

…..…..….

…..…..….

…..…..….

Problem #1No fetch

separation: need branch prediction

Original

…..…..….…..…..….IF EX

IF EX

…..…

IF EX

IF EX

slice

branch

slice

branch

CFD

…..…..….IF EX

IF EX

slice

branch

…..…..….IF EXslice

IF EXslice

IF EXbranch

IF EXbranch

CFD provides:• Fetch separation• Mechanism to comm.

predicates to Fetch Unit

……

.….

BQ