university of michigan electrical engineering and computer science 1 systematic register bypass...

1 University of MichiganElectrical Engineering and Computer Science

Systematic Register Bypass Customizationfor Application-Specific Processors

Kevin Fan, Nathan Clark, Michael Chu,K. V. Manjunath, Rajiv Ravindran,Mikhail Smelyanskiy, Scott Mahlke

Advanced Computer Architecture Laboratory

University of Michigan


Introduction

• Bypass network allows for data forwarding to reduce pipeline stalls

• Full bypass: any FU can bypass from any other FU and from any pipeline stage

# paths = (issue width)2 bypassable stages input ports per FU output ports per FU


Bypass Path Utilization

• As processors get wider and deeper, cost of bypass network increases quadratically [Palacharla ’98]

• Only few bypasses are heavily utilized

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Percent Utilization

No

rma

lize

d C

um

ula

tiv

e N

um

be

r o

f B

yp

as

se

s


Designing a Partial Bypass Network

• Reduce hardware at the cost of runtime• Design a sparse bypass network while minimizing

performance impact• Challenges:

– Reconcile different requirements for different program regions– Interplay between different bypass paths– Huge search space, exponential number of possible configurations


Spacewalking Partial Bypass

Bypasses(Ranked by Importance)

…

MostUseful

LeastUseful

EvaluateNew Machine

Replace BypassIf Performance

Drops Too Much Remove theleast usefulbypass

• Profile-guided Pareto ascent– Rank bypass paths by importance– Remove least important path and evaluate performance impact– Update rankings with new statistics– Repeat until performance degrades too far

X

Program

Usagestatistics

Pareto machines

Cost

1P

erf

orm

an

ce

Cost/Performance


Ranking Bypass Paths

+1 +2Equivalent bypass paths

% utilizationoffload potentialImportance =

cycles bypass was usedtotal cycles

redundant cyclescycles bypass was used

Bypass path


• Uses more bypasses than necessary

• Not all edges require 1-stage bypass

M3I2

A Closer Look

Ma

Ib

Id Ie

If

Ic

I1

Time I1 I2 M3

0 a

1 b c

2 d e

3 f

Critical edges Time I1 I2 M3

0 a

1 b c

2 d

3 f e

M3I2I1

Time I1 I2 M3

0 a

1 b c

2 d

3 f e

Time I1 I2 M3

0 a

1 b

2 d c

3 f e

Time I1 I2 M3

0 a

1 b c

2 d e

3 f

Time I1 I2 M3

0 a

1 b c

2 d e

3 f


Time I1 I2 M3

0 a

1 b

2

3

4

5

Compiling for Partial Bypass

• Difficulties:– Latencies between

operations vary depending on resource assignments

– Current assignment will affect future decisions

• Naïve scheduler will arbitrarily place Op c

• Need to provide resource hints to the scheduler to break ties

Time I1 I2 M3

0 a

1 b

2 d c

3 f e

M3I2I1

Optimal:

Scheduler:

Ma

Ib

Id Ie

If

Ic

1,2

1,21,2

1,2

1,2

1,2

Possibleedge latencies

IcIc

Id Ie

If

Id Ie

IfTime I1 I2 M3

0 a

1 b

2 c? c?

3

4

5

Time I1 I2 M3

0 a

1 b

2 c

3 d? d?

4

5

Time I1 I2 M3

0 a

1 b

2 c

3 d

4 e? e?

5

Time I1 I2 M3

0 a

1 b

2 c

3 d

4 e

5 f


BUG Preference Algorithm

• Perform pre-scheduling pass over the DFG

• Bottom-Up Greedy algorithm based on [Ellis ’85]

• Traverse DFG, critical paths first

• Select bypass paths to achieve earliest completion time for each operation

• Take into account time to:– Get inputs– Execute– Send outputs to consumers


Ma

Ib

Id Ie

If

Ic

Id

Ma

Ib

Ie

If

Ic

{1,2}

{1,2}

{1,2}

{3} Ma

Ib

Id Ie

If

Ic

{1,2}

{1}

{1,2}

BUG Example

• Place ops b, d, f on unit 1 since M bypasses to it• Place ops c, e on unit 2 since resource is free

M3I2I1

Ma

Ib

Id Ie

If

Ic {2}

{2}

Time I1 I2 M3

0 a

1 b

2 d c

3 f e

Ma

Ib

Id Ie

If

Ic


Bypass Cost Savings

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Benchmark

Rel

ativ

e C

ost 1

0.95

0.9

0.8

0.7

RelativePerformance


Pareto-optimal Machines

djpeg (5-wide) g721dec (9-wide)

1

1.2

1.4

1.6

1.8

2

0 5000 10000 15000 20000 25000 30000 35000 40000

Cost (gates)

Re

lati

ve

Dy

na

mic

Cy

cle

Co

un

t

1

1.2

1.4

1.6

1.8

2

0 10000 20000 30000 40000 50000

Cost (gates)

Re

lati

ve

Dy

na

mic

Cy

cle

Co

un

t

BUG PreferencesILP Preferences


Ind

ivid

ua

l B

yp

as

s P

ath

s

More

Less

Bypass Usage is Variable

bfis

h

cjp

eg

djp

eg

ep

ic

un

epic

g7

21e

nc

g7

21d

ec

gsm

en

c

gsm

de

c

me

sa

mp

eg

2en

c

pe

gen

c

pe

gde

c

rast

a

raw

c

raw

d

Utilization


Conclusion

• Significant bypass network cost can be saved without much performance loss

• Our approach:– Intelligent bypass spacewalking– Resource hints allow compiler to schedule code

effectively– 95% of original performance maintained when

removing 60% of utilized bypasses

• http://cccp.eecs.umich.edu

university of michigan electrical engineering and computer science 1 systematic register bypass...

Documents