exploiting criticality to reduce bottlenecks in distributed uniprocessors behnam robatmili and sibi...

Exploiting Criticality to Reduce Bottlenecks in

Distributed Uniprocessors Behnam Robatmili and Sibi Govindan

University of Texas at Austin

Doug BurgerMicrosoft Research

Stephen W. KecklerArchitecture Research Group, NVIDIA & University of Texas at

Austin

2

Motivation

Do we still care about single thread execution?

Running each single thread faster and more power effectively by using

multiple cores:

1. increases parallel systems efficiency

2. lessens the needs for heterogeneity and its software complexity!

3

Summary

Distributed uniprocessors: multiple cores sharing resources to run a thread across

Scalable complexity but cross-core delay overheads

Performance scalability overheads? Registers, memory, fetch, branches, etc?!

Measure critical cross-core delays using profile-based critical path analysis

Low-overhead distributed mechanisms to mitigate these bottlenecks

4

Distributed Uniprocessors• Partition single-thread instruction stream across cores• Distributed resources (RF, BP and L1) act like a large

processor• Inter-core instruction, data and control communication• Goal: Reduce these overheads

RF

BP

L1

RF

BP

L1RF

BP

L1

RF

BP

L1RF

BP

L1

Inter-core data communication

Inter-core control communication

Linear Complexity

5

Example Distributed Uniprocessors

CoreFusion TFlex

ISA x86 EDGE

Instruction partitioning Dynamic: centralized register management unit (RMU)

Static: compiler-generated predicated dataflow blocks

Fetch and control dependences

Dynamic: centralized fetch management unit (FMU)

Dynamic: next block prediction (no intrablock control flow)

Cross-core instruction communication

Dynamic: centralized RMU

Dynamic: distributed register RW queues

Scalability 4 2-w cores 8 2-w cores

This study uses TFlex as the underlying distributed uniprocessor

Older designs: Multiscalar and TLS use a noncontiguous instruction window

Recent designs: CoreFusion, TFlex, WiDGET and Forwardflow

6

TFlex Distributed Uniprocessor

C C C C

C C C C

C C C C

C C C C

C C C C

C C C C

C C C C

C C C C

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

T0 T1

T2 T3

T4 T5

T6

T7

Maps one predicated data-flow block to each core Blocks communicate across registers (via register home

cores) Example: B2 on C2 communicates to B3 on C3 through R1 on C1

Intra-block communication is all dataflow

32 physical cores

8 logical processors (threads)

L1

B2 Reg

[R2]

L1

Reg

[R0]

L1

Reg

[R1]

L1

Reg

[R3]

B1B0

B3

Intra-block

IQ-local

communicationInter-block

cross-core

communication

Control

dependences

C0 C1

C2 C3

7

Profile-based Critical Path Bottleneck Analysis

Using critical path analysis to quantify

scalable resources and bottlenecks

SPEC INT

Fetch bottleneck caused by mispredicted blocks Register communication overhead

One of the scalable resources

network

Re

al w

ork

Distributed Criticality Analyzer

FetchDecode

DecodeMerge

Issue Execute

RegWrite

RegWriteBypass

Commit

Communication-criticality inst type

outp

ut

criti

cal

inpu

t

criti

cal

Fetch-critical block reissue

pred_input Criticality Predictor

Block Reissue Engine

i_counter

available_blocks_bitpattern

An entry in block criticality status table

Requested block PC

Predicted comm-critical instspred_output o_counterRequested block PC

Core selected for running fetch-

critical block

Co

ord

ina

tor

com

po

ne

nts

Exe

cutin

g c

ore

8

• A statically-selected coordinator core is assigned to each region of the code executing on a core– Each coordinator core holds and maintains criticality data

for the regions assigned to it– Sends criticality data to executing core when the region is

fetched– Enables register bypassing, dynamic merging, block

reissue, etc.

9

Register Bypassing

B2

L1

1 Reg

[R0]

B0

L1

Reg

[R2]

B1

L1

Reg

[R3]

B3

L1

Reg

[R1]

Intra-block

IQ-local


cross-core

communication

1

2

2

Sample Execution: Block B2 communicating to B3 through register path

1 & 2 (2 is slow)

Output criticalInput critical

Last departing

Last arriving

2

2

Coordinator Core 0 predicts late communication instructions B21

& B31

(only

path 2 is predicted)

Bypassing critical register values on the critical path

Register

bypassing

1

2

C0 C1

C2 C3

Coordination

signals

10

Optimization Mechanisms

• Output criticality: Register bypassing– Explained in previous page (saves

delay)• Input criticality: Dynamic merging

– Decode time dependence height reduction for critical input chains (saves delay)

• Fetch criticality: Block reissue– Reissuing critical instructions

following pipeline flushes (saves energy & delay by reducing fetches by about 40%)

11

Aggregate Performance

16-core individual and aggregate results

168.wupwise

171.swim

172.mgrid

177.mesa

179.art

183.equake

188.am

mp

301.apsi

164.gzip

175.vpr

181.mcf

186.cra

fty

197.pars

er

253.perlb

mk

256.bzip

2

300.twolf

fp a

ve

int a

ve a

ve0.951.001.051.101.151.201.251.301.351.40 bypass merge breissue aggregate

Optimization mechanism

Final Critical Path Analysis

SPEC INT

12

improved

distribution

16 base 16 optimized8 base 8 optimized1 base

network

13

Performance Scalability Results

SPEC FP

Sp

ee

du

p o

ver

sin

gle

du

al-

issu

e c

ore

s

SPEC INT

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

baseline Pollacks

1 2 4 8 161

1.5

2

2.5

3

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass baseline Pollacks

1 2 4 8 161

1.5

2

2.5

3

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass_merge bypass baseline Pollacks

1 2 4 8 161

1.5

2

2.5

3

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass_merge_breissue bypass_merge bypass baselinePollacks

1 2 4 8 161

1.5

2

2.5

3

# of cores

16-core INT: 22% speedup

Follows Pollack’s rule by up to 8 cores

# of cores

INT FP

14

Energy Delay Square Product

8-core INT: 50% increase in ED2

Energy efficient configuration changes from 4 to 8-core

65nm, 1.0v, 1GHz

15

Conclusions and Future Work

• Goal: A power/performance scalable distributed uniprocessor

• This work addressed several key performance scalability limitations

• Next steps (4x speedup om SPEC INT):

Overhead How to address Status

Low-accuracy next block prediction

OGEHL-based integrated branch and predicate predictor (IPP)

submitted

Branches converted to predicates

OGEHL-based integrated branch and predicate predictor (IPP)

submitted

Dataflow fanout delay and power overhead

Low-power compiler-exposed operand broadcasts (EOBs)

submitted

Icache utilization Variable block sizes MSR E2

Questions?

17

Backup Slides

• Setup and Benchmarks• CPA Example• Single Core IPCs• Communication Criticality Example• Fetch Criticality Example• Full Performance Results• Criticality Predictor• Motivation

18

Backup Slides

19

Summary


Running each single thread effectively across multiple cores significantly

increases parallel systems efficiency and lessens the needs for heterogeneity

and its software complexity!

Distributed uniprocessors: multiple cores can share their resources to run a thread across


What are the overheads that limit performance scalability? Registers, memory, fetch, branches,

etc?!

We measure critical cross-core delays using static critical path analysis and find ways to hide

them

Major detected bottlenecks: cross-core register communication and fetches on flushes

We propose low-overhead distributed mechanisms to mitigate these bottlenecks

Motivation• Need for scaling single-thread

performance/power in multicore– Amdahl’s law– Optimized power/performance for each

thread• Distributed Uniprocessors

– Running single-thread code across distributed cores

– Sharing resources but also partitioning overhead

• Focus of this work– Static critical path analysis to quantify

bottlenecks– Dynamic hardware to reduce critical cross-

core latencies20

21

Distributed Uniprocessors

• Partition single-thread instruction stream across cores

• Distributed resources (RF, BP and L1) act like a large processor

RF

BP

L1

RF

BP

L1RF

BP

L1

RF

BP

L1RF

BP

L1

22

Exploiting Communication Criticality

L1

B0Reg

[R0]

L1

Reg

[R2]

L1

Reg

[R3]

L1

Reg

[R1]

B3B2

B1

Intra-block

IQ-local


cross-core

communication

fanout

Sample Execution:

Block B0 communicating to B1 through B2


Last departing

Last arriving

Predicting critical instructions in blocks B0 and B1Forwarding critical register values

Register

forwarded

Broadcast

message

Replacing fanout for critical input with broadcast messages

23

Dynamic Merging Results

cfactor: No. of predicted late inputs per block

full merge: Running the alg on all reg inputs 16-core runs

Sp

eed

up o

ver

no

mer

gin

g

1.00

1.05

1.10

1.15

1.20merge cfactor 1 merge cfactor 2 merge cfactor 3 full merge

65% of max using cfactor of 1

24

Block Reissue Results

1x IQ 2x IQ 4x IQ 8x IQ

46% 57% 65% 71%

168.

wupwise

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

188.

amm

p

301.

apsi

164.

gzip

175.

vpr

181.

mcf

186.

craf

ty

197.

pars

er

253.

perlb

mk

256.

bzip

2

300.

twol

f

fp a

ve

int a

ve a

ve0.96

1.00

1.04

1.08

1.12

1.16

1.20 1 2 4 8

Spee

dup

over

no

blk-

reiss

ue

x IQ

Block hit rates

Affected by dep. prediction

16-core runs

25

Critical Path Bottleneck Analysis

Using critical path analysis to quantify

scalable resources and bottlenecks

SPEC INT

Fetch bottleneck caused by mispredicted blocksRegister communication overhead

One of the scalable resources

26

Performance Scalability Results

# of cores# of cores

1 2 4 8 161

1.5

2

2.5

3

3.5

4

4.5

5

5.5

bypass_merge_breissue bypass_merge bypass baseline

16-core INT: 22% speedup

Follows Polluck’s rule by up to 8 cores

SPEC FP

1 2 4 8 161

1.5

2

2.5

3

Sp

ee

du

p o

ver

sin

gle

du

al-

issu

e c

ore

s

SPEC INT

27

Block Reissue

• Each core maintains a table of available blocks and the status of their cores

• Done by extending alloc/commit protocols

• Policies– Block Lookup: Previously executed copies of

the predicted block should be spotted– Block Replacement: Refetch if the predicted

block is not spotted in any core• Major power saving on fetch/decode

C C

C C

C C C C

C C C C

TFlex Cores

P

P L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

P

P P

P

1 cycle latency

28

• Each core has (shared when fused)– 1-ported cache bank (LSQ), 1-ported reg banks

(RWQ)– 128-entry RAM-based IQ, a branch prediction

table• When fused

– Registers, memory location and BP tables are stripped across cores

Inst

Queue

Reg

File

L1

Cache

RW

QLS

Q

BPred

Courtesy of Katie Coons for the figure

C C

C C

C C C C

C C C C

TFlex Cores

P

P L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

L2 L2 L2 L2

P

P P

P

1 cycle latency

29

• Each core has minimum resources for one block– 1-ported cache bank, 1-ported reg bank (128

regs)– 128-entry RAM-based IQ, a branch prediction

table– RWQ and LSQ holds the transient arch states

during execution and commits the states at commit time

– LSQ supports memory dependence prediction

Inst

Queue

Reg

File

L1

Cache

RW

QLS

Q

BPred

Courtesy of Katie Coons for the figure

30

Critical Output Bypassing

• Bypass late outputs to their destination instructions directly – Similar to memory bypassing and cloaking

[Sohi ‘99] but no speculation needed– Using predicted late outputs– Restricted between subsequent blocks

168.

wupwise

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

188.

amm

p

301.

apsi

164.

gzip

175.

vpr

181.

mcf

186.

craf

ty

197.

pars

er

253.

perlb

mk

256.

bzip

2

300.

twol

f

fp a

ve

int a

ve a

ve0

102030405060708090

100

> 3321

% o

f int

er-c

ore

regi

ster

dat

a tr

ansf

ers

31

Simulation SetupParameter Setup

iCache Partitioned 8KB (1-cycle hit)

Branch predictor Local/Gshare Tournament predictor (8K+256 bits, 3 cycle latency)

Single core Out of order, RAM structured 128-entry issue window, dual-issue (up to two INT and one FP) or single issue

L1 cache Partitioned 8KB (2-cycle hit, 2-way set-associative, 1-read port and 1-write port), 44-entry LSQ banks

L2 and memory S-NUCA L2 cache L2-hit latency varies from 5 cycles to 27 cycles; average main memory latency is 150 cycles

Benchmark type Names

8 SPEC FP wupwise, swim, mgrid, mesa, art, equake, ammp, apsi

8 SPEC INT Gzip, vpr, mcf, crafty, parser, perl, bzip, twolf

32

Predicting Critical Instructions

• State-of-the-art predictor [Fields ‘01]

– High communication and power overheads

– Large storage overhead– Complex token-passing hardware

• More complicated be ported to a dynamic CMP

• Need a simple, low-overhead while efficient predictor

33

Proposed Mechanisms

33

• Cross-core register communication• Dataflow software fanout trees• Expensive refill after pipeline

flushes• Fixed block sizes• Poor next block prediction

accuracy• Predicates not being predicated

Register forwarding

Dynamic instruction merging

Block Reissue

Critical Path Analysis

• Processes program dependence graph [Bodic ‘01]– Nodes: uarch events– Edges: data and uarch dep.s– Measure contribution of each

uarch resource• More effective than

simulation or profile-based techniques

• Built on top of [Nagarajan ‘06]

34

Simulator

Event Interface

Critical Path Analysis Tool

35

Block Reissue Hit rates

168.

wupwise

171.

swim

172.

mgr

id

177.

mes

a

179.

art

183.

equa

ke

188.

amm

p

301.

apsi

164.

gzip

175.

vpr

181.

mcf

186.

craf

ty

197.

pars

er

253.

perlb

mk

256.

bzip

2

300.

twol

f

fp a

ve

int a

ve a

ve0

20

40

60

80

100 1 2 4 8

Bloc

k re

issue

d hi

t rat

e

36

IPC of Single TFlex One 2w Core

• SPEC INT, IPC = 0.8• SPEC FP, IPC = 0.9

37

Speculation Aware

cf =1

SPEC INT

38

Critical Path Analysis

• Critical path: Longest dependence path during program execution– Determines execution time

• Critical path analysis [Bodic ‘01]– Measure contribution of each uArch

resource on critical cycles• Built on top of TRIPS CPA [Nagarajan ‘06]

38

39

Exploiting Fetch Criticality

B0

L1

Reg

L1

Reg

B1

L1

Reg

B0

L1

Reg

Predicted fetched blocks: B0, B1, B0, B0

Actual block order: B0, B0, B0, B0

B0

B0

B0

B0B0

B1

B0

B0

Cross-core block

control order

Without using block reissue all 3 blocks will be flushedWith block reissue: Coordinator core (C0) detects B0 instances on C2-3 and

reissues them

C0 C1

C2 C3

Coordination signals

50% reduction in fetch and decode operations

B0

B1

CFGB0 B0

Reissued blocks

Refetched blocks

Fetched blocks

40

Full Performance Comparison

41

Full Energy Comparison

Communication Criticality Predictor

• Block-atomic execution Late inputs and outputs are critical– Last outputs/inputs departing/arriving

before block commit• 70% and 50% of late inputs/outputs are

critical for SPEC INT and FP• Extend next block predictor protocol

– MJRTY algorithm [Moore ‘82] to predict/train– Increment/decrement a confident counter

upon correct/incorrect prediction of current majority

42

43


• Selective register forwarding– Critical register outputs are forwarded

to subsequent cores– Others outputs use original indirect

register forwarding using RWQs• Selective instruction merging

– Specialize decode of instructions dependent on critical register input

– Eliminates Dataflow fanout moves in address computation networks

44

Exploiting Fetch Criticality

• Blocks after mispredictions are critical

• Many flushed blocks may be re-fetched right after a misprediction

• Blocks are predicated so old blocks can be reissued if their cores are free– Each owner core keeps track of its

blocks– Extended allocate/commit protocols

• Major power saving on fetch/decode

45


L1

1B2

Reg

[R0]

L1

Reg

[R2]

L1

Reg

[R3]

L1

Reg

[R1]

B2B0

B3

Intra-block

IQ-local


cross-core

communication

1

2

2

Sample Execution: Block B2 communicating to B3 through register path

1 & 2 (2 is slow)


Last departing

Last arriving

Coordinator Core 0 predicts late communication instructions B21

& B31

(only

path 2 is predicted)

Fast forwarding critical register values on the critical path

Register

bypassing

1

2

C0 C1

C2 C2

Coordination

signals

46

Summary


Running each single thread effectively across multiple cores significantly

increases parallel systems efficiency and lessens the needs for heterogeneity

and its software complexity!

Distributed uniprocessors: multiple cores can share their resources to run a thread across


What are the overheads that limit performance scalability? Registers, memory, fetch, branches,

etc?!

We measure critical cross-core delays using static critical path analysis and find ways to hide

them

We propose low-overhead distributed mechanisms to mitigate these bottlenecks

exploiting criticality to reduce bottlenecks in distributed uniprocessors behnam robatmili and sibi...

Documents

core blocks

cores distributed resources

fetchcritical block

c1 intrablock communication

l1 act

block prediction

distributed mechanisms

critical crosscore delays