exploiting criticality to reduce bottlenecks in distributed uniprocessors behnam robatmili and sibi...
TRANSCRIPT
Exploiting Criticality to Reduce Bottlenecks in
Distributed Uniprocessors Behnam Robatmili and Sibi Govindan
University of Texas at Austin
Doug BurgerMicrosoft Research
Stephen W. KecklerArchitecture Research Group, NVIDIA & University of Texas at
Austin
2
Motivation
Do we still care about single thread execution?
Running each single thread faster and more power effectively by using
multiple cores:
1. increases parallel systems efficiency
2. lessens the needs for heterogeneity and its software complexity!
3
Summary
Distributed uniprocessors: multiple cores sharing resources to run a thread across
Scalable complexity but cross-core delay overheads
Performance scalability overheads? Registers, memory, fetch, branches, etc?!
Measure critical cross-core delays using profile-based critical path analysis
Low-overhead distributed mechanisms to mitigate these bottlenecks
4
Distributed Uniprocessors• Partition single-thread instruction stream across cores• Distributed resources (RF, BP and L1) act like a large
processor• Inter-core instruction, data and control communication• Goal: Reduce these overheads
RF
BP
L1
RF
BP
L1RF
BP
L1
RF
BP
L1RF
BP
L1
Inter-core data communication
Inter-core control communication
Linear Complexity
5
Example Distributed Uniprocessors
CoreFusion TFlex
ISA x86 EDGE
Instruction partitioning Dynamic: centralized register management unit (RMU)
Static: compiler-generated predicated dataflow blocks
Fetch and control dependences
Dynamic: centralized fetch management unit (FMU)
Dynamic: next block prediction (no intrablock control flow)
Cross-core instruction communication
Dynamic: centralized RMU
Dynamic: distributed register RW queues
Scalability 4 2-w cores 8 2-w cores
This study uses TFlex as the underlying distributed uniprocessor
Older designs: Multiscalar and TLS use a noncontiguous instruction window
Recent designs: CoreFusion, TFlex, WiDGET and Forwardflow
6
TFlex Distributed Uniprocessor
C C C C
C C C C
C C C C
C C C C
C C C C
C C C C
C C C C
C C C C
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
T0 T1
T2 T3
T4 T5
T6
T7
Maps one predicated data-flow block to each core Blocks communicate across registers (via register home
cores) Example: B2 on C2 communicates to B3 on C3 through R1 on C1
Intra-block communication is all dataflow
32 physical cores
8 logical processors (threads)
L1
B2 Reg
[R2]
L1
Reg
[R0]
L1
Reg
[R1]
L1
Reg
[R3]
B1B0
B3
Intra-block
IQ-local
communicationInter-block
cross-core
communication
Control
dependences
C0 C1
C2 C3
7
Profile-based Critical Path Bottleneck Analysis
Using critical path analysis to quantify
scalable resources and bottlenecks
SPEC INT
Fetch bottleneck caused by mispredicted blocks Register communication overhead
One of the scalable resources
network
Re
al w
ork
Distributed Criticality Analyzer
FetchDecode
DecodeMerge
Issue Execute
RegWrite
RegWriteBypass
Commit
Communication-criticality inst type
outp
ut
criti
cal
inpu
t
criti
cal
Fetch-critical block reissue
pred_input Criticality Predictor
Block Reissue Engine
i_counter
available_blocks_bitpattern
An entry in block criticality status table
Requested block PC
Predicted comm-critical instspred_output o_counterRequested block PC
Core selected for running fetch-
critical block
Co
ord
ina
tor
com
po
ne
nts
Exe
cutin
g c
ore
8
• A statically-selected coordinator core is assigned to each region of the code executing on a core– Each coordinator core holds and maintains criticality data
for the regions assigned to it– Sends criticality data to executing core when the region is
fetched– Enables register bypassing, dynamic merging, block
reissue, etc.
9
Register Bypassing
B2
L1
1 Reg
[R0]
B0
L1
Reg
[R2]
B1
L1
Reg
[R3]
B3
L1
Reg
[R1]
Intra-block
IQ-local
communicationInter-block
cross-core
communication
1
2
2
Sample Execution: Block B2 communicating to B3 through register path
1 & 2 (2 is slow)
Output criticalInput critical
Last departing
Last arriving
2
2
Coordinator Core 0 predicts late communication instructions B21
& B31
(only
path 2 is predicted)
Bypassing critical register values on the critical path
Register
bypassing
1
2
C0 C1
C2 C3
Coordination
signals
10
Optimization Mechanisms
• Output criticality: Register bypassing– Explained in previous page (saves
delay)• Input criticality: Dynamic merging
– Decode time dependence height reduction for critical input chains (saves delay)
• Fetch criticality: Block reissue– Reissuing critical instructions
following pipeline flushes (saves energy & delay by reducing fetches by about 40%)
11
Aggregate Performance
16-core individual and aggregate results
168.wupwise
171.swim
172.mgrid
177.mesa
179.art
183.equake
188.am
mp
301.apsi
164.gzip
175.vpr
181.mcf
186.cra
fty
197.pars
er
253.perlb
mk
256.bzip
2
300.twolf
fp a
ve
int a
ve a
ve0.951.001.051.101.151.201.251.301.351.40 bypass merge breissue aggregate
Optimization mechanism
Final Critical Path Analysis
SPEC INT
12
improved
distribution
16 base 16 optimized8 base 8 optimized1 base
network
13
Performance Scalability Results
SPEC FP
Sp
ee
du
p o
ver
sin
gle
du
al-
issu
e c
ore
s
SPEC INT
1 2 4 8 161
1.5
2
2.5
3
3.5
4
4.5
5
5.5
baseline Pollacks
1 2 4 8 161
1.5
2
2.5
3
1 2 4 8 161
1.5
2
2.5
3
3.5
4
4.5
5
5.5
bypass baseline Pollacks
1 2 4 8 161
1.5
2
2.5
3
1 2 4 8 161
1.5
2
2.5
3
3.5
4
4.5
5
5.5
bypass_merge bypass baseline Pollacks
1 2 4 8 161
1.5
2
2.5
3
1 2 4 8 161
1.5
2
2.5
3
3.5
4
4.5
5
5.5
bypass_merge_breissue bypass_merge bypass baselinePollacks
1 2 4 8 161
1.5
2
2.5
3
# of cores
16-core INT: 22% speedup
Follows Pollack’s rule by up to 8 cores
# of cores
INT FP
14
Energy Delay Square Product
8-core INT: 50% increase in ED2
Energy efficient configuration changes from 4 to 8-core
65nm, 1.0v, 1GHz
15
Conclusions and Future Work
• Goal: A power/performance scalable distributed uniprocessor
• This work addressed several key performance scalability limitations
• Next steps (4x speedup om SPEC INT):
Overhead How to address Status
Low-accuracy next block prediction
OGEHL-based integrated branch and predicate predictor (IPP)
submitted
Branches converted to predicates
OGEHL-based integrated branch and predicate predictor (IPP)
submitted
Dataflow fanout delay and power overhead
Low-power compiler-exposed operand broadcasts (EOBs)
submitted
Icache utilization Variable block sizes MSR E2
Questions?
17
Backup Slides
• Setup and Benchmarks• CPA Example• Single Core IPCs• Communication Criticality Example• Fetch Criticality Example• Full Performance Results• Criticality Predictor• Motivation
18
Backup Slides
19
Summary
Do we still care about single thread execution?
Running each single thread effectively across multiple cores significantly
increases parallel systems efficiency and lessens the needs for heterogeneity
and its software complexity!
Distributed uniprocessors: multiple cores can share their resources to run a thread across
Scalable complexity but cross-core delay overheads
What are the overheads that limit performance scalability? Registers, memory, fetch, branches,
etc?!
We measure critical cross-core delays using static critical path analysis and find ways to hide
them
Major detected bottlenecks: cross-core register communication and fetches on flushes
We propose low-overhead distributed mechanisms to mitigate these bottlenecks
Motivation• Need for scaling single-thread
performance/power in multicore– Amdahl’s law– Optimized power/performance for each
thread• Distributed Uniprocessors
– Running single-thread code across distributed cores
– Sharing resources but also partitioning overhead
• Focus of this work– Static critical path analysis to quantify
bottlenecks– Dynamic hardware to reduce critical cross-
core latencies20
21
Distributed Uniprocessors
• Partition single-thread instruction stream across cores
• Distributed resources (RF, BP and L1) act like a large processor
RF
BP
L1
RF
BP
L1RF
BP
L1
RF
BP
L1RF
BP
L1
22
Exploiting Communication Criticality
L1
B0Reg
[R0]
L1
Reg
[R2]
L1
Reg
[R3]
L1
Reg
[R1]
B3B2
B1
Intra-block
IQ-local
communicationInter-block
cross-core
communication
fanout
Sample Execution:
Block B0 communicating to B1 through B2
Output criticalInput critical
Last departing
Last arriving
Predicting critical instructions in blocks B0 and B1Forwarding critical register values
Register
forwarded
Broadcast
message
Replacing fanout for critical input with broadcast messages
23
Dynamic Merging Results
cfactor: No. of predicted late inputs per block
full merge: Running the alg on all reg inputs 16-core runs
Sp
eed
up o
ver
no
mer
gin
g
1.00
1.05
1.10
1.15
1.20merge cfactor 1 merge cfactor 2 merge cfactor 3 full merge
65% of max using cfactor of 1
24
Block Reissue Results
1x IQ 2x IQ 4x IQ 8x IQ
46% 57% 65% 71%
168.
wupwise
171.
swim
172.
mgr
id
177.
mes
a
179.
art
183.
equa
ke
188.
amm
p
301.
apsi
164.
gzip
175.
vpr
181.
mcf
186.
craf
ty
197.
pars
er
253.
perlb
mk
256.
bzip
2
300.
twol
f
fp a
ve
int a
ve a
ve0.96
1.00
1.04
1.08
1.12
1.16
1.20 1 2 4 8
Spee
dup
over
no
blk-
reiss
ue
x IQ
Block hit rates
Affected by dep. prediction
16-core runs
25
Critical Path Bottleneck Analysis
Using critical path analysis to quantify
scalable resources and bottlenecks
SPEC INT
Fetch bottleneck caused by mispredicted blocksRegister communication overhead
One of the scalable resources
26
Performance Scalability Results
# of cores# of cores
1 2 4 8 161
1.5
2
2.5
3
3.5
4
4.5
5
5.5
bypass_merge_breissue bypass_merge bypass baseline
16-core INT: 22% speedup
Follows Polluck’s rule by up to 8 cores
SPEC FP
1 2 4 8 161
1.5
2
2.5
3
Sp
ee
du
p o
ver
sin
gle
du
al-
issu
e c
ore
s
SPEC INT
27
Block Reissue
• Each core maintains a table of available blocks and the status of their cores
• Done by extending alloc/commit protocols
• Policies– Block Lookup: Previously executed copies of
the predicted block should be spotted– Block Replacement: Refetch if the predicted
block is not spotted in any core• Major power saving on fetch/decode
C C
C C
C C C C
C C C C
TFlex Cores
P
P L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
P
P P
P
1 cycle latency
28
• Each core has (shared when fused)– 1-ported cache bank (LSQ), 1-ported reg banks
(RWQ)– 128-entry RAM-based IQ, a branch prediction
table• When fused
– Registers, memory location and BP tables are stripped across cores
Inst
Queue
Reg
File
L1
Cache
RW
QLS
Q
BPred
Courtesy of Katie Coons for the figure
C C
C C
C C C C
C C C C
TFlex Cores
P
P L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
L2 L2 L2 L2
P
P P
P
1 cycle latency
29
• Each core has minimum resources for one block– 1-ported cache bank, 1-ported reg bank (128
regs)– 128-entry RAM-based IQ, a branch prediction
table– RWQ and LSQ holds the transient arch states
during execution and commits the states at commit time
– LSQ supports memory dependence prediction
Inst
Queue
Reg
File
L1
Cache
RW
QLS
Q
BPred
Courtesy of Katie Coons for the figure
30
Critical Output Bypassing
• Bypass late outputs to their destination instructions directly – Similar to memory bypassing and cloaking
[Sohi ‘99] but no speculation needed– Using predicted late outputs– Restricted between subsequent blocks
168.
wupwise
171.
swim
172.
mgr
id
177.
mes
a
179.
art
183.
equa
ke
188.
amm
p
301.
apsi
164.
gzip
175.
vpr
181.
mcf
186.
craf
ty
197.
pars
er
253.
perlb
mk
256.
bzip
2
300.
twol
f
fp a
ve
int a
ve a
ve0
102030405060708090
100
> 3321
% o
f int
er-c
ore
regi
ster
dat
a tr
ansf
ers
31
Simulation SetupParameter Setup
iCache Partitioned 8KB (1-cycle hit)
Branch predictor Local/Gshare Tournament predictor (8K+256 bits, 3 cycle latency)
Single core Out of order, RAM structured 128-entry issue window, dual-issue (up to two INT and one FP) or single issue
L1 cache Partitioned 8KB (2-cycle hit, 2-way set-associative, 1-read port and 1-write port), 44-entry LSQ banks
L2 and memory S-NUCA L2 cache L2-hit latency varies from 5 cycles to 27 cycles; average main memory latency is 150 cycles
Benchmark type Names
8 SPEC FP wupwise, swim, mgrid, mesa, art, equake, ammp, apsi
8 SPEC INT Gzip, vpr, mcf, crafty, parser, perl, bzip, twolf
32
Predicting Critical Instructions
• State-of-the-art predictor [Fields ‘01]
– High communication and power overheads
– Large storage overhead– Complex token-passing hardware
• More complicated be ported to a dynamic CMP
• Need a simple, low-overhead while efficient predictor
33
Proposed Mechanisms
33
• Cross-core register communication• Dataflow software fanout trees• Expensive refill after pipeline
flushes• Fixed block sizes• Poor next block prediction
accuracy• Predicates not being predicated
Register forwarding
Dynamic instruction merging
Block Reissue
Critical Path Analysis
• Processes program dependence graph [Bodic ‘01]– Nodes: uarch events– Edges: data and uarch dep.s– Measure contribution of each
uarch resource• More effective than
simulation or profile-based techniques
• Built on top of [Nagarajan ‘06]
34
Simulator
Event Interface
Critical Path Analysis Tool
35
Block Reissue Hit rates
168.
wupwise
171.
swim
172.
mgr
id
177.
mes
a
179.
art
183.
equa
ke
188.
amm
p
301.
apsi
164.
gzip
175.
vpr
181.
mcf
186.
craf
ty
197.
pars
er
253.
perlb
mk
256.
bzip
2
300.
twol
f
fp a
ve
int a
ve a
ve0
20
40
60
80
100 1 2 4 8
Bloc
k re
issue
d hi
t rat
e
36
IPC of Single TFlex One 2w Core
• SPEC INT, IPC = 0.8• SPEC FP, IPC = 0.9
37
Speculation Aware
cf =1
SPEC INT
38
Critical Path Analysis
• Critical path: Longest dependence path during program execution– Determines execution time
• Critical path analysis [Bodic ‘01]– Measure contribution of each uArch
resource on critical cycles• Built on top of TRIPS CPA [Nagarajan ‘06]
38
39
Exploiting Fetch Criticality
B0
L1
Reg
L1
Reg
B1
L1
Reg
B0
L1
Reg
Predicted fetched blocks: B0, B1, B0, B0
Actual block order: B0, B0, B0, B0
B0
B0
B0
B0B0
B1
B0
B0
Cross-core block
control order
Without using block reissue all 3 blocks will be flushedWith block reissue: Coordinator core (C0) detects B0 instances on C2-3 and
reissues them
C0 C1
C2 C3
Coordination signals
50% reduction in fetch and decode operations
B0
B1
CFGB0 B0
Reissued blocks
Refetched blocks
Fetched blocks
40
Full Performance Comparison
41
Full Energy Comparison
Communication Criticality Predictor
• Block-atomic execution Late inputs and outputs are critical– Last outputs/inputs departing/arriving
before block commit• 70% and 50% of late inputs/outputs are
critical for SPEC INT and FP• Extend next block predictor protocol
– MJRTY algorithm [Moore ‘82] to predict/train– Increment/decrement a confident counter
upon correct/incorrect prediction of current majority
42
43
Exploiting Communication Criticality
• Selective register forwarding– Critical register outputs are forwarded
to subsequent cores– Others outputs use original indirect
register forwarding using RWQs• Selective instruction merging
– Specialize decode of instructions dependent on critical register input
– Eliminates Dataflow fanout moves in address computation networks
44
Exploiting Fetch Criticality
• Blocks after mispredictions are critical
• Many flushed blocks may be re-fetched right after a misprediction
• Blocks are predicated so old blocks can be reissued if their cores are free– Each owner core keeps track of its
blocks– Extended allocate/commit protocols
• Major power saving on fetch/decode
45
Exploiting Communication Criticality
L1
1B2
Reg
[R0]
L1
Reg
[R2]
L1
Reg
[R3]
L1
Reg
[R1]
B2B0
B3
Intra-block
IQ-local
communicationInter-block
cross-core
communication
1
2
2
Sample Execution: Block B2 communicating to B3 through register path
1 & 2 (2 is slow)
Output criticalInput critical
Last departing
Last arriving
Coordinator Core 0 predicts late communication instructions B21
& B31
(only
path 2 is predicted)
Fast forwarding critical register values on the critical path
Register
bypassing
1
2
C0 C1
C2 C2
Coordination
signals
46
Summary
Do we still care about single thread execution?
Running each single thread effectively across multiple cores significantly
increases parallel systems efficiency and lessens the needs for heterogeneity
and its software complexity!
Distributed uniprocessors: multiple cores can share their resources to run a thread across
Scalable complexity but cross-core delay overheads
What are the overheads that limit performance scalability? Registers, memory, fetch, branches,
etc?!
We measure critical cross-core delays using static critical path analysis and find ways to hide
them
We propose low-overhead distributed mechanisms to mitigate these bottlenecks