using criticality to attack performance bottlenecks
DESCRIPTION
Using Criticality to Attack Performance Bottlenecks. Brian Fields UC-Berkeley (Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn). Bottleneck Analysis. Bottleneck Analysis: Determining the performance effect of an event on execution time. An event could be: - PowerPoint PPT PresentationTRANSCRIPT
Using Criticality to Attack Performance Bottlenecks
Brian FieldsUC-Berkeley
(Collaborators: Rastislav Bodik, Mark Hill, Chris Newburn)
Bottleneck Analysis
Bottleneck Analysis:Determining the performance effect of an
event on execution time
An event could be:• an instruction’s execution• an instruction-window-full stall• a branch mispredict• a network request• inter-processor communication• etc.
Why is Bottleneck Analysis Important?
Bottleneck Analysis Applications
Run-time Optimization• Resource arbitration
• e.g., how to scheduling memory accesses? • Effective speculation
• e.g., which branches to predicate?•Dynamic reconfiguration
• e.g, when to enable hyperthreading? • Energy efficiency
• e.g., when to throttle frequency?
Design Decisions• Overcoming technology constraints
• e.g., how to mitigate effect of long wire latencies?
Programmer Performance Tuning• Where have the cycles gone?
• e.g., which cache misses should be prefetched?
Why is Bottleneck Analysis Hard?
Current state-of-art
Event counts:Exe. time = (CPU cycles + Mem. cycles) * Clock cycle
timewhere:Mem. cycles = Number of cache misses * Miss penalty
miss11 (100 cycles) (100 cycles)
miss22 (100 cycles) (100 cycles)
2 misses but only 1 miss penalty
Parallelism in systems complicates performance understanding
Parallelism
• A branch mispredict and full-store-buffer stall occur in the same cycle that three loads are waiting on the memory system and two floating-point multiplies are executing
• Two parallel cache misses
• Two parallel threads
Criticality Challenges
• Cost• How much speedup possible from optimizing an
event?
• Slack• How much can an event be “slowed down” before
increasing execution time?
• Interactions• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
Our Approach
Our Approach: Criticality
Critical events affect execution time, non-critical do not
Bottleneck Analysis:Determining the performance effect of an
event on execution time
Defining criticality
Need Performance Sensitivity
• slowing down a “critical” event should slow down the entire program
• speeding up a “noncritical” event should leave execution time unchanged
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Standard Waterfall Diagram
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Annotated with Dependence Edges
(MISP)
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Fetch BW
ROB
Data Dep
Branch Misp.
Annotated with Dependence Edges
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
1
1
1
1
11
3
1 1
2
1
0
1
Edge Weights Added
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
0
0
0
0
Convert to Graph
1
1
1
11
1
1
2
1
1
1
1
1
2
1 1
11
1
2
1
1
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
0
0
0
0
Convert to Graph
1
1
1
11
1
1
2
1
1
1
1
1
2
1 1
11
1
2
1
1
Smaller graph instance
E1
E EE E
3
F F FF F
C C CC C
1
11 1
1
1
1 1
100 0 1
1
Non-critical,But how
much slack?
1
Critical Icache miss,
But how costly?
Add “hidden” constraints
E1
E EE E1 11
1 2
3
F F FF F
C C CC C
1
1 11 1
1
1
11 1
100 0 1
100 1Non-critical,
But how much slack?
Critical Icache miss,
But how costly?
Add “hidden” constraints
E1
E EE E1 11
1 2
3
F F FF F
C C CC C
1
1 11 1
1
1
11 1
100 0 1
100 1Slack = 13 – 7 = 6 cycles
Cost = 13 – 7 = 6 cycles
Slack “sharing”
E1
E EE E1 11
1 2
3
F F FF F
C C CC C
1
1 11 1
1
1
11 1
100 0 1
100 1Slack = 6
cycles
Slack = 6 cycles
Can delay one edge by 6 cycles, but not both!
Machine Imbalance
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Perc
ent o
f Dyn
amic
Inst
ruct
ions
apportioned
global
~80% insts have at least 5 cycles of apportioned
slack
Criticality Challenges
• Cost• How much speedup possible from optimizing an
event?
• Slack• How much can an event be “slowed down” before
increasing execution time?
• Interactions• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
Simple criticality not always enough
Sometimes events have nearly equal criticality
miss #1 (99)
miss #2 (100)
Want to know • how critical is each event?
• how far from critical is each event?
Actually, even that is not enough
Our solution: measure interactions
Two parallel cache misses
miss #1 (99)
miss #2 (100)Cost(miss #1) = 0
Cost(miss #2) = 1
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs Parallel interaction100 0 +
1icost = aggregate cost – sum of individual costs
= 100 – 0 – 1 = 99
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
2. Zero icost ?
1. Positive icost parallel
interaction
miss #1
miss #2
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2
. . .
3. Negative icost ?
Negative icost
Two serial cache misses (data dependent)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = ?
ALU latency (110 cycles)
Negative icost
Two serial cache misses (data dependent)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90Negative icost serial interaction
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2. . .
3. Negative icost serial
interaction
ALU latency
miss #1 miss #2
Branch mispredict
Fetch BW
Load-Replay Trap
LSQ stall
Why care about serial interactions?
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1
Icost Case Study: Deep pipelines
Looking for serial interactions!
Dcache (DL1)
1 4
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window -15.3
DL1+bw 6.0
DL1+bmisp -3.4
DL1+dmiss -0.4
DL1+alu -8.2
DL1+imiss 0.0
... ...
Total 100.0
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 18.3 % 30.5 % 25.8 %
DL1+window -4.2 -15.3 -24.5
DL1+bw 10.0 6.0 15.5
DL1+bmisp -7.0 -3.4 -0.3
DL1+dmiss -1.4 -0.4 -1.4
DL1+alu -1.6 -8.2 -4.7
DL1+imiss 0.1 0.0 0.4
... ... ... ...
Total 100.0 100.0 100.0
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Criticality Challenges
• Cost• How much speedup possible from optimizing an
event?
• Slack• How much can an event be “slowed down” before
increasing execution time?
• Interactions• When do multiple events need to be optimized
simultaneously?
• When do we have a choice?
• Exploit in Hardware
Exploit in Hardware
• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical
• Replacement for Performance Counters
• Requires offline analysis • Constructs entire graph
Only last-arriving edges can be critical
Observation: R1 R2 + R3
If dependence into R2 is on critical path, then value of R2 arrived last.
critical arrives last
arrives last critical
E
R2
R3
Dependence resolved early
Determining last-arrive edges
Observe events within the machine
last_arrive[F] =
last_arrive[E] =
E
F
CC
E
F
CC
FE if data ready on fetch
E
F
CC
E
F
CC
E
F
CC
EE observe arrival order of operands
E
F
CC
E
F
CC
last_arrive[C] =
EC if commit pointer is delayed
CC otherwise
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
EF if branch misp.
E
F
CC
E
F
CC
E
F
CC
E
F
CC
CF if ROB stall
FF otherwise
Last-arrive edges
The last-arrive rule
CP consists only of “last-arrive” edges
F
E
C
Prune the graph
Only need to put last-arrive edges in graphNo other edges could be on CP
F
E
C
newest
…and we’ve found the critical path!
Backward propagate along last-arrive edges
newest
F
E
C
newest Found CP by only observing last-arrive
edges but still requires constructing entire
graph
Step 2. Reducing storage reqs
CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive
edges, the more likely it is part of the CP
Algorithm: find sufficiently long last-arrive chains
1. Plant token into a node n
2. Propagate forward, only along last-arrive edges
3. Check for token after several hundred cycles
4. If token alive, n is assumed critical
Online Criticality Detection
Forward propagate token
newest
F
E
C
newest
PlantToken
Online Criticality Detection
Forward propagate token
newest
F
E
C
newest
PlantToken
Tokens
“Die”
Online Criticality Detection
Forward propagate token
F
E
C
PlantToken
Token survives
!
Putting it all together
CP prediction
table
Last-arrive edges
(producer retired instr)
OOO CoreE-critical?
Training Path
PC
Prediction Path
Token-PassingAnalyzer
Results• Performance (Speed)
• Scheduling in clustered machines• 10% speedup
• Selective value prediction• Deferred scheduling (Crowe, et al)
• 11% speedup
• Heterogeneous cache (Rakvic, et al.)• 17% speedup
• Energy• Non-uniform machine: fast and slow pipelines
• ~25% less energy
• Instruction queue resizing (Sasanka, et al.)• Multiple frequency scaling (Semeraro, et al.)
• 19% less energy with 3% less performance
• Selective pre-execution (Petric, et al.)
Exploit in Hardware
• Criticality Analyzer• Online, fast-feedback• Limited to critical/not critical
• Replacement for Performance Counters
• Requires offline analysis • Constructs entire graph
Profiling goal
Goal: • Construct graph
many dynamic instructions
Constraint:• Can only sample sparsely
Profiling goal
Goal: • Construct graph
Constraint:• Can only sample sparsely
DNA
DNA strand
Genome sequencing
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
. . .. . .
DNA
“Shotgun” genome sequencing
. . .. . .
. . . . . .
Find overlaps among samples
DNA
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache missBranch misp.No event
. . .. . .
Profiler hardware requirements
. . .. . .
Profiler hardware requirements
Match!
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Errors in graph construction
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Total 12.2 % 14.0 % 8.9 %
Conclusion: Grand Challenges
• Cost• How much speedup possible from optimizing
an event?
• Slack• How much can an event be “slowed down”
before increasing execution time?
• Interactions• When do multiple events need to be
optimized simultaneously?
• When do we have a choice?
modeling
token-passing analyzer
parallel interactions
serial interactions
shotgun profiling
Conclusion: Bottleneck Analysis Applications
Run-time Optimization• Effective speculation
• Resource arbitration
• Dynamic reconfiguration
• Energy efficiency
Design Decisions• Overcoming technology constraints
Programmer Performance Tuning• Where have the cycles gone?
Selective value prediction
Scheduling and steering in clustered processors
Resize instruction window
Non-uniform machines
Helped cope with high-latency dcache
Measured cost of cache misses/branch
mispredicts
Outline
Simple Criticality• Definition (ISCA ’01)
• Detection (ISCA ’01)
• Application (ISCA ’01-’02)
Advanced Criticality• Interpretation (MICRO ’03)
• What types of interactions are possible?
• Hardware Support (MICRO ’03, TACO ’04)
• Enhancement to performance counters
Simple criticality not always enough
Sometimes events have nearly equal criticality
miss #1 (99)
miss #2 (100)
Want to know • how critical is each event?
• how far from critical is each event?
Actually, even that is not enough
Our solution: measure interactions
Two parallel cache misses
miss #1 (99)
miss #2 (100)
Cost(miss #1) = 0
Cost(miss #2) = 1
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs Parallel interaction100 0 +
1icost = aggregate cost – sum of individual costs
= 100 – 0 – 1 = 99
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
2. Zero icost ?
1. Positive icost parallel
interaction
miss #1
miss #2
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2
. . .
3. Negative icost ?
Negative icost
Two serial cache misses (data dependent)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = ?
ALU latency (110 cycles)
Negative icost
Two serial cache misses (data dependent)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90Negative icost serial interaction
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2. . .
3. Negative icost serial
interaction
ALU latency
miss #1 miss #2
Branch mispredict
Fetch BW
Load-Replay Trap
LSQ stall
Why care about serial interactions?
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1
Outline
Simple Criticality• Definition (ISCA ’01)
• Detection (ISCA ’01)
• Application (ISCA ’01-’02)
Advanced Criticality• Interpretation (MICRO ’03)
• What types of interactions are possible?
• Hardware Support (MICRO ’03, TACO ’04)
• Enhancement to performance counters
Profiling goal
Goal: • Construct graph
many dynamic instructions
Constraint:• Can only sample sparsely
Profiling goal
Goal: • Construct graph
Constraint:• Can only sample sparsely
DNA
DNA strand
Genome sequencing
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
. . .. . .
DNA
“Shotgun” genome sequencing
. . .. . .
. . . . . .
Find overlaps among samples
DNA
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache missBranch misp.No event
. . .. . .
Profiler hardware requirements
. . .. . .
Profiler hardware requirements
Match!
Sources of error
Error Source Gcc Parser Twolf
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Errors in graph construction
5.3 % 1.5 % 1.6 %
Sources of error
Error Source Gcc Parser Twolf
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Errors in graph construction
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Total 12.2 % 14.0 % 8.9 %
Conclusion: Bottleneck Analysis Applications
Run-time Optimization• Effective speculation
• Resource arbitration
• Dynamic reconfiguration
• Energy efficiency
Design Decisions• Overcoming technology constraints
Programmer Performance Tuning• Where have the cycles gone?
Selective value prediction
Scheduling and steering in clustered processors
Resize instruction window
Non-uniform machines
Helped cope with high-latency dcache
Measured cost of cache misses/branch
mispredicts
Conclusion: Grand Challenges
• Cost• How much speedup possible from optimizing
an event?
• Slack• How much can an event be “slowed down”
before increasing execution time?
• Interactions• When do multiple events need to be
optimized simultaneously?
• When do we have a choice?
modeling
token-passing analyzer
parallel interactions
serial interactions
shotgun profiling
Backup Slides
Related Work
Criticality Prior Work
Critical-Path Method, PERT charts• Developed for Navy’s “Polaris” project-1957
• Used as a project management tool
• Simple critical-path, slack concepts
“Attribution” Heuristics• Rosenblum et al.: SOSP-1995, and many others
• Marks instruction at head of ROB as critical, etc.
• Empirically, has limited accuracy
• Does not account for interactions between events
Related Work: Microprocessor Criticality
Latency tolerance analysis• Srinivasan and Lebeck: MICRO-1998
Heuristics-driven criticality predictors• Tune et al.: HPCA-2001• Srinivasan et al.: ISCA-2001
“Local” slack detector• Casmira and Grunwald: Kool Chips Workshop-
2000
ProfileMe with pair-wise sampling• Dean, et al.: MICRO-1997
Unresolved Issues
Alternative I: Addressing Unresolved Issues
Modeling and Measurement• What resources can we model effectively?
• difficulty with mutual-exclusion-type resouces (ALUs)
• Efficient algorithms
• Release tool for measuring cost/slack
Hardware • Detailed design for criticality analyzer
• Shotgun profiler simplifications• gradual path from counters
Optimization • explore heuristics for exploiting interactions
Alternative II: Chip-Multiprocessors
Design Decisions• Should each core support out-of-order execution?• Should SMT be supported?• How many processors are useful?• What is the effect of inter-processor latency?
Programmer Performance TuningParallelizing applications
• What makes a good division into threads?• How can we find them automatically, or at least help programmers to find them?
Unresolved issuesModeling and Measurement
• What resources can we model effectively?• difficulty with mutual-exclusion-type resouces (ALUs)
• In other words, unanticipated side effects
1
1
1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1
(cache miss)
(cache miss)
F
E
C
F
E
C
F
E
C
F
E
C
10 10
1
0
1 10 10
111
0 0
000
Original Execution
(cache miss)
(cache hit)Nocontention
1. ld r2, [Mem]2. add r3 r2 + 13. ld r4, [Mem]4. add r6 r4 + 1
F
E
C
F
E
C
F
E
C
F
E
C
10 2
1
0
10 1 12
1111
0 0
000
Altered Execution(to compute cost of inst #3
cache miss)
Adder contention
Contention edge
Incorrect critical path due to contention edge
Should not be here
Unresolved issues
Modeling and Measurement (cont.)
• How should processor policies be modeled?• relationship to icost definition
• Efficient algorithms for measuring icosts• pairs of events, etc.
• Release tool for measuring cost/slack
Unresolved issues
Hardware • Detailed design for criticality analyzer
• help to convince industry-types to build it
• Shotgun profiler simplifications• gradual path from counters
Optimization • Explore icost optimization heuristics
• icosts are difficult to interpret
Validation
Validation: can we trust our model?
Run two simulations :
• Reduce CP latencies
• Reduce non-CP latencies
Expect “big” speedup
Expect no speedup
Validation: can we trust our model?
0
0.2
0.4
0.6
0.8
1
crafty eon gcc gzip perl vortex galgel mesaSp
eed
up
per
Cyc
le R
edu
ced
Reducing CP Latencies
Reducing non-CP Latencies
Validation
Two steps:
1. Increase latencies of insts. by their apportioned slack
• for three apportioning strategies:1) latency+1,2) 5-cycles to as many instructions
as possible, 3) 12-cycles to as many loads as
possible
2. Compare to baseline (no delays inserted)
Validation
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
110%
120%
ammp art gcc gzip mesa parser perl vortex average
Per
cent
of E
xecu
tion
Tim
e
baseline
latency + 1
12 cycles to loads
five cycles
Worst case: Inaccuracy of 0.6%
Slack Measurements
Three slack variants
Local slack:# cycles latency can be increased
without delaying any subsequent instructions
Global slack:# cycles latency can be increased
without delaying the last instruction in the program
Apportioned slack:Distribute global slack among instructions
using an apportioning strategy
Slack measurements
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Per
cent
of D
ynam
ic In
stru
ctio
ns
~21% insts have at least 5 cycles of local slack
local
Slack measurements
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Per
cent
of D
ynam
ic In
stru
ctio
ns
~90% insts have at least 5 cycles of global slack
local
global
Slack measurements
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack (perl)
Per
cent
of D
ynam
ic In
stru
ctio
ns
~80% insts have at least 5 cycles of apportioned
slack
local
apportioned
global
A large amount of exploitable slack exists
Application-centered Slack Measurements
Load slack
Can we tolerate a long-latency L1 hit?
design: wire-constrained machine, e.g. Grid
non-uniformity: multi-latency L1
apportioning strategy:apportion ALL slack to load
instructions
Apportion all slack to loads
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
Number of Cycles of Slack on Load Instructions
Per
cen
t of D
ynam
ic L
oad
s
gccperl
gzip
Most loads can tolerate an L2 cache hit
Multi-speed ALUs
Can we tolerate ALUs running at half frequency?
design: fast/slow ALUs
non-uniformity: multi-latency execution latency,
bypassapportioning strategy:
give slack equal to original latency + 1
Latency+1 apportioning
0%10%20%30%40%50%60%70%80%90%
100%
ammp art gcc gzip mesa parser perl vortex averagePerc
ent o
f Dyn
amic
Inst
ruct
ions
Most instructions can tolerate doubling their latency
Slack Locality and Prediction
Predicting slack
Two steps to PC-indexed, history-based prediction:
1. Measure slack of a dynamic instruction2. Store in array indexed by PC of static instruction
Two requirements:
1. Locality of slack2. Ability to measure slack of a dynamic instruction
Locality of slack
0
10
20
30
40
50
60
70
80
90
100
ammp art gcc gzip mesa parser perl vortex average
Per
cen
t o
f (w
eig
hte
d)
stat
ic in
stru
ctio
ns
ideal
Locality of slack
0
10
20
30
40
50
60
70
80
90
100
ammp art gcc gzip mesa parser perl vortex average
Per
cen
t o
f (w
eig
hte
d)
stat
ic in
stru
ctio
ns
ideal
100%
Locality of slack
0
10
20
30
40
50
60
70
80
90
100
ammp art gcc gzip mesa parser perl vortex average
Per
cent
of (
wei
ghte
d) s
tatic
inst
ruct
ions
ideal
95%
100%
90%
PC-indexed, history-based predictor
can capture most of the available slack
Slack Detector
Problem #2Determining if overall execution time increased
SolutionCheck if delay made instruction critical
delay and observe effective for hardware predictor
Problem #1Iterating repeatedly over same dynamic instruction
SolutionOnly sample dynamic instruction once
Slack Detector
Goal: Determine whether instruction has n cycles of slack
1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)
3. No, instruction has n cycles of slack 4. Yes, instruction does not have n cycles of slack
delay and observe
Slack Application
Fast/slow cluster microarchitecture
Data Cache
WIN Reg
WIN Reg
Fast, 3-wide cluster
Slow, 3-wide cluster
ALUs
ALUs
Fetch + Rename
Aggressive non-uniform design:
• Higher execution latencies
• Increased (cross-domain) bypass latency
• Decreased effective issue bandwidth
Steer
Bypass Bus
P F2
save ~37% core power
Picking bins for the slack predictor
Use implicit slack predictor with four bins:
1. Steer to fast cluster + schedule with high priority2. Steer to fast cluster + schedule with low priority 3. Steer to slow cluster + schedule with high
priority4. Steer to slow cluster + schedule with low priority
Two decisions
1. Steer to fast/slow cluster
2. Schedule with high/low priority within a cluster
Slack-based policies
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
ammp art gcc gzip mesa parser perl vortex average
No
rmal
ized
IPC
2 fast, high-power clustersslack-based
policyreg-dep steering
10% better performance from hiding non-uniformities
CMP case study
Multithreaded Execution Case Study
Two questions:
• How should a program be divided into threads?• what makes a good cutpoint?
• how can we find them automatically, or at least help programmers find them?
• What should a multiple-core design look like?• should each core support out-of-order execution?
• should SMT be supported?
• how many processors are useful?
• what is the effect of inter-processor latency?
Parallelizing an application
Why parallelize a single-thread application?
• Legacy code, large code bases
• Difficult to parallelize apps• Interpreted code, kernels of operating systems
• Like to use better programming languages• Scheme, Java instead of C/C++
Parallelizing an application
Simplifying assumption• Program binary unchanged
Simplified problem statement• Given a program of length L, find a cutpoint that
divides the program into two threads that provides maximum speedup
Must consider:
• data dependences, execution latencies, control dependences, proper load balancing
Parallelizing an application
Naive solution:• try every possible cutpoint
Our solution:• efficiently determine the effect of every
possible cutpoint
• model execution before and after every cut
Solution
last instruction
F
E
C
first instruction
0 1 0 1 0 1 0
1
3
2 1
0 1
21
1
4
0
0
2
1 11
2
0 1 0
21
141 1
21
1
2
3
1
000 0
start
Parallelizing an application
Considerations:• Synchronization overhead
• add latency to EE edges
• Synchronization may involve turning EE to EF • Scheduling of threads
• additional CF edges
Challenges:• State behavior (one thread to multiple
processors)• caches, branch predictor
• Control behavior• limits where cutpoints can be made
Parallelizing an application
More general problem:• Divide a program into N threads
• NP-complete
Icost can help:• icost(p1,p2) << 0 implies p1 and p2 redundant
• action: move p1 and p2 further apart
Preliminary Results
Experimental Setup• Simulator, based loosely on SimpleScalar
• Alpha SpecInt binaries
Procedure1. Assume execution trace is known
2. Look at each 1k run of instructions
3. Test every possible cutpoint using 1k graphs
Dynamic Cutpoints
Cost Distribution of Dynamic Cutpoints
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100
Execution time reduction (cycles)
Cu
mu
lati
ve P
ct. o
f C
utp
oin
ts bzip
crafty
eon
gap
gcc
parser
perl
tw ol
vpr
Only 20% of cuts yield benefits of > 20 cycles
Usefulness of cost-based policy
Speedups from parallelizing programs for a two-processor system
0
5
10
15
20
25
30
bzip crafty eon gap gcc gzip mcf parser perl twolf vpr
Sp
ee
du
p %
fixed-interval
simple cost-based
Static Cutpoints
Cost Distribution of Static Cutpoints
0
10
20
30
40
50
60
70
80
90
100
0 20 40 60 80 100 120 140 160 180
Avg. per-dynamic-instance Cost of Static Instructions
Cu
mu
lati
ve P
ct. o
f In
stru
ctio
ns bzip
crafty
eon
gap
gcc
gzip
mcf
parser
perl
tw olf
vpr
Up to 60% of cuts yield benefits of > 20 cycles
Future Avenues of Research
• Map cutpoints back to actual code• Compare automatically generated cutpoints to
human-generated ones• See what performance gains are in a simulator, as
opposed to just on the graph
• Look at the effect of synchronization operations• What additional overhead do they introduce?
• Deal with state, control problems• Might need some technique outside of the graph
Multithreaded Execution Case Study
Two possible questions:
• How should a program be divided into threads?• what makes a good cutpoint?
• how can we find them automatically, or at least help programmers find them?
• What should a multiple-core design look like?• should each core support out-of-order execution?
• should SMT be supported?
• how many processors are useful?
• what is the effect of inter-processor latency?
CMP design study
What we can do:
• Try out many configurations quickly• dramatic changes in architecture often only small
changes in graph
• Identifying bottlenecks• especially interactions
CMP design study: Out-of-orderness
Is out-of-order execution necessary in a CMP?
Procedure• model execution with different configurations
• adjust CD edges
• compute breakdowns• notice resource/events interacting with CD edges
CMP design study: Out-of-orderness
last instruction
F
E
C
first instruction
0 1 0 1 0 1 0
1
3
2 1
0 1
21
1
4
0
0
2
1 11
2
0 1 0
21
141 1
21
1
2
3
1
000 0
CMP design study: Out-of-orderness
Results summary• Single-core: Performance taps out at 256 entries• CMP: Performance gains up through 1024 entries
• some benchmarks see gains up to 16k entries
Why more beneficial?• Use breakdowns to find out.....
CMP design study: Out-of-orderness
Components of window cost• cache misses holding up retirement?• long strands of data dependencies?• predictable control flow?
Icost breakdowns give quantitative and qualitative answers
CMP design study: Out-of-orderness
cost(window) + icost(window, A) + icost(window, B) + icost(window, AB) = 0
window cost
100%
0%
ALU
cachemisses
Independent
ALU
cachemisses
interaction
Parallel Interaction
ALU
cachemisses
interaction
Serial Interaction
equal
Summary of Preliminary Results
icost(window, ALU operations) << 0• primarily communication between processors
• window often stalled waiting for data
Implications• larger window may be overkill
• need a cheap non-blocking solution• e.g., continual-flow pipelines
CMP design study: SMT?
Benefits• reduced thread start-up latency
• reduced communication costs
How we could help• distribution of thread lengths
• breakdowns to understand effect of communication
#1
#2
#1
Start #1
#2
CMP design study: How many processors?
CMP design study: Other Questions
What is the effect of inter-processor communication latency?• understand hidden vs. exposed communication
Allocating processors to programs• methodology for O/S to better assign programs
to processors
Waterfall To Graph Story
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Standard Waterfall Diagram
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Annotated with Dependence Edges
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
Fetch BW
Data Dep
ROB
Branch Misp.
Annotated with Dependence Edges
Time 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R5 = 0 F E C
R3 = 0 F E C
R1 = #array + R3
F E C
R6 = ld[R1] F E C
R3 = R3 + 1 F E C
R5 = R6 + R5 F E C
cmp R6, 0 F E C
bf L1 F E C
R5 = R5 + 100 F E C
R0 = R5 F E C
Ret R0 F E C
1
1
1
1
11
3
1 1
2
1
0
1
Edge Weights Added
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
1
2
1 1
11
1
1
1
1
11
1
1
2
2
0
0
0
0
Convert to Graph
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2
1
11
1
3
0
1
1
2
1 1
11
1
1
1
1
11
1
1
2
2
0
0
0
0
Find Critical Path
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2 1
11
1
3
0
1
1 1
11
1
1
1
1
1
11
1
1
2
2 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Add Non-last-arriving Edges
R5 = 0
R3 = 0
R1 = #array + R3
R6 = ld[R1]
R3 = R3 + 1
R5 = R6 + R5
cmp R6, 0
bf L1
R5 = R5 + 100
R0 = R5
Ret R0
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
F E C
1
1
1
1
1
1
2 1
11
1
0
1
1 1
11
1
1
1
1
1
11
1
1
2
2 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
Branch misprediction made correct
Graph Alterations
Token-passing analyzer
Step 1. Observing
Observation: R1 R2 + R3
If dependence into R2 is on critical path, then value of R2 arrived last.
critical arrives last
arrives last critical
E
R2
R3
Dependence resolved early
Determining last-arrive edges
Observe events within the machine
last_arrive[F] =
last_arrive[E] =
E
F
CC
E
F
CC
FE if data ready on fetch
E
F
CC
E
F
CC
E
F
CC
EE observe arrival order of operands
E
F
CC
E
F
CC
last_arrive[C] =
EC if commit pointer is delayed
CC otherwise
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
E
F
CC
EF if branch misp.
E
F
CC
E
F
CC
E
F
CC
E
F
CC
CF if ROB stall
FF otherwise
Last-arrive edges: a CPU stethoscope
CPU
E C
E E F E C F
F F
E F
C C
Last-arrive edges
F
E
C
0 1 0 1 0 1 0
1
3
21
0 1
21
1
4
0
0
2
1 11
2
0 1 0
21
141
1
21
1
2
3
1
00 0 0
Remove latencies
F
E
C
Do not need explicit weights
Last-arrive edges
The last-arrive rule
CP consists only of “last-arrive” edges
F
E
C
Prune the graph
Only need to put last-arrive edges in graphNo other edges could be on CP
F
E
C
newest
…and we’ve found the critical path!
Backward propagate along last-arrive edges
newest
F
E
C
newest Found CP by only observing last-arrive
edges but still requires constructing entire
graph
Step 2. Efficient analysis
CP is a ”long” chain of last-arrive edges. the longer a given chain of last-arrive
edges, the more likely it is part of the CP
Algorithm: find sufficiently long last-arrive chains
1. Plant token into a node n
2. Propagate forward, only along last-arrive edges
3. Check for token after several hundred cycles
4. If token alive, n is assumed critical
1. plant token
Token-passing example
2. propagate token
3. is token alive?
4. yes, train critical
Critical
Found CP without constructing entire graph
ROB Size
Implementation: a small SRAM array
Last-arrive producer node (inst id, type)
Token Queue
Read
Wri
te
Commited (inst id, type)
Size of SRAM: 3 bits ROB size < 200 Bytes
…
Simply replicate for additional tokens
Putting it all together
CP prediction
table
Last-arrive edges
(producer retired instr)
OOO CoreE-critical?
Training Path
PC
Prediction Path
Token-PassingAnalyzer
Scheduling and Steering
Case Study #1: Clustered architectures
steering
issue window
scheduling1. Current state of art
(Base)2. Base + CP
Scheduling3. Base + CP Scheduling + CP Steering
0.60
0.70
0.80
0.90
1.00
1.10
No
rma
lize
d I
PC
eoncrafty gcc gzip perl vortex galgel mesa
unclustered
2 cluster
4 cluster
Current State of the Art
Avg. clustering penalty for 4 clusters: 19%
Constant issue width, clock frequency
0.60
0.70
0.80
0.90
1.00
1.10
No
rma
lize
d I
PC
eoncrafty gcc gzip perl vortex galgel mesa
unclustered
2 cluster
4 cluster
CP Optimizations
Base + CP Scheduling
0.60
0.70
0.80
0.90
1.00
1.10
No
rma
lize
d I
PC
eoncrafty gcc gzip perl vortex galgel mesa
unclustered
2 cluster
4 cluster
CP Optimizations
Avg. clustering penalty reduced from 19% to 6%
Base + CP Scheduling + CP Steering
Token-passing Vs. Heuristics
Local Vs. Global Analysis
-5.0%
0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
crafty eon gcc gzip perl vortex galgel mesa
Sp
eed
up
oldest-uncommited
oldest-unissued
token-passing
Previous CP predictors:local resource-sensitive predictions (HPCA 01, ISCA
01)
CP exploitation seems to require global analysis
Icost case study
Icost Case Study: Deep pipelines
Deep pipelines cause long latency loops:• level-one (DL1) cache access,
issue-wakeup, branch misprediction, …
But can often mitigate them indirectlyAssume 4-cycle DL1 access; how to mitigate?
Increase cache ports? Increase window size?
Increase fetch BW? Reduce cache misses?
Really, looking for serial interactions!
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Case Study: Deep pipelines
E E EE E
F F FF F
C C CC C
E
F
C
5
6
5
9 18 7 6 7
5555
1 12 1 0 12
01010
14
2
1
i1 i2 i3 i4 i5 i6
4
4
DL1 access
window edge
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 30.5 %
DL1+window -15.3
DL1+bw 6.0
DL1+bmisp -3.4
DL1+dmiss -0.4
DL1+alu -8.2
DL1+imiss 0.0
... ...
Total 100.0
Icost Breakdown (6 wide, 64-entry window)
gcc gzip vortex
DL1 18.3 % 30.5 % 25.8 %
DL1+window -4.2 -15.3 -24.5
DL1+bw 10.0 6.0 15.5
DL1+bmisp -7.0 -3.4 -0.3
DL1+dmiss -1.4 -0.4 -1.4
DL1+alu -1.6 -8.2 -4.7
DL1+imiss 0.1 0.0 0.4
... ... ... ...
Total 100.0 100.0 100.0
Vortex Breakdowns, enlarging the window
64 128 256
DL1
DL1+window
DL1+bw
DL1+bmisp
DL1+dmiss
DL1+alu
DL1+imiss
...
Total
Vortex Breakdowns, enlarging the window
64 128 256
DL1 25.8 8.9 3.9
DL1+window
-24.5 -7.7 -2.6
DL1+bw 15.5 16.7 13.2
DL1+bmisp -0.3 -0.6 -0.8
DL1+dmiss -1.4 -2.1 -2.8
DL1+alu -4.7 -2.5 -0.4
DL1+imiss 0.4 0.5 0.3
... ... ... ...
Total 100.0 80.8 75.0
Shotgun Profiling
Profiling goal
Goal: • Construct graph
many dynamic instructions
Constraint:• Can only sample sparsely
Profiling goal
Goal: • Construct graph
Constraint:• Can only sample sparsely
DNA
DNA strand
Genome sequencing
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
DNA
“Shotgun” genome sequencing
. . .. . .
DNA
“Shotgun” genome sequencing
. . .. . .
. . . . . .
Find overlaps among samples
DNA
Mapping “shotgun” to our situation
many dynamic instructions
Icache miss
Dcache missBranch misp.No event
. . .. . .
Profiler hardware requirements
. . .. . .
Profiler hardware requirements
Match!
Offline Profiler Algorithm
long sample
detailed samples
=then
=if
Design issues
Identify microexecution context
• Choosing signature bits
• Determining PCs (for better detailed sample matching) long
sampleStart PC121620245660 . . .
branchencode taken/not-taken bit in signature
Sources of error
Error Source Gcc Parser Twolf
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
Sampling only a few graph fragments
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
Sampling only a few graph fragments
Modeling execution as a graph
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
Modeling execution as a graph
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Modeling execution as a graph
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Sources of error
Error Source Gcc Parser Twolf
Building graph fragments
5.3 % 1.5 % 1.6 %
Sampling only a few graph fragments
4.8 % 6.5 % 7.2 %
Modeling execution as a graph
2.1 % 6.0% 0.1 %
Total 12.2 % 14.0 % 8.9 %
Icost vs. Sensitivity Study
Compare Icost and Sensitivity Study
Corollary to DL1 and ROB serial interaction:As load latency increases, the benefit from enlarging the ROB increases.
E E EE E
F F FF F
C C CC C
E
F
C
1
2
1
1 2 3 2 3
1111
0 1 0 1 1
01010
2
2
1
i1 i2 i3 i4 i5 i6
4
3
DL1 access
Compare Icost and Sensitivity Study
0
5
10
15
20
25
64 128 192 256
ROB size
Sp
eed
up 10
54321
DL1 Latency
Compare Icost and Sensitivity Study
Sensitivity Study Advantages• More information
• e.g., concave or convex curves
Interaction Cost Advantages• Easy (automatic) interpretation
• Sign and magnitude have well defined meanings
• Concise communication• DL1 and ROB interact serially
Outline
• Definition (ISCA ’01)
• what does it mean for an event to be critical?
• Detection (ISCA ’01)
• how can we determine what events are critical?
• Interpretation (MICRO ’04, TACO ’04)
• what does it mean for two events to interact?
• Application (ISCA ’01-’02, TACO ’04)
• how can we exploit criticality in hardware?
Our solution: measure interactions
Two parallel cache misses (Each 100 cycles)
miss #1 (100)miss #2 (100)
Cost(miss #1) = 0
Cost(miss #2) = 0
Cost({miss #1, miss #2}) = 100
Aggregate cost > Sum of individual costs Parallel interaction100 0 +
0icost = aggregate cost – sum of individual costs
= 100 – 0 – 0 = 100
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
2. Zero icost ?
1. Positive icost parallel
interaction
miss #1
miss #2
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2
. . .
3. Negative icost ?
Negative icost
Two serial cache misses (data dependent)
miss #1 (100)
miss #2 (100)
Cost(miss #1) = ?
ALU latency (110 cycles)
Negative icost
Two serial cache misses (data dependent)
Cost(miss #1) = 90
Cost(miss #2) = 90
Cost({miss #1, miss #2}) = 90
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
icost = aggregate cost – sum of individual costs
= 90 – 90 – 90 = -90Negative icost serial interaction
Interaction cost (icost)
icost = aggregate cost – sum of individual costs
miss #1
miss #21. Positive icost
parallel interaction
2. Zero icost independent
miss #1 miss #2. . .
3. Negative icost serial
interaction
ALU latency
miss #1 miss #2
Branch mispredict
Fetch BW
Load-Replay Trap
LSQ stall
Why care about serial interactions?
ALU latency (110 cycles)
miss #1 (100)
miss #2 (100)
Reason #1 We are over-optimizing!Prefetching miss #2 doesn’t help if miss #1 is already prefetched (but the overhead still costs us)
Reason #2 We have a choice of what to optimizePrefetching miss #2 has the same effect as miss #1
Outline
• Definition (ISCA ’01)
• what does it mean for an event to be critical?
• Detection (ISCA ’01)
• how can we determine what events are critical?
• Interpretation (MICRO ’04, TACO ’04)
• what does it mean for two events to interact?
• Application (ISCA ’01-’02, TACO ’04)
• how can we exploit criticality in hardware?
Criticality Analyzer (ISCA ‘01)
Procedure
1. Observe last-arriving edges
• uses simple rules
2. Propagate a token forward along last-arriving edges
• at worst, a read-modify-write sequence to a small array
3. If token dies, non-critical; otherwise, critical
Goal
• Detect criticality of dynamic instructions
Slack Analyzer (ISCA ‘02)
Goal
• Detect likely slack of static instructions
Procedure
1. Delay the instruction by n cycles2. Check if critical (via critical-path analyzer)
• No, instruction has n cycles of slack • Yes, instruction does not have n cycles of
slack
Shotgun Profiling (TACO ‘04)
Goal
• Create representative graph fragments
Procedure
• Enhance ProfileMe counters with context
• Use context to piece together counter samples