accelerating decoupled look-ahead to exploit implicit ...parihar/thesis_pres.pdf · motivation...
TRANSCRIPT
![Page 1: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/1.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism
Raj Parihar
Advisor: Prof. Michael C. Huang
Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY
Raj Parihar Advanced Computer Architecture Lab University of Rochester
![Page 2: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/2.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Despite the proliferation of multi-core, multi-threaded systems
High single-thread performance is still an important CPU design goal
Modern programs do not lack instruction level parallelism
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
50
IPC
ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K
Raj Parihar Advanced Computer Architecture Lab University of Rochester 2
![Page 3: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/3.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Despite the proliferation of multi-core, multi-threaded systems
High single-thread performance is still an important CPU design goal
Modern programs do not lack instruction level parallelism
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
50
IPC
ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K
Real challenge: exploit implicit parallelism without undue cost
One effective approach: Decoupled look-ahead architecture
Raj Parihar Advanced Computer Architecture Lab University of Rochester 3
![Page 4: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/4.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
Raj Parihar Advanced Computer Architecture Lab University of Rochester 4
![Page 5: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/5.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
bzip2 crafty eon gap gcc gzip mcf pbmk twolf vortex vpr Gmean 1
10
50
IPC
ideal:128 ideal:512 ideal:2K real:128 real:512 real:2K
Raj Parihar Advanced Computer Architecture Lab University of Rochester 5
![Page 6: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/6.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
The look-ahead thread can often become a new bottleneck
Lack of correctness constraint allows many optimizations
Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread
Raj Parihar Advanced Computer Architecture Lab University of Rochester 6
![Page 7: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/7.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation
Decoupled look-ahead architecture targets
Performance hurdles: branch mispredictions, cache misses, etc.Exploration of parallelization opportunities, dependence informationMicroarchitectural complexity, energy inefficiency through decoupling
The look-ahead thread can often become a new bottleneck
Lack of correctness constraint allows many optimizations
Weak dependence: Instructions that contribute marginally to theoutcome can be removed w/o affecting the quality of look-aheadDo-It-Yourself branches: Side-effect free, “easy-to-predict”branches can be skipped in the look-ahead thread
Raj Parihar Advanced Computer Architecture Lab University of Rochester 6
![Page 8: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/8.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Outline
Motivation
Baseline decoupled look-ahead
Look-ahead: a new bottleneck
Look-ahead thread acceleration
Weak dependences/instructions
Do-It-Yourself branches & skeleton tuning
Experimental analysis
Additional insights and summary
Raj Parihar Advanced Computer Architecture Lab University of Rochester 7
![Page 9: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/9.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Baseline Decoupled Look-ahead Architecture
Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and
Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching
Main Core
Branch QueueLook-ahead Core
L0$ L1$
Executes Look-aheadskeleton
Executes programbinary
L2$
Register state synchronization
Prefetching hints
Branch prediction1
2
Main Memory
A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO’08
Raj Parihar Advanced Computer Architecture Lab University of Rochester 8
![Page 10: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/10.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Baseline Decoupled Look-ahead Architecture
Skeleton generated just for the look-ahead purposesThe skeleton runs on a separate core and
Speculative state is completely contained within look-ahead contextSends branch outcomes through FIFO queue; also helps prefetching
Main Core
Branch QueueLook-ahead Core
L0$ L1$
Executes Look-aheadskeleton
Executes programbinary
L2$
Register state synchronization
Prefetching hints
Branch prediction1
2addq v0, v0, v0nop......bgt a1, 0x12001f9a0subq v0, t0, a2
addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0addq v0, v0, v0subq v0, t0, a2cmovge a2, a2, v0subq a1, 0x2, a1addq v0, v0, v0bgt a1, 0x12001f9a0subq v0, t0, a2
Main Memory
Program binary
Skeleton
A. Garg and M. Huang, “A Performance-Correctness Explicitly Decoupled Architecture”, MICRO-08
Raj Parihar Advanced Computer Architecture Lab University of Rochester 9
![Page 11: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/11.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Single thread
Raj Parihar Advanced Computer Architecture Lab University of Rochester 10
![Page 12: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/12.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed; speed of look-ahead is not an issue (left half)Look-ahead thread is the new bottleneck
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Single−thread Decoupled Look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 11
![Page 13: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/13.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Single−thread Decoupled Look−ahead Ideal (cache, branch)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 12
![Page 14: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/14.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead: a new bottleneck
Look-ahead: A New BottleneckComparing four systems to discover new bottlenecks
Single-thread, decoupled look-ahead, ideal, and look-ahead limit
Application categories:Bottleneck removed or speed of look-ahead is not an issueLook-ahead thread is the new bottleneck (right half)
aplu msa wup mgri six swim facr gal gcc gap eon fma3 gzip craf vrtx apsi vpr bzp2 equk amp luc art perl mcf two 0
1
2
3
4
IPC
Look−ahead limit Single−thread Decoupled Look−ahead Ideal (cache, branch)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 13
![Page 15: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/15.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Weak Dependences/Instructions
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs (100s of)
Weak instruction can be experimentally defined and their
impact quantified in isolation
Raj Parihar Advanced Computer Architecture Lab University of Rochester 14
![Page 16: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/16.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Weak Dependences/Instructions
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs (100s of)
Weak instruction can be experimentally defined and their
impact quantified in isolation
Raj Parihar Advanced Computer Architecture Lab University of Rochester 14
![Page 17: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/17.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Weak Dependences/Instructions
Not all instructions are equally important and critical
Example of weak instructions:
Inconsequential adjustmentsLoad and store instructions thatare (mostly) silentDynamic NOP instructions
Plenty of weak instructions are
present in programs (100s of)
Weak instruction can be experimentally defined and their
impact quantified in isolation
Raj Parihar Advanced Computer Architecture Lab University of Rochester 14
![Page 18: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/18.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 1: Weak insts do not look different
After the fact analysis: based on static attributes of insts reveals
Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)
Weakness has very poor correlation with static attributes
Hard to identify the weak instructions through static heuristics
addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0
1
2
Instruction Type
Num
ber
of In
puts
WeakInstructions
addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0
1
2
Instruction Type
Num
ber
of In
puts
StrongInstructions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 15
![Page 19: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/19.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 1: Weak insts do not look different
After the fact analysis: based on static attributes of insts reveals
Static attributes of weak and regular insts are remarkably similarCorrelation coefficient of the two distributions is very high (0.96)
Weakness has very poor correlation with static attributes
Hard to identify the weak instructions through static heuristics
addq clr cmovne cmptlt divt fneg ldah ldt muls s4addq sll stq subq zapnot 0
1
2
Instruction Type
Num
ber
of In
puts
WeakInstructions
addq clr cmovne cmptlt divt fneg ldah ldt mult s4addq sll stq subq zap 0
1
2
Instruction Type
Num
ber
of In
puts
StrongInstructions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 15
![Page 20: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/20.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 2: False positives are extremely costly
After the fact analysis and close inspection also reveals
Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains
Case in point: zapnot in gap
zapnot Ra Rb Rc
84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%
Raj Parihar Advanced Computer Architecture Lab University of Rochester 16
![Page 21: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/21.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 2: False positives are extremely costly
After the fact analysis and close inspection also reveals
Some instructions are more likely to be weak than othersEven then, a single false positive can negate all the gains
Case in point: zapnot in gap
zapnot Ra Rb Rc
84% of the zapnot insts are weak in isolation: 3.4% speedupSingle false positive zapnot instruction: 6% slowdownMore than 1 false positive instructions can slowdown upto 13%
Raj Parihar Advanced Computer Architecture Lab University of Rochester 16
![Page 22: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/22.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 3: Neither absolute nor additive
Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!
Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown
0 50 100 150 200 250 300−40%
−30%
−20%
−10%
0%
10%
20%
Cummulative weak instructions
Per
form
ance
impa
ct o
ver
base
line
look
−ah
ead
perlbmk
Raj Parihar Advanced Computer Architecture Lab University of Rochester 17
![Page 23: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/23.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Challenge # 3: Neither absolute nor additive
Weakness is context dependent, non-linear – much like JengaAll weak instructions combined together are not weak!
Example: weak instruction combining in perlbmkAbout 300 weak instructions when tested in isolationAll combined together can result in up to 40% slowdown
0 50 100 150 200 250 300−40%
−30%
−20%
−10%
0%
10%
20%
Cummulative weak instructions
Per
form
ance
impa
ct o
ver
base
line
look
−ah
ead
perlbmk
Raj Parihar Advanced Computer Architecture Lab University of Rochester 17
![Page 24: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/24.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Metaheuristic Based Trail-and-Error Approach
Recap: Challenges in identifying weak instructions
Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive
Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic
Metaheuristic: Completely agnostic of meaning of solution
Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.
R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 18
![Page 25: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/25.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Metaheuristic Based Trail-and-Error Approach
Recap: Challenges in identifying weak instructions
Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive
Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic
Metaheuristic: Completely agnostic of meaning of solution
Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.
R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 18
![Page 26: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/26.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Metaheuristic Based Trail-and-Error Approach
Recap: Challenges in identifying weak instructions
Weak instructions look very similar to regular instructionsFalse positives are extremely costly and can negate all the gainWeakness is context dependent: neither absolute nor additive
Our approach: Metaheuristic based self-tuningExperimentally identify/verify weaknessSearch for profitable combination via metaheuristic
Metaheuristic: Completely agnostic of meaning of solution
Derive new solutions from current solutions through modificationsExample: genetic algorithm, simulated annealing, etc.
R. Parihar, M. Huang, “Accelerating Decoupled Look-ahead via Weak Dependence Removal”, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 18
![Page 27: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/27.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Genetic Algorithm based Framework
The problem naturally maps to genetic algorithm
Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)
Genetic evolution: Procreation, mutation, fitness-based selection
ProgramBinary
Look-aheadBinary
Chromosome creation GA evolution
Single-GeneChromosome
Parents Pool
Children Pool
Reproduction
RouletteWheel
Parentselection
Fitness test,Elitism
Initi
al C
hrom
osom
e P
opul
atio
n
Look-ahead construction
Sin
gle-
Inst
ruct
ion
Gen
es
(Binary Parser)
12
3
4
5
6
7
8
Mul
ti-In
stru
ctio
n G
enes
SuperpositionChromosome
OrthogonalChromosome
Xover &Mutation
De-duplication
Raj Parihar Advanced Computer Architecture Lab University of Rochester 19
![Page 28: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/28.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Genetic Algorithm based Framework
The problem naturally maps to genetic algorithm
Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)
Genetic evolution: Procreation, mutation, fitness-based selection
ProgramBinary
Look-aheadBinary
Chromosome creation GA evolution
Single-GeneChromosome
Parents Pool
Children Pool
Reproduction
RouletteWheel
Parentselection
Fitness test,Elitism
Initi
al C
hrom
osom
e P
opul
atio
n
Look-ahead construction
Sin
gle-
Inst
ruct
ion
Gen
es
(Binary Parser)
12
3
4
5
6
7
8
Mul
ti-In
stru
ctio
n G
enes
SuperpositionChromosome
OrthogonalChromosome
Xover &Mutation
De-duplication
Raj Parihar Advanced Computer Architecture Lab University of Rochester 19
![Page 29: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/29.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Genetic Algorithm based Framework
The problem naturally maps to genetic algorithm
Skeleton is represented by a bit vectorNatural mapping: weak inst → gene, collection → chromosomeObjective: find optimal combination (chromosome)
Genetic evolution: Procreation, mutation, fitness-based selection
ProgramBinary
Look-aheadBinary
Chromosome creation GA evolution
Single-GeneChromosome
Parents Pool
Children Pool
Reproduction
RouletteWheel
Parentselection
Fitness test,Elitism
Initi
al C
hrom
osom
e P
opul
atio
n
Look-ahead construction
Sin
gle-
Inst
ruct
ion
Gen
es
(Binary Parser)
12
3
4
5
6
7
8
Mul
ti-In
stru
ctio
n G
enes
SuperpositionChromosome
OrthogonalChromosome
Xover &Mutation
De-duplication
Raj Parihar Advanced Computer Architecture Lab University of Rochester 19
![Page 30: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/30.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Speedup of Weak Dependence Removal
Applications in which the look-ahead thread is a bottleneck
Self-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x
craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1
2
3
4
5
6
Spe
edup
ove
r si
ngle
−th
read
Baseline look−aheadGA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 20
![Page 31: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/31.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Speedup of Weak Dependence Removal
Applications in which the look-ahead thread is a bottleneck
Self-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.11x (geomean)
Overall speedup over single-thread baseline: 1.48x
craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1
2
3
4
5
6
Spe
edup
ove
r si
ngle
−th
read
Baseline look−aheadGA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 20
![Page 32: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/32.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Speedup of Weak Dependence Removal
Applications in which the look-ahead thread is a bottleneck
Self-tuned, genetic algorithm based decoupled look-ahead
Speedup over baseline decoupled look-ahead: 1.11x (geomean)Overall speedup over single-thread baseline: 1.48x
craf eon gap gzip mcf pbmk two vrtx vpr amp art eqk fma3 luc Gmean1
2
3
4
5
6
Spe
edup
ove
r si
ngle
−th
read
Baseline look−aheadGA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 20
![Page 33: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/33.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Progress of Genetic Evolution Process
Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved
GA evolution, helped by hybridization shows good progress
1 2 3 4 5 6 70%
20%
40%
60%
80%
100%
# of Generations
Pro
gres
s re
lativ
e to
bes
t GA
sol
utio
n
eon
mcf
pbmk
twolf
vpr
art
eqk
fma
amp
lucas
Raj Parihar Advanced Computer Architecture Lab University of Rochester 21
![Page 34: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/34.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Progress of Genetic Evolution Process
Per generation progress compared to the final best solutionAfter 2 generations, more than half of the benefits are achievedAfter 5 generations, significant performance benefits are achieved
GA evolution, helped by hybridization shows good progress
1 2 3 4 5 6 70%
20%
40%
60%
80%
100%
# of Generations
Pro
gres
s re
lativ
e to
bes
t GA
sol
utio
n
eon
mcf
pbmk
twolf
vpr
art
eqk
fma
amp
lucas
Raj Parihar Advanced Computer Architecture Lab University of Rochester 21
![Page 35: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/35.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Evolution can be Online or Offline
Offline evolution: one time tuning (e.g. install time)
Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result
Online evolution: takes longer but has little overhead
Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations
1
1.5
2
2.5
3
1 116
231
346
461
576
691
806
921
1036
1151
1266
1381
1496
1611
1726
1841
1956
2071
2186
2301
2416
2531
2646
2761
2876
2991
3106
3221
3336
3451
3566
3681
3796
3911
4026
4141
4256
4371
4486
4601
4716
Accumulated
IPC
Number of instruc6ons (in millions)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 22
![Page 36: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/36.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Evolution can be Online or Offline
Offline evolution: one time tuning (e.g. install time)
Fitness tests need not take long (2-20s on target machine)Different input and configuration do not invalidate result
Online evolution: takes longer but has little overhead
Additional work minimum: book keeping, bit vector manipulationMain source of slowdown: testing bad configurations
1
1.5
2
2.5
3
1 116
231
346
461
576
691
806
921
1036
1151
1266
1381
1496
1611
1726
1841
1956
2071
2186
2301
2416
2531
2646
2761
2876
2991
3106
3221
3336
3451
3566
3681
3796
3911
4026
4141
4256
4371
4486
4601
4716
Accumulated
IPC
Number of instruc6ons (in millions)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 22
![Page 37: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/37.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
A Locomotive and Cargo Analogy
Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload
Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo
(under utilization of locomotive’s capability)
L1 prefetches L2 prefetches
Raj Parihar Advanced Computer Architecture Lab University of Rochester 23
![Page 38: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/38.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
A Locomotive and Cargo Analogy
Skeleton payload: look-ahead tasks, associated housekeepingLocomotive: look-ahead thread Cargo: Skeleton payload
Dilemma: Heavy cargo (slower locomotive) vs. lighter cargo
(under utilization of locomotive’s capability)
L1 prefetches L2 prefetches
Raj Parihar Advanced Computer Architecture Lab University of Rochester 23
![Page 39: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/39.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Idea of Do-It-Yourself (DIY) Branches
Extends the idea of weak instructions to easy-to-predict branches
To accelerate the look-ahead thread via DIY branches
Either skip completely or partially execute in the skeleton
BR
B
C
BR
BA
C
BR
BA
C
(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]
BR
A
(4) DIY [A -> C]
B
C
A
BR
A
(5) DIY [A -> B -> BR -> C]
B
C
ZAP
ZAP
LEFT
FALL
RIGHT
(A) Forward conditional branch (If-Than, If-Than-Else) transformations
(B) Backward conditional branch (Loop) transformations
Tune skeleton via selectively including/excluding prefetches
R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 24
![Page 40: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/40.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Idea of Do-It-Yourself (DIY) Branches
Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches
Either skip completely or partially execute in the skeleton
BR
B
C
BR
BA
C
BR
BA
C
(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]
BR
A
(4) DIY [A -> C]
B
C
A
BR
A
(5) DIY [A -> B -> BR -> C]
B
C
ZAP
ZAP
LEFT
FALL
RIGHT
(A) Forward conditional branch (If-Than, If-Than-Else) transformations
(B) Backward conditional branch (Loop) transformations
Tune skeleton via selectively including/excluding prefetches
R. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 24
![Page 41: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/41.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Idea of Do-It-Yourself (DIY) Branches
Extends the idea of weak instructions to easy-to-predict branchesTo accelerate the look-ahead thread via DIY branches
Either skip completely or partially execute in the skeleton
BR
B
C
BR
BA
C
BR
BA
C
(1) DIY [ C] (2) DIY [BR -> A -> C] (3) DIY [BR -> B -> C]
BR
A
(4) DIY [A -> C]
B
C
A
BR
A
(5) DIY [A -> B -> BR -> C]
B
C
ZAP
ZAP
LEFT
FALL
RIGHT
(A) Forward conditional branch (If-Than, If-Than-Else) transformations
(B) Backward conditional branch (Loop) transformations
Tune skeleton via selectively including/excluding prefetchesR. Parihar, M. Huang, “Load Balancing in Decoupled Look-ahead via DIY Branches and Payload Tuning”, (in draft)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 24
![Page 42: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/42.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Hardware Support for DIY Branches
Hardware support needed to synchronize after DIY regions
Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion
Look-aheadThread
MainThread
Branch Queue
Branch Predictor
DIY Mode
i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]
Direction+
DIY info
DIY call depth register+
DIY mode bit
Skeleton [mask, diy, duty]
Raj Parihar Advanced Computer Architecture Lab University of Rochester 25
![Page 43: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/43.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Hardware Support for DIY Branches
Hardware support needed to synchronize after DIY regions
Additional BOQ bit to indicate the beginning of DIY regionMain thread has its own branch predictor for DIY regionDIY call depth register to keep track of nesting/recursion
Look-aheadThread
MainThread
Branch Queue
Branch Predictor
DIY Mode
i1: add [1, 0]i2: call [1, 1, 25]i3: ldq [1, 2]i4: stq [1, 0]
Direction+
DIY info
DIY call depth register+
DIY mode bit
Skeleton [mask, diy, duty]
Raj Parihar Advanced Computer Architecture Lab University of Rochester 25
![Page 44: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/44.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Experimental Setup
Program/binary analysis tool: ALTO
Simulator: detailed out-of-order,cycle-level in-house
SMT, look-ahead and speculativeparallelization supportTrue execution-driven simulation(faithfully value modeling)
Genetic algorithm framework
Modeled as offline and onlineextension to the simulator
Microarchitectural configurations:Baseline core (Similar to POWER5)
Fetch/Decode/Issue/Commit 8 / 4 / 6 / 6ROB 128Functional units INT 2+1 mul +1 div, FP 2+1 mul +1 divFetch Q/ Issue Q / Reg. (int,fp) (32, 32) / (32, 32) / (80, 80)LSQ(LQ,SQ) 64 (32,32) 2 search portsBranch predictor Gshare – 8K entries, 13 bit historyBr. mispred. penalty at least 7 cyclesL1 data cache (private) 32KB, 4-way, 64B line, 2 cycles, 2 portsL1 inst cache (private) 64KB, 2-way, 128B, 2 cyclesL2 cache (shared) 1MB, 8-way, 128B, 15 cyclesMemory access latency 200 cyclesLook-ahead core: Baseline core with only LQ, no SQ
L0 cache: 32KB, 4-way, 64B line, 2 cyclesRound trip latency to L1: 6 cycles
Communication: Branch Output Queue: 512 entriesReg copy latency (recovery): 64 cycles
Table 1: Microarchitectural configurations.
Raj Parihar Advanced Computer Architecture Lab University of Rochester 26
![Page 45: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/45.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Individual Performance Gains
Speedup of DIY branches over baseline look-ahead: 1.08x
Speedup of Skeleton Payload Tuning: 1.12x
Combined speedup (DIY + Payload Tuning): 1.15x
gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean
1.0
1.4
1.8
2.2
Spe
edup
ove
r si
ngle
-thr
ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead
15%
Weak InstsRemoval(16.2%)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 27
![Page 46: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/46.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Individual Performance Gains
Speedup of DIY branches over baseline look-ahead: 1.08x
Speedup of Skeleton Payload Tuning: 1.12x
Combined speedup (DIY + Payload Tuning): 1.15x
gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean
1.0
1.4
1.8
2.2
Spe
edup
ove
r si
ngle
-thr
ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead
15%
Weak InstsRemoval(16.2%)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 27
![Page 47: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/47.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Individual Performance Gains
Speedup of DIY branches over baseline look-ahead: 1.08x
Speedup of Skeleton Payload Tuning: 1.12x
Combined speedup (DIY + Payload Tuning): 1.15x
gcc mcf eon pbmk bzip2 twolf wup mgrid art eqk face ammp lucas fma3d Gmean
1.0
1.4
1.8
2.2
Spe
edup
ove
r si
ngle
-thr
ead Baseline decoupled look-aheadDIY branch based decoupled look-aheadSkeleton payload tuned decoupled look-aheadDIY+Skeleton payload tuned look-ahead
15%
Weak InstsRemoval(16.2%)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 27
![Page 48: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/48.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Overall Performance Gain
Final decoupled look-ahead system
Skeleton payload tuning + Weak dependence + DIY branches
Performance speedup over:
Baseline look-ahead: 1.20x Single-thread: 1.61x
gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0
1.2
1.4
1.6
1.8
2.0
Spe
edup
ove
r B
asel
ine
DLA
Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA
Raj Parihar Advanced Computer Architecture Lab University of Rochester 28
![Page 49: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/49.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak dependences/instructionsDo-It-Yourself branches & skeleton tuningExperimental analysis
Overall Performance Gain
Final decoupled look-ahead system
Skeleton payload tuning + Weak dependence + DIY branches
Performance speedup over:
Baseline look-ahead: 1.20x Single-thread: 1.61x
gcc mcf eon pbm bzp two wup mgri art eqk face amp luc fm3gmean1.0
1.2
1.4
1.6
1.8
2.0
Spe
edup
ove
r B
asel
ine
DLA
Weak Dependence Removed DLAWeak Dep + DIY + Payload Tuned DLA
Raj Parihar Advanced Computer Architecture Lab University of Rochester 28
![Page 50: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/50.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
![Page 51: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/51.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
![Page 52: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/52.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
![Page 53: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/53.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Salient Features of Decoupled Look-ahead
Hard-to-predict, data-dependent branches
Conventional predictors do not capture data-dependent behaviors
Prefetching for cold misses
Conventional prefetchers take time to learn access/miss patterns
Other potential advantages:
Can pass dependence information for thread level speculationCan assist in value prediction (close to 90% correctness)
Potential hurdles and showstoppers:
If no distillation possible: look-ahead can run at higher clockJIT: fixed mask an issue, but mask can be evolved dynamically
Raj Parihar Advanced Computer Architecture Lab University of Rochester 29
![Page 54: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/54.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Future Explorations
Effective look-ahead to improve L1 prefetching performance
Speeding up critical threads and serial bottlenecks via a shared
look-ahead agent in multi-threaded applications
Cost effective SMT implementation of decoupled look-ahead
Role of look-ahead to promote parallelization, value predictions
and acceleration of interpreted programs
Backward strawman: integrate non-speculative look-ahead
computations in the main thread directly
Raj Parihar Advanced Computer Architecture Lab University of Rochester 30
![Page 55: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/55.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Details in the Thesis and Papers
Decoupled Look-ahead Architecture:
Weak dependence removal in decoupled look-ahead [HPCA’14]
Load balancing in look-ahead via DIY branches [PACT-SRC’15]
Speculative parallelization in decoupled look-ahead [PACT’11]
DIY branches and payload tuning [in prep. for HPCA’17]
Shared Cache Management:
Hardware support for protective and collaborative caches [ISMM’16]
Protection and utilization in shared cache via rationing [PACT’14]
A coldness metric for cache optimization [MSPC’13]
Raj Parihar Advanced Computer Architecture Lab University of Rochester 31
![Page 56: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/56.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation: Cache Rationing
Compute systems with shared resources are prevalent today
Multi-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources
Raj Parihar Advanced Computer Architecture Lab University of Rochester 32
![Page 57: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/57.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Motivation: Cache Rationing
Compute systems with shared resources are prevalent todayMulti-core clusters, cloud computing, data centers, server farmsPrograms often compete for shared caches and other resources
Significant performance loss due to co-run interference: >25%
Equal partitioning No partitioning Rationing PIPP-equal0.7
0.8
0.9
1.0
1.1
1.2
1.3
IPC
Nor
m. t
o so
lo r
un w
/ 512
KB
L2$
2.32
1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26
SPEC 2000: with equake(2 cores, 1 MB L2 cache)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 33
![Page 58: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/58.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Idea of Cache Rationing
Achieve resource protection and utilization both simultaneously
Rationing policy:
Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs
Conservative sharing: provides a safety net for less aggressive
programs in the presence of non cooperative programs
R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16
Raj Parihar Advanced Computer Architecture Lab University of Rochester 34
![Page 59: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/59.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Idea of Cache Rationing
Achieve resource protection and utilization both simultaneously
Rationing policy:
Initial ration: Every program is assigned a initial portion of cacheNon intrusive sharing: A program can exceed allocated ration onlyif another program is not using its rationEntitlement: If a program is using its ration, it can not be takenaway by peer programs
Conservative sharing: provides a safety net for less aggressive
programs in the presence of non cooperative programs
R. Parihar, J. Brock, C. Ding, M. Huang, “Hardware support for protective and collaborative cache sharing”, ISMM’16
Raj Parihar Advanced Computer Architecture Lab University of Rochester 34
![Page 60: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/60.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware Support
Ration Accounting: ration counter-register pairs
To track the current usage of a program, maintained per core per set
Usage Tracking: access-bit and block owner
To detect unused ration and ensure entitlement, 1 per cache blk
blk 1 blk w-1
w ways
Data array
Access bit
p counter-register pairs
s se
ts
Ration tracker
w ways
Status bit
Tag array
Block ownerRation counter
Owner allocation
Additional storage overhead: <1% of total cache storage
Raj Parihar Advanced Computer Architecture Lab University of Rochester 35
![Page 61: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/61.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware Support
Ration Accounting: ration counter-register pairs
To track the current usage of a program, maintained per core per set
Usage Tracking: access-bit and block owner
To detect unused ration and ensure entitlement, 1 per cache blk
blk 1 blk w-1
w ways
Data array
Access bit
p counter-register pairs
s se
ts
Ration tracker
w ways
Status bit
Tag array
Block ownerRation counter
Owner allocation
Additional storage overhead: <1% of total cache storage
Raj Parihar Advanced Computer Architecture Lab University of Rochester 35
![Page 62: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/62.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Resource Protection Co-Run
Co-run with a high-pressure peer (mcf)
Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage
Equal partitioning No partitioning Rationing PIPP-equal0.7
0.8
0.9
1.0
1.1
1.2
1.3
IPC
nor
m. t
o so
lo r
un w
/ 512
KB
L2$
... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...
1.52
SPEC 2000: with mcf(2 cores, 1 MB L2 cache)
INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf
FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi
Raj Parihar Advanced Computer Architecture Lab University of Rochester 36
![Page 63: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/63.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Resource Protection Co-Run
Co-run with a high-pressure peer (mcf)Rationing: achieves good resource protection - similar to partitioningNo partitioning: almost every co-run is unhealthy with high damage
Equal partitioning No partitioning Rationing PIPP-equal0.7
0.8
0.9
1.0
1.1
1.2
1.3
IPC
nor
m. t
o so
lo r
un w
/ 512
KB
L2$
... 26 ... 26 ... 26 1 2 3 ... ... 261 2 3 ...1 2 3 ...1 2 3 ...
1.52
SPEC 2000: with mcf(2 cores, 1 MB L2 cache)
INT 1-gzip 2-vpr 3-gcc 4-mcf 5-crafty 6-parser7-eon 8-perlbmk 9-gap 10-vortex 11-bzip2 12-twolf
FP 13-wupwise 14-swim 15-mgrid 16-applu 17-mesa18-galgel 19-art 20-equake 21-facerec 22-ammp23-lucas 24-fma3d 25-sixtrack 26-apsi
Raj Parihar Advanced Computer Architecture Lab University of Rochester 36
![Page 64: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/64.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Capacity Utilization Co-Run
Co-run with a low-pressure peer (eon): cache demand <128 KB
Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs
Equal partitioning No partitioning Rationing PIPP-equal0.8
1.0
1.2
1.4
IPC
Nor
m. t
o so
lo r
un w
/ 512
KB
L2$
1.744
1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26
SPEC 2000: with eon(2 cores, 1 MB L2 cache)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 37
![Page 65: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/65.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Capacity Utilization Co-Run
Co-run with a low-pressure peer (eon): cache demand <128 KB
Rationing: utilizes cache well and speeds up 14 applications withoutslowing down any co-running programNo partitioning: also speeds up 13 applications at the cost ofslowing down 11 co-running programs
Equal partitioning No partitioning Rationing PIPP-equal0.8
1.0
1.2
1.4
IPC
Nor
m. t
o so
lo r
un w
/ 512
KB
L2$
1.744
1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26 1 2 3 ... ... 26
SPEC 2000: with eon(2 cores, 1 MB L2 cache)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 37
![Page 66: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/66.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robust
Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 38
![Page 67: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/67.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robust
Improves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 38
![Page 68: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/68.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regions
Raj Parihar Advanced Computer Architecture Lab University of Rochester 38
![Page 69: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/69.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary
Decoupled look-ahead can uncover significant implicit parallelismHowever, look-ahead thread often becomes a new bottleneck
Fortunately, look-ahead due to lack of correctness constraintlends itself to various optimizations
Weak instructions can be removed w/o affecting look-ahead qualitySide effect free, “easy-to-predict” DIY branches can be skippedSkeleton payload can be tuned w/o incurring extra recoveries
Metaheuristic based self-tuning approach is simple and robustImproves single thread performance by 1.61xMuch better compared to conventional turbo boost and freq scaling
Multi-threaded workload can benefit from speeding up serial
sections and bottleneck threads in critical regionsRaj Parihar Advanced Computer Architecture Lab University of Rochester 38
![Page 70: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/70.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Acknowledgments
Funding agencies: NSF, NSFC
Prof. Michael C. Huang Alok Garg
Prof. Chen Ding and his research group at URCS
Past & current members of Advanced Computer Architecture
Lab at University of Rochester
Raj Parihar Advanced Computer Architecture Lab University of Rochester 39
![Page 71: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/71.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Backup Slides
Accelerating Decoupled Look-aheadto Exploit Implicit Parallelism
Raj Parihar
Advisor: Prof. Michael C. Huang
Department of Electrical & Computer EngineeringUniversity of Rochester, Rochester, NY
Raj Parihar Advanced Computer Architecture Lab University of Rochester 40
![Page 72: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/72.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
![Page 73: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/73.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
![Page 74: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/74.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
![Page 75: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/75.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Summary of Distillation Techniques
Convert biased branches to unconditional “taken” or “not taken”Eliminate stores from the long distance stores-loads pairs
Stores would have been committed by main thread
Selective value (zero) substitution for the L2 misses
If the look-ahead distance drops below a threshold
Speculative parallelization of skeleton w/o any rollback support
Weak dependence/instruction removal from skeleton
Consecutive loop iterations accessing the same cache line
A wide variety of library calls and reduction operations
Selective payload: eliminate payload if it slows look-ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 41
![Page 76: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/76.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)
Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit (ideal)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 42
![Page 77: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/77.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)
Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit (ideal)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 42
![Page 78: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/78.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Micro helper thread based approach:Targets top cache misses and branch mispredictions (low coverage)Support for quick spawning and register communication (not trivial)
Decoupled look-ahead approach:Easy to disable, low management overhead on main threadNatural throttling to prevent run-away prefetching, cache pollution
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit (ideal)Decoupled look−ahead
4.0
Raj Parihar Advanced Computer Architecture Lab University of Rochester 43
![Page 79: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/79.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Cache misses90% 95%
DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68
Branch mispredictions90% 95%
DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236
Raj Parihar Advanced Computer Architecture Lab University of Rochester 44
![Page 80: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/80.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Cache misses90% 95%
DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68
Branch mispredictions90% 95%
DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236
Raj Parihar Advanced Computer Architecture Lab University of Rochester 44
![Page 81: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/81.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Practical Advantages of Decoupled Look-ahead
Look-ahead thread is a self-reliant agent,completely independent of main thread
No need for quick spawning and registercommunication supportLow management overhead on main threadEasier for run-time control to disable
Natural throttling mechanism to prevent
Run-away prefetching, cache pollution
Look-ahead thread size comparable to
aggregation of short helper threads
Cache misses90% 95%
DI SI DI SIbzip2 1.86 17 3.15 27crafty 0.73 23 1.04 38eon 2.28 50 3.34 159gap 1.35 15 1.44 23gcc 8.49 153 8.84 320gzip 0.1 6 0.1 6mcf 13.1 13 14.7 16parser 1.31 41 1.59 57pbmk 1.87 35 2.11 52twolf 2.69 23 3.28 28vortex 1.96 42 2 67vpr 7.47 16 11.6 22Avg 3.60% 36 4.44% 68
Branch mispredictions90% 95%
DI SI DI SIbzip2 3.9 52 4.49 64crafty 5.33 235 6.14 309eon 2.02 19 2.31 23gap 2.02 77 2.64 130gcc 8.08 1103 8.41 1700gzip 8.41 40 8.66 52mcf 9.99 14 10.2 18parser 6.81 130 7.3 183pbmk 2.88 92 3.21 127twolf 5.75 41 6.48 56vortex 1.24 114 1.97 167vpr 4.8 6 4.88 7Avg 5.10% 160 5.56% 236
Raj Parihar Advanced Computer Architecture Lab University of Rochester 44
![Page 82: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/82.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Correlation with RTL/FPGA Accurate Simulator
Reported performance improvement results are very pessimistic
Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces
RTL accurate simulator: shows 2x more performance potential
Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0
1.2
1.4
1.6
1.8
2.0
2.2
Spe
edup
ove
r si
ngle
-thr
ead SimpleScalar
IMG-psim
1.56x(projected)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 45
![Page 83: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/83.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Correlation with RTL/FPGA Accurate Simulator
Reported performance improvement results are very pessimistic
Optimistic branch misprediction latency: 7 vs 15 cycleFixed memory latency, no queuing delays in L1/L2 interfaces
RTL accurate simulator: shows 2x more performance potential
Perfect BP Perfect L2 Perfect L2+BP Perfect L1 Perfect L1+BP DLA1.0
1.2
1.4
1.6
1.8
2.0
2.2
Spe
edup
ove
r si
ngle
-thr
ead SimpleScalar
IMG-psim
1.56x(projected)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 45
![Page 84: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/84.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Simplified Look-ahead Core
Baseline skeleton: 71% After distillation: 57%
2-wide look-ahead core: (Front end is still 8-wide)
2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%
Renam
eROB
Int I
QFp
IQ
Decod
e
RAT-dec
RAT-wl
RAT-bl
DCL-cm
pLS
QTot
al0
20
40
60
80
100
% P
ower
(N
orm
. to
4-w
ide
DLA
)
3-wide Look-ahead core2-wide Look-ahead core
Look-ahead core components
Raj Parihar Advanced Computer Architecture Lab University of Rochester 46
![Page 85: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/85.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Simplified Look-ahead Core
Baseline skeleton: 71% After distillation: 57%2-wide look-ahead core: (Front end is still 8-wide)
2x power savings for RAT and other traditional hotspotsReduces overall power overhead of look-ahead system by 10%
Renam
eROB
Int I
QFp
IQ
Decod
e
RAT-dec
RAT-wl
RAT-bl
DCL-cm
pLS
QTot
al0
20
40
60
80
100
% P
ower
(N
orm
. to
4-w
ide
DLA
)
3-wide Look-ahead core2-wide Look-ahead core
Look-ahead core components
Raj Parihar Advanced Computer Architecture Lab University of Rochester 46
![Page 86: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/86.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline vs. Tuned Skeleton
Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed
Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead
4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1
1.1
1.2
1.3
1.4
1.5
Spe
edup
ove
r S
ingl
e-T
hrea
d
Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead
INT FP
Raj Parihar Advanced Computer Architecture Lab University of Rochester 47
![Page 87: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/87.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline vs. Tuned Skeleton
Distilled skeleton enables simplification of look-ahead coreBetter power and energy efficiency w/o compromising speed
Energy efficiency: 17% better compared to single-threadPower overhead: 1.38x over single-thread, used to be 1.53x forbaseline decoupled look-ahead
4-wide 3-wide 2-wide 4-wide 3-wide 2-wide1
1.1
1.2
1.3
1.4
1.5
Spe
edup
ove
r S
ingl
e-T
hrea
d
Baseline Decoupled Look-aheadDIY+Skeleton Payload Look-ahead
INT FP
Raj Parihar Advanced Computer Architecture Lab University of Rochester 47
![Page 88: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/88.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hybridization: Heuristically Designed Initial Solutions
Genetic evolution could be a slow and lengthy process
Heuristic based solutions are helpful to jump start the evolution
Heuristically designed solutions in our system:
Superposition chromosome; Orthogonal subroutine chromosome
X
Multi-Instruction Genes
X XX X X
X X
X XX X XX X X X
X X X X X X X X X
Initial Chromosomes
A B CSubroutines
(b) Superposition Chromosomes
(c) Orthogonal Chromosomes
(a) Single-gene Chromosomes
Single-Instruction Genes
Chromosome
XX
XX
X
Raj Parihar Advanced Computer Architecture Lab University of Rochester 48
![Page 89: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/89.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hybridization: Heuristically Designed Initial Solutions
Genetic evolution could be a slow and lengthy process
Heuristic based solutions are helpful to jump start the evolution
Heuristically designed solutions in our system:
Superposition chromosome; Orthogonal subroutine chromosome
X
Multi-Instruction Genes
X XX X X
X X
X XX X XX X X X
X X X X X X X X X
Initial Chromosomes
A B CSubroutines
(b) Superposition Chromosomes
(c) Orthogonal Chromosomes
(a) Single-gene Chromosomes
Single-Instruction Genes
Chromosome
XX
XX
X
Raj Parihar Advanced Computer Architecture Lab University of Rochester 48
![Page 90: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/90.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Online Genetic Evolution: equake
Primary overhead comes from testing bad skeleton configs
Break-even point: 1.8 billion insts (1-2 sec of native execution)By 4.6 billion insts: overall cumulative speed is already 10% faster
1
1.5
2
2.5
3
1 138
275
412
549
686
823
960
1097
1234
1371
1508
1645
1782
1919
2056
2193
2330
2467
2604
2741
2878
3015
3152
3289
3426
3563
3700
3837
3974
4111
4248
4385
4522
4659 Ac
cumulated
IPC
Number of instruc6ons (1 epoch = 1 million instruc6ons)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
0.5 1
1.5 2
2.5 3
3.5
1 13
8 27
5 41
2 54
9 68
6 82
3 96
0 10
97
1234
13
71
1508
16
45
1782
19
19
2056
21
93
2330
24
67
2604
27
41
2878
30
15
3152
32
89
3426
35
63
3700
38
37
3974
41
11
4248
43
85
4522
46
59
Distrib
uted
IPC
Number of instruc4ons (1 epoch = 1 million instruc4ons)
Single-‐thread baseline Baseline decoupled look-‐ahead Online self-‐tuned look-‐ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 49
![Page 91: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/91.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Other Proposals
Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]
Speculative slice achieves only 57% of their ideal speedup of 13%
Dual core execution or DCE [Zhou: PACT’05]
DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead
5.94
Raj Parihar Advanced Computer Architecture Lab University of Rochester 50
![Page 92: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/92.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Other Proposals
Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%
Dual core execution or DCE [Zhou: PACT’05]
DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead
5.94
Raj Parihar Advanced Computer Architecture Lab University of Rochester 50
![Page 93: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/93.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Other Proposals
Speculative slices [Zilles and Sohi: ISCA’00, ISCA’01]Speculative slice achieves only 57% of their ideal speedup of 13%
Dual core execution or DCE [Zhou: PACT’05]DCE achieves about 16% speedup over single-threadFor integer codes the speedup is substantially low (<10%)
gzp vpr gcc mcf cra eon pbm gap vrtx bzp twlf wup swm mgr apl msa gal art eqk fac amp luc fma six apsi Gmean
1.0
1.2
1.4
1.6
1.8
2.0
2.2
2.4
Spe
edup
ove
r si
ngle
−th
read
Speculative slice limit Dual−core execution (DCE_64) Self tuned decoupled look−ahead
5.94
Raj Parihar Advanced Computer Architecture Lab University of Rochester 50
![Page 94: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/94.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Potential DIY Modules
Loop iterations accessing same cache line, reduction operations
Library function calls: printf, OtsMove, OtsFill etc.
A case in point: mark modified reg() from 176.gcc
Dynamic contribution: 3% Performance speedup: 10%
static void mark_modified_reg (dest, x)
rtx dest; rtx x;
{
int regno, i;
if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);
if (GET_CODE (dest) == MEM) modified_mem = 1;
if (GET_CODE (dest) != REG) return;
regno = REGNO (dest);
if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;
else
for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)
modified_regs[regno + i] = 1;
}
Raj Parihar Advanced Computer Architecture Lab University of Rochester 51
![Page 95: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/95.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Potential DIY Modules
Loop iterations accessing same cache line, reduction operations
Library function calls: printf, OtsMove, OtsFill etc.A case in point: mark modified reg() from 176.gcc
Dynamic contribution: 3% Performance speedup: 10%
static void mark_modified_reg (dest, x)
rtx dest; rtx x;
{
int regno, i;
if (GET_CODE (dest) == SUBREG) dest = SUBREG_REG (dest);
if (GET_CODE (dest) == MEM) modified_mem = 1;
if (GET_CODE (dest) != REG) return;
regno = REGNO (dest);
if (regno >= FIRST_PSEUDO_REGISTER) modified_regs[regno] = 1;
else
for (i = 0; i < HARD_REGNO_NREGS (regno, GET_MODE (dest)); i++)
modified_regs[regno + i] = 1;
}
Raj Parihar Advanced Computer Architecture Lab University of Rochester 51
![Page 96: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/96.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Skeleton Payload Distribution
Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches
Optimal only 30% of the time
For the remaining 70% of the time other payloads are optimal
Performance potential of customized payloads: 1.21x
DLA bB
bB+L
2
bB+S
f
bB+L
1
bB+L
2+Sf
bB+L
1+L2
bB+L
1+Sf B
B+L2
B+Sf
B+L1
B+L2+
Sf
B+L1+
L2
B+L1+
Sf
B+L1+
L2+S
f
All-ins
tST
Skeleton Payload (epoch=10k insts)
0
20
40
60
80
# of
bes
t epo
chs
(%) gcc
mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma
Raj Parihar Advanced Computer Architecture Lab University of Rochester 52
![Page 97: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/97.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Skeleton Payload Distribution
Baseline skeleton payload: biased branches turn unconditional +L2 prefetches + L1 prefetches + Software prefetches
Optimal only 30% of the time
For the remaining 70% of the time other payloads are optimal
Performance potential of customized payloads: 1.21x
DLA bB
bB+L
2
bB+S
f
bB+L
1
bB+L
2+Sf
bB+L
1+L2
bB+L
1+Sf B
B+L2
B+Sf
B+L1
B+L2+
Sf
B+L1+
L2
B+L1+
Sf
B+L1+
L2+S
f
All-ins
tST
Skeleton Payload (epoch=10k insts)
0
20
40
60
80
# of
bes
t epo
chs
(%) gcc
mcfeonpbmkbzip2twolfwupmgrdarteqkfaceamplucfma
Raj Parihar Advanced Computer Architecture Lab University of Rochester 52
![Page 98: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/98.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Skeleton Payload Tuning Framework
Collects performance of various payloads in regular epoch
Associates static code region with the most optimal payload
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xZZ
Cycles
600700...400
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xZZ
Cycles
500700...400
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xAA
Cycles
500700...400
(A) Initial payload performance
B+L1+L2
bB+L1
bB+L2
InstCount
1k2k…Nk
StartPC
0xAA0xBB...0xAA
Best Skt
B+L1+L2bB+L2...bB+L1
(B) Best payload per epoch
PC
0xAA0xBB…0xZZ
<Payload#: Cnt> tuples
#1:50, #2:30,..., #N:100#4:10, #5:50...#1:10, #2:30,..., #N:10
(C) Per PC payload tuples
PC
0xAA0xBB…0xZZ
Best payload
#N:100#5:50...#2:30
(D) Best payload per PC
0xAA: bB+L20xZZ:
bB+L10xBB: B+L1+L2
(E) Final skeleton
Skeleton Payloads
Raj Parihar Advanced Computer Architecture Lab University of Rochester 53
![Page 99: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/99.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Performance Impact of Duty Cycle
One DIY call example from 179.art
5 10 20 25 50 75 80 90 100
Duty Cycle (not to the scale)
0
2
4
6
8
Per
form
ance
gai
n ov
er D
LA (
%)
WeightAdj() in 179.art
Raj Parihar Advanced Computer Architecture Lab University of Rochester 54
![Page 100: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/100.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Weak Dependence: Insights and Findings
The evolution process is remarkably robust
Different inputs and configuration do not invalidate resultsCan use sampling to accelerate fitness test w/o appreciable impacton quality of solution found
Energy reduction → due to less activity and stalling
About 10% dynamic instructions removed from skeleton11% energy saving over baselne decoupled look-ahead
Impact of weak insts removal on look-ahead quality is very small
Similar prefetch and branch hint accuracy
Raj Parihar Advanced Computer Architecture Lab University of Rochester 55
![Page 101: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/101.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Comparison with Speculative Parallel Look-ahead
Self-tuned skeleton is used in the speculative parallel look-ahead
In some cases, self-tuned and speculative parallel look-ahead
techniques are synergistic (ammp, art)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 56
![Page 102: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/102.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Unique Opportunities for Speculative Parallelization
Skeleton code offers more parallelism
Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation
Look-ahead is inherently error-tolerant
Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS
1
13
18
1
2
1
0x120011490 ldt $f0, 0(a0)
1.
1
0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)
0x1200114bc stt $f0, 0(a2)
0x12000da84 lda a5, 744(sp)
3. 0x12000daec lda a5, 4(a5)
2. 0x12000dac0 ldl t7, 0(a5)
4. 0x120011984 ldq a0, 80(sp)
5.
6.
7. 0x1200119f8 bis 0, t8, t11
8. 0x120011b04 lda a0, 8(a0)
1
2
1
A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11
Raj Parihar Advanced Computer Architecture Lab University of Rochester 57
![Page 103: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/103.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Unique Opportunities for Speculative Parallelization
Skeleton code offers more parallelism
Certain dependencies removed duringslicing for skeletonShort-distance dependence chainsbecome long-distance chains, suitablefor TLP exploitation
Look-ahead is inherently error-tolerant
Can ignore dependence violationsLittle to no support needed, unlike inconventional TLS
1
13
18
1
2
1
0x120011490 ldt $f0, 0(a0)
1.
1
0x1200119a0 ldt $f12, 32(sp)0x1200119ac lda t8, 168(sp)
0x1200114bc stt $f0, 0(a2)
0x12000da84 lda a5, 744(sp)
3. 0x12000daec lda a5, 4(a5)
2. 0x12000dac0 ldl t7, 0(a5)
4. 0x120011984 ldq a0, 80(sp)
5.
6.
7. 0x1200119f8 bis 0, t8, t11
8. 0x120011b04 lda a0, 8(a0)
1
2
1
A. Garg, R. Parihar, M. Huang, “Speculative Parallelization in Decoupled Look-ahead”, PACT’11
Raj Parihar Advanced Computer Architecture Lab University of Rochester 57
![Page 104: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/104.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Advanced Computer Architecture Lab University of Rochester 58
![Page 105: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/105.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Advanced Computer Architecture Lab University of Rochester 58
![Page 106: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/106.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Software Support
Dependence analysis
Profile guided, coarse-grain at basicblock level
Spawn and Target points
Basic blocks with consistentdependence distance of more thanthreshold of DMIN
Spawned thread executes fromtarget point
Loop level parallelism is also
exploited
Raj Parihar Advanced Computer Architecture Lab University of Rochester 58
![Page 107: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/107.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0
1.2
1.4
1.6
1.8
2.0
App
roxi
mat
e P
aral
lelis
m
Original binarySkeleton
Raj Parihar Advanced Computer Architecture Lab University of Rochester 59
![Page 108: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/108.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)
Loop based FP applications exhibit more BB level parallelism
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0
1.2
1.4
1.6
1.8
2.0
App
roxi
mat
e P
aral
lelis
m
Original binarySkeleton
Raj Parihar Advanced Computer Architecture Lab University of Rochester 59
![Page 109: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/109.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Parallelism Potential in Look-ahead Binary
Available parallelism for 2 core/contexts system; DMIN = 15BB
Skeleton exhibits significant more BB level parallelism (17%)Loop based FP applications exhibit more BB level parallelism
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 1.0
1.2
1.4
1.6
1.8
2.0
App
roxi
mat
e P
aral
lelis
m
Original binarySkeleton
Raj Parihar Advanced Computer Architecture Lab University of Rochester 59
![Page 110: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/110.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware and Runtime Support
Thread spawning and merging are verysimilar to regular thread spawning except
Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC
Value communication
Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level
789101112131415
1617181920212223
1
3456
2
Lookahead thread 0 Lookahead thread 1
Tim
e
Duplicate rename tableand set up context
Merge
Spawn
Cleanupduplicated state
Raj Parihar Advanced Computer Architecture Lab University of Rochester 60
![Page 111: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/111.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Hardware and Runtime Support
Thread spawning and merging are verysimilar to regular thread spawning except
Spawned thread shares the same registerand memory stateSpawning thread terminates at the target PC
Value communication
Register-based naturally through sharedregisters in SMTMemory-based communication can besupported at different levelsPartial versioning in cache at line level
789101112131415
1617181920212223
1
3456
2
Lookahead thread 0 Lookahead thread 1
Tim
e
Duplicate rename tableand set up context
Merge
Spawn
Cleanupduplicated state
Raj Parihar Advanced Computer Architecture Lab University of Rochester 60
![Page 112: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/112.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speedup of Speculative Parallelization
Applications in which the look-ahead thread is a bottleneck
Speculative look-ahead over decoupled look-ahead: 1.13x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1
2
3
4
5
Spe
edup
ove
r si
ngle
−th
read
Baseline look−ahead
Speculatively parallel look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 61
![Page 113: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/113.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speedup of Speculative Parallelization
Applications in which the look-ahead thread is a bottleneck
Speculative look-ahead over decoupled look-ahead: 1.13x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean1
2
3
4
5
Spe
edup
ove
r si
ngle
−th
read
Baseline look−ahead
Speculatively parallel look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 61
![Page 114: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/114.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
![Page 115: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/115.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13x
Speculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
![Page 116: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/116.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
![Page 117: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/117.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Speculative Look-ahead vs Conventional TLS
Skeleton provides more opportunities for parallelization
Speculative look-ahead over decoupled LA baseline: 1.13xSpeculative main thread over single thread baseline: 1.07x
crafty eon gzip mcf pbmk twolf vortex vpr ammp art eqk fma3d galgel lucas gmean 0.8
1.0
1.2
1.4
1.6
Spe
edup
ove
r re
spec
tive
base
line
Speculatively parallel mainSpeculatively parallel look−ahead
1.65IPC
Raj Parihar Advanced Computer Architecture Lab University of Rochester 62
![Page 118: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/118.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline Cache Partitioning
Baseline (naive) cache partitioning/sharing policies:
Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches
Two extremes: Resource protection vs. Capacity utilization
Unrelated program co-run: individual slowdowns may not bejustifiable if from different users
Unlike slowing down a thread occasionally to improve throughput
Cache rationing: achieves good cache protection and cache
utilization without slowing down individual programs
Raj Parihar Advanced Computer Architecture Lab University of Rochester 63
![Page 119: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/119.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline Cache Partitioning
Baseline (naive) cache partitioning/sharing policies:
Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches
Two extremes: Resource protection vs. Capacity utilization
Unrelated program co-run: individual slowdowns may not bejustifiable if from different users
Unlike slowing down a thread occasionally to improve throughput
Cache rationing: achieves good cache protection and cache
utilization without slowing down individual programs
Raj Parihar Advanced Computer Architecture Lab University of Rochester 63
![Page 120: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/120.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Baseline Cache Partitioning
Baseline (naive) cache partitioning/sharing policies:
Hard partition: every program gets equal cache shareNo partition: programs can use any portion of shared caches
Two extremes: Resource protection vs. Capacity utilization
Unrelated program co-run: individual slowdowns may not bejustifiable if from different users
Unlike slowing down a thread occasionally to improve throughput
Cache rationing: achieves good cache protection and cache
utilization without slowing down individual programs
Raj Parihar Advanced Computer Architecture Lab University of Rochester 63
![Page 121: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/121.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Microthreads vs Decoupled Look-ahead
Lightweight Microthreads: Decoupled Look-ahead:
Raj Parihar Advanced Computer Architecture Lab University of Rochester 64
![Page 122: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/122.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Look-ahead Skeleton Construction
Raj Parihar Advanced Computer Architecture Lab University of Rochester 65
![Page 123: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/123.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Under-clocked Dual-core Speedup
Typically a dual-core can be clocked only upto 90% clock
frequency of a single-core system
After adjusting the frequency of single-core
Single-core IPC: 1.80 (INT), 2.28 (FP), 2.05 (Combined)
Baseline look-ahead over 10% over-clocked single-thread
Speedup: 1.13x (INT), 1.34x (FP), 1.24x (Combined)
Self-tuned look-ahead over single-thread: (for 14 applications)
Speedup: 1.20x (INT), 1.96x (FP), 1.43x (Combined)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 66
![Page 124: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/124.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Self-tuned Look-ahead: SPEC 2006
Self-tuned look-ahead achieves 1.10x speedup over baseline
look-ahead for SPEC CPU 2006 applications
perl bzp gcc mcf go hmer sjen libq h264 omn astr xaln milc deal splx Gmean1
2
3
4
5678
Spe
edup
ove
r si
ngle
−th
read
Baseline look−ahead
GA based look−ahead
Raj Parihar Advanced Computer Architecture Lab University of Rochester 67
![Page 125: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/125.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Self-tuned Look-ahead: Speedup Analysis
A larger code (with more genes) takes slightly more time to evolve
0 10,000 20,000 30,000
1
10
100
# of static instructions
Rel
ativ
e P
erfo
rman
ce G
ain
Spec 2006spec 2000
Liner Regression Line( r = −0.46 )
Ideal − DLA
GA − DLARelative Performance Gain =
Raj Parihar Advanced Computer Architecture Lab University of Rochester 68
![Page 126: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/126.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Self-tuned Look-ahead: Speedup Analysis
Performance gain has strong correlation with # of generations
0 10,000 20,000 30,000
2
4
6
8
10
# of static instructions
Sat
urat
ion
Gen
erat
ion
Spec 2006
Spec 2000
Liner Regression Line(r = 0.56)
Saturation generation (>= 90% of the best GA solution)
Raj Parihar Advanced Computer Architecture Lab University of Rochester 69
![Page 127: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/127.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Partial Recovery in Speculative Parallelization
Raj Parihar Advanced Computer Architecture Lab University of Rochester 70
![Page 128: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/128.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Flexibility in Look-ahead Hardware Design
Comparison of regular (partial versioning) cache support with twoother alternatives
No cache versioning supportDependence violation detection and squash
Raj Parihar Advanced Computer Architecture Lab University of Rochester 71
![Page 129: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/129.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Genetic Algorithm Evolution
Raj Parihar Advanced Computer Architecture Lab University of Rochester 72
![Page 130: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/130.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Multi-instruction Gene Examples
Raj Parihar Advanced Computer Architecture Lab University of Rochester 73
![Page 131: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/131.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Superposition based Chromosomes
Raj Parihar Advanced Computer Architecture Lab University of Rochester 74
![Page 132: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/132.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Recovery based Early Termination of Fitness Test
Raj Parihar Advanced Computer Architecture Lab University of Rochester 75
![Page 133: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/133.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Optimizations to Implementation
Fitness test optimizations
Sampling based fitnessMulti-instruction genesEarly termination of tests
GA framework optimizations
Hybridization of solutionsAdaptive mutation rateUnique chromosomesFusion crossover operatorElitism policy
Raj Parihar Advanced Computer Architecture Lab University of Rochester 76
![Page 134: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/134.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Sampling based Fitness Test
Raj Parihar Advanced Computer Architecture Lab University of Rochester 77
![Page 135: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/135.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
L2 Cache Sensitivity Study
Speedup for various L2 caches is quite stable
1.139x (1 MB), 1.133x (2 MB), and 1.131x (4 MB) L2 caches
Avg. speedups, shown in the figure, are relative to
single-threaded execution with a 1 MB L2 cache
Raj Parihar Advanced Computer Architecture Lab University of Rochester 78
![Page 136: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/136.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
Approximable Program Paradigm
Weak dependence removal and speculative parallelization
techniques can be applied to any approximate program
Few real-life examples of approximate computing
Google search: does not work with a coherent, up-to-date databaseMap-Reduce paradigm: ignores consistently failing recordsMedia applications: photo, audio and video have some tolerance
Algorithm and applications level approximations
Modern benchmarks e.g., PARSEC are fundamentally approximateApplications space: clustering, predictions, optimizations, etc.BenchNN: a neural network based alternative to PARSEC
Raj Parihar Advanced Computer Architecture Lab University of Rochester 79
![Page 137: Accelerating Decoupled Look-ahead to Exploit Implicit ...parihar/thesis_pres.pdf · Motivation Baseline decoupled look-ahead Look-ahead thread acceleration Additional insights and](https://reader034.vdocuments.mx/reader034/viewer/2022050411/5f8857a9bfacf91b7b552794/html5/thumbnails/137.jpg)
MotivationBaseline decoupled look-aheadLook-ahead thread acceleration
Additional insights and summary
References (Partial)
Decoupled Access/Execute Computer ArchitecturesJ. Smith, ACM TC’84
A Study of Slipstream ProcessorsZ. Purser, K. Sundaramoorthy, E. Rotenberg, MICRO’00
Master/Slave Speculative ParallelizationC. Zilles, G. Sohi, MICRO’02
A Performance-Correctness Explicitly Decoupled ArchitectureA. Garg, M. Huang, MICRO’08
Speculative Parallelization in Decoupled Look-aheadA. Garg, R. Parihar, M. Huang, PACT’11
Accelerating Decoupled Look-ahead via Weak Dependence Removal: AMetaheuristic ApproachR. Parihar, M. Huang, HPCA’14
Raj Parihar Advanced Computer Architecture Lab University of Rochester 80