a distributed stallable architecture to handle delay variations
DESCRIPTION
A summary of Alberto A. Del Barrio's work during his stay in UCLA, under the direction of Prof. Jason CongTRANSCRIPT
![Page 1: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/1.jpg)
A Distributed Stallable Architectureto Handle Delay Variations
Dr. Alberto A. Del BarrioComplutense University of Madrid
![Page 2: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/2.jpg)
![Page 3: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/3.jpg)
UCLA Stay
• VLSI, Architecture, Synthesis and Technology(VAST) Laboratory– http://cadlab.cs.ucla.edu/beta/cadlab/news
• Lead by Prof. Jason Cong• Around 20 students (postdocs, predocs, master, undergrads, visitors)
• More than 400 papers• Tools releases, startups
– xPilot AutoESL Vivado HLS (Xilinx)
![Page 4: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/4.jpg)
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
![Page 5: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/5.jpg)
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
![Page 6: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/6.jpg)
How to modeldelay whileconsidering
process variations[Jung and Kim, ICCAD’07]
![Page 7: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/7.jpg)
Comparisons
• State of the art: Worst Case– Overpesimistic
• CODES’09: BTW + Centralized Stallable Arch.– Every failure in execution time will stall the wholedatapath
• Many operations finishing its execution at the same time willincurr an extra cycle penalty
• Dynamic behavior escaping from static analysis will stall thedatapath
– Can only recover failures up to 1 cycle– Worse behavior when sharing resources
• Proposal: Distributed Stallable Arch.
7
![Page 8: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/8.jpg)
An example: Differential EquationSolver (DES)
8
×1 × 2 + 5
×6 × 3 < 8
- 9 × 7 × 4
- 11 + 10
![Page 9: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/9.jpg)
Worst Case vs Best Case
Worst Case = Better ThanWorse Case
Best Case, 5 cycles of difference. DistributedArch. will be close to BC
9
![Page 10: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/10.jpg)
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
![Page 11: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/11.jpg)
Razor Register
Shadow
Main 1
0
Comp
clk
dclk
hit
din Comparison is performedbetween two registers
11
![Page 12: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/12.jpg)
Razor Register ChronogramD
dclk If the inputs changein the lapse of time dclk_shift ishappening, thevalue stored in theshadow registercould be dirty
If a FU is shared, theworst case delay isallocated for itsoperations
T
![Page 13: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/13.jpg)
Literature: Centralized StallableArchitecture
13
Buffe
r
Rz. Inp
ut
Register
Rz. O
utpu
t Re
gister
Combina onal Logic
FSM Combina onal
Logic
Rz. State Register
Rz. Stab.
Re
gister
By Cong et al., CODES’09Problems: FU sharing restrictsthe possibilities of the design. Theworst case timingmust be allocatedwhen happening
Deals withprocess variations
![Page 14: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/14.jpg)
Better Than Worse Case (unconstrained)
Additional slack guaranteesops. 4 and 7 to be correct
14
![Page 15: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/15.jpg)
Better Than Worse Case (unconstrained)
If only operations4 and 7 haveproblems, it´s ok
15
![Page 16: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/16.jpg)
Better Than Worse Case (unconstrained)
Operation 5 failurewas not considered
in the staticanalysis
Every failure istranslated into an
extra cycle
1st iterationfinishes after 11 cycles (1 failure)
16
![Page 17: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/17.jpg)
My approach: Distributed StallableArchitecture
17
Buffe
r
Raz.
Inpu
t Re
gister
Rz. O
utpu
t Re
gister
Combina onal Logic
FSM CL 1
St Reg 1
Rz. Stab.
Re
gister
…
…
Commit Signals Logic Unit
…
FSM CL N
St Reg N
DistributedArchitecture, by Del Barrio et al., DATE’10, TCAD (March, 2011)
The controller is splitinto several local controllers, plus a coordinatorresponsible forchecking hazardsdynamically
But how to integratewith Razor Registers???
![Page 18: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/18.jpg)
Usefulness of a DistributedArchitecture
×1 × 2
+ 5 ×6
× 3
< 8- 9
× 7 4
- 11 + 10
×
1
2
3
4
×1 × 2 + 5
×6
× 3
< 8
- 9
× 7
× 4
- 11
+ 10
1
2
3
4
BTW static schedule (unconstrained) Priority‐list static schedule (unconstrained)
Operations 7 and 4 are covered, ifa failure happens, it will have no impact over the latency
![Page 19: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/19.jpg)
Usefulness of a DistributedArchitecture
×1 × 2 + 5
×6
× 3
< 8- 9 × 7
× 4
- 11
+ 10
R R
R
R
R
1
2
3
4
×1 × 2
+ 5 ×6
× 3
< 8- 9
×7 4
- 11 + 10
×
1
2
3
4
R
R
R
1-stall
2-stall
3-stall
R4-stall
BTW execution example Distributed execution example
But what if more failures happen ?? BTW could not be enough
![Page 20: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/20.jpg)
Distributed Architecture: Best Case Static Scheduling (unconstrained)
20
![Page 21: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/21.jpg)
Distributed Architecture
We schedule consideringthe Best Case, but thedatapath is able toreschedule on the fly, and besides some failures can be hidden
If there are not more failures, 1st iteration will finish in 10 cycles, besides hiding 2 failures
21
![Page 22: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/22.jpg)
Simulator Results:
Benchmark Codes BTW DisM Barrier DisMDiffEq 10.5 10.394 6.31ARF [13] 18.79 18.379 14.103FFT [12] 11.57 12.218 11.844FIR16 [12] 20.65 20.342 16.074EWF [13] 12.892 12.829 12.316
Benchmark Codes BTW DisM Barrier DisMDiffEq 10.66 10.374 8.305ARF [13] 23.32 20.318 17.825FFT [12] 16.51 14.92 14.38FIR16 [12] 24.27 21.22 16.6EWF [13] 16.1 12.79 12.47
Unconstrained
RC‐constrained: 4+, 4*
Codes and DisMBarrier havesimilar results. DisM reduces 17% latency
DisM Barrier and DisM reduce 12% and 23% latency, respectively
![Page 23: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/23.jpg)
Implementation Results (P & R)
Distr. Arch. implementsmodulo scheduling
dynamically, as it executesoperations when ready
Comparison is performedbetween two registers, there isno problem with FU sharing, as din (combinational) will notinfluence on the comparisonresult
23
10 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 4 8.165 32.66 2 135 71 0.26 0.04 0.055 ns Vivado+NP*+Xilinx 14.4 (2*,2+) 5 9.003 45.015 2 157 73 0.26 0.05 0.05
Alb+Xilinx 14.4 (2*,2+) 3 8.7 26.1 2 246 118 0.26 0.08 0.08Alb tuned+Xilinx 14.4 (2*,2+) 3 9.535 28.605 2 204 86 0.26 0.07 0.06
Toolchain ResourcesLatencyCycle Time (nsEx. Time (ns) DSPs LUTs Regs %DSPs %LUTs %RegsTime Area
![Page 24: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/24.jpg)
UCLA research summary
• Goal: handling delay variations applying theDistributed Architecture developed in mythesis
• How to do this ?– Simulator & HW Implementation– Binding algorithm
![Page 25: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/25.jpg)
Binding Problem: bad binding
cstep 1
cstep 2
cstep 3
cstep 4
×1 × 2 + 5
×6 × 3 < 8
- 9 × 7 × 4
- 11 + 10
The hazard between 8 and 9 stallsseveral components of the graph
![Page 26: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/26.jpg)
Clock Cycle M1 M2 A1 A2
S1
S6
S7
S7
S1
S6
S7
S7
0
0
0
0
1
1
1
1
S2
S3
S4
S4
S2
S3
S4
S4
0
0
0
0
1
1
1
1
S5
S5
S10
S10
S10
S5
S10
S10
0
0
0
0
0
1
1
1
S8
S8
S8
S9
S11
S8
S8
S8
0
0
0
0
0
1
1
1
T T T T State State State State
x 1 x 2 + 5
x 6 x 3 + 5
8 x 7
- 9
- 11 + 10 x 2
x 6 x 3 + 5
- 11
x 7
+ 10
x 4
Issued Committed
1
2
3
4
5
6
7
8
x 1 x 2
x 6 x 3 + 5
x 6 x 3 + 5
8
- 11 + 10 x 1 x 2
x 7 x 4
8
- 11
x 7
+ 10 x 2 x 1
x 1
- 9
x 4
x 4 8
- 9
x 2 x 1
9
10
S7 1 S4 1 S10 1 S9 1
S1 0 S2 0 S10 1 S11 1
< <
< 8
< <
x 7 x 4 x 7 x 4
x 4 x 7 - 9
![Page 27: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/27.jpg)
Binding problem: good binding
cstep 1
cstep 2
cstep 3
cstep 4
×1 × 2 + 5
×6 × 3 < 8
- 9 × 7 × 4
- 11 + 10
The hazard between 8 and 10 is lessdamaging than the one between 6 and
4, because of the extra cstepThe cost function depends on
two bound operations
![Page 28: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/28.jpg)
Clock Cycle M1 M2 A1 A2
S1
S6
S4
S1
S6
S4
S1
S1
0
0
0
1
1
1
0
0
S2
S3
S7
S2
S3
S7
S7
S2
0
0
0
1
1
1
1
0
S5
S5
S8
S10
S5
S8
S8
S10
0
0
0
0
1
1
1
1
S9
S9
S9
S11
S9
S9
S11
S11
0
0
0
0
1
1
1
1
T T T T State State State State
x 1 x 2 + 5
x 6 x 3 + 5
< 8 x 4 x 7 - 9
- 11 + 10 x 1 x 2
x 6 x 3 + 5
- 11
x 1 x 7
+ 10 x 2
Issued Committed
1
2
3
4
5
6
7
8
x 1 x 2
x 6 x 3 + 5
x 4
x 6 x 3 + 5
< 8 x 7 - 9
- 11 + 10 x 1 x 2
< 8 x 4 x 7 - 9 - 9
< 8
- 11
< 8
x 1 + 10 x 2 x 1
x 4
x 7
![Page 29: A Distributed Stallable Architecture to Handle Delay Variations](https://reader034.vdocuments.mx/reader034/viewer/2022042715/55842dcad8b42a0b6d8b4e02/html5/thumbnails/29.jpg)
State of the Research
• Simulator: OK• Implementation:
– Sharing problem not solved yet• Binding algorithm
– Greedy version: OK– ILP formulation: difficult to model, not working– Network Flow formulation: possible target
• Study of controllers granularity– 1 FSM per FU– 1 FSM per operation cluster … but define what is anoperation cluster??!!