a pseudo-synchronous implementation flow for wchb qdi
TRANSCRIPT
A Pseudo-Synchronous Implementation Flow for WCHB QDI Asynchronous Circuits
Yvain Thonnart, Edith Beigné, Pascal Vivet CEA-LETI, Minatec, Grenoble, France
Async’2012, May 8th 2012
DTU, Copenhagen, Denmark
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 2
Asynchronous circuits
A handcrafted piece of art Entangled uneven loops
Requires minute attention to detail
Very valuable for specific needs
But very expensive design time
A powerful heavy machinery Backed-up by big EDA companies
Obsessed about clocks
Scared of loops
with synchronous CAD tools?
Pseudo-synchronous implementation “Mass-produced”
Much cheaper design time
Can run fast, nevertheless!
Trick the
chain link model
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 3
Outline
Asynchronous circuits with synchronous CAD tools ?
Pseudo-synchronous models for C-elements
Pseudo-synchronous circuit implementation
Benchmarking against asynchronous implementation
Real-world implementations
Conclusion & perspectives
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 4
DIMS WHCB pipeline combinational loops & optimization
Performance is given by the loops cycle times Design optimization needs to constrain those loops
Synchronous CAD tools can’t handle them
need to cut the loops in the timing graph & constrain loop segments
Where to cut for a systematic approach in the WCHB C-elements: the ones gathering forward and backward
data (they must be Resetted)
C
Fwd Logic
Bwd Logic
Reset
C
C
Fwd Logic
Bwd Logic
Reset
C
C
Fwd Logic
Bwd Logic
Reset
C
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 5
Asynchronous Implementation: cost & flaws Resulting timing constraints:
For each WCHB C-element in the cell library, disable timing arcs to cut the loops set_disable_timing ‘C_element’ –from ‘in’ –to ‘out’
For each path segment between two WCHB C-elements, specify a target maximum delay set_max_delay –from ‘C/elt/inst1/out’ –to ‘C/elt/inst2/in’ 0.5ns
Limitation: The WCHB C-elements themselves are not optimized Minimal or no drive adaptation of cells depending on cell load
No consideration on signal slope on path end
Cells can be moved back and forth during placement
Synchronous CAD tools do not manage asynchronous path ends correctly
Use pseudo-synchronous models for WCHB C-elements to cut timing loops without disabling timing arcs
to improve tool control over path ends
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 6
Pseudo-synchronous circuit timing paths
Loops are cut naturally at pseudo-synchronous C-elts No need to disable a timing arc
Creates 2 kinds of paths in WCHB pipeline: forward paths
backward paths
How to derive pseudo synchronous models ?
How to constrain resulting paths ?
C Fwd Logic
Bwd Logic
Reset=clk
C Fwd Logic
Bwd Logic
Reset=clk
C Fwd Logic
Bwd Logic
Reset=clk
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 7
Asynchronous .lib characterization .lib files in Liberty format to model cell timing arcs
As a function of input transition times and output capacitance
4 values per arc : rise delay, fall delay, rise transition, fall transition
Reset
A
B
Z
when B=1 and Reset inactive
A
Z
rise_delay
rise_tran
rise_delay(AZ):
30ps 120ps 200ps
80ps 160ps 250ps
130ps 210ps 300ps
Z output capacitance
10fF 40fF 100fF
A input transition
10ps
80ps
200ps
rise_tran(AZ):
12ps 80ps 320ps
20ps 85ps 320ps
28ps 90ps 320ps
Z output capacitance
10fF 40fF 100fF
A input transition
10ps
80ps
200ps
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 8
Pseudo-synchronous .lib derivation
Clk (was Reset)
A
B
Z
C-element is modeled like a synchronous flip-flop Reset pin is used as a dummy clock input
New arc uses first row of AZ arc, old arcs are turned to setup checks
A
Z
rise_delay
rise_tran
rise_delay(ClkZ):
30ps 120ps 200ps
80ps 160ps 250ps
130ps 210ps 300ps
Z output capacitance
10fF 40fF 100fF
rise_tran(ClkZ):
12ps 80ps 320ps
20ps 85ps 320ps
28ps 90ps 320ps
Z output capacitance
10fF 40fF 100fF
Clk
setup rise constraint
setup_rise(AClk)
computed as diff.
between 1st column of
previous rise_delay(AZ)
and new rise_delay(ClkZ)
setup_rise(BClk) Idem with previous BZ
A input transition
10ps
80ps
200ps
0ps
50ps
100ps
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 9
Simple pseudo-synchronous constraint Declaring a clock on the reset signal constrains all paths
to a given “dummy” period Actual asynchronous cycle time given by biggest sum of 2 fwd + 2bwd delays
on the loops (for token+bubble)
as bad as 4x dummy target period
often less (2x-3x) as no hold fixing is done
Dummy clock period limitation: Logic depth can be different on each path
Relaxes all paths to worst path length
Actual throughput not optimal when forward and backward logic are not balanced (on most critical local loop)
Actual forward latency can be really sub-optimal (given by sum of fwd delays)
What about over-constraining the design ? Negative slack is not a big deal for implementation, circuit is QDI after all !
But over-constrained paths will distract the optimization kernels…
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 10
Refined pseudo-synchronous timing constraints Use dummy clock declaration to identify paths, not to
constrain design with a given period Declare clock to break loops, with any period (e.g. 0ns)
Override delays on all paths with reg2reg set_max_delay constraints
set_max_delay 0.23ns –from C/elt/inst1 –to C/elt/inst2 (no pins given preserve all arcs inferred by clock declaration)
Resulting constraints very similar to asynchronous ones, but with no timing arc disabled Better control on timing paths for optimization tools
Leverage on all existing asynchronous STA methods to predict performance
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 11
WHCB isochronic forks handling
Green fork needs no isochronic assumption Both branches are acknowledged by protocol (C-element on point of reconvergence)
Red forks should be isochronic (or relaxed) Only one of the branches is acknowledged (reconvergence on a combinational gate)
BUT
they always occur at path ends (previous logic is shared)
Shortest adversary path goes through 2 C-elements and at least 1 inverting bwd logic
Constraining paths through the fork for shortest possible delays (with refined ‘set_max_delay’ constraints) also balances any buffer tree needed at the fork
Adversary path isochronic hypothesis is easily met
C
Fwd Logic
Bwd Logic
Reset
C
C
Fwd Logic
Bwd Logic
Reset
C
C
Fwd Logic
Bwd Logic
Reset
C
slow branch
fast branch
+ Adversary path
2nd segment
Adversary
path 1st
segment
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 12
Pseudo-synchronous implementation flow Source
Netlist.ref.v
Netlist.final.v
Map & Opt
preCTS.sdc
Place & IPO
CTS
Route & IPO
postCTS.sdc
Delay Calc
GDS SPEF
SDF
Final sim.
DRC, LVS… Async.lib
PSync.lib dummy.ctsspec
Reset
C
Reset=clk
C
Tape-out
PSyncIP.lib
< Your preferred asynchronous sythesis method here >
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 13
Linear pipeline case study
Implemented down to layout with Cadence SoC Encounter
STMicro 65nm LP technology
Very narrow floorplan 20µm*600µm to model a long NoC link
C
Reset
C
C
C
C
Reset
C
C
C
C
Reset
C
C
C
C
Reset
C
C
C
C
Reset
C
C
C
C
Reset
C
C
C
x17 MR4
Physically implemented & optimized with different strategies
Instantiated 4x to inject the 4 different input values on each MR4
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 14
Timing constraints strategies
Asynchronous modeling combinational loops broken at
C-elements inputs.
zero-delay target: ‘set_max_delay 0’ on all paths
zero slack: iterations on place-and-route
flow adjusting per path ‘set_max_delay’ values until implementation reports final slack of 0ps.
-40ps slack: same as above, but stop iterating
as soon as final negative slack is lesser than 40ps.
Pseudo-synchronous modeling
zero-delay target: ‘create_clock Reset -period 0’
simple: ‘create_clock Reset -period N’
with iterations until N cannot be reduced with a final slack of 0ps.
zero slack: ‘create_clock Reset -period 0’,
plus iterations on per path ‘set_max_delay’ values until implementation reports a final slack of 0ps.
-20ps slack: same as above, with a 20ps
target.
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 15
Benchmarking results @tt65_1.2V_25C
With asynchronous modeling, disabling timing arcs to break loops at C-elements degrades performance
Simple and 0 target synchronous are comparable in performance Less iterations for 0 target, but slightly bigger area
Ad-hoc synchronous constraints give best results
0
5
10
15
20
25
30
35
300 325 350 375 400 425 450 475 500 525 550 575 600 625 650
cycle time (ps)
nu
mb
er
of o
ccu
ren
ce
s
Async 0 target
Async 0ps slack
Async -40ps slack
Sync simple
Sync 0 target
Sync 0ps slack
Sync -20ps slack
0
5
10
15
20
25
30
35
40
175 225 275 325 375 425 475 525 575 625 675 725 775 825 875 925 975 1025
latency (ps)
nu
mb
er
of o
ccu
ren
ce
s
Async 0 target
Async 0ps slack
Async -40ps slack
Sync simple
Sync 0 target
Sync 0ps slack
Sync -20ps slack
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 16
ANoC implementations
ANoC router made of 6 kinds of WCHB processes 3 per input stage, 3 per output stage
Generic data path size
Any possible combination of input stages and output stages
60 “generic” ‘set_max_delay’ constraints cover all possible arrangements of processes in NoC topology 60 values to refine for zero-slack
strategies
Recent implementation in 3 chips with industrial partnership in 2011/2012 2D-mesh based, in STMicro 65nm LP
Req-Resp Master-Slave based in STMicro 32nm and 28nm LP
30
20
10
00
3D
(ftol)
TX_BITtx_bit00n
TX_BITtx_bit00n
31
21
11
01
32
22
12
02
33
23
13
03
2 4 2
6 1 6
6 1 6
2 4 2
2
6
2
5 4 4 4
6 5 5 6
6 3 4 4
1
MEPHISTOmep_01n
MEPHISTOmep_01n
MEPHISTOmep_02n
MEPHISTOmep_02n
TRX_OFDMtrx_ofdm_03n
TRX_OFDMtrx_ofdm_03n
ARM11arm11_00w
ARM11arm11_00w
TRX_OFDMtrx_ofdm_03e
TRX_OFDMtrx_ofdm_03e
SMEsme_01
SMEsme_01
SMEsme_03
SMEsme_03
SME_EXTsme_10
SME_EXTsme_10
SME_WIDEIO
sme_11
SME_WIDEIO
sme_11SME_WIDEIO
sme_12
SME_WIDEIO
sme_12UDECASIP
asip_13
UDECASIPasip_13
UDECASIPasip_13
UDECASIPasip_13
SME_WIDEIO
sme_21
SME_WIDEIO
sme_21SME_WIDEIO
sme_22
SME_WIDEIO
sme_22RX_BITrx_bit23
RX_BITrx_bit23
SMEsme_31
SMEsme_31
SMEsme_33
SMEsme_33
MEPHISTO_
HEATER
mep_30w
MEPHISTO_
HEATER
mep_30w
MEPHISTOmep_33e
MEPHISTOmep_33e
MEPHISTOmep_33s
MEPHISTOmep_33s
MEPHISTOmep_32s
MEPHISTOmep_32s
TRX_OFDMtrx_ofdm_30s
TRX_OFDMtrx_ofdm_30s
TRX_OFDMtrx_ofdm_31s
TRX_OFDMtrx_ofdm_31s
nocif2
nocif1
3D
(serial2)
3D
(serial2r)
3D
(normal)
TEST
3DNoC
TEST
3DNoC
TEST
Wide IO
TEST
Wide IO
C1_m1
C2_m1
C3_m1
C4_m1
L2_s1
L2_s2
L2_s3
L2_s4
C1_m2
C2_m2
C3_s
C4_s
L3_s1
C3_m2
C4_m2
L3_s2
C1_s
C2_s
FC_m1
L3_m
FC_m2
FC_s
GANoC-L2
GANoC-L3
MAG3D
P2012_CO
ST 65nm LP
16 routers
2 channels
34b datapath
1MGate ANoC
ST 28nm LP
10 routers
76b requests
68b responses
400kGate ANoC
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 17
28nm P2012_CO ANoC synthesis results
According to dummy period: Area increase up to +30% cycle time & latency reduction up to -30%
Ad-hoc pseudo-sync. constraints allow for: reproducible best performance @ 1280Mflit/s with reasonable area increase by ~20% compared to under-constrained design
Quality of Results
300
320
340
360
380
400
420
440
0.0 0.2 0.4 0.6 0.8 1.0
Pseudo-Sync Dummy Period (ns)
Are
a (
KG
ate
)
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Cri
tica
l ¨P
ath
Le
ng
th (
ns)
Area (Kgate)
Critical Path (ns)
Negative Slack
Positive Slack
Performance tt28_1.00V_25C
4.00
4.50
5.00
5.50
6.00
6.50
7.00
7.50
8.00
8.50
0.0 0.2 0.4 0.6 0.8 1.0
Pseudo-Sync Dummy Period (ns)
Pe
ak B
an
dw
idth
(G
B/s
)
2.00
2.20
2.40
2.60
2.80
3.00
3.20
3.40
Asyn
c. e
nd
to
en
d L
ate
ncy (
ns)
Peak Bandw idth (GB/s)
Async. end2end Latency (ns)
Ad-hoc max delay
pseudo-sync constraints
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 18
MAG3D implementation results
Technology STMicroelectronics
cmos 65nm low-power process
Implementation strategy Pseudo-synchronous hard-macro for
routers Mixed integration on top
Synchronous DfT Pseudo-synchronous ANoC links
P&R Runtime ~ 17h
ANoC Area 1M Gate
Performance @tt65_1.2V_25C 7 routers path ~10 mm links Average throughput:
850 Mflit/s Average latency:
9.81ns
~8.5mm
Measured NoC path
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 19
Conclusion Asynchronous circuits turned synchronous (not really…)
For the designs a bit more performance DIMS WCHB circuits are not as bad as you would think, aren’t they ?
For the designers a systematic approach for loop breaking and design constraints Large asynchronous designs within easy reach
For the community a “benevolent” betrayal Don’t banish me, please…
For the industry a comfortable well-known CAD environment Energy-efficient off-the-shelf soft IPs
OK, they are actually asynchronous, but only if they ask…
But will it work for more than ANoC or DIMS WCHB ?
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 20
Pseudo-synchronous timing paths in QDI (PCHB/PCFB/RSPCHB…) pipelines
Up to 5 types of pseudo-synchronous paths instead of 2 (+ WCHB like paths for state variable in PCFB) Not necessarily balanced in delays ad-hoc constraints to be considered,
dummy period could be insufficient
When no Reset input is present on the cells, create and rely on an “internal pin” for dummy clock pin(dummy) {direction : “internal”; […]} in .lib file create_clock –name ‘dummy_clk’ [$all_dummy_pins_in_design] in .sdc file
Blue paths form an isochronic fork for “bubbles” Need special handling to guarantee data deactivation before EN re-activation
Ra EN Ra EN
© CEA. All rights reserved
Yvain Thonnart – ASYNC’12 Symposium, Copenhagen, Denmark | May 8th 2012 | 21
timing arcs diversion and timing margin Alternatives for relative delay constraint
on isochronic fork specify ‘set_data_check’
reduce max delay constraints separately on both paths to guarantee there is no positive slack
Add security margin to data arcs
Compatible with simple dummy clk period constraint
Specify margin thanks to dummy clk transition time
EN Ra
A
B
Dummy (or Reset)
Z
modified comb arc
computed from EN
setup setup
setup
(with security
margin / EN)
setup
(with security
margin / EN)
A
Z
rise_delay
rise_tran
Clk
setup rise constraint
margin
spec
margin
Many thanks to My co-authors for their 9-year contribution &
support
The reviewers for their inspiring feedback
The audience for your questions ?