clocking and timing in fault- tolerant systems-on-chip andreas steininger
TRANSCRIPT
Clocking and Timing in Fault-Tolerant Systems-on-Chip
Andreas Steininger
Outline
• The Clock as a Blessing• The Clock as a Curse• Alternative Synchronization Schemes
- GALS- fully asynchronous- the DARTS approach
• Conclusion
2
Contributors to this Work
The DARTS project team
TU Vienna Gottfried Fuchs
Matthias Fuegger
Ulrich Schmid
Thomas Handl
RUAG Space Gerald Kempf
Manfred Sust
Wolfgang Zangerl
3
The Need for Fault Tolerance
miniaturization is key to progress in VLSI
=> smaller structures
=> lower voltage swing
=> smaller critical charge
=> higher operating frequencies
…result in higher susceptibility to faults (SET, EMI,…)
=> cannot avoid faults, need to tolerate them
4
The Role of Time
“The only reason for time is so that everything doesn’t happen at once”, Albert Einstein
5
The Need for Clocking
activities need to be co-ordinated• on system level (braking of wheels, …)
• on algorithmic level (consensus, …)
• on communication level
• on logic level (state machine switching,…)
co-ordination in the time domain (synchronization) is an efficient way to attain this
=> need a global notion of time (discrete „ticks“)
6
The Quality of Synchronization
real time
local time (number of ticks)
precision π
7
Typical Precision Values
on system level: ms … ms
on algorithm level: ms … ms
on communication level: ns … ms
on logic level: ps … ns
8
Synchronization Requirements
9
phase synchronisation(for „hardware clock“
on logic level)
clock synchronisation(for distributed time base
on algorithmic level)
1ms is excellent precision for distributed clock
at 1GHz this means 360.000° phase shift
Globally Synchronous Design
• whole design is „isochronic“ („perfect“ precision)• time conveyed by clock transitions
• perfect co-ordination of all activities
• very efficient design• can assume consistent states
• high level of abstraction
• very efficient implementation:• single crystal oscillator
• single control line (clock net)
10
„Isochronic“ Regions ?
speed of light (in medium) = 2 x 108 m/s = 20cm/ns
11
2cm
Ref
1GHz
4GHz
8GHz
The Variation Problem
12
Designer
system model
projected conditions
User
actual conditions
actual system
worst case
safety margins
?(unknown)
?(imperfections)
Timing completely fixed after designNo way to react to actual conditions & system („PVT variations“)
Fault-Tolerant Architectures
Duplication & Comparison
Triple-Modular Redundancy
13
FU
FU=?
ERR
FU
FU
vo-ter
YFU
Lock-Step Operation
single clock
14
„3“ „4“
„3“ „4“
single point of failure good replica determinism
FU
FU
vo-ter
YFU
„3“ „4“
Lock-Step Operationindependent clocks
15
„3“ „4“
„3“ „4“
single fault tolerant bad replica determinism
FU
FU
vo-ter
YFU
„3“ „4“
Fault-Tolerant HW-Clocking
16
FU
FU
vo-ter
YFU
v
v
v
Fault-Tolerant HW-Clocking
17
FU
FU
vo-ter
YFU
v
v
v
D
D
?
?
The Charme of SoCs
billions of transistors fit on one die
=> structuring into (IP) modules
„System-on-Chip“
BUT:• large clock distribution networks => „isochronic“??• FT clocking does not work with large skew• may need individual clocks for function modules
=> clock-synchrony neither attainable nor desirable
18
Co-ordination of Data Exchange
19
SRC SNK f(x) f(x)
When it is valid and consistent
When SNK has consumed the previous one
When can SNK use its input?
When can SRC apply the next input?
The Synchronous Approach
20
SRC SNK f(x) f(x)
co-ordination based on (global) time
Alternative: Asynchronous Design
21
SRC SNK f(x) f(x)
co-ordination based on handshaking
REQ: „Data word valid, you can use it“
ACK: „Data word consumed, send the next“
Async. Design – Advantages
• closed-loop control makes timing much more robust and adaptive to PVT variations
• no need for worst-case timing• local handshakes replace global clock• activity only when needed• beneficial for EMI• tends to stop operation in case of fault
22
Async. Design – Disadvantages
• Need to handle race between REQ and data
23
Async. Design – Disadvantages
• Need to handle race between REQ and data
24
SRC SNK f(x) f(x)
REQ: „Data word valid, you can use it“
Async. Design – Disadvantages
• Need to handle race between REQ and data
Solution 1: „Bundled Data“
25
SRC SNK f(x) f(x)
REQ: „Data word valid, you can use it“
Async. Design – Disadvantages
• Need to handle race between REQ and data
Solution 2: „Delay Insensitive“ (Coding)
26
SRC SNK f(x) f(x)
REQ: „Data word valid, you can use it“
Completion detection
Async. Design – Disadvantages
• Need to handle race between REQ and data• significant HW overhead (coding, delay elements)• „adaptive“ timing not as predictable• more difficult to design• classical fault-tolerance schemes not applicable• tends to stop operation in case of fault
27
Best of Both Worlds
GALS: Globally Asynchronous Locally Synchronous
28
retain efficiency of synchronous design wherever possible:„intra-module“
use asynchronousprinciple whereclock distributiontoo cumbersome:„inter-module“
First mention in PhD thesis by Chapiro / Stanford 84
A GALS Example
29
CPU2GHz
PCI-IF533MHz
DSP2,7GHz
USB-IF24MHz
Communication in GALS
Shared Memoryproducer writes to memory, consumer reads from therepro: control flow stays independent
• shared single-port memory
• true dual-port memory
Direct Messages (Data words)move data word from producer‘s output register to consumer‘s input register
• non-buffered / buffered (FIFO-queues)
• clock fixed, data-driven or pausible
30
Shared Memory
decoupling of clock domains by memory acting as a third party => high area overhead => unusual
for single port memory arbitration required• arbitration problem (unbounded delay…)
• one side may block the other at the arbiter
for multiport memory problems are confined to access to the same cell• busy flag may become metastable
• blocking still possible for one specific address
31
Shared Memory
32
CPU2GHz
shared memory
Arbi-tration
0xff14
DSP2,7GHz
• perfect decoupling of data path
• potential metastability problems at arbitration logic
• potential blocking through arbitration
Direct Messages
clock domain boundary is between producer‘s output register and consumer‘s input register
in general a synchronizer is needed at consumer‘s input• definitely for conventional (fixed) clock
• can be avoided by data-driven / pausible clocking
control flows of producer and consumer are strongly coupled: not maintaining the input/output register blocks other party
buffers/queues/FIFOs can • mitigate, but not avoid this problem (full/empty)
• compensate variations in the data rate on both sides, but not different average data rates
33
Direct Messages
data moving over clock domain boundary
metastability problems
=> need to insert handshake
…with synchronizers
34
S
0xff14
CPU2GHz
DSP2,7GHz
S
and (optional) buffers
Arbiter: Principle
purpose: ○ manage concurring requests to shared resource
method: ○ handle pairs of request_in / grant_out
○ requests may arrive in any order
○ arbiter must activate only one grant_out at a time
(respond to the first requester)
Mutual Exclusion (MUTEX)
problem: ○ resolve concurrent requests=> metastability problem
35
Arbiter: Circuit
36
„Metastability filter“: e.g., hi-threshold inverter
[from D. J. Kinniment „Synchronization and Arbitration in Digital Systems“, Wiley]
MUTEX-element: SR-latch
G1’
G2’
R1
R2
G1
G2
Vout,FF
t
Vth,inv
Vmeta
Arbiter: Operation
37
R1
G1
R2
G2
G1’
G2’
R1
R2
G1
G2
Muller C-Element
38
RS
reset
set
a
b
y
IF a = bTHEN y = aELSE hold yC
a b
y
Ca
by
Muller C-Element: Circuit
39
[Alan Martin, Caltech]
Data-Driven Clocking
Principle:○ as soon as new data arrive => start clocking
○ determine number k of clock cycles required
to process new data
○ stop clocking after k cycles, wait for next data
Properties: ○ need to switch clock on and off => beware spurious clock pulses!
○ no metastability problem: data stable as soon
as consumer clock starts
○ potential for power saving
○ useful for specific applications only (no pipe!)
40
Data-Driven Clock: Circuit / 1
41
CLK out
D
CLK out
CLK half period determined by D
D
Data-Driven Clock: Circuit / 2
42
D
C
REQ
ACK
CLK out
REQ
ACK
transition on REQ answered by transition on CLK out
min CLK half period deter-mined by D
CLK out
D
Pausible Clocking
Principle:○ producer requests consumer‘s clock to pause
○ data provided to input register during idle time
○ consumer‘s clock may resume- free running („pausible clock“)
- with one cycle only („stoppable clock“)
Properties: ○ need to switch clock on and off => beware spurious clock pulses!=> beware of clock tree delays!
○ producer controls consumer‘s clock (blocking!)
○ applications must cope with paused clock43
Pausible Clock: Circuit / 1
44
D
C
REQ
ACK
CLK out
REQ
ACK
inverter generates next REQ from ACK
self-oscillation
CLK out
D
Pausible Clock: Circuit / 2
45
D
C
REQ’
ACK’ external unit can safely stop CLK by activating REQ’
… and gets ACK’ as a response
CLK out
CLK out
REQ’
ACK’
Arb
D
Pausible Clock: Circuit / 3
46
D
C
REQ1
ACK1
for more external sources arbiters can be added and “anded” before the Muller C-Element
the two inverters can be eliminated by using a Muller C-Element with inverting output
CLK outArb REQn
ACKn
Arb
Advantages of GALS
• synchronous islands can be designed efficiently• modules operate independently• can use module specific-clock & timing• clocking is no single point of failure
47
Problems with GALS
• operation of modules not (inherently) co-ordinatedsynchrony for communication but not on system / algorithm level
• communication has to cross clock boundaries• potential for metastability
=> performance penalty through synchronizers
OR
=> module must handle irregular clocking
48
The DARTS Idea
49
phase synchronisation
tick synchronisation
clock synchronisation
Distributed Algorithms for Robust Tick Synchronization
TG-AlgsFu1
Data Bus
Fu3
Fu2
TG-Net
The DARTS Approach
Concept: Multiple synchronized tick generators Method: Distributed algorithm for fault-tolerant
tick generation implemented in (asynchronous) digital logic
Advantages- No crystal oscillator(s)- No critical clock tree- Clock is no single point of failure! - Reasonable synchrony
50
The DARTS Principle
51
Every function unit Fui augmented with simple local clock unit (TG-Alg)
TG-Algs communicate over dedicated TG-Net to generate tick-synchronized local clock signals
Up to f TG-Algs can be Byzantine faulty need n ≥ 3f + 2 TG-Algs
Fu1
Fu2
Fu3
data bus
Clock tree
TG-Algs
TG-Net
DARTS clocksStandard synchronous clocking
Formally proven
synchronization properties
A Comparison
52
TG-AlgsFu1
Data Bus
Fu3
Fu2
TG-Net
tick(3) tick(4)
Fu1 clk
Fu2 clk52
global synchrony (< 1 tick)
synchronous SoC GALSDARTS
Fu1
Data Bus Fu3
Fu2
Oscillator
Oscillator
Oscillator
Clo
ck
Tre
e
Oscillator
Fu1
Data Bus Fu3
Fu2
single point of failure
global synchrony (potentially 1 tick)
no single point of failure
no single point of failure
NO (inherent) global synchrony
The Distributed Algorithm
(1) Initially:
(2) send tick(0) to all; clock:= 0;
(3) “Relay Rule”
(4) If received tick(m) from at least f+1 remote nodes and m > clock:
(5) send tick(clock+1),…, tick(m) to all [once]; clock:= m;
(6) “Increment Rule”
(7) If received tick(m) from at least 2f+1 remote nodes and m >= clock:
(8) send tick(m+1) to all [once]; clock:= m+1;
[Srikanth & Toueg, 87]
TG-Alg 1
TG-Alg 6
TG-Alg 5
TG-Alg 4
TG-Alg 3
TG-Alg 2
TG-Net
Implementation Challenges
54
(1) Initially:
(2) send tick(0) to all; clock:= 0;
(3) “Relay Rule”
(4) If received tick(m) from at least f+1 remote nodes and m > clock:
(5) send tick(clock+1),…, tick(m) to all [once]; clock:= m;
(6) “Increment Rule”
(7) If received tick(m) from at least 2f+1 remote nodes and m >= clock:
(8) send tick(m+1) to all [once]; clock:= m+1;
Replacement by zero-bit messages
k-bit messagesk unbounded
Atomicity of actions
To be ensured by the architecture and delay constraints
Thresholds functions for fault tolerance
Glitch-free asynchronous implementation
TICK(k)
TICK(k-1)
...
TICK(1)
TICK(0)
k-bit msg vs. zero-bit tick
Software-based algorithm
The DARTS Prototype
55
ASIC design:
• radhard 180nm technology
• 2 designs:- flexible- fast
Prototype board:8 chips plus fixed & programmable interconnect
Proof of Concept
56
Frequency Stability (Warm-up)
57
0 2 4 6 8 10 12 14 16 1853.15
53.2
53.25
53.3
53.35
53.4
53.45
time in [hours]
freq
uenc
y in
[M
Hz]
Frequency Stability (detail)
58
0 5 10 1551.94
51.96
51.98
52.0
time in [min]
freq
uenc
y in
[MH
z]
0 5 10 151.7968
1.7970
1.7972
1.7974
core
vol
tage
in [V
]
DARTS – General Properties
Fully asynchronous implementation NO oscillators
Tolerates up to three Byzantine faulty nodes(configurable number of TG-Algs; 5 to 12)
Adapts to operating conditions (asynchronous logic)
59
Still Room for Improvements
o Transient faults are permanently stored in the elastic pipelines
o No on-the-fly integration of TG-Alg
o Relatively low clock speed
o Interfacing to traditional synchronous designs
o Scaling with number of faults is costly
60
Summary: Trends & Needs
• Preceding miniaturization necessitates fault tolerance
• Co-ordinaton of activities is fundamental, thus tight synchrony is a desirable feature on all levels
• SoCs are large modular designs on a single die
61
Summary: SoC Clocking
• globally synchronous clock:+ ideal synchrony, efficient in design & implementation- isochrony unrealistic, single point of failure
• DARTS clock+ best attainable global synchrony, adaptive timing, FT- high implementation efforts, frequency not stable
• GALS+ uses best of syn & asyn, indep. & module-specific clock- no global synchrony, metastability issues
• asynchronous design+ power-efficient, robust against faults & PVT- high overheads, difficult to design, timing hard to predict
62
More information on DARTS
http://ti.tuwien.ac.at/ecs/research/projects/darts
63