udirec: unified diagnosis and reconfiguration for frugal ... · ritesh parikh and valeria bertacco...
TRANSCRIPT
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
1/45 1/22
uDIREC: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults
Ritesh Parikh and Valeria Bertacco
Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor
1
MICRO-46, 9th December- 2013 Davis, California
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
2/45 2/22
Device Scaling and Processor Evolution
Shrinking transistor size
32nm 65nm 45nm
Waning reliability
Core2Duo - 2007
Simple bus
Intel Nehalem - 2008
Point-to-point bus
Transition to multicore NoC based interconnect
Intel Sandy Bridge - 2011
High-speed ring network
22nm
2D mesh network-on-chip
Intel SCC- 2009
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
3/45 3/22
Device Scaling and Processor Evolution
Shrinking transistor size
32nm 65nm 45nm
Waning reliability
Core2Duo - 2007
Simple bus
Intel Nehalem - 2008
Point-to-point bus
Transition to multicore NoC based interconnect
Intel Sandy Bridge - 2011
High-speed ring network
22nm
2D mesh network-on-chip
Intel SCC- 2009
Growing adoption of Network-on-Chip: ―Sole medium of on-chip communication ―Potential single-point of failure
Intel Identifies Cougar Point Error, SATA Connection Single Point of Failure. January, 2011. Cost: $700 Million
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
4/45 4/22
4
On-Chip Interconnect Evolution
NoC-based multicore/SoC
data packet H B B B T
― Routers connected via links ― Cores connected via network interface (NI)
B T B B H
B T B B H header
tail body
flits
router #1
network interface
core-1
router #2
network interface
core-2
router #3
network interface
router #4
network interface
core-3 core-4 link
Bus-based multicore/SoC
bus
core/IP
4
arbiters /allocators
crossbar
FIFO i n p
u t s
o u t p u
t s
routing
credit, FIFO and routing control
output port input port
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
5/45 5/22
Permanent Faults in NoCs ―Permanent wear-out ―Device aging
Detection Diagnosis Reconfiguration Recovery
Recover and resume normal
operation
Detect if fault has occurred
Diagnose where fault
has occurred
Reconfigure network to
account for fault
cannot send on faulty path need to re-route around fault
S D uDIREC solves two problems: ― Fault diagnosis at a fine resolution ― Reconfiguration to find new routes
NI
rtr
proc
NoCs: significant silicon footprint and heavy activity
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
6/45 6/22
6
uDIREC (for unified DIagnosis and REConfiguration) incorporates: ― routing-aware scheme for diagnosis ― route-reconfiguration to circumvent faults
Fine-grain fault model for NoCs derived from: ― end-to-end diagnosis scheme ― frugal reconfiguration algorithm
A deadlock-free routing algorithm for irregular networks with unidirectional links Tightly integrated software implementation enables: ― low overhead ― no performance overhead during fault-free execution
Contributions
Error!! Error!! Error!!
0 1 2
3 S D
6 7 8
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
7/45 7/22
Prior Work
Fine-Resolution Diagnosis
Routing Algorithm
Reconfiguration Algorithm
Experimental Evaluation
Rest of This Talk
0 1 2
3 4 5
6 7 8
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
8/45 8/22
8
Prior Art Permanent Fault Diagnosis
Reconfiguration around Faults
―Entire regions/routers [Puente’04] ―One or more bidirectional links [Fick’09]
0 1 2
3 4 5
6 7 8
― Based on routing on irregular networks [Aisopos’11]
― Constrained by number and location of faults [Flich’07, Fukushima’09]
Dedicated testing and high overhead
0
10
20
30
40
0 20 40 60
lost
nod
es in
a
64-n
ode
CM
P
# of permanent faults
~2/3rd loss of on-chip processor nodes
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
9/45 9/22
NI
data packet
router
core
NI
router
core
NI
router
core
NI
router
core
host core (software
score-board)
ECC
Datapath faults
― End-to-end software based ― Scoreboard to collect symptoms of corrupted data
Detection and Diagnosis Mechanism Inspired by Ghofrani et al., VTS’12
― Diagnose both datapath and control faults at low area overhead (< 3%) ― No runtime performance overhead when NoC is fault-free
counting and timeout error
Control faults
― Distributed hardware based ― Counting and timeout techniques
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
10/45 10/22
10
Fine-Resolution Fault Diagnosis
0 1 2
3 4 5
6 7 8
1
1 1
1 1
0 1 2
3 4 5
6 7 8
5 9
4
2 1
0 0
2 0
0 0
2 1
2 0
0
1 2
0 0
0
0
0
0
+1
A
B Differentiable from
network ends
A B Undistinguishable from network ends
Packets are augmented with ECC [Shamshiri et al., ITC’11]
Erroneous transmission are reported to SW supervisor node
Finest granularity of: ― Routing reconfiguration ― End-to-end diagnosis
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
11/45 11/22
11
Fine-Grain Fault Model
96% faults localized to a single datapath
segment, or a unidirectional link
A B Undistinguishable: ― O/P port ― Unidirectional link ― I/P port ― Crossbar contacts
KEY OBSERVATION: disable only ONE unidirectional
link on most faults
0 1 2
3 4 5
6 7 8
Error!! Error!! Error!!
crossbar crossbarbufferupstream router downstream router
datapath segment
packets entering same packets leaving
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
12/45 12/22
1 2
0
12
Benefits of the Fine-Grain Fault Model
X 1 2
0
1 2
0
Fault manifestation Coarse-grain fault model Fine-grain fault model
fault! 2 → 1 2 → 1
Pat
h di
vers
ity
Con
nect
ivity
1 2
0
1 2
0
1 2
0
fault!
connected!
fault! disconnected! from node0
to node0
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
13/45 13/22
Prior Work
Fine-Resolution Diagnosis
Routing Algorithm
Reconfiguration Algorithm
Experimental Evaluation
Rest of This Talk
0 1 2
3 4 5
6 7 8
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
14/45 14/22
14
Routing in Irregular Networks ― Routing algorithm should disable paths that lead to deadlock ― Up*/down* routing disables turns to avoid deadlock
1. Construct spanning tree rooted at a node (assumes bidirectional links) 2. Mark links towards root: up (down otherwise) 3. Disable all down→up turns 4. Follow up links towards root and down links to destination
root
darker → closer to root
dist 0 dist 1
dist 1 dist 2
0 1
3 2
0 1
3 2 disabled turns
0 1downup
dow
nup
down
dow
n
up
up
up
downro
ottowards r
oot
node
away
from
root
node
3 0
Up*/down* does not work with unidirectional links
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
15/45 15/22
15
Routing with Unidirectional Links
to
to from
from
― Separate spanning trees for up (up-tree) and down (down-tree) links Up-tree: unidirectional links towards root Down-tree: unidirectional links away from root Consistent ordering/labeling: no link marked both up and down
― As links are marked according to up*/down* principle (and no conflicts) uDIREC routing is deadlock free (disable down→up turns) Network is connected if both trees are complete
1 2
0
1 2
0
up-tree down-tree
root root
conflict! {0th}
{2nd} {1st}
{0th}
{1st} {2nd}
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
16/45 16/22
16
Growing Trees Concurrently
down
Step#1
down up
Up-tree and down-tree can be constructed: ― Independently: may lead to inconsistent marking ― Concurrently: consistent labeling ensured by construction
Grow tree beyond a node only if reachable by both up-tree and down-tree
Choice of root node affects connectivity
up 1 2
0
root
haltdown
forward
COMPLETE
Step#2
disable down→up turn
up down
1 2
0 root
haltdown
Step#1
haltup
INCOMPLETE
{0th}
{-} {-} {0th} {1st}
{2nd}
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
17/45 17/22
Reconfiguration Overview
Integrated with the software implementation of diagnosis scheme
― Suspension of network operation on fault detection
― Diagnosis of fault site via software scoreboard
― Determination of surviving topology at supervisor
― Calculation of new routes in software Selection of root that maximizes connectivity
― Distribution of new deadlock-free routes to routers
― Resumption of operation from uncorrupted state
Supervisor
NI
router
core NI
router
core
NI
router
core
NI
router
core
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
18/45 18/22
Reconfiguration Algorithm
Root selection process
― Exhaustive search of the optimal root node
Optimality of the root node
― Based on number of connected nodes (in our experiments) ― Based on critical functionality within sub-network
0 1 2
3 4 5
root
2 1 root
root
node
s=5
node
s=1
node
s=3
8 7 8 6
root trial #1: node 1 is root
…
…
root trial #2: node 2 is root
root trial #6: node 6 is root
root trial #8: node 6 is root
local winner!
local winner!
local winner!
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
19/45 19/22
RT
router
1-bit ctrl wire
1-bit data wire
procproc
proc proc
unid
irect
iona
l rin
g
RT
RT
RT ctrl data
decode/ forward
from previous controller
ctrl datato next
controller
from
pro
c ctrl
data assemble
/forward
update RT
19
Reconfiguration Implementation
Supervisor node
Routing table
-1-1-0-0-1-0-1-
Serial transfer of control and data
Distributed controller
― Permanent faults are rare occurrences ― Routing table data = 4 (directions) * 64 (destinations) * 64 (RTs) < 2 KB
― Distribute using unidirectional ring of 1-control and 1-data wire
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
20/45 20/22
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50
pack
et d
eliv
ery
rate
injected faults
up*/down*uDIREC
0
5
10
15
20
25
30
0 10 20 30 40 50
avg.
no.
of d
ropp
ed n
odes
injected faults
up*/down*uDIREC
20
Reliability Results
― As faults accumulate networks become disconnected
― uDIREC looses fewer nodes and partitions into fewer networks
1/3rd dropped nodes 2x less partitioning
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
21/45 21/22
0
3
6
9
0 10 20 30 40 50sa
tura
tion
thro
ughp
ut
injected faults
up*/down*uDIREC
0
10
20
30
40
0 10 20 30 40 50
avg.
net
wor
k la
tenc
y
injected faults
up*/down*uDIREC
21
Performance Results
path diversity dominates
― Initially latency degrades gracefully; at higher fault rates up*/down* drops
much more nodes, hence lower latency
― uDIREC consistently delivers higher throughput
connectivity dominates
>20% higher throughput
7% lower latency
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
22/45 22/22
22
Conclusions Proposed uDIREC: a unified diagnosis and reconfiguration solution Fine-grained fault diagnosis, a novel fault model, a deadlock-free routing algorithm and a software-based reconfiguration algorithm uDIREC drops only 1/3rd nodes compared to state-of-the-art and delivers >20% higher throughput beyond 15 faults uDIREC incurs less than 1% wiring overhead
1/3rd
loss of nodes
>20% higher
2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech
23/45 23/22
23
Thank you! Questions?