udirec: unified diagnosis and reconfiguration for frugal ... · ritesh parikh and valeria bertacco...

23
1/45 1/22 uDIREC: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor MICRO-46, 9 th December- 2013 Davis, California

Upload: others

Post on 26-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

1/45 1/22

uDIREC: Unified Diagnosis and Reconfiguration for Frugal Bypass of NoC Faults

Ritesh Parikh and Valeria Bertacco

Electrical Engineering & Computer Science Department The University of Michigan, Ann Arbor

1

MICRO-46, 9th December- 2013 Davis, California

Page 2: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

2/45 2/22

Device Scaling and Processor Evolution

Shrinking transistor size

32nm 65nm 45nm

Waning reliability

Core2Duo - 2007

Simple bus

Intel Nehalem - 2008

Point-to-point bus

Transition to multicore NoC based interconnect

Intel Sandy Bridge - 2011

High-speed ring network

22nm

2D mesh network-on-chip

Intel SCC- 2009

Page 3: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

3/45 3/22

Device Scaling and Processor Evolution

Shrinking transistor size

32nm 65nm 45nm

Waning reliability

Core2Duo - 2007

Simple bus

Intel Nehalem - 2008

Point-to-point bus

Transition to multicore NoC based interconnect

Intel Sandy Bridge - 2011

High-speed ring network

22nm

2D mesh network-on-chip

Intel SCC- 2009

Growing adoption of Network-on-Chip: ―Sole medium of on-chip communication ―Potential single-point of failure

Intel Identifies Cougar Point Error, SATA Connection Single Point of Failure. January, 2011. Cost: $700 Million

Page 4: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

4/45 4/22

4

On-Chip Interconnect Evolution

NoC-based multicore/SoC

data packet H B B B T

― Routers connected via links ― Cores connected via network interface (NI)

B T B B H

B T B B H header

tail body

flits

router #1

network interface

core-1

router #2

network interface

core-2

router #3

network interface

router #4

network interface

core-3 core-4 link

Bus-based multicore/SoC

bus

core/IP

4

arbiters /allocators

crossbar

FIFO i n p

u t s

o u t p u

t s

routing

credit, FIFO and routing control

output port input port

Page 5: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

5/45 5/22

Permanent Faults in NoCs ―Permanent wear-out ―Device aging

Detection Diagnosis Reconfiguration Recovery

Recover and resume normal

operation

Detect if fault has occurred

Diagnose where fault

has occurred

Reconfigure network to

account for fault

cannot send on faulty path need to re-route around fault

S D uDIREC solves two problems: ― Fault diagnosis at a fine resolution ― Reconfiguration to find new routes

NI

rtr

proc

NoCs: significant silicon footprint and heavy activity

Page 6: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

6/45 6/22

6

uDIREC (for unified DIagnosis and REConfiguration) incorporates: ― routing-aware scheme for diagnosis ― route-reconfiguration to circumvent faults

Fine-grain fault model for NoCs derived from: ― end-to-end diagnosis scheme ― frugal reconfiguration algorithm

A deadlock-free routing algorithm for irregular networks with unidirectional links Tightly integrated software implementation enables: ― low overhead ― no performance overhead during fault-free execution

Contributions

Error!! Error!! Error!!

0 1 2

3 S D

6 7 8

Page 7: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

7/45 7/22

Prior Work

Fine-Resolution Diagnosis

Routing Algorithm

Reconfiguration Algorithm

Experimental Evaluation

Rest of This Talk

0 1 2

3 4 5

6 7 8

Page 8: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

8/45 8/22

8

Prior Art Permanent Fault Diagnosis

Reconfiguration around Faults

―Entire  regions/routers  [Puente’04] ―One  or  more  bidirectional  links  [Fick’09]

0 1 2

3 4 5

6 7 8

― Based on routing on irregular networks  [Aisopos’11]

― Constrained by number and location  of  faults  [Flich’07,  Fukushima’09]

Dedicated testing and high overhead

0

10

20

30

40

0 20 40 60

lost

nod

es in

a

64-n

ode

CM

P

# of permanent faults

~2/3rd loss of on-chip processor nodes

Page 9: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

9/45 9/22

NI

data packet

router

core

NI

router

core

NI

router

core

NI

router

core

host core (software

score-board)

ECC

Datapath faults

― End-to-end software based ― Scoreboard to collect symptoms of corrupted data

Detection and Diagnosis Mechanism Inspired by Ghofrani et al., VTS’12

― Diagnose both datapath and control faults at low area overhead (< 3%) ― No runtime performance overhead when NoC is fault-free

counting and timeout error

Control faults

― Distributed hardware based ― Counting and timeout techniques

Page 10: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

10/45 10/22

10

Fine-Resolution Fault Diagnosis

0 1 2

3 4 5

6 7 8

1

1 1

1 1

0 1 2

3 4 5

6 7 8

5 9

4

2 1

0 0

2 0

0 0

2 1

2 0

0

1 2

0 0

0

0

0

0

+1

A

B Differentiable from

network ends

A B Undistinguishable from network ends

Packets are augmented with ECC [Shamshiri et al., ITC’11]

Erroneous transmission are reported to SW supervisor node

Finest granularity of: ― Routing reconfiguration ― End-to-end diagnosis

Page 11: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

11/45 11/22

11

Fine-Grain Fault Model

96% faults localized to a single datapath

segment, or a unidirectional link

A B Undistinguishable: ― O/P port ― Unidirectional link ― I/P port ― Crossbar contacts

KEY OBSERVATION: disable only ONE unidirectional

link on most faults

0 1 2

3 4 5

6 7 8

Error!! Error!! Error!!

crossbar crossbarbufferupstream router downstream router

datapath segment

packets entering same packets leaving

Page 12: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

12/45 12/22

1 2

0

12

Benefits of the Fine-Grain Fault Model

X 1 2

0

1 2

0

Fault manifestation Coarse-grain fault model Fine-grain fault model

fault! 2 → 1 2 → 1

Pat

h di

vers

ity

Con

nect

ivity

1 2

0

1 2

0

1 2

0

fault!

connected!

fault! disconnected! from node0

to node0

Page 13: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

13/45 13/22

Prior Work

Fine-Resolution Diagnosis

Routing Algorithm

Reconfiguration Algorithm

Experimental Evaluation

Rest of This Talk

0 1 2

3 4 5

6 7 8

Page 14: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

14/45 14/22

14

Routing in Irregular Networks ― Routing algorithm should disable paths that lead to deadlock ― Up*/down* routing disables turns to avoid deadlock

1. Construct spanning tree rooted at a node (assumes bidirectional links) 2. Mark links towards root: up (down otherwise) 3. Disable all down→up turns 4. Follow up links towards root and down links to destination

root

darker → closer to root

dist 0 dist 1

dist 1 dist 2

0 1

3 2

0 1

3 2 disabled turns

0 1downup

dow

nup

down

dow

n

up

up

up

downro

ottowards r

oot

node

away

from

root

node

3 0

Up*/down* does not work with unidirectional links

Page 15: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

15/45 15/22

15

Routing with Unidirectional Links

to

to from

from

― Separate spanning trees for up (up-tree) and down (down-tree) links Up-tree: unidirectional links towards root Down-tree: unidirectional links away from root Consistent ordering/labeling: no link marked both up and down

―    As  links  are  marked  according  to  up*/down*  principle  (and  no  conflicts) uDIREC routing is deadlock free (disable down→up turns) Network is connected if both trees are complete

1 2

0

1 2

0

up-tree down-tree

root root

conflict! {0th}

{2nd} {1st}

{0th}

{1st} {2nd}

Page 16: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

16/45 16/22

16

Growing Trees Concurrently

down

Step#1

down up

Up-tree and down-tree can be constructed: ― Independently: may lead to inconsistent marking ― Concurrently: consistent labeling ensured by construction

Grow tree beyond a node only if reachable by both up-tree and down-tree

Choice of root node affects connectivity

up 1 2

0

root

haltdown

forward

COMPLETE

Step#2

disable down→up turn

up down

1 2

0 root

haltdown

Step#1

haltup

INCOMPLETE

{0th}

{-} {-} {0th} {1st}

{2nd}

Page 17: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

17/45 17/22

Reconfiguration Overview

Integrated with the software implementation of diagnosis scheme

― Suspension of network operation on fault detection

― Diagnosis of fault site via software scoreboard

― Determination of surviving topology at supervisor

― Calculation of new routes in software Selection of root that maximizes connectivity

― Distribution of new deadlock-free routes to routers

― Resumption of operation from uncorrupted state

Supervisor

NI

router

core NI

router

core

NI

router

core

NI

router

core

Page 18: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

18/45 18/22

Reconfiguration Algorithm

Root selection process

― Exhaustive search of the optimal root node

Optimality of the root node

― Based on number of connected nodes (in our experiments) ― Based on critical functionality within sub-network

0 1 2

3 4 5

root

2 1 root

root

node

s=5

node

s=1

node

s=3

8 7 8 6

root trial #1: node 1 is root

root trial #2: node 2 is root

root trial #6: node 6 is root

root trial #8: node 6 is root

local winner!

local winner!

local winner!

Page 19: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

19/45 19/22

RT

router

1-bit ctrl wire

1-bit data wire

procproc

proc proc

unid

irect

iona

l rin

g

RT

RT

RT ctrl data

decode/ forward

from previous controller

ctrl datato next

controller

from

pro

c ctrl

data assemble

/forward

update RT

19

Reconfiguration Implementation

Supervisor node

Routing table

-1-1-0-0-1-0-1-

Serial transfer of control and data

Distributed controller

― Permanent faults are rare occurrences ― Routing table data = 4 (directions) * 64 (destinations) * 64 (RTs) < 2 KB

― Distribute using unidirectional ring of 1-control and 1-data wire

Page 20: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

20/45 20/22

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50

pack

et d

eliv

ery

rate

injected faults

up*/down*uDIREC

0

5

10

15

20

25

30

0 10 20 30 40 50

avg.

no.

of d

ropp

ed n

odes

injected faults

up*/down*uDIREC

20

Reliability Results

― As faults accumulate networks become disconnected

― uDIREC looses fewer nodes and partitions into fewer networks

1/3rd dropped nodes 2x less partitioning

Page 21: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

21/45 21/22

0

3

6

9

0 10 20 30 40 50sa

tura

tion

thro

ughp

ut

injected faults

up*/down*uDIREC

0

10

20

30

40

0 10 20 30 40 50

avg.

net

wor

k la

tenc

y

injected faults

up*/down*uDIREC

21

Performance Results

path diversity dominates

― Initially latency degrades gracefully; at higher fault rates up*/down* drops

much more nodes, hence lower latency

― uDIREC consistently delivers higher throughput

connectivity dominates

>20% higher throughput

7% lower latency

Page 22: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

22/45 22/22

22

Conclusions Proposed uDIREC: a unified diagnosis and reconfiguration solution Fine-grained fault diagnosis, a novel fault model, a deadlock-free routing algorithm and a software-based reconfiguration algorithm uDIREC drops only 1/3rd nodes compared to state-of-the-art and delivers >20% higher throughput beyond 15 faults uDIREC incurs less than 1% wiring overhead

1/3rd

loss of nodes

>20% higher

Page 23: uDIREC: Unified Diagnosis and Reconfiguration for Frugal ... · Ritesh Parikh and Valeria Bertacco Electrical Engineering & Computer Science Department The University of Michigan,

2 "A4rb_Premium" – 2012-02_v02 – do not delete this text object! Speech

23/45 23/22

23

Thank you! Questions?