an asynchronous routing algorithm for clos networks

Post on 24-Oct-2021

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

An Asynchronous Routing Algorithm for Clos Networks

Wei Song and Doug Edwards

The Advanced Processor Technologies Group (APT)

School of Computer Science

The University of Manchester

{songw, doug}@cs.man.ac.uk

1

Outline

• Introduction of Clos networks and asynchronous circuits

• Classic Synchronous dispatching algorithms – Random dispatching (RD)

– Concurrent Round-Robin Dispatching (CRRD)

• Asynchronous Dispatching (AD) algorithm

• Implementation

• Outcome

– 32-port Clos network. Setup 6.2 ns, release 3.9 ns.

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

2

Switching Networks

• Telephone networks

• Asynchronous transfer mode (ATM) and

IP networks

–ATLANTA chip (622Mb/s/port, 1997)

–Petastar Optical Switch (160Gb/s/port, 2003)

• Intra/inter chip interconnect

–High-radix router with many narrow ports

–SDM: delay guaranteed services

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

3

Asynchronous Circuit

• Asynchronous vs. synchronous

–Low power (clockless)

–Possible performance boost (average latency)

–Tolerance to process variation

• Implementation style

–2-phase vs. 4-phase

–Bundled-data vs. quasi-delay insensitive (QDI)

–4-phase QDI

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

4

Research Target

• An area-efficient and high-speed switching

network for on-chip networks

–High-radix router

–Low-power consumption

–Tolerance to process variation

–Area efficient

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

5

Clos Switching Network

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

6

3 stages: IMs, CMs and OMs

C(n, k, m):k IM/OMsn ports per IM/OMm CMs

LI links between IM/CMLO links between CM/OM

SMS: buffer CMMSM: buffer IM/OMS3: no buffer

n´m

n´m

n´m

k´k

k´k

k´k

m´n

m´n

m´n

IM(0)

IM(k-1)

IM(i)

CM(0)

CM(m-1)

CM(r)

OM(0)

OM(k-1)

OM(j)

Non-Blocking

• Blocking

– An available input/output pairs may not be connected due to internal blocking.

• Strict Non-Blocking (SNB)– All available input/output pairs are connectable.

• Rearrangeable Non-Blocking (RNB)– Connecting available input/output pairs may

require internal permutation.

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

7

12 nmn

12 nm

Area Consumption

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

8

Routing Algorithms

• Optimal algorithms

–Algorithms provide guaranteed results for all matches but with a higher complexity in time and implementation.

• Heuristic algorithms

–Algorithms provide all or partial connections in much lower time complexity.

• Most dynamically reconfigurable Clos

networks use heuristic algorithms

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

9

Dispatching Algorithms

Every Input/output pair has mpaths. Packets are dispatched evenly to all CMs.

Dispatching algorithm: the algorithm used in IM to CM packet dispatching.

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

10

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

C(2,3,2)

Random Dispatching (RD)

• Phase 1

– IPs send

requests to

LIs

– LIs select IPs

• Phase 2

– CMs select

LIs

– Configure

granted IPs

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

11

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Random Dispatching (RD)

• Phase 1

– IPs send

requests to

LIs

– LIs select IPs

• Phase 2

– CMs select

LIs

– Configure

granted IPs

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

12

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Random Dispatching (RD)

• Phase 1

– IPs send

requests to

LIs

– LIs select IPs

• Phase 2

– CMs select

LIs

– Configure

granted IPs

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

13

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

RD cannot resolve contention in IMs.

Concurrent Round-Robin Dispatching (CRRD)

• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

14

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Concurrent Round-Robin Dispatching (CRRD)

• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

15

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Concurrent Round-Robin Dispatching (CRRD)

• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

16

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Concurrent Round-Robin Dispatching (CRRD)

• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

17

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Concurrent Round-Robin Dispatching (CRRD)

• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

18

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Concurrent Round-Robin Dispatching (CRRD)

• Phase 1

– IPs request

– LIs select

– IPs select

– Go back

(iteration)

• Phase 2

– Same as RD

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

19

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

CRRD cannot resolve contention in CMs.

Asynchronous Dispatching (AD)

• Difficulties

– Packets arrive asynchronously

– Modules are event-driven

– Orphans are generated by failed requests

• Solution

– Independent algorithms (allocators) run in IMs and CMs

– Failed requests use status feedback from CMs as acknowledge.

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

20

Asynchronous Dispatching (AD)

• IM alg.

– IPs request

– LIs select

– OK? Send data

– Fail? Go back

• CM alg.

– CMs grant LIs

– Update status

feedback

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

21

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Asynchronous Dispatching (AD)

• IM alg.

– IPs request

– LIs select

– OK? Send data

– Fail? Go back

• CM alg.

– CMs grant LIs

– Update status

feedback

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

22

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Asynchronous Dispatching (AD)

• IM alg.

– IPs request

– LIs select

– OK? Send data

– Fail? Go back

• CM alg.

– CMs grant LIs

– Update status

feedback

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

23

IM(0)

CM(0)

IM(1)

OM(0)

OM(1)

CM(1)

CM(2)

0

2

1

3

Comparison of Algorithms

• Random Dispatching (AD)

– Cannot resolve contention in IMs or CMs.

• Concurrent Round-Robin Dispatching (CRRD)– Handle contention in IMs using iterations.

– Cannot resolve contention in CMs.

• Asynchronous Dispatching (AD)– Contention in IMs is handled by asynchronous

arbiter naturally.

– Handle contention in CMs using status feedback.

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

24

Scheduler Architecture

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

25

OM_Sch

(0)

IM_Sch(i)

IM_Sch(7)

CM_Sch(r)

CM_Sch(3)

OM_Sch

(7)

OM_Sch

(j)

RIM

IM_Dis

InpG0

InpG1

InpG2

InpG3

OMr

IMr

OM

r

IMr

RCM

CM_Dis

CM_Sch(0)IM_Sch(0)

req(0,0)

req(0,1)

req(0,2)

req(0,3)

req(i,0)

req(i,1)

req(i,2)

req(i,3)

req(7,0)

req(7,1)

req(7,2)

req(7,3)

C(4,8,4)

CM Dispatcher/OM Scheduler

CMa: CM ack

CMs: CM status feedback

CMa and CMs are

mutually exclusive

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

26

Tre

e A

rbite

r

CMr( i, j)

CMr( 0, j)

CMr( 7, j)

CMa( i, j)

CMa( 0, j)

CMa( 7, j)

CMa( i', j)

i' = 0 ~ 7, i' ≠ i

CMa( i, j')

j' = 0 ~ 7, j' ≠ j

CMs( i, j)

C(4,8,4)

OM_Sch

(0)

IM_Sch(i)

IM_Sch(7)

CM_Sch(r)

CM_Sch(3)

OM_Sch

(7)

OM_Sch

(j)

RIM

IM_Dis

InpG0

InpG1

InpG2

InpG3

OMr

IMr

OM

r

IMr

RCM

CM_Dis

CM_Sch(0)IM_Sch(0)

req(0,0)

req(0,1)

req(0,2)

req(0,3)

req(i,0)

req(i,1)

req(i,2)

req(i,3)

req(7,0)

req(7,1)

req(7,2)

req(7,3)

IM Dispatcher

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

27

IM/CM

Match

ArbiterIM/CM

Match

Arbiter

LI Arbiter

LI Arbiter

IMcfg

IMr(0)

IMr(1)

IMr(2)

IMr(3)

ImCmA(7)

ImCmA(0)

CMs(3)

CMs(0)

CMr(3)

CMr(0)

IMa

8

4*44

8

8

CfgG

CMa(0)

CMa(3)8

C(4,8,4)OM_Sch

(0)

IM_Sch(i)

IM_Sch(7)

CM_Sch(r)

CM_Sch(3)

OM_Sch

(7)

OM_Sch

(j)

RIM

IM_Dis

InpG0

InpG1

InpG2

InpG3

OMr

IMr

OM

r

IMr

RCM

CM_Dis

CM_Sch(0)IM_Sch(0)

req(0,0)

req(0,1)

req(0,2)

req(0,3)

req(i,0)

req(i,1)

req(i,2)

req(i,3)

req(7,0)

req(7,1)

req(7,2)

req(7,3)

IM/CM Match Arbiter(Multi-resource Arbiter)

• S. Golubcovs, D. Shang, F. Xia, A. Mokhov, and A. Yakovlev, “Modular approach to

multi-resource arbiter design,” ASYNC 2009.

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

28

Tree Arbiter

Tre

e A

rbite

r

ri cb

ocb

o

ri

rbo

rbi

rbi

rbocbi

cb

o

riri cb

o

rbi

rbocbi

rbo

rbi

cbi

cbi

ci

ci ci

ci

IMr(0, j)

IMr(3, j)

IMrb(0, j)

IMrb(3, j)

CMs(0, j)

CMsb(0, j)

CMs(3, j)

CMsb(3, j)

h

h( j,3

,0)

h

h( j,0

,0)

hh( j,0,3)

hh( j,3,3)

C(4,8,4)

Request Generation

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

29

h( j, h, r)

IMr( h, j)

CMs( r, j)

CMrMx( r, j, h)

CMrMx( r, j, 0)

CMrMx( r, j, 3)

CMrME( r, j)

LI A

rbite

rCMrME( r, 0)

CMrME( r, 7)

CMr( r, 0)

CMr( r, j)

CMr( r, 7)

CMsb( r, j)

CMrMx(0, j, h)

CMrMx(3, j, h)

IMrb( h, j)

C(4,8,4)

Configuration Generation

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

30

CMa( r, j)

CMrMx( r, j, h)

CMr( r, j)

cfgMx( r, h, j)

cfgMx( r, h, 0)

cfgMx( r, h, 7)

IMcfg( r, h)

IMa(h)

IMcfg(0, h)

IMcfg(3, h)

C(4,8,4)

Area and Speed

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

31

Dispatching:setup 4.6 ns release 2.9 ns

OM Scheduler:setup 1.6 ns release 1.0 ns

Total: setup 6.2 ns release 3.9 ns

C(4,8,4)

MATLAB Simulation Model

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

32

Throughput Analysis

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

33

C(4,8,4) non-blocking loadRD 55%CRRD (i=4) 70%AD (i=3) 76%

C(4,8,m) non-blocking loadRD (m=7) 74%CRRD (m=7) 82%AD (m=7) 96%

Conclusions

• The first asynchronous routing algorithm for general 3-stage Clos networks.

• Better throughput performance than RD and CRRD

• Future works

–Optimize the Clos structure

–Reduce area and latency

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

34

Thank you!

Questions?

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

35

Iterations

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

36

Uniform Traffic

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

37

Results after Optimization

• Async

– Setup 3.7 ns

– Reset 3.2 ns

– Period ~ 7 ns (140M)

– 28K Gates

• Sync

– 350 MHz

– Cell time = Iter+1 =5 (70M)

– 22K Gates

• Optimization

– Replace multi-resource arbiter

– Eager request (InpG)

– MUTEX arbiter

– Optimized tree arbiter

June 23rd 2010Advanced Processor Technologies Group

The School of Computer Science

38

top related