feedback–based two-stage switch

177
Title Feedback-based two stage switch architecture for high speed router design Author(s) Hu, Bing; Citation Issue Date 2010 URL http://hdl.handle.net/10722/56798 Rights unrestricted

Upload: vanvalkenburghr

Post on 07-Apr-2018

224 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 1/177

TitleFeedback-based two stage switch architecture for highspeed router design

Author(s) Hu, Bing;

Citation

Issue Date 2010

URL http://hdl.handle.net/10722/56798

Rights unrestricted

Page 2: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 2/177

 

FEEDBACK–BASED TWO-STAGE SWITCH

ARCHITECTURE FOR HIGH SPEED ROUTER

DESIGN

BY

HU BING

PH.D. THESIS

DECEMBER 2009

Page 3: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 3/177

Abstract of thesis entitled

Feedback–Based Two-Stage Switch Architecture for

High Speed Router Design 

submitted by

Hu Bing 

for the degree of Doctor of Philosophy

at The University of Hong Kong

in December 2009

Due to the widespread usage of WDM technology in fiber, the transmission

capacity increases sharply, while the processing capacity of current commercial

routers increases slowly. The speed mismatch between fiber and router induces a

 pressing need for building next generation high-speed routers. A major bottleneck of 

high-speed router design is its switch architecture, which concerns how packets are

moved from one linecard to another. In this thesis, we focus on designing efficient

and scalable switch architecture to enable the next generation high-speed routers.

A load-balanced two-stage switch configures its two switch fabrics according

to a pre-determined and periodic sequence of switch configurations. It is attractive

 because no centralized scheduler is required and close to 100% throughput can be

obtained. But it also faces two major challenges: packet mis-sequencing and poor 

delay performance. In this thesis, we propose a feedback-based two-stage switch

Page 4: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 4/177

architecture to simultaneously address these two challenges. Notably, we only

require a single-packet-buffer for each middle-stage port VOQ. This greatly cuts

down the average packet delay. At the same time, in-order packet delivery and high

throughput are ensured by properly selecting and coordinating the two sequences of 

switch configurations. As compared with the existing load-balanced switch

architectures and scheduling algorithms, our feedback-based switch imposes a

modest requirement on switch hardware, yet consistently yields the best delay-

throughput performance.

To further enhance the performance of the feedback-based switch, original

extensions and refinements are made. Specifically, a three-stage switch architecture

is proposed for further cutting down the average packet delay. A feedback 

suppression scheme is designed for reducing the communication overhead. A

multicast scheduling algorithm is invented for carrying multicast traffic using the

same unicast switch fabric. A batch scheduler is devised for multi-cabinet

implementation of the feedback-based switch. To address the fairness issue in

handling inadmissible traffic patterns, a fair scheduler is designed for allocating the

 bandwidth of over-subscribed outputs based on max-min fairness criterion. Last but

not the least, an optical implementation of the feedback-based two-stage switch is

 proposed.

Page 5: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 5/177

 

Feedback–Based Two-Stage Switch Architecture for

High Speed Router Design

 by

Hu Bing

(   )

B.Eng., M.Phil. U . E .S .T .C  

A thesis submitted in partial fulfillment of the requirements for 

the Degree of Doctor of Philosophy

at The University of Hong Kong

December 2009

Page 6: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 6/177

- i -

Declaration

I declare that this thesis represents my own work, except where due

acknowledgement is made, and that it has not been previously included in a thesis,

dissertation or report submitted to this University or to any other institution for a

degree, diploma or other qualification.

Signed  _________________________________ 

Hu Bing

Page 7: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 7/177

- ii -

Acknowledgments

First, I would like to express my deep gratitude to my research supervisor,

Doctor Kwan L. Yeung, for his guidance and encouragement throughout my graduate

study. Doctor Yeung’s unreserved supports cover every detail of my research work,

from teaching me research methodologies to taking pains to polish papers. His

instructions and infinite patience were essential for completing this thesis. I feel

 privileged to have had this opportunity to study under his supervision.

I thank the Electrical and Electronic Engineering Department at the

University of Hong Kong, for creating such a great education and research

environment. I thank all staff members in the department for their kindly help and

warm assistance. I also thank the financial support of the University of Hong Kong to

enable me to complete my Ph. D. study. My thanks also go to my lab-mates and

friends whose encouragement and help are essential.

Along the way, I have been incredibly fortunate in getting the support from

my dear parents, for their endless support, materially and spiritually.

Page 8: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 8/177

- iii -

Table of Contents 

Declaration .......................................................................................................... i 

Acknowledgments ............................................................................................... ii 

Table of Contents ................................................................................................ iii 

List of Figures ..................................................................................................... viii 

List of Symbols ................................................................................................... xi 

List of Abbreviations .......................................................................................... xiv 

Chapter 1 Introduction

1.1 Overview of Routers .......................................................................... 1

1.2 Switch Architectures .......................................................................... 7

1.2.1 Output-queued Switches ........................................................ 8

1.2.2 Input-queued Switches ........................................................... 8

1.2.3 CIOQ and Buffered Crossbar Switches ................................. 10

1.2.4 Load-Balanced Two-Stage Switches ..................................... 12

1.3 Contributions ..................................................................................... 13

1.4 Thesis Overview ................................................................................ 16

Chapter 2 Feedback-Based Two-Stage Switch Design

2.1 Introduction ....................................................................................... 182.2 Related Work ..................................................................................... 22

2.2.1 Using Re-sequencing Buffers ................................................ 22

2.2.2 Preventing Packets from Becoming Mis-sequencing ............ 23

2.3 Feedback-Based Two-Stage Switch .................................................. 26

2.3.1 Some Observations and Motivations ..................................... 26

2.3.2 Designing Scalable Feedback Mechanism ............................. 28

2.3.3 Solving Packet Mis-sequencing Problem .............................. 31

2.3.4 Feedback-Based Scheduling Algorithms ............................... 34

Page 9: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 9/177

- iv -

2.4 Performance Evaluations ................................................................... 36

2.4.1 Performance under Uniform Traffic ...................................... 37

2.4.2 Performance under Uniform Bursty Traffic ........................... 38

2.4.3 Performance under Hotspot Traffic ....................................... 39

2.5 The Stability of Feedback-Based Two-Stage Switch ........................ 41

2.5.1 The Existing Approaches ......................................................... 41

2.5.2 Fluid Model for Feedback-Based Two-Stage Switch ............. 42

2.5.3 100% Throughput Proof .......................................................... 45

2.6 Chapter Summary .............................................................................. 49

Chapter 3 Cutting Down Average Packet Delay

3.1 Introduction ....................................................................................... 50

3.2 Optimal Joint Sequence Design .......................................................... 52

3.2.1 In-order Packet Delivery Only ............................................... 53

3.2.2 Both In-order Packet Delivery and Staggered Symmetry ...... 59

3.2.3 Finding the Number of Different Joint Sequences ................. 61

3.2.4 Discussions ............................................................................. 63

3.3 Three-Stage Switch ............................................................................ 64

3.3.1 Three-Stage Switch Architecture ........................................... 64

3.3.2 Traffic Matrix Estimation ....................................................... 69

3.3.3 Performance Evaluations ....................................................... 70

3.4 Chapter Summary .............................................................................. 73

Chapter 4 Cutting Down Communication Overhead

4.1 Introduction ....................................................................................... 74

4.2 Feedback Suppression Algorithms .................................................... 75

4.2.1 Set-based Feedback (Set-feedback) ....................................... 77

4.2.2 Queue-based Feedback Version 1 (Q-feedback-1) ................ 78

4.2.3 Queue-based Feedback Version 2 (Q-feedback-2) ................ 79

4.3 Performance Evaluations ................................................................... 80

4.3.1 Performance under Uniform Traffic ...................................... 81

Page 10: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 10/177

- v -

4.3.2 Performance under Uniform Bursty Traffic ........................... 82

4.3.3 Performance under Hotspot Traffic ....................................... 82

4.3.4 Performance under Different Switch Size N  .......................... 83

4.4 Chapter Summary .............................................................................. 84

Chapter 5 Supporting Multicast Traffic

5.1 Introduction ....................................................................................... 85

5.2 Related Work ..................................................................................... 87

5.2.1 Multicast Switches Based on Bufferless Switch Fabrics ....... 87

5.2.2 Buffered Crossbar Based Multicast Switches ........................ 895.3 Multicast Scheduling in Feedback-Based Two-Stage Switch ........... 90

5.3.1 Multicast Scheduling .............................................................. 90

5.3.2 Discussions ............................................................................. 92

5.4 Performance Evaluations ................................................................... 93

5.4.1 Performance under Uniform Mixing Traffic ......................... 94

5.4.2 Performance under Uniform Bursty Mixing Traffic .............. 96

5.4.3 Performance under Binomial Mixing Traffic ........................ 97

5.5 Chapter Summary .............................................................................. 99

Chapter 6 Multi-cabinet Implementation

6.1 Introduction ....................................................................................... 100

6.2 Related Work ..................................................................................... 102

6.2.1 Multi-cabinet Implementation of Input-queued Switch ......... 1026.2.2 Multi-cabinet Implementation of Buffered Crossbar Switch . 103

6.3 Multi-cabinet Implementation of Feedback-Based Switch ............... 103

6.3.1 Revamped Feedback Mechanism ........................................... 103

6.3.2 Batch Scheduler Design ......................................................... 106

6.3.3 Some Properties ..................................................................... 107

6.4 Performance Evaluations ................................................................... 109

6.4.1 Performance under Uniform Traffic ...................................... 110

6.4.2 Performance under Uniform Bursty Traffic ........................... 111

Page 11: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 11/177

- vi -

6.4.3 Performance under Hotspot Traffic ....................................... 112

6.5 Chapter Summary .............................................................................. 112

Chapter 7 Scheduling Inadmissible Traffic Patterns

7.1 Introduction ....................................................................................... 113

7.2 Related Work ..................................................................................... 115

7.2.1 Fair Scheduling under Admissible Traffic .............................. 115

7.2.2 Fair Scheduling with Over-Subscribed Output Ports Only .... 115

7.2.3 Fair Scheduling with Over-Subscribed Input and Output Ports 116

7.3 Our Approach .................................................................................... 1177.4 Max-min Fairness Criterion ............................................................... 120

7.5 Performance Evaluations ................................................................... 122

7.5.1 Under Server-client Traffic Model ........................................ 122

7.5.2 Attack-traffic Scenario ........................................................... 124

7.6 Chapter Summary .............................................................................. 125

Chapter 8 An Optical Implementation of Feedback-Based Switch

8.1 Introduction ....................................................................................... 126

8.2 Related Work ..................................................................................... 127

8.3 Load Balanced Optical Switch (LBOS) ............................................ 128

8.3.1 Switch Architecture ................................................................. 128

8.3.2 Switch Operation ..................................................................... 130

8.3.3 Equivalence to Load Balanced Electronic Switches .............. 1338.4 Extensions and Refinements of LBOS .............................................. 134

8.4.1 Cutting down the Average Delay by Reconfiguration ........... 134

8.4.2 Supporting Multicast ............................................................... 136

8.4.3 Implementing Fair Scheduler Optically ................................. 137

8.5 Performance Evaluations ................................................................... 137

8.5.1 Performance under Uniform Traffic ........................................ 138

8.5.2 Performance under Uniform Bursty Traffic ............................ 139

8.5.3 Performance under Hotspot Traffic ....................................... 139

Page 12: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 12/177

- vii -

8.5.4 Performance for Linecard Placement ..................................... 141

8.6 Chapter Summary .............................................................................. 141

Chapter 9 Conclusion

9.1 Our Contributions ................................................................................ 142

9.2 Future Work ........................................................................................ 145

9.2.1 100% Throughput Proof without Speedup ............................. 145

9.2.2 Building a Large Feedback-Based Two-Stage Switches ....... 146

9.2.3 More Scalable Fairness Algorithm in LBOS ......................... 146

9.2.4 Scalable Iterative Algorithm for Input-queued Switch ............ 146

References ........................................................................................................... 147

Publications ......................................................................................................... 157

Page 13: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 13/177

- viii -

List of Figures

Fig. 1.1: A generic router ................................................................................ 2

Fig. 1.2: A router works in two different planes ............................................. 2

Fig. 1.3: The first generation routers architecture .......................................... 3

Fig. 1.4: The second generation routers architecture ...................................... 4

Fig. 1.5: The third generation routers architecture ......................................... 5

Fig. 1.6: The fourth generation routers architecture ....................................... 6

Fig. 1.7: An input-queued switch with Virtual Output Queues (VOQs) ........ 9

Fig. 1.8: A buffered crossbar switch ............................................................... 11

Fig. 2.1 A load-balanced two-stage switch architecture ................................ 19

Fig. 2.2 Some joint sequences for a 4 x 4 load-balanced switch ................... 21

Fig. 2.3 Feedback operation in joint sequences with staggered symmetry ... 30

Fig. 2.4 Delay vs input load p, with uniform traffic ........................................ 38

Fig. 2.5 Delay vs input load p, with uniform bursty traffic ........................... 39

Fig. 2.6 Delay vs input load p, with bursty traffic under different burst sizes 40

Fig. 2.7 Delay vs input load p, with hot-spot traffic ...................................... 40

Fig. 3.1 The feedback-base two-stage switch architecture ............................ 51

Fig. 3.2 Some joint sequences for a 4 x 4 load-balanced switch ................... 52

Fig. 3.3 The relation between staggered symmetry and in-order delivery .... 53

Fig. 3.4 The generic joint configuration at time slot t  ................................... 56

Fig. 3.5 Generic joint sequence with anchor output and ordered properties . 57

Fig. 3.6 Joint sequence with staggered symmetry and in-order delivery ...... 60

Fig. 3.7 A three-stage switch architecture ...................................................... 65

Fig. 3.8 An example of using three-stage switch .......................................... 66

Fig. 3.9 Traffic matrix and delay matrix ......................................................... 66

Fig. 3.10 An example of identifying the minimum independent set ................ 67Fig. 3.11 Third-stage configuration for traffic/delay matrix in Fig. 3.9(b) ..... 69

Page 14: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 14/177

- ix -

Fig. 3.12 Delay vs input load p, under hot-spot traffic with 3-stage switch ...... 71

Fig. 3.13 Delay vs number of sample intervals T , with 3-stage switch ............ 72

Fig. 4.1 Timing diagram of feedback switch with feedback suppression ...... 76

Fig. 4.2 Delay vs input load p, under uniform traffic with partial feedback .. 81

Fig. 4.3 Delay vs input load p, under bursty traffic with partial feedback ...... 82

Fig. 4.4 Delay vs input load p, under hot-spot traffic with partial feedback ... 83

Fig. 4.5 Throughput vs switch size N , with partial feedback ......................... 84

Fig. 5.1 Delay vs output load λ , with uniform mixing traffic ........................ 94

Fig. 5.2 Delay vs fan-out k , with uniform mixing traffic at λ =0.7 ................. 95

Fig. 5.3 Delay vs output load λ , with bursty mixing traffic ........................... 97

Fig. 5.4 Delay vs fan-out k , with bursty mixing traffic at λ =0.7 ................... 98

Fig. 5.5 Delay vs output load λ , with binomial mixing traffic ...................... 99

Fig. 6.1 The timing diagram of switch with large propagation delay .............. 101

Fig. 6.2 Multi-cabinet implementation of the feedback-based switch ........... 103

Fig. 6.3 Feedback operation in multi-cabinet implementation ...................... 104

Fig. 6.4 Delay vs input load p, under uniform traffic for multi-cabinet ......... 110

Fig. 6.5 Delay vs input load p, under bursty traffic for multi-cabinet ............. 111

Fig. 6.6 Delay vs input load p, under hot-spot traffic for multi-cabinet .......... 112

Fig. 7.1 A 4×4 feedback-based switch with output port 3 oversubscribed by inputs

0, 1, 2 and 3. ...................................................................................... 114

Fig. 7.2 Output 0’s throughput vs its output load  λ, under server-client traffic 123

Fig. 7.3 Output 0’s throughput vs its output load λ , under attack traffic ....... 124

Fig. 8.1 A 4×4 load balanced optical switch .................................................. 129Fig. 8.2 The internal structure of linecard i ................................................... 129

Page 15: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 15/177

- x -

Fig. 8.3 Time diagram for load balanced optical switch ............................... 131

Fig. 8.4 Timing diagram for pipelined packet sending and receiving ............ 132

Fig. 8.5 A joint sequence in load-balanced switch .......................................... 133

Fig. 8.6 Two possible linecard placement patterns using OXC ....................... 136

Fig. 8.7 Delay vs input load, under uniform traffic in LBOS ......................... 139

Fig. 8.8 Delay vs input load, under uniform bursty traffic in LBOS ............... 140

Fig. 8.9 Delay vs input load, under hot-spot traffic in LBOS ......................... 140

Page 16: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 16/177

- xi -

List of Symbols

 N  Switch size

VOQ1(i,k ) the VOQ at input port i with packets destined for output k  

VOQ2( j,k ) VOQ at middle-stage port j with packets destined for output k 

flow(i,k ) Packets arriving at input i and destined for output k  

 K  Anchor output port for an input port i 

 p Input load for a input port

 s p Burst size in uniform bursty traffic

S  j The set of VOQ2( j,k ) (for k =0,1,…, N -1) with 0-occupancy

d  The middle-stage port delay experienced in a feedback switch

{r i, j}  N × N matrix {r i, j}, which denotes the request number from flow(i, j)

 Z ij(n) The number of packets in VOQ1(i, j) at the beginning of time slot n 

 Aij(n) The cumulative number of arrivals for VOQ1(i, j) at the beginning

of time slot n 

 Dij(n) The cumulative number of departures for VOQ1(i, j) at the

 beginning of time slot n 

 Bij(n) The number of packets in VOQ2(i, j) at the beginning of time slot n

 X ij(n) The cumulative number of arrivals for VOQ2(i, j) at the beginning

of time slot n 

Y ij(n) The cumulative number of departures for VOQ2(i, j) at the

 beginning of time slot n 

λ ij The mean packet arrival rate to VOQ1(i, j)

ω A sample in random event

 Aij(t ,ω) The cumulative number of arrivals to VOQ1(i, j) for a fixed ω at

time t  

 Z ij(t ,ω) The number of packets in VOQ1(i, j) for a fixed ω at time t  

 Dij(t ,ω) The cumulative number of departures from VOQ1(i, j) for a fixed ω 

at time t  

 X ij(t ,ω) The cumulative number of arrivals to VOQ2(i, j) for a fixed ω at

time t   Bij(t ,ω) The number of packets in VOQ2(i, j) for a fixed ω at time t  

Page 17: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 17/177

- xii -

Y ij(t ,ω) The cumulative number of departures from VOQ2(i, j) for a fixed ω 

at time t  

C ij(t ) The joint queue occupancy of all packets arrived at input port i plus

all packets destined for output j 

{r n} Any sequence {r n} with r n → ∞ as n → ∞ 

S  The times of speedup

 f (t ) A non-negative, absolutely continuous function defined on R+∪{0}

q If VOQ2( j,k ) is not empty, the packet in VOQ2( j,k ) will be

transmitted to output port k with fixed delay q

 M  The number of reduced Latin squares

{d ij} The delay matrix, where d ij is traffic-weighted-average middle-

stage packet delay of all the N flows destined to output port i-1

Qi,j The packet counter Qi,j is associated with each of the VOQ1(i,j)

T  The sampling interval

u The number of non-overlapped sets per port

 g  The number of VOQs per non-overlapped set

Gm The non-overlapped set of VOQs

 F  Denotes VOQ1(i,F ) is the longest queue at input i at time t 

b The number of bits sent in the second stage of Q-feedback-1

 z  Denotes VOQ2( j,z ) is empty VOQ at middle-stage port j

C  The number of sets Gm sent when cutting down the feedback bits

m The number of multicast VOQs at each input port

 E  y The vector reports the occupancy status from VOQ2( j, yN /m) to

VOQ2( j, yN /m+ N /m-1)

T c The overall average delay experienced by all copies of all multicast

 packets

T  p The average delay experienced by the last-copy of all multicast

 packets

T c(k ) The average delay for multicast packets with fan-out k  

T  p(k ) The last-copy delay for multicast packets with fan-out k  

λ  The switch output load

 P k  The probability of generating a fan-out set with size k in binomial

mixing traffic

h The mean fan-out size in binomial mixing traffic

Page 18: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 18/177

Page 19: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 19/177

- xiv -

List of Abbreviations

ACK Acknowledgement

AMFS Adaptive Max-min Fair Scheduling

AWGR Arrayed Waveguide Grating Router 

 bps bits per second

CIOQ Combined Input Output Queuing

CMS Concurrent Matching Switch

CP Cross Point

CPU Central Processing Unit

CR Contention and Reservation

DRRM Dual Round Robin Matching

EDF Earliest Departure First

EDFA Erbium Doped optical Fiber Amplifier 

FDL Fiber Delay Line

FIFO First In First Out

F-MWM Fair Maximum Weight Matching

FOFF Full Ordered Frames First

GPS-SW Generalized Processor Sharing in network Switch

HOL Head Of Line

i.i.d Independent and Identically Distributed

ILP Integer Linear Programming

I-SMCB Input-based Shared Memory Crosspoint Buffer 

LBOS Load-Balanced Optical Switch

LQF Longest Queue First

MEMS Micro Electro Mechanical Systems

MSM Maximal Size Matching

MWM Maximum Weight Matching

MURS Multicast and Unicast Round robin Scheduling

O-E-O Optic-Electric-Optic

O-SMCB Output-based Shared Memory Crosspoint Buffer OXC Optical Cross-Connect

Page 20: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 20/177

- xv -

PF Padded Frame

PIM Parallel Iterative Matching

RR Round Robin

RTT Round Trip Time

SRR Synchronous Round Robin

TCAM Ternary Content Addressable Memory

TDMA Time Division Multiplexing Access

TFQA Tracking Fair Quota Allocation

UFS Uniform Frame Spreading

VOD Video On Demand

VOQ Virtual Output Queue

WDM Wavelength Division Multiplexing

WF2Q+ Fair Weighted Fair Queueing +

w.r.t With Regard To

Page 21: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 21/177

- 1 -

Chapter 1

Introduction

1.1 Overview of Routers

The Internet is a network of networks. The basic unit of data exchange on the

Internet is an IP packet. Routers play a crucial role in the Internet by connecting

different networks together and forwarding each IP packet to its correct destination.

An  N × N  generic router is shown in Fig. 1.1. It consists of a routing processor, a

switch fabric and  N linecards. The routing processor executes the routing protocols,

maintains the routing information and forwarding tables, and performs network 

management functions within the router. A linecard is a subsystem that receives

Page 22: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 22/177

- 2 -

datagrams on an external ingress or internal egress from the switch fabric. Each

linecard is (logically) divided into input port (for processing ingress traffic) and

output port (for processing egress traffic). A switch fabric allows inputs to be

connected with outputs for packet forwarding.

Fig. 1.1 A generic router 

Fig. 1.2 A router works in two different planes

A router operates in two different planes [1,2]: control and forwarding (Fig.

1.2). The control plane constructs a routing table using the routing protocol, where

the router learns which linecard is the most appropriate for forwarding specific

Page 23: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 23/177

- 3 -

 packets to specific destinations. Forwarding, the predominant plane in router, is

responsible for the actual process of switching a packet received from a linecard to

another one. Forwarding involves packet by packet processing and is generally more

time-critical than the operations at control plane.

Fig. 1.3 The first generation routers architecture

From Fig. 1.1, we can see that the switch fabric is at the very heart of a router.

In fact, the evolution of routers is accompanied by the evolution of switch fabrics.

Historically, routers have been realized with packet-switching software executing on

a general-purpose CPU. Those first generation routers appeared before the early

1990s, consisting of a CPU, a centralized memory and several linecards (Fig. 1.3).

Lincards are connected to the CPU and centralized memory via a shared bus [4]

(instead of a dedicated switch fabric). The CPU is responsible for all operations at

control and forwarding planes. When a packet arrives at an input linecard, it will

cross the shared bus to arrive at the centralized memory. When the output linecard is

identified by the CPU, the packet will be read out from the memory and forwarded to

the output linecard via the shared bus again. As each packet needs to traverse the

shared bus twice, the bus bandwidth limits the router performance. Besides, the use

of a single CPU also undermines the router performance. An example of the first

generation routers is Huawei Quidway AR18 series routers [3].

Page 24: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 24/177

- 4 -

Fig. 1.4 The second generation routers architecture

In a second generation router shown in Fig. 1.4, route cache, satellite

 processor and memory are allocated at each linecard. The operations at the

forwarding plane are segregated from the central CPU and carried out by distributed

linecards. If routing information can be found in the local linecard route cache, a

 packet will traverse the shared bus once, by going to the destination linecard directly.

Otherwise, the packet will be sent to the centralized memory for processing by the

central CPU, as the case of first generation routers. A major limitation of the second

generation router is the shared bus, which can support at most one packet traversal at

a time. Cisco 7500 series routers [6] belong to the second generation of routers.

To alleviate the bottleneck of using a single shared bus, the third generation

router introduces an interconnection network as the switch fabric (Fig. 1.5). This

enables multiple packets to traverse the switch fabric in parallel and without

contention. This architecture improves the routers’ switching capacity from the

second generation’s 2 Gbps to about 1 Tbps. An implicit requirement for 

implementing the architecture in Fig. 1.5 is that all linecards and the switch fabric

Page 25: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 25/177

- 5 -

must be stowed in the same standard-sized switch cabinet. A typical cabinet [7] is of 

a size 2.1m×0.6m×1.0m, and is supplied with power no more than 14 kW.

Accordingly, each cabinet can only house up to 16 linecards. An example of this

generation of routers is the Cisco 12000 series routers [7].

DMA

MA C 

R o u t   e  c  a  c h  e 

 S  a  t   e l  l  i   t   e 

 pr  o c  e  s  s  or 

M e m or  y D

   M   A

   M   A   C

   R  o  u   t  e  c  a  c   h  e

   S  a   t  e   l   l   i   t  e

  p  r  o  c  e  s  s  o  r

   M  e  m  o  r  y

 

Fig. 1.5 The third generation routers architecture

To accommodate Internet traffic in the range of 10 Tbps, a large number of 

linecards and huge power requirements are necessary. A report [73] shows that a

router consumes 0.01 kW power for 1 Gbps and supports 40 Gbps by a linecard.

Handling 10 Tbps data would result in 100 kW power and 125 linecards, which

cannot be supported by the third generation router architecture shown in Fig. 1.5.

The fourth generation routers remove the limitations of space and power by

distributing the linecards to different cabinets, as shown in Fig. 1.6. The burdens of 

space and power are parceled out. Optical fibers connect all cabinets to the central

electronic switch fabric. (Note that it is difficult to run copper wires at high-speed

Page 26: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 26/177

- 6 -

due to insertion loss, near-end crosstalk, electromagnetic emissions, echo and

 propagation skew [96-97]). As the centralized switch fabric works in electrical

domain, packets arrived on fiber must be converted to electrical signals for switching,

and vice versa when they depart the switch fabric. This extra O-E-O conversion and

the need for a centralized scheduler (for configuring the switch fabric on per-slot

 base) limit the fourth generation router from reaching even higher speed. Cisco CRS-

1 [8] is an example of the fourth generation router. Notably, it can pump the

switching capability to 90 Tbps, with 1152 linecards and each linecard running at 40

Gbps.

Fig. 1.6 The fourth generation routers architecture

 Nowadays the commercial dense WDM systems [9] can support up to 160

 parallel wavelengths in a single fiber, with up to 80 Gbps transmission rate on each

wavelength. Consequently, the fourth generation routers can only process packets

coming from 4 fibers. Besides, due to the speed mismatch between the linecard

Page 27: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 27/177

- 7 -

 processing rate (e.g. 40 Gbps in Cisco CRS-1) and the fiber, a linecard cannot be

directly connected to a dense WDM fiber. Therefore, there is always an urgent need

for building high-speed routers that can fully exploit the capacity of a fiber.

1.2 Switch Architectures

In a router, the forwarding plane involves packet by packet processing, which

is generally more time-critical than the operations at the control plane [2]. In Fig. 1.2,

the forwarding plane comprises two major functions: table lookup for identifying the

correct output linecard of a packet, and switching for actual delivery of the packet.

IP table lookup algorithms can be classified into trie-based [74-79], range-

 based [80-81], and hash-based algorithms [82-88]. These algorithms can be

implemented by software, hardware or both. Software schemes can benefit from low

cost and flexibility. Hardware solutions, e.g. TCAM (Ternary Content Addressable

Memory [89-95]), are more efficient as they can search contents in parallel and

complete the lookup in single clock cycle. Nerveless, as table lookup process can be

distributed to each linecard, its high-speed implementation tends to be less critical

than switching. To this end, the table lookup at 100 Gbps per linecard is reported in

[88] while due to the limitation of O-E-O conversion and centralized scheduler, the

switching rate of 40 Gbps per linecard seems to be the current limit.

In this thesis, we focus on designing efficient and scalable switch architecture

to enable the next generation high-speed routers. Based on the switch architecture,

the routers can be generally classified to output-queued, input-queued, and combined

Page 28: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 28/177

- 8 -

input output queued (CIOQ).

1.2.1 Output-queued Switches

In an output-queued switch, all packets can be switched to their respective

output linecards as soon as they arrive at the inputs. Accordingly, no input port

 buffer is required and the output-queued switch provides the optimal packet delay-

throughput performance. But the switch fabric must be powerful enough to deliver 

up to  N  packets to any output port, and the output buffer must be fast enough to

receive up to N  packets in each time slot, where  N is switch size (i.e. the number of 

linecards). In other words, the switch fabric and output ports must operate at N times

of an individual link rate. This makes high-speed output-queued switches expensive

to build, and difficult to scale.

It should be noted that the complexity of a switch fabric can be measured by

the number of switch configurations it needs to realize. A switch configuration is an

internal switch fabric connection pattern for mapping the set of  N input packets to N  

outputs. For an output-queued switch (fabric), it needs to realize N  N configurations as

up to N packets can go to the same output.

1.2.2 Input-queued Switches

For an input-queued switch, all packets are buffered at input ports and wait

for their turns to be served by the switch fabric. No switch fabric speedup is required

(i.e. the fabric only needs to run at the same speed as each input link), whereas each

input can send at most one packet and each output can receive at most one packet in

every time slot. Accordingly, the number of switch configurations to be realized by

Page 29: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 29/177

- 9 -

an input-queued switch is N !, which is substantially smaller than the  N  N  required by

an output-queued switch. This makes input-queued switches more suitable for 

 building high-speed routers with large port count.

0

 N -1

Input port 0

VOQ(0,0)

VOQ(0, N -1)

VOQ( N -1,0)

VOQ( N -1, N -1)

Switch fabric

Input port N -1

Output port 0

Output port N -1

 

Fig. 1.7 An input-queued switch with Virtual Output Queues (VOQs)

On the other hand, input-queued switches suffer from the well-known

 problem of head-of-line (HOL) blocking. This limits the maximum throughput of an

input-queued switch to just 58.6% under uniform traffic [10]. To eliminate the HOL

 blocking, Virtual Output Queue (VOQ) is proposed [11], where each input port

maintains a separate queue for each output (Fig. 1.7). A centralized scheduler is

needed to maximize the throughput of a VOQ switch. The scheduling problem is

equivalent to the matching problem in a bipartite graph [98]. It is found that for any

admissible traffic patterns, 100% throughput can be achieved by MWM (Maximum

Weight Matching [12]). However, MWM algorithm has a high time complexity of 

O( N 3·log  N ). MSM (Maximal Size Matching) algorithms with lower computation

overheads, notably, PIM (Parallel Iterative Matching [13]), iSLIP [14,15] and

DRRM (Dual Round Robin Matching [16]), are then proposed. They are iterative

algorithms involving non-negligible amount of communication overheads for state

Page 30: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 30/177

- 10 -

information exchange, which scales up very quickly with the number of iterations to

 be carried out, link speed and switch size.

As an example, the ATLANTA architecture proposed in [105] is based on

input-queued switch architecture. Notably, its switch fabric is implemented by a

three-stage (memory/space/memory) Clos network, where packets are buffered at the

first and third stage while the second stage is constructed using crossbar switch

modules. To avert overflow in fabric-embedded-buffers of the first and third stages,

 backpressures are sent out from the first stage to the input ports, as well as from the

third stage to the second stage crossbars. Nevertheless, its performance is limited by

the required packet/slot based switch re-configurations.

1.2.3 CIOQ and Buffered Crossbar Switches

In a CIOQ switch, packets are buffered at both input and output ports [17].

The switch fabric is the same as the input-queued switch fabric, where in each time

slot at most a single packet can leave/join an input/output port. A centralized

scheduler is responsible for selecting the most “critical” packet to deliver in each

time slot. A packet may arrive at an output port out-of-order. Therefore an output

 buffer/queue is required. It has been shown that [17] with a speedup of two (i.e. in

each time slot, up to two packets can leave/join an input/output port), CIOQ switch

can provide precise emulation of an output-queued switch. Like an input-queued

switch, the number of switch configurations to be realized by a CIOQ switch is also

 N !. But the complexity of the centralized scheduler is by no means less than that of 

an input-queued switch.

Page 31: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 31/177

- 11 -

 Notably, buffered crossbar switch [18-20] is an elegant approach of 

implementing CIOQ switches by adopting a distributed approach for scheduling. In

addition to buffering packets at each input, buffered crossbar switch allows packets

to be buffered at each crosspoint of the switch fabric, as shown in Fig. 1.8. It has

 been shown that buffered crossbar can yield performance comparable to output-

queued switches. Although buffered crossbar is touted for its technology feasibility

and simpler scheduler, it requires 2 N schedulers (one for each input/output port),  N 2 

in-fabric crosspoint buffers, and the switch configuration must be determined on a

slot-by-slot basis. It should be noted that the total  N 2 crosspoint buffers are very

difficult to build. A report [100] shows that a memory of 512-bit word occupies

0.0278 mm2 of silicon even under state-of-the-art 0.18 um VLSI technology.

Assumed switch size  N =32, holding all crosspoint buffers, with 1000-bit for each,

would result in 55.6 mm2 of silicon, which dominates the cost in terms of area and is

 prohibitive [47-48,100].

0

 N -1

Input port 0

VOQ(0,0)

VOQ(0, N -1)

VOQ( N -1,0)

VOQ( N -1, N -1)

Switch fabric

Input port N -1

Output port 0 Output port N -1

CP(0,0) ... CP(0, N -1)

CP( N -1,0) ... CP( N -1, N -1)

 

Fig. 1.8 A buffered crossbar switch

For a buffered crossbar switch, due to releasing the inputs/outputs contention,

Page 32: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 32/177

- 12 -

the total number of switch configurations to be realized is N  N , the same complexity

as output-queued switch fabric. Besides, the communication overheads for collecting

queue size at each crosspoint buffer (for input/output arbitration) can be a potential

 performance bottleneck.

1.2.4 Load-Balanced Two-Stage Switches

Load-balanced two-stage switches (or load-balanced switches) have received

a great deal of attention recently [21-32] because they are more scalable and can

 provide close to 100% throughput. A load-balanced switch consists of two stages of 

switch fabrics, as shown in Fig. 2.1 Each switch fabric is configured according to a

 pre-determined and periodic sequence of switch configurations. To this end, each

switch fabric only needs to realize only N switch configurations (instead of  N ! for an

input-queued and CIOQ switches, and N  N for an output-queued and buffered crossbar 

switches). This greatly facilitates high-speed implementation.

Besides, due to the pre-determined nature of the sequence of configurations,

load-balanced switch removes the need for a centralized scheduler – another major 

 bottleneck in designing high-speed switch. As a load-balanced switch provides

multiple paths for packets belonging to the same flow to arrive at the same output

 port, packets may arrive out-of-order due to different middle-stage port delays

experienced en route. Many efforts [22-32] are then made to address this notorious

 packet mis-sequencing problem (to be reviewed in Chapter 2). It is not difficult to

see that higher switch throughput is usually at the cost of poorer delay performance.

This is because that throughput is improved by better load balancing, but better load

 balancing tends to aggravate the packet mis-sequencing problem.

Page 33: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 33/177

- 13 -

1.3 Contributions

In this dissertation, we dedicate our efforts to designing efficient and scalable

switch architecture for next generation high-speed routers. We have two key design

objectives:

   No need for a centralized scheduler, as centralized scheduler is a major 

obstacle for a scalable switch architecture; and

  Amenable for optics, which can avoid the extra O-E-O conversion in the

fourth generation routers when packets are switched from one linecard to

another.

We follow the approach of load-balanced switch due to its scalability (no

centralized scheduler) and close to 100% throughput performance. But its notorious

 packet mis-sequencing problem must be properly addressed. Otherwise, the

complexity of the load-balanced switch as well as its delay performance would suffer.

To this end, an elegant solution called feedback-based two-stage switch (or feedback-

 based switch in short) is proposed in this thesis. Before diving into the details, our 

major contributions made in this thesis are outlined below.

  Feedback-based Two-stage Switch Design: Unlike other load-balanced

switches, at each middle-stage port between the two switch fabrics of our 

feedback-based two-stage switch, only a single-packet-buffer for each VOQ

is required. Although packets belonging to the same flow pass through

Page 34: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 34/177

- 14 -

different middle-stage VOQs, the delays they experience at different middle-

stage ports will be identical. This is made possible by properly selecting and

coordinating the two sequences of switch configurations to form a joint

sequence with both staggered symmetry property and in-order packet delivery

 property. Based on the staggered symmetry property, an efficient feedback 

mechanism is designed to allow the right middle-stage port  N -bit occupancy

vector to be delivered to the right input port at the right time. As compared

with the existing load-balanced switch architectures and scheduling

algorithms, our solution imposes a modest requirement on switch hardware,

 but consistently yields best delay-throughput performance.

  Cutting down the average packet delay of switch: As different flows

experience different middle-stage delays, we can cut down the average packet

delay by assigning heavy flows to experience less middle-stage delays. For a

given traffic matrix, we can find an optimal joint sequence that can minimize

the average middle-stage delay. But this involves tedious computation. A

three-stage switch architecture is thus proposed by adding another stage of 

switch fabric for dynamically mapping heavy flows to experience less

middle-stage port delay.

  Cutting down the communication overhead of feedback-based switch: In

a feedback-based switch, each middle-stage port needs to piggyback an N -bit

occupancy vector to its connected output in each time slot. To cut down this

communication overhead, the size of an occupancy vector can be reduced by

only reporting the status of selected middle-stage VOQs. To identify VOQs

Page 35: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 35/177

- 15 -

of interest, we first partition the  N  VOQs into u non-overlapped sets, each

 being identified by a set number. In each time slot, every input port

 piggybacks its set numbers of interest to the connected middle-stage port.

This guides a middle-stage port to only report the status of the VOQs of 

interest.

  Supporting multicast: By slightly modifying the operation of the original

feedback-based two-stage switch, we show that feedback-based switch

supports multicast traffic efficiently. A notable feature of this multicast

extension is that the switch fabric remains unicast, whereas packet

duplication is distributed to both input and middle-stage ports.

  Multi-cabinet implementation: In a single-cabinet implementation, the

 propagation delay between linecards and switch fabric is negligible. In a

multi-cabinet implementation, due to the non-negligible propagation delay

 between linecards and switch fabric, the requirement that occupancy vectors

must arrive at output/input ports within a single time slot will significantly

lower the feedback-based switch efficiency. To this end, we revamp the

original feedback mechanism to support multi-cabinet implementation, and a

new batch scheduler is also designed.

  Fairness support for switching inadmissible traffic: As long as the traffic

is admissible, due to the close to 100% throughput in our feedback switch,

 packets can arrive at outputs with bounded delays, so fairness in throughput is

not an issue. Under inadmissible traffic (i.e. some output ports are over-

Page 36: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 36/177

- 16 -

subscribed), the feedback switch may suffer from the ring-fairness problem,

i.e. “up-stream” input ports can starve some “down-stream” input ports. To

address this ring-fairness problem, an algorithm that can allocate the

 bandwidth of over-subscribed outputs based on max-min fairness criterion is

designed.

  Optical implementation of feedback-based switch: To ensure packets can

 be switched from one linecard to another all-optically, an optical feedback-

 based switch called Load-Balanced Optical Switch (LBOS) is proposed.

LBOS leverages an  N -wavelength WDM fiber ring to connect  N  linecards

together. The ring network is engineered such that the amount of time a

 packet should be buffered at a middle-stage port exactly matches the

 propagation delay that this packet would experience en route.

1.4 Thesis Overview

This thesis consists of nine chapters. In Chapter 2, we first review the

existing work for solving the packet mis-sequencing problem of load-balanced

switches. Then our proposed feedback-based two-stage switch’s framework is

introduced. The delay and throughput performance of feedback-based switch is

compared with other existing algorithms by simulations. With a speedup of two, the

stability of feedback-based switch is also proved.

In Chapter 3, we cut down the average packet delay of a feedback-based

switch by assigning heavy flows to experience less middle-stage ports delays. In

Page 37: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 37/177

- 17 -

Chapter 4, we focus on designing efficient feedback suppression schemes for cutting

down the communication overhead of sending middle-stage occupancy vectors. In

Chapter 5, we extend the feedback-based switch to support multicast traffic. In

Chapter 6, the feedback-based switch is refined to support multi-cabinet

implementation. In Chapter 7, a fair scheduling algorithm for inadmissible traffic is

 proposed. An optical implementation of the feedback-based switch, called LBOS, is

introduced in Chapter 8. Finally, Chapter 9 summarizes our contributions in this

thesis, and highlights some interesting future research directions.

Page 38: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 38/177

- 18 -

Chapter 2

Feedback-Based Two-StageSwitch Design 

2.1 Introduction

Due to its more scalable switch fabric, input-queued switch architecture is

more suitable than output-queued switch for high-speed router implementation.

However, input-queued switch requires a centralized scheduler to determine its

switch configuration on a slot-by-slot basis. The requirement for a centralized

scheduler is thus the major bottleneck in further increasing the router’s capacity.

Load-balanced two-stage switches [21-32] remove the bottleneck of 

Page 39: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 39/177

- 19 -

centralized scheduler and can provide close to 100% throughput. A load-balanced

two-stage switch architecture consists of two stages of switch fabrics, as shown in

Fig. 2.1. Each fabric is configured according to a pre-determined and periodic

sequence of switch configurations, with the only requirement that each input

connects to each output exactly once in the sequence. The two fabrics can use

different sequences. There are many ways to generate such a sequence, e.g., a

sequence can be constructed by cyclic shifting the set of input/output connections

used in each time slot, such that at time slot t , input i (for  i = 0,1,2,…, N -1) is

connected to output j, where j is given by

 j = ( i + t ) mod N . (2.1)

In Fig. 2.2 (a), the sequence of blue\dotted configurations represent the

configurations used by the first stage switch fabric in Fig. 2.1, and it is generated

 based on (2.1). Note that each switch port is abstracted as a circle in Fig. 2.2.

Fig. 2.1 A load-balanced two-stage switch architecture.

For the generic load-balanced switch architecture shown in Fig. 2.1, we use

VOQ1(i,k ) to represent the VOQ at input port i with packets destined for output k ,

and VOQ2( j,k ) to denote the VOQ at middle-stage port  j with packets destined for 

Page 40: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 40/177

- 20 -

output k . We define flow(i,k ) as packets arriving at input i and destined for output k .

Packets from flow(i,k ) are buffered at VOQ1(i,k ). Packets (from different inputs)

destined for output k  are buffered at VOQ2( j,k ) for  j = 0, 1, …,  N -1. Aiming at

converting the incoming non-uniform traffic to uniform, the first stage switch fabric

spreads packets evenly over all middle-stage ports. Then the second stage switch

fabric delivers the packets from middle-stage ports to their respective outputs. From

the above, we can see that in each time slot, there are two switch configurations, one

at each fabric. We call them a  joint  configuration. The sequence of   N  joint

configurations forms a  joint sequence. Three possible joint sequences are shown in

Fig. 2.2. It is important to point out that all three joint sequences in Fig. 2.2 meet the

 basic requirement of a load-balanced two-stage switch, but they have different

 properties, namely, in-order packet delivery and staggered symmetry. These two

 properties will be discussed in detail in Section 2.3, which form the basis of our 

feedback-based two-stage switch design. In Chapter 3, the problem of optimal joint

sequence design will be investigated.

Due to the two-stage nature, flow(i,k ) packets may arrive at output k  via

different middle-stage VOQ2( j,k )’s (for  j = 0, 1, …,  N -1) and thus may experience

different amounts of middle-stage port delay. This leads to the problem of packet

mis-sequencing. Many efforts [21-32] are then made to address this notorious packet

mis-sequencing problem (reviewed in Section 2.2). It is not difficult to see that

higher switch throughput is usually at the cost of poorer delay performance. This is

 because that throughput is improved by better load balancing, but better load

 balancing tends to deteriorate the packet mis-sequencing problem.

Page 41: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 41/177

- 21 -

Fig. 2.2 Some joint sequences for a 4 x 4 load-balanced switch.

In this chapter, we show that the efforts made in load balancing and keeping

 packets in-order can complement each other in improving both delay and throughput

 performance of the switch. We adopt a simple load-balanced switch architecture

where each middle-stage port between the two stages of switch fabrics only has a

single-packet-buffer for each VOQ. Although packets belonging to the same flow

will pass through different middle-stage VOQs, the delays they experience at

different middle-stage ports will be identical. This is made possible by properly

selecting and coordinating the two sequences of switch configurations (used by the

two stages of switch fabrics) to form a joint sequence with both staggered symmetry

 property and in-order packet delivery property. Based on the staggered symmetry

 property, an efficient feedback mechanism is designed to allow the right middle-stage

 port occupancy vector to be delivered to the right input port at the right time.

Page 42: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 42/177

- 22 -

Accordingly, the performance of load balancing as well as switch throughput is

significantly improved.

The rest of this chapter is organized as follows. In the next section, we review

the existing work for solving the packet mis-sequencing problem of load-balanced

switches. In Section 2.3, our proposed feedback switch framework is introduced. The

delay and throughput performance of our proposed solutions is compared with other 

existing algorithms in Section 2.4 by simulations. In Section 2.5, we prove that for 

any arbitrary work-conserving input port scheduler, the feedback-based switch can

achieve 100% throughput under a speedup of two. Finally, we conclude this chapter 

in Section 2.6.

2.2 Related Work

Two main approaches can be followed to solve the mis-sequencing problem

of load-balanced switches, using re-sequencing buffers at outputs, or preventing

 packets from becoming mis-sequenced in the first place.

2.2.1 Using Re-sequencing Buffers

When out-of-order packets arrive at an output port, they are temporarily

stored in the re-sequencing buffer (not shown in Fig. 2.1), waiting to be read out and

written onto the output link in the correct order. To this end, each packet header 

should have a sequence number field (or timestamp), which is added to the packet

upon its arrival at an input port. With the original two-stage switch architecture [21],

 packets can be mis-sequenced by an arbitrary amount, thus a finite re-sequencing

Page 43: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 43/177

- 23 -

 buffer is not possible. Efforts are made to bound the delay at additional costs, such as

 N writes to memory in one time slot [22], and a 3-D re-sequencing buffer [23].

In [24], a three-stage load-balanced switch is presented where each of the

three stage switch fabrics is configured by predetermined and periodic configurations.

The buffers ahead of each stage of switch fabric are separately called first-stage

 buffer, second-stage buffer and the third-stage buffer (i.e. re-sequencing buffer). For 

every arriving packet, it firstly reserves a position in the third-stage buffer. Upon

successful reservation, the packet is forwarded into the first-stage buffer by a flow

splitter according to its assigned position number of the third-stage buffer. Packets

are transmitted through the first two switches in a FIFO manner and are inserted to

their reserved positions in the third-stage buffer. Although the switch is proved to be

stable, this design requires additional hardware as well as global information

exchange for buffer reservation. The high implementation complexity may defeat the

original purpose of using a load-balanced switch.

2.2.2 Preventing Packets from Becoming Mis-sequenced

Instead of re-ordering packets at each output, we can prevent packets from

 becoming mis-sequenced in the first place [25-32]. This not only removes the re-

sequencing buffers, but also the corresponding re-sequencing delay. The majority

work along this direction [26-29] adopts the notion of “frame”. For an N × N switch, a

frame consists of  N packets belonging to the same flow. At each input port, incoming

 packets join their respective VOQs. If the size of a VOQ is larger than N packets, the

flow is said to have a full frame of packets. With the UFS (Uniform Frame Spreading)

algorithm [26], an input port is allowed to send only from flows/VOQs with at least a

Page 44: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 44/177

- 24 -

frame of packets. Once the sending/frame starts,  N packets from the selected flow

will be sent in the next N slots, where each packet arrives at a distinct middle-stage

 port from 0 to  N -1. The sending/frame starts when an input port is connected to a

 particular middle-stage port, say 0. Each input has a distinct frame starting time

 because they connect to middle-stage port 0 at different slots. Based on the above

frame notion, upon joining the VOQ at each middle port, each packet in the frame

will see the same middle-stage VOQ size. If the transmission at the second stage

switch fabric is coordinated such that an output is connected to middle-stage ports in

the same (cyclic) order as an input is connected to middle-stage ports, in-order 

 packet delivery is guaranteed.

A downside of the UFS algorithm is that when traffic load is light, it takes

time to form a full frame of packets, thus the delay performance suffers. To cut down

the delay, FOFF (Full Ordered Frames First) [27] is proposed. Instead of waiting for 

full frames of packets, FOFF allows mis-sequencing due to sending partial frames.

But the amount of mis-sequencing at middle-stage ports is bounded. As a result, the

amount of re-sequencing buffer at each output is also bounded.

PF (Padded Frame) algorithm [28] also improves the delay performance of 

UFS, but without the re-sequencing buffer of FOFF. The idea is that when no full

frames are available for sending, a partial frame can be sent as a “faked” full frame

 by padding the partial frame with dummy packets. CR (Contention and Reservation)

algorithm [29] can further improve the performance of PF by supporting two modes

of frame transmission: contention and reservation. As long as an input i has a full

frame of packets when i connects to middle port 0, i enters the reservation mode and

Page 45: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 45/177

- 25 -

the transmission in the next N slots is governed by UFS. Otherwise, input i enters the

contention mode, where the packet sent in each slot is selected using a round robin

scheduler, and must be acknowledged at the end of each time slot. A packet is

removed from the input VOQ only if a positive ACK (ACKnowledgement) is

received.

CR algorithm requires dedicated feedback/acknowledgement from each

middle-stage port in each time slot. The feedback path construction is not discussed

in [29]. To this end, the Mailbox switch [30] also requires a feedback path. It is

smartly constructed by adopting the joint sequence of switch configurations in Fig.

2.2(c), where input i and output i are always connected to the same middle-stage port.

In each time slot, when a packet arrives at a middle-stage port (from say input i), the

middle-stage port calculates its departure time (i.e. when it will be sent to its

destination output) based on its location in the VOQ. Then the departure time is sent

to the connected output port i using the second switch fabric. As input i and output i 

are resided on the same switch linecard, output i can relay the departure time of the

 packet to input i at negligible cost. A feedback path for reporting middle-stage packet

departure time is thus created. Based on the received packet departure time, the next

 packet in the flow will be dispatched and inserted in a middle-stage VOQ if it will

depart no earlier than the previous packet of the same flow. Although the packet

order is maintained by Mailbox switch without relying on the frame notion, the

overall switch throughput is limited.

In [31], a distributed and iterative scheduling algorithm CMS (Concurrent

Matching Switch) is introduced. Despite the fixed uniform identical mesh in both

Page 46: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 46/177

- 26 -

stages of switch fabrics, its logical configurations are the same as the joint sequence

in Fig. 2.2(c). For every arriving packet, input port sends a request to the current

(logically) connected middle-stage port. Each middle port records the receiving

requests in its own  N × N  matrix {r i, j}, where r i, j denotes the request number from

flow(i, j). Every N time slots each middle-stage port concurrently and independently

finds a matching based on its own {r i, j}. (Note that CMS can achieve stability using

randomized scheduling with amortized constant time and hardware complexity per 

 port, independent of  N .) In the following  N  time slots, the packets matched are

transmitted to the middle-stage ports. As soon as they arrive, middle-stage ports

forward them to the connected output ports. Since the packets selected in each slot

traverse the two switches in parallel and without conflicts, there is no out of order 

 problem. However, the packet delay performance can be quite large, where the best-

case is 3 N time slots when a parallel optical mesh is used. Having said that, the delay

 performance of the Chang's original architecture [21] is on the order of O( N ) if it is

implemented using an R/ N optics abstraction.

2.3 Feedback-based Two-stage Switch

2.3.1 Some Observations and Motivations

The delay and throughput performance of a load-balanced switch hinges on

how well the load-balancing and in-order packet delivery are implemented.

Obviously, if the incoming traffic is well-balanced by the first stage switch, the

throughput performance will be improved as the second stage switch can maximize

the number of packets sent in each time slot. Consequently, the packet delay will also

 be reduced due to higher throughput.

Page 47: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 47/177

- 27 -

But how to measure the load-balancing performance? Many scheduling

algorithms (e.g. in [23, 25]) try to ensure all middle-stage VOQs have the same

queue size. But as far as the throughput performance is concerned, we only need to

ensure each middle-stage VOQ2( j,k ) (in Fig. 2.1) does not suffer from either buffer 

underflow or overflow problem. A buffer underflow occurs if there are packets

waiting in some input ports for a particular output k , but VOQ2( j,k ) is empty at the

time that middle-stage port  j is connected to output k , yielding an idle transmission

slot on the second stage switch. On the other hand, buffer overflow is equally

undesirable as the overflowed packet is dropped, and the transmission slot in the first

stage switch is wasted. Indeed, as long as no buffer underflow and overflow at each

VOQ2( j,k ) is ensured, the actual buffer size for each VOQ2( j,k ) has no impact on the

throughput performance of the switch. Therefore, it may not be appropriate to

increase the buffer size of VOQ2( j,k ) for boosting throughput performance.

In a load-balanced switch, the head of line packet in each middle-stage VOQ

will experience an average delay of  N /2 slots (due to the deterministic nature of the N  

configurations), and each additional packet in the line will experience an additional

delay of  N slots. To minimize delay, a small buffer size at each VOQ2( j,k ) is preferred.

In general, mechanisms for ensuring in-order packet delivery tend to penalize

the packet delay performance more than throughput. If re-sequencing buffers are

used for solving the mis-sequencing problem, packets suffer from the additional re-

sequencing delay. Since packet mis-sequencing is due to packets of the same flow

experienced different delays at different middle-stage ports, a smaller buffer size at

Page 48: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 48/177

- 28 -

each VOQ2( j,k ) is favored because middle-stage packet delay can be reduced and

thus the mis-sequencing problem can be eased. Consequently, a smaller re-

sequencing buffer/delay is also possible. In fact, buffering a packet at an input port

(instead of a middle-stage port) gives more flexibility in sending because an input

can retry in the subsequent slots at different middle-stage ports (which may even

have a shorter queue size).

If the frame notion is used for ensuring in-order packet delivery, the time

required for forming a frame dominates the delay performance especially when the

load is light. Besides, frame-based transmission tends to make the traffic to

downstream switches more bursty, resulting in poor delay jitter performance.

Although PD [28] and CR [29] improve the delay performance of UFS [26], the use

of fake frames/packets undermines the load-balancing performance. In this chapter,

we are interested in designing a scheduling algorithm without using re-sequencing

 buffers for in-order packet delivery, and without incurring the frame-based

scheduling overheads.

From our observations above, we can see that a smaller buffer size at each

VOQ2( j,k ) is preferred if we can ensure (a) no underflow and overflow at each

VOQ2( j,k ), and (b) no packet mis-sequencing. The smallest buffer size at each

VOQ2( j,k ) is 1. In the rest of this chapter, we shall focus on using a single-packet-

 buffer at each VOQ2( j,k ).

2.3.2 Designing Scalable Feedback Mechanism

 Now the issue is how to ensure each single-packet-buffered VOQ2( j,k ) is free

Page 49: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 49/177

- 29 -

of either buffer overflow or underflow. If an input port knows the occupancy of its

connected VOQ2( j,k ) before sending a packet to it, the buffer overflow problem can

 be easily solved. Then, do we have an efficient feedback mechanism for reporting the

occupancy of VOQ2( j,k ) to input ports?

We propose a simple yet novel feedback mechanism based on a joint

sequence with  staggered symmetry property. A joint sequence of switch

configurations has the  staggered symmetry property if middle-stage port  j is

connected to output port k  at time slot t , then at next slot (t +1)  input port k  is

connected to the same middle-stage port j. In essence, for each given sequence in the

first stage switch, the second stage sequence (and thus the joint sequence) can be

obtained directly from the property itself. In Fig. 2.2(a), the first stage sequence is

constructed from (1) by cyclic shifting the set of connections used in each slot. Each

configuration in the second stage is obtained from the staggered symmetry property.

We can see that for every pair of  staggered configurations, e.g. the second switch

configuration at t =0 and the first switch configuration at t =1, they are mirror images

of each other.

As each VOQ2( j,k ) only has a single packet buffer, a single bit is sufficient to

denote its occupancy. For the N VOQ2( j,k )’s at middle-stage port j (for k =0, …, N -1),

their joint occupancy can be denoted by an  N -bit occupancy vector. Since each pair 

of input k and output k reside on the same linecard, the occupancy vector at middle-

stage port  j can be  piggybacked on the data packet sent to output k , which is then

made available to input k at negligible cost. Due to the staggered symmetry property

of the joint sequence used, input k will be connected to middle port j in the next time

Page 50: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 50/177

- 30 -

slot. This gives a very efficient feedback path, allowing the occupancy vector from

the right middle-stage port to be delivered to the right input at the right time. In the

next time slot, each input port scheduler will select a packet for sending based on the

received occupancy vector. If the packet is properly selected, both buffer overflow

and underflow at a middle-stage VOQ2( j,k ) can be avoided. (In Section 2.3.4, three

simple input port schedulers are designed.)

 S l   o t   t      

 S 

l   o t   t      + 1 

 S l   o t   t      

 S 

l   o t   t      + 1 

 

Fig. 2.3 Feedback operation in joint sequences with staggered symmetry.

The timing diagram in Fig. 2.3 summarizes the feedback operation, while

assuming each switch reconfiguration involves certain overhead. We can see that

switch reconfiguration takes place in parallel with relaying the occupancy vector 

from output k  to input k  and the execution of the scheduling algorithm. The

occupancy vector is created by taking both packet arrival/departure in the current slot

into account. In creating the vector, the occupancy bit of VOQ2( j,k ’) is always set to

0 if middle port j will connect to output k ’ in the next slot. This is because the packet

Page 51: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 51/177

- 31 -

(if any) in VOQ2( j,k ’) is guaranteed to be sent in the next time slot. Besides, when a

 buffered packet in VOQ2( j,k ’) is being sent, VOQ2( j,k ’) can receive another packet

simultaneously. Due to parallel packet transmission in the two switch stages, a packet

cannot be delivered from an input to an output in a single time slot, i.e. the minimum

delay a packet experienced at a middle-stage port is one slot.

From Fig. 3, we can also see that the feedback operation requires accurate

timing synchronization within a time slot. We notice that accurate synchronization of 

less than 10 ns is reported in [106], and a scheme to achieve 1 ns synchronization is

 proposed in [107]. Therefore, synchronization within a time slot of, say 40 ns, would

not be a major issue.

 Note that the joint sequence in Fig. 2.2(c) does not have the staggered

symmetry property. If it is used for implementing feedback path (as in [30,32]),

occupancy vector cannot be piggybacked onto data packet. Instead, a dedicated  

feedback packet must be sent from each middle-stage port to its connected output in

each time slot. This incurs not only extra propagation delay for sending the feedback 

 packet, but also extra packetization and synchronization overhead. As a result, the

duration of a time slot in [30,32] would be much longer than that shown in Fig. 2.3.

If the switch performance is studied using the number of time slots, the inefficiencies

of using a “larger” time slot could be easily overlooked.

2.3.3 Solving Packet Mis-sequencing Problem

If the load-balanced switch in Fig. 2.1 is configured by the joint sequence in

Fig. 2.2(a), will we face the packet mis-sequencing problem? We know that packet

Page 52: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 52/177

- 32 -

order will be preserved if  every packet of a flow experiences the same amount of 

delay when passing through any middle-stage port. This is obviously true if middle-

stage ports are bufferless, thereby every packet experiencing the same 0-slot delay.

Will it be still true for the case of single-packet-buffer-per-VOQ2( j,k )?

Surprisingly, a closer examination at the joint sequence in Fig. 2.2(a) reveals

that packets of the same flow do experience the same middle-stage port delay. Take

flow(0,1) in Fig. 2.2(a) as an example. If a packet is sent (from input 0) to middle-

stage port 0 at t =0, it will be buffered at VOQ2(0,1) for 2 slots until VOQ2(0,1) is

connected to output 1 at t =2. If the next packet of the flow is sent to middle-stage

 port 1 at t =1, it will be buffered at VOQ2(1,1) for, again, 2 slots until VOQ2(1,1) is

connected to output 1 at t =3.

In the following, we prove that this is true for each and every flow, and for 

any switch size N . Consider the joint sequence in Fig. 2.2(a). The sequence used by

the first stage switch is constructed from (2.1). The sequence used by the second

stage switch is constructed according to the staggered symmetry property, which can

 be represented by (2.2). That is at time t (for 0≤t < N ), middle-stage port j is connected

to output k , where k is given by

k = ( j + N  – 1 – t ) mod N  (2.2)

Statement 1: (Anchor Output). In Fig. 2.2(a), input i is always connected to

output K , where K = [(i+ N  –1) mod N ], via one of the middle-stage ports.

 Proof: At time t , input i is connected to output k  via middle-stage port  j.

Page 53: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 53/177

- 33 -

Substitute j from (2.1) into (2.2), we can express k in terms of i.

( ) mod 1 mod ( 1) modk i t N N t N i N N K   (2.3)

We can see that K depends only on i. Thus for a given input i, it is always connected

to the same anchor output K . #

Statement 2: (Deterministic Delay at Middle-stage Ports). Let  K  be the

anchor output of input i. For every packet of flow(i,k ), it experiences the same d slots

delay in one of the middle-stage ports, where d is given by

, if 

, if 

, if 

 N K k 

d K k K k  

 K N k K k 

(2.4)

 Proof: Suppose at slot t , input i is connected to its anchor output  K  via

middle-stage port j and a packet is sent to join VOQ2( j,k ). From (2.2), middle port j is

connected to each output in descending order of the output port number. Then if  K ≠k ,

this packet will experience exactly ( K -k ) module  N slots delay in VOQ2( j,k ) due to

the single packer buffer at VOQ2( j,k ). If  K =k , this packet can only be sent when

middle port j connects to output port K again, so its middle stage delay is N time slots.

In short, this packet will experience exactly d slots delay calculated by (2.4), and d is

 bounded by [1, N ]. #

Statement 3 (In-order Packet Delivery). In-order packet delivery is

guaranteed if the joint sequence of configurations is constructed using (2.1) and (2.2).

 Proof: Assume packets A and B of flow(i,k ) join VOQ2( j1,k ) and VOQ2( j2,k )

at time t A and t B (where t B>t A), respectively. Let d A and d B be their respective delays

Page 54: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 54/177

- 34 -

experienced in VOQ2. Mis-sequencing occurs only if packet B reaches output k  

earlier than packet A, i.e. t A+d A>t B+d B. However, this will never happen because

t B>t A and d A=d B from Statement 2. #

It can be easily seen that the delay a packet experienced at a middle-stage

 port is bounded between [1, N ] slots, and the average middle-stage packet delay is

merely ( N +1)/2 slots for uniform traffic. From Fig. 2.2, we can see that some joint

sequences have the staggered symmetry property only, some have the in-order packet

delivery property only, and some have both properties.  For instance, the joint

sequence in Fig. 2.2(b) has the staggered symmetry property but cannot ensure in-

order packet delivery. Consider packets from flow(0,1). Two different middle-stage

delays will be experienced, 2-slot via middle port 3 and 4-slot via middle port 1. This

causes packet out of order. On the other hand, the joint sequence in Fig. 2.2(c) can

 provide in-order packet delivery but lacks the staggered symmetry property. The

systematical study of joint sequences is carried out in the Chapter 3, but as far as this

chapter is concerned, we only focus on the joint sequence in Fig. 2.2(a).

2.3.4 Feedback-Based Scheduling Algorithms

Based on the received occupancy vector, each input port selects a packet for 

sending. Such an input port scheduler should be designed to avoid both buffer 

overflow and underflow at the connected middle-stage VOQ. Suppose input i is

connected to middle-stage port j at slot t , and its anchor output is K . Based on the N -

 bit occupancy vector received from middle-stage  j in the previous slot t -1, we find

candidate set S  j, i.e. the set of VOQ2( j,k ) (for k =0,1,…, N -1) with 0-occupancy. Input i 

can only choose a HOL packet from a VOQ in S  j for sending. This avoids buffer 

Page 55: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 55/177

- 35 -

overflow at VOQ2( j,h).

From Fig. 2.2(a), we can see that middle port j is connected to each output in

descending order of the output port number. Therefore, we know a priori that in the

next slot t +1, port  j will be connected to output  K -1 (wrapped around by  N ). If 

VOQ2( j, K -1) is empty and VOQ1(i, K -1) is not, we will face an underflow in

VOQ2( j, K -1) at slot t +1. As such, the scheduling algorithm should always give the

highest priority to schedule the HOL packet of VOQ1(i, K-1) at slot t . With the above

considerations in mind, we present three simple input port schedulers below.

   RR (Round-Robin): If VOQ1(i,h’ ) is selected in the previous slot, then the

next non-empty VOQ1(i,h) is selected with VOQ2( j,h)S  j. Comment: RR 

gives fair access to each VOQ1, and RR is amenable to hardware

implementation [33].

   LQF  (Longest Queue First): Among all the non-empty VOQ1(i,h)’s with

VOQ2( j,h)S  j, the one with the longest queue size is selected. Comment: 

LQF is good for non-uniform traffic, but requires O( N ) comparisons. We can

replace it by Quasi-LQF [34], a very efficient sub-optimal LQF algorithm

requiring only a single comparison per time slot.

   EDF  (Earliest Departure First): Among all the non-empty VOQ1(i,h)’s with

VOQ2( j,h)S  j, the one with the earliest departure time at the middle-stage

 port is selected. The departure time is calculated from (2.4). Comment: EDF

should not be confused with the classic Earliest Deadline First. Our EDF aims

at minimizing the chance of buffer overflow at each VOQ2, which is achieved

 by always giving priority to the VOQ1 with the minimum middle-stage delay

Page 56: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 56/177

- 36 -

to send first.

Take an example. Assume a 4×4 feedback switch is configured by the joint

sequence of Fig. 2.2(a) and at time slot 0, a packet of VOQ1(0,0) is sent. Further 

make a assumption that at time slot 1, there are 1, 2, 0, 3 packets in VOQ 1(0,0),

VOQ1(0,1), VOQ1(0,2), VOQ1(0,3) respectively and the feedback indicates that the

corresponding middle stage buffer for output port 0 is not empty. Therefore, only

VOQ1(0,1) and VOQ1(0,3) are legitimate candidates, i.e. VOQ1(0,1) and VOQ1(0,3)

S  j. Then at time slot 1, RR and EDF would select the packet at VOQ 1(0,1) for 

sending but LQF would transmit the HOL packet of VOQ1(0,3).

To give a scheduler more time to execute, batch scheduling [35] can be used,

where a single scheduling decision is made over a batch of time slots (instead of per 

slot). Packets arrived in the current batch of slots will be considered in the next batch.

Indeed, the multi-cabinet implementation of the feedback-based switch in Chapter 6

 belongs to this category

2.4 Performance Evaluations

In this section, the performance of our proposed feedback-based scheduling

algorithms is compared with some representative algorithms by simulations. In the

following, we only present simulation results for switch with size  N =32 although

similar conclusions apply to other sizes (unless explicitly spelling out, the default

value of switch size N =32 in all simulation results of this thesis). In our simulations,

we focus on studying the performance of the three proposed feedback-based

Page 57: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 57/177

- 37 -

scheduling algorithms in Section 2.3, i.e. round robin (RR), longest queue first (LQF)

and earliest departure first (EDF). For comparison, we also implement:

  LQF with byte-focal switch architecture (LQF_Byte-Focal) [23], which

outperforms FOFF and in general is the best performing algorithm based on

resequencing buffer.

  CR algorithm [29], which is the best performing frame-based scheduling

algorithm.

  iSLIP algorithm [15], which serves as a benchmark for single-stage input-

queued switches. Specifically, we implement iSLIP with a single iteration

(iSLIP-1), as multi-iterations involve heavy communication overhead.

  Output-queued switch, which serves as a lower bound.

2.4.1 Performance under Uniform Traffic

Uniform traffic is generated as follows. At each time slot for each input, a

 packet arrives with probability p and destines to each output with equal probability. 

Fig. 2.4 shows the delay-throughput performance under uniform traffic. We can see

that three input port schedulers RR, LQF and EDF yield comparable and less-than-

20-slot delay performance for input load up to p = 0.9. When p > 0.94, LQF gives the

 best performance (as it always serves the most needed flow first), then follow by

EDF and RR. The average packet delay at middle-stage ports can be easily derived:

(1+ N )/2 = 16.5 time slots. If we deduct this portion from the overall delay, we can

see that the (input port) delay of our scheduling algorithms matches the output-

queued switch performance very well. Compared with LQF_Byte-Focal, our three

schedulers give significantly smaller delay. When  p is reasonably large (>0.6), our 

Page 58: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 58/177

- 38 -

algorithms also beat iSLIP and CR. When  p=0.7, the delay of LQF_Byte-Focal is 95

time slots, iSLIP 44, CR 152 and ours only 20.

Fig. 2.4 Delay vs input load p, with uniform traffic.

2.4.2 Performance under Uniform Bursty Traffic

Bursty arrivals are modeled by the ON/OFF traffic model. In the ON state, a

 packet arrives in every time slot. In the OFF state, no packet arrivals are generated.

Packets of the same burst have the same output and the output for each burst is

uniformly distributed. Given the average input load of  p and average burst size s p, the

state transition probabilities from OFF to ON is  p/[ s p(1- p)] and from ON to OFF is

1/ s p. Without loss of generality, we set burst size s p = 30 packets.

Fig. 2.5 shows the delay-throughput performance under uniform-bursty traffic.

In Fig. 2.5, we can see that delay builds up quickly with input load, which is due to

Page 59: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 59/177

- 39 -

the bursty traffic nature. Nevertheless, our RR, LQF and EDF still outperform

LQF_Byte-Focal and CR algorithms. At  p=0.8, the delay of LQF_Byte-Focal is 224

time slots, 232 for CR, 156 for our RR/LQF/EDF, and 114 for output-queued switch.

Fig. 2.6 shows the delay performance of LQF under uniform-bursty traffic with

different burst sizes. We can see that average packet delay increases almost linearly

with burst size.

Fig. 2.5 Delay vs input load p, with uniform bursty traffic.

2.4.3 Performance under Hotspot Traffic

Packets arriving at each input port in each time slot with probability  p. Packet

destinations are generated as follows. For input port i, packet goes to output i+ N /2

with probability ½, and goes to any other output with probability 1/[2( N -1)]. Fig. 2.7

shows the delay-throughput performance under hotspot traffic.

Page 60: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 60/177

- 40 -

Fig. 2.6 Delay vs input load p, with bursty traffic under different burst sizes.

Fig. 2.7 Delay vs input load p, with hot-spot traffic.

Page 61: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 61/177

- 41 -

From Fig. 2.7, again we can see that our three schedulers are consistently

 better than others. And among the three, LQF again gives the best/lowest delay

 performance. Nevertheless, it is interesting to point out that the performance

difference among the three schedulers is much smaller than that in a single-stage

switch, and this is due to the use of the first stage switch for load balancing. For 

simplicity, we shall only concentrate on LQF below.

2.5 The Stability of Feedback-Based Two-Stage Switch

Simulation results in the previous section allow us to study the average

 performance under specific traffic patterns. In this section, we prove that under a

speedup of two, feedback-based switch using any arbitrary work-conserving port-

 based scheduling algorithms (not just RR, LQF and EDF) is stable under any

admissible traffic patterns.

2.5.1 The Existing Approaches

Generally there are two approaches in proving 100% throughput, either using

the Lypunov method or based on the fluid model. The Lypunov method consists of 

three steps [14,17]. First, model the VOQ-length process by a Markov chain. Then

convert the stability problem to a linear programming problem. Finally use

appropriate Lypunov functions. Based on this approach, switches using MWM [12],

MSM [14] and CIOQ [17] are proved to be stable.

In the Lypunov method, the packet arrival process at each input is required to

 be Bernouli i.i.d (Independent and Identically Distributed). To remove this limitation,

Page 62: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 62/177

- 42 -

the fluid model approach can be used. Under the assumption that the packet arrival

 process at each input obeys the law of large number, a much broader class of traffic

can be accounted for. The 100% throughput proofs for MWM and CIOQ in [36], and

for buffered crossbar switch in [20,37], are based on the fluid model.

2.5.2 Fluid Model for Feedback-Based Two-Stage Switch

Like [20,36-37], we first establish a fluid model for scheduling packets. Let

the number of packets in VOQ1(i, j) at the beginning of time slot n be Z ij(n). Let the

cumulative number of arrivals and departures for VOQ1(i, j) at the beginning of slot n 

 be Aij(n) and Dij(n), respectively. We have:

 Z ij(n) = Z ij (0) + Aij (n) −  Dij (n), n ≥ 0, i, j = 1,...., N  (2.5)

Let the number of packets in VOQ2(i, j) at the beginning of slot n  be  Bij(n).

Because there is only one packet buffer for each VOQ2(i, j), we have  Bij(t )=0 if 

VOQ2(i, j) is empty and  Bij(t )=1 if VOQ2(i, j) is occupied. The cumulative number of 

arrivals and departures in VOQ2(i, j) at the beginning of slot n are  X ij(n) and Y ij(n),

respectively. The following relationship holds:

 Bij(n) = Bij (0) + X ij (n) − Y ij (n), n ≥ 0, i, j = 1,...., N  (2.6)

We assume that the packet arrival process obeys the strong law of large

numbers with probability one, i.e.

,,...,1,,)(

lim  N  jin

n Aij

ij

n

 

 

where λ ij  is the mean packet arrival rate to VOQ1(i, j). The switch is, by definition,

rate stable if:

Page 63: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 63/177

- 43 -

( )lim , , 1,..., .

ij

ijn

 D ni j N 

 

An admissible traffic matrix is defined as the one that satisfies the following

constraints.

,1,1  j

ij

i

ij   

(2.7)

If a switch is rate stable for an admissible traffic matrix, then the switch delivers

100% throughput.

The fluid model is determined by a limiting procedure illustrated below. First,

the discrete functions are extended to right continuous functions. For arbitrary time t 

∈ [n, n+ 1):

 Aij(t ) = Aij(n);

 Z ij(t ) = Z ij(n);

 Dij(t ) = Dij(n) + (t - n)( Dij(n + 1) - Dij(n) );

 X ij(t ) = X ij(n);

 Bij(t ) = Bij(n);

Y ij(t ) = Y ij(n) + (t - n)( Y ij(n + 1) - Y ij(n) );

 Note that all functions are random elements of D[0, ∞). We shall sometimes

use the notation Aij(· ,ω),  Z ij(· ,ω),  Dij(· ,ω)  X ij(· ,ω), Bij(· ,ω), and Y ij(· ,ω) to explicitly

denote the dependency on the sample path ω. For a fixed ω, at time t , we have [36]:

 Aij(t ,ω), the cumulative number of arrivals to VOQ1(i, j)

 Z ij(t ,ω), the number of packets in VOQ1(i, j)

 Dij(t ,ω), the cumulative number of departures from VOQ1(i, j)

Page 64: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 64/177

- 44 -

 X ij(t ,ω), the cumulative number of arrivals to VOQ2(i, j) 

 Bij(t ,ω), the number of packets in VOQ2(i, j) 

Y ij(t ,ω), the cumulative number of departures from VOQ2(i, j)

For each r > 0, we define

);,(),( 1   rt  Ar t  A ij

ij

 

);,(),( 1   rt  Z r t  Z  ij

ij

 

);,(),( 1   rt  Dr t  D ij

ij

 

);,(),( 1   rt  X r t  X  ij

ij

 

);,(),( 1   rt  Br t  B ij

ij

 

);,(),( 1   rt Y r t Y  ij

ij

 

It is shown in [20,37] that for each fixed ω satisfying (2.5), (2.6) and any sequence

{r n} with r n → ∞ as n → ∞, there exists a subsequence }{k nr  and the continuous

functions(.)...)(.),( ijij  Z  A

, where)...),(),,((    t  Z t  A r 

ij

ij converges to uniformly on

compacts as k  → ∞ for any t  ≥ 0

;),( t t  A ij

ijk n

    

);(),( t  Z t  Z  ij

ijk n  

 

);(),( t  Dt  D ij

ijk n  

 

);(),( t  X t  X  ij

ijk n  

 

);(),( t  Bt  B ij

ijk n  

 

);(),( t Y t Y  ij

ijk n  

  (2.8)

 Definition 1: Any function obtained through the limiting procedure in (2.8) is

Page 65: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 65/177

- 45 -

said to be a fluid limit of the switch. So the fluid model equations using our proposed

scheduling algorithms are:

0)()0()( t t  Dt  Z t  Z ijijijij

 

  (2.9)

0)()()0()( t t Y t  X  Bt  B ijijijij(2.10)

 Definition 2: The fluid model of a switch operating under a scheduling

algorithm is said to be weakly stable if for every fluid model solution ),(  Z  D  

with 0)(,0)0( t  Z  Z  for almost every t  ≥ 0.

From [36], the switch is rate stable if the corresponding fluid model is weakly

stable. Our goal here is to prove that for every fluid model solution ),(  Z  D using our 

scheduling algorithms,0)( t  Z 

for almost every t . To prove0)( t  Z 

, we will use the

following Fact 1 from [36]:

 Fact 1: Let f  be a non-negative, absolutely continuous function defined on R+∪{0}

with f (0)=0. Assume that for almost every t such that 0)(,0)( t  f t  f  . Then f (t )=0

for almost every t ≥  0. (Note that R+

is the set of positive real numbers, and )(t  f   

denotes the derivative of function f (t ) at time t .)

2.5.3 100% Throughput Proof 

In the following, we show that our proposed scheduling algorithms give

100% throughput. The result is quite strong in the sense that it holds for any arbitrary

work-conserving scheduling algorithms with a speedup of two. In other words, each

Page 66: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 66/177

- 46 -

input i can choose to serve any non-empty VOQ1(i,k ) for which VOQ2( j,k ) is empty.

Theorem 1: (Sufficiency) A work-conserving scheduling algorithm can

achieve 100% throughput with a speedup of two for any admissible traffic pattern

obeyed the strong law of large numbers.

 Proof : Let C ij(t ) denote the joint queue occupancy of all packets arrived at

input port i, plus all packets destined for output j. We have

m

mjmj

 p

ipij t  Bt  Z t  Z t C  )]()([)()((2.11)

)(t  Z  and )(t  B are all non-negative, absolutely continuous functions, so C ij(t ) is non-

negative and absolutely continuous too. We can see that C ij(0)=0 and then we have

m

mjmj

 p

ipij t  Bt  Z t  Z t C  ])()([)()(

 

Combined with (2.9) and (2.10), we get

m

mjmjmjmj

 p

ipipij t Y t  X t  Dt t  Dt t C  ])()()([])([)(   

 

With a work-conserving scheduling algorithm, packets left VOQ1(m, j) will enter 

VOQ2(m, j), for m = 1,...., N , so

m

mj

m

mj t  X t  D )()(, then

m

mj

 p p

ip

m

mjipij t Y t  Dt C  ])()([)(   

.

From the admissible traffic condition (2.7), we get

m

mj

 p

ipij t Y t  Dt C  ])()([2)((2.12)

Page 67: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 67/177

- 47 -

For any non-empty VOQ1(i, j), i.e.0)( t  Z ij , then by continuity of  )(t  Z  ,    such that

0)'( t  Z ij for  ],[    t t t  . Set

)(min],[

t  Z a ijt t t 

   .

For large enough k , we have 2/)( at  Z  k nr 

ij for  ],[    t t t  . Also, for large

enough k  we have .12/ ar k n Thus

1)( t  Z ij for )],(,[    t r t r t 

k k  nn which means

that VOQ1(i, j) holds at least one packet in the long interval )].(,[   t r t r  k k  nn With a

work-conserving scheduling algorithm, flow(i, j) packets always experience the same

fixed middle-stage port delay of d slots, where d is given by (2.4). During the time

interval)],(,[   t r t r 

k k  nn when input port i is connected to any middle port g , then

  if VOQ2( g , j) is empty, a packet is transmitted from input port i to middle port

 g .)(t  Dik k  is increased by one.

  if VOQ2( g , j) is not empty, the packet in VOQ2( g , j) will be transmitted to

output port  j with fixed delay q, where q = d  mod  N .( )

mjmY t  will be

increased by one after  q slots. (The packet in VOQ2( g , j) will be sent when

middle port g is connected to output j. If this occurs in the current time slot, q 

= 0. Otherwise, it takes another q=d slots.)

If the switch is operated with a speedup of  S , in a long time interval

],[)],(,[      t t t t r t r k k  nn it fulfills:

Page 68: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 68/177

- 48 -

)()]()([)]()([ t t r S qt r Y qt r Y t r  Dt r  Dk k k k k  n

m

nmjnmjnipn

 p

ip  

 Note that( )

mjmY t  is monotonically non-decreasing and is increased at most one in

every time slot. So we have:

qt r Y qt r Y m

nmj

m

nmj k k  )()(

 

m

nmj

m

nmj t r Y qt r Y k k 

)()(.

Combined them together, we have

[ ( ) ( )] [ ( ) ( )] ( )k k k k k  ip n ip n mj n mj n n

 p m

 D r t D r t Y r t Y r t q S r t t   

Since q is pre-determined and within [0, N -1], its impact is insignificant in the fluid

limit [20]. Dividing the above equation with r n and let k →∞, fluid limits are obtained

as:

[ ( ) ( )] [ ( ) ( )] ( )ip ip mj mj

 p m

 D t D t Y t Y t S t t   

Further dividing the above equation by ( t t  ), and letting t t  , the

derivative of the fluid limit is

S t Y t  D

m

mj

 p

ip ])()([

. (2.13)

With a speedup of two (i.e. S =2), combined (2.12) & (2.13), we get

0)( t C ij  

Based on Fact 1, C ij(t )=0 for almost every t ≥  0. Due to (2.11) and C ij(t )=0,

then 0)( t  Z ij for almost every t ≥  0. Theorem 1 is proved. #

Page 69: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 69/177

- 49 -

It should be noted that existing stability proofs [21-30], adopt a common

approach of showing that the delay performance of a specific algorithm is within a

finite bound of the output-queued switch. Since the buffer size at each middle-stage

 port is usually assumed to be infinite, the derived bound w.r.t. (With Regard To)

output-queued switch can be unrealistically large.

2.6 Chapter Summary

In this chapter, a framework for designing feedback-based scheduling

algorithms was proposed for elegantly solving the notorious packet mis-sequencing

 problem of a load-balanced switch, while without sacrificing the switch’s delay and

throughput performance. Unlike existing approaches, we showed that the efforts

made in load balancing and keeping packets in-order can complement each other.

Specifically, at each middle-stage port between the two switch fabrics of a load-

 balanced switch, only a single-packet-buffer for each VOQ is required. In-order 

 packet delivery is made possible by properly selecting and coordinating the two

sequences of switch configurations to form a joint sequence with both staggered

symmetry property and in-order packet delivery property. As compared with the

existing load-balanced switch architectures and scheduling algorithms, our solutions

have the modest requirement on switch hardware, but consistently yield the best

delay and throughput performance under various traffic conditions.

Page 70: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 70/177

- 50 -

Chapter 3

Cutting Down Average PacketDelay 

3.1 Introduction

For an N × N switch, there are N 2

input-output pairs, and thus it needs to carry

a total of  N 2

different packet flows. In a feedback-based switch (Fig. 3.1.), although

the amount of middle-stage port delay experienced by packets of the same flow is the

same, packets of different flows may experience different middle-stage port delay.

The feedback-based switch in Fig. 3.1 is configured with the joint sequence in Fig.

3.2(a). Flow(0,1) packets will experience 2-slot middle port delay, such as arriving

middle port 0 at t =0 and leaving at t =2. On the other hand, flow(0,2) packets will just

Page 71: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 71/177

- 51 -

experience 1-slot middle port delay, such as arriving middle port 0 at t =0 and leaving

at t =1.  Assume flow(0,1) and flow(0,2) are the only flows in the switch and the

 packet arrival rate for flow(0,1) is much higher than that of flow(0,2). To minimize

the average packet delay performance, can we swap the two flows such that flow(0,1)

 packets experience 1-slot middle port delay instead? In general, if the traffic rate

matrix of a switch is known (e.g. by measurement), can we cut down the average

middle-stage packet delay by assigning heavy flows to experience less middle-stage

delays? This problem is investigated in this chapter along two directions.

Fig. 3.1 The feedback-base two-stage switch architecture.

First, from Chapter 2 we know that there exists a set of joint sequences with

 both staggered symmetry and in-order packet delivery properties (the joint sequence

in Fig. 3.2(a) is just a particular instance). Then, for a given traffic matrix, we try to

find an optimal joint sequence that can minimize the average middle-stage delay. But

the searching involves rather tedious computation. Then a more practical solution is

 proposed to add another stage of switch fabric for dynamically mapping heavy flows

to experience less middle-stage port delays. We call it a feedback-based three-stage

switch.

Page 72: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 72/177

- 52 -

The rest of this chapter is organized as follows. In the next section, we design

the optimal joint sequence for feedback-based two-stage switch under specific traffic.

In Section 3.3, the three-stage switch architecture is introduced to minimize the

average middle-stage delay. Finally, we conclude this chapter in Section 3.4.

Fig. 3.2 Some joint sequences for a 4 x 4 load-balanced switch.

3.2 Optimal Joint Sequence Design

A feedback-based two-stage switch has a single packet buffer at each middle-

stage VOQ2( j,k ). It is configured by a pre-determined joint sequence of  N  joint

configurations. A joint sequence consists of two (component) sequences of   N  

configurations, one for each switch stage, called first stage sequence and second

Page 73: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 73/177

- 53 -

stage sequence. From Fig. 3.2 and our discussion in Chapter 2, we can see that some

 joint sequences have the staggered symmetry property only, some have in-order 

 packet delivery property only, some have both properties, and yet more have none of 

the properties (not shown). The relationship among them can be described by Fig. 3.3.

For a feedback-based switch to properly function, a joint sequence should have both

staggered symmetry and in-order packet delivery properties. To find the optimal joint

sequence for a given traffic matrix, we have to answer to the following two questions:

1.  What is the necessary and sufficient condition for both staggered symmetry

and in-order delivery in a feedback-based two-stage switch?

2.  How many such joint sequences exist?

Fig. 3.3 The relation between staggered symmetry and in-order delivery.

A broader sense definition of feedback-based switch shall be adopted in this

section to denote any load-balanced switch with single packet buffer at each middle-

stage VOQ2( j,k ). If staggered symmetry property is also required, we spell it out

explicitly as feedback-based switch with staggered symmetry property.

3.2.1 In-Order Packet Delivery Only

Statement 4: A constant middle-stage delay for all packets belonging to the

same flow (and for all N 2 flows) is a necessary and sufficient condition for packet in-

order delivery in a feedback-based switch.

Page 74: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 74/177

- 54 -

 Proving sufficient condition: If the middle-stage delay is constant for all

 packets of the same flow, then middle-stage ports will not cause any packet out-of-

order problem. #

 Note that this sufficient condition is not limited to the feedback-based two-

stage switch architecture. It can be applied to other load-balanced switch

architectures [21, 31].

 Proving   necessary condition: In a feedback-based switch, assume flow(i,k )

 packets do not experience a constant middle port delay. Nevertheless, based on the

 periodicity of joint sequence, the middle-stage port delay is always bounded between

[1, N ] slots and if packets A and B of flow(i,k ) enter middle-stage ports at time slot t  

and t + N respectively, they will still experience the same middle port delay. Therefore,

there exist packets C and D belonging to flow(i,k ), such that C enters a middle-stage

 port at slot t and experiences a middle-stage delay of  d  s slots, whereas  D enters a

middle-stage port at slot t +1 and experiences a middle-stage delay of d (d < d  s) slots.

Because d and d  s are all positive integer, then:

d +1 ≤ d  s 

Packets C  and  D leave middle-stage ports (thus the switch as there is no

output buffer) at slot t +d  s and t +1+d respectively. If d +1=d  s, then t +d  s=t +1+d , which

means C and D will leave at the same time slot. This contradicts to the property of 

 joint sequence, so d +1≠d  s, i.e.

Page 75: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 75/177

- 55 -

d +1 < d  s (3.1)

From (3.1), we get t +1+d < t +d  s. In other words, packet C leaves middle port

after  D, causing packet out of sequence. As no constant middle-stage delay causes

 packet out of sequence, this proves the necessary condition. #

This necessary condition is only valid under the feedback-based two-stage

switch architecture, which means that there is only one packet buffer for every

VOQ2( j,k ) and two switch fabrics are configured by a joint sequence. Note that in a

feedback-based two-stage switch, the middle-stage port delay is always bounded

 between [1,  N ] slots. If the middle port delay is not upper bounded by  N slots, the

necessary condition may fail. Take an example, if every next packet incurs a larger 

middle-stage delay than the previous packets from the same flow, in order delivery

could still be sustained.

In Fig. 3.2(c), we can see that each input port always connects to a fixed

output port (via some middle-stage port) in all time slots. We call this anchor output 

 property. In this case, outputs 0, 1, 2 and 3 are the anchor outputs of inputs 0, 1, 2

and 3, respectively. Further consider input 0 in Fig. 3.2(c), it connects to middle ports

0, 1, 2 and 3 in a cyclic manner in each subsequent time slot. We denote this cycle by

(0, 1, 2, 3). Similarly, we can see that inputs 1, 2 and 3 connect to middle ports

following cycles (3, 0, 1, 2), (2, 3, 0, 1) and (1, 2, 3, 0), respectively. Indeed, (0, 1, 2,

3), (3 0 1 2), (2, 3, 0, 1) and (1, 2, 3, 0) are just different ways to express the same

cycle (0, 1, 2, 3). If all input ports of a switch connect to middle ports following the

Page 76: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 76/177

- 56 -

same cycle, we say the sequence of  N configurations is ordered . We can see that both

first and second sequences of configurations in Fig. 3.2(a) and (c) are ordered.

Statement 5: If a joint sequence of configurations has the anchor output

 property, and one of its two sequences is ordered, then the other sequence is also

ordered.

 Proof : Without loss of generality, let the first stage sequence of configurations

 be ordered based on cycle ( j1, j2, j3, j4 ...  j N ). At time slot t , let middle ports j1, j2 , j3 ,

 j4 ...  j N  be connected by input ports i1, i2 , i3 , i4 ... i N  respectively. Further let k 1, k 2 , k 3 ,

k 4 ...  k  N  be the anchor outputs for  i1, i2 , i3 , i4 ...  i N . We can get the generic joint

configuration at time slot t as shown in Fig. 3.4:

Fig. 3.4: The generic joint configuration at time slot t.

Similarly, the joint configurations at each subsequent time slot up to t + N -1

can be constructed based on the anchor output and ordered sequence properties as

shown in Fig. 3.5, resulting in a joint sequence of   N  joint configurations. By

construction, we can see that the second sequence of configurations (identified by

Page 77: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 77/177

- 57 -

solid lines) is also ordered, and follows the cycle of (k 1, k  N , k  N -1, k  N -2 …k 2). #

Fig. 3.5: Generic joint sequence with anchor output and ordered properties.

If the two component sequences of a joint sequence are both ordered, we say

the joint sequence has the ordered property. Note that the tuple (i x, j x, k  x) could take

any value in [0, N -1], so the joint sequence in Fig. 3.5 is a generic expression for all

 possible joint sequences with anchor output and ordered properties.

Statement 6: Anchor output and ordered properties are the necessary and

sufficient condition for a constant middle port delay for packets of the same flow in a

feedback-based two-stage switch.

 Proving sufficient condition: Let the first stage sequence of configurations be

ordered based on cycle ( j1, j2, j3, j4 ...  j N ). Further let k 1, k 2 , k 3 , k 4 ... k  N  be the anchor 

outputs for i1, i2 , i3 , i4 ... i N . From Statement 5, the second sequence of configurations

is also ordered, and follows the cycle of (k 1, k  N , k  N -1, k  N -2 …k 2). This joint sequence is

shown in Fig. 3.5. Consider a packet  A of flow(i1,k  N ) being transmitted to some

middle port j. Due to anchor output, j connects to (anchor) output port k 1 (of input i1)

Page 78: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 78/177

- 58 -

at the current time slot. Packet A arrives and waits at middle port j until j connects to

k  N again. Since the second stage sequence of configurations is ordered in cycle (k 1, k  N , 

k  N -1, k  N -2 …k 2), j will connect to output port k  N after one time slot. That means for any

arbitrary middle port  j, the middle-stage port delay for flow(i1,k  N ) is always 1-slot.

Repeat the above procedure for all N 2

possible flows, we can see that each flow has a

constant middle-stage delay. The sufficient condition is proved. #

 Proving   necessary condition: In a feedback-based two-stage switch, the

middle port delay is bounded between [1, N ] slots. Due to the connectivity of a joint

sequence, different flows arrived at an input port must experience distinct amount of 

middle-stage port delays. In other words, at each input port, there exists exactly one

flow(i,k ) experiencing a constant middle-stage port delay of  d  time slots, for  d  =

1, …, N . Assume flow(i,k ) experiences the constant middle-stage port delay of d = N  

time slots. At time slot t , input i connects to some middle-stage port j′ and j′ connects

some output port k ′. If a packet B of flow(i,k ) is transmitted to middle port  j′ in this

slot, because of the constant N time slots middle-stage port delay for flow(i,k ), j′ will

connect to output port k after  N slots. The joint sequence is periodic with a cycle of  N  

slots, so k=k ′. For arbitrary time slot t and middle port j′, k=k ′ is always true and this

shows that output k is the anchor output for input i. Repeat the above procedure for 

all input ports, we can see that each input port has a distinct anchor output port. This

shows that anchor output is the necessary condition for a constant middle-stage port

delay.

At input i, there exists flow(i,k ′) with constant 1-slot middle-stage port delay.

Let output k be the anchor output for  i, as proved above. When input i connects to

Page 79: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 79/177

- 59 -

some middle-stage port j at any time slot t , due to anchor output property, j connects

to output k . At current time slot, if a packet C of flow(i,k ′) is sent to middle port  j,

then one time slot later, j will connect to output k ′ to keep the constant 1-slot middle

 port delay. Since both middle port j and time slot t are arbitrarily selected, all middle

 ports connect to output ports k and k ′ following the same order of k first and then k ′.

Repeat the above process from 1-slot middle-stage port delay to ( N -1)-slots, we can

show that all middle ports connect to output ports follow the same ordered sequence.

This proves the necessary condition. #

From Statements 4 and 6, we can directly get Statement 7:

Statement 7: Anchor output and ordered properties are the necessary and

sufficient condition for packet in-order delivery in a feedback-based two-stage switch.

3.2.2 Both In-Order Packet Delivery and Staggered Symmetry

Statement 8: If one sequence of configurations is ordered and the other 

sequence is constructed by the staggered symmetry property, then the resulting joint

sequence has the anchor output property.

 Proof : Staggered symmetry property refers to the fact that for any middle-

stage port j, if it is connected to output k at time slot t , then at next slot (t +1) input k  

is connected to the same middle-stage port  j. In other words, the second

configuration at time slot t  is a (vertical) mirror image of the first configuration at

time slot t +1, and the second configuration at t + N -1 wraps around to become the

mirror image of the first configuration at t .

Page 80: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 80/177

- 60 -

Fig. 3.6: Joint sequence with staggered symmetry and in-order delivery.

Without loss of generality, let the first stage sequence of configurations be

ordered with cycle ( j1, j2, j3, j4 ...  j N ). Due to the staggered symmetry, we can see that

the second stage sequence is also ordered, but interestingly, based on the cycle (i N , i N -

1 ,…, i2, i1), which runs in the opposite direction as that followed by the first stage.

The resulting joint sequence is shown in Fig. 3.6. In each time slot, connection

 pattern in the first stage fabric is shifted downwards once (i.e. towards the right hand

side of ( j1,  j2,  j3,  j4 ...  j N )), whereas the connection pattern in the second stage is

shifted upwards once. From an input port’s point of view, the net effect is that the

shifting in opposite direction cancel out each other, and the input connects to the

same output (but via a different middle-stage port) as in the previous time slot. This

 proves the sufficient condition for Statement 8. #

Statement 9: For a feedback-based two-stage switch, the necessary and

sufficient conditions for both in-order packet delivery and staggered symmetry are:

one sequence of configurations is ordered and the other sequence is constructed by

the staggered symmetry property.

Page 81: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 81/177

- 61 -

 Proof : The sufficient condition for in-order packet delivery and staggered

symmetry is a direct consequence from Statements 7 and 8. If in-order packet

delivery is guaranteed, from Statement 7 an ordered sequence of configurations is a

necessary condition. Obviously the staggered symmetry property itself is the

necessary condition for both in-order packet delivery and staggered symmetry

 properties. #

3.2.3 Finding the Number of Different Joint Sequences

The Statement 9  answered the question 1 (i.e. necessary and sufficient

condition for both staggered symmetry and in-order delivery). Then for the sake of 

finding the optimal joint sequence, in the following we concern on the question 2 (i.e.

how many such joint sequences exist).

  All possible joint sequences: To find the number of sequences that satisfy the

requirement of each input visiting each output exactly once in the sequence,

we can make use of the solution for the classic problem of  Latin square [38].

A Latin square is an N × N table filled with N different symbols in such a way

that each symbol occurs exactly once in each row and exactly once in each

column. From [38], the total number of Latin squares is given by [ N !( N -1)!] M ,

where  M  is the number of reduced Latin squares (and  M  ≥1). Unlike Latin

square, in a load-balanced switch the configuration sequence is periodic with

 N , and “sequences” beginning with different starting time slots should be

counted once. Accordingly, the number of configuration sequences in the first

stage fabric is  N  times smaller than the number of Latin squares, or [( N -

1)!]2 M . For a given first stage sequence, there are [ N !( N -1)!] M ways to select

Page 82: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 82/177

- 62 -

the second sequence, resulting in a total of  N ![( N -1)!]3 M 2  possible joint

sequences. (Note that in this case, “sequences” with different starting time

slots are counted individually because they produce different joint sequences.)

  Joint sequences with in-order delivery property only: Based on Statement

7, the number of joint sequences providing in-order packet delivery is the

 product of the number of different anchor output patterns and the number of 

ordered sequences. Since each input must have a distinct anchor output, there

are  N ! ways to select an anchor output pattern. Similarly, the number of 

 possible configurations in a time slot is  N !, and there are ( N -1)! possible

cycles that a configuration (sequence) can follow. This results in  N !( N -1)!

 possible choices. But among them, we only count “sequences” with different

starting time slots once, so the total number of ordered sequences is [( N -1)!]2.

Then, the total number of joint sequences that can keep packets in-order 

delivery is (( N -1)!)2

 N !.

  Joint sequences with staggered symmetry property only: If the sequence

of configurations used by one switch fabric is known, we can always

construct a unique joint sequence with the staggered symmetry property. So

the number of joint sequences with staggered symmetry property equals to the

number of possible single-stage sequences, or [( N -1)!]2 M, where  M  is the

number of reduced Latin squares [38].

  Number of joint sequences with both two properties: From Statement 9,

once the first stage sequence of configurations is determined, so is the second

stage by the staggered symmetry property. Then the number of joint

sequences with both two properties equals to the number of ordered

sequences, which is given by [( N -1)!]2. It should be noted that if all the

Page 83: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 83/177

- 63 -

isomorphic joint sequences are counted once, then there are only ( N -1)!

unique/non-isomorphic joint sequences, each yields a different delay

experience at middle-stage ports.

3.2.4 Discussions

Until now, both two questions are addressed, which we would like to base on

to identify the optimal joint sequence for a given traffic matrix. Statement 9 provides

an efficient mechanism to design a joint sequence for feedback-based two-stage

switches. We first show how a joint sequence can be constructed based on Statement

9. Assume the first stage sequence of configurations is ordered based on cycle ( j1, j2,

 j3, j4 ...  j N ). At time slot t , let middle ports j1, j2 , j3 , j4 ...  j N  be connected by input ports

i1, i2 , i3 , i4 ...  i N  respectively. Due to the ordered property, the first stage

configurations at each subsequent time slot and up to t + N -1 can be constructed.

When the first stage sequence is obtained, the second stage sequence of 

configurations can be constructed directly from the staggered symmetry property.

The resulting joint sequence is shown in Fig. 3.6. Note that the tuple ( i x, j x) could

take any value in [0, N -1], so the joint sequence in Fig. 3.6 is a generic expression for 

all possible joint sequences with both in-order packet delivery and staggered

symmetry properties. By substituting all possible values for (i x, j x) into Fig. 3.6, we

can systematically find all joint sequences with both staggered symmetry and in-

order packet delivery properties.

Let us take a closer look at Fig. 3.6. We observe that the middle-stage port

delay for flow(i,i) (i= i1, i2 , i3 , i4 ... i N ) is always  N -1 slots. In other words, it is not

 possible to map flow(i,i) (i=0,1… N -1) to experience less than ( N -1)-slot middle-

Page 84: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 84/177

- 64 -

stage delay by any joint sequence satisfying Statement 9 (i.e. with both staggered

 property and in-order packet delivery properties). Also from Fig. 3.6, if output port j 

is the anchor output for input port i, then the middle-stage port delay for flow( j,i) is

always N -2 slots no matter the values for i and j.

We can see that when using a joint sequence with both in-order delivery and

staggered symmetry, the delays for different flows are complicatedly correlated with

each other. To find the optimal joint sequence that gives the minimum overall switch

delay performance, we have to use the brute force to check against every possible

 joint sequence in the pool of [( N -1)!]2, which involves rather tedious computation.

3.3 

Three-Stage Switch

In this section, we follow another more practical approach, which adds

another stage of switch fabric for dynamically mapping heavy flows to experience

less middle-stage port delays, called three-stage switch.

3.3.1 Three-Stage Switch Architecture

The three-stage switch architecture is shown in Fig. 3.7. Any joint sequence

with staggered symmetry and in-order packet delivery properties, e.g. the one in Fig.

3.2(a), can be used by the first two switch fabrics. The selected joint sequence will

not  be changed according to traffic. Instead, the configuration of the third stage

switch is designed/adjusted to map heavy flows to experience less middle-stage delay.

As the configuration in the third switch fabric is based on traffic, it is updated only if 

Page 85: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 85/177

- 65 -

there is a significant enough change in traffic pattern. Since no buffer is required at

the virtual output ports (in Fig. 3.7), adding the third stage switch fabric does not

increase the packet delay (assuming propagation delay is negligible). In other words,

as soon as packets arrive at virtual outputs, they are re-directed to outputs via the

configuration in the third fabric. The 0-delay at the virtual outputs (due to 0-buffer)

also ensures no packet mis-sequencing, and no interruption to the original middle-

stage VOQ occupancy feedback mechanism.

Fig. 3.7 A three-stage switch architecture.

An example is shown in Fig. 3.8. With the three-stage switch in Fig. 3.8(b),

 packets of flow(0,3) are delivered to virtual output 2 (instead of 3). After staying at

middle-stage ports for one slot, a packet arrives at virtual output 2 and is immediately

re-directed to output 3. We can see that the middle-stage delay of flow(0,3) packets is

 just one slot, whereas 4 slots are required using the two-stage switch implementation

in Fig. 3.8(a).

Without loss of generality, assume the traffic matrix {λ ij} is obtained. Then a

delay matrix {d ij} can be constructed, where d ij denotes that virtual output port j-1 is

connected to output port i-1, and the value of  d ij is the traffic-weighted-average

Page 86: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 86/177

- 66 -

middle-stage packet delay of all the  N  flows destined to output port i-1. From

Chapter 2, each of the  N  flows destined to an output port experiences a distinct

middle-stage delay, ranging from 1 to N slots. For the 4×4 traffic matrix {λ ij} in Fig.

3.9(a), the corresponding delay matrix {d ij} is found and shown in Fig. 3.9(b). As an

example,

d 34 = 4λ 13 + λ 23 + 2λ 33 + 3λ 43 = 0.8 + 0.1 + 0.2 + 1.2 = 2.3 slots.

Fig. 3.8 An example of using three-stage switch.

3.04.01.02.0

1.01.05.02.0

4.01.02.01.0

1.02.02.03.0

 

9.14.21.26.2

3.23.25.19.1

3.25.21.31.2

3.29.19.19.1

 (a) Traffic matrix {λ ij} (b) Delay matrix {d ij}

Fig. 3.9 Traffic matrix and delay matrix.

 Definition 3:  A set of entries of a matrix are independent  if none of them

occupies the same row or column.

A legitimate configuration in the third stage switch fabric must correspond to

an independent set. In Fig. 3.9(b), [d 11, d 22,d 33, d 44] is an independent set with virtual

output i mapping to output i. In this case, the three-stage switch degenerates into the

Page 87: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 87/177

- 67 -

two-stage. The average middle-stage packet delay experienced by all  N 2 flows in a

two-stage switch is thus

d 11 + d 22 + d 33 + d 44 = 1.9 + 3.1 + 2.3 + 1.9 = 9.2 slots.

Minimizing the overall average middle-stage packet delay becomes finding

an independent set from the delay matrix such that the sum of all entries in the set is

minimized. Optimal algorithms with polynomial running time exist [39,40]. It has a

time complexity of O( N 3). This is acceptable as the configuration in the third switch

fabric is not changed on slot-basis. For completeness, this algorithm is summarized:

9.14.21.26.2

3.23.25.19.1

3.25.21.31.2

3.29.19.19.1

05.02.07.0

8.08.004.0

2.04.010

4.0000

*05.02.07.0

8.08.0*04.0

2.04.010

4.000*0

*05.02.07.0

8.08.0*04.0

2.04.010

4.000*0

*05.02.07.0

8.08.0*04.0

2.04.01'0

4.0'00*0

*05.02.07.0

8.08.0*04.0

2.04.01*0

4.0*000

*05.02.07.0

8.08.0*04.0

2.04.01*0

4.0*000

*05.02.07.0

8.08.0*04.0

2.04.01*0

4.0*000

 Fig. 3.10 An example of identifying the minimum independent set.

For a given matrix, it finds the independent set with the minimum weight [39,40].

1.  Each row of the matrix subtracts the smallest element in this row, each

column subtracts the smallest element in this column.

2.  Find a zero element,  Z . If there is no starred zero in its row nor its column,

mark  Z with a star. Repeat for each zero of the matrix. Go to Step 3.

3.  Cover every column containing starred 0 with a line. If all columns are

covered, the starred zeros form the desired independent set; Exit. Otherwise,

go to Step 4.

Page 88: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 88/177

- 68 -

4.  Choose an uncovered zero and mark it with a prime; If there is no starred zero

 Z in this row, go to Step 5. If there is a starred zero  Z in this row, cover this

row with a line and uncover the column of  Z . Repeat until all zeros are

covered. Go to Step 6.

5.  There is a sequence of alternating starred and primed zeros constructed as

follows: let Z 0 denote the uncovered 0'. Let Z 1 denote the 0* in Z 0's column (if 

any). Let  Z 2 denote the 0' in  Z 1's row. Continue in a similar way until the

sequence stops at a 0', Z 2k , which has no 0* in its column. Unstar each starred

zero of the sequence, and star each primed zero of the sequence. Erase all

 primes and uncover every line. Return to Step 3.

6.  Let h denote the smallest uncovered element of the matrix; it will be positive.

Add h to each covered row; then subtract h from each uncovered column.

Return to Step 4 without altering any asterisks, primes, or covered lines.

For the delay matrix in Fig. 3.9(b), we can find a minimum independent set

[d 13, d 21,d 32, d 44]. Fig. 3.10 shows the detail steps and Fig. 3.11 shows the resulting

third stage configuration. The minimum average middle-stage packet delay is

d 13+ d 21+d 32+ d 44=1.9+2.1+1.5+1.9 = 7.4 slots.

This gives 19.6% reduction in middle-stage delay as compared with the two-stage

switch counterpart.

While changing the third-stage configuration, attention should be paid to the

in-flight packets buffered at middle-stage ports. Their destinations are based on the

old  mapping rendered by the old  third-stage configuration. As such, we have to

suspend the inputs from sending packets to middle-stage ports for  N slots; otherwise,

Page 89: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 89/177

- 69 -

 packets based on different mappings will coexist at middle-stage ports. During this

suspension period, the buffered middle-stage packets can be properly cleared and the

new configuration for the third switch fabric will be enforced immediately afterwards.

We call this N -slot suspension period reconfiguration penalty.

Fig. 3.11 Third-stage configuration for traffic/delay matrix in Fig. 3.9(b).

3.3.2 Traffic Matrix Estimation

Traffic matrix estimation among all the nodes in a network is generally

difficult. Fortunately, here we only need to find the traffic matrix at a single node, i.e.

 between the N switch inputs and the N switch outputs. In this section, a simple traffic

matrix estimation algorithm is presented. In particular, a packet counter  Qi,j is

associated with each of the  N 2

flows/VOQ1(i,j)s. At the beginning of each sampling

interval of  T  time slots, Qi,j is initialized to 0 and is increased by one for every

subsequent packet arrival. Let λ ij be the estimated traffic rate/load for flow(i, j). λ ij is

updated every T slots using the following exponentially weighted moving averaging

function:

Q  ji

ijij

,125.0875.0     

where λ′ij

is the previous estimate and the weighting on the current sample is

assumed to be 0.125 (which is deemed suitable by simulations.)

Page 90: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 90/177

- 70 -

We also introduce another criterion for suppressing unnecessary updates of 

the third-stage configuration, so as to minimize the reconfiguration penalty.

Specifically, when a new input load λ ij is obtained, we check if the load change is

significant enough by

]1.1,9.0[ij

ij

 

 (3.2)

If all flows satisfy (3.2), the existing third-stage configuration remains. Otherwise, a

new third-stage configuration is determined based on the updated traffic matrix. The

fluctuation range in (3.2) can also be tuned to balance the reconfiguration penalty

and the possible delay performance gain.

Finally, the three-stage switch architecture above is resilient to errors in

estimating the traffic matrix. This is because the close to 100% throughput is

guaranteed by the joint sequence used in the first two switch fabrics, whereas the

third fabric is purely for cutting down the delay. Therefore, adding the third stage

fabric will have no negative impact on switch throughput, packet order, as well as the

middle-stage VOQ occupancy feedback mechanism.

3.3.3 Performance Evaluations

In Chapter 2, the unbeatable delay-throughput performance of the feedback-

 based two-stage switch architecture has been well-demonstrated under various traffic

conditions. In this section, we only focus on the improvement of the three-stage

switch over the original feedback switch.

We first study the performance under hot-spot traffic model. For input port i,

Page 91: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 91/177

- 71 -

 packet goes to hot-spot output (i+ x) with probability ½, and goes to other outputs

with same probability 1/[2( N -1)]. The hot-spot can be changed by varying  x. This

traffic model is chosen because the overall traffic pattern remains admissible while

increasing input load p from 0 to 1, or varying x. Without loss of generality, the joint

sequence shown in Fig. 3.2(a) is assumed.

Fig. 3.12 Delay vs input load p, under hot-spot traffic with 3-stage switch.

In the hot-spot traffic model, heavy flow can be easily and correctly identified

 by our proposed traffic estimation algorithm. As such, Fig. 3.12 only shows the

delay-throughput performance (of a switch with size  N =32) against input load. The

y-axis is the overall average switch delay, which combines both input delay and

middle-stage delay. With two-stage switch architecture, varying hot-spot x results in

different delay-throughput performances. From Fig. 3.12, we can see that when the

hot-spot is at output (i+30), the lowest/best delay is obtained because the hot-spot

Page 92: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 92/177

- 72 -

flow is assigned to experience 1-slot middle-stage delay. When the hot-spot is at

output (i+31), the highest/poorest delay is obtained because hot-spot flow is assigned

to experience the largest 32-slot middle-stage delay.

With our three-stage switch architecture, we can always map the hot-spot

flow to experience the lowest 1-slot middle-stage delay by properly configuring the

third switch fabric. That means no matter what the value of  x is, the overall delay-

throughput performance rendered by our three-stage architecture is always the same

as the case of hot-spot at output (i+30). This cuts down the delay by as large as 15

time slots, giving 60.7% delay improvement at p=0.6 and 43.4% at p=0.95.

Fig. 3.13 Delay vs number of sample intervals T , with 3-stage switch.

Fig. 3.13 shows the delay versus time, or the number of sampling intervals,

where each sampling interval is T =105

slots. The initial traffic pattern/matrix changes

Page 93: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 93/177

- 73 -

twice during the simulation, at the 40-th and the 70-th sampling intervals,

respectively. Each change is represented by a randomly generated traffic matrix.

(Each matrix entry is uniformly distributed between 0 and 1, and the whole matrix is

regulated to be admissible.) From Fig. 3.13, we can see that our traffic estimation

algorithm is quite effective in adapting to the changes in traffic pattern, and the

overall improvement of three-stage switch, as compared with the original two-stage

switch, is about 8%.

3.4 Chapter Summary

In this chapter, we improved the delay performance of feedback-based two-

stage switch by assigning heavy flows to experience less middle-stage delays. We

have followed two approaches. First, for a given traffic matrix, we can find an

optimal joint sequence that can minimize the average middle-stage delay. In the

second approach, we extended the feedback-based two-stage switch architecture to

three-stage, thereby the third switch fabric dynamically maps heavy flows to

experience less middle-stage port delays.

Page 94: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 94/177

- 74 -

Chapter 4

Cutting Down CommunicationOverhead 

4.1 Introduction

The occupancy vector in our feedback-based two-stage switch requires N bits.

When the switch size  N  is large, the  N -bit occupancy vector may become a

 bottleneck. For example, with a 1024×1024 switch carrying 128-bye packets, the

(second) switch fabric must operate at a speedup of two for carrying the extra 1024

 bits of occupancy vector.

In this chapter, we focus on cutting down the communication overhead. The

Page 95: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 95/177

- 75 -

size of an occupancy vector can be reduced by only reporting the status of selected

middle-stage VOQs. To identify VOQs of interest, we first partition the  N VOQs

into u non-overlapped sets, each identified by a set number. In each time slot, every

input port piggybacks its set numbers of interest to the connected middle-stage port.

This “guides” each middle-stage port to only report the status of selected VOQs.

The rest of this chapter is organized as follows. In the next section, by

exploiting the feedback path in the first-stage, a set of efficient feedback suppression

algorithms are designed. In Section 4.3, we compare all the proposed algorithms by

simulations. Finally, we conclude the chapter in Section 4.4.

4.2  Feedback Suppression Algorithms

Firstly, we partition the  N VOQs at each port, either  input or middle-stage,

into u non-overlapped sets, denoted by G1, G2, …, Gu. Without loss of generality,

assume  g=N/u is an integer. Then each set Gm (m=1,2…u) contains  g  queues.

Specifically, at input k ,

Gm={VOQ1(k ,(m-1) g +1),VOQ1(k ,(m-1) g +2),…, VOQ1(k , mg )}.

At middle-stage port j,

Gm={VOQ2( j,(m-1) g +1), VOQ2( j,(m-1) g +2),…,VOQ2( j, mg )}.

To cut down the communication overhead, the size of an occupancy vector can be

reduced by only reporting the status of selected Gm.

To maximize switch performance, longer queues should be given more

chances to send packets. With full  N -bit occupancy vector, the LQF scheduling

Page 96: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 96/177

- 76 -

 provides the best performance by always selecting the longest queue from all the N  

VOQ1(i, j)’s at each input port. If we can select Gm based on where the longest queue

resides, the performance would not drop. We propose to construct another feedback 

mechanism for an input to piggybacks its set numbers of interest to the connected

middle-stage port. We can make use of the otherwise wasted bandwidth in the first

stage switch for this purpose, as shown in Fig. 4.1. (Note that the speedup required

for carrying feedback in the second stage switch is also applied to the first stage

switch.) But unlike the feedback mechanism in the second stage (for middle-stage

 ports to inform outputs/inputs), the (identity of) longest queue received from an input

i by middle port j at slot t can only be used N slots later, i.e. next time middle port j is

connected to output i. Since packets arrive and depart in every slot, the longest queue

identified N slots ago may not be the current longest queue – this is the price we must

 pay. Nevertheless, for highly skewed non-uniform traffic pattern, the history data

usually serves as a good estimation.

Fig. 4.1 Timing diagram of feedback switch with feedback suppression.

Page 97: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 97/177

- 77 -

With the above feedback mechanism in the first stage switch, three packet

scheduling algorithms are designed.

4.2.1 Set-Based Feedback (Set-feedback)

Let VOQ1(i,F ) denote the longest queue at input i at time t . If  F ∈Gm, then

the value of m is stored at input i and piggybacked (using log u bits) on the packet

sent to the connected middle-stage port j. Port j stores the value of m and when it is

connected to output i at time t + N -1,  j sends a  g -bit vector, corresponding to the

occupancy of the  g  queues in set Gm. Input/output i knows which set the  g -bit

occupancy vector refers to, based on the stored value of m at time t. At slot t + N , input

i selects a packet to send from the longest available queue in {VOQ1(i,(m-1) g +1),

VOQ1(i,(m-1) g +2), …, VOQ1(i,mg )}. “Available” means the corresponding

VOQ2( j,k ) is empty and VOQ1(i,k ) is not. In doing so, the likelihood of the selected

 packet comes from the longest queue among all the N VOQ1s at input i is increased.

The feedback bits required in the first and second stages are log u and  g  bits

respectively.

An example: Consider a 4×4 ( N =4) feedback-based switch, at each

input/middle-stage port, VOQs are partitioned into 2 (u=2) non-overlapped sets,

denoted by G1 and G2. Then at input 1, set G1 contains {VOQ1(1,0),VOQ1(1,1)} and

G2 contains {VOQ1(1,2),VOQ1(1,3)}. Assumed VOQ1(1 ,3) is the longest queue at

input 1 at time slot 0. VOQ1(1 ,3)∈G2, so the value of 2 (the identify of G) is stored

at input 1 and piggybacked using 1 (log u) bit on the packet sent to the connected

middle-stage port 0. Middle port 0 stores the value of 2 and when it is connected to

output 1 at time slot 3, middle port 0 sends a 2-bit ( g =2) vector, corresponding to the

Page 98: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 98/177

- 78 -

occupancy of {VOQ2(0,2),VOQ2(0,3)} in set G2. Input/output 1 knows the 2-bit

occupancy vector refers to, based on the stored value of 2 at time slot 0 . At slot 4,

input 1 selects a packet to send from the longest available queue in

{VOQ1(1,2),VOQ1(1,3)}.

4.2.2 Queue-Based Feedback Version 1 (Q-feedback-1)

Let VOQ1(i,F ) denote the longest queue at input i at time t . Unlike Set-

 feedback , the value of  F is stored at input i and piggybacked (using log N bits) on the

 packet sent to middle-stage port j. Port j stores the value of  F . When it is connected

to output i at slot t + N -1, j sends a b-bit occupancy vector, containing the occupancy

of b queues from VOQ2( j, F ) to VOQ2( j, F +b-1) (wrapped around by N ). Input/output

i knows which queues the b-bit occupancy vector refers to, based on the value of  F  

stored at time t. At slot t + N , input i selects a packet to send from the longest available

queue in {VOQ1(i, F ), VOQ1(i, F +1), …, VOQ1(i, F +b-1)}. The feedback bits required

in the first and second stages are log  N and b bits respectively. (Note that b= g is not

necessary.)

An example: For a 4×4 ( N =4) feedback-based switch, assume VOQ1(1 ,3) is

the longest queue at input 1 at time slot 0. Then the value of 3 is stored at input 1 and

 piggybacked using 2 (log N ) bits on the packet sent to the connected middle-stage

 port 0. Middle port 0 stores the value of 3 (the identify of VOQ1(1 ,3)) and when it is

connected to output 1 at time slot 3, middle port 0 sends a 3-bit (b=2) vector,

corresponding to the occupancy of {VOQ2(0,3), VOQ2(0,0), VOQ2(0,1)}.

Input/output 1 knows the 3-bit occupancy vector refers to, based on the stored value

of 3 at time slot 0. At slot 4, input 1 selects a packet to send from the longest

Page 99: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 99/177

- 79 -

available queue in {VOQ1(1,3),VOQ1(1,0), VOQ1(1,1)}.

4.2.3 Queue-Based Feedback Version 2 (Q-feedback-2)

This algorithm is the same as Q-feedback-1 except that the second stage

feedback is generated as follows. When middle-stage port j is connected to output i at

slot t + N -1, we randomly select an empty queue VOQ2( j,z ). Middle-stage port j then

sends a (1+log N )-bit occupancy vector, with the first bit indicates the occupancy of 

VOQ2( j,F ), and the following log N bits carry the value z . At slot t + N , input i selects

a packet from the longest available queue in {VOQ1(i, F ), VOQ2( j, z )}. The feedback 

 bits required in the first and second stages are log N and 1+log N bits respectively.

An example: For a 4×4 ( N =4) feedback-based switch, assume VOQ1(1 ,3) is

the longest queue at input 1 at time slot 0. Then the value of 3 is stored at input 1 and

 piggybacked using 2 (log N ) bits on the packet sent to the connected middle-stage

 port 0. Middle port 0 stores the value of 3 (the identify of VOQ1(1 ,3)) and when it is

connected to output 1 at time slot 3, middle port 0 randomly select an empty queue

(say it VOQ2(0 ,2)). Middle-stage port  j then sends a 3 (1+log  N ) bits occupancy

vector, with the first bit indicates the occupancy of VOQ2(0 ,3), and the following 2

(log N ) bits carry the value 2(the identify of VOQ2(0 ,2)). At slot 4, input 1 selects a

 packet to send from the longest available queue in {VOQ1(1,3),VOQ1(1,2)}.

 Note that the three algorithms above can all be extended to carry the feedback 

of the top-C longest queues (instead of the longest queue only). In Set-feedback , this

requires C ·log u bits in the first stage (for identifying up to C sets of Gm that contain

the top-C longest queues), and C · g bits in the second stage. Similarly, in Q-feedback-

Page 100: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 100/177

- 80 -

1, we need C ·log  N bits in the first stage and C ·b bits in the second stage. For  Q-

 feedback-2, we need C ·log N bits and C ·(1+ log N ) bits, respectively.

4.3  Performance Evaluations

In this section, the delay-throughput performance of the proposed scheduling

algorithms is studied by simulations. Without loss of generality, a switch with size

 N =32 is assumed unless otherwise specified. Scheduling algorithms with full

feedback (in the second stage switch) requires 32 bits. With our proposed feedback 

schemes, we target at using 12 bits only (roughly 1/3). The detailed parameter 

settings are as follows:

  For Set-feedback , in order to form 12 bits feedback, we partition the 32 VOQs

into u=8 sets with each set has  g =4 elements. Feedback of Top-3 longest

queues is used, i.e. C =3. The feedback bits required in the first stage and

second stage are 9 bits and 12 bits respectively.

  For Q-feedback-1, we set b to 6 and C to 2. The feedback bits required in the

first stage and second stage become 10 bits and 12 bits respectively, which

are comparable to that of Set-feedback .

  For Q-feedback-2, we set C to 2. The feedback bits required in the first stage

and second stage are 10 bits and 12 bits respectively.

For comparison, we also implement a) original feedback-switch with full N -

 bit feedback (LQF  algorithm); b) iSLIP  algorithm [15] (with a single iteration),

which serves as a benchmark for single-stage input-queued switches; and c) output-

Page 101: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 101/177

- 81 -

queued switch, which serves as a performance lower bound. In simulations, we use

the same traffic models as Chapter 2.4, i.e. the uniform, uniform bursty and hot-spot

traffic.

4.3.1 Performance under Uniform Traffic

Fig. 4.2 Delay vs input load p, under uniform traffic with partial feedback.

Fig. 4.2 compares the delay performance of the six schemes under uniform

traffic. We can see that the delay gap between  full-feedback and our proposed Set-

 feedback , Q-feedback-1 and 2 increases with the input load. At  p=0.1, they give

almost same delay performance. At p=0.8, the delay gap grows to about 20 slots. But

when compared with iSLIP , our proposed schemes require 40+ slots less, yielding

55% cut in delay. Among the three proposed schemes, Set-feedback  generally

outperforms the other two. With the fixed number of bits for conveying feedback 

Page 102: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 102/177

- 82 -

occupancy,  set-feedback can convey Top-3 longest queues instead of Top-2 (in Q-

 feedback ), which can identify the longest VOQ with higher accuracy.

4.3.2 Performance under Uniform Bursty Traffic

From Fig. 4.3, the delay performance under bursty traffic, we can see that

Set-feedback  gives the best performance (lowest delay), then followed by Q-

 feedback-1 and Q-feedback-2. In general, iSLIP has smaller delay for low input load

( p≤0.5). At  p=0.6, the delay is 183 slots for  iSLIP  and 92 slots for  Set-feedback ,

yielding a 50% cut in delay. Compared with full-feedback , delay is increased from 70

slots to 92 at p=0.6, which represents the price paid for minimizing feedback bits.

Fig. 4.3 Delay vs input load p, under bursty traffic with partial feedback.

4.3.3 Performance under Hotspot Traffic

From Fig. 4.4, the delay performance under hot-spot traffic, again we can see

Page 103: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 103/177

- 83 -

that Set-feedback , Q-feedback-1 and Q-feedback-2 give comparable performance.

Fig. 4.4 Delay vs input load p, under hot-spot traffic with partial feedback.

4.3.4 Performance under Different Switch Size N  

Based on the above simulation results, Set-feedback  gives the best

 performance among our proposed algorithms. In the following, we focus on the

 performance of  Set-feedback  under different traffic patterns with different switch

sizes N . Note that we still limit the feedback for Set-feedback to 12 bits, regardless of 

the switch size. Specifically, when  N  is 64, we set u=16 (so  g =64/16=4) and C =3.

The feedback bits required in the first stage and second stage become both 12 bits.

When N =128, we set u=32 (so g =4) and C =3. The feedback bits required in the first

stage and second stage are 15 and 12 bits respectively.

From Fig. 4.5, we can see that when N =128, 12-bit Set-feedback yields 94.5%

Page 104: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 104/177

- 84 -

throughput under uniform traffic. In other words, Set-feedback  trades just 5.5%

(lower) throughput for 88.3% saving in communication overhead.

Fig. 4.5 Throughput vs. switch size N , with partial feedback.

4.4  Chapter Summary

In this chapter, we focused on cutting down the communication overhead in

feedback-based two-stage switch. The size of an occupancy vector, which is sent by

middle-stage port to output port in every time slot, is reduced by only reporting the

status of selected middle-stage VOQs. To identify VOQs of interest, we first

 partitioned the  N VOQs into u non-overlapped sets, each being identified by a set

number. In each time slot, every input port piggybacks its set numbers of interest to

the connected middle-stage port. This guides a middle-stage port to only report the

status of the VOQs of interest. Extensive simulation results showed that our proposed

feedback suppression algorithms are very efficient.

Page 105: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 105/177

- 85 -

Chapter 5

Supporting Multicast Traffic

5.1 Introduction

The migration of broadcasting and multicasting services, such as cable TV

and multimedia-on-demand to packet oriented networks will play a dominant role in

the near future. These highly popular applications have the potential of loading up

the Internet. To keep up with the bandwidth demand of such applications, the next

generation of packet switches/routers need to provide efficient multicast switching

and packet replication.

When a multicast packet arrives at a switch, the set of output ports the packet

Page 106: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 106/177

- 86 -

destined for, i.e. the packet’s fan-out set , is retrieved from the local forwarding table

(like IP multicast). The cardinality of the fan-out set, i.e. its  fan-out , denotes the

number of copies that the packet should be cloned. Packets arrived at the same input

 port and destined for the same fan-out set belong to the same multicast flow. The

total number of possible multicast (and unicast) flows at an input port is 2 N 

-1. An

admissible multicast traffic pattern requires no over-subscribed input and output

 ports. That means the packet arrival rate at each input port should be less than or 

equal to its capacity, or 1 packet/slot. Similarly, the aggregated packet arrival rate at

each output port (after packet duplication) must also be smaller than or equal to 1

 packet/slot. A multicast switch aims at providing 100% throughput for any

admissible multicast traffic pattern with minimum possible packet delay.

For the sake of scalability, multicast switches are mainly designed based on

input-queued switch architecture, where a centralized scheduler is responsible for 

scheduling. Switch fabrics used can be bufferless [41-45] or buffered [46-49]. For 

multicast switches based on bufferless switch fabrics [41-45], in-switch multicast

capability (i.e. in-switch packet duplication and forwarding) is usually assumed,

where an input port can send a (multicast) packet to multiple output ports in a single

time slot. Such multicast fabrics are more expensive than their unicast counterparts.

Besides, the centralized scheduling algorithms are usually derived from their unicast

counterparts. Note that even for (simpler) unicast switches, a major bottleneck is the

implementation of the centralized scheduler.

For multicast switches with buffered switch fabrics, they mainly adopt the

 buffered crossbar [18-20] as their switch fabrics. Recall that for the buffered crossbar 

Page 107: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 107/177

- 87 -

switch introduced in Chapter 1, even through the scheduler is simpler, its switch

fabric needs to realize  N  N  switch configurations, the same complexity as output-

queued switch fabric.

In short, two limiting factors for high-speed multicast switch design are the

switch fabric complexity and the need for a sophisticated centralized scheduler. In

this chapter, we show that feedback-based two-stage switch can support multicast

traffic efficiently by slightly modifying its original operations. It elegantly

overcomes these two major obstacles. Specifically, it does not require a centralized

scheduler, and relies on a unicast switch fabric (realizing only  N  switch

configurations) to carry both unicast and multicast traffic.

The rest of the chapter is organized as follows. In the next section, we review

some related work on multicast switch design. The feedback-based two-stage switch

is modified to support multicast traffic in Section 5.3 and simulation results are

 presented in Section 5.4. We conclude the chapter in Section 5.5.

5.2  Related Work

5.2.1  Multicast Switches Based on Bufferless Switch Fabrics

Multicast switches based on bufferless switch fabrics [41-45] usually assume

in-fabric multicast capability (i.e. in-fabric packet duplication and forwarding), and

require a rather sophisticated centralized scheduler. In [41], each switch input port

maintains N +1 virtual queues, N for unicast and one for multicast. Priority is given to

schedule multicast traffic. If there are still idle inputs/outputs after scheduling

Page 108: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 108/177

- 88 -

multicast packets, unicast packets are considered to increase switch utilization.

Although a multicast packet can be “split” to send in multiple time slots, multicast

traffic suffers from severe head-of-line (HOL) blocking due to the single multicast

queue.

In [42], the number of multicast queues is increased to m to reduce HOL

 blocking. When a multicast packet arrives, it selects a multicast queue to join in order 

to balance the loading among different multicast queues. But packets assigned to

different queues generally have overlapped fan-out sets. Priority is given to schedule

a unicast packet first or a multicast packet first depending on the service ratio

 between the two classes. An iterative algorithm is also adopted to maximize the

throughput in each time slot.

In [43],  packet splitting is allowed to further cut down the HOL blocking.

Specifically, each input maintains k  unicast/multicast shared queues, one for each

non-overlapped set of outputs. When a multicast packet arrives and if its fan-out set

intersects with the fan-out sets of multiple queues, packet-splitting “breaks”  the

original packet into “smaller” ones, each with a modified fan-out set (such that no

intersection with the fan-out set of the queue it joins). An iterative algorithm is then

used to maximize the switch throughput. Simulation results show that high

throughput can only be achieved with a large number of iterations. But a large

number of iterations is not suitable for high-speed implementation.

In [44], the number of unicast/multicast shared pointer queues increases to

k = N , one for each output port (like the classic VOQs for unicast traffic). When a

Page 109: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 109/177

- 89 -

multicast/unicast packet arrives, it is time-stamped and stored in a shared memory.

Then its memory address (i.e. a pointer) is stored in all pointer queues that overlap

with the packet’s fan-out set. An iterative scheduling algorithm based on the

timestamps of buffered packets is designed for maximizing throughput. The major 

 problem with this approach, again, is its high communication overheads.

In [45], dynamic queuing policies are studied, where packet splitting upon

arrival is not allowed. The switch needs to identify active flows and then assign them

to different shared multicast queues based on the current switch load.

5.2.2  Buffered Crossbar Based Multicast Switches

Buffered crossbar switch architecture [18-20] is touted for its technology

feasibility and simpler scheduler. However, the buffered crossbar switch is not

scalable due to its 2 N  separate schedulers,  N 2 in-fabric crosspoint buffers and the

need for  N  N  switch configurations. Buffered crossbar has also been extended to

support multicast traffic [46-49]. MURS [46] gives priority to schedule unicast and

multicast traffic in a round robin fashion. Specifically, if unicast gets priority in time

slot t , unicast traffic will be scheduled first. If there are still idle outputs after 

scheduling unicast traffic, multicast traffic is considered. Then in slot t +1, multicast

traffic gets the scheduling priority.

To reduce the hardware cost, I-SMCB (Input-based Shared Memory

Crosspoint Buffer [47]) and O-SMCB (Output-based Shared Memory Crosspoint

Buffer [48]) aim at cutting down the crosspoint buffers from  N 2

to N 2/2. The key idea

is to share one crosspoint buffer by two adjacent input ports [47] or two adjacent

Page 110: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 110/177

- 90 -

output ports [48]. But such a hardware cost reduction is offset by its throughput

degradation. In [49], the theoretical relationship between throughput performance

and crosspoint buffer size is studied under a special multicast traffic pattern. It is

concluded that to avoid throughput degradation, the amount of buffer to be deployed

at every crosspoint must scale logarithmically with the switch size N .

5.3  Multicast Scheduling in Feedback-Based Two-Stage Switch

5.3.1  Multicast Scheduling

We extend the feedback-based two-stage switch (Fig. 2.1) to support

multicast traffic. At each input port, in addition to the N unicast VOQ1(i,k )’s, we add

another m shared queues for multicast. We adopt a simple queuing policy that divides

the outputs into m equal and non-overlapped sets (assuming  N /m is an integer),

where set  x (1≤ x≤m) contains outputs {( x-1) N /m, ( x-1) N /m+1,…, x·N /m-1}. Packet

splitting is used to “split” multicast packets to join different queues. So when a

multicast packet arrives and if its fan-out set intersects with the fan-out sets of 

multiple queues, then the original packet is “split” into “smaller” ones, each with a

modified fan-out set (which will not intersect with the fan-out set of the target queue).

 Note that the packet after splitting usually remains as a multicast packet but with a

smaller fan-out set. It is worth to note that when m=1, all multicast packets share the

same multicast queue; and when m= N , packet splitting converts all multicast packets

into unicast.

Without loss of generality, we assume the two stages of switch fabrics are

configured using the joint sequence of Fig. 2.2(a). In each time slot, based on the

Page 111: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 111/177

- 91 -

received occupancy vector of middle-stage port k , input i selects a packet for sending

among its  N+m local queues. Priority is given to schedule multicast traffic by

examining the m multicast queues first. Here we only consider giving the higher 

 priority to multicast traffic, as in general it is more time critical than unicast, but it

should be noted that our multicast scheduler can be revised to schedule unicast and

multicast packets depending on the service ratio (like [42]) or in a round robin

fashion (like [46]). Specifically, the HOL packet whose fan-out set has the largest

overlap with the set of empty queues at middle-port k  is selected. (If no overlap, a

unicast packet is selected instead.) A copy of the selected packet is sent to the

middle-port together with an  N -bit duplication vector , which identifies the overlap

 between the empty VOQ2( j,k )’s and the packet fan-out set. Then, the fan-out set of 

the selected multicast packet is updated to exclude those in the duplication vector. If 

the updated fan-out set is empty, the selected multicast packet is removed from the

multicast queue. When a packet arrives at the middle-stage port, it will be cloned and

stored at the corresponding empty (unicast) VOQ2( j,k )’s based on the duplication

vector.

If there are no backlogged multicast packets or none of them can be selected

(due to zero-overlap between the empty VOQ2( j,k )’s and any multicast packet’s fan-

out set), we select a unicast packet for sending using the LQF scheduler. In this case,

the duplication vector is set to all 0’s. Note that the packet transmission in the

second-stage switch fabric is the same as in a unicast switch. Following the pre-

determined sequence of configurations, when middle-stage port j connects to output

k , the packet (if any) at VOQ2( j,k ) is sent together with the occupancy vector of 

middle-port j.

Page 112: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 112/177

- 92 -

5.3.2  Discussions

In our proposed multicast scheduling algorithm, packet duplication takes

 place at both input ports and middle-stage ports. Packet duplication at input ports

“breaks” a multicast packet into smaller ones. Since multicast packets in different

multicast queues have non-overlapped fan-out sets, both HOL blocking and output

contention can be eased. Besides, storing multicast packets at inputs reduces the

input port buffer requirement. Since both two switch fabrics in feedback switch are

unicast, a multicast packet is sent in the first fabric as a unicast packet. The

complicated switch fabric with in-fabric duplication (as [41-45]) is not  required.

When a split multicast packet arrives a middle-stage port, the second stage packet

duplication occurs, which converts all multicast packets into unicast for delivering by

the second switch fabric.

When there is only a single multicast queue (m = 1), all packet duplication is

carried out at middle-stage ports. Under light traffic, input port queue size can be

minimized. But for heavy traffic, the switch will experience severe HOL blocking

 because a multicast packet will not be removed (from the only queue) until all its

copies are sent. With m > 1, packet splitting ensures that packet duplication occurs

 partially at input ports and packets in different queues have non-overlapped

destinations. This reduces the HOL blocking. Let the switch size be  N . When m= N ,

all packet duplication is carried out at input ports. In this case, there is no need for 

“multicast” queues because they only store unicast packets. In other words, each

input port only needs to maintain N unicast queues. The HOL blocking is completely

eliminated and the stability proof in Chapter 2 can also be applied for multicast

Page 113: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 113/177

- 93 -

feedback switch with m= N .

Unlike the feedback-based two-stage unicast switch, the load-balancing in the

first stage switch is based on multicast packets. Extensive simulation results show

that the final unicast traffic presented to the second stage switch is generally uniform.

This accredits to the use of the single-packet-buffer per middle-stage VOQ2( j,k ), and

the efficient feedback mechanism for reporting the middle-stage port occupancy. To

further increase the buffer utilization, we can use pointer queues [44] to separately

store a packet and its memory address. So a multicast packet is only required to store

once at an input port, and an entry in VOQ1(i,k ) only contains the memory address of 

the packet. Likewise, this can be applied to buffers at middle-stage ports.

The proposed multicast scheduling algorithm inherits the in-order packet

delivery property from its unicast counterpart. This is because we can treat each

distributary of a multicast flow as a unicast flow. In Chapter 2, it has been shown that

 packets belonging to the same unicast flow always experience the same middle-stage

 port delay. Therefore, when they arrive at the output port, they will be in order. If 

 packets belonging to every distributary flow orderly arrive at their respective outputs,

the corresponding multicast flow will not experience packet mis-sequencing problem.

5.4  Performance Evaluations

To the best of our knowledge, our proposed multicast scheduling is the only

one that does not rely on a centralized scheduler, and its switch fabric only needs to

realize N switch configurations (instead of  N !). To study its performance, we vary the

Page 114: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 114/177

- 94 -

number of multicast queues (m) at each input port. In our simulations, we try to

distinguish between the overall average delay experienced by all copies (T c) of a

multicast packet and the average delay experienced by the last-copy (T  p) of all

multicast packets. T  p corresponds to the worst-case delay and provides us some

insight on the delay variation among different copies of a multicast packet. For 

multicast packets with fan-out k , T c(k ) and T  p(k ) denote their average delay and

average last-copy delay respectively. They show the fairness performance in

handling packets with different fan-outs. Although we only present simulation results

for switch with size  N =32 below, the same conclusions and observations apply for 

other switch sizes.

5.4.1  Performance under Uniform Mixing Traffic

Fig. 5.1 Delay vs output load λ , with uniform mixing traffic

Page 115: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 115/177

- 95 -

At every time slot for each input, a packet arrives with probability  p (i.e.

input load is  p). If a packet arrives, it has equal probability of being unicast or 

multicast. If the packet is unicast, it destines to each output with equal probability. If 

the packet is multicast, its fan-out size k  is randomly selected between [2, 32], and

the identity of each output in the fan-out set is also randomly selected from all output

 ports. Fig. 5.1 shows the switch delay performance against switch output load λ ,

where

λ = p[0.5+0.5(2+32)/2] = 9 p. (5.1)

To ensure the traffic in our simulations is always admissible, we must have λ  ≤ 1 (or 

 p ≤1/9).

Fig. 5.2 Delay vs fan-out k , with uniform mixing traffic at λ =0.7

From the delay-throughput performance in Fig. 5.1, we can see that for output

load λ < 0.85, m=1 and m=2 provide a lower average packet delay than m=32. At λ =

Page 116: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 116/177

- 96 -

0.7, m=2 cuts down the overall average delay (T c) by 58.8% and the average last-

copy delay T  p by 51%. When λ >0.85, m=32 (packet duplication at input ports only)

yields a better/lower delay performance because there is no HOL blocking, while the

HOL in m=1 (packet duplication at middle-stage ports only) is intensified with the

traffic load. This also explains why m=2 (packet duplication at both input and

middle-stage ports) is better than m=1.

Fig. 5.2 shows the delay performance against different fan-outs, while fixing

λ = 0.7. When m=2, we can see that T c(k ), the average delay for packets with fan-out

k , is the lowest, and remains almost constant at 20 slots as fan-out k increases. Even

T  p(k ), the average last-copy delay for packets with fan-out k , increases rather slowly

with k . This shows that m=2 is fair in handling packets with different fan-outs. On

the contrary, with m=32, both T c(k ) and T  p(k ) increase more rapidly with fan-out size.

5.4.2  Performance under Uniform Bursty Mixing Traffic

We use the same traffic generator except that bursty arrivals are modeled by

the ON/OFF traffic model of Chapter 2.4. In the ON state, a packet arrival is

generated in every time slot, which has equal probability of being unicast or 

multicast. Simulation results in Figs. 5.3&5.4 are based on burst size  s p=30 packets.

Again, we can express the aggregated load at each output port by (5.1).

From Fig. 5.3, the performance gap between m=2 and m=32 is much wider 

than that in Fig. 5.1. This is because bursty traffic causes more unevenly distributed

queue sizes in the input ports when m=32. With m=2, packet duplication mainly 

occurs at middle-stage ports. In this case, both input port queue size and input port

Page 117: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 117/177

- 97 -

delay are reduced. With m=1, packet duplication only occurs at middle-stage ports.

The throughput is suffered from the severe HOL blocking. From Fig. 5.4, we can

again see that m=2 is fair in handling packets with different fan-outs. Although m=32

also gives improved fairness performance, this is at the cost of very high average

delay (T c(k )> 750 slots).

Fig. 5.3 Delay vs output load λ , with bursty mixing traffic

5.4.3  Performance under Binomial Mixing Traffic

Binomial mixing traffic [45] is the same as the Bernoulli uniform mixing

traffic model except in generating the fan-out size of a multicast packet. Let P k be the

 probability of generating a fan-out set with size k . The k destinations are uniformly

distributed over all output ports. The value of k is chosen according to a non-uniform

 binomial distribution, with mean fan-out h:

Page 118: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 118/177

- 98 -

k  N k k 

 N k  N 

h

 N 

hC  p )1()(

 

Fig. 5.4 Delay vs fan-out k , with bursty mixing traffic at λ =0.7

In our simulations, we set mean fan-out h = 17. Then the output load λ is:

λ = p[0.5+0.5×17] = 9 p 

The delay performance shown in Fig. 5.5 is comparable with that in Fig. 5.1. This is

 because the two traffic models are quite similar. Specifically, they have the same

Bernoulli packet arrival, same average fan-out size of 17, and their fan-out sets are

all uniformly selected from all outputs. We skip the figure of delay vs fan-out

 because it has a similar trend as that in Fig. 5.2.

From the simulation results above, we can see that setting m=2 is sensible as

it ensures sufficiently low packet delay and high throughput. Besides, the extra

Page 119: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 119/177

- 99 -

complexity involved in maintaining two multicast queues is marginal.

Fig. 5.5 Delay vs output load λ , with binomial mixing traffic

5.5  Chapter Summary

In this chapter, feedback-based two-stage switch is extended for scheduling

multicast traffic by slightly modifying its operations. The resulted switch not only

removes the centralized scheduler but also supports multicast traffic by the simple

unicast switch fabric. Simulation results showed that with packet duplication at both

input ports and middle-stage ports, the proposed multicast scheduling algorithm is

effective in cutting down both average delay and delay variation among different

copies of the same multicast packet.

Page 120: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 120/177

- 100 -

Chapter 6

Multi-cabinet Implementation

6.1 Introduction

To accommodate the growth of the Internet traffic, high-speed routers consist

of a large number of linecards (e.g. 1152 linecards in Cisco CRS-1 [8]), resulting in

larger physical space and power requirement. Consequently, a multi-cabinet

implementation of routers is needed [50-51], where the distance between linecards

and (central) switch fabrics can be tens of meters.

In a single-cabinet implementation, the propagation delay between linecards

Page 121: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 121/177

- 101 -

and switch fabrics is negligible. In a multi-cabinet implementation, due to the non-

negligible propagation delay, the requirement that occupancy vectors must arrive at

input ports within a single time slot will significantly lower the feedback-based

switch efficiency. This is illustrated in Fig. 6.1. Since the occupancy vector needs to

take the in-flight packet (in the first switch fabric) into account, it can only be

generated when the packet (at least partly) arrives. A dedicated feedback packet is

required as piggybacking occupancy vector onto data packet is not possible. Finally,

an input port must wait for the occupancy vector to arrive before another packet can

 be scheduled for sending. From Fig. 6.1, we can see that the duration of a slot must

 be at least twice the propagation delay between linecards and the switch fabrics. But

in each slot, only a single packet can be sent. Since a switch fabric cannot be

reconfigured while there are in-flight packets, the slot duration is (roughly) the

duration that a switch configuration lasts.

Fig. 6.1 The timing diagram of switch with large propagation delay

In this chapter, we revamp the original feedback mechanism and design a new

Page 122: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 122/177

- 102 -

 batch scheduler to solve this problem. The basic idea is to schedule and send multiple

 packets while each switch configuration lasts. The key challenge is at how to keep

the original close-to-100% throughput performance and ensure in-order packet

delivery.

The rest of the chapter is organized as follows. In the next section, we review

some related work on addressing the impact brought by propagation delay. In Section

6.3, the feedback mechanism is revamped and a new batch scheduler is designed. Its

 performance is evaluated in Section 6.4 and we conclude the chapter in Section 6.5.

6.2 Related Work

6.2.1 Multi-cabinet Implementation of Input-queued Switch

To improve the performance of input-queued switch under multi-cabinet

implementation, SRR (Synchronous Round Robin) scheduler is proposed [51]. SRR 

is a distributed and iterative scheme, in which one input port sends only one request

 based on a cyclic, like TDMA (Time Division Multiplexing Access), preferential

scheduling of VOQs. A request is selected by logically numbering the slots with an

incremental counter ranging from 0 to N −1. If the preferred VOQ is empty, then the

longest one is selected. Each output also has a preferential input to grant based on the

same TDMA-like cycle. If the preferred input request does not arrive, one request is

randomly selected to grant. Input port can receive the grant for the current request

one round-trip-time later. While waiting for the grant to arrive, each input continually

sends its preferred request on a slot-by-slot basis. From [51], we can see that when

the traffic is bursty, the switch throughput is rather limited.

Page 123: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 123/177

- 103 -

6.2.2 Multi-cabinet Implementation of Buffered Crossbar Switch

A multi-cabinet implementation of buffered crossbar switch is studied in [19],

where a large packet buffer size at each crosspoint is required to achieve high

throughput. This imposes further challenges to the implementation of buffered

crossbar switch. In [18], virtual crosspoint queues are introduced to alleviate the in-

fabric buffer requirement but the resulting switch gives poor throughput performance

under some traffic conditions.

6.3  Multi-cabinet Implementation of Feedback-Based Switch

6.3.1 Revamped Feedback Mechanism

Fig. 6.2 Multi-cabinet implementation of the feedback-based switch

Fig. 6.2 shows a multi-cabinet implementation of a feedback-based two-stage

switch. To increase the switch efficiency, we can send multiple packets in a slot. The

minimum duration of a slot is the round trip propagation time between linecards and

switch fabrics, or  RTT seconds. Let the (maximum) number of packets that can be

sent in each slot be x. The value of  x depends on packet size ( B bytes), RTT , and the

line rate ( R bps). Roughly, we have

Page 124: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 124/177

- 104 -

/ _ 8

 RTT R x RTT packet duration

 B

 

For a typical distance of 20 meters between linecards and switch fabrics, the

(minimum) slot duration is  RTT =200 ns. To transmit a packet of 200 bytes on a

40Gbps line, 40 ns are required. Reserving some guard times for control, we can

transmit x = 4 packets in a slot, as shown in Fig. 6.3. 

Packet

Input port Middle-stage Output portMiddle-stage

Time Transmission

delay

Propagation

delay ( RTT /2)

 S l   o t   t      

2nd switch reconfig.

Destination report

Occupancy vector 

1st switch reconfig.Last packet sent

in slot t arrives.

Occupancy vector 

is generated

 

Fig. 6.3 Feedback operation in multi-cabinet implementation

But can we still keep the in-order packet delivery and high-throughput

 properties of a single-cabinet implementation of the feedback switch? With the

following modifications, the answer is yes. First of all, the buffer size at each middle-

stage port VOQ2( j,k ) is increased to x to accommodate up to x packet arrivals in each

Page 125: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 125/177

- 105 -

time slot. The occupancy vector is expanded to N ·log x bits, as the size of each VOQ

requires log x bits.

The feedback operation is also revamped. Refer to Fig. 6.3. Assume at time

slot t input port i connects to output k via middle-stage port j. At the beginning of slot

t , (based on the occupancy vector received in the previous slot) input i uses a local

 batch scheduler (to be detailed in Section 6.3.2) to select up to x packets for sending.

A special header (destination report) is appended to the first packet sent, which

contains the destinations of the x packets (to be) sent in this slot. As each destination

requires log N bits to denote, the destination report consists of  x·log N bits.

While input ports are sending packets to middle-stage ports, middle-stage

 ports are sending packets to output ports in parallel. When a middle-stage port ( j) is

connected to an output port (k ), all backlogged packets (at most x) in VOQ2( j,k ) will

 be completely cleared. (Backlogged packets refer to packets arrived in previous time

slots, excluding those arriving packets in the current slot.) In fact, due to the

 predetermined sequence of configurations used, middle port  j knows beforehand

which VOQ2( j,k ) will be cleared at which time slot.

Middle-stage port  j generates the occupancy vector   upon receiving the

destination report from input i. The destination report contains the destinations of all  

the packets to arrive in the following slot duration. Therefore, at the time the

occupancy vector is generated (in the middle of slot t ), it already looks ahead to get

the accurate VOQ status at the time the last packet sent in slot t arrives at middle-

stage port j (see Fig. 6.3). The occupancy vector is then appended to the next packet

Page 126: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 126/177

- 106 -

sent in the second switch fabric for transmission, i.e. packet 3 in Fig. 6.3.

When the occupancy vector arrives at output k and is made available to input

k at the beginning of slot t +1, the input port batch scheduler selects and sends up to x 

 packets to middle-stage port j. It should be emphasized that the scheduling is based

on what will happen when the selected packets arrive at middle-stage port j (i.e. the

information in the occupancy vector received). Notably, the first packet from input k  

will arrive at middle-stage port  j right after the last packet from input i. The

 bandwidth of switch fabric is fully utilized.

6.3.2 Batch Scheduler Design

 Now we focus on the batch scheduler design. Without loss of generality, we

assume a LQF batch scheduler at input port k . Specifically, input k identifies the set

of VOQ2’s at middle-port j that has room for new packets, denote this set by S  j. Find

the longest queue VOQ1(k ,h), such that VOQ2( j,h) belongs to S  j. Then the HOL

 packet at VOQ1(k ,h) is scheduled for sending. Update S  j and the size of VOQ1(k ,h).

Then the above process is repeated until x packets are scheduled (or no more packets

available).

Like as the scheduler in single-cabinet implementation, we also include the

following refinements in the batch scheduler:

   Forced-zero-queue-size: If middle port  j will connect to output k ’ in the next

slot t +1, then in current slot t  middle port  j reports/feedbacks a zero queue

size for VOQ2( j,k ’). This is because VOQ2( j,k ’) is guaranteed to be exhausted

Page 127: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 127/177

- 107 -

at the end of slot t +1 (i.e. all its packets will be sent to output k ’.), With

forced-zero-queue-size, the batch scheduler has more flexibility in selecting

 packets to send.

   Preventing underflow: Assume input port i connects to middle port j in time

slot t , and  j will connect to output k ’ in the next slot t +1. If flow(i,k ) has

 packets waiting in VOQ1(i,k’ ) but VOQ2( j,k’ ) does not have  x packets ready

for sending in slot t +1, an underflow will occur. To avoid the possible loss of 

efficiency due to underflow at VOQ2( j,k’ ), at slot t input i should always give

 priority to send packets from VOQ1(i,k’ ) to VOQ2( j,k’ ) first.

6.3.3 Some Properties

The new batch scheduler operates on the architecture as Fig. 6.2, which

adopts the same joint sequence of Fig. 2.2(a). In the following, we show that the

multi-cabinet implementation of the feedback-based scheduler can ensure in-order 

 packet delivery, 100% throughput under a speedup of two and asymmetric

reconfigurations:

   In-order packet delivery. Each flow having a constant middle-stage delay is a

sufficient condition for packet in-order delivery in two-stage switch (proven

in Chapter 3). While extending the feedback-based switch to multi-cabinet

implementation, we allow x packets to be sent in each time slot. The constant

middle port delay for packets of the same flow is still guaranteed by the

adopted joint sequence. The delay a packet experiences at a middle-stage port

is again bounded by [1, N ] slots. Without loss of generality, assume m (out of 

 x) packets arrived at a middle-stage port  j in a same time slot belong to the

Page 128: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 128/177

- 108 -

same flow(i,k ). Those m packets will be buffered at VOQ2( j,k ) for the same

amount of time until middle-port j is connected to output k . Then, they will

 be delivered to output k  (possibly together with ( x  – m) packets from other 

flows) in the same slot. So the constant middle-stage delay is still guaranteed,

and thus the in-order delivery property is still ensured.

  100% throughput under speedup of two. For multi-cabinet implementation

with a batch size of  x packets, we can treat each batch as a single aggregate

 packet. Then the multi-cabinet switch is equivalent to a single-cabinet switch.

In other words, the propagation delay between linecards and switch fabrics

does not affect/reduce the throughput performance of a multi-cabinet switch.

   Asymmetric reconfiguration. In Fig. 6.3, when the last bit of the  x-th packet

arrives at the middle-stage port, the first stage switch fabric can start to re-

configure. When the last bit of the  x-th packet departs the second switch

fabric, the second switch fabric can start to re-configure. In other words, the

reconfiguration of second fabric can start before the last bit of the  x-th packet

arrives at the output port. For optical switch fabrics with non-negligible

amount of re-configuration overheads, such a pipelined packet transmission

and asymmetric reconfiguration can be very efficient.

  Cutting down the communication overhead. In original feedback-based switch,

the communication overhead for sending single packet is  N bits. From Fig.

6.3, we can see that only a single occupancy vector of  N ·log x bits is required

for  x packets sent in batch scheduler. The per packet communication overhead

is reduced from N bits to ( N ·log x)/ x bits.

Page 129: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 129/177

- 109 -

6.4  Performance Evaluations

In this section, we study the performance of our multi-cabinet implementation

of the feedback-based switch by simulations. In the following, we only present

simulation results for switch with size  N =32 although similar conclusions apply to

other sizes. As the duration of a time slot may be different when different

 propagation delay is considered (see Figs. 6.1 and 6.3), the delay performance is

measured by the number of time units, where each time unit is equivalent to the

transmission time of a packet at line rate. In our simulations, we use the same traffic

models as Chapter 2.4, i.e. uniform, uniform bursty and hot-spot. We assume the

 propagation delay between linecards and switch fabrics is y, which varies from 1 to 2

time units. For simplicity, we ignore the overheads for switch reconfiguration,

scheduling, etc. Three scheduling algorithms are compared:

  LQF without batch scheduling. When propagation delay is  y time units, we

denote the algorithm by LQF/ y. The operation of LQF/ y is based on Fig. 6.1,

where only one packet can be sent in each slot. In other words, this is a direct

extension from the single-cabinet case.

  LQF with batch scheduling (as shown in Fig. 6.3). When propagation delay is

 y, we denote the algorithm by B-LQF/ y and the number of packets that can be

sent in each time slot is 2 y.

  SRR algorithm [51]. When the propagation delay is  y, we denote SRR as

SRR/ y. We regard SRR as a “generalization” of iSLIP [15] for multi-cabinet

implementation. In other words, SRR serves as a benchmark for single-stage

input-queued switches. Note that we do not compare with LQF-Byte-focal

[23] and CR [29] because they cannot be used for multi-cabinet

Page 130: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 130/177

- 110 -

implementation.

When our B-LQF is used and the propagation delay is  y time units, the

number of packets that can be sent in each time slot is 2 y.

6.4.1 Performance under Uniform Traffic

From Fig. 6.4, we can see that due to the inefficiency caused by propagation

delay, LQF/ y can only obtain up to 25% and 50% throughput when  y=2 and 1

respectively. With B-LQF/ y, close-to-100% throughput can be obtained. Note that the

average middle-stage port delay is still 16.5  slots. Since the duration of a slot is 2 y 

time units, the average middle-stage port delay is 33 time units for  y=1 and 66 for 

 y=2.

Fig. 6.4 Delay vs input load p, under uniform traffic for multi-cabinet

Page 131: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 131/177

- 111 -

6.4.2  Performance under Uniform Bursty Traffic

In Fig. 6.5, our B-LQF/ y again yields close-to-100% throughput under bursty

traffic. Despite of the fact that the middle-stage packet delay increases with the slot

duration, it is interesting to observe that when input load  p>0.94, B-LQF/2 starts to

outperform B-LQF/1, though very slightly. The reason is as follows. In a time slot,

each input port can send up to 2d packets to a middle port with B-LQF. So packets in

B-LQF/2 tend to have a higher chance to enter the middle port than B-LQF/1. The

earlier packets enter the middle port, the less input port delay they experience. So

with B-LQF/2, packets tend to experience less input port delay. Under heavy bursty

loading, the input port delay dominates the overall delay performance. For B-LQF/2,

the drop in input port delay starts to outweigh the increase in middle port delay at

 p=0.94.

Fig. 6.5 Delay vs input load p, under bursty traffic for multi-cabinet

Page 132: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 132/177

- 112 -

6.4.3  Performance under Hotspot Traffic

In Fig. 6.6, we can again see that B-LQF/ y yields close-to-100% throughput,

and is significantly better than their non-batch scheduling counterparts.

Fig. 6.6 Delay vs input load p, under hot-spot traffic for multi-cabinet

6.5  Chapter Summary

In a multi-cabinet implementation of feedback-based switch, due to the non-

negligible propagation delay between linecards and switch fabric, the requirement

that occupancy vectors must arrive at output/input ports within a single time slot will

significantly lower the switch efficiency. In this chapter, we revamped the original

feedback mechanism and a new batch scheduler was designed to address this

 problem. We showed that with multi-cabinet implementation, the refined feedback-

 based two-stage switch still guarantees in-order packet delivery, and provides close-

to-100% throughput performance.

Page 133: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 133/177

- 113 -

Chapter 7

Scheduling Inadmissible TrafficPatterns 

7.1 Introduction

In the previous chapters, the feedback-based switch was designed while

focusing on handling admissible traffic patterns (i.e. both the input ports and output

 ports are not over-subscribed), like [21-32]. For any admissible traffic patterns, as

long as the switch is stable, all packets can arrive at the outputs with bounded delays.

In this case, fairness in throughput is not an issue. But in practice, admissible traffic

 patterns cannot be ensured, as an output port can experience oversubscription from

time to time. Therefore, a router should also be designed to efficiently handle

Page 134: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 134/177

- 114 -

inadmissible traffic patterns.

It is interesting to note that under an inadmissible traffic pattern where some

output ports are over-subscribed, the overall throughput in feedback-based switch is

not affected as the over-subscribed outputs would be always fully occupied (due to

the work-conserving nature of the port scheduler used). However, different input

 ports will have an unfair throughput share of the oversubscribed outputs. In other 

words, the feedback switch will suffer from the ring-fairness problem, i.e. for packets

going to the same over-subscribed output (e.g. output 3 in Fig. 7.1), the further away

“up-stream” input ports (e.g. input 0 in Fig. 7.1) can throttle the nearby “down-

stream” input ports (e.g. input 3 in Fig. 7.1).

t=0 t=1 t=2 t=3  

0

1

2

3

Input OutputMiddle-stage

 

Fig. 7.1 A 4 x 4 feedback-based switch with output port 3 oversubscribed by

inputs 0, 1, 2 and 3.

To address the ring-fairness issue for over-subscribed outputs, a fair scheduler 

is designed for feedback-based switch in this chapter. The basic idea of fair scheduler 

is to reserve the middle-stage buffer for the flows whose input VOQs exceeding a

 pre-determined threshold Q. Then the bandwidth of an over-subscribed output will be

allocated to those input VOQs (exceeding Q) using a simple round robin (RR)

scheduler. We show that the optimal value of threshold is equal to the switch size  

Page 135: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 135/177

- 115 -

(Q= N ) and the resulting algorithm can meet the max-min fairness criterion.

The rest of the chapter is organized as follows. In the next section, we review

some related work on fair scheduling algorithm design. In Section 7.3, our fair 

scheduling algorithm is proposed. In Section 7.4, we show that the proposed

algorithm satisfies the max-min fairness criterion. Its performance is then evaluated

in Section 7.5 by simulations. Finally, we conclude the chapter in Section 7.6.

7.2  Related Work

In the literature, fair schedulers are designed to handle both admissible and

inadmissible traffic patterns. For inadmissible traffic patterns, algorithms can be

further divided into two types, with over-subscribed output ports only, or with both

over-subscribed input and output ports.

7.2.1 Fair Scheduling under Admissible Traffic

In [53], a centralized algorithm called GPS-SW (Generalized Processor 

Sharing in network Switch) is proposed. Under the assumption that the traffic is

admissible, GPS-SW uses a matrix-scaling approach to maximize throughput while

distributing the excess available bandwidth in a fair fashion. However, the example

in [54] shows that under admissible traffic achieving both max-min fairness and

100% throughput at the same time is impossible. For the sake of fairness   under 

admissible traffic, GPS-SW sacrifices its throughput performance.

7.2.2 Fair Scheduling with Over-Subscribed Output Ports Only

Page 136: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 136/177

- 116 -

The F-MWM (Fair-MWM) algorithm, proposed in [55] for input-queued

switch, is based on the assumption that output ports can be oversubscribed but input

 ports cannot. Therefore, F-MWM only considers fairly allocating the bandwidth of 

output ports. As long as one (input) VOQ’s length exceeds a pre-set threshold, this

VOQ is removed to the congested list. Each VOQ in the congested list is served

exactly once during the every  N time slots. The VOQs that are not in the congested

list are scheduled using LQF.

TFQA (Tracking Fair Quota Allocation [56]) is a variant of F-MWM that

applies to buffered crossbar switch. Unlike F-MWM, it maintains an adaptive

threshold for each input port. The VOQs that exceed the threshold would be included

to primary class. The dual round robin pointers in each input port are responsible for 

scheduling the packets, one pointer for primary class and another for all VOQs. The

higher priority is always given to the primary class.

7.2.3 Fair Scheduling with Over-Subscribed Input and Output Ports

In [54, 52], input and output port can both be oversubscribed. So all inputs

and outputs bandwidth are required to be fairly allocated. The operations of 

algorithm for input-queued switch [52] can be divided into two main phases. In the

first phase, only the output port bandwidth is concerned. At the end of the first phase,

the only possible bottlenecks for the flows are the input ports. In the second phase,

the algorithm allocates bandwidth at the input ports in a max-min fair fashion, thus

resulting in an allocation that is overall max-min fair.

AMFS (Adaptive Max-min Fair Scheduling) [54] is based on the architecture

Page 137: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 137/177

- 117 -

of buffered crossbar switch. AMFS maintains two systems: a virtual system that

exactly emulates a virtual WF2Q+ (Fair Weighted Fair Queueing + [57]) and a real

system AMFS that is responsible for actually scheduling the flows. The virtual

WF2Q+ calculates per-flow virtual scheduling starting time and finishing time, which

the AMFS attempts to emulate. It has been proven that AMFS can sustain 100%

throughput for admissible traffic and ensure max-min fairness for non-admissible

traffic without any speedup. However, such proof is under the assumption that the

crosspoint buffer size is infinite. Furthermore, this algorithm incurs the overhead of 

maintaining the virtual WF2Q+ system.

7.3  Our Approach

Like [55-56], we consider inadmissible traffic patterns with oversubscribed

output ports only. This assumption is reasonable as the input ports can indeed avoid

 being over-subscribed [55] by the physical line-rate constraint on each ingress port.

But an output port is responsible for processing the egress traffic from  N  incoming

flows, so output port bandwidth over-subscription is difficult to avoid.

First of all, an overload vector {wi} (i=0, 1… N -1) and a reservation vector {qi}

(i=0, 1… N -1) are required for conveying reservation requests and grants at each

middle stage port j. All the elements in the two vectors are initialized to -1. If wi = l  

and l > -1, it indicates that input port i has more than or equal to Q packets destined

for output l . If  qi = m, then the VOQ2( j,i) of the current  middle stage port  j is

reserved for input port m (for sending a packet to output port i). In each time slot,

 based on the values of {wi} and {qi}, the following operations are carried out at each

Page 138: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 138/177

- 118 -

middle port j in parallel:

  Sending reservation/overload request. For any input port m, among its VOQs

≥ Q, select VOQ1(m,l ) based on a round robin (RR) scheduler and the identity

of VOQ1(m,l ) is piggybacked (using log  N  bits) onto the current packet

transmission to middle port  j. Middle port  j updates its overload vector to

 become wm = l .

   Determining the winner . Assumed a middle-stage port j connects to an output

 port k , middle port j examines its {wi}. If all wi≠ k (i=0, 1… N -1), make sure

that the reservation vector {qi} has qk = -1, which means no reservation on

VOQ2( j,k ) is required (as none of the input ports have Q or more than Q 

 packets for output k ). If there are some wi = k  (i=0, 1… N -1), then select

(based on a RR scheduler) one of them, say wl  = k , and set qk  = l . This

indicates that VOQ2( j,k ) (of the middle port  j) is reserved for input port l .

Then reset all wi = k  (i=0, 1… N -1) to wi= -1 to indicate the corresponding

reservation requests for VOQ2( j,k ) have been processed.

   Ensuring a reservation is honored . Before middle-stage port  j sending its

occupancy vector to its connected output port k ,  j examines its reservation

vector {qi} first. If there is any qi = m, where m > 0 and m ≠ k , middle-stage

 port  j knows that VOQ2( j,i) is not available as it has been successfully

reserved by input port m. Therefore, the feedback bit in the occupancy vector 

for VOQ2( j,i) is overwritten to 1. This is to ensure that VOQ2( j,i) can only be

used by input port m.

   Input port scheduling . Any VOQ1 sent reservation request at time slot t would

 be given the highest priority for scheduling at time slot t + N . Otherwise, send

the HOL packet from the longest VOQ1 with the corresponding empty middle

Page 139: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 139/177

- 119 -

stage VOQ2 (as in the original feedback-based switch with port scheduler 

LQF).

An example: Consider a 4×4 feedback-based switch (with the fair scheduler)

configured by the joint sequence of Fig. 2.2(a). Further make a assumption that at

time slot 0, the lengths of VOQ1(0,0) and VOQ1(0,2) exceed threshold 4, so input 0

would select one (say it VOQ1(0,2)) based on RR for sending a reservation request to

its currently connected middle port 1. Both input port 0 and middle port 1 record the

identify of VOQ1(0,2). When middle port 1 connects to output port 2 at time slot 1,

middle port 1 checks its received reservation requests for output port 2 and then

select one (say it VOQ1(0,2)) to grand based on RR. Then the middle port VOQ2(1,2)

can only be used by input port 0 in the following 4 time slots. Meanwhile, VOQ1(0,2)

would be given the highest priority for scheduling at time slot 4.

An input port generates a reservation request if a VOQ1 exceeds a pre-

determined threshold Q. The delay between an input port generating a reservation

request and knowing the result is  N  time slots (one round trip time for the joint

sequence). Within these N time slots, each input port can send up to  N packets. If Q 

is smaller than  N packets and assume the reservation is successful, by the time that

the input port knows the result, the corresponding VOQ1 may be empty as the

 backlogged packets may have been exhausted while waiting for the result to arrive.

This would create a wasted slot. (On the other hand, if the reservation fails, no slot

will be wasted even though the corresponding VOQ1 may still be empty then.) If Q ≥ 

 N , it is guaranteed that there will be at least one packet in the queue for making use

of the reserved slot. However, having a large Q would adversely affect the packet

Page 140: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 140/177

- 120 -

delay performance. Therefore, we use Q = N in our proposed fair scheduler to get the

 best delay-throughput performance.

7.4  Max-min Fairness Criterion

In the following, we would like to show that our fair scheduler can satisfy the

max-min fairness criterion. Firstly, we borrow the following two definitions from

[52,58]:

 Definition 4: The allocation vector {ai} is said to be feasible if and only if:

  Each entity receives an allocation greater than or equal to zero; that is, for all i, ai 

≥ 0.

  The total allocated resource is less or equal to the available resource U; that is, 

∑ai ≤ U.

 Definition 5: For the demand vector {bi}, the allocation vector {ai} is said to

 be max-min fair if:

1.  It is feasible.

2.   No entity receives an allocation greater than its demand; that is, for all i, ai≤ bi.

3.  For all i, the allocation of entity i cannot be increased while satisfying the above

two conditions and without reducing the allocation of some other entity  j for 

which a j ≤ ai.

As long as an algorithm meets the three conditions above, it satisfies the max-

min fairness criterion. Note that in our fair scheduler, the demand vector {bi} is the

traffic load from input port i to an over-subscribed output port  j. Let the capacity of 

Page 141: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 141/177

- 121 -

output port j be U, i.e. the available resource is U. Assume fair scheduler allocates U

to each input port i with allocation ai (i=0,1… N -1). Obviously, ai ≥ 0 and ∑ai ≤ U

(i=0,1… N -1). So ai is feasible (condition 1). By setting the threshold for generating a

reservation request at Q= N , fair scheduler will not waste any reserved slot. So for all

i, ai ≤ bi can be ensured (condition 2). In the following, we focus on condition 3, i.e.

we try to increase some bandwidth allocation ai and see how this would affect other 

inputs. Assume the switch has been “warmed up”. Let ci be the number of times that

input i’s VOQ(i, j) exceeds threshold Q during L time slots. We have

ci≤  L for all i (i=0,1… N -1). (7.1)

If input i has a larger ci than ck  of input k , according to fair scheduler, input i 

will generate more reservation requests and thus get a larger share of output  j’s

 bandwidth (as output j is over-subscribed). That is

ai ≥ ak , if ci ≥ ck  (7.2)

Take a closer look on ci. There are two possible cases:

  ci < L: In one or more time slots, the length of VOQ( i, j) is less than threshold Q.

Then traffic load bi is satisfied by bandwidth allocation ai, i.e.

iii L

ab Lc

/lim 

Therefore, ai cannot be further increased because ai has conformed to condition 2.

  ci =  L: The length of VOQ(i, j) is always longer than threshold Q. This indicates

that  traffic load bi cannot be satisfied by bandwidth allocation ai because the

output port j is over-subscribed:

∑ai = U (7.3)

From (7.1), we have:

Page 142: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 142/177

- 122 -

ci ≥ ck for all k (k =0,1… N -1)

Combine it with (7.2), we get:

ai ≥ ak , for all k (k =0,1… N -1) (7.4)

To increase ai, we have to reduce some ak (k =0,1… N -1) due to (7.3). Then we

reduce the allocation to some input port k  for  ai  ≥ ak  (7.4), which proves that

condition 3 is ensured. Combining the proof for all the three conditions in Definition

5 above, our fair scheduler satisfies the max-min fairness criterion. Note that here we

only focus on the max-min fairness, but the proportional fairness can also be applied

 by making a minor revision for the fair scheduler.

7.5  Performance Evaluations

In this section, we focus on the fairness performance in allocating the

 bandwidth of an over-subscribed output port using the original feedback-based

switch (Feedback) and the proposed fair scheduler (Feedback-F). (Note that for 

admissible traffic patterns, fair scheduler generates the same performance as original

feedback-based switch and thus not shown in this chapter.)

7.5.1 Server-client Traffic Model

The server-client traffic model in [59] is first adopted for generating

inadmissible traffic. At each time slot for every input, a packet arrives with

 probability p. Linecards are partitioned into two types: a server (i.e. linecard 0) and

 N -1 clients. The server transmits packets with equal probability to all clients. Each

client transmits 1/3 of its traffic toward the server and 2/3 to the other  N -2 clients

Page 143: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 143/177

- 123 -

with equal probability. The server is a hotspot and when  N =32, the amount of traffic

going to the server is given by

 λ = p( N -1)/3= 31 p/3.

Fig. 7.2 shows the bandwidth share of three representative flows, (1,0), (9,0)

and (2,0), at the server, versus the total server loading  λ. Note that to reach output

 port 0, the middle stage port delays for flows (1,0), (9,0) and (2,0) are 32, 8 and 1

time slots, respectively. The server becomes over-subscribed (i.e. the traffic becomes

inadmissible) when  λ  >1. With original feedback-based switch, flow(2,0) (yellow)

and flow(9,0) (purple) are quickly throttled by flow(1,0) (light blue), due to the ring-

fairness problem. With Feedback-F, the three flows equally share the oversubscribed

server bandwidth (together with the remaining 28 flows not shown), due to its proven

capability of max-min fair allocation.

Fig. 7.2 Output 0’s throughput vs its output load  λ, under server-client traffic.

Page 144: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 144/177

- 124 -

7.5.2 Attack-traffic Scenario

We also emulate an attack-traffic scenario, where output 0 is gradually

dominated by traffic coming from input 1. The detailed traffic model is as follows. At

each time slot for each input port, a packet arrives with probability p. For input port 1,

an arrived packet goes to output port 0 with probability 0.5 (we call it an attack-flow),

and the remaining 0.5 probability is equally shared by all other output ports. For any

other input ports, an arrived packet goes to all N -1 output ports with equal probability.

Therefore, at the over-subscribed output 0, when N =32, the output load  λ is:

 λ=0.5 p + p·( N -2)/( N -1)= p·91/62

Fig. 7.3 Output 0’s throughput vs its output load λ , under attack traffic

From Fig. 7.3, as output load  λ increases, with Feedback the throughput share

for flow(2,0) (yellow) and flow(9,0) (purple) quickly drops to 0, while the

Page 145: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 145/177

- 125 -

throughput for the attack-flow(1,0) (light blue) increases linearly. When Feedback-F

is used, the attack-flow(1,0) is regulated/reduced, due to the max-min fair allocation

nature. Specifically, the attack-flow(1,0) can only make use of the excess bandwidth

(if any) from other flows with smaller traffic demands, i.e. flow( i,0)s (i=2,3…31).

From Fig. 7.3, we can see that the malicious flow can be identified and punished by

the proposed fair scheduler.

7.6  Chapter Summary

For an inadmissible traffic pattern where some outputs are over-subscribed,

feedback-based two-stage switch will suffer from the ring-fairness problem. To this

end, a fair scheduler for feedback-based switch was designed in this chapter. We

adopted a simple idea of reserving a middle-stage port for any input VOQs exceeding

a threshold Q. Then the bandwidth of over-subscribed outputs is allocated to the

input VOQs (exceeding Q) on a RR basis. We proved that the resulting algorithm

satisfied the max-min fairness criterion. Indeed, the simulation results also showed

the max-min fairness nature of the proposed fair scheduler.

Page 146: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 146/177

- 126 -

Chapter 8

An Optical Implementation of Feedback-Based Switch 

8.1 Introduction

For routers with an electronic switch fabric (e.g. Fig. 1.6), packets must go

through additional O-E-O conversion while being switched from one linecard to

another. This not only limits the router speed, but also increases the difficulties in

designing a high-speed electronic switch fabric. In this chapter, we propose an

optical implementation of our (electronic) feedback-based switch to enable a packet

to be switched all-optically from one linecard to another. We call the resulting switch

load balanced optical switch (LBOS).

Page 147: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 147/177

- 127 -

It should be noted that despite all the advantages of optics [60-61],

implementing an all-optical router is still far from being practical because of the

immature technologies in optical processing and buffering. In this chapter, we focus

on designing hybrid electro-optic routers, where packet buffering and table lookup

are carried out in electrical domain, and switching is done optically.

The rest of this chapter is organized as follows. In the next section, we review

the related work on optical switch using in hybrid electro-optic routers. In Section

8.3, design and operation of LBOS are detailed. In Section 8.4, LBOS is enriched

and refined like the electrical feedback-based switch. Simulation results are

 presented in Section 8.5 and we conclude the chapter in Section 8.6.

8.2 Related Work

There are various efforts in designing efficient optical switches for high-

speed routers. Notably, in the 100 Tb/s router project [27], optical implementation of 

a load-balanced electronic switch [21] is considered. The three-stage Clos network 

architecture is adopted where the center stage is implemented using optical MEMS

[62]. But all-optical packet transmission from an input linecard/port to an output

linecard/port is not possible, as packets must be temporarily stored and processed in

electrical domain between different stages of the Clos network. Besides, to tackle the

 packet mis-sequencing problem a large re-sequencing buffer of  N 2+1 packets at each

output port is required, where N is the switch size.

Page 148: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 148/177

- 128 -

Recently, Fasnet [63], an optical switch fabric comprising N switch linecards

connected by two counter-rotating WDM fiber rings, is proposed. The notion of 

counter-rotating WDM fiber rings originally appears in designing metro networks

[64], and is further refined in [59,65-68]. In Fasnet [63], one ring is used for 

transmission, while the other is for reception. The N wavelengths in the transmission

ring are switched to the reception ring at a folding point between the two rings. Only

a special input port (called master input) can generate a frame header (called

locomotive). Other input ports can put their packets at the end of a frame as its frame

header passes by. At each input port, the maximum number of packets that can be

attached after one frame header is limited by a fairness quota of Y packets. Y can be

accumulated, but has an upper bound of U ×Y , where the values of Y and U should be

given in advance. Unlike [27], this ring-based switch architecture allows all-optical

 packet transmission from one linecard to another. But its delay-throughput

 performance is rather limited, which is further aggravated by the fairness algorithm

adopted.

8.3 Load Balanced Optical Switch (LBOS)

8.3.1 Switch Architecture

Our load-balanced optical switch (LBOS) is targeted at all-optical switching

of a packet from one linecard to another. As depicted in Fig. 8.1, LBOS consists of  N 

linecards connected by an  N -wavelength WDM fiber ring. Each linecard i has two

 ports, input i and output i. Linecard/output i is configured to receive (only) on its

dedicated wavelength channel  λi. To send a packet to linecard j, linecard/input i needs

to transmit the packet onto channel  λ j when  λ j is idle.

Page 149: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 149/177

- 129 -

Fig. 8.1 A 4x4 load balanced optical switch.

Fig. 8.2 The internal structure of linecard i.

The internal structure of linecard i is similar to that used by Fasnet [63], and

is shown in Fig. 8.2. For simplicity, the electrical buffers for implementing the virtual

output queues (VOQ(i,k )’s) at each input port are not shown. A linecard has three

major modules: a receiver on channel  λi, a “tunable” transmitter (implemented using

a fixed laser array) and a wavelength monitor. In Fig. 8.2, the EDFA (Erbium Doped

Page 150: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 150/177

- 130 -

optical Fiber Amplifier) is used to compensate for the optical signal loss en route. A

filter drops wavelength  λi from the fiber and passes all other channels to a splitter.

The dropped  λi enters the high bit-rate burst mode receiver for receiving. The splitter 

taps out a fraction of light and feeds it to the monitor module. The remaining signals

in the fiber will go through a FDL (Fiber Delay Line) of  t d seconds, where t d is the

time required for the monitor to identify an idle channel (detailed in the next

 paragraph) and the transmitter to start sending a packet onto a selected idle channel.

For the fraction of light entered the monitor, a demultiplexer converts it into

 N -1 separate  λ’s, and directs them to the dc-coupled photodiode array. A threshold

comparator is used to detect idle wavelength channels. Among all the idle channels,

the linecard controller identifies its longest  VOQ(i, j), and the head of line packet

from VOQ(i, j) is sent using the transmitter module. (We call it the LQF scheduler.)

The transmitter module consists of a fixed laser array, where laser  λ j will be used to

send the packet destined to linecard j. (A fixed laser array can be more cost-effective

than a single fast tunable laser.) Finally, the transmitted packet is merged back to the

fiber ring by the optical coupler (in Fig. 8.2) and continues its journal to the next

linecard.

8.3.2 Switch Operation

Let the packet duration, i.e. the amount of time required to send a packet

(onto a wavelength channel), be t  pkt seconds. We define the duration of a time slot to

 be t d+t  p seconds, where t d is the propagation delay of the FDL in Fig. 8.2 and t  p is the

 propagation delay of the fiber from the coupler in Fig. 8.2 to the drop filter of the

next linecard. Assume the whole system is synchronized, and in each time slot, at

Page 151: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 151/177

- 131 -

most one packet can be transmitted and/or received by each linecard. For the proper 

operation of the switch, we must have t d ≥ t  pkt and t  p ≥ t  pkt. This is illustrated by Fig.

8.3, where linecard i starts to receive a packet at the beginning of slot t and it takes

t  pkt seconds to receive the entire packet. Meanwhile, the monitor identifies the idle

channels, and a packet is sent onto the selected idle channel. The optical coupler adds

the packet back to the fiber ring at t d seconds after the beginning of the current slot. It

takes another t  p seconds for the first bit of the packet to arrive at linecard i+1. This

marks the end of time slot t  and the beginning of slot t +1. It is easy to see that a

 packet sent by linecard i will arrive at linecard j after ( j –  i) mod N time slots.

Fig. 8.3 Timing diagram for load balanced optical switch (LBOS).

From Fig. 8.3, we can see that in each time slot, the transmitter is idle in the

first t d seconds, whereas the receiver and monitor are idle for the last t  p seconds. As

only a single packet is sent/received in each slot, the efficiency of LBOS is t  pkt/(t d+t  p),

or at most 50% (assuming t  p=t d=t  pkt). To enhance the efficiency, transmitter, receiver 

Page 152: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 152/177

- 132 -

and monitor can operate in parallel for pipelined packet sending, receiving and

scheduling, as shown in Fig. 8.4. Specifically, in the first half of time slot t , the

transmitter can send a packet scheduled in the second half of slot t  –1. In the second

half of slot t , the receiver can receive a packet sent by some linecard in the first half 

of an earlier time slot , and meanwhile, the monitor can schedule another packet for 

sending in the first half of slot t +1. In other words, two packets can be received and

transmitted in each time slot. (We shall call it pipelined LBOS if we would like to

distinguish with the original LBOS.)

Fig. 8.4 Timing diagram for pipelined packet sending and receiving.

From the operations of the LBOS above, we can see that LBOS effectively

 balances the loading in the ring network by spreading (i) packets going to different

destinations over different wavelength channels (i.e. space/wavelength domain load

 balancing), and (ii) packets going to the same destination over different time slots (i.e.

time domain load balancing). In next sub-section, we show that our LBOS is an

Page 153: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 153/177

- 133 -

optical counterpart of the load-balanced electronic switch architecture in [32].

8.3.3 Equivalence to Load-Balanced Electronic Switches

Consider the basic LBOS operating based on the timing diagram in Fig. 8.3.

If we treat the fiber ring as a FDL, then the ring network “buffers” a packet from

linecard i to  j for exactly ( j –  i) mod  N time slots. Since one round trip time (RTT)

along the ring is  N  time slots, a specific wavelength channel on the ring can

carry/buffer up to N in-flight optical packets. With N wavelengths, the fiber ring can

 buffer up to N 2 packets. Therefore, (optical) packets are “buffered” as they propagate

along the fiber ring in different wavelengths, which exactly mimics the buffering

services rendered by the middle-stage VOQ2( j,k )’s in LBES (Fig. 2.1). In a specific

time slot, the channel status (i.e. idle or not) of all the wavelengths passing by, which

is equivalent to the occupancy of VOQ2( j,k )’s in Fig. 2.1, will be conveniently

detected by the wavelength monitor on each linecard – the need for dedicated

feedback packets/vectors is thus removed.

Fig. 8.5 A joint sequence in load-balanced switch.

Assumed the LBES (with single-packet-buffer at each VOQ2( j,k )) is

configured by the sequence of configurations shown in Fig. 8.5. Then we can easily

find a one-to-one mapping between every instance of sequence in Fig. 8.5 and the

Page 154: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 154/177

- 134 -

corresponding operation on the ring network in Fig. 8.1. Due to the equivalence

relationship, our LBOS inherits all the nice features of the LBES [32-33], such as

 being scalable, distributed, and yielding close-to-100% throughput and low average

 packet delay.`

8.4 Extensions and Refinements of LBOS

8.4.1 Cutting down the Average Delay by Reconfiguration

In LBOS, the delay experienced by a packet is the summation of the queuing

delay at the input linecard and the propagation delay between the input and output

linecards. Due to the way linecards are connected in a ring, the propagation delay is

 predetermined and fixed. For example, in Fig. 8.1 each packet of flow(0,3) requires 3

time slots from linecard 0 to linecard 3. Then for a given traffic matrix { λi,j}, the

average packet propagation delay is:

 ji

 j i

 ji h H  ,,   (8.1)

where  λi,j is arrival rate and hi,j is the propagation delay for flow(i, j), respectively. We

have 0≤ λi,j≤1 and 0≤hi,j≤ N -1 for  ]1,0[,  N  ji . In LBOS, we assume that flow(i,i) 

does not enter the ring, and thus hi,i=0.

Assume  λ0,3=1 and it is the only flow of the switch in Fig. 8.1. From (8.1), we

have  H =3 slots. If we swap the positions of linecards 0 and 2 in Fig. 8.1,  H  will

 become 1. This shows by judiciously connecting linecards to form a ring, the

 propagation delay (as well as the average packet delay) can be minimized. It is not

difficult to show that for a given traffic matrix, finding the optimal linecard

 placement pattern for minimizing H has the same complexity as the classic traveling

Page 155: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 155/177

- 135 -

salesman problem [71]. Nevertheless, such a linecard placement problem can be

formulated as an ILP (Integer Linear Programming) problem.

 Notations:

   xi: the propagation delay experienced by packets of flow(0,i), where 0≤  xi≤  N -1,

for  ]1,0[  N i . In fact, xi indicates linecard i’s relative position (to linecard 0) in

the ring.

   f i, j: binary variable and j > i for  ]1,0[,  N  ji . If  f i, j = 1, it means xi > x j and if  f i, j =

0, then xi < x j.

Objective:

minimize

)]1()[(])[( ,,,,  jii j

i j i

 ji jii j

i j i

 ji  f  N  x x Nf  x x    (8.2)

Subject to the following ring topology constraints:

 x0=0 (8.3)

1≤ xi≤ N -1 for  ]1,1[  N i  (8.4)

 xi - x j - N f i, j≥ 1- N    j>i for  ]1,0[,  N  ji  (8.5)

 x j - xi + N f i, j≥ 1  j>i for  ]1,0[,  N  ji  (8.6)

 Notably, constraints (8.5) and (8.6) above are to ensure xi≠  x j if i ≠  j.

 Note that the linecard placement pattern is changed only if there is a

significant enough change in traffic matrix. Even so, it is generally not feasible to

reconnect linecards manually. To this end, we can implement a LBOS using an OXC

(Optical cross-Connect), as shown in Fig. 8.6. Note that all ( N -1)! possible linecard

 placement patterns can be realized by an OXC, which supports  N ! configurations.

Page 156: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 156/177

- 136 -

Further note that inexpensive OXC (with millisecond or more reconfiguration delay)

can be used if the reconfiguration takes place infrequently.

(a) (b)

Fig. 8.6 Two possible linecard placement patterns using OXC: (a) {0-1-2-3}

and (b) {0-3-1-2}.

8.4.2 Supporting Multicast

The transmitter module in Fig. 8.2 consists of a fixed layer array. The lasers

are turned on by direct current injection when a packet is to be sent. Data bits are

then “written” inside a channel by an external modulator. Laser array facilitates

multicasting, where bits can be written  simultaneously by the external modulator on

multiple wavelengths (the corresponding lasers have been turned on for carrying a

multicast packet). In this way, the “replication” of packets is obtained in optical

domain (in which bandwidth efficiency is less critical). In other words, multicasting

can be implemented without increasing the (expensive) bandwidth requirement of the

electronic transmitters, as the electronic cost of sending a packet to multiple

destinations is the same as for sending a packet to a single destination. All multicast

scheduling algorithm in Chapter 5 can be implemented in multicast LBOS.

Page 157: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 157/177

- 137 -

8.4.3 Implementing Fair Scheduler Optically

To implement the fair scheduler in Chapter 7 optically, an optical control

channel λ  N  is required for conveying reservation requests and grants (shown in Fig.

8.7), which is comparable to the control channel in an OBS network for making data

 burst reservations. In other words, an extra transceiver on channel λ  N  is required at

each linecard for processing the control packets in electrical domain. Refer to Fig.

8.2, the λ  N  receiver is added in parallel with the λ i receiver in the receiver module,

and the λ  N transmitter is added to the laser array at the transmitter module. Due to the

relative low data rate on the control channel, an inexpensive low-speed transceiver 

can be used, e.g. using LEDs instead of laser diodes.

Assume pipelined LBOS (in Fig. 8.4) is used and the traffic carried on the

ring network (in Fig. 8.1) is shown in Fig. 8.7. We focus on the control channel λ  N  

(where N =4 in Fig. 8.7). In each packet duration, λ  N  carries two vectors, an overload

vector {wi} and a reservation vector {qi}, where i=0, 1… N -1. During a packet

duration, linecard k  drops λ  N  and uploads the updated {wi} and {qi} on λ  N  again.

Meanwhile, the operations of fair scheduler (in Chapter 7) are carried out in

electrical domain.

8.5 Performance Evaluations

In this section, we study the performance of our proposed LBOS under the

same three types of traffic patterns as Chapter 2, i.e. uniform, uniform bursty and

hot-spot traffic. For comparison, Fasnet [63], which has a similar hardware

complexity as LBOS, is implemented. In simulating Fasnet, we adopt the best

Page 158: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 158/177

- 138 -

 parameters as reported in [63], i.e. a fairness quota Y =100 packets and the maximum

accumulated quota of U ×Y =500 packets. For both LBOS and Fasnet, we assume the

 propagation delay between adjacent linecards is 100 ns (t  p=100 ns) and each linecard

introduces a (FDL) delay of 100 ns (t d=100 ns). The duration of a time slot is thus

200 ns, or two time units. We assume packets arrive at the beginning of each time

unit. For the non-pipelined LBOS (in Fig. 8.3), only one packet can be sent/received

in every two time units. With pipelined LBOS (in Fig. 8.4), one packet can be sent in

each time unit.

We also implement iSLIP algorithm [14] (with a single iteration), which

serves as a benchmark for input-queued switch, and output-queued switch, which

serves as a lower bound. In simulating them, zero propagation delay between

linecards is assumed (to their favor). It should be noted that both iSLIP and output-

queued switch are generally not practical for optical implementation.

For simplicity, we only present simulation results for switch with size  N =32

linecards below, but the similar conclusions and observations can be obtained for 

other switch sizes.

8.5.1 Performance under Uniform Traffic

From Fig. 8.7, we can see that without pipelined sending and receiving,

LBOS can only obtain up to 50% throughput. For pipelined LBOS, close-to-100%

throughput can be obtained. Note that the delay performance is the total delay a

 packet experienced at input port and en route. For LBOS, the average propagation

delay is 32 time units or 16 time slots (i.e., (1+ N -1)/2 under uniform traffic with

Page 159: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 159/177

Page 160: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 160/177

- 140 -

Fig. 8.8 Delay vs input load, under uniform bursty traffic in LBOS.

Fig. 8.9 Delay vs input load, under hot-spot traffic in LBOS.

From Fig. 8.9, again we can see that pipelined LBOS consistently

Page 161: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 161/177

- 141 -

outperforms Fasnet and delivers close-to-100% throughput.

8.5.4 Performance for Linecard Placement

We randomly generate 20 16×16 admissible traffic matrices. For each matrix,

the average propagation delay is calculated using (8.1) and the average of the 20

matrices is found to be H =16.1 time units. With the optimized linecard placement (by

solving the ILP in (8.2)-(8.6)), we can get an average propagation delay of 14.1 time

units. A saving of 12.3 % in propagation delay is observed.

We then carry out simulations to get the average packet delay (i.e. to take the

input port queuing delay into account) for each scenario. We found that without

linecard placement, the average delay is 25.9, and with linecard placement, the delay

drops to 22.9.

8.6 Chapter Summary

In this chapter, we designed an optical implementation of feedback-based

switch for using in a hybrid electro-optic router, called LBOS. It comprises  N 

linecards connected by an  N -wavelength WDM fiber ring. Each linecard i is

configured to receive on channel  λi. To send a packet, it can select and transmit on an

idle channel based on where the packet goes. Packets are switched from one linecard

to another all-optically, and then the extra O-E-O conversion in state-of-the-art

routers is removed. We also showed that LBOS inherits all nice features of a load-

 balanced electronic switch.

Page 162: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 162/177

- 142 -

Chapter 9

Conclusion

9.1 Our Contributions

In this dissertation, we dedicated our efforts to design efficient and scalable

switch architecture for next generation high-speed routers. Two major design

objectives are no need for a centralized scheduler and amendable to optical

implementation.

In Chapter 2, we focused on removing the centralized scheduler by following

the approach of load-balanced switch due to its scalability and close to 100%

Page 163: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 163/177

Page 164: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 164/177

- 144 -

In a feedback-based switch, each middle-stage port needs to piggyback an  N -

 bit occupancy vector to its connected output in each time slot. In Chapter 4, we

concentrated on cutting down this communication overhead. The size of an

occupancy vector can be reduced by only reporting the status of selected middle-

stage VOQs. To identify VOQs of interest, we partition the  N VOQs into u non-

overlapped sets, each being identified by a set number. In each time slot, every input

 port piggybacks its set numbers of interest to the connected middle-stage port. This

guides a middle-stage port to only report the status of the VOQs of interest.

In Chapter 5, by slightly modifying the operation of the original feedback-

 based two-stage switch, we showed that feedback-based switch supports multicast

traffic efficiently. A notable feature of this multicast extension is that the switch

fabric remains unicast, whereas packet duplication is distributed to both input and

middle-stage ports.

In a single-cabinet implementation, the propagation delay between linecards

and switch fabric is negligible. In a multi-cabinet implementation, due to the non-

negligible propagation delay between linecards and switch fabric, the requirement

that occupancy vectors must arrive at output/input ports within a single time slot will

significantly lower the feedback-based switch efficiency. To this end, we revamped

the original feedback mechanism in Chapter 6 for multi-cabinet implementation, and

a new batch scheduler was also devised.

As long as the incoming traffic is admissible, due to the close to 100%

Page 165: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 165/177

- 145 -

throughput performance of our feedback switch, packets can arrive at outputs with

 bounded delays, so fairness in throughput is not an issue. Under inadmissible traffic

(i.e. some output ports are over-subscribed), the feedback switch suffers from the

ring-fairness problem, i.e. “up-stream” input ports can starve some “down-stream”

input ports. To address this ring-fairness problem, an algorithm that allocates the

 bandwidth of over-subscribed outputs based on max-min fairness criterion was

 proposed in Chapter 7.

In Chapter 8, we proposed an optical implementation of the feedback-based

switch, called Load-Balanced Optical Switch (LBOS). LBOS leverages an  N -

wavelength WDM fiber ring to connect  N linecards together. The ring network was

engineered such that the amount of time a packet should be buffered at a middle-

stage port exactly matches the propagation delay that this packet would experience

en route. We showed that with LBOS, all-optical packet transmission from an input

linecard to an output linecard is ensured.

9.2 Future Work

9.2.1 100% Throughput Proof without Speedup

In Chapter 2, we proved that under a speedup of two, feedback-based switch

using any arbitrary work-conserving port scheduler is stable. Indeed, our simulation

results suggest LQF without speedup is stable over a wide range of traffics. However,

due to the lack of theoretical power, the stability without speedup is yet to be proved

up to now. In the following, we hope to come up with a 100% throughput proof 

(without speedup) by appealing to other powerful mathematical models.

Page 166: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 166/177

- 146 -

9.2.2 Building a Large Feedback-Based Two-Stage Switches 

In feedback-based switch, average packet delay would grow linearly with

switch size N . Therefore, when N is large, the average delay would suffer. To this end,

it is our hope that a large size feedback switch can be constructed by a number of 

small size feedback switch modules. Then the delay performance would grow

linearly with the module size, instead of the whole switch size.

9.2.3 More Scalable Fairness Algorithm in LBOS 

In Chapter 8, a dedicated control wavelength channel is required for 

implementing the fair scheduler for LBOS. In this approach, extra fixed receiver and

transmitter (on control channel) are required at each linecard, which increases the

hardware complexity of LBOS. Indeed, it is a worthwhile research direction to

implement fair scheduler without increasing the hardware complexity.

9.2.4 Scalable Iterative Algorithm for Input-queued Switch 

Besides load-balanced switches, we can also refine other switch architectures

for using in next generation high-speed routers. The input-queued switch with

iterative algorithms, as introduced in Chapter 1, is not scalable due to its up to  N  

iterations communication overheads to find maximal size matching. A very

interesting topic is can we achieve maximal size matching by only one iteration,

whereas this one iteration would function as “ N  iterations”. To accomplish this

objective, the “weight” information (e.g. queue size) should be considered in this

single iteration matching. Nevertheless, such idea is very interesting and merits

deeply deliberating.

Page 167: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 167/177

Page 168: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 168/177

- 148 -

scheduling for local area networks,” ACM Transactions on Computer Systems,

Vol. 11, pp. 319 – 352, 1993.

[14]   N. McKeown, “Scheduling algorithms for input-queued cell switches,” PhD.

Thesis, University of California at Berkeley, 1995.

[15]   N. McKeown, “The iSLIP scheduling algorithm for input-queued switches,”

 IEEE Transactions On Networking , Vol. 7, No. 2, pp. 188 – 201, April 1999.

[16]  Y. Li, S. Panwar and H. J. Chao, “On the performance of a dual round-robin

switch,” INFOCOM 1998, March 1998, San Francisco, USA.

[17]  S. T. Chuang, A. Goel, N. McKeown and B. Prabhakar, “Matching outputqueuing with a combined input/output-queued switch,”  IEEE Journal on

Selected Areas in Communications, Vol. 17, pp. 1030 – 1039, June 1999.

[18]  K. Yoshigoe, “Threshold-based exhaustive round-robin for the CICQ switch

with virtual crosspoint queues,” ICC 2007 , June 2007, Glasgow, Scotland.

[19]  R. Luijten, C. Minkenberg and M. Gusat, “Reducing memory size in buffered

crossbars with large internal flow control latency,” GLOBECOM 2003, Dec.2003, San Francisco, USA.

[20]  Y. Shen, S. S. Panwar and H. J. Chao, “Providing 100% throughput in a

 buffered crossbar switch,” IEEE HPSR 2007 , May 2007, New York, USA.

[21]  C. S. Chang, D. S. Lee and Y. S. Jou, “Load balanced Birkhoff-von Neumann

switches, part I: one-stage buffering,” Computer Communications, Vol. 25, pp.

611 – 622, 2002.

[22]  C. S. Chang, D. S. Lee and C. M. Lien, “Load balanced Birkhoff-von

 Neumann switches, part II: multi-stage buffering,” Computer 

Communications, Vol. 25, pp. 623 – 634, 2002.

[23]  Y. Shen, S. Jiang, S. S. Panwar and H. J. Chao, “Byte-focal: a practical load-

 balanced switch,” IEEE HPSR 2005, May 2005, Hong Kong.

[24]  X. L. Wang, Y. Cai, S. Xiao and W. B. Gong, “A three-stage load-balancing

Page 169: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 169/177

- 149 -

switch,” INFOCOM 2008, April. 2008, Phoenix, AZ, USA.

[25]  I. Keslassy and N. McKeown, “Maintaining packet order in two-stage

switches,” INFOCOM 2002, June 2002, New York, USA.

[26]  I. Keslassy, “The load-balanced router,”  PhD. Thesis, Stanford University,

2004.

[27]  I. Keslassy, S. T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Solgaard and N.

McKeown, “Scaling the internet routers using optics,”  ACM SIGCOMM’03,

Aug. 2003, Karlsruhe, Germany.

[28]  J. J. Jaramillo, F. Milan and R. Srikant, “Padded frames: a novel algorithm for stable scheduling in load-balanced switches,” IEEE/ACM Transactions on

 Networking , Vol. 16, No. 5, Oct. 2008

[29]  C. L. Yu, C. S. Chang and D. S. Lee, “CR switch: a load-balanced switch

with contention and reservation,”  INFOCOM 2007 , May 2007, Anchorage,

Alaska, USA.

[30]  C. S. Chang, D. S. Lee and Y. J. Shih, “Mailbox switch: a scalable two-stageswitch architecture for conflict resolution of ordered packets,” INFOCOM 

2004, March 2004, Hong Kong.

[31]  B. Lin and I. Keslassy, “The concurrent matching switch architecture,” 

 INFOCOM 2006 , April 2006, Barcelona, Spain.

[32]  H. I. Lee, “A two-stage switch with load balancing scheme maintaining

 packet sequence,” IEEE Communications Letters

, Vol. 10, pp. 290- 292, Apr.2006.

[33]  P. Gupta and N. McKeown, “Design and Implementation of a Fast Crossbar 

Scheduler,” IEEE Micro, Vol. 19, Issue 1, pp. 20 - 28, Jan.-Feb. 1999.

[34]  Y. S. Lin and C. B. Shung, “Quasi-pushout cell discarding,”  IEEE 

Communications Letters, Vol. 1, pp. 146-148, Sept. 1997

[35]  B. Wu, K. L. Yeung, M. Hamdi and X. Li, “Minimizing internal speedup for 

Page 170: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 170/177

Page 171: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 171/177

- 151 -

 buffered crossbar switches,” IEEE HPSR 2006 , June 2006, Poznan, Poland.

[47]  Z. Q. Dong and R. R. Cessa, “Packet switching and replication of multicast

traffic by crosspoint buffered packet switches,” IEEE HPSR 2007 , May 2007,

 New York, USA.

[48]  Z. Q. Dong and R. R. Cessa, “Input- and output-based shared-memory

crosspoint buffered packet switches for multicast traffic switching and

replication,” ICC 2008, May 2008, Beijing, China.

[49]  P. Giaccone and E. Leonardi, “Asymptotic performance limits of switches

with buffered crossbars supporting multicast traffic,”  IEEE Transactions on

 Information theory, Vol. 54, No. 2, Feb. 2008.

[50]  C. Minkenberg, R. Luijte, F. Abel, W. Denzel and M. Gusat, “Current issues

in packet switch design,”  Proceedings of ACM SIGCOMM , p.119-124,

January 2003.

[51]  A. Scicchitano, A. Bianco, P. Giaccone, E. Leonardi and E. Schiattarella,

“Distributed scheduling in input queued switches”  ICC 2007 , June 2007,

Glasgow, Scotland.

[52]  M. Hosaagrahara and H. Sethu, “Max-min fair scheduling in input-queued

switches” IEEE Transaction on Parallel and Distributed System, Vol. 19, NO.

4, April 2008.

[53]  R. Yim, N. Devroye, V. Tarokh, and H. T. Kung, “Achieving fairness in

generalized processor sharing for network switches,”  Proc. 22nd Biennial 

Symp. Comm., pp. 185-187, 2004.

[54]  X. Zhang, S. R. Mohanty and L. N. Bhuyan, “Adaptive max-min fair 

scheduling in buffered crossbar switches without speedup,” INFOCOM 2007 ,

May 2007, Anchorage, Alaska , USA

[55]   N. Kumar, R. Pan, and D. Shah, “Fair scheduling in input-queued switches

under inadmissible traffic,” GLOBECOM 2004,Vol. 3, No. 29, pp. 1713-1717,

Dec. 2004, Dallas, Texas, USA.

Page 172: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 172/177

- 152 -

[56]   N. Hua, P. Wang, D. P. Jin, L. G. Zeng, B. Liu and G. Feng, “Simple and fair 

scheduling algorithm for combined input-crosspoint-queued switch,” ICC 

2007 , June 2007, Glasgow, Scotland.

[57]  J. R. Bennett and H. Zhang, “Hierarchical packet fair queueing algorithms,”

 IEEE/ACM Transactions on Networking , vol. 5, no. 5, pp. 675–689, Oct.

1997.

[58]  D. P. Bertsekas and R. Gallager  , “Data networks,”  Englewood Cliffs, NJ:

Prentice-Hall, 1992.

[59]  A. Bianco, D. Cuda, J. Finochietto and F. Neri, “Multi-metaring protocol:

fairness in optical packet ring networks,” ICC 2007 , June, Glasgow, Scotland.

[60]  H. Kogan and I. Keslassy, “Optimal-complexity optical router,”  INFOCOM 

2007 , May 1997, Anchorage, Alaska, USA.

[61]  M. Maier and M. Reisslein, “Trends in optical switching techniques: a short

survey,” IEEE Network ,  pp. 42 – 47, Nov./Dec. 2008.

[62]  R. Ryf et al., “1296-port MEMS transparent optical crossconnect with 2.07 petabit/s switch capacity,” Optical Fiber Comm. Conf. and Exhibit (OFC) '01,

Vol. 4, pp. PD28 -P1-3, 2001.

[63]  A. Bianco, E. Carta, D. Cuda, J. M. Finochietto and F. Neri,“A distributed

scheduling algorithm for an optical switching fabric,”  ICC 2008, May 2008,

Beijing, China.

[64]  A. Carena, V. D. Feo, J. Finochietto, R. Gaudino, F. Neri, C. Piglione and P.Poggiolini, “RINGO: an experimental WDM optical packet network for 

metro applications,” IEEE Journal on Selected Areas in Communications, Vol.

22, No. 8, pp. 1561-1571, Oct. 2004

[65]  A. Bianco, J. M. Finochietto, G. Giarratana, F. Neri and C. Piglione,

“Measurement-based reconfiguration in optical ring metro networks,”

 Journal of Lightwave Technology, Vol. 23, No. 10, pp. 3156-3166, Oct. 2005

[66]  A. Antonino, A. Bianco, A. Bianciotto, V. D. Feo, J. M. Finochietto, R.

Page 173: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 173/177

- 153 -

Gaudino and F. Neri, “Wonder: a resilient WDM packet for metro

applications,” Optical Switching and Networking , pp. 19-28, 5, 2008

[67]  A. Bianco, D. Cuda, J. M. Finochietto, F. Neri and M. Valcarenghi, “Wonder :

a pon over a folded bus,” GLOBECOM 2008, Nov. 2008, New Orleans, LA,

USA.

[68]  A. Bianco, D. Cuda, J. M. Finochietto, F. Neri and C. Piglione, “Multi-fasnet

 protocol: short-term fairness control in WDM slotted MANs,” ICC 2006 ,

May 2006, Paris, France.

[69]  X. Wang and K. L. Yeung. “Load balanced two-stage switches using arrayed

waveguide grating routers,” IEEE HPSR 2007 , June, 2007, New York, USA.

[70]  J, C. Palais, “Fiber optic communications,” 5th ed. Upper Saddle River , N.J,

Pearson/Prentice Hall, 2005

[71]  A. Desai and S. Milner, “Autonomous reconfiguration in free-space optical

sensor networks,”  IEEE Journal on Selected Areas in Communications 

(JSAC), Vol. 23, No. 8, pp. 1556-1563, Aug. 2005

[72]  T. Akin, “Hardening cisco routers,” O'Reilly Networking , ,Feb. 2002

[73]  A. Vukovic, “Network power density challenges,”  ASHRAE Journal , Vol. 47,

Issue 4, p55-59, Apr. 2005 

[74]  M. Degermark, A. Brodnik, S. Carlsson and S. Pink, “Small forwarding

tables for fast routing lookups,” ACM SIGCOMM Computer Communication

 Review, Vol. 27, Issue 4, pp.3-14, Oct. 1997

[75]  W. Eatherton, G. Varghese and Z. Dittia, “Tree bitmap: hardware/software IP

lookups with Incremental updates”  ACM SIGCOMM   Computer 

Communication Review, Vol. 34, Issue 2, pp.97-122, April 2004

[76]  H. Song, J. Turner and J. Lockwood, “Shape shifting tries for faster IP

lookup,” IEEE ICNP2005, pp.358-367, 2005

[77]  V. Srinivasan and G. Varghese, “Faster IP lookups using controlled prefix

Page 174: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 174/177

- 154 -

expansion,”  ACM SIGMETRICS Performance Evaluation Review, Vol. 26,

Issue 1, pp.1-10, June 1998

[78]  S. Nilsson and G. Karlsson, “IP-address lookup using LC-trie,” IEEE Journal 

on Selected Areas in Communications, Vol. 17, pp.1083-1092, June. 2001

[79]  L. C. Wnn, K. M. Chen and T. J. Liu, “A longest prefix first search tree for IP

lookup,” IEEE ICC 2005, May. 2005, Seoul, Korea

[80]  P. R. Warkhede, S. Suri and G. Varghese, “Multi-way range trees: scalable IP

lookup with fast updates,” Computer Networks, vol. 44, No.3, pp.289-303,

2002

[81]  H. Lu and S. Sahni, “A b-tree dynamic router-table design,”  IEEE 

Transaction Computers, vol. 54, pp.813-823, 2005

[82]  H. Lu and S. Sahni, “O(log W ) multidimensional packet classification,”

 IEEE/ACM Transactions on Networking , Vol. 15, Issue 2, pp. 462-472, April

2007

[83]  P. C. Wang, C. L. Lee, C. T. Chan and H. Y. Chang, “Performanceimprovement of two-dimensional packet classification by filter rephrasing”,

 IEEE/ACM Transactions on Networking , Vol. 15, Issue 4, pp.906-917, Aug.

2007

[84]  M. Waldvogel, G. Varghese, J. Turner and B. A. Plattner, “Scalable high speed

IP routing lookups,”  ACM SIGCOMM 1997 , pp.25-36, Sept. 1997, Cannes,

France

[85]  Q. Sun, X. H. Huang, X. J. Zhou and Y. Ma, “A dynamic binary hash scheme

for IPv6 lookup,” GLOBECOM 2008, Nov. 2008, New Orleans, LA, USA.

[86]  S. Dharmapurikar, P. Krishnamurthy and D. Taylor, “Longest prefix matching

using bloom filters,”  ACM SIGCOMM 2003,  pp.201-212, 2003

[87]  R. Sangireddy, N. Futamura, S. Aluru and A. K. Somani, “Scalable, memory

efficient, high-speed IP lookup algorithms,”  IEEE/ACM   Transactions on

 Networking , Vol. 13, Issue 4, pp.802 – 812, Aug. 2005.

Page 175: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 175/177

- 155 -

[88]  H. Y. Song, F. Hao, M. Kodialam and T. V. Lakshman, “IPv6 lookups using

distributed and load balanced bloom filters for 100Gbps core router line

cards,” INFOCOM 2009, April 2009, Rio de Janeiro, Brazil

[89]  H. Y. Song and J. Turner, “Fast filter updates for packet classification using

TCAM,” GLOBECOM 2006 , Nov. 2006, San Francisco, USA

[90]  R. Panigrahy and S. Sharma, “Reducing TCAM power consumption and

increasing throughput,” 10th IEEE Symposium on High Performance 

 Interconnects Hot Interconnects ( HOTI’02), pp.107-112, 2002.

[91]  F. Zane, G. Narlikar, and A. Basu, “CoolCAMs: power-efficient TCAMs for 

forwarding engines,” INFOCOM 2003, April 2003, San Francisco, USA

[92]  K. Zheng, C. C. Hu, H. B. Liu and Bin Liu, “An ultra high throughput and

 power efficient TCAM-based IP lookup engine,” INFOCOM 2004, May 2004,

Hong Kong

[93]  M. J. Akhbarizadeh, M. Nourani, R. Panigrahy and S. Sharma, “High-speed

and low-power network search engine using adaptive block-selection

scheme,”  Proceedings of the 13th Symposium on High Performance

 Interconnects, pp.73–78, 2005

[94]  H. Yu, J. Chen, J. Wang, S. Q. Zheng and M. Nourani, “An improved TCAM-

 based IP lookup engine,” IEEE   HPSR 2008, May 2008, Shanghai, China

[95]  H. Yu, J. Chen, J. P. Wang  and S. Q. Zheng, “High-performance TCAM-

 based IP lookup engines,” INFOCOM 2008, April 2008, Phoenix, AZ, USA

[96]  A. Enteshari and M. Kavehrad, “40-100Gbps transmission over copper,”

 DesignCon 2009, Feb. 2009, Santa Clara, CA. USA.

[97]  M. Kavehrad, and J. F. Doherty, “10Gbps transmission over standard

category-5 copper cable,” GLOBECOM 2003, Dec. 2003, San Francisco, CA.

USA. 

[98]  G. Chartrand, “Introductory graph theory,” New York Dover , pp. 116, 1985.

Page 176: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 176/177

- 156 -

[99]  D. Gale and L. S. Shapley, “College admissions and the stability of 

marriage,” Amer. Math. Monthly, vol. 69, pp.9–15, 1962.

[100]  G. Kornaros, “BCB: a buffered crossbar switch fabric utilizing shared

memory,” Proc. Ninth EUROMICRO Conf. Digital System Design (DSD ’06),

 pp. 180-188, Aug. 2006.

[101]  H. Arimoto, T. Kitatani, T. Tsuchiya, K. Shinoda, T. Ohtoshi, M. Aoki and S.

Tsuji, “N-type doping to an active-short cavity DBR laser to expand its

continuous tuning range,” IEEE Photonics Letters, Vol. 20, No. 16, Aug. 15,

2008.

[102]  J. E. Simsarian, M. C. Larson, H. E. Garrett, H. Xu and T. A. Strand, “Less

than 5-ns wavelength switching with an SG-DBR laser,”  IEEE Photonics

 Letters, Vol. 16, No. 4, Feb. 15, 2006.

[103]  F. O. Ilday, J. Buckley, L. Kuznetsova and F. W. Wise, “Generation of 36-

femtosecond pulses from a ytterbium fiber laser,” Conference on Lasers and

Electro-Optics 2004 (CLEO), Vol. 2, pp. 3, May 2004.

[104]  A. V. Konyashchenko, L. L. Losev and S. Y. Tenyakov, “Raman frequency

shifter for laser pulses shorter than 100 fs,” OPTICS EXPRESS , Vol. 15, No.

19, pp. 11855-11559, Sep. 2007. 

[105]  F. M. Chiussi, J. G. Kneuer and V. P. Kumar, “Low-cost scalable switching

solutions for broadband networking: the ATLANTA architecture and chipset,”

 IEEE Communications Magazine, pp. 44-53, Dec. 1997.

[106]  A. E. Tan, “IEEE 1588 precision time protocol time synchronization

 performance,” National Semiconductor Application Note 1728, Oct. 2007.

[107]  R. Palaniappan, Y. Wang, T. Clarke and B. Goldiez, “Simulation of an ultra-

wide band enhanced time difference of arrival System,”  Parallel and 

 Distributed Computing and Systems, pp.306-309, Nov. 2007.

Page 177: FEEDBACK–BASED TWO-STAGE SWITCH

8/6/2019 FEEDBACK–BASED TWO-STAGE SWITCH

http://slidepdf.com/reader/full/feedbackbased-two-stage-switch 177/177