router design and packet scheduling

Router Design and Router Design and Packet SchedulingPacket Scheduling

22

IP RouterIP Router

• A router consists– A set of input interfaces at which packets arrive– A se of output interfaces from which packets depart

• Router implements two main functions– Forward packet to corresponding output interface– Manage congestion

... ...

33

Generic Router ArchitectureGeneric Router Architecture

• Input and output interfaces are connected through a backplane

• A backplane can be implemented by– Shared memory

• Low capacity routers (e.g., PC-based routers)

– Shared bus• Medium capacity routers

– Point-to-point (switched) bus • High capacity routers

input interface output interface

Inter-connection

Medium(Backplane)

44

What a Router Looks LikeWhat a Router Looks LikeCisco GSR 12416 Juniper M160

6ft6ft

19”19”

2ft2ft

Capacity:Capacity: 160Gb/s 160Gb/sPower:Power: 4.2kW 4.2kW

3ft3ft

2.5ft2.5ft

19”19”

Capacity:Capacity: 80Gb/s 80Gb/sPower:Power: 2.6kW 2.6kW

55

Points of Presence (POPs)Points of Presence (POPs)

A

B

C

POP1

POP3POP2

POP4 D

E

F

POP5

POP6 POP7 POP8

66

Basic Architectural ComponentsBasic Architectural Componentsof an IP Routerof an IP Router

Control Plane

Datapathper-packet processing

SwitchingForwardingTable

Routing Table

Routing Protocols

77

Per-packet processing in an IP RouterPer-packet processing in an IP Router

1. Accept packet arriving on an ingress line.2. Lookup packet destination address in the

forwarding table, to identify outgoing interface(s).

3. Manipulate packet header: e.g., decrement TTL, update header checksum.

4. Send packet to outgoing interface(s).5. Queue until line is free.6. Transmit packet onto outgoing line.

88

Generic Router ArchitectureGeneric Router Architecture

LookupIP Address

UpdateHeader

Header ProcessingData Hdr Data Hdr

~1M prefixesOff-chip DRAM

AddressTable

IP Address Next Hop

QueuePacket

BufferMemory

~1M packetsOff-chip DRAM

99

Generic Router ArchitectureGeneric Router ArchitectureLookup

IP AddressUpdateHeader

Header Processing

AddressTable

LookupIP Address

UpdateHeader

Header Processing

AddressTable

LookupIP Address

UpdateHeader

Header Processing

AddressTable

BufferManager

BufferMemory

BufferManager

BufferMemory

BufferManager

BufferMemory

1010

Packet processing is getting harderPacket processing is getting harder

1

10

100

1000

1996 1997 1998 1999 2000 2001

CPU Instructions per minimum length packet since 1996

1111

SpeedupSpeedup

• C – input/output link capacity• RI – maximum rate at which

an input interface can send data into backplane

• RO – maximum rate at which an output can read data from backplane

• B – maximum aggregate backplane transfer rate

• Back-plane speedup: B/C• Input speedup: RI/C• Output speedup: RO/C


Inter-connection

Medium(Backplane)

C CRI ROB

1212

Function divisionFunction division

• Input interfaces:– Must perform packet

forwarding – need to know to which output interface to send packets

– May enqueue packets and perform scheduling

• Output interfaces:– May enqueue packets

and perform scheduling


Inter-connection

Medium(Backplane)

C CRI ROB

1313

Three Router ArchitecturesThree Router Architectures

• Output queued• Input queued • Combined Input-Output queued

1414

Output Queued (OQ) RoutersOutput Queued (OQ) Routers

• Only output interfaces store packets

• Advantages– Easy to design

algorithms: only one congestion point

• Disadvantages– Requires an output

speedup of N, where N is the number of interfaces not feasible


Backplane

CRO

1515

Input Queueing (IQ) RoutersInput Queueing (IQ) Routers• Only input interfaces store packets• Advantages

– Easy to built • Store packets at inputs if

contention at outputs – Relatively easy to design

algorithms• Only one congestion point, but

not output…• need to implement backpressure

• Disadvantages– Hard to achieve utilization 1

(due to output contention, head-of-line blocking)

• However, theoretical and simulation results show that for realistic traffic an input/output speedup of 2 is enough to achieve utilizations close to 1


Backplane

CRO

1616

Combined Input-Output Queueing Combined Input-Output Queueing (CIOQ) Routers(CIOQ) Routers

• Both input and output interfaces store packets

• Advantages– Easy to built

• Utilization 1 can be achieved with limited input/output speedup (<= 2)

• Disadvantages– Harder to design algorithms

• Two congestion points• Need to design flow control

– Note: recent results show that with a input/output speedup of 2, a CIOQ can emulate any work-conserving OQ [G+98,SZ98]


Backplane

CRO

1717

Generic Architecture of a High Speed Generic Architecture of a High Speed Router TodayRouter Today

• Combined Input-Output Queued Architecture– Input/output speedup <= 2

• Input interface– Perform packet forwarding (and classification)

• Output interface– Perform packet (classification and) scheduling

• Backplane– Point-to-point (switched) bus; speedup N– Schedule packet transfer from input to output

1818

Backplane Backplane

• Point-to-point switch allows to simultaneously transfer a packet between any two disjoint pairs of input-output interfaces

• Goal: come-up with a schedule that– Meet flow QoS requirements– Maximize router throughput

• Challenges:– Address head-of-line blocking at inputs– Resolve input/output speedups contention– Avoid packet dropping at output if possible

• Note: packets are fragmented in fix sized cells (why?) at inputs and reassembled at outputs – In Partridge et al, a cell is 64 B (what are the

trade-offs?)

1919

Head-of-line BlockingHead-of-line Blocking• The cell at the head of an input queue

cannot be transferred, thus blocking the following cells

Cannot betransferred because output buffer full

Cannot be transferred because is blocked by red cell

Output 1

Output 2

Output 3

Input 1

Input 2

Input 3

2020

Solution to Avoid Head-of-line BlockingSolution to Avoid Head-of-line Blocking

• Maintain at each input N virtual queues, i.e., one per output

Output 1

Output 2

Output 3

Input 1

Input 2

Input 3

2121

Cell transfer Cell transfer • Schedule:

– Ideally: find the maximum number of input-output pairs such that:• Resolve input/output contentions• Avoid packet drops at outputs• Packets meet their time constraints (e.g., deadlines), if any

• Example– Assign cell preferences at inputs, e.g., their position in the input

queue – Assign cell preferences at outputs, e.g., based on packet deadlines,

or the order in which cells would depart in a OQ router– Match inputs and outputs based on their preferences

• Problem:– Achieving a high quality matching complex, i.e., hard to do in

constant time

2222

A Case StudyA Case Study[Partridge et al ’98][Partridge et al ’98]

• Goal: show that routers can keep pace with improvements of transmission link bandwidths

• Architecture– A CIOQ router– 15 (input/output) line cards: C = 2.4 Gbps

• Each input card can handle up to 16 (input/output) interfaces

• Separate forward engines (FEs) to perform routing – Backplane: Point-to-point (switched) bus, capacity B

= 50 Gbps (32 MPPS)• B/C = 20, but 25% of B lost to overhead (control) traffic

2323

Router ArchitectureRouter Architecture

packetheader

2424

Router ArchitectureRouter Architecture

1

15

input interface output interfaces

Backplane

forward enginesNetwork

processor

Data inData out

Control data(e.g., routing)

Updaterouting tables Set scheduling

(QoS) state

2525

Router Architecture: Data PlaneRouter Architecture: Data Plane

• Line cards– Input processing: can handle input links up to 2.4

Gbps (3.3 Gbps including overhead)– Output processing: use a 52 MHz FPGA; implements

QoS• Forward engine:

– 415-MHz DEC Alpha 21164 processor, three level cache to store recent routes

• Up to 12,000 routes in second level cache (96 kB); ~ 95% hit rate

• Entire routing table in tertiary cache (16 MB divided in two banks)

2626

Router Architecture: Control PlaneRouter Architecture: Control Plane

• Network processor: 233-MHz 21064 Alpha running NetBSD 1.1 – Update routing– Manage link status– Implement reservation

• Backplane Allocator: implemented by an FPGA– Schedule transfers between input/output interfaces

2727

Data Plane Details: ChecksumData Plane Details: Checksum

• Takes too much time to verify checksum– Increases forwarding time by 21%

• Take an optimistic approach: just incrementally update it– Safe operation: if checksum was correct it remains

correct– If checksum bad, it will be anyway caught by end-

host• Note: IPv6 does not include a header checksum

anyway!

2828

Data Plane Details: Slow Path Data Plane Details: Slow Path ProcessingProcessing

1. Headers whose destination misses in the cache2. Headers with errors3. Headers with IP options4. Datagrams that require fragmentation5. Multicast datagrams

Requires multicast routing which is based on source address and inbound link as well

Requires multiple copies of header to be sent to different line cards

2929

Control Plane: Backplane AllocatorControl Plane: Backplane Allocator• Time divided in epochs

– An epoch consists of 16 ticks of data clock (8 allocation clocks)• Transfer unit: 64 B (8 data click ticks)• During one epoch, up to 15 simultaneous transfers in an epoch

– One transfer: two transfer units (128 B of data + 176 auxiliary bits)

• Minimum of 4 epochs to schedule and complete a transfer but scheduling is pipelined.1. Source card signals that it has data to send to the destination card 2. Switch allocator schedules transfer3. Source and destination cards are notified and told to configure

themselves4. Transfer takes place

• Flow control through inhibit pins

3030

The Switch Allocator CardThe Switch Allocator Card

• Takes connection requests from function cards• Takes inhibit requests from destination cards• Computes a transfer configuration for each epoch• 15X15 = 225 possible pairings with 15! Patterns

3131

Allocator AlgorithmAllocator Algorithm

3232

The Switch AllocatorThe Switch Allocator

• Disadvantages of the simple allocator– Unfair: there is a preference for low-numbered

sources– Requires evaluating 225 positions per epoch, which is

too fast for an FPGA• Solution to unfairness problem: Random shuffling

of sources and destinations• Solution to timing problem: Parallel evaluation of

multiple locations• Priority to requests from forwarding engines over

line cards to avoid header contention on line cards

3333

Summary: Design Decisions Summary: Design Decisions (Innovations)(Innovations)

1. Each FE has a complete set of the routing tables2. A switched fabric is used instead of the

traditional shared bus3. FEs are on boards distinct from the line cards4. Use of an abstract link layer header5. Include QoS processing in the router

Packet SchedulingPacket Scheduling

3535

Packet SchedulingPacket Scheduling• Decide when and what packet to send on

output link– Usually implemented at output interface

1

2

Scheduler

flow 1

flow 2

flow n

Classifier

Buffer management

3636

Why Packet Scheduling?Why Packet Scheduling?

• Can provide per flow or per aggregate protection• Can provide absolute and relative differentiation

in terms of– Delay– Bandwidth– Loss

3737

Fair QueueingFair Queueing• In a fluid flow system it reduces to bit-by-bit round

robin among flows– Each flow receives min(ri, f) , where

• ri – flow arrival rate• f – link fair rate (see next slide)

• Weighted Fair Queueing (WFQ) – associate a weight with each flow [Demers, Keshav & Shenker ’89]– In a fluid flow system it reduces to bit-by-bit round

robin• WFQ in a fluid flow system Generalized

Processor Sharing (GPS) [Parekh & Gallager ’92]

3838

Fair Rate ComputationFair Rate Computation

• If link congested, compute f such that

Cfri

i ),min(

862

442

f = 4: min(8, 4) = 4 min(6, 4) = 4 min(2, 4) = 2

10

3939

Fair Rate Computation in GPSFair Rate Computation in GPS

• Associate a weight wi with each flow i• If link congested, compute f such that

Cwfr ii

i ),min(

862

442

f = 2: min(8, 2*3) = 6 min(6, 2*1) = 2 min(2, 2*1) = 2

10(w1 = 3)

(w2 = 1)

(w3 = 1)

4040

Generalized Processor SharingGeneralized Processor Sharing

0 152 104 6 8

5 1 1 11 1

• Red session has packets backlogged between time 0 and 10

• Other sessions have packets continuously backlogged

flows

link

4141

Generalized Processor SharingGeneralized Processor Sharing

• A work conserving GPS is defined as

• where– wi – weight of flow i– Wi(t1, t2) – total service received by flow i during [t1, t2)– W(t1, t2) – total service allocated to al flows during [t1,

t2)– B(t) – number of flows backlogged

)(),(),(

)(

tBiwdtttW

wdtttWi

tBj ji

4242

Properties of GPSProperties of GPS

• End-to-end delay bounds for guaranteed service [Parekh and Gallager ‘93]

• Fair allocation of bandwidth for best effort service [Demers et al. ‘89, Parekh and Gallager ‘92]

• Work-conserving for high link utilization

4343

Packet vs. Fluid SystemPacket vs. Fluid System

• GPS is defined in an idealized fluid flow model– Multiple queues can be serviced simultaneously

• Real system are packet systems– One queue is served at any given time– Packet transmission cannot be preempted

• Goal– Define packet algorithms approximating the fluid

system– Maintain most of the important properties

4444

• Standard techniques of approximating fluid GPS– Select packet that finishes first in GPS

assuming that there are no future arrivals• Important properties of GPS

– Finishing order of packets currently in system independent of future arrivals

• Implementation based on virtual time– Assign virtual finish time to each packet upon

arrival– Packets served in increasing order of virtual

times

Packet Approximation of Fluid SystemPacket Approximation of Fluid System

4545

Approximating GPS with WFQApproximating GPS with WFQ• Fluid GPS system service order

0 2 104 6 8• Weighted Fair Queueing

– select the first packet that finishes in GPS

4646

System Virtual TimeSystem Virtual Time

• Virtual time (VGPS) – service that backlogged flow with weight = 1 would receive in GPS

)(),(),()(

tBiwdtttWwdtttW

tBj jii

tW

wtV

tBj j

GPS

)(

1)()(

tBit

Ww

wt

W

tBj j

ii

)(1),( 2

1)(

21 tBidtt

Ww

wttWt

tttBj j

ii

4747

Service Allocation in GPSService Allocation in GPS

• The service received by flow i during an interval [t1,t2), while it is backlogged is

)(),( 2

121 tBidt

tVwttW

t

ttGPS

ii

)())()((),( 1221 tBitVtVwttW GPSGPSii

4848

Virtual Time Implementation of Virtual Time Implementation of Weighted Fair QueueingWeighted Fair Queueing

w j

kjk

jkj

LSF

0)0( V GPS

))(,max( 1 kj

kj

kj aVFS

if session j backlogged

if session j un-backlogged

• ajk – arrival time of packet k of flow j

• Sjk – virtual starting time of packet k of flow j

• Fjk – virtual finishing time of packet k of flow j

• Ljk – length of packet k of flow j

1 kj

kj FS

4949

Virtual Time Implementation of Virtual Time Implementation of Weighted Fair QueueingWeighted Fair Queueing

• Need to keep per flow instead of per packet virtual start, finish time only

• System virtual time is used to reset a flow’s virtual start time when a flow becomes backlogged again after being idle

5050

System Virtual Time in GPS System Virtual Time in GPS

0 4 128 16

1/21/81/81/81/8

)(tVGPS

2*C

C2*C

5151

Virtual Start and Finish TimesVirtual Start and Finish Times

• Utilize the time the packets would start Sik and finish Fi

k

in a fluid system

0 4 128 16

kiFkiS

i

kik

ik

i wLSF

)(tVGPS

5252

Goals in Designing Packet Fair Goals in Designing Packet Fair Queueing AlgorithmsQueueing Algorithms

• Improve worst-case fairness (see next):– Use Smallest Eligible virtual Finish time First

(SEFF) policy– Examples: WF2Q, WF2Q+

• Reduce complexity– Use simpler virtual time functions– Examples: SCFQ, SFQ, DRR, FBFQ, leap-forward

Virtual Clock, WF2Q+• Improve resource allocation flexibility

– Service Curve

5353

Worst-case Fair Index (WFI)Worst-case Fair Index (WFI)

• Maximum discrepancy between the service received by a flow in the fluid flow system and in the packet system

• In WFQ, WFI = O(n), where n is total number of backlogged flows

• In WF2Q, WFI = 1

5454

WFI exampleWFI example

Fluid-Flow (GPS)

WFQ (smallest finish time first): WFI = 2.5

WF2Q (earliest finish time first); WFI = 1

5555

Hierarchical Resource SharingHierarchical Resource Sharing

• Resource contention/sharing at different levels

• Resource management policies should be set at different levels, by different entities – Resource owner– Service providers– Organizations– Applications

Link

Provider 1

seminar video

Stat

Stanford.Berkeley

Provider 2

WEB

155 Mbps

50 Mbps50 Mbps

10 Mbps20 Mbps

100 Mbps 55 Mbps

Campus

seminar audio

EECS

5656

Hierarchical-GPS ExampleHierarchical-GPS Example

4 1

1 11 1

• Red session has packets backlogged at time 5

• Other sessions have packets continuously backlogged

5

0 10 20

10

1

First red packet arrives at 5 …and it is served at 7.5

5757

Packet Approximation of H-GPSPacket Approximation of H-GPS• Idea 1

– Select packet finishing first in H-GPS assuming there are no future arrivals

– Problem:• Finish order in system

dependent on future arrivals

• Virtual time implementation won’t work

• Idea 2– Use a hierarchy of PFQ to

approximate H-GPS

6 4

321

GPS GPS GPS

GPS GPS

GPS10

Packetized H-GPSH-GPS

6 4

321

GPS GPS GPS

GPS GPS

GPS10

5858

Problems with Idea 1Problems with Idea 1

• The order of the forth blue packet finish time and of the first green packet finish time changes as a result of a red packet arrival

Make decision here

4 1

1 11 15

10

1

Blue packet finish first

Green packet finish first

5959

Hierarchical-WFQ ExampleHierarchical-WFQ Example

• A packet on the second level can miss its deadline (finish time) by an amount of time that in the worst case is proportional to WFI 4 1

1 11 15

10

1

First red packet arrives at 5 …but it is served at 11 !

First level packet schedule

Second level packet schedule

6060

Hierarchical-WF2Q ExampleHierarchical-WF2Q Example• In WF2Q, all packets

meet their deadlines modulo time to transmit a packet (at the line speed) at each level

4 1

1 11 15

10

1

First red packet arrives at 5 ..and it is served at 7

First level packet schedule

Second level packet schedule

6161

WFWF22Q+Q+• WFQ and WF2Q

– Need to emulate fluid GPS system– High complexity

• WF2Q+– Provide same delay bound and WFI as WF2Q– Lower complexity

• Key difference: virtual time computation

– - sequence number of the packet at the head of the queue of flow i

– - virtual starting time of the packet– B(t) - set of packets backlogged at time t in

the packet system

))(min),,()(max()( )(

)(22

th

itBiQWFQWFiSttWtVtV

)( thi

)( thi

iS

6262

Example HierarchyExample Hierarchy

6363

Uncorrelated Cross TrafficUncorrelated Cross Traffic

Delay under H-WFQ

Delay under H-WF2Q+Delay under H-SFQ

Delay under H-SCFQ

20ms

60ms

40ms

20ms

60ms

40ms

6464

Correlated Cross TrafficCorrelated Cross Traffic

Delay under H-WFQ

Delay under H-WF2Q+Delay under H-SFQ

Delay under H-SCFQ

20ms

60ms

40ms

20ms

60ms

40ms

6565

Recap: System Virtual Time Recap: System Virtual Time

• Let ta be the starting time of a backlogged interval– Backlogged interval – an interval during which

the queue is never empty• Let t be an arbitrary time during the backlogged

interval starting at ta

• Then the system virtual time at time t, V(t), represents the service time that a flow with (1) weight 1, and that (2) is continuously backlogged during the interval [ta, t), would receive during [ta, t).

6666

Why Service Curve?Why Service Curve?

• WFQ, WF2Q, H-WF2Q+ – Guarantee a minimum rate:

• N – total number of flows– A packet is served no later than its finish time in

GPS (H-GPS) modulo the sum of the maximum packet transmission time at each level

• For better resource utilization we need to specify more sophisticated services (example to follow shortly)

• Solution: QoS Service curve model

N

j ji wwC1

/

6767

What is a Service Model?What is a Service Model?

• The QoS measures (delay,throughput, loss, cost) depend on offered traffic, and possibly other external processes.

• A service model attempts to characterize the relationship between offered traffic, delivered traffic, and possibly other external processes.

“external process”

Network elementoffered traffic

delivered traffic

(connection oriented)

6868

Arrival and Departure ProcessArrival and Departure Process

Network ElementRin Rout

Rin(t) = arrival process = amount of data arriving up to time t

Rout(t) = departure process = amount of data departing up to time t

bits

t

delay

buffer

6969

Traffic Envelope (Arrival Curve)Traffic Envelope (Arrival Curve)

• Maximum amount of service that a flow can send during an interval of time t

slope = max average rate

b(t) = Envelope

slope = peak rate

t

“Burstiness Constraint”

7070

Service CurveService Curve

• Assume a flow that is idle at time s and it is backlogged during the interval (s, t)

• Service curve: the minimum service received by the flow during the interval (s, t)

7171

Big PictureBig Picture

t t

slope = C

t

Rin(t)

Service curvebits bits

bits

Rout(t)

7272

Delay and Buffer BoundsDelay and Buffer Bounds

t

S (t) = service curve

E(t) = Envelope

Maximum delay

Maximum buffer

bits

7373

Service Curve-based Earliest Deadline Service Curve-based Earliest Deadline (SCED)(SCED)

• Packet deadline – time at which the packet would be served assuming that the flow receives no more than its service curve

• Serve packets in the increasing order of their deadlines

• Properties– If sum of all service curves <= C*t– All packets will meet their deadlines modulo the

transmission time of the packet of maximum length, i.e., Lmax/C

bits

Deadline of 4-th packet

12

34

t

7474

Linear Service Curves: ExampleLinear Service Curves: Example

t

bits

t

bits

t

Arrival curves

t

bits

t t

bitsbits

bits

Service curves

Arrival process

Deadlinecomputation

VideoFTP

t

Video packets have to wait after ftp packets

7575

Non-Linear Service Curves: ExampleNon-Linear Service Curves: Example

t

bits

t

bits

t

Arrival curves

t

bits

t t

bitsbits

bits

Service curves

Arrival process

Deadlinecomputation

t

Video FTP

Video packets transmittedas soon as they arrive

7676

SummarySummary• WF2Q+ guarantees that each packet is served no later

than its finish time in GPS modulo transmission time of maximum length packet– Support hierarchical link sharing

• SCED guarantees that each packet meets its deadline modulo transmission time of maximum length packet– Decouple bandwidth and delay allocations

• Question: does SCED support hierarchical link sharing?– No (why not?)

• Hierarchical Fair Service Curve (H-FSC) [Stoica, Zhang & Ng ’97]– Support nonlinear service curves– Support hierarchical link sharing

router design and packet scheduling

Documents

packet header

router design

main functionsforward

minimum length packet

outgoing interfaces

input interfaces

outgoing line

ingress line