router design and packet scheduling
DESCRIPTION
Router Design and Packet Scheduling. . . . . . . IP Router. A router consists A set of input interfaces at which packets arrive A se of output interfaces from which packets depart Router implements two main functions Forward packet to corresponding output interface Manage congestion. - PowerPoint PPT PresentationTRANSCRIPT
Router Design and Router Design and Packet SchedulingPacket Scheduling
22
IP RouterIP Router
• A router consists– A set of input interfaces at which packets arrive– A se of output interfaces from which packets depart
• Router implements two main functions– Forward packet to corresponding output interface– Manage congestion
... ...
33
Generic Router ArchitectureGeneric Router Architecture
• Input and output interfaces are connected through a backplane
• A backplane can be implemented by– Shared memory
• Low capacity routers (e.g., PC-based routers)
– Shared bus• Medium capacity routers
– Point-to-point (switched) bus • High capacity routers
input interface output interface
Inter-connection
Medium(Backplane)
44
What a Router Looks LikeWhat a Router Looks LikeCisco GSR 12416 Juniper M160
6ft6ft
19”19”
2ft2ft
Capacity:Capacity: 160Gb/s 160Gb/sPower:Power: 4.2kW 4.2kW
3ft3ft
2.5ft2.5ft
19”19”
Capacity:Capacity: 80Gb/s 80Gb/sPower:Power: 2.6kW 2.6kW
55
Points of Presence (POPs)Points of Presence (POPs)
A
B
C
POP1
POP3POP2
POP4 D
E
F
POP5
POP6 POP7 POP8
66
Basic Architectural ComponentsBasic Architectural Componentsof an IP Routerof an IP Router
Control Plane
Datapathper-packet processing
SwitchingForwardingTable
Routing Table
Routing Protocols
77
Per-packet processing in an IP RouterPer-packet processing in an IP Router
1. Accept packet arriving on an ingress line.2. Lookup packet destination address in the
forwarding table, to identify outgoing interface(s).
3. Manipulate packet header: e.g., decrement TTL, update header checksum.
4. Send packet to outgoing interface(s).5. Queue until line is free.6. Transmit packet onto outgoing line.
88
Generic Router ArchitectureGeneric Router Architecture
LookupIP Address
UpdateHeader
Header ProcessingData Hdr Data Hdr
~1M prefixesOff-chip DRAM
AddressTable
IP Address Next Hop
QueuePacket
BufferMemory
~1M packetsOff-chip DRAM
99
Generic Router ArchitectureGeneric Router ArchitectureLookup
IP AddressUpdateHeader
Header Processing
AddressTable
LookupIP Address
UpdateHeader
Header Processing
AddressTable
LookupIP Address
UpdateHeader
Header Processing
AddressTable
BufferManager
BufferMemory
BufferManager
BufferMemory
BufferManager
BufferMemory
1010
Packet processing is getting harderPacket processing is getting harder
1
10
100
1000
1996 1997 1998 1999 2000 2001
CPU Instructions per minimum length packet since 1996
1111
SpeedupSpeedup
• C – input/output link capacity• RI – maximum rate at which
an input interface can send data into backplane
• RO – maximum rate at which an output can read data from backplane
• B – maximum aggregate backplane transfer rate
• Back-plane speedup: B/C• Input speedup: RI/C• Output speedup: RO/C
input interface output interface
Inter-connection
Medium(Backplane)
C CRI ROB
1212
Function divisionFunction division
• Input interfaces:– Must perform packet
forwarding – need to know to which output interface to send packets
– May enqueue packets and perform scheduling
• Output interfaces:– May enqueue packets
and perform scheduling
input interface output interface
Inter-connection
Medium(Backplane)
C CRI ROB
1313
Three Router ArchitecturesThree Router Architectures
• Output queued• Input queued • Combined Input-Output queued
1414
Output Queued (OQ) RoutersOutput Queued (OQ) Routers
• Only output interfaces store packets
• Advantages– Easy to design
algorithms: only one congestion point
• Disadvantages– Requires an output
speedup of N, where N is the number of interfaces not feasible
input interface output interface
Backplane
CRO
1515
Input Queueing (IQ) RoutersInput Queueing (IQ) Routers• Only input interfaces store packets• Advantages
– Easy to built • Store packets at inputs if
contention at outputs – Relatively easy to design
algorithms• Only one congestion point, but
not output…• need to implement backpressure
• Disadvantages– Hard to achieve utilization 1
(due to output contention, head-of-line blocking)
• However, theoretical and simulation results show that for realistic traffic an input/output speedup of 2 is enough to achieve utilizations close to 1
input interface output interface
Backplane
CRO
1616
Combined Input-Output Queueing Combined Input-Output Queueing (CIOQ) Routers(CIOQ) Routers
• Both input and output interfaces store packets
• Advantages– Easy to built
• Utilization 1 can be achieved with limited input/output speedup (<= 2)
• Disadvantages– Harder to design algorithms
• Two congestion points• Need to design flow control
– Note: recent results show that with a input/output speedup of 2, a CIOQ can emulate any work-conserving OQ [G+98,SZ98]
input interface output interface
Backplane
CRO
1717
Generic Architecture of a High Speed Generic Architecture of a High Speed Router TodayRouter Today
• Combined Input-Output Queued Architecture– Input/output speedup <= 2
• Input interface– Perform packet forwarding (and classification)
• Output interface– Perform packet (classification and) scheduling
• Backplane– Point-to-point (switched) bus; speedup N– Schedule packet transfer from input to output
1818
Backplane Backplane
• Point-to-point switch allows to simultaneously transfer a packet between any two disjoint pairs of input-output interfaces
• Goal: come-up with a schedule that– Meet flow QoS requirements– Maximize router throughput
• Challenges:– Address head-of-line blocking at inputs– Resolve input/output speedups contention– Avoid packet dropping at output if possible
• Note: packets are fragmented in fix sized cells (why?) at inputs and reassembled at outputs – In Partridge et al, a cell is 64 B (what are the
trade-offs?)
1919
Head-of-line BlockingHead-of-line Blocking• The cell at the head of an input queue
cannot be transferred, thus blocking the following cells
Cannot betransferred because output buffer full
Cannot be transferred because is blocked by red cell
Output 1
Output 2
Output 3
Input 1
Input 2
Input 3
2020
Solution to Avoid Head-of-line BlockingSolution to Avoid Head-of-line Blocking
• Maintain at each input N virtual queues, i.e., one per output
Output 1
Output 2
Output 3
Input 1
Input 2
Input 3
2121
Cell transfer Cell transfer • Schedule:
– Ideally: find the maximum number of input-output pairs such that:• Resolve input/output contentions• Avoid packet drops at outputs• Packets meet their time constraints (e.g., deadlines), if any
• Example– Assign cell preferences at inputs, e.g., their position in the input
queue – Assign cell preferences at outputs, e.g., based on packet deadlines,
or the order in which cells would depart in a OQ router– Match inputs and outputs based on their preferences
• Problem:– Achieving a high quality matching complex, i.e., hard to do in
constant time
2222
A Case StudyA Case Study[Partridge et al ’98][Partridge et al ’98]
• Goal: show that routers can keep pace with improvements of transmission link bandwidths
• Architecture– A CIOQ router– 15 (input/output) line cards: C = 2.4 Gbps
• Each input card can handle up to 16 (input/output) interfaces
• Separate forward engines (FEs) to perform routing – Backplane: Point-to-point (switched) bus, capacity B
= 50 Gbps (32 MPPS)• B/C = 20, but 25% of B lost to overhead (control) traffic
2323
Router ArchitectureRouter Architecture
packetheader
2424
Router ArchitectureRouter Architecture
1
15
input interface output interfaces
Backplane
forward enginesNetwork
processor
Data inData out
Control data(e.g., routing)
Updaterouting tables Set scheduling
(QoS) state
2525
Router Architecture: Data PlaneRouter Architecture: Data Plane
• Line cards– Input processing: can handle input links up to 2.4
Gbps (3.3 Gbps including overhead)– Output processing: use a 52 MHz FPGA; implements
QoS• Forward engine:
– 415-MHz DEC Alpha 21164 processor, three level cache to store recent routes
• Up to 12,000 routes in second level cache (96 kB); ~ 95% hit rate
• Entire routing table in tertiary cache (16 MB divided in two banks)
2626
Router Architecture: Control PlaneRouter Architecture: Control Plane
• Network processor: 233-MHz 21064 Alpha running NetBSD 1.1 – Update routing– Manage link status– Implement reservation
• Backplane Allocator: implemented by an FPGA– Schedule transfers between input/output interfaces
2727
Data Plane Details: ChecksumData Plane Details: Checksum
• Takes too much time to verify checksum– Increases forwarding time by 21%
• Take an optimistic approach: just incrementally update it– Safe operation: if checksum was correct it remains
correct– If checksum bad, it will be anyway caught by end-
host• Note: IPv6 does not include a header checksum
anyway!
2828
Data Plane Details: Slow Path Data Plane Details: Slow Path ProcessingProcessing
1. Headers whose destination misses in the cache2. Headers with errors3. Headers with IP options4. Datagrams that require fragmentation5. Multicast datagrams
Requires multicast routing which is based on source address and inbound link as well
Requires multiple copies of header to be sent to different line cards
2929
Control Plane: Backplane AllocatorControl Plane: Backplane Allocator• Time divided in epochs
– An epoch consists of 16 ticks of data clock (8 allocation clocks)• Transfer unit: 64 B (8 data click ticks)• During one epoch, up to 15 simultaneous transfers in an epoch
– One transfer: two transfer units (128 B of data + 176 auxiliary bits)
• Minimum of 4 epochs to schedule and complete a transfer but scheduling is pipelined.1. Source card signals that it has data to send to the destination card 2. Switch allocator schedules transfer3. Source and destination cards are notified and told to configure
themselves4. Transfer takes place
• Flow control through inhibit pins
3030
The Switch Allocator CardThe Switch Allocator Card
• Takes connection requests from function cards• Takes inhibit requests from destination cards• Computes a transfer configuration for each epoch• 15X15 = 225 possible pairings with 15! Patterns
3131
Allocator AlgorithmAllocator Algorithm
3232
The Switch AllocatorThe Switch Allocator
• Disadvantages of the simple allocator– Unfair: there is a preference for low-numbered
sources– Requires evaluating 225 positions per epoch, which is
too fast for an FPGA• Solution to unfairness problem: Random shuffling
of sources and destinations• Solution to timing problem: Parallel evaluation of
multiple locations• Priority to requests from forwarding engines over
line cards to avoid header contention on line cards
3333
Summary: Design Decisions Summary: Design Decisions (Innovations)(Innovations)
1. Each FE has a complete set of the routing tables2. A switched fabric is used instead of the
traditional shared bus3. FEs are on boards distinct from the line cards4. Use of an abstract link layer header5. Include QoS processing in the router
Packet SchedulingPacket Scheduling
3535
Packet SchedulingPacket Scheduling• Decide when and what packet to send on
output link– Usually implemented at output interface
1
2
Scheduler
flow 1
flow 2
flow n
Classifier
Buffer management
3636
Why Packet Scheduling?Why Packet Scheduling?
• Can provide per flow or per aggregate protection• Can provide absolute and relative differentiation
in terms of– Delay– Bandwidth– Loss
3737
Fair QueueingFair Queueing• In a fluid flow system it reduces to bit-by-bit round
robin among flows– Each flow receives min(ri, f) , where
• ri – flow arrival rate• f – link fair rate (see next slide)
• Weighted Fair Queueing (WFQ) – associate a weight with each flow [Demers, Keshav & Shenker ’89]– In a fluid flow system it reduces to bit-by-bit round
robin• WFQ in a fluid flow system Generalized
Processor Sharing (GPS) [Parekh & Gallager ’92]
3838
Fair Rate ComputationFair Rate Computation
• If link congested, compute f such that
Cfri
i ),min(
862
442
f = 4: min(8, 4) = 4 min(6, 4) = 4 min(2, 4) = 2
10
3939
Fair Rate Computation in GPSFair Rate Computation in GPS
• Associate a weight wi with each flow i• If link congested, compute f such that
Cwfr ii
i ),min(
862
442
f = 2: min(8, 2*3) = 6 min(6, 2*1) = 2 min(2, 2*1) = 2
10(w1 = 3)
(w2 = 1)
(w3 = 1)
4040
Generalized Processor SharingGeneralized Processor Sharing
0 152 104 6 8
5 1 1 11 1
• Red session has packets backlogged between time 0 and 10
• Other sessions have packets continuously backlogged
flows
link
4141
Generalized Processor SharingGeneralized Processor Sharing
• A work conserving GPS is defined as
• where– wi – weight of flow i– Wi(t1, t2) – total service received by flow i during [t1, t2)– W(t1, t2) – total service allocated to al flows during [t1,
t2)– B(t) – number of flows backlogged
)(),(),(
)(
tBiwdtttW
wdtttWi
tBj ji
4242
Properties of GPSProperties of GPS
• End-to-end delay bounds for guaranteed service [Parekh and Gallager ‘93]
• Fair allocation of bandwidth for best effort service [Demers et al. ‘89, Parekh and Gallager ‘92]
• Work-conserving for high link utilization
4343
Packet vs. Fluid SystemPacket vs. Fluid System
• GPS is defined in an idealized fluid flow model– Multiple queues can be serviced simultaneously
• Real system are packet systems– One queue is served at any given time– Packet transmission cannot be preempted
• Goal– Define packet algorithms approximating the fluid
system– Maintain most of the important properties
4444
• Standard techniques of approximating fluid GPS– Select packet that finishes first in GPS
assuming that there are no future arrivals• Important properties of GPS
– Finishing order of packets currently in system independent of future arrivals
• Implementation based on virtual time– Assign virtual finish time to each packet upon
arrival– Packets served in increasing order of virtual
times
Packet Approximation of Fluid SystemPacket Approximation of Fluid System
4545
Approximating GPS with WFQApproximating GPS with WFQ• Fluid GPS system service order
0 2 104 6 8• Weighted Fair Queueing
– select the first packet that finishes in GPS
4646
System Virtual TimeSystem Virtual Time
• Virtual time (VGPS) – service that backlogged flow with weight = 1 would receive in GPS
)(),(),()(
tBiwdtttWwdtttW
tBj jii
tW
wtV
tBj j
GPS
)(
1)()(
tBit
Ww
wt
W
tBj j
ii
)(1),( 2
1)(
21 tBidtt
Ww
wttWt
tttBj j
ii
4747
Service Allocation in GPSService Allocation in GPS
• The service received by flow i during an interval [t1,t2), while it is backlogged is
)(),( 2
121 tBidt
tVwttW
t
ttGPS
ii
)())()((),( 1221 tBitVtVwttW GPSGPSii
4848
Virtual Time Implementation of Virtual Time Implementation of Weighted Fair QueueingWeighted Fair Queueing
w j
kjk
jkj
LSF
0)0( V GPS
))(,max( 1 kj
kj
kj aVFS
if session j backlogged
if session j un-backlogged
• ajk – arrival time of packet k of flow j
• Sjk – virtual starting time of packet k of flow j
• Fjk – virtual finishing time of packet k of flow j
• Ljk – length of packet k of flow j
1 kj
kj FS
4949
Virtual Time Implementation of Virtual Time Implementation of Weighted Fair QueueingWeighted Fair Queueing
• Need to keep per flow instead of per packet virtual start, finish time only
• System virtual time is used to reset a flow’s virtual start time when a flow becomes backlogged again after being idle
5050
System Virtual Time in GPS System Virtual Time in GPS
0 4 128 16
1/21/81/81/81/8
)(tVGPS
2*C
C2*C
5151
Virtual Start and Finish TimesVirtual Start and Finish Times
• Utilize the time the packets would start Sik and finish Fi
k
in a fluid system
0 4 128 16
kiFkiS
i
kik
ik
i wLSF
)(tVGPS
5252
Goals in Designing Packet Fair Goals in Designing Packet Fair Queueing AlgorithmsQueueing Algorithms
• Improve worst-case fairness (see next):– Use Smallest Eligible virtual Finish time First
(SEFF) policy– Examples: WF2Q, WF2Q+
• Reduce complexity– Use simpler virtual time functions– Examples: SCFQ, SFQ, DRR, FBFQ, leap-forward
Virtual Clock, WF2Q+• Improve resource allocation flexibility
– Service Curve
5353
Worst-case Fair Index (WFI)Worst-case Fair Index (WFI)
• Maximum discrepancy between the service received by a flow in the fluid flow system and in the packet system
• In WFQ, WFI = O(n), where n is total number of backlogged flows
• In WF2Q, WFI = 1
5454
WFI exampleWFI example
Fluid-Flow (GPS)
WFQ (smallest finish time first): WFI = 2.5
WF2Q (earliest finish time first); WFI = 1
5555
Hierarchical Resource SharingHierarchical Resource Sharing
• Resource contention/sharing at different levels
• Resource management policies should be set at different levels, by different entities – Resource owner– Service providers– Organizations– Applications
Link
Provider 1
seminar video
Stat
Stanford.Berkeley
Provider 2
WEB
155 Mbps
50 Mbps50 Mbps
10 Mbps20 Mbps
100 Mbps 55 Mbps
Campus
seminar audio
EECS
5656
Hierarchical-GPS ExampleHierarchical-GPS Example
4 1
1 11 1
• Red session has packets backlogged at time 5
• Other sessions have packets continuously backlogged
5
0 10 20
10
1
First red packet arrives at 5 …and it is served at 7.5
5757
Packet Approximation of H-GPSPacket Approximation of H-GPS• Idea 1
– Select packet finishing first in H-GPS assuming there are no future arrivals
– Problem:• Finish order in system
dependent on future arrivals
• Virtual time implementation won’t work
• Idea 2– Use a hierarchy of PFQ to
approximate H-GPS
6 4
321
GPS GPS GPS
GPS GPS
GPS10
Packetized H-GPSH-GPS
6 4
321
GPS GPS GPS
GPS GPS
GPS10
5858
Problems with Idea 1Problems with Idea 1
• The order of the forth blue packet finish time and of the first green packet finish time changes as a result of a red packet arrival
Make decision here
4 1
1 11 15
10
1
Blue packet finish first
Green packet finish first
5959
Hierarchical-WFQ ExampleHierarchical-WFQ Example
• A packet on the second level can miss its deadline (finish time) by an amount of time that in the worst case is proportional to WFI 4 1
1 11 15
10
1
First red packet arrives at 5 …but it is served at 11 !
First level packet schedule
Second level packet schedule
6060
Hierarchical-WF2Q ExampleHierarchical-WF2Q Example• In WF2Q, all packets
meet their deadlines modulo time to transmit a packet (at the line speed) at each level
4 1
1 11 15
10
1
First red packet arrives at 5 ..and it is served at 7
First level packet schedule
Second level packet schedule
6161
WFWF22Q+Q+• WFQ and WF2Q
– Need to emulate fluid GPS system– High complexity
• WF2Q+– Provide same delay bound and WFI as WF2Q– Lower complexity
• Key difference: virtual time computation
– - sequence number of the packet at the head of the queue of flow i
– - virtual starting time of the packet– B(t) - set of packets backlogged at time t in
the packet system
))(min),,()(max()( )(
)(22
th
itBiQWFQWFiSttWtVtV
)( thi
)( thi
iS
6262
Example HierarchyExample Hierarchy
6363
Uncorrelated Cross TrafficUncorrelated Cross Traffic
Delay under H-WFQ
Delay under H-WF2Q+Delay under H-SFQ
Delay under H-SCFQ
20ms
60ms
40ms
20ms
60ms
40ms
6464
Correlated Cross TrafficCorrelated Cross Traffic
Delay under H-WFQ
Delay under H-WF2Q+Delay under H-SFQ
Delay under H-SCFQ
20ms
60ms
40ms
20ms
60ms
40ms
6565
Recap: System Virtual Time Recap: System Virtual Time
• Let ta be the starting time of a backlogged interval– Backlogged interval – an interval during which
the queue is never empty• Let t be an arbitrary time during the backlogged
interval starting at ta
• Then the system virtual time at time t, V(t), represents the service time that a flow with (1) weight 1, and that (2) is continuously backlogged during the interval [ta, t), would receive during [ta, t).
6666
Why Service Curve?Why Service Curve?
• WFQ, WF2Q, H-WF2Q+ – Guarantee a minimum rate:
• N – total number of flows– A packet is served no later than its finish time in
GPS (H-GPS) modulo the sum of the maximum packet transmission time at each level
• For better resource utilization we need to specify more sophisticated services (example to follow shortly)
• Solution: QoS Service curve model
N
j ji wwC1
/
6767
What is a Service Model?What is a Service Model?
• The QoS measures (delay,throughput, loss, cost) depend on offered traffic, and possibly other external processes.
• A service model attempts to characterize the relationship between offered traffic, delivered traffic, and possibly other external processes.
“external process”
Network elementoffered traffic
delivered traffic
(connection oriented)
6868
Arrival and Departure ProcessArrival and Departure Process
Network ElementRin Rout
Rin(t) = arrival process = amount of data arriving up to time t
Rout(t) = departure process = amount of data departing up to time t
bits
t
delay
buffer
6969
Traffic Envelope (Arrival Curve)Traffic Envelope (Arrival Curve)
• Maximum amount of service that a flow can send during an interval of time t
slope = max average rate
b(t) = Envelope
slope = peak rate
t
“Burstiness Constraint”
7070
Service CurveService Curve
• Assume a flow that is idle at time s and it is backlogged during the interval (s, t)
• Service curve: the minimum service received by the flow during the interval (s, t)
7171
Big PictureBig Picture
t t
slope = C
t
Rin(t)
Service curvebits bits
bits
Rout(t)
7272
Delay and Buffer BoundsDelay and Buffer Bounds
t
S (t) = service curve
E(t) = Envelope
Maximum delay
Maximum buffer
bits
7373
Service Curve-based Earliest Deadline Service Curve-based Earliest Deadline (SCED)(SCED)
• Packet deadline – time at which the packet would be served assuming that the flow receives no more than its service curve
• Serve packets in the increasing order of their deadlines
• Properties– If sum of all service curves <= C*t– All packets will meet their deadlines modulo the
transmission time of the packet of maximum length, i.e., Lmax/C
bits
Deadline of 4-th packet
12
34
t
7474
Linear Service Curves: ExampleLinear Service Curves: Example
t
bits
t
bits
t
Arrival curves
t
bits
t t
bitsbits
bits
Service curves
Arrival process
Deadlinecomputation
VideoFTP
t
Video packets have to wait after ftp packets
7575
Non-Linear Service Curves: ExampleNon-Linear Service Curves: Example
t
bits
t
bits
t
Arrival curves
t
bits
t t
bitsbits
bits
Service curves
Arrival process
Deadlinecomputation
t
Video FTP
Video packets transmittedas soon as they arrive
7676
SummarySummary• WF2Q+ guarantees that each packet is served no later
than its finish time in GPS modulo transmission time of maximum length packet– Support hierarchical link sharing
• SCED guarantees that each packet meets its deadline modulo transmission time of maximum length packet– Decouple bandwidth and delay allocations
• Question: does SCED support hierarchical link sharing?– No (why not?)
• Hierarchical Fair Service Curve (H-FSC) [Stoica, Zhang & Ng ’97]– Support nonlinear service curves– Support hierarchical link sharing