packet arbitration in voq switches and others and qos
DESCRIPTION
Packet Arbitration in VoQ switches and Others and QoS. Recap. High-Performance Switch Design We need scalable switch fabrics – crossbar, bit-sliced crossbar, Clos networks. We need to solve the memory bandwidth problem Our conclusion is to go for input queued-switches - PowerPoint PPT PresentationTRANSCRIPT
1CSIT560 By M. Hamdi
Packet Arbitration in VoQ Packet Arbitration in VoQ switches and Othersswitches and Others
and and QoSQoS
2CSIT560 By M. Hamdi
Recap
• High-Performance Switch Design
– We need scalable switch fabrics – crossbar, bit-sliced crossbar, Clos networks.
– We need to solve the memory bandwidth problem
Our conclusion is to go for input queued-switches
We need to use VOQ instead of FIFO queues
– For these switches to function at high-speed, we need efficient and practically implementable scheduling/arbitration algorithms
3CSIT560 By M. Hamdi
PortProcessor
opticsLCS Protocol
optics
PortProcessor
opticsLCS Protocol
optics
Crossbar
Switch core architecture
Port #1
Scheduler
Request
Grant/Credit
Cell Data
Port #256
4CSIT560 By M. Hamdi
Algorithms for VOQ Switching
• We analyzed several algorithms for matching inputs and outputs– Maximum size matching: these are based on bipartite
maximum matching – which can be solved using Max-flow techniques in O(N2.5)These are not practical for high-speed implementationsThey are stable (100% throughput for uniform traffic)They are not stable for non-uniform traffic
– Maximal size matching: they try to approximateapproximate maximum size matching
• PIM, iSLIP, SRR, etc. These are practical – can be executed in parallel in O(logN) or even O(1) They are stable for uniform traffic and unstable for non-uniform traffic
5CSIT560 By M. Hamdi
Algorithms for VOQ Switching
– Maximum weight matching: These are maximum matchings based weights such queue length (LQF) (LPF) or age of cell (OCF) with a complexity of O(N3logN)
• These are not practical for high-speed implementations. Much more difficult to implement than maximum size matching
• They are stable (100% throughput) under any admissible traffic
– Maximal weight matching: they try to approximate maximum weight matching. They use RGA mechanism like iSLIP
• iLQF, iLPF, iOCF, etc.
These are “somewhat” practical – can be executed in parallel in O(logN) or even O(1) like iSLIP BUT the arbiters are much more complex to build
6CSIT560 By M. Hamdi
Differences between RRM, iSlip & FIRM
RRM iSlip FIRM
Input No grant unchanged
Granted one location beyond the accepted one
Output
No request unchanged
Grant accepted
one location beyond the granted one
Grant not accepted
one location beyond the previously granted one
unchanged the granted one
7CSIT560 By M. Hamdi
Algorithms for VOQ Switching
– Randomized algorithms• They try in a smart way to approximate maximum weight
matching by avoiding using an iterative process
• They are stable under any admissible traffic
• Their time complexity is small (depending on the algorithm)
• Their hardware complexity is yet untested.
8CSIT560 By M. Hamdi
Can we avoid having schedulers altogether !!!
9CSIT560 By M. Hamdi
Remember:Remember: Two Successive Scaling Problems Two Successive Scaling Problems
OQ routers: + work-conserving (QoS)- memory bandwidth =
(N+1)RR
R
RR
IQ routers: + memory bandwidth = 2R- arbitration complexity
Bipartite Matching
R R
10CSIT560 By M. Hamdi
Today: 64 ports at 10Gbps, 64-byte cells.
• Arbitration Time = = 51.2ns
• Request/Grant Communication BW = 17.5Gbps
10Gbps 64bytes
IQ Arbitration Complexity
Two main alternatives for scaling:1. Increase cell size2. Eliminate arbitration
Scaling to 160Gbps:• Arbitration Time = 3.2ns• Request/Grant Communication BW = 280Gbps
11CSIT560 By M. Hamdi
Desirable Characteristics for Router Architecture
Ideal: OQ• 100% throughput• Minimum delay• Maintains packet order
Necessary: able to regularly connect any input to any output
What if the world was perfect? Assume Bernoulli iid uniform arrival traffic...
12CSIT560 By M. Hamdi
Round-Robin Scheduling
• Uniform & non-bursty traffic => 100% throughput• Problem: traffic is non-uniform & bursty
13CSIT560 By M. Hamdi
Two-Stage Switch (I)
1
N
1
N
1
N
External Outputs
Internal Inputs
External Inputs
First Round-Robin Second Round-Robin
14CSIT560 By M. Hamdi
Two-Stage Switch (I)
1
N
1
N
1
N
External Outputs
Internal Inputs
External Inputs
First Round-Robin Second Round-Robin
Load Balancing
15CSIT560 By M. Hamdi
100% throughputProblem: unbounded mis-sequencing
External Outputs
Internal Inputs
1
N
ExternalInputs
Cyclic Shift Cyclic Shift
1
N
1
N
11
2
2
Two-Stage Switch Characteristics
16CSIT560 By M. Hamdi
Two-Stage Switch (II)
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
F ik
F ik
.
.
.
.
.
.
.
FlowSplitter
LoadBalancer VOQs First-Stage Round-Robin Second-Stage Round-RobinVOQs
External inputs Internal outputs Internal inputs External outputs
1 1 1
N N N
1
N
1
N
i
.
.
.
.
.
.
.
.
.
.
.
.
j
.
.
.
.
.
.
.
.
.
.
.
.
j
.
.
.
.
.
.
.
.
.
.
.
.
k
.
.
.
.
.
.
.
.
.
.
.
.
New
N3 instead of N2
17CSIT560 By M. Hamdi
Expanding VOQ Structure
Solution: expand VOQ structure by distinguishing among switch inputs
2
1
3
a
b
18CSIT560 By M. Hamdi
What is being done in practice(Cisco for example)
• They want schedulers that achieve 100% throughput and very low delay (Like MWM)
• They want it to be as simple as iSLIP in terms of hardware implementation
• Is there any solution to this !!!!!
19CSIT560 By M. Hamdi
Typical Performance of ISLIP-like Algorithms
PIM with 4 iterations
20CSIT560 By M. Hamdi
What is being done in practice(Cisco for example)
Company Switching Capacity
Switch Architecture
Fabric Overspeed
Agere 40 Gbit/s-2.5 Tbit/s Arbitrated crossbar 2x
AMCC 20-160 Gbit/s Shared memory 1.0x
AMCC 40 Gbit/s-1.2 Tbit/s Arbitrated crossbar 1-2x
Broadcom 40-640 Gbit/s Buffered crossbar 1-4x
Cisco 40-320 Gbit/s Arbitrated crossbar 2x
21CSIT560 By M. Hamdi
Can we make these Can we make these scheduling algorithms scheduling algorithms
simpler?simpler?Using a Simpler Architecture
22CSIT560 By M. Hamdi
Buffered Crossbar Switches
• A buffered crossbar switch is a switch with buffered fabric (memory inside the crossbar).
• A pure buffered crossbar switch architecture, has only buffering inside the fabric and none anywhere else.
• Due to HOL blocking problem, VOQ are used in the input side.
23CSIT560 By M. Hamdi
Buffered Crossbar Architecture
….1
N
Arbit
er
….1
N
Arbit
er
….1
N
Arbit
er
Arbiter Arbiter Arbiter
1
N
2
…
…Data
Flow Control
Input Cards
………
…
…
…
…
Output Card
Output Card
Output Card
1 2 N
24CSIT560 By M. Hamdi
Scheduling ProcessScheduling is divided into three steps:
–Input scheduling: each input selects in a certain way one cell from the HoL of an eligible queue and sends it to the corresponding internal buffer.
–Output scheduling: each output selects in a certain way from all internally buffered cells in the crossbar to be delivered to the output port.
–Delivery notifying: for each delivered cell, inform the corresponding input of the internal buffer status.
25CSIT560 By M. Hamdi
Advantages
• Total independence between input and output arbiters (distributed design) (1/N complexity as compared to centralized schedulers)
• Performance of Switch is much better (because there is much less output contention) – a combination of IQ and OQ switches
• Disadvantage: Crossbar is more complicated
26CSIT560 By M. Hamdi
41 2 3
1
3
2
4
I/O Contention Resolution
27CSIT560 By M. Hamdi
41 2 3
1
3
2
4
I/O Contention Resolution
28CSIT560 By M. Hamdi
InRr-OutRr• Input scheduling: InRr (Round-Robin)
- Each input selects the next eligible VOQ, based on its highest priority pointer, and sends its HoL packet to the internal buffer.
• Output scheduling: OutRr (Round-Robin)
- Each output selects the next nonempty internal buffer, based on its highest priority pointer, and sends it to the output link.
The Round Robin Algorithm
29CSIT560 By M. Hamdi
41 2 3
1
3
2
44 13 2
4 13 2
4 13 2
4 13 2
Input Scheduling (InRr.)
30CSIT560 By M. Hamdi 41 2 3
1
3
2
44 13 2
4 13 2
4 13 2
4 13 2
4 13 2
4 13 2
4 13 2
4 13 2
Output Scheduling (OutRr.)
31CSIT560 By M. Hamdi 41 2 3
4 13 2
4 13 2
4 13 2
4 13 2
1
3
2
44 13 2
4 13 2
4 13 2
4 13 2
Out. Ptrs Updt + Notification delivery
32CSIT560 By M. Hamdi
Performance study
2)(, .... 2)(,
... 2)(,
... 2)(
)( nNNVOQnN
VOQnN
VOQnVOQnL 1111,
Delay/throughput under Bernoulli Uniform and Burtsy Uniform
Stability performance:
33CSIT560 By M. Hamdi
Bernoulli Uniform Arrivals
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
100
101
102
103
Ave
rag
e D
elay
Normalized Load
32x32 Switch under Bernoulli Uniform Traffic
OQRR-RR1-SLIP4-SLIP
34CSIT560 By M. Hamdi
Bursty Uniform Arrivals
0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
101
102
103
Ave
rage
Del
ay
Normalized Load
32x32 Switch under Bursty Uniform Traffic
OQRR-RR1-SLIP4-SLIP
35CSIT560 By M. Hamdi
Scheduling Process
Because the arbitration is simple:
– We can afford to have algorithms based on weights for example (LQF, OCF).
– We can afford to have algorithms that provide QoS
36CSIT560 By M. Hamdi
Buffered Crossbar Solution: Scheduler
• The algorithm MVF-RR is composed of two parts:
– Input scheduler – MVF (most vacancies first)
Each input selects the column of internal buffers (destined to the same output) where there are most vacancies (non-full buffers).
– Output scheduler – Round-robin
Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.
37CSIT560 By M. Hamdi
Buffered Crossbar Solution: Scheduler
• The algorithm ECF-RR is composed of two parts:– Input scheduler – ECF (empty column first)
Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis.
– Output scheduler – Round-robin
Each output chooses the internal buffer which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one.
38CSIT560 By M. Hamdi
Buffered Crossbar Solution: Scheduler
• The algorithm RR-REMOVE is composed of two parts:
– Input scheduler – Round-robin (with remove-request signal sending)
Each input chooses non-empty VOQ which appears next on its static round-robin schedule from the highest priority one and updates the pointer to 1 location beyond the chosen one. It then sends out at most one remove-request signal to outputs
– Output scheduler – REMOVE
For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.
39CSIT560 By M. Hamdi
Buffered Crossbar Solution: Scheduler
• The algorithm ECF-REMOVE is composed of two parts:– Input scheduler – ECF (with remove-request signal
sending)
Each input selects first empty column of internal buffers (destined to the same output). If there is no empty column, it selects on a round-robin basis.It then sends out at most one remove-request signal to outputs
– Output scheduler – REMOVE
For each output, if it receives any remove-request signals, it chooses one of them based on its highest priority pointer and removes the cell. If no signal is received, it does simple round-robin arbitration.
40CSIT560 By M. Hamdi
Hardware Implementation of ECF-RR: An Input Scheduling Block
Round-robin arbiter
Round-robin arbiter
Selector 0 SelectorN-1
Any grant
Arbitration results
Grants Grants
Highest priority pointer
)()( 0 oi BOQO )()( 11 NiN BOQO )( 0iQO )( 1iNQO
41CSIT560 By M. Hamdi
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 11
1.02
1.04
1.06
1.08
1.1
1.12
1.14
1.16
1.18
1.2
Normalized load
Rela
tive
ave
rage d
ela
y
32x32 Switch under uniform traffic
RR-RR IBM MVF-RR ECF-RR RR-REMOVE ECF-REMOVEOutput
Performance Evaluation: Simulation StudyU
nif
orm
Tra
ffic
42CSIT560 By M. Hamdi
Performance Evaluation: Simulation Study
Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99
Improvement
Percentage 1% 1% 3% 6% 13% 17% 12%
Normalized Improvement Percentage
1% 1% 3% 6% 12% 15% 11%
Improvement Factor
1.01 1.01 1.03 1.06 1.13 1.17 1.12EC
F-R
EM
OV
e
over
RR
-RR
43CSIT560 By M. Hamdi
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.95
1
1.05
1.1
1.15
1.2
1.25
1.3
Normalized load
Ave
rage d
ela
y
32x32 Switch under uniform bursty traffic (average burst size:16)
RR-RR IBM MVF-RR ECF-RR RR-REMOVE ECF-REMOVEOutput
Performance Evaluation : Simulation StudyB
urs
ty
Tra
ffic
44CSIT560 By M. Hamdi
Performance Evaluation: Simulation Study
Load 0.5 0.6 0.7 0.8 0.9 0.95 0.99
Improvement
Percentage 10% 13% 16% 20% 22% 18% 11%
Normalized Improvement Percentage
9% 12% 14% 16% 18% 16% 10%
Improvement Factor
1.10 1.13 1.16 1.20 1.22 1.18 1.11EC
F-R
EM
OV
e
over
RR
-RR
45CSIT560 By M. Hamdi
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.551
1.005
1.01
1.015
Normalized load
Ave
rage d
ela
y
32x32 Switch under hotspot traffic
RR-RR IBM MVF-RR ECF-RR RR-REMOVE ECF-REMOVEOutput
Performance Evaluation : Simulation StudyH
ots
pot
Tra
ffic
46CSIT560 By M. Hamdi
Performance Evaluation: Simulation Study
Load 0.31 0.36 0.41 0.45 0.49 0.51
Improvement
Percentage 0.2% 0.3% 0.5% 0.8% 1% 0.7%
Normalized Improvement Percentage
0.2% 0.3% 0.5% 0.8% 1% 0.7%
Improvement Factor 1.002 1.003 1.005 1.008 1.01 1.007EC
F-R
EM
OV
e
over
RR
-RR
47CSIT560 By M. Hamdi
Quality of Service Quality of Service Mechanisms for Mechanisms for
Switches/Routers and the Switches/Routers and the InternetInternet
48CSIT560 By M. Hamdi
VOQ Algorithms and Delay• But, delay is key
– Because users don’t care about throughput alone– They care (more) about delays– Delay = QoS (= $ for the network operator)
• Why is delay difficult to approach theoretically?– Mainly because it is a statistical quantity– It depends on the traffic statistics at the inputs– It depends on the particular scheduling algorithm
used The last point makes it difficult to analyze delays in i /q
switchesFor example in VOQ switches, it is almost impossible to
give any guarantees on delay.
49CSIT560 By M. Hamdi
VOQ Algorithms and Delay• This does not mean that we cannot have an algorithm that
can do that. It means there exist none at this moment.
• For this exact reason, almost all quality of service schemes (whether for delay or bandwidth guarantees) assume that you have an output-queued switch
Link 1, ingress Link 1, egress
Link 2, ingress Link 2, egress
Link 3, ingress Link 3, egress
Link 4, ingress Link 4, egress
50CSIT560 By M. Hamdi
QoS Router
Policer
Policer
Classifier
Policer
Policer
Classifier
Per-flow Queue
Scheduler
Per-flow Queue
Per-flow Queue
Scheduler
Per-flow Queue
shaper
shaper
Queue management
51CSIT560 By M. Hamdi
VOQ Algorithms and Delay
• WHY: Because an OQ switch has no “fabric” scheduling/arbitration algorithm. Delay simply depends on traffic statistics Researchers have shown that you can provide a lot of
QoS algorithms (like WFQ) using a single server and based on the traffic statistics
• But, OQ switches are extremely expensive to build– Memory bandwidth requirement is very high
– These QoS scheduling algorithms have little practical significance for scalable and high-performance switches/routers.
52CSIT560 By M. Hamdi
Output QueueingThe “ideal”
1
1
1
1
1
1
1
1
1
11
1
2
2
2
2
2
2
53CSIT560 By M. Hamdi
How to get good delay cheaply?
• Enter speedup…– The fabric speedup for an IQ switch equals 1 (mem. bwdth. = 2)
– The fabric speedup for an OQ switch equals N (mem. Bwdth. = N+1)
– Suppose we consider switches with fabric speedup of S, 1 < S << N
– Such switch will require buffers both at the input and the output
call these combined input- and output-queued (CIOQ) switches
• Such switches could help if…– With very small values of S
– We get the performance – both delay and throughput – of an OQ switch
54CSIT560 By M. Hamdi
A CIOQ switch
• Consist of– An (internally non-blocking, e.g. crossbar) fabric with
speedup S > 1
– Input and output buffers
– A scheduler to determine matchings
55CSIT560 By M. Hamdi
A CIOQ switch
• For concreteness, suppose S = 2. The operation of the switch consists of
– Transferring no more than 2 cells from (to) each input (output)
– Logically, we will think of each time slot as consisting of two phases
– Arrivals to (departures from) switch occur at most once per time slot
– The transfer of cells from inputs to outputs can occur in each phase
56CSIT560 By M. Hamdi
Using Speedup
1
1
1
2
2
57CSIT560 By M. Hamdi
Performance of CIOQ switches
• Now that we have a higher speedup, do we get a handle on delay?– Can we say something about delay (e.g., every packet
from a given flow should below 15 msec)?
– There is one way of doing this: competitive analysis the idea is to compete with the performance of an OQ
switch
58CSIT560 By M. Hamdi
Intuition
Speedup = 1
Speedup = 2
Fabric throughput = .58
Fabric throughput = 1.16
Ave I/p queue = 6.25
Ave I/p queue = too large
59CSIT560 By M. Hamdi
Intuition (continued)
Speedup = 3
Fabric throughput = 1.74
Speedup = 4
Fabric throughput = 2.32
Ave I/p queue = 0.75
Ave I/p queue = 1.35
60CSIT560 By M. Hamdi
Performance of CIOQ switches
• The setup
– Under arbitrary, but identical inputs (packet-by-packet)
– Is it possible to replace an OQ switch by a CIOQ switch and schedule the CIOQ switch so that the outputs are identical packet-by-packet? To exactly mimick an OQ switch
– If yes, what is the scheduling algorithm?
61CSIT560 By M. Hamdi
What is exact mimicking?
Apply same inputs to an OQ and a CIOQ switch- packet by packet
Obtain same outputs- packet by packet
62CSIT560 By M. Hamdi
Consequences
• Suppose, for now, that a CIOQ is competitive wrt an OQ switch. Then
– We get perfect emulation of an OQ switch
– This means we inherit all its throughput and delay properties
– Most importantly – all QoS scheduling algorithms originally designed for OQ switches can be directly used on a CIOQ switch
– But, at the cost of introducing a scheduling algorithm – which is the key
63CSIT560 By M. Hamdi
Emulating OQ Switches with CIOQ
• Consider an N x N switch with (integer) speedup S > 1– We’re going to see if this switch can emulate an OQ switch
• We’ll apply the same inputs, cell-by-cell, to both switches– We’ll assume that the OQ switch sends out packets in FIFO order
– And we’ll see if the CIOQ switch can match cells on the output side
64CSIT560 By M. Hamdi
Key concept: Urgency
Urgency of a cell at any time = its departure time - current time• It basically indicates the time that this packet will depart the
OQ switch• This value is decremented after each time slot• When the value reaches 0, it must depart (it is at the HoL of
the output queues)
OQ switchOQ switch
65CSIT560 By M. Hamdi
Key concept: Urgency
• Algorithm: Most urgent cell first (MUCF). In each “phase”
1. Outputs try to get their most urgent cells from inputs.
2. Input grant to output whose cell is most urgent.
In case of ties, output i takes priority over output i + k.
3. Loser outputs try to obtain their next most urgent cell from another (unmatched) input.
4. When no more matchings are possible, cells are transferred.
66CSIT560 By M. Hamdi
Key concept: Urgency - Example
• At the beginning of phase 1, both outputs 1 and 2 request input 1 to obtain their most urgent cells
• Since there is a tie, then input 1 grants to output 1 (give it to least port #).• Output 2 proceeds to get its next most urgent cell (from input 2 and has
urgency of 3)
67CSIT560 By M. Hamdi
Implementing MUCF
• The way in which MUCF matches inputs to outputs is similar to the “stable marriage problem” (SMP)
• The SMP finds “stable” matchings in bipartite graphs– There are N women and N men
– Each woman (man) ranks each man (woman) in order of preference for marriage
68CSIT560 By M. Hamdi
An example• Consider the example we have already seen
• Executing GSA…– With men proposing we get the matching
(1, 1), (2, 4), (3, 2), (4, 3) – this takes 7 proposals (iterations)– With women proposing we get the matching
(1, 1), (2, 3), (3, 2), (4, 4) – this takes 7 proposals (iterations)– Both matchings are stable – The first is man-optimal – men get the best partners of any stable
matching– Likewise the second is woman-optimal
69CSIT560 By M. Hamdi
Theorem
• A CIOQ switch with a speedup of 4 operating under the MUCF algorithm exactly matches cells with FIFO output-queued switch.
• This is true even for Non-FIFO OQ scheduling schemes (e.g., WFQ, strict priority, etc.)
• We can achieve similar results with S = 2
70CSIT560 By M. Hamdi
Implementation - a closer look
Difficulty of Implementation
- Estimating urgency
- Matching process - too many iterations?
Estimating urgency depends on what is being emulated
- FIFO, Strict priorities - no problem
- WFQ, etc - problems
(and communicating this info among I/ps and O/ps)
71CSIT560 By M. Hamdi
QoS Scheduling Algorithms
72CSIT560 By M. Hamdi
Principles for QOS Guarantees• Consider a phone application at 1Mbps and an FTP application sharing a
1.5 Mbps link. – bursts of FTP can congest the router and cause audio packets to be dropped.
– want to give priority to audio over FTP
• PRINCIPLE 1: Marking of packets is needed for router to distinguish between different classes; and new router policy to treat packets accordingly
73CSIT560 By M. Hamdi
Principles for QOS Guarantees (more)• Applications misbehave (audio sends packets at a rate
higher than 1Mbps assumed above);
• PRINCIPLE 2: provide protection (isolation) for one class from other classes (Fairness)
74CSIT560 By M. Hamdi
QoS Differentiation: Two options
• Stateful (per flow)
• IETF Integrated Services (Intserv)/RSVP
• Stateless (per class)
• IETF Differentiated Services (Diffserv)
75CSIT560 By M. Hamdi
The Building Blocks: May contain more functions
• Classifier
• Shaper
• Policer
• Scheduler
• Dropper
76CSIT560 By M. Hamdi
QoS Mechanisms• Admission Control
– Determines whether the flow can/should be allowed to enter the network. • Packet Classification
– Classifies the data based on admission control for desired treatment through the network
• Traffic Policing– Measures the traffic to determine if it is out of profile. Packets that are determined
to be out-of-profile can be dropped or marked differently (so they may be dropped later if needed)
• Traffic Shaping– Provides some buffering, therefore delaying some of the data, to make sure the
traffic fits into the profile (may only effect bursts or all traffic to make it similar to Constant Bit Rate)
• Queue Management– Determines the behavior of data within a queue. Parameters include queue
depth, drop policy
• Queue Scheduling– Determines how different queues empty onto the outbound link
77CSIT560 By M. Hamdi
QoS Router
Policer
Policer
Classifier
Policer
Policer
Classifier
Per-flow Queue
Scheduler
Per-flow Queue
Per-flow Queue
Scheduler
Per-flow Queue
shaper
shaper
Queue management
78CSIT560 By M. Hamdi
Queue Scheduling Algorithms
CSIT560 By M. Hamdi
Scheduling at the output link of an OQ Switch
• Sharing always results in contention• A scheduling discipline resolves contention:
• Decide when and what packet to send on the output link– Usually implemented at output interface – Scheduling is a Key to fairly sharing resources and providing performance guarantees
Link 1, ingress Link 1, egress
Link 2, ingress Link 2, egress
Link 3, ingress Link 3, egress
Link 4, ingress Link 4, egress
80CSIT560 By M. Hamdi
Output Scheduling
scheduler
Allocating output bandwidthControlling packet delay
81CSIT560 By M. Hamdi
Types of Queue Scheduling
• Strict Priority– Empties the highest priority non-empty queue first, before servicing lower
priority queues. It can cause starvation of lower priority queues.
• Round Robin– Services each queue by emptying a certain amount of data and then going to
the next queue in order.
• Weighted Fair Queuing (WFQ)– Empties an amount of data from a queue based on the relative weight for the
queue (driven by reserved bandwidth) before servicing the next queue.
• Earliest Deadline First– Determines the latest time a packet must leave to meet the delay requirements
and service the queues in that order
82CSIT560 By M. Hamdi
Scheduling: Deterministic Priority
• Packet is served from a given priority level only if no packets exist at higher levels (multilevel priority with exhaustive service)
• Highest level gets lowest delay• Watch out for starvation!• Usually map priority levels to delay classes
Low bandwidth urgent messages
Realtime
Non-realtime
Priority
83CSIT560 By M. Hamdi
Scheduling: No Classification
FIFO
First come first serve
• This is the simplest possible. But we cannot provide any guarantees.
• With FIFO queues, if the depth of the queue is not bounded, there very little that can be done
• We can perform preferential dropping
• We can use other service disciplines on a single queue (e.g., EDF)
84CSIT560 By M. Hamdi
Scheduling: Class Based Queuing
• At each output port, packets of the same class are queued at distinct queues.
• Service disciplines within each queue can vary (e.g., FIFO, EDF, etc.). Usually it is FIFO
• Service disciplines between classes can vary as well (e.g., strict priority, some kind of sharing, etc.)
Class 1
Class 2
Class 3
Class 4
Class based scheduling
85CSIT560 By M. Hamdi
Per Flow Packet Scheduling• Each flow is allocated a separated “virtual queue”
– Lowest level of aggregation
– Service disciplines between the flows vary (FIFO, SP, etc.)
1
2
Scheduler
flow 1
flow 2
flow n
Classifier
Buffer management
86CSIT560 By M. Hamdi
Per-flow classification
SenderReceiver
87CSIT560 By M. Hamdi
Per-flow buffer management
SenderReceiver
88CSIT560 By M. Hamdi
Per-flow scheduling
SenderReceiver
89CSIT560 By M. Hamdi
The problems caused by FIFO queues in routers
1. In order to maximize its chances of success, a source has an incentive to maximize the rate at which it transmits.
2. (Related to #1) When many flows pass through it, a FIFO queue is “unfair” – it favors the most greedy flow.
3. It is hard to control the delay of packets through a network of FIFO queues.
Fair
ness
Dela
y
Guara
nte
es
90CSIT560 By M. Hamdi
Round Robin (RR)
• RR avoids starvation• All sessions have the same weight and the same packet
length:
A: B: C:
Round #2
…
Round #1
91CSIT560 By M. Hamdi
RR with variable packet length
A: B: C:
Round #1 Round #2
…
But the Weights are equal !!!
92CSIT560 By M. Hamdi
Solution…
A: B: C:
#1 #2 #3
…
#4
93CSIT560 By M. Hamdi
Weighted Round Robin (WRR)
WA=3 WB=1 WC=4
#1
round length = 8
…
#2
94CSIT560 By M. Hamdi
WRR – non Integer weights
WA=1.4 WB=0.2 WC=0.8
WA=7 WB=1 WC=4
Normalize
round length = 13
…
95CSIT560 By M. Hamdi
Weighted round robin
• Serve a packet from each non-empty queue in turn– Can provide protection against starvation
– It is easy to implement in hardware
• Unfair if packets are of different length or weights are not equal
• What is the Solution?• Different weights, fixed packet size
– serve more than one packet per visit, after normalizing to obtain integer weights
96CSIT560 By M. Hamdi
Problems with Weighted Round Robin
• Different weights, variable size packets– normalize weights by mean packet size
• e.g. weights {0.5, 0.75, 1.0}, mean packet sizes {50, 500, 1500}
• normalize weights: {0.5/50, 0.75/500, 1.0/1500} = { 0.01, 0.0015, 0.000666}, normalize again {60, 9, 4}
• With variable size packets, need to know mean packet size in advance
• Fairness is only provided at time scales larger than the schedule
97CSIT560 By M. Hamdi
Fairness
1.1 Mb/s
10 Mb/s
100 Mb/s
A
B
R1C
0.55Mb/s
0.55Mb/s
What is the “fair” allocation: (0.55Mb/s, 0.55Mb/s) or (0.1Mb/s, 1Mb/s)?
e.g. an http flow with a given(IP SA, IP DA, TCP SP, TCP DP)
98CSIT560 By M. Hamdi
Fairness
1.1 Mb/s
10 Mb/s
100 Mb/s
A
B
R1 D
What is the “fair” allocation?0.2 Mb/sC
99CSIT560 By M. Hamdi
Max-Min Fairness
The min of the flows should be as large as possible
Max-Min fairness for single resource:
Bottlenecked (unsatisfied) connections share the residual bandwidth equally
Their share is > = the share held by the connections not constrained by this bottleneck
C=10F1 = 25
F2 = 6F1’= 5
F2”= 5
100CSIT560 By M. Hamdi
Max-Min Fairness
• An allocation is fair if it satisfies max-min fairness– each connection gets no more than what it wants
– the excess, if any, is equally shared
Transfer half of excess
Unsatisfied demand
Transfer half of excess
Unsatisfied demand
101CSIT560 By M. Hamdi
Max-Min FairnessA common way to allocate flows
N flows share a link of rate C. Flow f wishes to send at rate W(f), and is allocated rate R(f).
1. Pick the flow, f, with the smallest requested rate.
2. If W(f) < C/N, then set R(f) = W(f).
3. If W(f) > C/N, then set R(f) = C/N.
4. Set N = N – 1. C = C – R(f).
5. If N>0 goto 1.
102CSIT560 By M. Hamdi
1W(f1) = 0.1
W(f3) = 10R1
C
W(f4) = 5
W(f2) = 0.5
Max-Min FairnessAn example
Round 1: Set R(f1) = 0.1
Round 2: Set R(f2) = 0.9/3 = 0.3
Round 3: Set R(f4) = 0.6/2 = 0.3
Round 4: Set R(f3) = 0.3/1 = 0.3
103CSIT560 By M. Hamdi
Max-Min Fairness
• How can an Internet router “allocate” different rates to different flows?
• First, let’s see how a router can allocate the “same” rate to different flows…
104CSIT560 By M. Hamdi
Fair Queueing
1. Packets belonging to a flow are placed in a FIFO. This is called “per-flow queueing”.
2. FIFOs are scheduled one bit at a time, in a round-robin fashion.
3. This is called Bit-by-Bit Fair Queueing.
Flow 1
Flow NClassification Scheduling
Bit-by-bit round robin
105CSIT560 By M. Hamdi
Weighted Bit-by-Bit Fair Queueing
Likewise, flows can be allocated different rates by servicing a different number of bits for each flow
during each round.
1R(f1) = 0.1
R(f3) = 0.3R1
C
R(f4) = 0.3
R(f2) = 0.3
Order of service for the four queues:… f1, f2, f2, f2, f3, f3, f3, f4, f4, f4, f1,…
Also called “Generalized Processor Sharing (GPS)”
106CSIT560 By M. Hamdi
Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Equal Weights
Weights : 1:1:1:1
1
1
1
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
Time
1
1
1
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A1B1C1D1
A2 = 2
C3 = 2
Weights : 1:1:1:1
D1, C1 Depart at R=1A2, C3 arrive
Time
Round 1
Weights : 1:1:1:1
1
1
1
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A1B1C1D1
A2 = 2
C3 = 2
A1B1C2D2
C2 Departs at R=2Time
Round 1Round 2
107CSIT560 By M. Hamdi
Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Equal Weights
Weights : 1:1:1:1
1
1
1
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A1B1C1D1
A2 = 2
C3 = 2
A1B1C2D2
D2, B1 Depart at R=3
A1B1C3D2
Time
Round 1Round 2Round 3
Weights : 1:1:1:1
Weights : 1:1:1:1
1
1
1
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1C3 = 2 C1 = 1
C1D1C2B1B1B1D2D2A 1A1A 1A 1
A2 = 2
C3C3A2A2
Departure order for packet by packet WFQ: Sort by finish round of packetsTime
Sort packets
1
1
1
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A1B1C1D1
A2 = 2
C3 = 2
A1B1C2D2
A1 Depart at R=4
A1B1C3D2A1C3A2A2
Time
Round 1Round 2Round 3Round 4
C3,A2 Departs at R=6
56
108CSIT560 By M. Hamdi
Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Weights 3:2:2:1
Weights : 3:2:2:1
3
2
2
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
Time
3
2
2
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A1A1A1B1
A2 = 2
C3 = 2
Time
Weights : 3:2:2:1
Round 1
3
2
2
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A1A1A1B1
A2 = 2
C3 = 2
D1, C2, C1 Depart at R=1Time
B1C1C2D1
Weights : 3:2:2:1
Round 1
109CSIT560 By M. Hamdi
Understanding bit by bit WFQ 4 queues, sharing 4 bits/sec of bandwidth, Weights 3:2:2:1
Weights : 3:2:2:1
3
2
2
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A2 = 2
C3 = 2
B1, A2 A1 Depart at R=2Time
A1A1A1B1B1C1C2D1A1A2A2B1
Round 1Round 2
Weights : 3:2:2:1
3
2
2
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1 C1 = 1
A2 = 2
C3 = 2
D2, C3 Depart at R=2Time
A1A1A1B1B1C1C2D1A1A2A2B1C3C3D2D2
Round 1Round 23
Weights : 1:1:1:1
Weights : 3:2:2:1
3
2
2
1
6 5 4 3 2 1 0
B1 = 3
A1 = 4
D2 = 2 D1 = 1
C2 = 1C3 = 2 C1 = 1
C1C2D1A1A1A1A1A2A2B1B 1B1
A2 = 2
C3C3D2D2
Departure order for packet by packet WFQ: Sort by finish time of packetsTime
Sort packets
110CSIT560 By M. Hamdi
Packetized Weighted Fair Queueing (WFQ)
Problem: We need to serve a whole packet at a time.
Solution: 1. Determine what time a packet, p, would complete if we served
it flows bit-by-bit. Call this the packet’s finishing time, Fp.
2. Serve packets in the order of increasing finishing time.
Also called “Packetized Generalized Processor Sharing (PGPS)”
111CSIT560 By M. Hamdi
WFQ is complex
• There may be hundreds to millions of flows; the linecard needs to manage a FIFO queue per each flow.
• The finishing time must be calculated for each arriving packet,
• Packets must be sorted by their departure time. • Most efforts in QoS scheduling algorithms is to come up
with practical algorithms that can approximate WFQ!
1
2
3
N
Packets arriving to egress linecard
CalculateFp
Find Smallest Fp
Departing packet
Egress linecard
112CSIT560 By M. Hamdi
When can we Guarantee Delays?
• Theorem
If flows are leaky bucket constrained and all nodes employ GPS (WFQ), then the network can guarantee worst-case delay bounds to sessions.
113CSIT560 By M. Hamdi
time
Cumulativebytes
A(t)D(t)
R
B(t)
Deterministic analysis of a router queueFIFO case
FIFO delay, d(t)
RA(t) D(t)
Model of router queue
B(t)
114CSIT560 By M. Hamdi
Flow 1
Flow NClassificationWFQ
Scheduler
A1(t)
AN(t)
R(f1), D1(t)
R(fN), DN(t)
time
Cumulativebytes
A1(t) D1(t)
R(f1)
Key idea: In general, we don’t
know the arrival process. So let’s
constrain it.
115CSIT560 By M. Hamdi
Let’s say we can bound the arrival process
time
Cumulativebytes
t
Number of bytes that can arrive in any period of length t
is bounded by:
This is called “() regulation”
A1(t)
116CSIT560 By M. Hamdi
The leaky bucket “()” regulator
Tokensat rate,
Token bucketsize,
Packet buffer
Packets Packets
One byte (or packet) per
token
117CSIT560 By M. Hamdi
() Constrained Arrivals and Minimum Service Rate
time
Cumulativebytes
A1(t) D1(t)
R(f1)
dmax
Bmax
Theorem [Parekh,Gallager ’93]: If flows are leaky-bucket constrained,and routers use WFQ, then end-to-end delay guarantees are possible.
1 1
.
( ) , ( ) / ( ).
For no packet loss,
I f then
B
R f d t R f