chapter 6 internet qos - national chiao tung university

Modern Computer Networks: An Open Source Approach Chapter 6

1

Chapter 6 Internet QoS

Problem Statement

The Internet, an IP-based network, has existed well many years. In order to

transmit data from one end to another end, TCP, the most popular protocol in the

world, is developed to resolve the problems about end-to-end congestion control

and data reliability as we talked in Chapter 4. However, there are more and more

new services provided on the Internet such as WWW, E-Commence, video

conference talked in Chapter 5. The quality of service provided by the reliable

end-to-end congestion control protocol over a network without resource

management is not enough to support these applications. A path with the specific

quality such as suitable bandwidth, low delay, and low jitter is necessary for the

new applications.

It is not news for network to support quality of service. The ATM network was

capable of supporting it 20 years ago. Hence, why we need to construct the QoS

network on Internet? Low-cost, simple and popular is the major reasons. Today,

almost all popular network services are based on TCP/IP service architecture. For

getting QoS, it is impossible and ineffective either changing all hardware and

software at hosts or cooperating with other network architecture by some specific

translators.

So, how about an IP-based QoS network? The architecture which is based

on IP network and able to provide the quality of services may be a good solution

currently. However, there is no QoS in the current Internet. Most routers only

support the best effort service for the applications. What is the best effort service?

The simple answer is that routers do their best without caring about what

applications can get. More specifically speaking, on the one hand users’ arrival

packets are inserted into the queue until the queue is overflow, and on the other

hand the server sends out packets from the queue continuously with maximal rate

until the queue is empty. As the server loading is not heavy, the performance of

the best effort service may be enough for most users. Unfortunately, the loading

turns heavy always due to the greedy of people.

Thus, a QoS-aware IP network is necessary for the maintenance of fairness.

According to the intuition and precedent of QoS design in ATM network, people

build a similar and ideal architecture, Integrated Service Architecture (IntServ). It

is expected to provide the service with accurate and guarantee quality for the

needed end users. Just as in ATM, the users need to build the path and reserve


2

the resource before they get the quality of service. For all routers on the reserved

path not only need to know where packets should be send as they do now, but

also need to know which flow they belong to and when they should be sent out

exactly. In fact, it is too difficult and complex to implement IntServ in a larger

environment.

However, many algorithms of key components in IntServ are good and

mature. What we lack is to simplify the IntServ architecture by loosing some

accurate boundary of service, but still satisfying the requirement of most popular

applications, such as WWW, FTP. Differentiated Service Architecture (DiffServ)

was proposed with the concept. Building a simple and coarse network that only

provides differentiated class services for users in a large scaled area is the basic

objective of DiffServ. This architecture attempts to supply the end-to-end QoS via

different forward behavior per hops. And signing contracts is necessary before

getting the service. From some aspect, DiffServ indeed is a simple model.

However, in fact, the description in the standard of DiffServ is a little rough and

there exists many points needed to be concreted further. Anyway, at this time it

still has a high probability to be implemented on the Internet some day.

Although there is no any one large scale QoS IP network existed now, some

IP traffic control components have been provided in many OS system, such as the

traffic control (TC) modules of Linux system. Of course, a DiffServ experiment

environment consisted of these traffic control modules also could be built in this

system. In the chapter, we will introduce the two architectures as mentioned

above, IntServ and DiffServ. Along with the discussion of each section, the open

source code of TC in Linux is given for helping to understand how QoS is supplied

in a router on Internet more clearly.

6.1 Issues

In order to provide QoS in the network, many additional functions are

necessary for a traditional router. In general, they can be cataloged into six

components as shown in Figure 6.1. In the section, we will introduce their

concepts and capabilities respectively. Besides, from the discussion, you will also

get what are the difficult problems for designing them. The further detail

description of each component will be mentioned in following sections of the

chapter.

Signal Protocol

Signal Protocol, a common language used to talk with each router, is the first


3

requirement in a QoS network, because quality of service is provided by

cooperating of all routers in a network. Several signal protocols are proposed for

various purposes. Among them, the most famous example is Resource

ReserVation Protocol (RSVP), which are used by the application to reserve

resource from the network. About RSVP you can get a further introduction in

Section 6.2.3. Besides, Common Open Policy Service (COPS) protocol currently

being examined by the IETF is another example. It is a simple query-response

protocol and used in the policy control system, which is one of the QoS

management architecture.

QoS Routing

If routing is regarded as placing road signs and guiding cards at the fork in

the road, QoS routing could be imaged as an advanced roadway system which

provide a more detail guide, such as the arrival time of goal or congestion

condition of roads. In current IP network, routers only provide some basic

QoS Aware Network Element

Admission Control

Classification Policing

Signal Protocol QoS Routing

Scheduling

Con

trol

P

lane

Data

Pla

ne

Figure 6.1: Six basic components consists of QoS aware network element.

Sender ReceiverA

D

BC

QoS -awareNEQoS Network

?

?

•Bandwidth Constraint•Packet Loss Constraint

Figure 6.2: A possible QoS routing architecture.


4

information, such as the most less hop-counts path selection based on your

destination address. It is insufficient as we want to further integrate the interactive

media transmission into IP network. Besides the number of passed hops, in order

to provide users a specific level of QoS, the guarantee on bandwidth and delay

also is required for these applications. In fact, it is difficult and complex to offer the

capability mentioned above in a large network due to the necessary of

cooperating among all routers, especially as the suitable path of one application is

needed to decide dynamically. A more detail introduction and discussion including

the catalog and possible architecture of QoS routing will be given in Section 6.2.4.

Admission Control

Even after the building of the advanced roadway system, the road may be still

congested and the only difference is that you can get the arrival time. Thus, we

need to further control the number and type of cars droved on the road. Admission

control component is responsible for this job. According to the type of network

architecture, its controlled targets are different that either just a single link or an

area network. In general, because the information and quality of road is provided

by QoS routing, the common styles of resource controlled by admission control is

not beyond the scope, such as bandwidth and delay. It seem a simple job,

assumed that the current conditional of network is got. However, the thought is

wrong. In fact, the admitted decision is not applied on a single packet. In other

words, an admission may present the agreement of all well-behaved packets

belong to one traffic source, where “well-behaved” may imply some statistical

characteristics. Figure 6.3 shows a simple example. There is a requirement with a

3MB/sec bandwidth constraint arrived to the router A. Then, the router A decides

whether to accept the request based on the variable bandwidth usage. The

difficult is how to estimate correct resource usage to reserve enough resource for

Figure 6.3: A simple operating example of the admission control component.

A

A path with 3 MB/s is requirement

10MBMax Support BW

Current BW

BW

TIME

10MBMax Support BW

Current BW

BW

TIME

•Bandwidth Constraint•Packet Loss Constraint


5

successful transmitting of each admitted traffic source, while the total network

resource is keeping on a high utilization.

Packet Classification

After the road is selected or reserved, which is supporting by the above three

components, cars are going now. The road system needs a component to identify

the difference of cars in order to billing or managing the traffic. For example, we

need to find out some very old cars, forbidding them entering the highway system.

Or need to know the owner of a car to charge the passage expense. As you can

imaged, there exists many various rules for classifying packets and it may be

experienced several verification of rules to classify one packet to a particular class.

Though the job is heavy, fast is the first necessary. Thus, how to classify packets

quickly with many various rules becomes the major issue of the component.

In IntServ, the component is applied to identify which traffic source a packet

belongs to according to the values of 5-fixed fields in the packet. A further

description can be got in Section 6.2.4. In DiffServ, it plays a more important and

complex role, which is expected to provide a multi-fields range-matching capability.

By reading Section 6.3.5, you can understand its difficulty and possible

resolutions.

Policing

There always exists some cars exceeded the speed limit on the road, which

brings the dangerous of other cars. It also happened in the network, so we need a

policing component to monitor the traffic of network. If the arrival rate of some

traffic source is exceeding the allocated rate, the policing component need to

either mark, drop or delay them. However, in most case the policing threshold is

not always a so exact value, just as the estimated resource usage in the

admission control component. Some level variation is tolerated. The most popular

theory, leaky bucket theory, also called token bucket theory is a example. It gives

the policed traffic a mean rate limit, while permitting it to send with a burst rate in a

time period of particular length. Besides policing, in fact, the token bucket theory

also is used to describe the traffic source model due to it also can be regards as a

traffic regulator. Obviously, it is applied widely, so we will introduce it in Section

6.2.2. Traffic Description Model.

Scheduling

The scheduling is the most major and classical component of QoS network.

Its general goal is to enforce the resource sharing between different users with


6

some predefined rules or ratios. There are various algorithms proposed to reach

the specific purposes. Some methods are very simple and primitive like FIFO and

some are complex and ingenious enough to provide a good resource fair sharing

guarantee. Regardless of the type of scheduling, as shown in Figure 6.4, a

scheduling basically should offers two default functions, enqueue and dequeue,

that are used to handle a new arrival packet and assign the next packet going

form output driver. Following, as looking the inside of the black box, we can further

catalog its job into the buffer management in one queue and the resource sharing

among multiple queues. Obviously, all processes are concerning the queuing

management, scheduling is also called Queuing Discipline.

As mentioned above, there are many designs in the black box for different

purposes. In IntServ, the architecture with an exquisite selector and multiple

queues, called as fair queuing discipline, is in common use, because it attempts to

provide users a good isolation from others and an exact bandwidth guarantee. We

will talk the class scheduling in Section 6.2.6. In DiffServ, multiple styles of

scheduling algorithms are required respectively in different functions. The

single-queue style scheduling is introduced in Section 6.3.6. Such style

scheduling also called as buffer management intents to supply some differential

degrees of services with simple architectures. Although the quality provided by

them is limited, they are easier to be implemented in routers than fair queuing

disciplines in general.

Open Source Implementation 6.1: Traffic Control Elements in Linux

DequeueScheduling Black PipeEnqueue

arrival packets

departure packets

Single FIFO Queue

FIFO Queue

FIFO Queue

FIFO Queue

SC

FIFO Queue

FIFO Queue

FIFO Queue

SC

Figure 6.4: The concept and possible architectures of scheduling.


7

Linux kernels provide a wide variety of traffic control functions. Users are

expected to use these functions to construct an IntServ-aware router or a DiffServ

aware router or any other QoS aware router. The relationship between TC and

other router functions are given in Figure 6.5. As shown in figure, TC attemps to

replace the original role of Output Queuing with a series of control elements. Those

elements coded in Linux consists of the following three major conceptual elements:

filters

queuing disciplines

classes

The filters play the role of classifications and classify packets based on some

particular rules or fields in the packet header. Their source codes are named

beginning with “cls_” and put in the directory /usr/src/linux/sched. For

example, the element implemented in the file “cls_rsvp.c” provides the

capability of flow identification required in an IntServ router.

The queuing disciplines attempt to support two basic functions, enqueue and

dequeue. The function enqueue decides whether to drop or queue the packets

while the function dequeue determines the transmitted order of packets or simply

delays some packets to send out. A simple queuing discipline may be a FIFO

which queues the arrived packets until the queue is full and sends out the packet

from the head of queue continuously. However, some queuing disciplines are more

complex, such as the elements CBQ implemented in “sch_cbq.c”, where packets

are further classified into several classes. In the case, the queuing discipline may

include several classes and each class owns the self queuing discipline. The

source codes of queuing disciplines or classes are put in the directory as same as

the filters, but their filename are begun with “sch_”.


8

The bottom of Figure 6.5 provides a possible combination of these control

elements mentioned above. As shown in the figure, the combination is various and

free, where a queuing discipline may consist of multiple classes and multiple filters

may guide packets into the same class. Users can design the structure according

to their necessary from user plane by the Perl script. In the following open source

implementation blocks, we will introduce several TC elements in detail which are

related to the text discussed in the chapter. A more detail description about the

traffic control in Linux could be found in [WA99]

6.2 Integrated Service

Before beginning to discuss IntServ, the definiation and concept of flow must

be described first. A flow is the basic manageable unit in IntServ and flow isolation

is an important capability of IntServ. In other words, each flow in IntServ should

own the resource which is allocated based on itself particular requirement and

independent of other flows.

In general, the creation and resource reservation of a flow is applied by the

application and negotiated with all network elements on the path. Following, we

will introduce the internal operating processes of the IntServ router in Section

6.2.1. In Section 6.2.2, we would look the services provided in IntServ. From

Figure 6.5: The simple combination of TC elements in Linux

Input Device

IP Forwarding

Upper Layers Process

Output Queuing

Output Device

Traffic Control

Output Device

Without QoS

Filter Class Queuing Discipline

Filter ClassPolicing Queuing Discipline

Queuing Discipline


9

Section 6.2.3, we begin to give the detail introduction for each key component.

The last is a simple summary about IntServ.

6.2.1 Basic Concept

The subsection would give you an introduction about the general operating

processes of IntServ. We would talk about the reservation request from the

viewpoint of an application first and then describe how to handle the request for

the router in IntServ.

The Trip of a Resource Resevation Request

A QoS request is asked from an application. For an application, in order to

get the resource in the IntServ domain, it needs to decide the service type and the

value of quality parameter first according to its traffic type and requirement. And

then, the request is sent out to the nearest IntServ router as described in Figure

6.6. The router will decide whether accept the requirement based on its status;

forwarding the request to the next router. Assumed that the request is accepted by

all routers on the path, which means the resource reservation process is finished,

and then the application can begin to send out data packets with a quality of

service guaranteed. Generally speaking, the common resource reservation

protocol in IntServ is RSVP For more detail description about the handling of

resource reservation request in IntServ will be introduced in Section 6.2.3.

The Request Response in IntServ Routers

Once received a resource reserved request, the IntServ router will pass it to

signal processing component, which is corresponding to the signal protocol

component shown in Figure 6.1. According to the negotiated result with the

admission control component, the component attemps to update the signal packet

Figure 6.6: The operating processes of the viewpoint from an application.

A

BC

QoS-Aware Router

IntServ Domain

Reservation Request

Application

Server

Accept?Y->Forward the Request

N ->Reject the Request


10

and forward it to next router. In other words, the signal processing component

only is a “transcriber” and the decision is controlled by the admission control

component. The admission control component attempts to dynamically manage

and allocate the resource of the output link for the required application. If further

looking the inside of admission control, we can divide its functions into two parts.

One is to gather and maintain the current usage of the output link and the other is

to decide whether the resource is enough to satisfy the requirement of the new

request. Besides the two components, there still exists another one in the control

plane of the IntServ router without mentioning here which is QoS routing. In fact,

due to the IntServ intents to employ admission control to manage the link

bandwidth allocation, the QoS routing component is not emphasized here.

However, it still can be used in the creation of virtual static path with QoS

guarantee. The existence of such path is helpful for reducing complexity and time

of resource reservation.

The Request Enforcement in IntServ Routers

Once the path has been created successfully, the data packets are begun

transmitting on it. The routers on the path should guarantee that the treatment got

by the application does conform with their previous requirement or make their

packets transmitting in a light traffic like path. Three basic components in the data

plane of the router enforce these promises as shown in Figure 6.7. The data

packet entrance of the router is the flow identification component. It attempts to

identify whether a packet belongs to some one reserved flow or nothing according

to the five fixed fields of the packet. For those packets belonging to the particular

reserved flow, they will be inserted into to the corresponding flow queue. Basically,

in the IntServ architecture, each reserved flow has self individual queue. As

regards packets without belonging to any reserved flow, they will be classified to

the best effort traffic; being inserted into a signal FIFO queue in most cases. In a

better situation, the packets would be placed into some improved version FIFO

queue. Anyway, it is necessary to reserve some portion of resource for the best

effort traffic in order to avoid starving them.


11

After the packet enters the individual flow queue, the next component

discussed is policing. It monitors the incoming rate of traffic sources to determine

whether they does confirm the behavior addressed previous. Those out of

packets may be dropped or delayed until they conforms the agreed behavior.

Sometimes, the policing component is ignored and its existence depends on the

necessary of the absolute rate control. Take an example to further explain the

demand. For the schedulers which attemp to share the residual bandwidth for

other flows innately, the component is necessary if there is an upper bound limited

of the bandwidth got by one flow.

Following, the scheduler selects the next delivery packet among all policed

head packets of the flow queues according to their bandwidth reserved

requirement. The common goal of the scheduler in IntServ is doing his best to

reduce the worst latency among all packets and the different treatment among all

reserved flows. The role of the scheduler is important in IntServ, because it is the

key to provide the guaranteed service with some critical end-to-end delay bound

and the characteristic of flow isolation. More detail explanation about the

capability will be described in Section 6.2.5. The packets selected by the

scheduler are sent to the output device. In most case, the output device does not

queue the packets anymore, because the rate should must smaller than or equal

to the output physical rate.

6.2.2 Service Type

In the current Internet, most routers only provide the best effort service which

is without any quality control. Obviously, it is difficult to satisfy the requirement of

smooth media transmitting, so two additional service types are defined in the

IntServ specification that are guaranteed service and control-load service. In the

section we will introduce the two service types. However, before discussing them,

let us look how to describe source traffic in IntServ, because the applications has

the responsibility to describe their traffic behavior first to help routers to allocate

and guarantee resource.

Data Plane of IntServ Router

FlowIdentification

Src IP Dest. IPDest PortSrc Port

Protocol ID

Flow Queue

Fq1

Fqn

Fq2

Best Effort Q

Scheduler

Policing

Figure 6.7: The internal operating procedures of the data plane in an


12

Traffic Description Model

In IntServ, the traffic is used to be described by a leaky bucket model. The

leaky bucket traffic model is an important concept and applied to many different

machines. As located at the traffic source, it could be a rate regulator to control

the packet sending rate. At router, it could make a police to monitor the income

traffic behavior of flows and shape it to the description in TSpec. Following we will

introduce the leaky bucket model by the style of the source traffic regulator.

As show in Figure 6.8, there is a leaky bucket where it can accumulate the

water with a limit of volume B. Above it, a water stream fills the bucket with a fixed

rate r. You can image that the packet is the person and water is necessary for

people. The basic principle is that enough water is necessary for permitting a

packet to pass through. The amount of water required for passing a packet is the

same as the length of the packet. If the water is insufficient, the packet would be

blocked in the queue and for the new arrival packets, they may be even dropped

as its flow queue is full. On the other hand, if there are no packets in the queue,

water may be accumulated in the bucket where the accumulated amount is no

more than the volume of bucket, B. As the new packets arrive, water is beginning

to be consumed until the bucket is empty. According to the principle described

above, the packet rate permitted by the leaky bucket must be bounded by a linear

function. A (t) = r * t + b.

Why the traffic in IntServ is described with leaky bucket model? The major

reason is that it is not an exact traffic model and only bound the traffic rate in a

Token Bucket

Peak Rate p

Token Stream with

average rate r

Bucket depth b

Flow Queue

Rate Permitted PacketsIncoming PacketsDROP?

Token Bucket

Peak Rate p

Token Stream with

average rate r

Bucket depth b

Flow Queue

Rate Permitted PacketsIncoming PacketsDROP?

Figure 6.8: The operation architecture of a leaky bucket.


13

region which is more resilience and suitable to be applied at many traffic sources.

Open Source Implementation 6.1

The token bucket algorithm is used widely for policing or shaping traffic. You can

find its trace in the code of police.c or sch_tbf.c. Below we would look the code

in sch_tbf.c to introduce the implementation of token bucket in linux. A basic

parameter and variable used by token bucket is define as

struct tbf_sched_data

{

/* Parameters */

u32 limit; /* Maximal length of backlog: bytes */

u32 buffer; /* Token bucket depth/rate: MUST BE >= MTU/B */

u32 mtu;

u32 max_size;

struct qdisc_rate_table *R_tab;

struct qdisc_rate_table *P_tab;

/* Variables */

long tokens; /* Current number of B tokens */

long ptokens; /* Current number of P tokens */

psched_time_t t_c; /* Time check-point */

struct timer_list wd_timer; /* Watchdog timer */

};

euque a packet from sk_buff

skb->len > q->max_size

kfree_skb(skb)

skb_queue_tail

Figure 6.9: The flowchart of the function enqueue() in code scf_tbf.c


14

Related to the introduction in the subsection, the token bucket (R,B,b) is

projected to the tbf_sched_data (R_tab,buffer,token). In the structure

tbf_sched_data, the basic calculated unit of the token bucket size is based on

time. In other words, under the transmitted rate is rate R, the buffer parameter

presents the maximum transmitted time bound and token present the admitted

transmitted time currently.

According to the description mentioned before, the packet is admitted to be

sent out with peak rate as the token size in bucket is larger than packet size. In

the original thought the peak rate is the line rate, the max sending rate of a

device; however, sometimes user may set the self peak rate. Thus, TC used

thesecond set token bucket (P_tab, mtu, ptokens) to support the case. The two

token bucket architecture able to make the traffic passed through it would

confirm the requirement with mean rate R and max burst time buffer with peak

rate P.

The flowchart of the function enque() of TBF is list in Figure 6.9. It is easy

to understand that packets are admitted to be inserted into the queue if the

packet size is not larger than the max size MTU. The function dequeue() are

list in Figure 6.10.

is there peak rate setting?

get one packet from skb queue

calculate the interval between last query and current

calculate the current tx time ptoks with peak rate

both ptoks and toks >=0

reinsert into the head of the skb queue

calculate the current tx time toks with avg rate

admitted the packet to be tx and update the value of tbf_sched

Figure 6.10: The flowchart of the function dequeue() of source code

sch_tbf.c


15

Guaranteed Service

Guaranteed service is a service which can provide applications to delivery

on a path with the guarantee of available bandwidth and end-to-end worst delay

bound. What is the end-to-end worst delay bound guarantee? It menas that for all

packets transmitting on this path, the total time of transmission must be smaller

than the required bound if the sender sends the packet under the reserved

constraint. The characteristic is very important for interactive, real-time media

transmission, because it is the base for providing the guarantee of the low delay

jitter which directly affects the processing of the media reproduction at the

receiver.

Based on the description in its RFC, A guaranteed service is invoked by a

sender specifying the flow’s traffic parameters and the receiver subsequently

requesting a desired service level. The former named Traffic Specification (TSpec)

includes the information about the traffic behavior that will be injected into the

network, and the latter called Reservation Specification (Rspec) describes the

resource requirement of the receiver. TSpec is composed of five traffic description

parameters that originates from the leaky bucket traffic model. The RSpec are

consisted of two parameters, a data rate R and a slack term S. On the most case

in order to get an error-free service, the requirement described in RSpec is larger

then that in TSpec. You can find the more detail introduction about guaranteed

service in RFC 2212.

Control-Load Service

Compared to the clear definition of the guaranteed service, the control-load

service is blurred. The definition of control-load service in IntServ is providing a

path where the transmission behavior on it is like to through a low-utilization link.

What is a low-utilization link? It is not spoken very clear in the RFC. The first

NoneRFC 2211RFC 2212RFC

None•Emulate a lightly loaded network for AP

•Guaranteed BW

•E2E* Delay Bound

Provide QoS

NoneTSpecRSpec >TSpecParameters

Best EffortControl LoadGuaranteedService Types

NoneRFC 2211RFC 2212RFC

None•Emulate a lightly loaded network for AP

•Guaranteed BW

•E2E* Delay Bound

Provide QoS

NoneTSpecRSpec >TSpecParameters

Best EffortControl LoadGuaranteedService Types

*) “E2E” implies End-to-End

Table 6.1: The service types provided in the IntServ


16

explained direction in RFC is

1. A very high percentage of transmitted packets will be successfully

delivered by the network to the receivers.

2. The transit queuing delay experienced by a very high percentage of

delivered packets will not greatly exceed the minimum delay.

The other explained direction is

1. Little or no average packet queuing delay over all time scales significantly

larger than the burst time.

2. Little or no congestion losses over all time scales significantly larger than

the burst time.

You can say the control-load service is only a best than best effort service.

Although the service does not seems as good as the guaranteed service, it is

suitable for some non-critical services due to its cheap. Related to the low

utilization of the guaranteed service caused by offering the absolute satisfied

service, the control-load service takes a resource sharing concept so as to

increase the resource utilization and reduce the cost for each user.

Since it just emulates a light-traffic link, no more specific guarantee on

bandwidth and delay bound are provided. The application which wants to invoke

the service, TSpec that describes the traffic behavior is the only necessary

requirement specification and RSpec is no more required. According to the TSpec,

the router decides whether the resource it owned is enough to let the packets of

the flow feel they are passing through a low-loss low-congestion link. About the

more detail you can got in RFC 2211.

6.2.3 Resource Reservation Protocol (RSVP)

As we talked in Section 6.2.1, an application in IntServ needs to invoke a

reservation process to build a path before it begins to send out data packets. Thus,

a common resource reservation protocol is necessary for the application and

routers. The RSVP is developed for this purpose by the IETF. However, the

design of RSVP is general and not limited on specific QoS architecture. In other

word, RSVP does not define the internal parameters related to traffic controlling

directly. All detail formats of parameters like the resource reservation information

are packaged in the object called as “opaque”. RSVP simply plays a role of a

signal communication protocol and is responsible for the delivery of these

messages.

Based on the principle of the general purpose , RSVP also avoids the difficult

routing problem. It only queries the routing table and gets the next hop to forward


17

the control message. Since the path building messages are routing by the routing

table, it is not the business of RSVP to decide whether the resource of the path is

enough to satisfy the application’s requirement. In fact, that is handled by the

admission control or QoS routing component.

According to the description in the RFC xx, the resource reserved request for

creating a path, called as “RESV”, is receiver oriented. In other words, the

receiver will gather the TSpec from the “PATH” message of the sender and setup

itself RSpec and then send it with the RESV message to reserve the resource.

Because the reservation is simplex, for some interactive applications like video

conference or VoIP, the twice path reservations are invoked from two head

respectively. After receiving the RESV message, sender can begin to transmit

packets along the reserved path. A simple traffic flow of message model is given

in Figure 6.11.

If you are family to ATM network, you maybe find the style of the reservation

resource in RSVP is very similar to the SVC building. Indeed, it does, but in order

to adapt the Internet dynamic network topologies, RSVP take the soft state

approach to maintain the reservation status in the routers along the path. The

reservation in each router has a timeout limited and would be automatically

deleted once the timer is expired. The advantage is that once the path is changed

by the low level routing component, the old path would cancel naturally. However,

a periodical refresh message is necessary which bring the network some

additional burden.

6.2.4 QoS Routing

For each resource reservation request, the IntServ network needs to decide

receiver sender

IntServ Domain

R

R R

R

R RSVP aware Router

RSVP PATH

RSVP PATH

RSVP RESVRSVP RESV

Control Plane Packet

Data Plane Packet

Figure 6.11: Traffic flow of the RSVP messages


18

which the path is just satisfied the application’s requirement. The current IP

routing protocols use a simple metric such as hop count or delay to calculate the

shortest path for the connection. The information is not enough to handle the

complex QoS requirement. For example, the path with the most less hop count is

not sure to be the path with the enough bandwidth. Thus, IntServ need a routing

protocol with QoS consideration to get the information about the network resource

usage and supply a suitable path for the application’s request.

We first look at the contents of a QoS problem. A basic routing problem is

consisted of a request target and behavior. The targets of the requirement are

almost focused on delay, cost and bandwidth. Bandwidth requirement must be

guaranteed by each link in the path, so it belongs to a link routing problem. Delay

or cost is accumulated by all links in the path, so it belongs to a path routing

problem. On the other hand, for the same target their request behaviors maybe

different request behavior. Some requirements are greedy, which is an optimal

routing problem and some are critical. The former belongs to an optimal routing

problem and the latter, which is a constraint routing problem. Table 6.2 gives a

good collection about the basic routing problem. All the basic routing problems are

Table 6.2: The basic routing problems

Polynomical Complexity

Delay-constrained routing

BW-constrained routing

Bandwidth

Delay, Cost

Apply to

Least-cost routing

BW-optimization routing

ExampleBasic Routuing Problems

Path Constrained

Path Optimization

Link Constrained

Link Optimization

PO

PC

LC

LO

Polynomical Complexity

Delay-constrained routing

BW-constrained routing

Bandwidth

Delay, Cost

Apply to

Least-cost routing

BW-optimization routing

ExampleBasic Routuing Problems

Path Constrained

Path Optimization

Link Constrained

Link Optimization

PO

PC

LC

LO


19

only polynomial complexity, but sometimes a QoS routing problem maybe multiple

requirements, e.g. a path with 1 MB link bandwidth constraint and the delay is

smaller than 20ms. The complexity of those composite routing problems are large,

some even are a NP completed problem as list in Table 6.3.

In order to find the path satisfied the complex requirement, many routing

architectures are designed currently. However there is no a standard to specific

how to QoS routing. According to the regional characteristic, we can class the

architectures to three classes: local architecture, global architecture and

hierarchical architecture as shown in Table 6.5. In the local architecture, each

router maintains its up-to-date local state and a routing is done on a hop-by-hop

basis. The architecture is the extreme example of distribution decision computing

and the major issue is how to decide the path quickly. The second is the

centralized example. In the case every router is able to maintain global state by

QoS Router Major Issues

Hierarchical Routing:

Scalability, but there is a significant negative impact on QoS routing

Aggregated Global State:Containing detail state info. About the nodes in the same group and aggregate state info. About the other groups

Hier –archical

Source Routing:

A feasible path is locally computed at the source node

Global State: Every node is able to maintain global state by exahanging the local state of every node.

Global

Distributed Routing:

Routing is done on a hop-by-hop basis

Local State:Each Node maintain its up-to-date local state

Local

Routing StrategiesMaintenance of State Information

QoS Router Major Issues

Hierarchical Routing:

Scalability, but there is a significant negative impact on QoS routing

Aggregated Global State:Containing detail state info. About the nodes in the same group and aggregate state info. About the other groups

Hier –archical

Source Routing:

A feasible path is locally computed at the source node

Global State: Every node is able to maintain global state by exahanging the local state of every node.

Global

Distributed Routing:

Routing is done on a hop-by-hop basis

Local State:Each Node maintain its up-to-date local state

Local

Routing StrategiesMaintenance of State Information

Table 6.5: The major issues of QoS routing

Table 6.3: The composite routing problems

Delay-constrained least-cost routingNPPC + PO

Delay-delayjitter- constrainted routingNPPC + PC

Delay-constrainted BW-optimization routingPNPC + LO

BW-Delay-constrained routingPNLC + PC

ExampleComp-lexity

Composite Routing

Delay-constrained least-cost routingNPPC + PO

Delay-delayjitter- constrainted routingNPPC + PC

Delay-constrainted BW-optimization routingPNPC + LO

BW-Delay-constrained routingPNLC + PC

ExampleComp-lexity

Composite Routing


20

exchanging the local state of every node. A feasible path is locally computed at

the source node where the path decision is easier than that in the first architecture.

However the addition message exchange protocol is required for this architecture

and it is difficult and non-scalable for each node to keep the up-to-date state

information of all nodes. Thus, the third, a mixed architecture, is proposition. Each

node contains detail state information about the nodes in the same group and

aggregated state information about the other group. It is scalability, but there is a

significant negative impact on QoS routing.

QoS routing is a difficult problem and not be implemented in any one

products currently. Large amount of dynamic QoS routing request from the

applications directly is not easy to handle, so currently the QoS routing algorithm

is used to allocate some virtual link with resource guaranteed for the routers. And,

the admission control component manages the resource and dynamically decides

whether to allocate for the required application.

6.2.5 Admission Control

After an application send its QoS request to the router, the admission control

component need to decide whether it can accept this request. The admission is

based on the current resource usage of the output link and the requirement of

application. How to getting the information about link resource usage efficiently is

the first major issue. Besides, deciding whether the residual resource is enough to

satisfy a user’s request without loss the utilization of resource is the other

important issue.

Usually, we catalog the approaches to two classes. One is the statistical

based and the other is measured based.

Statistical Based

In the early years, most quality of service is required on multimedia data

transmitting. Most the behavior of these traffic sources is specific. In other word,

they are easy to be described by some mathematical model. Thus, for the

statistical based approach, the requirement is consisted of several parameters

and the router simply calculates the answer of the specific traffic function with the

value from the request to decide whether it can accept it. However it is not always

truth that the traffic could easy be well modeled.

Besides, the real difficult of this class algorithm is how to define the traffic

amount function under the consideration of the tradeoff between the utilization

and loss probability. For example, we could describe the traffic by two parameters,

peak rate and average rate. Assume the traffic model is an on-off model that


21

either transmits at its peak rate or is idle. The function of traffic amount could

simply be the summation of the peak rate of all connections. If the calculated

result of the traffic function is under the maximum constraint after adding the peak

rate of the new request, the request is accepted. Otherwise is rejected. The

algorithm supply a one hundred guaranteed allocation for the application, but it

causes the low resource utilization. How about if we use the summation of

average rate as the function of the amount traffic? It is expected that there are

50% probability for a user to meet congestion and catch a delay or drop.

Measured Based

Since it is hard to get a suitable traffic amount function, some people start to

measurement the current resource usage directly. In order to get a respective

measured value and avoid a suddenly burst value, some people calculate the new

usage estimation by exponential averaging over consecutive measurement

estimation. It seems as the following equation, newoldnew MeasuredwEstimationwEstimation ×+×−= )1( .

The variable w provides the admission control component to decide the affect

ratio of past status. The smaller w make the history is forgot easily, which means

that the algorithm is more aggressive and the link is under a high utilization status,

however, the user may usually not get the treatment as they requirement. For

example, the admission control may accept a request if the current measurement

estimation is just lower than some constraint. However, it is possible that the

estimation becomes back to the original high value at next time. Then, the

acceptation would affect the treatment got by all originally reserved flows.

The other measurement approach is time windows. The estimation is getting

over several consecutive measurement intervals. It seem like

)...3,2,1max( CnCCCateEstimatedR =

where Ci is the average rate measured on a sample period. Along the value n

increased, the algorithm is more conservative.

Open Source Implementation 6.3: Traffic Estimator

TC provides a simple traffic estimated module for estimating the current sending

bit and packet rate. You can find the module in file “estimator.c” and there

are three functions in the module. Function qdisk_new_estimator()

handles the creation of new estimator and function

qdisk_kill_estimator() deletes the estimator of no use. est_timer() is

invoked by system once the setting time is up where time interval can be (1<<

interval) seconds for variable interval > 0. In function est_timer(), a


22

sending rate is calculated and EWMA is got by the following code:

6.2.6 Flow Identification

In IntServ, each flow owns self resource reserved requirement, so flow

identification is necessary to examine each packet and decide whether the packet

belong to one flow or no any flow. A resource reserved table which contains the

relation between flow number and flow identification is necessary for helping

identifying the packet. The identifier of a flow in IntServ is composed of five fields

in the TCP packet. Those are the source IP address, and port, destination IP

address and port, and protocol ID. The length of the identifier is 104 bits which is

larger than the 32 bits, so we need an effective data structure to storage the table

and process the identification. On the other hand, because the identification is

processed for every packet, the identified speed also is an important issue.

Identification is a classical data search problem. There are many kind data

structure to storage the table, however all algorithm are tradeoff between speed

and space. A simple structure for storage the information is the binary tree. The

storage space requirement is small, but multiple memory access operations are

necessary for identifying a packet. The other extreme is direct memory mapping,

but it does not fit in the space requirement. In order to get the balance between

the space and speed requirement, it is a common and popular thought for using a

hash structure. However, if we further study the hash structure, we will find there

are many uncertain reasons to affect the performance of flow identification in the

hash table. Thus, in fact, the better solution is the advanced tree structure like

xxxyyy. The advanced introduction and example about the structure will be

nbytes = st->bytes;

npackets = st->packets;

rate = (nbytes - e->last_bytes)<<(7 - idx);

e->last_bytes = nbytes;

e->avbps += ((long)rate - (long)e->avbps) >> e->ewma_log;

st->bps = (e->avbps+0xF)>>5;

rate = (npackets - e->last_packets)<<(12 - idx);

e->last_packets = npackets;

e->avpps += ((long)rate - (long)e->avpps) >> e->ewma_log;

e->stats->pps = (e->avpps+0x1FF)>>10;

Figure 6.12: a portion of code in function est_timer() of estimator.c


23

described in the 6.3.x, because the flow identification in IntServ is a small partial

set of packet classification in DiffServ.

Open Source Implementation 6.2: Flow Identification

According to the definition of IntServ, a flow is identified by five fields. In TC of

Linux, the flow identification is implemented by the double level hash structure shown

in Figure 6.13. The first level hash is keyed by the destination address, protocol ID,

and tunnel ID and its hash result could address the rsvp session which a packet

belong to. In RSVP, a session is identified by the destination address, port and

protocol ID. Following, by using the second level hash that is owned by the rsvp

session and keyed on source address and port, the flow where a packet belongs is

identified. The major function to support flow identification is rsvp_classify() and

its flowchart is shown in the left part of Figure 6.14.The flowchart of the function

rsvp_change, which provide the user to add new flow identification filter or modify the

existed one, is shown in the right part of Figure 6.14.

rsvp_head: First-Level Hash

hash_dst()

src_dst()

rsvp_session: Second-Level Hash

hash bucket

hash function

rsvp_session list

rsvp_session list

rsvp_session list

rsvp_filter list

rsvp_filter list

total 256 (dst,protocol id, tunnelid) lists

rsvp_filter list

16 (src,src port) list + 1 wildcard src lists

A pkt. arrives

Figure 6.13 The double-level hash structure in the source code CLS_RSVP


24

6.2.7 Packet Scheduling

There are many scheduling algorithms proposed in different domain. For the

IntServ architecture, because every reserved flow has self flow queue and all

packets belong to the flow are inserted into its self queue, a scheduling should

make all flows got their expected treatment at least. Besides, a worst delay bound

is important for some critical traffic. Thus, the scheduler we discussed in the

subsection is only constrained in the fair queuing style. There is an additional

feature that is sharing the residual bandwidth to the flow required bandwidth

based on the allocated ratio of them. According to the different designed concept,

we can catalog the scheduler to two classes. One is round robin based and the

other is sorted based.

Round Robin Based

The algorithms in the class are heuristic. Below we would take the Weighted

Round Robin (WRR) scheduler, the most popular one, to introduce the class. In

the WRR, each active flow can send out a particular number of packets in one

round. The packet numbers of one flow sent out are corresponding to the value of

the weight. The method is simple, but it only performs good in the environment

where all packets are fixed-length. An improved version, Deficit Round Robin

(DRR), is proposed to solve the problem. Related to the WRR, it is more

adaptable to the current network environment, such as Internet.

Figure 6.14: The flowchart of two functions in the source code CLS_RSVP

hash_dst()

sequential search in the rsvp_session list

hash_src()

sequential search in the rsvp_filter list

match nomatch

N

N

The flowchart of function rsvp_classify

Has rsvp_filter assigned ?

adjust classid

Has rsvp_session existed ?

modify

create

insert a rsvp_session

insert a rsvp_filter

Y

N

Y

The flowchart of function rsvp_change


25

The implementation of round robin based scheduler is simple, but the class of

scheduler is hard to support a fine quality bandwidth guaranteed. As the number

of flow is large, the flow may wait a long time to get one turn to send out packets.

If the source traffic of the flow is arrived at a constant bit rate, long waited time

may bring the flow a large delay jitter which means some packets may be sent out

quickly and others are not. In order words, the traditional round robin based

algorithm only can support fairness over the time interval of one round.

Sorted Based

The concept to design the sorted-based scheduler is very different with the

round robin based scheduler. Before describing it, we first introduce a conceptual

scheduler which is only applied on the fluid model network architecture. Assume

there are three flows fair shared a 3Mbits/sec link. In the fluid model architecture,

the scheduler is expected to completely divide the link into three virtual links. Each

flow can send out packets with rate 1Mbits/sec continuously in itself virtual link

without any delay caused by other flows. Besides, as one flow has no packets to

send out, the residual bandwidth could share fair to other flows. In other words,

the other two flows can share 1.5 Mbits/sec respectively. The ideal scheduler is

called generalized processor sharing (GPS). But, it is impossibly implemented in

the current network architecture, because in principle the current network

architecture transmits one packet at one time, which is called the packetized

model architecture. Figure 6.15 describes the difference between the fluid model

and the packetized model.

The optimal scheduler does not exist, but the order of packets that have been

finished transmitting in the fluid model could be got by computing. Thus, it is

accepted commonly that one packet-model scheduler is good if it can select

packets to send out to make the order finished transmitting of them is similar to

that doing by the fluid-model scheduler. The thought is good, but for such

A1 A2 A3B1 B2

Flow A

Flow BFluid Model

A1 B1 A2 B2 A3 Packetized Model

Figure 6.15: The difference of packet transmitting order between fluid model

and packetized model.


26

scheduler, it is the nightmare about how to get the transmitted order simplely and

quickly. There are many variant algorithms proposed to solve the problem.

However it becomes a tradeoff between the exact bandwidth sharing and

implemented complexity. Below we will use one version of sorted based scheduler,

packetized GPS (PGPS) to describe the detail working situation of such

scheduler.

Packetized GPS

PGPS is also called weighted fair queuing (WFQ). The default operating

architecture is that each packet would get a virtual finish timestamp (VFT) as they

arrived to the flow queue and the scheduler selects the packet with the smallest

VFT to send out. The computation of VFT is related to the arrival virtual system

time (VST) and the size of the packet and the reserved bandwidth of the flow

which the packet belonged to. The VFT of packets determine their transmitted

order, so a good VFT computation is the key point to well emulate the fluid model

scheduler.

According to the algorithm, if the flow is active which means there are

packets existed in its flow queue, the VFT of the next arrival packet equals to

i

kik

ik

i

LFF

φ+= −1

where Fik is the VFT of the k-th packet of flow i. Li

k is the length of the k-th packet of flow i and iφ is the allocated ratio of bandwidth. Theoretically, if the first packets

of each flow would be arrived at the same time and all flows would be backlogged

forever, according to the above equation, it is easy to get the finished transmitting

order of packets in the fluid model scheduler. Unfortunately, that is impossible. For

the non-active flow, the VFT of their first arrival packet is calculated by

i

kik

i

LtVF

φ+= )(

where V(t) is the virtual system time which is a linear function of real time t in each

of the time intervals. In fact, the maintence of VST is the real difficult point of such

scheduler. A bad V(t) will cause the new active flow to share more or less

bandwidth than other flows actived at beginning, which will further affect a

scheduler about its worst delay guarantee.

Open Source Implementation 6.3: Packet Scheduling

For each flow, the csz_qdisk_ops module allocates a structure csz_flow to

keep the information about it. There are two variables, start and finish, to keep the

minimal and maximal finish timestamp of packets in its flow queue. In principle,

the headed packet of the flow queue has the smallest finish timestamp and the tail


27

packet has the largest timestamp. Besides the structure csz_flow, the

csz_qdisk_ops module maintains two lists, s and f, in order to conveniently

implement the PGPS scheduler. The item in the lists is the address pointed to the

structure csz_flow. The list s is ordered by the variable start in the structure

csz_flow and only contained the active flows. It provides function csz_deque() to

quickly pick up the next transmitted packet from the proper flow queue. The list f is

ordered by the variable finish and provides the calculation of virtual system time of

PGPS in function csz_update(). Below we will introduce the three major functions

in the csz_qdisk_ops module and show their flowcharts respectively.

The function csz_enque is the entry of the module and a flowchart is shown

in Figure 6.16. For the arrival packet, the csz_enque first calculates its virtual

finish timestamp (VFT). As calculated the VFT of packets belong to the non active

flow, a current system virtual time is necessary. Thus, the function csz_update is

invoked before calculating. For the non active flow, the csz_enqueue

additionally need to wake up it by inserting its point into the list s, which gives the

flow chance to be sent out packets again.

The function csz_dequeue sends out the head packet of the flow queue

which pointed by the first item in the list s continuously. Every time the packet of

one flow is sent out, csz_dequeue will call csz_insert_start() to re-insert the

address of the flow into the list s again to keep its transmitted change in next

round if the flow queue is non empty. For the flow whose queue is empty, it will

disappear from the list s to avoid the system resource wasting.

csz_classify():get the flow id

csz_update(): update VST

check the len. of flow queue

drop the pktfull

calculated new VFT based on the last VST

Is the flow active?calculated new VFT based on the VST

csz_insert_finish() :Wake up the flow

csz_insert_start() :Wake up the flow

skb_queue_tail()

N

Y

csz_classify():get the flow id

csz_update(): update VST

check the len. of flow queue

drop the pktfull

calculated new VFT based on the last VST

Is the flow active?calculated new VFT based on the VST

csz_insert_finish() :Wake up the flow

csz_insert_start() :Wake up the flow

skb_queue_tail()

N

Y

Figure 6.16 The flowchart of the function csz_enque()


28

The third function csz_update is a magic and key in the csz_qdisk_ops

module. That is the calculation of the system virtual time. Based on the

description in PGPS, a system virtual time is calculated every time a packet

arrives and departures. However, via the maintenance of list f, the csz_qdisk_ops

Get the time interval, delay, between now and the time where last packet arrived

If exist any flow actively

Get the minimum VFT, F, from the tail packet of headed

item in the list f

Assume all flows are active and calculate the current VST

by delay and last VST

F>VST

All residual flows are still active and the VST, F,

calculated just now is right.

The flow pointed by the headed item in the list f is no

longer active

Assume the time A means the non active flow sends out the least packet. Then, Get the

VST at the time A and adjust delay to the time interval

between now and the time A.

Figure 6.18: The flowchart of the function csz_update()

get the csz_flow where the headed packet with the smallest VFT

skb_dequeue(): get the headed packet of the flow

recalculate the min VFT in the flow

if the flow is non empty

csz_insert_start()Return the packet

for sending out

Y

N

Figure 6.17 The flowchart of the function csz_deque()


29

just recalculated the SVT as a packet arrived. It is maintained by the function

csz_update. First, the csz_update get the time interval, delay, between now and

the last time been invoked. Secondly, it assumes that all flow are still active from

last invoking and calculates the current SVT. Then, the SVT is compared with the

variable finish of the header item in the list f. If SVT is smaller than the variable,

the flow must have been non active. The csz_update will remove it from the list f

and calculate the SVT at the time where the flow becomes to non active. The

delay also will be corrected to the time from the flow non active to now. And the

csz_update will executed the step 2 again until the correct SVT is got and all non

active flows are removed from the list f.

6.2.8 Summary

We introduced the IntServ architecture and related key components in the

chapter. The architecture attempts to provide a end-to-end QoS resource

guarantee based on a IP network. Two different QoS level of services, guaranteed

service and control-load service, are specific in the RFC. RSVP is used as a

common language to negotiate the resource reservation with the admission

control component of each router on the path between the sender and receiver.

For each IntServ router, in order to enforcement the negotiated result, it need

to identify the flow that every packet belongs to first and transmit them at the

current time. In order to control the transmission time of all flows at the same time

efficiently, a packet scheduling component is deployed before the output device.

Summarily speaking, IntServ is a explicit architecture indeed, though it may

be too complex for ISP to deploy in marketing. The traffic control components

developed for the arachitecture such as packet scheduling are the foundation of

the later architecture like Differentiated Servce. If we look the QoS IP network

from the historical viewpoint, IntServ can be said as the development period of

QoS tools.

6.3 Differentiated Service

In the section we will introduce another IP-based QoS architecture. It

should be a more possible model for implementation in the Internet. We would

describe the architecture and concept of DiffServ in Subsection 6.3.1. A

Comparison with IntServ could also be found in the subsection. In Subsection

6.3.2, the field in IP header used by DiffServ is introduced detail. The residual


30

subsections describe the key elements of DiffServ respectively.

6.3.1 Concept

Although IntServ supplies an accurate quality of server, the IntServ

architecture is too complex for ISP. Especially for the core router which is

necessary to identify huge traffic arrived from different applications, the highly

complex design is hard to provide a good perform. Besides, there are too many

applications have been used in our daily and it is impossible to change the code of

all these applications to make them adapt for the IntServ network in a short term.

Thus, a simpler and more scalable and manageable solution were needed. The

Differentiated Service (DiffServ) is designed for this goal.

General Model

A DiffServ network is composed of one or many DiffServ domains and one

DiffServ domain is composed of several routers. According to the capaility, the

routers in DiffServ are cataloged into two types as shown in Figure 6.19. The

router at the boundary of the domain is called edge router while that at the

interior of domain is called core router. While the packets enter a DiffServ domain,

it must pass through the edge router first. For each packet, there are two stages at

the edge router. The first stage is to identify and mark packets based on some

predefined policies. The mark on the packet would affect the forward treatment

received by the packet in the domain. The second stage is to police and shape

packets based on the traffic profile which is described and negotiated by the

customer and service provider. The stage assures that the traffic which would be

injected into the domain is under the service ability of the domain. Within the

interior of the domain, no further classification of profiling is performed. The

Edge Router

Core Router

DiffServ Domain

Ingress Router Egress Router

Core Routers

Police, Mark, Shape, Drop Packets

Forward Packet

Figure 6.19: The basic architecture of the DiffServ network.


31

residual stage executed by the core router is to forward packets with some

particular behavior according to their mark.

Comparison with IntServ

Compared with IntServ, the DiffServ architecture is simpler but roughly. Table

xx show the major difference between the two architectures. First, DiffServ does

not support the resource reserved of one single flow. A large amount of flows

seriously reduces the performance of several key components in IntServ such as

scheduler, classifier, etc. In DiffServ, arrival traffic only is divided into several

groups called forwarding classes. Each forwarding class represents a

predefined forwarding treatment. We would detail introduce the different

forwarding treatment of DiffServ in Subsection 6.3.3

Secondly, the job of packet classification only is handled at the boundary of

DiffServ domain. That is to say, only the edge routers need to classify and mark

each packet entered the domain according some predefined policies. The core

routers forward the packets with different behaviors just based on the marks at the

header of packets. The design avoids the complex and difficult problem about

classifying and scheduling huge amount of packets in the high speed core router.

That is one of the major reasons which cause the failure of IntServ.

Third, the IntServ specification clear defined the services which IntServ

provides, but the specification of DiffServ only specifics the forwarding behaviors

Table 6.x the difference between DiffServ and IntServ

End-to-endDomainWork region

ReservationProvisioning Guarantee required

Service TypeForwarding behaviorDefined in the standard

All in One Edge and CoreRouter capability

FlowClassManageable unit

IntServDiffServCompared Items

End-to-endDomainWork region

ReservationProvisioning Guarantee required

Service TypeForwarding behaviorDefined in the standard

All in One Edge and CoreRouter capability

FlowClassManageable unit

IntServDiffServCompared Items

Precedence D T R 0 0IP TOS

DS DSCP 0 0

1 2 3 4 5 6 7 8

No use

264 = 64 behaviors

12 AF PHBs1 EF PHB1 Best Effort PHB8 Class Selector PHBs

12 AF PHBs1 EF PHB1 Best Effort PHB8 Class Selector PHBs

Figure 6.20: The DS field redefined from the TOS field of IPv4 header


32

in the core router. The forwarding behavior describes how a packet to be

forwarded in one hop, and affects the service got by the packet. As regards for the

service provided in DiffServ, it is decided and designed by the service provider.

The forth difference is the resource is estimated and preserved before the

using of customers while the resource in IntServ is allocated and reserved as the

customer addresses the request. Besides, the DiffServ architecture is composed

of multiple DiffServ domains, which is helpful to the network management. In

IntServ, it only emphasizes the end-to-end service which is hard to be

implemented in a large scale area for service providers.

6.3.2 DS Field

Packet entered the DiffServ domain is marked by the edge router. This mark

provided the core router to decide how to treat the packet. Since DiffServ is based

on IP network directly without adding other level, the mark of packets must use

the field in the current IP header. Thus, DiffServ reclaims the Type of Service

(TOS) 8-bit field with in the IPv4 header to indicate forwarding behaviors. The

replacement field is called as DS field and only the 6 bits are used as a DS

CodePoint (DSCP) in DiffServ to encode the PHB as shown in Figure 6.20.

DSCP field can represent 64 distinct values and many kind of space allocation

about the codepoints are proposed. The standard decides to divide the space into

three pools as shows in Table 6.7. The codepoints defined in the pool 1

corresponds to the major standard PHBs, which are detail described in

Subsection 6.3.4. The codepoints in other two pools are reserved for the

experimental and local use.

6.3.3 Per-Hop Forward Behavior

In the subsection, we will introduce 4 major forwarding behavior types and

their corresponding recommended codepoints defined in the standard. The first

Similar above but may be subject to standards action

xxxx013

Experimental and local usexxxx112

Standards actionxxxx01

Assignment policyCodepoint spacePool

Similar above but may be subject to standards action

xxxx013

Experimental and local usexxxx112

Standards actionxxxx01

Assignment policyCodepoint spacePool

Table 6.7: The allocated space of the codepoints


33

two types are for sake of providing some limited backward compatibility since the

DS field is redefined from the original IP TOS field. The other two new PHB

groups are standardized by the IETF, Assured Forwarding (AF) PHBs and

Expedited Forwarding (EF) PHB, and provide the services with some degree of

service quality.

Default PHB

For most packets in the original IP network, the TOS field is useless and the

value of the field is set to zero. In order to let these packets which is DiffServ

unaware be able to pass through the DiffServ network painlessly, DiffServ define

the default DSCP value to 000000, which just right equals to the value of the TOS

field in most DiffServ-unaware packets. For these packets, DiffServ inserts them

into a non-policy queue and reserved a minimal bandwidth for them.

Class Selector PHB Group

Though at most cases the TOS field is not be used, however some vendors

are used the first 3 bits of the field. In order to allow older implementations of the

IP functionality to coexist with the DiffServ implementations, a DSCP field that

contains xxx000 is recommended to map to a group of PHB that corresponds to a

set of relative priorities for the traffic. The packet with a higher value of DSCP is

expected to have a higher relative priority than that with a lower value.

AF PHB Group

The PHBs in the group expect to forward all packets successfully as long as

the traffic source conforms to its traffic profile, Traffic Conditioning Agreement

(TCA). As regards to the traffic exceeded its TCA, the AF PHB will forward them

at his best. In other words, if the traffic load is not heavy, these packets would be

forwarded successfully. However, if the load is heavy, it would be discarded with a

higher probability. It deserves to be mentioned that for all packets if successfully

forwarded, the hop must remain their original transmitted sequence.

There are 4 forwarding rate classes in the AF PHB Group and each class is

allocated a certain bandwidth and buffer space. For each class, traffic is divided

into 3 drop precedence levels. In other words, there are 12 individual PHBs totally

in the group. As long as the buffer of a class is near full, which implies the amount

of the arrived traffic is over than the allocated bandwidth of the class, the packet in

the class with high drop precedence level would be discarded with a higher

probability than the packet with low drop level.

In order to avoid the congestion happened in a class, the amount of traffic


34

arrived into the class needs to be controlled. Moreover, because the class of one

packet is not changed in the same DiffServ domain, the edge router needs to

admit, shape, and even drop packets to keep off the usage of the DiffServ domain

overloading. As we talked in Subsection 6.3.1, in DiffServ, to provide the quality of

service is based on provision and monitor of the edge routers.

In fact, to decide whether the congestion happened is an interesting

researchable issue. There are many algorithms about the buffer management to

be proposed to detect the congestion and reduce its effect in advance, such as

random early drop (RED). We will look these buffer management algorithms in

Subsection 6.3.7. For a more detail and formal description about the AF PHBs,

you can read the RFC xxxx.

EF PHB

The EF PHB attempts to forward packets with low loss, low latency, and low

jitter. It hopes to provide a similar performance of the traditional point-to-point

leased-line service. In order to offer the above three characters in DiffServ, the

core router must offer the bandwidth at anytime which at least is enough to

transmit the EF traffic with the rate in their profile.

Besides, EF traffic is allowed to preempt other traffic type in the core router

for supporting the three “low” characters more easy, That is to say, if the core

router uses a priority queue to mange the bandwidth among all types of

forwarding behaviors, the EF traffic may own the highest priority to be forward.

However, in order to avoid starving all other traffic and assure the EF traffic itself

to be forwarded smoothly, a strict bandwidth constraint is very important and

necessary. A shaper implemented by a leaky bucket installed at the edge router

may be a good tool to reach the goal. All out-of-profile traffic may be forwarded

with Default PHB or even be discarded at the edge router.

Related to the AF PHBs, EF PHB offers a higher quality and lower traffic

burst tolerated of service. For the traffic with const bit rate and a higher

transmitted quality requirement, AF PHB is a good choice while the EF PHBs are

more adapt the traffic whose rate is a little burst but is loss tolerated, A collation

about the AF PHB and EF PHB and their relative feature are shown in table xxxx.

A more detail description can be found in the RFC 2598.

6.3.4 A Packet Life in a DiffServ Domain

It can be divided into three stages for a packet passing through a DiffServ

domain which are ingress, egress, and interior stages. The first two stages are

happened and handled by edge routers while the letter is handled by the core


35

routers. Following, we would introduce each stage by detailed describing the

operations in the routers.

Ingress Stage

As shown in Figure 6.21, the ingress stage of one packet is passed through

three blocks, traffic classification, traffic conditioning, and traffic forwarding. In the

first block, the classifier bases on some policies to identify the arrival traffic and

tells the behind components that they should take which traffic profile to manage

the behavior of the traffic. The classified packets streams are than passed into the

second block, traffic conditioning.

In the second block, according to the definition described in the profile, the

meter measures the traffic and catalogs the packets to in-profile or out-of-profile.

For the in-profile packets, the marker sets the suitable codepoint to let them

successfully through the domain in principle. As regards to the out-of-profile

packets, they may be dropped or marked a codepoint corresponding to the

forward behavior with a high drop probability. Alternatively, they are just passed

into the shaper as the in-profile packets. However unlike the in-profile packets

which pass through the shaper almost without any delay, they would stay behind

until they are conformed to their traffic profile.

The marked packets would be inserted into the corresponding class queue in

the traffic forwarding block. The implementation of DSCP classifier is far simpler

than the packet classifier mentioned in the first block, which is only looking the DS

Packet Classifier

Meter

DSCP Marker

Shaper

Dropper

Traffic Conditioning

DS Domain

DSCPClassifier

Class Scheduler

Traffic Forwarding

Figure 6.21: The ingress stage of a packet in the edge router


36

field of the packet marked in the traffic conditioning block and then dispatch it to

the corresponding class queue. And, the class scheduler forwards the packet

from each class queue with the particular forwarding rate which is decided as the

design of the network service.

Interior Stage

Related to the multiple processing blocks in the ingress stage, there is only

one block in the interior stage as shown in Figure 6.22. Simple architecture

reduces the implemented cost of core router and increases the performance of

forwarding speed. The core router is only responsible for triggering per-hop

behaviors based on the DSCP of the packets, which is similar as processing in the

third block of the ingress stage.

Egress Router

6.3.5 Packet Classification

The quality of service provided in the DiffServ domain is based on provision,

thus it is important for DiffServ to verify the arrived traffic with their TCA, and

control the amount of traffic injecting into the domain. In order to know which

traffic the profile respective to, a packet classifier is necessary. The classification

is not only applied in DiffServ, but also used in the many domains, such as in a

firewall.

DS Domain

DSCPClassifier

Class Scheduler

Traffic Forwarding

Routing Database Control Plane

Data Plane

Figure 6.22: The interior stage of a packet in the core router


37

Best Requirement

The traditional role of the classifier in the router is to help it finding the

corresponding forward target of one packet, which belongs to a one-dimensional

longest prefix matching classification. There is only one matching field, the

destination IP address, but the matching value may be fall in a range. In IntServ,

the classifier provides the router to identify which flow the packet belong to. The

definition of the flow in IntServ is composed of five fixed field in the IP and TCP

packet header. Compared to that used in the traditional routing, the number of

matching fields is 5, whose total bit length is 104 bit. However, there are only one

specific pattern to identify the flow. Related to the above two examples, the

classifier in DiffServ is far complex. DiffServ attempts to provide a very width way

to describe what kinds of packets belong to one class. Thus, the classifier in

DiffServ is a multi-dimensional range style. The classified conditions may include

several IP, TCP, and UDP packet header field value. For example, we can catalog

all packets with IP source address falling between 140.113.88.1 to

140.113.88.254 and port number equal 100 into the same class. Or we can

catalog all UDP packet whose port number is between 5000~6000, which may

belong to a audio traffic, to one class. It is obviously that the packet classifier in

Diffserv includes the two difficult problem, multi-dimensional and range matching.

Because this style of classifier is widely applied in many type of equipments,

such as the security firewall or the bandwidth controller, which may have to handle

a large amount of traffic. Moreover, because the classifier is the entry of router in

most case, any little delay is hard to be tolerated. Thus, the scalable and speed

still is the important issues for the design of the classified algorithm. Besides, all

traditional issues also exist, such as low storage requirement, fast update. The

only change is that the problems mentioned above become more complex and

difficult.

Classification Algorithms

Below we will look two different basic approaches for multi-dimensional range

match which respectively are the tries approach and the geometric approach.

The tries approach has been used for the longest prefix match in IP address

lookup widely. The traditional application is a one- dimensional range matching

problem, which is a special case of multi- dimensional range match.

Another approach turns the classification problem to a geometric problem.

The value in one field is projected to a number line and k classified fields would

composes a k-dimensional space. The k-dimensional classified rule is

transformed to a k-dimensional area and the packet can be represented as a point


38

in the same space. The packet classification problem is equal to decide a point

belong to which area in the k-dimensional space. Let us take a 2D example to

explain the transfer. Assume we want to classify packets according to their 3-bits

source address and 3-bits target address, which composes a 2-dimensional

space as shown in Figure 6.23. Assume we have three rules and the rule A is that

all packets with source address between (100, 110) and destination address

between (001, 101) belong to the class 1. That mean we can plot a rectangle A at

the space as shown in Figure6.23. And the rule B and C are pictured based on the

above principle. To classify which rule a packet with (source address, destination

address)=(101,011) is matching is equal to know which the rectangle the point

(101, 011) falls in. A 2D classification algorithm based on the above concept is

presented in [LAKSH98]. The reader is encouraged to read the paper directly.

6.3.6 Packet Discard

In DiffServ, besides the necessary of scheduling among class queues, the

management in one queue is also important since one queue is always shared by

a group of users. Especially in the EF service, a packet discarding policy is

expected to admit the small variant usage of users and avoid the happen of

congestion at the same time. Below we will introduce two kinds of policy.

Tail Drop

It is the simplest and most basic packet discard policy. The policy, normally

used in conjunction with FIFO queuing, drops new arrival packets as there is no

srcaddr

destaddr

000 001 010 011 100 101 110 111

001

010

011

100

101

110

111

B

A

Figure 6.23: The 2D example of the geometric classification


39

more space left in queue. Packets will continue to be drop until the queue space is

available.

Because tail drop is the default policy of FIFO queuing, some problems which

are often imputed to FIFO queuing are also belong to tail drop. For example, as

the bursting source shares a FIFO queue with other well-behaved sources, it may

occupy all available queue space in a short time; forcing new arrival packets of

other well-behaved sources are dropped. The problem could be avoided if we

divide the single queue to multiple queues where each traffic source owns itself

length limited queue. However, it means some packets may be dropped even

when the router still has queue space.

So, the current way implemented in many routers is longest queue tail drop.

All service queues share a common memory pool and the packet located at the

tail of the longest service queue will be dropped first when there is no more space

to queue a new arrival packet. The refinement makes the service classes that are

exceeding their allocated service rate will have a high dropping probability, while

service classes operating within their service allocation will maintain short queues,

and therefore will experience a low dropping probability.

Early Random Drop

If we do not attempt to classify services into different queues, a single queue

with fair sharing dropping policy is necessary. A possible way is to drop new

arrival packets based on some probability as the queue space is expected to be

full, which is termed early drop. The policy is expected to early warning the source

that queue size will be insufficient and avoid to drop continued packet in a short

time such as doing in tail drop policy, which may brings TCP of some current

versions serious harm.

It is the key of policy to decide whether the queue space is going to full? A

threshold on queue length may be the most direct way. Once the queue length

over the threshold, the probability function is applied to either queue or discard

new arrival packets. However, due to the variation of queue length is very large,

the threshold on queue length directly may be not very suitable.

A algorithm proposed in [Floyd93] presents a better way to reduce the

variation of queue length and is expected more correctly to estimate the happen

of full queue. The algorithm calculates the average queue size and adjusts the

packet discard probability with the value.

6.3.7 Summary


40

6.4 Pitfalls and Misleading 1. Shaper and Scheduler

2. WFQ and WRR

3. Service and Forwarding Behavior

6.5 Further Reading 6.6 Exercises Hands-on Exercises

Written Exercises

1. As mentioned in Section 6.1, there are six basic components required by a

QoS aware router. Can you give a block diagram to describe how to design a

IntServ router by the components and the operating relationship among them.

Of course, adding some components is allowed for demand as designing.

2. Assume a traffic is regulated by a token bucket with parameters ( r, p, B). Can

you further discuss the effect caused from the token bucket? For example,

what is its expected output result? Or if we modify any one parameter, what

does the result become?

3. There are two common traffic estimated methods introduced in

measured-based admission control. One is EWMA and the other is time

window. Can you further compare the difference on estimation between

them ?

4. There is a 107 bits/sec link and WRR is used to schedule. Suppose N flows

attemps to share the link and the size of their packets is 125 bytes. If we plan

to fair alloc 8*106 bits/sec bandwidth for half number of flows and the residual

bandwidth for other half. If N-1 flows are backlogged and what is the possible

worst delay waitted for sending the first packet by the non-actived flow once its

packet arrived.

5. Generally speaking, WRR is suitable for the network whose packet size is

fixed length and DRR is a improved version which is able to handle packets

with variant length. In fact, due to the simple implementation of DRR, it is more

and more popular. However, it still exists the drawback for providing a small


41

worst delay guarantee. Can you further study their abilities about the worst

delay guarantee. Does DRR guarantee the smaller worst delay than WRR?

6. A trace on queue length and a calculation on average queue length

periodically are required in the original algorithm of RED, which are a large

load for implementation. In TC, a better skill is provided to reduce the load.

You should observe the source code in the file sch_red.c and try to picture a

flowchart and describe how the problem is solved as implementation

chapter 6 internet qos - national chiao tung university

Documents