hierarchical packet fair queueing algorithms

15
IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997 675 Hierarchical Packet Fair Queueing Algorithms Jon C. R. Bennett, Member, IEEE, and Hui Zhang, Member, IEEE Abstract— In this paper, we propose to use the idealized Hierarchical Generalized Processor Sharing (H-GPS) model to simultaneously support guaranteed real-time, rate-adaptive best- effort, and controlled link-sharing services. We design Hierarchi- cal Packet Fair Queueing (H-PFQ) algorithms to approximate H-GPS by using one-level variable-rate PFQ servers as basic building blocks. By computing the system virtual time and per packet virtual start/finish times in unit of bits instead of seconds, most of the PFQ algorithms in the literature can be properly defined as variable-rate servers. We develop techniques to analyze delay and fairness properties of variable-rate and hierarchical PFQ servers. We demonstrate that in order to provide tight delay bounds with an H-PFQ server, it is essential for the one-level PFQ servers to have small Worst-case Fair Indices (WFI). We propose a newp PFQ algorithm called WF Q+ that is the first to have all the following three properties: 1) providing the tightest delay bound among all PFQ algorithms; 2) having the smallest WFI among all PFQ algorithms; and 3) having a relatively low asymptotic complexity of . Simulation results are presented to evaluate the delay and link-sharing properties of H-WF Q+, H-WFQ, H-SFQ, and H-SCFQ. Index Terms— Hierarchical packet scheduling, link-sharing, Quality of Service, real-time, resource management I. INTRODUCTION F UTURE integrated services networks will support mul- tiple service classes that include real-time service, best- effort service, and others. In addition, they will need to support link-sharing [6], which allows resource sharing among traffic streams that are grouped according to administrative affiliation, protocol, traffic type, or other criteria. Fig. 1(a) shows an example where there are 11 agencies sharing the output link. The administrative policy dictates that Agency A1 gets at least 50% of the link bandwidth whenever it has traffic. In addition, to avoid starvation of the best-effort traffic, of the 50% of the bandwidth assigned to A1, best-effort traffic should get at least 20% if there is sufficient demand. It is important to design mechanisms to meet the goals of link sharing and different service classes simultaneously. The fluid Hierarchical Generalized Processor Sharing (H-GPS) Manuscript received September 5, 1996; revised February 21, 1997 and May 5, 1997; approved by IEEE/ACM TRANSACTIONS ON NETWORKING Editor S. Floyd. This work was supported by DARPA under Contract N66001- 96-C-8528 and Contract N00174-96-K-0002, and by a NSF Career Award under Grant NCR-9624979. Views and conclusions contained in this paper are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Government. J. Bennett was with FORE Sysytems, Pittsburgh, PA. He is now with the Division of Engineering and Applied Sciences, Harvard University, Cambridge, MA 02138 USA (e-mail: [email protected]) and also with BBN Technologies, Cambridge MA 02138 USA (e-mail: [email protected]). H. Zhang is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891 USA (e-mail: [email protected]). Publisher Item Identifier S 1063-6692(97)07263-4. (a) (b) Fig. 1. A link sharing example. system provides a general and flexible framework to support hierarchical link sharing and traffic management for different service classes. H-GPS can be viewed as a hierarchical integra- tion of one-level GPS servers. With a one-level GPS, there are multiple packet queues, each associated with a service share. During any time interval when there are backlogged queues, the server services all backlogged queues simultaneously in proportion to their corresponding service shares. With an H- GPS server, the queue at each internal node is a logical one, and the service it receives is distributed instantaneously to its child nodes in proportion to their relative service shares. This service distribution follows the hierarchy until it reaches the leaf nodes where there are physical queues. It has been shown that with a one-level GPS: 1) an end-to- end delay bound can be provided to a session if the traffic on that session is leaky bucket constrained [12]; 2) bandwidth is fairly distributed to competing sessions [5]; and 3) the sources can accurately estimate the available bandwidth to them in a distributed fashion [8]. The first property forms the basis for supporting real-time traffic [2] and the third property enables robust and distributed end-to-end traffic management algo- rithms for best-effort traffic [8], [14]. H-GPS will maintain the first and the third properties, but distribute excess bandwidth 1063–6692/97$10.00 1997 IEEE

Upload: lythien

Post on 10-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hierarchical packet fair queueing algorithms

IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997 675

Hierarchical Packet Fair Queueing AlgorithmsJon C. R. Bennett,Member, IEEE,and Hui Zhang,Member, IEEE

Abstract— In this paper, we propose to use the idealizedHierarchical Generalized Processor Sharing (H-GPS) model tosimultaneously support guaranteed real-time, rate-adaptive best-effort, and controlled link-sharing services. We design Hierarchi-cal Packet Fair Queueing (H-PFQ) algorithms to approximateH-GPS by using one-level variable-rate PFQ servers as basicbuilding blocks. By computing the system virtual time and perpacket virtual start/finish times in unit of bits instead of seconds,most of the PFQ algorithms in the literature can be properlydefined as variable-rate servers. We develop techniques to analyzedelay and fairness properties of variable-rate and hierarchicalPFQ servers. We demonstrate that in order to provide tight delaybounds with an H-PFQ server, it is essential for the one-levelPFQ servers to have small Worst-case Fair Indices (WFI). Wepropose a newp PFQ algorithm called WF2Q+ that is the first tohave all the following three properties: 1) providing the tightestdelay bound among all PFQ algorithms; 2) having the smallestWFI among all PFQ algorithms; and 3) having a relativelylow asymptotic complexity of O(log N). Simulation results arepresented to evaluate the delay and link-sharing properties ofH-WF2Q+, H-WFQ, H-SFQ, and H-SCFQ.

Index Terms—Hierarchical packet scheduling, link-sharing,Quality of Service, real-time, resource management

I. INTRODUCTION

FUTURE integrated services networks will support mul-tiple service classes that include real-time service, best-

effort service, and others. In addition, they will need to supportlink-sharing [6], which allows resource sharing among trafficstreams that are grouped according to administrative affiliation,protocol, traffic type, or other criteria. Fig. 1(a) shows anexample where there are 11 agencies sharing the output link.The administrative policy dictates that Agency A1 gets at least50% of the link bandwidth whenever it has traffic. In addition,to avoid starvation of the best-effort traffic, of the 50% of thebandwidth assigned to A1, best-effort traffic should get at least20% if there is sufficient demand.

It is important to design mechanisms to meet the goalsof link sharing and different service classes simultaneously.The fluid Hierarchical Generalized Processor Sharing (H-GPS)

Manuscript received September 5, 1996; revised February 21, 1997 andMay 5, 1997; approved by IEEE/ACM TRANSACTIONS ONNETWORKING EditorS. Floyd. This work was supported by DARPA under Contract N66001-96-C-8528 and Contract N00174-96-K-0002, and by a NSF Career Awardunder Grant NCR-9624979. Views and conclusions contained in this paper arethose of the authors and should not be interpreted as representing the officialpolicies, either expressed or implied, of the United States Government.

J. Bennett was with FORE Sysytems, Pittsburgh, PA. He is now withthe Division of Engineering and Applied Sciences, Harvard University,Cambridge, MA 02138 USA (e-mail: [email protected]) and also withBBN Technologies, Cambridge MA 02138 USA (e-mail: [email protected]).

H. Zhang is with the School of Computer Science, Carnegie MellonUniversity, Pittsburgh, PA 15213-3891 USA (e-mail: [email protected]).

Publisher Item Identifier S 1063-6692(97)07263-4.

(a)

(b)

Fig. 1. A link sharing example.

system provides a general and flexible framework to supporthierarchical link sharing and traffic management for differentservice classes. H-GPS can be viewed as a hierarchical integra-tion of one-level GPS servers. With a one-level GPS, there aremultiple packet queues, each associated with a service share.During any time interval when there are backlogged queues,the server services all backlogged queues simultaneously inproportion to their corresponding service shares. With an H-GPS server, the queue at each internal node is a logical one,and the service it receives is distributed instantaneously to itschild nodes in proportion to their relative service shares. Thisservice distribution follows the hierarchy until it reaches theleaf nodes where there are physical queues.

It has been shown that with a one-level GPS: 1) an end-to-end delay bound can be provided to a session if the traffic onthat session is leaky bucket constrained [12]; 2) bandwidth isfairly distributed to competing sessions [5]; and 3) the sourcescan accurately estimate the available bandwidth to them in adistributed fashion [8]. The first property forms the basis forsupporting real-time traffic [2] and the third property enablesrobust and distributed end-to-end traffic management algo-rithms for best-effort traffic [8], [14]. H-GPS will maintain thefirst and the third properties, but distribute excess bandwidth

1063–6692/97$10.00 1997 IEEE

Page 2: Hierarchical packet fair queueing algorithms

676 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997

unused by a session according to the hierarchy rather thanjust service shares of sessions. Therefore, the simple H-GPSconfiguration in Fig. 1(b) simultaneously supports all threegoals, namely, link-sharing, real-time traffic management, andbest-effort traffic management.

While H-GPS provides a simple model for supportingintegrated services networks, it is defined in a hypotheticalfluid system that cannot be precisely implemented. In thispaper, we design packet approximation algorithms of H-GPS.In the literature, a number of one-level Packet Fair Queueing(PFQ) algorithms have been proposed to approximate the fluidGPS algorithm [1], [5], [7], [11]–[13], [17]. To reduce theimplementation complexity, they all use the notion of a systemvirtual time function that tracks the progress in the fluidsystem. As we will show in Section II, the same techniquebased on a single system virtual time function does not applyto packet algorithms approximating H-GPS.

In this paper, we propose to approximate H-GPS by usingone-level PFQ servers as basic building blocks and organizingthem in a hierarchical structure. The resulted HierarchicalPacket Fair Queueing (H-PFQ) algorithms should have thefollowing properties: 1) tight per session delay bounds that arecomparable to an H-GPS server; 2) bandwidth distribution in ahierarchical fashion that is similar to an H-GPS server; and 3) arelatively low complexity. To construct such an H-PFQ server,the one-level server needs to have the following properties: a)tight per session delay bound as compared to the one-levelGPS server and b) a relatively low complexity. In addition,as we will demonstrate, to achieve tight delay bounds in theH-GPS server, one-level PFQ servers also need to provide c)small Worst-case Fair Indices (WFI’s) as defined in [1]. Mostof the previously proposed PFQ algorithms [5], [7], [11], [12],[17] do not have small WFI’s. In fact, they all have WFI’s thatgrow proportionally to the number of queues in the system. Asa result, the delay bounds provided by H-PFQ servers made ofthese PFQ’s are much larger than those provided by H-GPS.The only exception is Worst-case Fair Weighted Fair Queueingalgorithm (WF Q), which is proven to provide smallest WFI’samong all PFQ algorithms [1]. However, WFQ uses a systemvirtual time function with a complexity of .

We propose a new algorithm that maintains all the importantproperties of WFQ, but has a lower complexity than WFQ.We call the new algorithm WFQ+. Simulation results arepresented to illustrate the advantages of H-WFQ+ over H-WFQ, H-SFQ, and H-SCFQ.

II. FLUID AND PACKET SYSTEMS

Throughout the paper, we discuss two types of systems: fluidsystem in which the traffic is infinitely divisible and multipletraffic streams can receive service simultaneously, and packetsystems in which only one traffic stream can receive serviceat a time and the minimum service unit is a packet. Whilefluid systems cannot be realized in the real world, they areconceptually simple and some of them have properties that arehighly desirable for network control. For these fluid systems,people have studied the corresponding packet approximationalgorithms.

In this section, we first review Generalized Processor Shar-ing (GPS), and illustrate how it can be approximated by packetalgorithms based on virtual time functions. We then defineHierarchical GPS (H-GPS) and show that the same techniquecannot be applied directly to H-GPS.

A. Packet Approximation of GPS

A one-level GPS server with queues is characterized bypositive real numbers, . Let be

the amount of sessiontraffic served in the interval ,be the total amount of service provided by the server

during the same period. A work-conserving GPS server isdefined as one for which

(1)

holds for any interval during which , the set ofbacklogged sessions at time, does not change.

There are two noteworthy points. First, in the definition ofthe GPS algorithm, there is no assumption on whether theserver rate is constant or variable. Since an internal node in ahierarchical server is a variable rate server, this ensures thatH-GPS is properly defined. Second, while’s can be arbitrarypositive numbers, it is the relative ratio’s among them ratherthan the exact numbers that are important. For example, givenan arbitrary set of ’s, we can define a new set of’s that arenormalized with respect to the sum of all’s, i.e.,

The GPS systems defined by’s and ’s are identical. Withoutlosing generality, we assume that holds.

From (1), it immediately follows that

(2)

holds for any interval during which queues andare continuously backlogged. That is, the server services

all backlogged sessions simultaneously, in proportion to theirservice shares. In addition,

(3)

holds for any interval during which queue is con-tinuously backlogged, i.e., queue gets a minimum shareof the server’s capacity during any of its backlogged periodregardless of the behaviors of other sessions. In the special caseof a fixed-rate server with rate, i.e., ,(3) becomes

(4)

where is the minimum rate guaranteed to the session.With such a strong bandwidth guarantee, GPS can also providea worst-case delay bound for a session that is constrained bya leaky bucket with an average rate no greater than[12].

Page 3: Hierarchical packet fair queueing algorithms

BENNETT AND ZHANG: HIERARCHICAL PACKET FAIR QUEUEING ALGORITHMS 677

A good packet approximation algorithm of GPS would beone that serves packets in increasing order of their finishtimes in the fluid GPS system [5], [12]. However, whenthe packet system is ready to choose the next packet totransmit, it is possible the next packet to depart under thefluid system has not arrived at the packet system. Waitingfor it requires knowledge of the future and also causes thesystem to be nonwork-conserving. To have a work-conservingpacket system, the packet server must choose a packet totransmit based only on the state of the fluid system up totime . In Weighted Fair Queueing (WFQ) [5], when theserver is ready to transmit the next packet at time, itpicks, among all the packets queued in the system at, thefirst packet that would complete service in the correspondingGPS system if no additional packets were to arrive aftertime . Since packet finish times will change when sessionsbecome backlogged or unbacklogged, a naive implementationof WFQ is to recompute the finish times ofall packets inthe GPS system whenever a session becomes backlogged orunbacklogged.

By observing the following important property of GPS [12],a more practical implementation of WFQ is possible.

Property 1: The relative finish order of all packets that arein the system at time is independent of any packet arrivalsto the system after time. That is, for any two packets and

at time in a GPS system, if finishes service before ,assuming there are no arrivals after time, will finish servicebefore for any pattern of arrivals after time.

With this property, it is possible to maintain the relative GPSfinish order for packets in the WFQ system by using a priorityqueue mechanism [5], [12]. Such an implementation is basedon the notion of a system virtual time function , whichis the normalized fair amount of service that all backloggedsessions should receive by timein the GPS system. Eachpacket ( th packet on session) has a virtual start andfinish time and , where and arethe times packet starts and finishes services in the GPSsystem, respectively. Another way of interpreting is that itrepresents the amount of service, normalized with respect toits service share, sessionhas received right after packetis served. In the GPS system, all backlogged sessions shouldreceive the same normalized amount of service. Since both thesystem virtual time and the per packet virtual start/finish timesrepresent the normalized amount of service, they are measuredin unit of bits. In the special case of a fixed rate server, theelapsed time of a backlogged period is also a measure of theservice provided by the server, therefore, virtual times canalso be measured in unit of seconds. The exact algorithm forcomputing virtual times are shown in Table I, whereis the set of backlogged queues at time, is the beginningof the system backlogged period that includes, is theserver rate at time, and and are the arrival time and thelength of packet , respectively. Notice that the definition ofvirtual times in unit of bits is more general, and is applicableto both fixed-rate and variable-rate servers.

Based on Property 1, for all packets present in the packetsystem at time , their relative finish order in GPS is the sameas the relative order of their virtual finish times. Therefore,

TABLE ICALCULATION OF VIRTUAL TIMES

TABLE IINOTATION USED IN SECTION III

WFQ can be implemented by the “Smallest virtual Finishtime First” (SFF) policy: when the server selects the nextpacket for service at time , it picks the packet with thesmallest virtual finish time. An important advantage of thisvirtual-time-function-based implementation is that the virtualfinish time of a packet can be computed at the packet arrivaltime and need not to be recomputed even if the set ofbacklogged sessions change in the future. The system virtualtime function, though, does need to be recomputed whenthe set of backlogged sessions change. Since there can be

sessions that become backlogged or unbacklogged duringan arbitrarily small interval, the worst case complexity ofcomputing is [7]. It is possible to have otherPacket Fair Queueing (PFQ) algorithms based on virtual timefunctions with lower worst-case complexity [7], [11], [17].Later in this paper, we will propose a more accurate virtualtime function with a worst-case complexity of .Therefore, by exploiting Property 1 of GPS, it is possible todesign virtual-time-function-based PFQ algorithms that havean overall complexity of .

B. H-GPS

An H-GPS server can be represented by a tree with apositive number associated with each node. The rootnode, denoted by , corresponds to the physical link and eachleaf node corresponds to a session with a queue of packets.A nonleaf node is called backlogged if at least one of itsleaf descendent nodes is backlogged. Let be theamount of session traffic served in the interval , and

, where is the setof the leaf descendent nodes for node. Also, for any node

, let and denote its parent node and set ofchild nodes, respectively. A work-conserving H-GPS server is

Page 4: Hierarchical packet fair queueing algorithms

678 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997

defined as one for which

(5)

holds for any interval during which node is contin-uously backlogged and , the set of backlogged childnodes of at time , does not change. It immediatelyfollows that

(6)

holds for any interval during which two sibling nodesand are continuously backlogged.

Assuming and , itcan be shown that (3) holds also for H-GPS. Therefore, H-GPScan provide the same minimum bandwidth and delay boundguarantees for each session as GPS. The major differencebetween GPS and H-GPS is that while (2) holds for any twoqueues in GPS, (6) holds only for sibling nodes in H-GPS.Since in H-GPS packet queues are associated with only leafnodes, the bandwidth is not always distributed to all queues inproportion to their service shares as in GPS. When a sessioncannot fully utilize its share of the service, the excess serviceis distributed according to the hierarchy.

H-GPS is also defined in a fluid system, therefore needs tobe approximated by a packet algorithm. While it is possibleto design practical packet approximation algorithms for GPSbased on a single system virtual time function, this is not thecase with H-GPS. The main reason is that Property 1 does nothold for H-GPS, i.e.,with H-GPS, the relative order of packetfinish times is dependent on future arrivals.

Consider the example where the root of H-GPS has twochildren A and B with service shares of 0.8 and 0.2, respec-tively. Node B is a leaf node while node A has two child leafnodes A1 and A2 with service shares of 0.75 and 0.05. Letthe link speed be 1 and all packets have the same length of1. At time 0, A1 has an empty queue, A2 and B have manypackets queued. Thus, A2 and B will have 80 and 20% of thelink bandwidth, respectively. With the assumption that thereare no future arrivals, the finish times in the H-GPS server are1.25, 2.5, 3.75, , for A2 packets, and 5, 10, 15, , for Bpackets. Therefore, at time 0, the relative order of packets is:

, , , , , , . Now assume that a sequenceof A1 packets arrive at time 1. According to the bandwidthdistribution hierarchy, the bandwidth shares for A1, A2, Bwill be 75, 5, and 20%, respectively. While this will not affectthe finish times for session B packets, it does affect the finishtimes for session A2 packets. Between time [0, 1], only 80%of the first packet of session A2 has been served. The rest ofthe 20% of the first packet and all the remaining packets willbe served at 5% of the link rate. Therefore, the finish timesare 5, 25, 45, 65, . That is, the relative ordering betweensession A2 and B packets have changed after the arrival ofsession A1’s packets.

In a GPS system, during any period two sessions are bothbacklogged, the ratio between the services they receive is aconstant regardless of future packet arrivals of other sessions.In an H-GPS system, this ratio is affected by other sessionsin the hierarchy. This is the fundamental reason why Property1 does not hold for H-GPS.

Without the relative-packet-order-invariant property, theconcept of packet virtual finish times is not applicable.Therefore, the technique based on a single system virtualtime function to approximate GPS does not apply to H-GPS.

III. H-PFQ

In this paper, we propose to approximate H-GPS by usingone-level PFQ servers as basic building blocks and organizingthem in a hierarchical structure. We call the resulted algorithmsHierarchical Packet Fair Queueing (H-PFQ).

A PFQ server node in a hierarchy differs from a standalonePFQ server in two aspects: it is a variable-rate server and thequeues it serves do not have to be FIFO. As we discussed in theprevious section, both GPS and WFQ are properly defined asvariable-rate servers. In fact, by computing the system virtualtime and per packet virtual start/finish times in unit of bits,most of the PFQ algorithms proposed in the literature [1],[7], [11], [13], [17] are variable-rate servers. Therefore, forPFQ nodes in an H-PFQ hierarchy, the virtual times shouldbe measured in unit of bits.

The second difference between a standalone server and aserver node in a hierarchy is that a standalone server servesper sessionFIFO queues, whereas, a server node serves persubtree logical queues that arenot necessarily FIFO. A numberof operations in the implementation of PFQ servers need touse packets from the head of each queue. While it is obviouswhich packet is at the head in an FIFO queue, we need todefine the head packet for the logical queue that is associatedwith a child subtree.

In the following, we present an implementation frameworkof H-PFQ where an internal server node can beany PFQalgorithm that is properly defined as a variable rate server. Themain data structure is a tree representation of the hierarchy.The root node represents the physical link and a leaf noderepresents a physical queue. Each nonroot nodeis connectedto its parent by a logical queue . For the parent node toimplement a PFQ algorithm, only the head of the logical queueis needed. Therefore, at any given time, only the reference tothe packet, which is the head of the logical queue, is storedin queue . The actual packet remains stored in the realqueue at the leaf node until the link finishes transmission of thepacket. For consistency, we also define for the root serverto be the packet currently being transmitted. At any given timewhen the server is busy, there exists a path from a leaf to theroot such that the logical queues of all nodes traversed by thepath point to the same physical packet that is currently beingtransmitted. The logical queues and associated data structuresat each node are updated when a packet arrives at an emptysession queue at the leaf node, or when the link is picking

Page 5: Hierarchical packet fair queueing algorithms

BENNETT AND ZHANG: HIERARCHICAL PACKET FAIR QUEUEING ALGORITHMS 679

the next packet to transmit. In the following, we present thepseudocode to describe the details of the algorithm.

ARRIVE(i, Packet)1 ENQUEUE( Packet)2 if3 then return4 Packet5

6

7 if8 then RESTART-NODE

When a packet arrives at a leaf node, if session ’s logicalqueue for its parent node is not empty, the packet is justappended to the end of the physical FIFO queue for the session.Otherwise, the packet also becomes the head of the logicalqueue . The virtual start and finish times for the logicalqueue are then updated, and the procedureRestart-Node() iscalled with the parent node if it is currently idle.

RESTART-NODE

1 SELECT-NEXT

2 if3 then456 if7 then8 else

9

1011 UPDATE-12 else131415 if and16 then RESTART-NODE

17 if and18 then TRANSMIT-PACKET-TO-LINK

A node is restarted whenever it needs to select a new packetto transmit. This occurs either when a packet arrives to an idlenode or when the last packet finishes being transmitted onthe physical link. If a packet arrives at an idle node, the

flag will be FALSE, in which case the new start timefor the node is computed using .If the node is not idle and has a packet to transmit, the newstart time will be set to the previous finish time. If the nodehas no more packets to send, the busy flag will be cleared.If the current node is not the root node, and its parent nodedoes not have a packet in its logical queue, the node willrestart its parent node. If the current node is the root node andthere is a packet in the queue, the packet will be transmittedover the link. Different PFQ algorithms have different packetselection and system virtual time updating algorithms. Forexample, SFQ uses the Smallest virtual Start time First (SSF)policy and updates the system virtual time based on the virtual

start time of the packet currently being served, whereas SCFQuses the Smallest virtual Finish time First (SFF) policy andupdates the system virtual time based on the virtual finishtime of the packet currently being served. These algorithms areimplemented in functionsSelect-Next andUpdate-V .

RESET-PATH

12 if3 then4 DEQUEUE

5 if6 then78

9

10 RESTART-NODE

11 else121314 RESET-PATH

When the link finishes transmitting a packet, it calls Reset-Path(R). Reset-Path descends the tree along the path to theleaf node whose packet just finished transmission. At eachnode along the path, it resets the logical queue to be empty.When the leaf node is reached, the first packet of the queue isdequeued and its parent node is restarted. During the descent,all pointers are cleared, but not the busy flags. During theprocess of picking a new packet, the busy flag acts as areminder to the Restart-Node function that a packet has justfinished transmission. If there are no more packets for thisnode to send, Restart-Node will clear the busy flag.

IV. DELAY ANALYSIS OF H-PFQ

In the previous section, we presented an algorithm to im-plement PFQ by integrating one-level PFQ’s into a hierarchy.While most of the PFQ algorithms proposed in the literaturecan be used for this purpose, the delay bounds provided by theresulted H-PFQ servers can vary significantly with differentPFQ algorithms.

In this section, we first give an example to show that withmost of the PFQ algorithms proposed in the literature, theresulted H-PFQ servers provide much larger delay boundsthan those by H-GPS. We then present the concept of Worst-case Fair Index (WFI) and demonstrate that delay boundsprovided by an H-PFQ server relates not only to delay boundsprovided by PFQ server nodes in the hierarchy, but also toWFI’s provided by the PFQ servers. In particular, in order toachieve tight delay bounds for H-PFQ, PFQ server nodes inthe hierarchy need to have small WFI’s. WFQ is the onlyalgorithm proposed in the literature that provides tight WFI’s,however, it has a relatively high complexity. We propose anew algorithm called WFQ+ that not only provides tightdelay bounds and low WFI’s, but also has a relatively lowcomplexity.

Page 6: Hierarchical packet fair queueing algorithms

680 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997

A. Limitation of Existing PFQ Algorithms

In [1], the following example is used to illustrate the largediscrepancies between the services provided by GPS andWFQ. Assume that there are 11 sessions with packet sizeof 1 sharing a link with the speed of 1, , and

. Session 1 sends 11 back-to-backpackets starting at time 0 while each of all the other 10 sessionssends only one packet at time 0. If the server is GPS, it willtake 2 time units to transmit each of the first 10 packets ofsession 1, one time unit to transmit the 11th packet, and 20time units to transmit the first packet from each of the othersessions. Denote theth packet of session to be , then inthe GPS system, the finish time is for ,21 for , and 20 for . Under WFQ,packets will be transmitted according to their finish times inthe GPS system. Therefore, the first 10 packets of session 1( ) will be transmitted, followed by one packetfrom each of sessions ( ), and thenthe 11th packet of session 1 ( ). In the example, betweentime 0 and 10, WFQ serves 10 packets from session 1 whileGPS serves only 5. After such a period, WFQ needs to serveother sessions in order for them to catch up. Intuitively, thedifference between the amounts of service provided to eachsession by WFQ and GPS is a measure of inaccuracy of WFQin approximating GPS. In this case, the inaccuracy ispackets, where is the number of sessions sharing the link.

Such an inaccuracy introduced by WFQ will significantlyaffect the delay bound provided by H-WFQ. Consider theexample with a link sharing structure in Fig. 1(a) and thepacket arrival sequence in Fig. 2. Assume that WFQ is usedinstead of GPS and the first 10 packets of class A1 belongto the best-effort subclass and the 11th packet belongs tothe real-time subclass. Even though the real-time subclass ofA1 reserves 30% of the link bandwidth, when a real-timepacket arrives, it may still have to wait 10 packet transmissiontimes. Now consider the example where there are 1001 classessharing a 100 Mbps link with the maximum packet size of1500 bytes. For a real-time session reserving 30% of the linkbandwidth, its packet may be delayed by 120 ms in just onehop! In contrast, if GPS or H-GPS is used, the worst-casedelay for a packet arriving at an empty A1 real-time queue is0.4 ms. Similar examples can be constructed for SCFQ [7],SFQ [11], and FBFQ [17].

B. WFI and Its Effect on Delay Bounds of H-PFQ

In [1], we introduce a metric called Worst-case Fair Index(WFI) to characterize PFQ servers. In this section, we willdevelop analysis techniques to show that delay bounds pro-vided by an H-PFQ server relates not only to delay boundsprovided by PFQ server nodes in the hierarchy, but also toWFI’s provided by the PFQ servers. In particular, in order toachieve tight delay bound for H-PFQ, PFQ server nodes in thehierarchy need to have small WFI’s.

Definition 1: A server is said to guarantee a Time Worst-case Fair Index (T-WFI) of for session , if for any time

, the delay of a packet arriving at is bounded above by

Fig. 2. WFQ, SFQ, and WF2Q.

, that is,

(7)

where is the rate guaranteed to session, is thenumber of bits in the session queue at time(including thepacket that arrives at time), and are the arrival anddeparture times of theth packet of session, respectively.

For the purpose of this paper, a packet is said to arrive orleave the server if its last bit arrives or leaves the server. Intu-itively, represents the maximum time a packet coming toan empty queue needs to wait before receiving its guaranteedservice rate. An important observation is that both GPS andH-GPS have a WFI of 0. That is, with GPS or H-GPS, a packetcoming to an empty queue can receive its guaranteed servicerate immediately after its arrival. However, as illustrated in theexample in Fig. 2, the T-WFI for WFQ can increases linearlyas a function of the number of sessions.

Since the previous definition of WFI applies only to astandalone server, which is fixed-rate and has only one level,we introduce a general definition of WFI that applies also toserver nodes in a hierarchy, which are variable rate servers.Again, as in Sections II and III, we generalize the definitionby measuring WFI in unit of bits instead of seconds.

Definition 2: A server node is said to guarantee a BitWorst-case Fair Index (B-WFI) of for session , if forany packet , the following holds:

(8)

where is the time departs the server, is any time instantsuch that and session is continuously backloggedduring , and is the service share guaranteed toqueue by server .

For a constant rate one-level server, , and. Therefore, (8) is equivalent to

(9)

In this case, Definition 2 subsumes Definition 1 andholds. This can be easily established by letting

Page 7: Hierarchical packet fair queueing algorithms

BENNETT AND ZHANG: HIERARCHICAL PACKET FAIR QUEUEING ALGORITHMS 681

and using the following property for an FIFO queue

(10)

Before we proceed to establish the relationship between thedelay bound of an H-PFQ server and WFI’s of PFQ servernodes, we first give the following definition of guaranteedservice burstiness index (SBI), which is a generalized boundeddelay property that applies to both constant-rate and variable-rate servers.

Definition 3: A server is said to guarantee a serviceburstiness index (SBI) of to session , if for any packet

, there exists a time instant within the server’s busyperiod that includes also , where , ,and , such that

(11)

holds where is the time departs the server and isthe service share guaranteed to queueby server .

The definition of SBI has its root in the guaranteed servicecurve concept proposed by Cruz [3]. However, there areseveral differences. First, SBI is applicable to both constant-rate and variable-rate servers, while guaranteed service curveis defined only for constant-rate servers. Second, for SBI, weconsider only intervals that end at packet departure times,while Cruz considers intervals that end at arbitrary timeinstants. Since both SBI and guaranteed service curve are usedto reason a session’s delay property, considering only timeintervals that end at packet departure times will result in atighter bound. Finally, in the definition of SBI, we requirethat and be within the samesystembusy period (butnot necessarily in the same session i’s backlogged period),while Cruz’s definition does not have such a requirement.Therefore, SBI represents a stronger guarantee than the guar-anteed service curve. However, it can be shown that all theanalysis and results in [3] are applicable with our definitionof SBI. Intuitively, for any work-conserving queueing system(including PFQ systems), system busy periods are invariantwith respect to the scheduling policy used, and therefore, canbe independently analyzed.

While the definitions of WFI and SBI look similar, worstcase fairness is a stronger property than bounded serviceburstiness. In the case of the worst-case fair property, (11)needs to hold forall intervals that end with and duringwhich session is continuously backlogged. In the case of theguaranteed service burstiness property, (8) needs to hold foronly one interval that ends at and starts at the beginning ofa session’s backlogged period. By letting be the start of thebacklogged period that includes, it immediately follows thata session’s guaranteed WFI is also the session’s guaranteedSBI.The opposite is not always true. For example, with WFQ,the guaranteed SBI for any session is . This is muchsmaller than the guaranteed WFI value, which can be as largeas .

In the following lemma, we establish the relationship be-tween the guaranteed SBI and guaranteed delay bound to aleaky bucket constrained session. A sessionis constrained by

a leaky bucket if the following holds for any interval:

(12)

where is the amount of sessionbits arrived during.

Lemma 1: Consider session that is leaky bucket con-strained by . If a standalone server with a constantrate guarantees an SBI of to session i, it can guaranteea delay bound of

where

Proof: For a constant rate server,hold during any system backlogged period. Since the serverguarantees an SBI to session, there exists a time instant,where , , and hold, such that

(13)

Since session has an FIFO queue and , we have

(14)

In addition, session is leaky bucket constrained, thus

(15)

Combining (13)–(15), we have

(16)

Rearranging terms and dividing both sides by, we have

(17)

Q.E.D.For most rate-based service disciplines [20], if the server

guarantees a delay bound of to a leaky bucket constrainedsession, it also guarantees an SBI of to the ses-sion [3]. Therefore, bounded service burstiness property andbounded delay property are equivalent for standalone rate-based servers. Since the bounded service burstiness propertyapplies also to variable rate servers, it can be viewed as thegeneralized bounded delay property.

From Lemma 1, it immediately follows that a bounded WFIfor a session also implies a bounded delay. However, the delaybound calculated from the WFI may not be tight in some cases.For example, while the tight delay bound for a leaky bucketconstrained session in a WFQ server is

the delay bound based on the WFI can be

Page 8: Hierarchical packet fair queueing algorithms

682 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997

which is much larger. Intuitively, WFI is the maximum amountof time a packet has to wait to receive its fair share servicewhen it comes to an empty queue. The reason that a packetmay have to wait for a long time is that some packetsrelated toit have receivedmoreservice than deserved in a previous timeperiod. In the case of a standalone server, these packets mustbelong to the same session. In the case of a hierarchicalserver, these packets may belong to sessions that share anancestor node with session. Therefore, WFI does not bounddelay tightly in the case of a standalone server since it does nottake into account the fact that packets from the same sessionmay receive more service in a previous time period. However,WFI is important in characterizing the delay in a hierarchicalserver since the extra service received in the previous timeperiod may have been received by a session other than theone being considered.

Now that we have defined WFI and SBI that are applicableto both standalone servers and server nodes in a hierarchy,we are ready to derive WFI’s and delay bounds provided byan H-PFQ server. For a sessionwith ancestors in an H-PFQ server, we use to represent its parent node, torepresent the parent node of for , where

, , and .Theorem 1: For a session with ancestors in an H-PFQ

server, it is guaranteed the following B-WFI

- (18)

where is the B-WFI guaranteed to the logical queue atnode by the server node , .

The proof is given in Appendix A. Basically, the theoremstates that the WFI provided to a session by an H-PFQ serveris the weighted sum of WFI’s of all the session’s ancestorservers.

Since a bounded WFI also implies a bounded delay, Corol-lary 1 immediately follows.

Corollary 1: For a session with ancestors in an H-PFQserver, if it is constrained by a leaky bucket , the delayof any packet in the session is bounded by

(19)

While Corollary 1 gives the delay bound for a leaky bucketconstrained session in an H-PFQ server, the bound is not thetightest as it does not account for the situation where packetsfrom the same session received more service in a previoustime period. The following theorem provides a tighter bound.

Theorem 2: For a session with ancestors in an H-PFQserver, if it is constrained by a leaky bucket , the delayof any packet in the session is bounded by

(20)

where is the B-WFI guaranteed to the logical queue atnode by the server node , ,and is the SBI guaranteed to sessionby its parent

server node, i.e., is the delay bound guaranteed to sessionby a standalone server.The proof of the theorem is given in Appendix B. The

theorem states that the delay bound provided by an H-PFQserver to session is the sum of the delay bound providedby session ’s parent server to sessionand the WFI’s ofall the other session’s ancestor nodes weighted by theirguaranteed shares. This bound is tight when ’s andare tight. This can be easily shown by constructing examplesas in Section IV-A. Therefore, to achieve tight delay boundsin an H-PFQ server, the WFI’s for the internal and root servernodes should be small.

V. WORST-CASE FAIR PFQ ALGORITHMS

A. WF Q

Among all PFQ algorithms proposed in the literature, theWorst-case Fair Weighted Fair Queueing (WFQ) [1] is theonly one that provides tight WFI’s.

WF Q differs from WFQ in that it uses the “SmallestEligible virtual Finish time First” (SEFF) policy instead ofthe popular SFF or SSF policies. With WFQ, when theserver picks the next packet to transmit at time, rather thanselecting it from among all the packets at the server as inWFQ, the server only considers the set of packets that havestarted service in the corresponding GPS system, and selectsthe packet among them that has the smallest virtual finish time.A packet is said to beeligible at time if its virtual start timeis no greater than the current system virtual time.

If we consider again the example in Section IV-A, at time0, all packets at the head of each session’s queue,,

, have started service in the GPS system. Amongthem, has the smallest finish time in GPS, so it willbe transmitted first in WFQ. At time 1, there are still 11packets at the head of the queues:and , .Although has the smallest virtual finish time, it will not startservice in GPS until time 2, therefore, it won’t be eligible fortransmission at time 1. The other 10 packets have all startedservice at time 0 at the GPS system, thus, are eligible, and oneof them will be transmitted. At time 3, becomes eligibleand has the smallest finish time among all packets, thus, it willbe transmitted next. The service order for all packets underWF Q is shown as the last time line in Fig. 2. As can beseen in the example, during any time interval, the differencebetween the amounts of bits transmitted by GPS and WFQ isless than one packet size. Therefore, WFQ is a more accurateapproximation of GPS than WFQ. The following theorem isproven in [1].

Theorem 3: 1) WF Q is a work-conserving policy.2) WF Q is worst-case fair for sessionwith the following

worst-case fair index:

(21)

3) For a session constrained by a leaky bucket ,WF Q guarantees a delay bound of

Page 9: Hierarchical packet fair queueing algorithms

BENNETT AND ZHANG: HIERARCHICAL PACKET FAIR QUEUEING ALGORITHMS 683

As can be seen, the WFI provided by WFQ is independentof the number of sessions sharing the server. In the case of

, will simply be . Since theB-WFI for a packet system is at least one packet size, WFQis an optimal packet policy with respect to the worst-case fairproperty. In addition, since the minimum difference betweena delay bound provided by a PFQ server and a GPS server isone packet transmission time, both WFQ and WFQ providethe tightest delay bound among all PFQ algorithms.

B. WF Q+

While WF Q provides the tightest delay bound and smallestWFI among all PFQ algorithms, it has the same worst-casecomplexity of as WFQ because they both need tocompute .

In this section, we present a new packet algorithm thatprovides the same delay bound and WFI as WFQ, but with alower complexity. Since this policy is also worst-case fair, butis simpler than WFQ, we call it WF Q+. WF Q+ also usesthe SEFF policy. The novel aspect of WFQ+ is the use ofa new system virtual time function that achievesboth low complexity and high accuracy in approximating theideal virtual time function used in GPS. While a number ofnew virtual time functions have been proposed to simplifythe implementation of WFQ [7], [17], they all result inPFQ algorithms with large WFI’s. The unique advantage of

is that the resulted WFQ+ algorithm combinesall three properties that are important for a PFQ algorithm tobe used in an H-PFQ server: tight delay bound, small WFI,and low algorithmic complexity.

With WF Q+, the virtual time function is defined as

(22)

where is the total amount of service providedby the server during the period , is the set ofsessions backlogged in the WFQ+ system at time, isthe sequence number of the packet at the head of the session’s queue, and is the virtual start time of the packet.

There are several noteworthy properties of .First, if we view the system virtual time function as a functionof the amount of service provided by the server,is a strictly monotonically increasing function of time witha minimum slope of one. We call this the “minimum slopeproperty” of , which is important for a PFQ server toprovide delay bounds to leaky bucket constrained sources thatare within one packet transmission time of those provided byGPS. The virtual time function , used by both WFQand WF Q, has this property by using the marginal service rateof the GPS server as the slope, which has a minimum valueof one. Therefore, both WFQ and WFQ can provide tightdelay bounds. On the other hand, the virtual time functionsused by SCFQ [7] and SFQ [11] may have a slope of 0during certain periods, and the delay bounds provided by theresulted SCFQ and SFQ algorithms are much larger than thoseprovided by WFQ and WFQ. The second important propertyof , as provided by the max over min operation

in (22), is that it is at least as large as the minimum virtualstart time of all packets at the head of all queues. This hastwo implications. First, this ensures that a newly backloggedsession has a virtual start time at least as large as one ofthe existing backlogged sessions. This is important for theresulted WFQ+ algorithm to achieve a low WFI. In addition,the property also ensures that at least one packet in the systemhas a virtual start time no greater than the current systemvirtual time. This guarantees the resulted SEFF policy to bework-conserving as only packets with virtual start time nogreater than the current system virtual time are eligible fortransmission.

To simplify the implementation, we also modify the defini-tion of virtual start and finish times. With the old definition asin Table I, virtual start and finish times need to be maintainedon a per packet basis. Usually this means stamping the valuesof and in the header of packet . This overhead maynot be acceptable for networks with small packet sizes, such asATM networks. With the following definition, there is only onepair of and that needs to be maintained for each session. Whenever a packet reaches the head of the queue,

and are updated according to the following:

ifif

(23)

(24)

where is the queue size of sessionjust before time. With this definition, per session and are also the

virtual start and finish times of the packet at the head of thesession queue.

There are two major tasks associated with implementingWF Q+: 1) computing the system virtual time function and 2)maintaining the set of eligible sessions sorted by virtual finishtimes. Both can be accomplished with complexity[19].

The delay and worse-case fairness properties of WFQ+ aregiven by the following theorem.

Theorem 4: 1) WF Q+ is work-conserving.2) WF Q+ is worst-case fair for sessionwith

(25)

3) For a session constrained by a leaky bucket ,WF Q+ guarantees a delay bound of

Since the proof is rather long, we will just present its outlinein this paper. The full proof will appear in a followup paper.The proof is based on the theory of rate-proportional serversdeveloped in [18].

While the definition of rate-proportional servers in [18] isbased on virtual time functions measured in unit of second,it can be easily extended to the more general definition withvirtual time functions measured in unit of bit. A fluid rateproportional server with sessions is characterized bynumbers , and a system virtual time function,

Page 10: Hierarchical packet fair queueing algorithms

684 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997

which must satisfy the following two conditions:

(26)

(27)

where is any interval in a system backlog period,is the set of backlogged sessions in the fluid system at time, and is the virtual time function for session, which is

iteratively defined as follows:

if

if

if

(28)

At any given time, the server simultaneously services allsessions that have the minimum virtual time function, pro-portionally to their relative service shares. Formally, duringany time period in which, , the set of backloggedsessions with the minimum virtual time function , isunchanged, the server services all sessions in such that

(29)

It can be shown that GPS is a special rate-proportionalserver with the system virtual time function . Infact, with GPS, holds for any two sessionsbacklogged at time , and therefore holds forany time instance .

For each fluid rate-proportional server, two PFQ algorithmscan be defined based on the SFF and the SEFF packet selectionpolicies. For example, for GPS, WFQ, and WFQ are thecorresponding PFQ algorithms with SFF and SEFF packetselection policies, respectively. By applying similar techniquesthat are used in [1] to prove the properties of WFQ, thefollowing theorem can be established.

Theorem 5: For any rate proportional server, its corre-sponding PFQ algorithm with the SEFF policy can provideto session

1) a delay bound of

if the session is constrained by a leaky bucket ,and

2) a WFI of .

In addition, the following theorem holds.Theorem 6: WF Q+ is a packet rate-proportional server

with the SEFF packet selection policy.The proof is rather long and will be presented in a follow-

up paper. The main results of Theorem 4 follow directly fromTheorems 5 and 6. In addition, WFQ+ is work-conservingbecause there is at least one packet eligible for service duringany system backlogged period.

Therefore, any PFQ algorithm that approximates a fluidrate-proportional server with SEFF policy achieves the same

Fig. 3. Example 1.

worst-case fairness and bounded delay properties as WFQ.The unique advantage of WFQ+ is that it uses a novel virtualtime function with a lower complexity.

The Corollary below, which gives the delay bound forH-WF Q+, follows directly from Theorems 2 and 4.

Corollary 2: For a session with ancestors in an H-WF Q+ server, if it is constrained by a leaky bucketand , the delay of any packet in the sessionis bounded by

(30)

VI. SIMULATION EXPERIMENTS

In this section, we present simulation experiments to illus-trate the bounded delay and hierarchical link-sharing propertiesof H-WF Q+. For the purpose of verifying, we had twoindependent implementations of all the algorithms in twodifferent simulators. All experiments were conducted in bothsimulators and the results matched each other.

A. Delay Characteristics

In this subsection, we compare the packet delay distributionsfor a real-time session under four different H-PFQ servers—H-WF Q+, H-WFQ, H-SFQ, and H-SCFQ. The service hierarchyis shown in Fig. 3. The rate above the node is the guaranteedservice rate for the node. The value inside the node representsthe node’s service share with respect to its parent node.

The real-time session being measured is the leaf nodelabeled RT-1. It has a guaranteed service share of 0.81 fromits parent node which translates into a guaranteed rate of 9Mbps. Session RT-1 is a deterministic on/off source with a25-ms on-period and a 75-ms off-period. Session RT-1 hasa continuously backlogged sibling session BE-1. As a result,nodes N1, N2, and NR are also continuously backlogged. Weuse two additional types of background traffic: Poisson sourcesthat are labeled PS-n and constant rate sessions that are labeledCS-n. All constant rate sessions have identical start times anda peak transmission rate equal to their guaranteed rate. Theyfirst passed through a multiplexer before they arrive at the

Page 11: Hierarchical packet fair queueing algorithms

BENNETT AND ZHANG: HIERARCHICAL PACKET FAIR QUEUEING ALGORITHMS 685

(a) (b)

(c) (d)

Fig. 4. Delay of RT-1 with uncorrelated cross traffic. (a) H-WFQ. (b) H-SCFQ. (c) H-SFQ. (d) H-WF2Q+.

server, so that they do not have simultaneous arrivals, butrather model the sort of packet train burst that could be sent byindividual users and/or networks with high speed connections.For simplicity, we assume all sessions transmit 8-kB packets.

We consider two scenario’s: 1) uncorrelated cross trafficand 2) correlated cross traffic.

Fig. 4 shows the case when PS-n sources are transmitting atan average of 1.5 times their guaranteed rate and the constantrate sources are not transmitting. As a result, all the PS-nsessions eventually become persistently backlogged. As wecan see, while on average the delays for all four algorithms aresimilar, the worst-case packet delays under H-SFQ, H-SCFQ,and H-WFQ are larger than those under H-WFQ+.

For experiments shown in Fig. 5, everything remains thesame except that the correlated constant rate sessions areturned on. As can be seen, the worst-case delay increasessubstantially under H-WFQ, H-SFQ, and H-WFQ, but remainsalmost the same for H-WFQ+. H-SFQ is affected most bythe presence of correlated traffic. This can be understood bythe following intuitive explanation. Since the packets are wellspaced out for constant rate sessions, they usually arrive at anempty session queue. With SFQ, these packets will be assignedthe same virtual start time as the packet currently being served,which has thesmallestvirtual start times among all backloggedpackets. Since SFQ picks the packet with the smallest virtualarrival times, the newly arrived packets will be served ahead

of other backlogged packets, including RT-1 packets. If thereis a burst of packet arrivals into empty queues for a shortinterval, the system virtual time will stop advancing, and onlystart advancing again after these packets finish service. As aresult, the delay for other packets in the system will increase.This can be seen also from the example illustrated in Fig. 2.

B. Hierarchical Link Sharing

We consider the link-sharing structure shown in Fig. 6(a),which has a multilevel hierarchy with two types of sources:TCP sources and deterministic on-off sources. We will exam-ine the performance of sessions labeled TCP-under link-sharing when on-off sources alternate betweenactive and idle states. To see the effect of hierarchical link-sharing, we use one on-off source for each level in thehierarchy. The bandwidth’s and active periods of the on-offsources are shown in Fig. 6(b).

Fig. 7 shows the bandwidth versus time plots for eachof the TCP sessions under consideration. The bandwidth ismeasured by averaging over 100 ms windows, with twoadjacent windows overlapping 50 ms. As can be seen, all fouralgorithms perform well. While 100 ms provides a very finegranularity of measuring bandwidth, it is a verylarge numberwhen it comes to one hop average packet delay. Therefore,even though the worst-case packet delays vary significantly

Page 12: Hierarchical packet fair queueing algorithms

686 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997

(a) (b)

(c) (d)

Fig. 5. Delay of RT-1 with correlated cross traffic. (a) H-WFQ. (b) H-SCFQ. (c) H-SFQ. (d) H-WF2Q+.

with different H-PFQ algorithms, the bandwidth distributionsare very similar.

VII. RELATED WORK

In [15], H-WFQ is used to support integrated traffic man-agement. The negative effects introduced by WFQ’s high WFIon link-sharing and traffic management algorithms are notstudied. To provide tighter bounds for real-time traffic, allreal-time queues need to be children of the root node, andlink-sharing between real-time and nonreal-time sessions isaccomplished via a separate mechanism.

In [10], an implementation of H-WFQ is presented. Thescheduler implemented is actually not an H-WFQ server, buta WFQ server in which the weights are dynamically changedaccording to the set of backlogged sessions in the packetserver. It is easy to show that such an implementation willnot only yield much larger delay bounds but also violates thelink-sharing goals in certain situations. The key problem isthat at any time instance, the set of the backlogged sessionsin a packet system can be quite different from that in thecorresponding fluid system. Adjusting the weight according tothe set of backlogged sessions in the packet system can resultin large errors.

In [6], a Class-Based Queueing (CBQ) algorithm is pre-sented to support link-sharing and integrated services. A CBQ

server consists of a link-sharing scheduler and a general sched-uler. The link-sharing scheduler decides whether to regulate aclass based on link-sharing rules and mark packets of regulatedclasses as ineligible. The general scheduler services eligiblepackets using a static priority policy. Our work differs fromthis work in that we build our framework on H-GPS, whichhas theoretically proven properties for supporting link-sharing,real-time service, and best-effort service.

A number of algorithms such as Self-Clocked Fair Queueing[4], [7], Stochastic Fair Queueing [9], Deficit Round Robin[16], Frame-based Fair Queueing [17], and Start-time FairQueueing [11] have been proposed to approximate GPS witha lower complexity. However, none of them addresses theissue of worst-case fairness, and all of them have largeWFI’s. As shown in the paper, H-PFQ algorithms based onthese algorithms result in much larger worst-case delay thanthat under H-WFQ+. The Leap Forward Virtual Clock [13]achieves low WFI by using an SEFF policy similar to thatused by WFQ and WFQ+. We hypothesize that it belongsto the class of PFQ algorithms that approximate a fluid rateproportional server using an SEFF policy.

The idea of implementing H-PFQ algorithm by integratingone-level PFQ algorithm into a hierarchy was also indepen-dently proposed in [11]. However, the details of the algorithmare not presented and the analysis applies only to H-SFQ.In addition, there are two claims made in [11] that we don’t

Page 13: Hierarchical packet fair queueing algorithms

BENNETT AND ZHANG: HIERARCHICAL PACKET FAIR QUEUEING ALGORITHMS 687

(a)

(b)

Fig. 6. Class hierarchy, on/off sources, and measured bandwidth. (a) Classhierarchy. (b) Active periods for on/off sources.

believe are accurate. In [11], it was claimed that SFQ had twounique advantages compared to other PFQ algorithms: first, itis the only algorithm that are fair when the server is variablerate; second, it is the only algorithm that does not requireadmission control (the sum of service shares does not needto be less than 1). As discussed in Section II, by measuringvirtual time functions in unit of bits instead of seconds, allexisting PFQ algorithms are fair even when the server isvariable rate, and therefore, they can all be used to implementH-PFQ algorithms. As shown in Section VI, even though thedelay bounds provided by the resulted H-PFQ algorithmscan vary significantly, the fairness (link-sharing) propertyis maintained by all algorithms. For admission control, asdiscussed in Section II, it is the relative ratio’s among session’sservice shares that are important. The exact values of theservice shares are not important. Normalizing all service shareswith respect to the sum of service shares will result in anidentical policy as before. Admission control is required onlyfor sessions that requireminimum service guarantees—thetotal amount of services that are allocated to the sessionsrequiring performance guarantees should be less than the totalserver capacity. This condition needs to be held forall PFQalgorithms, including SFQ.

VIII. C ONCLUSION

We have made several contributions in this paper. First,we proposed a formal model based on the idealized H-GPSsystem to simultaneously support guaranteed real-time, adap-tive best effort, and controlled link-sharing services. Second,we presented an algorithm to implement H-PFQ by organizingone-level PFQ servers in a hierarchical structure. Most of PFQalgorithms can be used for this purpose. The key is to computethe system virtual time and per packet virtual start/finish timesin unit of bits instead of seconds. Third, we develop a generalframework for analyzing the delay and fairness properties ofvariable-rate and hierarchical servers. We demonstrate, bothempirically and analytically, that having a PFQ algorithm witha low WFI value is a prerequisite for constructing H-PFQservers that provide tight delay bounds. Finally, we proposea new PFQ algorithm called WFQ+ that is the first to havethe following three properties: 1) providing the tightest delaybound among all PFQ algorithms; 2) having the smallest WFIamong all PFQ algorithms; and 3) having a relatively lowcomplexity of . The resulted H-WFQ+ providessimilar delay bounds and bandwidth distribution to thoseprovided by the idealized H-GPS server, and is the first inthe literature that provides both provably tight delay boundsfor real-time sessions and the full semantics of hierarchicallink-sharing service.

APPENDIX APROOF OF THEOREM 1

Let be the time packet departs the H-PFQ andbe a time period that sessionis continuously backlogged. Itimmediately follows that a) is also the time that packetdeparts from server node ; and b) the logical queueat node is continuously backlogged during withrespect to server node , .

Since server node is worst-case fair with the logicalqueue at node , the following holds for

(31)

where is the amount of service received bynode in . Multiplying at both sides of(31), we have

(32)

Summing (32) for and eliminating commonterms on both sides, we have

(33)

Q.E.D.

Page 14: Hierarchical packet fair queueing algorithms

688 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 5, NO. 5, OCTOBER 1997

(a) (b)

(c) (d)

Fig. 7. Hierarchical link-sharing for H-PFQ’s. (a) H-WFQ. (b) H-SCFQ. (c) H-SFQ. (d) H-WF2Q+.

APPENDIX BPROOF OF THEOREM 2

Consider the th packet of session. Let and beits arrival and departure times, respectively. Based on thedefinition of SBI, for , there exists an instant within thenode busy period that includes also , where

, and holds, such that

(34)

Since both and are in the same server busy periodof node , the logical queue at node is continu-ously backlogged with respect to server node ,

. Also, server node is worst-case fairwith the logical queue at node , therefore, (31) holds.Multiplying at both sides of (31), we have

(35)

Summing (35) for , and eliminating commonterms on both sides, we have

(36)

Combining (34) and (36), we have

(37)

Since session queue is FIFO and leaky bucket constrained,(14) and (15) hold. Combining them with (37), we have

(38)

Rearranging terms and dividing both sides by

(39)

Q.E.D.

Page 15: Hierarchical packet fair queueing algorithms

BENNETT AND ZHANG: HIERARCHICAL PACKET FAIR QUEUEING ALGORITHMS 689

ACKNOWLEDGMENT

The authors would like to thank D. Stephens and I. Stoicafor helping with the simulations, H. Vin for stimulating discus-sions on SFQ, and anonymous reviewers for their insightfulcomments.

REFERENCES

[1] J. C. R. Bennett and H. Zhang, “WF2Q: Worst-case fair weighted fairqueueing,” inProc. IEEE INFOCOM’96,San Francisco, CA, Mar. 1996,pp. 120–128.

[2] D. Clark, S. Shenker, and L. Zhang, “Supporting real-time applicationsin an integrated services packet network: Architecture and mechanism,”in Proc. ACM SIGCOMM’92,Baltimore, MD, Aug. 1992, pp. 14–26.

[3] R. Cruz, “Service burstiness and dynamic burstiness measures: Aframework,”J. High Speed Networks,vol. 1, no. 2, pp. 105–127, 1992.

[4] J. Davin and A. Heybey, “A simulation study of fair queueing and policyenforcement,”Computer Commun. Rev.,vol. 20, no. 5, pp. 23–29, Oct.1990.

[5] A. Demers, S. Keshav, and S. Shenker, “Analysis and simulation of afair queueing algorithm,”J. Internetworking Res. Experience,pp. 3–26,Oct. 1990; also inProc. ACM SIGCOMM’89,pp. 3–12.

[6] S. Floyd and V. Jacobson, “Link-sharing and resource managementmodels for packet networks,”IEEE/ACM Trans. Networking,vol. 3,Aug. 1995.

[7] S. Golestani, “A self-clocked fair queueing scheme for broadbandapplications,” inProc. IEEE INFOCOM’94,Toronto, CA, June 1994,pp. 636–646.

[8] S. Keshav, “A control-theoretic approach to flow control,” inProc. ACMSIGCOMM’91,Zurich, Switzerland, Sept. 1991, pp. 3–15.

[9] P. McKenney, “Stochastic fair queueing,” inProc. IEEE INFOCOM’90,San Francisco, CA, June 1990.

[10] O. Ndiaye, “An efficient implementation of a hierarchical weighted fairqueue packet scheduler,” Master’s thesis, Mass. Inst. Technol., May1994.

[11] P. Goyal, H. M. Vin, and H. Chen, “Start-time fair queuing: A sched-uling algorithm for integrated services,” inProc. ACM-SIGCOMM’96,Palo Alto, CA, Aug. 1996, pp. 157–168.

[12] A. Parekh and R. Gallager, “A generalized processor sharing approachto flow control—The single node case,”ACM/IEEE Trans. Networking,vol. 1, pp. 344–357, June 1993.

[13] S. Suri, G. Varghese, and G. Chandranmenon, “Leap forward virtualclock,” in Proc. INFOCOM’97,Kobe, Japan, Apr. 1997.

[14] S. Shenker, “Making greed work in networks: A game theoreticalanalysis of switch service disciplines,” inProc. ACM SIGCOMM’94,London, U.K., Aug. 1994, pp. 47–57.

[15] S. Shenker, D. Clark, and L. Zhang, “A scheduling service model anda scheduling architecture for an integrated services network,” 1993,preprint.

[16] M. Shreedhar and G. Varghese, “Efficient fair queueing using deficitround robin,” in Proc. SIGCOMM’95,Boston, MA, Sept. 1995, pp.231–243.

[17] D. Stilliadis and A. Verma, “Design and analysis of frame-basedfair queueing: A new traffic scheduling algorithm for packet-switchednetworks,” inProc. ACM SIGMETRICS’96,May 1996.

[18] , “Latency-rate servers: A general model for analysis of trafficscheduling algorithms,” inProc. IEEE INFOCOM’96,San Francisco,CA, Mar. 1996, pp. 111–119.

[19] I. Stoica and H. Abdel-Wahab, “Earliest eligible virtual deadline first:A flexible and accurate mechanism for proportional share resourceallocation,” Tech. Rep. TR-95-22, Old Dominion Univ., Nov. 1995.

[20] H. Zhang and S. Keshav, “Comparison of rate-based service disciplines,”in Proc. ACM SIGCOMM’91,Zurich, Switzerland, Sept. 1991, pp.113–122.

Jon C. R. Bennett (M’95) received the B.S. degree in applied mathemat-ics/computer science from Carnegie Mellon University, Pittsburgh, PA, in1992 and is currently working toward the Ph.D. degree at Harvard University,Cambridge MA.

He is currently a Senior Network Scientist at BBN Technologies engagedin research on advanced internetworking technologies. Prior to joining BBNhe worked as a Systems Architect at FORE Systems on the design andimplementation of advanced ATM switches. He holds several patents forefficient hardware algorithms and is currently a technical editor for IEEENetwork.

He is a member of the ACM SIGCOMM and the IEEE ComSoc wherehe is a member of the ComSoc Board of Education and a member of theeditorial board ofIEEE Network.

Hui Zhang (M’95) received the B.S. degree incomputer science from Beijing University in 1988,the M.S. degree in computer engineering from Rens-selaer Polytechnic Institute in 1989, and the Ph.D.degree in computer science from the University ofCalifornia at Berkeley in 1993.

Since January 1995, he has been an AssistantProfessor at the School of Computer Science and theDepartment of Electrical and Computer Engineeringof Carnegie Mellon University (CMU). Prior tojoining CMU, he was a Post-Doctoral Fellow at

the Lawrence Berkeley National Laboratory. His research interests are inintegrated services computer networks, resource management algorithms, andmultimedia systems. He was one of the principal designers and implementorsof the U.C. Berkeley Tenet Real-Time Protocol Suite.

Dr. Zhang received the National Science Foundation Career Award in 1996.