tight analyses of two local load balancing algorithms

28

Upload: independent

Post on 16-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Tight Analyses of Two Local Load BalancingAlgorithmsBhaskar Ghosh 1 F. T. Leighton 2 Bruce M. Maggs 3;4S. Muthukrishnan 5 C. Greg Plaxton 6;7 R. Rajaraman 6;7Andr�ea W. Richa 3 Robert E. Tarjan 8 David Zuckerman 6;9August 3, 1995CMU-CS-95-131School of Computer ScienceCarnegie Mellon UniversityPittsburgh, PA 152131Department of Computer Science, Yale University, New Haven, CT 06520. Supported by ONR Grant 4-91-J-1576 and aYale/IBM joint study. Email: [email protected] of Mathematics and Laboratory for Computer Science, MIT, Cambridge, MA 02139. Supported by ARPAContracts N00014-91-J-1698 and N00014-92-J-1799. Email: [email protected] of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213. Email: faricha, [email protected] in part by an NSF National Young Investigator Award, No. CCR-9457766, and by ARPA Contract F33615-93-1-1330.5DIMACS, Rutgers University, Piscataway, NJ 08855. Supported by DIMACS, Center for Discrete Mathematics andTheoretical Computer Science, a National Science Foundation Science and Technology Center, under NSF Contract STC-8809648. Email: [email protected] of Computer Science, University of Texas at Austin, Austin, TX 78712.Email: fdiz, plaxton, [email protected] by the Texas Advanced Research Program under Grant No. ARP-93-003658-461.8Department of Computer Science, Princeton University, 35 Olden Street, Princeton, NJ 08544, and NEC Research Institute.Research at Princeton University supported in part by the National Science Foundation, Grant No. CCR-8920505, and theO�ce of Naval Reserach, Contract No. N0014-91-J-1463. Email: [email protected] in part by an NSF National Young Investigator Award, No. CCR-9457799.The views and conclusions contained in this document are those of the authors and should not be interpreted as representingthe o�cial policies, either expressed or implied, of any of the sponsoring agencies or the U.S. Government.

AbstractThis paper presents an analysis of the following load balancing algorithm. At each step, each node in anetwork examines the number of tokens at each of its neighbors and sends a token to each neighbor with atleast 2d+1 fewer tokens, where d is the maximum degree of any node in the network. We show that withinO(�=�) steps, the algorithm reduces the maximum di�erence in tokens between any two nodes to at mostO((d2 logn)=�), where � is the maximum di�erence between the number tokens at any node initially andthe average number of tokens, n is the number of nodes in the network, and � is the edge expansion ofthe network. The time bound is tight in the sense that for any graph with edge expansion �, and forany value �, there exists an initial distribution of tokens with imbalance � for which the time to reducethe imbalance to even �=2 is at least (�=�). The bound on the �nal imbalance is tight in the sensethat there exists a class of networks that can be locally balanced everywhere (i.e., the maximum di�erencein tokens between any two neighbors is at most 2d), while the global imbalance remains ((d2 logn)=�).Furthermore, we show that upon reaching a state with a global imbalance of O((d2 logn)=�), the timefor this algorithm to locally balance the network can be as large as (n1=2). We extend our analysis toa variant of this algorithm for dynamic and asynchronous networks. We also present tight bounds for arandomized algorithm in which each node sends at most one token in each step.

Keywords: load balancing, distributed network algorithms.

1 IntroductionA natural way to balance the workload in a distributed system is to have each work station periodically pollthe other stations to which it is connected, and send some of its work to stations with less work pending.This paper analyzes the e�ectiveness of this local load balancing strategy in the simpli�ed scenario in whicheach work station has a collection of independent unit-size jobs (called tokens) that can be executed on anyother work station. We model a distributed system as a graph, where nodes correspond to work stations,and edges correspond to connections between stations, and we assume that in one unit of time, at mostone token can be transmitted across an edge of the graph in each direction. Our analysis addresses onlythe static load balancing aspect of this problem; we assume that each processor has an initial collection oftokens, and that no tokens are created or destroyed while the tokens are being balanced.We analyze the algorithms in this paper in terms of the initial imbalance of tokens, i.e., the maximumdi�erence between the number of tokens at any node and the average number of tokens, which we denote�, the number of nodes in the graph, which we denote n, the maximum degree of the graph, d, and thenode and edge expansion of the graph. We de�ne the node expansion �, of a graph G to be the largestvalue such that every set S of n=2 or fewer nodes in G has at least �jSj neighbors outside of S. We de�nethe edge expansion �, of a graph G to be the largest value such that for every set S of n=2 or fewer nodesin G, there are at least �jSj edges in G with one endpoint in S and the other not in S.The performance of an algorithm is characterized by the time that it takes to balance the tokens, andby the �nal balance that it achieves. We say that an algorithm globally balances (or just balances) to withint tokens if the maximum di�erence in the number of tokens between any two nodes in the graph is at mostt. We say that an algorithm locally balances to within t tokens if the maximum di�erence in the numberof tokens between any neighboring nodes in the graph is at most t.We analyze two di�erent types of algorithms in this paper, single-port and multi-port. In the single-portmodel, a node may transmit or receive at most one token in one unit of time. In the multi-port model, anode may simultaneously transmit or receive a token across all of its edges (there may be as many as d) ina single unit of time. Not surprisingly, the load balancing algorithms run faster in the multi-port model.In practice, however, single-port nodes may be prefered to multi-port nodes because they are easier andless costly to build.1.1 Our resultsThis paper analyzes the simplest and most natural local algorithms in both the single-port and multi-portmodels.In the single-port algorithm, a matching is randomly chosen at each step. First, each (undirected) edgein the network is independently selected to be a candidate with probability 1=4d. Then each candidateedge (u; v) for which there is another candidate edge (u; x) or (y; v) is removed from the set of candidates.The remaining candidates form a matchingM in the graph. For each edge (u; v) in M , if u and v have thesame number of tokens, then nothing is sent across (u; v). Otherwise, a token is sent from the node withmore tokens to the node with fewer. This algorithm was �rst analyzed in [14].We analyze the performance of the single-port algorithm in terms of both the edge expansion and thenode expansion of the graph. In terms of edge expansion, we show that the single-port algorithm balancesto within O(d logn=�) tokens in O(d�=�) steps, with high probability. In terms of node expansion, the�nal imbalance is O(logn=�), and the time is O(d�=�), with high probability. (To compare these bounds,note that � � � � d�.) The time bounds are tight in the sense that for many values of n, d, �, and �,there is an n-node maximum degree d graph with edge expansion � or node expansion � and an initialplacement of tokens with imbalance � where the time (for any algorithm) to balance to within even �=2tokens is at least (d�=�). Similarly, in terms of node expansion, there exist classes of graphs where thetime to balance to within even �=2 tokens is at least (d�=�).1

The multi-port algorithm is simpler and deterministic. At each step, a token is sent from node u tonode v across edge (u; v) if at the beginning of the step node u contained at least 2d+ 1 more tokens thannode v. This algorithm was �rst analyzed in [2].As in the single-port case, we analyze the multi-port algorithm in terms of both edge expansion andnode expansion. In terms of edge expansion, the algorithm balances to within O(d2 logn=�) tokens inO(�=�) steps. This bound is tight in the sense that for any network with edge expansion �, and any value�, there exists an initial distribution of tokens with imbalance � such that the time to reduce the imbalanceto even �=2 is (�=�). In terms of node expansion, the algorithm balances to within O(d logn=�) tokensin O(�=�) time. This bound is tight in the sense that for many values of d, n, and �, and any value�, there exists an n-node, maximum degree d graph with node expansion � and an initial distribution oftokens with imbalance � for which the time to balance to within �=2 tokens is (�=�).Both the single-port and multi-port algorithms will eventually locally balance the network, the single-port algorithm to within one token, and the multi-port algorithm to within 2d tokens. However, even afterreducing the global imbalance to a small value, the time for either of these algorithms to reach a locallybalanced state can be quite large. In particular, we show that after reaching a state that is globally balancedto within O(d logn=�) tokens, the multi-port algorithm may take another (n1=2) steps to reach a statethat is locally balanced to within 2d tokens. For networks with large node expansion and small degree,e.g., � = (1) and d = O(1), and small initial imbalance, e.g., � = O(d log2 n=�), the time to locallybalance the network, (n1=2), may be much larger than the time, O(�=�) = O(d log2 n=�2) = O(log2 n)to reach a state that is globally balanced to within O(d logn=�) tokens. We prove similar bounds in termsof edge expansion and also for the single-port algorithm.Thus far we have described a network model in which the nodes are synchronized by a global clock(i.e., a synchronous network), and in which the edges are assumed not to fail. With minor modi�cations,however, the load balancing algorithms can be made to work in both asynchronous and dynamic networks.In a dynamic network, the set of edges in the network may vary at each time step. In any time step, alive edge is one that can transmit one message in each direction. We assume that at each time step, eachnode in a synchronous dynamic network knows which of its edges are live. In an asynchronous network,the topology is �xed, but an adversary determines the speed at which each edge operates at every instantof time. For every undirected edge between two nodes, we allow at most two messages to be in transitat any instant in time. These messages may travel in opposite directions across the edge, or both maytravel in one direction, while no message travels in the opposite direction. An edge is said to be live fora unit interval of time if every message that was in transit across the edge (in either direction) at thebeginning of the interval is guaranteed to reach the end of the edge by the end of the interval. We analyzethe performance of the multi-port load balancing algorithm under the assumption that at each time step,the set of live edges has some edge expansion �, or node expansion �.We also study the o�-line load balancing problem, in which every node has knowledge of the global stateof the network. This problem on has been studied on static synchronous networks in [28]. We use theirresults to obtain tight bounds on o�-line load balancing in terms of edge expansion and node expansion. Inparticular, we prove that any network can be balanced o�-line in d(1+�)�=�e steps such that no node hasmore than two tokens over the average. This result can be used to show that any network can be balancedo�-line to within three tokens in at most 2d(1 + �)�=�e steps in the single-port model. Moreover, thereexists a network and an initial token distribution for which any single-port o�-line algorithm takes morethan d(1 + �)�=�e steps to balance the network to within one token. Similarly, in the multi-port model,any network can be balanced o�-line in at most d�=�e steps so that no node contains more than d tokensover the average. Using this result, we show that any network can be balanced to within d+1 tokens in atmost 2d�=�e steps. It is easy to observe that for any network G there exists an initial token distributionsuch that any algorithm will take at least d�=�e steps to balance G to within one token.2

1.2 Previous and related workLoad balancing has been studied extensively since it comes up in a wide variety of settings includingadaptive mesh partitioning [16, 38], �ne grain functional programming [15], job scheduling in operatingsystems [13, 24], and distributed game tree searching [21, 25]. A number of models have been proposed forload balancing, di�ering chie y in the amount of global information used by the algorithm [2, 11, 12, 14, 26,30]. On these models, algorithms have been proposed for speci�c applications; also, proposed heuristics andalgorithms have been analyzed using simulations and queuing{theoretic techniques [27, 34, 36]. In whatfollows, we focus on models that allow only local algorithms and on prior work that takes an analyticalapproach to the load balancing problem.Local algorithms restricted to particular networks have been studied on counting networks [4, 22],hypercubes [19, 33], and meshes [16, 28]. Another class of networks on which load balancing has beenstudied is the class of expanders. Peleg and Upfal [31] pioneered this study by identifying certain small-degree expanders as being suitable for load balancing. Their work has been extended in [9, 17, 32].These algorithms either use strong expanders to approximately balance the network, or the AKS sortingnetwork [3] to perfectly balance the network. Thus, they do not work on networks of arbitrary topology.Also, these algorithms work by setting up �xed paths through the network on which load is moved andtherefore fail when the network changes. In contrast, our local algorithm works on any arbitrary dynamicnetwork that remains connected.On arbitrary topologies, load balancing has been studied under two models. In the �rst model, anyamount of load can be moved across a link in any time step [8, 12, 14, 18, 35]. The second model is theone that we adopt here, namely one in which at most one unit load can be moved across a link in eachtime step. Load balancing algorithms on the second model were �rst proposed and analyzed in [2] forthe multi-port variant and in [14] for the single-port variant. The upper bounds established by them aresuboptimal by a factor of (pn) or (log(n�)) in general. Our result here is an improved, in fact, anoptimal bound for the same problem.As remarked earlier, our multi-port results (and those in [2]) hold even for dynamic or asynchronousnetworks. In general, work on dynamic and asynchronous networks has been limited. In work related to loadbalancing for instance, an end-to-end communication problem, namely one in which messages are routedfrom a single source to a single destination, has been studied in [1, 7] on dynamic networks. Our scenariois substantially more involved since we are required to move load between several sources and destinationssimultaneously. Another result on dynamic networks is the recent analysis of a local algorithm for theapproximate multicommodity ow problem [5, 6]. While their result has several applications including theend-to-end communication problem mentioned above, it does not seem to extend to load balancing. Ourresult on load balancing is related to their work in the technique; however, our algorithm and analysis aresimpler and we obtain optimal bounds for our problem.The convergence of local load balancing algorithms is related to that of random walks on MarkovChains. Indeed the convergence bounds in both cases depend on the expansion properties of the underlyinggraph and they are established using potential function arguments. There are however two importantdi�erences. First, the analysis of the rapid convergence of random walks [20, 29] relies on averagingarbitrary probabilities across any edge. This corresponds to sending an arbitrary (possibly nonintegral)load along an edge which is forbidden in our model. In this sense, the analysis in [12] (and all referencesin the unbounded capacity model) are similar to the random walk analysis. Second, our argument uses anexponential potential function. The analyses in [12, 20, 29], in contrast, use quadratic potential functions.Our potential function and our amortized analysis were necessary since a number of previous attemptsusing quadratic potential functions yielded suboptimal results [2, 14] for local load balancing.3

1.3 OutlineThe remainder of this paper is organized as follows. Section 2 contains some de�nitions. Section 3.1analyzes the performance of the single-port algorithm. Section 3.2 analyzes the performance of the multi-port algorithm. In Section 4, we show that the time to reach a locally balanced state can be quite large,even if the network starts in a state that is well balanced globally. Section 5 describes extensions to dynamicand asynchronous networks. Finally, Section 6 presents tight bounds on o�-line load balancing.2 PreliminariesFor any network G = (V;E) with n nodes, m edges, and edge expansion �, we denote the number of tokensat v 2 V by w(v). We denote the average number of tokens by �. For simplicity, throughout this paperwe assume that � is an integer. We assign a unique rank from [1; w(v)] to every token at v. The height ofa token is its rank minus �. The height of a node is the maximum among the heights of all its tokens.Consider a partition of V given by fSig, where the index i is an integer and Si may be empty for any i.Let S>j be [i>jSi. Similarly we de�ne S�j , S<j , and S�j . We de�ne index i to be good if jSij � �jS>ij=2d.An index that is not good is called a bad index. For any bad index i, we have jS>ij < jS�ij=(1 + �=(2d)).Since jS>0j=(1+�=(2d))log(1+�=(2d)) n � 1, there can be at most dlog(1+�=(2d))ne bad indices. It follows thatat least half of the indices in [1; 2dlog(1+�=(2d))ne] are good.3 Analysis for static synchronous networks3.1 The single-port modelIn this section, we analyze the single-port load balancing algorithm that is described in Section 1.1.Theorem 3.1 For an arbitrary network G with n nodes, maximum degree d, edge expansion �, and initialimbalance �, the single-port algorithm balances within O((d logn)=�) tokens in O((d�)=�) steps, with highprobability.Before every step we partition the set of nodes according to how many tokens they contain. For everyinteger i, we denote the set of nodes having �+ i tokens as Si. Consider the �rst T steps of the algorithm,with T to be speci�ed later. It holds that either jS>0j � n=2 at the start of at least half the steps, orjS�0j � n=2 at the start of at least half the steps. Without loss of generality, assume the former is true.Since at least half of the indices in [1; 2dlog(1+�=(2d))ne] are good in any time step t, there exists an indexj in [1; 2dlog(1+�=(2d))ne] that is good in at least half of those time steps in which jS>0j � n=2. Hence j isgood in at least T=4 steps.With every token at height h we associate a potential of �(h), where � : N ! R is de�ned as follows:�(x) = ( 0 if x � j,(1 + �)x otherwise, (1)where � = �=(cd), and c > 1 is a real constant to be speci�ed later. The potential of the network is thesum of the potentials of all tokens in the network. While transmitting a token, every node sends its tokenwith maximum height. Similarly, any token arriving into a node with height h is assigned height h+ 1. Itfollows from the de�nition of the potential function, and the fact that the height of a token never increases,that the potential of the network never increases. In the following, we show that during any step when jis good, the expected decrease in potential of the network is at least an "�2 fraction of the potential beforethe step, where " > 0 is a real constant to be speci�ed later.Before proving Theorem 3.1, we present an informal outline of the proof. For simplicity, let us assumethat G is a constant-degree expander, i.e., d = O(1) and � = (1). Consider the scenario in which all of4

the indices greater than j are bad. In this situation, for every i greater than j, the size of the set S�i growsexponentially with decreasing i, and hence the number of tokens with height i grows exponentially withdecreasing i. If the growth of �(x) with increasing x is slower than the growth of S�i with decreasing i,then the total potential due to tokens at height i \dominates" the total potential due to tokens at heightgreater than i. In such a case the potential of S>j is essentially a constant times the potential of tokens atheight j+1. In addition, if the potential of tokens of height at most j is zero, then in every step when j isgood, there is a constant fraction potential drop, because a constant fraction of nodes in S>j send tokensto S<j in such a step. The exponential function we have de�ned in Equation (1) satis�es the propertiesdescribed above.In general, the indices greater than j could form any sequence of good and bad indices that respectsthe upper bound on the number of bad indices. We consider the indices greater than j in reverse orderand show by an amortized analysis that for each index i we can \view" all indices greater than or equalto i as bad. If i is bad, then this view is trivially preserved; otherwise, there is a signi�cant potential dropacross the cut (S�i; S>i), and this drop can be used to rearrange the potential of S>i in order to maintainthe view that all indices greater than i are bad. We then invoke the argument for the case in which allindices greater than j are bad and complete the proof.Consider step t of the algorithm. Let �t denote the potential of the network after step t > 0. Let Mibe the set of tokens that are sent from a node in S>i to a node in S<i. Let mi = jMij. We say that a tokenp has an i-drop of �(i + 1) � �(i), if p moves from a node in S>i to a node in S<i. Thus, the potentialdrop due to a token moving on an edge from node u 2 Si to node v 2 Si0 , i > i0 + 1, can be expressed asthe sum of k-drops for i0 < k < i. In Lemma 3.1, we use this notion of i-drops to relate the total potentialdrop in step t, , to the m0is.Lemma 3.1 = 0@Xi>j mi�(1 + �)i1A +mj(1 + �)j+1:Proof: Let M be the set of tokens that are moved from a node in S>j . (Note that tokens that startfrom and end at nodes in S>j also belong to M .) For any token p, let a(p) (resp., b(p)) be the height of pbefore (resp., after) step t. = Xp2M (�(a(p))� �(b(p)))= Xp2M;p=2Mj Xb(p)�i<a(p) (�(i+ 1)� �(i)) + Xp2Mj Xj�i<a(p)(�(i+ 1)� �(i))= 0@Xi>j Xp2Mi �(1 + �)i1A +0@ Xp2Mj(1 + �)j+11A= 0@Xi>j mi�(1 + �)i1A +mj(1 + �)j+1:(In the third step we rearrange the terms in the summations and also use the equation �(i+ 1)� �(i) =�(1 + �)i for all i > j.)We now describe the amortized analysis, alluded to earlier in this section, that we use to prove Theo-rem 3.1. We associate a charge of "�2�(h) with each token at height h. We show that we can pay for allthe charges using the expected potential drop E[], and thus place a lower bound on E[]. We considerthe indices in [j + 1; `] in reverse order, where ` is the maximum token height. Corresponding to every iin [j; `], we maintain a \debt" term, given by �i below, which is the di�erence between the charges due to5

tokens at height greater than i and the sum of i0-drops for i0 > i. Hence �i is calculated by subtractingthe sum of i-drop's from �i+1 and adding the charges due to tokens at height i. We will show that E[�i]is such that the indices in [i+ 1; `] can be viewed as a sequence of bad indices. In other words, we upperbound E[�i] by O(�jS�ij(1 + �)i). It follows from this upper bound and the informal argument outlinedearlier in this section that E[�j+1], i.e., the expected total debt, can be paid for by the expected dropacross index j.Formally, for any i > j, we de�nei = Xk�imk�(1 + �)k; and�i = ("�2)0@ Xp:a(p)�i(1 + �)a(p)1A �i:We also de�ne � = ("�2)0@ Xp:a(p)>j(1 + �)a(p)1A� :Note that �t�1 = Xp:a(p)>j(1 + �)a(p) is the total potential of S>j prior to step t.In order to prove the upper bound on E[�i], we place a lower bound on E[mi] that is obtained fromthe following lemma of [14].Lemma 3.2 ([14]) For any edge e 2 E, the probability that e is selected in the matching is at least 1=(8d).Lemma 3.3 There exists a real constant " > 0 such that for all i > j, we have E[�i] � ("�)jS�ij(1 + �)i.Proof: The proof is by reverse induction on i. For i � `, the claim holds trivially. Therefore, for the basecase we consider i = `. Since m` = 0, we have ` = 0. Thus, �` = ("�2)jS`j(1 + �)` � ("�)jS�`j(1 + �)`,since � = �=(cd) � 1=c � 1 by our choice of c.For the induction step we consider two cases, depending on whether i is good or bad. If i is good,then jSij � �jS>ij=2d. Since there are at most �jS>ij=2 edges from S>i to Si, by the expansion propertyof the graph we �nd that there are at least �jS>ij=2 edges from S>i to S<i. By Lemma 3.2 we haveE[mi] � �jS>ij=(16d). Therefore, we haveE[�i] = E[�i+1] + ("�2)jS�ij(1 + �)i �E[mi]�(1 + �)i� E[�i+1] + ("�2)jS�ij(1 + �)i � c�2jS>ij(1 + �)i=16� E[�i+1]� (�2)jS�ij(1 + �)i(f(c; �; d)� ")� ("�)jS>ij(1 + �)i+1 � (�2)jS�ij(1 + �)i(f(c; �; d)� ")� ("�)jS�ij(1 + �)i((1 + �)� �(f(c; �; d)� ")=");where f(c; �; d) = c=(16(1+ �=(2d))). (The third inequality follows from the fact that jS>ij � jS�ij=(1 +�=(2d)). The fourth inequality follows from the induction hypothesis.)The second case is when i is bad. Thus jSij � �jS>ij=(2d). We haveE[�i] � E[�i+1] + ("�2)jS�ij(1 + �)i� ("�)jS>ij(1 + �)i+1 + ("�2)jS�ij(1 + �)i� ("�)jS�ij(1 + �)i((1 + �)=(1 + c�=2) + �):6

We now complete the induction step by determining values for c and " such that the inequalities((1 + �)� �(f(c; �; d)� ")=") � 1 (2)(1 + �)=(1 + c�=2) + � � 1 (3)hold, and thus establish the induction step. Since � � d, we can set c to be any constant greater than orequal to (�=d)+ 4 (e.g., c = 5). For this choice of c, � � (c� 4)=c, and hence � � (c� � c�2)=4. Therefore,(1 + �)=(1 + c�=2) + � = (1 + 2� + c�2=2)=(1 + c�=2)� (1 + c�=2)=(1+ c�=2) = 1:Thus, Equation (3) is satis�ed. Since � � d, we �nd that f(c; �; d) � c=24. We now set " = c=48 toestablish Equation (2). (For example, c = 5 and " = 5=48.)Applying Lemma 3.3 with i = j + 1, it follows that E[�j+1] � ("�)jS�j+1j(1+ �)j+1. If j is good thenE[mj] � �jS>j j=16d, and by the de�nitions of �, �j+1, and , we haveE[�] = E[�j+1]�E[mj](1 + �)j+1� E[�j+1]� �jS>j j(1 + �)j+1=(16d)� ("�)jS>j j(1 + �)j+1 � �jS>j j(1 + �)j+1=(16d)� �jS>j j(1 + �)j+1("� c=16)� 0;since c=16 � ".By the de�nitions of and �, we have �t � �t�1 � and � = "�2�t�1 � . If j is good duringstep t, we have E[�] � 0, and therefore, E[�t] � �t�1(1 � "�2), where the expectation is taken over therandom matching selected in step t. Thus, we have E[�t+T ] � �t(1 � "�2)T=4, where the expectation isover all the random matchings in the T steps. By setting T = d(4 ln 4)=("�2)e, we obtain E[�t+T ] � �t=4.By Markov's inequality, the probability that �t+T � �t=2 is at most 1=2. Therefore, using standardCherno� bounds [10], we can show that in T 0 = 8aTd(log�0 + logn)e steps, �T 0 > 1 with probabilityat most O(1=(�0)a + 1=na) for any constant a > 0. Since �0 � n(1 + �)�+1=�, we have log�0 �(�+1)(�)+logn� log �. Therefore, for T 0 = O(�d=�+d2 log n=�2), we have �t < 1 with high probabilitywhich implies that after T 0 steps, jS>2 log1+�=(2d) nj = 0 with high probability.To establish balance in the number of tokens below the average, we use an averaging argument to showthat after T 0 steps jS<�2 log1+�=(2d) nj � n=2 with high probability, and then repeat the above argumentsrede�ning the potential appropriately. This proves Theorem 3.1.3.2 The multi-port modelIn this section, we analyze the deterministic multi-port algorithm described in Section 1.1.Theorem 3.2 For an arbitrary network G with n nodes, maximum degree d, edge expansion �, and initialimbalance �, the multi-port algorithm load balances to within O((d2 logn)=�) tokens in O(�=�) steps.The proof of Theorem 3.2 is similar to that of Theorem 3.1. We assign a potential to every token, thatgrows exponentially with height. We then show by means of an amortized analysis that a suitable rear-rangement of the potential reduces every instance of the problem to a special instance that we understandwell.Before every step we partition the set of nodes according to how many tokens they contain, wheregroups are separated by 2d. For every integer i, we denote the set of nodes having between ��d+2id and�+ d� 1 + 2id tokens as Si. 7

Consider the �rst T steps of the algorithm with T to be speci�ed later. Without loss of generality,we assume that jS>0j � n=2 holds in at least half of these steps. By Section 2, there exists an index j in[1; 2dlog1+�=(2d)ne] that is good in at least half of those steps in which jS>0j � n=2. Hence in T steps ofthe algorithm, j is good in at least T=4 steps.With every token at height h we associate a potential of �(h), where � : N ! R is de�ned as follows:�(x) = ( 0 if x � 2jd,(1 + �)x otherwise,where � = �=(cd2), and c > 0 is a constant to be speci�ed later. The potential of the network is the sumof the potentials of all tokens in the network. While transmitting some number (say m) of tokens in aparticular step, a node sends the m highest-ranked tokens. Similarly if m tokens arrive at a node during astep, they are assigned the m highest ranks within the node. Thus, tokens that do not move retain theirranks after the step. It follows from the de�nition of the potential function and the fact that the height ofa token never increases that the network potential never increases. In the following we show that wheneverj is good the potential of S>j decreases by a factor of "�2d2, where " > 0 is a real constant to be speci�edlater. (For the sake of simplicity, we assume that d is even. If d is odd, we can replace d by d + 1 in ourargument without a�ecting the bounds up to constant factors.)Consider step t of the algorithm. Let u be a node in S<i with height h. Let A (resp., B) be the setof tokens that u receives from nodes in S>i (resp., S�i). We assign new ranks to tokens in A and B suchthat the rank of every token in A is less than that of every token in B. Let C be the set of tokens in Athat attain height at most h + (d=2) after the step. Since jAj � d, by the choice of our ranking, we havejCj � jAj=2. We call C the set of primary tokens. Also, all tokens leaving node u are at height at leasth� d+ 1 prior to this step.For any token p, let a(p) (resp., b(p)) be the index of the group of the node that contains p before (resp.,after) the step. (Note that the indexing is done prior to the step.) Let Mi be the set of primary tokensreceived by nodes in S<i. Let mi = jMij. (Note that mi is at least half the number of edges joining somenode in S<i and some node in S>i.) Lemma 3.4 establishes the relationship between the total potentialdrop in step t and the mi's.Lemma 3.4 � 0@12Xi>j mi�d(1 + �)(2i�1)d1A +mj(1 + �)2jd+1:Proof: Let M be the set of primary tokens that are moved from nodes in S>j . (Note that primarytokens that start from a node in S>j and end at a node in S>j are in M .) By the de�nition of primarytoken, the height of p prior to the step is at least 2a(p)d� 2d+ 1 and the height after the step is at most2b(p)d+ 3d=2. Moreover, p belongs to Mi for all i such that b(p) < i < a(p). � Xp2M[�(2a(p)d� 2d+ 1)� �(2b(p)d+ 3d=2)]� Xp2M;p=2Mj Xb(p)<i<a(p)[�(2(i+ 1)d� 2d+ 1)� �(2(i� 1)d+ 3d=2)]+ Xp2Mj0@ Xj<i<a(p)[�(2(i+ 1)d� 2d+ 1)� �(2(i� 1)d+ 3d=2)] + �(2(j + 1)d� 2d+ 1)1A� 0@12Xi>j Xp2Mi �d(1 + �)2id�d1A + Xp2Mj(1 + �)2d(j+1)�2d+18

� 0@12Xi>j mi�d(1 + �)2id�d1A +mj(1 + �)2jd+1:For any token p, let h(p) denote the height of p prior to the step. Thus 2a(p)d�d� h(p) � 2a(p)d+d� 1.For i > j and for a suitable constant " > 0 to be speci�ed later, we de�nei = 12Xk�imk�d(1 + �)2kd�d; and�i = ("�2d2)0@ Xp:h(p)�2id�d(1 + �)h(p)1A �i:We also de�ne � = ("�2d2)0@ Xp:h(p)>2jd(1 + �)h(p)1A �:For any step t0, let �t0 denote the total potential after step t0. Thus, �t�1 = Xp:h(p)>2jd(1 + �)h(p) is thetotal potential prior to step t.Lemma 3.5 There exists a real constant � > 0 such that for all i > j, we have�i � (��d2)jS�ij(1 + �)2id�d:Proof: The proof is by reverse induction on i. Let ` be the maximum token height. For i > b(`+d)=2dc,the claim is trivial. Therefore, for the base case we consider i = b(`+ d)=2dc. Since i = 0, we have�i � (2"�2d3)jSij(1 + �)`� (2"�2d3)jS�ij(1 + �)2d(i+1)�d� (��d2)jS�ij(1 + �)2id�dwhere � and " are chosen such that � > 2"�d(1 + �)2d. (Note that for c su�ciently large, (1 + �)2d can beset to an arbitrarily small constant.)For the induction step we consider two cases. If i is good, then jSij � �jS>ij=(2d) and mi � �jS>ij=2.Therefore, we have�i � �i+1 + (2"�2d3)jS�ij(1 + �)2id+d�1 �mi�d(1 + �)2id�d=2� �i+1 + (2"�2d3)jS�ij(1 + �)2id+d�1 � c�2d3jS>ij(1 + �)2id�d=4� �i+1 � (�2d3)jS�ij(1 + �)2id�d(f(c; �; d)� 2"(1 + �)2d)� (��d2)jS>ij(1 + �)2(i+1)d�d � (�2d3)jS�ij(1 + �)2id�d(f(c; �; d)� 4")� (��d2)jS�ij(1 + �)2id�d((1 + �)2d � �d(f(c; �; d)� 4")=�);where f(c; �; d) = c=(4(1 + �=(2d))). (The third inequality follows from the fact that jS>ij � jS�ij=(1 +�=(2d)). The fourth inequality follows from the induction hypothesis and the inequality (1 + �)2d � 2 forc su�ciently large.)The second case is when i is bad. Thus jSij � �jS>ij=(2d). We have�i � �i+1 + (2"�2d3)jS�ij(1 + �)2id+d�1� (��d2)jS>ij(1 + �)2(i+1)d�d + 2"�2d3jS�ij(1 + �)2id+d�1� (��d2)jS�ij(1 + �)2id�d((1 + �)2d=(1 + �=(2d)) + 2"�d(1 + �)2d=�):9

We now set c, �, and " such that c > 4, c=6� 4" � 4�, and c=4� 2"=� � 4. (One set of choices is c = 25,� = 1, and " = 1=24.) Since � � d, we have f(c; �; d) � c=6. Since c > 4, we have 2�d < 1=2, and hence(1 + �)2d � 1 +Pi>0(2�d)i � 1 + 2�d=(1� 2�d) � 1 + 4�d. Thus,((1 + �)2d � �d(f(c; �; d)� 4")=�) � 1 + 4�d� 4�d � 1:Since �=(2d) � 1=2, we have 1=(1 + �=(2d)) � 1� �=(4d), and hence,(1 + �)2d=(1 + �=(2d)) + "�d(1 + �)2d=� � (1 + �)2d(1=(1+ �=(2d)) + 2"�d=�)� (1 + �)2d(1� �=4d+ 2"�d=�)= (1 + �)2d(1� c�d=4 + 2"�d=�)� (1 + 4�d)(1� 4�d) < 1:Thus, in both the cases, �i � (��d2)jS�ij(1 + �)2id�d. This completes the inductive step.Corollary 3.5.1 If j is good on step t then we have � "�2d2�t�1.Proof: Applying Lemma 3.5 with i = j + 1, it follows that �j+1 � (��d2)jS�j+1j(1 + �)2(j+1)d�d. If j isgood then jSj j � �jS>j j=(2d) and mj � �jS>j j=2. Therefore,� = �j � �j+1 + "�2d3jSj j(1 + �)2jd+d�1 � �jS>j j(1 + �)2jd+1=2� (��d2)jS>j j(1 + �)2(j+1)d�d + (�"�2d2)jS>j j(1 + �)2jd+d�1=2� �jS>j j(1 + �)2jd+1=2� (�d2)jS>jj(1 + �)2(j+1)d�d(� + "�2=(2cd2)� c=4)� 0;for c, �, and " chosen above. (In the �rst inequality, the term "�2d3jSj j(1 + �)2jd+d�1 is an upper boundon the contribution to �j by tokens in Sj . The third inequality follows from the fact that for c > 4, wehave (1 + �)d � 2.)By the de�nitions of � and , we have �t � �t�1� and � = "�2d2�t�1�. If j is good during stept, then � � 0, and the desired claim follows.By Corollary 3.5.1, if j is good during step t then we have�t � �t�1(1� "�2d2):After T = d4dln �0e=("�2d2)e steps, we have �T � �0(1�"�2d2)T=4 < 1. Since �0 � n(1+�)�+1=�, ln �0 =O(�� + logn). Substituting for �, we obtain that within O(�=� + d2 lnn=�2) steps, jS>2 log1+�=(2d) nj �jS>j j = 0.We use an averaging argument to show that after T steps, jS<�2 log1+�=(2d) nj � n=2. By rede�ning thepotential function and repeating the above analysis in the other direction, we obtain that in another Tsteps jS<�4 log1+�=(2d) nj = 0. This completes the proof of Theorem 3.2.3.3 Results in terms of node expansionThe proofs of Theorems 3.1 and 3.2 can be easily modi�ed to analyze the algorithm in terms of the nodeexpansion � of the graph instead of the edge expansion �. (The primary modi�cations that need to be doneare to alter the de�nition of a good index, and to set � appropriately. We set � = �=c (resp., � = �=(cd))for the single-port model (resp., multi-port model) and call index i good if jSij � �jS>ij=2.)Theorem 3.3 For an arbitrary network G with n nodes, maximum degree d, node expansion �, and initialimbalance�, the single-port algorithm load balances within O((logn)=�) tokens in O(d�=�) steps with highprobability. 10

Corollary 3.3.1 If � � (d logn)=�, the single-port algorithm balances to within O(logn=�) tokens inO((d�)=�) steps with high probability. If � < (d logn)=�, the single-port algorithm balances to withinO(logn=�) tokens in O((d�)=�) steps with high probability.Theorem 3.4 For an arbitrary network G with n nodes, maximum degree d, node expansion �, and initialimbalance �, the multi-port algorithm load balances to within O((d logn)=�) tokens in O(�=�) steps.Corollary 3.4.1 If � � (d2 log n)=�, the multi-port algorithm balances to within O((d logn)=�) tokens inO(�=�) steps. If � < (d2 logn)=�, the multi-port algorithm balances to within O((d logn)=�) tokens inO(�=�) steps.4 Local load balancing can be expensiveIn this section, we show that locally load balancing to within 2d tokens using the multi-port algorithm of [2]described in Section 1.1 can take (n1=2) more time than globally load balancing to within O((d logn)=�)tokens. We extend this bound for the single-port algorithm presented in [14], i.e., the expected numberof additional steps this algorithm may take to perform local (to within one token) rather than globalbalancing is (dn1=2). Furthermore, we show that in the single-port case, we may be one step away frombeing locally balanced to within one token but have an expected running time of (�n1=2) for reaching alocally balanced state. Finally, we prove upper bounds on the time each algorithm takes to reach a locallybalanced state.We will now construct a graph G = (V;E) on n nodes and show that this graph has expansion (�).Let 0 < � � 1 (we will actually need � � 1=3, as seen below). Let V = ([ki=0Li) [ ([k�1i=0Ri), where Liand Ri are sets of (1+�)i nodes each. For convenience, let L�1 (resp., R�1) denote R0 (resp., L0) and Rkdenote Lk. Note that each Li and each Ri has size (1 + �)i = �(Pi�1j=0 jLj j) + 1 = �(Pi�1j=0 jRj j) + 1. Thusk = O(logn= log(1 + �)) = O(logn=�). Here we assume w.l.o.g. that k = logn= log(1 + �) and that k iseven. Let 0 < �0 � 1=3 and � � �0.For each i, 0 � i � k, let every subset S of Li of size at most 3jLij=4 have at least �0jSj neighborsin Li n S, and let Li have maximum degree d=2, for some suitable value of d, which depends on �0. Leteach node in Li have exactly d(1 + �)=(2(2 + �)) neighbors in Li+1 and each node in Li+1 have exactlyd=(2(2+ �)) neighbors in Li, 0 < i � k � 1 (note that w.l.o.g. we will ignore integrality constraints, sincewe can always �nd many values of d and � such that integrality holds). Using a counting argument onthe number of edges between levels that are adjacent to each node, we see that every S � Li has at least(1+�)jSj neighbors in Li+1. Now we consider how Li+1 \expands" into Li. We can use a similar approachas in [23, 37] to show that we can choose the edges between Li and Li+1, respecting the degree constraintsabove, such that any S � Li+1 such that jSj � jLi+1j=(1 + �)2 has at least (1 + �)jSj neighbors in Li.Let the only node in L0 be adjacent to every node in L1. We obtain a similar structure for the Ri's byreplacing Lj by Rj in this paragraph. Note that the maximum degree of G is � d.Lemma 4.1 G has node expansion (��0).Proof: Let S � V , jSj � n=2. Let Si = S \Li and S 0i = S \Ri, 0 � i � k. Let NY (X) = fy 2 Y j(x; y) 2E; x 2 Xg, for any X; Y � V ; if the set Y is not speci�ed, assume Y = V . Let 0 � i � k:1. If jSij � 3jLij=4, then Si has at least �0jSij neighbors inside Li n Si.2. If i < k and jSij > 2jLij=3 and jSi+1j � 2jLi+1j=3, then:(a) if 2jLij=3 < jSij � 3jLij=4 then jNLi(Si)j�jSij � �02jLij=3 = (�0=3)(2jLij) > (�0=3)(�Pi�1j=0 jLj j+jLij) � (��0=3)(Pij=0 jLj j); 11

(b) if jSij > 3jLij=4 then jNLi+1(Si)j� jSi+1j > (1+�)3jLij=4�2jLi+1j=3 = 3jLi+1j=4�2jLi+1j=3 =jLi+1j=12 > (�=12)(Pij=0 jLj j)Thus jN(Si)j � jSj � (��0=12)(Pij=0 jSjj).Similar results hold if we replace Sj by S 0j and Lj by Rj in the items above.Now we consider every Si, (resp., S 0i) such that jSij > 2jLij=3 (resp., jS 0ij > 2jRij=3) but there isno z > i such that jSz j � 2jLzj=3 (resp., jS 0zj � 2jRzj=3). Let i` = maxifi > 1 j jSij > 2jLij=3 andjSi�1j � 2jLi�1j=3g and ir = maxjfj > 1 j jS 0jj > 2jRjj=3 and jS 0j�1j � 2jRj�1j=3g. Let i` = 0 (resp.,ir = 0), in case no such i (resp., j) exists. If i` and ir are not both equal to 0, then we must have eitherPi`�1i=0 jLij � n=8 or Pir�1i=0 jRij � n=8 (since jSj � n=2). Assume w.l.o.g. that the former is true. Thisimplies that jLi` j > �n=8. Then:a. if jSi` j � jLi` j=(1+�)2 then NLi`�1(Si`) = Li`�1 and jNLi`�1(Si`)j�jSi`�1j � jLi`�1j=3 = jLi` j=(3(1+�)) > �(Pi`�1j=0 jLj j)=(3(1+ �)) � �n=48 = (�=24)(n=2).b. if jLi`j=(1 + �)2 > jSi` j > 3jLi`j=4 then jNLi`�1(Si`)j � jSi`�1j � (1 + �)jSi` j � 2jLi`�1j=3 > (1 +�)3jLi` j=4� 2jLi` j=(3(1+ �)) > (1 + �)3jLi`j=4� 2jLi`j=3 > jLi`j=12 > (�n=8)=12 = (�=48)(n=2);c. if jSi`j � 3jLi`j=4 then jNLi`(Si`)j � jSi`j � �0jSi`j > �02jLi` j=3 � (2�0=3)(�n=8) = (��0=6)(n=2).Hence, since we count each neighbor of S at most three times and we account for each Si and each S 0i atleast once in items (1) and (2), and in the preceding paragraph, S has (��0jSj) neighbors outside S in G.Corollary 4.1.1 If �0 = 1=3 then G has node expansion (�), 0 < � � 1=3.Let L0 = fug and R0 = fvg. Let G0 = (V;E 0) be the graph obtained by adding an edge e connectingv to u to G (see Figures 1, 2). Note that the node expansion of G cannot be reduced by the addition ofnew edges and so the node expansion of G0 is also (��0) (= (�), if we �x �0). We state all the resultsin this section in terms of the node expansion of the network, rather than in terms of its edge expansion.This is done for the sake of making our arguments more intuitive and clear.We present the cases below according to the initial distribution of tokens over the nodes of G0. We usew(x) to denote the number of tokens on node x. We will use the fact that at any time step every node inLi (resp. Ri), 0 � i � k, has exactly the same number of tokens, if they had the same number of tokensinitially. We prove this using induction. W.l.o.g. we will prove it for sets Li only. The same proof holds ifwe replace Lj by Rj below. Suppose every node in Li had the same number of tokens at time step t � 1.A node x 2 Li only sends a token to one of its neighbors in Li+1 when it has at least 2d+ 1 more tokensthan its neighbor has. Thus if at time t, x sends a token to some y 2 Li+1 then it sends a token to all itsneighbors in that set, since all of them had the same number of tokens at time t � 1 and the number ofneighbors of x in Li+1 is at most d. Hence at time t, every edge between Li and Li+1 is traversed by atoken. Since every node in Li+1 sees exactly the same number of nodes in Li, they will have all receivedthe same number of tokens from Li. We can use a similar argument for tokens that move from Li to Li�1.No token moves from a node in Li to another node inside Li, since all nodes in Li had the same number oftokens at time t � 1. When i = k, only consider tokens moving from Lk and Rk to Lk�1 and Rk�1, resp..From the remark above, we see that tokens never move inside each Li or Ri, and so we ignore the edgesinside each of these sets.We group the sets Ri's and Li's into L and R, groups of k=2 consecutive sets, and M, a group ofk + 1 consecutive sets (note that we have 2k + 1 distinct sets). Let L = fL0; L1; : : : ; Lk=2�1g, R =fR0; R1; : : : ; Rk=2�1g and M = fLk=2; Lk=2+1; : : : ; Lk�1; Rk(= Lk); Rk�1; : : : ; Rk=2g. Our choice for L;Mand R is such that the number of sets in L is roughly half the number of sets in M.12

4.1 The local algorithm may take (n1=2) steps to locally balance G0Suppose for every node z in R, w(z) = m + 1 initially, where m is an integer such that m � 2kd. For allz 2 Ri, Ri 2 M, let w(z) = m � 2(i� k=2)d; for all z 2 Li, Li 2 M, let w(z) = m � 2(3k=2� i)d. Forall z 2 Li, Li 2 L, let w(z) = m � 2(i+ k=2 + 1)d. Then w is globally balanced to within O((d logn)=�)tokens, but it is not locally balanced to within 2d tokens since w(v)� w(u) > kd (� 2d). See Figure 1.There are at least (1 + �)k=2�1 = (1 + �) log n2 log(1+�)�1 = pn=(1 + �) � pn=2 nodes in R, as there are inL. We claim that in order for G0 to be locally balanced to within 2d, we need to move from R to L at leastpn=2 tokens across edge e. Since at most one token at a time can traverse e, it will take time (n1=2) tolocally balance G0 to within 2d tokens. Our proof proceeds as follows:i. By observing how the tokens ow in G0 - since every node in Rj (resp., Lj) is identical with respectboth to the number of tokens it has and to the number of neighbors it sees in Rj�1 and Rj+1 (resp.,Li�1 and Li+1) - we can see that no token ever moves from left to right (i.e. from Li to Li�1 or fromRi to Ri+1) in G0. In particular, no token ever moves from R to M or from M to L.ii. Now we show that we can only have w(u) > m � (k + 2)d and w(x)� w(y) � 2d; 8x 2 Li; 8y 2Li+1; 8Li 2 L after pn=2 steps. Suppose we reach such a con�guration for w at some time t. Thenevery node in L has at least one more token than it had initially. That is, we have at least jLj � pn=2\extra" tokens in L at time t, all of which must have reached L by traversing e from v to u, since notoken moves from M to L. And since only one token at a time can traverse e, we have t � pn=2.iii. We also show that we can only have: w(v) � m � kd and w(y) � w(x) � 2d; 8x 2 Ri; 8y 2Ri+1; 8Ri 2 R after pn=2 steps. A counting argument (similar to the one above) on the number oftokens in R and the fact that no token is ever sent from R to M, is su�cient to show this.It follows from (ii) and (iii) that either v sends a token to u, or R or L is not 2d-locally balanced in anyof the �rst pn=2 steps, and so G0 is not locally balanced before these steps. Hence the algorithm will takeadditional time (pn=2) to locally balance G0.Note that the arguments above can be easily modi�ed in order to hold for the single-port model (dividethe number of tokens above height m by 2d on each node not in R[Rk=2, consider di�erences of at mostone token between \adjacent" Ri's or Li's, and consider randomization), since the lower bound on thenumber of steps required to reach a locally balanced state is given only in terms of how many tokens needto traverse the edge e. Any edge is selected with probability O(1d) at each iteration of the single-portalgorithm. Thus the expected time for e being selected at some time step is (d). Hence we have that itwill take (dn1=2) expected time for G0, if G0 is globally balanced to within O((logn)=�) tokens, to reacha locally balanced state to within 1 token in the single-port model.4.2 The local algorithm may take one step or (�n1=2) expected time to reach local balanceHere we consider the single-port model. Suppose we have the following distribution of tokens: for allz 2 Rk=2�1, let w(z) = m + 1 (where m is an integer such that m � k); for all z 2 Ri, i � k=2� 2 (notethat Ri 2 R), let w(z) = m� (k=2� i� 1); for all z 2 Rk=2, let w(z) = m� 1; for all z 2 Ri, i � k=2 + 1(note that Ri 2 M), let w(z) = m � (i� k=2); for all z 2 Li, Li 2 M, let w(z) = m� (3k=2� i). For allz 2 Li, Li 2 L, let w(z) = m� (k=2+ i). Thus w is globally balanced to within O((logn)=�) tokens but itis not locally balanced to within 1 token, since w(x)�w(y) > 1, for any x 2 Rk=2�1 and y 2 Rk=2[Rk=2�2.See Figure 2.The intuition for this case is that if all tokens move in the \right direction" initially, we reach a locallybalanced state in a single time step. Otherwise, if a large number of tokens move in the \wrong direction"13

in the �rst step, it will take a long time to reach such a state. If every node in Rk=2�1 is matched withsome node in Rk=2 (we can assume that every node in Rk=2�1 has a distinct neighbor in Rk=2), G0 reacheslocal balance in a single time step. On the other hand, if we move tokens across a matching betweenthe nodes in Rk=2�1 and Rk=2�2 then these tokens will clearly continue moving \down" (non-decreasingindexes of Ri), and never move \up". The expected size of such a matching will be (jRk=2�1j=d) (eachnode in Rk=2�1 has exactly d=(2(2 + �)) � d=5 neighbors in Rk=2�2). Using a similar analysis (takingcare of randomization) as in the previous case, we see that no token ever moves from M to L and thatw(v) > m� k=2 + 1 and w(u) = m� k=2, or L [ R is not locally balanced, for each of the (jRk=2�1j=d)initial steps. Thus, once any of these tokens (that moved from Rk=2�1 to Rk=2�2 in the �rst step) reachesv, it will eventually traverse e onto u.Hence, eventually all tokens that moved from Rk=2�1 to Rk=2�2 in the initial step will reach u. SincejRk=2�1j � �n1=2(1+�)2 and 1 � (1 + �)2 � 2, and since the expected time for edge e to be in the selectedmatching is (d), the expected time for G0 to reach local balance to within one token is (�n1=2).4.3 Convergence to a locally balanced stateWe now prove that if the graph G is globally balanced to within � tokens, in O(n�2=d) subsequent stepsthe multi-port algorithm locally balances G to within 2d tokens. De�ne the potential � of the network asPv2V (w(v)��)2. If the network is globally balanced to within � tokens, then � = O(n�2). At any step, ifthere exists an edge (u; v) such that jw(u)�w(v)j � 2d+1, then a token is transmitted along (u; v) resultingin a potential drop of at least d. Thus, within O(n�2=d) steps the network is globally balanced. Similarly,we show that if the graph G is globally balanced to within � tokens, then the single-port algorithm locallybalances to within 1 token in O(nd�2) subsequent steps with high probability.5 Extension to dynamic and asynchronous networksIn this section, we extend our results for the multi-port model of Section 3.2 to dynamic and asynchronousnetworks. We �rst prove that a variant of the local multi-port algorithm is optimal on dynamic synchronousnetworks in the same sense as for static synchronous networks. We then use a result of [2] that relates thedynamic synchronous and asynchronous models, to extend our results to asynchronous networks.In the dynamic synchronous model, the edges of the network may fail or succeed dynamically. An edgee 2 E is live during step t if e can transmit a message in each direction during step t. We assume thatat each step each node knows which of its adjacent edges are live. The local load balancing algorithm forstatic synchronous networks can be modi�ed to work on dynamic synchronous networks. The algorithmpresented here is essentially the same as in [2]. Every node u maintains an estimate eu(v) of the numberof tokens at v for every neighbor v of u. (The value of eu(v) at the start of the algorithm is arbitrary.) Inevery step of the algorithm, which we call DS, each node u does the following:(1) For each live neighbor v of u, if w(u)� eu(v) > 12d, u sends a message consisting of w(u) and a token,otherwise u sends a message consisting only of w(u).(2) For each message received from a live neighbor v, eu(v) is updated according to the message and if themessage contains a token, w(u) is increased by one.For every integer i, we denote the set of nodes having between �� 12d+ 24id and �+ 12d� 1 + 24idtokens as Si. Consider T steps of DS. We assume without loss of generality that jS>0j � n=2 at the startof at least T=2 steps. There exists an index j in [1; d2 log1+�=(2d) ne] that is good in at least half the stepsat the start of which jS>0j � n=2. If index j is good at the start of step t, we call t a good step. For anytoken p, let ht(p) denote the height of p after step t, t > 0. For convenience, we denote the height of p atthe start of DS by h0(p). Similarly, for t � 0, we de�ne ht(u) for every node u and eut (v) for every edge(u; v). 14

With every token at height h, we associate a potential of �(h), where � : N ! R is de�ned as follows:�(x) = ( 0 if x � 24jd� 11d,(1 + �)x otherwise,where � = �=(cd2), and c > 0 is a constant to be speci�ed later. Let �t denote the total potential of thenetwork after step t. Let t denote the potential drop during step t.We analyze DS by means of an amortized analysis over the steps of the algorithm. Let Et be the setf(u; v) : (u; v) is live, u 2 S>j and ht�1(u) � ht�1(v) � 24dg. For every step t, we assign an amortizedpotential drop of t = 150 X(u;v)2Etht�1(u)>ht�1(v)(�(ht�1(u)� d)� �(ht�1(v) + d)):The de�nition of t is analogous to the amount of potential drop that we use in step t in the argument ofSection 3.2 for the static case. By modifying that argument slightly and choosing appropriate values forthe constants c and ", we can show the following lemma.Lemma 5.1 If the live edges of G have an edge expansion of � during every step of DS, then for everygood step t we have t � "�2d2�t�1, where " is a constant chosen appropriately.Proof Sketch: LetMi denote the set of live edges between nodes in S<i and nodes in S>i. Letmi = jMij.For any node u, let g(u) represent the group to which u belongs prior to step t. We now place a lowerbound on t which is analogous to that on in Lemma 3.4 of Section 3.2. By the de�nition of t, wehavet = 150 X(u;v)2Et(u;v)=2Mjht�1(u)>ht�1(v)(�(ht�1(u)� d)� �(ht�1(v) + d)) + 150 X(u;v)2Et(u;v)2Mjht�1(u)>ht�1(v)(�(ht�1(u)� d)� �(ht�1(v) + d))� 150 0BBBBB@ X(u;v)2Et(u;v)=2Mjht�1(u)>ht�1(v) Xg(v)<i<g(u) (�(24(i+ 1)d� 13d)� �(24(i� 1)d+ 13d))1CCCCCA+ 150 X(u;v)2Et(u;v)2Mjht�1(u)>ht�1(v) Xj�i<g(u) (�(24(i+ 1)d� 13d)� �(24(i� 1)d+ 13d))� 2250 0BBBB@ Xi>j(u;v)2Miht�1(u)>ht�1(v)(�(24(i+ 1)d� 13d)� �(24(i� 1)d+ 13d))1CCCCA+ 150 X(u;v)2Mjht�1(u)>ht�1(v) �(24(j + 1)d� 13d)� 2250 0BBBB@ Xi>j(u;v)2Miht�1(u)>ht�1(v) �d(1 + �)24id�11d1CCCCA+ 150 X(u;v)2Mjht�1(u)>ht�1(v)(1 + �)24jd+11d15

� 2250 0@Xi>jmi�d(1 + �)24id�11d1A+ 150mj(1 + �)24jd+11d:(The second step can be veri�ed by expanding the two inner summations indexed by i. The third step isobtained from the second step by combining the two summations and redistributing the terms accordingto the value of index i. In the fourth step we use the inequality (1 + x)22d � 1 + 22dx for x � 0.)We can then establish claims similar to Lemma 3.5 and Corollary 3.5.1 of Section 3.2 by just modifyingthe constants in the proofs. Thus we have t � "�2d2�t�1 for constant " chosen appropriately.The following crucial lemma relates the amortized potential drops to the actual potential drops.Lemma 5.2 For any initial load distribution and any step t0 > 0, we haveXt�t0 t � 0@Xt�t0t1A � 2�0 � n2�(24jd):The main result follows from Lemmas 5.1 and 5.2. We �rst show that within O(1=("�2d2)) steps, there isa step when the actual potential of the network either decreases by a factor of 2 or falls below a thresholdvalue.Lemma 5.3 Let t be any integer such that at least 7=("�2d2) of the �rst t steps are good. There existst0 � t such that �t0 � maxf�0=2; n2�(24jd)g.Proof: If �0 � n2�(24jd), then the claim is proved for t = 0. For the remainder of the proof, we assumethat �0 � n2�(24jd). If �t0 � �0=2 for any t0 < t, the claim is proved. Otherwise, for all t0 < t, we have�t0 > �0=2. In this case, we obtain �t = �0 �Xt0�tt0� 3�0 + n2�(24jd)�Xt0�t t0� 4�0 � Xt0�tt0good ("�2d2)�t0� �0=2:(In the second step, we invoke Lemma 5.2. In the third step, we invoke Lemma 5.1 and use the inequalities�0 � n2�(24jd), and t0 � 0 for every t0. In the fourth step, we use the fact that at least 7=("�2d2) of thet steps are good and the inequality �t0 > �0=2 for every t0 � t.)Theorem 5.1 For an arbitrary network G with n nodes, degree d and initial imbalance �, if the liveedges at every step t of G have edge expansion �, then the dynamic synchronous multi-port algorithm loadbalances to within O(d2(logn)=�) tokens in O(�=�) steps.Proof: By repeatedly invoking Lemma 5.3, we obtain that within d(7=("�2d2))edlog �0e good steps,there exists a step after which the potential of the network is at most n2�(24jd). (Note that the factthat Lemma 5.2 holds for arbitrary initial values of the estimates, the eu(v)'s, is crucial here.) Since atleast T=4 of the �rst T steps are good, there exists t � 4d(7=("�2d2))edlog �0e such that �t � n2�(24jd).Since �0 � n(1 + �)(�+1)=�, we have log �0 � (logn+ (�+ 1) log(1 + �)� log �). Since � = �=(cd2) andlog(1 + �) = O(�), we have t = O((�=�) + d2(logn)=(�2)).16

Let h be the maximum height of any node after step t. We thus have�(h) � �t� n2(1 + �)24jd:Therefore h � maxf24jd� 11d; log(1+�)(n2(1 + �)24jd)g. Since log(1 + �) < �=2 for c appropriately large,we have h � 24jd+ (2 logn)= log(1 + �)� 24jd+ (4 logn)=�= O((d2 log n)=�)(The last step follows from the relations � = �=(cd2) and j = O((d logn)=�).) Thus, at the end of step t,no node has more than a = �+h tokens. We now prove by contradiction that for every step t > t0, no nodehas more than a + d tokens. Let t0 be the �rst step after step t such that there exists some node u withmore than a+d tokens. (If no such t0 exists, the claim holds trivially.) Of the d+1 highest tokens receivedby u after step t, at least 2 tokens (say p and q) were last sent by the same neighbor v of u. Without lossof generality, we assume that p arrived at u before q. Let t1 be the step when p was last sent by v to u.Therefore, we have evt1(u) � ht1(p)� d � a� d. Hence q can be sent to u only when v has height at leasta+ 11d, which contradicts our choice of t0.We have shown that after O(�=�+(d2 logn)=�2) steps, no node ever has more than �+O((d2 logn)=�)tokens. An easy averaging argument shows that there exists k = O((d logn)=�) such that after every stept0 � t, jS<�k j � n=2. By de�ning an appropriate potential function for tokens with heights below theaverage and repeating the analysis done for S>j , we show that in another O((�=�)+d2(logn)=(�2)) steps,all nodes have more than ��O(d2(logn)=�) tokens.We now prove Lemma 5.2. In order to do this, we need to address two issues that arise in the dynamicsetting: (i) potential gains, i.e., when a token gains height, and (ii) no actual potential drops across edgesthat join nodes di�ering by at least 24d tokens. We show that for any of the above events to occur, \many"tokens should have lost height in previous steps. We use a part of this prior potential drop to account for(i) and (ii). We now de�ne a notion of goodness of the tokens. Initially, all tokens are unmarked. After anystep t, for every token p that is moved along an edge, p is marked good if ht(p)� ht�1(p) � 6d; otherwisep is marked bad . The marking of tokens that do not move is unchanged.Lemma 5.4 For any two bad tokens p1 and p2 present at any node u at the start of any step t, if p1 andp2 were last sent to u by the same neighbor v of u, then jht(p1)� ht(p2)j > 4d.Proof: Let t1 (resp., t2) be the steps during which p1 (resp., p2) are last sent to u. Without lossof generality, we assume t1 < t2. Thus we have ht(p1) < ht(p2). By the de�nition of the algorithm,evt1(u) � �+ht1(p1)�d. Since p1 remains at u during the interval [t1; t2), we �nd that evt0(u) � �+ht0(p1)�dfor every step t0 in [t1; t2). Therefore, ht2�1(p2) � ht2�1(v)�d � evt2�1(u)+11d � ht2�1(p1)+10d. Since p2 isbad, we also have ht2(p2) > ht2�1(p2)�6d � ht2�1(p1)+4d. Since ht(p2) = ht2(p2) and ht(p1) = ht2�1(p1),the lemma follows.Corollary 5.4.1 At any time, for any node u and integer i > 0, there are at most d bad tokens withheights in [i; i+ 4d].Proof of Lemma 5.2: Consider an arbitrary step t of the algorithm. For every token p transferredfrom u to v in step t, we assign some credit to every edge adjacent to u or v. Speci�cally, if p is markedgood after step t we assign a credit of 9(�(ht(p) + 6d)� �(ht(p)))=(20d) units to every edge adjacent to u17

or v. If p is marked bad we assign a credit of (�(ht(p) + d)� �(ht(p)))=(20d) units to every edge adjacentto u or v. For a token transferred from u to v, the credit assigned to edges adjacent to u (resp., v) iscalled outgoing credit (resp., incoming credit). Also, for each pair (u; v), we assign an initial credit of2maxf�(24jd); (�(h0(u)� d) + �(h0(v)� d))g units at the start of the analysis. The total initial credit Iis bounded as follows: I � 2 n2!�(24jd) + X(u;v)2E 2(�(h0(u)� d) + �(h0(v)� d))� n2�(24jd) + Xu2V X0�`<d 2�(h0(u)� `)� n2�(24jd) + 2�0:(This bound corresponds to the negative term in Equation (5.2).)We now show that using the above accounting method, we can account for the amortized potentialdrop of (�(ht�1(u)� d)� �(ht�1(v) + d))=50 units at step t for every live edge (u; v) 2 Et. To accomplishthis, for every live edge (u; v) ((u; v) not necessarily in Et), we consider three cases: (i) a token p sent fromu to v is marked good, (ii) a token p sent from u to v is marked bad, (iii) no token is sent from u to v.We �rst consider case (i). When a token p is marked good after being sent along (u; v), we pay for theamortized drop D1 as well as the credit D2 associated with p from the actual potential drop of p:D1 +D2 = 2d[9(�(ht(p) + 6d)� �(ht(p)))]=(20d)+ (�(ht�1(u)� d)� �(ht�1(v) + d))=50� 9(�(ht�1(p))� �(ht(p)))=10+ (�(ht�1(p))� �(ht(p)))=50� �(ht�1(p))� �(ht(p)):(The second step follows from the fact that p is a good token which implies that ht(p) � ht�1(p)� 6d.)We now consider case (ii). In this case we need to account for: (1) if ht(p) > ht�1(p), an amountequal to the potential increase of D1 = �(ht(p))� �(ht�1(p)) units, and (2) a credit of D2 � (�(ht(p) +d) � �(ht(p)))=10 units. We have two subcases, depending on whether t is the �rst step (u; v) is live(subcase (a)) or not (subcase (b)). In subcase (a), if h0(u) � ht(p)�d, the initial credit C0 associated with(u; v) is at least 2maxf�(24jd); �(ht(p)� 2d)g. It follows from Corollary A.1.1 that 3C0=4 � �(ht(p)) andC0=4 � �(ht(p) + d)=10. Therefore, we have C0 � D1 +D2.We now consider subcase (a) under the assumption that h0(v) � ht(p)�d. In order to do the accounting,we use part of the incoming credit associated with the edge (u; v) due to the set X of good tokens of v withheights in the interval [h0(v); ht(p)� d]. (Note that each token in X is marked and thus, has contributedincoming credit to all edges adjacent to v.) For each token x in X , we use cx units of credit, which equals8(�(ht(q) + 6d) � �(ht(q)))=(20d) units of the incoming credit assigned to (u; v) by x. Let C1 denotePx2X cx. By invoking Corollary 5.4.1, we obtain:C1 � X1�i�bht(p)�d�h0(v)4d c(�(ht(p)� d� 4id+ 6d)� �(ht(p)� d� 4id))� (3 � 8)(�(ht(p) + d)� �(h0(v) + 4d))=20= 6(�(ht(p) + d)� �(h0(v) + 4d))=5:Since p is marked bad after step t, we have ht(p) � ht�1(p)� 6d. Therefore,C0 + C1 � 6(�(ht(p) + d)� �(h0(v) + 4d))=5 + 2maxf�(24jd); �(h0(v)� d)g� 6�(ht(p) + d)=5� �(ht(p))� �(ht�1(p)) + (�(ht(p) + d)� �(ht(p)))=10� D1 +D2: 18

(In the second step we use the inequality 2maxf�(24jd); �(h0(v)� d)g � 6�(h0(v) + 4d)=5 which followsfrom Corollary A.1.1.)We use a similar argument as above to handle subcase (b). The set X is the set of good tokens of vwith heights in the interval [eut�1(v)��; ht(p)�d]. Let cx and C1 be de�ned as in subcase (a). Let y denoteeut�1(v)��+4d. We �rst observe that since u sent a token to v during step t, y � ht�1(u)�8d � ht�1(p)�7d.Also, since p is a bad token we have y � ht�1(p)� 7d � ht(p)� d. As in subcase (a), we have:C1 � 6(�(ht(p) + d)� �(y))=5= (�(ht(p) + d)� �(y)) + (�(ht(p) + d)� �(y))=5� �(ht(p) + d)� �(ht�1(p) + d) + (�(ht(p) + d)� �(ht(p)))=5� �(ht(p))� �(ht�1(p)) +D2= D1 +D2:To complete the proof for case (ii), we show that for any token q of v, any incoming credit assignedby q to edge (u; v) that is used at step t for case (ii) is not used again for case (ii). To prove this, it isenough to note that if q belongs to X , for every further step t0 > t until q is transferred by u, we haveht0(q) � eut0�1(v)� �.We need to consider case (iii) only under the assumption that (u; v) 2 Et. In this case we account forD = (�(ht�1(u) � d)� �(ht�1(v) + d))=50 units of potential. Again we consider two subcases dependingon whether t is the least step in which (u; v) is live (subcase (a)) or not (subcase (b)). We �rst considersubcase (a). If h0(u) � ht�1(u) � 12d, then we use C0 = 2maxf�(24jd); �(h0(u) � d)g units of theinitial credit associated with (u; v). It follows from Corollary A.1.1 that C0 � �(ht�1(u) � d)=50 � D.If h0(u) < ht�1(u) � 12d, in addition to C0, we also use part of the incoming credit associated with theset of tokens Y = fy : y is a token of u and h0(u) < ht(y) � ht�1(u)g. Speci�cally, for every token y inY , we use (�(ht(y) + d)� �(ht(y)))=(50d) units of incoming credit that is assigned to (u; v) by y. (Notethat since ht(y) > h0(u), token y has moved and hence has some associated credit. Moreover, at least(�(ht(y) + d)� �(ht(y)))=(20d) units of incoming credit remains unused in the analysis of case (ii).) Letthis credit be denoted C1. Thus, we obtainC0 + C1 � C0 + 0@ Xh0(u)<k�ht�1(u)(�(k + d)� �(k))=(50d)1A� C0 + (�(ht�1(u))� �(h0(u) + d))=50� �(ht�1(u))=50� D:(In the third step we use the inequality C0 � �(h0(u) + d)=50 which follows from Corollary A.1.1.)We now consider subcase (b) of (iii). Since no token was sent along (u; v) in step t, we have eut�1(v)�� >ht�1(u)� 12d (� 24jd). It follows that eut�1(v)� � > ht�1(v)+ 12d. Since the last step in which (u; v) waslive, at least eut�1(v)� �� ht�1(v) tokens have left v. We use the outgoing credit C2 assigned to (u; v) dueto these token transmissions. We haveC2 � Xht�1(v)<k�eut�1(v)��(�(k + d)� �(k))=(20d)� (�(eut�1(v)� �)� �(ht�1(v) + d))=20� (�(eut�1(v)� �+ 11d)� �(ht�1(v) + d))=50� (�(ht�1(u)� d)� �(ht�1(v) + d))=50= D: 19

(The third step follows from Lemma A.2. The fourth step follows from the inequality eut�1(v) � � >ht�1(u)� 12d.)We note that the outgoing credit due to any such token x associated with edge (u; v) is used at mostonce in case (iii) because after step t, we have eut (v)� � � ht�1(v).A simple variant of DS, as suggested in [2], can be de�ned for the asynchronous network model inwhich the topology is �xed, but an adversary determines the speed at which each edge operates at everyinstant of time. An edge is said to be live for a unit interval of time, if every message that was in transitacross the edge (in either direction) at the beginning of the interval is guaranteed to reach the end of theedge by the end of the interval. As shown in [2], the analysis for the dynamic synchronous case can beused for asynchronous networks to yield the same time bounds, under the assumption that in every unitinterval of time the live edges of the network have edge expansion �.6 Tight bounds on o�-line load balancingIn this section, we analyze the load balancing problem in the o�-line setting for both single-port andmulti-port models. We derive nearly tight bounds on the minimum number of steps required to balance onarbitrary networks in terms of the node and edge expansion of the networks. We assume that the networkis synchronous.We �rst consider the network G = (V;E) under the single-port model. For any subset X of V , letX denote V nX , m(X) denote the number of edges in a maximum matching between X and X , A(X)denote the set fv 2 X : 9x 2 X such that (x; v) 2 Eg, and B(X) denote the set fx 2 X : 9y 2A(X) such that (x; y) 2 Eg. For subsets X and Y of V , let M(X; Y ) denote the set of edges with oneendpoint in X and the other in Y .Lemma 6.1 For any network G = (V;E) with node expansion � and any subset X of V , we have m(X) ��minfjX j; jXjg=(1 + �): Moreover, there exists a set X of size at most (1 + �)jV j=2 such that m(X) ��jX j=(1+ �).Proof: Without loss of generality, assume that jX j � jXj. Consider the bipartite graphH = (B(X); A(X);M(X;X)). A maximum matching in H is equal to a maximum ow in the graphI = (B(X) [ A(X) [ fs; tg;M(X;X) [ f(s; x) : x 2 B(X)g [ f(x; t) : x 2 A(X)g) from source s to sinkt. (All of the edges of I have unit capacity.) We will show that every cut C of I separating s and t is ofcardinality at least �jX j=(1+ �).Consider any cut C = (S; T ) with s 2 S and t 2 T . Let Y = T \ B(X) and Z = T \ A(X). Thecardinality of C is jY j + jM(Y;A(X) n Z)j+ jM(B(X) n Y; Z)j+ jA(X) n Zj � jY j+ jM(B(X) n Y; Z)j+jA(X) n Zj � jA(X n Y )j. Since jCj � jY j, and jA(X n Y )j � �jX n Y j, we have jCj � �jX j=(1+ �).For the second part of the lemma, consider set Y of size at most jV j=2 such that A(Y ) = �jY j. If weset X = Y [A(Y ), then m(X) � A(Y ) = �jX j=(1+ �).Theorem 1 of [28] obtains tight bounds on the o�ine complexity of load balancing in terms of thefunction m. By invoking Lemma 6.1, we obtainLemma 6.2 ([28]) Any network G with node expansion � and initial imbalance � can be balanced in atmost d�(1+ �)=�e steps so that every node has at most d�e+ 1 tokens.. Moreover, there exists a networkG and an initial load distribution with imbalance � such that any algorithm takes at least d�(1 + �)=�esteps to balance G such that every node has at most d�e tokens.By using the techniques of [28], we can modify the proof of Lemma 6.2 to show that any network Gwith node expansion � and initial imbalance � can be globally balanced to within 3 tokens in at most2d�(1+�)=�e steps. Lemma 6.2 implies that our local single-port algorithm is not optimal for all networks.20

However, there exists a class of graphs, e.g., constant-degree expanders, for which the local algorithm isoptimal. A network for which the local algorithm is not optimal is the hypercube. The local algorithmbalances in (� logn) time, while there exists an O(�plogn + log2 n) time load balancing algorithm forthe hypercube [33] which is optimal for � su�ciently large.The proof of Lemma 6.2 can be modi�ed to establish the following result for the multi-port model.Lemma 6.3 Any network G with edge expansion � and initial imbalance � can balanced in at most d�=�esteps so that every node has at most d�e+ d tokens. Moreover, for every network G, there exists an initialload distribution with imbalance � such that any algorithm takes at least d�=�e steps to balance G so thatevery node has at most d�e tokens.Proof Sketch: We prove that there exists an o�-line algorithm that balances to within d tokens inat most T = max;�X�V d jI(X)��jXjjjM(X;X)j e steps, where I(X) is the number of tokens held by nodes in X in theinitial distribution. For all X � V , we have (i) jI(X)� �jX jj � �minfjX j; jXjg and (ii) jM(X;X)j ��minfjX j; jXjg. It follows from (i) and (ii) that T � d�=�e.We essentially modify the proofs of Theorem 1 and Lemma 4 of [28] (where the single-port modelwas assumed) to establish the desired claims for the multi-port model. We transform the load balancingproblem on G to a network ow problem on a directed graph H = (V 0; E 0) which is constructed as follows.Let Vi be fhv; ii : v 2 V g, 0 � i � T . Let Ei be f(hu; ii; hv; i+ 1i) : (u; v) 2 E or u = vg, 0 � i < T .We set V 0 to fsg [ S0�i�T Vi [ ftg, and E 0 to f(s; hv; 0i) : v 2 V g [ S0�i<T Ei [ f(hv; T i; t) : v 2 V g.For any v in V , the capacity of the edge (s; hv; 0i) is w(v). For any (u; v) in E, the capacity of any edge(hu; ii; hv; i+ 1i), 0 � i < T , is 1. For any v in V , the capacity of any edge (hv; ii; hv; i+ 1i), 0 � i < T , is1. For any v in V , the capacity of the edge (hv; T i; t) is d�e+ d.We show that the value of the maximum integral ow in H is equal to the total number of tokensN in V from which it follows that there exists an o�-line algorithm that balances to within d tokensin T steps. Consider any cut C = (S; T ) of H separating s 2 S and t 2 T . Let Si = S \ Vi andD(Si) = fv 2 V : hv; ii 2 Sig. If S0 = ;, or ST = VT , or there is an edge of in�nite capacity then thecapacity of C is at least N . Otherwise, the number of edges from Vi to Vi+1 that belong to the cut is atleast jM(D(Si); D(Si))j � d(jSi+1j � jSij). Thus the capacity of C is at leastXv2V0nS0 w(v) + T�1Xi=0 �jM(D(Si); D(Si))j � d(jSi+1j � jSij)�+ (d�e+ d)jST j � N:Since capacity of the cut (fsg; V 0 n fsg) equals N , the maximum ow in H is N .To prove the second part of the lemma, given any network G with a partition (V1; V2) of its nodes suchthat jV1j � n=2 and jM(V1; V2)j = �jV1j, we de�ne an initial load distribution in which each node in V1has � tokens and each node in V2 has zero tokens. It is easily veri�ed that a proper choice of � establishesthe desired claim.Lemma 6.3 implies that the local multi-port algorithm is asymptotically optimal for all networks. Asin the single-port case, we can modify the above proof to obtain upper bounds on the o�-line complexityof globally balancing a network. We can show that any network G with edge expansion � and initialimbalance � can globally balanced to within d+ 1 tokens in at most 2d�=�e steps.References[1] Y. Afek, E. Gafni, and A. Rosen. The slide mechanism with applications in dynamic networks. InProceedings of the 11th ACM Symposium on Principles of Distributed Computing, pages 35{46, August1992. 21

[2] W. Aiello, B. Awerbuch, B. Maggs, and S. Rao. Approximate load balancing on dynamic and asyn-chronous networks. In Proceedings of the 25th Annual ACM Symposium on the Theory of Computing,pages 632{641, May 1993.[3] M. Ajtai, J. Koml�os, and E. Szemer�edi. Sorting in c logn parallel steps. Combinatorica, 3:1{19, 1983.[4] J. Aspnes, M. Herlihy, and N. Shavit. Counting networks and multiprocessor co-ordination. InProceedings of the 23rd Annual ACM Symposium on Theory of Computing, pages 348{358, May 1991.[5] B. Awerbuch and T. Leighton. A simple local-control approximation algorithm for multi-commodity ow. In Proceedings of the 34th Annual Symposium on Foundations of Computer Science, pages459{468, October 1993.[6] B. Awerbuch and T. Leighton. Improved approximation algorithms for the multi-commodity owproblem and local competitive routing in dynamic networks. In Proceedings of the 26th Annual ACMSymposium on the Theory of Computing, pages 487{496, May 1994.[7] B. Awerbuch, Y. Mansour, and N. Shavit. End-to-end communication with polynomial overhead.In Proceedings of the 30th Annual Symposium on Foundations of Computer Science, pages 358{363,October 1989.[8] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods.Chapter 7, Prentice{Hall, Englewood Cli�s, NJ, 1989.[9] A. Broder, A. M. Frieze, E. Shamir, and E. Upfal. Near{perfect token distribution. In Proceedings ofthe 19th International Colloquium on Automata, Languages and Programming, pages 308{317, July1992.[10] H. Cherno�. A measure of asymptotic e�ciency for tests of a hypothesis based on the sum of obser-vations. Annals of Mathematical Statistics, 23:493{507, 1952.[11] Y. C. Chow and W. Kohler. Models for dynamic load balancing in a heterogeneous multiple processorsystem. IEEE Transactions on Computers, C{28(5):57{68, November 1980.[12] G. Cybenko. Dynamic load balancing for distributed memory multiprocessors. Journal of Paralleland Distributed Computing, 7(2):279{301, 1989.[13] D. Eager, E. Lazowska, and J. Zahorjan. Adaptive load sharing in homogeneous distributed systems.IEEE Transactions on Software Engineering, SE{12(5):662{675, 1986.[14] B. Ghosh and S. Muthukrishnan. Dynamic load balancing on parallel and distributed networks byrandom matchings. In Proceedings of the 6th Annual ACM Symposium on Parallel Algorithms andArchitectures, pages 226{235, June 1994.[15] B. Goldberg and P. Hudak. Implementing functional programs on a hypercube multiprocessor. InProceedings of the 4th Conference on Hypercubes, Concurrent Computers and Applications, volume 1,pages 489{503, 1989.[16] A. Heirich and S. Taylor. A parabolic theory of load balance. Research Report Caltech-CS-TR-93-25,Caltech Scalable Concurrent Computation Lab, Pasadena, CA, March 1993.[17] K. T. Herley. A note on the token distribution problem. Information Processing Letters, 28:329{334,1991. 22

[18] S. H. Hosseini, B. Litow, M. Malkawi, J. McPherson, and K. Vairavan. Analysis of a graph coloringbased distributed load balancing algorithm. Journal of Parallel and Distributed Computing, 10(2):160{166, 1990.[19] J. J�aJ�a and K. W. Ryu. Load balancing and routing on the hypercube and related networks. Journalof Parallel and Distributed Computing, 14(4):431{435, 1992.[20] M. R. Jerrum and A. Sinclair. Conductance and the rapid mixing property for Markov chains: theapproximation of the permanent resolved. In Proceedings of the 20th Annual ACM Symposium onTheory of Computing, pages 235{244, May 1988.[21] R. Karp and Y. Zhang. A randomized parallel branch-and-bound procedure. In Proceedings of the20th Annual ACM Symposium on Theory of Computing, pages 290{300, May 1988.[22] M. Klugerman and C. G. Plaxton. Small depth counting networks. In Proceedings of the 24th AnnualACM Symposium on the Theory of Computing, pages 417{428, May 1992.[23] T. Leighton, C. E. Leiserson, and D. Kravets. Theory of parallel and VLSI computation. ResearchSeminar Series Report MIT/LCS/RSS 8, MIT Laboratory for Computer Science, May 1990.[24] F. C. H. Lin and R. M. Keller. The gradient model load balancing method. IEEE Transactions onSoftware Engineering, SE{13(1):32{38, 1987.[25] R. L�uling and B. Monien. Load balancing for distributed branch and bound algorithms. In Proceedingsof the 6th International Parallel Processing Symposium, pages 543{549, March 1992.[26] R. L�uling and B. Monien. A dynamic distributed load balancing algorithm with provable good perfor-mance. In Proceedings of the 5th Annual ACM Symposium on Parallel Algorithms and Architectures,pages 164{172, June 1993.[27] R. L�uling, B. Monien, and F. Ramme. Load balancing in large networks: A comparative study.In Proceedings of the 3rd IEEE Symposium on Parallel and Distributed Processing, pages 686{689,December 1991.[28] F. Meyer auf der Heide, B. Osterdiekho�, and R. Wanka. Strongly adaptive token distribution. InProceedings of the 20th International Colloquium on Automata, Languages and Programming, pages398{409, July 1993.[29] M. Mihail. Conductance and convergence of Markov chains - a combinatorial treatment of expanders.In Proceedings of the 30th Annual Symposium on Foundations of Computer Science, pages 526{531,October 1989.[30] L. M. Ni, C. Xu, and T. B. Gendreau. Distributed drafting algorithm for load balancing. IEEETransactions on Software Engineering, SE{11(10):1153{1161, October 1985.[31] D. Peleg and E. Upfal. The generalized packet routing problem. Theoretical Computer Science,53:281{293, 1987.[32] D. Peleg and E. Upfal. The token distribution problem. SIAM Journal on Computing, 18:229{243,1989.[33] C. G. Plaxton. Load balancing, selection and sorting on the hypercube. In Proceedings of the 1989ACM Symposium on Parallel Algorithms and Architectures, pages 64{73, June 1989.23

[34] J. Stankovic. Simulations of three adaptive, decentralized controlled, job scheduling algorithms. Com-puter Networks, 8:199{217, 1984.[35] R. Subramanian and I. D. Scherson. An analysis of di�usive load balancing. In Proceedings of the1994 ACM Symposium on Parallel Algorithms and Architectures, pages 220{225, June 1994.[36] A. N. Tantawi and D. Towsley. Optimal static load balancing in distributed computer systems. Journalof the ACM, 32:445{465, April 1985.[37] E. Upfal. An O(logN) deterministic packet routing scheme. pages 241{250, May 1989.[38] R. D.Williams. Performance of dynamic load balancing algorithms for unstructured mesh calculations.Concurrency: Practice and Experience, 3(5):457{481, 1991.A Some InequalitiesLet � equal �=(cd2). For the following we set c large enough so that (1 + �)12d � 3=2.Lemma A.1 For any integer x, if �(x) > 0, then �(x+ 12d) � 3�(x)=2.Proof: Since �(x) > 0, we have �(x+ 12d) = (1 + �)12d�(x)� 3�(x)=2:Corollary A.1.1 For any integer x we havemaxf�(24jd); �(x� 12d)g � 2�(x)=3Lemma A.2 For any integer x and y, if �(x) > 0 and x � y � 11d, then we have �(x) � �(y) �2(�(x+ 11d)� �(y))=5.Proof: 2(�(x+ 11d)� �(y))=5 = 2(�(x+ 11d)� �(x))=5+2(�(x)� �(y))=5� 2(1 + �)11d(�(x)� �(x� 11d))=5+2(�(x)� �(y))=5� 2(1 + �)11d(�(x)� �(y))=5+2(�(x)� �(y))=5� �(x)� �(y):(In the second step we use the inequality x�11d � y. In the last step we use the inequality (1+�)11d � 3=2.)24

m+1

m-(k-2)d

m-kd

m-(k+2)d

m-(k+2)dm-(k+4)d

m-2d

m+1

m+1

m

m-(k+6)d m-2kdm-2kd

m-2(k-1)d

m+1

L2L1 Lk/2-1 Lk/2 Lk/2+1 Lk-1 Lk= Rk Rk-1 Rk/2+1

Rk/2 Rk/2-1 R2 R1 v

e

u

Figure 1: The initial tokens distribution on G for the �rst case.m+1

m-(k-2)d

m-kd

m-(k+2)d

m-(k+2)dm-(k+4)d

m-2d

m+1

m+1

m

m-(k+6)d m-2kdm-2kd

m-2(k-1)d

m+1

L2L1 Lk/2-1 Lk/2 Lk/2+1 Lk-1 Lk= Rk Rk-1 Rk/2+1

Rk/2 Rk/2-1 R2 R1 v

e

u

Figure 2: The initial tokens distribution on G for the second case.25