approximation algorithms for the metric maximum clustering problem with given cluster sizes
Post on 02-Jul-2016
220 Views
Preview:
TRANSCRIPT
Available online at www.sciencedirect.com
Operations Research Letters 31 (2003) 179–184
OperationsResearchLetters
www.elsevier.com/locate/dsw
Approximation algorithms for the metric maximum clusteringproblem with given cluster sizes
Refael Hassin∗, Shlomi RubinsteinDepartment of Statistics and Operations Research, School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel
Received 4 November 2002; accepted 3 December 2002
Abstract
The input to the METRIC MAXIMUM CLUSTERING PROBLEM WITH GIVEN CLUSTER SIZES consists of a complete graph G=(V; E)with edge weights satisfying the triangle inequality, and integers c1 ; : : : ; cp that sum to |V |. The goal is to >nd a partition ofV into disjoint clusters of sizes c1; : : : ; cp, that maximizes the sum of weights of edges whose two ends belong to the samecluster. We describe approximation algorithms for this problem.c© 2003 Elsevier Science B.V. All rights reserved.
Keywords: Approximation algorithms; Maximum clustering
1. Introduction
In this paper we approximate the METRIC MAXIMUM
CLUSTERING PROBLEM WITH GIVEN CLUSTER SIZES. Theinput for the problem consists of a complete graphG = (E; V ), V = {1; : : : ; n}, with non-negative edgeweights w(i; j), (i; j)∈E, that satisfy the triangle in-equality. In the general case of the problem, clustersizes c1¿ c2¿ · · ·¿ cp¿ 1 such that c1+· · ·+cp=nare given. In the uniform case, c1 = c2 = · · ·= cp. Theproblem is to partition V into sets of the given sizes,so that the total weight of edges inside the clustersis maximized. See [6] and its references for someapplications.Hassin and Rubinstein [3] gave a approximation
algorithm whose error ratio is bounded by 1=2√2 ≈
0:353 for the general problem. We improve this
∗ Corresponding author.E-mail addresses: hassin@post.tau.ac.il (R. Hassin),
shlomiru@post.tau.ac.il (S. Rubinstein).
result for the case in which cluster sizes are large. Inparticular, when the minimum cluster size increases,the performance guarantee of our algorithm increasesasymptotically to 0.375.Feo and Khellaf [2] treated the uniform case and
developed a polynomial algorithm whose error ratio isbounded by c=2(c−1) or (c+1)=2c, where the clustersize is c=n=p, and it is even or odd, respectively. Thebound decreases to 1=2 as c approaches ∞. The algo-rithm’s time complexity is dominated by computationof a maximum weight perfect matching. (Without thetriangle inequality assumption, the bound is 1=(c− 1)or 1=c, respectively, but Feo, Goldschmidt and Khel-laf [1] improved the bound to 1
2 in the cases of c = 3and 4.) We describe an alternative algorithm for theuniform case that achieves the ratio of 1=2 and has alower O(n2) complexity.Hassin, Rubinstein and Tamir [4] generalized
the algorithm of [2] and obtained a bound of 12
for computing k clusters of size c each (16 k6 n=c)with maximum total weight. Our discussion
0167-6377/03/$ - see front matter c© 2003 Elsevier Science B.V. All rights reserved.doi:10.1016/S0167-6377(02)00235-3
180 R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184
concerning the uniform case does not apply to thisgeneralization.For E′ ⊂ E we denote by w(E′) the total weight of
edges in E′. For V ′ ⊆ V we denote by E(V ′) the edgeset of the subgraph induced by V ′. To simplify thepresentation, we denote the weight w(E(V ′)) of theedges in the subgraph induced by a vertex set V ′ byw(V ′). We denote by opt the optimal solution value,and by apx the approximate value returned by a givenapproximation algorithm. A p-matching is a set of pvertex-disjoint edges in a graph. A p-matching withp = �n=2� is called perfect. A greedy p-matching isobtained by sorting the edges in non-increasing orderof their weights, and then scanning the list and se-lecting edges as long as they are vertex-disjoint to thepreviously selected edges and their number does notexceed p. A greedy perfect matching has p= �n=2�.
2. A 38 -approximation algorithm
Lemma 1. Let M g be a greedy k-matching. Let M ′
be an arbitrary 2k-matching. Then, for i = 1; : : : ; k,the weight of the ith largest edge in M g is greaterthan or equal to the weight of the (2i − 1)st largestedge in M ′.
Proof. Let e1; : : : ; ek be the edges of M ′ in non-increasing order of weight. By the greedy construc-tion, every edge of e′ ∈M ′ \M g is incident to an edgeof e∈M g with w(e)¿w(e′). Since every edge ofM g can take the above role at most twice, it followsthat for e1; : : : ; e2i−1 we use at least i edges of M g allof which are at least as large as w(e2i−1).
Lemma 2. If a cluster C ⊂ V of size |C|=c containsa k-matching M of weight W , then w(C)¿ (c−k)W .
Proof. Let V (M) be the set of vertices of the edgesin M . Then, |V (M)| = 2k. Consider an edge e =(u; v)∈M . By the triangle inequality, for every vertexx, w(u; x) + w(v; x)¿w(u; v). Summing over x∈C \V (M), the total weight of the edges of C that con-nect u and v with vertices in C \ V (M) is at least(c−2k)w(u; v). Summation over M gives a weight ofat least (c − 2k)W .A similar summation over x∈V (M) (including x=
u; v) gives a total weight of 2kw(u; v). However, every
edge that contributes to this sum (including (u; v)) iscounted twice. Hence, summation over M only givesthat the total weight of these edges is at least kW .Altogether, w(C)¿ [(c − 2k) + k]W = (c −
k)W .
We >rst present our algorithm assuming that thecluster sizes are divisible by 4. After analyzing thiscase, we will show how to modify the algorithm forthe general case, and obtain the same bound asymp-totically for big cluster sizes.
Theorem 1. Let apx be the total weight of edges inthe clusters returned by Algorithm metric (Fig. 1).Then,
apx¿38opt:
Proof. Consider an optimal partition O1; : : : ; Op. LetMi be a maximum matching in the subgraph inducedby Oi, i=1; : : : ; p. Let M =M1 ∪ · · · ∪Mp (note that|Mi| = 1
2ci and |M | = 12 |V |). Let M ′
1 be the match-ing consisting of the 1
2c1 heaviest edges in M , M ′2
the next 12c2 heaviest edges, and so on up to M ′
p
which consists of the 12cp lightest edges in M . From
the assumption that c1¿ · · ·¿ cp it follows that∑i ciw(Mi)6
∑i ciw(M ′
i ). The edge set of E(Oi)can be covered by a set of ci − 1 disjoint matchings.Since Mi is a maximum matching in Oi it follows thatw(Oi)6 (ci − 1)w(Mi) and therefore
opt =p∑
i=1
w(Oi)6p∑
i=1
ciw(Mi)6p∑
i=1
ciw(M ′i ):
Let M gi be the greedy matching’s edges whose end
vertices were inserted by Algorithm metric to Si. Notethat |M g
i |= ci=4.By Lemma 1, w(M ′
i )6 2w(M gi ). Therefore, by
Lemma 2,
apx¿p∑
i=1
(ci − |M gi |)w(M g
i )
=34
p∑i=1
ciw(Mgi )
¿38
p∑i=1
ciw(M ′i )¿
38opt:
R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184 181
Fig. 1. Algorithm metric.
The situation is more complex when the restrictionthat the cluster sizes are divisible by 4 is removed.We propose to apply Algorithm Metric with sizesc′1; : : : ; c
′p, where c′i = 4�ci=4�, and complete the clus-
ters arbitrarily to sizes c1; : : : ; cp. If the cluster sizesare large enough, the bound is not greatly aMected bythis change and will be asymptotically 3
8 . We nowdescribe the changes in the algorithm and analysisthat are necessary to account for any cluster sizes atleast 4.Suppose that for i = 1; : : : ; p, ci = 4ki + �i, where
ki¿ 1 is an integer and �i ∈{0; 1; 2; 3}. Let Oi be theith cluster in an optimal solution and let O′
i be a max-imum weight subset of Oi resulting from the deletionof �i vertices. Then
w(O′i)
w(Oi)¿
|E(O′i)|
|E(Oi)| =
4ki
2
ci
2
and therefore
opt =p∑
i=1
w(Oi)6p∑
i=1
ci
2
4ki
2
w(O′i): (1)
Since |O′i | = 4ki, E(O′
i) can be covered by 4ki−1disjoint matchings. Hence, a maximum matching,Mi, in the subgraph induced by O′
i satis>es w(O′i)6
(4ki − 1)w(Mi) and with (1)
opt6p∑
i=1
fiw(Mi); (2)
where fi ≡ f(ci) = (4ki − 1)
(ci
2
)/(4ki
2
)=
ci(ci−1)=4ki. Note that f is not necessarily monotoneincreasing. Let j1; : : : ; jp be a permutation of 1; : : : ; psuch that fj1 ¿ · · ·¿fjp .Let M =M1∪· · ·∪Mp. Note that |Mi|=2ki. Let M ′
1be the matching consisting of the 2kj1 heaviest edgesin M , M ′
2 the next 2kj2 heaviest edges, and so on upto M ′
p which consists of the 2kjp lightest edges in M .Clearly,
p∑i=1
fiw(Mi)6p∑
i=1
fjiw(M′ji): (3)
Consider a greedy matching M g of size 12 |M |. Let
M gj1 be the matching consisting of the kj1 heaviest
edges in M g, M gj2 the next kj2 heaviest edges, and so
on. Let Sgi be the set of end vertices of the edges in
M gi , and let {S1; : : : ; Sp} be an arbitrary completion of
(Sg1 ; : : : ; S
gp) to disjoint clusters of sizes (c1; : : : ; cp).
182 R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184
By Lemma 1
w(M ′ji)6 2w(M g
i ); (4)
and by Lemma 2
w(Si)¿ (ci − ki)w(Mgi ): (5)
Combining (2)–(5) we obtain
opt6p∑
i=1
giw(Si);
where gi ≡ g(ci) = 2fi=(ci − ki) = ci(ci − 1)=2ki(ci − ki).Let � = [max{g(ci) | i = 1; : : : ; p}]−1, then
apx =p∑
i=1
w(Si)¿ �opt:
For example, for ci = 4; : : : ; 12 1=gi is 0.5, 0.4,0.333, 0.286, 0.428, 0.389, 0.356, 0.327, and 0.409.The bound improves over the previously known boundof 1
2√2≈ 0:353 if ci �∈ {2; 3; 6; 7; 11; 15; 19}, and in
particular if all cluster sizes are at least 20. Whenmin{ci: i = 1; : : : ; p} → ∞, � → 3
8 .
3. The uniform case
We now consider the uniform case, that is ci=c fori = 1; : : : ; p. Consider the set of partitions of V intoclusters of size c each. A random solution is obtainedby randomly (uniformly) selecting such a partition.The following theorem states a bound on the expectedvalue of a random solution. When the cluster sizes arenot identical, Example 1 in Section 4 shows that theexpected weight of a random solution is not a goodapproximation. Also note that in contrast to a similarbound for the related MAXIMUM CUT PROBLEM, our resultrequires that the edge weights satis>es the triangleinequality.
Theorem 2. The expected weight of a random solu-tion is at least 1
2opt.
Proof. Consider an optimal partition OPT =(O1; : : : ; Op). Let M be a matching consisting of theunion of maximum perfect matchings in the subgraphsinduced by O1; : : : ; Op.
Suppose >rst that c is even. Then M is a perfectmatching in G. Let Sv be the (random) set of edgesthat are incident to v in the cluster that containsv in a random solution. The weight of a randomsolution is
12
∑v∈V
w(Sv) =12
∑(u;v)∈M
[w(Su) + w(Sv)]:
Consider an edge (u; v)∈M . A solution in which uand v are contained in the same cluster satis>es, bythe triangle inequality, that the average weight of anedge in Su∪Sv is at least 1
2w(u; v). As for the solutionsin which u and v belong to distinct clusters, we pairthese solutions so that each pair consists of a solutionand another one obtained by swapping u and v. Againit follows from the triangle inequality that the averageweight of an edge in Su ∪ Sv in such a pair of solu-tions is at least 1
2w(u; v). Hence the expected averageweight of an edge in a random solution is at least 1=2the average edge weight in M . The claim follows byobserving that the average edge weight in M is at leastas large as the average edge weight in OPT (see forexample [4]).Suppose now that c is odd. Then, M leaves out one
vertex from each subset. We consider the stars incidentto these vertices in the optimal and random solutions.Since M is the union of maximum matchings, the sumof these stars in OPT is at most 2w(M). On the otherhand, from the same pairing argument and the triangleinequality, the expected total weight of these stars ina random solution is at least w(M). The rest of theproof of this case follows the same arguments as whenc was assumed to be even.
We use the method of conditional probabilities [5]to de-randomize the algorithm while preserving itsperformance guarantee. At a given stage we havealready determined the contents of several clustersand we deal with one active cluster that may now bepartially >lled. Let r be the number of clusters yet tobe constructed excluding the active one. Let A be thevertices already assigned to the active cluster, a= |A|,and let B be the yet undecided vertices, b = |B|: Wemaintain and update for each u∈B %u=
∑v∈A w(u; v),
% =∑
u∈B %u and � =∑
u;v∈B w(u; v): Altogether,these updates take O(n2) time.
R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184 183
The expected weight of a random completion of theassignment of vertices to clusters is
c − ab
% +
(c − a
2
)+ r
(c
2
)(
b
2
) �:
The >rst term is the expected weight of edges betweenvertices that will be added to the active cluster and thevertices already in it, the second is the expected weightof edges between vertices in B that will be placed inthe same cluster.At each step we examine the insertion of each u∈B
to the active cluster (contributing %u) followed by arandom completion, and select the one that gives max-imum expected weight. There are n insertions and eachrequires O(n) examinations. Thus, altogether the com-plexity is O(n2).
4. Some bad examples
In this section, we propose some natural algorithmsand provide for each of them an instance for whichthe algorithm performs badly.In Section 3 we have shown that the expected
weight of a random solution is at least 12opt when the
clusters have a common size. The following exampleshows that when the cluster sizes are not identical theexpected weight of a random solution may be verysmall relative to opt.
Example 1. Consider an instance with weightsw(1; j) = 1; j = 2; : : : ; n, and w(i; j) = 0 otherwise.Let c1 = 2 and ci = 1, i = 2; : : : ; n− 1. Then, only so-lutions in which vertex 1 is placed in the large clusterhave a positive weight. This happens with probability2n . Thus, opt = 1 whereas the expected weight of arandom solution is less than 2=n, and the ratio can bemade arbitrarily small.
Next, we show that a natural local search approachdoes not guarantee a constant bound, even in the uni-form case.
Example 2. We construct an instance with 0/1weights. The vertex set V is composed of disjoint sub-
sets V0; V1; : : : ; V‘. |V0|=(‘−k)‘, and for i=1; : : : ; ‘,Vi has a distinguished vertex vi ∈Vi, and |Vi| = k.Thus, |V | = ‘2. The subgraph induced by the unitweight edges consists of the cliques induced by Vi,i = 1; : : : ; ‘, and the clique induced by the distin-guished vertices v1; : : : ; v‘. (Vertices in V0 are notadjacent to any edge with unit weight.) Suppose thatV must be partitioned into ‘ clusters of size ‘ each,and that 1�k2�‘�|V |. The optimal solution con-tains a cluster with the distinguished vertices and itsweight is of order ‘2. A solution in which each Vi,i = 1; : : : ; ‘ is contained in a diMerent cluster cannotbe improved by relocating less than k vertices, andits weight is O(‘k2).
A perfect matching M that, for each p=1; : : : ; |M |,contains p edges whose total weight is at least %times the maximum weight of a p-matching is saidto be %-robust. Hassin and Rubinstein [3] proved thatthere always exists a 1=
√2-robust matching, and that
there are instances where a higher robustness is notpossible.Consider the following extension of the algorithm
of Feo and Khellaf [2] and Hassin et al. [4]:
1. Compute an %-robust matching S.Let S = {(uj; vj) j = 1; : : : ; m}, where
w(uj; vj)¿w(uj+1; vj+1) j = 1; : : : ; m− 1.2. For j = 1; : : : ; p, let dj = �cj=2� and Dj = d1
+ · · · + dj j = 1; : : : ; p. Let D0 = 0. Set Vi ={uj; vj | j = Di−1 + 1; : : : ; Di} i = 1; : : : ; p.
3. For each i such that ci is odd, add to Vi an arbi-trary yet unassigned vertex.
Hassin and Rubinstein [3] proved that the algo-rithm returns an %=2-approximation for the maximumclustering problem with cluster sizes c1; : : : ; cp. Thus,the best bound derived this way is 1=2
√2. Moreover,
since both a maximum matching and a greedy perfectmatching are 1
2 -robust, using such matchings in thealgorithm guarantees at least a 1
4 -approximation.The following example demonstrates that the 1
4bound is the best possible when using a maximummatching:
Example 3. Let V =A∪B∪C∪D∪F where |A|=M ,|B|= |C|= |D|= |F |=M=2, D= {d1; : : : ; dM=2}, and
184 R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184
F = {e1; : : : ; eM=2}. Let
w(i; j) =
2; i∈B j∈C;
1; i∈B j �∈ C or i∈C j �∈ B;
1; (i; j) = (dl; el) l= 1; : : : ; eM=2;
0; i; j∈A;
12 otherwise:
Let c = (M; 1; : : : ; 1) where the number of clusters ofsize 1 is 2M , so that
∑i ci = 3M = |V |: An optimal
solution will choose for the big cluster the set B ∪ Cand thus
opt = 2 (M)2 :
The edges {(d1; e1); : : : ; (dM=2; eM=2)} completed by ar-bitrary disjoint edges between A and B∪C constitute amaximum matching. The algorithm may choose sucha matching and then have D ∪ F as the large cluster,and then
apx =12
(M2
)2+
M4
:
This gives asymptotically a bound of 14 .
We observe that a greedy matching may havean advantage over a maximum matching since itchooses larger edges for the large clusters. The fol-lowing example shows however that the choice ofa greedy matching does not guarantee more than a38 -approximation, even for the case of two uniformclusters.
Example 4. Let V =A∪B∪C with |A|= |B|=M=2,|C| = M , A = {a1; : : : ; aM=2}, and B = {b1; : : : ; bM=2}.
Let c = (M;M), and
w(i; j) =
2; i∈C j∈A ∪ B;
2; (i; j) = (al; bl) l= 1; : : : ; M=2;43 ; i; j∈A or i; j∈B;23 ; i∈A j∈B or i∈B j∈A;
0; i; j∈C:
The algorithm may choose the edges (ai; bi) for thegreedy matching, resulting in clusters A ∪ B and C.The weight of this solution is
apx =23M 2
2+ 2× 4
3
(M2
2
)+
M2
× 2 ≈ M 2
2:
Consider a solution with sets A ∪ C1 and B ∪ C2,where C1; C2 ⊂ C are disjoint and of cardi-nality M=2 each. The value of this solution is
2
[2(
M2
)2+
43
(M2
2
)]≈ 4
3M2. Therefore the al-
gorithm achieves in this case asymptotically no morethan 3
8opt.
References
[1] T. Feo, O. Goldschmidt, M. Khellaf, One half approximationalgorithms for the ,-partition problem, Oper. Res. 40 (1992)S170–S172.
[2] T. Feo, M. Khellaf, A class of bounded approximationalgorithms for graph partitioning, Networks 20 (1990)181–195.
[3] R. Hassin, S. Rubinstein, Robust matchings, SIAM J. DiscreteMath. 15 (2002) 530–537.
[4] R. Hassin, S. Rubinstein, A. Tamir, Approximation algorithmsfor maximum dispersion, Oper. Res. Lett. 21 (1997) 133–137.
[5] P. Raghavan, Probabilistic construction of deterministicalgorithms: approximating packing integer programs, J.Comput. System Sci. 37 (1988) 130–143.
[6] R.R. Weitz, S. Lakshminarayanan, An empirical comparisonof heuristic methods for creating maximally diverse groups, J.Oper. Res. Soc. 49 (1998) 635–646.
top related