approximation algorithms for the metric maximum clustering problem with given cluster sizes

Available online at www.sciencedirect.com

Operations Research Letters 31 (2003) 179–184

OperationsResearchLetters

www.elsevier.com/locate/dsw

Approximation algorithms for the metric maximum clusteringproblem with given cluster sizes

Refael Hassin∗, Shlomi RubinsteinDepartment of Statistics and Operations Research, School of Mathematical Sciences, Tel-Aviv University, Tel-Aviv 69978, Israel

Received 4 November 2002; accepted 3 December 2002

Abstract

The input to the METRIC MAXIMUM CLUSTERING PROBLEM WITH GIVEN CLUSTER SIZES consists of a complete graph G=(V; E)with edge weights satisfying the triangle inequality, and integers c1 ; : : : ; cp that sum to |V |. The goal is to >nd a partition ofV into disjoint clusters of sizes c1; : : : ; cp, that maximizes the sum of weights of edges whose two ends belong to the samecluster. We describe approximation algorithms for this problem.c© 2003 Elsevier Science B.V. All rights reserved.

Keywords: Approximation algorithms; Maximum clustering

1. Introduction

In this paper we approximate the METRIC MAXIMUM

CLUSTERING PROBLEM WITH GIVEN CLUSTER SIZES. Theinput for the problem consists of a complete graphG = (E; V ), V = {1; : : : ; n}, with non-negative edgeweights w(i; j), (i; j)∈E, that satisfy the triangle in-equality. In the general case of the problem, clustersizes c1¿ c2¿ · · ·¿ cp¿ 1 such that c1+· · ·+cp=nare given. In the uniform case, c1 = c2 = · · ·= cp. Theproblem is to partition V into sets of the given sizes,so that the total weight of edges inside the clustersis maximized. See [6] and its references for someapplications.Hassin and Rubinstein [3] gave a approximation

algorithm whose error ratio is bounded by 1=2√2 ≈

0:353 for the general problem. We improve this

∗ Corresponding author.E-mail addresses: hassin@post.tau.ac.il (R. Hassin),

shlomiru@post.tau.ac.il (S. Rubinstein).

result for the case in which cluster sizes are large. Inparticular, when the minimum cluster size increases,the performance guarantee of our algorithm increasesasymptotically to 0.375.Feo and Khellaf [2] treated the uniform case and

developed a polynomial algorithm whose error ratio isbounded by c=2(c−1) or (c+1)=2c, where the clustersize is c=n=p, and it is even or odd, respectively. Thebound decreases to 1=2 as c approaches ∞. The algo-rithm’s time complexity is dominated by computationof a maximum weight perfect matching. (Without thetriangle inequality assumption, the bound is 1=(c− 1)or 1=c, respectively, but Feo, Goldschmidt and Khel-laf [1] improved the bound to 1

2 in the cases of c = 3and 4.) We describe an alternative algorithm for theuniform case that achieves the ratio of 1=2 and has alower O(n2) complexity.Hassin, Rubinstein and Tamir [4] generalized

the algorithm of [2] and obtained a bound of 12

for computing k clusters of size c each (16 k6 n=c)with maximum total weight. Our discussion

180 R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184

concerning the uniform case does not apply to thisgeneralization.For E′ ⊂ E we denote by w(E′) the total weight of

edges in E′. For V ′ ⊆ V we denote by E(V ′) the edgeset of the subgraph induced by V ′. To simplify thepresentation, we denote the weight w(E(V ′)) of theedges in the subgraph induced by a vertex set V ′ byw(V ′). We denote by opt the optimal solution value,and by apx the approximate value returned by a givenapproximation algorithm. A p-matching is a set of pvertex-disjoint edges in a graph. A p-matching withp = �n=2� is called perfect. A greedy p-matching isobtained by sorting the edges in non-increasing orderof their weights, and then scanning the list and se-lecting edges as long as they are vertex-disjoint to thepreviously selected edges and their number does notexceed p. A greedy perfect matching has p= �n=2�.

2. A 38 -approximation algorithm

Lemma 1. Let M g be a greedy k-matching. Let M ′

be an arbitrary 2k-matching. Then, for i = 1; : : : ; k,the weight of the ith largest edge in M g is greaterthan or equal to the weight of the (2i − 1)st largestedge in M ′.

Proof. Let e1; : : : ; ek be the edges of M ′ in non-increasing order of weight. By the greedy construc-tion, every edge of e′ ∈M ′ \M g is incident to an edgeof e∈M g with w(e)¿w(e′). Since every edge ofM g can take the above role at most twice, it followsthat for e1; : : : ; e2i−1 we use at least i edges of M g allof which are at least as large as w(e2i−1).

Lemma 2. If a cluster C ⊂ V of size |C|=c containsa k-matching M of weight W , then w(C)¿ (c−k)W .

Proof. Let V (M) be the set of vertices of the edgesin M . Then, |V (M)| = 2k. Consider an edge e =(u; v)∈M . By the triangle inequality, for every vertexx, w(u; x) + w(v; x)¿w(u; v). Summing over x∈C \V (M), the total weight of the edges of C that con-nect u and v with vertices in C \ V (M) is at least(c−2k)w(u; v). Summation over M gives a weight ofat least (c − 2k)W .A similar summation over x∈V (M) (including x=

u; v) gives a total weight of 2kw(u; v). However, every

edge that contributes to this sum (including (u; v)) iscounted twice. Hence, summation over M only givesthat the total weight of these edges is at least kW .Altogether, w(C)¿ [(c − 2k) + k]W = (c −

We >rst present our algorithm assuming that thecluster sizes are divisible by 4. After analyzing thiscase, we will show how to modify the algorithm forthe general case, and obtain the same bound asymp-totically for big cluster sizes.

Theorem 1. Let apx be the total weight of edges inthe clusters returned by Algorithm metric (Fig. 1).Then,

apx¿38opt:

Proof. Consider an optimal partition O1; : : : ; Op. LetMi be a maximum matching in the subgraph inducedby Oi, i=1; : : : ; p. Let M =M1 ∪ · · · ∪Mp (note that|Mi| = 1

2ci and |M | = 12 |V |). Let M ′

1 be the match-ing consisting of the 1

2c1 heaviest edges in M , M ′2

the next 12c2 heaviest edges, and so on up to M ′

which consists of the 12cp lightest edges in M . From

the assumption that c1¿ · · ·¿ cp it follows that∑i ciw(Mi)6

∑i ciw(M ′

i ). The edge set of E(Oi)can be covered by a set of ci − 1 disjoint matchings.Since Mi is a maximum matching in Oi it follows thatw(Oi)6 (ci − 1)w(Mi) and therefore

opt =p∑

w(Oi)6p∑

ciw(Mi)6p∑

ciw(M ′i ):

Let M gi be the greedy matching’s edges whose end

vertices were inserted by Algorithm metric to Si. Notethat |M g

i |= ci=4.By Lemma 1, w(M ′

i )6 2w(M gi ). Therefore, by

Lemma 2,

apx¿p∑

(ci − |M gi |)w(M g

p∑i=1

ciw(Mgi )

p∑i=1

ciw(M ′i )¿

38opt:

R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184 181

Fig. 1. Algorithm metric.

The situation is more complex when the restrictionthat the cluster sizes are divisible by 4 is removed.We propose to apply Algorithm Metric with sizesc′1; : : : ; c

′p, where c′i = 4�ci=4�, and complete the clus-

ters arbitrarily to sizes c1; : : : ; cp. If the cluster sizesare large enough, the bound is not greatly aMected bythis change and will be asymptotically 3

8 . We nowdescribe the changes in the algorithm and analysisthat are necessary to account for any cluster sizes atleast 4.Suppose that for i = 1; : : : ; p, ci = 4ki + �i, where

ki¿ 1 is an integer and �i ∈{0; 1; 2; 3}. Let Oi be theith cluster in an optimal solution and let O′

i be a max-imum weight subset of Oi resulting from the deletionof �i vertices. Then

w(O′i)

w(Oi)¿

|E(O′i)|

|E(Oi)| =

and therefore

opt =p∑

w(Oi)6p∑

w(O′i): (1)

Since |O′i | = 4ki, E(O′

i) can be covered by 4ki−1disjoint matchings. Hence, a maximum matching,Mi, in the subgraph induced by O′

i satis>es w(O′i)6

(4ki − 1)w(Mi) and with (1)

opt6p∑

fiw(Mi); (2)

where fi ≡ f(ci) = (4ki − 1)

)/(4ki

ci(ci−1)=4ki. Note that f is not necessarily monotoneincreasing. Let j1; : : : ; jp be a permutation of 1; : : : ; psuch that fj1 ¿ · · ·¿fjp .Let M =M1∪· · ·∪Mp. Note that |Mi|=2ki. Let M ′

1be the matching consisting of the 2kj1 heaviest edgesin M , M ′

2 the next 2kj2 heaviest edges, and so on upto M ′

p which consists of the 2kjp lightest edges in M .Clearly,

p∑i=1

fiw(Mi)6p∑

fjiw(M′ji): (3)

Consider a greedy matching M g of size 12 |M |. Let

M gj1 be the matching consisting of the kj1 heaviest

edges in M g, M gj2 the next kj2 heaviest edges, and so

on. Let Sgi be the set of end vertices of the edges in

M gi , and let {S1; : : : ; Sp} be an arbitrary completion of

(Sg1 ; : : : ; S

gp) to disjoint clusters of sizes (c1; : : : ; cp).

By Lemma 1

w(M ′ji)6 2w(M g

i ); (4)

and by Lemma 2

w(Si)¿ (ci − ki)w(Mgi ): (5)

Combining (2)–(5) we obtain

opt6p∑

giw(Si);

where gi ≡ g(ci) = 2fi=(ci − ki) = ci(ci − 1)=2ki(ci − ki).Let � = [max{g(ci) | i = 1; : : : ; p}]−1, then

apx =p∑

w(Si)¿ �opt:

For example, for ci = 4; : : : ; 12 1=gi is 0.5, 0.4,0.333, 0.286, 0.428, 0.389, 0.356, 0.327, and 0.409.The bound improves over the previously known boundof 1

2√2≈ 0:353 if ci �∈ {2; 3; 6; 7; 11; 15; 19}, and in

particular if all cluster sizes are at least 20. Whenmin{ci: i = 1; : : : ; p} → ∞, � → 3

3. The uniform case

We now consider the uniform case, that is ci=c fori = 1; : : : ; p. Consider the set of partitions of V intoclusters of size c each. A random solution is obtainedby randomly (uniformly) selecting such a partition.The following theorem states a bound on the expectedvalue of a random solution. When the cluster sizes arenot identical, Example 1 in Section 4 shows that theexpected weight of a random solution is not a goodapproximation. Also note that in contrast to a similarbound for the related MAXIMUM CUT PROBLEM, our resultrequires that the edge weights satis>es the triangleinequality.

Theorem 2. The expected weight of a random solu-tion is at least 1

Proof. Consider an optimal partition OPT =(O1; : : : ; Op). Let M be a matching consisting of theunion of maximum perfect matchings in the subgraphsinduced by O1; : : : ; Op.

Suppose >rst that c is even. Then M is a perfectmatching in G. Let Sv be the (random) set of edgesthat are incident to v in the cluster that containsv in a random solution. The weight of a randomsolution is

∑v∈V

w(Sv) =12

∑(u;v)∈M

[w(Su) + w(Sv)]:

Consider an edge (u; v)∈M . A solution in which uand v are contained in the same cluster satis>es, bythe triangle inequality, that the average weight of anedge in Su∪Sv is at least 1

2w(u; v). As for the solutionsin which u and v belong to distinct clusters, we pairthese solutions so that each pair consists of a solutionand another one obtained by swapping u and v. Againit follows from the triangle inequality that the averageweight of an edge in Su ∪ Sv in such a pair of solu-tions is at least 1

2w(u; v). Hence the expected averageweight of an edge in a random solution is at least 1=2the average edge weight in M . The claim follows byobserving that the average edge weight in M is at leastas large as the average edge weight in OPT (see forexample [4]).Suppose now that c is odd. Then, M leaves out one

vertex from each subset. We consider the stars incidentto these vertices in the optimal and random solutions.Since M is the union of maximum matchings, the sumof these stars in OPT is at most 2w(M). On the otherhand, from the same pairing argument and the triangleinequality, the expected total weight of these stars ina random solution is at least w(M). The rest of theproof of this case follows the same arguments as whenc was assumed to be even.

We use the method of conditional probabilities [5]to de-randomize the algorithm while preserving itsperformance guarantee. At a given stage we havealready determined the contents of several clustersand we deal with one active cluster that may now bepartially >lled. Let r be the number of clusters yet tobe constructed excluding the active one. Let A be thevertices already assigned to the active cluster, a= |A|,and let B be the yet undecided vertices, b = |B|: Wemaintain and update for each u∈B %u=

∑v∈A w(u; v),

% =∑

u∈B %u and � =∑

u;v∈B w(u; v): Altogether,these updates take O(n2) time.

R. Hassin, S. Rubinstein /Operations Research Letters 31 (2003) 179–184 183

The expected weight of a random completion of theassignment of vertices to clusters is

c − ab

(c − a

) �:

The >rst term is the expected weight of edges betweenvertices that will be added to the active cluster and thevertices already in it, the second is the expected weightof edges between vertices in B that will be placed inthe same cluster.At each step we examine the insertion of each u∈B

to the active cluster (contributing %u) followed by arandom completion, and select the one that gives max-imum expected weight. There are n insertions and eachrequires O(n) examinations. Thus, altogether the com-plexity is O(n2).

4. Some bad examples

In this section, we propose some natural algorithmsand provide for each of them an instance for whichthe algorithm performs badly.In Section 3 we have shown that the expected

weight of a random solution is at least 12opt when the

clusters have a common size. The following exampleshows that when the cluster sizes are not identical theexpected weight of a random solution may be verysmall relative to opt.

Example 1. Consider an instance with weightsw(1; j) = 1; j = 2; : : : ; n, and w(i; j) = 0 otherwise.Let c1 = 2 and ci = 1, i = 2; : : : ; n− 1. Then, only so-lutions in which vertex 1 is placed in the large clusterhave a positive weight. This happens with probability2n . Thus, opt = 1 whereas the expected weight of arandom solution is less than 2=n, and the ratio can bemade arbitrarily small.

Next, we show that a natural local search approachdoes not guarantee a constant bound, even in the uni-form case.

Example 2. We construct an instance with 0/1weights. The vertex set V is composed of disjoint sub-

sets V0; V1; : : : ; V‘. |V0|=(‘−k)‘, and for i=1; : : : ; ‘,Vi has a distinguished vertex vi ∈Vi, and |Vi| = k.Thus, |V | = ‘2. The subgraph induced by the unitweight edges consists of the cliques induced by Vi,i = 1; : : : ; ‘, and the clique induced by the distin-guished vertices v1; : : : ; v‘. (Vertices in V0 are notadjacent to any edge with unit weight.) Suppose thatV must be partitioned into ‘ clusters of size ‘ each,and that 1�k2�‘�|V |. The optimal solution con-tains a cluster with the distinguished vertices and itsweight is of order ‘2. A solution in which each Vi,i = 1; : : : ; ‘ is contained in a diMerent cluster cannotbe improved by relocating less than k vertices, andits weight is O(‘k2).

A perfect matching M that, for each p=1; : : : ; |M |,contains p edges whose total weight is at least %times the maximum weight of a p-matching is saidto be %-robust. Hassin and Rubinstein [3] proved thatthere always exists a 1=

√2-robust matching, and that

there are instances where a higher robustness is notpossible.Consider the following extension of the algorithm

of Feo and Khellaf [2] and Hassin et al. [4]:

1. Compute an %-robust matching S.Let S = {(uj; vj) j = 1; : : : ; m}, where

w(uj; vj)¿w(uj+1; vj+1) j = 1; : : : ; m− 1.2. For j = 1; : : : ; p, let dj = �cj=2� and Dj = d1

+ · · · + dj j = 1; : : : ; p. Let D0 = 0. Set Vi ={uj; vj | j = Di−1 + 1; : : : ; Di} i = 1; : : : ; p.

3. For each i such that ci is odd, add to Vi an arbi-trary yet unassigned vertex.

Hassin and Rubinstein [3] proved that the algo-rithm returns an %=2-approximation for the maximumclustering problem with cluster sizes c1; : : : ; cp. Thus,the best bound derived this way is 1=2

√2. Moreover,

since both a maximum matching and a greedy perfectmatching are 1

2 -robust, using such matchings in thealgorithm guarantees at least a 1

4 -approximation.The following example demonstrates that the 1

4bound is the best possible when using a maximummatching:

Example 3. Let V =A∪B∪C∪D∪F where |A|=M ,|B|= |C|= |D|= |F |=M=2, D= {d1; : : : ; dM=2}, and

F = {e1; : : : ; eM=2}. Let

w(i; j) =

2; i∈B j∈C;

1; i∈B j �∈ C or i∈C j �∈ B;

1; (i; j) = (dl; el) l= 1; : : : ; eM=2;

0; i; j∈A;

12 otherwise:

Let c = (M; 1; : : : ; 1) where the number of clusters ofsize 1 is 2M , so that

∑i ci = 3M = |V |: An optimal

solution will choose for the big cluster the set B ∪ Cand thus

opt = 2 (M)2 :

The edges {(d1; e1); : : : ; (dM=2; eM=2)} completed by ar-bitrary disjoint edges between A and B∪C constitute amaximum matching. The algorithm may choose sucha matching and then have D ∪ F as the large cluster,and then

apx =12

This gives asymptotically a bound of 14 .

We observe that a greedy matching may havean advantage over a maximum matching since itchooses larger edges for the large clusters. The fol-lowing example shows however that the choice ofa greedy matching does not guarantee more than a38 -approximation, even for the case of two uniformclusters.

Example 4. Let V =A∪B∪C with |A|= |B|=M=2,|C| = M , A = {a1; : : : ; aM=2}, and B = {b1; : : : ; bM=2}.

Let c = (M;M), and

w(i; j) =

2; i∈C j∈A ∪ B;

2; (i; j) = (al; bl) l= 1; : : : ; M=2;43 ; i; j∈A or i; j∈B;23 ; i∈A j∈B or i∈B j∈A;

0; i; j∈C:

The algorithm may choose the edges (ai; bi) for thegreedy matching, resulting in clusters A ∪ B and C.The weight of this solution is

apx =23M 2

2+ 2× 4

× 2 ≈ M 2

Consider a solution with sets A ∪ C1 and B ∪ C2,where C1; C2 ⊂ C are disjoint and of cardi-nality M=2 each. The value of this solution is

)]≈ 4

3M2. Therefore the al-

gorithm achieves in this case asymptotically no morethan 3

References

[1] T. Feo, O. Goldschmidt, M. Khellaf, One half approximationalgorithms for the ,-partition problem, Oper. Res. 40 (1992)S170–S172.

[2] T. Feo, M. Khellaf, A class of bounded approximationalgorithms for graph partitioning, Networks 20 (1990)181–195.

[3] R. Hassin, S. Rubinstein, Robust matchings, SIAM J. DiscreteMath. 15 (2002) 530–537.

[4] R. Hassin, S. Rubinstein, A. Tamir, Approximation algorithmsfor maximum dispersion, Oper. Res. Lett. 21 (1997) 133–137.

[5] P. Raghavan, Probabilistic construction of deterministicalgorithms: approximating packing integer programs, J.Comput. System Sci. 37 (1988) 130–143.

[6] R.R. Weitz, S. Lakshminarayanan, An empirical comparisonof heuristic methods for creating maximally diverse groups, J.Oper. Res. Soc. 49 (1998) 635–646.

approximation algorithms for the metric maximum clustering problem with given cluster sizes

Documents

title polynomial time approximation schemes for...

metric approximations and clustering -...

gromov-hausdorff approximation of metric spaces with ... ·...

compressed self-attention for deep metric learning with...

an introduction to metric learning for clustering

nirit gourgy. what is clustering? metric spaces k-center...

a clustering analysis of the chemical metric...

a local search approximation algorithm for -means clustering

a clustering under approximation stability - georgia...

semi-crowdsourced clustering: generalizing...

unsupervised deep metric learning with transformed ... ·...

approximation algorithms for classiﬁcation problems with...

a local search approximation algorithm for k-means...

integrating constraints and metric learning in...

nonlinear adaptive distance metric learning for clustering

self-expressive decompositions for matrix approximation...

symnmf: nonnegative low-rank approximation of a...

distance metric learning, with application to clustering...

learning a mahalanobis metric from equivalence...

clustering sequences in a metric space