paper title (use style: paper title)ceick/dip/clique_paper-v5.docx · web viewabstract—many...

An Efficient Graph-based Framework for Finding the Optimal Set of Non-overlapping Clusters

Fatih AkdagComputer Science Department

University of HoustonHouston, USA

[email protected]

Christoph F. EickComputer Science Department

University of HoustonHouston, USA

[email protected]

Abstract—Many Hotspot hotspot discovery algorithms and ensemble clustering some clustering algorithms algorithms (e.g. multi-run clustering, ensemble clustering) creategenerate a set of overlapping clusters that overlap. However, it is usually desirable to remove redundant or suboptimal clusters that does not add much value to the clustering result; e.g. clusters with low reward interestingness values, small sizes or low densities which are usually not desirable if they highly overlap with other clusters. In this paper, we introduce a framework that finds an optimal set of clusters with respect to an overlap threshold and a cluster interestingness function in a set of overlapping clusters. The overlap threshold is used to limit the degree to which clusters might overlap, and a plugin cluster interestingness function is employed to assess the importance of a cluster for the task. We propose a methodology which creates an overlap graph of the clusters and finds the optimal set of non-overlapping clusters in this graph, reformulating this problem as a maximum weight clique problem. ; aAs the maximum weight clique problem is NP-hard, we introduce two heuristics to speed up the proposed approach and assess the benefits of the two heuristics experimentally. The proposed methodology is evaluated in a case study in which we find non-overlapping hotspots in a set of highly overlapping spatial hotspots discovered in an air pollution dataset, and we use our methodology to find optimum set of clusters obtained by for an application of multi-run clustering toof a dataset that contains Gaussian clusters.

Keywords— Overlapping clusters; Maximum weight clique; maximum weight independent set; data mining; clustering

I. INTRODUCTION

In general, the use of clustering algorithms faces several challenges. Firstly, almost all clustering algorithms require the setting of input parameters which is a non-trivial task and choosing proper values for those parameters is critical for obtaining high-quality clusters. Furthermore, many clustering algorithms are probabilistic and different runs, even with the same parameters, lead to different results. Moreover, different data sets have unique implicit characteristics, and capturing these charteristics requires using different parameter settings of the employed clustering algorithms. Finally, domain experts frequently look for clusters which exhibit additional, unique characteristics that go far beyond the capabilities of traditional clustering algorithms. A second challenge in employing

clustering algorithms is finding alternative clusters. For example, Ding et al. [3] apply a spatial clustering algorithm for finding hotspots in spatial datasets at different granularities, ranging from very local to regional. In general, it not realistic to discover all significant characteristics of a dataset in a single run of a clustering algorithm; even for simple clustering tasks clustering algorithms have to be run multiple times. This establishes the need to analyze the results of several runs of a clustering algorithm—and multiple clustering algorithms as well. Recent research focuses on addressing this challenge: for example, alternative clustering [4] constructs a new clustering based on an already known clustering, and ensemble clustering aggregates multiple clusters into a single consolidated clustering [5,6].

Similarly, hotspot discovery algorithms [….] face a similar problem: In general, almost all approaches grow hotspots by enlarging a small seed hotspot. However, frequently, different seed hotspots are grown to identical or highly overlapping hotspots. Consequently—similar to ensemble clustering—there is a need to remove overlapping hotspots and to find the optimal set of hotspots. Addressing this problem is the main focus of this paper.

In particular, this paper proposes a methodology that selects the best set of cluster/hotspots from a given set of clusters/hotspots. Our methodology relies the use of a“plugin’ interestingness function that asseses the quality of the available clusters/hotspots and formulates obtaining the best set of clusters/hotspots as an optimization problem and provides a computational framework to sove this optimization problem. Moreover, our methodology is generic in the sense that it can be used in conjunction with hotspots discovery, multi-run clustering, ensemble clustering and meta clustering frameworks. Its main contributions include:

>>>Main Contributions<<<

Introduction goes here…We need to talk about:

1) algorithms and frameworks that create highly overlapping clusters,

2) why highly overlapping clusters are undesirable?3) How existing systems deal with this problem

C. Eick, 06/12/17,

Need to mention that we maximize a cluster interestingness function in our approach!

Christoph Eick, 06/12/17,

Optional; but we probably should add a small summary of the paper’s main contributions.


Need newer reference


Perhaps we need to find a different word replacing ‘heuristic’.

4) What we propose:a) How is our methodology effective? b) What are the contributions?[c)] Plugin functions/thresholds used[d)]

The rest of the paper is organized as follows. In Section 2, we describe the graph-based framework. Section 3 provides a detailed discussion of our methodology. We present the experimental evaluation in Section 4. We review the related work in Section 5 and Section 6 concludes the paper.

[3] W. Ding, C. F. Eick, J. Wang, and X. Yuan, “A framework for regional association rule mining in spatial datasets,” in Proc. IEEE Int. Conf. Data Mining, 2006.

[4] E. Bae, and J. Bailey, “COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity,” in Proc. 6th Int. Conf. Data Mining, 2006, pp. 56-62.

[5] Y. Zeng, J. Tang, J. Garcia-Frias, and R. G. Gao, “An adaptive metaclustering approch: combining the information from different clustering clustering results,” in Proc. IEEE Computer Society Conf. Bioinformatics, 2002.

[6] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” in Proc. 21st Int. Conf. Data Engineering, 2005. [

II. FRAMEWORKInput: a set of clusters S, and an overlap threshold λ where 0≤λ<1,Problem: Find a subset S’⊆S for which ∑C∈S i(C) is maximal, where i(C) is the interestingness of cluster C, subject to the following constraints: ∀C∈S’∀C’∈S’ and overlap(C, C’) ≤ λ where

overlap (C ,C ' )=¿C ∩C '∨ ¿min(|C|,|C '|)

(1)¿

We define the degree of overlap between two clusters as the ratio of the number of objects that are shared between both clusters to the number of objects in the cluster with the smaller size. For example, if one cluster has 100 objects, and another one has 80 objects and they share 60 objects, then the degree of overlap is 60/80 = 0.75. In this definition (1), the number of objects in the smaller cluster is used in the denominator to ensure that small clusters contained in larger clusters are eliminated. Alternatively, the total number of objects in both clusters could be the denominator, however, if cluster A with 1000 objects completely contains all objects in the cluster B which has 100 objects, then the overlap ratio would be 100/1100 = 0.09, which implies a very low degree of overlap. Definition 1 overcomes this problem. However, our framework is quite extensible and allows using plugin overlap function definitions.

The interestingness measure is also a plugin function which is used to assess the importance or value of a cluster for the task. The interestingness function depends no the problem on

hand. Some clustering algorithms assign a reward value to each cluster, in this case the reward value is directly used as the interestingness of the cluster. For other cases a cluster interestingness measure need to be defined. Some sample interestingness functions are:

1)[5)] i1(C) = |C|: Number of objects in the cluster determines the value of the cluster.

2)[6)] i2(C) = Area(C): Area of the spatial cluster determines the value of the cluster.

3)[7)] i3(C) = Volume(C): Volume of the 3D cluster determines the value of the cluster.

4)[8)] i4(C) = |C| / Area(C): Density of the cluster determines the value of a cluster.

For example, when mining densely populated areas in a spatial dataset with respect to a variable, once a set of clusters were obtained, it makes sense to use the density interestingness function to calculate the interestingness of a cluster. [Need to come up with better examples and put attributes in the equations]

III.[II.] METHODOLOGYIn this paper, we formulate the optimization problem

defined in Section 2 as a graph problem and present a methodology that finds the optimal solution in multiple steps.

A. Creating an overlap graphIn this subsection we present the methodology for creating

an overlap graph for a set of clusters.

Definition: A cluster overlap graph is a weighted undirected graph G(V,E) in which each vertex v corresponds to a cluster and weight of each vertex corresponds to the interestingness of the cluster. There is an edge between two vertices v and u if and only if the degree of overlap between clusters represented by v and u is larger than the overlap threshold λ.

We firstly calculate the interestingness of each cluster using the plugin interestingness function. Next, we create the cluster overlap graph G in which there is a vertex for each cluster and weight of each vertex is assigned to the interestingness of the cluster. We calculate the degree of overlap between each pair of clusters using the plugin overlap function and create an edge in the overlap graph between vertices representing these clusters if the degree of overlap is more than the overlap threshold. Fig. 1a shows a sample set of overlapping clusters with their interestingness values. Fig. 1b shows the overlap graph in which there is an edge (solid lines) between vertices if the degree of overlap between clusters is larger than 0.4 according to definition 1, which requires that one cluster needs to contain 40% of another to be considered as overlapping. Dashed lines in Fig. 1b represents overlapping clusters; however, the degree of overlap was less than the overlap threshold, so those edges were not created in the final graph (Fig. 1c).

C. Eick, 06/12/17,

This section needs a motivation and clear problem definition soon!

C. Eick, 06/12/17,

Need to give more useful interestingness functions here; i1 and i2 do not have any practical merit for clustering and I do not understand what i3 does---is it limited to evaluate 3D clusters.

C. Eick, 06/12/17,

Much to short; explain what a cluster interestingness functions is, why they are useful, and maybe give an example here; e.g. one we discuss in more detail. Should also mention that interestingness functions could be internal (maximized by the actual clustering or hotspot discovery algorithm) and external (not maximized by the clustering algorithm, e.g. Silhouette, but used to evaluate the obtained clusters.

a. A set of overlapping clusters

b. All overlaps shown

c. Final overlap graph GFig. 1. A sample set of overlapping clusters and creating their overlap graph G: Solid lines between vertices represent edges and dashed lines represent low degree of overlaps (not edges).

Since the goal is to find an optimal set of non-overlapping clusters that maximize the total interestingness, and an edge between vertices indicates overlap, this optimization problem is now reduced to finding a set of vertices in G maximizing the total weight where there is no edge between any pair of vertices.

Definition: Independent set is a set of vertices in a graph in which there is no edge between any pair of vertices.

Considering the graph in Fig. 1c, the set of vertices with weights {70, 50, 80} are all independent, thus this set is an independent set. Some of the other independent sets include {125, 80}, {130, 80}, {70, 50} and each vertex is also an independent set by itself.

Definition: Maximum weight independent set (MWIS) is the independent set in a graph with the maximum total weight of vertices.

We observe that the maximum weight independent set in the overlap graph G will represent the set of non-overlapping clusters with the maximum total interestingness. Thus, the problem is now reduced to finding the MWIS in the cluster overlap graph. We can also think of this problem in the following sense, considering the complement graph of G:

Definition: The complement of a graph G is a graph G’ on the same vertices such that there is an edge between two distinct vertices of G’ if and only if they are not adjacent in G.

If there is no edge between a pair of vertices in G, which means they are independent, then there will be an edge between those vertices in the complement graph G’. The problem can now be converted to finding the subset of vertices in G’ with the maximum weight in which there is an edge between all pairs of vertices, which is named a complete subset or clique in graph theory:

Definition: A clique is a subset of vertices in a graph in which there is an edge between all pairs of vertices in the set.

Considering the graph in Fig. 1c, the set of vertices with weights {125, 150, 50}, {70,150, 125}, {80, 150}, {130} are some of the cliques. Each vertex itself is also a clique.

The problem of finding the maximum weight clique (MWC) or its dual problem of finding the maximum weight independent set in a graph are well-known NP-hard problems and there has been exhaustive research on this topic. Bomze et al [3] gives a survey of exact and approximate solutions to this problem. This problem is also known to be a hard-to-approximate NP-hard problem as shown by [10]. That is, the optimal solution cannot be efficiently approximated to a certain degree. In our methodology, we use the maximum weight clique algorithm proposed by Östergård [16] which finds the optimum solution, moreover its implementation is available for public use and it is quite fast—it can find maximum weight cliques in graphs with up to a hundred vertices often under a second. However, it takes minutes when the input graph is too complex with thousands of vertices; thus, we preprocess the overlap graph and significantly simplify and partition the graph to improve the efficiency of the methodology which will be presented in the next section.

B. Simplification of overlap graph

We simplify the graph by removing the vertices which are guaranteed to be eliminated by maximum weight independent set or maximum weight clique algorithms. We define “overlap set” of a cluster as the set of clusters the cluster overlaps with, including the cluster itself. It is obvious that if two clusters overlap and if they overlap with the same set of clusters, then

their overlap sets will be same, and the one with the higher interestingness will always be chosen in the maximum weight independent set in case one of these clusters will be in the in this set at all. This is also true for a set of clusters, that is, in a set of overlapping clusters, if they all have the same overlap sets, only one of them can be selected into the maximum weight independent set and this will be the cluster with the highest interestingness. Keeping only one cluster among such clusters reduces the graph size dramatically and improves the efficiency of the framework significantly.

Procedure SimplifyGraph(G) foreach vertex vi in G si = overlap set of vi

end foreach Set RemovalSet = Empty set of vertices foreach vertex v1 in G foreach vertex v2 adjacent to v1

if |s1| = |s2| and |s1 U s2| = |s1| then if v1.weight < v2.weight then RemovalSet.Add(v1) else RemovalSet.Add(v2) end if end foreach end foreach foreach vertex vj in RemovalSet G.Remove(vj) end foreachEnd ProcedureFigure 2. Overlap graph simplification algorithm

Figure 2 depicts the algorithm we use for simplifying the graph. The following steps are applied when simplifiying a graph:

1) For each vertex, create a Set data structure and put the vertex itself and all of its adjacent vertices into the set. This set will be called “overlap set” of a vertex.

2) Compare each vertex’s overlap set with the overlap set of other vertices with which this vertex is connected. Add a vertex into a “removal set” if its weight is lower than a vertex with the same overlap set.

3) Remove vertices in the removal set from the overlap graph G.

Figure 3. Simplified overlap graph

Fig. 3 visualizes the simplified overlap graph for the graph in Fig 1c. In the simplification process, firstly an overlap set is created for each vertex. The vertex with weight 125 in Fig 1c has an overlap set of {50,70,125,130,150} which is same as the overlap set of vertex with weight 130; thus it is impossible for this vertex (125) to be in the maximum weight independent set as its weight is smaller than 130. The overlap sets of all other pair of vertices are different, so no more simplification is done on this graph. In graphs with a very large number of overlaps, simplification step significantly simplifies the graph.

The worst case runtime complexity of the simplification algorithm is O(|v| x |e|2) where |v| is the number of vertices (clusters) and |e| is the number of edges in the graph (i.e. number of high degree of overlaps). In our implementation of the set data structure we use a hash set, which assigns a hash value to each set element, thus checking for inclusion of an element in the set, adding/removing an element to/from the set is all achieved in O(1) time.

To the best of our knowledge, simplification of a graph while calculating the maximum weight clique or independent set using an overlap set is unique to our approach. Next, we do another optimization by partitioning the graph into sub-graphs.

C. Partitioning the overlap graph Definition: a connected component of an undirected graph

G is a subgraph C in which all pair of vertices are connected to each other by paths, and not connected to other vertices in the graph.

By definition, vertices in each connected component of a graph are independent from vertices in other connected components, which allows further optimizations for finding the maximum weight independent set in the overlap graph. Instead of running an NP-hard algorithm on the whole overlap graph, it makes sense to run the algorithm for each connected component which were already simplified. We identify the connected components in the overlap graph G and then find the maximum weight independent set of each connected component Ci by finding the maximum weight clique for the Ci’ (complement of Ci). The final optimal solution is the union of all vertices in the “maximum weight independent set” of all connected components.

Figure 4. Complement of the graph shown in Fig 3.

There were only one connected component in the graph in Fig. 3. So, we create the complement graph for this component (shown in Fig. 4) and find the maximum weight clique in it. In this example, there are many possible cliques. Each vertex itself is a clique. Vertices with weights 70, 50 and 80 create an clique of size 3 with a total reward of 200—they are not adjacent in Figure 3, and they are all pairwise adjacent in Figure 4. On the other hand, vertices with weights 130 and 80 create a clique of size 2 with a total weight of 210 and this subset yields the maximum weight clique.

The steps of our methodology can be summarized as follows:

1) Calculate the interestingness of each cluster using the plugin interestingness function.

2) Create a weighted overlap graph of clusters in which weight of each vertex is the interestingness of the cluster and

there is an edge between two vertices if their degree of overlap is more than the overlap threshold λ.

3) Simplify the overlap graph by eliminating vertices which cannot be in the optimal solution.(Fig. 2)

4) Find the connected components in the simplified overlap graph.

5) For each connected component c, create the complement graph c’.

6) Find the maximum weight clique (MWC) in each complement graph c’.

7) The union of all vertices in MWCs is the optimal solution

IV.[III.] EXPERIMENTAL EVALUATION

In this section, we firstly evaluate our methodology in a case study involving a gridded air pollution dataset. We will use the proposed framework to find the optimum set of non-overlapping hotspots discovered in an air pollution dataset. Next, we will find an optimum set of disjoint clusters in a set of clusterings found by a clustering algorithm using different parameters for each run.

A. Hotspot Discovery Algorithm

In this subsection, we present a case study in which we apply the proposed methodology on an output of a hotspot discovery algorithm presented in [1]. Dataset contains 27 air pollution hotspots discovered by the hotspot discovery algorithm defined in [1], and there are 138 pairs of overlapping hotspots. By setting the overlap threshold to 0.6, 83 pairs of hotspots (edges) were found to have a degree of overlap larger than 0.6. After removing weak edges, 48 edges were eliminated and number of edges in the graph decreased dramatically from 83 to 31. 4 connected components exists in the overlap graph. We will show the effects of graph simplification steps on just one connected component shown in Figure 5 as the whole graph is too large and hard to comprehend. As seen in Figure 5a, there were too many overlapping hotspots in the connected component, however, about half of the overlapping pair of hotspots had low degree of overlap, so the overlap graph was created with 20 edges and 10 vertices as seen in Figure 5b.

C. Eick, 06/12/17,

Mention what we are analyzing for the airpollution dataset and what interestingness function was actually used in the experiment.

C. Eick, 06/12/17,

The context in which this methodology is applied needs to be introduces in more detail!

a. Connected Component 1: all overlaps

b. Highly overlapping clusters

c. Graph after simplification

d. Complement of connected component 1

Figure 5. Graph simplification algorithm applied to a connected component in the overlap graph of low-variance hotspots

The first number in each vertex shows the hotspot number and

the number in parenthesis shows the interestingness of the hotspot. After running the graph simplification algorithm, 4 of the hotspots (hotspots 25, 16, 8 and 4) and 14 of the edges were eliminated and the complexity of the graph decreased dramatically as seen in Figure 5c. Next, the complement graph of this connected graph is created (Figure 5d), and maximum weight clique algorithm is run on this graph. The clique consisting of vertices 3, 5 and 9 were detected as the the maximum weight clique in this graph. It can be verified that in that the final set of hotspots 3, 5 and 9 are disjoint in Fig. 5c, and they are all adjacent in Fig 5d. We found the maximum weight clique for all connected components and the union of maximum weight cliques gives the optimal set of hotspots that maximize the total reward value. 8 non-overlapping hotspots were identified as the optimum solution for the whole set of 27 clusters.

On experiments with a much larger number of clusters, we difference is more significant. For a graph with 356 vertices, MWC algorithm took 48 minutes to find the maximum weight clique in the complement graph, however, after we applied the graph simplification the graph size decreased to 125 vertices and MWC algorithm took only 6 seconds to find the optimal solution.

B. Multi-run clusteringIn this subsection, we use our methodology as part of multi-run clustering. The key hypothesis of multi-run clustering is that better clustering results can be obtained by combining clusters that originate from multiple runs of clustering algorithms. Each clustering result contains a set of disjoint clusters, however, the optimal clusters might have been found by different parameters.clusters that originate from different runs of the same clustering algorithm usually overlap a lot. For example, our hypothsis is that the running k-means for different k (number of clusters) and then to combine the clusters that from different runs leads to a better result than the result of running k-means for the optimal k-value. This is usually the case with the well-known k-means clustering algorithm. It is usually hard to find a k value that finds the optimal clusters when the data set is not known in advance. ThereforeIn summary in this approach, k-means is run with multiple k values, saving the result of each clustering and then the optimal set of clusters is found by using an internal evaluation criterion. Internal cluster evaluation methods usually assign the best score to the clustering that produces clusters with high similarity within a cluster and low similarity between clusters. Some well-known internal cluster evaluation functions are Davis-Bouldin index, Dunn index and silhouette coefficient. The silhouette coefficient contrasts the average distance to elements in the same cluster with the average distance to elements in other clusters. Objects with a high silhouette value are considered well clustered. This index works well with k-means clustering, and is also widely used to determine the optimal number of clusters.

where a(i) is the average dissimilarity of i with all other data within the same cluster, and b(i) is the lowest average dissimilarity of i to any other cluster (closest cluster). Obviously -1 ≤ s(i) ≤ 1, and larger values of s(i) represents a better clustering.

In the multi-run clustering experiment, we will use k-means with different k values to cluster a 2-dimensional dataset called s1 which was used in [7] and [8]. We will find the silhouette coefficients of all clusters in each clustering and use this value as the reward of the cluster. Next, we will apply our methodology to find the set of non-overlapping set of clusters which maximizes the total reward.Dataset: S1 dataset contains 5000 2-dimensional data points, and 15 Gaussian clusters. Figure X shows S1:

He

Step 1. Multi-run clustering: We run k-means with k values from 5 to 25 and for each cluster in each clustering, we calculate the silhouette coefficients. Although there are 15 clusters in the dataset, the clustering with the maximum average silhouette width (0.83) was found when k was set to 18. For k=15, maximum average silhouette width was 0.81.

Step 2. Create Overlap Graph: In this step, we create an overlap graph of the clusters where each node in the graph corresponds to a cluster and the weight of each node represents the silhouette coefficient for the cluster. It should be noted that we only use clusters with silhouette coefficient larger than a threshold value (0.6 in this case) as we are only interested in good clusters. It’s a good idea to set a threshold when using our methodology for cluster aggregation as we are not interested in getting any bad clusters in the final result and to decrease the size of the graph. Out of 315 clusters involved, 203 of them had a reward value larger than 0.6. Figure V shows the of graph.

C. Eick, 06/12/17,

Not clear; do we assume the clusters in the final clustering we generate are disjoint? This point needs to be addressed somewhere!

C. Eick, 06/12/17,

???The dataset is always known…

C. Eick, 06/12/17,

How do you know that; it is better to state this as a hypothesis!

C. Eick, 06/12/17,

We need a real-table---just mentioning 2 instances of n, is not good enough for a journal; an experiment that enhances the graph-size and analyses the benefits of our algorithm for increasing graph sizes is needed. At the moment we only present anecdotic evidence and we need a more thorough analysis of the performance of our approach!

Figure V: Overlap graph

Step 3: Simplify overlap graph: We simplified the overlap graph using the graph simplification algorithm and out of 203 vertices, 171 of them were removed and we obtained a much smaller graph with only 32 vertices. The resulting graph is shown in Fig X.

Figure X: Simplified Overlap graph

Step 4: Finding maximum weight independent set: There are 4 connected components in this graph. We run the maximum weight clique algorithm for the complement of each connected component and the union of each maximum weight clique gives the optimal number of non-overlapping vertices in the graph. The optimal clustering was found to have 15 clusters (which is the correct number) as opposed to 18 which had the maximum average the silhouette coefficients. Finding the maximum weight independent set takes under a second when the graph is simplified which would take hours [will replace exact number] before simplification.

The final clustering is shown in Fig. Y:

Figure Y: Optimal ClustersThe clusters in Fig Y contains a total of 4985 points. Remaining 15 points do not belong to any clusters as they were part of clusters which were eliminated. Those points can be assigned to the cluster with the closest centroid. However, many of these clusters were created by different clustering, thus once the correct clusters were identified by our methodology, a trivial improvement to the final clustering is to reassign all points to the closest centroid. After re-assigning each point to the closest cluster, the silhouette plot of the final clustering improves mildly. Figure Z shows the clusters after reassignment.

Figure Z. Clusters after re-assigning points

The silhouette plot of all clusters in the final clustering is shown in Fig Z1:

Fig Z1: Silhouette width of non-overlapping clusters

After re-assigning points to detected clusters, the quality of the clusters improves. As shown in Fig. Z1, the average silhouette width is 0.87 which is higher than any single clustering alone (0.83 was the best clustering which was obtained by setting k=18). Reassignment phase increases the average silhouette width to 0.88 as shown I Fig Z2.:

Fig Z2: Silhouette width of non-overlapping clusters

Figure Z3 shows the same plot for k=15 before using our methodology:ter

Finally, Fig. Z4 shows the same plot for k=18 which was found to have the best average silhouette width without using our methodology:

We can conclude that our methodology was able to detect the correct cluster size and significantly improve the quality of the clustering result.

C. Eick, 06/12/17,

?!?; not very clear what this means: in general, we find clusters that are impossible to generate using a single run of a clustering algorithm; therefore, the optimal k value for the multi-run approach is likely different from the optimal k value for a single run approach.

C. Eick, 06/12/17,

How does our approach compare to [7,8]; can we compare our approach with [7,8]; what is unique about our approach, if compared to [7,8]. How does our approach enhance the state of the art given by [7,8]. If do not address these questions, reviewers will say that there is nothing novel and unique about our approach, as it reproduces results that can be obtained with algorithms published in [7, 8]!

C. Eick, 06/12/17,

Likely not statistically significant.

C. Eick, 06/12/17,

Really? Please make sure this is actually correct! I would have expected that it decreases; I thought the unassigned points would be often outliers.

C. Eick, 06/12/17,

What is the characteristics of unassigned points.

C. Eick, 06/12/17,

An alternative approach would be to treat those points as outliers.

V.[IV.] LITERATURE REVIEWClustering algorithms that create overlapping clusters usually deal with this problem by applying spatial constraints in clustering process such that two clusters cannot overlap. Although these constraints are effective at obtaining disjoint clusters, they sacrifice the ability of identifying better clusters.

Hotspot detection algorithms like SatScan reports the most significant hotspot when hotspots overlap, or users can choose to report all hotspots even when they overlap.

Nan Li et al. presents an algorithm which is the most similar to our approach. They formulate the cluster aggregation problem as a Maximum-Weight Independent Set problem as we do in this paper, however their solution is quite different. They assume that clusters in each clustering is disjoint and based on the assumption that each clustering is already an independent set of clusters, they grow each independent set by removing a weak cluster from the existing set and adding better clusters from other clusterings in each step as long as the total interestingness of the independent set increases and clusters do not overlap. Their algorithm is a form of replacement search, and finds an approximate solution. On the other hand, our methodology does not have any assumptions about the disjointness of the clusters in each clustering. Thus, their algorithm cannot be used for our first experiment in which the input is a set of highly overlapping clusters. Besides, we employ Ostergard’s algorithm which finds the exact solution to the maximum weight clique problem.

VI.[V.] CONCLUSIONWe proposed a novel and unique graph-based algorithm that finds an optimal set of clusters that overlap to a degree less than a threshold value. The proposed algorithm significantly improves the total runtime of the employed maximum weight clique algorithm by preprocessing and simplifying the overlap graph.

Some algorithms in the literature create clusters that overlap, and they usually do not employ any post-processing to select a set of non-overlapping clusters. Our methodology can be used as a post-processing step for such algorithms.

Our methodology can also be used for cluster aggregation. Many clustering algorithms require parameters that are hard to estimate. By running the same clustering algorithm multiple times with different parameters each time, one can obtain multiple clustering result. Using a cluster evaluation criterion, we assign a reward value to each cluster, and among all clusters generated, we find the best set of clusters that has the highest total reward. We showed that our methodology was able to find the exact number of clusters in a dataset without any prior knowledge of the data set and significantly improved the clustering output.

REFERENCES

[1] Akdag, F., Davis, J. U., & Eick, C. F. (2014, November). A computational framework for finding interestingness hotspots in large spatio-temporal grids. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Analytics for Big Geospatial Data (pp. 21-29). ACM.

[2] Bomze, I. M., Budinich, M., Pardalos, P. M., & Pelillo, M. (1999). The maximum clique problem. In Handbook of combinatorial optimization (pp. 1-74). Springer US.

[3] EPA. (n.d) retrieved 6/30/2015, from Air Data Web Site: http://www.epa.gov/airquality/airdata/

[4] Garey, M. R., & Johnson, D. S. (1979). Computers and intractability: a guide to the theory of NP-completeness. 1979. San Francisco, LA: Freeman.

[5] Östergård, P. R. (2002). A fast algorithm for the maximum clique problem. Discrete Applied Mathematics, 120(1), 197- 207.

[6] Wang, S., Cai, T., & Eick, C. F. (2013, December). New Spatiotemporal Hotspoting Algorithms and their Applications to Ozone Pollution. In: Proc. 8th International Workshop on Spatial and Spatio-Temporal Data Mining, IEEE, 2013.

[7] P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765, May 2006

[8] [3] W. Ding, C. F. Eick, J. Wang, and X. Yuan, “A framework for regional association rule mining in spatial datasets,” in Proc. IEEE Int. Conf. Data Mining, 2006.

[9] [4] E. Bae, and J. Bailey, “COALA: A novel approach for the extraction of an alternate clustering of high quality and high dissimilarity,” in Proc. 6th Int. Conf. Data Mining, 2006, pp. 56-62.

[10] [5] Y. Zeng, J. Tang, J. Garcia-Frias, and R. G. Gao, “An adaptive metaclustering approch: combining the information from different clustering clustering results,” in Proc. IEEE Computer Society Conf. Bioinformatics, 2002.

[6] A. Gionis, H. Mannila, and P. Tsaparas, “Clustering aggregation,” in Proc. 21st Int. Conf. Data Engineering, 2005.

C. Eick, 06/12/17,

Looks very short and highly incomplete at the moment.

http://www.epa.gov/airquality/airdata/

paper title (use style: paper title)ceick/dip/clique_paper-v5.docx · web viewabstract—many...

Documents