[ieee 2012 international conference on advances in social networks analysis and mining (asonam 2012)...

6

Click here to load reader

Upload: m

Post on 15-Apr-2017

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

Combining relations and text inscientific network clustering

David Combe∗, Christine Largeron∗, Elod Egyed-Zsigmond†, Mathias Gery∗∗Universite de Lyon, F-42023, Saint-Etienne, France,

CNRS, UMR 5516, Laboratoire Hubert Curien, F-42000, Saint-Etienne, France

Universite de Saint-Etienne, Jean-Monnet, F-42000, Saint-Etienne, France

Email: {david.combe, christine.largeron, mathias.gery}@univ-st-etienne.fr†Universite de Lyon

UMR 5205 CNRS, LIRIS

7 av J. Capelle, F-69100 Villeurbanne, France

Email: [email protected]

Abstract—In this paper, we present different combined cluster-ing methods and we evaluate their performances and their resultson a dataset with ground truth. This dataset, built from severalsources, contains a scientific social network in which textualdata is associated to each vertex and the classes are known.Indeed, while the clustering task is widely studied both in graphclustering and in non supervised learning, combined clusteringwhich exploits simultaneously the relationships between thevertices and attributes describing them, is quite new. We arguethat, depending on the kind of data we have and the type ofresults we want, the choice of the clustering method is importantand we present some concrete examples for underlining this.

I. INTRODUCTION

The goal of graph node clustering, related to community

detection within social networks, is to create a partition of

the vertices, taking into account the topological structure of

the graph, such that the clusters are composed of vertices

strongly connected [2], [4], [5], [6]. With the appearance of

the social networks on the internet, the number of methods

for graph clustering has grown recently. Among the core

methods proposed in the literature, we can mention those that

optimize a quality function to evaluate the goodness of a given

partition, like the modularity, the ratio cut, the min-max cut

or the normalized cut [7], [8], [9], [10], [11], hierarchical

techniques like divisive algorithms based on the minimum cut

[3], spectral methods [12] or Markov Clustering algorithm and

its extensions [13].

These graph clustering techniques are very useful for de-

tecting strongly connected groups in a graph but many of

them mainly focus on the topological structure, ignoring the

properties of the vertices. Nowadays, various data sources

like social networks, patent documents or biological data can

be seen as graphs where vertices have attributes. With such

kind of data, it can be useful to take into account the vertex

properties in the clustering process for increasing the accuracy

of the partitions. Generally, this is not the case in graph

clustering where usually, only the relationships of the network

are used. On the other hand, there are also unsupervised

methods to group objects according to their textual or numeric

attributes, like hierarchical clustering or k-means [14], [15],

[16]. More precisely, unsupervised learning affects the objects,

represented by attributes, into clusters so that the objects in

the same cluster are more similar to each other than to those

in other clusters, according to an attribute-based similarity

measure.

A new challenge in graph clustering consists in combining

structure data corresponding to the network and attribute data

describing the vertices. Recently, several works have attempted

to tackle this problem of hybrid clustering. We detail the main

ones in the next section. However, the combination of several

data types rises the problem of the meaning of the clustering.

Indeed, the different comparison and distance functions may

not be compatible and, consequently, they may lead to contra-

dictory results. Moreover, these results are difficult to evaluate

since there is no real benchmark dataset, with structured data

and attributed data, suitable for attributed graph clustering

evaluation. For this reason, in this work, we have built a dataset

with ground truth in order to compare the community of each

vertex with its computed cluster. It is a scientist network,

mainly based on the publications and the participation in

scientific events. It includes textual data (publication titles,

abstracts, full text, etc.) and relationship data (co-authorship,

co-participation in a same event). For this reason, a clustering

method that takes into account several criteria is needed in

order to identify in the network, groups of people who are in

relation and who share a same research field. In order to detect

strongly connected clusters containing persons with similar

research interests, we propose different methods to partition

the graph using both the structural data and attribute data. Our

experiments show that, depending on the weight allowed to

each type of data (textual or structural) and the way to combine

them during the clustering, the results can be very different.

The rest of the article is organized as follows. The next section

is dedicated to recently introduced graph clustering techniques

that consider attributes and structural information. We define

formally the problem in section III while we propose several

approaches which consider simultaneously structure data and

attribute data in section IV. Our experimental study to evaluate

these approaches is detailed in section V and the results in the

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.215

1280

2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

978-0-7695-4799-2/12 $26.00 © 2012 IEEE

DOI 10.1109/ASONAM.2012.215

1248

Page 2: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

section VI. Finally, section VII concludes the article.

II. STATE OF THE ART

Among the clustering methods, one can distinguish on

the one hand the non supervised learning techniques, also

called vector-based clustering, which exploit the attributes

describing the objects, like hierarchical clustering or k-means

and on the other hand those which consider the relationships

between the different objects as it is usually the case in graph

clustering. Recently, methods which exploit both data types

were introduced in order to detect communities in social

networks where documents or features are associated to the

vertices. For instance, in a pre-processing step, Steinhaeuser

and Chawla compute a similarity metric between pairs of

vertices, based on the attributes, which is used as a weight

for the corresponding edge. Afterwards, any graph clustering

method can be applied on the valued graph [17]. This method

is similar to the first approach presented in the next section

but their metric requires to set a parameter, which is not the

case in our work. Zhou et al. exploit also the attribute data in

order to extend the original graph [18], [19]. They add attribute

vertices and edges which connect original vertices sharing

the same value. A K-Medoids clustering is then applied on

a random walk distance computed on the attribute augmented

graph. One limit of this method lies in the fact that it does not

suit to continuous attributes. The second method introduced

in the next section exploits the same idea. However, the graph

is extended in a simpler way, without restriction on the type

of attributes considered. In the hierarchical clustering of Li etal., clusters are built under attribute based constraints [20]. In

a first step, cores are detected using only the structure data,

and afterwards they are merged in function of their attribute

similarity. Other works combine the two types of data during

the clustering process [21], [22], [23]. Ester et al. treat the

question in terms of the ”Connected k-Center (CkC) problem

and they propose NetScan, an extended version of the k-means

algorithm with a constraint of internal connectedness [21],

[24]. With this condition, two entities in a cluster are connected

by an internal path. In NetScan, like in other partitioning

algorithms, the number of clusters must be known, but this

point has been relaxed in recent works [25]. Recently, other

approaches have also been introduced in order to detect dense

subgraphs which are also homogeneous for the attributes [26],

[27]. Dang et al. have extended the Newman’s modularity

by adding a term to measure the attribute-based similarity

between two nodes [23]. In this way, the two types of data

are considered simultaneously during the clustering process.

However, the clusters may contain unconnected nodes. In a

similar way, in the third method proposed in the next section,

the two types of data are also considered simultaneously

but they are merged into a global distance used during the

clustering with the guarantee that the vertices in a same cluster

are connected. In section IV, we present several approaches

which consider simultaneously structure data and attribute data

and which offer the advantage to be easy to carry out while

in the next section, the problem of attributed graph clustering

is defined more formally.

III. PROBLEM STATEMENT

We consider a graph G = (V,E) where V ={v1, . . . , vi, . . . , v|V |

}is the set of vertices and E ⊂ V ×V is

the set of unlabeled edges. Graph node clustering consists in

grouping the vertices into clusters taking into consideration the

edge structure in such a way that there should be many edges

within each cluster and relatively few between the clusters [5],

[28]. Even if the case of overlapping community detection in

which a vertex can be affected to several clusters has been

recently studied [29], in this article we consider the general

case where the clustering process consists in partitioning the

set V of vertices into r disjoint clusters P = {C1, . . . , Cr}such that:

•⋃

k∈{1,...,r} Ck = V• Ck ∩ Cl = ∅, ∀ 1 ≤ k < l ≤ r• Ck �= ∅, ∀k ∈ {1, . . . , r}Moreover, we suppose that each vertex vi ∈ V is as-

sociated to a document represented by a vector di =(wi1, . . . , wij , . . . wiT ) where wij is the weight of the term

tj in the document di. These documents can been seen as

vertex attributes and G defined as an attributed graph [18].In an attributed graph clustering problem, the structural links

and the attributes are both considered, in such a way that:

• firstly, there should be many edges within each cluster

and relatively few between the clusters;

• secondly, two vertices belonging to the same cluster are

more similar in terms of attributes, than two vertices

belonging to two different clusters.

Thus, the clusters should be well separated and, the ver-

tices belonging to the same cluster should be connected and

homogeneous on attribute data.

IV. ATTRIBUTED GRAPH CLUSTERING APPROACHES

We introduce different approaches to partition the graph

using both structural and attribute data. The methods differ

on the manner in which the relational and attribute data are

combined:

• Structure-based clustering: the vertice attributes are used

in order to value the edges of the graph, that can be then

processed by any weighted graph clustering algorithm (cf.

section IV-A);

• Attribute-based clustering: structural information is used,

together with vertex attribute similarity to obtain a dis-

tance matrix (between each pair of vertices), which

can then be processed by any unsupervised clustering

algorithm (cf. section IV-B);

• Hybrid clustering: attributes and structure are considered

separately in order to compute a distance on each type of

data. These distances are then combined into a global dis-

tance that can be exploited by any unsupervised clustering

algorithm or used to obtain a valued graph, which can be

processed by any weighted graph clustering algorithm (cf.

section IV-C);

12811249

Page 3: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

A. Structure-based clustering on attribute weighted graph

In this method, the attributes are used to obtain a weighted

graph. We define a textual attribute-based distance disT ,

for instance the euclidean distance or the cosine distance,

well suited for textual attributes. The value disT (di, dj) is

associated to each edge (vi, vj) of E. Then, a graph clustering

method which is able to handle weighted graphs is used to

partition the set of the vertices V , for example, hierarchical

algorithms (agglomerative or divisive) or algorithms which

optimize quality functions like the Kernighan-Lin algorithm

or algorithms based on the modularity [7].

B. Attribute-based clustering on structural distance

In this method, the structural information is used to define

a structure based distance disS(vi, vj) between each pair of

vertices (vi, vj). In practice, the length of the shortest path

between vi and vj can be used as disS(vi, vj), where the

shortest path between vi and vj is the path that has the

smallest number of edges. More sophisticated distances, like

the neighborhood Random Walk Distance [18] can also be

used. The attributes are also taken into account to associate a

value to each edge of E, as explained in the previous section.

In this case, the shortest path between vi and vj is the smallest

sum of the weights of the path edges. Then, any unsupervised

learning technique can be applied on the distance matrix.

C. Hybrid clustering

In this method, a global distance disTS(vi, vj) between

two vertices vi and vj is defined as a linear combination

of two distances corresponding respectively to each type of

information:

disTS(vi, vj) = αdisT (di, dj) + (1− α) disS(vi, vj) (1)

where disT (di, dj) is a distance defined on the attributes,

disS(vi, vj) is defined directly on the graph and α is a

parameter between 0 and 1.

As previously, the length of a shortest path between vi and

vj can be used for disS(vi, vj), and the euclidean distance or

the cosine distance for disT (di, dj). Then, the partition can be

built either with a graph clustering algorithm applied on the

graph valued with the global distance or with a non supervised

learning technique using the global distance.

V. EXPERIMENTAL STUDY

We performed experiments to evaluate the different meth-

ods presented previously. While there are some benchmark

datasets suitable for community detection evaluation, based on

networks with ground truth, as far as we know, it does not exist

such a benchmark dataset, with structured data and attributed

data, suitable for attributed graph clustering evaluation. In our

context, where the community of each actor is unknown, one

can use the measures used either for community evaluation

like the modularity or for cluster evaluation like the sum of the

square error. However, it is clear that the evaluation measures

are linked to the corresponding clustering strategy. To avoid

this bias, we have built a dataset with a ground truth in order

TABLE INUMBER OF AUTHORS PER SESSION AND CONFERENCE

Session Conference Size (authors)A Bioinformatics SAC 24B Robotics SAC 16C Robotics IJCAI 38D Constraint IJCAI 21Total 99

to compare the community of each vertex with its cluster.

We used the accuracy as evaluation measure. This dataset is

presented in the following paragraph.

A. Network data

In order to build an attributed graph with a ground truth,

we concentrated on two conferences: SAC 2009 and IJCAI

2009. A co-participation network was generated from the well-

known DBLP1 dataset and the abstracts, titles and research

areas were extracted from the websites of the selected confer-

ences2.

1) Authors and research areas: Three research areas, cor-

responding to conference sessions, were selected: Robotics,

Bioinformatics and Constraint Programming. In both confer-

ences there is a Robotics session, while only SAC 2009 has

a session on Bioinformatics and IJCAI 2009 on Constraint

Programming. As shown in Table I, there are 24 authors in

the first research area (Bioinformatics), 16 + 38 = 54 in

the second one (Robotics) and 21 in the last one (Constraint

Programming). Each of these authors corresponds to one

vertex of V and its research area membership is used during

the evaluation step.

The abstracts and the titles of the articles published by

the authors at IJCAI 2009 and SAC 2009 are represented

in the vector space model introduced by Salton et al. [30].

After a preprocessing of the text with stemming and stopword

removal, an attribute vector di, in which the components are

computed with the tf-idf formula, is attached to each author

of V .

2) Social Network: We consider an event e as a journal or

a conference referenced in DBLP between 2007 and 2009. A

co-participation network is built on the set V , using the DBLP

database, as follows.

Let vi and vj be two authors belonging to V , if there exists

at least one event e such that vi and vj are authors for articles

published in e (even if they are not co-authors), then (vi,vj)

∈ E.

3) Graph: We obtain the attributed graph G = (V,E)having the vertices created with the authors and the edges

given by the co-participation relations. Moreover, each vertex

(i.e. author), is described by textual attributes corresponding

to the tf-idf vector associated to his articles and, its true class

is the research area (i.e. the session A, B, C or D in SAC 2009

or IJCAI 2009) of this author.

1http://www.informatik.uni-trier.de/∼ley/db/2The dataset is available at http://labh-curien.univ-st-etienne.fr/∼combe/

datasets/asonam/datasetSessionsRecognition.zip

12821250

Page 4: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

TABLE IITEXT-BASED CLUSTERING USING AVERAGE LINKAGE (MODEL T )

(a) Model T evaluated against PT (3 research areas)

1 2 3 TotalA SAC 2009 - Bioinformatics 11 13 24B SAC 2009 - Intelligent robotic syst. 16 16C IJCAI 2009 - Robotics and Vision 38 38D IJCAI 2009 - Constraints 21 21Total 11 21 67 99

(b) Model T evaluated against PTS (4 sessions)

1 2 3 4 TotalA SAC 2009 - Bioinformatics 11 13 24B SAC 2009 - Intelligent robotic syst. 2 14 16C IJCAI 2009 - Robotics and Vision 4 34 38D IJCAI 2009 - Constraints 21 21Total 11 21 6 61 99

B. Hypotheses

We enumerate here our clustering scenarios and hypothesis

and present the foreseen results. We consider 4 vertex subsets,

given by the authors publishing in the 4 extracted sessions:

• A: Bioinformatics (SAC),

• B: Robotics (SAC),

• C: Robotics (IJCAI),

• D: Constraint Programming (IJCAI).

1) Text: 3 research areas / 3 clusters (PT ): Consider-

ing only textual vertex attributes, the hypothesis underlying

our experiments is that this information should permit to

retrieve the three research areas: Robotics, Bioinformatics

and Constraint Programming, giving the partition into three

clusters containing the authors of the three research areas:

PT = {A,B ∪ C,D}.2) Structure: 2 conferences / 2 clusters (PS): On the other

hand, we suppose that taking into account only structural data

should allow to identify two groups corresponding to authors

participating to each conference: SAC2009 and IJCAI2009,

which define the partition into two clusters PS = {A∪B,C∪D}.

3) Text and structure: 4 sessions / 4 clusters (PTS):However, if we want to discover each session separately, both

textual and structural information have to be used. In this case

the partition will be into four clusters PTS = {A,B,C,D}.

C. Evaluation

The different strategies were evaluated using the accuracy of

the obtained clusters, compared to the ground truth considered:

research areas (PT ), conferences (PS) or sessions (PTS).

VI. EVALUATED STRATEGIES AND RESULTS

In order to check these hypotheses, we evaluate several

models combining text and structure (models TS1, TS2,

TS3), corresponding to the different approaches detailed in

Section IV. We compare our models against two baselines:

clustering based on text only (model T ) and clustering based

on structure only (model S).

TABLE IIISTRUCTURE-BASED CLUSTERING (MODEL S)

(a) Model S evaluated against PS (2 conferences)

1 2 TotalA SAC 2009 - Bioinformatics 24 24B SAC 2009 - Intelligent robotic syst. 16 16C IJCAI 2009 - Robotics and Vision 38 38D IJCAI 2009 - Constraints 21 21Total 40 59 99

(b) Model S evaluated against PTS (4 sessions)

1 2 3 4 TotalA SAC 2009 - Bioinformatics 24 24B SAC 2009 - Intelligent robotic syst. 16 16C IJCAI 2009 - Robotics and Vision 38 38D IJCAI 2009 - Constraints 11 21Total 40 59 0 0 99

A. Text-only based clustering: model T

Textual clustering considers only the attribute data i.e. the

documents {di, ∀vi ∈ V }. This text-based categorization

(model T ) was firstly performed with the euclidean distance

as well as with the cosine distance computed on the tf-idf

description, and with the bisecting K-means algorithm [31].

Then, the model T was performed with the cosine distance,

still computed on the tf-idf description, and with the average

linkage algorithm. As the latter strategy gives better results,

it is the only one presented here, as a baseline for our

experiments. Consequently, we have also used the average

linkage algorithm in all the attribute-based models.

Results obtained using only textual information are pre-

sented in Table IIa for the partition in three clusters which

should be compared to PT . This is an accuracy matrix, where

the columns of this table contain the number of authors clas-

sified in each cluster. Here the method clustered 11 authors in

the first cluster, 21 in the second and 67 in the third cluster. The

rows contain the ground truth TT = {A,B ∪ C,D} obtained

by merging the second line and the third line. Looking at

the results we can remark, that 13 authors were clustered

in the third class, mainly containing people publishing in

Robotics, while according to our ground truth, they belong

to the bioinformatics community. The Table IIb contains the

results for four clusters compared to PTS .

As expected, the accuracy is higher for the partition in

three clusters PT ((11+16+38+21)

99 × 100 = 87%) than for

the partition in four clusters PTS (69%). This result confirms

our hypothesis according to which the textual data allows to

identify the different research areas but fails to detect correctly

the four sessions.

B. Structure-only based clustering: model S

The algorithm by Blondel et al. [32] only exploits struc-

tural data (i.e. the graph G = (V,E)). This extension of

the Newman and Girvan’s algorithm [33], well known for

its capacity to handle large graphs, is a greedy method

which optimizes the ”modularity” of the partitions built on

the network. This algorithm, applied directly on the graph

12831251

Page 5: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

TABLE IVSTRUCTURE-BASED CLUSTERING ON ATTRIBUTE WEIGHTED GRAPH:

MODEL TS1

1 2 3 4 5 Total

A SAC - Bioinformatics 13 11 24

B SAC - Intelligent robotic syst. 11 5 16

C IJCAI - Robotics and Vision 38 38

D IJCAI - Constraints 15 6 21

Total 15 24 44 11 5 99

TABLE VATTRIBUTES-BASED CLUSTERING ON STRUCTURAL DISTANCE: MODEL

TS2 (AVERAGE LINK)

1 2 3 4 Total

A SAC 2009 - Bioinformatics 24 24

B SAC 2009 - Intelligent robotic syst. 4 11 1 16

C IJCAI 2009 - Robotics and Vision 1 37 38

D IJCAI 2009 - Constraints I 7 14 21

Total 4 35 9 51 99

G = (V,E), provides a bipartition which is exactly the ground

truth PS = {A ∪B,C ∪D} as shown in Table IIIa. Thus, the

identification of the two conferences using structural data is

perfectly achieved. However, the accuracy is only equal to 63%if we consider the four sessions as the ground truth (PTS), see

Table IIIb.

C. Structure-based clustering on attribute weighted graph:model TS1

In this strategy, corresponding to the approach presented

in Section IV-A, the cosine distance computed on the tf-idf

vectors is associated to each edge in order to obtain a weighted

graph. Then, this graph is partitioned by the method of Blondel

et al..As we can note on Table IV, taking into account structural

and attribute data improves the accuracy which reaches 76%for the partition in four clusters (PTS), when it is only equal to

69% without attribute data. This result confirms our hypothesis

according which the two types of information are useful to

identify the four sessions (PTS).

D. Attribute-based clustering on structural distance: modelTS2

Like previously, the cosine distance computed on the TF-

IDF vectors is associated to each edge in order to obtain

a weighted graph. Then, the geodesic distance between two

vertices is defined as the smallest sum of the weights of

the path edges between these vertices. Finally, a hierarchical

agglomerative clustering is applied on the geodesic distance

matrix, using usual distance between clusters: single link,

complete link, average link and center of gravity.

Table V presents the results obtained. With a classification

accuracy of 73% for the partition in four clusters (PTS), the

results are similar to those obtained with the modularity based

algorithm and higher than those obtained using only one type

of information (textual or structural).

TABLE VIHYBRID CLUSTERING: MODEL TS3

1 2 3 4 Total

A SAC 2009 - Bioinformatics 11 13 24

B SAC 2009 - Intelligent robotic syst. 2 14 16

C IJCAI 2009 - Robotics and Vision 4 34 38

D IJCAI 2009 - Constraints 21 21

Total 11 21 6 61 99

TABLE VIIRESULTS SYNTHESIS: MODELS T , S , TS1 , TS2 AND TS3

Accuracy considering:

Model PT PS PTS

T 87% - 69%

S - 100% 63%

TS1 - - 76%

TS2 - - 73%

TS3 - - 47-69%

Except the average link, we have also experimented the

single link, the complete link and the center of gravity. The

results (not presented here) are the same for each method.

E. Hybrid clustering: model TS3

In this approach, a global distance is defined as a linear

combination of two distances, each corresponding to a type

of data: cosine distance on textual information and geodesic

distance on the graph G. Then a hierarchical agglomerative

clustering is applied with the global distance matrix.

This strategy corresponds to the hybrid clustering presented

in Section IV-C.

Even if this method appears as a simple solution for

exploiting simultaneously the two types of data, it is not so

easy to use since it requires to set the parameter α in the linear

function. Moreover, in our experiments, the accuracy for the

partition in four clusters (PTS) varies in function of α between

47% (α set to 0.85, 0.96) and 69% (α set to 1) as shown

in Table VI. Thus, the best accuracy corresponds to those

obtained with a text-based clustering and it is not so good than

those obtained with the previous methods combining structural

data and attribute data.

F. Results synthesis

The results obtained by the models T , S, TS1, TS2 and

TS3 are synthesized in Table VII.

VII. CONCLUSION AND FUTURE WORK

As it has been presented in the previous sections, we

obtain very different results according to the clustering method

combination and the data taken into account when partitioning

an attributed graph.

In this study we have searched to point the difficulties

of choosing the right clustering methods. We have built a

dataset from real world data containing enough nodes so

that clustering algorithms can give fine results, yet having

12841252

Page 6: [IEEE 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012) - Istanbul (2012.08.26-2012.08.29)] 2012 IEEE/ACM International Conference on Advances

precise measurable clusters according to several different ways

to create partitions. Having this ground truth, we have been

able to evaluate a series of clustering methods and compare

their results. In our experiments, textual attribute based clus-

tering enables quite well to retrieve the research interests,

structure based clustering taking into account co-participation

information gets perfectly the conferences but the structural

information and attribute information are useful to retrieve the

four sessions corresponding to participants in one conference

who share a common interest.

We have carried out three different scenarios to combine

these information in a common clustering. In our case, the

linear combination, corresponding to the hybrid clustering

deteriorates the results. In addition, the linear combination

method is difficult to apply. It needs a weight parameter to

precise the relative importance of each type of information.

The other two scenarios starting with the structural and the

textual data give better results than the linear combination.

We have also showed that good clustering results can

be obtained using simple methods, when having a scenario

adapted to the data and having precise criteria characterizing

a good cluster.

We intend to study more deeply the usage interests and

characteristics of high level multi criteria clustering methods

in order to provide precise clustering scenario choice criteria.

We are also working on more real world examples and datasets

that help choosing quickly the best clustering scenario for a

given dataset.

ACKNOWLEDGMENT

This work was partially supported by St-Etienne Metropole

(http://www.agglo-st-etienne.fr/) and the Region Rhone-Alpes.

REFERENCES

[1] P.-O. Fjallstrom, “Algorithms for graph partitioning: A survey,” Science,vol. 3, no. 10, 1998.

[2] U. Brandes, M. Gaertler, and D. Wagner, “Experiments on GraphClustering Algorithms,” in In 11th Europ. Symp. Algorithms. Springer-Verlag, 2003, pp. 568–579.

[3] G. Flake, R. Tarjan, and K. Tsioutsiouliklis, “Graph clustering andminimum cut trees,” Internet Mathematics, vol. 1, no. 4, pp. 385–408,2003.

[4] M. Newman, “Detecting community structure in networks,” The Eu-ropean Physical Journal B-Condensed Matter and Complex Systems,vol. 38, no. 2, pp. 321–330, 2004.

[5] S. Schaeffer, “Graph clustering,” Computer Science Review, vol. 1, no. 1,pp. 27–64, 2007.

[6] A. Lancichinetti and S. Fortunato, “Community detection algorithms:a comparative analysis,” Physical review E, vol. 80, no. 5, p. 056117,2009.

[7] B. W. Kernighan and S. Lin, “An Efficient Heuristic Procedure forPartitioning Graphs,” Bell System Technical Journal, vol. 49, no. 2, pp.291–307, 1970.

[8] P. K. Chan, M. D. F. Schlag, and J. Y. Zien, “Spectral K-way ratio-cut partitioning and clustering,” IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, vol. 13, no. 9, pp. 1088–1096,1994.

[9] J. Shi and J. Malik, “Normalized cuts and image segmentation,” PatternAnalysis and Machine Intelligence, IEEE Transactions on, vol. 22, no. 8,pp. 888–905, 2000.

[10] C. Ding, X. He, H. Zha, and M. Gu, “A min-max cut algorithm for graphpartitioning and data clustering,” in Proceedings IEEE InternationalConference on Data Mining, 2001, pp. 107 – 114.

[11] M. Newman and M. Girvan, “Finding and evaluating community struc-ture in networks,” Physical review E, vol. 69, no. 2, pp. 1–16, 2004.

[12] U. Von Luxburg, “A tutorial on spectral clustering,” Statistics andComputing, vol. 17, no. 4, pp. 395–416, 2007.

[13] V. Satuluri and S. Parthasarathy, “Scalable graph clustering usingstochastic flows: applications to community discovery,” in Proceedingsof the 15th ACM SIGKDD international conference on Knowledgediscovery and data mining, 2009, pp. 737–746.

[14] J. Ward, “Hierarchical grouping to optimize an objective function,”Journal of the American statistical association, vol. 58, no. 301, pp.236–244, 1963.

[15] A. K. Jain and R. C. Dubes, “Algorithms for Clustering Data,” inPrentice Hall Advanced Reference Series, A. K. Jain and R. C. Dubes,Eds. Prentice Hall, Inc., 1988.

[16] A. Gordon, Classification, 2nd Edition. Chapman & Hall, 2000.[17] K. Steinhaeuser and N. Chawla, “Community detection in a large real-

world social network,” Social Computing, Behavioral Modeling, andPrediction, pp. 168—-175, 2008.

[18] Y. Zhou, H. Cheng, and J. Yu, “Graph clustering based on struc-tural/attribute similarities,” Proceedings of the VLDB Endowment, vol. 2,no. 1, pp. 718–729, 2009.

[19] Y. Zhou, H. Cheng, and J. X. Yu, “Clustering Large Attributed Graphs:An Efficient Incremental Approach,” 2010 IEEE International Confer-ence on Data Mining, pp. 689–698, Dec. 2010.

[20] H. Li, Z. Nie, W.-C. W. Lee, C. L. Giles, and J.-R. Wen, “ScalableCommunity Discovery on Textual Data with Relations,” Proceedings ofthe 17th ACM conference on Information and knowledge management,pp. 1203–1212, 2008.

[21] M. Ester, R. Ge, B. Gao, Z. Hu, B. Ben-Moshe, and B. B.-M. M.E. R. G. B. J. G. Zengjian Hu, “Joint Cluster Analysis of AttributeData and Relationship Data: the Connected k-Center Problem,” in SIAMInternational Conference on Data Mining. ACM Press, 2006, pp. 25–46.

[22] F. Moser, “Data mining for feature vector networks,” Ph.D. dissertation,Simon Fraser University, 2009.

[23] T. A. Dang and E. Viennet, “Community Detection based on Structuraland Attribute Similarities,” in International Conference on DigitalSociety (ICDS), 2012, pp. 7–12.

[24] R. Ge, M. Ester, B. J. Gao, Z. Hu, B. Bhattacharya, and B. Ben-Moshe,“Joint cluster analysis of attribute data and relationship data,” ACMTransactions on Knowledge Discovery from Data, vol. 2, no. 2, pp.1–35, 2008.

[25] F. Moser, R. Ge, and M. Ester, “Joint Cluster Analysis of Attributeand Relationship Data Without A-Priori Specification of the Numberof Clusters,” in Proceedings of the 13th ACM SIGKDD internationalconference on Knowledge discovery and data mining, Aug. 2007, p.510.

[26] S. Gunnemann, I. Farber, B. Boden, and T. Seidl, “Subspace ClusteringMeets Dense Subgraph Mining: A Synthesis of Two Paradigms,” inProceedings of the IEEE International Conference on Data Mining,2010, pp. 845–850.

[27] S. Gunnemann, B. Boden, and T. Seidl, “DB-CSC: a density-based ap-proach for subspace clustering in graphs with feature vectors,” MachineLearning and Knowledge Discovery in Databases, pp. 565–580, 2011.

[28] M. Girvan and M. Newman, “Community structure in social andbiological networks,” Proceedings of the National Academy of Sciences,vol. 99, no. 12, p. 7821, 2002.

[29] W. Qinna and E. Fleury, “Detecting overlapping communities in graphs,”in European Conference on Complex Systems, 2009.

[30] G. Salton and M. J. McGill, Introduction to modern InformationRetrieval. McGraw-Hill, 1983.

[31] M. Steinbach, G. Karypis, and V. Kumar, “A comparison of documentclustering techniques,” KDD workshop on text mining, vol. 400, no. X,pp. 525–526, 2000.

[32] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, “Fastunfolding of communities in large networks,” Journal of StatisticalMechanics: Theory and Experiment, 2008.

[33] M. E. J. Newman, “Fast algorithm for detecting community structure innetworks,” Physics, vol. 69, no. 2, pp. 1–5, 2004.

12851253