detection of overlapping communities in social …...first and foremost, i would like to thank my...

Detection of Overlapping Communitiesin Social Tagging Systems

A thesis submitted for partial fulfilmentof the requirements for the degree of

Master of Technology

by

Abhijnan Chakraborty

10CS60R03

Under the Guidance of

Prof. Niloy Ganguly

Department of Computer Science and EngineeringIndian Institute of Technology, Kharagpur

IndiaApril, 2012

“Man is by nature a social animal; an individual who is unsocial naturally and notaccidentally is either beneath our notice or more than human. Society is somethingthat precedes the individual. Anyone who either cannot lead the common life or isso self-sufficient as not to need to, and therefore does not partake of society, is

either a beast or a god.”

– Aristotle

Dedicated to My Parents

Certificate

This is to certify that the thesis entitled ‘Detection of Overlapping Communities in Social Tag-ging Systems’ submitted by Abhijnan Chakraborty, Roll – 10CS60R03, Department of ComputerScience and Engineering, Indian Institute of Technology, Kharagpur; for partial fulfilment of therequirements for the degree of Master of Technology in Computer Science and Engineering; isa bonafide record of the work and investigations carried out by him under my supervision andguidance.

Prof Niloy GangulyDept. of Computer Science & Engg.

Indian Institute of TechnologyKharagpur – 721302, India

1

Acknowledgements

While the rest of the thesis is meant to convey the technical work done, this is the only place totake the liberty to express personal gratitudes. Specially after working on online ‘social’ systems,I do not want to undermine the very basics of such studies – “Man is a social animal”. No onecan even survive, let alone building a thesis, without countless direct and indirect helps fromothers.

First and foremost, I would like to thank my research advisor, Prof. Niloy Ganguly, for his adviceand support during the work. He gave me the freedom to pursue my ideas and work at my ownpace, and was always available to discuss various problems on the way. I enjoyed spending lastone year with him both at work and otherwise. His attitude towards students and the countlesshours of discussions on different issues have changed me in many ways.

A special thanks to Saptarshi Ghosh, a research scholar in the department, who closely followedthis work. I am highly indebted to him for clarifying my doubts and for providing suggestionsand criticisms on my work. All the members of CNeRG (Complex Networks Research Group)have extended personal and professional helps in the time of need.

I am lucky enough to have some outstanding teachers in my school days, specially Mr. MukundaLal Pal and Mr. Dipankar Sen, who were always behind me in every tough situations and neverlost their belief and confidence on me. Words are not enough to express my gratitude to them.I also thank all my teachers at Jadavpur University, who introduced me to the exciting field ofcomputer science.

I want to take this opportunity to thank all of my friends for reminding me that there are manyother important things in life than studying. My university friends Sandip, Abhirup, Sourav,Utsab and folks from ‘Amar Bangla Mess’ – Sudipta, Ambarish, Apurba, Arijit, Debabrata, Sou-vik, Kaushik, Dhruba have already become a part of my life. My childhood buddies – Anwesha,Pinaki, Prithwiraj, Jyotirban, Soumya are kind enough not to expect explicit acknowledgementsfrom me. My life wouldn’t have been complete without them.

Last but not the least, I would like to thank Maa, Baba, Dida, Didivai, Masimoni, Mesomoni,Valomasi, Valomeso, Papluda, Pappanda, Ashokmama, Benumama for their constant support,love and encouragement. Their selfless guidance has helped me to find my path in this beautifuljourney called ‘life’.

Abhijnan Chakraborty

2

Abstract

Some of the most popular sites in the Web today are social tagging systems or folksonomies(e.g. Delicious, Flickr, LastFm etc.) where users share resources and collaboratively annotateresources with tags which help in the search, personalized recommendation and organization ofthe resources.

Folksonomies are modelled as tripartite (user-resource-tag) hypergraphs in order to study theirnetwork properties, and detecting communities of similar nodes from such networks is a chal-lenging and well-studied problem. However, most existing algorithm for community detection infolksonomies assign unique communities to nodes, whereas in reality, nodes in folksonomies areassociated with multiple overlapping communities – users have multiple topical interests, andthe same resource is often tagged with semantically different tags. The few attempts to detectoverlapping communities work on projections of the hypergraph, which results in significant lossof the information contained in the original tripartite structure.

In this work, we propose the first algorithm to detect overlapping communities in folksonomiesusing the complete hypergraph structure. Our algorithm converts a hypergraph into its corre-sponding weighted line-graph, using measures of hyperedge similarity, whereby any communitydetection algorithm on unipartite graphs can be used to produce overlapping communities in thefolksonomy. Through extensive experiments on synthetic as well as real folksonomy data, wedemonstrate that the proposed algorithm can detect better community structures as comparedto existing state-of-the-art algorithms for folksonomies.

3

Contents

Abstract 3

1 Introduction 81.1 Folksonomy as Hypergraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2 Existence of Overlapping Communities . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Identifying Overlapping Communities . . . . . . . . . . . . . . . . . . . . . . . . . 101.4 Link Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.5 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2 Related Work 132.1 Community Detection in Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Detecting Overlapping Communities in Graphs . . . . . . . . . . . . . . . . . . . 142.3 Community Detection in Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . 152.4 Overlapping Community Detection in Folksonomies . . . . . . . . . . . . . . . . . 15

3 Our Proposed Algorithm 173.1 Basic Idea of the Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Calculating Similarity Between Hyperedges . . . . . . . . . . . . . . . . . . . . . 18

3.2.1 Expressing Hyperedges as Vectors . . . . . . . . . . . . . . . . . . . . . . 183.2.2 Considering Vertex Neighbourhoods . . . . . . . . . . . . . . . . . . . . . 193.2.3 Choosing the Best Similarity Metric . . . . . . . . . . . . . . . . . . . . . 20

3.3 Detecting Communities in Line Graph . . . . . . . . . . . . . . . . . . . . . . . . 213.3.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.3.2 Fast Modularity Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.3 Louvain Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.4 Infomap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.3.5 Choosing the Best Community Detection Method . . . . . . . . . . . . . . 23

3.4 Time Complexity of Our Proposed Algorithm . . . . . . . . . . . . . . . . . . . . 23

4 Experiments and Evaluation 244.1 Generation of Synthetic Hypergraphs . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Normalized Mutual Information (NMI) . . . . . . . . . . . . . . . . . . . . . . . . 254.3 Comparison between Different Choices of OHC . . . . . . . . . . . . . . . . . . . 264.4 Comparing OHC with Other Algorithms . . . . . . . . . . . . . . . . . . . . . . . 28

4.4.1 Performance w.r.t. Number of Hyperedges . . . . . . . . . . . . . . . . . . 284.4.2 Performance in Presence of Scattered Hyperedges . . . . . . . . . . . . . . 294.4.3 Performance w.r.t. Fraction of Nodes in Multiple Communities . . . . . . 294.4.4 Performance w.r.t. Size of Real Community . . . . . . . . . . . . . . . . . 30

4

5 Experiments on Real World Folksonomies 325.1 Overlapping Communities in Folksonomies . . . . . . . . . . . . . . . . . . . . . . 325.2 Evaluation of Communities Detected . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2.1 Comparison of Conductance Value . . . . . . . . . . . . . . . . . . . . . . 355.2.2 Comparing Detected User Communities with Social Network . . . . . . . 36

6 Conclusion 38

Bibliography 41

A Publications from the Thesis 42

5

List of Figures

1.1 A toy example of Tripartite Hypergraph. Three types of nodes are graphicallyrepresented as Blue Circles, Orange Rectangles and Black Diamonds respectively.Each triangle created by connecting these three type of nodes is a hyperedge. . . 9

1.2 Example of Overlapping Community Structure . . . . . . . . . . . . . . . . . . . 101.3 Necessity of considering both resources as well as tags to identify users having

similar interests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Neighbourhood of two adjacent hyperedges . . . . . . . . . . . . . . . . . . . . . 19

4.1 An example synthetic hypergraph. There are two communities – blue and green.Violet nodes belong to both the communities. . . . . . . . . . . . . . . . . . . . . 25

4.2 Comparison of NMI values for different similarity metrics with varying hyperedgedensity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Comparison of NMI values for different community detection algorithms with vary-ing hyperedge density . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Variation of NMI values with varying hyperedge density when 10% nodes belongto multiple communities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Variation of NMI values with varying hyperedge density in presence of scatteredhyperedges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.6 Variation of NMI values with varying fraction of nodes in multiple communitieskeeping hyperedge density constant at 0.2 . . . . . . . . . . . . . . . . . . . . . . 30

4.7 Comparison of NMI values with varying number of real communities . . . . . . . 31

5.1 Cumulative distribution of the fraction of communities which overlap with a givennumber (x) of other communities; main figure – LastFm, inset – MovieLens . . . 34

5.2 Cumulative distribution of conductance values of tag communities obtained fromthe real-world folksonomies: LastFm (main plot), Delicious and MovieLens (bothinset) for OHC and HGC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.3 Community structure detected by OHC and CL algorithm with the social networkin LastFm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.4 Community structure detected by OHC and CL algorithm with the social networkin Delicious . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6

List of Tables

5.1 Statistics of Real Folksonomy Datasets . . . . . . . . . . . . . . . . . . . . . . . . 325.2 Examples of communities detected by proposed OHC algorithm. The algorithm

successfully clusters nodes which are related to a common semantic theme (seeColumn 2). Nodes related to multiple themes (boldfaced and italicized) are placedin overlapping communities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7

Chapter 1

Introduction

A number of the most popular sites in the Web today are online social systems where users formsocial relationships with one another and generate and share various forms of contents. Amongthese social systems, some are specifically designed for content sharing. This type of websitesare known as Social Tagging Systems. Here, users share contents or resources in these sites,and collaboratively annotate resources with descriptive keywords (known as ‘tags’) in order tofacilitate search and retrieval of interesting resources. Examples of such websites include Delicious(http://www.delicious.com), Flickr (http://www.flickr.com), LastFm (http://www.last.fm), MovieLens (http://www.movielens.org), Bibsonomy (http://www.bibsonomy.org) etc.

Thomas Vander Wal coined a term Folksonomy1 to describe social tagging systems. The word‘Folksonomy’ is a combination of two words – ‘folk’ and ‘taxonomy’. In such systems, ways forclassification and categorization evolve through the practice of collaboratively creating and man-aging tags. For this reason, folksonomies are also known as Collaborative Tagging Systems.In this work, we use the terms ‘Social Tagging System’ and ‘Folksonomy’ interchangeably.

With the growing popularity of social media sites in today’s Web, a tremendous amount of re-sources are being uploaded to the popular folksonomies; consequently it has become practicallyimpossible for users to discover on their own interesting resources and people having commoninterests. Hence it is important to develop algorithms for personalized search [1] and recommen-dation of resources [2] and potential friends to the users. One approach to these tasks is to groupthe entities (resources, tags, users) into communities or clusters which are typically thought of asgroups of entities having more/better interactions among themselves than with entities outsidethe group.

For detecting communities as well as studying other network properties, Folksonomies are mod-elled in literature [3–5] as tripartite hypergraphs.

1.1 Folksonomy as Hypergraph

Hypergraph model of folksonomies includes user, resource and tag nodes, where an hyperedge(u, t, r) indicates that user u has assigned tag t to resource r. Figure 1.1 shows a toy example oftripartite hypergraph.

1http://vanderwal.net/folksonomy.html

8

http://www.delicious.com

http://www.flickr.com

http://www.last.fm

http://www.last.fm

http://www.movielens.org

http://www.bibsonomy.org

http://vanderwal.net/folksonomy.html

Figure 1.1: A toy example of Tripartite Hypergraph. Three types of nodes are graphicallyrepresented as Blue Circles, Orange Rectangles and Black Diamonds respectively. Each trianglecreated by connecting these three type of nodes is a hyperedge.

Detecting communities from such hypergraphs is a challenging problem – this not only helps inefficient search and recommendation of resources or friends to users, but also in the organizationof the vast amount of resources present in folksonomies into semantic categories.

1.2 Existence of Overlapping Communities

Several algorithms have been proposed for detecting communities in hypergraphs [4,6–9] (detailsin Chapter 2 at Page 13). But, almost all of the prior approaches do not consider an importantaspect of the problem – they assign a single community to each node, whereas in reality, nodesin folksonomies frequently belong to multiple overlapping communities. For instance, users havemultiple topics of interest, and thus link to resources and tags of many different semantic cate-gories. Similarly, the same resource is frequently associated with semantically different tags byusers who appreciate different aspects of the resource.

As a motivating example, consider a popular photo of a daffodil in Flickr (Figure 1.2). Sincemany users are likely to tag the photo with ‘flower’ (or ‘daffodil’), as compared to few usersusing the tag ‘yellow’, algorithms assigning single communities to nodes would place this photoin the community related to flowers (or daffodils). Community-based recommendation schemes,which recommend resources to users based on common memberships in communities, would thusoverlook the fact that this photo is an excellent candidate for recommendation to a user whofavours tagging objects that are yellow-coloured (e.g. photos of yellow cars, sunset etc.). On theother hand, an algorithm detecting multiple overlapping communities would place the photo inboth communities related to flowers and the colour ‘yellow’, and thus raise the chances that thispopular photo is recommended to the above mentioned user.

9

Figure 1.2: Example of Overlapping Community Structure

1.3 Identifying Overlapping Communities

To the best of our knowledge, only two studies have addressed the problem of identifying over-lapping communities in folksonomies.

1. Wang et al. [10] proposed an algorithm to detect overlapping communities of users in folk-sonomies considering only the user-tag relationships (i.e. the user-tag bipartite projectionof the hypergraph), and

2. Papadopoulos et al. [5] detected overlapping tag communities by taking a projection of thehypergraph onto the set of tags.

Taking projections (as used by both these approaches) results in loss of some of the informationcontained in the original tripartite network and it is known that qualities of the communities ob-tained from projected networks are not as good as those obtained from the original network [11].

Further, none of these algorithms consider the resource nodes in the hypergraph. However, itis necessary to detect overlapping communities of users, resources and tags simultaneously forpersonalized recommendation of resources to users. Additionally, it is better to consider commonresources as well as common tags in order to identify users having similar interests (i.e. potentialfriends).

10

2 4 6 8 10 120

0.02

0.04

0.06

0.08

0.1

0.12

0.14

Number of Shared Items

Fra

ctio

n of

Frie

ndsh

ip L

inks

Only Resources

Only Tags

Resources or Tags

Figure 1.3: Necessity of considering both resources as well as tags to identify users having similarinterests

To demonstrate this, we give here a motivating statistics from the real data of the LastFmfolksonomy2 which also allows users to create a social network among themselves.Users who are linked in the social network (i.e. friends) usually have common tastes (a propertyknown as homophily [12]), and hence can be expected to have similar tagging behaviour in thefolksonomy as well.

Figure 1.3 plots the fraction of friends (i.e. user-pairs who are linked in the social network) whoshare k items in the folksonomy for different values of k, where the shared items are

1. only resources

2. only tags

3. resources or tags.

It is seen that the curve for 3 consistently has higher values as compared to the curves for 1 and 2,which shows the necessity of considering both resources and tags while identifying communitiesin folksonomies, without which some of the potential friendship relations cannot be identified.

The goal of this work is to propose such an algorithm that utilizes the complete tripartite struc-ture to detect overlapping communities, using the concept of link clustering which is explainednext.

1.4 Link Clustering

Though a node in a network can be associated to multiple semantic topics, a link (or edge,the terms are used interchangeably) is usually associated with only one semantics [13] – for

2The real folksonomy datasets are detailed in Chapter 5 at Page 32.

11

instance, a user can have multiple topical interests, but each link created by the user is likely tobe associated with exactly one of his interests.

Link clustering algorithms utilize this notion to detect overlapping communities, by clusteringlinks instead of the more conventional approach of clustering nodes – though each link is placedin exactly one link cluster, this automatically associates multiple overlapping communities withthe nodes since a node inherits membership of all the communities into which its links are placed.

Link clustering algorithms have recently been proposed for unipartite networks [13,14] and bipar-tite networks [10]. However, to our knowledge, this is the first attempt to propose a link-clusteringalgorithm for tripartite hypergraphs. Thus, the present work takes the first important step to-wards detecting overlapping communities in folksonomies considering the complete hypergraphstructure.

1.5 Organization of the Thesis

Chapter 2 gives a summary about prior works in community detection in graphs as well as inhypergraphs. Our link-clustering based algorithm is detailed in Chapter 3. We compare theperformance of the proposed algorithm with the existing algorithms by Papadopoulos et al. [5]and Wang et al. [10]. Extensive experiments on synthetically generated hypergraphs show thatour proposed algorithm out-performs both these algorithms (Chapter 4). Further, using datafrom three popular real folksonomies – Delicious, MovieLens and LastFm – we also show that theproposed algorithm can identify better overlapping community structures in real folksonomies(Chapter 5). Chapter 6 concludes the thesis.

12

Chapter 2

Related Work

Large networks or graphs are increasingly being used to model various types of complex systemsin the real world. These real world networks are not random graphs, as they display big inho-mogeneities, revealing a high level of order and organization. The degree distribution is broad,with a tail that often follows a power law. Therefore, many vertices with low degree coexist withsome vertices with large degree.

Furthermore, the distribution of edges is not only globally, but also locally inhomogeneous, withhigh concentrations of edges within special groups of vertices, and low concentrations betweenthese groups. This feature of real networks is called community structure or clustering. Commu-nities are groups of vertices which probably share common properties and/or play similar roleswithin the graph. Several algorithms have been proposed for finding communities or groups of‘similar’ nodes in graphs.

2.1 Community Detection in Graphs

Girvan and Newman proposed one of the initial algorithms for community detection [15]. Theiralgorithm removes network edges iteratively based on their betweenness centrality, which resultsin splitting the network into disconnected components. In a successive work, they introducedthe notion of modularity as a measure of the quality of community structure in a network [16].

A bunch of algorithms were proposed which attempt to detect community structure in a networkby maximizing modularity score. For instance, Clauset et al. [17] proposed an agglomerativehierarchical clustering which successively joins pairs of communities (starting from single-nodecommunities) such that each agglomeration results into the maximum possible modularity in-crease. Later, techniques like simulated annealing, extremal and spectral optimizations werepresented to maximize modularity score. Refer to [18] for a detailed survey of different commu-nity detection algorithms for graphs.

In social networks, every individual typically belongs to more than one communities. There arecommunities of her family members, friends and classmates, co-workers etc. Hence, a commu-nity detection algorithm should address the issue of overlapping communities. Recently manyalgorithms have been proposed which detect overlapping communities in graphs.

13

2.2 Detecting Overlapping Communities in Graphs

One of the initial methods to find overlapping communities was designed by Baumes et al. [19].They defined a community as a subset of actors whose induced subgraph locally optimizes agiven metric based on the edge density of the cluster. As different overlapping subsets may all belocally optimal, vertices can belong to multiple communities. Detecting communities of a graphis equivalent to finding the set of all locally optimal clusters.

Clique Percolation Method (CPM) by Palla et al. [20] is the most used overlapping communitydetection technique. It is based on the concept that finding overlapping communities is equivalentto finding k-cliques in the social networks. Their algorithm first finds all k-cliques with a fixedconstant k. Two detected k-cliques will be joined if they share k − 1 nodes. Each community isformed by joining maximum set of such k-cliques. One node may belong to multiple disconnectedk-cliques.

Clique Percolation scheme has been extended for different types of real word networks. Farkaset al. [21] and Lehmann et al. [22] extended the method to weighted and bipartite graphs re-spectively. Adamcsek et al. [23] designed a software package CFinder1 which implements CPM.Kumpula et al. [24] proposed a faster sequential implementation of CPM algorithm.

Lancichinetti et al. [25] proposed a local community detection algorithm. Their algorithm triesto optimize a fitness function, which is defined using the internal and external degrees of thecomputed cluster. By varying the parameters in the fitness function, both overlapping andhierarchical community structures can be obtained using the algorithm.

The well known modularity metric can be extended to overlapping community scenario. Nocosiaet al. [26] introduced overlapping modularity metric. In their scheme, a vector is assigned foreach node in the graph. This vector stores the probability that this node belongs to a particularcommunity. Their definition of overlapping modularity utilizes these vectors. With the notionof overlapping modularity, any modularity maximization algorithms can be applied to detectoverlapping communities.

Gregory [27] proposed an algorithm which works in multiple stages. First, the vertices withhighest split betweenness are identified. They are the potential vertices which may belong tomultiple communities. Then, these vertices are split into multiple nodes connected by edges. Theoriginal graph is transformed into a larger graph including these vertex sets instead of potentialoverlapping nodes. After that, any state-of-the-art non-overlapping clustering technique can beapplied to the resulting graph. Finally, the communities are mapped back into the original graph.

Some of the recent algorithms proposed for detecting overlapping communities [13,14] adopt themethodology of link clustering i.e. they find groups of ‘similar’ edges unlike conventional attemptsto group similar nodes. Link clustering strategies build from the idea that even though manyactors may belong to multiple groups, their social ties can be classified into a single category.Evans et al. [14] considered a modified random walk on the line graph of a particular graph alongwith other diffusion processes. Ahn et al. [13] proposed to group edges with an agglomerativehierarchical clustering technique.

The advantage of these algorithms is that while overlapping communities of nodes are indeeddiscovered (since a given node inherits membership of all communities that contain the edgesassociated with the node), these algorithms are much simpler and more efficient than the ones

1Available at http://www.cfinder.org.

14

http://www.cfinder.org

which directly find overlapping groups of nodes. Hence in the present study, we adopt thelink-clustering methodology to propose an algorithm for overlapping community detection intripartite hypergraphs.

2.3 Community Detection in Hypergraphs

Several algorithms have been proposed for detecting communities in hypergraphs. Vazquez [7]proposed an Bayesian formulation of the problem of finding hypergraph communities. Startingfrom a statistical model on hypergraphs, the author uses a Mean Field (MF) approximationas variational function which resolves the population structure by determining the hypergraphcommunities and model parameters from the data. The final Variational Bayes (VB) algorithmis a self-consistent set of equations for determining the group assignments and the model pa-rameters. The VB algorithm is based on recursive equations similar to those for the ExpectationMaximization (EM) algorithm.

Bulo et al. [28] proposed a Game Theoretic approach to hypergraph clustering. They haveshown that the hypergraph clustering problem can be converted into a non-cooperative multi-player clustering game. There the notion of a cluster is equivalent to a classical game-theoreticequilibrium concept. Zhou et al. [29] generalized spectral clustering techniques to hypergraphs.Lin et al. [9] proposed an efficient multi-tensor factorization method for community extractionfrom hypergraphs.

Neubauer et al. [6] used modularity concept to extract communities from hypergraphs. Theoriginal k-partite hypergraph is decomposed into k(k+1)

2 bipartite graphs. The algorithm tries tooptimize a joint modularity measure, which is based on the average bipartite modularity in theindividual bipartite graphs, in a brute-force, greedy bottom-up fashion. Later, Murata definedtripartite modularity [30] and proposed an algorithm to detect communities from hypergraphsusing tripartite modularity maximization principle [4].

2.4 Overlapping Community Detection in Folksonomies

All the community detection algorithms mentioned above assign a single community to eachnode. Only two studies have addressed the problem of overlapping community detection infolksonomies. But, they do not consider full tripartite hypergraph structure.

Wang et al. [10] proposed an edge clustering methodology to detect overlapping communitiesusing only user-tag subscription information (in effect, they consider the projection of a tripar-tite folksonomy onto a user-tag bipartite graph). Their algorithm is a k-means variant whichmaximizes intra-cluster similarity. The network is considered in an edge-centric view and eachcentroid only compares to a small set of edges that are correlated to the centroid. Though thisalgorithm is computationally fast, it requires the number of communities as an input which isdifficult to predict in real world folksonomies.

Papadopoulos et al. [5] proposed an algorithm to detect overlapping communities of tags. Thisalgorithm extracts resource-tag association graph from tripartite hypergraph, transforms it totag co-occurrence network and then finds overlapping tag communities. The proposed schemesearches for core sets in tag co-occurrence network. Cores are densely connected groups of tag

15

nodes. Then, the algorithm successively expands the identified cores by maximizing a localsubgraph quality measure.

Taking projections (as used by both these approaches) loose some information contained in theoriginal tripartite network. Guimera et al. [11] have shown that qualities of the communitiesobtained from projected networks are worse than those obtained from the original network. Tothe best of our knowledge, the present work is the first algorithm for detecting overlappingcommunities in folksonomies considering the complete hypergraph structure.

The proposed algorithm is detailed in the next chapter.

16

Chapter 3

Our Proposed Algorithm

This chapter details the proposed link-clustering algorithm for detecting overlapping commu-nities in tripartite hypergraphs. As discussed earlier, a folksonomy is modelled as a tripartitehypergraph (more specifically 3-uniform tripartite hypergraph). We first discuss the notationsused to model a folksonomy as a tripartite hypergraph.

A tripartite hypergraph is denoted as G = (V,E) where V is the set of nodes and E is the setof hyperedges. V is composed of three partite-sets (types of vertices) V X , V Y and V Z . Eachhyperedge in E connects triples of nodes (a, b, c) where a ∈ V X , b ∈ V Y , c ∈ V Z .

3.1 Basic Idea of the Algorithm

For a given hypergraph G, we convert G to the weighted line graph G′ which is a unipartite graph

in which the hyperedges in G are nodes, and two nodes e1 and e2 in G′ are connected by an

edge if e1 and e2 are similar in G. The weight of the edge (e1, e2) in G′ represents the similarity

between the two hyperedges e1 and e2 in the hypergraph G. Similarity calculation is detailed inSection 3.2.

Once the weighted line graph G′ is constructed from the given tripartite hypergraph G, any

community detection algorithm for unipartite graphs can be used to cluster the nodes in G′

(i.e. the hyperedges in G). Even the overlapping community detection algorithms for graphscan be used here. But, as discussed earlier, a link is usually associated with one particularsemantics. Hence, we have considered only the algorithms which do not produce overlappingcommunities. Choice of a particular community detection algorithm among them is described indetail in Section 3.3.

As we get the node communities in G′ , each hyperedge in G gets placed into a single link-

community. This automatically assigns multiple overlapping communities to nodes in G, sincea node inherits membership of all those communities into which the hyperedges connected withthis node are placed.

17

3.2 Calculating Similarity Between Hyperedges

The similarity between a pair of hyperedges can be computed using different metrics. For exam-ple, hyperedges can be expressed as feature vectors and then can be compared to find similarity.Another way of measuring similarity is by considering the neighbourhood of end vertices ofhyperedges.

3.2.1 Expressing Hyperedges as Vectors

In a hypergraph, each hyperedge is associated with three nodes, one each from V X , V Y and V Z

sets. We express each hyperedge as a vector of size |V X |+ |V Y |+ |V Z |, where an element of thevector represents the amount of participation of a particular node in that hyperedge.

Let di denote the degree of the node i. Then the i-th entry in the vector representation for aparticular hyperedge will be 0 if there is no path from i to any of its end nodes. Otherwise thei-th entry will be the inverse of the product of degrees of intermediate vertices in the shortestpath from i to any end vertices.

For example, in the vector representation X of hyperedge e = (a, b, c); the a-th entry of X willbe 1

da. Whereas, if the shortest path from any node j to the node b contains the nodes i and

k, then the j-th entry of X will be 1dj .di.dk.db

. It is to be noted that, while calculating the j-thentry, we are considering shortest paths from j to all a,b and c and then taking the path havingminimum hop length among these paths.

Now, with the vector representation presented above, the following two well known metrics canbe used to find similarity between two hyperedges.

1. Pearson Correlation:If hyperedges e1 and e2 can be expressed as vectors X and Y respectively, then the simi-larity between e1 and e2 can be measured by the following equation

sim(e1, e2) =

∑XY − (

∑X)(

∑Y )

n√(∑X2 − (

∑X)2

n

)(∑Y 2 − (

∑Y )2

n

) (3.1)

2. Cosine Similarity:It is a measure of similarity between two vectors by measuring the cosine of the anglebetween them. The similarity between e1 and e2 can be expressed as

sim(e1, e2) =X · Y‖X‖‖Y ‖ =

n∑i=1

Xi × Yi√n∑

i=1(Xi)2 ×

√n∑

i=1(Yi)2

(3.2)

For both metrics, the similarity value ranges from −1 to +1. Where −1 means exactly opposite,+1 means exactly the same, and in-between values indicates intermediate similarity or dissimi-larity with 0 usually indicating independence. Here, we have considered only positive similarityvalues. If the similarity between the two hyperedges e1 and e2 in the hypergraph G is more than

18

0, only then e1 and e2 are connected in the line graph G′ where the edge weight denoting the

similarity value.

3.2.2 Considering Vertex Neighbourhoods

Similarity between hyperedges can be measured by the relative overlap among the neighbours oftheir end vertices. We measure the similarity between only those hyperedges which are adjacent.Non-adjacent hyperedges are considered to have zero similarity.

It is to be noted that, the adjacency of two hyperedges can be defined in the following ways

1. Two hyperedges are adjacent if the hyperedges have at least one node in common.

2. Two hyperedges are adjacent if the hyperedges have exactly two nodes in common.

Although the second definition is a special case of the first definition, the choice will have animpact on the overall performance of the algorithm. If we consider the second definition, the linegraph G

′ will be sparser than if we take the first definition and G′ will contain many disconnected

components. Detecting communities from this sparser G′ will be more difficult. Also, in real

world folksonomies, condition of having two nodes common is too rigid. So, here in this work, wehave considered the first definition of adjacency. Two hyperedges are considered to be adjacentif they share at least one endpoint.

Figure 3.1: Neighbourhood of two adjacent hyperedges

The notations NX(i), NY (i) and NZ(i) denote the set of neighbours of node i of type V X ,V Y and V Z respectively (if i ∈ V X , then NX(i) = φ since nodes in the same partite-set arenot linked). Figure 3.1 shows the neighborhood of two adjacent hyperedges e1 = (a, b, c) ande2 = (p, q, r) where a, p ∈ V X ; b, q ∈ V Y ; c, r ∈ V Z and assumed a = p.

19

With the notations discussed, we have considered the following two popular similarity metricswhich can be used to measure hyperedge similarity.

1. Matching Similarity:It can be defined as the size of overlap between neighbour sets of end points. The matchingsimilarity measure can be expressed as the following equation

sim(e1, e2) = |N1

⋂N2| (3.3)

whereN1 = NX(b)

⋃NZ(b)

⋃NY (c)

⋃NX(c)

andN2 = NX(q)

⋃NZ(q)

⋃NY (r)

⋃NX(r)

2. Jaccard Similarity:It is expressed as the size of overlap normalized by the size of union of neighbour sets ofend vertices.

sim(e1, e2) =|S⋂

S′ | + |NY (c)

⋂NY (r)| + |NZ(b)

⋂NZ(q)|

|S⋃S′ | + |NY (c)

⋃NY (r)| + |NZ(b)

⋃NZ(q)| (3.4)

where S = NX(b)⋃

NX(c) and S′= NX(q)

⋃NX(r). Jaccard Similarity value can range

from 0 to 1.

3.2.3 Choosing the Best Similarity Metric

Vector based similarity metrics are global metrics which requires knowledge of the entire hyper-graph. Moreover, calculating similarity using vectors requires large memory. Size of each vectoris O(n) where n is the umber of nodes in the hypergraph. If there are m hyperedges, the spacecomplexity for vector based similarity calculation is O(m · n).On the other hand, neighbourhood based metrics can be computed locally for a pair of hyper-edges and can thus be computed efficiently for large real folksonomies. Also, experiments onsynthetically generated hypergraphs (details in Section 4.3) show that Jaccard Similarity givesthe best performance compared to other similarity metrics. Further, a metric similar to it wasfound to perform well in detecting overlapping communities in unipartite graphs [13]. Hence, forour algorithm, we choose Jaccard Similarity as the similarity metric. The algorithm for JaccardSimilarity calculation is presented in Algorithm 1.

20

Algorithm 1 Compute Similarity between two HyperedgesInput: hyperedges e1 = (a, b, c) and e2 = (p, q, r); a, p ∈ V X ; b, q ∈ V Y ; c, r ∈ V Z

Output: sim, Similarity between e1 and e2

if a �= p AND b �= q AND c �= r then/* Hyperedges are non-adjacent */sim ← 0

else/* Without loss of generality, let a = p; Any of the other pairs may be common as well */

S1 ← NX(b)⋃

NX(c), S2 ← NY (c), S3 ← NZ(b)S

′1 ← NX(q)

⋃NX(r), S

′2 ← NY (r), S

′3 ← NZ(q)

sim ← |S1

⋂S′1| + |S2

⋂S′2| + |S3

⋂S′3|

|S1

⋃S′1| + |S2

⋃S′2| + |S3

⋃S′3|

end ifreturn sim

3.3 Detecting Communities in Line Graph

With the similarity measure in Algorithm 1, we convert the hypergraph to its correspondingline graph where any community detection algorithm can be used. We have experimented withdifferent community detection algorithms to find the best candidate to be used in our proposedalgorithm. We present some of those algorithms below.

3.3.1 Hierarchical Clustering

In the line graph, we use single-linkage hierarchical clustering to construct a dendrogram. Westart with each node in the line graph as an individual cluster, then at each step, the two mostsimilar clusters are merged. This procedure is continued until all nodes belong to a single cluster,and cutting this dendrogram at some suitable level gives the final clusters of nodes. The optimallevel for the cut is decided based on the Partition Density metric [13] which is computed on theoriginal hypergraph.

The partition density of a community Pi of hyperedges is the number of hyperedges in Pi, nor-malized by the minimum and maximum number of hyperedges possible among the induced nodes(which are touched by the hyperedges in Pi). The global partition density D for a given parti-tioning of the hyperedges is the average partition density of all hyperedge communities weightedby the fraction of hyperedges present in each community. Algorithm 2 gives the algorithm forcomputing D for a given partitioning of the hyperedges at a certain level of the dendrogram.The dendrogram is cut at that level at which the global partition density D is maximum [13].

21

Algorithm 2 Compute Partition DensityInput: {P1, P2, . . . , PC}, a partitioning of the M hyperedges in E into C subsetsOutput: Global Partition Density D

for all i, 1 ≤ i ≤ C domi ← |Pi|/* Count number of induced nodes of the three types in Pi */nXi ← |

⋃(a,b,c)∈Pi

{a}|, nYi ← |

⋃(a,b,c)∈Pi

{b}|, nZi ← |⋃(a,b,c)∈Pi

{c}|

/* Compute Partition Density Di of subset Pi */

Di ← mi − max{nXi , nY

i , nZi }

(nXi × nY

i × nZi ) − max{nX

i , nYi , nZ

i }end for

D ← 1M

∑i

(mi ×Di) /* Global Partition Density */

return D

3.3.2 Fast Modularity Optimization

Clauset et al. [17] proposed a fast and greedy approach1 to implement modularity maximizationtechnique proposed by Newman [31]. Starting from a set of isolated nodes in the graph, the links(which are present in the original graph) are iteratively added to produce the largest possibleincrease in the modularity at each step. The algorithm uses different efficient data structures.A sparse matrix is used to contain the increase in modularity by joining two communities whohave at least one edge between them. A max-heap is also used to minimize the time complexityto O(n · log2 n) where n is the number of nodes in the graph.

3.3.3 Louvain Method

Blondel et al. [32] proposed a multistep technique2. On the initial step, communities are detectedbased on local optimization of modularity in the neighbourhood of each node in the graph. In thenext step, a weighted graph is formed where nodes are the communities detected in the earlierphase. These two steps are iterated until modularity (which is always computed in the originalgraph) does not increase any further. Computational complexity of this algorithm is O(m) wherem is the number of edges in the original graph.

3.3.4 Infomap

This is a dynamic algorithm proposed by Rosvall and Bergstrom [33]. The authors have shownthat the problem of finding the best cluster structure of a graph is equivalent to the problemof optimally compressing the information of a random walk taking place on the graph. Theoptimal compression is achieved by optimizing a quality function Minimum Description Length

1Can be found at http://www.cs.unm.edu/~aaron/research/fastmodularity.htm2Downloadable from https://sites.google.com/site/findcommunities/

22

http://www.cs.unm.edu/~aaron/research/fastmodularity.htm

https://sites.google.com/site/findcommunities/

of the random walk. Minimum Description Length expresses the best trade-off between leastdifference between the original and the compressed information and the maximal compression.Optimizing Minimum Description Length can be carried out with a combination of greedy searchand simulated annealing3. Computational complexity of this algorithm is also O(m) where m isthe number of edges in the graph.

3.3.5 Choosing the Best Community Detection Method

We have compared the performances of all the above community detection algorithms usingsynthetic hypergraphs (Section 4.3). Infomap algorithm is found to perform better than otheralgorithms. Lancichinetti et al. [34] also showed that for community detection in large graphs,Infomap can identify communities more accurately as compared to several other algorithmsincluding Louvain and greedy modularity maximization. Further, as Infomap has low computa-tional complexity, it can be used efficiently on line graphs of large real folksonomies. Therefore,we used Infomap to as the community detection algorithm.

3.4 Time Complexity of Our Proposed Algorithm

Let the number of nodes in the hypergraph be n and average node-degree be d, which implies thatthe number of hyperedges will be n·d

3 . Each hyperedge will, on average, be adjacent to 3 · (d− 1)

other hyperedges. So, the line graph will have n·d3 nodes and n·d

3 ×3·(d−1) = n·d·(d−1) = O(n·d2)edges.

Time complexity of infomap algorithm is linear in the size of the graph. So, community detectionin line graph takes O(n ·d2). Jaccard similarity calculation in the hypergraph also takes O(n ·d2)time. Therefore, the time complexity of the proposed algorithm is O(n · d2).It is to be noted that real-world folksonomies are known to be sparse, having small average degreed. So, essentially the complexity of our algorithm becomes O(n) which makes this algorithmscalable to work in large real world folksonomies.

The performance of this algorithm is evaluated in the next chapter.

3Available at http://www.tp.umu.se/~rosvall/code.html

23

http://www.tp.umu.se/~rosvall/code.html

Chapter 4

Experiments and Evaluation

In this chapter, we evaluate the performance of our proposed algorithm which we name as‘Overlapping Hypergraph Clustering’ (abbreviated to ‘OHC’). We first compare different choicesof similarity metrics as well as community detection algorithms for line-graph to be used in OHC.Then, we compare OHC algorithm with the algorithms by Wang et al. [10] and Papadopouloset al. [5], which are henceforth referred to as ‘CL’ (abbreviation of ‘Correlational Learning’) and‘HGC’ (as referred by the respective authors) respectively.

Since evaluation of clustering is difficult without the knowledge of ‘ground truth’ regarding thecommunity memberships of nodes, we have used synthetically generated hypergraphs with aknown community structure for evaluation of the algorithms. We discuss the generation ofsynthetic hypergraphs and the metric used to evaluate the algorithms, followed by the results ofexperiments on synthetic hypergraphs.

4.1 Generation of Synthetic Hypergraphs

Synthetic hypergraphs are generated using a modified version of the method used in [10]. Thegenerator algorithm takes the following as input:

1. Number of nodes in a partite set(all 3 partite sets V X , V Y and V Z are assumed to contain equal number of nodes)

2. Number of communities C

3. Fraction γ of nodes which belong to multiple communities

4. Hyperedge density β (i.e. fraction of total number of hyperedges possible in the hypergraph)

Initially, the nodes in each partite set are evenly distributed among each community under con-sideration (e.g. |V X |/C nodes in the partite set V X are assigned to each of the C communities).Subsequently, γ fraction of nodes are selected at random from each of V X , V Y and V Z . Eachselected node is assigned to some randomly chosen communities apart from the one it alreadyhas been assigned to. Nodes assigned to the same community are then randomly selected, onefrom each partite set, and interconnected with hyperedges. The number of hyperedges is decidedbased on the specified density β.

24

Figure 4.1 demonstrates an example of synthetic hypergraphs generated. In this example, 4nodes in each partite set is divided into two communities (i.e. C = 2). Hyperedge Density (β)is 20% and 25% nodes belong to both communities (i.e. γ = 0.25).

Figure 4.1: An example synthetic hypergraph. There are two communities – blue and green.Violet nodes belong to both the communities.

Users in real-world folksonomies often tag a few resources related to topics that are different fromtheir topics of primary interest, according to their transient interests at different times. Thoughsuch taggings are typically much fewer than those related to the primary interests of users, theycan adversely affect the performance of algorithms that assign a single community to nodes.To test whether the proposed algorithm can identify both the primary and transient interestsof users, a second set of hypergraphs are generated, where 1% of the generated hyperedgesinterconnect randomly-selected nodes from different communities; we denote these as ‘scattered’hyperedges.

The above assignment of communities to nodes constitutes the ‘ground truth’. After a hypergraphis generated, information about the communities is hidden, and then communities are detectedfrom the hypergraph by different community detection algorithms. The community structuredetected by each algorithm is compared with the ground truth using the metric ‘NormalizedMutual Information (NMI)’ which is explained next.

4.2 Normalized Mutual Information (NMI)

Normalized Mutual Information is an information-theoretic measure of similarity between twopartitioning of a set of elements, which can be used to compare two community structures for the

25

same graph (as identified by different algorithms). It is based on defining a confusion matrix N ,where the rows correspond to the ‘real’ communities, and the columns correspond to the ‘found’communities. The member of N , Nij is simply the number of nodes in the real community i thatappear in the found community j. Then NMI is defined in terms of different Nijs. This variableis in the range [0, 1] and equals 1 only when the two partitions are exactly coincident.

This ‘traditional’ definition of NMI does not consider the case of overlapping communities. Theyplace each node to only one cluster. But, a node may belong to more than one cluster. Thereforethe membership of the node i is not a number xi ∈ {1, 2, ..., |C|} any more, but it must beconsidered as a binary array of |C| entries, one for each cluster of the partition C (say (xi)k =1 if the node i is present in the Ck cluster, (xi)k = 0 otherwise).

Lancichinetti et al. [35] proposed an alternative definition of NMI considering overlapping com-munities. According to [35], given two community structures / partitions X and Y , NMI isdefined as

NMI(X,Y ) = 1− 1

2

(H(X|Y )norm +H(Y |X)norm

)(4.1)

where

H(X|Y )norm =1

NX

∑i

minj∈{1,2,...,NX}H(Xi|Yj)

H(Xi)

H(Y |X)norm =1

NY

∑i

minj∈{1,2,...,NY } H(Yi|Xj)

H(Yi)

Here H(X) and H(Y ) are entropies of X and Y . H(Y |X) and H(X|Y ) are conditional entropiesand NX and NY are number of clusters in X and Y respectively.

This NMI value is computed in two steps.

1. The pairs of clusters that are closest to each other are found from two clusterings.

2. The mutual information between those pairs of clusters are then averaged.

The value is in the range [0, 1]. Higher the NMI value, the more similar are the two communitystructures (refer to [35] for details).

4.3 Comparison between Different Choices of OHC

To find the best similarity metric and community detection method, we generated synthetic hy-pergraphs having various hyperedge densities β = 0.1, 0.2, . . ., 1.0. In each of these hypergraphs,10% of nodes in each partite set belonged to multiple communities (i.e. γ = 0.1).

First, we compare the performances of the different similarity metrics. Infomap is used as thecommunity detection method in line graph. NMI values between original and detected communitystructures are compared. The comparison result is shown in Figure 4.2. We can see that acrossevery value of hyperedge density, Jaccard Similarity gives the best result.

26

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.4

0.5

0.6

0.7

0.8

0.9

1

Hyperedge Density

NM

I

JaccardMatchingPearsonCosine

Figure 4.2: Comparison of NMI values for different similarity metrics with varying hyperedgedensity

Once Jaccard Similarity has been chosen as the desired similarity metric, we compare differentcommunity detection methods which can be applied on line graph. Figure 4.3 shows the com-parison of NMI values. Across all possible hyperedge densities, Infomap algorithm is found toperform better than other algorithms.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.4

0.5

0.6

0.7

0.8

0.9

1

Hyperedge Density

NM

I

InfomapLouvainHierarchicalModularity

Figure 4.3: Comparison of NMI values for different community detection algorithms with varyinghyperedge density

27

4.4 Comparing OHC with Other Algorithms

The CL and HGC algorithms produce only user and tag communities respectively. Hence, whilecalculating the NMI value for these algorithms, we have used the community memberships ofonly the user (respectively, tag) nodes according to the ground truth. Whereas the proposedOHC algorithm gives composite communities containing all three types of nodes. Hence, toevaluate the performance of OHC, we have considered the community memberships of all threetypes of nodes.

For all the following experiments, |V X | = |V Y | = |V Z | = 200 and number of communitiesC = 20. For each result, random hypergraphs were generated 50 times using the same set ofparameter values and the average performances over all 50 runs are reported.

4.4.1 Performance w.r.t. Number of Hyperedges

To study how the number of hyperedges affects the performance of the clustering algorithms,we generated synthetic hypergraphs having various hyperedge densities β = 0.1, 0.2, . . ., 1.0. Ineach of these hypergraphs, 10% of nodes in each partite set belonged to multiple communities(i.e. γ = 0.1). The NMI values for the three algorithms are shown in Figure 4.4.

0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

1

Hyperedge Density (β)

NM

I

OHCHGCCL

Figure 4.4: Variation of NMI values with varying hyperedge density when 10% nodes belong tomultiple communities

It can be clearly seen that, across all hyperedge densities, OHC performs significantly betterthan HGC and CL algorithms. A possible explanation for this is that the proposed OHC algo-rithm utilizes the complete tripartite structure of the hypergraph, whereas both CL and HGCalgorithms work on unweighted projections.

Guimera et al. [11] have shown that taking projection results in loss of some of the informa-tion contained in the original tripartite network. Moreover, unweighted projection loose more

28

information than weighted projection. Whereas, even for weighted projections, calculating theweight is most challenging and determining factor for the amount of information retained. Forexample, while taking projections from hypergraph to user-tag bipartite network, one doesn’ttake into account the relative importance of resource nodes. A resource node having higherdegree shouldn’t be considered same as another resource node having lower degree. The weightcalculation algorithm should take this and many other factors into consideration.

It is to noted that even for very low hyperedge densities, when detecting community structuresis difficult, the proposed OHC algorithm performs very well resulting in NMI scores above 0.8.This makes OHC suitable for real world folksonomies where hyperedge density is typically low.

4.4.2 Performance in Presence of Scattered Hyperedges

We have also experimented with synthetic hypergraphs having 1% of total hyperedges as ‘scat-tered’. Figure 4.5 shows the result. As the presence of scattered hyperedges disturbs the commu-nity structure in the hypergraph, the performance of all three algorithms degrade as expected.However, performance of OHC is still better than HGC and CL algorithms. For OHC algorithm,NMI scores remain above 0.7 which signifies its effectiveness in detecting community structureeven in presence of noisy or scattered hyperedges.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hyperedge Density (β)

NM

I

OHCHGCCL

Figure 4.5: Variation of NMI values with varying hyperedge density in presence of scatteredhyperedges

4.4.3 Performance w.r.t. Fraction of Nodes in Multiple Communities

A node belonging to multiple communities creates hyperedges to nodes in all those communities;hence, from the perspective of a particular community, the hyperedges created by this membernode to nodes in other communities reduces the exclusivity of this particular community. Asthe number of nodes in multiple overlapping community increases, the fraction of such inter-community hyperedges increases making the community structure more difficult to identify. Wenow study how this affects the performance of the algorithms.

29

0.1 0.3 0.5 0.7 0.9 10.1

0.3

0.5

0.7

0.9

1

Fraction of Nodes in Multiple Communtiy (γ)

NM

I

OHCHGCCL

Figure 4.6: Variation of NMI values with varying fraction of nodes in multiple communitieskeeping hyperedge density constant at 0.2

We generated synthetic hypergraphs by varying the fraction of nodes in multiple communities(γ) while keeping hyperedge density (β) constant at 0.2. This low value of hyperedge densitywas chosen to measure the effectiveness of the algorithms in sparse environment (as in real-worldfoksnomies).

Figure 4.6 shows that OHC performs consistently better than HGC and CL algorithms in this caseas well. Further, as the community structure becomes more and more complex, the informationloss as a result of projections becomes increasingly more crucial, hence the performance of theHGC and CL algorithms degrade sharply with increase in γ. On the other hand, the performanceof our OHC algorithm shows relatively much greater stability.

4.4.4 Performance w.r.t. Size of Real Community

We also observed how the performances of different algorithms are affected by the size of eachreal community. Hypergraphs having 200 nodes in each partite set were generated changingthe number of real communities. Here hyperedge density is fixed at 0.2 and 10% of total nodesbelong to multiple communities. The results are shown in Figure 4.7.

30

3 4 5 6 7 8 9 100.4

0.5

0.6

0.7

0.8

0.9

1

Number of Real Communities

NM

I

OHCHGCCL

Figure 4.7: Comparison of NMI values with varying number of real communities

When number of nodes in one community is large, random assignment of hyperedges during gen-eration of synthetic hypergraphs may create smaller communities inside one large community.Community detection algorithms find these smaller communities rather than the large encom-passing community. For this reason, as the number of real communities increases, size of eachcommunity decreases enabling better NMI performance. Here also, OHC performs better thanCL and HGC algorithms.

The above experiments clearly validate our motivation and show that considering the completetripartite structure of hypergraphs can result in better identification of community structure, ascompared to considering projections (as done in prior studies).

In the next chapter, we use OHC to study the community structure of real world folksonomies.

31

Chapter 5

Experiments on Real WorldFolksonomies

In this chapter, we apply the proposed OHC algorithm to gain insights into the communitystructures prevalent in real folksonomies. For this, we use the publicly available datasets [36]having snapshots of the folksonomies – Delicious, LastFm and MovieLens. The statistics of thesedata sets are summarized in Table 5.1.

Dataset users resources tags hyperedgesDelicious 1,867 69,226 53,388 437,593LastFm 1,892 17,632 11,946 186,479

MovieLens 2,113 10,197 13,222 47,957

Table 5.1: Statistics of Real Folksonomy Datasets

5.1 Overlapping Communities in Folksonomies

For all three datasets, OHC algorithm successfully groups semantically related resources andtags and the users tagging these resources. As an illustration, Table 5.2 shows the resourcesand tags placed in some example communities for each of the three datasets. It is evident thatthe resources and tags that are placed in the same community are often related to a commonsemantic theme.

32

Community Theme Example of Member NodesLastFmArtists

Hard Rock Van Halen, Deep Purple, Aerosmith , Alice Cooper,Guns N’ Roses, Scorpions, Kiss, Living Colour, WhiteLion, Bad Company, Bon Jovi, Hardline, The RollingStones

(resources) Heavy Metal Van Halen, Deep Purple, Aerosmith , Iron Maiden,Motorhead, Black Sabbath, Metallica, Twisted Sister,Crazy Lixx, Blind Guardian

LastFm Tags Metal blues rock, psychedelic rock, rap metal, nu metal ,metal, symphonic metal, doom metal, progressive metal,speed metal, folk metal, metalcore, viking metal, powermetal

Rock blues rock, psychedelic rock, rap metal, nu metal ,progressive rock, polish rock, art rock, soft rock, gothicrock, polish, punk, punk rock, hard rock, glam rock, pop-rock

MovieLensMovies

Superhero The Incredibles, Shrek, Shrek 2, The IncredibleHulk , Batman Begins, Batman Returns, Batman For-ever, Spider-Man, Superman, Superman II, Superman III,X-Men

(resources) Animation The Incredibles, Shrek, Shrek 2, The IncredibleHulk , Shrek the Third, Beowulf, WALL-E, Ratatouille,Finding Nemo, Cars, Toy Story, Toy Story 2, Kung fuPanda

MovieLensTags

Criticism violent, brutal , too violent, waste of celluloid, disturb-ing, junk, tragically stupid, lousy script, pointless, wasteof money, not very good, confusing plot, worst animatedflick ever

Violence violent, brutal , violence, murder, fatality, civil war,great villain, dark, spanish civil war, serial killer, greatwar depiction, vietnam war, world war ii, best war film

Delicious Tags Web 2.0 socialnetworking, socialweb, socialmedia, web20, php,drupal, xml, cms, webdesign, css3, twitter, skype, ruby,facebook, snippets, wikipedia, blog

Table 5.2: Examples of communities detected by proposed OHC algorithm. The algorithmsuccessfully clusters nodes which are related to a common semantic theme (see Column 2). Nodesrelated to multiple themes (boldfaced and italicized) are placed in overlapping communities.

A closer look at Table 5.2 reveals that the algorithm also correctly identifies nodes that arerelated to multiple overlapping communities (themes). For instance, the band Van Halen isplaced in two different communities detected from LastFm. The Wikipedia article about VanHalen1 justifies this placement pointing their genre as both ‘Hard Rock’ and ‘Heavy Metal’.

Any non-overlapping community detection algorithm would have placed this node to either ofthe two communities (assume ‘Hard Rock’). Community based recommendation schemes, whichrecommend resources to users based on common memberships in communities, would have only

1http://en.wikipedia.org/wiki/Van_Halen

33

http://en.wikipedia.org/wiki/Van_Halen

recommended this resource to users who are interested in ‘Hard Rock’. But, this resource can alsobe recommended to a user who likes to listen to ‘Heavy Metal’ songs. our proposed OHC algo-rithm places the resource in both communities; thus raise the chance of proper recommendationto users of real world folksonomies.

0 50 100 150 200 2500.5

0.6

0.7

0.8

0.9

1

Number of Overlapping Communities

CD

F

TagResourceUser

0 50 1000.7

0.8

0.9

1

MovieLensLastFm

Figure 5.1: Cumulative distribution of the fraction of communities which overlap with a givennumber (x) of other communities; main figure – LastFm, inset – MovieLens

Substantial amount of overlap is detected by OHC algorithm in all three datasets. Figure 5.1shows the cumulative distribution of the fraction of communities which overlap with a givennumber of other communities, for LastFm and MovieLens. A similar pattern was also detectedin Delicious.

5.2 Evaluation of Communities Detected

The principal difficulty in evaluating the communities detected in case of real folksonomies isthe absence of ‘ground truth’ regarding the community memberships of nodes in folksonomies,since their huge size makes it impossible for human experts to evaluate the quality of identifiedcommunities.

Hence, we use the following two methods for evaluation.

1. we use the graph-based metric Conductance, which has been shown to correctly conformwith the intuitive notion of communities and is extensively used for evaluating quality ofcommunities in online social networks (see [37] for details). As conductance is defined onlyfor unipartite networks, we compare tag communities detected by HGC with the tag nodesin the communities identified by our OHC algorithm.

2. in case of the folksonomies which allow users to form a social network among themselves, wecan assume that users having similar interests are likely to be linked in the social network,

34

or at least to have a common social neighbourhood (a property known as homophily [12].We utilize this notion to evaluate the user communities detected by CL algorithm and theuser nodes in the communities identified by OHC algorithm.

5.2.1 Comparison of Conductance Value

Conductance (φ(C)) of a community C, which implies a cut (C,G−C) in a graph G, is definedas

φ(C) =

∑i∈C, j∈(G−C)

Aij

min(A(C), A(G − C))(5.1)

where A is the adjacency matrix for the network and

A(C) =∑i∈C

∑j∈G

Aij

The Conductance [37] value ranges from 0 to 1 where a lower value signifies better communitystructure. Figure 5.2 shows the cumulative distribution of conductance values of detected tagcommunities by the two algorithms. Across all three datasets, OHC produces more communitieshaving lower conductance values, which implies that OHC can find communities of better qualitythan obtained by HGC algorithm. The reason for this superior performance is that OHC groupssemantically related nodes into relatively smaller cohesive communities instead of creating a fewnumber of generalized large communities. For example of semantically related communities, referto Table 5.2.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1.0

Conductance

CD

F

OHCHGC

0 0.5 10

0.5

1.00 0.5 1

0

0.5

1.0

LastFm

MovieLens

Delicious

Figure 5.2: Cumulative distribution of conductance values of tag communities obtained from thereal-world folksonomies: LastFm (main plot), Delicious and MovieLens (both inset) for OHCand HGC.

35

5.2.2 Comparing Detected User Communities with Social Network

In case of folksonomies which allow users to form a social network, there can be two typesof relationships among users – explicit social connections (in the social network) and implicitconnections through their tagging behaviour (e.g. tagging the same resource) in the hypergraph 2.

A community detection algorithm for hypergraphs utilizes the implicit relationships to identifythe community structure, and we propose to evaluate the detected community structure usingthe explicit connections that the users themselves create (in the social network). For instance, if alarge fraction of the users who are socially linked (or share a common social neighbourhood in thesocial network) are placed in the same community (by the algorithm), the detected communitystructure can be said to group together users having common interests.

Hence, to compare the community structure identified by two algorithms, we consider the user-pairs who are within a certain distance from each other in the social network (where distance 1implies friends, i.e. two users who are directly linked in the social network), and compute thefraction of such user-pairs who have been placed in a common community by the algorithm.

1 2 3 4 5 6

0.25

0.35

0.45

0.55

0.65

Distance in Social Network

Fra

ctio

n of

Use

r P

airs

in S

ame

Use

r C

omm

unity

OHC

CL

Figure 5.3: Community structure detected by OHC and CL algorithm with the social networkin LastFm

Figure 5.3 shows the result for the proposed OHC algorithm and the CL algorithm, for theLastFm dataset. Across all distances, OHC places a larger number of user-pairs who share acommon social neighbourhood, in a common community than the CL algorithm. Also, as thedistance between two users in the social network increases, both algorithms put a smaller fractionof such user-pairs in the same community.

We can also investigate the reverse question – among the users who are placed in a commoncommunity (by a community detection algorithm), what fraction of these users are actually con-nected in the social network (or share a common social neighbourhood)? While investigating

2The social network in LastFm is undirected, while in Delicious, a user can be a ‘fan’ of another user, but thisfan-relationship may or may not be reciprocated. We assumed two users are linked if they belong to a mutualfan relationship. In the LastFm and Delicious dataset analysed here, there are 12,717 and 7,668 bi-directionaluser-user links respectively.

36

this question, it is to be noted that ‘quality’ of large communities detected by community detec-tion algorithms are known to be lower than smaller communities [37]. Hence it is meaningful toanswer this question for detected communities taking their size into consideration.

1 2 3 4 5 60

0.2

0.4

0.6

0.8

1

Distance in Social Network

Fra

ctio

n of

Use

r P

airs

CommSize < 20 By OHCCommSize < 20 By CLCommSize > 20 By OHCCommSize > 20 By CL

Figure 5.4: Community structure detected by OHC and CL algorithm with the social networkin Delicious

Figure 5.4 shows the fraction of users who are placed in a common community by the OHC andCL algorithms, that are within a certain distance in the social network (where distance 1 impliesfriends), for the Delicious dataset.

For detected user-communities of size lesser than 20, more than 70% of the users who are placedin a common community by OHC are actually connected in the social network, whereas the cor-responding value for the CL algorithm is much lesser. However, for larger detected communities(having more than 20 users), the fraction of user-pairs who share a common social neighbourhoodis much lower and almost identical for both algorithms.

The above results clearly show that even in case of real folksonomies (as in the case of syntheti-cally generated hypergraphs), the proposed OHC algorithm can detect much better communitystructure as compared to the existing CL and HGC algorithms. The fact that a very largefraction of the users who are placed in a common community by OHC are actually friends (i.e.directly linked in the social network) shows that OHC can be used to identify potential friendsdirectly from the hypergraph structure.

37

Chapter 6

Conclusion

In this work, we proposed the first algorithm to detect overlapping communities considering thefull tripartite hypergraph structure of folksonomies. Through extensive experiments on syntheticas well as real folksonomy networks, we showed that the proposed algorithm out-performs existingalgorithms that consider projections of hypergaphs.

In large folksonomies, it is difficult for an individual user to find other like-minded users as wellas resources of her interest. Our algorithm successfully groups nodes into multiple communitieswhere each community represents a topic of interest. Based on these interests, like-minded usersas well as resources can be found out.

Thus the proposed algorithm can be effectively used in recommending interesting resources andfriends to users in folksonomies. Building such a personalized recommendation system takingadvantage of the effectiveness of the proposed algorithm comprises the future work.

38

Bibliography

[1] Shengliang Xu, Shenghua Bao, Ben Fei, Zhong Su, and Yong Yu. Exploring folksonomy forpersonalized search. In ACM SIGIR, pages 155–162, 2008.

[2] Ioannis Konstas, Vassilios Stathopoulos, and Joemon M. Jose. On social networks andcollaborative recommendation. In ACM SIGIR, pages 195–202, 2009.

[3] Ciro Cattuto, Christoph Schmitz, Andrea Baldassarri, Vito D P Servedio, Vittorio Loreto,Andreas Hotho, Miranda Grahl, and Gerd Stumme. Network properties of folksonomies. AiCommunications, 20(4):245–262, 2007.

[4] Tsuyoshi Murata. Detecting communities from social tagging networks based on tripartitemodularity. In Link Analysis in Heterogeneous Information Networks, July 2011.

[5] Symeon Papadopoulos, Yiannis Kompatsiaris, and Athena Vakali. A graph-based clusteringscheme for identifying related tags in folksonomies. In Data Warehousing and KnowledgeDiscovery Conference, pages 65–76, 2010.

[6] Nicolas Neubauer and Klaus Obermayer. Towards Community Detection in k-Partite k-Uniform Hypergraphs, pages 1–9. 2009.

[7] Alexei Vazquez. Finding hypergraph communities: a Bayesian approach and variationalsolution. Journal of Statistical Mechanics: Theory and Experiment, 2009, Jul 2009.

[8] Michael Brinkmeier, Jeremias Werner, and Sven Recknagel. Communities in graphs andhypergraphs. In ACM CIKM, 2007.

[9] Yu-Ru Lin, Jimeng Sun, Paul Castro, Ravi Konuru, Hari Sundaram, and Aisling Kelliher.Metafac: community discovery via relational hypergraph factorization. In ACM SIGKDD,pages 527–536, 2009.

[10] Xufei Wang, Lei Tang, Huiji Gao, and Huan Liu. Discovering Overlapping Groups in SocialMedia. In IEEE ICDM, pages 569–578, 2010.

[11] Roger Guimera, Marta Sales-Pardo, and Luis A. Nunes Amaral. Module identification inbipartite and directed networks. Phys. Rev. E, 76:036102, Sep 2007.

[12] M McPherson, L Smith-Lovin, and Jm Cook. Birds of a feather : Homophily in socialnetworks. Annual Review of Sociology, 27:415–444, 2001.

[13] Yong-Yeol Ahn, James P. Bagrow, and Sune Lehmann. Link communities reveal multiscalecomplexity in networks. Nature, 466(7307):761–764, August 2010.

[14] T. S. Evans and R. Lambiotte. Line graphs, link partitions, and overlapping communities.Phys. Rev. E, 80:016105, 2009.

39

[15] M. Girvan and M. E. J. Newman. Community structure in social and biological networks.Proceedings of the National Academy of Sciences, 99(12):7821–7826, 2002.

[16] M. Girvan and M. E. J. Newman. Finding and evaluating community structure in networks.Physical Review E, page 69, 2004.

[17] Aaron Clauset, M. E. J. Newman, and Cristopher Moore. Finding community structure invery large networks. Phys. Rev. E, 70:066111, Dec 2004.

[18] Santo Fortunato. Community detection in graphs. Physics Reports, 486(3-5):75–174, 2010.

[19] Jeffrey Baumes, Mark K. Goldberg, Mukkai S. Krishnamoorthy, Malik M. Ismail, andNathan Preston. Finding communities by clustering a graph into overlapping subgraphs. InNuno Guimaraes and Pedro T. Isaias, editors, IADIS AC, pages 97–104. IADIS, 2005.

[20] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek. Uncovering the overlapping communitystructure of complex networks in nature and society. Nature, 435:814–818, Jun 2005.

[21] Illes Farkas, Daniel Abel, Gergely Palla, and Tamas Vicsek. Weighted network modules.New Journal of Physics, 9(6):180, 2007.

[22] Sune Lehmann, Martin Schwartz, and Lars Kai Hansen. Biclique communities. Phys. Rev.E, 78:016108, Jul 2008.

[23] Balazs Adamcsek, Gergely Palla, Illes J. Farkas, Imre Derenyi, and Tamas Vicsek. Cfinder:locating cliques and overlapping modules in biological networks. Bioinformatics, 22(8):1021–1023, 2006.

[24] Jussi M. Kumpula, Mikko Kivelä, Kimmo Kaski, and Jari Saramäki. Sequential algorithmfor fast clique percolation. Phys. Rev. E, 78:026109, Aug 2008.

[25] Andrea Lancichinetti and Santo Fortunato. Benchmarks for testing community detectionalgorithms on directed and weighted graphs with overlapping communities. Physical ReviewE, 80(1):9, 2009.

[26] V. Nicosia, G. Mangioni, V. Carchiolo, and M. Malgeri. Extending the definition of modu-larity to directed graphs with overlapping communities, 2008.

[27] Steve Gregory. Finding overlapping communities using disjoint community detection algo-rithms. In Santo Fortunato, Giuseppe Mangioni, Ronaldo Menezes, and Vincenzo Nicosia,editors, Complex Networks, volume 207 of Studies in Computational Intelligence, pages 47–61. Springer Berlin / Heidelberg, 2009.

[28] Samuel Rota Bul and Marcello Pelillo. A game-theoretic approach to hypergraph clustering.Advances in Neural Information Processing Systems, pages 1–9, 2009.

[29] Dengyong Zhou, Jiayuan Huang, and Bernhard Scholkopf. Learning with hypergraphs:Clustering, classification, and embedding. In Advances in Neural Information ProcessingSystems (NIPS) 19, page 2006. MIT Press, 2006.

[30] Tsuyoshi Murata. Modularity for heterogeneous networks. In ACM Hypertext and Hyper-media, pages 129–134, 2010.

[31] M. E. J. Newman. Fast algorithm for detecting community structure in networks. Phys.Rev. E, 69:066133, Jun 2004.

40

[32] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fastunfolding of communities in large networks. Journal of Statistical Mechanics: Theory andExperiment, 2008(10), oct 2008.

[33] Martin Rosvall and Carl T. Bergstrom. Maps of random walks on complex networks revealcommunity structure. PNAS, 105:1118–1123, Jan 2008.

[34] Andrea Lancichinetti and Santo Fortunato. Community detection algorithms: a comparativeanalysis. Phys. Rev. E, 80:056117, Sep 2009.

[35] A. Lancichinetti, S. Fortunato, and J. Kertesz. Detecting the overlapping and hierarchicalcommunity structure in complex networks. New Journal of Physics, 11:033015, 2009.

[36] Iván Cantador, Peter Brusilovsky, and Tsvi Kuflik. Workshop on Information Heterogeneityand Fusion in Recommender Systems (HetRec 2011). In ACM RecSys, 2011.

[37] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and Michael W. Mahoney. Statisticalproperties of community structure in large social and information networks. In ACM WWW,2008.

41

Appendix A

Publications from the Thesis

The work presented in the thesis resulted in the following publications

[1] Abhijnan Chakraborty, Saptarshi Ghosh, Niloy Ganguly. Detecting Overlapping Commu-nities in Folksonomies. In proceedings of the 23rd ACM Conference on Hypertext and SocialMedia (Hypertext 2012). Milwaukee, Wisconsin, USA. June, 2012.

[2] Abhijnan Chakraborty, Saptarshi Ghosh. Identifying Overlapping Communities in Folk-sonomies. In Dynamics on and of Complex Networks: Applications to Biology, Computer Sci-ence, Economics, and the Social Sciences, Volume 2, Ganguly, N., Deutsch, A., and Mukherjee,A. (eds.), Springer.

[3] Abhijnan Chakraborty, Saptarshi Ghosh, Niloy Ganguly. Detection of Overlapping Com-munities in Folksonomies. Poster in the International Workshop on Mathematical Physics ofComplex Networks: From Graph Theory to Biological Physics (MAPCON12). Dresden, Ger-many. May, 2012.

42

detection of overlapping communities in social …...first and foremost, i would like to thank my...

Documents