[ieee 2013 1st international conference on emerging trends and applications in computer science...

4
ICETACS 2013 978-1-4673-5250-5/13/$31.00 ©2013 IEEE An Empirical Study of Community and Sub- Community Detection in Social Networks Applying Newman-Girvan Algorithm Deepjyoti Choudhury Department of Information Technology Assam University Silchar-788011, India [email protected] Saprativa Bhattacharjee Department of Information Technology Assam University Silchar-788011, India [email protected] Anirban Das Department of Information Technology Assam University Silchar-788011, India [email protected] Abstract A social network can be represented by a set of human beings in which one member is connected to one or more members from the same set. We can obtain visual and mathematical models of human relationship by analysing a social network. There are several inherent properties of social networks such as power law distribution, centrality, small world network, modularity etc. Community structure is another important property of social network and it has gained tremendous popularity in terms of current research trends. With the increasing popularity, community structure is also getting equally complex wi thin online social network services like Facebook, Google+, MySpace and Twitter. Newman- Girvan algorithm is the widely used community detection algorithm in social networks. This paper reflects the structure of communities as well as sub-communities occurring in a social network by applying Newman- Girvan algorithm. We have implemented this community detection algorithm on real world networks. We have given a new concept to detect sub-communities in real world networks in this paper. This paper is mainly focused on an empirical study of the Newman-Girvan algorithm. Keywords node, graph, community, algorithms, clustering I. INT RODUCT ION Community structure or we can say clustering is one of the most efficient features of networks which represent real world systems. That means we can define communities in a social network as the well-organized nodes in clusters with multiple edges among the nodes of the same cluster and fewer edges joining the nodes of different clusters. Community detection in networks has been an important issue in sociology, biology, computer science and in many other disciplines. So, networks are commonly represented as graphs in all those fields and disciplines. Real networks are not random graphs as they appear in large homogeneities. Real world networks have a high level of order and organization. The degree distribution in social networks generally follows a power law. Furthermore, the distribution of edges is not only globally, but also locally non-homogeneous with high concentrations of edges within special groups of nodes, and low concentrations between the other nodes. This feature of real networks is called community structure, or clustering. The term ―Community‖ first appeared in the book ―Gemeinschaft und Gesellchaft‖ published in 1887. In social networks, the term community has no unique definition till today which can be widely accepted. Figure 1: A simple graph with three communities having twelve nodes. In the above figure, there are three communities in which all the nodes within a community are densely inter-connected with each other and have sparse inter-connection with the nodes belonging to another community. In a social network community, nodes are connected with each other based on their human relationship like friendship, colleague etc. II. RELATED WORK In computer science, community can be regarded as sub- graphs of a network. We can generate the whole network as a graph where several sub-graphs may reside in the original graph. Connection among the nodes in a sub-graph is intra- dense. On the other hand connection among the nodes belonging to different sub-graphs is comparatively sparse. Newman termed these sub-graphs as community structure [1].This definition puts importance on structural characteristics of a community, where links or edges in intra- communities are denser than inter-community, which can be measured by degree of the module [2]. There are several existing community detection algorithms which have the limitation of not being able to detect the overlapping communities in a social network [3]. Overlapping community detection involves community definition as well as the evaluation metric. The evaluation metric focuses especially 2 1 6 5 8 9 1 1

Upload: anirban

Post on 21-Feb-2017

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE 2013 1st International Conference on Emerging Trends and Applications in Computer Science (ICETACS) - Shillong, India (2013.09.13-2013.09.14)] 2013 1st International Conference

ICETACS 2013

978-1-4673-5250-5/13/$31.00 ©2013 IEEE

An Empirical Study of Community and Sub-Community Detection in Social Networks Applying

Newman-Girvan Algorithm

Deepjyoti Choudhury

Department of Information

Technology

Assam University

Silchar-788011, India

[email protected]

Saprativa Bhattacharjee

Department of Information

Technology

Assam University

Silchar-788011, India

[email protected]

Anirban Das

Department of Information

Technology

Assam University

Silchar-788011, India

[email protected]

Abstract— A social network can be represented by a set

of human beings in which one member is connected to one

or more members from the same set. We can obtain visual

and mathematical models of human relationship by

analysing a social network. There are several inherent

properties of social networks such as power law

distribution, centrality, small world network, modularity

etc. Community structure is another important property

of social network and it has gained tremendous popularity

in terms of current research trends. With the increasing

popularity, community structure is also getting equally

complex within online social network services like

Facebook, Google+, MyS pace and Twitter. Newman-

Girvan algorithm is the widely used community detection

algorithm in social ne tworks. This paper reflects the

structure of communities as well as sub-communities

occurring in a social network by applying Newman-

Girvan algorithm. We have implemented this community

detection algorithm on real world networks. We have

given a new concept to detect sub-communities in real

world networks in this paper. This paper is mainly

focused on an empirical study of the Newman-Girvan

algorithm.

Keywords— node, graph, community, algorithms, clustering

I. INTRODUCTION

Community structure or we can say clustering is one of

the most efficient features of networks which represent real

world systems. That means we can define communities in a

social network as the well-organized nodes in clusters with

multip le edges among the nodes of the same cluster and fewer

edges joining the nodes of different clusters. Community

detection in networks has been an important issue in

sociology, biology, computer science and in many other

disciplines. So, networks are commonly represented as graphs

in all those fields and disciplines. Real networks are not

random graphs as they appear in large homogeneities. Real

world networks have a h igh level of o rder and organizat ion.

The degree distribution in social networks generally fo llows a

power law. Furthermore, the distribution of edges is not only

globally, but also locally non-homogeneous with high

concentrations of edges within special groups of nodes, and

low concentrations between the other nodes. This feature of

real networks is called community structure, or clustering.

The term ―Community‖ first appeared in the book

―Gemeinschaft und Gesellchaft‖ published in 1887. In social

networks, the term community has no unique definit ion till

today which can be widely accepted.

Figure 1: A simple graph with three communities having twelve nodes.

In the above figure, there are three communities in which

all the nodes within a community are densely inter-connected

with each other and have sparse inter-connection with the

nodes belonging to another community. In a social network

community, nodes are connected with each other based on

their human relationship like friendship, colleague etc.

II. RELATED WORK

In computer science, community can be regarded as sub-

graphs of a network. We can generate the whole network as a

graph where several sub-graphs may reside in the original

graph. Connection among the nodes in a sub-graph is intra-

dense. On the other hand connection among the nodes

belonging to different sub-graphs is comparatively sparse.

Newman termed these sub-graphs as community structure

[1].This definit ion puts importance on structural

characteristics of a community, where links or edges in intra-

communit ies are denser than inter-community, which can be

measured by degree of the module [2]. There are several

existing community detection algorithms which have the

limitat ion of not being able to detect the overlapping

communit ies in a social network [3]. Overlapping community

detection involves community definit ion as well as the

evaluation metric. The evaluation metric focuses especially

2

1

6

5

7

8

9

11

1

Page 2: [IEEE 2013 1st International Conference on Emerging Trends and Applications in Computer Science (ICETACS) - Shillong, India (2013.09.13-2013.09.14)] 2013 1st International Conference

ICETACS 2013

-75-

on analysis and comparison of the existing overlapping

community detection algorithms including the basic ideas of

the algorithms. M. Girvan and M. E. J. Newman [4] had

proposed community structure and detection algorithm in

social and biological networks. Social groupings in a social

network are represented as communities.

Dynamic graphs generally consist of multi-graph and

community detection in dynamic networks [5] is a

challenging task. In a dynamic graph, a pair o f nodes can

have links appearing or disappearing at different time po ints.

Mobility is used as a network transport mechanism for

distributing data in many networks. GuoDong Kang et.al

proposed two new mobility models in 2011 [6], which are

known as Social Community Partner Mobility Model (SCP)

and Social Community Leader Mobility Model (SCL).

Minimum-cut method is one of the oldest algorithms for

dividing a network into parts. This method uses in-load

balancing for parallel computing in order to min imize

communicat ion between processor nodes. So it is less than

ideal for finding community structure in general networks [1].

In simulation environment, SCP model [6] will regard the

office, gymnasium and restaurant to be small squares in the

given simulation area. Here, the concept of community

destination may come. When the community moves from the

office to the gymnasium, the gymnasium is called the

community destination. In simulation, the community

destination is the square which is chosen with respect to the

gymnasium. When the community moves from gymnasium to

the restaurant, one new square in the simulat ion will be

chosen as a new community destination which corresponds to

the restaurant. In Partner Movement Case, the members in

one community will also have their own destinations in

gymnasium or restaurant.

Jie Jin et. Al [7] proposed a new center-based method,

which is especially designed for weighted networks. And the

method is also suitable for large-scale network because of its

low computational complexity. They demonstrated the

method on a synthetic network and two real-world networks.

Most known techniques for community detection use only the

informat ion about the linkage behaviour [8] for the purposes

of community prediction and clustering. Some recent work

has shown that the use of node content can be helpful in

improving the quality of the communities. Moreover, we can

see that edge content [9] provides a number of unique

distinguishing characteristics of the communities which

cannot be modelled by node content.

III. METHODOLOGY

A. Newman-Girvan algorithm

Hierarchical methods have several shortcomings with

respect to detecting the communities in a social network. To

remove those shortcomings, Newman and Girvan presented

their algorithm to detect the communit ies in social networks

in 2002 [4]. They brought a new concept, popularly known as

―edge betweenness‖ to detect the community in large and

complex networks. According to the algorithm, we simply

focus on those edges that are least central to the network and

those edges are considered as most ‗‗between‘‘ communities,

instead of calculating the measure of the edges which are

central to the network. That means, ―edge betweenness‖ score

of a particular edge can be calculated as the number of times

it appears in the shortest path matrix o f the graph. Then, we

remove the particular edge which has the highest ―edge

betweenness‖ score according to the algorithm and we get

first two communities. If there are more than one links or

connections between the communit ies, then we will remove

the edges which connect both the communities serially

according to the highest ―edge betweenness‖ score. We will

remove all the edges in the network in this way until we get

the single nodes. The procedure of Newman-Girvan

algorithm is stated below:

Calculate the betweenness score for all the edges in the network.

The edge having the highest edge betweenness score

will be removed.

After removal of the edge, betweenness score will be recalculated for all the remaining edges in the network.

Step 2 will be repeated until we remove all edges or we get the single node in the network.

Newman-Girvan have defined the community detection

procedure in a network with this algorithm. But the steps

defined by Newman-Girvan will g ive us only dendrograms of

the network. So, we will get only two major communit ies by

following the steps of Newman-Girvan algorithm during the

first iteration. In this paper, we have defined a new concept of

detecting ―sub-communities‖ in a network applying Newman-

Girvan algorithm. We can detect sub-communities in a

network under the main two communit ies. We have presented

here the concept of a sub-community which has two or more

than two nodes. All the nodes contained in a sub-community

are intra-dense connected. The number of nodes contained in

a sub-community depends upon the threshold value given by

the user.

B. Data Set

We have tested Newman-Girvan algorithm on three real

world networks: Zachary Karate Club, College Football

Network and Bottlenose Dolphin Network. Given below is a

brief description of all three datasets:

1) Zachary Karate Club: Zachary [10] had generated this

network. He studied the friendship of 34 members of a karate

club over a period of two years. The club was divided in two

groups during that period almost of the same size because of

disagreements. The original div ision of the club in 2

communit ies is shown in result given below. And we have

also found out the sub-communities in Zachary Karate Club

under the main two communities.

2) American College Football Network : The American

College Football network [12] is a real world network which

consists of 115 teams. The edges in the network represent the

regular season football games between the two teams they

connect. The teams are divided into conferences and let the

teams play within their own conference more frequently.

Twelve conferences or communit ies are defined in the

network.

Page 3: [IEEE 2013 1st International Conference on Emerging Trends and Applications in Computer Science (ICETACS) - Shillong, India (2013.09.13-2013.09.14)] 2013 1st International Conference

ICETACS 2013

-76-

3) Bottlenose Dolphin Network : Lusseau et.al [11] studied

the behaviour of dolphins and compiled the Bottlenose

Dolphin Network in 2003 which consists of 62 bottlenose

dolphins living in Doubtful Sound, New Zealand. Two

dolphins established a relation between them by their

statistically frequent association. The network is divided into

two large groups and the number of relations or edges is 159.

C. Experimental Set-up

All the programs are coded in java. The execution and the

testing are done on a machine with 3.10GHz Intel® Core™ i5

processor and 4 GB of memory.

IV. RESULTS AND DISCUSSIONS

A. Zachary Karate Club Network :

Figure 2: Zachary Karate Club is divided into two communities.

Discussion: It is seen in most of the papers that Zachary

Karate Club is divided into two communities: one is

Admin istrators of the club and another is Instructors of the

club. The point of disagreement was raised in both the

communit ies and the Instructors left out from the club and

made one new club

Figure 3: Zachary Karate Club is divided into five sub-communities.

Discussion: Five sub-communities are detected in Zachary

Karate Club and all the sub-communit ies are derived from the

major two communities.

B. College Football Network

Figure 4: College Football Network is divided into twelve sub-communities.

Discussion: Co llege Football Network is a real world

network which was played in USA. In this network, node

represents team edge between the nodes repres ents game.

There are twelve teams in the network which is shown in the

above figure.

Figure 5: College Football Network is divided into two major communities.

Discussion: In most of the papers we can see that College

Football Network is divided into twelve communit ies. Here,

we have also shown that there are two major communities

and the rest all can be considered as sub-communities which

are derived from these two communities.

C. Bottlenose Dolphins Network

Figure 6: Bottlenose Dolphins Network is divided into two communities.

Page 4: [IEEE 2013 1st International Conference on Emerging Trends and Applications in Computer Science (ICETACS) - Shillong, India (2013.09.13-2013.09.14)] 2013 1st International Conference

ICETACS 2013

-77-

Discussion: This network consists of 62 bottlenose

dolphins and all the dolphins have the relation with some

another dolphins which is shown in the above figure. There

are two major communities in Bottlenose Dolphins Network.

Figure 7: Bottlenose Dolphins Network is divided into five sub-communities.

Discussion: The new result is shown on Bottlenose

Dolphins Network where we can get five sub-communities

and all the sub-communit ies are derived from the major two

communities.

V. CONCLUSIONS & FUTURE WORK

In this paper, we have presented an empirical study of

Newman-Girvan algorithm on various data sets. Our results

differ from those presented earlier in the sense that we have

defined a new concept of sub-communit ies. The main

drawback of Newman-Girvan algorithm is the absence of a

clear specification on the definition of what constitutes a

community.

A lot has been left out fo r individual interpretations. The

problem increases many-fo lds in cases of unsupervised

datasets. The user has to manually identify the major

communities from the dendrograms structure.

As a future work, we hope to apply the concept of multi-

objective function to detect the stable communities in social

networks.

REFERENCES

[1] M.E.J. Newman, ―Detecting Community Structure in

Networks‖, Eur. Phys. J. B 38, pp . 321-330, 2004.

[2] M.E.J. Newman and M Girvan, ―Finding and evaluation

community structure in networks‖, Physical Review E, 69(2),

2004.

[3] L. Zhubing, W. Jian, and Li yuzhou, ―An Overview on

Overlapping Community Detection‖, The 7th International

Conference on Computer Science & Education (ICCSE 2012) ,

Melbourne, Australia , July 14-17, 2012.

[4] M. Girvan and M. Newman, ―Community Structure in Social

and Biological Networks‖, Proceedings of the National Academy of Scinces, vol. 99, no. 12, pp. 7821–7826, June,

2002.

[5] L.C. Huang, T.J Yen, and S.C.T. Chou, ―Community Detect ion

in Dynamic Social Networks‖, International Conference on

Advances in Social Networks Analysis and Mining

(ASONAM), July 2011, pp. 110 – 117.

[6] G. Kangl, M. Diaz, T. Perennou, P. Senac, and L. Xul,

―Mobility Model Based on Social Community Detection Scheme‖, Cross Strait Quad-Regional Radio Science and

Wireless Technology Conference, 2011.

[7] J. Jin, L. Pan, C. Wang, and J. Xie, ―A Center-based

Community Detection Method In Weighted Networks‖, 23rd

IEEE International Conference on Tools with Artificial

Intelligence, 2011.

[8] A. Clauset, M. E. J. Newman, and C. Moore, ―Finding

community structure in very large networks‖, In Phys. Rev. E

70, 066111, 2004.

[9] G.J. Qi, C. C. Aggarwal, and T. Huang, ―Community Detection with Edge Content in Social Media Networks‖, IEEE 28th

International Conference on Data Engineering, 2012.

[10] Zachary, ―W.W: An information flow model for conflict and

fission in small groups‖, Journal of Anthropological Research.

33, pp. 452—473, 1977.

[11] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. Slooten,

and S. M. Dawson, ―Behavioral Ecology and Sociobiology 54‖,

pp. 396-405, 2003.

[12] M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci. USA

99, pp. 7821-7826, 2002.