Transcript
Page 1: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review

Analysis and mining of onlinesocial networks: emerging trendsand challengesSajid Yousuf Bhat and Muhammad Abulaish∗

Social network analysis (SNA) is a multidisciplinary field dedicated to the analysisand modeling of relations and diffusion processes among various objects in natureand society, and other information/knowledge processing entities with an aim ofunderstanding how the behavior of individuals and their interactions translatesinto large-scale social phenomenon. Because of exploding popularity of onlinesocial networks and availability of huge amount of user-generated content, thereis a great opportunity to analyze social networks and their dynamics at resolutionsand levels not seen before. This has resulted in a significant increase in researchliterature at the intersection of the computing and social sciences leading to severaltechniques for social network modeling and analysis in the area of machine learningand data mining. Some of the current challenges in the analysis of large-scale socialnetwork data include social network modeling and representation, link mining,sentiment analysis, semantic SNA, information diffusion, viral marketing, andinfluential node mining. © 2013 John Wiley & Sons, Ltd.

How to cite this article:WIREs Data Mining Knowl Discov 2013, 3:408–444. doi: 10.1002/widm.1105

INTRODUCTION

The social world is a network of interactions andrelationships that facilitates the flow and exchange

of information and resources like norms, values,and ideas among individuals.1 Such a view of thesocial world can be treated as a social network, andit can be defined as a social structure representedby a set of nodes and their interrelationships,generally called ties. A node in a social network isusually called a social actor, and may represent aperson, group, document, organization, or nation.A relation between a pair of nodes represents theirties reflecting friendship, kinship, dislike, commoninterest, acquaintance, financial exchange, physicalconnection, hyperlink, or colocation. Social networkanalysis (SNA) is one of the important techniquesused in the field of sociology and also finds significantapplication in anthropology, biology, communication,

∗Correspondence to: [email protected]

Department of Computer Science, Jamia Millia Islamia (A CentralUniversity), New Delhi, IndiaConflict of interest: The authors have declared no conflicts ofinterest for this article.

economics, geography, and social computing.2,3 Theincreasing popularity of social networks is largelydue to their relevance to various processes takingplace in society, such as spread of cultural fadsor diseases, formation of groups and communities,and recommendations. The process of SNA andmodeling facilitates to understand how the behaviorof individuals and their interactions translates intolarge-scale social systems. Recently, the application ofSNA and social network concepts to a wide domainof research interests has gained huge popularity.For example, an application of SNA for transportplanning is demonstrated in Ref 4. Hulst5 highlightedthe importance and applications of SNA for dealingwith organized crime in adversary networks. Withthe variety of data such networks provide, the SNAtasks applicable (but are not limited to) includecommunity (gang) identification, sentiment analysisand opinion mining, node influence analysis, andlink prediction. The work of Lewis et al.6 signifiedthat the clusters or community analysis in proteininteraction networks using SNA techniques canhighlight functionally coherent groups of proteinsand predict the level of function homogeneity within

408 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 2: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

these protein groups or communities. Similarly,Swan et al.7 throw some light on the importanceof community-wise management of knowledge andhow managed intracommunity and intercommunity(across structural holes) interactions lead to significantinnovations. Using SNA for perceiving, controlling,and distributing strategic and domain knowledge inorganizations8 and collaborative distance learning9

is also promising, considering the improvement ofthroughput for such systems.

Bonchi et al.10 presented a state-of-the-artsurvey on the business application of SNA in anonline social network (OSN) environment. Theyhighlighted the exploitation of various SNA conceptslike social contagion, influence, communities, andranking for facilitating many business challengessuch as marketing, customer service, and managingresources (financial, human, and knowledge). Onthe other hand, this article aims to present someimportant challenges of SNA in a generic domain,and it highlights how OSNs facilitate the analysis andunderstanding of such challenges by providing a spec-trum of opportunities and data that are highly relatedto the real world. A huge amount of literature alongthe direction of SNA exists. Most of them concentrateon some specific aspect of the social networks (e.g.,community detection and information diffusion) inisolation or are mostly oriented toward sociologicalaspects relieving computer science. However, thepresent multidimensional OSNs provide a means ofstudying various data mining tasks related to SNA ina unified framework where each of them can benefitfrom the others. In this regard, we present a survey onthe current state-of-the-art and challenges related toSNA in the light of OSNs and also present the designof a possible unified framework for some major tasksrelated to SNA.

This article introduces some of the recent datamining tasks targeting social networks, taking intoconsideration data about the structure of socialnetworks. Different data mining tasks and techniquesrequire that the social networks be represented andmodeled according to the needs of the analysisbeing performed in the task. As a result, variousalternatives for social network modeling exist and canbe categorized into different groups depending uponthe techniques used and the level of analysis sup-ported. Some of the popular social network modelingtechniques have been reviewed in Social NetworkRepresentations section. For a detailed descriptionof the graph-theoretic properties and social networkmetrics that form a basis for almost all kind of anal-yses of social networks, readers should refer to Refs 2and 11–13.

SOCIAL NETWORK PROPERTIESAND DATA

Network systems have been traditionally consideredto be random structures and despite links beingconsidered to occur at random between nodes,most nodes were expected to have almost the samedegree. However, significant contributions made byresearchers14,15 revealed that the vertex connectivityof large-scale real-world networks actually follows ascale-free power-law distribution. That is, for largevalues of k, the fraction P(k) of nodes having kconnections to other nodes in the network follows therelation as shown in (1), where c is a normalizationconstant and γ is a parameter usually ranging between2 and 3.

P(k) ∼ ck−γ . (1)

The growth of such networks involves a rich-get-richer scheme [preferential attachment (PA)] whereinnew nodes have a higher probability to link to nodesof higher degree, or it can be stated that the likelihoodof a node acquiring a new link is in proportion to thenode’s degree.16

Some special properties of social networks thatdifferentiate social networks from other networks ashighlighted by Newman and Park17 are as follows:first, social networks show positive correlationsbetween the degrees of adjacent vertices (assorta-tivity). More specifically, vertices of similar degreetend to be connected more with each other thanwith others. Second, social networks have nontrivialclustering or network transitivity. This propertymakes the networks to exhibit community structures,i.e., clusters of vertices or nodes that are more similaror connected within the group than to the rest ofthe network. Communities in social networks oftenmap to important functional or interest groups ofthe underlying nodes and designing methods andtechniques to identify them is a challenging task.

Milgram18 showed that in a well-definedpopulation the average path length between twoindividuals so that they can meet each other wassix hops, demonstrating that social networks can beclassified as small-world, which led to the famousphrase ‘six degrees of separation’. Another famousconcept of sociology is the weak link hypothesisby Granovetter,19 according to which the degree ofoverlap between the friend neighborhoods of twoindividuals is observed to increase as a function of thestrength of the tie connecting these two individuals.This specifies that strong ties are tightly clustered,whereas the weak ties represent longer distance rela-tionships, thus playing an important role for the flow

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 409

Page 3: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

of information and innovation. This phenomenonis known as the strength of weak ties. On the basisof this concept and the existence of groups in socialnetworks, Burt20 defined structural holes as the topo-logical scarcity or the weakness of links between thegroups in a social network. In terms of the productivityperspective of organization control, structural holesappear to provide an opportunity of the brokerage ofinformation flow between different working groups.This in turn provides an opportunity to control theprojects on which various groups across a structuralhole work. Within groups, opinion and behavior tendto be more homogeneous than otherwise. Thus, indi-viduals who connect groups across structural holesoften tend to have alternative ways of thinking andapproaching a problem as they are possibly exposed tomultiple activities. Brokerage between groups acrossthe structural holes highlights alternative options thatotherwise remain unexplored and based on this prop-erty, brokerage across a structural hole becomes socialcapital.21 Burt22 signifies that an increase in the num-ber of individuals performing the same task decreasesthe value of social capital and thus peers tend tolose the value of social capital to individuals (oftenmanagers) who have a very less number of peers.

The theory proposed by Granovetter19 related tothe strength of weak ties highlights an important prop-erty of contagion, i.e., diffusion of diseases and inno-vations (belief, ideology, norm, technology, organiza-tional form, fad, or fashion) in social and informationnetworks through physical and/or virtual contacts.That is, the reach of a diffusion process (social distancecovered) is significantly higher if it is passed throughweak ties rather than strong. Moreover, small-worldnetworks composed of a few long intercommunityties between tightly clustered communities facilitatingrapid diffusion of information and disease.

Social Network Data SourcesFor analyzing social networks, the primary require-ment is the access to social network data, which canbe viewed as a social relational system characterizedby a set of actors and their social ties.13 Additionalinformation in the form of actor attributes or multiplerelations can be a part of the social relationalsystem. Traditional sources of social network dataincluded questionnaires, interviews, and observations.However, acquiring data through these sources islaborious and costly, and restricts the analysis to asmaller number of individuals, resulting in possiblysignificant individual biases. With the availability oflarge electronic datasets like the e-mail networks andtelephone call graphs, and efficient computational

resources, in the late 1990s, physicists entered the fieldof social networks (and complex networks in general)with a concern of analyzing topological properties ofnetworks, developing new concepts, algorithms, andmodels. The advantage of such electronic datasets isthat they are large, relatively easy to process, and accu-rate in the sense that subjective biases are absent.23

Wu et al.24 argued that besides proxy datasetslike e-mail networks, face-to-face (F2F) interactionsalso remain a powerful conduit for informationexchange, especially for complex or tacit information.They incorporated wearable sociometric badgesthat can collect and analyze behavioral data fromindividuals over time by detecting people in closeproximity, capturing F2F interaction time, andrecording tonal variation and prosody using amicrophone. Alternatively, Eagle et al.25 showed thatdata gathered from the usage of mobile phones can beused to produce a significant insight into the relationaldynamics of user behavior. Besides communicationinformation presented by communication networkslike e-mail networks and telephone call graphs, theinformation provided by mobile phone data alsospans to the behavior, location, and proximity ofmobile phone users using GPS, Bluetooth, cell towerIDs, and application usage. With an aim of comparingthe user behavior represented by the data collectedfrom mobile phones with user-reported data collectedfrom direct user survey, the analysis of Eagle et al.25

highlighted that the former, as a complement tothe latter, provides significant insights into purelycognitive constructs, such as friendship and individualsatisfaction besides observable behavior. Such datacan easily be mapped to the actual F2F interactionsand relations between individuals.

The Online Social Network BoomThe growth of the World Wide Web (WWW) hasled to the evolution of different types of informationsharing systems, which include OSNs like Facebook,MySpace, Flickr, Last.fm, Digg, Bebo, Orkut, hi5,LinkedIn, LiveJournal, and Twitter. In the recentyears, OSNs like Facebook have achieved significantpopularity and now represent the most popular websites. OSNs provide individuals a means of joininga network (become users), provide information thatcould define them or their preferences (profile), andenable them to publish any content that they like toshare with other users. One of the important andoutstanding features provided by these OSNs is toenable users to create links to other users with whomthey associate. These user-centric features of OSNsenables a user to define and maintain social relations,

410 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 4: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

find and link to other users with similar interestsand preferences, and share, find, and endorse contentand knowledge contributed by a user itself or byother users.26 Considering the extreme popularity,huge membership, and the enormous amounts ofsocial network data generated by these OSNs, thereexists a unique opportunity to study, understand, andleverage their properties. An in-depth analysis of OSNstructure and growth can not only aid in designing andevaluating current systems but can also lead to betterdesign of future OSN-based systems and to a deeperunderstanding of the impact of OSNs on society.

The size of social networks is growing every dayand the huge amount of data being produced by themis obviously leading to information explosion in thearea of analyzing social networks. This necessitates theapplication of computational techniques to analyzethe structure and nature of such complex networksmore efficiently and accurately. Along with thesociologists developing SNA methods to discover theproperties of social networks, computer scientists aredeveloping and applying data mining techniques todiscover hidden patterns from social network data.

SOCIAL NETWORKREPRESENTATIONS

This section presents some of the basic techniquesused for representing social networks. The amountof knowledge gained from a social network and thelevel of analysis that can be efficiently performed onit often depend upon the scheme used for representa-tion. Representation issues are some of the preliminaryissues faced. Certain representations naturally allowmathematical analysis, whereas some stress only ontheoretical exploration and reasoning. The usage ofa particular representation scheme depends upon theproblem at hand and the nature of tasks and opera-tions that need to be performed on the network, andthe level of characteristic details that need to be visu-alized about the network. Some of the common repre-sentations are presented in the following subsections.

Node-Link RepresentationGraph-based models have been extensively used toanalyze social networks by considering different typesof graph such as undirected,27–29 directed/weighted,30

and bipartite.31 Graph-based social network modelsconsist of nodes to represent actors, and links to rep-resent ties or relations. Sociologists refer to such graphrepresentations of social networks as sociograms.Rendering a sociogram along with a summary ofgraph-theoretical concepts for visualization provides

a basic description of social network data. However,this might suffice for small graphs, but usually thedata and/or research questions are too complex forthis relatively simple approach.32 Alternatively, manynode-link variants in two and three dimensions havebeen experimented in the information visualization(InfoVis) communitya however, for large graphs orgraphs with high link density all these visualizationtechniques show highly overlapping edges resultingin occlusion. This makes it very difficult for any usersto gain a visual picture of a graph, or to select or finda particular node or edge in a graph.33

Matrix RepresentationThe matrix representation as used by manyresearchers11,34,35 to analyze social networks involvesrepresenting numerous important social networkactor-based activities and concepts like interactions,friendship, citations, community subscriptions, infor-mation access, and interests in a matrix form. Themost basic form of the matrix representation isbinary and is called the adjacency matrix that uses1 for an existing conceptual link between two objectsand a 0 for a nonexisting link. Other matrix rep-resentations may involve weights or intensities torepresent the importance or priority of the linksor their corresponding relationships. The Laplacianmatrix L = (lij)nxn sometimes called admittancematrix or Kirchhoff matrix is a positive semidefinitematrix representation of a graph as shown in Eq. (2),where deg(vi) is the degree of the node vi.

li,j =

⎧⎪⎨⎪⎩

deg (vi) if i = j−1 if i �= j and vi is adjacent to vj

0 otherwise

.

(2)

It is clear from Eq. (2) that the Laplacian matrixof a graph is simply taken as its degree matrix minusits adjacency matrix. The Laplacian matrix has atleast one zero eigenvalue, and the number of sucheigenvalues is equal to the number of disjoint parts inthe graph. A normalized Laplacian matrix, as shownin Eq. (3), has also been considered by the researchers.

li,j =

⎧⎪⎪⎨⎪⎪⎩

1 if i = j and deg (vi) �= 0− 1√

deg(vi) deg(vj)if i �= j and vi is adjacent to vj

0 otherwise

.

(3)

The normalization factor means that the largesteigenvalue is less than or equal to 2, with equalityonly when the graph is bipartite. Eigenvalues ofthe graph are called graph spectra and they give

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 411

Page 5: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

information about some basic topological propertiesof the underlying graph. Laplacian spectra of networkshave been investigated enormously to understand thesynchronization of coupled dynamics on networks.36

Graph representations using matrices providean efficient alternative to the traditional node-linkdiagrams33 and the effectiveness of various matrixrepresentations is significantly highlighted in Ghoniemet al.37 Ghoniem et al.33 have argued that eventhough matrices form a quick representation andprovide higher resolution readability for variousnetwork analysis tasks involved, the matrix-basedrepresentation appears underexploited. An increasedfamiliarity of matrix representation, owing to itswider use, has resulted in an improvement in itsreadability. On the other hand, path-related tasks arestill challenging on matrix representations. To addressmatrix-based graph representations’ weaknesses onthe path-related tasks that are important for SNA,Henry and Fekete38 have proposed enhanced matrixvisualization for graphs, MatLink, wherein node linksare overlaid on a matrix representation at its edges.It supports a fast and effective visualization of pathrelationship between nodes using this representationby simply highlighting nodes using a mouse pointer.Node-link and matrix social network representationsare the most popular and basic ones being used.

Semantic Representation of Social NetworksTim Berners-Lee (the inventor of the WWW)envisioned the Semantic Web, wherein the basicconcept is to enable sharing of data across widercontextual communities by enriching the Web withmachine-lucid information for both its automaticand manual processing.39,40 The core concept ofSemantic representation of data is Ontology, which isdefined as a formal, explicit specification of a sharedconceptualization in the form of an explicitly specifiedvocabulary which describes the various aspects of thedomain being modeled.39,41

Gruber42 advocates for the possibilities andsignificance of combining Semantic Web frameworkwith social media domain, with an aim of developingcollective intelligence and knowledge systems. Simi-larly, Downes43 stresses on the need of semantic socialnetworks for effective information retrieval (IR), likepersonalized and efficient content search. Accordingto Breslin and Decker,44 the Semantic Web providesthe representation and navigation mechanisms bylinking people and objects to record and represent theheterogeneous ties that bind them to each other. Thisarticle considers the fact that OSN data can be seen asa twofold structure: data describing the social network

structure and data describing the content producedby network members. Some of the ontologies thatexist for representing this view of OSNs include:

• Friend-of-a-Friend (FOAF):b This ontology isused to describe people, their relationships, andtheir activity. It includes a large set of propertiesto describe a user profile (family name, nick,interest, and so on), web usages (online accounts,weblogs, memberships, and so on), relationshipsto connect people (knows), and so on.45

• Relationship:c This ontology specializes theknows property of the FOAF ontology byincluding properties to more precisely define thetype of relationship between people.45

• Semantically Interlinked Online Communities(SIOC):d This ontology is aimed at linking dis-cussion posts to other related discussions, people,and topics emerging on platforms such as blogs,message boards, and mailing lists44 by includ-ing properties that specialize OnlineAccount andHasOnlineAccount from FOAF.46 Besides con-ventional discussion platforms, SIOC is evolvingto describe new web-based communication andcontent-sharing mechanisms.44

• Simple Knowledge Organization Systems(SKOS):e This ontology involves organizingknowledge in a hierarchical manner (e.g.,narrower, broader, and related) and to link it toSIOC descriptions using isSubjectOf property.45

The most popularly used ontology to representthe network of people in case of OSNs is theFOAF. Initially, the FOAF data were produced byhand wherein the interested users would create afile that contained personal data, including e-mailaddresses, location, interests, and a list of friends,and make it available as a web-accessible resource.These early FOAF implementations suffered frominconsistent tag usage and occasional lack of clarityin the specifications and often lead to extending theworking schema of FOAF whenever new and efficientconceptualizations were encountered.

One of the challenges faced by Semantic Webtechnology is ontology learning or extraction whereinthe attempt is to automatically recreate a conceptualmodel/vocabulary from existing knowledge sources,in particular natural text. Mika47 signifies thatcommunities in social networks create their ownontologies that reflect their identities, languages, andcollective intelligence through interactions on someparticular interests. These ontologies can help toannotate the content created by the individuals in

412 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 6: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

their respective communities. Mika47 also visualizesa three-layered architecture of the Semantic Webmapping to communities, ontologies, and content.This concept of ontology learning is basically relatedto the structure of OSNs, which not only allow usersto publish resources but also allow them to annotatethese resources by assigning short descriptive tags tothem. Since a consensus of community users is definingthe meaning for a resource (content), these social tagsrepresent objects around which those users form moretightly connected social networks.44 This neologismfor collaborative categorization of objects on OSNsites, including web blogs and social bookmarkingservices like Delicious, using freely chosen keywordsis termed as folksonomy.48 A related problem toautomatic ontology extraction for Semantic Web isautomatic semantic annotation of web content. Inthis regard, Wu et al.49 show how folksonomy canbe exploited to infer a global semantic model forsemantically annotating web resources. Folksonomiescan be improved by adding semantics that structureand link tags together. The proposed semanticmodel also aims in disambiguating tags and groupsynonymous tags together in concepts and can be usedto efficiently search and discover semantically relatedweb resources. The main issue related to folksonomy,which the earlier mentioned works attempt to relateto, is that tags lack an explicit structure and mostly donot relate to each other semantically. Folksonomieshave generally been considered of consisting of twolevels of descriptive tags: one that tends to describeabstract concepts and can be understood by manyand the other level represents tags that have aspecific meaning within a particular context andcommunity of people. This issue has been mostlyfaced with approaches to enrich folksonomies bybridging them with formal ontologies. The mainproperty of folksonomies is that they consist of atriadic structure wherein people associate tags toresources. Thus, adding structure to the tags (e.g., linktags to domain ontology) implicitly adds structureto the set of users according to the structure oftags. For example, Mika48 applies SNA on differentprojections of the tripartite structure of folksonomiesgrouped similar communities of interest, i.e., groupsof people sharing common tags. This in turnyields subsumption properties between the tags. Thesemantic knowledge of folksonomies can also guidein supporting folksonomy-based social platforms.45

Ereteo et al.45 have proposed a Semantic WebSNA framework based on extending the existingsocial data representation ontologies to SemSNAontology. Their main aim is to enhance socialnetwork representations by combining the structure

and content of networks by annotating the semanticsocial network with important social network indicessuch as centrality and community membership. Onthe basis of SPARQL formal definitions of SNAoperators they compute semantically parameterizedSNA features to annotate the graph nodes andrecord the results. Their analysis suggests that thesemantic representations of social networks have asignificant effect on connecting and exchanging thesocial data and the knowledge embedded in the socialnetwork. The semantic representation provides a richdomain for representing the complete topologicaland metainformation describing real-world socialnetworks like OSNs along multiple dimensions.Designing novel semantic framework(s) that allow thedynamic representation and analysis of OSNs appearsto be a perfect future goal along this direction.

SOCIAL NETWORK MODELS

Statistically, social networks involve representing aset of n objects and their relationships using an n × nadjacency matrix X. Each entry xij in X representsa binary value (either 0 or 1) to record the presenceor absence of relation between objects i and j, e.g.,existence of a friend relation between two individualsrepresented by i and j. In a general representation, xij

can be a discrete value representing the intensity orstrength of a relation between objects i and entity jon a particular scale. Moreover, an object O can havea set of attributes or properties A = {a1, a2, . . . , am}that define the object, e.g., demographic propertiesof an individual. In this regard, a traditional goal isto learn a model that could explain the probabilityof the existence or absence of a relation betweenobjects based on the set of their attributes O and thetopological properties of individual objects (nodes) inthe underlying network.50

Considering the complex and irregular structureof social networks, naturally all models thatbear resemblance to real social networks involverandomness in network construction rules.23 The mostcommon categorization of social network models istopological models and distance-based models. Thedifference being that the former utilizes only thenetwork structure in its rules, whereas the latterassigns an intrinsic (random) coordinate for eachnode, and closer nodes are more likely to be linkedto each other than distant nodes. In distance-basedmodels the node coordinates can be interpreted as realgeographic coordinates or alternatively as coordinatesin an abstract social space, which may representhobbies, opinions, occupation, etc.51 Topologicalmodels, on the other hand, try to model real networks

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 413

Page 7: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

by basing the network construction rules solely on thenetwork topology and can be further classified basedon whether the representation model incorporatesdynamic behavior (evolving nature) of the socialnetworks or not, i.e., whether the model is staticor dynamic. Early researches in SNA have primarilyfocused on the static properties of these networks,neglecting the fact that most real-world interactionnetworks are dynamic in nature, but recent workshave considered the dynamic nature of social networksto find evolving trends and patterns in them.52

Spiliopoulou53 presents a comprehensive survey on thedynamic community-based models for social networkevolution. It is desirable to identifying the sectionsof a network that tend to mutate, characterizing thetype of mutation, and predicting future mutations orevents (e.g., link prediction and predicting split ormerger of a network or its parts). In other terms,developing generic models for evolving networks is achallenge, which needs to be addressed. For instance,the rapid growth of online communities has dictatedthe need for analyzing large amounts of temporaldata to reveal community structure, dynamics, andevolution.54 Jamali and Abolhassani32 have classifiedthe social network models by considering formalmethods55 for representing social networks. Theyclassify social network models based on (1) descriptivemethods and graphical representations, (2) analysisprocedures, often based on the decomposition ofadjacency matrix, and (3) statistical models basedon probability distributions. Here, we have used asimilar classification to make a distinction betweenvarious social network models. In addition, we alsopresent a review related to the models that havegained significant importance recently, which includeadaptive and game-theoretic network models. Someof the popular statistical models for social networksare classified and explained below.

Exponential Random Graph Modelor p* ModelExponential random graph models (ERGMs)56,57 orthe p* models are built on the idea of p1 and p2 modelsand are considered as a promising way to modelnetwork structure via a series of substructures forcross-sectional data, and provide a general frameworkfor descriptively modeling a static network. The p1model58 defining the first probability distribution forbinary dyadic data [i.e., data composed of two setsof objects A and B, in such a way that observationsare basically observations of couples (a, b), with a ∈A and b ∈ B] is a model for the four possible dyadicoutcomes, one mutual, one null, and two asymmetric,

where the data (adjacency) matrix, x, is a realizationof a random matrix X, in which each dyad, Dij = (Xij,Xji), is an independent bivariate random variable withpossible values as given in Eq. (4). The p1 modeldefines the nature of ties to be reciprocal and includesparameters that aim to define the tendency of anindividual to extend and accept ties in the network.However, p1 models are restrictive as they assume thedyads to be independent.

Dij =

⎧⎪⎨⎪⎩

(1, 1) i links with j and j links with i(1, 0) or (0, 1) i links with j or j links with i(0, 0) otherwise

.

(4)

The p2 model,59 in turn, is a random effectvariant of the p1 model in which the sender andreceiver parameters are modeled as correlated randomeffects, a formulation that makes it possible to includeactor and dyad-specific covariates as fixed sender,receiver, density, or reciprocity regression parameters.The p1 and p2 models for network structure focuson the dyads in the network. However, triangles (ortriads) are important for several reasons to analyzesocial networks as given in Ref 60. The idea ofMarkov Graphs,61 which is an extension to the p1model, allows for triads in the network through thenotion of conditional dependence.

Considering X as a random graph with N nodes,xij as a random variable representing the possibilityof a tie between nodes i and j, and χ as the set ofall such graphs, the ERGMs are defined based on thelikelihood of the occurrence of such graphs as givenas follows:

Pθ ,x (X = x) = exp{θ tu (x)

}c (θ , χ)

. (5)

Here, the parameters for the model are specifiedin the form of vector θ and the function c(θ , χ)normalizes the value in the range of [0, 1] such thatthe sum of all observed values yields 1. The nature ofthe ERGM model is defined with the statistics specifiedby the vector u(x) defining a particular realization xof a network.62 A detailed description of these modelscan be found in Ref 63.

Hanneke and Xing64 have extended EGRMs totemporal EGRMs for modeling networks evolvingover discrete timestamps. However, one of the majorproblems associated with these models is that ofinferential degeneracy (tendency to converge to eitherempty or complete graphs), as analyzed in Ref 65.New specifications for the ERGM proposed in Ref 66attempt to find a solution for the degeneracy via adifferent parameterization of the models. In Ref 67,

414 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 8: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

the authors have reviewed these new specificationsand their experiments suggest that the resulting graphstend to be more realistic and that the near degeneracyproblem is avoided particularly in networks that showhighly transitive relations.

Preferential Attachment ModelsIn 1999, Barabasi and Albert noticed that real-world social networks showed power law degreedistributions, i.e., networks that grow from a smallnucleus of nodes and follow a ‘rich-get-richer’ schemein terms of node degrees. They are also called scale-free random graphs. Barabasi and Albert14 describeda dynamic PA model specifically designed to generatescale-free networks and model the network growth asfollows: from any current state of a network with n0nodes, each subsequent time step adds a new nodewith m ≤ n0 edges. The probability Pi that the newnode will be connected to an existing node i dependson the degree deg(i) of i according to the multinomialdistribution of Eq. (6).

Pi = deg (i)∑j

deg(j) . (6)

The PA model of Barabasi and Albert resultsin a network with a power law degree distributionwhose exponent is empirically determined to beγ BA = 2.9 ± 0.1.68

Small-World ModelsThe Watts–Strogatz (WS) small-world model69 forlarge networks incorporates some of the importantcharacteristics of real-world networks includingtransitivity, limited degrees, and limited path lengths(geodesics). By taking a one-dimensional lattice ofL vertices with each vertex being connected withits 2k nearest neighbors and periodic boundaryconditions (the lattice is a ring), the model proceedswith ‘rewiring’ each bond independently with someprobability ϕ. The probability ϕ is varied such thatthe transition between order (ϕ = 0) and randomness(ϕ = 1) can be monitored closely. Rewiring here relatesto relocating one end of a bond to a differentvertex selected at random from the whole lattice,with conditions that there cannot be more than onebonds between two vertices and no loops can occur.This process generates ϕLk/2 long-range edges whichconnect nodes that would in otherwise belong todiscrete neighborhoods. The behavior of the networkthus depends on three independent parameters: L,k, and ϕ. In this model, the average coordination

number z remains constant (z = 2) during the rewiringprocess, but the coordination number of any particularvertex may change. The total number of rewiredbonds, which are referred to as ‘shortcuts’, is ϕLon average. Newman and Watts70 pointed out thatthe distribution of shortcuts in the WS small-worldmodel is not completely uniform, which makes anaverage over different realizations of the randomnesshard to perform. Furthermore, the average distancebetween pairs of vertices on the graph is poorlydefined as there is a finite probability of a portionof the lattice becoming detached from the rest inthe model. Newman and Watts70 propose a variantof the WS small-world model that overcomes theselimitations. In this model, a new edge is inserted fora pair of nodes with a probability P without breakingany bond between any two nearest neighbors andalso does not form isolated clusters. However, theconditions to form a bond (edge) are same as forthe previous model. When P = 0, the Newman andWatts (NW) model forms a nearest-neighbor couplednetwork, and for P = 1, it forms a globally couplednetwork. Another variant proposed by Kleinberg71

starts with an underlying finite-dimensional grid andadds shortcut edges where the probability that twonodes are connected by a long edge depends onthe distance between them in the grid. That is, theprobability that two nonadjacent nodes x and y areconnected is proportional to d(x; y)−α. With α setto the dimension of the lattice, the greedy routingalgorithm can find paths from one node to another ina polylogarithmic number of expected steps.

Game-Theoretic ModelsIn an economical perspective of social networks, socialactors tend to connect to each other with an aimof gaining incentives, payoffs, or utilities from theconnections they establish.72 Moreover, forming alink may incur a cost and thus social actors can be seento form links strategically in a social network.73 Incase a social actor gains incentives that result from theactions of all the social actors in a social network, thenetwork formation and growth can be appropriatelymodeled using the concept of game theory.72 Gametheory deals with modeling situations wherein twoor more decision makers (players) seek to makemutually influencing decisions. For an introductionto game theory, the reader is referred to Ref 74. Ina basic sense, a game consists of the players decidingupon the next step(s), rules or strategies accordingto which the steps/decisions are taken, outcomesresulting from each possible decision taken by a player,and preferences that a player expects or prefers out ofthe possible outcomes.

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 415

Page 9: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

A network formation game is analogouslydefined as a 3-tuple <N, (Si)i∈N, (ui)i∈N>, where Nis the set of social actors (players), Si is the set ofstrategies for each player i wherein a strategy si ∈ Si

of a player i is the set of other players with which iwants to form a link, and ui is the incentive/utilityof player i which depends on its neighborhoodand the structure of the network.75 Initial game-theoretic models for network formation include that ofAumann and Myerson,76 where node pairs are givenordered opportunities to establish permanent linkswith mutual consent. The procedure is repeated untilthe remaining pairs reject these opportunities.72 Abetter approach was later proposed by Myerson,77 inwhich all the nodes (players) announce their strategiessimultaneously such that a link I,j is established onlyif I ∈ sj and j ∈ si.

78 Some other models of networkformation strive to maintain the stable state of anetwork directly wherein stability is defined in terms ofstrategic equilibrium. For example, Nash equilibriumdefines a network to be in stable state if no node formsor deletes a link to another node unilaterally. Similarly,pairwise stability defines that no node gains anincentive on deleting a link and that no pair of nodesexit in a network that want to form a mutual linkresulting in a mutual incentive.72,78 Other alternativesinclude the strong stability, wherein the changesinduced by a set S of nodes without the consentof the nodes /∈ S are such that any new links are onlyadded within the set S and that at least one node ofany deleted link is in S.78 Similarly, certain economicgame models of network formation aim to maximizethe efficiency of the network, i.e., the function of theutilities of the nodes. For example, Pareto efficiencyrequires that for a network state g there should exist nonetwork state g′ where the utility of a node is greaterthan it has in g and the utilities of the remaining nodesare greater than or equal to that they have in g.75

Recently, many studies related to game-theoreticeconomic models were made and new related modelswere proposed.79,80 Vallam et al.81 propose agame-theoretic model for network formation, whichinvolves localized payoffs and results in efficient pair-wise stable networks. Recently, evolutionary gametheory has also found its applications in the dynamicanalysis and formation of social networks. Evolu-tionary game theory on the dynamics of populationsmainly involves the concept of (1) evolutionary stablestrategy (ESS) and (2) analysis of the frequency ofdifferent strategies.82 A population is in an evolution-ary stable state when all members adopt an ESS suchthat small occurrences of mutant strategies are notlong lasting and vanish soon. Some studies along thisdirection include Ref 83, which models the network

dynamics in terms of the strength of link changesresulting from the repeated games played by the socialactors. They also highlight that the nodes in dynamicnetworks tend to cluster and show cooperativestrategies within them. Bala and Goyal84 highlightthat the mutual or one-sided incentives for a node tomake and maintain links cause the network to quicklyreach equilibrium. Although there exist in literaturesome attempts to model network formation usinggame theory, we argue that they are more centeredon certain economic objectives. In terms of social net-work context, this line of work needs more attentionin the near future as the game-theoretic formulation ofa social network appears highly promising, naturally.

Adaptive Network ModelsThe network models discussed so far, including thePA models and the small-world models, are concernedwith the dynamics of networks that deal with thetopological dynamics of a network, i.e., the growthor change in the topology of a network over timeaccording to simple local evolution rules, whichmimic the natural process of network formation.In addition to this line of work, the issue ofdynamics on networks deals with the analysis ofthe state transition of nodes in a network (e.g., thetransition of a person from infected to susceptibleand vice versa in a contagion network, or increasein the volume of interaction between the connectedindividuals in a communication network). The basicassumption made here is that the network topologyis static. The key ideas of studies along this directioninclude the formation of opinions and diffusion ofinformation, which is discussed later in this article(section Diffusion and Influence in Social Networks).

It is, however, a notable fact that in real-worldnetworks including social networks, the networkstates and topology coevolve. We can say that theevolution of the network topology is linked to thenetwork state and vice versa wherein there exists afeedback loop such that the characteristics of onenetwork dimension define the changes in the other.For example, the topology of a road network definesthe flow of traffic, whereas the rate and nature ofthe flow of traffic defines the topological changesthat need to be made in the road network (buildingnew roads connecting key points and blocking someexisting roads). Networks that exhibit such feedbackloops in their nature are called coevolutionary oradaptive networks.85 The study of adaptive networksis a young field and has only recently gained attention.Here, we present a brief summary of some of thestudies that aim to model the formation and growthof adaptive networks in nature and society.

416 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 10: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

One of the preliminary adaptive networkmodels was proposed by Christensen et al.,86 whichdescribes the evolution of an ecological network wherenodes (populations) are linked through ecologicalinteraction and the state of each node is defined interms of its evolutionary fitness. The evolution of thenetwork is modeled as the replacement of a node witha population of a new species having random fitness,which in turn also affects the fitness of the neighboringnodes and the local topology. Similarly, Gross et al.87

model an adaptive network using a disease epidemic[susceptible-infected-susceptible (SIS)] model whereina susceptible node can isolate itself from an infectednode by rewiring its links to other susceptible nodes,which is shown to have a strong effect on the dynamicsof the disease which in turn defines the rewiringprocess itself. Recently, Tero et al.88 proposed anadaptive network model inspired from the behaviorof a single-celled amoeboid organism that forages forpatchily distributed food sources and constructs atubular network linking the discovered food sources.Their model emulates this network construction basedon the feedback loops between the thickness of eachtube and internal protoplasmic flow. For more insight,the reads is referred to the original article.

Adaptive network models appear to be thefuture hot topic for studying and analyzing theformation and growth of social networks, especiallythe OSNs that present a unique opportunity toanalyze both the dynamics of and on the networks.The adaptive OSN formation and growth modelscan exploit the feedback loops that exist between thestates, which include discussion topics, user opinions,application/game usage, group memberships, and thenetwork topology defined in terms of friendship links,interaction links, or both. Moreover, despite the manyadvances in network modeling over the last decade,many unresolved issues still remain with big break-throughs to be made in the areas of inference anddynamic modeling. Dynamic models are needed to testhypotheses related to network dynamics and modelthe significance of various constraints and metrics,which drive the dynamics, by defining and estimatingparameters. For providing useful statistical inferencesand insights, a network model needs to efficientlyrepresent the dependencies between the network ties,and also between the behaviors of the actors. With theavailability of the OSN datasets it is possible to createdynamic models that combine evolving topologicaland topic structures for a better understanding andmodeling of the network evolution. Such hybrid mod-els can also improve the predictability of links in boththe static and dynamic settings of social networks.Evaluating and comparing the predictive ability of

various models is also a consideration for future workalong this direction. Moreover, most of the currentnetwork models tend to generate only undirectednetworks, whereas modeling directed network hasreceived less attention. For more issues and challengesrelated to network modeling, see Refs 68 and 89.

SOCIAL LINK PREDICTION

Social networks are used to represent the varioussocial systems in nature and society consisting of indi-viduals, objects, and the relations, links that bind themtogether. An important property of most of the socialsystems and their resulting social networks is that theyare dynamic. New individuals and objects are added toa system and some existing ones removed or lost withtime. Similarly, new relations and links between indi-viduals and objects are created and some existing onesbroken with time. Social networks need to reflect thesechanges that occur in the underlying social system andin doing so they are often considered highly dynamicstructures. Studying and understanding the mecha-nisms that result in a particular change in a socialnetwork is often appealing as it can throw some lighton predicting the behavior of social network with time.

A related challenge motivated from theseconsiderations is the link prediction problem. Theaim of link prediction is to predict the edges (relationsand links) that can be induced in a social networkS at a future time t′ by a particular known stateof S at a given time t. The applications of linkprediction range from online recommendations tothe detection of links between objects in a criminalcase, from predicting possible interacting proteinsor proteins with similar functionality to predictingpossible future collaborations between researchers,from predicting friendship relations to predictinghyperlinks between web pages, and so on. A naturalapproach for predicting links would be to use node-wise similarity, i.e., to determine distance or similaritybetween two objects using the available node features.For example, social links between two individualsin a social network can be reliably predicted simplybased on whether they are alike, i.e., based on theconcept of homophily. Supervised learning techniquessuch as decision-tree classifiers and support vectormachine (SVM) can be trained on one subset of theedges and then used to predict links in its complement.Such a classifier can make good predictions for a set ofunseen users or for future connections between knownusers.90,91 However, such an approach completelyignores the rich source of information, that is,the graph structure of the link graph. In contrast,according to Liben-Nowell and Kleinberg,92 the link

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 417

Page 11: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

prediction problem asks to what extent can theevolution of a social network be modeled usingfeatures intrinsic to the network itself . As a result,most of the link prediction approaches are basedon learning topological patterns from node-centric(local) properties or the overall network propertiesand the decision on the existence of a link is madebased on the visible patterns in the given state of thenetwork. Liben-Nowell and Kleinberg92 conducted acomprehensive study on some related link predictionmethods that use the network topology to estimatea similarity score between two nodes in a networkand a list of resulting node pairs with decreasingsimilarity scores. This list is then used to make adecision on the existence of a link. They have classifiedthe link prediction methods into those based on nodeneighborhoods, ensemble of all paths, and other high-level approaches. Based on their comparative analysis,they determine that no method is better than any other,instead all the methods perform well but many of themethods outperform the random predictor, signifyingthat network topology itself contains importantindications regarding the prediction of missing orfuture links. However, an important limitation inthese techniques is that they are mainly based on thetopological features of networks.

Intuitively, it appears that better performancecan be achieved by using additional sources ofinformation besides the topological information, suchas the content or semantic attributes of the nodes.Along this direction, Hasan et al.90 and O’Madadhainet al.93 incorporate features from multiple sourcesand train a classifier to decide if a link establishesor not. Hasan et al.90 use heterogeneous featuresincluding shortest distance, common neighbors, andnumber of matching keywords of nodes to analyzea coauthor dataset. On one of the datasets theiranalysis reveals that semantic features, which includekeyword-match count, of the publications result inmore gains in estimating the similarity. Similar resultshave been reported by O’Madadhain et al.93 on theuse of content-based attributes, geographic proximitybetween authors, and similarity of journal publicationpatterns. Other approaches for link prediction includemethods based on probabilistic models that try toincorporate multiple data elements from the networkto learn a compact model whose output probabilityis then used for prediction. For example, Taskaret al.94 defined a joint probabilistic model over theentire graph which also includes the content attributesof the nodes besides overall link structure usingdiscriminatively trained relational Markov networksand the trained model to collectively classify thetest data. On the basis of the assumption that the

network structure is in a stationary state, Kashimaand Abe95 proposed a parameterized probabilisticmodel of network evolution whose parameters areestimated using an expectation-maximization (EM)algorithm and then used for link prediction. Althoughsuch model-based approaches are powerful, they areusually computationally expensive, which requiresappropriate approximation to guarantee efficiency.Recently, a comprehensive survey on such relatedmethods was presented by Hasan and Zaki96

wherein they categorize link prediction methodsas probabilistic relational models, linear algebraicmodels using rank-reduced similarity matrices, andBayesian probabilistic models.

Murata and Moriyasu97 use weighted proximitymeasures of social networks for the task of linkprediction based on the assumption that nodeproximity can be determined effectively by usingboth graph proximity measures and the link weights.Their approach was the first to take link weightsinto consideration for the task of link prediction.They proposed three weighted similarity indices,as variants of the common neighbors, adamic-adar, and preferential attachment indices, and theirexperimental results show that considering theseweighted indices increases the performance of linkprediction, especially in highly dense networks.However, when Lu and Zhou98 applied the weightedindices to the coauthorship network and to theUS air transportation network, they found thatthe weighted indices performed even worse thanthe unweighted ones, reminding of the weak-tiestheory,19 which claims that the links with smallweights yet play a more important role in socialnetworks. They suggest from their experiments thatthe weak ties play a significant role in the linkprediction and the contributions of weak ties canremarkably enhance the prediction accuracy forsome networks. Similarly, in Ref 99, consideringnode similarity for link prediction, the authors haveproposed a similarity index based on local randomwalk, which has lower computational complexitycompared with other random-walk-based similarityindices with good or even better prediction. Zhelevaet al.100 argue that for networks in which groupstructures exist and are known, the link predictiontask can yield efficient results. They show howpredictive models based on descriptive, structural,and community features perform surprisingly wellon challenging link-prediction tasks by overlayingfriendship and family networks and using the featuresof the overlaid networks to accurately predictfriendship relationships. Schifanella et al.101 showthat the tagging activity of users reflects their group

418 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 12: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

participation and degree centrality in the socialnetwork. Moreover, the activity rates of users reveala strong assortative mixing in the social network.Using the tagging activity of users from Flicker andLast.fm, they show that close neighbors tend to showmore lexically similar tagging activity around commontopics and this feature can be exploited to predictlinks. They test this hypothesis on a Last.fm dataset,and show that the links predicted between usersbased on high semantic similarity closely map to theunderlying friendship links of users more accuratelythan Last.fm’s suggestions based on listening patterns.Leroy et al.102 define cold start link prediction as alink prediction challenge wherein the primary sourceof links, i.e., the network structure is not known andonly a secondary source of node-related informationis known. Unlike traditional link prediction methodsthat require a known state of a network, their methoddoes not require an initial state of the networkto predict the possible missing or future. Morespecifically, they assume that either the social networkexplicitly exists, but is kept secret by its owner, orit does not exist at all. They propose a two-phasemethod based on a bootstrap probabilistic graph forcold start link prediction. In the first phase, basedon some limited information (potentially, weaklycorrelated with the link structure of the network),the method predicts the existence of links resulting ina probabilistic graph, i.e., a graph where each edge islabeled with a probability representing the confidenceof the prediction or, in other terms, the uncertaintyof the existence of a link. The second phase takes asinput the probabilistic graph and refines it by adoptinggraph-theoretic measures given in Ref 92 as used inthe classical link prediction settings.

Although link prediction is not a new problemin information science, traditional methods have notcaught up with the new development of networkscience especially the new perspectives and toolsresulted from the studies of complex networks. Forexample, the hierarchical and modular structure ofsocial networks (section Community Analysis) couldefficiently guide the link prediction task and in turnthrow light on the community-wise evolution of socialnetworks. A major challenge related to link predictionis the heterogeneity found in most of the real-worldsocial networks. For example, OSNs usually involvedifferent node types such as users, images, URLs,and tags. Similarly, links may also involve attributesrepresenting polarity (positive or negative), friend andfoe relations, and so on. Furthermore, link predictionin weighted and directed networks along with pre-dicting link weights also requires significant attention.The cold start link prediction problem, which deals

with predicting links that could be induced into socialnetwork when a new node is added, is another bigchallenge that has received less attention till now.

COMMUNITY ANALYSIS

The description of the structure of complex networksis often studied at different levels ranging fromthe microscopic characteristics of individual nodes(degree, centrality, and so on) to the macroscopicdescription in terms of statistical properties of thewhole network (degree distribution, total clusteringcoefficient, and so on). In between these two levelsa ‘mesoscopic’ description tries to explain the com-munity structure in complex networks. Communitiesare considered to be the sets of nodes in a networkthat have denser connectivity to each other thanto the rest of the network (see Figure 1) and areimportant because they can often be closely related tofunctional units of a system, e.g., groups of individualsinteracting with each other in a society,104,105 WWWpages related to similar topics,106 and compartmentsin food webs.107 The basic task in the analysis ofcommunities is community detection, which hasreceived a lot of attention in the recent past, and thefield is still rapidly evolving.108

Detecting communities in a network dependson various factors such as whether the definitionof community relies on global or local networkproperties, whether nodes can simultaneously belongto several communities, whether the link weightsare utilized, and whether the definition allows forhierarchical community structure. An open challengerelated to community detection is to cope up withoverlapping communities that occur when a particularnode in a network belongs simultaneously to severalcommunities. Some of the approaches to deal withoverlapping communities have been proposed in Refs109 and 110. Another challenge related to communitydetection is owing to the presence of networkscontaining hierarchical structures. In such networks,a community may be part of even larger communities.Newman and Girvan111 have worked in this directionand introduced the concept of modularity as ameasure for the goodness of a partitioning assometimes it is better to study community structureconsidering nested hierarchy rather than choosing asingle community partitioning.112 Furthermore, real-world social networks tend to change dynamically,for example, in OSNs, each day new users jointhe network and new connections occur betweenexisting members, while some existing ones leave orbecome dormant. For analyzing such communities it isdesirable to understand the evolution and dynamics of

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 419

Page 13: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

Witch

Satan

Bad

Cruel

Evil

Taboo

Wicked

MeanDevil

Demon Possessed

ObsessionCrave

Lust

CompulsionWant

Desire

Impulse

UrgeYearn

Need

Passion

Fond

Adore

Like

Admire SweetheartPersonal

BoyfriendGirlfriend

Intimate

RelationshipLover

Friend

HusbandMate

Buddy

Wife

PalSpouse

Companion

PartnerMarry

Love

SexAffair

Cheat

Adultery

GenderMale

Female Ecstasy

Happy

Pleasure

DelightEnjoy

Joy

Fun

WomanHer

His

Girl

She

FIGURE 1 | Community structure identified in a word association network by the method proposed in Ref 103.

the community structure. We provide a more detaileddescription on different aspects of community analysisin the following subsections.

Methods and Approaches for CommunityDetectionAt a higher level of abstraction, the problem ofcommunity detection can be divided into localcommunity mining and global community mining. Inlocal community mining the problem is to determinea local community C from a network for which thecomplete connectivity information of the nodes isunknown and we are given only a partial set of nodesforming C and their connectivity information. Thecommunity structure of C can be revealed by recur-sively crawling the neighboring nodes of the givenpartial set of nodes and then based on the connectivityof the traversed nodes, determine the core and theboundary of the community C. In global communitymining, which is based on the assumption that wholeinformation regarding the node connectivity of anetwork is known, the problem is to identify allthe communities represented by their correspondingnodes that are present in the network.113 In the fieldof community mining most of the works are orientedtoward identification of global communities.

xi,j =√∑

k�=i,j

(Aik − Ajk

)2. (7)

The principal and the most popular techniqueused by sociologists in their study of social networksfor finding communities is hierarchical clustering.2,114

Hierarchical clustering methods involve discoveringnatural partitions of social networks, using variousmetrics of similarity, e.g., Euclidean distance13 definedin (7) compares the neighbors that two vertices share.In this equation, x is the similarity measure andA is the adjacency matrix. Hierarchical clusteringtechniques do not provide a way to decide upona proper cut in the generated dendrogram, i.e.,to choose among the created partitions a moreappropriate community structure. It should also benoted that the accuracy of results of a communitydetection method also depends upon the particularsimilarity measure used. Based on a way in whichclusters are formed, hierarchical clustering methodscan be classified into two categories—agglomerativeand divisive.114 Agglomerative methods are based on abottom-up approach that involves iteratively mergingclusters if their similarity is sufficiently high. Afterdefining a measure for vertex similarity, each node isassigned to its own community forming n communitiesfor n nodes. Now, an edge is added one at a time for thepair of nodes which shows the highest similarity. Thisapproach often finds only the strongly connected coresof communities and fails to identify the less denselyconnected boundary nodes for the communities. Anexample of agglomerative method is Ref 115, which isbased on modularity optimization111 and starts with

420 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 14: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

a state in which each vertex is the sole member ofone of n communities. Communities are repeatedlyjoined together in pairs, choosing at each step thejoin that results in the greatest increase (or smallestdecrease) in modularity. This method can be appliedto very large networks. In contrast to agglomerativemethods, divisive methods follow top-down approachwherein all the nodes are initially assigned to asingle large community which is iteratively dividedinto smaller communities by removing edges withlow similarity values and thus causing the split in alarge community. Divisive methods usually classifyall of the vertices (including peripheral nodes andoutliers) into communities, which may lower theaccuracy of community detection. On the basis ofthe sociological notion of betweenness centrality,13

Girvan and Newman104 proposed a communityfinding algorithm based on divisive clusteringapproach, which progressively removes edges from thenetwork. The algorithm calculates the betweenness ofall edges in the network and removes the one with thehighest betweenness. This process is repeated until noedge remains or a stopping criterion is met.

Besides the above deterministic methods forcommunity detection, other methods for the detectionof groups from networks have also been based onthe concept of stochastic block modeling.116 In SNA,block modeling is commonly used to divide thevertices in a network into categories or classes whereinnodes assigned to the same category or class sharesome common properties or show some degree ofequivalence. The equivalence defined within a classis mainly based on the topological properties ofnodes such as structural equivalence117 and regularequivalence.118 Structural equivalence assigns twovertices to the same category or class if they haveall the neighbors common or at least show a higherdegree of overlap between the set of their respectiveneighbors. On the other hand, regular equivalencedefines two vertices to be similar if they show similarconnection properties with vertices of some othercategories. In stochastic block modeling, objects areassigned positions defined in terms of IID (independentand identically distributed) random variables, and aparticular type of link between two objects is in turndefined as another random variable that depends onlyon the positions of the object pair it links. Extendingthe general stochastic block-modeling approach ofNowicki and Snijders116 that uses Gibbs sampling toinfer the object positions, Wolfe and Jensen119 allowan object to attain multiple position categories so asto model the multiple roles that an object may possessin different contexts. Wang et al.120,121 propose ageneralized stochastic block-modeling approach that

allows detecting groups of individuals whose activityis centered around certain topics based on the relationsbetween individuals and their respective demographicproperties. There also exist numerous other methodsand techniques for community detection, whichinclude methods based on maximum likelihood,112

mathematical programming,122 inference,123 andlatent space clustering.124

Extending the DBSCAN algorithm125 to undi-rected and unweighted graph structures, Xu et al.126

proposed SCAN (structural clustering algorithm fornetworks) to find clusters, hubs, and outliers in largenetworks based on structural similarity, which usesthe neighborhood of vertices as clustering criteria.Similarly, considering only the weighted interactiongraph of the OSNs, Falkowski et al.127 extended theDBSCAN algorithm to weighted interaction graphstructures of OSNs. The basic idea of density-basedclustering methods is that if the neighborhood of agiven radius ε of a point p contains more than μ

objects, then a new cluster with p as a core objectis created. The process then iterates to find density-reachable objects from these core objects and definesa density-connected cluster using density-connectivityrelations between nodes.125 Some important featuresof density-based community detection methodsinclude less computation, detection of outliers, andnatural scalability to large networks. However, themain drawback of traditional density-based commu-nity detection methods is that they require the globalneighborhood threshold ε and the minimum clustersize μ to be specified by the users. The methods areparticularly sensitive to the parameter ε, which is dif-ficult to determine prior. Actually, how to determinethe optimal value for parameter ε automatically forthe density-based clustering methods (e.g., DBSCANand SCAN) is a longstanding and challenging task.128

The method proposed by Sun et al.128 reducesthe number of possible values to consider for ε

significantly by considering only the edge weights ofa core-connected maximal spanning tree (CCMST) ofthe underlying network. In order to find an optimalvalue for ε from the remaining domain, they use mod-ularity as a quality function to automatically selectthe value for ε, which yields the community structureresulting best modularity. Similarly, Huang et al.129

proposed a two-stage parameter free extension ofdensity-based clustering by first finding smaller com-munities using the highest local structural similarityvalue of ε for a pair of nodes and a constant valuefor μ, and then iteratively optimizing the modularitymeasure111 upon joining these smaller communities.In the first stage, it uses a density-based approach todetect microcommunities by considering dense pairs

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 421

Page 15: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

(i.e., pairs of nodes whose structural similarity islargest among their adjacent neighbor nodes). In thesecond stage, it iteratively joins the microcommunitiesby considering the gain in modularity.

The area of community detection has grownenormously in the recent years in terms of the num-ber of community detection algorithms proposed.However, it is difficult to find any consensus in the def-inition of a community used or the application areasproposed, among most of them. A main issue that stillneeds to be explored is that of the validation of thecommunity structure found by these algorithms. Thiscalls for defining benchmark graphs with known com-munity structure for testing the algorithms. Althoughvarious methods for creating synthetic graphs withknown community structures exist, this process doesnot guarantee the performance on real networks.

Community Evaluation and ModularityOptimizationWe have seen many approaches for community detec-tion but one of the main challenges that most of themhave to face is that in real world they need to identifycommunity structures from networks for which theactual underlying community structure is hidden andthere is no ground truth for the mining problem withwhich the quality of the detected community structurecould be compared. Even though a community miningalgorithm demarcates communities in a network, weneed to answer a challenging question as to how dowe determine if the identified community structuresignificantly maps to the actual hidden communitystructure of the underlying network? Communitymining algorithms are often seen to identify a com-munity structure even in random networks which areexpected to have no significant community structureat all, so how can we measure the structure that isfound for these structureless networks? How can wecompare between different community results to findthe best ones for a given network113?

As a basic solution to the above mentionedproblems, various objective functions have beenproposed with an aim of finding optimal solutionsbased on different criteria. Some of the objectivefunctions that have been used so far include ratio cutminimization and normalized cut minimization. Ratiocut minimization130 involves minimizing the fractionof all possible edges leaving a cluster, whereas thenormalized cut minimization131 seeks to minimize thecut relative to the number of edges in a cluster insteadof its size. Given an undirected graph G = (V, E), letS be a subgraph of G representing a cluster with thenumber of nodes nS = |S| and the number of edges

mS = |{(u, v): u ∈ S ∧ v ∈ S}|. Let d(S) and d(G/S) bethe sum of the degrees of all nodes of S and of therest of the graph G/S, respectively, cS = |{(u, v): u ∈S, v /∈ S}| is the number of edges on the boundaryof S (cut size of S). The ratio cut and normalized cutobjective function can be expressed by Eqs (8) and (9),respectively. Both the ratio cut and the normalized cutminimization find clusters of almost similar size, i.e.,the number of nodes and/or edges in the resultingcommunities is almost the same.

R (S) = cS

nS (n − nS)(8)

N (S) = cS

d (S). (9)

Finding the minimum ratio cut of a graphby checking every possible collection of clusters iscomputationally prohibitive. However, many methodshave been proposed to find an approximation to theminimum ratio cut over the whole graph. Anothersolution for evaluating community structure is definedas a measure called modularity represented as Qintroduced by Girvan and Newman132 originally todefine a stopping criterion for one of their algorithmsin order to choose the best community structure froma hierarchy of communities. Based on the intrinsiclink structure of a network, modularity is a measureof goodness of a given partition of the network intocommunities.111,132–135 The idea of modularity Q isto compare the number of links inside communities tothe expected number of links in a random referencenetwork which contains no community structure.More precisely, the modularity Q uses Eq. (10) inwhich lmm is the number of links inside community m,L is the number of links in the entire network, Km isthe sum of degrees of nodes comprising community m,and the sum is over all communities. The term K2

m/2Lcorresponds to the expected number of links insidethe community m for a randomized graph of the samesize and same degree sequence as the original network.The total number of communities is represented by q.

Q =q∑

m=1

(lmm − K2

m

2L

). (10)

Modularity exploits the differences in nodedegrees as it aims to find the difference betweenthe fraction of edges that exist between the nodeswithin a community, and the fraction of edges thatare expected to exist between these nodes if theedges are assigned at random based on the nodedegrees in the underlying network. In this way,higher values of modularity for a partition schemerepresent communities for which more edges of thenodes in the community occur within the community

422 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 16: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

than expected by a random assignment of edgesbetween the nodes. The maximum value of Q is1 and it can also obtain negative values, whichcorrespond to assigning nodes into communitiessuch that communities are sparser than the randomreference. It can be assumed that high values ofmodularity indicate good partitions. In particular,Girvan and Newman132 suggested that the partitionof a network that maximizes modularity Q is thebest representation of the community structure ofthe network. As Q is independent of the method ofobtaining the communities, for many later communitydetection methods it became the objective functionto be maximized leading to modularity optimization-based methods for community detection. It shouldbe noted that an exhaustive optimization of Q isimpossible as there exist a large number of ways inwhich a graph can be portioned (even if the graphis small). Furthermore, modularity optimization isan NP-complete problem,136 which indicates thata polynomial time solution with respect to the sizeof the graph is not feasible. In this regard, severalalgorithms have been proposed, which aim to reachapproximately close values of maximum modularityin the least time possible. One of the first algorithmsfor modularity maximization is the greedy methodproposed by Newman.132 Newman’s method followsan agglomerative approach wherein at each levelsmaller groups are merged to form larger groupsonly when there is a gain in modularity. A similarapproach is proposed by Blondel et al.,137 which alsotakes the link weights into consideration and involvescomputing the gain in weighted modularity on merg-ing. Other approaches for modularity optimizationinclude methods based on simulated annealing138 likeRef 139, extremal optimization140 like Ref 141, andspectral methods,133,134 which optimize modularityby considering two partitions at a time using spectralbisection. Instead of using the Laplacian matrix,they use a modularity matrix whose elements aredetermined by Eq. (11), in which ki and kj are degreesof the nodes i and j, respectively, m is the total numberof nodes in the graph, and A is the adjacency matrix.

Bij = Aij − kikj

2m. (11)

Although modularity optimization methodshave proved highly effective in practice for communityevaluation,142 there are three major problems for theQ measure. First, modularity requires informationabout the entire structure of the graph, which isunrealistic in case of large networks like the WWW.As a solution to this problem, Clauset143 has proposeda measure of local community structure, called local

modularity, for graphs that lack global knowledge.Similarly, Radicchi et al.144 proposed a divisive hier-archical method, where links are iteratively removedbased on the value of their edge clustering coefficient,which is a local measure. This approach involves lesscomputation than that of edge betweenness used in Ref104 and thus yields a significant improvement in thecomplexity of the algorithm. Moreover, the stoppingcriterion of the procedure depends on the propertiesof the communities themselves and not on thevalues of a quality function like modularity. Second,modularity-based methods have a resolution limitand may fail to identify smaller (possibly important)communities.145 Possible solutions include recursivealgorithms based on modularity optimization.146

Finally, as pointed out by Scripps et al.,147 modularityonly measures existing links on the network, butdoes not explicitly consider the absent links betweentwo nodes in the same community. In this regard,it appears better to optimize a quality function thattakes into consideration both the topological structureof the networks and some secondary informationrelated to node properties or their temporal.148

How and what secondary information related tonodes, edges, or the network as a whole can be usedalong with the network topology for the communitydetection task is among the challenging issues relatedto community analysis in social networks.149,150

A important issue related to community analysisis the interpretation of the detected communitystructure in networks, i.e., what does a detectedcommunity structure in a particular network representor what do we do with the detected communities.While most of the algorithms are based on thestructural information of the social networks, it isstill difficult to say that structural communities closelymap to the underlying functional communities inmost of the real-world social networks. Alternatively,it appears promising to explore new techniquesand methods that tend to incorporate any availablesecondary information related to the nodes, edges, orthe network itself, e.g., demographic information ofnodes, edge weights, or textual content interactions,along with the network’s primary topologicalcharacteristics. In case of OSNs it is possible touse the content information available in the formof comments, messages, tags, and so on to definetopic hierarchies and the orientation of group opinionsabout them to guide the clustering process.

Overlapping and Hierarchical CommunityDetectionOne of the challenges in community detection asobserved by Zhang et al.151 is that the communities

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 423

Page 17: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

26

25

32

3124

28 29

3330

9

14

334

27

19

23

2116 15

10

8

2

4

20

22

18

12

11

71

1765

13

FIGURE 2 | Overlapping community structure found in the Zachary’s network152 by the method proposed in Ref 103.

may show overlapping behavior, i.e., some nodesmay show affinity toward multiple communities. Themembership of an entity in many groups is very com-mon in real-world networks. For example, in a socialnetwork, a person may participate in many interestgroups, see Figure 2. Identifying such overlappingcommunities from social networks has gained signif-icant attention recently. In this section, we introducesome of the existing techniques to detect overlappingcommunities. As mentioned by Chen,113 a natural wayto discover overlapping communities is to first glob-ally partition the network and then locally expand thediscovered communities to locate overlapping com-ponents. For example, Wei et al.153 first partition thenetwork into seed groups of overlapping communitystructure using existing spectral clustering methods.A locally optimal expansion process is then applied togreedily optimize Newman’s modularity Q measure.Similarly, Baumes et al.154 initialize community coresusing link aggregate (LA) algorithm and then refinethe peripheries by an iterative scan (IS) procedure.

The most popular method for identifying over-lapping communities is the clique percolation method(CPM) proposed by Palla et al.,110 which is based onthe concept of a k-clique. A k-clique is a subgraph ofk nodes. The method relies on the observation thatcommunities seem to consist of several small cliquesthat share many nodes with other cliques in the same

community. A k-clique community is defined as thelargest connected subgraph obtained by taking theunion of a k-clique with all k-cliques that are adjacent(two k-cliques sharing k − 1 nodes, where k is agiven parameter representing the clique size) to it.Rewiring one end of some links in a k-clique to someother node of an adjacent k-clique (called rolling)can also be used to identify a k-clique community asshown in Ref 155. The choice of k has a significanteffect on the community structure found. Typicallyused values of k are between 3 and 6 and the valuesof 2 and 1 map to bond and node percolation,respectively. High values of k yield tight, internallycohesive communities, whereas small values of k yieldsparse and larger communities. The intermediatek-cliques are allowed to share nodes between themand thus the resulting communities can show overlapsat common nodes. Kumpula et al.156 have proposed anenhanced variation of the CPM called the sequentialclique percolation (SCP) algorithm. SCP involvesadding edges to an empty network in a sequentialmanner. On the addition of each new edge, the for-mation of any new k-cliques is checked by searchingfor (k − 2)-cliques in local neighborhoods of boththe nodes at the ends of the newly added edge. Thecore of the SCP algorithm consists of constructingk-clique communities continuously when newk-cliques appear. For each new k-clique there are two

424 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 18: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

possible cases—the k-clique can either form its owncommunity or it can overlap with one or more exist-ing communities. In the latter case, all overlappingcommunities merge to form a single community.

Zhang et al.151 propose a hybrid method byembedding the vertices of an arbitrary graph into a d-dimensional space using spectral mapping in order toutilize the fuzzy c-means algorithm on graphs throughoptimizing a quality function. However, the eigenvec-tor calculations involved in the algorithm render itcomputationally expensive to use on large networks.Wei et al.153 first partition the network into seedgroups of overlapping community structures usingexisting spectral clustering method. A locally optimalexpansion process is then applied to greedily opti-mize Newman’s modularity measure. McDaid andHurley157 presented an overlapping community detec-tion method MOSES by combining local optimizationwith overlapping stochastic block modeling158 using a

greedy maximization strategy. Here, communities arecreated and deleted, and nodes are added or removedfrom communities, in a manner that maximizes alikelihood objective function. For an in-depth intro-duction to other overlapped community detectionmethods, readers are referred to Refs 113 and 159.

Besides overlapping community structures, net-works often contain significantly different communitystructures at different levels of granularity whereinsmaller communities are embedded within some largercommunities, see Figure 3, i.e., there exists a commu-nity hierarchy within the network.161 In order toprovide appropriate information about the modu-lar structure of a network, it is desirable to detectoverlapping communities along with their hierarchicalorganization. In Ref 109, the authors have proposeda method for simultaneously uncovering both thehierarchical and the overlapping community struc-ture from networks. The method is based on locally

Trig

ger

Top

less

MN

60V

auP

atch

back

MN

83M

N10

5Jo

nah

Hae

ckse

lF

ive

Cro

ssZ

apD

oubl

eC

CL

SM

N5

Zip

fel

TS

N83

Str

ipes

TR

88T

R12

0

TR

99

TS

N10

3

SN

9S

N4

SN

96O

scar

Fis

hB

umpe

rB

eak

PL

Num

ber1

Not

chM

usB

eesc

ratc

hJe

tZ

igW

ebS

N90

Rip

plef

luke

Gal

latin

Fea

ther

DN

21D

N16

Wav

eT

R82

SN

89S

N10

0U

pban

gQ

uasi

MN

23K

nit

DN

63

TR

77

Krin

gel

For

k

Thu

mpe

r

SN

63S

cabs

Hoo

kG

rinW

hite

tip

Shm

udde

l

FIGURE 3 | Hierarchical community structure identified in a Dolphin network160 by the method proposed in Ref 103.

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 425

Page 19: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

optimizing a fitness function for each node to decideupon the assignment of a node to a particular com-munity. A node is allowed to show affinity to multiplecommunities and thus can be assigned to multiplecommunities resulting in a possible overlapping com-munity structure for the underlying network. More-over, the method uses a resolution parameter whosevalue defines the average size of the communities to bedetected and thus varying this parameter between opti-mal higher and lower allows exploring the hierarchicallevels of the community structure for the underlyingnetwork. Reichardt and Bornholdt162 show that thereexists an analogy between detecting communities innetworks and finding the ground state of a spin glassmodel in the sense that the energy of the spin system isequivalent to the quality function of the clustering withthe spin states being the group indices. In this regard,edges should exist between nodes that have the samespin state, and seize to exist between the ones with dif-ferent spin states. A single parameter relates the weightgiven to missing and existing links in the quality func-tion and allows for an assessment of overlapping andhierarchical community structures. In line with CPM,Kumar et al.163 propose a method hierarchical andoverlapping communities (HOC) to identify hierarchi-cal and overlapping communities by finding maximalcliques from the underlying network. However, unlikeCPM, HOC uses the overlapping neighborhood crite-ria to define the similarity between two arbitrary nodesin a network. For HOC, if two nodes have the over-lapping neighborhood ratio greater than a threshold,they belong to the same community. The communitydetection framework, Infomap, presented by Rosvalland Bergstrom,164 reformulates community detectionas minimizing the description length of a randomwalk across the network. The total description lengthconsists of the length for encoding community transi-tions and the length for encoding movements withincommunities. Infomap considers smaller descriptionfor the trajectory of random walk to be more rea-sonable for defining a community partition. Rosvalland Bergstrom165 extend Infomap to find hierarchi-cal community structures from networks. Recently,Lancichinetti et al.166 presented order statistics localoptimization method (OSLOM), which locally opti-mizes the statistical significance of clusters definedwith respect to a random graph generated by theconfiguration model during community expansion.OSLOM is able to detect a hierarchical communitystructure by reapplying the algorithm on interme-diate supernetworks of detected communities. Themethods proposed by Lancichinetti et al.,109 Kumaret al.,163 and Reichardt and Bornholdt162 provide atunable parameter (resolution parameter) whose value

determines the size of the detected communities. Thisallows visualizing the community structure at differ-ent resolutions and thus forms a community hierarchy.These methods are called multiresolution methods andsome other multiresolution methods are given in Refs167 and 168. See Ref 159 for a description of thesemethods.

A two-stage algorithm proposed in Ref 169 fordetecting overlapping and hierarchical communitystructures in a network involves identifying allmaximal cliques in the network, which along witheach subordinate vertex (single vertices that do notbelong to any clique) are taken as an initial set ofcommunities. A dendrogram is then created in an iter-ative way using an agglomerative approach. In secondphase, a proper cut point for the dendrogram is deter-mined by finding the maximal value of an extendedmodularity measure, which also considers the numberof communities to which a node belongs to. Instead ofassigning nodes to communities, Ahn et al.170 assumethat links, rather than nodes, are characterized by asingle attribute (such as the community assignment).They study hierarchical organization of overlappingcommunities following a link clustering approach.

Dynamic Community Detection andCommunity EvolutionAs mentioned earlier, one of the important propertiesof the real-world social networks is that they tendto change dynamically and this property has untilrecently been largely ignored in terms of communitydetection. Recently, several datasets and methodsof recording the network dynamics have becomeavailable, enabling to monitor and analyze theevolution of real-world networks,171,172 which alsomakes it possible to analyze the evolutionary changesrelated to communities. The main evolutionarycharacteristic events related to the lifetime of acommunity include birth (a new community emerges),growth (an existing community gains more nodes),shrinkage (an existing community loses some membernodes), merging (more than one existing nodes join toform a new community), split (an existing communitybreaks into more than one community), and death(an existing community seizes to exist by losing allmember nodes or at least all of its core nodes).

The analysis of the evolution of communities inlarge social networks, such as membership, growth,and disbandment, is becoming increasingly prominentespecially for social networking sites such as MySpaceand Facebook. By monitoring friendship links andcommunity membership on a social network site, andcoauthorship and conference publication in a bibliog-raphy database, Backstrom et al.173 study the relation

426 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 20: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

between the evolution of the community structure andthe topological structure of the underlying network.Kairam et al.174 investigate about the life and death ofonline groups based on various group-based networkfeatures. They find that communities possessingdensely connected hierarchical subgroups and a setof loosely connected nodes tend to show long lifeand higher rate of membership gain. Although groupsincrease the diffusion rate, groups formed through thediffusion process attain smaller sizes with the passageof time. Moreover, their analysis suggests that histor-ical growth features can help in closely estimating thegrowth in near future but the structural features ofthe network tend to perform better in estimating thegrowth to a distant future (at least for small groups).

In order to identify community structure fromnetworks, most of the approaches either consider asingle snapshot at any time or all the nodes andedges present within a particular time window. Butin case of dynamic networks such approaches tendto miss the important behavior of communities,i.e., their evolution with time, which represents oneof the most important properties of networks andcommunities to be observed.175 The typical dynamiccommunity detection problem as formulated in Refs173 and 175 is to observe the social interactions ofa subset of individuals of a network at each timestep along a discrete time scale. Consequently, severalsubgraphs are formed, which reveal the underlyingcommunity structure and the changes that occur toit over time. However, the bipartite mapping ofcommunities for two subgraphs in these methodsassumed a zero to one or one to one mapping betweenthe communities and hence do not identify merge orsplit events. Alternatively, Greene et al.176 propose aheuristic threshold-based method allowing many-to-many mapping between communities across differenttime steps, thus also enabling the detection of mergeand split events. This approach is independent ofthe choice of the underlying static community findingalgorithm applied to the individual step graphs. Toperform the mapping between the communities attwo time steps, they define the similarity between twocommunities based on the Jaccard coefficient given byEq. (12), wherein Ci and Cj are two communities attwo consecutive time steps, respectively.

sim(Ci, Cj

) =∣∣Ci ∩ Cj

∣∣∣∣Ci ∪ Cj∣∣ . (12)

If the similarity (or overlap) for two com-munities Ci and Cj is found greater than a certainthreshold (between 0 and 1), an evolutionary relationis established between them. Similar approach is

followed in CHRONICLE177 and its two-stage exten-sion of SCAN126 for identifying community evolutionin dynamic networks but it does not consider theedge weights (often a important supplement forcommunity detection) when available.

Wang et al.178 have presented a core-based algorithm for tracking community evolution,which depends on core nodes to establish theevolving relationships among communities at differentsnapshots. Instead of overlapping level of nodesor edges between two communities, their algorithmheavily relies on core nodes that are distinguished fromordinary nodes on the basis of both the communitytopology and the node weight as follows. Othermethods introduced for the analysis of communitiesand their temporal evolution in dynamic networks areprovided in Refs 54, 171, and 179–181. However, aspointed out by Lin et al.,182 a common weakness inthese studies is that first the communities are identifiedseparately and then a mapping across communitiesto identify their evolutionary characteristics isperformed. An alternative approach would requirethat the identified community structure itself providesinformation regarding its evolutionary events anda historical event log helps in deciding upon anappropriate community structure. Lin et al.182 analyzethe evolution of communities by first identifying aninitial community structure using a stochastic blockmodel and then adapting this initial communitystructure by following a probabilistic model forcapturing the community evolution. For each time-step graph of a dynamic network, the communitystructure is determined by considering both thecurrent network state and the historic communityevolution patterns. However, the main limitationsof their method include the need for specifyingnumber of communities a priori and low scalabilityowing to large number of matrix computationsrequired. Falkowski et al.29 propose a framework forstudying community dynamics where a preliminarycommunity structure adapts to dynamic changes ina social network. A similar approach is proposedby Bhat and Abulaish,103 but unlike Falkowskiet al.,29 their concern is on tracking the evolutionof overlapping communities and does not need anaging function to remove old interactions from thenetwork. Their distance function is based on averagereciprocated interactions in node neighborhoods.Moreover, the proposed framework is applicableto directed/undirected and weighted/unweightednetworks, whereas the method of Falkowski et al.29

applies only to undirected and weighted networks.For unweighted networks, the proposed frameworkby Bhat and Abulaish103 assigns a unit weight to each

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 427

Page 21: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

edge in the network without altering the meaning orrepresentation of the network. Other related dynamiccommunity models and detection methods fordynamic networks are studied in Ref 53. Studying theevolution of discussion topics can help in identifyingthe evolving community structures in dynamic socialnetworks, which is another challenge that needsmore work to be done. Studying overlapping andhierarchical community structures from a dynamicperspective in evolving social networks may alsothrow some light on how communities merge or split,and the role of nodes with multiple memberships(overlapping nodes) in this context. Moreover,analyzing communities in weighted and directednetworks also needs some more attention.

DIFFUSION AND INFLUENCE INSOCIAL NETWORKS

With the increasing popularity of OSNs that forman efficient means of analyzing information diffusionthere has been a good amount of research regard-ing the analysis and modeling of information flow inthese networks. Diffusion in networks has been stud-ied across a wide range of real-world scenarios with anaim to find trends, patterns, and models that representthe spread, propagation or diffusion of entities likefads, diseases, computer viruses, knowledge, products,etc. Early work in the area of diffusion in networkscomes from the field of epidemiology,183 where it dealswith the spread of disease among individuals con-nected as a network, and from computer science,184

involving the study of the spread of computer virusesover the individuals connected via e-mail networks.This early work is primarily based on the SIR (sus-ceptible, infected/infectious, and recovered/removed)and the SIRS (susceptible, infected/infectious, recov-ered/removed, and susceptible) models.

In the field of sociology, the role of the word-of-mouth (WOM) in the propagation of innovationin social networks has been extensively studied. Sim-ilarly, in marketing science literature many modelsfor studying the diffusion of new products have beenproposed.185 For example, the primitive Bass model186

predicts adoption based on relative populations ofinnovators that are not influenced by the decisions ofothers and imitators whose adoption depends on thetotal number of adoptions in the system. In the fieldof SNA, the flow of information in a social network ismodeled as the propagation of innovation in the socialnetwork and was introduced by Ryan and Gross187

and later consolidated by Rogers.188 Unlike thedisease spread models, information diffusion in socialnetworks is based on complex social contagion189,190

that has properties like thresholds to infection, i.e.,individuals wait for several of their friends to adoptthe behavior before involving themselves.

With an increased popularity of the newenhanced forms of communication (e.g., mobilephones, e-mail, and OSNs) in recent years, researchershave been able to measure social contagion efficientlyand exploit it to propose effective models for infor-mation diffusion in social networks. Katona et al.191

analyze an OSN dataset to identify individual WOMeffects with an aim to discover how the local com-munication network structure affects the diffusionprocess. They show that dense groups/communitiesfacilitate higher rate of WOM influence and that influ-encers who occupy structural holes in the networkhave, on average, higher influential power. Moreover,people with many friends have a lower averageinfluence than those with fewer friends. They alsosuggest that demographic data are useful to identifystrong influencers and together with global networkvariables are also useful to identify adopters as well.Along a similar direction, Bhat and Abulaish192 ana-lyze the influential significance of overlapping nodes,i.e., nodes that belong to multiple communities in asocial network. Their analysis highlights that highlyoverlapping nodes in a social network represent thebest influential nodes in the network in terms of theirbetweenness centrality and that outliers and singlemembership nodes can be easily discarded as leastinfluential. Considering the case of Facebook, Bakshyet al.193 empirically rehighlight the strength of weekties and signify that weak ties, defined directly in termsof interaction propensities, diffuse novel informationthat would not have otherwise spread and thus playan important role in facilitating information diffusion.Their analysis also indicates that most informationdiffusion on Facebook is surprisingly driven by simplecontagion and not by complex contagion. Lermanet al.194 conducted empirical analysis of user activityon the OSN sites Digg and Twitter, in the light ofinformation diffusion cascades in these networks.Their analysis reveals that as Digg networks aredense and contain community structures, many of thecascades appear to spread through an interconnectedcommunity. On the other hand, as Twitter does notcontain significant community structures, its cascadesare more tree-like. On the basis of a viral e-mailexperiment involving 31,183 individuals, Iribarrenand Moro195 relate the unexpected slow pace ofinformation diffusion in an e-mail network to theheterogeneity found in the response time of humanactivity patterns. Similar results related to the effectsof heterogeneity in both the number of recommenda-tions made by individuals and of the time they take to

428 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 22: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

transmit the information are reported in Ref 196. Aralet al.197 analyze a social network of an organizationrepresented by 10 months of e-mail traffic observedover two 5-month periods. Besides, they also use someinformation related to accounting data on projectcowork relationships between the workers in theorganization. Their analysis reveals some interestingobservations which include: in general, it is not onlythe social structure but also the information contentthat determines the movement of information; theproductivity of an information worker is directlyproportional to her access of more novel and timelyinformation. The analysis of Sadilek et al.198 andCorley et al.199 throws some light on how the contentand demographic information available from onlinesocial media can be helpful in detecting and trackingdisease spread in the underlying population. Somebasic mathematical models that have been proposed inliterature for analyzing information diffusion in socialnetworks are explained in the following subsections.

Threshold ModelsIn threshold models of information diffusion, a nodev in a network may adopt an action (become active)only if a certain number of nodes in v’s neighborhoodare active, i.e., when the number of active nodes in theneighborhood of a node v exceeds a certain thresholdthe node v becomes active. The most simple exampleis the linear threshold model200 in which a node vbelonging to the network has a nonnegative weightwv,u for every node u in v’s neighborhood N(v) witha property that

∑u ε N(v)

wv,u ≤ 1. Given a threshold

value θv (value can be a fixed value or chosen at thestart of the process) and an initial set A1 of activenodes, it follows a sequence of steps in such a waythat at any time t, every node that was active at time t− 1 remains active and each node v that was inactiveat time t − 1 becomes active at time t if and only if∑

u ε N(v)wv,uXu, t−1 ≥ θv , where Xu,t−1 is 1 if u was

active at time t − 1and 0, otherwise. Thus, the weightwv,u represents the extent to which node v is influencedby node u, and the threshold θv represents the personaltendency of v to adopt a new action of its neighbors.

Cascade ModelsIn cascade models of information diffusion,201,202 eachindividual in a social network who adopts an action(becomes active) has a single probabilistic chance toactivate each inactive node in its neighborhood. Forexample, in the independent cascade model,202 givenan initial set of active nodes in the network the processproceeds in a series of time steps where at each time

step a node u that has just become active may attemptto activate each inactive node v in its neighborhoodirrespective of the set of neighbors of u that haveattempted to activate v in the past. Whether or notu becomes active, v and u have no further contactthroughout the remainder of the process. The processterminates when no new activations can be made.

Kempe et al.203 give a generalized model for theindependent cascade and the linear threshold modelswhere the probability with which an active node uattempts to activate an inactive node v depends on theneighbors of u who have already attempted activate u.However, the probability that an individual v is activeafter the activation process does not depend on anysequence of the activation attempts of the neighborsand removes any ambiguity related to any simultane-ous activation attempts made by the active neighbors.

Recent Works on Information DiffusionJackson and Yariv204 have analyzed that for thespread of a behavior in a social network thereexists a threshold for the number of initial activenodes (initial adopters) where ‘tipping’ occurs. Itmeans that a large number of initial active noderesults in an increased adoption rate, which reachesto a large subpopulation. On the other hand, asmaller number of seed nodes lead to the collapseof adoption behavior leading to very few or no newactive nodes. Furthermore, once the tipping pointis surpassed, they observe that the initial adoptionrate is high, followed by a slower adoption ratetoward the end. Moreover, the network structuresuch as the degree distributions of the nodes affectsthe tipping point and the adoption rate. A similarobservation has been made by Iribarren and Moro,196

who argue that most of the initial diffusion processtakes place owing to superspreading events andbecause there exists heterogeneity in the schedulingof information transmission by individuals,205–207

the diffusion process slows down toward the endin logarithmic time. Garg et al.208 have considered therole of peers (that represent nonexplicit relationshipsof users in a network) on the diffusion of nicheinformation in a social network using a dataset fromLast.fm. The peers of a user they considered had verysmall life span of connection as it was dependenton the online user’s evolving taste in music. Theiranalysis shows that there is a positive influenceof online peers (nonexplicit relationships) on thediffusion of niche information at a more granularlevel and found that users are six times more likely todiscover a new track as a result of peer influenceand discover 2.7 niche tracks as a result of that

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 429

Page 23: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

influence. Using blog contents posted by OSN users,Gruhl et al.207 have studied information diffusion interms of topics propagation from one blog to another.They have characterized information diffusion alongtwo dimensions—topics and individuals. In topic-based characterization process they found that theblog world contains numerous topics identifiable bytheir respective set of user postings, which are mostlycomposed of a union of active discussions (calledchatter) mediated by the authors of the topic andshort-term, high-intensity discussions (called spikes).In individual-based characterization, they characterizeusers on the basis of their respective posts related to thebirth, activeness, and death of a particular topic. Theypropose an SIRS-based model to analyze the processof information diffusion so as to identify individualswho play an important role in the formation andspread of infectious topics.

Based on a blog world representation of OSNs,where a function called blogroll enables bloggers tospecify explicit relationships with other bloggers, andtrackback and scrap functions allow bloggers to maketheir posts linked to other bloggers’ posts and tocopy other bloggers’ posts to their blogs respectively,Kwon et al.209 analyzed the diffusion of informa-tion in these networks. In contrast to the assumptionbased on social network theory that information diffu-sion in social networks occurs through the establishedrelations between members,210 the analyses of Kwonet al.209 show that a majority of information diffusesbetween blogs that have no explicit relationships. Fur-thermore, some posts show an explosive increase inthe number of blogs that trackback or scrap theseposts. The reasons for such explosive information dif-fusion are identified as (1) listing blogs on the mainpage of a blog world service portal and (2) diffu-sion through search engines. On the basis of theseobservations, Kwon et al.211 proposed an informationdiffusion model using the existing independent cas-cade model202 as a basis and added to it the virtualspace (i.e., the main page of the blog service provider)that exposes posts to a lot of bloggers with no priorrelationship. Zhao et al.212 seek to determine the roleof weak ties in the process of spreading informa-tion within the blog world. They show that althoughmany important features of the network structureare defined by the weak ties, exclusively republish-ing blogs through weak ties (scrap function of theblog world) cannot facilitate diffusion of informationthrough the network. On the other hand, republishingblogs selecting at random cannot facilitate the diffu-sion process. Moreover, when the blogs are selected atrandom for republishing, removing weak ties leads toa sharp decrease in the rate and range of the diffusion.

This concludes that for blog networks, it is difficultto analyze the delicate role played by weak ties inthe information diffusion process. Studying the com-munication patterns of e-mail networks over time,Kossinets et al.213 formulated the notion of ‘distance’between two nodes in a social network based on theminimum time it takes for information to diffuse fromone node to the other. They identify the backboneof the network (a subgraph of a network on whichinformation has the potential to flow the quickest) asa sparse graph with a concentration of both highlyembedded edges and long-range bridges—reflectingthe relationship between tie strength and connectivityin social networks. By analyzing the content of the blogposts, Stewart et al.214 modeled the problem of discov-ering information diffusion paths (IDPs) from the blo-gosphere as a problem of frequent pattern mining byrepresenting a blog community collected in a certaintime period as a blog sequence database. They defineIDPs as sequences of blogs that frequently discuss simi-lar topics sequentially, and around similar time points.

Based on continuous-time Markov chains,215

the information diffusion model proposed by Songet al.216 deals with predicting the flow of informationwhere they measure the likelihood of the propagationof information from a specific sender to a specificreceiver during a certain time period and proposea recommendation algorithm that predicts the mostlikely node to receive the information during a limitedtime period. Using diffusion rate of information ina social network by estimating the expected timefor information to diffuse to a specific user, themodel also ranks nodes/users based on how quicklyinformation flows to them. By considering the picturepopularity distributed over the Flicker social network,Cha et al.217 have shown that social links are thedominant method of information propagation andthat information spreading is limited to individualswho are within close proximity of the pictureuploaders. Furthermore, information takes a longtime to reach from one node to other and hencethe popularity of Flicker pictures steadily increasesover many years. The role of social influence in thediffusion of information in social networks is studiedby Oh et al.218 using a dataset from YouTube. Theiranalysis reveals interesting relationships betweensocial influence resulting from a user’s networkposition and the initial and later stages of informationdiffusion in a social network. Habiba et al.219 definespread of a diffusion process in terms of the number ofnodes expected to be affected by a stochastic diffusionprocess over time. They consider a node to qualify asa best spread blocker if its removal leads to a decreasein the spread of a diffusion process. Their results

430 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 24: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

show that on both static and dynamic networks,local measures such as the node degree perform atpar with other measures in identifying the blockers.Instead of utilizing the topological information of anetwork and with an aim of predicting the magnitudeand the rate of information diffusion, Yang andLeskovec220 propose a linear influence model wherein(considering the network as implicit) the numberof newly infected/influenced nodes by a node p atany given time t is a function of the number ofnodes infected/influenced by p before time t. Theymodel the influence functions as a regression taskin a nonparametric way and show that they can beestimated using a simple least squares procedure.

Influential Node Mining and ViralMarketingThe social networks of customers have long beenexpected to have a potential for being exploited toincrease brand/product awareness as WOM flowsthrough these networks facilitating social contagion.Moreover, the adoptions and opinions of certaincustomers (opinion leaders and influential customers)often influence the adoption behavior of other cus-tomers making their identification desirable for prod-uct manufacturers and marketing agencies.221 In thisdirection, Trusov et al.222 quantify the effect of WOMreferrals based on an OSN dataset, which contains therecords of new members who join the site. Their anal-ysis reveals that the addition of new members to OSNsis strongly facilitated by WOM referrals whose impactis found to be almost 20 times higher than marketingevents and 30 times higher than media appearances.

Viral marketing refers to the marketingtechniques that are based on utilizing existing socialnetworks in order to increase brand awareness orachieve other marketing objectives like increasingproduct sales by incorporating a self-replicatingviral process that is analogous to the spread ofpathological or computer viruses. Viral marketingprograms involve identifying individuals with highsocial networking potential (size of an individual’ssocial network and their ability to influence thatnetwork) or network value, and creating viralmessages that motivate this set of population to forma countable customer base for the brand. Networkvalue of a customer is defined as the expected increasein sales to others that results from marketing to thatcustomer.223 Some of the influence factors identifiedby Domingos223 that collectively influence thenetwork value of a customer in a positive manner are:

• high connectivity in the network• interest in the product

• leadership or asymmetric influence over thenetwork

• higher level of cascading influence.

The problem of viral marketing was formalizedby Kempe et al.203 as selecting the optimal individualsto be seeded with a product in an arbitrary networkgiven a fixed marketing budget. This strategy involvesencouraging the WOM by distributing discounted orfree products to targeted consumers assuming thatthey will then discuss the product with their friendsand encourage them to buy the product. However,what customers to seed with these initial products inorder to maximize the amount and rate of productadoption are not obvious. Sun and Tang224 present asurvey on the some existing models for social influenceanalysis. They present the notion of influence in termsof some important social network primitives includingnode degree, edge betweenness, structural holes, andhomophily. They also review the notion of influence interms of the actions and interactions of individuals ina social network and how community structures helpin explaining influence in a social network. Moreover,they present some existing models for maximizing theinfluence spread in social networks, which besides thethreshold (section Threshold Models) and the cascade(section Cascade Models) models include high-degreeheuristic, low-distance heuristic, and degree discountheuristic. Here, the high-degree and low-distanceheuristics consider nodes with higher degree (thuspossibly can influence more nodes) and nodes withthe shortest paths to other nodes as seed nodes,respectively. The degree discount heuristic is similarto high-degree heuristic but involves discounting thedegree for a new potential seed node p by the numberof already selected seed nodes that participate indefining p’s degree.

OSNs such as Facebook, Twitter, and Orkutprovide efficient platforms to advertise and marketproducts to consumers as these platforms allowtheir users to create virtual networks that providea formalization of social interactions of individuals.However, despite a huge marketing scope in theseOSNs, it has been difficult to use this platformsuccessfully for marketing225 often owing to privacyconsiderations as the full network described by theseonline platforms is not known. As a solution tothis problem, Stonedahl et al.226 have introducedthe LVMP (local viral marketing problem) similarto the global viral marketing problem proposed byKempe et al.203 The only difference is that in LVMPthe structural knowledge of the global network is notavailable, rather only the characteristics of each vertexthat provide summary statistics about the vertex and

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 431

Page 25: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

its role in the network are used. Instead of selectingindividuals in order to maximize the influence,Hartline et al.227 and Arthur et al.228 have studiedthe problem of revenue maximization, i.e., the processof selecting an individual for offering an optimalprice such that maximum revenue can be extractedfrom each next buyer. Their methods are based onthe marketing strategies called influence-and-exploitstrategies that initially influence a population bygiving free or discounted products to set of influentialbuyers and then extract revenue from the remainingbuyers using a ‘greedy’ pricing strategy. However, incontrast to Hartline et al.227 approach in which aninfluenced node is allowed to market product to anyarbitrary target nodes and also determine the price ofthe item without considering the network structure,Arthur et al.228 incorporate structural informationabout the network and the timing of an offer isdetermined through cascading the recommendations.As a dual to the influence maximization problem,Kimura et al.229,230 have addressed a contaminationminimization problem, which involves minimizingthe diffusion of unwanted information (underindependent cascade model) by preventing the flowof information through a small subset of edges ina network. In particular, they have shown thatunlike the case of removing nodes, blocking theedges that connect nodes with higher number ofoutgoing edges does not always work effectively forminimizing the reach of a spreading process. They alsopropose an effective approach to find an approximatesolution by following a greedy strategy involvingbond percolation. Percolation theory proposed in Ref231 examines how connectivity is disrupted withinspatially structured systems232 and refers to a class ofmodels that describe the properties of a system giventhe networking among its constituents.

Besides the above research works, a numberof researchers have specifically concentrated onidentifying leaders and influential bloggers from OSNsto assist viral marketing of products and other alliedtasks over these platforms. In support of these issues,a survey of over 200 million bloggers by McCann233

reveals that 31.7% blog about opinions related toproducts and brands, which are viewed by 71% ofregular internet users. Similarly, a survey performedby Nelsenwire234 on a relatively smaller populationof internet users in some countries finds that 70%of people give significance to the online opinions ofother people on the products they intend to purchase.For the managerial task of market analysis, blogsrepresent vital sources of information in the form ofdirectly accessible insights on products (own and/orrival) and related costumer feedback.235,236 Agarwal

et al.237 have proposed a model to identify influentialbloggers by analyzing the influence of their blogposts based on four blog properties, recognition,activity generation, novelty, and eloquence, that formthe parameters for the model and can be tunedto obtain different breeds of influential bloggers.Furthermore, their experimental results also confirmthat the influential bloggers in a blogging communityare not necessarily active bloggers. Goyal et al.238

have proposed a frequent pattern mining approach todiscover leaders in social networks based on analyzingvarious user actions in the social network. Theirmethod requires preliminary knowledge about theunderlying social graph and an action log containingall user actions and their time stamps. By computingan influence graph and influence cube from availableaction log data and introducing various thresholds,they present frequent pattern mining-based algorithmsto identify leaders, tribe leaders, and confidenceleaders. Similarly, considering the log-in behavior ofusers of an OSN, Trusov et al.239 use a Bayesianapproach to determine the influential nodes from anego-centered influence network estimated from thelog-in sequences followed by friend relations.

Conflicting Issues in Viral MarketingAn important challenge in the direction of viral mar-keting is owing to the contradicting claims by Wattsand Dodds240 that the influentials hypothesis is incor-rect and by Goldenberg et al.241 that the influentialshypothesis is correct. More specifically, a computersimulation by Watts and Dodds240 suggested thatseeding well-connected people to maximize the spreadof information works only under certain conditionsand should be less preferred as seeds or early adoptersfor large referral cascades. On the other hand, thestudy by Goldenberg et al.241 on the impact of socialposition on information probability indicates thatopinion leaders (hubs) in the network may adoptearly not because they are innovative but because theyare better informed than others by early exposure toinnovations through their multiple social links. Theanalysis by Hinz et al.242 and Iyengar et al.243 alongthis issue highlights that seeding hubs (high-degreeseeding) for viral marketing results in higher numberof referrals because hubs are more actively involved inthe diffusion process owing to the higher number oflinks. Moreover, in contrast to Goldenberg et al.,241

the analysis by Iyengar et al.243 significantly high-lights that opinion leaders associate to early adoptioneven after controlling for contagion, and are equallysensitive to contagion as nonleaders. In this section,we present some more important and relevant workrelated to influential node mining and viral marketing

432 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 26: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

and how OSNs present these opportunities. In addi-tion to this, the analysis by Leskovec et al.244 high-lights some interesting and challenging issues relatedto the traditional diffusion models in the context ofrecommendation, which tend to change the perceptionof information diffusion and hence the approachestoward viral marketing. Unlike the traditional modelsof information diffusion, they observe that a repeatedexposure of an individual to some product or resourcedecreases the chance of its adoption by the individ-ual. It means that providing excessive incentives toselected seed customers could have negative effects.As discussed earlier, traditional diffusion models con-sider that each individual either has the same chanceof infecting each of its neighbors (cascade models)or that an individual does not get infected unless thenumber of its neighbors who are infected is greaterthan a threshold (threshold models). In either case, thechance of a node getting infected is directly propor-tional to number of its infected neighbors. However,Leskovec et al.244 argue that although the chance of aproduct p, to be purchased by an individual i, increaseswith the number of i related suggestions directed atp, the chance of adoption quickly decreases to a rel-atively low level. This indicates that even though aproduct may be suggested by many friends to an indi-vidual i, it is possible that the individual i does notbuy the suggested product as it may appear to beof no significant use. Earlier, we also mentioned thatoften highly connected nodes are considered goodcandidates for selecting influential nodes. However,Leskovec et al.244 argue that nodes with high degreescan give productive recommendations only up to acertain level because the success per recommendationdeclines as a high-degree node sends out a significantlylarge number of recommendations for a certain prod-uct. In this regard, they present a stochastic model thatsupports the diffusion of influence/recommendationthrough long paths, but also considers the possibilitythat the recommendation paths can get blocked atshort lengths as discussed earlier.

Marketing is one of the major applicationareas for the analysis of the diffusion of informationand influence in social networks. However, besidesmarketing, areas including behavioral science and thestudy of the spread of computer viruses and spam,etc., also find themselves striving for finding IDPs andinfluential nodes from OSNs in order to have a betterunderstanding for controlling the possible epidemicsof their outburst. Although many issues relating tothe social process of diffusion have been addressed,some young directions still need more attention. Forexample, studying diffusion trends within and acrossthe communities in a social network, and analyzing

the community structure of influential nodes forinformation diffusion appear promising. Similarly,addressing link prediction issues in the light of influ-ence and information diffusion tendency of the nodesin a social network might also highlight new factorsthat can possibly determine the direction and weightsof new links that can possibly appear in the social net-work in near future. The field also calls for identifyingnew practical application areas, and demonstrationof the working of various diffusion models and nodeinfluence measuring/prediction methods.

FUTURE DIRECTIONSAND CONCLUSION

Most of the existing current literature on SNA tasksis oriented toward network modeling and communitydetection, followed by link prediction, informationdiffusion, and influential node mining. However,currently all these tasks face some common challengesand issues. One of the most common issues is thedynamic nature of the real-world social networks thattend to change with time. Incorporating this featureof social networks into statistical models can help inefficiently dealing with the link prediction problem.Besides the existence of various dynamic communitydetection methods, analyzing overlapping and hierar-chical community structures in evolving networks stillneeds some more attention. Another issue related tomost of the tasks addressed in this article is that of theheterogeneity found in real-world social networks.Often social networks involve different types ofnodes representing users, resources (e.g., URLs andvideos), and so on. Similarly, social links can also beassociated with different attributes such as polarityand relationship types (e.g., friend, foe, and family).Community detection algorithms need to deal withthis heterogeneity by ensuring that the nodes and linksof similar types are grouped within same communitiesbesides considering the topological structure of thesocial network. It may also be required to analyze ashow/why nodes and links change their discriminatingattribute values with the evolution of social network.Link mining and information diffusion models mayneed to address such issues. Opinion and sentiment-aware community detection, link prediction, andinformation diffusion also appear to be promisingas very less work has been done along this direction.Another issue related to the tasks involved in SNAis that most of the works have been done consideringonly undirected and unweighted nature of socialnetworks. Incorporating weights and directions tolinks for social network mining tasks is still an openchallenge. For example, predicting both direction and

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 433

Page 27: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

weight of missing or future links in a social network.Moreover, a huge amount of literature along thedirection of SNA exists, and most of them concentrateon some specific aspect of the social networks (e.g.,community detection and information diffusion), inisolation or are mostly oriented only toward socio-logical aspects relieving computer science. However,the present multidimensional OSNs provide a meansof studying various data mining tasks related toSNA in a unified framework where each of themcan benefit from the others. For example, modelinginformation diffusion process in conjunction withcentrality analysis and community detection canhelp to achieve more cost-effective viral marketing.Establishing relationships between opinion orienta-tion, community structures, and diffusion pathwaysin a population can lead to significant advancesin the analysis of large real-world social networks.Similarly, using both topological network structureand the nonstructural information like user-generatedcontent for SNA tasks appears more promising thanusing only the topological information. OSNs area rich source of both structural and nonstructuraldata and it appears they form an efficient platformfor formulating frameworks that include multiplesocial network activities and data mining tasks beingperformed in collaboration with each other to yieldmore meaningful and realistic results.

In this regard, we propose a conceptualframework (see Figure 4) that allows a unifiedanalysis of OSN data in which various SNA taskscan be made to supplement each other for a betterinterpretation and analysis of the underlying processesof OSNs. Community detection module can help tounderstand which individuals tend to have higherintensities of relationships than the rest of the network.For interaction networks, identified communities canrepresent clusters of individuals who interact witheach other more frequently than the rest of thenetwork and for friendship networks communitiescan represent the clusters of friends who havehigher affinity with each other than the rest of thenetwork. The main advantage could be to understandthe relationship (similarity and difference) betweendifferent communities of interaction and friendshipnetworks. For example, we may tend to answerthe question: do the groups of close friends interactmore frequently or the higher intensities of interactionamong a group of individuals are independent of thelevel of friendship among the same individuals.

Another task related to the proposed frame-work is to analyze user-generated content generatedby the individuals of a social network. This can helpin analyzing sentiments of individuals who representtheir opinions about various objects such as prod-ucts, events, and government policies. Text mining

Social network

Social networkcrawler

Text preprocessing

POS tagging

Mining individualcontextual sentiments

Identifying informationdiffusion pathways

Identifying contextualopinion diffusion trends

Identifying opinionleaders

Ranking

Influence maximization

Mining community-wisecontextual sentiments

Community detection

Evolution analysis

Opinion mining

Domainontology

Information diffusion

Influential node mining

Community analysisNetwork

structures

User generated

textual data

FIGURE 4 | A unified conceptual framework for online social network analysis.

434 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 28: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

techniques along with IR and natural language pro-cessing (NLP) can be used for this purpose. Opinionmining techniques are usually based on the consider-ation that a document, sentence, or feature/aspect isopinionated about one single issue or object and theproblem is to classify the opinion as positive, nega-tive, or neutral or on a multipoint scale to determinethe strength of the expressed opinion; for example, todetermine whether a user specifies the picture qual-ity of a camera phone as neutral, low, medium, orhigh, where neutral is treated as the lack of opinion.However, the user-generated content can be orientedtoward numerous heterogeneous objects or contextsand it may be required to analyze the sentiments ofindividuals in different contexts depending upon thetemporal requirements. In order to efficiently representand interpret a particular context, the idea is to utilizedomain knowledge. Starting with a seed ontology, textmining technique along with IR and NLP can be usedto enrich it further through learning concepts and theirrelationships from textual data source in an automaticway. The identified communities from interaction andfriendship networks can be exploited to analyze thesentiments and opinions learned from textual data effi-ciently at both individual and community levels. Theopinion strength of a community can be computedas the average strength of the opinions expressedby the individuals in that community. Furthermore,comparing opinion strengths and sentiments of theidentified communities may reveal interesting rela-tionships between the levels of interaction, friendship,and strengths of the opinions and orientation of senti-ments. At this point, it can also be possible to identifycentral individuals and communities that tend to havea higher impact on contextual sentiment orientationsof the individuals and communities.

The framework also facilitates to analyze andmodel the information diffusion process in bothinteraction and friendship networks. This processinvolves studying how far and at what rate thediffusion/spread of information can take place ina network. Information diffusion modeling of theinteraction and friendship networks can help todetermine which among the two networks has a higherscope of spreading information among the individuals,or does there exist a significant difference in theinformation diffusion process of the two networks.Furthermore, considering a particular context, it isalso possible to understand how particular contextualopinions diffuse in the interaction, friendship, andcommunity networks (where a node represents

an identified community and the weighted edgesrepresent the degree of similarity between them).For such a modeling, the opinion similarity betweenany two nodes will contribute in determining theprobability of the spread between them.

Another SNA task supported by the proposedframework is to identify influential nodes (individ-uals) in the interaction and friendship networks.This has importance in domains like viral marketing,which involves identifying individuals with highsocial networking potential (size of an individual’ssocial network and their ability to influence thatnetwork). In order to identify influential nodes in theinteraction and friendship networks, the individualnodes can be ranked by estimating diffusion prob-abilities from observed information diffusion datausing independent cascade model discussed earlier.Furthermore, contextual opinion leaders (individualsand communities) can be identified by analyzing theinteraction, friendship, and community networksin terms of the diffusion of contextual opinions inthe networks. It can also help in determining therelationship between the influential nodes identifiedseparately in interaction and friendship networks. Animportant issue not significantly highlighted here isthat the amount of data made available by OSNsis huge and it is growing exponentially. It calls fordealing with the scalability issues for the existingmethods to face various challenges discussed in thisarticle, and requires improving existing methods ordesigning new fast and efficient methods to tackle thehuge amount of heterogeneous and dynamic data.

The field of SNA is not new, but still it is rapidlygrowing with new challenges being faced owing tothe growth of new multidimensional social networks.This article attempts to present a review of some of thelatest and important aspects of SNA, which includemethods for social network modeling, analysis, andmining. Existing techniques for many analysis tasksare discussed and presented in a summarized way,which could be a useful source for the researchers inSNA area to get insight about the related problemsand existing state-of-the art techniques.

NOTESa http://www.infovis-wiki.netb http://www.foaf-project.org/c http://vocab.org/relationship/d http://sioc-project.org/e http://www.w3.org/2004/02/skos/

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 435

Page 29: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

REFERENCES1. White HC, Boorman SA, Breiger RL. Social structure

from multiple networks. I. Blockmodels of roles andpositions. Am J Sociol 1976, 81:730.

2. Scott J. Social Network Analysis: A Handbook. 2nded. London: Sage; 2000.

3. Borgatti SP, Mehra A, Brass DJ, Labianca G.Network analysis in the social sciences. Science 2009,323:892–895. doi: 10.1126/science.1165821.

4. Kowald M, Frei A, Hackney JK, Illenberger J,Axhausen KW. Collecting data on leisure travel: thelink between leisure contacts and social interactions.Procedia - Soc Behav Sci 2010, 4:38–48.

5. Hulst RCVD. Introduction to social network analysis(SNA) as an investigative tool. Trends Org Crime2008, 12:101–121.

6. Lewis ACF, Jones NS, Porter MA, Deane CM.The function of communities in protein interactionnetworks at multiple scales. BMC Syst Biol 2010,4:100.

7. Swan J, Newell S, Scarborough H, Hislop D.Knowledge management and innovation; networksand networking. J Knowl Manage 1999, 3:262–275.

8. Patak N, Mane S, Srivastava J, Contractor N.Knowledge perception analysis in a social network. In:Proceedings of the 5th Workshop on Link Analysis,Counterterrorism and Security, SIAM InternationalData Mining Conference, Bethesda, USA; 2007.

9. Reffay C, Chanier, T. How Social Network Analy-sis Can Help to Measure Cohesion in CollaborativeDistance-Learning. Computer Supported Collabora-tive Learning. Bergen: Kluwer Academic Publishers;2003. Available at: http://archive-edutice.ccsd.cnrs.fr/edutice-00000422. (Accessed February 23, 2010).

10. Bonchi F, Castillo C, Gionis A, Jaimes A. Socialnetwork analysis and mining for business applications.ACM Trans. Intell. Syst. Technol 2011, 2 Article22:37.

11. Mislove AE. Online social networks: measurement,analysis, and applications to distributed informationsystems, Ph.D. thesis, Rice University, Houston, TX,2009.

12. Rupnik, J. Finding community structure in socialnetwork analysis—overview. In: IS 2006: Proceedingsof the International Multiconference on InformationSociety, Ljubljana, Slovenija, 2006.

13. Wasserman S, Faust K. Social Network Analysis:Methods and Applications. Cambridge UniversityPress, Cambridge; 1994.

14. Barabasi AL, Albert R. Emergence of scaling in randomnetworks. Science 1999, 286:509–512.

15. Albert R, Jeong H, Barabasi AL. The diameter ofthe WWW. Nature 1999, 401:130–131. doi: 10.1038/43601.

16. Barabasi AL, Bonabeau E. Scale-free networks. Sci Am2003, 288:50–59.

17. Newman MEJ, Park J. Why social networks aredifferent from other types of networks. Phys Rev E2003, 68:036122.

18. Milgram S. The small world problem. Psychol Today1967, 2:60.

19. Granovetter MS. The strength of weak ties. Am JSociol 1973, 78:1360–1380.

20. Burt RS. Structural Holes: The Social Structure ofCompetition. Cambridge, MA: Harvard UniversityPress; 1992.

21. Burt RS. Structural holes and good ideas. Am J Sociol2004, 110:349–399.

22. Burt RS. The contingent value of social capital. AdmSci Q 1997, 42:339–365.

23. Kumpula, J. Community structures in complexnetworks: detection and modeling, Ph.D. thesis,Helsinki University of Technology, Finland, 2008.

24. Wu L, Waber B, Aral S, Brynjolfsson E, PentlandA. Mining face-to-face interaction networks usingsociometric badges: predicting productivity in an ITconfiguration task. In: Proceedings of the InternationalConference on Information Systems, Paris, France,2008.

25. Eagle N, Pentland A, Lazer D. Inferring social networkstructure using mobile phone data. Proc Natl Acad Sci2009, 106:15274–15278.

26. Mislove AE, Marcon M, Gummadi KP, Druschel P,Bhattacharjee B. Measurement and analysis of onlinesocial networks. In: Proceedings of the 7th ACMSIGCOMM Conference on Internet Measurement(San Diego, California, USA, October 24–26, 2007).IMC ’07. New York: ACM; 2007, 29–42.

27. Lahiri M, Berger-Wolf TY. Mining periodic behaviorin dynamic social networks. In: ICDM ’08:Proceedings of the International Conference on DataMining, IEEE, IEEE Computer Society, Los Alamitos,CA, USA. Pisa; 2008, 373–382. 10.1109/ICDM.2008.104.

28. Bhattacharyya P, Garg A, Wu SF. Social net-work model based on keyword categorization. In:ASONAM ’09: Proceedings of International Confer-ence on Advances in Social Network Analysis and Min-ing, 20–22 July 2009. IEEE Computer Society, Athens,Greece, 2009, 170–175; 10.1109/ASONAM.2009.46.

29. Falkowski T, Barth A, Spiliopoulou M Studyingcommunity dynamics with an incremental graphmining algorithm. In: AMCIS’08: Proceedings of the14th Americas Conference on Information Systems29, Toronto, Canada, 2008.

30. Chun H, Kwak H, Eom Y, Ahn Y, Moon S, JeongH. Comparison of online social relations in volume vsinteraction a case study of cyworld. In: Proceedings

436 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 30: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

of the 8th ACM SIGCOMM Conference on InternetMeasurement (Vouliagmeni, Greece, October 20–22,2008). IMC ’08. New York: ACM; 2008, 57–70.

31. Tantipathananandh C., Berger-Wolf T. Constant-factor approximation algorithms for identifyingdynamic communities. In: KDD ’09: Proceedings ofthe 15th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining. New York:ACM; 2009, 827–836.

32. Jamali M, Abolhassani H. Different aspects ofsocial network analysis. In: Proceedings of the2006 IEEE/WIC/ACM International Conferenceon Web Intelligence (December 18–22, 2006).Web Intelligence. Washington, DC: IEEE ComputerSociety; 2006, 66–72.

33. Ghoniem M, Fekete JD, Castagliola P. On thereadability of graphs using node-link and matrix-basedrepresentations a controlled experiment and statisticalanalysis. Inf Vis 2005, 4:114–135.

34. Tang L, Liu H, Yu PS, Han J, Faloutsos C.Understanding group structures and properties insocial media. In: Yu PS et al, eds. Link Mining:Models, Algorithms, and Applications Part 2. Springer,New York; 2010, 163–185. doi: 10.1007/978-1-4419-6515-8_6.

35. Chen J, Zaïane OR, Goebel R. Detecting communitiesin social networks using max-min modularity. In:SDM 2009: Proceedings of the SIAM Data MiningConference. SIAM, PA, 2009, 978–989.

36. Jalan S, Bandyopadhyay JN. Random matrix analysisof network Laplacians. Phys A: Stat Mech Appl 2008,387:667–674. doi: 10.1016/j.physa.2007.09.026.

37. Ghoniem M, Jussien N, Fekete JD. VISEXP: visualizingconstraint solver dynamics using explanations.In: FLAIRS’04: Seventeenth International FloridaArtificial Intelligence Research Society Conference,Miami, FL, 2004.

38. Henry N, Fekete JD. MatLink: enhanced matrixvisualization for analyzing social networks. In:Baranauskas C, Palanque P, Abascal J, Barbosa SD,eds. Proceedings of the 11th IFIP TC 13 InternationalConference on Human-Computer Interaction -Volume Part II (Rio de Janeiro, Brazil, September10–14, 2007). Lecture Notes in Computer Science.Berlin/Heidelberg: Springer-Verlag; 2007, 288–302.

39. Horrocks I. Ontologies and the semantic web.Commun. ACM 2008, 51:58–67.

40. Stumme G, Hotho A, Berendt B. Semantic webmining—state of the art and future directions. J WebSemantics 2006, 4:124–143.

41. Gruber TR. A translation approach to portable ontol-ogy specifications. Knowl Acquis 1993, 5:199–220.

42. Gruber TR. Collective knowledge systems: where thesocial web meets the semantic web. Web Semant 2008,6:4–13.

43. Downes S. Semantic networks and social networks.Learn Org 2005, 12:411–417.

44. Breslin J, Decker S. The future of social networks onthe internet: the need for semantics. IEEE InternetComput 2007, 11:86–90.

45. Ereteo G, Limpens F, Gandon F, Corby O, BuffaM, Leitzelman M, Sander P. Semantic social networkanalysis: a concrete case. In: Daniel BK, eds Handbookof Research on Methods and Techniques for StudyingVirtual Communities: Paradigms and Phenomena.Hershey, PA: IGI Global; 2011, 122–156.

46. Breslin J, Harth A, Bojars U, Decker S Towardssemantically interlinked online communities. In:ESWC ’05 Proceedings of the 2nd European SemanticWeb Conference. Springer-Verlag: Berlin Heidelberg;2005, 71–83.

47. Mika P. Social networks and the semantic web: thenext challenge. IEEE Intell Syst 2005, 20:80–93.

48. Mika P. Ontologies are us: a unified model of socialnetworks and semantics. Web Semant: Sci Serv AgentsWorld Wide Web 2006, 5:5–15.

49. Wu X, Zhang L, Yu Y. Exploring social annotationsfor the semantic web. In: Proceedings of the 15thInternational Conference on World Wide Web (WWW’06). New York: ACM; 2006, 417–426.

50. Smyth P. Statistical modeling of graph and networkdata. In: Proceedings of IJCAI Workshop on LearningStatistical Models from Relational Data, Acapulco,Mexico, 2003.

51. Wong LH, Pattison P, Robins G. A spatial model forsocial networks. Physica A 2006, 360:99.

52. Snijders TAB, Van de Bunt GG, Steglich CEG.Introduction to stochastic actor-based models fornetwork dynamics. Soc Netw 2010, 32:44–60. doi:10.1016/j.socnet.2009.02.004.

53. Spiliopoulou M. Evolution in social networks: asurvey. In: Aggarwal CC, eds Social Network DataAnalytics. Springer, New York, USA; 2011, 149–175.

54. Asur S, Parthasarathy S, Ucar D. An event-based framework for characterizing the evolutionarybehavior of interaction graphs. ACM Trans KnowlDiscov Data 2009, 3:1–36.

55. Hanneman R, Riddle C. 2005. Introduction to SocialNetwork Methods (online textbook). Riverside, CA:University of California. Available at: http://faculty.ucr.edu/hanneman/. (Accessed December 20, 2010).

56. Anderson C, Wasserman S, Crouch B. A p* primer:logit models for social networks. Social Netw 1999,21:37–66. doi: 10.1016/S0378-8733(98)00012-4.

57. Wasserman S, Pattison P. Logit models and logisticregression for social networks: I. An introductionto Markov graphs and p. Psychometrika 1996,61:401–425.

58. Holland PW, Leinhardt S. An exponential family ofprobability distributions for directed graphs (withdiscussion). J Am Stat Assoc 1981, 76:33–65.

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 437

Page 31: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

59. Van-Duijn MAJ, Snijders TAB, Zijlstra BJH. p2: arandom effects model with covariates for directedgraphs. Stat Neerl 2004, 58:234–254.

60. Handcock MS. Statistical models for social networks:inference and degeneracy. In: Breiger R, CarleyK, Pattison PE, eds. Dynamic Social NetworkModeling and Analysis: Workshop Summary andPapers. Washington, DC: The National AcademiesPress (National Research Council of the NationalAcademies); 2002, 229–240.

61. Frank O, Strauss D. Markov graphs. J Am Stat Assoc1986, 81:832–842.

62. Toivonen R, Kovanen L, Kivela M, Onnela JP,Saramaki J, Kaski K. A comparative study of socialnetwork models: network evolution models and nodalattribute models. Soc Netw 2009, 31:240–254.

63. Wasserman SS, Robins GL. An introduction to randomgraphs, dependence graphs, and p*. In: CarringtonPJ, Scott J, Wasserman S, eds. Models and Methodsin Social Network Analysis. New York: CambridgeUniversity Press; 2005, 148–161.

64. Hanneke S, Xing EP. Discrete temporal models ofsocial networks. In: Airoldi E, Blei DM, FienbergSE, Goldenberg A, Xing EP, eds. Proceedings ofthe 2006 Conference on Statistical Network Analysis(Pittsburgh, PA, USA). Lecture Notes in ComputerScience. Berlin/Heidelberg: Springer-Verlag; 2007,115–125.

65. Handcock M. 2003. Assessing degeneracy in statisticalmodels of social networks. Technical Report 39,University of Washington.

66. Snijders T, Pattison P, Robins G, Handcock M. Newspecifications for exponential random graph models.Sociol Methodol 2006, 36:99–153.

67. Robins G, Snijders T, Wang P, Handcock M, PattisonP. Recent developments in exponential random graph(p*) models for social networks. Social Netw 2007,29:192–215.

68. Goldenberg A, Zheng AX, Fienberg SE, Airoldi EM.A survey of statistical network models. Found TrendsMach Learn 2010, 2:129–233.

69. Watts DJ, Strogatz S. Collective dynamics of ‘small-world’ networks. Nature 1998, 393:440–442.

70. Newman MEJ, Watts DJ. Renormalization groupanalysis of the small-world network model. Phys LettA 1999, 263:341–346.

71. Kleinberg JM. Small-world phenomena and thedynamics of information. In: Dietterich TG, Becker S,Ghahramani Z, eds Advances in Neural InformationProcessing Systems (NIPS) 14. Cambridge, MA: MITPress; 2001, 431–438.

72. Hellmann T, Staudigl M, Rogers BW. Evolution ofsocial networks. In: Working Paper No. 470, Instituteof Mathematical Economics, Bielefeld University,Germany, 2012.

73. Narayanam R, Narahari Y. Topologies of strategicallyformed social networks based on a generic valuefunction—allocation rule model. Soc Netw 2011,33:56–69.

74. Gintis H. Game Theory Evolving: A Problem-Centered Introduction to Modeling Strategic Behavior.Princeton University Press, Princeton, New Jersey,USA; 2000.

75. Yadati N, Narayanam R. Game theoretic models forsocial network analysis. In: Proceedings of the 20thInternational Conference Companion on World WideWeb (WWW ’11). New York: ACM; 2011, 291–292.

76. Aumann R, Myerson R. Endogenous formation oflinks between players and coalitions: an applicationof the Shapley value. In: Roth A, ed. The ShapleyValue. Cambridge University Press, Cambridge; 1988,175–191.

77. Myerson R. Game Theory: Analysis of Conflict.Cambridge, MA: Harvard University Press; 1991.

78. Jackson MO. A survey of network formation models:stability and efficiency. In: Demange G, Wooders M,eds. Group Formation in Economics: Networks, Clubsand Coalitions. Cambridge University Press; 2005,11–57.

79. Jackson MO. Social and Economic Networks.Princeton University Press, Princeton, New Jersey,USA; 2010.

80. Goyal S. Connections: An Introduction to theEconomics of Networks. Princeton University Press,Princeton, New Jersey, USA; 2012.

81. Vallam RD, Subramanian CA, Narayanam R,Narahari Y, Narasimha S. 2011. Topologies andprice of stability of complex strategic networks withlocalized payoffs: analytical and simulation studies.Preprint: arXiv:1201.0067.

82. Alexander JM. Evolutionary Game Theory. StanfordEncyclopedia of Philosophy. Stanford, CA: Meta-physics Research Lab, CSLI, Stanford University;2009.

83. Skyrms B, Pemantle R. A dynamic model of socialnetwork formation. Proc Natl Acad Sci 2000, 97:9340–9346.

84. Bala V, Goyal S. A noncooperative model of networkformation. Econometrica 2000, 68:1181–1229.

85. Gross T, Blasius B. Adaptive coevolutionary networks:a review. J R Soc Interface 2008, 5:259–271.

86. Christensen K, Donangelo R, Koiller B, Sneppen K.Evolution of random networks. Phys Rev Lett 1998,81:2380–2383.

87. Gross T, D’Lima CJD, Blasius B. Epidemic dynamicson an adaptive network. Phys Rev Lett 2006,96:208701.

88. Tero A, Takagi S, Saigusa T, Ito K, Bebber DP, FrickerMD, Yumiki K, Kobayashi R, Nakagaki T. Rulesfor biologically inspired adaptive network design. SciSignal 2010, 327:439.

438 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 32: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

89. Snijders TAB. Statistical models for social networks.Annu Rev Sociol 2011, 37:129–151.

90. Hasan MA, Chaoji V, Salem S, Zaki M. Linkprediction using supervised learning. In: SDM’06:Proceedings of the Workshop on Link Analysis,Counter-terrorism and Security (at SIAM Data MiningConference), Bethesda, MD, 2006.

91. Wang D, Pedreschi D, Song C, Giannotti F,Barabasi AL. Human mobility, social ties, and linkprediction. In: Proceedings of the 17th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining (KDD ’11). New York: ACM; 2011,1100–1108.

92. Liben-Nowell D, Kleinberg J. The link-predictionproblem for social networks. J Am Soc Inf Sci Technol2007, 58:1019–1031.

93. O’Madadhain J, Hutchins J, Smyth P. Prediction andranking algorithms for event-based network data.SIGKDD Explor Newsl 2005, 7:23–30.

94. Taskar B, Wong MF, Abbeel P, Koller D. Linkprediction in relational data. In: Thrun S, Saul L,Scholkopf B, eds. NIPS ’03 Advances in NeuralInformation Processing Systems 16. Cambridge, MA:MIT Press; 2003.

95. Kashima H, Abe N. A parameterized probabilisticmodel of network evolution for supervised linkprediction. In: Proceedings of the Sixth InternationalConference on Data Mining (December 18–22, 2006).ICDM. Washington, DC: IEEE Computer Society;2006, 340–349.

96. Hasan MA, Zaki MJ. A survey of link prediction insocial networks. In: Aggarwal CC, eds Social NetworkData Analytics. Springer, New York, USA; 2011,243–275.

97. Murata T, Moriyasu S. Link prediction based onstructural properties of online social networks. NewGeneration Comput 2008, 26:245–257.

98. Lu L, Zhou T. Link prediction in weighted networks:the role of weak ties. Europhys Lett 2010, 89:18001.

99. Liu W, Lu L. Link prediction based on local randomwalk. Europhys Lett 2010, 89:58007.

100. Zheleva E, Getoor L, Golbeck J, Kuter U. Usingfriendship ties and family circles for link prediction.In: Advances in Social Network Mining and Analysis.Lecture Notes in Computer Science, vol 5498.2010, 97–113. Springer; Berlin Heidelberg. doi:10.1007/978-3-642-14929-0_6.

101. Schifanella R, Barrat A, Cattuto C, Markines B,Menczer F. Folks in folksonomies: social linkprediction from shared metadata. In: Proceedingsof the 3rd ACM International Conference on WebSearch and Data Mining. New York: ACM; 2010,271–280.

102. Leroy V, Cambazoglu BB, Bonchi F. Cold start linkprediction. In: Proceedings of the 16th ACM SIGKDDInternational Conference on Knowledge Discovery

and Data Mining (Washington, DC, USA, July 25–28,2010). KDD ’10. New York: ACM; 2010, 393–402.

103. Bhat SY, Abulaish M. OCTracker: a density-basedframework for tracking the evolution of overlappingcommunities in OSNs. In: Proceedings of IEEE/ACMInternational Conference on Advances in SocialNetworks Analysis and Mining (ASONAM’2012).IEEE, IEEE Computer Society, Los Alamitos, CA,USA; 2012, 501–505.

104. Girvan M, Newman MEJ. Community structure insocial and biological networks. Proc Natl Acad Sci2002, 99:7821–7826.

105. Arenas A, Danon L, Díaz-Guilera A, Gleiser PM,Guimera R. Community analysis in social networks.Eur Phys J B 2004, 38:373.

106. Flake G, Lawrence S, Giles C, Coetzee F. Self-organization and identification of web communities.IEEE Comput Soc 2002, 35:66.

107. Krause AE, Frank KA, Mason DM, Ulanowicz RE,Taylor WW. Compartments revealed in food-webstructure. Nature 2003, 426:282.

108. Fortunato S, Castellano C. 2007. Communitystructure in graphs. arXiv:0712.2716 [physics.soc-ph].

109. Lancichinetti A, Fortunato S, Kertesz J. Detecting theoverlapping and hierarchical community structure incomplex networks. New J Phys 2009, 11:033015.

110. Palla G, Derenyi I, Farkas I, Vicsek T. Uncoveringthe overlapping community structure of complexnetworks in nature and society. Nature 2005,435:814–818.

111. Newman MEJ, Girvan M. Finding and evaluatingcommunity structure in networks. Phys Rev E 2004,69:026113.

112. Clauset A, Moore C, Newman MEJ. Hierarchicalstructure and the prediction of missing links innetworks. Nature 2008, 453:98–101.

113. Chen J. Community mining—discovering communitiesin social networks, Ph.D. thesis, University of Alberta,Edmonton, Alberta, 2010.

114. Hastie T, Tibshirani T, Friedman J. The elementsof statistical learning: data mining, inference, andprediction. Math Intell 2001, 27:83–85. doi: 10.1007/BF02985802.

115. Clauset A, Newman MEJ, Moore C. Findingcommunity structure in very large networks. PhysRev E 2004, 70:066111.

116. Nowicki K, Snijders TAB. Estimation and predictionfor stochastic block structures. J Am Stat Assoc 2001,96:1077–1087.

117. Lorrain F, White HC. The structural equivalence ofindividuals in social networks. J Math Sociol 1971,1:49–80.

118. Everett MG, Borgatti SP. Regular equivalence: generaltheory. J Math Sociol 1994, 19:29–52.

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 439

Page 33: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

119. Wolfe AP, Jensen D. Playing multiple roles: discoveringoverlapping roles in social networks. In: ICML’04:Workshop on Statistical Relational Learning and ItsConnections to Other Fields, Banff, Alberta, Canada,2004.

120. Wang X, Mohanty N, McCallum A. Group and topicdiscovery from relations and text. In: Proceedings ofthe 3rd International Workshop on Link Discovery(Chicago, Illinois, August 21–25, 2005). LinkKDD’05. New York: ACM; 2005, 28–35.

121. Wang X, Mohanty N, McCallum A.. Group andtopic discovery from relations and their attributes.In: NIPS’06: Proceedings of 12th Annual Conferenceon Neural Information Processing Systems. Whistler,BC: MIT Press, Cambridge, MA; 2006, 1449–1456.

122. Agarwal G, Kempe D. Modularity maximizing net-work communities using mathematical programming.Eur Phys J B 2008, 66:409–418.

123. Hastings MB. Community detection as an inferenceproblem. Phys Rev E 2006, 74:035102.

124. Handcock MS, Rafter AE, Tantrum JM. Model-basedclustering for social networks. J R Stat Soc A 2007,170:301–354.

125. Ester M, Kriegel H, Jorg S, Xu X. A density-basedalgorithm for discovering clusters in large spatialdatabases with noise. In: Proceedings of theInternational Conference on Knowledge Discoveryfrom Data. AAAI Press, Palo Alto, California; 1996,226–231.

126. Xu X, Yuruk N, Feng Z, Schweiger TAJ. 2007. SCAN:a structural clustering algorithm for networks. In:Proceedings of the 13th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining(KDD ’07). ACM, New York, USA; 2007, 824–833.

127. Falkowski T, Barth A, Spiliopoulou M. DENGRAPH:a density-based community detection algorithm. In:Proceedings of the IEEE/WIC/ACM InternationalConference on Web Intelligence. Washington, DC:IEEE Computer Society; 2007, 112–115.

128. Sun H, Huang J, Han J, Deng H, Zhao P, FengB. gskeletonclu: density-based network clustering viastructure-connected tree division or agglomeration.In: Proceedings of the 2010 IEEE InternationalConference on Data Mining, ser. ICDM ’10.Washington, DC: IEEE Computer Society; 2010,481–490.

129. Huang J, Sun H, Han J, Deng H, Sun Y, Liu Y.Shrink: astructural clustering algorithm for detectinghierarchical communities in networks. In: Proceedingsof the 19th ACM International Conference onInformation and Knowledge Management, ser. CIKM’10. New York: ACM; 2010, 219–228.

130. Wei YC, Cheng CK. Ratio cut partitioning forhierarchical designs. IEEE Trans. on Computer-AidedDesign of Integrated Circuits and Systems 1991,10:911–921.

131. Shi J, Malik J. Normalized cuts and imagesegmentation. IEEE Trans Pattern Anal Mach Intell2000, 22:888–905.

132. Newman MEJ. Fast algorithm for detecting com-munity structure in networks. Phys Rev E 2004,69:066133.

133. Newman MEJ. Finding community structure innetworks using the eigenvectors of matrices. Phys RevE 2006, 74:036104.

134. Newman MEJ. Modularity and community struc-ture in networks. Proc Natl Acad Sci 2006,103:8577–8582.

135. Newman MEJ, Girvan M. Mixing patterns andcommunity structure in networks. In: Pastor-SatorrasR, Rubi J, Diaz-Guilera A, eds. Statistical Mechanicsof Complex Networks. Lecture Notes in Physics,vol 625. Berlin: Springer-Verlag; 2003, 66–87. doi:10.1007/978-3-540-44943-0_5.

136. Brandes U, Delling D, Gaertler M, Gorke R, HoeferM, Nikoloski Z, Wagner D. On modularity: NP-completeness and beyond. Technical Report 2006-19, ITI Wagner, Faculty of Informatics, UniversitatKarlsruhe, TH, 2006.

137. Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E.Fast unfolding of communities in large networks. JStat Mech 2008, 10:P10008.

138. Kirkpatrick S, Gelatt CD, Vecchi MP. Optimizing bysimulated annealing. Science 1983, 220:671–680.

139. GuimerA R, Amaral LAN. Functional cartographyof complex metabolic networks. Nature 2005,433:895–900.

140. Boettcher S, Percus A. Optimization with extremaldynamics. Phys Rev Lett 2001, 86:5211–5214.

141. Duch J, Arenas A. Community detection in complexnetworks using extremal optimization. Phys Rev E2005, 72:027104.

142. Danon L, Duch J, Díaz-Guilera A, Arenas A.Comparing community structure identification. J StatMech 2005, 9:P09008.

143. Clauset A. Finding local community structure innetworks. Phys Rev E 2005, 72:026132.

144. Radicchi F, Castellano C, Cecconi F, Loreto V, ParisiD. Defining and identifying communities in networks.Proc Natl Acad Sci 2004, 101:2658.

145. Fortunato S, Barthelemy M. Resolution limit incommunity detection. Proc Natl Acad Sci U S A 2007,104:36–41.

146. Ruan J, Zhang W. Identifying network communitieswith a high resolution. Phys Rev E 2008, 77:016104.

147. Scripps J, Tan P, Esfahanian A. Exploration oflink structure and community-based node roles innetwork analysis. In: Proceedings of the 2007 SeventhIEEE International Conference on Data Mining(October 28–31, 2007). ICDM. Washington, DC:IEEE Computer Society; 2007, 649–654.

440 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 34: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

148. Shalizi CR, Camperi MF, Klinkner KL. Discoveringfunctional communities in dynamical networks. In:Airoldi E, Blei DM, Fienberg SE, Goldenberg A,Xing EP, eds. Proceedings of the 2006 Conferenceon Statistical Network Analysis (Pittsburgh, PA, USA).Lecture Notes in Computer Science. Berlin/Heidelberg:Springer-Verlag; 2007, 140–157.

149. Newman MEJ. The physics of networks. Phys Today2008, 61:33–38.

150. Traud AL, Kelsic ED, Mucha PJ, Porter MA.2008. Community structure in online collegiate socialnetworks. arXiv:0809.0960.

151. Zhang S, Wang R, Zhang X. Identificationof overlapping community structure in complexnetworks using fuzzy c-means clustering. Physica A2007, 374:483–490.

152. Zachary WW. An information flow model for conflictand fission in small groups. J Anthropol Res 1977,33:452–473.

153. Wei F, Wang C, Ma L, Zhou A. Detectingoverlapping community structures in networks withglobal partition and local expansion. In: Zhang Y,Xu G, Yu G, Bertino E, eds. Proceedings of the 10thAsia-Pacific Web Conference on Progress in WWWResearch and Development (Shenyang, China, April26–28, 2008). Lecture Notes in Computer Science.Berlin/Heidelberg: Springer-Verlag; 2008, 43–55.

154. Baumes J, Goldberg MK, Krishnamoorthy MS,Magdon-Ismail M, Preston N. Finding communitiesby clustering a graph into overlapping subgraphs.In: Guimaraes N, Isaias PT, eds. AC’05: Proceedingsof the IADIS International Conference on AppliedComputing. Algarve: IADIS Press; 2005, 97–104.

155. Derenyi I, Palla G, Vicsek T. Clique percolation inrandom networks. Phys Rev Lett 2005, 94:160202.

156. Kumpula JM, Kivela M, Kaski K, Saramaki J.Sequential algorithm for fast clique percolation. PhysRev E 2008, 78:026109.

157. McDaid A, Hurley N. Detecting highly overlappingcommunities with model-based overlapping seedexpansion. In: Proceedings of the 2010 InternationalConference on Advances in Social Networks Analysisand Mining, ser. ASONAM ’10. Washington, DC:IEEE Computer Society; 2010, 112–119.

158. Latouche P, Birmele E, Ambroise C. 2009. Over-lapping stochastic block models. Technical ReportarXiv:0910.2098. Available at: http://arxiv.org/abs/0910.2098. (Accessed May 23, 2012).

159. Fortunato S. Community detection in graphs. PhysRep 2010, 486:75–174. doi: 10.1016/j.physrep.2009.11.002.

160. Lusseau D, Schneider K, Boisseau OJ, Haase P, SlootenE, Dawson SM. The bottlenose dolphin communityof doubtful sound features a large proportion oflong-lasting associations. Behav Ecol Sociobiol 2003,54:396–405.

161. Simon HA. The architecture of complexity. Proc AmPhilos Soc 1962, 106:467.

162. Reichardt J, Bornholdt S. Statistical mechanics ofcommunity detection. Phys Rev E 2006, 74:016110.

163. Kumar P, Wang L, Chauhan J, Zhang K. Discovery andvisualization of hierarchical overlapping communitiesfrom bibliography information. In: Proceedings ofthe 2009 Eighth IEEE International Conference onDependable, Autonomic and Secure Computing, ser.DASC ’09. Washington, DC: IEEE Computer Society;2009, 664–669.

164. Rosvall M, Bergstrom CT. Maps of random walks oncomplex networks reveal community structure. ProcNatl Acad Sci U S A 2008, 105:1118–1123.

165. Rosvall M, Bergstrom CT. Multilevel compressionof random walks on networks reveals hierarchicalorganization in large integrated systems. PLOS One2011, 6:e18209.

166. Lancichinetti A, Radicchi F, Ramasco JJ, FortunatoS. Finding statistically significant communities innetworks. PLoS One 2011, 6:5.

167. Pons P. 2006. Post-processing hierarchical communitystructures: quality improvements and multi-scale view.eprint, arXiv:cs/0608050v1.

168. Arenas A, Fernandez A, Gomez S. Analysis ofthe structure of complex networks at differentresolution levels. New J Phys 2008, 10:053039. doi:10.1088/1367-2630/10/5/053039.

169. Shen H, Cheng X, Cai K, Hu MB. Detect overlappingand hierarchical community structure in networks.Physica A 2009, 388:1706–1712.

170. Ahn YY, Bagrow JP, Lehmann S. Link communitiesreveal multi-scale complexity in networks. Nature2010, 466:761–764. doi: 10.1038/nature09182.

171. Kumar R, Novak J, Tomkins A. Structure andevolution of online social networks. In: Proceedings ofthe 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Philadelphia,PA, USA, August 20–23, 2006). KDD ’06. New York:ACM; 2006, 611–617.

172. Leskovec J, Backstrom L, Kumar R, TomkinsA. Microscopic evolution of social networks. In:Proceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining(Las Vegas, Nevada, USA, August 24–27, 2008). KDD’08. New York: ACM; 2008, 462–470.

173. Backstrom L, Huttenlocher D, Kleinberg J, LanX. Group formation in large social networks:membership, growth, and evolution. In: Proceedings ofthe 12th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Philadelphia,PA, USA, August 20–23, 2006). KDD ’06. New York:ACM; 2006, 44–54.

174. Kairam SR, Wang DJ. Leskovec J. The life and death ofonline groups: predicting group growth and longevity.In: Proceedings of the Fifth ACM International

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 441

Page 35: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

Conference on Web Search and Data Mining (WSDM’12). New York: ACM; 2012, 673–682.

175. Tantipathananandh C, Berger-Wolf T, Kempe D. Aframework for community identification in dynamicsocial networks. In: Proceedings of the 13th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining (San Jose, California,USA, August 12–15, 2007). KDD ’07. New York:ACM; 2007, 717–726.

176. Greene D, Doyle D, Cunningham P. Trackingthe evolution of communities in dynamic socialnetworks. In: Proceedings of ASONAM ’10: The2010 International Conference on Advances in SocialNetworks Analysis and Mining., Washington, DC:IEEE Computer Society; 2010, 176–183.

177. Kim M.-S, Han J. Chronicle: a two-stage density-based clustering algorithm for dynamic networks.In: Proceedings of the 12th International Conferenceon Discovery Science, ser. DS ’09. Berlin/Heidelberg:Springer-Verlag; 2009, 152–167.

178. Wang Y, Wu B, Du N. 2008. Community evolutionof social network: feature, algorithm and model.arXiv:0804.4356.

179. Lin Y, Sundaram H, Chi Y, Tatemura J, TsengBL. Blog community discovery and evolution basedon mutual awareness expansion. In: Proceedingsof the IEEE/WIC/ACM International Conferenceon Web Intelligence (November 02–05, 2007).Web Intelligence. Washington, DC: IEEE ComputerSociety; 2007, 48-56.

180. Palla G, Barabasi AL, Vicsek T. Quantifying socialgroup evolution. Nature 2007, 446:664–667. doi:10.1038/nature05670.

181. Spiliopoulou M, Ntoutsi I, Theodoridis Y, SchultR. MONIC: modeling and monitoring clustertransitions. In: Proceedings of the 12th ACM SIGKDDInternational Conference on Knowledge Discoveryand Data Mining (Philadelphia, PA, USA, August20–23, 2006). KDD ’06. New York: ACM; 2006,706–711.

182. Lin Y, Chi Y, Zhu S, Sundaram H, Tseng BL.Analyzing communities and their evolutions indynamic social networks. ACM Trans Knowl DiscovData 2009, 3:1–31.

183. Bailey NTJ. The Mathematical Theory of InfectiousDiseases and Its Application. 2nd ed. London: Griffin;1975.

184. Newman MEJ, Forrest S, Balthrop J. Email networksand the spread of computer viruses. Phys Rev E 2002,66:035101–035104.

185. Mahajan V, Muller E, Bass FM. New product diffusionmodels in marketing: a review and directions forresearch. J Market 1990, 54:1–26.

186. Bass FM. A new product growth for model consumerdurables. Manage Sci 1969, 15:215–227.

187. Ryan B, Gross NC. The diffusion of hybrid seed cornin two Iowa communities. Rural Sociol 1943, 8:15–24.

188. Rogers EM. Diffusion of Innovations. 4th ed. NewYork: Simon & Schuster; 1995.

189. Morris S. Contagion. Rev Econ Stud 2000, 67:57–78.

190. Centola D, Macy M. Complex contagions andthe weakness of long ties1. Am J Sociol 2007,113:702–734.

191. Katona Z, Zubcsek P, Sarvary M. Network effects andpersonal influences: the diffusion of an online socialnetwork. J Market Res 2011, 48:425–443.

192. Bhat SY, Abulaish M. Overlapping social networkcommunities and viral marketing. In: InternationalSymposium on Computational and Business Intel-ligence, New Delhi, India, IEEE Computer Society;2013, 242–246.

193. Bakshy E, Rosenn I, Marlow C, Adamic L. The role ofsocial networks in information diffusion. In: WWW’12Proceedings of the International conference on WWW,Lyon, France, 2012.

194. Lerman K, Ghosh R, Surachawala T. 2012. Socialcontagion: an empirical study of information spread onDigg and Twitter follower graphs. arXiv:1202.3162[cs.SI].

195. Iribarren JL, Moro E. Impact of human activitypatterns on the dynamics of information diffusion.Phys Rev Lett 2009, 103:038702.

196. Iribarren JL, Moro E. 2008. Information dif-fusion epidemics in social networks. Preprint:arXiv:0706.0641v1 [physics.soc-ph].

197. Aral S, Brynjolfsson E, Alstyne MWV. Productivityeffects of information diffusion in networks. In:Proceedings of the 28th Annual InternationalConference on Information Systems, Montreal,Canada, 2007.

198. Sadilek A, Kautz H, Silenzio V. Modeling spreadof disease from social interactions. In: Proceedingsof Sixth AAAI International Conference on Weblogsand Social Media (ICWSM), Trinity College Dublin,Ireland 2012.

199. Corley CD, Mikler AR, Singh KP, Cook DJ.Monitoring influenza trends through mining socialmedia. In: Proceedings of the International Conferenceon Bioinformatics and Computational Biology(BIOCOMP09). Las Vegas, NV; 2009, 340–346.

200. Granovetter M. Threshold models of collectivebehavior. Am J Sociol 1987, 83:1420–1443.

201. Goldenberg J, Libai B, Muller E. Using complexsystems analysis to advance marketing theorydevelopment: modeling heterogeneity effects on newproduct growth through stochastic cellular automata.Acad Market Sci Rev 2001, 1:9.

202. Goldenberg J, Libai B, Muller E. Talk of the network:a complex systems look at the underlying process ofword-of-mouth. Market Lett 2001, 3:211–223.

442 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013

Page 36: Analysis and mining of online social networks: emerging trends and challenges

WIREs Data Mining and Knowledge Discovery Analysis and mining of online social networks

203. Kempe D, Kleinberg J, Tardos E. Maximizing thespread of influence through a social network. In:Proceedings of the Ninth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining(Washington, D.C., August 24–27, 2003). KDD ’03.New York: ACM; 2003, 137–146.

204. Jackson MO, Yariv L. Diffusion on social networks.Econ Publ 2005, 16:3–16.

205. Barabasi AL. The origin of bursts and heavy tails inhuman dynamics. Nature 2005, 435:207.

206. Aiello W, Chung F, Lu L. A random graph model forpower law graphs. Exp Math 2000, 10:53–66.

207. Gruhl D, Guha R, Liben-Nowell D, TomkinsA. Information diffusion through Blogspace. In:Proceedings of the 13th International Conference onWorld Wide Web. New York: ACM Press; 2004,491–501.

208. Garg R, Telang R, Smith M. Peer influence andinformation diffusion in online networks: an empiricalanalysis. In: CIST’09: Conference on InformationSystems and Technology (October 10–11), San Diego,CA, 2009.

209. Kwon Y, Kim S, Park S. An analysis of informationdiffusion in the blog world. In: Proceedings of the 1stACM International Workshop on Complex NetworksMeet information & Knowledge Management (HongKong, China, November 02–06, 2009). CNIKM ’09.New York: ACM; 2009, 27–30.

210. Brown J, Reinegen P. Social ties and word-of-mouthreferral behavior. J Consum Res 1987, 1:350–362.

211. Kwon YS, Kim SW, Park S, Lim SH, Lee JB. Theinformation diffusion model in the blog world. In:Proceedings of the 3rd Workshop on Social NetworkMining and Analysis (Paris, France, June 28, 2009).SNA-KDD ’09. New York: ACM; 2009, 1–9.

212. Zhao J, Wu J, Xu K. Weak ties: a subtlerole in the information diffusion of online socialnetworks. Phys Rev E 2010, 82:016105. doi: 10.1103/PhysRevE.82.016105.

213. Kossinets G, Kleinberg J, Watts D. The structureof information pathways in a social communicationnetwork. In: KDD ’08: Proceedings of the 14th ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining. New York: ACM; 2008,435–443.

214. Stewart A, Chen L, Paiu R, Nejdl W. Discoveringinformation diffusion paths from blogosphere foronline advertising. In: ADKDD ’07: Proceedings ofthe 1st International Workshop on Data Mining andAudience Intelligence for Advertising. ACM, NewYork; 2007, 46–54.

215. Norris JR. Markov Chains. Cambridge: CambridgeUniversity Press; 1997.

216. Song XD, Chi Y, Hino K, Tseng BL. Informationflow modeling based on diffusion rate for predictionand ranking. In: Proceedings of the 16th International

Conference on World Wide Web. New York: ACM;2007, 191–200.

217. Cha M, Mislove A, Gummadi K. A measurement-driven analysis of information propagation in theFlickr social network. In: WWW’09: Proceedings ofthe Eighteenth International Conference on WorldWide Web. New York: ACM; 2009, 721–730.

218. Oh J, Susarla A, Tan Y. 2008. Examining the diffusionof user-generated content in online social networks.Social Science Research Network eLibrary.

219. Habiba H, Yu Y, Berger-Wolf TY, Saia J, Giles L,Smith M, Yen J, Zhang H. Finding spread blockers indynamic networks. In: Giles L et al., eds. Advances inSocial Network Mining and Analysis. Lecture Notesin Computer Science. vol 5498. Berlin/Heidelberg:Springer-Verlag; 2010, 55–76.

220. Yang J, Leskovec J. Modeling information diffusionin implicit networks. In: ICDM’10 Proceedings ofIEEE 10th International Conference on Data Mining.IEEE Computer Society, Los Alamitos, CA, USA 2010,599–608.

221. Wuyts S, Dekimpe MG, Gijsbrechts E, Pieters R.The Connected Customer: The Changing Natureof Consumer and Business Markets. New York:Routledge Academic; 2010.

222. Trusov M, Bucklin RE, Pauwels K. Effects of word-of-mouth versus traditional marketing: findings froman internet social networking site. J Mar Syst 2009,73:90–102.

223. Domingos P. Mining social networks for viralmarketing. IEEE Intell Syst 2005, 20:80–82.

224. Sun J, Tang J. A survey of models and algorithms forsocial influence analysis. Soc Netw Data Anal 2011,1:177–214.

225. Baruh L. Social media marketing: web X.0 ofopportunities. In: Dumova T, Fiordo R, eds.Handbook of Research on Social InteractionTechnologies and Collaboration Software: Conceptsand Trends. Idea Group, Inc. Hershey, New York;2009, 33–45.

226. Stonedahl F, Rand W, Wilensky U. Evolving viralmarketing strategies. In: Proceedings of the 12thAnnual Conference on Genetic and EvolutionaryComputation (Portland, Oregon, USA, July 07–11,2010). GECCO ’10. New York: ACM; 2010,1195–1202.

227. Hartline J, Mirrokni V, Sundararajan M. Optimalmarketing strategies over social networks. In:Proceedings of the 17th international Conference onWorld Wide Web (Beijing, China, April 21–25, 2008).WWW ’08. New York: ACM; 2008, 189–198.

228. Arthur D, Motwani R, Sharma A, Xu Y. Pricingstrategies for viral marketing on social networks. In:Leonardi S, ed. Proceedings of the 5th InternationalWorkshop on Internet and Network Economics(Rome, Italy, December 14–18, 2009). Lecture Notes

Volume 3, November/December 2013 © 2013 John Wiley & Sons, Ltd. 443

Page 37: Analysis and mining of online social networks: emerging trends and challenges

Advanced Review wires.wiley.com/widm

In Computer Science, vol 5929. Berlin/Heidelberg:Springer-Verlag; 2009, 101–112.

229. Kimura M, Saito K, Motoda H. Minimizing the spreadof contamination by blocking links in a network. In:Proceedings of the 23rd AAAI Conference on ArtificialIntelligence. AAAI Press, Palo Alto, California; 2008,1175–1180.

230. Kimura M, Saito K, Motoda H. Blocking links tominimize contamination spread in a social network.ACM Trans Knowl Discov Data 2009, 3:1–23.

231. Broadbent S, Hammersley J. Percolation processes:I. Crystals and mazes. Proc Camb Philos Soc 1957,53:629–641.

232. Stauffer D, Aharony A. Introduction to PercolationTheory. London: Taylor & Francis; 1991.

233. McCann U. 2009. Power to the people: socialmedia tracker wave 4. Available at: http://universalmccann.bitecp.com/wave4/Wave4.pdf. (Accessed Jan-uary 18,2011).

234. Nelsenwire. 2009. Global advertising: consumerstrust real friends and virtual strangers the most.Nielsen Global Online Consumer Survey. Available at:http://blog.nielsen.com/nielsenwire/consumer/global-advertising-consumers-trust-real-friends-and-virtual-strangers-the-most/. (Accessed June 30, 2010).

235. Richardson M, Domingos P. Mining knowledge-sharing sites for viral marketing. In: Proceedings of theEighth ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (Edmonton,Alberta, Canada, July 23–26, 2002). KDD ’02. NewYork: ACM; 2002; 61–70.

236. Lawrence R, Melville P, Perlich C, Sindhwani V,Meliksetian S, Hsueh PY, Liu Y. Social Media

Analytics—The Next Generation of Analytics-BasedMarketing Seeks Insights From Blogs OR/MS Today,February, 2010, 26–30. Marietta, GA: LionheartPublishing Inc.; 2010.

237. Agarwal N, Liu H, Tang L, Yu PS. Identifying theinfluential bloggers in a community. In: Proceedings ofthe International Conference on Web Search and WebData Mining (Palo Alto, California, USA, February11–12, 2008). WSDM ’08. New York: ACM; 2008,207–218.

238. Goyal A, Bonchi F, Lakshmanan LV. Discoveringleaders from community actions. In: Proceedings of the17th ACM Conference on Information and KnowledgeManagement (Napa Valley, California, USA, October26–30, 2008). CIKM ’08. New York: ACM; 2008,499–508.

239. Trusov M, Bodapati AV, Bucklin RE. Determininginfluential users in internet social networks. J MarketRes 2010, 47:643–658.

240. Watts DJ, Dodds PS. Influentials, networks, and publicopinion formation. J Consum Res 2007, 34:441–458.

241. Goldenberg J, Han S, Lehmann DR, Hong JW. Therole of hubs in the adoption process. J Market 2009,73:1–13.

242. Hinz O, Skiera B, Barrot C, Becker JU. Seeding strate-gies for viral marketing: an empirical comparison. JMarket 2011, 75:55–71.

243. Iyengar R, den Bulte CV, Valente TW. Opinionleadership and social contagion in new productdiffusion. Market Sci 2011, 30:195–212.

244. Leskovec J, Adamic LA, Huberman BA. The dynamicsof viral marketing. ACM Trans Web 2007, 1:Article 5.

FURTHER READINGAgrawal R, Rantzau R, TerziE. Context-sensitive ranking. In: Proceedings of the 2006 ACM SIGMOD InternationalConference on Management of Data (Chicago, IL, USA, June 27–29, 2006). SIGMOD ’06. New York: ACM; 2006,383–394.

Chickering DM, Heckerman D. A decision-theoretic approach to targeted advertising. In: Proceedings of Sixteenth Conferenceon Uncertainty in Artificial Intelligence (UAI). Stanford, CA: Stanford University; 2000, 82–88.

Domingos P, Richardson M. Mining the network value of customers. In: Proceedings of the Seventh ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining (San Francisco, California, August 26–29, 2001). KDD’01. New York: ACM; 2001, 57–66.

Li P, Li Z, Liu H, He J, Du X. Using link-based content analysis to measure document similarity effectively. In: Advancesin Data and Web Management. Lecture Notes in Computer Science, vol 5446. 2009, 455–467. Springer, Berlin Heidelbergdoi: 10.1007/978-3-642-00672-2_40.

Nie Z, Zhang Y, Wen J, Ma W. Object-level ranking: bringing order to Web objects. In: Proceedings of the 14th InternationalConference on World Wide Web (Chiba, Japan, May 10–14, 2005). WWW ’05. New York: ACM; 2005, 567–574.

Snijders TAB. The statistical evaluation of social network dynamics. In: Sobel M, Becker M, eds. Sociological Methodology,vol. 31. Blackwell Publishers Inc, Oxford; 2001, 361–395.

444 © 2013 John Wiley & Sons, Ltd. Volume 3, November/December 2013


Top Related