# Analysis of misinformation containment in online social networks

Post on 11-Dec-2016

213 views

TRANSCRIPT

en

i a

a r t i c l e i n f o

Article history:Received 17 May 2012Received in revised form 24 January 2013Accepted 8 April 2013Available online 20 April 2013

Keywords:Misinformation containment

a b s t r a c t

sources. For example, the news of the hit on Bin Laden rstbroke out on Twitter long before the US president ofciallyannounced it on the public media [1], or the recent event

hips built amongationshipsfor distr

potentiallyundesirable effects such as the widespread panicgeneral public. For instance, the misinformation ofu was observed in Twitter tweets at the outsetlarge outbreak in 2009 [3], or the wide spread of the falsereport that President Obama was killed on the hacked FoxNews Twitter feed in July 2011 [4].

In order for online social networks to serve as a trust-worthy channel for disseminating important information,

1389-1286/$ - see front matter Published by Elsevier B.V.

Corresponding author at: CISE Department, University of Florida,Gainesville, USA. Tel.: +1 7402740545.

E-mail addresses: nanguyen@cise.u.edu (N.P. Nguyen), ghyan@lanl.gov (G. Yan), mythai@cise.u.edu (M.T. Thai).

Computer Networks 57 (2013) 21332146

Contents lists available at SciVerse ScienceDirect

Computer N

journal homepage: www.elshttp://dx.doi.org/10.1016/j.comnet.2013.04.002landscape of communications and information sharing incyber-space. Many people have integrated popular onlinesocial sites, such as Facebook and Twitter, into their every-day lives and rely on them as one of their major news

and sharing based on the trust relationstheir users. Unfortunately, such trust relcial networks can possibly be exploitedmisinformation or rumors that couldon so-ibutingcausein theswineof the1. Introduction

The growing popularity of online social networks, to-gether with their diversity, has drastically changed the

Occupy of Wall Street has been spread quickly and widelyto a larger population due to its Facebook page [2]. Thepopularity of OSNs, as a result, is obtained from the conve-nience as well as efciency of information disseminationCommunity detectionInformation propagationWith their blistering expansion in recent years, popular online social sites such as Twitter,Facebook and Bebo, have become not only one of the most effective channels for viral mar-keting but also the major news sources for many people nowadays. Alongside these prom-ising features, however, comes the threat of misinformation propagation which can lead toundesirable effects. Due to the sheer number of online social network (OSN) users and thehighly clustered structures commonly shared by these kinds of networks, there is a sub-stantial challenge to efciently contain viral spread of misinformation in large-scale socialnetworks. In this paper, we focus on how to limit viral propagation of misinformation inOSNs. Particularly, we study a set of problems, namely the bIT-Node Protectors problems,which aim to nd the smallest set of highly inuential nodes from which disseminatinggood information helps to contain the viral spread of misinformation, initiated from aset of nodes I, within a desired fraction 1 b of the nodes in the entire network in T timesteps. For this family of problems, we analyze and present solutions including their inap-proximability results, greedy algorithms that provide better lower bounds on the numberof selected nodes, and a community-based method for solving these problems. We furtherconduct a number of experiments on real-world traces, and the empirical results show thatour proposed methods outperform existing alternative approaches in nding those impor-tant nodes that can help to contain the spread of misinformation effectively.

Published by Elsevier B.V.Analysis of misinformation containmnetworks

Nam P. Nguyen a,b,, Guanhua Yan b, My T. ThaaCISE Department, University of Florida, Gainesville, USAb Information Sciences (CCS-3), Los Alamos National Laboratory, NM, USAt in online social

etworks

evier .com/ locate/comnet

2134 N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146it is crucial to have an effective strategy to contain or limitthe viral effect of such misinformation. In particular, weaim to nd a tight set of users from whom disseminatinggood information minimizes the negative effects of mis-information, or in other words, we want to make sure thatmost of the network users are aware of the good informa-tion by the time the bad information reaches them. Here,the good information can be an authorized announcementto correct the corresponding misinformation. In the aboveexamples, good information could be something simplesuch as Swine u rumor is not correct or President Oba-ma is still healthy. However, from whom should the goodinformation be disseminated so that the viral effect of mis-information can be contained in a timely manner, whenthe infected sources are either known (e.g., hacked FoxNews Twitter feed) or unknown (e.g., tweets about swineu)?

Due to the sheer number of online social network usersand the highly clustered structures commonly shared bythese kinds of networks, there is a substantial challengeto efciently contain viral spread of misinformation inlarge OSNs. Conventional wisdom mainly focuses onimmunization which chooses a set of nodes in the networkto immunize in order to disrupt the diffusion process froma graph-theoretic standpoint. In the setting of misinforma-tion containment in OSNs, immunization of certain nodesrequires inspecting every message traversing them andstopping those suspected of carrying misinformation. Thisprocess itself, however, can be computationally expensivedue to the enormous number of messages that spread ina large online social network, e.g., Facebook or Twitter.For example, there were 177 million tweets sent out in asingle day on March 11, 2011 [5], and inspecting a tinyURL embedded in a tweet for potential misinformationcan be time consuming and inaccurate [6].

Against this backdrop, in this study we consider ascheme that takes a more offensive approach to ghtagainst viral spread of misinformation on social networks.Rather than classifying messages spreading in a network asmisinformation or not, this method relies on the similardiffusion mechanism adopted by the misinformation prop-agation in order to contain it. The key difference, however,is that misinformation often starts from nodes that are lessinuential and its propagation speed is thus constrained bythe trust relationships inherent in the diffusion processfrom these origins. The containment methods we considerherein, by contrast, aim to nd a smallest set of inuentialpeople to decontaminate so that the good diffusion pro-cess starting from them achieves the desirable effect on thespread of misinformation, i.e., the propagation of misinfor-mation is contained within a fraction 1 b of the entirenetwork. We call it the bIT-Node Protector problem, whereb is the desired decontamination threshold, I is the initialinfected set (either known or unknown), and T is the timewindow allowed for decontamination (either constrainedor unconstrained). Here, the superscript I or subscript T isshown only if I is known or T is constrained, respectively.

Some attempts on limiting misinformation have beenmade in earlier works (see Related Work). The most rele-vant work to our effort is the one suggested by Budaket al. [7], in which the authors formulated this as anoptimization problem, proved its NP-hardness, and thenprovided approximation guarantees for a greedy solutionbased on the submodularity property. However, the keydifferences between our work and theirs are: (1) they im-pose a k-node budget, i.e., the size of the selected set ofnodes is constrained by k and (2) they assume the highlyeffective propagation, i.e., the probability for good informa-tion spreading is either one or zero, whereas our decon-tamination model is more general since it allowsarbitrary spreading probabilities. Moreover, we provide afar richer framework for studying the problem of contain-ing viral spread on OSNs, where we consider not onlywhether the initial set of nodes contaminated by misinforma-tion is known to the defender, but also take into account thetime allowed for the defender to contain the misinformation.We thus believe the results from this work offer more in-sights into how to contain viral spread on OSNs under di-verse constraints in practice.

In a nutshell, our main contributions are summarized asfollows. First, we analyze an algorithm for the b-Node Pro-tector problem, which greedily chooses nodes with thebest marginal inuence added to the current solution,and show that this algorithm selects only a small fractionof extra nodes from the optimal solution (Theorem 1). Thisresult, indeed, provides us a better knowledge on the lowerbound of the optimal solution in comparison with the1 ln bN

factor suggested in [8]. Second, we show thatthe bIT-Node Protector problem is hard to approximate bya logarithmic factor, i.e., there is no polynomial-time algo-rithm that is guaranteed to nd a solution of at most a log-arithmic factor off the optimum solution. In normal graphs,we apply the greedy algorithm to the network restricted toT-hop neighbors of the initial set I and achieve a slightlybetter bound for the bIT-Node Protector problem. Third,we propose a community-based algorithm which returnsa good selection of nodes to decontaminate the nodes inthe network in an efcient manner. Finally, we conductexperiments on real-world traces including NetHEPT andFacebook networks [9], and empirical results show thatboth the greedy and community-based algorithms obtainthe best results in comparison with other availablemethods.

2. Related work

The information and inuence propagation problem onsocial networks was rst studied by Domingos and Rich-ardson in [10]. In this work, they designed viral marketingstrategies and analyze the diffusion processes using a datamining approach. Later, Kempe et al. [11] formulated theinuence maximization problem on a social network asan optimization problem. In their seminal work, they fo-cused on the linear threshold and independent cascademodels and proposed a generalized framework for bothof them, as well as proving the problem of inuence max-imization with a k-node budget admits a 1 1=e approx-imation algorithm. Leskovec et al. [12] studied theinuence propagation under the detection of outagebreak-out situation. In particular, they aimed to nd theset of nodes in networks to detect the out-break, e.g., thespread of virus, as soon as possible. Chen et al. [13] im-

of using benign computer worms to ght against another

N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 2135species has been studied in [14,15]. Most of these works fo-cus on analyzing the performance of active worm contain-ment in the traditional arena of worm propagation, whereinfectious computers use scanning strategies to nd newvictims in the IPv4 address space. Dubey et al. [16] con-ducted a study under the form of a network game focusingon quasi-linear model with various cost and benet forcompeting rms. Bharathi et al. [17] modied the indepen-dent cascade model to better capture the competing cam-paigns in the network. Kostka et al. [18], from a gametheory point of view, show that the rst propagationspreading is not always advantageous. Recently, Budaket al. [7] considered the strategy of using good informa-tion dissemination campaign to ght against misinforma-tion propagation in social networks. They formulated thisas an optimization problem, proved that it is NP-hard,and then provided approximation guarantees for a greedysolution based on the submodularity property. This prob-lem, indeed, can be considered as one variant of the familyof problems we address in this work. In [8], Goyal et al.study information dissemination on social network. Theyprovide an greedy algorithm and give a proof of factor1 ln bN

. However, that algorithm is a bicriteria one withan addictive error on the number of nodes, whereas ouranalysis suggests a deterministic bound which does notdepend on any error parameter.

From the security perspective of OSNs, the informationand inuence propagation plays an important role in ana-lyzing and designing strategies to counter misinformationsuch as rumor and computer malware. In [19], for example,Yan et al. conducted a comprehensive study of malwarecontainment strategies, including both user-oriented andnetwork oriented ones, on a moderate-sized online socialnetwork. Their work, albeit offering insights into thenature of malware propagation in realistic OSNs, was per-formed fully from an empirical perspective, and consideredonly a simple model for malware propagation. Tacklingcontainment of viral spread of misinformation in OSNs,however, demands solutions with a stronger theoreticfooting such that they are applicable to a variety of onlinesocial network structures and information disseminationmodels. Moreover, Xu et al. considered the problem ofdetecting worm propagation in online social networks,and showed that nding a minimal set of nodes to monitorfor the purpose of trafc correlation analysis is an NP-complete problem [20]. Our work differs from theirs aswe focus on containment, rather than detection, of misin-formation propagation in OSNs.

3. Diffusion models and problem denition

In this section, we rst dene two models of inuencepropagation in online social networks, as well as a mecha-nism modeling how good information is disseminated inprove the efciency of the greedy algorithm and propose anew degree discount heuristic that is much faster andscalable.

Some attempts have been made in the light of contain-ing the spread of misinformation. For instance, the conceptthe network in order to contain misinformation. Underthese models, we further formulate the bIT-Node Protectorproblem, which aims to nd the smallest set of highlyinuential nodes in the decontamination campaign.

3.1. Propagation models

We rst describe two types of information diffusion,namely the Linear Threshold and Independent Cascademodels. These two practical propagation models have re-ceived great attention since their introduction in [11],and in this subsection, they are discussed with the samenotations. For the sake of consistency, we call a node activeif it is inuenced by the misinformation either initially orsequentially from one of its neighbors, and inactiveotherwise.

3.1.1. Linear Threshold (LT) modelIn this model, the chance for a node v to adopt the mis-

information from a neighbor w is determined based on theweight bv ;w that satises

Pw2Nvbv ;w 6 1 for all w in the

neighborhood Nv of v. Initially, each node v 2 V indepen-dently selects a threshold hv 2 0;1 uniformly at random.The goal of this threshold is to represent the weighted frac-tion of vs neighbors that must adopt the misinformationactive in order for v to become active. Now, given the cho-sen thresholds for all nodes vs in V, the propagation pro-gresses from an initial set of infected nodes I as follows:in step t, all active nodes in step t 1 remain active, andany inactive node v for which the total weight of its activeneighbors is at least hv :

Pw2Nactivevbv ;w P hv is activated.

3.1.2. Independent Cascade (IC) modelIn this model, any node v that became active in step t

will have only one chance to activate each of its currentlyinactive neighbors ws, and the activation from v to aneighbor node w succeeds with a probability pv ;w. If nodew has multiple newly activated neighbors, they will tryto activate w sequentially in an arbitrary order. If any ofthese attempts succeeds, w is activated at time step t 1and the same procedure continues further on w, i.e., w willtry to activate its inactive neighbors. Again, any node is gi-ven only a single chance to inuence its friends, thus if itfails to do so in time t, it is not allowed to activate itsfriends again in time t 1. Note that in both IC and LTmodels, the information can be propagated to multiple-hop neighbors.

3.1.3. Decontamination mechanismThe decontamination mechanism in our problem is

coincident with the misinformation spreading model. Inparticular, once the good information spreads out from aparticular set of nodes AI , each node u in AI will try tospread out the good information to its neighbor node vwith the same inuence probability pu;v (as in the underly-ing IC model), or with the same inuence threshold hv (asin the LT model). We also assume that once a node isdecontaminated with good information, it will no longerbe inuenced by the misinformation. Moreover, if goodinformation and misinformation reach a node at the sametime, the good information overrules the bad. This

2136 N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146assumption makes sense for OSNs in reality as a good mes-sage could be announced to x a specic misinformation.

3.2. Problem denition

We consider the following problem in this paper:

Denition 1 (bIT -Node Protector ). Consider a social net-work represented by a directed graph G V ; E and anunderlying diffusion model (either LT or IC model). In thepresence of misinformation spreading on G from an eitherknown or unknown initial node set I, our goal is to choosethe set S # V of least nodes to decontaminate with goodinformation so that the expected decontamination ratio,which is the fraction of uncontaminated nodes in thewhole network, after T time windows, is at least b. HereT 2 N and b 2 0 1 are input parameters.

Based on the settings of the initial set I and the timewindow T, we have the following four different variantsof the Node Protector (NP) problems whose NP-hardnessproperties can be certied but are omitted here due tospace limit.

1. b-NP: I is unknown and T is unconstrained T 1.2. bI-NP: I is known and T is unconstrained T 1.3. bIT-NP: I is known and T is constrained T

Proof. Let us rst describe the notations used in this proof. bN

N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 2137Let Si i 1 . . .K be the set produced at the ith step ofAlgorithm 1 (note that S SK by this denition). Forany integer k, let Optrk be the maximum valueof rA over all sets A#V of jAj k nodes, i.e.,Optrk maxA#V ;jAjkfrAg. Furthermore, let q jQ j,where Q is the optimal solution set with the lowest rQthat exceeds bN, i.e., Q argminAis an OPT;rAPbNfrAg. Itfollows from the above denitions that

(i) q jOPTj for any optimal solution set OPT.(ii) Optr and r () are nondecreasing functions, and8A#V ;rA 6 OptrjAj.

(iii) There lie no Optrks 8k 1 . . .K strictly betweenbN and rQ (otherwise it will violate the denitionof Q).

We are now ready to prove the theorem. Since Algo-rithm 1 terminates at the Kth step, it follows thatrSK1 < bN. We consider the following cases:

Case 1: OptrK 1 < bN. When OptrK 1 < bN, itimplies that any set with even K 1 nodes is not sufcientto achieve the desired disinfection goal. Therefore, theoptimal solution Q must contain q > K 1 nodes. More-over, q 6 jSj K since S is a regular solution. This meansjQ j K , which in turns implies that S is also an optimalsolution.

Case 2: bN 6 OptrK 1 6 rQ. Due to (iii) and thedenitions of Q and Optr, this case can only bevalid either when (a) OptrK 1 bN or (b) OptrK 1 rQ.

When (a) OptrK 1 bN, it follows that the optimalsolution Q must span exactly bN nodes since rQ is theclosest to bN. Note that this does not necessarily indicatesthat jOPTj K 1. Instead, what it implies is bN OptrK 1 OptrK 2 . . . Optrq rQ.

When (b) OptrK 1 rQ, it again infers thatrQ Optrq . . . OptrK 1 due to (ii). Now, ifq K 1, this greedy algorithm will incur just one morenode than the optimal solution. Otherwise, the analysis ofCase 3 can be applied in a very similar manner andconsequently, we obtain the same result K 6 qbNde Dd 1

for both situations.

Case 3: rQ < OptrK 1 6 N. This is our main case tohandle. What we know in this case is jOPTj q 6 K 1,implying Optrq 6 OptrK 1. We need to bound thesize difference between SK1 and Sq, or in other words,bounding K 1 q. To do so, we will use the 1 1e

approximation result in [22] (note that all K stepsof Algorithm 1 follow the HC algorithm, and hence,this guarantee follows naturally), giving 1 1e

Optr

q 6 rSq 6 rSK1 bN D. Therefore, rSK1 rSq6 bN D 1 1e

Optrq. In addition, since OptrqP

rQP bN, the above inequality becomes rSK1rSq 6 bN D 1 1e

bN bNe D.

In order to lower bound rSK1 rSq, we observethat every time a node v is added into the solution set Si,the expected number of decontaminated nodes increasesat least by minfrSi1 rSigP d for i q; . . . ; K 2.Hence K 1 qd 6 rSK1 rSq 6 e D, whichimplies jSj K 6 jOPTj bNde Dd 1

. This bound also

concludes the proof. h

Theorem 1 implies the solution returned by GVS iswithin a linear factor of bN extra from the optimal solution,given the desired decontamination ratio b. Intuitively, theabitrary selection of any bN nodes in the network is alwayssufcient for our problem; however, this lower bound im-plies that GVS, indeed, selects at most 1e 36% of this manynodes in addition to the optimal solution. Moreover, thebigger b, i.e., the smaller the number of misinformationnodes we allowed, the more nodes we have to protect,and vice versa. The bound in Theorem 1 nicely reects thisintuition: when b is bigger, the range for K in the righthand side (RHS) gets larger, which allows K to gets biggeras more nodes needed to be decontaminated. Vice versa,when b gets smaller, the range for K reduces as fewernodes need to be protected.

4.2. Time complexity

It is easy to see that Algorithm 1 will terminate after atmost bN step in the worst case and the most expensivetask is to nd maxu2VnSru in Step 4 of the algorithm. In[23], the authors presented an efcient way to handle thistask in (approximatively) ONM0D where M0 is the num-ber of samples and D is the maximum complexity for nd-ing rv for any node v. Therefore, the total timecomplexity of this bounded algorithm is ON2M0D.

5. Algorithms for the bI- and bIT-Node Protectorsproblem

In this section, we study the family of bIT-Node Protectorproblems where the initial infected set I is known and thetime step T is either constrained or unconstrained. Due onits NP-hardness nature, it seems unrealistic for one to ndan optimal solution to the bIT-Node Protector problem in atractable manner. We further show, by Theorem 2, thatthis problem in general is hard to approximate with a log-arithmic factor via a similar reduction procedure from theSet Cover problem [24]. This result shows how difcult thisproblem is as it implies the nonexistence of any logarith-mic approximation algorithm for the bIT-Node Protectorproblem, under the assumption that P NP.

5.1. Inapproximability of bIT -Node Protector

The inapproximability of the bIT-Node Protectorproblem can be shown via a modied reduction from theSet Cover problem. Given an instance I SC fU;Sg of theSet Cover problem, where U fu1;u2; . . . ;ulg is the set ofl elements and S fS1; S2; . . . ; Stg is a collection of t subsetsof U, we (1) place S1; S2; . . . ; St on the left hand side (LHS)and u1; u2; . . . ; ul on the RHS, and (2) connect a directededge from Si to uj with probability 1 if uj 2 Si. Note thatthe direction simply means the infection only propagatesfrom one side to another, but not vice versa. We

jOPTbSCj. Moreover, there exists an optimal solution for

a solution S of K nodes for bT-Node Protector that expectedly

de d

2138 N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146bSC not containing any ui or its copies (since we can alwaysexclude ui and include the node representing any of the setSj that ui belongs to, which leads to a better or equivalentlygood solution). In other words, there exists an optimalsolution containing only subsets from S. Now, if there isa polynomial time algorithm A that can approximate bSCto c lnN for some constant c, i.e., it can nd a solution Sfor bSC satisfying jSj 6 c lnN jOPTbSCj. Since N nscT t lT, we can rewrite this inequality asjSj 6 c lnN jOPTbSCj c lnnscT jOPTbSCj6 d c lnnsc jOPTSCj c0 lnnsc jOPTSCj

where d is any constant satisfying dP c ln Tlnnsc , given the quan-tities T; t and l of the Set Cover and bSC instances. Thismeans algorithm A can approximate an instance of SetCover problem to a factor of c0 lnnsc in polynomial time,additionally create T 1 copies of S1; S2; . . . ; St as well asT 1 copies of u1; u2; . . . ; ul, and then tie these newlycreated nodes accordingly to Sis and ujs to make t directedchains on the LHS and l corresponding directed chains onthe RHS, with probability 1 on each edge (Fig. 1). Now,the initially infected set I is set to be fuiji 1; . . . ; lg (rednodes) and we want all nodes in the RHS to be inactiveat the end of the process, i.e., b 1. Let us, for short, namethis instance bSC and denote nsc t l. Note that N nscTin bSC . Using this reduction, one can show the followingresult:

Theorem 2. The bIT -Node Protector problem cannot beapproximated in polynomial time to a factor of c lnN, where cis some constant and N is the number of vertices in the graph,under the assumption P NP.

Proof. Before proving this theorem, it is easy to see thatthe optimal solution of a Set Cover instance and its corre-sponding bSC have the same size, i.e., jOPTSCj

Fig. 1. A modied reduction from the Set Cover problem. All edges haveprobability 1.where c0 is some constant. This implication does not ap-pear to hold true under the assumption P NP [25]. Thus,the conclusion follows. h

5.2. Algorithms for bI and bIT -Node Protector

We next present solutions for bI- and bIT-Node Protec-tors on general networks. The spirit of our approaches forthese cases is also based on HC algorithm, however, thesearching space is signicantly reduced due to the knowl-edge of the initial set I. In particular, we apply GVSwhere d mini1;...;K2frSi1 rSig;D0 b0N rSK1,and OPT is an optimal solution set for bIT -Node Protectorproblem.

Proof. The proof is similar to that of b-Node Protector andis omitted here. h

6. Experimental results

In this section, we show the experimental results of theGVS algorithm on three real networks including the NetH-EPT, NetHEPT_WC and the Facebook social networks. Wewant to demonstrate the followings (1) how the GVS algo-rithmworks on b- and bIT-Node Protector problems in com-parison to other available methods and (2) the expectedlower bounds of the optimal solutions between ours andthose suggested by the 1 ln bN

factor [8].

We also planned to compare our results against those of[7]. However, since the dissemination probabilities in ourmodels are distributed in the range 0;1, whereas [7] as-sumes a highly effective decontamination campaign (i.e.,good information spreads out with an absolute probability,i.e., pu;v 1 if there is an edge from u to v, and zero other-wise), it does not seem appropriate to do so. In what fol-lows we assume the IC information propagation model.

6.1. Datasets

6.1.1. NetHEPT and NetHEPT_WCThe NetHEPT network is a widely used dataset for test-

ing information diffusion [21,26]. This dataset containssatises

K 6 jOPTj max 0;b0jNTIj D

0 1

;algorithm on the network restricted to T-hop neighborsof the initial set I. This provides a slightly better term extrafrom the optimal solution in comparison with the case ofb-Node Protector. Our approach is based on the followingcrucial observation: once the infected set is known, the to-tal set of nodes possibly inuenced by I is reduced to NTIwhile the nodes in V n NTI will never be active, whereNTI

Su2INTu and

NTu fv 2 V ju can reach v within T hopsg; T

N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 2139information, mostly the academic collaboration from theHigh Energy Physics Theory section on arXiv wherenodes stand for authors and links represent the coauthor-ship. In their deliverable, the NetHEPT network contains15,233 nodes and 31,398 links, and the probabilities onedges are assigned either uniformly at random (for NetH-EPT) or with a weighted cascade model (for NetHEPT_WC),where pu;v 1=dinv with dinv being the in degree ofnode v. Note that the weighted cascade model is a specialcase of the IC model when the probability for each edgeis predetermined.

6.1.2. Facebook networkThis dataset contains the friendship information among

the Facebook user accounts in the New Orleans region,spanning from September 2006 to January 2009 [9]. To col-lect the information, the authors created several Facebookaccounts, each of which joined the regional network andcrawled the network in a breath-rst-search fashion. Thedata set contains more than 63 K nodes (users) connectedby more than 1.5 million friendship links with an averagenode degree of 23.5. In our experiments, the propagationprobability for each link connecting to users u and v is pro-portional to the communication frequency between u andv, normalized over the entire network.

6.1.3. SetupWe compare the GVS algorithm against the following

alternative solutions: (1) the Random method, which in-clude nodes added randomly to the current solution untilthe stopping criterion is met; (2) the High degree method,which adds nodes with the highest weighted degree tothe solution until the stopping criterion is met; (3) theDiscountIC method, which is based on the weighted dis-count for the IC model suggested in [21], and (4) the PageRank method, which adds nodes sequentially to the solu-tion based on their PageRank until the stopping criterionis satised.

In all experiments, the Monte Carlo simulation for esti-mating expected inuence is averaged over 1000 runs forconsistency. Since the execution of the GVS method isexpensive as shown in the running time analysis, we justconduct test cases on small values of b, ranging from0.01 to 0.37. At any value of b, we run all methods indepen-dently and report the number of selected nodes suggestedby each of them.

6.2. Number of selected nodes

6.2.1. Results on the b-Node Protector problemRecall that the ultimate goal of our problem is to choose

the set of least nodes so that a dissemination ratio of atleast b over the entire network is achieved. The left chartsof Figs. 2ac report the performances of the ve methodson the b-Node Protector problem for the three differentdatasets. As depicted in those gures, the number of se-lected nodes returned by the GVS algorithm is always thesmallest among all the methods considered. In particular,the solution returned by GVS is roughly 24% better thanthat from the second best method, DiscountIC, is 75% betterthan that from Highest Degree, and is more than 1.5 timersbetter than the solution returned by Random.

We observe that the behaviors of all ve methods arenearly the same on the NetHEPT and NetHEPT_WCdatasets although the number of selected nodes on theNetHEPT_WC dataset is slightly smaller. Moreover, thenumber of selected nodes tends to curve up as the targetdissemination ratio b gets larger. On the Facebook dataset,GVS, again, outperforms the other methods in term of thesize of the selected node set and is much better than theRandom and Highest Degree methods. While the Randommethod, not surprisingly, performs the worst in the pool,we had expected the Highest Degree method to have a bet-ter performance. Its poor performance can be explained bythe sparseness of the real online social network: althoughthe nodes of the highest degrees could inuence morenodes directly, they are not necessarily the most inuentialones possibly because of the low chances they can inu-ence each of their neighbors. In fact, this is the case sinceedges with high probabilities mostly connect low degreenodes in the Facebook network. In addition, unlike whatwe have observed for the other datasets, the number ofnodes returned by each of the methods on the Facebooknetwork increases only linearly as b becomes bigger.

6.2.2. Results on the bI and bIT -Node Protector problemsWe next look at the performances of GVS and the other

methods when working on the bI and bIT-Node Protectorproblems. In particular, we randomly choose 15% of allnodes to be I, the initial source of misinformation propaga-tion, and vary T, the time window, between 5 and 1. Werestrict the scope of the GVS algorithm on the reduced net-work GNTI, and provide I as part of the input for theother methods as well. The numbers of nodes and edgescontained in these restricted networks GNTIs for thebIT-Node Protector problem are 6645 and 18,236 for NetH-EPT, 8647 and 21,091 for NetHEPT_WC, and 21,816 and865,930 for the Facebook dataset, all respectively. For thebI-Node Protector problem setting, these numbers are7315 and 22,189 for NetHEPT, 9745 and 26,352 for NetH-EPT_WC, and 26,816 and 1 M for the Facebook dataset,all respectively.

The results are reported in the left charts of Fig. 3ac. Asexpected, the numbers of selected nodes returned by theGVS algorithm on the reduced networks outperform theother alternative solutions with bigger gaps among them.On average, the solution returned by GVS is 64% betterthan that by DiscountIC, 1.2 times better than that fromthe Highest Degreemethod, and up to 2.3 times better thanthe solution found by the Random method. These resultsare also consistent with what have been observed in theprevious test case. We also notice the reduction on thenumbers of nodes to be selected by all ve methods, par-ticularly GVS. The number of nodes chosen by the GVSalgorithm is reduced by at least one half for each of thethree cases. Of course, this is what one should expect oncethe knowledge of I is provided, and especially when thesizes of the restricted networks are much smaller. Theempirical result charts for the bI-Node Protector problemare highly similar to what we have seen in Fig. 3 and arethus excluded here.

2140 N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 600

800

1000te

cted

nod

esGVS

RandomHigh deg

DiscountICPagerankThere is room for improvement here on how to wiselychoose nodes when I is the set of some specic targets suchas high-degree nodes, or nodes with low or average de-grees. While they do sound interesting, we believe that isnot what we are aiming at here, and thus, it would be trea-ted in our future work.

0

200

400

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Num

ber o

f sel

(a)

0

200

400

600

800

1000

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Num

ber o

f sel

tect

ed n

odes

(a)

GVSRandom

High degDiscountIC

Pagerank

0

100

200

300

400

500

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Num

ber o

f sel

tect

ed n

odes

(a)

GVSRandom

High degDiscountIC

Pagerank

Fig. 2. Results of GVS algorithm for b-Node Protector on real social networks (lgures). 600

800

1000

ecte

d no

des

GVS solutionOur bound

ln(N) bound6.3. Lower bounds of the optimal solutions

We further investigate how big the size of the optimalsolution set would expectedly be once we are providedwith the greedy solution S from the GVS algorithm. Recallthat our result is jSj K 6 jOPTj max 0; bNde Dd 1

.

0

200

400

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Num

ber o

f sel

(b)

0

200

400

600

800

1000

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Num

ber o

f sel

ecte

d no

des

(b)

GVS solutionOur bound

ln(N) bound

0

100

200

300

400

500

0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16

Num

ber o

f sel

ecte

d no

des

(b)

GVS solutionOur bound

ln(N) bound

eft gures) and the expected lower bounds of the optimal solution (right

N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 2141 600

800

1000el

tect

ed n

odes

GVSRandom

High degDiscountIC

PagerankGoyal et al. [8] states that K 6 1 ln bN jOPTj, where P 1

is the additive error to the number of nodes to be decon-taminated. Given a solution S by the GVS algorithm, boththis result and ours provide the knowledge on the lower-bound of the optimal solution OPT. Theoretically, neitherof them dominates the other on all network instances

0

200

400

0.05 0.1 0.15 0.2 0.25 0.3

Num

ber o

f s

(a)

0

200

400

600

800

1000

0.05 0.1 0.15 0.2 0.25 0.3

Num

ber o

f sel

tect

ed n

odes

(a)

GVSRandom

High degDiscountIC

Pagerank

0

100

200

300

400

500

0.05 0.1 0.15 0.2 0.25

Num

ber o

f sel

tect

ed n

odes

(a)

GVSRandom

High degDiscountIC

Pagerank

Fig. 3. Results of GVS algorithm for bI5-Node Protector on real social networks (gures). 300

400

500

elec

ted

node

s

GVS solutionOur bound

ln(N) bound(for example, a clique with a probability of 1 g on alledges is more suitable for the 1 ln bN

-factor while real

social networks are more suitable for ours, as we shallsee below). Of course, a larger and tighter value for thislower bound is more desirable as it tells us how large thesize of the optimal solution should be. Thus, we also want

0

100

200

0.05 0.1 0.15 0.2 0.25 0.3

Num

ber o

f s

(b)

0

100

200

300

400

500

0.05 0.1 0.15 0.2 0.25 0.3

Num

ber o

f sel

ecte

d no

des

(b)

GVS solutionOur bound

ln(N) bound

0

100

200

300

400

500

0.05 0.1 0.15 0.2 0.25

Num

ber o

f sel

ecte

d no

des

(b)

GVS solutionOur bound

ln(N) bound

left gures) and the expected lower bounds of the optimal solution (right

mon interests who tend to interact more frequently with

them can reach each other within only a few hops. Finally,

2142 N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146other members in the same community than with the out-side world. The knowledge of the network communitystructure, as a result, provides us a much better under-standing about its topology as well as its organization prin-ciples. Community detection methods and algorithms canbe found in a comprehensive survey by Fortunato et al.[27].

Social-based and community-based algorithms utilizingnetwork communities have been shown to be effective,especially when applied to online social networks[2830]. Taking into account this great advantage ofcommunity structures, we propose a community-basedheuristic method that can return a reasonable solution ina timely manner. More specically, our method consistsof two main phases: (1) the community detection phase,which quickly reveals the underneath network communityto know how the two results look like in real-worldnetworks. Here, 1 since in term of nodes, this errorshould be an integer greater or equal to one.

As revealed in the right charts of Figs. 2 and 3, our lowerbounds (indicated in red circles) are usually larger thanthat of [8] in all test cases, and consequently provide amore meaningful insight into the expected size of the opti-mal solution: for any given b, the size of the optimal solu-tion should lie somewhere in between the region inducedby the GVS solutions and our lower bounds. This also pro-vides a new point of view into the problem as it ensuresthat the optimal solution is not too far away from theone returned by the greedy algorithm. Note that our resultdoes not necessarily imply the existence of any constant orlogarithmic factor approximation algorithms.

7. A community-based heuristic algorithm

As described in the experiments, the GVS algorithmprovides good solutions for both the b- and bIT-Node Pro-tector problems in comparison with the other methods.One of its down sides, however, is the extremely slow exe-cution due to the expensive task of estimating the marginalinuence when a node is added to the current solution.Even with available speed-up techniques provided in[21,26,23] for estimating this marginal gain, GVS still takesa long time to nish its tasks, especially on the Facebooksocial network (e.g., more than 5 h with a commodityPC). This means GVS, despite its good performance, mightnot be the best method for analyzing large-scale online so-cial networks, particularly when the execution time is alsoa constraint. This drives the need for a more desirable ap-proach which can return a good solution in a timelymanner.

To this end, we take into account a notable phenome-non that is commonly exhibited in large-scale online socialnetworks: the property of containing community struc-tures, i.e., these networks naturally consist of groups ofvertices with denser connections inside each group andfewer connections crossing among groups, where verticesand connections represent network users and their socialinteractions, respectively. Roughly speaking, a communityon social networks usually consists of people sharing com-7.2. High-inuence node selection

The rst phase provides us a community structure C as apartition of V into disjoint subsets C1; C2; . . . ; Cp, and weit execution is reasonably fast, which means it would notadd too much time to the entire process.structure and (2) the high-inuence node selection phase,which effectively selects nodes of high inuence. Thedetailed algorithm is described in Algorithm 2.

Algorithm 2. A community-based algorithm for the b-Node Protector problem

Input: Network G V ; E, threshold b 2 0;1;Output: A set S#V satises rSP bjV j;1: Use Blondels algorithm [31] to nd community

structure in G;2: Let C fC1;C2; . . . ;Cpg be the disjoint network

communities with jC1jP jC2jP P jCpj;3: S ;;4: for i from 1 to p do5: Si ;;6: while rSi < bjCij do7: v argmaxu2VnSifrSi u rSi};8: Si Si [ fvg;9: end while10: S S [ Si;11: if rSP bjV j then12: break;13: end if14: end for15: Return S.

7.1. Community detection

As described in Algorithm 2, detecting the networkcommunities is the rst phase and also is an importantpart of our method. A precise community structure thatnaturally reects the network topology will help the selec-tion of inuential nodes in each community to be moreeffective. Because the detection of network communitiesis not our focus in this paper, we utilize an communitydetection method proposed by Blondel et al. [31] whoseperformance has been veried thoroughly in the literature[32].

There are some good features of this detection algo-rithm that nicely suit our purposes. First, it returns a com-munity structure whose links within a single communityusually have high probabilities, and those that come acrosscommunities are often of low probabilities. This is to say,the information propagation within each community oc-curs with a high probability whereas the chance for misin-formation to spread between communities is relativelysmall. Second, the size of each community is much smallerin comparison with the whole network, and nodes in eachcommunity are usually (weakly) connected to each other.Moreover, they often have small diameters, i.e., nodes in

N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 2143need to select nodes from these subsets so that a decon-tamination ratio of b is achieved. For simplicity, we assumethat the communities are sorted in a non-increasing orderof their cardinalities.

Since edges crossing between communities usuallyhave low probabilities, it follows that a node from a com-munity often has only a low chance to spread out misinfor-mation (or good information if it is decontaminated) toanother node in a different community. Therefore, ourproblem can be regarded as the selection of nodes in eachcommunity to decontaminate so that a decontaminationratio of b is achieved within each community, and henceachieving a decontamination ratio of b over the entire net-work. Intuitively, one would think of applying the GVSalgorithm to each community to nd the set of inuentialnodes to decontaminate. In fact, that is the approach weadopt here. For each community Ci, we greedily selectnodes providing the maximal marginal gain and add it toSi as well as the nal solution S, until the stopping criterionis met. The motivation behind our approach is due to thestronger inuence within each community and the muchsmaller size of each community in comparison with thewhole network. Therefore, the selection process of high-inuence nodes should not be as computationally prohibi-tive as it used to be.

7.3. Experimental results

In this subsection, we report the experimental results ofthe heuristic algorithm alongside those from the afore-mentioned methods. We demonstrate the followings: (1)the number of selected nodes and (2) the execution time.The experimental setup is kept the same as in Section 6.The numbers of communities detected by Blondels algo-rithm on NetHEPT, NetHEPT_WC and Facebook are 1841,1839 and 260, respectively. We remove the results fromthe Randommethod from the charts due to its poor perfor-mance and to make the plots more visible.

Simulation results are reported in Fig. 4. As depicted inthese subgures, the numbers of nodes selected by thecommunity-based method (Community in red circles)are highly competitive in comparison with those of theother methods, let alone that its performance tends toget much better as more nodes need to be decontaminated.In particular, the community-based approach is slightlyoutperformed by the others for small values ofb 2 0 0:18 on both NetHEPT and NetHEPT_WC data-sets and b 2 0 0:9 on the Facebook network. However,it performs much better than the other methods as the tar-geted dissemination ratio b gets larger. On average, thesolution provided by the community-based method isroughly 16%, 41% and 22% better than those by the GVS,High Degree and DiscountIC methods for b 2 0:2 0:35on NetHEPT and NetHEPT_WC, respectively, and is nearly10% better than the greedy method on the Facebook data-set for b 2 0:09 0:16. Moreover, this quantity tends toincrease only linearly in a long run on all of the test net-works, unlike the expensive superlinear shapes of theothers.

There are reasons behind both the advantages and dis-advantages of the community-based method observed inour experiments. A careful look into the community struc-tures of the three networks reveals that both NetHEPT,NetHEPT_WC and Facebook, in fact, feature only a fewlarge-sized communities while containing a lot of small-sized ones, most of which are of cardinalities 6 and 8 (notethat this is not a surprising observation as it intuitivelyagrees well with the ndings of [27,33]). Moreover, themost inuential nodes are usually scattered among differ-ent communities. Therefore, when the target decontamina-tion ratio b is small, the GVS algorithm can quickly ndthese inuential nodes while the community-based algo-rithm has to obey the rule of selecting inuential nodesin larger communities. This explains the disadvantages ofthe community-based method in the smaller range of b.On the other hand, however, when more and more nodesneed to be decontaminated to achieve the target decon-tamination ratio, the community-based algorithm can sim-ply pick the most inuential nodes from communities ofsmaller sizes (where they can easily inuence the wholecommunity), while the GVS algorithm may have to selectmore nodes from the other places to satisfy the criterion(see Table 1).

We next discuss the effect of network communities inour problems. The general belief holds that while the localstars are important for vaccination, decontaminationshould also target those nodes with edges across commu-nities, since (1) they are typically only a small populationand (2) decontaminating them will keep misinformationwithin small circles. This, however, is not what has beenobserved in our experiments. The reason can be explainedas follow: as the community structure discovered has highinternal propagation probabilities within each communityand low propagation probabilities across different commu-nities, the bridge nodes, generally speaking, have only lowchances of spreading the misinformation to their neighbor-ing communities; by contrast, the local stars possessmuch higher inuence in spreading the information tothe entire community. This is, perhaps, our most interest-ing nding that is somewhat in contradiction with the gen-eral belief.

7.4. Running time

We next compare the execution times of the commu-nity-based algorithm (the time spent on community detec-tion is also included) and the other methods. Incompensation for its good performance, GVS consumes amuch longer time than the other methods on each of thedatasets, especially on the Facebook network for which ittakes more than 5 h to nd the solution. While the othermethods take fair amounts of time (from 1 to 21 min) foranalyzing these average-sized datasets, they pose potentialpossibilities to consume much more time on larger socialnetworks. In contrast, the community-based method, byleveraging the network community structure, is able to re-duce the processing time signicantly, while maintaining acompetitive performance. Unlike GVS, however, this algo-rithm does not provide any guarantee on the quality ofthe solution relative to the optimal one.

In conclusion, we believe that GVS is one of the bestinformative methods for nding highly inuential nodes

200N

2144 N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35

400

600

800

1000um

ber o

f sel

tect

ed n

odes

GVSCommunity

High degDiscountIC

Pagerankin small social networks when the running time is not aconcern, whereas the community-based method can be re-garded as a good heuristic-based method for detectingthose important nodes for large-scale social networks.

8. Conclusion and discussion

We study the family of bIT-Node Protector problemswhich aim to nd the set of least nodes from which dis-seminating good information provides a decontamina-tion ratio of at least b over the entire network. Weanalyze an algorithm for the b-Node Protector problemthat greedily adds nodes with the highest inuence gainsto the current solution, and show that this algorithmselects only a small fraction of nodes in addition to the

0

100

200

300

400

500

0.02 0.04 0.06 0.0

Num

ber o

f sel

tect

ed n

odes

GVSCommunity

High degDiscountIC

Pagerank

Fig. 4. Results of the community bas

Table 1Running time of ve methods.

NetHEPT NetHEPT_WC Facebook

GVS 33.1 min 36.1 min 5 hCommunity 10 s 11 s 2.4 minHigh Deg 1.18 min 1.2 min 4.3 minDiscountIC 4.3 min 4.3 min 20.3 minPageRank 14.4 min 14.4 min 21 min 0

200

400

600

800

1000

0.05 0.1 0.15 0.2 0.25 0.3 0.35

Num

ber o

f sel

tect

ed n

odes

GVSCommunity

High degDiscountIC

Pagerankoptimal solution. We adapt this algorithm to other scenar-ios, where the initial set I of infected nodes is known andthe network under concern is restricted to at most T-hopsaway from I. We further propose a community-based algo-rithmwhich returns efciently a good selection of nodes todecontaminate. Finally, we demonstrate the performanceof our approaches on three real-world traces.

There are a few open issues that are worth discussinghere. First, we assume that once an inuential user is pre-sented with good information, he or she would be willingto further spread it to all his or her online friends. In reality,however, it may need to provide extra incentive to thesepeople for combatting against misinformation spreadingin online social networks. Second, in this work we assumethat both misinformation and good information spread un-der the same diffusion models. Such an assumption, how-ever, is unnecessary, as in practice, we may want topresent the good information in such a way that it spreadsfaster than misinformation. Third, our work is based on theassumption that the campaign of disseminating goodinformation to combat misinformation is initiated in a cen-tralized manner, say, by the service provider of the onlinesocial network. This may not be realistic for somemisinformation decontamination campaigns, especiallywhen there is no centralized entity to organize such cam-

8 0.1 0.12 0.14 0.16

ed method on social networks.

paigns, or the service provider of the online social network International Colloquium on Structural Information andCommunication Complexity, SIROCCO 08, Springer-Verlag, Berlin,

N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146 2145holds a neutral position on such issues. Despite these real-istic challenges, the methods proposed in this work, as wellas our rigorous theoretic analysis, shed light on how to de-velop effective, yet efcient, techniques to contain misin-formation spreading in large-scale online social networks.

Acknowledgment

This work is partially supported by the DTRA YIP GrantNo. HDTRA1-09-1-0061, and the Los Alamos National Lab-oratory Directed Research and Development Program, pro-ject number 20110093DR.

References

[1] http://articles.cnn.com/2011-05-02/tech/osama.bin.laden.twitter1bin-tweet-twitter-user?s=PM:TECH.

[2] http://www.facebook.com/OccupyWallSt.[3] www.pcworld.com/businesscenter/article/163920/

swineufrenzydemonstratestwittersachillesheel.html.[4] http://articles.cnn.com/2011-07-04/tech/fox.hack1tweets-twitter-

feed-twitter-users?s=PM:TECH.[5] http://blog.twitter.com/2011/03/numbers.html.[6] C. Grier, K. Thomas, V. Paxson, M. Zhang, @spam: The underground

on 140 characters or less, in: Proceedings of the 17th ACMConference on Computer and Communications Security, CCS 10,ACM, New York, NY, USA, 2010, pp. 2737.

[7] C. Budak, D. Agrawal, A. El Abbadi, Limiting the spread ofmisinformation in social networks, in: Proceedings of the 20thInternational Conference on World WideWeb, WWW 11, ACM, NewYork, NY, USA, 2011, pp. 665674.

[8] A. Goyal, F. Bonchi, L. Lakshmanan, S. Venkatasubramanian, Onminimizing budget and time in inuence propagation over socialnetworks, Social Network Analysis and Mining (2012) 114.

[9] B. Viswanath, A. Mislove, M. Cha, K.P. Gummadi, On the evolution ofuser interaction in facebook, in: Proceedings of the 2nd ACMWorkshop on Online Social Networks, WOSN 09, ACM, New York,NY, USA, 2009, pp. 3742.

[10] P. Domingos, M. Richardson, Mining the network value of customers,in: Proceedings of the Seventh ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, KDD 01,ACM, New York, NY, USA, 2001, pp. 5766.

[11] D. Kempe, J. Kleinberg, E. Tardos, Maximizing the spread of inuencethrough a social network, in: Proceedings of the Ninth ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining,KDD 03, ACM, New York, NY, USA, 2003, pp. 137146.

[12] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, N.Glance, Cost-effective outbreak detection in networks, in:Proceedings of the 13th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD 07, ACM, New York,NY, USA, 2007, pp. 420429.

[13] N. Chen, On the approximability of inuence in social networks, in:Proceedings of the Nineteenth Annual ACM-SIAM Symposium onDiscrete Algorithms, SODA 08, Society for Industrial and AppliedMathematics, Philadelphia, PA, USA, 2008, pp. 10291037.

[14] D.M. Nicol, M. Liljenstam, Models and analysis of active wormdefense, in: Proceedings of the Third international conference onMathematical Methods, Models, and Architectures for ComputerNetwork Security, MMM-ACNS05, Springer-Verlag, Berlin,Heidelberg, 2005, pp. 3853.

[15] S. Tanachaiwiwat, A. Helmy, Encounter-based worms: analysis anddefense, Ad Hoc Network 7 (7) (2009) 14141430.

[16] P. Dubey, R. Garg, B. De Meyer, Competing for customers in a socialnetwork: the quasi-linear case, in: Proceedings of the SecondInternational Conference on Internet and Network Economics,WINE06, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 162173.

[17] S. Bharathi, D. Kempe, M. Salek, Competitive inuence maximizationin social networks, in: Proceedings of the 3rd InternationalConference on Internet and Network Economics, WINE07,Springer-Verlag, Berlin, Heidelberg, 2007, pp. 306311.

[18] J. Kostka, Y.A. Oswald, R. Wattenhofer, Word of mouth: Rumordissemination in social networks, in: Proceedings of the 15thHeidelberg, 2008, pp. 185196.[19] G. Yan, G. Chen, S. Eidenbenz, N. Li, Malware propagation in online

social networks: nature, dynamics, and defense implications, in:Proceedings of the 6th ACM Symposium on Information, Computerand Communications Security, ASIACCS 11, ACM, New York, NY,USA, 2011, pp. 196206.

[20] W. Xu, F. Zhang, S. Zhu, Toward worm detection in online socialnetworks, in: Proceedings of the 26th Annual Computer SecurityApplications Conference, ACSAC 10, ACM, New York, NY, USA, 2010,pp. 1120.

[21] W. Chen, C. Wang, Y. Wang, Scalable inuence maximization forprevalent viral marketing in large-scale social networks, in:Proceedings of the 16th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining, KDD 10, ACM, New York,NY, USA, 2010, pp. 10291038.

[22] G.L. Nemhauser, L.A. Wolsey, M.L. Fisher, An analysis ofapproximations for maximizing submodular set functions,Mathematical Programming 14 (1978) 265294.

[23] M. Kimura, K. Saito, R. Nakano, Extracting inuential nodes forinformation diffusion on a social network, in: Proceedings of the22nd national conference on Articial intelligence AAAI07, vol. 2,AAAI Press, 2007, pp. 13711376.

[24] T.N. Dinh, D.T. Nguyen, M.T. Thai, Cheap, easy, and massivelyeffective viral marketing in social networks: truth or ction?, in:Proceedings of the 23rd ACM Conference on Hypertext and SocialMedia, HT 12, ACM, New York, NY, USA, 2012, pp 165174.

[25] R. Raz, S. Safra,ASub-ConstantError-ProbabilityLow-DegreeTest, anda Sub-Constant Error-Probability PCP Characterization of NP, STOC.

[26] W. Chen, Y. Yuan, L. Zhang, Scalable inuence maximization in socialnetworks under the linear threshold model, in: Proceedings of the2010 IEEE International Conference on Data Mining, ICDM 10, IEEEComputer Society, Washington, DC, USA, 2010, pp. 8897.

[27] S. Fortunato, Community detection in graphs, Physics Reports 486(35) (2010) 75174.

[28] T.N. Dinh, Y. Xuan, M.T. Thai, Towards social-aware routing indynamic communication networks, in: Proceedings of the 28thInternational Performance Computing and CommunicationsConference, IPCCC 09, 2009, pp. 161168.

[29] N.P. Nguyen, T.N. Dinh, Y. Xuan, M.T. Thai, Adaptive algorithms fordetecting community structure in dynamic social networks, in:Proceedings of the 31th International Conference on ComputerCommunications, Infocom 11, 2011, pp. 22822290.

[30] N.P. Nguyen, T.N. Dinh, S. Tokala, M.T. Thai, Overlappingcommunities in dynamic networks: their detection and mobileapplications, in: Proceedings of the 17th Annual InternationalConference on Mobile Computing and Networking, MobiCom 11,ACM, New York, NY, USA, 2011, pp. 8596.

[31] V.D. Blondel, J.-L. Guillaume, R. Lambiotte, E. Lefebvre, Fast unfoldingof communities in large networks, Journal of Statistical Mechanics:Theory and Experiment 2008 (10) (2008) P10008.

[32] S. Fortunato, A. Lancichinetti, Community detection algorithms: acomparative analysis, in: Proceedings of the Fourth InternationalICST Conference on Performance Evaluation Methodologies andTools, VALUETOOLS 09, ICST, ICST, Brussels, Belgium, 2009, pp.27:127:2.

[33] M.A. Porter, J.-P. Onnela, P.J. Mucha, Communities in networks,Notices of the AMS 56 (9).

Nam P. Nguyen received his B.S. degree fromVietnam National University, VN (2007) andM.S. degree from Ohio University, USA (2009)both in Applied Mathematics. He is now aPh.D. candidate in Computer Science at Uni-versity of Florida, USA. His interests includecyber-security, complex networks structuralanalysis and vulnerability assessment, as wellas effective approximation algorithms forpractical mobile and social networking prob-lems.

Guanhua Yan obtained his Ph.D. degree inComputer Science from Dartmouth College,USA, in 2005. From 2003 to 2005, he was avisiting graduate student at the CoordinatedScience Laboratory in the University of Illinoisat Urbana-Champaign. He is now a TechnicalStaff Member in the Information SciencesGroup (CCS-3) at the Los Alamos NationalLaboratory. His research interests arecyber-security, networking, and large-scalemodeling and simulation techniques. He hascontributed about 50 articles in these elds.

My T. Thai received her Ph.D. degree in com-puter science from the University of Minne-sota, Twin Cities, in 2006. She is an associateprofessor in the CISE Department, Universityof Florida. Her current research interestsinclude algorithms and optimization on net-work science and engineering. She also servesas an associate editor for the Journal of Com-binatorial Optimization (JOCO) and Optimi-zation Letters and a conference chair ofCOCOON 2010 and several workshops in anarea of network science. She is a recipient of

DoD Young Investigator Awards and NSF CAREER awards. She is amember of the IEEE.

2146 N.P. Nguyen et al. / Computer Networks 57 (2013) 21332146

Analysis of misinformation containment in online social networks1 Introduction2 Related work3 Diffusion models and problem definition3.1 Propagation models3.1.1 Linear Threshold (LT) model3.1.2 Independent Cascade (IC) model3.1.3 Decontamination mechanism

3.2 Problem definition3.3 Notations

4 A bounded method for -Node Protector4.1 A greedy algorithm with a bounded solution4.2 Time complexity

5 Algorithms for the ? - and ? -Node Protectors problem5.1 Inapproximability of ? -Node Protector5.2 Algorithms for ? and ? -Node Protector

6 Experimental results6.1 Datasets6.1.1 NetHEPT and NetHEPT_WC6.1.2 Facebook network6.1.3 Setup

6.2 Number of selected nodes6.2.1 Results on the -Node Protector problem6.2.2 Results on the ? and ? -Node Protector problems

6.3 Lower bounds of the optimal solutions

7 A community-based heuristic algorithm7.1 Community detection7.2 High-influence node selection7.3 Experimental results7.4 Running time

8 Conclusion and discussionAcknowledgmentReferences

Recommended