influence maximization: near-optimal time complexity meets ... · influence maximization:...

15
arXiv:1404.0900v2 [cs.SI] 30 Apr 2014 Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen Shi School of Computer Engineering Nanyang Technological University Singapore [email protected] [email protected] [email protected] ABSTRACT Given a social network G and a constant k, the influence maximiza- tion problem asks for k nodes in G that (directly and indirectly) influence the largest number of nodes under a pre-defined diffusion model. This problem finds important applications in viral market- ing, and has been extensively studied in the literature. Existing al- gorithms for influence maximization, however, either trade approx- imation guarantees for practical efficiency, or vice versa. In par- ticular, among the algorithms that achieve constant factor approx- imations under the prominent independent cascade (IC) model or linear threshold (LT) model, none can handle a million-node graph without incurring prohibitive overheads. This paper presents TIM, an algorithm that aims to bridge the theory and practice in influence maximization. On the theory side, we show that TIM runs in O((k + )(n + m) log n/ε 2 ) expected time and returns a (1 1/e ε)-approximate solution with at least 1 n probability. The time complexity of TIM is near- optimal under the IC model, as it is only a log n factor larger than the Ω(m + n) lower-bound established in previous work (for fixed k, , and ε). Moreover, TIM supports the triggering model, which is a general diffusion model that includes both IC and LT as spe- cial cases. On the practice side, TIM incorporates novel heuristics that significantly improve its empirical efficiency without compro- mising its asymptotic performance. We experimentally evaluate TIM with the largest datasets ever tested in the literature, and show that it outperforms the state-of-the-art solutions (with approxima- tion guarantees) by up to four orders of magnitude in terms of run- ning time. In particular, when k = 50, ε =0.2, and =1, TIM requires less than one hour on a commodity machine to pro- cess a network with 41.6 million nodes and 1.4 billion edges. This demonstrates that influence maximization algorithms can be made practical while still offering strong theoretical guarantees. Categories and Subject Descriptors H.2.8 [Database Applications]: Data mining General Terms Algorithms, Theory, Experimentation Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. SIGMOD’14, June 22–27, 2014, Snowbird, UT, USA. Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM 978-1-4503-2376-5/14/06 ...$15.00. http://dx.doi.org/10.1145/2588555.2593670. 1. INTRODUCTION Let G be a social network, and M be a probabilistic model that captures how the nodes in G may influence each other’s behavior. Given G, M, and a small constant k, the influence maximization problem asks for the k nodes in G that can (directly and indirectly) influence the largest number of nodes. This problem finds impor- tant applications in viral marketing [8,25], where a company selects a few influential individuals in a social network and provides them with incentives (e.g., free samples) to adopt a new product, hoping that the product will be recursively recommended by each individ- ual to his/her friends to create a large cascade of further adoptions. Kempe et al. [17] are the first to formulate influence maximiza- tion as a combinatorial optimization problem. They consider sev- eral probabilistic cascade models from the sociology and market- ing literature [13–15, 27], and present a general greedy approach that yields (1 1/e ε)-approximate solutions for all models considered, where ε is a constant. This seminal work has moti- vated a large body of research on influence maximization in the past decade [2–7, 10, 16–19, 21, 28, 30, 31]. Kempe et al.’s greedy approach is well accepted for its simplic- ity and effectiveness, but it is known to be computationally ex- pensive. In particular, it has an Ω ( kmn · poly ( ε 1 )) time com- plexity [3] where n and m are the numbers of nodes and edges in the social network, respectively. Empirically, it runs in days even when n and m are merely a few thousands [6]. Such inef- ficiency of Kempe et al.’s method has led to a plethora of algo- rithms [5–7,10,16,19,21,30,31] that aim to reduce the computation overhead of influence maximization. Those algorithms, however, either trade performance guarantees for practical efficiency, or vice versa. In particular, most algorithms rely on heuristics to efficiently identify nodes with large influence, but they fail to achieve any ap- proximation ratio under Kempe et al.’s cascade models; there are a few exceptions [6,11,21] that retain the (11/eε)-approximation guarantee, but they have the same time complexity with Kempe et al.’s method and still cannot handle large networks. Very recently, Borgs et al. [3] make a theoretical breakthrough and present an O(kℓ 2 (m + n) log 2 n/ε 3 ) time algorithm 1 for in- fluence maximization under the independent cascade (IC) model, i.e., one of the prominent models from Kempe et al. [17]. Borgs et al. show that their algorithm returns a (1 1/e ε)-approximate solution with at least 1 n probability, and prove that it is near- optimal since any other algorithm that provides the same approx- imation guarantee and succeeds with at least a constant probabil- ity must run in Ω(m + n) time [3]. Although Borgs et al.’s al- gorithm significantly improves upon previous methods in terms of asymptotic performance, its practical efficiency is rather unsatis- factory, due to a large hidden constant factor in its time complexity. In short, no existing influence maximization algorithm can scale

Upload: others

Post on 16-Mar-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

arX

iv:1

404.

0900

v2 [

cs.S

I] 3

0 A

pr 2

014

Influence Maximization: Near-Optimal Time ComplexityMeets Practical Efficiency

Youze Tang Xiaokui Xiao Yanchen ShiSchool of Computer EngineeringNanyang Technological University

[email protected] [email protected] [email protected]

ABSTRACTGiven a social networkG and a constantk, theinfluence maximiza-tion problem asks fork nodes inG that (directly and indirectly)influence the largest number of nodes under a pre-defined diffusionmodel. This problem finds important applications in viral market-ing, and has been extensively studied in the literature. Existing al-gorithms for influence maximization, however, either tradeapprox-imation guarantees for practical efficiency, or vice versa.In par-ticular, among the algorithms that achieve constant factorapprox-imations under the prominentindependent cascade (IC)model orlinear threshold (LT)model, none can handle a million-node graphwithout incurring prohibitive overheads.

This paper presentsTIM, an algorithm that aims to bridge thetheory and practice in influence maximization. On the theoryside,we show thatTIM runs inO((k + ℓ)(n + m) log n/ε2) expectedtime and returns a(1 − 1/e − ε)-approximate solution with atleast1 − n−ℓ probability. The time complexity ofTIM is near-optimal under the IC model, as it is only alog n factor larger thantheΩ(m+ n) lower-bound established in previous work (for fixedk, ℓ, andε). Moreover,TIM supports thetriggering model, whichis a general diffusion model that includes both IC and LT as spe-cial cases. On the practice side,TIM incorporates novel heuristicsthat significantly improve its empirical efficiency withoutcompro-mising its asymptotic performance. We experimentally evaluateTIM with the largest datasets ever tested in the literature, andshowthat it outperforms the state-of-the-art solutions (with approxima-tion guarantees) by up to four orders of magnitude in terms ofrun-ning time. In particular, whenk = 50, ε = 0.2, and ℓ = 1,TIM requires less than one hour on a commodity machine to pro-cess a network with41.6 million nodes and1.4 billion edges. Thisdemonstrates that influence maximization algorithms can bemadepractical while still offering strong theoretical guarantees.

Categories and Subject DescriptorsH.2.8 [Database Applications]: Data mining

General TermsAlgorithms, Theory, ExperimentationPermission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’14,June 22–27, 2014, Snowbird, UT, USA.Copyright is held by the owner/author(s). Publication rights licensed to ACM.ACM 978-1-4503-2376-5/14/06 ...$15.00.http://dx.doi.org/10.1145/2588555.2593670.

1. INTRODUCTIONLet G be a social network, andM be a probabilistic model that

captures how the nodes inG may influenceeach other’s behavior.GivenG, M , and a small constantk, the influence maximizationproblem asks for thek nodes inG that can (directly and indirectly)influence the largest number of nodes. This problem finds impor-tant applications inviral marketing[8,25], where a company selectsa few influential individuals in a social network and provides themwith incentives (e.g., free samples) to adopt a new product,hopingthat the product will be recursively recommended by each individ-ual to his/her friends to create a large cascade of further adoptions.

Kempe et al. [17] are the first to formulate influence maximiza-tion as a combinatorial optimization problem. They consider sev-eral probabilistic cascade models from the sociology and market-ing literature [13–15, 27], and present a general greedy approachthat yields(1 − 1/e − ε)-approximate solutions for all modelsconsidered, whereε is a constant. This seminal work has moti-vated a large body of research on influence maximization in thepast decade [2–7,10,16–19,21,28,30,31].

Kempe et al.’s greedy approach is well accepted for its simplic-ity and effectiveness, but it is known to be computationallyex-pensive. In particular, it has anΩ

(

kmn · poly(

ε−1))

time com-plexity [3] wheren andm are the numbers of nodes and edgesin the social network, respectively. Empirically, it runs in dayseven whenn andm are merely a few thousands [6]. Such inef-ficiency of Kempe et al.’s method has led to a plethora of algo-rithms [5–7,10,16,19,21,30,31] that aim to reduce the computationoverhead of influence maximization. Those algorithms, however,either trade performance guarantees for practical efficiency, or viceversa. In particular, most algorithms rely on heuristics toefficientlyidentify nodes with large influence, but they fail to achieveany ap-proximation ratio under Kempe et al.’s cascade models; there are afew exceptions [6,11,21] that retain the(1−1/e−ε)-approximationguarantee, but they have the same time complexity with Kempeetal.’s method and still cannot handle large networks.

Very recently, Borgs et al. [3] make a theoretical breakthroughand present anO(kℓ2(m + n) log2 n/ε3) time algorithm1for in-fluence maximization under theindependent cascade (IC)model,i.e., one of the prominent models from Kempe et al. [17]. Borgs etal. show that their algorithm returns a(1 − 1/e − ε)-approximatesolution with at least1− n−ℓ probability, and prove that it isnear-optimal since any other algorithm that provides the same approx-imation guarantee and succeeds with at least a constant probabil-ity must run inΩ(m + n) time [3]. Although Borgs et al.’s al-gorithm significantly improves upon previous methods in terms ofasymptotic performance, its practical efficiency is ratherunsatis-factory, due to a large hidden constant factor in its time complexity.In short, no existing influence maximization algorithm can scale

Page 2: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

to million-node graphs while still providing non-trivial approxima-tion guarantees (under Kempe et al.’s models [17]). Therefore, anypractitioner who conducts influence maximization on sizable socialnetworks can only resort to heuristics, even though the results thusobtained could be arbitrarily worse than the optimal ones.

Our Contributions. This paper presentsTwo-phase InfluenceMaximization (TIM), an algorithm that aims to bridge the theoryand practice in influence maximization. On the theory side, weshow thatTIM returns a(1−1/e−ε)-approximate solution with atleast1−n−ℓ probability, and it runs inO((k+ℓ)(m+n) log n/ε2)expected time. The time complexity ofTIM is near-optimal underthe IC model, as it is only alog n factor larger than theΩ(m+ n)lower-bound established by Borgs et al. [3] (for fixedk, ℓ, andε).Moreover,TIM supports thetriggering model[17], which is a gen-eral cascade model that includes the IC model as a special case.

On the practice side,TIM incorporates novel heuristics that resultin up to100-fold improvements of its computation efficiency, with-out any compromise of theoretical assurances. We experimentallyevaluateTIM with a variety of social networks, and show that it out-performs the state-of-the-art solutions (with approximation guaran-tees) by up to four orders of magnitude in terms of running time.In particular, whenk = 50, ε ≥ 0.2, andℓ = 1, TIM requiresless than one hour to process a network with41.6 million nodesand1.4 billion edges. To our knowledge, this is the first result inthe literature that demonstrates efficient influence maximization ona billion-edge graph.

In summary, our contributions are as follows:

1. We propose an influence maximization algorithm that runsin near-linear expected time and returns(1 − 1/e − ε)-approximate solutions (with a high probability) under thetriggering model.

2. We devise several optimization techniques that improve theempirical performance of our algorithm by up to100-fold.

3. We provide theoretical analysis on the state-of-the-artsolu-tions with approximation guarantees, and establish the supe-riority of our algorithm in terms of asymptotic performance.

4. We experiment with the largest datasets ever used in the lit-erature, and show that our algorithm can efficiently handlegraphs with more than a billion edges. This demonstratesthat influence maximization algorithms can be made practi-cal while still offering strong theoretical guarantees.

2. PRELIMINARIESIn this section, we formally define the influence maximization

problem, and present an overview of Kempe et al. and Borgs et al.’ssolutions [3, 17]. For ease of exposition, we focus on theindepen-dent cascade (IC)model [17] considered by Borgs et al. [3]. InSection 4.2, we discuss how our solution can be extended to themore generaltriggering model.

2.1 Problem DefinitionLetG be a social network with a node setV and a directed edge

setE, with |V | = n and|E| = m. Assume that each directed edgee in G is associated with apropagation probabilityp(e) ∈ [0, 1].Given G, the independent cascade (IC) model considers a time-stamped influence propagation process as follows:1The time complexity of Borgs et al.’s algorithm is established asO(ℓ2(m+n) log2 n/ε3) in [3], but our correspondence with Borget al. shows that it should be revised asO(kℓ2(m+n) log2 n/ε3),due to a gap in the proof of Lemma 3.6 in [3].

v1 v4

v2

v3

0.011

0.01

0.01

0.01

Figure 1: Social networkG.

v1 v4

v2

v3

Figure 2: Random graphg1.

1. At timestamp1, weactivatea selected setS of nodes inG,while setting all other nodesinactive.

2. If a nodeu is first activated at timestampi, then for each di-rected edgee that points fromu to an inactive nodev (i.e.,vis an inactive outgoing neighbor ofu), u hasp(e) probabilityto activatev at timestampi + 1. After timestampi + 1, ucannot activate any node.

3. Once a node becomes activated, it remains activated in allsubsequent timestamps.

Let I(S) be the number of nodes that are activated when the aboveprocess converges, i.e., when no more nodes can be activated. Werefer toS as theseed set, andI(S) as thespreadof S. Intuitively,the influence propagation process under the IC model mimics thespread of an infectious disease: the seed setS is conceptually sim-ilar to an initial set of infected individuals, while the activation ofa node by its neighbors is analogous to the transmission of the dis-ease from one individual to another.

For example, consider a propagation process on the social net-work G in Figure 1, withS = v2 as the seed set. (The numberon each edge indicates the propagation probability of the edge.) Attimestamp1, we activatev2, since it is only node inS. Then, attimestamp2, bothv1 andv4 have0.01 probability to be activatedby v2, as (i) they are bothv2’s outgoing neighbors and (ii) the edgesfrom v2 to v1 andv4 have a propagation probability of0.01. Sup-pose thatv2 activatesv4 but not v1. After that, at timestamp3,v4 will activatev1 since the edge fromv4 to v1 has a propagationprobability of1. After that, the influence propagation process ter-minates, since no other node can be activated. The total number ofnodes activated during the process is3, and hence,I(S) = 3.

GivenG and a constantk, the influence maximization problemunder the IC model asks for a size-k seed setS with the maximumexpected spreadE [I(S)]. In other words, we seek a seed set thatcan (directly and indirectly) activate the largest number of nodes inexpectation.

2.2 Kempe et al.’s Greedy ApproachIn a nutshell, Kempe et al.’s approach [17] (referred to asGreedy

in the following) starts from an empty seed setS = ∅, and theniteratively adds intoS the nodeu that leads to the largest increasein E [I(S)], until |S| = k. That is,

u = argmaxv∈V

(

E[

I(

S ∪ v)]

− E[

I(S)]

)

.

Greedyis conceptually simple, but it is non-trivial to implementsince the computation ofE[I(S)] is #P-hard [5]. To address thisissue, Kempe et al. propose to estimateE[I(S)] to a reasonableaccuracy using a Monte Carlo method. To explain, suppose that weflip a coin for each edgee in G, and remove the edge with1− p(e)probability. Letg be the resulting graph, andR(S) be the set ofnodes ing that arereachablefrom S. (We say that a nodev in g isreachable fromS, if there exists a directed path ing that starts froma node inS and ends atv.) Kempe et al. prove that the expected sizeof R(S) equalsE[I(S)]. Therefore, to estimateE[I(S)], we canfirst generate multiple instances ofg, then measureR(S) on eachinstance, and finally take the average measurement as an estimationof E[I(S)].

Page 3: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

Assume that we take a large numberr of measurements in theestimation of eachE[I(S)]. Then, with a high probability,Greedyyields a(1 − 1/e − ε)-approximate solution under the IC model[17], whereε is a constant that depends on bothG andr [3, 18].In general,Greedyachieves the same approximation ratio underany cascade model whereE[I(S)] is a submodularfunction ofS[18]. To our knowledge, however, there is no formal analysisin theliterature on howr should be set to achieve a givenε onG. Instead,Kempe et al. suggest settingr = 10000, and most follow-up workadopts similar choices ofr. In Section 5, we provide a formal resulton the relationship betweenε andr.

Although Greedy is general and effective, it incurs signifi-cant computation overheads due to itsO(kmnr) time complexity.Specifically, it runs ink iterations, each of which requires estimat-ing the expected spread ofO(n) node sets. In addition, each es-timation of expected spread takes measurements onr graphs, andeach measurement needsO(m) time. These lead to anO(kmnr)total running time.

2.3 Borgs et al.’s MethodThe main reason forGreedy’s inefficiency is that it requires es-

timating the expected spread ofO(kn) node sets. Intuitively, mostof thoseO(kn) estimations arewastedsince, in each iteration ofGreedy, we are only interested in the node set with the largest ex-pected spread. Yet, such wastes of computation are difficulttoavoid under the framework ofGreedy. To explain, consider thefirst iteration ofGreedy, where we are to identify a single node inG with the maximum expected spread. Without prior knowledgeon the expected spread of each node, we would have to evaluateE[I(v)] for each nodev in G. In that case, the overhead of thefirst iteration alone would beO(mnr).

Borgs et al. [3] avoid the limitation ofGreedyand propose adrastically different method for influence maximization under theIC model. We refer to the method asReverse Influence Sampling(RIS). To explain howRISworks, we first introduce two concepts:

DEFINITION 1 (REVERSEREACHABLE SET). Let v be anode inG, and g be a graph obtained by removing each edgeein G with 1− p(e) probability. The reverse reachable (RR) set forv in g is the set of nodes ing that can reachv. (That is, for eachnodeu in the RR set, there is a directed path fromu to v in g.)

DEFINITION 2 (RANDOM RR SET). Let G be the distribu-tion of g induced by the randomness in edge removals fromG. Arandom RR setis an RR set generated on an instance ofg randomlysampled fromG, for a node selected uniformly at random fromg.

By definition, if a nodeu appears in an RR set generated for anodev, thenu can reachv via a certain path inG. As such,ushould have a chance to activatev if we run an influence propaga-tion process onG usingu as the seed set. Borgs et al. show aresult that is consistent with the above observation: If an RR setgenerated forv hasρ probability to overlap with a node setS, thenwhen we useS as the seed set to run an influence propagation pro-cess onG, we haveρ probability to activatev (See Lemma 2).Based on this result, Borgs et al.’sRISalgorithm runs in two steps

1. Generate a certain number of random RR sets fromG.

2. Consider themaximum coverageproblem [29] of selectingknodes to cover the maximum number of RR sets generated2.Use the standard greedy algorithm [29] to derive a(1−1/e)-approximate solutionS∗

k for the problem. ReturnS∗k as the

final result.2We say that a nodev covers a set of nodesS if and only if v ∈ S.

The rationale ofRIS is as follows: If a nodeu appears in a largenumber of RR sets, then it should have a high probability to ac-tivate many nodes under the IC model; in that case,u’s expectedspread should be large. By the same reasoning, if a size-k node setS∗k covers most RR sets, thenS∗

k is likely to have the maximumexpected spread among all size-k node sets inG. In that case,S∗

k

should be a good solution to influence maximization. We illustrateRISwith an example.

EXAMPLE 1. Consider that we invokeRIS on the social net-work G in Figure 1, settingk = 1. RISfirst generates a number ofrandom RR sets, each of which is pertinent to (i) a node sampleduniformly at random fromG and (ii) a random graph obtained byremoving each edgee in G with 1 − p(e) probability (see Defini-tion 2). Assume that the first RR setR1 is pertinent tov1 and therandom graphg1 in Figure 2. Then, we haveR1 = v1, v4, sincev1 andv4 are the only two nodes ing1 that can reachv1.

Suppose that, besidesR1, RISonly constructs three other ran-dom RR setsR2, R3, andR4, which are pertinent to three randomgraphsg2, g3, andg4, respectively. For simplicity, assume that(i) g2, g3, andg4 are identical tog1, and (ii) the node thatRISsamples fromgi (i ∈ [2, 4]) is vi. Then, we haveR2 = v2,R3 = v3, andR4 = v4. In that case,v4 is the node thatcovers the most number of RR sets, since it appears in two RR sets(i.e.,R1 andR4), whereas any other node only covers one RR set.Consequently,RISreturnsS∗

k = v4 as the result.

Compared withGreedy, RIScan be more efficient as it avoidsestimating the expected spreads of a large number of node sets.That said, we need to carefully control the number of random RRsets generated in Step 1 ofRIS, so as to strike a balance betweenefficiency and accuracy. Towards this end, Borgs et al. propose athreshold-based approach: they allowRIS to keep generating RRsets, until the total number of nodes and edges examined during thegeneration process reaches a pre-defined thresholdτ . They showthat whenτ is set toΘ(k(m + n) log n/ε3), RISruns in time lin-ear toτ , and it returns a(1 − 1/e − ε)-approximate solution tothe influence maximization problem with at least a constant prob-ability. They then provide an algorithm that amplifies the successprobability to at least1−n−ℓ, by increasingτ by a factor ofℓ, andrepeatingRISfor Ω(ℓ log n) times.

Despite of its near-linear time complexity,RISstill incurs signif-icant computational overheads in practice, as we show in Section 7.The reason can be intuitively explained as follows. Given that RISsets a thresholdτ on the total cost of Step 1, the RR sets sampledin Step 1 are correlated, due to which some nodes inG may ap-pear in RR sets more frequently than normal3. In that case, evenif we identify a node set that covers the most number of RR sets,it still may not be a good solution to the influence maximizationproblem. Borgs et al. mitigate the effects of correlations by settingτ to a large number. However, this not only results in theε−3 termin RIS’s time complexity, but also rendersRIS’s practical efficiencyless than satisfactory.

3. PROPOSED SOLUTIONThis section presentsTIM, an influence maximization method

that borrows ideas fromRISbut overcomes its limitations with a

3To demonstrate this phenomenon, imagine that we repeatedlysample from a Bernoulli distribution withp = 0.5, until the sumof samples reaches1. It can be verified that our sample set has1/2 probability to contain more1 than0, but only1/4 probabilityto contain more0 than1. In other words,1 is the most frequentnumber in the sample set with an abnormally high probability.

Page 4: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

Notation Description

G, GTa social networkG, and itstransposeGT constructed byexchanging the starting and ending points of each edgein G

n the number of nodes inG (resp.GT )

m the number of edges inG (resp.GT )

k the size of the seed set for influence maximization

p(e) the propagation probability of an edgee

I(S)the spread of a node setS in an influence propagationprocess onG (see Section 2.2)

w(R)the number of edges inGT that starts from the nodes inan RR setR (see Equation 1)

κ(R) see Equation 8

R the set of all RR sets generated in Algorithm 1

FR(S)the fraction of RR sets inR that are covered by a nodesetS

EPT the expected width of a random RR set

OPT the maximumI(S) for any size-k node setS

KPT a lower-bound ofOPT established in Section 3.2

λ see Equation 4

Table 1: Frequently used notations.

novel algorithm design. At a high level,TIM consists of two phasesas follows:

1. Parameter Estimation. This phase computes a lower-boundof the maximum expected spread among all size-k node sets,and then uses the lower-bound to derive a parameterθ.

2. Node Selection.This phase samplesθ random RR sets fromG, and then derives a size-k node setS∗

k that covers a largenumber of RR sets. After that, it returnsS∗

k as the final result.

The node selection phase ofTIM is similar toRIS, except that itsamples apre-decidednumber (i.e.,θ) of random RR sets, insteadof using a threshold on computation cost to indirectly control thenumber. This ensures that the RR sets generated byTIM are inde-pendent (givenθ), thus avoiding the correlation issue that plaguesRIS. Meanwhile, the derivation ofθ in the parameter estimationphase is non-trivial: As we shown in Section 3.1,θ needs to belarger than a certain threshold to ensure the correctness ofTIM, butthe threshold depends on the optimal result of influence maximiza-tion, which is unknown. To address this challenge, we compute aθ that is above the threshold but still small enough to ensure theoverall efficiency ofTIM.

In what follows, we first elaborate the node selection phase ofTIM, and then detail the parameter estimation phase. For ease ofreference, Table 1 lists the notations frequently used. Unless other-wise specified, all logarithms in this paper are to the basee.

3.1 Node SelectionAlgorithm 1 presents the pseudo-code ofTIM’s node selection

phase. GivenG, k, and a constantθ, the algorithm first generatesθ random RR sets, and inserts them into a setR (Lines 1-2). Thesubsequent part of the algorithm consists ofk iterations (Lines 3-7). In each iteration, the algorithm selects a nodevj that covers thelargest number of RR sets inR, and then removes all those coveredRR sets fromR. Thek selected nodes are put into a setS∗

k , whichis returned as the final result.

Implementation. Lines 6-10 in Algorithm 1 correspond to a stan-dard greedy approach for amaximum coverageproblem [29], i.e.,

Algorithm 1 NodeSelection (G, k, θ)1: Initialize a setR = ∅.2: Generateθ random RR sets and insert them intoR.3: Initialize a node setS∗

k= ∅.

4: for j = 1 to k do5: Identify the nodevj that covers the most RR sets inR.6: Add vj into S∗

k.

7: Remove fromR all RR sets that are covered byvj .

8: return S∗k

the problem of selectingk nodes to cover the largest number ofnode sets. It is known that this greedy approach returns(1− 1/e)-approximate solutions, and has a linear-time implementation. Forbrevity, we omit the description of the implementation and referinterested readers to [3] for details.

Meanwhile, the generation of each RR set in Algorithm 1 is im-plemented as a randomized breath-first search (BFS) onG. Givena nodev in G, we first create an empty queue, and then flip a coinfor eachincomingedgee of v; with p(e) probability, we retrievethe nodeu from whiche starts, and we putu into the queue. Sub-sequently, we iteratively extract the nodev′ at the top of the queue,and examine each incoming edgee′ of v; if e′ starts from an un-visited nodeu′, we addu′ into the queue withp(e′) probability.This iterative process terminates when the queue becomes empty.Finally, we collect all nodes visited during the process (includingv), and use them to form an RR set.

Performance Bounds. We define thewidth of an RR setR, de-noted asw(R), as the number of directed edges inG whose pointto the nodes inR. That is

w(R) =∑

v∈R

(the indegree ofv in G) . (1)

Observe that if an edge is examined in the generation ofR, thenit must point to a node inR. Let EPT be the expected widthof a random RR set. It can be verified that Algorithm 1 runs inO(θ · EPT ) time. In the following, we analyze howθ should beset to minimize the expected running time while ensuring solutionquality. Our analysis frequently uses the Chernoff bounds [24]:

LEMMA 1. LetX be the sum ofc i.i.d. random variables sam-pled from a distribution on[0, 1] with a meanµ. For anyδ > 0,

Pr[

X − cµ ≥ δ · cµ]

≤ exp

(

− δ2

2 + δcµ

)

,

P r[

X − cµ ≤ −δ · cµ]

≤ exp

(

− δ2

2cµ

)

.

In addition, we utilize the following lemma from [3] that estab-lishes the connection between RR sets and the influence propaga-tion process onG:

LEMMA 2. LetS be a fixed set of nodes, andv be a fixed node.Suppose that we generate an RR setR for v on a graphg that isconstructed fromG by removing each edgee with 1− p(e) proba-bility. Let ρ1 be the probability thatS overlaps withR, andρ2 bethe probability thatS, when used as a seed set, can activatev in aninfluence propagation process onG. Then,ρ1 = ρ2.

The proofs of all theorems, lemmas, and corollaries in Section 3are included in the appendix.

Let R be the set of all RR sets generated in Algorithm 1. Forany node setS, letFR(S) be fraction of RR sets inR covered byS. Then, based on Lemma 2, we can prove that the expected valueof n · FR(S) equals the expected spread ofS in G:

Page 5: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

COROLLARY 1. E[n · FR(S)] = E[I(S)].

Let OPT be the maximum expected spread of any size-k nodeset inG. Using the Chernoff bounds, we show thatn ·FR(S) is anaccurate estimator of any node setS’s expected spread, whenθ issufficiently large:

LEMMA 3. Suppose thatθ satisfies

θ ≥ (8 + 2ε)n · ℓ log n+ log(

n

k

)

+ log 2

OPT · ε2 . (2)

Then, for any setS of at mostk nodes, the following inequalityholds with at least1− n−ℓ/

(

n

k

)

probability:∣

∣n · FR(S)− E[I(S)]

∣<

ε

2· OPT. (3)

Based on Lemma 3, we prove that when Equation 2 holds, Al-gorithm 1 returns a(1− 1/e − ε)-approximate solution with highprobability:

THEOREM 1. Given aθ that satisfies Equation 2, Algorithm 1returns a(1−1/e− ε)-approximate solution with at least1−n−ℓ

probability.

Notice that it is difficult to setθ directly based on Equation 2,sinceOPT is unknown. We address this issue in Section 3.2, bypresenting an algorithm that returns aθ which not only satisfiesEquation 2, but also leads to anO((k + ℓ)(m + n) log n/ε2) ex-pected time complexity for Algorithm 1. For simplicity, we define

λ = (8 + 2ε)n ·(

ℓ log n+ log(

n

k

)

+ log 2)

· ε−2, (4)

and we rewrite Equation 2 as

θ ≥ λ/OPT. (5)

3.2 Parameter EstimationRecall that the expected time complexity of Algorithm 1 is

O(θ ·EPT ), whereEPT is the expected number of coin tossesrequired to generate an RR set for a randomly selected node inG.Our objective is to identify anθ that makesθ · EPT reasonablysmall, while still ensuringθ ≥ λ/OPT . Towards this end, we firstdefine a probability distributionV∗ over the nodes inG, such thatthe probability mass for each node is proportional to its in-degreein G. Let v∗ be a random variable followingV∗. We have thefollowing lemma:

LEMMA 4. nmEPT = E[I(v∗)], where the expectation of

I(v∗) is taken over the randomness inv∗ and the influence prop-agation process.

In other words, if we randomly sample a node fromV∗ and calcu-late its expected spreads, then on average we haves = n

mEPT .

This implies thatnmEPT ≤ OPT , sinceOPT equals the maxi-

mum expected spread of any size-k node set.Suppose that we are able to identify a numbert such thatt =

Ω( nmEPT ) and t ≤ OPT . Then, by settingθ = λ/t, we can

guarantee that Algorithm 1 is correct and has an expected timecomplexity of

O(θ ·EPT ) = O(m

nλ)

= O(

(k + ℓ)(m+ n) log n/ε2)

. (6)

Choices oft. An intuitive choice oft is t = nmEPT , since (i) both

n andm are known and (ii)EPT can be estimated by measur-ing the average width of RR sets. However, we observe that whenk ≫ 1, t = n

mEPT rendersθ = λ/t unnecessarily large, which

Algorithm 2 KptEstimation (G, k)1: for i = 1 to log2 n− 1 do2: Let ci = (6ℓ logn+ 6 log(log2 n)) · 2

i.3: Let sum = 0.4: for j = 1 to ci do5: Generate a random RR setR.

6: κ(R) = 1−(

1− w(R)m

)k

7: sum = sum+ κ(R).8: if sum/ci > 1/2i then9: return KPT ∗ = n · sum/(2 · ci)

10: return KPT ∗ = 1

in turn leads to inferior efficiency. To explain, recall thatnmEPT

equals the mean of the expected spread of a nodev∗ sampled fromV∗, and hence, it is independent ofk. In contrast,OPT increasesmonotonically withk. Therefore, the difference betweenn

mEPT

andOPT increases withk, which makest = nmEPT an unfavor-

able choice oft whenk is large. To tackle this problem, we replacenmEPT with a closer approximation ofOPT that increases with

k, as explained in the following.Suppose that we takek samples fromV∗, and use them to form

a node setS∗. (Note thatS∗ may contain fewer thank nodes dueto the elimination of duplicate samples.) LetKPT be the meanof the expected spread ofS∗ (over the randomness inS∗ and theinfluence propagation process). It can be verified that

nmEPT ≤ KPT ≤ OPT , (7)

and thatKPT increases withk. We also have the followinglemma:

LEMMA 5. LetR be a random RR set andw(R) be the widthof R. Define

κ(R) = 1−(

1− w(R)

m

)k

. (8)

Then,KPT = n · E [κ(R)], where the expectation is taken overthe random choices ofR.

By Lemma 5, we can estimateKPT by first measuringn ·κ(R)on a set of random RR sets, and then taking the average of the mea-surements. But how many measurements should we taken? Bythe Chernoff bounds, if we are to obtain an estimate ofKPT withδ ∈ (0, 1) relative error with at least1 − n−ℓ probability, thenthe number of measurements should beΩ(ℓn log n · δ−2/KPT ).In other words, the number of measurements required dependsonKPT , whereasKPT is exactly the subject being measured. Weresolve this dilemma with an adaptive sampling approach that dy-namically adjusts the number of measurements based on the ob-served samples of RR sets.

Estimation of KPT . Algorithm 2 presents our sampling ap-proach for estimatingKPT . The high level idea of the algorithmis as follows. We first generate a relatively small number of RRsets, and use them to derive an estimation ofKPT with a boundedabsolute error. If the estimated value ofKPT is much larger thanthe error bound, we infer that the estimation is accurate enough,and we terminate the algorithm. On the other hand, if the estimatedvalue ofKPT is not large compared with the error bound, then wegenerate more RR sets to obtain a new estimation ofKPT with areduced absolute error. After that, we re-evaluate the accuracy ofour estimation, and if necessary, we further increase the number ofRR sets, until a precise estimation ofKPT is computed.

More specifically, Algorithm 2 runs in at mostlog2 n − 1 iter-ations. In thei-th iteration, it samplesci RR sets fromG (Lines

Page 6: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

2-7), where

ci =(

6ℓ log n+ 6 log (log2 n))

· 2i. (9)

Then, it measuresκ(R) on each RR setR, and computes the av-erage value ofκ(R). Our choice ofci ensures that if this averagevalue is larger than2−i, then with a high probability,E[κ(R)] isat least half of the average value; in that case, the algorithm ter-minates by returning aKPT ∗ that equals the average value timesn/2 (Lines 8-9). Meanwhile, if the average value is no more than2−i, then the algorithm proceeds to the (i+ 1)-th iteration.

On the other hand, if the average value is smaller than2−i in alllog2 n−1 iterations, then the algorithm returnsKPT ∗ = 1, whichequals the smallest possibleKPT (since each node in the seedset can always activate itself). As we show shortly,E[ 1

KPT∗ ] =

O( 1KPT

), andKPT ∗ ∈ [KPT/4, OPT ] holds with a high prob-ability. Hence, settingθ = λ/KPT ∗ ensures that Algorithm 1 iscorrect and achieves the expected time complexity in Equation 6.

Theoretical Analysis. Although Algorithm 2 is conceptually sim-ple, proving its correctness and effectiveness is non-trivial as it re-quires a careful analysis of the algorithm’s behavior in each iter-ation. In what follows, we present a few supporting lemmas, andthen use them to establish Algorithm 2’s performance guarantees.

Let K be the distribution ofκ(R) over random RR sets inG.Then,K has a domain[0, 1]. Letµ = KPT/n, andsi be the sumof ci i.i.d. samples fromK, whereci is as defined in Equation 9.By the Chernoff bounds, we have the following result:

LEMMA 6. If µ ≤ 2−j , then for anyi ∈ [1, j − 1],

Pr

[

sici

>1

2i

]

<1

nℓ · log2 n.

By Lemma 6, ifKPT ≤ 2−j , then Algorithm 2 is very unlikelyto terminate in any of the firstj − 1 iterations. This prevents thealgorithm from outputting aKPT ∗ too much larger thanKPT .

LEMMA 7. If µ ≥ 2−j , then for anyi ≥ j + 1,

Pr

[

sici

>1

2i

]

> 1− n−ℓ·2i−j−1

/ log2 n.

By Lemma 7, ifKPT ≤ 2−j and Algorithm 2 happens to enterits i > j + 1 iteration, then it will almost surely terminate in thei-th iteration. This ensures that the algorithm would not output aKPT ∗ that is considerably smaller thanKPT .

Based on Lemmas 6 and 7, we prove the following theorem onthe accuracy and expected time complexity of Algorithm 2:

THEOREM 2. Whenn ≥ 2 and ℓ ≥ 1/2, Algorithm 2 re-turns KPT ∗ ∈ [KPT/4, OPT ] with at least1 − n−ℓ proba-bility, and runs inO(ℓ(m+ n) log n) expected time. Furthermore,E[

1KPT∗

]

< 12KPT

.

3.3 Putting It TogetherIn summary, ourTIM algorithm works as follows. GivenG, k,

and two parametersε and ℓ, TIM first feedsG andk as input toAlgorithm 2, and obtains a numberKPT ∗ in return. After that,TIM computesθ = λ/KPT ∗, whereλ is as defined in Equation 4and is a function ofk, ℓ, n, andε. Finally, TIM givesG, k, andθ as input to Algorithm 1, whose outputS∗

k is the final result ofinfluence maximization.

By Theorems 1 and 2, Equation 6, and the union bound,TIMruns inO((k + ℓ)(m + n) log n/ε2) expected time, and returnsa (1− 1/e − ε)-approximate solution with at least1 − 2 · n−ℓ

Algorithm 3 RefineKPT (G, k, KPT ∗, ε′)1: Let R′ be the set of all RR sets generated in the last iteration of Algo-

rithm 2.2: Initialize a node setS′

k= ∅.

3: for j = 1 to k do4: Identify the nodevj that covers the most RR sets inR′.5: Add vj into S′

k.

6: Remove fromR′ all RR sets that are covered byvj .

7: Let λ′ = (2 + ε′)ℓn logn · (ε′)−2.8: Let θ′ = λ′/KPT ∗.9: Generateθ′ random RR sets; put them into a setR′′.

10: Let f be the fraction of the RR sets inR′′ that is covered byS′k

.11: LetKPT ′ = f · n/(1 + ε′)12: return KPT+ = maxKPT ′,KPT ∗

probability. This success probability can easily increased to 1 −n−ℓ, by scalingℓ up by a factor of1 + log 2/ log n. Finally, wenote that the time complexity ofTIM is near-optimal under the ICmodel, as it is only alog n factor larger than theΩ(m+ n) lower-bound proved by Borgs et al. [3] (for fixedk, ℓ, andε).

4. EXTENSIONSIn this section, we present a heuristic method for improving

the practical performance ofTIM (without affecting its asymptoticguarantees), and extendTIM to an influence propagation modelmore general than the IC model.

4.1 Improved Parameter EstimationThe efficiency ofTIM highly depends on the outputKPT ∗ of

Algorithm 2. If KPT ∗ is close toOPT , thenθ = λ/KPT ∗

is small; in that case, Algorithm 1 only needs to generate a rela-tively small number of RR sets, thus reducing computation over-heads. However, we observe thatKPT ∗ is often much smallerthanOPT on real datasets, which severely degrades the efficiencyof Algorithm 1 and the overall performance ofTIM.

Our solution to the above problem is to add an intermediatestep between Algorithms 1 and 2 to refineKPT ∗ into a (poten-tially) much tighter lower-bound ofOPT . Algorithm 3 shows thepseudo-code of the intermediate step. The algorithm first retrievesthe setR′ of all RR sets created in the last iteration of Algorithm 2,i.e., the RR sets that from whichKPT ∗ is computed. Then, it in-vokes the greedy approach (for the maximum coverage problem)onR′, and obtains a size-k node setS′

k that covers a large numberof RR sets inR′ (Lines 2-6 in Algorithm 3).

Intuitively, S′k should have a large expected spread, and thus, if

we can estimateE[I(S′k)] to a reasonable accuracy, then we may

use the estimation to derive a good lower-bound forOPT . To-wards this end, Algorithm 3 generates a numberθ′ of random RRsets, and examine the fractionf of RR sets that are covered byS′

k

(Lines 7-10). By Corollary 1,f · n is an unbiased estimation ofE[I(S′

k)]. We setθ′ to a reasonably large number to ensure thatf ·n < (1+ε′) ·E[I(S′

k)] occurs with at most1−n−ℓ probability.Based on this, Algorithm 3 computesKPT ′ = f · n/(1 + ε′),which scalesf · n down by a factor of1 + ε′ to ensure thatKPT ′ ≤ E[I(S′

k)] ≤ OPT . The final output of Algorithm 3isKPT+ = maxKPT ′,KPT ∗, i.e., we choose the larger onebetweenKPT ′ andKPT ∗ as the new lower-bound forOPT . Thefollowing lemma shows the theoretical guarantees of Algorithm 3:

LEMMA 8. Given thatE[ 1KPT∗ ] < 12

EPT, Algorithm 3 runs

in O(ℓ(m + n) log n/(ε′)2) expected time. In addition, it returns

Page 7: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

KPT+ ∈ [KPT ∗, OPT ] with at least1 − n−ℓ probability, ifKPT ∗ ∈ [KPT/4, OPT ].

Note that the time complexity of Algorithm 3 is smaller than thatof Algorithm 1 by a factor ofk, since the former only needs toaccurately estimate the expected spread of one node set (i.e., S′

k),whereas the latter needs to ensure accurate estimations for

(

n

k

)

nodesets simultaneously.

We integrate Algorithm 3 intoTIM and obtain an improved so-lution (referred to asTIM+) as follows. GivenG, k, ε, andℓ, wefirst invoke Algorithm 2 to deriveKPT ∗. After that, we feedG,k, KPT ∗, and a parameterε′ to Algorithm 3, and obtainKPT+

in return. Then, we computeθ = λ/KPT+. Finally, we run Al-gorithm 1 withG, k, andθ as the input, and get the final result ofinfluence maximization. It can be verified that whenε′ ≥ ε/

√k,

TIM+ has the same time complexity withTIM, and it returns a(1−1/e−ε)-approximate solution with at least1−3n−ℓ probabil-ity. The success probability can be raised to1− n−ℓ by increasingℓ by a factor of1 + log 3/ log n.

Finally, we discuss the choice ofε′. Ideally, we should setε′

to a value that minimizes the total numberβ of RR sets generatedin Algorithms 1 and 3. However,β is difficult to estimate as itdepends on unknown variables such asKPT ∗ andKPT+. In ourimplementation ofTIM+, we set

ε′ = 5 · 3√

ℓ · ε2/(k + ℓ)

for anyε ≤ 1. This is obtained by using a function ofε′ to roughlyapproximateβ, and then taking the minimizer of the function.

4.2 Generalization to the Triggering ModelThe triggering model[17] is a influence propagation model that

generalizes the IC model. It assumes that each nodev is associ-ated with atriggering distributionT (v) over the power set ofv’sincoming neighbors, i.e., each sample fromT (v) is a subset of thenodes that has an outgoing edge tov.

Given a seed setS, an influence propagation process under thetriggering model works as follows. First, for each nodev, we take asample fromT (v), and define the sample as thetriggering setof v.After that, at timestamp 1, we activate the nodes inS. Then, at sub-sequent timestampi, if an activated node appears in the triggeringset of an inactive nodev, thenv becomes activated at timestampi+1. The propagation process terminates when no more nodes canbe activated.

The influence maximization problem under the triggering modelasks for a size-k seed setS that can activate the largest numberof nodes in expectation. To understand why the triggering modelcaptures the IC model as a special case, consider that we assign atriggering distribution to each nodev, such that each ofv’s incom-ing neighbors independently appears inv’s trigger set withp(e)probability, wheree is the edge that goes from the neighbor tov. Itcan be verified that influence maximization under this distributionis equivalent to that under the IC model.

Interestingly, our solutions can be easily extended to support thetriggering model. To explain, observe that Algorithms 1, 2,and 3do not rely on anything specific to the IC model, except that theyrequire a subroutine to generate random RR sets, whereas RR setsare defined under the IC model only. To address this issue, we re-vise the definition of RR sets to accommodate the triggering model,as explained in the following.

Suppose that we generate random graphsg fromG, by first sam-pling a node setT for each nodev from its triggering distributionT (v), and then removing any outgoing edge ofv that does not pointto a node inT . LetG be the distribution ofg induced by the random

choices of triggering sets. We refer toG as thetriggering graph dis-tribution for G. For any given nodeu and a graphg sampled fromG, we define the reverse reachable (RR) set foru in g as the set ofnodes that can reachu in g. In addition, we define arandom RRsetas one that is generated on an instance ofg randomly sampledfrom G, for a node selected fromg uniformly at random.

To construct random RR sets defined above, we employ a ran-domized BFS algorithm as follows. Letv be a randomly selectednode. Givenv, we first take a sampleT fromv’s triggering distribu-tion T (v), and then put all nodes inT into a queue. After that, weiteratively extract the node at the top of the queue; for eachnodeuextracted, we sample a setT ′ from u’s triggering distribution, andwe insert any unvisited node inT ′ into the queue. When the queuebecomes empty, we terminate the process, and form a random RRset with the nodes visited during the process. The expected cost ofthe whole process isO(EPT ), whereEPT denotes the expectednumber of edges inG that point to the nodes in a random RR set.This expected time complexity is the same as that of the algorithmfor generating random RR sets under the IC model.

By incorporating the above BFS approach into Algorithms 1, 2,and 3, our solutions can readily support the triggering model. Ournext step is to show that the revised solution retains the perfor-mance guarantees ofTIM and TIM+. For this purpose, we firstpresent an extended version of Lemma 2 for the triggering model.(The proof of the lemma is almost identical to that of Lemma 2.)

LEMMA 9. Let S be a fixed set of nodes,v be a fixed node,andG be the triggering graph distribution forG. Suppose that wegenerate an RR setR for v on a graphg sampled fromG. Letρ1 bethe probability thatS overlaps withR, andρ2 be the probabilitythat S (as a seed set) can activatev in an influence propagationprocess onG under the triggering model. Then,ρ1 = ρ2.

Next, we note that all of our theoretical analysis ofTIM andTIM+ is based on the Chernoff bounds and Lemma 2, without re-lying on any other results specific to the IC model. Therefore, oncewe establish Lemma 9, it is straightforward to combine it with theChernoff bounds to show that, under the triggering model, bothTIM andTIM+ provide the same performance guarantees as in thecase of the IC model. Thus, we have the following theorem:

THEOREM 3. Under the triggering model, TIM (resp. TIM+)runs inO((k + ℓ)(m + n) log n/ε2) expected time, and returnsa (1− 1/e − ε)-approximate solution with at least1 − 2 · n−ℓ

probability (resp.1− 3 · n−ℓ probability).

5. THEORETICAL COMPARISONSComparison with RIS. Borgs et al. [3] show that, under the ICmodel,RIScan derive a(1−1/e−ε)-approximate solution for theinfluence maximization problem, withO(kℓ2(m+ n) log2 n/ε3)running time and at least1 − n−ℓ success probability. The timecomplexity ofRIS is larger than the expected time complexity ofTIM andTIM+ by a factor ofℓ log n/ε. Therefore, bothTIM andTIM+ are superior toRISin terms of asymptotic performance.

Comparison with Greedy. As mentioned in Section 2.2,Greedyruns inO(kmnr) time, wherer is the number of Monte Carlo sam-ples used to estimate the expected spread of each node set. Kempeet al. do not provide a formal result on howr should be set toachieve a(1 − 1/e − ε)-approximation ratio; instead, they onlypoint out that when each estimation of expected spread hasε re-lated error,Greedyreturns a(1 − 1/e − ε′)-approximate solutionfor a certainε′ [18].

We present a more detailed characterization on the relationshipbetweenr andGreedy’s approximation ratio:

Page 8: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

Name n m Type Average degree

NetHEPT 15K 31K undirected 4.1

Epinions 76K 509K directed 13.4

DBLP 655K 2M undirected 6.1

LiveJournal 4.8M 69M directed 28.5

Twitter 41.6M 1.5G directed 70.5

Table 2: Dataset characteristics.

LEMMA 10. Greedyreturns a(1 − 1/e − ε)-approximate so-lution with at least1− n−ℓ probability, if

r ≥ (8k2 + 2kε) · n · (ℓ+ 1) log n+ log k

ε2 ·OPT. (10)

Assume that we knowOPT in advance and setr to thesmallest value satisfying the above inequality, inGreedy’sfavor. In that case, the time complexity ofGreedy isO(k3ℓmn2ε−2 log n/OPT ). Given thatOPT ≤ n, this com-plexity is much worse than the expected time complexity ofTIMandTIM+.

6. ADDITIONAL RELATED WORKThere has been a large body of literature on influence maximiza-

tion over the past decade (see [2–7, 10, 16–19, 21, 28, 30, 31]andthe references therein). BesidesGreedy[17] andRIS[3], the workmost related to ours is by Leskovec et al. [21], Chen et al. [6], andGoyal et al. [11]. In particular, Leskovec et al. [21] propose analgorithmic optimization ofGreedythat avoids evaluating the ex-pected spreads of a large number of node sets. This optimizationreduces the computation cost ofGreedyby up to700-fold, with-out affecting its approximation guarantees. Subsequently, Chen etal. [6] and Goyal et al. [11] further enhance Leskovec et al.’s ap-proach, and achieve up to50% additional improvements in termsof efficiency.

Meanwhile, there also exist a plethora of algorithms [5–7, 10,16, 19, 30, 31] that rely on heuristics to efficiently derive solutionsfor influence maximization. For example, Chen et al. [5] proposeto reduce computation costs by omitting the social network pathswith low propagation probabilities; Wang et al. [31] propose to di-vide the social network into smaller communities, and then iden-tify influential nodes from each community individually; Goyal etal. [12] propose to estimate the expected spread of each nodesetS only based on the nodes that are close toS. In general, existingheuristic solutions are shown to be much more efficient thanGreedy(and its aforementioned variants [6, 11, 21]), but they failto retainthe (1 − 1/e − ε)-approximation ratio. As a consequence, theytend to produce less accurate results, as shown in the experimentsin [5–7,10,16,19,30,31].

Considerable research has also been done to extend Kempe etal.’s formulation of influence maximization [17] to variousnew set-tings, e.g., when the influence propagation process followsa differ-ent model [10, 22], when there are multiple parties that competewith each other for social influence [2, 23], or when the influencepropagation process terminates at a predefined timestamp [4]. Thesolutions derived for those scenarios are inapplicable under our set-ting, due to the differences in problem formulations. Finally, thereis recent research on learning the parameters of influence propaga-tion model (e.g., the propagation probability on each edge)fromobserved data [9, 20, 26]. This line of research complements(andis orthogonal to) the existing studies on influence maximization.

7. EXPERIMENTS

TIMRIS CELF++ TIM+

100

101

102

103

104

105

1 10 20 30 40 50k

running time (seconds)

100

101

102

103

104

105

1 10 20 30 40 50k

running time (seconds)

(a) The IC model. (b) The LT model.

Figure 3: Computation time vs.k on NetHEPT.

Algorithm 2 Algorithm 3Algorithm 1

0 10 20 30 40 50 60 70

1 2 5 10 20 30 40 50

k

running time (seconds)

0 2 4 6 8

10 12 14

1 2 5 10 20 30 40 50

k

running time (seconds)

(a) TIM (the IC model). (b)TIM+ (the IC model).

Figure 4: Breakdown of computation time onNetHEPT.

This section experimentally evaluatesTIM andTIM+. Our ex-periments are conducted on a machine with an Intel Xeon2.4GHzCPU and48GB memory, running 64bit Ubuntu 13.10. All algo-rithms tested are implemented in C++ and compiled with g++ 4.8.1.

7.1 Experimental SettingsDatasets. Table 2 shows the datasets used in our experiments.Among them,NetHEPT, Epinions, DBLP, and LiveJournal arebenchmarks in the literature of influence maximization [19]. Mean-while, Twitter contains a social network crawled from Twitter.comin July 2009, and it is publicly available from [1]. Note thatTwitteris significantly larger than the other four datasets.

Propagation Models. We consider two influence propagationmodels, namely, the IC model (see Section 2.1) and thelinearthreshold (LT)model [17]. Specifically, the LT model is a spe-cial case of the triggering model, such that for each nodev, anysample fromv’s triggering distributionT (v) is either∅ or a sin-gleton containing an incoming neighbor ofv. Following previouswork [7], we constructT (v) for each nodev, by first assigning arandom probability in[0, 1] to each ofv’s incoming neighbors, andthen normalizing the probabilities so that they sum up to1. As forthe IC model, we set the propagation probability of each edgee asfollows: we first identify the nodev thate points to, and then setp(e) = 1/i, wherei denotes the in-degree ofv. This setting ofp(e) is widely adopted in prior work [5,10,16,30].

Algorithms. We compare our solutions with four methods, namely,RIS[3], CELF++ [11], IRIE [16], andSIMPATH[12]. In particu-lar, CELF++ is a state-of-the-art variant ofGreedythat consider-ably improves the efficiency ofGreedywithout affecting its theo-retical guarantees, whileIRIE andSIMPATHare the most advancedheuristic methods under the IC and LT models, respectively.Weadopt the C++ implementations ofCELF++, IRIE, andSIMPATHmade available by their inventors, and we implementRISand our

Page 9: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

TIMRIS KPT*TIM +CELF++ KPT+

0

200

400

600

800

1000

1 10 20 30 40 50k

expected spread

0

400

800

1200

1600

1 10 20 30 40 50k

expected spread

(a) The IC model. (b) The LT model.

Figure 5: Expected spreads,KPT ∗, andKPT+ on NetHEPT.

solutions in C++. Note thatRIS is designed under the IC modelonly, but we incorporate the techniques in Section 4.2 intoRISandextend it to the LT model.

Parameters. Unless otherwise specified, we setε = 0.1 andk =50 in our experiments. ForRISand our solutions, we setℓ in a waythat ensures a success probability of1 − 1/n. For CELF++, weset the number of Monte Carlo steps tor = 10000, following thestandard practice in the literature. Note that this choice of r is tothe advantage ofCELF++ because, by Lemma 10, the value ofrrequired in our experiments is always larger than10000. In each ofour experiments, we repeat each method three times and report theaverage result.

7.2 Comparison withCELF++ and RISOur first set of experiments compares our solutions with

CELF++ and RIS, i.e., the state of the arts among the solutionsthat provide non-trivial approximation guarantees.

Results onNetHEPT. Figure 3 shows the computation cost ofeach method on theNetHEPTdataset, varyingk from 1 to 50. Ob-serve thatTIM+ consistently outperformsTIM, while TIM is up totwo orders of magnitude faster thanCELF++ andRIS. In particu-lar, whenk = 50, CELF++ requires more than an hour to returna solution, whereasTIM+ terminates within ten seconds. Theseresults are consistent with our theoretical analysis (in Section 5)that Greedy’s time complexity is much higher than those ofTIMand TIM+. On the other hand,RIS is the slowest method in allcases despite of its near-linear time complexity, because of theε−3

term and the large hidden constant factor in its performancebound.One may improve the empirical efficiency ofRISby reducing thethresholdτ on its running time (see Section 2.3), but in that case,the worst-case quality guarantee ofRISis not necessarily retained.

The computation overheads ofRISandCELF++ increase withk, because (i)RIS’s thresholdτ on running time is linear tok, while(ii) a largerk requiresCELF++ to evaluate the expected spread ofan increased number of node sets. Surprisingly, whenk increases,the running time ofTIM andTIM+ tends to decrease. To under-stand this, we show, in Figure 4, a breakdown ofTIM andTIM+’scomputation overheads under the IC model. Evidently, both al-gorithms’ overheads are mainly incurred by Algorithm 1, i.e., thenode selection phase. Meanwhile, the computation cost of Algo-rithm 1 is mostly decided by the numberθ of RR sets that it needs togenerate. ForTIM, we haveθ = λ/KPT ∗, whereλ is as definedin Equation 4, andKPT ∗ is a lower-bound ofOPT produced byAlgorithm 2. Bothλ andKPT ∗ increase withk, and it happensthat, onNetHEPT, the increase ofKPT ∗ is more pronounced thanthat ofλ, which leads to the decrease inTIM’s running time. Sim-

ilar observations can be made onTIM+ and on the case of the LTmodel.

From Figure 4, we can also observe that the computation cost ofAlgorithm 3 (i.e, the intermediate step) is negligible compared withthe total cost ofTIM+. Yet, Algorithm 3 is so effective that it re-ducesTIM+’s running time to at most1/3 of TIM’s. This indicatesthat Algorithm 3 returns a much tighter lower-bound ofOPT thanAlgorithm 2 (i.e., the parameter estimation phase) does. Tosup-port this argument, Figure 5 illustrates the lower-boundsKPT ∗

andKPT+ produced by Algorithms 2 and 3, respectively. Ob-serve thatKPT+ is at least three timesKPT ∗ in all cases, whichis consistent withTIM+’s 3-fold efficiency improvement overTIM.

In addition, Figure 5 also shows the expected spreads of the nodesets selected by each method onNetHEPT. (We estimate the ex-pected spread of a node set by taking the average of105 MonteCarlo measurements.) There is no significant difference among theexpected spreads pertinent to different methods.

Results on Large Datasets.Next, we experiment with the fourlarger datasets, i.e.,Epinion, DBLP, LiveJournal, and Twitter.As RIS and CELF++ incur prohibitive overheads on those fourdatasets, we omit them from the experiments. Figure 6 shows therunning time ofTIM andTIM+ on each dataset. Observe thatTIM+

outperformsTIM in all cases, by up to two orders of magnitude interms of running time. Furthermore, even in the most adversarialcase whenk = 1, TIM+ terminates within four hours under boththe IC and LT models. (TIM is omitted from Figure 6d due to itsexcessive computation cost onTwitter.)

Interestingly, bothTIM andTIM+ are more efficient under theLT model than the IC model. This is caused by the fact that weuse different methods to generate RR sets under the two models.Specifically, under the IC model, we construct each RR set with arandomized BFS onG; for each incoming edge that we encounterduring the BFS, we need to generate a random number to decidewhether the edge should be ignored. In contrast, when we performa randomized BFS onG to create an RR set under the LT model,we generate a random numberx for each nodev that we visit, andwe usex to pick an incoming edge ofv to traverse. In other words,the number of random numbers required under the IC (resp. LT)model is proportional to the number of edges (resp. nodes) exam-ined. Given that each of our datasets contains much more edgesthan nodes, it is not surprising that our solutions perform better un-der the LT model.

Finally, Figure 7 shows the running time ofTIM andTIM+ asa function ofε. The performance of both algorithms significantlyimproves with the increase ofε, since a largerε leads to a lessstringent requirement on the number of RR sets. In particular, whenε ≥ 0.2, TIM+ requires less than1 hour to processTwitter underboth the IC and LT models.

7.3 Comparison withIRIE and SIMPATHOur second set of experiments comparesTIM+ with IRIE [16]

and SIMPATH [12], namely, the state-of-the-art heuristic meth-ods under the IC and LT models, respectively. (We omitTIM asit performs consistently worse thanTIM+.) Both IRIE andSIM-PATH have two internal parameters that control the trade-off be-tween computation cost and result accuracy. In our experiments, weset those parameters according to the recommendations in [12,16].Specifically, we setIRIE parametersα and θ to 0.7 and 1/320,respectively, andSIMPATH’s parametersη and ℓ to 10−3 and4,respectively. ForTIM+, we setε = ℓ = 1, in which caseTIM+

provides weak theoretical guarantees but high empirical efficiency.We evaluate the algorithms on all datasets exceptTwitter, as the

Page 10: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

TIM (IC) TIM (LT) TIM (LT)TIM (IC)+ +

100

101

102

103

1 10 20 30 40 50k

running time (seconds)

102

103

104

1 10 20 30 40 50k

running time (seconds)

102

103

104

105

1 10 20 30 40 50k

running time (seconds)

102

103

104

105

1 10 20 30 40 50k

running time (seconds)

(a) Epinions (b) DBLP (c) LiveJournal (d) Twitter

Figure 6: Running time vs.k on large datasets.

TIM (IC) TIM (LT) TIM (LT)TIM (IC)+ +

10-1

100

101

102

103

0.1 0.2 0.3 0.4ε

running time (seconds)

100

101

102

103

104

0.1 0.2 0.3 0.4ε

running time (seconds)

100

101

102

103

104

105

0.1 0.2 0.3 0.4ε

running time (seconds)

101

102

103

104

105

0.1 0.2 0.3 0.4ε

running time (seconds)

(a) Epinions (b) DBLP (c) LiveJournal (d) Twitter

Figure 7: Running time vs.ε on large datasets.

memory consumptions ofIRIE and SIMPATHexceed the size ofthe memory on our testing machine (i.e.,48GB).

Figure 8 shows the running time ofTIM+ and IRIE under theIC model, varyingk from 1 to 50. The computation cost ofTIM+

tends to decrease with the increase ofk, as a result of the subtle in-terplay among several variables (e.g.,λ, KPT ∗, andKPT+) thatdecide the number of random RR sets required inTIM+. Mean-while, IRIE’s computation time increases withk, since (i) it adoptsa greedy approach to iteratively selectk nodes from the input graphG, and (ii) a largerk results in more iterations inIRIE, which leadsto a higher processing cost. Overall,TIM+ is not as efficient asIRIE whenk is small, but it clearly outperformsIRIE on all datasetswhenk > 20. In particular, whenk = 50, TIM+’s computationtime onLiveJournalis less than5% of IRIE’s.

Figure 9 illustrates the expected spreads of the node sets returnedby TIM+ and IRIE. Compared withIRIE, TIM+ have (i) notice-ably higher expected spreads onDBLP andLiveJournal, and (ii)similar expected spreads onNetHEPTandEpinion. This indicatesthatTIM+ generally provides more accurate results thanIRIE does,even when we setε = ℓ = 1 for TIM+.

Figure 10 compares the computation efficiency ofTIM+ andSIMPATHunder the LT model, whenk varies. Observe thatTIM+

consistently outperformsSIMPATHby large margins. In particu-lar, whenk = 50, the former’s running time onLiveJournal islower than the latter’s by three orders of magnitude. Furthermore,as shown in Figure 11,TIM+’s expected spreads are significantlyhigher thanSIMPATH’s on LiveJournal, and are no worse on theother three datasets. Therefore,TIM+ is clearly more preferablethanSIMPATHfor influence maximization under the LT model.

7.4 Memory ConsumptionsOur last set of experiments evaluatesTIM+’s memory consump-

tions, settingε = 0.1 andℓ = 1 + log 3/ log n (i.e., we ensure

a success probability of at least1 − 1/n). Note thatε = 0.1 isadversarial toTIM+, due to the following reasons:

1. The memory costs ofTIM+ is mainly incurred by the setRof random RR sets generated in Algorithm 1;

2. TIM+ sets the size ofR to λ/KPT+, whereλ is as definedin Equation 4 andKPT+ is a lowerbound ofOPT gener-ated by Algorithm 3;

3. λ is inverse proportional toε2, i.e., a smallerε leads to alargerR, which results in a higher space overhead.

Figure 12 shows the memory costs ofTIM+ on each dataset underthe IC and LT models. In all cases,TIM+ requires more memoryunder the IC model than under the LT model. The reason is thatR’s size is inverse proportional toKPT+, while KPT+ tendsto be larger under the LT model (see Figure 5 for example). Thememory consumption ofTIM+ tends to be larger when the datasetsize increases, sinceR = λ/KPT+, while λ increases withn,i.e., the number of nodes in the dataset. But interestingly,TIM+

incurs a higher space overhead onNetHEPTthan onEpinion, eventhough the latter has a larger number of nodes. To explain, ob-serve from Figures 9 and 11 that nodes inEpinion tend to havemuch higher expected spreads than those inNetHEPT. As a conse-quence,TIM+ obtains a considerably largerKPT+ from Epinionthan fromNetHEPT. This pronounced increase inKPT+ rendersR = λ/KPT+ smaller onEpinion than onNetHEPT, despite ofthe fact thatλ is smaller on the latter.

8. CONCLUSIONThis paper presentsTIM, an influence maximization algorithm

that supports the triggering model by Kempe et al. [17]. The al-gorithm runs inO((k + ℓ)(n + m) log n/ε2) expected time, andreturns(1− 1/e− ε)-approximate solutions with at least1− n−ℓ

Page 11: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

TIM IRIE+

10-2

10-1

100

101

1 10 20 30 40 50k

running time (seconds)

10-2

10-1

100

101

1 10 20 30 40 50k

running time (seconds)

10-1

100

101

102

1 10 20 30 40 50k

running time (seconds)

101

102

103

1 10 20 30 40 50k

running time (seconds)

(a) NetHEPT (b) Epinions (c) DBLP (d) LiveJournal

Figure 8: Running time vs.k under the IC model.

TIM IRIE+

0

200

400

600

800

1000

1 10 20 30 40 50k

expected spread

0

5000

10000

15000

1 10 20 30 40 50k

expected spread

0

2000

4000

6000

8000

1 10 20 30 40 50k

expected spread

0

20000

40000

60000

80000

100000

1 10 20 30 40 50k

expected spread

(a) NetHEPT (b) Epinions (c) DBLP (d) LiveJournal

Figure 9: Expected spreads vs.k under the IC model.

probability. In addition, it incorporates heuristic optimizations thatlead to up to100-fold improvements in empirical efficiency. Ourexperiments show that, whenk = 50, ε = 0.2, andℓ = 1, the al-gorithm can process a billion-edge graph on a commodity machinewithin an hour. Such practical efficiency is unmatched by anyex-isting solutions that provide non-trivial approximation guaranteesfor the influence maximization problem. For future work, we planto investigate how we can turnTIM into a distributed algorithm, soas to handle massive graphs that do not fit in the main memory ofa single machine. In addition, we plan to extendTIM to other for-mulations of the influence maximization problem, e.g.,competitiveinfluence maximization[2, 23].

9. REFERENCES

[1]http://an.kaist.ac.kr/traces/WWW2010.html.

[2] S. Bharathi, D. Kempe, and M. Salek. Competitive influencemaximization in social networks. InWINE, pages 306–311,2007.

[3] C. Borgs, M. Brautbar, J. T. Chayes, and B. Lucier.Maximizing social influence in nearly optimal time. InSODA, pages 946–957, 2014.

[4] W. Chen, W. Lu, and N. Zhang. Time-critical influencemaximization in social networks with time-delayed diffusionprocess. InAAAI, 2012.

[5] W. Chen, C. Wang, and Y. Wang. Scalable influencemaximization for prevalent viral marketing in large-scalesocial networks. InKDD, pages 1029–1038, 2010.

[6] W. Chen, Y. Wang, and S. Yang. Efficient influencemaximization in social networks. InKDD, pages 199–208,2009.

[7] W. Chen, Y. Yuan, and L. Zhang. Scalable influencemaximization in social networks under the linear thresholdmodel. InICDM, pages 88–97, 2010.

[8] P. Domingos and M. Richardson. Mining the network valueof customers. InKDD, pages 57–66, 2001.

[9] A. Goyal, F. Bonchi, and L. V. S. Lakshmanan. Learninginfluence probabilities in social networks. InWSDM, pages241–250, 2010.

[10] A. Goyal, F. Bonchi, and L. V. S. Lakshmanan. A data-basedapproach to social influence maximization.PVLDB,5(1):73–84, 2011.

[11] A. Goyal, W. Lu, and L. V. S. Lakshmanan. Celf++:optimizing the greedy algorithm for influence maximizationin social networks. InWWW, pages 47–48, 2011.

[12] A. Goyal, W. Lu, and L. V. S. Lakshmanan. Simpath: Anefficient algorithm for influence maximization under thelinear threshold model. InICDM, pages 211–220, 2011.

[13] M. Granovetter. Threshold models of collective behavior.American Journal of Sociology, 83(6):1420–1443, 1978.

[14] E. M. J. Goldenberg, B. Libai. Talk of the network: Acomplex systems look at the underlying process ofword-of-mouth.Marketing Letters, 12(3):211–223, 2001.

[15] E. M. J. Goldenberg, B. Libai. Using complex systemsanalysis to advance marketing theory development.American Journal of Sociology, 9:1, 2001.

[16] K. Jung, W. Heo, and W. Chen. Irie: Scalable and robustinfluence maximization in social networks. InICDM, pages918–923, 2012.

[17] D. Kempe, J. M. Kleinberg, and É. Tardos. Maximizing thespread of influence through a social network. InKDD, pages137–146, 2003.

Page 12: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

TIM SIMPATH+

10-2

10-1

100

101

102

1 10 20 30 40 50k

running time (seconds)

10-1

100

101

102

103

1 10 20 30 40 50k

running time (seconds)

100

101

102

103

1 10 20 30 40 50k

running time (seconds)

101

102

103

104

105

1 10 20 30 40 50k

running time (seconds)

(a) NetHEPT (b) Epinions (c) DBLP (d) LiveJournal

Figure 10: Running time vs.k under the LT model.

TIM SIMPATH+

0

500

1000

1500

1 10 20 30 40 50k

expected spread

0

5000

10000

15000

20000

1 10 20 30 40 50k

expected spread

0

2500

5000

7500

10000

1 10 20 30 40 50k

expected spread

0

50000

100000

150000

1 10 20 30 40 50k

expected spread

(a) NetHEPT (b) Epinions (c) DBLP (d) LiveJournal

Figure 11: Expected spreads vs.k under the LT model.

TIM (LT)TIM (IC)+ +

0

0.5

1

1.5

2

1 10 20 30 40 50k

memory (GB)

0

0.1

0.2

0.3

0.4

0.5

1 10 20 30 40 50k

memory (GB)

0

2

4

6

8

10

1 10 20 30 40 50k

memory (GB)

(a) NetHEPT (b) Epinions (c) DBLP

0

2

4

6

8

10

1 10 20 30 40 50k

memory (GB)

0 5

10 15 20 25 30 35 40

1 10 20 30 40 50k

memory (GB)

(d) LiveJournal (e) Twitter

Figure 12: Memory consumptions ofTIM+ vs.k.

[18] D. Kempe, J. M. Kleinberg, and É. Tardos. Influential nodesin a diffusion model for social networks. InICALP, pages1127–1138, 2005.

[19] J. Kim, S.-K. Kim, and H. Yu. Scalable and parallelizableprocessing of influence maximization for large-scale socialnetworks. InICDE, pages 266–277, 2013.

[20] K. Kutzkov, A. Bifet, F. Bonchi, and A. Gionis. Strip: streamlearning of influence probabilities. InKDD, pages 275–283,2013.

[21] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos,J. VanBriesen, and N. Glance. Cost-effective outbreakdetection in networks. InKDD, pages 420–429, 2007.

Page 13: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

[22] Y. Li, W. Chen, Y. Wang, and Z.-L. Zhang. Influencediffusion dynamics and influence maximization in socialnetworks with friend and foe relationships. InWSDM, pages657–666, 2013.

[23] W. Lu, F. Bonchi, A. Goyal, and L. V. S. Lakshmanan. Thebang for the buck: fair competitive viral marketing from thehost perspective. InKDD, pages 928–936, 2013.

[24] R. Motwani and P. Raghavan.Randomized Algorithms.Cambridge University Press, 1995.

[25] M. Richardson and P. Domingos. Mining knowledge-sharingsites for viral marketing. InKDD, pages 61–70, 2002.

[26] K. Saito, N. Mutoh, T. Ikeda, T. Goda, and K. Mochizuki.Improving search efficiency of incremental variable selectionby using second-order optimal criterion. InKES (3), pages41–49, 2008.

[27] T. Schelling.Micromotives and Macrobehavior. W. W.Norton & Company, 2006.

[28] L. Seeman and Y. Singer. Adaptive seeding in socialnetworks. InFOCS, pages 459–468, 2013.

[29] V. V. Vazirani.Approximation Algorithms. Springer, 2002.[30] C. Wang, W. Chen, and Y. Wang. Scalable influence

maximization for independent cascade model in large-scalesocial networks.Data Min. Knowl. Discov., 25(3):545–576,2012.

[31] Y. Wang, G. Cong, G. Song, and K. Xie. Community-basedgreedy algorithm for mining top-k influential nodes inmobile social networks. InKDD, pages 1039–1048, 2010.

APPENDIXProof of Lemma 2. Let g be a graph constructed fromG by re-moving each edgee with 1− p(e) probability. Then,ρ2 equals theprobability thatv is reachable fromS in g. Meanwhile, by Defini-tion 1,ρ1 equals the probability thatg contains a directed path thatends atv and starts at a node inS. It follows thatρ1 = ρ2.

Proof of Corollary 1. Observe thatE[FR(S)] equals the proba-bility that S intersects a random RR set, whileE[I(S)]/n equalsthe probability that a randomly selected node can be activated byS in an influence propagation process onG. By Lemma 2, the twoprobabilities are equal, leading toE[n · FR(S)] = E[I(S)].

Proof of Lemma 3. Let ρ be the probability thatS overlaps with arandom RR set. Then,θ · FR(S) can be regarded as the sum ofθi.i.d. Bernoulli variables with a meanρ. By Corollary 1,

ρ = E[FR(S)] = E[I(S)]/n.

Then, we have

Pr[

∣n · FR(S)− E[I(S)]∣

∣ ≥ ε

2·OPT

]

= Pr[

∣θ · FR(S)− ρθ∣

∣ ≥ εθ

2n·OPT

]

= Pr[

∣θ · FR(S)− ρθ∣

∣ ≥ ε ·OPT

2nρ· ρθ]

. (11)

Let δ = ε ·OPT/(2nρ). By the Chernoff bounds, Equation 2, andthe fact thatρ = E[I(S)]/n ≤ OPT/n, we have

r.h.s. of Eqn. 11< 2 exp

(

− δ2

2 + δ· ρθ)

= 2 exp

(

− ε2 · OPT 2

8n2ρ+ 2εn ·OPT· θ)

≤ 2 exp

(

− ε2 ·OPT 2

8n · OPT + 2εn ·OPT· θ)

= 2 exp

(

− ε2 ·OPT

(8 + 2ε) · n · θ)

≤ 1(

n

k

)

· nl.

Therefore, the lemma is proved.

Proof of Theorem 1. Let Sk be the node set returned by Algo-rithm 1, andS+

k be the size-k node set that maximizesFR(S+k )

(i.e.,S+k covers the largest number of RR sets inR). AsSk is de-

rived fromR using a(1−1/e)-approximate algorithm for the max-imum coverage problem, we haveFR(Sk) ≥ (1−1/e) ·FR(S+

k ).LetS

k be the optimal solution for the influence maximization prob-lem onG, i.e.,E[I(S

k)] = OPT . We haveFR(S+k ) ≥ FR(S

k),which leads toFR(Sk) ≥ (1− 1/e) · FR(S

k).Assume thatθ satisfies Equation 2. By Lemma 3, Equation 3

holds with at least1 − n−ℓ/(

n

k

)

probability for any given size-knode setS. The, by the union bound, Equation 3 should hold simul-taneously for all size-k node sets with at least1− n−ℓ probability.In that case, we have

E[I(Sk)] > n · FR(Sk)− ε/2 ·OPT

≥ (1− 1/e) · n · FR(S+k )− ε/2 ·OPT

≥ (1− 1/e) · n · FR(S

k)− ε/2 ·OPT

≥ (1− 1/e) · (1− ε/2) · OPT − ε/2 ·OPT

> (1− 1/e− ε) ·OPT.

Thus, the theorem is proved.

Proof of Lemma 4. Let R be a random RR set,pR be the proba-bility that a randomly selected edge fromG points to a node inR.Then,EPT = E[pR ·m], where the expectation is taken over therandom choices ofR.

Let v∗ be a sample fromV∗, andb(v∗, R) be a boolean functionthat returns1 if v∗ ∈ R, and0 otherwise. Then, for any fixedR,

pR =∑

v∗

(

Pr[v∗] · b(v∗, R))

.

Now consider that we fixv∗ and varyR. Define

pv∗ =∑

R

(

Pr[R] · b(v∗, R))

.

By Lemma 2,pv∗ equals the probability that a randomly selectednode can be activated in an influence propagation process whenv∗ is used as the seed set. Therefore,E[pv∗ ] = E[I(v∗)]/n.This leads to

EPT/m = E[pR] =∑

R

(

Pr[R] · pR)

=∑

R

(

Pr[R] ·∑

v∗

(

Pr[v∗] · b(v∗, R))

)

=∑

v∗

(

Pr[v∗] ·∑

R

(

Pr[R] · b(v∗, R))

)

=∑

v∗

(

Pr[v∗] · pv∗

)

= E[pv∗ ] = E[I(v∗)]/n.

Thus, the lemma is proved.

Proof of Lemma 5. LetS∗ be a node set formed byk samples fromV∗, with duplicates removed. LetR be a random RR set, andαR

be the probability thatS∗ overlaps withR. Then, by Corollary 1,

KPT = E[I(S∗)] = E[n · αR].

Page 14: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

Consider that we samplek times over a uniform distribution onthe edges inG. Let E∗ be the set of edges sampled, with dupli-cates removed. Letα′

R be the probability that one of the edges inE∗ points to a node inR. It can be verified thatα′

R = αR. Fur-thermore, given that there arew(R) edges inG that point to nodesin R, α′

R = 1− (1−w(R)/m)k = κ(R). Therefore,

KPT = E[n · αR] = E[n · α′

R] = E [n · κ(R)] ,

which proves the lemma.

Proof of Lemma 6. Letδ = (2−i−µ)/µ. By the Chernoff bounds,

Pr

[

sici

> 2−i

]

≤ exp

(

− δ2

2 + δ· ci · µ

)

= exp(

−ci · (2−i − µ)2/(2−i + µ))

≤ exp(

−ci · 2−i−1/3)

=1

nℓ · log2 n.

This completes the proof.

Proof of Lemma 7. Letδ = (µ−2−i)/µ. By the Chernoff bounds,

Pr

[

sici

≤ 2−i

]

≤ exp

(

− δ2

2· ci · µ

)

= exp(

−ci · (µ− 2−i)2/(2 · µ))

≤ exp (−ci · µ/8) < n−ℓ·2i−j−1

/ log2 n.

This completes the proof.

Proof of Theorem 2. Assume thatKPT/n ∈ [2−j , 2−j+1]. Wefirst prove the accuracy of theKPT ∗ returned by Algorithm 2.

By Lemma 6 and the union bound, Algorithm 2 terminates in orbefore the(j − 2)-th iteration with less thann−ℓ(j − 2)/ log2 nprobability. On the other hand, if Algorithm 2 reaches the(j+1)-thiteration, then by Lemma 7, it terminates in the(j + 1)-th iterationwith at least1 − n−ℓ/ log2 n probability. Given the union boundand the fact that Algorithm 2 has at mostlog2 n − 1 iterations,Algorithm 2 should terminate in the(j − 1)-th, j-th, or (j + 1)-thiteration with a probability at least1 − n−ℓ(log2 n − 2)/ log2 n.In that case,KPT ∗ must be larger thann/2 · 2−j−1, which leadsto KPT ∗ > KPT/4. Furthermore,KPT ∗ should ben/2 timesthe average of at leastcj−1 i.i.d. samples fromK. By the Chernoffbounds, it can be verified that

Pr [KPT ∗ ≥ KPT ] ≤ n−ℓ/ log2 n.

By the union bound, Algorithm 2 returns, with at least1 − n−ℓ

probability,KPT ∗ ∈ [KPT/4, KPT ] ⊆ [KPT/4, OPT ].Next, we analyze the expected running time of Algorithm 2.

Recall that thei-th iteration of the algorithm generatesci RRsets, and each RR sets takesO(EPT ) expected time. Given thatci+1 = 2 · ci for any i, the first j + 1 iterations generate lessthan2 ∗ cj+1 RR sets in total. Meanwhile, for anyi′ ≥ j + 2,

Lemma 7 shows that Algorithm 2 has at mostn−ℓ·2i′−j−1

/ log2 nprobability to reach thei′-th iteration. Therefore, whenn ≥ 2 andℓ ≥ 1/2, the expected number of RR sets generated after the firstj + 1 iterations is less than

∑log2 n−1

i′=j+2

(

ci′ · n−ℓ·2i′−j−1

/ log2 n)

< cj+2.

Hence, the expected total number of RR sets generated by Algo-rithm 2 is less than2cj+1+cj+2 = 2cj+2. Therefore, the expectedtime complexity of the algorithm is

O(cj+2 · EPT ) = O(2jℓ log n ·EPT )

= O(2jℓ log n · (1 + m

n) ·KPT )

= O(2jℓ log n · (m+ n) · 2−j) = O(ℓ(m+ n) log n).

Finally, we show thatE[1/KPT ∗] < 12/KPT . Observe thatif Algorithm 2 terminates in thei-th iteration, it returnsKPT ∗ ≥n ·2−i−1. Let ζi denote the event that Algorithm 2 stops in thei-thiteration. By Lemma 7, whenn ≥ 2 andℓ ≥ 1/2, we have

E[1/KPT ∗] =∑log2 n−1

i=1

(

2i+1/n · Pr[ζi])

<∑log2 n−1

i=j+2

(

2i+1/n ·(

n−ℓ·2i−j−1

/ log2 n))

+ 2j+2/n

< (2j+3 + 2j+2)/n ≤ 12/KPT.

This completes the proof.

Proof of Lemma 8. We first analyze the expected time complex-ity of Algorithm 3. Observe that Lines 1-6 in Algorithm 3 run intime linear the total size of the RR sets inR′, i.e., the set of all RRsets generated in the last iteration of Algorithm 2. Given that Algo-rithm 2 has anO(ℓ(m + n) log n) expected time complexity (seeTheorem 2), the expected total size of the RR sets inR′ should beno more thanO(ℓ(m + n) log n). Therefore, Lines 1-6 of Algo-rithm 3 have an expected time complexityO(ℓ(m+ n) log n).

On the other hand, the expected time complexity of Lines 7-12of

Algorithm 3 isO(

E

[

λ′

KPT∗

]

·EPT)

, since they generateλ′

KPT∗

random RR sets, each of which takesO(EPT ) expected time.By Theorem 2,E[ 1

KPT∗ ] < 12KPT

. In addition, by Equation 7,EPT ≤ m

nKPT . Therefore,

O(

E

[

λ′

KPT∗

]

·EPT)

= O(

λ′

KPT· EPT

)

= O(

λ′

KPT· (1 + m

n) ·KPT

)

= O(

ℓ(m+ n) log n/(ε′)2)

.

Therefore, the expected time complexity of Algorithm 3 isO(

ℓ(m+ n) log n/(ε′)2)

.Next, we prove that Algorithm 3 returnsKPT+ ∈

[KPT ∗, OPT ] with a high probability. First, observe thatKPT+ ≥ KPT ∗ trivially holds, as Algorithm 3 setsKPT+ =maxKPT ′,KPT ∗, whereKPT ′ is derived in Line 11 of Al-gorithm 3. To show thatKPT+ ∈ [KPT ∗, OPT ], it suffices toprove thatKPT ′ ≤ OPT .

By Line 11 of Algorithm 3,KPT ′ = f · n/(1 + ε′), wherefis the fraction of RR sets inR′′ that is covered byS′

k, whileR′′ isa set ofθ′ random RR sets, andS′

k is a size-k node set generatedfrom Lines 1-6 in Algorithm 3. Therefore,KPT ′ ≤ OPT if andonly if f · n ≤ (1 + ε′) ·OPT .

Let ρ′ be the probability that a random RR set is covered byS′k. By Corollary 1,ρ′ = E[I(S′

k)]/n. In addition,f · θ′ can beregarded as the sum ofθ′ i.i.d. Bernoulli variables with a meanρ′.Therefore, we have

Pr[

f · n > (1 + ε′) ·OPT]

≤ Pr[

n · f − E[

I(S′

k)]

> ε′ ·OPT]

= Pr

[

θ′ · f − θ′ · ρ′ > θ′

n· ε′ ·OPT

]

= Pr

[

θ′ · f − θ′ · ρ′ > ε′ ·OPT

n · ρ′ · θ′ · ρ′]

(12)

Page 15: Influence Maximization: Near-Optimal Time Complexity Meets ... · Influence Maximization: Near-Optimal Time Complexity Meets Practical Efficiency Youze Tang Xiaokui Xiao Yanchen

let δ = ε′ ·OPT/(nρ′). By the Chernoff bounds, we have

r.h.s. of Eqn. 12 ≤ exp

(

− δ2

2 + δ· ρ′θ′

)

= exp

(

− ε′2 ·OPT 2

2n2ρ′ + ε′n · OPT· θ′)

≤ exp

(

− ε′2 ·OPT 2

2n ·OPT + ε′n ·OPT· θ′)

= exp

(

− ε′2 · OPT

(2 + ε′) · n · λ′

KPT ∗

)

≤ exp

(

− ε′2 · λ′

(2 + ε′) · n

)

≤ 1

nl.

Therefore,KPT ′ = f · n/(1 + ε′) ≤ OPT holds with at least1− n−l probability. This completes the proof.

Proof of Lemma 9. Let g be a graph constructed fromG by firstsampling a node setT for each nodev from its triggering distri-butionT (v), and then removing any outgoing edge ofv that doesnot point to a node inT . Then,ρ2 equals the probability thatv isreachable fromS in g. Meanwhile, by the definition of RR sets un-der the triggering model,ρ1 equals the probability thatg contains adirected path that ends atv and starts at a node inS. It follows thatρ1 = ρ2.

Proof of Lemma 10. Let S be any node set that contains no morethank nodes inG, andξ(S) be an estimation ofE[I(S)] usingrMonte Carlo steps. We first prove that, ifr satisfies Equation 10,thenξ(S) will be close toE[I(S)] with a high probability.

Let µ = E[I(S)]/n andδ = εOPT/(2knµ). By the Chernoffbounds, we have

Pr[

|ξ(S)− E[I(S)]| > ε

2kOPT

]

= Pr

[∣

r · ξ(S)n

− r · E[I(S)]n

2kn· r · OPT

]

= Pr

[∣

r · ξ(S)n

− r · E[I(S)]n

> δ · r · µ]

< 2 exp

(

− δ2

2 + δ· r · µ

)

= 2 exp

(

− ε2

(8k2 + 2kε) · n · r · µ)

= 2 exp ((ℓ+ 1) log n+ log k)

=1

k · nℓ+1(13)

Observe that, givenG andk, Greedyruns ink iterations, each ofwhich estimates the expected spreads of at mostn node sets withsizes no more thank. Therefore, the total number of node setsinspected byGreedyis at mostkn. By Equation 13 and the unionbound, with at least1− n−ℓ probability, we have

∣ξ(S′)− E[I(S′)]∣

∣ ≤ ε

2kOPT, (14)

for all thosekn node setsS′ simultaneously. In what follows, weanalyze the accuracy ofGreedy’s output, under the assumption thatfor any node setS′ considered byGreedy, it obtain a sample ofξ(S′) that satisfies Equation 14. For convenience, we abuse nota-tion and useξ(S′) to denote the aforementioned sample.

Let S0 = ∅, andSi (i ∈ [1, k]) be the node set selected byGreedyin the i-th iteration. We definexi = OPT − I(Si), and

yi(v) = I (Si−1 ∪ v) − I(Si−1) for any nodev. Let vi be thenode that maximizesyi(vi). Then,yi(vi) ≥ xi−1/k must hold;otherwise, for any size-k nodeS, we have

I(S) ≤ I(Si−1) + I(S \ Si−1)

≤ I(Si−1) + k · yi(vi)< I(Si−1) + xi−1 = OPT,

which contradicts the definition ofOPT .Recall that, in each iteration ofGreedy, it adds intoSi−1 the

nodev that leads to the largestξ(Si−1 ∪ v). Therefore,

ξ(Si)− ξ(Si−1) ≥ ξ(Si−1 ∪ vi)− ξ(Si−1). (15)

Combining Equations 14 and 15, we have

xi−1 − xi

= I(Si)− I(Si−1)

≥ ξ(Si)− ε

2kOPT − ξ(Si−1) +

(

ξ(Si−1)− I(Si−1))

≥ ξ(Si−1 ∪ vi) − ξ(Si−1)− ε

2kOPT

+(

ξ(Si−1)− I(Si−1))

≥ I(

Si−1 ∪ vi)

− I(Si−1)− ε

kOPT

≥ 1

kxi−1 − ε

kOPT. (16)

Equation 16 leads to

xk ≤(

1− 1

k

)

· xk−1 +ε

kOPT

≤(

1− 1

k

)2

· xk−2 +

(

1 +

(

1− 1

k

))

· εkOPT

≤(

1− 1

k

)k

· x0 +k−1∑

i=0

(

(

1− 1

k

)i

· εkOPT

)

=

(

1− 1

k

)k

· OPT +

(

1−(

1− 1

k

)k)

· ε ·OPT

≤ 1

e·OPT −

(

1− 1

e

)

· ε ·OPT.

Therefore,

I(Sk) = OPT − xk

≤ (1− 1/e) · (1− ε) · OPT

≤ (1− 1/e− ε) ·OPT.

Thus, the lemma is proved.