Why approximate when you can get the exact? Optimal Targeted Viral Marketing at Scale Xiang Li, J. David Smith Thang N. Dinh My T. Thai CISE Department CS Department CISE Department University of Florida Virginia Commonwealth University University of Florida Gainesville, FL, 32611 Richmond, VA 23284 Gainesville, FL, 32611 Email: {xixiang, jdsmith}@cise.ufl.edu Email: [email protected] Email: mythai@ufl.edu 7 Abstract—Oneofthemostcentralproblemsinviralmarketing in the past decade [3]–[10] (and references therein), focusing 1 isInfluenceMaximization(IM),whichfindsasetofk seedusers on solving the scalability issue. The most recent noticeable 0 whocaninfluencethemaximumnumberofusersinonlinesocial results are IMM [5] and SSA [6] algorithms, both of which 2 networks. Unfortunately, all existing algorithms to IM, including can run on large-scale networks with the same performance thestateoftheartSSAandIMM,haveanapproximationratioof guarantee of (1− 1 −(cid:15)). Meanwhile, BCT was introduced to n (1−1/e−(cid:15)).Recently,ageneralizationofIM,Cost-awareTarget e √ solve CTVM with a ratio of (1−1/ e−(cid:15)) [1]. a Viral Marketing (CTVM), asks for the most cost-effective users J to influence the most relevant users, has been introduced. The Despite a great amount of aforementioned works, none of 0 current√bestalgorithmforCTVMhasanapproximationratioof these attempts to solve IM or CTVM exactly. The lack of 3 (1−1/ e−(cid:15)). such a solution makes it impossible to evaluate the practical performanceofexistingalgorithmsinreal-worlddatasets.That In this paper, we study the CTVM problem, aiming to ] is: how far from optimal are the solutions given by existing I optimally solve the problem. We first highlight that using a S traditional two stage stochastic programming to exactly solve algorithms on billion-scale OSNs, in addition to their theo- . CTVM is not possible because of scalability. We then propose retical performance guarantees? This question has remained cs analmostexactalgorithm TIPTOP,whichhasanapproximation unanswered til now. [ ratio of (1−(cid:15)). This result significantly improves the current Unfortunately, obtaining exact solutions to CTVM (and best solutions to both IM and CTVM. At the heart of TIPTOP thustoIM)isverychallengingastherunningtimeisnolonger 1 lies an innovative technique that reduces the number of samples inpolynomialtime.Duetothenatureoftheproblem,stochas- v as much as possible. This allows us to exactly solve CTVM tic programming (SP) is a viable approach for optimization 2 on a much smaller space of generated samples using Integer under uncertainty when the probability distribution governs 6 Programming.Whileobtaininganalmostexactsolution,TIPTOP 4 is very scalable, running on billion-scale networks such as the data is given [11]. However traditional SP-based solutions 8 Twitter under three hours. Furthermore, TIPTOP lends a tool to various problems in NP-hard class are only for small net- for researchers to benchmark their solutions against the optimal workswithafewhundredsnodes[11].Thusdirectlyapplying 0 one in large-scale networks, which is currently not available. existing techniques to IM and CTVM is not suitable as OSNs . 1 consistofmillionsofusersandbillionsofedges.Furthermore, 0 Keywords—Viral Marketing; Influence Maximization; Algo- the theory developed to assess the solution quality such as 7 rithms; Online Social Networks; Optimization those in [11], [12], and the references therein, only provide 1 approximate confidence interval. Therefore, sufficiently large : I. INTRODUCTION v samples are needed to justify the quality assessment. This re- i With billions of active users, Online Social Networks quires us to develop novel stochastic programming techniques X (OSNs)havebeenoneofthemosteffectiveplatformsformar- to exactly solve CTVM on large-scale networks. r keting and advertising. Through “word-of-mouth” exchanges, In this paper, we provide the very first (almost) exact so- a so-called viral marketing, the influence and product adoption lutions to CTVM. To tackle the above challenges, we develop can spread from few key users to billions of users in the two innovative techniques: 1) Reduce the number of samples network. To identify those key users, a great amount of work as much as possible so that the SP can be solved in a very hasbeen devotedtotheInfluence Maximization(IM)problem short time. This requires us to tightly bound the number of whichasksforasetofkseedusersthatmaximizetheexpected samples needed to generate a candidate solution. 2) Develop numberofinfluencednodes.Takingintoaccountbotharbitrary novelcomputationalmethodtoassessthesolutionqualitywith cost for selecting a node and arbitrary benefit for influencing just enough samples, where the quality requirement is given a a node, a generalized problem, Cost-aware Targeted Viral priori. These results break the tradition and cross the barriers Marketing (CTVM), has been introduced recently [1]. Given in stochastic programming theory where solving SP on large- a budget B, the goal is to find a seed set S with the total cost scalenetworkshadbeenthoughtimpractical.Ourcontributions at most B that maximizes the expected total benefit over the are summarized as follows: influenced nodes. • Design an (almost) exact solution to CTVM using Solutions to both IM and CTVM either suffer from scal- two-stage stochastic programming (T-EXACT) which ability issues or limited performance guarantees. Since IM utilizes the sample average approximation method to is NP-hard, Kempe et al. [2] introduced the first (1 − 1 − reduce the number of realizations. Using T-EXACT, e (cid:15))−approximation algorithm for any (cid:15)>0. However, it is not we illustrate that traditional stochastic programming scalable.ThisworkhasmotivatedavastamountofworkonIM techniques badly suffer the scalability issue. • DevelopascalableandexactalgorithmtoCTVMwith does not put an effort in minimizing the number of samples an approximation ratio of (1−(cid:15)): The Tiny Integer at the stopping point to verify the candidate solution, which ProgramwithTheoreticallyOPtimalresults(TIPTOP). is needed to solve the Integer Programming (IP). All of these Being able to obtain the near exact solution, TIPTOP works have a (1−1/e−(cid:15))-approximation ratio. is used as a benchmark to evaluate the absolute per- In generalizing IM, Nguyen and Zheng [17] investigated formance of existing solutions in billion-scale OSNs. the BIM problem in which each node can have an arbitrary √ • Conduct extensive experiments confirming that the selecting cost. They proposed a (1−1/ e−(cid:15)) approximation theoreticalperformanceofTIPTOPisattainedinprac- algorithm (called BIM) and two heuristics. However, none tice. That said, it is feasible to compute 98% optimal of the proposed algorithms can handle billion-scale networks. solutions to CTVM on billion-scale OSNs in under Recently, Nguyen√et. al introduced CTVM and presented a three hours. The experiments further confirm that our scalable (1 − 1/ e − (cid:15)) algorithm [1]. They also showed sampling reductions are significant in practice – with that straightforward adaption of the methods in [5], [15] for reductions by a factor of 103 on average. CTVM can incur an excessive number of samples, thus, are not efficient enough for large networks. Related works. Kempe et al. [2] is the first to formulate Organization. The rest of the paper is organized as fol- viral marketing as the IM optimization problem. They showed lows.InSectionII,wepresentthenetworkmodel,propagation the problem to be NP-complete and devised an (1−1/e−(cid:15)) models,andtheproblemdefinition.SectionIIIpresentsourT- approximation algorithm on two fundamental cascade models, Linear Threshold (LT) and Independent Cascade (IC). In EXACTalgorithmforCTVM.Ourmaincontribution,TIPTOP addition,IMcannotbeapproximatedwithinafactor(1−1+(cid:15)) is introduced in Section IV. We analyze the performance of [13] under a typical complexity assumption. Computineg the TIPTOP in Section V. Experimental results on real social networks are shown in Section VI. And finally Section VII exact influence is shown to be #P-hard [4]. Following [2], a concludes our paper. seriesofworkhavebeenproposed,focusedonimprovingtime complexity. A notable one is Leskovec et al [3], in which the lazy-forward heuristic (CELF) was used to obtain about 700- II. MODELSANDPROBLEMDEFINITIONS fold speed up. The major bottle-neck in these works is the Let G=(V,E,c,b,p) be a social network with a node set inefficiency in estimating the influence spread. The diffusion V and a directed edge set E, with |V| = n and |E| = m. process also used to restrain propagation in social networks Each node u∈V has a selecting cost c(u)≥0 and a benefit [14]. b(u)≥0 if u is influenced1. Each directed edge (u,v)∈E is Recently, Borgs et al. [15] have made a theoretical break- associated with an influence probability p ∈[0,1]. uv through and introduced a novel sampling approach, called Given G and a subset S ⊂ V, referred to as the seed Reverse Influence Sampling (RIS), which is a foundation for set, in the IC model the influence cascades in G as follows. the later works. Briefly, RIS captures the influence landscape The influence propagation happens in round t = 1,2,3,.... of G = (V,E,p) through generating a hypergraph H = At round 1, nodes in the seed set S are activated and the (V,{E ,E ,...}).EachhyperedgeE ∈Hisasubsetofnodes 1 2 j other nodes are inactive. The cost of activating the seed set in V and constructed as follows: 1) selecting a random node (cid:80) S is given c(S) = c(u). At round t > 1, each newly v ∈ V 2) generating a sample graph g (cid:118) G and 3) returning u∈S activatednodeuwillindependentlyactivateitsneighborvwith E as the set of nodes that can reach v in g. Observe that j a probability p . Once a node becomes activated, it remains E contains the nodes that can influence its source v. If we uv j activated in all subsequent rounds. The influence propagation generate multiple random hyperedges, influential nodes will stops when no more nodes are activated. likely appear more often in the hyperedges. Thus a seed set S Denote by I(S) the expected number of activated nodes that covers most of the hyperedges will likely maximize the given the seed set S. We call I(S) the influence spread of S influence spread. Here a seed set S covers a hyperedge E , if j in G under the IC model. S∩E (cid:54)=∅. Therefore, IM can be solved using the following j framework. 1) Generate multiple random hyperedges from G. Mathematically, we view the probabilistic graph G as a 2) Use the greedy algorithm for the Max-coverage problem generative model for deterministic graphs. A deterministic [16] to find a seed set S that covers the maximum number graph g =(V,Es) is generated from G by selecting each edge of hyperedges and return S as the solution. The core issue in (u,v) ∈ E, independently, with probability puv. We refer to applying the above framework is that: How many hyperedges g as a realization or a sample of G and write g (cid:118) G. The are sufficient to provide a good approximation solution? probability that g is generated from G is Based on RIS, Borgs et al. [15] presented an O(kl2(m+ (cid:89) (cid:89) Pr[g]= p (1−p ). n)log2n/(cid:15)3)timealgorithmforIMunderICmodel.Itreturns e e a (1 − 1/e − (cid:15))-approximate ratio with probability at least e∈Es e∈E\Es 1−n−l. In practice, the proposed algorithm is, however, less Let m = |E|, there are W = 2m possible realizations of than satisfactory due to the rather large hidden constants. In G. We number those realizations as G1 = (V,E1),G2 = a sequential work, Tang et al. [5] reduce the running time (V,E2),...,GW = (V,EW), where E1,E2,...,EW are all to O((k+l)(m+n)logn/(cid:15)2) and show that their algorithm possible subsets of E. The influence spread of a seed set S is efficient in billion-scale networks. However, the number of equals the expected number of nodes reachable from S over generated samples can be arbitrarily large. The state of the art all possible sample graphs, i.e., SSA algorithm is introduced in [6] that outperforms existing methods several orders of magnitudes w.r.t running time. The 1Thecostofnodeu,c(u),canbeestimatedproportionallytothecentrality algorithm keeps generating samples and stops at exponential of u (how important the respective person is), e.g., out-degree of u [17]. check points to verify (stare) if there is adequate statistical Additionally,thenodebenefitb(u)referstothegainofinfluencingnodeu, evidenceonthesolutionqualityfortermination.However,SSA e.g.,1foreachnodeinourtargetedgroupand0outside. (cid:88) I(S)= Pr[g]|R(g,S)|, (1) value of the second-stage problem. This stochastic program- ming problem is, however, not yet ready to be solved with a g(cid:118)G linear algebra solver. where (cid:118) denotes that the sample graph g is generated from G with a probability denoted by Pr[g], and R(g,S) denotes the set of nodes reachable from S in g. 1)Discretization: Tosolveatwo-stagestochasticproblem, Similarly, the expected benefit of a seed set S is defined one often need to discretize the problem into a single (very as the expected total benefit over all influenced nodes, i.e., large) linear programming problem. That is we need to con- sider all possible realizations Gl (cid:118) G and their probability (cid:88) (cid:88) B(S)= Pr[g] b(u). (2) masses Pr[Gl]. Denote by {ξl} the adjacency matrix of the ij realization Gl = (V,El), i.e., ξl = 1, if (i,j) ∈ El, and g(cid:118)G u∈R(g,S) ij 0, otherwise. Since the objective involves only the expected We are now ready to define the CTVM problem as follows. cost ofthesecondstagevariablesxij,thetwo-stagestochastic programcanbediscretizedintoamixedintegerprogramming, Definition 1 (Cost-aware Targeted Viral Marketing - CTVM). denoted by MIP as follows. F Given a graph G=(V,E,c,b,p) and a budget BD >0, find W a seed set S ⊂V with the total cost c(S)≤BD to maximize max (cid:88)Pr[Gl](cid:88)b(v)xl (9) the expected benefit B(S). v l=1 v (cid:88) s. t. s c ≤BD (10) III. EXACTALGORITHMVIASTOCHASTIC v v In this section, wePfiRrsOtGpRreAsMenMtIoNuGr T-EXACT (traditional v∈V(cid:88) su ≥xlv, v∈V,Gl (cid:118)G (11) exact) algorithm based on two-stage SP. We then discuss the u∈IR(Gl,v) scalability issue of traditional SP which is later verified in our s ∈{0,1},xl ∈{0,1} (12) u v experiment, shown in Section VI. A. Two-stage Stochastic Linear Program Lemma 1. The expected number of non-zeros in T-EXACT is (cid:80) Given an instance G = (V,E,c,b,p) of CTVM, we first T u∈V I(u) where T is the number of graph realizations use integer variables s to represent whether or not node v is and I(u) denotes the expected influence of node u. v selected as a seed node, i.e, sv =1 if node v is selected (aka Proof: For each node u ∈ V, the number of constraints v ∈ S where S denotes the seed set) and 0, otherwise. Here (11) that u was on the left hand sides equal the number of n = |V| is the number of nodes and we assume nodes are nodes that u can reach to. The mean number of constraints numbered from 1 to n. We impose on s the budget constraint (11) that u participates into is, hence, I(u). Taking the sum (cid:80) v∈V svcv ≤ BD. Variables s are known as first stage over all possible u ∈ V, we have the expected size of the variables. The values of s are to be decided before the actual T-EXACT with T graph realizations is T ×(cid:80) I(u). realization of the uncertain parameters in G. u∈V Weassociatewitheachnodepair(u,v)arandomBernoulli variable ξ satisfying Pr[ξ = 1] = p and Pr[ξ = 0] = 2)RealizationReduction: Anapproachtoreducethenum- uv uv uv uv 1−p .ForeachrealizationofG,thevaluesofξ arerevealed berofrealizationsinT-EXACTistoapplytheSampleAverage uv uv to be either 0 or 1 and we can compute the benefit of the seed Approximation (SAA) method. We generate independently T set S implied by s. To do so, we define integer variable x to samples ξ1,ξ2,··· ,ξT using Monte Carlo simulation (i.e. to v bethestateofnodev whenthepropagationstops,i.e.,xv =1 generate each edge (u,v)∈E with probability puv). The ex- if v is eventually activated and 0, otherwise. pectation objective q(s)=E[B(s,x,ξ)] is then approximated The benefit obtained by seed set can be computed using by the sample average qˆT(x)= T1 (cid:80)Tl=1(cid:80)v(b(v)xlv), and the a second stage integer programming, denoted by B(s,x,ξ) as new formulation is then follows. max 1 (cid:80)T (cid:80) b(v)xl (cid:88) T l=1 v v B(s,x,ξ)=max b(v)xv (3) s. t. Constraints(10)−(12) v∈V (cid:88) Under some regularity conditions 1 (cid:80)T (cid:80) (1−xl ) s. t. su ≥xv, v ∈V, (4) converges pointwise with probabilityT1 tjo=1E[Bi(<sj,x,ξ)] iajs u∈IR(ξ,v) T →∞. Moreover, an optimal solution of the sample average s ∈{0,1},x ∈{0,1} (5) u v approximation provides an optimal solution of the stochastic where IR(ξ,v) denotes the set of nodes u so that there exists programming with probability approaching one exponentially a path from u to v given the realization ξ. fast w.r.t. T. Formally, denote by s∗ and sˆ the optimal solutionofthestochasticprogrammingandthesampleaverage The two-stage stochastic linear formulation for the CTVM approximation, respectively. For any (cid:15) > 0, it can be derived problem is as follows. from Proposition 2.2 in [18] that max E[B(s,x,ξ)] (6) s∈{0,1}n Pr[E[B(s,xˆ,ξ)]−E[B(s,x∗,ξ)]>(cid:15)] s. t. (cid:88)s c ≤BD (7) (cid:18) (cid:15)2 (cid:19) v v ≤exp −T +klogn (13) v∈V n4 where B(s,x,ξ) is given in (3)-(5) (8) Equivalently, if T ≥ n4(klogn − logα), then The objective is to maximize the expected benefit of the Pr[E[B(s,xˆ,ξ)]−E[B(s,x∗,ξ)(cid:15)2]<(cid:15)] > 1 − α for any activated nodes E[B(s,x,ξ)], where B(s,x,ξ) is the optimal α∈(0,1). B. Scalability Issues of Traditional Stochastic Optimization Algorithm 1 TIPTOP Algorithm WhileT-EXACTwasdesignedbasedonastandardmethod Input:GraphG=(V,E,b,c,w),budgetk>0,and(cid:15),δ∈(0,1). for stochastic programming, traditional methods can only be Output: Seed set Sk. applied for small networks, up to few hundreds nodes [11]. 1: Λ←(1+(cid:15))(2+ 32(cid:15))(cid:15)12 ln2δ 2: t←1;t =(cid:100)2lnn/(cid:15)(cid:101);v ←6 There are three major scalability issues when applying 3: Λ ←m(a1x+(cid:15))(2+2/3(cid:15))m2a(xln 2 +ln(cid:0)n(cid:1)) SAA and using T-EXACT for the influence maximization max (cid:15)2 δ/4 k 4: Generate random RR sets R ,R ,... using BSA [1]. problem. First, the samples have a large size O(m), where 1 2 5: repeat m = |E| is the number of edges in the network. For 6: N ←Λ×e(cid:15)t;R ←{R ,R ,...,R } large networks, m could be of size million or billion. As a 7: Sˆt←ILP (Rt,c,k) 1 2 Nt k MC t consequence, we can only have a small number of samples, 8: <passed,(cid:15) ,(cid:15) >←Verify(Sˆ ,v ,2vmaxN ) 1 2 k max t scarifyingthesolutionquality.Forbillionscalenetworks,even 9: if (not passed) and (Cov (Sˆ )≤Λ ) R k max onesamplewilllettoanextremelylargeILP,thatexceedsthe then t← IncreaseSamples(t,(cid:15) ,(cid:15) ) 1 2 capability of the best solvers. Second, the theory developed to 10: until passed or Cov (Sˆ )>Λ R k max assess the solution quality such as those in [11], [12], only 11: return Sˆ k provide approximate confidence interval. That is the quality assessment is only justified for sufficiently large samples and may not hold for small sample sizes. And third, most existing solution quality assessment methods [11], [12] only provide tosolveMax-Coverageproblemand2)thenumberofsamples theassessmentforagivennumberofsamplesize.Thus,ifthe thatdecidethequalityofapproximationfortheobjectivefunc- qualityrequirementisgivenapriori,e.g.,((cid:15),δ)approximation, tion.Ononehand,betterapproximationguaranteeprovidedby there is not an efficient algorithmic framework to identify the the ILP implies that our approach can obtain the guarantee ρ number of necessary samples. with less number of samples. On the other hand, the greedy IV. TIPTOP:(ALMOST)EXACTSOLUTIONIN algorithm is a lot faster than the ILP, thus can handle more LARGE-SCALENETWORKS samples in the same amount of time. In this section, we introduce our main contribution TIP- The new adoption of ILP poses new challenges. Solving TOP: a near exact algorithm (not approximation algorithm) ILPsisnow thebottleneckintermsofboth timeandmemory, for CTVM. TIPTOP is the first exact algorithm that can return dominating those in the verifying counterpart (Verify proce- a (1−(cid:15))-approximation ratio, overcome the above mentioned dure).Thusitiscriticalthatourapproachminimizethenumber scalability issues and can run on billion-scale networks. of RRsets used in the ILP to find the candidate solution Sˆ . k This is different from the previous approaches SSA and D- A. TIPTOP Algorithm Overview SSA that attempt to minimize the total number of RR sets in Forreadability,wepresentoursolutiontoCTVMinwhich both the searching and verifying. all nodes have uniform cost. Let BD = k, we want to find Moreover, both the approaches in SSA and D-SSA have S with |S| ≤ k so as to maximize the benefit B(S). The designlimitationsthatmakethemunsuitableforcouplingwith generalizationtothemoregeneralsettingsofCTVMisstraight the ILP. SSA [6] has fixed parameter settings for searching forward and omitted due to space limit. and verifying and it is difficult to find a good settings for At a high level, TIPTOP first generates a collection R of different input ranges. D-SSA has a more flexible approach. random RR sets which serves as the searching space to find However, it is constrained to use the same amount of samples a candidate solution Sˆ to CTVM. It next calls the Verify for searching and verifying. Thus, it is also not good for k procedure which independently generates another collection unbalanced searching/verifying. of random RR sets to closely estimate the objective function’s This enforces us to take a new approach in balancing the valueofthecandidatesolution.Ifthisvalueisnotcloseenough samples in searching and verifying. As shown in line 1 of to the optimal solution, TIPTOP generates more samples by Alg. 1, we start with a smaller set of samples R of size calling the IncreaseSamples procedure to enlarge the search Λ. Also, we use IncreaseSamples to dynamically adjust the space,andthusfindinganotherbettercandidatesolution.When samplesize,ensurejustenoughsamplestofindanear-optimal theobjectivefunction’svalueofthecandidatesolutionisclose candidate solution to CTVM. We will discuss more details enoughtotheoptimalone,TIPTOPhaltsandreturnsthefound about IncreaseSamples later in subsection IV-C. solution. The pseudo-code of TIPTOP is presented in Alg. 1. Secondly, in an effort to keep the ILP size small, we As shown in Theorem 2, TIPTOP has an approximation need to avoid executing IncreaseSamples as much as possible of (1 − (cid:15)) with a probability of at least (1 − δ). The key and allow IncreaseSamples to add a large batch of samples, point t√o improve the current best ratio of (1−1/e−(cid:15)) (and if it saw the candidate solution is still far away from the (1−1/ e−(cid:15)))to(1−(cid:15))liesinsolvingtheMaximumCoverage desired. Additionally, we allocate more resource into proving (MC)problemaftergeneratingRrandomRRsetsofsamples. the quality of candidate solutions, which is handled by Verify, Instead of using greedy technique as in all existing algorithms described in subsection IV-C. intheliterature,wesolvetheMCexactlyusingIntegerLinear And finally, as CTVM considers arbitrary benefits, we Program (ILP) (detailed in subsection IV-B). utilize Benefit Sampling Algorithm – BSA in [1] to embed Trade-off:“Betterapproximationguarantee”vs.“More the benefit of each node into consideration. BSA performs a samples”. The adoption of the ILP in the place of the greedy reversed influence sampling (RIS) in which the probability of method results in a trade-off between “better approximation anodechosenasthesourceisproportionaltoitsbenefit.Ran- guarantee” and “More samples”. Given a desired approxima- dom hyperedges generated via BSA can capture the “benefit tion guarantee, say ρ = 1 − 1/e − (cid:15), the two factors that landscape”. That is they can be used to estimate the benefit of influence ρ are 1) the approximation factor by the algorithm any seed set S as stated in the following lemma. Lemma 2. [1] Given a fixed seed set S ⊆ V, and let Pr[(1−(cid:15)(cid:48))B(Sˆ )≤B (Sˆ )≤(1+(cid:15)(cid:48))B(Sˆ )]≥1−δ(cid:48) 2 k Rver k 2 k 2 R ,R ,...,R ,...berandomRRsetsgeneratedusingBenefit 1 2 j The value of δ(cid:48) is selected as in line 3 so that the probability Sampling Algorithm [1], define random variables 2 of the union of all the bad events is bounded by δ . 2 (cid:26) 1 if R ∩S (cid:54)=∅, If the stopping rule algorithm stops within T RR sets, Z = j (14) cap j 0 otherwise. the algorithm evaluates the relative difference (cid:15)1 between the estimations of B(Sˆ ) via R and R . It also estimates the k t ver then relative gap (cid:15) between B(S∗) and its estimation using R . If 3 k t B(S) the combined gap (1−(cid:15) )(1−(cid:15) )(1−(cid:15) ) < (1−(cid:15)), Verify E[Z ]=Pr[R ∩S (cid:54)=∅]= (15) 1 2 3 j j Γ returns ‘true’ and goes back to TIPTOP. In turn, TIPTOP will where Γ=(cid:80) b(v) is the total nodes’ benefit. return Sˆk as the solution and terminate. v∈V If Sˆk does not pass the check in Verify, TIPTOP uses For a collection of T random RR sets R = the sub-procedure IncreaseSamples (Alg. 3) to increase the {R1,R2,...,RT}, we denote CovR(S) = (cid:80)Tj=1Zj as the size of the RR sets. Having more samples will likely lead numberofRRsetsthatintersectS,andBR(S)= CovTR(S)×Γ to better candidate solution Sˆk, however, also increase the IP astheestimationofB(S)viaR.WediscusstherestofTIPTOP solving time. Instead of doubling the current set R as SSA in the following subsections. does, we carefully use the information in the values of (cid:15) and 1 (cid:15) from the previous round to determine the increase in the B. Integer Linear Programming ILP 2 MC samplesizes.Recallthatthesamplesizeiset(cid:15)Λforincreasing Given a collection of RR sets R, we formulate the follow- integert.Thus,weincreasethesamplesizeviaincreasingtby ing ILP (R,c,BD), to find the optimal solution over the generateMdCRR sets to the MC problem, assuming uniform cost (approximately)loge(cid:15) (cid:15)(cid:15)212.Weforcettoincreasebyatleastone for node selection. and at most ∆t = (cid:100)2/(cid:15)(cid:101). That is the number of samples max (cid:88) will increase by a multiplicative factor between e(cid:15) ≈ (1+(cid:15)) maximize (1−yj) (16) and e∆tmax ≈e2. Rj∈R (cid:88) Algorithm 2 Verify subject to s ≤k (17) v v∈V Input: Candidate solution Sˆk, vmax, and Tcap. (cid:88) Output: Passed/not passed, (cid:15) , and (cid:15) . s +y ≥1 ∀R ∈R (18) 1 2 v j j 1: R ←∅,δ = δ,cov=0,(cid:15) =(cid:15) =∞ v∈Rj 2: forvier←0 to v2 4−1 do 1 2 max yj,si ∈{0,1} (19) 3: (cid:15) =min{(cid:15),1}/2i,(cid:15)(cid:48) = (cid:15)2 ;δ(cid:48) =δ /(v ×t ) Here, sv = 1 iff node v is selected into the seed set, and 4: Λ22 =1+(2+2/3(cid:15)(cid:48)22)(1+1−(cid:15)(cid:15)(cid:48)22)ln2δ2(cid:48) ((cid:15)2(cid:48)1)2 max max s = 0, otherwise. The variable y = 1 indicates that the RR 5: while cov<Λ do 2 2 v j 2 sets Rj cannot be covered by the seed set (S = {v|sv = 1}) 6: Generate Rj with BSA [1] and add it to Rver and yj = 0, otherwise. The objective aims to cover as many 8: if Rj∩S (cid:54)=∅ then cov=cov+1 RRsetsaspossiblewhilekeepingthecostatmostk usingthe 9: if |Rver|>Tcap then return <false,(cid:15)1,2(cid:15)2 > constraint (17). 10: end while 11: B (Sˆ )←Γ cov , (cid:15) ←1− Bver(Sˆk) We note that the benefit in selecting the node b(u) does ver k |Rver| 1 BR(Sˆk) 12: if ((cid:15) >(cid:15)) then return <false,(cid:15) ,(cid:15) > not appear in the above ILP as it is embedded in the Benefit 1(cid:114) 1 2 Sampling Algorithm (BSA) in [1]. 13: (cid:15) ← 3ln(tmax/δ1) . On one hand, the above ILP can be seen as an SAA of 3 (1−(cid:15)1)(1−(cid:15)2)CovRt(Sˆk) 14: if (1−(cid:15) )(1−(cid:15) )(1−(cid:15) )>(1−(cid:15)) then 1 2 3 the CTVM problem as it attempts to find optimal solutions return <true,(cid:15) ,(cid:15) > 1 2 over the randomly generated samples. On the other hand, it is 15: end for different from the traditional SAA discussed in Sec. III as it 16: return <false,(cid:15) ,(cid:15) > 1 2 doesnotrequiretherealizationofallrandomvariables,i.e.,the status of the edges. Instead, only a local portion of the graph Algorithm 3 IncreaseSamples surrounding the sources of RR sets need to be revealed. This Input: t,(cid:15) ,(cid:15) . critical difference from traditional SAA significantly reduces 1 2 Action: Return iteration t. the size of each sample, effectively, resulting in a much more 1: ∆t =(cid:100)2/(cid:15)(cid:101) compact ILP. 2: retumrnaxt+min{max{(cid:100)1/(cid:15)ln(cid:15)21(cid:101),1},∆t } (cid:15)2 max C. The Verify and IncreaseSamples Procedures As shown in Alg. 2, Verify takes a candidate solution V. OPTIMALITYOFTIPTOP Sˆk, precision limit vmax, and the maximum number of RR In this section, we prove that TIPTOP returns a solution sets Tcap as the input. It keeps generate RR sets to estimate Sˆ that is optimal up to a multiplicative error 1−(cid:15) with high B(Sˆk) until either the relative error reaches (cid:15)/2vmax−1 or the probability. Fig. 1 shows the proof map of our main Theorem. maximum number of generated samples Tcap is reached. Let R1,R2,R3,...,Rj,... be the random RR sets gener- Verify uses the stopping rule algorithm in [6] to estimate ated in TIPTOP. Given a seed set S, define random variables the influence. It generates a new pool of random RR sets, Z as in (14) and Y = Z − E[Z ]. Then Y satisfies the j j j j j denoted by R . For each pair (cid:15)(cid:48),δ(cid:48), derived from (cid:15) and conditions of a martingale [19], i.e., E[Y |Y ,Y ,...,Y ]= ver 2 2 2 i 1 2 i−1 δ , the stopping rule algorithm will stop when either the cap Y and E[Y ]<+∞. This martingale view is adopted from 2 i−1 i T is reached or there is enough evidence (i.e. cov ≥ Λ ) [5] to cope with the fact that random RR sets might not be max 2 to conclude that independent due to the stopping condition: the later RR sets are generated only when the previous ones do not satisfy the TIPTOP is (cid:18)(cid:18) (cid:18) (cid:19) (cid:19) (cid:19) stopping conditions. We obtain the same results in Corollaries n 2 n Ω ln +ln 1 and 2 in [5]. k δ (cid:15)2 Lemma 3 ( [5]). Given a set of nodes S and random RR sets Proof: The expected number of RR sets is Ω((1 + RZ =as {inRj(1}4)g.enLeertaµted =in BT(SIP)TaOnPd, µdˆefin=e r1an(cid:80)doTm Zvaribaeblaens (cid:15))θ((cid:15),δ)). Moreover, the expected size of each RR sets is j Z Γ Z T i=1 i upper-boundedby OPTk (Lemma4andEq.(7)in[20]).Thus, estimationofµ ,forfixedT >0.Forany0≤(cid:15),thefollowing k Z the size of the ILP which is equal the total sizes of all the RR inequalities hold sets plus n, the size of the cardinality constraint, is at most −Tµ(cid:15)2 Ω((1+(cid:15))θ((cid:15),δ)OPT )=Ω(cid:0)(cid:0)ln(cid:0)n(cid:1)+ln2(cid:1) n(cid:1) Pr[µˆ ≥(1+(cid:15))µ]≤e 2+23(cid:15) , and (20) k k δ (cid:15)2 In practice, the ILP size is much smaller than the worst- −Tµ(cid:15)2 Pr[µˆ ≥(1+(cid:15))µ]≤e 3 , and (21) case bound in the above theorem, as shown in Section VI. Pr[µˆ ≤(1−(cid:15))µ]≤e−T2µ(cid:15)2. (22) Lemma4. LetSk∗beaseedsetofsizekwithmaximumbenefit, 𝔹(𝑆 𝑘) ≥ 1−𝜖2 × 𝔹𝑣𝑒𝑟(𝑆 𝑘) Legend i(cid:15).∗e.=, B(cid:113)(S3k∗l)n(=tmaOx/Pδ1T)k..WDeehnaovtee:by µ∗k = OPΓTk,δ1 = δ/4 and 1 Prob. 1−𝛿2 2 𝑋 ≥𝛼× 𝑌 t Ntµ∗t .borP1 ≥1− ≥𝔹1ℛ−(𝑆 𝜖𝑘1)× 𝑖 Pro≡b. 𝑝 𝑗 (cid:15)∗t aPnPrd[rBNoRotft:s(aSAmk∗pp)pll≥eys.L(1Feom−r.(cid:15)e3∗ta)c(BhE(qtS.∈k∗2)1[1)∀.t.ftom=ras1xe.]e.,dtamfsteaetxr]Ss≥ok∗m,1me−eaalδgn1eµbr∗ka, −𝛿 𝜖× 3 𝐏𝐫𝑿≥𝜶𝒀 ≥𝒑 we obtainP: r[BRt(Sk∗)<(1−(cid:15)t)B(Sk∗)]<δ1/tmax. ≥1 × if not specified 𝔹 𝑆𝑘∗ =𝑂𝑃𝑇𝑘 × 1−𝜖3 ≤ 𝔹ℛ(𝑆𝑘∗) then 𝑝=1 Takingtheunionboundoverallt∈[1,tmax]yieldstheproof. 5 Prob. 1−𝛿1 4 Lemma 5. [ 1 → 2 ] The probability of the bad event that there exists some set of t ∈ [1,t ],i ∈ [0,v −1], and max max (cid:15) =(cid:15)/2i so that Verify returns some bad estimation of Sˆ at 2 k Fig. 1: Proof map of the main Theorem 2. line 10, i.e., 1 A common framework in [5], [6], [20] is to generate BRver(Sˆk)> (1−(cid:15) )B(Sˆk), random RR sets and use the greedy algorithm to select k seed 2 ninodLeesmtmhaat 3co[v2e0r]mthoastt(o1f−th1e/ege−ne(cid:15)r)ataepdprRoRximseattsi.onItailsgoshriothwmn is less than δ2. Here Bver(Sˆk) = Γ × Cov|RRvveerr(|Sˆk) be an withprobability1−δ isobtainedwhenthenumberofRRsets estimation of B(Sˆk) using random RR sets in Rver. reach a threshold Proof: After reaching line 10 in Verify, thegenerated RR (cid:18) (cid:18) (cid:19) (cid:19) n 2 n 1 sets within Verify, denoted by R , will satisfy the condition θ((cid:15),δ)=c×(8+(cid:15)) ln +ln , (23) ver k δ OPT (cid:15)2 that k 2 2 1 cov =Cov (Sˆ )≥Λ =1+(2+ (cid:15)(cid:48))(1+(cid:15)(cid:48))ln , for some constant c>0. Rver k 2 3 2 2 δ(cid:48) (cid:15)(cid:48)2 2 2 Theconstantcisboundedtobe8+(cid:15)in[20],broughtdown to 8(e−2)(1−1/(2e))2 ≈ 3.7 in [1] . And the current best where (cid:15)(cid:48)2 = 1−(cid:15)2(cid:15)2. is c=2+2/3(cid:15), inducted from Lemma 6 in [5]. According to the stopping rule theorem2 in [21], this Note that the threshold θ((cid:15),δ) cannot be used to decide stopping condition guarantees that how many RR sets we need to generate. The reason is that θ Pr[BRver(Sˆk)>(1+(cid:15)(cid:48)2)B(Sˆk)]<δ2(cid:48) depends on the unknown value OPT , of which computation k Substitute (cid:15)(cid:48) = (cid:15)2 and simplify, we obtain is #P-hard. 2 1−(cid:15)2 1 To overcome that hurdle, a simple stopping rule is devel- Pr[BRver(Sˆk)> 1−(cid:15) B(Sˆk)]<δ2(cid:48) oped in [1] to check whether we have sufficient RR sets to 2 ThenumberoftimesVerifyinvokedthestoppingcondition guarantee, w.h.p., a (1−1/e−(cid:15)) approximation. The rule is that we can stop when we can find any seed set Sk, of size k, to estimate Sˆk is at most tmax×vmax. Thus, we can use the that coverage Cov (S ) exceeds union bound of all possible bad events to get a lower-bound R k OPT δ(cid:48) ×t ×v = δ for the probability of existing a bad Λmax =(1+(cid:15))θ((cid:15),δ/4)× n k (24) es2timamtioanx of Sˆmka.x 2 (cid:18) (cid:19) 1 2 n 2 → 3 : This holds due to the definition of (cid:15) in Verify. =(1+(cid:15))(2+2/3(cid:15)) (ln +ln ) (25) 1 (cid:15)2 δ/4 k Since (cid:15) ←1−B (Sˆ )/B (Sˆ ), it follows that 1 ver k Rt k Our algorithm TIPTOP utilizes this stopping condition to B (Sˆ )=(1−(cid:15) )B (Sˆ ). guarantee that at most Ω(θ((cid:15),δ)) RR sets are used in the ILP ver k 1 Rt k in the worst-case. As a result, the size of the ILP is kept to be We note that it is possible that (cid:15)1 < 0, and the whole proof almost linear size, assuming k (cid:28)n still goes through even for that case. In the experiments, we, Theorem 1. The expected number of non-zeros in the ILP of 2replacingtheconstant4(e−2)with2+2/3(cid:15)duetothebetterChernoff- boundinLemma3 (cid:115) however, do not observe negative values of (cid:15)1. (cid:15)∗ = 3ln(tmax/δ1) 3 → 4 : Since Sˆk is an optimal solution of t Ntµ∗t ILP (R,c,B), it will be the k-size seed set that intersects (cid:115) MC 3ln(t /δ ) with the maximum number of RR sets in R. Let S∗ be ≤(cid:15) = max 1 an optimal k-size seed set, i.e., the one that results ink the 3 (1−(cid:15)1)(1−(cid:15)2)CovRt(Sˆk) maximum expectedBbene(fiSˆt3.)S≥inBce |S(Sk∗∗|)=, k, it follows that ⇔Ntµ∗t ≥(1−(cid:15)1)(1−(cid:15)2)CovRt(Sˆk) Rt k Rt k ⇔N B(S∗)/Γ≥(1−(cid:15) )(1−(cid:15) )N B (Sˆ )/Γ iwnhReretBhaRtti(nSˆtker)seacntdwBitRhtS(ˆSk∗a)nddeSno∗t,erethspeencutimveblye.r of RR sets ⇔B(tSk∗)k≥(1−(cid:15)1)(1−1(cid:15)2)BRt2(Sˆkt) Rt k t k k The last one holds due to the optimality of S∗ and Eq. 28. Theorem 2. [Main theorem 1 → 5 ] Let Sˆ be the solution k k returned by TIPTOP (Algorithm 1). We have: B(Sk∗)≥B(Sˆk)≥(1−(cid:15)1)(1−(cid:15)2)BRt(Sˆk) Pr[B(Sˆ )≥(1−(cid:15))OPT ]≥1−δ (26) k k Combine Eq. 29 and 4 → 5 , we have Proof: First we use the union bound to bound the prob- ability of the bad events. Then we show if none of the bad B(Sˆk)≥(1−(cid:15)2)(1−(cid:15)1)BRt(Sk∗) events happen, the algorithm will return a solution satisfying ≥(1−(cid:15) )(1−(cid:15) )(1−(cid:15) )B(S∗)≥(1−(cid:15))OPT B(Sˆ )≥(1−(cid:15))OPT . 2 1 3 k k k k Thelastoneholdsduetotheterminatingcondition(1−(cid:15) )(1− The bad events and the bounds on their probabilities are 2 (cid:15) )(1−(cid:15) )>(1−(cid:15)) in line 14 in Verify. 1 3 VI. EXPERIMENTS 1) Pr[∃t:B (S∗)<(1−(cid:15)∗)B(S∗) ]<δ (Lem. 4) Rt k t k 1 We conduct several experiments to illustrate the perfor- 2) Pr[∃t,i,(cid:15) = (cid:15) : B (Sˆ ) > 1 B(Sˆ )] < δ (Lem. 5) 2 2i Rver k 1−(cid:15)2 k 2 misacnocmeapnadreudtitloitythoefTTI-PETXOAPC.FTirIsLt,Pt.heOpuerrfroersmulatnscsehoofwTIthPTatOiPt 3) Parn[d(C(oBv(RSˆtk()Sˆ<k)(>1−Λm1/aex)−(cid:15))OPTk)]<δ/4 iVsI-mAa)g.nWituedethsefnasatperplwyhTilIePTmOaPintaasinainbgensoclhumtiaornkqfuoarlietyxis(tsiencg. 4) Pr[(t>t ) and (Cov (Sˆ )≤Λ )]<δ/4 methods, showing that they perform better than their theoreti- max Rt k max cal guarantee in certain cases but have significantly degraded performance in others. Finally, we conclude the section with Theboundsin1)and2)comedirectlyfromLems.4and5.The an in-depth analysis of TIPTOP’s performance characteristics. bound in 3) is a direct consequence of the stopping condition WeimplementedTIPTOPinC++usingCPLEXtosolvethe algorithm in [1]. The bound in 4) can be shown by noticing IP. Unless noted otherwise, each experiment is run 10 times that when t>t then N (cid:29)θ((cid:15),δ). max t and the results averaged. Settings for (cid:15) are listed with each Apply the union bound; the probability that none of the experiment, but δ is fixed at 1/n. All influence values are above bad events happens is at most δ1+δ2+δ/4+δ/4=δ. obtained by running a separate estimation program with (cid:15) = Assume that none of the above bad events happen, we 0.02 using the seed sets produced by each algorithm as input. shallshowthatthealgorithmreturnsa(1−(cid:15))optimalsolution. Dataset Network Type Nodes Edges IBf(STˆIP)T≥OP(s1to−ps(cid:15)w)OithPTCo,vRsitn(cSˆek)th>e bΛamdaxev,eitntisinob3v)ioduos tnhoatt US Pol. Books [22] Recommendation 105 442 k k happen. Otherwise, algorithm Verify returns ‘true’ at line 14 GR-QC [23] Collaboration 5242 14496 for some t ∈ [1,t ] and i ∈ [0,v ). Follow the path max max Wiki-Vote [23] Voting 7115 103689 1 → 2 → 3 → 4 . No bad event in 2) implies NetPHY [24] Collaboration 37149 180826 B(Sˆ )≥(1−(cid:15) )B (Sˆ ) (27) k 2 ver k Twitter [25] Social 41M 1.5B ≥(1−(cid:15) )(1−(cid:15) )B (Sˆ ) (28) 2 1 Rt k TABLE I: Networks used in our experiments. ≥(1−(cid:15) )(1−(cid:15) )B (S∗) (29) 2 1 Rt k No bad event in 1) implies B (S∗)≥(1−(cid:15)∗)B(S∗) A. Comparison to the T-EXACT IP Rt k t k Since T-EXACT produces exact results for IM, we use it 4 → 5 : To show show the optimality of TIPTOP. As T-EXACT suffers from B (S∗)≥(1−(cid:15) )B(S∗), Rt k 3 k scalability issues, we run on the 105-node US Pol. Books net- We prove that work only. As shown in Fig. 2, TIPTOP consistently performs as well as T-EXACT while performing more than 100 times faster.Weexploitthisbehaviortoplaceanupperboundonthe performance of approximation methods in the next section. B. Benchmarking Greedy Methods As shown above, TIPTOP is capable of producing almost 3If there are multiple optimal solutions, we break the tie by using alpha- exact solutions significantly faster than other IP-based solu- beticalorderonnodes’ids. tions. This property allows us to bound the performance of the cost), if a user with 2K followers requires $800 to be influenced, then a user with 2M followers would require $800K. However, due to cascades increasing the effective influence of users with follower counts between the two, the high-follower users may opt to reduce their prices. The log- scaled function models an extreme case, with the two-million- follower user requiring a mere $1,527. We show that the (a) Solution Quality (b) Running Time performance of all methods remains similar when scaling is changed, though the performance of IMM and SSA drops significantly when costs are randomized. Fig. 2: The performance of TIPTOP ((cid:15) = 0.02,δ = 1/n) and T-EXACT (T = 5000) on the US Political Books [22] Our final setting is the CTVM problem described above. network under the cost-aware setting with normalized costs as We target 5% of each network at random, assigning each the budget b is varied. targeted user a benefit on [0.1,1) and each non-targeted user a benefit of 0. Fig. 4 shows the performance on the GR-QC, Wiki-Vote, and NetPHY networks. Note that neither IMM nor SSAcanbeextendedtotheCTVMproblemwithoutsignificant approximation methods, which in turn allows us to compare modifications, and therefore each ignores the benefit values. their absolute performance rather than merely their relative FromFig.4,wecanseethatthetopologyhasasignificant performance. We evaluate the performance of IMM, BCT, impact on the performance of each algorithm. Note that on and SSA under three problem settings on three networks. All GR-QC and NetPHY, the performance of BCT relative to the evaluations are under the IC model with edge weights set to optimal tends downward while on Wiki-Vote it tends upward. w(u,v)=1/d (v), where d (v) is the in-degree of v. in in Further, on NetPHY both IMM and SSA do astonishingly The first setting we consider is the traditional IM problem well despite being ignorant of the targeted nodes. Figure 5 (whichwetermUnweightedtodifferentiateitfromsubsequent showsthatBCTperformssimilarlyregardlessofcostfunction, settings). We directly apply each author’s implementation to while the modified IMM and SSA face a significant drop in this problem with no modifications. Fig. 3a shows that the performance when the costs are divorced from topology. decade-or-so of work on this problem has resulted in greedy solutions that are much better than their approximation ratio Coverage Unweighted Cost-Aware CTVM of(1−1/e−(cid:15)).Somewhatsurprisingly,wefindsimilarresults IMM 6.94×106 3.96×106 8.59×106 on the Cost-Aware problem setting (fig. 3b). SSA 2.01×106 2.81×106 3.42×106 TheCost-AwaresettinggeneralizestheUnweightedsetting BCT 9.34×106 4.15×106 6.23×106 by adding a cost to each node on the network. In a social network setting, the costs can be understood as the relative TIPTOP 2.08×103 1.30×104 6.04×103 amount of payment each user requires to become a part of an Verification Unweighted Cost-Aware CTVM advertisingcampaign(i.e.aseednode).NotethatneitherIMM norSSAsupportcostsnatively.Weextendbothtothisproblem SSA 4.02×107 8.44×107 4.02×107 by scaling their objective functions by cost and limiting the TIPTOP 6.25×106 1.86×109 2.80×107 number of selected nodes by the sum of costs instead of the number of nodes, which gives each an approximation TABLE II: Mean required samples for each algorithm. √ guarantee of 1−1/ e−(cid:15) [16]. We consider three ways users Approximation guarantees are 0.61 for greedy methods in may decide on their relative value. First, users may define the the unweighted case and 0.37 for the cost-aware case (and value of their influence independently from network topology. the CTVM case for BCT). The guarantee for TIPTOP is 0.6 We model this scenario by assigning costs according to a in all cases. Coverage: Samples input into the MC solver. uniform random distribution on [0,1). Verification:Samplesusedtoverifysolutionquality.IMMand BCT do not incorporate a verification stage. C. Analyzing TIPTOP’s performance We now turn our attention to the details of TIPTOP’s performance.Fig.6comparestherunningtimeofeachmethod as the approximation guarantee and budget are varied. From this, we can see that the runtime performance of TIPTOP is notfarfrom–andinsomecasesbetterthan–greedymethods. (a) Unweighted NetPHY (b) Cost-Aware NetPHY With a comparable guarantee, TIPTOP is as fast as SSA and an order of magnitude faster than IMM. However, as the Fig. 3: Mean performance of each approximation algorithm guarantee tends to 1, the advantage is lost. Fig. 6b shows that as the budget is varied under the Unweighted and Cost-Aware while the runtime of both IMM and BCT is smoothly linear problem settings with (cid:15) = 0.02. OPT is estimated assuming and for SSA is near-constant, the running time of TIPTOP that TIPTOP achieves exactly (1−(cid:15))OPT. is less predictable. This is the cost of solving an IP instead of greedily solving MC: the IP solver may quickly find the Alternately,usersmayusethemostreadily-availablemetric optimal solution, or it may take exponentially long to do to determine their influence: follower count. We consider both so. However, on average the running time of TIPTOP is low linear- and log-scaled functions of this metric as rough upper enough that it is practical to run even on very large networks. and lower bounds. In the linear case (cost(u)=n/|E|dout(u); Tofurtherestablishthescalabilityof TIPTOP,werunitonthe dout(u) is the out-degree of u, and the n/|E| term normalizes Twitter dataset (fig. 7). While the running time of TIPTOP is (a) GR-QC (b) Wiki-Vote (c) NetPHY (a) Uniform Random (b) Log-Outdegree Fig. 4: Mean performance of each approximation algorithm as the Fig.5: PerformanceontheNetPHYnetworkunder budget is varied under the CTVM problem setting with linear costs. the CTVM setting with alternate costs. worse, it still produces 98% optimal solutions in under three the number of samples to meet the requirement of ILP solver hours. Previously, solving an IP on a network with millions is able to run on billion-scale OSNs such as Twitter under of nodes and billions of edges had been thought impractical. three hours. As TIPTOP is an exact solution to the ratio of TIPTOP is able to accomplish this by dramatically reducing (1−(cid:15)), it significantly improves from the current best ratio √ the number of samples used to solve MC. Table II shows that (1−1/e−(cid:15)) for IM and (1−1/ e−(cid:15)) for CTVM. TIPTOP on average, TIPTOP uses on the order of 103 fewer samples also lends a tool to benchmark the absolute performance of than greedy methods to solve MC. As SSA has a Stare phase existing algorithms on large-scale networks. in which it verifies the quality of a generating solution, we also compare the number of samples used in our method and ACKNOWLEDGMENT SSA. The lower section of table II shows that even then, the This work is supported in part by the NSF grant #CCF- numberofsamplesusedtoverifytheapproximationguarantee 1422116. is similar. The total number of samples used in both phases of REFERENCES TIPTOP, in general, in smaller that that of SSA. Further, the approximationratioof TIPTOP affordsgreaterprecisionifone [1] H.T.Nguyen,M.T.Thai,andT.N.Dinh,“Cost-awaretargetedviral needstoguaranteeacertainlevelofperformance,asshownin marketinginbillion-scalenetworks,”inINFOCOM. IEEE,2016. Fig. 7. Where prior work produced results of similar quality [2] D. Kempe, J. Kleinberg, and E´. Tardos, “Maximizing the spread of regardless of (cid:15), TIPTOP enables setting guarantees to exactly influence through a social network,” in KDD’03. ACM New York, NY,USA,2003,pp.137–146. the required value. If, for instance, an application requires an [3] J.Leskovec,A.Krause,C.Guestrin,C.Faloutsos,J.VanBriesen,and 80% approximation ratio, (cid:15) can be set to exactly 0.2, at which N. Glance, “Cost-effective outbreak detection in networks,” in ACM point TIPTOP is as fast as state-of-the-art greedy solutions. KDD’07. NewYork,NY,USA:ACM,2007,pp.420–429. [4] W.Chen,C.Wang,andY.Wang,“Scalableinfluencemaximizationfor prevalentviralmarketinginlarge-scalesocialnetworks,”inACMKDD ’10. NewYork,NY,USA:ACM,2010,pp.1029–1038. [5] Y. Tang, Y. Shi, and X. Xiao, “Influence Maximization in Near- Linear Time: A Martingale Approach,” in Proceedings of the 2015 ACMSIGMODInternationalConferenceonManagementofData,ser. SIGMOD’15. ACM,pp.1539–1554. [6] H. T. Nguyen, M. T. Thai, and T. N. Dinh, “Stop-and-stare: Optimal sampling algorithms for viral marketing in billion-scale networks,” in (a) Runtime as (cid:15) is increased. (b) Runtime as b is increased. SIGMOD. ACM,2016. b is held constant at 50. (cid:15) is held constant at 0.02. [7] T.N.Dinh,H.Zhang,D.T.Nguyen,andM.T.Thai,“Cost-effective viral marketing for time-critical campaigns in large-scale social net- works,”IEEE/ACMTransactionsonNetworking(TON),vol.22,no.6, Fig. 6: Mean running time of TIPTOP, IMM, SSA, and BCT pp.2001–2011,2014. on the NetPHY network under the CTVM problem setting. [8] H. Zhang, T. N. Dinh, and M. T. Thai, “Maximizing the spread of positiveinfluenceinonlinesocialnetworks,”inDistributedComputing Systems(ICDCS),2013IEEE33rdInternationalConferenceon. IEEE, 2013,pp.317–326. [9] T.N.Dinh,D.T.Nguyen,andM.T.Thai,“Cheap,easy,andmassively effective viral marketing in social networks: truth or fiction?” in Proceedings of the 23rd ACM conference on Hypertext and social media. ACM,2012,pp.165–174. [10] H. Zhang, D. T. Nguyen, H. Zhang, and M. T. Thai, “Least cost influence maximization across multiple social networks,” IEEE/ACM (a) Running Time (b) Solution Quality TransactionsonNetworking,vol.24,no.2,pp.929–939,2016. [11] A.Shapiro,D.Dentcheva,andA.Ruszczynski,LecturesonStochastic Programming: Modeling and Theory, Second Edition. Philadelphia, Fig. 7: Mean results on the Twitter network under the PA,USA:SocietyforIndustrialandAppliedMathematics,2014. Unweightedsetting.Onlyonerepetitionisusedonthisdataset. [12] G.BayraksanandP.D.Morton,“Assessingsolutionqualityinstochastic VII. CONCLUSION programs,”MathematicalProgramming,vol.108,no.2,pp.495–514, 2006. In this paper, we propose the first (almost) exact solutions [13] U.Feige,“Athresholdoflnnforapproximatingsetcover,”Journalof to the CTVM (and thus IM) problem, namely T-EXACT ACM,vol.45,no.4,pp.634–652,1998. and TIPTOP. T-EXACT uses the traditional stochastic pro- [14] N. P. Nguyen, Y. Xuan, and M. T. Thai, “A novel method for worm gramming approach and thus suffers the scalability issue. In containmentondynamicsocialnetworks,”inMilitaryCommunications contrast, our TIPTOP with innovative techniques in reducing Conference,2010,pp.2180–2185. [15] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier, “Maximizing social influence in nearly optimal time,” in Proceedings of the Twenty-Fifth AnnualACM-SIAMSymposiumonDiscreteAlgorithms,ser.SODA’14. SIAM,2014,pp.946–957. [16] S.Khuller,A.Moss,andJ.S.Naor,“Thebudgetedmaximumcoverage problem,” Information Processing Letters, vol. 70, no. 1, pp. 39–45, 1999. [17] H. Nguyen and R. Zheng, “On budgeted influence maximization in socialnetworks,”SelectedAreasinCommunications,IEEEJournalon, vol.31,no.6,pp.1084–1094,2013. [18] A. Kleywegt, A. Shapiro, and T. Homem-de Mello, “The sample average approximation method for stochastic discrete optimization,” SIAMJournalonOptimization,vol.12,no.2,pp.479–502,2002. [19] F. Chung and L. Lu, “Concentration inequalities and martingale in- equalities: a survey,” Internet Mathematics, vol. 3, no. 1, pp. 79–127, 2006. [20] Y. Tang, X. Xiao, and Y. Shi, “Influence maximization: Near-optimal time complexity meets practical efficiency,” in Proceedings of the 2014ACMSIGMODinternationalconferenceonManagementofdata. ACM,2014,pp.75–86. [21] P.Dagum,R.Karp,M.Luby,andS.Ross,“Anoptimalalgorithmfor monte carlo estimation,” SIAM J. Comput., vol. 29, no. 5, pp. 1484– 1496,Mar.2000. [22] V.Krebs,“Booksaboutuspolitics,”unpublished,compiledbyM.New- man.Retrievedfromhttp://www-personal.umich.edu/∼mejn/netdata/. [23] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network datasetcollection,”http://snap.stanford.edu/data,Jun.2014. [24] W.Chen,Y.Wang,andS.Yang,“Efficientinfluencemaximizationin socialnetworks,”inKDD’09. NewYork,NY,USA:ACM,2009,pp. 199–208. [25] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a social network or a news media?” in WWW ’10: Proceedings of the 19th international conference on World wide web. New York, NY, USA: ACM,2010,pp.591–600.