ebook img

Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades PDF

1.2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades

Back to the Past: Source Identification in Diffusion Networks from Partially Observed Cascades MehrdadFarajtabar∗ ManuelGomez-Rodriguez† NanDu∗ MohammadZamani(cid:5) HongyuanZha∗ LeSong∗ ∗GeorgiaInstituteofTechnology †MPIforSoftwareSystems (cid:5)StonyBrookUniversity 5 1 Abstract paths,investigatorsfoundthattheimageboard4chan4 was 0 theculpritsitewherethephotoswereoriginallypostedon 2 August 31, even though the photos had been taken down Whenapieceofmaliciousinformationbecomes n rampantinaninformationdiffusionnetwork,can fromthesitesoonaftertheirpost. Thisleakageofprivate a pictureshastouchedoffalargerworld-widediscussionand weidentifythesourcenodethatoriginallyintro- J debateonthestateofprivacyandcivillibertiesontheIn- duced the piece into the network and infer the 6 ternet[Isaac,2014]. time when it initiated this? Being able to do so 2 is critical for curtailing the spread of malicious Can we automatically pinpoint the identity of such mali- ] information,andreducingthepotentiallossesin- cious information sources, as well as the time when they I S curred. Thisisaverychallengingproblemsince first posted the malicious information, given historical in- s. typicallyonlyincompletetracesareobservedand complete diffusion traces? Solving this source identifi- c we need to unroll the incomplete traces into the cation problem is of outstanding interest in many scenar- [ past in order to pinpoint the source. In this pa- ios[Lappasetal.,2010]. Forexample,findingpeoplethat 1 per,wetacklethisproblembydevelopingatwo- originate rumors may reduce disinformation, identifying v stageframework,whichfirstlearnsacontinuous- patientzerosindiseasespreadsmayhelptounderstandand 2 timediffusionnetworkmodelbasedonhistorical controlepidemics,orinferringwhereatrojanorcomputer 8 diffusiontracesandthenidentifiesthesourceof wormisinitiallyreleasedmayincreasereliabilityofcom- 5 6 anincompletediffusiontracebymaximizingthe puternetworks. 0 likelihood of the trace under the learned model. Related Work. The problem of finding the source of a . Experiments on both large synthetic and real- 1 diffusion trace, also called cascade, has not been studied world data show that our framework can effec- 0 until very recently [Lappas et al., 2010, Shah and Zaman, 5 tively “go back to the past”, and pinpoint the 2010,AdityaPrakashetal.,2012,Pintoetal.,2012].How- 1 source node and its initiation time significantly : moreaccuratelythanpreviousstate-of-the-arts. ever, mostpreviousworkassumesthatacompletesteady- v state snapshot of the cascade is observed, in other words, i X we know which nodes got infected but not when they did r 1 INTRODUCTION so. Moreover, previous work uses discrete-time sequen- a tial propagation models such as the independent cascade model [Kempe et al., 2003] or the discrete version of the OnSeptember2014,acollectionofhundredsofprivatepic- SIR model [Bailey, 1975], which are difficult to estimate turesfromvariouscelebrities,mostlyconsistingofwomen accurately from real world data [Gomez-Rodriguez et al., and often containing nudity, were posted online, and later 2011,Duetal.,2012,2013b,Zhouetal.,2013a,b]. disseminated by users on websites and social networks suchasImgur1,Reddit2andTumblr3[Kedmey,2014]. Af- Onlyveryrecently,Pintoetal.[2012]considerafairlygen- ter quite some efforts in manual tracing of the diffusion eral continuous-time model and assume that only a small fraction of sparsely-placed nodes are observed and, if in- 1http://imgur.com/ fected,theirinfectiontimeisobserved.Unfortunately,their 2http://www.reddit.com/ approach requires the distance between observed nodes 3https://www.tumblr.com/ to be large because they approximate the infection times AppearinginProceedingsofthe18thInternationalConferenceon by Gaussian distributions using the central limit theorem. ArtificialIntelligenceandStatistics(AISTATS)2015,SanDiego, Since this is easily violated in real social and information CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the authors. 4http://www.4chan.org/ BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades 3 — Uncertain transmission delay. The spread of informa- 2 2 tion over social and information networks is a stochastic Ol5ivia Ma3son 3 2 Sop2hia 5 6 pmrioscseiossn.mThoedreelfsortoe,cwaeptnuereedtthoecuonncseidrtearinptryo.baFboilrisetixcatmrapnlse-, 8 ourtoyexampleillustratesthespreadofaparticularrumor 1 Liam andthereforeconsidersasetoffixededgedelays(e.g.,the 7 5 4 Jacob 1 rumor took 5 time units to spread from Sophia to Ethan). Ethan However, the edge delays are stochastic and possibly dif- Emma 0 9 7 ferentforeveryparticularrumor(e.g.,adifferentrumorcan 4 takemoreorlessthan5timeunitstospreadfromSophiato Ethan).Theedgedelaydensities,ortransmissiondensities, Figure 1: Spread of a rumor in a social network. Each maydependonparameterslikethecontentoftherumoror edgeweightisthetimeittookforarumortopassalongthe theusers’influence. edge. Solidmagnetedgesindicatetheactualpaththrough whichtherumorspreads. Greendashededgesarealterna- — Unknown infection path. In large real world networks, tivewaysinwhichtherumorcouldhavespread.Theinfec- we will often encounter a large number of potential paths tiontimesofSophiaandLiamareobserved(redsquares); thatmayexplainthespreadofarumorfromasourcenode theinfectiontimesoftheremainingnodesarehidden(yel- and any other node in the network. In fact, the set of po- lowsquares).HowcananalgorithmfindthatJacobwasthe tentialpathsincreasesexponentiallywithnetworksizeand personwhoinitiatedtherumor? network density and even simply counting the number of pathsrequiresnon-trivialmethods[Gomez-Rodriguezand Scho¨lkopf,2012a,b,Duetal.,2013a]. Forexample,inour networks[Backstrometal.,2012],wefinditsperformance toyexample,Liamcanbecomeawareoftherumorthrough onthistypeofnetworksunderwhelming,asshowninSec- eitherOliviaorEmma. tion5. OurApproach. Totacklethesechallenges, weproposea Challenges. Previous approaches failed successfully two-stagescalableframework: wefirstlearnacontinuous- addressseveralchallengesofthesourceidentificationprob- timediffusionnetworkmodelbasedonhistoricaldiffusion lem, which we illustrate next using a toy example, shown tracesandthenidentifythesourceofanincompletediffu- inFigure1. sion trace by maximizing its likelihood under the learned model. Thekeyideaofourframeworkistoviewtheprob- — Partially observed infections. It has become difficult, lemfromtheperspectiveofgraphicalmodels,andcastthe ifnotimpossible,tocollectcompletediffusiontraces,and problemasamaximumlikelihoodestimationproblem,for track each individual infection in online social and infor- which we find optimal solutions very efficiently using an mationnetworks. Thisproblemisexacerbatedbytheneed importance sampling approximation to the objective and to develop methods that can provide outputs in (almost) anoptimizationprocedurethatexploitsthestructureofthe real-time. For example, Spinn3r5 crawls only a subset of problem. Additionally, for networks with exponentially theblogsperiodically; Twitter’sstreamingAPIprovidesa distributed edge transmission densities, used previously smallpercentage(1%)ofthefullstreamoftweets[Morstat- for modeling information propagation [Gomez-Rodriguez teretal.,2013];Facebookuserstypicallykeeptheiractiv- etal.,2011],weshowthattheobjectiveisapiece-wiseuni- ity and profiles private [Sadikov et al., 2011]. It is thus modalfunctionwithrespecttothesource’sinfectiontime necessary to develop methods that are robust to missing anddevelopamoreefficientsearchprocedure. data [Chierichetti et al., 2011, Kim and Leskovec, 2011, Sadikov et al., 2011]. Our toy example illustrates this For both synthetic and real-world data, we show that the challenge by considering the infection times of Liam and framework can effectively “travel back to the past”, and Sophia to be observed and all other infection times to be pinpointthesourcenodeanditsinfectiontimesignificantly missing(hiddenorunobserved). moreaccuratelythanothermethods. —Unknowninfectionstarttime.Inmostreal-worldscenar- ios, the exact time when a piece of malicious information 2 OURFRAMEWORK starts spreading is unknown, and thus the observed infec- tiontimeshaveonlyrelativemeaning. Inourtoyexample, Our framework for solving the source identification prob- we knowLiam gotinfected 5 time units laterthan Sophia lemconsistsoftwomainstages:itfirstlearnsacontinuous- butwedonotactuallyobservehowmuchtimehaspassed timediffusionnetworkmodelbasedonhistoricaldiffusion betweenJacob’sinfection,whichtriggeredthespread,and traces,andthenidentifiesthesourceofanincompletedif- Sophia’sinfection. fusion trace and its initiation time by maximizing the its likelihood under the learned model. We start our expo- 5http://spinn3r.com/ sition by revisiting the continuous-time generative model M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song for cascade data in social networks introduced in Gomez- etal.,2013b]. Inthiscase, Rodriguezetal.[2011],Duetal.[2013a]. kτk−1 −(cid:16) τ (cid:17)k −(cid:16) τ (cid:17)k fji(τ;αji)= αk e αji , Sji(τ;αji)=e αji , ji 2.1 Continuous-TimeModelforCascades where k is a hyperparameter controlling the shape of the density. This family includes many well-known special Given a directed contact network, G = (V,E) with N cases, such as the exponential or Rayleigh distributions, nodes, a diffusion process begins with an infected source whichhavealsobeenusedtomodelinformationpropaga- nodesinitiallyadoptingcertaincontagion(idea,rumoror tion over information networks [Gomez-Rodriguez et al., maliciouspieceofinformation)attimet . Thecontagion s 2010,Duetal.,2013b]. istransmittedfromthesourcealongherout-goingedgesto theirdirectneighbors. Eachtransmissionthroughanedge Unfortunately,touseEq.1,allinfectednodesinacascade entails a random spreading time, τ, drawn from a density needtobefullyobserved. IfweonlyobserveasubsetOof overtime,f (τ;α ),parametrizedbyatransmissionrate theinfectednodes,thelikelihoodoftheincompletecascade ji ji α . Then,theinfectedneighborstransmitthecontagionto iscomputedasfollows, ji their respective neighbors, and the process continues. We (cid:90) (cid:89) p({t } |t )= p(t|t ) dt assume transmission times are independent and nonnega- i i∈O s s j tive, in other words, a node cannot be infected by a node Ω j∈H (cid:90) iannfeicntfeedctleadtenroindetsimreem;fajiin(τin;fαejcit)ed=fo0rifthτe<ent0i.reMdoifrfeuosvioern, = Ωi∈(cid:89)O∪Hp(ti|{tj}j∈πi)j(cid:89)∈H dtj, (2) process. Thus, if a node i is infected by multiple neigh- which essentially marginalize out the time for all hidden bors, onlytheneighborthatfirstinfectsnodeiwillbethe nodes H over a product space Ω := [t ,∞)|H|. For sim- s trueparent. plicityofnotation,wewillomitthedomainoftheintegra- The temporal traces left by diffusion processes are often tionintheremainderofthepaper. called cascades. A cascade t is an N-dimensional vector Thecomputationoftheincompletelikelihoodisadifficult t := (t ,...,t ) recording the times when nodes are in- 1 N highdimensionalintegrationproblemforcontinuousvari- fected, if so, i.e., t ∈ [0,∞], where T is the observation i ables. We will address this technical challenge using im- window cut-off and ∞ denotes nodes that did not get in- portancesamplinginSection3.2. fected during the observation window. However, as noted above,inmanyscenarios,weonlyobserveasubsetofthe infectednodes,O,whilethestateofallothernodes,H,is 2.3 LearningDiffusionNetworks hidden (we assume the source node s ∈ H). Our aim is then to find the source of a cascade s from the infection Ourframeworkreliesontheassumptionthatitispossible times{t } ofthesubsetofinfectednodesO. Figure1 torecordasufficientlylargenumberofhistoricalcascades, j j∈O illustratestheobserveddata. C,inordertodiscovertheexistenceofallnodesinthenet- work,toinferthenetworkstructureaswellasthemodelpa- rameters,{α }. Wenotethatitisnotnecessarytorecord 2.2 CascadeLikelihood ji cascades that cover all nodes and edges, but each cascade hastobefullyobservedtoasufficientlylargetimeperiod. According to the conditional independence relation pro- Furthermore,allcascadescollectivelyneedtocovertheen- posedinthecontinuous-timemodelforcascades,thecom- tirediffusionnetwork. Underthepreciseconditionsstated pletelikelihoodofacascadet(forbothobservedandhid- in [Daneshmand et al., 2014], one can infer the parame- dennodes)factorizesas (cid:89) ters of the continuous time model using an (cid:96)1-regularized p(t|t )= p(t |{t } ) (1) s i j j∈πi maximumlikelihoodestimationprocedure. i∈O∪H where π is the set of parents of i defined by the directed i graph G. For the infected nodes, Gomez-Rodriguez et al. 2.4 CascadeSourceIdentificationProblem [2011]showedthatthelikelihoodcanbefurtherwrittenas (cid:89) (cid:88) Given a learned diffusion model, our aim is to find the p(t |{t } )= S(t −t ;α ) H(t −t ;α ), i j j∈πi i j ji i l li sourcenodesofanincompletecascade,suchthatthelog- j∈πi l∈πi likelihoodoftheincompletecascadeismaximized. Thus, where S (τ;α ) = 1−F (τ;α ) is the survival func- ji ji ji ji weaimtosolve (cid:82)τ tion, F (τ;α ) = f (t;α )dt is the cumulative dis- ji ji 0 ji ji s∗ =argmax max p({t } |t ), (3) tributionfunction,andHji(τ;αji)= Sfjjii((ττ;;ααjjii)) isthehaz- s∈H ts∈(−∞,mini∈Oti) i i∈O s ardfunction,orinstantaneousinfectionrate. Wewillfocus where p({t } |t ) is defined in Eq. 2, and we assume i i∈O s ontheWeibullfamilyofdistributionsf (τ;α )sincethey that t < min t . If we observe several independent ji ji s i∈O i have been shown to fit well real world diffusion data [Du incomplete cascades D, all triggered by the same source BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades node,wewillmaximizetheirjointlikelihood where we draw L samples from q({η } ,{t } ) to i i∈O i i∈H (cid:32) (cid:33) approximatetheintegral. Now,wehaveanapproximation (cid:89) s∗ =argmax max L , (4) to L . Next, we explain how to choose the proposal and c c s∈H c∈D tcs∈(−∞,minj∈Otcj) theauxiliarydistributions. where L := p((cid:8)tc(cid:9) |t ). In the following sections, c j j∈O s wewilldesignalgorithmstoefficientlyoptimizetheabove 3.2 ChoiceofProposalDistributions objectiveandpresentexperimentalevaluations. We define our proposal distribution using the forward- 3 APPROXIMATEOBJECTIVE generative process of the cascades. Our proposal distri- FUNCTION butionq({η } ,{t } )willsamplecascadesfromthe i i∈O i i∈H There remain two technical challenges to solve to make learned continuous diffusion network model with s as the our framework useful in practice. First, the likelihood of source set. One of the interesting properties of this pro- incompletecascades,givenbyEq.2,isadifficulthighdi- posal distribution is that many terms involving the latent mensional integration problem over a continuous domain. variables in Eq. 6 will be canceled out and hence the for- We overcome this difficulty by an approximation algo- mulawillbecomesimpler. rithm, based on importance sampling, which will greatly Weremindthereaderthattheindependentcascademodel simplifytheintegration. Second,theinner-loopmaximiza- hasausefulshortest-pathproperty[Duetal.,2013a],which tion over the source timing in Eq. 4 is non-convex. We allowsustosampletheparents’infectiontimes,{t } , solvethisbydesigninganefficientalgorithm,whichfinds i i∈πj for each node j efficiently for different source infection theglobalmaximumbyexploitingthepiece-wisestructure timest . Morespecifically, wefirstsampleasetoftrans- s oftheproblem. mission times {τ } , one per edge, independent of uv (u,v)∈E 3.1 ImportanceSampling each other. Then, the time t taken to infect a node i is i SinceananalyticalevaluationoftheintegralinEq.2isin- simplythelengthoftheshortestpathinG fromthesource tractable, weturntoaMonteCarloapproximation. Todo s to node i, where the edge weights correspond to the as- so, in principle, we need to draw samples from the poste- sociated transmission times. Let Qi(s) be the collection riordistributionoflatentvariables,p({t } |t ,{t } ), of directed paths in G from the source s to node i, where i i∈H s i i∈O given the source time ts and the times of the observed eachpathq ∈Qi(s)containsasequenceofdirectededges nodes,{ti}i∈O. However,itisverychallengingtosample (j,m), and assume the source node is infected at time ts, fromthisposteriordistribution,andwewillinsteadaddress thenweobtainvariabletivia theproblembydesigninganefficientimportancesampling t =g (cid:0){τ } |s(cid:1):= min (cid:88) τ , (7) i i jm (j,m)∈E jl approach. q∈Qi(s) (j,m)∈q whereg (·)isthevalueoftheshortest-path. Morespecifically,wefirstintroduceasetofauxiliaryran- i dom variables {η } , where each variable corresponds Thisaboverelationiskeytospeeduptheevaluationofthe i i∈O tooneobservedinfectednode,withanarbitraryjointprob- sampled likelihood in Eq. 6 for different t values. First, s abilitydistributionq˜({η } ). Inthenextsection,wewill the sampled transmission times τ are independent and i i∈O uv brieflydiscusshowq˜ischosen. Then, giventheauxiliary thus can be sampled in parallel. Second, we can reuse distributionwehave the sampled transmission times τ for different t values uv s (cid:90) (cid:89) and sources s, since the transmission times are indepen- p({t } |t )= p({t } |t ) dt i i∈O s i i∈O∪H s i dent of t and s. We only need to compute the infection s i∈H (cid:90) (cid:89) (cid:89) timeti foreachnodeusingts =0,andthenforadifferent = p({t } |t )q˜({η } ) dt dη . (5) valueoft ,theinfectiontimeisjustanoffsetbyt . Third, i i∈O∪H s i i∈O i∈H ii∈O i the likelihsood of a sampled cascade ((cid:8)ηil(cid:9)i∈O,(cid:8)stli(cid:9)i∈H) for l = 1,...,L can be simply computed using Eq. 1 as Second, we introduce the proposal distribution for im- p((cid:8)ηl(cid:9) ,(cid:8)tl(cid:9) |t ), which is independent of the ac- portance sampling on the auxiliary and hidden variables, i i∈O i i∈H s tual value of t and depends only on the identity of the q({η } ,{t } ). Then,theintegralbecomes s i i∈O i i∈H sourcenodes. (cid:90) p({t } |t )q˜({η } ) p({t } |t )= i i∈O∪H s i i∈O i i∈O s q({η } ,{t } ) i i∈O i i∈H 3.3 ChoiceofAuxiliaryDistribution (cid:89) (cid:89) q({η } ,{t } ) dt dη i i∈O i i∈H i i i∈H i∈O The auxiliary distribution q˜({ηi}i∈O) is chosen to be ≈1 (cid:88)L p({ti}i∈O(cid:8),(cid:8)(cid:9)tli(cid:9)i∈H|ts)q˜((cid:8)ηil(cid:9)i∈O) eaquuxailliartyo dpi(s{trηibi}uit∈ioOn|w{tiil}li∈siHm)p.ly sIanmpoltehecraswcaodredss,frooumr L q( ηl ,{t } ) l=1 i i∈O i i∈H the learned continuous diffusion network model with (cid:44)φ (t ), (6) H as the source set. Here, it is is easy to see that L s M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song Algorithm1Oursourcedetectionalgorithm Require: C,D,L InfertransmissionratesAfromC using[Daneshmandetal.,2014,Algorithm1]. SampleLsetsoftransmissiontimes{τ } . ij (i,j)∈E Computeinfectiontimestˆl , l=1,...,Lassumingt =0usingEq.7. i∈V s Computechangepoints: t −tˆl,i∈D,j ∈π ,l=1,...,Landt −tˆl,j ∈M,i∈π ∩O,l=1,...,L. i j i i j j fori∈Mdo t∗ =argmax φ (t )(usinglinesearchmethodorLemma2ineachpiece) i ts L s endfor s∗ =argmax φ (t∗) i∈M L i t =max φ (t∗) s∗ i∈M L i (cid:82) (cid:81) p({η } |{t } ) dη = 1. With the above piecesincreasesasO(L∆N),i.e.,linearlyinthenumberof i i∈O i i∈H i∈O i choicesfortheproposalandauxiliarydistribution,wecan Monte Carlo samples, the number of observed nodes, and greatlysimplifytheapproximatelikelihoodinEq.6into the maximum in-degree, ∆, of the observed nodes. Sec- L ond, within each piece, the maximum of the function can φ (t )= 1 (cid:88)(cid:89)p(t |(cid:8)tl(cid:9) ,{t } ) (8) befoundefficiently. L s L i j j∈πi\O j j∈πi∩O l=1i∈O p(tl|(cid:8)tl(cid:9) ,{t } ) (cid:89) i j j∈πi\O j j∈πi∩O , 4.1 FindingEachContinuousPiece (cid:8) (cid:9) (cid:8) (cid:9) p(tl| tl , ηl ) i∈M i j j∈πi\O j j∈πi∩O In this section, we aim to efficiently find all the change whereMisthesetofhiddennodeswithobservedvariables pointst intheapproximatedlikelihoodφ (t ),givenby asparents, si L s Eq. 8. In other words, we will efficiently find the left and M:={i∈H|π ∩O =(cid:54) ∅}, (9) i rightendpointsofeachofitscontinuouspieces. Here,we whichistypicallymuchsmallerthantheoverallsetofhid- assume there is a directed path in G from the source s to dennodes. Itisnoteworthythat,undermildregularitycon- eachoftheobservedinfectednodesO,otherwise,itcannot ditions,theMonteCarloapproximationoftheintegralwill beasourceforthosenodes,trivially. converge to the true value with sufficient number of sam- The key idea to finding all change points is realiz- ples. However,acleverchoiceoftheproposaldistribution ing that each piece in Eq. 8 corresponds to a differ- makestheconvergencefasterandthecomputationmoreef- ent feasible parents-child configuration. Here, by fea- ficient. sible parents, we mean parents that get infected ear- lier than the child and thus are temporally plausible. 4 MAXIMIZEOBJECTIVEFUNCTION More specifically, given a source s, Eq. 8 is composed of three types of terms: p(t |(cid:8)tl(cid:9) ,{t } ) Ourobjectivefunction,givenbyEq.4,consistsofaninner and p(tli|(cid:8)tlj(cid:9)j∈πi\O,{tj}j∈πii∩O)j, wj∈hπiic\hOdepejndj∈oπni∩tOhe andanoutermaximization. Intheinnermaximization,we source time value t , as we will realize shortly, and s leveragetheMonteCarlosampleapproximationandsolve thus are responsible for the change point values t , and max φ (t ). (10) p(tl|(cid:8)tl(cid:9) ,(cid:8)ηl(cid:9) ),whichdoesnotdepseindon ts L s t ,ibecajusje∈πbio\tOh (cid:8)ηlj(cid:9)j∈πia∩nOd (cid:8)tl(cid:9) are sampled, t In the outer maximization, we rank all possible source s i i∈O i i∈M s equally shifts all sampled times and its likelihood is time nodes, s, in terms of their best starting time t , which is s shiftinvariant. Basedonthestructureofthefirsttwotype thesolutiontotheinnermaximization,andthenselectthe of terms, it is easy to show that at each change point t , topsourcenodeintherankingasouroptimalsource,s∗. si there is a node j ∈ O ∪ M, observed or hidden, that Theoutermaximizationisstraightforward,however,thein- changesitssetoffeasibletemporallyplausibleparents,i.e., nermaximization,whichconsistsoffindingtheoptimalts aparentofoneobservedorhiddennodebecomes(stopsbe- thatmaximizesφL(ts),definedinEq.8,mayseemdifficult ing)afeasibleparentattimetsi. Therefore,itisclearthat atfirst. Althoughitisa1-dimensionalproblem,theobjec- there are O(L∆N) change points, where ∆ is the maxi- tivefunctionispiece-wisecontinuousandnon-convexwith mumin-degreeofnodes. Next,wedescribeaprocedureto respecttots. Thisisbecausebyincreasing(ordecreasing) findallchangepointsefficiently. t , the parent-child relation between nodes may change. s However, there are two key properties of φ (t ), which EfficientChangePointEnumeration. Westartbysetting L s allow us to carry out the optimization efficiently. First, t = 0 and computing the infection time, denoted as tˆl, s j φ (t ) is piece-wise continuous and the number of such for each hidden node j ∈ M and realization l using the L s BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades 1 found within (cid:15)-neighborhood of t∗ with only s 0.8 2log(tsi+1−tsi)/log(3/2)evaluationsofφ (.). 0.6 (cid:15) L 0.4 0.2 Furthermore, by utilizing golden section search [Kiefer, 0 1953],onecanfurtherreducethecomplexityoffindingthe C=1 C=2 C=4 C=8 optimumpointtolog(tsi+1−tsi)/log(1.618)evaluations. (cid:15) Figure2:Evolutionoftheproposedmethodwithrespectto WesummarizetheoverallalgorithminAlgorithm1. thenumberofcascades. shortest path property described in Section 3.2. Then, we 5 EXPERIMENTS find the change points in which an observed node i ∈ O looses feasible parents by computing the time difference We evaluate the performance of our method on: (i) syn- t −tˆl,j ∈ π \O,l = 1,...,L, andthechangepointsin i j i thetic networks that mimic the structure of social net- whichahiddennodej ∈Mearnsfeasibleparentsbycom- worksand(ii)realnetworksinferredfromalargecascade putingthetimedifferencest −tˆl,i∈π ∩O,l=1,...,L. i j j dataset,usingawell-knownstate-of-the-artnetworkinfer- Ifatimedifferenceisnegative,weskipit,sincetheassoci- ence method [Gomez-Rodriguez et al., 2011]. We show ated parent will never (always) be feasible, independently thatourapproachdiscoversthetruesourceofacascadeor ofthet value. s setofcascadeswithsurprisinglyhighaccuracyinsynthetic Additionally, we can compute φ (.) efficiently for each networks and quite often in real networks, given the diffi- L change point t , since at each change point t , we will cultyoftheproblem,andsignificantlyoutperformsseveral si si onlyneedtorevaluatethecorrespondingtermstothenode baselinesandtwostateoftheartmethods[AdityaPrakash i ∈ O∪Mthatchangesitssetoffeasibleparents. Inthe etal.,2012,Pintoetal.,2012]. AppendixCprovidesaddi- caseofexponentialtransmissionlikelihoods,oncewehave tionalexperimentalresults. computed the likelihood at each change point t , we can si re-evaluate it at any time t ∈ [tsi,tsi+1), by multiplying 5.1 ExperimentsonSyntheticData thecorrespondingtermsintheapproximatedlikelihoodby et−tsi. Experimental Setup. We generate three types of Kro- neckernetworks[Leskovecetal.,2010]:(i)core-periphery 4.2 MaximizingwithinEachPiece networks (parameter matrix: [0.9 0.5; 0.5 0.3]), which mimic the information diffusion traces in real world net- Oncewehavedelimitedeachpieceoftheapproximatelike- works [Gomez-Rodriguez et al., 2010], (ii) random net- lihood given by Eq. 8, we can find the times t that max- s works ([0.5 0.5; 0.5 0.5]), typically used in physics and imize the likelihood in each piece efficiently, using well- graph theory [Easley and Kleinberg, 2010], and (iii) hier- knownline-searchproceduresforone-dimensionalcontin- archicalnetworks([0.90.1;0.10.9])[Clausetetal.,2008]. uous function, such as the forward-backward method, the We then set the pairwise transmission rates of the edges golden section method or the Fibonacci method [Luen- of the networks by drawing samples from α ∼ U(10,5). berger, 1973]. However, in the case of exponential trans- For each type of Kronecker network, we generate 10 net- missionlikelihoods,wecanperformthemaximizationstep workswith256nodesand512edges. Finally,foreachnet- evenmoreefficiently. work,wegenerateasetofcascadesfromtendifferentran- Exponential Transmission. We start by realizing that, in domsourcess∗.Sinceweareinterestedindetectingsource thecaseofexponentialtransmissionfunctions,theapprox- nodesoflargecascades,weonlyconsidersourcenodesthat imatelikelihoodgivenbyEq.8canbeexpressedas triggered at least ten large cascades out of 100 simulated L cascades. Given the size of the networks we experiment (cid:88) φL(ts)= γleβlts, (11) with,weconsideracascadetobelargeifitcontainsmore l=1 than40nodes. Ouraimisthentofindthesourceofalarge whereγl > 0andβl areindependentofts. Then, wecan cascade or small set of large cascades from the infection provethateachpieceoftheapproximatelikelihoodisuni- times of a small (unknown) fraction of all infected nodes. modal(proveninAppendixA): Inallthefollowingexperimentsthesamplesizeis400and 10% of the infected nodes are observed except when it is Lemma1 φL(ts)isuni-modalintsi <ts <tsi+1. explicitlymentioned. Now, we can find the maximum of φ (.) by only evalu- A Toy Example. We first consider a small 64-node hier- L ation of the function on a sequence of points (proven in archicalKroneckernetworkandvisualizetheapproximate AppendixB): likelihood given by Eq. 8 against the number of observed cascadesforeachnodeinthenetwork. Weuse150Monte Lemma2 The maximum point of φ (.) can be Carlo samples. Figure 2 summarizes the results, where L M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song Probability00..681Our NaiveMC OutDeg NetSleuth Pin to’sProbability00..681Our NaiveMC OutDeg NetSleuth Pin to’sess Probability00..681Our NaiveMC OutDeg NetSleuth Pin to’sess Probability00..681Our NaiveMC OutDeg NetSleuth Pin to’s Success 00..24 Success 00..24 (cid:239)10 Succ00..24 (cid:239)10 Succ00..24 p p 01 2 3 4 5 6 7 8 9 10 01 2 3 4 5 6 7 8 9 10 To 01 2 3 4 5 6 7 8 9 10 To 01 2 3 4 5 6 7 8 9 10 Number of Cascades Number of Cascades Number of Cascades Number of Cascades (a)SP,Random (b)SP,Core-Periphery (c)Top-10SP,Random (d)Top-10SP,Core-Periphery Figure3: SuccessProbability(SP)andTop-10SuccessProbability(Top-10SP)fortwotypesofKroneckernetworks. each square represents a node, the true source is marked 0.5 0.5 with a star and the heat map represents normalized like- 0.4 0.4 lihoods in [0,1]. In this toy example, a single cascade is 0.3 0.3 insufficient to detect the true source, since it has a rela- E E S S M M tively low likelihood. However, once more cascades are 0.2 0.2 observed, the likelihood of the true source increases and 0.1 0.1 ultimately become higher than all other nodes for 8 cas- 0 0 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 cades. Number of Cascades Number of Cascades (a) Random (b) Core-Periphery Accuracy. Next, we evaluate the accuracy of our method in comparison with two state of the art methods, Figure4: Mean-squarederror(MSE)ontheestimationof NETSLEUTH [Aditya Prakash et al., 2012] and Pinto’s t fortwotypesofKroneckernetworks. method[Pintoetal.,2012],andtwobaselinesinlargersyn- s thetic networks. The first baseline runs Montecarlo from andthencomputethetop-1andtop-10successprobability each potential source and ranks them by counting the av- basedonalloutputsforallcascades. Figure3summarizes erage maximum number of observed infected nodes that theresultsfortwotypesofKroneckernetworksagainstthe get infected in a time window equal to the length of ob- number of observed cascades. Our method outperforms servationwindow. Then, itranksthepotentialsourcesac- dramatically all others, achieving a success probability as cording to the average value of this quantity, where the highas0.6andtop-10successprobabilityofalmost1. The node with the highest value is the top node. The second low performance that state-of-the-art methods exhibit, in baseline first finds all potential sources that can reach all comparison with the validation within the corresponding observed infected nodes and then ranks them by decreas- papers, may be explained as follows: in both cases, the ingout-degree,wherethenodewiththehighestout-degree authors validated their algorithms with synthetic and real is the top node. NETSLEUTH assumes the same infection networkswithlargediameters,withoutlong-rangeconnec- probability β over all the edges, which we set to 0.1, fol- tions, such as 2-D grids [Aditya Prakash et al., 2012] and lowing Aditya Prakash et al. [2012]. Pinto’s method sim- spatial(geographical)networks[Pintoetal.,2012],where ilarly assumes that all pairwise transmission times come thesourceidentificationproblemismucheasier. from the same Gaussian distribution and they require its meantobemuchlargerthanitsstandarddeviationinorder Source Infection Time Estimation. We also evaluate toguaranteenonnegativetransmissiontimes.Intheirwork, howaccuratelyourmethodinferstheinfectiontimeofthe theysetµ/σ =4,whereµandσarethemeanandstandard true source by computing the mean square error (MSE), deviation, respectively. Since our fitted diffusion network E (cid:2)(t −tˆ )2(cid:3),estimatedbyrunningourmethodon10 s∗ s∗ s∗ contains edges with different transmission rates and thus different random sources. Here, we do not compare with differentexpectedtransmissiontime,wesettheparameter othercompetitivemethodssincetheydonotprovideanes- µtobetheminimumexpectedvalueoveralltheedges. timate of the infection time of the true source. Figure 4 showstheMSEoftheestimatedinfectiontimesofthetrue Weusedtwomeasuresofaccuracy:successprobabilityand sourceforthesamenetworksasaboveagainstthenumber top-10 success probability. We define success probability asP(sˆ= s∗)andtop-10successprobabilityastheproba- ofcascades. bilitythatthetruesources∗isamongthetop-10intermsof maximumlikelihoodorranking.Foreachnetworktype,we 5.2 ExperimentsonRealData estimatedbothmeasuresbyrunningourmethodon10dif- ferentrandomsourcesets. SinceNETSLEUTHandPinto’s Experimental Setup. We focus on the spread of memes, methodcanonlyacceptoneobservedcascadeatatime,we whichareashorttextualphrases(like,“lipstickonapig”) runthemethodsindependentlyforeachindividualcascade thattravelalmostintactthroughtheWeb[Leskovecetal., BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades 0.1 y 0.1 4000 ONauirveMC bilit ONauirveMC Success Probability0.05 ONeuttSDleeguth (cid:239)10 Success Proba0.05 ONeuttSDleeguth 2MSE (days)123000000000 p o 01 2 3 4 5 6 7 8 9 10 T 01 2 3 4 5 6 7 8 9 10 01 2 3 4 5 6 7 8 9 10 Number of Cascades Number of Cascades Number of Cascades (a) SP (b) Top-10SP (c) MSE Figure5:SuccessProbability(SP),Top-10SuccessProbability(Top-10SP)andmean-squarederror(MSE)ontheestima- tionoft forrealcascadedata. s 2009]. We experiment with a large meme dataset6, which randomfromallnodesinthenetworkwouldsucceedwith traces the spread of memes across 1,700 popular main- probability 1/1700 = 5.8 × 10−4, almost 20 times less stream media sites and blogs [Gomez-Rodriguez et al., accurate than our method. A second random guesser that 2013]. The dataset classifies memes per topic, and asso- chooses the source uniformly at random among the nodes ciateseachmememtoaninformationcascadet ,which from whom the observed nodes are reachable would suc- m issimplyarecordoftimeswhensitesfirstmentionedmeme ceedwithprobability1/425=2.4×10−3,almost5times m. We proceed as follows. We first infer an underlying lessaccurate. Thesameargumentfortop-10successprob- diffusionnetworkpertopicusingNETRATE,awell-known abilityshows12timesimprovementinaccuracycompared networkinferencemethod[Gomez-Rodriguezetal.,2011], to naive guesser and 3 times improvement in comparison usingallobservedinformationcascades.Wethenusethese tothemorecleverone. Finally,ourmethod’sMSEvalues inferrednetworksalongwithapercentageoftheinfections indicate that our method is able to find the source infec- √ oflargecascadestoinferthesourceofthesecascades. We tiontimewithinanaccuracyof 2000≈45days. Wefind select15sources,eachofthemhavingatleast10longcas- thisquiteremarkablegiventhatthecascadesweconsidered cades. Here, by long cascade we mean possessing more typicallyunfoldduringa1-yearperiod. than 27 nodes. The results are averaged over 5 runs, ran- domizingtheselectionoftheobservednodes,weconsider 6 CONCLUSIONS that10%oftheinfectednodesareobservedandutilize500 samplestoapproximatethelikelihood. We propose a two-stage framework for detecting the Accuracy.Weevaluatetheaccuracyofourmethodincom- sourceofacascadeincontinuous-timediffusionnetworks, which improves dramatically over previous state-of-the- parisonwithNETSLEUTHandthesamebaselinesasinthe arts in terms of detection accuracy. Our framework cast synthetic experiments, using success probability and top- the problem as a maximum likelihood estimation prob- 10successprobability. Unfortunately,wecannotcompare lem and then find optimal solutions very efficiently us- to Pinto’s method because it requires the identity of the ing an importance sampling approximation to the objec- true parent for each observed node in each cascade, and tive and an optimization procedure that exploits the struc- this is not available in real cascade data. Figure 5 sum- ture of the problem. Our work opens many interesting marizestheresults. Surprisingly,neitherNETSLEUTHnor venues for future work. For example, it would be useful the baselines succeed at detecting cascade sources in real data, even with 10 observed cascades; they output solu- to extend our method to support cascades with multiple sources and other continuous-time models different than tions with an (almost) zero (top-10) success probabilities. the continuous-time independent cascade model [Gomez- In contrast, our method achieves a non-zero (top-10) suc- cess probability as long as we observe more than 8 and 6 Rodriguezetal.,2011]. Also,atheoreticalanalysisofour importance sampling scheme is also interesting. Finally, cascades respectively, a fairly low number of cascades in it would be interesting to apply the current framework to thisscenario. Eventhen,theperformanceofourmethodin otherreal-worlddatasets. termsofsuccessandtop-10successprobabilitymayseem low at first, however, we would like to highlight how dif- ficult the problem we are trying to solve is, by consider- Acknowledgements ing the performance of two simple random guessers. A first random guesser who chooses the source uniformly at This work was supported in part by NSF/NIH BIGDATA 1R01GM108341, NSF IIS-1116886, NSF CAREER IIS- 6Dataisavailableathttp://snap.stanford.edu/infopath/ 1350983andaRaytheonFacultyFellowshiptoL.S. M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song References D. Kedmey. Hackers Leak Explicit Photos of More Than 100Celebrities. Time,2014. B.AdityaPrakash,J.Vrekeen,andC.Faloutsos. Spotting D.Kempe,J.M.Kleinberg,andE.Tardos.Maximizingthe culprits in epidemics: How many and which ones? In SpreadofInfluenceThroughaSocialNetwork.InKDD, ICDM,2012. 2003. L.Backstrom,P.Boldi,M.Rosa,J.Ugander,andS.Vigna. J.Kiefer. Sequentialminimaxsearchforamaximum. Pro- Fourdegreesofseparation. InWebSci,2012. ceedings of the American Mathematical Society, 4(3): N.T.J.Bailey.TheMathematicalTheoryofInfectiousDis- 502–506,1953. eases and its Applications. Hafner Press, 2nd edition, M. Kim and J. Leskovec. The network completion prob- 1975. lem: Inferringmissingnodesandedgesinnetworks. In F.Chierichetti,J.Kleinberg,andD.Liben-Nowell. Recon- SDM,2011. structingPatternsofInformationDiffusionfromIncom- T.Lappas,E.Terzi,D.Gunopulos,andH.Mannila. Find- pleteObservations. InNIPS,2011. ingEffectorsinSocialNetworks. InKDD,2010. A.Clauset,C.Moore,andM.E.J.Newman. Hierarchical J. Leskovec, L. Backstrom, and J. Kleinberg. Meme- Structure and the Prediction of Missing Links in Net- tracking and the dynamics of the news cycle. In KDD, works. Nature,453(7191):98–101,2008. 2009. H. Daneshmand, M. Gomez-Rodriguez, L. Song, and J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, B. Scho¨lkopf. Estimating diffusion network struc- and Z. Ghahramani. Kronecker Graphs: An Approach tures: Recovery conditions, sample complexity & soft- toModelingNetworks. JMLR,2010. thresholdingalgorithm. InICML,2014. D.G.Luenberger.Introductiontolinearandnonlinearpro- N. Du, L. Song, A. Smola, and M. Yuan. Learning Net- gramming, volume 28. Addison-Wesley Reading, MA, worksofHeterogeneousInfluence. InNIPS,2012. 1973. N.Du,L.Song,M.Gomez-Rodriguez,andH.Zha. Scal- F.Morstatter,J.Pfeffer,H.Liu,andK.Carley. Isthesam- able influence estimation in continuous-time diffusion plegoodenough? comparingdatafromtwittersstream- networks. InNIPS,2013a. ingapiwithtwittersfirehose. ICWSM,2013. N. Du, L. Song, H. Woo, and H. Zha. Uncover topic- P.C.Pinto,P.Thiran,andM.Vetterli. Locatingthesource sensitive information diffusion networks. In Artificial ofdiffusioninlarge-scalenetworks. Physicalreviewlet- IntelligenceandStatistics(AISTATS),2013b. ters,109(6):068702,2012. D. Easley and J. Kleinberg. Networks, Crowds, and Mar- S.Sadikov,M.Medina,J.Leskovec,andH.Garcia-Molina. kets:ReasoningAboutaHighlyConnectedWorld.Cam- CorrectingforMissingDatainInformationCascades.In bridgeUniversityPress,2010. WSDM,2011. M. Gomez-Rodriguez and B. Scho¨lkopf. Influence Max- D. Shah and T. Zaman. Detecting sources of com- imization in Continuous Time Diffusion Networks. In puter viruses in networks: theory and experiment. In Proceedings of the 29th International Conference on ACM SIGMETRICS Performance Evaluation Review, MachineLearning,2012a. volume38,pages203–214,2010. M. Gomez-Rodriguez and B. Scho¨lkopf. Submodular In- K. Zhou, L. Song, and H. Zha. Learning social infectiv- ference of Diffusion Networks from Multiple Trees. In ityinsparselow-ranknetworksusingmulti-dimensional Proceedings of the 29th International Conference on hawkesprocesses. InAISTATS,2013a. MachineLearning,2012b. K. Zhou, H. Zha, and L. Song. Learning triggering ker- nelsformulti-dimensionalhawkesprocesses. InICML, M. Gomez-Rodriguez, J. Leskovec, and A. Krause. In- 2013b. ferring Networks of Diffusion and Influence. In KDD, 2010. M.Gomez-Rodriguez,D.Balduzzi,andB.Scho¨lkopf. Un- coveringtheTemporalDynamicsofDiffusionNetworks. InICML,2011. M. Gomez-Rodriguez, J. Leskovec, and B. Scho¨lkopf. StructureandDynamicsofInformationPathwaysinOn- lineMedia. InWSDM,2013. M. Isaac. Nude Photos of Jennifer Lawrence Are Latest FrontinOnlinePrivacyDebate. NewYorkTimes,2014. BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades A ProofofLemma1 Supposetherearetwostationarypoints,i.e.,φ(cid:48) (x)=φ(cid:48) (y)=0,thus,bycontinuityofφ (.)in(t ,t )theremustbe L L L si si+1 az ∈(x,y)suchthatφ(cid:48)(cid:48)(z)=0. Weshowitisacontradictionas L L (cid:88) φ(cid:48)L(cid:48)(ts)= γlβl2eβlts >0 (12) l forall1≤l≤L. B ProofofLemma2 Assumewewouldliketofindthemaximizerofφ (.)ininterval(a,b)andconsidertwopointsatone-thirdandtwo-thirdof L theinterval,i.e.,c=a+b−a andd=a+2b−a. Itcanbeeasilyshownthat,ifφ (c)<φ (d),thenthemaximizerwillbe 3 3 L L oninterval(c,b)and,ifφ (c)<φ (d),thenthemaximizermustlieoninterval(a,d). Therefore,bytwoevaluations,we L L canshrinktheintervalcontainingthemaximizerbyafactorof 2. Then,toreachthe(cid:15)-neighborhoodoftherealmaximizer, 3 weneedevaluatethefunction2∗rtimes,where (t −t )(2/3)r <(cid:15). (13) si+1 si Thiswillproveourclaim. C AdditionalExperimentalResults Inthissection,weprovideadditionalexperimentalresultsonsyntheticdata,includinganevaluationoftheperformanceof ourmethodagainstthepercentageofobservedinfectionsandthenumberofMontecarlosamples,aswellasascalability analysis. Performancevs. percentageofobservedinfections. Intuitively,thegreaterthenumberofobservedinfections,themore accurately our method can infer the true source and its infection time. Figure 6 confirms this intuition by showing the successprobabilityagainstpercentageofobservedinfections. However,wealsofindthatthegreateristhepercentageof observedinfections,thesmalleristheeffectofobservingadditionalinfections;adiminishingreturnproperty. 1 1 1 0.8 0.8 0.8 y y y c0.6 c0.6 c0.6 a a a ur ur ur cc0.4 cc0.4 cc0.4 A A A 0.2 0.2 0.2 SP SP SP TOP(cid:239)10 SP TOP(cid:239)10 SP TOP(cid:239)10 SP 00 4 8 12 16 20 00 4 8 12 16 20 00 4 8 12 16 20 Observed Data (%) Observed Data (%) Observed Data (%) (a) Random (b) Hierarchical (c) Core-periphery Figure6: Accuracyvs. %observedinfections. Performancevs.numberofMontecarlosamples.Drawingmoretransmissiontimesamples{τ } leadstoabetter ji (j,i)∈E estimate of Eq. 6, and thus a greater accuracy of our method. Figure 7 shows the success probability against number of samples. Importantly,weobservethataslongasthenumberofsamplesislargeenough,theperformanceofourmethod quicklyflattensanddoesnotdependonthenumberofsamplesanymore. Running time vs. percentage of observed infections. Figure 8 plots the average running time to infer the source of a singlecascadeagainstthepercentageofobservedinfections. Perhapssurprisingly,therunningtimebarelyincreaseswith thepercentageofobservedinfections. Runningtimevs.numberofsamples.Figure9plotstheaveragerunningtimeagainstthenumberofMontecarlosamples usedtoapproximatethelikelihood,Eq.6.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.