Table Of ContentBack to the Past: Source Identification in Diffusion Networks from Partially
Observed Cascades
MehrdadFarajtabar∗ ManuelGomez-Rodriguez† NanDu∗
MohammadZamani(cid:5) HongyuanZha∗ LeSong∗
∗GeorgiaInstituteofTechnology †MPIforSoftwareSystems (cid:5)StonyBrookUniversity
5
1 Abstract paths,investigatorsfoundthattheimageboard4chan4 was
0 theculpritsitewherethephotoswereoriginallypostedon
2 August 31, even though the photos had been taken down
Whenapieceofmaliciousinformationbecomes
n rampantinaninformationdiffusionnetwork,can fromthesitesoonaftertheirpost. Thisleakageofprivate
a pictureshastouchedoffalargerworld-widediscussionand
weidentifythesourcenodethatoriginallyintro-
J
debateonthestateofprivacyandcivillibertiesontheIn-
duced the piece into the network and infer the
6 ternet[Isaac,2014].
time when it initiated this? Being able to do so
2
is critical for curtailing the spread of malicious Can we automatically pinpoint the identity of such mali-
] information,andreducingthepotentiallossesin- cious information sources, as well as the time when they
I
S curred. Thisisaverychallengingproblemsince first posted the malicious information, given historical in-
s. typicallyonlyincompletetracesareobservedand complete diffusion traces? Solving this source identifi-
c we need to unroll the incomplete traces into the cation problem is of outstanding interest in many scenar-
[ past in order to pinpoint the source. In this pa- ios[Lappasetal.,2010]. Forexample,findingpeoplethat
1 per,wetacklethisproblembydevelopingatwo- originate rumors may reduce disinformation, identifying
v stageframework,whichfirstlearnsacontinuous- patientzerosindiseasespreadsmayhelptounderstandand
2 timediffusionnetworkmodelbasedonhistorical controlepidemics,orinferringwhereatrojanorcomputer
8
diffusiontracesandthenidentifiesthesourceof wormisinitiallyreleasedmayincreasereliabilityofcom-
5
6 anincompletediffusiontracebymaximizingthe puternetworks.
0 likelihood of the trace under the learned model.
Related Work. The problem of finding the source of a
. Experiments on both large synthetic and real-
1 diffusion trace, also called cascade, has not been studied
world data show that our framework can effec-
0
until very recently [Lappas et al., 2010, Shah and Zaman,
5 tively “go back to the past”, and pinpoint the
2010,AdityaPrakashetal.,2012,Pintoetal.,2012].How-
1 source node and its initiation time significantly
: moreaccuratelythanpreviousstate-of-the-arts. ever, mostpreviousworkassumesthatacompletesteady-
v
state snapshot of the cascade is observed, in other words,
i
X we know which nodes got infected but not when they did
r 1 INTRODUCTION so. Moreover, previous work uses discrete-time sequen-
a
tial propagation models such as the independent cascade
model [Kempe et al., 2003] or the discrete version of the
OnSeptember2014,acollectionofhundredsofprivatepic-
SIR model [Bailey, 1975], which are difficult to estimate
turesfromvariouscelebrities,mostlyconsistingofwomen
accurately from real world data [Gomez-Rodriguez et al.,
and often containing nudity, were posted online, and later
2011,Duetal.,2012,2013b,Zhouetal.,2013a,b].
disseminated by users on websites and social networks
suchasImgur1,Reddit2andTumblr3[Kedmey,2014]. Af- Onlyveryrecently,Pintoetal.[2012]considerafairlygen-
ter quite some efforts in manual tracing of the diffusion eral continuous-time model and assume that only a small
fraction of sparsely-placed nodes are observed and, if in-
1http://imgur.com/
fected,theirinfectiontimeisobserved.Unfortunately,their
2http://www.reddit.com/
approach requires the distance between observed nodes
3https://www.tumblr.com/
to be large because they approximate the infection times
AppearinginProceedingsofthe18thInternationalConferenceon by Gaussian distributions using the central limit theorem.
ArtificialIntelligenceandStatistics(AISTATS)2015,SanDiego, Since this is easily violated in real social and information
CA, USA. JMLR: W&CP volume 38. Copyright 2015 by the
authors.
4http://www.4chan.org/
BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades
3 — Uncertain transmission delay. The spread of informa-
2
2 tion over social and information networks is a stochastic
Ol5ivia Ma3son 3 2 Sop2hia 5 6 pmrioscseiossn.mThoedreelfsortoe,cwaeptnuereedtthoecuonncseidrtearinptryo.baFboilrisetixcatmrapnlse-,
8 ourtoyexampleillustratesthespreadofaparticularrumor
1
Liam andthereforeconsidersasetoffixededgedelays(e.g.,the
7 5 4 Jacob 1 rumor took 5 time units to spread from Sophia to Ethan).
Ethan However, the edge delays are stochastic and possibly dif-
Emma 0 9
7 ferentforeveryparticularrumor(e.g.,adifferentrumorcan
4
takemoreorlessthan5timeunitstospreadfromSophiato
Ethan).Theedgedelaydensities,ortransmissiondensities,
Figure 1: Spread of a rumor in a social network. Each
maydependonparameterslikethecontentoftherumoror
edgeweightisthetimeittookforarumortopassalongthe
theusers’influence.
edge. Solidmagnetedgesindicatetheactualpaththrough
whichtherumorspreads. Greendashededgesarealterna- — Unknown infection path. In large real world networks,
tivewaysinwhichtherumorcouldhavespread.Theinfec- we will often encounter a large number of potential paths
tiontimesofSophiaandLiamareobserved(redsquares); thatmayexplainthespreadofarumorfromasourcenode
theinfectiontimesoftheremainingnodesarehidden(yel- and any other node in the network. In fact, the set of po-
lowsquares).HowcananalgorithmfindthatJacobwasthe tentialpathsincreasesexponentiallywithnetworksizeand
personwhoinitiatedtherumor? network density and even simply counting the number of
pathsrequiresnon-trivialmethods[Gomez-Rodriguezand
Scho¨lkopf,2012a,b,Duetal.,2013a]. Forexample,inour
networks[Backstrometal.,2012],wefinditsperformance toyexample,Liamcanbecomeawareoftherumorthrough
onthistypeofnetworksunderwhelming,asshowninSec- eitherOliviaorEmma.
tion5.
OurApproach. Totacklethesechallenges, weproposea
Challenges. Previous approaches failed successfully two-stagescalableframework: wefirstlearnacontinuous-
addressseveralchallengesofthesourceidentificationprob- timediffusionnetworkmodelbasedonhistoricaldiffusion
lem, which we illustrate next using a toy example, shown tracesandthenidentifythesourceofanincompletediffu-
inFigure1. sion trace by maximizing its likelihood under the learned
model. Thekeyideaofourframeworkistoviewtheprob-
— Partially observed infections. It has become difficult,
lemfromtheperspectiveofgraphicalmodels,andcastthe
ifnotimpossible,tocollectcompletediffusiontraces,and
problemasamaximumlikelihoodestimationproblem,for
track each individual infection in online social and infor-
which we find optimal solutions very efficiently using an
mationnetworks. Thisproblemisexacerbatedbytheneed
importance sampling approximation to the objective and
to develop methods that can provide outputs in (almost)
anoptimizationprocedurethatexploitsthestructureofthe
real-time. For example, Spinn3r5 crawls only a subset of
problem. Additionally, for networks with exponentially
theblogsperiodically; Twitter’sstreamingAPIprovidesa
distributed edge transmission densities, used previously
smallpercentage(1%)ofthefullstreamoftweets[Morstat-
for modeling information propagation [Gomez-Rodriguez
teretal.,2013];Facebookuserstypicallykeeptheiractiv-
etal.,2011],weshowthattheobjectiveisapiece-wiseuni-
ity and profiles private [Sadikov et al., 2011]. It is thus
modalfunctionwithrespecttothesource’sinfectiontime
necessary to develop methods that are robust to missing
anddevelopamoreefficientsearchprocedure.
data [Chierichetti et al., 2011, Kim and Leskovec, 2011,
Sadikov et al., 2011]. Our toy example illustrates this For both synthetic and real-world data, we show that the
challenge by considering the infection times of Liam and framework can effectively “travel back to the past”, and
Sophia to be observed and all other infection times to be pinpointthesourcenodeanditsinfectiontimesignificantly
missing(hiddenorunobserved). moreaccuratelythanothermethods.
—Unknowninfectionstarttime.Inmostreal-worldscenar-
ios, the exact time when a piece of malicious information 2 OURFRAMEWORK
starts spreading is unknown, and thus the observed infec-
tiontimeshaveonlyrelativemeaning. Inourtoyexample, Our framework for solving the source identification prob-
we knowLiam gotinfected 5 time units laterthan Sophia lemconsistsoftwomainstages:itfirstlearnsacontinuous-
butwedonotactuallyobservehowmuchtimehaspassed timediffusionnetworkmodelbasedonhistoricaldiffusion
betweenJacob’sinfection,whichtriggeredthespread,and traces,andthenidentifiesthesourceofanincompletedif-
Sophia’sinfection. fusion trace and its initiation time by maximizing the its
likelihood under the learned model. We start our expo-
5http://spinn3r.com/ sition by revisiting the continuous-time generative model
M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song
for cascade data in social networks introduced in Gomez- etal.,2013b]. Inthiscase,
Rodriguezetal.[2011],Duetal.[2013a]. kτk−1 −(cid:16) τ (cid:17)k −(cid:16) τ (cid:17)k
fji(τ;αji)= αk e αji , Sji(τ;αji)=e αji ,
ji
2.1 Continuous-TimeModelforCascades
where k is a hyperparameter controlling the shape of the
density. This family includes many well-known special
Given a directed contact network, G = (V,E) with N
cases, such as the exponential or Rayleigh distributions,
nodes, a diffusion process begins with an infected source
whichhavealsobeenusedtomodelinformationpropaga-
nodesinitiallyadoptingcertaincontagion(idea,rumoror
tion over information networks [Gomez-Rodriguez et al.,
maliciouspieceofinformation)attimet . Thecontagion
s 2010,Duetal.,2013b].
istransmittedfromthesourcealongherout-goingedgesto
theirdirectneighbors. Eachtransmissionthroughanedge Unfortunately,touseEq.1,allinfectednodesinacascade
entails a random spreading time, τ, drawn from a density needtobefullyobserved. IfweonlyobserveasubsetOof
overtime,f (τ;α ),parametrizedbyatransmissionrate theinfectednodes,thelikelihoodoftheincompletecascade
ji ji
α . Then,theinfectedneighborstransmitthecontagionto iscomputedasfollows,
ji
their respective neighbors, and the process continues. We (cid:90) (cid:89)
p({t } |t )= p(t|t ) dt
assume transmission times are independent and nonnega- i i∈O s s j
tive, in other words, a node cannot be infected by a node Ω j∈H
(cid:90)
iannfeicntfeedctleadtenroindetsimreem;fajiin(τin;fαejcit)ed=fo0rifthτe<ent0i.reMdoifrfeuosvioern, = Ωi∈(cid:89)O∪Hp(ti|{tj}j∈πi)j(cid:89)∈H dtj, (2)
process. Thus, if a node i is infected by multiple neigh-
which essentially marginalize out the time for all hidden
bors, onlytheneighborthatfirstinfectsnodeiwillbethe nodes H over a product space Ω := [t ,∞)|H|. For sim-
s
trueparent.
plicityofnotation,wewillomitthedomainoftheintegra-
The temporal traces left by diffusion processes are often tionintheremainderofthepaper.
called cascades. A cascade t is an N-dimensional vector
Thecomputationoftheincompletelikelihoodisadifficult
t := (t ,...,t ) recording the times when nodes are in-
1 N highdimensionalintegrationproblemforcontinuousvari-
fected, if so, i.e., t ∈ [0,∞], where T is the observation
i ables. We will address this technical challenge using im-
window cut-off and ∞ denotes nodes that did not get in-
portancesamplinginSection3.2.
fected during the observation window. However, as noted
above,inmanyscenarios,weonlyobserveasubsetofthe
infectednodes,O,whilethestateofallothernodes,H,is 2.3 LearningDiffusionNetworks
hidden (we assume the source node s ∈ H). Our aim is
then to find the source of a cascade s from the infection Ourframeworkreliesontheassumptionthatitispossible
times{t } ofthesubsetofinfectednodesO. Figure1 torecordasufficientlylargenumberofhistoricalcascades,
j j∈O
illustratestheobserveddata. C,inordertodiscovertheexistenceofallnodesinthenet-
work,toinferthenetworkstructureaswellasthemodelpa-
rameters,{α }. Wenotethatitisnotnecessarytorecord
2.2 CascadeLikelihood ji
cascades that cover all nodes and edges, but each cascade
hastobefullyobservedtoasufficientlylargetimeperiod.
According to the conditional independence relation pro-
Furthermore,allcascadescollectivelyneedtocovertheen-
posedinthecontinuous-timemodelforcascades,thecom-
tirediffusionnetwork. Underthepreciseconditionsstated
pletelikelihoodofacascadet(forbothobservedandhid-
in [Daneshmand et al., 2014], one can infer the parame-
dennodes)factorizesas
(cid:89) ters of the continuous time model using an (cid:96)1-regularized
p(t|t )= p(t |{t } ) (1)
s i j j∈πi maximumlikelihoodestimationprocedure.
i∈O∪H
where π is the set of parents of i defined by the directed
i
graph G. For the infected nodes, Gomez-Rodriguez et al. 2.4 CascadeSourceIdentificationProblem
[2011]showedthatthelikelihoodcanbefurtherwrittenas
(cid:89) (cid:88) Given a learned diffusion model, our aim is to find the
p(t |{t } )= S(t −t ;α ) H(t −t ;α ),
i j j∈πi i j ji i l li sourcenodesofanincompletecascade,suchthatthelog-
j∈πi l∈πi likelihoodoftheincompletecascadeismaximized. Thus,
where S (τ;α ) = 1−F (τ;α ) is the survival func-
ji ji ji ji weaimtosolve
(cid:82)τ
tion, F (τ;α ) = f (t;α )dt is the cumulative dis-
ji ji 0 ji ji s∗ =argmax max p({t } |t ), (3)
tributionfunction,andHji(τ;αji)= Sfjjii((ττ;;ααjjii)) isthehaz- s∈H ts∈(−∞,mini∈Oti) i i∈O s
ardfunction,orinstantaneousinfectionrate. Wewillfocus where p({t } |t ) is defined in Eq. 2, and we assume
i i∈O s
ontheWeibullfamilyofdistributionsf (τ;α )sincethey that t < min t . If we observe several independent
ji ji s i∈O i
have been shown to fit well real world diffusion data [Du incomplete cascades D, all triggered by the same source
BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades
node,wewillmaximizetheirjointlikelihood where we draw L samples from q({η } ,{t } ) to
i i∈O i i∈H
(cid:32) (cid:33) approximatetheintegral. Now,wehaveanapproximation
(cid:89)
s∗ =argmax max L , (4) to L . Next, we explain how to choose the proposal and
c c
s∈H c∈D tcs∈(−∞,minj∈Otcj) theauxiliarydistributions.
where L := p((cid:8)tc(cid:9) |t ). In the following sections,
c j j∈O s
wewilldesignalgorithmstoefficientlyoptimizetheabove 3.2 ChoiceofProposalDistributions
objectiveandpresentexperimentalevaluations.
We define our proposal distribution using the forward-
3 APPROXIMATEOBJECTIVE
generative process of the cascades. Our proposal distri-
FUNCTION
butionq({η } ,{t } )willsamplecascadesfromthe
i i∈O i i∈H
There remain two technical challenges to solve to make learned continuous diffusion network model with s as the
our framework useful in practice. First, the likelihood of source set. One of the interesting properties of this pro-
incompletecascades,givenbyEq.2,isadifficulthighdi- posal distribution is that many terms involving the latent
mensional integration problem over a continuous domain. variables in Eq. 6 will be canceled out and hence the for-
We overcome this difficulty by an approximation algo- mulawillbecomesimpler.
rithm, based on importance sampling, which will greatly
Weremindthereaderthattheindependentcascademodel
simplifytheintegration. Second,theinner-loopmaximiza-
hasausefulshortest-pathproperty[Duetal.,2013a],which
tion over the source timing in Eq. 4 is non-convex. We
allowsustosampletheparents’infectiontimes,{t } ,
solvethisbydesigninganefficientalgorithm,whichfinds i i∈πj
for each node j efficiently for different source infection
theglobalmaximumbyexploitingthepiece-wisestructure
timest . Morespecifically, wefirstsampleasetoftrans-
s
oftheproblem.
mission times {τ } , one per edge, independent of
uv (u,v)∈E
3.1 ImportanceSampling each other. Then, the time t taken to infect a node i is
i
SinceananalyticalevaluationoftheintegralinEq.2isin- simplythelengthoftheshortestpathinG fromthesource
tractable, weturntoaMonteCarloapproximation. Todo s to node i, where the edge weights correspond to the as-
so, in principle, we need to draw samples from the poste- sociated transmission times. Let Qi(s) be the collection
riordistributionoflatentvariables,p({t } |t ,{t } ), of directed paths in G from the source s to node i, where
i i∈H s i i∈O
given the source time ts and the times of the observed eachpathq ∈Qi(s)containsasequenceofdirectededges
nodes,{ti}i∈O. However,itisverychallengingtosample (j,m), and assume the source node is infected at time ts,
fromthisposteriordistribution,andwewillinsteadaddress thenweobtainvariabletivia
theproblembydesigninganefficientimportancesampling t =g (cid:0){τ } |s(cid:1):= min (cid:88) τ , (7)
i i jm (j,m)∈E jl
approach. q∈Qi(s) (j,m)∈q
whereg (·)isthevalueoftheshortest-path.
Morespecifically,wefirstintroduceasetofauxiliaryran- i
dom variables {η } , where each variable corresponds Thisaboverelationiskeytospeeduptheevaluationofthe
i i∈O
tooneobservedinfectednode,withanarbitraryjointprob- sampled likelihood in Eq. 6 for different t values. First,
s
abilitydistributionq˜({η } ). Inthenextsection,wewill the sampled transmission times τ are independent and
i i∈O uv
brieflydiscusshowq˜ischosen. Then, giventheauxiliary thus can be sampled in parallel. Second, we can reuse
distributionwehave the sampled transmission times τ for different t values
uv s
(cid:90)
(cid:89) and sources s, since the transmission times are indepen-
p({t } |t )= p({t } |t ) dt
i i∈O s i i∈O∪H s i dent of t and s. We only need to compute the infection
s
i∈H
(cid:90) (cid:89) (cid:89) timeti foreachnodeusingts =0,andthenforadifferent
= p({t } |t )q˜({η } ) dt dη . (5) valueoft ,theinfectiontimeisjustanoffsetbyt . Third,
i i∈O∪H s i i∈O i∈H ii∈O i the likelihsood of a sampled cascade ((cid:8)ηil(cid:9)i∈O,(cid:8)stli(cid:9)i∈H)
for l = 1,...,L can be simply computed using Eq. 1 as
Second, we introduce the proposal distribution for im- p((cid:8)ηl(cid:9) ,(cid:8)tl(cid:9) |t ), which is independent of the ac-
portance sampling on the auxiliary and hidden variables, i i∈O i i∈H s
tual value of t and depends only on the identity of the
q({η } ,{t } ). Then,theintegralbecomes s
i i∈O i i∈H sourcenodes.
(cid:90) p({t } |t )q˜({η } )
p({t } |t )= i i∈O∪H s i i∈O
i i∈O s q({η } ,{t } )
i i∈O i i∈H 3.3 ChoiceofAuxiliaryDistribution
(cid:89) (cid:89)
q({η } ,{t } ) dt dη
i i∈O i i∈H i i
i∈H i∈O The auxiliary distribution q˜({ηi}i∈O) is chosen to be
≈1 (cid:88)L p({ti}i∈O(cid:8),(cid:8)(cid:9)tli(cid:9)i∈H|ts)q˜((cid:8)ηil(cid:9)i∈O) eaquuxailliartyo dpi(s{trηibi}uit∈ioOn|w{tiil}li∈siHm)p.ly sIanmpoltehecraswcaodredss,frooumr
L q( ηl ,{t } )
l=1 i i∈O i i∈H the learned continuous diffusion network model with
(cid:44)φ (t ), (6) H as the source set. Here, it is is easy to see that
L s
M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song
Algorithm1Oursourcedetectionalgorithm
Require: C,D,L
InfertransmissionratesAfromC using[Daneshmandetal.,2014,Algorithm1].
SampleLsetsoftransmissiontimes{τ } .
ij (i,j)∈E
Computeinfectiontimestˆl , l=1,...,Lassumingt =0usingEq.7.
i∈V s
Computechangepoints: t −tˆl,i∈D,j ∈π ,l=1,...,Landt −tˆl,j ∈M,i∈π ∩O,l=1,...,L.
i j i i j j
fori∈Mdo
t∗ =argmax φ (t )(usinglinesearchmethodorLemma2ineachpiece)
i ts L s
endfor
s∗ =argmax φ (t∗)
i∈M L i
t =max φ (t∗)
s∗ i∈M L i
(cid:82) (cid:81)
p({η } |{t } ) dη = 1. With the above piecesincreasesasO(L∆N),i.e.,linearlyinthenumberof
i i∈O i i∈H i∈O i
choicesfortheproposalandauxiliarydistribution,wecan Monte Carlo samples, the number of observed nodes, and
greatlysimplifytheapproximatelikelihoodinEq.6into the maximum in-degree, ∆, of the observed nodes. Sec-
L ond, within each piece, the maximum of the function can
φ (t )= 1 (cid:88)(cid:89)p(t |(cid:8)tl(cid:9) ,{t } ) (8) befoundefficiently.
L s L i j j∈πi\O j j∈πi∩O
l=1i∈O
p(tl|(cid:8)tl(cid:9) ,{t } )
(cid:89) i j j∈πi\O j j∈πi∩O , 4.1 FindingEachContinuousPiece
(cid:8) (cid:9) (cid:8) (cid:9)
p(tl| tl , ηl )
i∈M i j j∈πi\O j j∈πi∩O In this section, we aim to efficiently find all the change
whereMisthesetofhiddennodeswithobservedvariables
pointst intheapproximatedlikelihoodφ (t ),givenby
asparents, si L s
Eq. 8. In other words, we will efficiently find the left and
M:={i∈H|π ∩O =(cid:54) ∅}, (9)
i rightendpointsofeachofitscontinuouspieces. Here,we
whichistypicallymuchsmallerthantheoverallsetofhid- assume there is a directed path in G from the source s to
dennodes. Itisnoteworthythat,undermildregularitycon- eachoftheobservedinfectednodesO,otherwise,itcannot
ditions,theMonteCarloapproximationoftheintegralwill beasourceforthosenodes,trivially.
converge to the true value with sufficient number of sam-
The key idea to finding all change points is realiz-
ples. However,acleverchoiceoftheproposaldistribution
ing that each piece in Eq. 8 corresponds to a differ-
makestheconvergencefasterandthecomputationmoreef-
ent feasible parents-child configuration. Here, by fea-
ficient.
sible parents, we mean parents that get infected ear-
lier than the child and thus are temporally plausible.
4 MAXIMIZEOBJECTIVEFUNCTION More specifically, given a source s, Eq. 8 is composed
of three types of terms: p(t |(cid:8)tl(cid:9) ,{t } )
Ourobjectivefunction,givenbyEq.4,consistsofaninner and p(tli|(cid:8)tlj(cid:9)j∈πi\O,{tj}j∈πii∩O)j, wj∈hπiic\hOdepejndj∈oπni∩tOhe
andanoutermaximization. Intheinnermaximization,we source time value t , as we will realize shortly, and
s
leveragetheMonteCarlosampleapproximationandsolve thus are responsible for the change point values t , and
max φ (t ). (10) p(tl|(cid:8)tl(cid:9) ,(cid:8)ηl(cid:9) ),whichdoesnotdepseindon
ts L s t ,ibecajusje∈πbio\tOh (cid:8)ηlj(cid:9)j∈πia∩nOd (cid:8)tl(cid:9) are sampled, t
In the outer maximization, we rank all possible source s i i∈O i i∈M s
equally shifts all sampled times and its likelihood is time
nodes, s, in terms of their best starting time t , which is
s
shiftinvariant. Basedonthestructureofthefirsttwotype
thesolutiontotheinnermaximization,andthenselectthe
of terms, it is easy to show that at each change point t ,
topsourcenodeintherankingasouroptimalsource,s∗. si
there is a node j ∈ O ∪ M, observed or hidden, that
Theoutermaximizationisstraightforward,however,thein- changesitssetoffeasibletemporallyplausibleparents,i.e.,
nermaximization,whichconsistsoffindingtheoptimalts aparentofoneobservedorhiddennodebecomes(stopsbe-
thatmaximizesφL(ts),definedinEq.8,mayseemdifficult ing)afeasibleparentattimetsi. Therefore,itisclearthat
atfirst. Althoughitisa1-dimensionalproblem,theobjec- there are O(L∆N) change points, where ∆ is the maxi-
tivefunctionispiece-wisecontinuousandnon-convexwith mumin-degreeofnodes. Next,wedescribeaprocedureto
respecttots. Thisisbecausebyincreasing(ordecreasing) findallchangepointsefficiently.
t , the parent-child relation between nodes may change.
s
However, there are two key properties of φ (t ), which EfficientChangePointEnumeration. Westartbysetting
L s
allow us to carry out the optimization efficiently. First, t = 0 and computing the infection time, denoted as tˆl,
s j
φ (t ) is piece-wise continuous and the number of such for each hidden node j ∈ M and realization l using the
L s
BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades
1 found within (cid:15)-neighborhood of t∗ with only
s
0.8 2log(tsi+1−tsi)/log(3/2)evaluationsofφ (.).
0.6 (cid:15) L
0.4
0.2 Furthermore, by utilizing golden section search [Kiefer,
0 1953],onecanfurtherreducethecomplexityoffindingthe
C=1 C=2 C=4 C=8 optimumpointtolog(tsi+1−tsi)/log(1.618)evaluations.
(cid:15)
Figure2:Evolutionoftheproposedmethodwithrespectto
WesummarizetheoverallalgorithminAlgorithm1.
thenumberofcascades.
shortest path property described in Section 3.2. Then, we 5 EXPERIMENTS
find the change points in which an observed node i ∈ O
looses feasible parents by computing the time difference
We evaluate the performance of our method on: (i) syn-
t −tˆl,j ∈ π \O,l = 1,...,L, andthechangepointsin
i j i thetic networks that mimic the structure of social net-
whichahiddennodej ∈Mearnsfeasibleparentsbycom-
worksand(ii)realnetworksinferredfromalargecascade
putingthetimedifferencest −tˆl,i∈π ∩O,l=1,...,L.
i j j dataset,usingawell-knownstate-of-the-artnetworkinfer-
Ifatimedifferenceisnegative,weskipit,sincetheassoci-
ence method [Gomez-Rodriguez et al., 2011]. We show
ated parent will never (always) be feasible, independently
thatourapproachdiscoversthetruesourceofacascadeor
ofthet value.
s setofcascadeswithsurprisinglyhighaccuracyinsynthetic
Additionally, we can compute φ (.) efficiently for each networks and quite often in real networks, given the diffi-
L
change point t , since at each change point t , we will cultyoftheproblem,andsignificantlyoutperformsseveral
si si
onlyneedtorevaluatethecorrespondingtermstothenode baselinesandtwostateoftheartmethods[AdityaPrakash
i ∈ O∪Mthatchangesitssetoffeasibleparents. Inthe etal.,2012,Pintoetal.,2012]. AppendixCprovidesaddi-
caseofexponentialtransmissionlikelihoods,oncewehave tionalexperimentalresults.
computed the likelihood at each change point t , we can
si
re-evaluate it at any time t ∈ [tsi,tsi+1), by multiplying 5.1 ExperimentsonSyntheticData
thecorrespondingtermsintheapproximatedlikelihoodby
et−tsi. Experimental Setup. We generate three types of Kro-
neckernetworks[Leskovecetal.,2010]:(i)core-periphery
4.2 MaximizingwithinEachPiece
networks (parameter matrix: [0.9 0.5; 0.5 0.3]), which
mimic the information diffusion traces in real world net-
Oncewehavedelimitedeachpieceoftheapproximatelike-
works [Gomez-Rodriguez et al., 2010], (ii) random net-
lihood given by Eq. 8, we can find the times t that max-
s
works ([0.5 0.5; 0.5 0.5]), typically used in physics and
imize the likelihood in each piece efficiently, using well-
graph theory [Easley and Kleinberg, 2010], and (iii) hier-
knownline-searchproceduresforone-dimensionalcontin-
archicalnetworks([0.90.1;0.10.9])[Clausetetal.,2008].
uous function, such as the forward-backward method, the
We then set the pairwise transmission rates of the edges
golden section method or the Fibonacci method [Luen-
of the networks by drawing samples from α ∼ U(10,5).
berger, 1973]. However, in the case of exponential trans-
For each type of Kronecker network, we generate 10 net-
missionlikelihoods,wecanperformthemaximizationstep
workswith256nodesand512edges. Finally,foreachnet-
evenmoreefficiently.
work,wegenerateasetofcascadesfromtendifferentran-
Exponential Transmission. We start by realizing that, in domsourcess∗.Sinceweareinterestedindetectingsource
thecaseofexponentialtransmissionfunctions,theapprox- nodesoflargecascades,weonlyconsidersourcenodesthat
imatelikelihoodgivenbyEq.8canbeexpressedas triggered at least ten large cascades out of 100 simulated
L cascades. Given the size of the networks we experiment
(cid:88)
φL(ts)= γleβlts, (11) with,weconsideracascadetobelargeifitcontainsmore
l=1 than40nodes. Ouraimisthentofindthesourceofalarge
whereγl > 0andβl areindependentofts. Then, wecan cascade or small set of large cascades from the infection
provethateachpieceoftheapproximatelikelihoodisuni- times of a small (unknown) fraction of all infected nodes.
modal(proveninAppendixA): Inallthefollowingexperimentsthesamplesizeis400and
10% of the infected nodes are observed except when it is
Lemma1 φL(ts)isuni-modalintsi <ts <tsi+1. explicitlymentioned.
Now, we can find the maximum of φ (.) by only evalu- A Toy Example. We first consider a small 64-node hier-
L
ation of the function on a sequence of points (proven in archicalKroneckernetworkandvisualizetheapproximate
AppendixB): likelihood given by Eq. 8 against the number of observed
cascadesforeachnodeinthenetwork. Weuse150Monte
Lemma2 The maximum point of φ (.) can be Carlo samples. Figure 2 summarizes the results, where
L
M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song
Probability00..681Our NaiveMC OutDeg NetSleuth Pin to’sProbability00..681Our NaiveMC OutDeg NetSleuth Pin to’sess Probability00..681Our NaiveMC OutDeg NetSleuth Pin to’sess Probability00..681Our NaiveMC OutDeg NetSleuth Pin to’s
Success 00..24 Success 00..24 (cid:239)10 Succ00..24 (cid:239)10 Succ00..24
p p
01 2 3 4 5 6 7 8 9 10 01 2 3 4 5 6 7 8 9 10 To 01 2 3 4 5 6 7 8 9 10 To 01 2 3 4 5 6 7 8 9 10
Number of Cascades Number of Cascades Number of Cascades Number of Cascades
(a)SP,Random (b)SP,Core-Periphery (c)Top-10SP,Random (d)Top-10SP,Core-Periphery
Figure3: SuccessProbability(SP)andTop-10SuccessProbability(Top-10SP)fortwotypesofKroneckernetworks.
each square represents a node, the true source is marked 0.5 0.5
with a star and the heat map represents normalized like-
0.4 0.4
lihoods in [0,1]. In this toy example, a single cascade is
0.3 0.3
insufficient to detect the true source, since it has a rela- E E
S S
M M
tively low likelihood. However, once more cascades are 0.2 0.2
observed, the likelihood of the true source increases and 0.1 0.1
ultimately become higher than all other nodes for 8 cas-
0 0
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
cades. Number of Cascades Number of Cascades
(a) Random (b) Core-Periphery
Accuracy. Next, we evaluate the accuracy of our
method in comparison with two state of the art methods,
Figure4: Mean-squarederror(MSE)ontheestimationof
NETSLEUTH [Aditya Prakash et al., 2012] and Pinto’s
t fortwotypesofKroneckernetworks.
method[Pintoetal.,2012],andtwobaselinesinlargersyn- s
thetic networks. The first baseline runs Montecarlo from
andthencomputethetop-1andtop-10successprobability
each potential source and ranks them by counting the av-
basedonalloutputsforallcascades. Figure3summarizes
erage maximum number of observed infected nodes that
theresultsfortwotypesofKroneckernetworksagainstthe
get infected in a time window equal to the length of ob-
number of observed cascades. Our method outperforms
servationwindow. Then, itranksthepotentialsourcesac-
dramatically all others, achieving a success probability as
cording to the average value of this quantity, where the
highas0.6andtop-10successprobabilityofalmost1. The
node with the highest value is the top node. The second
low performance that state-of-the-art methods exhibit, in
baseline first finds all potential sources that can reach all
comparison with the validation within the corresponding
observed infected nodes and then ranks them by decreas-
papers, may be explained as follows: in both cases, the
ingout-degree,wherethenodewiththehighestout-degree
authors validated their algorithms with synthetic and real
is the top node. NETSLEUTH assumes the same infection networkswithlargediameters,withoutlong-rangeconnec-
probability β over all the edges, which we set to 0.1, fol-
tions, such as 2-D grids [Aditya Prakash et al., 2012] and
lowing Aditya Prakash et al. [2012]. Pinto’s method sim-
spatial(geographical)networks[Pintoetal.,2012],where
ilarly assumes that all pairwise transmission times come
thesourceidentificationproblemismucheasier.
from the same Gaussian distribution and they require its
meantobemuchlargerthanitsstandarddeviationinorder Source Infection Time Estimation. We also evaluate
toguaranteenonnegativetransmissiontimes.Intheirwork, howaccuratelyourmethodinferstheinfectiontimeofthe
theysetµ/σ =4,whereµandσarethemeanandstandard true source by computing the mean square error (MSE),
deviation, respectively. Since our fitted diffusion network E (cid:2)(t −tˆ )2(cid:3),estimatedbyrunningourmethodon10
s∗ s∗ s∗
contains edges with different transmission rates and thus different random sources. Here, we do not compare with
differentexpectedtransmissiontime,wesettheparameter othercompetitivemethodssincetheydonotprovideanes-
µtobetheminimumexpectedvalueoveralltheedges. timate of the infection time of the true source. Figure 4
showstheMSEoftheestimatedinfectiontimesofthetrue
Weusedtwomeasuresofaccuracy:successprobabilityand
sourceforthesamenetworksasaboveagainstthenumber
top-10 success probability. We define success probability
asP(sˆ= s∗)andtop-10successprobabilityastheproba- ofcascades.
bilitythatthetruesources∗isamongthetop-10intermsof
maximumlikelihoodorranking.Foreachnetworktype,we
5.2 ExperimentsonRealData
estimatedbothmeasuresbyrunningourmethodon10dif-
ferentrandomsourcesets. SinceNETSLEUTHandPinto’s
Experimental Setup. We focus on the spread of memes,
methodcanonlyacceptoneobservedcascadeatatime,we
whichareashorttextualphrases(like,“lipstickonapig”)
runthemethodsindependentlyforeachindividualcascade
thattravelalmostintactthroughtheWeb[Leskovecetal.,
BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades
0.1 y 0.1 4000
ONauirveMC bilit ONauirveMC
Success Probability0.05 ONeuttSDleeguth (cid:239)10 Success Proba0.05 ONeuttSDleeguth 2MSE (days)123000000000
p
o
01 2 3 4 5 6 7 8 9 10 T 01 2 3 4 5 6 7 8 9 10 01 2 3 4 5 6 7 8 9 10
Number of Cascades Number of Cascades Number of Cascades
(a) SP (b) Top-10SP (c) MSE
Figure5:SuccessProbability(SP),Top-10SuccessProbability(Top-10SP)andmean-squarederror(MSE)ontheestima-
tionoft forrealcascadedata.
s
2009]. We experiment with a large meme dataset6, which randomfromallnodesinthenetworkwouldsucceedwith
traces the spread of memes across 1,700 popular main- probability 1/1700 = 5.8 × 10−4, almost 20 times less
stream media sites and blogs [Gomez-Rodriguez et al., accurate than our method. A second random guesser that
2013]. The dataset classifies memes per topic, and asso- chooses the source uniformly at random among the nodes
ciateseachmememtoaninformationcascadet ,which from whom the observed nodes are reachable would suc-
m
issimplyarecordoftimeswhensitesfirstmentionedmeme ceedwithprobability1/425=2.4×10−3,almost5times
m. We proceed as follows. We first infer an underlying lessaccurate. Thesameargumentfortop-10successprob-
diffusionnetworkpertopicusingNETRATE,awell-known abilityshows12timesimprovementinaccuracycompared
networkinferencemethod[Gomez-Rodriguezetal.,2011], to naive guesser and 3 times improvement in comparison
usingallobservedinformationcascades.Wethenusethese tothemorecleverone. Finally,ourmethod’sMSEvalues
inferrednetworksalongwithapercentageoftheinfections indicate that our method is able to find the source infec-
√
oflargecascadestoinferthesourceofthesecascades. We tiontimewithinanaccuracyof 2000≈45days. Wefind
select15sources,eachofthemhavingatleast10longcas- thisquiteremarkablegiventhatthecascadesweconsidered
cades. Here, by long cascade we mean possessing more typicallyunfoldduringa1-yearperiod.
than 27 nodes. The results are averaged over 5 runs, ran-
domizingtheselectionoftheobservednodes,weconsider
6 CONCLUSIONS
that10%oftheinfectednodesareobservedandutilize500
samplestoapproximatethelikelihood.
We propose a two-stage framework for detecting the
Accuracy.Weevaluatetheaccuracyofourmethodincom- sourceofacascadeincontinuous-timediffusionnetworks,
which improves dramatically over previous state-of-the-
parisonwithNETSLEUTHandthesamebaselinesasinthe
arts in terms of detection accuracy. Our framework cast
synthetic experiments, using success probability and top-
the problem as a maximum likelihood estimation prob-
10successprobability. Unfortunately,wecannotcompare
lem and then find optimal solutions very efficiently us-
to Pinto’s method because it requires the identity of the
ing an importance sampling approximation to the objec-
true parent for each observed node in each cascade, and
tive and an optimization procedure that exploits the struc-
this is not available in real cascade data. Figure 5 sum-
ture of the problem. Our work opens many interesting
marizestheresults. Surprisingly,neitherNETSLEUTHnor
venues for future work. For example, it would be useful
the baselines succeed at detecting cascade sources in real
data, even with 10 observed cascades; they output solu- to extend our method to support cascades with multiple
sources and other continuous-time models different than
tions with an (almost) zero (top-10) success probabilities.
the continuous-time independent cascade model [Gomez-
In contrast, our method achieves a non-zero (top-10) suc-
cess probability as long as we observe more than 8 and 6 Rodriguezetal.,2011]. Also,atheoreticalanalysisofour
importance sampling scheme is also interesting. Finally,
cascades respectively, a fairly low number of cascades in
it would be interesting to apply the current framework to
thisscenario. Eventhen,theperformanceofourmethodin
otherreal-worlddatasets.
termsofsuccessandtop-10successprobabilitymayseem
low at first, however, we would like to highlight how dif-
ficult the problem we are trying to solve is, by consider- Acknowledgements
ing the performance of two simple random guessers. A
first random guesser who chooses the source uniformly at This work was supported in part by NSF/NIH BIGDATA
1R01GM108341, NSF IIS-1116886, NSF CAREER IIS-
6Dataisavailableathttp://snap.stanford.edu/infopath/ 1350983andaRaytheonFacultyFellowshiptoL.S.
M.Farajtabar,M.Gomez-Rodriguez,N.Du,M.Zamani,H.Zha,L.Song
References D. Kedmey. Hackers Leak Explicit Photos of More Than
100Celebrities. Time,2014.
B.AdityaPrakash,J.Vrekeen,andC.Faloutsos. Spotting
D.Kempe,J.M.Kleinberg,andE.Tardos.Maximizingthe
culprits in epidemics: How many and which ones? In
SpreadofInfluenceThroughaSocialNetwork.InKDD,
ICDM,2012.
2003.
L.Backstrom,P.Boldi,M.Rosa,J.Ugander,andS.Vigna.
J.Kiefer. Sequentialminimaxsearchforamaximum. Pro-
Fourdegreesofseparation. InWebSci,2012.
ceedings of the American Mathematical Society, 4(3):
N.T.J.Bailey.TheMathematicalTheoryofInfectiousDis- 502–506,1953.
eases and its Applications. Hafner Press, 2nd edition,
M. Kim and J. Leskovec. The network completion prob-
1975.
lem: Inferringmissingnodesandedgesinnetworks. In
F.Chierichetti,J.Kleinberg,andD.Liben-Nowell. Recon- SDM,2011.
structingPatternsofInformationDiffusionfromIncom-
T.Lappas,E.Terzi,D.Gunopulos,andH.Mannila. Find-
pleteObservations. InNIPS,2011.
ingEffectorsinSocialNetworks. InKDD,2010.
A.Clauset,C.Moore,andM.E.J.Newman. Hierarchical
J. Leskovec, L. Backstrom, and J. Kleinberg. Meme-
Structure and the Prediction of Missing Links in Net-
tracking and the dynamics of the news cycle. In KDD,
works. Nature,453(7191):98–101,2008.
2009.
H. Daneshmand, M. Gomez-Rodriguez, L. Song, and
J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos,
B. Scho¨lkopf. Estimating diffusion network struc-
and Z. Ghahramani. Kronecker Graphs: An Approach
tures: Recovery conditions, sample complexity & soft-
toModelingNetworks. JMLR,2010.
thresholdingalgorithm. InICML,2014.
D.G.Luenberger.Introductiontolinearandnonlinearpro-
N. Du, L. Song, A. Smola, and M. Yuan. Learning Net- gramming, volume 28. Addison-Wesley Reading, MA,
worksofHeterogeneousInfluence. InNIPS,2012. 1973.
N.Du,L.Song,M.Gomez-Rodriguez,andH.Zha. Scal- F.Morstatter,J.Pfeffer,H.Liu,andK.Carley. Isthesam-
able influence estimation in continuous-time diffusion plegoodenough? comparingdatafromtwittersstream-
networks. InNIPS,2013a. ingapiwithtwittersfirehose. ICWSM,2013.
N. Du, L. Song, H. Woo, and H. Zha. Uncover topic- P.C.Pinto,P.Thiran,andM.Vetterli. Locatingthesource
sensitive information diffusion networks. In Artificial ofdiffusioninlarge-scalenetworks. Physicalreviewlet-
IntelligenceandStatistics(AISTATS),2013b. ters,109(6):068702,2012.
D. Easley and J. Kleinberg. Networks, Crowds, and Mar- S.Sadikov,M.Medina,J.Leskovec,andH.Garcia-Molina.
kets:ReasoningAboutaHighlyConnectedWorld.Cam- CorrectingforMissingDatainInformationCascades.In
bridgeUniversityPress,2010. WSDM,2011.
M. Gomez-Rodriguez and B. Scho¨lkopf. Influence Max- D. Shah and T. Zaman. Detecting sources of com-
imization in Continuous Time Diffusion Networks. In puter viruses in networks: theory and experiment. In
Proceedings of the 29th International Conference on ACM SIGMETRICS Performance Evaluation Review,
MachineLearning,2012a. volume38,pages203–214,2010.
M. Gomez-Rodriguez and B. Scho¨lkopf. Submodular In- K. Zhou, L. Song, and H. Zha. Learning social infectiv-
ference of Diffusion Networks from Multiple Trees. In ityinsparselow-ranknetworksusingmulti-dimensional
Proceedings of the 29th International Conference on hawkesprocesses. InAISTATS,2013a.
MachineLearning,2012b. K. Zhou, H. Zha, and L. Song. Learning triggering ker-
nelsformulti-dimensionalhawkesprocesses. InICML,
M. Gomez-Rodriguez, J. Leskovec, and A. Krause. In-
2013b.
ferring Networks of Diffusion and Influence. In KDD,
2010.
M.Gomez-Rodriguez,D.Balduzzi,andB.Scho¨lkopf. Un-
coveringtheTemporalDynamicsofDiffusionNetworks.
InICML,2011.
M. Gomez-Rodriguez, J. Leskovec, and B. Scho¨lkopf.
StructureandDynamicsofInformationPathwaysinOn-
lineMedia. InWSDM,2013.
M. Isaac. Nude Photos of Jennifer Lawrence Are Latest
FrontinOnlinePrivacyDebate. NewYorkTimes,2014.
BacktothePast:SourceIdentificationinDiffusionNetworksfromPartiallyObservedCascades
A ProofofLemma1
Supposetherearetwostationarypoints,i.e.,φ(cid:48) (x)=φ(cid:48) (y)=0,thus,bycontinuityofφ (.)in(t ,t )theremustbe
L L L si si+1
az ∈(x,y)suchthatφ(cid:48)(cid:48)(z)=0. Weshowitisacontradictionas
L
L
(cid:88)
φ(cid:48)L(cid:48)(ts)= γlβl2eβlts >0 (12)
l
forall1≤l≤L.
B ProofofLemma2
Assumewewouldliketofindthemaximizerofφ (.)ininterval(a,b)andconsidertwopointsatone-thirdandtwo-thirdof
L
theinterval,i.e.,c=a+b−a andd=a+2b−a. Itcanbeeasilyshownthat,ifφ (c)<φ (d),thenthemaximizerwillbe
3 3 L L
oninterval(c,b)and,ifφ (c)<φ (d),thenthemaximizermustlieoninterval(a,d). Therefore,bytwoevaluations,we
L L
canshrinktheintervalcontainingthemaximizerbyafactorof 2. Then,toreachthe(cid:15)-neighborhoodoftherealmaximizer,
3
weneedevaluatethefunction2∗rtimes,where
(t −t )(2/3)r <(cid:15). (13)
si+1 si
Thiswillproveourclaim.
C AdditionalExperimentalResults
Inthissection,weprovideadditionalexperimentalresultsonsyntheticdata,includinganevaluationoftheperformanceof
ourmethodagainstthepercentageofobservedinfectionsandthenumberofMontecarlosamples,aswellasascalability
analysis.
Performancevs. percentageofobservedinfections. Intuitively,thegreaterthenumberofobservedinfections,themore
accurately our method can infer the true source and its infection time. Figure 6 confirms this intuition by showing the
successprobabilityagainstpercentageofobservedinfections. However,wealsofindthatthegreateristhepercentageof
observedinfections,thesmalleristheeffectofobservingadditionalinfections;adiminishingreturnproperty.
1 1 1
0.8 0.8 0.8
y y y
c0.6 c0.6 c0.6
a a a
ur ur ur
cc0.4 cc0.4 cc0.4
A A A
0.2 0.2 0.2
SP SP SP
TOP(cid:239)10 SP TOP(cid:239)10 SP TOP(cid:239)10 SP
00 4 8 12 16 20 00 4 8 12 16 20 00 4 8 12 16 20
Observed Data (%) Observed Data (%) Observed Data (%)
(a) Random (b) Hierarchical (c) Core-periphery
Figure6: Accuracyvs. %observedinfections.
Performancevs.numberofMontecarlosamples.Drawingmoretransmissiontimesamples{τ } leadstoabetter
ji (j,i)∈E
estimate of Eq. 6, and thus a greater accuracy of our method. Figure 7 shows the success probability against number of
samples. Importantly,weobservethataslongasthenumberofsamplesislargeenough,theperformanceofourmethod
quicklyflattensanddoesnotdependonthenumberofsamplesanymore.
Running time vs. percentage of observed infections. Figure 8 plots the average running time to infer the source of a
singlecascadeagainstthepercentageofobservedinfections. Perhapssurprisingly,therunningtimebarelyincreaseswith
thepercentageofobservedinfections.
Runningtimevs.numberofsamples.Figure9plotstheaveragerunningtimeagainstthenumberofMontecarlosamples
usedtoapproximatethelikelihood,Eq.6.