ebook img

Efficient Information Flow Maximization in Probabilistic Graphs PDF

0.92 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Efficient Information Flow Maximization in Probabilistic Graphs

Efficient Information Flow Maximization in Probabilistic Graphs Christian Frey1, Andreas Zu¨fle2, Tobias Emrich1, Matthias Renz3 1InstituteforInformatics,Ludwig-Maximilians-Universia¨tMu¨nchen 2Dept. ofGeographyandGeoinformationScience,GeorgeMasonUniversity 3Dept. ofComputationalandDataSciences,GeorgeMasonUniversity [email protected]fi.lmu.de, azufl[email protected], [email protected]fi.lmu.de, [email protected] ABSTRACT Reliable propagation of information through large networks, e.g. 7 communication networks, social networks or sensor networks is 1 veryimportantinmanyapplicationsconcerningmarketing,social 0 networks, and wireless sensor networks. However, social ties of 2 friendshipmaybeobsolete,andcommunicationlinksmayfail,in- b ducing the notion of uncertainty in such networks. In this paper, e weaddresstheproblemofoptimizinginformationpropagationin (a) originalgraph (b) DijkstraMST F uncertainnetworksgivenaconstrainedbudgetofedges. Weshow 6 thatthisproblemrequirestosolvetwoNP-hardsubproblems: the computationofexpectedinformationflow,andtheoptimalchoice ] of edges. To compute the expected information flow to a source S vertex, weproposetheF-treeasaspecializeddatastructure, that D identifiesindependentcomponentsofthegraphforwhichthein- . formationflowcaneitherbecomputedanalyticallyandefficiently, s (c) Optimalfive-edgeflow (d) possibleworldg orforwhichtraditionalMonte-Carlosamplingcanbeappliedinde- 1 c [ pendentoftheremainingnetwork. Fortheproblemoffindingthe Figure1:Runningexample. optimaledges,weproposeaseriesofheuristicsthatexploitproper- idea[11,18,1]. Theprobabilisticgraphmodeliscommonlyused 2 tiesofthisdatastructure.Ourevaluationshowsthattheseheuristics to address such scenarios in a unified way (e.g. [25, 28, 15, 29, v leadtohighqualitysolutions,thusyieldinghighinformationflow, 41, 40]). In this model, each edge is associated with an existen- 5 whilemaintaininglowrun-time. tial probability to quantify the likelihood that this edge exists in 9 3 thegraph.Traditionally,tomaximizethelikelihoodofasuccessful 5 1. INTRODUCTION communicationbetweentwonodes,informationispropagatedby 0 Nowadays, social and communication networks have become floodingitthroughthenetwork. Thus,everynodethatreceivesa . ubiquitousinourdailylifetoreceiveandshareinformation.When- bitofinformationwillproceedtosharethisinformationwithallits 1 everwearenavigatingtheWorldWideWeb, updatingoursocial neighbors. Clearly,suchafloodingapproachisnotapplicablefor 0 networkprofiles,orsendingatextmessageonourcell-phone,we largecommunicationnetworksandforsocialnetworks,asthecom- 7 1 participateinaninformationnetworkasanode. Insuchnetworks, municationbetweentwonetworknodesincursacost: Sensornet- : networknodesexchangesomesortofinformation: Insocialnet- worknodes,e.g. inmicro-sensornetworks,havelimitedcomput- v works, users share their opinions and ideas, aiming to convince ingcapability,memoryresourcesandpowersupply,requirebattery Xi others. Inwirelesssensornetworksnodescollectdataandaimto powertosend,receiveandforwardmessages,andarealsolimited ensurethatthisdataispropagatedthroughthenetwork:Eithertoa bytheirbandwidth;individualsofasocialnetworkrequiretimeand ar destination,suchasaservernode,orsimplytoasmanyothernodes sometimesevenadditionalmonetaryresourcestoconvinceothers as possible. Abstractly speaking, in all of these networks, nodes oftheirideas. aimatpropagatingtheirinformation,ortheirbelief,throughoutthe Inthiswork,weaddressthefollowingproblem: Givenaproba- network. Theeventofasuccessfulpropagationofinformationbe- bilisticnetworkgraphG withedgesthatcanbeactivatedforcom- tweennodesissubjecttoinherentuncertainty.Inawirelesssensor, munication, i.e. enabled to transfer information, or stay inactive. telecommunicationorelectricalnetwork, alinkcanbeunreliable Theproblemistosend/receiveinformationfromasinglenodeQ andmayfailwithcertainprobability[10,32]. Inasocialnetwork, in G to/from as many nodes in G as possible assuming a limited trustandinfluenceissuesmayimpactthelikelihoodofsocialinter- budgetofedgesthatcanbeactivated. Tosolvethisproblem, the actions or the likelihood of convincing another of an individual’s mainfocusisontheselectionofedgestobeactivated. Example1. Toillustrateourproblemsetting,considerthenetwork depicted in Figure 1(a). The task is to maximize the information flowtonodeQfromothernodesgivenalimitedbudgetofedgesto beused.Incontrasttothegeneralproblemdefinedlater,thisexam- pleassumesequalweightsofallnodes.Eachedgeofthenetworkis labeledwithaprobabilityvaluedenotingtheprobabilityofasuc- cessfulcommunication.Astraightforwardsolutiontothisproblem, is to activate all edges. Assuming each node to have one unit of 1 information,theexpectedinformationflowofthissolutioncanbe of) nodes are reachable. This problem, well studied in the con- showntobe(cid:39) 2.51. Whilemaximizingtheinformationflow,this text of communication networks, has seen a recent revival in the solutionincursthemaximumpossiblecommunicationcost. Atra- databasecommunityduetotheneedforscalablesolutionsforbig ditionaltrade-offbetweenthesesingle-objectivesolutionsisusing networks. Specificproblemformulationsinthisclassasktomea- a probability maximizing Dijkstra’s spanning tree, as depicted in sure the probability that two specific nodes are connected (two- Figure1(b). Theexpectedinformationflowinthissettingcanbe terminalreliability[2]),allnodesinthenetworkarepairwisecon- showntoaggregateto1.59units, whilerequiringsixedgestobe nected(all-terminalreliability[34]),orallnodesinagivensubset activated. Yet, it can be shown that the solution depicted in Fig- arepairwiseconnected(k-terminalreliability[13,12]). Extending ure 1(c) dominates this solution: Only fives edges are used, thus thesereliabilityqueries,wheresourceandsinknode(s)arespeci- furtherreducingthecommunicationcost,whileachievingahigher fied,thecorrespondinggraphminingproblemistofind,foragiven expectedinformationflowof(cid:39)2.02unitsofinformationtoQ. probabilistic graph, the set of most reliable k-terminal subgraphs [16]. Alltheseproblemdefinitionshaveincommonthatthesetof The aim of this work is to efficiently find a near-optimal sub- nodes to be reached is predefined, and that there is no degree of network, which maximizes the expected flow of information at a freedominthenumberofactivatededges-thusallnodesareas- constrainedbudgetofedges. InExample1, wecomputedthein- sumedtoattempttocommunicatetoalltheirneighbors,whichwe formationflowforvariousexamplegraphs. Butinfact,thiscom- arguecanbeoverlyexpensiveinmanyapplications. putation has been shown to be exponentially hard in the number ReliabilityBounds.Severallowerboundson(two-terminal)re- of edges of the graph, and thus impractical to be solved analyti- liability have been defined in the context of communication net- cally. Furthermore,theoptimalselectionofedgestomaximizethe works[3,4,9,30]. Suchboundscouldbeusedintheplaceofour informationflowisshowntobeNP-hard. Thesetwosubproblems sampling approach, to estimate the information gain obtained by definethemaincomputationalchallengesaddressedinthiswork. addinganetworkedgetothecurrentactiveset. However, forall To tacklethese challenges, the remainder ofthis work is orga- thesebounds,thecomputationalcomplexitytoobtainthesebounds nizedasfollows. AfterasurveyofrelatedworkinSection2,we isatleastquadraticinthenumberofnetworknodes,makingthese recapitulate common definitions for stochastic networks and for- bounds unfeasible for large networks. Very simple but efficient mally define our problem setting in Section 3. After a more de- boundshavebeenpresentedin[19],suchasusingthemost-probable tailedtechnicaloverviewinSection4,thetheoreticalheartofthis pathbetweentwonodesasalowerboundoftheirtwo-terminalre- workispresentedinSection5. Weshowhowtoidentifyindepen- liability. However, thenumberofpossible(non-circular)pathsis dent subgraphs, for which the information flow can be computed exponentiallylargeinthenumberofedgesofagraph,suchthat,in independently. Thisallowstodividethemainproblemintomuch practice,eventhemostprobablepathwillhaveanegligibleprob- smallersub-problems. Toconquerthesesubproblems,weidentify ability, thusyieldingauselessupperbound. Thus, sincenoneof casesforwhichtheexpectedinformationflowcanbecomputedan- theseprobabilityboundsaresufficientlyeffectiveandefficientfor alytically,andweproposetoemployMonte-Carlosamplingtoap- practicaluse, wedirectlydecidedtouseasamplingapproachfor proximatetheinformationflowoftheremainingcases.Section5.3 partsofthegraphwherenoexactinferenceispossible. isthealgorithmiccoreofourwork,showinghowaforementioned InfluentialNodes. Existingworkmotivatedbyapplicationsto independentcomponentscanbeorganizedhierarchicalinaF-tree marketingprovidemethodstodetectinfluentialmemberswithina whichisinspiredbytheblock-cuttree[36,14,38]. Thisstructure socialnetwork. thiscanhelptopromoteanewproduct. Thetask allowsustoaggregateresultsofindividualcomponentsefficiently, is to detect nodes, i.e. persons, where the chance that the prod- and we show how previous Monte-Carlo sampling results can be uctisrecommendedtoabroadrangeofconnectedpeopleismax- re-usedasmoreedgesareselectedandactivated. Ourexperimen- imized. In[6],[31]aframeworkisprovidedwhichconsidersthe talevaluationinSection7showsthatouralgorithmssignificantly interactionsbetweenthepersonsinaprobabilisticmodel. Asthe outperformtraditionalsolutions, intermsofcombinedcommuni- problemoffindingthemostinfluentialverticesisNP-hard,approx- cationcostandinformationflow, onsyntheticandrealstochastic imation algorithms are used in [18], outperforming basic heuris- networks.Insummary,themaincontributionsofthisworkare: tics based on degree centrality and distance centrality which are • Theoreticalcomplexitystudyoftheflowmaximizationprob- applied traditionally in social networks. This branch of research leminprobabilisticgraphs. hasincommonthatthetaskistoactivateaconstrainednumberof • Efficientestimationoftheexpectedinformationflowbased nodestomaximizetheinformationflow,whereasourproblemdef- onnetworkgraphdecompositionandMonte-Carlosampling. initionconstrainsthenumberofactivatededgesforasinglespeci- • OurF-treestructureenablingefficientorganizationofinde- fiedquery/sinknode. pendent graph components and (local) intermediate results ReliablePaths. Inmobilead-hocnetworks,theuncertaintyof forefficientexpectedflowcomputation. anedgecanbeinterpretedastheconnectivitybetweentwonodes. • Analgorithmforiterativeselectionofedgestobeactivated Thus,animportantprobleminthisfieldistomaximizetheprob- tomaximizetheexpectedinformationflow. ability that two nodes are connected for a constrained budget of • Thoroughexperimentalevaluationoftheproposedmethods edges[10].Inthiswork,themaindifferencetoourworkisthatthe andalgorithms. informationflowtoasingledestinationismaximized,ratherthan theinformationflowingeneral. Theheuristics[10]cannotbeap- 2. RELATEDWORK plieddirectlytoourproblem,sinceclearly,maximizingtheflowto onenodemaydetrimenttheflowtoanothernode. Mining probabilistic graphs (a.k.a. uncertain graphs) has re- Bi-connectedcomponents. TheF-treethatweproposeinthis cently attracted much attention in the data mining and database workisinspiredbytheblock-cuttree[36,14,38].Themaindiffer- research communities [19, 39, 20, 29]. We summarize state-of- enceisthatourapproachaimsatfindingcyclicsubgraphs,where the-artpublicationsandrelateourworkinthiscontext. nodes are bi-connected. For subgraphs having a size of at least Subgraph Reliability. A related and fundamental problem in three vertices, this problem is equivalent to finding bi-connected uncertaingraphminingistheso-calledsubgraphreliabilityprob- subgraphs,whichissolvedin[36,14,38].Thus,ourproposeddata lem, which asks to estimate the probability that two given (sets 2 structuretreatsbi-connectedsubgraphsofsizelessthanthreesep- Definition3(ExpectedInformationFlow). LetQ∈V beanodeand arately, grouping them together as mono-connected components. let G = (V,E,W,P) be a probabilistic graph, then flow(Q,G) Moreimportantly, thisexistingworkdoesnotshowhowtocom- denotes the random variable of the sum of vertex weights of all pute,estimateandpropagateprobabilisticinformationthroughthe nodesinV reachablefromQ,formally: structure,whichisthemaincontributionofthiswork. (cid:88) flow(Q,G):= P((cid:108)(Q,v,G))·W(v). 3. PROBLEMDEFINITION v∈V Due to linearity of expectations, and exploiting that W(v) is de- AprobabilisticundirectedgraphisgivenbyG =(V,E,W,P), terministic,wecancomputetheexpectationE(flow(Q,G))ofthis where V is a set of vertices, E ⊆ V × V is a set of edges, W : V (cid:55)→ R+ is a function that maps each vertex to a posi- randomvariableasE(flow(Q,G))= tivevaluerepresentingtheinformationweightofthecorresponding (cid:88) (cid:88) E( P((cid:108)(Q,v,G))·W(v))= E(P((cid:108)(Q,v,G)))·W(v) vertexandP :E (cid:55)→(0,1]isafunctionthatmapseachedgetoits v∈V v∈V correspondingprobabilityofexistinginG. Inthefollowing, itis (2) assumedthattheexistenceofdifferentedgesareindependentfrom one another. Let us note, that our approach also applies to other GivenourdefinitionofExpectedInformationFlowinEquation2, modelssuchastheconditionalprobabilitymodel[29],aslongas wecannowstateourformalproblemdefinitionofoptimizingthe acomputationalmethodforanunbiaseddrawingofsamplesofthe expected information flow of a probabilistic graph G for a con- probabilisticgraphisavailable. strainedbudgetofedges. InaprobabilisticgraphG, theexistenceofeachedgeisaran- Definition 4 (Maximum Expected Information Flow). Let G = domvariable. Thus, thetopologyofG isarandomvariable, too. (V,E,W,P)beaprobabilisticgraph,letQ∈V beaquerynode Thesamplespaceofthisrandomvariableisthesetofallpossible andletkbeanon-negativeinteger.TheMaximumExpectedInfor- graphs. A possible graph g = (V ,E ) of a probabilistic graph g g mationFlow G isadeterministicgraphwhichisapossibleoutcomeoftheran- MaxFlow(G,Q,k)= domvariablesrepresentingtheedgesofG. Thegraphgcontainsa subsetofedgesofG,i.e.,Eg ⊆E.Thetotalnumberofsuchpossi- argmaxG=(V,E(cid:48)⊆E,W,P),|E(cid:48)|≤kE(flow(Q,G)), (3) blegraphsis2|E<1|,where|E<1|representsthenumberofedges isthesubgraphofGmaximizingtheinformationflowQconstrained e ∈ E having P(e) < 1, because for each such edge, we have tohavingatmostkedges. two cases as to whether or not that edge is present in the graph. WeletW denotethesetofallpossiblegraphs. Theprobabilityof ComputingMaxFlow(G,Q,k)efficientlyrequirestoovercome samplingthegraphg fromtherandomvariablesrepresentingthe twoNP-hardsubproblems. First,thecomputationoftheexpected probabilisticgraphGisgivenbythefollowingsamplingorrealiza- informationflowE(flow(Q,G))tovertexQforagivenprobabilis- tionprobabilityPr(g): ticgraphGisNP-hardasshownin[5].Inaddition,theproblemof (cid:89) (cid:89) selectingtheoptimalsetofkverticestomaximizetheinformation Pr(g)= P(e)· (1−P(e)). (1) flowMaxFlow(G,Q,k)isaNP-hardprobleminitself,asshownin e∈Eg e∈E\Eg thefollowing. Figure 1(a) shows an example of a probabilistic graph G and its possiblerealizationg in1(d). Thisprobabilisticgraphhas210 = Theorem1. EveniftheExpectedInformationFlowflow(Q,G)to 1 1024possibleworlds. UsingEquation1,theprobabilityofworld avertexQcanbecomputedinO(1)foranyprobabilisticgraphG, g isgivenby: theproblemoffindingMaxFlow(G,Q,k)isstillNP-hard. 1 Pr(g1)=0.6·0.5·0.8·0.4·0.4·0.5·(1−0.1)· Proof. AformalproofcanbefoundintheappendixinSection10. ·(1−0.3)·(1−0.4)·(1−0.1)=0.00653184 4. ROADMAP To compute MaxFlow(G,Q,k), we first need an efficient so- Definition1(Path). LetG =(V,E,W,P)beaprobabilisticgraph lution to approximate the reachability probability E((cid:108)(Q,v,G)) andletv ,v ∈ V betwonodessuchthatv (cid:54)= v . An(acyclic) fromQto/fromasinglenodev.Sincethisproblemcanbeshownto a b a b pathpath(v ,v ) = v ,v ,v ,...,v isasequenceofvertices, be#P-hard,Section5.3presentsanapproximationtechniquewhich a b a 1 2 b suchthat∀v ∈path(v ,v ):(v ∈V)and∀v ,v ∈path(v ,v ): exploitsstochasticindependenciesbetweenbranchesofaspanning i a b i i j a b v (cid:54)=v . treeofsubgraphGrootedatQ.Thistechniqueallowstoaggregate i j independent subgraphs of G efficiently, while exploiting a sam- plingsolutionforcomponentsofthegraphMaxFlow(G,Q,k)that Definition2(Reachability). Thenetworkreachabilityproblemas containscycles. definedin[15,5]computesthelikelihoodofthebinomialrandom OncewecanefficientlyapproximatetheflowE((cid:108)(Q,v,G))from variable (cid:108)(i,j,G) of two nodes i,j ∈ V being connected in G, Qtoeachnodev ∈ V,wenexttackletheproblemofefficiently formally: findingasubgraphMaxEFlow(G,Q,k)thatyieldsanear-optimal P((cid:108)(i,j,G)):= (cid:88) (cid:89) P(e)· (cid:89) (1−P(e))·(cid:108)(i,j,g), expected information flow given a budget of k edges in Section 6. Due to the theoretic result of Theorem 1, we propose heuris- g∈We∈Eg e∈E\Eg ticstochoosekedgesfromG. Finally,ourexperimentsinSection where(cid:108)(i,j,g)isanindicatorfunctionthatreturnsoneifthere 7 support our theoretical intuition that our solutions for the two existsapathbetweennodesiandj inthe(deterministic)possible aforementionedsubproblemssynergize: Anoptimalsubgraphwill graphg, andzerootherwise. ForagivenquerynodeQ, ouraim choose a budget of k edges in a tree-like fashion, to reach large is to optimize the information gain, which is defined as the total parts of the probabilistic graph. At the same time, our solutions weightofnodesreachablefromQ. exploittree-likesubgraphsforefficientprobabilitycomputation. 3 5. EXPECTEDFLOWESTIMATION Lemma2. IftwoverticesAandBaremono-connectedinaprob- In this section we estimate the expected information flow of a abilisticgraphG,thenthereachabilityprobabilitybetweenAand given subgraph G ⊆ G. Following Equation 2, the reachability B is equal to the product of the edge probabilities included in probabilityP((cid:108)(Q,v,G))betweenQandanodevcanbeusedto path(A,B),i.e., computethetotalexpectedinformationflowE(flow(Q,G)). This k−1 problem of computing the reachability probability between two (cid:108)(A,B,G)= (cid:89)P((v ,v ))withv ∈path(A,B) i i+1 i nodeshasbeenshowntobe#P-hard[10,5]andsamplingsolu- i=1 tionshavebeenproposedtoapproximateit[22,7]. Inthissection, wewillpresentoursolutiontoidentifysubgraphsofG forwhich Proof. FollowingpossibleworldsemanticsasdefinedinDefinition wecancomputetheinformationanalyticallyandefficiently, such 2,thereachabilityprobability(cid:108)(A,B,G)isthesumofprobabili- that expensive numeric sampling only has to be applied to small ties of all possible worlds where B is connected to A. We show subgraphs.WefirstintroducetheconceptofMonte-Carlosampling thatAandBareconnectedinapossiblegraphgifandonlyifall ofasubgraph. k−1edgesei =(vi,vi+1)∈path(A,B)exist. ⇒: By contradiction: Let A and B be connected in g, and let 5.1 TraditionalMonte-CarloSampling anyedgeonpath(A,B)bemissing. Thentheremustexistapath pathprime(A,B) (cid:54)= path(A,B)whichcontradictstheassumption Lemma1. LetG = (V,E,W,P),beanuncertaingraphandlet thatAandBaremono-connected. Sbeasetofsampleworldsdrawnrandomlyandunbiasedfromthe ⇐:Ifalledgesonpath(A,B)exist,thenBisconnectedtoAfol- setW ofpossiblegraphsofG. Thentheaverageinformationflow lowingtheassumptionthatpath(A,B)isapathfromAtoB. insamplesinS Duetoourassumptionofindependentedges,theprobabilitythat alledgesinpath(A,B)existisgivenby(cid:81)k−1P((v ,v )). 1 (cid:88) 1 (cid:88)(cid:88) i=1 i i+1 flow(Q,g)= · (cid:108)(Q,v,g)·W(v) (4) |S| |S| Definition6(Mono-ConnectedGraph). AprobabilisticgraphG = g∈S g∈S v (V,E,W,P)iscalledmono-connected,iffallpairsofverticesin isanunbiasedestimatoroftheexpectedinformationflowE(flow(Q,G)), V aremono-connected. where(cid:108)(Q,v,g)isanindicatorfunctionthatreturnsoneifthere existsapathbetweennodesQandvinthe(deterministic)sample Next, wegeneralizeLemma2towholesubgraphs, suchthata graphg,andzerootherwise. specifiedvertexQinthatsubgraphhasauniquepathtoallother verticesinthesubgraph.UsingLemma2,weconstitutethefollow- Proof. For µ to be an unbiased estimator of E(flow(Q,G)), we ingtheoremthatwillbeexploitedintheremainderofthiswork. havetoshowthatE(µ) = E(flow(Q,G)). Substitutingµyields E(µ) = E( 1 (cid:80) flow(Q,g)). Due to linearity of expecta- Theorem 2. Let G = (V,E,G,P) be a probabilistic graph, let |S| g∈S Q ∈ V be a node. If G is mono-connected, then E(flow(Q,G)) tions,thisisequalto 1 (cid:80) E(flow(Q,g)). Thesumover|S| |S| g∈S canbecomputedefficiently. identicalvaluescanbereplacedbyafactorof|S|. Reducingthis factoryieldsE(flow(Q,g∈S)).Followingtheassumptionofun- Proof. E(flow(Q,G)) is the sum of reachability probabilities of biasedsamplingSfromthesetWofpossibleworlds,theexpected all nodes, according to Equation 2. If G is connected and non- informationflowE(flow(Q,g))ofasamplepossibleworldg∈S cyclic, we can guarantee that each node has exactly one path to isequaltotheexpectedinformationflowE(flow(Q,G)). Q,andthus,ismono-connected. Thus,Lemma2isapplicableto computethereachabilityprobabilitybetweenQandeachnodev∈ NaivesamplingofthewholegraphG hasdisadvantages: First, V. Duetolinearityofexpectations,i.e.,E(X +Y) = E(X)+ thisapproachrequirestocomputereachabilityqueriesonasetof E(Y)forrandomvariablesXandY,wecanaggregateindividual possiblylargesampledgraphs. Second,aratherlargeapproxima- reachabilityexpectations,yieldingE(flow(Q,G)). tionerrorisincurred. Wewillapproachthesedrawbacksbyfirst describinghownon-cyclicsubgraphs,i.e. trees,canbeprocessed AnalogouslytoDefinition5,wedefinebi-connectednodes. in order to compute the information flow exactly and efficiently Definition7(Bi-ConnectedNodes). LetG = (V,E,W,P)bea withoutsampling. Forcyclicsubgraphsweshowhowsampledin- probabilistic graph and let A,B ∈ V. If there exists (at least) formationflowscanbeusedtocomputetheinformationflowinthe twopathspath (A,B)andpath (A,B)suchthatpath (A,B)(cid:54)= fullgraph. 1 2 1 path (A,B),thenwedenoteAandBasbi-connected. 2 5.2 Mono-Connectedvs. Bi-Connectedgraphs Definition8(Bi-ConnectedGraph). Abi-connectedgraph[36,14] The main observation that will be exploited in our work is the isaconnectedprobabilisticgraphG =(V,E,W,P)suchthatre- following: ifthereexistsonlyonepossiblepathbetweentwover- movalofanyonevertexA∈V willstillyieldaconnectedproba- tices,thenwecancomputetheirreachabilityprobabilityefficiently. bilisticgraph. Lemma3. Inabi-connectedgraphGofsize|V|≥3,allpairsof Definition5(Mono-ConnectedNodes). LetG =(V,E,W,P)be verticesarebi-connectedfollowingDefinition7. aprobabilisticgraphandletA,B ∈ V. Ifpath(A,B) = (A = v ,v ,...,v ,v = B)istheonlypathbetweenAandB,i.e., Proof. AformalproofofLemma3canbefoundintheappendix 1 2 k−1 k thereexistsnootherpathp∈V ×V ×V∗thatsatisfiesDefinition inSection10. 1,thenwedenoteAandBasmono-connected. The information flow within a bi-connected graph can not be Inthefollowing,whenthequeryvertexQisclearfromthecon- computed efficiently using Theorem 2, as the flow between any text,wecallavertexAmono-connectedifitismono-connectedto twonodesAtoBissharedbymorethanonepath.Inthenextsec- thequeryvertexQ. tion, we propose techniques to substitute bi-connected subgraphs 4 samplinginLemma1. ThefunctionBC.P(v) : BC.V (cid:55)→ [0,1] mapseachvertexv ∈ BC.V totheestimatedreachabilityproba- bilityreach(v,BC.AV)ofvbeingconnectedtoBC.AV inG. 4)Foreachpairof(mono-orbi-connected)components(C ,C ), 1 2 itholdsthattheintersectionC .V∩C .V =∅ofverticesisempty. 1 2 Thus,eachvertexinV ismappedtoatmostonecomponent’sver- texset. 5)Twodifferentcomponentsmaysharethesamearticulationver- tex, and the articulation vertex of one component may be in the vertexsetofanothercomponent. 6)ThearticulationvertexoftherootofaF-treeisQ∈V. (a) ExampleGraph (b) F-treerepresentation Intuitively speaking, a component is a set of vertices together Figure2:RunningexamplegraphwithcorrespondingF-tree withanarticulationvertexthatallinformationmustflowthrough in order to reach Q. By our iterative construction algorithm pre- sentedinSection5.4,eachcomponentisguaranteedtohavesuch by super-nodes, for which we can estimate the information flow anarticulationvertex,guidingthedirectiontovertexQ. Theidea using Monte-Carlo sampling exploiting Lemma 1. By substitut- oftheF-treeistousecomponentsasvirtualnodes,suchthatallac- ingthebi-connectedsubgraphsbysuper-nodesforwhichweapply tualverticesofacomponentsendtheirinformationtotheirarticu- samplingandmemorizethesamplinginformationforthesesuper- lationvertex.Thenthearticulationvertexforwardsallinformation nodes,weyieldamono-connectedgraphthatusesthesubstituted tothenextcomponent,untiltherootofthetreeisreachedwhere super-nodes. Thisapproachmaximizesthepartitionsofthegraph allinformationissenttoarticulationvertexQ. forwhichexpensiveMonte-Carloestimationcanbereplacedusing Theorem2. Example 2. As an example for a F-tree, consider Figure 2(a), The next section will show how to achieve this goal, by em- showingaprobabilisticgraph. Forbrevity,assumethateachedge ployingaF-treeofthegraph. Thisdatastructureborrowedfrom e ∈ E has an existential probability of p(e) = 0.5 and that all graph theory partitions the graph into bi-connected components verticesv ∈ V haveainformationweightcorrespondingtotheir (a.k.a. “blocks”)generatedbybi-connectedsubgraphs, andiden- id, e.g. vertex 6 has a weight of six. A corresponding F-tree is tifiesverticesofthegraphasarticulationverticestoconnecttwo shown in Figure 2(b). A mono-connected component is given by bi-connected components. We exploit these articulation vertices, A = ({1,2,3,6},Q). For this component, we can exploit The- byhavingthemrepresentalltheinformationflowthatisestimated orem 2 to analytically compute the flow of information from any toflowtothemfromtheircorrespondingbi-connectedcomponent. vertexin{1,2,3,6}toarticulationvertexQ: vertices3and6are Thisapproachisdescribedindetailinthefollowingsubsection. connectedtoQwithprobability0.5.Thus,thesenodescontributed anexpectedinformationflowof3·0.5 = 1.5and6·0.5 = 3re- 5.3 Flowtree spectively. Vertices2and3areconnectedtoQwithaprobability of0.5·0.5 = 0.25,respectively,followingLemma2. Thus,these Inthissectionweproposetoadapttheblock-cuttree[36,14,38] nodes contribute an expected information of 2·0.25 = 0.5 and topartitionagraphintoindependentbi-connectedcomponents.In- 3·0.25 = 0.75. Following Theorem 2, we can aggregate these stead of sampling the whole uncertain graph, the purpose of this probabilitiestoobtaintheexpectedinformationflowfromvertices indexstructureistoexploitTheorem2formono-connectedcom- {1,2,3,6}toarticulationvertexQas5.75. ponents,andtoapplylocalMonte-Carlowithinbi-connectedcom- ponentsonly. OuremployedFlowtree(F-tree)memorizesthein- Abi-connectedcomponentisdefinedbyB = ({4,5},3), rep- resenting a sub-graph having a cycle. Having a cycle, we can- formation flow at each node. Before we show how to utilize the F-tree for efficient information flow computation, we first give a not exploit Theorem 2 to compute the flow of a vertex in {4,5} to vertex 3. But we can sample the subgraph spanned by ver- formaldefinition. ticesin{3,4,5}toestimateprobabilitiesthatvertices{4,5}are Definition9(Flowtree). LetG =(V,E,W,P)beaprobabilistic connectedtoarticulationvertex3usingLemma1. Withsufficient graphandletQ ∈ V beavertexforwhichtheexpectedinforma- samples,thiswillyieldaprobabilityofaround0.375forbothver- tionflowistobecomputed. AFlowtree(F-tree)isatreestructure tices. AgainusingTheorem2,wecomputeaninformationflowof definedasfollows. 0.375·4+0.375·5 = 3.375 to articulation vertex 3. Given 1)eachcomponentoftheF-treeisaconnectedsubgraphofG. A thisexpectedflow,wecanusethemono-connectedcomponentAto componentcanbemono-connectedorbi-connected. computetheexpectedinformationanalyticallythatisfurtherprop- 2)amono-connectedcomponentMC =(MC.V ⊆V,MC.AV ∈ agatedfromthearticulationvertex3ofcomponentBtotheartic- V)isasetofverticesMC.V∪MC.AV thatformamono-connected ulationvertexQofA. Sincethearticulationvertexofcomponent subgraph (c.f. Definition 6) in G. The vertex MC.AV is called B isinthevertexsetofcomponentA,componentB isachildof articulationvertex. Intuitively,amono-connectedcomponentsrep- componentAinFigure2(b)sinceBpropagatesitsinformationto resentsatree-likestructurerootedatMC.AV. UsingTheorem2, A. Aswehavealreadycomputedabove,theprobabilityofvertex we can efficiently compute the information flow from all vertices 3tobeconnectedtoitsarticulationvertexQis0.25,yieldingan MC.V toMC.AV. informationflowworth3.375·0.25=0.84375unitsflowingfrom 3)abi-connectedcomponentBC =(BC.V,BC.P(v),BC.AV) vertices{4,5}toQ. Again,exploitingTheorem2,wecanaggre- isasetofverticesBC.V ∪BC.AV ofsizegreaterthantwothat gatethistoatotalflowof5.75+0.84375=6.59375fromvertices form a bi-connected subgraph G(cid:48) in G according to Definition 8. {1,2,3,4,5,6}toQ. Intuitively, a bi-connected component represents a subgraph de- Anotherbi-connectedcomponentisC =({7,8,9},6),forwhich scribingacycle. Inthiscase,wecanestimatelikelihoodofbeing wecanestimatetheinformationflowfromvertices7,8,and9toar- connected to the articulation vertex BC.AV using Monte-Carlo ticulationvertex6numericallyusingMonte-Carlosampling.Since 5 vertex 6 is in A, component C is a child of A. We find another Case III) v and v belong to the same component, i.e. src dest bi-connectedcomponentD = ({10,11},9),andtwomoremono- C =C src dest connectedcomponentsE =({13,...,16},9)andF =({12},11). CaseIIIa)Thiscomponentisabi-connectedcomponentBC:Adding a new edge between v and v within component BC may src dest In this example, the structure of the F-tree allows us to com- change the reachability BC.P(v) of each vertex v ∈ BC.V to puteorapproximatetheexpectedinformationflowtoQfromeach reach their articulation vertex BC.AV. Therefore, BC needs to vertex. For this purpose, only three small components B and C bere-sampledtonumericallyestimatethereachabilityprobability and D need to be sampled. This is a vast reduction of sampling functionP(v)foreachv∈BC.V. spacecomparedtoanaiveMonte-Carloapproachthatsamplesthe CaseIIIb):Thiscomponentisamono-connectedcomponentMC: full graph: rather than sampling a single random variable having Inthiscase,anewcycleiscreatedwithinamono-connectedcom- 2|E| = 219 = 524288 possible worlds, we only need to sample ponent,thussomeverticeswithinMC maybecomebi-connected. threerandomvariablescorrespondingtothebi-connectedcompo- Weneedto(i)identifythesetofverticesaffectedbythiscycle,(ii) nentsB,C andDhaving23 = 8,24 = 16,and23 = 8possible split these vertices into a new bi-connected component, and (iii) worlds,respectively. Clearly,thisapproachreducesthenumberof handle the set of vertices that have been disconnected from MC edges (marked in red in Figure 2(a)) that need to be sampled in by the new cycle. These three steps are performed by the split- eachsamplingiteration. Moreimportantly,ourexperimentsshow Tree(MC,vsrc,vdest)functionasfollows:(i)Westartbyidentify- thatthisapproachofsamplingcomponentindependentlyvastlyde- ingthenewcycleasfollows: Comparethe(unique)pathsofvsrc creasesthevarianceofthetotalinformationflow, thusyieldinga and vdest to MC.AV, and find the first vertex v∧ that appears morepreciseestimationatthesamenumberofsamples. in both paths. Now we know that the new cycle is decribed by HavingdefinedsyntaxandsemanticsoftheF-tree,thenextsec- path(v∧,vsrc),path(vdest,v∧) and the new edge between vsrc tion shows how to maintain the structure of a F-tree when addi- and vdest. (ii) All of these vertices are added to a bi-connected tional edges are selected. It is important to note that we do not componentBC =(path(v∧,vsrc)∪path(vdest,v∧)\v∧,P(v),v∧) intend to insert all edges of a probabilistic graph G into the F- usingv∧astheirarticulationvertex.AllverticesinMChavingv∧ tree. Rather, we only add the edges that are selected to compute (exceptv∧ itself)ontheirpathareremovedfromMC. Theprob- themaximumflowMaxFlow(G,Q,k)givenaconstrainedbudget abilitymassfunctionP(v)isestimatedbysamplingthesubgraph ofkedges. Thus,eveninacasewhereallverticesabi-connected, ofverticesinBC.V. (iii)Finally,orphansofMC thathavebeen such as in the initial example in Figure 1(a), we note, supported splitofffromMC duetothecreationofBC needtobecollected byourexperimentalevaluation,thatanoptimalselectionofedges intonewmono-connectedcomponents.Suchorphanshavingaver- prefersaspanning-tree-liketopology, whichsynergizeswellwith texofthecycleBC ontheirpathtoMC.AV willbegroupedby ourF-tree.Thenextsectionshowshowtobuildthestructureofthe thesevertices: Foreachvi ∈ BC.V, letorphani denotetheset F-treeiterativelybyaddingedgestoaninitiallyemptygraph. oforphansseparatedbyvi(separatedmeansvibeingthefirstver- The next subsection proposes an algorithm, to update a F-tree texinBC.V onthepathtoMC.AV). Foreachsuchgroup, we whenanewedgeisselected,startingatatrivialF-treethatcontains createanewmono-connectedcomponentMCi = (orphani,vi). onlyonecomponent(∅,Q). Usingthisedge-insertionalgorithm, Allthesenewmono-connectedcomponentswithvi ∈ BC.V be- wewillshowhowtochoosepromisingedgestobeinsertedtomax- come children of BC. If MC.V is now empty, thus all vertices imizetheexpectedinformationflow. Theselectionoftheedgesof of MC have been reassigned to other components, then MC is theF-treewillbeshowninsection6. deletedandBCwillbeappendedtothelistofchildrenofthecom- ponent C where BC.AV = v ∈ C.V. In case of MC.V be- ∧ 5.4 InsertionofEdgesintoaF-tree ingnotempty,weareleftoverwithamono-connectedcomponent MC with v ∈ MC.V. The new bi-connected component BC FollowingDefinition9ofaF-tree,eachvertexv∈Gisassigned ∧ becomesachildofMC. to a either a single mono-connected component (noted by a flag CaseIV)v andv belongtodifferentcomponentsC (cid:54)= v.isMCinthealgorithmbelow),asinglebi-connectedcomponent src dest src C . Since the F-tree is a tree-structure itself, we can identify (noted by v.isBC), or to no component, and thus disconnected dest the lowest common ancestor C of C and C . The in- from Q, noted by v.isNew. To insert a new edge (v ,v ), anc src dest src dest sertion of edge (v ,v ) has incurred a new cycle (cid:13) going ouredge-insertionalgorithmderivedinthissectiondiffersbetween src dest fromC toC ,thentoC viathenewedge,andthenback thesecasesasfollows: anc src dest toC . Thiscyclemaycrossmono-connectedandbi-connected Case I) v .isNew and v .isNew: We omit this case, as anc src dest components,whichallhavetobeadjustedtoaccountforthenew ouredgeselectionalgorithmspresentedinSection6alwaysensure circle. We need to identify all vertices involved to create a new asingleconnectedcomponentandinitiallytheF-treecontainsonly cyclic,thusbi-connected,componentfor(cid:13),andweneedtoiden- vertexQ. tify which parts remain mono-connected. In the following cases, CaseII)v .isNewexclusive-orv .isNew:Duetoconsid- src dest weadjustallcomponentsinvolvedin(cid:13)iteratively. First,weini- eringundirectededges,weassumewithoutlossofgeneralitythat tialize (cid:13) = (∅,P,v ), where v is the vertex within C v .isNew.Thusv isalreadyconnectedtoF-tree. anc anc anc dest src where the circle meets if C is a mono-connected component, CaseIIa): v .isMC: Inthiscase, anewdeadendisaddedto anc src andC .AV otherwise. LetC denotethecomponentthatiscur- the mono-connected structure MC which is guaranteed to re- anc src rentlyadjusted: mainmono-connected.Weaddv toMC .V. dest src CaseIVa)C =C :Inthiscase,thenewcirclemayenterC Case IIb): v .isBC: In this case, a new dead end is added to anc anc src fromtwodifferentarticulationvertices.Inthiscase,weapplyCase thebi-connectedstructureBC . Thisdeadendbecomesanew src III,treatingthesetwoverticesasv andv ,asthesetwover- mono-connected component MC = ({v },v ). Intuitively src dest dest src ticeshavebecomeconnectedtransitivelyviathebigcycle(cid:13). speaking,weknowthatvertexv hasnootherchoicebutpropa- dest Case IVb) C is a bi-connected component: In this case C be- gatingitsinformationtov . Thus,v becomesthearticulation src src comes absorbed by the new cyclic component (cid:13), thus (cid:13).V = vertexofMC. Thebi-connectedcomponentBC addsthenew src (cid:13).V ∪C.v, and(cid:13)inheritsallchildrenfromC. Therationalis mono-connectedcomponentMCtoitslistofchildren. 6 thatallverticeswithinCareabletoaccessthenewcycle. For the last case, CaseIV, consider Figure 3(d), where a new Case IVc) C is a mono-connected component: In this case, one edge d = (11,15) connected two vertices belonging to two dif- pathinC fromonevertexv toC.AV isnowinvolvedinacycle. ferentcomponentsDandE.Westartbyidentifyingthecyclethat Allverticesinvolvedinpath(v,C.AV)areaddedto(cid:13).V andre- hasbeencreatedwithintheF-tree,involvingcomponentsDandE, movedfromC. TheoperationsplitTree(C,v,C.AV)iscalledto andmeetingatthefirstcommonancestorcomponentC. Foreach create new mono-connected components that have been split off ofthesecomponentsinthecycle(D,C,E),oneofthesub-cases fromC andbecomeconnectedto(cid:13)viatheirindividualarticula- of Case IV is used. For component C, we have that C = C anc tionvertices. isthecommonancestorcomponent,thustriggeringCaseIVa. We Inthefollowing,weusethegraphofFigure2(a)anditscorre- find that both components D and E used vertex 9 as their artic- sponding F-tree representation of Figure 2(a) to insert additional ulation vertex v . Thus, the only cycle incurred in component anc edges and to illustrate the interesting cases of the insertion algo- C isthe(trivial)cycle(9)fromvertex9toitself,whichdoesnot rithmofSection5.4. requireanyaction. Weinitializethenewbi-connectedcomponent (cid:13)=(∅,⊥,9),whichinitiallyholdsnovertices,andhasnoprob- 5.5 InsertionExamples abilitymassfunctioncomputedyet(theoperator⊥canbereadas nullornot-defined)andusesv = 9asarticulationvertex. For Inthefollowing,weusethegraphofFigure2(a)anditscorre- anc componentD,weapplyCaseIVb,asDisabi-connectedcompo- spondingF-tree(FT))representationofFigure2(a)toinsertaddi- nent, it becomes absorbed by a new bi-connected component (cid:13), tionaledgesandtoillustratetheinterestingcasesoftheinsertion nowhaving(cid:13)=({10,11},⊥,9). Forthemono-connectedcom- algorithmofSection5.4. ponentE CaseIVcisused. WeidentifythepathwithinE thatis We start by an example for CaseII in Figure 3(a). Here, we nowinvolvedinacycle,byusingthepath(15,13,9)betweenthe insert a new edge a = (7,17), thus connecting a new vertex 17 involvedvertex15toarticulationvertex9. Allnodesonthispath totheFT.Sincevertex7belongstothebi-connectedcomponent are added to (cid:13), now having (cid:13) = ({10,11,15,13},⊥,9). Us- BC,weapplyCaseIIb.Anewmono-connectedcomponentG= ingthesplitTreeoperationsimilartoCaseIII,wecollectorphans ({17},7)iscreated,andaddedtothechildrenofBC. intonewmono-connectedcomponents, creatingG = ({14},13) InFigure3(b),weinsertanewedgeb = (6,8)instead. Inthis andH =({16},15)aschildrenof(cid:13). Finally,Monte-Carlosam- case, the two connected vertices are already part of the FT, thus plingisusedtoapproximatetheprobabilitymassfunction(cid:13).P(v) Case II does not apply. We find that both vertices belong to the foreachv∈(cid:13).V. samecomponentC. Thus,CaseIIIisusedandmorespecifically, since component C is a bi-connected component BC, CaseIIIa 6. OPTIMALEDGESELECTION is applied. In this case, no components need to be changed, but the probability function BC.P(v) has to re-approximated, as the The previous section presented the F-tree, a data structure to probabilities of nodes 7, 8 and 9 will have increased probability compute the expected information flow in a probabilistic graph. ofbeingconnectedtoarticulationvertex6,duetotheexistenceof Based on this structure, heuristics to find a near-optimal set of k newpathsleadingviaedgeb. edges maximizing the information flow MaxEFlow(G,Q,k) to a Next,inFigure3(c),anedgeisinsertedbetweenvertices14and vertex Q (see Definition 4) are presented in this section. There- 15. Both vertices belong to the mono-connected component E, fore, we first present a Greedy-heuristic to iteratively add the lo- thusCaseIIIbisappliedhere.Afterinsertionofedgec,theprevi- cally most promising edges to the current result. Based on this ouslymono-connectedcomponentE = ({13,14,15,16},9)now Greedy approach, we present improvements, aiming at minimiz- contains a cycle involving vertices 13, 14 and 15. (i) We iden- ingtheprocessingcostwhilemaximizingtheexpectedinformation tifythiscyclebyconsideringthepreviouspathsfromvertices14 flow. and 15 to their articulation vertex 9. These paths are (14,13,9) 6.1 GreedyAlgorithm and(15,13,9),respectively. Thefirstcommonvertexonthispath is 13, thus identifying the new cycle. (ii) We create a new bi- Aimingtoselectedgesincrementally,theGreedyalgorithmini- connectedcomponentG = ({14,15},13),containingallvertices tially uses the probabilistic graph G0 = (V,E0 = ∅,P), which ofthiscycleusingthefirstcommonvertex13asarticulationver- containsnoedges.Ineachiterationi,asetofcandidateedgescan- tex.Wefurtherremovetheseverticesexceptthearticulationvertex dListismaintained,whichcontainsalledgesthatareconnectedto 13fromthemono-connectedcomponentE; theprobabilityfunc- QinthecurrentgraphGi,butwhicharenotalreadyselectedinEi. tion G.P(v) is initialized by sampling the reachability probabil- Then,eachiterationselectsanedgeeinadditionwhichmaximizes ities within G; and component G is added to the list of children theinformationflowtoQ,suchthatGi+1 =(V,Ei∩e,P),where of E. (iii) Finally, orphans need to be collected. These are ver- tices in E, which have now become bi-connected to Q, because e= argmax E(flow(Q,(V,E ∩e,P))). (5) i their(previouslyunique)pathtotheirformerarticulationvertex9 e∈candList crosses a new cycle. We find that one vertex, vertex 16, had 15 Forthispurpose,eachedgee ∈ candListisprobed,byinserting as the first removed vertex on its path to 9. Thus, vertex 16 is it into the current F-tree using the insertion method presented in movedfromcomponentEintoanewmono-connectedcomponent Section 5.3. Then, the gain in information flow incurred by this H =({16},15),terminatingthiscase.Summarizing,vertex16in insertionisestimatedbyequation1. Afterkiterations,thegraph componentHnowreportsitsinformationflowtovertex15incom- G =(V,E ,P)isreturned. k k ponentG,forwhichtheinformationflowtoarticulationvertex13 6.2 ComponentMemorization incomponentGisapproximatedusingMonte-Carlosampling,this informationisthenpropagatedanalyticallytovertex9incompo- Weintroduceanoptimizationreducingthenumberofbi-connected nentE,subsequently,theremainingflowthathasbeenpropagated componentsforwhichtheirreachabilityprobabilitieshavetobees- allthisway,isapproximativelypropagatedtoarticulationvertex6 timatedusingMonte-Carlosampling, byexploitingstochasticin- incomponentC,whichallowstoanalyticallycomputetheflowto dependence between different components in the F-tree. During articulationvertexQ. eachGreedy-iteration,awholesetofedgescandListisprobedfor 7 (a) Case IIb: Insertion of edge (b) CaseIIIa: Insertionofedge (c) Case IIIb: Insertion of (d)CaseIVa-c:Insertionofedge a. b. edgec d Figure3:ExamplesofedgeinsertionsandF-treeupdatecasesusingtherunningexampleofFigure2(a). insertion. SomeoftheseinsertionsmayyieldnewcyclesintheF- ToobtainalowerboundoftheexpectedinformationflowtoQ tree,resultingfromcasesIIIandIV.Usingcomponentmemoriza- inagraphG,weusethesumoflowerboundflowsofeachvertex tion, the algorithm memorizes, for each edge e in candList, the usingEquation4toobtain probabilitymassfunctionofanybi-connectedcomponentBCthat (cid:88) hadtobesampledduringthelastprobingofe. Shouldeagainbe Elb(flow(Q,G))= Elb((cid:108)(Q,v,G))·W(v) insertedinalateriteration,thealgorithmchecksifthecomponent v∈V haschanged,intermsofverticeswithinthatcomponentorinterms aswellastheupperbound ofotheredgesthathavebeeninsertedintothatcomponent. Ifthe componenthasremainedunchanged,thesamplingstepisskipped, (cid:88) E (flow(Q,G))= E ((cid:108)(Q,v,G))·W(v) usingthememorizedestimatedprobabilitymassfunctioninstead. ub ub v∈V 6.3 SamplingConfidenceIntervals Now,atanyiterationioftheGreedyalgorithm,foranycandidate AMonteCarlosamplingiscontrolledbyaparametersamplesize edgee(cid:48)∈candListhavinganinformationflowlowerboundedby whichcorrespondstothenumberofsamplestakentoapproximate lb:=E (flow(Q,G )∪e),wepruneanyothercandidateedgee(cid:48) ∈ lb i theinformationflowofabi-connectedcomponenttoitsarticulation candListhavinganupperboundub:=E (flow(Q,G ∪e(cid:48)))if ub i vertex. Ineachiteration,wecanreducetheamountofsamplesby lb > ub. Therationalofthispruningisthat,withaconfidenceof introducingconfidenceintervalsfortheinformationflowforeach 1−α, we can guarantee that inserting e(cid:48) yields less information edgee ∈ candListthatisprobed. Theideaistoprunethesam- gain than inserting e. To ensure that the Central Limit Theorem plingofanyprobededgeeforwhichwecanconcludethat,atasuf- isapplicable,weonlyapplythispruningstepifatleast30sample ficientlylargelevelofsignificanceα,theremustexistanotheredge worldshavebeendrawnforbothprobabilisticgraphs. e(cid:48) (cid:54)= e in candList such that e(cid:48) is guaranteed to have a higher 6.4 DelayedSampling information flow that e, based on the current number of samples only.Togeneratetheseconfidenceintervals,werecallthat,follow- Forthelastheuristic,wereducethenumberofMonte-Carlosam- ingEquation4theexpectedinformationflowtoQisthesample- plings that need to be performed in each iteration of the Greedy averageofthesumofinformationflowofeachindividualvertex. Algorithm in Section 6.1. In a nutshell, the idea is that an edge, Foreachvertexv,therandomeventofbeingconnectedtoQina whichyieldsamuchlowerinformationgainthanthechosenedge, randompossibleworldfollowsabinomialdistribution,withanun- isunlikelytobecometheedgehavingthehighestinformationgain knownsuccessprobabilityp. Toestimatep,givenanumberS of inthenextiteration.Forthispurpose,weintroduceadelayedsam- samplesandanumber0≤s≤Sof’successful’samplesinwhich pling heuristic. In any iteration i of the Greedy Algorithm, let e Qisreachablefromv,weborrowtechniquesfromstatisticstoob- denote the best selected edge, as defined in Equation 5. For any tainatwosided1−αconfidenceintervalofthetrueprobabilityp. other edge e(cid:48) ∈ candList, we define its potential pot(e(cid:48)) := Asimplewayofobtainingsuchconfidenceintervalisbyapplying E(flow(Q,(V,Ei∩e(cid:48),P)), as the fraction of information gained by theCentralLimitTheoremofStatisticstoapproximateabinomial E(flow(Q,(V,Ei∩e,P)) adding edge e(cid:48) compared to the best edge e which has been se- distributionbyanormaldistribution. lectedinaniteration. Furthermore,wedefinethecostcost(e(cid:48))as Definition10(α-SignificantConfidenceInterval). LetS beaset thenumberofedgesthatneedtosampledtoestimatetheinforma- of possible graphs drawn from the probabilistic graph G, and let tion gain incurred by adding edge e(cid:48). If the insertion of e(cid:48) does pˆ := s be the fraction of possible graphs in S in which Q is notincuranynewcycles, thencost(e(cid:48))iszero. Now, afteritera- S reachable from v. With a likelihood of 1−α, the true probabil- tioniwhereedgee(cid:48)hasbeenprobedbutnotselected,wedefinea ity E((cid:108)(Q,v,G)) that Q is reachable from v in the probabilistic samplingdelay graphGisintheinterval cost(e(cid:48)) pˆ±z·(cid:112)pˆ(1−pˆ), (6) d(e(cid:48))=(cid:98)logc pot(e(cid:48)) (cid:99), wherezisthe100·(1−0.5·α)percentileofthestandardnormal whichimpliesthate(cid:48) willnotbeconsideredasacandidateinthe distribution. We denote the lower bound as E ((cid:108)(Q,v,G)) and nextditerationsoftheGreedyalgorithmofSection6.1. Thisdef- lb theupperboundasE ((cid:108)(Q,v,G)).Weuseα=0.05. initionofdelay,makesthe(false)assumptionthattheinformation ub 8 gainofanedgecanonlyincreasebyafactorofc > 1ineachit- nodesareintegersselecteduniformlyfrom[0,10].Itisknownthat eration, wheretheparametercisausedtocontrolthepenaltyof thismodelisnotabletocapturerealhumansocialnetworks[21], having high sampling cost and having low information gain. As duetothelackofmodelingoflongtaildistributionsproducedby anexample,assumeanedgee(cid:48)havinganinformationgainofonly “socialanimals”.Thus,weusethisdatagenerationonlyinourfirst 1%oftheselectedbestedgee,andrequiringtosampleanewbi- setofexperiments,usingrealsocialnetworkdatalater. connectedcomponentinvolving10edgesuponprobing. Also,we Synthetic Datasets: Locality assumption. We use two syn- assumethattheinformationgainperiteration(andthusbyinser- theticdatageneratingschemetogeneratespatialnetworks.Forthe tionofotheredgesinthegraph),mayonlyincreasebyafactorof first data generating scheme, denoted by partitioned, each vertex aTthmuso,sutscin=g d2e.laWyeedgseatmdp(eli(cid:48)n)g=an(cid:98)dlohga2vi0n1.00g1c(cid:99) == (cid:98)2l,oegd2g1e00e0(cid:48)(cid:99)w=oul9d. hpaasrtitthioenssamPe,d.e..g,rPeed. Tohfesidzaetads.etEisapcahrtviteirotnexedinintpoarntit=ion2P· |Vdis| 0 n−1 i notbeconsideredinthenextnineiterationsoftheedgeselection connectedtoallandonlyverticesinthepreviousandnextpartition algorithm.Itmustbenotedthatthisdelayedsamplingstrategyisa P andP . Thisdatagenerationallowstocon- (i−1)modn (i+1)modn heuristiconly,andthatnocorrectupper-boundcforthechangein trolthediameterofaresultingnetwork,whichisguaranteedtobe informationgaincanbegiven.Consequently,thedelayedsampling equalton−1. heuristicmaycausetheedgehavingthehighestinformationgain ForamorerealisticsyntheticdatasetdenotedasWSN,wesim- nottobeselected,asitmightstillbesuspended. Ourexperiments ulate a wireless sensor network. Here, vertices have two spatial showthatevenforlowvaluesofc(i.e.,closeto1),whereedgesare coordinatesselecteduniformlyin[0,1]. Usingaglobalparameter suspendedforalargenumberofiterations,thelossininformation (cid:15),anyvertexvisconnectedtoallandonlyverticeslocatedinthe(cid:15) gainisfairlylow. distanceofvusingEuclideandistance. Forbothsettingstheprobabilitiesofedgesarechosenuniformly 7. EVALUATION in[0,1]. This section evaluates efficiency and effectiveness of our pro- RealDatasets: Nolocalityassumption. Weusethesocialcir- posedsolutionstocomputeanear-optimalsubgraphofanuncer- clesofFacebookdatasetpublishedin[24]. Thisdatasetisasnap- taingraphtomaximizetheinformationflowtoasourcenodeQ, shot of the social network of Facebook - containing a subgroup given a constrained number of edges, according to Definition 4. of 535 users which form a social circle, i.e., a highly connected As motivated in the introductional Section 1, two main applica- subgraph, having10k edges. Theseusershaveexcessivenumber tionfieldsofinformationpropagationonuncertaingraphsare: i) of’friends’. Yet,ithasbeendiscussedin[37]thatthenumberof information/datapropagationinspatialnetworks,suchaswireless real friends that influence, affect and interact with an individual networks or a road networks, and ii) information/belief propaga- is limited. According to this result, and due to the lack of better tioninsocialnetworks. Thesetwotypesofuncertaingraphshave knowledgewhichpeopleofthissocialnetworkarerealfriends,we extremelydifferentcharacteristics,whichrequireseparateevalua- appliedhigheredgeprobabilitiesuniformlyselectedin[0.5;1.0]to tion.Aspatialnetworkfollowsalocalityassumption,constraining 10randomadjacentnodesofeachuser. Duetosymmetry,anav- the set of pairwise reachable nodes to a spatial distance. Thus, erageuserhas20suchhighprobabilities’closefriends’. Allother theaverageshortestpathbetweenapairoftworandomlyselected edgesareassignededgeprobabilitiesuniformlyselectedin]0;0.5]. nodescanbeverylarge,dependingonthespatialdistance.Incon- For our experiments on collaboration network data, we used trast,asocialnetworkhasnolocalityassumption,thusallowingto thescientificcollaborationsbetweenauthorswhosubmittedpapers movethroughthenetworkwithveryfewhops. Asaresult,with- totheGeneralRelativityandQuantumCosmologycategory. The outanylocalityassumption, thesetofnodesreachableink-hops structureofthisdatasetisthatifanauthorvi co-authoredapaper from a query node may grow exponentially large in the number withauthorvj,wherei(cid:54)=j,thegraphcontainsaundirectededgee of hops. In networks following a locality assumption, this num- fromvitovj.Ifapaperisco-authoredbykauthorsthisgeneratesa ber grows polynomial, usually quadratic (in sensor and road net- completelyconnected(sub)graphonknodes.Thisdatasethasbeen worksontheplane)intherangek,astheareacoveredbyacircle publishedin[23]. ThedatacoverspapersintheperiodfromJan- is quadratic to its radius. Our experiments have shown, that the uary1993toApril2003(124months). Probabilitiesonedgesare localityassumption,whichclearlyexistsinsomeapplicationsbut uniformlydistributedin[0;1]. Thegraphconsistsof|V| = 5242 notinothers,hastremendousimpactontheperformanceofoural- verticesand|E|=14496edges. gorithms,includingthebaseline. Consequently,weevaluateboth Finally, we evaluated our methods also on the Youtube social casesseparately. Besidethesetwocaseswealsoevaluatethefol- network, firstpublishedin[26]. Inthisnetwork, edgesrepresent lowingparameters,withdefaultvaluesspecifiedasfollows:sizeof friendship of the users with each other. The graph consists of the Graph |V| = 10,000, average vertex degree d = 2, and the |V| = 1134890 vertices and |E| = 2987624 edges. Again, the budgetofedgesk=100. probabilitiesonedgesareuniformlydistributedin[0;1]. AllexperimentswereevaluatedonasystemwithWindows10, RealDatasets: Localityassumption. Forourexperimentson 64Bit, 16.0 GB RAM with the processor unit Intel(R) Xeon(R) spatialnetworksweusedtheroadnetworkofSanJoaquinCounty1, CPUE3-1220,3,10Ghz.AllalgorithmswereimplementedinJava having|V| = 18263verticesand|E| = 23874edges. Thever- (version1.8.0 91). tices of the graph are road intersections and edges correspond to connectionsbetweenthem. Inordertosimulaterealsensornodes 7.1 DatasetDescriptions located at road intersections, we have connected vertices that are spatiallydistant fromeach otherhave alower chanceto success- This section describes our employed uncertain graph datasets. fullycommunicate.Specifically,fortwoverticeshavingadistance For both the case of locality assumption and no-locality assump- ofainmeters,wesetthecommunicationprobabilitytoe−0.001a. tion,weusesyntheticandrealdatasets. Thus,a10m,100mand1kmdistancewillyieldaprobabilityof SyntheticDatasets: Nolocalityassumption. Ourfirstmodel e−0.01 =99%,e−0.1 =90%,ande−1 =36%,respectively. denoted as Erdo¨s is based on the idea of the Erdo¨s-Re´nyi model [8],distributingedgesindependentlyanduniformlybetweennodes. Probabilitiesofedgesarechosenuniformlyin[0,1]andweightsof 1https://www.cs.utah.edu/lifeifei/SpatialDataset.htm 9 (a) ChangingGraphSizewithlocalityassumption (a) ChangingGraphDensitywithlocalityassumption (b) ChangingGraphSizewithoutlocaliltyassumption (b) ChangingGraphDensitywithoutlocalityassumption Figure4:Experimentswithchanginggraphsize Figure5:Experimentswithchanginggraphdensity 7.2 EvaluatedAlgorithms Section7.1.Thisdatagenerationallowsustoscalethetopologyof Thealgorithmsthatweevaluateinthissectionaredenotedand theuncertaingraphGintermsofsizeanddensity. Perdefault,we describedasfollows: use|V|=1000,|E|=4·|V|andk=100. NaiveAsproposedin[22,7]thefirstcompetitorNaivedoesnot GraphSize. Wefirstscalethesize|V|ofthesyntheticgraphs. utilizethestrategyofcomponentdecompositionofSection5and Figure 4(a) shows the information flow (left-hand-side) and run- utilizesapuresamplingapproachtoestimatereachabilityprobabil- time (right-hand-side) for our synthetic data set following the lo- ities. Toselectedges,thenaiveapproachchoosesthelocallybest calityassumption. First,wenotethattheDijkstra-basedshortest- edgeasshowninSection6butdoesnotusetheF-treerepresen- path spanning tree yields an extremely low information flow, far tation presented in Section 5.3. We use a constant Monte-Carlo inferior to all other approaches. The reason is that such a tree samplingsizeof5000samples. structureallowsnoroomforfailureofedges: wheneveranyedge DijkstraShortest-pathspanningtrees[35]areusedtointercon- fails,thewholesubtreebecomedisconnectedfromQ. Wefurther nect a wireless sensor network to a sink node. To obtain a max- note that all other algorithms, including the naive one, are obliv- imum probability spanning tree, we proceed as follows: the cost ious to the size of the network, in terms of information flow and w(e) of each edge e ∈ E is set to w(e) = −log(P(e)). Run- run-time.Thereasonisthat,duetothelocalityassumption,onlya ning the traditional Dijkstra algorithm on the transformed graph localneighborhoodofverticesandedgesisrelevant,regardlessof startingatnodeQyields,ineachiteration,aspanningtreewhich theglobalsizeofthegraph. Additionally,weseethatconfidence maximizes the connectivity probability between Q and any node intervalheuristics(CI)yieldsansignificantrun-timeperformance connectedtoQ[33]. Since, ineachiteration, theresultinggraph gain, but at the cost of a severe loss of information flow. Figure hasatreestructure,thisapproachcanfullyexploittheconceptof 4(b),showstheperformanceintermsofinformationgainandrun- Section5,requiringnosamplingstepatall. timefortheErdo¨sgraphshavingnolocalityassumption. Wefirst FT employstheF-treeproposedinSection5.3toestimatereach- observe that Dijkstra and Naive yield a significantly lower infor- abilityprobabilities.Tosamplebi-connectedcomponents,wedraw mationflowthanourproposedapproaches. ForDijkstra, thisre- 5000 samples for a fair comparison to Naive. All following FT- sult is again contributed to the constraint of constructing a span- AlgorithmsbuildontopofFT. ning tree, and thus not allowing any edges to connect the flow FT+Madditionallymaintainsforeachcandidateedgeethepdf between branches. For the Naive approach, the loss in informa- ofthecorrespondingbi-connectedcomponentfromthelastitera- tionflowrequiresacloserlook. Thisapproachsamplesthewhole tion(cf.Section6.2). graph 5000 times, to estimate the information flow. In contrast, FT+M+CIensuresthatprobingofanedgeisstoppedwhenever ourF-treeapproachsampleseachindividualbi-connectedcompo- anotheredgehasahigherinformationflowwithacertaindegreeof nent 5000 times. Why is the later approach more accurate? A confidenceasexplainedinSection6.3. first, informal, explanation is that, for a constant sampling size, FT+M+DStriestominimizethecandidateedgesinaniteration theinformationflowofasmallcomponentcanbeestimatedmore byleavingoutedgesthathadasmallinformationgain/cost-ratioin accurately than for a large component. Intuitively, sampling two thelastiteration(cf. Section6.4). Perdefault,wesetthepenaliza- independent components n times each, yields a total of n2 com- tionparametertoc=2. binations of samples of their joint distribution. More formally, FT+M+CI+DSCombinesalloftheaboveconcepts. thiseffectiscontributedtothefactthatthevarianceofthesumof tworandomvariablesincreasesastheircorrelationincreases,since 7.3 ExperimentsonSyntheticData Var((cid:80)n X )=(cid:80)n Var(X )+2(cid:80) Cov(X ,X ) i=1 i i=1 i 1≤i<j≤n i j Inthissection,weperformexperimentsonrandomlygenerated [27]. Furthermore, thenaiveapproachalsoincursanapproxima- uncertaingraphs.Wegenerategraphshavingno-locality-assumption tion error for mono-connected components, for which all F-tree usingErdo¨sgraphsandhavinglocalityassumptionusingthepar- (FT)approachescomputetheexactflowanalytically. Wefurther titioned generation. Both generation approaches are described in see that that the Naive approach, which has to sample the whole 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.