ebook img

The Filter-Placement Problem and its Application to Minimizing Information Multiplicity PDF

0.55 MB·English
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Filter-Placement Problem and its Application to Minimizing Information Multiplicity

The Filter-Placement Problem and its Application to Minimizing Information Multiplicity Do´ra Erdo¨s Vatche Ishakian Andrei Lapets Evimaria Terzi Azer Bestavros edori@bu.edu visahak@bu.edu lapets@bu.edu evimaria@bu.edu best@bu.edu Computer Science Department, Boston University Boston, MA 02215, USA ABSTRACT of newly-generated content of interest. 2 1 Inmanyinformationnetworks,dataitems–suchasupdates Motivation: A common characteristic of many informa- 0 insocialnetworks,newsflowingthroughinterconnectedRSS tionnetworksisthatcontentpropagationisnotcoordinated: 2 feeds and blogs, measurements in sensor networks, route nodes relay information they receive to their neighbors, in- updates in ad-hoc networks – propagate in an uncoordi- dependent of whether these neighbors have received such n nated manner: nodes often relay information they receive information from other sources. This lack of coordination a to neighbors, independent of whether or not these neigh- may be a result of node autonomy (as in social networks), J borsreceivedthesameinformationfromothersources. This limited capabilities and lack of local resources (as in sensor 1 uncoordinated data dissemination may result in significant, networks),orabsenceoftopological/routinginformation(as 3 yet unnecessary communication and processing overheads, inad-hocnetworks). Asanexampleconsidertheusers’feed ultimately reducing the utility of information networks. To in Facebook. The feed displays all content (e.g. videos) ] B alleviatethenegativeimpactsofthisinformation multiplic- sharedbythefriendsofaparticularuserU. Incaseswhere ityphenomenon,weproposethatasubsetofnodes(selected twoormorefriendsofU sharethesamecontent,thiscontent D at key positions in the network) carry out additional infor- appears multiple times in U’s feed. Such uncoordinated in- s. mation filtering functionality. Thus, nodes are responsible formationpropagationresultsinreceivingthesameorsimi- c for the removal (or significant reduction) of the redundant larcontentrepeatedly,wecallthisphenomenoninformation [ dataitemsrelayedthroughthem. Werefertosuchnodesas multiplicity. Theredundancyunderlyinginformationmulti- filters. We formally define the Filter Placement prob- plicityresultsinsignificant,yetunnecessary,communication 1 lemasacombinatorialoptimizationproblem, andstudyits andprocessingoverheads,ultimatelyreducingtheutilityof v 5 computational complexity for different types of graphs. We information networks. 6 alsopresentpolynomial-timeapproximationalgorithmsand Asanillustrationofinformationmultiplicity,considerthe 5 scalable heuristics for the problem. Our experimental re- toynewsnetworkshowninFigure1,inwhichbrandedsyn- 6 sults, which we obtained through extensive simulations on dicated content propagates over directed edges. In this ex- . syntheticandreal-worldinformationflownetworks,suggest ample, node s is the originator of new information – the 1 that in many settings a relatively small number of filters syndicatedcontent–whereasnodesxandyaredistributors 0 arefairlyeffectiveinremovingalargefractionofredundant (e.g.,newspapers)thatmayaddbrandingoradvertisement 2 information. to the syndicated content received from s. All nodes other 1 thansdonotgeneratenewcontent;rather,theyutilizetheir : v 1. INTRODUCTION connections to relay content to other nodes along directed i links. X Information networks arise in many applications, includ- r ingsocialnetworks,RSS-feedandblognetworks,sensornet- x z a works and ad-hoc networks. In information networks, con- 1 tent propagates from content creators, i.e., sources, to con- tent consumers through directed links connecting the vari- s z2 w ousnodesinthenetwork. Theutilityofaninformationnet- y z work has been long associated with its ability to facilitate 3 effective information propagation. A network is considered Figure 1: Illustration of information multiplicity. highly functional, if all its nodes are up-to-date and aware Nowassumethatasinglenewsitemireachesxandy after Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor its generation by s. Nodes x and y forward a copy of i to personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare notmadeordistributedforprofitorcommercialadvantageandthatcopies their neighbors, and as a result, z1, z2 and z3 receive i as bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to well. Infact,z2 (unnecessarily)receivestwocopiesofi;one republish,topostonserversortoredistributetolists,requirespriorspecific fromxandonefromy. Evenworse,ifz ,z andz forward 1 2 3 permissionand/orafee. Articlesfromthisvolumewereinvitedtopresent whatever they get to w, then w receives (1+2+1) copies theirresultsatThe38thInternationalConferenceonVeryLargeDataBases, of i. Clearly, to inform w, one copy of i is enough. August27th-31st2012,Istanbul,Turkey. Inmanynetworksettingsweobservepropagationofmul- ProceedingsoftheVLDBEndowment,Vol.5,No.5 Copyright2012VLDBEndowment2150-8097/12/01...$10.00. tiple instances of the same underlying piece of information. 418 For example in online media networks information multi- des [2] gives an effective algorithm to compute betweenness plicityarisesbydifferentnewssitespostingdifferentarticles for nodes and sets of nodes in a graph. Potamias et al. [28] aboutthesametopic. Insensornetworksinformationabout show how nodes with high centrality can be used to give the same measurement is propagated by different sensors. aneffectiveapproximationofthedistance betweenall node Toalleviatethenegativeimplicationsofinformationmul- pairs in a network. The main point in these works is to tiplicity, we propose that a subset of nodes be strategically find nodes which lie on as many shortest paths as possi- selectedandequippedwithadditional“filtering”functional- ble. The Filter Placement problem is not related to ity, namely the removal (or significant reduction) of similar betweenness centrality. Content does not only propagate data items relayed through them. In some cases, filtering along shortest paths. The goal of Filter Placement is could be an inexpensive process when only exact matching rather to cover as many paths as possible. In the example techniquesareusedsuchasinbroadcastbyfloodingscenar- given in Figure 1, nodes with the highest betweenness cen- ios [31]. In this case, all nodes can be equipped with this trality are x and y. However, the only node where we can functionality. In other cases, filtering may cause significant apply meaningful filtering functionality in this graph, is z2. overheadstoidentifysimilarbutnotidenticalcontent. (e.g. Thisexampledemonstratesthatfindingasetofnodeswith image[25],videoprocessing[5],timeseriesanalysis[3],con- high betweenness centrality does not necessarily solve the tent with different branding or presentation). Due to this Filter Placement problem. overhead, the deployment of such a filtering functionality SocialNetworks: Theproblemofidentifyingkkey nodes cannot be justified, except at a small number of nodes. oragentshasbeenaddressedinthecontextofmanydifferent Werefertonodesthatcarryoutsuchspecialfunctionali- social-network studies and applications. tiesasfilters. Werefertotheproblemofidentifyingtheset For example, a number of studies focused on the identi- offilternodesinagivennetworkastheFilterPlacement fication of k influential nodes such that information seeded problem. (e.g., advertisements) at these nodes would be maximally Notice that the placement of filters does not impact the spread out throughout the network [7, 11, 13, 30]. All ex- utilityofthenetworkinanyotherway;filterssimplyremove isting variants of this influence-maximization problem are similarcontent. Intheexampleabove,placingtwofiltersat concerned with improving the extent of information spread z2 and w completely alleviates redundancy. in the network, even if such maximization results in signif- Paper Contributions: To the best of our knowledge, we icant information redundancy. Our work is complementary arethefirsttoaddresstheissueofnetworkinformationmul- in that it does not aim to change or improve information tiplicity, and the first to propose a solution based on the spread. Rather, our work aims to identify the placement of deployment of filtering functionality at strategic locations k filter nodes so as to minimize redundant communication in these information networks. We formally define the Fil- andprocessingofinformationwithoutchangingtheoriginal ter Placement problem as a combinatorial optimization extent of information spread. problem,andstudyitscomputationalcomplexityfordiffer- Another line of work focuses on the identification of k ent types of graphs. We present polynomial-time constant- nodesthatneedtobemonitored(and/orimmunized)inor- factor approximation algorithms for solving the problem, der to detect contamination (prevent epidemics) [1, 14, 18, which is NP-hard for arbitrary graph topologies. We also 27]. Here, the goal from selecting these key nodes is to in- present a set of computational speedups that significantly hibitasmuchaspossiblethespreadofharmfulinformation improve the running time of our approximation algorithm content (e.g., viruses) by insuring that the selected nodes without sacrificing its practical utility. Our experimental actasbarriersthatstop/terminatetheflowofsuchcontent. results,whichweobtainedthroughextensivesimulationson In our model, filters do not terminate or inhibit the prop- syntheticandreal-worldinformationflownetworks,suggest agation of information. Rather, they remove redundancy thatinmanysettingsarelativelysmallnumberoffiltersare in order to propagate a streamlined/sanitized (“cleaner”) fairly effective in removing a large fraction of duplicative version of the information content. As a result, the under- information. lyingcombinatorialproblemweencounterwhensolvingthe Filter Placement problem is different from the combi- natorial problems addressed before. 2. RELATEDWORK Sensor Networks: At a high level, our work is related to While our work is the first to motivate, formalize, and mechanismsforinformationflowmanagementinsensornet- study the Filter Placement problem, the literature is works. In sensor networks, nodes are impoverished devices full of other works in various application domains that ad- – with limited battery life, storage capacity, and process- dress two related problems: the use of coordinated content ing capabilities – necessitating the use of in-network data dissemination, and/or the selection of nodes to carry out aggregation to preserve resources, and/or the use of coor- special functionalities. In this section, we provide a review dination of node operations for effective routing and query of these works. processing. In-network aggregationisnotaimedatremoval Centrality in Networks: Shortest-pathdistancebetween ofredundantdata,butattheextractionofaggregatestatis- nodesinanetworkplayanimportantroleinmanyapplica- tics from data (e.g., sum, average, etc). Therefore, studies tions including transportation, protein interaction and so- along these lines focus on the design of data-aggregation cial networks. The importance of a node in such a network strategies [6, 10, 12, 19] as well as associated routing of is defined by its betweenness centrality. This measure com- queries and query results [15] in order to allow for reduced putesforeverynodethetotalnumberofshortestpathsthat communicationcosts. Herewenotethatthereisanimplicit node lies on. This measure can be generalized to a subset tradeoff between resource consumption and the quality of ofthenodesinthegraph; Group betweennesscomputesthe information flow. An aggregate is effectively a caricature number of shortest paths, that a set of nodes cover. Bran- (an approximation) of the information; aggressive aggrega- 419 tionimpliesareductioninthequality(andhenceutility)of munication graph (c-graph). Participants in the network information flow. correspond to the nodeset V. A directed edge (u,v) E in ∈ Coordinatedcommunicationofsensorydataisexemplified the graph represents a link, along which node v can prop- byrecentworkthatwentbeyondstandardaggregationand agate items to u. Some nodes of G generate new items by focused on minimizing the communication cost required for virtue of access to some information origin; we call these the evaluation of multi-predicate queries scheduled across nodes sources. Sources generate distinct items – i.e., any network nodes [4]. In that work, a dynamic-programming two items generated by the same source are distinct. Once algorithm is used to determine an optimal execution order anitemisgenerated,itispropagatedthroughGasfollows: of queries. Examples of other studies that proposed coor- everynodethatreceivesanitemblindlypropagatescopiesof dination strategies include techniques that ensure efficient ittoitsoutgoingneighbors. Sincethepropagationisblind, storage, caching, and exchange of spatio-temporal data [22, a node might receive many copies of a single item, leading 23] or aggregates thereof [21]. to information multiplicity. Inadditiontotheirfocusonbalancinginformationfidelity Ourinformationpropagationmodelisdeterministicinthe and resource consumption, there is an implicit assumption sensethateveryitemreachinganodeisrelayedtoallneigh- in all of these studies that all nodes in the network are un- borsofthatnode. Inreality,linksareassociatedwithprob- der the same administrative domain, and hence could be abilities that capture the tendency of a node to propagate expected to coordinate their actions (e.g., as it relates to messages to its neighbors. Although our results (both the- routing or wake-sleep cycles). In our work, we consider set- oretical and experimental) continue to hold under a proba- tingsinwhichcoordinationofnodeoperationisnotjustified bilistic information propagation mode, for ease of presenta- (since nodes are autonomous). Instead, we focus on reduc- tionandwithoutlossofgeneralityweadoptadeterministic ing overheads by adding functionality to a subset of nodes, propagationmodelinthispaper. Moreover,eventhoughthe without concern to resource or energy constraints. propagationmodeldefinedbyGcouldbeusedtocommuni- Finally, broadcast by flooding in ad-hoc networks is an cate multiple items, in this paper we focus on a single item active topic of research. A good comparison of suggested i. The technical results are identical for the multiple-item methods is done by Williams and Camp [31]. In its basic version of the problem. model,anodepropagatesthebroadcastmessageitreceives Filters: Considertheproductionofanewitemibysource to all of its neighbors. The goal is to devise propagation s. In order for node v to receive this item, i has to travel heuristics, such that all nodes in the network receive the along a directed path from s to v. Since there are several content,butatthesametimethenumberofexactmessages paths from s to v, node v will receive multiple copies of i. propagated in the network is minimized to avoid network Moreover if v has children, then v propagates every copy congestion. In this setup, the propagated items are exactly of i it receives to each one of its children. To reduce the thesame,thuscomparisonofitems,bystoringafingerprint amount of redundancy (underlying information multiplic- of the received content, is a relatively cheap operation for ity), our goal is to add a filtering functionality to some of nodes. Theemphasisofexistingresearchinthisareaismore the nodes in G. We use filters to refer to nodes augmented on effective ways of information spreading. Our work is with such filtering functionality. applicable in a different domain of problems, namely where A filter can be viewed as a function that takes as input a spread of information is given, but comparison of content multiset of items I and outputs a set of items I′, such that is expensive and not every node can be equipped with this the cardinality of I′ is less than the cardinality of I. The functionality. specific filtering function depends on the application. We Content Networks: A common challenge in large-scale emphasize,thatformanyapplicationssuchfilteringmaybe access and distribution networks is the optimal placement costly to implement (for example resource-intensive). For ofserverstooptimizesomeobjective(e.g.,minimizeaverage ease of exposition, we will fix the filter function to be the distanceordelaybetweenend-pointsandtheclosestcontent function that eliminates all duplicate content:1 for every server)–namely,theclassicalfacilitylocationandk-median item the filter node receives, it will perform a check to de- problems[20],forwhichalargenumberofcentralized[20,8] termineifithasrelayedanitemwithsimilarcontentbefore. anddistributed[24,9]solutionshavebeendeveloped. Con- Ifnot,itpropagatestheitemtoallofitsneighbors. Afilter ceptually, the Filter Placement problem could be seen node never propagates already propagated content. as a facility location problem, wherein filters constitute the Objective Function: Let v V be an arbitrary node facilities to be acquired. However, in terms of its objective, ∈ in the graph. Define Φ( ,v) as the number of (not neces- theFilterPlacement problemisfundamentallydifferent ∅ sarily distinct) items that node v receives, when no filters since there is no notion of local measures of “distance” or are placed. Let A V be a subset of V. Then Φ(A,v) “cost”betweeninstalledfacilitiesandend-points. Rather,in ⊆ denotes the number of items node v receives, when filters oursetting,thesubjectoftheoptimizationisaglobalmea- are placed in the nodes in A. For a subset X V, let sure of the impact of all facilities (and not just the closest) ⊆ Φ(A,X)= Φ(A,x). on the utility that end-points derive from the network. ForagivePnxit∈eXmofinformation,thetotalnumberofmes- sages that nodes in V receive in the absence of filters is Φ( ,V). For filter set A, the total number of items that 3. THEFILTERPLACEMENTPROBLEM ∅ nodes in V receive is Φ(A,V). Thus, our goal is to find the Propagation model: In this paper, we consider networks set A of k nodes where filters should be placed, such that consisting of an interconnected set of entities (e.g., users, the difference between Φ( ,V) and Φ(A,V) is maximized. software agents) who relay information items (e.g., links, ∅ ideas, articles, news) to one another. We represent such a 1Generalizationsthatallowforapercentageofduplicatesto networkasadirectedgraphG(V,E),whichwecallthecom- make it through a filter are straightforward. 420 Problem 1 (Filter Placement–FP). Givendirected dynamic programing on c-trees. We call a graph G(V,E) c-graph G(V,E) and an integer k, find a subset of nodes a communication tree (c-tree), if in addition to being a c- A V of size A k, which maximizes the function graph, G is a tree when removing the source node. The ⊆ | |≤ recursionofthedynamicprogrammingalgorithmisdoneon F(A)=Φ( ,V) Φ(A,V). ∅ − thechildrenofeverynode. Transformingtheinputc-treeG into a binary tree makes it computationally more feasible. Another choice for an objective function in Problem 1 This transformation can be done in the following way: for would be to minimize Φ(A,V), and by that maximize the everynodev V,ifd (v) 2thendonothing. Otherwise, number of items covered. However this function has short- out ∈ ≤ fix an arbitrary ordering v ,v ,...v of the children of v, comings which make it undesirable: as we observed, every 1 2 r where d (v) = r. Let v be the left child of v. Create a copyofanitemcorrespondstoadirectedpathsinthegraph. out 1 new node u and let that be the right child of v. Let the The total number of directed paths in a graph is typically 1 remaining children of v be the children of u . Repeat these exponential in the number of nodes. Hence, even covering 1 steps until u has only two children: v and v . The an exponential amount of paths may result in a relatively r 1 r 1 r − − edges adjacent to the source will continue to be connected low level of redundancy elimination. Also, as shown in the example in Figure 1, not all redundancy can be eliminated. tothenodesinV. TheresultingbinarytreeisG′. Observe Asaconsequence,thenumberofpathscoveredisnotagood thatthenumberofnodesinG′ isatmosttwiceasmuchas thenumberofnodesinG. Alsonoticethatifthemaximum indicator of the quality of filtering. TheobjectivefunctionF(A)describedinproblem1over- out-degreeintreeGis∆thentheheightofG′ isatmosta factor of ∆ 1 larger than G. comestheseshortcomingsbymeasuringtheimprovementin redundancy reduction. In addition, it has some nice prop- Weapply−dynamicprogrammingonG′: LetOPT(v,i,A) be the function that finds an optimal filter set A of size erties. It is always positive and monotone, since placing an A i k in the subtree rooted in v. Then for every additional filter can only reduce the number of items. This i| =| 0≤...k≤we can compute OPT(v,i,A) by also implies that F is bounded: F( ) = 0 F() F(V). ∅ ≤ ≤ Furthermore,functionF issubmodular,sinceforeveryX OPT(v,i,A)=max ⊂ { FY(Y⊂ V van)d vF∈(/YY)., it holds that F(X ∪{v})−F(x) ≥ jm=0a.x..i{OPT(vl,j,A)+OPT(vr,i−j,A)}, In∪th{e d}efi−nition of Problem 1, there is a bound k on the j=m0..a.xi 1{OPT(vl,j,A∪{v})+OPT(vr,i−1−j,A∪{v})}}. size of the filter set A. In the next proposition we show − that when the number of filters is not bounded, finding the In the equation above, vl and vr denote the left and right minimal size filter placement that maximizes FP is trivial. childofv. Thefirsttermofthisrecursioncorrespondstothe Throughoutthepaperweusen= V todenotethenumber casewherewedonotplaceafilterinv,henceatotalnumber of nodes in G. | | of i filters can be placed in the subtrees. The second term corresponds to the case when we do place a filter in v and Proposition 1. LetG(V,E)beadirectedc-graph. Find- onlyi 1canbeplacedinthesubtrees. Theoptimalsolution − ingtheminimalsizedsetoffiltersA V,suchthatF(A)= for the whole tree is then OPT(r,k,A) where r V is the ⊆ ∈ F(V) takes O(E ) time. root of the tree. The above recursion does not guarantee | | that we do not choose any dump nodes of the binary tree. Proof. Let the filter set A consist of the nodes v V ∈ For this reason, we omit the second term of the recursion thatarenotsinksandd (v)>1,i.e., A= v V d (v)> in { ∈ | in when v is a dump node. Building the binary tree takes 1 and d (v) > 0 . With this choice of A, either the node out } O(n∆) time. We have to evaluate the recursion O(k) times is a sink, is in A or it has indegree 1. Hence, every node for every node. One evaluation takes O(k) computations. propagatesatmostonecopyofanitemtoitschildren. This shows the optimality of A. It is easy to see that A is a There are O(nlog(∆)) nodes in G′, which makes the total running time O(n∆+k2nlog(∆)). minimal optimal set; omitting any node from set A would result in unnecessary content duplication. Finding set A 4.2 FilterPlacementinDAGs needsonetraversalofthegraph,todeterminethenodeswith Considerc-graphsG(V,E),whicharedirectedandacyclic indegreegreaterthan1. Thishasrunningtimeproportional (DAGs). Although DAGs seem to have simpler structures to the number of edges (O(E )) in the graph. | | thanarbitrarygraphs,theFPproblemisNP-completeeven Despite this result, FP for a filterset of fixed k-size on on DAGs. The proof can be found in the appendix. an arbitrary graph is NP-complete. (For the proof see the Theorem 2. The FP problem is NP-complete when the appendix.) c-graph G(V,E) is a DAG. Theorem 1. The FP problem on an arbitrary c-graph In the remainder of this section, we propose polynomial- G(V,E) is NP-complete. timealgorithms,thatresultinsolutionsfortheFPproblem, that we prove to be effective in our experiments. 4. FILTERPLACEMENTALGORITHMS First,westartwithanaiveapproach,Greedy 1(G 1)(this In this section, we present algorithms for the FP prob- name will be meaningful later, when we propose heuris- lemondifferenttypesofc-graphs, namelytrees, DAGsand tics that can be viewed as extensions of Greedy 1). Con- arbitrary directed graphs. sider node v V; v will receive items on its incoming ∈ edges and will forward a copy of every item on every one 4.1 FilterPlacementinaTree of its outgoing edges. We can now compute a lower bound WhileFPishardonarbitrarygraphs,andaswewillshow on the number of copies of an item that v is propagating: also on DAGs, it can be solved in polynomial time with m(v)=d (v) d (v). Greedy 1computesm(v)forevery in out × 421 v V andchoosesthek nodeswiththehighestm()values. ThisistheconceptunderlyingourGreedy Allalgorithm: ∈ Computingm()dependsonthewaythegraphisstored. In The algorithm first chooses the node with the highest im- general it takes O(E ) time, since we need to compute the pact. Placingafilteratthatnodemightchangetheimpact | | inandoutdegreeofeverynode. Findingtheklargestvalues ofothernodes. Hence,anupdateoftheimpactofeverynode ofm()requiresO(kn)computations,whichmakesthetotal is required. Then Greedy All chooses again the node with running time of Greedy 1 O(kn+ E ). the highest impact. The algorithm repeats this for k steps. | | Observe that Greedy 1 only looks at the immediate neigh- bors of every node, whereas Greedy All is more exhaustive and considers all the nodes in the graph. B Algorithm 1 Greedy All algorithm Input: DAG G(V,E) and integer k. A Output: set of filters A V and A k. ⊆ | |≤ 1: find topological order σ of nodes 2: for i=1...k do 3: for j =1...n do Figure2: Fork=1Greedy 1placesafilterinB while 4: compute I(v ) j the optimal solution would be to place a filter in A. 5: A argmax I(v) AlthoughGreedy 1issimpleandefficient,formanygraphs 6: retur←n A v∈V it does not yield an optimal solution as exemplified in Fig- ure 2. Without any filters, the total number of received Because of the properties of F(), we can use the well- items in the graph is 14. Greedy 1 would place a filter in knownresultbyNemhauseretal.[26],whichstatesthatfor node B; this is because m(B) = 1 4 is the largest m() anymaximizationproblem,wheretheobjectivefunctionisa × valueinthisgraph. Howevertheoptimalsolutionwouldbe positive,monotoneandsubmodularset-function,thegreedy to place a filter in A, for which m(A) = 3 1. Placing a approach yields a (1 1)-approximation. filter in B leaves the number of received ite×ms unchanged. − e Theorem 3. The Greedy All algorithm for problem 1 is Making node A a filter instead, would yield a total number an (1 1)-approximation. of 12 received items. − e Inlightofthisillustrativeexample,weproposeGreedy All Implementation of Greedy All. In order to compute the (G All), which is an improved extension of Greedy 1. The impact, we need to compute the Prefix and Suffix of mainideabehind Greedy Allis, tocomputeforeverynode every node. We can think of Prefix(v) as the number of v V, how many copies of a propagating item i are gener- ∈ copies of an item i that v receives. Since the copies of i are ated because of v. The algorithm then greedily chooses the propagatedthroughtheparentsofv,itiseasytoseethatthe node with the highest such number to put a filter. Prefix ofanodeisthesumoftheprefixesofitsparents. We First, we look at the propagation of an item i that is can compute the Prefix of every node recursively, by first created by the source s. In order for i to be received by computingthePrefix ofitsancestors. Wefixatopological node v, i has to propagate along at least one path from order σ of the nodes. (A topological order of nodes is such s to v. In fact, v will receive as many copies of i as the an order, in which every edge is directed from a smaller to numberofpathsfromstov. Wedenotethenumberofdis- a larger ranked node in the ordering.) This order naturally tinct directed paths from any node x to y by #paths(x,y). implies,thattheparentsofanodeprecedeitintheordering. Then, the number of copies of i that arrive in v can be ex- TraversingthenodesofGintheorderofσ thePrefix can pressed as #paths(s,v). For later clarity we need to make becomputedwiththerecursiveformula(1)(Π denotesthe v the distinction between the number of paths between two set of parents of v). nodes and the number of copies of i that v receives. We denote the latter value by Prefix(v). Observe, that for Prefix(v)= X Prefix(x) (1) now #paths(s,v) = Prefix(v). We denote by Suffix(v) x∈Πv the number of copies of i, that are generated across the Remember, that the Prefix of a node can also be ex- wholegraph,afterihaspropagatedthroughv. Similarlyto pressed as #paths(s,v), thus formula (1) is equivalent to Prefix(v), the Suffix of v is also related to directed paths in the graph. In this simple case the Suffix is equal to Prefix(v)=#paths(s,v)= X #paths(s,x) (2) the total number of distinct directed paths, that start in v: Suffix(v) = #paths(v,x). Observe now, that the x∈Πv Px V As we established before, Suffix(v) is equivalent to the total number of c∈opies of i that propagate through v is the total number of directed paths starting in v. This can product Prefix(v) Suffix(v). × be computed effectively by doing some bookkeeping dur- To study the effects of placing a filter in v, observe that ing the recursive computation of (1): For every node v we even if node v were a filter, it would still propagate one will maintain a list, plist that contains for every ances- copy of i. In other words, placing a filter in v has the same v tor x of v the number of paths that go from x to v. Thus effect on the number of copies of i, as if Prefix(v) = 1. plist [x] = #paths(x,v). Observe now, that for an an- Hence, the amount of redundant copies of i generated be- v cestor x of v, plist [x] can be computed as the sum of the cause of the propagation through v, can be expressed by v plist of the parents. I(v) = (Prefix(v) 1) Suffix(v). We call I(v) the im- − × opabcjetcotfivve.fTunhcetiiomnp:aFct(vc)an=aΦls(o,bVe)expΦre(ssved,iVn)t=ermI(svo).fthe ∀x∈V :plistv[x]= X plistp[x] (3) ∅ − { } p∈Πv 422 S1 S2 Observe,thatforeverynode,plist canbecomputeddur- v ing the same recursion as (1). To compute the Suffix of a node v, we need to sum the number of paths that start in v. This is simply the sum of A the plist entries, that correspond to v. B C Suffix(v)=#paths(v,.)= Xplistx[v] (4) x V ∈ As a technical detail, in order to use this recursive for- Figure 3: For k=2 Greedy All chooses filter set mula, every node’s plist contains itself with value one: A,C , while the optimal solution is B,C . plist [v]=1. Asaspecialcase,asourceslistwouldcontain { } { } v only the entry corresponding to itself. Thus far, we described how to compute the impact of a are inspired by the principles behind Greedy All. node by computing its Prefix and Suffix when there are The first algorithm, Greedy Max (G Max) computes the im- no filters in the network. Remember now our earlier obser- pact of all nodes in the graph, similar to Greedy All. Once vation,thatplacingafilterinanodev hasthesameeffect the impacts are calculated, Greedy Max selects the k nodes ∗ on the number of copies of an item, as if there was only with the highest impact as filters, without recomputation one path leading from the source to v . We can capture of the impact. As our experiments show, Greedy Max finds ∗ tphliissetffve(cxt,)b=y0s,etbteinfogrepluisisntgv∗t[hvi∗s]l=ist1inanthdearlelcoutrhseiornv.alOubes- sruolnuntiinognstivmereyofsitmhiilsaarlgtoortithhomseifsoOun(ndEby),GsrienecdeythAellim. pTahcet serve th∗at the change in plistv changes the Suffix of all of nodes needs to be computed only o|nc|e. nodesprecedingv ,andthePref∗ix ofallnodessucceeding ∗ Our other heuristic is Greedy L (G L). This algorithm com- it. For this reason, we need to make a pass over the whole graph when updating the impact. putesasimplifiedimpactforeverynode: I′(v)=Prefix(v)× d (v). Thisisthenumberofitemsv propagatestoitsim- out Running time of Greedy All. The topological order σ of mediatechildren. ThenGreedy Lpicksthetopknodeswith the nodes can be computed in linear time. This value will respecttoI′asfilters. TocomputeI′(),weneedtocompute be evanescent in the total running time of the algorithm. Prefix()(Equation(1))andweneedtoknowthedegreeof Formulas (1) and (3) are updated along every edge of the every node. Both tasks can be accomplished by traversing graph. Formula (1) can be updated in constant time along theedgesonceinGreedy L.I′ isupdatedineveryiteration, an edge, while formula (3) requires O(∆) lookups and ad- which yields a total running time of O(kE ). ditions. (Where ∆ corresponds to the maximal degree in | | The three algorithms proposed above are all significantly the graph.) To compute Suffix(v) we keep a counter for faster than Greedy All. This superior running time is due every node v and according to formula (4) update that on- tothefactthatthesealgorithmschoosethefiltersleveraging line when the plist entries are computed. This yields a less information about the graph. All three heuristics cap- total running time of O(E ∆) for one iteration of the al- | |· ture different characteristics of a good filter set, and hence gorithm. Greedy All has k iterations, which yields a total their performance is not the same on different datasets. runningtimeofO(k ∆ E ). ThiscanbeO(k n3)inworst · ·| | · First observe, that a well connected node in the network case, but in practice, for most c-graphs E < O(n logn), | | · hashighinandoutdegreesandthus,wouldberankedhigh which results in a O(k n logn) running time. · · byGreedy 1. Ontheotherhand,G 1doesnottakeintoac- Observe that Greedy All is optimal for k = 1. For larger count the location of the node in the network. Greedy Max values of k it also yields very good results. In our ex- computes the full impact of every node, thus it can give a periments we found, that there are real-life graphs, where moreaccurateestimateoftheinfluenceofindividualnodes, Greedy AlliscapableoffindinganFPwhichyieldsperfect than Greedy 1. However, it fails to capture the correlation filtering. However in some cases it does not find the opti- between filters placed on the same path and thus, might mal solution. Look at the toy example in Figure 3. When choose filters that diminish each other’s impact. Greedy L no filters are placed the total number fo received items is overcomestheseshortcomingsbycombiningthesetwometh- Φ( ,V)=26. Sincefork=1theimpactvaluesofthenodes ods. However,this algorithm tends to pick nodes further ∅ are I(A) = 7,I(B) = 6,I(C) = 6, Greedy All will choose away from the source, since the Prefix () of nodes grows A as its first filter. Observe now, that the total number exponentiallywiththedistancefromthesource. Thediffer- of items has been reduced with the amount of A’s impact: ences in performance for various datasets are shown in our Φ( A ,V) = 19. Then, for k = 2 nodes B and C have experimental evaluation (Section 5). { } updated impact: I(B A) = 3, I(C A) = 4. The algorithm | | will choose C. This yields a total of Φ( A,C ,V) = 15 { } Algorithm 2 The Greedy L algorithm on DAGs. received items in this system. The optimal solution would Input: DAG G(V,E), integer k be to place filters in nodes B and C, which would result in Output: set of filters A V of size k. Φ( B,C ,V)=14 items received. ⊆ { } 1: A= Computational speedups. Our experimental evaluation 2: for i∅=1...k do (Section 5) shows, that although Greedy All yields good 3: compute Prefix() results with respect to our objective function, it is rather 4: A argmax Prefix(v) inefficient to apply to large datasets. For this reason, we ← v∈V 5: return A proposetwonewheuristics,thatalongwithGreedy 1,yield muchfasterandyeteffectivesolutionstoFP.Bothheuristics 423 4.3 FilterPlacementinGeneralGraphs and sign(v). We find w with the largest σ(w), such that Solving FP ongeneral graphs is NP-hard. Inthis section (w,wu1) sign(u) and (w,wv1) sign(v). u and v are in ∈ ∈ we propose a heuristic to choose an acyclic subgraph from different branches (thus edge (u,v) can be added) only if anygraph. Thisallowsustoapplythealgorithmsdesigned σ(v)<σ(wu1) σ(u)). ≤ for DAGs on this subgraph. The DFS traversal in the first phase of Acyclic takes Let c-graph G′(V,E′) be a general directed graph. Fix O(n·logn) time. To create the signatures we need to tra- an arbitrary ordering σ of the nodes in V. We call an edge verse T once. Every node w passes on its signature list to (v,u) E′ a forward edge if σ(v) < σ(u); otherwise it is its children. Every child uw of w adds w to the list, if w a backw∈ard edge. A well-known 2-approximation for choos- is a junction, otherwise it uses sign(w) unchanged. This ing an acyclic subgraph is the following greedy algorithm: introduces an additional O(n) steps to the algorithm. For fix an arbitrary order σ of the nodes. Let F be the set of an edge (u,v) the comparison of sign(u) and sign(v) takes forward edges with respect to σ, and let B be the set of O(logn) time. This needs to be repeated for every edge backwardedges. If F > B thenchoosetheDAGG(V,F), in E′, yielding a total running time of O(n2 logn) for the else choose the edge|s|in B|:|G(V,B). The drawback of this Acyclic algorithm. · algorithm is that it does not guarantee the resulting DAG to be connected. For this reason we develop our own algo- Algorithm 3 Acyclic algorithm to find maximal acyclic rithm,Acyclic(Algorithm3)tochooseaconnectedacyclic subgraph. subgraph. Input: directed graph G′(V,E′) with source s The Acyclic algorithm finds an acyclic subgraph in two Output: acyclic directed graph G(V,E) steps. We can assume that there is only one source s in 1: DFS traversal starting in s G′,otherwisewecreateanewsuper-sources,anddirectan 2: E T edgefromstoeverysource. First,AcyclicperformsaDFS 3: com←pute signatures traversalofG′ startingins. Everyedgethatisusedduring 4: for (u,v) E′ do this traversal is added to G. Second every remaining edge 5: if σ(v)∈<σ(w ) σ(u) then u1 inE′ isconsideredforaddition. Anedgee E′ isaddedto 6: E (u,v) ≤ E if it does not create a cycle. Observe, th∈at the resulting ← acyclic subgraph is maximal, since no edge can be added without creating a cycle. Acyclic is built on the observation made in the previous 5. EXPERIMENTALEVALUATION section; an item i that is generated by s reaches a node v We present here a comparative experimental evaluation if there is at least one directed path from s to v. For this of the various Filter Placement algorithms, using syn- reason, it is clear that every node that receives a copy of i thetic data and real datasets. These datasets capture the is visited by the DFS traversal in the first part of Acyclic. propagation of information in real-life scenarios. We report Nodes that are not visited, do not receive copies of i, thus the performance of the various algorithms with regard to uninterestingwithregardtoinformationpropagationinG′. the objective function as well as the running time. TheedgesusedduringtheDFStraversalresultinaspanning tree T of G, thus making G connected. In the second part of Acyclic edges are added to G in a 1 1 greedyfashion: everyedgee E′isconsideredandisadded 0.8 0.8 ptoroEachiffiotrddooeinsgntohtisrewsuoultldinb∈eatdoiraedcdtetdhecyecdlgee. eAinnqauiveestaiopn- Probability 00..46 Probability 00..46 to G, and then run a DFS to determine whether G e 0.2 0.2 ∪{ } is still acyclic. If not, then remove e. However, this would 0 0 0 10 20 30 40 50 0 20 40 60 80 100 120 require too many computations. # incoming edges # incoming edges Ourapproachusesinsteadadecisionmechanismbasedon (a) x= 1 (b) x= 3 4 4 thelocationofnodesinT. Wecalltheorderinwhichnodes arefirstvisitedduringtheDFStraversalthenodesdiscovery Figure 4: CDF of indegrees for synthetic graphs time, and denote it by σ( ). We call a node a junction if it has more than one child in T. Due to the DFS traversal, therecanbenoforwardedgewithregardtoσ()inE′,that Performance Metric: Tomeasureandcomparetheeffec- isnotanedgeinT. Abackwardedge(u,v)canbeaddedto tiveness of various algorithms, we define the Filter Ratio E ifthereisnodirectedpathfromvtou. Thisisthecaseif (FR) as the ratio between the objective function using fil- vanduareindifferentbranchesofthetree. Thus,thereisa ters deployed in a subset of nodes (A) and the maximum junction w, for which paths (w,w ),(w ,w )...(w ,u) valueoftheobjectivefunction–i.e.,thelevelofredundancy u1 u1 u2 ur and(w,w ),(w ,w )...(w ,v)areinT andw andw reductiondeliveredbyinstallingfiltersatthenodesinA. A v1 v1 v2 vl u1 v1 aredifferent. Inordertodecidetheexistenceofsuchawwe FR of 1 underscores complete elimination of redundancy. need to keep for every node u a signature: sign(u) contains F(A) alistofpairs (w,w ) . Wherethefirstelementsw ofthe FR(A)= { u1 } F(V) pairs are the junctions on the path (s u). Observe that → σ(wu1)isalwayslessorequaltoσ(u). Also,foranybranch Baseline Algorithms: We compare our algorithms to a startinginweitherallnodes’discoverytimesinthatbranch set of random heuristics that serve as baseline. aresmallerorallarelargerorequalthanσ(wu1). Whenan Random k (Rand K): chooses k filters from V uniformly at edge (u,v) is considered for addition now, we scan sign(u) random. 424 0.6 0.6 0.5 0.5 0.4 0.4 R 0.3 R 0.3 F F 0.2 0.2 0.1 0.1 0 0 0 10 20 30 40 50 0 10 20 30 40 50 # of Filters # of Filters G_ALL G_L Rand_K G_ALL G_L Rand_K G_Max Rand_W G_Max Rand_W G_1 Rand_I G_1 Rand_I (a) x= 1 (b) x= 3 4 4 Figure 5: FR for synthetic graphs Random Independent (Rand I):Everynodebecomesafilter subgraph. The subgraph we chose contains the nodes and independently of other nodes, with probability k. adjacent edges, corresponding to sites that use the phrase n Random Weighted(Rand W):Everynodevisassignedaweight “lipstickonapig”. Sitesmayfreelylinktoeachother,which w(v)= 1 ,whereC = u V (v,u) E isthe might result in cycles. We run Acyclic to find a maximal set of chPildur∈eCnvodfinv(.u)Then, evervy no{de∈bec|omes a∈filt}er with acyclicsubgraphinthisgraph. Thereisnoclearinitiatorof probabilityw(v) k. Theintuitionbehindthisis,thatthe thephraseintheblogosphere, sinceitwasusedbyacandi- influence of node×vnon the number of items that its child u dateduringthe2008presidentialcampaign. Forthisreason, receives, is inversely proportional to the indegree of u. werunAcyclicinitiatedfromeverynodeinthegraph,and Note that while we cannot guarantee that the number of thenchoosethelargestresultingDAG.ThisDAGhasasin- filters for randomized algorithms is k, they are designed so gle source: the node Acyclic was started from. It contains that the expected number of filters is k. We run the ran- 932 nodes and 2,703 edges. domized algorithms 25 times and then average the results. Results using synthetic datasets: To test some basic 1 properties of our algorithms we generate synthetic graphs. 0.8 First,weassignnodesto10levelsrandomly,sothattheex- pdiercetcetdednuedmgbeesrfroofmnoedveersypneordleevvelinisle1v0e0l.iNtoexetv,erwyengoedneeruatine Probability 00..46 level j >i with probability p(v,u)= x . The choice of x yj−i 0.2 andyinfluencesthedensityofthegraph. Theexponentofy isdesignedinsuchaway,thatnodesinnearbyclasseshave 0 0 20 40 60 80 100 higher probability of being connected, than nodes that are # incoming edges far apart. We choose to experiment with the combinations Figure 6: CDF of node indegree for G Phrase (x,y) = (1,4) and (x,y) = (3,4). For x = 1 we generate y 4 a graph with 1026 nodes and 32427 edges. For x = 3 we y 4 generate 1069 nodes and 101226 edges. The CDFs of the Figure6showstheCDFofthenodes’in-degreeinG Phrase. indegreeareshowninFigure4. TheCDFsoftheoutdegree Wefoundthatalmost70%ofthenodesaresinksandalmost arequitesimilar,andthusomittedduetospacelimitations. 50% of the nodes have in-degree one. There are a number Observe that nodes on the same level have similar proper- of nodes, which have both high in- and out-degrees. These ties;theexpectednumberandlengthofpathsgoingthrough arepotentiallygoodcandidatestobecomefilters. Thesteep them is the same. curveofFRforG PhraseinFigure7confirmsourintuition: Figures 5(a) and 5(b) reveal a gradual increase in FR as asfewasfournodesachieveperfectredundancyelimination afunctionofthenumberoffilters. Thisshowsthatthecho- forthisdataset. Asexpected,Greedy Allperformsthebest senfiltersarenodesthatcoverroughlyequal-sized,distinct with regard to the FR. Greedy Max picks a different node portions of all the paths in the graphs. for k = 2, but for k 3 it picks the same filter set and ≥ Results using the Quote dataset: The Quote dataset by hence performs as good as Greedy All. The two heuristics, Leskovecetal.[17]containsthelinknetworkofmainstream Greedy 1 and Greedy L are just a little bit less effective. mediasitesrangingfromlargenewsdistributorstopersonal Wetrackedtheconnectionbetweenthenodechosenfirstby blogs, along with timestamped data, indicating the adop- Greedy All(nodeA),andthenodechosenfirstbyGreedy L tion of topics or different phrases by the sites. We utilize (node B). These nodes were connected by a path of length monthlytracesfromAugust2008toApril2009togenerate 2. The impact of A is larger than the impact of B, since A a c-graph G Phrase. Since the Quote graph, which has over alsoinfluencesallofB’schildren. HoweverB ischosenover 400K edges, is very large, in our experiments we select a A by Greedy L, since the prefix of B is much larger than 425 that of A. The four central nodes chosen by Greedy All Physical Review as the source node, and take the subgraph also explain why Random Weighted performs so well: nodes of nodes that can be reached from this node through di- with large weights (and thus high probability of becoming rected paths. In this case only the node corresponding to filters)arethosenodeswithalargenumberofchildren. The paper [29] is connected to the source. The resulting sub- randomized algorithms Random k and Random Independent graph is intended to portray the propagation of an original perform significantly worse than all others because of the concept or finding in this paper through the physics com- high fraction of sink nodes in the graphs. munity: a filter in this setting can be seen as an opportune pointintheknowledgetransferprocesstopurgepotentially redundant citations of the primary source (e.g., derivative 1 work).4 The resulting citation graph G Citation is acyclic andcontains9,982nodesand36,070edges. AsforG Phrase 0.8 and the Twitter graph, G Citation has a power-law distri- R 0.6 bution of in and out degrees. (Plot omitted due to space F constraints.) 0.4 0.2 1 0 0 2 4 6 8 10 0.8 # of Filters 0.6 G_ALL G_L Rand_K R G_Max Rand_W F 0.4 G_1 Rand_I 0.2 Figure 7: FR for G Phrase on the Quote dataset; x- 0 axiscorrespondstothenumberoffilters, y-axiscor- 0 2 4 6 8 10 responds to the FR for different algorithms # of Filters G_ALL G_L Rand_K ResultsusingtheTwitterdataset: TheTwitterdataset G_Max Rand_W wascollectedbyKwaketal.[16]2. Thedatasetcontainsuser G_1 Rand_I idsandlinksbetweenusers,directedfromtheusertohisfol- lowers. The complete dataset contains over 41 million user Figure 8: FR for the Twitter graph. x-axis corre- profiles. Inordertoselectasubgraphoffeasiblesizeforour sponds to the number of filters, y-axis corresponds experiments, we first ran a breadth-first search up until six to the FR for different algorithms levels,startingfromtheuser“sigcomm09”. Ourgoalwasto find a subnetwork of users related to the computer science community. Forthis,wecreatedalistofkeywordsrelatedto 1 computerscience,technologyandacademiaandfilteredthe userprofilesofthefollowersaccordingtothat. Theresulting 0.8 networkisanacyclicgraphwithasingleroot“sigcomm09”, R 0.6 which we consider the source of information in this sub- F network. The graph contains about 90K nodes and 120K 0.4 edges. The number of out-going edges from the different 0.2 levels of the graph show an exponential growth: 2, 16, 194, 0 43993 and 80639 for levels 1,2,..., 5. We had to remove 0 2 4 6 8 10 a small number of edges, in order to maintain an acyclic # of Filters graph. Observe thatthis graphisquitesparse comparedto the other datasets. Figure 8 shows that Greedy All can re- G_ALL G_L Rand_K G_Max Rand_W move all redundancy with placing as few as six filters. Our G_1 Rand_I other heuristics also perform well. Greedy Max, Greedy 1 andGreedy Lallachievecompletefilteringwithatmostten Figure9: FRforG CitationintheAPSdataset; x-axis filters. TheconvergenceofFRtooneforGreedy Lisslower, corresponds to the number of filters, y-axis corre- asfortheotheralgorithms,becauseofitstendencytochoose sponds to the FR for different algorithms nodes further away from the source. Results using APS research dataset: The APS research dataset3 containsofthecitationnetworkofover450,000ar- While Greedy 1, Greedy L or Greedy Max all converge to a ticles from Physical Review Letters, Physical Review, and highlevelofremovingredundantitemswithless,thanfifteen ReviewsofModernPhysics,datingbackto1893. Thedataset filters,itisevidentfromFigure9,thatGreedy Allperforms consistsofpairsofAPSarticles–onecitingtheother. This here better than the alternatives. The G Citation graph can be viewed as a graph with a directed edge from node is a good illustration of the potential shortcomings of our A to B if B cites A. We select article [29], published in heuristics. As sketched in Figure 10, the graph has a set of 2Available at http://an.kaist.ac.kr/traces/WWW2010. 4In a live corpora of interconnected documents and cita- html tions, filters can be seen as the key documents in which to 3Available at https://publish.aps.org/datasets consolidate references to a specific source. 426 5000 4000 ds 3000 n o c e S 2000 1000 0 G_1 G_Max G_L G_All Figure 10: Sketch of APS graph Figure 11: Execution times for the placement of ten filters in the case of the Twitter dataset. ninenodes,interconnectedbyapath,thatallhaveindegree one. Allpathsfromtheuppertothelowerhalfofthegraph rithms for the Twitter dataset in seconds for k = 10 fil- traverse through these nodes, which makes them all high- ters. Obviously, Greedy 1 (with worst-case running time of impact. However, placing a filter in the first node highly O(E ))isourfastestalgorithmwithrunningtimelessthan diminishestheimpactoftheremainingnodes. Thisremains am|in|ute. Greedy Allisthemostcomputationallyintensive unobserved by Greedy Max resulting in the long range over method,witharunningtimeof83minutes. Sinceitrequires which G Max is constant. the recomputation of the impact of every node in every it- Summary of comparative performance: Comparing eration. Finally, Greedy Max and Greedy L appear to have the results for the synthetic data sets (Figures 5(a) and similar running times, approximately 60 minutes. Despite 5(b))totheresultsfortherealdatasets(Figures7,8and 9) the fact, that Greedy Max does the computation of the im- reveals a significant difference in the marginal utility from pact only once. As we have seen, the tendency of Greedy L added filters (the steepness of the performance curve). For is to pick nodes at the end of the topological order (away thesyntheticdatasets,thereisaslowgradualimprovement fromthesource). Afterselectingsuchanodev asfilter,the in performance for all algorithms as the number of filters modified impact I′ of most of the nodes remains the same; increases, suggesting a fairly constant marginal utility of the only nodes whose value of I′ changes are those thatare additionalfiltersthroughout. Incontrast,fortheQuoteand after vinthetopologicalorder. Sincethereissmallnumber Twitterdatasets,almostallredundancycanbefilteredout of such nodes, clever bookkeeping allows us to make this with at most ten filters, implying no marginal utility from updates in, practically, constant time. addedfiltersbeyondthatpoint. FortheAPSdataset,there Overall,ouralgorithmsGreedy 1,Greedy MaxandGreedy L is more of a spread in the performance of the algorithms, aremuchmoreefficientthanGreedy Allandcanbeapplied butthebestalgorithmshaveverysteepperformancecurves tolargerdatasets. Thisobservation,combinedwiththefact as well. that these algorithms have high-quality results make them This disparity between the results on synthetic and real appropriate to solve the FP problem in practice. datasetscanbeexplainedbythestructureoftheunderlying c-graph. The synthetic graphs are densely connected, and 6. CONCLUSIONS as a result paths cannot be covered with a small number of filters. On the other hand, the real data sets have a small Networksinwhichnodesindiscriminatelydisseminatein- set of “key” nodes that cover all paths in the graph. In formation to their neighbors are susceptible to information conclusion, while our methods can be used effectively on multiplicity – i.e., a prevalence of duplicate content com- all types of graphs, placing filters is more effective when municatedoveramultitudeofpaths. Inthispaper,wepro- operating over sparse graphs (which are more prevalent in posedaparticularstrategytomitigatethenegativeimplica- many information networks). tions from information multiplicity. Our approach relies on theinstallationoffiltersatkeypositionsintheinformation Runningtimes: Althoughallouralgorithmshaverunning network, and accordingly defined the Filter Placement time polynomial in the number of nodes in the dataset, we problem. In addition to characterizing the computational also investigate their efficiency in practice. For that, we complexity of Filter Placement (polynomial for trees, reportheretheiractualrunningtimesinoneofourdatasets. but NP-hard for DAGs and arbitrary graphs), we devised Note that the conclusions we draw are identical for other a bounded-factor approximation as well as other scalable datasets and thus we omit the results due to lack of space. heuristicalgorithms. Weevaluatedourmethodsexperimen- We implemented our algorithms in Python, and although tally using synthetic and real data sets. Our results sug- our implementation uses clever data structures and other gest that in practice, our Filter Placement algorithms necessaryoptimizations,ourcodeisnotfocusedonoptimiz- scale well on fairly large graphs, and that typically a small ing performance. Therefore, our efficiency study should be number of filters is sufficient for purposes of removal of re- seen as an evaluation of the relative efficiency of the differ- dundantcontentinreal-world(typicallysparse)information entalgorithms. Forourexperimentsweusedamachinewith networks. 4GHzAMDOpteronwith256GBofRAM,runninga64-bit Our current and future work is focusing on extensions to Linux CentOS distribution. the information propagation model adopted in this paper Figure 11 reports the running times of the different algo- totakeintoconsiderationmultirateinformationsources. In 427

See more

The list of books you might like