A Second Look at Counting Triangles in Graph Streams Graham Cormode, HosseinJowharia,b, [email protected],Correspondingauthor [email protected] 4 1 0 2 n Abstract a J In this paper we present improvedresults on the problem of counting triangles in 9 edgestreamedgraphs. Forgraphswithmedges andatleastT triangles,weshow ] thatan extralookoverthestream yieldsatwo-passstreamingalgorithmthatuses S O( m ) space and outputs a (1 + ǫ) approximation of the number of triangles D ǫ4.5√T in the graph. This improves upon the two-pass streaming tester of Braverman, . s OstrovskyandVilenchik,ICALP2013,whichdistinguishesbetweentriangle-free c [ graphs and graphs with at least T triangle using O( m ) space. Also, in terms T1/3 1 of dependence on T, we show that more passes would not lead to a better space v bound. Inotherwords,weprovethereisnoconstantpassstreamingalgorithmthat 5 7 distinguishes between triangle-free graphs from graphs with at least T triangles 1 usingO( m ) spaceforanyconstantρ 0. 2 T1/2+ρ ≥ . 1 0 4 1. Introduction 1 : Many applications produce output in form of graphs, defined an edge at a v i time. These include social networks that produce edges corresponding to new X friendships or other connections between entities in the network; communication r a networks, where each edge represents a communication (phone call, email, text message)between a pairof participants;and web graphs, where each edge repre- sentsalinkbetween pages. Oversuchgraphs,wewishtoanswerquestionsabout theinduced graph,relating tothestructureand properties. One of the most basic structures that can be present in a graph is a trian- gle: an embedded clique on three nodes. Questions around counting the number of triangles in a graph have been widely studied, due to the inherent interest in the problem, and because it is a necessary stepping stone to answering questions around more complex structures in graphs. Triangles are of interest within social networks, as they indicate common friendships: two friends of an individual are PreprintsubmittedtoTheoreticalComputerScience January13,2014 themselves friends. Counting the number of friendships within a graph is there- fore a measure of the closeness of friendship activities. Another application for trianglecountingisas aparameterforlarge graphmodels[LBKT08]. For these reasons, and for the fundamental nature of the problem, there have beennumerousstudiesoftheproblemofcountingorenumeratingtrianglesinvar- iousmodelsofdataaccess: externalmemory[LWZW10,HTC13];map-reduce[SV11, PT12,TKMF09];and RAMmodel[SW05, Tso08]. Indeed,itseemsthattriangle countingandenumerationisbecomingadefactobenchmarkfortesting“bigdata” systemsand theirabilityto process complex queries. Thereason is that the prob- lem captures an essentially hard problem within big data: accurately measuring thedegreeofcorrelation. In thispaper,westudytheproblemoftrianglecounting over(massive)streams of edges. In this case, lower boundsfrom communication complexity can be applied to show that exactly counting the number of triangles essentially requires storing the full input, so instead we look for methods which can approximate the number of triangles. In this direction, there has been series ofworksthathaveattemptedtocapturetherightspacecomplexityforalgorithms that approximate the number of triangles. Howevermost of these works have fo- cusedononepassalgorithmsandthus,duetothehardnatureoftheproblem,their spaceboundshavebecomecomplicated,sufferingfromdependenciesonmultiple graph parameters such maximumdegree, numberofpaths of length2, numberof cyclesoflength4, andetc. In a recent work by Braverman et al. [BOV13], it has been shown that at the expense of an extra pass over stream, a straightforward sampling strategy gives a sublinear bound that depends only on m (number of edges) and T (a lower bound on the number of triangles). More precisely [BOV13] have shown that oneextrapass yieldsan algorithmthatdistinguishesbetweentriangle-freegraphs fromgraphswithatleastT trianglesusingO( m )wordofspace. Althoughtheir T1/3 algorithmdoesnotgiveanestimateofthenumberoftrianglesandmoreimportant is not clearly superior to the O(m∆) one pass algorithm by [PT12, PTTW13] T (especially for graphs with small maximum degree ∆), it creates some hope that perhapswiththeexpenseofextrapassesonecouldgetimprovedandcleanerspace boundsthatbeattheonepassboundforawiderrangeofgraphs. Inparticularone might ask is there a O(m) multi-pass algorithm? In this paper, while we refute T such a possibility, we show that a more modest bound is possible. Specifically here we show modifications to the sampling strategy of [BOV13] along with a different analysis results in a 2-pass (1 + ǫ) approximation algorithm that uses only O( m ) space. We also observe that this bound is attainable in one pass– ǫ4.5√T if we can make the string assumption that the order of edge arrivals is random. 2 Additionally, via a reduction to a hard communication complexity problem, we demonstrate that this bound is optimal in terms of its dependence on T. In other wordsthereisno constantpassalgorithmthatdistinguishesbetweentriangle-free graphs from graphs with at least T triangles using O( m ) for any constant T1/2+ρ ρ > 0. We also givea similartwo pass algorithm that has better dependence on ǫ butsacrificestheoptimaldependenceonT. OurresultsaresummarizedinFigure 2 intermsoftheproblemaddressed,boundprovided,and numberofpasses. Algorithms for Triangle Counting in Graph Streams. The triangle counting problem has attracted particular attention in the model of graph streams: there is now a substantial body of study in this setting. Algorithms are evaluated on the amountofspacethattheyrequire,thenumberofpassesovertheinputstreamthat theytake,andthetimetakentoprocesseachupdate. Differentvariationsarisede- pendingonwhetherdeletionsofedgesarepermitted,orthestreamis‘insert-only’; and whether arrivals are ordered in a particular way, so that all edges incident on one node arrive together, or the edges are randomly ordered or adversarially or- dered. TheworkofJowhariandGhodsi[JG05]firststudiesthemostpopularofthese combinations: insert-only, adversarial ordering. The general approach, common to many streaming algorithms, is to build a randomized estimator for the desired quantity, and then repeat this sufficiently many times to provide a guaranteed ac- curacy. Theirapproach beginsby samplingan edgeuniformlyfromthestream of m arriving edges on n vertices. Their estimator then counts the number of trian- gles incident on a sampled edge. Since the ordering is adversarial, the estimator has to keep track of all edges incident on the sampled edge, which in the worst case is bounded by ∆, the maximum degree. The sampling process is repeated O(1 m∆)times(usingtheassumedboundonthenumberoftriangles,T),leading ǫ2 T to a total space requirement proportial to O(1 m∆2) to givean ǫ relativeerror es- ǫ2 T timation of t, the number of triangles in the graph. The parameter ε ensures that the error in the count is at most εt (with constant probability, since the algorithm is randomized). The process can be completed with a single pass over the input. Jowhariand Ghodsialso considerthe case where edges may be deleted, in which case a randomized estimator using “sketch” techniques is introduced, improving overaprevioussketch algorithmduetoBar-Yossef et al.[BYKS02]. The work of Buriol et al. [BFL+06] also adopted a sampling approach, and built a one-pass estimatorwith smaller working space. An algorithm is proposed which samples uniformly an edge from the stream, then picks a third node, and scans the remainder of the stream to see if the triangle on these three nodes is 3 present. Recall thatn is thenumberofnodes in thegraph, m is numberofedges, andT tislowerboundonthe(true)numberoftriangles. Toobtainanaccurate ≤ estimate of number of triangles in the graph, this procedure is repeated indepen- dentlyO(mn) timestoachieveǫrelativeerror. ε2T Recent work by Pavan et al. [PTTW13] extends the sampling approach of Buriol et al.: instead of picking a random node to complete the triangle with a sampled edge, their estimator samples a second edge that is incident on the first sampled edge. This estimator is repeated O(m∆) times, where ∆ represents the ǫ2T maximum degree of any node. That is, this improves the bound of Buriol et al. by a factor of n/∆. In the worst case, ∆ = n, but in general we expect ∆ to be substantiallysmallerthann. Braverman et al. [BOV13] take a different approach to sampling. Instead of building a single estimator and repeating, their algorithms sample a set of edges, and then look for triangles induced by the sampled edges. Specifically, an al- gorithm which takes two passes over the input stream distinguishes triangle-free graphsfrom thosewithT trianglesinspaceO(m/T1/3). For graphs with W m where W is the number of wedges (paths of length ≥ 2), Jhaet al. [JSP13] haveshownasinglepass O(1m/√T) spacealgorithmthat ǫ2 returnsan additiveerror estimationofthenumberoftriangles. Pagh and Tsourakakis[PT12] proposean algorithmin theMapReduce model of computation. However, it can naturally be adapted to the streaming setting. We conceptually “color” each vertex randomly from C colors (this can be ac- complished, for example, with a suitable hash function). We then store each monochromatic edge, i.e. each edge from the input such that both vertices have thesame color. Countingthe numberof triangles in this induced graph, and scal- ing up by a factor of C2 gives an estimator for t. The space used is O(m/C) in expectation. Setting C appropriately yields a one-pass algorithm with space O˜(mJ + m ), where J denotes the maximum number of triangles incident on a T √T singleedge. Lower bounds for triangle counting. A lower bound in the streaming model is presentedbyBar-Yossefetal.[BYKS02]. Theyarguethatthereare(dense)fami- liesofgraphsovernnodessuchthatanyalgorithmtoapproximatethenumberof triangles must use Ω(n2) space. The construction essentially encodes Ω(n2) bits of information, and uses the presence or absence of a single triangle to recover a single bit. Braverman et al. [BOV13] show a lower bound of Ω(m) by demon- strating a family of graphs with m chosen between n and n2. Their construction encodesmbitsinagraph,thenaddsT edgessuchthatthereareeitherT triangles 4 n numberofvertices m numberofedges t(G) numberoftrianglesin graph G T lowerboundon t(G) ε relativeerror δ probabilityoferror ∆ maximumdegree J maximumnumberoftrianglesincidenton an edge K maximumnumberoftrianglesincidenton avertex Dist(T) DistinguishgraphswithT trianglesfromtriangle-free graphs Estimate(T,c) c approximatethenumberoftriangleswhen thereareat leastT Disjr Determineiftwo lengthp bitstringsofweightr intersect p Figure1:Tableofnotation or0triangles,which revealthevalueofan encoded bit. For algorithmswhich take a constant number ofpasses overtheinput stream, Jowhari and Ghodsi [JG05] show that still Ω(n/t) space is needed to approxi- mate the number of triangles up to a constant factor, based on a similarencoding and testing argument. Specifically, they create a graph that encodes two binary strings, so that the resulting graph has T triangles if the strings are disjoint, and 2T if they have an intersection. In a similar way, Braverman et al. [BOV13] en- codebinary stringsinto agraph, so thatit eitherhas notriangles (disjointstrings) or at least T triangles (intersecting strings). This implies that Ω(m/T) space is requiredtodistinguishthetwocases. Inbothcases,thehardnessfollowsfromthe communicationcomplexityofdeterminingthedisjointnessofbinarystrings. 2. Preliminaries andResults In this section, we define additional notation and define the problems that we study. As mentioned above, we use t(G) to denote the number of triangles in graph G = (V,E). Let J(G) denote the maximum number of triangles that share an edge in G, and K(G) the maximum number incident on any vertex. We use t, J and K whenG is clearfrom thecontext. Problems Studied. We define some problems related to counting the number of triangles in a graph stream. These all depend on a parameter T that gives a promiseonthenumberoftrianglesinthegraph. 5 Problem Passes Bound Reference Dist(T) 1 Ω(m) [BOV13] Dist(T) O(1) Ω(m/T) [BOV13] Dist(T) 2 O( m ) [BOV13] T1/3 Estimate(T,ǫ) 1 O(1 m∆) [PTTW13] ǫ2 T Estimate(T,ǫ) 1 O(1(mJ + m )) [PT12] ǫ2 T T1/2 Estimate(T,ǫ) 2 O( m √logn) Theorem 1 ǫ4/3 T1/3 Estimate(T,ǫ) 2 O( m ) Theorem 3 ǫ4.5T1/2 Dist(T) O(1) Ω( m ) Theorem 6 T2/3 Dist(T) O(1) Ω( m ) form = Θ(n√T) Theorem 7 T1/2 Figure2:Summaryofresults Dist(T): Given a stream of edges, distinguishgraphs with at least T triangles fromtriangle-free graphs. Estimate(T,ǫ): Given the edge stream of a graph with at least T triangles, outputs where(1 ǫ)t(G) s (1+ǫ) t(G). − ≤ ≤ · Observe that any algorithm which promises to approximate the number of triangles for ǫ < 1 must at least be able to distinguish the case of 0 triangles or T triangles. Consequently, we provide lower bounds for the Dist(T) problem, and upper bounds for the Estimate(T,ǫ) problem. Our lower bounds rely on the hardnessofwell-knownproblemsfromcommunicationcomplexity. Inparticular, wemakeuseofthehardnessofDisjr: p Problem 1 TheDisjr probleminvolvestwoplayers,AliceandBob,whoeachhave p binary vectors of length p. Each vector has Hamming weight r, i.e. r entries set to one. The players want to distinguish non-intersecting inputs from inputs that dointersect. This problem is “hard” in the (randomized) communication complexity set- ting: it requires a large amount of communication between the players in order toprovideacorrect answerwithsufficientprobability[KN97]. Specifically, Disjr p requires Ω(r) bits of communication for any r p/2, over multiple rounds of ≤ interactionbetween Aliceand Bob. Our Results. We summarize the results for this problem discussed in Section 1, andincludeournewresults,inFigure2. Weobservethat,intermsofdependence on T, we achievetight bounds for 2 passes: Theorem 3 showsthat we can obtain a dependence on T 1/2, and Theorem 7 shows that no improvement for constant − 6 passes as a function of T can be obtained. It is useful to contrast to the results of [PT12],whereaonepassalgorithmachievesadependenceofm/T1/2, buthasan additional term of mJ/T. This extra term can be large: as big as m in the case that all triangles are incident on the same edge; here, we show that this term can beavoidedat thecostofan additionalpass. Our results improveover the 2-pass bounds given in [BOV13]. We show that theEstimate(T,ǫ) can be solvedwith dependence onT 1/3 (not just thedecision − problemDist(T)),andthatthedependenceonT can beimprovedtoT 1/2,atthe − expenseofhigherdependence onǫ. Comparingwiththeadditiveestimatorof[JSP13],whileoursamplingstrategy issomewhatsimilar,usingan extrapassoverthestreamwereturn arelativeerror estimationof thenumberof triangles. Moreoverourbiased estimator(Algorithm I)hasenabledustoobtainanunconditionalresult,althoughthisisachievedatthe expenseofhigherdependence onǫ. 3. Upper bounds In this section, we provide our two upper bounds. The first provides a simple sampling-based unbiased estimator, which has a low dependence on ǫ, but scales withT 1/3. Thesecondusesasimilarsamplingprocedure, and providesabiased − estimator,whosedependenceisimprovedtoT 1/2,butwithhighercostbasedon − ǫ. Algorithm I (unbiased estimator). Let p (0,1]. Thevalueofp willbedetermined later. ∈ In the first pass, the algorithm stores each edge independently at random withprobabilityp. Let G = (V,E )bethesampledsubgraph. ′ ′ In the second pass, the algorithm, upon reading the edge e / E , counts i ′ ∈ thenumberofnewtrianglesin(V,E e )andaddsittoaglobalcounter ′ i ∪{ } s. At the end of the second pass, the algorithm outputs Y = s as 3p2(1 p) theestimatefort(G). − Theorem 1 AlgorithmIisa2-passrandomizedstreamingalgorithmforEstimate(T,ǫ) thatusesO( m √logn) space. ǫ4/3 T1/3 PROOF: Let represent the set of triangles in the graph. For the analysis, we T partition into several groups through the following process. Fix an L [1,t] T ∈ 7 (determined below). Pick an arbitrary edge e E with at least L triangles on it. ∈ We notionally assign the triangles on e to the edge e. Let this be the set . e T ⊆ T Continue this process until all the remaining edges participate in fewer than L unassignedtriangles. Let betheunassignedtriangles. ′ T LetX betheindicatorrandomvariableassociatedwiththei-thtrianglein . i T We haveX = 1 with probability3p2(1 p). For each edgee, let s = X i − e i e i anddefines = X . Wehaves = s +s andtheexpectatio∈nTofs isE(s) = 3pT2′(1 p)i∈t.T′ i e∈E e T′ P −P P First we analyse the concentration of s . We have E(s ) = 3p2(1 p) . ′ ′ ′ T T − |T | Wealsocompute Var(s ) = E(s2 ) E2(s ) ′ ′ ′ T T − T E(X2)+ E(X X ) E2(s ) ≤ i i j − T′ i ′ i ′=j ′ X∈T ∈TX6 ∈T 3p2(1 p) +(4p3(1 p)2 +p4(1 p)) L. ′ ′ ≤ − |T | − − |T | Thefinal term derivesfrom consideringpairs oftriangles i,j. Webreak these intothosewhichshareanedge,andthosewhicharedisjoint. Forthosesharingan edge, both are sampledif either(a) theshared edgeand exactlyoneother edgein eachtriangleissampled,withtotalprobability4p3(1 p)2or(b)ifalledgesexcept − the shared edge are sampled, which occurs with probability p4(1 p). There are − at most such triangle pairs. For pairs of triangles which do not share any ′ |T |L edge, theircontributionto thesumis outweightedby theterm E(s )2. ′ − T Since (1 p) < 1 and p < 1 we simplify this expression to Var(s ) < ′ 3p2 +5p3− L. By theChebyshevinequality, T ′ ′ |T | |T | Var(s ) 3 5 L Pr[ s ′ E(s ′) ǫp2t] T′ |T′| + |T′| (1) | T − T | ≥ ≤ ǫ2p4t2 ≤ ǫ2p2t2 ǫ2pt2 To bound the deviation of each s , we use the Chernoff bound. Let Z be e e the event corresponding to e / E . Since the edges are sampled independently, ′ ∈ conditionedonZ , the randomvariables X are independent. Moreoverwe haveE(X Z ) =ep2. From theChernoffb{ouin}di∈,Tweeget i e | p2|Te|ǫ2 p2Lǫ2 Pr[ se E(se) ǫE(se) Ze] e− 2 e− 2 (2) | − | ≥ | ≤ ≤ Similarly, conditioned on Z , the random variables X are independent and E(X Z¯ ) = 2p(1 p). e { i}i∈Te i e | − 8 Pr[ s E(s ) ǫE(s ) Z ] e p(1 p) eǫ2 e p(1 p)Lǫ2 (3) e e e e − − |T | − − | − | ≥ | ≤ ≤ From (2)and (3), foreach e E, weget ∈ 2 2 p Lǫ Pr[ se E(se) ǫE(se)] e− 2 (4) | − | ≥ ≤ Therefore using the union bound and the fact that the number of edges with non-empty is boundedbyt/L, weget e T t 2 2 p Lǫ Pr se E(se) ǫ E(se) e− 2 (5) − ≥ ≤ L "(cid:12) (cid:12) # (cid:12)Xe Xe (cid:12) Xe (cid:12) (cid:12) Since t T an(cid:12)d setting L = (ǫt)2/(cid:12)3 and p = Ω( 1 √logn) with large enough ≥ (cid:12) (cid:12) ǫ4/3 T1/3 constants,theprobabilitiesin(1)and(5)willbeboundedbyasmallconstant. The expectednumberofedgesinthesampledgraphG ispm,andcanbeshowntobe ′ tightlyconcentrated around itsexpectation. so the space usageis as stated above. Thisprovesourtheorem. (cid:3) We now modify this algorithmto work in therandom order streaming model, whereall permutationsoftheinputare equallylikely[GM09]. Corollary2 Assumingthedataarrives inrandomorder, thereisa one-passran- domizedstreamingalgorithmfor Estimate(T,ǫ) thatusesO( m √logn) space. ǫ4/3 T1/3 PROOF: Theone-passalgorithmcollapsesthetwopassesofAlgorithmIintoone. That is, the algorithm stores each edge into graph G with probability p, and also ′ countsthenumberoftrianglescompletedinG by each edgefrom thestream G. ′ The analysis follows the same outline as the main theorem, with some modi- fication. First, we now have Pr[X = 1] = p2(1 p), since the unsampled edge i − mustbethelastinthestreamorder,andE(s)iscorrespondinglylowerbyafactor of 3. Then E(X Z ) = p2/3, since to count trianglei, we must have that the first i i | twoedgesareseenbeforeedgeeinthestream. Likewise,E(X Z¯) = 2p(1 p)/3, i i | − sincewemusthavetheunsamplededgeappearafterthetwosamplededges. This causesustorescalepbyaconstant,whichdoesnotchangetheasymptoticcostof thealgorithm. (cid:3) Notethattherequirementofrandomorderisimportantfortheone-passresult. Because we split the analysis based on the particular edges, the order in which these edges appear can affect the outcome. If the edge e were to always appear 9 afterthetwootheredgesintrianglei,thenE(X Z )wouldbe0. Hence,weneed i e | theedges to appearinrandom ordertoensurethisone-pass analysisholds. Our next algorithm builds a similar estimator, but differs in some important ways. Algorithm II(biasedestimator). Repeat thefollowingl 16/ǫtimesindependentlyin paralleland output ≥ theminimumoftheoutcomes. Inthefirstpass,pick everyedgewithprobabilityp (thevalueofpwillbe determinedlater.) In the second pass, count the number of triangles detected: either those where all three edges were sampled in the first pass, or two edges were sampledinthefirst pass,andthecompletingedgeobservedinthesecond pass. Let r bethetotalnumberoftrianglesdetected. Output r . 3p2(1 p)+p3 − Theorem 3 AlgorithmIIisa2-passrandomizedstreamingalgorithmforEstimate(T,ǫ) thatusesO( m ) space. ǫ4.5√T PROOF: Let R be the output of the Algorithm II. As in the previous proof, let represent the set of triangles in the graph. Consider one instance of the ba- T sic estimator, and let X be the outcome of this instance. Let X denote the i indicator random variable associated with the ith triangle in being detected. T By simple calculation, we have Pr[X = 1] = 3p2(1 p) + p3 and E(X) = i − 1 X = t. By the Markov inequality, Pr[X (1+ǫ)E(X)] ǫ. 3p2(1 p)+p3 i i ≤ ≥ Ther−eforewec∈aTn conclude, P Pr[R (1+ǫ)t] Pr[X (1+ǫ)t] 7/8. ≤ ≥ ≤ ≥ However, proving a lower bound on R is more complex, and requires a more involvedanalysis. First, weshowthat mosttriangles sharean edgewith alimited numberoftriangles. Moreprecisely,letL E denotethesetofedgeswhereeach ⊆ e L belongs to at most 3 t/ǫ triangles. We call L the set of light edges and ∈ H = E Ltheheavyedges. WeclaimthereexistsS suchthat S (1 ǫ)t \ p ⊆ T | | ≥ − and every triangle in S has at least two light edges. This is true because there can be at most 3t = √ǫt heavy edges, and moreover every two distinct edges 3√t/ǫ belongtoat mostonetriangle. For each triangle i S, fix two of its light edges. Let Y denote the indicator i ∈ randomvariablefortheeventwherethealgorithmpicksthelightedgesofi S in ∈ 10