Optimal Gossip-Based Aggregate Computation Jen-YeuChen∗ GopalPandurangan† 0 Abstract 1 0 Motivated by applications to modern networking technologies, there has been interest in designing 2 efficientgossip-basedprotocolsforcomputingaggregatefunctions.Whilegossip-basedprotocolsprovide n robustnessduetotheirrandomizednature,reducingthemessageandtimecomplexityoftheseprotocols a isalsoofparamountimportanceinthecontextofresource-constrainednetworkssuchassensorandpeer- J to-peernetworks. 9 Wepresentthefirstprovablyalmost-optimalgossip-basedalgorithmsforaggregatecomputationthat 1 are both time optimal and message-optimal. Given a n-node network, our algorithms guarantee that ] all the nodes can compute the common aggregates (such as Min, Max, Count, Sum, Average, Rank S etc.) of their values in optimal O(logn) time and using O(nloglogn) messages. Our result improves D on the algorithm of Kempe et al. [9] that is time-optimal, but uses O(nlogn) messages as well as on . the algorithm of Kashyap et al. [8] that uses O(nloglogn) messages, but is not time-optimal (takes s c O(lognloglogn)time). Furthermore,weshowthatouralgorithmscanbeusedtoimprovegossip-based [ aggregatecomputationinsparsecommunicationnetworks,suchasinpeer-to-peernetworks. 1 The main technical ingredient of our algorithm is a technique called distributed random ranking v (DRR) that can be useful in other applications as well. DRR gives an efficient distributed procedure 2 to partition the network into a forest of (disjoint) trees of small size. Since the size of each tree is 4 small,aggregateswithineachtreecanbeefficientlyobtainedattheirrespectiveroots. Alltherootsthen 2 perform a uniform gossip algorithm on their local aggregates to reach a distributed consensus on the 3 . globalaggregates. 1 Ouralgorithmsarenon-addressoblivious. Incontrast,weshowalowerboundofΩ(nlogn)onthe 0 0 messagecomplexityofanyaddress-obliviousalgorithmforcomputingaggregates. Thisshowsthatnon- 1 address oblivious algorithms are needed to obtain significantly better message complexity. Our lower : boundholdsregardlessofthenumberofroundstakenorthesizeofthemessagesused. Ourlowerbound v i isthefirstnon-triviallowerboundforgossip-basedaggregatecomputationandalsogivesthefirstformal X proofthatcomputingaggregatesisstrictlyharderthanrumorspreadingintheaddress-obliviousmodel. r a Keywords: Gossip-basedprotocols,aggregatecomputation,distributedrandomizedprotocols,probabilistic analysis,lowerbounds. ∗Department of Electrical Engineering, National DongHwa University, ShouFeng, Hualien 97401, Taiwan, ROC. E-mail: [email protected]. †DivisionofMathematicalSciences,NanyangTechnologicalUniversity,Singapore648477,andDepartmentofComputerSci- ence,BrownUniversity,Providence,RI02912. E-mail:[email protected]. SupportedinpartbyNSFAward CCF-0830476. 1 Introduction 1.1 BackgroundandPreviousWork Aggregate statistics (e.g., Average, Max/Min, Sum, and, Count etc.) are significantly useful for many ap- plications in networks [2, 5, 6, 9, 11, 13, 24]. These statistics have to be computed over data stored at individual nodes. For example, in a peer-to-peer network, the average number of files stored at each node orthemaximumsizeoffilesexchangedbetweennodesisanimportantstatisticneededbysystemdesigners for optimizing overall performance [22, 25]. Similarly, in sensor networks, knowing the average or max- imum remaining battery power among the sensor nodes is a critical statistic. Many research efforts have been dedicated to developing scalable and distributed algorithms for aggregate computation. Among them gossip-basedalgorithms[1,2,4,8,9,12,16,17,20,23]haverecentlyreceivedsignificantattentionbecause of their simplicity of implementation, scalability to large network size, and robustness to frequent network topology changes. In a gossip-based algorithm, each node exchanges information with a randomly chosen communicationpartnerineachround. Therandomnessinherentinthegossip-basedprotocolsnaturallypro- vides robustness, simplicity, and scalability [7, 8]. We refer to [7, 8, 9] for a detailed discussion on the advantages of gossip-based computation over centralized and deterministic approaches and their attractive- ness to emerging networking technologies such as peer-to-peer, wireless, and sensor networks. This paper focusesondesigningefficientgossip-basedprotocolsforaggregatecomputationthathavelowmessageand timecomplexity. Thisisespeciallyusefulinthecontextofresource-constrainednetworkssuchassensorand wireless networks, where reducing message and time complexity can yield significant benefits in terms of loweringcongestionandlengtheningnodelifetimes. Much of the early work on gossip focused on using randomized communication for rumor propagation [3, 7, 21]. In particular, Karp et al. [7] gave a rumor spreading algorithm (for spreading a single message throughout a network of n nodes) that takes O(logn) communication rounds and O(nloglogn) messages. It is easy to establish that Ω(logn) rounds are needed by any gossip-based rumor spreading algorithm (this bound also holds for gossip-based aggregate computation). They also showed that any rumor spreading algorithmneedsatleastΩ(nloglogn)messagesforaclassofrandomizedgossip-basedalgorithmsreferred toasaddress-obliviousalgorithms[7]. Informally,analgorithmiscalledaddress-obliviousifthedecisionto send a message to its communication partner in a round does not depend on the partner’s address. Karp et al.’salgorithmisaddress-oblivious. Fornon-addressobliviousalgorithms,theyshowalowerboundofω(n) messages,ifthealgorithmisallowedonlyO(logn)rounds. Kempe et al. [9] were the first to present randomized gossip-based algorithms for computing aggre- gates. They analyzed a gossip-based protocol for computing sums, averages, quantiles, and other aggregate functions. Intheirschemeforestimatingaverage, eachnodeselectsanotherrandomnodetowhichitsends half of its value; a node on receiving a set of values just adds them to its own halved value. Their protocol takes O(logn) rounds and uses O(nlogn) messages to converge to the true average in a n-node network. Their protocol is address-oblivious. The work of Kashyap et al. [8] was the first to address the issue of reducing the message complexity of gossip-based aggregate protocols, even at the cost of increasing the time complexity. They presented an algorithm that significantly improves over the message complexity of the protocol of Kempe et al. Their algorithm uses only O(nloglogn) messages, but is not time optimal — it runs in O(lognloglogn) time. Their algorithm achieves this O(logn/loglogn) factor reduction in thenumberofmessagesbyrandomlyclusteringnodesintogroupsofsizeO(logn),selectingrepresentative for each group, and then having the group representatives gossip among themselves. Their algorithm is not address-oblivious. For other related work on gossip-based protocols, we refer to [8, 2] and the references therein. 1 Table1: DRR-gossipvs. othergossip-basedalgorithms. Algorithm timecomplexity messagecomplexity addressoblivious? efficientgossip[8] O(lognloglogn) O(nloglogn) no uniformgossip[9] O(logn) O(nlogn) yes DRR-gossip[thispaper] O(logn) O(nloglogn) no 1.2 OurContributions In this paper, we present the first provably almost-optimal gossip-based algorithms for computing various aggregatefunctionsthatimprovesuponpreviousresults. Givenan-nodenetwork,ouralgorithmsguarantee that all the nodes can compute the common aggregates (such as Min, Max, Count, Sum, Average, Rank etc.) of their values in optimal O(logn) time and using O(nloglogn) messages. Our result (cf. Table 1) improves on the algorithm of Kempe et al. [9] that is time-optimal, but uses O(nlogn) messages as well as on the algorithm of Kashyap et al. [8] that uses O(nloglogn) messages, but is not time-optimal (takes O(lognloglogn)time). Our algorithms use a simple scheme called distributed random ranking (DRR) that gives an efficient distributedprotocoltopartitionthenetworkintoaforestofdisjointtreesofO(logn)size. Sincethesizeof each tree is small, aggregates within each tree can be efficiently obtained at their respective roots. All the rootsthenperformauniformgossipalgorithmontheirlocal(tree)aggregatestoreachadistributedconsensus on the global aggregates. Our idea of forming trees and then doing gossip among the roots of the trees is similartotheideaofKashyapetal. ThemainnoveltyisthatourDRRtechniquegivesasimpleandefficient distributed way of decomposing the network into disjoint trees (groups) which takes only O(logn) rounds andO(nloglogn)messages. Thisleadstoasimplerandfasteralgorithmthanthatof[8]. Thepaperof[20] proposes the following heuristic: divide the network into clusters (called the “bootstrap phase”), aggregate the data within the clusters — these are aggregated in a small subset of nodes within each cluster called clusterheads;theclusterheadsthenusegossipalgorithmofKempeetaltodointer-clusteraggregation;and, finally the clusterheads will disseminate the information to all the nodes in the respective clusters. It is not clear in [20] how to efficiently implement the bootstrap phase of dividing the network into clusters. Also, only numerical simulation results are presented in [20] to show that their approach gives better complexity thanthealgorithmofKempeetal. ItismentionedwithoutproofthattheirapproachcantakeO(nloglogn) messages and O(logn) time. Hence, to the best of our knowledge, our work presents the first rigorous protocolthatprovablyshowsthesebounds. Oursecondcontributionisanalyzinggossip-basedaggregatecomputationinsparsenetworks. Insparse topologies such as P2P networks, point-to-point communication between all pairs of nodes (as assumed in gossip-basedprotocols)maynotbeareasonableassumption. Ontheotherhand,asmallnumberofneighbors in such networks makes it feasible to send one message simultaneously to all neighbors in one round: in fact, this is a standard assumption in the distributed message passing model [19]. We show how our DRR technique leads to improved gossip-based aggregate computation in such (arbitrary) sparse networks, e.g., P2PnetworktopologiessuchasChord[25]. TheimprovementreliesonakeypropertyoftheDRRscheme that we prove: height of each tree produced by DRR in any arbitrary graph is bounded by O(logn) whp. In Chord, for example, we show that DRR-gossip takes O(log2n) time whp and O(nlogn) messages. In contrast,uniformgossipgivesO(log2n)roundsandO(nlog2n)messages. Our algorithm is non-address oblivious, i.e., some steps use addresses to decide which partner to com- municateinaround. Thetimecomplexityofouralgorithmisoptimalandthemessagecomplexityiswithin a factor o(loglogn) of the optimal. This is because, Karp et al [7] showed a lower bound of ω(n) for any non-address oblivious rumor spreading algorithm that operates in O(logn) rounds. (Computing aggregates 2 isatleastashardasrumorspreading.) Our third contribution is a non-trivial lower bound of Ω(nlogn) on the message complexity of any address-oblivious algorithm for computing aggregates. This lower bound holds regardless of the number of rounds taken or the size of the messages (i.e., even assuming that nodes that can send arbitrarily long messages). Our result shows that non-address oblivious algorithms (such as ours) are needed to obtain a significant improvement in message complexity. We note that this bound is significantly larger than the Ω(nloglogn)messagesshownbyKarpetal. forrumorspreading. Thusourresultalsogivesthefirstformal proof that computing aggregates is strictly harder than rumor spreading in the address-oblivious model. AnotherimplicationofourresultisthatthealgorithmofKempeetal. [9]isasymptoticallymessageoptimal fortheaddress-obliviousmodel. Our algorithm, henceforth called DRR-gossip, proceeds in phases. In phase one, every node runs the DRR scheme to construct a forest of (disjoint) trees. In phase two, each tree computes its local aggregate (e.g., sum or maximum) by a convergecast process; the local aggregate is obtained at the root. Finally in phase three, all the roots utilize a suitably modified version of the uniform gossip algorithm of Kempe et al. [9] to obtain the global aggregate. Finally, if necessary, the roots forward the global aggregate to other nodesintheirtrees. 1.3 Organization The rest of this paper is organized as follows. The network model is described in Section 2 followed by sections where each phase of the DRR-gossip algorithm is introduced and analyzed separately. The whole DRR-gossip algorithm is summarized in Section 3.4. Section 4 applies DRR-gossip to sparse networks. Anlowerboundonthemessagecomplexityofanyaddress-obliviousalgorithmforcomputingaggregatesis presented and proved in Section 5. Section 1.4 lists the main probabilistic tools used in our analysis — the DoobmartingaleandAzuma’sinequality. Section6concludeswithsomeopenquestions. 1.4 ProbabilisticPreliminaries WeuseDoobmartingalesextensivelyinouranalysis[14]. LetX ,...,X beanysequenceofrandomvari- 0 n ablesandletY beanyrandomvariablewithE[|Y|] < ∞. DefinetherandomvariableZ = E[Y|X ,...,X ], i 0 i i = 0,1,...,n. ThenZ ,Z ,...,Z formaDoobmartingalesequence. 0 1 n We use the martingale inequality known as Azuma’s inequality, stated as follows [14]. Let X ,X ,... 0 1 beamartingalesequencesuchthatforeachk, |X −X | ≤ c k k−1 k wherec maydependonk. Thenforallt ≥ 0andanyλ > 0, k − λ2 Pr(|Xt−X0| ≥ λ) ≤ 2e 2(cid:80)tk=1c2k (1) WealsoneedthefollowingvariantoftheChernoffboundfrom[18],thatworksinthecaseofdependent indicatorrandomvariablesthatarecorrelatedasdefinedbelow. Lemma1 ([18]) Let Z ,Z ,...,Z ∈ {0,1} be random variables such that for all l, and for any S ⊆ 1 2 s l−1 s (cid:86) (cid:80) {1,...,l−1},Pr(Z = 1| Z = 1) ≤ Pr(Z = 1). Thenforanyδ > 0,Pr( Z ≥ µ(1+δ)) ≤ l j∈S j l l l−1 l=1 s ( eδ )µ,whereµ = (cid:80)E[Z ]. (1+δ)1+δ l l=1 3 Algorithm1:F =DRR(G) foreachnodei ∈ V do chooserank(i)independentlyanduniformlyatrandomfrom[0,1]; setfound=FALSE//higherrankednodenotyetfound; setparent(i) = NULL//initiallyeverynodeisarootnode; setk = 0//numberofrandomnodesprobed; repeat sampleanodeuindependentlyanduniformlyatrandomfromV andgetitsrank; ifrank(u) > rank(i)then setparent(i) = u; setfound = TRUE; setk = k+1; end untilfound == TRUE ork < logn−1; iffound == TRUE then sendaconnectionmessageincludingitsidentifier,i,toitsparentnodeparent(i); end Collecttheconnectionmessagesandaccordinglyconstructthesetofitschildrennodes,Child(i); ifChild(i) = ∅then becomealeafnode; else becomeanintermediatenode; end end 2 Model ThenetworkconsistsofasetV ofnnodes; eachnodei ∈ V hasadatavaluedenotedbyv . Thegoalisto i computeaggregatefunctionssuchasMin,Max,Sum,Averageetc.,ofthenodevalues. Thenodescommunicateindiscretetime-stepsreferredtoasrounds. Asinpriorworksonthisproblem[7, 8],weassumethatcommunicationroundsaresynchronized,andallnodescancommunicatesimultaneously in a given round. Each node can communicate with every other node. In a round, each node can choose a communicationpartnerindependentlyanduniformlyatrandom. Anodeiissaidtocall anodej ifichooses jasacommunicationpartner. (Thisisknownastherandomphonecallmodel[7].) Onceacallisestablished, weassumethatinformationcanbeexchangedinbothdirectionsalongthelink. Inoneround,anodecancall only one other node. We assume that nodes have unique addresses. The length of a message is limited to O(logn+logs),wheresistherangeofvalues. Itisimportanttolimitthesizeofmessagesusedinaggregate computation, as communication bandwidth is often a costly resource in distributed settings. All the above assumptionsarealsousedinpriorworks[8,9]. Similartothealgorithmsof[8,9],ouralgorithmcantolerate thefollowingtwotypesoffailures: (i)somefractionofnodesmaycrashinitially,and(ii)linksarelossyand messagescangetlost. Thus,whilenodescannotfailoncethealgorithmhasstarted,communicationcanfail with a certain probability δ. Without loss of generality, 1/logn < δ < 1/8: Larger values of δ, requires only O(1/log(1/δ)) repeated calls to bring down the probability below 1/8, and smaller values only make iteasiertoproveourclaims. Throughoutthepaper,“withhighprobability(whp)”means“withprobabilityatleast1−1/nα,forsome α > 0”. 4 3 DRR-Gossip Algorithms 3.1 PhaseI:DistributedRandomRanking(DRR) The DRR algorithm is as follows (cf. Algorithm 1). Every node i ∈ V chooses a rank independently and uniformlyatrandomfrom[0,1]. (Equivalently,eachnodecanchoosearankuniformlyatrandomfrom[1,n3] whichleadstothesameasymptoticbounds;however,choosingfrom[0,1]leadstoasmootheranalysis,e.g., allows use of integrals.) Each node i then samples up to logn−1 random nodes sequentially (one in each round)tillitfindsanodeofhigherranktoconnectto. Ifnoneofthelogn−1samplednodeshaveahigher rank then node i becomes a “root”. Since every node except root nodes connects to a node with higher rank, thereisnocycleinthegraph. Thusthisprocessresultsinacollectionofdisjointtreeswhichtogether constituteaforestF. Inthefollowingtwotheorems,weshowtheupperboundsofthenumberoftreesandthesizeofeachtree producedbytheDRRalgorithm;thesearecriticalinboundingthetimecomplexityofDRR-gossip. Theorem2(NumberofTrees) ThenumberoftreesproducedbytheDRRalgorithmisO(n/logn)whp. Proof: Assume that ranks have already been assigned to the nodes. All ranks are distinct with proba- bility 1. Number the nodes according to the order statistic of their ranks: the ith node is the node with the ith smallest rank. Let the indicator random variable X take the value of 1 if the ith smallest node is a root i and0otherwise. LetX = (cid:80)n X bethetotalnumberofroots. Theithsmallestnodebecomesarootifall i=1 i thenodesthatitsampleshaveranksmallerthanorequaltoitself,i.e.,Pr(X = 1) = (cid:0)i(cid:1)logn−1.Hence,by i n linearityofexpectation,theexpectednumberofroots(andthus,trees)is: (cid:88)n (cid:88)n (cid:18)i(cid:19)logn−1 (cid:32)(cid:90) n(cid:18)i(cid:19)logn−1 (cid:33) (cid:18) n (cid:19) E[X] = Pr(X = 1) = = Θ di = Θ . i n n logn 1 i=1 i=1 Note that X s are independent (but not identically distributed) random variables, since the probability that i the ith smallest ranked node becomes the root depends only on the logn−1 random nodes that it samples andindependentofthesamplesoftherestofthenodes. Thus,applyingaChernoff’sbound[14],wehave: Pr(X > 6E[X]) ≤ 2E[X] = o(1/n). Theorem3(Sizeofatree) The number of nodes in every tree produced by the DRR algorithm is at most O(logn)whp. Proof: WeboundthattheprobabilitythatatreeofsizeΩ(logn)isproducedbytheDRRalgorithm. Fix asetS ofk = clognnodes,forsomesufficientlylargepositiveconstantc. Wefirstcomputetheprobability thatthissetofk nodesformatree. Forthesakeofanalysis,wewilldirecttreeedgesasfollows: atreeedge (i,j)isdirectedfromnodeitonodej ifrank(i) < rank(j),i.e. iconnectstoj. Withoutlossofgenerality, fix a permutation of S: (s ,...,s ,...,s ,...,s ) where rank(s ) > rank(s ), 1 ≤ α < β ≤ k. 1 α β k α β This permutation induces a directed spanning tree on S in the following sense: s is the root and any other 1 node s (1 < α ≤ k) connects to a node in the totally (strictly) ordered set {s ,...,s } (as fixed by α 1 α−1 the above permutation). For convenience, we denote the event that a node s connects to any node on a directed tree, T, as s → T. Note that s → T implies that s’s rank is less than that of any node on the tree T. Also, we denote the event of a directed spanning tree being induced on the totally (strictly) ordered set {s ,s ,...,s ,...,s } as T , where a node s can only connect to its preceding nodes in the ordered 1 2 α h h α set. As a special case, T is the event of the induced directed tree containing only the root node s . We are 1 1 interestedintheeventT ,i.e.,thesetS ofk nodesformingadirectedspanningtreeintheabovefashion. In k thefollowing,weboundtheprobabilityoftheeventT happening: k Pr(T ) = Pr(T ∩(s → T )∩(s → T )∩···∩(s → T )) k 1 2 1 3 2 k k−1 = Pr(T )Pr(s → T | T )Pr(s → T |T )...Pr(s → T | T ). (2) 1 2 1 1 3 2 2 k k−1 k−1 5 To bound each of the terms in the product, we use the principle of deferred decisions: when a new node is sampled (i.e., for the first time) we assign it a random rank. For simplicity, we assume that each node sampledisanewnode—thisdoesnotchangetheasymptoticbound,sincetherearenowonlyk = O(logn) nodes under consideration and each node samples at most O(logn) nodes. This assumption allows us to use the principle of deferred decisions to assign random ranks without worrying about sampling an already sampled node. Below we bound the conditional probability Pr(s → T | T ), for any 2 ≤ α ≤ k as α α−1 α−1 follows. Letr = rank(s )betherankofnodes ,1 ≤ q ≤ α;then q q q (cid:90) 1(cid:90) r1(cid:90) r2 (cid:90) rα−1 lo(cid:88)gn−1(cid:18)α−1(cid:19) Pr(s → T | T ) ≤ ... rh dr ...dr . α α−1 α−1 n α α 1 0 0 0 0 h=0 Theexplanationfortheaboveboundisasfollows: SinceT isadirectedspanningtreeonthefirstα−1 α−1 nodes,ands connectstoT ,wehaver > r > ··· > r > r . Hencer cantakeanyvaluebetween α α−1 1 2 α−1 α 1 0 and 1, r can take any value between 0 and r and so on. This is captured by the respective ranges of the 2 1 integrals. Theterminsidetheintegralsisexplainedasfollows. Thereareatmostlogn−1attemptsfornode s toconnecttoanyoneofthefirstα−1nodes. Suppose,itconnectsinthehthattempt. Then,thefirsth−1 α attemptsshouldconnecttonodeswhoserankshouldbelessthanr ,hencethetermrh (asmentionedearlier, α α we assume that we don’t sample an already sampled node, this doesn’t change the bound asymptotically). Theterm(α−1)/nistheprobabilitythats connectstoanyoneofthefirstα−1nodesinthehthattempt. α Simplifyingtherighthandside,wehave, Pr(s → T | T ) α α−1 α−1 α−1 (cid:90) 1(cid:90) r1(cid:90) r2 (cid:90) rα−1 ≤ ... [1+r +r2 +...rlogn−1]dr ...dr n α α α α 1 0 0 0 0 (cid:18) (cid:19) α−1 0! 1! 2! (logn)! = + + +···+ . n α! (α+1)! (α+2)! (logn+α)! Theaboveexpressionisboundedby b,where0 < b < 1ifα > 2and0 < b ≤ (1− 1 )ifα = 2. n logn+2 Besides,Pr(T ) ≤ 1 (cf. Theorem2);hence,theequation(2)isboundedby(cid:0)b(cid:1)k−1 1 . 1 logn n logn Using the above, the probability that a tree of size k = clogn is produced by the DRR algorithm is boundedby (cid:18)n(cid:19) b 1 (ne)k √ kk b 1 c(cid:48)·n k!( )k−1 ≤ O( k) ( )k−1 ≤ ·bk−1 = o(1/n), k n logn kk ek n logn log12 n ifcsufficientlylarge. ComplexityofPhaseI—theDRRalgorithm Theorem4 The message complexity of the DRR algorithm is O(nloglogn) whp. The time complexity is O(logn)rounds. Proof: Letd = logn−1. Fixanodei. Itsrankischosenuniformlyatrandomfrom[0,1]. Theexpected numberofnodessampledbeforeanodeifindsahigherrankednode(orelse,alldnodeswillbesampled)is computed as follows. The probability that exactly k nodes will be sampled is Θ( 1 1), since the last node k+1k sampledshouldbethehighestrankednodeandishouldbethesecondhighestrankednode(whp,allthenodes (cid:16) (cid:17) sampled will be unique). Hence the expected number of nodes probed is (cid:80)d Θ k 1 1 = O(logd). k=1 k+1k HencethenumberofmessagesexchangedbynodeiisO(logd). Bylinearityofexpectation,thetotalnumber ofmessagesexchangedbyallnodesisO(nlogd) = O(nloglogn). Toshowconcentration, wesetupaDoobmartingaleasfollows. LetX denotetherandomvariablethat countsthetotalnumberofnodessampledbyallnodes. E[X] = O(nlogd). Assumethatrankshavealready 6 Algorithm2:cov =convergecast-max(F,v) max Input: therankingforestF,andthevaluevectorvoverallnodesinF Output: thelocalMaxaggregatevectorcov overroots max foreachleafnodedosenditsvaluetoitsparent; foreachintermediatenodedo -collectvaluesfromitschildren; -comparecollectedvalueswithitsownvalue; -updateitsvaluetothemaximumamidallandsendthemaximumtoitsparent. end foreachrootnodez do -collectvaluesfromitschildren; -comparecollectedvalueswithitsownvalue; -updateitsvaluetothelocalmaximumvaluecov (z). max end been assigned to the nodes. Number the nodes according to the order statistic of their ranks: the ith node is the node with the ith smallest rank. Let the indicator r.v. Z (1 ≤ i ≤ n, 1 ≤ k ≤ d) indicate whether ik the kth sample by the ith smallest ranked node succeeded or not (i.e., it found a higher ranked node). If it succeeded then Z = 1 for all j ≤ k and Z = 0 for all j > k. Thus X = (cid:80)n (cid:80)d Z . Then ij ij i=1 k=1 ik the sequence X = E[X],X = E[X|Z ],...,X = E[X|Z ,...,Z ] is a Doob martingale. Note 0 1 11 nd 11 nd that |X −X | ≤ d (1 ≤ (cid:96) ≤ nd) because fixing the outcome of a sample of one node affects only the (cid:96) (cid:96)−1 outcomes of other samples made by the same node and not the samples made by other nodes. Applying Azuma’sinequality,forapositiveconstant(cid:15)wehave: (cid:18) (cid:15)2n2 (cid:19) Pr(|X −E[X]| ≥ (cid:15)n) ≤ 2exp − = o(1/n). 2n(logn)3 ThetimecomplexityisimmediatesinceeachnodeprobesatmostO(logn)nodesinasmanyrounds. 3.2 PhaseII:ConvergecastandBroadcast In the second phase of our algorithm, the local aggregate of each tree is obtained at the root by the Con- vergecastalgorithm—anaggregationprocessstartingfromleafnodesandproceedingupwardalongthetree to the root node. For example, to compute the local max/min, all leave nodes simply send their values to their parent nodes. An intermediate node collects the values from its children, compares them with its own valueandsendsitsparentnodethemax/minvalueamongallreceivedvaluesanditsown. Arootnodethen can obtain the local max/min value of its tree. Algorithm 2 and Algorithm 3 are the pseudo-codes of the Convergecast-maxalgorithmandtheConvergecast-sumalgorithm,respectively. AftertheConvergecastprocess,eachrootbroadcastsitsaddresstoallothernodesinitstreeviathetree links. This process proceeds from the root down to the leaves via the tree links (these two-way links were alreadyestablishedduringPhase1.) Attheendofthisprocess,allnon-rootnodesknowtheidentity(address) oftheirrespectiveroots. ComplexityofPhaseII Everynodeexcepttherootnodesneedstosendamessagetoitsparentintheupwardaggregationprocess oftheConvergecastalgorithms. SothemessagecomplexityisO(n). Sinceeachnodecancommunicatewith atmostonenodeinoneround,thetimecomplexityisboundedbythesizeofthetree. (Thisisthereasonfor bounding size and not just the height.) Since the tree size (hence, tree height also) is bounded by O(logn) (cf. Theorem 3) the time complexity of Convergecast and Broadcast is O(logn). Moreover, as the number ofrootsisatmostO(n/logn)byTheorem2,themessagecomplexityforbroadcastisalsoO(n). 7 Algorithm3:cov =convergecast-sum(F,v) sum Input: therankingforestFandthevaluevectorvoverallnodesinF Output: thelocalAveaggregatevectorcov overroots. max Initialization: everynodeistoresarowvector(v ,w = 1)includingitsvaluev andasizecountw ; i i i i foreachleafnodei ∈ Fdo -senditsparentamessagecontainingthevector(v ,w = 1); i i -reset(v ,w ) = (0, 0). i i end foreachintermediatenodej ∈ Fdo -collectmessages(vectors)fromitschildren; (cid:80) (cid:80) -computeandupdatev = v + v ,andw = w + w ,where j j k∈Child(j) k j j k∈Child(j) k Child(j) = {j’schildrennodes}; -sendcomputed(v ,w )toitsparent; j j -resetitsvector(v ,w ) = (0, 0)whenitsparentsuccessfullyreceivesitsmessage. j j end foreachrootnodez ∈ V˜ do -collectmessages(vectors)fromitschildren; (cid:80) -computethelocalsumaggregatecov (z,1) = v + v ,andthesizecountofthe sum z k∈Child(z) k (cid:80) treecov (z,2) = w + w ,whereChild(z) = {z’schildrennodes}. sum z k∈Child(z) k end 3.3 PhaseIII:Gossip In the third phase, all roots of the trees compute the global aggregate by performing the uniform gossip algorithmonthegraphG˜ = clique(V˜),whereV˜ ⊆ V isthesetofrootsand|V˜| = m = O(n/logn). The idea of uniform gossip is as follows. Every root independently and uniformly at random selects a nodetosenditsmessage. Iftheselectednodeisanotherrootthenthetaskiscompleted. Ifnot,theselected nodeneedstoforwardthereceivedmessagetoitsroot(allnodesinatreeknowtheroot’saddressattheend of Phase II — here is where we use a non-address oblivious communication). Thus, to traverse through an edgeofG˜,amessageneedsatmosttwohopsofG. Algorithm 4, Gossip-max, and Algorithm 6, Gossip-ave (which is a modification from the Push-Sum algorithmof[8,9])computetheMaxandAveaggregatesrespectively(otheraggregatessuchasMin,Sum etc.,canbecalculatedbyasuitablemodification). Notethat,unlikeGossip-max,Gossip-avealgorithmdoes notneedasamplingprocedure. Algorithm5,Data-spread,amodificationofGossip-max,canbeusedbyarootnodetospreaditsvalue. Ifarootneedstospreadaparticularvalueoverthenetwork,itsetsthisvalueasitsinitialvalueandallother rootssettheirinitialvaluetominusinfinity. 3.3.1 PerformanceofGossip-maxandData-spreadAlgorithms Let m denote the number of root nodes. By Theorem 2, we have m = |V˜| = O(n/logn) where n = |V|. Karp,etal.[7]showthatallmnodesofacompletegraphcanknowaparticularrumor(e.g.,theMaxinour application)inO(logm) = O(logn)roundswithhighprobabilitybyusingtheirPushalgorithm(aprototype ofourGossip-maxalgorithm)withuniformselectionprobability. SimilartothePushalgorithm,Gossip-max needsO(mlogm) = O(n)messagesforallrootstoobtainMaxiftheselectionprobabilityisuniform,i.e., 1/m. However,intheimplementationoftheGossip-maxalgorithmontheforest,therootofatreeisselected with a probability proportional to its size (number of nodes in the tree). Hence, the selection probability is notuniform. Inthiscase,wecanonlyguaranteethatafterthegossipprocedureoftheGossip-maxalgorithm, aportionoftherootsincludingtherootofthelargesttreewillpossesstheMax. Afterthegossipprocedure, 8 Algorithm4:xˆ =Gossip-max(G, F, V˜, y) max Initialization: everyrooti ∈ V˜ isoftheinitialvaluex = y(i)fromtheinputy. 0,i /*TocomputeMax,x = y(i) = cov (i);TocomputeAve,x = y(i) = cov (i,2).∗/; 0,i max 0,i sum Gossipprocedure:; fort=1: O(logn)roundsdo Everyrooti ∈ V˜ independentlyanduniformlyatrandom,selectsanodeinV andsendsthe selectednodeamessagecontainingitscurrentvaluex .; t−1,i Everynodej ∈ V −V˜ forwardsanyreceivedmessagestoitsroot.; Everyrooti ∈ V˜; —collectsmessagesandcomparesthereceivedvalueswithitsownvalue; —updatesitscurrentvaluex ,whichisalsothexˆ (i),nodei’scurrentestimateofMax, t,i max,t tothemaximumamongallreceivedvaluesanditsown.; end Samplingprocedure:; fort=1: 1 lognroundsdo c Everyrooti ∈ V˜ independentlyanduniformlyatrandomselectsanodeinV andsendseachof theselectednodesaninquirymessage.; Everynodej ∈ V −V˜ forwardsanyreceivedinquirymessagestoitsroot.; Everyrooti ∈ V˜,uponreceivinginquirymessages,sendstheinquiringrootsitsvalue.; Everyrooti ∈ V˜,updatesx ,i.e. xˆ (i),tothemaximumvalueitinquires. t,i max,t end Algorithm5:xˆ =Data-spread(G, F, V˜, x ) ru ru Initialization: Arootnodei ∈ V˜ whichintendstospreaditsvaluex ,|x | < ∞setsx = x . ru ru 0,i ru Alltheothernodesj setx = −∞.; 0,j Rungossip-max(G, F, V˜, x )ontheinitializedvalues. 0 Algorithm6:xˆ =Gossip-ave(G, F, V˜, cov ) ave sum Initialization: Everyrooti ∈ V˜ setsavector(s ,g ) = cov (i),wheres andg arethe 0,i 0,i sum 0,i 0,i localsumofvaluesandthesizeofthetreerootedati,respectively.; fort = 1 : O(logm+log(1/(cid:15)))roundsdo Everyrootnodei ∈ V˜ independentlyanduniformlyatrandomselectsanodeinV andsendsthe selectednodeamessagecontainingarowvector(s /2,g /2).; t−1,i t−1,i Everynodej ∈ V −V˜ forwardsanyreceivedmessagestotherootofitsrankingtree.; LetA ⊆ V˜ bethesetofrootswhosemessagesreachrootnodeiatroundt. Everyrootnode t,i i ∈ V˜ updatesitsrowvectorby; (cid:80) s = s /2+ s /2,; t,i t−1,i j∈At,i t−1,j (cid:80) g = g /2+ g /2.; t,i t−1,i j∈At,i t−1,j Everyrootnodei ∈ V˜ updatesitsestimateoftheglobalaveragebyxˆ (i) = xˆ = s /g . ave,t ave,t,i t,i t,i end rootscansampleO(logn)numberofotherrootstoconfirmandupdate,ifnecessary,theirvaluesandreach consensusontheglobalmaximum,Max. WeshowthefollowingtheoremforGossip-Max 9