ebook img

Apocrypha: Making P2P Overlays Network-aware - Stanford University PDF

14 Pages·2012·0.28 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Apocrypha: Making P2P Overlays Network-aware - Stanford University

1 Apocrypha: Making P2P Overlays Network-aware PrasannaGanesan QixiangSun HectorGarcia-Molina StanfordUniversity,Stanford,CA94305,USA fprasannag,qsun,[email protected] (cid:1) (cid:3) (cid:14) (cid:16) (cid:22) (cid:24) cmciopamAlyembshsuatorvnraegiccatna—notiitzoMheninanbngoeydttowdesideseotinnrwibnaiuotnhtdeoetdvhs.eseryIllsoantcyegametnisneoetbnwruaooillfr,tknot,ohndiisnpeesooevorrend-rtelotarh-ypeteonpeehretynwpsaroibcirnalke-l AD(cid:8)(cid:9)(cid:10)(cid:9)(cid:0)(cid:0)(cid:1)(cid:4)(cid:4)(cid:5)(cid:5) (cid:8)(cid:9)(cid:10)(cid:9)(cid:8)(cid:9)(cid:10)(cid:9)(cid:8)(cid:9)(cid:10)(cid:9)(cid:8)(cid:9)(cid:10)(cid:9)(cid:11)(cid:11)(cid:11)(cid:11)(cid:11)(cid:11)(cid:11)(cid:11)(cid:11)(cid:11)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:8)(cid:10)(cid:2)(cid:2)(cid:3)(cid:6)(cid:6)(cid:7)(cid:7) CB (cid:13)(cid:9)(cid:13)(cid:9)(cid:17)(cid:9)(cid:17)(cid:9)AC(cid:29)(cid:29)(cid:29)(cid:29)(cid:29)(cid:29)(cid:29)(cid:29)(cid:29)(cid:29)(cid:29)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30)(cid:30) (cid:13)(cid:13)(cid:14)(cid:17)(cid:17)(cid:18)(cid:18) (cid:31)(cid:9) (cid:9)(cid:31)(cid:9) (cid:9)(cid:31)(cid:9) (cid:9)(cid:31)(cid:9) (cid:9)(cid:31) (cid:15)(cid:15)(cid:16)(cid:19)(cid:19)(cid:20)(cid:20) BD AD(cid:21)(cid:21)(cid:22)(cid:25)(cid:25)(cid:26)(cid:26) !(cid:9)"(cid:9)!(cid:9)"(cid:9)!(cid:9)"(cid:9)!(cid:9)"(cid:9)!"(cid:23)(cid:23)(cid:24)(cid:27)(cid:27)(cid:28)(cid:28) BC network. Weproposea genericmechanism called Apocryphato makeanyP2Poverlay“network-aware”,andthusoptimizeinter- (a) (b) (c) nodecommunication. WeshowhowApocryphaadaptivelytrades Fig.1. Allvalidoverlaynetworksforthe4-nodehypercubetopology offthequalityofoptimizationagainstthecostofoptimizationun- der dynamic conditions. We demonstrate the applicability and utilityofApocryphaontwodifferentP2Psystems:theChordsys- An overlaynetwork thus generated may, however, have lit- temwhichusesadeterministicoverlaytopology,andtheGnutella tle to do with the location of nodes on the physical network. system which operates on an ad hoc topology. We also describe new,improvedroutingprotocolsfortheChordsystem,andintro- Forexample,twonodesonthesameLANmaynotnecessarily ducethelong-circuitphenomenoninGnutella. beanywhereneareachotherontheoverlaynetworkand,con- versely,nodesthatarefarapartonthephysicalnetworkmaybe neighborsontheoverlay.IthasbeenobservedthatP2Psystems I. INTRODUCTION wouldbenefitiftheoverlaynetworkweresomehowconstructed to“resemble” thephysicalnetwork, i.e., bymakingneighbors Manydistributedsystems,rangingfromfile-sharinganddis- on the overlay be nearby on the physical network. Such an tributedfilesystemsto cooperativecaching andmulticastsys- overlaynetworkisexpectedtoimproveonanetwork-ignorant tems, are being developed using peer-to-peer (P2P) princi- ples[gnu],[kaz],[lim],[RKCD01],[DKK+01]. InaP2Pdis- construction,bothbyreducingthelatencyontheoverlaylinks, and by reducing bandwidth usage in the internet core (which tributed system, components are autonomous, and there is no translatesintomoneyforISPs). central control. Component failures are viewed as the norm Thereareclearlyavastnumberofoverlaynetworksthatmay rather than as an exception. Consequently, systems designed be constructed on a given set of nodes, but each P2P system foraP2Penvironmentprovideahighdegreeofrobustnessand imposesitsownrulestodeterminewhichoftheseoverlaynet- “self-healing”. works are valid; we callthis set of valid overlaynetworksthe ApricethatP2Psystemspayforthisrobustnessandauton- overlaytopology. Toillustratewithasimpleexample,imagine omyisalossinefficiencyascomparedtoatightlycoordinated a P2P systemona static setoffour nodesthatrequires a “hy- distributedsystem.Alotofrecentefforthasthereforegoneinto percubetopology”,whichissimplyalloverlaynetworksshaped improvingtheefficiencyofP2Psystemswithoutsacrificingro- likeasquare,asshowninFigure1.LetussaythatnodesAand bustness. One importantareafor optimizationinP2P systems B are very close to each other on the physical network, while is communication efficiency. This paper offers a solution for CandDarefarawayfrombothAandB,aswellasfromeach improvingthecommunicationefficiencyinanyP2Psystem,by other.Insuchacase,overlaynetworks(a)or(b)wouldbemore adaptingthecommunicationstructuretotheunderlyingphysi- desirablethan(c),sinceboth(a)and(b)haveafastoverlaylink calnetwork. betweenAandB. To explain the problem, we first discuss communication in ThispaperstudiesagenericmechanismcalledApocrypha1 P2Psystems. AllP2Psystemsorganizeparticipantnodesinan for identifying and constructing the “best” valid overlay net- overlay network, where each node maintains a transport-layer work for any P2P system. Since P2P networks may be ex- linkwith onlya smallsetofother nodes. Communicationbe- tremelydynamic,withnodesjoiningandleavingallthetime,a tween nodes then takes place on top of this overlay network. critical feature of Apocrypha is its ability to effectivelymain- Theuseofanoverlaynetworkisdictatedbyrobustnessconsid- tain such a “network-aware” overlay network as the system erations;itisconsideredtooexpensiveforeachnodetoknow, evolvesdynamically. and directly communicate with, all other nodes in the system. Returningtooursimple four-nodeexample, Apocryphaop- Different P2P systems use different mechanisms to construct eratesbyexploitingthefactthattheP2Psystemmustconstruct andutilizesuchoverlaynetworks.Forexample,Gnutella[gnu] one of the three overlay networks in Figure 1. Say the P2P uses flooding on this overlay network to propagate messages, systeminitiallyconstructsoverlaynetwork(c). Apocryphacan whileChord[SMK+01]usestheoverlaynetworktoroutemes- sagesfromonenodetoanother. 1AP2POverlayConstRuctiontoYieldProximalneigHbors(Allegedly) 2 thentransform(c)into(a)byasimpleoperation: swappingthe Evaluating the utility of Apocrypha on large overlay net- “letters” B andC in the picture of overlaynetwork (c), which worksrequiresarealisticmodelofphysical-networklatencies. corresponds to node B breaking its overlay link with node D We evaluate Apocrypha both on well-known internet models, andconnecting to nodeA instead, andnode C performingthe as well as using real data gathered on PlanetLab. We demon- reverseoperation. Ingeneral,Apocryphamayhavetoperform stratethatouralgorithmsare(surprisingly)robustandperform anumberofthese“swaps”toendupwitha goodoverlaynet- wellonavarietyofphysical-networkmodels,aswellasonthe work,andouralgorithmisbasedonthissimpleidea. realdata. Of course, such an algorithm would require that we some- WeapplyandevaluateApocryphaontwodifferentP2Psys- howknowthataparticularswapis“desirable”.Inourexample, tems: (a)Chord,whichusesahypercube-likestructuredtopol- weknewthatitwasdesirabletoformanoverlaylinkbetween ogy,and(b)Gnutella,whichoperatesonanadhoctopology.In AandBandthereforeperformedaswaptocreateit. Oursolu- theprocess,weintroducenewroutingalgorithmsforChordthat tionassumestheexistenceofa“blackbox”thatcancomputea considerably improvequery latency in the Chord system. On distancebetweentwoarbitrarynodesonthephysicalnetwork. theGnutellasystem,wedemonstrateasurprisingphenomenon This black box may be implemented in many different ways, thatwecalllongcircuits.Wenotethattheseroutingalgorithms, e.g., [FJP+99], [RHKS02], [XTZ03], [CDK+03], and is used and the long-circuit phenomenon, are interesting in their own byoursolution. rightevenoutsidethecontextofApocrypha. Itmaynotbeimmediatelyobviousthattheswappingideais Ourpaperisorganizedasfollows. InSectionII,weclassify applicabletoanyP2Psystem;onemightwonderwhetherper- P2P systems based on their topology and explain how Apoc- formingsuchswappingcanresultinaninvalidoverlaynetwork. rypha is applicable to all of them. In Section III, we define In Section II, we will explain what swapping really translates the problem of creating a network-aware overlay for a static into,whenappliedtoaP2Psystem,andwhyitisa“safe”op- P2Psystem,describeasimplecentralizedalgorithmforit,and erationonallP2Psystems. evaluateitusinggraph-theoreticmetrics. SectionIVdescribes Extending the intuition presented so far into a practical, howtotranslatethecentralizedalgorithmintoadistributedone. generic solution for constructing a network-aware overlay SectionVdiscussesthedynamicversionoftheproblemwhere posesmanychallenges,including: nodesmayjoinandleavetheP2Psystem. SectionVIdiscusses Identifying nodes to involve in swaps: We need Apocrypha theapplicationofApocryphatoChord,andevaluatestheutility tooperateinadistributedfashionandidentifypairsofnodesto ofApocryphaintermsofquerylatencyonChord. SectionVII swapwhileminimizingthenumberofinvocationsoftheblack discusses an evaluation on Gnutella, and describes the long- boxtocomputeproximity. circuit phenomenon. We discuss related work in Section VIII andconcludeinSectionIX. Minimizing swap overhead: The swapping of two nodes may be necessitate swapping additional application state in some P2P systems. For example, with distributed hash tables II. CLASSIFYING P2PSYSTEMSANDUNDERSTANDING (DHTs),swappingtwonodesmayalsorequireswappingofthe THESWAP contenttheystore,andweneedoptimizationsinordertoreduce suchoverhead. Inthissection,weclassifyP2Psystemsbasedontheirtopol- Adaptingtodynamism: Finally,amajorchallengeinthede- ogy, explain what the swap operation translates into in each sign of Apocrypha is to enable it to adapt to dynamic node class,andjustifythesafetyofthisoperation. arrivals and departures. When nodes join and leave an “opti- Beforewedoso,wemaketwoobservations. First,theswap mized”overlaynetwork,thequalityofthenetworkmaybead- operation merely transforms an overlay network into another verselyaffectedand,consequently,workneedstobeperformed isomorphic overlay network, i.e., the “shape” of the overlay toreoptimizethenetwork.Thehighertherateofdynamism,the network remains unchanged by the swap. Second, the justifi- more expensive it can be to maintain an efficient overlay net- cationofthesafetyoftheswapliesinanimportantunderlying work. Consequently,Apocryphaadaptivelytradesoffthecost propertyofP2Ptopologiesthatwewillproveinthissection: ofmaintainingandoptimizingtheoverlaynetworkagainstthe ValidityClosure: ForanyP2Psystem,ifagivenoverlaynet- qualityoftheresultingnetwork. work is valid, then all overlay networks isomorphic to it are To the best of our knowledge, Apocrypha is the first solu- valid. tionforconstructingnetwork-awareoverlaysformanykindsof Ifthis propertyholds for a P2P topology,itis clear thatthe P2Psystems, specificallysystemswith deterministic, random- swapisa safeoperation. We nowclassifyP2Ptopologiesand ized or ad hoc topologies. (We explain what these topologies demonstratethattheyallpossessthevalidity-closureproperty. areinSectionII.) Apocryphaalsoappearstobethefirstsolu- Deterministic Topologies: Adeterministictopology is one tionthatexplicitlytradesoffoptimizationcostagainstnetwork inwhicheveryvalidoverlaynetworkisisomorphictoaspeci- quality in the face of varying rates of dynamism. In addition, fiedgraph,thatwecallthebasisgraphB ofthetopology. Our webelievethatagenericmechanismsuchasApocryphahelps “hypercube topology on 4 nodes” is a deterministic topology theapplicationdeveloperbydetachingtheproblemofchoosing whosebasisgraphisthesquare. Validityclosureclearlyholds anoverlaytopologyfromthatofmakingthestructureadaptto fordeterministictopologies:IfN isisomorphictoBandN is 1 2 theunderlyingphysicalnetwork. isomorphictoN ,then,bytransitivity,N isisomorphictoB. 1 2 3 In fact, the converse is also true, i.e., if two overlay networks protocol,itiscriticalto“preserve”thisbasisgraphwhentrans- N andN arevalid,theymustbeisomorphictoeachother. formingtheoverlaynetwork,whichisexactlywhatApocrypha 1 2 Many P2P systems, e.g., CAN [RFH+01] and does. Chord [SMK+01], use deterministic topologies. In these systems, each node is assigned a unique id, or label, chosen III. APOCRYPHA: THEFIRST CUT at random from some space of labels. The basis graph B of Webeginbydefiningtheproblemofconstructinganetwork- thetopologyisa functionofthechosen setoflabels. Inother awareoverlayforanarbitrarystaticP2Psystem: words, the P2P system dictates what pairs of labels have an Problem: Given (a) an initial valid overlay network with edge.Eachvalidoverlaynetworkcorrespondstoanassignment each node assigned a unique label, and (b) a black box d that oflabelstonodes,withanoverlaylinkbetweenapairofnodes computesthe “distance”betweenanytwo nodes,createa new if and only if their corresponding labels have an edge in B. overlaynetworkbyre-assigningtheoriginalsetoflabelsamong Oursolutionstartswithaninitialassignmentoflabelstonodes thenodes(andconstructingappropriateoverlaylinks),suchthat (and a corresponding overlay network), and then let nodes the averagedistance, computed overall overlaylinks using d, swap labels, while always maintaining an overlay network isminimized. consistent with the labeling of nodes. Performing a “correct” Note thatour objectiveofminimizing averagelink distance series of such swaps will lead us to the best possible valid maynotcorresponddirectlytoanapplicationmetric.Weevalu- overlaynetwork. ateApocryphaintermsofapplicationmetricsinlatersections. NondeterministicTopologies:Anondeterministictopology In this section, we devise a centralized algorithm for solving isonewhereeveryvalidoverlaynetworkisisomorphictoany thisproblem. Inlatersections,weshowhowtoadaptitintoa oneofasetofbasisgraphs. Thevalidity-closurepropertycon- practical,distributedmechanismforsolvingthesameproblem, tinuestobetruebyexactlythesameargumentasearlier. How- andmodificationstodealwithdynamicoverlaynetworks. ever,theconverseisnolongertrue;evenifanoverlaynetwork N isnotisomorphictoa validnetworkN , N couldstill be A. ACentralizedAlgorithm 2 1 2 valid.Anextremeformofnondeterministictopologiesisanad We now discuss how to solve the above problem assuming hoctopology,whereeveryoverlaynetworkisconsideredvalid. thatalltheinputisavailableinasinglelocation. Theproblem P2P systems such as Pastry [RD01] and Tapestry [ZKJ01] can be shown to be NP-hard by a simple reduction from the usenondeterministictopologies. Eachnodeagainhasaunique graph-isomorphismproblem,andweadoptahill-climbingap- label but, now, the overlay topology corresponds to any one proachtosolveit. Wesketchthisbasichill-climbingalgorithm of multiple basis graphs on these labels. Again, Apocrypha in Algorithm 1. The term nbrs(i) is shorthand for the set of startswithsomeassignmentoflabelstonodes,withanoverlay neighborsofnodeiontheoverlaynetwork. networkcorrespondingtooneofthemultiplebasisgraphs,and reassignslabelstonodeswhilepreservingthesamebasisgraph. Algorithm 1 Apocrypha(OverlayNetwork N, DistanceFunc- Apocrypha can thus no longer explore the space of all valid tiond) overlay networks, since it confines itself to overlay networks 1: whilenotconvergeddo correspondingtojustonebasisgraph. 2: for eachnodei do 3: Pickarandomnodej. Gnutella [gnu] and its super-peer variants like KaZaa [kaz] and LimeWire [lim] are examples of P2P systems built on ad 4: InitialCost= X d(i;m)+ X d(j;n); hoc topologies. Again, Apocrypha works the same way and m2nbrs(i) n2nbrs(j) doesnotexplorethefullspaceofvalidoverlaynetworks.How- 5: NewCost= X d(j;m)+ X d(i;n); ever, we note that none of these systems desire a truly ad hoc m2nbrs(i)) n2nbrs(j) 6: if NewCost<InitialCostthen topology. Forexample,allofthemwouldensurethattheover- 7: SwapLabel(i)andLabel(j); laynetworkformsaconnectedgraph,andrelyonrandomized 8: Breakedgesbetweenianditsneighbors topologiestoprovidesuchproperties,aswediscussnext. 9: Breakedgesbetweenjanditsneighbors RandomizedTopologies:Randomizedtopologiesareaspe- 10: Createedgesbetweeniandj’soldneighbors 11: Createedgesbetweenjandi’soldneighbors cial form of nondeterministic and ad hoc topologies. While 12: endif there are still multiple basis graphs available in such topolo- 13: endfor gies, the P2P system can no longer choose arbitrarily among 14: endwhile these differentbasisgraphs. Instead, thereis a specificproba- bilitywithwhicheachbasisgraphoughttobechoseninorder At each step, the algorithm selects each node i in turn and to guarantee important properties of the system such as con- considerswhetherit isbeneficialto swap the labels ofianda nectedness, low diameter, fault tolerance, degree balance and randomlychosennodej. Notethatperformingthischeck can efficientrouting. beexpensive,asitrequirescomputingthedistancedbetweeni P2PsystemssuchasSymphony[MMP03]andlow-diameter andeveryneighborofj,andbetweenj andeveryneighborof constructions on Gnutella-like systems [PRU01] use random- i. Ifitisdeterminedthattheswapisbeneficial,iandjproceed izedtopologies. Sincethe“importantproperties”ofthesystem to exchange their labels. This step corresponds to i and j ex- allrelyontheinitialbasisgraphbeingchosenbyarandomized changingtheirsetofneighborsontheoverlaynetwork. (Inthe 4 special case when the two nodes are already neighbors, they also continue to remain neighbors after the swap.) The algo- 350 Chord Gnutella rithmterminateswhenthereisnomore,orverylittle,progress s) 320 m made. y( 290 c n ate 260 L B. Evaluation nk 230 Li Wenowevaluatetheeffectiveness,andthecost,oftheabove ge 200 a hill-climbing algorithm as it operates on different kinds and er v 170 sizesof overlaynetworks. Ourevaluationrequiresa model of A the physicalnetwork that captures the distance between every 140 pair of overlay nodes. We begin by using a physical-network 0 500 1000 1500 2000 2500 (internet)modelgeneratedfromthestandardGT-ITMtopology Number of Steps generator,thatenablesustosimulatethealgorithmonoverlay Fig.2. ConvergenceofApocryphaonChordandGnutella networksoflargesizes. Later,wedescriberesultsonasmaller- scale overlay network, where the overlay nodes are machines 9 onPlanetLab, with pair-wisedistancesbetweenoverlaynodes Chord 8 Gnutella beingmeasureddirectlybasedonpingtimes. 1) EvaluationusingGT-ITM: WeusetheGT-ITMtopology 7 generator[ZCB96]togeneratetransit-stubmodelsofthephysi- de 6 o N calnetwork. Overlaynodesintheoverlaynetworkareattached er 5 p at random to stub nodes in the GT-ITM network model. The s 4 p distance between a pair of overlay nodes is set as the latency wa 3 S between them in the internet model. (We will henceforth use 2 thewords“distance”and“latency”interchangeably.) 1 We usedGT-ITM to generate multipleinternetmodelswith 0 thesameaveragegraphdegreebutdifferentproportionsoftran- 1 10 100 1000 10000 sitandstubdomains;weassignedlatenciesof5,20and100ms Number of Steps to stub-stub, stub-transit and transit-transit links respectively, Fig.3. NumberofSwapsperformedbyApocrypha with the link from an overlay node to its corresponding stub node having a latency of 1ms. We also experimented with addingdifferentlevelsofGaussiannoisetotheselatencies. In constructedon25000nodesinthestandardfashionasdescribed general, we found that the results obtained were fairly robust in[SMK+01]. Westartoutmeasuringtheaveragelinklatency, andinsensitivetotheexactparametersusedinourinternetmod- aswellasthenumberofswapsmadebyApocryphainitsopti- els. Theresultspresentedherewereallobtainedononerepre- mization. sentativemodel. Average Link Latency: Figure 2 plots the quality of the It has been observed that the distribution of latencies as overlaynetwork,i.e.,theaverageoftheoverlay-linklatencies, obtained from GT-ITM is somewhat different in its char- as a function of the number of steps that Apocrypha runs for. acteristics from latency distributions observed in the real The initial average link latency is about 341ms, correspond- world [GGG+03]. Interestingly, we experimented with sim- ing to the average latency between a pair of nodes in our in- plevariantsoftheGT-ITMmodel,usingnon-uniformdistribu- ternet model. On the Chord network, the link latency drops tionsofoverlaynodestodifferenttransitdomains. Theresult- to 172ms after 500 steps, and to 158ms after 2500 steps, re- inglatencydistributionsappeartobemuchclosertoobserved, sulting in an improvementby roughly a factor of 2 compared real-world latencies than with the plain GT-ITM model. Due to the initial state. On the Gnutella network, the average la- toitsdigressivenature,weconsignourdescriptionofthesein- tency drops to just 161ms after 250 steps and eventually falls ternetmodels,andtheresultsobtainedonthem,toanextended to 150msafter 2500steps, thus demonstrating slightly greater technical report [GSGM03]. Our evaluation on these internet improvementthanChord. This isnot toosurprising, sincethe modelssuggeststhatApocryphaperformsmuchbetterinthese Gnutella network has a lower average degree, and greater de- modifiedmodelsthanonplainGT-ITM.Therefore, theresults greeskew,whichmakesitless“constrained”andeasiertoopti- presentedhereofferaconservativeestimateofthelikelybene- mize.Wenotethatalthoughouralgorithmisonlyguaranteedto fitsofApocrypha. findalocalminimum,theobservednetworkqualityisactually We choose two different overlay networks on which to veryclose to optimal, as we discuss in our extendedtechnical evaluate the application of Apocrypha: Gnutella and Chord. report[GSGM03]. The detailed structure of these networks do not concern us Number of Swaps: A more surprising result is that Apoc- immediately, and will be described in Sections VI and VII. rypha uses very few swaps to achieve the rapid convergence Our Gnutella network consists of a 24702-node snapshot of observedin Figure2. Figure3 plotsthe numberofswaps ini- Gnutella obtained in 2001 [Cli], while our Chord network is tiatedpernodeasafunctionofthenumberofsteps. Notethat 5 230 7.5 1000 s) 220 7 de s) 750 Swaps 4 de m o m Latency o cy ( 210 Latency 6.5 er N cy ( er N aten Swaps ps p aten 500 3 ps p k L 200 6 wa k L wa n S n S erage Li 190 5.5 mber of erage Li 2 mber of Av 180 5 Nu Av 1 Nu 170 4.5 200 256 1024 4096 16384 65536 0 200 400 600 800 1000 Number of Nodes Number of Steps Fig.4. QualityandNumberofSwapsasafunctionofNetworkSize Fig. 5. Quality and Number of Swaps for a 223-node Chord network on PlanetLab the x-axis is plotted on a logarithmic scale. For both Chord quently, we see that the number of swaps per node is propor- andGnutella,weseethatthenumberofswapsincreasesloga- tionaltotheaveragedegreeofnodesinChord,suggestingthat rithmicallywiththenumberofsteps,withabouthalftheswaps each swap “optimizes” a constant number of links. We have occurringinthefirst100steps. Moreover,thetotalnumberof observed the same “swaps proportional to degree” behaviour swapsperformedisextremelysmall,beingslightlygreaterthan onotherkindsofoverlaynetworksbesidesChord,too. 8forChord,andunder5forGnutella,evenafter2500steps. Practical Implications: The above two graphs suggest that 2) Evaluation on PlanetLab: We consider a Chord over- thisbasichill-climbingapproachcouldbethebasisforaprac- lay nework implemented on PlanetLab computers. Planet- tical solution. First, we need veryfewsteps to convergefrom Lab [plaa] consists of machines distributed at a few hundred acompletelyarbitraryoverlaynetworktoagoodone. Inprac- sites across the world. We assigned 223 nodes, distributed tice, Apocrypha would run incrementally and will never need across128sites,toaChordnetworkandgatheredthemeasured tostartfromanarbitraryconfiguration. Second,thetotalnum- ping times [plab] between every pair of nodes. We then sim- ber of swaps necessary per node is extremely low, suggesting ulated Apocrypha on this Chord network, using the measured thateachstepcouldbeperformedpretty“fast”inadistributed ping time as the distance between a pair of nodes. Pings be- implementation. tween some pairs of nodes do not succeed, and time out on a Varying the Number of Nodes: We now study how the consistentbasis;wesetthesepingtimestobetenseconds. quality of the overlay network, and the number of swaps per- Figure 5 plots both the average link latency and the total formed, evolves as a function of the size of the overlay net- number of swaps performed, as a function of the number of work.WepresentexperimentsontheChordoverlaynetwork,in stepsof the hill-climbing algorithm. We notethat the average whichthedegreeofeachnodeisroughly2lognwherenisthe linklatencydropsverysteeply(notice thatthey-axisisin log numberofnodesinthenetwork.Figure4depictsboththequal- scale)initially,evenmoresothanintheearlierexperimentson ity(avg. linklatency)oftheoverlaynetwork,andthenumber theGT-ITMtopology.Thissteepdropisduetothequickadap- of swaps performed, for different-sized Chord networks, after tationofthenetworkinavoidingthe extremelybadlinkswith 1000stepsofthehill-climbingalgorithm. Notethatthex-axis very large latencies. We also observe that convergenceto the isinlogarithmicscale. final value is relatively fast, and the final average latency is a Weseethattheaveragelinklatencydecreasesasthenetwork factor3:5smallerthantheinitialaverage.Thenumberofswaps getslarger. Therearetworeasonsforthisphenomenon. First, performedispracticallyidenticaltothenumberofswapsinthe a smaller Chord network is much more “dense” than a larger GT-ITM topology for comparable Chord networks (Figure 4), network,meaningthatalargefractionofallpossibleedgesare andeachnodeinitiatesslightlymorethan4swapsintotal. presentinthenetwork. Consequently,asmallnetworkismore Overall,weseethattherelativequalityimprovementweob- constrainedandharder tooptimize. Second, asthe numberof tainonPlanetLabishigherthanontheGT-ITMtopology.Since nodesincreases,therearemorepairsofnodesthatarecloseto a large fraction of the quality improvement could potentially each other on the physical network, enabling better optimiza- havebeenobtainedbysimplyeliminatingthe“extremelybad” tionofoverlaylinks. linksbetweennodesthatdonotrespondtoeachother’spings, Interestingly, the number of swaps performed per node in- we repeat the aboveexperimenton a smaller 182-nodeChord creases roughly as a logarithmic function of the number of network, where the nodes which do not respond to more than nodes.(Wenotethatthenumberofswapsbecomesevencloser 10%ofpingsareeliminated.Thissmallernetworkhaslessthan to a logarithmic function if the hill-climbing algorithm is run 0:2%ofallpairwisepingtimesbeinglargerthan5seconds. to “convergence within (cid:15)” for each network size, rather than Figure6plotsthequalityandthenumberofswapsasafunc- for1000stepsinallcases.) Notethatthedegreeofeachnode tionofthenumberofhill-climbingstepsforthis182-nodenet- isalso a logarithmicfunctionofthe numberofnodes. Conse- work.Weobservethattheinitialaveragelatencyisonly133ms, 6 140 4 to that obtained by the centralized algorithm, and for higher Swaps valuesof W, there is practicallyno marginalimprovementon 130 Latency s) de mostpracticalP2Ptopologies. cy (m 120 3 er No Notethattheseprobescanmademoreefficientinmanysys- aten 110 ps p temsbypiggybackingthemonothertraffic. InDHTs,thepath nk L 100 2 Swa takenbyaqueryforakeytranslatesintoarandomwalkonthe Average Li 8900 1 Number of oflthvoeeorsdlaainmygen-beraatwsnedodorkmsy.swItneamlskysscltiaeknmebGseunusuistneegldlart,aotnhfidenosdmoauwrrcaaelnkodsfofmaoqrnuqoeudreyer.cieaIsnn, 70 betreatedasarandompointbyanodereceivingthequery(so 60 longastheTTLonthereceivedmessageislow).Thus,finding 0 200 400 600 800 1000 randomnodesinthenetworkcanoftenbemadeinexpensive. Number of Steps Biasing probes: Instead of using a purely random walk to Fig. 6. Quality and Number of Swaps for a 182-node Chord network on findacandidate,wecould“bias”therandomwalktoseekcan- PlanetLab didatesmorelikelytoswapplaceswiththeprobeinitiator. The policyweconsideristoassigneverynodeadissatisfactionin- whichisconsiderablysmallerthantheinitialaverageof739ms dexwhichisequaltotheaverageofitsoverlay-linklatencies.A in the earlier 223-nodecase. The final latency after optimiza- probethenreturnsthemostdissatisfiednodeencounteredonthe tion drops to 66ms, still a factor of 2 improvement over the random walk, instead of returning the termination point. The initiallatency.Thisimprovementisactuallylargerthanthatfor dangerinusingsuchapolicyisthatthelikelihoodofconverg- acorrespondinglysizedChordnetworkoperatingonGT-ITM, ingtoalocalminimumishigherthanwithrandomhillclimb- whichmayhavetwoexplanations:(a)theGT-ITMmodelover- ing. Ourexperimentslater onin thissection evaluatewhether estimates the distance between nodes, or (b) nodes on Planet- suchbiasingisagoodidea. Labarenearer eachother thanarbitrarynodesonthe internet. Weobservethatthenumberofswapsperformedisonceagain verysimilartothatintheearliercases,beingslightlylargerthan B. MinimizingOverhead 4pernode. The second challenge we face is to lower the overhead of each node running the algorithm on a continuous basis. Each IV. ADISTRIBUTEDSOLUTION timeanodeexecutessteps3-12ofthealgorithm,itneedsto(a) initiateaprobe,(b)checkwhetheraswapisbeneficial,(c)pos- Wenowdesignadistributedsolutionfortheproblemofcre- sibly perform a swap by breaking and forming overlay links, ating network-awareoverlays, in which each participant node and (d) possibly swap application state if necessary. We have runsanalgorithminordertoevolvethe(static)overlaynetwork arguedthat the overheadofstep (a)can bemadepretty small. into an optimal state. One simple way to design a distributed The overhead of step (c) is not too big a concern either since algorithm is to just let each node execute steps 3-12 of Algo- wehaveseen thatthe numberofswaps performedpernodeis rithm 1 in a periodic fashion. There are two difficulties that extremely small and decreases with time. Step (d) can be ex- needtobeovercomeforthisideatowork: (a)Anodeneedsto pensiveonsomeP2Psystemsandwediscusshowitsoverhead beabletofindanotherrandomnodejinstep3ofthealgorithm. may be minimized in Section V-A. We are left with the over- (b)Weneedtolowertheoverheadofrunningthealgorithmon headofstep(b)whichwenowdiscuss. acontinuousbasis.Wepresentlydiscusshoweachofthesetwo Everytime a node finds a candidate by probing, it needs to issuesaretackled. checkwhetheraswapisbeneficial,whichrequiresinvokingthe blackboxtocomputelatencybetweenthenodeandeachofthe A. FindingaRandomNode candidate’s neighbors. For some black box implementations, To find another random node in a distributed manner as in liketheEuclideanvectorssuggestedin[RHKS02],[CDK+03], step3ofAlgorithm1, eachnodeiinitiatesa probe,arandom the invocation is cheap and there may actually be little over- walk originating at itself and proceeding for a pre-determined head associated with this step. However, other implementa- numberofsteps,sayW.Theterminationpointjoftherandom tions of the black box, such as active ICMP pings, might be walkischosenbyiasthecandidateitconsidersswappinglabels much more expensive to invoke. We would therefore like to with. Nodes i and j can then collaborate to decide whether somehowreducethenumberofprobesinitiated,bothwhileac- exchanging labels and neighbors is beneficial to the overlay, tively improving the overlay network, and, more importantly, just asin the original algorithm. Of course, the quality ofour afterconvergingtoapointwhereverylittleimprovementisbe- algorithmdependsonthechoiceofW,thelengthoftherandom ingachieved. Ofcourse,detectingthisglobalconvergenceina walk. TheidealchoiceofW dependsonthemixingproperties distributedwayisnon-trivial. oftheoverlaynetwork,butourexperimentsindicatethatsetting We can think of the following simple solution: each node W tobeabouthalfthediameteroftheoverlaygraphisagood continuouslymonitorsthe“improvement”initsdissatisfaction idea; using smaller values of W may lead to results inferior indexas a result of each of its probes overthe last (cid:28) minutes. 7 Ifthisimprovementisverylow,thenoderealizesthattheover- laynetworkhasconvergedtoanear-optimalstateandcanstop 320 Quenching(Unbiased) Quenching(Biased) initiating probes. This idea does not quite work because each ms) 290 NoN Qou Qenucehnicnhgi(nUgn(Bbiiaasseedd)) successful swap is only guaranteed to improvethe overallob- y( Centralized c 260 jectivefunctionfortheentirenetwork,notanindividualnode’s en at dissatisfactionindex. L 230 k n Abettersolutionisforeachnodetomonitoritsstatefluctu- e Li 200 ation,thedifferencebetweenits bestandworstdissatisfaction g a indexoverthelast(cid:28) minutes. Anodethencontinuesto“wake ver 170 A up” periodically, but initiates a probe only if its state fluctua- 140 tion is greater than some threshold (cid:15). If the fluctuation is less 1 10 100 1000 10000 than(cid:15),thenodedecidestoavoidprobing,astepwecallprobe Time quenching. For robustness, a node is also given a small non- zeroprobabilitytoinitiateaprobeevenifthefluctuationisless Fig.7. ConvergenceofdistributedApocryphaontheChordnetwork than (cid:15). So, once the network evolves into a “good” state, the fluctuationislikelytobeverylow,andtheprobeinitiationrate Quenching(Unbiased) essentially corresponds to the small probability of initiating a 25000 Quenching(Biased) probeforrobustness. Forreasonsthatwilllaterbecomeappar- ent,wewillcallthisprobe-quenchingmechanismthehot-stove es 20000 b spoomliceyo.thNeortenothdaet.nodesalwaysrespondtoaprobeinitiatedby of Pro 15000 er b m 10000 u C. Evaluation N 5000 To see how the use of probes, biasing probes and probe quenchingaffectthequalityoftheoverlaygeneratedbyApoc- 0 rypha,weranthedistributedversionofApocryphaona25000- 0 500 1000 1500 2000 2500 Time node Chord network, using the GT-ITM internet model. The results on other physical-network models, on PlanetLab data, Fig.8. Numberofprobesinitiatedateachtimestep andusingtheGnutellanetwork,arealongsimilarlines. Nodesconsiderinitiatingaprobeonceaminute,andsendout Figure8showsthenumberofprobesinitiatedinthenetwork theprobe ifitis notquenched. (Ourchoiceoftime unitis ar- everyminute. Notethatwealwayshave25000probesaminute bitrary.Inpractice,thetimeunitmayitselfbeafunctionofthe withoutquenching. We seethatthe useofprobequenchingis stateofthenetworkaswediscussinoursectionondynamism.) effectiveinreducingthenumberofprobesafterverylittletime. Forthequenchingcriterion,weuse(cid:28) = 20and(cid:15) = 1ms,i.e., With quenching, each node initiates a probe about once every we quench a probe if the dissatisfaction index measured over 25minutes,oncetheoverlaynetworkconvergestoareasonable the last 20 minutes fluctuates by less than 1ms2 Again, recall state. Wealsoseethatbiasingdoesnotappeartoeitherhelpor thatwearestartingfromacompletelyarbitraryinitialnetwork hurt in reducing the probes. The real benefit of using biased whereas, in practice, we would maintain the overlay network walkswillbeseeninourexperimentsondynamicoverlaynet- incrementallyandalwayshoveraroundanear-optimalstate. works. ExperimentsontheGnutellanetworkproducedsimilar Figure 7 depicts the convergence of distributed Apocrypha resultsandareomitted. (usingwalkswithW = 10)asafunctionoftime; thecentral- izedalgorithmisusedasabaselineforcomparison.Noticethat itishardtomakeoutthedifferentlinesbecausetheyallmoreor V. DEALINGWITHDYNAMISM lesscoincide,whichillustratesexactlywhatwe wanttoshow: So far, we have ignored the fact that the overlay network DistributedApocrypha,withorwithouttheuseofbiasingand changescontinuouslyasnodesjoinandleavethe P2Psystem. quenching, still performs almost identically to the centralized HowwellApocryphaadaptstodynamismisdependentonhow algorithm. We see on the bottom right of the graph that the muchtheoverlaytopologyitselfchangeswhennodesjoinand useofquenchingmakesthenetworkconvergetoaslightlyin- leave. Forexample,thereisn’tmuchthatthealgorithmcando feriorstatethanwithoutquenching.Thisisonlytobeexpected, iftheentireoverlaystructurechangeseverytimeanodejoinsor since quenching avoids investingtoo much effortin obtaining leaves.However,mostP2Psystemsdonotbehavethiswayand marginalimprovementsinoverlayquality. Wehavenotshown allowtheoverlaystructuretochangesmoothly. FormanyP2P the number of swaps used by each of the algorithms but they, systems building distributed hash tables, the number of edges too,arealmostidentical. that need to be added or removeddue to a node arrivalor de- partureisusuallyasmalllogarithmicfractionofthetotaledges 2Apointofconcernmightbethatthedistancedbetweenthesametwonodes inthe system. ForGnutella, thedeparture orarrivalof anode mightvaryovertimeandthatourchoiceof(cid:15)istoolow.Suchcasesarehandled identicallytothewaywehandledynamisminthenextsection. affectsonlyaconstantnumberofedges. 8 Apocryphaadoptsthefollowingapproachtodealingwithdy- (cid:15) If none of a node’s neighbors are young, and the node namic node arrivals and departures: The P2P-system protocol wasindynamicmodeearlier,itswitchestotentative-static ispermittedtoexecutejustasitnormallydoeswhennodesjoin mode. It slowly increases the probability of initiating a and leave the system. As a result, new overlay links may be probe until either (a) p probes are sent out in tentative- created and some old overlaylinks may be deleted. In conse- static mode or (b) one of its neighbors is young. In case quence,theaverageoverlaylinklatencymaybecomeworse(or (a),thenodeswitchestostaticmode.Incase(b),thenode better,ifwegotlucky). However,sinceamajorityoftheover- switchesbacktodynamicmode. laylinksdonotchange,westillexpecttheoverlaynetworkto (cid:15) In static mode, a node uses our originally suggested hot- beinreasonablygoodshape. Apocryphacanthencontinueto stovepolicyforquenching. initiateprobesandperformswapsonthisnewoverlaynetwork Someintuitionintotheabovepolicycanbeobtainedasfol- toimproveit,andrestoreittoitsformerglory. lows:Itisclearthattheaggressionofanodeininitiatingprobes Evenatextremelyhighratesofdynamism(withmeannode should be a function of the dynamism it observes among its lifetimeslessthanonehour),ourexperimentsshowthatApoc- neighbors,sincenodesarerequiredtomakelocaldecisions. If rypha (with each node considering initiating one probe per veryfewofanode’sneighborsarenewlyjoinednodes,thenit minute) successfully improves the overlay network to a qual- oughttobe lessaggressiveandletthe newlyarrivednodesdo ity as good or better than in the static case. The problem is theworktoimprovethesystem. Butifalotofitsneighborsare that ourhot-stovepolicyfor probe quenchingdoesnothelpat new,thenthenodeitselfneedstostepupandinitiateprobesso all in reducing overhead; with a high rate of node dynamism, thatthenetworkdoesnotdeterioratetoomuch. itis verylikelythat a node receivesatleast onenewneighbor The pretty-girl policy helps us deal, not only with differ- intheintervaloftime(cid:28) weusedasourwindow. Thisonenew ent rates of dynamism, but also with bursts of dynamism in- neighborislikelytothrowthestatefluctuationoverthethresh- terspersed with periods of calm. When a burst of dynamism oldof(cid:15)thatwesetearlier. Consequently,theprobe-quenching ends,nodescautiouslysendoutprobesandslowlystep upthe mechanismhardlyeverdecidesnottoperformaprobe. Apoc- frequency of probing until they are sure the system is indeed ryphacontinuouslyperformsa lot ofwork to keepupwith all static. Once they realize that it is static, they go back to their the changes happening in the system in order to improve the usualquenchingpolicy. system, perhapsmore thanwewant itto. Athigh ratesofdy- Figure 9 depicts the effects of the pretty-girl policy on the namism, we would like the option of performing probes at a convergenceofthenetworkinthefaceofdifferentratesofdy- lowerratewhileperhapssettlingforalower-than-optimalqual- namism. Weagainstartwithanarbitrary,initialChordoverlay ityoverlaynetwork. network of 25000and have(cid:14) nodes joining and(cid:14) nodes leav- Oursolutiontodealingwithdynamismisbasedonaslightly ing the system everyminute (for a total of 2(cid:14) nodes changing counter-intuitive observation. If the overlay network is dy- in everyminute). Each node considersinitiatinga probeonce namic,nodesneedtobelessaggressiveininitiatingprobes.The aminuteand,ofcourse,maynotdosoifthepretty-girlpolicy reasonisthatwedesiretopermitsome“slack”forthenetwork sodictates. Weshowtheevolutionofthenetworkfordifferent qualityto getslightlyworse beforewe fixit. Ifnoslackis al- valuesofthedynamism(cid:14). Thetopmostcurvedepictsrunning lowed,wewillonlybewastingalotofeffortmakingsmallim- Apocrypha(at (cid:14) = 100)without the use of biased walks, and provementsthatwillsoondisappearifthereismoredynamism. can be seen to be vastly inferior to using biased walks at the Moreover,byallowingthenetworktodeteriorateslightly,itbe- same rate of dynamism. The remaining curves all use biased comes possibleto makelargerimprovementsmore efficiently. walks, and we see that, at low dynamism, the system evolves Thisleadsustotheideathatweshould“slowdowntime”when more or less similarly to the static case. When the dynamism thenetworkisdynamic,andletnodesreactasifinslowmotion. ishigher,thenetworktendstostabilizeatahigheraveragela- Ofcourse,sincewehaveadistributedsystemwhereanode tency rather than climbing all the way down to a quality near can only observe the dynamism among its neighbors, nodes that of the static case. The reason is that the pretty-girl pol- need to make local decisions without any knowledge of the icyconcludesthatitistooexpensivetomaintainahigh-quality globalrateofdynamism. Despitenodesmakingsuchlocalde- networkatthatcontinuousrateofdynamism. cisions, werequirethatthecumulativebehaviourofthenodes Figure10showshowmanyprobesarebeinginitiatedateach beafunctionofthegloballyobservedrateofdynamism. rate of dynamism, again as a function of time. The spikes on We devise a new probe-initiation policy, called the Pretty- the left end of the curve may be ignored, as they are artifacts Girl Policy3, which provides the above behaviour by using a ofthe system starting with allnodes joining atthe same time. three-state automaton. Let us define a node to be young if it We see that the static case performs the least probes, the case joinedthenetworkwithinthelast(cid:28) minutes,andoldotherwise. (cid:14) = 2performs more probes, and there are even more probes Ournewprobe-initiationpolicynowbecomes: at(cid:14) =25.However,whatisinterestingisthatfewerprobesare (cid:15) Ifkofanode’sneighborsareyoungwithk >0,thenode performed at (cid:14) = 100 than (cid:14) = 25, because Apocrypha real- isindynamicmodeandinitiatesaprobewithaprobability izes that it is not worth performing more probes at that high a+b(cid:1)kforsomeconstantsaandb. rate of dynamism. We note that the hot-stove policy would quenchpracticallynoprobesatallunderanyrateofdynamism 3ThebasisofthisnameliesinAlbertEinstein’sworkonunderstandingex- ternalsensoryinputontimedilation[Ein38]. (at(cid:14) =100forexample,itwouldperformabout24900probes 9 360 Dynamism=100(No Bias) 400 Avg Latency 30 ds) 340 Dynamism=100 350 Probes an ms) 320 DDynyanmamisimsm==252 ms) 300 25 hous ency( 238000 Static ency( 250 20 s (in t nk Lat 260 nk Lat 200 15 Probe Li 240 Li e age 220 age 150 10 Activ Aver 128000 Aver 10500 5 mber of u 160 0 0 N 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 3000 Time Time Fig.9. Convergenceasafunctionoftime Fig.11. Effectsofburstiness A. UsingProxies 25000 Finally,weshowhowtominimizethecostofmakingnodes Dynamism=25 Dynamism=100 swap labels. We have already seen that Apocrypha requires Dynamism=2 20000 Static very few swaps per node, but even this many swaps may be s e expensive since nodes may have to swap application state as- b Pro 15000 sociatedwiththelabels. InGnutella,orsystemsusingoverlay er of networksforefficientbroadcast,swappingisfreesincethereis mb 10000 noassociatedapplicationstate.Similarly,inP2Pmulticastsys- u N temsbuiltonstructuredoverlaynetworks[RKCD01],swapping 5000 is practically free. On the other hand, in P2P distributed hash tables (DHTs), a node’s label determines the content it stores 0 and,therefore,swappinglabelsentailshavingtoswaptheasso- 0 500 1000 1500 2000 2500 Time ciatedcontenttoo. One simple technique to reduce the overhead in DHTs is Fig.10. Numberofprobesquenched as follows: When two nodes swap labels, they do not swap theirassociatedcontent. Insteadtheysetupproxylinkstoeach other,sothat aquerydirectedtoonenodefortheother’scon- tentmaybeansweredbyforwardingthequeryalongtheproxy everyminute). Atallratesofdynamism,thenumberofprobes link. (Notethatnewcontentinsertedintothesystemshouldbe initiated per minute using the pretty-girl policyis much lower storedatthecorrectnode,andcanbereachedwithoutusingthe thanthatwiththehot-stovepolicy. proxylink.) Thisapproachhasmanyadvantages. Figure 11 shows what happens when the dynamism in the When a node n joins the system, the P2P system gives it a network is bursty. We again start with an initial unoptimized labelandrequiresthenodeto“takeover”aportionofthehash network, and there is no dynamism except for three bursts in spacethatsomeothernodemiscurrentlyresponsiblefor.Node theintervals[500;700],[1500;1700]and[2500;2700]. During ncantentativelyestablishaproxylinktominsteadofactually these intervals, there are (cid:14) = 100 nodesjoining and (cid:14) = 100 movingcontentover. Sincenhasnotbeeninthesystem very nodesleavingthenetworkeachminute. Thefigureshowsevo- long,itislikelythatithasahighdissatisfactionindexand,con- lution of both the quality of the network and the number of sequently, Apocrypha will make n swap labels multiple times probes initiated. We see that the number of probes stays rel- beforen“settlesdown”withsomelabel.Eachtimenswapsla- atively low during the bursts of dynamism, while the quality belswithsomeothernoden0,itcanhandoveritscurrentproxy startsdeteriorating. Butthedeteriorationdoesnotgetarbitrar- link to n0, and create a proxy link to n0 for itself. When n is ily bad(as evidenced by the “flat top” of the humps in the la- no longer involvedin swaps for a while, it can movethe con- tencycurve). Whenthedynamismends,thenumberofprobes tent overfrom whichevernode its proxy link points to. Thus, drops initially (just before each big spike) while nodes are in theamountofcontentthatneedstobeswappedcanbereduced tentative-staticmode andare sending out “feelers”to check if considerably. thenetworkisindeedstatic. Oncenodesdetecttheabsenceof Second,inmanyDHTsystems,indexdatastoredbyanode dynamism, theystartprobingaggressivelytoimprovethe net- isdesignedto“expire”afteracertainamountoftime. Thisex- work quickly. When the improvement flattens out, they once pirationtimeisoftenverylow,andtheproxylinkwillquickly again reduce the number of probes initiated. Over the long become unnecessary as all the data that it points to will soon term, we see that the average latency of the network declines expire. Inthemeanwhile,anodecanlazilycacheallresultsto steadily. queriesthataresentovertheproxylink,thusimprovingperfor- 10 2500 ogy.Thehorizontallineatthebottomrepresentstheaveragela- Greedy routing tencybetweenpairsofnodesinourinternetmodel,andisthus Shortest Path 2000 Internet alowerboundforanyoverlay-routingprotocol4Thetopcurve s) m depicts the average of the path latency between every pair of y( nc 1500 nodesusinggreedyrouting; notethat theaveragepathlatency e at drops from about 2500ms on the initial unoptimized network L ath 1000 to about 1160msbythe use of Apocrypha. The middle curve P g. plots the averagelatencyof the shortestoverlay path between v A 500 pairs of nodes. We see that this shortest-overlay-pathlatency betweentwonodesisveryclosetotheinternetlatency(380ms 0 after 2500 steps, compared to an internet latency of 341ms). 0 500 1000 1500 2000 2500 The trouble is that greedy routing does not appear to use this Number of Steps shortest path and instead uses a path of much higher latency Fig.12. PerformanceofqueriesonChordusinggreedyrouting forrouting. Luckilyforus,greedyroutingisnottheonlyway to route on the Chord network. We explore three newrouting protocolsforChordwhichallprogressivelyimproveongreedy manceandavoidingcopyingdatathatisnotqueriedfor. routingandcomeclosertofindingtheshortestoverlaypathbe- tweennodes.Wethenevaluatetheperformanceoftheserouting VI. APPLYINGAPOCRYPHATOCHORD protocolsboth on the GT-ITM topology andon the PlanetLab data. Sofar,wehavedescribedandevaluatedthequalityofover- lay networks optimized by Apocrypha, using a generic graph metric: the average link latency. The real question, however, A. ChoosingpathswiselywithSolomon isto understandthe impactofan “improved”overlaynetwork The Chord structure offers multiple alternative paths from on the actual applications that use the overlay network. To anynodetoanyothernode. Forsimplicity,imaginethatthere answer this question, we study two P2P search applications, areexactly2N nodeswithlabels0::2N (cid:0)1. Greedyroutingis Chord (this section) andGnutella (the nextsection), andeval- thenequivalenttoexpressingthedistancetothedestinationasa uate Apocrypha using application metrics such as the average binarystring,and“fixing”the1sleft-to-right. Forexample,we querylatency,andthefractionofqueryresultsreceived. wouldcoveradistanceoflength56(111000inbinary)bythree WefirstdescribeChord,andthehash-tablefunctionalitythat successivehopsoflength 32, 16and8, whichisequivalentto itprovides. Chordisadistributedhashtablestoringkey-value left-to-rightbitfixingonthebinarystring111000toconvertthe pairs. Keysarein thecircular range[0;2N), andeachnode is 1s to 0s. We observe that any routing protocol that fixes all assigned a unique “id” (or label, in our terminology)which is thesebitswillalsorequirethesamenumberofhops,irrespec- simplyarandomlychosenintegerintherange. Anodeisthen tiveoftheorderinwhichthebitsarefixed. ThePRS scheme designatedresponsibleforstoringallkeyslessthanorequalto of[GGG+03] usesthisinsightandsuggeststhatallneighbors itsid,andgreaterthanthenextsmallerexistingnodeid.Anode correspondingtofixinga“1”inthedistancebeconsideredcan- withidxmaintainslinkstothenodesresponsibleforx+2kfor didates,andthatanodeshouldforwardamessagetothecandi- allintegers0(cid:20)k <N,witharithmeticmodulo2N. datewiththelowestlatencyfromitself. Answeringaqueryforagivenkeyisequivalenttoforward- The above idea does not quite work when the number of ing the query to the node responsible for that key, using a se- nodes is not exactly 2N. For example, let us say we wanted quence of overlay links. This forwarding is achieved using togetfromnode0tonode54(distance111000)justasearlier; Chord’s greedy routing protocol. Greedy routing is simple: a say we want to take a hop of length 8 to fix the right 1 in the query is forwarded continuously in the clockwise direction to binary string. If there is no node with id exactly 8, the Chord theneighborwhoseidisclosesttothatofthequery.(Notethat systemrequiresnode0to maintainalinktothenodewiththe we are not permitted to “overshoot” the query key.) If nodes smallest id that is greater than 8. Say, node 0 has a link to choose ids randomly,the size of the key space that each node node9. Takingthishopoflength9wouldleadtoaremaining isresponsibleforisaboutthesame,and,assumingthatqueries distanceis56(cid:0)9 = 47,withabinaryrepresentation101111, areinitiatedequallyateverynodeandforeverykey,theaver- whichleavesusfivemore1stofixinthedistance. Ingeneral, age query latency is equal to the average latency in routing a thenumberofhopsnecessarytoreachthedestinationmaynot message betweeneverytwo nodes usingChord’sgreedy rout- belogarithmic. ingprotocol. We refer tothe latencyalongsome specificpath One simple way to fix the problem is to consider a neigh- fromonenodetoanotherasthepathlatency.Thus,theaverage borasacandidateforforwardingonlyiftheremainingdistance querylatencyissimplytheaveragepathlatencywhenpathsare fromthatneighborhasfewer1sthanthecurrentdistance.Ifno chosenusinggreedyrouting. Figure12showstheaveragepathlatency(fordifferentkinds 4OurGT-ITMmodelproducespairwiselatenciesthatobeythetrianglein- equality. Whenlatenciesdonotobeythetriangleinequality,likeinourexper- of paths) on Chord as a function of time as Apocrypha opti- imentsonPlanetLab,theaveragepairwiseinternetlatencyisnotnecessarilya mizesastatic25000-nodeChordnetworkonaGT-ITMtopol- lowerbound.

Description:
We propose a generic mechanism called Apocrypha to make any P2P overlay “ network-aware”, and thus optimize inter- node communication. We show how
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.