The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Keeping up with Storage: Decentralized, Write-Enabled Dynamic Geo-Replication PierreMatria,Mar´ıaS.Pe´reza,AlexandruCostanc,LucBouge´1,GabrielAntoniud aOntologyEngineeringGroup,UniversidadPolite´cnicadeMadrid,Spain bIRISA/INSARennes,France cIRISA/ENSRennes,France dINRIARennes,France Abstract Large-scaleapplicationsareever-increasinglygeo-distributed.Maintainingthehighestpossibledatalocalityiscrucial toensurehighperformanceofsuchapplications.Dynamicreplicationaddressesthisproblembydynamicallycreating replicasoffrequentlyaccesseddataclosetotheclients.Thisdataisoftenstoredindecentralizedstoragesystemssuch asDynamoorVoldemort,whichoffersupportformutabledata. However,existingapproachestodynamicreplication forsuchmutabledataremaincentralized,thusincompatiblewiththesesystems. Inthispaperweintroduceawrite- enableddynamicreplicationschemethatleveragesthedecentralizedarchitectureofsuchstoragesystems.Wepropose analgorithmenablingclientstolocatetentativelytheclosestdatareplicawithoutpriorrequesttoanymetadatanode. Large-scaleexperimentsonvariousworkloadsshowareadlatencydecreaseofupto42%comparedtootherstate-of- the-art,caching-basedsolutions. Keywords: cloud,replication,geo-replication,storage,fault-tolerance,consistency,database,key-valuestore 1. Introduction scientists worldwide in real-time. The users of com- mercialapplications,suchasFacebook,ever-increasing Large-scale applications such as social networks are amountsofdatathatneedstobeaccessibleworldwide. being increasingly deployed over multiple, geograph- Ensuring the lowest possible access time for users is ically distributed datacenters (or sites). Such geo- crucialfortheuserexperience. distribution provides fast data access for end-users A key factor impacting the performance of such ap- worldwide while improving fault-tolerance, disaster- plications is data locality, i.e. the location of the data recovery and minimizing bandwidth costs. Today’s relatively to the application. Accessing remote data cloudcomputingservices[1,2]allowawiderrangeof is orders of magnitude slower than using local data. applications to benefit from these advantages as well. Although such remote accesses may be acceptable for However, designing geo-distributedapplications is dif- rarely-accessed data (cold data), they hinder applica- ficult due to the high and often unpredictable latency tion performance for frequently-used data (hot data). betweensites[3]. For instance, in a social network application, popular Such geo-distributed applications span a large range profiles should be replicated at all sites whereas oth- ofspecificuse-cases.Forinstance,cientificapplications ers can remain located at fewer locations. Finding the such as the MonALISA [4] monitoring system for the right balance between replication and storage is criti- CERNLHCAliceexperiment[5]. Thisapplicationcol- cal:replicatingtoomanyprofileswastescostlymemory lects and aggregates monitoring data from 300+ sites and bandwidth, while failing to replicate popular ones distributed across the world, that must be delivered to resultsindegradedapplicationperformance. Dynamic replication [6] proposes to solve this issue Emailaddresses:[email protected](PierreMatri), bydynamicallyreplicatinghotdataascloseaspossible [email protected](Mar´ıaS.Pe´rez), totheapplicationsthataccessit.Thistechniqueislever- [email protected](AlexandruCostan), agedinContentDeliveryNetworks(CDN)tocacheim- [email protected](LucBouge´), [email protected](GabrielAntoniu) mutabledataclosetothefinaluser[7,8]. Similarily,it isusedinstoragesystemssuchasGFS[9]orHDFS[10] Wediscusstheeffectivenessandapplicabilityofour toreplicatemutabledata, byrelyingonthecentralized contribution(Section11),andconcludeonfuturework metadata management of these systems [11, 12]. Yet, thatfurtherenhancesourproposal(Section12). such an approach contradicts the design principles of decentralized storage systems such as Dynamo [13] or 2. Large-scale,data-intensiveapplications Voldemort [14], which aim to enable clients to locate datawithoutexchangeswithdedicatedmetadatanodes. In this paper we target large-scale applications serv- Furthermore, handling mutable objects in the con- ing large amounts of data to users around the world, text is difficult. Indeed, the dynamic replicas have to while seeking to enable low-latency access for these be kept synchronized with the origin data, without im- userstopotentially-mutabledata. Examplesofsuchap- pacting the consistency guarantees of the underlying plicationsspanmultipleuse-cases,amongwhich: system. To the best of our knowledge, no decentral- Scientificsystemmonitoring. Monitoring a geo- ized, write-enabled dynamic replica location and man- distributed cluster requires collecting potentially agement method exists in the literature today. Reach- large number of metrics from computers around ing this goal while providing predictable overhead and the world. This is for example the case for guaranteed accuracy is not trivial. In this paper we MonALISA [4], monitoring thousands of servers demonstratethatthisobjectivecanbereachedbycom- distributed in more than 300 sites around the biningthearchitectureofthissystemswithdeceptively world. The collected data is aggregated live, and simple algorithms from the literature. We make these is used to provide live insights about the system contributions,whichsubstantiallyreviseandextendthe performance and availability around the world. earlyprincipleswepreviouslyintroducedin[15]: To deliver the real-time promise of MonALISA, • Afterbrieflyintroducingtheapplicationswetarget ensuringthatthemonitoringdatathatisneededby (Section 2), the related work (Section 3) and the scientists around the world is located as close as storage systems we target (Section 4), we char- possibletothemiscrucial. acterize the challenges of decentralizing write- Socialnetworks. Business applications such as social enableddynamicdatareplication(Section5). networks ingest overwhelming amounts of data. • We address these challenges with a decentral- Facebook, for example, is expected to react 2 bil- izeddatapopularitymeasurementscheme(Sec- lion active profiles in the next few weeks [16]. tion 6), which leverages existing state-of-the-art Every single day, it processes 350 million photo storage system architecture to identify hot data uploads [17] or 6 billion posts [18]. The strong cluster-widedynamically. 500:1 read to write ratio [19] calls for large-scale caching. However, some of this data is muta- • Based on these popularity measurements, we de- ble by nature. This is for example the case of scribe a dynamic data replication algorithm userprofiles, whicharehardtocachewhilekeep- which dynamically creates and manages replicas ing this cache synchronized with the source data, ofhotdataascloseaspossibletotheapplications callingforheavyweight,customcacheinvalidation (Section7). pipelines[20]. • We enable clients to locate the closest of such The real-time promise of such applications requires data replicas using an approximate object loca- tokeepup-to-datedataascloseaspossibletotheend- tionmethod(Section8),whichminimizesstorage user. The globalscaleof theseapplications makesthis latencybyavoidingcommunicationwithanyded- difficultwhilekeepingcostlybandwidthandstorageus- icatedmetadatanode. age low. We will detail these challenges in Section 5. • Wedevelopaprototypeimplementationleverag- These challenges are independent of the type of plat- ing the above contributions, integrated with the formssuchascomputegridsforMonALISAorclouds Voldemortdistributedkey-valuestore(Section9), forFacebook. and prove the effectiveness of our approach with alarge-scaleexperimentalstudyontheAmazon 3. Relatedwork Cloud(Section10). Weobserveareadlatencyde- creaseofupto42%comparedtootherstate-of-the- Intheliterature,dynamicreplicationstandsasatopic art,caching-basedalgorithms. ofinterestforallapplicationsrequiringaccesstoshared 2 datafrommanygeo-distributedlocations.Mostofthese news thread, or more generally services display- contributionscanbeclassifiedintwocategories: ing publicly user-generated content. In all these applications,wealsoobservethatthedataobjects Immutabledata,decentralizedmanagement. A aremutable(changingaggregatesfromnewmon- rangeofapplicationsrequiretoprovidetheirusers itoring events, user updating their profile, post- with fast and timely access to static resources ing new statues or comments). Available geo- suchasimagesorvideos. Thisisthecaseofmost replicationssolutionsavailabletypicallyeitherre- global internet applications, in which Content quire the application to explicitly clear modified Distribution Networks (CDNs) help provide a objects from distant caches, or leverage a central- gooduserexperiencebycreatingreplicasofstatic, ized replication manager that contradicts the de- immutable data as close as possible to the clients centralized design of most state-of-the-art, geo- thataccessit. distributed data stores. Dynamic replication en- Yet, CDNsaretargetedatservingcontentdirectly ables the geo-distributed storage system to repli- tothefinaluser. Inthispaper,wefocusonallow- cate in near real-time the most requested objects ing a geo-distributed application to access a geo- ascloseaspossibletotheapplicationinstancesac- distributeddatasourcewiththelowestpossiblela- cessingthem. tency. Efficientlycreatingandplacingreplicasofhotdata Kingsy Grace et al. [7] provide an extensive sur- isnotenough. Indeed,oneneedstoensureaswell vey of replica placement and selection algorithms thatthosereplicasarekeptinsynchronizationwith available in the literature. Among these, Chen et theoriginaldata. Thisisusuallythecaseofappli- al.[8]proposeadissemination-treebasedreplica- cations relying on a globally-distributed file sys- tion algorithm leveraging a peer-to-peer location tem. Overall, the proposed solutions in the litera- service. Dong et al. [21] transform the multiple- tureleveragethecentralizedmetadatamanagement location problem into several classical mathemat- of certain storage systems such as HDFS [10] or ical problems with different parameter settings, GFS [9] to allow the clients to locate the closest for which efficient approximation algorithms ex- availablereplicaofthedatatheywanttoaccess. In ist. However, they don’t consider the impact of thatcontext,Ananthanarayananetal.[11]propose replication granularity on performance and scala- apopularity-baseddynamicreplicationstrategyfor bility. Wei et al. [22] address this issue by devel- HDFSaimedatimprovingtheperformanceofgeo- opingamodeltoexpressavailabilityasafunction distributed Map-Reduce clusters. Jayalakshmi et of replica number. This approach, however, only al.[12]modelsasystemdesignedtodirectclients works within a single site, as it assumes uniform tothemostoptimalreplicaavailable. bandwidthandlatency, whichisnotthecasewith Our proposal enables writes to any given object the geo-distributed workloads that we target. In- in a decentralized, large-scale storage system to spiredbytheP2Psystems,[23]proposesanadap- betransparentlyforwardedtotheexistingdynamic tive decentralized file replication algorithm that replicas of that object, without requiring explicit achieveshighqueryefficiencyandhighreplicauti- cache eviction requests from the application. In lization at a significantly low cost. In [24], Mac- contrast, replication strategies adopted in CDNs Cormick et al. enable storage systems to achieve such as Dynamic Page Caching [26] have a sub- balanced utilization of storage and network re- stantially different target ; they focus on offering sources in the presence of failures, and skewed fine-grained caching based on configured user re- distributions of data size and popularity. Madi et quest characteristics (cookies, request origin, ...), al. [25] consider a wider usage of parameters in whilestillaccessingtheorigindatareplicafordy- the context of data grids such as read cost or file namic,mutableobjects. transfertime. Replicaselectionalgorithm Targetedworkonreplica Mutabledata,centralizedmanagement. However, a selection prove that adopting a relevant data loca- range of applications rely on mutable data. This tionalgorithmcanleadtosignificantperformance is for example the case in MonALISA monitor- improvements. Mansouri et al. [27] propose a ing of the CERN LHC experiment [4, 5]. In a distributed replication algorithm named Dynamic webapplications, thisisobservedwithsocialnet- HierarchicalReplicationAlgorithm(DHR),which work profile pages, status pages, comments on a selects replica location based on multiple criteria 3 such as data transfer time and request–waiting- ring, r being the configured replication factor of time. Kumar et al. [28] address the problem of thesystem(usually2). minimizing average query span, i.e. the average P2Pclusterstatedissemination. Thepositionofeach number of machines that are involved in the pro- node on the ring is advertised in the cluster us- cessing of a query through co-location of related ingafamilyofpeer-to-peer(P2P)protocols: Gos- dataitems. C3[29]goesevenfurtherbydynami- sip [34]. Each node periodically disseminates callyadaptingreplicaselectionbasedonreal-time its status information to a number of randomly- metrics in an adaptive replica selection mecha- selected nodes and relays status information re- nism that reduces request tail latency in presence ceivedfromothernodes. Thismethodisalsoused ofservice-timefluctuationsinthestoragesystem. to detect and advertise node failures across the cluster[35]. However, none of these contributions considers the caseofmutabledatastoredindecentralizeddatastores, Clientrequestrouting. By placing objects determin- such as Cassandra [30] or Voldemort [14]. Facebook, istically in the cluster, Dynamo obviates the need forexample,circumventstheissuebydirectingallwrite for dedicated metadata servers. Clients are able requests to a single data center and using a dedicated to perform single-hop reads, i.e. address their re- protocol to keep the cache consistent across other re- questsdirectlytothenodesholdingthedata. This gions [19]. In this paper, we fill this gap by en- enables a minimal storage operation latency and ablingefficientdatareplicationofmutabledataingeo- higherthroughput. Shouldaclientaddressthere- distributed,decentralizeddatastores. questtoanodenotholdingtherequesteddata,this node will forward the request directly to the cor- rectone. Thiscorrectnodeisdeterminedusingthe 4. Background: Thesystemswetarget ringstateinformationdisseminatedthroughoutthe cluster. Letusfirstbrieflydescribethekeyarchitecturalprin- ciplesthatdrivethedesignofanumberofdecentralized Deterministicallyplacingdataobjectsanddissem- systems. Dynamo[13]hasinspiredthedesignofmany inatingringstatusintheclusterenableseachnode ofsuchsystems,suchasVoldemort[14],Cassandra[30] torouteincomingclientrequestsdirectlytoanode orRiak[31]. Inthispaperwetargetthisfamilyofsys- holding the data. Operation latency is further re- tems,whicharewidelyusedintheindustrytoday. duced by opening cluster state information to the clients so they can address their requests straight Datamodel. Dynamo is a key-value store, otherwise to the correct node, without any metadata server called distributed associative array. A key-value involved. storekeepsacollectionofvalues,ordataobjects. Eachobjectisstoredandretrievedusingakeythat We found our strategy on these design principles, uniquelyidentifiesit. whichallowustoguaranteethecorrectnessofthepro- posalwedescribeinthispaper. Wechoosenottomod- DHT-baseddatadistribution. Objects are distributed ify the original static replication mechanism, offering across the cluster using consistent hashing [32] the same data durability as the underlying system. We based on a distributed hash table (DHT), as in alsodonotchangetheserver-sideclientroutingmecha- Chord [33]. Given a hash function h(x), the out- nism,consequentlyguaranteeingthatstaticreplicasare put range [h ,h ] of the function is treated as min max always reachable. This allows us to focus on develop- a circular space (hmin sticking around to hmax), or inganefficientheuristicthatmaximizestheaccuracyof ring. Eachnodeisassignedadifferentrandomob- popularobjectidentification,optimizesthecreationand jectwithinthisrange,whichrepresentsitsposition placementofdynamicreplicasofsuchpopularobjects, on the ring. For any given key k, a position on andhelpsclientsefficientlylocatingtheclosestofthese the ring is determined by the result of h(k). The replicas. primarynodeholdingtheprimarystaticreplicaof theobjectisthefirstoneencounteredwhilewalk- ing the ring passed this position. To ensure fault- 5. Ourproposalinbrief: outlineandchallenges tolerance, additional static replicas are created at the time the object is stored. These are placed on Inthispaper,wedemonstratethatitispossibletoin- thenextrnodesfollowingtheprimarynodeonthe tegrate dynamic replication with the existing architec- 4 tureofthesestoragesystems,whichenablesustolever- 5.2. Tunablereplication agetheirexisting,built-inalgorithmstoefficientlyhan- Determining when to create a new dynamic replica dlereadandwritesingeo-distributedenvironments. ordeleteanexistingonebasedonthepreviousinforma- Such dynamic replication seeks to place new copies tionisalsochallenginginadistributedsetup. Indeed,it of the hot data in sites as close as possible to the ap- isnecessarytoboundthenumberofreplicastobecre- plication clients that access it. To that end, we permit atedateachsitewithoutanynodebeingresponsiblefor theclientstovotefordynamicobjectreplicastobecre- coordinatingthereplicateditems. Consequently,nodes atedataspecificsite. Thesevotesarecollectedateach mustsynchronizewitheachotherbeforecreatinganew nodeanddisseminatedacrosstheclustersothattheob- replicaofanyobject. Onecouldconsiderusingacon- jectswhichreceivedthemostvotes(orpopularobjects) sensusprotocolsuchasPaxos[41]. However,weargue are identified by the storage system. Dynamic replicas itwouldbeanoverkillforsuchasimpletaskasPaxos of such popular objects are created at sites where they isbynomeansalightprotocol[42]. are popular, and deleted when their popularity drops. Itturnsoutthatthetechniqueweusetosolvethevote Whentryingtoaccessanobject, clientstentativelyde- collection and counting challenge also provides all the termine the location of its closest replica (either static informationweneedtosolvethisissue(Section7). or dynamic) and address requests directly to the node holding it. Such approach however raises a number of 5.3. Dynamicreplicalocation challenges. Weneedtoenableclientstolocatedynamicreplicas We acknowledge that using client votes has been as they do for static replicas. Obviously, such replicas proposed before in the context of replicated relational shouldalsobeplacedatapredictable,deterministically databases, specifically to ensure data consistency [36]. chosennode.WeachievethisusingtheDHT-baseddata Transposing this idea to decentralized storage systems distributionofthestoragesystem(Section7.1). Toac- posesanumberofsignificantchallengesthatweaddress cesstheclosestavailablereplica, aclientalsoneedsto in this paper. Specifically, collecting client votes effi- knowwhetherornotadynamicreplicaexistsatagiven ciently without using a centralized process requires us site. Systematically probing nearby sites for available toproposeanovel,fully-decentralized,loosely-coupled replicaswouldcontradictthesingle-hopreadfeatureof vote collection algorithm. While existing dynamic Dynamo. Also, this would significantly increase stor- replicationtechniquesleverageacentralizedrepository age operation latency, consequently missing the point todirectclientrequeststothenearestavailablereplica, ofdynamicreplicationwhichispreciselytoreducethis we propose a technique allowing clients to tentatively latency. locatetheclosestavailablereplicawithoutanypriorre- We demonstrate that this issue can be solved us- questtoanyofsuchrepositories. ing probabilistic algorithms (Section 8). Our proposal builds on the solutions we adopt for the two afore- describedchallenges. 5.1. Collectingandcountingvotes 6. Identifyinghotobjectswithclientvotes Thegoalofdynamicreplicationistoimprovestorage operationlatencyfortheclients. Therefore,weneedto Inthissection,wedescribehowtoidentifythemost designanefficientwaytolettheclientscasttheirvotes popular objects at each site. This is achieved in three forobjects. steps: Collecting votes also raises a major challenge. While determining the most voted-for objects at each node is 1. Wedescribeanefficientwaytoallowtheclientto straightforward and can be done efficiently, inferring vote for an object to be replicated dynamically from this the most popular objects cluster-wide is not ataspecificlocation(Section6.1). an easy task. This problem is named distributed top-k 2. Wemaintainalocalcountofthesevotesateach monitoring. Sadly, most implementations in the litera- node to identify the most voted-for objects (Sec- ture[37,38,39,40]arecentralized. tion6.2). We address both these issues by mixing an approxi- 3. We disseminate and merge these votes through- mate frequency estimation algorithm with the existing, out the cluster to provide each node with a vision lightweightGossipprotocolprovidedbystoragesystem of the most popular objects for each site, cluster- (Section6). wide(Section6.3). 5 Theidentificationmethodwedescribeinthissection However, thegoalofthisschemeistoadapttofluc- addresses the vote collection and counting challenge tuatingobjectpopularitybyreplicatingdynamicallythe above. objects having the highest popularity over a recent pe- riod of time. Consequently, we use successive voting 6.1. Clientvotecasting rounds. We extract at the end of each round the most voted-forobjectsforeachsite,andcreateanew,empty Clientsvoteforobjectstobereplicateddynamically votesummaryforthesubsequentround. Thelengthof at sites close to them. We name these sites preferred aroundisaclustersetting: wediscussitsvalueinSec- sites. This proximity can for instance express network tion11.2. Wesynchronizetheseroundsacrosstheclus- latency,butalsometricssuchasbandwidthcostoravail- terbyusingthelocalclockofeachnode. able computational power may also be considered. To Keepinganexactvotesummaryforanygivenround this purpose, each client maintains a list of such pre- is memory-intensive. It has a memory complexity of ferredsites,orderedbypreference. O(M∗S), M beingthe numberofobjectsvoted-forin Wearguethatexistingreadqueriestothestoragesys- thisroundandS beingthenumberofsitesinthecluster. tem provide an ideal base for vote casting, as clients Such complexity is not tolerable as billions of objects intuitively vote only for objects they need to read. In mayexistandbequeriedinthecluster. Luckily,wedo contrast, objects being written-to are not good candi- not need to keep the vote count for all objects: we are dates for dynamic replication because of the synchro- onlyinterestedinknowingwhicharethemostvoted-for nization needed to keep dynamic replicas in sync with objects for each site. For each site, finding the k most staticreplicas;wediscusswritehandlinginSection7.3. frequentoccurrencesofakeyinastreamofdata(client For every storage operation on an object, the client in- votes)isaproblemknownastop-k counting. Multiple dicatesintherequestmessageitspreferredsitesforthis approximate, memory-efficient solutions to this prob- object, i.e. the sites where the client would have pre- lemexistintheliterature. Inthecontextofoursystem, ferred a dynamic replica of the object to exist. Let us such approximate approaches are tolerable as it is not assumeaclientwantstoreadtheobjectassociatedwith critical to collect the exact vote count for each object the key key. The client sends the request to the closest as long as the estimation of their vote count is precise node n holding a replica of that object. Say this node enough and the set of objects identified as popular ac- belongs to a site s. We detail the location of this clos- curatelycapturesthevotesexpressedbytheclients. As est node in Section 8. The client piggybacks the re- such, we use as vote summaries a set of approximate questmessagewiththelistofthesubsetofsiteshaving top-k estimators, k beingaconfigurationsettingwhose a higher preference than s in its list of preferred sites. value is discussed in Section 11.1. We choose to use Suchrequestisinterpretedbynasavoteforthisobject the Space-Saving algorithm [43] as top-k estimator. It tobereplicatedonthesesites. guarantees strict error bounds for approximate counts of votes, and only uses limited, configurable memory 6.2. Node-local vote collection and hot object identifi- space. Its memory complexity is O(k). For any given cation site, theoutputofSpace-Savingistheapproximatelist Ateachnode,wewanttoknowforeachsitethemost of the k most voted-for keys, along with an estimation voted-for objects. These are considered as candidates ofthenumberofvotesforeach. for dynamic replication. Each time a node receives a Any given node simultaneously maintains |S| active read request for an object identified by key, it records structures, one for each node in the cluster. Each time thevoteforthisobjecttobereplicatedonallsitesindi- thisnodereceivesarequestforanobjectv,foreachpre- catedaspreferredbytheclient. Letusfirstassumethat ferredsiteindicatedintherequest,thekeyofvisadded wekeeponecounterperkeyandpersite,whichisincre- tothecorrespondingactivestructure. Consequently, at mented by 1 for each vote. We name site counters the anytime,anodeisabletoknowwhicharethemostfre- set of key counters for a single site, and vote summary quent replication preferences indicated by a client for the set of site counters for all sites. In addition, if the anysiteovertheprevioustimewindow. objectreplicaidentifiedbykeyisadynamicreplica,we considerthattheclientimplicitlyvotesforthisreplicato 6.3. Cluster-widevotesummarydissemination bemaintained. Assuch,wealsorecordthevoteforkey In this section we explain how to obtain the most on the local site of the node receiving the request, i.e. voted-for objects across all nodes, starting from local the site the node belongs to. Algorithm 1 details these vote summaries built from user votes (1). We periodi- actionsbyanodereceivingareadrequestfromaclient. callysharethelocalvotesummariesofeachnodewith 6 Algorithm1Node-localobjectvotecounting Input: key: keyofanobjecttoread,prefs: listofpreferredsitesprovidedbytheclient. procedureCountClientVotes(key,prefs) (cid:46)Interpretreadingadynamicreplicaasanimplicitvote letlocalbethelocalsiteofthecurrentnode letreplicabethelocalreplicaoftheobjectwithkeykey if thereplicaisadynamicreplicathen addlocaltoprefs endif (cid:46)Addclientvotestothelocalvotesummary foreachpreferredsitesiteinpref do letvs[site]bethesitecounterstructureforsite countonevoteforkeyinvs[site] endfor endprocedure its peers (2). Merging these peer vote summaries (3) proach is also compatible with our design choices: the giveseachnodeaviewofthemostpopularitemsacross Space-Saving structure we use is proven to be merge- thecluster. Figure1illustratesthisprocess. ablein[44],withacommutativemergeoperation. We organize the process at any given node n in suc- Formally, we name MergeCounters(a,b) the func- cessive phases. During a voting round r of duration t, tion outlined in [44] that merges two Space-Saving thelocalvotesummarycapturingclientvotesisnamed structures a and b. MergeSummaries(v,v(cid:48)) is the func- activesummary.Whenaroundends,thesummarytran- tionmergingtwovotesummaries sand s(cid:48). Thesesum- sitions to a merging state: the node sends this sum- maries contain site counters, respectively v ,...,v and 1 S mary to its peers, i.e. every other node in the cluster. v(cid:48),...,v(cid:48). S is the total number of sites in the cluster. 1 S Roundsbeingsynchronizedacrossthecluster,thenode Thisfunctionreturnsamergedsummaryv(cid:48)(cid:48) containing alsoreceivessummariesfromitspeersforthesamevot- S sitecountersv(cid:48)(cid:48),...,v(cid:48)(cid:48),suchthat: ing round, which are merged with the local summary; 1 S merging this local summary with another one received from a peer n(cid:48) gives a summary of the votes received ∀a∈[1,S],vc(cid:48)(cid:48) =MergeCounters(vc ,vc’ ) (1) a a a bybothnandn(cid:48). Whenvotesummariesforeverypeer havebeenreceived,thesummaryiscomplete,atwhich pointallvotesreceivedbyallnodesintheclusterforthe Considering that MergeCounters is commutative, it is round r are summarized. After a period 2∗t since the trivialthatMergeSummarieshasthesameproperty. roundstarted,thiscluster-widesummarytransitionstoa Let us assume a reliable network at this point, with servingstatewhichwedetailinSection7. Weillustrate allpeersummariesbeingreceivedbeforethelocalsum- thesesuccessivevotesummaryphasesinFigure2. maryreachesaservingstate. Becauseeachnodesends Inpresenceoffaults,asummarycanreachtheserv- toallitspeersthesamelocalvotesummary,andbecause ing state without having received all peer vote sum- the MergeSummaries function is commutative, the re- maries in time. This may occur in case of delayed or sultingcompletesummaryafterallpeersummariesare lostpackets. Wequalifysuchsummaryasincomplete. mergedisidenticalateachnode. Whenallnodesreach Suchanapproachisconsistentwiththeclassofsys- the complete summary state, they share the same view temswetarget:Dynamoprovidesanefficientalgorithm of the most voted-for objects for each site. We use it for disseminating information across the cluster: Gos- toperformdynamicobjectreplicationinSection7. We sip. We use it to share a vote summary with every discuss the memory complexity of the popular object othernodewhenitreachesthemergingstate. Thisap- identificationprocessinSection11.4. 7 Client n2 n3 ... nN (1)Vote (2)Share (3) v1 v2 ... vS v’1 v’2 ... v’S (3)Merge Localvotesummary Globalvotesummary Noden 1 Figure1:Cluster-widepopularobjectidentificationoverview.S isthetotalnumberofsitesintheclusterandNthetotalnumberofnodes. Clients Peernodes (1) (2) (3) Serving t t t Active Complete Merging Figure2:Timelineofvotesummarystatesforavotingroundlengtht. 7. Lifecycleofadynamicreplica onaremotesites,anodenfirstinformsallnodeshold- ingstaticreplicasofobj,whichstorethisintheirlocal Wedetailinthissectionhowtocreateanddeletedy- state. Uponacknowledgementfromthesestaticreplica namic replicas using the cluster-wide vote summaries nodes, theprimarynodeupdatesitslocalstateaswell, whilehandlingwritestodynamically-replicatedobjects. and copies obj to a node at site s. The node on which 1. We first explain when to create a new dynamic thisreplicaisplacedisselecteddeterministically. Inthe replicaofanobjectidentifiedaspopularonasite caseofDynamo,wecanusetheexistingDHTtoplace (Section7.1). objectsatasiteinsuchadeterministicfashion. Starting 2. We describe the process of removing those data ontheclusterringfromthepositionh(key),wewalkthe replicas when their popularity popularity drops ring until we find a node at s, on which the replica is (Section7.2). placed. SuchamethodisusedtodaybyCassandra[30] 3. Wefinallyexplaintheprocessofforwardingwrites for rack-aware data placement. Thus, assuming that a to these dynamic replicas while retaining under- client knows a dynamic replica of an object exists at a lying storage system consistency and guarantees site,itcaneasilyinferwhichnodeholdsthisreplicaand (Section7.3). addressitsrequestdirectlytoit. 7.1. Whenandwheretocreateadynamicreplica? Withaccesstoasharedvotesummary,decidingwhen tocreateareplicaisstraightforward. Nodesintheclus- ter create remote dynamic replicas of the popular ob- Fault-tolerance: No replicas are created based on jectstheyareprimarynodefor. Assoonasavotesum- incompletevotesummaries. Inthepresenceoffailures, maryreachesthe complete state, thussummarizingthe this may result in dynamic replicas of yet popular ob- votesofallclientsacrossthecluster,thetop-kmostpop- jects not being created at the initiative of its primary ularobjectsarereplicatedtositesatwhichtheyarepop- node. Wehandlethiscasewiththereplicareadprocess ular. To replicate an object obj identified by a key key weoutlineinSection8. 8 7.2. Whentodeleteadynamicreplica? origin nodes before creating and after deleting any dy- namic replica. This guarantees that writes to dynami- Each node is responsible for the deletion of any dy- cally replicated objects are always forwarded to all of namicreplicasitholds.Adynamicreplicaatasitescan itsdynamicreplicas. bedeletedifitisnotamongthetop-kitemsforthissites Weprovethatthiswriteprotocoliscorrect,i.e. does intheservingsummaryforthecurrenttimeperiod. We notcausedynamicreplicastobeoutofsyncwithstatic also want to avoid replica bounces, i.e. object replicas replicas, even in the case of system failures. A dy- being repeatedly created and deleted at the same site. namicreplicaiscreatedonlyafter successfulacknowl- Thismayhappenforobjectswhosepopularityranking edgement from all other static replica nodes. These isaroundthetop-k threshold, andfluctuatesaboveand nodesonlyareinformedofthereplicadeletionafteritis underthisthreshold. Wedefineagraceperiodg,which flaggedasinactive. Thisensuresthatwritestoanobject representstheminimumnumberofconsecutiveserving are always propagated to the dynamic replicas of that summaries a previously-popular object must be absent object,eveninthepresenceoffaults. frombeforeitsdynamicreplicaisdeleted. The deletion process is simple. To delete a locally- helddynamicreplicaofanobjectobj,anodeflagsitas 8. Accessingtheclosestreplica inactive: subsequentclientrequestsforthatreplicaare Inthissection,weexplainhowtoletclientsaccessa as if it does not exist, which we discuss in Section 8. close replica of objects they want to read, either static The node then informs the static replica nodes of obj ordynamic,withoutanycommunicationwithanyded- ofthedeletionofthisreplica,whichtheyremovefrom icatedmetadatanode. Inanutshell, welettheuserre- theirlocalstate. questinformationaboutdynamicreplicascreatedonits Fault-tolerance: No replicas are deleted based on preferredsitesatanygiventime,andlaterusethisdata incompletevotesummaries. Replicasofobjectswhose toinferthelocationoftheclosestreplicaofanyobject popularitydroppedmaynotbedeletedastheyshouldin andaccessitdirectly. thepresenceoffailures. Yet,thisprincipleensuresthat noreplicasofstillpopularobjectsareeverdeleted,and 8.1. Locatingtheclosestreplica:dynamicreplicasum- remainaccessible. maries We assume a client can know at any time the list of 7.3. Handlingwritestodynamicallyreplicatedobjects all active dynamic replicas in its preferred sites. This assumption greatly eases locating the closest replica: Our dynamic replication scheme enables the stor- it is the static or dynamic one located on the site with age system to handle writes to dynamically-replicated itshighestpreference, orthecloseststaticreplicaifno objects. The afore-detailed dynamic replica creation replica exists at any preferred site. We name this list makes this process straightforward. Clients address of dynamic replicas for a site s dynamic replica sum- writerequestsforanyobjecttooneofthestaticreplica maryof s, anddetailtheclosestreplicalocationinAl- nodes of this object. Based on their local state, these gorithm 2. The client addresses its request to the node nodes determine all existing dynamic replicas of that holding this replica on the site indicated by this algo- objectandpropagatethewritetoallotherreplicas,static rithmusingitsknowledgeoftheclusterDHT.Thenode and dynamic, using the write protocol of the storage receiving this request returns the dynamic replica, if system. available. If it is not available locally, it forwards the Correctness: because our replication system does requesttothecloseststaticreplicanode,whichisguar- not modify the write propagation algorithm of the un- anteedtoholdtheobject. derlyingsystem,writesarepropagatedtodynamicrepli- Weneartheseassumptionsbyenablinganyclientto cas the same way they are propagated to static repli- request from any node such dynamic replica summary cas. In Voldemort, writes are eventually-consistent by atitspreferredsitesatthetimeoftherequest.Anynode default, and can optionally be made strongly consis- isabletoanswerthisrequestbasedonthecluster-wide tent. Our proposal shows the same consistency char- votesummarydisseminationprocess. Usingitsknowl- acteristics. As such, we only need to ensure that the edge of the client votes across the cluster given by the creationanddeletionofdynamicdoesnotviolatethese currentcomplete,servingsummary,thenodeknowsthe consistency characteristics. We ensure this by persist- list of all dynamic replicas currently active at any site. ing the new dynamic replica list for each object on its Thislistissenttotheclientforitspreferredsites. 9