Table Of ContentKeeping up with storage: Decentralized, write-enabled
dynamic geo-replication
Pierre Matri, María S. Pérez, Alexandru Costan, Luc Bougé, Gabriel Antoniu
To cite this version:
Pierre Matri, María S. Pérez, Alexandru Costan, Luc Bougé, Gabriel Antoniu. Keeping up with
storage: Decentralized,write-enableddynamicgeo-replication. FutureGenerationComputerSystems,
2018, 86, pp.1093-1105. 10.1016/j.future.2017.06.009. hal-01617658
HAL Id: hal-01617658
https://hal.inria.fr/hal-01617658
Submitted on 16 Oct 2017
HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est
archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Keeping up with Storage:
Decentralized, Write-Enabled Dynamic Geo-Replication
PierreMatria,Mar´ıaS.Pe´reza,AlexandruCostanc,LucBouge´1,GabrielAntoniud
aOntologyEngineeringGroup,UniversidadPolite´cnicadeMadrid,Spain
bIRISA/INSARennes,France
cIRISA/ENSRennes,France
dINRIARennes,France
Abstract
Large-scaleapplicationsareever-increasinglygeo-distributed.Maintainingthehighestpossibledatalocalityiscrucial
toensurehighperformanceofsuchapplications.Dynamicreplicationaddressesthisproblembydynamicallycreating
replicasoffrequentlyaccesseddataclosetotheclients.Thisdataisoftenstoredindecentralizedstoragesystemssuch
asDynamoorVoldemort,whichoffersupportformutabledata. However,existingapproachestodynamicreplication
forsuchmutabledataremaincentralized,thusincompatiblewiththesesystems. Inthispaperweintroduceawrite-
enableddynamicreplicationschemethatleveragesthedecentralizedarchitectureofsuchstoragesystems.Wepropose
analgorithmenablingclientstolocatetentativelytheclosestdatareplicawithoutpriorrequesttoanymetadatanode.
Large-scaleexperimentsonvariousworkloadsshowareadlatencydecreaseofupto42%comparedtootherstate-of-
the-art,caching-basedsolutions.
Keywords: cloud,replication,geo-replication,storage,fault-tolerance,consistency,database,key-valuestore
1. Introduction scientists worldwide in real-time. The users of com-
mercialapplications,suchasFacebook,ever-increasing
Large-scale applications such as social networks are
amountsofdatathatneedstobeaccessibleworldwide.
being increasingly deployed over multiple, geograph-
Ensuring the lowest possible access time for users is
ically distributed datacenters (or sites). Such geo-
crucialfortheuserexperience.
distribution provides fast data access for end-users
A key factor impacting the performance of such ap-
worldwide while improving fault-tolerance, disaster-
plications is data locality, i.e. the location of the data
recovery and minimizing bandwidth costs. Today’s
relatively to the application. Accessing remote data
cloudcomputingservices[1,2]allowawiderrangeof
is orders of magnitude slower than using local data.
applications to benefit from these advantages as well.
Although such remote accesses may be acceptable for
However, designing geo-distributedapplications is dif-
rarely-accessed data (cold data), they hinder applica-
ficult due to the high and often unpredictable latency
tion performance for frequently-used data (hot data).
betweensites[3].
For instance, in a social network application, popular
Such geo-distributed applications span a large range
profiles should be replicated at all sites whereas oth-
ofspecificuse-cases.Forinstance,cientificapplications
ers can remain located at fewer locations. Finding the
such as the MonALISA [4] monitoring system for the
right balance between replication and storage is criti-
CERNLHCAliceexperiment[5]. Thisapplicationcol-
cal:replicatingtoomanyprofileswastescostlymemory
lects and aggregates monitoring data from 300+ sites
and bandwidth, while failing to replicate popular ones
distributed across the world, that must be delivered to
resultsindegradedapplicationperformance.
Dynamic replication [6] proposes to solve this issue
Emailaddresses:pmatri@fi.upm.es(PierreMatri), bydynamicallyreplicatinghotdataascloseaspossible
mperez@fi.upm.es(Mar´ıaS.Pe´rez), totheapplicationsthataccessit.Thistechniqueislever-
alexandru.costan@irisa.fr(AlexandruCostan),
agedinContentDeliveryNetworks(CDN)tocacheim-
luc.bouge@ens.ens-rennes.fr(LucBouge´),
gabriel.antoniu@inria.fr(GabrielAntoniu) mutabledataclosetothefinaluser[7,8]. Similarily,it
isusedinstoragesystemssuchasGFS[9]orHDFS[10] Wediscusstheeffectivenessandapplicabilityofour
toreplicatemutabledata, byrelyingonthecentralized contribution(Section11),andconcludeonfuturework
metadata management of these systems [11, 12]. Yet, thatfurtherenhancesourproposal(Section12).
such an approach contradicts the design principles of
decentralized storage systems such as Dynamo [13] or
2. Large-scale,data-intensiveapplications
Voldemort [14], which aim to enable clients to locate
datawithoutexchangeswithdedicatedmetadatanodes. In this paper we target large-scale applications serv-
Furthermore, handling mutable objects in the con- ing large amounts of data to users around the world,
text is difficult. Indeed, the dynamic replicas have to while seeking to enable low-latency access for these
be kept synchronized with the origin data, without im- userstopotentially-mutabledata. Examplesofsuchap-
pacting the consistency guarantees of the underlying plicationsspanmultipleuse-cases,amongwhich:
system. To the best of our knowledge, no decentral-
Scientificsystemmonitoring. Monitoring a geo-
ized, write-enabled dynamic replica location and man-
distributed cluster requires collecting potentially
agement method exists in the literature today. Reach-
large number of metrics from computers around
ing this goal while providing predictable overhead and
the world. This is for example the case for
guaranteed accuracy is not trivial. In this paper we
MonALISA [4], monitoring thousands of servers
demonstratethatthisobjectivecanbereachedbycom-
distributed in more than 300 sites around the
biningthearchitectureofthissystemswithdeceptively
world. The collected data is aggregated live, and
simple algorithms from the literature. We make these
is used to provide live insights about the system
contributions,whichsubstantiallyreviseandextendthe
performance and availability around the world.
earlyprincipleswepreviouslyintroducedin[15]:
To deliver the real-time promise of MonALISA,
• Afterbrieflyintroducingtheapplicationswetarget ensuringthatthemonitoringdatathatisneededby
(Section 2), the related work (Section 3) and the scientists around the world is located as close as
storage systems we target (Section 4), we char- possibletothemiscrucial.
acterize the challenges of decentralizing write-
Socialnetworks. Business applications such as social
enableddynamicdatareplication(Section5).
networks ingest overwhelming amounts of data.
• We address these challenges with a decentral- Facebook, for example, is expected to react 2 bil-
izeddatapopularitymeasurementscheme(Sec- lion active profiles in the next few weeks [16].
tion 6), which leverages existing state-of-the-art Every single day, it processes 350 million photo
storage system architecture to identify hot data uploads [17] or 6 billion posts [18]. The strong
cluster-widedynamically. 500:1 read to write ratio [19] calls for large-scale
caching. However, some of this data is muta-
• Based on these popularity measurements, we de-
ble by nature. This is for example the case of
scribe a dynamic data replication algorithm
userprofiles, whicharehardtocachewhilekeep-
which dynamically creates and manages replicas
ing this cache synchronized with the source data,
ofhotdataascloseaspossibletotheapplications
callingforheavyweight,customcacheinvalidation
(Section7).
pipelines[20].
• We enable clients to locate the closest of such
The real-time promise of such applications requires
data replicas using an approximate object loca-
tokeepup-to-datedataascloseaspossibletotheend-
tionmethod(Section8),whichminimizesstorage
user. The globalscaleof theseapplications makesthis
latencybyavoidingcommunicationwithanyded-
difficultwhilekeepingcostlybandwidthandstorageus-
icatedmetadatanode.
age low. We will detail these challenges in Section 5.
• Wedevelopaprototypeimplementationleverag- These challenges are independent of the type of plat-
ing the above contributions, integrated with the formssuchascomputegridsforMonALISAorclouds
Voldemortdistributedkey-valuestore(Section9), forFacebook.
and prove the effectiveness of our approach with
alarge-scaleexperimentalstudyontheAmazon
3. Relatedwork
Cloud(Section10). Weobserveareadlatencyde-
creaseofupto42%comparedtootherstate-of-the- Intheliterature,dynamicreplicationstandsasatopic
art,caching-basedalgorithms. ofinterestforallapplicationsrequiringaccesstoshared
2
datafrommanygeo-distributedlocations.Mostofthese news thread, or more generally services display-
contributionscanbeclassifiedintwocategories: ing publicly user-generated content. In all these
applications,wealsoobservethatthedataobjects
Immutabledata,decentralizedmanagement. A
aremutable(changingaggregatesfromnewmon-
rangeofapplicationsrequiretoprovidetheirusers
itoring events, user updating their profile, post-
with fast and timely access to static resources
ing new statues or comments). Available geo-
suchasimagesorvideos. Thisisthecaseofmost
replicationssolutionsavailabletypicallyeitherre-
global internet applications, in which Content
quire the application to explicitly clear modified
Distribution Networks (CDNs) help provide a
objects from distant caches, or leverage a central-
gooduserexperiencebycreatingreplicasofstatic,
ized replication manager that contradicts the de-
immutable data as close as possible to the clients
centralized design of most state-of-the-art, geo-
thataccessit.
distributed data stores. Dynamic replication en-
Yet, CDNsaretargetedatservingcontentdirectly ables the geo-distributed storage system to repli-
tothefinaluser. Inthispaper,wefocusonallow- cate in near real-time the most requested objects
ing a geo-distributed application to access a geo- ascloseaspossibletotheapplicationinstancesac-
distributeddatasourcewiththelowestpossiblela- cessingthem.
tency. Efficientlycreatingandplacingreplicasofhotdata
Kingsy Grace et al. [7] provide an extensive sur- isnotenough. Indeed,oneneedstoensureaswell
vey of replica placement and selection algorithms thatthosereplicasarekeptinsynchronizationwith
available in the literature. Among these, Chen et theoriginaldata. Thisisusuallythecaseofappli-
al.[8]proposeadissemination-treebasedreplica- cations relying on a globally-distributed file sys-
tion algorithm leveraging a peer-to-peer location tem. Overall, the proposed solutions in the litera-
service. Dong et al. [21] transform the multiple- tureleveragethecentralizedmetadatamanagement
location problem into several classical mathemat- of certain storage systems such as HDFS [10] or
ical problems with different parameter settings, GFS [9] to allow the clients to locate the closest
for which efficient approximation algorithms ex- availablereplicaofthedatatheywanttoaccess. In
ist. However, they don’t consider the impact of thatcontext,Ananthanarayananetal.[11]propose
replication granularity on performance and scala- apopularity-baseddynamicreplicationstrategyfor
bility. Wei et al. [22] address this issue by devel- HDFSaimedatimprovingtheperformanceofgeo-
opingamodeltoexpressavailabilityasafunction distributed Map-Reduce clusters. Jayalakshmi et
of replica number. This approach, however, only al.[12]modelsasystemdesignedtodirectclients
works within a single site, as it assumes uniform tothemostoptimalreplicaavailable.
bandwidthandlatency, whichisnotthecasewith Our proposal enables writes to any given object
the geo-distributed workloads that we target. In- in a decentralized, large-scale storage system to
spiredbytheP2Psystems,[23]proposesanadap- betransparentlyforwardedtotheexistingdynamic
tive decentralized file replication algorithm that replicas of that object, without requiring explicit
achieveshighqueryefficiencyandhighreplicauti- cache eviction requests from the application. In
lization at a significantly low cost. In [24], Mac- contrast, replication strategies adopted in CDNs
Cormick et al. enable storage systems to achieve such as Dynamic Page Caching [26] have a sub-
balanced utilization of storage and network re- stantially different target ; they focus on offering
sources in the presence of failures, and skewed fine-grained caching based on configured user re-
distributions of data size and popularity. Madi et quest characteristics (cookies, request origin, ...),
al. [25] consider a wider usage of parameters in whilestillaccessingtheorigindatareplicafordy-
the context of data grids such as read cost or file namic,mutableobjects.
transfertime.
Replicaselectionalgorithm Targetedworkonreplica
Mutabledata,centralizedmanagement. However, a selection prove that adopting a relevant data loca-
range of applications rely on mutable data. This tionalgorithmcanleadtosignificantperformance
is for example the case in MonALISA monitor- improvements. Mansouri et al. [27] propose a
ing of the CERN LHC experiment [4, 5]. In a distributed replication algorithm named Dynamic
webapplications, thisisobservedwithsocialnet- HierarchicalReplicationAlgorithm(DHR),which
work profile pages, status pages, comments on a selects replica location based on multiple criteria
3
such as data transfer time and request–waiting- ring, r being the configured replication factor of
time. Kumar et al. [28] address the problem of thesystem(usually2).
minimizing average query span, i.e. the average
P2Pclusterstatedissemination. Thepositionofeach
number of machines that are involved in the pro-
node on the ring is advertised in the cluster us-
cessing of a query through co-location of related
ingafamilyofpeer-to-peer(P2P)protocols: Gos-
dataitems. C3[29]goesevenfurtherbydynami-
sip [34]. Each node periodically disseminates
callyadaptingreplicaselectionbasedonreal-time
its status information to a number of randomly-
metrics in an adaptive replica selection mecha-
selected nodes and relays status information re-
nism that reduces request tail latency in presence
ceivedfromothernodes. Thismethodisalsoused
ofservice-timefluctuationsinthestoragesystem.
to detect and advertise node failures across the
cluster[35].
However, none of these contributions considers the
caseofmutabledatastoredindecentralizeddatastores, Clientrequestrouting. By placing objects determin-
such as Cassandra [30] or Voldemort [14]. Facebook, istically in the cluster, Dynamo obviates the need
forexample,circumventstheissuebydirectingallwrite for dedicated metadata servers. Clients are able
requests to a single data center and using a dedicated to perform single-hop reads, i.e. address their re-
protocol to keep the cache consistent across other re- questsdirectlytothenodesholdingthedata. This
gions [19]. In this paper, we fill this gap by en- enables a minimal storage operation latency and
ablingefficientdatareplicationofmutabledataingeo- higherthroughput. Shouldaclientaddressthere-
distributed,decentralizeddatastores. questtoanodenotholdingtherequesteddata,this
node will forward the request directly to the cor-
rectone. Thiscorrectnodeisdeterminedusingthe
4. Background: Thesystemswetarget
ringstateinformationdisseminatedthroughoutthe
cluster.
Letusfirstbrieflydescribethekeyarchitecturalprin-
ciplesthatdrivethedesignofanumberofdecentralized Deterministicallyplacingdataobjectsanddissem-
systems. Dynamo[13]hasinspiredthedesignofmany inatingringstatusintheclusterenableseachnode
ofsuchsystems,suchasVoldemort[14],Cassandra[30] torouteincomingclientrequestsdirectlytoanode
orRiak[31]. Inthispaperwetargetthisfamilyofsys- holding the data. Operation latency is further re-
tems,whicharewidelyusedintheindustrytoday. duced by opening cluster state information to the
clients so they can address their requests straight
Datamodel. Dynamo is a key-value store, otherwise
to the correct node, without any metadata server
called distributed associative array. A key-value
involved.
storekeepsacollectionofvalues,ordataobjects.
Eachobjectisstoredandretrievedusingakeythat We found our strategy on these design principles,
uniquelyidentifiesit. whichallowustoguaranteethecorrectnessofthepro-
posalwedescribeinthispaper. Wechoosenottomod-
DHT-baseddatadistribution. Objects are distributed ify the original static replication mechanism, offering
across the cluster using consistent hashing [32]
the same data durability as the underlying system. We
based on a distributed hash table (DHT), as in
alsodonotchangetheserver-sideclientroutingmecha-
Chord [33]. Given a hash function h(x), the out-
nism,consequentlyguaranteeingthatstaticreplicasare
put range [h ,h ] of the function is treated as
min max always reachable. This allows us to focus on develop-
a circular space (hmin sticking around to hmax), or inganefficientheuristicthatmaximizestheaccuracyof
ring. Eachnodeisassignedadifferentrandomob-
popularobjectidentification,optimizesthecreationand
jectwithinthisrange,whichrepresentsitsposition
placementofdynamicreplicasofsuchpopularobjects,
on the ring. For any given key k, a position on andhelpsclientsefficientlylocatingtheclosestofthese
the ring is determined by the result of h(k). The
replicas.
primarynodeholdingtheprimarystaticreplicaof
theobjectisthefirstoneencounteredwhilewalk-
ing the ring passed this position. To ensure fault- 5. Ourproposalinbrief: outlineandchallenges
tolerance, additional static replicas are created at
the time the object is stored. These are placed on Inthispaper,wedemonstratethatitispossibletoin-
thenextrnodesfollowingtheprimarynodeonthe tegrate dynamic replication with the existing architec-
4
tureofthesestoragesystems,whichenablesustolever- 5.2. Tunablereplication
agetheirexisting,built-inalgorithmstoefficientlyhan-
Determining when to create a new dynamic replica
dlereadandwritesingeo-distributedenvironments.
ordeleteanexistingonebasedonthepreviousinforma-
Such dynamic replication seeks to place new copies tionisalsochallenginginadistributedsetup. Indeed,it
of the hot data in sites as close as possible to the ap- isnecessarytoboundthenumberofreplicastobecre-
plication clients that access it. To that end, we permit atedateachsitewithoutanynodebeingresponsiblefor
theclientstovotefordynamicobjectreplicastobecre- coordinatingthereplicateditems. Consequently,nodes
atedataspecificsite. Thesevotesarecollectedateach mustsynchronizewitheachotherbeforecreatinganew
nodeanddisseminatedacrosstheclustersothattheob- replicaofanyobject. Onecouldconsiderusingacon-
jectswhichreceivedthemostvotes(orpopularobjects) sensusprotocolsuchasPaxos[41]. However,weargue
are identified by the storage system. Dynamic replicas itwouldbeanoverkillforsuchasimpletaskasPaxos
of such popular objects are created at sites where they isbynomeansalightprotocol[42].
are popular, and deleted when their popularity drops. Itturnsoutthatthetechniqueweusetosolvethevote
Whentryingtoaccessanobject, clientstentativelyde- collection and counting challenge also provides all the
termine the location of its closest replica (either static informationweneedtosolvethisissue(Section7).
or dynamic) and address requests directly to the node
holding it. Such approach however raises a number of
5.3. Dynamicreplicalocation
challenges.
Weneedtoenableclientstolocatedynamicreplicas
We acknowledge that using client votes has been
as they do for static replicas. Obviously, such replicas
proposed before in the context of replicated relational
shouldalsobeplacedatapredictable,deterministically
databases, specifically to ensure data consistency [36].
chosennode.WeachievethisusingtheDHT-baseddata
Transposing this idea to decentralized storage systems
distributionofthestoragesystem(Section7.1). Toac-
posesanumberofsignificantchallengesthatweaddress
cesstheclosestavailablereplica, aclientalsoneedsto
in this paper. Specifically, collecting client votes effi-
knowwhetherornotadynamicreplicaexistsatagiven
ciently without using a centralized process requires us
site. Systematically probing nearby sites for available
toproposeanovel,fully-decentralized,loosely-coupled
replicaswouldcontradictthesingle-hopreadfeatureof
vote collection algorithm. While existing dynamic
Dynamo. Also, this would significantly increase stor-
replicationtechniquesleverageacentralizedrepository
age operation latency, consequently missing the point
todirectclientrequeststothenearestavailablereplica,
ofdynamicreplicationwhichispreciselytoreducethis
we propose a technique allowing clients to tentatively
latency.
locatetheclosestavailablereplicawithoutanypriorre-
We demonstrate that this issue can be solved us-
questtoanyofsuchrepositories.
ing probabilistic algorithms (Section 8). Our proposal
builds on the solutions we adopt for the two afore-
describedchallenges.
5.1. Collectingandcountingvotes
6. Identifyinghotobjectswithclientvotes
Thegoalofdynamicreplicationistoimprovestorage
operationlatencyfortheclients. Therefore,weneedto
Inthissection,wedescribehowtoidentifythemost
designanefficientwaytolettheclientscasttheirvotes
popular objects at each site. This is achieved in three
forobjects.
steps:
Collecting votes also raises a major challenge. While
determining the most voted-for objects at each node is 1. Wedescribeanefficientwaytoallowtheclientto
straightforward and can be done efficiently, inferring vote for an object to be replicated dynamically
from this the most popular objects cluster-wide is not ataspecificlocation(Section6.1).
an easy task. This problem is named distributed top-k 2. Wemaintainalocalcountofthesevotesateach
monitoring. Sadly, most implementations in the litera- node to identify the most voted-for objects (Sec-
ture[37,38,39,40]arecentralized. tion6.2).
We address both these issues by mixing an approxi- 3. We disseminate and merge these votes through-
mate frequency estimation algorithm with the existing, out the cluster to provide each node with a vision
lightweightGossipprotocolprovidedbystoragesystem of the most popular objects for each site, cluster-
(Section6). wide(Section6.3).
5
Theidentificationmethodwedescribeinthissection However, thegoalofthisschemeistoadapttofluc-
addresses the vote collection and counting challenge tuatingobjectpopularitybyreplicatingdynamicallythe
above. objects having the highest popularity over a recent pe-
riod of time. Consequently, we use successive voting
6.1. Clientvotecasting rounds. We extract at the end of each round the most
voted-forobjectsforeachsite,andcreateanew,empty
Clientsvoteforobjectstobereplicateddynamically
votesummaryforthesubsequentround. Thelengthof
at sites close to them. We name these sites preferred
aroundisaclustersetting: wediscussitsvalueinSec-
sites. This proximity can for instance express network
tion11.2. Wesynchronizetheseroundsacrosstheclus-
latency,butalsometricssuchasbandwidthcostoravail-
terbyusingthelocalclockofeachnode.
able computational power may also be considered. To
Keepinganexactvotesummaryforanygivenround
this purpose, each client maintains a list of such pre-
is memory-intensive. It has a memory complexity of
ferredsites,orderedbypreference.
O(M∗S), M beingthe numberofobjectsvoted-forin
Wearguethatexistingreadqueriestothestoragesys-
thisroundandS beingthenumberofsitesinthecluster.
tem provide an ideal base for vote casting, as clients
Such complexity is not tolerable as billions of objects
intuitively vote only for objects they need to read. In
mayexistandbequeriedinthecluster. Luckily,wedo
contrast, objects being written-to are not good candi-
not need to keep the vote count for all objects: we are
dates for dynamic replication because of the synchro-
onlyinterestedinknowingwhicharethemostvoted-for
nization needed to keep dynamic replicas in sync with
objects for each site. For each site, finding the k most
staticreplicas;wediscusswritehandlinginSection7.3.
frequentoccurrencesofakeyinastreamofdata(client
For every storage operation on an object, the client in-
votes)isaproblemknownastop-k counting. Multiple
dicatesintherequestmessageitspreferredsitesforthis
approximate, memory-efficient solutions to this prob-
object, i.e. the sites where the client would have pre-
lemexistintheliterature. Inthecontextofoursystem,
ferred a dynamic replica of the object to exist. Let us
such approximate approaches are tolerable as it is not
assumeaclientwantstoreadtheobjectassociatedwith
critical to collect the exact vote count for each object
the key key. The client sends the request to the closest
as long as the estimation of their vote count is precise
node n holding a replica of that object. Say this node
enough and the set of objects identified as popular ac-
belongs to a site s. We detail the location of this clos-
curatelycapturesthevotesexpressedbytheclients. As
est node in Section 8. The client piggybacks the re-
such, we use as vote summaries a set of approximate
questmessagewiththelistofthesubsetofsiteshaving
top-k estimators, k beingaconfigurationsettingwhose
a higher preference than s in its list of preferred sites.
value is discussed in Section 11.1. We choose to use
Suchrequestisinterpretedbynasavoteforthisobject
the Space-Saving algorithm [43] as top-k estimator. It
tobereplicatedonthesesites.
guarantees strict error bounds for approximate counts
of votes, and only uses limited, configurable memory
6.2. Node-local vote collection and hot object identifi-
space. Its memory complexity is O(k). For any given
cation
site, theoutputofSpace-Savingistheapproximatelist
Ateachnode,wewanttoknowforeachsitethemost of the k most voted-for keys, along with an estimation
voted-for objects. These are considered as candidates ofthenumberofvotesforeach.
for dynamic replication. Each time a node receives a Any given node simultaneously maintains |S| active
read request for an object identified by key, it records structures, one for each node in the cluster. Each time
thevoteforthisobjecttobereplicatedonallsitesindi- thisnodereceivesarequestforanobjectv,foreachpre-
catedaspreferredbytheclient. Letusfirstassumethat ferredsiteindicatedintherequest,thekeyofvisadded
wekeeponecounterperkeyandpersite,whichisincre- tothecorrespondingactivestructure. Consequently, at
mented by 1 for each vote. We name site counters the anytime,anodeisabletoknowwhicharethemostfre-
set of key counters for a single site, and vote summary quent replication preferences indicated by a client for
the set of site counters for all sites. In addition, if the anysiteovertheprevioustimewindow.
objectreplicaidentifiedbykeyisadynamicreplica,we
considerthattheclientimplicitlyvotesforthisreplicato 6.3. Cluster-widevotesummarydissemination
bemaintained. Assuch,wealsorecordthevoteforkey In this section we explain how to obtain the most
on the local site of the node receiving the request, i.e. voted-for objects across all nodes, starting from local
the site the node belongs to. Algorithm 1 details these vote summaries built from user votes (1). We periodi-
actionsbyanodereceivingareadrequestfromaclient. callysharethelocalvotesummariesofeachnodewith
6
Algorithm1Node-localobjectvotecounting
Input: key: keyofanobjecttoread,prefs: listofpreferredsitesprovidedbytheclient.
procedureCountClientVotes(key,prefs)
(cid:46)Interpretreadingadynamicreplicaasanimplicitvote
letlocalbethelocalsiteofthecurrentnode
letreplicabethelocalreplicaoftheobjectwithkeykey
if thereplicaisadynamicreplicathen
addlocaltoprefs
endif
(cid:46)Addclientvotestothelocalvotesummary
foreachpreferredsitesiteinpref do
letvs[site]bethesitecounterstructureforsite
countonevoteforkeyinvs[site]
endfor
endprocedure
its peers (2). Merging these peer vote summaries (3) proach is also compatible with our design choices: the
giveseachnodeaviewofthemostpopularitemsacross Space-Saving structure we use is proven to be merge-
thecluster. Figure1illustratesthisprocess. ablein[44],withacommutativemergeoperation.
We organize the process at any given node n in suc- Formally, we name MergeCounters(a,b) the func-
cessive phases. During a voting round r of duration t,
tion outlined in [44] that merges two Space-Saving
thelocalvotesummarycapturingclientvotesisnamed structures a and b. MergeSummaries(v,v(cid:48)) is the func-
activesummary.Whenaroundends,thesummarytran- tionmergingtwovotesummaries sand s(cid:48). Thesesum-
sitions to a merging state: the node sends this sum-
maries contain site counters, respectively v ,...,v and
1 S
mary to its peers, i.e. every other node in the cluster. v(cid:48),...,v(cid:48). S is the total number of sites in the cluster.
1 S
Roundsbeingsynchronizedacrossthecluster,thenode Thisfunctionreturnsamergedsummaryv(cid:48)(cid:48) containing
alsoreceivessummariesfromitspeersforthesamevot- S sitecountersv(cid:48)(cid:48),...,v(cid:48)(cid:48),suchthat:
ing round, which are merged with the local summary; 1 S
merging this local summary with another one received
from a peer n(cid:48) gives a summary of the votes received ∀a∈[1,S],vc(cid:48)(cid:48) =MergeCounters(vc ,vc’ ) (1)
a a a
bybothnandn(cid:48). Whenvotesummariesforeverypeer
havebeenreceived,thesummaryiscomplete,atwhich
pointallvotesreceivedbyallnodesintheclusterforthe Considering that MergeCounters is commutative, it is
round r are summarized. After a period 2∗t since the trivialthatMergeSummarieshasthesameproperty.
roundstarted,thiscluster-widesummarytransitionstoa
Let us assume a reliable network at this point, with
servingstatewhichwedetailinSection7. Weillustrate
allpeersummariesbeingreceivedbeforethelocalsum-
thesesuccessivevotesummaryphasesinFigure2.
maryreachesaservingstate. Becauseeachnodesends
Inpresenceoffaults,asummarycanreachtheserv- toallitspeersthesamelocalvotesummary,andbecause
ing state without having received all peer vote sum- the MergeSummaries function is commutative, the re-
maries in time. This may occur in case of delayed or sultingcompletesummaryafterallpeersummariesare
lostpackets. Wequalifysuchsummaryasincomplete. mergedisidenticalateachnode. Whenallnodesreach
Suchanapproachisconsistentwiththeclassofsys- the complete summary state, they share the same view
temswetarget:Dynamoprovidesanefficientalgorithm of the most voted-for objects for each site. We use it
for disseminating information across the cluster: Gos- toperformdynamicobjectreplicationinSection7. We
sip. We use it to share a vote summary with every discuss the memory complexity of the popular object
othernodewhenitreachesthemergingstate. Thisap- identificationprocessinSection11.4.
7
Client n2 n3 ... nN
(1)Vote (2)Share
(3)
v1 v2 ... vS v’1 v’2 ... v’S
(3)Merge
Localvotesummary Globalvotesummary
Noden
1
Figure1:Cluster-widepopularobjectidentificationoverview.S isthetotalnumberofsitesintheclusterandNthetotalnumberofnodes.
Clients Peernodes
(1) (2) (3)
Serving
t t t
Active Complete
Merging
Figure2:Timelineofvotesummarystatesforavotingroundlengtht.
7. Lifecycleofadynamicreplica onaremotesites,anodenfirstinformsallnodeshold-
ingstaticreplicasofobj,whichstorethisintheirlocal
Wedetailinthissectionhowtocreateanddeletedy-
state. Uponacknowledgementfromthesestaticreplica
namic replicas using the cluster-wide vote summaries
nodes, theprimarynodeupdatesitslocalstateaswell,
whilehandlingwritestodynamically-replicatedobjects.
and copies obj to a node at site s. The node on which
1. We first explain when to create a new dynamic thisreplicaisplacedisselecteddeterministically. Inthe
replicaofanobjectidentifiedaspopularonasite caseofDynamo,wecanusetheexistingDHTtoplace
(Section7.1). objectsatasiteinsuchadeterministicfashion. Starting
2. We describe the process of removing those data ontheclusterringfromthepositionh(key),wewalkthe
replicas when their popularity popularity drops ring until we find a node at s, on which the replica is
(Section7.2). placed. SuchamethodisusedtodaybyCassandra[30]
3. Wefinallyexplaintheprocessofforwardingwrites for rack-aware data placement. Thus, assuming that a
to these dynamic replicas while retaining under- client knows a dynamic replica of an object exists at a
lying storage system consistency and guarantees site,itcaneasilyinferwhichnodeholdsthisreplicaand
(Section7.3). addressitsrequestdirectlytoit.
7.1. Whenandwheretocreateadynamicreplica?
Withaccesstoasharedvotesummary,decidingwhen
tocreateareplicaisstraightforward. Nodesintheclus-
ter create remote dynamic replicas of the popular ob- Fault-tolerance: No replicas are created based on
jectstheyareprimarynodefor. Assoonasavotesum- incompletevotesummaries. Inthepresenceoffailures,
maryreachesthe complete state, thussummarizingthe this may result in dynamic replicas of yet popular ob-
votesofallclientsacrossthecluster,thetop-kmostpop- jects not being created at the initiative of its primary
ularobjectsarereplicatedtositesatwhichtheyarepop- node. Wehandlethiscasewiththereplicareadprocess
ular. To replicate an object obj identified by a key key weoutlineinSection8.
8
7.2. Whentodeleteadynamicreplica? origin nodes before creating and after deleting any dy-
namic replica. This guarantees that writes to dynami-
Each node is responsible for the deletion of any dy- cally replicated objects are always forwarded to all of
namicreplicasitholds.Adynamicreplicaatasitescan itsdynamicreplicas.
bedeletedifitisnotamongthetop-kitemsforthissites Weprovethatthiswriteprotocoliscorrect,i.e. does
intheservingsummaryforthecurrenttimeperiod. We notcausedynamicreplicastobeoutofsyncwithstatic
also want to avoid replica bounces, i.e. object replicas replicas, even in the case of system failures. A dy-
being repeatedly created and deleted at the same site. namicreplicaiscreatedonlyafter successfulacknowl-
Thismayhappenforobjectswhosepopularityranking edgement from all other static replica nodes. These
isaroundthetop-k threshold, andfluctuatesaboveand nodesonlyareinformedofthereplicadeletionafteritis
underthisthreshold. Wedefineagraceperiodg,which flaggedasinactive. Thisensuresthatwritestoanobject
representstheminimumnumberofconsecutiveserving are always propagated to the dynamic replicas of that
summaries a previously-popular object must be absent object,eveninthepresenceoffaults.
frombeforeitsdynamicreplicaisdeleted.
The deletion process is simple. To delete a locally-
helddynamicreplicaofanobjectobj,anodeflagsitas 8. Accessingtheclosestreplica
inactive: subsequentclientrequestsforthatreplicaare
Inthissection,weexplainhowtoletclientsaccessa
as if it does not exist, which we discuss in Section 8.
close replica of objects they want to read, either static
The node then informs the static replica nodes of obj
ordynamic,withoutanycommunicationwithanyded-
ofthedeletionofthisreplica,whichtheyremovefrom
icatedmetadatanode. Inanutshell, welettheuserre-
theirlocalstate.
questinformationaboutdynamicreplicascreatedonits
Fault-tolerance: No replicas are deleted based on preferredsitesatanygiventime,andlaterusethisdata
incompletevotesummaries. Replicasofobjectswhose toinferthelocationoftheclosestreplicaofanyobject
popularitydroppedmaynotbedeletedastheyshouldin andaccessitdirectly.
thepresenceoffailures. Yet,thisprincipleensuresthat
noreplicasofstillpopularobjectsareeverdeleted,and
8.1. Locatingtheclosestreplica:dynamicreplicasum-
remainaccessible.
maries
We assume a client can know at any time the list of
7.3. Handlingwritestodynamicallyreplicatedobjects all active dynamic replicas in its preferred sites. This
assumption greatly eases locating the closest replica:
Our dynamic replication scheme enables the stor-
it is the static or dynamic one located on the site with
age system to handle writes to dynamically-replicated
itshighestpreference, orthecloseststaticreplicaifno
objects. The afore-detailed dynamic replica creation
replica exists at any preferred site. We name this list
makes this process straightforward. Clients address
of dynamic replicas for a site s dynamic replica sum-
writerequestsforanyobjecttooneofthestaticreplica
maryof s, anddetailtheclosestreplicalocationinAl-
nodes of this object. Based on their local state, these
gorithm 2. The client addresses its request to the node
nodes determine all existing dynamic replicas of that
holding this replica on the site indicated by this algo-
objectandpropagatethewritetoallotherreplicas,static
rithmusingitsknowledgeoftheclusterDHT.Thenode
and dynamic, using the write protocol of the storage
receiving this request returns the dynamic replica, if
system.
available. If it is not available locally, it forwards the
Correctness: because our replication system does requesttothecloseststaticreplicanode,whichisguar-
not modify the write propagation algorithm of the un- anteedtoholdtheobject.
derlyingsystem,writesarepropagatedtodynamicrepli- Weneartheseassumptionsbyenablinganyclientto
cas the same way they are propagated to static repli- request from any node such dynamic replica summary
cas. In Voldemort, writes are eventually-consistent by atitspreferredsitesatthetimeoftherequest.Anynode
default, and can optionally be made strongly consis- isabletoanswerthisrequestbasedonthecluster-wide
tent. Our proposal shows the same consistency char- votesummarydisseminationprocess. Usingitsknowl-
acteristics. As such, we only need to ensure that the edge of the client votes across the cluster given by the
creationanddeletionofdynamicdoesnotviolatethese currentcomplete,servingsummary,thenodeknowsthe
consistency characteristics. We ensure this by persist- list of all dynamic replicas currently active at any site.
ing the new dynamic replica list for each object on its Thislistissenttotheclientforitspreferredsites.
9
Description:enabled dynamic replication scheme that leverages the decentralized architecture of such storage systems. We propose . centralized design of most state-of-the-art, geo- . nization needed to keep dynamic replicas in sync with [45] B. H. Bloom, Space/time trade-offs in hash coding with al- lowable