Tailwind: Fast and Atomic RDMA-based Replication Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes To cite this version: Yacine Taleb, Ryan Stutsman, Gabriel Antoniu, Toni Cortes. Tailwind: Fast and Atomic RDMA- basedReplication. ATC‘18-USENIXAnnualTechnicalConference, Jul2018, Boston, UnitedStates. pp.850-863. hal-01676502v2 HAL Id: hal-01676502 https://hal.inria.fr/hal-01676502v2 Submitted on 26 May 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Tailwind: Fast and Atomic RDMA-based Replication YacineTaleb*,RyanStutsman†,GabrielAntoniu*,ToniCortes‡ *UnivRennes,Inria,CNRS,IRISA†UniversityofUtah,‡BSCandUPC Abstract improve overall CPU efficiency of replication and keep Replication is essential for fault-tolerance. However, predictabletaillatencies. in in-memory systems, it is a source of high overhead. Existing RDMA-based approaches use message- Remote direct memory access (RDMA) is attractive to passing interfaces: a sender remotely places a message create redundant copies of data, since it is low-latency into a receiver’s DRAM; a receiver must actively poll and has no CPU overhead at the target. However, ex- and handlenew RDMAmessages. This approachguar- isting approaches still result in redundant data copying anteestheatomicityofRDMAtransfers,sinceonlyfully andactivereceivers. Toensureatomicdatatransfers,re- receivedmessagesareappliedbythereceiver[4,10,30]. ceivers check and apply only fully received messages. However, thisapproachdefeatsRDMAefficiencygoals Tailwind is a zero-copy recovery-log replication proto- since it forces receivers to use their CPU to handle in- col for scale-out in-memory databases. Tailwind is the comingRDMAmessagesanditincursadditionalmem- first replication protocol that eliminates all CPU-driven orycopies. datacopyingandfullybypassestargetserverCPUs,thus The main challenge of efficiently using RDMA for leaving backups idle. Tailwind ensures all writes are replication is that failures could result in partially ap- atomic by leveraging a protocol that detects incomplete pliedwrites.Thereasonisthatreceiversarenotawareof RDMAtransfers. Tailwindsubstantiallyimprovesrepli- databeingwrittentotheirDRAM.Leavingreceiversidle cation throughput and response latency compared with is challenging because there is no protocol to guarantee conventional RPC-based replication. In symmetric sys- dataconsistencyintheeventoffailures. temswhereserversbothserverequestsandactasrepli- A second key limitation with RDMA is its low scal- cas, Tailwind also improves normal-case throughput by ability. This limitation comes from the connection- freeingserverCPUresourcesforrequestprocessing. We oriented nature of RDMA transfers. Senders and re- implemented and evaluated Tailwind on RAMCloud, a ceivers have to setup queue pairs (QP) to perform low-latency in-memory storage system. Experiments RDMA.Lotsofrecentworkhasobservedthehighcost show Tailwind improves RAMCloud’s normal-case re- ofNICconnectioncachemisses[4,11,32]. Scalability questprocessingthroughputby1.7×. Italsocutsdown islimitedasittypicallydependsontheclustersize. writes median and 99th percentile latencies by 2 and 3 To address the above challenges, we developed Tail- respectively. wind,azero-copyprimary-backuplogreplicationproto- colthatcompletelybypassesCPUsonalltargetbackup servers. InTailwind,logrecordsaretransferreddirectly 1 Introduction from the source server’s DRAM to I/O buffers at tar- getserversviaRDMAwrites. Backupserversarecom- In-memory key-value stores are an essential building pletelypassiveduringreplication,savingtheirCPUsfor block for large-scale data-intensive applications [3, 19]. other purposes; they flush these buffers to solid-state Recent research has led to in-memory key-value stores drives(SSD)periodicallywhenthesourcetriggersitvia that can perform millions of operations per second per remote procedure call (RPC) or when power is inter- machine with a few microseconds remote access times. rupted. Eventhoughbackupsareidleduringreplication, Harvesting CPU power and eliminating conventional Tailwindisstronglyconsistent: ithasaprotocolthatal- network overheads has been key to these gains. How- lowsbackupstodetectincompleteRDMAtransfers. ever, like many other systems, they must replicate data Tailwind uses RDMA write operations for all data inordertosurvivefailures. movement, but all control operations such as buffer al- As the core frequency scaling and multi-core archi- location and setup, server failure notifications, buffer tecture scaling are both slowing down, it becomes criti- flushingandfreeingareallhandledthroughconventional caltoreducereplicationoverheadstokeep-upwithshift- RPCs. Thissimplifiessuchcomplexoperationswithout ing application workloads in key-value stores [13]. We slowing down data movement. In our implementation, showthatreplicationcanconsumeupto80%oftheCPU RPCsonlyaccountfor10−5 ofthereplicationrequests. cyclesforwrite-intensiveworkloads(§4.4), instrongly- This also makes Tailwind easier to use in systems that consistentin-memorykey-valuestores. Techniqueslike use log replication over distributed blocks even if they remote-directmemoryaccess(RDMA)arepromisingto werenotdesignedtoexploitRDMA. M2 Primary DRAM storage SinceTailwindneedsonlytomaintainconnectionsbe- C tween a primary server and its backups, the number of Client GET(C) 4 Poll connections scales with the size of a replica group, not Dispatch Worker with the cluster size, making Tailwind a scalable ap- M1 NIC Core 4 Core pCrlooWaucedh,.imapslceamlee-notuetdina-nmdemevoarlyuakteedy-TvaaliulweinstdoroenthRaAt eMx-- PrimAary DBRAM storage NIC Replicate(B) NoRne-AvpolilcaatDDitlee (BBu)ffer ploits fast kernel-bypass networking. Tailwind is suited PUT(B) 2 Replicate(B)3 M3 tltoahtreoRnuAcgyhM.pCuTtlaosiuliwndc’isnedfeoascciughsnPifioUncTanstotrlpoyenrigamticpoornonvsiienssttehRnecAycMluaCnstldeorulodrew’s- C W CoDorrkeer D CisopraetchNIC Replicate(B) PoPllrimaryD DRAM storage sultsinthreeremotereplicationoperationthatwouldoth- Non-volatile Buffer NIC D Cisopraetch 4 W Coorrkeer erwiseconsumeserverCPUresources. PUT(B)1 Replicate(B) Tailwind improves RAMCloud’s throughput by 1.7× CC AA on the YCSB benchmark, and it reduces durable PUT Client Non-volatile Buffer median latency from 32 µs to 16 µs and 99th percentile latency from 78 µs to 28 µs. Theses results stem from Figure1: Flowofprimary-backupreplication the fact that Tailwind significantly reduces the CPU cy- cles used by the replication operations: Tailwind only needs 1/3 of the cores RAMCloud uses to achieve the andasinglephysicalmachinemightrunbothakey-value samethroughput. storefrontendandanHDFSchunkserver. Thispapermakesfourkeycontributions; Replication is expensive for three reasons. First, it is inherently redundant and, hence, brings overhead: the • itanalyzesandquantifiesCPUrelatedlimitationsin actofreplicationitselfrequiresmovingdataoverthenet- modernin-memorykey-valuestores; work. Second,replicationinstronglyconsistentsystems • itpresentsTailwind’sdesign,itdescribesitsimple- is usually synchronous, so a primary must stall while mentationintheRAMClouddistributedin-memory holding resources while waiting for acknowledgements key-valuestore,anditevaluatesitsimpactonRAM- frombackups(oftenspinningaCPUcoreinlow-latency Cloud’snormal-caseandrecoveryperformance; stores). Third, in systems, where servers (either explic- • to our knowledge, Tailwind is the first log repli- itly or implicitly) serve both client-facing requests and cation protocol that eliminates all superfluous data replicationoperations,thoseoperationscontend. copying between the primary replica and its back- Figure1showsthisinmoredetail. Low-latency,high- ups, and it is the first log replication protocol that throughputstoresusekernel-bypasstodirectlypollNIC leavesserversCPUidlewhileservingasreplication control(withadispatchcore)ringstoavoidkernelcode targets; thisallowsserverstofocusmoreresources paths and interrupt latency and throughput costs. Even onnormal-caserequestprocessing; so, a CPU on a primary node processing an update op- erationmustreceivetherequest, handtherequestoffto • Tailwind separates the replication data path and a core (worker core) to be processed, send remote mes- control path and optimizes them individually; it sages,andthenwaitformultiplenodesactingasbackup uses RDMA for heavy transfer, but it retains the to process these requests. Batching can improve the simplicityofRPCforrareoperationsthatmustdeal numberofbackuprequestmessageseachservermustre- with complex semantics like failure handling and ceive, but at the cost of increased latency. Inherently, resourceexhaustion. though, replication can double, triple, or quadruple the number of messages and the amount of data generated 2 MotivationandBackground byclient-issuedwriterequests. Italsocausesexpensive Replicationandredundancyarefundamentaltofaulttol- stallsattheprimarywhileitwaitsforresponses. Inthese erance, but at the same time they are costly. Primary- systems,responsestakeafewmicrosecondswhichistoo backupreplication(PBR)ispopularinfault-tolerantstor- shortatimefortheprimarytocontextswitchtoanother agesystemslikefilesystemsandkey-valuestores,since thread,yetitslongenoughthattheworkercorespendsa ittolerates f stop-failureswith f+1replicas. Notethat, largefractionofitstimewaiting. werefertoaprimaryreplicaserverasprimary,andsec- 2.1 ThePromiseofRDMA ondary replica server as secondary or backup. In some systems, backup servers don’t process user-facing re- Recently, remote-direct memory access (RDMA) has quests, but in many systems each node acts as both a been used in several systems to avoid kernel overhead primary for some data items and as a backup for other and to reduce CPU load. Though the above kernel- data items. In some systems this is implicit: for exam- bypass-based request processing is sometimes called ple,akey-valuestoremaystoreitsstateonHDFS[28], (two-sided) RDMA, it still incurs request dispatching ) sequentialregionoftheremoteprocessesvirtualaddress e ag100 space(again,theregionmustberegisteredwiththeNIC). t n NICssupportafewmorecomplexoperations(compare- e 80 rc and-swap,atomicadd),buttheseoperationsarecurrently e 60 (p muchslowerthanissuinganequivalenttwo-sidedoper- on 40 Dispatch ationthatisservicedbytheremoteCPU[11,30]. These ati 20 Worker simple,restrictedsemanticsmakeRDMAoperationsef- tiliz 0 0 10 20 30 ficient, but they also make them hard to use safely and U Clients correctly. Some existing systems use one-sided RDMA operationsforreplication(andsomealsoevenusethem Figure2: Dispatchandworkercoresutilizationpercentageofasin- fornormalcaseoperations[4,5]). gleRAMCloudserver.Requestsconsistof95/5GET/PUTratio. However, no existing primary-backup replication schemereapsthefullbenefitsofone-sidedoperations.In existingapproaches,sourcenodessendreplicationoper- and processing overhead because a CPU, on the desti- ationsusingRDMAwritestopushdataintoringbuffers. nation node, must poll for the message and process it. CPUs at backups poll for these operations and apply RDMA-capable NICs also support so called one-sided themtoreplicas. Inpractice,thisisiseffectivelyemulat- RDMAoperationsthatdirectlyaccesstheremotehost’s ing two-sided operations [4]. RDMA reads don’t work memory,bypassingitsCPUaltogether.RemoteNICsdi- wellforreplication, becausetheywouldrequirebackup rectlyserviceRDMAoperationswithoutinterruptingthe CPUs to schedule operations and “pull” data, and pri- CPU (neither via explicit interrupt nor by enqueuing an marieswouldn’timmediatelyknowwhendatawassafely operationthattheremoteCPUmustprocess). One-sided replicated. operations are only possible through reliable-connected Two key, interrelated issues make it hard to use queue pairs (QP) that ensure in-order and reliable mes- RDMA writes for replication that fully avoids the re- sagedelivery,similartotheguaranteesTCPprovides. moteCPUsatbackups. First,aprimarycancrashwhen 2.1.1 Opportunities replicatingdatatoabackup. BecauseRDMAwrites(in- One-sided RDMA operations are attractive for replica- herently) don’t bufferall of the datato be written tore- tion;replicationinherentlyrequiresexpensive,redundant mote memory, its possible that an RDMA write could datamovement.Backupsare(mostly)passive;theyoften be partially applied when the primary crashes. If a pri- actasdumbstorage,sotheymaynotneedCPUinvolve- mary crashes while updating state on the backup, the ment. Figure 2 shows that RAMCloud, an in-memory backup’s replica wouldn’t be in the “before” or “after” low-latency kernel-bypass-based key-value store, is of- state, which could result in a corrupted replica. Worse, ten bottlenecked on CPU (see §4 for experimental set- sincetheprimarywaslikelymutatingallreplicasconcur- tings). For read-heavy workloads, the cost of polling rently,itispossibleforallreplicastobecorrupted.Inter- network and dispatching requests to idle worker cores estingly,backupcrashesduringRDMAwritesdon’tcre- dominates. Only 8 clients are enough to saturate a sin- ate new challenges for replication, since protocols must gle server dispatch core. Because of that, worker cores deal with that case with conventional two-sided oper- cannotbefullyutilized. One-sidedoperationsforrepli- ations too. Well-known techniques like log-structured cating PUT operations would reduce the number of re- backups[18,23,25]orshadowpaging[35]canbeused questseachserverhandlesinRAMCloud, whichwould to prevent update-in-place and loss of atomicity. Tradi- indirectlybutsignificantlyimprovereadthroughput. For tionallogimplementationsenforceatotalorderingoflog workloadswithasignificantfractionofwritesorwhere entries[9]. Indatabasesystems,forinstance,theorderis a large amount of data is transferred, write throughput usedtorecreateaconsistentstateduringrecovery. can be improved, since remote CPUs needn’t copy data Unfortunately, a second key issue with RDMA oper- betweenNICreceivebuffersandI/Oornon-volatilestor- ationsmakesthishard: eachoperationcanonlyaffecta agebuffers. single, contiguous region of remote memory. To be ef- ficient, one-sided writes must replicate data in its final, 2.1.2 Challenges stable form, otherwise backup CPU must be involved, Thekeychallengeinusingone-sidedRDMAoperations which defeats the purpose. For stable storage, this gen- isthattheyhavesimplesemanticswhichofferlittlecon- erally requires some metadata. For example, when a trol on the remote side. This is by design; the remote backup uses data found in memory or storage it must NIC executes RDMA operations directly, so they lack know which portions of memory contain valid objects, thegeneralitythataconventionalCPU-basedRPChan- and it must be able to verify that the objects and the dlers would have. A host can issue a remote read of markersthatdelineatethemhaven’tbeencorrupted. As a single, sequential region of the remote processes vir- a result, backups need some metadata about the objects tualaddressspace(theregiontoreadmustberegistered that they host in addition to the data items themselves. first,butaprocesscouldregisteritswholevirtualaddress However,RDMAwritesmakethishard. Metadatamust space). Or, a host can issue a remote write of a single, inherentlybeintermixedwithdataobjects,sinceRDMA Primary Backup put( A ) writes are contiguous. Otherwise, multiple round trips CCPPUU woTualidlwbeinndeesdoelvde,sagthaeinsedecfheaaltliennggtehsetehfrfiocuigehncaygfoairnms.of ReplicateAA( AA ) & Re1queOspt eRnD MA Buffer 2 A03 m beumffeorry Step 1 NIC NIC low-overheadredundancyinlogmetadata. Primariesin- put( B ) 4 ACK & crementally log data items and metadata updates to re- RDMA Buffer mote memory on backups via RDMA writes. Backups A BB CPU Step 2 remainunawareofthecontentsofthebuffersandblindly Replicate( B ) 1 oRnDeM-sAid Wedrite 2 DMAA B flush them to storage. In the rare event when a primary NIC NIC fails,allbackupsworkinparallelscanninglogdatatore- put( C ) 3 ACK constructmetadatasodataintegritycanbechecked. The A BB BC CPU nextsectiondescribesitsdesignindetail. Replicate( C ) 1 Close 2 3 Step 3 & Free RDMA Buffer A B BC NIC NIC 3 Design 4 ACK Tailwind is a strongly-consistent RDMA-based replica- Figure 3: ThethreereplicationstepsinTailwind. Duringthefirst tionprotocol.Itwasdesignedtomeetfourrequirements: (open) and third (close) steps, the communication is done through RPCs.Whilethesecondstepinvolvesone-sidedwritesonly. Zero-copy,Zero-CPUonBackupsforDataPath. In order to relieve backups CPUs from processing replica- tionrequests,Tailwindreliesonone-sidedRDMAwrites sinceitcanonlyaffectacontiguousmemoryregion. In for all data movement. In addition, it is zero-copy at this case, updating both data and metadata would re- primary and secondary replicas; the sender uses kernel- quire sending two messages which would nullify one- bypassandscatter/gatherDMAfordatatransfer; onthe sidedRDMAbenefits. Moreover,thisisrisky: incaseof backup side, data is directly placed to its final storage failures a primary may start updating the metadata and locationviaDMAtransferwithoutCPUinvolvement. fail before finishing, thereby invalidating all replicated Strong Consistency. For every object write Tailwind objects. synchronouslywaitsforitsreplicationonallbackupsbe- Forlog-structureddata,backupsneedtwopiecesofin- fore notifying the client. Although RDMA writes are formation: (1)theoffsetthroughwhichdatainthebuffer one-sided, reliable-connected QPs generate work com- is valid. This is needed to guarantee the atomicity of pletion to notify the sender once a message has been each update. An outdated offset may lead the backup correctly sent and acknowledged by the receiver NIC to use old and inconsistent data during crash recovery. (i.e. written to remote memory) [8]. One-sided opera- (2)Achecksumusedtochecktheintegrityofthelength tionraisemanyissues,Tailwindisdesignedtocoverall fieldsofeachlogrecordduringrecovery.Checksumsare cornercasesthatmaychallengecorrectness(§3.4). critical for ensuring log entry headers are not corrupted whileinbuffersoronstorage. Thesechecksumsensure Overhead-free Fault-Tolerance. Backups are un- iteratingoverthebufferissafe;thatis,acorruptedlength aware of replication as it happens, which can be unsafe fielddoesnot“point”intothemiddleofanotherobject, in case of failures. To address this, Tailwind appends a outofbuffer,orindicateanearlyendtothebuffer. piece of metadata in the log after every object update. The protocol assumes that each object has a header Backups use this metadata to check integrity and locate nexttoit[4,12,26]. Implementation-agnosticinforma- valid objects during recovery. Although a few backups tioninheadersshouldinclude: (1)thesizeoftheobject have to do little extra work during crash recovery, that nexttoittoallowlogtraversal;(2)anintegritycheckthat workhasnoimpactonrecoveryperformance(§4.6). ensurestheintegrityofthecontentsofthelogentry. Preserves Client-facing RPC Interface. Tailwind Tailwind checksums are 32-bit CRCs computed over hasnorequirementontheclientside;alllogicisimple- logentryheaders.Thelastchecksuminthebuffercovers mentedbetweenprimariesandbackups. Clientsobserve allpreviousheadersinthebuffer. Formaximumprotec- thesameconsistencyguarantees. However,forwriteop- tion, checksums are end-to-end: they should cover the erations, Tailwind highly improves end-to-end latency data while it is in transit over the network and while it andthroughputfromtheclientperspective(§4.2). occupiesstorage. To be able to perform atomic updates with one-sided 3.1 TheMetadataChallenge RDMAs in backups, the last checksum and the current Metadata is crucial for backups to be able to use repli- offsetinthebuffermustbepresentandconsistentinthe cateddata. Forinstance,abackupneedstoknowwhich backup after every update. A simple solution is to ap- portionsofthelogcontainvaliddata. InRPC-basedsys- pend the checksum and the offset before or after every tems, metadata isusually piggybacked within a replica- objectupdate. AsingleRDMAwritewouldsufficethen tion request [11, 21]. However, it is challenging to up- for both data and metadata. The checksum must nec- date both data and metadata with a single RDMA write essarily be sent to the backup. Interestingly, this is not thecasefortheoffset. Thenatureoflog-structureddata open Backup1 Primary DRAM storage and the properties of one-sided RDMA make it possi- ble, with careful design, for the backup to compute this A B C D E F G Backup2 valueatrecoverytimewithouthurtingconsistency. This open 0x0 0x1 0x2 0x2001 ispossiblebecauseRDMAwritesareperformed(atthe logID=1 receiverside)inanincreasingaddressorder[8]. Inaddi- checksum = 0xFFFF Backup3 close segmentID = 3 tion, reliable-connectedQPsensurethatupdatesareap- offset = *G + pliedintheordertheyweresent. G Based on these observations, Tailwind appends a checksum in the log after every object update; at any Figure4: PrimaryDRAMstorageconsistsofamonotonicallygrow- point of time a checksum guarantees, with high proba- inglog.Itislogicallysplitintofixed-sizesegments. bility,theintegrityofallpreviousheadersprecedingitin thebuffer. Duringfailure-freetime,abackupisensured toalwayshavethelatestchecksum,attheendofthelog. At the second step in Figure 3, the primary is able to On the other hand, backups have to compute the offset performallsubsequentreplicationrequestswithRDMA themselvesduringcrashrecovery. writes 1 . Backup NIC directly put objects to memory buffersviaDMAtransfer 2 withoutinvolvingtheCPU. 3.2 Non-volatileBuffers The primary gets work completion notification from its InTailwind,atstartup,eachbackupmachineallocatesa correspondingQP 3 . pool of in-memory I/O buffers (8 MB each, by default) The primary keeps track of the last written offset in and registers them with the NIC. To guarantee safety, the backup buffer. When the next object would exceed eachbackuplimitsthenumberofbuffersoutstandingun- thebuffercapacity,theprimaryproceedstothethirdstep flushed buffers it allows. This limit is a function of its inFigure3. Thelastreplicationrequestiscalledclose local,durablestoragespeed. Abackupdeniesopeninga andisperformedthroughanRPC 1 .Theclosenotifies new replication buffer for a primary if it would exceed thebackup 2 + 3 thatthebufferisfullandthuscanbe the amount of data it could flush safely to storage on flushed to durable storage. This eventually allows Tail- backup power. Buffers are pre-zeroed. Servers require windtoreclaimbuffersandmakethemavailabletoother powersuppliesthatallowbufferstobeflushedtostable primaries. Buffersarezeroedagainwhenflushed. storage in the case of a power failure [4, 5, 20]. This WeuseRPCsforopenandcloseoperationsbecause avoids the cost of a synchronous disk write on the fast it simplifies the design of the protocol without hurting pathofPUToperations. latency. Asanexampleofcomplication,aprimarymay Initiatives such as the OpenCompute Project propose chooseasecondarythathasnobuffersleft. Thiscanbe standards where racks are backed by batteries backup, challengingtohandlewithRDMA.Moreover,theseop- that could provide a minimum of 45 seconds of power erations are not frequent. If we consider 8 MB buffers supply [1] at full load, including network switches. and objects of 100 B, which corresponds to real work- Battery-backedDIMMscouldhavebeenanotheroption, loadsobjectsize[19],openandcloseRPCswouldac- buttheyrequiremorecarefulattention. Becauseweuse countfor2.38×10−5 ofthereplicationrequests. Larger RDMA, batteries need to back the CPU, the PCIe con- buffersimplylessRPCsbutlongertimestoflushbackup troller, and the memory itself. Moreover, there exists datatosecondarystorage. no clear way to flush data that could still be residing in Thanks to all these steps, Tailwind can effectively NIC cache or in PCIe controller, which would lead to replicatedatausingone-sidedRDMA.However,without firmwaremodificationsandtoanon-portablesolution. takingcareoffailurescenariostheprotocolwouldnotbe correct. Next,wedefineessentialnotionsTailwindrelies 3.3 ReplicationProtocol onforitscorrectness. 3.3.1 WritePath 3.3.2 PrimaryMemoryStorage TobeabletoperformreplicationthroughRDMA,apri- TheprimaryDRAMlog-basedstorageislogicallysliced maryhastohastoreserveanRDMA-registeredmemory into equal segments (Figure 4). For every open and bufferfromasecondaryreplica.ThefirststepinFigure3 close RPC the primary sends a metadata information depictsthisoperation:aprimarysendsanopenRPCtoa about current state: logID, latest checksum, segmentID, backup 1 . Tailwinddoesnotenforceanyreplicaplace- andcurrentoffsetinthelastsegment. Incaseoffailures, mentpolicy,insteaditleavesbackupselectionuptothe thisinformationhelpsthebackupinfindingbackup-data storage system. Once the open processed 2 + 3 , the itstoresforthecrashedserver. backup sends an acknowledgement to the primary and Atanygiventime,aprimarycanonlyreplicateasin- piggybacks necessary information to perform RDMA gle segment to its corresponding backups. This means writes 4 (i.e. remote_keyandremote_address[8]). a backup has to do very little work during recovery; if Theopencallfailsiftherearenoavailablebuffers. The a primary replicates to r replicas then only r segments primaryhasthentoretry. wouldbeopen,incasetheprimaryfails. input :PointertoamemorybufferrdmaBuf Primary Log Buffer on output:Sizeofdurablyreplicateddataoffset Put(B ) Backup 1 currPosition=rdmaBuf; B & Checksum ok 2 offset=currPosition; A B A B 3 while/c*urrCProesiatitone<aMhAeXa_BdUerFFiEnR_StIhZeEdcourrent position */ sync( B ) 4 header=(Header)currPosition; Put(B ) 5 c/u*rrPNoostitioCno+r=rsuipzteeodf(hoeradeir)n;complete header */ A B A B B o k & C nhoetc kosku m 6 ifheader→type!=INVALIDthen sync( B ) 7 ifheader→type==checksumTypethen Put(B ) 8 checksum=(Checksum)currPosition; B not ok 9 ifchecksum!=currChecksumthen A B A /* Primaries never append a zero sync( B ) checksum, check if it is 1. */ 10 ifcurrChecksum==0andchecksum==1then 11 offset=currPosition+sizeOf(checksum); Figure 5: Fromtoptobottomarethreescenariosthatcanhappen 12 else whenaprimaryreplicacrasheswhilewritinganobject(Binthiscase) 13 returnoffset; then synchronizing with backups. In the first scenario the primary 14 else replicafullywritesthemessagetothebackupleavingthebackupin /* Move the offset at the end of acorrectstate.Bcanberecoveredinthiscase.Inthesecondscenario, current checksum */ theobjectBiswrittenbutthechecksumispartiallywritten.Therefore, 15 offset=currPosition+sizeOf(checksum); Bisdiscarded. SimilarlyforthethirdscenariowhereBispartially 16 else written. 17 currChecksum=crc32(currChecksum,header); 18 else 19 returnoffset; Basically,thealgorithmtakesanopenbufferandtries /* Move forward to next entry */ toiterateoveritsentries. Itmovesforwardthankstothe 20 currPosition+=header→objectSize; sizeoftheentrywhichshouldbeintheheader.Forevery /* We should only reach this line if a primary entrythebackupcomputesachecksumovertheheader. crashed before sending close */ When reaching a checksum in the buffer it is compared 21 returnoffset; Algorithm1:UpdatingRDMAbuffermetadata with the most recently computed checksum: the algo- rithm stops in case of checksum mismatch. There are threestopconditions: (1)integritycheckfailure; (2)in- 3.4 FailureScenarios validobjectfound;(3)endofbufferreached. Acombinationofthreefactorsguaranteethecorrect- When a primary or secondary replica fail the protocol ness of the algorithm: (1) the last entry is always a mustrecoverfromthefailureandrebuildlostdata. Pri- checksum;Tailwindimplicitlyusesthisconditionasan mary and secondary replicas failure scenarios require end-of-transmissionmarker. (2)Checksumsarenotal- differentactionstorecover. lowed to be zero; the primary replica always verifies 3.4.1 Primary-replicaFailure the checksum before appending it. If it is zero it sets Primary replica crashes are one of the major concerns it to 1 and then appends it to the log. Otherwise, an in- in the design. In case of such failures, Tailwind has to: completeobjectcouldbeinterpretedasvalidzerocheck- (1)locatebackup-data(ofcrashedprimary)onsecondary sum. (3)Buffersarepre-zeroed;combinedwithcondi- replicas; (2)rebuildup-to-datemetadatainformationon tion(2),abackuphasameanstocorrectlydetectthelast secondaryreplicas;(3)ensurebackup-dataintegrityand validoffsetinabufferbyusingAlgorithm1. consistency;(4)startrecovery. Locating Secondary Replicas. After detecting a pri- 3.4.2 CorruptedandIncompleteObjects mary replica crash, Tailwind sends a query to all sec- Figure5showsthethreestatesofabackupRDMAbuffer ondaryreplicastoidentifytheonesstoringdatabelong- incaseaprimaryreplicafailure.Thefirstscenarioshows ing to the crashed primary. Since every primary has a a successful transmission of an object B and the check- unique logID it is easy for backups to identify which sum ahead of it. If the primary crashes, the backup is buffersbelongtothatlogID. abletosafelyrecoveralldata(i.e. AandB). Building Up-to-date Metadata. Backup buffers can Inthesecondscenario. thebackupreceivedB,butthe either be in open or close states. Buffers that are checksumwasnotcompletelyreceived. Inthiscasethe closeddonotposeanyissue, theyalreadyhaveup-to- integritycheckwillfail. ObjectAwillberecoveredand date metadata. If they are in disk or SSD they will be Bwillbeignored. Thisissafe,sincetheclient’sPUTof loaded to memory to get ready for recovery. However, Bcouldnothavebeenacknowledged. for open buffers, the backup has to compute the offset The third scenario is similar: B was not completely and find the last checksum. Secondary replicas have to transmitted to the backup. However, there creates two scantheiropenbufferstoupdatetheirrespectivecheck- possibilities.IfB’sheaderwasfullytransmitted,thenthe sumandoffset. Todoso,theyiterateoverallentriesas algorithmwilllookpasttheentryandfinda0-byteatthe depictedinAlgorithm1. endofthelogentry. ThiswayitcantellthattheRDMA operation didn’t complete in full, so it will discard the berforreplicas. Ifasecondaryreplicacrashes,Tailwind entry and treat the prefix of the buffer up through A as updates the version number on the primary and secon- valid. Ifthechecksumispartiallywritten,itwillstillbe daries. Since secondaries need to be notified, Tailwind discarded,sinceitwillnecessarilyendina0-byte:some- uses an RPC instead of an RDMA for this operation. thing that is disallowed for all valid checksums that the Tailwindupdatestheversionnumberrightafterthestep primarycreates. IfB’sheaderwasonlypartiallywritten, (2) when re-creating a secondary replica. This ensures some of the bytes of the length field may be left zero. thattheprimaryandbackupsaresynchronized. Replica- Imagine that o is the offset of the start of B. If the pri- tioncanstartagainfromaconsistentstate. Notethatthis mary intended B to have length l and l(cid:48) is the length RPCisrareandonlyoccursafterthecrashofabackup. actually written into the backup buffer. It is necessar- ily the case that l(cid:48) <l, since lengths are unsigned and 3.4.4 NetworkPartitioning l(cid:48) is a subset of the bits in l. As a result, o+l(cid:48) falls It can happen that a primary is considered crashed by a within the range where the original object data should subsetofservers. Tailwindwouldstartlocatingitsback- havebeenreplicatedinthebuffer. Furthermore,thedata ups, then rebuilding metadata on those backups. While there consists entirely of zeroes, since an unsuccessful metadataisrebuilt, theprimarycouldstillperformone- RDMAwritehaltsreplicationtothebuffer,andreplica- sided RDMA to its backups, since they are always un- tion already halted before o+sizeof(Header). As a aware of these type of operations. To remedy this, as result, this case is handled identically to the incomplete soonasaprimaryorsecondaryfailureisdetected,allma- checksum case, leaving the (unacknowledged) B off of chinesclosetheirrespectiveQPswiththecrashedserver. thevalidprefixofthebuffer. ThisallowsTailwindtoensurethatserversthatarealive A key property maintained by Tailwind is that torn butconsideredcrashedbytheenvironmentdonotinter- writesneverdependonchecksumchecksforcorrectness. ferewithworkdoneduringrecovery. Theycanalsobedetectedbycarefulconstructionofthe logentriesheadersandchecksumsandtheorderingguar- 4 Evaluation anteesthatRDMANICsprovide. WeimplementedTailwindonRAMCloudalow-latency Bit-flips The checksums, both covering the log entry in-memory key-value store. Tailwind’s design perfectly headersandtheindividualobjectsthemselvesensurethat suitsRAMCloudinmanyaspects: recovery is robust against bit-flips. The checksums en- Low latency. RAMCloud’s main objective is to pro- sure with high probability that bit-flip anywhere in any vide low-latency access to data. It relies on fast net- replicawillbedetected. Inclosedsegments,whenever working and kernel-bypass to provide a fast RPC layer. datacorruptionisdetected,Tailwinddeclaresthereplica TailwindcanfurtherimproveRAMCloud(PUT)latency corrupted. Thehigher-levelsystemwillstillsuccessfully (§4.2) by employing one-sided RDMA without any ad- recoverafailedprimary,butitmustrelyonreplicasfrom ditionalcomplexityorresourceusage. other backups to do so. In open segments data corrup- Replication and Threading in RAMCloud. To tionistreatedaspartiallytransmittedbuffers;assoonas achieve low latency, RAMCloud dedicates one core Tailwindimmediatelystopsiteratingoverthebufferand solely to poll network requests and dispatch them to returnsthelastvalidoffset. workercores(Figure1). Workercoresexecuteallclient 3.4.3 Secondary-replicaFailure and system tasks. They are never preempted to avoid When a server crashes the replicas it contained become contextswitchesthatmayhurtlatency.Toprovidestrong unavailable. Tailwind must re-create new replicas on consistency, RAMCloud always requests acknowledge- other backups in order to keep the same level of dura- mentsfromallbackupsforeveryupdate. Withtheabove bility. Luckily, secondary-replica crashes are dealt with threading-model, replication considerably slows down naturallyinstoragesystemsanddonotsufferfromone- theoverallperformanceofRAMCloud[31]. HenceTail- sidedRDMAcomplications.Tailwindtakesseveralsteps wind can greatly improve RAMCloud’s CPU-efficiency to allocate a new replica: (1) It suspends all operations andremovereplicationoverheads. on the corresponding primary replica; (2) It atomically Log-structured Memory. RAMCloud organizes its creates a new secondary replica; (3) It resumes normal memory as an append-only log. Memory is sliced into operations on the primary replica. Step (1) ensures that smaller chunks called segments that also act as the unit data will always have the same level of durability. Step of replication, i.e., for every segment a primary has to (2)iscrucialtoavoidinconsistenciesifaprimarycrashes choose a backup. Such an abstraction makes it easy to while re-creating a secondary replica. In this case the replace RAMCloud’s replication system with Tailwind. newlycreatedsecondaryreplicawouldnothavealldata Tailwindchecksumscanbeappendedinthelog-storage, andcannotbeused. with data, and replicated with minimal changes to the However,itcanhappenthatasecondaryreplicastops code. In addition, RAMCloud provides a log-cleaning andrestartsaftersometime,whichcouldleadtoincon- mechanism which can efficiently clean old checksums sistent states between secondary replicas. To cope with andreclaimtheirstoragespace. this,Tailwindkeeps,atanypointoftime,aversionnum- CPU XeonE5-24502.1GHz8cores,16hwthreads the throughput. For instance with 50% PUTs Tailwind sustains 340 KOp/s against 200 KOp/s for RAMCloud, RAM 16GB1600MHzDDR3 which is a 70% improvement. With update-only work- NIC MellanoxMX354ACX3@56Gbps load, improvement is not further increased: In this case Switch 36portMellanoxSX6036G Tailwindimprovesthethroughputby65%. Tailwind can improve the number of read operations OS Ubuntu15.04,Linux3.19.0-16, serviced by accelerating updates. CPU cycles saved al- MLX43.4.0,libibverbs1.2.1 low servers (that are backups as well) to service more requestsingeneral. Table1: Experimentalclusterconfiguration. Figure 7 shows that update latency is also consider- ably improved by Tailwind. Under light load Tailwind reducesmedianand99th percentilelatencyofanupdate We compared Tailwind with RAMCloud replication from 16 µs to 11.8 µs and from 27 µs to 14 µs respec- protocol,focusingouranalysisonthreekeyquestions: tively. Under heavy load, i.e. 500 KOp/s Tailwind re- Does Tailwind improve performance? Measure- duces median latency from 32 µs to 16 µs compared to mentsshowTailwindreducesRAMCloud’smedianwrite RAMCloud. Under the same load tail latency is even latency by 2× and 99th percentile latency by 3× (Fig- furtherreducedfrom78µsto28µs. ure7). Tailwindimprovesthroughputby70%forwrite- Tailwind can effectively reduce end-to-end client la- heavyworkloadsandby27%forworkloadsthatinclude tency. With reduced acknowledgements waiting time, justasmallfractionofwrites. andmoreCPUpowertoprocessrequestsfaster, servers cansustainaverylowlatencyevenunderheavyconcur- WhydoesTailwindimproveperformance? Tailwind rentaccesses. improves per-server throughput by eliminating backup request processing (Figure 9), which allows servers to 4.3 GainsasBackupLoadVaries focuseffortonprocessinguser-facingrequests. SinceallserversinRAMCloudactasbothbackupsand What is the Overhead of Tailwind? We show that primaries,Tailwindacceleratesnormal-caserequestpro- Tailwind’s performance improvement comes at no cost. cessingindirectlybyeliminatingtheneedforserversto Specifically, we measure and find no overhead during actively process replication operations. Figure 8 shows crashrecoverycomparedtoRAMCloud. the impact of this effect. In each trial load is directed at a subset of four RAMCloud storage nodes; “Active 4.1 ExperimentalSetup PrimaryServers”indicatesthenumberofstoragenodes Experimentsweredoneona35serverDellr320cluster thatprocessclientrequests. Nodesdonotreplicatedata (Table1)ontheCloudLab[24]testbed. tothemselves,sowhenonlyoneprimaryisactiveitisre- We used three YCSB [2] workloads to evaluate ceivingnobackupoperations.Alloftheactiveprimary’s Tailwind: update-heavy (50% PUTs, 50% GETs), backup operations are directed to the other three other- read-heavy (5% PUTs, 95% GETs), and update-only wise idle nodes. Note that, in this figure, throughput is (100% PUTs). We intitially inserted 20 million objects per-active-primaries. So, as more primaries are added, of100Bplus30Bforthekey. Afterwards,weranupto theaggregateclusterthroughputincreases. 30clientmachines.Clientsgeneratedrequestsaccording As client GET/PUT operations are directed to more toaZipfiandistribution(θ =0.99). Notethatbydefault nodes (more active primaries), each node slows down RAMCloud object headers Objects were uniformly in- because it must serve a mix of client operations inter- serted in active servers. The replication factor was set mixed with an increasing number of incoming backup to3andRDMAbufferssizewassetto8MB.Everydata operations. Enoughclientloadisoffered(30clients)so pointintheexperimentsisaveragedover3runs. that storage nodes are the bottleneck at all points in the RAMCloud’s RPC-based replication protocol served graphs. Withfouractiveprimaries,everyservernodeis as a baseline for comparison. Note that, in the compar- saturated processing client requests and backup opera- isonwithTailwind,werefertoRAMCloud’sreplication tionsforallclient-issuedwrites. protocolasRAMCloudforsimplicity. Even when only 5% of client-issued operations are writes (Figure 8a), Tailwind increasingly improves per- 4.2 PerformanceImprovement formance as backup load on nodes increases. When a The primary goal of Tailwind is to accelerate basic op- primarydoesn’tperformbackupoperationsTailwindim- erations throughput and latency. To demonstrate how proves throughput 3%, but that increases to 27% when Tailwind improves performance we show Figure 6, i.e. theprimaryservicesitsfairshareofbackupoperations. throughput per server as we increase the number of The situation is similar when client operations are a clients. When client operations consist of 5% PUTs 50/50 mix of reads and writes (Figure 8b) and when and 95% GETs, RAMCloud achieves 500 KOp/s per clientsonlyissuewrites(Figure8c). serverwhileTailwindreachesupto635KOp/s. Increas- As expected, Tailwind enables increasing gains over ingtheupdateloadenablesTailwindtofurtherimprove RAMCloud with increasing load, since RDMA elimi- er Tailwind v r RAMCloud e S / 200 S) 600 300 / Op 400 200 K 100 t( 200 100 u p gh 0 u 0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30 o r h T Clients(YCSB-B) (YCSB-A) (WRITE-ONLY) Figure6: Throughputperserverina4servercluster. Tailwind µs) 80 coreutilizationby50%. Thisisstemsfromthefactthat µMedianLatency(s) 123000 RAMCloud PercentileLatency( 246000 aqu5ut0lie%Ialnisrzgttawseetriihefnoesrnatntihcnittishgisoelscynlw,aigosdoehfir.stkdlpyWliaostpiaicmtadhhtpccu5rhoot0inv/ll5osiezi0adsadttrstieoioosanfd2dsi05us%%aenntodwowtrwhrrieeitrlepdiestlueiiccostae,nrtdlwieyoaw.oncrhhkreeeens-r 00 100 200 300 400 500 600 th99 00 100 200 300 400 500 600 reducing the proportion of writes. With 5% writes Throughput(Kops) Throughput(Kops) Tailwind utilizes even more dispatch than RAMCloud. This is actually a good sign, since read workloads are (a) (b) dispatch-bound. Therefore,TailwindallowsRAMCloud toprocessevenmorereadsbyacceleratingwriteopera- Figure7: (a)Medianlatencyand(b)99thpercentilelatencyofPUT tions. ThisisimplicitlyshowninFigure10with”Repli- operationswhenvaryingtheload. cation” graphs that represent worker utilization due to waiting for replication requests. For update-only work- loads, RAMCloud spends 80% of the worker cycles in nates three RPCs that each server must handle for each replication. With5%writesRAMCloudspends62%of client-issued write, which, in turn, eliminates worker worker cycles waiting for replication requests to com- corestallsonthenodehandlingthewrite. plete against 49% with Tailwind. The worker load dif- In short, the ability of Tailwind to eliminate replica- ferenceisspentonservicingreadrequests. tionworkonbackupstranslatesintomoreavailabilityfor normalrequestprocessing,and,hence,betterGET/PUT 4.5 ScalingwithAvailableResources performance. We also investigated how Tailwind improves internal 4.4 ResourceUtilization server parallelism (i.e. more cores). Figure 11 shows TheimprovementsabovehaveshownhowTailwindcan throughput and worker utilization with respect to avail- improveRAMCloud’sbaselinereplicationnormal-case. ableworkercores.Clients(fixedto30)issue50/50reads Themainreasonisthatoperationscontendwithbackup and writes to 4 servers. Note that we do not count the operations for worker cores to process them. Figure 9a dispatch core with available cores, as it is always avail- illustrates this: we vary the offered load (updates-only) able.Withonlyasingleworkercorepermachine,RAM- toa4-serverclusterandreportaggregatedactiveworker Cloudcanserve430KOp/scomparedto660KOp/sfor cores. Forexample,toservice450KOp/s,Tailwinduses Tailwindwithrespectively4.5and3.5workercoresuti- 5.7 worker cores while RAMCloud requires 17.6 active lization.RAMCloudcanover-allocateresourcestoavoid cores,thatis3×moreresources. Forthesamescenario, deadlocks,whichexplainswhyitcangoabovethelimit wealsoshowFigure9bthatshowstheaggregateactive of available cores. Interestingly, when increasing the dispatch cores. Interestingly, gains are higher for dis- available worker cores, Tailwind enables better scaling. patch, e.g., to achieve 450 KOp/s, Tailwind needs only RAMClouddoesnotachievemorethroughputwithmore 1/4ofdispatchcoresusedbyRAMCloud. than 5 available cores. Tailwind continues to improve Both observations confirm that, for updates, most of throughputuptoall7availablecorespermachine. theresourcesarespentinprocessingreplicationrequests. EventhoughbothRAMCloudandTailwindexhibita TogetabetterviewontheimpactwhenGET/PUToper- plateau, this is actually due to the dispatch thread limit ationsaremixed,weshowFigure10. Itrepresentsactive thatcannottakemorerequestsin.ThissuggeststhatTail- worker and dispatch cores, respectively, when varying windallowsRAMCloudtobettertakeadvantageofper- clients. Whenrequestsconsistofupdatesonly,Tailwind machineparallelism. Infact,byeliminatingthereplica- reduces worker cores utilization by 15% and dispatch tionrequestsfromdispatch,Tailwindallowsmoreclient-
Description: