Table Of Content

To Vote Before Decide: A Logless One-Phase Commit Protocol for Highly-Available Datastores Yuqing Zhu #1, Philip S. Yu △2, Guolei Yi +3, Wenlong Ma #4, Mengying Guo #4, Jianxun Liu #4 #ICT, Chinese Academy of Sciences, Beijing, China △ University of Illinois at Chicago, USA +Baidu, Bejing, China [email protected], [email protected], [email protected] 7 1 4{mawenlong,guomengying,liujianxun}@ict.ac.cn 0 2 n Abstract—Highly-available datastores are widely deployed The state-of-the-art practice for distributed commit in a J for online applications. However, many online applications highly-available datastores is to have the transaction client are not contented with the simple data access interface decide before the transaction participants vote [10]–[15], 1 currently provided by highly-available datastores. Distributed 1 transactionsupportisdemandedbyapplicationssuchaslarge- denoted as the vote-after-decide approach. On deciding, scale online payment used by Alipay or Paypal. Current the client initiates a distributed commit process, which ] solutions to distributed transaction can spend more than half typically incurs two phases of processing. Participants vote C ofthewholetransactionprocessingtimeindistributedcommit. on the decision in the first phase of commit, with the votes D An efficient atomic commit protocol is highly desirable. This recordinginlogsorthroughreplication.Thesecondphaseis . paper presents the HACommit protocol, a logless one-phase s for notifying the commit outcome and applying transaction commit protocol for highly-available systems. HACommit has c transaction participants vote for a commit before the client changes. Even if the transaction client can be notified of [ decidestocommitorabortthetransaction;incomparison,the the commit outcome at the end of the first phase [10], [11], 2 state-of-the-art practice for distributed commit is to have the [14], [15], the commit is not completed and the transaction v client decide before participants vote. The change enables the resultis notvisible to othertransactionsuntilthe end of the 8 removal of both the participant logging and the coordinator secondphase.Thetwoprocessingphasesinvolveatleasttwo 0 logging steps in the distributed commit process; it also makes 4 possible that, after the client initiates the transaction commit, communicationroundtrips,aswellasthestepforloggingto 2 the transaction data is visible to other transactions within write-ahead logs [16] or for replicating among servers. The 0 one communication roundtrip time (i.e., one phase). In the communicationroundtripsandtheloggingorreplicatingstep 1. evaluationwithextensiveexperiments,HACommitoutperforms are costly procedures in distributed processing. They lead recent atomic commit solutions for highly-available datastores 0 to a long distributed commit process, which then reduces under different workloads. In the best case, HACommit can 7 commit in one fifth of the time 2PC does. transaction throughputs. 1 A different approach to distributed commit is having : Keywords-atomiccommit,highavailability,transaction,2PC, v participants vote for a commit before the client decides consensus i to commit or abort the transaction, denoted as the vote- X before-decide approach. Having the participants vote first, r I. INTRODUCTION a the voting step can overlap with the processing of the last Onlineapplicationshavestrongrequirementsonavailabi- transaction operation, saving one communication roundtrip; lity;theirdatastoragewidelyexploitshighly-availabledata- and, the votes can be replicated at the mean time, instead stores [1]–[3]. For highly-available datastores, distributed of usinga separateprocessingstep. Thismakesthe removal transaction support is highly desirable. It can simplify of one processing phase possible. On receiving the client’s application development and facilitate large-scale online commit decision, the participants can directly commit the transacting business like Paypal [4], Alipay [5] or Baidu transaction locally; thus, the transaction data can be made Wallet [6]. Besides, it can enable quick responses to big visible to other transactions within one communication data queries through materialized view and incremental roundtrip time, i.e., one phase. Though previous one- processing [7], [8]. The benefits of transactions come from phase commit protocols also have participants vote early, the ACID (atomicity, consistency, isolation and durability) they need to make several impractical assumptions, e.g., guarantees [9]. The atomic commit process is key to the log externalization [17]; besides, they rely heavily on the guarantee of ACID properties. Current solutions to atomic coordinator logs to guarantee atomicity and durability. commitincursahighcostinhibitingonlineapplicationsfrom In this paper, we present the HACommit protocol, a usingdistributedtransactions.Afastatomiccommitprocess logless one-phase commit protocol for highly-available is highly desirable. systems. HACommittakes the vote-before-decideapproach. In order to remove logging and enable one-phase commit, evaluated its performance using a YCSB-based transaction HACommit tackles two key challenges: the first is how to benchmark [20]. As the number of participants and data commit(abort)atransactioncorrectlyinaone-phaseprocess; itemsinvolvedinatransactionisthekeyfactoraffectingthe and, the second is how to guarantee a correct transaction performanceof commitprotocols,we evaluatedHACommit recoveryonparticipantorcoordinatorfailureswithoutusing and several recent protocols [12], [14], [15] by varying the logs. numberof operationspertransaction.In the evaluationwith For the first challenge, we observe that, with the vote- extensive experiments, HACommit can commit in less than before-decide approach, the commit process becomes a a millisecond. In the best case, HACommit can commit in problemthattheclientproposesadecisiontobeacceptedby one fifth of the time that the widely-used 2PC commits. participants.Thisproblemiswidelyknownastheconsensus problem [18]. Consensus algorithms are solutions to the Roadmap. Section II discusses related work. Section III consensus problem. The widely used consensus algorithm overviews the design of HACommit. Section IV details Paxos [19] can reach a consensus among participants (ac- the last operation processing in HACommit and Section V ceptors)in a one-phaseprocess,if the proposeris the initial describes the commit process. Section VI presents the proposer in a run of the algorithm. HACommit runs the recovery processes on client and participant failures. We Paxosalgorithm once for each transaction commit(abort).It reportourperformanceevaluationsinSectionVII.Thepaper usestheuniqueclientastheinitialproposerofthealgorithm is brought to a close with conclusions in Section VIII. and the participants as the acceptors and learners. Thus, the client can propose any value, either commit or abort, II. RELATEDWORK to be acceptedby participantsasthe consensus.HACommit Atomic commit protocols (ACPs). A large body of proposesanewprocedureforprocessingthelasttransaction work studied the atomic commit problem in distributed operationsuchthatconsensusalgorithmscanbeexploitedin environment both in database community [16], [17], [21] the commit process. To exploit Paxos, HACommit designs and distributed computing community [22], [23]. The most a transaction context structure to keep Paxos configuration widely used atomic commit protocol is two-phase commit information for the commit process. (2PC) [9]. It has been proposed decades ago, but remains For the second challenge, we notice that consensus algo- widely exploited in recent years [12], [15], [24]–[26]. 2PC rithms can reach an agreement among a set of participants involves at least two communication round trips between safely even on proposer failures. As HACommit exploits the transaction coordinator and the participants. Relying on Paxos and uses the client as the proposer/coordinator, the both coordinator and participant logs for fault tolerance, it client failure will not block the commit process. On the is blocking on coordinator failures. clientfailure,HACommitrunstheclassicPaxosalgorithmto Non-blockingatomic commit protocolswere proposedto reach the same transaction outcome among the participants, avoid the blocking on coordinator failures during commit. which act as would-be proposers replacing the failed But some assume the impractical model of synchronous client. Furthermore, we observe that, in practice, the high communication and incur high costs, so they are rarely availability of data in highly-available datastores leads to implemented in real systems [27]. Those assuming the an equal effect of fail-free participants during commit. asynchronous system model generally exploit coordinator Instead of using logs for participant failure recovery, replication and the fault-tolerant consensus protocol [21], HACommit has participants replicate their votes and the [22]. These non-blocking ACPs generally incur an even transaction metadata to their replicas when processing the higher cost than 2PC. Besides, they are all designed taking last transaction operation. For participant replica failures, the same vote-after-decide approach as 2PC, i.e., that HACommit proposes a recovery process that exploits the participants vote after the client decides. replicated votes and metadata. One-phase commit (1PC) protocols were proposed to With HACommit, a highly-available datastore can not reduce the communicationcosts of 2PC. Comparedto 2PC, onlyrespondto the clientcommitrequestwithin onephase, they reduce both the number of forced log writes and as in other state-of-art commit solutions [10], [14], [15], communication roundtrips. The price is to send all parti- but also makes the transaction changes visible to other cipants’ logs to the coordinator[28] or to make impractical transactions within one phase, increasing the transaction assumptions on systems, e.g, consistency checking on each concurrency.Withoutclientfailures,HACommitcancommit update [17]. Non-blocking 1PC protocols also exist. They a transaction within two message delays. Based on Paxos, havethesame problemsasblocking1PCprotocols.Though HACommit is non-blocking on client failures; and, it 1PC protocols have participants vote for commit before the can also tolerate participant replica failures. HACom- client decides as HACommit does, they do not allow the mit can be used along with various concurrency control client to abort the transaction if all transaction operations schemes [9], [17], e.g., optimistic, multi-version or lock- are successfully executed [9]. In comparison, HACommit basedconcurrencycontrol.WeimplementedHACommitand gives the client all the freedom to abort a transaction. All the above atomic protocols do not consider the high }Execution }Commit decide availability of data as a condition, thus involving unneces- Client (coordinator) sary logging steps for failure recovery at the participants lastop result vote YES outcome ack or the coordinator. Exploiting the high availability of data, the participant logging step can be easily implemented Participants apply as a process of data replication, which is executed for replicate votes & txn context each operation in highly-available datastores–no matter the Participant operation belongs to a transaction or not. replicas ACPs for highly-available datastores. In recent years, Figure1. Anexample commitprocessusingHACommit. quite a few solutions are proposed for atomic commit in highly-available datastores. Spanner [12] layers two phase commit[15]layersPaxosover2PC. Inessence,itreplicates locking and 2PC over the non-blocking replica synchro- two-phase commit operations among datacenters and uses nization protocol of Paxos [19]. Spanner is non-blocking Paxos to reach consensus on the commit decision. It due to the replication of the coordinator’s and participants’ requiresthefullreplicaineachdatacenter,whichprocesses logs by Paxos, but it incurs a high cost in commit. transactions independently and in a blocking manner. Message futures [29] proposes a transaction manager that All the above ACPs for highly-available datastores take utilizes a replication log to check transaction conflicts and the vote-after-decide approach. In comparison, HACommit exchange transactions information across datacenters. The exploits the vote-before-decide approach to enable the concurrency server for conflict checking is the bottleneck removalofoneprocessingphaseandtheremovaloflogging for scalability and performance. Besides, the assumption of in commit. HACommit overlaps the participant voting with sharedlogsare impracticalin realsystems[17].Helios [11] the processing of the last operation. Using the unique also exploits a log-based commit process. It can guarantee client as the transaction coordinator and the initial Paxos the minimum transaction conflict detection time across proposer,HACommitcommitsthe transaction in one phase, datacenters. However, it relies on a conflict detection at the end of which the transaction data made visible to protocol for optimistic concurrencycontrol using replicated other transactions. HACommit exploits the high availability logs, which makes strong assumptions on one replica of data for failure recovery, instead of using the classic knowing all transactions of any other replica within a approach of logging. critical time interval, which is impossible for asynchronous systems with disorderly messages [30]. The safety property III. OVERVIEW OF HACOMMIT ofHeliosinguaranteeingserializabilitycanbethreatenedby the fluctuation of cross-DC communicationlatencies. These HACommit is designed to be used in highly-available commit proposals heavily exploit transaction logs, while datastores, which guarantee high availability of data. Gen- logging is costly for transaction processing [31]. erally, highly-available datastores partition data into shards MDCC [14] proposes a commit protocol based on Paxos and distribute them to networked servers to achieve high variants for optimistic concurrency control [9]. MDCC scalability.Toguaranteehighavailabilityofdata,eachshard exploits the application server as the proposer in Paxos, is replicated across a set of servers. Clients are front- while the application server is in fact the transaction client. end application servers or any proxy service acting for Though its application server can find out the transaction applications. Clients can communicate with servers of the outcome within one processing phase, the commit process highly-available datastores. A transaction is initiated by a of MDCC is inherently two-phase, i.e., a voting phase client.Atransactionparticipantisaserverholdinganyshard followed by a decision-sending phase, and no concurrent operated by the transaction, while servers holding replicas accesses are permitted over outstanding options during the of a shard are called participant replicas. commit process. TAPIR [10] has a Paxos-based commit The implementation of HACommit involves both client process similar to that of MDCC, but TAPIR can be and server sides. On the client side, it provides an atomic used with pessimistic concurrency control mechanisms. It commit interface via a client-side library for transaction also uses the client as the proposer in Paxos. It layers processing. On the server side, it specifies the processing transaction processing over the inconsistent replication of of the last operation and the normal commit process, highly-availabledatastores,andexploitsthehighavailability as well as the recovery process on client or participant of data for participant replica recovery. TAPIR also returns failures. Except for the last operation, all transaction the transaction outcome to the client within one processing operationscanbeprocessedfollowingeithertheinconsistent phase of commit, but the transaction outcome is only replication solutions [10], [14] or the consistent replication visible to other transactions after two phases. It has solutions [12], [15]. Different concurrency control schemes strongrequirementsfor applications,e.g.,pairwise invariant and isolation levels [9] can be used with HACommit, e.g., checks and consensus operation result reverse. Replicated optimistic, multi-version,lock-basedconcurrencycontrolor read-committed, serializable isolation levels. On processing instanceforcommit.Thisconfigurationinformationmustbe the last transaction operation, participants vote for a tran- known to all acceptors of Paxos. saction commit based on the results of local concurrency In case when inconsistent replication [10] is used in op- control, integrity and consistency checks. erationprocessing,thetransactioncontextmustalsoinclude A HACommit application begins a transaction, starting relevantwrites. Relevantwritesarewritesoperatingondata the transaction execution phase. It can then execute reads heldbyaparticipantanditsreplicas.Therelevantwritesare and writes in the transaction. On the last operation, the necessary in case of participant failures. With inconsistent clientindicatesto all participantsthatitis the lastoperation replication, participant replicas might not process the same of the transaction. All participants check locally whether writes for a transaction as the participant. Consider when a to vote YES or NO for a commit. They replicate their set of relevant writes are known to the participant but not votesandthetransactioncontextinformationtotheirreplicas its replicas. The client might fail after sending the Commit respectively before responding to the client. The client will decisionto participants.Inthemeantime,a participantfails receive votes for a commit from all participants, as well as and one of its replica acts as the new participant. Then, the the processing result for the last operation. This is the end recovery proposers propose the same Commit decision. In of the execution phase. Then, the client can either commit such case, the new participant will not know what writes to or abort the transaction, though the client can only commit apply when committing the transaction. To reduce the data the transaction if all participants vote YES [32]. kept in the transaction context, the relevant writes can be The atomic commit process starts when the client pro- recorded as commands [33]. poses the transaction decision to the participants and their V. THECOMMIT PROCESS replicas.Oncetheclient’sdecisionisreceivedbymorethan a replica quorum of any participant, HACommit will guar- In HACommit, the client commits or aborts a transaction antee thatthetransactionis committedorabortedaccording by initiating a Paxos instance. to the client’s decision despite failures of the client or the A. Background: the Paxos Algorithm participant replicas. Therefore, the client can safely end the transaction once it has received acknowledgement from A run of the Paxos algorithm is called an instance. a replica quorum of any participant. The transaction will A Paxos instance reaches a single consensus among the be committed at all participant replicas once they receive participants.Aninstanceproceedsinrounds.Eachroundhas the client’s commit decision. An example of transaction a ballot with a unique number bid. Any would-be proposer processing using HACommit is illustrated in Figure 1. can start a new roundon any(apparent)failure.Each round generallyconsistsoftwophases[34](phase-1andphase-2), and each phase involves one communicationroundtrip. The IV. PROCESSING THELASTOPERATION consensusis reachedwhen one active proposersuccessfully On processing the last operation of the transaction, finishesoneround.Participantsintheconsensusproblemare the client sends the last operation to participants holding generallycalledacceptorsinPaxos.However,inaninstance relevant data, indicating about the last operation. For other ofPaxos,ifaproposeristheonlyandtheinitialproposer,it participants, the client sends an empty operation as the can propose any value to be accepted by participants as the last operation. All participants process the last operation– consensus, incurring one communication roundtrip between those receiving an empty operation does no processing. the proposer and the participants [35]. They check locally whether a commit for the transaction Paxosiscommonlyusedinreachingtheconsensusamong can violate any ACID property and vote accordingly. They a set of replicas. Each Paxos instance has a configuration, replicate their votes and the transaction context to their whichincludesthesetofacceptorsandlearners.Widelyused replicas respectively before responding to the client. The in reaching replica consensus, Paxos is generally used with replication of participant votes and the transaction context itsconfigurationstayingthesameacrossinstances[36].The is required to survive the votes and guarantee voting configuration information must be known to all proposers, consistency in case of participant failures. The participants acceptors and learners. Take data replication for example. piggyback their votes on their response to the client’s last The set of data replicas are acceptors and learners. The operation request after the replication. The client makes its leader replica is the initial proposer and all other replicas decision on receiving responses from all participants. are would-be proposers. Clients send their writing requests Transaction context. The transaction context must in- to the leader replica, which picks one write or a write clude the transaction ID and the shard IDs. The transaction sequence as its proposal. Then the leader replica starts a ID uniquely identifies the transaction and distinguishes Paxos instance to propose its proposal to the acceptors. the Paxos instance for the commit process. The shard In practice, the configuration can stays the same across IDs are necessary to compute the set of participant IDs, different Paxos instances, e.g., writes to the same data at which constitute the configuration informationof the Paxos different time. B. The One-Phase Commit Process instances for commit, as the participant can involve in multiple concurrent transactions. To distinguish different InHACommit,theclientistheonlyandtheinitialpropo- transactions, we include a transaction ID in the phase-2 ser of the Paxos instance, as each transaction has a unique message, as well as in all messages sent between clients client. As a result, the client can commit the transaction in and participants. A transaction T is uniquely identified in one communication roundtrip to the participants. the system by its ID tid, which can be generated using The commitprocessstarts from the second phase (phase- distributed methods, e.g., UUID [37]. 2) of the Paxos algorithm. That is, the client first sends a phase-2 message to all participants. To guarantee the E. Paxos Configuration Information correctness, the exploitation of the Paxos algorithm must Different from those Paxos exploitations where the strictly comply with the algorithmspecification. Complying configuration stay the same across multiple instances, with the Paxos algorithm, the phase-2 message includes HACommit has different configurations in Paxos instances a ballot number bid, which is equal to zero, and the for different transaction commits. The set of participants is proposal for commit, which can be commit or abort. On the configuration of a Paxos instance. Each transaction has receiving the phase-2 message, a participant records the different participants, leading to different configurations of ballot number and the outcome for the transaction locally. Paxos instances for commit. As required by the algorithm, Then it commits the transaction by applying the writes and theconfigurationmustbeknowntoallproposersandwithin releasing all data items; or, it aborts the transaction by the configuration. A replacing proposer (i.e., a recovery rolling back the transaction and releasing all data items. node) needs the configuration information to continue the In the mean time, the participant invokes the replication algorithm after the failure of a previous proposer. The layer to replicate the result to its replicas. Afterwards, each first proposer of the commit instance is the transaction participantacknowledgesthe client. Alternatively,the client client, which is the only node with complete information can send the phase-2 message to all participants and their of the configuration. If the client fails, the configuration replicas. Each participant replica also follows the same informationmightgetlost. Infact,a clientmightfailbefore processing procedure as its participant’s. Then, the client the transaction comes to the commit step. Then a replacing waits responses from all participants and their replicas. proposerwill hardly have enough configurationto abort the C. Participant Acknowledgements dangling transaction. To guarantee the availability of the configuration infor- Foranyparticipant,iftheacknowledgementsbyaquorum mation, we include the configuration information in the of its replicas are received by the client, the client can phase-2message.Besides, astheconfigurationisexpanding safelyendthetransaction.Infact,thecommitprocessisnot and updating after a new operation is processed, the client finished until all participants acknowledge the client. But must send the up-to-date configuration to all participants any participant failing to acknowledge can go through the contacted so far on processing each operation. In case that failure recovery process (Section VI) to successfully finish a participant fails and one of its replicas take its place, the the commit process. In HACommit, all participants must configurationmust be updated and sent to all replicas of all finally acknowledge the acceptance of the client’s proposal participants. The exact configuration of the Paxos instance so that the transaction is committed at all data operated by for commit will be formed right on the processing of the the transaction. last transactional operation. In this way, each participant The requirement for participants’ acknowledgements is replicakeepslocallyanup-to-datecopyoftheconfiguration differentfromthatforthequorumacceptanceintheoriginal information.As a participantcan fail and be replacedby its Paxos algorithm. In Paxos, the Consensus is reached if a replicas, HACommit does not rely on participant IDs for proposalisacceptedbymorethana quorumofparticipants. the configurationreference.Instead,itrecordsthe IDsofall The original Paxos algorithm can tolerate the failures of shardsoperatedbythetransaction.WiththesetofshardIDs, bothparticipants(acceptors)andproposers.HACommituses any serverin the system shall find outthe contemporaryset the client as the initial proposer and the participants as of participants easily. acceptorsandwould-beproposerswhenexploitingPaxosfor the commit process. In its Paxos exploitation, HACommit VI. FAILURERECOVERY only tolerates the failures of the initial proposerand would- In the design of HACommit, we assume that, if a client be proposers. However, the failure of participants (i.e., or a participant replica fails, it can only fail by crashing. acceptors) can be tolerated by the participant replication, In the following,we describesthe recoverymechanismsfor which can also exploit consensus algorithms like Paxos. client failure and participant replica failure respectively. D. Distinguishing Concurrent Commits A. On Client Failure Each Paxos instance corresponds to the commit of one In HACommit, all participants are all candidates of transaction,butoneparticipantcanengageinmultiplePaxos recovering nodes for a failure. We call recovering nodes as recovery proposers, which act as would-be proposersof the 2) Liveness: To guarantee liveness, HACommit adopts commit process. The recovery proposers will be activated the assumption commonly made for Paxos. That is, one on client failure. In an asynchronous system, there is no proposer will finally succeed in finishing one round of way to be sure about whether a client actually fails. In the algorithm. In HACommit, if all participants consider practical implementations, a participant can keep a timer the current proposer as failed and starts a new round on the duration since it has received a message from the of Paxos simultaneously, a racing condition among new current proposer. If the duration has exceeded a threshold, proposers could be formed in the first phase of Paxos. theparticipantconsidersthecurrentproposerasfailed.Then No proposer might be able to succeed in finishing the it considers itself as the recovery proposer. second phase of Paxos, making the liveness of commit not A recovery proposer must run the complete Paxos guaranteed. Though rarely happening, the racing condition algorithm to reach the consensus safely among the par- among would-be proposers must be avoided in Paxos [19] ticipants. As any would-be proposer can start a new for the liveness consideration. In actual implementations, round on any (apparent) failure, multiple rounds, phases the random back-off of candidates is enough to resolve the and communications roundtrips will be involved on client racing situation [34], [36]; or, some leader election [34] failures. or failure detection [38] services outside the algorithm Although complicated situations can happen, the par- implementation might be used. ticipants of a transaction will reach the same outcome B. On Participant Replica Failures eventually, if they ever reach a consensus and the transaction ends. For example, as delayed messages cannot be HACommit can tolerate not only client failures, but also distinguished from failures in an asynchronous system, the participantreplica failures. It can guaranteecontinuousdata current proposer might in fact have not failed. Instead, its availabilityif morethan a quorumof replicasare accessible last message has not reached a participant, which considers for each participant in a transaction. In case that quorum theproposerasfailed.Or,multipleparticipantsconsidersthe replicaavailabilitycannotbeguaranteed,HACommitcanbe current proposer as failed and starts a new round of Paxos blocked but the correctnessof atomic commit is guaranteed simultaneously.Allthesesituationswillnotimpairthesafety anyhow[19].Thehighavailabilityofdataenablesarecovery of the Paxos algorithm [19]. process based on replicas instead of logging, though 1) The Recovery Process: A recoveryproposerstarts the logging and other mechanisms like checkpointing [26] and recovery process by starting a new round of the Paxos asynchronous logging [33] can fasten the recovery process. instance from the first phase. In the first phase, the new Failed participant replicas can be recovered by copying proposerwillupdatethe ballotnumberbid tobe largerthan data from the correct replicas of the same participant. anyoneithasseen.Itsendsaphase-1messagewiththenew Or, recovery techniques used in consensus and replication ballot number to all participants. On receiving the phase-1 services[39],[40]canbe employedforthe replicarecovery message with bid, if a participant has never received any of participants. Although one replica is selected as the phase-1 message with ballot number greater than bid, it leader (i.e., the participant), the leader replica can easily respondstotheproposer.Theresponseincludestheaccepted bereplacedbyotherreplicasofthesameparticipant[39].If transaction decision and the ballot number on which the a participant failed before sending its vote to its replicas, acceptance is made, if the participanthas ever accepted any the new leader will make a new decision for the vote. transaction decision. Otherwise, as the vote of a participant is replicated before If the proposer has received responses to its phase-1 sending to the coordinator, this vote can be kept consistent message from all participants, it sends a phase-2 message duringthechangeofleaders.Besides,theclienthastsentthe to allparticipants.The phase-2message hasthe same ballot transaction outcome to all participants and their replicas in number as the proposer’s last phase-1 message. Besides, the commit process. Thus, failed participant replicas can be the transaction outcome with the highest ballot number in recoveredcorrectly as long as the number of failed replicas the responses is proposed as the final transaction outcome; for a participant is tolerable by the consensus algorithm in or, if no accepted transaction outcome is included in use. responses to the phase-1 message, the proposer proposes We assume there are fewer failed replicas for each ABORT to satisfy the assumptions of the CAC problem. participant than that is tolerable by the highly-available Unless the participant has already responded to a phase- datastore. This is generally satisfied, as the number of 1 message having a ballot number greater than bid, replicas can be increased to tolerate more failures. If a participant accepts the transaction outcome and ends unfortunately the assumption is not met, the participant the transaction after receiving the phase-2 message. The withoutenoughreplicaswillnotrespondtotheclientsoasto participant acknowledges the proposer accordingly. After guarantee replica consistency and correctness. The commit receiving acknowledgements from all participants, the new process will have to be paused until all participants have proposer can safely end the transaction. enough active replicas. Though not meeting the assumption can impair the liveness of the protocol, HACommit can 2.5 HACommit guarantee the correctness of commit and the consistency of data anyhow. ms]2.0 2PC y [ RCommit VII. EVALUATION nc1.5 e t a Our evaluationexploresthree aspects: (1) commitperfor- e l1.0 mance of HACommit—it has smaller commit latency than g a r other protocols and this advantage increases as the number ve0.5 A of participants per transaction increases; (2) fault tolerance of HACommit—it can tolerate client failures, as well as 0.0 1 4 8 16 32 64 server failures; and, (3) transaction processing performance Operation # per transaction [ops/txn] ofHACommit—ithashigherthroughputsandloweraverage latencies than other protocols. Figure2. Commitlatencieswhenincreasingthenumberofoperationsper transaction. A. Experimental Setup with the first and last quarter of each trial elided to avoid We compare HACommit with two-phase commit (2PC), start up and cool down artifacts. For all experimental runs, replicated commit (RCommit) [15] and MDCC [14]. Two- clients recorded throughput and response times. We report phasecommit(2PC)isstillconsideredthestandardprotocol the average of three sixty-second trials. for committing distributed transactions. It assumes no In all experiments, we preload a database containing a replication and is not resilient to single node failures. singletablewith10millionrecords.Eachrecordhasasingle RCommit and MDCC are state-of-the-art commit protocols primary key column and 1 additional column each with for distributed transactions over replicated as HACommit 10 bytes of randomly generated string data. We use small- is. It has better performance than the approach that layers size records to focus our attention on the key performance 2PC over the Paxos algorithm [12], [36]. MDCC guaran- factors. Accesses to records are uniformly distributed over tee only isolation levels weaker than serializability. The the whole database. In all workloads, transactions are same concurrency control scheme and the same storage committed if no data conflicts exist. That is, all transaction management component are used for HACommit, 2PC and aborts in the experiments are due to concurrency control RCommit. These three implementationsuse the consistency requirements. levelofserializability.Comparedtotheimplementationsfor 2PC and RCommit, the HACommit implementation also B. Commit Performance supports the weak isolation level of read committed [41]. As we are targeting at transaction commit protocols, The evaluationof MDCC is based on its opensources [42]. we first examine the actual costs of the commit process. We evaluate all implementations using the Amazon We study the duration of the commit process. We do not EC2 cloud. We evaluate each implementation using a compare the commit process of HACommit with that of YCSB-based benchmark [20]. As our database completely MDCC because the latter integrates a concurrency control resides in memory and the network communication plays process; comparing only the commit process of the two an important role, we deploy the systems over memory- protocols is unfair to MDCC. optimizedinstancesofr3.2xlarge(with8cores,60GBmem- HACommit outperforms 2PC and RCommit in varied ory and high-speed network). Unless noted otherwise, all workloads. Figure 2 shows the latencies of commit. We implementations are deployed over eight nodes. The cross- vary the number of operations per transaction from 1 to node communication roundtrip is about 0.1 milliseconds. 64. The advantage of HACommit increases as the number For HACommit, RCommit and MDCC, the database is of operations per transaction increases. When a transaction deployedwiththreereplicas.For2PC,noreplicationisused. has 64 operations, HACommit can commit in one fifth of Generally, 2PC requires buffer management for durability. the time 2PC does. This performanceis more significant as We do not include one for 2PC and in-memory database is it seems, as HACommit uses replication and 2PC does not. usedinstead.Thedurabilityisguaranteedthroughoperation That means, HACommit has n−1 times more participants logging. As buffer management takes up about one fifth than 2PC in the commit, where n is the numberof replicas. of the local processing time of transactions, our 2PC HACommit’s commit latency increases slightly as the implementation without buffer managementshould perform number of operations increases to 20. On committing a faster than a typical 2PC implementation. transaction, the system must apply all writes and release In all experiments, each server runs a server program all locks.When the numberof operationsis small, applying and a test client program. By default, each client runs with writes and releasing locks in the in-memory database cost 10 threads. Each data point in the graphs represents the a small amount of time, as compared to the network median of at least five trials. Each trial is run for over 120s communication roundtrip time (RTT). As the number of 5000 6000 Client e latency [ms]4433050500000000 123///555 rrreeepppllliiicccaaa ssf affaailiisll ghput [txns/s]435000000000 RReepplliiccaa12 1 123 4246 8795 9 199 TRRieemppeaa iiorrieundtg Averag22050000 Throu21000000 123///555 rrreeepppllliiicccaaa ssf affaailiisll RRReeepppllliiicccaaa345 11011000101110001111001000 11100010 1100 15000 50 100 150 180 00 50 100 150 180 Time (ms) 09394 97 166 4098 49222 Time [s] Time [s] Figure3. Transactionlatencyvariationsduring Figure 4. Transaction throughput variations Figure 5. HACommit’s behavior on a client failure serverfailures. duringserverfailures. (circled numbersaretransactions). operations increases, the time needed increases slightly for replicas are available for every data item, HACommit can applyingallwritesin the in-memorydatabase.Accordingly, process transactions normally. the commit latency of HACommit increases. We also examine how HACommit behaves under tran- 2PCandRCommithaveincreasedcommitlatencieswhen saction client failures. We deliberately kill the client in an the number of operations per transaction increases. They experiment. Each server program periodically checks its need to log new writes for commit and the old values of local transaction contexts to see if any last contact time dataitemsforrollback,thusthetimeneededfortheprepare exceeds a timeout period. We set the timeout period to be phase increases as the number of writes goes up, leading to 15 seconds. Figure 5 visualizes the logs on client failures a longer commit process. 2PC has a higher commit latency and demonstrates how participants recover from the client than RCommit, because in our implementations, 2PC must failure. log in-memory data and RCommit relies on replication for In Figure 5, replicas represent participants. The cross at fault tolerance. theclientlinerepresentsthefailureoftheclient.Thecircled numbersrepresentunendedtransactions.Thetimeaxisatthe bottom stretches from left to right. The moment when the C. Fault-Tolerance first transaction is detected to be unended is denoted as the In the fault-tolerance tests, we examine the behaviors of beginning of the time axis. A transaction is named on the HACommit under both client failures and server failures. time that it is discovered to be unended, i.e., transaction 1 The evaluation result demonstrates that no transaction is is the first transaction to be detected not to be unended. A blocked under server failures and the client failure, as long replica can detect a transaction is unended because there as a quorum of participant replicas are accessible. are timeouts on the last time when a processing message is We use five replicas and initiate one client in the fault received at the replica. Timeouts are specified by arrows. tolerance tests. To simulate failures, we actively kill a Replica 1 has the smallest node ID. It detects unended process in the experiments. The network module of our transactions1to9andstartspushingthemtoanendthrough implementations can instantly return an error in such case. a repairing process. We have synchronized the clocks of Our implementation processes the error as if connection nodes. For simplicity, we use the moment when replica timeout on node failures happens. 1 detects transaction 1 has a last contact time exceeding Figure 3 shows the evolution of the average transaction the timeout period as the beginning of the time axis. It latency in a five replica setup that experiences the failure takes about 100 milliseconds for replica 1 to repair each of one replica at 50, 100 and 180 seconds respectively. transaction. Replica 1 aborts the nine transactions in the The corresponding throughputs are shown in Figure 4. The repairing process because no transaction outcome has ever latencies and throughputs are captured for every second. been accepted by any replica. The transaction 10 is later At 50 and 100 seconds, the average transaction latency detected by replica 4. However, replica 4 waits for four decreases and the throughput increases. With PCC, reads timeoutperiodsbeforeitactuallyinitiatesarepairingprocess in the HACommit implementation take up a great portion forthetransaction.Thereasonisthattransaction10hasbeen of time. The failure of one replica means that the system committed at replica 1, 2 and 3. Replica 4 finally commits can process fewer reads. Hence, this leads to lower average the transaction. Replica 5 also detects transaction 10, but it latencies and higher throughputs for read transactions, as doesnotinitiateanyrepairingprocess.Beforereplica5starts well as for all transactions. At 180 seconds, we failed one repairingtransaction10,thetransactionisalreadycommitted more replicas, violating the quorum availability assumption in the repairing process initiated by replica 4. of HACommit. The thoughput drops to zero immediately D. Transaction Throughput and Latency because no operation or commit process can succeed at all. The HACommit implementation uses timeouts to detect We evaluate the transaction throughputand latency when failures and quorum reads/writes. As long as a quorum of usingdifferentcommitprotocols.Intheexperiments,onthe hput [txns/s]11802000000000 HRCAoCmommmitTitxTnxn cy [ms]44330505 HRCAoCmommmitTitxTnxn cy [ms]323055 HRCAoCmommmit iUt UpdpadtaetTexTnxn ug en25 en20 nsaction thro 462000000000 Average lat2110505 Average lat11505 Tra 0 1 2 4 8 10 16 20 32 64 014 8 16 32 64 014 8 16 32 64 Operation # per transaction [ops/txn] Operation # per transaction [ops/txn] Operation # per transaction [ops/txn] Figure6. Transactionthroughput:HACommitvs. Figure7. Transactionaveragelatency:HACommit Figure 8. Latency of update transaction: RCommit. vs.RCommit. HACommitvs.RCommit. hput [txns/s]11112468000000000000 HMADCCoCmTmxnitTxnRC cy [ms]11802 HMADCCoCm Umpidt aUtpedTaxtneTxnRC cy [ms]6987 HMADCCoCm Rmeiat dRTexandTxnRC ug10000 en en nsaction thro 4682000000000000 Average lat 462 Average lat4352 Tra 0 1 2 4 8 10 16 20 32 0 1 4 8 10 16 20 32 1 1 4 8 10 16 20 32 Operation # per transaction [ops/txn] Operation # per transaction [ops/txn] Operation # per transaction [ops/txn] Figure 9. Transaction throughput under read- Figure10. LatencyofUPDATEtransactionsunder Figure 11. Latency of READ transactions under committedCC:HACommitvs.MDCC. read-committed CC:HACommitvs.MDCC. read-committed CC:HACommitvs.MDCC. failure of lock acquisition, we retry the same transaction HACommit in that it acquires no locks on reads. Figure until it successfully commits. Each retry is made after a 9 shows the transaction throughputs for HACommit-RC random amount of time. and MDCC. The latencies of update transactions and Figure 6 shows the transaction throughputs when using read transactions are shown in Figure 10 and Figure 11. HACommit and RCommit, and Figure 7 demonstrates the HACommit-RC has larger transaction throughputs than average transaction latencies. The HACommit implementa- MDCCinallworkloads.Thelatenciesofupdatetransactions tion has larger transaction throughputs than the RCommit are lower in the HACommit-RC implementation than in implementation in all workloads. Besides, HACommit has the MDCC implementation, although they have similar lower transaction latencies than RCommit in all workloads. performances in read transactions. Both HACommit-RC HACommit’s advantage on transaction latency increases and MDCC implement read transactions similarly and as the number of operations in a transaction increases guarantee the read-committedconsistency level. The reason in the workloads. As both implementations use the same that HACommit-RC has better performance in transaction concurrency control and isolation level, factors leading to throughput and update transaction latency is as follows. HACommit’s advantage over RCommit are two-fold. First, MDCC uses optimistic concurrency control, which can nocostlyloggingisinvolvedduringthecommit.Second,no cause high abort rates under high contention, leading persistence of data is needed. to lower performance than HACommit-RC, which uses We compare the update transaction latencies of HA- pessimistic concurrencycontrol.Besides, MDCCholdsdata Commit and RCommit in Figure 8. Both implementations byoutstandingoptions,leadingtothesameeffectoflocking use the same concurrency control scheme and consistency in committed transactions. level. We can see that HACommit outperforms RCommit. As the number of operations increases in the workloads, VIII. CONCLUSION HACommit’s advantage also increases. The advantage of We have proposed HACommit, a logless one-phase HACommit is still due to a commit without logging and commit protocol for highly-available datastores. In contrast data persistence. to the classic vote-after-decide approach to distributed We also examine the transaction throughput and latency commit, HACommit adopts the vote-before-decide ap- when using weaker isolation levels with HACommit. In proach. In HACommit, the procedure for processing the this case, we compare HACommit against MDCC. We last transaction operation is redesigned to overlap the last implemented HACommit-RC with the read-committed iso- operation processing and the voting process. To commit a lation level [41]. This is an isolation level comparable to transactioninonephase,HACommitexploitsPaxosanduses that guaranteed by MDCC. HACommit-RC differs from the unique client as the initial proposer. To exploit Paxos, HACommit designs a transaction context structure to keep [12] J.C.Corbett,J.Dean,M.Epstein,A.Fikes,C.Frost,J.Fur- Paxos configuration information. Although client failures man, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild et al., “Spanner: Google’s globally-distributed database,” can be tolerated by the Paxos exploitation, HACommit Proceedings of OSDI, p. 1, 2012. designs a recovery process for client failures such that the transaction can actually end with the transaction data [13] L. Glendenning, I. Beschastnikh, A. Krishnamurthy, and visible to other transactions. For participantreplica failures, T.Anderson,“Scalableconsistencyinscatter,”inProceedings HACommit has participants replicate their votes and the of SOSP. ACM, 2011, pp. 15–28. transactionmetadatatotheirreplicas;and,afailurerecovery [14] T. Kraska, G. Pang, M. J. Franklin, and S. Madden, “Mdcc: process is proposed to exploit the replicated votes and Multi-data center consistency,” in Eurosys, 2013. metadata. Our evaluation demonstrates that HACommit outperforms recent atomic commit solutions for highly- [15] H. A. Mahmoud, A. Pucher, F. Nawab, D. Agrawal, and available datastores. A.E.Abbadi,“Lowlatencymulti-datacenterdatabasesusing replicatedcommits,”inProc.oftheVLDBEndowment,2013. ACKNOWLEDGMENT [16] C. Mohan, B. Lindsay, and R. Obermarck, “Transaction This work is also supported in part by the State Key management in the r* distributed database management Development Program for Basic Research of China (Grant system,”ACMTrans.DatabaseSyst.,vol.11,no.4,pp.378– No. 2014CB340402) and the National Natural Science 396, Dec. 1986. Foundation of China (Grant No. 61303054). [17] M. Abdallah, R. Guerraoui, and P. Pucheral, “One-phase REFERENCES commit: does it make sense?” in Proc. of International [1] J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey, ConferenceonParallelandDistributedSystems. IEEE,1998, E.Rollins,M.Oancea,K.Littlefield,D.Menestrina,S.Ellner, pp. 182–192. J. Cieslewicz, I. Rae, T. Stancescu, and H. Apte, “F1: A distributed sql database that scales,” Proc. VLDB Endow., [18] M. K. Aguilera, “Stumbling over consensus research: vol. 6, no. 11, pp. 1068–1079, Aug. 2013. Misunderstandings and issues,” in Replication. Springer, 2010, pp. 59–72. [2] “Amazon cloud goes down friday night, taking netflix, instagram and pinterest with it,” October 2012, [19] L. Lamport, “The part-time parliament,” ACM Transactions http://www.forbes.com/sites/anthonykosner/2012/06/30/ on Computer Systems, vol. 16, no. 2, pp. 133–169, 1998. amazon-cloud-goes-down-friday-night-taking-netflix- instagram-and-pinterest- with-it/. [20] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears, “Benchmarking cloud serving systems with ycsb,” [3] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, in Proceedings of the 1st SoCC. ACM, 2010. H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab et al., “Scaling memcache at facebook.” in nsdi, vol. 13, 2013, pp. [21] J.GrayandL.Lamport,“Consensusontransactioncommit,” 385–398. ACMTrans.DatabaseSyst.,vol.31,no.1,pp.133–160,Mar. 2006. [4] “Paypal,” https://www.paypal.com/. [22] R.Guerraoui, M.Larrea,andA.Schiper,“Reducingthecost [5] “Alipay,” https://www.alipay.com/. fornon-blockinginatomiccommitment,”inProc.ofICDCS. IEEE, 1996, pp. 692–697. [6] “Baidu wallet,”https://www.baifubao.com/. [7] D. Peng and F. Dabek, “Large-scale incremental processing [23] R. Guerraoui and A. Schiper, “The decentralized non- using distributed transactions and notifications.” in OSDI, blocking atomic commitment protocol,” in Proc. of IEEE vol. 10, 2010, pp. 1–15. Symposium on Parallel and Distributed Processing. IEEE, 1995, pp. 2–9. [8] J. Goldstein and P.-A˚. Larson, “Optimizing queries using materialized views: a practical, scalable solution,” in ACM [24] Y.Sovran,R.Power,M.K.Aguilera,andJ.Li,“Transactional SIGMODRecord, vol.30,no. 2. ACM,2001, pp.331–342. storageforgeo-replicatedsystems,”inProc.ofSOSP’11,pp. 385–400. [9] P. A. Bernstein, V. Hadzilacos, and N. Goodman, Concur- rency control and recovery in database systems. Addison- [25] S. Mu, Y. Cui, Y. Zhang, W. Lloyd, and J. Li, “Extracting wesley New York, 1987, vol. 370. more concurrency from distributed transactions,” in Proc. of OSDI, 2014. [10] I. Zhang, N. K. Sharma, A. Szekeres, A. Krishnamurthy, and D. R. Ports, “Building consistent transactions with [26] E. P. Jones, D. J. Abadi, and S. Madden, “Low overhead inconsistent replication,”inProceedings of SOSP ’15. New concurrencycontrolforpartitionedmainmemorydatabases,” York, NY, USA: ACM, 2015. in Proc. of SIGMOD. ACM, 2010, pp. 603–614. [11] F. Nawab, V. Arora, D. Agrawal, and A. El Abbadi, “Minimizingcommitlatencyoftransactionsingeo-replicated data stores,” in Proceedings of SIGMOD’15. ACM, 2015, pp. 1279–1294.

To Vote Before Decide: A Logless One-Phase Commit Protocol for Highly-Available Datastores PDF

0.24 MB·

by Yuqing Zhu

#journals #arxiv

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview To Vote Before Decide: A Logless One-Phase Commit Protocol for Highly-Available Datastores

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.