Archie: A Speculative Replicated Transactional System Sachin Hirve Roberto Palmieri Binoy Ravindran VirginiaTech VirginiaTech VirginiaTech [email protected] [email protected] [email protected] ABSTRACT SMA-based transactional systems, hereafter) so that trans- WepresentArchie,ahighperformancefault-toleranttrans- actionscanbeprocessedinawaysuchthattheoverallsys- actional system. Archie complies with the State Machine tem is always available even in presence of faults. In such systems, each node replicates the entire shared state (full Approach, where the transactional state is fully replicated replication). Ashigh-levelcharacterization,SMA-basedtrans- and total ordered transactions are executed on the repli- cas. Archie avoids the serial execution after transactions actionalsystemscanbeclassifiedaccordingtothetimewhen transactions are ordered globally. On the one hand, trans- get ordered, which is the typical bottleneck of those pro- actions can be executed by clients before a global certifica- tocols, by anticipating the work and using speculation to tionisinvokedtoresolveconflictsagainstothertransactions process transactions in parallel, enforcing a predefined or- der. The key feature of Archie is to avoid any non-trivial running on remote nodes. This approach is known as De- ferred Update Replication (DUR) [26, 22, 3, 14]. On the operation to perform post total order’s notification, in case other hand, clients can postpone the transaction execution thesequencernoderemainsstable(onlyasingletimestamp until the agreement on a common order is reached. This incrementisneededforcommittingatransaction). Thisap- way, they do not process transactions but simply broadcast proach significantly shortens the transaction’s critical path. transaction requests to all nodes and wait until the fastest The contention of speculative execution is always kept lim- replicareplies. ThismethodisknownasDeferredExecution ited by activating a fixed number of transactions at a time. Replication (DER) [16, 10]. A comprehensive evaluation, using three competitors and three well known benchmarks, shows that Archie outper- In both of these cases, an additional layer for ordering transactions is needed. This layer implements a total order formscompetitorsinallmedium/highcontentionscenarios. protocol, such as Multi-Paxos [15], for agreeing on a set of previously proposed values in the presence of faults. En- CategoriesandSubjectDescriptors forcing this order while transactions are certified/executed H.2.4 [Database Management]: Transaction processing; is usually done serially [26, 13, 10] (serial phase hereafter), D.4.5 [Software]: Fault-tolerance withasignificantimpactonthetotaltransactionexecution time because any task done after the establishment of the GeneralTerms global order increases latency perceived by clients. In ad- dition, in such systems, the throughput is always bound by Algorithms, Performance the performance of the (usually single) committer thread. One approach that overcomes this limitation consists of Keywords relaxing the total order while providing a set of partial or- ders where only conflicting transactions are commonly seri- Fault-Tolerance, Replication, Speculation alized[16]. Thisallowstheparallelizationoftheserialphase andalsooftheorderingprocessbutrequiresadditionalinfor- 1. INTRODUCTION mation from the application to classify submitted transac- TheStateMachineApproach(SMA)[23]isawell-known tions. However, in the presence of write-intensive workload coordination technique for building fault-tolerant services withmedium/highconflicts,establishingthetotalorderstill wherealldistributednodesreceiveandprocessthesamese- represents the most effective design choice. quence of requests. Many replicated transactional schemes InthispaperwepresentArchie,anSMA-basedtransac- provide SMA with a transaction semantics (we name them tional scheme that incorporates a set of protocol and sys- tem innovations that extensively use speculation for remov- ing any non-trivial task after the delivery of the transac- tion’sorder. Themaingoalof Archieistoavoidthetime- consuming operations (e.g., the entire transaction’s execu- tion or iterations over transaction’s read and written ob- jects)performedafterthisnotification,suchthatatransac- tion can be immediately committed. In order to accomplish the above goal, we designed Mi- MoX,anoptimizedsequencer-basedtotalorderlayerwhich inherits the advantages of two well-known mechanisms: the First,itmakesapplicationbehaviorindependentoffailures. optimistic notification [12, 17], issued to nodes prior to the Whenanodeinthesystemcrashesorstopsservingincoming establishment of the total order; and batching [21, 6], as a requests, other nodes are able to transparently service the means for improving the throughput of the global ordering same request, process the transaction, and respond back to process. MiMoX proposes an architecture that mixes these theapplication. Second,itdoesnotsufferfromabortsdueto two mechanisms, thus allowing the anticipation (thanks to contentiononsharedremoteobjectsbecauseacommonseri- theoptimisticnotification)ofabigamountofwork(thanks alizationorderisdefinedpriortostartingtransaction(local) to batching) before the total order is finalized. Nodes take execution, thus yielding high performance and better scala- advantageofthetimeneededforassemblingabatchtocom- bilityinmedium/highcontentionscenarios[13]. Third,with puteasignificantamountofworkbeforethedeliveryofthe DER,thesizeofnetworkmessagesexchangedforestablish- order is issued. This anticipation is mandatory in order to ingthecommonorderdoesnotdependonthetransaction’s minimize (and possibly eliminate the need for) the trans- logic(i.e.,thenumberofobjectsaccessed). Rather,itislim- action’s serial phase. As originally proposed in [17], Mi- ited to the name of the transaction and, possibly, its input MoX guarantees that if the sequencer node (leader) is not parameters, which reduces network usage and increases the replaced during the ordering process (i.e., either suspected ordering protocol’s performance. orcrashed),thesequenceofoptimisticnotificationsmatches AscommonlyadoptedinseveralSMA-basedtransactional thesequenceoffinalnotifications. Asadistinguishingpoint, systems[13,10,18]andthankstothefullreplicationmodel, the solution in [17] relies on a ring topology as a means for Archiedoesnotbroadcastread-onlyworkloadsthroughMi- deliveringtransactionsoptimistically,whereasMiMoXdoes MoX;read-onlyrequestsarehandledlocally,inparallelwith not assume any specific topology. the speculative execution. Processing write transactions Atthecoreof Archiethereisanovelspeculativeparallel (bothconflictingandnotconflicting)inthesameorderonall concurrency control, named ParSpec, that processes/certi- nodes allows Archie to guarantee 1-copy-serializability [2]. fies transactions upon their optimistic notification and en- WeimplementedArchieinJavaandweconductedacom- forcesthesameorderasthesequenceofoptimisticnotifica- prehensive experimental study using benchmarks including tions. Thekeyenablingpointforguaranteeingtheeffective- TPC-C [5], Bank and a distributed version of Vacation [4]. nessofParSpecisthatthemajorityoftransactionsspecula- Ascompetitors,weselectedoneDUR-based: PaxosSTM[26] tivelycommitbeforethetotalorderisdelivered. Thisgoalis –ahigh-performanceopensourcetransactionalsystem;and reached by minimizing the overhead caused by the enforce- two DER-based: one non-speculative (SM-DER [23]) and ment of a predefined order on the speculative executions. one speculative (HiperTM [10]) transactional system. ParSpec achieves this goal by the following steps: Our experiments on PRObE [7], a state-of-the-art public - Executing speculative transactions in parallel, but allow- cluster, reveal Archie’s high-performance and scalability. ingthemtospeculativecommitonlyin-order,thusreduc- On up to 19 nodes, Archie outperforms all competitors in ing the cost of detecting possible out-of-order executions; most of the tested scenarios. As expected, when the con- - Dividing the speculative transaction execution into two tentionisverylow,PaxosSTMbehavesbetterthanArchie. stages: thefirst,wherethetransactionisentirelyspecula- The paper makes the following contributions: tively executed and its modifications are made visible to - Archieisthefirstfully-implementedDER-basedtransac- the following speculative transactions; the second, where tionalsystemthateliminatescostlyoperationsduringthe aready-to-commitsnapshotofthetransaction’smodifica- serial phase by anticipating the work through speculative tions is pre-installed into the shared data-set, but not yet parallel execution. made available to non-speculative transactions. - MiMoX is the first total order layer that guarantees a A transaction starts its speculative commit phase only reliable optimistic delivery order (i.e., the optimistic or- when its previous transaction, according to the optimistic der matches the total order) without any assumption on order, becomes speculatively-committed and its modifica- thenetworktopology,andmaximizestheoverlappingtime tionsarevisibletoothersuccessivespeculativetransactions. (i.e.,thetimebetweentheoptimisticandrelativetotalor- The purpose of the second stage concerns only the non- dernotifications)whenthesequencernodeisnotreplaced speculative commit, thus it can be removed from the spec- (e.g,. due to a crash). ulative transaction’s critical path and executed in paral- - ParSpec is the first parallel speculative concurrency con- lel. Thisapproachincreasestheprobabilityofspeculatively trol that removes from the transaction’s critical path the committing a transaction before the total order is notified. tasktoinstallwrittenobjectsandimplementsalightweight Thefinalcommitofanalreadyspeculatively-committedtrans- commit procedure to make them visible. action consists of making the pre-installed snapshot avail- able to all. In case the MiMoX’s leader is stable during 2. ARCHIE the execution, ParSpec realizes this task without iterating Assumptions. We assume a distributed system, where over all transaction’s written objects but, rather, it just in- asetofprocessesΠ={p ,...,p }communicateusingmes- creases one local timestamp. Clients are informed about 1 n sagepassinglinks. Toeventuallyreachanagreementonthe theirtransactions’outcomewhileotherspeculativetransac- orderoftransactionswhennodesarefaulty,weassumethat tions execute. As a result, transaction latency is minimized the system can be enhanced with the weakest type of unre- and ParSpec’s high throughput allows more clients to sub- liable failure detector [8] that is necessary to implement a mit requests. leader election. The principles at the base of Archie can be applied in Nodesmayfailaccordingtothefail-stop(crash)model[2]. bothDUR-andDER-basedsystems. Forthepurposeofthis Weassume2f+1nodeswhereatmostf nodesaresimulta- paper, we optimized Archie to cope with the DER model. neouslyfaulty. Inanycommunicationstep,anodecontacts This is because DER has three main benefits over DUR. allothernodesandwaitsforaquorumQofreplies. Weas- sumetheclassicalquorum[15]Q=f+1suchthataquorum nodeinthesystem,calledtheleader,isresponsiblefordefin- canalwaysbeformedbecauseN−f ≥Q. Thiswayanytwo ing the order of the messages. quorums always intersect, thus ensuring that, even though MiMoX provides the APIs of Optimistic Atomic Broad- f failureshappen,thereisalwaysatleastonenodewiththe cast [12]: broadcast(m), which is used by clients to broad- lastupdatedinformationthatwecanuseforrecoveringthe castamessagemtoallnodes;final-delivery(m),whichis system. Further, we consider only non-byzantine faults. used for notifying each replica on the delivery of a message For the sake of generality and following the trend of [13, m(orabatchofthem);andopt-delivery(m),whichisused 10,16],weadopttheprogrammingmodelofsoftwaretrans- for early-delivering a previously broadcast message m (or a actional memory (STM) [24] and its natural extension to batch of them) before the final-delivery(m) is issued. distributed systems (i.e., DTM). DTM allows the program- Each MiMoX message that is delivered is a container of mer to simply mark a set of operations with transactional either a single transaction request or a batch of transaction requirements as an“atomic block”. The DTM framework requests (when batching is used). The sequence of final- transparently ensures the block’s transactional properties delivery(m) events, called final order, defines the transac- whileexecutingitconcurrentlywithotherblocks. Thereex- tion serialization order, which is the same for all the nodes istsmanyframeworksthatintegrateSTM-likeatomicblock in the system. The sequence of opt-delivery(m) events, abstraction with different types of transactional systems, calledoptimisticorder,definestheoptimistictransactionse- such as Key-Value store [19, 22]. The adoption of this rializationorder. Sinceonlythefinalorderistheresultofa model does not restrict the applicability of our approach. distributed agreement, the optimistic order may differ from Archieonlyassumestheexistenceoftransactionalreadand the final order and may also differ among nodes (i.e., each write operations to instrument. As all DER-based systems, node may have its own optimistic order). As we will show snapshot-deterministic [18, 16] transactions are assumed. later, MiMox guarantees the match between the optimistic This excludes any forms of non-determinism. and final order when the leader is not replaced (i.e., stable) TransactionProcessingModel. Archiedefinesappli- during the ordering phase. cationthreadsthatinvoketransactions(alsocalledclients), and service threads that process transactions. These two 3.1 OrderingProcess groups of threads do not necessarily run on the same phys- MiMoX defines two types of batches: opt-batch, which ical machine. Our transaction processing model is similar groups messages from the clients, and final-batch, which tothemulti-tieredarchitecturethatispopularinrelational storestheidentificationofmultipleopt-batches. Eachfinal- databases and other modern storage systems, where dedi- batch is identified by an unique instance_ID. Each opt- cated threads (different from threads that invoke transac- batch is identified by a pair <instance_ID, #Seq>. tions) process transactions. When a client broadcasts a request using MiMoX, this Archieprovidesanapplication-levellibrarythatcontains request is delivered to the leader which aggregates it into a interfacesforinteractingwiththetransactionalsystemsuch batch(theopt-batch). Inordertopreservetheorderofthese astheinvokeprocedure. Weadoptthestore-procedureab- steps,andforavoidingsynchronizationpointsthatmayde- straction,commoninDBMS,wheretheclientdoesnotsend grade performance, we rely on single-thread processing for allthetransactionaloperationstotheservicethreads. When thefollowingtasks. Foreachopt-batch,MiMoXcreatesthe aclientperformsatransactionT ,itiswrappedintoatrans- x pair<instance_ID,#Seq>,whereinstance_IDistheiden- actionrequestREQ(T )andmarkedasaread-only orwrite x tifierofthecurrentfinal-batchthatwillwraptheopt-batch, transaction depending on T ’s operations. If all operations x and#Seqis the positionof the opt-batchinthe final-batch. are reads, T is marked as read-only (T ); otherwise as a x r When the pair is defined, it is appended to the final-batch. write (T ). For a write transaction, REQ(T ) is passed to w w At this stage, instead of waiting for the completion of the MiMoX,whichisresponsiblefororderingREQ(T )among w final-batch and before creating the next opt-batch, MiMoX all the other requests submitted concurrently. sendsthecurrentopt-batchtoallthenodes,waitingforthe MiMoX delivers each transaction request (or a batch of relativeacknowledgments. Usingthismechanism,theleader them)twice,onceoptimisticallyandoncefinally. Thesetwo informs nodes about the existence of a new batch while the events are handled at each node by ParSpec, the local con- final-batchisstillaccumulatingrequests. Thisway,MiMoX currencycontrolprotocol. WhenREQ(T )isoptimistically w maximizestheoverlapbetweenthetimeneededforcreating delivered, ParSpec extracts T from REQ(T ), retrieves w w thefinal-batchwiththelocalprocessingofopt-batches;and T ’s business logic, and starts executing it speculatively. w enablesnodestoeffectivelyprocessmessages,thankstothe When REQ(T ) is finally delivered, ParSpec commits T w w reliable optimistic order. if T executed in an order compliant with the final deliv- w Each node, upon receiving the opt-batch, immediately eryorder. Otherwise,T isabortedandrestarted. ParSpec w triggers the optimistic delivery for it. As in [17], we be- works completely locally; no network interaction is needed. lievethatwithinadata-centerthescenarioswheretheleader When a client issues a read-only transaction T , the li- r crashesorbecomessuspectedarerare. Iftheleaderisstable brarytargetsonenodeinthesystemanddeliversREQ(T ) r foratleastthedurationofthefinal-batch’sagreement,then directlytothatnode,withoutorderingtherequestglobally. eveniftheopt-batchisreceivedout-of-orderwithrespectto other opt-batches sent by the leader, this possible reorder- 3. MIMOX ing is still nullified by the ordering information (i.e., #Seq) MiMoX is a network system that ensures total order of stored within each opt-batch. messagesacrossremotenodes. ItreliesonMulti-Paxos[15], After sending the opt-batch, MiMoX loops again serving an algorithm of the Paxos family, which guarantees agree- the next opt-batch, until the completion of the final-batch. ment on a sequence of values in the presence of faults (i.e., When ready, MiMoX uses the Multi-Paxos algorithm for total order). MiMoX is sequencer-based – i.e., one elected establishing an agreement among nodes on the final-batch. Theleaderproposes anorderforthefinal-batches,towhich 64-cores). Thememoryavailableis128GB,andthenetwork the other replicas reply with their agreement – i.e., accept connection is a high performance 40 Gigabit Ethernet. messages. When a majority of agreement for a proposed For the purpose of the study, we decided to finalize an order is reached, each replica considers it as decided. opt-batch when it reaches the maximum size of 12K bytes Themessagesizeofthefinal-batchisverylimitedbecause andafinal-batchwhenitreaches5opt-batches,orwhenthe it contains only the identifiers of opt-batches that have al- time needed for building them exceeds 10 msec, whichever ready been delivered to nodes. This makes the agreement occurs first. All data points reported are the average of six process fast and includes a high number of client messages. repeated measurements. 3.2 HandlingFaultsandRe-transmissions 130 MiMoXensuresthat,oneachnode,anaccept istriggered 10 bytes 125 20 bytes for a proposed message (or batch) m only if all the opt- c se 120 50 bytes pbraotcpheerstybperloenvegnintgs ltoossmofhmaveessbaegeens breecloenivgeidn.gEtonfaolrrceiandgytdheis- s per 115 cided messages (or batches). sg 110 m As an example, consider three nodes {N1,N2,N3}, where x 105 N1 istheleader. Thefinal-batch(FB)iscomposedofthree 000 100 opt-batches: OB1,OB2,OB3. N1sendsOB1toN2andN3. 1 95 ThenitdoesthesameforOB2andOB3. ButN2andN3do 90 not receive both messages. After sending OB , the FB is 3 5 7 9 11 13 15 17 19 3 complete,andN sendsthepropose messageforFB. Nodes Nodes 1 N andN sendtheaccept messagetotheothernodes,rec- 2 3 ognizingthatthereareunknownopt-batches(i.e.,OB and Figure 1: MiMoX’s message throughput. 2 OB ). The only node having all the batches is N . There- 3 1 fore, N2 and N3 request N1 for the re-transmission of the Figure 1 shows MiMoX’s throughput in requests ordered missing batches. In the meanwhile, each node receives the per second. For this experiment, we varied the number of majority of accept messages from other nodes and triggers nodes participating in the agreement and the size of each thedecide forFB. Atthisstage,ifN1 crashes,eventhough request. Clearly, the maximum throughput (122K requests FB has been agreed, OB2 and OB3 are lost, and both N2 ordered per second) is reached when the node count is low and N3 cannot retrieve their content anymore. (3 nodes). However, the percentage of degradation in per- We solve this problem using a dedicated service at each formance is limited when the system size is increased: with node, which is responsible for re-transmitting lost messages 19 nodes and request size of 10 bytes, the performance de- (or batches). Each node, before sending the accept for an creases by only 11%. FB,mustreceivealltheopt-batches. TheFB iscomposed Figure 1 shows also the results for request sizes of 20 of the identification of all the expected opt-batches. Thus, and 50 bytes. Recall that Archie’s transaction execution each node is easily able to recognize the missing batches. process leverages the ordering layer only for broadcasting Assuming that the majority of nodes are non-faulty, the the transaction ID (e.g., method or store-procedure name), re-transmission request for one or multiple opt-batches is alongwithitsparameters(ifany),andnottheentiretrans- broadcast to all the nodes such that, eventually the entire action business logic. Other solutions, such as the DUR sequence of opt-batches belonging to FB is rebuilt and the scheme,usethetotalorderlayerforbroadcastingthetrans- accept message is sent. action read- and write-set after a transaction’s completion, Nodescandetectamissingbatchbeforethepropose mes- resulting in larger request size than Archie’s. In fact, our sage for the FB is issued. Exploiting the sequence number evaluationswithBankandTPC-Cbenchmarksrevealedthat andtheFB’sIDusedforidentifyingopt-batches,eachnode almost all the transaction requests can be compacted be- can easily find a gap in the sequence of the opt-batches re- tween 8 and 14 bytes. MiMoX’s performance for a request ceived, that belong to the same FB (e.g., if OB1 and OB3 sizeof20bytesisquiteclosetothatfor10byterequestsize. are received, then, clearly, OB2 is missing). Thus, the re- We observe a slightly larger gap with 19 nodes and 50 byte transmission can be executed in parallel with the ordering, requestsize,wherethethroughputobtainedis104K.Thisis withoutadditionaldelay. Theworstcasehappenswhenthe a performance degradation lesser than 15% with respect to missing opt-batch is the last in the sequence. In this case, themaximumthroughput. Thisisbecause,withsmallerre- the propose message of FB is needed to detect the gap. quests (10 or 20 bytes), opt-batches do not get filled to the maximum size allowed, resulting in smaller network mes- 3.3 Evaluation sages. Ontheotherhand,largerrequests(50bytes)tendto We evaluated MiMoX’s performance by an experimen- fill batches sooner, but these bigger network messages take tal study. We focused on MiMoX’s scalability in terms of more time to traverse. the system size, the average time between optimistic and Figure2showsMiMoX’sdelaybetweentheoptimisticand finaldelivery,thenumberofrequestsinopt-batchandfinal- the relative final delivery, named overlapping time. This batch, and the size of client requests. We used the PRObE experimentisthesameasthatreportedinFigure1. MiMoX testbed [7], a public cluster that is available for evaluating achieves a stable overlapping time, especially for a request systems research. Our experiments were conducted using size of 10 bytes, of ≈8 msec. This delay is non-negligible 19 nodes (tolerating up to 9 failures) in the cluster. Each if we consider that Archie processes transactions locally. nodeisequippedwithaquadsocket,whereeachsockethosts Usingbiggerrequests,thefinal-batchbecomesreadysooner anAMDOpteron6272,64-bit,16-core,2.1GHzCPU(total because less requests fit in one final-batch. As a result, the arestillexecutingoperationsorthatarenotallowedtospec- 14 10 bytes ulativelycommityet. Moreover,eachtransactionT records 20 bytes ec) 12 50 bytes its optimistic order in a field called T.OO. T’s optimistic ms 10 order is the position of T within its opt-batch, along with ds ( 8 the position of the opt-batch in the (possible) final-batch. n ParSpec’smaingoalistoactivateinparallelasetofspec- co 6 e ulative transactions, as soon as they are optimistically de- millis 4 livered,andtoentirelycompletetheirexecutionbeforetheir 2 final order is notified. 0 As a support for the speculative execution, the following 3 5 7 9 11 13 15 17 19 meta-dataareused: abort-array,whichisabit-arraythat Nodes signalswhenatransactionmustabort;LastX-committedTx, whichstorestheIDofthelastx-committedtransaction;and Figure 2: Time between optimistic/final delivery. SCTS, the speculative commit timestamp, which is a mono- tonically increasing integer that is incremented each time a transactionx-commits. Also,eachnodeisequippedwithan timebetweenoptimisticandfinaldeliverydecreases. Thisis additional timestamp, called CTS, which is an integer incre- particularlyevidentwitharequestsizeof50bytes,wherewe mented each time a non-speculative transaction commits. observe an overlapping time that is, on average, 4.6 msec. For each shared object, a set of additional information is ThelastresultsmotivateourdesignchoicetoadoptDER alsomaintainedforsupportingParSpec’soperations: (1)the as a replication scheme instead of DUR. listofcommittedandx-committedversions;(2)theversion written by the last x-committed transaction, called spec- Request Final-batch Opt-batch %ReNF %ReF version; (3) the boolean flag called wait-flag, which in- size(bytes) size size 10 4.91 230.59 0% 1.7% dicates that a speculative active transaction wrote a new 20 4.98 166.45 0% 3.4% versionoftheobject,andwait-flag.OO,theoptimisticor- 50 5.12 90.11 0% 4.8% der of that transaction; and (4) a bit-array called readers- array,whichtracksactivetransactionsthatalreadyreadthe Table 1: Size of requests, batches, and % reorders. objectduringtheirexecution. Committed(orx-committed) versionscontainVCTS, whichis the CTS(or the SCTS)of the Table 1 shows other information collected from the pre- transaction that committed (or x-committed) that version. vious experiments. It is interesting to observe the number Thesizeoftheabort-arrayandreaders-arrayisbounded of opt-batches that makes up a final-batch (5 on average) byMaxSpec,whichisanintegerdefiningthemaximumnum- and the number of client requests in each opt-batch (varies ber of speculative transactions that can run concurrently. from 90 to 230 depending on the request size). This last MaxSpec is fixed and set a priori at system start-up. It can information confirms the reason for the slight performance be tuned according to the underlying hardware. drop using requests of 50 bytes. In fact, in this case each When an opt-batch is optimistically delivered, ParSpec opt-batch transfers ≈ 4500 bytes in payload, as compared extracts the transactions from the opt-batch and processes to ≈ 2300 bytes for request size of 10 bytes. them, activating MaxSpec transactions at a time. Once all Intheseexperiments,weusedTCPconnectionsforsend- thesespeculativetransactionsfinishtheirexecution,thenext ingopt-batches. SinceMiMoXusesasinglethreadforsend- set of MaxSpec transactions is activated. As it will be clear ing opt-batches and for managing the final-batch, reorders later, this approach allows a quick identification of those betweenoptimisticandfinaldeliveriescannothappenexcept transactions whose history is not compliant anymore with when the leader crashes or is suspected. Table 1 supports theoptimisticorder,thustheymustbeabortedandrestarted. this. It reports the maximum reordering percentages ob- Intheabort-arrayandreaders-array,eachtransaction servedwhenleaderisstable(columnRe NF)andwhenthe hasitsinformationstoredinaspecificlocationsuchthat,if leader is intentionally terminated after a period of stable twotransactionsT andT areoptimisticallyordered,sayin a b execution (column Re F), using 19 nodes. the order T > T , then they will be stored in these arrays a b respecting the invariant T >T . a b 4. PARSPEC Sincetheoptimisticorderisamonotonicallyincreasingin- teger,foratransactionT,thepositioni=T.OOmodMaxSpec ParSpec is the concurrency control protocol that runs lo- stores T’s information. When abort-array[i]=1, T must cally at each node. MiMoX delivers each message or a abort because its execution order is not compliant anymore batch of messages twice: once optimistically and once fi- with the optimistic order. Similarly, when an object obj nally. These two events are the triggers for activating Par- has readers-array[i]=1, it means that the transaction T Spec’s activities. Without loss of generality, hereafter, we performed a read operation on obj during its execution. will refer to a message of MiMoX as a batch of messages. Speculative active transactions make available new ver- Transactions are classified as speculative: i.e., those that sionsofwrittenobjectsonlywhentheyx-commit. Thisway, are only optimistically delivered, but their final order has other speculative transactions cannot access intermediate not been defined yet; and non-speculative: i.e., those whose snapshots of active transactions. However, when MaxSpec final order has been established. Among speculative trans- transactions are activated in parallel, multiple concurrent actions,wecandistinguishbetweenspeculatively-committed writesonthesameobjectcouldhappen. Whenthosetrans- (or x-committed hereafter): i.e., those that have completely actionsreachtheirx-commitphase,differentspeculativever- executedalltheiroperationsandcannotbeabortedanymore sions of the same object could be available for readers. As byotherspeculativetransactions;andactive: i.e.,thosethat an example, consider four transactions {T ,T ,T ,T } that theyprogressivelyx-commit,accordingtotheoptimisticor- 1 2 3 4 are optimistically delivered in this order. T and T write der. InParSpec,transactionalwriteoperationsarebuffered 1 3 tothesameobjectO ,andT andT readfromO . When locally in a transaction’s write-set. Therefore, they are not a 2 4 a T and T reach the speculative commit phase, they make availableforconcurrentreadsbeforethewritingtransaction 1 3 twospeculativeversionsofOa available: OaT1 andOaT3. Ac- x-commits. Thewriteprocedurehasthemaingoalofabort- cordingtotheoptimisticorder,T2’sreadshouldreturnOaT1 ing those speculative active transactions that are serialized andT4’sreadshouldreturnOaT3. Eventhoughthisapproach after (in the optimistic order) and a) wrote the same ob- maximizesconcurrency,itsimplementationrequirestravers- ject,and/orb) previouslyreadthesameobject(butclearly ing the shared lists of transactional meta-data, resulting in a different version). high transaction execution time and low performance [1]. When a transaction T performs a write operation on an i ParSpec finds an effective trade-off between performance object X and finds that X.wait-flag = 1, ParSpec checks and overhead for managing meta-data. In order to avoid the optimistic order of the speculative transaction T that j maintainingalistofspeculativeversions,ParSpecallowsan wroteX. IfX.wait-flag.OO>T .OO,thenitmeansthatT i j active transaction to x-commit only when the speculative isserializedafterT . So,anabortforT istriggeredbecause i j transaction optimistically ordered just before it is already T isaconcurrentwriteronXandonlyoneX.spec-versionis j x-committed. Formally, given two speculative transactions allowedforX. Onthecontrary,ifX.wait-flag.OO<T .OO i T and T such that T .OO = {T .OO}+1, T is allowed (i.e., T is serialized before T according to the optimistic x y y x y j i to x-commit only when T is x-committed. Otherwise, T order)thenT ,beforeproceeding,loopsuntilT x-commits. x y i j keeps spinning even when it has executed all of its opera- SinceanewversionofX writtenbyT willeventuallybe- i tions. T easily recognizes T ’s status change by reading comeavailable,allspeculativeactivetransactionsoptimisti- y x the shared field LastX-committedTx. We refer to this prop- cally delivered after T that read X must be aborted and i erty as rule-comp. By rule-comp, read and write operations restarted so that they can obtain X’s new version. Identi- becomeefficient. Infact,whenatransactionT readsanob- fying those speculative transactions that must be aborted ject, only one speculative version of the object is available. is a lightweight operation in ParSpec. When a speculative Therefore,T’sexecutiontimeisnotsignificantlyaffectedby transactionx-commits,itshistoryisfixedandcannotchange the overhead of selecting the appropriate version according because all the speculative transactions serialized before it to T’s history. In addition, due to rule-comp, even though have already x-committed. Thus, only active transactions two transactions may write to the same object, they can x- can be aborted. Object X keeps track of readers using the commitandmakeavailabletheirnewversionsonlyin-order, readers-arrayandParSpecusesitfortriggeringanabort: one after another. This policy prevents any x-committed allactivetransactionsthatappearinthereaders-arrayaf- transaction to abort due to other speculative transactions. terT ’sindexandhavinganentryof1areaborted. Finally, i In the following, ParSpec’s operations are detailed. before including the new object version in T ’s write-set, i ParSpecsetsX.wait-flag=1andX.wait-flag.OO=T .OO. i 4.1 TransactionalReadOperation Finally, if a write operation is executed on an object al- When a write transaction T performs a read operation readywrittenbythetransaction,itsvalueissimplyupdated. i on an object X, it checks whether another active transac- tion T is writing a new version of X and T ’s optimistic 4.3 X-Commit j j order is prior to Ti’s. In this case, it is useless for Ti to Aspeculativeactivetransactionthatfinishesallofitsop- access the spec-version of X because, eventually, Tj will erations enters the speculative commit (x-commit) phase. x-commit, and Ti will be aborted and restarted in order Thisphasehasthreepurposes: thefirst(A) istoallownext to access Tj’s version of X. Aborting Ti ensures that its speculativeactivetransactionstoaccessthenewspeculative serialization order is compliant with the optimistic order. versions of the written objects; the second (B) is to allow Ti is made aware about the existence of another transac- subsequent speculative transactions to x-commit; the third tion that is currently writing X through X.wait-flag, and (C) is to prepare“future”committed versions (not yet visi- about its order through X.wait-flag.OO. If X.wait-flag=1 ble) of the written objects such that, when the transaction and X.wait-flag.OO < Ti.OO, then Ti waits until the pre- is eventually final delivered, those versions will be already vious condition is no longer satisfied. For the other cases, available and its commit will be straightforward. However, namelywhenX.wait-flag=0orX.wait-flag.OO>Ti.OO,Ti in order to accomplish (B), only (A) must be completed proceedswiththereadoperationwithoutwaiting,accessing while (C) can be executed later. This way, ParSpec antici- thespec-version. Specifically,ifX.wait-flag.OO>Ti.OO, patestheeventthattriggersthex-commitofthenextspec- then it means that another active transaction Tk is writing ulative active transactions, while executing (C) in parallel toX. But,accordingtotheoptimisticorder,Tk isserialized with that. after Ti. Thus, Ti can simply ignore Tk’s concurrent write. Step (A). All the versions written by transaction Ti are After successfully retrieving X’s value, Ti stores it in its moved from Ti’s write-set to the spec-version field of the read-set, signals that a read operation on X has been com- respectiveobjectsandtherespectivewait-flagsarecleared. pleted,andsetstheflagcorrespondingtoitsentryinX.readers- Thisway,thenewspeculativeversionscanbeaccessedfrom array. This notification is used by writing transactions to other speculative active transactions. At this time, a trans- abort inconsistent read operations that are performed be- actionT thataccessedanyobjectconflictingwithT ’swrite- j i fore a previous write takes place. set objects and is waiting on wait-flags can proceed. Inaddition,duetorule-comp,anx-committedtransaction 4.2 TransactionalWriteOperation cannot be aborted by any speculative active transaction. The rule-comp prevents two or more speculative transac- Therefore,allthemeta-dataassignedtoT mustbecleaned i tionsfromx-committinginparallelandinanyorder. Rather, upforallowingthenextMaxSpecspeculativetransactionsto execute from a clean state. without being validated because the optimistic order is not Step(B).Thisstepisstraightforwardbecauseitonlycon- reliable anymore. For this reason, the commit is executed sistsofincreasingSCTS,thespeculativecommittimestamp, using a single thread. Transaction validation consists of whichisincrementedeachtimeatransactionx-commits,as checking if all the versions of the read objects during the well as increasing LastX-committedTx. speculativeexecutioncorrespondtothelastcommittedver- Step (C). This step is critical for avoiding the iteration sions of the respective objects. If the validation succeeds, on the transaction’s write-set to install the new commit- then the commit phase is equivalent to the one in scenario ted versions during the serial phase. However, this step (A). When the validation fails, the transaction is aborted does not need to be in the critical path of subsequent ac- and restarted for at most once. The re-execution happens tive transactions. For this reason (C) is executed in paral- on the same committing thread and accesses all the last lel to subsequent active transactions after updating LastX- committed versions of the read objects. committedTx,suchthatthechainofspeculativetransactions In both the above scenarios, clients must be informed waiting for x-commit can evolve. about transaction’s outcome. ParSpec accomplishes this For each written object, a new committed, but not yet task asynchronously and in parallel, rather than burdening visible, version is added to the object’s version list. The the commit phase with expensive remote communications. visibility of this version is implemented leveraging SCTS. 4.5 Abort Specifically, SCTS is assigned to the VCTS of the version. SCTS is always greater than CTS because the speculative Onlyspeculativeactivetransactionsandx-committedtrans- execution always precedes the final commit. This way (as actions whose total order has already been notified can be we will show in Section 4.6) no non-speculative transaction aborted. In the first case, ParSpec uses the abort mecha- can access that version until CTS is equal to SCTS. If the nism for restarting speculative transactions with an execu- MiMox’s leader is stable in the system (i.e., the optimistic tionhistorythatisnon-compliantwiththeoptimisticorder. orderisreliable),thenwhenCTS reachesthevalueofSCTS, Forcing a transaction T to abort means simply to set the thenthespeculativetransactionhasalreadybeenexecuted, T’s index of the abort-array. However, the real work for validatedandallofitsversionsarealreadyavailabletonon- annulling the transaction context and restarting from the speculative transactions. beginning is executed by T itself by checking the abort- array. Thischeckismadeafterexecutinganyreadorwrite 4.4 Commit operation and when T is waiting to enter the x-commit i phase. The abort of a speculative active transaction con- The commit event is invoked when a final-batch is deliv- sists of clearing all of its meta-data before restarting. ered. Atthisstage,twoscenarioscanhappen: (A)thefinal- In the second case, the abort is needed because the spec- batch contains the same set of opt-batches already received ulative transaction x-committed with a serialization order inthesameorder,or(B)theoptimisticorderiscontradicted differentfromthetotalorder. Inthiscase,beforerestarting by the content of the final-batch. the transaction as non-speculative, all the versions written Scenario (A) is the case when the MiMoX’s leader is not bythex-committedtransactionmustbedeletedfromtheob- replaced while the ordering process is running. This rep- jects’versionlists. Infact,duetothesnapshot-deterministic resents the normal case within a data-center, and the best execution,thenewsetofwrittenversionscandifferfromthe case for Archie because the speculative execution can be x-committedset,thussomeversioncouldbecomeincorrectly actuallyleveragedforcommittingtransactionswithoutper- available after the increment of CTS. forming additional validation or re-execution. In fact, Par- Spec’s rule-comp guarantees that the speculative order al- 4.6 Read-OnlyTransactions ways matches the optimistic order, thus if the latter is also When a read-only transaction is delivered to a node, it is confirmed by the total order, it means that the speculative immediately processed, accessing only the committed ver- execution does not need to be validated anymore. sions of the read objects. This way, read-only workloads In this scenario, the only duty of the commit phase is do not interfere with the write workloads, thus limiting the to increase CTS. Given that, when CTS=Y, it means that synchronization points between them. A pool of threads thex-committedtransactionwithSCTS=Y hasbeenfinally is reserved for executing read-only transactions. Before a committed. Non-speculative transactions that start after read-onlytransactionT performsitsfirstreadoperationon this increment of CTS will be able to observe the new ver- i an object, it retrieves the CTS of the local node and as- sions written during the step (C) of the x-commit of the signs this value to its own timestamp (T .TS). After that, transaction with SCTS=Y. i the set of versions available to T is fixed and composed of Usingthisapproach,ParSpeceliminatesanycomplexop- i allversionswithVCTS ≤T .TS–i.e.,T cannotaccessnew eration during the commit phase and, if most of the trans- i i versionscommittedbyanyT orderedafterT . Someobject actionsx-commitbeforetheirnotificationofthetotalorder, j i could have, inside its version list, versions with a VCTS > then they are committed right away, paying only the delay T .TS. These versions are added from x-committed trans- of the total order. If the transaction does not contain mas- i actions, but not yet finally committed, thus their access is sive non-transactional computation, then the iteration on prohibited to any non-speculative transaction. thewrite-setforinstallingthenewcommittedversions,and the iteration on the read-set for validating the transaction, 5. CONSISTENCYGUARANTEES have almost the same cost as running the transaction from scratch after the final delivery. This is because, once the Archieensures1-CopySerializability[2]asaglobalprop- totalorderisdefined,transactionscanexecutewithoutany erty, and it ensures also that any speculative transaction overhead, such as logging in the read-set or write-set. (active,x-committedandaborted)alwaysobservesaserial- In scenarios like (B), transactions cannot be committed izable history, as a local property. 1-Copy Serializability. Archie ensures 1-Copy Seri- reason, it is true also for the other cases). In other words, alizability. The main argument that supports this claim is duetothetotalorderofwritetransactions,therearenotwo that transactions are validated and committed serially. We read-only transactions, running on the same node or differ- can distinguish two cases according to the reliability of the entnodes,thatcanobservethesametwowritetransactions optimistic delivery order with respect to the final delivery serialized differently. order: i) when the two orders match, and the final commit Serializable history. In ParSpec, all speculative trans- proceduredoesnotaccomplishanyvalidationprocedure;ii) actions (including those that will abort) always observe a when the two orders do not match, thus the validation and history that is serializable. This is because new specula- a possible re-execution are performed. tiveversionsareexposedonlyattheendofthetransaction, The case ii) is straightforward to prove because, even whenitcannotabortanymore;andbecauseeachspeculative thoughtransactionsareactivatedandexecutedspeculatively, transactionchecksitsabortbitafteranyoperation. Assume they are validated before being committed. The validation, three transactions T , T and T , optimistically ordered in 1 2 3 aswellasthecommit,processissequential. Thisruleholds this way. T x-commits a new version of object A, called 1 even for non-conflicting transactions. Combining serial val- A andT overwritesAproducingA . Italsowritesobject 1 2 2 idation with the total order of transactions guarantees that B, creating B . Both T and T run in parallel while T 2 2 3 1 allnodeseventuallyvalidateandcommitthesamesequence already x-committed. Now T reads A from T (i.e., A ) 3 1 1 of write transactions. The ordering layer ensures the same and subsequently T starts to x-commit. T publishes the 2 2 sequence of delivery even in the presence of failures, there- A ’s speculative version and flags T to abort because its 2 3 fore, all nodes eventually reach the same state. execution is not compliant with the optimistic order any- The case i) is more complicated because transactions are more. Then T continues its x-commit phase exposing B ’s 2 2 notvalidatedafterthenotificationofthefinalorder;rather, speculative version. In the meanwhile, T starts a read op- 3 they directly commit after increasing the commit times- eration on B before being flagged by T , and it finds B . 2 2 tamp. For this case we rely on MiMoX, which ensures that Even though T is marked as aborted, it already started 3 all final delivered transactions are always optimistically de- the read operation on B before checking the abort-bit. For livered before. Given that, we can consider the speculative this reason, this check is done after the read operation. In commitasthefinalcommitbecause,afterthat,thetransac- theexample,whenT finishesthereadoperationonB,but 3 tionisensuredtonotabortanymoreandeventuallycommit. before returning B to the executing thread, it checks the 2 Theexecutionofaspeculativetransactionisnecessarilyseri- abort-bitanditabortsduetothepreviousreadonA. Asa alizablebecauseallofitsreadoperationsaredoneaccording result,thehistoryofaspeculativetransactionisalways(and toapredefinedorder. Incaseareadoperationaccessesaver- at any point in time) compliant with the optimistic order, sionsuchthatitsexecutionbecomesnotcompliantwiththe thus preventing the violation of serializability. optimistic order anymore, the reader transaction is aborted and restart. In addition, transactions cannot speculatively 6. IMPLEMENTATIONANDEVALUATION commit in any order or concurrently. They are allowed to We implemented Archie in Java: MiMoX’s implementa- do so only serially, thus reproducing the same behavior as tion inherited JPaxos’s [21, 14] software architecture, while the commit phase in case ii). ParSpechasbeenbuiltfromscratch. Asatestbed,weused Read-only transactions are processed locally without a PRObE [7] as presented in Section 3. ParSpec does not global order. They access only committed versions of ob- commitversionsonanystablestorage. Thetransactionpro- jects,andtheirserializationpointisdefinedwhentheystart. cessingisentirelyexecutedin-memorywhilefault-tolerance At this stage, if we consider the history composed of all is ensured through replication. the committed transactions, when a read-only transaction WeselectedthreecompetitorstocompareagainstArchie. starts, it defines a prefix of that history such that it can- Twoarestate-of-the-art,open-source,transactionalsystems not change over time. Versions committed by transactions based on state-machine replication. One, PaxosSTM [26] serialized after this prefix are not visible by the read-only implementstheDURmodel,whiletheother,HiperTM[10], transaction. Considerahistoryofcommittedwritetransac- complieswiththeDERmodel. Asthethirdcompetitor,we tions, H={T , ...,T , ..., T }. Without loss of generality, 1 i n implementedtheclassicalDERscheme(calledSM-DER)[23], assumethatT committedwithtimestamp1;T committed 1 i wheretransactionsareorderedthroughJPaxos[14]andpro- withtimestampi;andT committedwithtimestampn. All n cessedinasinglethreadafterthetotalorderisestablished. nodes in the system eventually commit H. Different com- PaxosSTM [26] processes transactions locally, and relies mitordersforthesetransactionsarenotallowedduetothe on JPaxos [14] as a total order layer for their global certifi- totalorderenforcedbyMiMoX.Supposethattworead-only cation across all nodes. On the other hand, HiperTM [10], transactions T and T , executing on node N and N , re- a b a b as Archie, exploits the optimistic delivery for anticipating spectively, access the same shared objects. Let T perform a theworkbeforethenotificationofthefinalorder,butitpro- itsfirstreadoperationonX accessingthelastversionofX cessestransactionsonsinglethread. Inaddition,HiperTM’s committed at timestamp k, and T at timestamp j. Let P b a ordering layer is not optimized for maximizing the time be- and P be the prefixes of H defined by T and T , respec- b a b tween optimistic and final delivery. tively. P (H)={T , ...,T } such that k ≤ i and P (H)= a 1 k b Eachcompetitorprovidesitsbestperformanceunderdif- {T ,...,T }suchthatj ≤i. P andP canbeeithercoinci- 1 j a b ferentworkloads,thustheyrepresentacomprehensiveselec- dent,oroneisaprefixoftheotherbecausebothareprefixes tion to evaluate Archie. Summarizing, PaxosSTM ensures of H: i.e., if k <j, then P is a prefix of P ; if k >j, then a b high-performance in workloads with very low contention, P is a prefix of P ; if k=j, then P and P coincide. b a a b such that remote aborts do not kick-in. HiperTM, as well Let P be a prefix of P . Now, ∀ T , T ∈ P , T and T a b u v a a b as SM-DER, are independent from the contention because will observe T and T in the same order (and for the same u v they process transactions using a single thread but their SMR ALVIN-FD SMR PaxosPSaTxMosSTM ALVIANLVIN SMR PaxosSTM ALVIN HiperTM ALVIN-FD HiperTM ALVIN-FD 140000 240000 240000 120000 220000 220000 c SMR ARCHIE-FD e sec 200000 r s 100000 PaHxoipseSrTTMM secHipAeSRrMTC MR2HI0E0000ASSRMMCARRHRICEH-FIDE AARRCCSHHMIIREEa--FFDD ARCAHRICEH-FIDE per 180000 n pe 25000 SMR Pper axAoRsCSHT MIE1-8FD0000 abab AARRCCbHHIIEE Transaction 1111 0246800000000000000000000 Transactio 3 24685000000000000 00007 1000x Transaction per secTransactions per sec Transactions per sec3 111 (92468024a0000000 )R112 11220505T3 e050550000h000000000 pr 0000000000551 o000000lui1cg 7 hP3a 3pa Hs9xu RoitTransactions per secp71e sTransactions per sec p5e1- Sl3i15rcTTaHsMM 11122i 31122g 05055 7 05055h000007 00000100000195 -00000000000 00000059 R 910 79R%e PTransaction 3eTransactions per sec Transactions per sec.1a pp91 H1x l11i1oilcpTransactions per sec1i1sa 7 ceAS 51122Ss1122 r R05055MaTT105055 000000x Transactions per secC131111MM00000 R1000x Transaction per secs112200000 300000 H 0505500000002468( 00000071Ib00000 1 1111E 000001468024600000)1122 30000000 195000000505500000 T5 3 39 00000h3 RP1r 00000Transactions per seco17 ea S 1H7uA5 p x515MoglRiTransactions per seci1p5 csh1-C Se a 7ADS1122p19 rMHs05055RTT E9 u 71700000I RMM3CRab Et11229300000 05055H-R000000-1Fe00000I pD1ME00000 l 7i1 91c9000000a5eRs R3 d1 ee3i5 p pu1 1l1 l i1mi11c1c7A 5Aa5aRR9ss- C1S1C 1A79A1H3M3H R09R 7I7 R%I1CEabCE9 H- .H-11FFI55IDEDE 9R 11Latency (ms)e977 p1lR i1111112c A2468024680 a00000000000R11es(99 Cc 1Ap 31)H3RlILi1CEc 5aH- 1FatI5eDE sn7 c1 y1 937-ReH p1li1 ci1ags9 h113-5 1950% 17 .1 179 19 0 260 SMR AR 3eLpVlic aI5sN10- F7 24D 0 00 9 11 13 15 3 17 5 19 7 9 90 11 13 15 17 19 240000 Hip eS11rM24T00MR000000 P1000x Transaction per secax 11111222 6802468024o0000000000 sPASaLTxVMoIsNS-TFMD A L21000x Transaction per secV4 1111122I00246802AN00000000L0V R0 33eINplic 5a 5sH i7p 7eSr M9T R9MReR p1eli1c p1ali1sc a1s3 13 15Latency (ms) 1R5 112345678eP000000007pAla ic1L a7x1sV9oI s1N9S-TFMD ALVIN 220000 220000 ec 40 3 5 7 9 S 1M1R 13 15 17 A 19RCHIE-FD 80 3 5 7 9 11 13 15 17 19 0 3 5 7 9 11 13 15 17 19 sec 200000 r s 100000 (d) ThrougPhaHpxoiupRsteeSpr-lTTicaHMMsigh - 50%sec.HipAeSRrMTC MR2HI0(Ee0)0Th0r0oASuSRMgMChARRpHRuICEtH-FR-IeDEpMlicaesdAAiuRRmCCSHHM-IIREEa5--0FF%DD. ARCAHR(IfCE)H-FLIDEatencyR-epHliciagsh - 50%. per 180000 n pe Figure 3 :25P00e0rformanSceMRof BankPper abxAoeRsnCScHTh MIE1m-8FaD0rk0v0a0ryinababg nodes, coAnARtReCCbnHHtIIiEEon and percentage of write transactions. Transaction 1111 0246800000000000000000000 Transactio 3 24685000000000000 00007 1000x Transactions per secTransactions per sec Transactions per sec3 1122 0505059 R112 311220505 e050550000000000000( 5pa000000000051 )000000li1 #7c P31a 3a 9HwsxR oiaeTransactions per secp71 sTransactions per secpr15el Si1ec35rahTTs MM o11122 3 u1122 05055 7 05055s000007 e00000100000195. 00000000000 0000005 R 91 79Re PTransaction 3eTransactions per sec Transactions per sec1a pp91 H1x l11i1oilcpTransactions per sec1i1sa 7 ceAS 51122Ss1122 r R05055MaTT105055 000000x Transactions per secC131111MM00000 R1000x Transactions per secs112200000 300000 H 0505500000002468 00000071I 000001 11111222E 000001246802468024000001122 3 195000000505500000 5 3 33900000 R(P1 00000bTransactions per sec17 ea5S) 1H7A p x515MolRi#Transactions per seci1p5 cs 1-7C Sea ADS1122119 rMHs050559RTT E9 71700000I RMM39CRabE1122w300000 R 05055H-0000001eaF00000 pI1rDlE00000 i1 7ce 919a000000hs5R R 31o e3eu5 p p1 1sl1 li11ei1c1c7A5 Asa5a.RR9ss 1 CS1C 1A7A1H3M3H R9R 7I7 RI1CEabCE9 H- H-11FFI55IDEDE 9R 111000x Transactions per sece977 p1lR i111223 cA505050 aR11es99 C 31Ap1H(3RlcIi1CE )5cH- #1FaI5DE 17s0 10 193w7Rea p1lri1c ea1sh9 1o13u5s 1e5s. 17 1 179 19 0 R 3eplic a5s10 7 00 9 11 13 15 3 17 5 19 7 9 11 13 15 17 19 Figure 4: Performance of TPC-C be R 33nepclhic m5a 5sar k7 7va r9y 9in g11 n11o 1d3e 1s3 a15n d1R5 1ne7pul imc1 a71bs9er 19of warehouses. ReRpelicpaliscas performance is significantly affected by the length of trans- testbed. However, these parameters can be tuned for ex- actions(anyoperationisonthetransaction’scriticalpath). ploringdifferenttrade-offsforthehardwareandapplication This way, workloads composed of short transactions repre- workload at hand. senttheirsweetspot. Inaddition,SM-DERexcelsforwork- ThebenchmarksadoptedinthisevaluationincludeBank, loads where contention is very high. Here the intuition is acommonbenchmarkthatemulatesbankoperations,TPC- that,ifonlyfewobjectsareshared,thenexecutingtransac- C [5], a popular on-line transaction processing benchmark, tions serially without any overhead is the best solution. and Vacation, a distributed version of the famous applica- WeprovidedtwoversionsofArchie: onethatexploitsthe tionincludedintheSTAMPsuite[4]. Wescaledthesizeof optimistic delivery and one that postpones the parallel ex- the system in the range of 3-19 nodes and we also changed ecution until the transactions are final delivered. This way, thesystem’scontentionlevelbyvaryingthetotalnumberof wecanshowtheimpactoftheanticipationofthework,with shared objects available. All the competitors benefit from respect to the parallel execution. The version of Archie the execution of local read-only transactions. For this rea- thatdoesnotusetheoptimisticdelivery,calledArchie-FD, son we scope out read-only intensive workloads. Each node replacesthex-commitwiththenormalcommit. Incontrast hasanumberofclientsrunningonit. Whenweincreasethe with Archie, Archie-FD installs the written objects dur- nodesinthesystem,wealsoslightlyincreasethenumberof ingthecommit. Forthepurposeofthestudy,weconfigured clients accordingly. This also means that the concurrency MaxSpecandthesizeofthethreadpoolthatservesread-only and (possibly) the contention in the system moderately in- transactions as 12. This configuration resulted in an effec- crease. This is also why the throughput tends to increase tive trade-off between performance and scalability on our for those competitors that scale along with the size of the system. In practice, we used on average the following total petitors. In this case, the parallel execution of Archie is number of application threads balanced on all nodes: 1000 not exploited because transactions are always waiting for for TPC-C, 3000 for Bank, and 550 for Vacation. the previous speculative transaction to x-commit and then start almost the entire speculative execution from scratch. 6.1 BankBenchmark Increasing the number of nodes, HiperTM behaves better We configured the Bank benchmark for executing 10% than Archie because of minimal synchronization required and 50% of read-only transactions, and we identified the due to the single thread processing. However, when the high, medium and low contention scenarios by setting 500, contention decreases (Figure 4(b)), Archie becomes better 2000,and5000totalbankaccounts,respectively. Wereport than SM-DER by as much as 44%. Further improvements only the results for high and medium contention (Figure 3) can be seen in Figure 4(c) where contention is much lower because the trend in low contention scenario is very similar (96% of gain). to the medium contention though with higher throughput. Archie is able to outperform SM-DER when 19 ware- Figure 3(a) plots the results of the high contention sce- housesaredeployed,becauseitboundsthemaximumnum- nario. PaxosSTM suffers from a massive amount of remote ber of speculative transactions that can conflict each other aborts (≈85%), thus its performance is worse than others (i.e., MaxSpec). We used 12 as MaxSpec, thus the number anditisnotabletoscalealongwiththesizeofthesystem. of possible transactions that can conflict with each other is Interestingly, SM-DER behaves better than HiperTM be- less than the total number of shared objects, thus reducing cause HiperTM’s transaction execution time is higher than the abort percentage from 98% (1 warehouse) to 36% (19 SM-DER’s due to the overhead of operations’ instrumenta- warehouses) (see also Figure 6). Performance of SM-DER tion. This is particularly evident in Bank, where transac- worsensfromFigure4(a)toFigure4(b). Althoughitseems tions are short and SM-DER’s execution without any over- counterintuitive, it is because, with more objects, the cost headprovidesbetterperformance. Infact,evenifHiperTM of looking up a single object is less than with 19 objects. anticipates the execution leveraging the optimistic delivery, its validation and commit after the total order nullify any 100 ns 3 nodes previous gain. We observed also the time between the opti- o 90 11 nodes misticandfinaldeliveryinHiperTMtobelessthan1msec, acti 80 19 nodes s which limits the effectiveness of its optimistic execution. an 70 The two versions of Archie perform better than oth- d tr 60 perasybsuatpsetnilallAtyricnhpiee-rFfoDrm,wanitcheoaurtotuhneds1p4e%culaagtaivinesetxAecructhioine,. mmitte 4500 Thisisduetotheeffectiveexploitationoftheoptimisticde- co livery. Consistently with the results reported in Section 3, % X- 2300 we observed an average time between optimistic and final 1 19 100 deliveryof8.6msec,almost9×longerthanHiperTM.How- Number of warehouses ever,asshowedinFigure3(c),Archie’saveragetransaction latency is still much lower than others. The peak through- Figure 5: % of x-committed transactions before the put improvement over the best competitor (i.e., SM-DER) notification of the total order. is 54% for Archie and 41% for Archie-FD. Figure 3(b) shows the results with an increased number Figure5showsaninterestingparameterthathelpstoun- of shared objects in the system. In these experiments the derstand the source of Archie’s gain: the percentage of contention is lower than before, thus PaxosSTM performs speculative transactions x-committed before their total or- better. With 3 nodes, its performance is comparable with der is notified. It is clear from the plot that, due to the Archie but, by increasing the nodes and thus the con- high contention with only one warehouse, Archie cannot tention,itdegrades. HereArchie’sparallelexecutionhasa exploititsparallelismthusalmostalltransactionsx-commit significant benefit, reaching a speed-up by as much as 95% after their final delivery is issued. The trend changes by over SM-DER. Due to the lower contention, also the gap increasing the number of warehouses. In the configuration between Archie and Archie-FD increased up to 25%. with 100 warehouses, the percentage of x-committed trans- Figures 3(d), 3(e), 3(f) show the results with higher per- actionsbeforetheirfinaldeliveryisintherangeof75%-95%. centageofread-onlytransactions(50%). Recallthatallpro- Theperformancerelatedtothisdata-pointisshowninFig- tocolsexploittheadvantageoflocalprocessingofread-only ure4(c)whereArchieisindeedthebest,andthegapwith transactions but absolute numbers are higher than before, respect to Archie-FD increased up to 41%. aswellaslatencyisreduced,butthetrendsarestillsimilar. Figure 6 reports the percentage of aborted transactions of the only two competitors that can abort: PaxosSTM 6.2 TPC-CBenchmark and Archie. PaxosSTM invokes an abort when a trans- TPC-C is characterized by transactions accessing several action does not pass the certification phase, while Archie objectsandtheworkloadhasacontentionlevelusuallyhigher aborts a transaction during the speculative execution. Re- than other benchmarks (e.g., Bank). The mix of TPC-C callthat,inPaxosSTM,theabortnotificationisdeliveredto profiles is the same as the default configuration, thus gen- theclient,whichhastore-executethetransactionandstart erating 92% of write transactions. We evaluated three sce- again a new global certification phase. On the other hand, narios, varying the total number of shared warehouses (the Archie’s abort is locally managed and the re-execution of mostcontentedobjectinTPC-C)intherangeof{1,19,100}. the speculative transaction does not involve any client op- Withonlyonewarehouse,alltransactionsconflicteachother eration,thussavingtimeandnetworkload. Inthisplot,we (Figure4(a))thusSM-DERbehavesbetterthanothercom- varythenumberofnodesinthesystemand,foreachnode,
Description: