L-Store: A Real-time OLTP and OLAP System ∗ Mohammad Sadoghi†, Souvik Bhattacherjee‡ , Bishwaranjan Bhattacharjee†, Mustafa Canim† †IBMT.J.WatsonResearchCenter ‡UniversityofMaryland,CollegePark ABSTRACT vendors and users of the systems (e.g., application development 6 anddeploymentcosts). Second,thereisacompellingcasetosup- 1 Arguablydataisanewnaturalresourceintheenterpriseworldwith portreal-timedecisionmakingonthelatestversionofthedata[27] 0 anunprecedenteddegreeofproliferation. Buttoderivereal-time (likewisesupportedby[16,18]),whichmaynotbefeasibleacross 2 actionableinsightsfromthedata,itisimportanttobridgethegap looselyintegratedenginesthatareconnectedthroughtheextract- betweenmanagingthedatathatisbeingupdatedatahighveloc- n transformload(ETL)process. Closingthisgapmaybepossible, ity(i.e.,OLTP)andanalyzingalargevolumeofdata(i.e.,OLAP). a butitseliminationmaynotbefeasiblewithoutsolvingtheoriginal J However,therehasbeenadividewherespecializedsolutionswere problemofunifyingOLTPandOLAPcapabilitiesorwithoutbeing often deployed to support either OLTP or OLAP workloads but 5 forcedtorelyonad-hocapproachestobridgethegapinhindsight. notboth;thus,limitingtheanalysistostaleandpossiblyirrelevant 1 WearguethattheseparationofOLTPandOLAPcapabilitiesisa data. Inthispaper,wepresentLineage-basedDataStore(L-Store) that combines the real-time processing of transactional and ana- stepbackwardthatdeferssolvingtheactualchallengeofreal-time ] analytics. Third,combiningreal-timeOLTPandOLAPfunction- B lytical workloads within a single unified engine by introducing a alitiesremainsasanimportantbasicresearchquestion,whichde- D novellineage-basedstoragearchitecture.Wedevelopacontention- mandsdeeperinvestigationevenifitispurelyfromthetheoretical freeandlazystagingofcolumnardatafromawrite-optimizedform . standpoint. s (suitableforOLTP)intoaread-optimizedform(suitableforOLAP) c inatransactionallyconsistentapproachthatalsosupportsquerying Inthisdilemma,wesupportthelatterschoolofthought(i.e.,ad- [ vocating a generalized solution) with the goal of undertaking an andretainingthecurrentandhistoricdata. Ourworkingprototype 1 ofL-Storedemonstratesitssuperioritycomparedtostate-of-the-art important step to study the entire landscape of single engine ar- chitecturesandtosupportbothtransactionalandanalyticalwork- v approachesunderacomprehensiveexperimentalevaluation. loadsholistically(i.e.,“onesizefitsall”).Inthispaper,wepresent 4 Lineage-basedDataStore(L-Store)withanovellineage-basedstor- 8 1. INTRODUCTION agearchitecturetoaddresstheconflictsbetweenrow-andcolumn- 0 4 Wearenowwitnessinganarchitecturalshiftanddivideindatabase majorrepresentationbydevelopingacontention-freeandlazystag- 0 community. The first school of thought emerged from an aca- ing of columnar data from write optimized into read optimized . demic conjecture that “one size does not fit all” [35] (i.e., advo- form in a transactionally consistent manner without the need to 1 cating specialized solutions), which has lead to manifolds of in- replicatedata,tomaintainmultiplerepresentationofdata,ortode- 0 novationsoverthelastdecadeincreatingspecializeddatabaseen- velop multiple loosely integrated engines that sacrifices real-time 6 ginesgearedtowardvariousnicheworkloadsandapplicationsce- capabilities. 1 narios(e.g.,[35,5,11,9,28,8,21,36]). Thisschoolhassuccess- Tofurtherdisambiguateournotionof“onesizefitsall”,inthis : v fullyinfluencedmajordatabasevendorssuchasMicrosofttofocus paper,werestrictourfocustoreal-timerelationalOLTPandOLAP i onbuildingnewspecializedenginesofferedaslooselyintegrated capabilities. We define a set of architectural characteristics for X engines (e.g., Hekatonin-memory engine [8]and Apollo column distinguishing the differences between existing techniques. First, r store engine [19]) within a single umbrella of database portfolio there could be a single product consisting of multiple loosely in- a (notably,recenteffortsarenowfocusedonatighterreal-timeinte- tegrated engines that can be deployed and configured to support grationofHekatonandApolloengines[18]).Ithasalsoinfluenced either OLTP or OLAP. Second, there could be a single engine as Oracletopartiallyacceptthebasicpremisethat“onesizedoesnot opposedtohavingmultiplespecializedenginespackagedinasin- fitall”asfarasdatarepresentationisconcernedandhasledOracle gleproduct. Third,evenifwehaveasingleengine,thenwecould todevelopadual-formattechnique[16]thatmaintainstwotightly have multiple instances running over a single engine, where one integratedrepresentationofdata(i.e.,twocopiesofthedata)ina instance is dedicated and configured for OLTP workloads while transactionallyconsistentmanner. another instance is optimized for OLAP workloads, in which the However,thesecondschoolofthought,supportedbybothaca- differentinstancesareassumedtobeconnectedusinganETLpro- demia (e.g., [13, 6, 7]) and industry (e.g., SAP [10]), rejects the cess. Finally, even when using the same engine running a single aforementionedfundamentalpremiseandadvocatesageneralized instance, there could be multiple copies or representations (e.g., solution.Proponentsofthisidea,rightlyinourview,makethefol- row vs. columnar layout) of the data, where one copy (or repre- lowingarguments.First,thereisatremendouscostinbuildingand sentation)ofthedataisreadoptimizedwhilethesecondcopy(or maintainingmultipleenginesfromboththeperspectiveofdatabase representation) is write optimized. The architectural comparison ofvariousexistingtechniquesbasedonourrigorousdefinitionof ∗ WorkwasperformedaspartofasummerinternshipatIBMT.J. WatsonResearchCenterunderMohammadSadoghi’smentorship. 1 L-Store HANA[27] ES2[6,7] HyPer[13] OracleDual-format[16] MicrosoftSQLServer[18] H-Store+Hadoop[11] SingleProduct " " " " " " – SingleEngine " " " " " – – SingleInstance " " " " " " – SingleCopy " " " datapagereplication – – – (OSforking) SingleRepresentation " main+delta " " – – – limitedto OLTP-optimized " " limitedto partitionableworkloads " " " get/putoperations (i.e.,serialexecution) OLAP-optimized " " inconsistentsnapshot " " " " ispossible UnifiedOLTP+OLAP " " " " " " – Table1:Architecturalcharacterizationofselecteddatabaseengines. “onesizefitsall”isoutlinedinTable1.1 routers. Now the task of any real-time targeted advertising auc- Inshort,wedevelopL-Store,asanimportantfirststeptowards tionistodetermineandpresentasetofrelevantadstotheshop- supporting real-time OLTP and OLAP processing that faithfully per by running analytics over the location information, shopping satisfies our definition of generalized solution, and, in particular, patterns,pastpurchases,andbrowsinghistoryoftheshopper. Fur- wemakethefollowingcontributions: thermore,iftheseadvertisementsresultinapurchase,thenthere- sultingtransactionsneedtobecomeavailableimmediatelytosub- • Introducingacontention-freeupdatemechanismoverana- sequent analytics in order to improve the effectiveness of future tivecolumnarstoragemodelinordertolazilyandindepen- advertisements. Moreover, the actual ad bidding in the auction dently stage stable data from a write-optimized columnar also requires a transactional semantics support in real-time. Fi- layout (i.e., OLTP) into a read-optimized columnar layout nally,allthesestepsmustbecompletedtypicallywithin150mil- (i.e.,OLAP) liseconds[2]. Therefore, wearguethatinordertosustainahigh velocitytransactionaldata(e.g.,GoogleAdWordsservedalmost30 • Achieving(atmost)2-hopawayaccesstothelatestversion billionadsperdayin2012[14])whileexecutingcomplexanalytics ofanyrecord(preventingreadperformancedeteriorationfor onthelatestandhistoric(transactional)data,thereisacompelling pointqueries) needtodevelopasolutionthatexhibitsatruereal-timeOLTPand OLAPcapabilities. • Contention-freemergingofonlystabledata,namely,merg- Anotherprominentscenarioisfrauddetectionespeciallyatthe ingoftheread-onlybasedatawithrecentlycommittedup- timewhenthecostofcybercrimecontinuestoincreaseatastagger- dates(bothincolumnarrepresentation)withouttheneedto ingrateandhasalreadysurpassed$400billiondollarsannually[1]. blockongoingornewtransactions Forinstance,acreditcardcompanywillneedtoapproveatransac- tioninasmalltimewindow(i.e.,subsecondranges). Duringthis • Contention-freepagede-allocation(uponthecompletionof shorttimespan,itisforcedtodetermineifatransactionisfraudu- themergeprocess)usinganepoch-basedapproachwithout lentornot.Thus,thereisacrucialneedtoruncomplexanalyticsin theneedtodraintheongoingtransactions real-timeaspartofthetransactionthatisbeingprocessed.Without suchaproactivefrauddetectioncapability,fraudulenttransactions • Afirstofitskindcomprehensiveevaluationtostudythelead- may remain undetectable, which may result in irreversible finan- ingarchitecturaldesignforconcurrentlysupportingshortup- cial losses as clearly been witnessed when billions of dollars are datetransactionsandanalyticalqueries(e.g.,anin-placeup- beinglostduetofraudactivitieseveryyear[1].Furthermore,there datewithahistorytablearchitectureandthecommonlyem- are indirect financial losses involving stakeholders such as credit ployedmainanddeltastoresarchitecture) card companies and merchants. The indirect losses attributed to declineoflegitimatetransactionsthatdisruptsmerchant’sdailyop- 1.1 MotivatingReal-timeOLTPandOLAP eration,lostpaymentvolumeasconsumersoptforalternativepay- Before describing our proposed approach in-depth, we briefly menttypesthatareperceivedtobesafer,andlostcustomersdueto present two important scenarios that benefit greatly from a real- cardcancellationandreissue[26]. timeOLTPandOLAPsolution. Considerthemobilee-commercemarket,inwhichtherevenue 2. UNIFIEDARCHITECTURE forthelocation-basedmobileadvertisingaloneisexpectedtoreach $18billionsby2019[25]. Apotentialbuyerwithamobiledevice Thedivideinthedatabasecommunityispartlyattributedtothe mayroamaroundphysicallywhileshopping.Inthemeantime,the storageconflictpertainingtotherepresentationoftransactionaland shopper’smobiledevicegenerateslocationinformation. Alterna- analyticaldata. Inparticular,transactionaldatarequireswrite-opt- tively,astheshopperbrowsestheweb,againthelocationinforma- imizedstorage,namelytherow-basedlayout,inwhichallcolumns tioniseitherexchangedexplicitlyordetectedautomaticallybased areco-located(andpreferablyuncompressedforin-placeupdates). ontheshopper’sIPaddressorbyitsconnectiontothenearbyWiFi Thislayoutimprovespointupdatemechanisms,sinceaccessingall columnsofarecordcanbeachievedbyasingleI/O(orfewcache 1However, it is crucial to note that the presented comparison is missesformemory-residentdata). Incontrast,tooptimizethean- solely focused on the overall architectural choices, and it does alytical workloads (i.e., reading many records), it is important to notmakeanyclaimsabouttherelativesystemperformanceand/or haveread-optimizedstorage,i.e.,columnarlayoutinhighlycom- functionalities.Forexample,ifHANAcontainsmorecheckmarks thanMicrosoftSQLServer,itdoesnotimplythatHANAisabet- pressedform. Theintuitionbehindhavingcolumnarlayoutisdue terproduct,insteaditsimplyassessesHANAarchitecturallywith totheobservationthatmostanalyticalqueriestendtoaccessonlya respecttoourdefinitionof“onesizefitsall”. smallsubsetofallcolumns[3].Thus,bystoringdatacolumn-wise, 2 Read Optimized Columnar Storage Base Pages (compressed, (read-only) read-only pages) Row-based Storage Columnar Storage Tail Pages (append-only) y ro Base Record tce (read-only) Write Optimized riD (uncompressed, e Tail Record g in-place updates) aP (latest version) Range Partitioning Record (spanning over a set of aligned columns) Figure1:Overviewofstoragelayoutconflict. Figure2:Overviewofthelineage-basedstoragearchitecture. we can avoid reading irrelevant columns (i.e., reducing the raw records;theyarepresentedandmaintainedidentically. amount of data read) and avoid polluting processor’s cache with Tospeedqueryprocessing,thereisalsoanexplicitlinkage(for- irrelevant data, which substantially improve both disk and mem- wardandbackwardpointers)amongrecords. Fromabaserecord, orybandwidth,respectively.Furthermore,storingdataincolumnar thereisaforwardpointertothelatestversionoftherecordintail formimprovesthedatahomogeneitywithineachpage,whichre- pages. Thedifferentversionsofthesamerecordsintailpagesare sultsinanoverallbettercompressionratio.Thisstorageconflictis chainedtogethertoenablefastaccesstoanearlierversionofthe depictedinFigure1. record.Thelinkageisestablishedbyintroducingatable-embedded indirectioncolumnthatstoresforwardpointers(i.e.,RIDs)forbase 2.1 L-StoreStorageOverview recordsandbackwardpointersfortailrecords(i.e.,RIDs). Thefinalaspectofourlineage-basedarchitectureisaperiodic, Toaddressthedilemmabetweenwrite-andread-optimizedlay- contention-freemergingofasetofbasepageswithitscorrespond- outs,wedevelopL-Store. AsdemonstratedinFigure2,thehigh- ingtailpages.Thisisperformedtoconsolidatebasepageswiththe level architecture of L-Store is based on native columnar layout recent updates and to bring base pages forward in time (i.e., cre- (i.e.,dataacrosscolumnsarealignedtoallowimplicitre-construction), ating a set of merged pages). Tail pages that are already merged whererecordsare(virtually)partitionedintodisjointranges(also and fall outside the snapshot boundaries of all active queries are referredtoasupdaterange). Recordswithineachrangespanaset ofread-only,compressedpages,whichwerefertothemasthebase calledhistorictail-pages.Thesepagesarere-organized,sothatdif- pages. Moreimportantly,foreveryrangeofrecords,andforeach ferentversionsofarecordarestoredcontiguouslyinlined. Delta- compressionisappliedacrossdifferentversionsoftailrecords,and updated column within the range, we maintain a set of append- tail records are ordered based on the RIDs of their correspond- only pages to store the latest updates, which we refer to them as thetailpages. Anytimearecordisupdatedinbasepages, anew ingbaserecords. Below,wedescribetheuniquedesignandalgo- recordisappendedtoitscorrespondingtailpages,wherethereare rithmicfeaturesofL-Storethatenablesefficienttransactionalpro- cessingwithoutperformancedeteriorationofanalyticalprocessing; explicitvaluesonlyfortheupdatedcolumns(non-updatedcolumns thereby,achievingareal-timeOLTPandOLAP. arepreassignedaspecialnullvaluewhenapageisfirstallocated). Werefertotherecordsinbasepagesasthebaserecordsandthe 2.2 Lineage-basedStorageArchitecture recordsintailpagesasthetailrecords. Eachrecord(whetherfalls inbaseortailpages)spansoverasetofalignedcolumns(i.e.,no In L-Store, the storage layout is naively columnar that applies joinisnecessarytopulltogetherallcolumnsofthesamerecord).2 equallytobothbaseandtailpages.Adetailedviewofourlineage- A unique feature of our lineage-based architecture is that tail basedstoragearchitectureispresentedinFigure3. Ingeneral,one pagesarestrictlyappend-onlyandfollowawrite-oncepolicy. In canperceivetailpagesasdirectlymirroringthestructureandthe other words, once a value is written to tail pages, it will not be schemaofbasepages. Aswepointedoutearlier,conceptuallyfor over-written even if the writing transaction aborts. The append- everyrecord,wedistinguishbetweenbasevs. tailrecords,where onlydesignsubstantiallysimplifiesconcurrencyandrecoverypro- eachrecordisassignedauniqueRID.Butitisimportanttonote tocolasdescribedinSection4.1.Anotherimportantpropertyofour thattheRIDassignedtoabaserecordisstableandremainscon- lineage-basedstorageisthatalldataarerepresentedinacommon stant throughout the entire life-cycle of a record, and all indexes holistic form; there are no ad-hoc corner cases. Records in both onlyreferencebaserecords(baseRIDs);consequently,eliminating baseandtailpagesareassignedrecord-identifiers(RIDs)fromthe indexmaintenanceproblemassociatedwhenupdateoperationre- samekeyspace.Therefore,bothbaseandtailpagesarereferenced sultsincreationofanewversionoftherecord[33,32]. Areader throughthedatabasepagedirectoryusingRIDsandpersistediden- performingindexlookupalwayslandsatabaserecord,andfrom tically. Therefore,atthelower-levelofthedatabasestack,thereis the base record it can reach any desired version of the record by absolutelynodifferencebetweenbasevs.tailpagesorbasevs.tail followingthetable-embeddedindirectiontoaccessthelatest(ifthe baserecordisout-of-date)oranearlierversionoftherecord.How- 2Fundamentally,thereisnodifferencebetweenbasevs.tailrecord, ever, when a record is updated, a new version is created. Thus, thedistinctionismadeonlytoeasetheexposition. anewtailrecordiscreatedtoholdthenewversion, andthenew 3 Start Time Column Read Optimized (implicit end time of the previous version) Updated Columns (compressed, read-only pages) Schema Encoding Column Corresponding Columnar Storage (keeping track of changed column) Columns (page-based) Indirection Column (back pointer to the previous version) Base Record Write Optimized (uncompressed, append-only updates) Forward Pointer to the Base Pages Tail Latest Version of the Record (read-only) Record Pre-allocated Space (on-demand allocated append-only region) Range Partitioning Tail Pages Last Updated Time (append-only) Indirection Column Schema Encoding Start Time Column (uncompressed, in-place update) Column Column (populated after merge) Figure3:Detailed,unfoldedviewoflineage-basedstoragearchitecture. tailrecordisassignedanewtailRIDthatisreferencedbythebase wayspreserved(evenafterthemergeprocess)forfasterpruningof record(asdemonstratedinFigure3). thoserecordsthatarenotvisibletoreadersbecausetheyfalloutside Eachtableinadditiontohavingthestandarddatacolumnshas thereader’ssnapshot.Lastly,wemayaddtheBaseRIDcolumnop- severalmeta-datacolumns. Thesemeta-datacolumnsincludethe tionallytotailrecordstostoretheRIDsoftheircorrespondingbase Indirectioncolumn,theSchemaEncodingcolumn,theStartTime records;thisisutilizedtoimprovethemergeprocess. BaseRIDis column,andtheLastUpdatedTimecolumn. Anexampleoftable ahighlycompressiblecolumnthatwouldrequireatmosttwobytes schemaisshowninTable2. whenrestrictingtherangepartitioningofrecordsto216records. TheIndirectioncolumnexistsinboththebaseandtailrecords. 2.3 Fine-grainedStorageManipulation Forbaserecords,theIndirectioncolumnisinterpretedasaforward pointertothelatestversionofarecordresidingintailpages,essen- The transaction processing can be viewed as two major chal- tiallystoringtheRIDofthelatestversionofarecord. Ifarecord lenges: (1)howdataisphysicallymanipulatedatthestoragelayer has never been updated, then the Indirection column will hold a andhowchangesarepropagatedtoindexesand(2)howmultiple nullvalue. Incontrast, fortailrecords, theIndirectioncolumnis transactions(whereeachtransactionconsistsofmanystatements) used to store a backward pointer to the last updated version of a canconcurrentlycoordinatereadingandwritingoftheshareddata. recordintailpages.Ifnoearlierversionexists,thentheIndirection The focus of this paper is on the former challenge, and we defer columnwillpointtotheRIDofthebaserecord. thelattertoourdiscussionontheemployedconcurrencymodelin TheSchemaEncodingcolumnstoresthebitmaprepresentation Section4.1. ofthestateofthedatacolumnsforeachrecord,wherethereisone bitassignedforeverycolumnintheschema(excludingthemeta- 2.3.1 UpdateandDeleteProcedures datacolumns),andifacolumnisupdated,itscorrespondingbitin Withoutthelossofgenerality,wefocusonhowtohandleasingle theSchemaEncodingcolumnissetto1,otherwiseissetto0. The pointupdateordeleteinL-Store(butnotethatwesupportmulti- schemaencodingenablestoquicklydetermineifacolumnhasev- statement transactions as demonstrated by our evaluation). Each erybeenupdatedornot(forbaserecords)ortodetermineforeach updatemayaffectasingleormultiplerecords. Sincerecordsare tail record, which columns have been updated and have explicit (virtually)partitionedintoasetofdisjointranges(asshowninTa- valuesasopposedtothosecolumnsthathavenotbeenupdatedand ble2), eachupdatedrecordnaturallyfallswithinonlyonerange. haveanimplicitspecialnullvalues(denotedby∅).Anexampleof Nowforeachrangeofrecords,uponthefirstupdatetothatrange,a SchemaEncodingcolumnisprovidedinTable2. setoftailpagesarecreated(andpersistedondiskoptionally)forthe The Start Time column stores the time at which a base record updatedcolumnsandareaddedtothepagedirectory,i.e.,lazytail- was first installed in base pages (the original insertion time), and pageallocation. Consequently, updatesforeachrecordrangeare foratailrecord,theStartTimecolumnholdsthetimeatwhichthe appendedtotheircorrespondingtailpagesoftheupdatedcolumns recordwasupdated,whichisalsotheimplicitendtimeoftheprevi- only; thereby,avoidingin-placeupdatesforthedatacolumnsand ousversionoftherecord.Inaddition,totheStartTimecolumn,for clusteringupdatesforarangeofrecordswithintheircorresponding baserecords,wemaintainanoptionalLastUpdatedTimecolumn, tailpages. whichisonlypopulatedafterthemergeprocessistakenplaceand TodescribetheupdateprocedureinL-Store,werelyonourrun- holdstheStartTimeofthosetailrecordsincludedinmergedpages. ningexampleshowninTable2. Whenatransactionupdatesany AlsonotethattheinitialStartTimecolumnforbaserecordsisal- column of a record for the first time, two new tail records (each 4 RID Indirection SchemaEncoding StartTime Key A B C RIDs), and they are never directly point to any tail records (i.e., Partitionedbaserecordsforthekeyrangeofk1tok3 tailRIDs)inordertoavoidtheindexmaintenancecostthatarise bb12 tt85 00010001 1103::0024 kk12 aa12 bb12 cc12 intheabsenceofin-placeupdatemechanism[33,32]. Therefore, b3 t7 0001 15:05 k3 a3 b3 c3 whenanewversionofarecordiscreated(i.e.,anewtailrecord), Partitionedbaserecordsforthekeyrangeofk4tok6 first,allindexesdefinedonunaffectedcolumnsdonothavetobe b4 ⊥ 0000 16:20 k4 a4 b4 c4 modifiedand,second,onlytheaffectedindexesaremodifiedwith b5 ⊥ 0000 17:21 k5 a5 b5 c5 b6 ⊥ 0000 18:02 k6 a6 b6 c6 theupdatedvalues,buttheycontinuetopointtobaserecordsand Partitionedtailrecordsforthekeyrangeofk1tok3 notthenewlycreatedtailrecords[33,32]. Supposethereisanin- t1 b2 0100* 13:04 ∅ a2 ∅ ∅ dexdefinedonthecolumnC (cf. Table2). Nowaftermodifying t2 t1 0100 19:21 ∅ a21 ∅ ∅ t3 t2 0100 19:24 ∅ a22 ∅ ∅ therecordb2fromc2toc21,weaddthenewentry(c21,b2)tothe t4 t3 0001* 13:04 ∅ ∅ ∅ c2 indexonthecolumnC.3Subsequently,whenareaderlooksupthe tt56 bt43 00100011* 1195::2055 ∅∅ a∅22 ∅∅ cc231 valuec21fromtheindex,italwaysarrivesatthebaserecordb2ini- tt78 bt61 00000010 1290::4155 ∅∅ ∅∅ ∅∅ c∅31 ftioalllloyw,tihnegnththeeinrdeiardeecrtimonuisftndeecteersmsairnye)athnedvmisuisbtlechveecrksiiofnthoefvbi2si(bblye versionhasthevaluec forthecolumnC,essentiallyre-evaluating Table2:Anexampleoftheupdateanddeleteprocedures(concep- 21 thequerypredicates. tualtabularrepresentation). Therearetwoothermeta-datacolumnsthatareaffectedbythe updateprocedure. TheStartTimecolumnfortailrecordssimply holdsthetimeatwhichtherecordwasupdated(animplicitendof thepreviousversion). Forexample,therecordt hasastarttime tailrecordisassignedauniqueRID)arecreatedandappendedto 7 of19:45,whichalsoimpliesthattheendtimeofthefirstversion thecorrespondingtailpages. Forexample, considerupdatingthe oftherecordb . TheSchemaEncodingcolumnisaconciserep- columnAoftherecordwiththekeyk (referencedbytheRIDb ) 3 2 2 resentationthatshowswhichdatacolumnshavebeenupdatedthus in Table 2. The first tail record, referenced by the RID t , con- 1 far. Forexample,theSchemaEncodingofthetailrecordt isset tains the original value of the updated column, i.e., a , whereas 7 2 to“0100”,whichimpliesthatonlythecolumnAhasbeenchanged. implicitnullvalues(∅)arepreassignedforremainingunchanged Todistinguishbetweenwhetheratailrecordisholdingnewvalues columns. Takingsnapshotoftheoriginalchangedvaluesbecomes oritisthesnapshotofoldvalues,weaddaflagtotheSchemaEn- essential in order to ensure contention-free merging as discussed coding column, which is shown as an asterisk. For example, the inSection3.1. Thesecondtailrecordcontainsthenewlyupdated tailrecordt storestheoldvalueofthecolumnA,whichiswhy value for column A, namely, a , and again implicit special null 6 21 itsSchemaEncodingissetto“0100*”. TheSchemaEncodingcan values for the rest of the columns; a column that has never been alsobemaintainedoptionallyforbaserecordsaspartoftheupdate updated does not even have to be materialized with special null processoritcouldbepopulatedonlyduringthemergeprocess. values. However,foranysubsequentupdates,onlyonetailrecord Notably,whentherearemultipleindividualupdatestothesame iscreated,e.g.,thetailrecordt isappendedasaresultofupdating 3 recordbythesametransaction,theneachupdateiswrittenasasep- thecolumnAfroma toa fortherecordb . 21 22 2 arateentrytotailpages. Eachupdateresultsinacreationofanew Ingeneral,updatescouldeitherbecumulativeornon-cumulative. tailrecordandonlythefinalupdatebecomesvisibletoothertrans- Thecumulativepropertyimpliesthatwhencreatinganewtailrecord, actions. The prior entries are implicitly invalidated and skipped thenewrecordwillcontainthelatestvaluesforalloftheupdated byreaders. Alsodeleteoperationissimplytranslatedintoanup- columnsthusfar.Forexample,considerupdatingthecolumnCfor dateoperation, inwhichalldatacolumnsareimplicitlysetto∅, therecordb .SincethecolumnCoftherecordb isbeingupdated 2 2 e.g.,deletingtherecordb resultsincreatingthetailrecordt .An forthefirsttime, wefirsttakeasnapshotofitsoldvalueascap- 1 8 alternative design for delete is to create a tail record that holds a turedbythetailrecordt . Nowforthecumulativeupdate,anew 4 completesnapshotofthelatestversionofthedeletedrecord. tailrecordisappendedthatrepeatsthepreviouslyupdatedcolumn A,asdemonstratedbythetailrecordt . Ifnon-cumulativeupdate 5 2.3.2 InsertProcedure approachwasemployed,thenthetailrecordwouldconsistsofonly thechangedvalueforcolumnCandnotA.Itisimportanttonote Thefinalkeyoperationistheinsertionofnewrecords. Concep- thatcumulationofupdatescanberesetatanytime. Intheabsence tually,thetablenaturallygrowsbyinsertingnewrecordstotheend ofcumulation,readersaresimplyforcedtowalkbackthechainof ofthetable(append-onlymechanism).Werelyonasimplermani- recentversionstoretrievethelatestvaluesofalldesiredcolumns. festationofournotionoftailpagesfollowedbythetransformation Thus,cumulativeupdateisanoptimizationthatisintendedtoim- oftailpagesintocompressed,read-onlybasepagesthroughasim- provethereadperformance. plified merge process. In fact, one can even view our previously Aspartoftheupdateroutine,theembeddedIndirectioncolumn describedupdatemechanismasaformofsparseinsertion. (forwardpointers)forbaserecordsisalsoupdatedtopointtothe Inourproposedinsertdesign,wedesignatetheendofthetableas newlycreatedtailrecord. Inourrunningexample,theIndirection theinsertrange. Aninsertrangeisbasicallyapre-allocatedrange column of the record b2 points to the tail record t5. Also after ofbaseRIDsforaccommodatingfutureinsertions.Inpractice,the updating the column C of the record b3, the Indirection column insertrangesize(atleastamillionRIDs)ismuchlargerthanour pointstothelatestversionofb3,whichisgivenbyt7. Likewise, rangepartitioningthatisemployedforupdateprocessing(i.e.,up- the Indirection column in the tail records are updated to point to thepreviousversionoftherecord. Itisimportanttonotethatthe Indirectioncolumnofbaserecordsistheonlycolumnthatrequires 3Optionallytheoldvalue(c2,b2)couldberemovedfromthein- dex; however, itsremovalmayaffectthosequeriesthatareusing anin-placeupdateinourarchitecture.However,asdiscussedinour indexestocomputeanswersundersnapshotsemantics. Therefore, concurrencymodel(cf. Section4.1),thisisaspecialcolumnthat weadvocatedeferringtheremovalofchangedvaluesfromindexes lendsitselftolatch-freeconcurrencyprotocol. until the changed entries fall outside the snapshot of all relevant Furthermore, indexes always point to base records (i.e., base activequeries. 5 Read Optimized (compressed, read-only pages) Tail Pages (append-only) Range Partitioning (update range) Write Optimized Table-level Tail-pages (uncompressed, (append-only) append-only updates, i.e., sparse insertion) Write Optimized snI (uncompressed, e R tr append-only inserts) a n g e Indirection Column (uncompressed, in-place update) Figure4:Append-onlyinsertionofnewrecordswithconcurrentupdates(byemployingtailpages). RID Indirection SchemaEncoding StartTime Key A B C ing up a record in the insert range. When a new record is about Partitionedbaserecordsforthekeyrangeofk4tok6 tobeinsertedtothetable,thenewrecordreceivesareservedbase bb45 ⊥⊥ 00000000 1167::2201 kk45 aa45 bb45 cc45 RIDintheinsertrangeandthecorrespondingtailRIDinthetable- b6 ⊥ 0000 18:02 k6 a6 b6 c6 level tail-range. If insert range is full, then a new insert range is InsertrangeforthebaserecordwiththebaseRIDrangeofb7tob9 created.Butthekeyguidingprincipleforinsertionistosatisfythe b7 ⊥ stabilitypropertyofthebasepages(i.e.,read-only)withtheexcep- b8 t14 b9 t16 tionoftheIndirectioncolumnthatisupdatedin-place. Therefore, Table-leveltail-pagesforthebaserecordwiththebaseRIDrangeofb7tob9 theinsertionproceduresimplyconsistsofacquiringbaseandtail tt7 t7 0000 18:30 k7 a7 b7 c7 tt8 t8 0000 18:45 k8 a8 b8 c8 RIDs,inserttheactualrecordtotable-leveltail-pages,andsetting tt9 t9 0000 19:05 k9 a9 b9 c9 theIndirectioncolumninthebaserecordtonull.Alternatively,the Pt1ar3titionedtabil8recordsforthe0k0ey01r*angeofk7to1k89:45 ∅ ∅ ∅ c8 Indirectioncolumncouldbesettonullwhenallocatingpagesfor t14 t13 0001 22:25 ∅ ∅ ∅ c81 theinsertrange. t15 b3 0100* 19:05 ∅ a9 ∅ ∅ AnexampleofinsertionisillustratedinTable3.Theinsertrange t16 t15 0100 22:45 a91 ∅ ∅ ∅ isshownasb tob ,andthetable-leveltail-rangeisshownastt 7 9 7 Table3:Anexampleofinsertionwithconcurrentupdates(concep- tott9. Thefirstinsertedrecordis(k7,a7,b7,c7)withthekeyk7 tualtabularrepresentation). thatisassignedb7asitsbaseRIDandtt7asitstailRID.Theonly columnallocatedforbaserecordsistheIndirectioncolumn,which isinitiallysettonull(⊥). Theactualvaluesforthemeta-dataand datacolumnsareappendedtothetable-leveltail-pagesattheposi- daterange).4. Fortheinsertrange,weallocatedasetoftailpages tiongivenbythetailRIDtt7. Inthesamespirit,therecordswith forappendingnewrecords,whichwerefertothemas“table-level b8 andb9 arealsoappendedtotheinsertrange. Nowifarecently tail-pages”eventhoughstructurallythereisnodifferencebetween insertedrecordisupdated,thentheupdatefollowsthesamepathas explainedearlier(cf. Section2.3.1). Supposetherecordb isup- table-level tail-pages vs. regular tail pages. Figure 4 pictorially 8 datedbymodifyingthevalueofitsCcolumnfromc toc . The capturesourinsertdesign.Intable-leveltail-pages,weallocatetail 8 81 updatesimplyresultsinacquiringanewtailRIDintheregulartail pagesforallcolumns(unlikeforupdatesthatwaslimitedtoonly pages(asbefore)andappendingonlytheupdatedcolumnfollowed theupdatedcolumns)becausetheinsertstatementalwaysprovide byupdatingtheIndirectioncolumnin-place. Thisisdemonstrated avalueforeverycolumn(evenifitisanimplicitnullvaluefora byappendingthetailrecordt tothecorrespondingtailpagesand nullablecolumn). 14 settingtheIndirectioncolumnoftherecordb tot . Adding a new insert range consists of reserving a set of base 8 14 RIDs (e.g., in the order of millions) and a set of tail RIDs; these two sets of RIDs are equal in size and aligned. Thus, the 10th baseRIDintheinsertrangecorrespondstothe10thtailRIDinthe 3. REAL-TIMESTORAGEADAPTION table-leveltail-range(i.e.,bothrangesfollowingthesameinsertion order).ThealignmentofRIDsallowsimplicitaddressingforlook- Toensureanearoptimalstoragelayout,outdatedbasepagesare mergedlazilywiththeircorrespondingtailpagesinordertopre- 4Each table may have more than one insert range to support a servetheefficiencyofanalyticalqueryprocessing. Recallthatthe higherdegreeofconcurrencyiftheworkloadisinsertintensive. base pages are read-only and compressed (read optimized) while 6 In-page, Independent Page Directory Lineage Tracking (reflect the number of updates applied to apage) Asynchronous Lazy Merge Read Optimized (committed, consecutives updates) (compressed, read-only pages) ⋈ = In-page, Independent Base Pages Lineage Tracking (merged pages that are compressed, read-only) Consecutive Set of Indirection Column Committed Updates Base Pages (unaffected by the (older versions, removed de-allocation process) from the page directory) Indirection Column Tail Pages Epoch-based De-allocation Merge Queue (unaffected by the merge process) (unaffected by the (determined by the (tail pages to be merged) de-allocation process) longest running query) Figure5:Lazily,independentlymergingoftail&basepages. Figure6:Epoch-based,contention-freepagede-allocation. thetailpagesareuncompressed5thatgrowusingastrictlyappend- 3.1.1 MergeAlgorithm only technique (write optimized). Therefore, it is necessary to Thedetailsofthemergealgorithm,conceptuallyresemblingthe transformtherecentcommittedupdates(accumulatedintailpages) standardleft-outerjoin,consistsof(1)identifyingasetofcommit- thatarewriteoptimizedintoreadoptimizedform.Adistinguishing tedtailrecordsintailpages;(2)loadingthecorrespondingoutdated featureofourlineage-basedarchitectureistointroduceacontention- basepages;(3)consolidatingthebaseandtailpages;(4)updating free merging process that is carried out completely in the back- thepagedirectory;and(5)de-allocatingtheoutdatedbasepages. groundwithoutinterferingwithforegroundtransactions. Further- more,thecontention-freemergingprocedureisappliedonlytothe Step1: Identifycommittedtailrecordsintailpages: Selecta updatedcolumnsoftheaffectedupdateranges. Thereisevenno setofconsecutivefullycommittedtailrecords(orpages)sincethe dependencyamongcolumnsduringthemerge; thus, thedifferent lastmergewithineachupdaterange. columns of the same record can be merged completely indepen- Step 2: Load the corresponding outdated base pages: For a dentofeachotheratdifferentpointintime. Themergeprocessis selectedsetofcommittedtailrecords,loadthecorrespondingout- conceptuallydepictedinFigure5,inwhichwriterthreads(i.e.,up- datedbasepagesforthegivenupdaterange(limittheloadtoonly datetransactions)placecandidatetailpagestobemergedintothe outdatedcolumns). Thisstepcanfurtherbeoptimizedbyavoiding mergequeuewhilethemergethreadcontinuouslytakespagesfrom toloadsub-rangesofrecordsthathavenotyetchangedsincethe thequeueandprocessesthem. lastmerge.Nolatchingisrequiredwhenloadingthebasepages. Step3:Consolidatethebaseandtailpages:Foreveryupdated 3.1 Contention-free,RelaxedMerge column, the merge process will read n outdated base pages and In L-Store, we abide to one main design principle for ensur- appliesasetofrecentcommittedupdatesfromthetailpagesand ingcontention-freeprocessingthatis“alwaysoperatingonstable writesoutmnewpages.8 FirsttheBaseRIDcolumnofthecom- data”. Theinputstothemergeprocessare(1)asetofbasepages mittedtailpages(fromStep1)arescannedinreverseordertofind (committedbaserecords)thatareread-only,6 thus,stabledataand thelistofthelatestversionofeveryupdatedrecordsincethelast (2)asetofconsecutivecommittedtailrecordsintailpages,7 thus, merge(atemporaryhashtablemaybeusedtokeeptrackwhether alsostabledata. Theoutputofthemergeprocess(thatisalsore- thelatestversionofarecordisseenornot). Subsequently,apply- laxed) is a set of newly consolidated base pages (also referred to ingthelatesttailrecordsinareverseordertothebaserecordsuntil as merged pages) that are read-only, compressed, and almost up- anupdatetoeveryrecordinthebaserangeisseenorthelistisex- to-date,thus,stabledata. Todecoupleusers’transactions(writers) hausted(skipanyintermediateversionsforwhichanewerupdate fromthemergeprocess, wealsoensurethatthewritepathofthe existsintheselectedtailrecords). Ifalatesttailrecordindicates ongoing transactions does not overlap with the write path of the thedeletionoftherecord,thenthedeletedrecordwillbeincluded merge process. Writers append new uncommitted tail records to intheconsolidatedrecords.9 Anycompressionalgorithm(e.g.,dic- tailpages(butasstatedbeforeuncommittedrecordsdonotpartic- tionary encoding) can be applied on the consolidated pages (on ipateinthemerge),andwritersperformin-placeupdateoftheIn- columnbasis)followedbywritingthecompressedpagesintonewly directioncolumnwithinbaserecordstopointtothelatestversion 8Atmostuptoonemergedpagepercolumncouldbeleftunder- oftheupdatedrecordsintailpages(buttheIndirectioncolumnis utilizedforarangeofrecordsafterthemergeprocess. Tofurther notmodifiedbythemergeprocess),whereasthewritepathofthe reducetheunderutilizedmergedpages,onemaydefinefinerrange mergeprocessconsistsofcreatingonlyanewsetofread-onlybase partitioningforupdates(e.g., 212 records), butoperatemergesat pages. coarsergranularity(e.g.,216 records). Thiswillprovidethebene- fitoflocalityofaccessforreadersgivensmallerrangesizeof212, 5Eventhoughcompressiontechniquessuchaslocalandglobaldic- yetitprovidesabetterspaceutilizationandcompressionfornewly tionaries can be employed in tail pages, but these directions are createdmergepageswhenlargerrangesarechosen. outsidethescopeofthecurrentwork. 9Alternatively, if all the deleted values are also stored in tail 6TheIndirectioncolumnistheonlycolumnthatundergoesin-place records, then it is sufficient to fill all data columns with the spe- updatethatalsoneverparticipatesinthemergeprocess. cial null value ∅ for deleted records in the final merged pages. 7Note that not every committed update has to be applied as the However, wewould still needto preservethe Indirectioncolumn mergeprocessisrelaxed,andthemergeeventuallyprocessallcom- ofdeletedrecordsinordertoprovideaccesstotheearlierversions mittedtailrecords. ofdeletedrecords. 7 Schema Start Last that after the merged pages are created and the page directory is RID Indirection Updated Key A B C Encoding Time Time updated, then the old table-level tail-pages can be discarded per- Partitionedbaserecordsforthekeyrangeofk1tok3;Tail-pageSequenceNumber(TPS)=0 manentlyafteralltheactivequeriesthatstartedpriortothemerge b1 t8 0000 10:02 k1 a1 b1 c1 processareterminated. Incontrast, theregulartailpagessurvive b2 t5 0101 13:04 k2 a2 b2 c2 b3 t7 0001 15:05 k3 a3 b3 c3 after the merge in order to enable answering historic queries and Relevanttailrecords(belowTPS≤t7high-watermark)forthekeyrangeofk1tok3 toavoidinterferingwithupdatetransactions. Allbaserecordsthat t5 t4 0101 19:25 ∅ a22 ∅ c21 havebeenmergedwiththeirtable-leveltail-pagesareconsideredto Rte7sultingmetrg6edrecords0fo0r0t1hekeyra1n9g:4e5ofk1tok3;TPS=∅t7 ∅ ∅ c31 beoutsidetheinsertrange. b1 t8 0000 10:02 10:02 k1 a1 b1 c1 We further strengthen our data stability condition for bringing b2 t5 0101 13:04 19:25 k2 a22 b2 c21 basepagesup-to-date. Earlierwestatedthatthemergeonlyoper- b3 t7 0001 15:05 19:45 k3 a3 b3 c31 atesonasetofcommittedconsecutivetailrecords,butnocondition Table4: Anexampleoftherelaxedandalmostup-to-datemerge wasimposedonthebaserecords.Nowwestrengthenthiscondition procedure(conceptualtabularrepresentation). byrequiringthatthebaserecordsmustalsofalloutsidetheinsert rangebeforebecomingacandidateformergingtherecentupdates. 3.1.2 MergeAnalysis createdpages.Moreover,theoldStartTimecolumnisremainedin- Akeydistinguishingfeatureofourlineage-basedstoragearchi- tactduringthemergeprocessbecausethiscolumnisneededtohold tectureistoallowcontention-freemergingoftailandbasepages theoriginalinsertiontimeoftherecord.10 Therefore,tokeeptrack withoutinterferingwithconcurrenttransactions. Toformalizeour ofthetimefortheconsolidatedrecords,theLastUpdatedTimecol- merge process, we prove that merge operates only on stable data umnispopulatedtostoretheStartTimeoftheappliedtailrecords. without any information loss and that the merge does not limit The Schema Encoding column may also be populated during the users’ transactions to access and/or modify the data that is being mergetoreflectallthecolumnsthathavebeenchangedforeach merged. record. Lemma1 Mergeoperatesstrictlyonstabledata. Step 4: Update the page directory: The pointers in the page directoryareupdatedtopointtothenewlycreatedmergedpages. PROOF. Recall that by construction, we enforced that merge Essentially this is the only foreground action taken by the merge “alwaysoperateonstabledata”. Theinputstothemergeprocess process,whichissimplytoswapandupdatepointersinthepage are (1) a set of base pages consisting of committed base records directory–anindexstructurethatisupdatedrarelyonlywhennew thatareread-onlyandoutsidetheinsertrange,thus,stabledataand pagesareallocated. (2)asetofconsecutivecommittedtailrecordsintailpages,thus, Step5:De-allocatetheoutdatedbasepages:Theoutdatedbase alsostabledata. Theoutputofthemergeprocessisasetofnewly pagesarede-allocatedoncethecurrentreadersaredrainednatu- mergedpagesthatareread-only,thus,stabledataaswell. Hence, rallyviaanepoch-basedapproach. Theepochisdefinedasatime themergeprocessstrictlytakesasinputsstabledataandproduces window,inwhichtheoutdatedbasepagesmustbekeptaroundas stabledataaswell. longasthereisanactivequerythatstartedbeforethemergepro- Lemma2 Mergesafelydiscardsoutdatedbasepageswithoutvio- cess. Pointers to the outdated base pages are kept in a queue to latinganyquery’ssnapshot. be re-claimed at the end of the query-driven epoch-window. The pointerswappingandthepagede-allocationareillustratedinFig- PROOF. In order to support snapshot isolation semantics and ure6.(cid:4) timetravelqueries,weneedtoensurethatearlierversionsofrecords thatparticipateinthemergeprocessareretained. Sincewenever AnexampleofourmergeprocessisshowninTable4basedon performin-placeupdatesandeachupdateistransformedintoap- ourearlierupdateexample, inwhichweapplythefirstseventail pendinganewversionoftherecordtotailpages,thenaslongas records (denoted by t to t ) to their corresponding base pages. tail pages are not removed, we can ensure that we have access 1 7 Theresultingmergedpagesareshown,wheretheaffectedrecords toeveryupdatedversion. Butrecallthatoutdatedbasepagesare arehighlighted. Notethatonlytheupdatedcolumnsareaffected de-allocatedusingourproposedepoch-basedapproachafterbeing bythemergeprocess(andtheIndirectioncolumnisnotaffected). merged. Also note that base pages contain the original values of Furthermore,notallupdatesareneededtobeapplied,onlythelat- whenarecordwasfirstcreated.Therefore,anyoriginalvaluesthat estversionofeveryupdatedrecordneedstobeconsolidatedwhile laterwereupdatedmustbestoredbeforediscardingoutdatedbase the other entries are simply discarded. In our example, only the pagesafteramergeistakenplace. Inanotherwords,wemusten- tailrecordst andt participatedinthemerge,andtherestwere surethatoutdatedbasepagesarediscardedsafely. 5 7 discarded. As a result, the two fundamental criteria, namely, relaxing the It is important to note that merging table-level tail-pages with merge(i.e.constructinganalmostup-to-datesnapshot)andoperat- base pages in the insert range follows a similar process as above ingonstabledata,arenotsufficienttoensurethesafetypropertyof withafewsimplification. First,theconsolidationprocessisrather themerge. Thelastmissingpiecethatenablessafetyofthemerge trivialbecausetailrecordsinthetable-leveltail-rangefollowsthe isaccomplishedbytakingasnapshotoftheoriginalvalueswhen exactsameinsertionorderintheinsertrange(atrivialjoin-likeop- acolumnisbeingupdatedforthefirsttime(asdescribedinSec- eration). Also the insert range does not actually have any value tion 2.3). In other words, we have further strengthened our data exceptfortheIndirectioncolumn(whichdoesnotevenparticipate stabilitycriterionbyensuringevenstabilityinthecommittedhis- inthemergeitself). Thus,themergeprocessisessentiallyreading tory. Hence,outdatedbasepagescanbesafelydiscardedwithout asetofconsecutivecommittedtailrecordsandcompressingthem anyinformationloss,namely,themergeprocessissafe. to create a set of newly merged pages. Another simplification is Theorem1 Themergeprocessandusers’transactionsdonotcon- 10TheStartTimecolumnisalsohighlycompressiblecolumnwitha tendforbaseandtailpagesortheresultingmergedpages,namely, negligiblespaceoverheadtomaintainit. themergeprocessiscontention-free. 8 PROOF. Aspartofensuringcontention-freemerge,wehaveal- berofpotentialopportunitiestoguidethemergeprocessinorder readyshownthatmergeoperatesonstabledata(provenbyLemma1) tofurtheracceleraterelaxedanalyticalqueries(thosequeriesthat andthatthereisnoinformationlossasaresultofthemergepro- cantolerateslightlyoutdatedsnapshot)byimplicitlyconstructing cess(provenbyLemma2).Nextweprovethatthewritepathofthe a slightly outdated but consistent snapshot of the data across the mergeprocessdoesnotoverlapwiththewritepathofusers’trans- entire table during the merge. Currently, our proposed merge is actions(i.e.,writers).Recallthatwritersappendnewuncommitted already relaxed and brings base pages almost up-to-date in time. tailrecordstotailpages(butasstatedbeforeuncommittedrecords Now we suggest to further coordinate the merge such that every donotparticipateinthemerge),andwritersperformin-placeup- merge not only take a set of consecutive committed tail records, dateoftheIndirectioncolumnwithinbaserecordstopointtothe butalsotakesonlythoseconsecutivecommittedrecordsbeforean latestversionoftheupdatedrecordsintailpages(buttheIndirec- agreed upon time t . Thus, after merging a range of records, we i tion column is not modified by the merge process), whereas the areensuredthatonlycommittedrecordsbeforethetimet ispro- i writepathofthemergeprocessconsistsofcreatingonlyanewset cessed.Furthermore,weproposethateverypagealsomaintainsits ofread-onlymergedpagesandeventuallydiscardingtheoutdated temporallineagetorememberthetimestampoftheearliestcom- basepagessafely. mitted records that have not been merged yet, where ideally its Therefore,wemustshowthatsafelydiscardingbasepagesdoes timestampisaftert .Anyrangeofrecordsthatyettobemergedor i not interfere with users’ transactions. In particular, as explained hasfailedtobringbasepagesforwardintimeuptot canbemanu- i inLemma2,iftheoriginalvalueswerenotwrittentotailrecords allybroughtforwardtot aspartofthenormalqueryprocessingby i atthetimeoftheupdate,thenduringthemergeprocess,wewere consolidatingtailpages. Periodically,theagreeduponmergetime forcedtostorethemsomewhereorencounterinformationloss.Itis isadvancedfromt tot ,andallsubsequentmergesareadjusted i i+1 notevenclearwherewouldbetheoptimallocationforstoringthe accordingly. However,exploitingthetemporallineagetospeedup originalvalues. Asimplemindedapproachofjustaddingthemto relaxedanalyticalqueries(almostforfree)isoutsidethescopeof tailpageswouldhavebrokenthelinearorderofchangestorecords thecurrentwork.11 suchthattheoldervalueswouldhaveappearedafterthenewerval- 3.2 MaintainingLineage ues,anditwouldhaveinterferedwiththeongoingupdatetransac- tions.But,moreimportantly,theneedtostoretheoldvaluesatany Thelineageofeachbasepage(andconsequentlymergedpages) locationwouldhaveimpliedthatduringthemergeprocessmulti- ismaintainedindependentlyasaresultofthemergeprocess. The plecoordinatedactionswererequiredtoensureconsistencyacross lineageinformationisinstrumentaltodecouplethemergeandup- modificationtoisolatedlocations;hence,breakingthecontention- dateprocessingandtoallowindependentmergingofthedifferent freepropertyofthemerge. Therefore,bystoringtheoriginalup- columnsofthesamerecordatdifferentpointintime. Thelineage dated values at the time of update, we trivially eliminate all the informationiscapturedusingarathersimpleandelegantconcept, potential contention during the merge process in order to safely whichwerefertoastail-pagesequencenumber(TPS)inorderto discardingoutdatedbasepages. keeptrackofhowmanyupdatedentries(i.e.,tailrecords)fromtail As a result, users’ transactions are completely decoupled from pageshavebeenappliedtotheircorrespondingbasepagesaftera themergeprocess, andusers’transactionsandthemergeprocess completionofamerge. OriginalbasepagesalwaysstartwithTPS donotcontendoverbase,tail,ormergedpages. setto0,avaluethatismonotonicallyincreasingaftereverymerge. Againtoensurethismonotonictyproperty,westressedearlierthat When analyzing the performance of our merge algorithm, we alwaysaconsecutivesetofcommittedtailrecordsareusedinthe observethatintheworstcase,datafromallupdatedcolumnsfora mergeprocess. givenupdaterangeisreadandwrittenback,butitisacostthatis TPSisalsousedtointerprettheindirectionpointer(alsoamono- amortizedovermanyupdates,akeystrengthofourlineage-based tonicallyincreasingvalue)byreadersafterthemergeistakenplace. storagearchitecture. Ingeneral,ifupdatesarespreadoverarange ConsiderourrunningexampleinTable4.Afterthefirstmergepro- ofrecords(evenifskewed),thenthedatafortheentirerangehasto cess,thenewlymergedpageshaveTPSsetto7,whichimpliesthat bereadandwritten;however,whenupdatesarestrictlylocalized, thefirstsevenupdates(tailrecordst tot )inthetailpageshave thenadditionaloptimizationcanbeappliedtofurtherprunetheset 1 7 been applied to the merged pages. Consider the record b in the ofrecordsread. 2 basepagesthathasanindirectionvaluepointingtot (cf.Table4), However,itisimportanttonotethatL-Store’sobjectivehasbeen 5 therearetwopossibleinterpretations. Ifthetransactionisreading tointroduceacontention-freemergeprocedurewithouttheinter- thebasepageswithTPSsetto0,thenthe5th updatehasnotyet ferencewiththeconcurrenttransactionsbecausecontentionisthe reflectedonthebasepage. Otherwiseifthetransactionisreading most important deciding factor in the overall performance of the thebasepageswithTPS7,thentheupdatereferencedbyindirec- systemespeciallyasthesizeofthemainmemorycontinuestoin- tionvaluet hasalreadybeenappliedtothebasepagesasseenin crease (arguably the entire transactional data can fit in the main 5 Table4. Notably,theIndirectioncolumnisupdatedonlyin-place memorytoday)andthestorage-classmemories(suchasSSDs)re- (alsoamonotonicallyincreasingvalue)bywriters,whilemerging placethemechanicaldisks[8,30]. Nevertheless,wepointoutthat tailpagesdoesnotaffecttheindirectionvalue. therearepotentialopportunitiestofurtherimprovethemergepro- More importantly, we can leverage the TPS concept to ensure cessexecutiontimebyemployingandstudyingmorecomplexjoin readconsistencyofusers’transactionswhenthemergeisperformed algorithmsforimplementingthemergebyoperatingdirectlyonthe compresseddatawithouttheneedtodecompressandcompressthe 11Similar optimization for constructing an almost up-to-date and data(acomplementarydirectionthatistheoutsidethescopeofthe consistent snapshots was first introduced in [23], but it required currentwork). Nevertheless,inourevaluation(Section5),witha todrainallactivequeriesbeforetheout-of-datesnapshotcouldbe singleasynchronousmergethread,wewereabletocopewithtens advanced in time or it required maintaining multiple almost up- to-date snapshots simultaneously. Unlike the approach in [23], ofconcurrentwriterthreads,andwewereabletoprocessmillions ourproposedmergealgorithmcombinedwiththetemporallineage ofupdatespersecondwhenupdating40%ofthecolumnsonaver- eliminates all contention with the ongoing queries such that the age. relaxedsnapshotcanbebroughtforwardintimelazilyandasyn- Lastly, we would like to point out that there are also a num- chronously. 9 Last Schema Schema Start RID Indirection StartTime A C RID Indirection Updated Key A B C Encoding Encoding Time Time Merged,committedtailrecordsforthekeyrangeofk1tok3 Recentlymergedrecordsforthekeyrangeofk1tok3;TPS=t7 t1 b2 0100* 13:04 a2 ∅ b1 t8 0000 10:02 10:02 k1 a1 b1 c1 t2 t1 0100 19:21 a21 ∅ b2 t12 0101 13:04 19:25 k2 a22 b2 c21 t3 t2 0100 19:24 a22 ∅ b3 t11 0001 15:05 19:45 k3 a3 b3 c31 t4 t3 0001* 13:04 ∅ c2 Partitionedtailrecordsforthekeyrangeofk1tok3 t5 t4 0101 19:25 a22 c21 t1 b2 0100* 13:04 ∅ a2 ∅ ∅ t6 b3 0001* 15:05 ∅ c3 t2 t1 0100 19:21 ∅ a21 ∅ ∅ t7 t6 0001 19:45 ∅ c31 t3 t2 0100 19:24 ∅ a22 ∅ ∅ Ordered,Inlined,Compressedcommittedtailrecordsforthekeyrangeofk1tok3 t4 t3 0001* 13:04 ∅ ∅ ∅ c2 c1 b2 0101 13:04,19:21,19:24,19:25 a2,a21,a22,- c2,-,-,c21 t5 t4 0101 19:25 ∅ a22 ∅ c21 c2 b3 0001 15:05,19:45 ∅ c3,c31 t6 b3 0001* 15:05 ∅ ∅ ∅ c3 tt78 bt61 00000010 1290::4155 ∅∅ ∅∅ ∅∅ c∅31 Table6:Anexampleofcompressingmergedtailpages(conceptual t9 t5 0010* 13:04 ∅ ∅ b2 ∅ tabularrepresentation). t10 t9 0010 21:25 ∅ ∅ b21 ∅ t11 t7 0001 21:30 ∅ ∅ ∅ c32 t12 t10 0110 21:55 ∅ a23 b21 ∅ tailrecordt . However,thetailrecordt isnotcumulative(reset 3 10 Table5: Anexampleoftheindirectioninterpretationandlineage occurredatthe8th update),whereasthetailrecordt iscumula- 12 tracking(conceptualtabularrepresentation). tive,butcarriesupdatesonlyfromthetailrecordt andnotfrom 10 t andt .Supposethatatransactionisreadingthebasepageswith 5 3 theTPS0, thentoreconstructthefullversionoftherecordb , it 2 mustreadboththetailrecordst andt (whileskipping3rd and 5 12 lazilyandindependentlyforthedifferentcolumnsofthesamerecords. 10th). Butifatransactionisreadingfromthemergedpageswith Therefore, whenthemergeofcolumnsisdecoupled, eachmerge the TPS 7, then it is sufficient to only read the tail record t to 12 occursindependentlyandatadifferentpointintime.Consequently, fullyreconstructtherecordbecausethe3rd and5th updateshave notallbasepagesarebroughtforwardintimesimultaneously.Ad- alreadybeenappliedtothemergedpages. ditionally,evenifthemergeoccursforallcolumnssimultaneously, itisstillpossiblethatareaderreadsbasepagesforthecolumnA 3.3 CompressingHistoricData beforethemerge(orduringthemergebeforethepagedirectoryis Forhistorictailpages,namely,thecommittedandsubsequently updated)whilethesamereaderreadsthecolumnCafterthemerge; mergedtailpages,weintroduceacontention-freecompressionsch- thus,readingasetofinconsistentbaseandmergedpages. eme to substantially reduce storage footprint and improving ac- cesspatternsforhistoricqueries(ortimetravelqueries). Period- Lemma3 An inconsistent read with concurrent merge is always ically, for a range of records, we compress only a set of merged detectable. tailpages(acrossallupdatedcolumns)thatfalloutsidetheoldest PROOF. Sinceeachbasepageindependentlytracksitslineage, querysnapshotinordertoavoidclashingwiththereaders/writers i.e.,itsTPCcounter;therefore,TPScanbeusedtoverifytheread ofnon-historicdata. consistency.Inparticular,forarangeofrecords,allreadbasepages Thekeybenefitofourcompressionscheme,whichtakesasin- must have an identical TPS counter; otherwise, the read will be put a set of tail pages for all updated columns for a given range inconsistent. Hence,aninconsistentreadacrossdifferentcolumns of records (stable data) and outputs a set of newly compressed ofthesamerecordisalwaysdetectable. tailpages(incolumnarform),arethefollowingkeyproperties(as demonstratedinTable6).First,thecompressedtailrecordsarere- Theorem2 Constructingconsistentsnapshotswithconcurrentmerge orderedaccordingtothebaseRIDorder;henceimprovingthelo- isalwayspossible. calityofaccess. Second,foreachrecord,andwithineachcolumn, thedifferentversionsarestoredinlineandcontiguously. Thever- PROOF. As proved in Lemma 3, the read inconsistency is al- sioninliningavoidtheneedtorepeatedlystoreunchangedvalues ways detectable. Furthermore, once a read inconsistency is en- duetocumulativeupdates,but,moreimportantly,itenablesdelta countered, theneach pageissimply broughtto thedesired query compressionamongthedifferentversionsoftherecordstofurther snapshotindependentlybyexaminingitsTPSandtheindirection reducethespaceoverhead. Alsocollapsingthedifferentversions value and consulting the corresponding tail pages using the logic ofthesamerecordintoasingletailrecordeliminatestheneedfor outlinedearlier.Hence,consistentreadsbyconstructingconsistent backpointersthatareneededforreferencingthepreviousversions. snapshots across different columns of the same record is always Thus,theinlineversionsaretightlypackedandorderedtemporally possible. asshowninTable6. TPS, or an alternative but similar counter conceptually, could Thecompressedtailpagesareread-onlyandareusedexclusively beusedasahigh-watermarkforresettingthecumulativeupdates forhistoricqueries. Thesenewpagescanbeisolatedandpushed aswell. Continuingwithourrunningscenario,inwhichwehave down lower in the storage hierarchy since they are by definition theoriginalbasepageswiththeTPS0(asshowninTable4),the colderandnotaccessedasfrequently. Lastly,thepagedirectoryis mergedpagesthewithTPS7(asshowninTable5).Forsimplicity, updatedbyswappingthepointersfortheoldtailpagestopointto weassumethecumulationwasalsoresetafterthe7th tailrecord. thenewlycreatedcompressedtailpages.Notably,consideringthat For the record b , we see that the indirection pointer is t , for theaccessfrequencyismuchlowerforhistorictailpagescompared 2 12 whichweknowthatthecumulativeupdatehasbeenresetafterthe totheaccessfrequencyofnon-historictailpages,anylocking-or 7th update. Thismeansthatthetailrecordt doesnotcarryup- non-locking-basedapproaches(suchastheepoch-basedapproach 12 datesthatwereaccumulatedbetweentailrecords1to7. Suppose discussedinSection3.1)canbeemployedwithoutnoticeablyaf- thattherecordwasupdatedfourtimes,wheretheupdateentriesin fectingtheoverallsystemperformance. the tail pages are 3rd, 5th, 10th, and 12th tail records. The tail recordt isacumulativeandcarriestheupdatedvaluesfromthe 5 10