ebook img

L-Store: A Real-time OLTP and OLAP System PDF

2.4 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview L-Store: A Real-time OLTP and OLAP System

L-Store: A Real-time OLTP and OLAP System ∗ Mohammad Sadoghi†, Souvik Bhattacherjee‡ , Bishwaranjan Bhattacharjee†, Mustafa Canim† †IBMT.J.WatsonResearchCenter ‡UniversityofMaryland,CollegePark ABSTRACT vendors and users of the systems (e.g., application development 6 anddeploymentcosts). Second,thereisacompellingcasetosup- 1 Arguablydataisanewnaturalresourceintheenterpriseworldwith portreal-timedecisionmakingonthelatestversionofthedata[27] 0 anunprecedenteddegreeofproliferation. Buttoderivereal-time (likewisesupportedby[16,18]),whichmaynotbefeasibleacross 2 actionableinsightsfromthedata,itisimportanttobridgethegap looselyintegratedenginesthatareconnectedthroughtheextract- betweenmanagingthedatathatisbeingupdatedatahighveloc- n transformload(ETL)process. Closingthisgapmaybepossible, ity(i.e.,OLTP)andanalyzingalargevolumeofdata(i.e.,OLAP). a butitseliminationmaynotbefeasiblewithoutsolvingtheoriginal J However,therehasbeenadividewherespecializedsolutionswere problemofunifyingOLTPandOLAPcapabilitiesorwithoutbeing often deployed to support either OLTP or OLAP workloads but 5 forcedtorelyonad-hocapproachestobridgethegapinhindsight. notboth;thus,limitingtheanalysistostaleandpossiblyirrelevant 1 WearguethattheseparationofOLTPandOLAPcapabilitiesisa data. Inthispaper,wepresentLineage-basedDataStore(L-Store) that combines the real-time processing of transactional and ana- stepbackwardthatdeferssolvingtheactualchallengeofreal-time ] analytics. Third,combiningreal-timeOLTPandOLAPfunction- B lytical workloads within a single unified engine by introducing a alitiesremainsasanimportantbasicresearchquestion,whichde- D novellineage-basedstoragearchitecture.Wedevelopacontention- mandsdeeperinvestigationevenifitispurelyfromthetheoretical freeandlazystagingofcolumnardatafromawrite-optimizedform . standpoint. s (suitableforOLTP)intoaread-optimizedform(suitableforOLAP) c inatransactionallyconsistentapproachthatalsosupportsquerying Inthisdilemma,wesupportthelatterschoolofthought(i.e.,ad- [ vocating a generalized solution) with the goal of undertaking an andretainingthecurrentandhistoricdata. Ourworkingprototype 1 ofL-Storedemonstratesitssuperioritycomparedtostate-of-the-art important step to study the entire landscape of single engine ar- chitecturesandtosupportbothtransactionalandanalyticalwork- v approachesunderacomprehensiveexperimentalevaluation. loadsholistically(i.e.,“onesizefitsall”).Inthispaper,wepresent 4 Lineage-basedDataStore(L-Store)withanovellineage-basedstor- 8 1. INTRODUCTION agearchitecturetoaddresstheconflictsbetweenrow-andcolumn- 0 4 Wearenowwitnessinganarchitecturalshiftanddivideindatabase majorrepresentationbydevelopingacontention-freeandlazystag- 0 community. The first school of thought emerged from an aca- ing of columnar data from write optimized into read optimized . demic conjecture that “one size does not fit all” [35] (i.e., advo- form in a transactionally consistent manner without the need to 1 cating specialized solutions), which has lead to manifolds of in- replicatedata,tomaintainmultiplerepresentationofdata,ortode- 0 novationsoverthelastdecadeincreatingspecializeddatabaseen- velop multiple loosely integrated engines that sacrifices real-time 6 ginesgearedtowardvariousnicheworkloadsandapplicationsce- capabilities. 1 narios(e.g.,[35,5,11,9,28,8,21,36]). Thisschoolhassuccess- Tofurtherdisambiguateournotionof“onesizefitsall”,inthis : v fullyinfluencedmajordatabasevendorssuchasMicrosofttofocus paper,werestrictourfocustoreal-timerelationalOLTPandOLAP i onbuildingnewspecializedenginesofferedaslooselyintegrated capabilities. We define a set of architectural characteristics for X engines (e.g., Hekatonin-memory engine [8]and Apollo column distinguishing the differences between existing techniques. First, r store engine [19]) within a single umbrella of database portfolio there could be a single product consisting of multiple loosely in- a (notably,recenteffortsarenowfocusedonatighterreal-timeinte- tegrated engines that can be deployed and configured to support grationofHekatonandApolloengines[18]).Ithasalsoinfluenced either OLTP or OLAP. Second, there could be a single engine as Oracletopartiallyacceptthebasicpremisethat“onesizedoesnot opposedtohavingmultiplespecializedenginespackagedinasin- fitall”asfarasdatarepresentationisconcernedandhasledOracle gleproduct. Third,evenifwehaveasingleengine,thenwecould todevelopadual-formattechnique[16]thatmaintainstwotightly have multiple instances running over a single engine, where one integratedrepresentationofdata(i.e.,twocopiesofthedata)ina instance is dedicated and configured for OLTP workloads while transactionallyconsistentmanner. another instance is optimized for OLAP workloads, in which the However,thesecondschoolofthought,supportedbybothaca- differentinstancesareassumedtobeconnectedusinganETLpro- demia (e.g., [13, 6, 7]) and industry (e.g., SAP [10]), rejects the cess. Finally, even when using the same engine running a single aforementionedfundamentalpremiseandadvocatesageneralized instance, there could be multiple copies or representations (e.g., solution.Proponentsofthisidea,rightlyinourview,makethefol- row vs. columnar layout) of the data, where one copy (or repre- lowingarguments.First,thereisatremendouscostinbuildingand sentation)ofthedataisreadoptimizedwhilethesecondcopy(or maintainingmultipleenginesfromboththeperspectiveofdatabase representation) is write optimized. The architectural comparison ofvariousexistingtechniquesbasedonourrigorousdefinitionof ∗ WorkwasperformedaspartofasummerinternshipatIBMT.J. WatsonResearchCenterunderMohammadSadoghi’smentorship. 1 L-Store HANA[27] ES2[6,7] HyPer[13] OracleDual-format[16] MicrosoftSQLServer[18] H-Store+Hadoop[11] SingleProduct " " " " " " – SingleEngine " " " " " – – SingleInstance " " " " " " – SingleCopy " " " datapagereplication – – – (OSforking) SingleRepresentation " main+delta " " – – – limitedto OLTP-optimized " " limitedto partitionableworkloads " " " get/putoperations (i.e.,serialexecution) OLAP-optimized " " inconsistentsnapshot " " " " ispossible UnifiedOLTP+OLAP " " " " " " – Table1:Architecturalcharacterizationofselecteddatabaseengines. “onesizefitsall”isoutlinedinTable1.1 routers. Now the task of any real-time targeted advertising auc- Inshort,wedevelopL-Store,asanimportantfirststeptowards tionistodetermineandpresentasetofrelevantadstotheshop- supporting real-time OLTP and OLAP processing that faithfully per by running analytics over the location information, shopping satisfies our definition of generalized solution, and, in particular, patterns,pastpurchases,andbrowsinghistoryoftheshopper. Fur- wemakethefollowingcontributions: thermore,iftheseadvertisementsresultinapurchase,thenthere- sultingtransactionsneedtobecomeavailableimmediatelytosub- • Introducingacontention-freeupdatemechanismoverana- sequent analytics in order to improve the effectiveness of future tivecolumnarstoragemodelinordertolazilyandindepen- advertisements. Moreover, the actual ad bidding in the auction dently stage stable data from a write-optimized columnar also requires a transactional semantics support in real-time. Fi- layout (i.e., OLTP) into a read-optimized columnar layout nally,allthesestepsmustbecompletedtypicallywithin150mil- (i.e.,OLAP) liseconds[2]. Therefore, wearguethatinordertosustainahigh velocitytransactionaldata(e.g.,GoogleAdWordsservedalmost30 • Achieving(atmost)2-hopawayaccesstothelatestversion billionadsperdayin2012[14])whileexecutingcomplexanalytics ofanyrecord(preventingreadperformancedeteriorationfor onthelatestandhistoric(transactional)data,thereisacompelling pointqueries) needtodevelopasolutionthatexhibitsatruereal-timeOLTPand OLAPcapabilities. • Contention-freemergingofonlystabledata,namely,merg- Anotherprominentscenarioisfrauddetectionespeciallyatthe ingoftheread-onlybasedatawithrecentlycommittedup- timewhenthecostofcybercrimecontinuestoincreaseatastagger- dates(bothincolumnarrepresentation)withouttheneedto ingrateandhasalreadysurpassed$400billiondollarsannually[1]. blockongoingornewtransactions Forinstance,acreditcardcompanywillneedtoapproveatransac- tioninasmalltimewindow(i.e.,subsecondranges). Duringthis • Contention-freepagede-allocation(uponthecompletionof shorttimespan,itisforcedtodetermineifatransactionisfraudu- themergeprocess)usinganepoch-basedapproachwithout lentornot.Thus,thereisacrucialneedtoruncomplexanalyticsin theneedtodraintheongoingtransactions real-timeaspartofthetransactionthatisbeingprocessed.Without suchaproactivefrauddetectioncapability,fraudulenttransactions • Afirstofitskindcomprehensiveevaluationtostudythelead- may remain undetectable, which may result in irreversible finan- ingarchitecturaldesignforconcurrentlysupportingshortup- cial losses as clearly been witnessed when billions of dollars are datetransactionsandanalyticalqueries(e.g.,anin-placeup- beinglostduetofraudactivitieseveryyear[1].Furthermore,there datewithahistorytablearchitectureandthecommonlyem- are indirect financial losses involving stakeholders such as credit ployedmainanddeltastoresarchitecture) card companies and merchants. The indirect losses attributed to declineoflegitimatetransactionsthatdisruptsmerchant’sdailyop- 1.1 MotivatingReal-timeOLTPandOLAP eration,lostpaymentvolumeasconsumersoptforalternativepay- Before describing our proposed approach in-depth, we briefly menttypesthatareperceivedtobesafer,andlostcustomersdueto present two important scenarios that benefit greatly from a real- cardcancellationandreissue[26]. timeOLTPandOLAPsolution. Considerthemobilee-commercemarket,inwhichtherevenue 2. UNIFIEDARCHITECTURE forthelocation-basedmobileadvertisingaloneisexpectedtoreach $18billionsby2019[25]. Apotentialbuyerwithamobiledevice Thedivideinthedatabasecommunityispartlyattributedtothe mayroamaroundphysicallywhileshopping.Inthemeantime,the storageconflictpertainingtotherepresentationoftransactionaland shopper’smobiledevicegenerateslocationinformation. Alterna- analyticaldata. Inparticular,transactionaldatarequireswrite-opt- tively,astheshopperbrowsestheweb,againthelocationinforma- imizedstorage,namelytherow-basedlayout,inwhichallcolumns tioniseitherexchangedexplicitlyordetectedautomaticallybased areco-located(andpreferablyuncompressedforin-placeupdates). ontheshopper’sIPaddressorbyitsconnectiontothenearbyWiFi Thislayoutimprovespointupdatemechanisms,sinceaccessingall columnsofarecordcanbeachievedbyasingleI/O(orfewcache 1However, it is crucial to note that the presented comparison is missesformemory-residentdata). Incontrast,tooptimizethean- solely focused on the overall architectural choices, and it does alytical workloads (i.e., reading many records), it is important to notmakeanyclaimsabouttherelativesystemperformanceand/or haveread-optimizedstorage,i.e.,columnarlayoutinhighlycom- functionalities.Forexample,ifHANAcontainsmorecheckmarks thanMicrosoftSQLServer,itdoesnotimplythatHANAisabet- pressedform. Theintuitionbehindhavingcolumnarlayoutisdue terproduct,insteaditsimplyassessesHANAarchitecturallywith totheobservationthatmostanalyticalqueriestendtoaccessonlya respecttoourdefinitionof“onesizefitsall”. smallsubsetofallcolumns[3].Thus,bystoringdatacolumn-wise, 2 Read Optimized Columnar Storage Base Pages (compressed, (read-only) read-only pages) Row-based Storage Columnar Storage Tail Pages (append-only) y ro Base Record tce (read-only) Write Optimized riD (uncompressed, e Tail Record g in-place updates) aP (latest version) Range Partitioning Record (spanning over a set of aligned columns) Figure1:Overviewofstoragelayoutconflict. Figure2:Overviewofthelineage-basedstoragearchitecture. we can avoid reading irrelevant columns (i.e., reducing the raw records;theyarepresentedandmaintainedidentically. amount of data read) and avoid polluting processor’s cache with Tospeedqueryprocessing,thereisalsoanexplicitlinkage(for- irrelevant data, which substantially improve both disk and mem- wardandbackwardpointers)amongrecords. Fromabaserecord, orybandwidth,respectively.Furthermore,storingdataincolumnar thereisaforwardpointertothelatestversionoftherecordintail formimprovesthedatahomogeneitywithineachpage,whichre- pages. Thedifferentversionsofthesamerecordsintailpagesare sultsinanoverallbettercompressionratio.Thisstorageconflictis chainedtogethertoenablefastaccesstoanearlierversionofthe depictedinFigure1. record.Thelinkageisestablishedbyintroducingatable-embedded indirectioncolumnthatstoresforwardpointers(i.e.,RIDs)forbase 2.1 L-StoreStorageOverview recordsandbackwardpointersfortailrecords(i.e.,RIDs). Thefinalaspectofourlineage-basedarchitectureisaperiodic, Toaddressthedilemmabetweenwrite-andread-optimizedlay- contention-freemergingofasetofbasepageswithitscorrespond- outs,wedevelopL-Store. AsdemonstratedinFigure2,thehigh- ingtailpages.Thisisperformedtoconsolidatebasepageswiththe level architecture of L-Store is based on native columnar layout recent updates and to bring base pages forward in time (i.e., cre- (i.e.,dataacrosscolumnsarealignedtoallowimplicitre-construction), ating a set of merged pages). Tail pages that are already merged whererecordsare(virtually)partitionedintodisjointranges(also and fall outside the snapshot boundaries of all active queries are referredtoasupdaterange). Recordswithineachrangespanaset ofread-only,compressedpages,whichwerefertothemasthebase calledhistorictail-pages.Thesepagesarere-organized,sothatdif- pages. Moreimportantly,foreveryrangeofrecords,andforeach ferentversionsofarecordarestoredcontiguouslyinlined. Delta- compressionisappliedacrossdifferentversionsoftailrecords,and updated column within the range, we maintain a set of append- tail records are ordered based on the RIDs of their correspond- only pages to store the latest updates, which we refer to them as thetailpages. Anytimearecordisupdatedinbasepages, anew ingbaserecords. Below,wedescribetheuniquedesignandalgo- recordisappendedtoitscorrespondingtailpages,wherethereare rithmicfeaturesofL-Storethatenablesefficienttransactionalpro- cessingwithoutperformancedeteriorationofanalyticalprocessing; explicitvaluesonlyfortheupdatedcolumns(non-updatedcolumns thereby,achievingareal-timeOLTPandOLAP. arepreassignedaspecialnullvaluewhenapageisfirstallocated). Werefertotherecordsinbasepagesasthebaserecordsandthe 2.2 Lineage-basedStorageArchitecture recordsintailpagesasthetailrecords. Eachrecord(whetherfalls inbaseortailpages)spansoverasetofalignedcolumns(i.e.,no In L-Store, the storage layout is naively columnar that applies joinisnecessarytopulltogetherallcolumnsofthesamerecord).2 equallytobothbaseandtailpages.Adetailedviewofourlineage- A unique feature of our lineage-based architecture is that tail basedstoragearchitectureispresentedinFigure3. Ingeneral,one pagesarestrictlyappend-onlyandfollowawrite-oncepolicy. In canperceivetailpagesasdirectlymirroringthestructureandthe other words, once a value is written to tail pages, it will not be schemaofbasepages. Aswepointedoutearlier,conceptuallyfor over-written even if the writing transaction aborts. The append- everyrecord,wedistinguishbetweenbasevs. tailrecords,where onlydesignsubstantiallysimplifiesconcurrencyandrecoverypro- eachrecordisassignedauniqueRID.Butitisimportanttonote tocolasdescribedinSection4.1.Anotherimportantpropertyofour thattheRIDassignedtoabaserecordisstableandremainscon- lineage-basedstorageisthatalldataarerepresentedinacommon stant throughout the entire life-cycle of a record, and all indexes holistic form; there are no ad-hoc corner cases. Records in both onlyreferencebaserecords(baseRIDs);consequently,eliminating baseandtailpagesareassignedrecord-identifiers(RIDs)fromthe indexmaintenanceproblemassociatedwhenupdateoperationre- samekeyspace.Therefore,bothbaseandtailpagesarereferenced sultsincreationofanewversionoftherecord[33,32]. Areader throughthedatabasepagedirectoryusingRIDsandpersistediden- performingindexlookupalwayslandsatabaserecord,andfrom tically. Therefore,atthelower-levelofthedatabasestack,thereis the base record it can reach any desired version of the record by absolutelynodifferencebetweenbasevs.tailpagesorbasevs.tail followingthetable-embeddedindirectiontoaccessthelatest(ifthe baserecordisout-of-date)oranearlierversionoftherecord.How- 2Fundamentally,thereisnodifferencebetweenbasevs.tailrecord, ever, when a record is updated, a new version is created. Thus, thedistinctionismadeonlytoeasetheexposition. anewtailrecordiscreatedtoholdthenewversion, andthenew 3 Start Time Column Read Optimized (implicit end time of the previous version) Updated Columns (compressed, read-only pages) Schema Encoding Column Corresponding Columnar Storage (keeping track of changed column) Columns (page-based) Indirection Column (back pointer to the previous version) Base Record Write Optimized (uncompressed, append-only updates) Forward Pointer to the Base Pages Tail Latest Version of the Record (read-only) Record Pre-allocated Space (on-demand allocated append-only region) Range Partitioning Tail Pages Last Updated Time (append-only) Indirection Column Schema Encoding Start Time Column (uncompressed, in-place update) Column Column (populated after merge) Figure3:Detailed,unfoldedviewoflineage-basedstoragearchitecture. tailrecordisassignedanewtailRIDthatisreferencedbythebase wayspreserved(evenafterthemergeprocess)forfasterpruningof record(asdemonstratedinFigure3). thoserecordsthatarenotvisibletoreadersbecausetheyfalloutside Eachtableinadditiontohavingthestandarddatacolumnshas thereader’ssnapshot.Lastly,wemayaddtheBaseRIDcolumnop- severalmeta-datacolumns. Thesemeta-datacolumnsincludethe tionallytotailrecordstostoretheRIDsoftheircorrespondingbase Indirectioncolumn,theSchemaEncodingcolumn,theStartTime records;thisisutilizedtoimprovethemergeprocess. BaseRIDis column,andtheLastUpdatedTimecolumn. Anexampleoftable ahighlycompressiblecolumnthatwouldrequireatmosttwobytes schemaisshowninTable2. whenrestrictingtherangepartitioningofrecordsto216records. TheIndirectioncolumnexistsinboththebaseandtailrecords. 2.3 Fine-grainedStorageManipulation Forbaserecords,theIndirectioncolumnisinterpretedasaforward pointertothelatestversionofarecordresidingintailpages,essen- The transaction processing can be viewed as two major chal- tiallystoringtheRIDofthelatestversionofarecord. Ifarecord lenges: (1)howdataisphysicallymanipulatedatthestoragelayer has never been updated, then the Indirection column will hold a andhowchangesarepropagatedtoindexesand(2)howmultiple nullvalue. Incontrast, fortailrecords, theIndirectioncolumnis transactions(whereeachtransactionconsistsofmanystatements) used to store a backward pointer to the last updated version of a canconcurrentlycoordinatereadingandwritingoftheshareddata. recordintailpages.Ifnoearlierversionexists,thentheIndirection The focus of this paper is on the former challenge, and we defer columnwillpointtotheRIDofthebaserecord. thelattertoourdiscussionontheemployedconcurrencymodelin TheSchemaEncodingcolumnstoresthebitmaprepresentation Section4.1. ofthestateofthedatacolumnsforeachrecord,wherethereisone bitassignedforeverycolumnintheschema(excludingthemeta- 2.3.1 UpdateandDeleteProcedures datacolumns),andifacolumnisupdated,itscorrespondingbitin Withoutthelossofgenerality,wefocusonhowtohandleasingle theSchemaEncodingcolumnissetto1,otherwiseissetto0. The pointupdateordeleteinL-Store(butnotethatwesupportmulti- schemaencodingenablestoquicklydetermineifacolumnhasev- statement transactions as demonstrated by our evaluation). Each erybeenupdatedornot(forbaserecords)ortodetermineforeach updatemayaffectasingleormultiplerecords. Sincerecordsare tail record, which columns have been updated and have explicit (virtually)partitionedintoasetofdisjointranges(asshowninTa- valuesasopposedtothosecolumnsthathavenotbeenupdatedand ble2), eachupdatedrecordnaturallyfallswithinonlyonerange. haveanimplicitspecialnullvalues(denotedby∅).Anexampleof Nowforeachrangeofrecords,uponthefirstupdatetothatrange,a SchemaEncodingcolumnisprovidedinTable2. setoftailpagesarecreated(andpersistedondiskoptionally)forthe The Start Time column stores the time at which a base record updatedcolumnsandareaddedtothepagedirectory,i.e.,lazytail- was first installed in base pages (the original insertion time), and pageallocation. Consequently, updatesforeachrecordrangeare foratailrecord,theStartTimecolumnholdsthetimeatwhichthe appendedtotheircorrespondingtailpagesoftheupdatedcolumns recordwasupdated,whichisalsotheimplicitendtimeoftheprevi- only; thereby,avoidingin-placeupdatesforthedatacolumnsand ousversionoftherecord.Inaddition,totheStartTimecolumn,for clusteringupdatesforarangeofrecordswithintheircorresponding baserecords,wemaintainanoptionalLastUpdatedTimecolumn, tailpages. whichisonlypopulatedafterthemergeprocessistakenplaceand TodescribetheupdateprocedureinL-Store,werelyonourrun- holdstheStartTimeofthosetailrecordsincludedinmergedpages. ningexampleshowninTable2. Whenatransactionupdatesany AlsonotethattheinitialStartTimecolumnforbaserecordsisal- column of a record for the first time, two new tail records (each 4 RID Indirection SchemaEncoding StartTime Key A B C RIDs), and they are never directly point to any tail records (i.e., Partitionedbaserecordsforthekeyrangeofk1tok3 tailRIDs)inordertoavoidtheindexmaintenancecostthatarise bb12 tt85 00010001 1103::0024 kk12 aa12 bb12 cc12 intheabsenceofin-placeupdatemechanism[33,32]. Therefore, b3 t7 0001 15:05 k3 a3 b3 c3 whenanewversionofarecordiscreated(i.e.,anewtailrecord), Partitionedbaserecordsforthekeyrangeofk4tok6 first,allindexesdefinedonunaffectedcolumnsdonothavetobe b4 ⊥ 0000 16:20 k4 a4 b4 c4 modifiedand,second,onlytheaffectedindexesaremodifiedwith b5 ⊥ 0000 17:21 k5 a5 b5 c5 b6 ⊥ 0000 18:02 k6 a6 b6 c6 theupdatedvalues,buttheycontinuetopointtobaserecordsand Partitionedtailrecordsforthekeyrangeofk1tok3 notthenewlycreatedtailrecords[33,32]. Supposethereisanin- t1 b2 0100* 13:04 ∅ a2 ∅ ∅ dexdefinedonthecolumnC (cf. Table2). Nowaftermodifying t2 t1 0100 19:21 ∅ a21 ∅ ∅ t3 t2 0100 19:24 ∅ a22 ∅ ∅ therecordb2fromc2toc21,weaddthenewentry(c21,b2)tothe t4 t3 0001* 13:04 ∅ ∅ ∅ c2 indexonthecolumnC.3Subsequently,whenareaderlooksupthe tt56 bt43 00100011* 1195::2055 ∅∅ a∅22 ∅∅ cc231 valuec21fromtheindex,italwaysarrivesatthebaserecordb2ini- tt78 bt61 00000010 1290::4155 ∅∅ ∅∅ ∅∅ c∅31 ftioalllloyw,tihnegnththeeinrdeiardeecrtimonuisftndeecteersmsairnye)athnedvmisuisbtlechveecrksiiofnthoefvbi2si(bblye versionhasthevaluec forthecolumnC,essentiallyre-evaluating Table2:Anexampleoftheupdateanddeleteprocedures(concep- 21 thequerypredicates. tualtabularrepresentation). Therearetwoothermeta-datacolumnsthatareaffectedbythe updateprocedure. TheStartTimecolumnfortailrecordssimply holdsthetimeatwhichtherecordwasupdated(animplicitendof thepreviousversion). Forexample,therecordt hasastarttime tailrecordisassignedauniqueRID)arecreatedandappendedto 7 of19:45,whichalsoimpliesthattheendtimeofthefirstversion thecorrespondingtailpages. Forexample, considerupdatingthe oftherecordb . TheSchemaEncodingcolumnisaconciserep- columnAoftherecordwiththekeyk (referencedbytheRIDb ) 3 2 2 resentationthatshowswhichdatacolumnshavebeenupdatedthus in Table 2. The first tail record, referenced by the RID t , con- 1 far. Forexample,theSchemaEncodingofthetailrecordt isset tains the original value of the updated column, i.e., a , whereas 7 2 to“0100”,whichimpliesthatonlythecolumnAhasbeenchanged. implicitnullvalues(∅)arepreassignedforremainingunchanged Todistinguishbetweenwhetheratailrecordisholdingnewvalues columns. Takingsnapshotoftheoriginalchangedvaluesbecomes oritisthesnapshotofoldvalues,weaddaflagtotheSchemaEn- essential in order to ensure contention-free merging as discussed coding column, which is shown as an asterisk. For example, the inSection3.1. Thesecondtailrecordcontainsthenewlyupdated tailrecordt storestheoldvalueofthecolumnA,whichiswhy value for column A, namely, a , and again implicit special null 6 21 itsSchemaEncodingissetto“0100*”. TheSchemaEncodingcan values for the rest of the columns; a column that has never been alsobemaintainedoptionallyforbaserecordsaspartoftheupdate updated does not even have to be materialized with special null processoritcouldbepopulatedonlyduringthemergeprocess. values. However,foranysubsequentupdates,onlyonetailrecord Notably,whentherearemultipleindividualupdatestothesame iscreated,e.g.,thetailrecordt isappendedasaresultofupdating 3 recordbythesametransaction,theneachupdateiswrittenasasep- thecolumnAfroma toa fortherecordb . 21 22 2 arateentrytotailpages. Eachupdateresultsinacreationofanew Ingeneral,updatescouldeitherbecumulativeornon-cumulative. tailrecordandonlythefinalupdatebecomesvisibletoothertrans- Thecumulativepropertyimpliesthatwhencreatinganewtailrecord, actions. The prior entries are implicitly invalidated and skipped thenewrecordwillcontainthelatestvaluesforalloftheupdated byreaders. Alsodeleteoperationissimplytranslatedintoanup- columnsthusfar.Forexample,considerupdatingthecolumnCfor dateoperation, inwhichalldatacolumnsareimplicitlysetto∅, therecordb .SincethecolumnCoftherecordb isbeingupdated 2 2 e.g.,deletingtherecordb resultsincreatingthetailrecordt .An forthefirsttime, wefirsttakeasnapshotofitsoldvalueascap- 1 8 alternative design for delete is to create a tail record that holds a turedbythetailrecordt . Nowforthecumulativeupdate,anew 4 completesnapshotofthelatestversionofthedeletedrecord. tailrecordisappendedthatrepeatsthepreviouslyupdatedcolumn A,asdemonstratedbythetailrecordt . Ifnon-cumulativeupdate 5 2.3.2 InsertProcedure approachwasemployed,thenthetailrecordwouldconsistsofonly thechangedvalueforcolumnCandnotA.Itisimportanttonote Thefinalkeyoperationistheinsertionofnewrecords. Concep- thatcumulationofupdatescanberesetatanytime. Intheabsence tually,thetablenaturallygrowsbyinsertingnewrecordstotheend ofcumulation,readersaresimplyforcedtowalkbackthechainof ofthetable(append-onlymechanism).Werelyonasimplermani- recentversionstoretrievethelatestvaluesofalldesiredcolumns. festationofournotionoftailpagesfollowedbythetransformation Thus,cumulativeupdateisanoptimizationthatisintendedtoim- oftailpagesintocompressed,read-onlybasepagesthroughasim- provethereadperformance. plified merge process. In fact, one can even view our previously Aspartoftheupdateroutine,theembeddedIndirectioncolumn describedupdatemechanismasaformofsparseinsertion. (forwardpointers)forbaserecordsisalsoupdatedtopointtothe Inourproposedinsertdesign,wedesignatetheendofthetableas newlycreatedtailrecord. Inourrunningexample,theIndirection theinsertrange. Aninsertrangeisbasicallyapre-allocatedrange column of the record b2 points to the tail record t5. Also after ofbaseRIDsforaccommodatingfutureinsertions.Inpractice,the updating the column C of the record b3, the Indirection column insertrangesize(atleastamillionRIDs)ismuchlargerthanour pointstothelatestversionofb3,whichisgivenbyt7. Likewise, rangepartitioningthatisemployedforupdateprocessing(i.e.,up- the Indirection column in the tail records are updated to point to thepreviousversionoftherecord. Itisimportanttonotethatthe Indirectioncolumnofbaserecordsistheonlycolumnthatrequires 3Optionallytheoldvalue(c2,b2)couldberemovedfromthein- dex; however, itsremovalmayaffectthosequeriesthatareusing anin-placeupdateinourarchitecture.However,asdiscussedinour indexestocomputeanswersundersnapshotsemantics. Therefore, concurrencymodel(cf. Section4.1),thisisaspecialcolumnthat weadvocatedeferringtheremovalofchangedvaluesfromindexes lendsitselftolatch-freeconcurrencyprotocol. until the changed entries fall outside the snapshot of all relevant Furthermore, indexes always point to base records (i.e., base activequeries. 5 Read Optimized (compressed, read-only pages) Tail Pages (append-only) Range Partitioning (update range) Write Optimized Table-level Tail-pages (uncompressed, (append-only) append-only updates, i.e., sparse insertion) Write Optimized snI (uncompressed, e R tr append-only inserts) a n g e Indirection Column (uncompressed, in-place update) Figure4:Append-onlyinsertionofnewrecordswithconcurrentupdates(byemployingtailpages). RID Indirection SchemaEncoding StartTime Key A B C ing up a record in the insert range. When a new record is about Partitionedbaserecordsforthekeyrangeofk4tok6 tobeinsertedtothetable,thenewrecordreceivesareservedbase bb45 ⊥⊥ 00000000 1167::2201 kk45 aa45 bb45 cc45 RIDintheinsertrangeandthecorrespondingtailRIDinthetable- b6 ⊥ 0000 18:02 k6 a6 b6 c6 level tail-range. If insert range is full, then a new insert range is InsertrangeforthebaserecordwiththebaseRIDrangeofb7tob9 created.Butthekeyguidingprincipleforinsertionistosatisfythe b7 ⊥ stabilitypropertyofthebasepages(i.e.,read-only)withtheexcep- b8 t14 b9 t16 tionoftheIndirectioncolumnthatisupdatedin-place. Therefore, Table-leveltail-pagesforthebaserecordwiththebaseRIDrangeofb7tob9 theinsertionproceduresimplyconsistsofacquiringbaseandtail tt7 t7 0000 18:30 k7 a7 b7 c7 tt8 t8 0000 18:45 k8 a8 b8 c8 RIDs,inserttheactualrecordtotable-leveltail-pages,andsetting tt9 t9 0000 19:05 k9 a9 b9 c9 theIndirectioncolumninthebaserecordtonull.Alternatively,the Pt1ar3titionedtabil8recordsforthe0k0ey01r*angeofk7to1k89:45 ∅ ∅ ∅ c8 Indirectioncolumncouldbesettonullwhenallocatingpagesfor t14 t13 0001 22:25 ∅ ∅ ∅ c81 theinsertrange. t15 b3 0100* 19:05 ∅ a9 ∅ ∅ AnexampleofinsertionisillustratedinTable3.Theinsertrange t16 t15 0100 22:45 a91 ∅ ∅ ∅ isshownasb tob ,andthetable-leveltail-rangeisshownastt 7 9 7 Table3:Anexampleofinsertionwithconcurrentupdates(concep- tott9. Thefirstinsertedrecordis(k7,a7,b7,c7)withthekeyk7 tualtabularrepresentation). thatisassignedb7asitsbaseRIDandtt7asitstailRID.Theonly columnallocatedforbaserecordsistheIndirectioncolumn,which isinitiallysettonull(⊥). Theactualvaluesforthemeta-dataand datacolumnsareappendedtothetable-leveltail-pagesattheposi- daterange).4. Fortheinsertrange,weallocatedasetoftailpages tiongivenbythetailRIDtt7. Inthesamespirit,therecordswith forappendingnewrecords,whichwerefertothemas“table-level b8 andb9 arealsoappendedtotheinsertrange. Nowifarecently tail-pages”eventhoughstructurallythereisnodifferencebetween insertedrecordisupdated,thentheupdatefollowsthesamepathas explainedearlier(cf. Section2.3.1). Supposetherecordb isup- table-level tail-pages vs. regular tail pages. Figure 4 pictorially 8 datedbymodifyingthevalueofitsCcolumnfromc toc . The capturesourinsertdesign.Intable-leveltail-pages,weallocatetail 8 81 updatesimplyresultsinacquiringanewtailRIDintheregulartail pagesforallcolumns(unlikeforupdatesthatwaslimitedtoonly pages(asbefore)andappendingonlytheupdatedcolumnfollowed theupdatedcolumns)becausetheinsertstatementalwaysprovide byupdatingtheIndirectioncolumnin-place. Thisisdemonstrated avalueforeverycolumn(evenifitisanimplicitnullvaluefora byappendingthetailrecordt tothecorrespondingtailpagesand nullablecolumn). 14 settingtheIndirectioncolumnoftherecordb tot . Adding a new insert range consists of reserving a set of base 8 14 RIDs (e.g., in the order of millions) and a set of tail RIDs; these two sets of RIDs are equal in size and aligned. Thus, the 10th baseRIDintheinsertrangecorrespondstothe10thtailRIDinthe 3. REAL-TIMESTORAGEADAPTION table-leveltail-range(i.e.,bothrangesfollowingthesameinsertion order).ThealignmentofRIDsallowsimplicitaddressingforlook- Toensureanearoptimalstoragelayout,outdatedbasepagesare mergedlazilywiththeircorrespondingtailpagesinordertopre- 4Each table may have more than one insert range to support a servetheefficiencyofanalyticalqueryprocessing. Recallthatthe higherdegreeofconcurrencyiftheworkloadisinsertintensive. base pages are read-only and compressed (read optimized) while 6 In-page, Independent Page Directory Lineage Tracking (reflect the number of updates applied to apage) Asynchronous Lazy Merge Read Optimized (committed, consecutives updates) (compressed, read-only pages) ⋈ = In-page, Independent Base Pages Lineage Tracking (merged pages that are compressed, read-only) Consecutive Set of Indirection Column Committed Updates Base Pages (unaffected by the (older versions, removed de-allocation process) from the page directory) Indirection Column Tail Pages Epoch-based De-allocation Merge Queue (unaffected by the merge process) (unaffected by the (determined by the (tail pages to be merged) de-allocation process) longest running query) Figure5:Lazily,independentlymergingoftail&basepages. Figure6:Epoch-based,contention-freepagede-allocation. thetailpagesareuncompressed5thatgrowusingastrictlyappend- 3.1.1 MergeAlgorithm only technique (write optimized). Therefore, it is necessary to Thedetailsofthemergealgorithm,conceptuallyresemblingthe transformtherecentcommittedupdates(accumulatedintailpages) standardleft-outerjoin,consistsof(1)identifyingasetofcommit- thatarewriteoptimizedintoreadoptimizedform.Adistinguishing tedtailrecordsintailpages;(2)loadingthecorrespondingoutdated featureofourlineage-basedarchitectureistointroduceacontention- basepages;(3)consolidatingthebaseandtailpages;(4)updating free merging process that is carried out completely in the back- thepagedirectory;and(5)de-allocatingtheoutdatedbasepages. groundwithoutinterferingwithforegroundtransactions. Further- more,thecontention-freemergingprocedureisappliedonlytothe Step1: Identifycommittedtailrecordsintailpages: Selecta updatedcolumnsoftheaffectedupdateranges. Thereisevenno setofconsecutivefullycommittedtailrecords(orpages)sincethe dependencyamongcolumnsduringthemerge; thus, thedifferent lastmergewithineachupdaterange. columns of the same record can be merged completely indepen- Step 2: Load the corresponding outdated base pages: For a dentofeachotheratdifferentpointintime. Themergeprocessis selectedsetofcommittedtailrecords,loadthecorrespondingout- conceptuallydepictedinFigure5,inwhichwriterthreads(i.e.,up- datedbasepagesforthegivenupdaterange(limittheloadtoonly datetransactions)placecandidatetailpagestobemergedintothe outdatedcolumns). Thisstepcanfurtherbeoptimizedbyavoiding mergequeuewhilethemergethreadcontinuouslytakespagesfrom toloadsub-rangesofrecordsthathavenotyetchangedsincethe thequeueandprocessesthem. lastmerge.Nolatchingisrequiredwhenloadingthebasepages. Step3:Consolidatethebaseandtailpages:Foreveryupdated 3.1 Contention-free,RelaxedMerge column, the merge process will read n outdated base pages and In L-Store, we abide to one main design principle for ensur- appliesasetofrecentcommittedupdatesfromthetailpagesand ingcontention-freeprocessingthatis“alwaysoperatingonstable writesoutmnewpages.8 FirsttheBaseRIDcolumnofthecom- data”. Theinputstothemergeprocessare(1)asetofbasepages mittedtailpages(fromStep1)arescannedinreverseordertofind (committedbaserecords)thatareread-only,6 thus,stabledataand thelistofthelatestversionofeveryupdatedrecordsincethelast (2)asetofconsecutivecommittedtailrecordsintailpages,7 thus, merge(atemporaryhashtablemaybeusedtokeeptrackwhether alsostabledata. Theoutputofthemergeprocess(thatisalsore- thelatestversionofarecordisseenornot). Subsequently,apply- laxed) is a set of newly consolidated base pages (also referred to ingthelatesttailrecordsinareverseordertothebaserecordsuntil as merged pages) that are read-only, compressed, and almost up- anupdatetoeveryrecordinthebaserangeisseenorthelistisex- to-date,thus,stabledata. Todecoupleusers’transactions(writers) hausted(skipanyintermediateversionsforwhichanewerupdate fromthemergeprocess, wealsoensurethatthewritepathofthe existsintheselectedtailrecords). Ifalatesttailrecordindicates ongoing transactions does not overlap with the write path of the thedeletionoftherecord,thenthedeletedrecordwillbeincluded merge process. Writers append new uncommitted tail records to intheconsolidatedrecords.9 Anycompressionalgorithm(e.g.,dic- tailpages(butasstatedbeforeuncommittedrecordsdonotpartic- tionary encoding) can be applied on the consolidated pages (on ipateinthemerge),andwritersperformin-placeupdateoftheIn- columnbasis)followedbywritingthecompressedpagesintonewly directioncolumnwithinbaserecordstopointtothelatestversion 8Atmostuptoonemergedpagepercolumncouldbeleftunder- oftheupdatedrecordsintailpages(buttheIndirectioncolumnis utilizedforarangeofrecordsafterthemergeprocess. Tofurther notmodifiedbythemergeprocess),whereasthewritepathofthe reducetheunderutilizedmergedpages,onemaydefinefinerrange mergeprocessconsistsofcreatingonlyanewsetofread-onlybase partitioningforupdates(e.g., 212 records), butoperatemergesat pages. coarsergranularity(e.g.,216 records). Thiswillprovidethebene- fitoflocalityofaccessforreadersgivensmallerrangesizeof212, 5Eventhoughcompressiontechniquessuchaslocalandglobaldic- yetitprovidesabetterspaceutilizationandcompressionfornewly tionaries can be employed in tail pages, but these directions are createdmergepageswhenlargerrangesarechosen. outsidethescopeofthecurrentwork. 9Alternatively, if all the deleted values are also stored in tail 6TheIndirectioncolumnistheonlycolumnthatundergoesin-place records, then it is sufficient to fill all data columns with the spe- updatethatalsoneverparticipatesinthemergeprocess. cial null value ∅ for deleted records in the final merged pages. 7Note that not every committed update has to be applied as the However, wewould still needto preservethe Indirectioncolumn mergeprocessisrelaxed,andthemergeeventuallyprocessallcom- ofdeletedrecordsinordertoprovideaccesstotheearlierversions mittedtailrecords. ofdeletedrecords. 7 Schema Start Last that after the merged pages are created and the page directory is RID Indirection Updated Key A B C Encoding Time Time updated, then the old table-level tail-pages can be discarded per- Partitionedbaserecordsforthekeyrangeofk1tok3;Tail-pageSequenceNumber(TPS)=0 manentlyafteralltheactivequeriesthatstartedpriortothemerge b1 t8 0000 10:02 k1 a1 b1 c1 processareterminated. Incontrast, theregulartailpagessurvive b2 t5 0101 13:04 k2 a2 b2 c2 b3 t7 0001 15:05 k3 a3 b3 c3 after the merge in order to enable answering historic queries and Relevanttailrecords(belowTPS≤t7high-watermark)forthekeyrangeofk1tok3 toavoidinterferingwithupdatetransactions. Allbaserecordsthat t5 t4 0101 19:25 ∅ a22 ∅ c21 havebeenmergedwiththeirtable-leveltail-pagesareconsideredto Rte7sultingmetrg6edrecords0fo0r0t1hekeyra1n9g:4e5ofk1tok3;TPS=∅t7 ∅ ∅ c31 beoutsidetheinsertrange. b1 t8 0000 10:02 10:02 k1 a1 b1 c1 We further strengthen our data stability condition for bringing b2 t5 0101 13:04 19:25 k2 a22 b2 c21 basepagesup-to-date. Earlierwestatedthatthemergeonlyoper- b3 t7 0001 15:05 19:45 k3 a3 b3 c31 atesonasetofcommittedconsecutivetailrecords,butnocondition Table4: Anexampleoftherelaxedandalmostup-to-datemerge wasimposedonthebaserecords.Nowwestrengthenthiscondition procedure(conceptualtabularrepresentation). byrequiringthatthebaserecordsmustalsofalloutsidetheinsert rangebeforebecomingacandidateformergingtherecentupdates. 3.1.2 MergeAnalysis createdpages.Moreover,theoldStartTimecolumnisremainedin- Akeydistinguishingfeatureofourlineage-basedstoragearchi- tactduringthemergeprocessbecausethiscolumnisneededtohold tectureistoallowcontention-freemergingoftailandbasepages theoriginalinsertiontimeoftherecord.10 Therefore,tokeeptrack withoutinterferingwithconcurrenttransactions. Toformalizeour ofthetimefortheconsolidatedrecords,theLastUpdatedTimecol- merge process, we prove that merge operates only on stable data umnispopulatedtostoretheStartTimeoftheappliedtailrecords. without any information loss and that the merge does not limit The Schema Encoding column may also be populated during the users’ transactions to access and/or modify the data that is being mergetoreflectallthecolumnsthathavebeenchangedforeach merged. record. Lemma1 Mergeoperatesstrictlyonstabledata. Step 4: Update the page directory: The pointers in the page directoryareupdatedtopointtothenewlycreatedmergedpages. PROOF. Recall that by construction, we enforced that merge Essentially this is the only foreground action taken by the merge “alwaysoperateonstabledata”. Theinputstothemergeprocess process,whichissimplytoswapandupdatepointersinthepage are (1) a set of base pages consisting of committed base records directory–anindexstructurethatisupdatedrarelyonlywhennew thatareread-onlyandoutsidetheinsertrange,thus,stabledataand pagesareallocated. (2)asetofconsecutivecommittedtailrecordsintailpages,thus, Step5:De-allocatetheoutdatedbasepages:Theoutdatedbase alsostabledata. Theoutputofthemergeprocessisasetofnewly pagesarede-allocatedoncethecurrentreadersaredrainednatu- mergedpagesthatareread-only,thus,stabledataaswell. Hence, rallyviaanepoch-basedapproach. Theepochisdefinedasatime themergeprocessstrictlytakesasinputsstabledataandproduces window,inwhichtheoutdatedbasepagesmustbekeptaroundas stabledataaswell. longasthereisanactivequerythatstartedbeforethemergepro- Lemma2 Mergesafelydiscardsoutdatedbasepageswithoutvio- cess. Pointers to the outdated base pages are kept in a queue to latinganyquery’ssnapshot. be re-claimed at the end of the query-driven epoch-window. The pointerswappingandthepagede-allocationareillustratedinFig- PROOF. In order to support snapshot isolation semantics and ure6.(cid:4) timetravelqueries,weneedtoensurethatearlierversionsofrecords thatparticipateinthemergeprocessareretained. Sincewenever AnexampleofourmergeprocessisshowninTable4basedon performin-placeupdatesandeachupdateistransformedintoap- ourearlierupdateexample, inwhichweapplythefirstseventail pendinganewversionoftherecordtotailpages,thenaslongas records (denoted by t to t ) to their corresponding base pages. tail pages are not removed, we can ensure that we have access 1 7 Theresultingmergedpagesareshown,wheretheaffectedrecords toeveryupdatedversion. Butrecallthatoutdatedbasepagesare arehighlighted. Notethatonlytheupdatedcolumnsareaffected de-allocatedusingourproposedepoch-basedapproachafterbeing bythemergeprocess(andtheIndirectioncolumnisnotaffected). merged. Also note that base pages contain the original values of Furthermore,notallupdatesareneededtobeapplied,onlythelat- whenarecordwasfirstcreated.Therefore,anyoriginalvaluesthat estversionofeveryupdatedrecordneedstobeconsolidatedwhile laterwereupdatedmustbestoredbeforediscardingoutdatedbase the other entries are simply discarded. In our example, only the pagesafteramergeistakenplace. Inanotherwords,wemusten- tailrecordst andt participatedinthemerge,andtherestwere surethatoutdatedbasepagesarediscardedsafely. 5 7 discarded. As a result, the two fundamental criteria, namely, relaxing the It is important to note that merging table-level tail-pages with merge(i.e.constructinganalmostup-to-datesnapshot)andoperat- base pages in the insert range follows a similar process as above ingonstabledata,arenotsufficienttoensurethesafetypropertyof withafewsimplification. First,theconsolidationprocessisrather themerge. Thelastmissingpiecethatenablessafetyofthemerge trivialbecausetailrecordsinthetable-leveltail-rangefollowsthe isaccomplishedbytakingasnapshotoftheoriginalvalueswhen exactsameinsertionorderintheinsertrange(atrivialjoin-likeop- acolumnisbeingupdatedforthefirsttime(asdescribedinSec- eration). Also the insert range does not actually have any value tion 2.3). In other words, we have further strengthened our data exceptfortheIndirectioncolumn(whichdoesnotevenparticipate stabilitycriterionbyensuringevenstabilityinthecommittedhis- inthemergeitself). Thus,themergeprocessisessentiallyreading tory. Hence,outdatedbasepagescanbesafelydiscardedwithout asetofconsecutivecommittedtailrecordsandcompressingthem anyinformationloss,namely,themergeprocessissafe. to create a set of newly merged pages. Another simplification is Theorem1 Themergeprocessandusers’transactionsdonotcon- 10TheStartTimecolumnisalsohighlycompressiblecolumnwitha tendforbaseandtailpagesortheresultingmergedpages,namely, negligiblespaceoverheadtomaintainit. themergeprocessiscontention-free. 8 PROOF. Aspartofensuringcontention-freemerge,wehaveal- berofpotentialopportunitiestoguidethemergeprocessinorder readyshownthatmergeoperatesonstabledata(provenbyLemma1) tofurtheracceleraterelaxedanalyticalqueries(thosequeriesthat andthatthereisnoinformationlossasaresultofthemergepro- cantolerateslightlyoutdatedsnapshot)byimplicitlyconstructing cess(provenbyLemma2).Nextweprovethatthewritepathofthe a slightly outdated but consistent snapshot of the data across the mergeprocessdoesnotoverlapwiththewritepathofusers’trans- entire table during the merge. Currently, our proposed merge is actions(i.e.,writers).Recallthatwritersappendnewuncommitted already relaxed and brings base pages almost up-to-date in time. tailrecordstotailpages(butasstatedbeforeuncommittedrecords Now we suggest to further coordinate the merge such that every donotparticipateinthemerge),andwritersperformin-placeup- merge not only take a set of consecutive committed tail records, dateoftheIndirectioncolumnwithinbaserecordstopointtothe butalsotakesonlythoseconsecutivecommittedrecordsbeforean latestversionoftheupdatedrecordsintailpages(buttheIndirec- agreed upon time t . Thus, after merging a range of records, we i tion column is not modified by the merge process), whereas the areensuredthatonlycommittedrecordsbeforethetimet ispro- i writepathofthemergeprocessconsistsofcreatingonlyanewset cessed.Furthermore,weproposethateverypagealsomaintainsits ofread-onlymergedpagesandeventuallydiscardingtheoutdated temporallineagetorememberthetimestampoftheearliestcom- basepagessafely. mitted records that have not been merged yet, where ideally its Therefore,wemustshowthatsafelydiscardingbasepagesdoes timestampisaftert .Anyrangeofrecordsthatyettobemergedor i not interfere with users’ transactions. In particular, as explained hasfailedtobringbasepagesforwardintimeuptot canbemanu- i inLemma2,iftheoriginalvalueswerenotwrittentotailrecords allybroughtforwardtot aspartofthenormalqueryprocessingby i atthetimeoftheupdate,thenduringthemergeprocess,wewere consolidatingtailpages. Periodically,theagreeduponmergetime forcedtostorethemsomewhereorencounterinformationloss.Itis isadvancedfromt tot ,andallsubsequentmergesareadjusted i i+1 notevenclearwherewouldbetheoptimallocationforstoringthe accordingly. However,exploitingthetemporallineagetospeedup originalvalues. Asimplemindedapproachofjustaddingthemto relaxedanalyticalqueries(almostforfree)isoutsidethescopeof tailpageswouldhavebrokenthelinearorderofchangestorecords thecurrentwork.11 suchthattheoldervalueswouldhaveappearedafterthenewerval- 3.2 MaintainingLineage ues,anditwouldhaveinterferedwiththeongoingupdatetransac- tions.But,moreimportantly,theneedtostoretheoldvaluesatany Thelineageofeachbasepage(andconsequentlymergedpages) locationwouldhaveimpliedthatduringthemergeprocessmulti- ismaintainedindependentlyasaresultofthemergeprocess. The plecoordinatedactionswererequiredtoensureconsistencyacross lineageinformationisinstrumentaltodecouplethemergeandup- modificationtoisolatedlocations;hence,breakingthecontention- dateprocessingandtoallowindependentmergingofthedifferent freepropertyofthemerge. Therefore,bystoringtheoriginalup- columnsofthesamerecordatdifferentpointintime. Thelineage dated values at the time of update, we trivially eliminate all the informationiscapturedusingarathersimpleandelegantconcept, potential contention during the merge process in order to safely whichwerefertoastail-pagesequencenumber(TPS)inorderto discardingoutdatedbasepages. keeptrackofhowmanyupdatedentries(i.e.,tailrecords)fromtail As a result, users’ transactions are completely decoupled from pageshavebeenappliedtotheircorrespondingbasepagesaftera themergeprocess, andusers’transactionsandthemergeprocess completionofamerge. OriginalbasepagesalwaysstartwithTPS donotcontendoverbase,tail,ormergedpages. setto0,avaluethatismonotonicallyincreasingaftereverymerge. Againtoensurethismonotonictyproperty,westressedearlierthat When analyzing the performance of our merge algorithm, we alwaysaconsecutivesetofcommittedtailrecordsareusedinthe observethatintheworstcase,datafromallupdatedcolumnsfora mergeprocess. givenupdaterangeisreadandwrittenback,butitisacostthatis TPSisalsousedtointerprettheindirectionpointer(alsoamono- amortizedovermanyupdates,akeystrengthofourlineage-based tonicallyincreasingvalue)byreadersafterthemergeistakenplace. storagearchitecture. Ingeneral,ifupdatesarespreadoverarange ConsiderourrunningexampleinTable4.Afterthefirstmergepro- ofrecords(evenifskewed),thenthedatafortheentirerangehasto cess,thenewlymergedpageshaveTPSsetto7,whichimpliesthat bereadandwritten;however,whenupdatesarestrictlylocalized, thefirstsevenupdates(tailrecordst tot )inthetailpageshave thenadditionaloptimizationcanbeappliedtofurtherprunetheset 1 7 been applied to the merged pages. Consider the record b in the ofrecordsread. 2 basepagesthathasanindirectionvaluepointingtot (cf.Table4), However,itisimportanttonotethatL-Store’sobjectivehasbeen 5 therearetwopossibleinterpretations. Ifthetransactionisreading tointroduceacontention-freemergeprocedurewithouttheinter- thebasepageswithTPSsetto0,thenthe5th updatehasnotyet ferencewiththeconcurrenttransactionsbecausecontentionisthe reflectedonthebasepage. Otherwiseifthetransactionisreading most important deciding factor in the overall performance of the thebasepageswithTPS7,thentheupdatereferencedbyindirec- systemespeciallyasthesizeofthemainmemorycontinuestoin- tionvaluet hasalreadybeenappliedtothebasepagesasseenin crease (arguably the entire transactional data can fit in the main 5 Table4. Notably,theIndirectioncolumnisupdatedonlyin-place memorytoday)andthestorage-classmemories(suchasSSDs)re- (alsoamonotonicallyincreasingvalue)bywriters,whilemerging placethemechanicaldisks[8,30]. Nevertheless,wepointoutthat tailpagesdoesnotaffecttheindirectionvalue. therearepotentialopportunitiestofurtherimprovethemergepro- More importantly, we can leverage the TPS concept to ensure cessexecutiontimebyemployingandstudyingmorecomplexjoin readconsistencyofusers’transactionswhenthemergeisperformed algorithmsforimplementingthemergebyoperatingdirectlyonthe compresseddatawithouttheneedtodecompressandcompressthe 11Similar optimization for constructing an almost up-to-date and data(acomplementarydirectionthatistheoutsidethescopeofthe consistent snapshots was first introduced in [23], but it required currentwork). Nevertheless,inourevaluation(Section5),witha todrainallactivequeriesbeforetheout-of-datesnapshotcouldbe singleasynchronousmergethread,wewereabletocopewithtens advanced in time or it required maintaining multiple almost up- to-date snapshots simultaneously. Unlike the approach in [23], ofconcurrentwriterthreads,andwewereabletoprocessmillions ourproposedmergealgorithmcombinedwiththetemporallineage ofupdatespersecondwhenupdating40%ofthecolumnsonaver- eliminates all contention with the ongoing queries such that the age. relaxedsnapshotcanbebroughtforwardintimelazilyandasyn- Lastly, we would like to point out that there are also a num- chronously. 9 Last Schema Schema Start RID Indirection StartTime A C RID Indirection Updated Key A B C Encoding Encoding Time Time Merged,committedtailrecordsforthekeyrangeofk1tok3 Recentlymergedrecordsforthekeyrangeofk1tok3;TPS=t7 t1 b2 0100* 13:04 a2 ∅ b1 t8 0000 10:02 10:02 k1 a1 b1 c1 t2 t1 0100 19:21 a21 ∅ b2 t12 0101 13:04 19:25 k2 a22 b2 c21 t3 t2 0100 19:24 a22 ∅ b3 t11 0001 15:05 19:45 k3 a3 b3 c31 t4 t3 0001* 13:04 ∅ c2 Partitionedtailrecordsforthekeyrangeofk1tok3 t5 t4 0101 19:25 a22 c21 t1 b2 0100* 13:04 ∅ a2 ∅ ∅ t6 b3 0001* 15:05 ∅ c3 t2 t1 0100 19:21 ∅ a21 ∅ ∅ t7 t6 0001 19:45 ∅ c31 t3 t2 0100 19:24 ∅ a22 ∅ ∅ Ordered,Inlined,Compressedcommittedtailrecordsforthekeyrangeofk1tok3 t4 t3 0001* 13:04 ∅ ∅ ∅ c2 c1 b2 0101 13:04,19:21,19:24,19:25 a2,a21,a22,- c2,-,-,c21 t5 t4 0101 19:25 ∅ a22 ∅ c21 c2 b3 0001 15:05,19:45 ∅ c3,c31 t6 b3 0001* 15:05 ∅ ∅ ∅ c3 tt78 bt61 00000010 1290::4155 ∅∅ ∅∅ ∅∅ c∅31 Table6:Anexampleofcompressingmergedtailpages(conceptual t9 t5 0010* 13:04 ∅ ∅ b2 ∅ tabularrepresentation). t10 t9 0010 21:25 ∅ ∅ b21 ∅ t11 t7 0001 21:30 ∅ ∅ ∅ c32 t12 t10 0110 21:55 ∅ a23 b21 ∅ tailrecordt . However,thetailrecordt isnotcumulative(reset 3 10 Table5: Anexampleoftheindirectioninterpretationandlineage occurredatthe8th update),whereasthetailrecordt iscumula- 12 tracking(conceptualtabularrepresentation). tive,butcarriesupdatesonlyfromthetailrecordt andnotfrom 10 t andt .Supposethatatransactionisreadingthebasepageswith 5 3 theTPS0, thentoreconstructthefullversionoftherecordb , it 2 mustreadboththetailrecordst andt (whileskipping3rd and 5 12 lazilyandindependentlyforthedifferentcolumnsofthesamerecords. 10th). Butifatransactionisreadingfromthemergedpageswith Therefore, whenthemergeofcolumnsisdecoupled, eachmerge the TPS 7, then it is sufficient to only read the tail record t to 12 occursindependentlyandatadifferentpointintime.Consequently, fullyreconstructtherecordbecausethe3rd and5th updateshave notallbasepagesarebroughtforwardintimesimultaneously.Ad- alreadybeenappliedtothemergedpages. ditionally,evenifthemergeoccursforallcolumnssimultaneously, itisstillpossiblethatareaderreadsbasepagesforthecolumnA 3.3 CompressingHistoricData beforethemerge(orduringthemergebeforethepagedirectoryis Forhistorictailpages,namely,thecommittedandsubsequently updated)whilethesamereaderreadsthecolumnCafterthemerge; mergedtailpages,weintroduceacontention-freecompressionsch- thus,readingasetofinconsistentbaseandmergedpages. eme to substantially reduce storage footprint and improving ac- cesspatternsforhistoricqueries(ortimetravelqueries). Period- Lemma3 An inconsistent read with concurrent merge is always ically, for a range of records, we compress only a set of merged detectable. tailpages(acrossallupdatedcolumns)thatfalloutsidetheoldest PROOF. Sinceeachbasepageindependentlytracksitslineage, querysnapshotinordertoavoidclashingwiththereaders/writers i.e.,itsTPCcounter;therefore,TPScanbeusedtoverifytheread ofnon-historicdata. consistency.Inparticular,forarangeofrecords,allreadbasepages Thekeybenefitofourcompressionscheme,whichtakesasin- must have an identical TPS counter; otherwise, the read will be put a set of tail pages for all updated columns for a given range inconsistent. Hence,aninconsistentreadacrossdifferentcolumns of records (stable data) and outputs a set of newly compressed ofthesamerecordisalwaysdetectable. tailpages(incolumnarform),arethefollowingkeyproperties(as demonstratedinTable6).First,thecompressedtailrecordsarere- Theorem2 Constructingconsistentsnapshotswithconcurrentmerge orderedaccordingtothebaseRIDorder;henceimprovingthelo- isalwayspossible. calityofaccess. Second,foreachrecord,andwithineachcolumn, thedifferentversionsarestoredinlineandcontiguously. Thever- PROOF. As proved in Lemma 3, the read inconsistency is al- sioninliningavoidtheneedtorepeatedlystoreunchangedvalues ways detectable. Furthermore, once a read inconsistency is en- duetocumulativeupdates,but,moreimportantly,itenablesdelta countered, theneach pageissimply broughtto thedesired query compressionamongthedifferentversionsoftherecordstofurther snapshotindependentlybyexaminingitsTPSandtheindirection reducethespaceoverhead. Alsocollapsingthedifferentversions value and consulting the corresponding tail pages using the logic ofthesamerecordintoasingletailrecordeliminatestheneedfor outlinedearlier.Hence,consistentreadsbyconstructingconsistent backpointersthatareneededforreferencingthepreviousversions. snapshots across different columns of the same record is always Thus,theinlineversionsaretightlypackedandorderedtemporally possible. asshowninTable6. TPS, or an alternative but similar counter conceptually, could Thecompressedtailpagesareread-onlyandareusedexclusively beusedasahigh-watermarkforresettingthecumulativeupdates forhistoricqueries. Thesenewpagescanbeisolatedandpushed aswell. Continuingwithourrunningscenario,inwhichwehave down lower in the storage hierarchy since they are by definition theoriginalbasepageswiththeTPS0(asshowninTable4),the colderandnotaccessedasfrequently. Lastly,thepagedirectoryis mergedpagesthewithTPS7(asshowninTable5).Forsimplicity, updatedbyswappingthepointersfortheoldtailpagestopointto weassumethecumulationwasalsoresetafterthe7th tailrecord. thenewlycreatedcompressedtailpages.Notably,consideringthat For the record b , we see that the indirection pointer is t , for theaccessfrequencyismuchlowerforhistorictailpagescompared 2 12 whichweknowthatthecumulativeupdatehasbeenresetafterthe totheaccessfrequencyofnon-historictailpages,anylocking-or 7th update. Thismeansthatthetailrecordt doesnotcarryup- non-locking-basedapproaches(suchastheepoch-basedapproach 12 datesthatwereaccumulatedbetweentailrecords1to7. Suppose discussedinSection3.1)canbeemployedwithoutnoticeablyaf- thattherecordwasupdatedfourtimes,wheretheupdateentriesin fectingtheoverallsystemperformance. the tail pages are 3rd, 5th, 10th, and 12th tail records. The tail recordt isacumulativeandcarriestheupdatedvaluesfromthe 5 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.