ebook img

Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads PDF

16 Pages·2016·1.07 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads

Bridging the Archipelago between Row-Stores and Column-Stores for Hybrid Workloads Joy Arulraj Andrew Pavlo Prashanth Menon CarnegieMellonUniversity CarnegieMellonUniversity CarnegieMellonUniversity [email protected] [email protected] [email protected] ABSTRACT HTAPpipelinesusingseparateDBMSs. Themostcommonprac- ticeistouseoneDBMSfortransactionsandanotherforanalytical Data-intensiveapplicationsseektoobtaintrillinsightsinreal-time queries. Withthismodel,thenewinformationfromtransactions byanalyzingacombinationofhistoricaldatasetsalongsiderecently goesintoanon-linetransactionalprocessing(OLTP)DBMS.Then collecteddata.Thismeansthattosupportsuchhybridworkloads, inthebackground,thesystemusesanextract-transform-loadutil- databasemanagementsystems(DBMSs)needtohandlebothfast itytomigratedatafromtheOLTPDBMStoadatawarehousefor ACID transactions and complex analytical queries on the same on-lineanalyticalprocessing(OLAP). database. Butthecurrenttrendistousespecializedsystemsthat Thereareseveralproblemsinherentinsuchabifurcatedenvi- areoptimizedforonlyoneoftheseworkloads,andthusrequirean ronment. Foremostisthatthetimeittakestopropagatechanges organizationtomaintainseparatecopiesofthedatabase.Thisadds betweentheseparatesystemsisoftenmeasuredinminutesoreven additionalcosttodeployingadatabaseapplicationintermsofboth hours. Thisdatatransferinhibitsanapplication’sabilitytoacton storageandadministrationoverhead. dataimmediatelywhenitisenteredinthedatabase. Second,the Toovercomethisbarrier,wepresentahybridDBMSarchitecture administrativeoverheadofdeployingandmaintainingtwodifferent thatefficientlysupportsvariedworkloadsonthesamedatabase.Our DBMSsisnon-trivialaspersonnelisestimatedtobealmost50% approachdiffersfrompreviousmethodsinthatweuseasingleexe- ofthetotalownershipcostofalarge-scaledatabasesystem[45].It cutionenginethatisoblivioustothestoragelayoutofdatawithout alsorequirestheapplicationdevelopertowriteaqueryformultiple sacrificingtheperformancebenefitsofthespecializedsystems.This systemsiftheywanttocombinedatafromdifferentdatabases. obviates the need to maintain separate copies of the database in AbetterapproachistouseasingleHTAPDBMSthatcansup- multipleindependentsystems.Wealsopresentatechniquetocon- portthehighthroughputandlowlatencydemandsofmodernOLTP tinuouslyevolvethedatabase’sphysicalstoragelayoutbyanalyzing workloads,whilealsoallowingforcomplex,longerrunningOLAP the queries’ access patterns and choosing the optimal layout for queriestooperateonbothhot(transactional)andcold(historical) differentsegmentsofdatawithinthesametable. Toevaluatethis data.WhatmakesthesenewerHTAPsystemsdifferentfromlegacy work,weimplementedourarchitectureinanin-memoryDBMS. general-purposeDBMSsisthattheyincorporatemanyofthead- Ourresultsshowthatourapproachdeliversupto3×higherthrough- vancementsoverthelastdecadefromthespecializedOLTPand putcomparedtostaticstoragelayoutsacrossdifferentworkloads. OLAPsystems.ThekeychallengewithHTAPDBMSs,however,is We also demonstrate that our continuous adaptation mechanism executingOLAPworkloadsthataccessolddataalongwithnewdata allowstheDBMStoachieveanear-optimallayoutforanarbitrary inthedatabasewhilesimultaneouslyexecutingtransactionsthat workloadwithoutrequiringanymanualtuning. updatethedatabase.ThisisadifficultproblemandexistingHTAP DBMSresorttousingseparatequeryprocessingandstorageengines 1. INTRODUCTION fordatathatisstoredindifferentlayouts. Thatis,theyemploya Organizationsneedtoquicklytransformfreshlyobtaineddata separateOLTPexecutionengineforrow-orienteddatathatisbetter intocriticalinsights. Suchworkloads,colloquiallyknownashy- fortransactionsandaseparateOLAPexecutionengineforcolumn- bridtransaction-analyticalprocessing(HTAP),seektoextrapolate orienteddatathatisbetterforanalyticalqueries.Theythenhaveto insightsandknowledgebyanalyzingacombinationofhistorical useasynchronizationmethod(e.g.,two-phasecommit)tocombine datasetswithreal-timedata[40,48]. Datahasimmensevalueas theresultsfromthetwodifferentpartsofthesystem[5,29,47]. soonasitiscreated,butthatvaluediminishesovertime.Formany Cobblingsystemstogetherinthismannerincreasesthecomplexity applicationdomains,suchashigh-frequencytradingandInternet oftheDBMSanddegradesperformanceduetotheadditionalover- advertising,itisimperativethatthisanalysisusesthenewestdatafor headneededtomaintainthestateofthedatabaseacrossdifferent theresultstohavethemostimpact.Manyorganizationsimplement runtimes.Thisinturnlimitsthetypesofquestionsthatcanbeasked aboutdataassoonasitentersthedatabase,whichisthemainselling pointofaHTAPDBMS. Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor Toovercomethisproblem, wepresentamethodtobridgethe classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation architecturalgapbetweentheOLTPandOLAPsystemsusinga onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM unifiedarchitecture. Ourapproachistostoretablesusinghybrid mustbehonored.Abstractingwithcreditispermitted.Tocopyotherwise,orrepublish, layoutsbasedonhowtheDBMSexpectstupleswillbeaccessed topostonserversortoredistributetolists,requirespriorspecificpermissionand/ora [email protected]. inthefuture. Thismeansthatatable’s“hot”tuplesarestoredin SIGMOD’16,June26-July01,2016,SanFrancisco,CA,USA a format that is optimized for OLTP operations, while the other ©2016ACM.ISBN978-1-4503-3531-7/16/06...$15.00 “cold”tuplesinthatsametablearestoredinaformatthatismore DOI:http://dx.doi.org/10.1145/2882903.2915231 583 amenabletoOLAPqueries.Wethenproposealogicalabstraction ID IMAGE-ID NAME PRICE DATA overthisdatathatallowstheDBMStoexecutequeryplansthat 101 201 ITEM-101 10 DATA-101 span these different layouts without using separate engines and 102 202 ITEM-102 20 DATA-102 withminimaloverhead. Ourothercontributionisanovelon-line 103 203 ITEM-103 30 DATA-103 reorganizationtechniquethatcontinuouslyenhanceseachtable’s 104 204 ITEM-104 40 DATA-104 physicaldesigninresponsetoanevolvingqueryworkload. This (a)OLTP-orientedN-aryStorageModel(NSM) enables the DBMS to migrate the database to the near-optimal storagelayoutforanarbitraryapplicationwithoutrequiringahuman administratortoconfiguretheDBMSfortheparticularapplication ID IMAGE-ID NAME PRICE DATA andinatransactionallysafemanner. 101 201 ITEM-101 10 DATA-101 Toevaluateourapproach,weimplementedourstorageandexe- 102 202 ITEM-102 20 DATA-102 cutionarchitectureinthePelotonHTAPDBMS[1]. Wecompare 103 203 ITEM-103 30 DATA-103 ourmethodswithotherstate-of-the-artstoragemodelsandshow 104 204 ITEM-104 40 DATA-104 thattheyenabletheDBMStoachieveupto3×higherthroughput (b)OLAP-orientedDecompositionStorageModel(DSM) acrossdifferenthybridworkloads. Wealsodemonstratethatour reorganizationmethodallowstheDBMStocontinuouslymodifythe ID IMAGE-ID NAME PRICE DATA layoutoftableswithoutmanualtuningandwithminimaloverhead. 101 201 ITEM-101 10 DATA-101 The remainder of this paper is organized as follows. We first 102 202 ITEM-102 20 DATA-102 discusstheperformanceimpactofastoragemodel,andthebenefits 103 203 ITEM-103 30 DATA-103 ofaflexiblestoragemodelforHTAPworkloadsinSection2.Next, 104 204 ITEM-104 40 DATA-104 inSection3,wedescribethedesignofoursystemarchitecturebased (c)HTAP-orientedFlexibleStorageModel(FSM) onthismodel.Wethenpresenttheconcurrencycontrolmechanism Figure1:StorageModels–Differenttablestoragelayoutsworkwellfor thatenablestheDBMStosupporthybridworkloadsinSection4, OLTP,OLAP,andHTAPworkloads.Thedifferentcoloredregionsindicate followedbyouron-linelayoutreorganizationtechniqueinSection5. thedatathattheDBMSstorescontiguously. We then present our experimental evaluation in Section 6. We concludewithadiscussionofrelatedworkinSection7. executing OLAP queries because they unnecessarily access and processattributesthatarenotevenneededforthefinalresult. 2. MOTIVATION Analternateapproach,knownasthedecompositionstoragemodel (DSM),storesdataattribute-at-a-time[7,17]. Thatis,theDBMS Webeginwithanoverviewoftheimpactofthetable’sstorage storesthetuples’valuesforasingleattributeinatablecontiguously. layoutontheperformanceoftheDBMSfordifferenttypesofwork- This is the storage model employed in modern OLAP DBMSs, loads.Wethenmakethecaseforwhya“flexible”storagemodelis including Vertica [49] and MonetDB [15]. Figure 1b shows an thebestchoiceforHTAPworkloads. exampleofthestoragelayoutofatableusingDSM:theDBMS 2.1 StorageModels allocatesspaceforallthevaluesbelongingtothefirstattribute(ID), ThereareessentiallytwowaysthatDBMSsstoredata: (1)the followedbythosethatbelongtothesecondattribute(IMAGE-ID). n-arystoragemodeland(2)decompositionstoragemodel.These DSMworkswellforOLAPworkloadsbecausetheDBMSonly models prescribe whether the DBMS stores the data in a tuple- retrievesthevaluesbelongingtothoseattributesthatareneededby centricorinanattribute-centricmanner. Notethatthechoiceof thequery. Theyalsoprocessthedataattribute-at-a-time,thereby thestoragemodelisindependentofwhethertheprimarystorage improvingtheCPUefficiencyduetolowerinterpretationoverhead locationofthedatabaseisondiskorin-memory. andskippingunnecessaryattributes[6,7]. All DBMSs that are based on the architecture of the original But just like how NSM systems are inefficient for read-only DBMSsfromthe1970s(i.e., IBMSystemR,INGRES)employ OLAPworkloads,DSMsystemsarenotidealforwrite-heavyOLTP then-arystoragemodel(NSM).Withthisapproach, theDBMS workloads.Thisisbecausetheseworkloadsarecomprisedofqueries storesalloftheattributesforasingletuplecontiguously. Inthe thatinsertandupdatetuplesintothetable.UnlikeinNSM,thisisan example shown in Figure 1a, all the attributes belonging to the expensiveoperationinDSM,asthestoragemanagerneedstocopy firsttuple(ID#101)arestoredoneafteranother,followedbyall overthetuples’attributestoseparatestoragelocations.Hence,the theattributesofthesecondtuple(ID#102). NSMworkswellfor NSMandDSMarchitecturesposeaproblemforaDBMSsupport- OLTPworkloadsbecausethequeriesintransactionstendtooperate ingHTAPworkloadsthatincludetransactionsthatupdatethestate onlyonanindividualentityinthedatabaseatatime(e.g.,asingle ofthedatabasewhilealsoexecutingcomplexanalyticalquerieson customerrecord),andthustheyneedtoaccessmost(ifnotall)ofthe thisdata.WecontendthatitisbetterfortheDBMStouseaflexible attributesforthatentity.Itisalsoidealforinsert-heavyworkloads, storagemodelthatprovidesthebenefitsofboththeNSMandDSM becausetheDBMScanaddtuplestothetableusingasinglewrite. fortheirrespectiveworkloads, whileavoidingtheproblemsthat NSM is not a good choice, however, for analytical queries in theyencounterwiththeotherworkloads. OLAPworkloads.Thisisbecausethesequeriestendtoaccessmul- 2.2 TheCaseforaFlexibleStorageModel tipleentitiesinatableatthesametime.Thesequeriesalsotypically only access a subset of the attributes for each entity. For exam- WerefertoastoragemodelthatgeneralizestheNSMandDSM ple,aquerymightonlyanalyzethelocationattributeforallofthe astheflexiblestoragemodel(FSM).Itsupportsawidevarietyof customerrecordswithinaparticulargeographicregion.ButNSM- hybridstoragelayoutswheretheDBMSco-locatestheattributes basedDBMSscanonlyreadtablestuple-at-a-time,andtherefore thatarefrequentlyaccessedtogether.Wedeferadetaileddescription theyprocessthedatainatuple-centricmannerwithintheirquery oftheFSMtoSection3.IntheexampleshowninFigure1c,allthe operators[21].Priorresearchhasshownthatthisexecutionstrategy valuesbelongingtothefirstthreeattributes(ID, IMAGE-ID, NAME) haslowerCPUefficiencyduetohighinterpretationoverhead[15]. arestoredcontiguously,followedbythosebelongingtothelasttwo NSM-basedDBMSssquanderI/Oandmemorybandwidthwhen attributes(PRICE, DATA). 584 Storage Models : NSM DSM FSM me (s)6800 me (s)7000 me (s)10000 ution ti6200 ution ti6300 ution ti 7750 Exec5600 0.2 0.4 0.6 0.8 1.0 Exec5600 0.2 0.4 0.6 0.8 1.0 Exec 5500 0.2 0.4 0.6 0.8 1.0 Fraction of Tuples Selected Fraction of Tuples Selected Fraction of Tuples Selected (a)HybridWorkload(Projectivity=0.01) (b)HybridWorkload(Projectivity=0.1) (c)HybridWorkload(Projectivity=1.0) me (s)260 me (s)500 me (s)3500 on ti180 on ti300 on ti1750 uti uti uti Exec100 Exec100 Exec 0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 Fraction of Tuples Selected Fraction of Tuples Selected Fraction of Tuples Selected (d)Read-OnlyWorkload(Projectivity=0.01) (e)Read-OnlyWorkload(Projectivity=0.1) (f)Read-OnlyWorkload(Projectivity=1.0) Figure2: PerformanceImpactoftheStorageModels–Timetakenbytheexecutionenginetorun(1)ahybrid,and(2)aread-onlyworkloadontopofthe NSM,DSM,andFSMstoragemanagers. FSM-basedDBMSscanexploitafundamentalpropertyofHTAP Figures2ato2cpresenttheresultsforthehybridworkload.We workloadsinmoderndatabaseapplications. Intheseworkloads, observethatNSMandFSMstoragemanagersoutperformtheDSM the access patterns of transactions are such that tuples are more storagemanagerbyupto1.3×acrossallscenarios.Thisisbecause likely to be updated in an OLTP transaction when they are first DSMneedstosplitatuple’sattributesoneveryinsertandstorethem addedtothedatabase. Then, overtimetheybecomecolder and inseparatememorylocations. FSMworkswellonthisworkload thusarelesslikelytobeupdatedagain. Forinstance,morethan becauseitadoptswiderverticalpartitionssimilartotheNSM,and halfofthecontentthatFacebookusersaccessandinteractwithare therefore executes insert queries faster than DSM. On the scan sharedbytheirfriendsinthepasttwodays,andthenthereisarapid queries, FSM outperforms NSM because it stores the predicate declineincontentpopularityoverthefollowingdays[13].Forthese attribute(a )andtheprojectedattributes(a ,...,a )separatefrom 0 1 k workloads,itisadvantageoustostorethehotdatainatuple-centric theotherattributesinthetable. Duringbothpredicateevaluation layout(NSM)sinceitislikelytobemodifiedduringthisperiod.But andsubsequentprojection,theFSMstoragemanagerretrievesonly thenonceaparticulardataitempassessomethreshold,theDBMS theattributesrequiredforqueryprocessing. canreorganizethecolddatatoanattribute-centriclayout(DSM) Theresultsfortheread-onlyworkloadareshowninFigures2d thatworkswellforanalyticalqueries[47]. AFSM-basedDBMS to 2f. We see that on this workload the DSM and FSM layouts cansupporttheseHTAPworkloadsefficientlybystoringdifferent outperformtheNSMonlowprojectivityanalyticalqueries. They partsofthesametableunderdifferenthybridstoragelayouts. execute the workload up to 1.8× faster than NSM due to better Tobetterunderstandtheseissues,wenowcomparetheperfor- utilizationofmemorybandwidth. Thisexperimenthighlightsthe manceimpactofthesedifferentstoragemodelsinamotivatingexper- performanceimpactofthestoragemodelonHTAPworkloads.Itis iment.ConsideradatabasewithasingletableR(a ,a ,...,a ) clearfromthisthatneitheraNSMnoraDSMstoragelayoutisa 0 1 500 thathas10mtuples.Eachattributea isarandomintegervaluein goodsolutionforHTAPworkloads. Abetterapproachistousea k therange[−100,100]. Theapplicationforthisdatabasecontains FSMDBMSthatcanprocessdatastoredinawidevarietyofhybrid thefollowingtwoqueries: layouts,includingthecanonicaltuple-at-a-timeandtheattribute- Q : INSERT INTO R VALUES (a ,a ,...,a ) at-a-timelayouts. Giventhis,wenowpresentourFSM-oriented 1 0 1 500 Q : SELECT a ,a ,...,a FROM R WHERE a <δ DBMSarchitecture. 2 1 2 k 0 ThetransactionalqueryQ insertsatupleintothetableR,while 1 3. TILE-BASEDARCHITECTURE theanalyticalqueryQ projectsasubsetofattributesa ,a ,...,a 2 1 2 k from all the tuples in the table R that satisfy the predicate a < We now describe a manifestation of the FSM approach that 0 δ. Notethatdifferentvaluesforkandδaffectitsprojectivityand usesastorageabstractionbasedon tiles. Atileisakintoaver- selectivity,respectively.Weconsidertwoworkloads:(1)ahybrid tical/horizontalsegmentoftable.Webeginwithadescriptionofa workloadof1000scan(Q )queriesfollowedby100minsert(Q ) storagearchitecturebasedonphysicaltiles.Wethendescribehow 2 1 queries,and(2)aread-onlyworkloadof1000scan(Q )queries. tohidethelayoutofthesephysicaltilesfromtheDBMS’squery 2 Wemeasurethetotaltimethatanin-memoryDBMStakesto processingcomponentsthroughlogicaltiles. Wewillpresentour execute these two workloads when its storage manager uses the concurrencycontrolprotocolforthisarchitectureinSection4and NSM,DSM,andFSMstoragelayouts. FortheFSMlayout,the thedynamiclayoutreorganizationmethodinSection5. DBMS co-locates the attributes accessed in Q together in the 2 3.1 PhysicalTile following layout: {{a },{a ,...,a },{a ,...,a }}. We 0 1 k k+1 500 considerthreescenarioswhereweincreasetheprojectivityofthe ThefundamentalphysicalstorageunitoftheDBMSisatiletuple. scanquery(Q )progressivelyfrom1%to100%.Foreachofthese Informally, atiletupleisasubsetofattributevaluesthatbelong 2 projectivitysettings,wealsovarytheselectivityofthequeryfrom toatuple. Asetoftiletuplesformaphysicaltile. Werefertoa 10%to100%.Wenotethatthetimetakentoexecuteascanquery collectionofphysicaltilesasatilegroup. Thestoragemanager ishigherthanthetimetakentoexecuteaninsertquery. Wedefer physicallystoresatableasasetoftilegroups.Allthephysicaltiles thedescriptionofourexperimentalsetuptoSection6. belongingtoatilegroupcontainthesamenumberoftiletuples. ConsiderthetablelayoutshowninFigure3.Thetablecomprises 585 ID IMAGE-ID NAME PRICE DATA Tile 101 201 ITEM-101 Tile 10 DATA-101 A-1 102 202 ITEM-102 A-2 20 DATA-102 } Tile 101 201 ITEM-101 Tile 10 DATA-101 Tile 103 203 ITEM-103 30 DATA-103 A-1 102 202 ITEM-102 A-2 20 DATA-102 Group 103 203 ITEM-103 30 DATA-103 A Logical Tile TILE A-1 ATTR 1 TILE A-1 ATTR 3 TILE A-2 ATTR 1 X TILE A-1 ATTR 2 TILE A-2 ATTR 1 Tile 104 204 Tile ITEM-104 40 Tile DATA-104 } Tile 12 11 21 B-1 105 205 B-2 ITEM-105 50 B-3 DATA-105 Group 3 1 2 106 206 ITEM-106 60 DATA-106 B Materialization TCi-l1e 110078 220087 IITTEEMM--110078 7800 DDAATTAA--110078 } GTroilue p MateriaYlized Tile 110021 220021 IITTEEMM--110011 1100 2100 109 209 ITEM-109 90 DATA-109 C 103 203 ITEM-101 10 20 110 210 ITEM-110 100 DATA-110 Figure4: LogicalTile–Anexampleofalogicaltilerepresentingdata spreadacrossacoupleofphysicaltiles(A-1,A-2). Figure3:PhysicalTile–Anexamplestoragelayoutofatablecomposed ofphysicaltiles.Thistablecomprisesofthreetilegroups(A,B,C). Toovercomethisproblem,weproposeanabstractionlayerin ofthreetilegroups(A,B,C),thateachcontainadisparatenumber our architecture based on logical tiles. A logical tile succinctly ofphysicaltiles.ThetilegroupAconsistsoftwotiles(A-1,A-2). representsvaluesspreadacrossacollectionofphysicaltilesfrom ThetileA-1containsthefirstthreeattributes(ID,IMAGE-ID,NAME) oneormoretables. TheDBMSusesthisabstractiontohidethe ofthetable,whilethetileA-2containstheremainingtwoattributes specificsofthelayoutofthetablefromitsexecutionengine,without (PRICE,DATA).ThesetwotilesformthetilegroupA. sacrificingtheperformancebenefitsofaworkload-optimizedstorage layout.Wenowillustratetheabstractionwithanexample. AtuplecanbestoredinanFSMdatabaseusingdifferentlayouts FortheexampleinFigure4,thelogicaltileXpointstodatain overtime. Forinstance,thedefaultapproachisfortheDBMSto thetwophysicaltilesA-1andA-2. Eachcolumnofalogicaltile storeallthenewtuplesinatuple-centriclayout,andthenasthey containsalistofoffsetscorrespondingtothetuplesintheunderlying become colder, it reorganizes them into a layout with narrower physicaltiles.Acolumninalogicaltilecanrepresentthedatastored OLAP-friendlyverticalpartitions.Thisisdonebycopyingoverthe inoneormoreattributesspreadacrossacollectionofphysicaltiles. tuplestoatilegroupwiththenewlayout,andswappingtheoriginal TheDBMSstoresthismappinginthelogicaltile’smetadataregion. tilegroupinthetablewiththenewlyconstructedone. Aswede- Itrecordsthisinformationonlyonceforalogicaltile’scolumn.For scribeinSection5,thisreorganizationoccursinthebackgroundand instance,thefirstcolumninXmapstothefirstandsecondattributes inatransactionallysafemannertoavoidfalsenegatives/positives. ofA-1.WhentheDBMSmaterializesX,itusesthefirstcolumnto WenotethattheNSMandDSMlayoutsarespecialcasesofthe constructthefirsttwoattributesofthematerializedtileY. FSMlayoutwhenusingtiles.Ifeachtilegroupconsistsofonetile ThevaluestoredinthefirstcolumnofthefirstrowofXrepresents thatcontainsalltheattributesofthetable,thenitisthesameasthe thevaluespresentinthefirsttwoattributesofthefirsttupleofA-1. NSM’stuple-centriclayout.Ontheotherhand,ifeachtileconsists Duringmaterialization,theDBMStransformsitinto{101,201}. ofexactlyoneattribute,thenitisequivalenttotheDSM’sattribute- Similarly,thevaluestoredinthethesecondcolumnofthefirstrow centriclayout.Besidessupportingflexibleverticalpartitioning,this ofXmapstothethirdattributeofA-1andthefirstattributeofA-2. tile-based architecture also supports horizontal partitioning of a Onmaterialization,itbecomes{ITEM-101,10}. table.ThisdesignchoiceallowstheDBMStoconfigurethenumber Foralltheattributesthatacolumninalogicaltilemapsto,the of tuples stored in a tile group, such that the tiles fit within the DBMSusesthesamelistoftupleoffsetsduringmaterialization.It CPUcache.Fortwotableswithdifferentschemas,theDBMScan ispossiblefortwocolumnsinalogicaltile(withuniqueoffsetlists) thereforechoosetostoredifferentnumberoftuplespertilegroup. tomaptothesameattributeinaphysicaltile.Thisisexemplifiedby thesecondandthethirdcolumnsofX.Tosimplifytheabstraction, 3.2 LogicalTile werestrictlogicaltilesfromreferringtootherlogicaltiles,meaning An architecture based on physical tiles enables the DBMS to thateverylogicaltilecolumnmapstoanattributeinaphysicaltile. organizedatainanypossiblelayoutthatworkswellfortheHTAP Wecontendthatthislogicaltileabstractionprovidesexpressive workload.Theproblemwithstoringdataindifferentlayouts,how- power even after this simplification. We next demonstrate how ever,isthatitisdifficulttoefficientlyexecutequeriesoverthisdata. itenablesaDBMStoprocessdataorganizedindifferentlayouts. ThisisbecausethedesignoftheDBMS’squeryprocessingcompo- During query execution, the DBMS can dynamically choose to nentsisnotoptimizedforaparticulardatalayout.Onesolutionis materializealogicaltile,thatisanintermediatequeryresult,intoa totransformeachtupleintoastandardformat,independentofthe physicaltile[6].Inthiscase,theoperatorconstructsapassthrough layoutofitstilegroupwhenaqueryreadsthedata.Butthisrequires logicaltilewithonlyonecolumnthatdirectlymapstotheattributes extraprocessingoverheadforeachquery,andthustheDBMSloses in the materialized physical tile, and propagates that logical tile thebenefitsoftheOLTP-orOLAP-optimizedstoragelayouts. upwardstoitsparentintheplantree. Another approach is for the DBMS to use multiple execution enginesforqueryprocessingthatareoptimizedforthedifferent 3.3 LogicalTileAlgebra hybridlayouts. Butthisrequiresexpensivemergingoftheresults producedbytheoperatorsinthedifferentexecutionengines[47]. InorderfortheDBMStobenefitfromstoringdatainphysical Inthiscase,theDBMSneedstoemployanadditionalsynchroniza- tiles,itmustbeabletoexecutequeriesefficientlywhilestillremain- tionmethod(beyonditsinternalconcurrencycontrolprotocol)to ingagnostictotheunderlyinglayoutsoftuples. Werefertothis enforceACIDguarantees.Beyondthesetwoissues,justhavingto propertyofabstractingawaythephysicallayoutofthedatafromthe maintainmultipleexecutionpathsintheDBMS’ssourcecodethat DBMS’squeryprocessingcomponentsaslayouttransparency.This arespecializedforthedifferentlayoutsisnotoriouslydifficult[50]. reducesthecouplingbetweentheDBMS’sstoragemanagerand 586 SELECT R.c, SUM(S.z) metadataofthelogicaltileandnotthedatathatitrepresents.The FROM R JOIN S ON R.b = S.y projectionoperatormodifiesthelistofattributesintheschemaof WHERE R.a = 1 AND S.x = 2 theinputlogicaltiletoremovetheattributesthatarenotneededin GROUP BY R.c; theupper-levelsofthequeryplanorinitsfinalresult.Forthequery inFigure5,theprojectionoperator(π)ontopofthesequentialscan Ω Materialize,{LT},{PT} operator(σ)associatedwithRoutputslogicaltilesthatcontains theattributesbandc.Anothermetadataoperatorisselection;this Γc;sum(z) Aggregate,{LT,C},{LT} operatormodifiesthemetadataoftheinputlogicaltiletomarkany rowcorrespondingtoatuplethatdoesnotsatisfythepredicateas R.b=S.y Join,{LT,LT},{LT} notbeingpartofthelogicaltile. (cid:49) πb,c πy,z Projection,{LT},{LT} Mutators: Theseoperatorsmodifythedatastoredinthetable. Theinsertoperatortakesinalogicaltile,andappendstheassociated tuplestothespecifiedtable. Inthiscase,theoperatorfirstrecon- σR.a=1 σS.x=2 SequentialScan,{T,P},{LT} structsthetuplesrepresentedbythelogicaltile,andthenaddsthem intothetable. Itcanalsodirectlytaketuplesfromtheclientand R S Table appendthemtothetable. Figure5: SampleSQLqueryandtheassociatedplantreeforillustrating Thedeleteoperatortakesinalogicaltile,andremovesthetuples theoperatorsofthelogicaltilealgebra.Wedescribethename,inputs,and presentintheunderlyingtable.Itusesthetupleoffsetsinthefirst outputofeachoperatorinthetree. WedenotethelogicaltilebyLT,the columnofthelogicaltiletoidentifythelocationsofthetuplesthat physicaltilebyPT,thetablebyT,theattributesbyC,andthepredicatebyP. shouldbedeleted, andthenproceedswiththeirremoval. Italso supportsatruncatemodetoquicklyeraseallthetuplesinthetable. executionengine.Wenowpresentanalgebradefinedoverlogical Theupdateoperatorfirstremovesthetuplespresentinthelogicaltile tilesthatallowstheDBMStoachievelayouttransparency. similartothedeleteoperator.Itthenconstructsthenewerversion ConsidertheSQLquerythataccessesthetablesR(a,b,c)and ofthetuplesbycopyingovertheirolderversion,andperforming S(x,y,z)anditscorrespondingplantreeshowninFigure5. It therequestedmodifications.Finally,itappendsthenewerversion selectssometuplesfromRbasedona,andsometuplesfromS ofthetuplesintothetable.AswediscussinSection4,themutators based on x. It then joins the resulting tuples based on b and y. alsocontrolthevisibilityoftuplesfortransactions. Finally,itgroupstheresultingtuplesbyc,andforeachgroup,it PipelineBreakers:Thelastcategoryofoperatorsconsumethe computesthesumoverz.Usingthisqueryasanexample,wenow logicaltilesproducedbytheirchildrenintheplantree.Theyblock describetheoperatorsemanticsofourlogicaltilealgebra.1 theexecutionoftheupper-leveloperatorswhiletheywaitfortheir Bridge Operators: Most of the logical tile algebra operators children’soutput.Thisisnecessarybecausetheseoperatorsneedto produceandconsumelogicaltiles,whichmeansthattheyareoblivi- haveallofthelogicaltilesfromtheirchildrenbeforetheycanpro- oustothestoragelayoutoftheunderlyingphysicaltiles.Theonly ducetheirownoutput.Thisessentiallybreaksthestreamlinedflow operatorsthatinteractwiththestoragemanageraretheonesatthe oflogicaltilesbetweenoperatorsduringthequeryexecution[37]. bottomandthetopofthetree.Werefertotheseoperatorsasbridge Thejoinoperatortakesinapairoflogicaltiles,andthenevaluates operatorsbecausetheyconnectlogicaltilesandphysicaltiles. thejoinpredicateoverthem. Itfirstconstructsanoutputlogical Thebridgeoperatorsincludethetableaccessmethods,suchas tile,whoseschemaisobtainedbyconcatenatingtheschemasofthe thesequentialscanandtheindexscanoperators. Thesequential twoinputlogicaltiles. Whenititeratesovereachpairoftuples, scanoperatorgeneratesalogicaltileforeverytilegroupinthetable. ifitfindsapairsatisfyingthepredicate,thenitconcatenatesthem Eachlogicaltileonlycontainsonecolumnthatisalistofoffsets andappendsthemtotheoutputlogicaltile. Forthequeryshown correspondingtoallthetuplesinthetilegroupsatisfyingthepredi- inFigure5,thejoinoperator( )examineseverypairoflogical cate.InFigure5,thesequentialscan(σ)operatorassociatedwithR tilesemittedbytheprojection((cid:49)π)operators,andthenproducesa emitslogicaltilesthatrepresentthetuplesthatsatisfythepredicate concatenatedlogicaltilecontainingeverypairoflogicaltiletuples a=1. The index scan operator identifies the tuples matching the thatsatisfythejoinpredicateR.b=S.y. predicateusingtheindex,andthenconstructsoneormorelogical Set operators, such as union and intersect, are also pipeline- tilesthatcontainthematchingtuples. breakers.Whilegoingoverthelogicaltilesproducedbytheirchil- The DBMS uses the materialize operator to transform a logi- dren,theykeeptrackofthetuplesobserved.Finally,theyemitthe cal tile to a physical tile. It also uses this operator to perform logicaltilesaftermarkingthetuplesthatshouldbeskippedbythe early-materialization[6]. InFigure5,theaggregateoperator(Γ) setoperationasnotbeingpartoftheassociatedlogicaltiles. constructsaphysicaltilewiththeaggregatetuples,andthenwraps Similarly,aggregationoperators(e.g.,count,sum),examineall arounditwithapassthroughlogicaltile.Inordertosendthequery thelogicaltilesfromtheirchildtoconstructtheaggregatetuples. resulttotheclient,theDBMSrunsthematerializeoperator(Ω)for Unlikethesetoperators,theseoperatorsbuildnewphysicaltilesto transformingthepassthroughlogicaltilereturnedbytheaggregate storetheaggregatetuples.Forinstance,inFigure5,theaggregate operatortoaphysicaltile.Inthiscase,thematerializeoperatordoes operator(Γ)constructsaphysicaltilethatcontainthesumoverthe notneedtoconstructanewphysicaltile.Instead,itdirectlyreturns attribute z for every group oftuples witha unique value forthe thephysicaltileunderlyingthepassthroughlogicaltile. attributec.Itthenconstructsasetofpassthroughlogicaltiles,and MetadataOperators: Themetadataofalogicaltileincludes propagatesthemupwardsintheplantreeonelogicaltileatatime. informationabouttheunderlyingphysicaltilesandabitmapthat representstherowsthatmustbeexaminedbytheoperatorprocess- 3.4 Discussion ingthelogicaltile. Thiscategoryofoperatorsonlymodifiesthe Wenowdiscussthebenefitsofthelogicaltileabstractionforan HTAPDBMS.Theseincludethefollowing: 1 Aformalalgebraicdefinitionofalltheoperatorsinourlogicaltilealgebraispre- sentedinAppendixA. LayoutTransparency:Theoperatorsneednotbespecializedfor 587 allpossiblestoragelayoutsbecausethelogicaltilealgebrahidesthis Index TxnId BeginCTS EndCTS PreV ID SALARY informationfromthem.Thisdesignobviatestheneedforexpensive mergeoperationsoverresultsproducedusingdifferentexecution 101 — 1001 1004 — Tile 101 Tile 201 engines. It also reduces the DBMS’s code complexity, thereby 102 — 1001 ∞ — A-1 102 A-2 202 improvingmaintainabilityandtestability. 103 305 1002 ∞ — 103 203 104 VectorizedProcessing: Theiteratorparadigminmanyquery- 105 — 1002 ∞ — Tile 104 204 processing systems is implemented using executors that process — 1002 1003 — B-1 105 205 dataonetupleatatime[21]. Thisapproachsuffersfromahigh — 1004 ∞ [1, 1] 101 301 interpretationoverheadandpreventskeycompileroptimizations such as loop pipelining [15]. In contrast, the operators in a tile- Figure6:ConcurrencyControl–VersioninginformationthattheDBMS basedDBMSprocessdatalogicaltileatatime.Asshowninprior recordsinthetilegroupsforMVCC. researchoncolumnarDBMSs,vectorizedprocessingimprovesthe CPUefficiencybyreducingtheinterpretationoverhead[7]. • PreV:Referencetothepreviousversion,ifany,ofthetuple. FlexibleMaterialization:Sinceneitherearlymaterializationnor Figure 6 shows an example of this versioning metadata. The latematerializationisauniversallygoodstrategy[6],atile-based DBMSstoresthisinformationseparatefromthephysicaltiles,so DBMScandynamicallychoosetomaterializeatanyoperatorin thatitcanhandleallthephysicaltilesinatilegroupinthesame theplantreeduringthequeryexecution.Itcanthenpropagatethe manner.Thesystemusesthepreviousversionfieldtotraversethe passthrough logical tiles upwards in the plan tree. This flexible versionchainandaccesstheearlierversions,ifany,ofthattuple. materializationstrategyworkswellforHTAPworkloads. Thisversionchaincanspanacrossmultipletilegroups,andthus versionsmaybestoredunderdifferentphysicallayoutsinmemory. CachingBehavior:TheDBMScanoptimizeseveraldimensions inhowitcreatestilegroups,suchasthenumberoftuplespertile 4.1 Protocol group,toensurethatthetilesfitwellwithinthecachehierarchy. Wenowdiscusstheoperatorsintheexecutionenginethatwork Furthermore,thelogicaltiles’succinctrepresentationenablesthe withthestoragemanagertoensuretransactionalisolation. Byde- DBMStomoreeasilymanagecomplexintermediatequeryexecu- fault,theDBMSprovidessnapshotisolation. tionresultswithinthecache. Mutators:Theinsertoperatorstartsbyrequestinganemptytuple ForthequeryshowninFigure5,processingonelogicaltileata slotfromtheDBMS.Itthenclaimsthatslotbystoringitstransaction timereducesthenumberofcachemissesandfunctioncalls. The identifierinthattuple’sTxnIdfieldwithanatomiccompare-and- DBMSalsocopieslessdataandchasesfewerpointersbecauseit swapoperation.TheBeginCTSfieldisinitiallysettoinfinitysothat onlymaterializesthelogicaltileintheaggregationoperator.Lastly, thetupleisnotvisibletootherconcurrenttransactions.Whenthe our logical-tile algebra bridges the theoretical gap between row- transactioncommits,theDBMSsetsthetuple’sBeginCTSfieldto storesandcolumn-storeswithinasingleDBMSarchitecture. thetransaction’scommittimestamp,andresetsitsTxnIdfield. In Figure6,thefirsttwotuplesinthefirsttilegroupareinsertedbythe 4. CONCURRENCYCONTROL transactionwithcommittimestamp1001.TheDBMSappendsall HTAPworkloadsarecomprisedofshort-durationtransactions thenewtuplesinatilegroupwiththedefaultNSMlayout. thatareexecutedalongsidelong-runninganalyticalqueries. The Thedeleteoperatorfirstacquiresalatchonthetargettupleby DBMSmustensurethattheseOLAPqueriesdonotseetheeffects storing its transaction identifier in the tuple’s TxnId field. This oftransactionsthatstartaftertheybeginandthereadersshouldnot preventsanothertransactionfromconcurrentlydeletingthesame blockonwriters[41].ItisforthisreasonthatmostHTAPDBMSs tuple. Atthetimeofcommittingthetransaction,theDBMSsets employmulti-versionconcurrencycontrol(MVCC)[34,38,46]. thetuple’sEndCTStothetransaction’scommittimestamp,thereby WenowdiscusshowtouseMVCCwithourtile-basedarchitec- ceasingitsexistence.InFigure6,thetuplewithID105isdeleted ture.Weadoptasimilarapproachtakeninprevioussystemswhere bythetransactionwithcommittimestamp1003. theDBMSrecordstheversioninginformationdirectlyalongsidethe Theupdateoperatorbeginswithmarkingtheolderversionof tupledata[30,46].Whenanewtransactionisstarted,theDBMS thetupleasinvisible.Itthenconstructsthenewertupleversionby assignsitauniquetransactionidentifierfromamonotonicallyin- copyingovertheolderone,andperformingtherequestedmodifi- creasingglobalcounter. Whenatransactionisreadytocommit, cations. Finally,itappendsthenewertupleversionintothetable. the DBMS assigns it a unique commit timestamp that it obtains InFigure6,thetransactionwithcommittimestamp1004updates byincrementingthetimestampofthelastcommittedtransaction. theSALARYofthetuplewithID101. NotethatthePreVfieldof Eachtransactionmaintainsametadatacontextthatincludes:(1)the thethirdtupleinthesecondtilegroupreferstotheolderversion timestampoflastcommittedtransactionthatshouldbevisibletothis oftupleinthefirsttilegroup.Atthispointintime,thetransaction transaction,and(2)referencestothesetoftupleversionsthatthe withidentifier305holdsalatchonthetuplewithID103. transactioneitherinsertsordeletesduringitslifetime.Areference BridgeOperators:Theonlyothercategoryofoperatorsthatare containsonlythelocationofthetupleandnotitsactualdata.Inour concernedaboutthevisibilityoftuplesarethetableaccessmethods. tile-basedarchitecture,itincludesthetilegroupidentifierandthe The sequential scan and index scan operators emit one or more offsetofthetuplewithinthattilegroup.Everytilegroupcontains logicaltilesforthematchingtuples. Beforepredicateevaluation, thefollowingversioningmetadataforeachtuple: theyfirstdetermineifthetupleisvisibletotheirtransaction, by • TxnId:Aplaceholderfortheidentifierofthetransactionthat verifyingwhetherthetransaction’slastvisiblecommittimestamp currentlyholdsalatchonthetuple. fallswithinthetheBeginCTSandEndCTSfieldsofthetuple. The • BeginCTS:Thecommittimestampfromwhichthetuplebe- tupleversionsthatareinsertedbythetransactionarealsovisibleto comesvisible. theseoperators. Duetothisdesign,alltheotheroperatorsofthe logicaltilealgebraaredecoupledfromthetuplevisibilityproblem. • EndCTS:Thecommittimestampafterwhichthetupleceases tobevisible. Atanypointintime,theDBMScanrollbackanuncommitted 588 transaction by resetting the TxnId field of the tuples that it has Algorithm1VerticalPartitioningAlgorithm latched,andreleasinganyunusedtupleslots.Overtime,theolder Require: recentqueriesQ,tableT,numberofrepresentativequeriesk versionsofatuplebecomeinvisibletoanycurrentorfuturetrans- functionUPDATE-LAYOUT(Q,T,k) actions. TheDBMSperiodicallygarbagecollectstheseinvisible #StageI:Clusteringalgorithm tupleversionsinanasynchronousincrementalmanner.Thegarbage forallqueriesqappearinginQdo collectionprocessnotonlyhelpsinrecoveringthestoragespace, forallrepresentativequeriesrjassociatedwithT do butalsorefreshesthestatisticsmaintainedbythequeryplanner. ifrjisclosesttoqthen rj ←rj+w×(q-rj) 4.2 Indexes endif endfor A tile-based DBMS can use any order-preserving in-memory endfor index(e.g.,B+tree)forprimaryandsecondaryindexes.Whilethe #StageII:Greedyalgorithm GeneratelayoutforT usingr keystoredintheindexcomprisesofasetoftupleattributes,the endfunction valueisalogicallocationofthelatestversionofatuple. Wedo notstorerawtuplepointersbecausethentheDBMSwouldhaveto 5.1 On-lineQueryMonitoring updatethemwhenitreorganizesthelayoutofthetilegroups. TheDBMSusesalightweightmonitorthattrackstheattributes InFigure6,theindexentryforthetuplewithID101referstothe thatareaccessedbyeachquery. Thegoalistodeterminewhich thirdtupleinthesecondtilegroup.Anoperatormightencountera attributesshouldbestoredinthesamephysicaltileinthenewtile versionofthetuplethatisnotvisibletoitscurrenttransaction.When grouplayout.Themonitorcollectsinformationabouttheattributes thisoccurs,itusesthePreVfieldtotraversetheversionchainto present in the query’s SELECT clause, as well as those that are findthenewestversionofthetuplethatisvisibletothetransaction. presentintheWHEREclause[11,36]. Itdistinguishesbetween Inthisway,theDBMSdoesnotneedtoredundantlymaintainthe thesetwosetsofattributesbecausebyco-locatingonlytheattributes versioninginformationinthetableaswellasintheindexes[30]. intheWHEREclause,ratherthanthoseinbothclausestogether, 4.3 Recovery itneedstoretrievelessdataforpredicateevaluation.Itthenstores thisinformationasatimeseriesgraphforeachindividualtable. OurDBMSarchitectureincludesarecoverymodulethatisre- To reduce the monitoring overhead, the monitor only gathers sponsiblefortheloggingandrecoverymechanisms. Itemploysa statisticsfromarandomsubsetofqueries. Inthiscase, itneeds variantofthecanonicalARIESrecoveryprotocolthatisadapted to ensure that the storage layout is not biased towards the more for in-memory DBMSs [12, 32, 35]. During regular transaction frequentlyobservedtransactions.Toimprovetheoverallthroughput execution,theDBMSrecordsthechangesperformedbythetrans- onHTAPworkloads,itneedstooptimizethelayoutforboththe actioninthewrite-aheadlog, beforecommittingthetransaction. transactionsandthedataintensiveanalyticalqueries.TheDBMS Itperiodicallytakessnapshotsthatarestoredonthefilesystemto achieves this by recording the query plan cost computed by the boundtherecoverytimeafteracrash. optimizer[3].Itusesthisplancostinformationtoderiveastorage Atthestartofrecovery,theDBMSfirstloadsthelatestsnapshot. layoutthatalsobenefitstheanalyticalqueries. Wenextdescribe It then replays the log to ensure that the changes made by the howtheDBMSusesthesequerystatisticstocomputeaworkload- transactionscommittedsincethesnapshotarepresentinthedatabase. optimizedverticalpartitioningforeachtable. Changesmadebyuncommittedtransactionsatthetimeoffailure arenotpropagatedtothedatabase.Aswedonotrecordthephysical 5.2 PartitioningAlgorithm changestotheindexesinthislog, theDBMSrebuildsallofthe Computingtheoptimalstoragelayoutforaworkloadbasedonits tables’indexesduringrecoverytoensurethattheyareconsistent querystatisticsrequiresexponentialspaceandtimeinthenumberof withthedatabase[32].WeleaveadetaileddiscussionofPeloton’s attributes.ConsideraqueryworkloadQwithmqueries,andatable recoverymoduleasfuturework. withnattributes. Foreveryqueryq ∈ Q,thenaïvepartitioning algorithmneedstoexamine2ngroupsofattributes.Itthenanalyses 5. LAYOUTREORGANIZATION allthepossibleattributepartitionsofthetable.Thenumberofsuch AlloftheaboveoptimizationsinthedesignofourFSM-based partitionsisreferredtoasaBellnumber.Basedontheasymptotic DBMSaremootunlessthesystemisabletoefficientlyandeffec- limitoftheBellnumber,thetimecomplexityofthisalgorithmis tivelyreorganizethelayoutofthedatabasetoadaptwiththechanges O(enln(n)+mn2n)anditsspacecomplexityisΘ(2n)[19,23]. intheHTAPworkload. Thecruxofourapproachistotrackthe Asthisisinfeasibleinpractice,weadoptamoreresource-efficient recentqueryworkloadatruntime,andthenperiodicallycomputea partitioningalgorithmthatconvergestowardstheoptimalstorage workload-optimizedstoragelayoutforeachtableinthebackground. layout over time. Our approach leverages the query statistics to TheDBMSthenreorganizesthetabletomatchthecomputedlayout. computethelayoutintwostages.AsshowninAlgorithm1,itfirst There are two approaches to perform data reorganization in a employsaclusteringalgorithmtodeterminewhichsetofattributes DBMS. The first is to combine query processing with data reor- shouldbestoredtogetherinthesamephysicaltile. Itthenusesa ganization. In this case, the DBMS copies existing data to new greedyalgorithmthatusestheoutputoftheclusteringalgorithmto tile groups with a query-optimized layout before processing the generateaworkload-awaretilegroupstoragelayout. data[11]. Thisapproachcanbeprohibitivelyexpensiveforsome ForeachtableT inthedatabase,theDBMSmaintainsstatistics queriesduetotheadditionalI/Ooverhead. abouttherecentqueriesQthataccessedit. Foreachq ∈ Q,the Analternativeistodecouplethisreorganizationfromqueryexe- DBMSextractsitsmetadata,suchastheattributesitaccessed. If cutionanduseaseparatebackgroundprocesstoreorganizethedata theDBMSisabletoidentifythe“important”queries,thenitcan inanincrementalmanneronetilegroupatatime.Overtime,this optimizethestoragelayoutofthetableforthesequeries.Itperforms processoptimizesthestoragelayoutfortheworkloadandamortizes thisidentificationusinganon-linek-meansclusteringalgorithm thecostacrossmultiplequeries.Itisbecausethisapproachisincre- anddynamicallyupdatesthelayoutbasedonincomingqueries[27]. mental[25]thatweadoptthissecondstrategy.Wenowdescribethe Foreachqueryq,theclusteringalgorithmobservesthereferenced informationthattheDBMScollectstoguidethisprocess. attributes,andassignsittothejthcluster,whosemeanrepresenta- 589 tivequeryr isthemostsimilartoq.Thedistancemetricbetween Someapplicationcouldhaveperiodiccyclesintheirworkload j twoqueriesisdefinedasthenumberofattributeswhichareaccessed that oscillate between different sets of queries. In this scenario, byoneandexactlyoneofthetwoqueriesdividedbythenumber it would be better if the DBMS does not quickly reorganize ta- ofattributesinT. Queriesthataccessalotofcommonattributes blestoanewlayoutbecausethentheworkloadwouldshiftback ofT,therefore,tendtobelongtothesamecluster.Afterassigning afteranexpensivereorganization. Thisisparticularlyimportant thequerytoacluster,itupdatesr toreflecttheinclusionofthe in exploratory workloads with ad-hoc OLAP queries. To avoid j queryq.Thealgorithmprioritizeseachquerybasedonitsplancost. thisproblem,theDBMSprioritizestheolderquerysamplesinthe Ifittreatsallqueriesequally, theOLTPqueriesthattendtoout- clusteringalgorithmwithalargerweight,therebydampeningthe numberthelong-runninganalyticalquerieswillforcethealgorithm adaptionmechanism.Ontheotherhand,iftheworkloadshiftisnot toquicklyconvergetoatuple-centriclayout.Thelowprojectivity ephemeral,thenitproceedswiththereorganization. Weexamine analyticalqueriesdonotbenefitfromthetuple-centriclayout.By strategiesforcontrollingdatareorganizationinSection6.5. prioritizingthequeriesbasedontheirplancost,thequerieswith higherI/Ocosthaveastrongerinfluenceonthelayoutofthetable. 6. EXPERIMENTALEVALUATION Anattractivefeatureoftheon-lineclusteringalgorithmisthat, Wenowpresentananalysisofthetile-basedFSMarchitecture. overtime,themeansoftheclustersdrifttowardstherecentsamples. WeimplementedourexecutionengineandFSMstoragemanager Tobetterunderstandthisaspect,wenowdiscussthealgorithmmore inthePelotonDBMS[1].Pelotonisamulti-threaded,in-memory formally.Wedefinec tobethemeanrepresentativequeryofthejth j DBMSthatisdesignedforHTAPworkloads.Itsexecutionengine cluster,andc torepresentthatmean’sinitialvalue.Eachcluster’s 0 supports all the common relational operators, which are imple- meanisavectorofnumberswhoselengthequalsthenumberofthe mentedusingthelogicaltileabstraction. Wealsointegratedthe attributesinT.Ifthealgorithmaddssquerysamplestothecluster runtimecomponentsforon-linequerymonitoringanddatareorgani- overtime,andprioritizestheoldersampleswithaweightwatthe zationdescribedinSection5. timeofaddition,thenthecurrentmeanoftheclusterisgivenby: WedeployedPelotonfortheseexperimentsonadual-socketIntel s c =(1−w)sc +w(cid:88)(1−w)s−iQ XeonE5-4620serverrunningUbuntu14.04(64-bit).Eachsocket j 0 i containseight2.6GHzcores.Ithas128GBofDRAManditsL3 i=1 cachesizeis20MB. The weight w determines therate with which theolder query Foreachexperiment, weexecutetheworkloadfivetimesand samplesareforgottenbythealgorithm. Thetimecomplexityof reporttheaverageexecutiontime.Alltransactionsexecutewiththe eachiterationofthisalgorithmisO(mnk)withaspacecomplexity defaultsnapshotisolationlevel. WedisabletheDBMS’sgarbage ofO(n(m+k)). Therefore,itismoreefficientthananaïvepar- collectionandloggingcomponentstoensurethatourmeasurements titioningalgorithm.Byusingtheclusteringalgorithm,theDBMS areonlyforthestorageandqueryprocessingcomponents. maintainsthetopkrepresentativequeriesassociatedwitheachtable Inthissection,webeginwithananalysisoftheimpactofquery inthedatabase. projectivityandselectivitysettingsontheperformanceofdifferent ThenextstageinAlgorithm1istouseagreedyalgorithmto storagemodels.WethendemonstratethataFSMDBMScancon- deriveapartitioninglayoutforthetableusingtheserepresentative vergetoanoptimallayoutforanarbitraryworkloadwithoutany queries. Thisalgorithmiteratesovertherepresentativequeriesin manualtuning.Wethenexaminetheimpactofthetable’shorizontal thedescendingorderbasedontheweightoftheirassociatedclusters. fragmentationontheperformanceoftheDBMS.Wenextperform Foreachcluster,thealgorithmgroupstheattributesaccessedbythat asensitivityanalysisontheparametersofthedatareorganization cluster’srepresentativequerytogetherintoatile.Itcontinuesthis process. Lastly,wecomparesomeofourdesignchoicesagainst processuntilitassignseachattributeinthetabletosometile. In anotherstate-of-the-artadaptivestoragemanager[11]. thismanner,theDBMSperiodicallycomputesalayoutforthetable usingtherecentquerystatistics.Wenextdescribehowitreorganizes 6.1 ADAPTBenchmark thetabletothenewlayout. Wefirstdescribethebenchmarksweuseinourevaluation.Since therecurrentlyisadearthofHTAPworkloadsfortesting,wedevel- 5.3 DataLayoutReorganization opedourownthatwecalltheADAPTbenchmarkthatiscomposed Weuseanincrementalapproachfordatalayoutreorganization. of queries that are common in enterprise workloads [41]. This Foragiventilegroup,theDBMSfirstcopiesoverthedatatothe benchmarkisinspiredbytheonethatAlagiannisetal.developed newlayout, andthenatomicallyswapsinthenewlyconstructed forevaluatingtheH Oadaptivestoragemanager[11]. 2 tilegroupintothetable.Anyconcurrentdeleteorupdateoperation TheADAPTdatabasehastwotables–anarrowtableandawide onlymodifiestheversioningmetadatathatisstoredseparatefrom table. Each table contains tuples with a primary key (a ) and p 0 thephysicaltiles. Thenewlyconstructedtilegroupreferstothe integerattributes(a ,...,a ), each4bytesinsize. Thenarrow 1 p versioning metadata of the older tile group. The storage space tablehasp=50attributes,andthewidetablehasp=500attributes. consumedbythephysicaltilesintheoldertilegroupisreclaimedby Thesizeofthetuplesinthesetwotablesisapproximately200B theDBMSonlywhentheyarenolongerreferencedbyanylogical and2KB,respectively.Inallofourexperiments,wefirstload10m tiles. Duetoitsincrementalnature, theoverheadofdatalayout tuplesintoeachtableinthedatabase. reorganizationisamortizedovermultiplequeries. Thebenchmark’sworkloadconsistsof(Q )aninsertquerythat 1 Thereorganizationprocessdoesnottargethottilegroupsthat addsasingletupleintothetable,(Q )ascanquerythatprojects 2 arestillbeingheavilyaccessedbyOLTPtransactions. Rather,it a subset of attributes of the tuples that satisfy a predicate, (Q ) 3 transforms the cold data to new layouts. This approach works anaggregatequerythatcomputesthemaximumvalueforasetof wellwiththeMVCCprotocoldescribedinSection4.Theupdated attributesovertheselectedtuples, (Q )anarithmeticquerythat 4 versionsofthetuplesstartoffinatuple-centriclayout,andthenare sumsupasubsetofattributesoftheselectedtuples,and(Q )ajoin 5 graduallytransformedtoanOLAP-optimizedlayout. Hence,the querythatcombinesthetuplesfromtwotablesbasedonapredicate reorganizationprocessdoesnotimpactlatency-sensitivetransactions definedovertheattributesinthetables.ThecorrespondingSQLfor whilebenefitingtheOLAPqueries. thesequeriesisasfollows: 590 Storage Models : NSM DSM FSM Execution time (s)4950000 0.01 0.1 0.5 Execution time (s)242550000 0.01 0.1 0.5 Execution time (s)495000000 0.01 0.1 0.5 Execution time (s)24255000000 0.01 0.1 0.5 Fraction of Attributes Projected Fraction of Attributes Projected Fraction of Attributes Projected Fraction of Attributes Projected (a)Scan,Narrow,ReadOnly (b)Scan,Narrow,Hybrid (c)Scan,Wide,ReadOnly (d)Scan,Wide,Hybrid me (s)4000 me (s)8000 me (s)50000 me (s)90000 Execution ti20000 0.01 0.1 0.5 Execution ti40000 0.01 0.1 0.5 Execution ti250000 0.01 0.1 0.5 Execution ti450000 0.01 0.1 0.5 Fraction of Attributes Projected Fraction of Attributes Projected Fraction of Attributes Projected Fraction of Attributes Projected (e)Aggregate,Narrow,ReadOnly (f)Aggregate,Narrow,Hybrid (g)Aggregate,Wide,ReadOnly (h)Aggregate,Wide,Hybrid Figure7:ProjectivityMeasurements–Theimpactofthestoragelayoutonthequeryprocessingtimeunderdifferentprojectivitysettings.Theexecution enginerunstheworkloadwithdifferentunderlyingstoragemanagersonboththenarrowandthewidetable. Q : INSERT INTO R VALUES (a ,a ,...,a ) Thesetrendsalsooccurfortheaggregatequeryworkloadsshown 1 0 1 p Q : SELECT a ,a ,...,a FROM R WHERE a <δ inFigures7eto7h. Weconfigurethisquerytocomputethemax- 2 1 2 k 0 Q : SELECT MAX(a ),...,MAX(a ) FROM R WHERE a <δ imumvalueoverallthetuplesfortheattributesofinterest. The 3 1 k 0 Q : SELECT a +a +...+a FROM R WHERE a <δ magnitudeoftheperformancegapsbetweenthedifferentstorage 4 1 2 k 0 Q : SELECT X.a ,...,X.a ,Y.a ,...,Y.a managersissmallerunderthelowprojectivitysettings.FSMexe- 5 1 k 1 k FROM R AS X, R AS Y WHERE X.a < Y.a cutestheread-onlyworkloadupto1.9×fasterthanNSM.Thisis i j becausetheexecutionengineneedstomaterializethelogicaltiles Notethatdifferentvaluesforkandδaltertheprojectivityandthe consistingoftheaggregatetuplesforallthestoragemanagers. selectivityofthequeries,respectively.Weusedifferentworkloads Figure8showstheresultsfortheread-onlyandhybridworkloads comprisedofthesequerytypestoevaluatetheimpactofthestorage underdifferentselectivitysettings.Fortheseexperiments,wefixthe modelsontheperformanceoftheDBMS. projectivityofthescanandaggregatequeriesto0.1(i.e.,thequery projectsfiveattributesforthenarrowtable,and50attributesforthe 6.2 PerformanceImpactofStorageModels widetable).Wevarytheselectivityofthesequeriesfrom10–100%. Webeginwithananalysisofthestoragemodelswhenexecuting Across all these settings, the execution time with FSM is either thescan(Q ),aggregate(Q ),andinsert(Q )queriesunderdiffer- betterorcomparabletotheotherstoragemanagers.Thishighlights 1 2 3 entprojectivityandselectivitysettings.Weconsidertwoworkloads thebenefitsofhybridstoragelayoutsforHTAPworkloads. inthisexperiment: (1)aread-onlyworkloadwithonescanorag- WeseethatDSMoutperformsNSMontheread-onlyworkload, gregatequery,and(2)ahybridworkloadcomprisedofonescanor butthenthisorderisflippedforthehybridworkload.Thisisbecause aggregatequeryfollowedby1minsertqueries.Foreachworkload, DSMexecutesthescanqueryfasterthanNSMonboththetables wefirstloadthedatabaseandexecutethequeriesfivetimestillthe (Figures8aand8c). ButNSMisfasterthanDSMontheinsert DBMSreorganizesthetables’layout. Thisistheidealscenario queries(Figures8dand8h). FSMoutperformsNSMontheread- fortheFSMstoragemanager.Wethenexecutetheworkloadagain onlyworkloadandDSMonthehybridworkload.Thisillustratesthat usingdifferentstoragemanagersandmeasuretheircompletiontime. theFSMstoragemanagercanadapttoanyHTAPworkloadunlike Figures7aand7bshowtheresultsforthesetwoworkloadswith theDSMandNSMstoragemanagers. Incaseoftheworkloads thescanqueryunderdifferentprojectivitysettings. Fortheseex- that contain the aggregate query, as shown in Figures 8e to 8h, periments,weconfigurethescanquerytoselectallthetuples.We thedifferencebetweenthestoragemanagersshrinks.Weattribute observethatwhentheprojectivityofthescanqueryislow,theDSM thistothelowprojectivitysettingoftheaggregatequery, which andFSMstoragemanagersexecutetheread-onlyworkload2.3× corroboratesthefindingsinFigure7. fasterthantheirNSMcounterpart.Thisisbecausetheymakebetter 6.3 Workload-AwareAdaptation useofmemorybandwidthbyonlyfetchingtherequiredattributes. Asthequery’sprojectivityincreases,theperformancegapdecreases; Inthepreviousexperiment,weexaminedahybridstoragelayout whenhalfoftheattributesareincludedinthequery’soutput,the aftertheDBMSoptimizesthelayoutfortheworkload.Inthereal DSMstoragemanageris21%slowerthantheotherstorageman- world,thisisnotfeasibleduetothedynamicnatureofworkloads. agers. Weattributethistotheincreaseinthetuplereconstruction Thus,wenowexaminetheabilityofFSMtoadaptthestoragelayout cost.Onthehybridworkload,underlowprojectivitysettings,the atruntime,andcompareitagainststaticNSMandDSMlayouts. FSMstoragemanageroutperformsbothNSMandDSMby24% Inthisexperiment,weexecuteasequenceofquerieswithchang- and33%,respectively.Thisisbecauseitprocessesthescanquery ingproperties.TomimicthetemporallocalityinHTAPworkloads, fasterthanNSMandtheinsertqueriesfasterthanDSM. andtoclearlydelineatetheimpactoftheDBMS’sstoragelayout TheresultsinFigures7cand7dshowthatthedifferencesbetween onthedifferentqueries,thesequenceisdividedintosegmentsof25 thestoragemanagersarelargerforthewidetable. Whenhalfof queriesthateachcorrespondtoaparticularquerytype.Thismeans theattributesareprojected,DSMis43%slowerthanFSMonthe thattheDBMSexecutesthesametypeofquery(withdifferentinput read-onlyworkloadinFigure7c, butitisonly19%slowerthan parameters)inonesegment,andthenswitchestoanotherquerytype FSMforthenarrowtableonthesameworkloadandprojectivity inthenextsegment.Wemeasuretheexecutiontimeofeachqueryin settinginFigure7a. Thisisbecausethetuplereconstructioncost thesequenceonthewidetablefortheNSM,DSM,andFSMstorage increaseswithwidertuples.Thisexperimentshowsthatthebenefits managers. WeconfiguredthePeloton’sreorganizationprocessto ofFSMaremorepronouncedforthewidetable. speeduptheadaptationprocessforthesakeofdemonstration. 591 Storage Models : NSM DSM FSM Execution time (s)123025050 0.2 0.4 0.6 0.8 1.0 Execution time (s)333048000000 0.2 0.4 0.6 0.8 1.0 Execution time (s)12420000000 0.2 0.4 0.6 0.8 1.0 Execution time (s)233628000000000 0.2 0.4 0.6 0.8 1.0 Fraction of Tuples Selected Fraction of Tuples Selected Fraction of Tuples Selected Fraction of Tuples Selected (a)Scan,Narrow,ReadOnly (b)Scan,Narrow,Hybrid (c)Scan,Wide,ReadOnly (d)Scan,Wide,Hybrid me (s)1000 me (s)4800 me (s)9000 me (s)44000 Execution ti 260000 0.2 0.4 0.6 0.8 1.0 Execution ti34410000 0.2 0.4 0.6 0.8 1.0 Execution ti15000000 0.2 0.4 0.6 0.8 1.0 Execution ti2386000000 0.2 0.4 0.6 0.8 1.0 Fraction of Tuples Selected Fraction of Tuples Selected Fraction of Tuples Selected Fraction of Tuples Selected (e)Aggregate,Narrow,ReadOnly (f)Aggregate,Narrow,Hybrid (g)Aggregate,Wide,ReadOnly (h)Aggregate,Wide,Hybrid Figure8:SelectivityMeasurements–Theimpactofthestoragelayoutonthequeryprocessingtimeunderdifferentselectivitysettings.Theexecutionengine runstheworkloadwithdifferentunderlyingstoragemanagersonboththenarrowandthewidetable. 1800 s) Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert Scan Insert e ( m n ti1000 o uti c e x E 200 0 25 50 75 100 125 150 175 200 225 250 275 300 Query Sequence Figure9:Workload-AwareAdaptation–Theimpactoftilegrouplayoutadaptiononthequeryprocessingperformanceinanevolvingworkloadmixture fromtheADAPTbenchmark.Thisexperimentalsoexaminesthebehaviorofdifferentstoragemanagerswhileservingdifferentquerytypesintheworkload. 7000 TobetterunderstandhowtheFSMisperformingdatareorganiza- nt Tile Group Layouts tion,weanalyzethedistributionofthedifferenttilegroupstorage u Co FSM NSM layoutsovertime.Attheendofeveryquerysegment,wecompute up 3500 thenumberoftilegroupsperlayout.TheresultsshowninFigure10 o Gr indicatethatthereareonlytwodifferentlayouts–theFSMlayout Tile 0 mentioned above and the NSM layout. After every insert query 0 1 2 3 4 5 6 7 8 9 10 11 12 segment,thereisanincreaseinthenumberoftilegroupswithNSM Query Segment layout. Thisisbecausethenewlyaddedtuplesarestoredinthat layout.Overtime,thenumberoftilegroupswiththeFSMlayout Figure10:LayoutDistribution–Thevariationinthedistributionoftile increases,asitworkswellforthescanquerysegment.Thisexplains grouplayoutsofthetableovertimeduetothedatareorganizationprocess. thebetterexecutiontimesonthosesegmentsinFigure9. ThekeyobservationfromthetimeseriesgraphinFigure9is 6.4 HorizontalFragmentation thattheFSMstoragemanagerconvergesovertimetoalayoutthat Wenextmeasuretheimpactofhorizontalfragmentationonthe workswellfortheparticularsegment. Forinstance,considerits DBMS’sperformance. Thegoalistocomparequeryprocessing behavioronthefirstquerysegment–alow-projectivityscanquery throughlogicaltilesversusthecanonicaltuple-at-a-timeiterator (Q ).Whenthetablesareloadedintothedatabase,FSMstoresall 2 model[21]. WevarythenumberoftuplesthattheFSMstorage thetuplesinthedefaultNSMlayoutbecauseithasnotyetseenany manager storesin everytile groupbetween 10–10000. We then queries.Thequeryexecutiontimeis,therefore,comparabletoNSM. measurethetimeittakestoexecutetheworkloadscomprisingof Overthenextfewqueries,however,itstartsreorganizingthedatato scanqueries(Q )fromSection6.2. thehybridlayout{{a },{a ,...,a },{a ,...,a }},which 2 0 1 k k+1 500 Figure11apresentstheresultsfortheread-onlyworkloadonthe isidealforQ .Afterthishappens,thequeryexecutiontimedrops 2 narrowtable.Weobservethattheexecutiontimedropsby24%as andmatchesthatoftheDSM-basedstoragemanager. weincreasethenumberoftuplespertilegroupfrom10to1000. Next,whentheworkloadshiftstotheinsertquerysegment(Q ), 1 Weattributethistothereductionininterpretationoverhead.Onthe weobservethatbothNSMandFSMfarebetterthanDSM.Thisis hybridworkloadshowninFigure11b,theperformancegapreduces becausetheyperformfewerwritestodifferentmemorylocations. to 17%. We attribute this to the writes performed by the insert Aftertheinsertqueries,thereisanothershiftintheworkloadback queries(Q )thatarenotaffectedbythehorizontalfragmentation. tothescanqueries.FSMimmediatelyoutperformsNSMbecause 1 TheresultsforthewidetableareshowninFigures11cand11d. italreadyreorganizedmostofthetilegroupsinitiallyloadedinto On the read-only workload, we see that the coarse-grained frag- thetabletoanOLAP-optimizedlayoutduringthefirstscanquery mentationsettingoutperformsthefine-grainedoneby38%. This segment. TheDBMStakesmoretimetoprocessthescanqueries isbecausethelogicaltilessuccinctlyrepresentthedatastoredin inthethirdsegmentcomparedtothoseinthefirstsegmentbecause thesewidetables,andthereforethebenefitsofvectorizedprocessing thetablegrowsinsizeduetotheinterleavinginsertquerysegment. aremorepronounced.Weobservethattheperformancedifference 592

Description:
Bridging the Archipelago between. Row-Stores and Column-Stores for Hybrid Workloads. Joy Arulraj. Andrew Pavlo. Prashanth Menon. Carnegie
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.