ebook img

INGESTBASE: A Declarative Data Ingestion System PDF

1.1 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview INGESTBASE: A Declarative Data Ingestion System

: A Declarative Data Ingestion System I B NGEST ASE Alekh Jindal Jorge-Arnulfo Quiane´-Ruiz Samuel Madden Microsoft QCRI MIT [email protected] [email protected] [email protected] Abstract—Big data applications have fast arriving data that to static and hard-coded data ingestion pipelines in traditional must be quickly ingested. At the same time, they have specific databases as well as in big data systems like Hadoop. Hadoop needs to preprocess and transform the data before it could be Distributed File System (HDFS), for instance, chunks input put to use. The current practice is to do these preparatory data into fixed sized blocks, replicates each block (three times 7 transformationsoncethedataisalreadyingested,howeverthisis by default), and stores them on different machines for fault 1 expensive to run and cumbersome to manage. As a result, there tolerance. Some efforts have tried to add additional steps to 0 isaneedtopushdatapreprocessingdowntotheingestionitself. this static pipeline, such as indexing [1], co-partitioning [2], 2 In this paper, we present a declarative data ingestion system, anderasurecoding[3].However,eachoftheseforksoutanew n scpalelceidfyINthGeEirSTdBatAaSEin,gteostailolnowloagpicpliincataiomnodreeveslyospteemrsattoicpmlaannannerd. storage system with one additional feature at a time and does a We introduce the notion of ingestions plans, analogous to query notofferfullflexibilitytospecifyarbitraryapplication-specific J plans, and present a declarative ingestion language to help de- ingestion logic. 1 velopers easily build sophisticated ingestion plans. INGESTBASE 2 providesanextensibleingestionoptimizertorewriteandoptimize The current practice to deal with application-specific in- ingestion plans by applying rules such as operator reordering gestion needs is to additionally deploy so called cooking ] B and pipelining. Finally, the INGESTBASE runtime engine runs jobs to prepare the data, i.e., use a query processor to run the optimized ingestion plan in a distributed and fault-tolerant preprocessingjobsoncethedataisalreadyingested.However, D manner. Later, at query processing time, INGESTBASE supports this forces the users to spend additional time and money, as . ingestion-aware data access and interfaces with upstream query well as introduces another dependency before the data could s processors, such as Hadoop MapReduce and Spark, to post- c be put to use. Cooking jobs are also hard to share because [ process the ingested data. We demonstrate through a number of they contain custom ad-hoc logic not necessarily understood experiments that INGESTBASE: (i) is flexible enough to express by others, i.e., they lack a formal data ingestion language. 1 a variety of ingestion techniques, (ii) incurs a low ingestion v overhead, (iii) provides efficient access to the ingested data, and In addition, we now have both the ingested as well as the 3 (iv) has much better performance, up to 6 times, than preparing cooked data at the same time, i.e., data duplication, and in a 9 data as an afterthought, via a query processor. fault-tolerance manner (e.g., replicated). Finally, the cooking 0 jobs end up overloading the compute clusters, even when 6 data ingestion typically runs on separate capacity which is 0 I. INTRODUCTION often underutilized. Thus, with cooking jobs, the users end up . 1 Modern big data applications witness massive amounts of creating additional data pipelines, which are tedious to build, 0 continuously and quickly arriving data. At the same time, this expensive to run, and cumbersome to manage. 7 data needs to be preprocessed and optimized in a specific 1 manner before it could be put to use. Examples of such appli- In this paper, we identify data ingestion as an explicit step : cations include: (i) data exploration over periodically arriving that needs to be specified and planned in a more systematic v i scientific datasets, e.g., astronomy; (ii) analyzing service logs, manner. We present INGESTBASE, a flexible and declarative X e.g., from cloud services, which need to be quickly ingested data ingestion system to quickly prepare the incoming data r for real time debugging; (iii) approximate query processing for application specific requirements. At the same time, IN- a over fast arriving social network data, e.g., tweets, to get the GESTBASE hides the ingestion processing complexity from the users, similar to databases hiding the query processing latest trends; (iv) quality checking and cleaning commodity data, e.g., news content, before selling it on the data market; complexity. To do this, INGESTBASE exposes a declarative language interface, the ingestion language, to easily express and(v)archivinghighvelocitytelecomdata,e.g.,phonecalls, arbitraryingestionlogic,essentiallyanoperatorDAGconnect- for security purposes. In all these scenarios, data needs to be ing raw data sources to application ready data in the storage consumed as soon as it arrives and so once it gets ingested thereislittleroomforfurtherpreprocessing,whichisanyways system.INGESTBASEusesanoptimizertorewriteandcompile declarativeingestionstatementsintoanefficientingestionplan. prohibitively expensive due to the massive data volumes. INGESTBASE has a runtime engine that runs the optimized As a result, applications developers need to carefully ingestion plan in a distributed and fault-tolerant manner over design their data ingestion pipelines and push the application a cluster of machines. Finally, INGESTBASE provides a data specific data preparation logic down to data ingestion itself. access kernel to support ingestion aware query processing via For instance, applying multi-dimensional partitioning to slice higher level substrates, such as MapReduce and Spark. and dice the data in different ways for data exploration, or detecting and fixing data quality violations for data market In summary, our key contributions are as follows: applications, or considering different erasure codes to reduce (1) We introduce the notion of ingestion plans, analogous to the storage footprint during data archiving. This is in contrast query plans in relational databases, to specify a sequence of transformations that should be applied to raw data as it is linestatus, i.e., products shipped on the same date have the ingestedintoastoragesystem.Toeasilybuildingestionplans, same linestatus. This would require to partition the data on we describe a declarative data ingestion language to express shipdate, iterate over every pair of tuples in each partition, complex ingestion logic by composing a variety of ingestion and check whether or not there is a functional dependency operators and their data flow (Section IV). violation. Subsequently, the violated records could be output (2) We present an extensible, rule-based ingestion optimizer to a violations file (for further correction) or even discarded. to rewrite and optimize ingestion plans, via techniques such Denial Constraint Checks. Consider again the TPC-H as operator reordering and pipelining. We further describe the lineitemdataset.Supposetheuserwantstocheckthefollowing INGESTBASE runtime engine to efficiently run the optimized denial constraint (DC): each item sold in quantity less than ingestion plan in a distributed and fault-tolerant manner. In 3 does not have a discount of more than 9%. This requires particular, we show how the system allows users to control to scan the lineitem table, check each tuple for this denial the fault-tolerance mechanism for their data based on their constraint, and then store both the violating tuples as well as ingestion plans (Sections V and VI). the original data. (3) Wedescribethe INGESTBASE supportforingestion-aware data access, i.e., leveraging the ingest processing for efficient Single-pass Repair. Besides detecting violations to data qual- data access via upstream query processors. Specifically, we ity rules, users may also want to perform single-pass repairs. show how our prototype implementation works with two For example, consider a tax dataset having a country code storage-compute combinations, namely HDFS-MapReduce attribute.Incasethecountrycodeisnotvalid,usersmaywant and HDFS-Spark. (Sections VII and VIII). to correct the code using a dictionary (e.g., changing a value (4) Finally, we present experimental evaluation over TPC- “mexico”toitscorrespondingcode“MX”).Thiswouldrequire H dataset to: (i) show the overhead of INGESTBASE and parsing the country code attribute in the dataset, checking if compare it with plain HDFS upload times, (ii) compare the the code is valid, and looking up in the dictionary in case the effectiveness of ingest-aware query processing compared to code is invalid. Only the corrected values are finally stored. both MapReduce and Hive, (iii) contrast INGESTBASE with ingestion via cooking jobs (using Hive), and (iv) show the B. Data Sampling fault-tolerance behavior of INGESTBASE (Section IX). Sampling is a common technique to gather quick insights Intheremainderofthepaper,wefirstdiscussfourdifferent from very large datasets [7]. Samples can be used to quickly case study scenarios to understand the data ingestion pain evaluate statistical properties of data (e.g., approximate aver- in modern applications (Section II). Then, we describe an ages or counts in certain subgroups), or to get a representative overviewofthe INGESTBASE architecture(SectionIII)before subset of the data. A key problem in using samples is the presenting our core contributions (Sections IV– IX). processofgeneratingsamplesthemselves:producingasample requires an entire pass over the data. Rather, the users want II. THEDATAINGESTIONPAIN to collect samples as the data is being ingested, with minimal overhead. We discuss a few scenarios below. Let us first see the data ingestion pain in modern appli- cations and understand the need for a declarative ingestion Random Sampling. Users may want to create Bernoulli system. Below we describe four case study scenarios, namely samples by probabilistically replicating some of the tuples in data cleaning, data sampling, data analytics, and data storage, the dataset and collecting them into a separate physical file, to highlight the preprocessing needs in modern big data i.e.,inadditiontocollectingalltuplesintoabasefileanyways. applications and motivate a more systematic approach to it. Likewise, users may also want to create reservoir samples by addingeachtupleintoareservoir,removingtuplesfromitwith A. Data Cleaning a given probability, and then finally emitting the reservoir as samples in the end. Data cleaning is traditionally done as an afterthought, Stratified Sampling. Besides pure random sampling, users i.e., the data cleaning process starts once a dataset has already may also generate stratified samples, where rare subgroups been uploaded [4], [5], [6]. This means that users have to are over-represented vs common sub-groups. For example, in applytediousandtimeconsumingdatacleaningtransformation a dataset about people by state, a larger fraction of records before the data could be put to use. In contrast, cleaning data fromNorthDakotamightbeincludedthanfromCalifornia,to whileingestingthedatasetswouldspeeduptheentirecleaning ensure that enough records about North Dakota are present to process. Users want to detect the portions of the data that achieve a target level of statistical confidence. Such samples violatetheirbusinessrules,oftenanexpensivestepinthedata are commonly used in databases to produce statistical approx- cleaning process [5], and apply simple repairs. Though data imations [8], [7]. This requires to partition the data on the repair may be an iterative process, users could apply one-pass stratification attribute and randomly pick records from each repairsontheirdatasets.Below,wediscussexamplesofafew strata (partition). The number of records picked from each data cleaning operations. partition is proportional to the partition size. Functional Dependency Checks. Consider the TPC-H lineitem table, which includes one entry per item per order C. Data Analytics in a business analytics application. This table includes a shipdate field (the date the item shipped) and a linestatus Data analytics often requires special data formats for field (whether the item has shipped or not). We may want to good performance. The typical practice is to either create enforceafunctionaldependency(FD)thatshipdatedetermines these formats once the data has been already uploaded to a Ingestion Runtime Engine Ingestion Plan Optimizer Declarative Ingestion Language storage system [9], or to modify the storage system with the Query Hive Pig rk application-specific logic [1], [2]. In contrast, the developers a Processor MapReduce p wouldwanttosimplyspecifytheirformats(declaratively)and S let the ingestion system take care of creating appropriate files in the storage system. e IDngaetas tAiocnc-easwsare PDraetdai cAactec ePsus sHhdaondwlen Fault Tolerance s a Co-partitioning. Users may want to apply custom data par- B Storage t titioning when ingesting data. For example, users can mimic es CaretotprHaibratuidttoeioonapirne[g2s.]tU,orsweedhrsetrcoeoguetwltdhoefur,dratehtnaearbsselaitnmsgpwleeiftfithhceiaetnwctoomjoreminlaostniown“istjhoaoinnud”t Ing IPnrgoecsetsioson r DeIcInnlgagereasstttiiivooenn I RnPgulaennst tiOmiopent EiLmnangizigneureage Fault Tolerance evaluate the skew in the join attribute, creating more balanced Input Data co-partitions. Fig.1. IngestBaseSystemArchitecture Layouts & Indexes. Users may want to plug-in alternate data data and replication for hot data). The recovery mechanisms, layouts and indexes, e.g., RCFile [10] (the default Hive [11] however,shouldworkwithbotherasurecodesandreplication. layout in HDFS), Trojan Layouts [12] (i.e., a different data layoutforeachdatareplica)andHail[1](i.e.,adifferentindex E. Remarks for each data replica). Creating Trojan Layouts, for instance, would require to create data blocks, replicate each block three We see that the above applications require custom data times, and serialize each block replica differently (e.g., row, ingestionlogic,i.e.,userswouldliketoprovidedifferentclean- column, and RCFile). Users may also want to create different ingrules,samplingtechniques,datalayouts,anderasurecodes. layouts for different parts of data replicas, i.e., sub-divide Employing cooking jobs for such applications is tedious, time the data blocks within a data replica and create a different taking, and inefficient. Rather the users would want to specify layout for each of them. Such hybrid replicas improve query howthedatashouldbetransformedasitgetsingested,without robustness as more queries are likely to see at least some of incurring additional cooking jobs or worrying about the low- the data blocks in favorable layouts. level details. Thus, we need a systematic and declarative data ingestion system. Data Placement. With large data centers, users may want to control how the data is placed in them, e.g., placing hot III. INGESTBASEOVERVIEW and cold data blocks differently. This requires looking at the contents of each data block when placing it into a cluster. The goal of INGESTBASE is to allow developers to easily Suchcontentbaseddataplacementcouldbefurtherusefulfor: express and efficiently run arbitrary data ingestion logic. In (ii) improving data locality, (iii) isolating concurrent queries our earlier works, we demonstrated flexible data upload to to different nodes, and (i) utilizing a portion of the cluster to HDFS [13], as well as we showed how it could be used save energy or to multiplex resources. for scalable violation detection [6] and robust data partition- ing [14], [15]. This paper describes a full-fledged, declarative data ingestion system that could work with arbitrary storage D. Data Storage and query processor substrates. Figure 1 illustrates the archi- Despite the plummeting price of disks, storage space still tecture of our system. remains a concern in replicated storage systems with large TheinputtobeingestedusingINGESTBASEisrepresented datasets. Below we describe two scenarios on how users may as data items, referred to as ingest data items. At the very want to optimize the storage space. beginning,theingestdataitemsaresimplytherawinputfiles. Replicated Storage. Users may want to control both what However, these could be later broken into smaller ingest data parts of the data are replicated and how many times. This items,suchasfilechunksorrecordsforfine-grainedingestion control becomes crucial when different parts of the data have logic, e.g., applying chunk level replication or detecting null different relative importance. For example, a user storing values in each record. Each ingest data item is further associ- weblogs might replicate the most recent logs (hot data) more ated with a list of labels denoting its lineage during ingestion. frequently for higher availability, compared to the massive Finally, ingestion operators specify the logic to transform the olderlogs(colddata).Thiswouldrequirepartitioningthedata ingestdataitems,i.e.,IngestOp:LID→LID’,whereLID on date (could be trivial in case of time series), and then andLID’areinputandoutputsetoflabelledingestdataitems. applying the replication selectively. For example, the single-pass repair from Section II-A would output only a repaired tuple, i.e., SinglePassRepair : Erasure-coded Storage. Erasure coding is an alternative to t → t |φ. Ingestion operator follows the iterator model repaired replication for handling failures. The advantage of erasure with the following API: coding over replication is that it provides the same degree • initialize: initialize an operator for the first time. of redundancy as replication at lower storage overhead (but • setInput: assign the set of input ingest data items. with a higher access cost in the event of failure). Creating an • hasNext: check whether next output is available. arbitrary erasure code would require dividing the input blocks • next: get the next output labelled ingest data item. into stripes and applying erasure coding for each stripe. As • finalize: cleanup the ingestion operator in the end. with data layouts and indexes, user may want to use different erasure codes, or a mix of replication and erasure codes, for Withingestdataitemsandtheingestoperatorsasthebuild- different portions of the data (e.g., erasure codes for cold ing blocks, INGESTBASE allows to create arbitrary operator DAGs, called ingestion plans. An ingestion plan can further couldapplymulti-levelpartitioning(acrossandwithinchunks) controlthedataflowbyselectivelychoosingwhichingestdata as follows: items go to which portions of the DAG. INGESTBASE makes s3 = FORMAT s1 it easier for the users to build sophisticated ingestion plans by PARTITION BY top-level-partition providing a declarative ingestion language. The ingestion plan CHUNK BY chunk PARTITION BY intra-chunk-partition is then optimized via the ingestion optimizer, which choses to ORDER BY order push-downorpush-uptheingestionoperators,pipelinethedata SERIALIZE AS serializer; flow across several ingest operators, and block the data flow Finally, where to ingest is specified using the following wherever needed. Finally, the INGESTBASE runtime engine STORE statement: runstheresultingoptimizedingestionplaninadistributedand fault-tolerant manner. s4 = STORE s3 LOCATE USING locator UPLOAD TO target; IV. DATAINGESTIONLANGUAGE Thelocator operatorspecifieswhichingestdataitemsmustbe co-located(oranti-located),whilethetarget operatorspecifies In this section, we describe the declarative ingestion lan- the final storage substrate. Note that target only points to the guage in INGESTBASE. In contrast to the current practice of using query processing language to cook the data, to the best registeredstoragelocation;theactualbindingof INGESTBASE with the storage system is a bit more involved, as described of our knowledge, this is the first work to propose primitives in Section VIII. foraningestionlanguage.Therearetwopartstoouringestion language: (i) the declarative ingestion operators to specify the application-specific data transformation during ingestion, and B. Ingestion Dataflow (ii) the declarative ingestion data flow to control (via the use In the previous section, we saw the declarative statements of labels) which data items flow through different parts of the for specifying and chaining ingestion operators. Our ingestion ingestion plan. We describe these two below. language further allows the users to control the ingestion data flow, i.e., selectivity feed different portions of the ingest data A. Ingestion Operators items to different ingest operators. To do so, we define a data flow stage as a set of ingest operators operating on a set of Theingestionoperatorshelpaddressthreeingestionneeds: ingest data items. Recall, that the ingest data items have an what to ingest, where to ingest, and how to ingest, similar to associated set of labels denoting the transformations applied the what, where, and how of data storage proposed in [16]. to them so far. We use these labels to filter the relevant data Foragivenapplication,theseingestionneedscouldbederived items for each stage. using storage optimizer tools [16], [17]. Users can then define what to ingest using a SELECT statement as follows: CREATE STAGE a USING s1,s2,..,sm s1 = SELECT projection WHERE l_op1=v1,l_op2=v2,..,l_opn=vn FROM LID USING parser WHERE filter In the above, we define a stage a with the ingest operators in REPLICATE BY replicator; s −s (whichcouldbeeitheroftheSELECT,FORMAT,and 1 m STORE) and operating on ingest data items that have labels While the above syntax is very similar to standard SQL l = v . For example, consider ingesting hourly data and selectstatement(exceptreplicationandresultassignment),the op i assiume that the parser operator assigns the file creation projection,parser,filter,andreplicatorcanalsobeprovidedas timestampasthelabelforeachingestdataitem,thefollowing customingestoperators.Forinstance,wemayprojectmachine stage ingests only the last hour of data each time: learning features for each tuple or replicate the ingest data itemsprobabilistically.Oncompilation,theingestionoperators s1 = SELECT * FROM input; in the SELECT statement are chained as follows: CREATE STAGE a USING s1 LID→parser→filter→projection→replicator WHERE l_parser > now-1; For a SELECT statement to be valid, the output and input Multiple stages could be chained to each other using the ingest data items of consecutive ingest operators must match, CHAIN STAGE statement as shown below: i.e., they should have the same granularity and the same schema.NextweshowtheFORMATstatementtodescribehow CHAIN STAGE b TO a1,a2,..,ak to ingest the data. USING s1,s2,..,sm WHERE l_op1=v1,l_op2=v2,..,l_opn=vn s2 = FORMAT s1 PARTITION BY partition Note that the above statement performs a union all on the CHUNK BY chunk outputs from stages a ,a ,..,a before feeding it to stage b. 1 2 k ORDER BY order SERIALIZE AS serializer; By defining stages on top of the ingestion operators, users Operators in the above FORMAT statement are chained in can selectively process different ingest data items in different the order in which they appear, i.e., the chaining in s2 is parts of the ingestion plan. Such selective ingestion capability partition → chunk → order → serialize. Users can create is useful for: (i) handling heterogeneous data where different alternate chains by changing the order of these operators, portions of the data have different characteristics and hence e.g., ordering before chunking will create a global sort order theyneeddifferentdataingestionlogic,(ii)supportingmultiple as opposed to per-chunk sort order in s2. The operators in workload types, e.g., graph and relational analytics, each re- FORMATstatementcouldalsoappearmultipletimes,e.g.,users quiringthedatatobeshoehorneddifferently,and(iii)reducing Output Data Output Data Output Data Stage g Uploader Stage g Uploader Block 5 Uploader Disjoint Random Disjoint Random Disjoint Random Stage f Locator Locator Stage f Locator Locator Block 3 Locator Block 4 Locator Stage e Stage e Stage d Sorted Row RCFile PAX Stage d Sorted Row RCFile PAX Sorted Row RCFile PAX replica 1a replica 1b replica 1a replica 1b replica 1a replica 1b Pa1r0ti0timonbe r Pa1r0ti0timonbe r 2x Replicator Pa1r0ti0timonbe r Block 2 2x Replicator Pa1r0ti0timonbe r 100mb 100mb Stage b 2x Replicator Hash Partitioner Stage c Stage b Partitioner Hash Partitioner Stage c Partitioner Hash Partitioner replica 1 replica 2 replica 1 replica 2 replica 1 replica 2 2x Replicator 2x Replicator 2x Replicator Parser Parser Parser Stage a Stage a Block 1 Input Dataset Input Dataset Input Dataset (a) ExampleLogIngestionPlan. (b) Operatorreordering. (c) Operatorpipelining. Fig.2. Illustratingingestionplananditsoptimization. the risk of picking the wrong ingestion logic, e.g., due to allow users to easily control and selectively process the ingest changes in the workload, by applying multiple logic in the dataitems.Together,thedeclarativeingestionoperatorsandthe first place. data flow primitives allow users to quickly stitch sophisticated ingestion plans for their applications. C. Example: Log Analytics Letusnowillustrateourlanguageviaanexample.Consider V. INGESTIONOPTIMIZER a log analytics scenario where large volumes of logs are collected from a cloud service. These logs need to ingested INGESTBASE takes the declarative ingestion statements quickly with a low overhead. Later, in case of any problems andcompilesthemintoaDAG,theingestionplan,asshownin with the cloud service, e.g., disruption or slow performance, Figure2(a).TheingestionoptimizertakesthisDAGandemits the service administrators need to quickly search the relevant an optimized ingestion plan. To do so, the optimizer supports log lines. Each log line contains a combination of structured rule-basedtreetransformationstoidentifysubtreepatternsand (e.g., timestamp, machine name) and unstructured (e.g., the transform them into alternate subtrees. An ingestion plan sub- error stack, manual user commands) data items. tree is represented as an ingestion operator expression, which consists of the root operator and its descendants (recursively). For such an application, developers may create the fol- An optimizer rule (for tree transformations) operates on the lowing ingestion logic: create three data replicas and apply a ingestion operator expression via the following two methods: differentsetofoperatorstoeachofthem;firsttwoofthethree check:IngestOpExpr→true/false replicas differ only in their layout (sorted row and RCFile), apply:IngestOpExpr→IngestOpExpr(cid:48) while the third replica uses logical partitioning in addition to the physical partitioning. As a result, the first two replicas The check method verifies whether a rule is applicable to an are suitable for selection and projection queries, while the ingestionoperatorexpressionandtheapplymethodproduces third replica is suitable for join and aggregation queries. The the modified ingestion operator expression. The optimizer ingestion statements for these are as follows: performsapreordertraversalovertheingestionDAGandfires matching rules wherever applicable, i.e., larger subtrees are s1 = SELECT * FROM input USING parser REPLICATE BY 2; matched for relevant rules first. The rules are matched in the s2 = SELECT * FROM s1 REPLICATE BY 2; s3 = FORMAT s2 CHUNK BY 100mbBlocks; same sequence as provided in the ordered rule set, and they s4 = FORMAT s3 SERIALIZE AS sortedRow; are applied iteratively until none of the rules match any of the s5 = FORMAT s3 SERIALIZE AS rcFile; s6 = FORMAT s1 PARTITION BY hash CHUNK BY 100mbBlocks ingestionoperatorexpressionsinthetree.Wenowdescribetwo SERIALIZE AS pax; rules,namelyoperatorreorderingandpipelining,toreducethe s7 = STORE s4,s5 LOCATE USING disjointLocator; data volume and the materialization cost respectively. s8 = STORE s6 LOCATE USING randomLocator; s9 = STORE s7,s8 UPLOAD TO hdfsStorage; Operator Reordering. This rule rearranges ingestion opera- Thecorrespondingingestiondataflowisdescribedasfollows: tors in order to reduce the data volume in flight, i.e., push- CREATE STAGE a USING s1; down data reducing operators, e.g., filter, while push-up the CHAIN STAGE b TO a USING s2,s3 WHERE l_replicate1=1; data expanding operators, e.g., replicate. In order to preserve CHAIN STAGE c TO a USING s6,s8 WHERE l_replicate1=2; the semantics of the ingestion plan, we only rearrange the CHAIN STAGE d TO b USING s4 WHERE l_replicate2=1; CHAIN STAGE e TO b USING s5 WHERE l_replicate2=2; ingestionoperatorsinthesamedataflowstage,i.e.,thereisno CHAIN STAGE f TO d,e USING s7; conditionalprocessingofthedataitemsinvolved.Oneinstance CHAIN STAGE g TO c,f USING s9; of this rule could push replicate operator at the very end of Finally, Figure 2(a) depicts the resulting log ingestion the stage, i.e., replicate data as late as possible, as shown in logic, as described above. As also noted in [16], we see that stage b of Figure 2(b). Another instance could swap the filter thinking in terms of what, where, and how in INGESTBASE and projection operator depending on which provides more makesitmoreintuitivetoreasonaboutarbitrarydataingestion data reduction, i.e., whether we reduce data volume more by operations. Also, the data flow primitives in INGESTBASE filtering the rows or by filtering the columns. Thus, operator reordering rules could be useful in reducing the data traffic dataitemsindifferentthreads.Tosupportsuchmulti-threaded while executing the ingestion plans. parallelism, the ingest operator implementation has a parallel mode, in addition to the default serial mode, to process input Operator Pipelining. By default, all output ingest data items ingest data items using a thread pool. These threads are later arecollected(i.e.,materialized)fromaningestoperatorbefore synchronized in the finalize method of the ingestion operator. being fed to the next one. Internally, this is done by adding The parallel mode is turned on by default for CPU heavy a materialize operator after each ingest operator. An obvious operators such as serialize. However, users could provide optimization is to pipeline the data items between operators additional optimizer rules to control the serial/parallel modes. as much as possible and materialize only when really needed. Operator pipelining rule removes materialization between op- Parallel ingestion allows INGESTBASE to significantly re- erators that process ingest data items of the same granularity duce the overhead of transforming the data. We demonstrate (detected by looking at the data types). We materialize only this experimentally in Section IX. whenthegranularityoftheingestdataitemchanges,e.g.,from tuples to blocks. To illustrate, Figure 2(c) shows the log B. Efficient Distributed I/O ingestion plan from Section IV-C with operator pipelining. We can see that stages a − g of the plan, as shown in In the previous section, we described how we can paral- Figure 2(a), have been transformed to five pipelined blocks lelizetheingestionplanandprocessdatalocallyoneachnode. 1−5inFigure2(c).Otherinstancesoftheoperatorpipelining However, several ingestion plans require to move data around. rules could consider materializing long pipelines in between In this section, we describe how the INGESTBASE runtimes for fault-tolerance, or for early access to the incoming data. engine handles distributed I/O efficiently. Below we describe the three major data movement scenarios, namely shuffling, Thus, we see that the ingestion optimizer provides an placement, and replication. extensiblewaytotransformandoptimizetheingestionDAGs. Shuffling. An ingestion plan may require to shuffle inter- VI. INGESTBASERUNTIMEENGINE mediate ingest data items in order to produce the final data items. For example, to gather stratified samples, we need to Recallthatthemodernbigdataapplicationsneedtoingest grouptheentiredatasetacrossallnodesandthenpicksamples the incoming data quickly and with low overhead. As a result, it is critical to have an efficient runtime engine for these from each group1. INGESTBASE runtime engine handles this using a distributed file system by first creating local groups applications. In this section, we describe the INGESTBASE on each node and then copying them to the distributed file runtimeenginewhich(i)runsaningestionplaninparallelona system in parallel. While copying, the data is organized into clusterofmachines,(ii)efficientlyhandlesdistributeddataI/O directories, one for each group, such that data belonging to during ingestion, and (iii) handles fault-tolerance both during the same group is in the same directory. Finally, each node and after ingestion. We describe each of these below. reads back and processes the group-directories, one at a time, A. Parallel Ingestion from the distributed file system. Essentially, the INGESTBASE runtime engine leverages the remote data access mechanism Givenaningestionplanandaclusterofmachines,INGEST- of the distributed file system to shuffle data across nodes. BASE runtime engine exploits two kinds of parallelism: inter- node and intra-node parallelism. We describe these below. Placement. INGESTBASE allows users to reason data place- ment at a logical level, i.e., using the locator operator to map Inter-node Parallelism. When a user submits an ingestion eachingestdataitemtoalocationID,withoutgettingintothe plan on one of the nodes (the client) for execution, the lowleveldataplacementpolicies.Asaresult,userscaneasily INGESTBASE runtime engine copies the resulting optimized make data placement decisions, such as which portions of the plan to all nodes (specified via a slaves configuration file) in datashouldresideonwhichnodes;or,whichdataitemsshould the cluster and executes it over the local data on each node. be co-located and which data items should not be co-located. This makes sense because the raw data is typically generated To enforce these decisions, the INGESTBASE runtime engine on multiple nodes in the first place, e.g., log data, and it simply looks at the location IDs of each ingest data item, e.g., is cumbersome to bring all of this data to a single node. a data block, and copies items with the same location ID to Therefore, instead of bringing data to the ingestion plan, we the same node in the cluster. The mapping from location IDs shiptheplantothedataitself.Thisissimilartoshippingquery to nodes can either be provided by the user, or the runtime plans in distributed query processing. INGESTBASE runtime self-assigns the location IDs to nodes in the same order (in a engine launches remote shell to start the ingestion plans on round robin manner) as they appear in the slaves file. all nodes in parallel and waits for them to finish before it terminates. Replication.Replicationisusuallydoneforfault-toleranceand ittypicallyinvolvesmovingeachreplicatoadifferentnode.In Intra-node Parallelism. Besides parallelizing the ingestion contrast, INGESTBASE completely decouples data replication process across different nodes, the INGESTBASE runtime en- andplacement,andallowsuserstotakeindependentdecisions ginealsoparallelizespartoftheingestionplanacrossdifferent for the two. As a result, users can choose to replicate data threads on the same node. For example, the serialize at different granularities and/or may not place the replicas on operator is CPU bound and so the INGESTBASE runtime different nodes. For example, users may choose to replicate engineforksseveraloperatorinstances(asmanyasthenumber some rows (could be seen as samples) in each data block and of cores by default) at the same time, each serializing a different subset of ingest data items. Likewise, the INGEST- 1Userscouldalsodoper-nodestratifiedsamplingtocomputethesamples BASE runtime engine transforms different replicas of ingest fromthelocalstratumoneachnode. storethemalongwiththedatablockonthesamenode,i.e.,no Detector Recovery Implementation Recovery additional data movement is needed. Impl. Catalog Failed Recovered Detect c/morisruspintged? Failed Blocks Block UDF Recovery BlocksUDF Block Paths C. Fault Tolerance Inthissection,wedescribethefault-tolerancemechanisms Fig.3. Post-ingestionfault-toleranceinINGESTBASE. in INGESTBASE runtimeenginetohandlefailuresbothduring maintains a catalog of detect and recover UDFs for and after the data ingestion. each ingestion plan. Figure 3 depicts this post-ingestion fault- 1)Handling In-flight Failures: The INGESTBASE runtime tolerance mechanism in INGESTBASE. engine can handle two types of failures while running the We built three implementations of the detect and recover ingestion plan. UDFs using the above architecture: Ingestion Operator Failure. In case an ingestion operator Replication based. This fault-tolerance mechanism looks for fails, we need to re-run all pipelined operators that appear be- a replica of the failed data block and increases the replication forethefailedone.However,insteadofrestartingtheingestion factor of the replica by 1. The block placement policy takes plan from scratch, we can resume ingestion from the previous care of storing the new replica on a different node. block of pipelined operators. This is because ingest data items Transformation based. This recovery mechanism is for data arefullymaterializedaftereverypipelinedoperatorblock,and block replicas that are not bitwise identical, i.e., they are therefore each such block serves as a checkpoint. In case of repeated failures (3 times by default) of the same operator2, serialized differently. This mechanism copies and transforms a data block replica so that it has the same serialization as the the INGESTBASE runtime engine replaces the failing operator failed data block. with a dummy pass-through operator. The dummy operator simplyreturnstheinputingestdataitemsandassigneachitem Erasure coding based. This recovery mechanism is for a label of “−1” to denote the failure. Application developers erasure-coded,insteadofreplicated,datablocks.Itfirstfetches can further control the recovery time (e.g., in case they expect all data blocks in the same stripe and then reconstructs the more failures) by adding custom operator pipelining rules to missingdatablock.Thereconstructeddatablockisstoredback forcemorefrequentmaterialization,asdiscussedinSectionV. to the HDFS. Node Failure.Incaseoneofthenodesintheclusterfails,we Thus, we see that INGESTBASE users can: (i) inject cus- simply reschedule the ingestion plan from the failing nodes to tom fault-tolerance logic for their application specific needs, other nodes. However, this still requires the data on the failed e.g., heterogeneous replication [1], [12], (ii) change the fault- node to be available remotely (in case that node is used as the tolerance over time as the application needs evolve, e.g., mi- data node as well). For node failures during data shuffling, we grating from replication to erasure coding [3], and (iii) have check which of the groups directories in the distributed file different fault-tolerance mechanisms for different ingestion systemarecorruptandwecopythemagain,assumingthatthe plans, i.e., fault-tolerance mechanism is not tied to the storage distributedfilesystemstillworkswithonelessnode.Tohandle system anymore. data placement, we reassign the location ID of the failed node to the next node (in slaves file) in the round robin sequence. VII. INGESTIONAWAREDATAACCESS 2)Handling Post-ingestion Failures: Given that data is INGESTBASE allows users to apply ad-hoc data transfor- ingested with custom application-specific logic, it may also mations while ingesting their datasets. However, the system need custom fault-tolerance logic. INGESTBASE allows users also needs to keep track of these transformations in order to control the fault-tolerance mechanism for their data based to leverage them later for query processing. Essentially, we on their ingestion plans. To do so, INGESTBASE provides need to track three pieces of information: (i) which ingestion two fault-tolerance UDFs to define how to detect and recover operators were used to preprocess the dataset; (ii) how were failing data items (typically data blocks). theingestionoperatorscomposed;and(iii)theoperatorlineage detect:f →{r1,r2,..,rn} andthetransformationappliedtoeachoutputdataitem.For(i) recover:{Br1,Br2,..,Brn}→Bf and (ii), we simply serialize the ingestion plan in the storage system. Note that we do not serialize the operator instances, Here f is the failing block id while {r ,r ,..,r } are the 1 2 n rather we store the instance parameters and re-instantiate the recovery blocks; B is the corresponding block. The above i operators whenever needed. For (iii), we make use of labels UDFs essentially address two key questions: (1) Which data assigned to the ingest data items, as described below. blocks are needed to recover a failed data block? (2) How to reconstruct a failed data block from the recovery data blocks? Recall that each ingestion operator assigns a label to every data item that is processed. One could imagine storing all As soon as it finishes executing the ingestion plan, the such labels for every ingest data item. However, this would INGESTBASE runtime engine launches a fault-tolerance dae- resultinahugeamountofmetadata.Instead,theINGESTBASE mon that polls the storage system for failing data blocks (the runtime engine collects the labels common to all data items detect) and invokes the user-supplied recovery UDFs for that are materialized and preserves them as the name of every failed block detected (the recover). INGESTBASE the physical file. Thus, each physical file in INGESTBASE is named as follows: label1 label2 label3 label4 .....labeln. 2The INGESTBASE runtime engine detects recurring failures by keeping The labels in the above filename have the same relative track of the execution status of each ingestion operator, i.e., whether or not theypassedthefinalizemethodsuccessfully. sequence as the corresponding operators in the ingestion plan. Thus, the filename of a physical file in INGESTBASE acts Output Data as a signature, or the lineage of the preprocessing applied Standard Reduce Aggregation Reducer to it. For example, the name of a physical file produced by Map/Reduce Map Hash Join Mapper Hash Join Mapper the ingestion plan of log analytics (Figure 2(a)) might be: parseID replicaID hashID fileID paxID locationID uploadID. As Record Readershipdate>1997-12-01 reploicradIeDr=da1te>r1e9p9l7ic-a1I2D-0=12 mktsegment=automobile a result of these label encoded filenames, INGESTBASE does Index Access Index Access Post Filtering not need to maintain any additional metadata files. IngCeasrttBilaagsee 
 lineitem orders orders customer DDataat aA Acccceessss Co-group Input Format Co-group Input Format Once the data is ingested using INGESTBASE, users want Input Splitter replicaID=1 replicaID=2 to access it from their applications. INGESTBASE provides Replica ID Filter Path Filter ingestion-aware access methods to query data from arbitrary Lineitem Orders Customer query processors. Again, the INGESTBASE access methods address three key questions: (i) what to access, (ii) where to Fig.4. Ingestion-awaredataaccessforTPC-HQ3usingHadoopMapReduce. access,and(iii)howtoaccess.Wedescribethesethreebelow. What to access? INGESTBASE allows developers to retrieve could use these access methods with two popular query pro- a subset of a dataset, based on the labels applied to the cessing engines, namely Hadoop MapReduce and Spark. ingesteddataitems.Todoso,INGESTBASEprovidestwofilter operators:onethatfiltersdatareplicasandonethatfiltersdata VIII. INTEGRATINGINGESTBASE blocks in a particular replica: We now describe how INGESTBASE works with two dif- filterReplica (IngestOp filterOperator, Label operatorLabel) ferent combination of storage and compute substrates. filterBlock (IngestOp filterOperator, Label operatorLabel) Asanexample,considerasamplingingestoperatorthatlabels A. HDFS & MapReduce every data item as either 1 (denoting a sample) or 0 (denoting LetusfirstlookathowINGESTBASEinteractswithHDFS. the original data item). Also assume that the ingestion plan First of all, INGESTBASE needs to map the ingest data items physicallypartitionsthesampledandoriginalingestdataitems to physical HDFS files, i.e., collect the output ingest data into different physical files. To access only the samples, we items from an ingestion plan and store them in HDFS. To can use filterReplica to filter the files that have label 1. This do so, the last transformation in an ingestion plan must be narrowsdowndataaccessonlytotherelevantportionsofdata. upload. If the ingestion plan contains a physical partitioner, Wheretoaccess?Inadditiontofiltering,INGESTBASEallows the upload operator maps each physical partition to a HDFS developers to define: (i) the data access parallelism by setting file. Otherwise, it collects all data items into a single HDFS the number of tasks to run in parallel; and (ii) the amount file. INGESTBASE furthercontrolsseveralstoragedecisionsof of data each computation task has to read. This is done by the HDFS files it creates. For instance, it can replicate each assigningdatablockstocomputationtasks. INGESTBASE API physical file if the ingestion plan already contains a replica- allows key-based splitting as well as co-splitting two or more tion operator, split files into subfiles and choose a different datasets on their respective keys. replication for each subfile, or let HDFS do the standard 3x replication. Likewise, INGESTBASE controls data placement splitByKey (Key key [, Int maxSplitSize]) by assigning location IDs to physical partitions and mapping coSplitByKey (Key key1, Dataset d2, Key key2, ..) each location ID to a particular data node, via a custom data placementpolicy.Similarly,theupload couldpipelinethedata For example, if the ingestion plan partitioned the data on items produced by a ingestion plan directly to HDFS files, an attribute into ranges, developers can distribute different without first collecting them on local disk. It can also bulk rangepartitionstodifferentmachinesandincreasedataaccess parallelism; or to the same machines to improve data locality. load the data items to HDFS files. Finally, INGESTBASE can manipulatethefault-tolerancemechanism,e.g.,transformdata How to access? Finally, INGESTBASE allows developers to layouts when recovering failed blocks (Section VI-C). Thus, deserializetheretrievedblocksandapplyfurtherselection/pro- even though INGESTBASE sits on top of HDFS, it could be jection predicates while reading them. tightly integrated with the storage decisions in HDFS. deserialize(Projection p, Selection s) For MapReduce query processing, we bake INGESTBASE accessmethodsusingtheHadoopInputFormats.TheInputFor- Note that the actual deserialization depends on the se- matallowsuserstospecifyapathfilter tofilterinputbasedon rialization operator in the ingestion plan. The built in IN- the HDFS file path, a splitter to split the data logically, and GESTBASE library provides deserialize operators for all of a record reader to actually read the data. We implemented the serialize operators it provides (PAX, RCFile, SortedFile, custom functionality for these three methods in order to use ColumnGroup, etc.) These implementations take into account theselection/projectionpredicateswhiledeserializingthedata. the INGESTBASE access methods in Hadoop. For example, to implement the filterReplica, we created a path filter which Forexample,theymaydeserializeonlytheprojectedattributes retrieves all physical files having a particular label in their (in case of column layout), or perform index access (in case filename3. We also implemented additional helper methods, the data is sorted). e.g. filterReplicaById, filterReplicaByPartitioning, and filter- Ingestion-aware data access pushes down one or more ReplicaByLayout, for the ease of programming. query predicates before producing the input for the upstream query processor. The following section describes how one 3Recallthatwepersistthelabelsinthefilenames. publicJavaRDDLike<?,?>groupwiseAnalytics(StringingestFilepath){ A. Data Ingestion Scenarios JavaSparkContextctx=newJavaSparkContext(SPARK MASTER,‘‘myJob”, SPARK HOME,SPARK JAR); In this section, we describe the performance of INGEST- // IngestBasedataaccess BASE on the four ingestion scenarios described in Section II. IngestBaseDatasetd=newIngestBaseDataset(ctx,ingestFilepath); d.filterBlock(SamplingOperator,SAMPLE ID); 1)Data Cleaning: We start by evaluating INGESTBASE d.filterReplicaByPartitioning (PARTKEY); on data cleaning operations, when uploading the TPC-H d.splitByPartitionKey(PARTKEY); d.deserializeProject(PARTKEY,SUPPKEY); lineitem table into HDFS. // standardSparktransformations Setup. We consider the data quality rules described in Sec- return d.RDD() tion II-A: (i) a functional dependency (FD) stating that any .map(newGroupbyKeyMap()) .reduceByKey(newGroupbyKeyReduce()); two tuples having the same ship_date must have the same } line_status; and (ii) a denial constraint (DC) stating Listing1. Ingestion-awaredataaccessforgroup-wiseanalyticsusingSpark. that any tuple having quantity smaller than 3 must have discount smaller than 9%. We measure the runtime of three different transformations: (i) detecting FD violations, To illustrate, Figure 4 shows how the INGESTBASE access (ii) detecting DC violations, and (iii) repairing (in addition methods can be used to run TPC-H Q3 (which consists of to detecting) DC violations. two joins and a GROUP BY) in a single MapReduce job, in contrast to two jobs in standard Hadoop. This is possible Discussion. Figure 5(a) shows the results. We observe that because INGESTBASE co-groups all three TPC-H relations. detectingviolationsfortheDCrulewhileingestingdataincurs Note that the output of INGESTBASE access methods is fed to only a 25% overhead over standard HDFS. This is because standard map/reduce data flows. Thus, in addition to allowing the detection process is simply piggy-backed into the process users to easily preprocess and transform their datasets, IN- of uploading data into main-memory. This is also why DC GESTBASE access methods also allows developers to quickly repairincursalmostnoextraoverhead.However,thisoverhead build efficient query processors. increases when the data quality rules require more complex data transformations. For instance, the ingestion plan to detect B. HDFS & Spark FD violations takes double the standard HDFS upload time. This is because the FD requires grouping the entire dataset Spark runs over HDFS as well as it uses the same Hadoop on ship_date, which results in shuffling the data across all InputFormats to read data from HDFS. As a result, we could nodes.Still,detectingviolationswheningestingdatasets(using easily run Spark jobs on top of INGESTBASE ingested data. INGESTBASE) is much better than detecting violations after To illustrate, Listing 1 shows the data access for group-wise a dataset is ingested (using e.g., Hive), as we shall see in analytics over sampled data in Spark. This data access plan Section IX-B. selects the replica of the sampled data which is partitioned on PARTKEY, co-locates (splits) values of PARTKEY, and projects 2)Data Sampling: We now look at the performance of PARTKEY and SUPPKEY attributes. Finally, we get an RDD INGESTBASE for computing samples during data ingest. from the ingested dataset and can apply standard Spark trans- Setup. We consider five different sampling techniques: formation over it. We see that using the INGESTBASE data Bernoulli, simple random, systematic random, local stratified, accessplans,developerscaneasilynarrowdowntheiranalysis andglobalstratified.Herethelocalstratifiedsamplingcollects to the most relevant portions of the data, without dealing with samplesfromthelocalstrataoneachnode,whereastheglobal the actual physical data representation used to store the data. stratified sampling collects samples from the global strata. Discussion.Figure5(b)showstheresultsoftheseexperiments. IX. EXPERIMENTS Weobservethat INGESTBASE hasaverysmalloverhead(less WerananumberofexperimentstoevaluateINGESTBASE. than 10%) for all methods except global stratified sampling. Our goal was to answer two key questions: (i) how efficiently Thissmalloverheadreflectsthetimetowritethedatasamples does INGESTBASE allow users to perform data transforma- todisk.Inthecaseofglobalstratifiedsampling, INGESTBASE tions? and (ii) is transform-as-you-upload in INGESTBASE upload time is nearly twice of HDFS upload time. The reason better than other possible transformation approaches? To is the same as for the DC rule in the previous experiments: evaluate these questions, we ran INGESTBASE ingestion plans global stratified sampling requires shuffling the entire dataset for the four different data ingestion scenarios described in across all nodes to collect samples from each subgroup. SectionII.Forallexperiments,wemeasureunmodifiedHDFS However, these experiments show that most types of sampling data upload times as the default baseline, and Hive (a widely can be done efficiently, as data is being ingested, with little used SQL-based database that runs on Hadoop MapReduce additional overhead (no additional passes over the entire data and HDFS) as an additional baseline wherever possible. Ad- set). ditionally, we evaluate the data access times INGESTBASE for 3)DataAnalytics: Inthissection,weanalyzethe INGEST- several common relational operations. All experiments were BASE ingest times when preparing datasets for different data done on a cluster of 10 nodes. Each node has 1.07 GHz analytical tasks. with 32-core Xeon running on Ubuntu 12.04, 256 GB main- memory, and 11 TB of disk storage. We experimented with Setup. We create different layouts for each replica (using the TPC-H data at scale factor 1000 (1TB in total), and generated schemetheTrojanLayouts[12]paper).Wedenotethisscheme data on all 10 nodes in parallel. We run INGESTBASE on top asPer-replicaLayouts.Itworksbycreatingabinaryrow,PAX, of Hadoop 2.0.6-alpha and use Hive 0.13.1. and compressed PAX layouts for each of the three replicas, 1000010000100In00geI1sn0tgB0e0a0sstIeBn gPaeslaesnt PBsInlaagsneess PtBlaansse Plans 80008000 8000Inge80Isn0tgB0easstIeBn gPaeslaesnt PBsInlaagsneess PtBlaansse Plans 100001000010000 10000 IngeIsntgBeasstIeBn gPaeslaesnt PBsInlaagsneess PtBlaansse Plans 80008000 8000 8000 IngeIsntgBeasstIeBn gPaeslaesnt PBsInlaagsneess PtBlaansse Plans StanSdtaarndd HSarDtdaFn HSdDaSrFdtSa HndDaFrSd HDFS StanSdtaarndd HSarDtdaFn HSdDaSrFdtSa HndDaFrSd HDFS StanSdtaarndd HSarDtdaFn HSdDaSrFdtSa HndDaFrSd HDFS StanSdtaarndd HSarDtdaFn HSdDaSrFdtSa HndDaFrSd HDFS 80008000 8000 8000 64006400 6400 6400 80008000 8000 8000 64006400 6400 6400 Time (seconds) 46Time (seconds)0000004600Time (seconds)0000 460000Time (seconds)00 46000000 Time (seconds)3428Time (seconds)0000342800Time (seconds)00 34280000Time (seconds)34280000 Time (seconds) 46Time (seconds)0000004600Time (seconds)0000 460000Time (seconds)00 46000000 Time (seconds)3428Time (seconds)0000342800Time (seconds)00 34280000Time (seconds)34280000 16001600 1600 1600 20002000 2000 2000 16001600 1600 1600 20002000 2000 2000 BernoBuellrinouSBlilmierpnSloeiu
mllSBipyelesrS
tneioSmmuyplasliltteiec
m
SSaiyLmtisocptc
elaeml
L aoStciycas
lGt elLomobacaGtali
cllo 
baLlo
Gclaolb al
 Global
 Per-rePpelri-crae
pPliecra-
repHlyiPcbaer
riH-dr
yebprliicdaC
H
oynbtreCidno
tn-HbteyanbsCter-ioddbn

aCtseoenndt-tC
ebConaontsn-tebetdean
nstC-teb-odban
astseeednd
t-C
boansteedn
t-based
 FlexibFllee
xibleF
lexiEbrlea
sFulEererxa
isbuler
eE
rasFulreex
EibFrlalees
xuirbele
F
leRxeibplelic
RFaeletpixoliincb
alRet
ieopnl
icatRioenp
lication
 DC DDetCe cDtieotDneCcDt iDoCne RteDeDcpCtCia o RiDrneepFDteaDCcir tDRioeFentDpe acDDitrCieot nFeRDcetp iDoaneirteFctDio Dnetection RandRomandoRmRanadnRodmaonmdRSoamtRnraadtnoiSfidmetordamtRifiSaetSnrdatdrtaoiSfitmietfirdeadtSifiterSadttriafietidfiedStratified LayouLtasyoutsLayRouetpsliLcRaaeyspoluictasRsePpalicrtaitsPiRoanerpitniltgiicoaPnsianrgtiPtiolancPienPamgrlateictnieotmnPienlngatcemePnltacement ReplicRaetpiolincaRtieopnlicaCtoRiodenipnClgiocdatiniognCEroadsiunErgera CCsuoorddeEii nnCrgagosduEirrneag sECurEoarerdsa iuHnsrugyerb eCEr iHrodadysibunrrgied EHryabsruidre Hybrid (a) DataCleaning (b) DataSampling (c) DataAnalytics (d) DataStorage Fig.5. IngestionruntimeengineoverheadofINGESTBASEoverdifferentapplications,comparedtouploadingtostandardHDFS(withoutanypreprocessing). respectively.Inaddition,weimplementthreenewdatastorage users apply both replication and erasure coding, on different schemes, as described in Section II-C: (i) Hybrid Replicas, portions of the data: we replicate the first range partition 10 which store subsets of the blocks of a replica in different times and apply erasure coding for the remaining partitions (3 layouts,(ii)Content-basedPartitioning,whichchunksthedata parity blocks for every 10 data blocks). basedonitscontentinsteadofphysicalsize,and(iii)Content- Discussion. Figure 5(d) shows the results. Interestingly, we basedPlacement,whichplacesdatablocksbasedontheircon- tent. For Hybrid Replicas, we create the same three layouts as observe that INGESTBASE outperforms HDFS in the Flexible inPer-replicaLayoutsandweletHDFShandlethereplication. Replication case. This is because INGESTBASE creates fewer data replicas than HDFS and hence stores less data. Erasure For the other two schemes, we use a logical partitioner to coding, on the other hand, stores 30% more data as well as generate10rangepartitions(fortheContent-basedPartitioner) incurs more CPU costs. As a result, it is has almost 40% and place all those data blocks having the same range on the same data node (for the Content-based Placement). higher runtime than standard HDFS. However, INGESTBASE allows developers to flexibly choose the erasure codes as well Discussion. Figure 5(c) illustrates the results, which overall as freely combine erasure coding with replication. Hence, confirm the trend we observed in the previous section: the INGESTBASEcanbeeffectivelyusedtooptimizestoragespace overhead is directly proportional to the time spent by IN- by transforming the physical data representation in a variety GESTBASE intransferringorprocessingdata.Weobservethat of ways. when creating a different data layout per data replica (Per- replicaLayouts),theINGESTBASEingesttimeisnearlydouble B. Comparison with Hive Cooking Jobs the HDFS upload time. This is mainly because INGESTBASE has to deal with data replication outside HDFS. When IN- The typical practice is to prepare a dataset once it is GESTBASE pushes data replication to HDFS itself (such as already ingested into HDFS using query processing tools or in Hybrid Replicas, Content-based Partitioning, and Content- MapReduce over HDFS. Let us now compare INGESTBASE based Placement), we see a decrease of the INGESTBASE with such an approach. overhead. In particular, we observe that for the Content-based Partitioning and Content-based Placement schemes, which Setup. We preload the data into HDFS and create an external are less CPU demanding than the other two schemes, the Hive table to contain the data. We then run HiveQL queries to overheaddecreasesevenmore,tojustover20%.Theseresults do three data transformations, namely functional dependency, showtheefficiencyof INGESTBASE atapplyingarbitrarydata denialconstraintchecking,andrandomsampling(thesearethe transformations at ingest time, and without any modifications only transformations that are easy to represent in HiveQL). to the HDFS. Transformation Hive(s) INGESTBASE(s) Improvement FunctionalDependency 8,100 4,516 1.8x 4)Storage Space Optimization: Finally, we evaluate IN- DenialConstraint 2,616 1,011 2.6x GESTBASE in scenarios where optimizing the storage space is RandomSampling 2,274 371 6.1x the primary goal of the user. TABLEI. INGESTIONOVERHEADINHIVEANDINGESTBASE. Setup. We consider four different scenarios. First, the case Discussion Table I shows the ingestion overhead, above the where users over-replicate the hot data, for better availabil- standard HDFS upload time, in Hive and INGESTBASE. We ity, and under-replicate the cold data, for preserving storage can see that INGESTBASE has 1.8× less overhead than Hive space (Flexible Replication). To do so, we create 10 range forcheckingfunctionaldependencyandalmost6.1×lessover- partitions and consider the first partition to be hot (replicated head for random sampling. The reason is that INGESTBASE 10 times) and the remaining partitions to be cold (replicated piggy-backs these operations onto the ingestion process. For 2 times). Second, we consider using erasure coding instead of example, to generate random samples, INGESTBASE incurs replication. We erasure code the data with 3 parity blocks for a single data read from disk while ingesting it into HDFS. every 10 data blocks. Third, we consider the case where users Hive on the other hand, needs to re-read the entire dataset applydifferenterasurecodestodifferentportionsoftheirdata twice. Furthermore, it is tedious to run more complex trans- (Flexible Erasure Coding): we create 3 parity blocks for every formations such as stratified sampling in Hive, and physical 5 data blocks of the first range partition, while the remaining transformations such as erasure coding are not possible at all. range partitions are encoded as before, i.e. 3 parity blocks Thus, INGESTBASE has utility both in terms of performance for every 10 data blocks. Finally, we consider the case where aswellasflexibilityiningestingdatasetsinanad-hocmanner. �1�1 �1 �1

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.