QFix: Diagnosing errors through query histories Xiaolan Wang Alexandra Meliou Eugene Wu SchoolofComputerScience SchoolofComputerScience ComputerScience UniversityofMassachusetts UniversityofMassachusetts ColumbiaUniversity [email protected] [email protected] [email protected] 6 ABSTRACT often obscures the root of the problem and other data that 1 Data-drivenapplicationsrelyonthecorrectnessoftheirdata may have been affected. Therefore, traditional data clean- 0 to function properly and effectively. Errors in data can be ing approaches are not well-suited for this setting: While 2 they provide general-purpose tools to identify and rectify incredibly costly and disruptive, leading to loss of revenue, anomaliesinthedata, theyarenotdesignedtodiagnosethe b incorrect conclusions, and misguided policy decisions. While causes of errors that are rooted in erroneous updates. Some e data cleaning tools can purge datasets of many errors before data cleaning systems try to identify structural sources of F the data is used, applications and users interacting with the mistakes [35], but they are unable to trace the source of the datacanintroducenewerrors. Subsequentvalidupdatescan 1 mistakes to particular faulty queries. obscuretheseerrorsandpropagatethemthroughthedataset 1 While improving data quality and correcting data errors causingmorediscrepancies. Evenwhensomeofthesediscrep- ancies are discovered, they are often corrected superficially, has been an important focus for data management research, ] handling new errors, introduced during regular database B onacase-by-casebasis,furtherobscuringthetrueunderlying interactions, has received little attention. Most work in D cause, and making detection of the remaining errors harder. thisdirectionfocusesonguarding against erroneousupdates. In this paper, we propose QFix, a framework that derives . For example, integrity constraints [27] reject some improper s explanations and repairs for discrepancies in relational data, c by analyzing the effect of queries that operated on the data updates, but only if the data falls outside rigid, predefined [ ranges. Certificate-based verification [7] is less rigid, but it and identifying potential mistakes in those queries. QFix is flexible, handling scenarios where only a subset of the isimpracticalandnon-scalableasitrequiresuserstoanswer 2 challengequestionsbeforeallowingtheupdates,anditisnot v true discrepancies is known, and robust to different types of applicable to updates initiated by applications. 9 update workloads. We make four important contributions: In this paper, we present QFix, a diagnosis and repair 3 (a)weformalizetheproblemofdiagnosingthecausesofdata framework for data errors caused by erroneous updates. In 5 errors based on the queries that operated on and introduced contrast to existing approaches in data cleaning that aim to 7 errorstoadataset;(b)wedevelopexactmethodsforderiving detectandcorrecterrorsinthedatadirectly,thegoalofQFix 0 diagnosesandfixesforidentifiederrorsusingstate-of-the-art istoidentifyerrorsinthequeriesthatintroducederrorsinthe . tools; (c) we present several optimization techniques that 1 data. Thesediagnosesexplain howerrorswereintroducedto improve our basic approach without compromising accu- 0 a dataset, and allow an administrator to easily identify and racy, and (d) we leverage a tradeoff between accuracy and 6 further validate the most likely query-based sources of these performance to scale diagnosis to large datasets and query 1 errors. In addition, once the erroneous queries have been logs, while achieving near-optimal results. We demonstrate : v the effectiveness of QFix through extensive evaluation over confirmed, repairing the source of the errors can potentially i benchmark and synthetic data. lead to the identification of additional discrepancies in the X datathatwouldhaveotherwiseremainedundetected. Wede- r scribetwomotivatingexamples;thefirstisareal-lifescenario, a 1. INTRODUCTION provided to us by a large US-based wireless provider. Poor data quality is a hard and persistent problem. It is estimatedtocosttheUSeconomymorethan$600billionper year [17] and erroneous price data in retail databases alone Example 1 (Wireless discount policies). A wire- cost the US consumers $2.5 billion each year [18]. Although less provider offers company discounts as incentives for cor- data cleaning tools can purge many errors from a dataset porate customers. There are different types of discounts before downstream applications use the data, datasets can (flat, percentage, fee-based), and their details are specific frequently change as applications and users execute queries to corporate agreements. The large number of policies and that modify the data. Mistakes in these queries can intro- complexities in their rules frequently cause policies to be set duce errors to the data, and these errors can propagate to incorrectly, leading to errors in the application of discounts moredatabysubsequentupdatequeries. Bythetimeerrors to customers’ accounts. are detected, their origin has often been obscured and it is Customers who notice billing errors contact the provider, difficulttoidentifytheoffendingqueryandcorrectit. Identi- butthecallcentersdonothavethecapacityorabilitytoinves- fyingandcorrectingerrorsinthedatadirectlyissuboptimal, tigatethecomplaintsdeeply. Thestandardcourseofactionis as it targets the symptom, rather than the underlying cause. tocorrectmistakesonacase-by-casebasisforeachcomplaint. Fixing the manifested data errors on a case-by-case basis As a result, unreported errors remain in the database for a 1 long time, or they never get corrected, and their cause be- • Weprovideanexacterror-diagnosissolutionthroughanon- comes harder to trace as further queries modify the database. trivial transformation of the problem to a mixed integer linear program (MILP) that encodes the data provenance Example 2 (Tax bracket adjustment). Taxbrack- of the erroneous tuples. Our approach employs state-of- ets determine tax rates for different income levels and are the-artMILPsolverstoidentifyoptimaldiagnosesthatare often adjusted. Accounting firms implement these changes guaranteed to resolve all complaints without introducing to their databases by appropriately updating the tax rates new errors to the data (Section 4). of their customers. Mistakes in these update queries (e.g., • We present several optimizations to our basic diagnostic Figure2)resultinerrorsinthetaxratesandcomputedtaxes. method, which reduce the problem size without affecting thequalityoftheproducedsolutions. Further,wepropose In these application scenarios, data errors are typically re- anincrementalrepairmethodthattargetsthecaseswhere ported to a customer service department, which does not the log contains a single corrupted query (or the search havetheresourcesnorthecapabilitytoinvestigatetheerrors focusesonasinglerepair). Thisincrementalanalysisofthe more broadly. Instead, errors are resolved on a case-by-case log allows us to scale to large datasets (100k records) and basis. ThegoalofQFixistoidentifythequeryorqueriesthat largequerylogs(hundredstothousandsofupdatequeries). caused the errors and propose corrections to those queries. Further,weshowthatouroptimizationtechniqueshavethe Once these repairs have been validated (say, by an expert), additional advantage of tolerating incomplete information, theycanbeusedtoidentifyunreportederrorsandtoprevent such as unreported errors (Section 5). the introduction of more errors. This problem has the fol- lowing important characteristics that render it very difficult, • Weperformathoroughevaluationofthetrade-offsbetween and unsuitable for traditional techniques: speedandaccuracyofourbaselineandoptimizedmethods under a controlled, synthetic setting. In particular, we Obscurity. Handlingdataerrorsdirectlyoftenleadstopar- demonstrate that the QFix optimizations achieve signif- tialfixesthatfurthercomplicatetheeventualdiagnosisand icant speedup compared to the baseline algorithm (40× resolution of the problem. For example, a transaction im- in some of our experimental settings). We also evaluate plementing a change in the state tax law updated tax rates QFix on common OLTP benchmarks and show how QFix usingthewrongrate,affectingalargenumberofconsumers. can propose fully accurate repairs within milliseconds on Thiscausesalargenumberofcomplaintstoacallcenter,but a TPC-C workload with 1500 queries (Section 7). eachcustomeragentusuallyfixeseachproblemindividually, To the best of our knowledge, QFix is the first system which ends up obscuring the source of the problem. that diagnoses and repairs errors through query histories. Large impact. Erroneous queries cause errors at a large We show that it is extremely effective and efficient with the scale. The potential impact of the errors is high, as mani- updateworkloadsfoundinmostcommonbenchmarks. While festedinseveralreal-worldcases[24,32,39]. Further,errors QFix trusts its input to be correct, it can handle incomplete that remain undetected for a significant amount of time information, and it can be resilient to some inaccuracies in can instigate additional errors, even through valid updates. the reported data errors (Section 6). QFix does not handle This increases both their impact, and their obscurity. some complex query types that are less common in update Systemic errors. The errors created by bad queries are workloads, such as nested queries, joins, and aggregation. It systemic: they have common characteristics, as they share also does not currently deal with large amounts of incorrect the same cause. The link between the resulting data errors information, such as fake data errors reported by malicious is the query that created them; cleaning techniques should users. These challenges present exciting future extensions to leverage this connection to diagnose and fix the problem. the system presented in this work. Diagnosing the cause of the errors, will achieve systematic fixes that will correct all relevant errors, even if they have 2. QFixSYSTEMARCHITECTURE not been explicitly identified. Figure 1 shows the QFix architecture. The system takes QFix does not replace traditional data cleaning methods, twoinputs: alogofupdatequeries(includingUPDATE,INSERT, but rather, complements them. Instead of identifying errors and DELETE statements) and a set of identified data errors in the data directly, QFix targets the diagnosis, explanation, (complaints). QFix analyzes the data errors and the query and repair of errors at the root by leveraging example errors logs to trace the causes of the errors in queries in the log acquired from users, traditional data cleaning, or detection (diagnoses), and to automatically derive query repairs. The techniques. query repairs represent corrections to the queries in the log, Diagnosing data errors stemming from incorrect updates andcanbeusedtoidentifyadditionalerrorsinthedatathat is fundamentally challenging: the search space of possible were not reported. mistakes and fixes is large, and the amount of information ThecoreofQFixistheMILPEncoder,whichexpressesthe (number of known errors) may be limited. QFix addresses query diagnosis problem as a Mixed Integer Linear Program these challenges by analyzing the queries that operated on a (MILP), and the constraint problem is solved by the Solver. datasetinanefficientandscalablemanner. Moreconcretely, The Optimizer uses slicing and incremental techniques that we make the following contributions: help the system scale to large datasets and query logs effi- • Weformalizetheproblemofdiagnosingasetoferrorsusing ciently, while maintaining high accuracy. For completeness, log histories of updates that operated on the data. Given an optional Denoiser can be applied to the inputs to detect asetofcomplaints asrepresentationsofdatadiscrepancies andremoveincorrectcomplaints—weregardthiscomponent in the current state of a dataset, QFix determines how to as orthogonal to this paper. resolve all of the complaints with the minimal amount of changes to the queries in the log (Section 3). 2 Users complaint tuple Notation Description Database Query Log Query diagnoses UCACCCUWNP____PHDBDBPWDAAAEA ALCLY_RTTAAMIEE_DENN WE WC WCCN=U_ _EET?AI IS D _ARD+=T CNE = O??DH= ,NMW? C O?UTC EH_ P=_URYDED S T?SRAD,_E EETCI _DST E_CPD E A=C_CTYA _IU? DMT WBASAA =EN_T L ON=DY? A ?TAMT CDNWN CE=_ D HI=ER?D ,W E=CS R=_ _E DE?YT T_ IDD += ?? & repairs Q TQh=es{eqq1u,e.n..c,eqonf}executedupdatequeries(log) D0 Initialdatabasestateatbeginningoflog QFix Denoiser Solver Dn Enddatabasestate(current)Dn=Q(D0) Di Databasestateafterqueryqi: Di=qi(...q1(D0)) Optimizer c:t(cid:55)→t∗ Complaint: Tc(D)=(Dn\{t})∪{t∗} MILP Encoder OptSimliciiznagt ion IncRreempaeinrtal µCq(t) MCoomdipfilearinftunsecttiCon=o{fcq1,(e...g..,,cSkE}Tclause) σq(t) Conditionalfunctionofq (e.g.,WHEREclause) tnew TuplevaluesintroducedinanINSERTquery Figure 1: QFix processes data anomalies in the form of Q∗ Logrepair complaints and analyzes logged query histories to identify d(Q,Q∗) Distancefunctionsbetweentwoquerylogs the causes of error in the form of repaired queries. The Figure 3: Summary of notations used in the paper. key component is the MILP Encoder, which expresses the diagnosis problem as a mixed integer linear program. databaseD. Intherestofthepaper,weusethetermupdate 3. MODELINGABSTRACTIONS queries, or just queries, to refer to any of the queries in (Q), In this section, we introduce a running example inspired including insertion and deletion queries. from the use-case of Example 2, and describe the model Query (q ): We model each query as a function over a i abstractions that we use to formalize the diagnosis problem. database D, resulting in a new database D(cid:48). For INSERT queries, D(cid:48) = q(D) = D∪{t }. We model UPDATE and Example 3. Figure2demonstratesanexampletaxbracket new DELETE queries as follows: adjustment in the spirit of Example 2. The adjustment sets the tax rate to 30% for income levels above $87,500, and D(cid:48) =q(D)={µ (t)|t∈D,σ (t)}∪{t|t∈D,¬σ (t)} q q q is implemented by query q . A digit transposition mistake 1 in the query, results in an incorrect owed amount for tu- In this definition, the modifier function µ (t) represents the q ples t3 and t4. Query q2, which inserts a tuple with slightly query’supdateequations,andittransformsatuplebyeither higher income than t3 and t4 and the correct information, deleting it (µq(t) = ⊥) or changing the values of some of obscures this mistake. This mistake is further propagated by its attributes. The conditional function σ (t) is a boolean q query q3, which calculates the pay check amount based on the function that represents the query’s condition predicates. In corresponding income and owed. the example of Figure 2: While traditional data cleaning techniques seek to identify µ (t)=(t.income,t.income∗0.3,t.pay) and correct the erroneous values in the table Taxes directly, q1 our goal is to diagnose the problem, and understand the σq1(t)=(t.income≥85700) reasons for these errors. In this case, the reason for the data µ (t)=(t.income,t.owed,t.income−t.owed) q3 errors is the incorrect predicate value in query q . 1 σ (t)=true In this paper, we assume that we know some errors in q2 the dataset, and that these errors were caused by erroneous Note that in this paper, we only consider query without updates. The errors may be obtained in different ways: sub-query or aggregation. traditional data cleaning tools may identify discrepancies in Database state (D ): We use D to represent the state of thedata(e.g.,atuplewithlowerincomehashigherowedtax i i a database D after the application of queries q through q amount), or errors can be reported directly from users (e.g., 1 i from the log Q. D represents the original database state, customers reporting discrepancies to customer service). Our 0 and D the final, or current, database state. Out of all the goal is not to correct the errors directly in the data, but to n states, the system only maintains D and D . In practice, analyze them as a “symptom” and provide a diagnosis. The 0 n D can be a checkpoint: a state of the database that we diagnosis can produce a targeted treatment: knowing how 0 assumeiscorrect;wecannotdiagnoseerrorsbeforethisstate. the errors were introduced guides the proper way to trace The intermediate states can be derived by executing the log: and resolve them. D = q (q (...q (D ))). We also write D = Q(D ) to i i i−1 1 0 n 0 3.1 ErrorModeling denote that the final database state D can be derived by n applying the sequence of queries in the log to the original In our setting, the diagnoses are associated with errors database state D . in the queries that operated on the data. In Example 3, 0 the errors in the dataset are due to the digit transposition True database state (D∗): Queries in Q are possibly erro- i mistake in the WHERE clause predicate of query q1. Our neous,introducingerrorsinthedata. Thereexistsasequence goal is to infer the errors in a log of queries automatically, oftrue databasestates{D∗,D∗...,D∗},withD∗ =D ,rep- 0 1 n 0 0 given a set of incorrect values in the data. We proceed to resenting the database states that would have occurred if describe our modeling abstractions for data, queries, and there had been no errors in the queries. The true database errors,andhowweusethemtodefinethediagnosisproblem. statesareunknown;ourgoalistofindandcorrecttheerrors in Q and retrieve the correct database state D∗. Dataandquerymodels n For ease of exposition, in the remainder of the paper we Query log (Q): We define a query log that update the assume that the database contains a single relation with at- database as an ordered sequence of UPDATE, INSERT, and tributesA ,...,A ,butthesingletableisnotarequirement 1 m DELETE queries Q = {q ,...,q }, that have operated on a in our framework. 1 n 3 Taxes: D Query log: Q Taxes: D 0 4 ID income owed pay q1: UPDATE Taxes SET owed=income*0.3 ID income owed pay WHERE income>=85700 t $9500 $950 $8550 t $9500 $950 $8550 1 q2: INSERT INTO Taxes 1 t $90000 $22500 $67500 t $90000 $27000 $63000 2 VALUES (25, 85800, 21450) 2 t $86000 $21500 $64500 t $86000 $25800 $60200 3 q3: UPDATE Taxes SET pay=income-owed 3 t $86500 $21625 $64875 t $86500 $25950 $60550 4 4 t $87000 $21750 $65250 5 Figure 2: A recent change in tax rate brackets calls for a tax rate of 30% for those with income above $87500. The accounting department issues query q to implement the new policy, but the predicate of the WHERE clause condition transposed two 1 digits of the income value. Errormodels Problemdefinition Following the terminology in Examples 1 and 2, we model a Wenowformalizetheproblemdefinitionfordiagnosingdata set of identified or user-reported data errors as complaints. errors using query logs. A diagnosis is a log repair Q∗ that A complaint corresponds to a particular tuple in the final resolves all complaints in the set C and leads to a correct database state D∗, and identifies that tuple’s correct value database state D∗. n n assignment. We formally define complaints below: Definition 5 (Optimal diagnosis). Given database Definition 4 (Complaint). A complaint c is a map- states D and D , a query log Q such that Q(D )=D , a 0 n 0 n ping between two tuples: c : t (cid:55)→ t∗, such that t and t∗ set of complaints C on D , and a distance function d, the n have the same schema, t∈D ∪{⊥}, and t(cid:54)=t∗. A com- optimal diagnosis is a log repair Q∗, such that: n plaint defines a transformation Tc on a database state D: • Q∗(D0)=Dn∗, where Dn∗ has no errors Tc(D)=(D\{t})∪{t∗}. • d(Q,Q∗) is minimized In the example of Figure 2, two complaints are reported More informally, we seek the minimum changes to the log on the final database state D3: c1 :t3 (cid:55)→t∗3 and c2 :t4 (cid:55)→t∗4, QthatwouldresultinacleandatabasestateDn∗. Obviously, where t∗3 = (86000,21500,64500) and t∗4 = (86500,21625, a challenge is that Dn∗ is unknown, unless we know that the 64875). For both these cases, each complaint denotes a complaint set is complete. value correction for a tuple in D . Complaints can also 3 modeltheadditionorremovaloftuples: c:⊥(cid:55)→t∗ means Problemscopeandsolutionoutline that t∗ should be added to the database, whereas c:t(cid:55)→⊥ In this work, we assume basic data manipulation queries means that t should be removed from the database. with no subqueries, aggregations, or joins; these operations Complaint set (C): We use C to denote the set of all known are not as common in update workloads. QFix supports complaintsC ={c1,...,ck},andwecallitthecomplaint set. queries with WHERE clauses containing conjunctions and Each complaint in C represents a transformation (addition, disjunctions of predicates. Predicates and SET expressions deletion, or modification) of a tuple in Dn. We assume that can be over linear combinations of constants and attributes. the complaint set is consistent, i.e., there are no two com- We study the impact of the number of predicates in the plaintsthatproposedifferenttransformationstothesametu- WHERE clause in Section 7.3. plet∈Dn. ApplyingallthesetransformationstoDn results In Section 4, we describe our basic method, which uses in a new database instance Dn(cid:48) =Tc1(Tc2(...Tck(Dn))).1 C a constraint programming formulation that expresses this iscomplete ifitcontainsacomplaintforeacherrorinDn. In diagnosisproblemasamixedintegerlinearprogram(MILP). thatcase,D(cid:48) =D∗. Inourwork,wedonotassumethatthe Section 5 presents several optimization techniques that ex- n n complaintsetiscomplete,but,asismorecommoninpractice, tend the basic method, allowing QFix to (1) handle cases we only know a subset of the errors (incomplete complaint of incomplete information (incomplete complaint set), and set). Further, we focus our analysis on valid complaints; we (2) scale to large data and log sizes. Specifically, the fully briefly discuss dealing with invalid complaints (complaints optimized, incremental algorithm (Section 5.4), can handle identifyingacorrectvalueasanerror)inSection6,butthese querylogswithhundredsofquerieswithinminutes,whilethe techniques are beyond the scope of this paper. performance of the basic approach collapses by 50 queries. Log repair (Q∗): The goal of our framework is to derive Due to space considerations, we omit discussion of alter- a diagnosis as a log repair Q∗ = {q∗,...,q∗}, such that native approaches that use classification tools and linear 1 n Q∗(D ) = D∗. In this work, we focus on errors produced systems of equations. These approaches are limited to a 0 n by incorrect parameters in queries, so our repairs focus on query log containing a single query, and are discussed and altering query constants rather than query structure. There- evaluated in more detail in our technical report [36]. fore, for each query q∗ ∈ Q∗, q∗ has the same structure i i as qi (e.g., the same number of predicates and the same 4. AMILP-BASEDSOLUTION variables in the WHERE clause), but possibly different param- Inthissection,weintroduceabasic solver-basedapproach eters. For example, a good log repair for the example of to resolve the errors reflected in the complaint set. This Figure 2 is Q∗ ={q∗,q ,q }, where q∗=UPDATE Taxes SET 1 2 3 1 approach constructs a mixed-integer linear programming owed=income*0.3 WHERE income >= 87500. (MILP) problem by linearizing and parameterizing the cor- 1Since the complaint set is consistent, it is easy to see that rupted query log over the tuples in the database. Briefly, the order of transformations is inconsequential. an MILP is a linear program where only a subset of the 4 undeterminedvariablesarerequiredtobeintegers,whilethe u.A =x ⊗µ (t).A and v.A =(1−x )⊗t.A . As- j qi,t qi j j qi,t j rest are real-valued. suming a number M as the upper bound of the domain of Our general strategy is to model each query as a linear t.A , we get the following constraints: j equation that computes the output tuple values from the u.A ≤µ (t).A v.A ≤t.A inputs and to transform the equation into a set of of linear j qi j j j constraints. Inaddition,theconstantvaluesinthequeriesare u.Aj≤xqi,tM v.Aj≤(1−xqi,t)M (3) parameterizedintoasetofundeterminedvariables,whilethe u.A ≥µ (t).A −(1−x )M v.A ≥t.A −x M database state is encoded as constraints on the initial and fi- j qi j qi,t j j qi,t naltuplevalues. Finally,theundeterminedvariablesareused The set of conditions on u.Aj ensure that u.Aj =µqi(t).Aj to construct an objective function that prefers value assign- if xqi,t = 1, and 0 otherwise. Similarly, the conditions on mentsthatminimizeboththeamountthatthequerieschange v.Aj ensure that v.Aj =t.Aj if xqi,t =0, and 0 otherwise. and the number of non-complaint tuples that are affected. Now, Equation (2) becomes linear: The rest of this section will first describe the process of t.A(cid:48) =u.A +v.A (4) linearizing a single query and translating it into a set of j j j constraints. We then extend the process to the entire query INSERT:Aninsertqueryaddsanewtuplet tothedatabase. new log and finally define the objective function. Subsequent If the query were corrupted, then the inserted values need sectionsintroduceoptimizationsthatbothimprovethespeed repair. We use a binary variable x to model whether the and quality of the results, as well as harness the trade-off query is correct. Each attribute of the newly inserted tuple between the two. (t(cid:48).A ) may take one of two values: the value specified by j the insertion query (t .A ) if the query is correct (x=1), 4.1 EncodingaSingleQuery new j or an undetermined value (u.A ) if the query is incorrect j MILP problems express constraints as a set of linear in- (x=0). Thus, similar with Equation (2), we write: equalities. Our task is to derive such a mathematical repre- t(cid:48).A =x⊗t .A +(1−x)⊗v.A (5) sentation for each query in Q. Starting with the functional j new j j representationofaquery(Section3.1),wedescribehoweach DELETE: A delete query removes a set of tuples from the querytype,UPDATE,INSERT,andDELETE,canbetransformed database. Since the MILP problem doesn’t have a way to intoasetoflinearconstraintsoveratupletandanattribute express a non-existent value, we encode a deleted tuple by value A . j setting its attributes to a value outside of the attribute UPDATE: Recall from Section 3.1 that query qi can be mod- domain M+. In this way, subsequent conditional functions eled as the combination of a modifier function µ (t) and qi ontheattributewillreturnfalse,soitwillnothaveaneffect conditional function σ (t). First, we introduce a binary qi on subsequent queries encoded in the MILP problem: variable x to indicate whether t satisfies the conditional function oqfi,tqi: xqi,t = 1 if σqi(t) = true and xqi,t = 0 t(cid:48).Aj = xqi,t⊗M++(1−xqi,t)⊗t.Aj (6) otherwise. In a slight abuse of notation: x = σ (t) qi,t qi x =σ (t) (1) qi,t qi This expression is further linearized using the same method Next, we introduce real-valued variables for the attributes as Equation (3). of t. We express the updated value of an attribute using Putting it all together. The constraints defined in Equa- semi-modules, borrowing from the models of provenance for tions (1)–(6) form the main structure of the MILP problem aggregateoperations[2]. Asemi-moduleconsistsofacommu- for a single attribute A of a single tuple t. To linearize a j tative semi-ring, whose elements are scalars, a commutative query q one needs to apply this procedure to all attributes i monoid whose elements are vectors, and a multiplication-by- and tuples. This process is denoted as Linearize(q,t) in scalars operation that takes a scalar x and a vector u and Algorithm 1. Our MILP formulation includes three types of returns a vector x⊗u. A similar formalism has been used variables: thebinaryvariablesx ,thereal-valuedattribute qi,t in literature to model hypothetical data updates [29]. values (e.g., u.A ), and the real-valued constants in µ and j qi attGriibvuenteaAjquinertyheqiupanddatetduptlueplte, tw(cid:48)easexfoplrloewsss:the value of σasqsii.gnAeldl tvhaeluseesvbayriaabMleIsLaPresoulnvedre.termined and need to be Next,weextendthisencodingtotheentirequerylog,and t(cid:48).A =x ⊗µ (t).A +(1−x )⊗t.A (2) j qi,t qi j qi,t j incorporate an objective function encouraging solutions that In this expression, the ⊗ operation corresponds to regular minimize the overall changes to the query log. multiplication, but we maintain the ⊗ notation to indicate 4.2 EncodingandRepairingtheQueryLog that it is a semi-module multiplication by scalars. This expression models the action of the update: If t satisfies We proceed to describe the procedure (Algorithm 1) that the conditional function (x = 1), then t(cid:48).A takes the encodes the full query log into a MILP problem, and solves qi,t j value µ (t).A ; if t does not satisfy the conditional function the MILP problem to derive Q∗. The algorithm takes as qi j (x =0), then t(cid:48).A takes the value t.A . In our running input the query log Q, the initial and final (dirty) database qi,t j j example, the rate value of a tuple t after query q would states D , and the complaint set C, and outputs a fixed 1 0,n be expressed as: t(cid:48).owed = x ⊗(t.income∗0.3)+(1− query log Q∗. q1,t x )⊗t.owed. Equation (2) does not yet provide a linear WefirstcallLinearize oneachtupleinD andeachquery q1,t 0 representation of the corresponding constraint, as it con- in Q, and add the result to a set of constraints milp_cons. tainsmultiplicationofvariables. Tolinearizethisexpression, ThefunctionAssignVals addsconstraintstosetthevaluesof we adapt a method from [29]: We introduce two variables theinputstoq andtheoutputsofq totheirrespectiveval- 0 n u.A and v.A to represent the two terms of Equation (2): uesinD andT (D ). Additionalconstraintsaccountforthe j j 0 C n 5 Algorithm 1: Basic: The MILP-based approach. Require: Q,D0,Dn,C 1: milp_cons←∅ 2: foreachtinRdo 3: foreachqinQdo 4: milp_cons←milp_cons∪Linearize(q,t) 5: endfor 6: milp_cons←milp_cons∪AssignVals(D0.t,Dn.t,C) 7: foreachiin{0,...,N−1}do 8: milp_cons←milp_cons∪ConnectQueries(qi,qi+1) 9: endfor Figure 4: Log size vs. execution time over 1000 records. 10: endfor 11: milp_obj←EncodeObjective(milp_cons,Q) 12: solved_vals←MILPSolver(milp_cons,milp_obj) complaints non-complaints 13: Q∗←ConvertQLog(Q,solved_vals) 14: ReturnQ∗ tuples dirty truth repair factthattheoutputofqueryq istheinputofq (Connect- i i+1 (a) (b) Queries). Thisfunctionsimplyequatest(cid:48) fromthelinearized result for q to the t input for the linearized result of q . i i+1 Figure 5: Graphical depiction of correct (a) and over- Finally, EncodeObjective augments the program with an generalized (b) repairs. Solid and empty circles represent objectivefunctionthatmodelsthedistancefunctionbetween complaint and non-complaint tuples. Each thick line rep- the original query log and the log repair (d(Q,Q∗)). In the resents the interval of query q’s range predicate. Dirty: followingsectionwedescribeourmodelforthedistancefunc- incorrect interval in corrupted query; truth: correct interval tion, though other models are also possible. Once the MILP in true query; repair: interval returned by the solver. solver returns a variable assignment, ConvertQLog updates theconstantsinthequerylogbasedonthisassignment,and theproblemusingthebasic algorithmthatparameterizesall constructs the fixed query log Q∗. queries, while the blue bars show the potential gain of only 4.3 TheObjectiveFunction parameterising the oldest query that we assume is incorrect. At a log size of 80, the solver for basic failed to produce an The optimal diagnosis problem (Definition 5) seeks a log answer within 1000 seconds. Although MILP solvers exhibit repair Q∗, such that the distance d(Q,Q∗) is minimized. empirical performance variation, this experiment illustrates In this section, we describe our model for the objective the performance limitation of the basic approach. function,whichassumesnumericalparametersandattributes. Asecondlimitationofbasic isitsinabilitytohandleerrors This assumption is not a restriction of the QFix framework. in the complaint set. This is because the basic MILP for- Handling other data types, such as categorical values comes mulation generates hard constraints for all of the database down to defining an appropriate distance function, which records, thus any error, whether a false negative missing can then be directly incorporated into QFix. complaint or a false positive incorrect complaint, must be In our experiments, we use the normalized Manhattan correct. It may be impossible to find a repair that satisfies distance(inlinearizedformatintheMILPproblem)between this condition and will lead to solver infeasibility errors. the parameters in Q and Q∗. We use q.param to denote i The rest of this section describes three classes of slicing the ith parameter of query q, and |q.param| to denote the optimizationsthatreducethenumberoftuples,queries,and total number of parameters in q: attributesthatareencodedintheMILPproblem. Thetuple- d(Q,Q∗)=(cid:88)n |qi.(cid:88)param||q .param −q .param∗| wslihceinngthteechconmiqpuleaiandtdsietitoinsailnlycoimmpprleotvee.sWtheearlespoapirroapccouseraacny i j i j i=1 j=1 incrementalalgorithmthatavoidstheexponentialincreasein solvertimebyonlyparameterizingasmallnumberofqueries Different choices for the objective function are also possi- at a time—thus limiting the cost to the left side of Figure 4. ble. For example, one may prioritize the total number of changes incurred in the log, rather than the magnitude of 5.1 TupleSlicing: ReducingTuples thesechanges. However,athoroughinvestigationofdifferent Our first optimization, tuple-slicing, applies a two step possible distance metrics is beyond the scope of our work. process to reduce the problem size without sacrificing accu- racy: it first aggressively reduces the problem size by only 5. OPTIMIZINGTHEBASICAPPROACH encodingtuplesinthecomplaintsetandthenrefinesthelog Amajordrawbackofourbasic MILPtransformation(Sec- repair through a second but much smaller MILP problem. tion 4)is that itexhaustively encodes thecombination ofall Step 1 (Initial Repair Step): The first step of tuple slic- tuples in the database and all queries in the query log. In ing aggressively reduces the problem size by only encoding this approach, the number of constraints (as well as unde- those tuples in the complaint set C (Algorithm 1 line 2 is termined variables) grows quadratically with respect to the replaced with for each t in C). Each tuple necessitates databaseandthequerylog. Thisincreasehasalargeimpact the linearization of the entire query log, thus, only encoding on the running time of the solver, since it needs to find the complaint tuples minimizes the size of the problem with a (near)-optimal assignment of all undetermined variables respect to the relevant tuples. This optimization is guaran- (exponential with the number of undetermined variables). teed to resolve C, thus depending on the properties of the This is depicted in Figure 4, which increases the query log non-complaint records, it can generate correct repairs an size over a database of 1000 tuples. The red bars encode order of magnitude faster without hurting the accuracy. In 6 Figure 5(a), the solver will guarantee a repair interval that Definition 6 (Complaint Attributes A(C)). Theset excludesthetwoleft-mostcomplaints,includesthetworight- of attributes identified as incorrect in the complaint set. most complaints, and minimizes the difference between the A(C)={A |t.A (cid:54)=t∗.A ,c(t,t∗)∈C} dirty and repaired intervals (due to the objective function). i i i Thiseffectivelypushestherepair’slower-boundtowardsthat Definition 7 (Query dependency & impact). Query of the dirty interval. This is a case where such a solution is q has direct-impact, I(q ), which is the set of attributes correct,becausethedirtyandtruthintervalsoverlap. Recall i i updated in its modifier function µ (e.g., SET clause). Its that we do not have access to the truth interval, and our qi dependency, P(q ), is the set of attributes involved in its goal is to reproduce the truth interval given C (solid circles) i conditionfunctionσ . Wederivethefull-impact,F(q ),of and the corrupted query. qi i aqueryq bypropagatingitsdirectimpactthroughsubsequent However, this approach can also cause the repair to be a i queries in the log (Algorithm 2): superset of the truth interval, and affect tuples not part of the complaint set. Figure 5(b) highlights such a case where n (cid:91) the dirty and truth intervals are non-overlapping, and the F(qi)=I(qi) F(qj) non-complaint record between them has been incorrectly j=i+1 included in the repair interval—because the MILP problem F(qi)∩P(qj)(cid:54)=∅ did not include the non-complaint. By computing the full-impact of q, we can determine the In both of these cases, the objective function will ensure extentthatitaffectsCbasedonitsoverlapwiththecomplaint that the repair does not over-generalize the upper bound attributes. Specifically, when |F(q)∩A(C)| = |A(C)|, q towardstherightbecausethatstrictlyincreasestheobjective may affect all complaint attributes and is a candidate for function. Therefore, our main concern is to refine the repair repair; when 0<|F(q)∩A(C)|<|A(C)|, q contributed to interval to exclude those non-complaint tuples in case (b). a subset of the complaint attributes and is a candidate for Note that in the case of incomplete complaint sets, the user repair; when |F(q)∩A(C)|=0, q is irrelevant and can be may opt to no execute the refinement step if she believes ignored during the repair process. We distinguish between that the non-complaint records are indeed in error. the first and second conditions in the special case where we are repairing a single corrupted query in the query log. In Step2(RefinementStep): Althoughtherearemanypos- this case, only queries in the first conditions are candidates sible mechanisms to refine the initial repair (e.g., incremen- for repair because the single query must have caused errors tally shrinking the repaired interval until the non-complaint in all of the complaint attributes. This enables QFix to tuples are all excluded), the straightforward approaches are scale significantly better for this important problem setting. noteffectivewhenmultiplecorruptquerieshavebeenrepaired Finally, we use Rel(Q) to denote the set of relevant queries because they don’t take the query interactions into account. thatarecandidatesforrepair. Ourqueryslicing optimization Instead, we solve this with a second, significantly smaller, linearizes only the queries in Rel(Q), rather than the entire MILPproblem. LetQ∗ bethesetofrepairedqueriesfrom log, resulting in smaller problems than the basic approach rep the initial MILP formulation with tuple slicing; NC be the without any loss of accuracy. setofnon-complainttuplesnowmatchingtherepairedWHERE clauses, as in Figure 5(b); and C+ =C∪NC. We create a 5.3 AttributeSlicing: ReducingAttributes newMILPusingC+ asthecomplaintset. Thekeyistoonly In addition to removing irrelevant queries, we additionally parameterize the repaired clauses from Step 1 as constraints avoid encoding irrelevant attributes. Given Rel(Q), the rele- with undetermined variables. The variables for all other vantattributescanbedefinedas: Rel(A)=∪ (F(q )∪ qi∈RelQ i tuples and queries are fixed to their assigned values from P(q )) We propose attribute slicing optimization that only i Step1. Thisrefinesthesolutionsfromthepreviousstepwhile encodes constraints for attributes in Rel(A). We find that incorporatingknowledgeaboutcomplaintsinNC,Finally,we thistypeofslicingcanbeeffectiveforwidetablesalongwith useanewobjectivefunctiontominimizethenumberofnon- queries that focus on a small subset of attributes. complaint tuples t∈NC that are matched by the solution. 5.4 IncrementalRepairs Inourexperiments,wefindthatthissecondMILPiteration addsminimaloverhead(0.1−0.5%)withrespecttotheinitial Even with the slicing optimizations, the number of unde- MILP problem. In summary, tuple-slicing is an effective termined variables can remain high, resulting in slow solver method to improve the performance of the basic approach, runtime. The red bars in Figure 4 showed the exponential without compromising, and often improving, repair quality. cost of parameterizing the entire query log as compared to onlysolvingforasinglequery(bluebars). Theseresultssug- gestthatitisfaster torunmanysmallMILPproblemsthan a single large one, and motivates our incremental algorithm. 5.2 QuerySlicing: ReducingQueries Our Inc approach (Algorithm 3) focuses on the case k Inpractice,manyofthequeriesinthequerylogcouldnot where there is a single corrupted query to repair. It does so have affected the complaint attributes (defined below). For by linearizing the full query log, including any slicing opti- example, if q and q only read and wrote attribute A , mizations,butonlyparameterizingandrepairingabatchofk N−1 N 1 thentheycouldnothavecontributedtoanerrorinA . How- consecutive queries at a time. This procedure first attempts 2 ever, if q wrote A , then either or both queries may have to repair the k most recent queries, and continues to the N 2 causedtheerror. Inshort,ifwemodelaqueryasasetofat- next k queries if a repair was not generated. The algorithm tributereadandwriteoperations,thosenotpartofthecausal internallycallsamodifiedversionofthebasic approachthat read-write chain to the complaint attributes can be ignored. takes extra parameters {q ,q }, only parameterizes those i i+k This is the foundation of our query-slicing optimization. queries, and fixes the values of all other variables. 7 Algorithm 2: FullImpact: Algorithm forfindingF(q). In this section, we carefully study the performance and accuracycharacteristicsofthebasicMILP-basedrepairalgo- Require: Q,qi rithm, slicing-based optimizations that improve the latency F(qi)←I(qi) 2: foreachqj inqi+1,...,qn∈Qdo ofthesystem,andtheincrementalalgorithmforsinglequery if F(qi)∩P(qj)(cid:54)=∅then corruptions. Due to the difficulty of collecting corrupt query 4: F(qi)←F(qi)∪F(qj) logs from active deployments, our goal instead is to under- endif 6: endfor stand these trade-offs in controlled synthetic scenarios, as ReturnF(qi) well as study the effectiveness in typical database query workloads based on widely used benchmarks. To this end, our experiments are organized as follows: Algorithm 3: Inc : The incremental algorithm. k First, we compare the basic and incremental MILP algo- Require: Q,Dj,Dn,C,k rithm against the different optimizations to highlight the SortQfrommosttoleastrecent 2: foreachqi...qi+k∈Qdo value of different optimizations and the limitations of the Qsuffix ={qj|j≥i} basicapproach. WethenshowthatUPDATEqueriesarepartic- 4: Q∗ ←Basicparams(Qsuffix,Dj,Dn,C,{qi,qi+k}) ularlydifficulttorepairandfocussolelyondifferenttypesof if Q∗(cid:54)=∅then 6: ReturnQ∗ UPDATE-only workloads to understand how QFix responds to endif differentparametersettings. Weendwithanevaluationusing 8: endfor established database transaction benchmarks from OLTP- bench [15]: TPC-C [11] and TATP [37]. All experiments wererunon12x2.66GHzmachineswith16GBRAMrunning Theincrementalapproachprioritizesrepairsforcomplaints IBM CPLEX [12] as the MILP solver on CentOS release 6.6. thatareduetomorerecentcorruptions. Giventhatthebasic algorithmsimplyfailsbeyondasmalllogsize,webelievethis 7.1 ExperimentalSetup is a natural and pragmatic assumption to use, and results in For each of our experiments we generate and corrupt a a10×scalabilityimprovement. Ourexperimentsfurthereval- querylog. Weexecutetheoriginalandcorruptquerylogson uate different batching level k in the incremental algorithm an initial (possibly empty) database, perform a tuple-wise andshowthatitisimpracticalfrombothaperformanceand comparisonbetweentheresultingdatabasestatestogenerate accuracy to have k>1. atruecomplaintset,andsimulateincompletecomplaintsets by removing a subset of the true complaints. Finally, we 6. NOISYCOMPLAINTSETS execute the algorithms and compare the repaired query log Asdescribedintheproblemsetup(Section3.1),complaint with the true query log, as well as the repaired and true sets may be imperfect. First, complaint sets are typically final database states, to measure performance and accuracy incomplete, missing errors that occur in Dn, but are not metrics. Performanceismeasuredaswallclocktimebetween reported. In this case, the naive encoding of the query log submittingacomplaintsetandthesystemterminatingafter and database (basic) will likely produce an infeasible MILP. retrievingavalidrepair. Accuracyismeasuredastherepair’s In the running example of Figure 2, if the complaint set is precision (percentage of repaired tuples that were correctly incomplete and only contains a complaint on t4, basic will fixed), the recall (the percentage of the full complaint set interpret t3 as a correct state and repairing the condition of thatwasrepaired),andtheF1measure(theharmonicmean q1 to a value greater than $86500 will appear to introduce ofprecisionandrecall). Wereporttheaverageacross20runs. a new error. The solver will declare the problem infeasible We describe the experimental parameters in the context of and will not return a solution. the datasets and workloads below. However, the tuple slicing optimization (Section 5.1) im- Synthetic: We generate an initial database of N random plicitly corrects this problem: By only encoding the tuples D tuples. TheschemacontainsaprimarykeyidalongwithN in the incomplete complaint set, the encoded problem does a attributes a ...a , whose values are integers picked from not enforce constraints on the query’s effect on other tuples 1 Na [0,V ] uniformly at random. We then generate a sequence in the database. This allows the result to generalize to tu- d of N queries. The default setting for these parameters are: ples not in the complaint set. The second iteration of the q N =1000,N =10,V =200,N =300. MILP execution then uses a soft constraint on the number D a d q UPDATE queries are defined by a SET clause that assigns of non-complaint tuples that are affected by the repair in an attribute a Constant or Relative value, and a WHERE order to address the possibility of over-generalization. clause can either be a Point predicate on a key, or a Range Another possible inaccuracy in the complaint set is the predicate on non-key attributes: presence of false positives: some complaints may be incor- rectly reporting errors, or the target tuple t∗ of a complaint SET Clause: WHERE Clause: may be incorrect. This type of noise in the complaint set Constant: SET (a_i=?), .. Point: WHERE a_j=? & .. canalsoleadtoinfeasibility. Onecanremovesucherroneous Relative: SET (a_i=a_i+?) Range: WHERE a_j in [?,?+r] & .. complaints as a pre-processing step, using one of numerous where ?∈ [0,V ] is random and r is the size of the range outlier detection algorithms. While this is an interesting d predicate. Query selectivity is by default 2% (r=4). Note problem, it is orthogonal to the query repair problem that that a range predicate where r = 0 is distinct from a Point we are investigating in this work. Thus, in our experiments, predicateduetothenon-keyattribute. TheWHEREclauses we focus on incomplete complaints sets and assume that inDELETEqueriesaregeneratedinanidenticalfashion,while there are not erroneous complaints. INSERTqueriesinsertvaluespickeduniformlyatrandomfrom V . By default, we generate UPDATE queries with non-key 7. EXPERIMENTS d range predicates and constant set clauses. 8 (a)Performanceformultiplecorruptions. (b)Performanceforsinglecorruption. (c)Performancefordifferentquerytypes. (d)Accuracyformultiplecorruptions. (e)Accuracyforsinglecorruption. (f)Accuracyfordifferentquerytypes. Figure 6: Our analysis highlights limitations of basic, the value of tuple-slicing, and the high cost of UPDATE queries. In addition, the skew parameter s determines the distri- N =30,wecorrupt3queries: q . Wefindthatthenum- q 1,11,21 bution attributes referenced in the WHERE and SET clauses. berofqueriesgreatlyaffectsboththescalabality(Figure6a) Each attribute in a query is picked from either a uniform and the accuracy (Figure 6d) of the algorithms. Specifically, distribution when s = 0 or a zipfian distribution with ex- asthenumberincreases,thenumberofpossibleassignments ponent s. This allows our experiments to vary between a of the MILP parameters increases exponentially and the uniform distribution, where each attribute is equally likely solveroftentakeslongerthanourexperimentaltimelimit of to be picked, and a skewed distribution where nearly all 1000secondsandreturnsaninfeasibilityerror. Thisisapre- attributes are the same. dominant reason why the accuracy degrades past 30 queries. For example, when 40 queries are involved (with 4 corrup- Benchmarks: We use the TPC-C [11] and TATP [37] tions) and we ignore the infeasible executions, the average benchmarks. TheformergeneratestheORDERtableatscale executiontimeis300secondsandtheprecisionandrecallare 1 with one warehouse, and uses the queries that modify the greater than 0.94. Unfortunately, with 50 queries (5 corrup- ORDERtable. Weexecutealogof2000queriesoveraninitial tions), all runs exceed the time limit and return infeasibility. table containing 6000 tuples. 1837 queries are INSERTs and the rest are UPDATEs. The latter TATP workload simulates Single Corrupt Query: In this experiment, we evaluate thecallerlocationsystem. WegenerateadatabasefromSUB- theefficacyQFixwithtupleslicingandincrementaloptimiza- SCRIBER table with 5000 tuples and 2000 UPDATE queries. tion in the special case when one query has been corrupted Both setups were generated using the OLTP-bench [15]. We in a much larger query log. We compare QFix-inc without introduce a single corruption, and vary corrupted query’s tupleslicing(inc )againsttupleslicingatdifferentbatching 1 index from the most recent query q to q . levelsof1,2,8(inc −tuple;inc −tuple,inc −tuple). Re- N N−1500 1 2 8 call from Section 5.4 that inc parameterizes k consecutive Corrupting Queries: We corrupt query q by replacing it k i queries in each batch until a repair is found. Figure 6b high- with a randomly generated query of the same type based lights the scalability limitation of the incremental algorithm on the procedures described above. To standardize our without tuple-slicing: with 50 queries inc easily exceeds procedures, we selected a fixed set of indexes idx that are 1 the 1000s limit. The tuple-slicing scales significantly better used in all experiments. (nearly200×faster),howevertheaccuracyseverelydegrades 7.2 PreliminaryAnalysis when k>1. The primary reason is because of infeasibility errors—the MILP problem is much harder, and fails to find The following set of experiments are designed to establish a repair. This is highlighted by the symmetry between the the rationale for the settings in the subsequent experiments. precision and recall curves. A secondary reason is beause Specifically,wecomparedifferentslicing-basedoptimizations the refinement step of tuple slicing may not generate a fully ofbasic asthenumberofqueriesincreases. Wethenevaluate correct repair and generalize incorrectly. This is why the the scalability of the different slicing-based optimizations of precisioncurveislowerthantherecallcurveforinc −tuple. theincrementalapproachinthecontextofasinglecorrupted 8 query. We then establish the difficulty of repairing UPDATE Query Type: Our final preliminary experiment evaluates workloads as compared to other query types. the incremental algorithm with tuple slicing optimization (inc −tuple)onINSERT,DELETE,orUPDATE-onlyworkloads. Multiple Corrupt Queries: In this experiment, we com- 1 Weincreasethenumberofqueriesfrom1to200andcorrupt parethebasicapproach(basic)againsteachslicingoptimiza- the oldest query in the log. Figure 6c shows that while the tion individually(basic−tuple,basic−attr,basic−query) cost of repairing INSERT workloads remains relatively con- and all of them together (basic − all). We use the de- stant, the cost for DELETE-only and UPDATE-only workloads fault settings with N = 1000 tuples and a sequence of D increase as the corruption happens earlier in the query log— UPDATE queries. We generate query logs in 5 different sizes and a much faster rate for UPDATE queries. The F1 score for N ∈{10,20,30,40,50} and corrupt every tenth query start- q all settings is nearly 1 (Figure 6f). Takeaways: We find that ing from oldest query q , up to q . For example, when the 1 41 9 (a)#ofattributesvstime. (b)Databasesizevstime(Na=100) Figure 7: For datasets with many attributes, the optimizations result in significant improvements. basic, even with slicing optimizations, has severe scalability order to tease apart the algorithm performance. These in- limitations due to the large number of undetermined values— cludefactorssuchasquerycomplexity,logsizeN ,database q thisisunsurprisingasMILPconstraintsolvingisanNP-hard size N and the skew of the query predicates. We focus on D problem. In contrast, the incremental algorithms can scale anarrowtablesettingthatcontainsN =10attributes,and a to larger query log sizes, however only a batch size of k=1 a single corrupt query in the query log. can retain high repair quality. UPDATE queries translate into Database Size: Figure 8a varies the database size (N ∈ D more undetermined variables than other query types, and are [100,100k]),andfixesqueryoutputcardinalityandcomplaint significantlymoreexpensivetorepair. Basedontheseresults, setsizeinthesamewayasthepreviousscalabilityexperiment. we focus on the incremental algorithm inc and the more Incontrasttothepreviousexperiment,thescalabalitycurve 1 difficult UPDATE-only workloads. is nearly flat for both corruption query indices. The reason is because the solver’s implicit pruning optimization is less effective when there are only 10 attributes: Every query is 7.3 SyntheticIncrementalExperiments likelytotouchanattributethataffectsthecomplaintset. We We divide these experiments into two groups: the first verifiedthisbyapplyngquery-slicingtothesamesetting,and evaluates the different slicing optimizations under different foundfarfewerquerieswereprunedcomparedtotheprevious settings;thelatteralgorithmandvariesworkloadanddataset experiment. It takes less than a minute to perfectly repair parameters. Note that omit accuracy figures when the F1- therecentcorruptionq ,andaround4minutesfortheolder 200 score is ≥0.99. corruptionq ,evenasthedatabasesizeincreases. Atsmaller 50 Comparing optimizations. database sizes, the randomness in the workload generator Varying # Attributes: We first evaluate QFix and its op- leads to variability in the size of the complaint set, and timizations by increasing the number of attribute (N ∈ ultimately a larger and more difficult MILP problem. The a [10,500]) with N =100. As shown in Figure 7a, when the exponential relationship between solver time and problem D numberofattributeinatableissmall(e.g.,Na =10)allalgo- difficulty results in the higher average latency for q50. rithms appear identical. However, increasing the number of Query Clause Type: So far, we have focused on UPDATE attributeexhibitsalargerbenefitforqueryandattributeslic- queries with constant set clauses and range predicates. Fig- ing (up to 6.8× reduction compared to tuple-slicing). When ure 8b individually varies each clause and compares against the table is wide (N = 500), applying all optimizations Constant/Point and Relative/Range queries. The x-axis a (inc1−all) is 40× faster than tuple-slicing alone. varies the index of the corrupted query between q1 and q249. DatabaseSize: Wevarythedatabasesize(N ∈[100,5000]) We find that point predicates and constant set clauses are D with a large number of attributes (N = 100). We fix the easier to solve than range predicates and relative set clauses, a number of complaints by decreasing the query selectivity respectively. Thereasonfortheformerpairisbecauserange in proportion to N ’s increase—the specific mechanism to predicates double the number of undetermined variables as D do so does not affect the findings. Figure 7b shows that compared to point queries. In addition, point queries are the costs are relatively flat until the corruption occurs in an on key attributes, thus further reduces the possible search old query (q ). In addition, we find that the cost is highly space. We believe the latter pair is because the constant 50 correlated with the number of candidate queries that are set clauses break the causal relationship between input and encoded in the MILP problem. The increase in cost despite output records for the overwritten values. This both simpli- tuple-slicing is due to the increasing number of candidate fies the difficulty of the constraint problem, and reduces the queries inthe system; we believethis increasingtrendis due total number of constraints. to the solver’s ability to prune constraints that correspond Predicate Dimensionality: Figure 8e varies the dimen- to queries that clearly will not affect the complaint set—an sionality of the update queries by increasing the number of implicit form of query slicing. Applying attribute-slicing predicates in the WHERE clause, while keeping the query car- supercedes this implicit optimization and results in a flat dinalityconstant. Thecostincreaseswiththedimensionality curve, while query-slicing explicitly reduces the number of because each additional predicate is translated into a new candidate queries in proportion with the database size, and setofconstraintsandundeterminedvariables,increasingthe leadstotheincreasingtrend. Ultimately,combiningallthree problem complexity. optimizations improves the latency over tuple-slicing by 2× Skew: We now study the effects of attribute skew on in the worst case, and up to 4× in the best case. the algorithms. We increase the skew parameter from 0 (uniform) to 1 (nearly every attribute is A ) and find a 0 Sensitivity to data and workload factors. reduction in latency (Figure 8d). We believe the reason is Thefollowingsetofsyntheticexperimentsfocusonasingle becauseincreasingtheskewfocusesthequerypredicatesover QFix setting—incremental with tuple-slicing—and individ- asmallersetoflogicalattributes,andincreasesthenumberof ually vary numerous database and workload parameters in 10