ebook img

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models PDF

1.5 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models Sanjay Krishnan, Jiannan Wang, Eugene Wu†, Michael J. Franklin, Ken Goldberg UC Berkeley, †Columbia University {sanjaykrishnan, jnwang, franklin, goldberg}@berkeley.edu [email protected] ABSTRACT oping rules or software to correct the problems. For some 6 types of dirty data, such as inconsistent values, model train- 1 Data cleaning is often an important step to ensure that pre- ing may seemingly succeed albeit, with potential subtle in- 0 dictive models, such as regression and classification, are not accuracies in the model. For example, battery-powered sen- 2 affectedbysystematicerrorssuchasinconsistent,out-of-date, sorscantransmitunreliablemeasurementswhenbatterylev- or outlier data. Identifying dirty data is often a manual and n els are low [21]. Similarly, data entered by humans can be iterative process, and can be challenging on large datasets. a susceptible to a variety of inconsistencies (e.g., typos), and J However,manydatacleaningworkflowscanintroducesubtle unintentional cognitive biases [23]. Such problems are of- biasesintothetrainingprocessesduetoviolationofindepen- 5 tenaddressedintime-consumingloopwheretheanalystrains dence assumptions. We propose ActiveClean, a progressive 1 a model, inspects the model and its predictions, clean some cleaningapproachwherethemodelisupdatedincrementally instead of re-training and can guarantee accuracy on par- data,andre-train. ] Thisiterativeprocessisthedefactostandard,butwithout B tially cleaned data. ActiveClean supports a popular class of appropriatecare,canleadtoseveralseriousstatisticalissues. D modelscalledconvexlossmodels(e.g.,linearregressionand Due to the well-known Simpson’s paradox, models trained SVMs). ActiveClean also leverages the structure of a user’s . on a mix of dirty and clean data can have very misleading s modeltoprioritizecleaningthoserecordslikelytoaffectthe c results. We evaluate ActiveClean on five real-world datasets results even in simple scenarios (Figure 1). Furthermore, if [ the candidate dirty records are not identified with a known UCI Adult, UCI EEG, MNIST, Dollars For Docs, and World- 1 Bankwithbothrealandsyntheticerrors. Ourresultssuggest sampling distribution, the statistical independence assump- tionsformosttrainingmethodsareviolated.Theviolationsof v thatourproposedoptimizationscanimprovemodelaccuracy theseassumptionscanintroduceconfoundingbiases. Tothis 7 byup-to2.5xforthesameamountofdatacleaned. Further- end,wedesignedActiveCleanwhichtrainspredictivemodels 9 moreforafixedcleaningbudgetandonallrealdirtydatasets, while allowing for iterative data cleaning and has accuracy 7 ActiveCleanreturnsmoreaccuratemodelsthanuniformsam- guarantees. ActiveClean automates the dirty data identifica- 3 plingandActiveLearning. tionprocessandthemodelupdateprocess,therebyabstract- 0 ingthesetwoerror-pronestepsawayfromtheanalyst. . 1. INTRODUCTION 1 ActiveCleanisinspiredbytherecentsuccessofprogressive 0 Machine Learning on large and growing datasets is a key data cleaning where a user can gradually clean more data 6 data management challenge with significant interest in both until the desired accuracy is reached [6, 34, 30, 17, 26, 38, 1 industry and academia [1, 5, 10, 20]. Despite a number of 37]. Wefocusonapopularclassofmodelscalledconvexloss v: breakthroughsinreducingtrainingtime,predictivemodeling models(e.g.,includeslinearregressionandSVMs)andshow i can still be a tedious and time-consuming task for an ana- thattheSimpson’sparadoxproblemcanbeavoidedusingit- X lyst. Data often arrive dirty, including missing, incorrect, or erativemaintenanceofamodelratherthanre-training. This r inconsistent attributes, and analysts widely report that data process leverages the convex structure of the model rather a cleaningandotherformsofpre-processingaccountforupto than treating it like a black-box, and we apply convergence 80%oftheireffort[3, 22]. Whiledatacleaningisanexten- argumentsfromconvexoptimizationtheory. Weproposesev- sivelystudiedproblem,thepredictivemodelingsettingposes eral novel optimizations that leverage information from the anumberofnewchallenges:(1)highdimensionalitycanam- modeltoguidedatacleaningtowardstherecordsmostlikely plifyevenasmallamountoferroneousrecords[36],(2)the tobedirtyandmostlikelytoaffecttheresults. complexity can make it difficult to trace the consequnces of Tosummarizethecontributions: anerror,and(3)thereareoftensubtletechnicalcconditions (e.g.,independentandidenticallydistributed)thatcanbevi- • Correctness (Section 5). We show how to update a olated by data cleaning. Consequently, techniques that have dirtymodelgivennewlycleaneddata. Thisupdatecon- beendesignedfortraditionalSQLanalyticsmaybeinefficient vergesmonotonicallyinexpectation. Forabatchsizeb or even unreliable. In this paper, we study the relationship anditerationsT,itconvergeswithrateO(√1 ). bT betweendatacleaningandmodeltrainingworkflowsandex- • Efficiency(Section6). Wederiveatheoreticaloptimal plore how to apply existing data cleaning approaches with sampling distribution that minimizes the update error provableguarantees. and an approximation to estimate the theoretical opti- Oneofthemainbottlenecksindatacleaningisthehuman mum. effort in determining which data are dirty and then devel- • Detection and Estimation (Section 7). We show how ActiveClean can be integrated with data detection to asanexpensiveuser-definedfunctioncomposedofdetermin- guidedatacleaningtowardsrecordsexpectedtobedirty. isticschema-preservingmapandfilteroperationsappliedtoa • Theexperimentsevaluatethesecomponentsonfourdatasets subset of rows in the relation. A relation is defined as clean withrealandsyntheticcorruption(Section8). Results if R = Clean(R ). Therefore, for every r ∈ R clean clean clean suggests that for a fixed cleaning budget, ActiveClean thereexistsauniquer(cid:48) ∈Rinthedirtydata.Themapandfil- returns more accurate models than uniform sampling tercleaningmodelisnotafundamentalrestrictionofActive- andActiveLearningwhensystematiccorruptionissparse. Clean,andAppendixAdiscussesacompatible“setofrecords" cleaningmodel. 2. BACKGROUNDANDPROBLEMSETUP 2.3 Iteration Thissectionformalizestheiterativedatacleaningandtrain- AsanexampleofhowClean(·)fitsintoaniterativeanalysis ingprocessandhighlightsanexampleapplication. process,considerananalysttrainingaregressionandidenti- 2.1 PredictiveModeling fying outliers. When she examines one of the outliers, she realizes that the base data (prior to featurization) has a for- TheuserprovidesarelationRandwishestotrainamodel matting inconsistency that leads to incorrect parsing of the using the data in R. This work focuses on a class of well- numerical values. She applies a batch fix (i.e., Clean(·)) to analyzedpredictive analyticsproblems; onesthat canbe ex- alloftheoutlierswiththesameerror,andre-trainsthemodel. pressedastheminimizationofconvexlossfunctions. Convex Thisiterativeprocesscanbedescribedasthefollowingpseu- lossminimizationproblemsareamenabletoavarietyofincre- docodeloop: mentaloptimizationmethodologieswithprovableguarantees (see Friedman, Hastie, and Tibshirani [15] for an introduc- 1. Init(iter) tion). Examplesincludegeneralizedlinearmodels(including 2. current_model = Train(R) linearandlogisticregression),supportvectormachines,and infact,meansandmediansarealsospecialcases. 3. Foreachtin{1,...,iter} We assume that the user provides a featurizer F(·) that (a) dirty_sample = Identify(R,current_model) maps every record r ∈ R to a feature vector x and label y. Forlabeledtrainingexamples{(x ,y )}N ,theproblemisto (b) clean_sample = Clean(dirty_sample) i i i=1 findavectorofmodelparametersθbyminimizingalossfunc- (c) current_model = Update(clean_sample, R) tionφoveralltrainingexamples: 4. Output: current_model N θ∗ =argmin(cid:88)φ(xi,yi,θ) While we have already discussed Train(·) and Clean(·), θ i=1 the analyst still has to define the primitives Identify(·) and Update(·). For Identify(·), given a the current best model, Where φ is a convex function in θ. For example, in a linear theanalystmustspecifysomecriteriatoselectasetofrecords regressionφis: to examine. And in Update(·), the analyst must decide how φ(xi,yi,θ)=(cid:107)θTxi−yi(cid:107)22 to update the model given newly cleaned data. It turns out that these primitives are not trivial to implement since the Typically,aregularizationtermr(θ)isaddedtothisproblem. straight-forward solutions can actually lead to divergence of r(θ) penalizes high or low values of feature weights in θ to thetrainedmodels. avoidoverfittingtonoiseinthetrainingexamples. 2.4 Challenges N θ∗ =argmin(cid:88)φ(x ,y ,θ)+r(θ) (1) θ i i Correctness: Let us assume that the analyst has imple- i=1 mentedanIdentify(·)functionthatreturnskcandidatedirty In this work, without loss of generality, we will include the records. Thestraight-forwardapplicationdatacleaningisto regularizationaspartofthelossfunctioni.e., φ(xi,yi,θ)in- repair the corruption in place, and re-train the model after cludesr(θ). each repair. Suppose k (cid:28) N records are cleaned, but all of the remaining dirty records are retained in the dataset. Fig- 2.2 DataCleaning ure1highlightsthedangersofthisapproachonaverysimple We consider corruption that affects the attribute values of dirtydatasetandalinearregressionmodeli.e.,thebestfitline records. Thisdoesnotcovererrorsthatsimultaneouslyaffect for two variables. One of the variables is systematically cor- multiplerecordssuchasrecordduplicationorstructuresuch ruptedwithatranslationinthex-axis(Figure1a). Thedirty as schema transformation. Examples of supported cleaning dataismarkedinredandthecleandatainblue,andtheyare operations include, batch resolving common inconsistencies shownwiththeirrespectivebestfitlines. Aftercleaningonly (e.g., merging“U.S.A"and“UnitedStates"), filteringoutliers twoofthedatapoints(Figure1b), theresulting bestfitline (e.g.,removingrecordswithvalues>1e6),andstandardizing isintheoppositedirectionofthetruemodel. attributesemantics(e.g.,“1.2miles"and“1.93km"). Aggregates over mixtures of different populations of data We are particularly interested in those errors that are dif- can result in spurious relationships due to the well-known ficultortime-consumingtoclean, andrequiretheanalystto phenomenoncalledSimpson’sparadox[32]. Simpson’spara- examine an erroneous record, and determine the appropri- doxisbynomeansacornercase,andithasaffectedtheva- ateaction–possiblyleveragingknowledgeofthecurrentbest lidity of a number of high-profile studies [35]; even in the model. We represent this operation as Clean(·) which can simple case of taking an average over a dataset. Predictive be applied to a record r (or a set of records) to recover the models are high-dimensional generalizations of these aggre- cleanrecordr(cid:48) = Clean(r). Formally,wetreattheClean(·) gateswithoutclosedformtechniquestocompensateforthese biases. Thus,trainingmodelsonamixtureofdirtyandclean orqueryprocessing. Theseapproaches, whileveryeffective, datacanleadtounreliableresults,whereartificialtrendsin- sufferfromcomposibilityproblemswhenplacedinsideclean- troduced by the mixture can be confused for the effects of ingandtrainingloops. Tosummarize,ActiveCleanconsiders datacleaning. data cleaning during model training, while these techniques considermodeltrainingfordatacleaning. Oneoftheprimary contributionsofthisworkisanincrementalmodelupdateal- gorithmwithcorrectnessguaranteesformixturesofdata. 2.6 UseCase: DollarsforDocs[2] ProPublica collected a dataset of corporate donations to doctors to analyze conflicts of interest. They reported that some doctors received over $500,000 in travel, meals, and consultationexpenses[4].ProPublicalaboriouslycuratedand cleanedadatasetfromtheCentersforMedicareandMedicaid Figure 1: (a) Systematic corruption in one variable can Services that listed nearly 250,000 research donations, and leadtoashiftedmodel. (b)Mixeddirtyandcleandatare- aggregatedthesedonationsbyphysician,drug,andpharma- sultsinalessaccuratemodelthannocleaning. (c)Small ceutical company. We collected the raw unaggregated data samplesofonlycleandatacanresultinsimilarlyinaccu- and explored whether suspect donations could be predicted ratemodels. with a model. This problem is typical of analysis scenar- ios based on observational data seen in finance, insurance, Analternativeistoavoidthedirtydataaltogetherinstead medicine, and investigative journalism. The dataset has the of mixing the two populations, and the model re-training is followingschema: restricted to only data that are known to be clean. This ap- Contribution(pi_specialty, drug_name, device_name, proach is similar to SampleClean [33], which was proposed corporation, amount, dispute, status) to approximate the results of aggregate queries by applying them to a clean sample of data. However, high-dimensional pi_specialtyisatextualattributedescribingthespecialtyof models are highly sensitive to sample size. Figure 1c illus- thedoctorreceivingthedonation. tratesthat,evenintwodimensions,modelstrainedfromsmall drug_name is the branded name of the drug in the research samplescanbeasincorrectasthemixingsolutiondescribed study(nullifnotadrug). before. device_nameisthebrandednameofthedeviceinthestudy (nullifnotadevice). Efficiency: Conversely,hypotheticallyassumethatthean- corporationisthenameofthepharmaceuticalprovidingthe alysthasimplementedacorrectUpdate(·)primitiveandim- donation. plementsIdentify(·)withatechniquesuchasActiveLearn- amountisanumericalattributerepresentingthedonationamount. ing to select records to clean [37, 38, 16]. Active learning dispute is a Boolean attribute describing whether the re- isatechniquetocarefullyselectthesetofexamplestolearn searchwasdisputed. the most accurate model. However, these selection criteria statusisastringlabeldescribingwhetherthedonationwas aredesignedforstationarydatadistributions,anassumption allowedunderthedeclaredresearchprotocol. Thegoalisto which is not true in this setting. As more data are cleaned, predictdisalloweddonation. the data distribution changes. Data which may look unim- portant in the dirty data might be very valuable to clean in However, this dataset is very dirty, and the systematic na- reality, and thus any prioritization has to predict a record’s tureofthedatacorruptioncanresultinaninaccuratemodel. valuewithrespecttoananticipatedcleanmodel. On the ProPublica website [2], they list numerous types of data problems that had to be cleaned before publishing the 2.5 TheNeedForAutomation data (see Appendix I). For example, the most significant do- nations were made by large companies whose names were ActiveCleanisaframeworkthatimplementstheIdentify(·) also more often inconsistently represented in the data, e.g., and Update(·) primitives for the analyst. By automating the “PfizerInc.",“PfizerIncorporated",“Pfizer".Insuchscenarios, iterative process, ActiveClean ensures reliable models with the effect of systematic error can be serious. Duplicate rep- convergence guarantees. The analyst first initializes Active- resentationscouldartificiallyreducethecorrelationbetween Clean with a dirty model. ActiveClean carefuly selects small theseentitiesandsuspectedcontributions. Therewerenearly batches of data to clean based on data that are likely to be 40,000ofthe250,000recordsthathadeithernamingincon- dirtyandlikelytoaffectthemodel. Theanalystappliesdata sistencies or other inconsistencies in labeling the allowed or cleaningtothesebatches,andActiveCleanupdatesthemodel disallowedstatus. Withoutdatacleaning,thedetectionrate withanincrementaloptimizationtechnique. usingaSupportVectorMachinewas66%. Applyingthedata Machinelearninghasbeenappliedinpriorworktoimprove cleaningtotheentiredatasetimprovedthisrateto97%inthe the efficiency of data cleaning [37, 38, 16]. Human input, cleandata(Section8.6.1),andtheexperimentsdescribehow eitherforcleaningorvalidationofautomatedcleaning,isof- ActiveClean can achieve an 80% detection rate for less than tenexpensiveandimpracticalforlargedatasets. Amodelcan 1.6%oftherecordscleaned. learnrulesfromasmallsetofexamplescleaned(orvalidated) byahuman,andactivelearningisatechniquetocarefullyse- lectthesetofexamplestolearnthemostaccuratemodel.This 3. PROBLEMFORMALIZATION model can be used to extrapolate repairs to not-yet-cleaned Thissectionformalizestheproblemsaddressedinthepaper. data,andthegoaloftheseapproachesistoprovidetheclean- estpossibledataset–independentofthesubsequentanalytics 3.1 NotationandSetup 4.1 Overview TheuserprovidesarelationR,acleanerC(·),afeaturizer Figure 2 illustrates the ActiveClean architecture. The dot- F(·), and a convex loss problem defined by the loss φ(·). A tedboxesdescribeoptionalcomponentsthattheusercanpro- totalofkrecordswillbecleanedinbatchesofsizeb,sothere videtoimprovetheefficiencyofthesystem. willbe k iterations. Weusethefollowingnotationtorepre- b sentrelevantintermediatestates: 4.1.1 RequiredUserInput • DirtyModel: θ(d) isthemodeltrainedonR(without Model: The user provides a predictive model (e.g., SVM) cleaning) with the featurizer F(·) and loss φ(·). This specified as a convex loss optimization problem φ(·) and a servesasaninitializationtoActiveClean. featurizerF(·)thatmapsarecordtoitsfeaturevectorxand • DirtyRecords: R ⊆Risthesubsetofrecordsthat labely. dirty arestilldirty. AsmoredataarecleanedRdirty →{}. Cleaning Function: TheuserprovidesafunctionC(·)(im- • Clean Records: Rclean ⊆ R is the subset of records plemented via software or crowdsourcing) that maps dirty thatareclean,i.e.,thecomplementofRdirty. recordstocleanrecordsasperourdefinitioninSection??. • Samples: Sisasample(possiblynon-uniformbutwith Batches: Data are cleaned in batches of size b and the known probabilities) of the records R . The clean dirty usercanchangethesesettingsifshedesiresmoreorlessfre- sampleisdenotedbyS =C(S). clean quentmodelupdates. Thechoiceofbdoesaffecttheconver- • CleanModel: θ(c) istheoptimalcleanmodel,i.e.,the gencerate.Section5discussestheefficiencyandconvergence modeltrainedonafullycleanedrelation. trade-offsofdifferentvaluesofb. Weempiricallyfindthata • CurrentModel: θ(t) isthecurrentbestmodelatitera- batch size of 50 performs well across different datasets and tiont∈{1,...,kb},andθ(0) =θ(d). use that as a default. A cleaning budget k can be used as a Therearetwometricsthatwewillusetomeasuretheper- stopping criterion once C()˙ has been called k times, and so formanceofActiveClean: the number of iterations of ActiveClean is T = k. Alterna- b ModelError. Themodelerrorisdefinedas(cid:107)θ(t)−θ(c)(cid:107). tively,theusercancleandatauntilthemodelisofsufficient accuracytomakeadecision. TestingError. LetT(θ(t))betheout-of-sampletestingerror whenthecurrentbestmodelisappliedtothecleandata,and 4.1.2 BasicDataFlow T(θ(c)) be the test error when the clean model is applied to The system first trains the model φ(·) on the dirty dataset thecleandata.ThetestingerrorisdefinedasT(θ(t))−T(θ(c)) tofindaninitialmodelθ(d)thatthesystemwillsubsequently 3.2 Problem1. CorrectUpdateProblem improve. Thesamplerselectsasampleofsizebrecordsfrom thedatasetandpassesthesampletothecleaner, whichexe- GivennewlycleaneddataS andthecurrentbestmodel clean cutes C(·) for each sample record and outputs their cleaned θ(t), the model update problem is to calculate θ(t+1). θ(t+1) versions. Theupdaterusesthecleanedsampletoupdatethe will have some error with respect to the true model θ(c), weights of the model, thus moving the model closer to the whichwedenoteas: true cleaned model (in expectation). Finally, the system ei- error(θ(t+1))=(cid:107)θ(t+1)−θ(c)(cid:107) ther terminates due to a stopping condition (e.g., C(·) has beencalledamaximumnumberoftimesk,ortrainingerror Since a sample of data are cleaned, it is only meaningful to convergence), or passes control to the sampler for the next talkaboutexpectederrors. Wecalltheupdatealgorithm“re- iteration. liable"iftheexpectederrorisupperboundedbyamonotoni- callydecreasingfunctionµoftheamountofcleaneddata: 4.1.3 Optimizations E(error(θnew))=O(µ(|S |)) In many cases, such as missing values, errors can be ef- clean ficiently detected. A user provided Detector can be used to Intuitively,“reliable"meansthatmorecleaningshouldimply identify such records that are more likely to be dirty, and moreaccuracy. thus improves the likelihood that the next sample will con- The Correct Update Problem is to reliably update the model taintruedirtyrecords.Furthermore,theEstimatorusesprevi- θ(t)withasampleofcleaneddata. ouslycleaneddatatoestimatetheeffectthatcleaningagiven recordwillhaveonthemodel.Thesecomponentscanbeused 3.3 Problem2. EfficiencyProblem separately (if only one is supplied) or together to focus the TheefficiencyproblemistoselectSclean suchthattheex- system’s cleaning efforts on records that will most improve pected error E(error(θ(t))) is minimized. ActiveClean uses themodel. Section7describesseveralinstantiationsofthese previously cleaned data to estimate the value of data clean- components for different data cleaning problems. Our ex- ing on new records. Then it draws a sample of records S ⊆ perimentsshowthattheseoptimizationscanimprovemodel Rdirty. Thisisanon-uniformsamplewhereeachrecordrhas accuracybyup-to2.5x(Section8.3.2). a sampling probability p(r) based on the estimates. We de- 4.2 Example rive the optimal sampling distribution for the SGD updates, andshowhowthetheoreticaloptimumcanbeapproximated. Thefollowingexampleillustrateshowauserwouldapply TheEfficiencyProblemistoselectasamplingdistributionp(·) ActiveCleantoaddresstheusecaseinSection2.6: overallrecordssuchthattheexpectederrorw.r.ttothemodelif EXAMPLE 1. TheanalystchoosestouseanSVMmodel,and trainedonfullycleandataisminimized. manually cleans records by hand (the C(·)). ActiveClean ini- tially selects a sample of 50 records (the default) to show the 4. ARCHITECTURE analyst. Sheidentifiesasubsetof15recordsthataredirty,fixes ThissectionpresentstheActiveCleanarchitecture. thembynormalizingthedrugandcorporationnameswiththe orright). Forthisclassofmodels, givenasuboptimalpoint, thedirectiontotheglobaloptimumisthegradientoftheloss function. The gradient is a d-dimensional vector function of thecurrentmodelθ(d) andthecleandata. Therefore,Active- Cleanneedstoupdateθ(d)somedistanceγ (Figure3B): θnew ←θ(d)−γ·∇φ(θ(d)) At the optimal point, the magnitude of the gradient will be zero.Sointuitively,thisapproachiterativelymovesthemodel Figure 2: ActiveClean allows users to train predictive downhill(transparentredcircle)–correctingthedirtymodel models while progressively cleaning data. The frame- until the desired accuracy is reached. However, the gradi- work adaptively selects the best data to clean and can ent depends on all of the clean data which is not available optionally(denotedwithdottedlines)integratewithpre- andActiveCleanwillhavetoapproximatethegradientfroma defineddetectionrulesandestimationalgorithmsforim- sampleofnewlycleaneddata.Themainintuitionisthatifthe provedconference. gradient steps are on average correct, the model still moves downhillalbeitwithareducedconvergencerateproportional totheinaccuracyofthesample-basedestimate. helpofasearchengine,andcorrectsthelabelswithtypographi- Toderiveasample-basedupdaterule, themostimportant calorincorrectvalues. Thesystemthenusesthecleanedrecords property is that sums commute with derivatives and gradi- toupdatethethecurrentbestmodelandselectthenextsample ents. The convex loss class of models are sums of losses, so of 50. The analyst can stop at any time and use the improved giventhecurrentbestmodelθ,thetruegradientg∗(θ)is: modeltopredictdonationlikelihoods. N g∗(θ)=∇φ(θ)= 1 (cid:88)∇φ(x(c),y(c),θ) 5. UPDATESWITHCORRECTNESS N i i i This section describes an algorithm for reliable model up- ActiveCleanneedstoestimateg∗(θ)fromasampleS,which dates. Theupdaterassumesthatitisgivenasampleofdata isdrawnfromthedirtydataR . Therefore, thesum has S from R where i ∈ S has a known sampling dirty dirty dirty dirty two components the gradient on the already clean data g probabilityp(i). Sections6and7showhowtooptimizep(·) C whichcanbecomputedwithoutcleaningandg thegradient S andtheanalysisinthissectionappliesforanysamplingdis- estimatefromasampleofdirtydatatobecleaned: tributionp(·)>0. |R | |R | 5.1 GeometricDerivation g(θ)= |cRlea|n ·gC(θ)+ |dRirt|y ·gS(θ) The update algorithm intuitively follows from the convex g can be calculated by applying the gradient to all of the geometry of the problem. Consider the problem in one di- C alreadycleanedrecords: mension(i.e.,theparameterθ isascalarvalue),sothenthe goal is to find the minimum point (θ) of a curve l(θ). The g (θ)= 1 (cid:88) ∇φ(x(c),y(c),θ) consequence of dirty data is that the wrong loss function is C |R | i i clean optimized. Figure3Aillustratestheconsequenceoftheopti- i∈Rclean mization. Thereddottedlineshowsthelossfunctiononthe g canbeestimatedfromasamplebytakingthegradientw.r.t S dirtydata. Optimizingthelossfunctionfindsθ(d) thatatthe eachrecord,andre-weightingtheaveragebytheirrespective minimum point (red star). However, the true loss function samplingprobabilities. Beforetakingthegradienttheclean- (w.r.ttothecleandata)isinblue,thustheoptimalvalueon ing function C(·) is applied to each sampled record. There- thedirtydataisinfactasuboptimalpointoncleancurve(red fore, let S be a sample of data, where each i ∈ S is drawn circle). withprobabilityp(i): g (θ)= 1 (cid:88) 1 ∇φ(x(c),y(c),θ) S |S | p(i) i i i∈S Then,ateachiterationt,theupdatebecomes: θ(t+1) ←θ(t)−γ·g(θ(t)) 5.2 ModelUpdateAlgorithm Tosummarize,thealgorithmisinitializedwithθ(0) = θ(d) whichisthedirtymodel. Therearethreeusersetparameters thebudgetk,batchsizeb,andthestepsizeγ. Inthefollow- ingsection,wewillprovidereferencesfromtheconvexopti- Figure3:(A)Amodeltrainedondirtydatacanbethought mizationliteraturethatallowtheusertoappropriatelyselect of as a sub-optimal point w.r.t to the clean data. (B) The thesevalues. Ateachiterationt = {1,...,T},thecleaningis gradient gives us the direction to move the suboptimal appliedtoabatchofdatabselectedfromthesetofcandidate modeltoapproachthetrueoptimum. dirtyrecordsR . Then, anaveragegradientisestimated dirty Theoptimalcleanmodelθ(c) isvisualizedasayellowstar. fromthecleanedbatchandthemodelisupdated. Iterations The first question is which direction to update θ(d) (i.e., left continueuntilk=T ·brecordsarecleaned. 1. Calculate the gradient over the sample of newly clean Convergence properties of batch SGD formulations have dataandcalltheresultg (θ(t)) been well studied [11]. Essentially, if the gradient estimate S 2. Calculate the average gradient over all of the already isunbiasedandthestepsizeisappropriatelychosen, theal- clean data in R = R−R , and call the result gorithmisguaranteedtoconverge. InAppendixB,weshow clean dirty g (θ(t)) that the gradient estimate from ActiveClean is indeed unbi- C 3. Applythefollowingupdaterule: ased and our choice of step size is one that is established to converge. The convergence rates of SGD are also well ana- θ(t+1) ←θ(t)−γ·(|Rdirty |·g (θ(t))+|Rclean |·g (θ(t))) lyzed[11,8,39]. Theanalysisgivesaboundontheerrorof |R| S |R| C intermediatemodelsandtheexpectednumberofstepsbefore achievingamodelwithinacertainerror.Forageneralconvex 5.3 AnalysiswithStochasticGradientDescent loss,abatchsizeb,andT iterations,theconvergencerateis Theupdate algorithmcanbe formalizedas aclassof very boundedbyO(√σb2T). σ2isthevarianceintheestimateofthe well studied algorithms called Stochastic Gradient Descent. gradientateachiteration: SGDprovidesatheoreticalframeworktounderstandandana- E((cid:107)g−g∗(cid:107)2) lyzetheupdateruleandboundtheerror. Mini-batchstochas- ticgradientdescent(SGD)isanalgorithmforfindingtheopti- whereg∗isthegradientcomputedoverthefulldataifitwere malvaluegiventheconvexlossanddata. Inmini-batchSGD, fully cleaned. This property of SGD allows us to bound the randomsubsetsofdataareselectedateachiterationandthe modelerrorwithamonotonicallydecreasingfunctionofthe averagegradientiscomputedforeverybatch. numberofrecordscleaned,thussatisfyingthereliabilitycon- OnekeydifferencewithtraditionalSGDmodelsisthatAc- dition in the problem statement. If the loss in non-convex, tiveCleanappliesafullgradientsteponthealreadycleandata theupdateprocedurewillconvergetowardsalocalminimum andaveragesitwithastochasticgradientstep(i.e.,calculated ratherthantheglobalminimum(SeeAppendixC). fromasample)onthedirtydata. Therefore,ActiveCleaniter- ationscantakemultiplepassesoverthecleandatabutatmost 5.4 Example asinglecleaningpassofthedirtydata. Theupdatealgorithm can be thought of as a variant of SGD that lazily material- This example describes an application of the update algo- izes the clean value. As data is sampled at each iteration, rithm. data is cleaned when needed by the optimization. It is well knownthatevenforanarbitraryinitializationSGDmakessig- EXAMPLE 2. RecallthattheanalysthasadirtySVMmodel nificant progress in less than one epoch (a pass through the on the dirty data θ(d). She decides that she has a budget of entiredataset)[9]. Inpractice,thedirtymodelcanbemuch cleaning 100 records, and decides to clean the 100 records in more accurate than an arbitrary initialization as corruption batchesof10(setbasedonhowfastshecancleanthedata,and may only affect a few features and combined with the full howoftenshewantstoseeanupdatedresult). Allofthedatais gradient step on the clean data the updates converge very initiallytreatedasdirtywithRdirty =RandRclean =∅. The quickly. gradientofabasicSVMisgivenbythefollowingfunction: Settingthestepsizeγ: Thereisextensiveliteratureinma- (cid:40) −y·xify·x·θ≤1 chine learning for choosing the step size γ appropriately. γ ∇φ(x,y,θ)= canbeseteithertobeaconstantordecayedovertime. Many 0 ifyx·θ≥1 machinelearningframeworks(e.g.,MLLib,Sci-kitLearn,Vow- palWabbit)automaticallysetlearningratesorprovidediffer- Foreachiterationt,asampleof10recordsS isdrawnfrom entlearningschedulingframeworks. Intheexperiments,we Rdirty. ActiveClean then applies the cleaning function to the useatechniquecalledinversescalingwherethereisaparam- sample: eterγ =0.1,andateachiterationitdecaystoγ = γ0 . 0 t |S|t {(x(c),y(c))}={C(i):∀i∈S} i i Setting the batch size b: The batch size should be set by the user to have the desired properties. Larger batches will Using these values, ActiveClean estimates the gradient on the takelongertocleanandwillmakemoreprogresstowardsthe newlycleaneddata: clean model but will have less frequent model updates. On the other hand, smaller batches are cleaned faster and have gS(θ)= 110(cid:88)p(1i)∇φ(x(ic),yi(c),θ) morefrequentmodelupdates. Therearediminishingreturns i∈S to increasing the batch size O(√1 ). In the experiments, we b ActiveCleanalsoappliesthegradienttothealreadycleandata useabatchsizeof50whichconvergesfastbutallowsforfre- (initiallynon-existent): quent model updates. If a data cleaning technique requires alargerbatchsizethan50,i.e.,datacleaningisfastenough g (θ)= 1 (cid:88) ∇φ(x(c),y(c),θ) that the iteration overhead is significant compared to clean- C |R | i i clean ing50records,ActiveCleancanapplytheupdatesinsmaller i∈Rclean batches. For example, the batch size set by the user might Then,itcalculatestheupdaterule: be b = 1000, but the model updates after every 50 records are cleaned. We can disassociate the batching requirements |R | |R | of SGD and the batching requirements of the data cleaning θ(t+1) ←θ(t)−γ·( |dRirt|y ·gS(θ(t))+ |cRlea|n ·gC(θ(t))) technique. Finally,R ←R −S,R ←R +S,andcon- 5.3.1 ConvergenceConditionsandProperties dirty dirty clean clean tinuetothenextiteration. 6. EFFICIENCYWITHSAMPLING theEstimatorisdesignedtoestimatetheamountthatclean- The updater received a sample with probabilities p(·). For ingagivendirtyrecordwillmovethemodeltowardsthetrue anydistributionwherep(·)>0,wecanpreservecorrectness. optimalmodel. ActiveClean uses a sampling algorithm that selects the most 7.1 TheDetector valuablerecordstocleanwithhigherprobability. Thedetectorreturnstwoimportantaspectsofarecord:(1) 6.1 OracleSamplingProblem whethertherecordisdirty,and(2)ifitisdirty,whatiswrong withtherecord. Thesamplercanuse(1)toselectasubsetof Recall that the convergence rate of an SGD algorithm is bounded by σ2 which is the variance of the gradient. Intu- dirty records to sample at each batch and the estimator can use (2) estimate the value of data cleaning based on other itively, thevariancemeasureshowaccuratelythegradientis recordswiththesamecorruption. ActiveCleansupportstwo estimated from a uniform sample. Other sampling distribu- types of detectors: a priori and adaptive. In former assumes tions,whilepreservingthesampleexpectedvalue,mayhavea thatweknowthesetofdirtyrecordsandhowtheyaredirtya lowervariance. Thus,theoraclesamplingproblemisdefined prioritoActiveClean, whilethelatteradaptively learnschar- asasearchoversamplingdistributionstofindtheminimum acteristicsofthedirtydataaspartofrunningActiveClean. variancesamplingdistribution. 7.1.1 A PrioriDetector DEFINITION1 (ORACLESAMPLINGPROBLEM). Givenaset ofcandidatedirtydataR ,∀r∈R findsamplingprob- Formanytypesofdirtinesssuchasmissingattributevalues dirty dirty abilitiesp(r)suchthatoverallsamplesSofsizekitminimizes: andconstraintviolations,itispossibletoefficientlyenumer- ateasetofcorruptedrecordsanddeterminehowtherecords E((cid:107)gS−g∗(cid:107)2) arecorrupted. Itcanbeshownthattheoptimaldistributionoverrecordsin DEFINITION2 (APRIORIDETECTION). Let r be a record R isprobabilitiesproportionalto: inR. AnaprioridetectorisadetectorthatreturnsaBoolean dirty ofwhethertherecordisdirtyandasetofcolumnse thatare r pi ∝(cid:107)∇φ(x(ic),yi(c),θ(t))(cid:107) dirty. This is an established result, for thoroughness, we provide D(r)=({0,1},e ) r a proof in the appendix (Section D), but intuitively, records From the set of columns that are dirty, find the corresponding with higher gradients should be sampled with higher proba- featuresthataredirtyf andlabelsthataredirtyl . bility as they affect the update more significantly. However, r r ActiveClean cannot exclude records with lower gradients as Hereisanexamplethisdefinitionusingadatacleaningmethod- thatwouldinduceabiashurtingconvergence. Theproblem ologyproposedintheliterature. is that the optimal distribution leads to a chicken-and-egg problem: the optimal sampling distribution requires know- Constraint-based Repair: One model for detecting errors ing (x(c),y(c)), however, cleaning is required to know those involvesdeclaringconstraintsonthedatabase. i i values. Detection. LetΣbeasetofconstraintsontherelationR. Inthedetectionstep,thedetectorselectsasubsetofrecords 6.2 DirtyGradientSolution R ⊆ R that violate at least one constraint. The set e dirty r Suchanoracledoesnotexist,andonesolutionistousethe isthesetofcolumnsforeachrecordwhichhaveaconstraint gradientw.r.ttothedirtydata: violation. pi ∝(cid:107)∇φ(x(id),yi(d),θ(t))(cid:107) EXAMPLE 3. Anexampleofaconstraintontherunningex- ampledatasetisthatthestatusofacontributioncanbeonly It turns out that the solution works reasonably well in prac- “allowed" or “disallowed". Any other value for status is an tice on our experimental datasets and has been studied in error. MachineLearningastheExpectedGradientLengthheuristic [31]. Thecontributioninthisworkisintegratingthisheuris- 7.2 AdaptiveDetection ticwithstatisticallycorrectupdates. However,intuitively,ap- Aprioridetectionisnotpossibleinallcases. Thedetector proximatingtheoracleascloselyaspossiblecanresultinim- also supports adaptive detection where detection is learned proved prioritization. The subsequent section describes two frompreviouslycleaneddata. Notethatthis“learning"isdis- components, thedetectorandestimator, thatcanbeusedto tinctfromthe“learning"attheendofthepipeline. Thechal- improvetheconvergencerate.Ourexperimentssuggestup-to lenge in formulating this problem is that detector needs to a2ximprovementinconvergencewhenusingtheseoptional describe how the data is dirty (e.g. e in the a priori case). optimizations(Section8.3.2). r Thedetectorachievesthisbycategorizingthecorruptioninto uclasses. Theseclassesarecorruptioncategoriesthatdonot 7. OPTIMIZATIONS necessarily align with features, but every record is classified Inthissection,wedescribetwoapproachestooptimization, withatmostonecategory. theDetectorandtheEstimator,thatimprovetheefficiencyof Whenusingadaptivedetection,therepairstephastoclean the cleaning process. Both approaches are designed to in- the data and report to which of the u classes the corrupted creasethelikelihoodthattheSamplerwillpickdirtyrecords record belongs. When an example (x,y) is cleaned, the re- that, once cleaned, most move the model towards the true pairsteplabelsitwithoneoftheclean,1,2,...,u+1classes cleanmodel. TheDetectorisintendedtolearnthecharacter- (includingonefor“notdirty"). Itispossiblethatuincreases istics that distinguish dirty records from clean records while each iteration as more types of dirtiness are discovered. In many real world datasets, data errors have locality, where cleaned. Theresultisabiasedestimator,andwhenthenum- similarrecordstendtobesimilarlycorrupted. Thereareusu- berofcleanedsamplesislargethealternativetechniquesare allyasmallnumberoferrorclassesevenifalargenumberof comparableorevenslightlybetterduetothebias. recordsarecorrupted. One approach for adaptive detection is using a statistical EstimationForAPrioriDetection. classifier.Thisapproachisparticularlysuitedforasmallnum- If most of the features are correct, it would seem like the ber data error classes, each of which containing many erro- gradient is only incorrect in one or two of its components. neousrecords. Thisproblemcanbeaddressedbyanyclassi- The problem is that the gradient ∇φ(·) can be a very non- fier,andweuseanall-versus-oneSVMinourexperiments. linear function of the features that couple features together. Another approach could be to adaptively learn predicates Forexample,thegradientforlinearregressionis: thatdefineeachoftheerrorclasses. Forexample, ifrecords ∇φ(x,y,θ)=(θTx−y)x with certain attributes are corrupted, a pattern tableau can beassignedtoeachclasstoselectasetofpossiblycorrupted Itisnotpossibletoisolatetheeffectofachangeofonefeature records. This approach is better suited than a statistical ap- on the gradient. Even if one of the features is corrupted, all proachforalargenumberoferrorclassesorscarcityoferrors. ofthegradientcomponentswillbeincorrect. However,itreliesonerrorsbeingwellalignedwithcertainat- Toaddressthisproblem,thegradientcanbeapproximated tributevalues. inawaythattheeffectsofdirtyfeaturesonthegradientare decoupled. Recall,intheaprioridetectionproblem,thatas- DEFINITION3 (ADAPTIVECASE). Select the set of records sociated with each r ∈ R is a set of errors f ,l which forwhichκgivesapositiveerrorclassification(i.e.,oneoftheu dirty r r is a set that identifies a set of corrupted features and labels. errorclasses).Aftereachsampleofdataiscleaned,theclassifier Thispropertycanbeusedtoconstructacoarseestimateofthe κisretrained. Sotheresultis: cleanvalue.Themainideaistocalculateaveragechangesfor D(r)=({1,0},{1,...,u+1}) eachfeature,thengivenanuncleaned(butdirty)record,add theseaveragechangestocorrectthegradient. AdaptiveDetectionWithOpenRefine: Toformalizetheintuition,insteadofcomputingtheactual gradient with respect to the true clean values, compute the EXAMPLE 4. OpenRefineisaspreadsheet-basedtoolthatal- conditionalexpectationgiventhatasetoffeaturesandlabels lowsuserstoexploreandtransformdata. However,itislimited f ,l arecorrupted: to cleaning data that can fit in memory on a single computer. r r Sincethecleaningoperationsarecoupledwithdataexploration, p ∝E(∇φ(x(c),y(c),θ(t))|f ,l ) ActiveCleandoesnotknowwhatisdirtyinadvance(theanalyst i i i r r maydiscovernewerrorsasshecleans). Corruptedfeaturesaredefinedasthat: SupposetheanalystwantstouseOpenRefinetocleantherun- i∈/ f =⇒ x(c)[i]−x(d)[i]=0 ning example dataset with ActiveClean. She takes a sample of r datafromtheentiredatasetandusesthetooltodiscovererrors. i∈/ l =⇒ y(c)[i]−y(d)[i]=0 Forexample,shefindsthatsomedrugsareincorrectlyclassified r asbothdrugsanddevices.Shethenremovesthedeviceattribute Theneededapproximationrepresentsalinearizationofthe forallrecordsthathavethedrugnameinquestion. Asshefixes errors,andtheresultingapproximationwillbeoftheform: therecords,shetagseachonewithacategorytagofwhichcor- p(r)∝(cid:107)∇φ(x,y,θ(t))+M ·∆ +M ·∆ (cid:107) ruptionitbelongsto. x rx y ry 7.3 TheEstimator whereMx,Myarematricesand∆rxand∆ryarevectorswith one component for each feature and label where each value To get around the problem with oracle sampling, the esti- is the average change for those features that are corrupted matorwillestimatethecleanedvaluewithpreviouslycleaned and 0 otherwise. Essentially, it the gradient with respect to data. The estimatorwill also take advantage of the detector the dirty data plus some linear correction factor. In the ap- fromtheprevioussection.Thereareanumberofdifferentap- pendix, we present a derivation using a Taylor series expan- proaches, such as regression, that could be used to estimate sionandanumberofM andM matricesforcommoncon- x y thecleanedvaluegiventhedirtyvalues. However,thereisa vex losses (Appendix E and F). The appendix also describes problemofscarcity, whereerrorsmayaffectasmallnumber howtomaintain∆ and∆ ascleaningprogresses. rx ry ofrecords.Asaresult,theregressionapproachwouldhaveto learnamultivariatefunctionwithonlyafewexamples. Thus, EstimationForAdaptiveCase. high-dimensionalregressionill-suitedfortheestimator. Con- Asimilarprocedureholdsintheadaptivesetting,however, versely,itcouldtryaverysimpleestimatorthatjustcalculates itrequiresreformulation.Here,ActiveCleanusesucorruption an average change and adds this change to all of the gradi- classes provided by the detector. Instead of conditioning on ents.Thisestimatorcanbehighlyinaccurateasitalsoapplies the features that are corrupted, the estimator conditions on thechangetorecordsthatareknowntobeclean. the classes. So for each error class, it computes a ∆ and ux ActiveCleanleveragesthedetectorforanestimatorbetween ∆ . Thesearetheaveragechangeinthefeaturesgiventhat uy thesetwoextremes.Theestimatorcalculatesaveragechanges classandtheaveragechangeinlabelsgiventhatclass. feature-by-featureandselectivelycorrectsthegradientwhen afeatureisknown tobecorruptedbasedonthedetector. It p(r )∝(cid:107)∇φ(x,y,θ(t))+M ·∆ +M ·∆ (cid:107) u x ux y uy also applies a linearization that leads to improved estimates 7.3.1 Example when the sample size is small. We evaluate the lineariza- tion in Section 8.5 against alternatives, and find that it pro- Here is an example of using the optimization to select a videsmoreaccurateestimatesforasmallnumberofsamples sampleofdataforcleaning. EXAMPLE 5. ConsiderusingActiveCleanwithanaprioride- vector, andabinarySVMisusedtoclassifythestatusofthe tector. Letusassumethattherearenoerrorsinthelabelsand medicaldonations. only errors in the features. Then, each training example will WorldBank: Thedatasethas193recordsofcountryname, haveasetofcorruptedfeatures(e.g.,{1,2,6},{1,2,15}). Sup- population,andvariousmacro-economicsstatistics. Theval- posethatthecleanerhasjustcleanedtherecordsr1andr2rep- uesarelistedwiththedateatwhichtheywereacquired. This resentedastupleswiththeircorruptedfeatureset:(r1,{1,2,3}), allowed us to determine that records from smaller and less (r2,{1,2,6}). Foreachfeaturei,ActiveCleanmaintainstheav- populouscountriesweremorelikelytobeout-of-date. erage change between dirty and clean in a value in a vector ∆ [i]forthoserecordscorruptedonthatfeature. 8.1.2 ComparedAlgorithms x Then,givenanewrecord(r ,{1,2,3,6}),∆ isthevector 3 r3x Here are the alternative methodologies evaluated in the ex- ∆ wherecomponentiissetto0ifthefeatureisnotcorrupted. x periments: SupposethedataanalystisusinganSVM,thentheM matrix x Robust Logistic Regression [14]. Feng et al. proposed isasfollows: a variant of logistic regression that is robust to outliers. We (cid:40) −y[i]ify·x·θ≤1 chose this algorithm because it is a robust extension of the Mx[i,i]= convexregularizedlossmodel, leadingtoabetterapples-to- 0 ifyx·θ≥1 apples comparison between the techniques. (See details in Thus,wecalculateasamplingweightforrecordr : AppendixH.1) 3 Discarding Dirty Data. As a baseline, dirty data are dis- p(r )∝(cid:107)∇φ(x,y,θ(t))+M ·∆ (cid:107) 3 x r3x carded. To turn the result into a probability distribution, ActiveClean SampleClean (SC) [33]. SampleClean takes a sample of normalizesoveralldirtyrecords. data,appliesdatacleaning,andthentrainsamodeltocom- pletiononthesample. 8. EXPERIMENTS ActiveLearning(AL)[18]. TofairlyevaluateActiveLearn- First, the experiments evaluate how various types of cor- ing,wefirstapplyourgradientupdatetoensurecorrectness. rupteddatabenefitfromdatacleaning.Next,theexperiments Withineachiteration,examplesareprioritizedbydistanceto exploredifferentprioritizationandmodelupdateschemesfor thedecisionboundary(calledUncertaintySamplingin[31]). progressive data cleaning. Finally, ActiveClean is evaluated However,wedonotincludeouroptimizationssuchasdetec- end-to-end in a number of real-world data cleaning scenar- tionandestimation. ios. ActiveCleanOracle(AC+O): InActiveCleanOracle,instead of an estimation and detection step, the true clean value is 8.1 ExperimentalSetupandNotation usedtoevaluatethetheoreticalidealperformanceofActive- Themainmetricforevaluationisarelativemeasureofthe Clean. trainedmodelandthemodelifallofthedataiscleaned. 8.2 DoesDataCleaningMatter? Relative Model Error. Let θ be the model trained on the dirtydata,andletθ∗ bethemodeltrainedonthesamedata Thefirstexperimentevaluatesthebenefitsofdatacleaning ifitwascleaned. Thenthemodelerrorisdefinedas (cid:107)θ−θ∗(cid:107). ontwooftheexampledatasets(EEGandAdult). Ourgoalis (cid:107)θ∗(cid:107) tounderstandwhichtypesofdatacorruptionareamenableto datacleaningandwhicharebettersuitedforrobuststatistical 8.1.1 Scenarios techniques. Theexperimentcomparesfourschemes: (1)full Income Classification (Adult): In this dataset of 45,552 datacleaning,(2)baselineofnocleaning,(3)discardingthe records, the task is to predict the income bracket (binary) dirty data, and (4) robust logistic regression,. We corrupted from 12 numerical and categorical covariates with an SVM 5%ofthetrainingexamplesineachdatasetintwodifferent classifier. ways: Seizure Classification (EEG): In this dataset, the task is Random Corruption: Simulated high-magnitude random to predict the onset of a seizure (binary) from 15 numerical outliers. 5% of the examples are selected at random and a covariates with a thresholded Linear Regression. There are random feature is replaced with 3 times the highest feature 14980 data points in this dataset. This classification task is value. inherentlyhardwithanaccuracyoncompletelycleandataof only65%. Systematic Corruption: Simulatedinnocuouslooking(but stillincorrect)systematiccorruption. Themodelistrainedon Handwriting Recognition (MNIST) 1: In this dataset, the thecleandata,andthethreemostimportantfeatures(highest task is to classify 60,000 images of handwritten images into weighted) are identified. The examples are sorted by each 10categorieswithanone-to-allmulticlassSVMclassifier.The of these features and the top examples are corrupted with uniquepartofthisdatasetisthefeaturizeddataconsistsofa the mean value for that feature (5% corruption in all). It is 784 dimensional vector which includes edge detectors and importanttonotethatexamplescanhavemultiplecorrupted rawimagepatches. features. Dollars For Docs: The dataset has 240,089 records with 5 Figure 4 shows the test accuracy for models trained on textualattributesandonenumericalattribute. Thedatasetis bothtypes ofdatawith thedifferenttechniques. The robust featurizedwithbag-of-wordsfeaturizationmodelforthetex- method performs well on the random high-magnitude out- tualattributeswhichresultedina2021dimensionalfeature liers with only a 2.0% reduction in clean test accuracy for 1 EEG and 2.5% reduction for Adult. In the random setting, http://ufldl.stanford.edu/wiki/index.php/Using_the_MNIST_ Dataset discardingdirtydataalsoperformsrelativelywell. However, (a) Adult (b) EEG Test Accuracy 10345678900000000%%%%%%%% (a) Randomly Corrupted Data Test Accuracy 10345678900000000%%%%%%%% (b) Systematically Corrupted Data Model Error % 012345 0 1000 2000 3000 4000 5000 Model Error % 11205050 0 1000 2000 3000 4000 5000 20% 20% 10% 10% 25 15 Fwihgue0%nr ec4o:rr(EuaEpG) tReodbduasttatAedaculrth enriqaunedso0am%n danddisEclEoGa ordkinagtydpaictAaadullt.w o(brk) Test Error % 1125050 Test Error % 150 0 0 Data cleaning can provide reliable performance in both 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 the systematically corrupted setting and randomly cor- # Records Cleaned # Records Cleaned ruptedsetting. Figure 5: The relative model error as a function of the therobustmethodfaltersonthesystematiccorruptionwitha numberofexamplescleaned. ActiveCleanconvergeswith 9.1%reductionincleantestaccuracyforEEGand10.5%re- a smaller sample size to the true result in comparison to ductionforAdult.Theproblemisthatwithoutcleaning,there ActiveLearningandSampleClean. is no way to know if the corruption is random or systematic and when to trust a robust method. While data cleaning re- quiresmoreeffort,itprovidesbenefitsinbothsettings. Inthe remaining experiments, unless otherwise noted, the experi- mentsusesystematiccorruption. Summary: A 5% systematic corruption can introduce a 10% reductionintestaccuracyevenwhenusingarobustmethod. 8.3 ActiveClean: A PrioriDetection Thenextsetofexperimentsevaluatedifferentapproaches to cleaning a sample of data compared to ActiveClean using a priori detection. A priori detection assumes that all of the Figure 6: -D denotes no detection, and -D-I denotes no corruptedrecordsareknowninadvancebuttheircleanvalues detection and no importance sampling. Both optimiza- areunknown. tions significantly help ActiveClean outperform Sample- CleanandActiveLearning. 8.3.1 ActiveLearningandSampleClean 8.3.2 SourceofImprovements Thenextexperimentevaluatesthesamples-to-errortrade- off between four alternative algorithms: ActiveClean (AC), ThenextexperimentcomparestheperformanceofActive- SampleClean,ActiveLearning,andActiveClean+Oracle(AC+O). Cleanwithandwithoutvariousoptimizationsat500records Figure 5 shows the model error and test accuracy as a func- cleaned point. ActiveClean without detection is denoted as tion of the number of cleaned records. In terms of model (AC-D) (that is at each iteration we sample from the entire error, ActiveClean gives its largest benefits for small sample dirty data), and ActiveClean without detection and impor- sizes. For 500 cleaned records of the Adult dataset, Active- tancesamplingisdenotedas(AC-D-I).Figure6plotstherel- Cleanhas6.1xlesserrorthanSampleCleanand2.1xlesser- ativeerrorofthealternativesandActiveCleanwithandwith- rorthanActiveLearning. For500cleanedrecordsoftheEEG outtheoptimizations.Withoutdetection(AC-D),ActiveClean dataset, ActiveClean has 9.6x less error than SampleClean isstillmoreaccuratethanActiveLearning. Removingtheim- and2.4xlesserrorthanActiveLearning. BothActiveLearn- portance sampling, ActiveClean is slightly worse than Active ing and ActiveClean benefit from the initialization with the LearningontheAdultdatasetbutiscomparableontheEEG dirtymodelastheydonotretraintheirmodelsfromscratch, dataset. andActiveCleanimprovesonthisperformancewithdetection Summary: Both a priori detection and non-uniform sampling anderrorestimation. ActiveLearninghasnonotionofdirty significantlycontributetothegainsoverActiveLearning. and clean data, and therefore prioritizes with respect to the dirty data. These gains in model error also correlate well to 8.3.3 MixingDirtyandCleanData improvements in test error (defined as the test accuracy dif- Trainingamodelonmixeddataisanunreliablemethodol- ferencew.r.tcleaningalldata). Thetesterrorconvergesmore ogylacking thesameguaranteesas ActiveLearningor Sam- quicklythanmodelerror,emphasizingthebenefitsofprogres- pleClean even in the simplest of cases. For thoroughness, sive data cleaning, since it is not neccessary to clean all the the next experiments include the model error as a function datatogetamodelwithessentiallythesameperformanceas of records cleaned in comparison to ActiveClean. Figure 7 thecleanmodel.Forexample,toachieveatesterrorof1%on plots the same curves as the previous experiment compar- theAdultdataset,ActiveCleancleans500fewerrecordsthan ing ActiveClean, Active Learning, and two mixed data algo- ActiveLearning. rithms. PC randomly samples data, clean, and writes-back Summary: ActiveClean with a priori detection returns results the cleaned data. PC+D randomly samples data from using thataremorethan6xmoreaccuratethanSampleCleanand2x the dirty data detector, cleans, and writes-back the cleaned moreaccuratethanActiveLearningforcleaning500records. data. For these errors PC and PC+D give reasonable results

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.