Table Of ContentActiveClean: Interactive Data Cleaning While Learning
Convex Loss Models
Sanjay Krishnan, Jiannan Wang, Eugene Wu†, Michael J. Franklin, Ken Goldberg
UC Berkeley, †Columbia University
{sanjaykrishnan, jnwang, franklin, goldberg}@berkeley.edu
ewu@cs.columbia.edu
ABSTRACT oping rules or software to correct the problems. For some
6
types of dirty data, such as inconsistent values, model train-
1 Data cleaning is often an important step to ensure that pre-
ing may seemingly succeed albeit, with potential subtle in-
0 dictive models, such as regression and classification, are not
accuracies in the model. For example, battery-powered sen-
2 affectedbysystematicerrorssuchasinconsistent,out-of-date,
sorscantransmitunreliablemeasurementswhenbatterylev-
or outlier data. Identifying dirty data is often a manual and
n els are low [21]. Similarly, data entered by humans can be
iterative process, and can be challenging on large datasets.
a susceptible to a variety of inconsistencies (e.g., typos), and
J However,manydatacleaningworkflowscanintroducesubtle
unintentional cognitive biases [23]. Such problems are of-
biasesintothetrainingprocessesduetoviolationofindepen-
5 tenaddressedintime-consumingloopwheretheanalystrains
dence assumptions. We propose ActiveClean, a progressive
1 a model, inspects the model and its predictions, clean some
cleaningapproachwherethemodelisupdatedincrementally
instead of re-training and can guarantee accuracy on par- data,andre-train.
]
Thisiterativeprocessisthedefactostandard,butwithout
B tially cleaned data. ActiveClean supports a popular class of
appropriatecare,canleadtoseveralseriousstatisticalissues.
D modelscalledconvexlossmodels(e.g.,linearregressionand
Due to the well-known Simpson’s paradox, models trained
SVMs). ActiveClean also leverages the structure of a user’s
. on a mix of dirty and clean data can have very misleading
s modeltoprioritizecleaningthoserecordslikelytoaffectthe
c results. We evaluate ActiveClean on five real-world datasets results even in simple scenarios (Figure 1). Furthermore, if
[ the candidate dirty records are not identified with a known
UCI Adult, UCI EEG, MNIST, Dollars For Docs, and World-
1 Bankwithbothrealandsyntheticerrors. Ourresultssuggest sampling distribution, the statistical independence assump-
tionsformosttrainingmethodsareviolated.Theviolationsof
v thatourproposedoptimizationscanimprovemodelaccuracy
theseassumptionscanintroduceconfoundingbiases. Tothis
7 byup-to2.5xforthesameamountofdatacleaned. Further-
end,wedesignedActiveCleanwhichtrainspredictivemodels
9 moreforafixedcleaningbudgetandonallrealdirtydatasets,
while allowing for iterative data cleaning and has accuracy
7 ActiveCleanreturnsmoreaccuratemodelsthanuniformsam-
guarantees. ActiveClean automates the dirty data identifica-
3 plingandActiveLearning.
tionprocessandthemodelupdateprocess,therebyabstract-
0
ingthesetwoerror-pronestepsawayfromtheanalyst.
. 1. INTRODUCTION
1 ActiveCleanisinspiredbytherecentsuccessofprogressive
0 Machine Learning on large and growing datasets is a key data cleaning where a user can gradually clean more data
6 data management challenge with significant interest in both until the desired accuracy is reached [6, 34, 30, 17, 26, 38,
1 industry and academia [1, 5, 10, 20]. Despite a number of 37]. Wefocusonapopularclassofmodelscalledconvexloss
v: breakthroughsinreducingtrainingtime,predictivemodeling models(e.g.,includeslinearregressionandSVMs)andshow
i can still be a tedious and time-consuming task for an ana- thattheSimpson’sparadoxproblemcanbeavoidedusingit-
X lyst. Data often arrive dirty, including missing, incorrect, or erativemaintenanceofamodelratherthanre-training. This
r inconsistent attributes, and analysts widely report that data process leverages the convex structure of the model rather
a cleaningandotherformsofpre-processingaccountforupto than treating it like a black-box, and we apply convergence
80%oftheireffort[3, 22]. Whiledatacleaningisanexten- argumentsfromconvexoptimizationtheory. Weproposesev-
sivelystudiedproblem,thepredictivemodelingsettingposes eral novel optimizations that leverage information from the
anumberofnewchallenges:(1)highdimensionalitycanam- modeltoguidedatacleaningtowardstherecordsmostlikely
plifyevenasmallamountoferroneousrecords[36],(2)the tobedirtyandmostlikelytoaffecttheresults.
complexity can make it difficult to trace the consequnces of Tosummarizethecontributions:
anerror,and(3)thereareoftensubtletechnicalcconditions
(e.g.,independentandidenticallydistributed)thatcanbevi- • Correctness (Section 5). We show how to update a
olated by data cleaning. Consequently, techniques that have dirtymodelgivennewlycleaneddata. Thisupdatecon-
beendesignedfortraditionalSQLanalyticsmaybeinefficient vergesmonotonicallyinexpectation. Forabatchsizeb
or even unreliable. In this paper, we study the relationship anditerationsT,itconvergeswithrateO(√1 ).
bT
betweendatacleaningandmodeltrainingworkflowsandex- • Efficiency(Section6). Wederiveatheoreticaloptimal
plore how to apply existing data cleaning approaches with sampling distribution that minimizes the update error
provableguarantees. and an approximation to estimate the theoretical opti-
Oneofthemainbottlenecksindatacleaningisthehuman mum.
effort in determining which data are dirty and then devel- • Detection and Estimation (Section 7). We show how
ActiveClean can be integrated with data detection to asanexpensiveuser-definedfunctioncomposedofdetermin-
guidedatacleaningtowardsrecordsexpectedtobedirty. isticschema-preservingmapandfilteroperationsappliedtoa
• Theexperimentsevaluatethesecomponentsonfourdatasets subset of rows in the relation. A relation is defined as clean
withrealandsyntheticcorruption(Section8). Results if R = Clean(R ). Therefore, for every r ∈ R
clean clean clean
suggests that for a fixed cleaning budget, ActiveClean thereexistsauniquer(cid:48) ∈Rinthedirtydata.Themapandfil-
returns more accurate models than uniform sampling tercleaningmodelisnotafundamentalrestrictionofActive-
andActiveLearningwhensystematiccorruptionissparse. Clean,andAppendixAdiscussesacompatible“setofrecords"
cleaningmodel.
2. BACKGROUNDANDPROBLEMSETUP
2.3 Iteration
Thissectionformalizestheiterativedatacleaningandtrain-
AsanexampleofhowClean(·)fitsintoaniterativeanalysis
ingprocessandhighlightsanexampleapplication.
process,considerananalysttrainingaregressionandidenti-
2.1 PredictiveModeling fying outliers. When she examines one of the outliers, she
realizes that the base data (prior to featurization) has a for-
TheuserprovidesarelationRandwishestotrainamodel
matting inconsistency that leads to incorrect parsing of the
using the data in R. This work focuses on a class of well-
numerical values. She applies a batch fix (i.e., Clean(·)) to
analyzedpredictive analyticsproblems; onesthat canbe ex-
alloftheoutlierswiththesameerror,andre-trainsthemodel.
pressedastheminimizationofconvexlossfunctions. Convex
Thisiterativeprocesscanbedescribedasthefollowingpseu-
lossminimizationproblemsareamenabletoavarietyofincre-
docodeloop:
mentaloptimizationmethodologieswithprovableguarantees
(see Friedman, Hastie, and Tibshirani [15] for an introduc- 1. Init(iter)
tion). Examplesincludegeneralizedlinearmodels(including
2. current_model = Train(R)
linearandlogisticregression),supportvectormachines,and
infact,meansandmediansarealsospecialcases. 3. Foreachtin{1,...,iter}
We assume that the user provides a featurizer F(·) that
(a) dirty_sample = Identify(R,current_model)
maps every record r ∈ R to a feature vector x and label y.
Forlabeledtrainingexamples{(x ,y )}N ,theproblemisto (b) clean_sample = Clean(dirty_sample)
i i i=1
findavectorofmodelparametersθbyminimizingalossfunc- (c) current_model = Update(clean_sample, R)
tionφoveralltrainingexamples:
4. Output: current_model
N
θ∗ =argmin(cid:88)φ(xi,yi,θ) While we have already discussed Train(·) and Clean(·),
θ i=1 the analyst still has to define the primitives Identify(·) and
Update(·). For Identify(·), given a the current best model,
Where φ is a convex function in θ. For example, in a linear
theanalystmustspecifysomecriteriatoselectasetofrecords
regressionφis:
to examine. And in Update(·), the analyst must decide how
φ(xi,yi,θ)=(cid:107)θTxi−yi(cid:107)22 to update the model given newly cleaned data. It turns out
that these primitives are not trivial to implement since the
Typically,aregularizationtermr(θ)isaddedtothisproblem.
straight-forward solutions can actually lead to divergence of
r(θ) penalizes high or low values of feature weights in θ to
thetrainedmodels.
avoidoverfittingtonoiseinthetrainingexamples.
2.4 Challenges
N
θ∗ =argmin(cid:88)φ(x ,y ,θ)+r(θ) (1)
θ i i Correctness: Let us assume that the analyst has imple-
i=1
mentedanIdentify(·)functionthatreturnskcandidatedirty
In this work, without loss of generality, we will include the records. Thestraight-forwardapplicationdatacleaningisto
regularizationaspartofthelossfunctioni.e., φ(xi,yi,θ)in- repair the corruption in place, and re-train the model after
cludesr(θ). each repair. Suppose k (cid:28) N records are cleaned, but all of
the remaining dirty records are retained in the dataset. Fig-
2.2 DataCleaning
ure1highlightsthedangersofthisapproachonaverysimple
We consider corruption that affects the attribute values of dirtydatasetandalinearregressionmodeli.e.,thebestfitline
records. Thisdoesnotcovererrorsthatsimultaneouslyaffect for two variables. One of the variables is systematically cor-
multiplerecordssuchasrecordduplicationorstructuresuch ruptedwithatranslationinthex-axis(Figure1a). Thedirty
as schema transformation. Examples of supported cleaning dataismarkedinredandthecleandatainblue,andtheyare
operations include, batch resolving common inconsistencies shownwiththeirrespectivebestfitlines. Aftercleaningonly
(e.g., merging“U.S.A"and“UnitedStates"), filteringoutliers twoofthedatapoints(Figure1b), theresulting bestfitline
(e.g.,removingrecordswithvalues>1e6),andstandardizing isintheoppositedirectionofthetruemodel.
attributesemantics(e.g.,“1.2miles"and“1.93km"). Aggregates over mixtures of different populations of data
We are particularly interested in those errors that are dif- can result in spurious relationships due to the well-known
ficultortime-consumingtoclean, andrequiretheanalystto phenomenoncalledSimpson’sparadox[32]. Simpson’spara-
examine an erroneous record, and determine the appropri- doxisbynomeansacornercase,andithasaffectedtheva-
ateaction–possiblyleveragingknowledgeofthecurrentbest lidity of a number of high-profile studies [35]; even in the
model. We represent this operation as Clean(·) which can simple case of taking an average over a dataset. Predictive
be applied to a record r (or a set of records) to recover the models are high-dimensional generalizations of these aggre-
cleanrecordr(cid:48) = Clean(r). Formally,wetreattheClean(·) gateswithoutclosedformtechniquestocompensateforthese
biases. Thus,trainingmodelsonamixtureofdirtyandclean orqueryprocessing. Theseapproaches, whileveryeffective,
datacanleadtounreliableresults,whereartificialtrendsin- sufferfromcomposibilityproblemswhenplacedinsideclean-
troduced by the mixture can be confused for the effects of ingandtrainingloops. Tosummarize,ActiveCleanconsiders
datacleaning. data cleaning during model training, while these techniques
considermodeltrainingfordatacleaning. Oneoftheprimary
contributionsofthisworkisanincrementalmodelupdateal-
gorithmwithcorrectnessguaranteesformixturesofdata.
2.6 UseCase: DollarsforDocs[2]
ProPublica collected a dataset of corporate donations to
doctors to analyze conflicts of interest. They reported that
some doctors received over $500,000 in travel, meals, and
consultationexpenses[4].ProPublicalaboriouslycuratedand
cleanedadatasetfromtheCentersforMedicareandMedicaid
Figure 1: (a) Systematic corruption in one variable can Services that listed nearly 250,000 research donations, and
leadtoashiftedmodel. (b)Mixeddirtyandcleandatare- aggregatedthesedonationsbyphysician,drug,andpharma-
sultsinalessaccuratemodelthannocleaning. (c)Small ceutical company. We collected the raw unaggregated data
samplesofonlycleandatacanresultinsimilarlyinaccu- and explored whether suspect donations could be predicted
ratemodels. with a model. This problem is typical of analysis scenar-
ios based on observational data seen in finance, insurance,
Analternativeistoavoidthedirtydataaltogetherinstead medicine, and investigative journalism. The dataset has the
of mixing the two populations, and the model re-training is followingschema:
restricted to only data that are known to be clean. This ap-
Contribution(pi_specialty, drug_name, device_name,
proach is similar to SampleClean [33], which was proposed
corporation, amount, dispute, status)
to approximate the results of aggregate queries by applying
them to a clean sample of data. However, high-dimensional pi_specialtyisatextualattributedescribingthespecialtyof
models are highly sensitive to sample size. Figure 1c illus- thedoctorreceivingthedonation.
tratesthat,evenintwodimensions,modelstrainedfromsmall drug_name is the branded name of the drug in the research
samplescanbeasincorrectasthemixingsolutiondescribed study(nullifnotadrug).
before. device_nameisthebrandednameofthedeviceinthestudy
(nullifnotadevice).
Efficiency: Conversely,hypotheticallyassumethatthean-
corporationisthenameofthepharmaceuticalprovidingthe
alysthasimplementedacorrectUpdate(·)primitiveandim-
donation.
plementsIdentify(·)withatechniquesuchasActiveLearn-
amountisanumericalattributerepresentingthedonationamount.
ing to select records to clean [37, 38, 16]. Active learning
dispute is a Boolean attribute describing whether the re-
isatechniquetocarefullyselectthesetofexamplestolearn
searchwasdisputed.
the most accurate model. However, these selection criteria
statusisastringlabeldescribingwhetherthedonationwas
aredesignedforstationarydatadistributions,anassumption
allowedunderthedeclaredresearchprotocol. Thegoalisto
which is not true in this setting. As more data are cleaned,
predictdisalloweddonation.
the data distribution changes. Data which may look unim-
portant in the dirty data might be very valuable to clean in However, this dataset is very dirty, and the systematic na-
reality, and thus any prioritization has to predict a record’s tureofthedatacorruptioncanresultinaninaccuratemodel.
valuewithrespecttoananticipatedcleanmodel. On the ProPublica website [2], they list numerous types of
data problems that had to be cleaned before publishing the
2.5 TheNeedForAutomation data (see Appendix I). For example, the most significant do-
nations were made by large companies whose names were
ActiveCleanisaframeworkthatimplementstheIdentify(·)
also more often inconsistently represented in the data, e.g.,
and Update(·) primitives for the analyst. By automating the
“PfizerInc.",“PfizerIncorporated",“Pfizer".Insuchscenarios,
iterative process, ActiveClean ensures reliable models with
the effect of systematic error can be serious. Duplicate rep-
convergence guarantees. The analyst first initializes Active-
resentationscouldartificiallyreducethecorrelationbetween
Clean with a dirty model. ActiveClean carefuly selects small
theseentitiesandsuspectedcontributions. Therewerenearly
batches of data to clean based on data that are likely to be
40,000ofthe250,000recordsthathadeithernamingincon-
dirtyandlikelytoaffectthemodel. Theanalystappliesdata
sistencies or other inconsistencies in labeling the allowed or
cleaningtothesebatches,andActiveCleanupdatesthemodel
disallowedstatus. Withoutdatacleaning,thedetectionrate
withanincrementaloptimizationtechnique.
usingaSupportVectorMachinewas66%. Applyingthedata
Machinelearninghasbeenappliedinpriorworktoimprove
cleaningtotheentiredatasetimprovedthisrateto97%inthe
the efficiency of data cleaning [37, 38, 16]. Human input,
cleandata(Section8.6.1),andtheexperimentsdescribehow
eitherforcleaningorvalidationofautomatedcleaning,isof-
ActiveClean can achieve an 80% detection rate for less than
tenexpensiveandimpracticalforlargedatasets. Amodelcan
1.6%oftherecordscleaned.
learnrulesfromasmallsetofexamplescleaned(orvalidated)
byahuman,andactivelearningisatechniquetocarefullyse-
lectthesetofexamplestolearnthemostaccuratemodel.This 3. PROBLEMFORMALIZATION
model can be used to extrapolate repairs to not-yet-cleaned Thissectionformalizestheproblemsaddressedinthepaper.
data,andthegoaloftheseapproachesistoprovidetheclean-
estpossibledataset–independentofthesubsequentanalytics 3.1 NotationandSetup
4.1 Overview
TheuserprovidesarelationR,acleanerC(·),afeaturizer Figure 2 illustrates the ActiveClean architecture. The dot-
F(·), and a convex loss problem defined by the loss φ(·). A tedboxesdescribeoptionalcomponentsthattheusercanpro-
totalofkrecordswillbecleanedinbatchesofsizeb,sothere videtoimprovetheefficiencyofthesystem.
willbe k iterations. Weusethefollowingnotationtorepre-
b
sentrelevantintermediatestates: 4.1.1 RequiredUserInput
• DirtyModel: θ(d) isthemodeltrainedonR(without Model: The user provides a predictive model (e.g., SVM)
cleaning) with the featurizer F(·) and loss φ(·). This specified as a convex loss optimization problem φ(·) and a
servesasaninitializationtoActiveClean. featurizerF(·)thatmapsarecordtoitsfeaturevectorxand
• DirtyRecords: R ⊆Risthesubsetofrecordsthat labely.
dirty
arestilldirty. AsmoredataarecleanedRdirty →{}. Cleaning Function: TheuserprovidesafunctionC(·)(im-
• Clean Records: Rclean ⊆ R is the subset of records plemented via software or crowdsourcing) that maps dirty
thatareclean,i.e.,thecomplementofRdirty. recordstocleanrecordsasperourdefinitioninSection??.
• Samples: Sisasample(possiblynon-uniformbutwith
Batches: Data are cleaned in batches of size b and the
known probabilities) of the records R . The clean
dirty usercanchangethesesettingsifshedesiresmoreorlessfre-
sampleisdenotedbyS =C(S).
clean quentmodelupdates. Thechoiceofbdoesaffecttheconver-
• CleanModel: θ(c) istheoptimalcleanmodel,i.e.,the gencerate.Section5discussestheefficiencyandconvergence
modeltrainedonafullycleanedrelation. trade-offsofdifferentvaluesofb. Weempiricallyfindthata
• CurrentModel: θ(t) isthecurrentbestmodelatitera- batch size of 50 performs well across different datasets and
tiont∈{1,...,kb},andθ(0) =θ(d). use that as a default. A cleaning budget k can be used as a
Therearetwometricsthatwewillusetomeasuretheper- stopping criterion once C()˙ has been called k times, and so
formanceofActiveClean: the number of iterations of ActiveClean is T = k. Alterna-
b
ModelError. Themodelerrorisdefinedas(cid:107)θ(t)−θ(c)(cid:107). tively,theusercancleandatauntilthemodelisofsufficient
accuracytomakeadecision.
TestingError. LetT(θ(t))betheout-of-sampletestingerror
whenthecurrentbestmodelisappliedtothecleandata,and 4.1.2 BasicDataFlow
T(θ(c)) be the test error when the clean model is applied to
The system first trains the model φ(·) on the dirty dataset
thecleandata.ThetestingerrorisdefinedasT(θ(t))−T(θ(c))
tofindaninitialmodelθ(d)thatthesystemwillsubsequently
3.2 Problem1. CorrectUpdateProblem improve. Thesamplerselectsasampleofsizebrecordsfrom
thedatasetandpassesthesampletothecleaner, whichexe-
GivennewlycleaneddataS andthecurrentbestmodel
clean cutes C(·) for each sample record and outputs their cleaned
θ(t), the model update problem is to calculate θ(t+1). θ(t+1)
versions. Theupdaterusesthecleanedsampletoupdatethe
will have some error with respect to the true model θ(c),
weights of the model, thus moving the model closer to the
whichwedenoteas:
true cleaned model (in expectation). Finally, the system ei-
error(θ(t+1))=(cid:107)θ(t+1)−θ(c)(cid:107) ther terminates due to a stopping condition (e.g., C(·) has
beencalledamaximumnumberoftimesk,ortrainingerror
Since a sample of data are cleaned, it is only meaningful to
convergence), or passes control to the sampler for the next
talkaboutexpectederrors. Wecalltheupdatealgorithm“re-
iteration.
liable"iftheexpectederrorisupperboundedbyamonotoni-
callydecreasingfunctionµoftheamountofcleaneddata: 4.1.3 Optimizations
E(error(θnew))=O(µ(|S |)) In many cases, such as missing values, errors can be ef-
clean
ficiently detected. A user provided Detector can be used to
Intuitively,“reliable"meansthatmorecleaningshouldimply
identify such records that are more likely to be dirty, and
moreaccuracy.
thus improves the likelihood that the next sample will con-
The Correct Update Problem is to reliably update the model taintruedirtyrecords.Furthermore,theEstimatorusesprevi-
θ(t)withasampleofcleaneddata. ouslycleaneddatatoestimatetheeffectthatcleaningagiven
recordwillhaveonthemodel.Thesecomponentscanbeused
3.3 Problem2. EfficiencyProblem
separately (if only one is supplied) or together to focus the
TheefficiencyproblemistoselectSclean suchthattheex- system’s cleaning efforts on records that will most improve
pected error E(error(θ(t))) is minimized. ActiveClean uses themodel. Section7describesseveralinstantiationsofthese
previously cleaned data to estimate the value of data clean- components for different data cleaning problems. Our ex-
ing on new records. Then it draws a sample of records S ⊆ perimentsshowthattheseoptimizationscanimprovemodel
Rdirty. Thisisanon-uniformsamplewhereeachrecordrhas accuracybyup-to2.5x(Section8.3.2).
a sampling probability p(r) based on the estimates. We de-
4.2 Example
rive the optimal sampling distribution for the SGD updates,
andshowhowthetheoreticaloptimumcanbeapproximated. Thefollowingexampleillustrateshowauserwouldapply
TheEfficiencyProblemistoselectasamplingdistributionp(·) ActiveCleantoaddresstheusecaseinSection2.6:
overallrecordssuchthattheexpectederrorw.r.ttothemodelif
EXAMPLE 1. TheanalystchoosestouseanSVMmodel,and
trainedonfullycleandataisminimized.
manually cleans records by hand (the C(·)). ActiveClean ini-
tially selects a sample of 50 records (the default) to show the
4. ARCHITECTURE
analyst. Sheidentifiesasubsetof15recordsthataredirty,fixes
ThissectionpresentstheActiveCleanarchitecture. thembynormalizingthedrugandcorporationnameswiththe
orright). Forthisclassofmodels, givenasuboptimalpoint,
thedirectiontotheglobaloptimumisthegradientoftheloss
function. The gradient is a d-dimensional vector function of
thecurrentmodelθ(d) andthecleandata. Therefore,Active-
Cleanneedstoupdateθ(d)somedistanceγ (Figure3B):
θnew ←θ(d)−γ·∇φ(θ(d))
At the optimal point, the magnitude of the gradient will be
zero.Sointuitively,thisapproachiterativelymovesthemodel
Figure 2: ActiveClean allows users to train predictive downhill(transparentredcircle)–correctingthedirtymodel
models while progressively cleaning data. The frame- until the desired accuracy is reached. However, the gradi-
work adaptively selects the best data to clean and can ent depends on all of the clean data which is not available
optionally(denotedwithdottedlines)integratewithpre- andActiveCleanwillhavetoapproximatethegradientfroma
defineddetectionrulesandestimationalgorithmsforim- sampleofnewlycleaneddata.Themainintuitionisthatifthe
provedconference. gradient steps are on average correct, the model still moves
downhillalbeitwithareducedconvergencerateproportional
totheinaccuracyofthesample-basedestimate.
helpofasearchengine,andcorrectsthelabelswithtypographi-
Toderiveasample-basedupdaterule, themostimportant
calorincorrectvalues. Thesystemthenusesthecleanedrecords
property is that sums commute with derivatives and gradi-
toupdatethethecurrentbestmodelandselectthenextsample
ents. The convex loss class of models are sums of losses, so
of 50. The analyst can stop at any time and use the improved
giventhecurrentbestmodelθ,thetruegradientg∗(θ)is:
modeltopredictdonationlikelihoods.
N
g∗(θ)=∇φ(θ)= 1 (cid:88)∇φ(x(c),y(c),θ)
5. UPDATESWITHCORRECTNESS N i i
i
This section describes an algorithm for reliable model up- ActiveCleanneedstoestimateg∗(θ)fromasampleS,which
dates. Theupdaterassumesthatitisgivenasampleofdata
isdrawnfromthedirtydataR . Therefore, thesum has
S from R where i ∈ S has a known sampling dirty
dirty dirty dirty two components the gradient on the already clean data g
probabilityp(i). Sections6and7showhowtooptimizep(·) C
whichcanbecomputedwithoutcleaningandg thegradient
S
andtheanalysisinthissectionappliesforanysamplingdis-
estimatefromasampleofdirtydatatobecleaned:
tributionp(·)>0.
|R | |R |
5.1 GeometricDerivation g(θ)= |cRlea|n ·gC(θ)+ |dRirt|y ·gS(θ)
The update algorithm intuitively follows from the convex
g can be calculated by applying the gradient to all of the
geometry of the problem. Consider the problem in one di- C
alreadycleanedrecords:
mension(i.e.,theparameterθ isascalarvalue),sothenthe
goal is to find the minimum point (θ) of a curve l(θ). The g (θ)= 1 (cid:88) ∇φ(x(c),y(c),θ)
consequence of dirty data is that the wrong loss function is C |R | i i
clean
optimized. Figure3Aillustratestheconsequenceoftheopti- i∈Rclean
mization. Thereddottedlineshowsthelossfunctiononthe g canbeestimatedfromasamplebytakingthegradientw.r.t
S
dirtydata. Optimizingthelossfunctionfindsθ(d) thatatthe eachrecord,andre-weightingtheaveragebytheirrespective
minimum point (red star). However, the true loss function samplingprobabilities. Beforetakingthegradienttheclean-
(w.r.ttothecleandata)isinblue,thustheoptimalvalueon ing function C(·) is applied to each sampled record. There-
thedirtydataisinfactasuboptimalpointoncleancurve(red fore, let S be a sample of data, where each i ∈ S is drawn
circle). withprobabilityp(i):
g (θ)= 1 (cid:88) 1 ∇φ(x(c),y(c),θ)
S |S | p(i) i i
i∈S
Then,ateachiterationt,theupdatebecomes:
θ(t+1) ←θ(t)−γ·g(θ(t))
5.2 ModelUpdateAlgorithm
Tosummarize,thealgorithmisinitializedwithθ(0) = θ(d)
whichisthedirtymodel. Therearethreeusersetparameters
thebudgetk,batchsizeb,andthestepsizeγ. Inthefollow-
ingsection,wewillprovidereferencesfromtheconvexopti-
Figure3:(A)Amodeltrainedondirtydatacanbethought mizationliteraturethatallowtheusertoappropriatelyselect
of as a sub-optimal point w.r.t to the clean data. (B) The thesevalues. Ateachiterationt = {1,...,T},thecleaningis
gradient gives us the direction to move the suboptimal appliedtoabatchofdatabselectedfromthesetofcandidate
modeltoapproachthetrueoptimum. dirtyrecordsR . Then, anaveragegradientisestimated
dirty
Theoptimalcleanmodelθ(c) isvisualizedasayellowstar. fromthecleanedbatchandthemodelisupdated. Iterations
The first question is which direction to update θ(d) (i.e., left continueuntilk=T ·brecordsarecleaned.
1. Calculate the gradient over the sample of newly clean Convergence properties of batch SGD formulations have
dataandcalltheresultg (θ(t)) been well studied [11]. Essentially, if the gradient estimate
S
2. Calculate the average gradient over all of the already isunbiasedandthestepsizeisappropriatelychosen, theal-
clean data in R = R−R , and call the result gorithmisguaranteedtoconverge. InAppendixB,weshow
clean dirty
g (θ(t)) that the gradient estimate from ActiveClean is indeed unbi-
C
3. Applythefollowingupdaterule: ased and our choice of step size is one that is established to
converge. The convergence rates of SGD are also well ana-
θ(t+1) ←θ(t)−γ·(|Rdirty |·g (θ(t))+|Rclean |·g (θ(t))) lyzed[11,8,39]. Theanalysisgivesaboundontheerrorof
|R| S |R| C intermediatemodelsandtheexpectednumberofstepsbefore
achievingamodelwithinacertainerror.Forageneralconvex
5.3 AnalysiswithStochasticGradientDescent loss,abatchsizeb,andT iterations,theconvergencerateis
Theupdate algorithmcanbe formalizedas aclassof very boundedbyO(√σb2T). σ2isthevarianceintheestimateofthe
well studied algorithms called Stochastic Gradient Descent. gradientateachiteration:
SGDprovidesatheoreticalframeworktounderstandandana-
E((cid:107)g−g∗(cid:107)2)
lyzetheupdateruleandboundtheerror. Mini-batchstochas-
ticgradientdescent(SGD)isanalgorithmforfindingtheopti-
whereg∗isthegradientcomputedoverthefulldataifitwere
malvaluegiventheconvexlossanddata. Inmini-batchSGD,
fully cleaned. This property of SGD allows us to bound the
randomsubsetsofdataareselectedateachiterationandthe
modelerrorwithamonotonicallydecreasingfunctionofthe
averagegradientiscomputedforeverybatch.
numberofrecordscleaned,thussatisfyingthereliabilitycon-
OnekeydifferencewithtraditionalSGDmodelsisthatAc-
dition in the problem statement. If the loss in non-convex,
tiveCleanappliesafullgradientsteponthealreadycleandata
theupdateprocedurewillconvergetowardsalocalminimum
andaveragesitwithastochasticgradientstep(i.e.,calculated
ratherthantheglobalminimum(SeeAppendixC).
fromasample)onthedirtydata. Therefore,ActiveCleaniter-
ationscantakemultiplepassesoverthecleandatabutatmost
5.4 Example
asinglecleaningpassofthedirtydata. Theupdatealgorithm
can be thought of as a variant of SGD that lazily material- This example describes an application of the update algo-
izes the clean value. As data is sampled at each iteration, rithm.
data is cleaned when needed by the optimization. It is well
knownthatevenforanarbitraryinitializationSGDmakessig- EXAMPLE 2. RecallthattheanalysthasadirtySVMmodel
nificant progress in less than one epoch (a pass through the on the dirty data θ(d). She decides that she has a budget of
entiredataset)[9]. Inpractice,thedirtymodelcanbemuch cleaning 100 records, and decides to clean the 100 records in
more accurate than an arbitrary initialization as corruption batchesof10(setbasedonhowfastshecancleanthedata,and
may only affect a few features and combined with the full howoftenshewantstoseeanupdatedresult). Allofthedatais
gradient step on the clean data the updates converge very initiallytreatedasdirtywithRdirty =RandRclean =∅. The
quickly. gradientofabasicSVMisgivenbythefollowingfunction:
Settingthestepsizeγ: Thereisextensiveliteratureinma- (cid:40)
−y·xify·x·θ≤1
chine learning for choosing the step size γ appropriately. γ ∇φ(x,y,θ)=
canbeseteithertobeaconstantordecayedovertime. Many 0 ifyx·θ≥1
machinelearningframeworks(e.g.,MLLib,Sci-kitLearn,Vow-
palWabbit)automaticallysetlearningratesorprovidediffer- Foreachiterationt,asampleof10recordsS isdrawnfrom
entlearningschedulingframeworks. Intheexperiments,we Rdirty. ActiveClean then applies the cleaning function to the
useatechniquecalledinversescalingwherethereisaparam- sample:
eterγ =0.1,andateachiterationitdecaystoγ = γ0 .
0 t |S|t {(x(c),y(c))}={C(i):∀i∈S}
i i
Setting the batch size b: The batch size should be set by
the user to have the desired properties. Larger batches will Using these values, ActiveClean estimates the gradient on the
takelongertocleanandwillmakemoreprogresstowardsthe newlycleaneddata:
clean model but will have less frequent model updates. On
the other hand, smaller batches are cleaned faster and have gS(θ)= 110(cid:88)p(1i)∇φ(x(ic),yi(c),θ)
morefrequentmodelupdates. Therearediminishingreturns i∈S
to increasing the batch size O(√1 ). In the experiments, we
b ActiveCleanalsoappliesthegradienttothealreadycleandata
useabatchsizeof50whichconvergesfastbutallowsforfre-
(initiallynon-existent):
quent model updates. If a data cleaning technique requires
alargerbatchsizethan50,i.e.,datacleaningisfastenough g (θ)= 1 (cid:88) ∇φ(x(c),y(c),θ)
that the iteration overhead is significant compared to clean- C |R | i i
clean
ing50records,ActiveCleancanapplytheupdatesinsmaller i∈Rclean
batches. For example, the batch size set by the user might
Then,itcalculatestheupdaterule:
be b = 1000, but the model updates after every 50 records
are cleaned. We can disassociate the batching requirements |R | |R |
of SGD and the batching requirements of the data cleaning θ(t+1) ←θ(t)−γ·( |dRirt|y ·gS(θ(t))+ |cRlea|n ·gC(θ(t)))
technique.
Finally,R ←R −S,R ←R +S,andcon-
5.3.1 ConvergenceConditionsandProperties dirty dirty clean clean
tinuetothenextiteration.
6. EFFICIENCYWITHSAMPLING theEstimatorisdesignedtoestimatetheamountthatclean-
The updater received a sample with probabilities p(·). For ingagivendirtyrecordwillmovethemodeltowardsthetrue
anydistributionwherep(·)>0,wecanpreservecorrectness. optimalmodel.
ActiveClean uses a sampling algorithm that selects the most 7.1 TheDetector
valuablerecordstocleanwithhigherprobability.
Thedetectorreturnstwoimportantaspectsofarecord:(1)
6.1 OracleSamplingProblem whethertherecordisdirty,and(2)ifitisdirty,whatiswrong
withtherecord. Thesamplercanuse(1)toselectasubsetof
Recall that the convergence rate of an SGD algorithm is
bounded by σ2 which is the variance of the gradient. Intu- dirty records to sample at each batch and the estimator can
use (2) estimate the value of data cleaning based on other
itively, thevariancemeasureshowaccuratelythegradientis
recordswiththesamecorruption. ActiveCleansupportstwo
estimated from a uniform sample. Other sampling distribu-
types of detectors: a priori and adaptive. In former assumes
tions,whilepreservingthesampleexpectedvalue,mayhavea
thatweknowthesetofdirtyrecordsandhowtheyaredirtya
lowervariance. Thus,theoraclesamplingproblemisdefined
prioritoActiveClean, whilethelatteradaptively learnschar-
asasearchoversamplingdistributionstofindtheminimum
acteristicsofthedirtydataaspartofrunningActiveClean.
variancesamplingdistribution.
7.1.1 A PrioriDetector
DEFINITION1 (ORACLESAMPLINGPROBLEM). Givenaset
ofcandidatedirtydataR ,∀r∈R findsamplingprob- Formanytypesofdirtinesssuchasmissingattributevalues
dirty dirty
abilitiesp(r)suchthatoverallsamplesSofsizekitminimizes: andconstraintviolations,itispossibletoefficientlyenumer-
ateasetofcorruptedrecordsanddeterminehowtherecords
E((cid:107)gS−g∗(cid:107)2) arecorrupted.
Itcanbeshownthattheoptimaldistributionoverrecordsin DEFINITION2 (APRIORIDETECTION). Let r be a record
R isprobabilitiesproportionalto: inR. AnaprioridetectorisadetectorthatreturnsaBoolean
dirty
ofwhethertherecordisdirtyandasetofcolumnse thatare
r
pi ∝(cid:107)∇φ(x(ic),yi(c),θ(t))(cid:107) dirty.
This is an established result, for thoroughness, we provide D(r)=({0,1},e )
r
a proof in the appendix (Section D), but intuitively, records
From the set of columns that are dirty, find the corresponding
with higher gradients should be sampled with higher proba-
featuresthataredirtyf andlabelsthataredirtyl .
bility as they affect the update more significantly. However, r r
ActiveClean cannot exclude records with lower gradients as
Hereisanexamplethisdefinitionusingadatacleaningmethod-
thatwouldinduceabiashurtingconvergence. Theproblem
ologyproposedintheliterature.
is that the optimal distribution leads to a chicken-and-egg
problem: the optimal sampling distribution requires know- Constraint-based Repair: One model for detecting errors
ing (x(c),y(c)), however, cleaning is required to know those involvesdeclaringconstraintsonthedatabase.
i i
values. Detection. LetΣbeasetofconstraintsontherelationR.
Inthedetectionstep,thedetectorselectsasubsetofrecords
6.2 DirtyGradientSolution
R ⊆ R that violate at least one constraint. The set e
dirty r
Suchanoracledoesnotexist,andonesolutionistousethe isthesetofcolumnsforeachrecordwhichhaveaconstraint
gradientw.r.ttothedirtydata: violation.
pi ∝(cid:107)∇φ(x(id),yi(d),θ(t))(cid:107) EXAMPLE 3. Anexampleofaconstraintontherunningex-
ampledatasetisthatthestatusofacontributioncanbeonly
It turns out that the solution works reasonably well in prac-
“allowed" or “disallowed". Any other value for status is an
tice on our experimental datasets and has been studied in
error.
MachineLearningastheExpectedGradientLengthheuristic
[31]. Thecontributioninthisworkisintegratingthisheuris- 7.2 AdaptiveDetection
ticwithstatisticallycorrectupdates. However,intuitively,ap-
Aprioridetectionisnotpossibleinallcases. Thedetector
proximatingtheoracleascloselyaspossiblecanresultinim-
also supports adaptive detection where detection is learned
proved prioritization. The subsequent section describes two
frompreviouslycleaneddata. Notethatthis“learning"isdis-
components, thedetectorandestimator, thatcanbeusedto
tinctfromthe“learning"attheendofthepipeline. Thechal-
improvetheconvergencerate.Ourexperimentssuggestup-to
lenge in formulating this problem is that detector needs to
a2ximprovementinconvergencewhenusingtheseoptional
describe how the data is dirty (e.g. e in the a priori case).
optimizations(Section8.3.2). r
Thedetectorachievesthisbycategorizingthecorruptioninto
uclasses. Theseclassesarecorruptioncategoriesthatdonot
7. OPTIMIZATIONS
necessarily align with features, but every record is classified
Inthissection,wedescribetwoapproachestooptimization, withatmostonecategory.
theDetectorandtheEstimator,thatimprovetheefficiencyof Whenusingadaptivedetection,therepairstephastoclean
the cleaning process. Both approaches are designed to in- the data and report to which of the u classes the corrupted
creasethelikelihoodthattheSamplerwillpickdirtyrecords record belongs. When an example (x,y) is cleaned, the re-
that, once cleaned, most move the model towards the true pairsteplabelsitwithoneoftheclean,1,2,...,u+1classes
cleanmodel. TheDetectorisintendedtolearnthecharacter- (includingonefor“notdirty"). Itispossiblethatuincreases
istics that distinguish dirty records from clean records while each iteration as more types of dirtiness are discovered. In
many real world datasets, data errors have locality, where cleaned. Theresultisabiasedestimator,andwhenthenum-
similarrecordstendtobesimilarlycorrupted. Thereareusu- berofcleanedsamplesislargethealternativetechniquesare
allyasmallnumberoferrorclassesevenifalargenumberof comparableorevenslightlybetterduetothebias.
recordsarecorrupted.
One approach for adaptive detection is using a statistical EstimationForAPrioriDetection.
classifier.Thisapproachisparticularlysuitedforasmallnum- If most of the features are correct, it would seem like the
ber data error classes, each of which containing many erro- gradient is only incorrect in one or two of its components.
neousrecords. Thisproblemcanbeaddressedbyanyclassi- The problem is that the gradient ∇φ(·) can be a very non-
fier,andweuseanall-versus-oneSVMinourexperiments. linear function of the features that couple features together.
Another approach could be to adaptively learn predicates Forexample,thegradientforlinearregressionis:
thatdefineeachoftheerrorclasses. Forexample, ifrecords
∇φ(x,y,θ)=(θTx−y)x
with certain attributes are corrupted, a pattern tableau can
beassignedtoeachclasstoselectasetofpossiblycorrupted Itisnotpossibletoisolatetheeffectofachangeofonefeature
records. This approach is better suited than a statistical ap- on the gradient. Even if one of the features is corrupted, all
proachforalargenumberoferrorclassesorscarcityoferrors. ofthegradientcomponentswillbeincorrect.
However,itreliesonerrorsbeingwellalignedwithcertainat- Toaddressthisproblem,thegradientcanbeapproximated
tributevalues. inawaythattheeffectsofdirtyfeaturesonthegradientare
decoupled. Recall,intheaprioridetectionproblem,thatas-
DEFINITION3 (ADAPTIVECASE). Select the set of records
sociated with each r ∈ R is a set of errors f ,l which
forwhichκgivesapositiveerrorclassification(i.e.,oneoftheu dirty r r
is a set that identifies a set of corrupted features and labels.
errorclasses).Aftereachsampleofdataiscleaned,theclassifier
Thispropertycanbeusedtoconstructacoarseestimateofthe
κisretrained. Sotheresultis:
cleanvalue.Themainideaistocalculateaveragechangesfor
D(r)=({1,0},{1,...,u+1}) eachfeature,thengivenanuncleaned(butdirty)record,add
theseaveragechangestocorrectthegradient.
AdaptiveDetectionWithOpenRefine: Toformalizetheintuition,insteadofcomputingtheactual
gradient with respect to the true clean values, compute the
EXAMPLE 4. OpenRefineisaspreadsheet-basedtoolthatal-
conditionalexpectationgiventhatasetoffeaturesandlabels
lowsuserstoexploreandtransformdata. However,itislimited
f ,l arecorrupted:
to cleaning data that can fit in memory on a single computer. r r
Sincethecleaningoperationsarecoupledwithdataexploration, p ∝E(∇φ(x(c),y(c),θ(t))|f ,l )
ActiveCleandoesnotknowwhatisdirtyinadvance(theanalyst i i i r r
maydiscovernewerrorsasshecleans). Corruptedfeaturesaredefinedasthat:
SupposetheanalystwantstouseOpenRefinetocleantherun- i∈/ f =⇒ x(c)[i]−x(d)[i]=0
ning example dataset with ActiveClean. She takes a sample of r
datafromtheentiredatasetandusesthetooltodiscovererrors.
i∈/ l =⇒ y(c)[i]−y(d)[i]=0
Forexample,shefindsthatsomedrugsareincorrectlyclassified r
asbothdrugsanddevices.Shethenremovesthedeviceattribute Theneededapproximationrepresentsalinearizationofthe
forallrecordsthathavethedrugnameinquestion. Asshefixes errors,andtheresultingapproximationwillbeoftheform:
therecords,shetagseachonewithacategorytagofwhichcor-
p(r)∝(cid:107)∇φ(x,y,θ(t))+M ·∆ +M ·∆ (cid:107)
ruptionitbelongsto. x rx y ry
7.3 TheEstimator whereMx,Myarematricesand∆rxand∆ryarevectorswith
one component for each feature and label where each value
To get around the problem with oracle sampling, the esti- is the average change for those features that are corrupted
matorwillestimatethecleanedvaluewithpreviouslycleaned and 0 otherwise. Essentially, it the gradient with respect to
data. The estimatorwill also take advantage of the detector the dirty data plus some linear correction factor. In the ap-
fromtheprevioussection.Thereareanumberofdifferentap- pendix, we present a derivation using a Taylor series expan-
proaches, such as regression, that could be used to estimate sionandanumberofM andM matricesforcommoncon-
x y
thecleanedvaluegiventhedirtyvalues. However,thereisa vex losses (Appendix E and F). The appendix also describes
problemofscarcity, whereerrorsmayaffectasmallnumber howtomaintain∆ and∆ ascleaningprogresses.
rx ry
ofrecords.Asaresult,theregressionapproachwouldhaveto
learnamultivariatefunctionwithonlyafewexamples. Thus, EstimationForAdaptiveCase.
high-dimensionalregressionill-suitedfortheestimator. Con-
Asimilarprocedureholdsintheadaptivesetting,however,
versely,itcouldtryaverysimpleestimatorthatjustcalculates
itrequiresreformulation.Here,ActiveCleanusesucorruption
an average change and adds this change to all of the gradi-
classes provided by the detector. Instead of conditioning on
ents.Thisestimatorcanbehighlyinaccurateasitalsoapplies
the features that are corrupted, the estimator conditions on
thechangetorecordsthatareknowntobeclean.
the classes. So for each error class, it computes a ∆ and
ux
ActiveCleanleveragesthedetectorforanestimatorbetween
∆ . Thesearetheaveragechangeinthefeaturesgiventhat
uy
thesetwoextremes.Theestimatorcalculatesaveragechanges
classandtheaveragechangeinlabelsgiventhatclass.
feature-by-featureandselectivelycorrectsthegradientwhen
afeatureisknown tobecorruptedbasedonthedetector. It p(r )∝(cid:107)∇φ(x,y,θ(t))+M ·∆ +M ·∆ (cid:107)
u x ux y uy
also applies a linearization that leads to improved estimates
7.3.1 Example
when the sample size is small. We evaluate the lineariza-
tion in Section 8.5 against alternatives, and find that it pro- Here is an example of using the optimization to select a
videsmoreaccurateestimatesforasmallnumberofsamples sampleofdataforcleaning.
EXAMPLE 5. ConsiderusingActiveCleanwithanaprioride- vector, andabinarySVMisusedtoclassifythestatusofthe
tector. Letusassumethattherearenoerrorsinthelabelsand medicaldonations.
only errors in the features. Then, each training example will WorldBank: Thedatasethas193recordsofcountryname,
haveasetofcorruptedfeatures(e.g.,{1,2,6},{1,2,15}). Sup- population,andvariousmacro-economicsstatistics. Theval-
posethatthecleanerhasjustcleanedtherecordsr1andr2rep- uesarelistedwiththedateatwhichtheywereacquired. This
resentedastupleswiththeircorruptedfeatureset:(r1,{1,2,3}), allowed us to determine that records from smaller and less
(r2,{1,2,6}). Foreachfeaturei,ActiveCleanmaintainstheav- populouscountriesweremorelikelytobeout-of-date.
erage change between dirty and clean in a value in a vector
∆ [i]forthoserecordscorruptedonthatfeature. 8.1.2 ComparedAlgorithms
x
Then,givenanewrecord(r ,{1,2,3,6}),∆ isthevector
3 r3x Here are the alternative methodologies evaluated in the ex-
∆ wherecomponentiissetto0ifthefeatureisnotcorrupted.
x periments:
SupposethedataanalystisusinganSVM,thentheM matrix
x
Robust Logistic Regression [14]. Feng et al. proposed
isasfollows:
a variant of logistic regression that is robust to outliers. We
(cid:40)
−y[i]ify·x·θ≤1 chose this algorithm because it is a robust extension of the
Mx[i,i]= convexregularizedlossmodel, leadingtoabetterapples-to-
0 ifyx·θ≥1
apples comparison between the techniques. (See details in
Thus,wecalculateasamplingweightforrecordr : AppendixH.1)
3
Discarding Dirty Data. As a baseline, dirty data are dis-
p(r )∝(cid:107)∇φ(x,y,θ(t))+M ·∆ (cid:107)
3 x r3x carded.
To turn the result into a probability distribution, ActiveClean SampleClean (SC) [33]. SampleClean takes a sample of
normalizesoveralldirtyrecords. data,appliesdatacleaning,andthentrainsamodeltocom-
pletiononthesample.
8. EXPERIMENTS ActiveLearning(AL)[18]. TofairlyevaluateActiveLearn-
First, the experiments evaluate how various types of cor- ing,wefirstapplyourgradientupdatetoensurecorrectness.
rupteddatabenefitfromdatacleaning.Next,theexperiments Withineachiteration,examplesareprioritizedbydistanceto
exploredifferentprioritizationandmodelupdateschemesfor thedecisionboundary(calledUncertaintySamplingin[31]).
progressive data cleaning. Finally, ActiveClean is evaluated However,wedonotincludeouroptimizationssuchasdetec-
end-to-end in a number of real-world data cleaning scenar- tionandestimation.
ios. ActiveCleanOracle(AC+O): InActiveCleanOracle,instead
of an estimation and detection step, the true clean value is
8.1 ExperimentalSetupandNotation
usedtoevaluatethetheoreticalidealperformanceofActive-
Themainmetricforevaluationisarelativemeasureofthe Clean.
trainedmodelandthemodelifallofthedataiscleaned.
8.2 DoesDataCleaningMatter?
Relative Model Error. Let θ be the model trained on the
dirtydata,andletθ∗ bethemodeltrainedonthesamedata Thefirstexperimentevaluatesthebenefitsofdatacleaning
ifitwascleaned. Thenthemodelerrorisdefinedas (cid:107)θ−θ∗(cid:107). ontwooftheexampledatasets(EEGandAdult). Ourgoalis
(cid:107)θ∗(cid:107) tounderstandwhichtypesofdatacorruptionareamenableto
datacleaningandwhicharebettersuitedforrobuststatistical
8.1.1 Scenarios
techniques. Theexperimentcomparesfourschemes: (1)full
Income Classification (Adult): In this dataset of 45,552 datacleaning,(2)baselineofnocleaning,(3)discardingthe
records, the task is to predict the income bracket (binary) dirty data, and (4) robust logistic regression,. We corrupted
from 12 numerical and categorical covariates with an SVM 5%ofthetrainingexamplesineachdatasetintwodifferent
classifier. ways:
Seizure Classification (EEG): In this dataset, the task is
Random Corruption: Simulated high-magnitude random
to predict the onset of a seizure (binary) from 15 numerical
outliers. 5% of the examples are selected at random and a
covariates with a thresholded Linear Regression. There are
random feature is replaced with 3 times the highest feature
14980 data points in this dataset. This classification task is
value.
inherentlyhardwithanaccuracyoncompletelycleandataof
only65%. Systematic Corruption: Simulatedinnocuouslooking(but
stillincorrect)systematiccorruption. Themodelistrainedon
Handwriting Recognition (MNIST) 1: In this dataset, the
thecleandata,andthethreemostimportantfeatures(highest
task is to classify 60,000 images of handwritten images into
weighted) are identified. The examples are sorted by each
10categorieswithanone-to-allmulticlassSVMclassifier.The
of these features and the top examples are corrupted with
uniquepartofthisdatasetisthefeaturizeddataconsistsofa
the mean value for that feature (5% corruption in all). It is
784 dimensional vector which includes edge detectors and
importanttonotethatexamplescanhavemultiplecorrupted
rawimagepatches.
features.
Dollars For Docs: The dataset has 240,089 records with 5
Figure 4 shows the test accuracy for models trained on
textualattributesandonenumericalattribute. Thedatasetis
bothtypes ofdatawith thedifferenttechniques. The robust
featurizedwithbag-of-wordsfeaturizationmodelforthetex-
method performs well on the random high-magnitude out-
tualattributeswhichresultedina2021dimensionalfeature
liers with only a 2.0% reduction in clean test accuracy for
1 EEG and 2.5% reduction for Adult. In the random setting,
http://ufldl.stanford.edu/wiki/index.php/Using_the_MNIST_
Dataset discardingdirtydataalsoperformsrelativelywell. However,
(a) Adult (b) EEG
Test Accuracy 10345678900000000%%%%%%%% (a) Randomly Corrupted Data Test Accuracy 10345678900000000%%%%%%%% (b) Systematically Corrupted Data Model Error % 012345 0 1000 2000 3000 4000 5000 Model Error % 11205050 0 1000 2000 3000 4000 5000
20% 20%
10% 10% 25 15
Fwihgue0%nr ec4o:rr(EuaEpG) tReodbduasttatAedaculrth enriqaunedso0am%n danddisEclEoGa ordkinagtydpaictAaadullt.w o(brk) Test Error % 1125050 Test Error % 150
0 0
Data cleaning can provide reliable performance in both 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
the systematically corrupted setting and randomly cor- # Records Cleaned # Records Cleaned
ruptedsetting.
Figure 5: The relative model error as a function of the
therobustmethodfaltersonthesystematiccorruptionwitha
numberofexamplescleaned. ActiveCleanconvergeswith
9.1%reductionincleantestaccuracyforEEGand10.5%re-
a smaller sample size to the true result in comparison to
ductionforAdult.Theproblemisthatwithoutcleaning,there
ActiveLearningandSampleClean.
is no way to know if the corruption is random or systematic
and when to trust a robust method. While data cleaning re-
quiresmoreeffort,itprovidesbenefitsinbothsettings. Inthe
remaining experiments, unless otherwise noted, the experi-
mentsusesystematiccorruption.
Summary: A 5% systematic corruption can introduce a 10%
reductionintestaccuracyevenwhenusingarobustmethod.
8.3 ActiveClean: A PrioriDetection
Thenextsetofexperimentsevaluatedifferentapproaches
to cleaning a sample of data compared to ActiveClean using
a priori detection. A priori detection assumes that all of the Figure 6: -D denotes no detection, and -D-I denotes no
corruptedrecordsareknowninadvancebuttheircleanvalues detection and no importance sampling. Both optimiza-
areunknown. tions significantly help ActiveClean outperform Sample-
CleanandActiveLearning.
8.3.1 ActiveLearningandSampleClean
8.3.2 SourceofImprovements
Thenextexperimentevaluatesthesamples-to-errortrade-
off between four alternative algorithms: ActiveClean (AC), ThenextexperimentcomparestheperformanceofActive-
SampleClean,ActiveLearning,andActiveClean+Oracle(AC+O). Cleanwithandwithoutvariousoptimizationsat500records
Figure 5 shows the model error and test accuracy as a func- cleaned point. ActiveClean without detection is denoted as
tion of the number of cleaned records. In terms of model (AC-D) (that is at each iteration we sample from the entire
error, ActiveClean gives its largest benefits for small sample dirty data), and ActiveClean without detection and impor-
sizes. For 500 cleaned records of the Adult dataset, Active- tancesamplingisdenotedas(AC-D-I).Figure6plotstherel-
Cleanhas6.1xlesserrorthanSampleCleanand2.1xlesser- ativeerrorofthealternativesandActiveCleanwithandwith-
rorthanActiveLearning. For500cleanedrecordsoftheEEG outtheoptimizations.Withoutdetection(AC-D),ActiveClean
dataset, ActiveClean has 9.6x less error than SampleClean isstillmoreaccuratethanActiveLearning. Removingtheim-
and2.4xlesserrorthanActiveLearning. BothActiveLearn- portance sampling, ActiveClean is slightly worse than Active
ing and ActiveClean benefit from the initialization with the LearningontheAdultdatasetbutiscomparableontheEEG
dirtymodelastheydonotretraintheirmodelsfromscratch, dataset.
andActiveCleanimprovesonthisperformancewithdetection Summary: Both a priori detection and non-uniform sampling
anderrorestimation. ActiveLearninghasnonotionofdirty significantlycontributetothegainsoverActiveLearning.
and clean data, and therefore prioritizes with respect to the
dirty data. These gains in model error also correlate well to 8.3.3 MixingDirtyandCleanData
improvements in test error (defined as the test accuracy dif- Trainingamodelonmixeddataisanunreliablemethodol-
ferencew.r.tcleaningalldata). Thetesterrorconvergesmore ogylacking thesameguaranteesas ActiveLearningor Sam-
quicklythanmodelerror,emphasizingthebenefitsofprogres- pleClean even in the simplest of cases. For thoroughness,
sive data cleaning, since it is not neccessary to clean all the the next experiments include the model error as a function
datatogetamodelwithessentiallythesameperformanceas of records cleaned in comparison to ActiveClean. Figure 7
thecleanmodel.Forexample,toachieveatesterrorof1%on plots the same curves as the previous experiment compar-
theAdultdataset,ActiveCleancleans500fewerrecordsthan ing ActiveClean, Active Learning, and two mixed data algo-
ActiveLearning. rithms. PC randomly samples data, clean, and writes-back
Summary: ActiveClean with a priori detection returns results the cleaned data. PC+D randomly samples data from using
thataremorethan6xmoreaccuratethanSampleCleanand2x the dirty data detector, cleans, and writes-back the cleaned
moreaccuratethanActiveLearningforcleaning500records. data. For these errors PC and PC+D give reasonable results