DataMiningandKnowledgeDiscovery,1,317–328(1997) (cid:176)c 1997KluwerAcademicPublishers,Boston.ManufacturedinTheNetherlands. Methodological Note On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach STEVENL.SALZBERG [email protected] DepartmentofComputerScience,JohnsHopkinsUniversity,Baltimore,MD21218,USA Editor:UsamaFayyad ReceivedDecember3,1996;RevisedJuly3,1997;AcceptedJuly15,1997 Abstract. Animportantcomponentofmanydataminingprojectsisfindingagoodclassificationalgorithm,a processthatrequiresverycarefulthoughtaboutexperimentaldesign. Ifnotdoneverycarefully, comparative studiesofclassificationandothertypesofalgorithmscaneasilyresultinstatisticallyinvalidconclusions. This isespeciallytruewhenoneisusingdataminingtechniquestoanalyzeverylargedatabases,whichinevitably containsomestatisticallyunlikelydata. Thispaperdescribesseveralphenomenathatcan,ifignored,invalidate anexperimentalcomparison. Thesephenomenaandtheconclusionsthatfollowapplynotonlytoclassification, buttocomputationalexperimentsinalmostanyaspectofdatamining.Thepaperalsodiscusseswhycomparative analysisismoreimportantinevaluatingsometypesofalgorithmsthanforothers,andprovidessomesuggestions abouthowtoavoidthepitfallssufferedbymanyexperimentalstudies. Keywords:classification,comparativestudies,statisticalmethods 1. Introduction Dataminingresearchersoftenuseclassifierstoidentifyimportantclassesofobjectswithin adatarepository. Classificationisparticularly usefulwhenadatabasecontainsexamples that can be used as the basis for future decision making; e.g., for assessing credit risks, formedicaldiagnosis,orforscientificdataanalysis. Researchershavearangeofdifferent types of classification algorithms at their disposal, including nearest neighbor methods, decisiontreeinduction,errorbackpropagation,reinforcementlearning,andrulelearning. Overtheyears,manyvariationsofthesealgorithmshavebeendevelopedandmanystudies have been produced comparing their effectiveness on different data sets, both real and artificial. Theproductivenessofclassificationresearchinthepastmeans thatresearchers todayconfrontaprobleminusingthosealgorithms,namely: howdoesonechoosewhich algorithm touse fora newproblem? Thispaper addressesthemethodologythat onecan use toanswerthisquestion, and discusseshowit hasbeen addressedin theclassification community. It also discusses some of the pitfalls that confront anyone trying to answer this question, and demonstrates how misleading results can easily follow from a lack of attentiontomethodology. Below,Iwilluseexamplesfromthemachinelearningcommunity whichillustratehowcarefulonemustbewhenusingfastcomputationalmethodstomine alargedatabase. Theseexamplesshowthatwhenonerepeatedlysearchesalargedatabase with powerful algorithms, it is all too easy to “find” a phenomenon or pattern that looks impressive,evenwhenthereisnothingtodiscover. 318 SALZBERG Itisnaturalforexperimentalresearcherstowanttouserealdatatovalidatetheirsystems. Aculturehasevolvedinthemachinelearningcommunitythatnowinsistsonaconvincing evaluationofnewideas,whichveryoftentakestheformofexperimentaltesting. Thisisa healthydevelopmentanditrepresentsanimportantstepinthematurationofthefield. One indicationofthis maturationisthecreation andmaintenanceoftheUC Irvinerepository ofmachinelearningdatabases(Murphy,1995),whichnowcontainsover100datasetsthat haveappearedinpublishedwork. Thisrepositorymakesitveryeasyformachinelearning researcherstocomparenewalgorithmstopreviouswork. Thedataminingfield,althougha newerareaofresearch,isalreadyevolvingamethodologyofitsowntocomparetheeffec- tivenessofdifferentalgorithmsonlargedatabases. Large publicdatabasesarebecoming increasingly popular in many areas of science and technology, bringing with them great opportunities,butalsotechnicaldangers. Aswewillseebelow,however,onemustbevery carefulinthedesignofanexperimentalstudyusingpubliclyavailabledatabases. Although the development and maintenance of data repositories has in general been positive, some research on classification algorithms has relied too heavily on the UCI repository andothershared datasets,and hasconsequently producedcomparativestudies whoseresults areat bestconfusing. To bemore precise: it hasbecome commonplaceto taketwoormoreclassifiersandcomparethemonarandomselectionofdatasetsfromthe UCIrepository. Anydifferencesinclassificationaccuracythatreachstatisticalsignificance (moreonthatbelow)areprovidedassupportingevidenceofimportantdifferencesbetween the algorithms. As argued below, many such comparisons are statistically invalid. The message to the data mining community is that one must be exceedingly careful when usingpowerfulalgorithmstoextractinformationfromlargedatabases,becausetraditional statisticalmethodswerenotdesignedforthisprocess. BelowIgivesomeexamplesofhow tomodifytraditionalstatisticsbeforeusingthemincomputationalevaluations. 2. Comparingalgorithms Empirical validation is clearly essential to the process of designing and implementing new algorithms, and the criticisms beloware not intended to discourageempirical work. Classification research, which is a component of data mining as well as a subfield of machine learning, has always had a need for veryspecific, focused studies that compare algorithms carefully. The evidence to date is that good evaluations are not done nearly enough—for example, Prechelt(1996)recently surveyed nearly200 experimental papers onneuralnetworklearningalgorithmsandfoundmostofthemtohaveseriousexperimental deficiencies. Hissurveyfoundthatastrikinglyhighpercentageofnewalgorithms(29%) werenotevaluatedonanyrealproblematall,andthatveryfew(only8%)werecompared to more than one alternative on real data. In a survey by Flexer (1996) of experimental neuralnetworkpapers,only3outof43studiesinleadingjournalsusedaseparatedataset for parameter tuning, which leavesopen the possibility that manyof the reported results wereoverlyoptimistic. Classification research comes in a variety of forms: it lays out new algorithms and demonstrates their feasibility, or it describes creative new algorithms which may not (at first) require rigorous experimental validation. It is important that work designed to be primarilycomparativedoesnotundertaketocriticizeworkthatwasintendedtointroduce COMPARING CLASSIFIERS 319 creativenewideasortodemonstratefeasibilityonanimportantdomain. Thisonlyserves to suppress creative work (if the new algorithm does not perform well) and encourages peopleinsteadtofocusonnarrowstudiesthatmakeincrementalchangestopreviouswork. On the other hand, if a new method outperforms an established one on some important tasks, then thisresult wouldbeworthreporting becauseit mightyieldimportant insights about both algorithms. Perhaps most important, comparative work should be done in a statistically acceptable framework. Work intended to demonstrate feasibility, in contrast topurelycomparativework,mightnotalwaysneedstatisticalcomparisonmeasurestobe convincing. 2.1. Datarepositories Inconductingcomparativestudies,classificationresearchersandotherdataminersmustbe careful notto relytooheavilyonstored repositoriesofdata(suchastheUCI repository) asitssource ofproblems, because itisdifficult toproducemajor newresults usingwell- studied and widely shared data. For example, Fisher’s iris data has been around for 60 yearsand has been used in hundreds (maybe thousands) ofstudies. The NetTalkdataset ofEnglishpronunciation data (introduced bySejnowskiandRosenberg,(1987)has been usedinnumerousexperiments,ashastheproteinsecondarystructuredata(introducedby QianandSejnowski(1988),tocitejusttwoexamples. Holte(1993)collectedresultson16 differentdatasets,andfoundasmanyas75differentpublishedaccuracyfiguresforsome of them. Any new experiments on these and other UCI datasets run the risk of finding “significant” results that are no more than statistical accidents, as explained in Section 3.2. Note that the repository still serves many useful functions; among other things, it allows someone with a new algorithmic idea to test its plausibility on known problems. However,itisamistaketoconclude,ifsomedifferencesdoshowup,thatanewmethodis “significantly”betteronthesedatasets. Itisveryhard–sometimesimpossible–tomake suchanargumentinaconvincingandstatisticallycorrectway. 2.2. Comparativestudiesandpropermethodology Thecomparativestudy,whetheritinvolvesclassificationalgorithmsorotherdataextraction techniques,doesnotusuallyproposeanentirelynewmethod;mostoftenitproposeschanges tooneormoreknownalgorithms,andusescomparisonstoshowwhereandhowthechanges will improve performance. Although these studies may appear superficially to be quite easytodo,infactitrequiresconsiderableskilltobesuccessfulatbothimprovingknown algorithmsanddesigningtheexperiments. HereIfocusondesignofexperiments,which has been the subject of little concern in the machine learning community until recently (withsomeexceptions,suchas(KiblerandLangley,1988)and(CohenandJensen,1997)). Includedinthecomparativestudycategoryarepapersthatneitherintroduceanewalgorithm norimproveanoldone;instead,theyconsideroneormoreknownalgorithmsandconduct experimentsonknown datasets. Theymay also include variations onknownalgorithms. The goal of these papers is ostensibly to highlight the strengths and weaknesses of the algorithmsbeingcompared. Althoughthegoalisworthwhile,theapproachtakenbysuch papersissometimesnotvalid,forreasonstobeexplainedbelow. 320 SALZBERG 3. Statisticalvalidity: atutorial Statisticsoffersmanyteststhataredesignedtomeasurethesignificanceofanydifference betweentwoormore “treatments.” Thesetestscanbeadaptedfor usein comparisonsof classifiers,buttheadaptationmustbedonecarefully,becausethestatisticswerenotdesigned with computational experiments in mind. For example, in one recent machine learning study, fourteendifferentvariationsonseveralclassificationalgorithmswerecomparedon elevendatasets. Thisisnotunusual;manyotherrecentstudiesreportcomparisonsofsimilar numbersofalgorithmsanddatasets.1All154ofthevariationsinthisstudywerecompared toadefaultclassifier,anddifferenceswerereportedassignificantifatwo-tailed,pairedt-test producedap-valuelessthan0.05.2 Thisparticularsignificancelevelwasnotnearlystringent enough,however:ifyoudo154experiments,thenyouhave154chancestobesignificant,so theexpectednumberof“significant”resultsatthe0.05levelis154⁄0:05=7:7. Obviously thisisnotwhatonewants. Inordertogetresultsthataretrulysignificantatthe0.05level, youneedto setamuchmorestringent requirement. Statisticianshavebeenawareofthis problem for a very long time; it is known as themultiplicity effect. At least two recent papershavefocusedtheirattentionnicelyonhowclassificationresearchersmightaddress thiseffect(GascuelandCaraux,1992,FeeldersandVerkooijen,1995). Inparticular, letfibe theprobabilitythat ifnodifferences exist amongouralgorithms, wewillmakeatleastonemistake;i.e.,wewillfindatleastonesignificantdifference. Thus fiisthepercentofthetimeinwhichwe(theexperimenters)makeanerror. Foreachofour tests(i.e.,eachexperiment),letthenominalsignificancelevelbefi⁄. Thenthechanceof makingtherightconclusionforoneexperimentis1¡fi⁄. If we conduct n independent experiments, the chance of getting them all right is then (1¡fi⁄)n. (Notethatthisistrueonlyifalltheexperimentsareindependent;whentheyare not,tighterboundscanbecomputed. Ifasetofdifferentalgorithmsarecomparedonthe sametestdata,thenthetestsareclearlynotindependent. Infact,acomparisonthatdraws thetraining andtestsets fromthesame samplewillnot be independenteither.) Suppose thatinfactnorealdifferencesexistamongthealgorithmsbeingtested;thenthechancethat wewillnotgetalltheconclusionsright—inotherwords,thechancethatwewillmakeat leastonemistakeis fi=1¡(1 ¡fi⁄)n Suppose for example we set our nominal significance level fi⁄ for each experiment to 0.05. Then the odds of making at least one mistake in our 154 experiments are fi = 1¡(1 ¡:05)154 = 0:9996. Clearly,a99.96%chanceofdrawinganincorrectconclusion isnotwhatwewant! Again,we areassumingthat notruedifferencesexist; i.e.,thisisa conditionalprobability. (Moreprecisely,thereisa99.96%chancethatatleastoneofthe resultswillincorrectlyreach“significance”atthe0.05level.) In order to obtain results significant at the 0.05 level with 154 tests, we need to set 1¡(1 ¡fi⁄)154 •0:05,whichgivesfi⁄ •0:0003. Thiscriterionisover150timesmore stringentthantheoriginalfi•0:05criterion. Theaboveargumentisstill veryrough, because itassumes thatall theexperimentsare independentofoneanother. Whenthisassumptioniscorrect,onecanmaketheadjustment ofsignificancedescribedabove,whichiswellknowninthestatisticscommunityastheBon- COMPARING CLASSIFIERS 321 ferroniadjustment. Whenexperimentsuse identicaltrainingand/ortestsets,thetestsare clearlynotindependent. Theuseofthewrongp-valuemakesitevenmorelikelythatsome experimentswillfindsignificancewherenoneexists. Nonetheless,manyresearcherspro- ceedwithusingasimplet-testtocomparemultiplealgorithmsonmultipledatasetsfromthe UCIrepository;see,e.g.,WettschereckandDietterich(WettschereckandDietterich,1995). Althougheasytoconduct,thet-testissimplythewrongtestforsuchanexperimentaldesign. Thet-testassumesthatthetestsetsforeach“treatment”(eachalgorithm)areindependent. Whentwoalgorithmsarecomparedonthesamedataset,thenobviouslythetestsetsarenot independent,sincetheywillsharesomeofthesameexamples—assumingthetrainingand testsetsarecreatedbyrandompartitioning,whichisthestandardpractice. Thisproblemis widespreadincomparativemachinelearningstudies. (Oneoftheauthorsofthestudycited abovehas writtenrecentlythat thepaired t-testhas “ahigh probabilityofType Ierror... andshouldneverbeused”(Dietterich,1996).) Itisworthnotingherethatevenstatisticians havedifficultyagreeingonthecorrectframeworkforhypothesistestingincomplexexper- imentaldesigns. Forexample,thewholeframeworkofusingalphalevelsandp-valueshas beenquestionedwhenmorethantwohypothesesareunderconsideration(Reftery,1995). 3.1. Alternativestatisticaltests One obvious problem with the experimental design cited above is that it only considers overallaccuracyonatestset. Butwhenusingacommontestsettocomparetwoalgorithms, acomparisonmustconsiderfournumbers: thenumberofexamplesthatAlgorithmAgot rightand Bgotwrong (A > B), thenumberthatBgotrightandAgot wrong(B > A), thenumberthatbothalgorithmsgotright,andthenumberthatbothgotwrong. Ifonehas justtwoalgorithmstocompare,thenasimplebutmuchimproved(overthet-test)wayto comparethemistocomparethepercentageoftimesA>B versusB >A,andthrowout theties. Onecanthenuseasimplebinomialtestforthecomparison,withtheBonferroni adjustmentformultipletests. Analternativeistouserandom,distinctsamplesofthedata totesteachalgorithm,andtouseananalysisofvariance(ANOVA)tocomparetheresults. Asimpleexampleshowshowabinomialtestcanbeusedtocomparetwoalgorithms. As above,measureeachalgorithm’sansweronaseriesoftestexamples. Letnbethenumber ofexamplesforwhichthealgorithmsproducedifferentoutput. Wemustassumethisseries oftestsisindependent;i.e.,weareobservingasetofBernoullitrials. (Thisassumptionis validifthetestdataisarandomly drawnsamplefromthepopulation.) Lets (successes) bethenumberoftimesA>B,andf (failures)bethenumberoftimesB >A. Ifthetwo algorithmsperformequallywell,thentheexpectedvalueE(s)= 0:5n= E(f). Suppose thats>f,soitlookslikeAisbetterthanB. Wewouldliketocalculate P(s‚observedvaluejp(success)=0:5) which is the probability that A “wins” overB at least as many times as observed in the experiment. Typically, the reported p-value is double this value because a 2-sided test is used. This can be easily computed using the binomial distribution, which gives the probabilityofssuccessesinntrialsas n! psqn¡s s!(n¡s)! 322 SALZBERG Ifwe expectnodifferences betweenthealgorithms, thenp = q = 0:5. Supposethat we hadn = 50examplesforwhichthealgorithmsdiffered,andins = 35casesalgorithmA wascorrectwhileBwaswrong. Thenwecancomputetheprobabilityofthisresultas X50 n! (0:5)n =0:0032 s!(n¡s)! s=35 Thusitishighlyunlikelythatthealgorithmshavethesameaccuracy;wecanrejectthenull hypothesiswithhighconfidence. (Note: thecomputationhereusesthebinomialdistribu- tion,whichisexact. Another,nearlyidenticalformofthistestisknownasMcNemar’stest (Everitt,1977),whichusesthe´2 distribution. ThestatisticusedfortheMcNemartestis (js¡fj¡1)2=(s+f),whichissimplertocompute.) Ifwemakethisintoa2-sidedtest, wemustdoubletheprobabilityto0.0064,butwecanstillrejectthenullhypothesis. Ifwe hadobserveds =30,thentheprobabilitywouldriseto0.1012(fortheone-sidedtest),or justover10%. Inthiscasewemightsaythatwecannotrejectthenullhypothesis;inother words,thealgorithmsmayinfactbeequallyaccurateforthispopulation. The aboveis just an example, and is not meant to cover all comparative studies. The methodappliesaswelltoclassifiersastootherdataminingmethodsthatattempttoextract patterns from a database. However, the binomial test is a relatively weak test that does nothandlequantitativedifferencesbetweenalgorithms,nor doesithandlemorethan two algorithms,nordoesitconsiderthefrequencyofagreementbetweentwoalgorithms. IfN isthenumberofagreementsandN >> n,thenitcanbearguedthatourbeliefthatthealgo- rithmsaredoingthesamethingshouldincreaseregardlessofthepatternofdisagreement. AspointedoutbyFeeldersandVerkooijen(1995),findingtheproperstatisticalprocedureto comparetwoormoreclassificationalgorithmscanbequitedifficult,andrequiresmorethan an introductory levelknowledgeofstatistics. A good generalreference for experimental designisCochranandCox(1957), anddescriptionsofANOVAandexperimentaldesign canbefoundinintroductorytextssuchasHildebrand(1986). Jensen(Jensen,1991,Jensen,1995)discussesaframeworkforexperimentalcomparison ofclassifiersandaddressessignificancetesting,andCohenandJensen(1997)discusssome specificwaystoremoveoptimisticstatisticalbiasfromsuchexperiments. Oneimportantin- novationtheydiscussisrandomizationtesting,inwhichonederivesareferencedistribution asfollows. Foreachtrial,thedatasetiscopiedandclasslabelsarereplacedwithrandom classlabels. Thenanalgorithmisusedtofindthemostaccurateclassifieritcan,usingthe same methodology that is used with the original data. Any estimate of accuracy greater than random for the copied data reflects the bias in the methodology, and this reference distributioncanthenbeusedtoadjusttheestimatesontherealdata. 3.2. Communityexperiments: CautionaryNotes Infact,theprobleminthemachinelearningcommunityisworsethanstatedabove,because many people are sharing a small repository of datasets and repeatedly using those same datasets for experiments. Thus there is a substantial danger that published results, even whenusingstrictsignificance criteria andtheappropriatesignificancetests, willbemere accidents of chance. The problem is as follows. Suppose that 100 different people are COMPARING CLASSIFIERS 323 studyingtheeffectsofalgorithmsAandB,tryingtodeterminewhichoneisbetter. Suppose thatinfactbothhavethesamemeanaccuracy(onsomeverylargepopulationofdatasets), althoughthealgorithmsvaryrandomlyintheirperformanceonspecificdatasets. Now,if 100 people are studying the effect of algorithms A and B, we would expect that five of themwillgetresultsthatarestatisticallysignificantatthep•0:05level,andonewillget significance atthe 0.01level! (Actually, thepicture isa bitmore complicated, sincethis assumesthe100experimentsareindependent.) Clearlyinthiscasetheseresultsaredueto chance,butifthe100peopleareworkingseparately,theoneswhogetsignificantresultswill publish,whiletheotherswillsimplymoveontootherexperiments. Denton(1985)made similar observations about how the reviewing and publication process can skew results. Althoughthedataminingcommunityismuchbroaderthantheclassificationcommunity,it islikelythatbenchmarkdatabaseswillemerge,andthatdifferentresearcherswilltesttheir mining techniques on them. The experienceof the machine learning community should serveasacautionarytale. Inothercommunities (e.g.,testingnewdrugs), experimenterstrytodeal withthisphe- nomenonof“communityexperiments”byduplicatingresults. Properduplicationrequires drawinganewrandomsamplefromthepopulationandrepeatingthestudy. However,this is not what happens with benchmark databases, which normally are static. If someone wants to duplicate results, they canonly re-run the algorithmswith the same parameters onthesamedata,andofcoursetheresultswillbethesame. Usingadifferentpartitioning ofthedataintotrainingandtestsetsdoesnothelp;duplicationcanonlyworkifnewdata becomesavailable. 3.3. Repeatedtuning Another very substantial problem with reporting significance results based on computa- tionalexperimentsissomething thatisoften leftunstated: manyresearchers“tune”their algorithmsrepeatedlyinordertomakethemperformoptimallyonatleastsomedatasets. Attheveryleast,whenasystemisbeingdeveloped,thedeveloperspendsagreatdealof timedeterminingwhat parametersit shouldhaveandwhatthe optimalvaluesshouldbe. For example, the back propagation algorithm has a learning rate and a momentum term thatgreatlyaffectlearning,andthearchitectureofaneuralnet(whichhasmanydegreesof freedom)hasanenormouseffectonlearning. Equallyimportantformanyproblemsisthe representationofthedata,whichmayvaryfromonestudytothenextevenwhenthesame basicdatasetisused. Forexample,numericvaluesaresometimesconvertedtoadiscrete setofintervals,especiallywhenusingdecisiontreealgorithms(FayyadandIrani,1993). Whenever tuning takes place, every adjustment should really be considered a separate experiment. For example, if one attempts 10 different combinations of parameters, then significancelevels(p-values)wouldhavetobe, e.g., 0.005in orderto obtainlevelscom- parable to a single experiment using a level of 0.05. (This assumes, unrealistically, that theexperimentsareindependent.) Butfewifanyexperimenterskeepcarefulcountofhow manyadjustments theyconsider. (KiblerandLangley(1988)andAha(1992)suggest,as analternative,thattheparametersettingsthemselvesbestudiedasindependentvariables, and that theireffects be measuredon artificial data. A greater problem occurs whenone usesanalgorithmthathasbeenusedbefore: thatalgorithmmayalreadyhavebeentuned 324 SALZBERG onpublic databases. Thereforeone cannot evenknow howmuch tuning took place, and anyattempttoclaimsignificantresultsonknowndataissimplynotvalid. Fortunately, one canperform virtually unlimited tuningwithout corrupting the validity oftheresults. Thesolutionistousecrossvalidation3 entirelywithinthetrainingsetitself. The recommended procedure is to reserve a portion of the training set as a “tuning” set, andtorepeatedlytestthealgorithmandadjustitsparametersonthetuningset. Whenthe parametersappeartobeattheiroptimalsettings,accuracycanfinallybemeasuredonthe testdata. Thus,whendoingcomparativeevaluations,everythingthatisdonetomodifyorprepare thealgorithmsmustbedoneinadvanceofseeingthetestdata. Thispointwillseemobvious tomanyexperimentalresearchers,butthefactisthatpapersarestillappearinginwhichthis methodologyisnotfollowed. InthesurveybyFlexer(1996),only3outof43experimental papers in leading neural network journals used a separate data set for parameter tuning; the remaining 40 papers either did not explain how they adjusted parameters or else did their adjustments after using the test set. Thus this point is worthstating explicitly: any tweaking or modifying of code, any decisions about experimental design, and any other parametersthatmayaffectperformancemustbefixedbeforetheexperimenterhasthetest data. Otherwisetheconclusionsmightnotevenbevalidforthedatabeingused,muchless anylargerpopulationofproblems. 3.4. Generalizingresults Acommonmethodologicalapproachinrecentcomparativestudiesistopickseveraldatasets from the UCI repository (or some other public source of data) and perform a series of experiments,measuringclassificationaccuracy, learningrates, andperhapsothercriteria. Whetherornotthechoiceofdatasetsisrandom,theuseofsuchexperimentstomakemore generalstatementsaboutclassificationalgorithmsisnotnecessarilyvalid. In fact, if one draws samples from a population to conduct an experiment, any infer- encesonemakescanonlybeappliedtotheoriginalpopulation,whichinthiscasemeans theUCI repository itself. It isnotvalidto makegeneralstatements about other datasets. The only way this might be valid would be if the UCI repository were known to repre- sentalargerpopulationofclassificationproblems. Infact,though,asarguedpersuasively by Holte (Holte,1993) and others, the UCI repository is a very limited sample of prob- lems, many of which are quite easy for a classifier. (Many of them may represent the sameconceptclass,forexamplemanymightbealmostlinearlyseparable,assuggestedby thestrongperformanceoftheperceptronalgorithminonewell-knowncomparativestudy (Shavlik,MooneyandTowell,1991).) ThustheevidenceisstrongthatresultsontheUCI datasets donotapply to allclassification problems,and therepository isnotan unbiased “sample”ofclassificationproblems. ThisisnotbyanymeanstosaythattheUCIrepositoryshouldnotexist. Therepository servesseveralimportantfunctions. Havingapublicrepositorykeepsthecommunityhonest, inthesensethatanypublishedresultscanbechecked. Asecondfunctionisthatanynew ideacanbe“vetted”againsttherepositoryasawayofcheckingitforplausibility. Ifitworks well,itmightbeworthfurtherinvestigation. Ifnot,ofcourseitmaystill beagoodidea, butthecreatorwillhavetothinkupothermeanstodemonstrateitsfeasibility. Thedangers COMPARING CLASSIFIERS 325 of repeated re-use of the same data should spur the research community to continually enlargetherepository,sothatovertimeitwillbecomemorerepresentativeofclassification problemsingeneral. Inaddition, theuseofartificialdatashouldalwaysbeconsideredas a way to testmore precisely the strengthsand weaknessesof a new algorithm. Thedata miningcommunitymustlikewisebevigilantnottocometorelytooheavilyonanystandard setofdatabases. Thereisasecond,relatedproblemwithmakingbroadstatementsbasedonresultsfrom a common repository of data. The problem is that someone can write an algorithm that workswellonsomeofthesedatasetsspecifically; i.e.,thealgorithmisdesignedwiththe datasets in mind. If researchers become familiar with the datasets, they are likely to be biasedwhendevelopingnewalgorithms,eveniftheydonotconsciouslyattempttotailor theiralgorithmstothedata. Inotherwords,peoplewillbeinclinedtoconsiderproblems suchasmissingdataoroutliersbecause theyknowtheseareproblemsrepresentedin the publicdatasets. However,theseproblemsmaynotbeimportantfordifferentdatasets. Itis anunavoidablefactthatanyonefamiliarwiththedatamaybebiased. 3.5. Arecommendedapproach Althoughitisprobablyimpossibletodescribeamethodologythatwillserveasa“recipe”for allcomparisonsofcomputationalmethods,hereisanapproachthatwillallowthedesigner ofanewalgorithmtoestablishthenewalgorithm’scomparativemerits. 1. Chooseotheralgorithmstoincludeinthecomparison. Makesuretoincludethealgo- rithmthatismostsimilartothenewalgorithm. 2. Choose a benchmark data set that illustrates thestrengths of the new algorithm. For example,ifthealgorithmissupposedtohandlelargeattributespaces,chooseadataset withalargenumberofattributes. 3. Dividethedatasetintoksubsetsforcrossvalidation. Atypicalexperimentusesk =10, thoughothervaluesmaybechosendependingonthedatasetsize. Forasmalldataset, itmaybebettertochooselargerk,sincethisleavesmoreexamplesinthetrainingset. 4. Runacross-validationasfollows. (A) ForeachoftheksubsetsofthedatasetD,createatrainingsetT =D¡k. (B) Divide each training set into two smaller subsets, T1 and T2. T1 will be used for training, and T2for tuning. Thepurpose of T2is to allow theexperimenter to adjust all the parameters of his algorithm. This methodology also forces the experimentertobemoreexplicitaboutwhatthoseparametersare. (C) Oncetheparametersareoptimized,re-runtrainingonthelargersetT. (D) Finally,measureaccuracyonk. (E) Overallaccuracyisaveragedacrossallk partitions. Thesek valuesalsogivean estimateofthevarianceofthealgorithms. 5. Tocomparealgorithms,usethebinomialtestdescribedinsection3.1,ortheMcNemar variantonthattest. (Itmaybetempting,becauseitiseasytodowithpowerfulcomputers,torunmanycross validationsonthesamedataset,andreporteachcrossvalidationasasingletrial. However, 326 SALZBERG thiswouldnotproducevalidstatistics,becausethetrialsinsuchadesignarehighlyinter- dependent.) Theaboveprocedureappliestoasingledataset. Ifanexperimenterwishesto comparemultipledatasets,thenthesignificancelevelsusedshouldbeincreasedusingthe Bonferroniadjustment. Thisisaconservativeadjustmentthatwilltendtomisssignificance insomecases,butitenforcesahighstandardforreportingthatonealgorithmisbetterthan another. 4. Conclusion Assomeresearchershaverecentlyobserved,nosingleclassificationalgorithmisthebest forallproblems. Thesameobservationcanbemadefordatamining: nosingletechnique islikelytoworkbestonalldatabases. Recenttheoreticalworkhasshownthat,withcertain assumptions, no classifier is always better than another one (Wolpert,1992). However, experimental science is concerned with data that occurs in the real world, and it is not clearthatthese theoreticallimitationsarerelevant. Comparativestudiestypicallyinclude atleastonenewalgorithmandseveralknownmethods;thesestudiesmustbeverycareful abouttheirmethodsandtheirclaims. Thepointofthispaperisnottodiscourageempirical comparisons; clearly, comparisons and benchmarks are an important, central component of experimental computer science, and can be very powerful if done correctly. Rather, thispaper’sgoalistohelpexperimentalresearcherssteerclearofproblemsindesigninga comparativestudy. Acknowledgments ThankstoSimonKasifandEricBrillforhelpfulcommentsanddiscussions. ThankstoAlan Salzbergforhelpwiththestatisticsandforothercomments. Thisworkwassupportedin partbytheNationalScienceFoundationunderGrantsIRI-9223591andIRI-9530462,and bytheNationalInstitutesofHealthunderGrantK01-HG00022-01. Theviewsexpressed herearetheauthor’sown,anddonotnecessarilyrepresenttheviewsofthoseacknowledged above,theNationalScienceFoundation,ortheNationalInstitutesofHealth. Notes 1. Thisstudywasneverpublished;asimilarstudyis(KohaviandSommerfield,1995),whichused22datasets and4algorithms. 2. Ap-valueissimplytheprobabilitythataresultoccurredbychance.Thusap-valueof0.05indicatesthatthere wasa5%probabilitythattheobservedresultsweremerelyrandomvariationofsomekind,anddonotindicate atruedifferencebetweenthetreatments. Foradescriptionofhowtoperformapairedt-test,seereference (Hilderbrand,1986)oranotherintroductorystatisticstext. 3. Crossvalidationreferstoawidelyusedexperimentaltestingprocedure.Theideaistobreakadatasetupinto kdisjointsubsetsofapproximatelyequalsize.Thenoneperformskexperiments,whereforeachexperiment thekthsubsetisremoved,thesystemistrainedontheremainingdata,andthenthetrainedsystemistestedon theheld-outsubset.Attheendofthisk-foldcrossvalidation,everyexamplehasbeenusedinatestsetexactly once. Thisprocedurehastheadvantagethatallthetestsetsareindependent. However,thetrainingsetsare clearlynotindependent,sincetheyoverlapeachothersubstantially. Thelimitingcaseistosetk = n¡1, wherenisthesizeoftheentiredataset.Thisformofcrossvalidationissometimescalled“leaveoneout.”