ebook img

The Dynamics of AdaBoost Weights Tells You What's Hard to Classify PDF

7 Pages·0.2 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Dynamics of AdaBoost Weights Tells You What's Hard to Classify

The Dynamics of AdaBoost Weights Tells You What’s Hard to Classify B.Caprile C.Furlanello S.Merler ITC-irst ITC-irst ITC-irst 2 I-38050Povo,Trento I-38050Povo,Trento I-38050Povo,Trento 0 Italy Italy Italy 0 [email protected] [email protected] [email protected] 2 n a J 7 Abstract 1 ThedynamicalevolutionofweightsintheAdaBoostalgorithmcontains ] G usefulinformationabouttheroˆlethatthe associateddatapointsplay in L the built of the AdaBoost model. In particular, the dynamics induces a bipartition of the data set into two (easy/hard) classes. Easy points . s areininfluentialinthemakingofthemodel,whilethevaryingrelevance c of hard points can be gauged in terms of an entropy value associated [ to theirevolution. Smoothapproximationsofentropyhighlightregions 1 where classification is most uncertain. Promising results are obtained v whenmethodsproposedareappliedintheOptimalSamplingframework. 4 1 0 1 1 Introduction 0 2 Inthispaperweinvestigatetheboostingweightdynamicsinducedbyclassificationproce- 0 duresoftheAdaBoostfamily[2,8],andshowhowitcanbeexploitedtoforhighlighting / s points and regions of uncertain classification. Friedman et al. [3] proposed to analyze c and trim the distribution of weights over a training sample in order to reduce computa- : v tionwithoutsacrificingaccuracy. Here,wefocusinsteadontrackingthedynamicsofthe i boostingweightof individualpoints. By introducingthe notionof entropyof the weight X evolution, we can clarifythe notionsof “easy” and the “hard”points as the two types of r weight dynamics being observed: in particular, in different classification tasks and with a differentbase modelsit is foundthata groupof pointsmay be selected which have very low (ideally, zero) entropy of weight evolution: the easy points. In this framework, we can answer questions as: do easy point play any role in building the AdaBoost model? Forhardpoints,candifferentdegreesof“hardness”beidentifiedwhichaccountfordiffer- entdegreesofclassificationuncertainty? Doeasy/hardpointsshowanypreferenceabout where to concentrate? The first two questionsare clearly connectedto equivalentresults in the framework of Support Vector Machines: in a number of experiments, hard points arefoundindeedmostlynearbytheclassificationboundary. Inthesecondpartofthispa- per,thesmoothapproximation(bykernelregression)oftheweightentropyattrainingdata is proposed as an indicator function of classification uncertainty, thereby obtaining a re- gionhighlightingmethodology. Asa naturalapplication,a strategyforoptimalsampling in classification tasks was implemented: compared with uniform random sampling, the entropy-basedstrategyisclearlymoreeffective. Moreover,itcomparesfavorablywithan alternativemargin-basedsamplingstrategy. 2 The Dynamics ofWeights In the present section, the dynamics that the AdaBoost algorithm sets over the weights is singled out for study. In particular, the intuition is substantiated that the evolution of weights yields information about the varying relevance that differentdata points have in thebuiltoftheAdaBoostmodel. LetD ≡ {xi,yi}Ni=1 beatwo-classsetofdatapoints,wherethexisbelongtoasuitable region, X, of some (metric) feature space, and yi takes values in {1,−1}, for 1 ≤ i ≤ N. The AdaBoostalgorithmiterativelybuildsa classmembershipestimatoroverX asa thresholdedlinear superpositionof differentrealizations, Mk, of a same base model, M. Anymodelinstance,Mk, resultingfromtrainingatstepk dependsonthevaluestakenat the same step by a set of N numbers(in the following, the weights), w = w1,...wN – one for each data point. After training, weights are updated: those associated to points misclassified by the current model instance are increased, while decreased are those for whichtheassociatedpointisclassifiedcorrectly.Aninterestingvariantofthisbasicscheme consists in training the different realizations of the base model, not on the whole data set, but on Bootstrap replicates of it [6]. In this second scheme, samplings are extracted accordingtothediscreteprobabilitydistributiondefinedbytheweightsassociatedtodata points,normalizedtosumone. InFig. 1atheplotsarereportedoftheevolutionoftheweightsassociatedto3datapoints whentheAdaBoostalgorithmisappliedtoasimplebinaryclassificationtaskonsynthetic two-dimensionaldata (experimentA-Gaussiansasdescribedin Sec. A.1). Exceptfor occasional bursts, the weight associated to the first point goes rapidly to zero, while the weightsassociatedtothesecondandthirdpointkeepongoingupanddowninaseemingly chaotic fashion. Our experience is that these two types of behaviour are not specific of thecaseunderconsideration,butcanbeobservedinanyAdaBoostexperiment.Moreover, tertium non datur, i.e., no other qualitative behaviour is observed (as, for example, that someweighttendstoastrictlypositivevalue). 2.1 EasyVs. HardDataPoints ThehypothesisthereforeemergesthattheAdaBoostalgorithmsetapartitionofdatapoints intotwoclasses: ononesidethepointswhoseweighttendsrapidlytozero;ontheother, the pointswhoseweightshowan apparentlychaoticbehaviour. Infact, the hypothesisis perfectlyconsistentwith the rationaleunderlyingthe AdaBoostalgorithm: weightsasso- ciated tothose data pointsthatseveralmodelinstancesclassify correctlyevenwhenthey arenotcontainedinthetrainingsamplefollowthefirstkindofbehaviour.Inpracticeinde- pendentlyofwhichbootstrapsampleisextracted,thesepointsareclassifiedcorrectly,and theirweightisconsequentlydecreasedanddecreased.Wecallthemthe“easy”points.The second type of behaviouris followedby the pointsthat, when notcontainedin the train- ingset, happentobeoftenmisclassified. A seriesofmisclassificationsmakestheweight associatedwithanysuchpointincrease,therebyincreasingtheprobabilityforthepointto becontainedinthefollowingbootstrapsample. Astheprobabilityincreasesandthepoint isfinallyextracted(andclassifiedcorrectly),itsweightisdecreased;thisinturnmakesthe pointlesslikelytobeextracted–andsoforth.Wecallthiskindofpoints“hard”. In Fig. 1b, histogramsare reportedof the values thatthe weights associated to the same 3 data points of Fig. 1a take over the same 5000 iterations of the AdaBoost algorithm. As expected, the histogram of (easy) point 1 is very much squeezed towards zero (more 0.20 0.20 0.20 0.15 0.15 0.15 W W W 0.10 0.10 0.10 0.05 0.05 0.05 0.0 0.0 0.0 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000(a) Number of models Number of models Number of models 30 80 40 25 Total60 Total30 Total20 Percent of 40 Percent of 20 Percent of 1105 20 10 5 0 0 0 0.0 0.02 0.04 0.06 0.08 0.0 0.02 0.04 0.06 0.08 0.0 0.02 0.04 0.06 0.08(b) W W W Figure 1: Evolutionof weights in the AdaBoostalgorithm. (a) The evolutionsover 5000 steps of the AdaBoostalgorithm are reported for the weights associatedto 3 data points ofexperimentA-Gaussians. Fromlefttoright: an“easy”datapoint(theweighttends tozero),andtwo“hard”datapoints(theweightfollowsaseeminglyrandompattern). (b) Thecorrespondingfrequencyhistograms. than 80% of weights lies below 10−6). Histograms of (hard) points 2 and 3 exhibit the sameGamma-likeshape,butdifferremarkablyforwhatconcernsaverageanddispersion. Naturally,thefirstquestioniswhetheranylimitexistsforthesedistributions.Foreachdata point, two unbinned cumulative distributions were therefore built by taking the weights generated by the first 3000 steps of the AdaBoost algorithm, and those generated over the whole 5000steps. The same-distributionhypothesiswas then tested by meansof the Kolmogorov-Smirnov(KS) test [5]. Results are reported in Fig. 2a, where p-values are plottedagainstthemeanvalueofall5000values. Itisinterestingtonoticethatformean values close to 0 (easy points) the same-distributionhypothesisis always rejected, while it is typically not-rejectedfor higher values (hard points). It seems that easy points may beconfidentlyidentifiedbysimplyconsideringtheaverageoftheirweightdistribution. A binary LDA classifier was therefore trained on the data of Fig. 2a. By setting a p-value thresholdequalto0.05,theresultingprecision(thecomplementto1ofthefractionoffalse negative)wasequalto0.79andrecall(thecomplementto1ofthefractionoffalsepositive) wasequalto0.96. 2.2 Entropy Canwedoanybetteratseparatingeasypointsfromhardones?Forhardpoints,candiffer- entdegreesof“hardness”beidentifiedwhichaccountfordifferentdegreesofclassification uncertainty? Whatwe are goingto showis thatbyassociatinga notionofentropy to the evolutionsofweightsbothquestionscanbeansweredinthepositive.Tothisend,theinter- val[0,1]ispartitionedintoLsubintervalsoflength1/L,andtheentropyvalueiscomputed L as i=1fi log2 fi, wherefi isthe relativefrequencyofweightvaluesfallinginthe i-th subPinterval(0log20issetto0).Forourcases,Lwassetto1000. 1.0 1.0 40 0.8 0.8 p value00..46 p value00..46 Percent of Total2300 0.2 0.2 10 0.0 0.0 0 0.0 0.002 0.004 0.006 0.008 0.01(a) 0 2 4 6 8(b) 0 2 4 6 8(c) Mean Entropy Entropy Figure2: Separatingeasyformhardpoints. (a)p-valuesoftheKStestVs. meanvaluesof frequencyhistograms. (b)p-valuesoftheKStestVs. entropyoffrequencyhistograms. As in(a),thehorizontallinemarksthethresholdvaluefortheLDAclassifier. (c)Histogram ofentropyvaluesforthe400datapointsofexperimentA-Gaussians. Qualitatively, the relationship between entropy and p-values of the KS test is similar to theoneholdingforthemean(Fig.2a-b).Quantitatively,however,adifferenceisobserved, sincetheLDAclassifiertrainedonthesedataperformsmuchbetterinprecisionandslightly worse inrecall(respectively,0.99and0.90,as comparedto 0.79and0.96). Thisimplies thattheclassofeasypointscanbeidentifiedwithhigherconfidencebyusingtheentropy in place of the mean value of the distribution. Further supportto the hypothesisof a bi- partite (easy/hard) nature of data points is gained by observing the frequency histogram of entropiesforthe 400 pointsof experimentA-Gaussians(Fig. 2c), fromwhichtwo groupsof data pointsemerge as clearly separated. The first is the zero entropygroup of easypoints,andthesecondisthegroupofhardpoints. Doeasy/hardpointsshowanypreferenceaboutwheretoconcentrate? InFig. 3ahardand easypointsareshownasdeterminedfortheexperimentA-Sin(seeSec. A.1fordetails). Hard points are mostly found nearby the two-class boundary; yet, their density is much lower along the straightsegmentof the boundary(where the boundaryis smoother), and appearthereforetoconcentratewheretheclassificationuncertaintyishighest. Easypoints to the opposite. Considering that easy points stay well clear of the boundary (i.e., hard pointstypicallyinterposebetweenthemandtheboundary),whatonemaythenquestionis whethertheyplayanyroˆleinthebuiltoftheAdaBoostmodel. Theanswerisno. Infact, the modelsbuilt disregardingthe easy pointsare practicallythe same as the modelsbuilt onthecompletedataset. IntheexperimentofFig. 3onlythe0.55%of10000testpoints wereclassifieddifferentlybythetwomodels,ascontrastedtoreductionofthetrainingset from400toonly111(hard)points. 2.3 SmoothingtheEntropy Intheprevioussection, theentropyofthe weightfrequencyhistogramwasintroducedas anindicatoroftheuncertaintyofclassifyingtheassociateddatapointasbelongingtoclass −1or1. Bydefiningasmoothapproximationtothepunctualentropyvaluesassociatedto datapoints,wenowextendthenotionofclassificationuncertaintytothewholedomainof ourbinaryclassifier. Forsimplicitysake,kernelregressionwasemployed–i.e.,theentropy valuesat data pointsare convolvedwith a Gaussian kernelof fixed bandwidth[9]. In so doing,ascalarentropyfunction,H = H(x), isdefinedonA. InFig. 3b,thegreylevels encodethevaluesofH (increasingfromblacktowhite)fortheexperimentA-Sin. Themethodappearscapableofhighlightingregionswhereclassificationturnsoutuncertain –duetothedistributionofdatapoints,themorphologyoftheclassboundaryorboth. Of course,functionHdependsonthegeometricpropertiesspecificofthebasemodeladopted, 8 4 4 7 6 2 2 5 0 0 4 3 -2 -2 2 -4 -4 1 0 (a) (b) -10 -5 0 5 10 -10 -5 0 5 10 Figure3: (a)Easy(white)andhard(black)datapointsofexperimentA-Sinobtainedby thresholdingthehistogramofentropy.Squaresandcircletsexpresstheclass. (b)Level-plot oftheH function.GreylevelsencodeH values(seescaleontheright). and its degreeof smoothnessdependsonthe size of the convolutionkernel. It shouldbe noticed, however, that the bias/variance balance can be controlled by suitably tuning the convolution parameters. Finally, more sophisticated local smoothing techniques may be employedaswell(e.g.,RadialBasisFunctions)whichmayadapttodirectionality,known morphologyoftheboundaryorlocaldensityofsamplepoints. 3 An Applicationto OptimalSampling Toillustratetheapplicabilityofnotionsdevelopedabovetopracticalcases,werefertothe framework of optimal sampling [1]. In general, an optimal sampling problem is one in whichacostisassociatedtotheacquisitionofdatapoints,insuchawaythatsolvingthe problemconsistsnotonlyinminimizingtheclassification(orregression)errorbutalsoin keepingthesamplingcostaslowaspossible. A typicalsettingforthisclassofproblems is the one in which we start from an assigned set of (sparse) data points, and we then incrementally add points to the training set on the basis of certain information extracted fromintermediateresults. For the experiments reported below, which are based on the same settings as Sin and SpiralofSec. A.1(seealsoSec. A.2fordetails),westartedfromasmallsetofsparse two-dimensionalbinaryclassificationdata.High-uncertaintyareasareidentifiedbymeans of the method described in Sec. 2.3, and additional training points are chosen in these areas. Assumingaunitarycostforeachnewpoint,performanceoftheprocedureisfinally evaluatedbyanalyzingthesamplingcostagainsttheclassificationerror. In Fig. 4, two plots are reported of the classification error as function of the number of trainingpoints. Comparisonis madewith a blind(randomlyuniform)samplingstrategy, andwithaspecializationofuncertaintysamplingstrategyasrecentlyproposedin[4]. The latterconsistsinaddingtrainingpointswheretheclassifierislesscertainofclassmember- ship.Inparticular,theclassifierwastheAdaBoostmodelandtheuncertaintyindicatorwas themarginoftheprediction. ResultsreportedinFig.4showthatinbothexperimentstheentropysamplingmethodholds a definite advantage on the random sampling strategy. In the first experiment, an initial advantage of entropy over the margin based sampling is also observed, but the margin strategy takes over as the number of samplings goes beyond 400. It should be noticed, 0.25 0.20 Entropy Margin Uniform random 0.20 0.15 Error0.10 Error0.15 0.05 0.10 Entropy Margin Uniform random 0.0 0.05 0 200 400 600 800 1000 0 200 400 600 800 1000 (a) (b) Points number Points number Figure4: Misclassificationerrorasafunctionofthenumberoftrainingpointsfortheen- tropybasedschemeiscomparedtotheuniformrandomsamplingandthemarginsampling strategy. (a)ExperimentB-Sin. (b)ExperimentB-Spiral. however, that the margin sampling automatically adapts its spatial scale to the increased densityofsamplingpoints,whileourentropymethoddoesnot(thesizeoftheconvolution kernelisfixed). Infact,intheexperimentB-Spiral(Fig. 4b)wheretheboundaryhasa morecomplexstructure,(andthesizeofconvolutionkernelsmaller),1000samplingsare notsufficientforthemarginbasedmethodtoexhibitanadvantageontheentropymethod (butthelatterloosestheinitialadvantageexhibitedinthefirstexperiment). 4 Final Comments Withinthemanypossibleinterpretationsoflearningbyboosting,itispromisingtocreate diagnosticindicatorfunctionsalternativetomargins[8]bytracingthedynamicsofboost- ingweightsforindividualpoints.Wehaveusedentropy(inthepunctualandthensmoothed versions)asadescriptorofclassificationuncertainty,identifyingeasyandhardpoints,and designingaspecificoptimalsamplingstrategy.Thestrategyneedstobefurtherautomated, e.g. consideringadaptiveselectionofsmoothingparametersasa functionofspatialvari- ability. A directnumericalrelationshipwith theweightsof SupportVectorexpansionsis alsoclearlyneeded. Ontheotherhand,itwouldbealso interestingtoassociatethemain typesofweightdynamics(orpointhardness)totheregularityoftheboundarysurfaceand ofthenoisestructure. References [1] V.Fedorov. TheoryofOptimalExperiments. AcademicPress,NewYork,1972. [2] Y.FreundandR.E.Schapire. ADecision-theoreticGeneralizationofOnlineLearning andanApplicationtoBoosting.JournalofComputerandSystemSciences,55(1):119– 139,August1997. [3] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical viewofboosting. TheAnnalsofStatistics,2000. [4] D.D.LewisandJ.Catlett.HeterogeneousUncertaintySamplingforSupervisedLearn- ing. In Cohen and Hirsh, editors, Eleventh International Conference on Machine Learning,pages148–156,SanFrancisco,1994.MorganKaufmann. [5] W.H.Press,S.A.Teukolsky,W.T.Vetterling,andB.P.Flannery. NumericalRecipes in C–The ArtofScientificComputing. CambridgeUniversityPress, secondedition, 1992. [6] J.R. Quinlan. Bagging, Boosting, and C4.5. In Thirteenth National Conference on ArtificialIntelligence,pages163–175,Cambridge,1996.AAAIPress/MITPress. [7] Y.RavivandN.Intrator.VarianceReductionviaNoiseandBiasConstraints. InA.J.C. Sharkey,editor,CombiningArtificialNeuralNets: EnsembleandModularMulti-Net Systems,pages163–175,London,1999.Springer-Verlag. [8] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee. Boosting the Margin: A New Explanation for the Effectivenessof Voting Methods. The Annals of Statistics, 26(5):1651–1686,1998. [9] W. Ha¨rdle. Applied Nonparametric Regression, volume 19 of Econometric Society Monographs. CambridgeUniversityPress,1990. A Data Details are given on the data employed in experimentsof Sec. 2 and 3. Full details and dataareaccessibleathttp://www.mpa.itc.it/nips-2001/data/. A.1 ExperimentA Gaussians: 4 sets of points (100 points each) were generated by sampling 4 two-dimensional Gaussian distributions, respectively centered in (−1.0,0.5), (0.0,−0.5), (0.0,0.5) and (1.0,−0.5). Covariance matrices were diagonal for allthe4distributions;variancewasconstantandequalto0.4.Pointscomingfrom the sampling of the first two Gaussians were labelled with class −1; the others withclass1. Sin: TheboxinR2,R ≡ [−10,10]×[−5,5],waspartitionedintotwoclassregionsR1 (upper)andR−1(lower)bymeansofthecurve,C ofparametricequations: x(t) = t C ≡ (cid:26) y(t) = 2sin(3t)if −10≤t≤0;0if0≤t≤10. 400 two-dimensionaldata were generated by randomlysampling region R, and labelledwitheither−1or1accordingtowhethertheybelongedtoR−1orR1. Spiral: Asinthepreviouscase,theideawastohaveabipartitionofarectangularsubset, S, of R2 presentingfairly complexboundaries(S ≡ [−5,5]×[−5,5]). Taking inspirationfrom[7],aspiralshapedboundarywasdefined. 400two-dimensional datawerethengeneratedbyrandomlysamplingregionS,andwerelabelledwith either−1or1accordingtowhethertheybelongedtooneortheotherofthetwo classregions. A.2 ExperimentB Thisgroupofdatawasgeneratedinsupporttotheoptimalsamplingexperimentsdescribed inSec. 3. Morespecifically,two initialdata sets, eachcontaining40points,were gener- atedforboththeSinandSpiralsettingsbyemployingthesameproceduresasabove. At each roundof the optimalsamplingprocedure,10 new data pointswere generatedby uniformly sampling a suitable, high entropy subregion of the domain. Data points were thenlabelledaccordingtotheirbelongingtooneortheotherofthetwoclassregions.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.