Domain based classification Robert P.W. Duin [email protected] ICT group, Faculty of Electr. Eng., Mathematics and Computer Science Delft University of Technology, The Netherlands Elzbieta Pekalska [email protected] 6 ICT group, Faculty of Electr. Eng., Mathematics and Computer Science 1 Delft University of Technology, The Netherlands 0 2 n a Abstract nown, one, therefore, minimizes the empirical error J E (f)= 1 n L(y ,f(x )). Additionally, a trade- The majority of traditional classification ru- emp nPi=1 i i 8 off between the function complexity and the fit to the lesminimizingtheexpectedprobabilityofer- 1 data has to be kept, as a small empirical error does ror (0-1 loss) are inappropriate if the class not yet guarantee a small true error. This is achieved ] probabilitydistributionsareill-definedorim- L by adding a suitable penalty or regularization func- possible to estimate. We argue that in such M tionasproposedinthestructuralriskminimizationor casesclassdomainsshouldbeusedinsteadof regularizationprinciples. . class distributions or densities to constructa t a reliable decision function. Proposalsare pre- Although these principles are mathematically well- t s sentedforsomeevaluationcriteriaandclassi- founded, they rely on very strong,though general,as- [ fierlearningschemes,illustratedbyanexam- sumptions. They impose a fixed (stationary) distri- 1 ple. bution from which vectors, representing objects, are v drawn. Moreover, the training set is believed to be 0 representative for the task. Usually, it is a random 3 1. Introduction subset of some large target set, such as a set of all 5 objects in an application. Such assumptions are often 4 0 Probabilistic framework is often employed to solve le- violatedin practice,notonlydue to differences inme- . arning problems. One conveniently assumesthat real- asurements caused by variability between sensors or 1 worldobjects or phenomena are representedas (or, in a difference in calibration of measuring devices, but, 0 6 fact,reducedto)vectorsxinasuitablevectorspaceX. more importantly, due to the lack of information on 1 The learning task relies on finding an unknown func- class distributions or impossibility of gathering a re- : tionaldependency betweenx and some outputs y∈Y. presentative sample. Some examples are: v Vectorsxareassumedtobeiid,i.e.drawnindependen- i X tlyfromafixed,butunknownprobabilitydistribution • In the application of face detection, the distribu- r p(x). The function f is given as a fixed conditional a tionofnon-facescannotbe determined, asitmay density p(y|x), which is also unknown. To search for beunknownforwhichtypeofimagesandinwhich the ideal function f∗, a general space of hypothesis environments such a detector is going to be used. functions F = {f : X →Y} is considered. f∗ is con- sidered optimal according to some loss function L : • In machine diagnostics and industrial inspection X×Y→[0,M], M>0, measuring the discrepancy be- some of the classes have to be artificially induced tweenthetrueandestimatedvalues. Thelearningpro- inordertoobtainsufficientexamplesfortraining. blem is then formulated as minimizing the true error Whethertheyreflectthetruedistributionmaybe E(f)=RX×YL(y,f(x))p(x,y)dxdy, given a finite iid unknown. sample, i.e. the training set {(x ,y )}, i=1,2,...,n. i i As the joint probability p(x,y) =p(x)p(y|x) is unk- • In geological exploration for mining purposes, a large set of examples may be easily obtained in one area on earth, but its distribution might be entirelydifferentthaninanotherarea,whosesam- Unpublishedpaper,March2005. Copyrightbytheauthors. ple will not be provided due to the related costs. Inhumanlearning,arandomsamplingofthedistribu- designlearningproceduresandhowtoevaluatethem? tion of the target set does not seem to be a plausible Canclassifiersoutputconfidences? Howtojudgewhe- approach, as it is usually not very helpful to encoun- ther a given training set is representative for the do- ter multiple copies of the same object among training main? Are any further assumptions needed or advan- examples. For instance, in studying the difference be- tageous? Can cluster analysis or feature selection be tween the paintings of Rembrandt and Velasquez it applied? The goal of this paper is to raise interest in makes no sense to consider copies of the same pain- domain learning. As the first step, we introduce the ting. Even very similar ones may be superfluous, in problem,discussafewissuesandproposesomeappro- spite of the fact that they may represent some mode aches. in the distribution of paintings. On the contrary, it may be better to emphasize the tails or the borders 2. Performance criteria of the distribution, especially in the situations, where the classes seem to be hard to distinguish. Suppose a classifier f(x) is designed that assigns ob- jects to one of the given classes. A labeled evaluation Although the probabilistic framework is applied to set or a test set S is usually used to estimate the per- manylearningproblems,therearemanypracticalsitu- formance of f(x) by counting the number of incorrect ations,where alternativeparadigmsare necessarydue assignments. This, however, demands that the set S to the nature of ill-sampled data or ill-defined distri- is representative for the distribution of the target set, butions. Which may be an appropriate model for the which conflicts with our assumption. relation between a training set of examples and the 1 target set of objects to be classified if we cannot or For a set of objects to be representative for the class donotwanttoassumethatthedistributionofthetra- domains it may be assumed that the objects are well iningsetisanapproximationofthedistributionofthe spreadoverthesedomains. ForthetestsetS,itmeans target set? This paper focusses on this aspect. Our that there is no objectx in any ofthe classesthat has basic assumption is that the training sample is repre- a large distance d(x,xs) to its nearest objects xs∈S. sentativeforthedomain ofthetargetset(allexamples Therefore,foradomainrepresentativetestsetS holds in the given application) instead of being drawn from that a fixed probability distribution. d =max min d(x,xs) (1) max x xs∈S Consider a representation space, called also input is small ∀x. The usefulness of this approach relies on space, X, endowed with some metric d. This is the the fact that the distances as given in the input space space, in which objects are representedas vectorsand aremeaningfulfortheapplication. Consequently,fora the learning takes place. A domain is a bounded set well-performingclassifier,noneoftheerroneouslyclas- AinX,i.e.∃r>0∀x,z∈Ad(x,z)≤r. (Wedonotas- sified objects is far away (at the wrong side) from the sume thatthe domainis totallybounded.) This isnot decisionboundary. Iftheclassesareseparable,thetest newasoneusuallyexpectsthatclassesarerepresented objects should also be as far away from the decision by a set of vectors in (possibly convex and) bounded boundary as possible. Therefore, our proposal is to subsets of some space. Here, we will focus on vector follow the worst-casescenarioand to judge a classifier spacerepresentationsconstructedbyfeatures,dissimi- by the object that is the most misleading. This will larities or kernels. As the class domains are bounded be judged by its distance to the decision boundary. in this representation, for each class ω , there exists j some indicator function Gj(x) of the object2 x such Consider a two-class problem with the labels y ∈ that Gj(x)=1 if x is accepted as a member of ωj and {−1,+1},wherey(x)denotesthetruelabelofx. (This Gj(x)=0, otherwise. Given a training set of labeled notation is the consequence of our assumption that examples {(xi,yi)}, Gj(xi) = 1 if xi belongs to the different objects with different labels may be repre- class ωj. We will assume that each object belongs to sented in the same point x). Let f(x) yield the si- a singleclass,however,identicalobjectswith different gned distance of x to the decision boundary induced labels are permitted. This allows classes to overlap. by the classifier. Note that the unsigned distance of x to the decision boundary is related to the functional Giventheabovemodel,severalquestionsarise. Howto form of f. Then 1In classification problems, yi is a class label and L is the 0-1 loss, L(yi,f(xi)) = I(yi 6= f(xi)), where I is η(S|g)= min y(xs)f(xs) (2) the indicator function. Classifiers minimize the expected xs∈S classification error (0-1 loss). 2By anobject wemean itsrepresentation xin thecon- is the signed distance to the decision boundary of the sidered vectorspace. ’worst’ classified object from the test set S. Having introduced this, a classifier f1(x) is judged to be bet- 3.1. Discriminants ter than a classifier f2(x) if η(S|f1)>η(S|f2). The Consider a two-class problem. If classes are separable main argument supporting this criterion follows from byapolynomialorwhenakerneltransformationisap- thefactthatifthevectorspacerepresentationandthe plied, a discriminantfunction canbe found by solving distance measureareappropriateforthe learningpro- a setoflinearinequalities overthe trainingset X,e.g. blem, then for small values of η(S|f), the test set S contains objects that are similar to the objects in a wrong class. As the data and the learning procedure yi(αTK(X,xi)+α0)>0, ∀xi∈X, (3) are not based on probabilities, it is difficult to make K(X,x ) is the column vector of all kernel values a statement about the probability of errors instead of i K(x,x ), ∀x∈X. The resulting weights α∈Rn define the seriousness of their contributions. i the classifier f in the following way: As a consequence, outliers should be avoided, since they cannot be detected by statistical means. Still, Assign x to ω1,if f(x)=αTK(X,x)+α0 ≥0, (4) objects that have large distances to all other objects Assign x to ω2,if f(x)=αTK(X,x)+α0 <0. (incomparisontotheirnearestneighbordistances)in- dicate that the domain is not well sampled. If the Thisdecisionfunctionfindsasolutioniftheclassesare sampling is proper, all objects have to be considered separable in the Hilbert space induced by the kernel as equally important, as they are examples of valid K and fails if they are not. Since no model used to representations. Copies of the same object do not in- optimize the decision boundary, this decision function fluence thelearningproceduresandmay,therefore,be is independent of the use of domains or densities. removed. In the traditional probabilistic approach to pattern Ifclassesoverlapsuchthattheoverlappingdomaincan recognition, the nearest mean classifier (NMC) and be estimated and a class of possible density functions Fisher’s linear discriminant (FLD) are two frequen- isprovided,thenitmightbe possibletodetermine ge- tly used classifiers. Given class means estimated over neralization bounds for the classification error or to the training set, the NMC assigns each object to the estimate the expected error over the class of density class of its nearestmean. In a domainapproach,class functions. Both tasks are, however, not straightfor- means should be replaced by the class centers. These ward,neither estimating the domain ofthe class over- are vectors µ in the vector space Rm that yield the j lap, nor defining an appropriate class of density func- minimum distance to the most remote object in X : j tions. As we only sketch the problem, we will restrict ourselves to classifiers that maximize criterion (2). µˆ =arg min max kx−x∗k (5) j x∗∈Rm x∈Xj 3. Classifier proposals Class centers may be found by a procedure like the Support Vector Data Description (?; ?), in agre- Anumber ofpossible domainbaseddecisionfunctions ement to criterion (5). Such a center is determined willbeintroducedinthissection. Wewillstartbypre- by m+1 training objects at most, and usually much sentingthedomainversionsofsomewell-knownproba- less. Anapproximationcanbealsobasedonafeature- bilistic classifiers. It should be emphasized once again by-feature computation. Additionally, for sufficiently thatintheprobabilisticframework,anyaveragingover large data, a single training object may be a sufficien- objectsortheirfunctionaldependencies reliesontheir tly good approximationof the center: distribution. So, averaging cannot be used in domain based learning procedures. It has to be replaced by µˆ =arg min max kx−x∗k. (6) appropriateoperatorssuchas minimum, maximum or j x∗∈X x∈Xj domain center. Thiscanbedeterminedfastfromthepairwisedistance ConsideravectorspaceRm,inwhichobjectsxarere- matrix computed between the training examples (?). presentede.g.byfeatures. LetX={x1,x2,...,xn}be Given the class centers, the Nearest Center Classifier a training set with the labels Y ={y1,y2,...,yn}. As- (NCC) is now defined as: sume k classes ω1,...,ωk. If k=2, then yi∈{−1,+1} are assumed. Let X be a subset of X containing all j Assign x to ω ,if i=argmin kx−µˆk. (7) members of ωj. Then, X=SjXj. i j j This classifier is optimal (it maximizes criterion 2) if theclassdomainsarehypersphereswithidenticalradii. A traditional criterion for judging the goodness of a 3.2. Model based, parametric decision single feature is the Fisher Criterion: functions (µ1−µ2)2 Two of the methods described in the previous section JF = σ12+σ22 (8) aimatfindingdiscriminantsbysomeseparabilitycrite- rionsuchasthedifferenceinclasscentersortheFisher in which µ and σ2 are the class means andvariances, distance. They appear to be optimal for identically j j shaped class domains, hyperspheres and, respectively, respectively, as computed for the single feature. A ellipsoids. Here, instead of considering a functional domain based version is defined by substituting the form of a classifier, we will start from some class do- mean with the class center and the variance with the squared class range. For the k-th feature, σ 2 can be main models and then determine the classifier. j then estimated as: Class domains are defined by their boundaries. If du- ring a training process some objects are placed out- σˆj2 =(max(xik)−min(xik))2 (9) side the domain, the boundaries have to be adjusted. i i Thisispermittedonlyifthe nearestobjectsinside the Herewith, a Fisher Linear Domain Discriminant domain are close to the boundaries or their parts (if (FLDD) can be defined by a weight vector in the distinguishable). ’Unreasonablyfaraway’objectssho- feature space for which the domain version of (8) uld not play a role in positioning of the domain bo- is maximum. We expect that this direction will be undaries. They have to be determined with respect determined by the minimum-volume ellipsoid enclo- to the demandthatobjects shouldsamplethe domain sing Xc, the pooled data shifted by the class centers well. So, the distance from the domain boundary to Xc={x−µ : x∈X }. It is defined by the positive the nearest objects should be comparable to the ne- j j j semi-definite matrix G, such that xTGx<1,∀x∈Xc. arest neighbor distances between the objects. In fact, this is the basic learning problem (?). A significant Consequently, one has: difference to many later studies (?), however, is that Assign x to ω1, indomainlearningprobabilitiesordensitiescannotbe if (x−µ )TG−1(x−µ )≥(x−µ )TG−1(x−µ ), used. 2 2 1 1 Assign x to ω2, Formally, the problem may be stated as follows. Let if (x−µ2)TG−1(x−µ2)<(x−µ1)TG−1(x−µ1). Dj(x,θ)=0 be some parametric domain description (with the parameters θ for the class ω and let X be (10) j j a set of examples from ω . Then, θ should be cho- The FLDD can then be written as: j sensuchthat the maximumdistance fromthe domain f(x)=(µ −µ )TG−1x. (11) boundary to its nearest neighbor in the training set 2 1 is minimized under the condition that all training ob- jects are inside the domain at some suitable distance This classifier is optimal according to criterion (2) if δ to the border: thetwoclassesaredescribedbytheidenticalellipsoids exceptforthepositionoftheircenters. Theestimation of G in the problem of finding the minimum volume minθ maxx∗ minx∈Xj kx∗−xk, ellipsoidenclosingthedataX isaconvexoptimization s.t. (a) D(x∗,θ)=0, problemwhichisonlytractableinspecialcases(?; ?). (12) s.t. (b) D (x,θ)<0, ∀x∈X Anapproximationispossiblewhenthejointcovariance j j matrix is used for pre-whitening the data (which, ho- (c) kx∗−xk>δ, ∀x∈Xj wever,conflictswiththeconceptofadomainclassifier) andthenderivingahypersphereinsteadofanellipsoid. This is a nonlinear optimization. As indicated above, As a third possibility in this section we will mention such problems are intractable already for simple do- the binary decision tree classifier based on the pu- mains likearbitraryellipsoids (?). The challenge,the- rity criterion (?), capturing aspects of partitioning of refore, is to find approximate and feasible solutions. examples relevant to good classification. In each node Examples can be found in the area of one-class classi- ofthetree,thefeatureandathresholdaredetermined fiers (?; ?). A very problematic issue, however, is the to distinguish the largest pure part (i.e. a range be- constraint (c) in (12) indicating that the domain bor- longing to just one of the classes) of the training set. der should fit loosely, but in a restricted way around Other more advanced ways of finding a domain based the training examples in the feature space. The diffi- learner will be discussed below. culty arisesaskx∗−xk>δ isa non-convexconstraint, hence the entire formulation is non-convex3. In do- mean square error over the training set (?): main learning, new algorithms have to be designed to 1 solve the formulated problems. θˆ =argmin X(net(x,θ)−t(x))2, (15) θ n Onceclassdomainshavebeenfound,the problemofa x∈X proper class assignment arises if objects get multiple wherenet(x,θ)isthenetworkoutputforxandt(x)is memberships or if they are rejected by allclasses. If a thetarget,whichisy(x)here. Asthenetworkfunction unique decision is demanded in such cases,a discrimi- is nonlinear, training is done in small steps following nanthasto be determined,asdiscussedinsection3.1. agradientdescentapproach. Thesummationoverthe Alternatively,duringclassification,thedistancestoall training examples, however,conflicts with the domain domainboundarieshavetobe foundandthesmallest, learningidea. Ifitisreplacedbythemaximumopera- in the case of reject, or the largest,in the case ofmul- tor, the network will be updated such that the ’worst’ tiple acceptance, has to be used for the final decision. object,i.e.theobjectclosesttothedomainoftheother Again, the criterion (2) is used. class,makesassmallesterroraspossible (itis asclose as possible to the decision border): 3.3. Model based, non-parametric decision functions θˆ =argmin max (net(x,θ)−t(x))2 (16) θ x∈X Insteadof estimating the parametersofsome postula- ted model, such a model might be also directly con- A severe drawback, however, is that instead of opti- structedfromthetrainingset,inanalogytothekernel mizing the distance to the decision boundary in the density(Parzen)estimators(?) instatisticallearning. input space, the largest deviation in the network out- Foradomaindescription,the sumofkernelfunctions, put space is optimized. Unless the network is linear, however, may be replaced by a maximum, or, equiva- suchasatraditionalperceptron,thiswillyieldasigni- lently, by the unionofthe kerneldomains. Inorderto ficantly different neural net. restrict the class domains, the kernel domain should be bounded. Let Φ(x,x ,h) define the domain for a 3.5. Support vector machines i kernel associated with x , e.g. all points within a hy- i The key principle behind the support vector machine perspherewiththeradiush,thenthedomainestimate (SVM),thestructuralriskminimizationleadingtothe for the class ω is: j maximum margin classifier, makes it an ideal candi- date for domain learning. Thanks to the reproducing Dj(x,h)= [ {Φ(x,xi,h)}. (13) propertyofkernels,inthecaseofnon-overlappingclas- xi∈Xj ses, the SVM is a maximum margin hyperplane in a Hilbert spaceinduced by the specified kernel(?). The The value of the kernel width h can be estimated by margin is determined only by support vectors. These theleave-one-outprocedure. hisfoundasthesmallest aretheboundaryobjects,i.e.theobjectsclosesttothe value for which all training objects belong to the do- decisionboundaryf(x,θ)(?; ?). Assuch,theSVMis mainwhichis estimatedby alltrainingobjects except independent of class density models: the one to be classified. This width is equal to the largestnearestneighbordistancefoundinthetraining f(x,θ)=argmax min y(x)f(x,θ). (17) set: θ x∈X hˆ =max min kx −xk. (14) i l6=i i l Multiple copies of the same object added to the tra- ining set do not contribute to the construction of the Also in this case it is not straightforward how the di- SVM,astheydoforclassifiersbasedonsomeprobabi- stance to the domain boundary should be computed. listic model. Moreover,the SVMisalsonotaffectedif objects which are further away from the decision bo- 3.4. Neural networks undary are disregarded or if objects of the same class are added there. This decision function is, thereby, The iterative way neural networks are trained make truly domain based. them suitable for domain learning. Traditionally, the weightsofaneuralnetworkarechosentominimizethe For nonlinear classifiers f(x,θ) defined on nonlinear kernels, the SVM has, however,a similar drawback as 3Convex optimization deals with a well-behaved set of the nonlinear neural network. The distances to the problems that have advantageous theoretical properties such as the duality theory and for which efficient algori- decisionboundaryarecomputedintheoutputHilbert thmsexist. This is not truefor non-convexproblems. spacedefinedbythekernelandnotintheinputspace. A second problem is that the soft-margin formulation (?), the traditional solution to overlapping classes is 10 not domain based. The optimization problem for a linear classifier f(x)=wTx+w0 is rewritten into: 5 minw ||w||2+Pxi∈Xξ(xi), 2 s.t. y f(x )≥1−ξ(x ), (18) e i i i ur 0 ξ(xi)≥0 eat F in which the term Pxi∈Xξ(xi) is an upper bound of −5 the misclassificationerroronthe trainingset, hence it is responsible for minimizing a sum of error contribu- tions. Addingacopyofanerroneouslyassignedobject −10 will affect the sum and, thereby, will influence the so- −15 −10 −5 0 5 10 ughtoptimumw. Theresultis,thereby,dependenton Feature 1 the distribution of objects, not just on their domain. Figure 1. Example of the projection of a small set of ob- For a proper domain based solution, formulation (17) jects on a nonlinear decision boundary. shouldbesolvedasitisforthecaseofoverlappingdo- mains, resulting in the negative margin support vector machine. This means that the distance of the fur- for the output space of nonlinear decision functions. thest away misclassified object should be minimized. Still, well-performingclassifiersmaybe obtained. The Asthesigneddistanceisnegative,thenegativemargin question, however, arises how evaluation and a com- is obtained. In the probabilistic approach this classi- parison of classifiers that establish different nonline- fier is unpopular as it will be sensitive to outliers. As arities,e.g. a linear classifier, a neural network and a explained in the introduction, in domain learning, the support vector machine should be done. existence of outliers shouldbe neglected. This implies The only way various classification functions can be that, if they exist, they should be removed before, as compared is in their common input space, as their theycanonlybedetectedondistributioninformation. output spaces may differ. In the introduction, crite- rion(2)wasadoptedstatingthattheperformanceofa 4. Evaluation procedure domainbasedclassifierisdeterminedbytheclassifica- tionof the mostdifficult example. Itis determinedby In the previous section a number of possible domain thedistanceintheinputspacefromthatobjecttothe based classifiers has been discussed, inspired by well decision boundary. For linear classifiers the computa- known probabilistic procedures. This is just an at- tion of this distance is straightforward. For analytical tempt to illustrate the key points of domain learning nonlinearclassifiersthecomputationofthisdistanceis approaches. Someofthemarefeasible,likethenearest not trivial, but might be defined based on some opti- center rule and the maximum error neural network. mization procedure over the decision boundary. For Others seem to be almost intractable as the question arbitrary decision function, there is no way to derive of determining multidimensional domains that fit aro- this distance directly. In order to compare classifiers und a given set of points lead to hard optimization of various nature we propose the following heuristic problems. Dropping the assumption that the proba- procedure based on a stochastic approximation of the bility distribution of the objects is representative for distance of an object to the decision boundary: the distributionofthetargetobjectstobe classifiedis apparently very significant. The consequence is that the statistical approach has to be replaced by an es- 1. Let f be a classifier found in the input space Rm. timate of the shape of the class domains. The pro- Givenanindependent testset S,generate a large blem of defining consistent classificationprocedures is setofobjectsR∈Rmthatlieintheneighborhood not the only one in domain learning. As it was alre- of the test examples. ady noticed, for a proper optimization, the distance 2. Label the objects in S and R by the classifier f. from the objects to the decision boundary or to the domain boundary should be determined in the input 3. For eachobjectx inS find the k nearestobjects s space. Here, the original object representation is defi- xi in R that are assigned different labels. r ned for the application, so the distances measured in this space are related in a meaningful way to the dif- 4. Enrich this set {xi,i=1,2,...,k} by interpola- r ferences between objects. This relation does not hold tion. 10 0 Nearest Center Domain Fisherc −2 Decision Tree 5 NegMarginSVC−1 −4 NegMarginSVC−3 e −6 ure 2 0 manc −8 at or Fe Perf−10 −5 Nearest Center −12 Domain Fisher Decision Tree −14 NegMarginSVC−1 −10 NegMarginSVC−3 −15 −10 −5 0 5 10 −16 0 10 20 30 40 50 Feature 1 Size training set per class Figure 2. Five domain based classifiers on artificial data. Figure 3. Learningcurvesforthefivedomainbasedclassi- The three support objects of the linear Negative Margin fiers. As classes overlap, the performance (19) is negative. SVMare indicated by circles. Higher performance indicates better results. 5. Usesuccessivebisectiontofindthepointsxi that Negative Margin SVM using a linear kernel. As c are on the lines between x and all xi such that theoptimizationproblemisnotquadratic,weim- s r theyarealmostonthedecisionboundaryinduced plemented this classifier using boosting (?). by f. Negative Margin SVM using a 3rd order polyno- mial kernel. 6. Find the point x in {xi} that is nearest to x . c c s Twoslightlyoverlappingartificialbananashapedclas- 7. Usethedistanced(x ,x )betweenx andx asa s c s c ses are generated in two dimensions. Fig.2 shows an measure for the confidence in the classification of example for 50 objects per class. The decision bo- x . If the true label of x is known the distance s s undaries for the above mentioned classifiers are also to x may be given a sign: positive for a correct c presented there. label, negative for an incorrect one. The following experiment is performed using a fixed 8. Use test set of 200 examples per class. Training sets of eS =minxs∈Sd(xs,xc) (19) the cardinalities up to 50 objects per classare genera- ted, such that smaller sets are contained in the larger as a performance measure for the evaluated clas- ones. For each training set the above classifiers are sifier f given the test set S. determined and evaluated using the procedure discus- sed in section 4. This is repeated 10 times and the This proposed procedure has to be further evaluated. performances is averaged. An example of the result of the projection of a small Fig.3presentsthe resultsasafunctionofthe cardina- test set on a given classifier is shown in fig.1. lity of the training set. These are the learning curves offiveclassifiersshowinganincreasingperformanceas 5. Example a function of the training size. As the classes slightly overlap,the performance(19)isnegative. This iscau- We implemented the following domain based classi- sed by the fact that the ’worst’classifiedobject in the fiers: test set is erroneously labeled and is, thereby on the wrong side of the decision boundary. Nearest Center, NCC, based on (5). The curves indicate that our implementation of the Domain Fisher basedon(11),usinganheuristices- Domain Fisher Discriminant is bad, at least for these timateofGbypre-whiteningthedatafollowedby data. This might be understood that it is sensitive the NCC to determine the class centers. for all class boundary points, to all sides. Enlarging the datasetmayyieldmoredisturbances. Thesimpler Decision Tree using the purity criterion. Nearest Center classifier performs much better and is about similar to the linear SVM. The nonlinear SVM vectormachinesandotherkernel-basedlearningme- as well as the Decision Tree yield very good results. thods. UK: Cambridge University Press. Our evaluation procedure should bad for overlapping Hochbaum and Shmoys][1985]Hochbaum85 Hoch- training sets classified by the Decision Tree, as small baum, D., & Shmoys, D. (1985). A best possible regions separated out in different classes disturb the heuristic for the k-center problem. Mathematics of procedure. They are, however, not detected if there Operations Research, 10, 180–184. size is really small. The probability that inside such a region a point is generated (compare the procedure Kulkarni and Zeitouni][1993]KulkarniZ93 Kulkarni, discussed in section 4 may be too small. S. R., & Zeitouni, O. (1993). On probably correct classification of concepts. COLT (pp. 111–116). 6. Conclusions Parzen][1962]Parzen62Parzen,E.(1962). Ontheesti- Traditional ways of learning are inappropriate or in- mation of a probability density function and mode. accurate if training sets are only representative for Annals of Math. Statistics, 33, 1065–1076. the domain, but not for the distribution of the tar- Schapire][2002]Schapire02 Schapire, R. (2002). The get objects. In this paper, a number of domain based boosting approach to machine learning: An ove- classifiers have been discussed. Instead of minimizing rview. MSRI Workshop on Nonlinear Estimation the expected number ofclassificationerrors,the mini- and Classification. mumdistanceto thedecisionboundaryis proposedas a criterion. This is difficult to compute for arbitrary Scho¨lkopfetal.][2001]Scholkopf01Scho¨lkopf,B.,Platt, nonlinear classifiers. A heuristic procedure based on J., Smola, A., & Williamson, R. (2001). Estimating generatingpointsclosetothedecisionboundaryispro- thesupportofahigh-dimensionaldistribution. Neu- posed for classifier evaluation. ral Computation, 13, 1443–1471. This paper is restricted to an introduction to domain Tax][2001]Tax2001 Tax, D. (2001). One-class classi- learning. It formulates the problem, points towards fication. Doctoral dissertation, Delft University of possiblesolutionsandgivessomeexamples. Afirstse- Technology, The Netherlands. riesofdomainbasedclassifiershasbeenimplemented. Muchresearchhastobedonetomakethe domainba- TaxandDuin][1999]TaxDui1999aTax,D.,&Duin,R. sed classification approach ready for applications. As (1999). Supportvectordomaindescription. Pattern there is a largeneed for novelapproachesin this area, Recognition Letters, 20, 1191–1199. webelievethatanimportantnewtheoreticaldirection for further investigation is identified. Valiant][1984]Valiant84 Valiant, L. G. (1984). A the- oryofthelearnable.Commun.ACM,27,1134–1142. Acknowledgments Vandenberghe and Boyd][1996]BerBoy1996 Vanden- berghe, L., & Boyd, S. (1996). Semidefinite pro- This workis supportedby the DutchOrganizationfor gramming. SIAM Review, 38, 49–95. Scientific Research (NWO). Vapnik][1998]VapnikVapnik, V. (1998). Statistical le- References arning theory. John Wiley & Sons, Inc. Bishop][1995]Bishop95 Bishop, C. (1995). Neural ne- tworks for pattern recognition. Oxford: OxfordUni- versity Press. Boyd and Vandenberghe][2004]BoyBer2004 Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge University Press. Breiman et al.][1984]Breiman84 Breiman, L., et al. (1984). Classification and regression trees. Wad- sworth. Cristianini and Shawe-Taylor][2000]Cristianini00 Cri- stianini, N., & Shawe-Taylor, J. (2000). Support