JournalofMachineLearningResearch9(2008)1-21 Submitted3/06;Revised8/06;Published1/08 Max-margin Classification of Data with Absent Features GalChechik(cid:3) [email protected] DepartmentofComputerScience StanfordUniversity StanfordCA,94305 GeremyHeitz [email protected] DepartmentofElectricalEngineering StanfordUniversity StanfordCA,94305 GalElidan [email protected] PieterAbbeel [email protected] DaphneKoller [email protected] DepartmentofComputerScience StanfordUniversity StanfordCA,94305 Editor:NelloCristianini Abstract Weconsidertheproblemoflearningclassifiersinstructureddomains,wheresomeobjectshavea subsetoffeaturesthatareinherentlyabsentduetocomplexrelationshipsbetweenthefeatures.Un- likethecasewhereafeatureexistsbutitsvalueisnotobserved,herewefocusonthecasewherea featuremaynotevenexist(structurallyabsent)forsomeofthesamples.Thecommonapproachfor handlingmissingfeaturesindiscriminativemodelsistofirstcompletetheirunknownvalues, and thenuseastandardclassificationprocedureoverthecompleteddata.Thispaperfocusesonfeatures thatareknowntobenon-existing,ratherthanhaveanunknownvalue. Weshowhowincomplete datacanbeclassifieddirectlywithoutanycompletionofthemissingfeaturesusingamax-margin learning framework. We formulate an objective function, based on the geometric interpretation ofthemargin, thataimsto maximizethemargin ofeachsampleinits own relevant subspace. In thisformulation,thelinearlyseparablecasecanbetransformedintoabinarysearchoveraseries ofsecondorderconeprograms(SOCP),aconvexproblemthatcanbesolvedefficiently. Wealso describetwoapproachesforoptimizingthegeneralcase: anapproximationthatcanbesolved as a standard quadratic program (QP) and an iterative approach for solving the exact problem. By avoidingthepre-processingphaseinwhichthedataiscompleted,bothoftheseapproachescould offerconsiderablecomputationalsavings. Moreimportantly,weshowthattheeleganthandlingof missingvaluesbyourapproachallowsittobothoutperformothermethodswhenthemissingvalues have non-trivial structure, and be competitive with other methods when the values are missing at random.Wedemonstrateourresultsonseveralstandardbenchmarksandtworeal-worldproblems: edgepredictioninmetabolicpathways,andautomobiledetectioninnaturalimages. Keywords: maxmargin,missingfeatures,networkreconstruction,metabolicpathways (cid:3).GalChechik’scurrentaddressisGoogleResearch,1600AmphitheaterParkway,MountainViewCA,94043. (cid:13)c2008GalChechik,GeremyHeitz,GalElidan,PieterAbbeelandDaphneKoller. CHECHIK,HEITZ,ELIDAN,ABBEELANDKOLLER 1. Introduction Inthetraditionalformulationofsupervisedlearning,datainstancesareviewedasvectorsoffeatures insomehigh-dimensionalspace. However,inmanyreal-worldtasks,datainstanceshaveacomplex patternofmissingfeatures. Sometimesfeaturesaremissingduetomeasurementnoiseorcorruption, butinothercasessampleshavevaryingsubsetsofobservablefeaturesduetotheinherentproperties oftheinstances. Examples for such settings can be found in many domains. In visual object recognition, an objectisoftenclassifiedusingasetofimagepatchescorrespondingtopartsoftheobject(e.g.,the licenseplateforcars);butsomeimagesmaynotcontainallparts,simplybecauseaspecificinstance doesnothavethepartinthefirstplace. Inotherscenarios, somefeaturesmaynotevenbedefined forsomeinstances. Suchcasesarisewhenwearelearningtopredictattributesofinteractingobjects that are organized based on a known graph structure. For example, we might wish to classify the attributesofaweb-pagegiventheattributesofneighboringweb-pages(JensenandNeville,2002), or characteristics of a person given properties of its family members. In analyzing genomic data, wemaywishtopredicttheedgeconnectivityinnetworksofinteractingproteins(Jaimovichetal., 2005)orpredictnetworksofchemicalreactionsinvolvingenzymesandothermetaboliccompounds (Kharchenkoetal.,2003;VertandYamanishi,2004). Inthesecases, thelocalneighborhoodofan instanceinthegraphcanvarydrastically,andithasalreadybeenobservedthatthiscouldintroduce bias (Jensen and Neville, 2002). In the web-page task, for instance, one useful feature of a given pagemaybethemostcommontopicofothersitesthatpointtoit. Ifthisparticularpagehasnosuch parents,however,thefeatureismeaningless,andshouldbeconsideredstructurallyabsent. Common methods for classification with missing features assume that the features exist but their values are unknown. The approach usually taken is a two phase approach known as imputa- tion. Firstthevaluesofthemissingattributesarefilledinduringapreprocessingphase,followedby the applicationof a standardclassifier to the completeddata (Little and Rubin,1987;Roth, 1994). Imputation makes most sense when the features are known to exist, but their values are missing due to noise, especially in the setting of missing at random (when the missingness pattern is con- ditionallyindependentoftheunobservedfeaturesgiventheobservations),ormissingcompletelyat random(whenitisindependentofbothobservedandunobservedmeasurements). In the common practice of imputation application, missing attributes in continuous data are oftenfilledwithzeros,withtheaverageofallofthedatainstances,orusingtheknearestneighbors (kNN) of each instance to find a plausible value of its missing features (in the limit of large k this becomesequivalenttousingthemeanofallthedata). A second family of imputation methods builds probabilistic generative models of the features (withorwithoutthelabel)usingrawmaximumlikelihoodoralgorithmssuchasexpectationmaxi- mization(EM)tofindthemostprobablecompletion(GhahramaniandJordan,1994). Suchmodel- based methods allow the designer to introduce prior knowledge about the distribution of features, and are extremely useful when such knowledge can be explicitly modeled (Kapoor, 2006). These methods have been shown to work very well for missing at random (MAR) data settings, because theyassumethatthemissingfeaturesaregeneratedbythesamemodelthatgeneratestheobserved features. The distribution over missing features, can also be approximated using multiple imputa- tions, where several missing values are filled in. This can be efficiently achieved using Markov- chainmonte-carlo(MCMC)methods(Schafer,1997). Theimputeddatasetscanthenbecombined for classificationusing a Bayesian approachor ensemblemethods(Schafer, 1997). Recently, Pan- 2 MAX-MARGINCLASSIFICATIONOFDATAWITHABSENTFEATURES nagadatta et al. (2006) took a different approach and described a method for incorporating second orderuncertaintiesaboutthesampleswhilekeepingtheclassificationproblemconvex. However,model-basedapproaches(usingEMorMCMC,forinstance)canbecomputationally expensive, and require significant prior knowledge about the domain. More importantly, they may produce meaningless completions for features that are structurally absent, which will likely de- creaseclassificationperformance. Asanextremeexample,considerthecaseoftwosubpopulation ofinstances(e.g.,animalsandbuildings)withfewornooverlappingfeatures(e.g.,bodyparts,and architectural aspects). In such a case filling missing values (e.g., the body parts of buildings) is clearlymeaninglessandwillnotimproveclassificationperformance. The core observation made in this paper is that, in some cases, features are known to be non- existing, rather than have an unknown value. In these cases it would be useful to avoid guessing valuesfor hypotheticalundefinedfeatures. This distinctionmotivates the developmentof different approachesforhandlingnon-existingfeatures. Theapproachweproposehereaimstoclassifydatainstancesdirectlywithoutfillinghypothet- icalmissingvalues, andis thereforedifferentfromimputation. Ourinterpretationis thateachdata instanceresideinalowerdimensionalsubspaceofthefeaturespace,determinedbyitsownexisting features. Toformulatetheclassificationproblemwegobacktothegeometricintuitionsunderlying max-marginclassification: Wetrytomaximizetheworst-casemarginoftheseparatinghyperplane, whilemeasuringthemarginofeachdatainstanceinitsownlower-dimensionalsubspace. We show how the linearly separable case can be solved using a binary search over a series of feasibilityproblems,eachformalizedasasecondorderconeprogramming(SOCP)problem,which isconvexandcanbesolvedefficiently. Forthenon-separablecase,theobjectiveisnon-convex,and weproposeaniterativeprocedurethatinpracticeconvergestoafixedpointofthisobjective. We evaluate the performance of our approach both in two real world applications with non- existing features, and in a set of standard benchmarks with randomly removed features. Although theintuitionsunderlyingourmethodarisefromcaseswithnon-existingfeatures,weshowthatthe performanceofourmethodsisnotinferiortoothersimpleimputationmethods,eveninthecaseof missing at random. We then evaluate our approach on the task of predicting missing enzymes in genetic pathways and detecting automobiles in natural images. In both of these domains, features may be inherently absent due to some or all of the mechanisms described above. Our approach significantlyoutperformssimpleimputationtechniquesinthesesettings. 2. Max-MarginFormulationforMissingFeatures We start by formalizing the problem of classification with non-existing features, and then explain whystandardmax-marginclassificationmustbemodifiedtohandlemissingness. Wearegivenasetofnlabeledsamplesx :::x ,eachcharacterizedbyasubsetoffeaturesfrom 1 n afullsetF ofsized. LetF denotethesetoffeaturesoftheithsample. Asamplethathasallfeatures i F =F, can be viewed as a vector in Rd, where the jth coordinate corresponds to the jth feature. i Each sample xi can therefore be viewed as embedded in the relevant subspace RjFij (cid:18)Rd. Each samplex alsohasabinaryclasslabely 2f(cid:0)1;1g. Importantly,sinceinstancessharefeatures,the i i classifierwelearnshouldhaveparametersthatareconsistentacrossinstances,evenifthoseinstance donotlieinthesamesubspace. We address the problem of finding an optimal classifier, within the max-margin framework. In the classical SVM approach of (Vapnik, 1995; Scho¨lkopf and Smola, 2002), we learn a linear 3 CHECHIK,HEITZ,ELIDAN,ABBEELANDKOLLER missing feature r 2 r existing 1 feature Figure1: The margin is incorrectly scaled when a sample that has missing features is treated as if the missing feature has a value of zero. In this example, the margin of a sample that onlyhasonefeature(thexdimension)ismeasuredbothinthehigherdimensionalspace (r )andthelowerone(r ). Thelowerdimensionalmarginislarger(unlesstheclassifier 2 1 is orthogonal to the sample’s second feature), and therefore underestimates the margin, givingittoomuchweight. classifierwbymaximizingthemargin,definedasr (cid:17)min y(wx +b)=kwk. Inthestandardsetting, i i i wheneachsamplehasallfeatures(i.e., F =F foreachinstance),thelearningofsuchaclassifier i reducestotheconstrainedoptimizationproblem min 1kwk2+C(cid:229) n x (1) i w;x ;b 2 i=1 s:t: y(wx +b)(cid:21)1(cid:0)x ; i=1:::n i i i wherebisathreshold,thex ’sareslackvariablesnecessaryforthecasewhenthetraininginstances are not linearly separable, andC is the trade-off between accuracy and model complexity. Equa- tion(1)definesaquadraticprogrammingproblem,whichcanbesolvedveryefficiently,andcanbe extendedtononlinearclassifiersusingkernels(Scho¨lkopfandSmola,2002). Considernow learningsuch a classifierin the presenceof missingdata. At first glance, it may appearthatsincethex’sonlyaffecttheoptimizationthroughinnerproductswithw,missingfeatures can merely be skipped (or equivalently, replaced with zeros), thereby preserving the values of the inner product. However, this does not properly normalize the different entries in w, and damages classification accuracy. The reason is illustrated in Figure 1 where a single sample in R2 has one existingandonemissingfeature. Duetothemissingfeature,measuringthemargininthefullspace r , underestimates the correct geometric margin of the sample in the sub space r . In the next 2 1 sections,weexplorehowEquation(1)canbesolvedwhileproperlytakingthiseffectintoaccount. 2.1 GeometricMargins The original derivation of the SVM classifier (Vapnik, 1995) is motivated by the goal of finding a hyperplane that maximally separates the positive examples from the negative, as measured by the 4 MAX-MARGINCLASSIFICATIONOFDATAWITHABSENTFEATURES geometric margin r full(w)=min yiwxi. In the case of missing features, each instance resides in i kwk a subspace of the joint feature space, therefore the margin yiwxi which measures the distance to a kwk hyperplaneinthefulljointspaceisnotwelldefined. To addressthis problemwe treat the margin of eachinstancein its own relevant subspace. We definetheinstancemarginr (w)fortheith instanceas i yw(i)x r (w)= i i (2) i kw(i)k where w(i) is a vector obtained by taking the entries of w that are relevant for x , namely those for i which the sample x has valid features. Its norm is kw(i)k= (cid:229) w2. We now consider our i k:fk2Fi k newgeometricmargintobetheminimumoverallinstancemaqrgins,r (w)=minir i(w),andarrive atanewoptimizationproblemforthemissing-featurescase yw(i)x i i max min : (3) w i kw(i)k ! Tosolvethisoptimizationproblem,considerfirstthecaseofwherenofeaturesaremissing,as instandardSVM.There,maximizingthemarginr full(w), ywx maxr full(w)=max min i i (4) w w i kwk (cid:18) (cid:19) is transformed into the quadratic programming problem of Equation (1) in two steps. First, kwk, which does not depend on the instance, can be taken out of the minimization, to produce max 1 (min ywx). Then, the following invariance is used: for every solution w, there exists wkwk i i i asolutionthatachievesthesametargetfunctionvalue,butwithamarginthatequals1. Thisallows ustorewritetheaboveproblemasaconstrainedoptimizationproblemoverw 1 max (5) w kwk s:t: y(wx)(cid:21)1 ; i i which is equivalent to the problem of minimizing kwk2 with the same constraints, resulting in the slack-freeversionofthestandardSVMlearningoptimizationproblemofEquation(1). Unfortunately, in the case of missing features, since different margin terms are normalized by differentnormskw(i)kinEquation(3),thedenominatorcannolongerbetakenoutoftheminimiza- tion. In addition, each of the terms yw(i)x=kw(i)k is non-convex in w, which makes it difficult to i i solveexactlyinadirectway. ThenextsectiondiscusseshowthegeometricmarginofEquation(3) canbeoptimized. 3. Algorithms We investigate three approaches for maximizing the geometric margin of Equation (3). First, we showhowthelinearly-separablecasecanbeoptimizedusingaseriesofconvexoptimizationprob- lems, and discuss the difficulties in extending this formulation to the non-separable case. Then, we describe two methods for solving the general case. The first uses a convex formulation to ap- proximate the problem; the second uses a non-convex optimization formulation that we optimize iteratively. 5 CHECHIK,HEITZ,ELIDAN,ABBEELANDKOLLER 3.1 AConvexFormulationfortheLinearlySeparableCase In the linearly separable case, we now show how to transform the optimization problem of Equa- tion (3) into a series of convex optimization problem, in three steps. First, instead of maximizing min yiw(i)xi overw,wemaximizealowerbound i kw(i)k (cid:16) (cid:17) max g w;g yw(i)x s:t: min i i >g : i kw(i)k This takes the minimization term out of the objective function into the constraints. The resulting optimizationproblemisequivalenttoEquation(3)sincetheboundg canalwaysbemaximizeduntil itisperfectlytight. Asasecondstep,wereplacethesingleconstraintthatrequiresthattheminimumislargerthan g bymultipleconstraints,oneforeachtermintheminimization max g w;g yw(i)x s:t: i i (cid:21)g i=1:::n: kw(i)k Finally,wewrite max g w;g s:t: yw(i)x (cid:21)g kw(i)k i=1:::n ; (6) i i Toseehowthiscouldbesolvedefficiently,assumefirstthatg isgiven. Foranyfixedvalueofg ,this problem obeys the general structure of a problem known as second order cone program (SOCP) (Loboetal.,1998). AnSOCPproblemisconvex,andcanbesolvedefficientlyevenforlargescale problems. To find the optimal value of g , note that Equation (6) is a feasibility problem: we need to find the maximal g value that still allows for a feasible solution. We therefore can solve Equation (6) efficiently by doing a bisection search over g 2 R+, where in each iteration we solve an SOCP feasibilityproblem,forwhichstandardandefficientsolversexist. Oneproblemwiththisformulation,isthatanyrescaledversionofasolutionisalsoasolution. This is because each constraint is invariant to a rescaling of w. As a result, the null case w(cid:17)0, is always a solution, and is usually the solution that is found by the standard solvers. Note that addingaconstraintthatenforcesanonvanishingnormkwk>const isnolongerconvex. However, addingaconstraintonasingleentryinw,sayw >const,preservestheconvexityoftheproblem. k Toensurethatthesolutionisnotmissedbytheadditionalconstraint,wesolvetheSOCPtwicefor each entry of w, once while adding the constraint w >1, and once with w <(cid:0)1 (a total of 2d k k problems). Unfortunately, extending this convex formulation to the non-separable case is difficult. When addingslackvariablesx ’s,weobtain n max g +C(cid:229) x i w;g i=1 s:t: yw(i)x (cid:21)(g (cid:0)x )kw(i)k i=1:::n i i i 6 MAX-MARGINCLASSIFICATIONOFDATAWITHABSENTFEATURES whichisnotjointlyconvexinwandx . Wecouldhowever,solvearelatedproblem,wheretheslack variablesarenotnormalizedbykw(i)k n max g (cid:0)C(cid:229) x i w;g i=1 s:t: yw(i)x (cid:21)g kw(i)k(cid:0)x i=1:::n: i i i Unfortunately, the problem of vanishing solutions (w(cid:17)0), discussed in the separable case, is also encounteredhere. Unlike the separablecase, using constraintsof the form w >1, we can no k longer guarantee that the solutions of the different modified problems will coincide. Indeed, we found empirically that different solutions are obtained for each modified problem. As a result, the non-separableformulationaboveisnotlikelytobeofpracticaluse. 3.2 AverageNorm Wenowconsideranalternativesolutiontotheclassificationproblem,basedonanapproximationof themargininEquation(3). WhileapplyingstandardSVMwithzero-filledvaluescanbeviewedas an approximation that ignores any instance-specific components of the margin, we provide a finer approximation. Wekeepinstance-specificnumeratorsoftheinstancemarginkwkandapproximate onlythedenominatorbyacommontermthatdoesnotdependontheinstance. Thisallowsustotake thedenominatoroutoftheminimizationpartoftheobjective. Whilethisisonlyanapproximation, italsohasaninterestinginterpretationasaparameter-sharingapproach. In particular, we explore the effects of replacing each of the low-dimensional norms by the root-mean-squarenormoverallinstances,kwk= 1(cid:229) n kw(i)k2,yielding n i=1 q yw(i)x i i maxmin : w i kwk Whenallsampleshaveallfeatures,thisreducestotheoriginalproblemEquation(4). However,even withmanymissingfeatures,thedenominatorwillalsobeagoodapproximationofthedenominator in Equation (4) if all the norm terms are equal. In this case, the approximated problem differs fromtheexactproblembyaconstantfactor. Wethereforeexpectthatincaseswherethenormsare near equal, we will find nearly optimal solutions of the original problem. As we show below, the advantageofthisapproximationisthatityieldsatractableoptimizationproblem. Wenowfollowsimilarstepstothoseofequations(4)-(5),andobtaintheminimizationproblem: min 1 (cid:229) n 1kw(i)k2 w n 2 i=1 s:t: y(w(i)x)(cid:21)1;i=1:::n: i i Addingthethresholdbandslackvariablesx tohandlethenon-separablecases,weobtain i min 1 (cid:229) n 1kw(i)k2+Cx (7) i w;b;x n 2 i=1(cid:18) (cid:19) s:t: y(w(i)x +b)(cid:21)1(cid:0)x ;i=1:::n i i i 7 CHECHIK,HEITZ,ELIDAN,ABBEELANDKOLLER which is, as expected, a quadratic programming problem that can be solved using the same tech- niquesasstandardSVM. Interestingly,thisoptimizationproblemcanalsobeinterpretedasanapproachforsharingaclas- sifieracrossmultipletasks. Inthisinterpretation,weseektosolvemultipleclassificationproblems, oneforeachsample,andtheparametersoftheclassifiersaresharedacrosstherelevantsamples. 3.3 Instance-specificMargins The global approach described in the last section is not expected to perform well if the individual normskw(i)kvaryconsiderably. AsecondapproachforsolvingEquation(3)istotreateachinstance individually by representing each of the norms kw(i)k as a scaling of the full norm by defining scalingcoefficientss =kw(i)k=kwk,andrewritingEquation(3)as i yw(i)x 1 yw(i)x kw(i)k i i i i max min =max min ; s = : (8) i w i sikwk ! w kwk i si ! kwk This allows us to use again the invariance to kwk and rewrite the optimization as a constrained optimizationproblemovers andw. Equation(8)thereforetranslatesinto i 1 min kwk2 w;s 2 1 s:t: yw(i)x (cid:21)1 8i i i s i s =kw(i)k=kwk 8i i andinthenon-separablecase min 1kwk2+C(cid:229) x (9) i w;b;x ;s 2 i 1 s:t: y(w(i)x +b) (cid:21)1(cid:0)x ; i=1:::n i i i s i (cid:16) (cid:17) s =kw(i)k=kwk; i=1:::n: i This constrained optimization problem is no longer a QP. In fact, due to the normalization constraintsitisnotevenconvexinw. Oneapproachforsolvingitisaprojectedgradientapproach, in which one iterates between steps in the direction of the gradient of the Lagrangian 1kwk2+ 2 C(cid:229) x (cid:0)(cid:229) n a yiw(i)x and projections to the constrained space, by calculating s =kw(i)k=kwk. i i i=1 isi i i Withtherightchoicesofstepsizes,suchapproachesareguaranteedtoconvergetolocalminima. Fortunately, wecanusethefactthattheproblemis a QPforany givensetofs ’s, anddevisea i fasteriterativealgorithm. Foragiventupleofs’s, wesolveaQPforw, andthenusetheresulting i wtocalculatenews’s. TosolvetheQP,wederivethedualforgivens’s i i max (cid:229) n a (cid:0)1 (cid:229) n a yihx;x iyja (10) i i i j j a 2Rn i=1 2i;j=1 si sj s:t: 0(cid:20)a (cid:20)C; i=1:::n i n (cid:229) a y =0 i i i=1 8 MAX-MARGINCLASSIFICATIONOFDATAWITHABSENTFEATURES Iterativeoptimization/projection Initialization: Initializes0=1foralli=1:::n. i Iterations repeat Findoptimalwt(st(cid:0)1). SolveEquation(10)forthecurrents. Updatest(wt),usings =kw(i)k=kwk. i until(stoppingcriterion) Figure2: Pseudo-codeoftheiterativeprojectionsalgorithm. wheretheinnerproducth(cid:1);(cid:1)iistakenonlyoverfeaturesthatarevalidforbothx andx (kernelsare i j discussedinthenextsection). Figure2providesapseudocodeofthisiterativealgorithm. The dual solution is used to find the optimal classifier as in standard SVM, by setting w = (cid:229) n a yix. This separating hyper plane can be used to predict the label of any new sample x by i=1 isi i i calculating the sign of the thresholded dot product sign(w(i)x (cid:0)b) where b is set as in standard i SVM(see,forexample,p. 203inScho¨lkopfandSmola,2002). The primary difference between this algorithm and the projected gradient approach is that in- stead of a series of small gradient steps, we take bigger leaps and follow each by a similar projec- tion back into the constrained space. It is easy to see that a fixed point of this iterative procedure achieves an optimal solution for Equation (9), since it achieves a minimal kwk while obeying the s constraints. As a result, when this algorithm converges, it is guaranteed that we reach a (local) i optimum of the original problem Equation (9). Unfortunately, the convergence of this iterative al- gorithmisnotalwaysguaranteed,andthereforeitmustbestoppedwhensomestoppingcriterionis reached. Inpractice,weusedcrossvalidationtochooseanearlystoppingpoint,andfoundthatthe best solutions were obtained within very few steps (2-5). This means that, in practice, the running time of this algorithm is expected to be at most five times of that of running a single SVM. Our specific implementation used matlab scripts for data manipulation and CPLEX for optimization, yielding a total runtime of 8 seconds for a set of 200 samples and 500 features on a 2.4 GHz Intel XeonCPU.Itispossibletoextendthisapproachandaddmultiplerestartswithrandomizedstarting points,andchoosethebestoneusingcrossvalidation. Oneimplementationadvantageofthisalgo- rithmisthattheeffectofthes factorscanbeabsorbedintothe yi term(seeEquation10),rendering i si thehx;x iterm(andlaterthekernelmatrix,seenextsection)independentofsandallowingittobe i j calculatedinadvance. We also tested two other approaches for optimizing Equation (9). In the first, we replaced the aboveupdateofswithanoptimizationstep. Inthisstepweminimized(s (cid:0)kw(i)k=kwk)2 subject i to the yw(i)x (cid:21) (1(cid:0)x )s constraints. The solution to this problem can be derived analytically. i i i i ThisapproachperformedslightlyworsethantheiterativeapproachofFigure2. Second,wetooka hybridapproach,combininggradientascentoversandQPforw. Specifically,weiteratedbetween solving Equation (10), and making small steps in the direction of the gradient of the dual w.r.t. s. ThisapproachalsodidnotperformaswellastheiterativeapproachofFigure2. 9 CHECHIK,HEITZ,ELIDAN,ABBEELANDKOLLER 4. KernelsforMissingFeatures ThepoweroftheSVMapproachcanbelargelyattributedtotheflexibilityandefficiencyofnonlin- ear classification allowed through the use of kernels. Because the dual objective of Equation (10) depends only on the inner products of the data instances, we can consider the features to lie in a space of much higher dimension than is feasible in the primal formulation of Equation (9). As a result, these objectives can be kernelized in the same manner as in standard SVMs. For example, forEquation(10)wewrite max (cid:229) n a (cid:0)1 (cid:229) n a yiK(x;x )yja (11) i i i j j a 2Rn i=1 2i;j=1 si sj n s:t: 0(cid:20)a (cid:20)C; i=1:::n ; (cid:229) a y =0 : i i i i=1 where K(x;x ) is the kernel function that simulates an inner product in the higher dimensional i j feature space. Classification of new samples are obtained as in standard SVM by calculating the marginr (x )=(cid:229) y a 1K(x ;x ) 1 . new j j jsj j new snew Importantly, kernels in this formulation operate over vectors with missing features, and we therefore have to develop kernels that handle this case correctly. Fortunately, for some standard kernels,thereisaneasyproceduretoconstructamodifiedkernelthattakessuchmissingvaluesinto account. We focus on kernels whose dependence on their inputs is through their inner product hx ;x i. i j TwoexamplesforsuchkernelsarepolynomialandSigmoidalkernels. Akernelinthisclasscanbe easilytransformedintoakernelthathandlesmissingfeatures,byconsideringonlyfeaturesthatare validforbothsamples,andskippingtherest. Thisisnotequivalenttoignoringmissingfeatures(as in filling with zeros), since the algorithm applied on Equation (11) using this kernel does already correctforthemissinginformationbytakingintoaccountthes terms. i Forinstance,forapolynomialkernelK(x;x )=(hx;x i+1)d,definethemodifiedkernel i j i j K0(x;x ) = (hx;x i +1)d ; i j i j F withtheinnerproductcalculatedovervalidfeatureshx ;x i =(cid:229) hx ;x i. Itisstraight- i j F k:fk2F(j)\Fi ik jk forward to see that K0 is a kernel, since instances with missing values can be filled with zeros: Formally, define x0 =h(x), where h:fR;NaNgd !Rd replaces invalid entries (missing features) with zeros. Then we have K0(x;x )=K(x0;x0), simply since multiplying by zero is equivalent to i j i j skippingthemissingvalues. HenceK0 isakernel. 5. Experiments We evaluate the performance of our approaches in three different missingness scenarios. First, as a sanity check, we explore performance when features are missing at random. In this setting our methods are not expected to provide considerable gain over other methods, but we hope they will perform equally well. Second, we study a visual object recognition application where some featuresaremissingbecausetheycannotbelocatedintheimage. Finally,weapplyourmethodsto a problem of biologicalnetwork completion, where missingnesspatternsof features is determined bytheknownstructureofthenetwork. 10
Description: