Learning Kernel Classifiers AdaptiveComputationandMachineLearning ThomasG.Dietterich,Editor ChristopherBishop,DavidHeckerman,MichaelJordan,andMichaelKearns,AssociateEditors Bioinformatics:TheMachineLearningApproach,PierreBaldiandSørenBrunak ReinforcementLearning:AnIntroduction,RichardS.SuttonandAndrewG.Barto Graphical Models for Machine Learning and Digital Communication, Brendan J.Frey LearninginGraphicalModels,MichaelI.Jordan Causation,Prediction,and Search,secondedition, PeterSpirtes, Clark Glymour, andRichardScheines PrinciplesofDataMining,DavidHand,HeikkiMannilla,andPadhraicSmyth Bioinformatics:TheMachineLearningApproach,secondedition,PierreBaldiand SørenBrunak LearningKernelClassifiers:TheoryandAlgorithms,RalfHerbrich Learning with Kernels: Support Vector Machines, Regularization, Optimization, andBeyond,BernhardSchölkopfandAlexanderJ.Smola Learning Kernel Classifiers Theory and Algorithms RalfHerbrich TheMITPress Cambridge,Massachusetts London,England (cid:1)c2002MassachusettsInstituteofTechnology Allrightsreserved.Nopartofthisbookmaybereproducedinanyformbyanyelectronicormechanicalmeans (includingphotocopying,recording,orinformationstorageandretrieval)withoutpermissioninwritingfromthe publisher. ThisbookwassetinTimesRomanbytheauthorusingtheLATEXdocumentpreparationsystemandwasprinted andboundintheUnitedStatesofAmerica. LibraryofCongressCataloging-in-PublicationData Herbrich,Ralf. Learningkernelclassifiers:theoryandalgorithms/RalfHerbrich. p. cm.—(Adaptivecomputationandmachinelearning) Includesbibliographicalreferencesandindex. ISBN0-262-08306-X(hc.:alk.paper) 1.Machinelearning.2.Algorithms.I.Title.II.Series. Q325.5.H482001 (cid:2) 006.31—dc21 2001044445 Tomywife,Jeannette Therearemanybranchesoflearningtheorythathavenotyetbeenanalyzedandthatareimportant bothforunderstandingthephenomenonoflearningandforpracticalapplications.Theyarewaiting fortheirresearchers. —VladimirVapnik Geometryisilluminating;probabilitytheoryispowerful. —PálRuján Contents SeriesForeword xv Preface xvii 1 Introduction 1 1.1 TheLearningProblemand(Statistical) Inference 1 1.1.1 Supervised Learning . . . . . . . . . . . . . . . 3 1.1.2 Unsupervised Learning . . . . . . . . . . . . . . 6 1.1.3 Reinforcement Learning . . . . . . . . . . . . . 7 1.2 LearningKernelClassifiers 8 1.3 ThePurposesofLearningTheory 11 I LEARNINGALGORITHMS 2 KernelClassifiersfromaMachineLearningPerspective 17 2.1 TheBasicSetting 17 2.2 LearningbyRiskMinimization 24 2.2.1 The(Primal)Perceptron Algorithm . . . . . . . 26 2.2.2 Regularized RiskFunctionals . . . . . . . . . . 27 2.3 KernelsandLinearClassifiers 30 2.3.1 TheKernelTechnique . . . . . . . . . . . . . . 33 2.3.2 KernelFamilies . . . . . . . . . . . . . . . . . . 36 2.3.3 TheRepresenter Theorem . . . . . . . . . . . . 47 2.4 SupportVectorClassification Learning 49 2.4.1 MaximizingtheMargin . . . . . . . . . . . . . 49 2.4.2 SoftMargins—Learning withTrainingError . . 53 2.4.3 GeometricalViewpointsonMarginMaximization 56 2.4.4 Theν–TrickandOtherVariants . . . . . . . . . 58 x Contents 2.5 AdaptiveMarginMachines 61 2.5.1 AssessmentofLearningAlgorithms . . . . . . . 61 2.5.2 Leave-One-OutMachines . . . . . . . . . . . . 63 2.5.3 PitfallsofMinimizingaLeave-One-OutBound . 64 2.5.4 AdaptiveMarginMachines . . . . . . . . . . . . 66 2.6 Bibliographical Remarks 68 3 KernelClassifiersfromaBayesianPerspective 73 3.1 TheBayesianFramework 73 3.1.1 ThePowerofConditioning onData . . . . . . . 79 3.2 GaussianProcesses 81 3.2.1 BayesianLinearRegression . . . . . . . . . . . 82 3.2.2 FromRegressiontoClassification . . . . . . . . 87 3.3 TheRelevanceVectorMachine 92 3.4 BayesPointMachines 97 3.4.1 EstimatingtheBayesPoint . . . . . . . . . . . . 100 3.5 FisherDiscriminants 103 3.6 Bibliographical Remarks 110 II LEARNINGTHEORY 4 MathematicalModelsofLearning 115 4.1 Generativevs.Discriminative Models 116 4.2 PACandVCFrameworks 121 4.2.1 ClassicalPACandVCAnalysis . . . . . . . . . 123 4.2.2 GrowthFunctionandVCDimension . . . . . . 127 4.2.3 StructuralRiskMinimization . . . . . . . . . . . 131 4.3 TheLuckinessFramework 134 4.4 PACandVCFrameworksforReal-ValuedClassifiers 140 4.4.1 VCDimensionsforReal-ValuedFunctionClasses 146 4.4.2 ThePACMarginBound . . . . . . . . . . . . . 150 4.4.3 RobustMarginBounds . . . . . . . . . . . . . 151 4.5 Bibliographical Remarks 158 xi Contents 5 BoundsforSpecificAlgorithms 163 5.1 ThePAC-BayesianFramework 164 5.1.1 PAC-BayesianBoundsforBayesianAlgorithms 164 5.1.2 APAC-BayesianMarginBound . . . . . . . . . 172 5.2 Compression Bounds 175 5.2.1 Compression SchemesandGeneralization Error 176 5.2.2 On-lineLearningandCompressionSchemes . . 182 5.3 AlgorithmicStabilityBounds 185 5.3.1 AlgorithmicStabilityforRegression . . . . . . 185 5.3.2 AlgorithmicStabilityforClassification . . . . . 190 5.4 Bibliographical Remarks 193 III APPENDICES A Theoretical BackgroundandBasicInequalities 199 A.1 Notation 199 A.2 Probability Theory 200 A.2.1 SomeResultsforRandomVariables . . . . . . . 203 A.2.2 FamiliesofProbability Measures . . . . . . . . 207 A.3 Functional AnalysisandLinearAlgebra 215 A.3.1 Covering,PackingandEntropyNumbers . . . . 220 A.3.2 MatrixAlgebra . . . . . . . . . . . . . . . . . . 222 A.4 Ill-PosedProblems 239 A.5 BasicInequalities 240 A.5.1 General(In)equalities . . . . . . . . . . . . . . . 240 A.5.2 LargeDeviationBounds . . . . . . . . . . . . . 243 B ProofsandDerivations—Part I 253 B.1 Functions ofKernels 253 B.2 EfficientComputationofStringKernels 254 B.2.1 EfficientComputationoftheSubstringKernel . . 255 B.2.2 EfficientComputationoftheSubsequence Kernel 255 B.3 Representer Theorem 257 B.4 Convergence ofthePerceptron 258