An Elementary Introduction to Statistical Learning Theory WILEY SERIES IN PROBABILITY AND STATISTICS Established by WALTER A. SHEWHART and SAMUEL S. WILKS Editors: David J. Balding, Noel A. C. Cressie, Garrett M. Fitzmaurice, Harvey Goldstein, Iain M. Johnstone, Geert Molenberghs, David W. Scott, Adrian F. M. Smith, Ruey S. Tsay, Sanford Weisberg Editors Emeriti: Vic Barnett, J. Stuart Hunter, Joseph B. Kadane, Jozef L. Teugels A complete list of the titles in this series appears at the end of this volume. An Elementary Introduction to Statistical Learning Theory SANJEEV KULKARNI Department of Electrical Engineering School of Engineering and Applied Science Princeton University Princeton, New Jersey GILBERT HARMAN Department of Philosophy Princeton University Princeton, New Jersey A JOHN WILEY & SONS, INC., PUBLICATION Copyright©2011JohnWiley&Sons,Inc.Allrightsreserved. PublishedbyJohnWiley&Sons,Inc.,Hoboken,NewJersey PublishedsimultaneouslyinCanada Nopartofthispublicationmaybereproduced,storedinaretrievalsystem,ortransmittedinanyform orbyanymeans,electronic,mechanical,photocopying,recording,scanning,orotherwise,exceptas permittedunderSection107or108ofthe1976UnitedStatesCopyrightAct,withouteithertheprior writtenpermissionofthePublisher,orauthorizationthroughpaymentoftheappropriateper-copyfee totheCopyrightClearanceCenter,Inc.,222RosewoodDrive,Danvers,MA01923,(978)750-8400, fax(978)750-4470,oronthewebatwww.copyright.com.RequeststothePublisherforpermission shouldbeaddressedtothePermissionsDepartment,JohnWiley&Sons,Inc.,111RiverStreet, Hoboken,NJ07030,(201)748-6011,fax(201)748-6008,oronlineat http://www.wiley.com/go/permission. LimitofLiability/DisclaimerofWarranty:Whilethepublisherandauthorhaveusedtheirbestefforts inpreparingthisbook,theymakenorepresentationsorwarrantieswithrespecttotheaccuracyor completenessofthecontentsofthisbookandspecificallydisclaimanyimpliedwarrantiesof merchantabilityorfitnessforaparticularpurpose.Nowarrantymaybecreatedorextendedbysales representativesorwrittensalesmaterials.Theadviceandstrategiescontainedhereinmaynotbe suitableforyoursituation.Youshouldconsultwithaprofessionalwhereappropriate.Neitherthe publishernorauthorshallbeliableforanylossofprofitoranyothercommercialdamages,including butnotlimitedtospecial,incidental,consequential,orotherdamages. Forgeneralinformationonourotherproductsandservicesorfortechnicalsupport,pleasecontactour CustomerCareDepartmentwithintheUnitedStatesat(800)762-2974,outsidetheUnitedStatesat (317)572-3993orfax(317)572-4002. Wileyalsopublishesitsbooksinavarietyofelectronicformats.Somecontentthatappearsinprint maynotbeavailableinelectronicformats.FormoreinformationaboutWileyproducts,visitourweb siteatwww.wiley.com. LibraryofCongressCataloging-in-PublicationData: Kulkarni,Sanjeev. Anelementaryintroductiontostatisticallearningtheory/SanjeevKulkarni,GilbertHarman. p.cm.—(Wileyseriesinprobabilityandstatistics) Includesindex. ISBN978-0-470-64183-5 (cloth) 1.Machinelearning–Statisticalmethods.2.Patternrecognitionsystems.I.Harman,Gilbert.II.Title. Q325.5.K852011 006.3(cid:2)1–dc22 2010045223 PrintedinSingapore oBookISBN:978-1-118-02347-1 ePDFISBN:978-1-118-02343-3 ePubISBN:978-1-118-02346-4 10987654321 Contents Preface xiii 1 Introduction: Classification, Learning, Features, and Applications 1 1.1 Scope, 1 1.2 Why Machine Learning? 2 1.3 Some Applications, 3 1.3.1 Image Recognition, 3 1.3.2 Speech Recognition, 3 1.3.3 Medical Diagnosis, 4 1.3.4 Statistical Arbitrage, 4 1.4 Measurements, Features, and Feature Vectors, 4 1.5 The Need for Probability, 5 1.6 Supervised Learning, 5 1.7 Summary, 6 1.8 Appendix: Induction, 6 1.9 Questions, 7 1.10 References, 8 2 Probability 10 2.1 Probability of Some Basic Events, 10 2.2 Probabilities of Compound Events, 12 2.3 Conditional Probability, 13 2.4 Drawing Without Replacement, 14 2.5 A Classic Birthday Problem, 15 2.6 Random Variables, 15 2.7 Expected Value, 16 v vi CONTENTS 2.8 Variance, 17 2.9 Summary, 19 2.10 Appendix: Interpretations of Probability, 19 2.11 Questions, 20 2.12 References, 21 3 Probability Densities 23 3.1 An Example in Two Dimensions, 23 3.2 Random Numbers in [0,1], 23 3.3 Density Functions, 24 3.4 Probability Densities in Higher Dimensions, 27 3.5 Joint and Conditional Densities, 27 3.6 Expected Value and Variance, 28 3.7 Laws of Large Numbers, 29 3.8 Summary, 30 3.9 Appendix: Measurability, 30 3.10 Questions, 32 3.11 References, 32 4 The Pattern Recognition Problem 34 4.1 A Simple Example, 34 4.2 Decision Rules, 35 4.3 Success Criterion, 37 4.4 The Best Classifier: Bayes Decision Rule, 37 4.5 Continuous Features and Densities, 38 4.6 Summary, 39 4.7 Appendix: Uncountably Many, 39 4.8 Questions, 40 4.9 References, 41 5 The Optimal Bayes Decision Rule 43 5.1 Bayes Theorem, 43 5.2 Bayes Decision Rule, 44 5.3 Optimality and Some Comments, 45 5.4 An Example, 47 5.5 Bayes Theorem and Decision Rule with Densities, 48 5.6 Summary, 49 5.7 Appendix: Defining Conditional Probability, 50 CONTENTS vii 5.8 Questions, 50 5.9 References, 53 6 Learning from Examples 55 6.1 Lack of Knowledge of Distributions, 55 6.2 Training Data, 56 6.3 Assumptions on the Training Data, 57 6.4 A Brute Force Approach to Learning, 59 6.5 Curse of Dimensionality, Inductive Bias, and No Free Lunch, 60 6.6 Summary, 61 6.7 Appendix: What Sort of Learning? 62 6.8 Questions, 63 6.9 References, 64 7 The Nearest Neighbor Rule 65 7.1 The Nearest Neighbor Rule, 65 7.2 Performance of the Nearest Neighbor Rule, 66 7.3 Intuition and Proof Sketch of Performance, 67 7.4 Using more Neighbors, 69 7.5 Summary, 70 7.6 Appendix: When People use Nearest Neighbor Reasoning, 70 7.6.1 Who Is a Bachelor? 70 7.6.2 Legal Reasoning, 71 7.6.3 Moral Reasoning, 71 7.7 Questions, 72 7.8 References, 73 8 Kernel Rules 74 8.1 Motivation, 74 8.2 A Variation on Nearest Neighbor Rules, 75 8.3 Kernel Rules, 76 8.4 Universal Consistency of Kernel Rules, 79 8.5 Potential Functions, 80 8.6 More General Kernels, 81 8.7 Summary, 82 8.8 Appendix: Kernels, Similarity, and Features, 82 8.9 Questions, 83 8.10 References, 84 viii CONTENTS 9 Neural Networks: Perceptrons 86 9.1 Multilayer Feedforward Networks, 86 9.2 Neural Networks for Learning and Classification, 87 9.3 Perceptrons, 89 9.3.1 Threshold, 90 9.4 Learning Rule for Perceptrons, 90 9.5 Representational Capabilities of Perceptrons, 92 9.6 Summary, 94 9.7 Appendix: Models of Mind, 95 9.8 Questions, 96 9.9 References, 97 10 Multilayer Networks 99 10.1 Representation Capabilities of Multilayer Networks, 99 10.2 Learning and Sigmoidal Outputs, 101 10.3 Training Error and Weight Space, 104 10.4 Error Minimization by Gradient Descent, 105 10.5 Backpropagation, 106 10.6 Derivation of Backpropagation Equations, 109 10.6.1 Derivation for a Single Unit, 110 10.6.2 Derivation for a Network, 111 10.7 Summary, 113 10.8 Appendix: Gradient Descent and Reasoning toward Reflective Equilibrium, 113 10.9 Questions, 114 10.10 References, 115 11 PAC Learning 116 11.1 Class of Decision Rules, 117 11.2 Best Rule from a Class, 118 11.3 Probably Approximately Correct Criterion, 119 11.4 PAC Learning, 120 11.5 Summary, 122 11.6 Appendix: Identifying Indiscernibles, 122 11.7 Questions, 123 11.8 References, 123 CONTENTS ix 12 VC Dimension 125 12.1 Approximation and Estimation Errors, 125 12.2 Shattering, 126 12.3 VC Dimension, 127 12.4 Learning Result, 128 12.5 Some Examples, 129 12.6 Application to Neural Nets, 132 12.7 Summary, 133 12.8 Appendix: VC Dimension and Popper Dimension, 133 12.9 Questions, 134 12.10 References, 135 13 Infinite VC Dimension 137 13.1 A Hierarchy of Classes and Modified PAC Criterion, 138 13.2 Misfit Versus Complexity Trade-Off, 138 13.3 Learning Results, 139 13.4 Inductive Bias and Simplicity, 140 13.5 Summary, 141 13.6 Appendix: Uniform Convergence and Universal Consistency, 141 13.7 Questions, 142 13.8 References, 143 14 The Function Estimation Problem 144 14.1 Estimation, 144 14.2 Success Criterion, 145 14.3 Best Estimator: Regression Function, 146 14.4 Learning in Function Estimation, 146 14.5 Summary, 147 14.6 Appendix: Regression Toward the Mean, 147 14.7 Questions, 148 14.8 References, 149 15 Learning Function Estimation 150 15.1 Review of the Function Estimation/Regression Problem, 150 15.2 Nearest Neighbor Rules, 151 15.3 Kernel Methods, 151 15.4 Neural Network Learning, 152 15.5 Estimation with a Fixed Class of Functions, 153 x CONTENTS 15.6 Shattering, Pseudo-Dimension, and Learning, 154 15.7 Conclusion, 156 15.8 Appendix: Accuracy, Precision, Bias, and Variance in Estimation, 156 15.9 Questions, 157 15.10 References, 158 16 Simplicity 160 16.1 Simplicity in Science, 160 16.1.1 Explicit Appeals to Simplicity, 160 16.1.2 Is the World Simple? 161 16.1.3 Mistaken Appeals to Simplicity, 161 16.1.4 Implicit Appeals to Simplicity, 161 16.2 Ordering Hypotheses, 162 16.2.1 Two Kinds of Simplicity Orderings, 162 16.3 Two Examples, 163 16.3.1 Curve Fitting, 163 16.3.2 Enumerative Induction, 164 16.4 Simplicity as Simplicity of Representation, 165 16.4.1 Fix on a Particular System of Representation? 166 16.4.2 Are Fewer Parameters Simpler? 167 16.5 Pragmatic Theory of Simplicity, 167 16.6 Simplicity and Global Indeterminacy, 168 16.7 Summary, 169 16.8 Appendix: Basic Science and Statistical Learning Theory, 169 16.9 Questions, 170 16.10 References, 170 17 Support Vector Machines 172 17.1 Mapping the Feature Vectors, 173 17.2 Maximizing the Margin, 175 17.3 Optimization and Support Vectors, 177 17.4 Implementation and Connection to Kernel Methods, 179 17.5 Details of the Optimization Problem, 180 17.5.1 Rewriting Separation Conditions, 180 17.5.2 Equation for Margin, 181 17.5.3 Slack Variables for Nonseparable Examples, 181 17.5.4 Reformulation and Solution of Optimization, 182 17.6 Summary, 183
Description: