Learning with Kernels vorgelegt von Diplom(cid:0)Physiker Alexander Johannes Smola Vom Fachbereich (cid:1)(cid:2) (cid:3) Informatik der Technischen Universit(cid:4)at Berlin zur Erlangung des akademischen Grades Doktor der Naturwissenschaften (cid:0) Dr(cid:5) rer(cid:5) nat(cid:5) (cid:0) Promotionsausschuss(cid:6) Vorsitzender(cid:6) Prof(cid:5) Dr(cid:5) A(cid:5) Biedl Berichter(cid:6) Prof(cid:5) Dr(cid:5) S(cid:5) J(cid:4)ahnichen Berichter(cid:6) Prof(cid:5) Dr(cid:5) J(cid:5) Shawe(cid:0)Taylor Tag der wissenschaftlichen Aussprache(cid:6) (cid:1)(cid:7)(cid:5) November (cid:1)(cid:8)(cid:8)(cid:9)(cid:10) (cid:1)(cid:11)(cid:5)(cid:12)(cid:12) Uhr Berlin (cid:1)(cid:8)(cid:8)(cid:9) (cid:0) D (cid:9)(cid:2) (cid:0) Foreword The present thesis can take its place among the numerous doctoral theses and other publications that are currently revolutionizing the area of machine learning(cid:0) Theauthor(cid:1)sbasicconcerniswithkernel(cid:2)basedmethodsandinparticularSupport Vector algorithms for regression estimation for the solution of inverse(cid:3) often ill(cid:2) posed problems(cid:0) However(cid:3) Alexander Smola(cid:1)s thesis stands out from many of the other publications in this (cid:4)eld(cid:0) This is due in part to the author(cid:1)s profound theoretical penetration of his subject(cid:2)matter(cid:3) but also and in particular to the wealth of detailed results he has included(cid:0) EspeciallyneatandofparticularrelevancearethealgorithmicextensionsofSup(cid:5) port Vector Machines(cid:3) which can be combined as building blocks(cid:3) thus markedly improving the Support Vectors(cid:0) Of substantial interest is also the very elegant un(cid:5) supervisedmethod fornonlinear feature extraction(cid:3)which appliesthe kernel(cid:5)based method to classical Principal Component Analysis (cid:6)kernel PCA(cid:7)(cid:0) And although only designed to illustrate the theoretical results(cid:3) the practical applications the author gives us from the area of high(cid:5)energy physics and time(cid:5)series analysis are highly convincing(cid:0) In many respects the thesis is groundbreaking(cid:3) but it is likely to soon become a frequently cited work for numerous innovative applications from the (cid:4)eld of statistical machine learning and for improving our theoretical understanding of Support Vector Machines(cid:0) Stefan Ja(cid:0)hnichen(cid:3) Professor(cid:3)Technische Universita(cid:8)t Berlin Director(cid:3) GMD Berlin iv Foreword AlexSmola(cid:1)sthesishasbranchedoutin at least(cid:4)venoveldirectionsbroadlybased around kernel learning machines(cid:9) analysis of cost functions(cid:3) relations to regular(cid:5) ization networks(cid:3) optimization algorithms(cid:3) extensions to unsupervised learning in(cid:5) cluding regularized principal manifolds(cid:3) entropy numbers for linear operators and applications to bounding covering numbers(cid:0) I will highlight some of the signi(cid:4)cant contributions made in each of these areas(cid:0) Cost Functions This sectionpresents a veryneat coherent view of costfunctions and their e(cid:10)ect on the overall algorithmics of kernel regression(cid:0) In particular(cid:3) it is shown how using a general convex cost function still allows the problem to be cast as a convex programming problem solvable via the dual(cid:0) Experiments show that choosing the right cost function can improve performance(cid:0) The section goes on to describe a very useful approach to choosing the (cid:0) for the (cid:0)(cid:5)insensitive loss measure(cid:3)basedontraditionalstatisticalmethods(cid:0)Furtherre(cid:4)nementsarisingfrom this approachgive a new algorithm termed (cid:1)(cid:5)SV regression(cid:0) Kernels and Regularization The chapter covers the relation between kernels used in Support Vector Machines and Regularization Networks(cid:0) This connection is a very valuable contribution to understanding the operation of SV machines and in particulartheir generalizationproperties(cid:0) The analysis of particular kernels and experiments showing the e(cid:10)ects of their regularization properties are very illuminating(cid:0) Consideration of higher dimensional input spaces is made and the case of dot product kernels studied in some detail(cid:0) This leads to the introduction of Conditionally Positive De(cid:4)nite Kernels and semiparametric estimation(cid:3) both of which are new in the context of SV machines(cid:0) Optimization Algorithms This section takes the interior point methods and implements them for SV regression and classi(cid:4)cation(cid:0) By taking into account the speci(cid:4)cs of the problem e(cid:11)ciency savingshave been made(cid:0) The considerationthen turns to subset selection to handle large data sets(cid:0) This introduces among other techniques(cid:3) SMO or sequential minimal optimization(cid:0) This approach is generalized to the regressioncase and proves an extremely e(cid:11)cient method(cid:0) Unsupervised Learning The extension to Kernel PCA is a nice idea which appears to work well in practice(cid:0) The further work on Regularized Principal Manifolds is very novel and opens up a number of interesting problems and techniques(cid:0) Entropy numbers(cid:0) kernels and operators Theestimationofcoveringnumbers via techniques from operator theory is another major contribution to the state(cid:5)of(cid:5) the(cid:5)art(cid:0) Many new results are presented(cid:3) among others the generalization bounds for Regularized Principal Manifolds are given(cid:0) John Shawe(cid:1)Taylor(cid:3) Professor(cid:3)Royal Holloway(cid:3) University of London To my parents Abstract SupportVector(cid:6)SV(cid:7) Machinescombineseveraltechniquesfromstatistics(cid:3)machine learning and neural networks(cid:0) One of the most important ingredients are kernels(cid:3) i(cid:0)e(cid:0)theconceptoftransforminglinearalgorithmsintononlinearonesviaamapinto feature spaces(cid:0) The present work focuses on the following issues(cid:9) Extensions of Support Vector Machines(cid:0) Extensions of kernel methods to other algorithms such as unsupervised learning(cid:0) Capacity bounds which are particularly well suited for kernel methods(cid:0) AfterabriefintroductiontoSVregressionitisshownhowtheclassical(cid:2)(cid:2)insensitive loss function can be replaced by other cost functions while keeping the original advantages or adding other features such as automatic parameter adaptation(cid:0) Moreover the connection between kernels and regularization is pointed out(cid:0) A theoretical analysis of several common kernels follows and criteria to check Mer(cid:5) cer(cid:1)s condition more easily are presented(cid:0) Further modi(cid:4)cations lead to semipara(cid:5) metric models and greedy approximation schemes(cid:0) Next three di(cid:10)erent types of optimization algorithms(cid:3) namely interior point codes(cid:3) subset selection algorithms(cid:3) and sequential minimal optimization (cid:6)including pseudocode(cid:7) are presented(cid:0) The primal(cid:2)dual framework is used as an analytic tool in this context(cid:0) Unsupervisedlearningisanextensionofkernelmethodstonewproblems(cid:0)Besides KernelPCAonecanusetheregularizationtoobtainmoregeneralfeatureexractors(cid:0) A second approach leads to regularized quantization functionals which allow a smoothtransitionbetween theGenerativeTopographicMapand PrincipalCurves(cid:0) The second part deals with uniform convergence bounds for the algorithms and concepts presented so far(cid:0) It starts with a brief self contained overview and an introduction to functional analytic tools which play a crucial role in this problem(cid:0) Byviewing the class ofkernelexpansionsasan imageofalinearoperatoronemay give bounds on the generalization ability of kernel expansions even when standard concepts like the VC dimension fail or give too conservativeestimates(cid:0) In particular it is shown that it is possible to compute the covering numbers of the given hypothesis classes directly instead of taking the detour via the VC dimension(cid:0) Applications of the new tools to SV machines(cid:3) convex combinations of hypotheses (cid:6)i(cid:0)e(cid:0) boosting and sparse coding(cid:7)(cid:3) greedy approximation schemes(cid:3) and principal curves conclude the presentation(cid:0) Keywords SupportVectors(cid:3)Regression(cid:3)KernelExpansions(cid:3)Regularization(cid:3)Sta(cid:5) tistical Learning Theory(cid:3) Uniform Convergence(cid:12) viii Abstract Support Vektor (cid:6)SV(cid:7) Maschinen verbinden verschiedene Techniken der Statistik(cid:3) des maschinellen Lernens und Neuronaler Netze(cid:0) Eine Schlu(cid:8)sselposition f(cid:8)allt den Kernen zu(cid:3) d(cid:0)h(cid:0) dem Konzept(cid:3) lineareAlgorithmen durch eine Abbildung in Merk(cid:5) malsr(cid:8)aume nichtlinear zu machen(cid:0) Die Dissertation behandelt folgende Aspekte(cid:9) Erweiterungen des Support Vektor Algorithmus Erweiterungen und Anwendungen kernbasierter Methoden auf andere Algorith(cid:5) men wie das unu(cid:8)berwachte Lernen Absch(cid:8)atzungen zur Generalisierungsf(cid:8)ahigkeit(cid:3) die besonders auf kernbasierte Methoden abgestimmt sind Nacheiner kurzenEinfu(cid:8)hrungin die SV Regressionwird gezeigt(cid:3)wie die (cid:2) unemp(cid:5) (cid:4)ndliche Kostenfunktion durch andere Funktionen ersetzt werden kann(cid:3) w(cid:8)ahrend gleichzeitigdieVorteiledesurspru(cid:8)nglichenAlgorithmuserhaltenbleiben(cid:3)oderauch neue Eigenschaften wie automatische Parameteranpassunghinzugefu(cid:8)gt werden(cid:0) WeiterhinwirddieVerbindungzwischenKernenund Regularisierungaufgezeigt(cid:0) Es folgt eine theoretische Analyse verschiedener h(cid:8)au(cid:4)g verwendeter Kerne(cid:3) nebst KriterienzurleichtenU(cid:8)berpru(cid:8)fungvonMercersBedingung(cid:0)WeitereVer(cid:8)anderungen fu(cid:8)hren zu semiparametrischenModellen sowie(cid:13)geizigen(cid:14)N(cid:8)aherungsverfahren(cid:0)Ab(cid:5) schlie(cid:15)end werden drei Optimierungsalgorithmen vorgestellt(cid:3) na(cid:8)mlich die Methode der inneren Punkte(cid:3) Auswahlalgorithmen und sequentiell minimale Optimierung(cid:0) Als analytisches Werkzeug fungiert hier das prima(cid:8)r(cid:2)duale Konzept der Opti(cid:5) mierunge(cid:0)AuchPseudocodewirdindiesemZusammenhangzurVerfu(cid:8)gunggestellt(cid:0) Unu(cid:8)berwachtes Lernen ist ein Anwendungsfall kernbasierter Methoden auf neue Probleme(cid:0) Neben Kern PCA kann man das Regularisierungskonzeptdazu verwen(cid:5) den(cid:3) allgemeinere Mermalsextraktoren zu erhalten(cid:0) Ein zweiter Ansatz fu(cid:8)hrt zu einem stufenlosen U(cid:8)bergangzwischen der der erzeugenden topographischen Abbil(cid:5) dung (cid:6)GTM(cid:7) und Hauptkurven(cid:0) Der zweite Teil der Dissertation besch(cid:8)aftigt sich mit Absch(cid:8)atzungen zur uni(cid:5) formen Konvergenz fu(cid:8)r die bisher vorgestellten Algorithmen und Konzepte(cid:0) Dazu wirdzuerstkurzeinU(cid:8)berblicku(cid:8)berexistierendeTechnikenzurKapazit(cid:8)atskontrolle und Funktionalanalysis gegeben(cid:0) Letztere spielen eine entscheidende Rolle(cid:3) da die Klasse der Kernentwicklungen als Bild unter einem linearen Operator aufgefa(cid:15)t werdenkann(cid:3)wasAbsch(cid:8)atzungenderGeneralisierungsf(cid:8)ahigkeitsogarindenFa(cid:8)llen erm(cid:8)oglicht(cid:3) in denen klassische Ans(cid:8)atze wie die VC Dimension versagen bzw(cid:0) zu konservative Absch(cid:8)atzungen geben(cid:0) Insbesondere wird gezeigt(cid:3) da(cid:15) es m(cid:8)oglich ist(cid:3) die U(cid:8)berdeckungszahlen einer gegebenenHypothesenklassedirekt zuberechnen(cid:3)ohnedenUmwegu(cid:8)berdieBerech(cid:5) nung derVC Dimension zu nehmen(cid:0) Anwendungen (cid:4)nden die neuen Methoden bei Support Vektor Maschinen(cid:3) Konvexkombinationenvon Hypothesen (cid:6)z(cid:0)B(cid:0) Boosting und sp(cid:8)arliche Kodierung(cid:7)(cid:3) (cid:13)geizigen(cid:14)Na(cid:8)herungsverfahrenund Hauptkurven(cid:0) Schlagworte Support Vektoren(cid:3) Regression(cid:3) Kernentwicklungen(cid:3) Regular(cid:5) isierung(cid:3) Statistische Lerntheorie(cid:3) Uniforme Konvergenz(cid:12) Preface The goal of this thesis is to give a self contained overview over Support Vector Machinesandsimilarkernelbasedmethods(cid:3)mainlyforRegressionEstimation(cid:0)Itis(cid:3) inthissense(cid:3)complementarytoBernhardSch(cid:8)olkopf(cid:1)sworkonPatternRecognition(cid:0) Yet it also contains new insights in capacity control which can be applied to classi(cid:4)cation problems as well(cid:0) Itisprobablybesttoviewthisworkasatechnicaldescriptionofatoolset(cid:3)namely the building blocks of a Support Vector Machine(cid:0) The (cid:4)rst part describes its basic machineryandthepossibleadd(cid:2)onsthatcanbe usedformodifying it(cid:3)justlikethe lea(cid:16)et one would get from a car dealer with a choice of all the (cid:1)extras(cid:1) available(cid:0) In this respect the second part could be regarded as a list of operating instruc(cid:5) tions(cid:3) namely how to e(cid:10)ectively carry out capacitycontrolfor a class of systems of the SV type(cid:0) How to read this Thesis I tried to organizethis workboth in a self contained(cid:3) and modularmanner(cid:0) Where necessary(cid:3)proofshavebeenmovedintotheappendixofthecorrespondingchapters and can be omitted if the reader is willing to accept some results on faith(cid:0) Some fundamentalresults(cid:3)however(cid:3)ifneededtounderstandthefurtherwayofreasoning(cid:3) are derived in the main body(cid:0) How not to read this Thesis This is the work of a physicist who decided to do applied statistics(cid:3) ended up in a computer science department(cid:3) and sometimes had engineering applications or functional analysis in mind(cid:0) Hence it provides a mixture of techniques and concepts from several domains(cid:3) su(cid:11)cient to annoy many readers(cid:3) due to the lack of mathematical rigor (cid:6)from a mathematician(cid:1)s point(cid:7)(cid:3) the sometimes rather theoretical reasoning and some technical proofs (cid:6)from a practicioner(cid:1)s point(cid:7)(cid:3) the lackofhardlyanyconnectionwithphysics(cid:3)orsomealgorithmsthatwork(cid:3)buthave not(cid:6)yet(cid:7)beenproventobeoptimalortoterminateina(cid:4)nitenumberofsteps(cid:3)etc(cid:0) However(cid:3)I tried to split the nuisance equally among the disciplines(cid:0) x Preface Acknowledgements I would like to thank the researchers at the Neuro group of GMD FIRST with whom I had the pleasure of carrying out research in an excellent environment(cid:0) In particular the discussions with Peter Behr(cid:3) Thilo Frie(cid:15)(cid:3) Jens Kohlmorgen(cid:3) Steven Lemm(cid:3) Sebastian Mika(cid:3) Takashi Onoda(cid:3) Petra Philips(cid:3) Gunnar R(cid:8)atsch(cid:3) Andras Zieheandmanyotherswereofteninspiring(cid:0)Besidesthat(cid:3)thesystemadministrators Roger Holst and Andreas Schulz helped in many ways(cid:0) The second lab to be thanked is the machine learning group at ANU(cid:3) Canberra(cid:0) PeterBartlett(cid:3) Jon Baxter(cid:3) Shai Ben(cid:2)David(cid:3)and Lew Mason helped me in getting a deeper understanding of Statistical Learning Theory(cid:0) The research visit at ANU was also one of the most productive periods of the past two years(cid:0) NexttomentionaretwodepartmentsatAT(cid:17)TandBellLaboratories(cid:3)withLeon Bottou(cid:3) Chris Burges(cid:3) Yann LeCun(cid:3) Patrick Ha(cid:10)ner(cid:3) Craig Nohl(cid:3) Patrice Simard(cid:3) and Charles Stenard(cid:0) Much of the work would not have been possible(cid:3) if I had not hadthechancetolearnaboutSupportVectorsintheir(cid:3)then(cid:3)jointresearchfacility inHolmdel(cid:3)USA(cid:0)IamparticularlygratefultoVladimirVapnikinthisrespect(cid:0)His (cid:13)hands on(cid:14) approach to statistics was a reliable guide to interesting problems(cid:0) MoreoverI had the fortune to discuss with and get help from people like Bernd Carl(cid:3)Nello Cristianini(cid:3)LeovanHemmen(cid:3) AdamKrzyzak(cid:3)David MacKay(cid:3)Heidrun Mu(cid:8)ndlein(cid:3)NoboruMurata(cid:3)ManfredOpper(cid:3)JohnShawe(cid:2)Taylor(cid:3)MarkStitson(cid:3)Sara Solla(cid:3)GraceWahba(cid:3)andJasonWeston(cid:0)ChrisBurges(cid:3)AndreElisee(cid:10)(cid:3)RalfHerbrich(cid:3) OlviMangasarian(cid:3)Klaus(cid:2)RobertMu(cid:8)ller(cid:3)BernhardSch(cid:8)olkopf(cid:3)JohnShawe(cid:2)Taylor(cid:3) Anja Westerho(cid:10) and Robert Williamson gave helpful comments on the thesis and found many errors(cid:0) Special thanks go the three people with whom most of the results of this thesis were obtained (cid:18) Klaus(cid:5)Robert Mu(cid:8)ller(cid:3) Bernhard Sch(cid:8)olkopf(cid:3) and Robert Williamson(cid:9)thankstoKlausinparticularforalwaysremindingmeofthenecessity that theoretical advances have to be backed by experimental evidence(cid:3) for advice and discusions about learning theory and neural networks(cid:3) and for allowing me to focusonresearch(cid:3)quiteundisturbedfromadministrativechores(cid:12)thankstoBernhard for many discussions which were a major source of ideas and for ensuring that the proofsstayedtheoreticallysound(cid:12)thankstoBobforteachingmemanythingsabout statistical learning theory and functional analysis(cid:3) and many valuable discussions(cid:0) Closecollaborationisonlypossibleifcomplementedbyfriendship(cid:0)Manyofthese researchers(cid:3) in particular Bernhard(cid:3) became good friends(cid:3) far beyond the level of scienti(cid:4)c cooperation(cid:0) Finally I would like to thank Stefan J(cid:8)ahnichen(cid:0) He provided as head of GMD and dean of the TU Berlin computer science department the researchenvironment in which new ideas could be developed(cid:0) His wise advice and guidance in scienti(cid:4)c matters and academic issues was very helpful(cid:0) This work was made possible through funding by ARPA(cid:3) grants of the DFG JA (cid:19)(cid:20)(cid:21)(cid:22)(cid:20)(cid:23)andJA(cid:19)(cid:20)(cid:21)(cid:22)(cid:24)(cid:23)(cid:3)fundingfromtheAustralianResearchCouncil(cid:3)travelgrants from the NIPS foundation and NEuroNet(cid:3) and support of NeuroCOLT (cid:25)(cid:0)
Description: