ebook img

Learning with Kernels vorgelegt von Diplom{Physiker Alexander Johannes Smola Vom ... PDF

236 Pages·2003·1.71 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning with Kernels vorgelegt von Diplom{Physiker Alexander Johannes Smola Vom ...

Learning with Kernels vorgelegt von Diplom(cid:0)Physiker Alexander Johannes Smola Vom Fachbereich (cid:1)(cid:2) (cid:3) Informatik der Technischen Universit(cid:4)at Berlin zur Erlangung des akademischen Grades Doktor der Naturwissenschaften (cid:0) Dr(cid:5) rer(cid:5) nat(cid:5) (cid:0) Promotionsausschuss(cid:6) Vorsitzender(cid:6) Prof(cid:5) Dr(cid:5) A(cid:5) Biedl Berichter(cid:6) Prof(cid:5) Dr(cid:5) S(cid:5) J(cid:4)ahnichen Berichter(cid:6) Prof(cid:5) Dr(cid:5) J(cid:5) Shawe(cid:0)Taylor Tag der wissenschaftlichen Aussprache(cid:6) (cid:1)(cid:7)(cid:5) November (cid:1)(cid:8)(cid:8)(cid:9)(cid:10) (cid:1)(cid:11)(cid:5)(cid:12)(cid:12) Uhr Berlin (cid:1)(cid:8)(cid:8)(cid:9) (cid:0) D (cid:9)(cid:2) (cid:0) Foreword The present thesis can take its place among the numerous doctoral theses and other publications that are currently revolutionizing the area of machine learning(cid:0) Theauthor(cid:1)sbasicconcerniswithkernel(cid:2)basedmethodsandinparticularSupport Vector algorithms for regression estimation for the solution of inverse(cid:3) often ill(cid:2) posed problems(cid:0) However(cid:3) Alexander Smola(cid:1)s thesis stands out from many of the other publications in this (cid:4)eld(cid:0) This is due in part to the author(cid:1)s profound theoretical penetration of his subject(cid:2)matter(cid:3) but also and in particular to the wealth of detailed results he has included(cid:0) EspeciallyneatandofparticularrelevancearethealgorithmicextensionsofSup(cid:5) port Vector Machines(cid:3) which can be combined as building blocks(cid:3) thus markedly improving the Support Vectors(cid:0) Of substantial interest is also the very elegant un(cid:5) supervisedmethod fornonlinear feature extraction(cid:3)which appliesthe kernel(cid:5)based method to classical Principal Component Analysis (cid:6)kernel PCA(cid:7)(cid:0) And although only designed to illustrate the theoretical results(cid:3) the practical applications the author gives us from the area of high(cid:5)energy physics and time(cid:5)series analysis are highly convincing(cid:0) In many respects the thesis is groundbreaking(cid:3) but it is likely to soon become a frequently cited work for numerous innovative applications from the (cid:4)eld of statistical machine learning and for improving our theoretical understanding of Support Vector Machines(cid:0) Stefan Ja(cid:0)hnichen(cid:3) Professor(cid:3)Technische Universita(cid:8)t Berlin Director(cid:3) GMD Berlin iv Foreword AlexSmola(cid:1)sthesishasbranchedoutin at least(cid:4)venoveldirectionsbroadlybased around kernel learning machines(cid:9) analysis of cost functions(cid:3) relations to regular(cid:5) ization networks(cid:3) optimization algorithms(cid:3) extensions to unsupervised learning in(cid:5) cluding regularized principal manifolds(cid:3) entropy numbers for linear operators and applications to bounding covering numbers(cid:0) I will highlight some of the signi(cid:4)cant contributions made in each of these areas(cid:0) Cost Functions This sectionpresents a veryneat coherent view of costfunctions and their e(cid:10)ect on the overall algorithmics of kernel regression(cid:0) In particular(cid:3) it is shown how using a general convex cost function still allows the problem to be cast as a convex programming problem solvable via the dual(cid:0) Experiments show that choosing the right cost function can improve performance(cid:0) The section goes on to describe a very useful approach to choosing the (cid:0) for the (cid:0)(cid:5)insensitive loss measure(cid:3)basedontraditionalstatisticalmethods(cid:0)Furtherre(cid:4)nementsarisingfrom this approachgive a new algorithm termed (cid:1)(cid:5)SV regression(cid:0) Kernels and Regularization The chapter covers the relation between kernels used in Support Vector Machines and Regularization Networks(cid:0) This connection is a very valuable contribution to understanding the operation of SV machines and in particulartheir generalizationproperties(cid:0) The analysis of particular kernels and experiments showing the e(cid:10)ects of their regularization properties are very illuminating(cid:0) Consideration of higher dimensional input spaces is made and the case of dot product kernels studied in some detail(cid:0) This leads to the introduction of Conditionally Positive De(cid:4)nite Kernels and semiparametric estimation(cid:3) both of which are new in the context of SV machines(cid:0) Optimization Algorithms This section takes the interior point methods and implements them for SV regression and classi(cid:4)cation(cid:0) By taking into account the speci(cid:4)cs of the problem e(cid:11)ciency savingshave been made(cid:0) The considerationthen turns to subset selection to handle large data sets(cid:0) This introduces among other techniques(cid:3) SMO or sequential minimal optimization(cid:0) This approach is generalized to the regressioncase and proves an extremely e(cid:11)cient method(cid:0) Unsupervised Learning The extension to Kernel PCA is a nice idea which appears to work well in practice(cid:0) The further work on Regularized Principal Manifolds is very novel and opens up a number of interesting problems and techniques(cid:0) Entropy numbers(cid:0) kernels and operators Theestimationofcoveringnumbers via techniques from operator theory is another major contribution to the state(cid:5)of(cid:5) the(cid:5)art(cid:0) Many new results are presented(cid:3) among others the generalization bounds for Regularized Principal Manifolds are given(cid:0) John Shawe(cid:1)Taylor(cid:3) Professor(cid:3)Royal Holloway(cid:3) University of London To my parents Abstract SupportVector(cid:6)SV(cid:7) Machinescombineseveraltechniquesfromstatistics(cid:3)machine learning and neural networks(cid:0) One of the most important ingredients are kernels(cid:3) i(cid:0)e(cid:0)theconceptoftransforminglinearalgorithmsintononlinearonesviaamapinto feature spaces(cid:0) The present work focuses on the following issues(cid:9) Extensions of Support Vector Machines(cid:0) Extensions of kernel methods to other algorithms such as unsupervised learning(cid:0) Capacity bounds which are particularly well suited for kernel methods(cid:0) AfterabriefintroductiontoSVregressionitisshownhowtheclassical(cid:2)(cid:2)insensitive loss function can be replaced by other cost functions while keeping the original advantages or adding other features such as automatic parameter adaptation(cid:0) Moreover the connection between kernels and regularization is pointed out(cid:0) A theoretical analysis of several common kernels follows and criteria to check Mer(cid:5) cer(cid:1)s condition more easily are presented(cid:0) Further modi(cid:4)cations lead to semipara(cid:5) metric models and greedy approximation schemes(cid:0) Next three di(cid:10)erent types of optimization algorithms(cid:3) namely interior point codes(cid:3) subset selection algorithms(cid:3) and sequential minimal optimization (cid:6)including pseudocode(cid:7) are presented(cid:0) The primal(cid:2)dual framework is used as an analytic tool in this context(cid:0) Unsupervisedlearningisanextensionofkernelmethodstonewproblems(cid:0)Besides KernelPCAonecanusetheregularizationtoobtainmoregeneralfeatureexractors(cid:0) A second approach leads to regularized quantization functionals which allow a smoothtransitionbetween theGenerativeTopographicMapand PrincipalCurves(cid:0) The second part deals with uniform convergence bounds for the algorithms and concepts presented so far(cid:0) It starts with a brief self contained overview and an introduction to functional analytic tools which play a crucial role in this problem(cid:0) Byviewing the class ofkernelexpansionsasan imageofalinearoperatoronemay give bounds on the generalization ability of kernel expansions even when standard concepts like the VC dimension fail or give too conservativeestimates(cid:0) In particular it is shown that it is possible to compute the covering numbers of the given hypothesis classes directly instead of taking the detour via the VC dimension(cid:0) Applications of the new tools to SV machines(cid:3) convex combinations of hypotheses (cid:6)i(cid:0)e(cid:0) boosting and sparse coding(cid:7)(cid:3) greedy approximation schemes(cid:3) and principal curves conclude the presentation(cid:0) Keywords SupportVectors(cid:3)Regression(cid:3)KernelExpansions(cid:3)Regularization(cid:3)Sta(cid:5) tistical Learning Theory(cid:3) Uniform Convergence(cid:12) viii Abstract Support Vektor (cid:6)SV(cid:7) Maschinen verbinden verschiedene Techniken der Statistik(cid:3) des maschinellen Lernens und Neuronaler Netze(cid:0) Eine Schlu(cid:8)sselposition f(cid:8)allt den Kernen zu(cid:3) d(cid:0)h(cid:0) dem Konzept(cid:3) lineareAlgorithmen durch eine Abbildung in Merk(cid:5) malsr(cid:8)aume nichtlinear zu machen(cid:0) Die Dissertation behandelt folgende Aspekte(cid:9) Erweiterungen des Support Vektor Algorithmus Erweiterungen und Anwendungen kernbasierter Methoden auf andere Algorith(cid:5) men wie das unu(cid:8)berwachte Lernen Absch(cid:8)atzungen zur Generalisierungsf(cid:8)ahigkeit(cid:3) die besonders auf kernbasierte Methoden abgestimmt sind Nacheiner kurzenEinfu(cid:8)hrungin die SV Regressionwird gezeigt(cid:3)wie die (cid:2) unemp(cid:5) (cid:4)ndliche Kostenfunktion durch andere Funktionen ersetzt werden kann(cid:3) w(cid:8)ahrend gleichzeitigdieVorteiledesurspru(cid:8)nglichenAlgorithmuserhaltenbleiben(cid:3)oderauch neue Eigenschaften wie automatische Parameteranpassunghinzugefu(cid:8)gt werden(cid:0) WeiterhinwirddieVerbindungzwischenKernenund Regularisierungaufgezeigt(cid:0) Es folgt eine theoretische Analyse verschiedener h(cid:8)au(cid:4)g verwendeter Kerne(cid:3) nebst KriterienzurleichtenU(cid:8)berpru(cid:8)fungvonMercersBedingung(cid:0)WeitereVer(cid:8)anderungen fu(cid:8)hren zu semiparametrischenModellen sowie(cid:13)geizigen(cid:14)N(cid:8)aherungsverfahren(cid:0)Ab(cid:5) schlie(cid:15)end werden drei Optimierungsalgorithmen vorgestellt(cid:3) na(cid:8)mlich die Methode der inneren Punkte(cid:3) Auswahlalgorithmen und sequentiell minimale Optimierung(cid:0) Als analytisches Werkzeug fungiert hier das prima(cid:8)r(cid:2)duale Konzept der Opti(cid:5) mierunge(cid:0)AuchPseudocodewirdindiesemZusammenhangzurVerfu(cid:8)gunggestellt(cid:0) Unu(cid:8)berwachtes Lernen ist ein Anwendungsfall kernbasierter Methoden auf neue Probleme(cid:0) Neben Kern PCA kann man das Regularisierungskonzeptdazu verwen(cid:5) den(cid:3) allgemeinere Mermalsextraktoren zu erhalten(cid:0) Ein zweiter Ansatz fu(cid:8)hrt zu einem stufenlosen U(cid:8)bergangzwischen der der erzeugenden topographischen Abbil(cid:5) dung (cid:6)GTM(cid:7) und Hauptkurven(cid:0) Der zweite Teil der Dissertation besch(cid:8)aftigt sich mit Absch(cid:8)atzungen zur uni(cid:5) formen Konvergenz fu(cid:8)r die bisher vorgestellten Algorithmen und Konzepte(cid:0) Dazu wirdzuerstkurzeinU(cid:8)berblicku(cid:8)berexistierendeTechnikenzurKapazit(cid:8)atskontrolle und Funktionalanalysis gegeben(cid:0) Letztere spielen eine entscheidende Rolle(cid:3) da die Klasse der Kernentwicklungen als Bild unter einem linearen Operator aufgefa(cid:15)t werdenkann(cid:3)wasAbsch(cid:8)atzungenderGeneralisierungsf(cid:8)ahigkeitsogarindenFa(cid:8)llen erm(cid:8)oglicht(cid:3) in denen klassische Ans(cid:8)atze wie die VC Dimension versagen bzw(cid:0) zu konservative Absch(cid:8)atzungen geben(cid:0) Insbesondere wird gezeigt(cid:3) da(cid:15) es m(cid:8)oglich ist(cid:3) die U(cid:8)berdeckungszahlen einer gegebenenHypothesenklassedirekt zuberechnen(cid:3)ohnedenUmwegu(cid:8)berdieBerech(cid:5) nung derVC Dimension zu nehmen(cid:0) Anwendungen (cid:4)nden die neuen Methoden bei Support Vektor Maschinen(cid:3) Konvexkombinationenvon Hypothesen (cid:6)z(cid:0)B(cid:0) Boosting und sp(cid:8)arliche Kodierung(cid:7)(cid:3) (cid:13)geizigen(cid:14)Na(cid:8)herungsverfahrenund Hauptkurven(cid:0) Schlagworte Support Vektoren(cid:3) Regression(cid:3) Kernentwicklungen(cid:3) Regular(cid:5) isierung(cid:3) Statistische Lerntheorie(cid:3) Uniforme Konvergenz(cid:12) Preface The goal of this thesis is to give a self contained overview over Support Vector Machinesandsimilarkernelbasedmethods(cid:3)mainlyforRegressionEstimation(cid:0)Itis(cid:3) inthissense(cid:3)complementarytoBernhardSch(cid:8)olkopf(cid:1)sworkonPatternRecognition(cid:0) Yet it also contains new insights in capacity control which can be applied to classi(cid:4)cation problems as well(cid:0) Itisprobablybesttoviewthisworkasatechnicaldescriptionofatoolset(cid:3)namely the building blocks of a Support Vector Machine(cid:0) The (cid:4)rst part describes its basic machineryandthepossibleadd(cid:2)onsthatcanbe usedformodifying it(cid:3)justlikethe lea(cid:16)et one would get from a car dealer with a choice of all the (cid:1)extras(cid:1) available(cid:0) In this respect the second part could be regarded as a list of operating instruc(cid:5) tions(cid:3) namely how to e(cid:10)ectively carry out capacitycontrolfor a class of systems of the SV type(cid:0) How to read this Thesis I tried to organizethis workboth in a self contained(cid:3) and modularmanner(cid:0) Where necessary(cid:3)proofshavebeenmovedintotheappendixofthecorrespondingchapters and can be omitted if the reader is willing to accept some results on faith(cid:0) Some fundamentalresults(cid:3)however(cid:3)ifneededtounderstandthefurtherwayofreasoning(cid:3) are derived in the main body(cid:0) How not to read this Thesis This is the work of a physicist who decided to do applied statistics(cid:3) ended up in a computer science department(cid:3) and sometimes had engineering applications or functional analysis in mind(cid:0) Hence it provides a mixture of techniques and concepts from several domains(cid:3) su(cid:11)cient to annoy many readers(cid:3) due to the lack of mathematical rigor (cid:6)from a mathematician(cid:1)s point(cid:7)(cid:3) the sometimes rather theoretical reasoning and some technical proofs (cid:6)from a practicioner(cid:1)s point(cid:7)(cid:3) the lackofhardlyanyconnectionwithphysics(cid:3)orsomealgorithmsthatwork(cid:3)buthave not(cid:6)yet(cid:7)beenproventobeoptimalortoterminateina(cid:4)nitenumberofsteps(cid:3)etc(cid:0) However(cid:3)I tried to split the nuisance equally among the disciplines(cid:0) x Preface Acknowledgements I would like to thank the researchers at the Neuro group of GMD FIRST with whom I had the pleasure of carrying out research in an excellent environment(cid:0) In particular the discussions with Peter Behr(cid:3) Thilo Frie(cid:15)(cid:3) Jens Kohlmorgen(cid:3) Steven Lemm(cid:3) Sebastian Mika(cid:3) Takashi Onoda(cid:3) Petra Philips(cid:3) Gunnar R(cid:8)atsch(cid:3) Andras Zieheandmanyotherswereofteninspiring(cid:0)Besidesthat(cid:3)thesystemadministrators Roger Holst and Andreas Schulz helped in many ways(cid:0) The second lab to be thanked is the machine learning group at ANU(cid:3) Canberra(cid:0) PeterBartlett(cid:3) Jon Baxter(cid:3) Shai Ben(cid:2)David(cid:3)and Lew Mason helped me in getting a deeper understanding of Statistical Learning Theory(cid:0) The research visit at ANU was also one of the most productive periods of the past two years(cid:0) NexttomentionaretwodepartmentsatAT(cid:17)TandBellLaboratories(cid:3)withLeon Bottou(cid:3) Chris Burges(cid:3) Yann LeCun(cid:3) Patrick Ha(cid:10)ner(cid:3) Craig Nohl(cid:3) Patrice Simard(cid:3) and Charles Stenard(cid:0) Much of the work would not have been possible(cid:3) if I had not hadthechancetolearnaboutSupportVectorsintheir(cid:3)then(cid:3)jointresearchfacility inHolmdel(cid:3)USA(cid:0)IamparticularlygratefultoVladimirVapnikinthisrespect(cid:0)His (cid:13)hands on(cid:14) approach to statistics was a reliable guide to interesting problems(cid:0) MoreoverI had the fortune to discuss with and get help from people like Bernd Carl(cid:3)Nello Cristianini(cid:3)LeovanHemmen(cid:3) AdamKrzyzak(cid:3)David MacKay(cid:3)Heidrun Mu(cid:8)ndlein(cid:3)NoboruMurata(cid:3)ManfredOpper(cid:3)JohnShawe(cid:2)Taylor(cid:3)MarkStitson(cid:3)Sara Solla(cid:3)GraceWahba(cid:3)andJasonWeston(cid:0)ChrisBurges(cid:3)AndreElisee(cid:10)(cid:3)RalfHerbrich(cid:3) OlviMangasarian(cid:3)Klaus(cid:2)RobertMu(cid:8)ller(cid:3)BernhardSch(cid:8)olkopf(cid:3)JohnShawe(cid:2)Taylor(cid:3) Anja Westerho(cid:10) and Robert Williamson gave helpful comments on the thesis and found many errors(cid:0) Special thanks go the three people with whom most of the results of this thesis were obtained (cid:18) Klaus(cid:5)Robert Mu(cid:8)ller(cid:3) Bernhard Sch(cid:8)olkopf(cid:3) and Robert Williamson(cid:9)thankstoKlausinparticularforalwaysremindingmeofthenecessity that theoretical advances have to be backed by experimental evidence(cid:3) for advice and discusions about learning theory and neural networks(cid:3) and for allowing me to focusonresearch(cid:3)quiteundisturbedfromadministrativechores(cid:12)thankstoBernhard for many discussions which were a major source of ideas and for ensuring that the proofsstayedtheoreticallysound(cid:12)thankstoBobforteachingmemanythingsabout statistical learning theory and functional analysis(cid:3) and many valuable discussions(cid:0) Closecollaborationisonlypossibleifcomplementedbyfriendship(cid:0)Manyofthese researchers(cid:3) in particular Bernhard(cid:3) became good friends(cid:3) far beyond the level of scienti(cid:4)c cooperation(cid:0) Finally I would like to thank Stefan J(cid:8)ahnichen(cid:0) He provided as head of GMD and dean of the TU Berlin computer science department the researchenvironment in which new ideas could be developed(cid:0) His wise advice and guidance in scienti(cid:4)c matters and academic issues was very helpful(cid:0) This work was made possible through funding by ARPA(cid:3) grants of the DFG JA (cid:19)(cid:20)(cid:21)(cid:22)(cid:20)(cid:23)andJA(cid:19)(cid:20)(cid:21)(cid:22)(cid:24)(cid:23)(cid:3)fundingfromtheAustralianResearchCouncil(cid:3)travelgrants from the NIPS foundation and NEuroNet(cid:3) and support of NeuroCOLT (cid:25)(cid:0)

Description:
The author's basic concern is with kernel{based methods and in particular via techniques from operator theory is another major contribution to the . Peter Bartlett, Jon Baxter, Shai Ben{David, and Lew Mason helped me in getting situation graphically:2 only the points outside the shaded region.
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.