A (short) Introduction to Support Vector Machines and Kernelbased Learning Johan Suykens K.U. Leuven, ESAT-SCD-SISTA Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70 Email: [email protected] http://www.esat.kuleuven.ac.be/sista/members/suykens.html ESANN 2003, Bruges April 2003 ¦ Overview • Disadvantages of classical neural nets • SVM properties and standard SVM classifier • Related kernelbased learning methods • Use of the “kernel trick” (Mercer Theorem) • LS-SVMs: extending the SVM framework • Towards a next generation of universally applicable models? • The problem of learning and generalization Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 1 x w 1 1 Classical MLPs w y x 2 2 h(·) w 3 x 3 w n x n b h(·) 1 Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions • Learning from input-output patterns; either off-line or on-line learning • Parallel network architecture, multiple inputs and outputs Use in feedforward and recurrent networks Use in supervised and unsupervised learning applications Problems: Existence of many local minima! How many neurons needed for a given task? Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 2 Support Vector Machines (SVM) cost function cost function MLP SVM weights weights • Nonlinear classification and function estimation by convex optimization with a unique solution and primal-dual interpretations. • Number of neurons automatically follows from a convex program. • Learning and generalization in huge dimensional input spaces (able to avoid the curse of dimensionality!). • Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, ... ). Application-specific kernels possible (e.g. textmining, bioinformatics) Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 3 SVM: support vectors 1 3 0.9 2.5 2 0.8 1.5 0.7 1 0.6 0.5 2 2 x0.5 x 0 0.4 −0.5 0.3 −1 0.2 −1.5 0.1 −2 0 −2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 x x 1 1 • Decision boundary can be expressed in terms of a limited number of support vectors (subset of given training data); sparseness property • Classifier follows from the solution to a convex QP problem. Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 4 SVMs: living in two worlds ... → Primal space: ( large data sets) Parametric: estimate w ∈ Rnh T y(x) = sign[w ϕ(x) + b] ϕ(x) ϕ1(x) w 1 y(x) x + + w n + h + + + + + + ϕnh(x) + x + + K(x , x ) = ϕ(x )Tϕ(x ) (“Kernel trick”) x x i j i j x x x x + x x + x x → x x x Dual space: ( high dimensional inputs) x x Input space Non-parametric: estimate α ∈ RN #sv y(x) = sign[P α y K(x, x ) + b] i=1 i i i K(x,x ) 1 Feature space α 1 y(x) x α #sv K(x,x ) #sv Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 5 Standard SVM classifier (1) N Rn • Training set {x , y } : inputs x ∈ ; class labels y ∈ {−1, +1} i i i=1 i i T • Classifier: y(x) = sign[w ϕ(x) + b] Rn Rn with ϕ(·) : → h a mapping to a high dimensional feature space (which can be infinite dimensional!) • For separable data, assume T w ϕ(x ) + b ≥ +1, if y = +1 i i T ⇒ y [w ϕ(x ) + b] ≥ 1, ∀i T i i ½ w ϕ(x ) + b ≤ −1, if y = −1 i i • Optimization problem (non-separable case): N 1 y [wTϕ(x ) + b] ≥ 1 − ξ T i i i min J (w, ξ) = w w + c ξ s.t. i 2 ½ ξ ≥ 0, i = 1, ..., N w,b,ξ i X i=1 Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 6 Standard SVM classifier (2) • Lagrangian: N N T L(w, b, ξ; α, ν) = J (w, ξ) − α {y [w ϕ(x ) + b] − 1 + ξ } − ν ξ i i i i i i X X i=1 i=1 • Find saddle point: max min L(w, b, ξ; α, ν) α,ν w,b,ξ • One obtains N ∂L = 0 → w = α y ϕ(x ) i i i ∂w X i=1 N ∂L = 0 → α y = 0 i i ∂b X i=1 ∂L = 0 → 0 ≤ α ≤ c, i = 1, ..., N i ∂ξ i Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 7 Standard SVM classifier (3) • Dual problem: QP problem N N N 1 α y = 0 i i max Q(α) = − y y K(x , x ) α α + α s.t. i j i j i j j α 2 Xi=1 X X i,j=1 j=1 0 ≤ α ≤ c, ∀i i T with kernel trick (Mercer Theorem): K(x , x ) = ϕ(x ) ϕ(x ) i j i j N • Obtained classifier: y(x) = sign[ α y K(x, x ) + b] i i i i=1 P Some possible kernels K(·, ·): T K(x, x ) = x x (linear SVM) i i T d K(x, x ) = (x x + τ) (polynomial SVM of degree d) i i K(x, x ) = exp{−kx − x k2/σ2} (RBF kernel) i i 2 T K(x, x ) = tanh(κ x x + θ) (MLP kernel) i i Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 8 Kernelbased learning: many related methods and fields SVMs Regularization networks Some early history on RKHS: RKHS LS-SVMs 1910-1920: Moore ? = 1940: Aronszajn 1951: Krige 1970: Parzen Gaussian processes 1971: Kimeldorf & Wahba Kernel ridge regression Kriging SVMs are closely related to learning in Reproducing Kernel Hilbert Spaces Introduction to SVM and kernelbased learning ¦ Johan Suykens ¦ ESANN 2003 9