Trabajo Fin de Grado MATHEMATICAL OPTIMIZATION AND FEATURE SELECTION Presented by: Alejandro Casado Reinaldos Supervisors: Dr. Rafael Blanquero Bravo, Universidad de Sevilla Dr. Emilio Carrizosa Priego, Universidad de Sevilla June 21, 2015 Contents 1 Introduction 3 2 Support Vector Machine 4 2.1 Linear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.1 The Linearly Separable Case . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 The Linearly Nonseparable Case . . . . . . . . . . . . . . . . . . . . 9 2.2 Nonlinear Support Vector Machine . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 The "Kernel Trick" . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.2 Kernels and Their Properties . . . . . . . . . . . . . . . . . . . . . 13 2.2.3 Examples of Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.4 Building the Optimal Path . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 SVM in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Ramp Loss SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3 Feature Selection in SVM 36 2 Chapter 1 Introduction Supervised classification is a common task in big data. It seeks procedures for classifying objects in a set Ω into a set C of classes, [7]. Supervised classification has been success- fully applied in many different fields. Examples are found in text categorization, such as document indexing, webpage classification and spam filtering; biology and medicine, such as classification of gene expression data, homology detection, protein–protein interaction prediction, abnormal brain activity classification and cancer diagnosis; machine vision; agriculture; or chemistry, to cite a few fields. Mathematical optimization has played a crucial role in supervised classification. Tech- niques from very diverse fields within mathematical optimization have been shown to be useful. Support Vector Machine (SVM) is one of the main exponents as application of the mathematical optimization to supervised classification. SVM is a state of the art method for supervised learning. For the two-class case, SVM aims at separating both classes by means of a hyperplane which maximizes the margin, i.e., the width of the band separating the two sets. This geometrical optimization problem can be written as a convex quadratic optimization problem with linear constraints, in principle solvable by any nonlinear opti- mization procedure. In some applications the amount of features is huge and training SVM using the en- tire feature set would be computationally very expensive, while its outcome would lack from insight. This is, for instance, the case in gene expression and text categorization. In this sense, we talk about the combinatorial problem of selecting a best-possible set of features, discarding the remaining ones. It is called the feature selection problem. In this work we analyze SVMs and how relevant features can be identified. In Chap- ter 2, we describe the SVM, first in the case of a linear kernel and then for the more interesting case of nonlinear kernels. We show how SVM can be handled in the statistical programme R. Then, some issues related with feature selection are described in Chapter 3. 3 Chapter 2 Support Vector Machine 2.1 Linear Support Vector Machine Assume we have available a non-empty set of data Ω, where each u ∈ Ω has two compo- i nents: Ω ={u =(x ,y ): i=1,2,...,n } i i i with x ∈ Rr, vector of predictors variables, and y ∈ {1,−1} two given classes of u . i i i We now have a non-empty set I, which will be called the learning set. The learning set is composed of u = (x ,y ), where y is given, ∀ i. The binary classification problem is i i i i based on predicting, from the data of I, the y class of a given u ∈ Ω. It is used β ∈ Rr i i y β ∈ R in order to construct a function f : Rr→R such that: 0 f(x) = βtx+β 0 This function is called separation function, [16]. It classifies as class 1 those x ∈ Rr with i f(x) > 0 and as class -1 those x ∈ Rr with f(x) < 0. i The goal is to have a function f such that all positive points in I, i.e. (y = 1), are i assigned to class 1 and negative points in I, i.e. (y = −1), to class -1. Points x with i f(x) = 0 must be classified according to a predeterminied rule. It is defined with the system y (βtx +β ) > 0 ∀i ∈ I i i 0 2.1.1 The Linearly Separable Case First, consider the simplest case: suppose the positive points (y = 1) and negative i (y = −1) data points from the learning set I can be separated by a hyperplane: i {x : f(x) = βtx+β =0} 0 where β is the weight vector with norm kβk, and β is the bias. If this hyperplane can 0 separate the learning set of data into the two given classes without error, the hyperplane 4 is called a separating hyperplane. If positive and negative data points can be separated by the hyperplane H0 := β +xtβ = 0 0, then H+ = β +βtx > 0, if y = 1 0 i i H− = β +βtx < 0, if y = −1 0 i i For separable sets, we have an infinite number of such hyperplanes. Consider any sep- arating hyperplane. Let d be the shortest distance from the separating hyperplane to − the nearest negative data point, and let d be the shortest distance from the separating + hyperplane to the nearest positive data point. We say that the hyperplane is an optimal separatinghyperplaneifwemaximizethedistancebetweenthehyperplaneandtheclosest observation. In order to find the best separating hyperplane, we use a norm k.k in Rr, and derive the distances between the two given classes and the separating hyperplane. First, let us consider the Euclidean case, with the Euclidean norm kxk2 = xtx: Property. Let k.k be the Euclidean distance, Then, given x, we have {β +βtx,0} (2.1) d = d(x,{y : β +βty ≤ 0}) = max 0 − 0 kβk {−(β +βtx),0} (2.2) d = d(x,{y : β +βty ≥ 0}) = max 0 + 0 kβk Proof. Let x be a fixed point, we have the following problem which formulates the dis- tance between point x from the separating hyperplane: min kx−yk (2.3) subject to: (β +βty) = 0 0 Equivalent to: min kx−yk2 (2.4) subject to: (β +βty) = 0 0 With the Euclidean distance we are in conditions to use the Karush-Kuhn-Tucker (KKT) conditions. Then let L(y,λ) be the Lagrange function defined as follows: L(y,λ)=ky −xk2 −λ(β +βty) 0 Proceeding as the method KKT says: 5 ∂ L(y,λ): 2(y −x)−λβ = 0 ∂y 2βt(y −x)−λβtβ = 0 , (βty = −β ) 0 −2β −2βtx = λβtβ 0 λ = −2β0−2βtx βtβ Replacing λ in equation of ∂ L(y,λ) and applying norm k.k: ∂y 2(y −x) = λβ ky −xk = |λ|kβk 2 |−2β0−2βtx|kβk = |β0+βtx|kβk = |β0+βtx| βtβ 2 kβk2 kβk Summarizing, in the Euclidean case: d(x,{y : β +βty = 0}) = |β0+βtx| 0 kβk If β +βtx ≥ 0, 0 d(x,{y : β +βty ≥ 0}) = 0 0 d(x,{y : β +βty ≤ 0}) = |β0+βtx| = β0+βtx 0 kβk kβk If β +βtx ≤ 0, 0 d(x,{y : β +βty ≤ 0}) = 0 0 d(x,{y : β +βty ≥ 0}) = |β0+βtx| = −(β0+βtx) 0 kβk kβk In general: d(x,{y : β +βty ≥ 0}) = max{0, −(β0+βtx)} 0 kβk d(x,{y : β +βty ≤ 0}) = max{0, β0+βtx} (cid:4) 0 kβk For an arbitrary norm, we can use the following result, which extends property, [24]: Theorem 1.1. For any norm k.k and any hyperplane H(β,β ) we have 0 {β0−hβ;xi}, when hβ;xi ≤ β , kβk◦ 0 d (x,H(β,β )) = k.k 0 {hβ;xi−β0}, when hβ;xi > β . kβk◦ 0 6 Here kβk◦ denotes the dual norm of kβk, defined as kβk◦ = max utβ (2.5) subject to: kuk = 1. We have in the previous theorem the formula of the distance from one point x to a halfs- pace. Now, given (x ,··· ,x ) with labels (y ,··· ,y ), the distance of x to the halfspace 1 n 1 n i of misclassification is given by: max{y (β +βtx ),0} i 0 i d = ,∀ i ∈ I. i kβk0 The minimum of this equation, d =min d , is called margin. I ui∈I i Figure 2.1: Linear SVM with the margin Thegoalistomaximizethemargin. Thisissolvingbythefollowingoptimizationproblem: max min max{yi(β0+βtxi),0}. β,β0 i kβk◦ This problem is equivalent to, max min {yi(β0+βtxi),0}, β,β0 i kβk◦ which is equivalent to, 7 min max kβk◦ , β,β0 i {yi(β0+βtxi),0} or, min kβk◦ . β,β0 mini{yi(β0+βtxi),0} The function (β ,β) 7−→ kβk◦ is homogeneus in R , hence, we can assume with- 0 miniyi(β0+βtxi) + outlossofgeneralitythatthedenominatorequals1. Thenwehavethefollowingequivalent representation: min kβk◦ β0,β (2.6) subject to: min y (β +βtx ) = 1 i i 0 i β ∈ Rr,β ∈ R 0 It is easily seen that this is equivalent to, min kβk◦ β0,β (2.7) subject to: min y (β +βtx ) ≥ 1 i i 0 i β ∈ Rr,β ∈ R 0 i.e., min kβk◦ β0,β (2.8) subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I i 0 i β ∈ Rr,β ∈ R 0 In the Euclidean case we have: min βtβ β0,β (2.9) subject to: y (β +βtx ) = 1, ∀ i ∈ I i 0 i β ∈ Rr,β ∈ R 0 which is an optimization problem with convex objective function and linear constraints. Then the problem (2.9) is equivalent to: min βtβ β0,β (2.10) subject to: y (β +βtx ) > 0, ∀ i ∈ I i 0 i β ∈ Rr,β ∈ R 0 For polyhedral norms, problem (2.8) can be written as a linear problem. Let us consider the particular important cases k.k = k.k and k.k = k.k . To achieve the dual of those 1 ∞ norms we have the following property: Property. Let k.k be a p−norm. Then, its dual norm is k.k◦ = k.k , where p and q p p q satisfies the following: 8 1 + 1 = 1 p q If we have k.k = k.k , then its dual k.k◦ is the infinity norm k.k , and the problem (2.8) 1 ∞ can be expressed as follows: min kβk β0,β ∞ (2.11) subject to: min y (β +βtx ) ≥ 1, ∀ i ∈ I i i 0 i β ∈ Rr,β ∈ R. 0 This problem can be reformulated as a linear problem, min z subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I (2.12) i 0 i z ≥ β ≥ −z i β ∈ Rr,β ∈ R,z ≥ 0 0 On the other hand, if we have k.k = k.k , then its dual k.k◦ is the 1-norm k.k , and our ∞ 1 problem can be expressed as follows: min kβk β0,β 1 (2.13) subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I i 0 i β ∈ Rr,β ∈ R 0 which can also be converted in a linear problem, min P z j j subject to: y (β +βtx ) ≥ 1, ∀ i ∈ I (2.14) i 0 i z ≥ β ≥ −z , j = 1,..,r j i j β ∈ Rr,β ∈ R,z ≥ 0 0 j 2.1.2 The Linearly Nonseparable Case In real applications, it is unlikely that there will be such a clear linear separation between data drawn from two classes. More likely, there will be some overlap. The overlap will cause problems for any classification rule, and depending upon the extent of the overlap, the overlapping points could not be classified. The nonseparable case occurs if either the two classes are separable, but not linearly so, or that no clear separability exists between the two classes, linearly or nonlinearly. In the previous section we assumed that I was linearly separable, if this is not so the above problem is infeasible. Therefore, we must find other method. 9
Description: