ebook img

Kernel Partial Least Squares is Universally Consistent PDF

0.19 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Kernel Partial Least Squares is Universally Consistent

Kernel Partial Least Squares is Universally Consistent Gilles Blanchard Nicole Kra¨mer 0 1 Weierstrass Institute for Berlin Institute of Technology 0 2 Applied Analysis and Stochastics Machine Learning Group n Mohrenstr. 39 Franklinstr. 28/29 a J 10117 Berlin, Germany 10587 Berlin, Germany 4 1 January 14, 2010 ] E M Abstract . t a Weprovethestatistical consistencyofkernelPartialLeastSquares t s Regression applied to a bounded regression learning problem on a re- [ producing kernel Hilbert space. Partial Least Squares stands out of 2 well-known classical approaches as e.g. Ridge Regression or Principal v 0 Components Regression, as it is not defined as the solution of a global 8 cost minimization procedure over a fixed model nor is it a linear esti- 3 4 mator. Instead, approximate solutions are constructed by projections . onto a nested set of data-dependent subspaces. To prove consistency, 2 0 we exploit the known fact that Partial Least Squares is equivalent to 9 the conjugate gradient algorithm in combination with early stopping. 0 The choice of the stopping rule (number of iterations) is a crucial : v point. We study two empirical stopping rules. The first one monitors i X the estimation error in each iteration step of Partial Least Squares, r and the second one estimates the empirical complexity in terms of a a condition number. Both stopping rules lead to universally consistent estimators provided the kernel is universal. 1 INTRODUCTION Partial Least Squares (PLS) (Wold, 1975; Wold et al., 1984) is a supervised dimensionality reduction technique. It iteratively constructs an orthogonal set of m latent components from the predictor variables which have maximal 1 covariance with the response variable. This low-dimensional representation of the data is then used for prediction by fitting a linear regression model to theresponseandthelatentcomponents. Thenumbermoflatentcomponents acts as a regularizer. In contrast to Principal Components Regression, the latent components are response-dependent. In combination with the kernel trick (Sch¨olkopf et al., 1998), kernel PLS performs nonlinear dimensionality reduction and regression (Rosipal and Trejo, 2001). While PLS has proven to be successful in a wide range of applications, theoretical studies of PLS – such as its consistency – are less widespread. This is perhaps due to the fact that in contrast to many standard methods (as e.g. Ridge Regression or Principal Components Regression), PLS is not defined as the solution of a global cost function nor is it a linear estimator in the sense that the fitted values depend linearly on the response variable. Instead, PLS minimizes theleast squares criterionona nested subset of data- dependent subspaces (i.e., the subspaces defined by the latent components). Therefore, results obtained for linear estimators are not straightforward to extend to PLS. Recent work (Naik and Tsai, 2000; Chun and Keles, 2009) study the model consistency of PLS in the linear case. Their results assume that the target function depends on a finite known number ℓ of orthogonal latent components and that PLS is run at least for ℓ steps (without early stopping). Inthis configuration, Chun and Keles (2009) obtaininconsistency results in scenarios where the dimensionality can grow with the number of data. This underscores that the choice of the regularization (or early stop- ping) term m is important and that it has to be selected in a data-dependent manner. Here, we prove the universal prediction consistency of kernel PLS in the infinite dimensional case. In particular, we define suitable data-dependent stopping criteria for the number of PLS components to ensure consistency. For the derivation of our results, we capitalize on the close connection of PLS and the conjugate gradient algorithm (Hestenes and Stiefel, 1952) for the solution of linear equations: The PLS solution with m latent compo- nents is equivalent to the conjugate algorithm applied to the set of normal equations in combination with early stopping after m iterations. We use this equivalence to define the population version of kernel PLS. We then pro- ceed in three steps: (i) We show that population kernel PLS converges to the true regression function. (ii) We bound the difference between empirical and population PLS, which is low as long as the number of iterations does not grow too fast. We ensure this via two different stopping criteria. The first one monitors the error in each iteration stop of PLS, and the second one estimates the empirical complexity in terms of a condition number. (iii) Combining the results from the two previous steps, our stopping rules lead 2 to universally consistent estimators provided the kernel is universal. We em- phasize that either stopping rule does not depend on any prior knowledge of the target function and only depends on observable quantities. 2 BACKGROUND We study a regression problem based on a joint probability distribution P(X,Y) on . The task is to estimate the true regression function X ×Y f¯(x) = E[Y X = x] (1) | basedonafinitenumbernofobservations(x ,y ),...,(x ,y ) . Asa 1 1 n n ∈ X×Y general convention, population quantities defined fromthe perfect knowledge ofthedistributionP will bedenotedwithabar, empirical quantities without. ¯ We assume that f belongs to the space of P -square-integrable functions X (P ), where P denotes the X-marginal distribution. The vector y Rn 2 X X L ∈ represents the n centered response observations y ,...,y . 1 n As we consider kernel techniques to estimate the true regression function (1), we map the data to a reproducing kernel Hilbert space with bounded k H kernel k, via the canonical kernel mapping φ(x) : , x φ(x) = k(x, ). k X → H 7→ · In the remainder, we make the following assumptions: (B) boundedness of the data and kernel: Y [ 1,1] almost surely and ∈ − sup k(x,x) 1. x∈X ≤ (U) universality of the kernel: for any distribution P on , is dense X k X H in (P ). 2 X L 2.1 LEARNING AS AN INVERSE PROBLEM We very briefly review the interpretation of Kernel based regression as a statistical inverse problem, as introduced in De Vito et al. (2006). Let us denote the inclusion operator of the kernel space into (P ) by 2 X L T¯ : ֒ (P ) g g. k 2 X H → L 7→ This operatormapsafunctiontoitself, but between two Hilbertspaces which differ with respect to their geometry – the inner product of being defined k H by the kernel function k, while the inner product of (P ) depends on the 2 X L 3 data generating distribution. The adjoint operator of T¯ is defined as usual as the unique operator satisfying the condition f,T¯g = T¯∗f,g L2(PX) Hk (cid:10) (cid:11) (cid:10) (cid:11) for all f (P ),g . It can be checked from this definition and the 2 X k ∈ L ∈ H reproducing property of the kernel that T¯∗ coincides with the kernel integral operator from (P ) to , 2 X k L H T¯∗g = k(.,x′)g(x′)dP(x′) = E[k(X, )g(X)] . · Z Finally, the operator S¯ = T¯∗T¯ : is the covariance operator for the k k H → H random variable φ(X): Sg = E[k(X, )g(X)] = E[φ(X) φ(X),g ] . · h i Learning in the kernel space can becast (formally) as the inverse problem k T¯g¯ = f¯, which yields (after rHight multiplication by T¯∗) the so-called normal equation S¯g¯ = T¯∗f¯. (2) ¯ The above equation has a solution if and only if f can be represented as a function of , i.e. f¯ T¯ ; however, even if f¯ T¯ , the above formal k k k H ∈ H 6∈ H equation canbe used asa motivation to use regularization algorithms coming ¯ from the inverse problems literature in order to find an approximation g¯ of f belonging to the space . In a learning problem, neither the left-hand nor k H the right-hand side of (2) is known, and we only observe empirical quantities, which can be interpreted as ”perturbed” versions of thepopulation equations wherein P is replaced by its empirical counterpart P and f¯by y. Note X X,n that the space (P ) is isometric to Rn with the usual Euclidean product, 2 X,n L wherein afunctiong (P )ismappedtothen-vector (g(x ),...,g(x )). 2 X,n 1 n ∈ L The empirical integral operator T∗ : (P ) is then given by 2 X,n k L → H n 1 T∗g = g(x )k(x , ). (3) i i n · i=1 X The empirical covariance operator S = T∗T is defined similarly, but on the input space . Note that if the Hilbert space is finite-dimensional, the k k H H operator S corresponds to left multiplication with the empirical covariance matrix. The perturbed, empirical version of the normal equation (2) is then defined as Sg = T∗y. (4) 4 Again, if is finite-dimensional, the right-hand side corresponds to the k H covariance between predictors and response. In general, equation (4) is ill- posed, and regularization techniques are needed. Popular examples are Ker- nel Ridge Regression (which corresponds to Tikonov regularization in inverse problems) or ℓ -Boosting (corresponding to Landweber iterations). 2 2.2 PLS AND CONJUGATE GRADIENTS PLS is generally described as a greedy iterative method that produces a se- quence of latent components on which the data is projected. In contrast with PCA components, which maximize the variance, in PLS, the components are defined to have maximal covariance with the response y. In particular, the latent components depend on the response. For prediction, the response y is projected on these latent components. It is however a known fact that the output of the m-th step of PLS is actually equivalent to a conjugate gradient (CG) algorithm applied to the normal equation (4), stopped early at step m (see e.g. Phatak and de Hoog, 2002 for a detailed overview). This has been established for traditional PLS, i.e. = Rd and the linear kernel is used, but for the kernel PLS (KPLS) X algorithm introduced by Rosipal and Trejo (2001) the exact same analysis is valid as well. Here, for reasons of clarity with the remainder of our analysis we therefore directly present KPLS as a CG method. For the self-adjoint operator S and for T∗y, we define the associated Krylov space of order m as (T∗y,S) = span T∗y,ST∗y,...,Sm−1T∗y . m k K ⊂ H In other words, (T∗y,S)(cid:8)is the linear subspace of (cid:9) of all elements of m k K H the form q(S)T∗y, where q ranges over the real polynomials of degree m 1. − The m-th step of the CG method as applied to the normal equation (4) is simply defined (see e.g. Engl et al., 1996, chap. 7) as the element g m k ∈ H that minimizes the least squares criterion over the Krylov space, i.e. g = arg min y Tg 2 . (5) m g∈Km(T∗y,S)k − k Here, we recall that for any function g , the mapping Tg of g into k ∈ H (P ) can be equivalently represented as the n-vector (g(x ),...,g(x )). 2 X,n 1 n L Observe that since the Krylov space depends itself on the data (and in par- ticular on the response variable y), (5) is not a linear estimator. An extremely important property of CG is that the above optimization problem can be exactly computed by a simple iterative algorithm which only 5 Algorithm 1 Empirical KPLS (in ) k H Initialize: g = 0;u = T∗y;d = u 0 0 0 0 for m = 0,...,(m 1) do max − α = u 2/ d ,Sd m m m m k k h i g = g +α d (update) m+1 m m m u = u α Sd (residuals) m+1 m m m − β = u 2/ u 2 m m+1 m k k k k d = u +β d (new basis vector) m+1 m+1 m m end for Return: approximate solution g mmax requires to use forward applications of the operator S, sums of elements in and scalar multiplications and divisions (algorithm 1). k H In fact, the CG algorithm iteratively constructs a basis d ,...,d of 0 m−1 the Krylov space (T∗y,S). The sequence of g is constructed in m m k K ∈ H such a way that the residuals u = T∗y Sg are pairwise orthogonal in m m − , i.e. u ,u = 0 for j = k, while the constructed basis is S-orthogonal k j k H h i 6 (or equivalently, uncorrelated), i.e. d ,Sd = 0 for j = k. j k h i 6 Note that the above algorithm is written entirely in , this form being k H convenient for the theoretical analysis to come. In practice, since all involved elements belongtospan (K(x ,.)) , aweighted kernel expansionisused i 1≤i≤n { } torepresent theseelements, andcorrespondingweightupdateequationsusing the kernel matrix can be derived (see Rosipal and Trejo, 2001). 3 POPULATION VERSION OF KPLS Using the CG interpretation of KPLS, we can define its population version as follows: Definition 1. Denote by g¯ the output of algorithm 1 after m iter- m k ∈ H ations, if we replace the empirical operator S and the vector T∗y by their population versions S¯ and T¯∗f¯, respectively. We define population KPLS with m components as f¯ = T¯g¯ (P ). m m 2 X ∈ L ¯ We emphasize again that g¯ and f (P ) are identical as m k m 2 X R ∈ H ∈ L functions from to , but seen as elements of Hilbert spaces with a different X geometry (norm). The first step in our consistency proof is to show that ¯ ¯ population KPLS f converges to f (with respect to the (P ) norm) if m 2 X m tends to . Note that even if f¯ T¯ , we can stillLshow that T¯g¯ k m ∞ ¯ 6∈ H converges to the projection of f onto the closure of in (P). If the k 2 H L 6 ¯ kernel is universal (U), this projection is f itself and this implies asymptotic consistency of the population version. We will assume for simplicity that ¯ (I) the true regression function f has infinitely many components in its decomposition over the eigenfunctions of S¯, which implies that the population version of the algorithm can theoreti- cally be run indefinitely without exiting. If this condition is not satisfied the population algorithm stops after a finite number of steps κ, at which points ¯ ¯ it holds that f = f so that the rest of our analysis also holds in that case κ with only minor modifications. Proposition 1. The kernel operator of k is defined as K¯ = T¯T¯∗ : (P ) 2 X L → (P ). We denote by the orthogonal projection onto the closure of the 2 X L ¯ P ¯ ¯ range of K in (P). Then, recalling f = Tg¯ where g¯ is the output of the 2 m m m L m-th iteration of the conjugate gradient algorithm applied to the population normal equation (2), it holds that f¯ = K¯q (K¯)f¯, where q is a polynomial m m m of degree m 1 fulfilling ≤ − q = arg min f¯ K¯q(K¯)f¯ 2 . m degq≤m−1kP − kL2(PX) Proof. The minimization property (5) when written in the population case yields q = arg min f¯ T¯q(S¯)T¯∗f¯ 2. m degq≤m−1k − k Furthermore, for all polynomials q f¯ T¯q(S¯)T¯∗f¯ 2 = f¯ K¯q(K¯)f¯ 2 − − = f K¯q(K¯)f¯ 2 + (I ) f¯ K¯q(K¯)f¯ 2 (cid:13) (cid:13) (cid:13) (cid:13) kP − k k −P − k (cid:13) (cid:13) = (cid:13) f¯ K¯q(K¯)(cid:13)f¯ 2 + (I )f¯ 2. (cid:0) (cid:1) (cid:0) (cid:1) kP − k k −P k As the second term above does not depend on the polynomial q, this yields the announced result. This leads to the following convergence result. ¯ Theorem 1. Letus denote byf the projectionof f onto the firstmprincipal m ¯ components of the operator K. We have e ¯ ¯ ¯ f f f f . kP − mkL2(PX) ≤ kP − mkL2(PX) In particular, e ¯ m→∞ ¯ f f in (P ). m 2 X −→ P L 7 Thistheoremisanextensionofthefinite-dimensionalresultsbyPhatak and de Hoog (2002). Proof. We construct a sequence of polynomials q of degree m 1 such m ≤ − that f¯ K¯q (K¯)f¯ f¯e f m m kP − k ≤ kP − k and then exploit the minimization property of Proposition 1. Let us con- e sider the first m eigenvalues λe,...,λ of the operator K¯ with corresponding 1 m eigenfunctions φ ,...,φ . Then, by definition 1 m m ∞ ¯ ¯ ¯ f = f,φ φ , f = f,φ φ , m i i i i P i=1 i=1 X(cid:10) (cid:11) X(cid:10) (cid:11) The polynomial e m λ λ i p (λ) = − (6) m λ i i=1 Y fulfills p (0) = 1, hence itedefines a polynomial q of degree m 1 via m m p (λ) = 1 λq (λ). As the zeroes of p are the first m eigen≤value−s of K¯, m m m − the polynomial q has the convenient property that it ”cancels out” the first e m e m eigenfunctions, i.e. e e e e ∞ f¯ K¯q (K¯)f¯ 2 = p (λ )2 f¯,φ 2 m m i i kP − k i=m+1 X (cid:10) (cid:11) By construction, p (λ )2e 1 for i > m, and heence m i ≤ ∞ f¯ eK¯qm(K¯)f¯ 2 f¯,φi 2 = f¯ fm 2. kP − k ≤ kP − k i=m+1 X (cid:10) (cid:11) As the principal compeonents approximations f convergeeto f¯, this con- m P cludes the proof. e As the rate of convergence of the population version is at least as good as the rate of the principal components approximations, this theorem shows in particular that the conjugate gradient method is less biased than Principal Components Analysis. This factisknown for linear PLSintheempirical case (De Jong, 1993; Phatak and de Hoog, 2002). By the same token, KPLS is lessbiasedthanℓ -Boosting, asthelattercorrespondstothefixedpolynomial 2 q (t) = m−1(1 t)i, . However, empirical findings suggest that for KPLS, m i=0 − thedecreaseinbiasisbalancedbyanincreasedcomplexityintermsofdegrees P of freedom (Kr¨amer and Braun, 2007). The goal of the next section is to introduce a suitable control of this complexity. 8 4 CONSISTENT STOPPING RULES 4.1 ERROR MONITORING We control the error between the population case and the empirical case by iteratively monitoring upper bounds on this error. Since this bound only involves known empirical quantities, we design a stopping criterion based on the bound. This then leads to a globally consistent procedure. The key ingredient of the stopping rule is to bound the differences for u,x,d (defined in algorithm 1) if we replace the empirical quantities S and T∗y by their population versions. Note that algorithm 1 involves products and quotients of the perturbed quantities. The error control based on these expressions can hence be represented in terms of the following three functions. Definition 2. For any positive reals x > δ 0 define x ≥ δ x ζ(x,δ ) = x x(x δ ) x − and for any positive reals (x,y,δ ,δ ) define x y ξ(x,y,δ ,δ ) = xδ +yδ +δ δ ; x y y x x y ξ′(x,y,δ ,δ ) = xδ +yδ . x y y x The usefulness of these definitions is justified by the following standard lemma for bounding the approximation error of inverse and products, based only on the knowledge of the approximant: Lemma 1. Let α,α¯ be two invertible elements of a Banach algebra , with α α¯ δ and α−1 −1 > δ. Then B k − k ≤ k k α−1 α¯−1 ζ( α−1 −1,δ). − ≤ (cid:13) (cid:13) (cid:13) (cid:13) Let 1, 2 be two Bana(cid:13)ch spaces an(cid:13)d assu(cid:13)me a(cid:13)n assocative product operation B B exists from to a Banach space , satisfying for any (x ,x ) 1 2 3 1 2 1 2 B ×B B ∈ B ×B the product compatibility condition x x x x . Let α,α¯ in and 1 2 1 2 1 ¯ k k ≤ ¯k kk k B β,β , such that α α¯ δ and β β δ . Then 2 α β ∈ B k − k ≤ k − k ≤ ¯ αβ α¯β ξ( α , β ,δ ,δ ). α β k − k ≤ k k k k In the same situation as above, if it is known that α¯ C, then k k ≤ αβ α¯β¯ ξ′(C, β ,δ ,δ ). α β k − k ≤ k k 9 Furthermore, we can bound the deviation of the ’starting values’ S and T∗y: Lemma 2. Set ε = 4 (logn)/n. If the kernel is bounded (B), with prob- n ability at least 1 n−2, − p T∗y T¯f¯ ε and S S¯ ε . n n k − k ≤ − ≤ (cid:13) (cid:13) The second bound is well-known, see e(cid:13).g. Sha(cid:13)we-Taylor and Cristianini (2003); Zwald and Blanchard (2006), and the first one is based on the same argument. The error monitoring updates corresponding to algorithm 1 are displayed in algorithm 2. Note that the error monitoring initialization and update only depend on observable quantities. Algorithm 2 Error Control for Algorithm 1 Initialize: δg = 0;ε = 4 (logn)/n;δu = ε ; 0 n 0 n δd = ε 0 n p Initialize: ε = ξ( u , u ,δu,δu) 0,4 k 0k k 0k 0 0 for m = 0,...,(m 1) do max − ε = ξ′( d ,1,δd ,ε ) m,1 k mk m n ε = ξ( d , d ,δd ,ε ) m,2 k mk k mk m m,1 ε = ζ( d ,Sd ,ε ) (if defined, else exit) m,3 m m m,2 h i δα = ξ( u 2, d ,Sd −1,ε ,ε ) δmg =kδgm+kξ(αh m, d mi,δα,δmd,4) m,3 m+1 m m k mk m m δu = δu +ξ(α , d ,δα,ε ) m+1 m m k mk m m,1 ε = ζ( u 2,ε ) (if defined, else exit) m,5 m m,4 k k ε = ξ( u , u ,δu ,δu ) m+1,4 k m+1k k m+1k m+1 m+1 δβ = ξ( u 2, u−1 2,ε ,ε ) m k m+1k k m k m+1,4 m,5 δd = δd +ξ(β , d ,δβ,δd ) m+1 m m k mk m m end for Definition 3 (First stopping rule). Fix 0 < γ < 1 and run the KPLS 2 algorithm 1 along with the error monitoring algorithm 2. Let m +1 denote (n) the first time where either the procedure exits, or δg > n−γ. Here, the m(n)+1 subscript (n) indicates that the step is data-dependent. Output the estimate at the previous step, that is, the estimate f(n) = Tg . m(n) The next theorem states that this stopping rule leads to a universally consistent learning rule for bounded regression. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.