Table Of Content

Sparse Kernel Canonical Correlation Analysis via (cid:96) -regularization∗ 1 Xiaowei Zhang†1, Delin Chu1, Li-Zhi Liao2 and Michael K. Ng2 1Department of Mathematics, National University of Singapore. 2Department of Mathematics, Hong Kong Baptist University. 7 Abstract 1 0 Canonical correlation analysis (CCA) is a multivariate statistical technique for finding the linear 2 relationshipbetweentwosetsofvariables. ThekernelgeneralizationofCCAnamedkernelCCAhas been proposed to find nonlinear relations between datasets. Despite their wide usage, they have one n common limitation that is the lack of sparsity in their solution. In this paper, we consider sparse a J kernel CCA and propose a novel sparse kernel CCA algorithm (SKCCA). Our algorithm is based on a relationship between kernel CCA and least squares. Sparsity of the dual transformations is 6 1 introduced by penalizing the (cid:96)1-norm of dual vectors. Experiments demonstrate that our algorithm notonlyperformswellincomputingsparsedualtransformationsbutalsocanalleviatetheover-fitting ] problem of kernel CCA. L M 1 Introduction . t a t The description of relationship between two sets of variables has long been an interesting topic to s many researchers. Canonical correlation analysis (CCA), which was originally introduced in [26], is a [ multivariate statistical technique for finding the linear relationship between two sets of variables. Those 1 two sets of variables can be considered as different views of the same object or views of different objects, v and are assumed to contain some joint information in the correlations between them. CCA seeks a 7 0 linear transformation for each of the two sets of variables in a way that the projected variables in the 2 transformed space are maximally correlated. 4 Let {xi}ni=1 ∈Rd1 and {yi}ni=1 ∈Rd2 be n samples for variables x and y, respectively. Denote 0 1. X =(cid:2)x1 ··· xn(cid:3)∈Rd1×n, Y =(cid:2)y1 ··· yn(cid:3)∈Rd2×n, 0 7 and assume both {x }n and {y }n have zero mean, i.e., (cid:80)n x = 0 and (cid:80)n y = 0. Then CCA solves 1 i i=1 i i=1 i i : i=1 i=1 v the following optimization problem i X max wTXYTw x y r wx,wy a s.t. wTXXTw =1, (1.1) x x wTYYTw =1, y y to get the first pair of weight vectors w and w , which are further utilized to obtain the first pair x y of canonical variables wTX and wTY, respectively. For the rest pairs of weight vectors and canonical x y variables,CCAsolvessequentiallythesameproblemas(1.1)withadditionalconstraintsoforthogonality among canonical variables. Suppose we have obtained a pair of linear transformations Wx ∈ Rd1×l and ∗Partofthematerialinthispaperwaspresentedin[10]and[56]. †Correspondingauthor: [email protected] 1 Wy ∈Rd2×l, thenforapairofnewdata(x,y), itsprojectionintothenewcoordinatesystemdetermined by (W ,W ) will be x y (WTx,WTy). (1.2) x y Since CCA only consider linear transformation of the original variables, it can not capture nonlinear relations among variables. However, in a wide range of practical problems linear relations may not be adequate for studying relation among variables. Detecting nonlinear relations among data is important andusefulinmoderndataanalysis,especiallywhendealingwithdatathatarenotintheformofvectors, suchastextdocuments,images,micro-arraydataandsoon. Anaturalextension,therefore,istoexplore andexploitnonlinearrelationsamongdata. TherehasbeenawideconcerninthenonlinearCCA[11,30], among which one most frequently used approach is the kernel generalization of CCA, named kernel canonicalcorrelationanalysis(kernelCCA).Motivatedfromthedevelopmentandsuccessfulapplications of kernel learning methods [37, 39], such as support vector machines (SVM) [7, 37], kernel principal component analysis (KPCA) [38], kernel Fisher discriminant analysis [33], kernel partial least squares [36] and so on, there has emerged lots of research on kernel CCA [1, 32, 2, 16, 17, 25, 24, 29, 30, 39]. Kernelmethodshaveattractedagreatdealofattentioninthefieldofnonlineardataanalysis. Inkernel methods, we first implicitly represent data as elements in reproducing kernel Hilbert spaces associated with positive definite kernels, then apply linear algorithms on the data and substitute the linear inner productbykernelfunctions, whichresultsinnonlinearvariants. The mainideaofkernelCCAisthatwe first virtually map data X into a high dimensional feature space H via a mapping φ such that data in x x the feature space become Φx =(cid:2)φx(x1) ··· φx(xn)(cid:3)∈RNx×n, where N is the dimension of feature space H that can be very high or even infinite. The mapping φ x x x from input data to the feature space H is performed implicitly by considering a positive definite kernel x function κ satisfying x κ (x ,x )=(cid:104)φ (x ),φ (x )(cid:105), (1.3) x 1 2 x 1 x 2 where (cid:104)·,·(cid:105) is an inner product in H , rather than by giving the coordinates of φ (x) explicitly. The x x feature space H is known as the Reproducing Kernel Hilbert Space (RKHS) [49] associated with kernel x function κ . In the same way, we can map Y into a feature space H associated with kernel κ through x y y mapping φ such that y Φy =(cid:2)φy(y1) ··· φy(yn)(cid:3)∈RNy×n. After mapping X to Φ and Y to Φ , we then apply ordinary linear CCA to data pair (Φ ,Φ ). x y x y Let K =(cid:104)Φ ,Φ (cid:105)=[κ (x ,x )]n ∈Rn×n, K =(cid:104)Φ ,Φ (cid:105)=[κ (y ,y )]n ∈Rn×n (1.4) x x x x i j i,j=1 y y y y i j i,j=1 be matrices consisting of inner products of datasets X and Y, respectively. K and K are called x y kernel matrices or Gram matrices. Then kernel CCA seeks linear transformation in the feature space by expressing the weight vectors as linear combinations of the training data, that is n n (cid:88) (cid:88) w =Φ α= α φ (x ), w =Φ β = β φ (y ), x x i x i y y i y i i=1 i=1 where α, β ∈Rn are called dual vectors. The first pair of dual vectors can be determined by solving the following optimization problem max αTK K β x y α,β s.t. αTK2α=1, (1.5) x βTK2β =1. y The rest pairs of dual vectors are obtained via sequentially solving the same problem as (1.5) with extra constraints of orthogonality. More details on the derivation of kernel CCA are presented in Section 2. Suppose we have obtained dual transformations W , W ∈ Rn×l and corresponding CCA transfor- x y mations Wx ∈ RNx×l and Wy ∈ RNy×l in feature spaces, then projection of data pair (x,y) onto the 2 kernelCCAdirectionscanbecomputedbyfirstmappingxandy intothefeaturespaceH andH ,then x y evaluate their inner products with W and W . More specifically, projections can be carried out as x y (cid:104)W ,φ (x)(cid:105)=(cid:104)Φ W ,φ (x)(cid:105)=WTK (X,x), (1.6) x x x x x x x (cid:2) (cid:3)T with K (X,x)= κ (x ,x) ··· κ (x ,x) , and x x 1 x n (cid:104)W ,φ (y)(cid:105)=(cid:104)Φ W ,φ (y)(cid:105)=WTK (Y,y), (1.7) y y y y y y y (cid:2) (cid:3)T with K (Y,y)= κ (y ,y) ··· κ (y ,y) . y y 1 y n Both optimization problems (1.1) and (1.5) can be solved by considering generalized eigenvalue problems [4] of the form Ax=λBx, (1.8) where A, B are symmetric positive semi-definite. This generalized eigenvalue problem can be solved effi- cientlyusingapproachesfromnumericallinearalgebra[19]. CCAandkernelCCAhavebeensuccessfully appliedinmanyfields,includingcross−languagedocumentsretrieval[47],content−basedimageretrieval [25], bioinformatics [46, 53], independent component analysis [2, 17], computation of principal angles between linear subspaces [6, 20]. Despite the wide usage of CCA and kernel CCA, they have one common limitation that is lack of sparseness in transformation matrices W and W and dual transformation matrices W and W . x y x y Equation (1.2) shows that projections of the data pair x and y are linear combinations of themselves which make interpretation of the extracted features difficult if the transformation matrices W and W x y aredense. Similarly,from(1.6)and(1.7)wecanseethatthekernelfunctionsκ (x ,x)andκ (y ,y)must x i y i beevaluatedforall{x }n and{y }n whendualtransformationmatricesW andW aredense,which i i=1 i i=1 x y can lead to excessive computational time to compute projections of new data. To handle the limitation of CCA, researchers suggested to incorporate sparsity into weight vectors and many papers have studied sparse CCA [9, 23, 35, 40, 41, 48, 50, 51, 52]. Similarly, we shall find sparse solutions for kernel CCA so that projections of new data can be computed by evaluating the kernel function at a subset of the training data. Although there are many sparse kernel approaches [5], such as support vector machines [37], relevance vector machine [45] and sparse kernel partial least squares [14, 34], seldom can be found in the area of sparse kernel CCA [13, 43]. In this paper we first consider a new sparse CCA approach and then generalize it to incorporate sparsity into kernel CCA. A relationship between CCA and least squares is established so that CCA solutionscanbeobtainedbysolvingaleastsquaresproblem. Weattempttointroducesparsitybypenal- izing (cid:96) -norm of the solutions, which eventually leads to a (cid:96) -norm penalized least squares optimization 1 1 problem of the form 1 min (cid:107)Ax−b(cid:107)2+λ(cid:107)x(cid:107) , x∈Rd 2 2 1 where λ > 0 is a regularizer controlling the sparsity of x. We adopt a fixed-point continuation (FPC) method [21, 22] to solve the (cid:96) -norm regularized least squares above, which results in a new sparse CCA 1 algorithm (SCCA LS). Since the optimization criteria of CCA and kernel CCA are of the same form, the same idea can be extended to kernel CCA to get a sparse kernel CCA algorithm (SKCCA). The remainder of the paper is organized as follows. In Section 2, we present background results on both CCA and kernel CCA, including a full parameterization of the general solutions of CCA and a detailed derivation of kernel CCA. In Section 3, we first establish a relationship between CCA and least squares problems, then based on this relationship we propose to incorporate sparsity into CCA by penalizing the least squares with (cid:96) -norm. Solving the penalized least squares problems by FPC leads 1 to a new sparse CCA algorithm SCCA LS. In Section 4, we extend the idea of deriving SCCA LS to its kernel counterpart, which results in a novel sparse kernel CCA algorithm SKCCA. Numerical results of applying the newly proposed algorithms to various applications and comparative empirical results with other algorithms are presented in Section 5. Finally, we draw some conclusion remarks in Section 6. 3 2 Background InthissectionweprovideenoughbackgroundresultsonCCAandkernelCCAsoastomakethepaper self-contained. In the first subsection, we present the full parameterization of the general solutions of CCAandrelatedresults; inthesecondsubsection, basedontheparameterizationinprevioussubsection, we demonstrate a detailed derivation of kernel CCA. 2.1 Canonical correlation analysis As stated in Introduction, by solving (1.1), or equivalently min (cid:107)XTw −YTw (cid:107)2 x y 2 wx,wy s.t. wTXXTw =1, (2.1) x x wTYYTw =1, y y we can get a pair of weight vectors w and w for CCA. Only one pair of weight vectors is not enough x y for most practical problems, however. To obtain multiple projections of CCA, we recursively solve the following optimization problem (wk,wk)=arg max wTXYTw x y x y wx,wy s.t. wTXXTw =1, x x XTw ⊥{XTw1,··· ,XTwk−1}, k =2,··· ,l, (2.2) x x x wTYYTw =1, y y YTw ⊥{YTw1,··· ,YTwk−1}, y y y where l is the number of projections we need. The unit vectors XTwk and YTwk in (2.2) are called the x y kth pair of canonical variables. If we denote W =(cid:2)w1 ··· wl(cid:3)∈Rd1×l, W =(cid:2)w1 ··· wl(cid:3)∈Rd2×l, x x x y y y then we can show [9] that the optimization problem above is equivalent to max Trace(WTXYTW ) x y Wx,Wy s.t. WxTXXTWx =I, Wx ∈Rd1×l, (2.3) WyTYYTWy =I, Wy ∈Rd2×l. Hence, optimization problem (2.3) will be used as the criterion of CCA. A solution of (2.3) can be obtained via solving a generalized eigenvalue problem of the form (2.1). Furthermore, we can fully characterize all solutions of the optimization problem (2.3). Define r =rank(X), s=rank(Y), m=rank(XYT), t=min{r,s}. Let the (reduced) SVD factorizations of X and Y be, respectively, (cid:20) (cid:21) (cid:20) (cid:21) X =U Σ1 QT =(cid:2)U U (cid:3) Σ1 QT =U Σ QT, (2.4) 0 1 1 2 0 1 1 1 1 and (cid:20) (cid:21) (cid:20) (cid:21) Y =V Σ2 QT =(cid:2)V V (cid:3) Σ2 QT =V Σ QT, (2.5) 0 2 1 2 0 2 1 2 2 where U ∈Rd1×d1, U ∈Rd1×r, U ∈Rd1×(d1−r), Σ ∈Rr×r, Q ∈Rn×r, 1 2 1 1 V ∈Rd2×d2, V ∈Rd2×s, V ∈Rd2×(d2−s), Σ ∈Rs×s, Q ∈Rn×s, 1 2 2 2 U and V are orthogonal, Σ and Σ are nonsingular and diagonal, Q and Q are column orthogonal. It 1 2 1 2 follows from the two orthogonality constraints in (2.3) that l≤min{rank(X),rank(Y)}=min{r,s}=t. (2.6) 4 Next, let QTQ =P ΣPT (2.7) 1 2 1 2 bethesingularvaluedecompositionofQTQ ,whereP ∈Rr×r andP ∈Rs×sareorthogonal,Σ∈Rr×s, 1 2 1 2 andassumethereareq distinctivenonzerosingularvalueswithmultiplicitym ,m ,··· ,m ,respectively, 1 2 q then q (cid:88) m= m =rank(QTQ )≤min{r,s}=t. i 1 2 i=1 The full characterization of W and W is given in the following theorem [9]. x y k Theorem 2.1. i). If l = (cid:80)mi for some k satisfying 1 ≤ k ≤ q, then (Wx,Wy) with Wx ∈ Rd1×l and i=1 Wy ∈Rd2×l is a solution of optimization problem (2.3) if and only if (cid:26) W =U Σ−1P (:,1:l)W +U E, x 1 1 1 2 (2.8) W =V Σ−1P (:,1:l)W +V F, y 1 2 2 2 where W ∈Rl×l is orthogonal, E ∈R(d1−r)×l and F ∈R(d2−s)×l are arbitrary. k k+1 ii). If (cid:80)mi < l < (cid:80) mi for some k satisfying 0 ≤ k < q, then (Wx,Wy) with Wx ∈ Rd1×l and i=1 i=1 Wy ∈Rd2×l is a solution of optimization problem (2.3) if and only if (cid:26) W =U Σ−1(cid:2)P (:,1:α ) P (:,1+α :α )G(cid:3)W +U E, Wx =V1Σ−11(cid:2)P1(:,1:αk) P1(:,1+αk:αk+1)G(cid:3)W +V2F, (2.9) y 1 2 2 k 2 k k+1 2 k where αk = (cid:80)mi for k =1,··· ,q, W ∈Rl×l is orthogonal, G ∈Rm(k+1)×(l−αk) is column orthogonal, i=1 E ∈R(d1−r)×l and F ∈R(d2−s)×l are arbitrary. iii). If m < l ≤ min{r,s}, then (Wx,Wy) with Wx ∈ Rd1×l and Wy ∈ Rd2×l is a solution of optimization problem (2.3) if and only if (cid:26) W =U Σ−1(cid:2)P (:,1:m) P (:,m+1:r)G (cid:3)W +U E, Wx =V1Σ−11(cid:2)P1(:,1:m) P1(:,m+1:s)G1(cid:3)W +V2F, (2.10) y 1 2 2 2 2 2 where W ∈ Rl×l is orthogonal, G ∈ R(r−m)×(l−m) and G ∈ R(s−m)×(l−m) are column orthogonal, 1 2 E ∈R(d1−r)×l and F ∈R(d2−s)×l are arbitrary. AnimmediateapplicationofTheorem2.1isthatwecanprovethatUncorrelatedLinearDiscriminant Analysis (ULDA) [8, 27, 55] is a special case of CCA when one set of variables is derived form the data matrix and the other set of variables is constructed from class information. This theorem has also been utilized in [9] to design a sparse CCA algorithm. 2.2 Kernel canonical correlation analysis Now, we look at some details on the derivation of kernel CCA. Note from Theorem 2.1 that each solution (W ,W ) of CCA can be expressed as x y W =XW +W⊥, W =YW +W⊥, x x x y y y whereW⊥ andW⊥ areorthogonaltotherangespaceofX andY,respectively. Since,intrinsically,kernel x y CCA is performing ordinary CCA on Φ and Φ , it follows that the solutions of kernel CCA should be x y obtained by virtually solving max Trace(WTΦ Φ W ) x x y y Wx,Wy s.t. WxTΦxΦTxWx =I, Wx ∈RNx×l, (2.11) WyTΦyΦTyWy =I, Wy ∈RNy×l, 5 Similar to ordinary CCA, each solution (W ,W ) of (2.11) shall be represented as x y W =Φ W +W⊥, W =Φ W +W⊥, (2.12) x x x x y y y y where W , W ∈Rn×l are usually called dual transformation matrices, W⊥ and W⊥ are orthogonal to x y x y the range space of Φ and Φ , respectively. x y Substituting (2.12) into (2.11), we have WTΦ Φ W =WTK K W , WTΦ ΦTW =WTK2W , WTΦ ΦTW =WTK2W . x x y y x x y y x x x x x x x y y y y y y y Thus, the computation of transformations of kernel CCA can be converted to the computation of dual transformation matrices W and W by solving the following optimization problem x y max Trace(WTK K W ) x x y y Wx,Wy s.t. WTK2W =I, W ∈Rn×l, (2.13) x x x x WTK2W =I, W ∈Rn×l, y y y y which is used as the criterion of kernel CCA in this paper. As can be seen from the analysis above, terms W⊥ and W⊥ in (2.12) do not contribute to the x y canonical correlations between Φ and Φ , thus, are usually neglected in practice. Therefore, when we x y are given a set of testing data X = (cid:2)x1 ··· xN(cid:3) consisting of N points, the projection of X onto t t t t kernel CCA direction W can be performed by first mapping X into feature space H , then compute its x t x inner product with W . More specifically, suppose Φ =(cid:2)φ (x1) ··· φ (xN)(cid:3) is the projection of X x x,t x t x t t in feature space H , then the projection of X onto kernel CCA direction W is given by x t x WTΦ =WTK , x x,t x x,t where K =(cid:104)Φ ,Φ (cid:105)=[κ (x ,xj)]j=1:N ∈Rn×N is the matrix consisting of the kernel evaluations of x,t x x,t x i t i=1:n X with all training data X. Similar process can be adopted to compute projections of new data drawn t from variable y. Intheprocessofderiving (2.13),weassumeddataΦ andΦ havebeencentered(thatis,thecolumn x y mean of both Φ and Φ are zero), otherwise, we need to perform data centering before applying kernel x y CCA. Unlike data centering of X and Y, we can not perform data centering directly on Φ and Φ since x y wedonotknowtheirexplicitcoordinates. However,asshownin[38,37],datacenteringinRKHS canbe accomplished via some operations on kernel matrices. To center Φ , a natural idea should be computing x Φ = Φ (I − eneTn), where e denotes column vector in Rn with all entries being 1. However, since x,c x n n kernel CCA makes use of the data through kernel matrix K , the centering process can be performed on x K as x e eT e eT e eT e eT K =(cid:104)Φ ,Φ (cid:105)=(I− n n)(cid:104)Φ ,Φ (cid:105)(I− n n)=(I− n n)K (I− n n). (2.14) x,c x,c x,c x x x n n n n Similarly, we can center testing data as e eT e eT e eT e eT K =(cid:104)Φ ,Φ −Φ n N(cid:105)=(I− n n)K −(I− n n)K n N. (2.15) x,t,c x,c x,t x x,t x n n n n More details about data centering in RKHS can be found in [38, 37]. In the sequel of this paper, we assume the given data have been centered. There are papers studying properties of kernel CCA, including the geometry of kernel CCA in [29] and statistical consistency of kernel CCA in [16]. In the remainder of this paper, we consider sparse kernel CCA. Before that, we explore a relation between CCA and least squares in the next section. 3 Sparse CCA based on least squares formulation Note form (2.1) that when one of X and Y is one dimensional, CCA is equivalent to least squares estimationtoalinearregressionproblem. Formoregeneralcases,somerelationbetweenCCAandlinear 6 regression has been established under the condition that rank(X) = n−1 and rank(Y) = d in [42]. In 2 thissection,weestablisharelationbetweenCCAandlinearregressionwithoutanyadditionalconstraint on X and Y. Moreover, based on this relation we design a new sparse CCA algorithm. We focus on a solution subset of optimization problem (2.3) presented in the following lemma, whose proof is trivial and omitted. Lemma 3.1. Any (W ,W ) of the following forms x y (cid:26) W =U Σ−1P (:,1:l)+U E, x 1 1 1 2 (3.1) W =V Σ−1P (:,1:l)+V F, y 1 2 2 2 is a solution of optimization problem (2.3), where E ∈R(d1−r)×l and F ∈R(d2−s)×l are arbitrary. Suppose matrix factorizations (2.4)-(2.7) have been accomplished, and let Tx =YT[(YYT)12]†V1P2(:,1:l)Σ(1:l,1:l)−1 =Q2P2(:,1:l)Σ(1:l,1:l)−1, (3.2) Ty =XT[(XXT)12]†U1P1(:,1:l)Σ(1:l,1:l)−1 =Q1P1(:,1:l)Σ(1:l,1:l)−1, (3.3) where A† denotes the Moore-Penrose inverse of a general matrix A and 1 ≤ l ≤ m, then we have the following theorem. Theorem 3.2. For any l satisfying 1≤l≤m, suppose Wx ∈Rd1×l and Wy ∈Rd2×l satisfy W =argmin{(cid:107)XTW −T (cid:107)2 :W ∈Rd1×l}, (3.4) x x x F x and W =argmin{(cid:107)YTW −T (cid:107)2 :W ∈Rd2×l}, (3.5) y x y F y where T and T are defined in (3.2) and (3.3), respectively. Then W and W form a solution of x y x y optimization problem (2.3). Proof. Since (3.4) and (3.5) have the same form, we only prove the result for W , the same idea can be x applied to W . y We know that W is a solution of (3.4) if and only if it satisfies the normal equation x XXTW =XT . (3.6) x x Substituting factorizations (2.4), (2.5) and (2.7) into the equation above, we get XXT =U Σ2UT, 1 1 1 and XT = U Σ QTQ P (:,1:l)Σ(1:l,1:l)−1 x 1 1 1 2 2 = U Σ P (:,1:l), 1 1 1 which yield an equivalent reformulation of (3.6) U Σ2UTW =U Σ P (:,1:l). (3.7) 1 1 1 x 1 1 1 It is easy to check that W is a solution of (3.7) if and only if x W =U Σ−1P (:,1:l)+U E, (3.8) x 1 1 1 2 where E ∈ R(d1−r)×l is an arbitrary matrix. Therefore, Wx is a solution of (3.4) if and only if Wx can be formulated as (3.8). Similarly, W is a solution of (3.5) if and only if W can be written as y y W =V Σ−1P (:,1:l)+V F, (3.9) y 1 2 2 2 where F ∈R(d2−s)×l is an arbitrary matrix. Now,comparingequations(3.8)and(3.9)withtheequation(3.1)inLemma3.1,wecanconcludethat for any solution W of the least squares problem (3.4) and any solution W of the least squares problem x y (3.5), W and W form a solution of optimization problem (2.3), hence a solution of CCA. x y 7 Remark 3.1. InTheorem3.2weonlyconsiderl satisfying1≤l≤m. Thisisreasonable, sincethereare m nonzero canonical correlations between X and Y, and weight vectors corresponding to zero canonical correlation does not contribute to the correlation between data X and Y. Consider the usual regression situation: we have a set of observations (x ,b )···(x ,b ) where x ∈ 1 1 n n i Rd1 and bi are the regressor and response for the ith observation. Suppose {xi} has been centered, then linear regression model has the form n (cid:88) f(X)= x β , i i i=1 (cid:2) (cid:3) and aims to estimate β = β ··· β so as to predict an output for each input x. The famous least 1 n squares estimation minimizes the residual sum of squares Res(β)=(cid:107)XTβ−b(cid:107)2. 2 Therefore, (3.4) and (3.5) can be interpreted as least squares estimations of linear regression problems with columns of X and Y being regressors and rows of T and T being corresponding responses. x y Recent research on lasso [44] shows that simultaneous sparsity and regression can be achieved by penalizing the (cid:96) -norm of the variables. Motivated by this, we incorporate sparsity into CCA via the 1 established relationship between CCA and least squares and considering the following (cid:96) -norm penalized 1 least squares problems 1 (cid:88)l min{ (cid:107)XTW −T (cid:107)2 + λ (cid:107)W (cid:107) :W ∈Rd1×l}, (3.10) Wx 2 x x F i=1 x,i x,i 1 x and 1 (cid:88)l min{ (cid:107)YTW −T (cid:107)2 + λ (cid:107)W (cid:107) :W ∈Rd2×l}, (3.11) Wy 2 y y F i=1 y,i y,i 1 y where λ , λ are positive regularization parameters and W , W are the ith column of W and W , x,i y,i x,i y,i x y respectively. When we set λ = ··· = λ = λ > 0 and λ = ··· = λ = λ > 0, problems (3.10) x,1 x,l x y,1 y,l y and (3.11) become 1 min{ (cid:107)XTW −T (cid:107)2 +λ (cid:107)W (cid:107) :W ∈Rd1×l}, (3.12) Wx 2 x x F x x 1 x and 1 min{ (cid:107)YTW −T (cid:107)2 +λ (cid:107)W (cid:107) :W ∈Rd2×l}, (3.13) Wy 2 y y F y y 1 y where (cid:88)d1 (cid:88)l (cid:80)d2 (cid:80)l (cid:107)W (cid:107) = |W (i,j)|, (cid:107)W (cid:107) = |W (i,j)|. x 1 x y 1 y i=1j=1 i=1j=1 Since(3.10)and(3.11)(also,(3.12)and(3.13))havethesameform,allresultsholdingforoneproblem canbenaturallyextendedtotheother,soweconcentrateon(3.10). Optimizationproblem(3.10)reduces to a (cid:96) -regularized minimization problem of the form 1 1 min (cid:107)Ax−b(cid:107)2+λ(cid:107)x(cid:107) , (3.14) x∈Rd 2 2 1 when l = 1. In the field of compressed sensing, (3.14) has been intensively studied as denoising basis pursuit problem, andmanyefficientapproacheshavebeenproposedtosolveit, see[3,15,21,54]. Inthis paper we adopt the fixed-point continuation (FPC) method [21, 22], due to its simple implementation and nice convergence property. Fixed-point algorithm for (3.14) is an iterative method which updates iterates as xk+1 =S (cid:0)xk−τAT(Ax−b)(cid:1), with ν =τλ, (3.15) ν 8 where τ >0 denotes the step size, and S is the soft-thresholding operator defined as ν (cid:2) (cid:3)T S (x)= S (x ) ··· S (x ) ν ν 1 ν d with S (ω)=sign(ω)max{|ω|−ν,0}, ω ∈R. (3.16) ν S (ω) reduces any ω with magnitude less than ν to zero, thus reducing the (cid:96) -norm and introducing ν 1 sparsity. The fixed-point algorithm can be naturally extended to solve (3.10), which yields Wk+1 =S (cid:0)Wk −τ X(XTWk −T )(cid:1), i=1,··· ,l, (3.17) x,i νx,i x,i x x,i x,i where ν = τ λ with τ > 0 denoting the step size. We can prove that fixed-point iterations have x,i x x,i x some nice convergence properties which are presented in the following theorem. Theorem 3.3. [21] Let Ω be the solution set of (3.10), then there exists M∗ ∈Rd1×l such that X(XTW −T )≡M∗, ∀W ∈Ω. (3.18) x x x In addition, define L:={(i,j):|M∗ |<λ } (3.19) i,j x as a subset of indices and let λ (XXT) be the maximum eigenvalue of XXT, and choose τ from max x 2 0<τ < , x λ (XXT) max then the sequence {Wk}, generated by the fixed-point iterations (3.17) starting with any initial point W0, x x converges to some W∗ ∈Ω. Moreover, there exists an integer K >0 such that x (Wk) =(W∗) =0, ∀(i,j)∈L, (3.20) x i,j x i,j when k >K. Remark 3.2. 1. Equation (3.18) shows that for any two optimal solutions of (3.10) the gradient of the squared Frobenius norm in (3.10) must be equal. 2. Equation (3.20) means that the entries of Wk with indices from L will converge to zero in finite x steps. The positive integer K is a function of W0 and W∗, and determined by the distance between x x them. Similarly, we can design a fixed-point algorithm to solve (3.11) as follows: Wk+1 =S (cid:0)Wk −τ Y(YTWk −T )(cid:1), with ν =τ λ , i=1,··· ,l, (3.21) y,i νy,i y,i y y,i y,i y,i y y,i where τ >0 denotes the step size. y Now, we are ready to present our sparse CCA algorithm. Algorithm 1 (SCCA LS: Sparse CCA based on least squares) Input: Training data X ∈Rd1×n, Y ∈Rd2×n Output: Sparse transformation matrices Wx ∈Rd1×l and Wy ∈Rd2×l. 1: Compute matrix factorizations (2.4)-(2.7); 2: Compute Tx and Ty according to (3.2) and (3.3); 3: repeat 4: Wxk,+i1 =Sνx,i(cid:0)Wxk,i−τxX(XTWxk,i−Tx,i)(cid:1), νx,i =τxλx,i, i=1,··· ,l, 5: until convergence 6: repeat 7: Wyk,+i1 =Sνy,i(cid:0)Wyk,i−τyY(YTWyk,i−Ty,i)(cid:1), νx,i =τxλx,i, i=1,··· ,l, 8: until convergence 9: return Wx =Wxk and Wy =Wyk. 9 AlthoughdifferentsolutionsmaybereturnedbyAlgorithm1startingfromdifferentinitialpoints, we can conclude form (3.18) that XXTW∗ =XXTW(cid:99)∗, ∀W∗,W(cid:99)∗ ∈Ω, x x x x which results in UTW∗ = UTW(cid:99)∗. Similarly, we have VTW∗ = VTW(cid:99)∗ for two different solutions of 1 x 1 x 1 y 1 y (3.11). Hence, (W∗)TXXTW∗ =(W(cid:99)∗)TXXTW(cid:99)∗, x x x x (W∗)TYYTW∗ =(W(cid:99)∗)TYYTW(cid:99)∗, y y y y (W∗)TXYTW∗ =(W(cid:99)∗)TXYTW(cid:99)∗. x y x y The above equations show that any two optimal solutions of (3.10) approximate the solution of CCA in the same level. Due to the effect of (cid:96) -norm regularization, a solution (W∗,W∗) does not satisfy the orthogonality 1 x y constraints of CCA any more, but we can derive a bound on the deviation. Since (3.10) is a convex optimization problem, we have X(XTW∗−T )+λ G =0, for some G ∈∂(cid:107)W∗(cid:107) , (3.22) x x x x 1 where ∂(cid:107)W∗(cid:107) denotes the sub-differential of (cid:107)·(cid:107) at W∗. Simplifying (3.22), we can get x 1 1 x UTW∗ =Σ−1P (:,1:l)−λ Σ−2UTG, 1 x 1 1 x 1 1 which implies (W∗)TXXTW∗ =I −λ P (:,1:l)TΣ−1UTG−λ GTU Σ−1P (:,1:l)+λ2GTU Σ−2UTG. x x l x 1 1 1 x 1 1 1 x 1 1 1 Since G ∈ Rd1×l satisfies |Gi,j| ≤ 1 for i = 1,··· ,d1, j = 1,··· ,l, we further assume there are Nx non-zeros in G, it follows that (cid:107)(Wx∗)TXX√TWx∗−Il(cid:107)F ≤ λx√ (cid:18)2(cid:112)N + λx N (cid:19) l σr(√X) l x σr(X) x (cid:18) (cid:19) ≤ λx d1 2+ λx (cid:112)ld , (3.23) σ (X) σ (X) 1 r r whereσ (X)denotesthesmallestnonzerosingularvalueofX. Sotheboundisaffectedbyregularization r parameter λ , the smallest nonzero singular value of X and the number of non-zeros in G. A Similar x result can be obtained for the optimal solutions of (3.11). 4 Extension to kernel canonical correlation analysis SincekernelCCAcriterion(2.13)andCCAcriterion(2.3)havethesameform,wecanexpectasimilar characterization of solutions of (2.13) as Theorem 2.1. Define rˆ=rank(K ), sˆ=rank(K ), mˆ =rank(K KT), x y x y and let the eigenvalue decomposition of K and K be, respectively, x y (cid:20) (cid:21) (cid:20) (cid:21) K =U Π1 0 UT =(cid:2)U U (cid:3) Π1 0 (cid:2)U U (cid:3)T =U Π UT, (4.1) x 0 0 1 2 0 0 1 2 1 1 1 and (cid:20) (cid:21) (cid:20) (cid:21) K =V Π2 0 VT =(cid:2)V V (cid:3) Π2 0 (cid:2)V V (cid:3)T =V Π VT, (4.2) y 0 0 1 2 0 0 1 2 1 2 1 10