ebook img

Supervised Dimensionality Reduction via Distance Correlation Maximization PDF

0.88 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Supervised Dimensionality Reduction via Distance Correlation Maximization

Supervised Dimensionality Reduction via Distance Correlation Maximization Praneeth Vepakomma∗ Department of Statistics, Rutgers University, Piscataway, NJ - 08854 USA. [email protected], 6 1 Chetan Tonde*, Ahmed Elgammal 0 2 Department of Computer Science, Rutgers University, n Piscataway, NJ - 08854 USA. a J {cjtonde,elgammal}@cs.rutgers.edu 3 ] Tuesday 5th January, 2016 G L . s c [ In our work, we propose a novel formulation for supervised dimension- 1 ality reduction based on a nonlinear dependency criterion called Statistical v Distance Correlation, [Székely et al., 2007]. We propose an objective which 6 3 is free of distributional assumptions on regression variables and regression 2 model assumptions. Our proposed formulation is based on learning a low- 0 0 dimensional feature representation z, which maximizes the squared sum of . Distance Correlations between low dimensional features z and response y, 1 0 and also between features z and covariates x. We propose a novel algorithm 6 to optimize our proposed objective using the Generalized Minimization Max- 1 imizaiton method of Parizi et al. [2015]. We show superior empirical results : v on multiple datasets proving the effectiveness of our proposed approach over i X severalrelevantstate-of-the-artsuperviseddimensionalityreductionmethods. r a 1. Introduction Rapid developments of imaging technology, microarray data analysis, computer vision, neuroimaging, hyperspectral data analysis and many other applications call for the anal- ∗Authors contributed equally. 1 ysis of high-dimensional data. The problem of supervised dimensionality reduction is concerned with finding a low-dimensional representation of data such that this represen- tation can be effectively used in a supervised learning task. Such representations help in providing a meaningful interpretation and visualization of the data, and also help to pre- vent overfitting when the number of dimensions greatly exceeds the number of samples, thus working as a form of regularization. In this paper we focus on supervised dimen- sionality reduction in the regression setting, where we consider the problem of predicting a univariate response y ∈ R from a vector of continuous covariates x ∈ Rp, for i = 1 to i i n. Sliced Inverse Regression (SIR) of Li [1991], Lue [2009], Szretter and Yohai [2009] is one of the earliest developed supervised dimensionality reduction techniques and is a seminal work that introduced the concept of a central subspace that we now describe. Thistechniqueaimstofindasubspacegivenbythecolumnspaceofap×dmatrixBwith d << p such that y=X|BTX where = indicates statistical independence. Under mild | | conditions the intersection of all such dimension reducing subspaces is itself a dimension reducing subspace, and is called the central subspace [Cook, 1996]. SIR aims to estimate this central subspace. Sliced Average Variance Estimation (SAVE) of Shao et al. [2009] and Shao et al. [2007] is another early method that can be used to estimate the central subspace. SIR uses a sample version of the first conditional moment EX | Y to construct anestimatorofthissubspaceandSAVEusesthesamplefirstandsecondconditionalmo- ments to estimate it. Likelihood Acquired Directions (LAD) of Cook and Forzani [2009] isatechniquethatobtainsthemaximumlikelihoodestimatorofthecentralsubspaceun- derassumptionsofconditionalnormalityofthepredictorsgiventheresponse. LikeLAD, methods SIR and SAVE rely on elliptical distributional assumptions like Gaussianity of the data. Morerecentmethodsthatdonotrequireanydistributionalassumptionsonthemarginal distribution of x or on the conditional distribution of y. The authors of Gradient Based Kernel Dimension Reduction (gKDR), Fukumizu and Leng [2014], use an equiva- lent formulation of the conditional independence relation y=X|BTX using conditional | cross-covariance operators and aim to find a B that maximizes the mutual information I(BTX,y). In this work, the authors estimate the conditional cross-covariance opera- tors by using Gaussian kernels. gKDR instead uses kernels only to provide equivalent characterizationsofconditionalindependenceusingsampleestimatorsofcross-covariance operators. Sufficient Component Analysis (SCA) of Yamada et al. [2011] is another technique wheretheBisalsolearntusingadependencecriterion. SCAaimstomaximizetheleast- squares mutual information given by SMI(Z,Y) = 1 (cid:82) (cid:82)( pzy(z,y) −1)2dzdy between 2 pz(z)py(y) the projected features Z = BTX and the response. This is done under orthonormal constraints over B, and the optimal solution is found by approximating pzy(z,y) using pz(z)py(y) methodofdensityratioestimation[Sugiyamaetal.,2012,Vapniketal.,2015],andalsoan analyticalclosedformsolutionfortheminimaisobtained. InSuzukiandSugiyama[2013] (LSDR), the authors optimize this objective using a natural gradient based iterative solutionontheSteifelmanifoldSm(R)viaalinesearchalongthegeodesicinthedirection d 2 of the natural gradient [Amari, 1998, Nishimori and Akaho, 2005]. Our contribution in this paper is as follows: We propose a new formulation for super- vised dimensionality reduction that is based on a dependency criterion called Distance Correlation, [Szekely et al., 2007]. This setup is free of distributional, as well as regres- sion model assumptions. The novelty in our formulation is that we do not restrict the transformation from x to z to be linear, as in case many of the above techniques. To further add, other works of Li et al. [2012], Kong et al. [2015], Berrendero et al. [2014] have used Distance Correlation as a criterion for feature selection in a regression set- ting. In our work, we show benefits Distance Correlation as a criterion for supervised low-dimensional feature learning. Inourworkweusethefollowingnotation: ThespectralradiusofamatrixMisdenoted by λ (M), ith eigenvalue by λ (M), and ith generalized eigenvalue Ax = λ Bx by max i i λ (A,B). Moreover, λ (M) (λ (A,B)), and λ (M) (λ (A,B)) respectively, i max max max min the maximum and minimum eigenvalues (generalized eigenvalues) of matrices M, A and B. We use the usual partial ordering for symmetric matrices: A (cid:23) B means A−B is positive semidefinite; similarly for the relationships (cid:23),≺,(cid:31). The norm (cid:107)·(cid:107) will be either the Euclidean norm for vectors or the norm that it induces for matrices, unless otherwise specified. 2. Distance Correlation Distance Correlation introduced by Szekely et al. [2007] and Székely et al. [2009], Székely and Rizzo [2012, 2013] is a measure nonlinear dependencies between random vectors of arbitrary dimensions. We describe below α-distance covariance which is an extended version of standard distance covariance for α = 1. Definition 2.1. Distance Covariance [Székely et al., 2007], α-dCov: Distance covariance between random variables x ∈ Rd and y ∈ Rm with finite first moments is a nonnegative number given by (cid:90) ν2(x,y) = |f (t,s)−f (t)f (s)|2w(t,s)dtds x,y x y Rd+m where f ,f are characteristic functions of x,y, f is the joint characteristic function, x y x,y and w(t,s) is a weight function defined as w(t,s) = (C(p,α)C(q,α)|t|α+p|s|α+q)−1 with p q 2πd/2Γ(1−α/2) C(d,α) = . α2αΓ((α+d)/2) The distance covariance is zero if and only if random variables x and y are indepen- dent. From above definition of distance covariance, we have the following expression for Distance Correlation. Definition 2.2. Distance Correlation [Székely et al., 2007] (α-dCorr): The squared Distance Correlation between random variables x ∈ Rd and y ∈ Rm with finite first moments is a nonnegative number defined as (cid:40) √ ν2(x,y) , ν2(x,x)ν2(y,y) > 0. ρ2(x,y) = ν2(x,x)ν2(y,y) 0, ν2(x,x)ν2(y,y) = 0. 3 The Distance Correlation defined above has the following interesting properties; 1) ρ2(x,x) is defined for arbitrary dimensions of x and y, 2) ρ2(x,y) = 0 if and only if x and y are independent, and 3) ρ2(x,y) satisfies the relation 0 ≤ ρ2(x,y) ≤ 1. In our work, we use α-Distance Covariance with α = 2 and in the following paper for simplicity just refer to it as Distance Correlation. Wedefinesampleversionofdistancecovariancegivensamples{(x ,y )|k = 1,2,...,n} k k sampled i.i.d. from joint distribution of random vectors x ∈ Rd and y ∈ Rm. To do so, we define two squared Euclidean distance matrices E and E , where each en- X Y try [E ] = (cid:107)x −x (cid:107)2 and [E ] = (cid:107)y −y (cid:107)2 with k,l ∈ {1,2,...,n}. These X k,l k l Y k,l k l squared distance matrices are when double-centered, by making their row and column sums zero, and are denoted as E(cid:98)X,Q(cid:98)X, respectively. So given a double-centering matrix J = I−n111T,wehaveE(cid:98)X = JEXJandE(cid:98)Y = JEYJ. Hencesampledistancecorrelation (for α = 2) is defined as follows. Definition 2.3. Sample Distance Correlation [Székely et al., 2007]: Given i.i.d sam- ples X ×Y = {(x ,y )|k = 1,2,3,...,n} and corresponding double centered Euclidean k k distance matrices E(cid:98)X and E(cid:98)Y, then the squared sample distance correlation is defined as, n 1 (cid:88) νˆ2(X,Y) = n2 [E(cid:98)X]k,l[E(cid:98)Y]k,l, k,l=1 and equivalently sample distance correlation is given by (cid:40) √ νˆ2(X,Y) , νˆ2(X,X)νˆ2(Y,Y) > 0. ρˆ2(X,Y) = νˆ2(X,X)νˆ2(Y,Y) 0, νˆ2(X,X)νˆ2(Y,Y) = 0. . 3. Laplacian Formulation of Sample Distance Correlation In this section, we propose a Laplacian formulation of sample distance covariance, and sample distance correlation, which we later use to propose our objective used for super- vised dimensionality reduction (SDR). A graph Laplacian version of sample distance correlation can be obtained as follows, Lemma 3.1. Given matrices of squared Euclidean distances E and E , and Laplacians X Y LX and LY formed over adjacency matrics E(cid:98)X and E(cid:98)Y, the square of sample distance correlation ρˆ2(X,Y) is given by Tr(cid:0)XTL X(cid:1) ρˆ2(X,Y) = Y . (1) (cid:112) Tr(YTL Y)Tr(XTL X) Y X Proof. Given matrices E(cid:98)X, E(cid:98)Y, and column centered matrices X(cid:101), Y(cid:101), from result of Torgerson [1952] we have that E(cid:98)X = −2X(cid:101)X(cid:101)T and E(cid:98)Y = −2Y(cid:101)Y(cid:101)T. In the problem of 4 multidimensionalscaling(MDS)[BorgandGroenen,2005],weknowforagivenadjacency matrix say W and a Laplacian matrix L, Tr(cid:0)XTLX(cid:1) = 1 (cid:88)[W] [E ] . (2) ij X i,j 2 i,j Now for the Laplacian L = LX and adjacency matrix W = E(cid:98)Y we can represent Tr(cid:0)XTLYX(cid:1) in terms of E(cid:98)Y as follows, n Tr(cid:0)XTLYX(cid:1) =1 (cid:88)[E(cid:98)Y]i,j[EX]i,j. 2 i,j=1 From the fact [EX]i,j = ((cid:104)x(cid:101)i,x(cid:101)i(cid:105)+(cid:104)x(cid:101)j,x(cid:101)j(cid:105)−2(cid:104)x(cid:101)i,x(cid:101)j(cid:105)), and also E(cid:98)X = −2X(cid:101)X(cid:101)T we get n Tr(cid:0)XTLYX(cid:1) = −1 (cid:88)[E(cid:98)Y]i,j([E(cid:98)X]i,i+[E(cid:98)X]j,j −2[E(cid:98)X]i,j) 4 i,j=1 n n n n 1 (cid:88) 1 (cid:88) (cid:88) 1 (cid:88) (cid:88) = [E(cid:98)X]i,j[E(cid:98)Y]i,j − [E(cid:98)X]j,j [E(cid:98)Y]i,j − [E(cid:98)X]i,i [E(cid:98)Y]i,j 2 4 4 i,j j i i j SinceE(cid:98)X andE(cid:98)Y aredoublecenteredmatrices(cid:80)ni=1[E(cid:98)Y]i,j = (cid:80)nj=1[E(cid:98)Y]i,j = 0itfollows that Tr(cid:0)XTLYX(cid:1) =1 (cid:88)[E(cid:98)X]i,j[E(cid:98)Y]i,j. 2 i,j It also follows that n νˆ2(X,Y) = n12 (cid:88)[E(cid:98)Y]i,j[EX]i,j = n22Tr(cid:0)XTLYX(cid:1) i,j=1 Similarly, we can express the sample distance covariance using Laplacians L and L as X Y (cid:18) (cid:19) (cid:18) (cid:19) νˆ2(X,Y) = 2 Tr(cid:0)XTL X(cid:1) = 2 Tr(cid:0)YTL Y(cid:1). n2 Y n2 X The sample distance variances can be expressed as νˆ2(X,X) = (cid:0) 2 (cid:1)Tr(cid:0)XTL X(cid:1) and νˆ2(Y,Y) = (cid:0) 2 (cid:1)Tr(cid:0)YTL Y(cid:1) substituting back into expressionn2 of sampleXdistance n2 Y correlation above we get Equation 1. 4. Framework 4.1. Problem Statement The goal in supervised dimensionality reduction (SDR) is to learn a low dimensional representation Z ∈ Rn×p of input features X ∈ Rn×d, so as to predict the respone vector 5 y ∈ R from Z. The intuition being that Z captures all information relevant to predict y. Also, during testing, for out-of-sample prediction, for a new data point x∗, we estimate z∗ assuming that it is predictable from x∗. In our proposed formulation, we use afore- mentionedLaplacianbasedsampledistancecorrelationtomeasuredependenciesbetween variables. We propose maximize dependencies between the low dimensional features Z and response vector y, and also low dimensional features Z with input features X. Our objective is to maximize the sum of squares of these two sample distance correlations which is given by, f(Z) = ρˆ2(X,Z)+ρˆ2(Z,y) (3) Tr(cid:0)ZTL Z(cid:1) Tr(cid:0)ZTL Z(cid:1) X y f(Z) = + . (4) (cid:112) (cid:112) Tr(XTL X)Tr(ZTL Z) Tr(yTL y)Tr(ZTL Z) X Z y Z On simplification we get the following optimization problem which we refer to as Prob- lem (P). Tr(cid:0)ZTS Z(cid:1) X,y max f(Z) = Problem (P) (cid:112) Z Tr(ZTL Z) Z where k = √ 1 , k = √ 1 are constants, and S = k L +k L . X Y X,y X X Y y Tr(XTLXX) Tr(yTLyy) 4.2. Algorithm In the proposed problem (Problem (P)), we observe that numerator of our objective is convex while denominator is non-convex due the presence of a square root and a non- linear Laplacian term L on Z. Hence, this makes direct optimization of this objective Z practically infeasible. So to optimize Problem (P), we present a surrogate objective Problem (Q) which lower bounds our proposed original objective. We maximize this lower bound with respect to Z and show that optimizing this surrogate objective Prob- lem (Q) (lower bound), also maximizes the proposed objective in Problem (P). We do so by utlizing the Generalized Minorization-Maximization (G-MM) framework of Parizi et al. [2015]. The G-MM framework of Parizi et al. [2015] is an extension of the well known MM framework of Lange et al. [2000]. It removes the equality constraint between both ob- jectives at every iteration Z , except at initialization step Z . This allows the use a k 0 broader class of surrogates that avoid maximization iterations being trapped at sharp local maxima, and also makes the problem less sensitive to problem initializations. The surrogate lower bound objective is as follows, Tr(cid:0)ZTS Z(cid:1) X,y max g(Z,M) = Problem (Q) Z Tr(ZTLMZ) where M ∈ Rn×d belongs to the set of column-centered matrices. The surrogate problem (Problem (Q)) is convex in both its numerator and denom- inator for a fixed auxiliary variable M. Theorem 4.1 provides the required justification 6 that under certain conditions, maximizing the surrogate Problem (Q) also maximizes the proposed objective Problem (P) . An outline of the strategy for optimization is as follows: (cid:104) (cid:105)T a) Initialize: Initialize Z = cJ ,0T , a column-centered matrix where c = 0 d (n−d)×d √ 1 and J ∈ Rd×d is a centering matrix. This is motivated by statement 1) in 4 2(d−1) d proof of Theorem 4.1. b) Optimize: Maximize the surrogate lower bound Z = argmaxg(Z,Z ) (See sec- k+1 k tion 5). (cid:0) (cid:1) c) Rescaling: Rescale Z ← κZ such that Tr Z L Z is greater than k+1 k+1 k+1 Zk+1 k+1 one. This is motivated by proof of statement 3) of Theorem 4.1, and also the fact that g(Z,M) = g(κZ,M) and f(Z) = f(κZ) for any scalar κ. d) Repeat step b and c above until convergence. Theorem 4.1. Under above strategy, maximizing the surrogate Problem Q also maxi- mizes Problem P. Proof. For convergence it is enough for us to show the following, [Parizi et al., 2015]: (cid:104) (cid:105)T 1. f(Z ) = g(Z ,Z ) for Z = cJ ,0T and c = √ 1 , 0 0 0 0 d (n−d)×d 4 2(d−1) 2. g(Z ,Z ) ≥ g(Z ,Z ) and, k+1 k k k 3. f(Z ) ≥ g(Z ,Z ) k+1 k+1 k (cid:104) (cid:105)T To prove statement 1, for Z = cJ ,0T , we observe that Z column-centered, 0 d (n−d)×d 0 L = 2Z ZT and ZTZ = c2J . Hence we get Tr(cid:0)ZTZ L Z (cid:1) = c4Tr(2J ) = c4Z20(d−1) =0 10. This pr0ove0s the reqduired statement f(Z ) =0g(0Z Z,0Z 0) = Tr(cid:0)ZTL dZT(cid:1). 0 0 0 0 Z0 0 Statement 2 follows from the optimization Z = argmaxg(Z,Z ). To prove state- k+1 k ment 3 we have to show that Tr(cid:0)ZT S Z (cid:1) Tr(cid:0)ZT S Z (cid:1) k+1 X,y k+1 ≥ k+1 X,y k+1 . (cid:113)Tr(cid:0)ZTk+1LZk+1Zk+1(cid:1) Tr(cid:0)ZTk+1LZkZk+1(cid:1) Since numerators on both sides are equal, it is enough for us to show that (cid:113) Tr(cid:0)ZT L Z (cid:1) ≤ Tr(cid:0)ZT L Z (cid:1). k+1 Zk+1 k+1 k+1 Zk k+1 Now from Lemma A.4 we have Tr(cid:0)ZT L Z (cid:1) ≤ Tr(cid:0)ZT L Z (cid:1). It follows k+1 Zk+1 k+1 k+1 Zk k+1 from the rescaling step (step c) of the optimization strategy that the left hand side (cid:0) (cid:1) Tr Z L Z is always greater that one, and so taking square root of it implies t+1 Zt+1 t+1 (cid:113) Tr(cid:0)Z L Z (cid:1) ≤ Tr(cid:0)ZT L Z (cid:1). t+1 Zt+1 t+1 t+1 Zt t+1 We summarize all of the above steps in Algorithm 4.1 below and section 5 further describes optimization algorithm to solve Problem (Q) required by it. 7 Algorithm 4.1 DISCOMAX (cid:104) (cid:105)T Require: Initialize Z = cJ ,0T , a column-centered matrix where c = 0 d (n−d)×d √ 1 , k ← 0 4 2(d−1) Ensure: Z∗ = argmax f(Z) Z 1: repeat 2: Solve, Z = argmaxg(Z,Z ) Problem (Q) k+1 k Z 3: Rescale Zk+1 ← κZk+1 such that Tr(cid:0)ZTk+1LZk+1Zk+1(cid:1) ≥ 1 4: k = k+1 5: until (cid:107)Zk+1−Zk(cid:107)2 < (cid:15) 6: Z∗ = Zk+1 7: return Z∗ 5. Optimization In this section, we propose a framework for optimizing the surrogate objective g(Z,M), referred to as Problem (Q), for a fixed M = Z . We observe that for a given value of k M, g(Z,M) is a ratio of two convex functions. To solve this, we convert this maximiza- tion problem to an equivalent minimization problem h(Z,M), by taking its reciprocal [Schaible, 1976]. This allows us to utilize the Quadratic Fractional Programming Prob- lem (QFPP) framework of Dinkelbach [1967] and Zhang [2008] to minimize h(Z,M). We refer to this new minimization problem as Problem (R). It is stated below. Tr(cid:0)ZTL Z(cid:1) M min h(Z,M) = Problem (R) (5) Z Tr(ZTSX,yZ) where M = Z . k In his seminal work Dinkelbach [1967] and later Zhang [2008] proposed a novel frame- work to solve constrained QFP problems by converting it to an equivalent parametric optimization problem, by introducing a scalar parameter α ∈ R. We utilize this equiva- lence proposed to defined new parametric problem, Problem (S). The solution involves a search over the scalar parameter α while repeatedly solving Problem (S) to get the required solution Z . This search process continues until values of α converge. k+1 In a nutshell, Dinkelbach [1967] and Zhang [2008] frameworks suggest the following optimizations are equivalent: Problem (S) Problem (R) ⇐⇒ minimize H(z;α∗) = f (z)−α∗f (z) minimize h(z) = f1(z) z∈Rd 1 2 z∈Rd f2(z) for some α∗ ∈ R where f (z) := zTA z−2b z+c with A ,A ∈ Rn×n, b ,b ∈ Rn, and c ,c ∈ R. A i i i i i 1 2 1 2 1 2 1 and A are symmetric with f (x) > 0 over some z ∈ Z. 2 2 8 To see the equivalence of h(Z,M) in Problem (R) to h(z) above we observe that: A = I ⊗L , A = I ⊗S , c = c = 0, and b = b = 0. Also, due to positive 1 n M 2 n X,y i 2 1 2 definiteness of A , f (z) is positive1, and f(z ) > 0. Using this setup for h(Z,M) we i i i get,2 vec(Z)T (I ⊗L )vec(Z) n M min h(Z,M) = (6) Z vec(Z)T (I ⊗S )vec(Z) n X,y In subsection 5.1 we propose a Golden Section Search [Kiefer, 1953] based algorithm (Algorithm 5.1) which utilizes concavity property of H(Z;α) with respect to α to locate the best α∗. During this search we repeatedly solve Problem (S) starting with an intial interval 0 = α ≤ α ≤ α = λ (L ,S ) for a fixed M, then at each step shorten the l u min M X,y search interval by moving upper and lower limits closer to each other. We continue until convergence to α∗. The choice of the upper limit of α = λ (L ,S ) is motivated u min M X,y by proof of Lemma A.2. TosolveProblem(S)foragivenα,weproposeaniterativealgorithminsubsection5.2 (Algorithm 5.2). It uses the classical Majorization-Minimization framework of Lange [2013]. 5.1. Golden Section Search Dinkelbach [1967] and Zhang [2008] showed the following properties of the objective3 H(α) with respect to α, for a fixed Z. Theorem 5.1. Let G: R → R be defined as G(α) = minH(Z;α) = min(cid:8)Tr(cid:0)ZTL Z(cid:1)−αTr(cid:0)ZTS Z(cid:1)(cid:9) M X,y Z Z as derived from Problem (S), then following statements hold true. 1. G is continuous at any α ∈ R. 2. G is concave over α ∈ R. 3. G(α) = 0, has a unique solution α∗. Algorithm 5.1 exploits the concavity property of G(α) to perform a Golden Section Searchoverα. Subsection5.2providesaniterativeMajorization-Minimizationalgorithm (Algorithm 5.2) to solve this minimization problem Problem (S). 1In case of A is semi-definite we regularize by adding A +(cid:15)I so that A (cid:31)0 i i i 2⊗ indicates kronecker product. vec(Z) denotes column vectorization of matrix Z. 3For a fixed Z and variable argument α we denote H(Z;α) as H(α). 9 Algorithm 5.1 Golden Section Search for α ∈ [α ,α ] for a fixed M = Z . l u k √ Require: (cid:15), η = 1+ 5, α = 0, S ,L , L , M = Z . 2 l X,y X y k Ensure: Z = argmin g(Z,Z ) k+1 Z k+1 1: DX ← diag(LX) 2: LM ← 2MTM 3: αu ← λmax(LM,SX,y) (Lemma A.1) 4: β ← αu+η(αl −αu) 5: δ ← αl +η(αu−αl) 6: repeat 7: H(β) ← minimize(cid:0)Tr(cid:0)ZTLMZ(cid:1)−βTr(cid:0)ZTSX,yZ(cid:1)(cid:1) (Problem (S)) Z∈Rd 8: H(δ) ← minimize(cid:0)Tr(cid:0)ZTLMZ(cid:1)−δTr(cid:0)ZTSX,yZ(cid:1)(cid:1) (Problem (S)) Z∈Rd 9: if (H(β) > H(δ)) then 10: αu ← δ, δ ← β 11: β ← αu+η(αl −αu) 12: else 13: αl ← β, β ← δ 14: δ ← αl +η(αu−αl) 15: end if 16: until (|αu−αl| < (cid:15)) 17: α∗ ← αu+αu 2 18: Zk+1 ← argminZ∈Rd(cid:0)Tr(cid:0)ZTLMZ(cid:1)−α∗Tr(cid:0)ZTSX,yZ(cid:1)(cid:1) (Problem (S)) 19: return α∗, Zk+1 5.2. Distance Correlation Maximization Algorithm Algorithm 5.2 gives a iterative fixed point algorithm which solves Problem (S). The- orem 5.2 provides a fixed point iterate used to minimize H(Z,α) with respect to Z for a given α. The fixed point iterate4 Z = HZ minimizes Problem (S) and a mono- t+1 t tonic convergence is assured by the Majorization-Minimization result of Lange [2013]. Theorem 5.2 below derives the fixed point iterate used in Algorithm 5.2. Theorem 5.2. For a fixed γ2 (Lemma A.1), some α (Lemma A.2) and H = (cid:0)γ2D −αS (cid:1)†(γ2D −L ) X X,y X M the iterate Z = HZ monotonically minimizes the objective, t t−1 F(Z;α) = Tr(cid:0)ZTL Z(cid:1)−αTr(cid:0)ZTS Z(cid:1) (7) M X,y 4We use the subscript t to indicate fixed point iteration of Z . t 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.