ebook img

Joint rank and variable selection for parsimonious estimation in a high-dimensional finite mixture regression model PDF

0.24 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Joint rank and variable selection for parsimonious estimation in a high-dimensional finite mixture regression model

JOINT RANK AND VARIABLE SELECTION FOR PARSIMONIOUS ESTIMATION IN HIGH-DIMENSION FINITE MIXTURE REGRESSION MODEL. EMILIEDEVIJVER Abstract. We study a dimension reduction method for finite mixture of multivariate response re- gression models in high dimension. Both the number of responses and of predictors may exceed the samplesize. Weconsiderjointlypredictorselectionandrankreductionforobtaininglower-dimensional approximations of parameter matrices. Thismethodology was alreadydeveloped in[8]. Inthis paper, 5 weprovethattheseestimatorsareadaptivetotheunknownmatrixsparsity. Moreprecisely,weexhibit 1 a penalty for which the model selected by the penalized likelihood satisfies an oracle inequality. We 0 supportourtheoretical resultwithsimulationstudyanddataanalysis. 2 n 1. Introduction a J The multivariate response regressionmodel 2 Y =βX +E ] postulates a linear relationship between Y, the q n matrix containing q responses for n subjects, and T × X, the p n matrix on p predictor variables. The termE is anq n matrix with independent columns, S × × E N (0,Σ) for all i 1,...,n . The unknown q p coefficient matrix β needs to be estimate. In . i ∼ q ∈ { } × h a more general way, we could use finite mixture of linear models, which model the relationship between t response and predictors, arising from different subpopulations: if the variable Y, conditionally to X, a m belongs to the cluster k, there exists β and Σ such that Y =β X+E, with E (0,Σ ). k k k q k ∼N Ifwe usethis model todealwithhigh-dimensionaldata,the numberofvariablescanbe quicklymuch [ larger than the sample size, because and predictors and response variables could be high dimensional. 1 To solve this problem, we will have to reduce the parameter dimension. v One wayto solve the dimensionproblemis to selectrelevantvariables, inorder to reduce the number 2 4 of unknowns. Indeed, all the information should not be interesting for the clustering. In a density 4 estimation way, we could cite Pan and Shen, in [16], who focus on mean variable selection, Zhou and 0 Pan, in [20], who use the Lasso estimator to regularize Gaussian mixture model with general covariance 0 matrices, Sun and Wang, in [18], who propose to regularize the k-means algorithm to deal with high- . 1 dimensional data, Guo et al, in [10], who propose a pairwise variable selection method. 0 In a regressionframework, we could use the Lasso estimator, introduced by Tibshirani in [19], which 5 is a sparse estimator, by penalizing the maximum likelihood estimator by the ℓ -norm, which achieves 1 1 the sparsity, as the ℓ penalty, but leads also to a convex optimization. Because we work with the : 0 v multivariate linear model, to deal with the matrix structure, we could prefer the group Lasso, variables i X grouped by columns, which selects columns rather than coefficients. This estimator was introduced by Zhou in [21] in the generalcase. If we select J columns among the p possible, we have to estimate J q r a coefficients rather than pq for A, which could| b|e smaller than n if J is smaller enough. | | | | Anotherestimatorwhichreducesthedimension,knowninthelinearmodel,isthelowrankestimator: introduced by Izenman in [11], and more used the last decades, with among others Bunea et al. in [4] and Giraud in [9], the regression matrix could be estimated by matrix of rank r, r < p q. Then, we ∧ have to estimate r(p+q r) coefficients, which could be smaller than n. − Inthis paper,we havechosento mix these twoestimators to providea sparseandlow rank estimator in mixture models. This method was introduced by Bunea et al. in [4], in the case of linear model and known noise covariance matrix. They present different ways, more or less computational, with more or less good results in theory. They get an oracle inequality, which say that, among a model collection, they are able to choose an estimator with good rank and good active variables. For this model, Ma et al. in [12] get a minimax lower bound, which precise that they attain nearly optimal rates of convergence adaptively for square Schatten norm losses. Date:January5,2015. 1 In this paper, we considerfinite mixture of K linear models inhigh-dimension. This model is studied in details by St¨adler et al. article for real response variable in [17], and by Devijver for multivariate response variable in [8]. We will estimate β for all k 1,...,K by a column sparse and low rank k ∈ { } estimator. The Lassoestimator is usedto selectvariables,whereaswe refitthe estimationby a low rank estimator, restricted on relevant variables. We propose a procedure which is based on a modeling that recasts variable selection, rank selection, and clustering problems into a model selection problem. This procedureisdevelopedin[8],withmethodology,computationalissues,simulationsanddataanalysis. In thispaper,wefocusontheoreticalpointofview,anddevelopedsimulationsanddataanalysisforthelow rankissue. Ourprocedureconstructsamodelcollection,withmodelsmoreorlesssparse,andwith rank vector more or less small. Among this collection, we have to select a model. We use the slope heuristic, which is a non-asymptotic criterion. In a theoretical way, in this paper, we get an oracle inequality for the collectionconstructedby our procedure,whichmakesa comparisonperformancebetweenour model and the oracle for a specified penalty. This result is an extension of the work of Bunea et al in [4], to mixture models and with covariance matrices (Σ ) unknown. They ensure that mixing sparse estimator and low rank matrix could be k 1 k K ≤ ≤ interesting. Indeed,whereaswehavetoestimatep qcoefficientsineachclusterfortheregressionmatrix, × we get only r(J +q r) unknown variables, which could be smaller than the number of observations | | − n if J and r are small. Even if the oracle inequality we get in this paper is an extension of Bunea | | et al. result, we use a really different way to prove it. Considering the model collection constructed, we want to select a model as good as possible. For that, we use the slope heuristic, which constructs a penalty, proportional to the dimension, and select the model minimizing the penalized log-likelihood. Theoretically, we construct also a penalty, proportional to the dimension (up to a logarithm term). We provide an oracle inequality which compare, up to a constant, the Jensen-Kullback-Leibler divergence of our model and the true model to the Kullback-Leibler divergence between the oracle and the true model. In estimation term, we do as well as possible. This oracle inequality is deduced from a general modelselectiontheoremformaximumlikelihoodestimatorofMassartin[13]. Controllingthebracketing entropy of models, we could prove the result. Remark that we work in a regression framework, then we rather use an extension of this theorem proved in Cohen and Le Pennec article [5], and that our model collection is random, constructed by the Lasso, then we rather use an extension of this theorem proved in [7]. To illustrate this procedure, in a computational way, we validate it on simulated dataset, and benchmark dataset. If the data have a low rank structure, we could easily find it with our methodology. This paper is organized as follows. In the Section 2, we describe the finite mixture regression model used in this procedure,andthe main step ofthe procedure. In the Section3, we presentthe main result ofthispaper,whichisanoracleinequalityfortheprocedureproposed. Finally,inSection4,weillustrate the procedure on simulated and benchmark dataset. Proof details of the oracle inequality are given in Appendix. 2. The model and the model collection We introduce our procedure of estimation by sparse and low rank matrix in the linear model, in Section 2.1, and extend it in Section 2.2 in mixture models. 2.1. Linear model. We consider the observations ((x ,y ) ) which realized random variables i i i=1,...,n (X,Y), satisfying the linear model Y =βX +ǫ where Y Rq are the responses,X Rp arethe regressors,β Rp q is anunknownmatrix,andǫ Rq × are rando∈m errors, ǫ N (0,Σ), wi∈th Σ S++ a symmetric∈positive definite matrix. We will wor∈k in ∼ q ∈ q high-dimension, then p q could be larger than the number of observations n. × We will construct an estimator which is sparse and low rank for β to cope with the high-dimension issue. Moreover,toreducethe covariancematrixdimension,wecompute adiagonalestimatorofΣ. The procedure we will propose could be explained into two steps. First, we estimate the active columns of β thanks to the Lasso estimator, for λ>0, using the estimator βˆLasso =argmin Y βX 2+λ B ; λ β Rp×q || − ||2 || ||1 ∈ (cid:8) (cid:9) where β = p q β . We assume that the variance is unknown. From this approach, we || ||1 j=1 z=1| j,z| could rescale the parameters by Σ 1 = PtP the Cholesky decomposition, and φ = P 1β, and then get P P − − 2 the estimates by, T denoting the set of triangular matrices of size q, for λ>0, q (1) (φˆLasso,PˆLasso)= argmin PY φX 2+λ φ . λ λ (φ,P)∈Rp×q×Tq(cid:8)|| − ||2 || ||1(cid:9) ThisapproachwasconsideredinSt¨adleretal. in[17],inscalarresponsecase. Thisreparametrization is done in order to get a scale-invariantestimator and a convex minimization problem. For λ>0, from the Lasso estimator of φˆLasso, we could deduce the relevant columns. λ Restricted to these relevant columns, in the second step of the procedure, we compute a low rank estimator of β, saying of rank at most r. Indeed, as explained in Giraud in [9], we restrict the maxi- mum likelihood estimator to have a rank at most r, keeping only the r biggest singular values in the corresponding decomposition. We get an explicit formula. This two steps procedure leads to an estimator of β which is sparse and has a low rank. We have also reduced the dimension into two ways. We refit the covariance matrix estimator by the maximum likelihood estimator. This estimator is studied in Bunea et al. in [4], in method 3. Let extend it in mixture models. 2.2. Mixture model. We observe n independent couples (x,y) = ((x ,y ) ) of random variables i i 1 i n (X,Y), with Y Rq and X Rp. The conditional density is assumed to b≤e≤a multivariate Gaussian ∈ ∈ mixture regression model. If the observation i belongs to the cluster k, we assume that there exists β Rp q, and Σ S++ such that y =β x +ǫ where ǫ N (0,Σ ). kT∈hus,×therandokm∈resqponsevariableiY rRqidepeindsonais∼etofqexplarnatoryvariables,writtenX Rp, ∈ ∈ through a mixture of linear regression-type model. Give more precisions on the assumptions. The variables Y are independent conditionally to X , for all i=1,...,n ; i i • we let Y X =x s (y x )dy, with i i i ξ i • | ∼ | (2) s (y x)= K πk exp (y−βkx)tΣ−k1(y−βkx) ξ | k=1(2π)q2det(Σk)1/2 (cid:18)− 2 (cid:19) X ξ =(π1,...,πk,β1,...,βk,Σ1,...,ΣK)∈ ΠK ×(Rq×p)K ×(S+q+)K (cid:0) K (cid:1) Π = (π ,...,π );π >0 for k 1,...,K and π =1 K 1 K k k ( ∈{ } ) k=1 X S++ is the set of symmetric positive definite matrices on Rq. q Wewanttoestimatetheconditionaldensityfunctions fromtheobservations. Forallk 1,...,K ,β ξ k ∈{ } is the matrix of regression coefficients, and Σ is the covariance matrix in the mixture component k k. The π s are the mixture proportions. In fact, for all k 1,...,K , for all z 1,...,q , k [βtx] = p [β ] x is the zth component of the mean of t∈he{mixture c}omponent k∈for{ the con}- k z j=1 k j,z j ditional density s (.x). ξ P | We could introduce, for all k 1,...,K , ∈{ } Σ−k1 =PktPk φk =Pk−1βk. We could define an extension of the Lasso estimator, K (3) (φˆLasso,PˆLasso)= argmin PY φX 2+λ π φ . λ λ (φ,P)∈Rp×q×Tq(|| − ||2 kX=1 k|| k||1) Remark that the penalty take into account the mixture weight. We could also define a low rank estima- tor (βˆLR,PˆLR) restricted to relevant variables detected by the Lasso estimator, for the regularization λ λ parameter λ. In a computational way, we will use two generalized EM algorithms, in order to deal with high dimensional data and get a sparse and low rank estimator. Give some details about those algorithms. Initially,theEMalgorithmwasintroducedbyDempsterin[6]. Italternatestwosteps,anexpectation step to cluster data, and a maximization step to update estimation. In our procedure, we want first to know whichcolumns arerelevant, then we wantto compute (3). This algorithmwasused andexplained in [8]. It is a generalization of the EM algorithm, for the Lasso estimator, and in a regression context. From the estimate φˆLasso, we could deduce which columns are relevant. The second algorithm we use λ lead to determine φ on relevant columns, for all k 1,...,K , with rank R . We alternate two steps, k k ∈{ } 3 E-step and M-step, until relative convergence of the parameters and of the likelihood. We restrict the dataset to relevant columns, and construct an estimator of size J q rather than p q. | |× × E-step: compute for k 1,...,K , i 1,...,n , the expected value of the log likelihood • ∈ { } ∈ { } function, γˆ =E (Z Y) i,k θ(ite) i,k| ϕ k = K ϕ l=1 l where P ϕl =πl(ite)detPl(ite)exp−12(cid:16)Pl(ite)Yi−XiΦ(lite)(cid:17)t(cid:16)Pl(ite)Yi−XiΦ(lite)(cid:17) M-step: • – Togetestimationinlinearmodel,weassigneachobservationinitsestimatedcluster,bythe MAP principle. We could apply this thanks to the E-step, which compute the a posteriori probability. Therefore, we say that y comes from component number argmax γˆ . i i,k k 1,...,K ∈{ } – Then, we can define β˜(ite) =(xt x ) 1xt y , in which x and y are the sample restric- k k k − k k k k | | | | | | tion to the cluster k. We decompose β˜(ite) in singular values such that β˜(ite) =USVt with k k S =diag(s ,...,s ) and s s ... s the singular values. Then, the estimator βˆ(ite) 1 q 1 ≥ 2 ≥ ≥ q k is defined by βˆ(ite) =US Vt with S =diag(s ,...,s ,0,...,0). k r r 1 r To select the regularization parameter for the Lasso estimator and the ranks, we could use criterion as BICorcross-validation. Inpractice,weprefertoconstructamodelcollectionwithvariousactivecolumns andvariousranks,andselectoneasfinalstep. Togetvariousactivecolumns,weconstructadata-driven grid of regularization parameters, coming from EM algorithm formula. See [8] for more details. To get various ranks, we estimate parameters for different values of ranks, belonging to r ,...,r . min max { } From this procedure, we construct a model with K clusters, J relevant columns and matrix of regressioncoefficients of ranks R NK, as described by the next m|od|el S . (K,J,R) ∈ (4) S = y Rq x Rp s (y x) (K,J,R) { ∈ | ∈ 7→ ξ(K,J,R) | } where K π det(P ) 1 sξ(K,J,R)(y|x)= k(2π)q/2k exp −2(y−βkR(k)x[iJ])tΣ−k1(y−βkR(k)x[iJ]) ; k=1 (cid:18) (cid:19) X ξ(K,J,R) =(π ,...,π ,βR(1),...,βR(K),Σ ,...,Σ ) Ξ 1 K 1 K 1 K ∈ K Ξ =Π Ψ (S++)K; K K × (K,J,R)× q Ψ = (βR(1),...,βR(K)) (Rq J )K Rank(β )=R(k) . (K,J,R) { 1 k ∈ ×| | | k } Varying K N ,J ([1,p]), and R 1,...,p q K, we get a model collection with ∗ ∈ K ⊂ ∈ J ⊂ P ∈ R ⊂ { ∧ } various number of components, active columns and matrix of regressioncoefficients. Among this model collection, during the last step, a model has to be selected. As in Maugis and Michel in [14], and in Maugis and Meynet in [15], among others, a non asymptotic penalized criterion is used. The slope heuristic was introduced by Birg´e and Massart in [3], and developed in practice by Maugis and Michel in [2] with the Capushe package. To use it in our context, we have to extend the theoreticalresultto determine the penalty shape in the high-dimensionalcontext,with a randommodel collection, in a regression framework. The main result is described in the next section, whereas proof details are given in appendix. 3. Oracle inequality In a theoretical point of view, we want to ensure that the slope heuristic which construct a penalty willselectagoodmodel. WefollowtheapproachdevelopedbyBirg´eandMassartin[3]whichconsistsof defininganonasymptoticpenalizedcriterion,leadingtoanoracleinequality. Inthecontextofregression, CohenandLePennec,in[5],andDevijverin[7],proposeageneralmodelselectiontheoremformaximum likelihood estimation. The result we get is a theoretic penalty, for which the model selected is as good as the best one, according to the Kullback-Leibler loss. 4 3.1. Framework and model collection. Among the model collection constructed by the procedure developed in section 2.2, with various rank and various sparsity, we want to select an estimator which is close to the truth. The oracle is by definition the model belonging to the collection which minimizes the contrast with the true model. In practice, we do not have access to the true model, then we can not know the oracle. Nevertheless,the goalof the model selection step of our procedure is to be nearest to the oracle. In this section, we present an oracle inequality, which mean that if we have penalized the log-likelihood in a good way, we will select a model which is as good as the oracle, according to the Kullback-Leibler loss. We consider the model collection defined by (4). Becauseweworkinhighdimension, pcouldbebig,anditwillbetime-consumingtotestalltheparts of 1,...,p . We construct a sub-collectiondenoted by L, which is constructed by the Lasso,which is { } J also random. This step is explained in more details in [8]. Moreover,to get the oracle inequality, we assume that the parameters are bounded: S = s S for all k 1,...,K , (BK,J,R) (K,J,R) ∈ (K,J,R)| ∈{ } Σ(cid:8)k =diag([Σk]1,1,...,[Σk]q,q), for all z 1,...,q ,a [Σ ] A , Σ k z,z Σ ∈{ } ≤ ≤ R(k) for all k [1,K],βR(k) = [σ ] [ut] [v ] , ∈ k k l k .,l k l,. l=1 X for all l 1,...,R(k) ,[σ ] <A . k l σ ∈{ } Remarkthatthisdecompositionofβ isthesingularvaluedecomposit(cid:9)ion,with(σ ) thesingular k l 1 l R(k) ≤≤ values, and u and v unit vectors, for k 1,...,K . k k ∈{ } We also assume that covariates belong to an hypercube: without restrictions, we could assume that X [0,1]p. ∈ Fixing the possible number of components, L the active column set constructed by the Lasso, K J and the possible vector of ranks, we get a model collection R (5) S(BK,J,R). K J R [∈K [∈J [∈R 3.2. Notations. Before enunciate the main theorem which leads to the oracle inequality, for the model collection (5) we need to define some metrics used to compare the conditional densities. First, the Kullback-Leibler divergence is defined by s KL (s,t)= log sdλ λ t Z (cid:16) (cid:17) for s and t two densities, sdλ absolutely continuous with respect to tdλ. To deal with regression data, for fixed covariates (x ,...,x ), we define 1 n n 1 KL⊗λn(s,t)=E n KLλ(s(.|xi),t(.|xi))! i=1 X for s and t two densities, sdλ absolutely continuous with respect to tdλ. We also define the Jensen-Kullback-Leibler divergence, first introduced in Cohen and Le Pennec [5], by 1 JKL (s,t)= KL (s,(1 ρ)s+ρt) λ,ρ λ ρ − for ρ ]0,1[, s and t two densities, sdλ absolutely continuous with respect to tdλ. The tensorized one is ∈ defined by n 1 JKL⊗λ,nρ(s,t)=E n JKLλ,ρ(s(.|xi),t(.|xi)) . ! i=1 X Notethatthosedivergencesarenotmetrics,notsatisfyingthetriangularinequalityandnosymmetric, but are also wild used in statistics to compare two densities. 5 3.3. Oracle inequality. Let state the main theorem. Theorem 3.1. Assume that we observe ((x ,y ) ) with unknown conditional density s . Let = i i 1 i n 0 and L = L ,where L iscons≤tr≤uctedbytheLassoestimator. Lets¯ S M K×J×R M K×J ×R J (K,J,R) ∈ (BK,J,R) such that, for δ >0, KL δ KL KL⊗λn(s0,s¯(K,J,R))≤t SiBnf KL⊗λn(s0,t)+ n ∈ (K,J,R) and such that there exists τ >0 such that (6) s¯(K,J,R) e−τs0. ≥ Consider the collection sˆ of rank constrained log-likelihood minimizer in S , (K,J,R) (K,J,R) (K,J,R) satisfying { } ∈M n 1 sˆ = argmin log s (y x ) . (K,J,R) s(K,J,R)∈S(BK,J,R)(−nXi=1 (cid:0) (K,J,R) i| i (cid:1)) Denote by D the dimension of the model S . Let pen : R defined by, for all (K,J,R) (BK,J,R) M → + (K,J,R) , ∈M D pen(K,J,R) κ (K,J,R) [2B2(A ,A ,a ,q) β Σ σ ≥ n log D(K,J(cid:8),R)B2(A ,A ,a ,q) 1 β Σ σ − n ∧ (cid:18) (cid:19) 4epq + log +R D pq (cid:18) (K,J,R)−q2 ∧ (cid:19)(cid:27) for κ>0 an absolute constant. Then, the estimator sˆ , with (Kˆ,Jˆ,Rˆ) n 1 (Kˆ,Jˆ,Rˆ)= argmin log(sˆ (y x ))+pen(K,J,R) , (K,J,R)∈ML(−nXi=1 (K,J,R) i| i ) n log(sˆ (y x ))+pen(Kˆ,Jˆ,rˆ) − (Kˆ,Jˆ,Rˆ) i| i i=1 X n ′ inf log(sˆ (y x ))+pen(K,J,R) +η (K,J,R) i i ≤(K,J,R) − | ! ∈M i=1 X satisfies E(JKL⊗ρ,nλ(s0,sˆ(Kˆ,Jˆ,Rˆ))) pen(K,J,R) Σ2 (7) ≤CE(cid:18)(K,J,iRn)f∈ML(cid:18)t∈Si(nKf,J,R)KL⊗λn(s0,t)+ n (cid:19)+ n (cid:19) for C >0. Theproofofthetheorem3.1isgiveninSection5. Notethatcondition(6)leadstocontroltherandom modelcollection. The mixture parametersarebounded inorderto constructbracketsover , and (K,J,R) S thustoupperboundtheentropynumber. Theinequality(7)weobtainisnotexactlyanoracleinequality, since the Jensen-Kullback-Leibler risk is upper bounded by the Kullback-Leibler bias. Becausewearelookingonasmallrandomsub-collectionofmodels,ourestimatorsˆ isattainable (K,J,R) in practice. Moreover,it is a non-asymptotic result,which allowsus to study casesfor which p increases with n. NotethatweusetheJensen-Kullback-LeiblerdivergenceratherthantheKullback-Leiblerdivergence, becauseitisbounded. Thisboundednessturnsouttobecrucialtocontrolthelossofthepenalizedmax- imum likelihood estimator under mild assumptions on the complexity of the model and their collection. WecouldcompareourboundwiththeoneofBuneaetal,in[4],whocomputedaproceduresimilarto ours,in a linear model. According to consistentgroupselectionfor the groupLasso,they getadaptivity of the estimator to an optimal rate, and their estimators perform the bias variance trade-off among 6 Table 1. Performances of our procedure. Mean number K,Rˆ,M,FA,ARI of the { } Kullback-Leibler divergence between the model selected and the true model, the esti- matedrankofthemodelselectedineachcluster,themissedvariables,the falserelevant variables, and the ARI, over 20 simulations. KL Rˆ M FA ARI p>n 19.03 [2.8,3] 0 20 0.95 p<n 3.28 [3,3] 0 0.6 0.99 all reduced rank estimators. Nevertheless, their results are obtained according to some assumption, for instance the mutual coherence on XtX, which postulates that the off-diagonal elements have to be small. Some assumptions on the design are required, whereas our result just need to deal with bounded parameters and bounded covariates. 4. Numerical studies We will illustrate our procedure with simulations and real datasets, to highlight advantages of our method. We adaptthe simulationspartof Bunea et al. article,[4]. Indeed, we workin the same way,to getsparseandlowrankestimator. Nevertheless,weaddthemixturetobeconsistentwithourclustering method, and to have more flexibility. 4.1. Simulations. To illustrate our procedure, we use simulations adapted from the article of Bunea [4], extended for mixture models. The design matrix X has iid rows X from a multivariate normal distribution N(0,Σ) with Σ = ρI, i ρ > 0. We consider a mixture with 2 components. Depending on the cluster, the coefficient matrix β k has the form b B0 b B1 β = k k k 0 0 (cid:20) (cid:21) for k 1,2 , with B0 a J R(k) matrix and B1 a R(k) q matrix. All entries in B0 and B1 are iid ∈ { } × × N(0,1). The noise matrix E has independent N(0,1) entries. Let E denote its ith row. i The proportion vector π is defined by π =[1,1], all clusters having the same probability. 2 2 Each row Y in Y is then generated as, if the observation i belongs to the cluster k, Y =β X +E , i i k i i for 1 i n. This setup contains many noise features, but the relevant one lie in a low-dimensional ≤ ≤ subspace. We report two settings: p>n: n=50, J =6,p=100,q=10,R=[3,3],ρ=0.1,b=[3, 3]. • | | − p<n: n=200, J =6,p=10,q=10,R=[3,3],ρ=0.01,b=[3, 3]. • | | − The currentsetups show thatvariableselection,without takingthe rankinformationinto consideration, may be suboptimal, even if the correlations between predictors are low. Each model was simulated 20 times, andtable 1 summarizes our findings. We evaluate the prediction accuracyof eachestimator βˆby themeansquarederror(MSE)usingatestsampleateachrun. Wealsoreportthemedianrankestimate (denoted by Rˆ) over all runs, rates of non included true variables (denoted by M for misses) and the rates of incorrectly included variables (FA for false actives). Ideally, we are looking for a model with low MSE, low M and low FA. Wecandrawthefollowingconclusionsfromtable1. Whenweworkinlow-dimensionalframework,we get very good results. Even if we could use any estimator, because we do not have dimension problem, with our procedure we get the matrix structure involved by the model. We get almost exact clustering, and the Kullback-Leibler divergence between the model we construct and the true model is really low. In case of high-dimensional data, when p is larger than n, we get also good results. We find the good structure, selecting the relevant variables (our model will have false relevant variables, but no missed variables), and selecting the true ranks. We could remark that false relevant variables have low values. Comparingtoanotherprocedurewhichwillnotreducetherank,wewillperformthedimensionreduction. 4.2. Application. In this section, we apply our procedure to real data set. The Norwegian paper quality data were obtained from a controlled experiment that was carried out at a paper factory in Norway to uncover the effect of three control variables X ,X ,X on the quality of the paper which 1 2 3 was measured by 13 response variables. Each of the control variables X takes values in 1,0,1 . i {− } To account for possible interactions and nonlinear effects, second order terms were added to the set of 7 predictors, yielding X ,X ,X ,X2,X2,X2,X X ,X X ,X X , and the intercept term. There were 29 1 2 3 1 2 3 1 2 1 3 2 3 observations with no missing values made on all response and predictor variables. The Box Behnken design of the experiment and the resulting data are described in Aldrin [1] and Izenman [11]. Moreover, Bunea et al in [4] also study this dataset. We always center the responses and the predictors. The datasetclearlyindicatesthatdimensionreductionispossible,makingitatypicalapplicationforreduced rank regressionmethods. Moreover,our method will exhibit different classes among this sample. We construct a model collection varying the number of clusters in = 2,...,5 . We select a model K { } with 2 classes. We select all variables except X X and X X , which is consistent with comments of 1 2 2 3 Bunea et al. In the two classes, we get two mean matrices, with ranks equal to 2 and 4. One cluster describes the mean comportment(with rank equal to 2), whereasthe other cluster contains values more different. 5. Appendices Inthoseappendices,wepresentthedetailsoftheproofofthetheorem3.1. Thisderivesfromageneral model selection theorem, enunciated in section 5.1, and proved in the paper [7]. Then, the proof of the theorem 3.1 could be summarized by satisfying assumptions 1, 2 and 3 described in section 5.1. 5.1. A general oracle inequality for model selection. Model selectionappears with the AIC crite- rion and BIC criterion. In a non-asymptotic way, a theory was developed by Birg´e and Massart in [3]. With some assumptions thatwe will detailhere,we get anoracleinequality for the maximumlikelihood estimator among a model collection. Le Pennec and Cohen, in [5], generalize this theorem in regression framework. We have to use a generalizationof this theorem because we consider a random collection of models. Let us consider a model collection (S ) . m m ∈M Before enunciate the general theorem, begin by talking about the assumptions. First, we impose a structuralassumption. ItisabracketingentropyconditiononthemodelS withrespecttotheHellinger m divergence n 1 d2H⊗n(s,t)=E"n d2H(s(.|xi),t(.|xi))#. i=1 X A bracket [t ,t+] is a pair of functions such that for all (x,y) ,t (y,x) s(y x) t+(y,x). − − ∈ X ×Y ≤ | ≤ The bracketing entropy H[.](δ,S,d⊗Hn) of a set S is defined as the logarithm of the minimum number of brackets [t−,t+] of width d⊗Hn(t−,t+) smaller than δ such that every functions of S belong to one of these brackets. Assumption 1 (H ). There is a non-decreasing function φ such that δ 1φ (δ) is non-increasing on (0,+ ) and formevery σ R+ and every s S , m 7→ δ m m m ∞ ∈ ∈ σ H[.](δ,Sm(sm,σ),d⊗Hn)dδ ≤φm(σ); Z0 q where Sm(sm,σ)={t∈Sm,d⊗Hn(t,sm)≤σ}. The model complexity Dm is then defined as nσm2 with σm the unique root of 1 (8) φ (σ)=√nσ. m σ Denote that the model complexity depends on the bracketing entropies not of the global models S m but of the ones of smaller localized sets. This is a weaker assumption. For technical reason, a separability assumption is also required. ′ ′ ′ Assumption 2 (Sep ). There exists a countable subset S of S and a set with λ( ) = 0, m m m Ym Y \Ym for λ the Lebesgue measure, such that for every t S , there exists a sequence (t ) of elements of m k k 1 ′ ′ ∈ ≥ S such that for every x and every y , log(t (y x)) goes to log(t(y x)) as k goes to infinity. m ∈Ym k | | Wealsoneedaninformationtheorytypeassumptiononourmodelcollection. Weassumetheexistence of a Kraft-type inequality for the collection. Assumption 3 (K). There is a family (x ) of non-negative numbers such that m m ∈M e−xm Σ<+ . ≤ ∞ m X∈M 8 Then, we could write our main global theorem to get an oracle inequality in regression framework, with a random collection of models. Theorem 5.1. Assume we observe ((x ,y ) ) with unknown conditional density s . Let = i i 1 i n 0 ≤≤ S (S ) be at most countable collection of conditional density sets. Let assumption (K) holds while m m ∈M assumptions (H ) and (Sep ) hold for every models S . Let δ >0, and s¯ S such that m m m ∈S KL m ∈ m δ KL KL⊗λn(s0,s¯m)≤t∈inSfmKL⊗λn(s0,t)+ n ; and let τ >0 such that (9) s¯ e τs . m − 0 ≥ Introduce (S ) some random sub-collection of (S ) . Consider the collection (sˆ ) of m m ˆ m m m m ˆ η-log-likelihood mini∈mMizer in S , satisfying ∈M ∈M m n n log(sˆ (y x )) inf log(s (y x )) +η. m i i m i i i=1− | ≤sm∈Sm i=1− | ! X X Then, for any ρ (0,1) and any C >1, there are two constants κ and C depending only on ρ and 1 0 2 ∈ C such that, as soon as for every index m , 1 ∈M (10) pen(m) κ( +(1 τ)x ) m m ≥ D ∨ with κ > κ , and where the model complexity is defined in (8), the penalized likelihood estimate 0 m D sˆ with mˆ ˆ such that mˆ ∈M n n ′ log(sˆ (y x ))+pen(mˆ) inf log(sˆ (y x ))+pen(m) +η mˆ i i m i i − | ≤m ˆ − | ! Xi=1 ∈M Xi=1 satisfies pen(m) (11) E(JKL⊗ρ,nλ(s0,sˆmˆ))≤C1E(cid:18)mi∈nMfˆ t∈inSfmKL⊗λn(s0,t)+2 n (cid:19) Σ2 η +η ′ (12) +C (1 τ) + . 2 ∨ n n Remark 5.2. We get that, among a random model collection, we are able to choose one which is as good as the oracle, up to a constant C , and some additive terms being around 1. This result is non- 1 n asymptotic, and gives a theoretic penalty to select this model. Remark 5.3. The proof of this theorem is detailed in [7]. Nevertheless, we could give the main ideas to understand the assumptions. From assumptions 1 and 2, we could use maximal inequalities which lead to, except on a set of probability less than e−xm′−x, for all x, a control of the ratio of the centered empirical process of log(sˆm′) over the Hellinger distance between s0 and sˆm′, this control being around n1. Thanks to Bernstein inequality, satisfied according to the inequality (9), and thanks to the assumption 3, we get the oracle inequality. Now, to prove theorem (3.1), we have to satisfy assumptions (1), (3), and (2). 5.2. Assumption (H ). m 5.2.1. Decomposition. As done in Cohen and Le Pennec [5], we could decompose the entropy by H[.](ǫ,S(BK,J,R),d⊗Hn)≤H[.](ǫ,ΠK,d⊗Hn)+kH[.](ǫ,F(J,R),d⊗Hn) 9 where y Rq x Rp s (y x)= K π Φ(y (βR(k)x) ,Σ ) ∈ | ∈ 7→ θ | k=1 r | k |J k S(BK,J,R) = θ = π1,...,πK,β1R(1),...,βPKR(K),Σ1,...,ΣK ∈ΘK   Θ =nΠ Ψ˜ ([a ,A ]q )K o  k K × (K,J,R)× Σ Σ +∗ Ψ(K,J,R) ={(β1R(1),...,βkR(K))∈[−Aβ,Aβ]K|Rank(βk)=R(k)}  K Π = (π ,...,π ) (0,1)K; π =1 k 1 K k ( ∈ ) k=1 X = Φ(.(βRX) ,Σ);β [a ,A ]J ,rank(βR)=R, (J,R) J β β | | F | | ∈ nΣ=diag(Σ2,...,Σ2) [a2,A2]q 1 q ∈ Σ Σ Φ the Gaussian density. (cid:9) For the proportions, we get that K 1 3 − H[.](ǫ,ΠK,d⊗Hn)≤log K(2πe)K/2(cid:18)ǫ(cid:19) !. Looking after the Gaussian entropy. 5.2.2. For the Gaussian. We want to bound the integrated entropy. For that, first we have to construct somebracketstorecoverS . Fixf S . Wearelookingforfunctionst andt+ suchthatt f t+. m m − − ∈ ≤ ≤ Because f is a Gaussian, t and t+ are dilatations of Gaussians. We then have to determine the mean, − thevarianceandthe dilatationcoefficientoft andt+. We needthebothfollowinglemmastoconstruct − these brackets. Lemme 5.4. Let Φ(.µ ,Σ ) and Φ(.µ ,Σ ) be two Gaussian densities. If their variance matrices are 1 1 2 2 | | assumed to be diagonal, with Σ = diag([S2] ,...,[S ]2) for a 1,2 , such that [S ]2 > [S ]2 > 0 for all z 1,...,q , then for all xa Rq, a 1 a q ∈ { } 2 z 1 z ∈{ } ∈ q Φ(x|µ1,Σ1) [Σ2]z exp12(µ1−µ2)tdiag(cid:16)[Σ2]1−1[Σ1]1,...,[Σ2]q−1[Σ1]q(cid:17)(µ1−µ2). Φ(x|µ2,Σ2) ≤z=1p[Σ1]z Y Lemme 5.5. The Hellinger distapnce of two Gaussian densities with diagonal variance matrices is given by the following expression: q 2[Σ2] [Σ2] 1/2 d2 (Φ(.µ ,Σ ),Φ(.µ ,Σ ))=2 2 1 q1 2 q1 H | 1 1 | 2 2 − qY1=1[Σ21]q1 +[Σ22]q1! 1 1 exp (µ µ )tdiag (µ µ ) × (−4 1− 2 (cid:18)[Σ21]q1 +[Σ22]q1(cid:19)q1∈{1,...,q}! 1− 2 ) To get an ǫ-bracket for the densities, we have to construct a δ-net for the variance and the mean, δ to be precised later. Step 1: construction of a net for the variance • Let ǫ ∈]0,1], and δ = √ǫ2q. Let b2j = (1+δ)1−2jA2Σ. For 2 ≤ j ≤ N, we have [aΣ,AΣ] = [b ,b ] ... [b ,b ], where N is chosen to recover everything. We want that N N 1 3 2 − S S a2Σ =(1+δ)1−N/2A2Σ a N Σ 2log = 1 log(1+δ) ⇔ A − 2 Σ (cid:18) (cid:19) 4log(AΣ√1+δ) N = aΣ . ⇔ log(1+δ) 4log(AΣ√1+δ) WewantN tobeaninteger,thenN = aΣ . Wegetaregularnetforthevariance. log(1+δ) (cid:24) (cid:25) We could let B = diag(b2 ,...,b2 ), close to Σ (and deterministic, independent of the values i(1) i(q) of Σ), where i is a permutation such that b Σ b for all z 1,...,q . i(z)+1 z,z i(z) ≤ ≤ ∈{ } 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.