ebook img

Efficient Sparse Group Feature Selection via Nonconvex Optimization PDF

0.27 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Efficient Sparse Group Feature Selection via Nonconvex Optimization

Efficient Sparse Group Feature Selection via Nonconvex Optimization 3 1 Shuo Xiang1,2 Xiaotong Shen3 Jieping Ye1,2 0 1Department of Computer Science and Engineering, Arizona State University, AZ 85287 2 2Center for Evolutionary Medicine and Informatics, The Biodesign Institute, Arizona State University, AZ 85287 n 3School of Statistics, University of Minnesota, Minneapolis, MN 55455 a J 8 January 22, 2013 1 ] G Abstract L . Sparse feature selection has been demonstrated to be effective in handling high-dimensional data. s While promising, most of the existing works use convex methods, which may be suboptimal in terms c [ of the accuracy of feature selection and parameter estimation. In this paper, we expand a nonconvex paradigm to sparse group feature selection, which is motivated by applications that require identifying 2 theunderlyinggroupstructureandperformingfeatureselection simultaneously. Themaincontributions v of this article are twofold: (1) statistically, we introduce a nonconvex sparse group feature selection 5 modelwhichcanreconstructtheoracleestimator. Therefore, consistentfeatureselection andparameter 7 0 estimation can be achieved; (2) computationally, we propose an efficient algorithm that is applicable to 5 large-scaleproblems. Numericalresultssuggestthattheproposednonconvexmethodcomparesfavorably . against its competitors on synthetic data and real-world applications, thus achieving desired goal of 5 delivering high performance. 0 2 1 : 1 Introduction v i X r During the past decade, sparse feature selection has been extensively investigated, on both optimization a algorithms [1] and statistical properties [28, 20, 3]. When the data possesses certain groupstructure, sparse modeling has been explored in [24, 16, 13] for group feature selection. The group lasso [24] proposes an L -regularization method for each group, which ultimately yields a group-wisely sparse model. The utility 2 of such a method has been demonstrated in detecting splice sites [23]—an important step in gene finding and theoretically justified in [13]. The sparse group lasso [11] enables to encourage sparsity at the level of both features and groups simultaneously. In the literature, most approaches use convex methods due to globality of the solution and tractable computation. However, this may lead to suboptimal results. Recent studies demonstrate that nonconvex methods, for instance, the truncated L -penalty [19, 15, 27], may have potential to deliver superior perfor- 1 mance than the standardL -formulation. In addition,[19] suggeststhat a constrainedversionofnonconvex 1 regularizationis slightly more preferable than its regularizationcounterpart due to theoretical merits. Inthisarticle,weinvestigatethesparsegroupfeatureselection(SGFS)throughaconstrainednonconvex 1 formulation. Ideally, we wish to optimize the following L -model: 0 1 minimize Ax y 2 x 2k − k2 p subject to I(x =0) s j 1 | |6 ≤ (1) Xj=1 G | | I( x =0) s , k Gjk2 6 ≤ 2 Xj=1 where A is an n by p data matrix with its columns representing different features. x = (x , ,x ) is 1 p ··· partitioned into G non-overlapping groups x and I() is the indicator function. The advantage of the | | { Gi} · L -model (1) lies in its complete controlon two levels of sparsity (s ,s ), which are the numbers of features 0 1 2 and groups respectively. However,a problem like (1) is known to be NP-hard [17]. Thispaperdevelopsanefficientnonconvexmethod,whichisacomputationalsurrogateoftheL -method 0 describedaboveandhastheoreticallyguaranteedperformance. Wecontributeintwoaspects: (i)statistically, the proposed method retains the merits of the L approach (1) in the sense that the oracle estimator can 0 be reconstructed,whichleadsto consistentfeature selectionandparameterestimation;(ii)computationally, our efficient optimization tool enables to treat large-scale problems. 2 Nonconvex Formulation and Computation Onemajordifficultyofsolving(1)comesfromnonconvexanddiscreteconstraints,whichrequireenumerating all possible combinations of features and groups to achieve the optimal solution. Therefore we approximate these constraints by their continuous computational surrogates: 1 minimize Ax y 2 x 2k − k2 p subject to J (x ) s , τ j 1 | | ≤ (2) Xj=1 G | | J ( x ) s , τ k Gik2 ≤ 2 Xi=1 where J (z)=min(z /τ,1) is a truncated L -function approximating the L -function [19, 26], and τ >0 is τ 1 0 | | a tuning parameter such that J (z) approximates the indicator function I(z =0) as τ approaches zero. τ | |6 To solve the nonconvex problem (2), we develop a Difference of Convex (DC) algorithm based on a decomposition of each nonconvex constraintfunction into a difference of two convex functions; for instance, p J (x )=S (x) S (x), τ j 1 2 | | − Xj=1 where p 1 S (x)= x 1 j τ | | Xj=1 and p 1 S (x)= max x τ,0 2 j τ {| |− } Xj=1 2 are convex in x. Then each trailing convex function, say S (x), is replaced by its affine minorant at the 2 previous iteration S (x) S (xˆ(m 1)) S (xˆ(m 1))T(x xˆ(m 1)), (3) 1 2 − 2 − − − −∇ − which yields an upper approximation of the constraint function p J (x ) as follows: j=1 τ | j| P p p 1 τ |xj|·I(|xˆ(jm−1)|≤τ)+ I(|xˆ(jm−1)|>τ)≤s1. (4) Xj=1 Xj=1 Similarly, the second nonconvex constraint in (2) can be approximated by G G 1 | | | | τ kxGjk2·I(kxˆ(Gmj−1)k2 ≤τ)+ I(kxˆ(Gmj−1)k2 >τ)≤s2. (5) Xj=1 Xj=1 Note that both (4) and (5) are convex constraints, which result in a convex subproblem as follows: 1 minimize Ax y 2 x 2k − k2 subject to 1 xT1(xˆ(m−1)) 1 s1 (p T1(xˆ(m−1))) (6) τk k ≤ − −| | 1 xT3(xˆ(m−1)) G s2 (G T2(xˆ(m−1))), τk k ≤ − | |−| | where T , T and T are the support sets1 defined as: 1 2 3 T (x)= i: x τ 1 i { | |≤ } T (x)= i: x τ 2 { k Gik2 ≤ } T (x)= i:x x ,j T (x) , 3 { i ∈ Gj ∈ 2 } xT1 1 and xT3 G denote the corresponding value restricted on T1 and T3 respectively, and x G = k k k k k k |iG=|1kxGik2. Solving (6) would provide us an updated solution, denoted as xˆ(m). Such procedure is iPterated until the objective value is no longer decreasing, indicating that a local minimizer is achieved. The DC algorithm is summarized in Algorithm 1, from which we can see that efficient computation of (6) is critical to the overall DC routine. We defer detailed discussion of this part to Section 4. Algorithm 1 DC programming for solving (2) Input: A, y, s , s 1 2 Output: solutionx to (2) 1: (Initialization) Initializexˆ(0). 2: (Iteration) At iterationm, compute xˆ(m) by optimizing (6). 3: (Stopping Criterion) Terminate when the objective function stops decreasing. 3 Theoretical Results Thissectioninvestigatestheoreticalaspectsoftheproposedmethod. Morespecifically,wedemonstratethat the oracle estimator xˆo, the least squares estimator based on the true model, can be reconstructed. As a result, consistent selection as well as optimal parameter estimation can be achieved. 1Supportsetsindicatethattheelementsoutsidethesesetshavenoeffectontheparticularitemsintheconstraintsof (6). 3 For better presentation, we introduce some notations that would be only utilized in this section. Let C = (G , ,G ) be the collection of groups that contain nonzero elements. Let A = A (x) and i1 ··· ik Gj Gj A=A(x) denote the indices of nonzero elements of x in group G and in entire x respectively. Define j = x :(A ,C)=(A ,C0), A =j, C =i , j,i C C0 S { ∈S 6 | | | | } where is the feasible region of (2) and C0 represents the true nonzero groups. S The following assumptions are needed for obtaining consistent reconstruction of the oracle estimator: Assumption 1 (Separation condition). Define log(1 h2(x,x0)) C (x0)= inf − − , min x max(C0 C ,1) ∈S | \ | then for some constant c >0, 1 log G +logs0 C (x0) c | | 1, min 1 ≥ n where h(x,x0)= 1 (g1/2(x,y) g1/2(x0,y))2dµ(y) 1/2 2Z − (cid:0) (cid:1) is the Hellinger-distance for densities with respect to a dominating measure µ. Assumption 2 (Complexity of the parameter space). For some constants c >0 and any 0<t<ε 1, 0 ≤ H(t, ) c max((log(G +s0))2,1) log(2ε/t), Fj,i ≤ 0 | | 1 |Bj,i| where = x h(x,x0) 2ε is a local parameter space and = g1/2(x,y) : x is a j,i j,i j,i j,i B S ∩{ ∈ ≤ } F { ∈ B } collection of square-root densities. H(, ) is the bracketing Hellinger metric entropy of space [14]. · F F Assumption 3. For some positive constants d ,d ,d with d >10, 1 2 3 1 log(1 h2(x,x0)) d log(1 h2(xτ,x0)) d τd2p, 1 3 − − ≥− − − where xτ =(x I(x τ), ,x I(x τ)). 1 1 p p | |≥ ··· | |≥ Withtheaboveassumptionshold,wecanconcludethefollowingnon-asymptoticprobabilityerrorbound regarding the reconstruction of the oracle estimator xˆo. Theorem1. SupposethatAssumptions2and3hold. Foraglobalminimizer of (2)xˆ with(s ,s )=(s0,s0) 1 2 1 2 and τ (d1−10)Cmin(x0) 1/d2, the following result hold: ≤ d3d (cid:0) (cid:1) P xˆ =xˆo exp c nC (x0)+2(log G +logs0) . (cid:16) 6 (cid:17)≤ (cid:16)− 2 min | | 1 (cid:17) Moreover, with Assumption 1 hold, P xˆ =xˆo 1 and (cid:16) (cid:17)→ s0 Eh2(xˆ,xo)=(1+o(1))max(Eh2(xˆo,x0), 1) n as n , G . →∞ | |→∞ Theorem 1 states that the oracle estimator xˆo can be accurately reconstructed, which in turn yields feature selectionconsistency as well as the recoveryof the performance of the oracle estimator in parameter estimation. Moreover,accordingto Assumption 1, such conclusionstill holds when s0 G growsin the order of exp(c−11nCmin) . This is in contrast to existing conclusions on consistent featur1e| s|election, where the number of candidate features should be no larger than exp(c n) for some c [28]. In this sense, the number ∗ ∗ 4 of candidate features is allowed to be much larger when an additional group structure is incorporated, particularly when each group contains considerable redundant features. To our knowledge, our theory for the grouped selection is the first of this kind. However, it has a root in feature selection. The large deviation approach used here is applicable to derive bounds for feature selectionconsistency. Insucha situation,the resultagreeswiththe necessaryconditionforfeature selection consistency for any method, except for the constants independent of the sample size [19]. In other words, the required conditions are weaker than those for L -regularization [21]. The use of the Hellinger-distance 1 is mainly to avoid specifying a sub-Gaussian tail of the random error. This means that the result continues to hold even when the error does not have a sub-Gaussian tail. 4 Optimization Procedures As mentioned in Section 2, efficient computation of the convex subproblem (6) is of critical importance for the proposed DC algorithm. Note that (6) has an identical form of the constrained sparse group lasso problem: 1 minimize Ax y 2 x 2k − k2 (7) subject to x s 1 1 k k ≤ x s G 2 k k ≤ exceptthatxisrestrictedtothetwosupportsets. AstobeshowninSection4.3,analgorithmforsolving(6) can be obtained through only a few modifications on that of (7). Therefore, we first focus on solving (7). 4.1 Accelerated Gradient Method For large-scale problems, the dimensionality of data can be very high, therefore first-order optimization is often preferred. We adapt the well-known accelerated gradient method (AGM) [18, 2], which is commonly used due to its fast convergence rate. To apply AGM to our formulation (7), the crucial step is to solve the following Sparse Group Lasso Projection (SGLP): 1 minimize x v 2 x 2k − k2 (8) subject to x s (C ) 1 1 1 k k ≤ x s (C ), G 2 2 k k ≤ which is an Euclidean projection onto a convex set and a special case of (7) when A is the identity. For convenience, let C and C denote the above two constraints in what follows. 1 2 Since the AGM is a standardframeworkwhose efficiency mainly depends on that of the projection step, we leave the detailed description of AGM in the supplement and introduce the efficient algorithm for this projection step (8). 4.2 Efficient Projection Webeginwithsomespecialcasesof (8). IfonlyC exists,(8)becomesthewell-knownL -ballprojection[9], 1 1 whose optimal solution is denoted as s1(v), standing for the projection of v onto the L -ball with radius s . On the other hand, if only C iPs1involved, it becomes the group lasso projection,1denoted as s2. 1 2 PG Moreover,wesayaconstraintisactive, ifandonlyifanequalityholdsatthe optimalsolutionx ; otherwise, ∗ it is inactive. Preliminary results are summarized in Lemma 1: 5 Lemma 1. Denote a global minimizer of (8) as x . Then the following results hold: ∗ 1. If both C and C are inactive, then x =v. 1 2 ∗ 2. If C is the only active constraint, i.e., x =s , x <s , then x = s1(v) 1 k ∗k1 1 k ∗kG 2 ∗ P1 3. If C is the only active constraint, i.e., x <s , x =s , then x = s2(v) 2 k ∗k1 1 k ∗kG 2 ∗ PG 4.2.1 Computing x from the optimal dual variables ∗ Lemma1describesaglobalminimizerwheneitherconstraintisinactive. Nextweconsiderthecaseinwhich both C and C are active. By the convex duality theory [6], there exist unique non-negative dual variables 1 2 λ and η such that x is also the global minimizer of the following regularized problem: ∗ ∗ ∗ 1 minixmize 2kx−vk22+λ∗kxk1+η∗kxkG, (9) whose solution is given by the following Theorem. Theorem 2 ([11]). The optimal solution x of (9) is given by ∗ vλ∗ x =max vλ∗ η ,0 Gi i=1,2, , G (10) ∗Gi {k Gik2− ∗ } vλ∗ ··· | | k Gik2 where vλ∗ is computed via soft-thresholding [8] v with threshold λ as follows: Gi Gi ∗ vλ∗ =SGN(v ) max v λ ,0 , Gi Gi · {| Gi|− ∗ } where SGN() is the sign function and all the operations are taken element-wisely. · Theorem2givesananalyticalsolutionofx inanidealsituationwhenthevaluesofλ andη aregiven. ∗ ∗ ∗ Unfortunately,thisisnotthe caseandthe valuesofλ andη needtobe computeddirectlyfrom(8). Based ∗ ∗ on Theorem 2, we have the following conclusion characterizing the relations between the dual variables: Corollary 1. The following equations hold: x = |G| max vλ∗ η ,0 kvGλ∗ik1 =s (11) k ∗k1 {k Gik2− ∗ } vλ∗ 1 Xi=1 k Gik2 G kx∗kG = | | max{kvGλ∗ik2−η∗,0}=s2 . (12) Xi=1 Suppose λ is given, then computing η from (12) amounts to solving a median finding problem, which ∗ ∗ can be done in linear time [9]. Finally, we treat the case ofunknown λ (thus unknown η ). We propose anefficient bisection approach ∗ ∗ to compute it. 4.2.2 Computing λ : bisection ∗ Givenaninitialguess(estimator)ofλ ,saysλˆ,onemayperformbisectiontolocatetheoptimalλ ,provided ∗ ∗ thatthereexistsanoracleprocedureindicatingiftheoptimalvalueisgreaterthanλˆ2. Thisbisectionmethod can estimate λ in logarithm time. Next, we shall design an oracle procedure. ∗ 2Anupperboundandalowerboundofλ∗ shouldbeprovidedinordertoperformthebisection. Theseboundscanbeeasily derivedfromtheassumptionthatbothC1 andC2 areactive. 6 Let the triples (x ,λ ,η )=SGLP(v,s ,s ) ∗ ∗ ∗ 1 2 be the optimal solution of (8) with both constraints active, i.e., x = s , x = s , with (λ ,η ) be ∗ 1 1 ∗ G 2 ∗ ∗ k k k k the optimal dual variables. Consider the following two sparse group lasso projections: (x,λ,η)=SGLP(v,s ,s ), 1 2 (x′,λ′,η′)=SGLP(v,s′1,s′2). The following key result holds. Theorem 3. If λ λ and s =s , then s s . ≤ ′ 2 ′2 1 ≥ ′1 Theorem3givestheoracleprocedurewithitsproofpresentedinthesupplement. Foragivenestimatorλˆ, wecomputeitscorrespondingηˆfrom(12)andthensˆ from(11),satisfying(xˆ,λˆ,ηˆ)=SGLP(v,sˆ,s ). Then 1 1 2 sˆ is compared with s . Clearly, by Theorem 3, if sˆ s , the estimator λˆ is no less than λ . Otherwise, 1 1 1 1 ∗ sˆ >s means λˆ <λ . In addition, from (11) we know≤that sˆ is a continuous function of λˆ. Together with 1 1 ∗ 1 the monotonicity given in Theorem 3, a bisection approach can be employed to calculate λ . Algorithm 2 ∗ gives a detailed description. 4.3 Solving Restricted version of (7) Finally, we modify the above procedures to compute the optimal solution of the restricted problem (6). To apply the accelerated gradient method, we consider the following projection step: 1 minimize x v 2 x 2k − k2 subject to xT1 s (C ) (13) 1 1 1 k k ≤ xT3 s (C ). G 2 2 k k ≤ Our first observation is: T (x) T (x), since if an element of x lies in a group whose L -norm is less 3 1 2 ⊂ than τ, then the absolute value of this element must also be less than τ. Secondly, from the decomposable nature of the objective function, we conclude that: v if j (T )c x∗j =(cid:26) vjλ∗j if j ∈∈T11\T3, since there are no constraintsonx if it is outside T andinvolvesonly the L -normconstraintif j T T . j 1 1 1 3 ∈ \ Following routine calculations as in [9], we obtain the following results similar to (11) and (12): vλ∗ s = max vλ∗ η ,0 k Gik1 + vλ∗ (14) 1 {k Gik2− ∗ } vλ∗ j iX∈T2 k Gik2 j∈XT1\T3 s2 = max{kvGλ∗ik2−η∗,0}. (15) iX∈T2 Based on (14) and (15), we design a similar bisection approach to compute λ∗ and thus (x∗)T3, as in Algorithm 2. Details are deferred to the supplement. 7 Algorithm 2 Sparse Group Lasso Projection Algorithm Input: v, s , s 1 2 Output: an optimal solutionx to the Sparse Group Projection Problem Function SGLP(v, s , s ) 1 2 1: if x 1 s1 and x G s2 then k k ≤ k k ≤ 2: return v 3: end if 45:: xxCC12 ==PP1Gss12((vv)) 6: xC12= bisec(v, s1, s2) 7: if kxC1kG ≤s2 then 8: return xC1 9: else if kxC2k1 ≤s1 then 10: return xC2 11: else 12: return xC12 13: end if Function bisec(v, s , s ) 1 2 1: Initialize up, low and tol 2: while up low >tol do 3: λˆ =(low−+up)/2 4: if (12) has a solution ηˆgiven vλˆ then 5: calculates1 using η and λˆ. 6: if sˆ1 s1 then 7: up≤=λˆ b b 8: else 9: low =λˆ 10: end if 11: else 12: up=λˆ 13: end if 14: end while 15: λ∗ =up 16: Solve (12) to get η∗ 17: Calculatex∗ from λ∗ and η∗ via (10) 18: return x∗ 5 Significance This section is devoted to a brief discussion of advantages of our work statistically and computationally. Moreover, it explains why the proposed method is useful to perform efficient and interpretable feature selection with a given natural group structure. Interpretability. The parameters in (2) are highly interpretable in that s and s are upper bounds of 1 2 the number of nonzero elements as well as that of groups. This is advantageous, especially in the presence of certain prior knowledge regarding the number of features and/or that of groups. However, such an interpretationvanisheswithconvexmethodssuchaslassoorsparsegrouplasso,inwhichincorporatingsuch prior knowledge often requires repeated trials of different parameters. Parameter tuning. Typically, tuning parameters for good generalization usually requires considerable amount work due to a large number of choices of parameters. However, tuning in (1) may search through integer values in a bounded range, and can be further simplified when certain prior knowledge is available. 8 Table1: Runningtime(inseconds)ofDykstra’s,ADMMandourprojectionalgorithm. Allthreealgorithms are averagedover 100 replications. Methods 102 103 104 105 106 Dykstra 0.1944 0.5894 4.8702 51.756 642.60 ADMM 0.0519 0.1098 1.2000 26.240 633.00 ours <10−7 0.0002 0.0051 0.0440 0.5827 This permits more efficient tuning than its regularizationcounterpart. Based on our limited experience, we note that τ does not need to be tuned precisely as we may fix at some small values. Performance and Computation. Although our model (2) is proposed as a computational surrogate of the ideal L -method, its performance can also be theoretically guaranteed, i.e., consistent feature selection 0 canbeachieved. Moreover,thecomputationofourmodelismuchmoreefficientandapplicabletolarge-scale applications. 6 Empirical Evaluation This section performs numerical experiments to evaluate the proposed methods in terms of the efficiency and accuracy of sparse group feature selection. Evaluations are conducted on a PC with i7-2600 CPU, 8.0 GB memory and 64-bit Windows operating system. 6.1 Evaluation of Projection Algorithms Since the DC programming and the accelerated gradient methods are both standard, the efficiency of the proposednonconvexformulation(2)dependsontheprojectionstepin(8). Therefore,wefocusonevaluating the projection algorithms and comparing with two popular projection algorithms: Alternating Direction Multiplier Method (ADMM) [5] and Dykstra’s projectionalgorithm[7]. We provide a detailed derivation of adapting these two algorithms to our formulation in the supplement. To evaluate the efficiency, we first generate the vector v whose entries are uniformly distributed in [ 50,50] and the dimension of v, denoted as p, is chosen from the set 102,103,104,105,106 . Next we − { } partition the vector into 10 groups of equal size. Finally, s is set to 5log(p) and s , the radius of the 2 1 L -ball, is computed by √10s (motivated by the fact that s √10s ). 1 2 2 1 ≤ 2 For a fair comparison, we run our projection algorithm until converge and record the minimal objective value asf . ThenwerunADMM andDykstra’salgorithmuntiltheir objectivevaluesbecome closeto ours. ∗ More specifically, we terminate their iterations as soon as f f 10 3 and f f 10 3, ADMM ∗ − Dykstra ∗ − − ≤ − ≤ where f and f stand for the objective value of ADMM and Dykstra’s algorithm respectively. ADMM Dykstra Table 1 summarizes the average running time of all three algorithms over 100 replications. Next we demonstrate the accuracy of our projection algorithm. Toward this end, the general convex optimization toolbox CVX [12] is chosen as the baseline. Following the same strategy of generating data, we report the distance (computed from the Euclidean norm ) between optimal solution of the three 2 k·k projectionalgorithms and that of the CVX. Note that the projectionis strictly convexwith a unique global optimal solution. For ADMM and Dykstra’s algorithm, the termination criterion is that the relative difference of the objective values between consecutive iterations is less than a threshold value. Specifically, we terminate the iterationif f(x ) f(x ) 10 7f(x ). Forourprojectionalgorithm,wesetthetol inAlgorithm2to k 1 k − k 1 be10 7. Th|eresu−lts−aresum|m≤arizedinTa−ble2. Poweredbysecond-orderoptimizationalgorithms,CVXcan − 9 provide fast and accurate solution for problems of moderate size but would suffer from greatcomputational burden for large-scale ones. Therefore we only report the results up to 5,000 dimensions. Table2: DistancebetweentheoptimalsolutionofprojectionalgorithmsandthatoftheCVX.Alltheresults are averagedover 100 replications. Methods 50 100 500 1000 5000 Dykstra 9.00 9.81 11.40 11.90 12.42 ADMM 0.64 0.08 3.6e-3 6.3e-3 1.3e-2 ours 1.4e-3 1.1e-3 1.2e-3 1.7e-3 7.3e-3 FromTables1and2,wenotethatbothADMMandouralgorithmyieldmoreaccuratesolutionthanthat of Dykstra’s. For projections of moderate size, all three algorithms perform well. However, for large-scale ones, our advantage on efficiency is evident. 6.2 Performance on Synthetic Data 6.2.1 Experimental Setup We generate a 60 100 matrix A, whose entries follow i.i.d standard normal distribution. The 100 features × (columns)arepartitionedinto10groupsofequalsize. Thegroundtruthvectorx possessesnonzeroelements 0 onlyin4ofthe10groups. Tofurtherenhancesparsity,ineachnonzerogroupofx ,onlyt(t 10)elements 0 ≤ are nonzero, where t is uniformly distributed from [1,5]. Finally y is generated according to Ax +z with 0 z following distribution (0,0.52), where A and y are divided into training and testing set of equal size. N We fit our method to the training set and compare with lasso, group lasso and sparse group lasso. The tuning parameters of the convex methods are selected from 0.01,0.1,1,10 , whereas for our method, the { } number of nonzero groups is selected from the set 2,4,6,8 and the number of features is chosen from { } 2s ,4s ,6s ,8s . Leave-one-out cross-validation is conducted over the training set for choosing the best 2 2 2 2 { } tuning parameter for all the methods. 6.2.2 Results and Discussions We use following metrics for evaluation: • Estimation error: xˆ x 2 k − 0k2 • Prediction error: Axˆ y˜ 2 k − k2 • Group precision: T (xˆ) T (x )/T (xˆ) 2 2 0 2 | ∩ | | | • Group recall: T (xˆ) T (x )/T (x ) 2 2 0 2 0 | ∩ | | | where xˆ is the estimator obtained from (2) and y˜ is an independent vector following the same distribution asy. Thegroupprecisionandrecalldemonstratethecapabilityofrecoveringthegroupstructurefromdata. We report the results in Table 3 and observe that our model generally exhibits better performance. Note thatalthoughourmodeldoesnotprovidethebestresultonthemetricofgrouprecall,thegroupprecisionof our model is significantlybetter thanthe others,illustrating the fact that the three convexmethods recover more redundant groups. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.