ebook img

Regularization, sparse recovery, and median-of-means tournaments PDF

0.29 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Regularization, sparse recovery, and median-of-means tournaments

Regularization, sparse recovery, and median-of-means tournaments ∗ 7 1 0 Ga´bor Lugosi†‡§ Shahar Mendelson ¶ 2 n January 17, 2017 a J 5 1 Abstract ] T A regularized risk minimization procedure for regression function estimation is intro- S duced that achieves near optimal accuracy and confidence under general conditions, in- . h cludingheavy-tailedpredictorandresponsevariables. Theprocedureisbasedonmedian- at of-means tournaments, introduced by the authors in [8]. It is shown that the new pro- m cedure outperforms standard regularized empirical risk minimization procedures such as lasso or slope in heavy-tailed problems. [ 1 2010 Mathematics Subject Classification: 62J02, 62G08, 60G25. v 2 1 1 Introduction 1 4 0 1.1 Empirical risk minimization, regularization . 1 Regression function estimation is a fundamental problem in statistics and machine learning. 0 7 In the most standard formulation of the problem, (X,Y) is a pair of random variables in 1 which X, taking values in some general measurable space , represents the observation (or : X v feature vector) and one would like to approximate the unknown real value Y by a function i X of X. In other words, one is interested in finding a function f : R such that f(X) is X → r “close” to Y. As the vast majority of the literature, we measure the quality of f by the risk a R(f)= E(f(X) Y)2 , − which is well defined whenever f(X) and Y are square integrable, assumed throughout the paper. Clearly, the best possible function is the regression function m(X) = E(Y X). | However, in statistical problems, the joint distribution of (X,Y) is unknown and the re- gression function is impossible to compute. Instead, a sample = ((X ,Y ),...,(X ,Y )) N 1 1 N N D of independent copies of the pair (X,Y) is available. ∗Ga´borLugosiwassupportedbytheSpanishMinistryofEconomyandCompetitiveness,GrantMTM2015- 67304-P and FEDER, EU.Shahar Mendelson was supported in part by theIsrael Science Foundation. †DepartmentofEconomicsandBusiness,PompeuFabraUniversity,Barcelona,Spain,[email protected] ‡ICREA, Pg. Llus Companys 23, 08010 Barcelona, Spain §Barcelona Graduate School of Economics ¶Department of Mathematics, Technion, I.I.T, and Mathematical Sciences Institute, The Australian Na- tional University,[email protected] 1 A popular and thoroughly studied approach is to select a function f from a fixed class N of functions. Formally, a learning procedure is a map Φ : ( R)N that assigns to F X × → F each sample = (X ,Y )N a (random) function Φ( ) = f . b DN i i i=1 DN N If the class is sufficiently “large,” then it is reasonable to expect that the best function F in the class b f∗ = argminE(f(X) Y)2 − f∈F has an acceptable performance. We assume throughout that the minimum is attained and f∗ is unique. In fact, we assume that is a closed and convex subset of L (µ)—where 2 ∈ F F µ denotes the distribution of X—, guaranteeing the existence and uniqueness of f∗. The quality of a learning procedure is typically measured by the mean squared error f f∗ 2 = E (f (X) f∗(X))2 , k N − kL2 N − |DN where, for q 1, we use the notation (cid:0) (cid:1) b b ≥ f g = (E f(X) g(X) q)1/q and also f Y = (E f(X) Y q)1/q . k − kLq | − | k − kLq | − | A closely related, though not equivalent, measure of performance is the excess risk, defined by the conditional expectation R(f ) R(f∗) = E (f (X) Y)2 E(f∗(X) Y)2 . N N N − − |D − − The goal of the statistical learning p(cid:0)roblem is to find a(cid:1) learning procedure that achieves a b b good accuracy with a high confidence. In particular, for r > 0 and δ (0,1), we say that a ∈ procedure achieves accuracy r with confidence 1 δ (e.g., for the mean squared error) if − P f f∗ r 1 δ . k N − kL2 ≤ ≥ − (cid:16) (cid:17) High accuracy andhighconfidence(bi.e., smallr andsmallδ)areobviously conflicting require- ments. The achievable tradeoff has been thoroughly studied and it is fairly well understood. WereferthereadertoLecu´eandMendelson[4], LugosiandMendelson[8]forrecentaccounts. The most standard approach for a learning procedure is empirical risk minimization (erm), also known as least squares regression in which N f argmin (f(X ) Y )2 N i i ∈ − f∈F i=1 X b (where we assume that the minimum is achieved). It is now well understood (see, e.g., Lecu´e and Mendelson [4]) that if the distribution of Y, f(X) and (f h)(X) for f,h all have − ∈ F well-behaved tails (sub-Gaussian inacertain, strong, sense), then empiricalrisk minimization achieves nearly optimal accuracy and the best possible confidence for that level of accuracy. On the other hand, when either Y or (f h)(X) for f,h 0 may have heavier tails, − ∈ F ∪{ } empirical risk minimization suffers, as atypical values in the sample distort the empirical means. Indeed, in such cases significantly better learning procedures exist, as it was recently pointed out by Lugosi and Mendelson [8]. Even for well-behaved distributions, the accuracy achievable by empirical risk minimiza- tion is only acceptable if the class is small, otherwise overfitting becomes inevitable. A F common way of avoiding overfitting is by regularization. 2 In regularized risk minimization one gives priority to functions in according to some F prior belief of “simplicity”. More precisely, let Ψ be a norm defined on a vector space E containing . A small value of Ψ(f) is interpreted as simplicity and simple functions are F given priority by way of adding a penalty term to the empirical risk that is proportional to Ψ(f). In particular, for some regularization parameter λ > 0, a regularized risk minimizer selects N 1 f argmin (f(X ) Y )2+λΨ(f) . N i i ∈ N − f∈F ! i=1 X The term Ψ(f) is somebtimes called the penalty. In some applications, notably in ridge re- gression, the penalty is not a norm but rather a squared norm. However, in this paper we focus on norm penalties that encompass various notions of sparsity, for example, the lasso and slope, discussed in detail below. 1.2 Why run tournaments? The problem that motivated this work is the better understanding of the tradeoff between accuracy and confidence in regularized learning procedures. Since regularized procedures require the minimization of a functional that has the empirical mean as a component, they suffer from the same disadvantages as empirical risk minimization. These are highly visible when either class members f(X) or the target Y are heavy tailed in some sense. Thus, the application of regularized erm in heavy-tailed problems results in a suboptimal tradeoff between the accuracy with which the procedure performs and the confidence with which that accuracy can be guaranteed. A typical sample contains a large subset of atypical points and misleads the empirical minimization procedure. The idea of a median-of-means tournament was introduced in [8] to address the issue of samples that may contain many atypical points. It turns out that this learning procedure attains the optimal accuracy/confidence tradeoff under rather minimal conditions and, in particular, in heavy-tailed problems. In what follows we study the optimal tradeoff for problems usually addressed using regu- larization procedures. Example: sparse recovery It is instructive to keep in mind the important example of sparse-recovery in Rd. Consider the following standard setup: Let X be an isotropic random vector in Rd (that is, for every t Rd, E t,X 2 = t 2, where denotes the Euclidean norm in Rn). Let Y be the ∈ h i k k2 k · k2 unknown target random variable and set t∗ to be the minimizer in Rd of the risk functional t E( X,t Y)2. Thus, contains all linear functions f(x) = x,t for t Rd. In sparse → h i− F h i ∈ recovery problems one believes that t∗ is supported on at most s coordinates with respect to the standard basis in Rd—or at least it is well approximated by an s-sparse vector—, but one does not know that for certain. The lasso procedure (introduced in [15]) selects t Rd that ∈ minimizes the regularized empirical squared-loss functional b N 1 t ( t,X Y )2+λ t i i 1 → N h i− k k i=1 X for a well chosen regularization parameter λ, and t = d t is the ℓ -norm of t. k k1 i=1| i| 1 P 3 It turns out (see Lecu´e and Mendelson [5]) that if X is sufficiently sub-Gaussian then one may obtain nontrivial estimates on the performance of the lasso. The quantities c () denote i · positive constants that depend on the argument only. Theorem 1.1. (Lecu´e and Mendelson [5].) Let (X,Y) be as described above and set 0< δ < 1. Assume that there is some v Rd supported on at most s coordinates for which ∈ log(ed) t∗ v c (δ) ξ s . k − k1 ≤ 1 k kLq N r If λ = c (L,δ) ξ log(ed)/N and N c (L)slog(ed/s) then, with probability at least 1 δ, 2 k kLq ≥ 3 − the lasso estimator with regularization parameter λ satisfies that p log(ed) t t∗ c (L,δ) ξ √s k − k2 ≤ 4 k kLq · N r and b log(ed) t t∗ c (L,δ) ξ s . k − k1 ≤ 4 k kLq N r Note that the results of Theorem 1.1 hold even when t∗ is not s-sparse, but rather, when b t∗ is only approximated by an s-sparse vector. Thus, the lasso can identify the degree of sparsity of t∗ and perform (almost) as if it had been given the information of the degree of sparsity of t∗. However, if t∗,X Y happens to h i− be heavy-tailed then the confidence with which the above accuracy can be attained is rather weak: c(δ) and c(δ,L) depend polynomially on 1/δ. The purpose of this paper is to introduce a learning procedure — a regularized median of means tournament — that performs as well as regularized empirical risk minimization but undermuch more general conditions on the distribution. In fact, its performancecorresponds to that of regularized empirical risk minimization in sub-Gaussian problems, even though the problem at hand may be heavy tailed. For example, it is not surprising that the accu- racy/confidence tradeoff of the lasso deteriorates when the problem is heavy-tailed, because the lasso is based on minimization of the empirical risk. In contrast, the regularized proce- dure we introduce performs well even it we drop the sub-Gaussian assumptions and replace them by considerably weaker ones. Our benchmark is a general estimate due to Lecu´e and Mendelson [5] on the performance of regularized empirical risk minimization, and the main result of this paper (Theorem 4.3 below) parallels the main finding of [5]. The proof of Theorem 4.3 combines the techniques of [5] with those of [8]. We then use the lasso as a proof of concept and show that the procedure we propose matches the performance of the lasso in the well-behaved case even when the problem is heavy tailed. We also work out a procedure analogous to slope and present similar findings. 1.3 Two examples Before we turn to a presentation of the regularized tournament procedure and the technical machinerywerequire,letusdescribeinmoredetailthetwoapplicationsmentionedpreviously, both of which originating in sparse recovery: lasso, and slope (see, [1, 3, 14, 15]). The two are regularized empirical risk minimization procedures in Rd, and we would like to compare their performance with that of the regularized tournament procedure we introduce below. 4 As we mentioned, the lasso is a regularized empirical risk minimization procedure that uses the regularization function Ψ(t) = t , while the regularization function of slope is 1 k k defined using a set of non-increasing weights (β )d . The corresponding norm is i i=1 d Ψ(t)= β t∗ , i i i=1 X where (t∗)d denotes the non-increasing rearrangement of (t )d . Clearly, slope is a gen- i i=1 | i| i=1 eralized version of lasso, as the latter is given by the choice β = 1 for 1 i d. lasso and i ≤ ≤ slope exhibit an almost miraculous feature: although the regularization functions have little to do with sparsity, regularized empirical risk minimization preforms extremely well when the true minimizer t∗ is sparse, or if it is at least well-approximated by a sparse vector. Many results are known on the performance of lasso and slope, and almost all of them hold only when both the random vector X and the target Y have well behaved tails. One exceptionis[5]thatstudiesthecaseofapotentiallyheavy-tailedY,withtheresultingestimate outlined in Theorem 1.1. It shows that while a high degree of accuracy is possible—the same as if the problem were purely Gaussian, the confidence is rather weak. That weak confidence iscausedbytheheavytailofξ,theinsufficientsub-Gaussianmomentsoflinearforms,andthe fact that empirical minimization procedures are highly sensitive to atypical sample points. It turns out that a similar performance bound holds for slope with the choice of weights β C log(ed/i). i ≤ Theorpem 1.2. (Lecu´e and Mendelson [5].) There exist constants c ,c and c that depend 1 2 3 only on L, δ and C for which the following holds. Let Ψ(t) denote the slope norm for the weights (β )d . If there is v Rd that satisfies i i=1 ∈ supp(v) s and | | ≤ s ed Ψ(t∗ v) c ξ log , − ≤ 1k kLq√N s (cid:18) (cid:19) then for N c slog(ed/s) and with the choice of λ = c ξ /√N, one has ≥ 2 2k kLq s ed s ed Ψ(tˆ t∗) c ξ log and tˆ t∗ c ξ log − ≤ 3k kLq√N s k − k2 ≤ 3k kLqsN s (cid:18) (cid:19) (cid:18) (cid:19) with probability at least 1 δ. − We show that median-of-means versions tournament versions of lasso and slope per- form better than their regulatized empirical risk minimization counterparts, simply because median-of-means is far more robust to atypical sample points than empirical minimization. This leads to the ‘Gaussian’ accuracy, but with the optimal confidence, under the same as- sumptions. The tournament lasso Thetournament lassois justa regularized tournament procedure(definedaccurately below) for Ψ(t) = t . As the next result shows, it significantly outperforms the lasso in heavy- 1 k k tailed problems. 5 Theorem 1.3. Assume that X is an isotopic random vector and that for every t Rd and ∈ any 1 ≤ p ≤ clogd, kht,XikLp ≤ L√pkht,XikL2. Let t∗ = argmint∈RdE(Y −ht,Xi)2 and assume that Y t∗,X σ. k −h ikL4 ≤ If there is v that is s-sparse such that log(ed/s) t∗ v c (L)σ s , 1 1 k − k ≤ · N r N c (L)slog(ed/s), and 2 ≥ s ed r c (L)σ log , 3 ≥ sN s (cid:18) (cid:19) then, with probability at least b 2 r 1 2exp c (L)N min 1, , 4 − − σ ( )! (cid:18) (cid:19) b the tournament lasso produces t that satisfies log(ed/s) t t∗ c (Lb)r and t t∗ c (L)σs . 2 5 1 5 k − k ≤ k − k ≤ N r It follows that the tournament lasso yields the same accuracy as the standard lasso, b b b but with a much better confidence. Moreover, the confidence improves when we require a weaker accuracy. The crucial point is that the tournament lasso attains the optimal accu- racy/confidencetradeoff eventhoughtheproblemisveryfarfrombeingsub-Gaussian. Linear functionals X,t exhibit a sub-Gaussian moment growth only up to p logd rather than for h i ∼ any p 1, and the noise ξ = t∗,X Y is only assumed to belong to L . There is no hope 4 ≥ h i− that regulatized empirical risk minimization would come close to this accuracy/confidence tradeoff under such assumptions, but the tournament lasso does just that. The tournament slope The tournament slope is simply a regularized tournament procedure in Rd for the norm Ψ(t) = d β t∗, where β C log(ed/i). We obtain the following performance bound, i=1 i i i ≤ proved in Section 6. P p Theorem 1.4. Assume that X is an isotopic random vector and that for every t Rd and ∈ any 1 ≤ p ≤ clogd, kht,XikLp ≤ L√pkht,XikL2. Let t∗ = argmint∈RdE(Y −ht,Xi)2 and assume that Y t∗,X σ. If there is v that is s-sparse such that k −h ikL4 ≤ slog(ed/s) t∗ v c (L)σ , 1 1 k − k ≤ · √N N c (L)slog(ed/s), and 2 ≥ s ed r c (L)σ log , 3 ≥ sN s (cid:18) (cid:19) then, with probability at least b 2 r 1 2exp c (L)N min 1, , 4 − − σ ( )! (cid:18) (cid:19) b 6 the tournament slope produces t that satisfies b s ed t t∗ c (L)σ log 2 5 k − k ≤ sN s (cid:18) (cid:19) b and d s ed Ψ(t t∗)= (t t∗)∗ log(ed/i) c (L)σ log . − − i ≤ 5 √N s i=1 (cid:18) (cid:19) X p Therefore, onebcan obtain thbe same accuracy as the standard slope, but with a much better confidence. The confidence improves for a weaker accuracy, and just like the tourna- ment lasso, the tournament slope displays the optimal accuracy/confidence tradeoff, but allowing heavy-tailed distributions. 2 Assumptions, background In the general setup we study, we merely assume a rather weak fourth-moment assumption. More precisely, we work under the following conditions. Assumption 2.1. Let L (µ) be a locally compact, convex class of functions. Let Y L 2 2 F ⊂ ∈ and assume that, for some constant L > 0, for every f,h , f h L f h ; • ∈ F k − kL4 ≤ k − kL2 f∗ Y σ for a known value σ . • k − kL4 ≤ 4 4 Remark 2.1. The second condition may easily be replaced by a combination of two assump- tions: that for every f , f Y L f Y ; and that f∗ Y σ for ∈ F k − kL4 ≤ k − kL2 k − kL2 ≤ some know constant σ > 0. Also, in the case of independent additive noise, that is, when Y = f (X)+W, the assumption that f∗ Y σ may be replaced by the minimal one, 0 k − kL4 ≤ 4 that W σ. The necessary modifications to the proofs are straightforward and we do not k kL2 ≤ explore this observation further. 2.1 Complexity parameters of a class Beforedescribingtheregularizedmedianofmeantournament,weintroducefourparametersof “complexity” dependingbothontheclass andthedistributionof(X,Y). Thesecomplexity F parameters are essential in describing the optimal performance of learning procedures. For detailed discussion on themeaning androle of these parameters, we refer toMendelson [9, 10] and Lugosi and Mendelson [8]. As explained in those papers, the relevant complexity of a class has four components, reflected in four parameters λ ,λ ,r ,r that we define next. Q M E M F First we need some notation. Denote the unit ball in L (µ) by D = f : f 1 2 { k kL2 ≤ } and let S = {f : kfkL2 = 1} be the unit sphere. For h ∈ L2(µ)eand r > 0, we write D (r) = f : f h r . In a similar fashion for the norm Ψ used as a regularization h { k − kL2 ≤ } function, let = f : Ψ(f) 1 , set ρ = f :Ψ(f) ρ and (ρ) = f : Ψ(f h) ρ . h B { ≤ } B { ≤ } B { − ≤ } Define the star-shaped hull of the class centred in h by F star( ,h) = λf +(1 λ)h : 0 λ 1, f . F { − ≤ ≤ ∈ F} 7 In what follows we make two important modifications to the definitions of the complexity parameters used in [9, 10, 8]. First, just like in the above-mentioned articles, we are inter- ested in “localized” classes. However, because regularized procedures are affected by two norms, Ψ and L (µ), the localization has to be with respect to both of them. Therefore, the 2 “localization” of , centred in h and of radii ρ,r > 0 is defined by F = star( h,0) (ρ rD) . h,ρ,r F F − ∩ B∩ Clearly, if is convex then for any h , star( h,0) = h and F ∈F F − F − = f h: f , Ψ(f h) ρ, f h r = ( h) (ρ rD) . Fh,ρ,r { − ∈ F − ≤ k − kL2 ≤ } F − ∩ B∩ Thesecondminormodificationisthateachcomplexityparameterisassociatedwiththe‘worse case’ centre h ′ for some fixed ′ , and not necessarily with the whole of . ∈ F F ⊂ F F Two of the four parameters are defined using the notion of packing numbers. Definition 2.2. Given a set H L (µ) and ε > 0, denote the ε-packing number of H by 2 ⊂ (H,εD). In other words, (H,εD) is the maximal cardinality of a subset h ,...,h 1 m M M { }⊂ H, for which h h ε for every i = j. k i − jkL2 ≥ 6 Thefirstrelevant parameter λ is definedas follows with appropriate numerical constants Q κ and η: Definition 2.3. Fix ρ > 0 and h . For κ,η > 0, set ∈F λ (κ,η,h,ρ) = inf r : log ( ,ηrD) κ2N . (2.1) Q h,ρ,r { M F ≤ } For ′ let F ⊂ F λ (κ,η,ρ) = sup λ (κ,η,h,ρ) . Q Q h∈F′ While κ and η are adjustable parameters, we are mainly interested in the behaviour of λ Q as a function of ρ. The way one selects ρ is clarified later. The next parameter, denoted by λ , is also defined in terms of the packing numbers of M the localization , though at a different scaling than λ . h,ρ,r Q F Definition 2.4. Fix h and ρ > 0. Let κ > 0, 0 < η < 1, and define ∈ F λ (κ,η,h,ρ) = inf r : log ( ,ηrD) κ2Nr2 . (2.2) M h,ρ,r { M F ≤ } Also, for ′ let F ⊂ F λ (κ,η,ρ) = sup λ (κ,η,h,ρ) . M M h∈F′ For the remaining two complexity parameters, let (ε )N be independent, symmetric i i=1 1,1 -valued random variables that are independent of (X ,Y )N . {− } i i i=1 Definition 2.5. Fix h and ρ > 0. For κ > 0 let ∈ F N 1 r (κ,h,ρ) = inf r : E sup ε u(X ) κ√Nr , (2.3) E i i ( u∈Fh,ρ,r(cid:12)(cid:12)√N Xi=1 (cid:12)(cid:12) ≤ ) (cid:12) (cid:12) and for ′ set r (κ,ρ) = sup r (κ,h,(cid:12)ρ). (cid:12) F ⊂ F E h∈F′ E (cid:12) (cid:12) 8 Definition 2.6. Fix h and ρ > 0. For κ > 0, set r (κ,h,ρ) to be M ∈ F N 1 r (κ,h,ρ) = inf r :E sup ε u(X ) (h(X ) Y ) κ√Nr2 . (2.4) M i i i i ( u∈Fh,ρ,r(cid:12)(cid:12)√N Xi=1 · − (cid:12)(cid:12) ≤ ) (cid:12) (cid:12) (cid:12) (cid:12) For σ >0 put (σ) = f ′ : f(X)(cid:12) Y σ and letr (κ,σ,ρ)(cid:12) = sup r (κ,h,ρ). FY { ∈F k − kL2 ≤ } M h∈F(σ) M Y Finally, suppose that the distribution of (X,Y) is such that Y f∗(X) σ for a e k − kL4 ≤ 4 known constant σ > 0. The “complexity” of relative to centres in ′ and radius ρ is F F r∗( , ′,ρ) = max λ (c ,c ,ρ),λ (c /σ ,c ,ρ),r (c ,ρ),r (c ,σ ,ρ) . (2.5) Q 1 2 M 1 4 2 E 1 M 1 4 F F { } Herec ,c areappropriatepositivenumericalconstants. (“Appropriate”meansthatr∗( , ′,ρ) 1 2 e F F satisfies Propositions 5.1, 5.5 and 5.7 below). The existence of such constants is proved in [8] when ′ = , i.e., when any function in is a ‘legal choice’ of a centre, and under F F F Assumption 2.1. In that case, the constants depend only on the value of L. When and ′ are clear from the context, we simply write r∗(ρ) for r∗( , ′,ρ). F F F F Example: linear regression with ℓ regularization 1 To give the reader some feeling of the nature of r∗(ρ) and the type of estimates that are needed for its identification, let be a vector space. Hence, for any h , h = and F ∈ F F − F thus = ρ rD. h,ρ,r F F ∩ B∩ In particular, is independent of the choice of centre h and there is no ‘diversity’ in the h,ρ,r F localized sets one encounters. Also, if is convex and centrally symmetric (i.e., if f then F ∈F f ), the richest localized set is essentially when the centre is h = 0. Indeed, in such a − ∈ F case, = ( h) (ρ rD) 2 (ρ rD) , h,ρ,r F F − ∩ B∩ ⊂ F ∩ B∩ which is a localization of 2 for h = 0. Thus, when is centrally symmetric, it suffices to F F study its localizations associated with a single centre – h = 0. In the case of the lasso = t, :t Rd and therefore is a vector space consisting F {h ·i ∈ } F of linear functionals. From here on we identify the linear functional t, with t Rd, and h ·i ∈ thus identify with Rd. The regularization function used in the lasso is Ψ(t) = t and 1 F k k t∗ =argmin E( t,X Y)2. Using out∈rRndotahtion,iD− = t Rd : E X,t 2 1 , and if X is isotropic, then D = Bd, { ∈ h i ≤ } 2 the Euclidean unit ball in Rd (though, in general, D is an ellipsoid in Rd). Hence, all the localizations associated with the lasso are = ρBd rD , Ft,ρ,r 1 ∩ where Bd = t Rd : t 1 . Moreover, if X is isotropic then 1 { ∈ k k1 ≤ } = ρBd rBd . Ft,ρ,r 1 ∩ 2 It follows that if one is to identify λ and λ , it suffices to obtain estimates on the Euclidean Q M packing numbers log (ρBd rBd,ηrBd) M 1 ∩ 2 2 9 and, in order to bound r and r it suffices to control the expectations of the empirical E M processes N 1 E sup ε u,X i i u∈ρB1d∩rB2d(cid:12)(cid:12)√N Xi=1 h i(cid:12)(cid:12) (cid:12) (cid:12) and (cid:12) (cid:12) (cid:12) N (cid:12) 1 E sup ε ξ u,X i i i u∈ρB1d∩rB2d(cid:12)(cid:12)√N Xi=1 h i(cid:12)(cid:12) (cid:12) (cid:12) respectively, where in the latter, the multi(cid:12)pliers are ξ = t∗,X(cid:12) Y . (cid:12) i h (cid:12)ii− i The question of obtaining sharp bounds on these complexity parameters, even for the lasso, is nontrivial, and a rich mathematical theory has been developed just for that goal (see, e.g., the books [7, 16, 2], and for more recent results, [13]). We study the parameters in question for ℓ -regularized linear regression (lasso) and also for the more general well- 1 studied procedure slope. However, since it is not the focus of this article we do not explore theissue of boundingthe complexity parameters involved beyond those two cases. Let us just mention that the question of finding data-dependent bounds on these quantities, or better still, computationally feasible data dependent bounds, is wide open. 3 Regularized risk minimization Beforewecanproperlydescribetheregularizedtournamentprocedure,letusexplainthepath one may take when studying the standard regularized empirical risk minimization (rerm), which is the basis for the analysis of the regularized tournament procedure we introduce in what follows. rerm selects a minimizer f of the regularized empirical risk functional N b 1 argmin (f(X ) Y )2+λΨ(f) . f∈F N i − i ! i=1 X A typical bound on the performance of rerm for a well-chosen regularization parameter λ involves two estimates: first on the L distance f f∗ and second, on the Ψ-distance 2 k − kL2 Ψ(f f∗). The two estimates one can guarantee depend on structure of , the choice of Ψ − F and the tail behaviour of the functions involved. b Clearly, a minimizer of the regularized empirical risk functional also minimizes N 1 (f(X ) Y )2 (f∗(X ) Y )2+λ(Ψ(f) Ψ(f∗)) i i i i N − − − − i=1 X N N 1 2 = (f f∗)2(X )+ (f∗(X ) Y )(f f∗)(X )+λ(Ψ(f) Ψ(f∗)) i i i i N − N − − − i=1 i=1 X X = Q +M +λ(Ψ(f) Ψ(f∗)), (3.1) f f − where N 1 Q = (f f∗)2(X ) f i N − i=1 X 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.