Stein COnsistent Risk Estimator (SCORE) for hard thresholding Charles-Alban Deledalle Gabriel Peyre´ Jalal Fadili Institut de Mathe´matiques de Bordeaux CEREMADE GREYC CNRS-Universite´ Bordeaux 1 CNRS-Universite´ Paris Dauphine CNRS-ENSICAEN Bordeaux, France Paris, France Caen, France Email: [email protected] Email: [email protected] Email: [email protected] Abstract In this work, we construct a risk estimator for hard thresholding which can be used as a basis to solve the difficult task of automatically selecting the threshold. As hard thresholding is not even continuous, Stein’s lemma cannot be used to get an unbiased estimator of degrees of 3 freedom,henceoftherisk.Weprovethatunderamildcondition,ourestimatorofthedegreesoffreedom,althoughbiased,isconsistent.Numerical 1 evidence showsthatourestimator outperforms anotherbiasedriskestimatorproposedin[1]. 0 2 I. INTRODUCTION n We observe a realisation y RP of the normal random vector Y =x0+W, W (x0,σ2IdP). Given an estimator y x(y,λ) of ∈ ∼N 7→ a x0 evaluated at y and parameterized by λ, the associated Degree Of Freedom (DOF) is defined as [2] J P cov(Y ,x(Y ,λ)) 4 df x (x ,λ), i i . (1) 2 { } 0 σ2 i=1 X The DOF plays an important role in model/parameter selection. For instance, define the criterion ] T Y x(Y,λ)) 2 Pσ2+2σ2df x (Y,λ). (2) S k − k − { } . In the rest, we denote div the divergence operator. If x(,λ) is weakly differentiable w.r.t. its first argument with an essentially bounded h · b gradient, Stein’s lemma [3] implies that df x (Y,λ)= div(x(Y,λ)) and (2) (the SURE in this case) are respectively unbiased estimates at of df x (x ,λ) and of the risk E x(Y{,λ}) x 2. In practice, (2) relies solely on the realisation y which is useful for selecting λ 0 W 0 m minim{izi}ng (2). k b − k [ In this paper, we focus on Hard Thresholding (HT) 0 if y <λ , v1 y7→HT(y,λ)i =(cid:26) yi oth|eriw| ise . (3) 4 HT is is not even continuous, and the Stein’slemma does not apply, so that df x (x ,λ) and the risk cannot be unbiasedly estimated [1]. 0 7 { } To overcome this difficulty, we build an estimator that, although biased, turns out to enjoy good asymptotic properties. In turn, this allows 8 efficient selection of the threshold λ. 5 1. II. STEINCONSISTENTRISKESTIMATOR(SCORE) 0 Remark that the HT can be written as 3 1 HT(y,λ)=ST(y,λ)+D(y,λ) : y +λ if y < λ v i i − Xi where ST(y,λ)i = 0 if −λ6yi <+λ yi λ otherwise − r λ if y < λ a − i − and D(y,λ) = 0 if λ6y <+λ , i − i +λ otherwise where y ST(y,λ)isthesoft thresholdingoperator. Softthresholding isaLipschitzcontinuous functionof y withanessentiallybounded 7→ gradient, and therefore, appealing to Stein’s lemma, an unbiased estimator of its DOF is given by df ST (Y,λ) = divST(Y,λ). This { } DOF estimate at a realization y is known to be equal to # y > λ , i.e., the number of entries of y greater than λ (see [4], [5]). The {| | } | | mapping y D(y,λ) is piece-wise constant with discontinuities at λ so that Stein’s lemma doebs not apply to estimate the DOF of 7→ ± hard thresholding. To circumvent this difficulty, we instead propose an estimator of the DOF of a smoothed version replacing D(,λ) by · ⋆D(.,λ) where is a Gaussian kernel of bandwidth h>0 and ⋆ is the convolution operator. In this case ⋆D(.,λ) is obviously h h h G G G C∞ whose DOF can be unbiasedly estimated as div( h⋆D(.,λ)(Y)). To reduce bias (this will be made clear from the proof), we have G furthermore introduced a multiplicative constant, √σ2+h2/σ, leading to the following DOF formula P y7→ df{HT}(y,λ,h)=#{|y|>λ} + λ√√2σπ2σ+hh2 exp −(y2i+hλ2)2 +exp −(y2i−hλ2)2 . (4) Xi=1h (cid:16) (cid:17) (cid:16) (cid:17)i We now give our two main rbesults proved in Section IV. Algorithm Risk estimation for Hard Thresholding Inputs: observation y RP, threshold λ>0 ∈ Parameters: noise variance σ2>0 Output: solution x⋆ Initializeh h(P) ← for all λ in the tested range do Compute xb HT(y,λ) using (3) ← Compute df HT (y,λ,h) using (4) { } Compute SCORE at y using (2) end for b return x⋆ x that provides the smallest SCORE ← Fig.1. Pseudo-algorithm forHTwithSCORE-basedthreshold optimization. st 100 Risk atic co 80 SJaCnOseRnE’s estimator [1] dr 60 a Qu 40 10 20 30 40 50 60 Threshold λ Fig.2. RiskanditsSCOREestimate withrespect tothethresholdλ. Theorem 1: Let Y =x0+W for W (x0,σ2IdP). Take h(P) such that limP h(P)=0 and limP P−1h(P)−1 =0. Then ∼N →∞ →∞ plim 1 df HT (Y,λ,h(P)) df HT (x ,λ) =0. In particular P→∞ P { } − { } 0 b b b (cid:16) (cid:17) b b 1.Plim EW P1df{HT}(Y,λ,h(P)) =Plim P1df{HT}(x0,λ), and →∞ →∞ 2. lim V (cid:2)1df HT (Y,λ,h(P))(cid:3)=0 , P W P b{ } b →∞ where V is the variance w.r.t. W. (cid:2) (cid:3) W b b We now turn to a straightforward corollary of this theorem. Corollary 1: Let Y = x +W for W (x ,σ2Id ), and assume that x = o(P1/2). Take h(P) such that lim h(P) = 0 and limP P−1h(P)−1=0 0. Then, the∼SteNin C0OnsistePnt Risk Estimator(SCkO0RkE4) evaluated at a realization y of Y P→∞ →∞ b b P SCORE{x}b(y,λ,h(P))= (yi2−σ2)+I(|yi|>λ)(2σ2−yi2)+2σλ√√σ2π2+bhhb((PP))2 exp −(2ybhi(+Pλ))22 +exp −(2ybhi(−Pλ))22 Xi=1(cid:18) h (cid:16) (cid:17) (cid:16) (cid:17)i(cid:19) b is such that plim 1 SCORE x (Y,λ,h(P)) E HT(Y,λ) x 2 =0. P→∞ P { } − Wk − 0k Fig. 1 summarizes the ps(cid:16)eudo-code when applying SCORE to automatically(cid:17)find the optimal threshold λ that minimizes SCORE in a predefined (non-empty) range. b III. EXPERIMENTSANDCONCLUSIONS Fig.2showstheevolutionof thetruerisk,theSCOREandtheriskestimatorof[1]asafunctionof λwherex isacompressiblevector 0 of length P =2E5 whose sorted values in magnitude decay as x =1/iγ for γ >0, and we have chosen σ such that the SNR of y is | 0|(i) of about 5.65dB and h(P)=6σ/P1/3 σ/10. The optimal λ is found around the minimum of the true risk. ≈ Future work will concern a deeper investigation of the choice of h(P), comparison with other biased risk estimators, and extensions to other non-continuousbestimators and inverse problems. b IV. PROOF We first derive a closed-form expression for the DOF of HT. Lemma 1: Let Y =x +W where W (x ,σ2Id ). The DOF of HT is given by 0 0 P ∼N 1 P (x ) +λ (x ) λ λ P ((x ) +λ)2 ((x ) λ)2 df{HT}(x0,λ)= P − 2 i=1(cid:20)erf(cid:18) 0√i2σ (cid:19)−erf(cid:18) 0√i2σ− (cid:19)(cid:21)!+ √2πσ i=1(cid:20)exp(cid:18)− 02σi2 (cid:19)+exp(cid:18)− 02σi−2 (cid:19)(cid:21). X X (5) Proof: According to [1], we have P df HT (x ,λ)=E [# Y >λ ]+λ/σ2E sign(Y )W I(Y >λ) 0 W W i i i { } {| | } " | | # i=1 X where sign(.) isthe signfunction and I(ω)isthe indicatorfor an event ω.Integrating w.r.t.to thezero-mean Gaussian density of variance σ2 yields the closed form of the expectation terms. We now turn to the proof of our theorem. Proof: The first part of (5) corresponds to E [# Y >λ ], and can then be obviously unbiasedly estimated from an observation y W by # y >λ . Let A be the function defined, for (t,a{|) |R2, }by {| | } ∈ √σ2+h2 (t a)2 A(t,a)= exp − . h − 2h2 (cid:18) (cid:19) By classical convolution properties of Gaussians, we have ((x ) a)2 EWi[A(Yi,a)]=exp − 2(σ02i+−h2) , and (cid:18) (cid:19) σ2+h2 ((x ) a)2 ((x ) a)2 VWi[A(Yi,a)]= h√2σ2+h2 exp − 2σ02i+−h2 −exp − σ02+i−h2 . (cid:18) (cid:19) (cid:18) (cid:19) Taking h=h(P) and assuming lim h(P)=0 shows that P →∞ 1 P 1 P 1 P ((x ) a)2 b Pl→im∞EW "P i=1A(Ybi,a)#=Pl→im∞P i=1EW [A(Yi,a)]=Pl→im∞P i=1exp(cid:18)− 02σi−2 (cid:19) . X X X Since from (4), we have P λ df HT (Y,λ,h)=# Y >λ + [A(Y ,λ)+A(Y , λ)] i i { } {| | } √2πσ − i=1 X and using Lemma 1, statement 1. fbollows. For statement 2., the Cauchy-Schwartz inequality implies that 1 1/2 V [# Y >λ ]1/2 λ2 P VW Pdf{HT}(Y,λ,h) 6 W {|P| } + 2πP VW[A(Yi,λ)]1/2+VW[A(Yi,−λ)]1/2 . (cid:20) (cid:21) Xi=1h i b #{|Yi|>λ}∼iidBin(P,1−p) whose variance is Pp(1−p), where p= 21 erf (x√0)2iσ+λ −erf (x√0)2iσ−λ . It follows that (cid:16) (cid:16) (cid:17) (cid:16) (cid:17)(cid:17) 1 lim V # Y >λ =0 . P W P {| | } →∞ (cid:20) (cid:21) Taking again h=h(P) with limP h(P)=0 and limP P−1h(P)−1=0, yields →∞ →∞ P P 1 1 b limb V A(Y ,a) =blim V [A(Y ,a)]=0 , P W"P i # P P2 W i →∞ i=1 →∞ i=1 X X where we used the fact that the random variables Y are uncorrelated. This establishes 2.. Consistency (i.e. convergence in probability) i followsfromtraditionalargumentsbyinvoking Chebyshev inequalityandusingasymptoticunbiasedness andvanishing varianceestablished in 1. and 2.. Let us now prove the corollary. Proof: By assumption, lim h(P) = 0. Thus by by virtue of statement 1. of Theorem 1 and specializing (2) to the case of HT P →∞ gives b 1 1 lim E SCORE HT (y,λ,h(P)) = lim E Y HT(Y,λ)) 2 Pσ2+2σ2df HT (Y,λ,h(P)) P W P { } P P W k − k − { } →∞ (cid:20) (cid:21) →∞ 1 h 1 i b = lim E Y HT(Y,λ)) 2 σ2+2σ2 limb E df HbT (Y,λ,h(P)) P P Wk − k − P P W { } →∞ →∞ 1 1 =Plim PEWkY −HT(Y,λ))k2−σ2+2σ2Plim Pdf{HbT}(x0,λ) b →∞ →∞ 1 = lim E HT(y,λ) x 2 P P Wk − 0k →∞ where we used the fact that all the limitsof the expectations are finite. The Cauchy-Schwartz inequality again yields 1 1/2 1 1/2 1 1/2 V SCORE HT (Y,λ,h(P)) 6V Y HT(Y,λ) 2 +2σ2V df HT (Y,λ,h(P)) W P { } W P k − k W P { } (cid:20) (cid:21) (cid:20) (cid:21) (cid:20) (cid:21) b 1 P (cid:0) (cid:1) 1/2 b1 b 1/2 = V Y 2I(Y <λ) +2σ2V df HT (Y,λ,h(P)) P i=1 Wi | i| | i| ! W(cid:20)P { } (cid:21) X (cid:2) (cid:3) 1 P 1/2 1 b 1/2 b 6 E Y 4 +2σ2V df HT (Y,λ,h(P)) P i=1 Wi| i| ! W (cid:20)P { } (cid:21) X = kxP02k44 +6σ2kxP02k2 + 3Pσ4!1/2+2bσ2VW(cid:20)P1df{bHT}(Y,λ,h(P))(cid:21)1/2 6 kx0k24 2+6σ2kx0k24 + 3σ4 1/2+2σ2V b 1df HT (Yb,λ,h(P)) 1/2 . P ! P P ! W (cid:20)P { } (cid:21) b b As by assumption, limP→∞P−1h(P)−1 = 0 and kx0k4 = o P1/2 , the variance of SCORE vanishes as P → ∞. We conclude using the same convergence in probability arguments used at the end(cid:16)of the(cid:17)proof of Theorem 1. b REFERENCES [1] M.Jansen,“Information criteria forvariable selection undersparsity,”Technical report, ULB,Tech.Rep.,2011. [2] B. Efron, “How biased is the apparent error rate of a prediction rule?” Journal of the American Statistical Association, vol. 81, no. 394, pp. 461–470, 1986. [3] C.Stein,“Estimation ofthemeanofamultivariate normaldistribution,” TheAnnalsofStatistics, vol.9,no.6,pp.1135–1151,1981. [4] D.DonohoandI.Johnstone, “AdaptingtoUnknownSmoothnessViaWaveletShrinkage.” Journal oftheAmericanStatistical Association, vol.90,no. 432,pp.1200–1224,1995. [5] H.Zou,T.Hastie, andR.Tibshirani,“Onthe“degrees offreedom”ofthelasso,”TheAnnalsofStatistics, vol.35,no.5,pp.2173–2192, 2007.