ebook img

Empirical risk minimization in inverse problems PDF

0.32 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Empirical risk minimization in inverse problems

TheAnnalsofStatistics 2010,Vol.38,No.1,482–511 DOI:10.1214/09-AOS726 (cid:13)c InstituteofMathematicalStatistics,2010 1 EMPIRICAL RISK MINIMIZATION IN INVERSE PROBLEMS 0 By Jussi Klemela¨ and Enno Mammen 1 0 University of Oulu and University of Mannheim 2 n Westudyestimation ofamultivariatefunction f:Rd→Rwhen a the observations are available from the function Af, where A is a J known linear operator. Both the Gaussian white noise model and 3 density estimation are studied. We define an L -empirical risk func- 2 1 tionalwhichisusedtodefineaδ-netminimizerandadenseempirical risk minimizer. Upperboundsfor themean integrated squared error ] T of the estimators are given. The upper bounds show how the diffi- S culty of the estimation depends on the operator through the norm . of the adjoint of the inverse of the operator and on the underlying h function class through theentropy of theclass. Corresponding lower t a bounds are also derived. As examples, we consider convolution op- m erators and the Radon transform. In these examples, the estimators [ achievetheoptimalratesofconvergence.Furthermore,anewtypeof oracle inequality is given for inverseproblems in additivemodels. 1 v 9 1. Introduction. We consider estimation of a function f:Rd R when 8 → 0 a linear transform Af of the function is observed under stochastic noise. 2 We consider both the Gaussian white noise model and density estimation . 1 with i.i.d. observations. We study two estimators: a δ-net estimator which 0 minimizes the L -empirical risk over a minimal δ-net of a function class and 2 0 a dense empirical risk minimizer which minimizes the empirical risk over 1 : the whole function class without restricting the minimization over a δ-net. v We call this estimator a “dense minimizer” because it is defined as a mini- i X mizeroverapossiblyuncountablefunctionclass.Theδ-netestimatorismore r universal: it may also be applied for nonsmooth functions and for severely a ill-posed operators. On the other hand, the dense empirical minimizer is ex- pected to work only for relatively smooth cases (the entropy integral has to Received June 2008; revised June 2009. 1Supportedby DeutscheForschungsgemeinschaft underProject MA1026/8-1. AMS 2000 subject classification. 62G07. Key words and phrases. Deconvolution, empirical risk minimization, multivariate den- sity estimation, nonparametric function estimation, Radon transform, tomography. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2010,Vol. 38, No. 1, 482–511. This reprint differs from the original in pagination and typographic detail. 1 2 J. KLEMELA¨ AND E. MAMMEN converge). However, because the minimization in the calculation of this esti- mator is not restricted to a δ-net, we have available a larger toolbox of algo- rithmsfor finding(an approximation of) theminimizer of theempiricalrisk. Let (Y, ,ν) be a Borel space and let A:L (Rd) L (Y) be a linear 2 2 operator, wYhere L (Rd) is the space of square integrab→le functions f:Rd 2 R (with respect to the Lebesgue measure) and L (Y) is the space of squa→re 2 integrable functions g:Y R (with respect to measure ν). In the density → estimation model, we have i.i.d. observations (1) Y ,...,Y Y 1 n ∈ with common density function Af:Y R, where f:Rd R is a density → → function which we want to estimate. In the Gaussian white noise model, the observation is a realization of the process (2) dY (y)=(Af)(y)dy+n 1/2dW(y), y Y, n − ∈ where W(y) is the Brownian process on Y, that is, for h ,h L (Y), 1 2 2 ∈ the random vector ( h dW, h dW) is a two-dimensional Gaussian ran- Y 1 Y 2 dom vector with zero mean, marginal variances h 2, h 2 and covariance h h dν. (In our Rexamples,RY is either the Ekucl1ikd2eakn 2skp2ace or the prod- Y 1 2 uct of the real half-line with the unit sphere so that the existence of the R Brownian process is guaranteed.) We want to estimate the signal function f:Rd R. The Gaussian white noise model is very useful for presenting → the basic mathematical ideas in a transparent way. For the δ-net estima- tor, the treatment is almost identical for the Gaussian white noise model and for the density estimation, but when we consider the dense empirical risk minimization, then, in the density estimation model, we need to use bracketing numbers and empirical entropies with bracketing, instead of the usual L -entropies. Our results for the Gaussian white noise model can also 2 serve as a first step for obtaining analogous results for inverse problems in regression or in other statistical models. The L -empirical risk is defined by 2 2 (Qg)dY + g 2, Gaussian white noise, − Y n k k2 (3) γ (g)= Z n n −2n−1 (Qg)(Yi)+kgk22, density estimation, i=1 X  where Q is the adjoint of the inverse of A:  (4) (A 1h)g= h(Qg)dν − Rd Y Z Z forh L (Y),g L (Rd).TheoperatorQ=(A 1) hasthedomainL (Rd), 2 2 − ∗ 2 ∈ ∈ similarly as A. Minimizing fˆ f 2 with respect to estimators fˆis equiva- k − k2 lent to minimizing fˆ f 2 f 2 andwe have, in theGaussian whitenoise k − k2−k k2 EMPIRICAL RISK MINIMIZATION 3 model, fˆ f 2 f 2= 2 ffˆ+ fˆ 2 k − k2−k k2 − Rd k k2 Z (5) = 2 (Af)(Qfˆ)dν+ fˆ 2 − Y k k2 Z 2 (Qfˆ)dY + fˆ 2=γ (fˆ). ≈− Y n k k2 n Z The usual least squares estimator is defined as a minimizer of the criterion (6) Afˆ Af 2 Af 2 2 (Ag)dY + Ag 2d=ef γ˜ (g); k − k2−k k2≈− Y n k k2 n Z see, for example, O’Sullivan (1986). In density estimation, the log-likelihood empirical risk has been more common than the L -empirical risk and in 2 the setting of inverse problems, the log-likelihood is defined as γ¯ (g) = n n 1 n log(Ag) (Y ), analogously to (6). These alternative definitions − − i=1 × i of the empirical risk do not seem to lead to such an elegant theory as does P the empirical risk in (3). Theempirical risk in (3) has been used in deconvo- lution problems for projection estimators by Comte, Taupin and Rozenholc (2006). We give upper bounds for the mean integrated squared error (MISE) of the estimators. Theupperboundscharacterize how the rates of convergence depend on the entropy of the underlying function class and on smooth- F ness properties of the operator A. Previously, such characterizations have been given (up to our knowledge) in inverse problems only for the case of estimating real-valued linear functionals L. In these cases, the rates of convergence are determined by the modulus of continuity of the functional ω(ǫ)=sup L(f):f , Af ǫ ; see Donoho and Low (1992). For the 2 { ∈F k k ≤ } case ofestimating thewholefunctionwithaglobal lossfunction,therates of convergence depend on the size of the underlying function class in terms of theentropy andcapacity; seeCencov (1972),LeCam(1973),Ibragimov and Hasminskii (1980, 1981), Birg´e (1983), Hasminskii and Ibragimov (1990), Yang and Barron (1999), Ibragimov (2004). δ-net estimators were consid- ered by, for example, van der Laan, Dudoit and van der Vaart (2004). These papers consider direct statistical problems. We show that for inverse sta- tistical problems, the rate of convergence depends on the operator through the operator norm ̺(Q, ) of Q, over a minimal δ-net ; see (8) for the δ δ F F definition of ̺(Q, ). More precisely, the convergence rate ψ of the δ-net δ n F estimator is the solution to the equation nψ2 =̺2(Q, )log(# ), n Fψn Fψn where# isthecardinality ofaminimalδ-net.For directproblems,when Fψn A is the identity operator, ̺(Q, ) 1. (We write a b to mean that δ n n F ≍ ≍ 4 J. KLEMELA¨ AND E. MAMMEN 0<liminf a /b limsup a /b < .) As examples of operators A,weconsnid→e∞rthneconn≤volution onp→e∞ratnor anndt∞heRadontransform.For these operators, the estimators achieve the minimax rates of convergence over Sobolev classes. The general framework for empirical risk minimization and the use of the empirical process machinery, includingentropy bounds,for derivingoptimal bounds seems to be new. Convolution and Radon transforms are discussed for illustrative purposes. These examples show that our results lead to opti- mal rates of convergence. As a new application, we introduce the estimation of additive models in inverse problems. A new type of oracle inequality is presented,whichalsogives theoptimalrates ofconvergence in“anisotropic” inverse problems. For an extended version of this paper that also contains additional material, see Klemela¨ and Mammen (2009). The paper is organized as follows. Section 2 gives an upper bound for the MISE of the δ-net estimator. Section 3 gives a lower bound for the MISE of any estimator. Section 4 gives an upper bound for the MISE of the dense empirical risk minimizer. Section 5 proves that the δ-net estimator achieves the optimal rate of convergence in the ellipsoidal framework and discusses this result for the case where A is a convolution operator or the Radon transform. Furthermore, it contains an oracle inequality for additive models. Section 6 contains the proofs of the main results. 2. δ-netminimizer. Let beasetofdensitiesorsignalfunctionsf:Rd R. Let be a finite δ-net oFf in the L -metric, where δ>0. That is, fo→r δ 2 F F each f , there is a φ such that f φ δ. Define the estimator δ 2 ∈F ∈F k − k ≤ fˆby fˆ=argminγ (φ), n φ∈Fδ where γ (φ) is defined in (3). Typically, we would like to choose a δ-net of n minimal cardinality. We assume that is bounded in the L -metric: 2 F (7) sup g B , 2 2 k k ≤ g ∈F where 0<B < . 2 ∞ Theorem 1 gives a bound for the mean integrated squared error of the estimator.Wemayidentifythefirsttermintheboundasabiastermandthe second term as a variance term. The variance term depends on the operator norm of Q over the δ-net . We define this operator norm as δ F Q(φ φ) ′ 2 (8) ̺(Q, )= max k − k , δ>0, δ F φ,φ′∈Fδ,φ6=φ′ kφ−φ′k2 EMPIRICAL RISK MINIMIZATION 5 where Q is defined by (4). In the case of density estimation, we need the additionalassumptionsthat̺(Q, ) 1andthatA andQ arebounded δ F ≥ F F in the L metric: ∞ (9) ̺(Q, ) 1, sup Af B , sup Qf B , δ ′ F ≥ f k k∞≤ ∞ f k k∞≤ ∞ ∈F ∈F where 0<B ,B < . ′ ∞ ∞ ∞ Theorem 1. For the density estimation, we assume that (9) is satisfied. For f , we have that ∈F ̺2(Q, ) (log (# )+1) E fˆ f 2 C δ2+C Fδ · e Fδ , k − k2≤ 1 2 n where (10) C =(1 2ξ) 1(1+2ξ), 1 − − (11) C =(1 2ξ) 1ξC , 2 − τ − (12) C >0 τ and ξ is such that C 1(4B /3+ 2[8(B )2/9+C B ]) ξ<1/2, τ− ∞′ ∞′ τ ∞ ≤ (13) density estimation,  p  2/Cτ ξ<1/2, white noise. ≤ A proof of Tpheorem 1 is given in Section 6.2. Remark 1. Theorem 1shows that the δ-net estimator achieves therate of convergence ψ when ψ is the solution of the equation n n (14) ψ2 n 1̺2(Q, )log(# ). n≍ − Fψn Fψn We calculate the rate under the assumptions that log(# ) and ̺(Q, ) δ δ F F increase polynomially as δ decreases: we assume that one can find a δ-net whose cardinality satisfies log(# )=Cδ b δ − F for some constants b,C >0 and we assume that ̺(Q, )=C δ a δ ′ − F for some a,C >0 (in the direct case a=0 and C =1). Then (14) can be ′ ′ written as ψ2 n 1ψ 2a b and the rate of the δ-net estimator is n≍ − n− − (15) ψ n 1/[2(a+1)+b]. n − ≍ Let bea setof s-smooth,d-dimensionalfunctions suchthat b=d/s.Then F the rate is ψ n s/[2(a+1)s+d], which, for the direct case a=0, gives the n − ≍ classical rate ψ n s/(2s+d). n − ≍ 6 J. KLEMELA¨ AND E. MAMMEN 3. A lower bound for MISE. Theorem 2 gives a lower bound for the mean integrated squared error of any estimator when estimating densities or signal functions f:Rd R in the function class . Theorem 2 also holds → F for nonlinear operators. Theorem 2. Let A be a possibly nonlinear operator. Assume that for each sufficiently small δ>0, we find a finite set for which δ D ⊂F (16) min f g :f,g ,f =g C δ 2 δ 0 {k − k ∈D 6 }≥ and max f g :f,g C δ, white noise, (17) {k − k2 ∈Dδ}≤ 1 max D (f,g):f,g C δ, density estimation, K δ 1 (cid:26) { ∈D }≤ where D2 (f,g)= log (f/g)f is the Kullback–Leibler distance and C , C K e 0 1 are positive constants. Let R 1 A(f g) 2 max k − k , white noise, ̺ (A, )=√2f,g∈Dδ,f6=g kf −gk2 K Dδ  max DK(Af,Ag), density estimation. f,g∈Dδ,f6=g kf −gk2 Let ψ be such that n  (18) log (# )<nψ2̺2 (A, ), e Dψn n K Dψn where a <b means that liminf a /b >0. Assume that n n n n n →∞ (19) lim nψ2̺2 (A, )= . n n K Dψn ∞ →∞ Then lim inf ψ 2infsupE f fˆ 2>0, n→∞ n− fˆ f∈F k − k2 where the infimum is taken over all estimators. That is, ψ is a lower bound n for the minimax rate of convergence. A proof of Theorem 2 is given in Section 6.3. Remark 2. Theorem 2 shows that one can get a lower bound ψ for n the rate of convergence by solving the equation (20) ψ2̺2 (A, ) n 1log (# ). n K Dψn ≍ − e Dψn The upperboundin Theorem 1 dependson the operator norm of Q,defined in (8), whereas the lower bound depends on the operator norm of A. Note, also, that the operator norm ̺(Q, ) is on the other side of the equation Fψn in (14) compared to the operator norm ̺ (A, ) in the equation (20). K Dψn EMPIRICAL RISK MINIMIZATION 7 Remark 3. Inthedensityestimationcase,onecaneasilycheck assump- tions (17) and (19) if one assumes that the functions in A are bounded δ D and bounded away from 0. Then (21) C A(f g) D (Af,Ag) C A(f g) ′ 2 K 2 ·k − k ≤ ≤ ·k − k and(17)and(19)followbythecorrespondingconditionswithHilbertnorms instead of Kullback–Leibler distances. 4. Dense minimizer. The dense minimizer minimizes the empirical risk over the whole function class . In contrast to the δ-net estimator, the F minimization is not restricted to a δ-net. We call this estimator a “dense minimizer” because it is defined as a minimizer over a possibly uncountable function class. The δ-net estimator is more widely applicable: it may also be applied to estimate nonsmooth functions and it may be applied when the operator is severely ill-posed. The dense minimizer may only be applied for relatively smooth cases (the entropy integral has to converge). Because it works without a restriction to a δ-net, we have available a larger toolbox of numerical algorithms that can be applied. For a collection of functions f:Rd R, the dense minimizer fˆ is F → defined as a minimizer of the empirical risk over , up to ǫ >0: n F γ (fˆ) inf γ (g)+ǫ , n n n ≤g ∈F where γ (φ) is defined in (3). For clarity, we present separate theorems for n the Gaussian white noise model and for the density estimation model. In both models, we make the assumption that the functions in are bounded F in the L -metric as in (7). 2 4.1. Gaussian white noise. Let , δ>0, be a δ-net of , with respect δ F F to the L -norm. Define 2 Q(f g) 2 (22) ̺(Q, )=max k − k :f ,g ,f =g , δ>0, δ δ 2δ F f g ∈F ∈F 6 (cid:26) k − k2 (cid:27) where Q is theadjoint of the inverse of A, definedby (4). Define theentropy integral δ def (23) G(δ) = ̺(Q, ) log (# )du, δ (0,B ], Fu e Fu ∈ 2 Z0 p where B is the L -bound defined by (7). 2 2 Theorem 3. Assume that: 1. the entropy integral in (23) converges; 8 J. KLEMELA¨ AND E. MAMMEN 2. G(δ)/δ2 is decreasing on the interval (0,B ]; 2 3. ̺(Q, )=cδ a, where 0 a<1 and c>0; δ − F ≤ 4. lim G(δ)δa 1 = ; δ 0 − → ∞ 5. δ ̺(Q, ) log (# ) is decreasing on (0,B ]. 7→ Fδ e Fδ 2 Let ψ be such tphat n (24) ψ2 Cn 1/2G(ψ ), n≥ − n 2(1+a) where C is a positive constant, and assume that lim nψ = . n n →∞ ∞ Then, for f , ∈F E fˆ f 2 C (ψ2 +ǫ ) k − k2≤ ′ n n for a positive constant C , for sufficiently large n. ′ A proof of Theorem 3 is given in Section 6.4. Remark 4. Assumption 5 is a technical assumption which is used to replace a Riemann sum by an entropy integral. We prefer to write the as- sumptions in terms of the entropy integral in order to make them more readable. Remark 5. We may write ̺(Q, ) in a simpler way when there exist δ F minimal δ-nets which are nested: . We may then define, alter- δ 2δ δ F F ⊂F natively, Q(f g) 2 ̺(Q, )= max k − k . δ F f,g∈Fδ,f6=g kf −gk2 Remark 6. Theorems 3 and 4 show that the rate of convergence of the dense minimizer is the solution of the equation (25) ψ2 =n 1/2G(ψ ). n − n To get the optimal rate, the net is chosen so that its cardinality is mini- δ F mal. In the polynomial case, one can find a δ-net whose cardinality satisfies log(# )=Cδ b δ − F for some constants b,C >0 and the operator norm satisfies ̺(Q, )=C δ a δ ′ − F for some a,C >0. (In the direct case, a=0 and C =1.) Thus, the entropy ′ ′ integral G(δ) is finite when δu a b/2du< , which holds when 0 − − ∞ (26) R a+b/2<1. EMPIRICAL RISK MINIMIZATION 9 Then (25) leads to ψn2 ≍n−1/2ψn−a−b/2+1 and the rate of the dense mini- mization estimator is (27) ψ n 1/[2(a+1)+b]. n − ≍ Thisisthesamerateastherateoftheδ-netestimatorgivenin(15).Wehave the following example. Let be a set of s-smooth, d-dimensional functions F such that b=d/s. Condition (26) may then bewritten as acondition for the smoothness index s:s>d/[2(1 a)]. Whentheproblemis direct, then a=0 − and we have theclassical condition s>d/2. Therate is ψ n s/[2(a+1)s+d], n − ≍ which gives, for the direct case a=0, the classical rate ψ n s/(2s+d). n − ≍ 4.2. Density estimation. A δ-bracketing net of with respect to the F L -norm is a set = (gL,gU):j =1,...,N of pairs of functions such 2 Fδ { j j δ} that: 1. gL gU δ, j=1,...,N ; k j − j k2≤ δ 2. for each g , there exists j=j(g) 1,...,N such that gL g gU. ∈F ∈{ δ} j ≤ ≤ j Let us define L = gL:j =1,...,N and U = gU:j =1,...,N . Fur- Fδ { j δ} Fδ { j δ} ther, define (28) ̺ (Q, )=max ̺(Q, L, U),̺(Q, L, L) , den Fδ { Fδ Fδ Fδ F2δ } where Q(gU gL) ̺(Q, L, U)=max k − k2 :gL L,gU U Fδ Fδ gU gL ∈Fδ ∈Fδ (cid:26) k − k2 (cid:27) and Q(f g) ̺(Q, L, L)=max k − k2 :f L,g L,f =g Fδ F2δ f g ∈Fδ ∈F2δ 6 (cid:26) k − k2 (cid:27) for δ>0. Define the entropy integral δ def (29) G(δ) = ̺ (Q, ) log (# )du, δ (0,B ], den Fu e Fu ∈ 2 Z0 p where B =sup f . 2 f∈Fk k2 Theorem 4. We make assumptions 1–5 of Theorem 3 [with operator norm ̺ (Q, ) in place of ̺(Q, )] and, in addition, we assume that den δ δ F F sup Af < , sup Qg < and that the operator Q pre- f∈Fk k∞ ∞ g∈FBL2∪FBU2k k∞ ∞ serves positivity (g 0 implies that Qg 0). Let ψ be such that n ≥ ≥ (30) ψ2 Cn 1/2G(ψ ) n≥ − n 10 J. KLEMELA¨ AND E. MAMMEN 2(1+a) for a positive constant C and assume that lim nψ = . Then, for n n →∞ ∞ f , ∈F E fˆ f 2 C (ψ2 +ǫ ) k − k2≤ ′ n n for a positive constant C , for sufficiently large n. ′ A proof of Theorem 4 is given in Section 6.5. An analogous discussion of optimal rates as in Remark 6 for the Gaussian white noise model also applies for dense density estimators. 5. Examples of function spaces. In Section 5.1, we consider ellipsoidal function spaces and in Section 5.2 we consider additive models and their generalizations. 5.1. Ellipsoidalfunctionspaces. SinceweareintheL -setting,itisnatu- 2 raltoworkinthesequencespace;wedefinethefunctionclasses asellipsoids. We shall apply singular value decompositions of the operators and wavelet- vaguelette systems in the calculation of the rates of convergence. In Section 5.1.1, we calculate the operator norms in the framework of singular value decompositions. In Section 5.1.2, we calculate the operator norms in the wavelet-vaguelettte framework. Section 5.1.3 derives the rate of convergence of the δ-net estimator for the case of a convolution operator and the Radon transform,andthelower boundfortherateof convergence ofany estimator. 5.1.1. Singularvaluedecomposition. Weassumethattheunderlyingfunc- tion space consists of d-variate functions that are linear combinations of F orthonormalbasisfunctionsφ withmulti-indexj=(j ,...,j ) 0,1,... d. j 1 d ∈{ } Define the ellipsoid and the corresponding collection of functions by ∞ ∞ (31) Θ= θ: a2θ2 L2 , = θ φ :θ Θ . j j ≤ F j j ∈ ( ) ( ) j1=0X,...,jd=0 j1=0X,...,jd=0 δ-net and δ-packing set for polynomial ellipsoids. We assume that there exist positive constants C ,C such that for all j 0,1,... d, 1 2 ∈{ } (32) C j s a C j s, 1 j 2 ·| | ≤ ≤ ·| | where j = j + +j . In Klemela¨ and Mammen (2009), we construct 1 d | | ··· a δ-net Θ and a δ-packing set Θ using the techniques of Kolmogorov δ ∗δ and Tikhomirov (1961); see also Birman and Solomyak (1967). Since the construction is in the sequence space, we define the δ-net and δ-packing set of by F ∞ ∞ (33) = θ φ :θ Θ , = θ φ :θ Θ . Fδ j j ∈ δ Dδ j j ∈ ∗δ ( ) ( ) j1=0X,...,jd=0 j1=0X,...,jd=0

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.