ebook img

Risk Estimators for Choosing Regularization Parameters in Ill-Posed Problems - Properties and Limitations PDF

2.7 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Risk Estimators for Choosing Regularization Parameters in Ill-Posed Problems - Properties and Limitations

Risk Estimators for Choosing Regularization Parameters in Ill-Posed Problems - Properties and Limitations Felix Lucka ∗ Katharina Proksch† Christoph Brune‡ Nicolai Bissantz§ Martin Burger¶ Holger Dette(cid:107) Frank Wu¨bbeling∗∗ January 19, 2017 7 1 0 2 Abstract n a This paper discusses the properties of certain risk estimators recently proposed to J choose regularization parameters in ill-posed problems. A simple approach is Stein’s un- 8 biased risk estimator (SURE), which estimates the risk in the data space, while a recent 1 modification (GSURE) estimates the risk in the space of the unknown variable. It seems intuitive that the latter is more appropriate for ill-posed problems, since the properties ] T in the data space do not tell much about the quality of the reconstruction. We pro- S vide theoretical studies of both estimators for linear Tikhonov regularization in a finite . dimensional setting and estimate the quality of the risk estimators, which also leads to h t asymptotic convergence results as the dimension of the problem tends to infinity. Un- a like previous papers, who studied image processing problems with a very low degree of m ill-posedness, we are interested in the behavior of the risk estimators for increasing ill- [ posedness. Interestingly, our theoretical results indicate that the quality of the GSURE 1 risk can deteriorate asymptotically for ill-posed problems, which is confirmed by a de- v tailed numerical study. The latter shows that in many cases the GSURE estimator leads 0 to extremely small regularization parameters, which obviously cannot stabilize the recon- 7 struction. Similar but less severe issues with respect to robustness also appear for the 9 SURE estimator, which in comparison to the rather conservative discrepancy principle 4 0 leads to the conclusion that regularization parameter choice based on unbiased risk esti- . mation is not a reliable procedure for ill-posed problems. A similar numerical study for 1 0 sparsity regularization demonstrates that the same issue appears in nonlinear variational 7 regularization approaches. 1 Keywords: Ill-posed problems, regularization parameter choice, risk estimators, : v Stein’s method, discrepancy principle. i X ∗Centre for Medical Image Computing, University College London, WC1E 6BT London, UK email: r [email protected] a †Institut fA˜1r Mathematische Stochastik, Georg-August-Universita¨t G¨ottingen, Goldschmidtstrasse 7, 4 37077 G¨ottingen, Germany, e-mail: [email protected] ‡DepartmentofAppliedMathematics,UniversityofTwente,P.O.Box217,7500AEEnschede,TheNether- lands, e-mail:[email protected] §Fakulta¨t fu¨r Mathematik, Ruhr-Universita¨t Bochum, 44780 Bochum, Germany, e-mail: [email protected] ¶Institut fu¨r Numerische und Angewandte Mathematik, Westfa¨lische Wilhelms-Universita¨t (WWU) Mu¨nster. Einsteinstr. 62, D 48149 Mu¨nster, Germany. e-mail: [email protected] (cid:107)Fakulta¨tfu¨rMathematik,Ruhr-Universita¨tBochum,44780Bochum,Germany,e-mail: holger.dette@ruhr- uni-bochum.de ∗∗Institut fu¨r Numerische und Angewandte Mathematik, Westfa¨lische Wilhelms-Universita¨t (WWU) Mu¨nster. Einsteinstr. 62, D 48149 Mu¨nster, Germany. e-mail: [email protected] 1 1 Introduction Choosing suitable regularization parameters is a problem as old as regularization theory, which has seen a variety of approaches both from deterministic (e.g. L-curve criteria, [18, 17]) or statistical perspectives (e.g. Lepskij principles, [2, 20]), respectively in between (e.g. discrepancy principles motivated by deterministic bounds or noise variance, cf. [28, 3]). Recently, another class of statistical parameter choice rules based on risk estimation, more precisely using Stein’s unbiased risk estimation [27], was introduced in problems related to imageprocessing([9,29,30,12,33,10,22,32,31,25]). InadditiontoaclassicalSteinunbiased risk estimator (SURE), several authors have considered a generalized version (GSURE, [30, 11, 15]), which measures risk in the space of the unknown rather than in the data space and hence seems more appropriate for ill-posed problems. Previous investigations show that the performance of such parameter choice rules is reasonable in many different settings (cf. [16, 34, 8, 1, 26, 23, 13]). However, the problems considered in these works are very mildly ill-posed and therefore, a first motivation of this paper is to further study the properties of parameter choice by SURE and GSURE in Tikhonov-type regularization methods more systematicallyindependenceoftheill-posednessoftheproblemandthedegreeofsmoothness oftheunknownexactsolution. Forthispurposeweprovideatheoreticalanalysisofthequality ofunbiasedriskestiamtorsinthecaseoflinearTikhonovregularization. Additionallywecarry out extensive numerical investigations on appropriate model problems. While in very mildly ill-posed settings the performances of the parameter choice rules under consideration are reasonable and comparable, our investigations yield various interesting results and insights in ill-posedsettings. Forinstance,wedemonstratethatGSUREshowsarathererraticbehaviour as the degree of ill-posedness increases. The observed effects are so strong that the meaning of a parameter chosen according to this particular criterion is unclear. A second motivation of this paper is to study the discrepancy principle as a reference method and as we shall see it can indeed be put in a very similar context and analyzed by the same techniques. Although the popularity of the discrepancy principle is decreasing recently in favor of choices using more statistical details, our findings show that it is still more robust for ill-posed problems than risk-based parameter choices. The conservative choice by the discrepancy principle is well-known to rather overestimate the optimal parameter, but on the other hand it avoids to choose too small regularization as risk-based methods often do. In the latter case the reconstruction results are completely deteriorated, while the discrepancy principle yields a reliable, though not optimal, reconstruction. Throughout the paper we consider a (discrete) inverse problem of the form y = Ax∗+ε, (1) wherey ∈ Rmisavectorofobservations,A ∈ Rm×nissomeknownbutpossiblyill-conditioned matrix,andε ∈ Rm isanoisevector. Weassumethatεconsistsofindependentandidentically distributed (i.i.d.) Gaussian errors, i.e., ε ∼ N(0,σ2I ). The vector x∗ ∈ Rn denotes the m (unknown) exact solution to be reconstructed from the observations. In order to find an estimate xˆ(y) of x∗, we apply a variational regularization method: 1 xˆ (y) = argmin (cid:107)Ax−y(cid:107)2+αR(x), (2) α x∈Rn 2 2 where R is assumed convex and such that the minimizer is unique for positive regularization 2 parameter α. In what follows the dependence of xˆ (y) on α and the data y may be dropped α where it is clear without ambiguity that xˆ = xˆ (y). α In practice there are two choices to be made: First, a regularization functional R needs to be specified in order to appropriately represent a-priori knowledge about solutions and second, a regularization parameter α needs to be chosen in dependence of the data y. The ideal parameter choice would minimize a difference between xˆ (y) and x∗ over all α, which α obviously cannot be computed and is hence replaced by a parameter choice rule that tries to minimize a worst-case or average error to the unknown solution, which can be referred to as a risk minimization. In the practical case of having a single observation only, the risk based on average error needs to be replaced by an estimate as well, and unbiased risk estimators that will be detailed in the following are a natural choice. For the sake of a clearer presentation of methods and results we first focus on linear Tikhonov regularization, i.e., 1 R(x) = (cid:107)x(cid:107)2, 2 2 leading to the explicit Tikhonov estimator xˆ (y) = T y := (A∗A+αI)−1A∗y. (3) α α In this setting, a natural distance for measuring the error of xˆ (y) is given by its (cid:96) -distance α 2 to x∗. Thus, we define α∗ := argmin (cid:107)xˆ (y)−x∗(cid:107)2 α 2 α(cid:62)0 as the optimal, but inaccessible, regularization parameter. Many different rules for the choice of the regularization parameter α are discussed in the literature. Here, we focus on strategies that rely on an accurate estimate of the noise variance σ2. A classical example of such a rule is given by the discrepancy principle: The regularization parameter αˆ is given as the DP solution of the equation (cid:107)Axˆ (y)−y(cid:107)2 = mσ2. (4) α 2 The discrepancy principle is robust and easy-to-implement for many applications (cf. [4, 19, 24]) and is based on the heuristic argument, that x (y) should only explain the data y up α to the noise level. Several other parameter choice rules are based on Stein’s famous unbiased risk estimator (SURE). The basic idea is to choose the α that minimizes the estimated quadratic risk function αˆ∗ ∈ argmin R (α) := argmin E(cid:2)(cid:107)Ax∗−Axˆ (y)(cid:107)2(cid:3) (5) SURE SURE α 2 α(cid:62)0 α(cid:62)0 Since R depends on the unknown vector x∗, we replace it by an unbiased estimate: SURE αˆ ∈ argmin SURE(α,y) := argmin (cid:107)y−Axˆ (y)(cid:107)2−mσ2+2σ2df (y) (6) SURE α 2 α α(cid:62)0 α(cid:62)0 with df (y) = tr(∇ ·Axˆ (y)). α y α As an analogue of the SURE-criterion, a generalized version (GSURE) is often considered. In contrast to SURE, which aims at optimizing the MSE in the image of the operator, GSURE operates in the domain instead and considers the MSE of the reconstruction of x: αˆ∗ ∈argmin R (α) := argmin E(cid:2)(cid:107)Π(x∗−xˆ (y))(cid:107)2(cid:3), GSURE GSURE α 2 α(cid:62)0 α(cid:62)0 3 where Π := A+A denotes the orthogonal projector onto the range of A∗. Again, we replace R by an unbiased estimator to obtain GSURE αˆ ∈argminGSURE(α,y) := argmin(cid:107)x (y)−xˆ (y)(cid:107)2−σ2tr(cid:0)(AA∗)+(cid:1)+2σ2gdf (y) GSURE ML α 2 α α(cid:62)0 α(cid:62)0 (7) with gdf (y) = tr((AA∗)+∇ Axˆ (y)), x = A+y = A∗(AA∗)+y, α y α ML where M+ denotes the Pseudoinverse of M. Notice that all parameter choice rules depend on the data y and hence on the random errors ε ,...,ε . Therefore, αˆ , αˆ and αˆ are random variables, described in terms of 1 m DP SURE GSURE theirprobabilitydistributions. Wefirstinvestigatethesedistributionsbynumericalsimulation studies. The results point to several problems of the presented parameter choice rules, in particular of GSURE, and motivate our further theoretical investigation. In the next section, we will describe a simple inverse problem scenario in terms of quadratic Tikhonov regularization and fix the setting and notations both for further numerical simu- lation as well as the analysis of the risk based estimators. The latter will be carried out in Section3andsupplementedbyanexhaustivenumericalstudyinSection4. Finallyweextend the numerical investigation in Section 5 to a sparsity-promoting LASSO-type regularization, for which we find similar behaviour. Conclusions are given in Section 6. 2 Risk Estimators for Quadratic Regularization In the following we discuss the setup in the case of a quadratic regularization functional R(x) = 1(cid:107)x(cid:107)2, i.e. we recover the well-known linear Tikhonov regularization scheme. The 2 linearity can be used to simplify arguments and gain analytical insight in the next section. 2.1 Singular System and Risk Representations Considering a quadratic regularization allows to analyze xˆ in a singular system of A in a α convenient way. Let r = rank(A), l = min(n,m). Let A = UΣV∗, Σ = diag(γ ,...,γ ) ∈ Rm×n, γ ≥ ... ≥ γ > 0, γ ...γ := 0 1 l 1 r r+1 m denote a singular value decomposition of A with U = (u ,...,u ) ∈ Rm×m,V = (v ,...,v ) ∈ Rn×n unitary. 1 m 1 n Defining y = (cid:104)u ,y(cid:105), x∗ = (cid:104)v ,x∗(cid:105), (cid:15)˜ = (cid:104)u ,(cid:15)(cid:105) (8) i i i i i i we can rewrite model (1) in its spectral form y = γ x∗+ε˜, i = 1...l; y = ε˜, i = l+1...m, i i i i i i where ε˜ ,...,ε˜ are still i.i.d. ∼ N(0,σ2). We will express some more terms in the singular 1 m system that are frequently used throughout this paper. In particular, we have for x , the ML 4 regularized solution and its norm 1 1 x = A+y = VΣ+U∗y, with Σ+ = diag( ,..., ,0...0) ∈ Rn×m ML γ γ 1 r (cid:18) (cid:19) γ xˆ (y) = (A∗A+αI)−1A∗y =: VΣ+U∗y, with Σ+ = diag i ∈ Rn×m α α α γ2+α i (cid:88)m γ2 (cid:107)xˆ (cid:107)2 = i y2 α 2 (γ2+α)2 i i=1 i as well as the residual and distance to the maximum likelihood estimate (cid:88)m α2 (cid:107)Axˆ −y(cid:107)2 = y2. (9) α 2 (γ2+α)2 i i=1 i (cid:107)x −xˆ (cid:107)2 = (cid:107)A∗(AA∗)+y−(A∗A+αI)−1A∗y(cid:107)2 = (cid:107)V (cid:0)Σ+−Σ+(cid:1)U∗y(cid:107)2 ML α 2 2 α 2 = (cid:88)r (cid:18)1 − γi (cid:19)2y2. γ (γ2+α) i i=1 i i Based on the generalized inverse we compute (cid:18) (cid:19) 1 1 (AA∗)+ = U(ΣΣ∗)+U∗ = Udiag ,..., ,0,...,0 U∗ γ2 γ2 1 r A∗(AA∗)+A = Vdiag(1,...,1,0,...,0)V∗, (cid:124) (cid:123)(cid:122) (cid:125) (cid:124) (cid:123)(cid:122) (cid:125) r n−r which yields the degrees of freedom and the generalized degrees of freedom df := ∇ ·Axˆ = tr(cid:0)A(A∗A+αI)−1A∗(cid:1) = (cid:88)r γi2 α y γ2+α i=1 i gdf := tr((AA∗)+∇ ·Axˆ) = tr(cid:0)(AA∗)+A(A∗A+αI)−1A∗(cid:1) α y r r = tr((ΣΣ∗)+ΣΣ−1) = (cid:88) 1 γ γi = (cid:88) 1 . α γ2 iγ2+α γ2+α i=1 i i i=1 i Next,wederivethespectralrepresentationsoftheparameterchoicerules. Forthediscrepancy principle, we use (9) to define (cid:88)m α2 DP(α,y) := y2−mσ2, (10) (γ2+α)2 i i=1 i and now, (4) can be restated as DP(αˆ ,y) = 0. For (6) and (7), we find DP (cid:88)m α2 (cid:88)m γ2 SURE(α,y) = y2−mσ2+2σ2 i (11) (γ2+α)2 i γ2+α i=1 i i=1 i GSURE(α,y) = (cid:88)r (cid:18)1 − γi (cid:19)2y2−σ2(cid:88)r 1 +2σ2(cid:88)r 1 . (12) γ γ2+α i γ2 γ2+α i=1 i i i=1 i i=1 i 5 2.2 An Illustrative Example We consider a simple imaging scenario which exhibits typical properties of inverse problems. The unknown function x∗ : [−1/2,1/2] → R is mapped to a function y : [−1/2,1/2] → R ∞ ∞ by a periodic convolution with a compactly supported kernel of width l ≤ 1/2: (cid:90) 1 y (s) = A x∗ := 2 k (s−t)x∗ (t)dt, s ∈ [−1/2,1/2], ∞ ∞,l ∞ l ∞ −1 2 where the 1-periodic C∞(R) function k (t) is defined for |t| ≤ 1/2 by 0 l (cid:40) (cid:16) (cid:17) 1 exp − 1 if |t| < l (cid:90) l (cid:18) 1 (cid:19) k (t) := 1−t2/l2 , N = exp − dt, l Nl 0 l ≤ |t| ≤ 1/2 l −l 1−t2/l2 and continued periodically for |t| > 1/2. Examples of k (t) are plotted in Figure 1(a). The l normalization ensures that A and suitable discretizations thereof have the spectral radius ∞,l γ = 1 which simplifies our derivations and the corresponding illustrations. The x∗ used in 1 ∞ the numerical examples is the sum of four delta distributions: 4 (cid:18) (cid:19) (cid:34) (cid:35) (cid:88) 1 1 1 1 1 x∗ (t) := a δ b − , with a = [0.5,1,0.8,0.5], b = √ , √ , √ , . ∞ i i 2 26 11 3 (cid:112)3/2 i=1 The locations of the delta distributions approximate [−0.3,−0.2,0.1,0.3] by irrational num- bers which simplifies the discretization. Discretization For a given number of degrees of freedom n, let (cid:20) (cid:21) i−1 1 i 1 En := − , − , i = 1,...,n i n 2 n 2 √ denote the equidistant partition of [−1/2,1/2] and ψin(t) = n1Ein(t) an ONB of piecewise constant functions over that partition. If we use m and n degrees of freedom to discretize range and domain of A , respectively, we arrive at the discrete inverse problem (1) with ∞,l √ (cid:90) (cid:90) (A ) = (cid:10)ψm,A ψn(cid:11) = mn k (s−t) dtds (13) l i,j i ∞,l j l Em En i j x∗j = (cid:10)ψjn,x∗∞(cid:11) = √n(cid:90)Enx∗∞(t)dt = √n(cid:88)4 ai1Einδ(cid:18)bi− 12(cid:19) j i The two dimensional integration in (13) is computed by the trapezoidal rule with equidistant spacing, employing 100×100 points to partition Em×En. Note that we drop the subscript i i l from A whenever the dependence on this parameter is not of importance for the argument l being carried out. As the convolution kernel k has mass 1 and the discretization was designed to be mass- l preserving, we have γ = 1 and the condition of A is given by cond(A) = 1/γ , where 1 r r = rank(A). Figure 2 shows the decay of the singular values for various parameter settings and Table 1 lists the corresponding condition numbers. 6 45 8 40 7 35 6 30 5 25 4 20 3 15 2 10 1 5 0 0 -1 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 (a) (b) Figure 1: (a) The convolution kernel k (t) for different values of l. (b) True solution x∗, clean l data A x∗ and noisy data A x∗+ε for m = n = 64, l = 0.06, σ = 0.1. l l Table 1: Condition of A computed different values of m = n and l. l l = 0.02 l = 0.04 l = 0.06 l = 0.08 l = 0.1 m = 16 1.27e+0 1.75e+0 2.79e+0 6.77e+0 2.31e+2 m = 32 1.75e+0 6.77e+0 6.94e+1 6.88e+2 2.30e+2 m = 64 6.77e+0 6.88e+2 6.42e+2 1.51e+3 4.22e+3 m = 128 6.88e+2 1.51e+3 1.51e+4 4.29e+3 4.29e+4 m = 256 1.70e+3 4.70e+4 1.87e+6 4.07e+6 1.79e+6 m = 512 4.70e+4 1.11e+7 1.22e+7 2.12e+7 3.70e+7 Empirical Distributions Using the above formulas and m = n = 64, l = 0.06, σ = 0.1, we computed the empirical distibutions of the different parameter choice rules by evaluating (10), (11) and (12) on a fine logarithmical α-grid, i.e., log (α ) was increased linearly in 10 i between −40 and 40 with a step size of 0.01. We draw N = 106 samples of ε. The results are ε displayed in Figures 3 and 4: In both figures, we use a logarithmic scaling of the empirical probabilities wherein empirical probabilities of 0 have been set to 1/(2N ). While this presen- ε tation complicates the comparison of the distributions as the probability mass is deformed, it facilitates the examination of small values and tails. First, we observe in Figure 3(a) that αˆ typically overestimates the optimal α∗. However, it DP performs robustly and does not cause large (cid:96) -errors as can be seen in Figure 3(b). For αˆ 2 SURE and αˆ , the latter is not true: While being closer to α∗ than αˆ most often, and, as can GSURE DP be seen from the joint error histograms in Figure 4, producing smaller (cid:96) -errors most often, 2 both distributions show outliers, i.e., occasionally, very small values of αˆ are estimated that cause large (cid:96) -errors. In the case of αˆ , we even observe two clearly separated modes 2 GSURE in the distributions. These findings motivate the theoretical examinations carried out in the 7 0 -0.5 -1 -1.5 -2 -2.5 -3 -3.5 -4 -4.5 -5 0 20 40 60 80 100 120 Figure 2: Decay of the singular values γ of A for different choices of m and l. As expected, i l increasingthewidthloftheconvolutionkernelleadstoafasterdecay. Forafixedl,increasing m corresponds to using a finer discretization and γ converges to the corresponding singular i value of A , as can be seen for the largest γ , e.g., for l = 0.02. ∞,l i following section. 8 10-1 ,$ ,^DP ,^SURE ,^GSURE 10-2 10-2 10-3 10-3 ) , ( pm Pe 10-4 g0110-4 o l 10-5 10-5 10-6 10-6 -6 -5 -4 -3 -2 -1 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 log10jjx$!x,jj2 (a) (b) Figure 3: Empirical probabilities of (a) αˆ and (b) the corresponding (cid:96) -error for different 2 parameter choice rules using m = n = 64, l = 0.06, σ = 0.1 and N = 106 samples of ε. ε 1 0.9 0.8 x = y 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 (a) Discrepancy principle vs SURE 1 0.9 0.8 x = y 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 2.1 2.2 2.3 2.4 2.5 (b) Discrepancy principle vs GSURE Figure 4: Joint empirical probabilities of log (cid:107)x∗−x (cid:107) using m = n = 64, l = 0.06, σ = 0.1 10 αˆ 2 and N = 106 samples of ε (the histograms in Figure 3(b) are the marginal distributions ε thereof). As in Figure 3(b), the logarithms of the probabilities are displayed (here in form of a color-coding) to facilitate the identification of smaller modes and tails. The red line at x = y divides the areas where one method performs better than the other: In (a), all samples falling into the area on the right of the red line correspond to a noise realization where the discrepancy principle leads to a smaller error than SURE. 9 ,)y;(P ,)y;(ER D U S 0 0 log10, log10, -2 -1.8 -1.6 -1.4 -1.2 -1-3.5 -3 -2.5 -2 -1.5 -1 -10 -8 -6 -4 -2 0 (a) (b) (c) Figure 5: True risk functions (black dotted line), their estimates for six different realizations yk, k = 1...6 (solid lines), and their corresponding minima/roots (dots on the lines) in the setting described in Figure 1 using (cid:96) -regularization: (a) DP(α,Ax∗) and DP(α,yk). (b) 2 R (α) and SURE(α,yk). (c) R (α) and GSURE(α,yk). SURE GSURE 3 Properties of the Parameter Choice Rules for Quadratic Regularization In this section we consider the theoretical (risk) properties of SURE, GSURE and the dis- crepancy principle. Assumption 1. For the sake of simplicity we only consider m = n in this first analysis. Furthermore, we assume 1 = γ ≥ ... ≥ γ > 0 (14) 1 m and that (cid:107)x∗(cid:107)2 = O(m). Note that all assumptions are fulfilled in the numerical example we 2 described in the previous section. Wementionthatweconsiderherearathermoderatesizeofthenoise, whichremainsbounded in variances as m → ∞. A scaling corresponding to white noise in the infinite dimensional limit is rather σ2 ∼ m and an inspection of the estimates below shows that the risk estimate is potentially far from the expected values in such cases additionally. 3.1 SURE-Risk We start with an investigation of the well-known SURE risk estimate. Based on (11) and Stein’s result, the representation for the risk is given as R (α) = E[SURE(α,y)] SURE (cid:88) α2 (cid:88) γ2 = E[y2]−σ2m+2σ2 i (γ2+α)2 i γ2+α i i i i (cid:88) α2 (cid:88) γ2 = (γ2·(x∗)2+σ2)−σ2m+2σ2 i . (15) (γ2+α)2 i i γ2+α i i i i 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.