ebook img

Ideal Spatial Adaptation via Wavelet Shrinkage PDF

30 Pages·1999·0.32 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Ideal Spatial Adaptation via Wavelet Shrinkage

Ideal Spatial Adaptation by Wavelet Shrinkage David L. Donoho Iain M. Johnstone Department of Statistics, Stanford University, Stanford, CA, 94305-4065, U.S.A. June 1992 Revised April 1993 Abstract With ideal spatial adaptation, an oracle furnishes information about how best to adaptaspatiallyvariableestimator,whether piecewise constant,piecewise polynomial, variableknotspline,orvariablebandwidthkernel,totheunknownfunction. Estimation with the aid of an oracle o(cid:11)ers dramatic advantages over traditional linear estimation by nonadaptive kernels; however, it is a priori unclear whether such performance can be obtained by a procedure relying on the data alone. We describe a new principle for spatially-adaptiveestimation: selective wavelet reconstruction. We show that variable- knotspline(cid:12)tsandpiecewise-polynomial(cid:12)ts,whenequippedwithanoracletoselectthe knots, are not dramatically more powerful than selective wavelet reconstruction with an oracle. We develop a practical spatially adaptive method, RiskShrink, which works by shrinkage of empirical wavelet coe(cid:14)cients. RiskShrink mimics the performance of an oracle for selective wavelet reconstruction as well as it is possible to do so. A new inequality in multivariate normal decision theory which we call the oracle inequality shows that attained performance di(cid:11)ers from ideal performance by at most a factor (cid:24)2logn,wherenisthesamplesize. Moreovernoestimatorcangiveabetterguarantee than this. Within the class of spatially adaptive procedures, RiskShrink is essentially 2 optimal. Relying only on the data, it comes within a factor log n of the performance of piecewise polynomial and variable-knot spline methods equipped with an oracle. In contrast, it is unknown how or if piecewise polynomialmethods could be made to function this well when denied access to an oracle and forced to rely on data alone. Keywords: Minimax estimation subject to doing well at a point; Orthogonal Wavelet Bases of Compact Support; Piecewise-Polynomial (cid:12)tting; Variable-Knot Spline. 1 1 Introduction Suppose we are given data yi = f(ti)+ei; i = 1;:::;n; (1) 2 ti = i=n, where ei are independently distributed as N(0;(cid:27) ), and f((cid:1)) is an unknown function which we would like to recover. We measure performance of an estimate f^((cid:1)) in termsof quadratic loss atthe sample points. In detail, let f = (f(ti))ni=1 and f^= (f^(ti))ni=1 2 n 2 denote the vectors of true and estimated sample values, respectively. Let kvk2;n = i=1vi 2 denote the usual squared ‘n norm; we measure performance by the risk P R(f^;f)= n(cid:0)1Ekf^(cid:0)fk22;n; which we would like to make as small as possible. Although the notation f suggests a function of a real variable t, in this paper we work only with the equally spaced sample points ti: 1.1 Spatially Adaptive Methods We are particularly interested in a variety of spatially adaptive methods which have been proposedinthestatisticalliterature,suchasCART (Breiman,Friedman,OlshenandStone, 1983), Turbo (Friedman and Silverman, 1989), MARS (Friedman, 1991), and variable- bandwidth kernel methods (Mu(cid:127)ller and Stadtmuller, 1987). Such methods have presumably been introduced because they were expected to do a better job in recovery of the functions actually occurring with real data than do traditional methods based on a (cid:12)xed spatial scale, such as Fourier series methods, (cid:12)xed-bandwidth kernel methods, and linear spline smoothers. Informal conversationswith Leo Breiman and Jerome Friedman have con(cid:12)rmed this assumption. We now describe a simple framework which encompasses the most important spatially adaptive methods, and allows us to develop our main theme e(cid:14)ciently. We consider esti- mates f^de(cid:12)ned as f^((cid:1))= T(y;d(y))((cid:1)) (2) where T(y;(cid:14)) is a reconstruction formula with \spatial smoothing" parameter (cid:14), and d(y) is a data-adaptive choice of the spatial smoothing parameter (cid:14). A clearer picture of what we intend emerges from (cid:12)ve examples. [1]. Piecewise Constant Reconstruction TPC(y;(cid:14)). Here (cid:14) is a (cid:12)nite list of, say, L real numbers de(cid:12)ning a partition (I1;:::;IL) of [0;1]via I1 = [0;(cid:14)1);I2 = [(cid:14)1;(cid:14)1+(cid:14)2);:::;IL = L [(cid:14)1+(cid:1)(cid:1)(cid:1)+(cid:14)L(cid:0)1;(cid:14)1+(cid:1)(cid:1)(cid:1)+(cid:14)L],sothat 1 (cid:14)i =1. NotethatLisavariable. Thereconstruction formula is P L TPC(y;(cid:14))(t)= Ave(yi :ti 2 I‘)1I‘(t); ‘=1 X piecewise constant reconstruction using the mean of the data within each piece to estimate the pieces. [2]. Piecewise Polynomials TPP(D)(y;(cid:14)). Here the interpretation of (cid:14) is the same as in [1], only the reconstruction uses polynomials of degree D. L TPP(D)(y;(cid:14))(t)= p^‘(t)1I‘(t); ‘=1 X 2 D k where p^‘(t) = k=0akt is determined by applying the least squares principle to the data arising for interval I‘ P 2 (p^‘(ti)(cid:0)yi) = min! ti2I‘ X [3]. Variable-Knot SplinesTspl;D(y;(cid:14)). Here (cid:14) de(cid:12)nes a partition as above, and on each interval of the partition the reconstruction formula is a polynomial of degree D, but now thereconstructionmustbe continuousandhavecontinuousderivatives throughorderD(cid:0)1. In detail, let (cid:28)‘ be the left endpoint of I‘, ‘ = 1;:::;L. The reconstruction is chosen from among those piecewise polynomials s(t) satisfying k k d d s ((cid:28)‘(cid:0)) = s ((cid:28)‘+) dt ! dt ! for k = 0;:::;D(cid:0)1, ‘ = 2;:::;L; subject to this constraint, one solves n 2 (s(ti)(cid:0)yi) = min! i=1 X [4]. Variable Bandwidth Kernel Methods TVK;2(y;(cid:14)). Now (cid:14) is a function on [0;1]; (cid:14)(t) 2 represents the \bandwidth of the kernel at t"; the smoothing kernel K is a C function of compact support which is also a probability density, and if f^= TVK;2(y;(cid:14)) then n 1 t(cid:0)ti f^(t) = yiK (cid:14)(t): (3) n (cid:14)(t) i=1 (cid:18) (cid:19)(cid:30) X More re(cid:12)ned versions of this formula would adjust K for boundary e(cid:11)ects near t = 0 and t = 1. [5]. Variable-Bandwidth High-Order Kernels TVK;D(y;(cid:14)), D > 2. Here (cid:14) is again the D local bandwidth, and the reconstruction formula is as in (3), only K((cid:1)) is a C function integrating to 1, with vanishing intermediate moments j t K(t)dt= 0; j = 1;:::;D(cid:0)1: Z As D > 2, K((cid:1)) cannot be nonnegative. Thesereconstructiontechniques,whenequipped withappropriateselectorsofthespatial smoothing parameter (cid:14), duplicate essential features of certain well-known methods. [1] Thepiecewise constantreconstructionformulaTPC,equipped withchoice ofpartition (cid:14) by recursive partitioning and cross-validatory choice of \pruning constant" as de- scribed by Breiman, Friedman, Olshen and Stone (1983)results in the method CART applied to 1-dimensional data. [2] The spline reconstruction formula Tspl;D, equipped with a backwards deletion scheme models the methods of Friedman and Silverman (1989) and Friedman (1991) applied to 1-dimensional data. [3] The kernel method TK;2 equipped with the variable bandwidth selector described in Brockmann, Gasser and Herrmann (1992) results in the \Heidelberg" variable bandwidth smoothing method. Compare also Terrell and Scott (1992). 3 These schemes are computationally feasible and intuitively appealing. However, very little is known about the theoretical performance of these adaptive schemes, at the level of uniformity in f and N that we would like. 1.2 Ideal Adaptation with Oracles To avoid messy questions, we abandon the study of speci(cid:12)c (cid:14)-selectors and instead study ideal adaptation. Forus, ideal adaptation is the performance which can be achieved fromsmoothing with theaidofanoracle. Suchanoraclewill nottellusf,butwill tellus,forourmethodT(y;(cid:14)), the \best" choice of (cid:14) for the true underlying f. The oracle’s response is conceptually a selection (cid:1)(f) which satis(cid:12)es R(T(y;(cid:1)(f));f)= Rn;(cid:27)(T;f) where Rn;(cid:27) denotes the ideal risk Rn;(cid:27)(T;f)= infR(T(y;(cid:14));f): (cid:14) As R measures performance with a selection (cid:1)(f) based on full knowledge of f rather than a data-dependent selection d(y), it represents an ideal we cannot expect to attain. Nevertheless it is the target we shall consider. Ideal adaptation o(cid:11)ers, in principle, considerable advantages over traditional nonadap- tive linear smoothers. Consider the case of a function f which is a piecewise polynomial of degree D, with a (cid:12)nite number of pieces I1;:::;IL, say: L f = p‘(t)1I‘(t): (4) ‘=1 X Assume that f has discontinuities at some of the break-points (cid:28)2;:::;(cid:28)L. 2 Theriskofideallyadaptivepiecewisepolynomial(cid:12)tsisessentially(cid:27) L(D+1)=n. Indeed, anoraclecouldsupply theinformationthatoneshoulduseI1;:::;ILratherthansomeother partition. Traditional least-squares theory says that, for data from the traditional linear 2 model Y = X(cid:12) +E, with noise Ei independently distributed as N(0;(cid:27) ), the traditional least-squares estimator (cid:12)^ satis(cid:12)es EkX(cid:12)(cid:0)X(cid:12)^k22 = (number of parameters in (cid:12))(variance of noise) Applyingthistooursetting,(cid:12)ttingafunctionoftheform(4)requires(cid:12)tting(# pieces)(degree+ 1)parameters,sofor therisk R(f^;f)= n(cid:0)1Ekf^(cid:0)fk22;n we getL(D+1)(cid:27)2=nas advertised. On the other hand, the risk of a spatially-non-adaptive procedure is far worse. Con- sider kernel smoothing. Because f has discontinuities, no kernel smoother with (cid:12)xed non- spatially varying bandwidth attains a risk R(f^;f) tending to zero faster than Cn(cid:0)1=2, C = C(f;kernel). The same result holds for estimates in orthogonal series of polynomials or sinusoids, for smoothing splines with knots at the sample points and for least squares smoothing splines with knots equispaced. Moststrikingly, evenforpiecewise polynomial (cid:12)tswithequal-width pieces, wehavethat R(f^;f) is of size (cid:16) n(cid:0)1=2 unless the breakpoints of f form a subset of the breakpoints of f^. But this can happen only for very special n, so in any event limsupR(f^;f)n1=2(cid:21) C > 0: N!1 4 (cid:0)1=2 (cid:0)1 In short, oracles o(cid:11)er an improvement|ideally|from risk of order n to order n . No (cid:0)1 better performance than this can be expected, since n is the usual \parametric rate" for estimating (cid:12)nite-dimensional parameters. Can we approach this ideal performance with estimators using the data alone? 1.3 Selective Wavelet Reconstruction as a Spatially Adaptive Method A new principle for spatially adaptive estimation can be based on recently developed \wavelets" ideas. Introductions, historical accounts and references to much recent work may be found in the books by Daubechies (1992), Meyer (1990), Chui (1992) and Frazier, Jawerth and Weiss (1991). Orthonormal bases of compactly supported wavelets provide a powerful complement to traditional Fourier methods: they permit an analysis ofa signal or imageintolocalised oscillating components. Inastatisticalregressioncontext,thisspatially varying decomposition can be used to build algorithms that adapt their e(cid:11)ective \window width" to the amount of local oscillation in the data. Since the decomposition is in terms of an orthogonal basis, analytic study in closed form is possible. For the purposes of this paper, we discuss a (cid:12)nite, discrete, wavelet transform. This transform, along with a careful treatment of boundary correction, has been described by Cohen, Daubechies, Jawerth, and Vial (1993), with related work in Meyer (1991) and Malgouyres (1991). To focus attention on our main themes, we employ a simpler periodised version of the (cid:12)nite discrete wavelet transform in the main exposition. This version yields an exactly orthogonal transformation between data and wavelet coe(cid:14)cient domains. Brief comments on the minor changes needed for the boundary corrected version are made in Section 4.6. n J+1 Suppose we have data y = (yi)i=1, with n = 2 . For various combinations of pa- rameters M (number of vanishing moments), S (support width), and j0 (Low-resolution cuto(cid:11)), one may construct an n-by-n orthogonal matrix W|the (cid:12)nite wavelet transform matrix. Actually there are many such matrices, depending on special (cid:12)lters: in addition to the original Daubechies waveletsthere arethe Coi(cid:13)ets and Symmlets of Daubechies (1993). For the (cid:12)gures in this paper we use the Symmlet with parameter N = 8. This has M = 7 vanishing moments and support length S = 15. This matrix yields a vector w of the wavelet coe(cid:14)cients of y via| w = Wy; T and because the matrix is orthogonal we have the inversion formula y = W w. J+1 J+1 Thevectorwhasn = 2 elements. Itisconvenienttoindexdyadicallyn(cid:0)1 = 2 (cid:0)1 of the elements following the scheme j wj;k : j = 0;:::;J; k = 0;:::;2 (cid:0)1; and the remaining element we label w(cid:0)1;0. To interpret these coe(cid:14)cients let Wjk denote T the (j;k)-th row of W. The inversion formula y = W w becomes yi = wj;kWjk(i); j;k X expressing y asasumofbasis elementsWjk withcoe(cid:14)cients wj;k. Wecall theWjk wavelets. 5 The vector Wjk, plotted as a function of i, looks like a localized wiggle, hence the name \wavelet". For j and k bounded away from extreme cases by the conditions j j0 (cid:20) j < J (cid:0)j1; S < k < 2 (cid:0)S; we have the approximation 1=2 j=2 j n Wjk(i)(cid:25) 2 (2 t(cid:0)k) t = i=n where isa(cid:12)xed\wavelet"in thesenseoftheusualwavelettransformonIR(Meyer,1990), Daubechies (1988). This approximation improveswith increasing n and increasing j1. Here is an oscillating function of compact support, usually called the mother wavelet. We (cid:0)j therefore speak of Wjk as being localized to spatial positions near t = k2 and frequencies j near 2 . The wavelet can have a smooth visual appearance, if the parameters M and S are chosensu(cid:14)cientlylarge,andfavorablechoicesofso-calledquadraturemirror(cid:12)ltersaremade in the construction ofthe matrixW. Daubechies (1988)described aparticular construction with S = 2M +1 for which the smoothness (number of derivatives) of is proportional to M. For our purposes, the only details we need are [W1] Wjk has vanishing moments through order M, as long as j (cid:21) j0: n(cid:0)1 ‘ j i Wjk(i)= 0 ‘= 0;:::;M; j (cid:21) j0; k = 0;:::;2 (cid:0)1: i=0 X J(cid:0)j J(cid:0)j [W2] Wjk is supported in [2 (k(cid:0)S);2 (k+S)], provided j (cid:21) j0. Because of the spatial localization of wavelet bases, the wavelet coe(cid:14)cients allow one to easily answer the question \is there a signi(cid:12)cant change in the function near t?" by looking (cid:0)j at the wavelet coe(cid:14)cients at levels j = j0;:::;J at spatial indices k with k2 (cid:25) t. If these coe(cid:14)cients are large, the answer is \yes." Figures 1 displays four functions { Bumps, Blocks, HeaviSine and Doppler { which have been chosen because they caricature spatially variable functions arising in imaging, spectroscopy and other scienti(cid:12)c signal processing. For all (cid:12)gures in this article, n = 2048. Figure2depictsthewavelettransformsofthefourfunctions. Thelargecoe(cid:14)cientsoccur exclusively near the areas of major spatial activity. This property suggests that a spatially adaptivealgorithmcouldbebasedontheprinciple ofselectivewaveletreconstruction. Given a (cid:12)nite list (cid:14) of (j;k) pairs, de(cid:12)ne TSW(y;(cid:14)) by TSW(y;(cid:14))= f^= wj;kWjk: (5) (j;k)2(cid:14) X Thisprovidesreconstructionsbyselectingonlyasubsetoftheempirical waveletcoe(cid:14)cients. Our motivation in proposing this principle is twofold. First, for a spatially inhomo- geneous function, \most of the action" is concentrated in a small subset of (j;k)-space. Second, under the noise model underlying (1), noise contaminates all wavelet coe(cid:14)cients equally. Indeed, the noise vector e = (ei) is assumed to be a white noise; so its orthogonal transform z = We is also a white noise. Consequently, the empirical wavelet coe(cid:14)cient wj;k = (cid:18)j;k +zj;k 6 n(cid:0)1 where (cid:18) = Wf is the wavelet transform of the noiseless data f = (f(ti))i=0. 2 Every empirical wavelet coe(cid:14)cient therefore contributes noise of variance (cid:27) , but only a very few wavelet coe(cid:14)cients contribute signal. This is the heuristic of our method. Idealspatialadaptationcanbede(cid:12)nedforselectivewaveletreconstructionintheobvious way. For the risk measure (1) the ideal risk is Rn;(cid:27)(SW;f)= infRn;(cid:27)(TSW(y;(cid:14));f) (cid:14) with optimal spatial parameter (cid:1)(f) a list of (j;k) indices attaining Rn;(cid:27)(TSW(y;(cid:1)(f));f)= Rn;(cid:27)(SW;f): Figures 3-6 depict the results of ideal wavelet adaptation for the 4 functions displayed in Figure 2. Figure 3 shows noisy versions of the four functions of interest; the signal- to-noise ratio ksignalk2;n=knoisek2;nis 7. Figure 4 shows the noisy data in the wavelet domain. Figure 5 shows the reconstruction by selective wavelet reconstruction using an oracle; Figure 6 shows the situation in the wavelet domain. Because the oracle helps us to select the important wavelet coe(cid:14)cients, the reconstructions are of high quality. Thetheoreticalbene(cid:12)ts ofideal waveletselection canagainbe seenin thecase(4)where f is a piecewise polynomial of degree D. Suppose we use a wavelet basis with parameter M (cid:21) D. Then properties [W1] and [W2] imply that the wavelet coe(cid:14)cients (cid:18)j;k of f all vanish except for (i) coe(cid:14)cients at the coarse levels 0(cid:20) j < j0 (cid:0)j (cid:0)j (ii) coe(cid:14)cients at j0 (cid:20) j (cid:20) J whose associated interval [2 (k(cid:0)S);2 (k+S)] contains a breakpoint of f. j0 There are a (cid:12)xed number 2 of coe(cid:14)cients satisfying (i), and, in each resolution level j j, ((cid:18)j;k;k = 0;:::;2 (cid:0)1) at most (# breakpoints)(2S +1) satisfying (ii). Consequently, with L denoting again the number of pieces in (4), we have j0 #f(j;k):(cid:18)j;k 6= 0g(cid:20) 2 +(J +1(cid:0)j0)(2S+1)L: (cid:3) Let (cid:14) = f(j;k):(cid:18)j;k 6= 0g. Then, because of the orthogonality of the (Wjk), (j;k)2(cid:14)(cid:3)wj;kWjk is the least-squares estimate of f and P (cid:3) (cid:0)1 (cid:3) 2 R(T(y;(cid:14) );f) = n #((cid:14) )(cid:27) 2 J+1 (cid:20) (C1+C2J)L(cid:27) =n for all n = 2 (6) with certain constants C1, C2, depending linearly on S, but not on f. Hence 2 (cid:27) logn Rn;(cid:27)(SW;f)= O : (7) n ! for every piecewise polynomial of degree D (cid:20) M. This is nearly as good as the bound 2 (cid:0)1 (cid:27) L(D+1)n of ideal piecewise polynomial adaptation, and considerably better than the (cid:0)1=2 rate n of usual nonadaptive linear methods. 7 1.4 Near-Ideal Spatial Adaptation by Wavelets Of course, calculations of ideal risk which point to the bene(cid:12)t of ideal spatial adaptation prompt the question: How nearly can one approach ideal performance when no oracle is available and we must rely on data only, and no side information about f? The bene(cid:12)t of the waveletframeworkis that we can answer such questions precisely. In Section 2 of this paper we develop new inequalities in multivariate decision theory which furnish an estimate f^(cid:3) which, when presented with data y and knowledge of the noise level 2 (cid:27) , obeys 2 Rn;(cid:27)(f^(cid:3);f)(cid:20) (2logn+1)fRn;(cid:27)(SW;f)+ (cid:27) g (8) n J+1 for every f, every n = 2 , and every (cid:27). Thus, in complete generality, it is possible to come within a 2logn factor of the per- formance of ideal wavelet adaptation. In small samples n, the factor (2logn+1) can be replaced by a constant which is much smaller: e.g., 5 will do if n (cid:20) 256; 10 will do if n (cid:20) 16384. On the other hand, no radically better performance is possible: to get an inequality valid for all f, all (cid:27), and all n, we cannot even change the constant 2 to 2(cid:0)(cid:15) and still have (8) hold, neither by f^(cid:3) nor by any other measurable estimator sequence. To illustrate the implications, Figures 7 and 8 show the situation for the four basic examples, with an estimatorf~n(cid:3) which hasbeen implemented on thecomputer, as described in section 2.3 below. The result, while slightly noisier than the ideal estimate, is still of good quality { and requires no oracle. The theoretical properties are also interesting. Our method has the property that for every piecewise polynomial (4) of degree D (cid:20) M with (cid:20) L pieces, Rn;(cid:27)(f^(cid:3);f)(cid:20) (C1+C2logn)(2logn+1)L(cid:27)2=n; where C1 and C2 are as in (6); this result is merely a combination of (7) and (8). Hence 2 in this special case we have an actual estimator coming within Clog n of ideal piecewise polynomial (cid:12)ts. 1.5 Universality of Wavelets as a Spatially Adaptive Procedure This last calculation is not essentially limited to piecewise polynomials; something like it holds for all f. In section 3 we show that, for constants Ci not depending on f, n, or (cid:27), Rn;(cid:27)(SW;f)(cid:20) (C1+C2J)Rn;(cid:27)(PP(D);f) J+1 for every f, every n = 2 and every (cid:27) > 0. We interpret this result as saying that selective wavelet reconstruction is essentially as powerful as variable-partition piecewise constant (cid:12)ts, variable-knot least-squares splines, or piecewise polynomial (cid:12)ts. Suppose that the function f is such that, furnished with an oracle, piecewise polynomials, piecewise constants, or variable-knot splines would improve the rate of convergence over traditional (cid:12)xed-bandwidth kernel methods, say from rate of (cid:0)r1 (cid:0)r2 convergence n (with (cid:12)xed-bandwidth) to n , r2 > r1. Then, furnished with an oracle, 2 (cid:0)r2 selective wavelet adaptation o(cid:11)ers an improvement to log nn ; this is essentially the same bene(cid:12)t at the level of rates. We know of no proof that existing procedures for (cid:12)tting piecewise polynomials and variable-knot splines, such as those current in the statistical literature, can attain anything like the performance of ideal methods. 8 In contrast, for selective wavelet reconstruction, it is easy to o(cid:11)er performance compa- rable to that with an oracle, using the estimator f^(cid:3). And wavelet selection with an oracle o(cid:11)ers the advantages of other spatially-variable methods. The main assertion of this paper is therefore that, from this (theoretical) perspective, it is cleaner and more elegant to abandon the ideal of (cid:12)tting piecewise polynomials with optimal partitions, andturninstead toRiskShrink,aboutwhich wehavetheoretical results, and an order O(n) algorithm. 1.6 Contents Section 2 discusses the problem of mimicking ideal wavelet selection; Section 3 shows why wavelet selection o(cid:11)ers the same advantages as piecewise polynomial (cid:12)ts; Section 4 dis- cusses variations and relations to other work. Section 5 contains certain proofs. Related manuscripts by the authors, currently under publication review and available as LaTeX (cid:12)les by anonymous ftp from playfair.stanford.edu, are cited in the text by [(cid:12)lename.tex]. 2 Decision Theory and Spatial Adaptation In this section we solve a new problem in multivariate normal decision theory and apply it to function estimation. 2.1 Oracles for Diagonal Linear Projection Consider the following problem from multivariate normal decision theory. We are given n observations w = (wi)i=1 according to wi = (cid:18)i+(cid:15)zi i= 1;:::;n (9) where zi are independent and identically distributed as N(0;1), (cid:15) > 0 is the (known) noise level, and (cid:18) = ((cid:18)i) is the object of interest. We wish to estimate with ‘2-loss and so de(cid:12)ne the risk measure R((cid:18)^;(cid:18)) = Ek(cid:18)^(cid:0)(cid:18)k22;n: (10) We consider a family of diagonal linear projections: n TDP(w;(cid:14))= ((cid:14)iwi)i=1; (cid:14)i 2 f0;1g: Such estimators \keep" or \kill" each coordinate. Suppose we had available an or- acle which would supply for us the coe(cid:14)cients (cid:1)DP((cid:18)) optimal for use in the diagonal projection scheme. These ideal coe(cid:14)cients are (cid:14)i = 1fj(cid:18)ij>(cid:15)g meaning that ideal diagonal projection consists in estimating only those (cid:18)i larger than the noise level. Supplied with such coe(cid:14)cients, we would attain the ideal risk n R(cid:15)(DP;(cid:18))= (cid:26)T(j(cid:18)ij;(cid:15)) i=1 X 2 2 with (cid:26)T((cid:28);(cid:27))= min((cid:28) ;(cid:27) ). In general the ideal risk R(cid:15)(DP;(cid:18)) cannot be attained for all (cid:18) by any estimator, linear or nonlinear. However surprisingly simple estimates do come remarkably close. 9 Motivated by the idea that only very few wavelet coe(cid:14)cients contribute signal, we consider threshold rules, thatretain only observed datathat exceeds a multiple of the noise level. De(cid:12)ne ‘hard’ and ‘soft’ threshold non-linearities by (cid:17)H(w;(cid:21)) = wIfjwj> (cid:21)g (11) (cid:17)S(w;(cid:21)) = sgn(w)(jwj(cid:0)(cid:21))+: (12) The hard threshold rule is reminiscent of subset selection rules used in model selection and we return to it later. For now, we focus on soft thresholding. Theorem 1 Assume model (9){(10). The estimator (cid:18)^iu = (cid:17)S(wi;(cid:15)(2logn)1=2) i = 1;:::;n satis(cid:12)es n Ek(cid:18)^u(cid:0)(cid:18)k22;n (cid:20) (2logn+1)f(cid:15)2+ min((cid:18)i2;(cid:15)2)g for all (cid:18) 2 IRn: (13) i=1 X In \Oracular" notation, we have R((cid:18)^(cid:3);(cid:18))(cid:20) (2logn+1)((cid:15)2+R(cid:15)(DP;(cid:18))): (cid:18) 2 IRn 2 Now (cid:15) denotes the mean-squared loss for estimating one parameter unbiasedly, so the inequality says that we can mimick the performance of an oracle plus one extra parameter to within a factor of essentially 2logn. A short proof appears in the Appendix. However it is natural and more revealing to (cid:3) (cid:3) look for ‘optimal’ thresholds (cid:21)n which yield the smallest possible constant (cid:3)n in place of 2logn+1amongsoftthresholdestimators. Wegivetheresulthereandoutline theapproach in Section 2.4. (cid:3) Theorem 2 Assume model (9){(10). The minimax threshold (cid:21)n de(cid:12)ned at (20) and solv- ing (22) below yields an estimator (cid:18)^i(cid:3) = (cid:17)S(wi;(cid:21)(cid:3)n(cid:15)) i= 1;:::;n (14) which satis(cid:12)es n Ek(cid:18)^(cid:3)(cid:0)(cid:18)k22;n (cid:20) (cid:3)(cid:3)nf(cid:15)2+ min((cid:18)i2;(cid:15)2)g for all (cid:18) 2 IRn: (15) i=1 X (cid:3) (cid:3) (cid:3) The coe(cid:14)cient (cid:3)n, de(cid:12)ned at (19), satis(cid:12)es (cid:3)n (cid:20) 2logn + 1, and the threshold (cid:21)n (cid:20) 1=2 (2logn) . Asymptotically (cid:3) (cid:3) 1=2 (cid:3)n (cid:24) 2logn; (cid:21)n (cid:24) (2logn) ; n ! 1: (cid:3) Table 1 shows that this constant (cid:3)n is much smaller than 2logn+1 when n is on the (cid:3) order of a few hundred. For n = 256, we get (cid:3)n (cid:25) 4:44. For large n, however, the (cid:24) 2logn upper bound is sharp. This holds even if we extend from soft co-ordinatewise thresholds to allow completely arbitrary estimator sequences into contention. 10

Description:
an oracle for selective wavelet reconstruction as well as it is possible to do so. A new have an annoying visual appearance { it will contain small blips against an well-known divergence between the usual numerical and visual
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.