Asymptotic Equivalence for Nonparametric Regression with Non-Regular Errors Alexander Meister Markus Reiß Institut fu¨r Mathematik Institut fu¨r Mathematik Universit¨at Rostock Humboldt-Universit¨at zu Berlin Ulmenstraße 69 Unter den Linden 6 18051 Rostock, Germany 10099 Berlin, Germany e-mail: [email protected] e-mail: [email protected] 1 1 0 2 n a J Abstract 7 2 Asymptotic equivalence in Le Cam’s sense for nonparametric regression experiments is extended to the case of non-regular error densities, which have jump discontinuities at their ] T endpoints. We prove asymptotic equivalence of such regression models and the observation S . of two independent Poisson point processes which contain the target curve as the support h t a boundary of its intensity function. The intensity of the point processes is of order of the m sample size n and involves the jump sizes as well as the design density. The statistical [ model significantly differs from regression problems with Gaussian or regular errors, which 1 v are known to be asymptotically equivalent to Gaussian white noise models. 8 4 2 5 . 1 0 1 1 : v i X r a 2010 Mathematics Subject Classification: 62B15; 62G08; 62M30. Keywords: Extreme value statistics; frontier estimation; Le Cam distance; Le Cam equivalence; Poisson point processes. 1 1. Introduction The goal of transforming nonparametric regression models into asymptotically equivalent statistical experiments, which describe continuous observations of a sto- chastic process, has stimulated considerable research activity in mathematical statis- tics. The continuous design in these limiting models simplifies the asymptotic analy- sis and makes statistical procedures more transparent because in the regression case the discrete design points generate distracting approximation errors. Most papers so far establish asymptotic equivalence of certain nonparametric regression models with nonparametric Gaussian shift experiments. In that Gaussian white noise ex- periment, a process is observed which contains the target function in its drift and a blurring Wiener process which is scaled with a factor of order n−1/2, where n de- notes the original sample size. The basic equivalence result for standard Gaussian regression with deterministic design has been established by Brown and Low (1996). Afterwards, many important extensions have been achieved. The case of random design for univariate design has been treated by Brown et al. (2002). Carter (2007) considers the case of unknown error variance and design density; and Reiß (2008) extends the results to the multivariate setting. Recently, the model with dependent regression errors has been investigated in Carter (2009). The work by Grama and Nussbaum (1998) is the first to consider the important case of non-Gaussian errors which are, however, supposed to be included in an exponential family. Such classes of error distributions are also studied in Brown et al. (2010) where the regression error is supposed to be non-additive. General regular distributions for the additive error variables are covered in Grama and Nussbaum (2002) where only slightly more than standard Hellinger differentiablity is required for the error density. On the other hand, when allowing for jump discontinuities of the error density, the situation changes completely. Standard examples include uniform or exponential error densities. These types of error distributions are non-regular and we know from parametric theory that better rates of convergence and non-Gaussian limit distribu- tions can be expected. The faster convergence rates are attained only by specific estimators, e.g. employing extreme value statistics in their construction instead of local averaging statistics. The Nadaraja-Watson estimator and the local polynomial estimators are procedures of that latter type, which can be improved significantly under non-regular errors. Mu¨ller and Wefelmeyer (2010) establish improved mini- max rates for regression functions which satisfy some H¨older condition. Hall and van Keilegom (2009) derive a rigorous theory for the optimal convergence rates for nonparametric regression under non-regular errors and smoothness constraints up to 2 regularity one on the target regression function. Their nonparametric minimax rates in dimension one are of the form n−s/(s+1) for Ho¨lder regularity s, which is faster than the usual n−s/(2s+1)-rate for regular regression, but slower than n−2s/(2s+1), the squared regular rate in analogy with the parametric rates. At first sight, this is counter-intuitive, but may be explained by a Poisson instead of Gaussian limiting law. Many applications of non-regular regression models occur in the field of econo- metrics, see Chernozhukov and Hong (2004) for an overview and a precise asymptotic investigation of the parametric likelihood ratio process. Irregular regression problems are also closely related to nonparametric boundary estimation in image reconstruc- tion, see the monograph of Korostelev and Tsybakov (1993). Considerable interest has also found the problem of frontier estimation, see Gijbels et al. (1999) and the references therein. In Janssen and Marohn (1994) weak asymptotic equivalence of the extreme or- der statistics of a one-dimensional localization problem with non-regular errors and a Poisson point process model is derived in a parametric setup. Also for the pre- cise asymptotic analysis of regression experiments with non-regular errors the use of Poisson point processes and random measures turn out to be useful, see e.g. Knight (2001) for parametric linear models and Chernozhukov and Hong (2004) for general parametric regression, yet a precise and nonparametric statement lacks. We intend to fill this gap by rigorously proving asymptotic equivalence of nonparametric regres- sion experiments with non-regular errors with a Poisson point process (PPP) model. Therein the target parameter occurs as the boundary curve of the intensity function. Hence, the Gaussian structure of the process experiment is not kept; nor is the scaling factor n−1/2 which will be changed into n−1 in agreement with the parametric rate. For a comprehensive review on PPP and their statistical inference we refer to Karr (1991) and Kutoyants (1998). They discuss image reconstruction from laser radar as a practical application of support estimation of the intensity function of a PPP, which corresponds to identifying the target parameter in our PPP experiment. The asymptotic equivalence result therefore links interesting inference questions in both models which might prove useful in both directions. For the basic concept of asymptotic equivalence of statistical experiments we refer to Le Cam (1964) and Le Cam and Yang (2000). To grasp the impact let us just mention that asymptotic equivalence between two sequences of statistical models transfers asymptotical risk bounds for any inference problem from one model to the other, at least for bounded loss functions. Moreover, asymptotic equivalence remains valid for the sub-experiments obtained by restricting the parameter class so that we shall also cover smoother nonparametric or just parametric regression problems. 3 The paper is organized as follows. In Section 2 we introduce our models, state our main result in Theorem 2.1 and give a constructive description of the equivalence maps. In Section 3 we construct pilot estimators of the target functions which will be employed to localize the model in Section 4 and 6. The findings of Section 5 yield asymptotic equivalence of the PPP experiment and the regression model when the target functions are changed into approximating step functions. In Section 7 all the results are combined to complete the proof of Theorem 2.1. Section 8 discusses limitations and extensions of the results and gives a geometric explanation of the unexpected nonparametric minimax rate for H¨older classes. 2. Model and main result In this section we specify the statistical experiments under consideration. First we define the joint parameter space Θ of both the regression and the PPP experiment, imposing standard smoothness constraints on the target function. Definition 2.1. For some constants C > 0 and α ∈ (0,1] the parameter set Θ Θ consists of all functions ϑ : [0,1] → R which are twice continuously differentiable on [0,1] with (cid:107)ϑ(cid:107) ≤ C and (cid:107)ϑ(cid:48)(cid:48)(cid:107) ≤ C and where the second derivative satisfies the ∞ Θ ∞ Θ Ho¨lder condition (cid:12) (cid:12) (cid:12)ϑ(cid:48)(cid:48)(x)−ϑ(cid:48)(cid:48)(y)(cid:12) ≤ C |x−y|α, ∀x,y ∈ [0,1]. Θ In the regression model Θ represents the collection of all admitted regression functions. Thisparameterspacewillremainunchangedforallexperimentsconsidered here. Definition 2.2. We define the statistical experiment A in which the data Y , n j,n j = 1,...,n, with Y = ϑ(x )+ε (2.1) j,n j,n j,n are observed. The deterministic design points x ,...,x ∈ [0,1] are assumed to 1,n n,n satisfy (cid:0) (cid:1) x = F−1 (j −1)/(n−1) , (2.2) j,n D where the distribution function F : [0,1] → [0,1] possesses a Lipschitz continuous D Lebesgue density f which is uniformly bounded away from zero. The regression D errors ε are assumed to be i.i.d. with error density f : [0,1] → R+, which is j,n ε Lipschitz continuous and strictly positive. The conditions on the design are adopted from Brown and Low (1996). They imply that d−1/n ≤ x −x ≤ d/n, (2.3) j+1,n j,n 4 for all n ∈ N, j = 1,...,n and a finite positive constant d. The error model describes the class of densities which are supported on [−1,1], regular within (−1,1) and which have jumps at their left and right endpoints. Note that by constant extrapolation the density f on [−1,1] can always be written as ε f (x) = 1 (x)·ϕ(x), ε [−1,1] with a strictly positive Lipschitz continuous function ϕ : R → R satisfying for some constant C > 0 ε |ϕ(t)−ϕ(s)| sup +sup|ϕ(t)| ≤ C . (2.4) ε |t−s| t(cid:54)=s t Instead of constant extrapolation, ϕ may alternatively be continued such that ϕ ∈ L (R) holds in addition. 1 Hence, experiment A describes a non-regular nonparametric regression model. n We believe that the regularity condition on f in the interior (−1,1) can be sub- ε stantially relaxed, but at the cost of more involved estimation techniques. We have restricted our consideration to the specific interval [−1,1] for convenience. In the PPP model the target function ϑ occurs as upper and lower boundary curves of the intensity functions of two independent Poisson point processes X and 1 X . 2 Definition 2.3. For functions ϑ ∈ Θ, the design density f and the noise density f D ε fromabovewedefinetheexperimentB inwhichweobservetwoindependentPoisson n point processes X , j = 1,2, on the rectangle S = [0,1] × [−C − 1,C + 1] ⊂ R2 j Θ Θ with respective intensity functions λ (x,y) = f (x)·1 (y)·nf (1), 1 D [−CΘ−1,ϑ(x)] ε λ (x,y) = f (x)·1 (y)·nf (−1), (2.5) 2 D [ϑ(x),CΘ+1] ε for all (x,y) ∈ S. Each realisation X represents a measure mapping from the Borel subsets of S j to N ∪ {0}. Equivalently, X (·)/X (S) may be characterized by a two-dimensional j j discreteprobabilitydistribution, seeKarr(1991)orKutoyants(1998)formoredetails on PPP. Thus, the underlying action space can be taken as a Polish space (e.g. the separable Banach space L (S)) such that asymptotic equivalence can be established 1 by Markov kernels. Figure 1 shows on the left the regression function ϑ(x) = 3 xcos(10x) and cor- 10 responding n = 100 equidistant observations of A corrupted by uniform noise on n 5 Figure 1. Left: Regression model A with uniform U[−1,1] errors. n Right: Equivalent Poisson point process model B n [−1,1]. A realisation of the equivalent PPP model B is shown on the right, with n ’+’, ’-’ indicating point masses of X and X , respectively. 2 1 We may conceive X as the random point measure (cid:80)Nj δ where N is drawn j k=1 (xj,yj) j k k from a Poisson-distribution with intensity (cid:107)λ (cid:107) and the (xj,yj) are drawn ac- j L1(S) k k cording to the bivariate density λ /(cid:107)λ (cid:107) . The vertical bounds ±(C +1) for the j j L1(S) Θ domain S are non-informative for ϑ ∈ Θ, but the boundedness avoids technicalities. The equivalent unbounded PPP can be described by infinite random point measures (cid:80)∞ δ where the xj are drawn according to the density f and k=1 (xj,yj) k D k k y1 = ϑ(x1)−(nf (1))−1(cid:80)k z1, y2 = ϑ(x2)+(nf (−1))−1(cid:80)k z2 k k ε l=1 l k k ε l=1 l holds with exponentially distributed (zj) of mean one (all independent). In this form, k the PPP already appears in Knight (2001), yielding the limiting law for parametric estimators in the nonregular linear model. We present the main result of this work in the following theorem. Theorem 2.1. The statistical experiments A and B are asymptotically equivalent n n in Le Cam’s sense as n → ∞. This asymptotic equivalence is achieved constructively by consecutive invertible (inlaw)andparameter-independentmappingsofthedata, whichgeneratenewexper- iments where the observation laws are shown to be asymptotically close (uniformly over ϑ in total variation norm). In order to highlight the main ideas in the subse- quent proof and to indicate how to use our theoretical result in practice, let us give 6 an algorithmic description of these equivalence mappings leading from experiment A to experiment B (in the version with unbounded domain). n n (1) Take the data Y , j = 1,...,n, from experiment A . j,n n (2) Split the data and bin one part: consider the odd indices J := {1,3,..., n 2(cid:100)n/2(cid:101) − 1} and intervals I = [k/m,(k + 1)/m) with some appropriate m. k ¯ ¯ Put X = (Y ) and Z = (Z ) with 1 j+1,n j∈Jn\{n} j j∈Jn Z¯ = Y −ϑˆ (ξ )−ϑˆ(cid:48)(ξ )(x −ξ ), j ∈ J , j j,n 1 j 1 j j j n ˆ where ξ is the centre of that interval I with x ∈ I and where ϑ is a j k j,n k 1 (good) estimator of ϑ based on the data X . 1 ¯ ¯ ¯ (3) Consider the local extremes in Z, i.e. s = min(Z ), S = max(Z ), k = k k k k 0,...,m−1. (4) Use ϑˆ on the data X again to transform s(cid:48)(cid:48) = s + ϑˆ (ξ ) + 1, S(cid:48)(cid:48) = S + 1 k k 1 k k k ˆ ϑ (ξ )−1. 1 k (5) Randomization to build PPP X , X : on each interval I generate (xl,yl) l u k k k (cid:82) with xl having the density f = f 1 / f independent of everything else k k D Ik Ik D andyl = S(cid:48)(cid:48)−ϑˆ(cid:48)(xl)(ξ −xl); definethePPPX whereindependentlyoneach k k 1 k k k l I we observe a point measure in (xl,yl) plus independently (conditionally k k k on S(cid:48)(cid:48), ϑˆ(cid:48)) a PPP with intensity k 1 nf (1)(m(cid:82) f )1{x ∈ I , y ≤ S(cid:48)(cid:48) −ϑˆ(cid:48)(x)(ξ −x)}; 2 ε I D k k 1 k k analogouslygeneratexu withthedensityf independently,yu = s(cid:48)(cid:48)−ϑˆ(cid:48)(xu)(ξ − k k k k 1 k k xu) and use the intensity k nf (−1)(m(cid:82) f )1{x ∈ I , y ≥ s(cid:48)(cid:48) −ϑˆ(cid:48)(x)(ξ −x)} 2 ε I D k k 1 k k to build X independently conditionally on s(cid:48)(cid:48), ϑˆ(cid:48). u k 1 ˆ (6) Use a (good) estimator ϑ based on the PPP data X = (X ,X ) and redo 2 2 l u steps (2)-(5) to transform X via Z¯ = Y −ϑˆ (ξ )−ϑˆ(cid:48)(ξ )(x − 1 j+1 j+1,n 2 j+1 2 j+1 j+1 ξ ), j ∈ J , to another couple (X(cid:48),X(cid:48)) of PPP; the final PPP are obtained j+1 n l u by X = X +X(cid:48), X = X +X(cid:48). 1 l l 2 u u In this algorithmic description we could do without substracting and adding the pilot estimator itself (i.e., only use the derivative) in steps (2) and (4), but in the proof this localization permits an easy sufficiency argument for the local extremes. Put in a nutshell, the asymptotic equivalence is achieved by considering block-wise extreme values in the regression experiment, in conjunction with a pre- and post- processing procedure (localization step) performing a linear correction on each block. 7 The easier block-wise constant approximation approach by Brown and Low (1996) does not work here since we need a much higher approximation order. Throughoutweshallwriteconst. foragenericpositiveconstantwhichmaychange its value from line to line and does not depend on the parameter ϑ nor on the sample size n. Similarly, the Landau symbols O, o and the asymptotic order symbol (cid:16) will denote uniform bounds with respect to ϑ and n. 3. Pilot estimators In order to prove Theorem 2.1 a localization strategy is required as in Nussbaum (1996) for the density estimation problem. To that end we construct pilot estimators of the target function ϑ and its derivative in both, experiments A and B . n n Let us fix the estimation point x ∈ [0,1] and apply a local polynomial estimation 0 approach. We introduce the neighbourhood U = [x −h,x +h] for x ∈ [h,1−h] h 0 0 0 and the one-sided analogue U = [0,2h] for x ∈ [0,h), U = [1 − 2h,1] for x ∈ h 0 h 0 (1−h,1]. WeintroducethesetΠ := Π (U )ofquadraticpolynomialsonU . Standard 2 h h approximation theory (by a Taylor series argument) gives for h ↓ 0 (cid:0) (cid:1) γ := supminmax h−(2+α)|ϑ(x)−p(x)|+h−(1+α)|ϑ(cid:48)(x)−p(cid:48)(x)| ≤ const. < ∞, h ϑ∈Θ p∈Π x∈Uh where the constant does not depend on h. ˆ Definition 3.1. We call ϑ ∈ Π in experiment A locally admissible at x if n 0 max |Y −ϑˆ(x )| ≤ 1+γ h2+α j,n j,n h j:xj,n∈Uh ˆ holds. Similarly, in experiment B we call ϑ ∈ Π locally admissible at x if n 0 X ({x ∈ U , y > ϑˆ(x)+γ h2+α}) = 0 and X ({x ∈ U , y < ϑˆ(x)−γ h2+α}) = 0 1 h h 2 h h ˆ ˆ hold. Our estimator ϑ (x ) is just any locally admissible ϑ ∈ Π, evaluated at n,h 0 n,h x and selected as a measurable function of the data (by the measurable selection 0 theorem). ˆ Note that the by γ enlarged band size guarantees that ϑ exists since the mini- h n,h mizerϑ ∈ Πinthedefinitionofγ iseligible. Thefollowingresultgivesthepointwise h h risk bounds for the regression function and its derivative with orders O(n−s/(s+1)) and O(n−(s−1)/(s+1)), respectively, wheres = 2+α denotestheregularityinaH¨olderclass. As an application of our asymptotic equivalence we shall show in Section 8.2 below the optimality of these rates in a minimax sense. The upper bound proof relies on en- tropy arguments and norm equivalences for polynomials and could be easily extended to more general local polynomial estimation and Lp-loss functions. 8 Proposition 3.1. Select the bandwidth h such that h (cid:16) n−1/(3+α). Then we have in experiment A as well as in experiment B n n sup sup E (cid:0)n2(2+α)/(3+α)(cid:12)(cid:12)ϑˆ (x )−ϑ(x )(cid:12)(cid:12)2+n2(1+α)/(3+α)(cid:12)(cid:12)ϑˆ(cid:48) (x )−ϑ(cid:48)(x )(cid:12)(cid:12)2(cid:1) ≤ const. ϑ n,h 0 0 n,h 0 0 ϑ∈Θx0∈[0,1] Proof of Proposition 3.1: We shall need the following bounds in Π = Π (U ) from 2 h DeVore and Lorentz (1993): (cid:107)p(cid:107) ≤ 8h−1(cid:107)p(cid:107) (their Theorem IV.2.6); L∞(Uh) L1(Uh) (cid:107)p(cid:48)(cid:107) ≤ c h−1(cid:107)p(cid:107) (their Thm. IV.2.7); their proof of Thm. IV.2.6 es- L∞(U ) 0 L∞(U ) h h tablishes |p(x)| ≥ (1−4(x−x )/h)(cid:107)p(cid:107) for x := argmax |p(x)| and x ≤ x < M ∞ M x∈U M h x +h/4, assuming without loss of generality that x lies in the left half of U , such M M h that uniformly over x 0 1 (cid:88) (cid:107)p(cid:107) := |p(x )| ≥ const.·|p(x )| = const.·(cid:107)p(cid:107) n,h,1 j,n M L∞(U ) nh h xj,n∈Uh is derived. Let us start with considering the regression experiment A . We apply a standard n chaining argument in the finite-dimensional space Π together with an approximation argument. From above we have (cid:107)p(cid:107) /(cid:107)p(cid:107) (cid:16) 1 as well as (cid:107)p(cid:107) ≥ c |p(x )| L∞(U ) n,h,1 n,h,1 1 0 h with some c > 0 uniformly in p ∈ Π. Fix R > 2. For every δ > 0 we can find 1 elements (p ) that form a δ-net in Π ∩ {(cid:107)p(cid:107) ≥ c max(1,c )(R − 1)γ h2+α} l l≥1 n,h,1 1 0 h with respect to the L∞(U )-norm satisfying (cid:107)p (cid:107) (cid:16) δl1/3 as l → ∞ ; for this note h l n,h,1 that, by the above norm equivalences, Π ∩ {(cid:107)p(cid:107) ≥ c max(1,c )(R − 1)γ h2+α} n,h,1 1 0 h with maximum norm is isometric to R3 ∩ {|x| ≥ c max(1,c )(R − 1)γ h2+α} with 1 0 h the Euclidean metric uniformly for h → 0 and nh → ∞ and use standard coverings of Euclidean balls, e.g. Lemma 2.5 in van de Geer (2006). We obtain (cid:16) P ∃p ∈ Π : max |Y −p(x )| ≤ 1+γ h2+α, ϑ j,n j,n h j:xj,n∈Uh (cid:17) max(h−(2+α)|p(x )−ϑ(x )|,h−(1+α)|p(cid:48)(x )−ϑ(cid:48)(x )|) ≥ Rγ 0 0 0 0 h (cid:16) = P ∃p ∈ Π : max |ε −(p(x )−ϑ(x ))| ≤ 1+γ h2+α, ϑ j,n j,n j,n h j:xj,n∈Uh (cid:17) max(h−(2+α)|p(x )−ϑ(x )|,h−(1+α)|p(cid:48)(x )−ϑ(cid:48)(x )|) ≥ Rγ 0 0 0 0 h (cid:16) ≤ P ∃p ∈ Π : max |ε −(p(x )−ϑ (x ))| ≤ 1+2γ h2+α, ϑ j,n j,n h j,n h j:xj,n∈Uh (cid:17) (cid:107)p−ϑ (cid:107) ≥ max(1,c )c (R−1)γ h2+α h n,h,1 0 1 h (cid:16) (cid:17) ≤ P ∃l ≥ 1 : max |ε −p (x )| ≤ 1+2γ h2+α +δ ϑ j,n l j,n h j:xj,n∈Uh (cid:88) (cid:16) (cid:17) ≤ P max |ε −p (x )| ≤ 1+2γ h2+α +δ . ϑ j,n l j,n h j:xj,n∈Uh l≥1 9 From f (−1) > 0, f (+1) > 0 and the Lipschitz continuity of f within [−1,1] we ε ε ε infer that any ε satisfies j,n (cid:0) (cid:1) min P(ε ≥ 1−κ),P(ε ≤ −1+κ) ≥ cκ j,n j,n for some constant c > 0 and all κ ∈ (0,1). We derive an exponential inequality for any f : U → R and ∆ > 0: h P( max |ε −f(x )| ≤ 1+∆) j,n j,n j:xj,n∈Uh (cid:89) (cid:16) (cid:16) (cid:17)(cid:17) ≤ 1−min P(ε > 1+∆−|f(x )|), P(ε < −1−∆+|f(x )|) j,n j,n j,n i j:xj,n∈Uh (cid:16) (cid:88) (cid:17) ≤ exp log(1−c(|f(x )|−∆) ) i + j:xj,n∈Uh (cid:16) (cid:88) (cid:17) ≤ exp −c (|f(x )|−∆) i + j:xj,n∈Uh (cid:0) (cid:1) ≤ exp −cnh((cid:107)f(cid:107) −∆) , n,h,1 using log(1+h) ≤ h. We therefore choose δ = Rγ h2+α and arrive at h (cid:16) P ∃p ∈ Π : p is locally admissible, ϑ (cid:17) max(h−(2+α)|p(x )−ϑ(x )|,h−(1+α)|p(cid:48)(x )−ϑ(cid:48)(x )|) ≥ Rγ 0 0 0 0 h (cid:88) (cid:16) (cid:17) (cid:16) (cid:16) (cid:17)(cid:17) ≤ exp −const.·nh(δ +γ h2+α)l1/3 = O exp −const.·Rnh3+α . h l≥1 We conclude, substituting h (cid:16) n−1/(3+α), that uniformly over R ≥ 2 (cid:16) (cid:17) P h−(2+α)|ϑˆ (x )−ϑ(x )| ≥ Rγ = O(exp(−const.·R)), ϑ n,h 0 0 h (cid:16) (cid:17) P h−(1+α)|ϑˆ(cid:48) (x )−ϑ(cid:48)(x )| ≥ Rγ = O(exp(−const.·R)). ϑ n,h 0 0 h Integrating out these exponential tail bounds yields the desired moment bound in experiment A . n All the results obtained so far remain valid for the PPP experiment B when the n (cid:82) empirical norm (cid:107)·(cid:107) is replaced by the rescaled L (U )-norm (cid:107)g(cid:107) := 1 |g|, n,h,1 1 h 1,Uh h Uh the admissibility conditions are exchanged and the following (easier) exponential