Uniform in bandwidth exact rates for a class of kernel estimators V V K Davit ARRON ∗ Ingrid AN EILEGOM † January 27, 2012 2 1 0 2 Abstract n a Given an i.i.d sample (Y ,Z ), taking values in Rd′ Rd, we consider a collection i i J × Nadarya-Watson kernel estimators of the conditional expectations E(< c (z),g(Y) > g 6 +d (z) Z = z), where z belongs to a compact set H Rd, g a Borel function on 2 Rd′gand|cg(),dg() are continuous functions on Rd. Giv⊂en two bandwidth sequences · · ] h < h fulfilling mild conditions, we obtain an exact and explicit almost sure limit T n n bounds for the deviations of these estimators around their expectations, uniformly in S . g , z H and hn h hn under mild conditions on the density fZ, the class h ∈ G ∈ ≤ ≤ , the kernel K and the functions c (),d (). We apply this result to prove that t G g · g · a smoothedempiricallikelihoodcanbeusedtobuildconfidenceintervalsforconditional m probabilities P(Y C Z = z), that hold uniformly in z H, C , h [h ,h ]. n n [ ∈ | ∈ ∈ C ∈ Here is a Vapnik-Chervonenkis class of sets. C 1 v 7 Key Words: Local empirical processes, empirical likelihood, kernel smoothing, uniform in 0 bandwidth consistency. 5 5 . 1 0 2 1 : v i X r a ∗Laboratoire de Mathématiques Pures et Appliquées, Université de Franche-Comté, 16 route de gray, 25000 Besançon, France, E-mail: [email protected]. Financial support from IAP research network P6/03 of theBelgian Government (Belgian Science Policy). †Institute of Statistics, Université catholique de Louvain, Voie du Roman Pays 20, 1348 Louvain-la- Neuve,Belgium, E-mail: [email protected]. Financial support from IAP research network P6/03oftheBelgianGovernment(BelgianSciencePolicy),andfromtheEuropeanResearchCouncilunder theEuropeanCommunity’sSeventhFrameworkProgramme(FP7/2007-2013)/ERCGrantagreementNo. 203650 is gratefully acknowledged. 1 1 Introduction and statement of the main results Considerani.i.dsample(Y ,Z ) takingvaluesinRd′ Rd,withthesamedistribution i i i=1,...,n × as a vector (Y,Z), and write < , > for the usual inner product. In this paper, we · · investigate the limit behaviour of quantities of the following form (assuming that this expression is meaningful): n Z z W (g,h,z) :=f (z)−1/2 < c (z),g(Y )> +d (z) K i− n Z g i g h Xi=1(cid:20)(cid:0) (cid:1) (cid:16) (cid:17) Z z i E < c (z),g(Y )> +d (z) K − . (1) g i g − h (cid:18) (cid:19)(cid:21) (cid:16) (cid:17) (cid:0) (cid:1) Here, K denotes a kernel, h > 0 is a smoothing parameter, g is a Borel function from Rd′ toRk and f is (aversion) of the density of Z. Given aclass of functions satisfyingsome Z G Vapnik-Chervonenkis type conditions (see conditions (HG1) below), and given a compact setH,EinmahlandMason(2000)showedthatsomewhatrecenttoolsinempiricalprocesses theory could be used efficiently to provide exact rates of convergence of sup W (g,h ,z) , g , z H , n n | | ∈ G ∈ (cid:8) (cid:9) along a bandwidth sequence h fulfilling some mild conditions (see condition (HV) in the n sequel). The exact content of their result is written in Theorem 1 below. The contribution of the present paper is twofold. As a first contribution, we provide an extension of the result of Einmahl and Mason, by enriching Theorem 1 with a uniformity in the bandwidth h, when h is allowed to vary into an interval [h ,h ], with h and h fulfilling conditions of n n n n Theorem 1. This extension is stated in Section 1.2 (Theorem 2), and is proved in Section 3. As a second contribution (Theorem 3), we apply our Theorem 2 to establish confidence intervals for quantities of the form P Y C Z =z , C , z H, ∈ | ∈ C ∈ (cid:16) (cid:17) by empirical likelihood techniques. Indeed, we prove that these confidence intervals can be built to hold uniformly in z H, C and h [h ,h ], under conditions that are very n n ∈ ∈ C ∈ similar to those of Theorem 2. This result is stated in Section 1.4 and is proved in Section 4. 1.1 A result of Einmahl and Mason As our first result is an extension of Theorem 1 in Einmahl and Mason (2000) we have to first introduce the notations and assumptions they made in their article. Consider a compact set H Rd with nonempty interior. We shall make the following assumption on ⊂ the law of (Y,Z). (Hf) (Y,Z) has a density f that is continuous in x on Rd′ O′, where O′ Rd Y,Z × ⊂ is open and where H O′. ⊂ Moreover f is continuous and bounded away from zero and infinity on O′. Z 2 From now on, O will denote an open set fulfilling H ( O ( O′. Now consider a class of G functionsfromRd′ toRk. Forl = 1,...,k,write := Π ( ),whereΠ (x ,...,x ,...,x ) := l l l 1 l k G G x for (x ,...,x ) Rk. l 1 k ∈ (HG) Each class is a pointwise separable VC subgraph class and has a finite valued l G measurable envelope function G satisfying, for some p (2, ]: l ∈ ∞ α := max sup G () < , l=1,...,k z∈O || l · ||LY|Z=z,p ∞ where G () is the Lp-norm of G under the distribution of Y Z = z. For a || l · ||LY|Z=z,p l definition of apointwise separable VC subgraph classwe refertoVan deVaa(cid:12)rt andWellner (cid:12) (1996,p. 110and141). Now,foranyg ,considerapairoffunctions(c ((cid:12)),d ()), where g g ∈ G · · c maps Rd to Rk and d maps Rd to R, and assume that g g (H ) The classes of functions := c , g and := d , g are uniformly 1 g 2 g C D { ∈ G} D { ∈ G} bounded and uniformly equicontinuous on O. We now formulate our assumptions on the Kernel K, with the following definition. := K λ z , λ > 0, z Rd . (2) K ·− ∈ n o (cid:0) (cid:1) (HK1) K has bounded variation and the class is VC subgraph. K (HK2) K(s)= 0 when s / [ 1/2,1/2]d. ∈ − (HK3) K(s)ds =1. Rd R Note that (HK1) is fulfilled for a quite large class of kernels (see, e.g., Mason (2004), Example F.1). In Einmahl and Mason (2000), the authors have studied the almost sure asymptotic behaviour of sup W (g,h ,z) , g , z H n n | | ∈G ∈ (cid:8) (cid:9) (recall (1)), along a bandwidth sequence (h ) that satisfies the following conditions n n≥1 (here we write log n := loglog(n 3)) : 2 ∨ (HV) h 0, nh d , log(1/h )/log n , hd n/log(1/h ) 1−2/p , n ↓ n ↑ ∞ n 2 → ∞ n n → ∞ where p is as in condition (HG). We also set (cid:0) (cid:1) ∆2(g,z) :=E < c (z),g(Y)> +d (z) 2 Z = z , z Rd, g , (3) g g ∈ ∈ G ∆2(g) :=su(cid:16)p(cid:0)∆2(g,z), g (cid:1) (cid:12)(cid:12) (cid:17) (4) ∈ G (cid:12) z∈H ∆2( ) :=sup∆2(g). (5) G g∈G Given a measurable space (χ, ), a measure Q and a Borel function ψ : χ R, we write T 7→ ψ p = ψp dQ. (6) || ||Q,p | | Z χ Under the above mentioned assumptions, Einmahl and Mason have proved the following theorem, λ denoting the Lebesgue measure. 3 Theorem 1 (Einmahl, Mason, 2000) Underassumptions(HG), (H ), (Hf),(HK1) C − (HK3) and (HV), we have almost surely W (g,h ,z) n n lim sup | | = ∆( ) K . (7) λ,2 n→∞ z∈H,g∈G 2nhd log(h−d) G || || n n q We point out that (7) is slightly stronger than Theorem 1 of Einmahl and Mason (2000) , as f (z)−1/2 appears in our definition of W (g,h,z) which is not the case in their paper. Z n −1/2 However, (7) is a consequence of their Theorem 1, as f is uniformly continuous on H, Z by (Hf). 1.2 An extension of Theorem 1 Our first result states that Theorem 1 can be enriched by an additional uniformity in h h h in the supremum appearing in (7), provided that (h ) and (h ) do n n n n≥1 n n≥1 ≤ ≤ fulfill assumption (HV). We also refer to Einmahl and Mason (2005), where the authors provided some consistency results for kernel type function estimators that hold uniformly in the bandwidth (see also Varron (2008) for an improvement in the case of kernel density estimation). Theorem 2 Assume that (HG), (Hf), (H ) and (HK1) (HK3) are satisfied. Let C − (h ) and (h ) be two sequences of constants fulfilling (HV) as well as h = o(h ). n n≥1 n n≥1 n n Then we have almost surely W (g,h,z) n lim sup | | = ∆( ) K . (8) λ,2 n→∞ z∈H,g∈G,hn≤h≤hn 2nhdlog(h−d) G || || The proof of Theorem 2 is provided in pSection 3. Remark 1 Einmahl and Mason (2005) have proved a result strong enough to derive that, under weaker conditions than those of Theorem 2, we have almost surely f (z)1/2W (g,h,z) Z n limsup sup < . (9) n→∞ z∈H,g∈G, nhdlog(1/h)+loglogn ∞ h∈[clogn,1] n p However, the finite constant appearing on the right hand side of (9) is not explicit in their result. The main contribution of Theorem 2 is that the right hand side of (9) is explicit, by paying the price of making stronger assumptions. Remark 2 As Theorem 2 is an extension of Theorem 1 of Einmahl and Mason, all the corollaries ofTheorem 1(see EinmahlandMason(2000))can be enrichedwithauniformity in the bandwidth. 4 1.3 Some applications of Theorem 2 to data-driven bandwidth selection The main statistical interest of Theorem 2 is that we can derive the limit behavior of kernel regression estimators with data-driven bandwidth. Let us consider such a random bandwidth h (z) = h(z,Y ,...,Y ,Z ,...,Z ) that depends on the sample as well as on n 1 n 1 n the point z Rd. In the sequel, Id shall denote the identity function. Our next corollary ∈ gives the a.s. limit behavior of the Nadaraya-Watson estimator n K z−Zi h r (z) = Y n n (cid:16)K z−(cid:17)Zj i Xi=1 j=1 h (cid:16) (cid:17) P of the regression function r(z) := E(Y Z = z), when h () satisfies some mild conditions. n | · Note that the asymptotics are given for r () r(h , ), with n n · − · u z −1 r(h ,z) := h yK − f (y,u)dudy. n n Y,Z h Rd×ZRk (cid:16) n (cid:17) The random differences r(h ,z) r(z) can be controlled by analytic arguments as soon as n − the a.s. limit behavior is known. Corollary 1 Assume that h () satisfies almost surely (resp. in probability) n · log(1/h ) log(1/h ) n n 0< liminf limsup < 1. n→∞ logn ≤ n→∞ logn Then, we have f (z)(r (z) r(h ,z)) nh (z)d Z n n n limsupsup ± − = K , n→∞ z∈H p 2∆(Id,z)log(h (z)q−d) || ||λ,2 n q almost surely (resp. in probability). Proof: The proof involves continuity arguments for ∆(Id, ) and the fact that the numer- · ator and denominator of r (z) are specific forms of the general object W appearing in n n Theorem 2. We also consider the countable collection of events n−r h n−r′ for all large n , r,r′ Q (0,1). n ≤ ≤ ∈ ∩ n o On each of these countable events, the sequence h can be bounded from below and above n by sequences h and h fulfilling condition (HV). We omit technical details.(cid:3) n n Example 1: Tsybakov’s plug-in selection rule: Tsybakov (1987) considered a plug-in bandwidth selection rule when d = k = 1. In that case, he suggested that, for a given point z R, the bandwidth should be chosen of the ∈ form h (z) := βˆ (z)n−1/5, n n 5 where βˆ (z) is a consistent estimate of the theoretical quantity β(z) that minimizes the n asymptotic square error of r (z). Under the conditions stated in Tsybakov (1987), since n most of them being consequences of the assumptions of Theorem 2, the plug-in bandwidth satisfies the assumptions of Corollary 1. Example 2: cross validation: We again consider the case d = k = 1. An important example is the bandwidth h that n minimizes the sample-based quantity n 1 CV(h) := [Y r (Z )]2w(Z ), h [n−1+δ,n−δ], i n,−i i i n − ∈ i=1 X where w is a weight function on R and δ > 0 is a fixed (small) value. We refer to Clark (1975) and Priestley and Chao (1972) for more details on that technique. By construction the random sequence h satisfies the assumptions of Corollary 1. Moreover, it is shown in n Härdle et al. (1988) that, under mild conditions, we have h C n−1/5 in probability, for n 0 ∼ a theoretical constant C . 0 1.4 Asymptotic confidence bands by empirical likelihood Empirical likelihoodmethods instatisticalinferencehavebeenintroduced by Owen(2001). Thisnonparametrictechniquehassuscitatedmuchinterestforseveralpracticalreasons,the most important one being that it directly provides confidence intervals without requiring furtherapproximationmethods,suchastheestimationofdispersionparameters. Moreover, empirical likelihood is a very versatile tool which can be adapted in many different fields, for instance in estimation of densities or conditional expectations by kernel smoothing methods. The idea can be summarised as follows : consider an independent, identically distributed sample (Y ,Z ) taking values in Rd′ Rd. Given h> 0, z H, a function i i 1≤i≤n × ∈ g from Rd′ to Rk and a (kernel) real function K, define the following centring parameter, which plays the role of a deterministic approximation of E g(Y) Z = z : | (cid:0) (cid:1) E g(Y)K Z−z h m(g,h,z) := (cid:18) (cid:16) (cid:17)(cid:19). (10) E K Z−z h (cid:18) (cid:19) (cid:16) (cid:17) This quantity is the root of the following equation in θ: Z z E K − g(Y) θ = 0, (11) h − (cid:18) (cid:19) (cid:16) (cid:17)(cid:16) (cid:17) which naturally leads to the following formula for a confidence interval (around m(g,h,z)) by empirical likelihood methods (for more details see, e.g., Owen (2001), chapter 5) : I (g,h,z,c) := θ R, (θ,g,h,z) c , (12) n n { ∈ R ≥ } 6 where c (0,1) is a given critical value that has to be chosen in practice, and where ∈ n n n Z z i (θ,g,h,z) := max np , p K − g(Y ) θ = 0, p 0, p = 1 . n i i i i i R h − ≥ nYi=1 Xi=1 (cid:16) (cid:17)(cid:0) (cid:1) Xi=1 o(13) It is known (see, e.g., (2001), chapter 5) that, for fixed z Rd and fixed g, we can expect ∈ m(g,h,z) I (g,h,z,c) (14) n ∈ toholdwithprobability equaltoP(χ2 2logc),ultimately asn , h 0, nhd ≤ − → ∞ → → ∞ (see e.g., Owen, chapter 5). A natural arising question is: Can we expect (14) to hold uniformly in z,g and h? • In that case, how much uniformity can we get? • Uniformity in g and z would allow to construct asymptotic confidence bands (instead of simple confidence intervals), while a uniformity in h would allow more flexibility in the practical choice of that smoothing parameter. Our Theorem 3 provides a tool strong enough to give some positive answers to these questions. We shall focus on the case where = 1 , C for a class of sets . We will also make an abuse of notation, by C G { ∈ C} C identifying and , and hence, we shall write m(C,h,z) for m 1 ,h,z and so on. Write C C G the conditional variance of 1 (Y) given Z = z as follows : C (cid:0) (cid:1) σ2(C,z) := P Y C Z = z P2 Y C Z = z , C , z H. (15) ∈ | − ∈ | ∈ C ∈ The next theorem shows(cid:0)that we can c(cid:1)onstru(cid:0)ct, by empirica(cid:1)l likelihood methods (recall (12)), confidence bands around the centring parameters m(C,h,z) with lengths tending to zero at rate 2σ2(C,z)log(h−d)/nhd when n and h h h . We make the n n → ∞ ≤ ≤ following assumptions on h , h and : p n n C (HG′) is a VC class satisfying inf inf σ2(C,z) =:β > 0. C z∈HC∈C (HV′) h 0, nhd , log(1/h )/log n , nhd/log(1/h ) . n ↓ n ↑ ∞ n 2 → ∞ n n → ∞ Note that (HV′) is equivalent to (HV) in the specific case where p = . ∞ Theorem 3 Under assumptions (Hf), (HK1) (HK3), (HG′) and (HV′), as well as − h = o(h ), we have almost surely: n n log m(C,h,z),C,h,z n − R lim sup = 1. (16) n→∞ z∈H,C∈C, (cid:16)log(h−d) (cid:17) hn≤h≤hn The proof of Theorem 3 is provided in Section 4. Remark 3 Theorem 3 implies that, for an arbitrary ǫ > 0, taking c = hd+ǫ when con- structing confidence regions as in (12) ensures that each m(C,h,z) belongs to its asso- ciated confidence interval I (C,h,z,c). Moreover, this claim turns out to be false when n taking c = hd−ǫ with ǫ > 0. This shows that one cannot go below the theoretical limit c= hd without loosing uniformity in C,h and z. 7 Remark 4 In order to obtain a confidence band for m(C,h,z) uniformly in C,h and z, we need the limiting distribution of sup log m(C,h,z),C,h,z /log(h−d), (17) n − R z,C,h h (cid:16) (cid:17)i soTheorem3isnotsufficientforthis. Obtainingsuchalimitlawisarealchallengeinitself, andisbeyond the scope of this paper. Weleave that problem as anopen problem. In thecase ofunivariate kernel densityestimation, Bickel andRosenblatt (1973) showedthat the supre- mum over the transformed kernel density estimator, obtained after a proper rescaling and a proper translation, converges to an extreme value distribution. The simulations in Section 2 suggest that a proper linear transformation of [ log (m(C,h,z),C,h,z)]/log(h−d) n − R (depending on z,C and h) might also lead to a nondegenerate limiting distribution. 2 Simulation results A simulation study is carried out to illustrate the convergence stated in (16). We estimate the density of (17) for four different sample sizes: n = 50,100,500,1000. We specified the following parameters: 1. Z is uniformly distributed on [0,1]. Given Z = z, Y has an exponential distribution with expectation 1/z. 2. C is the class of intervals [0,t], t [1,2]. ∈ 3. H = [0.25,0.75]. 4. h = n−1/5−δ and h = n−1/5+δ, with δ = 1/20. n n For each sample size, the density is estimated as follows : 100 independent samples are simulated (which is enough since the density is univari- • ate). For each sample, the supremum in (17) is approximated by a maximum over a finite • grid of size 50. Finally, the density of (17) is estimated by using a Parzen-Rosenblatt density esti- • mator, applied to the 100 obtained values. We used an Epanechnikov kernel and the bandwidth was obtained from cross validation. Figure 1shows the density estimates for n = 50,100,500,1000. Figure 2has beenobtained from a second simulation study, where the interval [h ,h ] has been widened (δ = 1/10). n n As already mentioned in Remark 1.4, Figures 1 and 2 suggest that after a proper lin- ear transformation, the distribution of (17) might converge to a non-degenerate limiting distribution. 8 Figure 1: Estimated densities of the supremum in (17) for δ = 1/20. The black curve corresponds to n = 50, the light gray curve to n = 100, the white curve to n = 500 and the dark gray curve to n = 1000. 9 Figure 2: Estimated densities of the supremum in (17) for δ = 1/10. The light gray curve corresponds to n = 50, the black curve to n = 100, the white curve to n = 500 and the dark gray curve to n = 1000. 10