ebook img

Kernel Estimation of Density Level Sets PDF

0.22 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Kernel Estimation of Density Level Sets

KERNEL ESTIMATION OF DENSITY LEVEL SETS 5 Benoît CADRE1 0 0 Laboratoire de Mathématiques, Université Montpellier II, 2 CC 051, Pla e E. Bataillon, 34095 Montpellier edex 5, FRANCE n a J f f f n 4 Abstra t. Let be a multivariate density and be a kernel estimate of n X , ,X 1 drawn from the -sample 1 ··· n of i.i.d. random variables with density f . We ompute the asymptoti rate of onvergen e towards 0 of the volume ] t f t T of the symmetri di(cid:27)eren e between the -level set { ≥ } and its plug-in f t S n estimator { ≥ }. As a orollary, we obtain the exa t rate of onvergen e . h of a plug-in type estimate of the density level set orresponding to a (cid:28)xed f t a probability for the law indu ed by . m [ Key-words : Kernel estimate, Density level sets, Hausdor(cid:27) measure. 1 2000 Mathemati s Subje t Classi(cid:28) ation : 62H12, 62H30. v 1 2 1. Introdu tion.Re ent years have witnessed an in reasing interest in esti- 2 mation of density level sets and in related multivariate mappings problems. 1 0 The main reason is the re ent advent of powerfull mathemati al tools and 5 omputational ma hinery that render these problems mu h more tra table. 0 One of the most powerful appli ation of density level sets estimation is in / h unsupervised luster analysis (see Hartigan [1℄), where one tries to break a t a omplex data set into a series of pie ewise similar groups or stru tures, ea h m of whi h may then be regarded as a separate lass of data, thus redu ing : v overall data ompexity. But there are many other (cid:28)elds where the knowl- i edge of density level sets is of great interest. For example, Devroye and Wise X [2℄, Grenander [3℄, Cuevas [4℄ and Cuevas and Fraiman [5℄ used density sup- r a port estimation for pattern re ognition and for dete tion of the abnormal behavior of a system. t (t) In this paper, we onsider the problem of estimating the -level set L f IRk of amultivariate probability density with support in from independent X , ,X f t 0 t 1 n randomvariables ··· withdensity .Re allthatfor ≥ ,the -level f set of the density is de(cid:28)ned as follows : (t) = x IRk : f(x) t . L { ∈ ≥ } 1 adremath.univ-montp2.fr 1 (t) n The question now is how to de(cid:28)ne the estimates of from the -sample X , ,X 1 n ··· ? Even in a nonparametri framework, there are many possible answers to this question, depending on the restri tions one an impose on the level set and the density under study. Mainly, there are two families of su h estimators : the plug-in estimators and the estimators onstru ted by f f n an ex ess mass approa h. Assume that an estimator of the density is (t) f t n available. Then a straightforward estimator of the level set is { ≥ }, the plug-in estimator. Mol hanov [6, 7℄ and Cuevas and Fraiman [5℄ proved onsisten y of these estimators and obtained some rates of onvergen e. The M n ex ess mass approa h suggest to (cid:28)rst onsider the empiri al mapping L IRk de(cid:28)ned for every borel set ⊂ by n 1 M (L) = 1 tλ(L), n n {Xi∈L}− i=1 X λ IRk (t) where denotestheLebesguemeasure on .Anatural estimator of isa M (L) L n maximizer of over agiven lassof borel sets .For di(cid:27)erent lassesof level sets (mainly star-shaped or onvex level sets), estimators based on the ex ess mass approa h were studied by Hartigan [8℄, Müller [9℄, Müller and Sawitzki [10℄, Nolan [11℄andPolonik [12℄,whoproved onsisten yandfound ertainratesof onvergen e. Whenthelevelsetisstar-shaped,Tsybakov[13℄ re ently proved that the ex ess mass approa h gives estimators with opti- mal rates of onvergen e in an asymptoti ally minimax sense, whithin the studied lasses of densities. Though this result has a great theoreti al in- terest, assuming the level set to be onvex or star-shaped appears to be somewhat unsatisfa tory for the statisti al appli ations. Indeed, su h an as- sumption does not permit to onsider the important ase where the density under study is multimodal with a (cid:28)nite number of modes, and hen e the results an not be applied to luster analysis in parti ular. In omparison, the plug-in estimators do not are about the spe i(cid:28) shape of the level set. Moreover, another advantage of the plug-in approa h is that it leads to eas- ily omputable estimators. We emphasize that, if the ex ess mass approa h often gives estimators with optimal rates of onvergen e, the omplexity of the omputational algorithm of su h an estimator is high, due to the pres- en e of the maximizing step (see the omputational algorithm proposed by Hartigan, [8℄). In this paper, we study a plug-in type estimator of the density level set (t) f , using a kernel density estimate of (Rosenblatt, [14℄). Given a kernel K IRk IRk h = h(n) > 0 on (i.e., a probability density on ) and a bandwidth 2 h 0 n f su h that → as grows to in(cid:28)nity, the kernel estimate of is given by n 1 x X f (x) = K − i , x IRk. n nhk h ∈ Xi=1 (cid:16) (cid:17) (t) (t) n We let the plug-in estimate of be de(cid:28)ned as (t) = x IRk : f (x) t . n n { ∈ ≥ } IRk In the whole paper, the distan e between two borel sets in is a mea- λ IRk sure -in parti ular the volume or Lebesgue measure on - of the sym- ∆ A∆B = (A Bc) (Ac B) metri di(cid:27)eren e denoted (i.e., ∩ ∪ ∩ for all sets A,B ). Our main result (Theorem 2.1) deals with the limit law of √nhkλ (t)∆(t) , n (cid:16) (cid:17) whi h is proved to be degenerate. Consider now the following statisti al problem. In luster analysis for instan e, it is of interest to estimate the density level set orresponding to p [0,1] f a (cid:28)xed probability ∈ for the law indu ed by . The data ontained p in this level set an then be regarded as the most important data if is f t far enough from 0. Sin e is unknown, the level of this density level set (t) is unknown as well. The natural estimate of the target density level set (t ) t n n n be omes , where is su h that f dλ = p. n Zn(tn) As a onsequen e of our main result, we obtain in Corollary 2.1 the exa t (t ) (t) n n asymptoti rate of onvergen e of to . More pre isely, we prove that β n for some whi h only depends on the data, one has : 2 β √nhkλ (t )∆(t) K2dλ n n n → sπ (cid:16) (cid:17) Z in probability. The pre ise formulations of Theorem 2.1 and Corollary 2.1 are given in Se tion 2.Se tion 3isdevoted to the proofof Theorem 2.1 whilethe proofof Corollary 2.1 is given in Se tion 4. The appendix is dedi ated to a hange of 3 (k 1) variablesformulainvolvingthe - -dimensional Hausdor(cid:27)measure(Propo- sition A). 2. The main results. t Θ (0, ) 2.1 Estimation of -level sets. In the following, ⊂ ∞ denotes an . open interval and kk stands for the eu lidean norm over any (cid:28)nite dimen- f sional spa e. Let us introdu e the hypotheses on the density : f f(x) 0 x H1. is twi e ontinuously di(cid:27)erentiable and → as k k → ∞; t Θ H2. For all ∈ , inf f > 0, f−1({t})k∇ k ψ(x) x IRk where, here and in the following, ∇ denotes the gradient at ∈ of ψ : IRk IR thedi(cid:27)erentiablefun tion → .Next,weintrodu etheassumptions K on the kernel : K H3. is a ontinuously di(cid:27)erentiable and ompa tly supported fun - µ : tion. Moreover, there exists a monotone nonin reasing fun tion IR IR K(x)= µ( x ) x IRk + → su h that k k for all ∈ . K The assumption on the support of is only provided for simpli ity of the proofs.Asamatteroffa t,one ould onsideramoregeneral lassofkernels, in luding the gaussian kernel for instan e. Moreover, as we will use Pollard's K µ( . ) results [15℄, is assumed to be of the form kk . (k 1) Throughout the paper, H denotes the - -dimensional Hausdor(cid:27) mea- IRk sureon ( f.Evansand Gariepy, [16℄). Re allthat H agrees withordinary (k 1) ∂A (cid:16) - -dimensional surfa e area(cid:17) on ni e sets. Moreover, is the boundary A IRk of the set ⊂ , 3 k = 1 α(k) = if ; k+4 k 2 ( if ≥ . g : IRk IR λ + g andforanyboundedborelfun tion → , standsforthemeasure A IRk de(cid:28)ned for ea h borel set ⊂ by λ (A) = gdλ. g ZA P Finally, the notation → denotes the onvergen e in probability. λ(∂(t)) = 0 It an be proved that if H1, H3 hold and if , one has : P λ (t)∆(t) 0. n → (cid:16) (cid:17) 4 The aim of Theorem 2.1 below is to obtain the exa t rate of onvergen e. g : IRk IR + Theorem 2.1. Let → be a bounded borel fun tion and assume nhk/(logn)16 nhα(k)(logn)2 0 that H1-H3 hold. If → ∞ and → , then for t Θ almost every (a.e.) ∈ : 2t g √nhkλ (t)∆(t) P K2dλ d . g n (cid:16) (cid:17) → sπ Z Z∂(t) k∇fk H g Remarks 2.1. • Noti e that the rightmost integral is de(cid:28)ned be ause is (t) t > 0 bounded and is a ompa t set for all a ording to H1. g 1 •In pra ti e, thisresultismainly interesting when ≡ ,sin ewethen have the asymptoti behavior of the volume of the symmetri di(cid:27)eren e between the two level sets. The general ase is provided for the proof of Corollary 2.1 below. f f • If we only assume to be Lips hitz instead of H1, then is an almost everywhere ontinuously di(cid:27)erentiable fun tion by Radema her's theorem and Theorem 2.1 holds under the additional assumption on the bandwidth : nhk+2(logn)2 0 → . 2.2 Estimation of level sets with (cid:28)xed probability. In order to derive f the orollary, we need an additional ondition on . t (0,sup f] λ(f−1[t ε,t+ε]) 0 ε 0 H4. For all ∈ IRk , − → as → . Moreover, λ(f−1(0,ε]) 0 ε 0 → as → . f Roughlyspeaking,H4meansthatthesetswhere is onstantdonot harge IRk the Lebesgue measure on . Many densities with a (cid:28)nite number of lo al f extrema satisfy H4. However, noti e that if is a ontinuous density su h λ(f−1(0,ε]) 0 ε 0 that → as → , then it is ompa tly supported. Let us now denote by P the appli ation [0,sup f] [0,1] : IRk → P t λ ((t)). f 7→ f p [0,1] Observe that P is one-to-one if satis(cid:28)es H1, H4. Then, for all ∈ , t(p) [0,sup f] λ ((t(p))) = p let ∈ IRk be the unique real number su h that f . (p) (p) t [0,sup f ] λ ( (t )) = p Morevover, let n ∈ IRk n be su h that fn n n . Noti e that t(p) f IRk n n does exists sin e is a density on . 5 The aim of Corollary 2.1 below is to obtain the exa t rate of onvergen e (t ) (t) n n of to . We also introdu e an estimator of the unknown integral in Theorem 2.1. k 2 (α ) n n Corollary 2.1. Let ≥ , be a sequen e of positive real numbers α 0 nhk+2/logn n su h that → and assume that H1-H4 hold. If → ∞, nhk+4(logn)2 0 α2nhk/(logn)2 p (Θ) → and n → ∞ then, for a.e. ∈ P : β 2 √nhk n λ (t(p))∆(t(p)) P K2dλ, (p) n n → sπ tn (cid:16) (cid:17) Z q (p) (p) β = α /λ( (t ) (t +α )). n n n n n n n where − Remarks 2.2. • It is of statisti al interest to mpentio[0n,1th]e tfa(p )t that(tp)under n the assumptions of the orollary, we have for all ∈ : → with probability 1 (see Lemma 4.3). k = 1 h • When , the onditions of Theorem 2.1 on the bandwidth do not permit to derive Corollary 2.1. In pra ti e, estimations of density level sets and their appli ations to luster analysis for instan e are mainly interesting in high-dimensional problems. 3. Proof of Theorem 2.1. t > 0 3.1. Auxiliary results and proof of Theorem 2.1. For all , let (logn)β (logn)β t = f−1 t ,t and t = f−1 t,t+ , Vn − √nhk Vn √nhk h i h i β > 1/2 K˜ where is (cid:28)xed. Moreover, stands for the real number : K˜ = K2dλ. Z g : IRk IR + Proposition3.1.Let → be abounded borel fun tionand assume nhk/(logn)31β nhα(k)(logn)2β 0 that H1-H3 hold. If → ∞ and → , then for t Θ a.e. ∈ : lim√nhk P(f (x) t)dλ (x) = lim√nhk P(f (x) < t)dλ (x) n g n g n ZVnt ≥ n ZVtn tK˜ g = d . s2π f H Z∂(t) k∇ k 6 g : IRk IR + Proposition3.2.Let → be abounded borel fun tionand assume nhk/(logn)5β nhα(k)(logn)2β 0 that H1-H3 hold. If → ∞ and → , then for t Θ a.e. ∈ : limnhkvar λ t (t) = 0 = limnhkvar λ t (t)c . n g Vn∩n n g Vn∩n h (cid:16) (cid:17)i h (cid:16) (cid:17)i t Θ Proof of Theorem 2.1. Let ∈ be su h that both on lusions of Propo- sitions 3.1 and 3.2 hold. A ording to H3 and Pollard ([15℄, Theorem 37 and Problem 28, Chapter II), we have almost surely (a.s.) : sup f Ef 0. n n | − |→ IRk sup Ef (x) f(x) x Moreover, sin e both n n and vanish as k k → ∞ by H1, H3, we have : sup Ef f 0. n | − |→ IRk n Thus, a.s. and for large enough : t sup f f . n | − |≤ 2 IRk (t) (t/2) (t) (t/2) n Consequently, ⊂ and sin e ⊂ , we get : λ (t)∆(t) = 1 dλ + 1 dλ . (3.1) g n {fn<t,f≥t} g {fn≥t,f<t} g (cid:16) (cid:17) Z(t/2) Z(t/2) Let A = √nhk sup f f (logn)β . n n | − |≤ (t/2) n o (t/2) Sin e is a ompa t set by H1, it is a lassi al exer ise to prove that P(A ) 1 n → under the assumptions of the theorem. Hen e, one only needs to A A n n prove that the result of Theorem 2.1 holds on the event . But on , one λ ( (t)∆(t)) = J1+J2 g n n n has a ording to (3.1) : , where : J1 = λ t (t)c and J2 = λ t (t) . n g Vn∩n n g Vn∩n (cid:16) (cid:17) (cid:16) (cid:17) j = 1 j = 2 By Propositions 3.1 and 3.2, if or : tK˜ g √nhkJj P d , (3.2) n → s2π f H Z∂(t) k∇ k 7 h nhα(k)(logn)2β 0 nhk/(logn)31β if the bandwidth satis(cid:28)es → and → ∞. β = 16/31 Letting , the theorem is proved • X 3.2. Proof of Proposition 3.1. Let be a random variable with density f , x X hk√n V (x)= varK − and Z (x) = (f (x) Ef (x)), n n n n h V (x) − n (cid:16) (cid:17) p x IRk V (x) = 0 Φ n for all ∈ su h that 6 . Moreover, denotes the distribution (0,1) fun tion of the N law. c In the proofs, denotes a positive onstant whose value may vary from line to line. IRk Lemma 3.1. Assume that H1, H3 hold and let C ⊂ be a ompa t set inf f > 0 c > 0 n 1 x C su h that . Then, there exists su h that for all ≥ , ∈ C u IR and ∈ : c P(Z (x) u) Φ(u) . n | ≤ − |≤ √nhk n 1 Proof.BytheBerry-Essèeninequality ( f.Feller,[17℄),onehasforall ≥ , u IR x IRk V (x) = 0 n ∈ and ∈ su h that 6 : 3 x X x X 3 P(Z (x) u) Φ(u) E K − EK − . n | ≤ − | ≤ nV (x)3 h − h n (cid:12) (cid:16) (cid:17) (cid:16) (cid:17)(cid:12) (cid:12) (cid:12) p (cid:12) (cid:12) It is a lassi al exer ise to dedu e from H1, H3 that x X x X 3 supE K − EK − chk and inf V (x) chk, n x∈C h − h ≤ x∈C ≥ (cid:12) (cid:16) (cid:17) (cid:16) (cid:17)(cid:12) (cid:12) (cid:12) (cid:12) (cid:12) hen e the lemma • g : IRk IR Θ (g) + 0 For all borel bounded fun tion → , we let to be the set t Θ of ∈ su h that : 1 1 g lim λ f−1[t ε,t] = lim λ f−1[t,t+ε] = d . g g εց0 ε (cid:16) − (cid:17) εց0 ε (cid:16) (cid:17) Z∂(t) k∇fk H g : IRk IR + Lemma 3.2. Let → be a borel bounded fun tion and assume Θ (g) = Θ 0 that H1, H2 hold. Then we have : a.e. 8 t Θ η > 0 Proof. A ording to H1, H2, for all ∈ , there exists su h that : inf f >0. f−1[t−η,t+η]k∇ k t Θ ε > 0 We dedu e from Proposition A that for all ∈ and small enough : 1 1 t g λ f−1[t ε,t] = d ds. g ε − ε f H (cid:16) (cid:17) Zt−εZ∂(s) k∇ k Using the Lebesgue-Besi ovit h theorem ( f. Evans and Gariepy, [16℄, The- t Θ orem 1, Chapter I), we then have for a.e. ∈ : 1 g lim λ f−1[t ε,t] = d , g εց0 ε (cid:16) − (cid:17) Z∂(t) k∇fk H λ (f−1[t,t + ε]) λ (f−1[t ε,t]) g g and the same result holds for instead of − , hen e the lemma • λ(∂(t)) = 0 It is a straightforward onsequen e of Lemma 3.2 above that t Θ for a.e. ∈ . For simpli ity, we shall assume throughout that this is true t Θ Θ for all ∈ . Sin e is an open interval, we have in parti ular λ f−1[t ε,t+ε] = λ f−1(t ε,t+ε) , − − (cid:16) (cid:17) (cid:16) (cid:17) t Θ ε> 0 for all ∈ and small enough. t Θ x IRk f(x)V (x) = 0 n We now let for ∈ and ∈ su h that 6 : nhk hk√n t (x) = (t f(x)) and t (x) = (t Ef (x)), n sK˜f(x) − n Vn(x) − n p Φ(u) = 1 Φ(u) u IR and (cid:28)nally, − for all ∈ . g : IRk IR + Lemma 3.3. Let → be a bounded borel fun tion and assume nhk/(logn)2β nhk+4(logn)2β 0 that H1, H2 hold. If → ∞ and → , then for t Θ (g) 0 all ∈ : lim√nhk P(f (x) t)dλ (x) Φ(t (x))dλ (x) = 0 n g n g n hZVnt ≥ −ZVnt i and lim√nhk P(f (x) < t)dλ (x) Φ(t (x))dλ (x) = 0. n g n g n hZVtn −ZVtn i 9 t Θ (g) 0 Proof. We only prove the (cid:28)rst equality. Let ∈ . First note that for x IRk V (x) = 0 n all ∈ su h that 6 : P(f (x) t)= P(Z (x) t (x)). n n n ≥ ≥ IRk inf f > 0 t There exists a ompa t set C ⊂ su h that C and Vn ⊂ C for all n . Observe that by Lemma 3.1 and the above remarks, √nhk P(f (x) t)dλ (x) Φ(t (x))dλ (x) cλ ( t). hZVnt n ≥ g −ZVnt n g i ≤ g Vn λ ( t) 0 Sin e g Vn → by Lemma 3.2, one only needs now to prove that : E := √nhk Φ(t (x)) Φ(t (x))dλ (x) 0. n n n g ZVnt | − | → Φ One dedu es from the Lips hitz property of that E c√nhkλ ( t) sup t (x) t (x). (3.3) n ≤ g Vn | n − n | x∈Vt n t (x) t (x) x t But, by de(cid:28)nitions of n and n , we have for all ∈ Vn : 1 t (x) t (x) n n √nhk| − | 1 1 hk t f(x) + Ef (x) f(x) n ≤ | − |(cid:12) K˜f(x) − V (x)h−k(cid:12) sVn(x)| − |! (cid:12) n (cid:12) (cid:12) (cid:12) (cid:12)q q (cid:12) (logn)β (cid:12) K˜f(x) V (x)h−k (cid:12) hk n | − | + Ef (x) f(x) . (3.4) ≤ √nhk vu K˜f(x)Vn(x)h−k sVn(x)| n − |! u t t It is a lassi al exer ise to dedu e from H1, H3 that, sin e Vn is ontained in C, sup Ef (x) f(x) ch2, n | − |≤ x∈Vt n and similarly, that sup K˜f(x) V (x)h−k ch. n | − | ≤ x∈Vt n One dedu es from (3.4) and above that sup t (x) t (x) c(√h(logn)β +√nhk+4). n n | − | ≤ x∈Vt n 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.