Improved Bounds ∗ on the Sample Complexity of Learning Yi Li Philip M. Long Department of Computer Science Department of Computer Science National University of Singapore National University of Singapore Singapore 117543, Republic of Singapore Singapore 117543, Republic of Singapore [email protected] [email protected] Aravind Srinivasan† Bell Laboratories Lucent Technologies 600-700 Mountain Avenue Murray Hill, NJ 07974-0636, USA [email protected] Abstract We present a new general upper bound on the number of examples required to estimate all of the expectations of a set of random variables uniformly well. The quality of the estimates is measured using a variant of the relative error proposed by Haussler and Pollard. We also show that our bound is within a constant factor of the best possible. Our upper bound implies improved bounds on the sample complexity of learning according to Haussler’s decision theoretic model. ∗A preliminary version of this work appeared in the Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, 2000. †Part of this work was done while this author was at the School of Computing of the National University of Singapore. 1 Keywords: Sample complexity, machine learning, empirical process theory, PAC learning, agnostic learning. 1 Introduction Haussler [3], building on the work of Valiant [13], Vapnik [14] and others, introduced an abstract model of learning that unified the treatment of a variety of problems. In Haussler’s model, “exam- ples” are drawn independently at random according to some probability distribution and given to the learning algorithm, whose goal is to output a “hypothesis” that performs nearly as well as the best hypothesis in some “comparison class”. The number of examples which is sufficient to ensure that with high probability a relatively accurate hypothesis can be determined, has become known as the sample complexity in this context. Haussler reduced the study of sample complexity to a more basic problem. For a real-valued function f and a probability distribution P over the domain of f, a natural estimate of the expec- tation of f(x) when x is drawn according to P can be obtained as follows: obtain several samples x ,...,x independently from P, and use 1 m f(x ), the sample average, as the estimated ex- 1 m m (cid:80)i=1 i pectation. Chernoff-Hoeffding bounds can generally be used to show that accurate estimates are likelytobeobtainedhereifmislargeenough. TogetgoodsamplecomplexityboundsinHaussler’s model, we need a generalization of this setting: for a domain X, a probability distribution P over X, and a possibly infinite set F of functions defined on X, one wants to use one collection of independent draws from P to simultaneously estimate the expectations of all the functions in F (w.r.t. P). Let ν > 0 be an adjustable parameter. Haussler proposed using the following measure of distance between two non-negative reals r and s, to determine how far the estimates are from the true expectations: |r−s| d (r,s) = . ν r+s+ν This can be thought of as a modification of the usual notion of “relative error” to make it well- behavedaround0(i.e., whenbothr andsarenon-negativerealsthatarecloseto0)andsymmetric in its arguments r and s. It can be verified that d is a metric on the set of non-negative reals R+, ν and has some good metric properties such as being compatible with the ordering on the reals (if 0 ≤ r < s < t, then d (r,s) < d (r,t) and d (s,t) < d (r,t)) [3]. Also, as seen below, upper bounds ν ν ν ν on this metric yield upper bounds for other familiar distance metrics. 2 The pseudo-dimension [10] (defined in Section 2) of a class F of [0,1]-valued functions is a generalization of the Vapnik-Chervonenkis dimension [15], and is a measure of the “richness” of F. Haussler [3] and Pollard [11] showed that, for any class F whose pseudo-dimension is d, if we sample 1 1 1 1 O(cid:18) (cid:18)dlog +dlog +log (cid:19)(cid:19) (1) α2ν α ν δ times, then with probability 1 − δ, the d distance between the sample average and the true ν expectation will be at most α, for all the functions in F. In this paper, we prove a bound of 1 1 1 O(cid:18) (cid:18)dlog +log (cid:19)(cid:19) α2ν ν δ examples, which improves on (1) by a logarithmic factor when α is relatively small. Furthermore, we show that our bound is optimal to within a constant factor. A line of research culminating in the work of Talagrand [12] studied the analogous problem in which the absolute value of the difference between the sample average and the true expectation was used instead of the d metric: O( 1 (d+log(1/δ))) examples have been shown to suffice here. A ν α2 disadvantageofthistypeofanalysisisthat,informally,thebottleneckoccurswithrandomvariables whose expectation is close to 1/2. In a learning context, these correspond to hypotheses whose error is close to that obtained through random guessing. If good hypotheses are available, then accurate estimates of the quality of poor hypotheses are unnecessary. The d metric enables one ν to take advantage of this observation to prove stronger bounds for learning when good hypotheses are available, which is often the case in practice. (See [3] for a more detailed discussion of the advantages of the d metric.) In any case, our upper bound yields a bound within a constant factor ν of Talagrand’s by setting α = (cid:15), ν = 1/2, and the upper bound for PAC learning by setting ν = (cid:15), α = 1/2. Ourupper boundproof makes use of chaining, a proof techniquedue to Kolmogorov, whichwas first applied to empirical process theory by Dudley.1 Our analysis is the first application we know of chaining to bound the sample complexity of obtaining small relative error. The proof of our lower bound generalizes an argument of [5] to the case in which estimates are potentially nonzero. 1See[10]foradiscussionofthehistoryofchaining,and[8]forasimpleproofofaboundwithinaconstantfactor of Talagrand’s using chaining. 3 2 Preliminaries Fix a countably infinite domain X. (We assume X is countable for convenience, but weaker assumptions suffice: see [3].) The pseudo-dimension of a set F of functions from X to [0,1], denoted by Pdim(F), is the largest d such that there is a sequence x ,...,x of domain elements 1 d from X and a sequence r ,...,r of real thresholds such that for each b ,...,b ∈ {above,below}, 1 d 1 d there is an f ∈ F such that for all i = 1,...,d, we have f(x ) ≥ r ⇔ b = above. For k ∈ N, the i i i pseudo-dimension of a subset F of [0,1]k is defined using the above by viewing the elements of F as functions from {1,...,k} to [0,1]. The VC-dimension is the restriction of the pseudo-dimension to sets of functions from X to {0,1}. We will make use of the usual Hoeffding bound; let exp(x) denote ex. Lemma 1 ([6]) Let Y ,...,Y be independent random variables for which each Y takes values in 1 m i [a ,b ]. Then for any η > 0, we have i i Pr(|((cid:80)mi=1Yi)−((cid:80)mi=1E(Yi))| > η) ≤ 2exp(cid:18)(cid:80)mi=−1(2bηi2−ai)2(cid:19). The following lower bound is a slight modification of Theorem 5 on page 12 of [1], and is proved similarly. Lemma 2 Suppose that Y ,...,Y is a sequence of independent random variables taking only the 1 m values 0 and 1, and that for all i, Pr(Y = 1) = p. Suppose q = 1−p and mq ≥ 1. For any integer i k such that k = mp−h ≥ 1, h > 0, if we define β = 1 + 1 , then, 12k 12(m−k) m 1 h2 h3 h4 h Pr(cid:32) Y = k(cid:33) ≥ √ exp(cid:32)− − − − −β(cid:33). (cid:88) i 2πpqm 2pqm 2p2m2 3q3m3 2qm i=1 √ Proof: Robbins’ formula (m! = (cid:0)me(cid:1)m 2πm·eαm, where 1/(12m+1) ≤ αm ≤ 1/(12m)) gives m pm k qm m−k m 1/2 Pr(cid:32)(cid:88)Yi = k(cid:33) ≥ (cid:18) k (cid:19) (cid:18)m−k(cid:19) (cid:18)2πk(m−k)(cid:19) e−β i=1 pm k+1/2 qm m−k+1/2 = (2πpqm)−1/2e−β(cid:18) (cid:19) (cid:18) (cid:19) k m−k h −k−1/2 h −m+k−1/2 = (2πpqm)−1/2e−β(cid:18)1− (cid:19) (cid:18)1+ (cid:19) . pm qm Since, for t > 0, ln(1+t) < t− 1t2+ 1t3 and ln(1−t) < −t− 1t2− 1t3, we have 2 3 2 3 m h h2 h3 Pr(cid:32) Y = k(cid:33) ≥ (2πpqm)−1/2e−βexp(cid:32)(pm−h+1/2)·(cid:32) + + (cid:33) (cid:88) i pm 2p2m2 3p3m3 i=1 h h2 h3 −(qm+h+1/2)·(cid:32) − + (cid:33)(cid:33). qm 2q2m2 3q3m3 4 Expanding this expression and noting that h < mp and mq ≥ 1 completes the proof. The following correlational result involving a “balls and bins” experiment will be useful. Lemma 3 ([9, 2]) Suppose we throw m balls independently at random into n bins, each ball having an arbitrary distribution. Let B be the random variable denoting the number of balls in the ith bin. i Then for any t ,...,t , Pr( n B ≥ t ) ≤ n Pr(B ≥ t ). 1 n i=1 i i i=1 i i (cid:86) (cid:81) We will also use the following, which has been previously used (see [7]). Lemma 4 For all x ∈ [0,1] and all real a ≤ 1, (1−a)x ≤ 1−ax. Proof: The LHS is convex in x, and the RHS is linear in x; the LHS and RHS are equal when x is 0 or 1. For (cid:126)x = (x ,...,x ) ∈ Xm, and f : X → [0,1], define Eˆ (f) = 1 m f(x ) to be the sample 1 m (cid:126)x m (cid:80)i=1 i average of f w.r.t. (cid:126)x. For a probability distribution P over X, and a function f defined on X, let E (f) denote the expectation of f(x) when x is drawn according to P. P Recall from the introduction that for ν > 0 and r,s ≥ 0, d (r,s) = |r−s| . We will find it useful ν ν+r+s in our analysis to extend the domain of d to pairs r,s for which r+s > −ν. ν For a family F of [0,1]-valued functions defined on X, define opt(F,ν,α,δ) to be the least M such that for all m ≥ M, for any probability distribution P over X, if m examples (cid:126)x = (x ,...,x ) 1 m are drawn independently at random according to P, with probability at least 1−δ, for all f ∈ F, d (Eˆ (f),E (f)) ≤ α. Let opt(d,ν,α,δ) be the maximum of opt(F,ν,α,δ) over all choices of F ν (cid:126)x P for which Pdim(F) = d. In other words, opt(d,ν,α,δ) is the best possible bound on opt(F,ν,α,δ) in terms of Pdim(F), ν, α, and δ. The following is the main result of this paper. Theorem 5 opt(d,ν,α,δ) = Θ 1 dlog 1 +log 1 . (cid:16)α2ν (cid:16) ν δ(cid:17)(cid:17) 3 Upper bound For each positive integer m, let Γ denote the set of all permutations of {1,...,2m} that, for each m i ≤ m, either swap i and m+i, or leave both i and m+i fixed. For any g ∈ R2m, and σ ∈ Γ , let m µ (g,σ) = (1/m) m g , and µ (g,σ) = (1/m) m g . 1 i=1 σ(i) 2 i=1 σ(m+i) (cid:80) (cid:80) We will make use of the following known lemma, which is proved by first bounding the prob- ability that a sample gives rise to an inaccurate estimate in terms of the probability that two 5 samples give rise to dissimilar estimates, and then applying the fact that any permutation that swaps corresponding elements of the two samples is equally likely. Lemma 6 ([15, 10, 3]) Choose a set F of functions from X to [0,1], a probability distribution P over X, and ν > 0, 0 < α < 1, and m ≥ 2/(α2ν). Suppose U is the uniform distribution over Γ . m Then, Pm{(cid:126)x : ∃f ∈ F, d (Eˆ (f),E (f)) > α} ν (cid:126)x P 1 m 1 m ≤ 2· sup U (cid:40)σ : ∃f ∈ F, d (cid:32) f(x ), f(x )(cid:33) > α/2(cid:41). ν m (cid:88) σ(i) m (cid:88) σ(m+i) (cid:126)x∈X2m i=1 i=1 We will use the following lemma due to Haussler. Lemma 7 ([3]) Choose m ∈ N. Let g ∈ [0,1]2m, ν > 0, and 0 < α < 1, and let U be the uniform distribution over Γ . Then U {σ : d (µ (g,σ),µ (g,σ)) > α} ≤ 2e−2α2νm. m ν 1 2 Lemma 8 shows that when the L norm of g is relatively small, one can get something stronger. 1 Lemma 8 Choose ν,α > 0. Choose m ∈ N and g ∈ [−1,1]2m for which 2m |g | ≤ cνm for i=1 i (cid:80) some c ≤ 2/3. Then if U is the uniform distribution over Γ , U {σ : d (µ (g,σ),µ (g,σ)) > α} ≤ m ν 1 2 2e−α2νm/36c. Proof: Expanding the definition of d and simplifying, we get ν m g −g (cid:12) i=1(cid:16) σ(i) σ(m+i)(cid:17)(cid:12) d (µ (g,σ),µ (g,σ)) = (cid:12)(cid:80) (cid:12). ν 1 2 (cid:12) νm+ 2m g (cid:12) i=1 i (cid:80) Also, note that m (g −g )2 ≤ m 2|g −g | ≤ 2cνm. One can sample uniformly from i=1 i m+i i=1 i m+i (cid:80) (cid:80) Γ by independently deciding whether σ swaps i and m+i for i = 1,...,m. Thus, applying the m Hoeffdingbound(Lemma1)withthefactthat−|g −g | ≤ g −g ≤ |g −g |, σ(i) σ(m+i) σ(i) σ(m+i) σ(i) σ(m+i) we get m 2m U {σ : d (µ (g,σ),µ (g,σ)) > α} = U (cid:40)σ : (cid:12) (g −g )(cid:12) > α(cid:32)νm+ g (cid:33)(cid:41) ν 1 2 (cid:12)(cid:88) σ(i) σ(m+i) (cid:12) (cid:88) i (cid:12) (cid:12) (cid:12)i=1 (cid:12) i=1 (cid:12)−α2(νm+ 2m g )2(cid:12) ≤ 2exp(cid:32) (cid:80)i=1 i (cid:33). 4cνm Since −cνm ≤ 2m g ≤ cνm and c ≤ 2/3, the term (νm+ 2m g )2 takes its minimal value at i=1 i i=1 i (cid:80) (cid:80) 2m g = −cνm. Therefore i=1 i (cid:80) U {σ : d (µ (g,σ),µ (g,σ)) > α} ≤ 2exp −α2(νm−cνm)2 . ν 1 2 (cid:16) 4cνm (cid:17) 6 Since c ≤ 2/3, the lemma follows. For (cid:126)v,w(cid:126) ∈ Rk, let (cid:96) ((cid:126)v,w(cid:126)) = 1 k |v − w |. For F ⊆ Rk, (cid:126)v ∈ Rk, define (cid:96) ((cid:126)v,F) = 1 k (cid:80)i=1 i i 1 min{(cid:96)((cid:126)v,f(cid:126)) : f(cid:126)∈ F}; if F = ∅, then (cid:96) ((cid:126)v,F) = ∞. The following result of [4] bounds the size of a 1 “well-separated” set of a certain pseudo-dimension: Lemma 9 ([4]) For all k ∈ N, for all 0 < (cid:15) ≤ 1, if each pair f,g of distinct elements of some F ⊆ [0,1]k has (cid:96) (f,g) > (cid:15), then |F| ≤ (41/(cid:15))Pdim(F). 1 The following is the key lemma in our analysis, and is a new application of chaining. Lemma 10 Choosed ∈ N. Chooseanintegerm ≥ 125(2d+1) andF ⊆ [0,1]2m forwhichPdim(F) = α2ν d. Then if U is the uniform distribution over Γ , for any α > 0, ν > 0, m U {σ : ∃f ∈ F, d (µ (f,σ),µ (f,σ)) > α} ≤ 6·(2624/ν)de−α2νm/90. ν 1 2 Proof: Let F−1 = ∅. For each nonnegative integer j, construct Fj by initializing it to Fj−1, and as long as there is a f ∈ F for which (cid:96) (f,F ) > ν/22j+4, choosing such an f and adding 1 j it to F . For each f ∈ F and each j ≥ 0 choose an element ψ (f) of F such that (cid:96) (f,ψ (f)) j j j 1 j is minimized. (Since (cid:96) ((cid:126)v,w(cid:126)) > ν/22j+4 for distinct (cid:126)v,w(cid:126) ∈ F , F is finite by Lemma 9. So 1 j j this minimum is well-defined.) We have (cid:96) (f,ψ (f)) ≤ ν/22j+4, as otherwise f would have been 1 j added to Fj. Let G0 = F0, and for each j > 0, define Gj to be {f −ψj−1(f) : f ∈ Fj}. Since (cid:96)1(f,ψj−1(f)) ≤ ν/22j+2, we have for all g ∈ Gj that 2i=m1|gi| ≤ νm/22j+1. By induction, for each (cid:80) k, each f ∈ Fk has gf,0 ∈ G0,···,gf,k ∈ Gk such that f = kj=0gf,j. Let F∗ = ∪kFk. Since for all (cid:80) f ∈ F, for all k we have (cid:96)1(f,ψk(f)) ≤ ν/22k+4, F∗ is dense in F w.r.t. (cid:96)1. Define p = U {σ : ∃f ∈ F, d (µ (f,σ),µ (f,σ)) > α}. ν 1 2 Since F∗ is dense in F, p = U {σ : ∃f ∈ F∗,dν(µ1(f,σ),µ2(f,σ)) > α} and thus, m 2m p = U σ : ∃f ∈ F∗, (cid:12)(cid:12)(cid:88)(cid:16)fσ(i)−fσ(m+i)(cid:17)(cid:12)(cid:12) > α(cid:32)νm+(cid:88)fi(cid:33)(cid:41). (cid:12) (cid:12) (cid:12)i=1 (cid:12) i=1 (cid:12) (cid:12) For each f ∈ F∗, there are gf,0 ∈ G0,gf,1 ∈ G1,... such that f = ∞j=0gf,j (only a finite number of (cid:80) the g ’s are nonzero). Applying the triangle inequality, we see that f,j ∞ m ∞ 2m p ≤ U σ : ∃f ∈ F∗, (cid:88)(cid:12)(cid:12)(cid:88)(cid:16)(gf,j)σ(i)−(gf,j)σ(m+i)(cid:17)(cid:12)(cid:12) > ανm+(cid:88)(cid:88)(gf,j)i. (cid:12) (cid:12) j=0(cid:12)i=1 (cid:12) j=0i=1 (cid:12) (cid:12) 7 √ Let ν = ν/3, and for each j ∈ N, let ν = ν j +1/(3·2j). Then ∞ ν ≤ ν, and hence, 0 j j=0 j (cid:80) ∞ m ∞ 2m p ≤ U σ : ∃f ∈ F∗, (cid:88)(cid:12)(cid:12)(cid:88)(cid:16)(gf,j)σ(i)−(gf,j)σ(m+i)(cid:17)(cid:12)(cid:12) > (cid:88)α(cid:32)νjm+(cid:88)(gf,j)i(cid:33) (cid:12) (cid:12) j=0(cid:12)i=1 (cid:12) j=0 i=1 (cid:12) (cid:12) ∞ m 2m ≤ (cid:88)U σ : ∃f ∈ F∗, (cid:12)(cid:12)(cid:88)(cid:16)(gf,j)σ(i)−(gf,j)σ(m+i)(cid:17)(cid:12)(cid:12) > α(cid:32)νjm+(cid:88)(gf,j)i(cid:33)(cid:41) (cid:12) (cid:12) j=0 (cid:12)i=1 (cid:12) i=1 (cid:12) (cid:12) ∞ m 2m ≤ U (cid:40)σ : ∃g ∈ G , (cid:12) g −g (cid:12) > α(cid:32)ν m+ g (cid:33)(cid:41), (2) (cid:88) j (cid:12)(cid:88)(cid:16) σ(i) σ(m+i)(cid:17)(cid:12) j (cid:88) i (cid:12) (cid:12) j=0 (cid:12)i=1 (cid:12) i=1 (cid:12) (cid:12) since each g ∈ G . f,j j √ Choose j > 0. For each g ∈ G , 2m |g | ≤ νm/22j+1, and ν m = νm j +1/2j. By applying √ j (cid:80)i=1 i j 3 Lemma 8 with c = 3/( j +12j+1), we see that j m 2m −α2ν m U (cid:40)σ : ∃g ∈ G , (cid:12) g −g (cid:12) > α(cid:32)ν m+ g (cid:33)(cid:41) ≤ 2|G |exp(cid:32) j (cid:33). j (cid:12)(cid:88)(cid:16) σ(i) σ(m+i)(cid:17)(cid:12) j (cid:88) i j 36c (cid:12) (cid:12) j (cid:12)i=1 (cid:12) i=1 (cid:12) (cid:12) Plugging in the values of ν and c , we get j j m 2m U (cid:40)σ : ∃g ∈ G , (cid:12) g −g (cid:12) > αν m+ g ≤ 2|G |exp −α2νm(j +1)/180 . j (cid:12)(cid:88)(cid:16) σ(i) σ(m+i)(cid:17)(cid:12) j (cid:88) i j (cid:16) (cid:17) (cid:12) (cid:12) (cid:12)i=1 (cid:12) i=1 (cid:12) (cid:12) Distinct elements (cid:126)v and w(cid:126) of F have (cid:96) ((cid:126)v,w(cid:126)) > ν/22j+4; so by Lemma 9, |G | ≤ (164·4j+1/ν)d. j 1 j Thus ∞ m 2m U (cid:40)σ : ∃g ∈ G , (cid:12) g −g (cid:12) > α(cid:32)ν m+ g (cid:33)(cid:41) (cid:88) j (cid:12)(cid:88)(cid:16) σ(i) σ(m+i)(cid:17)(cid:12) j (cid:88) i (cid:12) (cid:12) j=1 (cid:12)i=1 (cid:12) i=1 ∞ (cid:12) (cid:12) ≤ 2(164·4j+1/ν)de−α2νm(j+1)/180 (cid:88) j=1 ≤ 4(164·16/ν)de−α2νm/90. (3) Note that G = F , and therefore the elements of G are in [0,1]2m. Thus, we can apply 0 0 0 Lemma 7 to get m 2m U (cid:40)σ : ∃g ∈ G , (cid:12) g −g (cid:12) > α(cid:32)ν m+ g (cid:33)(cid:41) ≤ 2|G |exp(−2α2ν m). 0 (cid:12)(cid:88)(cid:16) σ(i) σ(m+i)(cid:17)(cid:12) 0 (cid:88) i 0 0 (cid:12) (cid:12) (cid:12)i=1 (cid:12) i=1 (cid:12) (cid:12) Substituting the value of ν , upper bounding the size of |G | using Lemma 9, and combining with 0 0 (2) and (3) completes the proof. Combining Lemma 10 with Lemma 6, and solving for m proves the upper bound of Theorem 5. 8 4 Lower bound In this section, we establish the lower bound side of Theorem 5. For positive integers d and n, we define X to be an arbitrary set of nd elements of X. We view X as the union of d disjoint d,n d,n subsets, which we will call types; there will be n elements of each type. We refer to the jth element in type i as a . Let P be the uniform distribution on X . The function class F consists of all i,j d,n d,n d functions mapping X to {0,1} that take the value 1 on at most one point in each type, and take the value 0 outside of X . It is easy to check that the pseudo-dimension of F is d. d,n d We begin by establishing the first term of the lower bound. Theorem 11 For any real 0 < ν ≤ 1/100, 0 < α ≤ 1/100, 0 < δ ≤ 1/5 and any integer d ≥ 1, opt(F ,ν,α,δ) > d ln 1 . d 30α2ν 3ν Proof: Suppose d,α and ν are given. Set n = (cid:98)1(cid:99) and P = P . We will show that if m = ν d,n dln(1/(3ν))/(30α2ν) , then (cid:4) (cid:5) Pm (cid:126)x : ∃f ∈ F , d Eˆ (f),E (f) > α > 1/5, (cid:110) d ν(cid:16) (cid:126)x P (cid:17) (cid:111) proving the theorem. For a sample (cid:126)x, for each type i, and each j ∈ {1,...,n}, let B ((cid:126)x) denote the i,j number of times that a appears in (cid:126)x. If i,j p = Pm (cid:126)x : ∃f ∈ F , d Eˆ (f),E (f) > α (cid:110) d ν(cid:16) (cid:126)x P (cid:17) (cid:111) we have p ≥ Pm (cid:126)x : ∃f ∈ F , d Eˆ (f),E (f) > α, Eˆ (f) < E (f) (cid:110) d ν(cid:16) (cid:126)x P (cid:17) (cid:126)x P (cid:111) E (f)−Eˆ (f) = Pm(cid:40)(cid:126)x : ∃f ∈ F , P (cid:126)x > α(cid:41) d E (f)+Eˆ (f)+ν P (cid:126)x 1−α αν = Pm{(cid:126)x : ∃f ∈ Fd, Eˆ(cid:126)x(f) < EP(f) − (cid:27) 1+α 1+α 1−α αν 1 ≥ Pm{(cid:126)x : ∃f ∈ Fd, Eˆ(cid:126)x(f) < EP(f) − ,EP(f) ≥ (cid:27) 1+α 1+α 2n 1−3α 1 ≥ Pm(cid:26)(cid:126)x : ∃f ∈ Fd, Eˆ(cid:126)x(f) < EP(f) , EP(f) ≥ (cid:27) 1+α 2n 1 ≥ Pm(cid:110)(cid:126)x : ∃f ∈ Fd, Eˆ(cid:126)x(f) < EP(f)(1−4α), EP(f) ≥ 2n(cid:27) m ≥ Pm{(cid:126)x : ∃I ⊆ {1,...,d}, |I| = (cid:100)d/2(cid:101), ∀i ∈ I, ∃j, Bi,j((cid:126)x) < (1−4α)(cid:27). nd 9 Let φ be the indicator function for the event that there exists j ∈ {1,...,n} for which B ((cid:126)x) < i i,j m(1−4α). Then, nd d p ≥ Pm(cid:40)(cid:126)x : φ ((cid:126)x) ≥ (cid:100)d/2(cid:101)(cid:41). (4) (cid:88) i i=1 Fix i,j, and let r = 1/(nd). Since E [B ((cid:126)x)] = mr, a simple calculation with binomial Pm i,j coefficients shows that Pm{(cid:126)x : B ((cid:126)x) = y} ≤ Pm{(cid:126)x : B ((cid:126)x) = y+1} i,j i,j for y = 0,1,...,(cid:98)mr(cid:99)−1. Thus, √ √ Pm{(cid:126)x : B ((cid:126)x) < mr(1−4α)} ≥ (cid:100) mr(cid:101)·Pm (cid:126)x : B ((cid:126)x) = (cid:100)mr(1−4α)(cid:101)−(cid:100) mr(cid:101) . i,j i,j (cid:8) (cid:9) √ If we put h = mr−((cid:100)mr(1−4α)(cid:101)−(cid:100) mr(cid:101)), p = r, q = 1−r and apply Lemma 2, we obtain √ 1 h2 h3 h4 h Pm(cid:40)(cid:126)x : B ((cid:126)x) < mr(1−4α)(cid:41) ≥ mr(cid:115) ×exp(cid:32)− − − − (cid:33) i,j 2πmrq 2rqm 2r2m2 3q3m3 2qm 1 1 1 ×exp(cid:18)− ( + )(cid:19). 12 mr−h mq+h Because r ≤ 1/100, q = 1−r ≥ 1−1/100, α ≤ 1/100, m ≥ (cid:98)ln(100/3)(cid:99) and mr−h ≥ 1, a simple 30α2r calculation shows that Pm{(cid:126)x : B ((cid:126)x) < mr(1−4α)} > (e−29α2rm)/3. i,j Crucially, by Lemma 3, Pm{(cid:126)x : φ ((cid:126)x) = 1} = 1−Pm{(cid:126)x : ∀j,B ((cid:126)x) ≥ mr(1−4α)} i i,j ≥ 1−(1−Pm{(cid:126)x : B ((cid:126)x) < mr(1−4α)})n i,j 1 > 1−(1− e−29α2rm)n 3 > 1−e−0.99. Let z = Pm{(cid:126)x : d φ ((cid:126)x) ≤ (cid:100)d/2(cid:101)−1}. Then i=1 i (cid:80) d 1−e−0.99 d < E( φ ((cid:126)x)) ≤ dz/2+(1−z)d. (cid:16) (cid:17) (cid:88) i i=1 Solving the above inequality, we have z < 2/e0.99 < 4/5. This implies that Pm{(cid:126)x : d φ ((cid:126)x) ≥ i=1 i (cid:80) (cid:100)d/2(cid:101)} > 1/5, which, since p ≥ Pm{(cid:126)x : d φ ((cid:126)x) ≥ (cid:100)d/2(cid:101)} by (4), completes the proof. i=1 i (cid:80) A similar proof establishes the second term in the lower bound: 10
Description: