On multiplier processes under weak moment 6 1 assumptions 0 2 n Shahar Mendelson1,2 a J January 26, 2016 5 2 ] Abstract T S We show that if V Rn satisfies a certain symmetry condition h. (closely related to uncon⊂ditionaity) and if X is an isotropic random t vector for which X,t L L√p for every t Sn−1 and p . logn, a k k p ≤ ∈ m then the corresponding empirical and multiplier processes indexed by (cid:10) (cid:11) V behave as if X were L-subgaussian. [ 1 v 1 Introduction 3 2 The motivation for this work comes from various problems in LearningThe- 5 6 ory, in which one encounters the following random process. 0 LetX = (x ,...,x )bearandomvectoronRn(whosecoordinates(x )n . 1 n i i=1 1 need not be independent) and let ξ be a random variable that need not be 0 independent of X. Set (X ,ξ )N to be N independent copies of (X,ξ), and 6 i i i=1 1 for V Rn define the centred multiplier process ⊂ : v N i 1 X sup (ξ X ,v Eξ X,v ) . (1.1) i i ar v∈V (cid:12)(cid:12)(cid:12)√N Xi=1 (cid:10) (cid:11)− (cid:10) (cid:11) (cid:12)(cid:12)(cid:12) (cid:12) (cid:12) Multiplier processes are(cid:12) often studied in a more gener(cid:12)al context, in which the indexing class need not be a class of linear functionals on Rn. Instead, one may consider an arbitrary probability space (Ω,µ) and in which case F is a class of functions on Ω. Let X ,...,X be independent, distributed 1 N 1Department of Mathematics, Technion, I.I.T., Haifa, Israel and Mathematical Sci- ences Institute, The Australian National University, Canberra, Australia, Email: sha- [email protected] 2Supportedin part by theIsrael Science Foundation. 1 according to µ, and the multiplier process indexed by F is N 1 sup (ξ f(X ) Eξf(X )) . (1.2) i i i f∈F (cid:12)√N − (cid:12) (cid:12) Xi=1 (cid:12) (cid:12) (cid:12) Naturally, the simplest (cid:12)multiplier process is when ξ(cid:12) 1 and (1.2) is the (cid:12) (cid:12)≡ standard empirical process. Controlling a multiplier process is relatively straightforward when ξ ∈ L and is independent of X. For example, one may show (see, e.g., [20], 2 Chapter 2.9) that if ξ is a mean-zero random variable that is independent of X ,...,X then 1 N N N 1 1 Esup (ξ f(X ) Eξf(X )) C ξ Esup ε f(X ) , f∈F (cid:12)√N i i − i (cid:12) ≤ k kL2 f∈F (cid:12)√N i i (cid:12) (cid:12) Xi=1 (cid:12) (cid:12) Xi=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) where he(cid:12)re and throughout the articl(cid:12)e, (ε )N are inde(cid:12)pendent, symme(cid:12)tric (cid:12) (cid:12) i i=1 (cid:12) (cid:12) 1,1 -valued random variables that are independent of (X ,ξ )N , and C {− } i i i=1 is an absolute constant. This estimate and others of its kind show that multiplier processes are as ‘complex’ as their seemingly simpler empirical counterparts. However, the results we are looking for are of a different nature: estimates on multi- plier processes that are based on some natural complexity parameter of the underlying class F, and that exhibits the class’ geometry. Itturnsoutthatchainingmethodslead tosuchestimates, andthestruc- ture of F may be captured using the following parameter, which is a close relative of Talagrand’s γ-functionals [19]. Definition 1.1 For a random variable Z and p 1, set ≥ Z Z = sup k kLq. (p) k k √q 1≤q≤p Given a class of functions F, u 1 and s 0, put 0 ≥ ≥ Λs0,u(F) = infsup 2s/2kf −πsfk(u22s), (1.3) f∈F sX≥s0 where the infimum is taken with respect to all sequences (F ) of subsets s s≥0 of F, and of cardinality F 22s. π f is the nearest point in F to f with s s s | | ≤ respect to the (u22s) norm. Let Λ˜s0,u(F) = Λs0,u(F)+2s0/2supkπs0fk(u22s0). f∈F 2 To put these definitions in some perspective, Z measures the local- (p) k k subgaussian behaviour of Z, and the meaning of ‘local’ is that takes (p) k k intoaccountthegrowthofZ’smomentsuptoafixedlevel p. Incomparison, Z Z supk kLq, k kψ2 ∼ q≥2 √q implying that for 2 p < , Z . Z ; hence, for every u 1 and ≤ ∞ k k(p) k kψ2 ≥ s s , 0 ≥ Λ (F) . inf sup 2s/2 f π f , s0,u k − s kψ2 f∈F sX≥s0 and Λ˜ (F) cγ (F,ψ ) (see [19] for a detailed study on generic chaining 0,u 2 2 ≤ and the γ functionals). Recall that the canonical gaussian process indexed by F consists of cen- tred gaussian random variable G , and the covariance structure of the pro- f cess is endowed by the inner product in L (µ). Let 2 EsupG = sup E sup G : F′ F, F′ is finite . f f f∈F { f∈F′ ⊂ } and note that if the class F L (µ) is L-subgaussian, that is, if for every 2 ⊂ f,h F 0, ∈ ∪ f h L f h , k − kψ2(µ) ≤ k − kL2(µ) then Λ˜ (F)may beboundedusingthecanonical gaussianprocess indexed s0,u by F. Indeed, by Talagrand’s Majorizing Measures Theorem [18, 19], for every s 0, 0 ≥ Λ˜ (F) . L EsupG +2s0/2sup f . s0,u f k kL2(µ) f∈F f∈F (cid:0) (cid:1) As an example, let V Rn and set F = v, : v V to be the class of ⊂ { · ∈ } linear functionals endowed by V. If X is an isotropic, L-subgaussian vector, (cid:10) (cid:11) it follows that for every t Rn, ∈ k X,t kψ2 ≤ Lk X,t kL2 = Lktkℓn2. Therefore, if G =(g ,(cid:10)...,g(cid:11)) is the st(cid:10)anda(cid:11)rd gaussian vector in Rn, ℓ (V)= 1 n ∗ Esupv∈V | G,v | and d2(V)= supv∈V kvkℓn2, one has (cid:10) Λ˜(cid:11) (F) .L Esup G,v +2s0/2sup X,v s0,u k kL2 v∈V v∈V .L(cid:0)ℓ (V)(cid:10)+2s0(cid:11)/2d (V) . (cid:10) (cid:11) (cid:1) ∗ 2 As the following estim(cid:0)ate from [9] shows,(cid:1)Λ˜ can be used to control a multiplier process in a relatively general situation. 3 Theorem 1.2 Forq > 2, thereareconstants c , c ,c ,c andc thatdepend 0 1 2 3 4 only on q for which the following holds. Let ξ L and set ξ ,...,ξ to be q 1 N ∈ independent copies of ξ. Fix an integer s 0 and w,u > c . Then, with 0 0 ≥ probability at least 1 c w−qN−((q/2)−1)logqN 2exp( c u22s0), 1 2 − − − N 1 sup (ξ f(X ) Eξf) c wu ξ Λ˜ (F). f∈F (cid:12)√N i i − (cid:12) ≤ 3 k kLq s0,c4u (cid:12) Xi=1 (cid:12) (cid:12) (cid:12) It follows fr(cid:12)om Theorem 1.2 that if (cid:12) (cid:12) (cid:12) 2 ℓ (V) ∗ D(V)= d (V) (cid:18) 2 (cid:19) then with probability at least 1 c w−qN−((q/2)−1)logqN 2exp( c u2D(V)), 2 3 − − − N 1 sup ξ v,X Eξ v,X . Lwu ξ ℓ (V). (1.4) f∈F (cid:12)√N i i − (cid:12) k kLq ∗ (cid:12)(cid:12) Xi=1(cid:0) (cid:10) (cid:11) (cid:10) (cid:11)(cid:1)(cid:12)(cid:12) There a(cid:12)re other generic situations in wh(cid:12)ich Λ˜ (F) may be controlled (cid:12) (cid:12) s0,u using the geometry of F (for example [13, 9] when F is a class of linear functionals on Rn and X is an unconditional, log-concave random vector). However, there is no satisfactory theory that describes Λ˜ (F) for an arbi- s0,u trary class F; such results are highly nontrivial. Moreover, because the definition of Λ (F) involves for every s0,u k k(p) p, class members must have arbitrarily high moments for Λ to be well s0,u defined. In the context of classes of linear functionals on Rn, one expects an analogous result to Theorem 1.2 to be true even if the functionals X,t do not have arbitrarily high moments. A realistic conjecture is that if for each (cid:10) (cid:11) t Sn−1 ∈ X,t L√q X,t for every 2 q . n k kLq ≤ k kL2 ≤ then a subgau(cid:10)ssian(cid:11)-type estimat(cid:10)e like(cid:11)(1.4) should still be true. In whatfollows we will notfocus on such a general resultthat is likely to hold for every V Rn. Rather, we will concentrate our attention on situa- ⊂ tions where a subgaussian estimate like (1.4) is true, but linear functionals only satisfy X,t L√q X,t for every 2 q . logn. k kLq ≤ k kL2 ≤ (cid:10) (cid:11) (cid:10) (cid:11) 4 The obvious example in which only logn moments should suffice is V = ∼ Bn (or similar sets that have n extreme points). Having said that, the 1 ∼ applications that motivated this work require a broader spectrum of sets that only need that number of moments to exhibit a subgaussian behaviour as in (1.4). Question 1.3 Let X = (x ,...,x ) be an isotropic random vector and as- 1 n sume that kxikLq ≤ L√q for every 2 ≤ q ≤ p. If ξ ∈ Lq0 for some q0 > 2, how small can p be while still having that N 1 Esup ξ X ,v Eξ X,v C(L,q ) ξ ℓ (V)? v∈V (cid:12)√N i i − (cid:12) ≤ 0 k kLq0 ∗ (cid:12)(cid:12) Xi=1 (cid:10) (cid:11) (cid:10) (cid:11)(cid:12)(cid:12) We will show(cid:12) p logn suffices for a pos(cid:12)itive answer to Question 1.3 if (cid:12) ∼ (cid:12) the norm kzkV◦ = supv∈V | v,z | satisfies the following unconditionality property: (cid:10) (cid:11) Definition 1.4 Givenavectorx = (x )n , let(x∗)n bethenon-increasing i i=1 i i=1 rearrangement of (x )n . | i| i=1 The normed space (Rn, ) is K-unconditional with respect to the basis k k e ,...,e if for every x Rn and every permutation of 1,...,n 1 n { } ∈ { } n n x e K x e , i i π(i) i k k ≤ k k i=1 i=1 X X and if y Rn and x∗ y∗ for 1 i n then ∈ i ≤ i ≤ ≤ n n x e K y e i i i i k k ≤ k k i=1 i=1 X X Remark 1.5 This is not the standard definition of an unconditional ba- sis, though every unconditional basis (in the classical sense) on an infinite dimensional space satisfies Definition 1.4 for some constant K (see, e.g., [1]). There are many natural examples of K-unconditional norms, including all the ℓ norms. Moreover, the norm sup n v∗z∗ is 1-unconditional. p v∈V i=1 i i In fact, if V Rn is closed under permutations and reflections (sign- ⊂ P changes), then V◦ is 1-unconditional. Finally, since the maximum of k · k two K-unconditional norms is K-unconditional, it follows that if V◦ is k·k K-unconditional, so is the norm sup ,v . v∈V∩rB2n · (cid:10) (cid:11) We will show the following: 5 Theorem 1.6 There exists an absolute constant c and for K 1, L 1 1 ≥ ≥ and q > 2 there exists a constant c that depends only on K, L and q for 0 2 0 which the following holds. Consider V Rn for which the norm V◦ = supv∈V v, is K-unconditional • ⊂ k·k | · | with respect to the basis e ,...,e . 1 n { } (cid:10) (cid:11) ξ L for some q > 2. • ∈ q0 0 An isotropic random vector X Rn which satisfies that • ∈ max X,e L for p = c logn. j (p) 1 1≤j≤nk k ≤ (cid:10) (cid:11) If (X ,ξ )N are independent copies of (X,ξ) then i i i=1 N 1 Esup ξ X ,v Eξ X,v c ξ ℓ (V). v∈V (cid:12)√N i i − (cid:12) ≤ 2k kLq ∗ (cid:12)(cid:12) Xi=1(cid:0) (cid:10) (cid:11) (cid:10) (cid:11)(cid:1)(cid:12)(cid:12) Theproofof(cid:12)Theorem1.6isbasedonthestud(cid:12)yofaconditionedBernoulli (cid:12) (cid:12) process. Indeed, a standard symmetrization argument (see, e.g., [8, 20]) shows that if (ε )N are independent, symmetric, 1,1 -valued random i i=1 {− } variables that are independent of (X ,ξ )N then i i i=1 N N 1 1 Esup ξ X ,v Eξ X,v CEsup ε ξ X ,v i i i i i v∈V (cid:12)√N − (cid:12) ≤ v∈V (cid:12)√N (cid:12) (cid:12)(cid:12) Xi=1 (cid:10) (cid:11) (cid:10) (cid:11)(cid:12)(cid:12) (cid:12)(cid:12) Xi=1 (cid:10) (cid:11)(cid:12)(cid:12) for an ab(cid:12)solute constant C; a similar(cid:12)bound hold(cid:12) with high probab(cid:12)ility, (cid:12) (cid:12) (cid:12) (cid:12) showing that it suffices to study the supremum of the conditioned Bernoulli process N 1 sup ε ξ X ,v = ( ). i i i v∈V (cid:12)√N (cid:12) ∗ (cid:12)(cid:12) Xi=1 (cid:10) (cid:11)(cid:12)(cid:12) Put x (j) = X ,e and s(cid:12)et Z = N−1/2 N(cid:12) ε ξ x (j), which is a sum of i i j (cid:12) j i=(cid:12)1 i i i iid random variables. Therefore, if Z = (Z ,...,Z ) then 1 n (cid:10) (cid:11) P ( ) = sup Z,v . ∗ v∈V (cid:10) (cid:11) Theproofof Theorem 1.6 follows by showingthat fora well-chosen constant C(L,q) the event Z∗ CEg∗ for every 1 j n j ≤ j ≤ ≤ (cid:8) (cid:9) 6 isofhighprobability,andifthenormk·kV◦ = supv∈V ·,v isK-unconditional then (cid:10) (cid:11) sup Z,v C (K,L,q)Esup G,v . 1 ≤ v∈V v∈V (cid:10) (cid:11) (cid:10) (cid:11) Before presenting the proof of Theorem 1.6, let us turn to one of its outcomes – estimates on the random Gelfand widths of a convex body. We will present another application, motivated by a question in the rapidly developing area of Spare Recovery in Section 3. LetV Rn beaconvex, centrally symmetricset. Awellknownquestion ⊂ in Asymptotic Geometric Analysis has to do with the diameter of a random m-codimensionalsectionofV (see,e.g.,[14,15,16,2]). Inthepast,thefocus was on obtaining such estimates for subspaces selected uniformly according to theHaar measure, or alternatively, according to themeasureendowed via the kernel of an m n gaussian matrix (see, e.g. [17]). More recently, there × has been a growing interest in other notions of randomness, most notably, generated by kernels of other random matrix ensembles. For example, the following was established in [12]: Theorem 1.7 Let X ,...,X be distributed according to an isotropic, L- 1 m subgaussian random vector on Rn, set Γ = m X , e and put i=1 i · i r (V,γ) = inf r > 0 :ℓ (V PrBn)(cid:10) γr(cid:11)√m . G { ∗ ∩ 2 ≤ } Then, with probability at least 1 2exp( c (L)m) 1 − − diam(ker(Γ) V) r (V,c (L)), G 2 ∩ ≤ for constants c and c that depends only on L. 1 2 AversionofTheorem1.7wasobtainedunderamuchweakerassumption: the random vector need not be L-subgaussian; rather, it suffices that it satisfies a weak small-ball condition. Definition 1.8 The isotropic random vector X satisfies a small-ball condi- tion with constants κ> 0 and 0< ε 1 if for every t Sn−1, ≤ ∈ Pr( X,t κ) ε. | | ≥ ≥ The analog of gaussian paramet(cid:10)er r (cid:11)for a general random vector X is G m 1 r (V,γ) = inf r >0 : E sup X ,v γr√m . X i n v∈V∩rB2n(cid:12)√m Xi=1(cid:10) (cid:11)(cid:12) ≤ o (cid:12) (cid:12) 7 Clearly, if X is L-subgaussian then r (V,γ) r (V,cLγ) for a suitable X G ≤ absolute constant c. Theorem 1.9 [11, 10] Let X be an isotropic random vector that satis- fies the small-ball condition with constants κ and ε. If X ,...X are in- 1 m dependent copies of X and Γ = m X , e , then with probability at least i=1 i · i 1 2exp( c (ε)m) 0 − − P (cid:10) (cid:11) diam(ker(Γ) V) r V,c (κ,ε) . X 1 ∩ ≤ (cid:0) (cid:1) Theorem 1.6 implies that if the norm z V◦ is K-unconditional, and the k k growth of moments of the coordinate linear functionals X,e for 1 i n i ≤ ≤ is L-‘subgaussian’ up to the level logn, then the small-ball condition ∼ (cid:10) (cid:11) depends only on L and r (V,c (L)) r (V,c (L,K)). Therefore, with X 1 G 2 ≤ probability at least 1 2exp( c (L)m) one has the gaussian estimate: 0 − − diam(ker(Γ) V) r V,c (L,K) , G 2 ∩ ≤ (cid:0) (cid:1) even though the choice of a subspace has been made according to an ensem- ble that could be very far from a subgaussian one. We end this introduction with a word about notation. Throughout, absolute constants are denoted by c,c ..., etc. Their value may change from 1 line to line or even within the same line. When a constant depends on a parameter α it will be denoted by c(α). A . B means that A cB for an ≤ absolute constant c, and the analogous two-sided inequality is denoted by A B. In a similar fashion, A . B implies that A c(α)B, etc. α ∼ ≤ 2 Proof of Theorem 1.6 There are two substantial difficulties in the proof of Theorem 1.6. First, Z ,...,Z are not independent random variables, not only because of the 1 n Bernoulli random variables (ε )N that appear in all the Z ’s, but also be- i i=1 i cause the coordinates of X = (x ,...,x ) need not be independent. Second, 1 n whilethereissomeflexibility inthemomentassumptionsonthecoordinates of X, there is no flexibility in the moment assumption on ξ, which is only ‘slightly better’ than square-integrable. As a starting point, let us address the fact that the coordinates of Z need not be independent. 8 Lemma 2.1 There existabsolute constants c andc forwhichthefollowing 1 2 holds. Let β 1 and set p = 2βlog(en). If (W )n are random variables ≥ j j=1 and satisfy that W L, then for every t 1, with probability at least j (p) k k ≤ ≥ 1 c t−2β, 1 − W∗ c tL βlog(en/j) for every 1 j n. j ≤ 2 ≤ ≤ Proof. Let a ,...,a Rpand by the convexity of t tq, 1 k ∈ → k k 1 1 a2 q a2q. k j ≤ k j j=1 j=1 (cid:0) X (cid:1) X Thus, given (a )n , and taking the maximum over subsets of 1,...,n of i i=1 { } cardinality k, n 1 1 1 max a2 q max a2q a2q. |J1|=k k j ≤ |J1|=k k j ≤ k j (cid:0) jX∈J1 (cid:1) jX∈J1 Xj=1 When applied to a = W , it follows that point-wise, j j k n 1 1 (W∗)2 q W2q. (2.1) k j ≤ k j j=1 i=1 (cid:0) X (cid:1) X Since W L it is evident that EW2q L2qqq for 2q p. Hence, k jk(p) ≤ j ≤ ≤ taking the expectation in (2.1), k 1 1/q n E (W∗)2 q qL2 1/q c qL2 k j ≤ · k ≤ 1 (cid:16) (cid:0) Xj=1 (cid:1) (cid:17) (cid:0) (cid:1) for q = βlog(en/k) (which does satisfy 2q p). Hence, by Chebyshev’s ≤ inequality, for t 1, ≥ 2 1 1 k 1 Pr (W∗)2 (et)2c2L2q e−2q = . (2.2) k j ≥ 1 ≤ t2q · en · t−2q (cid:16) Xj≤k (cid:17) (cid:18) (cid:19) Using (2.2) for k = 2j and applying the union bound, it is evident that with probability at least 1 2t−2β, for every 1 k n, − ≤ ≤ 1 (W∗)2 (W∗)2 . t2L2βlog(en/k). k ≤ k j j≤k X 9 Recall that q > 2 and set η = (q 2)/4. Let u 2 and consider the 0 0 − ≥ event = ξ∗ u ξ (eN/i)1/q0 for every 1 i N . Au { i ≤ k kLq0 ≤ ≤ } A standard binomial estimate combined with Chebyshev’s inequality for ξ q0 shows that is a nontrivial event. Indeed, u | | A N 1 Pr ξ∗ u ξ (eN/i)1/q0 Pri ξ u ξ (eN/i)1/q0 , i ≥ k kLq0 ≤ i ≥ k kLq0 ≤ uiq0 (cid:18) (cid:19) (cid:16) (cid:17) (cid:16) (cid:17) and by the union bound for 1 i n, Pr( ) 2/uq0. u ≤ ≤ A ≤ The random variables we shall use in Lemma 2.1 are W =Z , j j1Au for u 2 and 1 j n. ≥ ≤ ≤ The following lemma is the crucial step in the proof of Theorem 1.6. Lemma 2.2 There exists an absolute constant c for which the following holds. Let X be a random variable that satisfies X L for some p > 2 (p) k k ≤ and set X ,...,X to be independent copies if X. If 1 N N 1 W = ε ξ X , (cid:12)√N i i i(cid:12)1Au (cid:12) Xi=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) then W (p) cuL. (cid:12) (cid:12) k k ≤ Theproofof Lemma2.2requirestwopreliminaryestimates onthe‘gaus- sian’ behaviour of a monotone rearrangements of N copies of a random variable. Lemma 2.3 There exists an absolute constant c for which the following holds. Assume that X L. If X ,...,X are independent copies of (2p) 1 N k k ≤ X, then for every 1 k N and 2 q p, ≤ ≤ ≤ ≤ (X∗)2 1/2 cL( klog(eN/k)+√q). k i kLq ≤ i≤k (cid:0)X (cid:1) p Proof. The proof follows from a comparison argument, showing that up to the p-th moment, the ‘worst case’ is when X is a gaussian variable. 10