A note on the sample complexity of the Er-SpUD algorithm by Spielman, Wang and Wright for exact recovery of sparsely used dictionaries Radosl aw Adamczak∗ January 14, 2016 6 1 0 2 Abstract n We consider the problem of recovering an invertible n n matrix A and a sparse n p a × × J random matrix X based on the observation of Y = AX (up to a scaling and permutation of columns of A and rows of X). Using only elementary tools from the theory of empirical 2 1 processes we show that a versionof the Er-SpUD algorithmby Spielman, Wang and Wright withhighprobabilityrecoversAandX exactly,providedthatp Cnlogn,whichisoptimal ] up to the constant C. ≥ R P . 1 Introduction h t a m Learningsparsely-useddictionarieshasrecentlyattractedconsiderableattentioninconnectionto [ applicationsinmachinelearning, signalprocessingorcomputationalneuroscience. Inparticular, two important fields of applications are dictionary learning [9, 5, 2, 10, 15] and blind source 2 v separation [16, 4]. We do not discuss these applications and refer the Reader to the aforesaid 9 articles for details. 4 Among many approaches to this problem a particularly successful one has been presented 0 2 by Spielman, Wang and Wright [11, 12], who considered the noiseless-invertible case: 0 . 1 The main problem: 0 6 Consideraninvertiblen nmatrixAandarandomn psparsematrixX. DenoteY = AX. 1 × × : The objective is to reconstruct A and X (up to scaling and permutation of columns of A v and rows of X) based on the observable data Y. i X r a The Authors of [12] provide an algorithm which with high probability successfully recovers the matrices A and X up to rescaling and permutation of the columns of A and rows of X, provided that X is a sparse random matrix satisfying the following probabilistic assumptions. Probabilistic model specification X = χ R , ij ij ij where χ ,R are independent random variables, ij ij • ∗Institute of Mathematics, University of Warsaw, Banacha 2, 02-097 Warszawa, Poland. E-mail: [email protected] 1 χ are Bernoulli distributed: P(χ = 1) = 1 P(χ = 0) = θ, ij ij ij • − R are i.i.d., with mean zero and satisfy ij • µ := E R 1/10, ij | |≥ P(R t) 2e−t2/2. t>0 ij ∀ | | ≥ ≤ Following[12]wewillsaythatmatricessatisfyingtheaboveassumptionsfollowtheBernoulli- Subgaussian model with parameter θ. We remark that the constant 1/10 above is of no importance and has been chosen following [12] and [7]. The approach of Spielman, Wang and Wright consists of two steps. At the first step (given by the Er-SpUD algorithm we describe below) one gathers p/2 candidates for the rows of X. The second, greedy step (Greedy algorithm, also described below) selects from the candidates the set of n sparsest vectors, which form a matrix of rank n. The algorithms work as follows: ER-SpUD(DC): Exact Recovery of Sparsely-Used Dictionaries using the sum of two columns of Y as constraint vectors. 1. Randomly pair the columns of Y into p/2 groups g = Ye ,Ye . j { j1 j2} 2. For j = 1,...,p/2 Let r = Ye +Ye , where g = Ye ,Ye . j j1 j2 j { j1 j2} Solve min wTY subject to rTw = 1, and set s =wTY. wk k1 j j Above we use the convention that if r = 0 (which happens with nonzero probability), and j as aconsequence theminimization problemhas nosolution, then weskip thecorrespondingstep of the algorithm. Thesecond stage, describedbelow, is runon theset S of vectors s returnedat thefirststage i (for notational simplicity we relabel them if r = 0 for some j). We use the standard notation j that x denotes the number of nonzero coordinates of a vector x. 0 k k Greedy: A Greedy Algorithm to Reconstruct X and A. 1. REQUIRE: S = s ,...,s Rp. 1 T { }⊆ 2. For i = 1,...,n REPEAT l argmin s , breaking ties arbitrarily ← sl∈Sk lk0 x = s , S =S s i l l \{ } UNTIL rank([x ,...,x ]) = i 1 i 3. Set X = [x ,...,x ]T and A= YYT(XYT)−1. 1 n In [12] it was proved that there exist positive constants C,α, such that if 2 α θ n ≤ ≤ √n 2 and p Cn2log2n, then the ER-SpUD algorithm successfully recovers the matrices A,X with ≥ probability at least 1 1 . Note that the equation Y = A′X′ still holds if we set A′ = AΠΛ − Cp10 and X′ = Λ−1ΠTX for some permutation matrix Π and a nonsingular diagonal matrix Λ. Therefore, by recovery we mean that nonzero multiples of all the rows of X are among the set s ,...,s produced by the ER-SpUD(DC) algorithm. In [12] it is also proved that if 1 p/2 { } P(R = 0) = 0,thenforp > Cnlogn,withprobability1 C′nexp( cθp)foranymatricesA′,X′ ij − − such that Y = A′X′ and max eTX′ max eTX there exists a permutation matrix Π ik i k0 ≤ ik i k0 and a nonsingular diagonal matrix Λ such that A′ = AΠΛ, X′ = Λ−1ΠTX. In fact, the Authors of [12] prove that with the above probability any row of X is nonzero and has at most (10/9)θp nonzero entries, whereas any linear combination of two or more rows of X has at least (11/9)θp entries. In particular it follows that the Greedy algorithm will extract from the set s ,...,s 1 T { } multiples of all n rows of X (note that all s ’s are in the row space of Y and thus also in the row j space of X). Since, as one can prove, X is with high probability of rank n, one easily proves that one can recover A by the formula used in the 3rd step of the algorithm. We remark that in [7] Luh and Vu obtained the same results concerning sparsity of linear combinations of rows of X without the assumptions about the symmetry of the variables R . ij Note also that for θ of the order n−1, p = Cnlogn is necessary for uniqueness of the solution in the sense described above, otherwise with significant probability some of the rows of X may be zero, which means that some columns of A do not influence the matrix Y. In [12] it was also proved that if p > Cnlogn, θ > C′ logn, then with high probability the n ER-SpUD algorithm does not recover any of the rows of qX. Spielman, Wang and Wright have conjectured that their algorithm works with high proba- bility provided that p > Cnlogn (which, as mentioned above is required for well-posedness of the problem). Recently, Luh and Vu [7] have proved that the algorithm works for p > Cnlog4n, which differs from the conjectured number of samples just by a polylogarithmic factor. In this note we will consider a modified version of the algorithm with a slightly different first stage. Namely, instead of using only p/2 pairs of columns of Y, we will use all p pairs. 2 For fixed p it clearly increases the time complexity of the algorithm (which however remains (cid:0) (cid:1) polynomial), but the advantage of this modification is the possibility of proving that it requires onlyp = Cnlogntorecover X andAwithhighprobability, whichasexplainedaboveisoptimal. More specifically, we will consider the following algorithm. Modified ER-SpUD(DC): Exact Recovery of Sparsely-Used Dictionaries using the sum of two columns of Y as constraint vectors. For i= 1,...,p 1 − For j = i+1,...,p Let r = Ye +Ye ij i j Solve min wTY subject to rTw = 1, and set s = wTY. wk k1 ij ij The final step of the recovery algorithm is again a greedy selection of the sparsest vectors amongthecandidatescollectedatthefirststep. Asbefore,undertheassumptionP(R = 0) = 0, ij the greedy procedure successfully recovers X and A, provided that multiples of all the rows of X are present among the input set S. The main result of this note is Theorem 1.1. There exist absolute constants C,α (0, ) such that if ∈ ∞ 2 α θ n ≤ ≤ √n 3 and X follows the Bernoulli-Subgaussian model with parameter θ, then for p Cnlogn, with ≥ probability at least 1 1/p the modified ER-SpUD algorithm successfully recovers all the rows − of X, i.e. multiples of all the rows of X are present among the vectors s returned by the ij algorithm. Remark Very recently in [13], Sun, Qing and Wright proposed an algorithm with polynomial sample complexity, which recovers well conditioned dictionaries under the assumption that the variables R are i.i.d. standard Gaussian and θ 1/2, thus allowing for the first time for a ij ≤ linear number of nonzero entries per column of the matrix X. Their novel approach is based on non-convex optimization. The sample complexity of the algorithms in [13] is however higher then for the Er-SpUD algorithm; as mentioned by the Authors, numerical simulations suggest that it is at least p = Ω(n2logn) even in the case of orthogonal matrix A. The Authors of [13] conjecture that algorithms with sample complexity p = O(nlogn) should be possible also for large θ. 2 Proof of Theorem 1.1 Wewillfollowthegeneralapproachpresentedin[12]and[7]. Themainnewpartoftheargument is an improved bound on the sample complexity for empirical approximation of first moments of arbitrary marginals of thecolumns of the matrix X, given in Proposition 2.1 below. Soas notto reproduce technical and lengthy parts of the original proof, we organize this section as follows. First, we present the crucial Proposition 2.1 and provide a brief discussion of its mathematical content. Next, we present an overview of the main steps in the proof scheme of [12]. For parts of the proof not related to Proposition 2.1 or to the modification of the algorithm considered here, we only indicate the relevant statements from [12], while for the part involving the use of Proposition 2.1 and for the conclusion of the proof we provide the full argument. Proposition 2.1 is proved in Section 3. Below by e ,...,e we will denote the standard basis in RN for various choices of N (in 1 N particular for N = n and N = p). The value of N will be clear from the context and so this should not lead to ambiguity. By Bn we will denote the unit ball in the space ℓn, i.e. Bn = x Rn: x 1 , where for x = (x(1)1,...,x(n)), x = n x(i). The coordi1nates of1a ve{cto∈r x wilkl bke1d≤eno}ted by x(i) k k1 i=1| | or if it does not interfere with other notation (e.g. for indexed families of vectors) simply by x . i P Again, the meaning of the notation will be clear from the context. Proposition 2.1. Let U ,U ,...,U ,χ ,...,χ be independent random vectors in Rn. Assume 1 2 p 1 p that for some constant M and all 1 i p, 1 j n, ≤ ≤ ≤ ≤ Ee|Ui(j)|/M 2 (1) ≤ and P(χ (j) = 1) = 1 P(χ (j) = 0) = θ. i i − Define the random vectors Z ,...,Z with the equality Z (j) = U (j)χ (j) for 1 i p, 1 p i i i ≤ ≤ 1 j n and consider the random variable ≤ ≤ p 1 W := sup (xTZ E xTZ ) . (2) i i x∈B1n(cid:12)pXi=1 | |− | | (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) Then, for some universal constant C and every q max(2,logn), ≥ C W ( pθq+q)M (3) q k k ≤ p p 4 and as a consequence Ce P W ( pθq+q)M e−q. (4) ≥ p ≤ (cid:16) p (cid:17) The above proposition can be considered a quantitative version of the uniform law of large numbers for linear functionals xTZ indexed by the unit sphere in the space ℓn. As such it 1 is a classical object of study in the theory of empirical processes. The proof we give uses only Bernstein’s inequality (see e.g. [14]) and Talagrand’s contraction principle [6], which in a somewhat similar context was applied e.g. in [8, 1]. Let us also remark that in the above proposition we do not require independence between components of the random vectors U or χ for fixed i, but just independence between the i i random vectors U ,χ ,i = 1,...,p. i i 2.1 Main steps of the proof of Theorem 1.1 As announced, we will now present an outline of the proof of Theorem 1.1, indicating which steps differ from the original argument in [12]. Step 1. A change of variables. Recall that r are sums of two columns of the matrix Y. At the first step of the proof, ij instead of looking at the original optimization problem minimize wTY subject to rTw = 1 (5) k k1 ij one performs a change of variables z = ATw, b =A−1r , arriving at the optimization problem ij ij minimize zTX subject to bTz = 1. (6) k k1 ij Note that one cannot solve (6) since it involves the unknown matrices X and A. The goal of the subsequent steps is to prove that with probability separated from zero the solution z ∗ of (6) is a multiple of one of the basis vectors e ,...,e , say z = λe . This means that 1 n ∗ k wTY = zTX = λeTX, i.e. (5) recovers the k-th row of X up to scaling. ∗ ∗ k Step 2. The solution z satisfies supp(z ) supp(b ). ∗ ∗ ij ⊆ At this step we prove the following lemma, which is a counterpart of Lemma 11 in [12]. It is weaker in that we do not consider arbitrary vectors b , but only sums of two distinct columns ij of X (which is enough for the application in the proof of Theorem 1.1). On the other hand it works already for p > Cnlogn and not for p > Cn2logn as the original lemma from [12]. Lemma 2.2. For 1 i < j p, define b = Xe +Xe , I = (suppXe ) (suppXe ). There ij i j ij i j ≤ ≤ ∪ exist numerical constants C,α > 0 such that if 2/n θ α/√n and p > Cnlogn, then with ≤ ≤ probability at least 1 p−2 the random matrix X has the following property: − (P1) For every 1 i < j p either I 0 (1/(8θ),n] or every solution z to the ij ∗ ≤ ≤ | | ∈ { } ∪ optimization problem (6) satisfies suppz I . ∗ ij ⊆ To prove the above lemma, one first shows a counterpart of Lemma 16 in [12]. Lemma 2.3. For any 1 j p, if Z = (χ R ,...,χ R ), then for all v Rn, 1j 1j nj nj ≤ ≤ ∈ µ θ E vTZ v . 1 | |≥ 8 nk k r 5 Proof. Let ε ,...,ε be a sequence of i.i.d. Rademacher variables, independent of Z. By 1 n standard symmetrization inequalities (see e.g. Lemma 6.3. in [6]), n n 1 E vTR = E v χ R E v ε χ R . i ij ij i i ij ij | | ≥ 2 (cid:12)Xi=1 (cid:12) (cid:12)Xi=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) The random variables ε R are sym(cid:12) metric and(cid:12)E ε R (cid:12) = µ, so by L(cid:12)emma 16 from [12], the i ji i ij | | right-hand side above is bounded from below by µ θ v . 8 nk k1 q The next lemma is an improvement of Lemma 17 in [12], which is crucial for obtaining Lemma 2.2. Lemma 2.4. There existsan absolute constant C, suchthat the following holds for p > Cnlogn. Let S 1,...,p be a fixed subset of size S < p. Let X be the submatrix of X, obtained by a ⊆ { } | | 4 S restriction of X to the columns indexed by S. With probability at least 1 p−8, for any v Rn, − ∈ pµ θ vTX 2 vTX > v . 1 S 1 1 k k − k k 32 nk k r Proof. Note firstthatby increasingthesetS, weincrease vTX , sowithoutloss of generality S 1 k k wecanassumethat S = p/4 . ApplyProposition2.1withthevectorsU = (R ,...,R )and j 1j nj | | ⌊ ⌋ χ = (χ ,...,χ ) and q = 8logp. Note that our integrability assumptions on R imply (1) j 1j nj ij with M being a universal constant. Therefore, for some absolute constant C and p Cnlogn, ≥ with probability at least 1 p−8 we have − sup vTX E vTX C( pθlogp+logp) 2C pθlogp, 1 1 v∈B1n(cid:12)k k − k k (cid:12)≤ p ≤ p (cid:12) (cid:12) sup v(cid:12)TX E vTX (cid:12) 2C pθlogp, S 1 S 1 v∈B1n(cid:12)k k − k k (cid:12)≤ p (cid:12) (cid:12) where we used tha(cid:12)t for C sufficiently large(cid:12), p/logp n 1/θ. ≥ ≥ Thus, by homogeneity, for all v Rn, ∈ vTX E vTX 2C θplogp v , 1 1 1 k k − k k ≤ k k v(cid:12)(cid:12)TX E vTX (cid:12)(cid:12) 2Cpθplogp v . k(cid:12) Sk1− k Sk1(cid:12)≤ k k1 (cid:12) (cid:12) p (cid:12) (cid:12) In particular this means t(cid:12)hat (using the notation(cid:12)of Proposition 2.1) vTX E vTX 2C θplogp v = pE vTZ 2C θplogp v , 1 1 1 1 1 k k ≥ k k − k k | |− k k 2 vTX 2E vTX +4pC θplogp v = 2S E vTZ +p4C θplogp v , S 1 S 1 1 1 1 k k ≤ k k k k | | | | k k p p and so vTX 2 vTX (p 2S )E vTZ 6C θplogp v . 1 S 1 1 1 k k − k k ≥ − | | | |− k k Now, by Lemma 2.3 and the assumed bound on the cardinalpity of S, we get pµ θ pµ θ vTX 2 vTX 6C θplogp v > v 1 S 1 1 1 k k − k k ≥ 16 n − k k 32 nk k r r (cid:16) p (cid:17) for p > C′nlogn, where C′ is another absolute constant. We are now in position to prove Lemma 2.2. 6 Proof of Lemma 2.2. We will show that for each 1 i < j p the probability that 0 < I ij ≤ ≤ | | ≤ 1/(8θ) and there exists a solution to (6) not supported on I is bounded from above by 1/p4. ij By the union bound over all i< j, this implies the lemma. Fix i,j and let S = l [p]: X = 0 . Denote by the σ-field generated by Xe and { ∈ ∃k∈Iij kl 6 } F1 i Xe . Then = 0 < I 1/(8θ) . By independence, for each k / i,j , on the event j ij 1 A { | | ≤ } ∈ F ∈ { } , A P(k S 1) 1 (1 θ)|Iij| 1 e−2θ|Iij| 1 e−14 < 1, ∈ |F ≤ − − ≤ − ≤ − 4 where the second inequality holds if α is sufficiently small. Thus, by independence of columns of X and Hoeffding’s inequality, p P S i,j 1 e−cp (7) 1 | \{ }| ≤ 4 F ≥ − (cid:16) (cid:12) (cid:17) (cid:12) for some universal constant c> 0. Let z∗ be any(cid:12)solution of (6) and denote by z0 its orthogonal projection on RIij = x Rn: xk = 0 for k / Iij . Set also z1 = z∗ z0 and let XS,XSc be the { ∈ ∈ } − matrices obtained from X by selecting the columns labeled by S and Sc = [p] S respectively. \ By the triangle inequality, and the fact that z0TXSc = 0 , we get kz∗TXk1 = k(z0T +z1T)XSk1+k(z0T +z1T)XSck1 zTX zTX + zTX zTX ≥ k 0 Sk1 −k 1 Sk1 k 1 k1 −k 1 Sk1 = zTX +( zTX 2 zTX ). k 0 k1 k 1 k1 − k 1 Sk1 Denote now by X′ the Ic (p 2) matrix obtained by restricting X to the rows from | ij| × − Ic and columns from [p] i,j . Set also S′ = S i,j . If, slightly abusing the notation, we ij \{ } \{ } identify z1 with a vector from R|Iicj|, we have zTX 2 zTX = zTX′ 2 zTX′ , k 1 k1− k 1 Sk1 k 1 k1− k 1 S′k1 where we used the fact that zTXe = zTXe = 0. 1 i 1 j Denote by the σ-field generated by Xe ,Xe and the rows of X labeled by I (note 2 i j ij F that I is itself random, but this will not be a problem in what follows). The random set S is ij measurable with respect to . Moreover, due to independence and identical distribution of the 2 F entries of X, conditionally on the matrix X′ still follows the Bernoulli-Subgaussian model 2 F with parameter θ. Therefore, by Lemma 2.4, if C is large enough, then on S′ p/4 we have {| | ≤ } pµ θ P for all v ∈ R|Iicj| : kvTX′k1−2kvTXS′′k1 ≥ 32 Ic kvk1 F2 ≥ 1−p−8. (cid:16) s| ij| (cid:12) (cid:17) (cid:12) (cid:12) Notethat bythedefinitionofz , wehave bTz = bTz = 1, thereforez isafeasiblecandidate 0 ij 0 ij 0 for the solution of the optimization problem (6). Thus, we have zTX′ 2 zTX′ 0 and k 1 k1− k 1 S′k1 ≤ as a consequence, on the event S′ p/4 , {| | ≤ } P(for some solution z to (6), z = 0 ) p−8. (8) ∗ 1 2 6 |F ≤ Thus, denoting = for some solution z to (6), z = 0 and 0 < Ic < 1/(8θ) , we get by (7) B { ∗ 1 6 | ij| } and (8), P( ) P( S′ > p/4 )+EP( )1 2 {|S′|≤p/4} B ≤ B∩{| | } B|F EP(S′ > p/4 )1 +p−8 1 A ≤ | | |F e−cp+p−8 p−4 ≤ ≤ for p > Cnlogn with a sufficiently large absolute constant C. 7 Step 3. With high probability z = λe for k = argmax b (l). ∗ k 1≤l≤n| ij | At this step one proves the following lemma (Lemma 12 in [12]). Since no changes with respect to the original argument are required (we do not use Proposition 2.1 here), we do not reproducetheproofandrefertheReaderto[12]fordetails. We remarkthatalthoughthelemma is formulated in [12] for symmetric variables, the symmetry assumption is not used in its proof. ↓ ↓ ↓ Below, by b b ... b ,wedenotethenonincreasingrearrangementofthesequence | |1 ≥ | |2 ≥ ≥ | |n b ,..., b , while for J [n], XJ denotes the matrix obtained from X by selecting the rows 1 n | | | | ⊆ indexed by the set J. Lemma 2.5. There exist two positive constants c ,c such that the following holds. For any 1 2 γ > 0 and s Z , such that θs< γ/8 and p such that + ∈ c slogn p c 1 2 p max ,n , and , ≥ θγ2 logp ≥ θγ2 n o with probability at least 1 4p−10, the random matrix X has the following property. (P2) For every J [n−] with J = s and every b Rs, satisfying |b|↓1 1 γ, the solution to ⊆ | | ∈ |b|↓2 ≤ − the restricted problem minimize zTXJ subject to bTz = 1, (9) 1 k k is unique, 1-sparse, and is supported on the index of the largest entry of b. Step 4. Conclusion of the proof. Set s = 12θn+1. Our first goal is to prove that with probability at least 1 1/p2, for all − k [n], there exist i,j [p], i = j such that the vector b = Xe +Xe satisfies the assumptions i j ∈ ∈ 6 ↓ of Lemma 2.5, b = b and I := (suppXe ) (suppXe ) satisfies 0 < I 1/(8θ), which | |1 | k| ij i ∪ j | ij| ≤ will allow us to take advantage of Lemma 2.2. Note that we have ∞ ER2 4 te−t2/2dt = 4. ij ≤ Z0 Since E R = µ 1 , by the Paley-Zygmund inequality (see e.g. Corollary 3.3.2. in [3]), we | ij| ≥ 10 have 1 3(E R )2 P(R ) | ij| c | ij| ≥ 20 ≥ 4 ER2 ≥ 0 ij forsomeuniversalconstantc > 0. InparticularP(R = 0) < 1 c0. Letq beany(1 c /(2s))- 0 | ij| − 2 − 0 quantile of R , i.e. P(R q) (1 c /(2s)) and P(R q) c /(2s). In particular, ij ij 0 ij 0 | | | | ≤ ≥ − | | ≥ ≥ since s 1, we get q > 0. We have P(R q) c /(4s) or P(R q) c /(4s). Let us ij 0 ij 0 ≥ ≥ ≥ ≤ − ≥ assume that P(R q) c /(4s), the other case is analogous. ij 0 ≥ ≥ Define the event as ki E = χ = 1, r [n] k : χ = 1 (s 1)/2, R q, χ = 1 = R q ki ki ri ki r6=k ri ri E |{ ∈ \{ } }| ≤ − ≥ ∀ ⇒ | |≤ n o We will assume that p 2Cnlogn for some numerical constant C to be fixed later on. For ≥ k [n], consider the events ∈ = k ki A E 1≤i≤⌊p/2⌋ [ and = l [n]: χ = χ = 1 = k . i ki kj li lj B E ∩E ∩ { ∈ } { } 1≤i≤[⌊p/2⌋⌊p/2[⌋<j≤p n o 8 We will first show that for all k [n], ∈ 1 P( ) 1 , (10) Ak ≥ − p4 which we will use to prove that 1 P( ) 1 . (11) Bk ≥ − p3 Let us start with the proof of (10). Set = r [n] k : χ = 1 (s 1)/2 . By ki rk B {|{ ∈ \{ } }| ≤ − } independence we have P( ) = P(χ = 1)P(R q)P( )P( χ = 1 = R q ) ki ki ki ki r6=k ri ri ki E ≥ B ∀ ⇒ | | ≤ |B c 2θ(n 1) c (s−1)/2 0 0 θ 1 − 1 , ≥ 4s − s 1 − 2s (cid:16) − (cid:17)(cid:16) (cid:17) where to estimate P(B ) we used Markov’s inequality. The right hand side above is bounded ki from below by c /n for some universal constant c . Therefore if the constant C is large enough, 1 1 we obtain c ⌊p/2⌋ 1 P c 1 1 exp( c p/(4n)) exp( 4logp) = , Eki ≤ − n ≤ − 1 ≤ − p4 (cid:16)1≤i≤\⌊p/2⌋ (cid:17) (cid:16) (cid:17) where we used the inequality p/logp 16c−1n for p Cnlogn. We have thus established (10). ≥ 1 ≥ Let us now pass to (11). Denote by the σ-field generated by χ ,R ,k [n],1 i 1 ki ki F ∈ ≤ ≤ p/2 . ⌊ ⌋ For ω define i (ω) = min 1 i p/2 : ω . Note that on , k min ki k ∈ A { ≤ ≤ ⌊ ⌋ ∈ E } A P( ) P ( l [n]: χ = χ = 1 = k Bk|F1 ≥ Ekj ∩ { ∈ limin lj } { } F1 (cid:16)⌊p/2[⌋<j≤p n o(cid:12) (cid:17) (cid:12) (cid:12) Define = r [n] k : χ =1 (s 1)/2, l [n]: χ = χ = 1 = k . Ckj {|{ ∈ \{ } rj }| ≤ − }∩ { ∈ limin lj } { } n o Similarly as in the argument leading to (10), for fixed j, using the independence of the variables χ ,R we obtain lm lm P l [n]: χ = χ = 1 = k Ekj ∩ { ∈ limin lj } { } F1 (cid:16) n o(cid:12) (cid:17) = P(R q)E 1 P( χ = 1 = R(cid:12) q , ) kj ≥ Ckj ∀r6=k rj ⇒ | (cid:12)rj| ≤ |Ckj F1 |F1 (cid:16) c s−1 (cid:17) P(R q)E 1 1 0 2 ≥ kj ≥ Ckj − 2s F1 = P(Rkj q) 1(cid:16) c0(cid:16) s−21P((cid:17)kj (cid:12)(cid:12)(cid:12)1) (cid:17) ≥ − 2s C |F c c (cid:16)s−1 (cid:17) 0 0 2 1 ≥ 4s − 2s × (cid:16) (cid:17) s 1 P l [n]: χ = χ = 1 = k P χ = 1, r [n] k : χ = 1 > − { ∈ limin lj } { } F1 − kj |{ ∈ \{ } rj }| 2 F1 (cid:16) (cid:16)c0 1 c0 s−21 θ(1 θ)(s−1)/2 (cid:12)(cid:12)(cid:12)θ2θ(cid:17)(n−(cid:16)1) . (cid:12)(cid:12)(cid:12) (cid:17)(cid:17) ≥ 4s − 4s − − s 1 (cid:16) (cid:17) (cid:16) − (cid:17) Now recall that θ α for some universal constant α. If α is small enough then 1 θ e−2θ and ≤ n − ≥ (1 θ)(s−1)/2 e−θ(s−1) = e−12θ2n e−12α2 1. − ≥ ≥ ≥ 3 9 Since 2θ(n−1) 1, this implies that s−1 ≤ 6 c P l [n]: χ = χ = 1 = k 2 Ekj ∩ { ∈ ljmin lj } { } F1 ≥ n (cid:16) n o(cid:12) (cid:17) (cid:12) for some positive universal constant c2. Since the events Ekj ∩ (cid:12){l ∈ [n]: χljmin = χlk = 1} = n k , p/2 <k p are conditionally independent, given , we obtain that on , 1 k { } ⌊ ⌋ ≤ F A o c ⌊p/2⌋ 1 P( c ) 1 2 , Bk|F1 ≤ − n ≤ p4 (cid:16) (cid:17) provided C is a sufficiently large universal constant. Now, using (10), we get 1 1 2 1 P( ) E1 P( ) P(A ) 1 1 1 , Bk ≥ Ak Bk ≥ k − p4 ≥ − p4 ≥ − p3 (cid:16) (cid:17) (cid:16) (cid:17) proving (11). Taking the union bound over k [n], we get ∈ 1 P( ) 1 . Bk ≥ − p2 1≤k≤n \ Set γ = 1/2 and observe that if C is large enough and α small enough, then the assump- tions of Lemma 2.2 and Lemma 2.5 are satisfied. Moreover s = 12θn + 1 1 . Recall the ≤ 8θ properties P1 and P2 considered in the said lemmas. Consider the event = A 1≤k≤nBk ∩ properties P1 and P2 hold and note that P( ) 1 1. On the event , for every k, there { } A ≥ − p A T exist 1 i < j p, such that ≤ ≤ 1 I s 1/(8θ), ij • ≤| | ≤ ≤ the largest entry of b (in absolute value) equals b 2q > 0 whereas the remaining entries k • ≥ do not exceed q, In particular, by property P1 we obtain that any solution z to the problem (6) satisfies ∗ suppz I . Therefore for some (any) J I with J = s, we obtain (identifying vectors ∗ ij ij ⊆ ⊇ | | supportedonJ withtheirrestrictionstoJ),thatz isinfactasolutiontotherestricted problem ∗ (9) with b = b , which by property P2 implies that z = λe for some λ = 0. ij ∗ k 6 According to the discussion at the beginning of Step 1, this means that the solution w to ∗ (5) satisfies wTY = λeTX, i.e. the algorithm, when analyzing the vector b , will add a multiple ∗ k ij of the k-th row of X to the collection S. This ends the proof of Theorem 1.1. 3 Proof of Proposition 2.1 The first tool we will need is the classical Bernstein’s inequality (see e.g. Lemma 2.2.11 in [14]). Lemma 3.1 (Bernstein’s inequality). Let Y ,...,Y be independent mean zero random variables 1 p suchthat for some constants M,v and everyinteger k 2, E Y k k!Mk−2v/2. Then, for every i ≥ | | ≤ t > 0, p t2 P Y t 2exp . i ≥ ≤ − 2(pv+Mt) (cid:16)(cid:12)Xi=1 (cid:12) (cid:17) (cid:16) (cid:17) (cid:12) (cid:12) As a consequence, for every q(cid:12) 2, (cid:12) ≥ p Y C(√qpv+qM), (12) i q ≤ (cid:13)Xi=1 (cid:13) (cid:13) (cid:13) where C is a universal constant. (cid:13) (cid:13) 10