Multiprocessor Approximate Message Passing with Column-Wise Partitioning ∗ Yanting Ma Yue M. Lu North Carolina State University Harvard University [email protected] [email protected] Dror Baron North Carolina State University [email protected] 7 1 0 Abstract 2 Solvingalarge-scaleregularizedlinearinverseproblemusingmultipleprocessorsisimportant n invariousreal-worldapplicationsdue tothelimitationsofindividualprocessorsandconstraints a J on data sharing policies. This paper focuses on the setting where the matrix is partitioned 0 column-wise. We extend the algorithmic framework and the theoretical analysis of approxi- 3 mate message passing (AMP), an iterative algorithmfor solving linear inverse problems, whose asymptotic dynamics are characterized by state evolution (SE). In particular, we show that ] T column-wisemultiprocessorAMP(C-MP-AMP)obeysanSEunderthesameassumptionswhen theSEforAMPholds. TheSEresultsimplythat(i)theSEofC-MP-AMPconvergestoastate I s. that is no worse than that of AMP and (ii) the asymptotic dynamics of C-MP-AMP and AMP c can be identical. Moreover,for a setting that is not coveredby SE, numericalresults show that [ damping can improve the convergence performance of C-MP-AMP. 2 v 8 1 Introduction 7 5 Many scientific and engineering problems can be modeled as solving a regularized linear inverse 2 0 problem of the form 1. y = Ax+w, (1) 0 7 where the goal is to estimate the unknown x RN given the matrix A Rn×N and statistical ∈ ∈ 1 information about the signal x and the noise w Rn. : ∈ v In some scenarios, it might be desirable to partition the matrix A either column-wise or row- Xi wise and store the sub-matrices at different processors. The partitioning style depends on data availability, computational considerations, and privacy concerns. For example, in high-dimensional r a settings where N n, or in situations where the columns of A, which represent features in feature ≫ selection problems [2], cannot be shared among processors for privacy preservation, column-wise partitioning might be preferable. In this paper, we consider multiprocessor computing for the (non-overlapping) column-wise partitioned linear inverse problem: P y = A x +w, (2) p p p=1 X where P is the number of processors, A Rn Np is the sub-matrix that is stored in Processor p, p × ∈ and P N = N. p=1 p Many studies on solving the column-wise partitioned linear inverse problem (2) have been in the P context of distributed feature selection. Zhouet al. [3] modeled feature selection as a parallel group ∗This document serves as a supporting document for [1] 1 testing problem. Wang et al. [4] proposed to de-correlate the data matrix before partitioning, and each processor then works independently using the de-correlated matrix without communication with other processors. Peng et al. [5] studied problem (2) in the context of optimization, where they proposed a greedy coordinate-block descent algorithm and a parallel implementation of the fast iterative shrinkage-thresholding algorithm (FISTA) [6]. Our work is based on the approximate message passing (AMP) framework [7]. AMP is an efficient iterative algorithm for solving linear inverse problems (1). In the large scale random setting, its average asymptotic dynamics are characterized by a state evolution (SE) formalism [8], which allows one to accurately predict the average estimation error at every iteration. Recently, a finite-sample analysis of AMP [9] showed that when the prior distribution of the input signal x has i.i.d. sub-Gaussian entries,1 the average performance of AMP concentrates to the SE prediction at an exponential rate in the signal dimension N. Our goal is to extend the AMP algorithmic framework and the SE analysis in [9] to the column- wise partitioned linear inverse problem (2). We show that column-wise multiprocessor AMP (C- MP-AMP) obeys a new SE under the same model assumptions where the SE for AMP holds. With the new SE, we can predict the average estimation error in each processor at every iteration. Moreover, the comparison between the SE of AMP and that of C-MP-AMP implies that (i) the estimation error of C-MP-AMP is no worse than that of AMP and (ii) with a specific communi- cation schedule between the processors and the fusion center that coordinates the processors, the asymptotic dynamics of C-MP-AMP are identical to that of AMP. This result implies a speedup linear in the number of processors. It is worth mentioning that row-wise multiprocessor AMP [10–12] obeys the same SE as AMP, because it distributes the computation of matrix-vector multiplication among multiple processors and aggregates the results before any other operations. Some existing work on row-wise multipro- cessorAMP [12–14]introduceslossycompressiontothecommunication between processorsandthe fusion center, whereas we assume perfect communication and focus on the theoretical justifications and implications of the new SE of C-MP-AMP. The remainder of the paper is organized as follows. Section 2 introduces the C-MP-AMP algo- rithm(Algorithm 1), thestate evolution sequences, andourmain performanceguarantee (Theorem 1), which is a concentration result for PL loss functions acting on the outputs generated by Algo- rithm 1 concentrates to the state evolution prediction. Section 3 proves Theorem 1. The proof is mainly based on Lemmas 3 and 4. The proof of Lemma 3 is the same as in [9] using the result that we prove in Lemma 2. Section 4 proves Lemma 4. 2 Column-Wise Multiprocessor AMP and State Evolution 2.1 Review of AMP Approximate message passing (AMP) [7] is a fast iterative algorithm for solving linear inverse problems (1). Starting with an all-zero vector x0 as its initial estimate, at the tth iteration, AMP proceeds according to zt 1 N zt = y Axt+ − η ([xt 1 +A zt 1] ), (3) − n t′−1 − ∗ − i i=1 X xt+1 = η (xt +A zt), (4) t ∗ 1A random variable X is sub-Gaussian if there exist positive constants c and κ such that P(|X −EX| > ǫ) ≤ ce−κǫ2,∀ǫ>0. 2 where vectors with negative iteration indices are all-zero vectors, A denotes the transpose of a ∗ matrix A, η : R R is a Lipschitz function with weak derivative η , for any u RN, [u] denotes t → t′ ∈ i its ith entry. The function η acts coordinate-wise when applied to vectors. That is, the vector t (η (u ),η (u ),...,η (u )) is denoted by η (u). t 1 t 2 t N t Under the assumptions on the measurement matrix A, the signal x, the measurement noise w, and the denoising function η () as listed in [9, Section 1.1], the sequence of the estimates xt that t · { } is generated by AMP (3) (4) has the following property [9]. For all ǫ (0,1), there exist constants ∈ K ,κ >0 independent of n or ǫ, such that t t N 1 P φ(xt+1,x ) E φ(η (X +τtZ),X) ǫ K e κtnǫ2, (5) (cid:12)N i i − t (cid:12) ≥ ! ≤ t − (cid:12)(cid:12) Xi=1 (cid:2) (cid:3)(cid:12)(cid:12) where φ : R2 R (cid:12)(cid:12)is a pseudo-Lipschitz function of order 2 ((cid:12)(cid:12)PL(2)),2 X pX, Z is a standard → ∼ normal random variable that is independent of X, and τt is defined via the following recursion ((σ0)2 = δ 1E[X2], δ = n/N): − (τt)2 = σ2 +(σt)2, W (σt+1)2 = δ 1E η (X +τtZ) X 2 . (6) − t − h i (cid:0) (cid:1) Notice that (5) implies, by applying the Borel-Cantelli Lemma, the almost sure convergence result proved in [8]: N 1 lim φ(xt+1,x ) a=.s. E φ(η (X +τtZ),X) . (7) N N i i t →∞ i=1 X (cid:2) (cid:3) If we choose φ(x,y) = (x y)2, then (7) characterizes the mean square error (MSE) achieved − by AMP at each iteration. 2.2 Column-Wise Multiprocessor AMP In our proposed column-wise multiprocessor AMP (C-MP-AMP) algorithm, the fusion center col- lects vectors that represent the estimations of the portion of the measurement vector y contributed by the data from individual processors according to a pre-defined communication schedule. The sum of these vectors is computed in the fusion center and transmitted to all processors. Each processor performs standard AMP iterations with a new equivalent measurement vector, which is computed using the vector received from the fusion center. The pseudocode for C-MP-AMP is presented in Algorithm 1. 2.3 Performance Guarantee SimilartoAMP,thedynamicsoftheC-MP-AMPalgorithm canbecharacterized byanSEformula. Let (σ0,kˆ0)2 = δ 1E[X2], where δ = n/N , p = 1,...,P. For outer iterations 1 s sˆand inner p p− p p ∀ ≤ ≤ 2Recall the definition of PL(2) from [8]: a function f : Rm → R is said to be PL(2) if there is L > 0 such that |f(x)−f(y)|≤L(1+kxk+kyk)kx−yk, ∀x,y∈Rm, where k·k denotes theEuclidean norm. 3 Algorithm 1 C-MP-AMP Inputs to Processor p: y, A , kˆ (maximum number of inner iterations at each outer p s s=0,...,sˆ { } iteration). Initialization: x0p,kˆ0 = 0, zp0,kˆ0−1 = 0, rp0,kˆ0 = 0, p. ∀ for s = 1 : sˆdo (loop over outer iterations) At the fusion center: gs = Pu=1rus−1,kˆs−1 At Processor p: P xsp,0 = xsp−1,kˆs−1, rps,0 = rps−1,kˆs−1 for t = 0 :tˆ 1 do (loop over inner iterations) s − s,k s,k s,0 z = y g r r p s p p − − − xs,k+1 = η (xs(cid:16),k +A zs,k)(cid:17) p s,k p ∗p p rs,k+1 = A xs,k+1 zps,k Np η ([xs,k +A zs,k] ). p p − n i=1 s′,k p ∗p p i Output from Processor p: xPsˆ,kˆsˆ. p iterations 0 t kˆ , we define the sequences (σs,k)2 and (τs,k)2 as s p p ≤ ≤ { } { } (σs,0)2 = (σs 1,kˆs)2, (8) p p− (τs,k)2 = σ2 + (σs,0)2+(σs,k)2, (9) p W u p u=p X6 2 (σs,k+1)2 = δ 1E η (X +τs,kZ) X , (10) p p− s,k p − (cid:20) (cid:21) (cid:16) (cid:17) where Z is a standard normal random variable that is independent of X. With these definitions, we have the following performance guarantee for C-MP-AMP. Theorem 1. Under the assumptions listed in [9, Section 1.1], let P be a fixed integer, for p = 1,...,P, let n/N = δ (0, ) be a constant. Define N = P N . Then for any PL(2) function p p ∈ ∞ p=1 p φ: R2 R, we have ǫ (0,1), there exist constants K ,κ > 0 independent of n,ǫ, such that → ∀ ∈ s,kPs,k Np 1 P φ(xs,k+1,x ) E φ(η (X +τs,kZ),X) ǫ K e κs,knǫ2, p, (cid:12)N p,i p,i − s,k p (cid:12) ≥ ≤ s,k − ∀ (cid:12)(cid:12) p Xi=1 h i(cid:12)(cid:12) (cid:12) (cid:12) where xs,k+1(cid:12)is generated by Algorithm 1, τs,k is defined in (8–(cid:12)10), X p , and Z is a standard p (cid:12) p (cid:12) ∼ X normal random variable that is independent of X. Remark 1: C-MP-AMP converges to a fixed point that is no worse than that of AMP. This statement can be demonstrated as follows. When C-MP-AMP converges, the quantities in (8–10) do not keep changing, hence we can drop all the iteration indices for fixed point analysis. Notice that the last term on the right hand side (RHS) of (9) vanishes, which leaves the RHS independent of p. That is, (τs,k)2 are equal for all p, hence we can further drop the processor index for (τs,k)2. p p Denote (τs,k)2 by τ2 for all s,k,p, and plug (10) into (9), then p P τ2 = σ2 + δ 1E (η(X +τZ) X)2 W p− − Xp=1 h i (=a) σ2 +δ 1E (η(X +τZ) X)2 , W − − h i 4 whichisidenticaltothefixedpointequationobtainedfrom(6). Intheabove,step(a)holdsbecause P δ 1 = P Np = N. Because AMP always converges to the worst fixed point of the fixed p=1 p− p=1 n n point equation (6) [15], the average asymptotic performance of C-MP-AMP is identical to AMP P P when there is only one solution to the fixed point equation, and at least as good as AMP in case of multiple fixed points. Remark 2: The asymptotic dynamics of C-MP-AMP can be identical to AMP with a specific communication schedule. This can be achieved by letting kˆ = 1, s. In this case, the quantity s s,k ∀ (τ ) is involved only for t = 0. Because the last term in (9) is 0 when t = 0, the computation of p (τs,0)2 is independent of p. Therefore, τs,0 are again equal for all p. Dropping the processor index p p for (τs,k)2, the recursion in (8–10) can be simplified as p P (τs,0)2 = σ2 + δ 1E η (X +τs,0Z) X 2 W p− s,0 − Xp=1 h(cid:0) (cid:1) i = σ2 +δ 1E η (X +τs 1,0Z) X 2 , W − s−1,0 − − h i (cid:0) (cid:1) where the iteration evolves over s, which is identical to (6) evolving over t. Remark 3: Theorem 1 implies almost sure convergence. Similar to the performance guarantee for AMP [9], the concentration result in Theorem 1 implies Np 1 lim φ(xs,k+1,x ) a=.s.E φ(η (X +τs,kZ),X) , p, N Np p,i p,i s,k p ∀ →∞ Xi=1 h i by the Borel-Cantelli Lemma. 3 Proofs of Theorem 1 Our proof follows closely from the proof for AMP in [9], with additional dependence structure to be addressed due to vectors being transmitted among processors. 3.1 Proof Notations Withoutloss of generality, weassumethesequence kˆ inAlgorithm 1tobeaconstant valuekˆ. s s 0 Let t = skˆ+k, θ(t)= t/kˆ kˆ. Given w Rn, x {RN}p,≥for p = 1,...,P, define the column vectors p ht+1,qt RNp and bt,⌊mt ⌋ Rn for t ∈ 0 recur∈sively as follows. Starting with initial condition p p ∈ p p ∈ ≥ q0 RNp: p ∈ ht+1 = A mt qt, qt = f (ht,x ) p ∗p p− p p t p p bt = A qt λtmt 1, mt = bt + bθ(t) w (11) p p p− p p− p p u − u=p X6 where Np 1 f (ht,x ) = η (x ht) x , and λt := f (ht ,x ). (12) t p t−1 p− − p p δpNp t′ p,i p,i i=1 X In (12), the derivative of f : R2 R is with respect to the first argument. We assume that η is t t → Lipschitz for all t 0, then it follows that f is Lipschitz for all t 0. Consequently, the weak t ≥ ≥ derivativeandf exit. Further,f isassumedtobedifferentiable, exceptpossiblyatafinitenumber t′ t′ 5 of points, with bounded derivative whenever it exits. In (11), quantities with negative indices or with index θ(t)= 0 (i.e., t < kˆ) are defined to be zeros. To see the equivalence between Algorithm 1 and the recursion defined in (11) and (12), we let x0 = 0, r0 = 0, zt = 0, and p p p ht+1 = x (A zt +xt), qt = xt x , p p− ∗ p p p p− p bt = rt A x , mt = zt. p p− p p p − p Let (σ0)2 = δ 1E[X2]. We assume that (σ0)2 is strictly positive for all p = 1,...,P and for all p p− p ǫ (0,1), there exist K,κ > 0 such that ∈ q0 2 P k pk (σ0)2 ǫ Ke κnǫ2, p = 1,...,P. (13) (cid:12) n − p (cid:12)≥ ! ≤ − ∀ (cid:12) (cid:12) (cid:12) (cid:12) Define the state evolution(cid:12)scalars τt (cid:12) and σt for the the recursion defined in (11) as (cid:12) { p}t≥(cid:12)0 { p}t≥1 follows: 1 (τt)2 = (σt)2+ (σθ(t))2+σ2 , (σt)2 = E f (τt 1Z,X) 2 , (14) p p u W p δ t p− p Xu6=q h(cid:0) (cid:1) i where Z (0,1) and X p are independent. Notice that with the equivalence between X ∼ N ∼ Algorithm 1 and the recursion 11, the state evolution scalars defined in (14) matches (8) - (10). Writing the updating equations for bt,ht+1 defined in (11) in matrix form, we have p p Xt = A Mt, Yt = A Qt, (15) p ∗p p p p p where Xt =[h1 +q0 h2 +q1 ht +qt 1], Yt = [b0 b1+λ1m0 bt 1+λt 1mt 2] p p p| p p|···| p p− p p| p p p|···| p− p− p− Mt =[m0 m1 mt 1], Qt = [q0 q1 qt 1]. p p| p|···| p− p p| p|···| p− Let (mt) and (qt) denote the projection of mt and qt onto the column space of Mt and Qt, p p p p p p || || respectively. That is, (mtp) = Mpt (Mpt)∗Mpt −1(Mpt)∗mtp || (qpt) = Qtp (cid:0)(Qtp)∗Qtp −(cid:1)1(Qtp)∗qpt. || Let (cid:0) (cid:1) αt = (αt ,αt ,...,αt ) , γt = (γt ,γt ,...,γt ) (16) p p,0 p,1 p,t 1 ∗ p p,0 p,1 p,t 1 ∗ − − be the coefficient vectors of these projections. That is, αtp = (Mpt)∗Mpt −1(Mpt)∗mtp, γpt = (Qtp)∗Qtp −1(Qtp)∗qpt. (17) and (cid:0) (cid:1) (cid:0) (cid:1) t 1 t 1 − − (mt) = αt mi, (qt) = γt qi. (18) p p,i p p p,i p || || i=0 i=0 X X Define (mt) = mt (mt) , (qt) = qt (qt) . (19) p ⊥ p− p || p ⊥ p− p || The main lemma will show that αt and γt concentrate around some constant αˆt and γˆt, respec- p p p p tively. We define these constants in the following subsection. 6 3.2 Concentrating Constants Let Z˜t and Z˘t each bea sequence of zero-mean jointly Gaussian random variables whose { p}t≥0 { p}t≥0 covariance is defined recursively as follows. For t,r 0, ≥ E˜r,t E˘r,t E[Z˘rZ˘t]= p , E[Z˜rZ˜t] = p , (20) p p σrσt p p τrτt p p p p where E˘r,t = E˜r,t+ E˜θ(r),θ(t) +σ2 , p p u W u=p X6 E˜r,t = δ 1E f (τr 1Z˜r 1,X)f (τt 1Z˜t 1,X) . (21) p p− r p− p− t p− p− h i Moreover, Z˜r is independent of Z˜t and Z˘r is independent of Z˘t for all r,t 0 whenever p = q. p q p q ≥ 6 Note that according to the definition of σt and τt in (14), we have E˘t,t = (τt)2, E˜t,t = (σt)2, p p p p p p and E[(Z˜t)2] = E[(Z˘t)2] = 1. In (21), quantities with negative indices or with either θ(t) = 0 or p p θ(r)= 0 are zeros. Define matrices C˜t,C˘t Rt t,p = 1,2, such that p p ∈ × [C˜t] = E˜r,s, [C˘t] = E˘r,s, r,s = 0,...,t 1. p r+1,s+1 p p r+1,s+1 p ∀ − Define vectors E˜t,E˘t Rt,p = 1,2, such that p p ∈ E˜t = (E˜0,t,E˜1,t,...,E˜t 1,t), E˘t = (E˘0,t,E˘1,t,...,E˘t 1,t). p p p p− p p p p− Define the concentrating values αˆt and γˆt as p p γˆt = (C˜t) 1E˜t, αˆt = (C˘t) 1E˘t. (22) p p − p p p − p Let (σ0)2 = (σ0)2 and (τ0)2 = (τ0)2, and for t > 0, define p p p p ⊥ ⊥ (σt)2 = (σt)2 (γˆt) E˜t = (σt)2 (E˜t) (C˜t) 1E˜t, p ⊥ p − p ∗ p p − p ∗ p − p (τt)2 = (τt)2 (αˆt) E˘t = (σt)2 (E˘t) (C˘t) 1E˘t. (23) p ⊥ p − p ∗ p p − p ∗ p − p Lemma 1. The matrices C˜t and C˘t, t 0, defined above are invertible, and the scalars (σt)2 p p ∀ ≥ p and (τt)2, t 0, defined above are strictly positive. ⊥ p ∀ ≥ ⊥ Proof. Theprooffor C˜t beinginvertible and(σt)2 beingstrictly positive is thesame as in [9]. Now p p consider C˘t+1. Notice that C˘t+1 is the sum of ⊥a positive definite matrix (C˜t+1) and P positive p p p semi-definite matrices, hence, C˘t+1 is positive definite. Consequently, p det(C˘t+1)= det(C˘t)det((τt)2 (E˘t) (C˘t) 1E˘t) > 0, (24) p p p − p ∗ p − p which implies (τt)2 (E˘t) (C˘t) 1E˘t = (τt)2 > 0. p − p ∗ p − p p ⊥ 7 3.3 Condition Distribution Lemma Let the sigma algebra St1,t be generated by x,w,b0,...,bt1 1,m0,...,mt1 1,h1,...,ht,q0,...,qt, p. p p − p p − p p p p ∀ We now compute the conditional distribution of A given St1,t for 1 p P, where t is either t p 1 ≤ ≤ or t+1. Notice that conditioning on St1,t is equivalent to conditioning on the linear constraints: A Qt1 = Yt1, A Mt = Xt, 1 p P, (25) p p p ∗p p p ≤ ≤ where in (25), only A , 1 p P, are treated as random. p ≤ ≤ Let PkQtp1 = Qtp1((Qtp1)∗Qtp1)−1Qtp1 and PkMpt = Mpt((Mpt)∗Mpt)−1Mpt, which are the projectors onto the column space of Qt1 and Mt, respectively. The following lemma provides the conditional p p distribution of the matrices A , p = 1,...,P, given t1,t. p G Lemma 2. For t = t or t+1, the conditional distribution of the random matrices A , p = 1,...,P, 1 p given St1,t satisfies (A ,...,A ) =d (Et1,t+P A˜ P ,...,Et1,t+P A˜ P ) 1 P |St1,t 1 ⊥M1t 1 ⊥Qt11 P ⊥MPt P ⊥QtP1 where P⊥Qtp1 = I−PkQtp1 and P⊥Mpt = I−PkMpt. A˜p =d Ap and A˜p is independent of St1,t. Moreover, A˜ is independent of A˜ for p = q. Et1,t is defined as p q p 6 Et1,t = Yt1((Qt1) Qt1) 1(Qt1) +Mt((Mt) Mt) 1(Xt) p p p ∗ p − p ∗ p p ∗ p − p ∗ Mt((Mt) Mt) 1(Mt) Yt1((Qt1) Qt1) 1(Qt1) . − p p ∗ p − p ∗ p p ∗ p − p ∗ Proof. To simplify the notation, we drop the superscriptt or t in the following proof. It should be 1 understood that Q represents Qt1, Y represents Yt1, M represents Mt, and X represents Xt. p p p p p p p p First let us consider projections of a deterministic matrix. Let Aˆ be a deterministic matrix p that satisfies the linear constraints Y = Aˆ Q and X = Aˆ M , then we have p p p p ∗p p Aˆ = Aˆ Q (Q Q ) 1Q +Aˆ (I Q (Q Q ) 1Q ), p p p ∗p p − ∗p p − p ∗p p − ∗p Aˆ = M (M M ) 1M Aˆ +(I M (M M ) 1M )Aˆ . p p p∗ p − p∗ p − p p∗ p − p∗ p Combining the two equations above, as well as the two linear constraints, we can write Aˆ = Y (Q Q ) 1Q +M (M M ) 1X M (M M ) 1M Y (Q Q ) 1Q +P Aˆ P . (26) p p ∗p p − ∗p p p∗ p − p − p p∗ p − p∗ p ∗p p − ∗p ⊥Mp p ⊥Qp We now demonstrate the conditional distribution of A ,...,A . Let S ,...,S bearbitrary Borel 1 P 1 P sets on Rn N1,...,Rn NP, respectively. × × P A S ,...,A S A Q = Y ,A M = X , p 1 ∈ 1 P ∈ P p p p ∗p p p ∀ (=a(cid:0)) P Et1,t+P A P (cid:12) S ,...,Et1,t+P A P (cid:1) S A Q = Y ,A M = X , p 1 ⊥M1 1 ⊥Q1(cid:12)∈ 1 P ⊥MP P ⊥QP ∈ P p p p ∗p p p ∀ (=b) P (cid:16)Et11,t+P⊥M1A1P⊥Q1 ∈ S1,...,EtP1,t+P⊥MPAPP⊥QP ∈ SP (cid:12)(cid:12) (cid:17) (cid:16) (cid:17) = P Et1,t+P A P S ...P Et1,t+P A P S , (27) 1 ⊥M1 1 ⊥Q1 ∈ 1 P ⊥MP P ⊥QP ∈ P (cid:16) (cid:17) (cid:16) (cid:17) which implies the desired result. In step (a), Et1,t = Y (Q Q ) 1Q +M (M M ) 1X M (M M ) 1M Y (Q Q ) 1Q , p = 1,...,P, p p ∗p p − ∗p p p∗ p − p∗ − p p∗ p − p∗ p ∗p p − ∗p 8 which follows from (26). Step (b) holds since P A P is independent of the conditioning. The ⊥Mp p ⊥Qp independence is demonstrated as follows. Notice that ApQp = ApP|Q|pQp. In what follows, we will show that A|p| := ApP|Q|p is independent of A⊥r := ArP⊥Qr, for p,r = 1,...,P. Then similar approach can be used to demonstrate that P⊥MpAp is independent of P|M| rAr. Together they provide the justification for step (b). Note that A|p| and A⊥r are jointly normal, hence it is enough to show they are uncorrelated. N N E [A|p|]i,j[A⊥r ]m,l = E [Ap]i,k[PkQp]k,j [Ar]m,k Ik,l−[PkQr]k,l ( ! !) n o Xk=1 Xk=1 (cid:16) (cid:17) N N (=a) n1δ0(i,m)δ0(p,r) [PkQp]k,jIk,l− [PkQp]k,j[PkQtr1]k,l! k=1 k=1 X X N (=b) n1δ0(i,m)δ0(p,r) [PkQp]l,j − [PkQp]k,j[PkQr]l,k (=c) 0, ! k=1 X where δ (i,j) is the Kronecker delta function. In the above, step (a) holds since the original matrix 0 A has (0,1/n) i.i.d. entries, step (b) holds since projectors are symmetric matrices, and step (c) N follows the property of projectors P2 = P. Combining the results in Lemma 2 and [9, Lemma 4], we have the following conditional distri- bution lemma. Lemma 3. For the vectors ht+1 and bt defined in (11), the following holds for t 1, p = 1,...,P: p p ≥ b0 S0,0 =d (σ0) Z′0+∆0,0, h1 S1,0 =d (τ0) Z0+∆1,0, (28) p| p ⊥ p p p| p ⊥ p p t 1 t 1 bt St,t =d − γˆt bi +(σt) Z′t+∆t,t, ht St+1,t =d − αˆt hi+1+(τt) Zt +∆t+1,t, (29) p| p,i p p ⊥ p p p| p,i p p ⊥ p p i=0 i=0 X X where (q0) ∆0,0 = k p ⊥k (σ0) Z′0 (30) p √n − p ⊥! p 1 (m0) (m0) q0 2 − (b0) (m0) q0 2 ∆1p,0 = " k √pn⊥k −(τp0)⊥!I− k √pn⊥kPkqp0#Zp0+qp0 k npk ! p ∗n p ⊥ − k npk ! (31) ∆tp,t = t−1 γpt,i−γˆpt,i bip+" k(q√ptn)⊥k −(σpt)⊥!I− k(q√pt)n⊥kPkMpt#Zp′t i=0 X(cid:0) (cid:1) +Mpt (Mptn)∗Mpt −1 (Hpt)∗n(qpt)⊥ − (Mnpt)∗ λtpmtp−1− t−1λtp,iγpt,imip−1 (32) ! " #! i=1 X ∆tp+1,t = t−1 αtp,i−αˆtp,i hip+1+" k(m√tpn)⊥k −(τpt)⊥!I− k(m√tpn)⊥kPkQtp+1#Zpt i=0 X(cid:0) (cid:1) +Qt+1 (Qtp+1)∗Qtp+1 −1 (Bpt+1)∗(mtp)⊥ (Qtp+1)∗ qt t−1αt qi , (33) p n n − n p− p,i p ! " #! i=0 X 9 where Z′t Rn and Zt RNp are random vectors with independent standard normal elements, and p ∈ p ∈ are independent of the corresponding sigma algebras. Moreover, Z′t is independent of Z′t and Zt p q p is independent of Zt when p =q. q 6 Proof. The proof for each individual p [P] is similar to the proof for [9, Lemma 4]. The claim that Z′t is independent of Z′t and Zt ∈is independent of Zt when p = q follows from Lemma 2, p q p q 6 where we have that A˜ is independent of A˜ for p = q. p q 6 3.4 Main Concentration Lemma . We use the shorthand X = c to denote the concentration inequality P (X c ǫ) K e κtnǫ. n n t − | − | ≥ ≤ As specified in the theorem statement, the lemma holds for all ǫ (0,1), with K ,κ denoting the t t ∈ generic constants dependent on t, but not on n,ǫ. . Lemma 4. With the = notation defined above, the following holds for all t 0, p = 1,...,P. ≥ (a) ∆t,t 2 P k p k ǫ Kte−κtnǫ. (34) n ≥ ≤ ! ∆t+1,t 2 P k p k ǫ Kte−κtnǫ. (35) n ≥ ≤ ! (b) (i) For pseudo Lipschitz functions φ :Rt+2 R, h → Np 1 . φ (h1 ,...,ht+1,x )= E[φ (τ0Z˜0,...,τtZ˜t,X)]. (36) N h p,i p,i p,i h p p p p p i=1 X (ii) Let ψ : R2 R be a bounded function that is differentiable in the first argument except h → possibly at a finite number of points, with bounded derivative when it exists. Then, Np 1 . ψ (ht+1,x ) = E[ψ (τtZ˜t,X)], (37) N h p,i p,i h p p p i=1 X where Z˜t is defined in (20), and X p is independent of Z˜ t. { p} ∼ X { }p (iii) For pseudo-Lipschitz function phi :RP(t+1)+1 R, b → n 1 . φ (b0 ,...,b0 ,...,bt ,...,bt ,w )= E φ (σ0Z˘0,...,σ0Z˘0,...,σtZ˘t,...,σt Z˘t ,W) , n b 1,i P,i 1,i P,i i b 1 1 P P 1 1 P P Xi=1 h i (38) where Z˘t is defined in (20), and W p is independent of Z˘t . { p} ∼ W { p} (c) (htp+1)∗qp0 . (htp+1)∗xp . = 0, = 0. (39) n n (btp)∗w . = 0. (40) n 10