Table Of ContentMultiprocessor Approximate Message Passing
with Column-Wise Partitioning
∗
Yanting Ma Yue M. Lu
North Carolina State University Harvard University
yma7@ncsu.edu yuelu@seas.harvard.edu
Dror Baron
North Carolina State University
barondror@ncsu.edu
7
1
0 Abstract
2
Solvingalarge-scaleregularizedlinearinverseproblemusingmultipleprocessorsisimportant
n
invariousreal-worldapplicationsdue tothelimitationsofindividualprocessorsandconstraints
a
J on data sharing policies. This paper focuses on the setting where the matrix is partitioned
0 column-wise. We extend the algorithmic framework and the theoretical analysis of approxi-
3 mate message passing (AMP), an iterative algorithmfor solving linear inverse problems, whose
asymptotic dynamics are characterized by state evolution (SE). In particular, we show that
]
T column-wisemultiprocessorAMP(C-MP-AMP)obeysanSEunderthesameassumptionswhen
theSEforAMPholds. TheSEresultsimplythat(i)theSEofC-MP-AMPconvergestoastate
I
s. that is no worse than that of AMP and (ii) the asymptotic dynamics of C-MP-AMP and AMP
c can be identical. Moreover,for a setting that is not coveredby SE, numericalresults show that
[
damping can improve the convergence performance of C-MP-AMP.
2
v
8 1 Introduction
7
5
Many scientific and engineering problems can be modeled as solving a regularized linear inverse
2
0 problem of the form
1. y = Ax+w, (1)
0
7 where the goal is to estimate the unknown x RN given the matrix A Rn×N and statistical
∈ ∈
1 information about the signal x and the noise w Rn.
: ∈
v In some scenarios, it might be desirable to partition the matrix A either column-wise or row-
Xi wise and store the sub-matrices at different processors. The partitioning style depends on data
availability, computational considerations, and privacy concerns. For example, in high-dimensional
r
a
settings where N n, or in situations where the columns of A, which represent features in feature
≫
selection problems [2], cannot be shared among processors for privacy preservation, column-wise
partitioning might be preferable. In this paper, we consider multiprocessor computing for the
(non-overlapping) column-wise partitioned linear inverse problem:
P
y = A x +w, (2)
p p
p=1
X
where P is the number of processors, A Rn Np is the sub-matrix that is stored in Processor p,
p ×
∈
and P N = N.
p=1 p
Many studies on solving the column-wise partitioned linear inverse problem (2) have been in the
P
context of distributed feature selection. Zhouet al. [3] modeled feature selection as a parallel group
∗This document serves as a supporting document for [1]
1
testing problem. Wang et al. [4] proposed to de-correlate the data matrix before partitioning, and
each processor then works independently using the de-correlated matrix without communication
with other processors. Peng et al. [5] studied problem (2) in the context of optimization, where
they proposed a greedy coordinate-block descent algorithm and a parallel implementation of the
fast iterative shrinkage-thresholding algorithm (FISTA) [6].
Our work is based on the approximate message passing (AMP) framework [7]. AMP is an
efficient iterative algorithm for solving linear inverse problems (1). In the large scale random
setting, its average asymptotic dynamics are characterized by a state evolution (SE) formalism [8],
which allows one to accurately predict the average estimation error at every iteration. Recently, a
finite-sample analysis of AMP [9] showed that when the prior distribution of the input signal x has
i.i.d. sub-Gaussian entries,1 the average performance of AMP concentrates to the SE prediction at
an exponential rate in the signal dimension N.
Our goal is to extend the AMP algorithmic framework and the SE analysis in [9] to the column-
wise partitioned linear inverse problem (2). We show that column-wise multiprocessor AMP (C-
MP-AMP) obeys a new SE under the same model assumptions where the SE for AMP holds.
With the new SE, we can predict the average estimation error in each processor at every iteration.
Moreover, the comparison between the SE of AMP and that of C-MP-AMP implies that (i) the
estimation error of C-MP-AMP is no worse than that of AMP and (ii) with a specific communi-
cation schedule between the processors and the fusion center that coordinates the processors, the
asymptotic dynamics of C-MP-AMP are identical to that of AMP. This result implies a speedup
linear in the number of processors.
It is worth mentioning that row-wise multiprocessor AMP [10–12] obeys the same SE as AMP,
because it distributes the computation of matrix-vector multiplication among multiple processors
and aggregates the results before any other operations. Some existing work on row-wise multipro-
cessorAMP [12–14]introduceslossycompressiontothecommunication between processorsandthe
fusion center, whereas we assume perfect communication and focus on the theoretical justifications
and implications of the new SE of C-MP-AMP.
The remainder of the paper is organized as follows. Section 2 introduces the C-MP-AMP algo-
rithm(Algorithm 1), thestate evolution sequences, andourmain performanceguarantee (Theorem
1), which is a concentration result for PL loss functions acting on the outputs generated by Algo-
rithm 1 concentrates to the state evolution prediction. Section 3 proves Theorem 1. The proof is
mainly based on Lemmas 3 and 4. The proof of Lemma 3 is the same as in [9] using the result that
we prove in Lemma 2. Section 4 proves Lemma 4.
2 Column-Wise Multiprocessor AMP and State Evolution
2.1 Review of AMP
Approximate message passing (AMP) [7] is a fast iterative algorithm for solving linear inverse
problems (1). Starting with an all-zero vector x0 as its initial estimate, at the tth iteration, AMP
proceeds according to
zt 1 N
zt = y Axt+ − η ([xt 1 +A zt 1] ), (3)
− n t′−1 − ∗ − i
i=1
X
xt+1 = η (xt +A zt), (4)
t ∗
1A random variable X is sub-Gaussian if there exist positive constants c and κ such that P(|X −EX| > ǫ) ≤
ce−κǫ2,∀ǫ>0.
2
where vectors with negative iteration indices are all-zero vectors, A denotes the transpose of a
∗
matrix A, η : R R is a Lipschitz function with weak derivative η , for any u RN, [u] denotes
t → t′ ∈ i
its ith entry. The function η acts coordinate-wise when applied to vectors. That is, the vector
t
(η (u ),η (u ),...,η (u )) is denoted by η (u).
t 1 t 2 t N t
Under the assumptions on the measurement matrix A, the signal x, the measurement noise w,
and the denoising function η () as listed in [9, Section 1.1], the sequence of the estimates xt that
t
· { }
is generated by AMP (3) (4) has the following property [9]. For all ǫ (0,1), there exist constants
∈
K ,κ >0 independent of n or ǫ, such that
t t
N
1
P φ(xt+1,x ) E φ(η (X +τtZ),X) ǫ K e κtnǫ2, (5)
(cid:12)N i i − t (cid:12) ≥ ! ≤ t −
(cid:12)(cid:12) Xi=1 (cid:2) (cid:3)(cid:12)(cid:12)
where φ : R2 R (cid:12)(cid:12)is a pseudo-Lipschitz function of order 2 ((cid:12)(cid:12)PL(2)),2 X pX, Z is a standard
→ ∼
normal random variable that is independent of X, and τt is defined via the following recursion
((σ0)2 = δ 1E[X2], δ = n/N):
−
(τt)2 = σ2 +(σt)2,
W
(σt+1)2 = δ 1E η (X +τtZ) X 2 . (6)
− t
−
h i
(cid:0) (cid:1)
Notice that (5) implies, by applying the Borel-Cantelli Lemma, the almost sure convergence
result proved in [8]:
N
1
lim φ(xt+1,x ) a=.s. E φ(η (X +τtZ),X) . (7)
N N i i t
→∞ i=1
X (cid:2) (cid:3)
If we choose φ(x,y) = (x y)2, then (7) characterizes the mean square error (MSE) achieved
−
by AMP at each iteration.
2.2 Column-Wise Multiprocessor AMP
In our proposed column-wise multiprocessor AMP (C-MP-AMP) algorithm, the fusion center col-
lects vectors that represent the estimations of the portion of the measurement vector y contributed
by the data from individual processors according to a pre-defined communication schedule. The
sum of these vectors is computed in the fusion center and transmitted to all processors. Each
processor performs standard AMP iterations with a new equivalent measurement vector, which is
computed using the vector received from the fusion center. The pseudocode for C-MP-AMP is
presented in Algorithm 1.
2.3 Performance Guarantee
SimilartoAMP,thedynamicsoftheC-MP-AMPalgorithm canbecharacterized byanSEformula.
Let (σ0,kˆ0)2 = δ 1E[X2], where δ = n/N , p = 1,...,P. For outer iterations 1 s sˆand inner
p p− p p ∀ ≤ ≤
2Recall the definition of PL(2) from [8]: a function f : Rm → R is said to be PL(2) if there is L > 0 such that
|f(x)−f(y)|≤L(1+kxk+kyk)kx−yk, ∀x,y∈Rm, where k·k denotes theEuclidean norm.
3
Algorithm 1 C-MP-AMP
Inputs to Processor p: y, A , kˆ (maximum number of inner iterations at each outer
p s s=0,...,sˆ
{ }
iteration).
Initialization: x0p,kˆ0 = 0, zp0,kˆ0−1 = 0, rp0,kˆ0 = 0, p.
∀
for s = 1 : sˆdo (loop over outer iterations)
At the fusion center: gs = Pu=1rus−1,kˆs−1
At Processor p:
P
xsp,0 = xsp−1,kˆs−1, rps,0 = rps−1,kˆs−1
for t = 0 :tˆ 1 do (loop over inner iterations)
s
−
s,k s,k s,0
z = y g r r
p s p p
− − −
xs,k+1 = η (xs(cid:16),k +A zs,k)(cid:17)
p s,k p ∗p p
rs,k+1 = A xs,k+1 zps,k Np η ([xs,k +A zs,k] ).
p p − n i=1 s′,k p ∗p p i
Output from Processor p: xPsˆ,kˆsˆ.
p
iterations 0 t kˆ , we define the sequences (σs,k)2 and (τs,k)2 as
s p p
≤ ≤ { } { }
(σs,0)2 = (σs 1,kˆs)2, (8)
p p−
(τs,k)2 = σ2 + (σs,0)2+(σs,k)2, (9)
p W u p
u=p
X6
2
(σs,k+1)2 = δ 1E η (X +τs,kZ) X , (10)
p p− s,k p −
(cid:20) (cid:21)
(cid:16) (cid:17)
where Z is a standard normal random variable that is independent of X.
With these definitions, we have the following performance guarantee for C-MP-AMP.
Theorem 1. Under the assumptions listed in [9, Section 1.1], let P be a fixed integer, for p =
1,...,P, let n/N = δ (0, ) be a constant. Define N = P N . Then for any PL(2) function
p p ∈ ∞ p=1 p
φ: R2 R, we have ǫ (0,1), there exist constants K ,κ > 0 independent of n,ǫ, such that
→ ∀ ∈ s,kPs,k
Np
1
P φ(xs,k+1,x ) E φ(η (X +τs,kZ),X) ǫ K e κs,knǫ2, p,
(cid:12)N p,i p,i − s,k p (cid:12) ≥ ≤ s,k − ∀
(cid:12)(cid:12) p Xi=1 h i(cid:12)(cid:12)
(cid:12) (cid:12)
where xs,k+1(cid:12)is generated by Algorithm 1, τs,k is defined in (8–(cid:12)10), X p , and Z is a standard
p (cid:12) p (cid:12) ∼ X
normal random variable that is independent of X.
Remark 1: C-MP-AMP converges to a fixed point that is no worse than that of AMP. This
statement can be demonstrated as follows. When C-MP-AMP converges, the quantities in (8–10)
do not keep changing, hence we can drop all the iteration indices for fixed point analysis. Notice
that the last term on the right hand side (RHS) of (9) vanishes, which leaves the RHS independent
of p. That is, (τs,k)2 are equal for all p, hence we can further drop the processor index for (τs,k)2.
p p
Denote (τs,k)2 by τ2 for all s,k,p, and plug (10) into (9), then
p
P
τ2 = σ2 + δ 1E (η(X +τZ) X)2
W p− −
Xp=1 h i
(=a) σ2 +δ 1E (η(X +τZ) X)2 ,
W − −
h i
4
whichisidenticaltothefixedpointequationobtainedfrom(6). Intheabove,step(a)holdsbecause
P δ 1 = P Np = N. Because AMP always converges to the worst fixed point of the fixed
p=1 p− p=1 n n
point equation (6) [15], the average asymptotic performance of C-MP-AMP is identical to AMP
P P
when there is only one solution to the fixed point equation, and at least as good as AMP in case
of multiple fixed points.
Remark 2: The asymptotic dynamics of C-MP-AMP can be identical to AMP with a specific
communication schedule. This can be achieved by letting kˆ = 1, s. In this case, the quantity
s
s,k ∀
(τ ) is involved only for t = 0. Because the last term in (9) is 0 when t = 0, the computation of
p
(τs,0)2 is independent of p. Therefore, τs,0 are again equal for all p. Dropping the processor index
p p
for (τs,k)2, the recursion in (8–10) can be simplified as
p
P
(τs,0)2 = σ2 + δ 1E η (X +τs,0Z) X 2
W p− s,0 −
Xp=1 h(cid:0) (cid:1) i
= σ2 +δ 1E η (X +τs 1,0Z) X 2 ,
W − s−1,0 − −
h i
(cid:0) (cid:1)
where the iteration evolves over s, which is identical to (6) evolving over t.
Remark 3: Theorem 1 implies almost sure convergence. Similar to the performance guarantee
for AMP [9], the concentration result in Theorem 1 implies
Np
1
lim φ(xs,k+1,x ) a=.s.E φ(η (X +τs,kZ),X) , p,
N Np p,i p,i s,k p ∀
→∞ Xi=1 h i
by the Borel-Cantelli Lemma.
3 Proofs of Theorem 1
Our proof follows closely from the proof for AMP in [9], with additional dependence structure to
be addressed due to vectors being transmitted among processors.
3.1 Proof Notations
Withoutloss of generality, weassumethesequence kˆ inAlgorithm 1tobeaconstant valuekˆ.
s s 0
Let t = skˆ+k, θ(t)= t/kˆ kˆ. Given w Rn, x {RN}p,≥for p = 1,...,P, define the column vectors
p
ht+1,qt RNp and bt,⌊mt ⌋ Rn for t ∈ 0 recur∈sively as follows. Starting with initial condition
p p ∈ p p ∈ ≥
q0 RNp:
p ∈
ht+1 = A mt qt, qt = f (ht,x )
p ∗p p− p p t p p
bt = A qt λtmt 1, mt = bt + bθ(t) w (11)
p p p− p p− p p u −
u=p
X6
where
Np
1
f (ht,x ) = η (x ht) x , and λt := f (ht ,x ). (12)
t p t−1 p− − p p δpNp t′ p,i p,i
i=1
X
In (12), the derivative of f : R2 R is with respect to the first argument. We assume that η is
t t
→
Lipschitz for all t 0, then it follows that f is Lipschitz for all t 0. Consequently, the weak
t
≥ ≥
derivativeandf exit. Further,f isassumedtobedifferentiable, exceptpossiblyatafinitenumber
t′ t′
5
of points, with bounded derivative whenever it exits. In (11), quantities with negative indices or
with index θ(t)= 0 (i.e., t < kˆ) are defined to be zeros.
To see the equivalence between Algorithm 1 and the recursion defined in (11) and (12), we let
x0 = 0, r0 = 0, zt = 0, and
p p p
ht+1 = x (A zt +xt), qt = xt x ,
p p− ∗ p p p p− p
bt = rt A x , mt = zt.
p p− p p p − p
Let (σ0)2 = δ 1E[X2]. We assume that (σ0)2 is strictly positive for all p = 1,...,P and for all
p p− p
ǫ (0,1), there exist K,κ > 0 such that
∈
q0 2
P k pk (σ0)2 ǫ Ke κnǫ2, p = 1,...,P. (13)
(cid:12) n − p (cid:12)≥ ! ≤ − ∀
(cid:12) (cid:12)
(cid:12) (cid:12)
Define the state evolution(cid:12)scalars τt (cid:12) and σt for the the recursion defined in (11) as
(cid:12) { p}t≥(cid:12)0 { p}t≥1
follows:
1
(τt)2 = (σt)2+ (σθ(t))2+σ2 , (σt)2 = E f (τt 1Z,X) 2 , (14)
p p u W p δ t p−
p
Xu6=q h(cid:0) (cid:1) i
where Z (0,1) and X p are independent. Notice that with the equivalence between
X
∼ N ∼
Algorithm 1 and the recursion 11, the state evolution scalars defined in (14) matches (8) - (10).
Writing the updating equations for bt,ht+1 defined in (11) in matrix form, we have
p p
Xt = A Mt, Yt = A Qt, (15)
p ∗p p p p p
where
Xt =[h1 +q0 h2 +q1 ht +qt 1], Yt = [b0 b1+λ1m0 bt 1+λt 1mt 2]
p p p| p p|···| p p− p p| p p p|···| p− p− p−
Mt =[m0 m1 mt 1], Qt = [q0 q1 qt 1].
p p| p|···| p− p p| p|···| p−
Let (mt) and (qt) denote the projection of mt and qt onto the column space of Mt and Qt,
p p p p p p
|| ||
respectively. That is,
(mtp) = Mpt (Mpt)∗Mpt −1(Mpt)∗mtp
||
(qpt) = Qtp (cid:0)(Qtp)∗Qtp −(cid:1)1(Qtp)∗qpt.
||
Let (cid:0) (cid:1)
αt = (αt ,αt ,...,αt ) , γt = (γt ,γt ,...,γt ) (16)
p p,0 p,1 p,t 1 ∗ p p,0 p,1 p,t 1 ∗
− −
be the coefficient vectors of these projections. That is,
αtp = (Mpt)∗Mpt −1(Mpt)∗mtp, γpt = (Qtp)∗Qtp −1(Qtp)∗qpt. (17)
and (cid:0) (cid:1) (cid:0) (cid:1)
t 1 t 1
− −
(mt) = αt mi, (qt) = γt qi. (18)
p p,i p p p,i p
|| ||
i=0 i=0
X X
Define
(mt) = mt (mt) , (qt) = qt (qt) . (19)
p ⊥ p− p || p ⊥ p− p ||
The main lemma will show that αt and γt concentrate around some constant αˆt and γˆt, respec-
p p p p
tively. We define these constants in the following subsection.
6
3.2 Concentrating Constants
Let Z˜t and Z˘t each bea sequence of zero-mean jointly Gaussian random variables whose
{ p}t≥0 { p}t≥0
covariance is defined recursively as follows. For t,r 0,
≥
E˜r,t E˘r,t
E[Z˘rZ˘t]= p , E[Z˜rZ˜t] = p , (20)
p p σrσt p p τrτt
p p p p
where
E˘r,t = E˜r,t+ E˜θ(r),θ(t) +σ2 ,
p p u W
u=p
X6
E˜r,t = δ 1E f (τr 1Z˜r 1,X)f (τt 1Z˜t 1,X) . (21)
p p− r p− p− t p− p−
h i
Moreover, Z˜r is independent of Z˜t and Z˘r is independent of Z˘t for all r,t 0 whenever p = q.
p q p q ≥ 6
Note that according to the definition of σt and τt in (14), we have E˘t,t = (τt)2, E˜t,t = (σt)2,
p p p p p p
and E[(Z˜t)2] = E[(Z˘t)2] = 1. In (21), quantities with negative indices or with either θ(t) = 0 or
p p
θ(r)= 0 are zeros.
Define matrices C˜t,C˘t Rt t,p = 1,2, such that
p p ∈ ×
[C˜t] = E˜r,s, [C˘t] = E˘r,s, r,s = 0,...,t 1.
p r+1,s+1 p p r+1,s+1 p ∀ −
Define vectors E˜t,E˘t Rt,p = 1,2, such that
p p ∈
E˜t = (E˜0,t,E˜1,t,...,E˜t 1,t), E˘t = (E˘0,t,E˘1,t,...,E˘t 1,t).
p p p p− p p p p−
Define the concentrating values αˆt and γˆt as
p p
γˆt = (C˜t) 1E˜t, αˆt = (C˘t) 1E˘t. (22)
p p − p p p − p
Let (σ0)2 = (σ0)2 and (τ0)2 = (τ0)2, and for t > 0, define
p p p p
⊥ ⊥
(σt)2 = (σt)2 (γˆt) E˜t = (σt)2 (E˜t) (C˜t) 1E˜t,
p ⊥ p − p ∗ p p − p ∗ p − p
(τt)2 = (τt)2 (αˆt) E˘t = (σt)2 (E˘t) (C˘t) 1E˘t. (23)
p ⊥ p − p ∗ p p − p ∗ p − p
Lemma 1. The matrices C˜t and C˘t, t 0, defined above are invertible, and the scalars (σt)2
p p ∀ ≥ p
and (τt)2, t 0, defined above are strictly positive. ⊥
p ∀ ≥
⊥
Proof. Theprooffor C˜t beinginvertible and(σt)2 beingstrictly positive is thesame as in [9]. Now
p p
consider C˘t+1. Notice that C˘t+1 is the sum of ⊥a positive definite matrix (C˜t+1) and P positive
p p p
semi-definite matrices, hence, C˘t+1 is positive definite. Consequently,
p
det(C˘t+1)= det(C˘t)det((τt)2 (E˘t) (C˘t) 1E˘t) > 0, (24)
p p p − p ∗ p − p
which implies (τt)2 (E˘t) (C˘t) 1E˘t = (τt)2 > 0.
p − p ∗ p − p p
⊥
7
3.3 Condition Distribution Lemma
Let the sigma algebra St1,t be generated by x,w,b0,...,bt1 1,m0,...,mt1 1,h1,...,ht,q0,...,qt, p.
p p − p p − p p p p ∀
We now compute the conditional distribution of A given St1,t for 1 p P, where t is either t
p 1
≤ ≤
or t+1.
Notice that conditioning on St1,t is equivalent to conditioning on the linear constraints:
A Qt1 = Yt1, A Mt = Xt, 1 p P, (25)
p p p ∗p p p ≤ ≤
where in (25), only A , 1 p P, are treated as random.
p
≤ ≤
Let PkQtp1 = Qtp1((Qtp1)∗Qtp1)−1Qtp1 and PkMpt = Mpt((Mpt)∗Mpt)−1Mpt, which are the projectors onto
the column space of Qt1 and Mt, respectively. The following lemma provides the conditional
p p
distribution of the matrices A , p = 1,...,P, given t1,t.
p
G
Lemma 2. For t = t or t+1, the conditional distribution of the random matrices A , p = 1,...,P,
1 p
given St1,t satisfies
(A ,...,A ) =d (Et1,t+P A˜ P ,...,Et1,t+P A˜ P )
1 P |St1,t 1 ⊥M1t 1 ⊥Qt11 P ⊥MPt P ⊥QtP1
where P⊥Qtp1 = I−PkQtp1 and P⊥Mpt = I−PkMpt. A˜p =d Ap and A˜p is independent of St1,t. Moreover,
A˜ is independent of A˜ for p = q. Et1,t is defined as
p q p
6
Et1,t = Yt1((Qt1) Qt1) 1(Qt1) +Mt((Mt) Mt) 1(Xt)
p p p ∗ p − p ∗ p p ∗ p − p ∗
Mt((Mt) Mt) 1(Mt) Yt1((Qt1) Qt1) 1(Qt1) .
− p p ∗ p − p ∗ p p ∗ p − p ∗
Proof. To simplify the notation, we drop the superscriptt or t in the following proof. It should be
1
understood that Q represents Qt1, Y represents Yt1, M represents Mt, and X represents Xt.
p p p p p p p p
First let us consider projections of a deterministic matrix. Let Aˆ be a deterministic matrix
p
that satisfies the linear constraints Y = Aˆ Q and X = Aˆ M , then we have
p p p p ∗p p
Aˆ = Aˆ Q (Q Q ) 1Q +Aˆ (I Q (Q Q ) 1Q ),
p p p ∗p p − ∗p p − p ∗p p − ∗p
Aˆ = M (M M ) 1M Aˆ +(I M (M M ) 1M )Aˆ .
p p p∗ p − p∗ p − p p∗ p − p∗ p
Combining the two equations above, as well as the two linear constraints, we can write
Aˆ = Y (Q Q ) 1Q +M (M M ) 1X M (M M ) 1M Y (Q Q ) 1Q +P Aˆ P . (26)
p p ∗p p − ∗p p p∗ p − p − p p∗ p − p∗ p ∗p p − ∗p ⊥Mp p ⊥Qp
We now demonstrate the conditional distribution of A ,...,A . Let S ,...,S bearbitrary Borel
1 P 1 P
sets on Rn N1,...,Rn NP, respectively.
× ×
P A S ,...,A S A Q = Y ,A M = X , p
1 ∈ 1 P ∈ P p p p ∗p p p ∀
(=a(cid:0)) P Et1,t+P A P (cid:12) S ,...,Et1,t+P A P (cid:1) S A Q = Y ,A M = X , p
1 ⊥M1 1 ⊥Q1(cid:12)∈ 1 P ⊥MP P ⊥QP ∈ P p p p ∗p p p ∀
(=b) P (cid:16)Et11,t+P⊥M1A1P⊥Q1 ∈ S1,...,EtP1,t+P⊥MPAPP⊥QP ∈ SP (cid:12)(cid:12) (cid:17)
(cid:16) (cid:17)
= P Et1,t+P A P S ...P Et1,t+P A P S , (27)
1 ⊥M1 1 ⊥Q1 ∈ 1 P ⊥MP P ⊥QP ∈ P
(cid:16) (cid:17) (cid:16) (cid:17)
which implies the desired result. In step (a),
Et1,t = Y (Q Q ) 1Q +M (M M ) 1X M (M M ) 1M Y (Q Q ) 1Q , p = 1,...,P,
p p ∗p p − ∗p p p∗ p − p∗ − p p∗ p − p∗ p ∗p p − ∗p
8
which follows from (26). Step (b) holds since P A P is independent of the conditioning. The
⊥Mp p ⊥Qp
independence is demonstrated as follows. Notice that ApQp = ApP|Q|pQp. In what follows, we will
show that A|p| := ApP|Q|p is independent of A⊥r := ArP⊥Qr, for p,r = 1,...,P. Then similar approach
can be used to demonstrate that P⊥MpAp is independent of P|M| rAr. Together they provide the
justification for step (b). Note that A|p| and A⊥r are jointly normal, hence it is enough to show they
are uncorrelated.
N N
E [A|p|]i,j[A⊥r ]m,l = E [Ap]i,k[PkQp]k,j [Ar]m,k Ik,l−[PkQr]k,l
( ! !)
n o Xk=1 Xk=1 (cid:16) (cid:17)
N N
(=a) n1δ0(i,m)δ0(p,r) [PkQp]k,jIk,l− [PkQp]k,j[PkQtr1]k,l!
k=1 k=1
X X
N
(=b) n1δ0(i,m)δ0(p,r) [PkQp]l,j − [PkQp]k,j[PkQr]l,k (=c) 0,
!
k=1
X
where δ (i,j) is the Kronecker delta function. In the above, step (a) holds since the original matrix
0
A has (0,1/n) i.i.d. entries, step (b) holds since projectors are symmetric matrices, and step (c)
N
follows the property of projectors P2 = P.
Combining the results in Lemma 2 and [9, Lemma 4], we have the following conditional distri-
bution lemma.
Lemma 3. For the vectors ht+1 and bt defined in (11), the following holds for t 1, p = 1,...,P:
p p ≥
b0 S0,0 =d (σ0) Z′0+∆0,0, h1 S1,0 =d (τ0) Z0+∆1,0, (28)
p| p ⊥ p p p| p ⊥ p p
t 1 t 1
bt St,t =d − γˆt bi +(σt) Z′t+∆t,t, ht St+1,t =d − αˆt hi+1+(τt) Zt +∆t+1,t, (29)
p| p,i p p ⊥ p p p| p,i p p ⊥ p p
i=0 i=0
X X
where
(q0)
∆0,0 = k p ⊥k (σ0) Z′0 (30)
p √n − p ⊥! p
1
(m0) (m0) q0 2 − (b0) (m0) q0 2
∆1p,0 = " k √pn⊥k −(τp0)⊥!I− k √pn⊥kPkqp0#Zp0+qp0 k npk ! p ∗n p ⊥ − k npk !
(31)
∆tp,t = t−1 γpt,i−γˆpt,i bip+" k(q√ptn)⊥k −(σpt)⊥!I− k(q√pt)n⊥kPkMpt#Zp′t
i=0
X(cid:0) (cid:1)
+Mpt (Mptn)∗Mpt −1 (Hpt)∗n(qpt)⊥ − (Mnpt)∗ λtpmtp−1− t−1λtp,iγpt,imip−1 (32)
! " #!
i=1
X
∆tp+1,t = t−1 αtp,i−αˆtp,i hip+1+" k(m√tpn)⊥k −(τpt)⊥!I− k(m√tpn)⊥kPkQtp+1#Zpt
i=0
X(cid:0) (cid:1)
+Qt+1 (Qtp+1)∗Qtp+1 −1 (Bpt+1)∗(mtp)⊥ (Qtp+1)∗ qt t−1αt qi , (33)
p n n − n p− p,i p
! " #!
i=0
X
9
where Z′t Rn and Zt RNp are random vectors with independent standard normal elements, and
p ∈ p ∈
are independent of the corresponding sigma algebras. Moreover, Z′t is independent of Z′t and Zt
p q p
is independent of Zt when p =q.
q 6
Proof. The proof for each individual p [P] is similar to the proof for [9, Lemma 4]. The claim
that Z′t is independent of Z′t and Zt ∈is independent of Zt when p = q follows from Lemma 2,
p q p q 6
where we have that A˜ is independent of A˜ for p = q.
p q
6
3.4 Main Concentration Lemma
.
We use the shorthand X = c to denote the concentration inequality P (X c ǫ) K e κtnǫ.
n n t −
| − | ≥ ≤
As specified in the theorem statement, the lemma holds for all ǫ (0,1), with K ,κ denoting the
t t
∈
generic constants dependent on t, but not on n,ǫ.
.
Lemma 4. With the = notation defined above, the following holds for all t 0, p = 1,...,P.
≥
(a)
∆t,t 2
P k p k ǫ Kte−κtnǫ. (34)
n ≥ ≤
!
∆t+1,t 2
P k p k ǫ Kte−κtnǫ. (35)
n ≥ ≤
!
(b) (i) For pseudo Lipschitz functions φ :Rt+2 R,
h
→
Np
1 .
φ (h1 ,...,ht+1,x )= E[φ (τ0Z˜0,...,τtZ˜t,X)]. (36)
N h p,i p,i p,i h p p p p
p
i=1
X
(ii) Let ψ : R2 R be a bounded function that is differentiable in the first argument except
h
→
possibly at a finite number of points, with bounded derivative when it exists. Then,
Np
1 .
ψ (ht+1,x ) = E[ψ (τtZ˜t,X)], (37)
N h p,i p,i h p p
p
i=1
X
where Z˜t is defined in (20), and X p is independent of Z˜ t.
{ p} ∼ X { }p
(iii) For pseudo-Lipschitz function phi :RP(t+1)+1 R,
b
→
n
1 .
φ (b0 ,...,b0 ,...,bt ,...,bt ,w )= E φ (σ0Z˘0,...,σ0Z˘0,...,σtZ˘t,...,σt Z˘t ,W) ,
n b 1,i P,i 1,i P,i i b 1 1 P P 1 1 P P
Xi=1 h i
(38)
where Z˘t is defined in (20), and W p is independent of Z˘t .
{ p} ∼ W { p}
(c)
(htp+1)∗qp0 . (htp+1)∗xp .
= 0, = 0. (39)
n n
(btp)∗w .
= 0. (40)
n
10