Table Of Content

Poisson Matrix Completion Yang Cao Yao Xie H. Milton Stewart School of H. Milton Stewart School of Industrial and Systems Engineering Industrial and Systems Engineering Georgia Institute of Technology Georgia Institute of Technology [email protected] [email protected] Abstract—We extend the theory of matrix completion to the establish a lower bound. Another related work [22] (not in the 5 casewherewemakePoissonobservationsforasubsetofentriesof matrix completion setting) considers the case where all entries 1 a low-rank matrix. We consider the (now) usual matrix recovery of the low-rank matrix are observed and the observations are 0 formulationthroughmaximumlikelihoodwithproperconstraints Poisson counts of the entries of the underlying matrix. In the 2 on the matrix M, and establish theoretical upper and lower compressed sensing literature, there is a line of research for bounds on the recovery error. Our bounds are nearly optimal r sparse signal recovery in the presence of Poisson noise [23]– a up to a factor on the order of O(log(d d )). These bounds are 1 2 [25] and the corresponding performance bounds. The recently M obtained by adapting the arguments used for one-bit matrix developed SCOPT [26], [27] algorithm can also be used to completion [1] (although these two problems are different in 5 nature) and the adaptation requires new techniques exploiting solve the Poisson compressed sensing problems. 2 properties of the Poisson likelihood function and tackling the difficulties posed by the locally sub-Gaussian characteristic of In this paper, we extend the theory of matrix completion ] the Poisson distribution. Our results highlight a few important to the case of Poisson observations. We study recovery based L distinctions of Poisson matrix completion compared to the prior on maximum likelihood with proper constraints on a matrix M workinmatrixcompletionincludinghavingtoimposeaminimum signal-to-noise requirement on each observed entry. We also M with ra√nk less than or equal to r (nuclear norm bound t. develop an efficient iterative algorithm and demonstrate its good (cid:107)M(cid:107)∗ ≤ α rd1d2 for some constant α and bounded entries a performance in recovering solar flare images. β ≤ Mij ≤ α). Note that the formulation differs from the t one-bit matrix completion case in that we also require a lower s Keywords—matrix completion, Poisson noise, high-dimensional [ bound on each entry of the matrix. This is consistent with statistics, information theory an intuition that the value of each entry can be viewed as the 6 signal-to-noiseratio(SNR)foraPoissonobservation,andhence v I. INTRODUCTION this essentially poses a requirement for the minimum SNR. We 3 also establish upper and lower bounds on the recovery error, 4 Matrix completion, with a goal of recovering a low-rank 2 matrixM ∈Rd1×d2 fromobservationsofasubsetofitsentries, by adapting the arguments used for one-bit matrix completion 6 [1]. The upper and lower bounds nearly match up to a factor attracts much interests recently due to its important real world 0 applications including the famous Netflix problem [2]. Poisson on the order of O(log(d1d2)), which shows that the convex . relaxation formulation for Poisson matrix completion is nearly 1 matrix completion, where the observations are Poisson counts optimal. (We conjecture that such a gap is inherent to the 0 of a subset of the entries, is an important instance in its own 5 as it occurs from a myriads of applications including optical Poisson problem). Moreover, we also highlight a few important 1 distinctions of Poisson matrix completion compared to the imaging, nuclear medicine, low-dose x-ray imaging [3], and : prior work on matrix completion in the absence of noise and v network traffic analysis [4]. with Gaussian noise: (1) Although our arguments are adapted i X Recently, much success has been achieved in solving the from one-bit matrix completion (where the upper and lower matrix completion problem using nuclear norm minimization, bounds nearly match), in the Poisson case there will be a r a partly inspired by the theory of compressed sensing [5]. It has gap between the upper and lower bounds, possibly due to the been shown that when M is low rank, it can be recovered fact that Poisson distribution is only locally sub-Gaussian. In from only a few observations on its entries (see, e.g. [6]–[14]). our proof, we notice that the arguments based on bounding Earlier work on matrix completion typically assume that the all moments of the observations, which usually generate tight observationsarenoiseless,i.e.,wemaydirectlyobserveasubset bounds for prior results with sub-Gaussian observations, do not of entries of M. In the real world, however, the observations generate tight bounds here; (2) We will need a lower bound are noisy, which is the focus of the subsequent work [15]– on each matrix entry in the maximum likelihood formulation, [20], most of which consider a scenario where M is the sum which can be viewed as a requirement for the lowest signal- of a low-rank matrix with a Gaussian random matrix, i.e., to-noise ratio (since the signal-to-no√ise ratio (SNR) of a the observations are a subset of entries of M contaminated PoissonobservationwithintensityI is I).Comparedwiththe with Gaussian noise. Recently there has also been work which more general framework for M-estimator [28], our results are consider the more general noise models, including noisy 1-bit specific to the Poisson case, which may possible be stronger observations [1], which may be viewed as a case where the but do not apply generally. We also develop several simple observations are Bernoulli random variables whose parameters yet efficient algorithms, including proximal and accelerated depend on a underlying low-rank matrix. The other method proximal gradient descent algorithms, and an algorithm which [21] is developed for Poisson matrix completion but it does not is based on singular value thresholding that we examine in details. This algorithm can be viewed as a consequence of parameters p and q, where p,q ∈ R is given by D(p(cid:107)q) (cid:44) + approximating the log likelihood function by its second order plog(p/q)−(p−q);theHellingerdistancebetweentwoPoisson Taylor expansion and invoking a theorem for exact solution distributions with parameters p and q with p,q ∈R is given of a nuclear norm regularized problem [12]. Our algorithm byd2 (p,q)(cid:44)2−2exp(cid:16)−1(cid:0)√p−√q(cid:1)2(cid:17),Wefur+therdefine is related to [29]–[31] and can be viewed as a special case H 2 the average KL divergence and Hellinger distance for entries where a simple closed form solution for the algorithm exists. of two matrices P, Q∈Rd1×d2, where each entry corresponds We further demonstrate the good performance of the algorithm + to the parameter of a Poisson random variable: in recovering solar flare images. 1 (cid:88) Ourformulationandresultsareinspiredbytheseminalwork D(P(cid:107)Q)= d d D(Pij(cid:107)Qij), of one-bit matrix completion [1], yet with several important 1 2 i,j distinctions. In one-bit matrix completion, the value of each 1 (cid:88) observation Yij is binary-valued and hence bounded, whereas d2H(P,Q)= d d d2H(Pij,Qij). in our problem, each observation is a Poisson random variable 1 2 i,j which is unbounded; hence, the arguments involve bounding measurements have to be changed. In particular, we need to II. FORMULATION bound max Y when Y is a Poisson random variable with ij ij ij intensity Mij. Moreover, the Poisson likelihood function is non Suppose we observe a subset of entries of a matrix M ∈ Lipschitz (due to a bad point when Mij tends to zero), and Rd+1×d2 on the index set Ω ⊂ [d1] × [d2]. The indices are hence we need to introduce a lower bound on each entry of the randomly selected with E|Ω|=m. In other words, I {(i,j)∈Ω} matrix Mij, which can be interpreted as the lowest required are i.i.d. Bernoulli random variables with parameter m/(d1d2). SNR.Otherdistinctionsalsoincludeanalysistakingintoaccount The observations are Poisson counts of the observed matrix of the property of the Poisson likelihood function, and using entries Kullback-Leibler (KL) divergence as well as Hellinger distance Y ∼Poisson(M ), ∀(i,j)∈Ω. (1) ij ij that are different from those for the Bernoulli random variable as used in [1]. Our goal is to recover the matrix M from the Poisson observations {Y } . ij (i,j)∈Ω While working on this paper we realize a parallel work [32] which also studies performance bounds for low rank matrix We make the following assumptions. First, we set an upper completion with exponential family noise under more general bound α > 0 for the entries of M to entail the recovery assumptions and using a different approach for proof (Poisson problem is well-posed [18]. This assumption is also reasonable noiseisaspecialcaseoftheirs).TheirupperboundfortheMSE in practice; for instance, M may represent an image which per entry is on the order of O(log(d +d )rmax{d ,d }/m) is usually not too spiky. Second, assume the rank of M is 1 2 1 2 (ourupperboundisO(cid:16)log(d1d2)(cid:112)r(d1+d2)/m(cid:17)),andtheir less than or equal to a positive integer r ≤min{d1,d2} (this assumption is not restrictive in that we only assume an upper lower bound is on the order of O(rmax{d ,d }/m) (versus (cid:16)(cid:112) (cid:17) 1 2 bound on the rank). The third assumption is characteristic to our lower bound is O r(d1+d2)/m ). Poisson matrix completion: we set a lower bound β > 0 for eachentryM .Thisentry-wiselowerboundisrequiredforour ij The rest of the paper is organized as follows. Section II sets later analysis, and it also has an interpretation of a minimum up the formalism for Poisson matrix completion. Section III required signal-to-noise ratio (S√NR), as the SNR of a Poisson presents the matrix recovery based on constrained maximum observation with intensity I is I. likelihood and establishes the upper and lower bounds for the recovery accuracy. Section IV presents an efficient iterative We recover the matrix M using a regularized maximum algorithm that solves the maximum likelihood approximately likelihood formulation. Note that the log-likelihood function and demonstrates its performance on recovering solar flare for the Poisson observation model (1) is proportional to images. All proofs are delegated to Appendix. (cid:88) F (X)= Y logX −X , (2) The notation in this paper is standard. In particular, R Ω,Y ij ij ij + (i,j)∈Ω denotes the set of positive real numbers; [d] = {1,2,...,d}; I is the indicator function for an event ε; |A| denotes the where the subscript Ω and Y indicate the random quantities [ε] number of elements in a set A; diag{λ } denotes a diagonal involved in the maximum likelihood function F. Based on our i matrixwithasetofnumbers{λ }onitsdiagonal;1 denotes assumptions, we may define a set of candidate estimators i n×m an n-by-m matrix of all ones. Let entries of a matrix M be (cid:110) (cid:112) denoted by Mij. Let (cid:107)M(cid:107) be the spectral norm which is the S (cid:44) X ∈R+d1×d2 :(cid:107)X(cid:107)∗ ≤α rd1d2, (3) (cid:113) largest absolute singular value, (cid:107)M(cid:107) = (cid:80) M2 be the β ≤X ≤α,∀(i,j)∈[d ]×[d ]}. F i,j ij ij 1 2 Frobenius norm, (cid:107)M(cid:107)∗ be the nuclear norm which is the sum Here the upper bound on the nuclear norm (cid:107)M(cid:107) comes from ∗ of the singular values, and finally (cid:107)M(cid:107)∞ = maxij|Mij| be combining the assumptions (cid:107)M(cid:107) ≤ α and rank(M) ≤ r, ∞ √ the infinity norm. Let rank(M) denote the rank of a matrix M. (cid:112) since (cid:107)M(cid:107) ≤ rank(M)(cid:107)M(cid:107) and (cid:107)M(cid:107) ≤ d d (cid:107)M(cid:107) We say that a random variable X follows Poisson distribution ∗ √ F F 1 2 ∞ with parameter λ (or X ∼ Poisson(λ) if its probability mass lead to (cid:107)M(cid:107)∗ ≤ α rd1d2. An estimator M(cid:99) for M can be function P(X = k) = e−λλk/(k!)). We also define the KL obtained by solving the following convex optimization problem: divergence and Hellinger distance for Poisson distribution as M(cid:99)=argmaxFΩ,Y(X). (4) follows: the KL divergence of two Poisson distributions with X∈S 2 III. PERFORMANCEBOUNDS any algorithm which, for any M ∈S, returns an estimator M(cid:99). Then there exists M ∈ S such that with probability at least In the following, we establish an upper bound and an 3/4, information theoretic lower bound on the mean square error (cid:40) (cid:114) (cid:41) (MSE) per entry (cid:107)M(cid:99)−M(cid:107)2F/(d1d2) for the estimator in (4). d1d (cid:107)M −M(cid:99)(cid:107)2F ≥min C1,C2α3/2 rmaxm{d1,d2} Theorem 1 (Upper bound). Assume M ∈ S, rank(M) = r, 1 2 Ω is chosen at random following our sampling model with (7) E|Ω|=m,andM(cid:99)isthesolutionto(4).Thenwithaprobability as long as the right-hand side of (7) exceeds rα2/min{d1,d2}, where C ,C ,C are absolute constants. exceeding (1−C/(d d )), we have 0 1 2 1 2 √ (cid:18) (cid:19) 1 (cid:107)M −M(cid:99)(cid:107)2 ≤C(cid:48) 8αT (α r)· Similarto[1],[33],proofofTheorem2reliesoninformation d d F 1−e−T β theoretic arguments outlined as follows. First we find a set of 1 2 (cid:114) (cid:114) matricesχ∈S sothatthedistancebetweenanyX(i),X(j) ∈χ, (cid:0)α(e2−2)+3log(d d )(cid:1) d1+d2 1+ (d1+d2)log(d1d2). identified as (cid:107)X(i)−X(j)(cid:107) , is sufficiently large. Then, for 1 2 m m F (5) any X ∈S and the recovered X(cid:98), if we assume that they are sufficiently close to each other with high probability, then we If m≥(d1+d2)log(d1d2) then (5) simplifies to canclaimthatX istheelementinthesetS thatisclosesttoX(cid:98). √ 1 √ (cid:18) 8αT (cid:19)(cid:18)α r(cid:19) Finally, by applying a generalized Fano’s inequality involving (cid:107)M −M(cid:99)(cid:107)2 ≤ 2C(cid:48) · KL divergence, we claim that the probability for the event that d d F 1−e−T β 1 2 (6) X is the matrix in set S closest to X(cid:98) must be small, which (cid:114) (cid:0)α(e2−2)+3log(d d )(cid:1) d1+d2. leads to a contraction and hence proves our lower bound. 1 2 m Remark3. TheassumptionsinTheorem2canbeachieved,for Above, T,C(cid:48),C are absolute constants. where T,C,C(cid:48) are example, by the following construction. First, choose an α such absolute constants. that α≥max{1,2β}, and then an r ≥4. Then, for d (or d ) 1 2 sufficiently large, the conditions that α2rmax{d ,d } ≥ C 1 2 0 and the right-hand side of (7) exceeds rα2/min{d ,d } are The proof of Theorem 1 is an extension of the ingenious 1 2 met. Since r ≤ O(min{d ,d }/α2), M ∈ S, what has been arguments for one-bit matrix completion [1]. The extension 1 2 chosen is approximately low-rank. In other words, no matter for Poisson case here is nontrivial for various aforementioned how large r is, we can always find d (or d ) large enough so reasons (notably the non sub-Gaussian and only locally sub- 1 2 that the assumptions in Theorem 2 are satisfied and thus there Gaussian nature of the Poisson observations). An outline of exist an M which can not be recovered with arbitrarily small our proof is as follows. First, we establish an upper bound for error by any method. the KL divergence D(M(cid:107)X) for any M,X ∈S by applying Lemma 2 given in the appendix. Second, we find an upper Remark 4. When m ≥ (d + d )log(d d ) and m = 1 2 1 2 bound for the Hellinger distance d2 (M,M(cid:99)) using the fact that O(r(d +d )logδ(d d )) with δ > 2, the ratio between the H 1 2 1 2 theKLdivergencecanbeboundedfrombelowbytheHellinger upper bound in (6) and the lower bound in (7) is on the order distance. Finally, we bound the mean squared error in Lemma of O(log(d d )). Hence, the lower bound matches the upper 1 2 3 via the Hellinger distance. bound up to a logarithmic factor. Remark 1. Fixing d ,d ,m, α and β, the upper bound in 1 2 Theorem 1 increases as r increases. This is consistent with the IV. ALGORITHMS intuitionthatourmethodisbetteratdealingwithapproximately low-rank matrices (than with nearly full rank matrices). On the The matrix completion problem formulated in (4) is a otherhand,fixingd ,d ,α,β andr,theupperbounddecreases Semidefinite program (SDP), since it is a nuclear norm 1 2 as m increases, which is also consistent with our intuition that minimization problem with a convex feasible domain. Hence, M is supposed to be recovered more accurately with more wemaysolvedit,forexample,viatheinterior-pointmethod[34]. observations. Although the interior-point method returns an exact solution to (4), it does not scale well with the dimensions of the matrix d 1 Remark 2. In the upper bound (5), the mean-square-error per and d . 2 entrycanbearbitrarilysmall,inthesensethattheupperbound goes to zero as d and d go to infinity when the number of Inthefollowing,wewilldevelopasetofiterativealgorithms 1 2 the measurements m = O((d +d )logδ(d d )) (m ≤ d d ) that solves the problem approximately and are more efficient 1 2 1 2 1 2 for δ >2 when r is fixed, or for δ >3 when r is sublinear on than solving the problem as SDP. In doing so, we use the the order of o(log(d d )). framework of proximal algorithms to solve (4). At first, we 1 2 rewrite search space S as the intersection of two closed and convex set in Rd1×d2: The following theorem establishes an information theoretic lower bound and demonstrates that there exists an M ∈S such Γ (cid:44){M ∈Rd1×d2 :(cid:107)M(cid:107) ≤r(cid:112)d d } and 1 ∗ 1 2 that any recovery method cannot achieve a mean square error per entry less than the order of O(rmax{d1,d2}/m). Γ2 (cid:44){M ∈Rd1×d2 :β ≤Mij ≤α,∀(i,j)∈[d1]×[d2]}, Theorem 2 (Lower bound). Fix α, r, d , and d to be such where the first set is a nuclear norm ball and the second set 1 2 thatα,d ,d ≥1,r ≥4,α≥2β,andα2rmax{d ,d }≥C . is a high-dimensional box. Let f(M) (cid:44) −F (M) be the 1 2 1 2 0 Ω,Y Let Ω be any subset of [d ]×[d ] with cardinality m. Consider negative log-likelihood function, then optimization problem (4) 1 2 3 is equivalent to M(cid:99)=arg min f(M). (8) M∈Γ1(cid:84)Γ2 The remaining of the problem is then to deal with the projection onto the search space S. Since S is an intersection (cid:84) Noticing that the search space S = Γ1 Γ2 is closed and oftwoconvexsets,wemayusealternatingprojectionalgorithm convex and f(M) is a convex function, we can use proximal to compute a sequence that converges to this intersection of gradient methods to solve (8). Let IΓ(M) be an indicator Γ1 and Γ2, which is stated in Algorithm 3. Algorithm 3 is function that takes value zero if M ∈Γ and is ∞ if M ∈Γc. Then problem (8) is also equivalent to Algorithm 3 Alternating Projection Algorithm M(cid:99)=argM∈mRdi1n×d2f(M)+IΓ1(cid:84)Γ2(M). (9) 12:: Ifonritijali=ze1:,U20,.i.s.tdheomatrix needed to be projected onto S. To guarantee the convergence of proximal gradient method, 3: Vj =ΠΓ1(Uj−1) we need the Lipschitz constant L>0. In our case, Lipschitz 4: Uj =ΠΓ2(Vj) constant L is a positive number satisfying 5: If (cid:107)Vj −Uj(cid:107)F ≤10−6 then return Uj. 6: end for (cid:107)∇f(X)−∇f(Y)(cid:107) ≤L(cid:107)X−Y(cid:107) ,∀X,Y ∈S, (10) F F and hence L=α/β2 by the definition of our problem. Define efficient if some closed forms of projection onto the convex sets can be achieved. Fortunately, computation of the projection the projection of Y onto Γ as onto Γ in our case is quite simple. Based on the definition 2 Π (Y)=argmin(cid:107)X−Y(cid:107)2. of Frobenius norm, [Π (Y)] is: β if Y <β; α if Y >α; Γ X∈Γ F Y if β ≤Y ≤α. EvΓe2n if thijere is no cliojsed form expirjession ij ij for projection onto Γ , we can use TFOCS, a matlab package, 1 Algorithm 1 ProximalGradientforPoissonMatrixCompletion to implement this step. 1: Initialize: [M0]ij = Yij for (i,j) ∈ Ω and [M0]ij = (α+ Similar to the construction in [30], we may rewrite (4) as β)/2 otherwise; the maximum number of iterations K. 2: for k =1,2,...K do M(cid:99)=arg min f(M)+λ(cid:107)M(cid:107)∗, (11) 3: Mk =ΠS(Mk−1−(1/L)∇f(Mk−1)) M∈Γ2 4: end for where λ is a regularizing parameter that balances the goodness of data fit versus regularization. Algorithm1haslinearconvergencerate,whichisestablished The PMLSV algorithm can be derived as follows (in the in the following theorem: same spirit as [29], [19]). Let f(M) (cid:44) −F (M) be the Ω,Y Theorem3. Let{Mk}bethesequencegeneratedbyAlgorithm negative log-likelihood function. In the kth iteration, we may 1. Then for any k >1, we have form a Taylor expansion of f(M) around M , keep up to k−1 second term and then solve f(Mk)−f(M(cid:99))≤ L(cid:107)M02−kM(cid:99)(cid:107)2F. Mk =argMm∈iΓn2[Qtk(M,Mk−1)+λ(cid:107)M(cid:107)∗], (12) with Although Algorithm 1 can be implemented easily, its linear Q (M,M )(cid:44)f(M )+(cid:104)M −M ,∇f(M )(cid:105) convergence rate is not sufficiently if the Lipschitz constant L tk k−1 k−1 k−1 k−1 t is large. In such scenarios, we prefer Nesterov’s accelerated + k(cid:107)M −M (cid:107)2, (13) method for solving this problem which is our Algorithm 2. 2 k−1 F Algorithm 2 has faster convergence, as stated in the following where ∇f is the gradient of f, t is the reciprocal of the k step size in the kth iteration, which we will specify later. By Algorithm2AcceleratedProximalGradientforPoissonMatrix dropping and introducing terms independent of M whenever Completion needed (more details can be found in [35]), (12) is equivalent 1: Initialize: [M0]ij = Yij for (i,j) ∈ Ω and [M0]ij = to (α+β)/2 otherwise; Z = M ; the maximum number 0 0 M = of iterations K. k 2: for k =1,2,...K do (cid:34)1(cid:13)(cid:13) (cid:18) 1 (cid:19)(cid:13)(cid:13)2 λ (cid:35) 3: Mk =ΠS(Zk−1−(1/L)∇f(Zk−1)) argmMin 2(cid:13)(cid:13)M − Mk−1− tk∇f(Mk−1) (cid:13)(cid:13)F + tk(cid:107)M(cid:107)∗ . 4: Zk =Mk+((k−1)/(k+2))(Mk−Mk−1) (14) 5: end for Using a theorem proved in [12], we may show (in Appendix A) Theorem 4. that the exact solution to (14) is given by a form of Singular Value Thresholding (SVT): Theorem4. Let{M }bethesequencegeneratedbyAlgorithm k (cid:18) (cid:19) 2. Then for any k >1, we have 1 M =D M − ∇f(M ) , (15) k λ/tk k−1 t k−1 f(Mk)−f(M(cid:99))≤ 2L(cid:107)(Mk0+−1)M(cid:99)2 (cid:107)2F. where Dτ(Σ)(cid:44)diag{(σi−τ)+} ankd (x)+ =max{x,0}. 4 Algorithm 4 PMLSV for Poisson Matrix Completion 50% and 30% of the image are observed. The results show that our algorithm can recover the original image accurately when 1: Initialize: [M0]ij = Yij for (i,j) ∈ Ω and is (α+β)/2 50% or above of the image entries are observed. In the case of otherwise,themaximumnumberofiterationsK,parameters only 30% of the image entries are observed, our algorithm still η, and L. captures the main features in the image. The PMLSV algorithm 2: for k =1,2,...K do is very efficient: the running time on a laptop with 2.40Hz two 3: C =Mk−1−(1/L)∇f(Mk−1) 4: C =UDVT {singular value decomposition} core CPU and 8GB RAM for all three examples are less than 1.2 seconds (much faster than solving SDP). 5: Dnew =diag((diag(D)−λ/L)+) 6: Mk =ΠΓ2(cid:0)UDnewVT(cid:1) Rank = 10 7: If f(Mk)>QL(Mk,Mk−1) then L=ηL, go to 4. 8: If |f(Mk)−QL(Mk,Mk−1)|<0.5/K then exit; 9: end for Rank = 10 (a) p=0.8. (b) λ=0.1,K =2000. Fig. 2: (a) Observed image with 80% of entries known (dark spots represent missing entries). (b) Recovered image with λ = 0.1 and Fig. 1: Solar flare image of size 48-by-48 with rank 10. no more than 2000 iterations, where the elapsed time is 1.176595 seconds. The PMLSV algorithm is summarized in Algorithm 4. In the algorithm description, L is the reciprocal of the step size, η >1 is a scale parameter to change the step size, and K is the maximum number of iterations, which is user specified: a larger K leads to more accurate solution, and a small K obtains the coarse solution quickly. If the cost function value does not decrease, the step size is shortened to change the singular values more conservatively. The algorithm terminates whentheabsolutedifferenceinthecostfunctionvaluesbetween two consecutive iterations is less than 0.5/K. (a) p=0.5. (b) λ=0.1,K =2000. Under the assumption that the box constraint is not binding, Fig. 3: (a) Observed image with 50% of entries known (dark spots Algorithm4isthesameasthatin[29]andconvergenceanalysis represent missing entries). (b) Recovered image with λ = 0.1 and can be found there. no more than 2000 iterations, where the elapsed time is 1.110226 Despite of its simplicity, the PMLSV algorithm has a seconds. surprisingly good performance. With the simple initialization for M , the magnitude of the gradient is typically small at each 0 iteration. Hence, we can ensure M to be belong to or be close k to Γ by choosing an appropriate step size in the kth iteration. The complexity of PMLSV is on the order of O(d2d +d3) 1 2 2 (which comes from the most expensive step of performing singular value decomposition). This is much lower than the complexity of solving an SDP, which is on the order of O(d3+d d3+d2d2).Inparticular,forad-by-dmatrix,PMLSV 1 1 2 1 2 algorithm has complexity O(d2) versus solving the SDP has (a) p=0.3. (b) λ=0.1,K =2000. complexity O(d3). Fig. 4: (a) Observed image with 30% of entries known (dark spots V. NUMERICALEXAMPLE represent missing entries). (b) Recovered image with λ = 0.1 and no more than 2000 iterations, where the elapsed time is 1.097281 We demonstrate the good performance of our estimator in seconds. recovering a solar flare image. The solar flare image is of size 48-by-48. We break the image into 8-by-8 patches, then collect the vectorized patches into a 64-by-36 matrix: such a matrix is well approximated by a low-rank matrix, as demonstrated in Fig. 1. ACKNOWLEDGEMENT Suppose entries are observed using our sampling model TheauthorswouldliketothankProf.YuejieChi,Prof.Mark with E|Ω|=m. Let p(cid:44)m/(d d ), then we observe (100p)% Davenport, and Prof. Yaniv Plan for stimulating discussions 1 2 of entries. We use L = 10−4 and η = 1.1 in the PMLSV and inspiring comments. This work is partially supported by algorithm. Fig. 2 to Fig. 4 show the recovery result when 80%, NSF grant CCF-1442635. 5 REFERENCES poisson inverse problems with physical constraints,” arXiv preprint arXiv:1403.6532,2014. [1] M.A.Davenport,Y.Plan,E.v.d.Berg,andM.Wootters,“1-bitmatrix [26] A.K.Q.Tran-DinhandV.Cevher,“Aproximalnewtonframeworkfor completion,”InformationandInference,2014. compositeminimization:Graphlearningwithoutcholeskydecomposition [2] A.SIGKDD,“Netflix,”inProceedingsofkddcupandworkshop,2007. andmatrixinversions,”Proc.30thInt.Conf.MachineLearning(ICML), [3] D. J. Brady, Optical imaging and spectroscopy. John Wiley & Sons, 2013. 2009. [27] A.K.Q.Tran-DinhandV.Cevher,“Compositeself-concordantmini- [4] J.A.Bazerque,G.Mateos,andG.B.Giannakis,“Inferenceofpoisson mization,”J.MachineLearningResearch(LMLR),2014. countprocessesusinglow-ranktensordata,”inAcoustics,Speechand [28] M.J.Wainwright,“Structuredregularizersforhigh-dimensionalproblems: SignalProcessing(ICASSP),IEEEInternationalConferenceon,pp.5989 Statisticalandcomputationalissues,”AnnualReviewofStatisticsandits –5993,2013. Applications,pp.233–253,2014. [5] D. L. Donoho, “Compressed sensing,” Information Theory, IEEE [29] S.JiandJ.Ye,“Anacceleratedgradientmethodfortracenormmini- Transactionson,vol.52,no.4,pp.1289–1306,2006. mization,”inProceedingsofthe26thAnnualInternationalConference [6] E. J. Cande`s and B. Recht, “Exact matrix completion via convex onMachineLearning,pp.457–464,ACM,2009. optimization,”FoundationsofComputationalmathematics,vol.9,no.6, [30] M.J.Wainwright,“Structuredregularizersforhigh-dimensionalproblems: pp.717–772,2009. Statisticalandcomputationalissues,”AnnualReviewofStatisticsand [7] R.H.Keshavan,A.Montanari,andS.Oh,“Matrixcompletionfroma ItsApplication,vol.1,pp.233–253,2014. fewentries,”InformationTheory,IEEETransactionson,vol.56,no.6, pp.2980–2998,2010. [31] A.Agarwal,S.Negahban,andM.J.Wainwright,“Fastglobalconver- genceratesofgradientmethodsforhigh-dimensionalstatisticalrecovery,” [8] E.J.Cande`sandT.Tao,“Thepowerofconvexrelaxation:Near-optimal inAdvancesinNeuralInformationProcessingSystems,pp.37–45,2010. matrixcompletion,”InformationTheory,IEEETransactionson,vol.56, no.5,pp.2053–2080,2010. [32] J.Lafond,“Lowrankmatrixcompletionwithexponentialfamilynoise,” arXivpreprintarXiv:1502.06919,2015. [9] W. Dai and O. Milenkovic, “Set: an algorithm for consistent matrix completion,”inAcousticsSpeechandSignalProcessing(ICASSP),2010 [33] E.J.CandesandM.A.Davenport,“Howwellcanweestimateasparse IEEEInternationalConferenceon,pp.3646–3649,IEEE,2010. vector?,”AppliedandComputationalHarmonicAnalysis,vol.34,no.2, pp.317–323,2013. [10] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization,” [34] Z.LiuandL.Vandenberghe,“Interior-pointmethodfornuclearnorm SIAMreview,vol.52,no.3,pp.471–501,2010. approximationwithapplicationtosystemidentification,”SIAMJournal [11] B.Recht,“Asimplerapproachtomatrixcompletion,”TheJournalof on Matrix Analysis and Applications, vol. 31, no. 3, pp. 1235–1256, MachineLearningResearch,vol.12,pp.3413–3430,2011. 2009. [12] J.-F. Cai, E. J. Cande`s, and Z. Shen, “A singular value thresholding [35] Y.CaoandY.Xie,“Low-rankmatrixrecoveryinpoissonnoise,”arXiv algorithmformatrixcompletion,”SIAMJournalonOptimization,vol.20, preprintarXiv:1407.0726,2014. no.4,pp.1956–1982,2010. [36] M.LedouxandM.Talagrand,ProbabilityinBanachSpaces:isoperimetry [13] Z.Lin,A.Ganesh,J.Wright,L.Wu,M.Chen,andY.Ma,“Fastconvex andprocesses,vol.23. Springer,1991. optimization algorithms for exact recovery of a corrupted low-rank [37] D. Pollard, A user’s guide to measure theoretic probability, vol. 8. matrix,”ComputationalAdvancesinMulti-SensorAdaptiveProcessing CambridgeUniversityPress,2002. (CAMSAP),vol.61,2009. [38] B.Yu,“Assouad,fano,andlecam,”inFestschriftforLucienLeCam, [14] R. Mazumder, T. Hastie, and R. Tibshirani, “Spectral regularization pp.423–435,Springer,1997. algorithms for learning large incomplete matrices,” The Journal of MachineLearningResearch,vol.11,pp.2287–2322,2010. [15] R.Keshavan,A.Montanari,andS.Oh,“Matrixcompletionfromnoisy entries,”inAdvancesinNeuralInformationProcessingSystems,pp.952– 960,2009. APPENDIXA [16] E.J.CandesandY.Plan,“Matrixcompletionwithnoise,”Proceedings SINGULARVALUETHRESHOLDING oftheIEEE,vol.98,no.6,pp.925–936,2010. [17] S. Negahban, M. J. Wainwright, et al., “Estimation of (near) low- Consider the following problem rank matrices with noise and high-dimensional scaling,” The Annals ofStatistics,vol.39,no.2,pp.1069–1097,2011. (cid:26)1 (cid:27) [18] wS.eiNghegteadhbmaantraixndcoMm.pJl.etWioani:nOwpritgimhta,l“bRoeusntdrisctwedithstnroonisge,c”oTnhveexJiotyurannadl Y∈mRdi1n×d2 2(cid:107)Y −X(cid:107)2F +τ(cid:107)Y(cid:107)∗ , (16) ofMachineLearningResearch,vol.13,no.1,pp.1665–1697,2012. where X ∈ Rd1×d2 is given and τ is the regularization [19] A.Rohde,A.B.Tsybakov,etal.,“Estimationofhigh-dimensionallow- parameter.ForamatrixX ∈Rd1×d2 withrankr,letitssingular r2a0n1k1.matrices,” The Annals of Statistics, vol. 39, no. 2, pp. 887–930, value decomposition be X = UΣVT, where U ∈ Rd1×r, [20] A.Soni,S.Jain,J.Haupt,andS.Gonella,“Errorboundsformaximum V ∈Rd2×r,Σ=diag({σi},i=1,2,...,r),andσi isasingular value of the matrix X. For each τ ≥ 0, define the singular likelihoodmatrixcompletionundersparsefactormodels,” value thresholding operator as: [21] A.Soni,S.Jain,J.Haupt,andS.Gonella,“Noisymatrixcompletion undersparsefactormodels,”arXivpreprintarXiv:1411.0282,2014. D (X)(cid:44)UD (Σ)VT. (17) τ τ [22] A.SoniandJ.Haupt,“Estimationerrorguaranteesforpoissondenoising withsparseandstructureddictionarymodels,” The solution to (16) is given by singular value thresholding [23] M. Raginsky, R. M. Willett, Z. T. Harmany, and R. F. Marcia, according to the following theorem “Compressedsensingperformanceboundsunderpoissonnoise,”Signal Processing,IEEETransactionson,vol.58,no.8,pp.3990–4002,2010. Theorem 5 (Theorem 2.1 in [12]). For each τ ≥ 0, and [24] M.Raginsky,S.Jafarpour,Z.T.Harmany,R.F.Marcia,R.M.Willett,and X ∈Rd1×d2: R. Calderbank, “Performance bounds for expander-based compressed (cid:26) (cid:27) 1 sensing in poisson noise,” Signal Processing, IEEE Transactions on, D (X)=arg min (cid:107)Y −X(cid:107)2 +τ(cid:107)Y(cid:107) . (18) vol.59,no.9,pp.4139–4153,2011. τ Y∈Rd1×d2 2 F ∗ [25] X. Jiang, G. Raskutti, and R. Willett, “Minimax optimal rates for 6  (cid:12) (cid:12) APPPREONODFIXS B =2hEsup (cid:12)(cid:12)(cid:12)(cid:88)(cid:15)ijI[(i,j)∈Ω](Yij(−logXij)) X∈S(cid:12)i,j (cid:12)h In the following, Lemma 1 is used in proving Lemma 2, (cid:12) and Lemma 4 corresponds to Lemma 3 in [1]. +(cid:88)(cid:15)ijI[(i,j)∈Ω]Xij(cid:12)(cid:12)(cid:12)  Lemma 1. Assuming Y ∼ Poisson (λ) is a Poisson random i,j (cid:12) variable with λ ≤ α. Then P(Y −λ ≥ t) ≤ e−t,∀t ≥ t0 for  (cid:12) (cid:12)h t (cid:44)α(e2−3). (cid:12) (cid:12) 0 ≤22h−1Esup (cid:12)(cid:12)(cid:12)(cid:88)(cid:15)ijI[(i,j)∈Ω](Yij(−logXij))(cid:12)(cid:12)(cid:12)  X∈S(cid:12)i,j (cid:12) Proof: (20)  (cid:12) (cid:12)h (cid:12) (cid:12) We introduce θ ≥0, +22h−1Esup (cid:12)(cid:12)(cid:12)(cid:88)(cid:15)ijI[(i,j)∈Ω]Xij(cid:12)(cid:12)(cid:12) , X∈S(cid:12)i,j (cid:12) P(Y −λ≥t)=P(Y ≥t+λ) =P(θY ≥θ(t+λ))=P(exp(θY)≥exp(θ(t+λ))). where the expectation are over both Ω and Y. Using Markov inequality, we can have P(Y −λ≥t)≤exp(−θ(t+λ))E(cid:0)eθY(cid:1). =exp(−θ(λ+t))·exp(cid:0)λ(eθ−1)(cid:1) Letting θ =2, P(Y −λ≥t) =exp(cid:0)−t+λ(e2−3)(cid:1). exp(−t) Inthefollowing,wewill usecontractionprincipletofurther Define that bound the first term of (20). We let φ(t)=−βlog(t+1). We t (cid:44)α(e2−3), know φ(0) = 0 and |φ(cid:48)(t)| = |β/(t+1)|, so |φ(cid:48)(t)| ≤ 1 if 0 t ≥ β −1. Setting Z = X −1 , then we have Z ≥ to make P(Y−λ≥t) ≤ 1, we derive that P(Y −λ≥t) ≤ e−t d1×d2 √ √ ij exp(−t) β−1,∀(i,j)∈[d1]×[d2] and (cid:107)Z(cid:107)∗ ≤α rd1d2+ d1d2 by when triangle inequality. Therefore, φ(Z ) is a contraction and it ij vanishes at 0. By Theorem 4.12 in [36] and using the fact that t≥t ≥λ(e2−3) 0 |(cid:104)A,B(cid:105)|≤(cid:107)A(cid:107)(cid:107)B(cid:107) , we have ∗  (cid:12) (cid:12)h (cid:12) (cid:12) (L2e)mamndaS2.beLettheFΩse,Yt d(Xefi)nebdeitnhe(3li)k,etlhihenood function defined in 22h−1EXsu∈pS(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)i,j (cid:15)ijI[(i,j)∈Ω](Yij(−logXij))(cid:12)(cid:12)(cid:12)(cid:12)  (cid:26) P sup |FΩ,Y(X)−EFΩ,Y(X)| (cid:20) (cid:21) (cid:12)(cid:12) (cid:12)(cid:12)h ≥X∈CS(cid:48)(cid:0)α√r/β(cid:1)(cid:0)α(e2−2)+3log(d1d2)(cid:1)· ≤22h−1E mi,ajxYihj Xsu∈pS(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)i,j (cid:15)ijI[(i,j)∈Ω]((−logXij))(cid:12)(cid:12)(cid:12)(cid:12)  (cid:16)(cid:112) (cid:17)(cid:111) (19) m(d1+d2)+d1d2log(d1d2)  (cid:12) (cid:12)h (cid:20) (cid:21) (cid:12) (cid:18) (cid:19)(cid:12) ≤ d1Cd2, =22h−1E mi,ajxYihj EXsu∈pS(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)i,j (cid:15)ijI[(i,j)∈Ω] β1φ(Zij) (cid:12)(cid:12)(cid:12)(cid:12)  where C(cid:48) and C are absolute positive constants and the  (cid:12) (cid:12)h probability and the expectation are both over Ω and Y. ≤22h−1(cid:18)β2(cid:19)hE(cid:20)mi,ajxYihj(cid:21)EXsu∈pS(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)i,j (cid:15)ijI[(i,j)∈Ω]Zij)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)  Proof: In order to prove the lemma, we let (cid:15)ij are i.i.d. (cid:18)2(cid:19)h (cid:20) (cid:21) (cid:20) (cid:21) Rademacher random variables. In the following derivation, the =22h−1 E maxYh E sup |(cid:104)∆ ◦E,Z(cid:105)|h firstinequalityisduetheRadamachersymmetrizationargument β i,j ij X∈S Ω (Lemma 6.3 in [36]) and the second inequality is due to the (cid:18)2(cid:19)h (cid:20) (cid:21) (cid:20) (cid:21) panodwehr≥me1a.nTihneenquwaelithya:v(ea+b)h ≤2h−1(ah+bh) if a,b>0 ≤22h−1 β E mi,ajxYihj E Xsu∈pS(cid:107)E◦∆Ω(cid:107)h(cid:107)Z(cid:107)h∗ E(cid:20)sup |FΩ,Y(X)−EFΩ,Y(X)|h(cid:21) ≤22h−1(cid:18)β2(cid:19)h(cid:0)α√r+1(cid:1)h(cid:16)(cid:112)d1d2(cid:17)hE(cid:20)mi,ajxYihj(cid:21)E(cid:2)(cid:107)E◦∆Ω(cid:107)h(cid:3), X∈S (21)  (cid:12) (cid:12)h ≤2hEsup (cid:12)(cid:12)(cid:12)(cid:12)(cid:88)(cid:15)ijI[(i,j)∈Ω](YijlogXij −Xij)(cid:12)(cid:12)(cid:12)(cid:12)  wdehneorteesEthedeinndoitceastotrhemamtraitxrifxorwΩithanedn◦trideesnogtievsenthebyHa(cid:15)dija,m∆arΩd X∈S(cid:12)i,j (cid:12) product. 7 Similarly,thesecondtermof(20)canbeboundedasfollows: Above, firstly we use triangle inequality and power mean  (cid:12) (cid:12)h inequality, then along with independence, we use (24) in the (cid:12) (cid:12) third inequality. By standard computations for exponential 22h−1Esup (cid:12)(cid:12)(cid:12)(cid:88)(cid:15)ijI[(i,j)∈Ω]Xij(cid:12)(cid:12)(cid:12)  random variables, X∈S(cid:12)i,j (cid:12) (cid:20) (cid:21) (cid:20) (cid:21) (22) E maxWihj ≤2h!+logh(d1d2). (26) ≤22h−1E sup (cid:107)E◦∆ (cid:107)h(cid:107)X(cid:107)h i,j Ω ∗ X∈S Thus, we have ≤22h−1(cid:0)α√r(cid:1)h(cid:16)(cid:112)d d (cid:17)hE(cid:2)(cid:107)E◦∆ (cid:107)h(cid:3). (cid:20) (cid:21) (cid:16) (cid:17) 1 2 Ω E maxYh ≤22h−1 αh+(t )h+2h!+logh(d d ) . ij 0 1 2 i,j (27) Plugging (21) and (22) into (20), we have (cid:20) (cid:21) E sup |F (X)−EF (X)|h Ω,Y Ω,Y X∈S ≤22h−1(cid:0)α√r+1(cid:1)h(cid:16)(cid:112)d1d2(cid:17)hE(cid:2)(cid:107)E◦∆Ω(cid:107)h(cid:3)· (23) (cid:32)(cid:18)2(cid:19)h (cid:20) (cid:21) (cid:33) E maxYh +1 . β i,j ij Therefore, combining (27) and (23), we have To bound E(cid:2)(cid:107)E◦∆ (cid:107)h(cid:3), we can use the result from [1] (cid:20) (cid:21) Ω E sup |F (X)−EF (X)|h if we take h=log(d d )≥1: Ω,Y Ω,Y 1 2 X∈S E(cid:2)(cid:107)E◦∆Ω(cid:107)h(cid:3) ≤24h−1(cid:0)α√r+1(cid:1)h(cid:16)(cid:112)d d (cid:17)hE(cid:2)(cid:107)E◦∆ (cid:107)h(cid:3)· (28) 1 2 Ω (cid:115) h ≤C0(cid:16)2(1+√6)(cid:17)h m(d1+d2)+d1dd12d2log(d1d2) (cid:18)β2(cid:19)h(cid:16)αh+(t0)h+2h!+logh(d1d2)(cid:17). Then, for some constant C . Therefore, the only term we need to (cid:18) (cid:20) (cid:21)(cid:19)1 bound is E(cid:2)max Y0h(cid:3). E sup |F (X)−EF (X)|h h i,j ij Ω,Y Ω,Y X∈S From Lemma 1, if t≥t0, then for any (i,j)∈[d1]×[d2], ≤16(cid:0)α√r+1(cid:1)(cid:16)(cid:112)d d (cid:17)E(cid:2)(cid:107)E◦∆ (cid:107)h(cid:3)h1 · the following inequality holds since t >α: 1 2 Ω 0 (cid:18) (cid:19) 2 P(|Yij −Mij|≥t)=P(Yij ≥Mij +t)+P(Yij ≤Mij −t) β (α+t0+2h+log(d1d2)) ≤exp(−t)+0 =P(W ≥t), ≤16(cid:18)2(cid:19)(cid:0)α√r+1(cid:1)(cid:16)(cid:112)d d (cid:17)E(cid:2)(cid:107)E◦∆ (cid:107)h(cid:3)h1 · ij β 1 2 Ω (24) (cid:0)α(e2−2)+3log(d d )(cid:1) wvahreiarbeleWs.ij are independent standard exponential random ≤128(cid:16)1+√6(cid:17)Ch1 (cid:18)α1√2r(cid:19)(cid:0)α(e2−2)+3log(d d )(cid:1)· 0 β 1 2 Below we use the fact that for any positive random variable (cid:16)(cid:112) (cid:17) X, we can write EX =(cid:82)∞P(X ≥t)dt, allowing us to bound m(d1+d2)+d1d2log(d1d2) . 0 (cid:20) (cid:21) (29) E maxYh i,j ij whereweusethefactthat(ah+bh+ch+dh)1/h ≤a+b+c+dif (cid:18) (cid:20) (cid:21)(cid:19) a,b,c,d>0inthefirstinequalityandwetakeh=log(d d )≥ 1 2 ≤22h−1 αh+E max|Yij −Mij|h 1 in the second inequality. i,j (cid:18) (cid:90) ∞ (cid:18) (cid:19) (cid:19) =22h−1 αh+ P max|Y −M |h ≥t dt ij ij 0 i,j (cid:32) (cid:90) ∞ (cid:18) (cid:19) (cid:33) ≤22h−1 αh+(t )h+ P max|Y −M |h ≥t dt 0 ij ij (t0)h i,j (cid:32) (cid:90) ∞ (cid:18) (cid:19) (cid:33) ≤22h−1 αh+(t )h+ P maxWh ≥t dt √ 0 (t0)h i,j ij Moreover when C(cid:48) ≥128(cid:0)1+ 6(cid:1)e, ≤22h−1(cid:18)αh+(t0)h+E(cid:20)mi,ajxWihj(cid:21)(cid:19) C (cid:32)128(1+√6)(cid:33)log(d1d2) ≤ C0 (25) 0 C(cid:48) d1d2 8 Therefore we can use Markov inequality to see that 1. For all X ∈χ, each entry has |X |=αγ. ij (cid:26) P sup |F (X)−EF (X)| Ω,Y Ω,Y ≥X∈CS(cid:48)(cid:0)α√r/β(cid:1)(cid:0)α(e2−2)+3log(d1d2)(cid:1)· α2γ22.d1Fdo2r/2a.ll X(i),X(j) ∈ χ, i (cid:54)= j, (cid:107)X(i) − X(j)(cid:107)2F > (cid:16)(cid:112)m(d +d )+d d log(d d )(cid:17)(cid:111) Lemma 5. For x,y >0, D(x(cid:107)y)≤(y−x)2/y. 1 2 1 2 1 2 (cid:26) =P sup |F (X)−EF (X)|h Ω,Y Ω,Y X∈S √ Proof: First assume x ≤ y. Let z = y−x. Then z ≥ 0 ≥C(cid:48)(cid:0)α r/β(cid:1)(cid:0)α(e2−2)+3log(d1d2)(cid:1)· and D(x(cid:107)x+z) = xlogx+xz +z. Taking the first derivative (cid:16)(cid:112) (cid:17)(cid:111) of this with respect to z, we have ∂ D(x(cid:107)x + z) = z . m(d1+d2)+d1d2log(d1d2) ∂z x+z Thus, by Taylor’s theorem, there is some ξ ∈ [0,z] so that (cid:20) (cid:21) ≤E sup |FΩ,Y(X)−EFΩ,Y(X)|h / D(x(cid:107)y)=D(x(cid:107)x)+z·x+ξξ.Sincetheright-hand-sideincreases X∈S√ in ξ, we may replace ξ with z and obtain D(x(cid:107)y)≤ (y−x)2. {(cid:0)C(cid:48)(cid:0)α r/β(cid:1)(cid:0)α(e2−2)+3log(d1d2)(cid:1)· For x > y, with the similar argument we may concludeythat (cid:16)(cid:112) (cid:17)(cid:17)h for z =y−x<0 there is some ξ ∈[z,0] so that D(x(cid:107)y)= m(d1+d2)+d1d2log(d1d2) } D(x(cid:107)x)+z· ξ . Since z <0 and ξ/(x+ξ) increases in ξ, x+ξ C thentheright-hand-sideisdecreasinginξ.Wemayalsoreplace ≤ , d d ξ with z and this proves the lemma. 1 2 √ where C(cid:48) ≥128(1+ 6)e and C are absolute constants. Proof of Theorem 1: Lemma 1, Lemma 2, and Lemma 3 are used in the proof. In the following, the expectation are taken with respect to both Ω and {Y }. First, note that Lemma 3. Let β ≤Mij,M(cid:99)ij ≤α,∀(i,j)∈[d1]×[d2], then ij (cid:20) (cid:18) (cid:19) (cid:21) d2H(M,M(cid:99))≥ 1−4αeT−T (cid:107)Md−dM(cid:99)(cid:107)2F, FΩ,Y(X)−FΩ,Y(M)=(i(cid:88),j)∈Ω Yijlog MXiijj −(Xij −Mij) . 1 2 where T = 1 (α−β)2. Then for any X ∈S, 8β E[F (X)−F (M)] Ω,Y Ω,Y (cid:20) (cid:18) (cid:19) (cid:21) Proof: Assuming x is any entry in M and y is any entry = m (cid:88) M log Xij −(X −M ) in M(cid:99), then β ≤ x,y ≤ α and 0 ≤ |x−y| ≤ α−β. By the d d ij M ij ij 1 2 ij mean value theorem there exists an ξ(x,y)∈[β,α] such that i,j (cid:20) (cid:18) (cid:19) (cid:21) 1 √ √ 1(cid:32) 1 (cid:33)2 1 =−dmd (cid:88) Mijlog MXij −(Mij −Xij) (32) ( x− y)2 = (x−y) = (x−y)2 ≤T. 1 2 i,j ij (cid:112) 2 2 2 ξ(x,y) 8ξ(x,y) m (cid:88) =− D(M (cid:107)X )=−mD(M(cid:107)X). d d ij ij The function f(z) = 1−e−z is concave in [0,+∞], so if 1 2 i,j z ∈[0,T], we may bound it from below with a linear function For M ∈ S, we know M(cid:99) ∈ S and FΩ,Y(M(cid:99)) ≥ FΩ,Y(M). 1−e−T Thus we write 1−e−z ≥ z. (30) √ √ T 0≤FΩ,Y(M(cid:99))−FΩ,Y(M) Pluggingz = 12( x− y)2 = 8ξ(1x,y)(x−y)2 in(30),wehave =FΩ,Y(M(cid:99))+EFΩ,Y(M(cid:99))−EFΩ,Y(M(cid:99)) (cid:18) 1 √ √ (cid:19) 1−e−T 1 +EFΩ,Y(M)−EFΩ,Y(M)−FΩ,Y(M) 2−2exp − ( x− y)2 ≥ (x−y)2 (cid:104) (cid:105) 2 T 4ξ(x,y) ≤E FΩ,Y(M(cid:99))−FΩ,Y(M) + 1−e−T 1 (cid:12) (cid:12) ≥ T 4α(x−y)2. (cid:12)(cid:12)FΩ,Y(M(cid:99))−EFΩ,Y(M(cid:99))(cid:12)(cid:12)+|FΩ,Y(M)−EFΩ,Y(M)| (31) ≤−mD(M(cid:107)M(cid:99))+2 sup |FΩ,Y(X)−EFΩ,Y(X)|. Note that (31) holds for any x and y. This concludes the proof. X∈S Applying Lemma 2, we obtain that with probability at least √ Lemma 4. Let H (cid:44) (cid:8)M :(cid:107)M(cid:107) ≤α rd d ,(cid:107)M(cid:107) ≤α(cid:9) (1−C/(d1d2)), ∗ 1 2 ∞ and γ ≤1 be such that r is an integer. Suppose r/γ2 ≤d , γ2 1 0≤−mD(M(cid:107)M(cid:99)) then we may construct a set χ∈H of size +2C(cid:48)(cid:0)α√r/β(cid:1)(cid:0)α(e2−2)+3log(d d )(cid:1)· (cid:18) (cid:19) 1 2 |χ|≥exp rd2 (cid:16)(cid:112)m(d +d )+d d log(d d )(cid:17). 16γ2 1 2 1 2 1 2 √ with the following properties: After rearranging terms and applying the fact that d d ≤ 1 2 9 d1+d2, we obtain Now consider an algorithm that for any X ∈S returns X(cid:98) √ such that D(M(cid:107)M(cid:99))≤2(cid:16)C(cid:112)(cid:48)(cid:0)α r/β(cid:1)(cid:0)α(e2−2)+3log(d1d2)(cid:1)(cid:17)· (33) d1d (cid:107)X−X(cid:98)(cid:107)2F <(cid:15)2 (36) m(d +d )+(d +d )2log(d d ) . 1 2 1 2 1 2 1 2 with probability at least 1/4. Next, we will show this leas to Note that the KL divergence can be bounded below by the an contradiction. Let Hellinger distance (Chapter 3 in [37]): X∗ =arg min (cid:107)X(i)−X(cid:98)(cid:107)2, d2 (x,y)≤D(x(cid:107)y). X(i)∈χ F H by the same argument as that in [1], we have X∗ = X as Thus from (33), we obtain long as (36) holds. Using the assumption that (36) holds with √ d2H(M,M(cid:99))≤2C(cid:48)(cid:0)α r/β(cid:1)(cid:0)α(e2−2)+3log(d1d2)(cid:1)· probability at least 1/4, we have (cid:16)(cid:112)m(d1+d2)+(d1+d2)2log(d1d2)(cid:17). P(X∗ (cid:54)=X)≤ 3. (37) (34) 4 Using a generalized Fano’s inequality for the KL divergence in Finally, Theorem 1 is proved by applying Lemma 3. [38], we have max D(X(k)(cid:107)X(l))+1 P(X∗ (cid:54)=X)≥1− X(k)(cid:54)=X(l) . (38) Proof of Theorem 2: log|χ| We will prove by contradiction. Lemma 4 and Lemma 5 are Define D (cid:44) D(X(k)(cid:107)X(l)) = (cid:80) D(X(k)(cid:107)X(l)). We used in the proof. Without loss of generality, assume d ≥d . (i,j)∈Ω ij ij 2 1 know that each term in the sum is either 0, D(α(cid:107)α(cid:48)), or Choose (cid:15)>0 such that D(α(cid:48)(cid:107)α). From Lemma 5, since α(cid:48) <α, we have (cid:40) (cid:114) (cid:41) 1 rd (cid:15)2 =min ,C α3/2 2 , m(γα)2 64m(cid:15)2 256 2 m D ≤ ≤ . α(cid:48) α(cid:48) where C2 is an absolute constant that will be be specified later. Combining (37) and (38), we have that First, choose γ such that r is an integer and γ2 1 D+1 √ ≤1−P(X (cid:54)=X∗)≤ 4 2(cid:15) 8(cid:15) 1 4 log|χ| α ≤γ ≤ α ≤ 2. (cid:32)64m(cid:15)2 +1(cid:33) (cid:32)64m(cid:15)2 +1(cid:33) (39) ≤16γ2 α(cid:48) ≤1024(cid:15)2 α(cid:48) . We may make such a choice because rd α2rd 2 2 α2r r α2r ≤ ≤ Suppose 64m(cid:15)2 ≤α(cid:48), then with (39), we have 64(cid:15)2 γ2 32(cid:15)2 1 2 and ≤1024(cid:15)2 , α2r α2r α2r 4 α2rd2 − = >4α2r >1. 32(cid:15)2 64(cid:15)2 64(cid:15)2 which implies that α2rd ≤32. Then if we set C >32, this 2 0 leads to a contradiction. Next, suppose 64m(cid:15)2 >α(cid:48), then with Furthermore, since we have assumed that (cid:15)2 is larger than (39), we have Crα2/d , r/γ2 ≤ d for an appropriate choice of C. Let 1 1 χ(cid:48) bethesetdefinedinLemma4,byreplacingαwithα/2 1 (cid:18) 128m(cid:15)2 (cid:19) α/2,γ <1024(cid:15)2 . and with this choice of γ. Then we can construct a packing set 4 (1−γ)α3rd 2 χ of the same size as χ(cid:48) by defining α/2,γ Since 1−γ >1/2, we have (cid:110) (cid:16) γ(cid:17) (cid:111) χ(cid:44) X(cid:48)+α 1− 1 :X(cid:48) ∈χ(cid:48) . (cid:114) 2 d1×d2 α/2,γ (cid:15)2 > α3/2 rd2. The distance between pairs of elements in χ is bounded since 1024 m α2γ2d d Setting C2 ≤1/4096, this leads to a contradiction. Therefore, (cid:107)X(i)−X(j)(cid:107)2 ≥ 1 2 ≥4d d (cid:15)2. (35) (36) must be incorrect with probability at least 3/4. This F 4 2 1 2 concludes our proof. Define α(cid:48) (cid:44)(1−γ)α, then every entry of X ∈χ has X ∈ ij {α,α(cid:48)}. Since we have assumed r ≥4, for every X ∈χ, we have (cid:16) γ(cid:17) γ (cid:112) (cid:107)X(cid:107) =(cid:107)X(cid:48)+α 1− 1 (cid:107) ≤(cid:107)X(cid:48)(cid:107) +α(1− ) d d ∗ 2 d1×d2 ∗ ∗ 2 1 2Lemma 6. If f is a closed convex function satisfying Lipschitz α(cid:112) (cid:112) (cid:112) condition (10), then for any X,Y ∈S, the following inequality ≤ rd d +α d d ≤α rd d , 2 1 2 1 2 1 2 holds: for some X(cid:48) ∈χ(cid:48) . Since the γ we choose is less than 1/2, L α/2,γ f(Y)≤f(X)+(cid:104)∇f(X),Y −X(cid:105)+ (cid:107)Y −X(cid:107)2. α(cid:48) is greater than α/2. Therefore, from the assumption that 2 F β ≤α/2, we conclude that χ⊂S. 10