BAYESIANLEARNINGFOR LOW-RANKMATRIXRECONSTRUCTION MartinSundin, CristianR.Rojas, MagnusJansson andSaikatChatterjee ACCESSLinnaeusCenter,SchoolofElectricalEngineering KTHRoyalInstituteofTechnology,Stockholm,Sweden [email protected], [email protected], [email protected], [email protected] 5 1 0 ABSTRACT is 2 n baseWdelodwev-realonpklmataetnrtixvacroiambplelemtioondealnsdforrecBoanysetsruiacntiolenarfnrionmg Xˆ =argmXinβ||y−Avec(X)||22+g(X), (2) a J linearmeasurements. Forunder-determinedsystems,thede- whereβ > 0isaregularizationparameterandg(·)isafixed 3 velopedmethodsareshowntoreconstructlow-rankmatrices penaltyfunctionthatpromoteslowrankinXˆ. Commonlow- 2 whenneithertheranknorthenoisepowerisknowna-priori. rankpenaltiesintheliterature[1,3,7]are We derive relations between the latent variable models and ] L several low-rank promoting penalty functions. The relations g(X)=||X||∗ =tr((XX(cid:62))1/2), (nuclearnorm) justifytheuseofKroneckerstructuredcovariancematricesin M g(X)=tr((XX(cid:62))s/2), (Schattens-norm) aGaussianbasedprior. Inthemethods,weuseevidenceap- t. proximationandexpectation-maximizationtolearnthemodel g(X)=log|XX(cid:62)+(cid:15)Ip|, (log-determinantpenalty) a parameters. The performance of the methods is evaluated t s throughextensivenumericalsimulations. wheretr(·)denotesthematrixtrace,|·|denotesdeterminant, [ and0 < s ≤ 1and(cid:15) > 0. Wementionthatthenuclearnorm 1 penaltyisaconvexfunction. 1. INTRODUCTION v Intheliterature,LRMRalgorithmscanbecategorizedin 0 three types: convex optimization [1–3,8–10], greedy solu- 4 Reconstructionofahighdimensionallow-rankmatrixfroma tions[4,7,11–13]andBayesianlearning[14]. Mostofthese 7 low dimensional measurement vector is a challenging prob- existing algorithms are highly motivated from analogous al- 5 lem. The low-rank matrix reconstruction (LRMR) prob- 0 gorithms used for standard sparse reconstruction problems, lem is inherently under-determined and have been receiving . suchascompressedsensingwherevec(X)in(1)isreplaced 1 considerableattention[1,2]duetoitsgeneralityoverpopular by a sparse vector. Using convex optimization we can solve 0 sparse reconstruction problems along with many application 5 (2) when g(X) is the nuclear norm, which is an analogue scopes[1–6]. HereweconsidertheLRMRsystemmodel 1 of using (cid:96) -norm in sparse reconstruction problems. Fur- 1 : ther, greedy algorithms, such as iteratively reweighted least v y=Avec(X)+n (1) i squares [7] solves (2) by using algebraic approximations. X where y ∈ Rm is the measurement vector, A ∈ Rm×pq is Whileconvexoptimizationandgreedysolutionsarepopular ar the linear measurement matrix, X ∈ Rp×q is the low-rank they often need more a-priori information than knowledge matrix, n ∈ Rm is additive noise (typically assumed to be aboutstructureofthesignalunderreconstruction; forexam- zero-meanGaussianwithcovarianceCov(n) = β−1I )and ple, convex optimization algorithms need information about m the strength of the measurement noise to fix the parameter vec(·)isthevectorizationoperator.Withm<pq,thesetupis β, and greedy algorithms need information about rank. In underdeterminedandthetaskisthereconstruction(orestima- absence of such a-priori information, Bayesian learning is tion) of X from y. To deal with the underdetermined setup, a preferred strategy to use. Bayesian learning is capable atypicalandmuchusedstrategyistousearegularizationin of estimating the necessary information from measurement thereconstructioncostfunction. Regularizationbringsinthe data. InBayesianlearningweevaluatetheposteriorp(X|y) informationaboutlowrankpriors. AtypicaltypeIestimator with the knowledge of prior p(X). If X has a prior dis- ThisworkwassupportedinpartbytheSwedishResearchCouncilunder tribution p(X) ∝ e−21g(X) and the noise is distributed as contract621-2011-5847. n ∼ N(0,β−1I ), then the maximum-a-posteriori (MAP) Parts of this paper was presented at the International Conference on m estimate can be interpreted as the type I estimate in (2). As Signal Processing and Communications (SPCOM), July 2014, Bangalore, India[35]. typeIestimationrequiresmoreinformation(suchasβ),type IIestimatorsareoftenmoreuseful. TypeIIestimationtech- 1.1. Preliminaries niques use hyper-parameters in the form of latent variables In this section, we explain the relevance vector machine withpriordistributions.Whileforsparsereconstructionprob- (RVM)[15,16,22]andsparseBayesianlearning(SBL)meth- lems,BayesianlearningviatypeIIestimationintheformof ods[17,18,24]forastandardsparsereconstructionproblem. relevancevectormachine[15,16]andsparseBayesianlearn- Thesetupis ing [17,18] have gained significant popularity, the endeavor to design type II estimation algorithms for LRMR is found y=Ax+n, (3) to be limited. In [14], direct use of sparse Bayesian learn- ing was used to realize an LRMR reconstruction algorithm. where x ∈ Rn is the sparse vector to be reconstructed from Bayesian approaches were used in [19,20] for a problem the measurement vector y ∈ Rm and n ∼ N(0,β−1Im) is setup with a combination of low rank and sparse priors, theadditivemeasurementnoise.Theapproachistomodelthe called principal component pursuit [21]. In [20], Gaussian sparsesignalx=[x1x2 ... xn](cid:62)as and Bernoulli variables was used and the parameters were x =γ−1/2u , (4) estimated using Markov Chain Monte Carlo while in [24] i i i an empirical Bayesian approach was used. Type II estima- whereu ∼N(0,1)andγ >0istheprecisionofx . Thisis i i i tionmethodsaretypicallyiterativewherelatentvariablesare equivalenttosetting usually treated via variational techniques [14,19], evidence (cid:114) approximation [15,16], expectation maximization [17,18] p(x |γ )= γi exp(−γ x2/2)=N(x |0,γ−1). andMarkovchainMonteCarlo[5,20]. i i 2π i i i i Our objective in this paper is to develop new type II es- The main idea is to use a learning algorithm for which sev- timation methods for LRMR. Borrowing ideas from type II eralprecisionsgotoinfinity,leadingtosparsereconstruction. estimation techniques for sparse reconstruction, such as the Alternatively said, the use of precisions allows to inculcate relevance vector machine and sparse Bayesian learning al- dominanceofafewcomponentsoverothercomponentsina gorithms, we model a low-rank matrix by a multiplication sparsevector. Notethatγ =[γ γ ... γ ](cid:62) andβ arelatent 1 2 n of precision matrices and an i.i.d. Gaussian matrix. The variablesthatalsoneedtobeestimated. Wefindtheposterior useofprecisionmatriceshelpstorealizelow-rankstructures. (cid:90) Theprecisionmatricesarecharacterizedbyhyper-parameters p(x|y)= p(x|y,γ,β)p(γ,β|y)dγdβ ≈p(x|y,γˆ,βˆ), whicharetreatedaslatentvariables. Themaincontributions ofthispaperareasfollows. if p(γ,β|y) is assumed sharply peaked around γˆ, βˆ (this is version of the so-called Laplace approximation described in 1. We introduce one-sided and two-sided precision matrix Appendix.1).Assumingtheknowledgeofγ =γˆandβ =βˆ, basedmodels. theMAPestimateis 2. We show how the Schatten s-norm and log-determinant penalty functions are related to latent variable models in xˆ (cid:44)[xˆ xˆ ... xˆ ](cid:62) ←argmaxp(x|y,γ,β)=βΣA(cid:62)y (5) 1 2 n x thesenseofMAPestimationviatypeIestimator(2). whereΣ=(diag(γ)+βA(cid:62)A)−1. Inthenotionofiterative 3. For all new type II estimation methods, we derive up- updates, we use ← to denote the assignment operator. The dateequationsforallparametersiniterations. Themeth- precisionsγ andβ areestimatedby odsarebasedonevidenceapproximationandexpectation- maximization. (γnew,βnew)←argmaxp(y,γ,β) i 4. Themethodsarecomparednumericallytoexistingmeth- γi,β ods, such as the Bayesian learning method of [14] and =argmaxp(y|γ,β)p(γ)p(β), γi,β nuclearnormbasedconvexoptimizationmethod[1]. wherep(γ)=(cid:81) p(γ )andp(y|γ,β)=N(y|0,(Adiag(γ)A(cid:62)+ i i Weareawarethatevidenceapproximationandexpectation- β−1I )−1). Gamma distributions are typically chosen as m maximization are unable to provide globally optimal solu- hyper-priorsforp(γ )andp(β)withtheform i tions. Hence we are unable to provide performance guar- antees for our methods. This paper is organized as follows. ba+1 p(β)=Gamma(β|a+1,b)= βae−bβ, We discuss the preliminaries of sparse Bayesian learning in Γ(a+1) section 1.1. In section 2 we introduce one-sided precisions with a > −1, b > 0 and β ≥ 0. The evaluation of for matrices and derive the relations to type I estimators. (γnew,βnew) leads to coupled equations and are therefore Two-sided precisions are introduced in section 3 and in sec- i solvedapproximatelyas tion 4 we derive the update equations for the parameters. In section 5 we numerically compare the performance of the 1−γ Σ +2a (cid:80)n γ Σ +2a γ ← i ii and β ← i=1 i ii , (6) algorithmsformatrixreconstructionandmatrixcompletion. i xˆ2+2b ||y−Axˆ||2+2b i 2 |x | butionwillbedescribedlater). Thisisequivalentto ilog(x2+1) i |x |1/2 |α|q/2 (cid:18) 1 (cid:19) i p(X|α)= exp − tr(X(cid:62)αX) . (8) (2π)pq/2 2 Denoting Z=XX(cid:62) and tr(X(cid:62)αX) = tr(αXX(cid:62)) = x tr(αZ),weevaluate i (cid:90) p(X)= p(X|α)p(α)dα Fig. 1. Illustration of different sparsity promoting penalty α(cid:31)0 functions. (cid:90) |α|q/2 = e−12tr(αZ) p(α)dα (9) (2π)pq/2 α(cid:31)0 whereΣii isthei’thdiagonalelementofΣ. Theparameters ∝e−12g˜(Z) ∝e−21g(X). of the Gamma distributions for p(γ) and p(β) are typically chosentobenon-informative,i.e. a,b→0. Theupdatesolu- We note that g(X) must have the special form g(X) = tionsof(5)and(6)arerepeatediterativelyuntilconvergence. g˜(XX(cid:62)) for p(X) (9) to hold (as α is integrated out). As InsparseBayesianlearningalgorithmastandardexpectation- p(X) ∝ e−21g(X), theresultingMAPestimatorcanbeinter- maximizationframeworkisusedtoestimateγ andβ. pretedasthetypeIestimator(2). FinallywementionthattheRVMandSBLmethodshave connection with type I estimation [24,25]. If the precisions 2.1. Relationbetweenpriors havearbitrarypriordistributionsp(γ ),thenthemarginaldis- i tributionofx becomes Nextweinvestigatetherelationbetweenthepriorsp(X)and i p(α). The motivation is that the relations are necessary for (cid:90) p(x )= p(x |γ )p(γ )dγ ∝e−h(xi)/2, designing practical learning algorithms. From (9), we note i i i i i that p(X) is the Laplace transform of |α|q/2p(α)/(2π)pq/2 [28], which establishes the relation. Naturally, we can find for some function h(·). Given (3) and for a known β, the p(α)bytheinverseLaplacetransform[28]asfollows MAPestimateis (cid:90) xˆ =argmin(cid:32)β||y−Ax||2+(cid:88)n h(x )(cid:33). p(α)∝|α|−q/2 e21tr(αZ)e−21g˜(Z)dZ, (10) 2 i x i=1 ReZ=α∗ Ifp(γ )isagammapriorthenp(x )isaStudent-tdistribution where the integral is taken over all symmetric matrices Z ∈ i i with h(x ) = constant×log(x2 +constant). One rule of Cp×p such that ReZ = α∗ where α∗ is a real matrix so i i thumb is that a ”more” concave h(x ) gives a more sparsity that the contour path of integration is in the region of con- i vergenceoftheintegrand. WhiletheLaplacetransformchar- promoting model [24], see some example functions in Fig- ure 1. In the figure, h(x ) = |x | corresponds to a Laplace acterizes the exact relation between priors, the computation i i distributedvariablex ,log(x2+1)toaStudent-tand|x |1/2 is non-trivial and often analytically intractable. In practice, i i i astandardapproachistousetheLaplaceapproximation[29] to a generalized normal distribution [23]. The relation be- tweenthesparsitypromotingpenaltyfunctionh(x )andthe wheretypicallythemodeofthedistributionunderapproxima- i corresponding prior p(γ ) of the latent variable γ was dis- tionisfoundfirstandthenaGaussiandistributionismodeled cussedin[24],seealso[i26]and[27]. i aroundthatmode. Letp(α)havetheformp(α)∝e−12K(α); thentheLaplaceapproximationbecomes 2. ONE-SIDEDPRECISIONBASEDMODEL g˜(Z)=min{tr(αZ)−qlog|α|+K(α)} α(cid:31)0 Thestructureofalow-rankmatrixXischaracterizedbythe −log|H|+constant, dominantsingularvectorsandsingularvalues.Liketheuseof precisionsin(4)forthestandardsparsereconstructionprob- whereHistheHessianoftr(αZ)−qlog|α|+K(α)evalu- lemviainculcatingdominance,weproposetomodelthelow- atedattheminima(whichisassumedtoexist).Thederivation rankmatrixXas oftheLaplaceapproximationisshowninAppendix.1. Denoting K˜(α) = qlog|α|−K(α) and assuming that X=α−1/2U, (7) theHessianisconstant(independentofZ)wegetthat where the components of U ∈ Rp×q are i.i.d. N(0,1) and g˜(Z)=min(cid:110)tr(αZ)−K˜(α)(cid:111), α ∈ Rp×p isapositivedefiniterandommatrix(whichdistri- α(cid:31)0 whereweabsorbedtheconstantstermsintothenormalization 3. TWO-SIDEDPRECISIONBASEDMODEL factorofp(X). Wefindthatg˜(Z)istheconcaveconjugateof K˜(α)[10].Hence,foragiveng˜(Z)wecanrecoverK˜(α)as Inthissection,weproposetouseprecisionmatricesonboth sides to model a random low-rank matrix. We call this the K˜(α)= min{tr(αZ)−g˜(Z)} (11) two-sided precision based model. Our hypothesis is that the Z(cid:31)0 two-sidedprecisionhelpstoenhancedominanceofafewsin- if K˜(α) is concave (which holds under the assumption that gularvectors. Forlow-rankmodeling,wemakethefollowing K(α) is convex). Further, we can find K(α) from K˜(α) ansatz followed by solving the prior p(α) ∝ e−12K(α). Using the X=α−1/2Uα−1/2 (16) concaveconjugaterelation(11),wenowdealwiththetaskof L R findingappropriateK(α)fortwoexamplelow-rankpromot- where α ∈ Rp×p and α ∈ Rq×q are positive definite ingpenaltyfunctions,asfollows. L R random matrices. Using the relation vec(X) = (α−1/2 ⊗ 1. ForSchattens-norm: TheSchattens-normbasedpenalty R functionisg(X) = tr((XX(cid:62))s/2). Wehereusearegu- αL−1/2)vec(U)=(αR⊗αL)−1/2vec(U),wefind larizedSchattens-normbasedpenaltyfunctionas p(X|α ,α ) L R g(X)=tr((XX(cid:62)+(cid:15)Ip)s/2), (12) = |αR⊗αL|1/2 exp(cid:0)−vec(X)(cid:62)(α ⊗α )vec(X)/2(cid:1) (2π)pq/2 R L wheretheuseof(cid:15) > 0helpstobringnumericalstability to the algorithms in Section 4. For the penalty function = |αL|q/2|αR|p/2 exp(cid:0)−tr(X(cid:62)α Xα )/2(cid:1). (17) (12),wefindtheappropriateK(α)as (2π)pq/2 L R K(α)=Cstr(α−2−ss)+qlog|α|+(cid:15)tr(α), (13) Topromotelow-rank,weuseapriordistributionp(αL,αR)= p(α )p(α ). ThemarginaldistributionofXis where C = 2−s(cid:0)2(cid:1)−2−ss. The derivation of (13) is L R s s s (cid:90) giveninAppendix.3.Notethat,fors=1,g(X)becomes p(X)= p(X|α ,α )p(α )p(α )dα dα . (18) theregularizednuclearnormbasedpenaltyfunction L R L R R L αL(cid:31)0 min(p,q) αR(cid:31)0 g(X)=tr((XX(cid:62)+(cid:15)Ip)12)= (cid:88) (σi(X)2+(cid:15))12. We have noticed that the use of (17) in evaluating (18) does i=1 notbringoutsuitableconnectionsbetweentheresultingp(X) 2. Log-determinant penalty: For the log-determinant based functions and the usual low-rank promoting g(X) functions penaltyfunction (suchasnuclearnorm,Schattens-normandlog-determinant). Thusitisnon-trivialtoestablishadirectconnectionbetween g(X)=νlog(cid:12)(cid:12)XX(cid:62)+(cid:15)Ip(cid:12)(cid:12), (14) p(X|αL,αR)of(17)andthetypeIestimatorof(2). Insteadofadirectconnectionwecanestablishanindirect whereν >q−2isarealnumber,wefindK(α)as connectionbyanapproximation. Foragiven(α ,β)andby R K(α)=(cid:15)tr(α)+(q−ν)log|α|. (15) marginalizingoverαL,wehavep(X|αR) ∝ e−12g˜(XαRX(cid:62)) andhencethecorrespondingtypeIestimatorcostfunctionis Asp(α) ∝ e−21K(α),wefindthatthepriorαisWishart minβ||y−Avec(X)||2+g˜(Xα X(cid:62)). (19) distributed (Wishart is a conjugate prior the distribution 2 R X (8)). For a scalar instead of a matrix, the prior distribu- tion becomes a Gamma distribution as used in the stan- A similar cost function can be found for a given (αL,β) by dardRVMandSBL. marginalizingoverαR. WediscusstherolesofαL andαR Wehavediscussedaleft-sidedprecisionbasedmodel(7) inthenextsection. in this section, but the same strategy can be easily extended to form a right-sided precision based model. Then a natural 3.1. Interpretationoftheprecisions question arises, which model to use? Our hypothesis is that the user choice stems from minimizing the number of vari- From(16),wecanseethatthecolumnandrowvectorsofX ables to estimate. If the low-rank matrix is fat then the left- areintherangespacesofα−1andα−1,respectively. L R sidedmodelshouldbeused,otherwisetheright-sidedmodel. Further let us interpret this in a statistical sense with the Afurtherquestionarisesontheprospectofdevelopingatwo note that a skewed precision matrix comprises of correlated sided precision based model, which is described in the next components. Letusdenotethe(i,j)thcomponentofα−1 by R section. [α−1] . If α−1 is highly skewed then [α−1] and [α−1] R ij R R ij R ii arehighlycorrelated. Supposex ∈ Rp denotesthei’thcol- ThestandardRVMin[15]usesthedifferentupdaterule[30] i umnvectorofX. Thenfollowing(17),wecanwrite tr((α ⊗α )Σ)+2a (cid:20) xi (cid:21)∼N (cid:18)(cid:20) 0 (cid:21),(cid:20) [α−R1]iiα−L1 [α−R1]ijα−L1 (cid:21)(cid:19). β ← ||y−RAvec(LXˆ)||2+2b, (24) x 0 [α−1] α−1 [α−1] α−1 2 j R ji L R jj L whichoftenimprovesconvergence[30]. Theupdaterule(23) The above relation shows that a presence of highly skewed has the benefit over (24) of having established convergence α−1 leads to the cross-correlation [α−1] α−1 between x R R ij L i properties. Insimulationsweusedtheupdaterule(23)since and x that is comparably strong to the auto-correlations j itimprovedtheestimationaccuracy. [α−1] α−1. We mention that a low-rank property can be R ii L Finallywedealwith(22)asfollows. established in a qualitative statistical sense by the presence of columns having strong cross-correlation. The one-sided 1. For Schatten s-norm: Using the Schatten s-norm prior precision based model can be seen as the two-sided model (13)givesustheupdateequations whereα−1 =I .Hencetheone-sidedprecisionbasedmodel R q (cid:16) (cid:17)(s−2)/2 is unable to capture information about cross-correlation be- α ←c Xˆα Xˆ(cid:62)+Σ˜ +(cid:15)I , L s R L p tweencolumnsofX. Asimilarargumentcanbemadeforthe (25) rightsidedprecisionbasedmodelwhereα−L1 =Ip. αR ←cs(cid:16)Xˆ(cid:62)αLXˆ +Σ˜R+(cid:15)Iq(cid:17)(s−2)/2, 4. PRACTICALALGORITHMS wherec = (s/2)s/2 andthematricesΣ˜ andΣ˜ have s L R elements Consideringthepotentialoftwo-sidedprecisionmatrices,the optimalinferenceproblemis [Σ˜ ] =tr(Σ(α ⊗E(L))), L ij R ij max p(X,y,αL,αR,β) [Σ˜R]ij =tr(Σ(E(ijR)⊗αL)), which is the MAP estimator for amenable priors and often andwhereE(L) ∈ Rp×p andE(R) ∈ Rq×q arematrices connectedwiththetypeIestimatorin(2). Directhandlingof ij ij withonesinposition(i,j)andzerosotherwise. the optimal inference problem is limited due to lack of ana- lyticaltractability.Thereforevariousapproximationsareused 2. Log-determinant penalty: For the log-determinant prior to design practical algorithms which are also type II estima- (15)theupdateequationsbecome tors. This section is dedicated to design new type II estima- tors via evidence approximation (as used by the RVM) and α ←ν(cid:16)Xˆα Xˆ(cid:62)+Σ˜ +(cid:15)I (cid:17)−1, L R L p expectation-maximization(asusedinSBL)approaches. (26) (cid:16) (cid:17)−1 α ←ν Xˆ(cid:62)α Xˆ +Σ˜ +(cid:15)I . R L R q 4.1. Evidenceapproximation In the evidence approximation, we iteratively update the pa- Weseethattheupdateruleforthelog-determinantpenalty rametersas canbeinterpretedas(25)inthelimits→0. The derivations of (25) and (26) are shown in Appendix .3 Xˆ ←argmaxp(X|y,αL,αR,β), (20) and.4. Thecorrespondingupdateequationsfortheone-sided X β ←argmaxp(y,αL,αR,β), (21) precision based model (7) are obtained by fixing the other β precisionmatrixtobetheidentitymatrix. Inthespiritofthe α ←argmaxp(y,α ,α ,β), evidenceapproximationbasedrelevancevectormachine, we L L R αL (22) callthedevelopedalgorithmsinthissectionasrelevancesin- αR ←argmaxp(y,αL,αR,β), gularvectormachine(RSVM).ForSchatten-snormandlog- αR determinantpriors,themethodsarenamedasRSVM-SNand The solution of (20) is the standard linear minimum mean RSVM-LD,respectively. squareerrorestimator(LMMSE)as vec(Xˆ)←βΣA(cid:62)y, where Σ=(cid:0)(α ⊗α )+βA(cid:62)A(cid:1)−1. 4.2. EM R L Using a standard approach (see equations (45) and (46) of Inexpectation-maximization[31],thevalueoftheprecisions [15]or(7.88)of[31]),thesolutionof(21)canbefoundas θ (cid:44){αL, αR, β}areupdatedineachiterationbymaximiz- ingthecost(EMhelpfunctioninMAPestimation) m+2a β ← . (23) ||y−Avec(Xˆ)||2+tr(AΣA(cid:62))+2b Q(θ,θ(cid:48))+logp(θ) (27) 2 whereθ(cid:48)aretheparametervaluesfromthepreviousiteration. Inthesimulationsweconsideredlow-rankmatrixreconstruc- ThefunctionQ(θ,θ(cid:48))isdefinedas tion and also matrix completion as a special case due to its popularity. Q(θ,θ(cid:48))=E [logp(y,X|θ)]=constant X|y,θ(cid:48) − β||y−Avec(Xˆ)||2− 1tr(α Xˆα Xˆ(cid:62))− 1tr(Σ−1Σ(cid:48)) 5.1. Performancemeasure,experimentalsetupandcom- 2 2 2 L R 2 petingalgorithms q p m + log|α |+ log|α |+ logβ, (28) 2 L 2 R 2 To compare the algorithms, the performance measure is the normalized-mean-square-error whereΣ(cid:48) = (cid:0)(α(cid:48) ⊗α(cid:48) )+β(cid:48)A(cid:62)A(cid:1)−1, andE denotesthe R L NMSE(cid:44)E[||Xˆ −X||2]/E[||X||2]. expectationoperator.ThemaximizationofQ(θ,θ(cid:48))+logp(θ) F F leads to update equations which are identical to the update Inexperimentswevariedthevalueofoneparameterwhile equations of evidence approximation. That means that for keepingtheotherparametersfixed. Forgivenparameterval- theSchatten-snorm,themaximizationleadsto(20),(23)and ues,weevaluatedtheNMSEasfollows. (25),andforlog-determinantpenalty,themaximizationleads 1. ForLRMR,therandommeasurementmatrixA∈Rm×pq to(20),(23)and(26).Forthenoiseprecision,EMreproduces was generated by independently drawing the elements the update equation (23). The derivation of (28) and update fromN(0,1)andnormalizingthecolumnvectorstounit equations are shown in Appendix .2. Unlike evidence ap- norm. For low rank matrix completion, each row of A proximation, EMhasmonotonicconvergencepropertiesand containsa1inarandompositionandzerootherwisewith hencethederivedupdateequationsareboundtoimprovees- theconstraintthattherowsarelinearlyindependent. timationperformanceiniterations. 2. MatricesL ∈ Rp×r andR ∈ Rr×q withelementsdrawn fromN(0,1)wererandomlygeneratedandthematrixX 4.3. Balancingtheprecisions was formed as X=LR. Note that X is of rank r (with probability1). Wehavefoundthatinpracticalalgorithms,thereisachance that one of the two precisions becomes large and the other 3. Generate the measurement y = Avec(X) + n, where small over iterations. A small precision results in numerical n ∼ N(0,σ2I ) and σ2 is chosen such that the signal- n m n instability in the Kronecker covariance structure (17). To to-measurement-noiseratiois preventtheinbalancewerescalethematrixprecisionsineach E[||Avec(X)||2] rpq iteration such that 1) the a-priori and a-posteriori squared SMNR(cid:44) 2 = . E[||n||2] mσ2 FrobeniunnormofXareequal, 2 n 4. EstimateXˆ usingcompetingalgorithmsandcalculatethe E[||X||2F|αL,αR]=tr(α−L1)tr(α−R1) error||Xˆ −X||2. F =E[||X||2F|αL,αR,β,y]=||Xˆ||2F +tr(Σ), 5. Repeatsteps2−4foreachmeasurementmatrixT1num- beroftimes. and2)thecontributionoftheprecisionstothenormisequal, 6. Repeatsteps1−5forthesameparametervaluesT num- 2 tr(α−1)=tr(α−1). beroftimes. L R 7. ThencomputetheNMSEbyaveraging. Therescalingmakesthealgorithmmorestableandoftenim- InthesimulationswechoseT =T =25,whichmeans 1 2 provesestimationperformance. that the averaging was done over 625 realizations. We nor- malizedthecolumnvectorsofAtomaketheSMNRexpres- 5. SIMULATIONEXPERIMENTS sionrealizationindependent. Finallywedescribecompetingalgorithms. Forcompari- In this section we numerically verify our two hypotheses, son,weusedthefollowingnuclearnormbasedestimator andcomparethenewalgorithmswithrelevantexistingalgo- Xˆ =argmin||X|| ,s.t. ||y−Avec(X)|| ≤(cid:15), rithms. Ourobjectivesaretoverify: ∗ 2 X • the hypothesis that the left-sided precision is better than (cid:112) √ where we used (cid:15) = σ m+ 8m as proposed in [32]. therightsidedprecisionforafatlow-rankmatrix, n The cvx toolbox [33] was used to implement the estimator. • the hypothesis that the two-sided precision based model FormatrixcompletionwealsocomparedwiththeVariational performsbetterthanone-sidedprecisionbasedmodel. Bayesian(VB)developedbyBabacanet. al.[14]. InVB,the • theproposedmethodsperformbetterthananuclear-norm matrixXisfactorizedas minimization based convex algorithm and a variational X=LR, Bayesalgorithm. 0 −2 RSVM−LD, left−sided (a) −5 RSVM−SN −−64 RRRRSSSSVVVVMMMM−−−−LLSSDDNN,,,, rtlrwieiggfohth−−tt−s−ssiisddiidedeededdd NMSE [dB]−−1150 NCuRcBlear norm RSVM−SN, two−sided −20 −8 B] E [d−10 −250 .3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 MS m/pq N −12 (b) 0 −14 −10 −16 E [dB]−20 MS−30 −18 N −40 −200 .3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 −500 5 10 15 20 25 30 35 40 m/pq SMNR [dB] Fig.2.NMSEvs.m/(pq)forlow-rankmatrixreconstruction. Fig. 3. NMSE vs. m/(pq) and SMNR for low-rank matrix reconstruction. (a)SMNR=20dBandm/(pq)isvaried. (b) m/(pq)=0.7andSMNRisvaried. and (block) sparsity inducing priors are used for the column vectorsofL ∈ Rp×min(p,q) andR(cid:62) ∈ Rq×min(p,q). TheVB algorithm was developed for matrix completion (and robust the nuclear-norm based algorithm in the low measurement PCA),butnotformatrixreconstruction. Wenotethat,unlike region. Nowwefixm/(pq) = 0.7andvarytheSMNR.The RSVMandVB,thenuclearnormestimatorrequiresa-priori results are shown in Figure 3 (b) which confirms robustness knowledge of the noise power. We also compared the algo- againstmeasurementnoise. rithmstotheCrame´r-Raobound(CRB)from[12,34](aswe Nextwedealwithmatrixcompletionwherethemeasure- know the rank a-priori in our experimental setup). We men- ment matrix A has a special structure and considered to be tion that the CRB is not always a valid lower bound in this inferior to hold information about X than the same dimen- experimentalsetupbecausealltechnicalconditionsforcom- sionalrandommeasurementmatrixusedinLRMR.Therefore putingavalidCRBarenotalwaysfulfilledandtheestimators matrix completion requires more measurements and higher arenotalwaysunbiased.ThechoiceofCRBisduetoabsence SMNR. We performed similar experiments as in our second ofanyotherrelevanttheoreticalbound. experimentandtheresultsareshowninFigure4. Intheex- periments the performance of the VB algorithm is included. 5.2. Simulationresults ItcanbeseenthatRSVM-SNistypicallybetterthantheother algorithms. WefindthatthetheVBalgorithmispessimistic. Our first experiment is for verification of the first two hy- FinallyinourlastexperimentweinvestigatedtheVBal- potheses.Fortheexperiment,weconsideredLRMRandfixed gorithmtofindconditionsforitsimprovementandcompared rank(X) = r = 3, p = 15, q = 30, SMNR = 20 dB and it with RSVM-SN. For this experiment, we fixed r = 3, variedm. TheresultsareshowninFigure2whereNMSEis p = 15, m/(pq) = 0.7 and SMNR = 20 dB, and varied q. plotted against normalized measurements m/(pq). We note The results are shown in Figure 5 and we see that VB pro- that RSVM-SN with left precision is better than right preci- vides good performance when p = q. Theresult may be at- sion. Sameresult alsoholdfor RSVM-LD.Thisverifiesthe tributedtoanaspectthatVBishighlypronetoalargenumber firsthypothesis. FurtherweseethatRSVM-SNandRSVM- of model parameters which arises in case X is away from a LDwithtwosidedprecisionsarebetterthanrespectiveone- squarematrix. sided precisions. This result verifies the second hypothesis. Intheexperimentsweuseds = 0.5forRSVM-SNasitwas foundtobethebest(empirically). Henceforthwefixs=0.5 6. CONCLUSION forRSVM-SN. Thesecondexperimentconsiderscomparisonwithnuclear- InthispaperwedevelopedBayesianlearningalgorithmsfor normbasedalgorithmandtheCRBforLRMR.Theobjective low-rank matrix reconstruction. The framework relates low- isrobustnessstudybyvaryingnumberofmeasurementsand rankpenaltyfunctions(typeIestimators)tothelatentvariable measurement noise power. We used r = 3, p = 15 and models(typeIIestimators)witheitherleft-orright-sidedpre- q = 30. In Figure 3 (a) we show the performance against cisionsthroughthematrixLaplacetransformandtheconcave varyingm/(pq); theSMNR=20dBwasfixed. Theperfor- conjugate formula. The model was further extended to the manceimprovementofRSVM-SNismorepronouncedover two-sided precision based model. Using evidence approx- −2 (a) 0 −4 −5 E [dB]−10 −6 MS−15 −8 N −20 −10 −250.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 E [dB]−12 m/pq NMS VB (b) 0 −14 RSVM−SN −10 −16 B] E [d−20 −18 MS VB N−30 RSVM−SN −20 Nuclear norm −40 0 5 10 15 20 25 30 35 40 −22 5 10 15 20 25 SMNR [dB] q Fig. 4. NMSE vs. m/(pq) and SMNR for low-rank matrix Fig.5. NMSEvs. qforlow-rankmatrixcompletion. completion. (a) SMNR = 20 dB and m/(pq) is varied. (b) m/(pq)=0.7andSMNRisvaried. around a minima. With this approximation, the integral be- comes (cid:115) imation and expectation maximization, we derived the up- (cid:90) (4π)n date equations for the parameters. The resulting algorithm I ≈ e−12f(a0)−14(a−a0)(cid:62)H(a−a0)da= e−21f(a0). |H| wasnamedtheRelevanceSingularVectorMachine(RSVM) due to its similarity with the Relevance Vector Machine for In(9),theintegralisgivenby sparsevectors.Especiallywederivedtheupdateequationsfor 1 (cid:90) the estimators corresponding to the log-determinant penalty I = (2π)pq/2 e−12[tr(αZ)−qlog|α|+K(α)]dα. α(cid:31)0 and the Schatten s-norm penalty, we named the algorithms Setf(a)=tr(αZ)−qlog|α|+K(α),wherea=vec(α). RSVM-LDandRSVM-SN,respectively. Letα (cid:31)0denotetheminimaoff(a)andHtheHessianat Through simulations, we showed that the two-sided pre- 0 α . Assumingthatα andHare“large”inthesensethatthe cisionbasedmodelperformsbetterthantheone-sidedmodel 0 0 integraloverα(cid:31)0canbeapproximatedbytheintegralover formatrixreconstruction. Thealgorithmalsooutperformeda α∈Rp×pwefindthat nuclear-normbasedestimator,eventhoughthenuclear-norm based estimator knew the noise power. The proposed meth- I ≈ 1 (cid:90) e−12f(a0)−14(a−a0)(cid:62)H(a−a0)da odsalsooutperformedavariationalBayesmethodformatrix (2π)pq/2 completionwhenthematrixisnotsquare. (4π)p2/2 [Derivations] = e−12f(a0), (2π)pq/2|H|1/2 wherea =vec(α ). 0 0 .1. DerivationoftheLaplaceApproximation The Laplace approximation is an approximation of the inte- .2. TheEMhelpfunction gral TheEMhelpfunctionQ(θ,θ(cid:48))isgivenby (cid:90) m I = e−12f(a)da, Q(θ,θ(cid:48))=EX|y,θ(cid:48)[log p(X|y,θ)]=c+ 2 log β β 1 − E[||y−Avec(X)||2− E[tr(α Xα X(cid:62))] wheretheintegralisovera ∈ Rn. Thefunctionf(a)isap- 2 2 2 L R q p proximatedbyasecondorderpolynomialarounditsminima + log|α |+ log|α |, a as 2 L 2 R 0 wherecisaconstant. Usingthat 1 f(a)≈f(a0)+ 2(a−a0)(cid:62)H(a−a0), E[||y−Avec(X)||22]=||y||22−2y(cid:62)Avec(Xˆ) +tr(A(cid:62)A(vec(Xˆ)vec(Xˆ)(cid:62)+Σ(cid:48))) whereH = ∇2f(a)| istheHessianoff(a)ata . The a=a0 0 =||y−Avec(Xˆ)||2+tr(A(cid:62)AΣ(cid:48)), term linear in a vanishes and H (cid:31) 0 at a0 since we expand 2 and REFERENCES E[tr(α Xα X(cid:62))] [1] E.J. Cande`s and Y. Plan, ”Matrix Completion With L R Noise,”ProceedingsoftheIEEE,vol.98,no.6,pp.925- =tr((α ⊗α )(vec(Xˆ)vec(Xˆ)(cid:62)+Σ(cid:48))) R L 936,June2010. =tr(α Xˆα Xˆ(cid:62))+tr((α ⊗α )Σ(cid:48)), L R R L [2] E.J. Cande`s and B. Recht, ”Exact matrix completion viaconvexoptimization,”FoundationsofComputational werecovertheexpression(28)fortheEMhelpfunction. mathematics,vol.9,no.6pp.717-772,December2009. [3] M.Fazel,”Matrixrankminimizationwithapplications,” .3. Details for the RSVM with the Schatten s-norm Diss.PhDthesis,StanfordUniversity,2002. penalty [4] D.Zachariah,M.Sundin,M.JanssonandS.Chatterjee, WeheresetS=(cid:15)I tokeepthederivationmoregeneral. The q ”Alternating Least-Squares for Low-Rank Matrix Re- regularizedSchattens-normpenaltyisgivenby construction,” Signal Processing Letters, IEEE, vol. 19, g˜(Z)=tr((X(cid:62)X+S)s/2). no.4,pp.231-234,April2012. [5] K.Yu,J.Lafferty,S.ZhuandY.Gong,”Large-scalecol- Fortheconcaveconjugateformula(11)wefindthatthemin- laborative prediction using a nonparametric random ef- imumoverZoccurswhen fects model,” Proceedings of the 26th Annual Interna- α− s(Z+S)s/2−1 =0. tionalConferenceonMachineLearning.ACM,2009. 2 [6] J. Fan, Y. Fan, and J. Lv, ”High dimensional covari- SolvingforZgivesusthat ancematrixestimationusingafactormodel,”Journalof Econometrics,vol.147,issue1,pp.186-197,2008. 2−s(cid:18)2(cid:19)−2/(2−s) K˜(α)=−tr(αS)− tr(α−2/(2−s)), [7] M.Fornasier,H.RauhutandR.Ward,”Low-rankmatrix s s recovery via iteratively reweighted least squares mini- mization,” SIAM Journal on Optimization, vol. 21, no. whichresultsin(13). 4,pp.1614-1640,2011. Using (28), we find that the minimum of (27) for the Schattens-normoccurswhen [8] E.J. Cande`s and T. Tao, ”The Power of Convex Relax- ation: Near-Optimal Matrix Completion,” IEEE Trans- (cid:18)2(cid:19)−s/(2−s) Xˆα Xˆ(cid:62)+Σ˜ − α−2/(2−s) =0 actionsonInformationTheory,vol.56,no.5,pp.2053- R R s L 2080,May2010. Solving for α gives (25) for α . The update equation for [9] C.Jian-Feng,E.J.CandsandZ.Shen.”Asingularvalue L L α isderivedinasimilarmanner. thresholding algorithm for matrix completion,” SIAM R Journal on Optimization, vol. 20, no. 4, pp. 1956-1982, March2010. .4. DetailsfortheRSVMwiththelog-determinantpenalty [10] S.BoydandL.Vandenberghe,”Convexoptimization,” Thelog-determinantpenaltyisgivenby Cambridge,Cambridgeuniversitypress,2009. g(X)=νlog|Z+S|. [11] L. Kiryung, and Y. Bresler. ”Admira: Atomic decom- positionforminimumrankapproximation,”IEEETrans- Fortheconcaveconjugateformula(11)wefindthatthemin- actionsonInformationTheory,vol.56,no.9,pp.4402- imumoverZoccurswhen 4416,September2010. α−ν(Z+S)−1 =0. [12] T. Gongguo and A. Nehorai, ”Lower Bounds on the Mean-Squared Error of Low-Rank Matrix Reconstruc- SolvingforZgives tion,” IEEE Transactions on Signal Processing, vol. 59, K˜(α)=−tr(αS)+νlog|α|+νp−νlogν. no.10,pp.4559-4571,October2011. [13] I.JohnstoneandA.Y.Lu,”Onconsistencyandsparsity Byremovingtheconstantswerecover(15). for principal components analysis in high dimensions,” Using(28),wefindthattheminimumof(27)withrespect JournaloftheAmericanStatisticalAssociationvol.104, toαLforthelog-determinantpenaltyoccurswhen no.486,2009. Xˆα Xˆ(cid:62)+Σ˜ +S −να−1 =0 [14] S.D. Babacan, M. Luessi, R. Molina and A.K. Kat- R R L L saggelos,”SparseBayesianMethodsforLow-RankMa- Solving for α gives us (26) for α . The derivation of the trix Estimation,” IEEE Transactions on Signal Process- L L updateequationforα isfoundinasimilarway. ing,vol.60,no.8,pp.3964-3977,August2012. R [15] M.E. Tipping, ”Sparse Bayesian learning and the rele- learningalgorithms,”vol.7,Cambridge,Cambridgeuni- vancevectormachine,”Thejournalofmachinelearning versitypress,2003. research,vol.1,pp.211-244,January2001. [30] D.JC. MacKay, ”Bayesian interpolation,” Neural com- [16] M.TippingandA.Faul,”Fastmarginallikelihoodmax- putation,vol.4,no.3,pp.415-447,May1992. imisation for sparse Bayesian models,” Proceedings of [31] C.M. Bishop, ”Pattern recognition and machine learn- theninthinternationalworkshoponartificialintelligence ing,”NewYork,Springer,2006. andstatistics,vol.1,no.3,2003. [32] E.J. Cande`s, J.K. Romberg and T. Tao, ”Stable sig- [17] Z. Zhang and B.D. Rao, ”Extension of sbl algorithms nal recovery from incomplete and inaccurate measure- fortherecoveryofblocksparsesignalswithintra-block ments,”Communicationsonpureandappliedmathemat- correlation,” IEEE Transactions on Signal Processing, ics,vol.59,no.8,pp.1207-1223,March2006. vol.61,no.8,pp.2009-2015,April2013. [33] M. Grant and S. Boyd, ”CVX: Matlab software for [18] Z.ZhangandB.D.Rao,”SparseSignalRecoveryWith disciplined convex programming,” version 2.0 beta, Temporally Correlated Source Vectors Using Sparse http://cvxr.com/cvx,September2013. BayesianLearning,”IEEEJournalofSelectedTopicsin [34] K. Werner and M. Jansson, ”Reduced rank linear re- SignalProcessing,vol.5,no.5,pp.912-926,Sept.2011. gression and weighted low rank approximations,” IEEE [19] D.Wipf,”Non-ConvexRankMinimizationviaanEm- Transactions on Signal Processing, vol. 54, no. 6, pp. piricalBayesianApproach,”UncertaintyinArtificialIn- 2063-2075,June2006. telligence(UAI),2012. [35] M. Sundin, S. Chatterjee, M. Jansson and C.R. Rojas, [20] D.Xinghao,L.He,andL.Carin,”Bayesianrobustprin- ”RelevanceSingularVectorMachineforlow-rankmatrix cipalcomponentanalysis,”IEEETransactionsonImage sensing”,InternationalConferenceonSignalProcessing Processing,vol.20,no.12,pp.3419-3430,2011. andCommunications(SPCOM),IndianInstituteofSci- [21] E.J.Cande`s,X.Li,Y.MaandJ.Wright,”Robustprinci- ence,Bangalore,India,July2014.Avaliableonlinefrom palcomponentanalysis?,”JournaloftheACM(JACM), http://arxiv.org/abs/1407.0013. vol.58,no.3,2011. [22] S. Ji; Y. Xue and L. Carin, ”Bayesian Compressive Sensing,” IEEE Transactions on Signal Processing, vol. 56,no.6,pp.2346-2356,June2008. [23] A.Arjun,K.GuptaandV.L.Girko,”Multidimensional Statistical Analysis and Theory of Random Matrices,” Proceedings of the Sixth Eugene Lukacs Symposium, BowlingGreen,OH,USA,29-30March,1996. [24] D.P.Wipf,B.D.RaoandS.Nagarajan,”LatentVariable Bayesian Models for Promoting Sparsity,” IEEE Trans- actionsonInformationTheory,vol.57,no.9,pp.6236- 6255,September2011. [25] S.D. Babacan, R. Molina and A.K. Katsaggelos, ”Bayesian Compressive Sensing Using Laplace Priors,” IEEE Transactions on Image Processing, vol. 19, no. 1, pp.53-63,January2010. [26] C.R. Rojas, D. Katselis and H. Hjalmarsson, ”A Note on the SPICE Method,” IEEE Transactions on Signal Processing, vol. 61, no. 18, pp. 4545-4551, September 2013. [27] P.BabuandP.Stoica,”ConnectionbetweenSPICEand Square-Root LASSO for sparse parameter estimation,” SignalProcessing,vol.95,pp.10-14,February2014. [28] A. Terras and A. Terras, ”Harmonic analysis on sym- metric spaces and applications,” vol. 1985, Berlin, Springer,1985. [29] D.JC. MacKay, ”Information theory, inference, and