ebook img

Rank regularization and Bayesian inference for tensor completion and extrapolation PDF

0.69 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Rank regularization and Bayesian inference for tensor completion and extrapolation

Rank regularization and Bayesian inference for tensor completion and extrapolation † Juan Andre´s Bazerque, Gonzalo Mateos, and Georgios B. Giannakis (contact author)∗ as the singular-value decomposition (SVD) [3], [7], [25] or, Abstract—A novel regularizer of thePARAFAC decomposition employing the parallel factor (PARAFAC) decomposition [9], factors capturing the tensor’s rank is proposed in this paper, [24]. In the context of tensor completion, an approach falling as the key enabler for completion of three-way data arrays underthe firstcategorycan befoundin [12], whileimputation with missing entries. Set in a Bayesian framework, the tensor 3 completionmethodincorporatespriorinformationtoenhanceits using PARAFAC was dealt with in [2]. 1 smoothingandpredictioncapabilities.Thisprobabilisticapproach The imputation approach presented in this paper builds on 0 cannaturallyaccommodate general modelsforthedatadistribu- a novel regularizer accounting for the tensor rank, that relies tion, lending itself to various fitting criteria that yield optimum 2 on redefiningthe matrixnuclearnormin termsof its low-rank estimates in the maximum-a-posteriori sense. In particular, two n algorithms are devised for Gaussian- and Poisson-distributed factors.Thecontributionistwo-fold.First,itisestablishedthat a data, that minimize therank-regularized least-squareserror and the low-rank inducing property of the regularizer carries over J Kullback-Leibler divergence, respectively. The proposed tech- to tensors by promoting sparsity in the factors of the tensor’s 1 nique is able to recover the “ground-truth” tensor rank when PARAFAC decomposition. In passing, this analysis allows for 3 testedonsyntheticdata,andtocompletebrainimagingandyeast drawing a neat connection with the atomic-norm in [5]. The gene expression datasets with 50% and 15% of missing entries second contribution is the incorporation of prior information, ] respectively, resulting in recovery errors at 10dB and 15dB. T − − with a Bayesian approachthat endows tensor completionwith Index Terms—Tensor, low-rank, missing data, Bayesian infer- I ence, Poisson process. extrasmoothingandpredictioncapabilities.Aparallelanalysis . s in the context of reproducing kernel Hilbert spaces (RKHS) c further explains these acquired capabilities, provides an alter- [ I. INTRODUCTION nativemeansofobtainingthepriorinformation,andestablishes 1 Imputationof missing data is a basic task arising in various a useful connection with collaborative filtering approaches[1] v Big Data applications as diverse as medical imaging [12], when reduced to the matrix case. 9 bioinformatics[3], as well as social and computernetworking While least-squares (LS) is typically utilized as the fitting 1 [10], [17]. The key idea rendering recovery feasible is the 6 criterionformatrixandtensorcompletion,implicitlyassuming 7 “regularity” present among missing and available data. Low Gaussian data, the adopted probabilistic framework supports 1. rank is an attribute capturing this regularity, and can be the incorporation of alternative data models. Targeting count readily exploited when data are organized in a matrix. A 0 processesavailableintheformofnetworktrafficdata,genome 3 natural approach to low-rank matrix completion problem is sequencing, and social media interactions, which are modeled 1 minimizing the rank of a target matrix, subject to a constraint as Poisson distributed, the maximum a posteriori (MAP) es- : on the error in fitting the observed entries [4]. Since rank v timator is expressed in terms of the Kullback-Leibler (K-L) i minimization is generally NP-hard [26], the nuclear norm has divergence [10]. X beenadvocatedrecentlyasaconvexsurrogatetotherank[11]. The remainderof the paperis organizedas follows. Section r Beyond tractability, nuclear-norm minimization enjoys good a II offers the necessary background on nuclear-norm regular- performance both in theory as well as in practice [4]. ization for matrices, the PARAFAC decomposition, and the The goal of this paper is imputation of missing entries of definition of tensor rank. Section III presents the tensor com- tensors(alsoknownasmulti-wayarrays),whicharehigh-order pletionproblem,establishingthelow-rankinducingpropertyof generalizations of matrices frequently encountered in chemo- the proposed regularization. Prior information is incorporated metrics,medicalimaging,andnetworking[8],[16].Leveraging in Section IV, with Bayesian and RKHS formulations of the the low-rank structure for tensor completion is challenging, tensor imputation method, leading to the low-rank tensor- sinceevencomputingthetensorrankisNP-hard[14].Defining imputation (LRTI) algorithm. Section V develops the method a nuclear norm surrogate is not obvious either, since singular for Poisson tensor data, and redesigns the algorithm to min- values as defined by the Tucker decomposition are not gen- imize the rank-regularized K-L divergence. Finally, Section erally related with the rank. Traditional approaches to finding VI presents numerical tests carried out on synthetic and real low-dimensional representations of tensors include unfolding data, including expression levels in yeast, and brain magnetic the multi-way data and applying matrix factorizations such resonance images (MRI). Conclusions are drawn in Section VII,whilemosttechnicaldetailsaredeferredtotheAppendix. † This work was supported byMURI (AFOSR FA9550-10-1-0567) grant. The notation adopted throughout includes bold lowercase PartofthepaperappearedintheProc.ofIEEEWorkshoponStatisticalSignal Processing,AnnArbor,USA,August5-8,2012. and capital letters for vectors a and matrices A, respectively, ∗ The authors are with the Dept. of ECE and the Digital with superscript T denoting transposition. Tensors are under- Technology Center, University of Minnesota, 200 Union Street SE, lined as e.g., X, and their slices carry a subscript as in X ; Minneapolis, MN 55455. Tel/fax: (612)626-7781/625-2002; Emails: p {bazer002,mate0058,georgios}@umn.edu see also Fig. 1. Both the matrix and tensor Frobenius norms IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 1 are represented by . Symbols , , ⊛, and , denote F k·k ⊗ ⊙ ◦ the Kroneker, Kathri-Rao, Hadamard (entry-wise), and outer product, respectively. II. PRELIMINARIES A. Nuclear-norm minimization for matrix completion Low-rankapproximationis a popularmethodforestimating missing values of a matrix Z RN×M, which capitalizes on Fig.1. Tensorslices alongtherow,column,andtubedimensions. ∈ “regularities” across the data [11]. For the imputation to be feasible,a bindingassumptionthatrelatesthe availableentries This result plays a critical role in this paper, as the withthemissingonesisrequired.Analternativeistopostulate Frobenius-norm regularization for controlling the rank in (4), thatZhaslowrankR min(N,M).Theproblemoffinding will be useful to obtain its tensor counterparts in Section III. matrixZˆ withranknot≪exceedingR,whichapproximatesZin thegivenentriesspecifiedbyabinarymatrix∆ 0,1 N×M, ∈{ } can be formulated as B. PARAFAC decomposition Zˆ =argmin (Z X)⊛∆ 2 s. to rank(X) R. (1) ThePARAFACdecompositionofatensorX RM×N×P is X k − kF ≤ ∈ at the heart of the proposedimputation method, since it offers Thelow-rankpropertyofmatrixXimpliesthatthevectors(X) a means to define its rank [9], [24]. Given R N, consider of its singular values is sparse. Hence, the rank constraint is matrices A RN×R, B RM×R, and C RP∈×R, such that equivalentto s(X) 0 R, where the ℓ0-(pseudo)norm 0 ∈ ∈ ∈ k k ≤ k·k equals the number of nonzero entries of its vector argument. R Aiming at a convex relaxation of the NP-hard problem (1), X(m,n,p)= A(m,r)B(n,r)C(p,r). (5) onecanleveragerecentadvancesincompressivesampling[11] r=1 X andsurrogatetheℓ0-normwiththeℓ1-norm,whichhereequals The rank of X is the minimum value of R for which this thenuclearnormofXdefinedaskXk∗ :=ks(X)k1.Withthis decompositionis possible.ForR∗ :=rank(X), thePARAFAC relaxation, the Lagrangian counterpart of (1) is decomposition is given by the corresponding factor matrices Zˆ =argmXin12k(Z−X)⊛∆k2F +µkXk∗ (2) {RA∗.,B,C} (all with R∗ columns), so that (5) holds with R= where µ 0 is a rank-controllingparameter. Problem (2) can To appreciate why the aforementioned rank definition is be further≥transformedby consideringthe followingcharacter- natural,rewrite(5)asX= R a b c ,wherea ,b ,and r=1 r◦ r◦ r r r ization of the nuclear norm [23] c represent the r-th columns of A, B, and C, respectively; r kXk∗ ={mB,iCn}21(kBk2F +kCk2F) s. to X=BCT. (3) aenndtritehseOoru(tmer,pnr,opd)uc:=tsAO(rmP:=,r)aBr(◦nb,rr)◦Cc(rp,∈r)R. MTh×eNr×anPkhoafvae tensoristhustheminimumnumberofouterproducts(rankone For an arbitrary matrix X with SVD X= UΣVT, the mini- factors) required to represent the tensor. It is not uncommon mum in (3) is attained for B=UΣ1/2 and C=VΣ1/2. The to adopt an equivalent normalized representation optimization in (3) is over all possible bilinear factorizations of X, so that the number of columns of B and C is also R R a variable. Building on (3), one can arrive at the following X= ar br cr = γr(ur vr wr) (6) ◦ ◦ ◦ ◦ equivalent reformulation of (2) [17] r=1 r=1 X X Zˆ′ =arg min 1 (Z X)⊛∆ 2 + µ( B 2 + C 2) bydefiningunit-normvectorsur :=ar/ ar ,vr :=br/ br , {X,B,C}2k − kF 2 k kF k kF w := c / c , and weights γ :=k ak b c ,kr =k r r r r r r r k k k kk kk k s.toX=BCT. (4) 1,...,R. Let X , p = 1,...,P denote the p-th slice of X along its The equivalence implies that by finding the global minimum p third (tube) dimension, such that X (m,n) := X(m,n,p); of (4), one can recover the optimal solution of (2). However, p see Fig. 1. The following compact form of the PARAFAC since (4) is nonconvex,it may have multiple stationary points. decomposition in terms of slice factorizations will be used in Interestingly, the next result provides conditions for these the sequel stationary points to be globally optimal (parts a) and b) are proved in the Appendix, while the proof for c) can be found X =Adiag eTC B, p=1,...,P (7) in [17].) p p Proposition 1: Problems (2) and (4) are equivalent, in the wherethediagonalmatrix(cid:2)diag(cid:3)[u]hasthevectoruonitsdiag- sense that: onal,andeT isthe p-th rowofthe P P identitymatrix.The a) global minima coincide: Zˆ =Zˆ′; PARAFACpdecomposition is symmetr×ic [cf. (5)], and one can b) all local minima of (4) are globally optimal; and, also write X =Bdiag eTA C, or, X =Cdiag eTB A m m n n c) stationarypointsXof (4)satisfying (X Z)⊛∆ 2 µ in terms of slices along the first (row), or, second (column) k − k ≤ (cid:2) (cid:3) (cid:2) (cid:3) are globally optimal. dimensions. IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 2 III. RANK REGULARIZATIONFOR TENSORS Generalizing the nuclear-norm regularization technique (2) from low-rank matrix to tensor completion is not straightfor- ward, since singular values of a tensor (given by the Tucker decomposition) are not related to the rank [16]. Fortunately, the Frobenius-norm regularization outlined in Section II-A offers a viable option for low-rank tensor completion under the PARAFAC model, by solving 1 µ Zˆ:=argmin (Z X)⊛∆ 2 + A 2 + B 2 + C 2 {X,A,B,C}2k − kF 2 k kF k kF k kF s.to X =Adiag eTC B(cid:0), p=1,...,P (8)(cid:1) p p where the Frobenius norm o(cid:2)f a te(cid:3)nsor is defined as X 2 := X2(m,n,p), and the Hadamard prkodukcFt as Fig.2. Theℓ2/3-normballcomparedtotheℓ0-andℓ1-normballs m n p (X⊛∆)(m,n,p):=X(m,n,p)∆(m,n,p). P P P The next property is a direct consequence of the low-rank Different from the matrix case, it is unclear whether the promoting property of (8) as established in Proposition 2. regularization in (8) bears any relation with the tensor rank. Interestingly,thefollowinganalysiscorroboratesthecapability Corollary 1: If Zˆ denotes the solution to problem (8) , and of (8) to producea low-ranktensor Zˆ, for sufficiently large µ. µ≥µmax :=k∆⊛Zk4F/3, then Zˆ =0M×N×P. In this direction, consider an alternative completion problem Corollary 1 asserts that if the penalty parameter is cho- stated in terms of the normalized tensor representation (6) sen large enough, the rank is reduced to the extreme case rank(Zˆ) = 0. To see why this is a non-trivial property, it is Zˆ′ :=arg min 1 (Z X)⊛∆ 2 + µ γ 2/3 prudent to think of ridge-regression estimates where similar {X,γ,{ur},{vr},{wr}}2k − kF 2k k2/3 quadratic regularizers are adopted, but an analogous property R does not hold. In ridge regression one needs to let µ in s.toX= γ (u v w ) (9) →∞ r r ◦ r◦ r order to obtain an all-zero solution. Characterization of µmax Xr=1 isalsoofpracticalrelevanceasitprovidesaframeofreference where γ :=[γ1,...,γR]T; the nonconvexℓ2/3 (pseudo)-norm for tuning the regularization parameter. is given by γ := ( R γ 2/3)3/2; and the unit-norm Using (10), it is also possible to relate (8) with the atomic k k2/3 r=1| r| constraint on the factors’ columns is left implicit. Problems normin[5].Indeed,theinfimumℓ1-normofγ isapropernorm (8) and (9) are equivalePnt as established by the following forX,namedatomicnorm,anddenotedby X A := γ 1[5]. k k k k proposition (its proof is provided in the Appendix.) Thus, by replacing γ 2/3 with X A, (10) becomes convex k k k k Proposition2:Thesolutionsof (8)and(9)coincide,i.e.,Zˆ′ = in X. Still, the complexity of solving such a variant of (10) Zˆ,with optimalfactorsrelatedbyˆar = √3γˆruˆr, bˆr = √3γˆrvˆr, resides in that kXkA is generally intractable to compute [5]. In this regard, it is remarkable that arriving to (10) had the and ˆc = √3γˆ wˆ , r =1,...,R. r r r sole purposeof demonstratingthe low-rankinducingproperty, To furtherstress the capability of (8) to producea low-rank and that (8) is to be solved by the algorithm developed in approximant tensor X, consider transforming (9) once more the ensuing section. Such an algorithm will neither require by rewriting it in the constrained-errorform computing the atomic norm or PARAFAC decomposition of Zˆ′′ :=arg min γ (10) X, nor knowing its rank. The number of columns in A, B, 2/3 {X,γ,{ur},{vr},{wr}}k k and C can be set to an overestimate of the rank of Z, such as R the upper bound R¯ := min MN,NP,PM rank(Z), and s.to (Z X)⊛∆ 2 σ2, X= γ (u v w ). { } ≥ || − ||F ≤ r r ◦ r◦ r the low-rankof X will be inducedby regularizationas argued Xr=1 earlier. To carry out a fair comparison, only convergence to a For any value of σ2 there exists a corresponding Lagrange stationary point of (8) will be guaranteed in this paper. multiplier λ such that (9) and (10) yield the same solution, Remark 1:These insightsfosterfutureresearchdirectionsfor under the identity µ = 2/λ. [Since f(x) = x2/3 is an the design of a convex regularizer of the tensor rank. Specif- increasingfunction,theexponentof γ canbesafelyelim- ically, substituting ρ(A,B,C) := R ( a 3 + b 3 + inatedwithoutaffectingtheminimizkerokf2/(310).]Theℓ -norm c 3) for the regularization term in r(=8)1, kturrnks γk rkinto 2/3 r 2/3 k k P k k γ 2/3in(10)producesasparsevectorγ whenminimized[6], γ 1 = X A in the equivalent (10). It is envisioned that k k k k k k sharing this well-documentedproperty of the ℓ1-norm as their with such a modification in place, the acquired convexity of norm-oneballs,depictedinFig.2,sharethe“pointygeometry” (10) would enable a reformulationof Proposition 1, providing which is responsible for inducing sparsity. conditionsforglobaloptimalityof the stationarypointsof(8). With (8) equivalently rewritten as in (10), its low-rank Still, a limitation of (8) is that it does not allow for incor- inducing property is now revealed. As γ in (10) becomes poratingsideinformationthatcouldbeavailableinadditionto sparse,someofitsentriesγ arezeroed,andthecorresponding the given entries ∆⊛Z. r outer-productsγ (a b c ) drop from the sum in (6), thus Remark 2: In the context of recommender systems, a de- r r r r ◦ ◦ lowering the rank of X. scription of the users and/or products through attributes (e.g., IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 3 gender, age) or measures of similarity, is typically available. B. Nonparametric tensor decomposition It is thus meaningful to exploit both known preferences and Incorporating the information conveyed by R , R , and A B descriptionstomodelthepreferencesofusers[1].Inthree-way R , together with a practical means of finding these matrices C (samples, genes, conditions) microarray data analysis, the rel- can be facilitated by interpreting (13) in the context of RKHS ative position of single-nucleotidepolymorphismsin the DNA [27]. In particular, the analysis presented next will use the moleculeimpliesdegreesofcorrelationamonggenotypes[22]. Representer Theorem, interpreted as an instrument for finding These correlations could be available either through a pre- the best interpolating function in a Hilbert space spanned scribedmodel,or,throughestimatesobtainedusingareference by kernels, just as interpolation with sinc-kernels is carried tensorZˇ.Aprobabilisticapproachtotensorcompletioncapable out in the space of bandlimited functions for the purpose of of incorporatingsuch types of extra informationis the subject reconstructing a signal from its samples [19]. of the ensuing section. In this context, it is instructive to look at a tensor f : Rasafunctionofthreevariablesm,n,andp,lMivin×g N×P → inmeasurablespaces , ,and ,respectively.Generalizing IV. BAYESIAN LOW-RANK TENSORAPPROXIMATION M N P (8) to this nonparametricframework,low-rankfunctions f are A. Bayesian PARAFAC model formally defined to belong to the following family R Aprobabilisticapproachisdevelopedinthissectioninorder := f : R: f(m,n,p)= a (m)b (n)c (p) to integrate the available statistical informationinto the tensor FR { M×N×P→ r r r r=1 imputation setup. To this end, suppose that the observation X such that a (m) , b (n) , c (p) noise is zero-mean, white, Gaussian; that is r ∈HM r ∈HN r ∈HP} where , , and are Hilbertspaces constructedfrom M N P Zmnp =Xmnp+emnp; such that emnp ∼N(0,σ2), i.i(.1d1..) sp,ecwifihieHlde RkeriHnsealsnkinMiti,aHlkNovearnedstikmPa,tedeofifnthede roavnekroMf f,.N, and Since vectors ar in (6) are interchangeable, identical dis- P The following nonparametric fitting criterion is adopted for tributions are assigned across r = 1,...,R, and they are finding the best fˆ interpolating data z : δ =1 R mnp mnp modeled as independentfrom each other, zero-mean Gaussian { } distributed with covariance matrix R RM×M. Similarly, M N P vectors br and cr are uncorrelated anAd z∈ero-mean, Gaussian, fˆR :=argfm∈FinRm=1n=1p=1δmnp(zmnp−f(m,n,p))2 with covariance matrix R and R , respectively. In addition X XX B C R a , b , and c are assumed mutually uncorrelated. And since µ r r r + a 2 + b 2 + c 2 . (14) scale ambiguity is inherently present in the PARAFAC model, 2 k rkHM k rkHN k rkHP r=1 vectors a , b , and c are set to have equal power; that is, X(cid:0) (cid:1) r r r It is shown in the Appendix that leveraging the Representer θ :=Tr(R )=Tr(R )=Tr(R ). (12) Theorem, the minimizer of (14) admits a finite dimensional A B C representation in terms of k , k and k , M N P Under these assumptions, the negative of the posterior dis- fˆ (m,n,p)=kT (m)K−1Adiag kT(p)K−1C BTK−1k (n) tribution can be readily written as exp( L(X)), with R M M P P N N (15) − (cid:2) (cid:3) L(X)= 1 (Z X)⊛∆ 2 where vector kM(m) and matrix KM have entries 2σ2k − kF k (m,m′), m,m′ = 1,...,M; and where k (n), K , M N N + 1 R aTR−1a +bTR−1b +cTR−1c kP(p), and KP are correspondingly defined in terms of kN 2 r A r r B r r C r and kP. It is also shown in the Appendix that the coefficient Xr=1(cid:0) (cid:1) matrices A, B, and C can be found by solving 1 1 = (Z X)⊛∆ 2 + Tr ATR−1A 2σ2k − kF 2 A P 2 +Tr BTR−B1B +Tr CTR(cid:2)−C1C(cid:0) . (cid:1) Am,Bin,C Zp−Adiag eTpC BT ⊛∆p F p=1 Correspondingly(cid:0),the MAP(cid:1)estima(cid:0)tor of X is(cid:1)(cid:3) +µ TXr(A(cid:13)(cid:13)TK(cid:0) −1A)+Tr(B(cid:2)TK−(cid:3)1B)+(cid:1)Tr(CT(cid:13)(cid:13)K−1C) 2 M N P 1 1 (cid:0) s. to A RM×R, B RN×R, C RP×R.(cid:1) (16) Zˆ :=argmin (Z X)⊛∆ 2 + Tr ATR−1A ∈ ∈ ∈ {X,A,B,C}2σ2k − kF 2 A Problem (16) reduces to (8) when the side information is +Tr BTR−B1B +Tr CT(cid:2)R−C(cid:0)1C (cid:1) discarded by selecting kM, kN and kP as Kronecker deltas, s.toXp =Adia(cid:0)g eTpC B(cid:1)T, p=(cid:0)1,...,P (cid:1)(cid:3) (13) in which case KM, KN, and KP are identity matrices. In the general case, (16) yields the sought nonlinear low-rank (cid:2) (cid:3) reducing to (8) when R = I , R = I , and R = I . approximation method for f(m,n,p) when combined with A M B N C P ThisBayesianapproachinterpretstheregularizationparameter (15), evidencing the equivalence between (14) and (13). µ [cf. (8)] as the noise variance, which is useful in practice Interpreting (14) as an interpolator renders (13) a natural to select µ. The ensuing section explores the advantages of choicefortensorimputation,wherein general,missing entries incorporating prior information to the imputation method. aretobeinsertedbyconnectingthemtosurroundingpointson IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 4 thethree-dimensionalarrangement.Relativeto (8), thisRKHS D. Block successive upper-bound minimization algorithm perspective also highlights (13)’s extra smoothing and extrap- Aniterativealgorithmisdevelopedhereforsolving(13),by olation capabilities. Indeed, by capitalizing on the similarities cyclically minimizing the cost over A, B, and C. In the first captured by K , K and K , (16) can recover completely M N P step of the cycle the cost in (13) is minimized with respect to missing slices. This feature is not shared by imputation meth- (w.r.t.) A considering B and C as parameters. Accordingly, odsthatleveragelow-rankonly,sincetheserequireatleastone the partial cost to minimize reduces to pointintheslicetobuildoncolinearities.Extrapolationisalso cpaopstsuibreleainfuthritshesrenpsoei.nItfMfor+ins1tannocet iKnMthecaonribgeineaxlpsaentd,etdhetno f(A):= 21k(Z−X)⊛∆k2F + µ2Tr ATR−A1A (19) a new slice of data can be predicted by (15) based on its (cid:0) (cid:1) correlation k (M+1) with the available entries. These extra where µ was identified with and substituted for σ2. Function M capabilitieswillbeexploitedinSectionVI, wherecorrelations (19) is quadratic in A and can be readily minimized after re- are leveraged for the imputation of MRI data. The method writing it in termsof a:=vec(A) [see (47) in the Appendix]. describedby(13)and(16)canbeappliedtomatrixcompletion However, such an approach becomes computationally infea- by just setting entries of C to one, and can be extended to sible for other than small datasets, since it involves storing higher-order dimensions with a straightforward alteration of P matrices of dimensions NM MR, and solving a linear × the algorithms and theorems throughoutthis paper. systemofMR MRequations.Thealternativepursuedhereto × IdentificationofcovariancematricesR ,R ,andR with overcomethisobstaclereliesontheso-calledblocksuccessive A B C kernel matrices K , K and K is the remaining aspect to upper-bound minimization (BSUM) algorithm [20]. M N P clarify in the connection between (13) and (16). It is apparent In BSUM one minimizes a judiciously chosen upper-bound from (13) and (16) that correlations between columns of the g(A,A¯) of f(A), which: i) depends on the current iterate factors are reflected in similarities between the tensor slices, A¯; ii) should be simpler to optimize; and iii) satisfies certain giving rise to the opportunityof obtaining one from the other. local-tightness conditions; see also [20] and properties i)-iii) This aspect is explored next. below. For A¯ given, consider the function C. Covariance estimation 1 g(A,A¯):= (Z X)⊛∆ 2 (20) To implement (13), matrices RA, RB, and RC must be 2k − kF postulated a priori, or alternatively replaced by their sample λ 1 +µ Tr ATA Tr(ΘTA)+ Tr(ΘTA¯) estimates. Such estimates need a training set of vectors a, b, 2 − 2 (cid:18) (cid:19) and c abiding to the Bayesian model just described, and this (cid:0) (cid:1) requires PARAFAC decomposition of training data. In order where λ := λmax(R−A1) is the maximum eigenvalue of R−A1, to abridge this procedure,it is convenientto inspect how R , and Θ := λI R−1. The following properties of g(A,A¯) R , and R are related to their kernel counterparts. A imply that it m−ajorAizes f(A) at A¯, satisfying the technical B C Based on the equivalence between the standard RKHS conditions required for the convergence of BSUM (properties interpolator and the linear mean-square error estimator [21], i)-iii) are established in the the proof of Lemma 1 in the it is useful to re-visit the probabilistic framework and identify Appendix). kernel similarities between slices of X with their mutual i) f(A¯)=g(A¯,A¯); cworvitaeriaKncPe(sp.′,Fpo)cu:s=ingEo(nTrt(hXeTpt′uXbpe))d,imtheantsioisn, othfeXc,ovoanreiacnacne iiiii)) fddA(Af)(A)|gA(A=A¯,A¯=),ddAAg(.A,A¯)|A=A¯; and, between slices Xp′ and Xp taking hX,Yi := Tr(XTY) as The com≤putational a∀dvantage of minimizing g(A,A¯) in the standard inner product in the matrix space. Under this place of f(A) comes from g(A,A¯) being separable across alternativedefinitionforK ,andcorrespondingdefinitionsfor P rows of A. To see this, consider the Kathri-Rao product K , and K , it is shown in the Appendix that N M Π := C B := [c1 b1,...cR bR], defined by the K =θ2R , K =θ2R , K =θ2R (17) column-wi⊙se Kronecker⊗products c ⊗ b . Let also matrix M A N B P C r r and that θ is related to the second-ordermoment of X by Z:=[Z1,...,ZP]∈NM×NP denotet⊗heunfoldingofZalong its tube dimension, and likewise for ∆ := [∆1,...,∆P] E X 2 =Rθ3. (18) 0,1 M×NP andX:=[X1,...,XP] RM+×NP.Then,usin∈g k kF {the fo}llowing identity [10] ∈ Since sample estimates for K , K , K , and E X M N P F k k can be readily obtained from the tensor data, (17) and (18) X:=[X1,...,XP]=AΠT. (21) provideanagilemeansofestimatingR ,R ,andR without A B C requiring PARAFAC decompositions over the set of training it is possible to rewrite (20) as tensors. 1 This strategy remains valid when kernels are not estimated g(A,A¯):= Z AΠT ⊛∆ 2 from data. One such case emerges in collaborative filtering 2k − kF of user preferences [1], where the similarity of two users is +µ λ(cid:0)Tr ATA (cid:1) Tr(ΘTA)+ 1Tr(ΘTA¯) modeled as a function of attributes; such age or income. 2 − 2 (cid:18) (cid:19) (cid:0) (cid:1) IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 5 Algorithm 1 : Low-rank tensor imputation (LRTI) X given data Z only on the entries specified by ∆, takes the 1: function UPDATE FACTOR(A,R,Π,∆,Z,µ) form 2: Set λ=λmax(R−1) 3: Unfold ∆ and Z over dimension of A into ∆ and Z M N P 4: Set Θ=(λI R−1)A l∆(Z;X)= δmnp[zmnplog(xmnp)−xmnp] 5: for m=1,..−.,M do mX=1nX=1Xp=1 6: Select rows zTm, δmT, and θmT, and set Dm =diag(δm) (24) 7: Computeam =(ΠTDmΠ+λµI)−1(ΠTDmzm+µθm) after dropping terms log(z !) that do not depend on X. 8: Update A with row aT mnp m The choice of the Poisson distribution in (23) over a Gaus- 9: end for 10: return A sian one for counting data, prompts minimization of the K-L 11: end function divergence(24)insteadofLSasamoresuitablecriterion[10]. 12: InitializeA, B and C randomly. Still, the entries of X are not coupled in (24), and a binding 13: while cost cost old <ǫ do 1154:: BA=|=UUPP−DDAATTEE FFAACC|TTOORR((BA,,RRBA,,((AC⊙CB)),,∆∆,,ZZ,,µµ)) tPeAnRsoArFaApCpromxoimdealtiinognatsassukmupntdioenrimsinsastiunrgaldfaotra.feMasiimbiilciktyinogftthhee 16: C= UPDATE FACTOR(C,RC,(B⊙A),∆,Z,µ) method for Gaussian data, (nonnegative) Gaussian priors are ⊙ 17: Recalculate cost in (13) assumed for the factors of the PARAFAC decomposition. Ac- 18: end while cordingly, the MAP estimator of X given Poisson-distributed 19: return X with slices Xˆp=Adiag(eTpC)BT data (entries of Z indexed by ∆) becomes M N P Zˆ :=argmin δ (x z log(x )) mnp mnp mnp mnp which can be decomposed as {X,A,B,C}∈T − m=1n=1p=1 X XX µ M + Tr ATR−1A +Tr BTR−1B +Tr CTR−1C (25) g(A,A¯)= 1 δ ⊛z diag(δ )Πa 2 2 A B C 2k m m− m mk2 over t(cid:2)he f(cid:0)easible set(cid:1) :=(cid:0)X,A,B,C(cid:1) : A(cid:0) 0,B (cid:1)0(cid:3),C µmX=1(cid:20) 0,X =Adiag eTCTBT{, p=1,...,P ,≥whereth≥esymb≥ol +2 λkamk2+θmTam+θmT¯am (22) shpould be underpstood to imply entry-wi}se nonegativity. (cid:0) (cid:1)i ≥With the aid(cid:2)of Re(cid:3)presenter’s Theorem, it is also possible where zT, aT, δT, θT, and ¯aT , represent the m-th rows m m m m m to interpret(25) as a variationalestimator in RKHS, with K-L of matrices Z, A, ∆, Θ, and A¯, respectively. Not only (22) analoguesto (14)-(16), so that the conclusionstherebyregard- evidences the separability of (20) across rows of A, but it ingsmoothing,predictionandpriorcovarianceestimationcarry alsopresentseachofitssummandsina standardizedquadratic over to the low-rank Poisson imputation method (25). form that can be readily minimized by equating its gradient to zero. Accordingly, the majorization strategy reduces the A. Block successive upper-bound minimization algorithm computational load to R systems of M equations that can be solved in parallel. Collecting the solution of such quadratic AK-LcounterpartoftheLRTIalgorithmisdevelopedinthis programs into the rows of a matrix A∗ yields the minimizer section, that provably converges to a stationary point of (25), of (20), and the update A A∗ for the BSUM cycle. Such via an alternating-minimizationiteration which optimizes (25) a procedure is presented i←n Algorithm 1, where analogous sequentially w.r.t. one factor matrix, while holding the others updates for B and C are carried out cyclically. fixed. In the sequel, the goal is to arrive at a suitable expression By virtue of properties i)-iii) in Lemma 1, convergence of for the cost in (25), when viewed only as a function of e.g., Algorithm 1 follows readily from that of the BSUM algo- rithm [20]. A. To this end, let matrix Z := [Z1,...,ZP] NM×NP ∈ denote the unfolding of Z along its tube dimension, and Proposition 3: The iterates for A, B and C generated by Algorithm 1 converge to a stationary point of (13). [lXike1w,.is.e.,fXorP∆] :R=M+[×∆N1P,...B.a,s∆edPo]n∈th{e0s,e1d}eMfi×niNtiPonsa,n(d24X)c:a=n ∈ be written as V. INFERENCEFOR LOW-RANKPOISSON TENSORS l∆(Z;X)=1TM(∆⊛[X−Z⊛log(X)])1NP (26) Adoption of the LS criterion in (8) assumes in a Bayesian where1 ,1 areall-onevectorsofdimensionsM andNP M NP setting thatthe randomZ is Gaussian distributed. Thissection respectively, and log() should be understood entry-wise. The deals with a Poisson-distributed tensor Z, a natural alternative · log-likelihoodin (26) can be expressedin terms of A, andthe to the Gaussian model when integer-valued data are obtained Kathri-Rao product Π := B C by resorting again to (21). by counting independent events [10]. Suppose that the entries ⊙ Substituting(21)into(26)onearrivesatthedesiredexpression z of Z are Poisson distributed, with probability mass mnp for the cost in (25) as a function of A, namely function f(A):=1T (∆⊛[AΠ Z⊛log(AΠT)])1 P(zmnp =k)= xkmnpe−xmnp (23) +MµTr ATR−−1A . NP k! 2 A andmeansgivenbythecorrespondingentriesintensor X.For A closed-form minimize(cid:0)r A⋆ for f(cid:1)(A) is not available, but mutually-independent zmnp , the log-likelihoodl∆(Z;X) of since f(A) is convex one could in principle resort to an { } IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 6 Algorithm 2 : Low-rank Poisson-tensor imputation (LRPTI) 16 1: function UPDATE FACTOR(A,R,Π,∆,Z,µ) 14 2: Set λ=λmax(R−1) 12 3: Unfold ∆ and Z over dimension of A into ∆ and Z 10 54:: CCoommppuuttee TS==λAµ1⊛(cid:0)(cid:0)µA∆(Πλ⊛ITZΠR(cid:1)−(e1le)Ament-∆wiΠse(cid:1)division) Xrank()68 2λµ − − 6: Update A with entries amr =tmr+√t2mr+smr 4 7: return A 2 8: end function 100−5 10−4 10−3 10−2 10−1 100 9: InitializeA, B and C randomly. µ/µmax 10: while cost cost old <ǫ do 2 11: A|= UP−DATE FAC|TOR(A,RA,(C B),∆,Z,µ) 0 12: B= UPDATE FACTOR(B,RB,(A⊙C),∆,Z,µ) −2 111453::: endRCwech=aillecUuPlDatAeTcEosFtAiCnT(O25R)(C,RC,(B⊙⊙A),∆,Z,µ) ZXerror(,)dB∆ −−64 16: return X with slices Xˆp=Adiag(eTpC)BT −8 −10 −1120−5 10−4 10−3 10−2 10−1 100 µ/µmax iterativeproceduretoobtainA⋆.Toavoidextrainneriterations, Fig. 3. Performance of the low-rank tensor imputation method as function the approach here relies again on the BSUM algorithm [20]. ofthe regularizing parameter µ; (top) rankof the tensor as recovered by(8) For A¯ given, consider the separable function averaged over100testrepetitions, (bottom)relative recovery error. M,R a2 g(A,A¯):=µλ mr 2t a s log(a )+u 2 − mr mr− mr mr mr A related algorithm, abbreviated as CP-APR can be found mX,r=1(cid:16) (cid:17) in [10], where the objective is to find the tensor’s low-rank (27) factorsperse. TheLRPTIalgorithmheregeneralizesCP-APR where λ:=λmax(R−A1) is the largest eigenvalue of R−A1, and by focusing on recovering missing data, and incorporating the parameters s , t , and u are defined in terms of A¯, prior information through rank regularization. In terms of rm rm rm Z, ∆, Π, and Θ:= λI R−1 A¯ by convergence to a stationary point, the added regularization − A allowsforliftingtheassumptiononthelinearindependenceof (cid:0)1 NP δ (cid:1)z a¯ π mk mk mr kr the rows of Π, as required by CP-APR [10] - an assumption s := , mr λµk=1 Rr′=1a¯mr′πkr′ without a straightforward validation since iterates Π are not X accessible beforehand. NP 1 P t := µθ π δ mr mr kr mk 2λµ − (cid:16) Xk=1 (cid:17) VI. NUMERICALTESTS and u := 1 θ a¯ + NP δ z a¯ π υ , A. Simulated Gaussian data mr λµ mr mr k=1 mk mk mr kr mrk withυmrk:=log(a¯m(cid:16)rπkr/ Rr′=P1a¯mr′πkr′)/ Rr′=1a¯mr′πk(cid:17)r′. Synthetic tensor-data of dimensions M ×N ×P = 16× As asserted in the following lemma, g(A,A¯) majorizes f(A) 4 4 were generated according to the Bayesian tensor model at A¯ and satisfies the techPnical conditions rPequired for the de×scribed in Section IV. Specifically, entries of Z consist of convergenceof BSUM (see the Appendix for a proof.) realizationsof Gaussianrandomvariablesgeneratedaccording to (11), with means specified by entries of X and variance Lemma1: Functiong(A,A¯)satisfiesthefollowingproperties scaled to yield an SNR of 20dB . Tensor X is constructed i) f(A¯)=g(A¯,A¯); − from factors A, B and C, as in (7). Matrices A, B, and C iiiii)) fddA(Af)(A)g|A(A=A¯,A¯=),ddAAg(.A,A¯)|A=A¯; and, have R = 6 columns containing realizations of independent ≤ ∀ zero-mean, unit-variance, Gaussian random variables. Moreover, g(A,A¯) is minimized at A = A⋆ with entries g A quarter of the entries of Z were removed at random and a⋆g,mr :=tmr+ t2mr+smr. reserved to evaluate performance. The remaining seventy five Lemma1highlightsthereasonbehindadoptingg(A,A¯)inthe percent of the data were used to recover Z considering the p proposed block-coordinate descent algorithm: it is separable removed data as missing entries. Method (8) was employed across the entries of its matrix argument [cf. (27)], and hence for recovery, as implemented by the LRTI Algorithm, with it admits a closed-form minimizer given by the MR scalars regularization µ( A 2 + B 2+ C 2)resultingfromsetting a⋆g,mr. The updates A ← A∗g are tabulated under Algorithm RA =IM, RB2=kINkF, ankd RkCF =kIPk.F 2 for solving (25), where analogous updates for B and C are The relative recovery error between Zˆ and data Z was carried out cyclically. computed, along with the rank of the recovered tensor, as a By virtue of properties i)-iii) in Lemma 1, convergence measure of performance. Fig. 3 depicts these figures of merit of Algorithm 2 follows readily from the general convergence averagedover 100 repetitionsof the experiment,across values theory available for the BSUM algorithm [20]. of µ varying on the interval 10−5µmax to µmax, which is Proposition 4: The iterates for A, B and C generated by computed as in Corollary 1. Fig 3 (bottom) shows that the Algorithm 2 converge to a stationary point of (25). LRTI algorithm is successfulin recoveringthe missing entries IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 7 16 repository [15]. The tensor Z to be estimated corresponds to 14 a three-dimensional MRI scan of the brain comprising a set 12 of P = 18 images, each of M N = 256 196 pixels. 10 Fifty percent of the data is rem×oved uniform×ly at random Xrank()68 together with the whole slice Zn, n = 50. Fig. 5 depicts the resultsofapplyingestimator(14)totheremainingdata,which 4 2 yields a reconstruction error of 10.54dB. The original slice − 100−2 10−1 1µ00 101 102 Zp, p=5,itscorruptedcounterpart,andtheresultingestimate are shown on top and center left. Covariance matrices K , M −2 K and K are estimated from six additionaltensorsamples N P −4 containing complementary scans of the brain also available at −6 [15].Fig.5(centerright)representsthecovariancematrixK ZXerror(,)dB∆−−108 fthoartcroelfluemctns sslyicmems epterirepsenodfictuhlearbrtaoinZ.pT,hsishocwoirnreglaatiosntruisctuthNree key enabler for the method to recover the missing slice up −12 to 9.60dB (see Fig. 5 (bottom)) by interpolating its a priori −1140−2 10−1 1µ00 101 102 sim−ilar parallel counterparts. All in all, the experiment evidences the merits of low-rank Fig.4. Performanceofthelow-rankPoissonimputationmethodasfunction PARAFAC decompositionformodelinga tensor,the ability of of the regularizing parameter µ; (top) rank of the recovered tensor averaged over100testrepetitions, (bottom)relative recovery error. theBayesianestimator(13)inrecoveringmissingdata,andthe usefulness of incorporating correlations as side information. Onaccountofthecomprehensiveanalysisofthree-wayMRI of Z up to 10dB for a wide range of valuesof µ, presenting data arrays in [8], and the nonnegative PARAFAC decompo- a minimum−at µ = 10−2µmax. This result is consistent with sition advanced thereby,inference of tensors with nonnegative Fig. 3 (top), which shows that rank R∗ = 6 is approximately continuousentries will be pursued as futureresearch, combin- recoveredat the minimumerror. Fig. 3 (top) also corroborates ingmethodsandalgorithmsinsectionsIVandVofthispaper. the low-rank inducing effect of (8), with the recovered rank varying from the upper bound R¯ =NP =16 to R=0, as µ is increased, and confirms that the recovered tensor is null at µmax as asserted by Corollary 1. D. RNA sequencing data TheRNA-Seqmethoddescribedin[18] exhaustivelycounts B. Simulated Poisson data the number of RNA transcripts from yeast cells. The reverse transcription of RNA molecules into cDNA is achieved by The synthetic example just described was repeated for the low-rankPoisson-tensormodeldescribedinSectionV.Specif- P =2 alternative methods, differentiated by the use of oligo- dT,orrandom-hexonucleotideprimers.ThesecDNAmolecules ically, tensor data of dimensions M N P = 16 4 4 × × × × are sequenced to obtain counts of RNA molecules across weregeneratedaccordingtothelow-rankPoisson-tensormodel of Section V. Entries of Z consist of realizations of Poisson M = 6,604 genes on the yeast genome. The experiment was repeated in [18] for a biological and a technological replicate random variables generated according to (23), with means specified by entries of X. Tensor X is again constructed as of the original sample totalling N = 3 instances per primer in (7) from factors A, B and C having R = 6 columns, selection.Thedataarethusorganizedinatensorofdimensions containing the absolute value of realizations of independent 6,604 3 2 as shown in Fig. 6 (top), with integer data that × × Gaussian random variables scaled to yield E[x ] = 100. are modeled as Poisson counts. Fifteen percent of the data is mnp Half of the entries of Z were considered missing to be removedand reserved forassessing performance.The missing data are represented in white in Fig. 6 (center). recoveredfromtheremaininghalf.Method(25)wasemployed for recovery, as implemented by the LRPTI Algorithm, with The LRPTI algorithm is run with the data available in Fig. regularization µ( A 2 + B 2 + C 2). 6 (center) producing the recovered tensor depicted in Fig. 6 Fig.4shows2thkeekstFimatkedrkaFnkakndkreFcoveryerrorover 100 (bottom). The recovery error for this experiment was 15dB. − realizations of the experiment, for µ in the interval 0.01 to 100.TherecoveryerrorinFig.4(bottom)exhibitsaminimum VII. CONCLUDING SUMMARY of 15dB at µ = 1, where the rank R∗ = 6 is recovered − It was shown in this paper that regularizing with the [cf. Fig. 4 (top).] The low-rank inducing effect of (8) is again Frobenius-norm square of the PARAFAC decomposition fac- corroboratedbythedecreasingtrendinFig.4(top),butinthis tors, controls the tensor’s rank by inducing sparsity in the case the rank is lower bounded by R = 1, because the K-L vector of amplitudes of its rank-one components. A Bayesian fitting criterion prevents (25) from yielding a null estimate Zˆ. method for tensor completion was developed based on this property, introducing priors on the tensor factors. It was ar- C. MRI data gued,andcorroboratednumerically,thatthispriorinformation Estimator(14) was tested against a corruptedversion of the endowsthecompletionmethodwithextracapabilitiesinterms MRI brain data set 657 from the Internet brain segmentation of smoothingand extrapolation.It was also suggested through IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 8 Fig.6. Imputation ofsequencing countdataviaLRPTI;(top)original data; (center) datawithmissingentries; (bottomrecovered tensor. ately from (3). Indeed, if (4) is minimized in two steps 1 µ min min (Z X)⊛∆ 2 + ( C 2 + B 2) X 2 B,Ck − kF 2 k kF k kF s. to CBT =X (28) it is apparent that the LS part of the cost does not depend on the inner minimizationvariables. Hence, (28) can be rewritten as 1 µ min (Z X)⊛∆ 2 + min ( C 2 + B 2) (29) X 2k − kF B,C 2 k kF k kF s.toCBT=X Fig.5. Resultsofapplying(14)totheMRIbraindataset657.(top)original and by recognizing (3) as the inner problem in (29), the and recovered fibers Zp and Zˆp for p= 5. (center) input fiber Zp, p= 5 equivalence follows. wcoiltuhmmnisssZinngdaantda,ZˆanndfcoorvtahreianpcoesitmioantrinxK=N50.(ibnowttohmic)hotrhigeinwahloalnedinrepcuotvselriecde b) Consider the cost in (4) at the local minimum (B¯,C¯) ismissing) 1 µ U(B¯,C¯):= (Z X¯)⊛∆ 2 + ( C¯ 2 + B¯ 2) 2k − kF 2 k kF k kF where X¯ := B¯C¯T. Arguing by contradiction, suppose that a parallelism between Bayesian and RKHS inference, that thereisadifferentlocalminimum(B,C)suchthatU(B,C)= the prior covariance matrices can be obtained from (sample) U(B¯,C¯), and without loss of generality set U(B,C) 6< correlations among the tensor’s slices. In such a probabilis- U(B¯,C¯), so that dU := U(B,C) U(B¯,C¯) < 0, which tic context, generic distribution models for the data lead to − can be expanded to multiple fitting criteria. Gaussian and Poisson processes were especially considered by developing algorithms that minimize dU =Tr ∆⊛(Z−X¯) ∆⊛(X¯ −X) +k∆⊛(X¯ −X)k2F µ the regularized LS and K-L divergence, respectively. + (cid:2)(cid:0)C 2 C¯ 2(cid:1)+(cid:0) B 2 B¯(cid:1)(cid:3)2 <0. (30) Numericaltests on synthetic data corroboratedthe low-rank 2 k kF −k kF k kF −k kF inducingproperty,and the ability of the completionmethod to Setting(cid:0)thisinequalityasidefornow,cons(cid:1)idertheaugmented recoverthe “ground-truth”rank, while experiments with brain matrix Q in terms of generic B and C matrices: images and gene expression levels in yeast served to evaluate B BBT X Q:= BT CT = (31) the method’s performance on real datasets. C XT CCT (cid:20) (cid:21) (cid:18) (cid:19) Although the results and algorithms in this paper were and the correspon(cid:2)ding Q¯ defin(cid:3)ed in terms of B¯ and C¯. presented for three-way arrays, they are readily extendible to Foreachvalueofθ (0,1)considertheconvexcombination higher-ordertensors or reducible to the matrix case. ∈ Q :=Q¯ +θ(Q Q¯). (32) θ − As both Q and Q¯ are positive semi-definite, so is Q and by APPENDIX θ means of the Choleski factorization one obtains I. Proof of Proposition 1 Q := Bθ B′ C′ = BθB′θ Xθ . (33) θ C θ θ X′ C C′ Proof: a) The equivalence of (2) and (4) results immedi- (cid:20) θ (cid:21) (cid:18) θ θ θ (cid:19) (cid:2) (cid:3) IEEETRANSACTIONSONSIGNALPROCESSING(SUBMITTED) 9 which defines B , C and X . the LS part of the cost depend on γ only, and not on their θ θ θ r Expanding the cost difference dU as in (30) results in particular factorizations a b c . Thus, the penalty is the only θ r r r term that varies when γ is constant, rendering the inner-most dU :=U(B ,C ) U(B¯,C¯) r θ θ θ − minimization in (35) equivalent to =Tr ∆⊛(Z X¯) ∆⊛(X¯ X ) + µ2 (cid:2)k(cid:0)Cθk2F −−kC¯k(cid:1)2F(cid:0)+kBθk2F−−kθB¯(cid:1)k(cid:3)2F arm,birn,cra2r+b2r+c2r γ =a b c . (36) + ∆(cid:0)⊛(X¯ X ) 2. (cid:1) r r r r k − θ kF Thearithmeticgeometric-meaninequalitygivesthesolution From the definitions (31)-(33) it follows that X¯ Xθ = to (36), as it states that for scalars a2, b2 and b2, it holds that θ(X¯ X), B 2 B¯ 2 =θ( B 2 B¯ 2),and −C 2 r r r C¯ −2 =θ(kCθk2F−kC¯kF2), sokthaktF−k kF k θkF− 3 a2b2c2 (1/3)(a2+b2+c2) k kF k kF −k kF r r r ≤ r r r dUθ :=θTr ∆⊛(Z X¯) ∆⊛(X¯ X) withequalitywphena2r =b2r =c2r,sothattheminimumof(36) + µ2θ (cid:2)k(cid:0)Ck2F −k−C¯k2F(cid:1)+(cid:0)kBk2F −−kB¯k(cid:1)2F(cid:3) is Satutabisntietdutaintga2rth=ebc2ro=rrecs2rpo=ndγir2n/g3. Rr=1(a2r + b2r + c2r) = +θ2k∆(cid:0) ⊛(X¯ −Xθ)k2F (cid:1) 3theoRrp=t1imγir2z/a3ti=on3pkroγbkl22e//m33sinistotra(n3s5i)tivPyeie;lhdesn(c9e),.bEyqsuhiovwalienngcethoaft and thus, it can be put in terms of (30) as in boPth (9) and (8) equivalent to (35) proves them equivalent to dU :=θ dU ∆⊛(X¯ X ) 2 +θ2 ∆⊛(X¯ X ) 2. each other, as desired. θ −k − θ kF k − θ kF III. Proof of Corollary 1 If dU w(cid:0)ere strictly negative, so(cid:1)would dU ∆⊛(X¯ Proof: The following result on the norm of the matrix Xθ)k2F, and hence − k − inverse will be used in the proof of the corollary. 1 Lemma 2: [13,p.58]If E Rm×m satisfies E 1,then θli→m0θdUθ = dU −k∆⊛(X¯ −Xθ)k2F <0. I+E is invertible, and (I∈+E)−1 F ≤(1−kkEkFkF≤)−1. but then there is in (cid:0)every neighborhood of (B(cid:1)¯,C¯) a point Another useful inequa(cid:13)lity holds fo(cid:13)r any value of µ, and for (B ,C ) such that U(B ,C )<U(B¯,C¯), B¯,C¯ cannot be a A, B, and C being the (cid:13)minimizers (cid:13)of (8) θ θ θ θ local minimum. This contradiction implies that U(B,C) = µ A 2 + B 2 + C 2 ∆⊛Z 2 (37) U(B¯,C¯) for any pair of local minima, which proves the k kF k kF k kF ≤k kF as it follows(cid:0)fromcomparingthe cost(cid:1)at such a minimum,and statement in part b) of Proposition 1. at the feasible point (A,B,C)=(0,0,0). II-Equivalence of tensor completion problems A second characterization of the minimum of (8) will be Proof: The Frobenius square-norms of A, B, and C are obtained by equating the gradient to zero. By vectorizing separable across columns; hence, the penalty in (8) can be matrix A, the cost in (8) can be rewritten as rewritten as P kAk2f +kBk2F +kCk2F =XrRR=1kark2+kbrk2+kcrk2 Xp=121(cid:13)(cid:13)diag[δp](cid:0)zp−(Bdiag[eTpC]⊗I))a(cid:1)(cid:13)(cid:13)22+ µ2kak(2238) = a2+b2+c2 (34) where z , δ , and a denote the vector rearrangements of ma- r r r p p r=1 trices Z , D , and A, respectively. Additional regularization X p p by defining a := a , b := b , c := c , r = that vanishes when taking derivatives w.r.t. A were removed r r r r r r 1,...,R. k k k k k k from (38). Setting the gradient of (38) w.r.t. a to zero, yields On the other hand, X can be expressed w.l.o.g. in terms of a=(I+E)−1ζ the normalized outer products (6) with γ := a b c . Substi- r r r r tuting (6) and (34) for the tensor and the penalty respectively, with (8) reduces to P 1 E:= BTdiag[eTC] I diag[δ ] Bdiag[eTC] I {uˆ},m{vˆi}n,{wˆ}mγin{ar},m{birn},{cr}12||(Z−X)⊛∆||2F µXp=1(cid:0) p ⊗ (cid:1) p (cid:0) p ⊗ (cid:1) P µ R ζ := 1 BTdiag[eTC] I diag[δ ]z . + a2+b2+c2 µ p ⊗ p p 2 r r r p=1 r=1 X(cid:0) (cid:1) X R The norms of E and ζ can be bounded by using the sub- s.toX= γ (u v w ) multiplicative property of the norm, and the Cauchy-Schwarz r r r r ◦ ◦ r=1 inequality, which results in X γr =arbrcr. (35) E 1 B 2 C 2 Focusing on the inner minimization w.r.t. norms a , b , k kF ≤ µk kFk kF r r and cr for arbitrary fixed directions ur , vr , and wr , ζ 1 ∆⊛Z B C . { } { } { } F F F F and fixed products γ := a b c . The constraints and hence k k ≤ µk k k k k k r r r r

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.