ebook img

Low-Rank Modeling and Its Applications in Image Analysis PDF

1.6 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Low-Rank Modeling and Its Applications in Image Analysis

Low-Rank Modeling and Its Applications in Image Analysis XiaoweiZhou,TheHongKongUniversityofScienceandTechnology CanYang,HongKongBaptistUniversity HongyuZhao,YaleUniversity WeichuanYu,TheHongKongUniversityofScienceandTechnology Low-rank modeling generally refers to a class of methods that solve problems by representing variables ofinterestaslow-rankmatrices.Ithasachievedgreatsuccessinvariousfieldsincludingcomputervision, 4 data mining, signal processing and bioinformatics. Recently, much progress has been made in theories, 1 algorithmsandapplicationsoflow-rankmodeling,suchasexactlow-rankmatrixrecoveryviaconvexpro- 0 grammingandmatrixcompletionappliedtocollaborativefiltering.Theseadvanceshavebroughtmoreand 2 moreattentionstothistopic.Inthispaper,wereviewtherecentadvanceoflow-rankmodeling,thestate- t of-the-artalgorithms,andrelatedapplicationsinimageanalysis.Wefirstgiveanoverviewtotheconceptof c low-rankmodelingandchallengingproblemsinthisarea.Then,wesummarizethemodelsandalgorithms O forlow-rankmatrixrecoveryandillustratetheiradvantagesandlimitationswithnumericalexperiments. Next,weintroduce a few applications oflow-rank modeling in thecontext of imageanalysis.Finally,we 3 concludethispaperwithsomediscussions. 2 ] V 1. INTRODUCTION C Inmanyresearchfieldsthedatatobeanalyzedoftenhavehighdimensionality,which . brings great challenges to data analysis. Some examples include images in computer s vision,documentsinnaturallanguageprocessing,customers’recordsinrecommender c [ systems,andgenomicsdatainbioinformatics. Fortunately, the high-dimensional data often lie in a low-dimensional subspace. 3 Mathematically, if we represent each data point by a vector d Rm and denote the v i ∈ entire dataset by a big matrix D = [d , ,d ], the low-dimensionality assumption 9 1 ··· n can be translated into the following low-rank assumption: rank(D) min(m,n). A 0 ≪ typicalexampleincomputervisionisLambertianreflectance,whered , ,d corre- 4 1 n ··· spond to a set of images of a convex Lambertian surface under various lighting con- 3 . ditions[BasriandJacobs 2003]. Anotherexampleis from signal processing,wheredi 1 representsa vector of signal intensities receivedby an antennaarray at time pointi. 0 Interested readers are referred to [Markovsky 2012, Chapter 1.3] for more low-rank 4 examples. 1 Intherealworld,therawdatacanhardlybeperfectlylow-rankduetotheexistence : v ofnoise.Therefore,thefollowingmodelismorefaithfultorealsituations i X r a D=X+E, (1) where X corresponds to a low-rank component and E corresponds to the noise or er- rorinmeasurements.Recoveringthelow-rankstructurefromnoisydatabecomesthe centrictaskinmanyproblems. A conventional approach to finding the low-rank approximation is to solve the fol- lowingoptimizationproblem: min D X 2, X k − kF s.t.rank(X) r, (2) ≤ 2 X.Zhouetal. where Y = Y2 denotes the Frobenious norm of a matrix.1 Solving such k kF ij ij a minimization qproblem can be interpreted as seeking the optimal rank-r esti- P mate of D in a least-squares sense. According to the matrix approximation theorem [EckartandYoung1936],thesolutionto(2)isgivenanalyticallybythesingularvalue decomposition(SVD): r X∗ = σ u vT, (3) i i i i=1 X where u , v ,and σ fori=1, ,r correspondtotheleftsingularvectors,right i i i { } { } { } ··· singular vectors, and singular values of D, respectively. The vectors u , ,u also 1 r ··· form an orthogonal basis to represent a r-dimensional subspace that can best embed thedata.ThisprocedurecorrespondstoPrincipalComponentAnalysis(PCA)[Jolliffe 2002]instatistics. WhilePCAisoneofthemostpopulartoolsfordataanalysisbecauseoftheanalyti- calsolutionincomputationandtheprovableoptimalityundercertainassumptions,it cannot handle some difficulties in real applications. Consider the following two com- monexamples: Recovery from a few entries. In many applications, we would like to recover a matrixfromonlyasmallnumberofobservedentries.Atypicalexampleisthat,when buildingrecommendersystems,wehopetomakepredictionstocustomers’preferences basedontheinformationcollectedsofar.TheNetFlixproblem[Korenetal.2009]isa famousinstance.ThedataisabigmatrixDwitheachentryD 1, ,5 recording ij ∈{ ··· } theratingofcustomeriformoviej.Therearearound480Kcustomersand18Kmovies inthedataset,butonly1.2%entrieshavevaluessinceeachcustomeronlyratedabout 200 movieson average.The problem is how to predict the ratings that have not been made yet based on the current observation. A popular solution is to assume that the ratingmatrixshouldbelow-rank.Thisassumptionisbasedonthefactthatasubgroup of customers are likely to share a similar taste and their ratings to the movies will be highly correlated. Consequently, the rank of the rating matrix will be bounded by the number of subgroupsformedby the customers.Therefore,the problem turns into recoveringa low-rankmatrix from afewentries.This problemis often calledMatrix Completion(MC). Recovery from gross errors.In some other applications, we have to recovera low- dimensional subspace from corrupted data. For example, the face images of a person may include glasses or shadows that occlude the true appearance. The classical PCA assumes independently and identically distributed (i.i.d.) Gaussian noise and adopts the sum of squared differences as the loss function, as shown in (2). Since the least- squares fitting is sensitive to outliers, the classical PCA can be easily corrupted by these gross errors. For example, the reconstructed face images would include arti- facts caused by the glasses or shadows in the input images [DeLaTorreandBlack 2003]. Recovering a subspace or low-rank matrix robustly in the presence of outliers hasbecomeapopularly-studiedissue.ThisproblemisoftencalledRobustPrincipal ComponentAnalysis(RPCA). In recentyears, many new techniques for low-rankmatrix recoveryhave been pro- posed. In the following, we will introduce some representative works. Basically, they 1Inthispaper,amatrixisdenotedbyacapitalletter,e.g.Y.AnelementandacolumnofYaredenotedby Yij andyi,respectively. Low-RankModelingandItsApplicationsinImageAnalysis 3 canbedividedintotwocategoriesbasedontheirapproachestomodelingthelow-rank prior. The first approach is to minimize the rank of the unknown matrix subject to some constraints. The rank minimization is often achieved by convex relaxation. We callthesemethodsrankminimizationmethods.Thesecondapproachistofactorizethe unknownmatrixasaproductoftwofactormatrices.Therankoftheunknownmatrix is upper bounded by the ranks of the factor matrices. We call these methods matrix factorizationmethods. Therestofthispaperisorganizedasfollows.2 InSection2,wewillreviewtherank minimization methods for low-rank matrix recovery.We shall introduce some typical modelsaswellasthecorrespondingoptimizationalgorithmstosolvethesemodels.In Section3,wewillintroducematrixfactorizationmethodsforlow-rankmatrixrecovery. InSection4,wewillusesynthesizedexperimentstoillustratetheperformancesofthe discussedmethods.InSection5,wewillgiveabriefreviewoftheapplicationsoflow- rankmodelinginimageanalysis.Finally,wewillconcludethepaperwithdiscussions inSection6. 2. RANKMINIMIZATION A directapproach to recoveringa low-rankmatrix is to minimize the rank ofthe ma- trixwith certain constraintsthat make theestimated matrix consistentwith original data.However,the rankminimizationproblemiscombinatorialandknowntobeNP- hard[Fazel2002].Therefore,convexrelaxationisoftenusedtomaketheminimization tractable. The most popular choice is to replace rank with the nuclear norm which is definedas r X ∗ = σi, (4) k k i=1 X whereσ ,σ , ,σ arethesingularvaluesofXandristherankofX.Theadvantages 1 2 r ··· of using the nuclear norm relaxation are mainly two-folds. Firstly, the nuclear norm is convex. Hence, it is feasible to compute the global optima of the relaxed problem efficiently. Secondly, the nuclear norm is proven to be the tightest convex surrogate of rank [Fazel 2002]. It means that the nuclear norm is the best approximation to the rank operator in all convex functions. Moreover, the analogy between using the nuclear norm for low-rank matrix recovery and using the ℓ -norm for sparse signal 1 recoveryhasbeenwellestablished[Rechtetal.2010],andtheexactrecoveryproperty hasbeenprovenforsomelow-rankmodelsusingthe nuclearnorm[Rechtetal.2010; Cande`sandRecht 2009; Cande`setal. 2011]. In the following, we will first introduce theconvexmodels,summarizetheoptimizationalgorithms,andfinallyintroducesome nonconvexrelaxationmethodsbriefly. 2.1. MatrixCompletion Inmatrixcompletion,missingvaluesinamatrixareestimatedgivenobservedvalues D ij Ω ,whereΩdenotesthesetofobservedentries.Asdiscussedpreviously,the ij { | ∈ } commonassumption is that the matrix should be low-rank.To solve the problem,the followingoptimizationproblemisoftenconsidered: min rank(X), X s.t. (X)= (D), (5) Ω Ω P P 2AconferenceversionofthispaperappearedinProceedingsofSPIEDefense,Security,andSensing2013 [ZhouandYu2013]. 4 X.Zhouetal. where (X)denotestheoperationofprojectingmatrixXtothespaceofallmatrices Ω P with nonzero elements restricted in Ω, i.e. (X) has the same values as X for the Ω P entries in Ω and zero values for the entries outside Ω. The equality constraint in (5) saysthattheestimatedvaluesshouldcoincidewiththeexistingdata. Aswediscussedearlier,replacingrankwiththenuclearnormcanmaketheproblem tractable.Insomerecentworks[Cande`sandRecht2009;Caietal.2010],thefollowing convexproblemissolved: min X ∗, X k k s.t. (X)= (D), (6) Ω Ω P P Cande`sandRecht [2009] theoretically proved that the solution of (6) can exactly re- cover the low-rank matrix with a high probability, if the underlying low-rank matrix satisfies the incoherence condition and the locations of observed entries Ω are uni- formlydistributedwith Ω Cn1.2rlogn,where Ω isthenumberofobservedentries, | |≥ | | C is a positive constant, n is the matrix size, and r is the rank. Here the incoherence conditionisusedtomathematicallycharacterizethedifficultyofrecoveringtheunder- lyinglow-rankmatrixfromasmallnumberofsampledentries.Informally,itsaysthat thesingularvectorsoftheunderlyinglow-rankmatrixshouldsufficiently“spreadout” andbeuncorrelatedwiththestandardbasis.Anextremeexampleisthattheunderly- ing low-rank matrix takes 1 in its (i,j)-th entry and 0 elsewhere.This matrix can be recovered only if the (i,j)-th entry is actually sampled. This result has been further strengthened to Ω Cnr poly(logn) by imposing the strong incoherence condition | | ≥ [Cande`sandTao2010;Gross2011]. In real applications, the observed entriesmay be noisy,and the equality constraint in (6) will be too strict, resulting in over-fitting [Mazumderetal. 2010]. Therefore, the following relaxed form of (6) is often considered for matrix completion with noise [CandesandPlan2010;Mazumderetal.2010]: 1 mXin 2 kPΩ(D)−PΩ(X)k2F +λkXk∗, (7) wheretheparameterλcontrolstherankofXandtheselectionofλshoulddependon thenoiselevel[CandesandPlan2010]. 2.2. RobustPrincipalComponentAnalysis Convexprogramminghas also been used to solve RPCA. A popular method is named sparse and low-rankdecomposition[Cande`setal. 2011], and involvesthe decomposi- tion of a matrix D as a sum of a low-rank component X and a sparse component E by minimizing the rank of X and the cardinality of E simultaneously. The surprising messageisthat,undersomemildassumptions,thelow-rankmatrixcanbeexactlyre- covered by the following convex program named Principal ComponentPursuit (PCP) [Cande`setal.2011]: min X ∗+λ E 1, X,E k k k k s.t. X+E=D. (8) Here,thenuclearnorm X ∗ andtheℓ1-norm E 1 aretheconvexsurrogatesofrank k k k k and cardinality, respectively. Cande`setal. [2011] and Chandrasekaranetal. [2011] analyzed the conditions for exact recovery. Briefly speaking, it has been proven that the underlying low-rank matrix X∗ and the underlying sparse matrix E∗ can be ex- actlyrecoveredwithhighprobabilityifX∗ satisfiestheincoherenceconditionandthe nonzeroentriesofE∗ aresufficientlysparsewitharandomspatialdistribution.More- Low-RankModelingandItsApplicationsinImageAnalysis 5 over,a theoretical choice of parameter λ is providedto make the exact recoverymost likely[Cande`setal.2011]. The basic model in (8) was extended to handle additional scenarios such as Stable PCP that considers Gaussian noise [Zhouetal. 2010], the outlier pursuit that incor- porates group sparsity [Xuetal. 2012a], and the matrix recovery from compressive measurements[Wrightetal.2012].InStablePCP[Zhouetal.2010],theequalitycon- straint in (8) is relaxed to be X+E D σ to allow the existence of Gaussian F k − k ≤ noise.Inimplementation,thefollowingproblemissolved µ mX,iEn kXk∗+λkEk1+ 2kX+E−Dk2F, (9) whereµisaparameterdeterminedbythenoiselevel. 2.3. OptimizationAlgorithms Thefollowingtheorem([Caietal.2010,Theorem2.1])servesasanimportantbuilding blockinnuclearnormminimizationalgorithms: THEOREM 2.1. Thesolutiontothefollowingproblem 1 mXin 2 kZ−Xk2F +λkXk∗ (10) isgivenbyX∗ = (Z),where λ D r (Z)= (σ λ) u vT, (11) Dλ i− + i i i=1 X ristherankofZ,(x) =max(x,0),and u , v and σ aretheleftsingularvectors, + i i i { } { } { } rightsingularvectorsandsingularvaluesofZ,respectively. isnamedthesingularvaluethresholding(SVT)operator[Caietal.2010]. λ D Based on Theorem 2.1, various algorithms have been developed for specific prob- lems. Two of the most popular techniques are the Proximal Gradient (PG) method [Moreau1965]andtheAugmentedLagrangianMethod(ALM)[Bertsekas1999],both of which are applicable to a variety of convex problems. The PG method is very use- ful to solve the norm-regularized maximum-likelihood problems such as the model in (7), whose energy function comprises a differentiable loss and a nonsmooth reg- ularizer. Moreover, the PG method is often combined with the Nesterov method to accelerateconvergence[Nesterov2007;BeckandTeboulle 2009].Examplesusing the PG method include [JiandYe 2009; Mazumderetal. 2010; TohandYun 2010], etc. The ALM method is closely related to the Alternating Direction Method of Multipli- ers(ADMM) [Boyd2010].It providesa powerfulframeworkto solve convexproblems with equality constraints such as MC in (6) and PCP in (8). The algorithms used in [Linetal. 2010; Cande`setal. 2011] belong to this class. The details of PG and ALM willbegiveninthefollowingsubsections. 2.3.1. ProximalGradient. Insparselearningproblems,thefollowingoptimizationprob- lemisoftenconsidered: minf(X)+λ (X), (12) X R where f(X) usually denotes a differentiable loss function and (X) corresponds to a R convex regularizer which might be nonsmooth. For example, the matrix completion withnoisein(7)usesf(X)=kPΩ(X−D)k2F andR(X)=kXk∗. Iff(X)issimplythesumofsquareddifferencesbetweenXandagivenmatrix,the problem in (12) is named the proximal problem [Moreau 1965], which can be solved 6 X.Zhouetal. analytically for many types of (X). For example, if (X) is the nuclear norm, the R R problemcanbesolvedanalyticallybasedonTheorem2.1. When(12)isnotastandardproximalproblem,theProximalGradient(PG)method [Moreau 1965] is usually used. In PG, a quadratic approximation to f(X) is made aroundthepreviousestimateX′ ineachiteration.Define µ Q (X,X′)=f(X′)+ f(X′),X X′ + X X′ 2 +λ (X), µ h∇ − i 2k − kF R µ 1 = X [X′ f(X′)] 2 +λ (X)+const., (13) 2k − − µ∇ kF R where <> means the inner product and µ is a constant. It can be proven that [BeckandTeboulle 2009], if f(X) is differentiable and convex with a Lipschitz con- tinuousgradientsatisfying f(X ) f(X ) µ X X , (14) 1 2 F 1 2 F k∇ −∇ k ≤ k − k (12)canbesolvedbyrepeatedlyupdatingXvia Xk+1 =argmin Q (X,Xk), µ X 1 1 λ =argmin X [Xk f(Xk)] 2 + (X) (15) X 2k − − µ∇ kF µR with a convergence rate of (1/k). It is easy to see that (15) is simply the proximal O problem,whichisoftenconvenienttosolve. The Accelerated Proximal Gradient (APG) method uses the Nesterov method [Nesterov1983]to acceleratethe convergenceofPG.Instead ofmaking quadratic ap- proximationaroundXk,APGmakestheapproximationatanotherpointYk,whichis a linear combination of Xk and Xk−1. This modification will give a convergence rate of ( 1 ).Pleasereferto[Nesterov2007;BeckandTeboulle2009]fordetails.TheAPG O k2 methodissummarizedinAlgorithm1. Algorithm1AcceleratedProximalGradient(APG) 1. Initialize:X0 =X−1 Rm×n,t0 =t−1 =1 ∈ 2. repeat 3. Yk =Xk+ tk−t1k−1(Xk−Xk−1) 4. Xk+1 =argminXQµ(X,Yk) 5. tk+1 = 1+√1+4(tk)2 2 6. untilconvergence ThePGandAPGmethodshavebeenintensivelyusedtosolvethematrixcompletion problem in (7), where the updating in (15) is solved via SVT. For example,the SOFT- IMPUTEalgorithmin[Mazumderetal.2010]solves(7)byiterativelyupdatingXby: Xk+1 =Dλ(PΩ(D)+PΩ⊥(Xk))=Dλ(Xk−[PΩ(Xk)−PΩ(D)]), (16) wherePΩ⊥ denotesthecomplementaryprojectionsuchthatPΩ(Xk)+PΩ⊥(Xk)=Xk. Itisstraightforwardtofindthat(16)isequivalentto(15)withµ=1.Hence,theSOFT- IMPUTE algorithm can be interpreted as PG with fixed step length. The FPCA algo- rithmintroducedin[Maetal.2011]isalsobasedonPGwithacontinuationtechnique to accelerateconvergence.JiandYe [2009] andTohandYun [2010]also proposeddif- ferentimplementationsofAPGformatrixcompletion.Tomiokaetal.[2010]proposeda Low-RankModelingandItsApplicationsinImageAnalysis 7 DualAugmentedLagrangian algorithm for matrix completion,which achievessuper- linear convergence. It can be interpreted as a proximal method with the descending directionscomputedfromtheaugmentedLagrangianofthedualproblem. 2.3.2. Augmented Lagrangian Method. The Augmented Lagrangian Method (ALM) [Bertsekas 1999] is a classical tool to minimize a convex function with equality con- straints.WewillusePCPin(8)asanexampletointroducethismethod. ToremovetheconstraintX+E=D,amultiplierYisintroducedandtheaugmented Lagrangianof(8)reads µ Lµ(X,E,Y)=kXk∗+λkEk1+<Y,D−X−E>+2kD−X−Ek2F. (17) ALMalternatesbetweenthefollowingtwosteps: (Xk+1,Ek+1)=argminL (X,E,Yk), (18) µ X,E Yk+1 =Yk+µ(D Xk+1 Ek+1), (19) − − whicharenamedprimalminimizationanddualascent,respectively.ForPCP,thepri- mal minimization in (18) is difficult over X and E simultaneously. But if we fix one variable and minimize over the other one, the marginal optimization over X (or E) turnsinto the nuclear norm (orℓ -norm)regularizedproximalproblem,which can be 1 efficiently solved by SVT (or soft-thresholding). Then, the X-step and E-step are re- peateduntilconvergencetosolve(18). A more efficient way is to update the primal variables X and E for only one iter- ation, instead of exactly solving (18) before updating the dual variable Y [Linetal. 2010; Cande`setal. 2011].This is named the InexactAugmentedLagrangian Method (IALM), a special case of the Alternating Direction Method of Multipliers (ADMM) [Boyd 2010]. The method is summarized in Algorithm 2. It can be proven that the sequences Xk and Ek will converge to an optimal solution of (8) [Linetal. 2010; { } { } Boyd2010]. Algorithm2InexactAugmentedLagrangianMethod(IALM)forPCP 1. Initialize:E0 =Y0 =0 2. repeat 3. Xk+1 =argminXLµ(X,Ek,Yk) 4. Ek+1 =argminELµ(Xk+1,E,Yk) 5. Yk+1 =Yk+µ(D Xk+1 Ek+1) − − 6. untilconvergence ALMcanalsobeappliedtomatrixcompletionin(6).In[Linetal.2010],theequality constraint (X) = (D) is replaced by X = D+E and (E) = 0. The new con- Ω Ω Ω P P P straintisequivalenttotheoriginalone,buttheprojectionoperatoronXhasbeenre- moved.Then,theALMisapplied.Inthisway,minimizingtheaugmentedLagrangian over X turns into a proximal problem, which could be solved by SVT. ALM was also applied to solve the nonnegative matrix factorization problem for matrix completion [Xuetal.2012b]. 2.4. NonconvexRankMinimization Recently, a few works used nonconvex functions instead of the nuclear norm as the surrogates of rank for rank minimization. A typical choice is the Schatten-p norm of 8 X.Zhouetal. singularvalues: r 1/p f (X)= σp , (20) p i ! i=1 X where σ , ,σ are singular values of X. When p 0, f (X) rank(X), and 1 r p ··· → → the minimization is intractable. When p = 1, f (X) turns out to be the nuclear 1 norm, which is the tightest convex surrogate.To bridge the gap,the nonconvexcases with 0 < p < 1 were considered in recent literature [MajumdarandWard 2011; MohanandFazel2012;MarjanovicandSolo2012;Nieetal.2012;Lietal.2014].The theoretical analysis on the recovery properties can be found in [Zhangetal. 2013a]. Besides the Schatten-p, other nonconvex surrogate functions for rank minimization werealsostudiedin[Luetal.2014]. 3. MATRIXFACTORIZATION Instead of minimizing rank, another approach to modeling the low-rank property is matrixfactorization.MatrixfactorizationintendstodecomposeX Rm×nasaproduct of two factor matrices X = ABT, where A Rm×r and B ∈Rn×r. Using matrix ∈ ∈ factorizationtomodelalow-rankmatrixisbasedonthefactthat rank(ABT) min(rank(A),rank(B)). (21) ≤ Therefore,if r is small, X has a small rank. Finally, the problem of recoveringa low- rank matrix can be converted into estimating two factor matrices A and B. In this paper, we will discuss representative matrix-factorization methods in the context of low-rank matrix recovery.Notice that not all matrix factorization methods aim to re- coveralow-rankmatrix.Forexample,theoutputsofnonnegativematrixfactorization [LeeandSeung 1999] or dictionary learning [TosicandFrossard 2011] are not neces- sarily low-rank. We will not discuss these methods here. For a summary of matrix factorizationmethods,pleasereferto[SinghandGordon2008]. 3.1. MatrixFactorizationwithMissingValues Basically, the factorization-based methods for matrix completion aim to solve the fol- lowingoptimizationproblem: 1 min (D) (ABT) 2. (22) A,B 2 kPΩ −PΩ kF A straightforward approach to solving (22) is to minimize the function over A or B alternately by fixing the other one. Each subproblem of estimating A or B turns into a least-squares problem which admits a closed-form solution. Algorithms of this type have been extensively studied in many works such as the early computer vision literature [Shumetal. 1995; VidalandHartley 2004] and the recent matrix recovery literature[HaldarandHernando2009;TangandNehorai2011;Jainetal.2013]. ThematrixcompletionsolverLMaFit[Wenetal.2012]alsoadoptedthealternating strategytosolvethefollowingequivalentformof(22): 1 min Z ABT 2, A,B,Z2 k − kF s.t. (Z)= (D), (23) Ω Ω P P where Z is an auxiliary variable. Each step of updating A, B or Z can be solved veryefficiently.Additionally,LMaFitintegratesanonlinearsuccessiveover-relaxation schemetoacceleratetheconvergenceofalternation. Low-RankModelingandItsApplicationsinImageAnalysis 9 While the formulation in (22) is nonconvex, the empirical results in many works demonstrated that the alternating minimization performed both accurately and ef- ficiently compared to convex methods [HaldarandHernando 2009; Keshavanetal. 2009; TangandNehorai 2011]. Meanwhile, the theoretical analysis in [Jainetal. 2013] showed that the alternating minimization can succeed under the conditions similar to the existing conditions given in [Cande`sandRecht 2009], which has been introducedinSection2.1.Thelowerboundsfortherecoveryerrorofusingalternating minimizationformatrixcompletionwereanalyzedin[TangandNehorai2011]. Incomputervisionliterature,manyworksadoptedhigherorderalgorithmsinstead of alternating least squares to solve (22) for faster convergence and better preci- sion. For example, BuchananandFitzgibbon [2005] developed a Damped Newton al- gorithm to solve the problem.The variables A and B are updated based on the New- ton algorithm with a damping factor. However, they cannot handle large-scale prob- lems due to the infeasibility of computing the Hessian matrix over a large number of variables. To interpolate between the alternating least squares and the Newton algorithm, some works proposed to use hybrid algorithms. In the Wiberg algorithm [OkataniandDeguchi 2007], A is updated via least squares while B is updated by a Gauss-Newton step in each iteration. Later, the Wiberg algorithm was extended to a damped version to achieve better convergence [Okatanietal. 2011]. Chen [2008] proposed an algorithm similar to the Wiberg algorithm. The difference is that B is updatedviatheLevenberg-Marquadtalgorithmandconstrainedin BBTB = I .In- { | } terested readers can refer to [Okatanietal. 2011] for a more detailed introduction to thefactorizationmethodsincomputervision. When observation is highly incomplete,the problem in (22) is likely to be ill-posed, which is a common case in collaborative filtering. A popular approach to addressing thisissueistopenalizethesquaredFrobeniousnormsoffactormatrices: 1 λ min (D) (ABT) 2 + ( A 2 + B 2). (24) A,B 2 kPΩ −PΩ kF 2 k kF k kF ThismethodisnamedMaximumMarginMatrixFactorization(MMMF) [Srebroetal. 2005].Theideaissimilartousingthesquaredℓ -norminridgeregressiontoimprove 2 the stability of parameter estimation. Moreover,the following equality is established in[Srebroetal.2005] 1 kXk∗ =A,B:mX=inABT 2 (kAk2F +kBk2F), (25) which indicates the equivalence between the MMMF in (24) and the nuclear norm minimization in (7). This equivalence was also studied in [Mazumderetal. 2010; Wangetal.2012;Cabraletal.2013].Tosolvetheoptimizationin(24),eithergradient- based algorithms [RennieandSrebro 2005] or alternating minimization can be used [Wangetal.2012;Cabraletal.2013]. 3.2. RiemannianOptimization Anotherwidely-usedregularizationstrategyinlow-rankmatrixfactorizationistocon- strainthesearchspaceandoptimizeovermanifolds. Keshavanetal.[2010a]proposedtosolvethefollowingmatrixcompletionproblem 1 min (D) (AΣBT) 2, A,Σ,B2 kPΩ −PΩ kF s.t. A Gr(r,m), B Gr(r,n), Σ Rr×r, (26) ∈ ∈ ∈ where Gr(r,n) denotes the set of r-dimensional subspaces in Rn, which forms a Rie- mannian manifold named Grassmannian. Keshavanetal. [2010a] proposed an algo- 10 X.Zhouetal. rithm named OptSpace to iteratively estimate the factor matrices, where A and B are updatedby gradientdescent overthe Grassmannian, while Σ is updated by least squares. Similar to the theoretical results for the convex method [CandesandPlan 2010],Keshavanetal.[2010b]providedtheperformanceguaranteeofOptSpaceunder an appropriate incoherence condition. Building upon the same model, NgoandSaad [2012] proposed a scaled-gradient procedure to accelerate the convergence of the al- gorithm. Daietal. [2011] proposed an algorithm named SET to solve the following two-factormodel: 1 min (D) (ABT) 2, A,B2 kPΩ −PΩ kF s.t. A Gr(r,m), B Rn×r. (27) ∈ ∈ SETupdatesAovertheGrassmannianandestimatesBbyleastsquares.Basedonthe samemodel,BoumalandAbsil[2011]developedanalgorithmnamedRTRMCthatop- timizesthecostfunctionbyasecond-orderRiemanniantrust-regionmethodtoachieve fasterconvergence. Mishraetal. [2013b] proposed a framework of optimization over Riemannian quo- tientmanifoldsforlow-rankmatrixfactorization.Theyinvestigatedthreetypesofma- trixfactorization:thefull-rankfactorization,thepolarfactorizationandthesubspace- projectionfactorization,which are relatedto the modelsin (22),(26)and (27),respec- tively. To take into account the invariance over a class of equivalent solutions, they explored the underlying quotient nature of the search spaces and designed a class of gradient-basedandtrust-regionalgorithmsoverthequotientsearchspaces.Theycon- cludedthroughexperimentsthatthethreefactorizationmodelswithdifferentRiemm- nanian structures were almost equivalent in terms of computational complexity and performedfavorablycomparedtothepreviousmethodssuchasLMaFitandOptSpace. More related works include [Meyeretal. 2011; Mishraetal. 2012, 2013a; Absiletal. 2013],etc. Instead of exploring the geometries of search spaces of factor matrices, Vandereycken [2013] and Shalitetal. [2012] proposed to directly optimize a function overthesetoffixed-rankmatrices: 1 min f(X)= (D) (X) 2, X 2 kPΩ −PΩ k2 s.t. X . (28) r ∈M Here denotes the set of rank-r matrices in Rm×n, which forms a smooth mani- r M fold. Vandereycken [2013] developed a conjugate gradient descent algorithm named LRGeomCG to efficiently optimize any smooth function over , while Shalitetal. r M [2012] designed an online algorithm to solve large-scale problems. Numerical experi- ments [Vandereycken 2013; Mishraetal. 2013b] showed that LRGeomCG performed comparablywiththequotient-spacemethodsformatrixcompletion. Recently, a manifold-optimization toolbox named Manopt has been developed [Boumaletal. 2014], providing a lot of ready-to-use algorithms to solve optimization problems over various manifolds, such as Grassmannians and the fixed-rank mani- folds. 3.3. RobustMatrixFactorization Robust matrix factorization is a method for handling outliers in data, and can be re- garded as the factorization approach towards RPCA. As mentioned before,the sensi- tivitytooutliersfortraditionalmethodsisduetothesquaredlossusedin(22),which penalizes large errors too much, resulting in biased fitting. To address this issue, a

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.