Matrix Approximation under Local Low-Rank Assumption JoonseokLeea,SeungyeonKima,GuyLebanona,b,YoramSingerb aCollegeofComputing,GeorgiaInstituteofTechnology,Atlanta,GA30332 bGoogleResearch,MountainView,CA94043 3 jlee716@, seungyeon.kim@, lebanon@cc. gatech.edu, [email protected] 1 { } 0 2 Abstract n a Matrixapproximationisacommontoolinmachinelearningforbuildingaccurate J predictionmodelsforrecommendationsystems,textmining,andcomputervision. 5 Aprevalentassumptioninconstructingmatrixapproximationsisthatthepartially 1 observed matrix is of low-rank. We propose a new matrix approximation model whereweassumeinsteadthatthematrixisonlylocallyoflow-rank,leadingtoa ] G representationoftheobservedmatrixasaweightedsumoflow-rankmatrices.We analyzetheaccuracyoftheproposedlocallow-rankmodeling. Ourexperiments L showimprovementsofpredictionaccuracyinrecommendationtasks. . s c [ 1 Introduction 1 v Matrixapproximationisacommontaskinmachinelearning. Givenafewobservedmatrixentries 2 M ,...,M , matrix approximation constructs a matrix Mˆ that approximates M at its 9 {unoab1s,ebr1ved entriaems.,bmIn}general, the problem of completing a matrix M based on a few observed 1 entriesisill-posed,asthereareaninfinitenumberofmatricesthatperfectlyagreewiththeobserved 3 entries of M. Thus, we need additional assumptions such that M is a low-rank matrix. More 1. formally,weapproximateamatrixM Rn1×n2 byarank-rmatrixMˆ =UVT,whereU Rn1×r, 0 V Rn2×r,andr min(n1,n2). In∈thisnote,weassumethatM behavesasalow-rank∈matrixin 3 the∈vicinityofcerta(cid:28)inrow-columncombinations,insteadofassumingthattheentireM islow-rank. 1 We therefore construct several low-rank approximations of M, each being accurate in a particular v: region of the matrix. Smoothing the local low-rank approximations, we express Mˆ as a linear i combination of low-rank matrices that approximate the unobserved matrix M. This mirrors the X theory of non-parametric kernel smoothing, which is primarily developed for continuous spaces, r andgeneralizeswell-knowncompressedsensingresultstooursetting. a 2 GlobalandLocalLow-RankMatrixApproximation We describe in this section two standard approaches for low-rank matrix approximation (LRMA). Theoriginal(partiallyobserved)matrixisdenotedbyM Rn1×n2,anditslow-rankapproximation byMˆ =UVT,whereU Rn1×r,V Rn2×r,r min∈(n1,n2). ∈ ∈ (cid:28) GlobalLRMA IncompleteSVDisapopularapproachforconstructingalow-rankapproximation Mˆ byminimizingtheFrobeniusnormoverthesetAofobservedentriesofM: (cid:88) (U,V)=argmin ([UVT] M )2. (1) a,b a,b − U,V (a,b) A ∈ Anotherpopularapproachisminimizingthenuclearnormofamatrix(definedasthesumofsingular valuesofthematrix)satisfyingconstraintsconstructedfromthetrainingset: Mˆ =argmin X , s.t. Π (X M) <α (2) A F X (cid:107) (cid:107)∗ (cid:107) − (cid:107) 1 (t) T Tu(t)=Tu(u)+✏=Mu+✏ Rn1⇥n2 T(u) T(u) Tt(t) Tu(t) T(s) Tt(u) Tu(u) M M M s t u u s M M Figure1: Forillustrativepurposes,weassumeadistancefunctiondwhoseneighborhoodstructure coincideswiththenaturalorderonindices. Thatis,s = (a,b)issimilartou = (a,b)if a a (cid:48) (cid:48) (cid:48) and b b are small. (Left) For all s [n ] [n ], the neighborhood s : d(s,s) <| h− in| (cid:48) 1 2 (cid:48) (cid:48) | − | ∈ × { } the original matrix M is approximately described by the corresponding entries of the low-rank matrix (s)(shadedregionsofM arematchedbylinestothecorrespondingregionsin (s)that T T approximatethem). Ifd(s,u)issmall, (s)issimilarto (u),asshownbytheirspatialcloseness intheembeddingspaceRn1×n2. (RightT)TheoriginalmatrTixM (bottom)isdescribedlocallybythe low-rankmatrices (t)(neart)and (u)(nearu). Thelinesconnectingthethreematricesidentify T T identical entries: M = (t) and M = (u). The equation at the top right shows a relation t t u u T T tying the three patterned entries. Assuming the distance d(t,u) is small, (cid:15) = (t) (u) = u u T −T (t) M (u)issmallaswell. u u T − whereΠA :Rn1×n2 Rn1×n2 istheprojectiondefinedby[ΠA(M)]a,b =Ma,b if(a,b) Aand0 → ∈ otherwise,and istheFrobeniusnorm. F (cid:107)·(cid:107) Minimizing the nuclear norm X is an effective surrogate for minimizing the rank of X. One advantageof(2)over(1)istha(cid:107)tw(cid:107)e∗donotneedtoconstraintherankofMˆ inadvance. However, problem(1)issubstantiallyeasiertosolvethanproblem(2). Local LRMA In order to facilitate a local low-rank matrix approximation, we need to pose an assumptionthatthereexistsametricstructureover[n ] [n ],where[n]denotesthesetofintegers 1 2 × 1,...,n .Formally,d((a,b),(a,b))reflectsthesimilaritybetweentherowsaanda andcolumns (cid:48) (cid:48) (cid:48) b{andb(cid:48). I}ntheglobalmatrixfactorizationsettingabove,weassumethatthematrixM Rn1×n2 ∈ hasalow-rankstructure. Inthelocalsetting,however,weassumethatthemodelischaracterizedby multiplelow-rankn1 n2matrices. Specifically,weassumeamapping :[n1] [n2] Rn1×n2 × T × → that associates with each row-column combination [n ] [n ] a low rank matrix that describes 1 2 × the entries of M in its neighborhood (in particular this applies to the observed entries A): : [n1] [n2] Rn1×n2 where a,b(a,b) = Ma,b. Note that in contrast to the global estimatTe in × → T GlobalLRMA,ourmodelnowconsistsofmultiplelow-rankmatrices,eachdescribingtheoriginal matrixM inaparticularneighborhood. Figure1illustratesthismodel. Withoutadditionalassumptions,itisimpossibletoestimatethemapping fromasetofm<n n 1 2 T observations. Ouradditionalassumptionisthatthemapping isslowlyvarying. Sincethedomain T of is discrete, we assume that is Ho¨lder continuous. Following common approaches in non- T T parametric statistics, we define a smoothing kernel K (s ,s ), where s ,s [n ] [n ], as a h 1 2 1 2 1 2 ∈ × non-negativesymmetricunimodalfunctionthatisparameterizedbyabandwidthparameterh > 0. AlargevalueofhimpliesthatK (s, )hasawidespread, whileasmallhcorrespondstonarrow h spreadofK (s, ). Weuse,forexampl·e,theEpanechnikovkernel,definedasK (s ,s ) = 3(1 h · h 1 2 4 − d(s ,s )2)1 . WedenotebyK(a,b)thematrixwhose(i,j)-entryisK ((a,b),(i,j)). 1 2 {d(s1,s2)<h} h h IncompleteSVD(1)andcompressedsensing(2)canbeextendedtolocalversionasfollows IncompleteSVD: ˆ(a,b)=argmin K(a,b) Π (X M) s.t. rank(X)=r (3) T (cid:107) h (cid:12) A − (cid:107)F X CompressedSensing: ˆ(a,b)=argmin X s.t. K(a,b) Π (X M) <α, (4) T X (cid:107) (cid:107)∗ (cid:107) h (cid:12) A − (cid:107)F 2 MovieLens 10M Netflix 0.84 0.9 Global(r=5) Global(r=5) Global(r=7) Global(r=7) 0.83 GGlloobbaall((rr==1105)) 0.89 GGlloobbaall((rr==1105)) Local(r=5) Local(r=5) Local(r=7) 0.88 Local(r=7) 0.82 Local(r=10) Local(r=10) Local(r=15) Local(r=15) MSE0.81 DFC MSE0.87 DNFetCflix Winner R R0.86 0.8 0.85 0.79 0.84 0.78 0.83 5 10 15 20 25 30 35 40 45 5 10 15 20 25 30 35 40 45 Number of Anchor Points Number of Anchor Points Figure2: RMSEofglobal-LRMA,local-LRMA,andotherbaselinesonMovieLens10M(Left)and Netflix(Right)dataset. Local-LRMAmodelsareindicatedbythicksolidlines,whileglobal-LRMA modelsareindicatedbydottedlines. Modelswithsamerankarecoloredidentically. where denotesacomponent-wiseproductoftwomatrices,[A B] =A B . i,j i,j i,j (cid:12) (cid:12) The two optimization problems above describe how to estimate ˆ(a,b) for a particular choice of T (a,b) [n ] [n ]. Conceptually, this technique can be applied for each test entry (a,b), result- 1 2 ∈ × ing in the matrix approximation Mˆ = ˆ (a,b), where (a,b) [n ] [n ]. However, this a,b a,b 1 2 T ∈ × requires solving a non-linear optimization problem for each test index (a,b) and is thus computa- tionallyprohibitive.Instead,weuseNadaraya-Watsonlocalregressionwithasetofqlocalestimates ˆ(s ),..., ˆ(s ),inordertoobtainacomputationallyefficientestimate ˆˆ(s)foralls [n ] [n ]: 1 q 1 2 T T T ∈ × q Tˆˆ(s)=(cid:88)(cid:80)qKhK(si,(ss),s)Tˆ(si). (5) i=1 j=1 h j Equation(5)issimplyaweightedaverageof ˆ(s ),..., ˆ(s ),wheretheweightsensurethatvalues 1 q T T of ˆ atindicesclosetoscontributemorethanindicesfurtherawayfroms. T NotethatthelocalversioncanbefasterthanglobalSVDsince(a)eachlow-rankapproximationis independentofeachother, socanbecomputedinparallel, and(b)therankusedinthelocalSVD model can be significantly lower than the rank used in a global one. If the kernel K has limited h support (K (s,s) is often zero), the regularized SVD problems would be sparser than the global h (cid:48) SVDproblem,resultinginadditionalspeedup. 3 Experiments Wecomparelocal-LRMAtoglobal-LRMAandotherstate-of-the-arttechniquesonpopularrecom- mendationsystemsdatasets: MovieLens10MandNetflix. Wesplitthedatainto9:1ratiooftrain andtestset. Adefaultpredictionvalueof3.0wasusedwheneverweencounteratestuseroritem without training observations. We use the Epanechnikov kernel with h = h = 0.8, assuming a 1 2 productformK ((a,b),(c,d)) = K (a,c)K (b,d). Fordistancefunctiond,weusearccosdis- h h(cid:48)1 h(cid:48)(cid:48)2 tance, definedasd(x,y) = arccos( x,y / x y ). Anchorpointswerechosenrandomlyamong (cid:104) (cid:105) (cid:107) (cid:107)(cid:107) (cid:107) observedtrainingentries. L regularizationisusedforlocallow-rankapproximation. 2 Figure 2 graphs the RMSE of Local-LRMA and global-LRMA as well as the recently proposed methodcalledDFC(Divide-and-ConquerMatrixFactorization)asafunctionofthenumberofan- chor points. Both local-LRMA and global-LRMA improve as r increases, but local-LRMA with rankr 5outperformsglobal-LRMAwithanyrank. Moreover,local-LRMAoutperformsglobal- ≥ LRMAinaveragewithevenafewanchorpoints(thoughtheperformanceoflocal-LRMAimproves furtherasthenumberofanchorpointsqincreases). 3