ebook img

Random Projections For Large-Scale Regression PDF

0.25 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Random Projections For Large-Scale Regression

Random Projections For Large-Scale Regression Gian-AndreaThanei,ChristinaHeinzeandNicolaiMeinshausen 7 1 0 2 AbstractFittinglinearregressionmodelscanbecomputationallyveryexpensiveinlarge-scaledataanalysistasksif n a thesamplesizeandthenumberofvariablesareverylarge.Randomprojectionsareextensivelyusedasadimension J reductiontoolinmachinelearningandstatistics.Wediscusstheapplicationsofrandomprojectionsinlinearregres- 9 sionproblems,developedtodecreasecomputationalcosts,andgiveanoverviewofthetheoreticalguaranteesofthe 1 generalization error. It canbe shown that the combination of randomprojections with least squares regression leads to similar recovery as ridge regression and principal component regression. We also discuss possible improvements ] T whenaveragingovermultiplerandomprojections,anapproachthatlendsitselfeasilytoparallelimplementation. S . h t a 1 Introduction m [ AssumewearegivenadatamatrixX∈Rn×p (nsamplesofa p-dimensionalrandomvariable)andaresponsevector 1 Y∈Rn. We assume a linear model for the data where Y=Xβ+ε for some regression coefficient β ∈Rp and ε v i.i.d. mean-zero noise. Fitting a regression model by standard least squares or ridge regression requires O(np2) or 5 O(p3)flops.Inthesituationoflarge-scale(n,pverylarge)orhighdimensional(p(cid:29)n)datathesealgorithmsarenot 2 applicablewithouthavingtopayahugecomputationalprice. 3 Usingarandomprojection,thedatacanbe“compressed”eitherrow-orcolumn-wise.Row-wisecompressionwas 5 0 proposed and discussed in Zhou et al. (2007); Dhillon et al. (2013a); McWilliams et al. (2014b). These approaches . replacetheleast-squaresestimator 1 0 7 argmin(cid:107)Y−Xγ(cid:107)22 withtheestimator argmin(cid:107)ψY−ψXγ(cid:107)22, (1) 1 γ∈Rp γ∈Rp : v wherethematrixψ ∈Rm×n(m(cid:28)n)isarandomprojectionmatrixandhas,forexample,i.i.d.N (0,1)entries.Other i X possibilities for the choice of ψ are discussed below. The high-dimensional setting and (cid:96)1-penalized regression is consideredinZhouetal.(2007),whereitisshownthatasparselinearmodelcanberecoveredfromtheprojecteddata r a undercertainconditions.Theoptimizationproblemisstill p-dimensional,however,andcomputationallyexpensiveif thenumberofvariablesisverylarge. Gian-AndreaThanei ETHZu¨rich,Ra¨mistrasse1018092Zu¨rich,Switzerland,e-mail:[email protected] ChristinaHeinze ETHZu¨rich,Ra¨mistrasse1018092Zu¨rich,Switzerland,e-mail:[email protected] NicolaiMeinshausen ETHZu¨rich,Ra¨mistrasse1018092Zu¨rich,Switzerland,e-mail:[email protected] 1 Column-wisecompressionaddressesthislaterissuebyreducingtheproblemtoad-dimensionaloptimizationwith d(cid:28)pbyreplacingtheleast-squaresestimator argmin(cid:107)Y−Xγ(cid:107)2 withtheestimator φ argmin(cid:107)Y−Xφγ(cid:107)2, (2) 2 2 γ∈Rp γ∈Rd wheretherandomprojectionmatrixisnowφ ∈Rp×d (withd(cid:28) p).ByrightmultiplicationtothedatamatrixXwe transform the data matrix to Xφ and thereby reduce the number of variables from p to d and thus reducing com- putational complexity. The Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984; Dasgupta and Gupta, 2003;IndykandMotwani,1998)guaranteesthatthedistancebetweentwotransformedsamplepointsisapproximately preservedinthecolumn-wisecompression. Random projections have also been considered under the aspect of preserving privacy (Blocki et al., 2012). By pre-multiplication with a random projection matrix as in (1) no observation in the resulting matrix can be identified withoneoftheoriginaldatapoints.Similarly,post-multiplicationasin(2)producesnewvariablesthatdonotreveal therealizedvaluesoftheoriginalvariables. In many applications the random projection used in practice falls under the class of Fast Johnson-Lindenstrauss Transforms(FJLT)(AilonandChazelle,2006).OneinstanceofsuchafastprojectionistheSubsampledRandomized HadamardTransform(SRHT)(Tropp,2010).Duetoitsrecursivedefinition,thematrix-vectorproducthasacomplex- ityofO(plog(p)),reducingthecostoftheprojectiontoO(nplog(p)).Otherproposalsthatleadtospeedupscompared toaGaussianrandomprojectionmatrixincluderandomsignorsparserandomprojectionmatrices(Achlioptas,2003). Notably,ifthedatamatrixissparse,usingasparserandomprojectioncanexploitsparsematrixoperations.Depending onthenumberofnon-zeroelementsinX,onemightpreferusingasparserandomprojectionoveraFJLTthatcannot exploitsparsityinthedata.Importantly,usingXφ insteadofXinourregressionalgorithmofchoicecanbedisadvan- tageousifXisextremelysparseandd cannotbechosentobemuchsmallerthan p.(Theprojectiondimensiond can bechosenbycrossvalidation.)Asthemultiplicationbyφ “densifies”thedesignmatrixusedinthelearningalgorithm thepotentialcomputationalbenefitofsparsedataisnotpreserved. ForOLSandrow-wisecompressionasin(1),wherenisverylargeand p<m<n,theSRHT(andsimilarFJLTs) can be understood as a subsampling algorithm. It preconditions the design matrix by rotating the observations to a basiswhereallpointshaveapproximatelyuniformleverage(Dhillonetal.,2013a).Thisjustifiesuniformsubsampling intheprojectedspacewhichisappliedsubsequenttotherotationinordertoreducethecomputationalcostsoftheOLS estimation.RelatedideascanbefoundinthewaycolumnsandrowsofXaresampledinaCUR-matrixdecomposition (Mahoney and Drineas, 2009). While the approach in Dhillon et al. (2013a) focuses on the concept of leverage, McWilliams et al. (2014b) propose an alternative scheme that allows for outliers in the data and makes use of the conceptofinfluence(Cook,1977).Here,randomprojectionsareusedtoapproximatetheinfluenceofeachobservation whichisthenusedinthesubsamplingschemetodeterminewhichobservationstoincludeinthesubsample. Using random projections column-wise as in (2) as a dimensionality reduction technique in conjunction with ((cid:96) 2 penalized)regressionhasbeenconsideredinLuetal.(2013),Kaba´n(2014)andMaillardandMunos(2009).Themain advantageofthesealgorithmsisthecomputationalspeedupwhilepreservingpredictiveaccuracy.Typically,avariance reductionistradedoffagainstanincreaseinbias.Ingeneral,onedisadvantageofreducingthedimensionalityofthe dataisthatthecoefficientsintheprojectedspacearenotinterpretableintermsoftheoriginalvariables.Naively,one couldreversetherandomprojectionoperationbyprojectingthecoefficientsestimatedintheprojectedspacebackinto theoriginalspaceasin(2).Forpredictionpurposesthisoperationisirrelevant,butitcanbeshownthatthisestimator doesnotapproximatetheoptimalsolutionintheoriginalp-dimensionalcoefficientspacewell(Zhangetal.,2012).As aremedy,Zhangetal.(2012)proposetofindthedualsolutionintheprojectedspacetorecovertheoptimalsolutionin theoriginalspace.Theproposedalgorithmapproximatesthesolutiontotheoriginalproblemaccuratelyifthedesign matrixislow-rankorcanbesufficientlywellapproximatedbyalow-rankmatrix. Lastly, random projections have been used as an auxiliary tool. As an example, the goal of McWilliams et al. (2014a) is to distribute ridge regression across variables with an algorithm called LOCO. The design matrix is split across variables and the variables are distributed over processing units (workers). Random projections are used to preservethedependenciesbetweenallvariablesinthateachworkerusesarandomlyprojectedversionofthevariables residing on the other workers in addition to the set of variables assigned to itself. It then solves a ridge regression using this local design matrix. The solution is the concatenation of the coefficients found from each worker and the 2 solution vector lies in the original space so that the coefficients are interpretable. Empirically, this scheme achieves largespeedupswhileretaininggoodpredictiveaccuracy.Usingsomeoftheideasandresultsoutlinedinthecurrent manuscript, one can show that the difference between the full solution and the coefficients returned by LOCO is bounded. Clearly,row-andcolumn-wisecompressioncanalsobeappliedsimultaneouslyorcolumn-wisecompressioncanbe usedtogetherwithsubsamplingofthedatainsteadofrow-wisecompression.Intheremainingsections,wewillfocus onthecolumn-wisecompressionasitposesmoredifficultchallengesintermsofstatisticalperformanceguarantees. While row-wise compression just reduces the effective sample size and can be expected to work in general settings aslongasthecompresseddimensionm<nisnottoosmall(Zhouetal.,2007),column-wisecompressioncanonly workwellifcertainconditionsonthedataaresatisfiedandwewillgiveanoverviewoftheseresults.Ifnotmentioned otherwise,wewillreferwithcompressedregressionandrandomprojectionstothecolumn-wisecompression. Thestructureofthemanuscriptisasfollows:Wewillgiveanoverviewofboundsontheestimationaccuracyinthe followingsection2,includingbothknownresultsandnewcontributionsintheformoftighterbounds.InSection3we willdiscussthepossibilityandpropertiesofvariance-reducingaveragingschemes,whereestimatorsbasedondifferent realizedrandomprojectionsareaggregated.Finally,Section4concludesthemanuscriptwithashortdiscussion. 2 TheoreticalResults Wewilldiscussinthefollowingthepropertiesofthecolumn-wisecompressedestimatorasin(2),whichisdefinedas βˆφ = φ argmin(cid:107)Y−Xφγ(cid:107)2, (3) d 2 γ∈Rd whereweassumethatφ hasi.i.d.N (0,1/d)entries.Thisestimatorwillbereferredtoasthecompressedleastsquares estimator (CLSE) in the following. We will focus on the unpenalized form as in (3) but note that similar results also apply to estimators that put an additional penalty on the coefficients β or γ. Due to the isotropy of the random projection,aridge-typepenaltyasinLuetal.(2013)andMcWilliamsetal.(2014a)isperhapsanaturalchoice.An interesting summary of the bounds on random projections is on the other hand that the random projection as in (3) alreadyactsasaregularizationandthetheoreticalpropertiesof(3)areverymuchrelatedtothepropertiesofaridge- typeestimatorofthecoefficientvectorintheabsenceofrandomprojections. Wewillrestrictdiscussionofthepropertiesmostlytothemeansquarederror(MSE) E (cid:2)E ((cid:107)Xβ−Xβˆφ(cid:107)2)(cid:3). (4) φ ε d 2 Firstresultsoncompressedleastsquareshavebeengivenin(MaillardandMunos,2009)inarandomdesignsetting. It was shown that the bias of the estimator (3) is of order O(log(n)/d). This proof used a modified version of the Johnson-Lindenstrauss Lemma. A recent result (Kaba´n, 2014) shows that the log(n)-term is not necessary for fixed design settings where Y=Xβ+ε for some β ∈Rp and ε is i.i.d. noise, centered E [ε]=0 and with the variance ε E [εε(cid:48)]=σ2I .Wewillworkwiththissettinginthefollowing. ε n×n Thefollowingresultof(Kaba´n,2014)givesaboundontheMSEforfixeddesign. Theorem1. (Kaba´n,2014)AssumefixeddesignandRank(X)≥d.Then E (cid:2)E ((cid:107)Xβ−Xβˆφ(cid:107)2)(cid:3) ≤ σ2d+(cid:107)Xβ(cid:107)22 +trace(X(cid:48)X)(cid:107)β(cid:107)22. (5) φ ε d 2 d d Proof. SeeAppendix. Comparedwith(MaillardandMunos,2009),theresultremovesanunnecessaryO(log(n))termanddemonstratesthe O(1/d) behaviour of the bias. The result also illustrates the tradeoffs when choosing a suitable dimension d for the projection. Increasing d will lead to a 1/d reduction in the bias terms but lead to a linear increase in the estimation error(whichisproportionaltothedimensioninwhichtheleast-squaresestimationisperformed).Anoptimalbound 3 can only be achieved with a value of d hat depends on the unknown signal and in practice one would typically use cross-validationtomakethechoiceofthedimensionoftheprojection. OneissuewiththeboundinTheorem1isthattheboundonthebiasterminthenoiselesscase(Y =Xβ) E (cid:2)E ((cid:107)Xβ−Xβˆφ(cid:107)2)(cid:3) ≤ (cid:107)Xβ(cid:107)22 +trace(X(cid:48)X)(cid:107)β(cid:107)22 (6) φ ε d 2 d d isusuallyweakerthanthetrivialbound(bysettingβˆφ =0)of d E (cid:2)E ((cid:107)Xβ−Xβˆφ(cid:107)2)(cid:3) ≤(cid:107)Xβ(cid:107)2 (7) φ ε d 2 2 for most values of d < p. By improving the bound, it is also possible to point out the similarities between ridge regressionandcompressedleastsquares. Theimprovementintheboundrestsonasmallmodificationintheoriginalproofin(Kaba´n,2014).Theideaisto boundthebiastermof(4)byoptimizingovertheupperboundgivenintheforegoingtheorem.Specifically,onecan usetheinequality E [E [(cid:107)Xβ−Xφ(φ(cid:48)X(cid:48)Xφ)−1φ(cid:48)X(cid:48)Xβ(cid:107)2]] ≤ min E [E [(cid:107)Xβ−Xφφ(cid:48)βˆ(cid:107)2]], φ ε 2 φ ε 2 βˆ∈Rp insteadof E [E [(cid:107)Xβ−Xφ(φ(cid:48)X(cid:48)Xφ)−1φ(cid:48)X(cid:48)Xβ(cid:107)2]] ≤ E [E [(cid:107)Xβ−Xφφ(cid:48)β(cid:107)2]]. φ ε 2 φ ε 2 To simplify the exposition we will from now on always assume we have rotated the design matrix to an orthogonal designsothattheGrammatrixisdiagonal: Σ =X(cid:48)X=diag(λ ,...,λ ). (8) 1 p Thiscanalwaysbeachievedforanydesignmatrixandisthusnotarestriction.Itimplies,however,thattheoptimal regressioncoefficientsβ areexpressedinthebasisinwhichtheGrammatrixisorthogonal,thisisthebasisofprincipal components.Thiswillturnouttobethenaturalchoiceforrandomprojectionsandallowsforeasierinterpretationof theresults. FurthermorenotethatinTheorem1wehavetheassumptionRank(X)≥d,whichtellsusthatwecanapplytheCLSE inthehighdimensionalsetting p(cid:29)naslongaswechoosed smallenough(smallerthanRank(X),whichisusually equalton)inordertohaveuniqueness. WiththeforegoingdiscussiononhowtoimprovetheboundinTheorem1wegetthefollowingtheorem: Theorem2.AssumeRank(X)≥d,thentheMSE(4)canbeboundedaboveby p E [E [(cid:107)Xβ−Xβˆφ(cid:107)2]]≤σ2d+∑β2λw (9) φ ε d 2 i i i i=1 where (1+1/d)λ2+(1+2/d)λ trace(Σ)+trace(Σ)2/d w = i i . (10) i (d+2+1/d)λ2+2(1+1/d)λ trace(Σ)+trace(Σ)2/d i i Proof. SeeAppendix. The w are shrinkage factors. By defining the proportion of the total variance observed in the direction of the i-th i principalcomponentas λ i α = , (11) i trace(Σ) wecanrewritetheshrinkagefactorsintheforegoingtheoremas 4 (1+1/d)α2+(1+2/d)α +1/d w = i i . (12) i (d+2+1/d)α2+2(1+1/d)α +1/d i i Analyzingthistermshowsthattheshrinkageisstrongerindirectionsofhighvariancecomparedtodirectionsoflow variance.Toexplainthisrelationinabitmoredetailwecompareittoridgeregression.TheMSEofridgeregression withpenaltytermλ(cid:107)β(cid:107)2isgivenby 2 p (cid:16) λ (cid:17)2 p (cid:16) λ (cid:17)2 E [(cid:107)Xβ−XβRidge(cid:107)2]=σ2∑ i +∑β2λ . (13) ε 2 λ +λ i i λ+λ i=1 i i=1 i Imagine that the signal lives on the space spanned by the first q principal directions, that is β =0 for i>q. The i best MSE we could then achieve is σ2q by running a regression on the first q first principal directions. For random projections, we can see that we can indeed reduce the bias term to nearly zero by forcing w ≈0 for i=1,...,q. i Thisrequiresd(cid:29)qasthebiasfactorswillthenvanishlike1/d.Ridgeregressionontheotherhandrequiresthatthe penaltyλ issmallerthantheq-thlargesteigenvalueλ (toreducethebiasonthefirstqdirections)butlargeenoughto q renderthevariancefactorλ/(λ +λ)verysmallfori>q.Thetradeoffinchoosingthepenaltyλ inridgeregression i i and choosing the dimension d for random projections is thus very similar. The number of directions for which the eigenvalue λ is larger than the penalty λ in ridge corresponds to the effective dimension and will yield the same i varianceboundasinrandomprojections.TheanalogybetweentheMSEbounds(9)forrandomprojectionsand(13) for ridge regression illustrates thus a close relationship between compressed least squares and ridge regression or principalcomponentregression,similartoDhillonetal.(2013b). InsteadofanupperboundfortheMSEofCLSEasinMaillardandMunos(2009)andKaba´n(2014),wewillin thefollowingtrytoderiveexplicitexpressionsfortheMSE,followingtheideasinKaba´n(2014)andMarzettaetal. (2011) and we give a closed form MSE in the case of orthonormal predictors. The derivation will make use of the followingnotation: Definition1.Letφ ∈Rp×d bearandomprojection.Wedefinethefollowingmatrices: φX=φ(φ(cid:48)X(cid:48)Xφ)−1φ(cid:48)∈Rp×p and Tφ =E [φX]=E [φ(φ(cid:48)X(cid:48)Xφ)−1φ(cid:48)]∈Rp×p. d d φ d φ ThenextLemma(Marzettaetal.,2011)summarizesthemainpropertiesofφXandTφ. d d Lemma1.Letφ ∈Rp×d bearandomprojection.Then i) (φX)(cid:48)=φX(symmetric), d d ii) φXX(cid:48)XφX=φX(projection), d d d iii) ifΣ =X(cid:48)Xisdiagonal⇒Tφ isdiagonal. d Proof. SeeMarzettaetal.(2011). TheimportantpointofthislemmaisthatwhenweassumeorthogonaldesignthenTφ isdiagonal.Wewilldenotethis d by Tφ =diag(1/η ,...,1/η ), d 1 p wherethetermsη arewelldefinedbutwithoutanexplicitrepresentation. i Aquickcalculationrevealsthefollowingtheorem: Theorem3.AssumeRank(X)≥d,thentheMSE(4)equals p (cid:16) λ (cid:17) E [E [(cid:107)Xβ−Xβˆφ(cid:107)2]]=σ2d+∑β2λ 1− i . (14) φ ε d 2 i i η i=1 i Furthermorewehave p λ ∑ i =d. (15) η i=1 i 5 0 0.8 ll l ebxoaucntd 0.81. llllllllllllllllllllllllllllllllllllllll 6 l 0. l l 0.6 E l E S4 l S M0.20. lllllllllllllllllllllllllllllllll M0.20.4 exact l bound 0 0 0. 0. 0 10 20 30 40 0 10 20 30 40 d d Fig.1 NumericalsimulationsoftheboundsinTheorems2and3.Left:theexactfactor(1−λ1/η1)intheMSEisplottedversusthebound w1asafunctionoftheprojectiondimensiond.Right:theexactfactor(1−λp/ηp)intheMSEandtheupperboundwp.Notethattheupper boundworksespeciallywellforsmallvaluesofdandforthelargereigenvaluesandisalwaysbelowthetrivialbound1. Proof. SeeAppendix. BycomparingcoefficientsinTheorems2and3,weobtainthefollowingcorollary. Corollary1.AssumeRank(X)≥d,then λ i ∀i∈{1,...,p}: 1− ≤w (16) i η i Asalreadymentionedingeneralwecannotgiveaclosed-formexpressionforthetermsη ingeneral.However,for i somespecialcases(26)canhelpustogettoanexactformoftheMSEofCLSE.Ifweassumeorthonormaldesign (Σ =CI )thenwehavethatλ/η isaconstantforalliandandthus,by(26),wehaveη =Cp/d.Thisgives p×p i i i p (cid:16) d(cid:17) E [E [(cid:107)Xβ−Xβˆφ(cid:107)2]]=σ2d+C∑β2 1− , (17) φ ε d 2 i p i=1 andthusweendupwithaclosedformMSEforthisspecialcase. Providingtheexactmean-squarederrorsallowsustoquantifytheconservativenessoftheupperbounds.Theupper bound has been shown to give a good approximation for small dimensions d of the projection and for the signal containedinthelargereigenvalues. 3 AveragedCompressedLeastSquares We have so far looked only into compressed least squares estimator with one single random projection. An issue in practiceofthecompressedleastsquaresestimatorisitsvarianceduetotherandomprojectionasanadditionalsource ofrandomness.Thisvariancecanbereducedbyaveragingmultiplecompressedleastsquaresestimatescomingfrom differentrandomprojections.Inthissectionwewillshowsomepropertiesoftheaveragedcompressedleastsquares (ACLSE)estimatoranddiscussitsadvantageovertheCLSE. Definition2.(ACLSE)Let{φ ,...,φ } ∈Rp×d beindependentrandomprojections,andletβˆφi foralli∈{1,...,K} 1 K d be the respective compressed least squares estimators. We define the averaged compressed least squares estimator (ACLSE)as 6 1 K βˆK := ∑βˆφi. (18) d K d i=1 Onemajoradvantageofthisestimatoristhatitcanbecalculatedinparallelwiththeminimalnumberoftwocommu- nications,onetosendthedataandonetoreceivetheresult.ThismeansthattheasymptoticcomputationalcostofβˆK d isequaltothecostofβˆφ ifcalculationsaredoneonK differentprocessors.ToinvestigatetheMSEofβˆK,werestrict d d ourselvesforsimplicitytothelimitcase βˆ = lim βˆK (19) d d K→∞ andinsteadonlyinvestigateβˆ .ThereasoningbeingthatforlargeenoughvaluesofK(sayK>100)thebehaviourof d βˆ isverysimilartoβˆK.TheexactformoftheMSEintermsoftheη’sisgivenin Kaba´n(2014).Herewebuildon d d i theseresultsandgiveanexplicitupperboundfortheMSE. Theorem4.AssumeRank(X)≥d.Define p (cid:16)λ (cid:17)2 τ =∑ i . η i=1 i TheMSEofβˆ canbeboundedfromaboveby d p E [E [(cid:107)Xβ−Xβˆ (cid:107)2]]≤σ2τ+∑β2λw2, φ ε d 2 i i i i=1 wherethew’saregiven(asinTheorem1)by i (1+1/d)λ2+(1+2/d)λ trace(Σ)+trace(Σ)2/d w = i i . i (d+2+1/d)λ2+2(1+1/d)λ trace(Σ)+trace(Σ)2/d i i and τ ∈[d2/p,d]. Proof. SeeAppendix. Comparingaveragingtothecasewhereweonlyhaveonesingleestimatorweseethattherearetwodifferences:First thevarianceduetothemodelnoiseε turnsintoσ2τ withτ ∈[d2/p,d],thusτ ≤d.Secondlytheshrinkagefactorsw i inthebiasarenowsquared,whichintotalmeansthattheMSEofβˆ isalwayssmallerorequaltotheMSEofasingle d estimatorβˆφ. d Weinvestigatethebehaviorofτ asafunctionofdinthreedifferentsituations(Figure2).Wefirstlookattwoextreme cases of covariance matrices for which the respective upper and lower bounds [d2/p,d] for τ are achieved. For the lowerbound,letΣ =I beorthonormal.Thenλ/η =cforalli,asabove.From p×p i i p ∑λ/η =d i i i=1 wegetλ/η =d/p.Thisleadsto i i p d2 d2 τ =∑(λ/η)2=p = , i i p2 p i=1 whichreproducesthelowerbound. Wewillnotbeabletoreproducetheupperboundexactlyforalld≤p.Butwecanshowthatforanydthereexists acovariancematrixΣ,suchthattheupperboundisreached.Theideaistoconsideracovariancematrixthathasequal varianceinthefirstd directionandalmostzerointheremaining p−d.Definethediagonalcovariancematrix 7 1.0 single 1.0 single ll 2.0 single ll l average l average l l average l 0.8 l 0.8 l lll 1.5 llll MSE0.40.6 ll MSE0.6 l llllllll MSE1.0 llllllll 0.00.2 lllllllllllllllllllllllllllllllllllll 0.20.4 llllllllllllllllllllllll 0.5 lllllllllllllllllllllllll 0 10 20 30 40 0 10 20 30 40 0 10 20 30 40 d d d Fig.2 MSEofaveragedcompressedleastsquares(circle)versustheMSEofthesingleestimator(cross)withcovariancematrixΣi,i=1/i. Ontheleftwithσ2=0(onlybias),inthemiddleσ2=1/40andontheleftσ2=1/20.Onecanclearlyseethequadraticimprovementin termsofMSEaspredictedbyTheorem4. 0 0 4 lower bound 4 lower bound tau tau upper bound upper bound 0 0 3 3 E E S0 S0 M2 M2 0 0 1 1 0 0 0 10 20 30 40 0 10 20 30 40 d d Fig.3 Simulationsofthevariancefactorτ(line)asafunctionofdforthreedifferentcovariancematricesandinlowerbound(d2/p)and upperbound(d)(square,triangle).Ontheleft(Σ=Ip×p)τasprovenreachesthelowerbound.Inthemiddle(Σi,i=1/i)τreachesalmost thelowerbound,indicatingthatinmostpracticalexamplesτ willbeveryclosetothelowerboundandthusaveragingimprovesMSE substantiallycomparedtothesingleestimator.Ontherighttheextremecaseexamplefrom(20)withd=5,whereτ reachestheupper boundford=5.  1, if i= jandi≤d  Σi,j= ε, if i= jandi>d. (20) 0, if i(cid:54)= j Weshowlim τ =d.ForthisdecomposeΦ intotwomatricesΦ ∈Rd×d andΦ ∈R(p−d)×d: ε→0 d r (cid:18) (cid:19) Φ Φ = d . Φ r Thesamewaywedefineβ ,β ,X andX .NowweboundtheapproximationerrorofβˆΦ toextractinformationabout d r d r √ d λ/η.Assumeasquareddatamatrix(n=p)X= Σ,then i i 8 E [argmin(cid:107)Xβ−XΦγ(cid:107)2]≤E [(cid:107)Xβ−XΦΦ−1β (cid:107)2] Φ 2 Φ d d 2 γ∈Rd =E [(cid:107)X β −X Φ Φ−1β (cid:107)2] Φ r r r r d d 2 =εE [(cid:107)β −Φ Φ−1β (cid:107)2] Φ r r d d 2 ≤ε(2(cid:107)β (cid:107)2+2(cid:107)β (cid:107)2E [(cid:107)Φ (cid:107)2]E [(cid:107)Φ−1(cid:107)2]) r 2 d 2 Φ r 2 Φ d 2 ≤εC, whereCisindependentofε andboundedsincetheexpectationofthesmallestandlargestsingularvaluesofarandom projectionisbounded.Thismeansthattheapproximationerrordecreasestozeroasweletε→0.Applyingthistothe closedformfortheMSEofβˆΦ wehavethat d p (cid:16) λ (cid:17) d (cid:16) λ (cid:17) p (cid:16) λ (cid:17) ∑β2λ 1− i ≤∑β2 1− i +ε ∑ β2 1− i i i η i η i η i=1 i i=1 i i=d+1 i hastogotozeroasε →0,whichinturnimplies d (cid:16) λ (cid:17) lim ∑β2 1− i =0, ε→0 i=1 i ηi andthuslim λ/η =1foralli∈{1,...,d}.Thisfinallyyieldsalimit ε→0 i i p λ2 lim ∑ i =d. ε→0 η2 i=1 i Thisillustratesthatthelowerboundd2/pandupperboundd forthevariancefactorτ canbothbeattained.Sim- ulationssuggestthatτ isusuallyclosetothelowerbound,wherethevarianceoftheestimatorisreducedbyafactor d/p compared to a single iteration of a compressed least-squares estimator, which is on top of the reduction in the bias error term. This shows, perhaps unsurprisingly, that averaging over random projection estimators improves the mean-squared error in a Rao-Blackwellization sense. We have quantified the improvement. In practice, one would have to decide whether to run multiple versions of a compressed least-squares regression in parallel or run a single randomprojectionwithaperhapslargerembeddingdimension.Thecomputationaleffortandstatisticalerrortradeoffs willdependontheimplementationbuttheboundsabovewillgiveagoodbasisforadecision. 4 Discussion Wediscussedsomeknownresultsaboutthepropertiesofcompressedleast-squaresestimationandproposedpossible tighterboundsandexactresultsforthemean-squarederror.Whiletheexactresultsdonothaveanexplicitrepresen- tation, they allow nevertheless to quantify the conservative nature of the upper bounds on the error. Moreover, the shown results allow to show a strong similarity of the error of compressed least squares, ridge and principal com- ponentregression.WealsodiscussedtheadvantagesofaformofRao-Blackwellization,wheremultiplecompressed least-squareestimatorsareaveragedovermultiplerandomprojections.Thelatteraveragingprocedurealsoallowsto computetheestimatortriviallyinadistributedwayandisthusoftenbettersuitedforlarge-scaleregressionanalysis. Theaveragingmethodologyalsomotivatestheuseofcompressedleastsquaresinthehighdimensionalsettingwhere itperformssimilartoridgeregressionandtheuseofmultiplerandomprojectionwillreducethevarianceandresultin anonrandomestimatorinthelimit,whichpresentsacomputationallyattractivealternativetoridgeregression. 9 References Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of ComputerandSystemSciences. Ailon, N. and Chazelle, B. (2006). Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform. Proceedingsofthe38thAnnualACMSymposiumonTheoryofComputing. Blocki,J.,Blum,A.,Datta,A.,andSheffet,O.(2012).Thejohnson-lindenstrausstransformitselfpreservesdifferential privacy. In Foundations of Computer Science (FOCS), 2012 IEEE 53rd Annual Symposium on, pages 410–419. IEEE. Cook,R.D.(1977). Detectionofinfluentialobservationinlinearregression. Technometrics,19:15–18. Dasgupta, S. and Gupta, A. (2003). An Elementary Proof of a Theorem of Johnson and Lindenstrauss. Random StructuresandAlgorithms. Dhillon,P.,Lu,Y.,Foster,D.P.,andUngar,L.(2013a). NewSubsamplingAlgorithmsforFastLeastSquaresRegres- sion. InAdvancesinNeuralInformationProcessingSystems. Dhillon,P.S.,Foster,D.P.,andKakade,S.(2013b). ARiskComparisonofOrdinaryLeastSquaresvsRidgeRegres- sion. TheJournalofMachineLearningResearch,14. Indyk, P. and Motwani, R. (1998). Approximate nearest neighbors: towards removing the curse of dimensionality. Proceedingsofthe30thAnnualACMSymposiumonTheoryofComputing. Johnson, W. and Lindenstrauss, J. (1984). Extensions of Lipschitz Mappings into a Hilbert Space. Contemporary Mathematics:ConferenceonModernAnalysisandProbability. Kaba´n,A.(2014). Newboundsoncompressivelinearleastsquaresregression. InArtificialIntelligenceandStatistics. Lu, Y., Dhillon, P., Foster, D. P., and Ungar, L. (2013). Faster ridge regression via the subsampled randomized hadamardtransform. InAdvancesinNeuralInformationProcessingSystems26,pages369–377. Mahoney,M.W.andDrineas,P.(2009). Curmatrixdecompositionsforimproveddataanalysis. Proceedingsofthe NationalAcademyofSciences,106(3):697–702. Maillard,O.-A.andMunos,R.(2009). CompressedLeast-SquaresRegression. NIPS. Marzetta,T.,Tucci,G.,andSimon,S.(2011). Arandommatrix-theoreticapproachtohandlingsingularcovariance estimates. IEEETrans.InformationTheory. McWilliams,B.,Heinze,C.,Meinshausen,N.,Krummenacher,G.,andVanchinathan,H.P.(2014a). Loco:Distribut- ingridgeregressionwithrandomprojections. arXivpreprintarXiv:1406.3469. McWilliams,B.,Krummenacher,G.,Lucˇic´,M.,andBuhmann,J.M.(2014b). Fastandrobustleastsquaresestimation incorruptedlinearmodels. InNIPS. Tropp, J. A. (2010). Improved analysis of the subsampled randomized Hadamard transform. arXiv:1011.1595v4 [math.NA]. Zhang,L.,Mahdavi,M.,Jin,R.,Yang,T.,andZhu,S.(2012).Recoveringoptimalsolutionbydualrandomprojection. arXivpreprintarXiv:1211.3046. Zhou,S.,Lafferty,J.D.,andWasserman.,L.A.(2007). Compressedregression. NIPS. 5 Appendix Inthissectionwegiveproofsofthestatementsfromthesectiontheoreticalresults. Theorem1. (Kaba´n,2014)AssumefixeddesignandRank(X)≥d,thentheAMSE4canbeboundedaboveby (cid:107)Xβ(cid:107)2 (cid:107)β(cid:107)2 E [E [(cid:107)Xβ−Xβˆφ(cid:107)2]]≤σ2d+ 2 +trace(X(cid:48)X) 2. (21) φ ε d 2 d d Proof. (Sketch) 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.