ebook img

Compressed Regression PDF

0.57 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Compressed Regression

Compressed Regression Shuheng Zhou John Lafferty † Larry Wasserman‡† ∗ ∗ ComputerScienceDepartment ∗ †MachineLearning Department 8 0 ‡Department ofStatistics 0 2 CarnegieMellonUniversity n Pittsburgh,PA 15213 a J 1 1 February 1,2008 ] L M Abstract . at Recent research has studied the role of sparsity in high dimensional regression and t s signalreconstruction,establishingtheoreticallimitsforrecoveringsparsemodelsfrom [ sparse data. This line of work shows that ℓ -regularized least squares regression can 1 2 accuratelyestimateasparselinearmodelfromnnoisyexamplesin pdimensions,even v 4 if p is much larger than n. In this paper we study a variant of this problem where the 3 5 originaln inputvariablesarecompressedbyarandomlineartransformationtom n ≪ 0 examplesin p dimensions,andestablishconditionsunderwhichasparselinearmodel . 6 can be successfully recovered from the compressed data. A primary motivation for 0 7 this compression procedure is to anonymize the data and preserve privacy by reveal- 0 ing little information about the original data. We characterize the number of random : v projections that are required for ℓ -regularized compressed regression to identify the i 1 X nonzero coefficients in the true model with probability approaching one, a property r a called “sparsistence.” In addition,weshowthat ℓ -regularized compressedregression 1 asymptotically predicts as well as an oracle linear model, a property called “persis- tence.” Finally, we characterize the privacy properties of the compression procedure in information-theoretic terms, establishing upper bounds on the mutual information between thecompressed anduncompresseddatathatdecay tozero. Keywords: Sparsity,ℓ regularization,lasso,highdimensionalregression,privacy, 1 capacity ofmulti-antennachannels,compressed sensing. 1 CONTENTS 1 Introduction 3 2 Background and RelatedWork 6 2.A SparseRegression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.B CompressedSensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.C Privacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Compressed Regressionis Sparsistent 10 3.A OutlineofProof forTheorem3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.B Incoherenceand ConcentrationUnderRandom Projection . . . . . . . . . . . . . 14 3.C ProofofTheorem3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4 Compressed Regressionis Persistent 19 4.A UncompressedPersistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 4.B CompressedPersistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 InformationTheoretic AnalysisofPrivacy 26 5.A PrivacyUndertheMultipleAntennaChannel Model . . . . . . . . . . . . . . . . 27 5.B PrivacyUnderMultiplicativeNoise. . . . . . . . . . . . . . . . . . . . . . . . . . 28 6 Experiments 30 6.A Sparsistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.B Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 7 Proofs ofTechnical Results 40 7.A ConnectiontotheGaussianEnsembleResult . . . . . . . . . . . . . . . . . . . . 40 7.B S-Incoherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.C ProofofLemma3.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.D ProofofProposition3.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 7.E ProofofTheorem3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 7.F ProofofLemma3.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 7.G ProofofLemma3.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 7.H ProofofClaim3.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 8 Discussion 55 9 Acknowledgments 56 2 I. INTRODUCTION Two issues facing the use of statistical learning methods in applications are scale and privacy. Scale is an issue in storing, manipulating and analyzing extremely large, high dimensional data. Privacy is, increasingly, a concern whenever large amounts of confidential data are manipulated within an organization. It is often important to allow researchers to analyze data withoutcompro- mising the privacy of customers or leaking confidential information outside the organization. In this paper we show that sparse regression for high dimensionaldata can be carried out directly on a compressed form of the data, in a manner that can be shown to guard privacy in an information theoreticsense. The approach we develop here compresses the data by a random linear or affine transformation, reducing the number of data records exponentially, while preserving the number of original input variables. These compressed data can then be made available for statistical analyses; we focus on the problem of sparse linear regression for high dimensional data. Informally, our theory ensures thattherelevantpredictorscan belearned from thecompresseddataas wellas theycould befrom the original uncompressed data. Moreover, the actual predictions based on new examples are as accurate as they would be had the original data been made available. However, the original data arenot recoverablefrom thecompresseddata, and thecompresseddataeffectivelyrevealnomore information than would be revealed by a completely new sample. At the same time, the inference algorithms run faster and require fewer resources than the much larger uncompressed data would require. Infact,theoriginaldataneedneverbestored;theycanbetransformed“onthefly”asthey comein. Inmoredetail,thedataarerepresentedasan p matrix X. Eachofthe p columnsisanattribute, × andeach ofthen rowsisthevectorofattributesforanindividualrecord. Thedataarecompressed by arandomlineartransformation X X 8X (1.1) 7→ ≡ where 8 is a random m n matrix with m en. It is also natural to consider a random affine × ≪ transformation X X 8X 1 (1.2) 7→ ≡ + where1isarandomm pmatrix. Suchtranseformationshavebeencalled“matrixmasking”inthe × privacyliterature(Duncan and Pearson,1991). Theentriesof8and1aretakentobeindependent Gaussian random variables, but other distributions are possible. We think of X as “public,” while 8and1areprivateandonlyneededatthetimeofcompression. However,evenwith1 0and8 = known,recovering X from X requiressolvingahighlyunder-determinedlineaersystemandcomes withinformationtheoreticprivacyguarantees, as wedemonstrate. e 3 In standard regression,a responseY Xβ ǫ Rn isassociated withtheinputvariables, where = + ∈ ǫ are independent, mean zero additivenoise variables. In compressed regression, we assume that i theresponseisalso compressed,resultinginthetransformedresponseY Rm givenby ∈ Y Y 8Y (1.3) e 7→ ≡ 8Xβ 8ǫ (1.4) e = + Xβ ǫ (1.5) = + Notethatundercompression,thetransformednoiseǫ 8ǫ is notindependentacross examples. e = e In the sparse setting, the parameter vector β Rp is sparse, with a relatively small number s of ∈ e nonzerocoefficientssupp(β) j : β 0 . Twokeytasksaretoidentifytherelevantvariables, j = 6= and to predict the response xTβ for a new input vector x Rp. The method we focus on is ℓ - 1 (cid:8) (cid:9) ∈ regularized least squares, also known as the lasso (Tibshirani, 1996). The main contributions of this paper are two technical results on the performance of this estimator, and an information- theoreticanalysisoftheprivacypropertiesoftheprocedure. Ourfirst resultshowsthatthelassois sparsistentundercompression,meaningthatthecorrectsparsesetofrelevantvariablesisidentified asymptotically. Omittingdetailsand technicalassumptionsforclarity,ourresult isthefollowing. Sparsistence (Theorem 3.4): Ifthenumberofcompressedexamplesm satisfies C n C s2lognps m 2 , (1.6) 1 ≤ ≤ slogn andtheregularizationparameterλ satisfies m mλ2 λ 0 and m , (1.7) m → log p → ∞ thenthecompressedlassosolution 1 β argmin Y Xβ 2 λ β (1.8) m = β 2mk − k2 + mk k1 includesthecorrectvariablees,asymptotically:e e P supp(β ) supp(β) 1. (1.9) m = → Our second result shows that the(cid:0) lasso is persistent un(cid:1)der compression. Roughly speaking, per- e sistence(Greenshtein and Ritov,2004)meansthattheprocedurepredictswell,asmeasuredbythe predictiverisk R(β) E(Y Xβ)2, (1.10) = − wherenow X Rp isanewinputvectorandY istheassociatedresponse. Persistenceisaweaker ∈ conditionthan sparsistency,andin particulardoes notassumethat thetruemodelis linear. 4 Persistence (Theorem 4.1): GivenasequenceofsetsofestimatorsB ,thesequenceofcom- n,m pressedlassoestimators β argmin Y Xβ 2 (1.11) n,m 2 = k − k β 1 Ln,m k k ≤ e e e ispersistentwiththeoracleriskoveruncompresseddatawithrespecttoB ,meaningthat n,m P R(β ) inf R(β) 0, asn . (1.12) n,m − β 1 Ln,m −→ → ∞ k k ≤ incaselog2(np) m neandtheradiusoftheℓ ballsatisfiesL o(m/log(np))1/4. 1 n,m ≤ ≤ = Our third result analyzes the privacy properties of compressed regression. We consider the prob- lem of recovering the uncompressed data X from the compressed data X 8X 1. To pre- = + serve privacy, the random matrices 8 and 1 should remain private. However, even in the case where 1 0 and 8 is known, if m min(n, p) the linear system Xe 8X is highly under- = ≪ = determined. We evaluate privacy in information theoretic terms by bounding the average mutual information I(X X)/np per matrix entry in theoriginal datamatrix Xe, which can be viewedas a ; communication rate. Bounding this mutual information is intimately connected with the problem of computing tehe channel capacity of certain multiple-antenna wireless communication systems (Marzettaand Hochwald, 1999; Telatar, 1999). Information Resistence (Propositions 5.1 and 5.2): Therateatwhichinformationabout X is revealedbythecompresseddataX satisfies I(X X) m e r sup ; O 0, (1.13) n,m = np = n → e (cid:16) (cid:17) wherethesupremumisoverdistributionsontheoriginaldataX. Assummarizedbytheseresults,compressedregressionisapracticalprocedureforsparselearning inhighdimensionaldatathathasprovablygoodproperties. Thisbasictechniquehasconnectionsin theprivacyliteraturewithmatrixmaskingandothermethods,yetmostoftheexistingworkinthis directionhasbeenheuristicandwithouttheoreticalguarantees;connectionswiththisliteratureare brieflyreviewedinSection2.C. Compressedregressionbuildsontheideasunderlyingcompressed sensing and sparse inference in high dimensional data, topics which have attracted a great deal of recent interest in the statistics and signal processing communities; the connections with this literaturearereviewedin Section2.Band 2.A. The remainder of the paper is organized as follows. In Section 2 we review relevant work from high dimensional statistical inference, compressed sensing and privacy. Section 3 presents our analysisofthesparsistencypropertiesofthecompressedlasso. Ourapproachfollowsthemethods introduced by Wainwright (2006) in the uncompressed case. Section 4 proves that compressed 5 regression is persistent. Section 5 derives upper bounds on the mutual information between the compressed data X and the uncompressed data X, after identifying a correspondence with the problem of computing channel capacity for a certain model of a multiple-antenna mobile com- munication channeel. Section 6 includes the results of experimental simulations, showing that the empirical performance of the compressed lasso is consistent with our theoretical analysis. We evaluate the ability of the procedure to recover the relevant variables (sparsistency)and to predict well (persistence). The technical details of the proof of sparsistency are collected at the end of the paper, in Section 7.B. The paper concludes with a discussion of the results and directions for futureworkin Section 8. II. BACKGROUND AND RELATED WORK In this section we briefly review relevant related work in high dimensional statistical inference, compressedsensing,and privacy,to placeourworkin context. A. Sparse Regression Weadoptstandardnotationwhereadatamatrix X has p variablesandn records;inalinearmodel theresponseY Xβ ǫ Rn isthusann-vector,andthenoiseǫ isindependentandmeanzero, i = + ∈ E(ǫ) 0. Theusualestimatorofβ istheleast squaresestimator = β (XTX) 1XTY. (2.1) − = However, this estimator has very large variance when p is large, and is not even defined when b p > n. An estimator that has received much attention in the recent literature is the lasso β n (Tibshirani, 1996), defined as b n p 1 β argmin (Y XTβ)2 λ β (2.2) n = 2n i − i + n | j| i 1 j 1 X= X= b 1 argmin Y Xβ 2 λ β , (2.3) = 2nk − k2 + nk k1 where λ is a regularization parameter. The practical success and importance of the lasso can be n attributed to the fact that in many cases β is sparse, that is, it has few large components. For example,dataare often collected withmanyvariablesin thehopethat at least afew willbe useful for prediction. The result is that many covariates contribute little to the prediction of Y, although it is not known in advance which variables are important. Recent work has greatly clarified the propertiesofthelassoestimatorinthehighdimensionalsetting. One of the most basic desirable properties of an estimator is consisistency; an estimator β is n 6 b consistent in case P β β 0. (2.4) n 2 k − k → MeinshausenandYu (2006) have recently shown that the lasso is consistent in the high dimen- b sional setting. If the underlying model is sparse, a natural yet more demanding criterion is to ask that the estimator correctly identify the relevant variables. This may be useful for interpretation, dimension reduction and prediction. For example, if an effective procedure for high-dimensional datacanbeusedtoidentifytherelevantvariablesinthemodel,thenthesevariablescanbeisolated and theircoefficients estimatedby a separateprocedure that workswell for low-dimensionaldata. An estimatoris sparsistent1 if P supp(β ) supp(β) 1, (2.5) n = → (cid:0) (cid:1) where supp(β) j : j 0 . Asybmptotically, a sparsistent estimator has nonzero coeffi- = { 6= } cients only for the true relevant variables. Sparsistency proofs for high dimensional problems have appeared recently in a number of settings. Meinshausenand Buhlmann (2006) consider the problem of estimating the graph underlying a sparse Gaussian graphical model by showing spar- sistencyofthelassowithexponentialratesofconvergenceontheprobabilityoferror. Zhao andYu (2007) show sparsistency of the lasso under more general noise distributions. Wainwright (2006) characterizes the sparsistency properties of the lasso by showing that there is a threshold sample size n(p,s) above which the relevant variables are identified, and below which the relevant vari- ables fail to be identified, where s β is the number of relevant variables. More precisely, 0 = k k Wainwright (2006) shows that when X comes from a Gaussian ensemble, there exist fixed con- stants 0 < θ 1 and 1 θ < , where θ θ 1 when each row of X is chosen as an ℓ u ℓ u ≤ ≤ +∞ = = independentGaussianrandomvector N(0,I ), thenforany ν > 0,if p p ∼ × n > 2(θ ν)slog(p s) s 1, (2.6) u + − + + thenthelassoidentifiesthetruevariableswithprobabilityapproaching one. Conversely,if n < 2(θ ν)slog(p s) s 1, (2.7) ℓ − − + + thentheprobabilityofrecoveringthetruevariablesusingthelassoapproacheszero. Theseresults require certain incoherence assumptions on the data X; intuitively, it is required that an irrele- vant variable cannot be too strongly correlated with the set of relevant variables. This result and Wainwright’s method of analysis are particularly relevant to the current paper; the details will be described in the following section. In particular, we refer to this result as the Gaussian Ensemble result. However, it is important to point out that under compression, the noise ǫ 8ǫ is not = independent. This prevents one from simply applying the Gaussian Ensemble results to the com- pressed case. Related work that studies information theoretic limits of sparsity reecovery, where 1ThisterminologyisduetoPradeepRavikumar. 7 the particular estimator is not specified, includes (Wainwright, 2007; Donohoand Tanner, 2006). Sparsistency in the classification setting, with exponential rates of convergence for ℓ -regularized 1 logisticregression,isstudiedby Wainwrightetal. (2007). Analternativegoalisaccurateprediction. Inhighdimensionsitisessentialtoregularizethemodel in some fashion in order to control the variance of the estimator and attain good predictive risk. Persistence for the lasso was first defined and studied by Greenshteinand Ritov (2004). Given a sequenceofsetsofestimators B , thesequenceofestimatorsβ B iscalled persistent incase n n n ∈ P R(β ) inf R(β) 0, (2.8) n b −β Bn → ∈ where R(β) E(Y XTβ)2 is the pbrediction risk of a new pair (X,Y). Thus, a sequence of = − estimators is persistent if it asymptotically predicts as well as the oracle within the class, which minimizes the population risk; it can be achieved under weaker assumptions than are required for sparsistence. In particular, persistence does not assume the true model is linear, and it does not require strong incoherence assumptions on the data. The results of the current paper show that sparsistenceand persistencearepreserved undercompression. B. Compressed Sensing Compressed regression has close connections to, and draws motivationfrom, compressed sensing (Donoho, 2006; Cande`s et al., 2006; Cande`s and Tao, 2006; Rauhutet al., 2007). However, in a sense,ourmotivationhereistheoppositetothatofcompressedsensing. Whilecompressedsensing of X allowsasparse X tobereconstructedfromasmallnumberofrandommeasurements,ourgoal is to reconstruct a sparse function of X. Indeed, from the point of view of privacy, approximately reconstructing X,whichcompressedsensingshowsispossibleif X issparse, shouldbeviewedas undesirable;wereturn tothispointinSection 5. Several authors haveconsidered variationson compressed sensingfor statisticalsignal processing tasks (Duarteet al., 2006; Davenportetal., 2006; Hauptet al., 2006; Davenportet al., 2007). The focus of this work is to consider certain hypothesis testing problems under sparse random mea- surements, and a generalization to classification of a signal into two or more classes. Here one observes y 8x, where y Rm, x Rn and 8 is a known random measurement matrix. The = ∈ ∈ problemisto selectbetween thehypotheses H : y 8(s ǫ), (2.9) i i = + where ǫ Rn is additiveGaussian noise. Importantly, the setup exploits the “universality”of the ∈ e matrix8,whichisnotselectedwithknowledgeofs . Theprooftechniquesuseconcentrationprop- i erties of random projection, which underlie the celebrated lemma of Johnsonand Lindenstrauss (1984). The compressed regression problem we introduce can be considered as a more challeng- ingstatisticalinferencetask,wheretheproblemistoselectfromanexponentiallylargesetoflinear 8 models,eachwithacertainsetofrelevantvariableswithunknownparameters,ortopredictaswell as the best linear model in some class. Moreover, a key motivation for compressed regression is privacy; if privacy is not a concern, simple subsampling of the data matrix could be an effective compressionprocedure. C. Privacy Research on privacy in statistical data analysis has a long history, going back at least to Dalenius (1977a); we refer to Duncan andPearson (1991) for discussion and further pointers into this lit- erature. The compression method we employ has been called matrix masking in the privacy lit- erature. In the general method, the n p data matrix X is transformed by pre-multiplication, × post-multiplication,and additionintoanew m q matrix × X AXB C. (2.10) = + The transformation A operates on dataerecords for fixed covariates, and the transformation B op- erates on covariates for a fixed record. The method encapsulated in this transformation is quite general,andallowsthepossibilityofdeletingrecords,suppressingsubsetsofvariables,dataswap- ping, and including simulated data. In our use of matrix masking, we transform the data by re- placing each variable with a relatively small number of random averages of the instances of that variable in the data. In other work, Sanil et al. (2004) consider the problem of privacy preserving regression analysis in distributed data, where different variables appear in different databases but it is of interest to integrate data across databases. The recent work of Tinget al. (2007) considers random orthogonal mappings X RX X where R is a random rotation (rank n), designed to 7→ = preserve the sufficient statistics of a multivariateGaussian and therefore allow regression estima- tion, for instance. This use of matrix maskineg does not share the informationtheoretic guarantees wepresentinSection5. Wearenotawareofpreviousworkthatanalyzestheasymptoticproperties ofastatisticalestimatorundermatrixmaskingin thehighdimensionalsetting. TheworkofLiu et al.(2006)iscloselyrelatedtothecurrentpaperatahighlevel,inthatitconsid- ers lowrankrandomlineartransformationsofeithertherowspaceorcolumnspaceofthedata X. Liuet al. (2006) note the Johnson-Lindenstrausslemma, which implies that ℓ norms are approx- 2 imately preserved under random projection, and argue heuristically that data mining procedures that exploit correlations or pairwise distances in the data, such as principal components analysis and clustering, are just as effective under random projection. The privacy analysis is restricted to observing that recovering X from X requires solving an under-determined linear system, and arguingthat thispreventstheexactvaluesfrom beingrecovered. e Aninformation-theoreticquantificationofprivacywasformulatedbyAgrawal andAggarwal(2001). Given a random variable X and a transformed variable X, Agrawaland Aggarwal (2001) define e 9 theconditionalprivacylossof X given X as P(X eX) 1 2−I(X;X), (2.11) | = − e which is simply a transformed measureeof the mutual information between the two random vari- ables. Inourworkweidentifyprivacywiththerateofinformationcommunicatedabout X through X under matrix masking, maximizing over all distributions on X. We furthermore identify this with the problem of computing, or bounding, the Shannon capacity of a multi-antenna wireless ceommunicationchannel,as modeledby Telatar(1999)and Marzettaand Hochwald(1999). Finally,itisimportanttomentiontheextensiveand currentlyactivelineofwork oncryptographic approachestoprivacy,whichhavecomemainlyfromthetheoreticalcomputersciencecommunity. For instance, Feigenbaum etal. (2006) develop a framework for secure computation of approx- imations; intuitively, a private approximation of a function f is an approximation f that does not reveal information about x other than what can be deduced from f(x). Indykand Woodruff (2006) consider the problem of computing private approximate nearest neighbors in tbhis setting. Dwork(2006)revisitsthenotionofprivacyformulatedbyDalenius(1977b),whichintuitivelyde- mands that nothing can be learned about an individualrecord in a database that cannot be learned without access to the database. An impossibility result is given which shows that, appropriately formalized, this strong notion of privacy cannot be achieved. An alternative notion of differential privacy is proposed, which allows the probability of a disclosureof privateinformation to change byonlyasmallmultiplicativefactor,dependingonwhetherornotanindividualparticipatesinthe database. Thislineofwork has recently been builtupon byDwork et al. (2007), withconnections to compressed sensing,showingthat any methodthat givesaccurate answers to a large fraction of randomlygenerated subsetsumqueriesmustviolateprivacy. III. COMPRESSED REGRESSION IS SPARSISTENT In thestandardsetting, X isan p matrix,Y Xβ ǫ isavectorofnoisyobservationsundera × = + linear model, and p is considered to be a constant. In the high-dimensionalsetting we allow p to growwithn. Thelassorefers tothefollowingquadraticprogram: (P ) minimize Y Xβ 2 such that β L. (3.1) 1 k − k2 k k1 ≤ In Lagrangianform, thisbecomestheoptimizationproblem 1 (P ) minimize Y Xβ 2 λ β , (3.2) 2 2nk − k2 + nk k1 wherethescalingfactor1/2n ischosenbyconventionandconvenience. Foranappropriatechoice oftheregularizationparameterλ λ(Y,L), thesolutionsofthesetwo problemscoincide. = 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.