ebook img

On Computationally Tractable Selection of Experiments in Regression Models PDF

0.43 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview On Computationally Tractable Selection of Experiments in Regression Models

On Computationally Tractable Selection of Experiments in Regression Models YiningWang1,AdamsWeiYu1,andAartiSingh1 6 1 0 1MachineLearningDepartment,CarnegieMellonUniversity 2 c December5,2016 e D 2 Abstract ] L Wederivecomputationallytractablemethodstoselectasmallsubsetofexperimentsettings M from a large pool of given design points. The primary focus is on linear regression models, whilethetechniqueextendstogeneralizedlinearmodelsandDelta’smethod(estimatingfunc- . t tionsoflinearregressionmodels)aswell. Thealgorithmsarebasedonacontinuousrelaxation a t ofanotherwiseintractablecombinatorialoptimizationproblem,withsamplingorgreedypro- s cedures as post-processing steps. Formal approximation guarantees are established for both [ algorithms,andsimulationsonsyntheticdataconfirmtheeffectivenessoftheproposedmeth- 4 ods. v Keywords:optimalselectionofexperiments,A-optimality,computationallytractablemeth- 8 6 ods 0 2 0 1 Introduction . 1 0 Considerthelinearregressionmodel 6 y = Xβ +ε, 1 0 v: where X ∈ Rn×p is the design matrix, y ∈ Rn is the response and ε ∼ N (0,σ2I ) are ho- n n Xi moscedastic Gaussian noise with variance σ2. β0 is a p-dimensional regression model that one wishes to estimate. We consider the “large-scale low-dimensional” setting where both n,p → ∞ r a andp < n,andX hasfullcolumnrank. AcommonestimatoristheOrdinaryLeastSquares(OLS) estimator: βˆols = argmin (cid:107)y−Xβ(cid:107)2 = (X(cid:62)X)−1X(cid:62)y. β∈Rp 2 Despite the simplicity and wide applicability of OLS, in practice it may not be desirable to obtainthefulln-dimensionalresponsevectoryduetomeasurementconstraints. Itisthenacommon practice to select a small subset of rows (e.g., k (cid:28) n rows) in X so that the statistical efficiency of regression on the selected subset of design points is maximized. Compared to the classical experimentaldesignproblem(Pukelsheim,1993)whereX canbefreelydesigned,inthisworkwe consider the setting where the selected design points must come from an existing (finite) design poolX. Belowwelistthreeexampleapplicationswheresuchconstraintsarerelevant: 1 Example 1 (Wind speed prediction) In Chen et al. (2015) a data set is created to record wind speed across a year at established measurement locations on highways in Minnesota, US. Due to instrumentation constraints, wind speed can only be measured at intersections of high ways, and a small subset of such intersections is selected for wind speed measurement in order to reduce data gatheringcosts. Example2(CPUbenchmarking) Centralprocessingunits(CPU)arevitaltotheperformanceof acomputersystem. Itisaninterestingstatisticalquestiontounderstandhowknownmanufacturing parameters (clock period, cache size, etc.) of a CPU affect its execution time (performance) on benchmark computing tasks (Ein-Dor & Feldmesser, 1987). As the evaluation of real benchmark execution time is time-consuming and costly, it is desirable to select a subset of CPUs available in the market with diverse range of manufacturing parameters so that the statistical efficiency is maximizedbybenchmarkingfortheselectedCPUs. Example3(Neuralvisualperception) Understandinghowhumansperceivevisualobjectsisan important question in psychological and neural science. In an experimental study (Leeds et al., 2014) a group of human subjects are shown several pictures of visual objects such as cars, toys, dogs and neural-biological responses are recorded. As experimental studies are expensive, it is an importantquestiontoselectasubsetof“representative”visualstimuliinordertoreduceexperimen- talcost. Again,visualstimulicannotbearbitrarilydesignedandmustbeselectedfromanexisting poolofnaturalpictures. In this work, we focus on computationally tractable methods for experiment selection that achievesnear-optimalstatisticalefficiencyinlinearregressionmodels. Weconsidertwoexperiment selectionmodels: thewithreplacement modelwhereeachdesignpoint(rowofX)canbeselected more than once with independent noise involved for each selection, and the without replacement model where distinct row subsets are required. We propose two computationally tractable algo- rithms: onesamplingbasedalgorithmthatachieves1+o(1)approximationofthestatisticallyop- timalsolutionforthewithreplacementmodelandO(logk)approximationforthewithoutreplace- ment model, assuming that the design pool X is well-conditioned. In the case of ill-conditioned design,weposeagreedymethodthatachievesO(p2/k)approximationforthewithoutreplacement model. 2 Problem formulation and backgrounds Wefirstgiveaformaldefinitionoftheexperimentselectionprobleminlinearregressionmodels: Definition2.1(experimentselectionmodels). LetX beaknownn×pdesignmatrixwithfullcol- umnrankandkbethesubsetbudget,withp ≤ k ≤ n. Anexperimentselectionmethodfindasubset S ⊆ [n]ofsizekandtheny = X β +ε˜isobserved,whereeachcoordinateofε˜isi.i.d.Gaussian S S 0 randomvariablewithzeromeanandequalcovariance. Twoscenariosareconsidered: 1. Withreplacement: S isamulti-setwhichallowsduplicatesofindices. Notethatfresh(inde- pendent)noiseisimposedafterexperimentselection,andhenceduplicatedesignpointshave independentnoise. 2 Table1: Summaryofapproximationresults. Σ = X(cid:62)diag(π∗)X,whereπ∗istheoptimalsolution ∗ ofEq.(3). Algorithm Model C(n,p,k) Assumptions sampling withrep. 1+o(1) (cid:107)Σ−1(cid:107) (cid:107)X(cid:107)2 logk → 0 ∗ 2 ∞ sampling withoutrep. O(logk) (cid:107)Σ−1(cid:107) (cid:107)X(cid:107)2 logk → 0 ∗ 2 ∞ p(p+1) Rigorous: 1+ greedy withoutrep. 2(k−p+1) k > p Conjecture: 1+O( p ) k−p 2. Withoutreplacement: S isastandardsetthatmaynothaveduplicateindices. Asevaluationcriterion,weconsiderthemeansquareerrorE(cid:107)βˆ −β (cid:107)2,whereβˆ = (X(cid:62)X )−1X(cid:62)y k 0 2 k S S S S is the OLS estimate on the selected design X and its corresponding response y . In the standard S S settingofn > pandrank(X) = p,itiswell-knownthatthestatisticallyoptimalsubsetS∗ isgiven bytheA-optimalitycriterion(Pukelsheim,1993): (cid:20) (cid:21) (cid:16) (cid:17)−1 S∗ = argmin tr X(cid:62)X . (1) S⊆[n],|S|≤k S S HereS∗ ⊆ [n]isamulti-setifwith-replacementmodelisconsideredandastandardsetifwithout- replacementmodelisused. DespitethestatisticaloptimalityofEq.(1),theoptimizationproblemiscombinatorialinnature and the optimal subset S∗ is difficult to compute. A brute-force search over all possible subsets of sizek requiresO(nkk3)operations,whichisinfeasibleforevenmoderatelysizeddesignsX. Inthispaperweconsidercomputationally-feasiblemethodsthatapproximatelycomputetheop- timal design in Eq. (1). By “computationally feasible” or “computationally tractable”, we refer to algorithms whose running time scales polynomially with input dimension n,p and k. For exam- ple, the OLS estimate βˆ = (X(cid:62)X )−1X(cid:62)y takes O(kp2 + p3) operations to compute and is k S S S S computationallyfeasible. Incontrast,thebrute-forcesearchmethodforoptimizingEq.(1)requires O(nkk3)runningtime,whichisexponentialink whenbothp,k increaseswithnandthuscompu- tationally intractable. Let Sˆ be the subset produced by such a computationally efficient algorithm. Duetocomputationalconstraints,SˆmaynotachievetheoptimalstatisticalefficiencyasS∗ defined inEq.(1). Nevertheless,wewishtoguaranteethestatisticalpropertyofSˆinthefollowingsense: (cid:20) (cid:21) (cid:16) (cid:17)−1 F(Sˆ;X) ≤ C(n,p,k)·F(S∗;X) where F(S;X) = tr X(cid:62)X . (2) S S Here C(n,p,k) ≥ 1 is the approximation ratio of Sˆ. In Table 1 we summarize the approximation ratioofcomputationally-tractableexperimentselectionalgorithmspresentedinthispaper. 2.1 Relatedwork There has been an increasing amount of work on fast solvers for the general least-square problem min (cid:107)y−Xβ(cid:107)2. Mostofexistingworkalongthisdirection(Woodruff,2014;Dhillonetal.,2013; β 2 Drineasetal.,2011;Raskutti&Mahoney,2015)focusessolelyonthecomputationalaspectsanddo 3 notconsiderstatisticalconstraintssuchaslimitedmeasurementsofy. Aconvexoptimizationformu- lationwasproposedinDavenportetal.(2015)foraconstrainedadaptivesensingproblem,without finite sample guarantees on the combinatorial problem. In Horel et al. (2014) a computationally tractable approximation algorithm wasproposed forthe D-optimalitycriterion of theexperimental designproblem. However, thecoreideainHoreletal.(2014)ofpipageroundinganSDPsolution (Ageev&Sviridenko,2004)isnotappplicabletoA-optimalitytypecriterionasdefinedinEq.(1). Popular subsampling techniques such as leverage score sampling (Drineas et al., 2008) were studiedinleastsquareandlinearregressionproblems(Zhuetal.,2015;Maetal.,2015;Chenetal., 2015). While computation remains the primary subject, measurement constraints and statistical properties were also analyzed within the linear regression model (Zhu et al., 2015). However, few existing (computationally efficient) methods achieve optimal statistical efficiency in terms of estimatingtheunderlyinglinearmodelβ . 0 Anotherrelatedareaisactivelearning(Chaudhurietal.,2015;Hazan&Karnin,2015;Sabato &Munos,2014),whichisastrongersettingwherefeedbackfrompriormeasurementscanbeused to guide subsequent data selection. Chaudhuri et al. (2015) analyzes an SDP relaxation in the contextofactivemaximumlikelihoodestimation. However,theanalysisinChaudhurietal.(2015) onlyworksforthewithreplacementmodelandthetwo-stagefeedback-drivenstrategyproposedin Chaudhurietal.(2015)isnotavailableundertheexperimentselectionmodeldefinedinDefinition 2.1. 2.2 Notations ForamatrixA ∈ Rn×m,weuse(cid:107)A(cid:107) = sup (cid:107)Ax(cid:107)p todenotetheinducedp-normofA. Inpar- p x(cid:54)=0 (cid:107)x(cid:107)p (cid:113) ticular,(cid:107)A(cid:107) = max (cid:80)n |A |and(cid:107)A(cid:107) = max (cid:80)m |A |. (cid:107)A(cid:107) = (cid:80) A2 1 1≤j≤m i=1 ij ∞ 1≤i≤n j=1 ij F i,j ij denotes the Frobenius norm of A. Let σ (A) ≥ σ (A) ≥ ··· ≥ σ (A) ≥ 0 be the 1 2 min(n,m) singular values of A, sorted in descending order. The condition number κ (A) is defined as 2 p κ(A) = σ (A)/σ (A). ForsequencesofrandomvariablesX andY ,weuseX → Y to 1 min(n,m) n n n n denoteX convergesinprobabilitytoY . Wesaya (cid:46) b iflim |a /b | ≤ 1anda (cid:38) b if n n n n n→∞ n n n n lim |b /a | ≤ 1. n→∞ n n 3 Methods and main results We describe two computationally feasible algorithms for the experiment selection problem, both based on a continuous relaxation of the combinatorial A-optimality problem. Statistical efficiency boundsarepresentforbothalgorithms,withdetailedproofsgiveninSec.6. 3.1 ContinuousrelaxationofEq.(1) Considerthefollowingcontinuousoptimizationproblem: (cid:20) (cid:21) (cid:16) (cid:17)−1 π∗ = argmin f(π;X) = argmin tr X(cid:62)diag(π)X , (3) π∈Rp π∈Rp s.t. π ≥ 0, (cid:107)π(cid:107) ≤ k, 1 (cid:107)π(cid:107) ≤ 1 (onlyforthewithoutreplacementmodel). ∞ 4 Note that the (cid:107)π(cid:107) ≤ 1 constraint is only relevant for the without replacement model and for the ∞ with replacement model we drop this constraint in the optimization problem. It is easy to verify that both the objective funnction f(π;X) and the feasible set in Eq. (3) are convex, and hence the globaloptimalsolutionπ∗ofEq.(3)canbeobtainedusingcomputationallytractablealgorithms. In particular, wedescribeanSDPformulationandapracticalprojectedgradientdescentalgorithmin AppendixBandthesupplementarymaterial,bothprovablyconvergetotheglobaloptimalsolution ofEq.(1)withrunningtimescalingpolynomiallyinn,pandk. The following Gaussian design example gives an intuitive interpretation of the optimization problemEq.(3)andshowinwhatwayitimprovesoverthetrivialexperimentselectionmethodof choosingrowsofX uniformlyatrandom. Example: Gaussiandesign WeuseasimpleGaussianexampletoshowthatnon-uniformweights i.i.d. π (cid:54)= k/ncouldimprovetheobjectivefunctionf(π;X). Supposex ,··· ,x ∼ N (0,Σ ). Let i 1 n p 0 πunif betheuniformlyweightedsolutionofπunif = k/n,correspondingtoselectingeachrowofX i uniformlyatrandom. Wethenhave (cid:34) (cid:35) f(πunif;X) = 1tr (cid:18)1X(cid:62)X(cid:19)−1 →p 1tr(Σ−1). k n k 0 Ontheotherhand,letB2 = γtr(Σ )forsomeuniversalconstantγ > 1andB = {x ∈ Rp : (cid:107)x(cid:107)2 ≤ 0 2 B2}, X = {x ,··· ,x }. By Markov inequality, |X∩B| (cid:38) γn . Define weighted solution πw as 1 n 1−γ πw ∝ 1/p(x |B)·I[x ∈ B]normalizedsuchthat(cid:107)πw(cid:107) = k. 1 Then i i i i 1 (cid:20) (cid:21) tr (cid:16)1 (cid:80) xix(cid:62)i (cid:17)−1 1 n xi∈X∩B p(xi|B) f(πw;X) = k 1 (cid:80) 1/p(x |B) n xi∈X∩B i (cid:104) (cid:105) →p 1 n tr (cid:0)(cid:82)Bxx(cid:62)dx(cid:1)−1 (cid:46) p2 . k|X∩B| (cid:82) 1dx (1−γ)tr(Σ ) B 0 HereinthelastinequalityweapplyLemmaA.1. Because tr(Σ−01) = 1 (cid:80)p 1 ≥ (cid:16)1 (cid:80)p σ (Σ )(cid:17)−1 = p p i=1 σi(Σ0) p i=1 i 0 p byJensen’sinequality, weconcludethatingeneralf(πw;X) < f(πunif;X), andthegapis tr(Σ0) largerforill-conditionedcovarianceΣ . Thisexampleshowsthatunevenweightsinπhelpsreduc- 0 ingthetraceofinverseoftheweightedcovarianceX(cid:62)diag(π)X. We also present several facts about the continuous optimization problem. Proofs are trivial and aregiveninSec.6.1. Fact3.1. f(π∗;X) ≤ F(S∗;X). Fact3.2. Letπ andπ(cid:48) befeasiblesolutionsofEq.(3)suchthatπ ≤ π(cid:48) foralli = 1,··· ,n. Then i i f(π;X) ≥ f(π(cid:48);X),withequalityifandonlyifπ = π(cid:48) 1Notethatπ∗maynotbetheoptimalsolutionofEq.(3);however,itsufficesforthepurposeofthedemonstrationof improvedmentresultingfromnon-uniformweights. 5 Fact3.3. (cid:107)π∗(cid:107) = k. 1 From Fact 3.1, it is clear the optimal value of Eq. (3) lower bounds the combinatorial optimal solutioninEq.(1). However,then-dimensionalvectorπ∗ isnotavalidexperimentselectionforX withoutfurtherprocessing: π∗mayhaveunevenweightsandmorethanknon-zerocoordinates. Itis thusimportanttodeveloppost-processingproceduresthattransformπ∗ intoavalidsubsetsolution Sˆwithoutlosingmuchofthestatisticalefficiencyinπ∗. Thisisthetopicofthenexttwosections. 3.2 Samplingbasedexperimentselection AnaturalideaofobtainingavalidsubsetSofsizekisbysamplingfromaweightedrowdistribution specified by π∗. More specifically, let Sˆ = {i ,··· ,i } where i ,··· ,i are sampled, either with 1 k 1 k orwithoutreplacementdependingonthemodel,fromthefollowingdistribution: π∗ j Pr[i = j] = , t = 1,··· ,k;j = 1,··· ,n. t k Thesamplingprocedureiseasytounderstandinanasymptoticsense. Forexample,considerthe withreplacementmodelwherei ,··· ,i arei.i.d.sampledfromthedistributionPr[i = j] = π∗/k. 1 k t j It is easy to verify that EX(cid:62)X = X(cid:62)diag(π∗)X. By weak law of large numbers, X(cid:62)X →p Sˆ Sˆ Sˆ Sˆ X(cid:62)diag(π∗)X as k → ∞ and hence F(Sˆ;X) →p f(π∗;X) by continuous mapping theorem. A more refined analysis is presented in Theorem 3.1 to provide explicit conditions under which the asymptoticapproximationsarevalidandonthestatisticalefficiencyofSˆaswellasanalysisunder themorerestrictivewithoutreplacementregime. Theorem 3.1. Suppose (cid:107)Σ−1(cid:107) (cid:107)X(cid:107)2 logk → 0 as n,k → ∞, where Σ = X(cid:62)diag(π∗)X. ∗ 2 ∞ ∗ Thenwithprobability1−o(1)thesubsetSˆsatisfiesF(Sˆ;X) ≤ (1+o(1))F(S∗;X)forthewith replacementmodelandF(Sˆ;X) ≤ O(logk)·F(S∗;X)forthewithoutreplacementmodel. ThoughitistemptingtoestablishtheproximitybetweeenX(cid:62)X andΣ directlyusingmatrix Sˆ Sˆ ∗ concentrationresults,suchanapproachleadstoundesireddependencyoverκ(Σ )intheconditions ∗ of n and k. In order to prove theorem 3.1 under weaker conditions, we adapt the proof technique of Spielman and Srivastava in their seminal work on spectral sparsification of graphs (Spielman & Srivastava, 2011). More specifically, we prove the following stronger result which shows that X(cid:62)X isaspectralapproximationofΣ withhighprobability,undersuitableconditions. Sˆ Sˆ ∗ Lemma3.1. LetSˆbeconstructedbysamplingwithreplacementanddefineΣ = X(cid:62)X . Suppose Sˆ Sˆ Sˆ (cid:107)Σ−1(cid:107) (cid:107)X(cid:107)2 logk → 0. Then with probability 1−o(1) the following holds uniformly over all ∗ 2 ∞ z ∈ Rp: z(cid:62)Σ z = (1+o(1))z(cid:62)Σ z. Sˆ ∗ Lemma3.1impliesthatσ (Σ ) = (1+o(1))σ (Σ )forallj = 1,··· ,p. RecallthatF(Sˆ;X) = j Sˆ j ∗ (cid:80)p σ (Σ )−1 and f(π∗;X) = (cid:80)p σ (Σ ). The with replacement result of Theorem 3.1 j=1 j Sˆ j=1 j ∗ is then a simple consequence of Lemma 3.1 because F(Sˆ;X) ≤ (1 + o(1))f(π∗;X) ≤ (1 + o(1))F(S∗;X). ToestablishthewithoutreplacementresultinTheorem3.1,weprovethefollowinglemmathat exploitstheadditionalinfinitynormconstraintinEq.(3): 6 Lemma 3.2. Let Sˆ be constructed by sampling with replacement and suppose (cid:107)π∗(cid:107) ≤ 1. Then ∞ max #{i ∈ Sˆ} ≤ O (logk). 1≤i≤n P LetS˜ ⊆ SˆbethesetofdistinctelementsinSˆ. Lemma3.2showsthat,withprobability1−o(1), |S˜| ≥ |Sˆ|/O(logk), which further implies that F(S˜;X) ≤ O(logk)·F(Sˆ;X). Complete proofs ofLemma3.1,3.2andTheorem3.1areplacedinSec.6. To provide further insights into the condition (cid:107)Σ−1(cid:107) (cid:107)X(cid:107)2 logk → 0 in Theorem 3.1, we ∗ 2 ∞ consideraGaussiandesignexampleandderiveinterpretableconditions. Example(Gaussiandesign) Supposex ,··· ,x i.∼i.d. N (0,Σ ). Then(cid:107)X(cid:107)2 ≤ O ((cid:107)Σ (cid:107)2plogn). 1 n p 0 ∞ P 0 2 Inaddition,bysimplealgebra(cid:107)Σ−1(cid:107) ≤ min{1,p−1κ(Σ )}tr(Σ−1) ≤ min{1,p−1κ(Σ )}F(S∗;X). ∗ 2 ∗ ∗ ∗ Using a very conservative upper bound of F(S∗;X) by sampling rows in X uniformly at ran- dom and apply weak law of large numbers and the continuous mapping theorem, we have that F(S∗;X) (cid:46) 1tr(Σ−1). In addition, tr(Σ−1)(cid:107)Σ (cid:107) ≤ p(cid:107)Σ−1(cid:107) (cid:107)Σ (cid:107) = pκ(Σ ). Subsequently, k 0 0 0 2 0 2 0 2 0 thecondition(cid:107)Σ−1(cid:107) (cid:107)X(cid:107)2 logk → 0isimpliedby ∗ 2 ∞ pκ(Σ )κ(Σ )lognlogk ∗ 0 → 0. (4) k Essentially, the condition is reduced to k (cid:38) κ(Σ )κ(Σ ) · plognlogk. The linear dependency 0 ∗ on p is necessary, as we consider the low-dimensional linear regression problem and k < p would implyaninfinitemean-squareerrorinestimationofβ . Wealsoremarkthattheconditionisscale- 0 invariant,asX(cid:48) = ξX andX sharethesamequantity(cid:107)Σ−1(cid:107) (cid:107)X(cid:107)2 logk. ∗ 2 ∞ 3.3 Greedyexperimentselection InAvron&Boutsidis(2013)aninterestingupperboundonF(S∗;X)forthewithoutreplacement model(i.e.,S∗ doesnothaveduplicateindices)isproved,whichwesummarizebelow: Lemma3.3(Avron&Boutsidis(2013),Lemma3.9,simplified). F(S∗;X) ≤ n−p+1tr[(X(cid:62)X)−1] k−p+1 forp < k ≤ n,underthewithoutreplacementmodel. OneconsequenceofLemma3.3isacomputationallytractablegreedyremovalprocedure,which issummarizedbelow: input :X ∈ Rn×p,InitialsubsetS ⊆ [n],targetsizek ≤ |S |. 0 0 output:Sˆ ⊆ [n],aselectedsubsetofsizek. Initialization: t = 0. 1. Findj∗ ∈ S suchthattr[(X(cid:62) X )−1]isminimized. t St\{j∗} St\{j∗} 2. Removej∗ fromS : S = S \{j∗}. t t+1 t 3. Repeatsteps1and2until|S | = k. OutputSˆ = S . t t Figure1:Greedybasedexperimentselection. ThefollowingcorollaryimmediatelyfollowsLemma3.3: Corollary3.1. F(Sˆ;X) ≤ |S0|−p+1F(S ;X),forp < k ≤ |S |. k−p+1 0 0 7 In Avron & Boutsidis (2013) the greedy removal procedure in Figure 1 is applied to the en- tire design set S = [n], which gives approximation guarantee F(Sˆ;X) ≤ n−p+1tr[(X(cid:62)X)−1]. 0 k−p+1 Thisresultsinanapproximationratioof n−p+1 asdefinedinEq.(2)byapplyingthetrivialbound k−p+1 tr[(X(cid:62)X)−1] ≤ F(S∗;X),whichistightforadesignthathasexactlyk non-zerorows. Tofurtherimprovetheapproximationratio,weconsiderapplyingthegreedyremovalprocedure with S equal to the support of π∗; that is, S = {j ∈ [n] : π∗ > 0}. Because (cid:107)π∗(cid:107) ≤ 1 under 0 0 j ∞ thewithoutreplacementsetting,wehavethefollowingcorollary: Corollary3.2. LetS bethesupportofπ∗andsuppose(cid:107)π∗(cid:107) ≤ 1. ThenF(Sˆ;X) ≤ (cid:107)π∗(cid:107)0−p+1f(π∗;X) ≤ 0 ∞ k−p+1 (cid:107)π∗(cid:107)0−p+1F(S∗;X). k−p+1 Itisthusimportanttoupperboundthesupportsize(cid:107)π∗(cid:107) . Withthetrivialboundof(cid:107)π∗(cid:107) = n 0 0 we recover the n−p+1 approximation ratio by applying Figure 1 to S = [n]. In order to bound k−p+1 0 (cid:107)π∗(cid:107) awayfromn,weconsiderthefollowingassumptionimposedonX: 0 Assumption3.1. Definemappingφ : Rp → Rp(p2+1) asφ(x) = (ξijx(i)x(j))1≤i≤j≤p,wherex(i) denotestheithcoordinateofap-dimensionalvectorxandξ = 1ifi = j andξ = 2otherwise. ij ij Denote φ˜(x) = (φ(x),1) ∈ Rp(p2+1)+1 as the affine version of φ(x). For any p(p+1) +1 distinct 2 rowsofX,theirmappingsunderφ˜arelinearindependent. p(p+1) Assumption3.1isessentiallyageneral-positionassumption,whichassumesthatno +1 2 design points in X lie on a degenerate affine subspace after a specific quadratic mapping. Like other similar assumptions in the literature (Tibshirani, 2013), Assumption 3.1 is very mild and almost always satisfied in practice, for example, if each row of X is independently sampled from absolutelycontinuousdistributions. Wearenowreadytostatethemainlemmaboundingthesupportsizeofπ∗. Lemma3.4. (cid:107)π∗(cid:107) ≤ k+ p(p+1) ifAssumption3.1holds. 0 2 Lemma 3.4 is established by an interesting observation into the properties of Karush-Kuhn- Tucker(KKT)conditionsoftheoptimizationproblemEq.(3),whichinvolvesalinearsystemwith p(p+1) +1variables. WegivethecompleteproofofLemma3.4inSec.6.5. 2 CombiningresultsfrombothLemma3.4andCorollary3.2wearriveatthefollowingtheorem, whichupperboundstheapproximationratioofthegreedyremovalprocedureinFigure1initialized bythesupportofπ∗. Theorem 3.2. Let π∗ be the optimal solution of the without replacement version of Eq. (3) and Sˆ betheoutputofthegreedyremovalprocedureinFigure1initializedwithS = {i ∈ [n] : π∗ > 0}. 0 i Ifk > pandAssumption3.1holdsthen (cid:18) (cid:19) p(p+1) F(Sˆ;X) ≤ 1+ F(S∗;X). 2(k−p+1) Under a slightly stronger condition that k > 2p, the approximation results in Theorem 3.2 can besimplifiedasF(Sˆ;X) ≤ (1+O(p2/k))F(S∗;X). Inaddition,F(Sˆ;X) ≤ (1+o(1))F(S∗;X) if p2/k → 0, meaning that near-optimal experiment selection is achievable with computationally tractablemethodsifO(k2)designpointsareallowedintheselectedsubset. 8 3.4 Extensions Wediscusspossibleextensionofourresultsbeyondestimationofβ inthelinearregressionmodel. 0 3.4.1 Generalizedlinearmodels In a generalized linear model µ(x) = E[Y|x] satisfies g(µ(x)) = η = x(cid:62)β for some known link 0 function g : R → R. Under regularity conditions (Van der Vaart, 2000), the maximum-likelihood estimatorβˆ ∈ argmax {(cid:80)n logp(y |x ;β)}satisfiesE(cid:107)βˆ −β (cid:107)2 = (1+o(1))tr(I(X,β )−1), n β i=1 i i n 0 2 0 whereI(X,β )istheFisher’sinformationmatrix: 0 I(X,β ) = −(cid:88)n E∂2logp(yi|xi;β0) = −(cid:88)n (cid:18)E∂2logp(yi;ηi)(cid:19)x x(cid:62). (5) 0 ∂β∂β(cid:62) ∂η2 i i i=1 i=1 i Here both expectations are taken over y conditioned on X and the last equality is due to the suf- ficiency of η = x(cid:62)β . The experiment selection problem is then formulated to select a subset i i 0 S ⊆ [n]ofsizek,eitherwithorwithoutduplicates,thatminimizestr(I(X ,β )−1). S 0 ItisclearfromEq.(5)thattheoptimalsubsetS∗ dependsontheunknownparameterβ ,which 0 itself is to be estimated. This issue is known as the design dependence problem for generalized linearmodels(Khurietal.,2006). Onapproachistoconsiderlocallyoptimaldesigns(Khurietal., 2006;Chernoff,1953),whereaconsistentestimateβˇofβ isfirstobtainedonainitialdesignsusbet 0 andthenηˇ = x(cid:62)βˇissuppliedtocomputeamorerefineddesignsubsettogetthefinalestimateβˆ. i i Withtheinitialestimateβˇavailable,theoptimalexperimentselectionrulebecomes (cid:20) (cid:21) (cid:16) (cid:17)−1 S∗ = argmin|S|≤kFβˇ(S;X) = argmin|S|≤ktr X(cid:101)S(cid:62)X(cid:101)S = F(S;X(cid:101)), (6) whereX(cid:101) = (x(cid:101)1,··· ,x(cid:101)n)(cid:62) isdefinedas (cid:115) ∂2logp(y ;ηˇ) x = −E i i x . (cid:101)i ∂η2 i Note that under regularity conditions −E∂2logp(yi;ηˇi) is non-negative and hence the square-root is ∂η2 well-defined. BothresultsinTheorems3.1and3.2arevalidwithX replacedbyX(cid:101) forgeneralized linearmodels. Belowweconsidertwogeneralizedlinearmodelexamplesandderiveexplicitforms ofX(cid:101). Example 1: Logistic regression In a logistic regression model responses y ∈ {0,1} are binary i andthelikelihoodmodelis eηi p(y ;η ) = ψ(η )yi(1−ψ(η ))1−yi, where ψ(η ) = . i i i i i 1+eηi Simplealgebrayields (cid:115) eηˇi x = x , (cid:101)i (1+eηˇi)2 i whereηˇ = x(cid:62)βˇ. i i 9 Example2: Poissoncountmodel InaPoissoncountmodeltheresponsevariabley takesvalues i of non-negative integers and follows a Poisson distribution with parameter λ = eηi = ex(cid:62)i β0. The likelihoodmodelisformallydefinedas eηire−eηi p(y = r;η ) = , r = 0,1,2,··· . i i r! Simplealgebrayields √ x(cid:101)i = eηˇixi, whereηˇ = x(cid:62)βˇ. i i 3.4.2 Delta’smethod Suppose g(β ) is the quantity of interest, where β ∈ Rp is the parameter in a linear regression 0 0 model and g : Rp → Rm is some known function. Examples include the in-sample prediction g(β) = Xβ and the out-sample prediction g(β) = Zβ. Let βˆ = (X(cid:62)X)−1X(cid:62)y be the OLS n estimateofβ . If∇giscontinuouslydifferentiableandβˆ isconsistent,thenbytheclassicaldelta’s 0 n method(VanderVaart,2000)E(cid:107)g(βˆ )−g(β )(cid:107)2 = (1+o(1))σ2tr(∇g(β )(X(cid:62)X)−1∇g(β )(cid:62)). n 0 2 0 0 Theexperimentselectionproblemofestimatingg(β )canthenbeformalizedas 0 (cid:20) (cid:21) (cid:16) (cid:17)−1 S∗ = argmin F (S;X) = argmin tr G X(cid:62)X , |S|≤k G0 |S|≤k 0 S S whereG = ∇g(β )(cid:62)∇g(β ). IfG dependsontheunknownparameterβ thenthedesigndepen- 0 0 0 0 0 dence problem again exists, and a locally optimal solution can be obtained by replacing G in the 0 objectivefunctionwithGˇ = ∇g(βˇ)(cid:62)∇g(βˇ)forsomeinitialestimateβˇofβ . 0 IfGˇ isinvertible,thenthereexistsinvertiblep×pmatrixPˇ suchthatGˇ = PˇPˇ(cid:62) becauseGˇ is positivedefinite. Theobjectivefunctioncanthenbere-organizedas (cid:20) (cid:21) (cid:20) (cid:21) (cid:16) (cid:17)−1 (cid:16) (cid:17)−1 F (S;X) = tr PˇPˇ(cid:62) X(cid:62)X = tr Pˇ−1X(cid:62)X Pˇ−(cid:62) = F(S;XPˇ−(cid:62)), Gˇ S S S S whereF(S;XPˇ−(cid:62))istheordinaryA-optimalitycriteriondefineinEq.(1). OurresultsinTheorems 3.1 and 3.2 remain valid by operating on the transformed matrix XPˇ−(cid:62). In the case of Gˇ being singular,itsufficestoconsideraregularizedobjective (cid:20) (cid:21) (cid:16) (cid:17)−1 F (S;X) = tr Gˇ X(cid:62)X , where Gˇ = Gˇ +λI . Gˇ λ S S λ p×p λ Note that Gˇ is always invertible if λ > 0, because Gˇ = ∇g(βˇ)(cid:62)∇g(βˇ) is positive semi-definite. λ Qualityofsuchapproximationcanbeboundedas F (S;X) ≤ F (S;X) ≤ F (S;X)+λF(S;X). Gˇ Gˇ Gˇ λ 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.