Additive Approximations in High Dimensional Nonparametric Regression via the SALSA KirthevasanKandasamy [email protected] YaoliangYu [email protected] MachineLearningDepartment,CarnegieMellonUniversity,Pittsburgh,PA,USA 6 1 0 Abstract x, i.e. f(x) = β(cid:62)x for some β ∈ RD. Linear Regres- 2 Highdimensionalnonparametricregressionisan sionistypicallysolvedbyminimisingthesumofsquared y inherently difficult problem with known lower errors on the training set subject to a complexity penalty a M boundsdependingexponentiallyindimension.A on β. Such parametric methods are conceptually simple popularstrategytoalleviatethiscurseofdimen- andhavedesirablestatisticalpropertieswhentheproblem 4 sionalityhasbeentouseadditivemodelsoffirst meetstheassumption.However,theparametricassumption 2 order, which model the regression function as isgenerallytoorestrictiveformanyrealproblems. a sum of independent functions on each dimen- ] Nonparametricregressionreferstoasuiteofmethodsthat L sion.Thoughusefulincontrollingthevarianceof typically only assume smoothness on f . They present a ∗ M theestimate,suchmodelsareoftentoorestrictive more compelling framework for regression since they en- inpracticalsettings. Betweennon-additivemod- . compass a richer class of functions than parametric mod- t els which often have large variance and first or- a els do. However they suffer from severe drawbacks in t deradditivemodelswhichhavelargebias,there high dimensional settings. The excess risk of nonpara- s has been little work to exploit the trade-off in [ metric methods has exponential dependence on dimen- the middle via additive models of intermediate sion. Current lower bounds (Gyo¨rfi et al., 2002) suggest 3 order. In this work, we propose SALSA, which v that this dependence is unavoidable. Therefore, to make bridgesthisgapbyallowinginteractionsbetween 7 progress stronger assumptions on f beyond just smooth- ∗ 8 variables, but controls model capacity by lim- ness are necessary. In this light, a common simplification 2 iting the order of interactions. SALSA min- has been to assume that f decomposes into the additive 00 iRmKisHeSs tnhoerrmesipdeunaalltsiuesm. oAflsgqouraitrhems wiciatlhlys,qiutacreadn formf∗(x)=f∗(1)(x1)+f∗(∗2)(x2)+···+f∗(D)(xD)(Hastie . &Tibshirani,1990;Lafferty&Wasserman,2005;Raviku- 2 be viewed as Kernel Ridge Regression with an maretal.,2009). Inthisexposition,werefertosuchmod- 0 additivekernel. Whentheregressionfunctionis 6 els as first order additive models. Under this assumption, additive,theexcessriskisonlypolynomialindi- 1 theexcessriskimprovessignificantly. mension.UsingtheGirard-Newtonformulae,we : v efficiently sum over a combinatorial number of Thatsaid, thefirstorderassumptionisoftentoobiasedin i X terms in the additive expansion. Via a compari- practice since it ignores interactions between variables. It sonon15realdatasets,weshowthatourmethod isnaturaltoaskifwecouldconsideradditivemodelswhich r a iscompetitiveagainst21otheralternatives. permitinteractions. Forinstance,asecondordermodelhas theexpansionf (x)=f(1)(x ,x )+f(2)(x ,x )+.... ∗ ∗ 1 2 ∗ 1 3 Ingeneral,wemayconsiderdordersofinteractionwhich 1.Introduction have(cid:0)D(cid:1)termsintheexpansion. Ifd(cid:28)D,wemayallow d Given i.i.d samples (X ,Y )n from some distribution for a richer class of functions than first order models, and i i i=1 P , on X × Y ⊂ RD × R, the goal of least squares hopefullystillbeabletocontroltheexcessrisk. XY regression is to estimate the regression function f (x) = ∗ Evenwhenf isnotadditive,usinganadditiveapproxima- E[Y|X = x]. A popular approach is linear regression ∗ tion has its advantages. It is a well understood statistical which models f as a linear combination of the variables ∗ conceptthatwhenweonlyhavefewsamples,usingasim- plermodeltofitourdatagivesusabettertrade-offforvari- Proceedings of the 33rd International Conference on Machine ance against bias. Since additive models are statistically Learning, New York, NY, USA, 2016. JMLR: W&CP volume 48.Copyright2016bytheauthor(s). simpler they may give us better estimates due to reduced ShrunkAdditiveLeastSquaresApproximation variance. In most nonparametric regression methods, the mensionalfunctions. SomevariantssuchasRODEO(Laf- bias-variancetrade-offismanagedviaaparametersuchas ferty & Wasserman, 2005) and SpAM (Ravikumar et al., thebandwidthofakerneloracomplexitypenalty. Inthis 2009)studyfirstordermodelsinvariableselection/sparsity work, we demonstrate that this trade-off can also be con- settings. MARS (Friedman, 1991) uses a sum of splines trolledviaadditivemodelswithdifferentordersofinterac- on individual dimensions but allows interactions between tion. Intuitively, wemightuseloworderinteractionswith variables via products of hinge functions at selected knot fewdatapointsbutwithmoredatawecanincreasemodel points. Lou et al. (2013) model f as a first order model ∗ capacity via higher order interactions. Indeed, our exper- plus a sparse collection of pairwise interactions. How- imentssubstantiatethisintuition: additivemodelsdowell ever, restricting ourselvesto onlyto asparsecollection of onseveraldatasetsinwhichf isnotnecessarilyadditive. second order interactions might be too biased in practice. ∗ COSSO (Lin & Zhang, 2006) study higher order models Therearetwokeymessagesinthispaper. Thefirstisthat but when you need only a sparse collection of them. In weshoulduseadditivemodelsinhighdimensionalregres- Section4welistseveralotherparametricandnonparamet- siontoreducethevarianceoftheestimate. Thesecondis ricmethodsusedinregression. thatitisnecessarytomodelbeyondjustfirstordermodels toreducethebias. Ourcontributionsinthispaperare: Our approach is based on additive kernels and builds on Kernel Ridge Regression (Steinwart & Christmann, 2008; 1. Weformulateadditivemodelsfornonparametricregres- Zhang,2005). Usingadditivekernelstoencodeandiden- sion beyond first order models. Our method SALSA – tifystructureintheproblemisfairlycommoninMachine forShrunkAdditiveLeastSquaresApproximation–esti- matesadthorderadditivefunctioncontaining(cid:0)D(cid:1)terms Learning literature. A large line of work, in what has to d come to be known as Multiple Kernel Learning (MKL), in its expansion. Despite this, the computational com- focuses on precisely this problem (Bach, 2008; Go¨nen & plexityofSALSA isO(Dd2). Alpaydin, 2011; Xu et al., 2010). Additive models have 2. Our theoretical analysis bounds the excess risk for also been studied in Gaussian process literature via addi- SALSA for (i) additive f under reproducing kernel ∗ tivekernels(Duvenaudetal.,2011;Plate,1999).However, Hilbertspaceassumptionsand(ii)non-additivef inthe ∗ theytreattheadditivemodeljustasaheuristicwhereaswe agnosticsetting. In(i),theexcessriskhasonlypolyno- alsoprovideatheoreticalanalysisofourmethods. mialdependenceonD. 3. We compare our method against 21 alternatives on 2.Preliminaries synthetic and 15 real datasets. SALSA is more con- sistent and in many cases outperforms other meth- Webeginwithabriefreviewofsomebackgroundmaterial. ods. Our software and datasets are available at Wearegiveni.i.ddata(X ,Y )n sampledfromsomedis- i i i=1 github.com/kirthevasank/salsa. Our imple- tributionP onacompactspaceX ×Y ⊂RD×R. Let XY mentation of locally polynomial regression is also re- themarginaldistributionofXonX beP andtheL (P ) X 2 X leased as part of this paper and is made available at normbe(cid:107)f(cid:107)2 =(cid:82) f2dP . Wewishtousethedatatofind 2 X github.com/kirthevasank/local-poly-reg. afunctionf :X →Rwithsmallrisk Before we proceed we make an essential observation. (cid:90) R(f)= (y−f(x))2dP (x,y)=E[(Y−f(X))2]. Whenparametricassumptionsaretrue, parametricregres- XY X×Y sionmethodscanscalebothstatisticallyandcomputation- allytopossiblyseveralthousandsofdimensions.However, ItiswellknownthatRisminimisedbytheregressionfunc- it is common knowledge in the statistics community that tionf (·)=E [Y|X =·]andtheexcessriskforanyf is ∗ XY nonparametric regression can be reliably applied only in R(f)−R(f )=(cid:107)f−f (cid:107)2(Gyo¨rfietal.,2002). Ourgoal ∗ ∗ 2 very low dimensions with reasonable data set sizes. Even istodevelopanestimatethathaslowexpectedexcessrisk D = 10 is considered “high” for nonparametric methods. ER(fˆ)−R(f )=E[(cid:107)fˆ−f (cid:107)2],wheretheexpectationis ∗ ∗ 2 Inthisworkweaimtostatisticallyscalenonparametricre- takenwithrespecttorealisationsofthedata(X ,Y )n . i i i=1 gressiontodimensionsontheorder10–100whileaddress- Some smoothness conditions on f are required to make ingthecomputationalchallengesindoingso. ∗ regression tractable. A common assumption is that f ∗ hasboundednorminthereproducingkernelHilbertspace RelatedWork (RKHS) H of a continuous positive definite kernel κ : κ Apluralityofworkinhighdimensionalregressionfocuses X ×X → R. ByMercer’stheorem(Scho¨lkopf&Smola, on first order additive models. One of the most popular 2001),κpermitsaneigenexpansionoftheformκ(x,x(cid:48))= techniquesistheback-fittingalgorithm(Hastieetal.,2001) (cid:80)∞j=1µjφj(x)φj(x(cid:48)) where µ1 ≥ µ2 ≥ ··· ≥ 0 are which iteratively approximates f∗ via a sum of D one di- theeigenvaluesoftheexpansionandφ1,φ2,... areanor- thonormalbasisforL2(P ). X ShrunkAdditiveLeastSquaresApproximation KernelRidgeRegression(KRR)isapopulartechniquefor lowingobjectivejointlyoverfˆ(j) ∈H ,j =1...,M . k(j) d nonparametricregression.Itischaracterisedasthesolution osofmtheekfeorlnleolwκin.g optimisation problem over the RKHS of {fˆ(j)}Mj=d1 =f(j)∈Hakr(jg)m,j=in1,...,Md λ (cid:88)jM=d1(cid:107)f(j)(cid:107)2Hk(j) + fˆ=argminλ(cid:107)f(cid:107)2Hκ + n1 (cid:88)n (Yi−f(Xi))2. (1) n1 (cid:88)n (cid:16)Yi−(cid:88)Md f(j)(Xi(j))(cid:17)2. (3) f∈Hκ i=1 i=1 j=1 Ourestimateforf isthenfˆ(·) = (cid:80) fˆ(j)(·). Atfirst,this Hereλistheregularisationcoefficienttocontrolthevari- j appearstroublesomesinceitrequresoptimisingovernM ance of the estimate and is decreasing with more data. d Via the representer theorem (Scho¨lkopf & Smola, 2001; parameters(αi(j)),j = 1,...,Md,i = 1,...,n. However, Steinwart & Christmann, 2008), we know that the solu- fromtheworkofAronszajn(1950),weknowthatthesolu- tion lies in the linear span of the canonical maps of the tionof(3)liesintheRKHSofthesumkernelk training points Xn – i.e. fˆ(·) = (cid:80) α κ(·,X ). This re- ducestheaboveo1bjectivetoαˆ = argimiinα∈Rniλα(cid:62)Kα+ k(x,x(cid:48))=(cid:88)Md k(j)(x(j),x(j)(cid:48)) (4) 1(cid:107)Y −Kα(cid:107)2 whereK ∈ Rn×n isthekernelmatrixwith n 2 j=1 K = κ(X ,X ). The problem has the closed form so- ij i j (cid:88) lution αˆ = (K +λnI)−1Y. KRR has been analysed ex- = k(j)([xi1,...,xid],[x(cid:48)i1,...,x(cid:48)id]). tensivelyunderdifferentassumptionsonf∗;see(Steinwart 1≤i1<···<id≤D &Christmann,2008;Steinwartetal.,2009;Zhang,2005) See Remark 6 in Appendix A for a proof. Hence, the so- and references therein. Unfortunately, as is the case with lution fˆcan be written in the form fˆ(·) = (cid:80) α k(·,X ) manynonparametricmethods,KRR suffersfromthecurse i i i This is convenient since we only need to optimise over n ofdimensionalityasitsexcessriskisexponentialinD. parameters despite the combinatorial number of kernels. Additive assumption: To make progress in high dimen- Moreover, it is straightforward to see that the solution is sions, we assume that f∗ decomposes into the following obtained by solving (1) by plugging in the sum kernel additiveformthatcontainsinteractionsofdordersamong k for κ. Consequently fˆ(j) = (cid:80) αˆ k(j)(·,X(j)) and i i i thevariables. (Lateron,wewillanalysenon-additivef∗.) fˆ=(cid:80) αˆ k(·,X )whereαˆisthesolutionof(1). Whileat i i i firstsightthedifferenceswithKRRmightseemsuperficial, f (x)= (cid:88) f(j)(x ,x ,...,x ), (2) wewillseethatthestrongeradditiveassumptionwillhelp ∗ ∗ i1 i2 id us reduce the excess risk for high dimensional regression. 1≤i1<i2<···<id≤D Ourtheoreticalresultswillbecharacteriseddirectlyviathe optimisationobjective(3). We will write, f (x) = (cid:80)Md f(j)(x(j)) where M = ∗ j=1 ∗ d (cid:0)D(cid:1), and x(j) denotes the subset (x ,x ,...,x ). We 3.1.TheESPKernel d i1 i2 id areprimarilyinterestedinthesettingd (cid:28) D. Whilethere While the above formulation reduces the number of opti- arealargenumberoff(j)’s,eachofthemonlypermitsin- ∗ misation parameters, the kernel still has a combinatorial teractions of at most d variables. We will show that this number of terms which can be expensive to compute. assumptiondoesinfactreducethestatisticalcomplexityof While this is true for arbitrary choices for k(j)’s, under the function to be estimated. The first order additive as- some restrictions we can efficiently compute k. For sumptionisequivalenttosettingd = 1above. Apotential this, we use the same trick used by Shawe-Taylor & difficulty with the above assumption is the combinatorial Cristianini (2004) and Duvenaud et al. (2011). First computationalcostinestimatingallf(j)’swhend>1.We ∗ consider a set of base kernels acting on each dimension circumventthisbottleneckusingtwostrategems: aclassi- k ,k ,...,...,k . Definek(j)tobetheproductkernelof 1 2 D calresultfromRKHStheory,andacomputationaltrickus- all kernels acting on each coordinate – k(j)(x(j),x(j)(cid:48)) = ingelementarysymmetricpolynomialsusedbeforebyDu- k (x ,x(cid:48) )k (x ,x(cid:48) )···k (x ,x(cid:48) ). Then, the venaudetal.(2011);Shawe-Taylor&Cristianini(2004)in i1 i1 i1 i2 i2 i2 id id id additive kernel k(x,x(cid:48)) becomes the dth elemen- thekernelliteratureforadditivekernels. tary symmetric polynomial (ESP) of the D variables k (x ,x(cid:48)),...,k (x ,x(cid:48) ). Concretely, 1 1 1 D D D 3.SALSA (cid:32) d (cid:33) To extend KRR to additive models we first define kernels k(x,x(cid:48))= (cid:88) (cid:89)k (x ,x(cid:48) ) . (5) k(j) thatactoneachsubsetx(j). Wethenoptimisethefol- i(cid:96) i(cid:96) i(cid:96) 1≤i1<i2<···<id≤D (cid:96)=1 ShrunkAdditiveLeastSquaresApproximation We refer to (5) as the ESP kernel. Using the Girard- {(φ(j))∞ } is an orthonormal basis for L2(P ) and (cid:96) (cid:96)=1 X(j) Newton identities (Macdonald, 1995) for ESPs, we can {(µ(j))∞ } are its eigenvalues. P is the marginal (cid:96) (cid:96)=1 X(j) compute this summation efficiently. For the D variables distribution of the coordinates X(j). We also need the sD1 = s1,...,sD and 1 ≤ m ≤ D, define the mth power following regularity condition on the tail behaviour of the sumpmandthemthelementarysymmetricpolynomialem: basisfunctions{φ(j)}forallk(j). Similarassumptionsare (cid:96) made in (Zhang et al., 2013) and are satisfied for a large D (cid:88) p (sD)= sm , rangeofkernelsincludingthoseinTheorem4. m 1 i i=1 (cid:88) Assumption2. Forsomeq ≥ 2, ∃ ρ < ∞suchthatfor e (sD)= s ×s ×···×s . m 1 i1 i2 im allj =1,...,M and(cid:96)∈N,E[φ(j)(X)2q]≤ρ2q. 1≤i1<i2<···<im≤D d (cid:96) In addition define e (sn) = 1. Then, the Girard-Newton Wealsodefinethefollowing, 0 1 formulaestate, (cid:88)∞ 1 (cid:88)Md γ(j)(λ)= , γ (λ)= γ(j)(λ). (6) em(sD1 )= m1 (cid:88)m (−1)i−1em−i(sD1 )pi(sD1 ). (cid:96)=1 1+λ/µ((cid:96)j) k j=1 i=1 Thefirsttermisknownastheeffectivedatadimensionality Startingwithm = 1andproceedinguptom = d,e can ofk(j) (Zhang,2005;Zhangetal.,2013)andcapturesthe d be computed iteratively in just O(Dd2) time. By treating statisticaldifficultyofestimatingafunctioninHk(j). γk is s = k ,thekernelmatrixcanbecomputedinO(n2d2D) thesumoftheγ(j)’s. Ourfirsttheorembelowboundsthe i i time. WhiletheESPtrickrestrictstheclassofkernelswe excessriskoffˆinterms(cid:107)f∗(cid:107)2F andγk. canuseinSALSA,itappliesforimportantkernelchoices. Forexample,ifeachk(j) isaGaussiankernel,thenitisan Theorem 3. Let Assumptions 1 and 2 hold. and Y have ESPkernelifwesetthebandwidthsappropriately. boundedconditionalvariance: E[(Y −f∗(X))2|X]≤σ2. Thenthesolutionfˆof (3)satisfies, In what follows, we refer to a kernel such as k (5) which permitsonlydordersofinteractionasadthorderkernel. A E[R(fˆ)]−R(f )≤M (cid:18)20λ(cid:107)f (cid:107)2 + 12σ2γk(λ) +χ(k)(cid:19). kernel which permits interactions of all D variables is of ∗ d ∗ F n Dth order. Note that unlike in MKL, here we do not wish to learn the kernel. We use additive kernels to explicitly Here χ(k) are kernel dependent low order terms and are reducethecomplexityofthefunctionclassoverwhichwe given in (11) in Appendix A. Our proof technique gener- optimiseforfˆ. Next,wepresentourtheoreticalresults. alises the analysis of Zhang et al. (2013) for KRR to the additivecase. WeuseideasfromAronszajn(1950)tohan- 3.2.TheoreticalAnalysis dlesumRKHSs. WeconsideraspaceF containingthetu- pleoffunctionsf(j) ∈H andusefirstorderoptimality We first consider the setting when f(j) is in H over k(j) ∗ k(j) conditionsof(3)inF. TheproofisgiveninAppendixA. which we optimise for fˆ(j). Theorem 3 generally bounds the excess risk of fˆ (3) in terms of RKHS parameters. The term γk(λ), which typically has exponential de- Then,wespecialiseittospecificRKHSsinTheorem4and pendence on d, arises through the variance calculation. show that in many cases, the dependence on D reduces Therefore, by using small d we may reduce the variance fromexponentialtopolynomialforadditivef . Webegin of our estimate. However, this will also mean that we are ∗ withsomeassumptions. onlyconsideringasmallerfunctionclassandhencesuffer large bias if f is not additive. In naive KRR, using a ∗ Assumption 1. f∗ has a decomposition f∗(x) = Dth order kernel (equivalent to setting Md = MD = 1) (cid:80)Md g(j)(x(j))whereeachg(j) ∈H . the excess risk depends exponentially in D. In contrast, j=1 k(j) for an additive dth order kernel, γk(λ) has polynomial We point out that the decomposition {g(j)} need not be dependenceonD if f isadditive. Wemakethisconcrete ∗ unique. To enforce definiteness (by abusing notation) we viathefollowingtheorem. definef(j) ∈ H , j = 1,...,M tobethesetoffunc- ∗ k(j) d tions which minimise (cid:80) (cid:107)g(j)(cid:107)2 . Denote the mini- Theorem 4. Assume the same conditions as Theorem 3. mumvalueby(cid:107)f (cid:107)2. WejdenoteHitk(bj)yanormforreasons Then,suppressinglog(n)terms, ∗ F madeclearinourproofs. • if each k(j) has eigendecay µ(j) ∈ O((cid:96)−2s/d), then by (cid:96) Let k(j) have an eigenexpansion k(j)(x(j),x(j)(cid:48)) = choosing λ (cid:16) n2−s+2sd, we have E[R(fˆ)] − R(f∗) ∈ (cid:80)∞(cid:96)=1µ((cid:96)j)φ((cid:96)j)(x(j))φ((cid:96)j)(x(j)(cid:48)) in L2(PX(j)). Here, O(D2dn2−s+2sd), ShrunkAdditiveLeastSquaresApproximation • if each k(j) has eigendecay µ(j) ∈ O(π˜dexp(−α(cid:96)2)) Theorem 5. Let f be an arbitrary measurable function (cid:96) ∗ forsomeconstantsπ˜,α,thenbychoosingλ (cid:16) 1/n,we andY haveboundedfourthmomentE[Y4] ≤ ν4. Further haveE[R(fˆ)]−R(f )∈O(D2dπ˜d). eachk(j)satisfiesAssumption2. Then∀η >0, ∗ n We bound γ via bounds for γ(j) and use it to derive the E[R(fˆ)]−R(f∗)≤(1+η)AE + (1+1/η)EE, k optimalratesfortheproblem. TheproofisinAppendixB. (cid:16)M γ (λ)(cid:17) where,AE= inf (cid:107)f −f (cid:107)2, EE∈O d k . It is instructive to compare the rates for the cases above f∈Hd,λ ∗ 2 n whenweuseaDthorderkernelκinKRRtoestimateanon- additivefunction. Thefirsteigendecayisobtainedifeach Theproof,giveninAppendixC,alsofollowsthetemplate k(j) isaMate´rnkernel. Thenf(j) belongstotheSobolev in Zhang et al. (2013). Loosely, we may interpret AE class of smoothness s (Berlinet & Thomas-Agnan, 2004; and EE as the approximation and estimation errors1. We Tsybakov,2008). Byfollowingasimilaranalysis,wecan mayuseTheorem5tounderstandthetrade-offsinapprox- showthatifκisinaSobolevclass,thentheexcessriskof imainganon-additivefunctionviaanadditivemodel. We KRR isO(n2−s+2sD)whichissignificantlyslowerthanours. provideanintuitive“not-very-rigorous”explanation. Hd,λ In our setting, the rates are only exponential in d but we is typically increasing with d since higher order additive haveanadditionalD2dtermasweneedtoestimateseveral functionscontainlowerorderfunctions. Hence,AEisde- suchfunctions.Anexampleofthesecondeigendecayisthe creasing with d as the infimum is taken over a larger set. √ Gaussiankernelwithπ˜ = 2π (Williamsonetal.,2001). On the other hand, EE is increasing with d. With more In the nonadditve case, the excess risk is in the Gaussian data EE decreases due to the 1/n term. Hence, we can RKHSisO(cid:0)(2π)D/2(cid:1)whichisslowerthanSALSA whose affordtouselargerdtoreduceAEandbalancewithEE. n Thisresultsinanoverallreductionintheexcessrisk. dependence on D is just polynomial. D,d do not appear intheexponentofnbecausetheGaussianRKHScontains TheactualanalysiswouldbemorecomplicatedsinceH d,λ very smooth functions. KRR is slower since we are opti- is a bounded class depending intricately on λ. It also de- mising over the very large class of non-additive functions pends on the kernels k(j), which differ with d. To make and consequently it is a difficult statistical problem. The the above intuition concrete and more interpretable, it is faster rates for SALSA should not be surprising since the necessary to have a good handle on AE. However, if class of additive functions is smaller. The advantage of wearetoovercometheexponentialdependenceindimen- SALSA isitsabilitytorecoverthefunctionatafasterrate sion, usualnonparametricassumptionssuchasHo¨lderian/ when f∗ is additive. Finally we note that by taking each Sobolev conditions alone will not suffice. Current lower basekernelki intheESPkerneltobea1DGaussian,each bounds suggest that the exponential dependence is un- k(j) isaGaussian. However,atthispointitisnotclearto avoidable(Gyo¨rfietal.,2002;Tsybakov,2008).Additional usifitispossibletorecoveras-smoothSobolevclassvia assumptions will be necessary to demonstrate faster con- thetensorproductofs-smoothonedimensionalkernels. vergence. Once we control AE, the optimal rates can be obtained by optimising the bound over η,λ. We wish to Finally, we analyse SALSA under more agnostic assump- pursuethisinfuturework. tions. We will neither assume that f is additive nor ∗ that it lies in any RKHS. First, define the functions f(j), λ 3.3.PracticalConsiderations j =1,...,M whichminimisethepopulationobjective. Choice of Kernels: The development of our algorithm {f(j)}Md = argmin λ (cid:88)Md (cid:107)f(j)(cid:107)2 + and our analysis assume that the ki’s are known. This is λ j=1 f(j)∈Hk(j),j=1,...,M j=1 Hk(j) heralrydlfyorthgeocoadseeminprieriaclaitlypaenrdfotrhmeyanhcaev.eCtorobsescvhaolisdeantipornopis- (cid:34)(cid:16) (cid:88)Md (cid:17)2(cid:35) not feasible here as there are too many hyper-parameters. E Y − f(j)(X(j)) . (7) Inourexperimentsweseteachk tobeaGaussiankernel i j=1 k (x ,x(cid:48)) = σ exp(−(x − x(cid:48))2/2h2) with bandwidth i i i Y i i i Let f = (cid:80) f(j), R(j) = (cid:107)f(j)(cid:107) and R2 = hi = cσin−1/5. Hereσi isthestandarddeviationoftheith λ j λ λ λ Hk(j) d,λ covariateandσY isthestandarddeviationofY.Thechoice (cid:80) R(j)2. Toboundtheexcessriskintheagnosticsetting ofbandwidthwasinspiredbyseveralotherkernelmethods wejalsλodefinetheclass, which use bandwidths on the order σin−1/5 (Ravikumar et al., 2009; Tsybakov, 2008). The constant c was hand H =(cid:8)f :X →R; f(x)=(cid:88)f(j)(x(j)), (8) tuned – we found that performance was robust to choices d,λ between 5 and 60. In our experiments we use c = 20. c j ∀j, f(j) ∈H ,(cid:107)f(j)(cid:107) ≤R(j)(cid:9). k(j) Hk(j) λ 1Loosely(andnotstrictly)sincefˆneednotbeinHd,λ. ShrunkAdditiveLeastSquaresApproximation was chosen by experimenting on a collection of synthetic 2000),RBFInterpolation(RBFI),M5’ModelTrees(M5’) datasets and then used in all our experiments. Both syn- (Wang & Witten, 1997) and Shepard Interpolation (SI). theticandrealdatasetsusedinexperimentsareindependent Nonparametric additive models: Back-fitting with cu- ofthedatausedtotunec. bicsplines(BF)(Hastie&Tibshirani,1990), Multivariate Adaptive Regression Splines (MARS) (Friedman, 1991), Choice of d,λ: If the additive order of f is known and ∗ Component Selection and Smoothing (COSSO) (Lin & we have sufficient data then we can use that for d in (5). Zhang, 2006), Sparse Additive Models (SpAM) (Raviku- However, this is usually not the case in practice. Further, mar et al., 2009) and Additive Gaussian Processes (Add- even in non-additive settings, we may wish to use an ad- GP)(Duvenaudetal.,2011). Parametricmodels: Ridge ditive model to improve the variance of our estimate. In Regression (RR), Least Absolute Shrinkage and Selection these instances, our approach to choose d uses cross vali- (LASSO) (Tibshirani, 1994) and Least Angle Regression dation. Foragivendwesolve(1)fordifferentλandpick (LAR)(Efronetal.,2004). Weusedsoftwarefrom(Chang thebestoneviacrossvalidation. Tochoosetheoptimald & Lin, 2011; Hara & Chellappa, 2013; Jakabsons, 2015; wecrossvalidateond.Inourexperimentsweobservedthat Lin&Zhang,2006;Rasmussen&Williams,2006)orfrom the cross validation error had bi-monotone like behaviour Matlab. Insomecasesweusedourownimplementation. withauniquelocaloptimumond. Sincetheoptimaldwas typicallysmallwesearchbystartingatd=1andkeepin- 4.1.SyntheticExperiments creasinguntiltheerrorbeginstoincreaseagain. Ifdcould We begin with a series of synthetic examples. We com- belargeandlinearsearchbecomestooexpensive,abinary pareSALSA tosomenon-additivemethodstoconveyintu- searchlikeprocedureon{1,...,D}canbeused. itionaboutouradditivemodel. Firstwecreateasynthetic We conclude this section with a couple of remarks. First, loworderfunctionoforderd = 3inD = 15dimensions. we could have considered an alternative additive model Wedosobycreatingaddimensionalfunctionf andadd d which sums all interactions up to dth order instead of just thatfunctionoverall(cid:0)D(cid:1)combinationsofcoordinates. We d the dth order. The excess risk of this model differs from compareSALSAusingorder3andcompareagainstothers. Theorems3,4and5onlyinsubdominanttermsand/orcon- TheresultsaregiveninFigure1(a). Thissettingistailored stantfactors. Thekernelcanbecomputedefficientlyusing to the assumptions of our method and, not surprisingly, it thesametrickbysummingallpolynomialsuptod. Inour outperformsallalternatives. experimentswefoundthatbothouroriginalmodel(2)and Next we demonstrate the bias variance trade-offs in us- summingoverallinteractionsperformedequallywell. For ingadditiveapproximationsonnon-additivefunctions. We simplicity,resultsarepresentedonlyfortheformer. created a 15 dimensional (non-additive) function and fit- Secondly,asisthecasewithmostkernelmethods,SALSA ted a SALSA model with d = 1,2,4,8,15 for difference requiresO(n2)spacetostorethekernelmatrixandO(n3) choices of n. The results are given in Figure 1(b). The efforttoinvertit. Somerecentadvancesinscalablekernel interestingobservationhereisthatforsmallsamplessizes methodssuchasrandomfeatures,divideandconquertech- smalldperformsbest. However,asweincreasethesample niques,stochasticgradientsetc.(Daietal.,2014;Leetal., sizewecanalsoincreasethecapacityofthemodelbyac- 2013; Rahimi & Recht, 2007; 2009; Zhang et al., 2013) commodating higher orders of interaction. In this regime, can be explored to scale SALSA with n. However, this is large d produces the best results. This illustrates our pre- beyond the scope of this paper and is left to future work. vious point that the order of the additive model gives us Forthisreason,wealsolimitourexperimentstomoderate another way to control the bias and variance in a regres- dataset sizes. The goal of this paper is primarily to intro- siontask. Wepositthatwhennisextremelylarge,d=15 duceadditivemodelsofhigherorder,addressthecombina- willeventuallybeatallothermodels. Finally,weconstruct torialcostinsuchmodelsandtheoreticallydemonstratethe syntheticfunctionsinD = 20to50dimensionsandcom- improvementsintheexcessrisk. pareagainstothermethodsinFigures1(c)to1(f).Here,we chosedviacrossvalidation. Ourmethodoutperformsoris 4.Experiments competitivewithothermethods. We compare SALSA to the following. Nonparamet- 4.2.RealDatasets ric models: Kernel Ridge Regression (KRR), k-Nearest Finally we compare SALSA against the other methods Neighbors (kNN), Nadaraya Watson (NW), Locally Lin- listedaboveon16datasets. Thedatasetsweretakenfrom ear/Quadraticinterpolation(LL,LQ),(cid:15)-SupportVectorRe- the UCI repository, Bristol Multilevel Modeling and the gression((cid:15)−SVR),ν-SupportVectorRegression(ν−SVR), followingsources: (Guillame-Bertetal.,2014;Justetal., GaussianProcessRegression(GP),RegressionTrees(RT), 2010;Paschou,2007;Tegmarketal,2006;Tu,2012;We- Gradient Boosted Regression Trees (GBRT) (Friedman, hbe et al., 2014). Table 1 gives the average squared error ShrunkAdditiveLeastSquaresApproximation (a) (b) (c) (d) (e) (f) Figure1.(a)ComparisonofSALSA whichknowstheadditiveorderoff againstothermethods. (b)Comparisonofdifferentchoices ∗ ofdinSALSA.Thebestdvarieswithn. (c)-(f)ComparisonofSALSA (dchosenviacrossvalidation)withalternativesonsynthetic datasets. Inallcases,weplotthemeansquaredpredictionerroronatestsetof2000points. Allcurvesareproducedbyaveragingover 10trials.Theerrorbarsarenotvisibleinsomecurvesastheyareverysmall. on a test set. For the Speech dataset we have indicated polynomially on D when f is additive, significantly bet- ∗ the training time (including cross validation for selecting terthantheusualexponentialdependenceofnonparametric hyper-parameters) for each method. For SALSA we have methods, albeit under stronger assumptions. Our analysis alsoindicatedtheorderdchosenbycrossvalidation. See of the agnostic setting provides intuitions on the tradeoffs thecaptionunderthetableformoredetails. invovledwithchangingd. Wedemonstratetheefficacyof SALSA via a comprehensive empirical evaluation. Going SALSA performsbest(orisveryclosetothebest)in5of forward, we wish to use techniques from scalable kernel the datasets. Moreover it falls within the top 5 in all but methodstohandlelargedatasets. twodatasets,comingsixthinbothinstances. Observethat inmanycasesdchosenbySALSAismuchsmallerthanD, Theorems3,4showpolynomialdependenceonDwhenf ∗ butimportantlyalsolargerthan1. Thisobservation(along isadditive. However,thesetheoremsareunsatisfyingsince withFig1(b))corroboratesakeythemeofthispaper:while in practice regression functions need not be additive. We itistruethatadditivemodelsimprovethevarianceinhigh believeourmethoddidwellevenonnon-additivesettings dimensional regression, it is often insufficient to confine sincewecouldcontrolmodelcapacityviad. Inthislight, ourselvestojustfirstordermodels. weposethefollowingopenproblem: identifysuitableas- sumptions to beat existing lower bounds and prove faster InAppendixDwehavegiventhespecificsonthedatasets convergenceofadditivemodelswhoseadditiveorderdin- suchaspreprocessing,thepredictors,featuresetc.Wehave creaseswithsamplesizen.OurTheorem5mightbeuseful alsodiscussedsomedetailsonthealternativesused. inthisendeavour. 5.Conclusion Acknowledgements SALSA finds additive approximations to the regression We thank Calvin McCarter, Ryan Tibshirani and Larry functioninhighdimensions. Ithaslessbiasthanfirstorder Wassermanfortheinsightfuldiscussionsandfeedbackon models and less variance than non-additive methods. Al- thepaper. WealsothankMadalinaFiterauforprovidingus gorithmically, itrequiresplugginginanadditivekernelto withdatasets. ThisworkwaspartlyfundedbyDOEgrant KRR.Incomputingthekernel, weusetheGirard-Newton DESC0011114. formulaetoefficientlysumoveracombinatorialnumberof terms.Ourtheoremsshowthattheexcessriskdependsonly ShrunkAdditiveLeastSquaresApproximation Dataset(D,n) SALSA (d) KRR kNN NW LL LQ (cid:15)−SVR ν−SVR GP RT GBRT Housing(12,256) 0.26241(9) 0.37690 0.43620 0.38431 0.31219 0.35061 1.15272 0.38600 0.67563 1.06015 0.42951 Galaxy(20,2000) 0.00014(4) 0.01317 0.25854 0.30615 0.01676 0.00175 0.65280 0.15798 0.00221 0.02293 0.01405 fMRI(100,700) 0.80730(2) 0.86495 0.85645 0.84989 0.91098 1.14079 0.81080 0.81376 0.84766 1.52834 0.87326 Insulin(50,256) 1.02062(3) 1.09023 1.15578 1.18070 1.06457 1.35747 1.10725 1.09140 1.22404 1.58009 1.06624 Skillcraft(18,1700) 0.54695(1) 0.54803 0.67155 0.73258 0.60581 1.29690 0.71261 0.66311 0.54816 1.08047 0.57273 School(36,90) 1.32008(2) 1.64315 1.37910 1.14866 2.13390 4.79447 1.38173 1.48746 1.64215 1.99244 2.09863 CCPP*(59,2000) 0.06782(2) 0.08038 0.32017 0.33863 0.07568 0.06779 0.33707 0.094493 0.11128 1.04527 0.06181 Bleeding(100,200) 0.00123(5) 0.10633 0.16727 0.19284 2.48e−6 0.00614 0.36505 0.03168 6.68e−6 0.18292 0.19076 Speech(21,520) 0.02246(2) 0.03036 0.09348 0.11207 0.03373 0.02404 0.22431 0.06994 0.02531 0.05430 0.03515 Trainingtime 4.71s 0.8s 0.18s 1.81s 4.53s 6.80s 0.24s 27.43s 6.34s 0.21s 5.30s Music(90,1000) 0.62512(3) 0.61244 0.71141 0.75225 0.67271 1.31957 0.75420 0.59399 0.62429 1.45983 0.66652 Telemonit(19,1000) 0.03473(9) 0.05640 0.09262 0.21198 0.08253 0.18399 0.33902 0.05246 0.03948 0.01375 0.04371 Propulsion(15,200) 0.00881(8) 0.05010 0.14614 0.11237 1.12712 1.12801 0.74511 0.00910 0.00355 0.02341 0.00061 Airfoil*(40,750) 0.51756(5) 0.53111 0.85879 0.86752 0.51877 0.51105 0.64853 0.55118 0.54494 0.45249 0.34461 Forestfires(10,211) 0.35301(3) 0.28771 0.36571 0.37199 0.35462 12.5727 0.70154 0.43142 0.29038 0.41531 0.26162 Brain(29,300) 0.00036(2) 0.01239 0.24429 0.22929 1.23412 2.45781 0.27204 0.05556 5e−14 0.00796 0.00693 RBFI M5’ SI BF MARS COSSO SpAM Add-GP RR LASSO LAR Housing(12,256) 0.64871 0.38256 0.50445 0.64218 0.42379 1.30965 0.81653 0.45656 95.60708 0.44515 0.84410 Galaxy(20,2000) 0.01532 0.00116 0.92189 0.94165 0.00163 0.00153 0.95415 − 0.13902 0.02392 1.02315 fMRI(100,700) 1.38585 1.42795 0.90595 0.86197 0.90850 0.82448 0.88014 − 0.81005 0.81390 0.88351 Insulin(50,256) 1.22404 1.78252 1.20771 1.16524 1.10359 1.13791 1.20345 − 1.02051 1.11034 1.22404 Skillcraft(18,1700) 0.81966 1.07195 0.87677 0.83733 0.54595 0.55514 0.905445 − 0.70910 0.66496 1.00048 School(36,90) 1.61927 1.72657 1.52374 1.48866 1.44453 1.48046 1.59328 − 1.53416 1.34467 1.64330 CCPP*(59,2000) 1.04257 0.10513 0.97223 0.88084 0.08189 0.96844 0.06469 − 0.07641 0.07395 1.04527 Bleeding(100,200) 0.13872 0.00210 0.24918 0.37840 0.00497 0.31362 0.41735 − 0.00001 0.00191 0.43488 Speech(21,520) 0.03339 0.02843 0.39883 0.36793 0.01647 0.34863 0.66009 0.02310 0.07392 0.07303 0.73916 Trainingtime 0.12s 5.70s 0.66s 27.93s 8.11s 9.40s 76.39s 79mins 0.03s 15.94s 0.06s Music(90,1000) 0.78482 1.28709 0.77347 0.75646 0.88779 0.79816 0.76830 − 0.67777 0.63486 0.78533 Telemonit(19,1000) 0.02872 0.01491 0.65386 0.84412 0.02400 5.71918 0.86425 − 0.08053 0.08629 0.94943 Propulsion(15,200) 0.05832 0.02341 0.27768 0.56418 0.0129 0.00094 1.11210 0.01435 0.01490 0.02481 10.2341 Airfoil*(40,750) 1.00909 0.46714 0.99413 0.96668 0.54552 0.51782 0.96231 − 0.53187 0.51986 1.00910 Forestfires(10,211) 0.45773 0.47749 0.78057 0.88979 0.33891 0.37534 0.96944 0.17024 0.47892 0.51934 1.05415 Brain(29,300) 0.97815 2e−37 0.81711 0.63700 5e−31 2e−6 0.89533 − 4e−13 0.00089 1.04216 Table1. Theaveragesquarederroronthetestsetforallmethodson16datasets.Thedimensionalityandsamplesizenareindicatednexttothedataset.Thebestmethod(s)foreach datasetareinbold.Thesecondtofifthmethodsareunderlined.ForSALSAwehavealsoindicatedtheorderdchosenbyourcrossvalidationprocedureinparantheses.Thedatasets witha*areactuallylowerdimensionaldatasetsfromtheUCIrepository. Butweartificiallyincreasethedimensionalitybyinsertingrandomvaluesfortheremainingcoordinates. Eventhough,thisdoesn’tchangethefunctionvalueitmakestheregressionproblemharder. ShrunkAdditiveLeastSquaresApproximation References Jakabsons, Gints. Open source regression software for Matlab/Octave,2015. Aronszajn, N. Theory of Reproducing Kernels. Trans. Amer.Math.Soc.,1950. Just, Marcel Adam, Cherkassky, Vladimir L., Aryal, S, Mitchell, Tom M., Just, Marcel Adam, Cherkassky, Bach,FrancisR.ConsistencyoftheGroupLassoandMul- VladimirL., Aryal, Esh, andMitchell, TomM. Aneu- tipleKernelLearning. JMLR,2008. rosemantictheoryofconcretenounrepresentationbased ontheunderlyingbraincodes. PLoSONE,2010. Berlinet, Alain and Thomas-Agnan, Christine. Reproduc- ing kernel Hilbert spaces in Probability and Statistics. Lafferty,JohnD.andWasserman,LarryA. Rodeo: Sparse KluwerAcademic,2004. Nonparametric Regression in High Dimensions. In NIPS,2005. Chang, Chih-Chung and Lin, Chih-Jen. LIBSVM: A li- brary for support vector machines. ACM Transactions Le, Quoc, Sarlos, Tamas, and Smola, Alex. Fastfood - onIntelligentSystemsandTechnology,2(3),2011. Approximating Kernel Expansions in Loglinear Time. In 30th International Conference on Machine Learning Dai, Bo, Xie, Bo, He, Niao, Liang, Yingyu, Raj, Anant, (ICML),2013. Balcan, Maria-Florina F, and Song, Le. Scalable Ker- nelMethodsviaDoublyStochasticGradients. InNIPS, Lin, Yi and Zhang, Hao Helen. Component selection and 2014. smoothinginsmoothingsplineanalysisofvariancemod- els. AnnalsofStatistics,2006. Duvenaud, David K., Nickisch, Hannes, and Rasmussen, Carl Edward. Additive Gaussian Processes. In NIPS, Lou, Yin, Caruana, Rich, Gehrke, Johannes, and Hooker, 2011. Giles. AccurateIntelligibleModelswithPairwiseInter- actions. InKDD,2013. Efron,Bradley,Hastie,Trevor,Johnstone,Iain,andTibshi- rani,Robert. LeastAngleRegression. AnnalsofStatis- Macdonald,IanGrant. SymmetricfunctionsandHallpoly- tics,32(2):407–499,2004. nomials. ClarendonPress,1995. Friedman, Jerome H. Multivariate Adaptive Regression Paschou,P. PCA-correlatedSNPsforStructureIdentifica- Splines. Ann.Statist.,19(1):1–67,1991. tion. PLoSGenetics,2007. Friedman,JeromeH. GreedyFunctionApproximation: A Plate, Tony A. Accuracy versus Interpretability in flex- GradientBoostingMachine. AnnalsofStatistics,2000. ible modeling:implementing a tradeoff using Gaussian processmodels. Behaviourmetrika,InterpretingNeural Go¨nen, Mehmet and Alpaydin, Ethem. Multiple Kernel NetworkModels”,1999. LearningAlgorithms. JournalofMachineLearningRe- search,12:2211–2268,2011. Rahimi, Ali and Recht, Benjamin. Random Features for Large-ScaleKernelMachines. InNIPS,2007. Guillame-Bert,M,Dubrawski,A,Chen,L,Holder,A,Pin- sky,MR,andClermont,G. UtilityofEmpiricalModels Rahimi,AliandRecht,Benjamin. WeightedSumsofRan- of Hemorrhage in Detecting and Quantifying Bleeding. dom Kitchen Sinks: Replacing minimization with ran- InIntensiveCareMedicine,2014. domizationinlearning. InNIPS,2009. Gyo¨rfi,La´szlo´,Kohler,Micael,Krzyzak,Adam,andWalk, Rasmussen,C.E.andWilliams,C.K.I. GaussianProcesses Harro.ADistributionFreeTheoryofNonparametricRe- for Machine Learning. Cambridge University Press, gression. SpringerSeriesinStatistics,2002. 2006. Hara,KentaroandChellappa,Rama.Computationallyeffi- Ravikumar,Pradeep,Lafferty,John,Liu,Han,andWasser- cientregressiononadependencygraphforhumanpose man, Larry. Sparse Additive Models. Journal of the estimation. InComputerVisionandPatternRecognition RoyalStatisticalSociety:SeriesB,71:1009–1030,2009. (CVPR),2013IEEEConferenceon,2013. Scho¨lkopf, Bernhard and Smola, Alexander J. Learning Hastie, T. J. and Tibshirani, R. J. Generalized Additive withKernels: SupportVectorMachines,Regularization, Models. London: Chapman&Hall,1990. Optimization,andBeyond. MITPress,2001. Hastie, T. J., Tibshirani, R. J., and Friedman, J. The Ele- Shawe-Taylor,JohnandCristianini,Nello.KernelMethods mentsofStatisticalLearning. Springer.,2001. forPatternAnalysis. Cambridgeuniversitypress,2004. ShrunkAdditiveLeastSquaresApproximation Steinwart,IngoandChristmann,Andreas. SupportVector Machines. Springer,2008. Steinwart, Ingo, Hush, Don R., and Scovel, Clint. Opti- malRatesforRegularizedLeastSquaresRegression. In COLT,2009. Tegmark et al, M. Cosmological Constraints from the SDSSLuminousRedGalaxies.PhysicalReview,74(12), 2006. Tibshirani, Robert. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal Statistical Society, SeriesB,1994. Tsybakov, Alexandre B. Introduction to Nonparametric Estimation. Springer,2008. Tu, Zhidong. IntegrativeAnalysisofacross-locciregula- tionNetworkidentifiesAppasaGeneregulatingInsulin SecretionfromPancreaticIslets. PLoSGenetics,2012. Wang,YongandWitten,IanH. InducingModelTreesfor ContinuousClasses. InECML,1997. Wehbe, L., Murphy, B., Talukdar, P., Fyshe, A., Ramdas, A.,andMitchell,T. Simultaneouslyuncoveringthepat- ternsofbrainregionsinvolvedindifferentstoryreading. PLoSONE,2014. Williamson, Robert C., Smola, Alex J., and Scho¨lkopf, Bernhard. Generalization Performance of Regulariza- tionNetworksandSupportVectorMachinesviaEntropy NumbersofCompactOperators. IEEETransactionson InformationTheory,47(6):2516–2532,2001. Xu, Zenglin, Jin, Rong, Yang, Haiqin, King, Irwin, and Lyu, Michael R. Simple and Efficient Multiple Kernel LearningbyGroupLasso. InICML,2010. Zhang,Tong. LearningBoundsforKernelRegressionUs- ingEffectiveDataDimensionality.NeuralComputation, 17(9):2077–2098,2005. Zhang,Yuchen,Duchi,JohnC.,andWainwright,MartinJ. DivideandConquerKernelRidgeRegression.InCOLT, 2013.