EigenGP: Gaussian Process Models with Adaptive Eigenfunctions HaoPeng YuanQi DepartmentsofComputerScience DepartmentsofCSandStatistics PurdueUniversity PurdueUniversity WestLafayette,IN47906 WestLafayette,IN47906 [email protected] [email protected] Abstract 5 mussen[2005]. 1 Among all sparse GP regression methods, a state-of-the- 0 Gaussianprocesses(GPs)provideanonparametricrepresen- art approach is to represent a function as a sparse finite lin- 2 tation of functions. However, classical GP inference suffers ear combination of pairs of trigonometric basis functions, a ul from high computational cost for big data. In this paper, we sine and a cosine for each spectral point; thus this approach J proposeanewBayesianapproach,EigenGP,thatlearnsboth is called sparse spectrum Gaussian process (SSGP) [La´zaro- 3 basis dictionary elements—eigenfunctions of a GP prior— Gredillaetal.,2010]. SSGPintegratesoutbothweightsand 1 andpriorprecisionsinasparsefinitemodel. Itiswellknown phasesofthetrigonometricfunctionsandlearnsallhyperpa- that,amongallorthogonalbasisfunctions,eigenfunctionscan rametersofthemodel(frequenciesandamplitudes)bymax- ] G providethemostcompactrepresentation. Unlikeothersparse imizing the marginal likelihood. Using global trigonomet- Bayesian finite models where the basis function has a fixed ric functions as basis functions, SSGP has the capability of L form,oureigenfunctionsliveinareproducingkernelHilbert approximating any stationary Gaussian process model and . s space as a finite linear combination of kernel functions. We been shown to outperform alternative sparse GP methods— c learnthedictionaryelements—eigenfunctions—andtheprior including fully independent training conditional (FITC) ap- [ precisions over these elements as well as all the other hy- proximation [Snelson and Ghahramani, 2006]—on bench- 3 perparameters from data by maximizing the model marginal markdatasets. AnotherpopularsparseBayesianfinitelinear v likelihood. We explore computational linear algebra to sim- modelistherelevancevectormachine(RVM)[Tipping,2000; 2 6 plifythegradientcomputationsignificantly. Ourexperimen- FaulandTipping,2001]. Ituseskernelexpansionsovertrain- 3 tal results demonstrate improved predictive performance of ingsamplesasbasisfunctionsandselectsthebasisfunctions 0 EigenGPoveralternativesparseGPmethodsaswellasrele- by automatic relevance determination [MacKay, 1992; Faul . 1 vancevectormachines. andTipping,2001]. 0 Inthispaper,weproposeanewsparseBayesianapproach, 4 EigenGP, that learns both functional dictionary elements— 1 1 Introduction eigenfunctions—and prior precisions in a finite linear model : v representation of a GP. It is well known that, among all or- i X Gaussianprocesses(GPs)arepowerfulnonparametricBayes- thogonal basis functions, eigenfunctions provide the most ian models with numerous applications in machine learning compact representation. Unlike SSGP or RVMs where the r a andstatistics. GPinference, however, iscostly. Trainingthe basis function has a fixed form, our eigenfunctions live in a exact GP regression model with N samples is expensive: it reproducing kernel Hilbert space (RKHS) as a finite linear takes an O(N2) space cost and an O(N3) time cost. To ad- combination of kernel functions with their weights learned dressthisissue,avarietyofapproximatesparseGPinference from data. We further marginalize out weights over eigen- approacheshavebeendeveloped[WilliamsandSeeger,2001; functions and estimate all hyperparameters—including basis Csato´ and Opper, 2002; Snelson and Ghahramani, 2006; points for eigenfunctions, lengthscales, and precision of the La´zaro-Gredillaetal.,2010;WilliamsandBarber,1998;Tit- weightprior—bymaximizingthemodelmarginallikelihood sias, 2009; Qi et al., 2010; Higdon, 2002; Cressie and Jo- (also known as evidence). To do so, we explore computa- hannesson, 2008]—for example, using the Nystro¨m method tionallinearalgebraandgreatlysimplifythegradientcompu- to approximate covariance matrices [Williams and Seeger, tation for optimization (thus our optimization method is to- 2001], optimizing a variational bound on the marginal like- tallydifferentfromRVMoptimizationmethods). Asaresult lihood [Titsias, 2009] or grounding the GP on a small set of of this optimization, our eigenfunctions are data dependent (blurred)basispoints[SnelsonandGhahramani,2006;Qiet and make EigenGP capable of accurately modeling nonsta- al., 2010]. An elegant unifying view for various sparse GP tionary data. Furthermore, by adding an additional kernel regression models is given by Quin˜onero-Candela and Ras- term in our model, we can turn the finite model into an in- 1 finite model to model the prediction uncertainty better—that andSchervish,2004]. ConstructinggeneralnonstationaryGP is,itcangivenonzeropredictionvariancewhenatestsample modelsremainsachallengingtask. isfarfromthetrainingsamples. For regression, we use a Gaussian likelihood function EigenGP is computationally efficient. It takes O(NM) p(yi|f) = N(yi|f(xi),σ2), where σ2 is the variance of spaceandO(NM2)timefortrainingonwithM basisfunc- the observation noise. Given the Gaussian process prior tions,whichissameasSSGPandmoreefficientthanRVMs overf andthedatalikelihood, theexactposteriorprocessis (as RVMs learn weights over N, not M, basis functions.). p(f|D,y) ∝ GP(f|0,k)(cid:81)Ni=1p(yi|f). Although the poste- Similar to FITC and SSGP, EigenGP focuses on predictive riorprocessforGPregressionhasananalyticalform,weneed accuracy at low computational cost, rather than on faithfully tostoreandinvertanN byN matrix,whichhasthecompu- converging towards the full GP as the number of basis func- tational complexity O(N3), rendering GP unfeasible for big tionsgrows. (Forthelattercase,pleaseseetheapproach[Yan dataanalytics. andQi,2010]thatexplicitlyminimizestheKLdivergencebe- tweenexactandapproximateGPposteriorprocesses.) 3 Model of EigenGP The rest of the paper is organized as follows. Section 2 describes the background of GPs. Section 3 presents the Toenablefastinferenceandobtainanonstationarycovariance EigenGP model and an illustrative example. Section 4 out- function,ournewmodelEigenGPprojectstheGPpriorinan lines the marginal likelihood maximization for learning dic- eigensubspace. Specifically,wesetthelatentfunction tionaryelementsandtheotherhyperparameters. InSection5, we discuss related work. Section 6 shows regression results M (cid:88) on multiple benchmark regression datasets, demonstrating f(x)= α φj(x) (1) j improved performance of EigenGP over Nystro¨m [Williams j=1 andSeeger,2001],RVM,FITC,andSSGP. where M (cid:28) N and {φj(x)} are eigenfunctions of the GP prior. We assign a Gaussian prior over α = [α ,...,α ], 1 M 2 Background of Gaussian Processes α ∼ N(0,diag(w)),sothatf followsaGPpriorwithzero meanandthefollowingcovariancefunction WedenoteN independentandidenticallydistributedsamples M as D = {(x1,y1),...,(xn,yn)}N, where xi is a D dimen- k˜(x,x(cid:48))=(cid:88)wjφj(x)φj(x(cid:48)). (2) sionalinput(i.e.,explanatoryvariables)andyiisascalarout- j=1 put(i.e.,aresponse),whichweassumeisthenoisyrealization ofalatentfunctionf atx . To compute the eigenfunctions {φj(x)}, we can use the i Galerkin projection to approximate them by Hermite poly- A Gaussian process places a prior distribution over the latent function f. Its projection f at {x }N defines a nomials [Marzouk and Najm, 2009]. For high dimensional x i i=1 joint Gaussian distribution p(f ) = N(f|m0,K), where, problems, however, this approach requires a tensor product x without any prior preference, the mean m0 are set to 0 ofunivariateHermitepolynomialsthatdramaticallyincreases and the covariance function k(x ,x ) ≡ K(x ,x ) en- thenumberofparameters. i j i j To avoid this problem, we apply the Nystro¨m method codes the prior notion of smoothness. A popular choice [Williams and Seeger, 2001] that allows us to obtain an ap- is the anisotropic squared exponential covariance function: k(x,x(cid:48)) = a exp(cid:0)−(x−x(cid:48))Tdiag(η)(x−x(cid:48))(cid:1), where proximationtotheeigenfunctionsinahighdimensionalspace 0 the hyperparameters include the signal variance a and the efficiently. Specifically, given inducing inputs (i.e. basis 0 lengthscales η = {η }D , controlling how fast the covari- points)B=[b1,...,bM],wereplace d d=1 ancedecayswiththedistancebetweeninputs. Usingthisco- (cid:90) variancefunction,wecanpruneinputdimensionsbyshrink- k(x,x(cid:48))φj(x)p(x)dx=λjφj(x(cid:48)) (3) ing the corresponding lengthscales based on the data (when η = 0, the d-th dimension becomes totally irrelevant tothe byitsMonteCarloapproximation d covariance function value). This pruning is known as Auto- M 1 (cid:88) maticRelevanceDetermination(ARD)andthereforethisco- k(x,b )φj(b )≈λ φj(x) (4) M i i j variance is also called the ARD squared exponential. Note i=1 that the covariance function value remains the same when (x(cid:48) −x) is the same – regardless where x(cid:48) and x are. This Then,byevaluatingthisequationatBsothatwecanestimate thevaluesofφj(b )andλ ,weobtainthej-theigenfunction thusleadstoastationaryGPmodel. Fornonstationarydata, i j φj(x)asfollows however,astationaryGPmodelisamisfit. Althoughnonsta- √ tionary GP models have been developed and applied to real M worldapplications,theyareoftenlimitedtolow-dimensional φj(x)= λ(M)k(x)u(jM) =k(x)uj (5) problems, such as applications in spatial statistics [Paciorek j 2 wherek(x) (cid:44) [k(x,b1),...,k(x,bM)],λ(jM) andu(jM) are Basispointslearned Basispointsestimatedby the j-th eigenvalue and eigenvector of the covariance func- fromK-means evidencemaximization √ tion evaluated at B, and u = Mu(M)/λ(M). Note that, 2 2 j j j forsimplicity,wehavechosenthenumberoftheinducingin- ns 1 1 o puts to be the same as the number of the eigenfunctions; in cti 0 0 di y y practice, we can use more inducing inputs while computing Pre −1 −1 onlythetopM eigenvectors. Asshownin(5),oureigenfunc- −2 −2 tion lives in a RKHS as a linear combination of the kernel −3 −3 0 2 4 6 0 2 4 6 functionsevaluatedatBwithweightsuj. 1 x 1 x Inserting(5)into(1)weobtain ons 0.5 0.5 cti M M n 0 0 f(x)=(cid:88)αj(cid:88)uijk(x,bi) (6) enfuy−0.5 y−0.5 g j=1 i=1 Ei −1 −1 This equation reveals a two-layer structure of EigenGP. The −1.5 −1.5 0 2 4 6 0 2 4 6 x x firstlayerlinearlycombinesmultiplekernelfunctionstogen- erate each eigenfunction φj. The second layer takes these eigenfunctionsasthebasisfunctionstogeneratethefunction Figure 1: Illustrationof the effect of optimizing basispoints valuef. Notethatf isaBayesianlinearcombinationof{φj} (i.e.,inducinginputs). Inthefirstrow,thepinkdotsrepresent wheretheweightsαareintegratedouttoavoidoverfitting. data points, the blue and red solid curves correspond to the All the model hyperparameters are learned from data. predictive mean of EigenGP and ± two standard deviations Specifically, for the first layer, to learn the eigenfunctions around the mean, the black curve corresponds to the predic- {φ }, we estimate the inducing inputs B and the kernel hy- i tivemeanofthefullGP,andtheblackcrossesdenotethebasis perparameters(suchaslengthscalesηfortheARDkernel)by points. Inthesecondrow,thecurvesinvariouscolorsrepre- maximizing the model marginal likelihood. For the second sentthefiveeigenfunctionsofEigenGP. layer,wemarginalizeoutαtoavoidoverfittingandmaximize themodelmarginallikelihoodtolearnthehyperparameterw oftheprior. 3.1 IllustrativeExample With the estimated hyperparameters, the prior over f is nonstationarybecauseitscovariancefunctionin(2)variesat Forthisexample,weconsideratoydatasetusedbytheFITC different regions of x. This comes at no surprise since the algorithm [Snelson and Ghahramani, 2006]. It contains 200 eigenfunctionsaretiedwithp(x)in(3). Thisnonstationarity one-dimensional training samples. We use 5 basis points reflectsthefactthatourmodelisadaptivetothedistribution (M = 5) and choose the ARD kernel. We then compare oftheexplanatoryvariablesx. thebasisfunctions{φj}andthecorrespondingpredictivedis- Notethattorecoverthefulluncertaintycapturedbytheker- tributions in two cases. For the first case, we use the kernel nelfunctionk,wecanaddthefollowingtermintothekernel widthηlearnedfromthefullGPmodelasthekernelwidthfor functionofEigenGP: EigenGP,andapplyK-meanstosetthebasispointsBasclus- δ(x−x(cid:48))(k(x,x(cid:48))−k˜(x,x(cid:48))) (7) tercenters. TheideaofusingK-meanstosetthebasispoints has been suggested by Zhang et al. [2008] to minimize an whereδ(a) = 1ifandonlyifa = 0. Comparedtotheorig- error bound for the Nystro¨m approximation. For the second inal EigenGP model, which has a finite degree of freedom, case, we optimize η, B and w by maximizing the marginal thismodifiedmodelhastheinfinitenumberofbasisfunctions likelihoodofEigenGP. (assuming k has an infinite number of basis functions as the The results are shown in Figure 1. The first row demon- ARDkernel). Thus,thismodelcanaccuratelymodeltheun- strates that, by optimizing the hyperparameters, EigenGP certainty of a test point even when it is far from the training achieves the predictions very close to what the full GP set.Wederivetheoptimizationupdatesofallthehyperparam- achieves—butusingonly5basisfunctions. Incontrast,when etersforboththeoriginalandmodifiedEigenGPmodels. But the basis points are set to the cluster centers by K-means, according to our experiments, the modified model does not EigenGP leads to the prediction significantly different from improvethepredictionaccuracyovertheoriginalEigenGP(it thatofthefullGPandfailstocapturethedatatrend, inpar- even reduces the accuracy sometimes.). Therefore, we will ticular,forx∈(3,5). ThesecondrowofFigure1showsthat focus on the original EigenGP model in our presentation for K-meanssetsthebasispointsalmostevenlyspacedony,and itssimplicity. Beforewepresentdetailsabouthyperparame- accordingly the five eigenfunctions are smooth global basis teroptimization,letusfirstlookatanillustrativeexampleon functionswhoseshapesarenotdirectlylinkedtothefunction the effect of hyperparameter optimization, in particular, the they fit. Evidence maximization, by contrast, sets the basis optimizationoftheinducinginputsB. pointsunevenlyspacedtogeneratethebasisfunctionswhose 3 shapesaremorelocalizedandadaptivetothefunctiontheyfit; NotethatwecancomputeK C−1 efficientlyvialow-rank BX N for example, the eigenfunction represented by the red curve updates. Also,R11T(S11T)canbeimplementedefficiently wellmatchesthedataontheright. byfirstsummingoverthecolumnsofS(R)andthencopying it multiple times—without any multiplication operation. To 4 Learning Hyperparameters obtain tr(C−N1yydTBC−N1dCN), we simply replace C−N1 in (12) and(13)byC−1yyTC−1. N N Inthissectionwedescribehowweoptimizeallthehyperpa- Using similar derivations, we can obtain the derivatives rameters, denoted by θ, which include B in the covariance withrespecttothelengthscaleη,a andσ2respectively. 0 function(2),andallthekernelhyperparameters(e.g.,a and 0 Tocomputethederivativewithrespecttow,wecanusethe η). Tooptimizeθ,wemaximizethemarginallikelihood(i.e. formulatr(Pdiag(w)Q)=1T(QT◦P)wtoobtainthetwo evidence)basedonaconjugateNewtonmethod1. Weexplore tracetermsin(9)asfollows: twostrategiesforevidencemaximization. Thefirstoneisse- quential optimization, which first fixes w while updating all theotherhyperparameters,andthenoptimizeswwhilefixing tr(C−N1dCN) =1T(Φ◦(C−1Φ)) (14) theotherhyperparameters. Thesecondstrategyistooptimize dw N all the hyperparameters jointly. Here we skip the details for tr(C−1yyTC−1dC ) themorecomplicatedjointoptimizationanddescribethekey N dw N N =1T(Φ◦(C−N1yyTC−N1Φ)) (15) gradient formula for sequential optimization. The key com- putationinouroptimizationisforthelogmarginallikelihood anditsgradient: Foreithersequentialorjointoptimization,theoverallcom- putational complexity is O(max(M2,D)N) where D is the lnp(y|θ)= −12ln|CN|− 21yTC−N1y− N2 ln(2π), (8) datadimension. dlnp(y|θ)= −21(cid:2)tr(C−N1dCN)−tr(C−N1yyTCN−1dCN)(cid:3)(9) where C = K˜ + σ2I, K˜ = Φdiag(w)ΦT, and Φ = N {φm(xn)} is an N by M matrix. Because the rank of K˜ is 5 Related Work M (cid:28) N,wecancomputeln|C |andC−1 efficientlywith N N thecostofO(M2N)viathematrixinversionanddeterminant Ourworkiscloselyrelatedtotheseminalworkby[Williams lemmas. Even with the use of the matrix inverse lemma for and Seeger, 2001], but they differ in multiple aspects. First, thelow-rankcomputation,anaivecalculationwouldbevery we define a valid probabilistic model based on an eigen- costly. Weapplyidentitiesfromcomputationallinearalgebra decomposition of the GP prior. By contrast, the previous [Minka,2001;deLeeuw,2007]tosimplifytheneededcom- approach [Williams and Seeger, 2001] aims at a low-rank putationdramatically. TocomputethederivativewithrespecttoB,wefirstnotice approximation to the finite covariance/kernel matrix used in that,whenwisfixed,wehave GPtraining—fromanumericalapproximationperspective— and its predictive distribution is not well-formed in a prob- C =K˜ +σ2I=K K −1K +σ2I (10) abilistic framework (e.g., it may give a negative variance N XB BB BX of the predictive distribution.). Second, while the Nystro¨m whereKXB isthecross-covariancematrixbetweenthetrain- method simply uses the first few eigenvectors, we maximize ingdataXandtheinducinginputsB,andKBB isthecovari- the model marginal likelihood to adjust their weights in the ancematrixonB. covariancefunction. Third,exploringtheclusteringproperty FortheARDsquaredexponentialkernel,utilizingthefol- of the eigenfunctions of the Gaussian kernel, our approach lowingidentities, tr(PTQ) = vec(P)Tvec(Q)andvec(P◦ canconductsemi-supervisedlearning,whilethepreviousone Q)=diag(vec(P))vec(Q),wherevec(·)vectorizesamatrix cannot. Thesemi-supervisedlearningcapabilityofEigenGP intoacolumnvector,and◦representstheHadamardproduct, is investigated in another paper of us. Fourth, the Nystro¨m wecanderivethederivativeofthefirsttracetermin(9) method lacks a principled way to learn model hyperparam- tr(C−1dC ) eters including the kernel width and the basis points while N N =4RXTdiag(η)−4(R11T)◦(BTdiag(η)) dB EigenGPdoesnot. −4SBTdiag(η)+4(S11T)◦(BTdiag(η)) (11) Our work is also related to methods that use kernel prin- ciplecomponentanalysis(PCA)tospeedupkernelmachines where1isacolumnvectorofallones,and [Hoegaertsetal.,2005].However,forthesemethodsitcanbe R = (K −1K C−1)◦K (12) difficult—ifnotimpossible—tolearnimportanthyperparam- BB BX N BX eters including kernel width for each dimension and induc- S = (K −1K C−1K K −1)◦K (13) BB BX N XB BB BB inginputs(notasubsetofthetrainingsamples). Bycontrast, 1We use the code from http://www.gaussianprocess.org/ EigenGPlearnsallthesehyperparametersfromdatabasedon gpml/code/matlab/doc gradientsofthemodelmarginallikelihood. 4 6 Experimental Results (a)Nystro¨m (b)SSGP 4 2 In this section, we compare EigenGP2 and alternative meth- 2 1 odsonsyntheticandrealbenchmarkdatasets. Thealternative 0 y 0 y methods include the sparse GP methods—FITC, SSGP, and −1 theNystro¨mmethod—aswellasRVMs.Weimplementedthe −2 −2 Nystro¨mmethodourselvesanddownloadedthesoftwareim- −4 −3 plementationsfortheothermethodsfromtheirauthors’web- −2 0 2 4 6 8 −2 0 2 4 6 8 x x sites. ForRVMs,weusedthefastfixedpointalgorithm[Faul (c)FITC (d)EigenGP 2 2 andTipping,2001].WeusedtheARDkernelforallthemeth- odsexceptRVMs(sincetheydonotestimatethelengthscales 1 1 inthiskernel)andoptimizedallthehyperparametersviaevi- 0 0 y y dencemaximization. ForRVMs,wechosethesquaredexpo- −1 −1 nentialkernelwiththesamelengthscaleforallthedimensions −2 −2 andapplieda10-foldcross-validationonthetrainingdatato −3 −3 selectthelengthscale. Onlargerealdata,weusedthevalues −2 0 2 4 6 8 −2 0 2 4 6 8 x x of η, a , and σ2 learned from the full GP on a subset that 0 was1/10ofthetrainingdatatoinitializeallthemethodsex- cept RVMs. For the rest configurations, we used the default Figure 2: Predictions of four sparse GP methods. Pink dots setting of the downloaded software packages. For our own representtrainingdata;bluecurvesarepredictivemeans;red model,wedenotetheversionswithsequentialandjointopti- curvesaretwostandarddeviationsaboveandbelowthemean mizationasEigenGPandEigenGP*,respectively.Toevaluate curves,andtheblackcrossesindicatetheinducinginputs. thetestperformanceofeachmethod,wemeasuretheNormal- izedMeanSquareError(NMSE)andtheMeanNegativeLog Probability(MNLP),definedas: average results from 10 runs are reported in Table 1 (dataset 1). We have the results from two versions of the Nystro¨m NMSE= (cid:80)i(yi−µi)2/(cid:80)i(µi−y¯)2 (16) method. Forthefirstversion,thekernelwidthislearnedfrom MNLP= 1 (cid:80) [(yi−µi)2+lnσ2+ln2π] (17) the full GP and the basis locations are chosen by K-means 2N i σi i as before; for the second version, denoted as Nystro¨m*, its where y , µ and σ2 are the response value, the predictive hyperparametersarelearnedbyevidencemaximization. Note i i i meanandvarianceforthei-thtestpointrespectively,andy¯is thattheevidencemaximizationalgorithmfortheNystro¨map- theaverageresponsevalueofthetrainingdata. proximationisnoveltoo—developedbyusforthecompara- tiveanalysis.Table1showsthatbothEigenGPandEigenGP* approximate the mean of the full GP model more accurately 6.1 ApproximationQualityonSyntheticData thantheothermethods,inparticular,severalordersofmagni- As in Section 3.1, we use the synthetic data from the FITC tudebetterthantheNystro¨mmethod. paperforthecomparativestudy. Toletallthemethodshave Furthermore, we add the difference term (7) into the ker- thesamecomputationalcomplexity,wesetthenumberofin- nel function and denote this version of our algorithm as ducinginputsM = 7. TheresultsaresummarizedinFigure EigenGP+. Itgivesbetterpredictivevariancewhenfarfrom 2. FortheNystro¨mmethod,weusedthekernelwidthlearned thetrainingdatabutitspredictivemeanisslightlyworsethan fromthefullGPandappliedK-meanstochoosethebasislo- the version without this term (7); the NMSE and MNLP of cations[Zhangetal.,2008].Figure2(a)showsthatitdoesnot EigenGP+ are0.014±0.001and−0.081±0.004. Thus,on fitwell. Figure2(b)demonstratesthatthepredictionofSSGP theotherdatasets,weonlyusetheversionswithoutthisterm oscillatesoutsidetherangeofthetrainingsamples,probably (EigenGP and EigenGP∗) for their simplicity and effective- duetothefactthatthesinusoidalcomponentsareglobaland ness. Wealsoexaminetheperformanceofallthesemethods span the whole data range (increasing the number of basis withahighercomputationalcomplexity. Specifically,weset functionswouldimproveSSGP’spredictiveperformance,but M = 10. Again, both versions of the Nystro¨m method give increase the computational cost.). As shown by Figure 2(c), poor predictive distributions. And SSGP still leads to extra FITCfailstocapturetheturnsofthedataaccuratelyforxnear wavy patterns outside the training data. FITC, EigenGP and 4whileEigenGPcan. EigenGP+ give good predictions. Again, EigenGP+ gives Using the full GP predictive mean as the label for x ∈ better predictive variance when far from the training data , [−1,7] (we do not have the true y values in the test data), butwithasimilarpredictivemeanasEigenGP. we compute the NMSE and MNLP of all the methods. The Finally, we compare the RVM with EigenGP* on this 2The implementation is available at: https://github.com/ dataset.WhiletheRVMgivesNMSE=0.048in2.0seconds, hao-peng/EigenGP EigenGP* achieves NMSE = 0.039±0.017 in 0.33±0.04 5 dictive mean on the right is smaller than the true one, prob- Table1: NMSEandMNLPonsyntheticdata ably affected by the small dynamic range of the data on the left. Figure 3(b) shows that the predictive mean of FITC at NMSE Method dataset1 dataset2 x∈(2,3)haslowerfrequencyandsmalleramplitudethanthe truefunction—perhapsinfluencedbythelow-frequencypart Nystro¨m 39±18 1526±769 ontheleftx ∈ (0,2). Actuallybecauseofthelow-frequency Nystro¨m* 2.41±0.53 2721±370 part, FITC learns a large kernel width η; the average kernel FITC 0.02±0.005 0.50±0.04 widthlearnedoverthe10runsis207.75. Thislargevalueaf- SSGP 0.54±0.01 0.22±0.05 fectsthequalityoflearnedbasispoints(e.g.,lackingofbasis EigenGP 0.006±0.001 0.06±0.02 pointsforthehighfrequencyregionontheright).Bycontrast, EigenGP∗ 0.009±0.002 0.06±0.02 using the same initial kernel width as FITC, EigenGP learns asuitablekernelwidth—onaverage,η =0.07—andprovides MNLP goodpredictionsasshowninFigure3(c). Method dataset1 dataset2 Nystro¨m 645±56 2561±1617 6.3 Accuracyvs. TimeonRealData Nystro¨m* 7.39±1.66 40±5 FITC −0.07±0.01 0.88±0.05 SSGP 1.22±0.03 0.87±0.09 g 0.28 0.8 EigenGP −0.33±0.00 0.40±0.07 usin 0.26 RFIVTMC 0.72 FSISTGCP EigenGP∗ −0.31±0.01 0.44±0.07 Ho SSGP EigenGP aliforniaNMSE00..2224 EEiiggeennGGPP* MNLP 00..5664 EigenGP* C 0.2 0.48 second with M = 30 (EigenGP performs similarly), both (a) 0.18 0.4 fasterandmoreaccurate. 0 1000 2000 3000 0 1000 2000 3000 0.65 Seconds 1.3 Seconds RVM FITC 0.6 FITC 1.2 SSGP 6.2 PredictionQualityonNonstationaryData SSGP EigenGP S PPTMSE0.55 EEiiggeennGGPP* NLP 1.1 EigenGP* We then compare all the sparse GP methods on an one- PN 0.5 M 1 b) dimensionalnonstationarysyntheticdatasetwith200training ( 0.45 0.9 and 500 test samples. The underlying function is f(x) = xsin(x3)wherex ∈ (0,3)andthestandarddeviationofthe 0.40 2000 4000 6000 0.80 2000 4000 6000 0.15 Seconds 0.2 Seconds whitenoiseis0.5. Thisfunctionisnonstationaryinthesense RVM FITC that its frequency and amplitude increase when x increases m 0.12 FITC −0.08 SSGP m SSGP EigenGP fnruommb0e.rWofebraasnidsopmoilnytsge(nfuenractteiodntsh)etdoabtae1104tifmoresalalntdhesectotmhe- TelecoNMSE00..0069 EEiiggeennGGPP* MNLP−−00..6346 EigenGP* e petitive methods. Using the true function value as the label, ol P 0.03 −0.92 wecomputethemeansandthestandarderrorsofNMSEand c) ( MNLPasinTable1(dataset2). FortheNystro¨mmethod,the 00 1000 2000 3000 −1.20 1000 2000 3000 Seconds Seconds marginallikelihoodoptimizationleadstomuchsmallererror NMSE NMLP than the K-means based approach. However, both of them farepoorlywhencomparedwiththealternativemethods. Ta- ble1alsoshowsthatEigenGPandEigenGP∗ achieveastrik- Figure 4: NMSE and MNLP vs. training time. Each meth- ing ∼25,000 fold error reduction compared with Nystro¨m*, od (except RVMs) has five results associated with M = and a ∼10-fold error reduction compared with the second 25,50,100,200,400, respectively. In (c), the fifth result of bestmethod,SSGP.RVMsgaveNMSE0.0111±0.0004with FITCisoutoftherange;theactualtrainingtimeis3485sec- 1.4±0.05seconds, averagedover10runs, whiletheresults onds,theNMSE0.033,andtheMNLP−1.27. Thevaluesof ofEigenGP*withM =50areNMSE0.0110±0.0006with MNLPforRVMsare1.30,1.25and0.95,respectively. 0.89±0.1042seconds(EigenGPgivessimilarresults). We further illustrate the predictive mean and standard de- To evaluate the trade-off between prediction accuracy and viation on a typical run in Figure 3. As shown in Fig- computationalcost,weusethreelargerealdatasets. Thefirst ure 3(a), the predictive mean of SSGP contains reasonable dataset is California Housing [Pace and Barry, 1997]. We high frequency components for x ∈ (2,3) but, as a station- randomly split the 8 dimensional data into 10,000 training ary GP model, these high frequency components give extra and 10,640 test points. The second dataset is Physicochem- wavy patterns in the left region of x. In addition, the pre- ical Properties of Protein Tertiary Structures (PPPTS) which 6 (a)SSGP (b)FITC (c)EigenGP 4 4 4 2 2 2 y y y 0 0 0 −2 −2 −2 −4 −4 −4 0 1 2 3 0 1 2 3 0 1 2 3 x x x Figure3:Predictionsonnonstationarydata.Thepinkdotscorrespondtonoisydataaroundthetruefunctionf(x)=xsin(x3), represented by the green curves. The blue and red solid curves correspond to the predictive means and ± two standard deviationsaroundthemeans. TheblackcrossesnearthebottomrepresenttheestimatedbasispointsforFITCandEigenGP. can be obtained from Lichman [2013]. We randomly split learningbyeitherusingstochasticgradientdescenttoupdate the 9 dimensional data into 20,000 training and 25,730 test the weights of the eigenfunctions or applying the online VB points. The third dataset is Pole Telecomm that was used in ideaforGPs[Hensmanetal.,2013]. La´zaro-Gredillaetal.[2010]. Itcontains10,000trainingand 5000 test samples, each of which has 26 features. We set M = 25,50,100,200,400, and the maximum number of it- Acknowledgments erationsinoptimizationtobe100forallmethods. The NMSE, MNLP and the training time of these meth- ThisworkwassupportedbyNSFECCS-0941043,NSFCA- ods are shown in Figure 4. In addition, we ran the REERawardIIS-1054903, andtheCenterforScienceofIn- Nystro¨m method based on the marginal likelihood maxi- formation, an NSF Science and Technology Center, under mization, which is better than using K-means to set the ba- grantagreementCCF-0939370. sis points. Again, the Nystro¨m method performed orders of magnitude worse than the other methods: with M = 25,50,100,200,400, on California Housing, the Nystro¨m References method uses 146, 183, 230, 359 and 751 seconds for train- ing, respectively, and gives the NMSE 917, 120, 103, 317, NoelCressieandGardarJohannesson. Fixedrankkrigingfor and95; onPPPTS,thetrainingtimesare258,304,415,853 very large spatial data sets. Journal of the Royal Statisti- and2001secondsandtheNMSEsare1.7×105, 6.4×104, calSociety:SeriesB(StatisticalMethodology),70(1):209– 1.3×104,8.1×103,and8.1×103;andonPoleTelecomm, 226,February2008. the training times are 179, 205, 267, 478, and 959 seconds andtheNMSEsare2.3×104,5.8×103,3.8×103,4.5×102 LehelCsato´ andManfredOpper. SparseonlineGaussianpro- and 84. The MNLPs are consistently large, and are omitted cesses. NeuralComputation,14:641–668,March2002. hereforsimplicity. JandeLeeuw. Derivativesofgeneralizedeigensystemswith ForRVMs,weincludecross-validationinitstrainingtime applications. In Department of Statistics Papers. Depart- because choosing an appropriate kernel width is crucial for mentofStatistics,UCLA,UCLA,2007. RVM. Since RVM learns the number of basis functions au- tomaticallyfromthedata,inFigure4itshowsasingleresult Anita C. Faul and Michael E. Tipping. Analysis of sparse foreachdataset.EigenGPachievesthelowestpredictionerror Bayesianlearning.InAdvancesinNeuralInformationPro- usingshortertime. ComparedwithEigenGPbasedonthese- cessingSystems14.MITPress,2001. quential optimization, EigenGP∗ achieves similar errors, but takeslongerbecausethejointoptimizationismoreexpensive. JamesHensman, Nicolo` Fusi, andNeilD.Lawrence. Gaus- sianprocessesforbigdata. Inthe29thConferenceonUn- certainty in Articial Intelligence (UAI 2013), pages 282– 7 Conclusions 290,2013. Inthispaperwehavepresentedasimpleyeteffectivesparse DaveHigdon. Spaceandspace-timemodelingusingprocess Gaussianprocessmethod,EigenGP,andappliedittoregres- convolutions. InCliveW.Anderson,VicBarnett,PhilipC. sion. DespiteitssimilaritytotheNystro¨mmethod,EigenGP Chatwin, and Abdel H. El-Shaarawi, editors, Quantitative can improve its prediction quality by several orders of mag- methods for current environmental issues, pages 37–56. nitude. EigenGP can be easily extended to conduct online SpringerVerlag,2002. 7 Luc Hoegaerts, Johan A. K. Suykens, Joos Vandewalle, and Christopher K. I. Williams and Matthias Seeger. Using the Bart De Moor. Subset based least squares subspace re- Nystro¨mmethodtospeedupkernelmachines.InAdvances gressioninRKHS. Neurocomputing,63:293–323,January in Neural Information Processing Systems 13, volume 13. 2005. MITPress,2001. MiguelLa´zaro-Gredilla,JoaquinQuin˜onero-Candela,CarlE. Feng Yan and Yuan Qi. Sparse Gaussian process regression Rasmussen, and An´ıbal R. Figueiras-Vidal. Sparse spec- via (cid:96)1 penalization. In Proceedings of 27th International trum Gaussian process regression. Journal of Machine ConferenceonMachineLearning,pages1183–1190,2010. LearningResearch,11:1865–1881,2010. Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved MosheLichman. UCImachinelearningrepository,2013. Nystro¨m low-rank approximation and error analysis. In Proceedings of the 25th International Conference on Ma- DavidJ.C.MacKay. Bayesianinterpolation. NeuralCompu- chineLearning,pages1232–1239.ACM,2008. tation,4(3):415–447,1992. Youssef M. Marzouk and Habib N. Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics,228(6):1862–1902,2009. Thomas P. Minka. Old and new matrix algebra useful for statistics. Technicalreport,MITMediaLab,2001. R.KelleyPaceandRonaldBarry. Sparsespatialautoregres- sions. Statistics and Probability Letters, 33(3):291–297, 1997. Christopher J. Paciorek and Mark J. Schervish. Nonstation- ary covariance functions for Gaussian process regression. InAdvancesinNeuralInformationProcessingSystems16. MITPress,2004. Yuan Qi, Ahmed H. Abdel-Gawad, and Thomas P. Minka. Sparse-posterior Gaussian processes for general likeli- hoods. In Proceedings of the 26th Conference on Uncer- taintyinArtificialIntelligence,2010. Joaquin Quin˜onero-Candela and Carl E. Rasmussen. A uni- fyingviewofsparseapproximateGaussianprocessregres- sion. Journal of Machine Learning Research, 6:1935– 1959,122005. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processesusingpseudo-inputs. InAdvancesinNeuralIn- formationProcessingSystems18.MITpress,2006. Michael E. Tipping. The relevance vector machine. In Ad- vancesinNeuralInformationProcessingSystems12.MIT Press,2000. MichalisK.Titsias.Variationallearningofinducingvariables in sparse Gaussian processes. In The 12th International Conference on Artificial Intelligence and Statistics, pages 567–574,2009. ChristopherK.I.WilliamsandDavidBarber. Bayesianclas- sificationwithGaussianprocesses. IEEETransactionson Pattern Analysis and Machine Intelligence, 20(12):1342– 1351,1998. 8