ebook img

EigenGP: Gaussian Process Models with Adaptive Eigenfunctions PDF

0.8 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview EigenGP: Gaussian Process Models with Adaptive Eigenfunctions

EigenGP: Gaussian Process Models with Adaptive Eigenfunctions HaoPeng YuanQi DepartmentsofComputerScience DepartmentsofCSandStatistics PurdueUniversity PurdueUniversity WestLafayette,IN47906 WestLafayette,IN47906 [email protected] [email protected] Abstract 5 mussen[2005]. 1 Among all sparse GP regression methods, a state-of-the- 0 Gaussianprocesses(GPs)provideanonparametricrepresen- art approach is to represent a function as a sparse finite lin- 2 tation of functions. However, classical GP inference suffers ear combination of pairs of trigonometric basis functions, a ul from high computational cost for big data. In this paper, we sine and a cosine for each spectral point; thus this approach J proposeanewBayesianapproach,EigenGP,thatlearnsboth is called sparse spectrum Gaussian process (SSGP) [La´zaro- 3 basis dictionary elements—eigenfunctions of a GP prior— Gredillaetal.,2010]. SSGPintegratesoutbothweightsand 1 andpriorprecisionsinasparsefinitemodel. Itiswellknown phasesofthetrigonometricfunctionsandlearnsallhyperpa- that,amongallorthogonalbasisfunctions,eigenfunctionscan rametersofthemodel(frequenciesandamplitudes)bymax- ] G providethemostcompactrepresentation. Unlikeothersparse imizing the marginal likelihood. Using global trigonomet- Bayesian finite models where the basis function has a fixed ric functions as basis functions, SSGP has the capability of L form,oureigenfunctionsliveinareproducingkernelHilbert approximating any stationary Gaussian process model and . s space as a finite linear combination of kernel functions. We been shown to outperform alternative sparse GP methods— c learnthedictionaryelements—eigenfunctions—andtheprior including fully independent training conditional (FITC) ap- [ precisions over these elements as well as all the other hy- proximation [Snelson and Ghahramani, 2006]—on bench- 3 perparameters from data by maximizing the model marginal markdatasets. AnotherpopularsparseBayesianfinitelinear v likelihood. We explore computational linear algebra to sim- modelistherelevancevectormachine(RVM)[Tipping,2000; 2 6 plifythegradientcomputationsignificantly. Ourexperimen- FaulandTipping,2001]. Ituseskernelexpansionsovertrain- 3 tal results demonstrate improved predictive performance of ingsamplesasbasisfunctionsandselectsthebasisfunctions 0 EigenGPoveralternativesparseGPmethodsaswellasrele- by automatic relevance determination [MacKay, 1992; Faul . 1 vancevectormachines. andTipping,2001]. 0 Inthispaper,weproposeanewsparseBayesianapproach, 4 EigenGP, that learns both functional dictionary elements— 1 1 Introduction eigenfunctions—and prior precisions in a finite linear model : v representation of a GP. It is well known that, among all or- i X Gaussianprocesses(GPs)arepowerfulnonparametricBayes- thogonal basis functions, eigenfunctions provide the most ian models with numerous applications in machine learning compact representation. Unlike SSGP or RVMs where the r a andstatistics. GPinference, however, iscostly. Trainingthe basis function has a fixed form, our eigenfunctions live in a exact GP regression model with N samples is expensive: it reproducing kernel Hilbert space (RKHS) as a finite linear takes an O(N2) space cost and an O(N3) time cost. To ad- combination of kernel functions with their weights learned dressthisissue,avarietyofapproximatesparseGPinference from data. We further marginalize out weights over eigen- approacheshavebeendeveloped[WilliamsandSeeger,2001; functions and estimate all hyperparameters—including basis Csato´ and Opper, 2002; Snelson and Ghahramani, 2006; points for eigenfunctions, lengthscales, and precision of the La´zaro-Gredillaetal.,2010;WilliamsandBarber,1998;Tit- weightprior—bymaximizingthemodelmarginallikelihood sias, 2009; Qi et al., 2010; Higdon, 2002; Cressie and Jo- (also known as evidence). To do so, we explore computa- hannesson, 2008]—for example, using the Nystro¨m method tionallinearalgebraandgreatlysimplifythegradientcompu- to approximate covariance matrices [Williams and Seeger, tation for optimization (thus our optimization method is to- 2001], optimizing a variational bound on the marginal like- tallydifferentfromRVMoptimizationmethods). Asaresult lihood [Titsias, 2009] or grounding the GP on a small set of of this optimization, our eigenfunctions are data dependent (blurred)basispoints[SnelsonandGhahramani,2006;Qiet and make EigenGP capable of accurately modeling nonsta- al., 2010]. An elegant unifying view for various sparse GP tionary data. Furthermore, by adding an additional kernel regression models is given by Quin˜onero-Candela and Ras- term in our model, we can turn the finite model into an in- 1 finite model to model the prediction uncertainty better—that andSchervish,2004]. ConstructinggeneralnonstationaryGP is,itcangivenonzeropredictionvariancewhenatestsample modelsremainsachallengingtask. isfarfromthetrainingsamples. For regression, we use a Gaussian likelihood function EigenGP is computationally efficient. It takes O(NM) p(yi|f) = N(yi|f(xi),σ2), where σ2 is the variance of spaceandO(NM2)timefortrainingonwithM basisfunc- the observation noise. Given the Gaussian process prior tions,whichissameasSSGPandmoreefficientthanRVMs overf andthedatalikelihood, theexactposteriorprocessis (as RVMs learn weights over N, not M, basis functions.). p(f|D,y) ∝ GP(f|0,k)(cid:81)Ni=1p(yi|f). Although the poste- Similar to FITC and SSGP, EigenGP focuses on predictive riorprocessforGPregressionhasananalyticalform,weneed accuracy at low computational cost, rather than on faithfully tostoreandinvertanN byN matrix,whichhasthecompu- converging towards the full GP as the number of basis func- tational complexity O(N3), rendering GP unfeasible for big tionsgrows. (Forthelattercase,pleaseseetheapproach[Yan dataanalytics. andQi,2010]thatexplicitlyminimizestheKLdivergencebe- tweenexactandapproximateGPposteriorprocesses.) 3 Model of EigenGP The rest of the paper is organized as follows. Section 2 describes the background of GPs. Section 3 presents the Toenablefastinferenceandobtainanonstationarycovariance EigenGP model and an illustrative example. Section 4 out- function,ournewmodelEigenGPprojectstheGPpriorinan lines the marginal likelihood maximization for learning dic- eigensubspace. Specifically,wesetthelatentfunction tionaryelementsandtheotherhyperparameters. InSection5, we discuss related work. Section 6 shows regression results M (cid:88) on multiple benchmark regression datasets, demonstrating f(x)= α φj(x) (1) j improved performance of EigenGP over Nystro¨m [Williams j=1 andSeeger,2001],RVM,FITC,andSSGP. where M (cid:28) N and {φj(x)} are eigenfunctions of the GP prior. We assign a Gaussian prior over α = [α ,...,α ], 1 M 2 Background of Gaussian Processes α ∼ N(0,diag(w)),sothatf followsaGPpriorwithzero meanandthefollowingcovariancefunction WedenoteN independentandidenticallydistributedsamples M as D = {(x1,y1),...,(xn,yn)}N, where xi is a D dimen- k˜(x,x(cid:48))=(cid:88)wjφj(x)φj(x(cid:48)). (2) sionalinput(i.e.,explanatoryvariables)andyiisascalarout- j=1 put(i.e.,aresponse),whichweassumeisthenoisyrealization ofalatentfunctionf atx . To compute the eigenfunctions {φj(x)}, we can use the i Galerkin projection to approximate them by Hermite poly- A Gaussian process places a prior distribution over the latent function f. Its projection f at {x }N defines a nomials [Marzouk and Najm, 2009]. For high dimensional x i i=1 joint Gaussian distribution p(f ) = N(f|m0,K), where, problems, however, this approach requires a tensor product x without any prior preference, the mean m0 are set to 0 ofunivariateHermitepolynomialsthatdramaticallyincreases and the covariance function k(x ,x ) ≡ K(x ,x ) en- thenumberofparameters. i j i j To avoid this problem, we apply the Nystro¨m method codes the prior notion of smoothness. A popular choice [Williams and Seeger, 2001] that allows us to obtain an ap- is the anisotropic squared exponential covariance function: k(x,x(cid:48)) = a exp(cid:0)−(x−x(cid:48))Tdiag(η)(x−x(cid:48))(cid:1), where proximationtotheeigenfunctionsinahighdimensionalspace 0 the hyperparameters include the signal variance a and the efficiently. Specifically, given inducing inputs (i.e. basis 0 lengthscales η = {η }D , controlling how fast the covari- points)B=[b1,...,bM],wereplace d d=1 ancedecayswiththedistancebetweeninputs. Usingthisco- (cid:90) variancefunction,wecanpruneinputdimensionsbyshrink- k(x,x(cid:48))φj(x)p(x)dx=λjφj(x(cid:48)) (3) ing the corresponding lengthscales based on the data (when η = 0, the d-th dimension becomes totally irrelevant tothe byitsMonteCarloapproximation d covariance function value). This pruning is known as Auto- M 1 (cid:88) maticRelevanceDetermination(ARD)andthereforethisco- k(x,b )φj(b )≈λ φj(x) (4) M i i j variance is also called the ARD squared exponential. Note i=1 that the covariance function value remains the same when (x(cid:48) −x) is the same – regardless where x(cid:48) and x are. This Then,byevaluatingthisequationatBsothatwecanestimate thevaluesofφj(b )andλ ,weobtainthej-theigenfunction thusleadstoastationaryGPmodel. Fornonstationarydata, i j φj(x)asfollows however,astationaryGPmodelisamisfit. Althoughnonsta- √ tionary GP models have been developed and applied to real M worldapplications,theyareoftenlimitedtolow-dimensional φj(x)= λ(M)k(x)u(jM) =k(x)uj (5) problems, such as applications in spatial statistics [Paciorek j 2 wherek(x) (cid:44) [k(x,b1),...,k(x,bM)],λ(jM) andu(jM) are Basispointslearned Basispointsestimatedby the j-th eigenvalue and eigenvector of the covariance func- fromK-means evidencemaximization √ tion evaluated at B, and u = Mu(M)/λ(M). Note that, 2 2 j j j forsimplicity,wehavechosenthenumberoftheinducingin- ns 1 1 o puts to be the same as the number of the eigenfunctions; in cti 0 0 di y y practice, we can use more inducing inputs while computing Pre −1 −1 onlythetopM eigenvectors. Asshownin(5),oureigenfunc- −2 −2 tion lives in a RKHS as a linear combination of the kernel −3 −3 0 2 4 6 0 2 4 6 functionsevaluatedatBwithweightsuj. 1 x 1 x Inserting(5)into(1)weobtain ons 0.5 0.5 cti M M n 0 0 f(x)=(cid:88)αj(cid:88)uijk(x,bi) (6) enfuy−0.5 y−0.5 g j=1 i=1 Ei −1 −1 This equation reveals a two-layer structure of EigenGP. The −1.5 −1.5 0 2 4 6 0 2 4 6 x x firstlayerlinearlycombinesmultiplekernelfunctionstogen- erate each eigenfunction φj. The second layer takes these eigenfunctionsasthebasisfunctionstogeneratethefunction Figure 1: Illustrationof the effect of optimizing basispoints valuef. Notethatf isaBayesianlinearcombinationof{φj} (i.e.,inducinginputs). Inthefirstrow,thepinkdotsrepresent wheretheweightsαareintegratedouttoavoidoverfitting. data points, the blue and red solid curves correspond to the All the model hyperparameters are learned from data. predictive mean of EigenGP and ± two standard deviations Specifically, for the first layer, to learn the eigenfunctions around the mean, the black curve corresponds to the predic- {φ }, we estimate the inducing inputs B and the kernel hy- i tivemeanofthefullGP,andtheblackcrossesdenotethebasis perparameters(suchaslengthscalesηfortheARDkernel)by points. Inthesecondrow,thecurvesinvariouscolorsrepre- maximizing the model marginal likelihood. For the second sentthefiveeigenfunctionsofEigenGP. layer,wemarginalizeoutαtoavoidoverfittingandmaximize themodelmarginallikelihoodtolearnthehyperparameterw oftheprior. 3.1 IllustrativeExample With the estimated hyperparameters, the prior over f is nonstationarybecauseitscovariancefunctionin(2)variesat Forthisexample,weconsideratoydatasetusedbytheFITC different regions of x. This comes at no surprise since the algorithm [Snelson and Ghahramani, 2006]. It contains 200 eigenfunctionsaretiedwithp(x)in(3). Thisnonstationarity one-dimensional training samples. We use 5 basis points reflectsthefactthatourmodelisadaptivetothedistribution (M = 5) and choose the ARD kernel. We then compare oftheexplanatoryvariablesx. thebasisfunctions{φj}andthecorrespondingpredictivedis- Notethattorecoverthefulluncertaintycapturedbytheker- tributions in two cases. For the first case, we use the kernel nelfunctionk,wecanaddthefollowingtermintothekernel widthηlearnedfromthefullGPmodelasthekernelwidthfor functionofEigenGP: EigenGP,andapplyK-meanstosetthebasispointsBasclus- δ(x−x(cid:48))(k(x,x(cid:48))−k˜(x,x(cid:48))) (7) tercenters. TheideaofusingK-meanstosetthebasispoints has been suggested by Zhang et al. [2008] to minimize an whereδ(a) = 1ifandonlyifa = 0. Comparedtotheorig- error bound for the Nystro¨m approximation. For the second inal EigenGP model, which has a finite degree of freedom, case, we optimize η, B and w by maximizing the marginal thismodifiedmodelhastheinfinitenumberofbasisfunctions likelihoodofEigenGP. (assuming k has an infinite number of basis functions as the The results are shown in Figure 1. The first row demon- ARDkernel). Thus,thismodelcanaccuratelymodeltheun- strates that, by optimizing the hyperparameters, EigenGP certainty of a test point even when it is far from the training achieves the predictions very close to what the full GP set.Wederivetheoptimizationupdatesofallthehyperparam- achieves—butusingonly5basisfunctions. Incontrast,when etersforboththeoriginalandmodifiedEigenGPmodels. But the basis points are set to the cluster centers by K-means, according to our experiments, the modified model does not EigenGP leads to the prediction significantly different from improvethepredictionaccuracyovertheoriginalEigenGP(it thatofthefullGPandfailstocapturethedatatrend, inpar- even reduces the accuracy sometimes.). Therefore, we will ticular,forx∈(3,5). ThesecondrowofFigure1showsthat focus on the original EigenGP model in our presentation for K-meanssetsthebasispointsalmostevenlyspacedony,and itssimplicity. Beforewepresentdetailsabouthyperparame- accordingly the five eigenfunctions are smooth global basis teroptimization,letusfirstlookatanillustrativeexampleon functionswhoseshapesarenotdirectlylinkedtothefunction the effect of hyperparameter optimization, in particular, the they fit. Evidence maximization, by contrast, sets the basis optimizationoftheinducinginputsB. pointsunevenlyspacedtogeneratethebasisfunctionswhose 3 shapesaremorelocalizedandadaptivetothefunctiontheyfit; NotethatwecancomputeK C−1 efficientlyvialow-rank BX N for example, the eigenfunction represented by the red curve updates. Also,R11T(S11T)canbeimplementedefficiently wellmatchesthedataontheright. byfirstsummingoverthecolumnsofS(R)andthencopying it multiple times—without any multiplication operation. To 4 Learning Hyperparameters obtain tr(C−N1yydTBC−N1dCN), we simply replace C−N1 in (12) and(13)byC−1yyTC−1. N N Inthissectionwedescribehowweoptimizeallthehyperpa- Using similar derivations, we can obtain the derivatives rameters, denoted by θ, which include B in the covariance withrespecttothelengthscaleη,a andσ2respectively. 0 function(2),andallthekernelhyperparameters(e.g.,a and 0 Tocomputethederivativewithrespecttow,wecanusethe η). Tooptimizeθ,wemaximizethemarginallikelihood(i.e. formulatr(Pdiag(w)Q)=1T(QT◦P)wtoobtainthetwo evidence)basedonaconjugateNewtonmethod1. Weexplore tracetermsin(9)asfollows: twostrategiesforevidencemaximization. Thefirstoneisse- quential optimization, which first fixes w while updating all theotherhyperparameters,andthenoptimizeswwhilefixing tr(C−N1dCN) =1T(Φ◦(C−1Φ)) (14) theotherhyperparameters. Thesecondstrategyistooptimize dw N all the hyperparameters jointly. Here we skip the details for tr(C−1yyTC−1dC ) themorecomplicatedjointoptimizationanddescribethekey N dw N N =1T(Φ◦(C−N1yyTC−N1Φ)) (15) gradient formula for sequential optimization. The key com- putationinouroptimizationisforthelogmarginallikelihood anditsgradient: Foreithersequentialorjointoptimization,theoverallcom- putational complexity is O(max(M2,D)N) where D is the lnp(y|θ)= −12ln|CN|− 21yTC−N1y− N2 ln(2π), (8) datadimension. dlnp(y|θ)= −21(cid:2)tr(C−N1dCN)−tr(C−N1yyTCN−1dCN)(cid:3)(9) where C = K˜ + σ2I, K˜ = Φdiag(w)ΦT, and Φ = N {φm(xn)} is an N by M matrix. Because the rank of K˜ is 5 Related Work M (cid:28) N,wecancomputeln|C |andC−1 efficientlywith N N thecostofO(M2N)viathematrixinversionanddeterminant Ourworkiscloselyrelatedtotheseminalworkby[Williams lemmas. Even with the use of the matrix inverse lemma for and Seeger, 2001], but they differ in multiple aspects. First, thelow-rankcomputation,anaivecalculationwouldbevery we define a valid probabilistic model based on an eigen- costly. Weapplyidentitiesfromcomputationallinearalgebra decomposition of the GP prior. By contrast, the previous [Minka,2001;deLeeuw,2007]tosimplifytheneededcom- approach [Williams and Seeger, 2001] aims at a low-rank putationdramatically. TocomputethederivativewithrespecttoB,wefirstnotice approximation to the finite covariance/kernel matrix used in that,whenwisfixed,wehave GPtraining—fromanumericalapproximationperspective— and its predictive distribution is not well-formed in a prob- C =K˜ +σ2I=K K −1K +σ2I (10) abilistic framework (e.g., it may give a negative variance N XB BB BX of the predictive distribution.). Second, while the Nystro¨m whereKXB isthecross-covariancematrixbetweenthetrain- method simply uses the first few eigenvectors, we maximize ingdataXandtheinducinginputsB,andKBB isthecovari- the model marginal likelihood to adjust their weights in the ancematrixonB. covariancefunction. Third,exploringtheclusteringproperty FortheARDsquaredexponentialkernel,utilizingthefol- of the eigenfunctions of the Gaussian kernel, our approach lowingidentities, tr(PTQ) = vec(P)Tvec(Q)andvec(P◦ canconductsemi-supervisedlearning,whilethepreviousone Q)=diag(vec(P))vec(Q),wherevec(·)vectorizesamatrix cannot. Thesemi-supervisedlearningcapabilityofEigenGP intoacolumnvector,and◦representstheHadamardproduct, is investigated in another paper of us. Fourth, the Nystro¨m wecanderivethederivativeofthefirsttracetermin(9) method lacks a principled way to learn model hyperparam- tr(C−1dC ) eters including the kernel width and the basis points while N N =4RXTdiag(η)−4(R11T)◦(BTdiag(η)) dB EigenGPdoesnot. −4SBTdiag(η)+4(S11T)◦(BTdiag(η)) (11) Our work is also related to methods that use kernel prin- ciplecomponentanalysis(PCA)tospeedupkernelmachines where1isacolumnvectorofallones,and [Hoegaertsetal.,2005].However,forthesemethodsitcanbe R = (K −1K C−1)◦K (12) difficult—ifnotimpossible—tolearnimportanthyperparam- BB BX N BX eters including kernel width for each dimension and induc- S = (K −1K C−1K K −1)◦K (13) BB BX N XB BB BB inginputs(notasubsetofthetrainingsamples). Bycontrast, 1We use the code from http://www.gaussianprocess.org/ EigenGPlearnsallthesehyperparametersfromdatabasedon gpml/code/matlab/doc gradientsofthemodelmarginallikelihood. 4 6 Experimental Results (a)Nystro¨m (b)SSGP 4 2 In this section, we compare EigenGP2 and alternative meth- 2 1 odsonsyntheticandrealbenchmarkdatasets. Thealternative 0 y 0 y methods include the sparse GP methods—FITC, SSGP, and −1 theNystro¨mmethod—aswellasRVMs.Weimplementedthe −2 −2 Nystro¨mmethodourselvesanddownloadedthesoftwareim- −4 −3 plementationsfortheothermethodsfromtheirauthors’web- −2 0 2 4 6 8 −2 0 2 4 6 8 x x sites. ForRVMs,weusedthefastfixedpointalgorithm[Faul (c)FITC (d)EigenGP 2 2 andTipping,2001].WeusedtheARDkernelforallthemeth- odsexceptRVMs(sincetheydonotestimatethelengthscales 1 1 inthiskernel)andoptimizedallthehyperparametersviaevi- 0 0 y y dencemaximization. ForRVMs,wechosethesquaredexpo- −1 −1 nentialkernelwiththesamelengthscaleforallthedimensions −2 −2 andapplieda10-foldcross-validationonthetrainingdatato −3 −3 selectthelengthscale. Onlargerealdata,weusedthevalues −2 0 2 4 6 8 −2 0 2 4 6 8 x x of η, a , and σ2 learned from the full GP on a subset that 0 was1/10ofthetrainingdatatoinitializeallthemethodsex- cept RVMs. For the rest configurations, we used the default Figure 2: Predictions of four sparse GP methods. Pink dots setting of the downloaded software packages. For our own representtrainingdata;bluecurvesarepredictivemeans;red model,wedenotetheversionswithsequentialandjointopti- curvesaretwostandarddeviationsaboveandbelowthemean mizationasEigenGPandEigenGP*,respectively.Toevaluate curves,andtheblackcrossesindicatetheinducinginputs. thetestperformanceofeachmethod,wemeasuretheNormal- izedMeanSquareError(NMSE)andtheMeanNegativeLog Probability(MNLP),definedas: average results from 10 runs are reported in Table 1 (dataset 1). We have the results from two versions of the Nystro¨m NMSE= (cid:80)i(yi−µi)2/(cid:80)i(µi−y¯)2 (16) method. Forthefirstversion,thekernelwidthislearnedfrom MNLP= 1 (cid:80) [(yi−µi)2+lnσ2+ln2π] (17) the full GP and the basis locations are chosen by K-means 2N i σi i as before; for the second version, denoted as Nystro¨m*, its where y , µ and σ2 are the response value, the predictive hyperparametersarelearnedbyevidencemaximization. Note i i i meanandvarianceforthei-thtestpointrespectively,andy¯is thattheevidencemaximizationalgorithmfortheNystro¨map- theaverageresponsevalueofthetrainingdata. proximationisnoveltoo—developedbyusforthecompara- tiveanalysis.Table1showsthatbothEigenGPandEigenGP* approximate the mean of the full GP model more accurately 6.1 ApproximationQualityonSyntheticData thantheothermethods,inparticular,severalordersofmagni- As in Section 3.1, we use the synthetic data from the FITC tudebetterthantheNystro¨mmethod. paperforthecomparativestudy. Toletallthemethodshave Furthermore, we add the difference term (7) into the ker- thesamecomputationalcomplexity,wesetthenumberofin- nel function and denote this version of our algorithm as ducinginputsM = 7. TheresultsaresummarizedinFigure EigenGP+. Itgivesbetterpredictivevariancewhenfarfrom 2. FortheNystro¨mmethod,weusedthekernelwidthlearned thetrainingdatabutitspredictivemeanisslightlyworsethan fromthefullGPandappliedK-meanstochoosethebasislo- the version without this term (7); the NMSE and MNLP of cations[Zhangetal.,2008].Figure2(a)showsthatitdoesnot EigenGP+ are0.014±0.001and−0.081±0.004. Thus,on fitwell. Figure2(b)demonstratesthatthepredictionofSSGP theotherdatasets,weonlyusetheversionswithoutthisterm oscillatesoutsidetherangeofthetrainingsamples,probably (EigenGP and EigenGP∗) for their simplicity and effective- duetothefactthatthesinusoidalcomponentsareglobaland ness. Wealsoexaminetheperformanceofallthesemethods span the whole data range (increasing the number of basis withahighercomputationalcomplexity. Specifically,weset functionswouldimproveSSGP’spredictiveperformance,but M = 10. Again, both versions of the Nystro¨m method give increase the computational cost.). As shown by Figure 2(c), poor predictive distributions. And SSGP still leads to extra FITCfailstocapturetheturnsofthedataaccuratelyforxnear wavy patterns outside the training data. FITC, EigenGP and 4whileEigenGPcan. EigenGP+ give good predictions. Again, EigenGP+ gives Using the full GP predictive mean as the label for x ∈ better predictive variance when far from the training data , [−1,7] (we do not have the true y values in the test data), butwithasimilarpredictivemeanasEigenGP. we compute the NMSE and MNLP of all the methods. The Finally, we compare the RVM with EigenGP* on this 2The implementation is available at: https://github.com/ dataset.WhiletheRVMgivesNMSE=0.048in2.0seconds, hao-peng/EigenGP EigenGP* achieves NMSE = 0.039±0.017 in 0.33±0.04 5 dictive mean on the right is smaller than the true one, prob- Table1: NMSEandMNLPonsyntheticdata ably affected by the small dynamic range of the data on the left. Figure 3(b) shows that the predictive mean of FITC at NMSE Method dataset1 dataset2 x∈(2,3)haslowerfrequencyandsmalleramplitudethanthe truefunction—perhapsinfluencedbythelow-frequencypart Nystro¨m 39±18 1526±769 ontheleftx ∈ (0,2). Actuallybecauseofthelow-frequency Nystro¨m* 2.41±0.53 2721±370 part, FITC learns a large kernel width η; the average kernel FITC 0.02±0.005 0.50±0.04 widthlearnedoverthe10runsis207.75. Thislargevalueaf- SSGP 0.54±0.01 0.22±0.05 fectsthequalityoflearnedbasispoints(e.g.,lackingofbasis EigenGP 0.006±0.001 0.06±0.02 pointsforthehighfrequencyregionontheright).Bycontrast, EigenGP∗ 0.009±0.002 0.06±0.02 using the same initial kernel width as FITC, EigenGP learns asuitablekernelwidth—onaverage,η =0.07—andprovides MNLP goodpredictionsasshowninFigure3(c). Method dataset1 dataset2 Nystro¨m 645±56 2561±1617 6.3 Accuracyvs. TimeonRealData Nystro¨m* 7.39±1.66 40±5 FITC −0.07±0.01 0.88±0.05 SSGP 1.22±0.03 0.87±0.09 g 0.28 0.8 EigenGP −0.33±0.00 0.40±0.07 usin 0.26 RFIVTMC 0.72 FSISTGCP EigenGP∗ −0.31±0.01 0.44±0.07 Ho SSGP EigenGP aliforniaNMSE00..2224 EEiiggeennGGPP* MNLP 00..5664 EigenGP* C 0.2 0.48 second with M = 30 (EigenGP performs similarly), both (a) 0.18 0.4 fasterandmoreaccurate. 0 1000 2000 3000 0 1000 2000 3000 0.65 Seconds 1.3 Seconds RVM FITC 0.6 FITC 1.2 SSGP 6.2 PredictionQualityonNonstationaryData SSGP EigenGP S PPTMSE0.55 EEiiggeennGGPP* NLP 1.1 EigenGP* We then compare all the sparse GP methods on an one- PN 0.5 M 1 b) dimensionalnonstationarysyntheticdatasetwith200training ( 0.45 0.9 and 500 test samples. The underlying function is f(x) = xsin(x3)wherex ∈ (0,3)andthestandarddeviationofthe 0.40 2000 4000 6000 0.80 2000 4000 6000 0.15 Seconds 0.2 Seconds whitenoiseis0.5. Thisfunctionisnonstationaryinthesense RVM FITC that its frequency and amplitude increase when x increases m 0.12 FITC −0.08 SSGP m SSGP EigenGP fnruommb0e.rWofebraasnidsopmoilnytsge(nfuenractteiodntsh)etdoabtae1104tifmoresalalntdhesectotmhe- TelecoNMSE00..0069 EEiiggeennGGPP* MNLP−−00..6346 EigenGP* e petitive methods. Using the true function value as the label, ol P 0.03 −0.92 wecomputethemeansandthestandarderrorsofNMSEand c) ( MNLPasinTable1(dataset2). FortheNystro¨mmethod,the 00 1000 2000 3000 −1.20 1000 2000 3000 Seconds Seconds marginallikelihoodoptimizationleadstomuchsmallererror NMSE NMLP than the K-means based approach. However, both of them farepoorlywhencomparedwiththealternativemethods. Ta- ble1alsoshowsthatEigenGPandEigenGP∗ achieveastrik- Figure 4: NMSE and MNLP vs. training time. Each meth- ing ∼25,000 fold error reduction compared with Nystro¨m*, od (except RVMs) has five results associated with M = and a ∼10-fold error reduction compared with the second 25,50,100,200,400, respectively. In (c), the fifth result of bestmethod,SSGP.RVMsgaveNMSE0.0111±0.0004with FITCisoutoftherange;theactualtrainingtimeis3485sec- 1.4±0.05seconds, averagedover10runs, whiletheresults onds,theNMSE0.033,andtheMNLP−1.27. Thevaluesof ofEigenGP*withM =50areNMSE0.0110±0.0006with MNLPforRVMsare1.30,1.25and0.95,respectively. 0.89±0.1042seconds(EigenGPgivessimilarresults). We further illustrate the predictive mean and standard de- To evaluate the trade-off between prediction accuracy and viation on a typical run in Figure 3. As shown in Fig- computationalcost,weusethreelargerealdatasets. Thefirst ure 3(a), the predictive mean of SSGP contains reasonable dataset is California Housing [Pace and Barry, 1997]. We high frequency components for x ∈ (2,3) but, as a station- randomly split the 8 dimensional data into 10,000 training ary GP model, these high frequency components give extra and 10,640 test points. The second dataset is Physicochem- wavy patterns in the left region of x. In addition, the pre- ical Properties of Protein Tertiary Structures (PPPTS) which 6 (a)SSGP (b)FITC (c)EigenGP 4 4 4 2 2 2 y y y 0 0 0 −2 −2 −2 −4 −4 −4 0 1 2 3 0 1 2 3 0 1 2 3 x x x Figure3:Predictionsonnonstationarydata.Thepinkdotscorrespondtonoisydataaroundthetruefunctionf(x)=xsin(x3), represented by the green curves. The blue and red solid curves correspond to the predictive means and ± two standard deviationsaroundthemeans. TheblackcrossesnearthebottomrepresenttheestimatedbasispointsforFITCandEigenGP. can be obtained from Lichman [2013]. We randomly split learningbyeitherusingstochasticgradientdescenttoupdate the 9 dimensional data into 20,000 training and 25,730 test the weights of the eigenfunctions or applying the online VB points. The third dataset is Pole Telecomm that was used in ideaforGPs[Hensmanetal.,2013]. La´zaro-Gredillaetal.[2010]. Itcontains10,000trainingand 5000 test samples, each of which has 26 features. We set M = 25,50,100,200,400, and the maximum number of it- Acknowledgments erationsinoptimizationtobe100forallmethods. The NMSE, MNLP and the training time of these meth- ThisworkwassupportedbyNSFECCS-0941043,NSFCA- ods are shown in Figure 4. In addition, we ran the REERawardIIS-1054903, andtheCenterforScienceofIn- Nystro¨m method based on the marginal likelihood maxi- formation, an NSF Science and Technology Center, under mization, which is better than using K-means to set the ba- grantagreementCCF-0939370. sis points. Again, the Nystro¨m method performed orders of magnitude worse than the other methods: with M = 25,50,100,200,400, on California Housing, the Nystro¨m References method uses 146, 183, 230, 359 and 751 seconds for train- ing, respectively, and gives the NMSE 917, 120, 103, 317, NoelCressieandGardarJohannesson. Fixedrankkrigingfor and95; onPPPTS,thetrainingtimesare258,304,415,853 very large spatial data sets. Journal of the Royal Statisti- and2001secondsandtheNMSEsare1.7×105, 6.4×104, calSociety:SeriesB(StatisticalMethodology),70(1):209– 1.3×104,8.1×103,and8.1×103;andonPoleTelecomm, 226,February2008. the training times are 179, 205, 267, 478, and 959 seconds andtheNMSEsare2.3×104,5.8×103,3.8×103,4.5×102 LehelCsato´ andManfredOpper. SparseonlineGaussianpro- and 84. The MNLPs are consistently large, and are omitted cesses. NeuralComputation,14:641–668,March2002. hereforsimplicity. JandeLeeuw. Derivativesofgeneralizedeigensystemswith ForRVMs,weincludecross-validationinitstrainingtime applications. In Department of Statistics Papers. Depart- because choosing an appropriate kernel width is crucial for mentofStatistics,UCLA,UCLA,2007. RVM. Since RVM learns the number of basis functions au- tomaticallyfromthedata,inFigure4itshowsasingleresult Anita C. Faul and Michael E. Tipping. Analysis of sparse foreachdataset.EigenGPachievesthelowestpredictionerror Bayesianlearning.InAdvancesinNeuralInformationPro- usingshortertime. ComparedwithEigenGPbasedonthese- cessingSystems14.MITPress,2001. quential optimization, EigenGP∗ achieves similar errors, but takeslongerbecausethejointoptimizationismoreexpensive. JamesHensman, Nicolo` Fusi, andNeilD.Lawrence. Gaus- sianprocessesforbigdata. Inthe29thConferenceonUn- certainty in Articial Intelligence (UAI 2013), pages 282– 7 Conclusions 290,2013. Inthispaperwehavepresentedasimpleyeteffectivesparse DaveHigdon. Spaceandspace-timemodelingusingprocess Gaussianprocessmethod,EigenGP,andappliedittoregres- convolutions. InCliveW.Anderson,VicBarnett,PhilipC. sion. DespiteitssimilaritytotheNystro¨mmethod,EigenGP Chatwin, and Abdel H. El-Shaarawi, editors, Quantitative can improve its prediction quality by several orders of mag- methods for current environmental issues, pages 37–56. nitude. EigenGP can be easily extended to conduct online SpringerVerlag,2002. 7 Luc Hoegaerts, Johan A. K. Suykens, Joos Vandewalle, and Christopher K. I. Williams and Matthias Seeger. Using the Bart De Moor. Subset based least squares subspace re- Nystro¨mmethodtospeedupkernelmachines.InAdvances gressioninRKHS. Neurocomputing,63:293–323,January in Neural Information Processing Systems 13, volume 13. 2005. MITPress,2001. MiguelLa´zaro-Gredilla,JoaquinQuin˜onero-Candela,CarlE. Feng Yan and Yuan Qi. Sparse Gaussian process regression Rasmussen, and An´ıbal R. Figueiras-Vidal. Sparse spec- via (cid:96)1 penalization. In Proceedings of 27th International trum Gaussian process regression. Journal of Machine ConferenceonMachineLearning,pages1183–1190,2010. LearningResearch,11:1865–1881,2010. Kai Zhang, Ivor W. Tsang, and James T. Kwok. Improved MosheLichman. UCImachinelearningrepository,2013. Nystro¨m low-rank approximation and error analysis. In Proceedings of the 25th International Conference on Ma- DavidJ.C.MacKay. Bayesianinterpolation. NeuralCompu- chineLearning,pages1232–1239.ACM,2008. tation,4(3):415–447,1992. Youssef M. Marzouk and Habib N. Najm. Dimensionality reduction and polynomial chaos acceleration of Bayesian inference in inverse problems. Journal of Computational Physics,228(6):1862–1902,2009. Thomas P. Minka. Old and new matrix algebra useful for statistics. Technicalreport,MITMediaLab,2001. R.KelleyPaceandRonaldBarry. Sparsespatialautoregres- sions. Statistics and Probability Letters, 33(3):291–297, 1997. Christopher J. Paciorek and Mark J. Schervish. Nonstation- ary covariance functions for Gaussian process regression. InAdvancesinNeuralInformationProcessingSystems16. MITPress,2004. Yuan Qi, Ahmed H. Abdel-Gawad, and Thomas P. Minka. Sparse-posterior Gaussian processes for general likeli- hoods. In Proceedings of the 26th Conference on Uncer- taintyinArtificialIntelligence,2010. Joaquin Quin˜onero-Candela and Carl E. Rasmussen. A uni- fyingviewofsparseapproximateGaussianprocessregres- sion. Journal of Machine Learning Research, 6:1935– 1959,122005. Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processesusingpseudo-inputs. InAdvancesinNeuralIn- formationProcessingSystems18.MITpress,2006. Michael E. Tipping. The relevance vector machine. In Ad- vancesinNeuralInformationProcessingSystems12.MIT Press,2000. MichalisK.Titsias.Variationallearningofinducingvariables in sparse Gaussian processes. In The 12th International Conference on Artificial Intelligence and Statistics, pages 567–574,2009. ChristopherK.I.WilliamsandDavidBarber. Bayesianclas- sificationwithGaussianprocesses. IEEETransactionson Pattern Analysis and Machine Intelligence, 20(12):1342– 1351,1998. 8

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.