ebook img

Learning Kernels for Structured Prediction using Polynomial Kernel Transformations PDF

1.7 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Learning Kernels for Structured Prediction using Polynomial Kernel Transformations

Learning Kernels for Structured Prediction using Polynomial Kernel Transformations ChetanTonde,AhmedElgammal DepartmentofComputerScience,RutgersUniversity, Piscataway,NJ-08854USA. {cjtonde,elgammal}@cs.rutgers.edu January8,2016 6 1 0 2 Abstract n Learningthekernelfunctionsusedinkernelmethodshasbeenavastlyexploredareainmachinelearn- a ing. It is now widely accepted that to obtain ’good’ performance, learning a kernel function is the key J challenge. Inthisworkwefocusonlearningkernelrepresentationsforstructuredregression. Wepropose 7 useofpolynomialsexpansionofkernels,referredtoasSchoenbergtransformsandGegenbaurtransforms, whicharisefromtheseminalresultofSchoenberg[1938]. Thesekernelscanbethoughtofaspolynomial ] G combinationofinputfeaturesinahighdimensionalreproducingkernelHilbertspace(RKHS).Welearn kernelsoverinputandoutputforstructureddata,suchthat,dependencybetweenkernelfeaturesismaxi- L mized. WeuseHilbert-SchmidtIndependenceCriterion(HSIC)tomeasurethis. Wealsogiveanefficient, . s matrixdecomposition-basedalgorithmtolearnthesekerneltransformations,anddemonstratestate-of-the- c artresultsonseveralreal-worlddatasets. [ 1 v 1 Introduction 1 1 Thegoalinsupervisedstructuredpredictionistolearnapredictionfunctionf: X →Yfromaninputdomain 4 X,toanoutputdomainY. Asanexample,inarticulatedhumanposeestimation,inputx ∈ X wouldbean 1 imageofahumanperforminganaction,andoutputwouldbetheaninterdependentvectorofjointpositions 0 . (x,y,z). Typically,spaceoffunctionsHisfixed(decisiontrees,neuralnetworks,SVMs)andparametrized. 1 WeestimatetheseparametersfromagivensetoftrainingexamplesS ={(x ,y ),(x ,x ),...,(x ,y )}⊆ 0 1 1 2 2 m m 6 X ×Y,drawnindependentlyidenticallydistributed(i.i.d.) fromP(x,y)overX ×Y. Weformulateamean- 1 ingful loss function L: Y ×Y, such as 0-1 loss, or squared loss [Weston et al., 2002], or a structured loss : [TaskarandGuestrin,2003,Tsochantaridisetal.,2004,BoandSminchisescu,2010,Bratieresetal.,2013]. v i Duringpredictionforainputx∗ ∈X,wesearchforthebestpossiblelabely∗sothatthislossL(f(x∗),y)is X minimized,givenalloftrainingdata,andforallpossiblelabelsy ∈Y,suchthat, r a y∗ =f(x∗)=argminL(f(x∗),y) y∈Y In case of kernel methods for structured prediction [Weston et al., 2002, Taskar and Guestrin, 2003, Tsochantaridisetal.,2004,BoandSminchisescu,2010,Schölkopfetal.,2006,NowozinandLampert,2011], space of functions H is specified by positive definite kernel functions, which further are jointly defined on the input and output space as h((x,y),(x(cid:48),y(cid:48))). In the most common case, this kernel is factorized over the input and output domain as k(x,x(cid:48)) and g(y,y(cid:48)), with input and output elements x,x(cid:48) ∈ X and 1 y,y(cid:48) ∈ Y, respectively. These individual kernels map arguments data to reproducing kernel Hilbert space (RKHS) or kernel feature spaces. They are denoted by K and G, respectively. Now, it is well known that performanceofkernelalgorithmscriticallydependsonthechoiceofkernelfunctions(k(x,x(cid:48))andg(y,y(cid:48))), andlearningthemisachallengingproblem. Inthiswork,weproposeamethodtolearnkernelfeaturespace representationsoverinputandoutputdataforproblemofstructuredprediction. Ourcontributionsareasfollows,first,weproposetouseofmonomialandexpansionofdotproductand Gegenbaurexpansionofradialkerneltolearnapolynomialcombinationkernelfeaturesoverbothinputand output. Second,weproposeanefficient,matrix-basedalgorithmtosolvefortheseexpansions. Andthird,we showstate-of-theartresultsonsyntheticandreal-worlddatasetsdatausingTwinGaussianProcessesofBo andSminchisescu[2010]asaprototypicalkernelmethodforstructuredprediction. 1.1 Relatedwork In the seminal work of Micchelli and Pontil [2005] showed that we can parametrize a family of kernels F overacompactsetΩas (cid:26)(cid:90) (cid:27) F = G (x)dω: p∈P(Ω) , ω Ω where P(Ω) is a set of all probability measures on Ω, and G(ω) is a base kernel parametrized by ω. To illustrate with an example, if we set Ω ⊆ R and G (x) as multivariate Gaussian kernel over x, with + ω varianceω,thenF correspondstoasubsetoftheclassofradialkernels. Most kernel learning frameworks in past have focussed on learning a single kernel from a family of kernels(F)definedusingaboveequation.Alloftheseframeworkshavefocussedonproblemofclassification orregression. Table1illustratespreviousworkonlearningkernelsfordifferentchoicesofΩ,andbasekernel G (x,y),overdatadomainX. Someoftheaboveapproachesworkiteratively,[Argyriouetal.,2005,2006] ω KernelfamilyandbasekernelG (x,y) RelatedWork ω Radialkernels(Iterative),G (x,y)=e−ω(cid:107)x−y(cid:107)2 Argyriouetal.[2005,2006]. ω Dotproductkernels,G (x,y)=eω(cid:104)x,y(cid:105) Argyriouetal.[2005,2006]. ω Finiteconvexsumofkernels,SimpleMKL Rakotomamonjyetal.[2008]. Radialkernels(Semi-infinitePrograming,InfiniteKernelLearning) GehlerandNowozin[2008] Shift-invariantkernels(RadialandAnisotropic),G (x−y) Shirazietal.[2010] ω Table1: KernellearningframeworkswhichlearnkernelsasconvexcombinationofbasekernelsG (x,y). ω whileothersuseoptimizationmethodssuchas,semi-infiniteprogramming[Özögür-AkyüzandWeber,2010], or QCQP [Shirazi et al., 2010]). A review paper by Suzuki and Sugiyama [2011] surveys many of these worksonvariousMultipleKernelLearningalgorithms. Therelevantworkswhichusesresultsonpolynomial expansionofkernelshasbeenofSmolaetal.[2000]whichlearndotproductkernelsusingmonomialbasis {1,x,x2,...},andlearningshift-invariantkernelsusingradialbasekernelsofShirazietal.[2010],bothof thesehavebeenproposedforclassificationandregression. We use results on expansion of kernels for learning kernels for problem of structured regression. In proposeaframeworktolearnkernelsforstructuredpredictionthatusemonomialandGegenbaurexpansions tolearnapositivecombinationofbasekernels. TheGegenbaurbasishasanadvantageofbeingorthonormal and provides a weight parameter γ that helps avoid Gibbs phenomenon observed in interpolation [Gottlieb etal.,1992]. 2 Outline of this paper is as follows. In section 2 we describe radial and dot product kernels along and their respective polynomial expansions. In section 3 we describe the dependency criteria (HSIC) used in our problem formulation. In section 4 we describe our proposed algorithm. In section 5 we describe Twin Gaussian Processes used for structured prediction. In section 6 we present experimental results on several syntheticandreal-worlddatasets.Finally,section7andsection8arediscussionandconclusion,respectively. 2 Kernel Transformations 2.1 RadialandDotProductKernels A classical result of Schoenberg [1938] states that any continuous, isotropic, kernel k(x,x(cid:48)), is positive definiteifandonlyifk(x,x(cid:48))=φ((cid:104)x,x(cid:48)(cid:105)),andφ(t)isarealvaluedfunctionsuchthat ∞ (cid:88) φ(t)= α Gλ(t), t∈[−1,1] k k i=0 where (cid:80)∞ α ≥ 0, k ∈ Z+ and α Gλ(1) < ∞. The symbol Gλ(t) stands for what is known as k=0 k k k k ultrasphericalorGegenbaurbasispolynomials.ExamplesofsuchkernelsincludetheGaussianandLaplacian kernels. To understand this better, if we look at a result of Bochner [1959] which says that, a continuous, shift- invariantkernelk(·)ispositivedefinite,ifandonlyif,itisainverseFouriertransformofafinitenon-negative probabilitymeasureµonRd. (cid:90) √ k(x−y)=φ(z)= e −1(cid:104)z,s(cid:105)dµ(s), x,y,s∈Rd Rd This results allows us to represent a shift-invariant kernel uniquely as a spectral distribution in a spectral domain. Now,ifwetaketheFouriertransformofequation2.1,wehaveauniquerepresentationofpositive definitekernelφ(t),andalsoauniquerepresentationsofpositivedefinitebasekernelsGλ(t),k = 1,2,..., k inthe spectraldomain. Hence, the kernelexpansion isanonnegative mixtureofbase spectraldistributions whosekernelsaregivenbykernelsGλ(t)withnonnegativeweightsα ’s. k k If we look at monomial basis instead of Gegenbaur basis, a similar interpretation can be given for dot productkernelsasmentionedinSmolaetal.[2000].Additionally,weknowthatthespanofmonomialsforma densebasisinL [−1,1],andbyWeirerstrassapproximationtheorem[Szegö,1939],anycontinuousfunction 2 ona[−1,1]canbeapproximateduniformlyonthatintervalbypolynomialstoanydegreeofaccuracy. The Gegenbaurbasisisorthonormalandprovidesadditionalbenefitsininterpolationaccuracyforfunctionswith sharpchanges. ThusavoidingthesocalledGibbsphenomenon,[Gottliebetal.,1992]. Inourwork,weuseaboveexpansionsofkernelk(cid:48)(·,·)onfeaturesobtainedfromaninitialkernelk(·,·), definedonX×X. Werefertok(·,·)asinitialkernel,andestimateφ(t)whichwhenappliedtokernelmatrix [K] =(cid:104)x,y(cid:105)givesusanewkernelmatrix[K](cid:48) =φ([K] . Figure1illustratesthispictorially. i,j i,j i,j 2.2 MonomialTransformations Incaseofthemonomialbasisfunctionswehavethefollowingexpansion. Theorem2.1(Schoenberg[1938]). Foracontinuousfunctionφ: [−1,1]→Randk(x,y)=φ((cid:104)x,y(cid:105)),the kernelmatrixK(cid:48) definedasK(cid:48) =φ([K] )ispositivedefiniteforanypositivedefinitematrixKifandonly i,j 3 Figure1: FindingofKernelTransformationsφ(·) ifφ(·)isrealentire,andoftheformbelow ∞ (cid:88) k(cid:48)((cid:104)x,y(cid:105))=φ(t)= α ti (1) i i=0 withα ≥0foralli≥0. i Soφ(t)isaninfinitelydifferentiableandallcoefficientsα arenon-negative(eg. φ(t) = et). Abenefit j ofthismonomialexpansionisthatitiseasytocomputeandcanbereadilyevaluatedinparallel. 2.3 GegenbaurTransformations WeobtainGegenbaur’sbasisifweinsistonhavingaorthonormalexpansion.Theseexpansionsarealsomore general,inthesensethattheyincludeaweightfunctionw(t;γ),whichcontrolsthesizeofthefunctionspace usedforrepresentation,controlledbyaparameterγ >−1/2. Toformallystate, Theorem2.2(Schoenberg[1938]). Forarealpolynomialφ: [−1,1]→RandforanyfiniteX ={x ,x ,x ,...} 1 2 3 thematrix[φ((cid:104)x ,x (cid:105))] ispositivedefiniteifandonlyifφ(t)isanonnegativelinearcombinationofGegen- i j i,j bauer’spolynomialsGγ(t),whichis, i ∞ (cid:88) φ(t)= α Gγ(t) (2) i i i=0 withα ≥0,and(cid:80) a Gγ(1)<∞. i i i i Definition2.1. Gegenbauer’spolynomialsaredefinedasbelow, Gγ(t)=1,Gγ(t)=2γt,..., (3) 0 1 (cid:18) (cid:19) (cid:18) (cid:19) 2(γ+i) 2γ+i−1 Gγ (t)= tGγ(t)− Gγ (t) (4) i+1 i+1 i i+1 i−1 AsstatedearlierGγ andGγ areorthogonalin[−1,1]andallpolynomialsareorthogonalwithrespectto k l the weightfunction w(t;γ)=(1−t2)(γ−1/2)andγ >−1/2. (cid:90) 1 Gγ(t)Gγ(t)w(t;γ)dt=0, k (cid:54)=l. (5) k l −1 4 Theweightfunctionabovedefinesaweightedinnerproductonthespaceoffunctionswithanormandainner productgivenby (cid:90) 1 (cid:107)φ(cid:107)2 = w(t;γ)φ(t)φ(t)dt, (6) −1 (cid:90) 1 (cid:104)φ,ψ(cid:105)= w(t;γ)φ(t)ψ(t)dt (7) −1 Thisweightfunctioncontrolsthespaceoffunc- tions over which we estimate these polynomial maps. Figure2showstheweightfunctionobtained γγ==00..55 1.0 γγ==11..00 by different γ parameter. We observe that γ value γγ==1100..00 controlsthebehaviorofthefunctionattheboundary points {1,−1}. Having this weight function also 0.8 improves the the quality of interpolation by avoid- ing the Gibbs phenomenon as further explained in 0.6 Gottliebetal.[1992]. TosummarizegivenabasekernelmatrixKon inputdata,weexpandthetargetkernelK(cid:48)asatrans- 0.4 formation φ(t) applied to K. This maps the ini- tial RKHS features k(x,·) to new RKHS features 0.2 k(cid:48)(x,·). These two forms of the polynomial map- pingrepresentationshavebeenknownbeforeinthe literature but have mostly used for non-structured 0.01.0 0.5 0.0 0.5 1.0 predictiontaskslikeregressionandclassification.In ourapproach,weusetheseexpansionsonbothinput Figure 2: Weight function w(t;γ) as a function of γ, and output kernels so as to maximize dependence as we see that the γ value controls the behavior of betweenmappedfeaturespacesk(cid:48)(x,·)andg(cid:48)(y,·), theweightfunctionattheboundaries,andequivalently using the Hilbert Schmidt Independence Criterion controlling the size of the function space, which de- (HSIC)describedbelow. creaseswithsize. 3 HilbertSchmidtIndependence Criterion Tomeasurecross-correlationordependencebetweenstructuredinputandoutputdatainkernelfeatureGret- tonetal.[2005]proposedaHilbertSchmidtIndependencecriterion(HSIC).Giveni.i.d.sampledinput-output pairdata,{(x ,y ),(x ,y ),...,(x ,y )}∼P(x,y)HSICmeasuresthedependencebetweenX andY. 1 1 2 2 m m Definition3.1. IfwehavetwoRKHS’sKandG,thenameasureofstatisticaldependencebetweenX andY isgivenbythenormofthecross-covarianceoperatorM :G →K,whichisdefinedas, xy M :=E [(k(x,·)−µ )⊗(g(y,·)−µ )] (8) xy x,y x y =E [(k(x,·)⊗g(y,·))]−µ ⊗µ (9) x,y x y andthemeasureisgivenbytheHilbert-SchmidtnormofM whichis, xy HSIC(p ,K,G):=||M ||2 (10) xy xy HS 5 Sothelargertheabovenorm,thehigherthestatisticaldependencebetweenX andY. Theadvantagesof usingHSICformeasuringstatisticaldependence,asstatedinGrettonetal.[2005]areasfollows: first,ithas gooduniformconvergenceguarantees;second,ithaslowbiaseveninhighdimensions;andthirdanumber ofalgorithmscanbeviewedasmaximizingHSICsubjecttoconstraintsonthelabels/outputs. Empirically, intermsofkernelmatricesitisdefinedas Definition3.2. LetZ :={(x ,y ),...(x ,y )}⊆X×Ybeaseriesofmindependentobservationsdrawn 1 1 m m fromp . AnunbiasedestimatorofHSIC(Z,K,G)isgivenby, xy HSIC(Z,K,G)=(m−1)−2trace(KHGH) (11) whereK,H,G∈Rm×m,[K] :=k(x ,x ),[G] :=g(y ,y )and[H] :=δ −m−1 i,j i j i,j i j i,j ij Now,forwelldefinednormalized andboundedkernels,KandG. WehaveHSIC(Z,K,G) ≥ 0. For the ease of discussion we denote HSIC(Z,K,G) by HSIC(K,G). So we use HSIC to measure depen- dence between kernels in the target kernel feature space and estimate these transformations (α (cid:48)s,β(cid:48)s) by i j maximizingit. 4 Learning Kernel Transformations Thegoalinstructuredpredictionistopredictoutputlabely ∈Y,givenainputexamplex∈X,ourthesisis thatifoutputkernelfeatureg(cid:48)(y,·)ismorecorrelated(dependent)withinputkernelfeaturek(cid:48)(x,·),thenwe cansignificantlyimprovetheregressionperformance. WeproposetouseHSICforthis. Also, maximizing HSIC(orequivalentlyKernelTargetAlignment)hasbeenusedasanobjectiveforlearningkernelsinthepast [Cortesetal.,2012]. TheseframeworksmaximizetheHSICoralignmentbetweeninputandoutputs,which in our case are structured objects. Shawe-Taylor and Kandola [2002] in their work provide generalization boundsandwhichconfirmthatmaximizingalignment(i.eHSIC)doesindeedleadtobettergeneralization. Followingonthis,ifweletK(cid:48) = φ(K),andG(cid:48) = ψ(G),whereKandGarenormalizedbasekernels oninputandoutputdata. UsingtheempiricaldefinitionHSIC wedefinethefollowingobjectivefunctionto optimizeas, L(α∗,β∗)=maxHSIC(φ(K),ψ(G)) (12) α,β subjectto α ≥0,β ≥0,∀i,j ≥0 (13) i j In case of monomial basis as in section 2.2, we can use equation 1 on kernel matrices K and G we get equationsforφ:K→K(cid:48)andψ :G→G(cid:48)asfollows, ∞ (cid:88) φ(K)= α K(i),α ≥0, ∀i≥0 (14) i i i=0 ∞ (cid:88) φ(G)= β G(j),β ≥0, ∀j ≥0 (15) j j j=0 where K(i) is the kernel obtained by applying the ith polynomial basis ti, to the base kernel matrix K (similarlyGλ(t)forGegenbaurbasis). Figure3illustratesthis. Assumingourbasekernelsarebothbounded k andnormalizedwealsoneedournewkernelsφ(K)andψ(G)tobeat-leastbounded. Forthispurpose,we imposel -normregularizationconstraintonα ’sandβ ’s,thatis (cid:107)α(cid:107)2 =1and (cid:107)β(cid:107)2 =1.Theseconstraints 2 i j are similar to the l -norm regularization constraint on positive mixture coefficient’s in the Multiple Kernel 2 6 Figure3: TwinKernelLearning Learning framework of Cortes et al. [2012]. These helps generalization to unseen test data and also avoid arbitraryincreaseofoptimizationobjectivesoastomaximizeit. Substituting the expansions equations 14 and 15 in equation 13, and simplifying the main objective 13 andweget, ∞ ∞ (cid:88)(cid:88) maximize α β C (16) i j i,j i=0j=0 subjectto, (cid:107)α(cid:107) =1, (cid:107)β(cid:107) =1 2 2 wheretheC-matrixissuchthat[C] =HSIC(K(i),G(j)).Also,weknowthattheentriesoftheC-matrix i,j arenon-negative. TheC-matrixlooksasfollows, K(0) K(1) K(i) K(d2) . G(0) C C .. C 1,1 1,2 d2,1 . G(1) C C .. C C= 1,1 1,2 d2,1 (17) . G(0) C C .. C 2,1 1,2 2,d2 G(j) ... ... C ... i,j . G(d1) Cd1,1 Cd1,2 .. Cd1,d2 whereC =HSIC(K(i),G(j)). i,j To further explain, every entry [C] = HSIC(K(i),G(j)) represents higher order cross-correlations i,j amongthepolynomialcombinationoffeaturesoforderiandjbetweeninputandoutput,respectively.Hence by appropriately choosing coefficient’s α and β we are maximizing these higher order cross-correlations i j in the target kernel feature space. Here we approximate sum to finite degrees d and d which leads us to 1 2 a finite dimensional problem with deg(φ) = d and deg(ψ) = d . This amounts to using all polynomial 1 2 combinationsoffeaturesuptodegreed forinputandd foroutput. Sohigherthedegreewechoosemore 1 2 7 isthedependenceandintuitivelybetteristheprediction. Theupperboundonthedegreeiscomputationally limited by the added non-linearity in the kernel based prediction algorithm and can lead to overfitting. So wechoosethesedegreesempiricallybycross-validationuntilwesaturatetheperformanceoftheprediction algorithm. Wehavethefollowingtheoremregardingthesolutionofoptimizationproblem16, Theorem 4.1. The solution (α∗,β∗) to the optimization problem in 16 is given by the first left and right singularvectoroftheC-matrix Proof. Using Perron-Frobenius theorem [Chang et al., 2008] for square non-negative matrices CTC and CCT,weclaimthatbothCTCandCCT havePerronvectorsα∗andβ∗,respectively. Bothα∗andβ∗are theleftandrightsingularvectorsofCandalsomaximizeEq. 16. Thisabovetheoremgivesusourrequiredsolutiontotheproblem. Hence,tosolvefortheunknown’sα ’s i and β ’s we do Singular Value Decomposition (SVD) of the C-matrix and choose α and β to be the first j leftandrightsingularvectorsTheorem4.1. Thenon-negativityoftheαandβ vectorisguaranteeddueto non-negativityoftheC-matrixcombinedwithPerron-FrobeniustheoremmChangetal.[2008]. We also observe that α = β = 0., so choosing d = 1 corresponds to using the identity mapping 0 0 1 φ(t) = tontheinputkernel,whichcorrespondstousingtheinitialkernelonly,orequivalently,nomapping oninputkernel. Asimilarlylogicalsoappliesifwesetd = 1,thenwehaveψ(t) = tandnomappingon 2 outputbasekernel. Thisisinterestingbecauseifwesetd =1andd tobesomearbitraryvaluegreaterthan 2 1 one, then solution α∗ is exactly where α∗ ∝ HSIC(K(i),G), which same as choosing coefficients based i onkernelalignmentasinCortesetal.[2012]. WealsoliketopointoutthesimilarityofourproposedobjectivetothatofKernelCanonicalCorrelation Analysis (KCCA) objective, which also uses HSIC [Chang et al., 2013]. In KCCA we find two nonlinear mappings φ(·) ∈ K and ψ(·) ∈ G from their prespecified RKHS’s maximizing statistical correlation. In our approach, we also look for analytical kernel transformations φ(·) and ψ(·) on initial kernel matrices to maximizethesameobjective. 5 Twin Gaussian Processes TwinGaussianProcesses(TGP)ofBoandSminchisescu[2010],arearecentandpopularformofstructured predictionmethods,whichmodelinput-outputdomainsusingGaussianprocesseswithcovariancefunctions, representedbyKandG. Thesecovariancematricesencodepriorknowledgeabouttheunderlyingprocess that is being modeled. In many real world applications data is high dimensional and highly structured, and the choice of kernel functions is not obvious. In our work, we aim to learn kernel covariance matrices simultaneously.WeuseTGPasanillustrativeexampletodemonstratethebenefitsoflearningthem.Although wenotethatthisframeworkisnotlimitedonlytotheuseofTwinGaussianProcesses. In TGP choice of the auxiliary evaluation function is typically some form of information measure, e.g. KL-Divergence or HSIC which are known to be special cases of Bregman divergences (See Banerjee et al. [2005]).KL-Divergenceisanasymmetricmeasureofinformation,whileHSICissymmetricinitsarguments. We refer to two versions of Twin Gaussian Processes below corresponding to each of these measures of information. WerefertothemasTGPwithKL-DivergenceorsimplyTGP,andTGPwithHSICforTGP. TGPwithKL-Divergence:InthisversionofTGP,weminimizetheKL-divergencebetweenthekernels, K andG,giventhetrainingdataX ×Y andtestexamplex∗. ThepredictionfunctionforHOTGPis, y∗ =argminD ((G ||K ) (18) KL Y∪y X∪x∗ y 8 TGP with HSIC: For this version of TGP with HSIC criteria, the prediction function maximizes the HSICbetweenthekernelsK andGgiventhetrainingdata(X ×Y),andtestexamplex∗. Theprediction functionlooksasfollows, y∗ =argmaxHSIC(G ,K ) (19) Y∪y X∪x∗ y 5.1 ModifiedTwinGaussianProcesses(TGP) Wealsobothofthesecriteriaproposeaboveagainstdegreeofmappingd andd . Thisallowsustoshow 1 2 howeachinformationmeasureisaffectedasthemappingdegreesd andd areincreased. Therelationship 1 2 we observe is straightforward and direct, allowing the choice of d and d to be made easily. We refer to 1 2 these new modified TGP’s as Higher Order TGP with KL-Divergence (HOTGP) and Higher Order HSIC (HOHSIC)forTGPusingHSIC. ModifiedTGPwithKL-Divergence: InthisversionofTGP,weminimizetheKL-divergencebetween thetransformedkernels,φ(K)andψ(G),giventhetrainingdataX×Y,andtestexamplex∗. Theprediction functionforHOTGPis, y∗ =argminD ((ψ(G )||φ(K )) (20) KL Y∪y X∪x∗ y Modified TGP with HSIC: For this version of TGP with HSIC criteria, the prediction function maxi- mizes the HSIC between the transformed kernels φ(K) and ψ(G) given the training data X ×Y, and test examplex∗. Thepredictionfunctionlooksasfollows, y∗ =argmaxHSIC((ψ(G ),φ(K )) (21) Y∪y X∪x∗ y 6 Experiments We show empirical results using Twin Gaussain Processes with KL-Divergence and HSIC, and using both monomialandGegenbaurtransformationsonsyntheticandreal-worlddatasets. Tomeasureimprovementin performanceoverthebaselinewelookatempiricalreductioninerrorwhichwecall%Gain definedas (cid:18) Error (cid:19) (mapping) %Gain= 1− ×100. Error (nomapping) Inallourexperiments,weuseGaussiankernelsk(x ,x )=exp(−γ ||x −x ||2)andg(y ,y )=exp(−γ ||y − i j x i j i j y i y ||2)asbasekernelsoninputandoutput, respectively. Thebandwidthparametersγ andγ werechosen j x y using cross-validation using base kernel on the original dataset. The weight parameters were chosen to be λ = 0.51andλ = 0.52usingroughestimatesfromexpressionsinGottliebetal.[1992]andvalidatedon 1 2 validationset. Forchoiceofexpansiondegreeweincreasedegreesd andd until%Gainsaturatesonthe 1 2 cross-validation. Welearnkerneltransformationsφ(·)andψ(·)usingaboveproposedapproach. 6.1 Datasets 6.1.1 SyntheticData 9 S Shape Dataset: The S-shape synthetic dataset from Dalal and Triggs 1 [2002]isasimple1Dinput/outputregressionproblem.Inthisdataset,500 0.9 valuesofinputs(x)aresampleduniformlyin(0,1),andevaluatedforr = 0.8 x+0.3sin(2πx)+(cid:15)with(cid:15)drawnfromazeromeanGaussiannoisewith 0.7 0.6 sptraonbdlaermdwdehviicahtiiosntoσp=red0i.c0t5x(,Fgigivuerner4.).TGhoisaldahtearseetisistochsoalllveengthineginivnetrhsee Output x00..45 0.3 sensethatitismultivalued(inthemiddleoftheS-shape), discontinuous 0.2 (at the boundary of uni-valued and multivalued region), and noisy ((cid:15) = 0.1 N(0,σ)). Theerrorismetricisusedismeanabsoluteerror(MAE). 0-0.2 0 0.2 0.4 Input r0.6 0.8 1 1.2 PoserDataset:Poserdatasetcontainssyntheticimagesofhumanmo- tioncapturesequencesfromPoser7SmithMicroSoftware. Thesemotion Figure4: S-Shapedataset. sequencesincludes8categories: walk,run,dance,fall,prone,sit,transi- tionsandmisc.Thereare1927trainingexamplescomingfromdifferentsequencesofvaryinglengthsandthe testsetisacontinuoussequenceof418timesteps.Inputfeaturevectorsare100dsilhouetteshapedescriptors while output feature vectors are 54d vectors with x,y and z rotation of joint angles. Error metric is mean absoluteerror(MAE)inmm. 6.1.2 Real-worlddata USPS Handwritten Digits Reconstruction (Figure 5): In USPS hand- writtendigitreconstructiondatasetfromWestonetal.[2002]ourgoalis topredict16pixelvaluesatcenterofanimagegivenouterpixels. Weuse 7425examplesfortraining(withoutlabels)and2475examples(roughly 1/4thforeachdigit)fortesting. Errormetrichereisreconstructionerror measuredusingmeanabsoluteerror(MAE).HumanEva-IPoseDataset (Figure 6): HumanEva-I dataset from Sigal and Black [2006] is a chal- lenging dataset that contains real motion capture sequences from three different subjects (S1,S2,S3) performing five different actions (Walking, Figure5: USPSdigits. Jogging, Box, Throw/Catch, Gestures). We train models on all subjects andallactions. Wehaveinputimagesfromthreedifferentcameras;C1,C2andC3andweuseHoGfeatures fromDalalandTriggs[2005]onthem. Theoutputvectorsare60dwiththex,y,zjointpositionsinmm. We report results using concatenated features from all three cameras (C1+C2+C2) and also individual features fromeachindividualcamera(C1,C2orC3). 6.2 Results 6.2.1 Syntheticdata S-shapedata: Wechoosebandwidthparametertobeγ =1andγ =1 x y usingcross-validation.Toillustrateeffectofincreasingdegreesd andd , 1 2 werunourresultsonagridofdegreesfromtheset{1,2,3,5,7,11}. In figure7weplotincreaseinmappingdegreeas%Gain. Figures7aand7b show results for using KL-Divergence as optimization criteria. Fig- ures 7c and 7d show results for using HSIC as an optimization criteria. Weobservethatforbothasweincreased andd ,%Gainincreasesi.e. 1 1 meanabsoluteerror(MAE)reduces.Alsoforeachpairoffigures,foreach objective,changingfromamonomialbasistoGegenbaurbasishelpsim- Figure6: HumanEva-I prove%Gainfrom31.27%to39.49%forKL-Div,and22.31%to26.06% forHSIC. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.