ebook img

Passing Expectation Propagation Messages with Kernel Methods PDF

0.15 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Passing Expectation Propagation Messages with Kernel Methods

Passing Expectation Propagation Messages with Kernel Methods WittawatJitkrittum ArthurGretton NicolasHeess∗ 5 GatsbyUnit,UniversityCollegeLondon 1 0 [email protected] 2 [email protected] [email protected] n a J 2 Abstract ] L Weproposetolearnakernel-basedmessageoperatorwhichtakesasinputallex- M pectationpropagation(EP)incomingmessagestoafactornodeandproducesan . outgoingmessage. InordinaryEP,computinganoutgoingmessageinvolvesesti- t a matingamultivariateintegralwhichmaynothaveananalyticexpression. Learn- t ingsuchanoperatorallowsonetobypasstheexpensivecomputationoftheinte- s [ gralduringinferencebydirectlymappingallincomingmessagesintoanoutgoing message. Theoperatorcanbelearnedfromtrainingdata(examplesofinputand 1 output messages) which allows automated inference to be made on any kind of v factorthatcanbesampled. 5 7 3 0 1 Background 0 . 1 0 Existingapproachestoautomatedprobabilisticinferencecanbebroadlydividedintotwocategories 5 (Heessetal.,2013): uninformedandinformedcases. Intheuninformedcase,themodelerhasfull 1 freedominexpressingaprobabilisticmodelwithoutanyconstraintonthesetofusablefactors.This : highflexibilitycomesata price duringinferenceas less factor-specificinformationis availableto v theinferenceengine.OftenMCMC-basedsamplingtechniquesareemployedbytreatingthefactors i X as black boxes. In the informed case (Stan Development Team, 2014; Minka et al., 2012), the r modelerisrequiredtobuilda modelfromconstructswhosenecessarycomputationsare knownto a theinferenceengine.Althoughefficientduringtheinference,usinganunsupportedconstructwould requiremanualderivationandimplementationoftherelevantcomputationsintheinferenceengine. In this work, we focuson EP, a commonlyused approximateinferencemethod. FollowingHeess etal.(2013),weproposetolearnakernel-basedmessageoperatorforEPtocapturetherelationship betweenincomingmessagesto a factor andoutgoingmessages. The operatorbridgesthe gapbe- tweentheuninformedandinformedcasesbyautomaticallyderivingtherelevantcomputationsfor anycustomfactorthatcanbesampled. Thishybridapproachgivesthemodelerasmuchflexibility as in the uninformedcase while offering efficient message updates as in the informed case. This approach supports fast inference as no expensive KL divergence minimization needs to be done duringinferenceas in ordinaryEP. In addition, a learnedoperatorfor a factor is reusablein other modelsinwhichthefactorappears. Aswillbeseen,tosendanoutgoingmessagewiththekernel- basedoperator,itissufficienttogenerateafeaturevectorforincomingmessagesandmultiplywith apre-learnedmatrix.UnlikeHeessetal.(2013)whichconsidersaneuralnetwork,thekernel-based messageoperatorweproposecanbeeasilyextendedtoallowonlineupdatesoftheoperatorduring inference. ∗NicolasHeessisnowaffiliatedwithGoogleDeepmind. 1 2 Expectation Propagation(EP) Expectationpropagation(Minka,2001;Bishop,2006)(EP)isacommonlyusedapproximateinfer- encemethodforinferringtheposteriordistributionoflatentvariablesgivenobservations.Inatypical directedgraphicalmodel,thejointdistributionofthedataX = {X ,...,X }andlatentvariables 1 n θ ={θ ,...,θ }takestheformofaproductoffactors,p(X,θ)= m f (X|θ)whereeachfactor 1 t i=1 i f maydependononlyasubsetofX andθ. WithX observed,EPapproximatestheposteriorwith i q(θ) ∝ m m (θ) wherem isanapproximatefactorcoQrrespondingtof withthecon- i=1 fi→θ fi→θ i straintthatithasachosenparametricform(e.g.,Gaussian)intheexponentialfamily(ExpFam).EP takesintQoaccountthefactthatthefinalquantityofinterestistheposteriorq(θ) whichisgivenby theproductofallapproximatefactors. Infindingtheith approximatefactorm ,EPusesother fi→θ approximate factors m (θ) := m (θ) as a context to determine the plausible range θ→fi j6=i fj→θ ofθ. EPiterativelyrefinesmfi→θ fQoreachiwithmfi→θ(θ) = proj[´dXf(X|mθ)θm→Xf→i(fθi)(X)mθ→fi(θ)] whereproj[r] =argmin KL[rkq]andm (X):= δ(X −X )ifX isobservedtobe q∈ExpFam X→fi 0 X . IntheEPliterature,m isknownasacavitydistribution. 0 θ→fi Theprojectioncanbecarriedoutbythefollowingmomentmatchingprocedure.AssumeanExpFam distributionq(θ|η) = h(θ)exp η⊤u(θ)−A(η) whereu(θ)isthesufficientstatisticofq,ηisthe naturalparameterandA(η) = log´ dθh(θ)exp η⊤u(θ) is the log-partitionfunction. Itcan be (cid:0) (cid:1) shown that q∗ = proj[r] satisfies Eq∗(θ)[u(θ)] = Er(θ)[u(θ)]. That is, the projection of r onto ExpFamisgivenbyq∗ ∈ExpFamthathasthesam(cid:0)emome(cid:1)ntparametersasthemomentsunderr. Ingeneral,undertheapproximationthateachfactorfullyfactorizes,anEPmessagefromafactorf toavariableV takestheform m (v)= proj ´ dV\{v}f(V) V′∈VmV′→f(v′) := proj[rf→V(v)] := qf→V(v) (1) f→V m (v) m (v) m (v) (cid:2) V→Qf (cid:3) V→f V→f whereV = V(f)isthesetofvariablesconnectedtof inthefactorgraph. Inthepreviouscaseof m , wehaveV(f) = {X,θ}andV inEq. 1correspondstoθ. Typically,whenthefactorf is fi→θ complicated,theintegraldefiningr becomesintractable. Quadraturerulesorothernumerical f→V integrationtechniquesareoftenappliedtoapproximatetheintegral. 3 Learning to PassEP Messages Our goal is to learn a message operator Cf→V′ with signature [mV→f]V∈V(f) 7→ qf→V′ which takesinallincomingmessages{mV→f |V ∈V(f)}andoutputsqf→V′(v′)i.e.,thenumeratorof Eq.1.Forinference,werequireonesuchoperatorforeachrecipientvariableV′ ∈V(f)i.e.,intotal |V(f)|operatorsneedtobelearnedforf. Operatorlearningiscastasadistribution-to-distribution regression problem where the training set SV′ := {([mnV→f]V∈V(f),qfn→V′)}Nn=1containing N incoming-outgoingmessage pairs can be generated as in Heess et al. (2013) by importance sam- pling to compute the mean parametersErf→V′(v′)[u(v′)] for momentmatching. In principle, the importance sampling itself can be used in EP for computing outgoing messages (Barthelme´ and Chopin,2011). Theschemeis, however,expensiveasweneedtodrawalargenumberofsamples for each outgoing message to be sent. In our case, the importance sampling is used for data set generationwhichisdoneofflinebeforetheactualinference. The assumptions needed for the generation of a training set are as following. Firstly, we assume the factor f takes the form of a conditional distribution f(v |v ,...,v ). Secondly, given 1 2 |V(f)| v ,...,v , v can be sampled from f(·|v ,...,v ). The ability to evaluate f is not as- 2 |V(f)| 1 2 |V(f)| sumed. Finally we assume that a distributionon the naturalparametersof all incomingmessages {m } isavailable. Thedistributionisusedsolelytogivearoughideaofincomingmes- V→f V∈V(f) sagesthelearnedoperatorwillencounterduringtheactualEPinference.Inpractice,weonlyneedto ensurethatthedistributionsufficientlycoverstherelevantregioninthespaceofincomingmessages. Inrecentyears,therehavebeenanumberofworksontheregressiontaskwithdistributionalinputs, includingPoczosetal.(2013);Szaboetal.(2014)whichmainlyfocusonthenon-parametriccase andareoperatedundertheassumptionthatthesamplesfromthedistributionsareobservedbutnot 2 thedistributionsthemselves. Inourcase,thedistributions(messages)aredirectlyobserved. More- over,sincethedistributionsareinExpFam,theycanbecharacterizedbyafinite-dimensionalnatural parametervectororexpectedsufficientstatistic. Hence,wecansimplifyourtasktodistribution-to- vector regression where the outputvector contains a finite number of momentssufficient to char- acterize qf→V′. As regression input distributions are in ExpFam, one can also treat the task as vector-to-vectorregression.However,seeingtheinputsasdistributionsallowsonetousekernelson distributionswhichareinvarianttoparametrization. OncethetrainingsetSV′ isobtained,anydistribution-to-vectorregressionfunctioncanbeapplied to learn a message operatorCf→V′. Givenincomingmessages, the operatoroutputsqf→V′ from whichtheoutgoingEPmessageisgivenbymf→V′ =qf→V′/mV′→f whichcanbecomputedan- alytically.Weoptforkernelridgeregression(Scho¨lkopfandSmola,2002)asourmessageoperator foritssimplicity,itspotentialuseinanonlinesetting(i.e.,incrementallearningduringinference), andrichsupportingtheory. 3.1 KernelRidgeRegression We consider here the problem of regressing smoothly from distribution-valued inputs to feature- valuedoutputs. We followthe regressionframeworkof MicchelliandPontil(2005),with conver- genceguaranteesprovidedbyCaponnettoandDeVito(2007). Undersmoothnessconstraints,this regressioncanbeinterpretedascomputingtheconditionalexpectationoftheoutputfeaturesgiven theinputs(Grunewalderetal.,2012). LetX=(x |···|x )bethetrainingregressioninputsandY= E u(v′)|···|E u(v′) ∈ 1 N qf1→V′ qfN→V′ RDy×N betheregressionoutputs. Theridgeregressioninthe(cid:16)primalformseeksW ∈ RDy×D(cid:17)for theregressionfunctiong(x)=Wxwhichminimizesthesquared-lossfunctionJ(W)= N ky − i=1 i Wx k2 +λtr WW⊤ where λ is a regularization parameter and tr denotes a matrix trace. It is i 2 P wellknownthatthe solutionisgivenbyW = YX⊤ XX⊤+λI −1 whichhasan equivalentdual (cid:0) (cid:1) solution W = Y(K+λI)−1X⊤ = AX⊤ . The dualformulationallows one to regressfromany (cid:0) (cid:1) type of input objects if a kernel can be defined. All the inputs enter to the regression function throughthegrammatrixK ∈ RN×N where(K) = κ(x ,x )yieldingtheregressionfunctionof ij i j theformg(x) = N a κ(x ,x)whereA := (a |···|a ). Thedualformulationthereforeallows i=1 i i 1 N onetostraightforwardlyregressfromincomingmessagestovectorsofmeanparameters. Although this propertyis aPppealing, the training size N in our setting can be chosen to be arbitrarily large, makingcomputationofg(x) expensivefora newunseenpointx. To eliminatethe dependencyon N, we propose to apply random Fourier features (Rahimi and Recht, 2007) φˆ(x) ∈ RD for x := [m ] such thatκ(x,x′) ≈ φˆ(x)⊤φˆ(x′) whereD is thenumberofrandomfeatures. The V→f V∈V(f) useoftherandomfeaturesallowsustogobacktotheprimalformofwhichtheregressionfunction g(φˆ(x)) = Wφˆ(x) can be computed efficiently. In effect, computing an EP outgoing message requires nothing more than a multiplication of a matrix W (D × D ) with the D-dimensional y featurevectorgeneratedfromtheincomingmessages. 3.2 KernelsonDistributions Anumberofkernelsondistributionshavebeenstudiedintheliterature(JebaraandKondor,2003; Jebaraetal.,2004). Relevanttousarekernelswhoserandomfeaturescanbeefficientlycomputed. Duetothespaceconstraint,weonlygiveafewexampleshere. Expected Product Kernel Let µ := E k(·,a) be the mean embedding (Smola et al., r(l) r(l)(a) 2007)ofthedistributionr(l)intoRKHSH(l)inducedbythekernelk. Assumek =k (Gaussian gauss kernel) and assume there are c incoming messages x := (r(i)(a(i)))c and y := (s(i)(b(i)))c . i=1 i=1 Theexpectedproductkernelκ isdefinedas pro c c c κ (x,y):= µ , µ = E E k(l) (a,b)≈φˆ(x)⊤φˆ(y) pro r(l) s(l) r(l)(a) s(l)(b) gauss * + Ol=1 Ol=1 ⊗lH(l) Yl=1 3 Log KL divergence. Mean: −1.880, SD: 1.827 700 Best prediction. Log div: −8.831 Median prediction 1. Log div: −1.942 1 0.4 predicted 600 ground truth 0.5 0.2 500 uency400 0 −5 0 5 −01 0 −5 0 q Fre300 Median prediction 2. Log div: −1.940 75th percentile. Log div: 5.177 200 1 0.6 100 0.4 0.5 0.2 0 −10 −8 −6Log d−i4vergen−ce2 of DistN0ormal2 4 6 0 −5 0 5 0 −5 0 5 Figure1:LogKL-divergenceonalogisticfactortestsetusingkernelonjointembeddings. where φˆ(x)⊤φˆ(y) = c φˆ(l)(r(l))⊤φˆ(l)(s(l)). The feature map φˆ(l)(r(l)) can be estimated by l=1 applyingtherandomFourierfeaturestok(l) andtakingtheexpectationsE E . Thefinal Q gauss r(l)(a) s(l)(b) featuremapisφˆ(x)=φˆ(1)(r(1))⊛φˆ(2)(r(2))⊛···⊛φˆ(c)(r(c))∈Rdcwhere⊛denotesaKronecker productandweassumethatφˆ(l) ∈Rd forl ∈{1,...,c}. KernelonJointEmbeddings Anotherwaytodefineakernelonx,yistomean-embedbothjoint distributionsr = ci=1r(i) ands = ci=1s(i) anddefinethe kerneltobeκjoint(x,y) := hµr,µsiG whereGisanRKHSconsistingoffunctionsg :X(1)×···×X(c) →RandX(l)denotesthedomain Q Q ofr(l) ands(l). Thiskernelisequivalenttotheexpectedproductkernelonthejointdistributions. 4 Experiment Asapreliminaryexperiment,weconsiderthelogisticfactorf(z|x)=δ z− 1 whichis 1+exp(−x) deterministicwhenconditioningonx. Thefactorisusedinmanycommo(cid:16)nmodelsinclud(cid:17)ingbinary logistic regression. The two incoming messages are m (x) = N(x;µ,σ2) and m (z) = X→f Z→f Beta(z;α,β). We randomly generate 2000 training input messages and learn a message operator using the kernel on joint embeddings. Kernel parameters are chosen by cross validation and the number of random features D is set to 2000. We report logKL[qkqˆ] where q = q is the f→X groundtruthoutputmessageobtainedbyimportancesamplingandqˆisthemessageoutputfromthe operator. For better numericalscaling, regressionoutputsare set to (E [x],logV [x]) instead of q q theexpectationsofthefirsttwo moments. ThehistogramoflogKL errorsis shownontheleftof Fig. 1. The rightfigure shows sample outputmessages at differentlog KL errors. It can be seen thattheoperatorisabletocapturetherelationshipofincomingandoutgoingmessages. Withhigher trainingsize,increasednumberofrandomfeaturesandwellchosenkernelparameters,weexpectto seeasignificantimprovementintheoperator’saccuracy. 5 Conclusionand Future Work We propose to learn to send EP messages with kernel ridge regression by casting the KL mini- mizationproblemasasupervisedlearningproblem. Withrandomfeatures,incomingmessagesto a learned operator are converted to a finite-dimensional vector. Computing an outgoing message amountstocomputingthemomentparametersbymultiplyingthevectorwithamatrixgivenbythe solutionoftheprimalridgeregression. Byvirtueoftheprimalform,itisstraightforwardtoderiveanupdateequationforanonline-active learningduringEPinference: ifthepredictivevariance(similartoaGaussianprocess)onthecur- rentincomingmessagesishigh,thenwequerytheoutgoingmessagefromtheimportancesampler (oracle) and update the operator. Otherwise, the outgoingmessage is efficiently computedby the operator. Onlinelearningoftheoperatorlessenstheneedofthedistributiononnaturalparameters ofincomingmessagesusedintrainingsetgeneration. Determininganappropriatedistributionwas oneoftheunsolvedproblemsinHeessetal.(2013). Acknowledgments WegratefullyacknowledgethesupportoftheGatsbyCharitableFoundation. 4 References S. Barthelme´ andN. Chopin. Abc-ep: Expectationpropagationforlikelihood-freebayesiancom- putation. InL.GetoorandT.Scheffer,editors,Proceedingsofthe28thInternationalConference onMachineLearning(ICML-11),ICML’11,pages289–296,NewYork,NY,USA,June2011. ACM. ISBN978-1-4503-0619-5. C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-VerlagNewYork,Inc.,Secaucus,NJ,USA,2006. ISBN0387310738. A. Caponnettoand E. De Vito. Optimal rates for the regularizedleast-squaresalgorithm. Found. Comput.Math.,7(3):331–368,July2007. ISSN 1615-3375. doi: 10.1007/s10208-006-0196-8. URLhttp://dx.doi.org/10.1007/s10208-006-0196-8. S.Grunewalder,G.Lever,A.Gretton,L.Baldassarre,S.Patterson,andM.Pontil.Conditionalmean embeddingsasregressors. InICML,2012. N.Heess,D.Tarlow,andJ.Winn.Learningtopassexpectationpropagationmessages.InC.Burges, L.Bottou,M.Welling,Z.Ghahramani,andK.Weinberger,editors,AdvancesinNeuralInforma- tionProcessingSystems26,pages3219–3227.2013. T.JebaraandR.Kondor.Bhattacharyyaandexpectedlikelihoodkernels.InConferenceonLearning Theory.press,2003. T. Jebara, R. Kondor, and A. Howard. Probability product kernels. J. Mach. Learn. Res., 5:819–844, Dec. 2004. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1005332.1016786. C. A. Micchelli and M. A. Pontil. On learning vector-valued functions. Neural Comput., 17(1):177–204, Jan. 2005. ISSN 0899-7667. doi: 10.1162/0899766052530802. URL http://dx.doi.org/10.1162/0899766052530802. T.Minka,J.Winn,J.Guiver,andD.Knowles.Infer.NET2.5,2012.MicrosoftResearchCambridge. http://research.microsoft.com/infernet. T.P.Minka.AFamilyofAlgorithmsforApproximateBayesianInference.PhDthesis,Massachusetts InstituteofTechnology,2001. AAI0803033. B. Poczos, A. Rinaldo, A. Singh, and L. Wasserman. Distribution-free distribution regression. arXiv:1302.0082[cs,math,stat],Feb.2013. URLhttp://arxiv.org/abs/1302.0082. arXiv:1302.0082. A.RahimiandB.Recht. Randomfeaturesforlarge-scalekernelmachines. InNeuralInformation ProcessingSystems,2007. B. Scho¨lkopf and A. J. Smola. Learning with kernels : support vector machines, regularization, optimization,andbeyond. Adaptivecomputationandmachinelearning.MITPress,2002. A.Smola,A.Gretton,L.Song,andB.Scho¨lkopf. Ahilbertspaceembeddingfordistributions. In InAlgorithmicLearningTheory: 18thInternationalConference,pages13–31.Springer-Verlag, 2007. StanDevelopmentTeam. Stan:Ac++libraryforprobabilityandsampling,version2.4,2014. URL http://mc-stan.org/. Z. Szabo, A. Gretton, B. Poczos, and B. Sriperumbudur. Consistent, two- stage sampled distribution regression via mean embedding. Feb. 2014. URL http://arxiv-web3.library.cornell.edu/abs/1402.1754v2. 5

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.