ebook img

Deep Probabilistic Programming PDF

0.27 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Deep Probabilistic Programming

PublishedasaconferencepaperatICLR2017 DEEP PROBABILISTIC PROGRAMMING DustinTran MatthewD.Hoffman RifA.Saurous ColumbiaUniversity AdobeResearch GoogleResearch EugeneBrevdo KevinMurphy DavidM.Blei GoogleBrain GoogleResearch ColumbiaUniversity 7 1 0 ABSTRACT 2 r WeproposeEdward,aTuring-completeprobabilisticprogramminglanguage.Ed- a warddefinestwocompositionalrepresentations—randomvariablesandinference. M Bytreatinginferenceasafirstclasscitizen,onaparwithmodeling,weshowthat probabilisticprogrammingcanbeasflexibleandcomputationallyefficientastra- 7 ditionaldeeplearning.Forflexibility,Edwardmakesiteasytofitthesamemodel usingavarietyofcomposableinferencemethods,rangingfrompointestimation ] L to variational inference to MCMC. In addition, Edward can reuse the modeling M representationaspartofinference,facilitatingthedesignofrichvariationalmod- elsandgenerativeadversarialnetworks. Forefficiency,Edwardisintegratedinto t. TensorFlow, providing significant speedups over existing probabilistic systems. a Forexample,weshowonabenchmarklogisticregressiontaskthatEdwardisat t s least35xfaster thanStanand6xfasterthanPyMC3. Further,Edwardincursno [ runtimeoverhead:itisasfastashandwrittenTensorFlow. 2 v 1 INTRODUCTION 7 5 7 The nature of deep neural networks is compositional. Users can connectlayers in creative ways, 3 withouthavingtoworryabouthowtoperformtesting(forwardpropagation)orinference(gradient- 0 basedoptimization,withbackpropagationandautomaticdifferentiation). . 1 In this paper,we designcompositionalrepresentationsforprobabilisticprogramming. Probabilis- 0 ticprogrammingletsusersspecifygenerativeprobabilisticmodelsasprogramsandthen“compile” 7 those models down into inference procedures. Probabilistic models are also compositionalin na- 1 ture,andmuchworkhasenabledrichprobabilisticprogramsviacompositionsofrandomvariables : v (Goodmanetal.,2012;Ghahramani,2015;Lakeetal.,2016). i X Lesswork,however,hasconsideredananalogouscompositionalityforinference. Rather,manyex- isting probabilistic programming languages treat the inference engine as a black box, abstracted r a away from the model. These cannot capture probabilistic inferences that reuse the model’s representation—a key idea in recent advances in variational inference (Kingma&Welling, 2014; Rezende&Mohamed,2015;Tranetal.,2016b),generativeadversarialnetworks(Goodfellowetal., 2014),andalsoinmoreclassicinferences(Dayanetal.,1995;Gutmann&Hyvärinen,2010). WeproposeEdward1,aTuring-completeprobabilisticprogramminglanguagewhichbuildsontwo compositionalrepresentations—oneforrandomvariablesandoneforinference. Bytreatinginfer- ence as a first class citizen, on a par with modeling, we show that probabilisticprogrammingcan be as flexible and computationally efficient as traditional deep learning. For flexibility, we show howEdwardmakesiteasytofitthesamemodelusingavarietyofcomposableinferencemethods, ranging from point estimation to variationalinference to MCMC. For efficiency, we show how to integrateEdwardinto existingcomputationalgraphframeworkssuch as TensorFlow(Abadietal., 2016). FrameworkslikeTensorFlowprovidecomputationalbenefitslikedistributedtraining,paral- lelism,vectorization,andGPUsupport“forfree.” Forexample,weshowonabenchmarktaskthat Edward’sHamiltonian Monte Carlo is many times faster than existing software. Further, Edward incursnoruntimeoverhead:itisasfastashandwrittenTensorFlow. 1See Tranetal. (2016a) for details of the API. A companion webpage for this paper is available at http://edwardlib.org/iclr2017.Itcontainsmorecompleteexampleswithrunnablecode. 1 PublishedasaconferencepaperatICLR2017 2 RELATED WORK Probabilisticprogramminglanguages(PPLs)typicallytradeofftheexpressivenessofthelanguage withthecomputationalefficiencyofinference. Ononeside,therearelanguageswhichemphasize expressiveness(Pfeffer, 2001; Milchetal., 2005; Pfeffer, 2009; Goodmanetal., 2012), represent- ing a rich class beyond graphical models. Each employs a generic inference engine, but scales poorly with respect to model and data size. On the other side, there are languages which em- phasizeefficiency(Spiegelhalteretal.,1995;Murphy,2001;Plummer,2003;Salvatieretal.,2015; Carpenteretal.,2016). ThePPLisrestrictedtoaspecificclassofmodels,andinferencealgorithms are optimized to be efficient for this class. For example, Infer.NET enables fast message pass- ing for graphical models (Minkaetal., 2014), and Augur enables data parallelism with GPUs for Gibbssamplingin Bayesian networks(Tristanetal., 2014). Edwardbridgesthisgap. Itis Turing complete—itsupportsanycomputableprobabilitydistribution—anditsupportsefficientalgorithms, suchasthosethatleveragemodelstructureandthosethatscaletomassivedata. TherehasbeensomepriorresearchonefficientalgorithmsinTuring-completelanguages. Venture and Anglican design inference as a collection of local inference problems, defined over program fragments(Mansinghkaetal.,2014;Woodetal.,2014). Thisproducesfastprogram-specificinfer- encecode, whichwe buildon. Neithersystemsupportsinferencemethodssuch asprogrammable posterior approximations, inference models, or data subsampling. Concurrent with our work, WebPPL features amortized inference (Ritchieetal., 2016). Unlike Edward, WebPPL does not reusethemodel’srepresentation;rather,itannotatestheoriginalprogramandleverageshelperfunc- tions,whichisalessflexiblestrategy. Finally,inferenceisdesignedasprogramtransformationsin Kiselyov&Shan(2009);S´cibioretal.(2015);Zinkov&Shan(2016).Thisenablestheflexibilityof composinginferenceinsideotherprobabilisticprograms.Edwardbuildsonthisideatocomposenot onlyinferencewithinmodelingbutalsomodelingwithininference(e.g.,variationalmodels). 3 COMPOSITIONAL REPRESENTATIONS FOR PROBABILISTIC MODELS Wefirstdevelopcompositionalrepresentationsforprobabilisticmodels. Wedesiretwocriteria: (a) integrationwithcomputationalgraphs,anefficientframeworkwherenodesrepresentoperationson dataandedgesrepresentdatacommunicatedbetweenthem(Culler,1986);and(b)invarianceofthe representationunderthegraph,thatis,therepresentationcanbereusedduringinference. Edward defines random variablesas the key compositionalrepresentation. They are class objects withmethods,forexample,tocomputethelogdensityandtosample.Further,eachrandomvariable x is associated to a tensor (multi-dimensionalarray) x∗, which represents a single sample x∗ ∼ p(x). Thisassociationembedstherandomvariableontoacomputationalgraphontensors. The design’ssimplicity makesit easy to developprobabilisticprogramsin a computationalgraph framework. Importantly,allcomputationisrepresentedonthegraph. Thisenablesonetocompose randomvariableswithcomplexdeterministicstructuresuchasdeepneuralnetworks,adiversesetof mathoperations,andthirdpartylibrariesthatbuildonthesameframework.Thedesignalsoenables compositionsofrandomvariablestocapturecomplexstochasticstructure. Asanillustration,weuseaBeta-Bernoullimodel,p(x,θ)=Beta(θ|1,1) 50 Bernoulli(x |θ), Qn=1 n whereθisalatentprobabilitysharedacrossthe50datapointsx∈{0,1}50. Therandomvariablex is50-dimensional,parameterizedbytherandomtensorθ∗. Fetchingtheobjectxrunsthegraph: it simulatesfromthegenerativeprocessandoutputsabinaryvectorof50elements. tf.ones(50) 1 theta =Beta(a=1.0,b=1.0) 2 x =Bernoulli(p=tf.ones(50)*theta) θ∗ x∗ θ x Figure1:Beta-Bernoulliprogram(left)alongsideitscomputationalgraph(right).Fetchingxfrom thegraphgeneratesabinaryvectorof50elements. Allcomputationisregisteredsymbolicallyonrandomvariablesandnotovertheirexecution. Sym- bolic representationsdo not require reifyingthe full model, which leads to unreasonablememory 2 PublishedasaconferencepaperatICLR2017 1 #Probabilisticmodel zn 2 z =Normal(mu=tf.zeros([N, d]),sigma=tf.ones([N, d])) 3 h =Dense(256,activation=’relu’)(z) φ θ 4 x =Bernoulli(logits=Dense(28* 28,activation=None)(h)) 5 6 #Variationalmodel 7 qx= tf.placeholder(tf.float32, [N, 28 *28]) xn 8 qh= Dense(256,activation=’relu’)(qx) 9 qz=Normal(mu=Dense(d,activation=None)(qh), 10 sigma=Dense(d,activation=’softplus’)(qh)) N Figure2:Variationalauto-encoderforadatasetof28×28pixelimages:(left)graphicalmodel,with dottedlinesfortheinferencemodel;(right)probabilisticprogram,with2-layerneuralnetworks. consumption for large models (Tristanetal., 2014). Moreover, it enables us to simplify both de- terministic andstochastic operationsin the graph, beforeexecutingany code(S´cibioretal., 2015; Zinkov&Shan,2016). Withcomputationalgraphs,itisalsonaturaltobuildmutablestateswithintheprobabilisticprogram. Asatypicaluseofcomputationalgraphs,suchstatescandefinemodelparameters;inTensorFlow, this is given by a tf.Variable. Another use case is for building discriminative models p(y|x), wherexarefeaturesthatareinputastrainingortestdata. Theprogramcanbewrittenindependent ofthedata,usingamutablestate(tf.placeholder)forxinitsgraph. Duringtrainingandtesting, wefeedtheplaceholdertheappropriatevalues. In AppendixA, we provideexamplesof a Bayesian neuralnetworkfor classification (A.1), latent Dirichletallocation(A.2),andGaussianmatrixfactorization(A.3).Wepresentothersbelow. 3.1 EXAMPLE: VARIATIONALAUTO-ENCODER Figure2 implements a variational auto-encoder (VAE) (Kingma&Welling, 2014; Rezendeetal., 2014)inEdward. Itcomprisesaprobabilisticmodeloverdataandavariationalmodeldesignedto approximatetheformer’sposterior.Hereweuserandomvariablestoconstructboththeprobabilistic modelandthevariationalmodel;theyarefitduringinference(moredetailsinSection4). There are N data points x ∈ {0,1}28·28 each with d latent variables, z ∈ Rd. The program n n usesKeras(Chollet,2015)todefineneuralnetworks. Theprobabilisticmodelisparameterizedby a2-layerneuralnetwork,with256hiddenunits(andReLUactivation),andgenerates28×28pixel images. The variationalmodel is parameterizedby a 2-layer inference network, with 256 hidden unitsandoutputsparametersofanormalposteriorapproximation. Theprobabilisticprogramisconcise. CoreelementsoftheVAE—suchasitsdistributionalassump- tionsandneuralnetarchitectures—areallextensible. Withmodelcompositionality,wecanembed it into morecomplicatedmodels(Gregoretal., 2015; Rezendeetal., 2016) and for other learning tasks (Kingmaetal., 2014). With inferencecompositionality(which we discussin Section4), we can embed it into more complicated algorithms, such as with expressive variational approxima- tions (Rezende&Mohamed, 2015; Tranetal., 2016b; Kingmaetal., 2016) and alternative objec- tives(Ranganathetal.,2016a;Li&Turner,2016;Diengetal.,2016). 3.2 EXAMPLE: BAYESIANRECURRENTNEURALNETWORKWITHVARIABLELENGTH Random variables can also be composed with control flow operations. As an example, Figure3 implementsaBayesianrecurrentneuralnetwork(RNN)withvariablelength.Thedataisasequence ofinputs{x ,...,x }andoutputs{y ,...,y }oflengthT withx ∈ RD andy ∈ R pertime 1 T 1 T t t step. Fort=1,...,T,aRNNappliestheupdate ht =tanh(Whht−1+Wxxt+bh), where the previoushiddenstate is ht−1 ∈ RH. We feed each hiddenstate into the output’slike- lihood, y ∼ Normal(W h +b ,1), and we place a standard normal prior over all parameters t y t y {W ∈ RH×H,W ∈ RD×H,W ∈ RH×1,b ∈ RH,b ∈ R}. Our implementationis dy- h x y h y namic:itdiffersfromaRNNwithfixedlength,whichpadsandunrollsthecomputation. 3 PublishedasaconferencepaperatICLR2017 Wh 1 defrnn_cell(hprev, xt): 2 returntf.tanh(tf.dot(hprev, Wh) + tf.dot(xt, Wx) + bh) 3 Wx xt 4 Wh=Normal(mu=tf.zeros([H, H]),sigma=tf.ones([H, H])) 5 Wx=Normal(mu=tf.zeros([D, H]),sigma=tf.ones([D, H])) 6 Wy=Normal(mu=tf.zeros([H, 1]),sigma=tf.ones([H,1])) b h 7 bh=Normal(mu=tf.zeros(H),sigma=tf.ones(H)) ··· ··· 8 by=Normal(mu=tf.zeros(1),sigma=tf.ones(1)) Wy ht 109 x = tf.placeholder(tf.float32, [None, D]) 11 h = tf.scan(rnn_cell, x,initializer=tf.zeros(H)) by yt 12 y =Normal(mu=tf.matmul(h, Wy) + by,sigma=1.0) Figure3: Bayesian RNN: (left)graphicalmodel; (right)probabilisticprogram. The programhas anunspecifiednumberoftimesteps;itusesasymbolicforloop(tf.scan). 3.3 STOCHASTIC CONTROL FLOW AND MODELPARALLELISM a∗ tf.while_loop(...) a p∗ x∗ p x Figure4: Computationalgraphforaprobabilisticprogramwithstochasticcontrolflow. Randomvariablescanalsobeplacedinthecontrolflowitself,enablingprobabilisticprogramswith stochasticcontrolflow.Stochasticcontrolflowdefinesdynamicconditionaldependencies,knownin the literature as contingentor existentialdependencies(Mansinghkaetal., 2014; Wuetal., 2016). See Figure4, where x may or may not depend on a for a given execution. In AppendixA.4, we usestochasticcontrolflowtoimplementaDirichletprocessmixturemodel.Tensorswithstochastic shapearealsopossible: forexample,tf.zeros(Poisson(lam=5.0))definesavectorofzeroswith lengthgivenbyaPoissondrawwithrate5.0. Stochasticcontrolflowproducesdifficultiesforalgorithmsthatusethegraphstructurebecausethe relationshipofconditionaldependencieschangesacrossexecutiontraces.Thecomputationalgraph, however, providesan elegantway of teasing out static conditionaldependencestructure (p) from dynamicdependencestructure(a). Wecanperformmodelparallelism(parallelcomputationacross componentsofthemodel)overthestaticstructurewithGPUsandbatchtraining. Wecanusemore genericcomputationstohandlethedynamicstructure. 4 COMPOSITIONAL REPRESENTATIONS FOR INFERENCE We described random variables as a representation for building rich probabilistic programs over computationalgraphs. We now describe a compositionalrepresentation for inference. We desire two criteria: (a) support for many classes of inference, where the form of the inferred posterior dependsonthealgorithm;and(b)invarianceofinferenceunderthecomputationalgraph,thatis,the posteriorcanbefurthercomposedaspartofanothermodel. To explainourapproach,we willuse a simplehierarchicalmodelasa runningexample. Figure5 displaysajointdistributionp(x,z,β)ofdatax,localvariablesz,andglobalvariablesβ. Theideas hereextendtomoreexpressiveprograms. 4.1 INFERENCE AS STOCHASTICGRAPH OPTIMIZATION The goal of inference is to calculate the posterior distribution p(z,β | x ;θ) given data x , train train whereθ areanymodelparametersthatwewillcomputepointestimatesfor.2 Weformalizethisas 2Forexample,wecouldreplacex’ssigmaargumentwithtf.exp(tf.Variable(0.0))*tf.ones([N, D]). Thisdefinesamodelparameterinitializedat0andpositive-constrained. 4 PublishedasaconferencepaperatICLR2017 β 1 N =10000 #number ofdatapoints 2 D = 2 #datadimension 3 K = 5 #number ofclusters 4 z x 5 beta=Normal(mu=tf.zeros([K, D]),sigma=tf.ones([K, D])) n n 6 z =Categorical(logits=tf.zeros([N, K])) 7 x =Normal(mu=tf.gather(beta, z),sigma=tf.ones([N, D])) N Figure5: Hierarchicalmodel: (left)graphicalmodel;(right)probabilisticprogram. Itisamixture of Gaussians over D-dimensional data {x } ∈ RN×D. There are K latent cluster means β ∈ n RK×D. thefollowingoptimizationproblem: minL(p(z,β |x ;θ), q(z,β;λ)), (1) train λ,θ whereq(z,β;λ)isanapproximationtotheposteriorp(z,β|x ;θ),andLisalossfunctionwith train respecttopandq. Thechoiceofapproximationq,lossL,andrulestoupdateparameters{θ,λ}arespecifiedbyanin- ferencealgorithm.(Noteqcanbenonparametric,suchasapointoracollectionofsamples.) InEdward,wewritethisproblemasfollows: 1 inference= ed.Inference({beta: qbeta, z: qz}, data={x:x_train}) Inferenceisanabstractclasswhichtakestwoinputs.Thefirstisacollectionoflatentrandomvari- ablesbetaandz,associatedtotheir“posteriorvariables”qbetaandqzrespectively.Thesecondisa collectionofobservedrandomvariablesx,whichisassociatedtotheirrealizationsx_train. TheideaisthatInferencedefinesandsolvestheoptimizationinEquation1. Itadjustsparameters ofthedistributionofqbetaandqz(andanymodelparameters)tobeclosetotheposterior. Classmethodsareavailabletofinelycontroltheinference.Callinginference.initialize()builds acomputationalgraphtoupdate{θ,λ}. Callinginference.update()runsthiscomputationonce toupdate{θ,λ};wecallthemethodinaloopuntilconvergence. Importantly,noefficiencyislost inEdward’slanguage: thecomputationalgraphisthesameasifitwerehandwrittenforaspecific model.Thismeanstheruntimeisthesame;alsoseeourexperimentsinSection5.2. AkeyconceptinEdwardisthatthereisnodistinct“model”or“inference”block.Amodelissimply acollectionofrandomvariables,andinferenceisawayofmodifyingparametersinthatcollection subjecttoanother. Thisreductionismofferssignificantflexibility. Forexample,we caninferonly partsofamodel(e.g.,layer-wisetraining(Hintonetal.,2006)),inferpartsusedinmultiplemodels (e.g.,multi-tasklearning),orpluginaposteriorintoanewmodel(e.g.,Bayesianupdating). 4.2 CLASSES OF INFERENCE The design of Inference is very general. We describe subclasses to represent many algorithms below:variationalinference,MonteCarlo,andgenerativeadversarialnetworks. Variationalinferencepositsafamilyofapproximatingdistributionsandfindstheclosestmemberin the family to the posterior(Jordanetal., 1999). In Edward, we build the variationalfamily in the graph;seeFigure6(left). Forourrunningexample,thefamilyhasmutablevariablesasparameters λ={π,µ,σ},whereq(β;µ,σ)=Normal(β;µ,σ)andq(z;π)=Categorical(z;π). SpecificvariationalalgorithmsinheritfromtheVariationalInferenceclass. Eachdefinesitsown methods, such as a loss function and gradient. For example, we represent maximum a posteriori (MAP)estimationwithanapproximatingfamily(qbetaandqz)ofPointMassrandomvariables,i.e., withallprobabilitymassconcentratedatapoint.MAPinheritsfromVariationalInferenceandde- finesthenegativelogjointdensityasthelossfunction;itusesexistingoptimizersinsideTensorFlow. InSection5.1,weexperimentwithmultiplegradientestimatorsforblackboxvariationalinference (Ranganathetal.,2014). Eachestimatorimplementsthesameloss(anobjectiveproportionaltothe divergenceKL(qkp))andadifferentupdaterule(stochasticgradient). 5 PublishedasaconferencepaperatICLR2017 1 qbeta =Normal( 1 T =10000 #number ofsamples 2 mu=tf.Variable(tf.zeros([K, D])), 2 qbeta =Empirical( 3 sigma=tf.exp(tf.Variable(tf.zeros([K, D]))))3 params=tf.Variable(tf.zeros([T, K, D]))) 4 qz=Categorical( 4 qz=Empirical( 5 logits=tf.Variable(tf.zeros([N, K]))) 5 params=tf.Variable(tf.zeros([T, N]))) 6 6 7 inference= ed.VariationalInference( 7 inference= ed.MonteCarlo( 8 {beta: qbeta, z: qz},data={x:x_train}) 8 {beta: qbeta, z: qz},data={x:x_train}) Figure6: (left)Variationalinference.(right)MonteCarlo. 1 defgenerative_network(eps): 2 h =Dense(256,activation=’relu’)(eps) 3 returnDense(28* 28,activation=None)(h) ǫ 4 n 5 defdiscriminative_network(x): θ 6 h =Dense(28* 28,activation=’relu’)(x) 7 returnDense(h,activation=None)(1) 8 9 #Probabilisticmodel x 10 eps=Normal(mu=tf.zeros([N, d]),sigma=tf.ones([N, d])) n 11 x =generative_network(eps) N 12 13 inference= ed.GANInference(data={x:x_train}, 14 discriminator=discriminative_network) Figure 7: Generative adversarialnetworks: (left) graphicalmodel; (right) probabilistic program. Themodel(generator)usesaparameterizedfunction(discriminator)fortraining. Monte Carlo approximates the posterior using samples (Robert&Casella, 1999). Monte Carlo is an inference where the approximating family is an empirical distribution, q(β;{β(t)}) = 1 T δ(β,β(t)) and q(z;{z(t)}) = 1 T δ(z,z(t)). The parameters are λ = {β(t),z(t)}. T Pt=1 T Pt=1 SeeFigure6(right).MonteCarloalgorithmsproceedbyupdatingonesampleβ(t),z(t) atatimein the empiricalapproximation. Specific MC samplersdetermine the update rules: they can use gra- dientssuchasin HamiltonianMonteCarlo(Neal, 2011) andgraphstructuresuchas insequential MonteCarlo(Doucetetal.,2001). Edward also supports non-Bayesian methods such as generative adversarial networks (GANs) (Goodfellowetal., 2014). See Figure7. The modelposits randomnoise eps over N data points, each with d dimensions; this random noise feeds into a generative_network function, a neural network that outputs real-valued data x. In addition, there is a discriminative_network which takesdata asinputandoutputsthe probabilitythatthe datais real(inlogitparameterization). We buildGANInference;runningitoptimizesparametersinsidethetwoneuralnetworkfunctions.This approachextendstomanyadvancesinGANs(e.g.,Dentonetal.(2015);Lietal.(2015)). Finally, one can design algorithms that would otherwise require tedious algebraic manipulation. Withsymbolicalgebraonnodesofthecomputationalgraph,wecanuncoverconjugacyrelationships betweenrandomvariables. Userscanthenintegrateoutvariablestoautomaticallyderiveclassical Gibbs (Gelfand&Smith, 1990), mean-field updates (Bishop, 2006), and exact inference. These algorithmsarebeingcurrentlydevelopedinEdward. 4.3 COMPOSING INFERENCES Core to Edward’sdesignis thatinferencecanbe writtenas a collectionofseparate inferencepro- grams. Below we demonstrate variational EM, with an (approximate)E-step over local variables and an M-step overglobal variables. We instantiate two algorithms, each of which conditionson inferencesfromtheother,andwealternatewithoneupdateofeach(Neal&Hinton,1993), 1 qbeta =PointMass(params=tf.Variable(tf.zeros([K, D]))) 2 qz=Categorical(logits=tf.Variable(tf.zeros([N, K]))) 3 4 inference_e= ed.VariationalInference({z: qz}, data={x:x_train, beta:qbeta}) 5 inference_m= ed.MAP({beta:qbeta}, data={x: x_train, z: qz}) 6 ... 6 PublishedasaconferencepaperatICLR2017 7 for_ inrange(10000): 8 inference_e.update() 9 inference_m.update() ThisextendstomanyothercasessuchasexactEMforexponentialfamilies,contrastivedivergence (Hinton, 2002), pseudo-marginalmethods(Andrieu&Roberts, 2009), andGibbs samplingwithin variationalinference(Wang&Blei,2012;Hoffman&Blei,2015).Wecanalsowritemessagepass- ingalgorithms,whichsolveacollectionoflocalinferenceproblems(Koller&Friedman,2009).For example,classicalmessage passingusesexactlocalinferenceandexpectationpropagationlocally minimizestheKullback-Leiblerdivergence,KL(pkq)(Minka,2001). 4.4 DATASUBSAMPLING Stochastic optimization (Bottou, 2010) scales inference to massive data and is key to algorithms such as stochastic gradient Langevin dynamics (Welling&Teh, 2011) and stochastic variational inference(Hoffmanetal.,2013). Theideaistocheaplyestimatethemodel’slogjointdensityinan unbiasedway. At each step, onesubsamplesa data set {x } of size M and then scales densities m withrespecttolocalvariables, N logp(x,z,β)=logp(β)+ logp(x |z ,β)+logp(z |β) Xh n n n i n=1 M N ≈logp(β)+ logp(x |z ,β)+logp(z |β) . M X h m m m i m=1 To supportstochastic optimization,we representonlya subgraphof thefullmodel. Thisprevents reifyingthefullmodel,whichcanleadtounreasonablememoryconsumption(Tristanetal.,2014). Duringinitialization,wepassinadictionarytoproperlyscalethearguments.SeeFigure8. 1 beta=Normal(mu=tf.zeros([K, D]),sigma=tf.ones([K, D])) β 2 z =Categorical(logits=tf.zeros([M, K])) 3 x =Normal(mu=tf.gather(beta, z),sigma=tf.ones([M, D])) 4 5 qbeta =Normal(mu=tf.Variable(tf.zeros([K, D])), 6 sigma=tf.nn.softplus(tf.Variable(tf.zeros([K, D])))) zm xm 7 qz=Categorical(logits=tf.Variable(tf.zeros([M, D]))) 8 M 9 inference= ed.VariationalInference({beta: qbeta, z: qz},data={x:x_batch}) 10 inference.initialize(scale={x: float(N)/M, z: float(N)/M}) Figure 8: Data subsampling with a hierarchical model. We define a subgraph of the full model, formingaplateofsizeM ratherthanN. WethenscalealllocalrandomvariablesbyN/M. Conceptually,thescale argumentrepresentsscalingforeachrandomvariable’splate, asif wehad seen that random variable N/M as many times. As an example, AppendixB shows how to im- plement stochastic variational inference in Edward. The approach extends naturally to streaming data(Doucetetal.,2000;Brodericketal.,2013;McInerneyetal.,2015),dynamicbatchsizes,and data structures in which working on a subgraph does not immediately apply (Binderetal., 1997; Johnson&Willsky,2014;Fotietal.,2014). 5 EXPERIMENTS Inthissection,weillustratetwomainbenefitsofEdward:flexibilityandefficiency. Fortheformer, weshowhowitiseasytocomparedifferentinferencealgorithmsonthesamemodel.Forthelatter, weshowhowitiseasytogetsignificantspeedupsbyexploitingcomputationalgraphs. 5.1 RECENT METHODSIN VARIATIONAL INFERENCE We demonstrate Edward’s flexibility for experimenting with complex inference algorithms. We considertheVAEsetupfromFigure2andthebinarizedMNISTdataset(Salakhutdinov&Murray, 7 PublishedasaconferencepaperatICLR2017 Inferencemethod Negativelog-likelihood VAE(Kingma&Welling,2014) ≤88.2 VAEwithoutanalyticKL ≤89.4 VAEwithanalyticentropy ≤88.1 VAEwithscorefunctiongradient ≤87.9 Normalizingflows(Rezende&Mohamed,2015) ≤85.8 Hierarchicalvariationalmodel(Ranganathetal.,2016b) ≤85.4 Importance-weightedauto-encoders(K =50)(Burdaetal.,2016) ≤86.3 HVMwithIWAEobjective(K =5) ≤85.2 Rényidivergence(α=−1)(Li&Turner,2016) ≤140.5 Table1: InferencemethodsforaprobabilisticdecoderonbinarizedMNIST.TheEdwardPPL isa convenientresearchplatform,makingiteasytobothdevelopandexperimentwithmanyalgorithms. 2008). Weused=50latentvariablesperdatapointandoptimizeusingADAM.Westudydifferent components of the VAE setup using different methods; AppendixC.1 is a complete script. After trainingweevaluateheld-outloglikelihoods,whicharelowerboundsonthetruevalue. Table1 shows the results. The first method uses the VAE from Figure2. The next three methods use the same VAE but apply different gradientestimators: reparameterizationgradientwithout an analyticKL;reparameterizationgradientwithan analyticentropy;andthe scorefunctiongradient (Paisleyetal.,2012;Ranganathetal.,2014).Thistypicallyleadstothesameoptimabutatdifferent convergence rates. The score function gradient was slowest. Gradients with an analytic entropy produced difficulties around convergence: we switched to stochastic estimates of the entropy as it approached an optima. We also use hierarchical variational models (HVMs) (Ranganathetal., 2016b)withanormalizingflowprior;itproducedsimilarresultsasanormalizingflowonthelatent variable space (Rezende&Mohamed, 2015), and better than importance-weightedauto-encoders (IWAEs)(Burdaetal.,2016). Wealsostudynovelcombinations,suchasHVMswiththeIWAEobjective,GAN-basedoptimization onthedecoder(withpixelintensity-valueddata),andRényidivergenceonthedecoder. GAN-based optimizationdoesnotenablecalculationof the log-likelihood;Rényidivergencedoesnotdirectly optimizeforlog-likelihoodsoitdoesnotperformwell.ThekeypointisthatEdwardisaconvenient researchplatform:theyarealleasymodificationsofagivenscript. 5.2 GPU-ACCELERATED HAMILTONIAN MONTECARLO 1 #Model β 2 x = tf.Variable(x_data,trainable=False) 3 beta=Normal(mu=tf.zeros(D),sigma=tf.ones(D)) 4 y =Bernoulli(logits=tf.dot(x,beta)) 5 xn yn 6 #Inference 7 qbeta =Empirical(params=tf.Variable(tf.zeros([T, D]))) N 8 inference= ed.HMC({beta:qbeta}, data={y:y_data}) 9 inference.run(step_size=0.5 / N,n_steps=10) Figure9: EdwardprogramforBayesianlogisticregressionwithHamiltonianMonteCarlo(HMC). We benchmarkruntimesfor a fixed numberof Hamiltonian Monte Carlo (HMC; Neal, 2011) iter- ations on modern hardware: a 12-core Intel i7-5930K CPU at 3.50GHz and an NVIDIA Titan X (Maxwell) GPU. We apply logistic regression on the Covertype dataset (N = 581012, D = 54; responseswere binarized)usingEdward, Stan (with PyStan)(Carpenteretal., 2016), and PyMC3 (Salvatieretal., 2015). We ran 100 HMC iterations, with 10 leapfrog updatesper iteration, a step sizeof0.5/N,andsingleprecision.Figure9illustratestheprograminEdward. Table2displaystheruntimes.3 Edward(GPU)featuresadramatic35xspeedupoverStan(1CPU) and6xspeedupoverPyMC3(12CPU).ThisshowcasesthevalueofbuildingaPPLontopofcom- 3Inapreviousversionofthispaper,wereportedPyMC3took361s. Thiswascausedbyabugpreventing PyMC3fromcorrectlyhandlingsingle-precisionfloatingpoint. (PyMC3withdoubleprecisionisroughly14x 8 PublishedasaconferencepaperatICLR2017 Probabilisticprogrammingsystem Runtime(s) HandwrittenNumPy(1CPU) 534 Stan(1CPU)(Carpenteretal.,2016) 171 PyMC3(12CPU)(Salvatieretal.,2015) 30.0 Edward(12CPU) 8.2 HandwrittenTensorFlow(GPU) 5.0 Edward(GPU) 4.9 Table 2: HMC benchmarkforlarge-scalelogisticregression. Edward(GPU)issignificantlyfaster thanothersystems. Inaddition,Edwardhasnooverhead:itisasfastashandwrittenTensorFlow. putationalgraphs. Thespeedupstemsfromfastmatrixmultiplicationwhencalculatingthemodel’s log-likelihood;GPUs can efficiently parallelizethis computation. We expectsimilar speedupsfor modelswhosebottleneckisalsomatrixmultiplication,suchasdeepneuralnetworks. Therearevariousreasonsforthespeedup. Stanonlyused1CPU asitleveragesmultiplecoresby runningHMC chainsinparallel. Stanalsouseddouble-precisionfloatingpointasitdoesnotallow single-precision.ForPyMC3,wenoteEdward’sspeedupisnotaresultofPyMC3’sTheanobackend comparedtoEdward’sTensorFlow. Rather,PyMC3doesnotuseTheanoforallitscomputation,so itexperiencescommunicationoverheadwithNumPy. (PyMC3wasactuallyslowerwhenusingthe GPU.)WepredictthatportingEdward’sdesigntoTheanowouldfeaturesimilarspeedups. In addition to these speedups, we highlight that Edward has no runtime overhead: it is as fast as handwrittenTensorFlow. FollowingSection4.1,thisisbecausethecomputationalgraphsforinfer- enceareinfactthesameforEdwardandthehandwrittencode. 5.3 PROBABILITY ZOO InadditiontoEdward,wealsoreleasetheProbabilityZoo,acommunityrepositoryforpre-trained probabilitymodelsandtheirposteriors.4 ItisinspiredbythemodelzooinCaffe(Jiaetal.,2014), whichprovidesmanypre-traineddiscriminativeneuralnetworks,andwhichhasbeenkeytomaking large-scaledeeplearningmoretransparentandaccessible. ItisalsoinspiredbyForest(Stuhlmüller, 2012),whichprovidesexamplesofprobabilisticprograms. 6 DISCUSSION: CHALLENGES & EXTENSIONS We describedEdward,aTuring-completePPL withcompositionalrepresentationsforprobabilistic models and inference. Edward expands the scope of probabilistic programming to be as flexible andcomputationallyefficientastraditionaldeeplearning. Forflexibility,weshowedhowEdward canuseavarietyofcomposableinferencemethods,capturerecentadvancesinvariationalinference andgenerativeadversarialnetworks,andfinelycontroltheinferencealgorithms. Forefficiency,we showed how Edward leverages computational graphs to achieve fast, parallelizable computation, scalestomassivedata,andincursnoruntimeoverheadoverhandwrittencode. In presentwork, we are applying Edwardas a research platform for developingnew probabilistic models(Rudolphetal.,2016;Tranetal.,2017)andnewinferencealgorithms(Diengetal.,2016). Aswithanylanguagedesign,Edwardmakestradeoffsinpursuitofitsflexibilityandspeedforre- search. For example, an open challenge in Edward is to better facilitate programs with complex controlflowandrecursion. Whilepossibletorepresent,itisunknownhowtoenabletheirflexible inference strategies. In addition, it is open how to expand Edward’sdesign to dynamiccomputa- tionalgraphframeworks—whichprovidemoreflexibilityintheirprogrammingparadigm—butmay sacrifice performance. A crucial next step for probabilistic programming is to leverage dynamic computationalgraphswhilemaintainingtheflexibilityandefficiencythatEdwardoffers. slowerthanEdward(GPU).)ThishasbeenfixedafterdiscussionwithThomasWiecki. Thereportednumbers alsoexcludecompilationtime,whichissignificantforStan. 4TheProbabilityZooisavailableathttp://edwardlib.org/zoo.Itincludesmodelparametersand inferredposteriorfactors,suchaslocalandglobalvariablesduringtrainingandanyinferencenetworks. 9 PublishedasaconferencepaperatICLR2017 ACKNOWLEDGEMENTS We thank the probabilistic programmingcommunity—forsharing our enthusiasm and motivating furtherwork—includingdevelopersofChurch,Venture,Gamalon,Hakaru,andWebPPL.We also thank Stan developersfor providingextensive feedbackas we developedthe language, as well as ThomasWieckiforexperimentaldetails.WethanktheGoogleBayesFlowteam—JoshuaDillon,Ian Langmore,RyanSepassi,andSrinivasVasudevan—aswellasAmrAhmed,MatthewJohnson,Hung Bui, Rajesh Ranganath, Maja Rudolph, and Francisco Ruiz for their helpfulfeedback. Thiswork issupportedbyNSFIIS-1247664,ONRN00014-11-1-0651,DARPAFA8750-14-2-0009,DARPA N66001-15-C-4032,Adobe,Google,NSERCPGS-D,andtheSloanFoundation. REFERENCES Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, GeoffreyIrving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, YuanYu, and XiaoqiangZhang. TensorFlow: A system forlarge- scalemachinelearning. arXivpreprintarXiv:1605.08695,2016. ChristopheAndrieuandGarethORoberts.Thepseudo-marginalapproachforefficientMonteCarlo computations. TheAnnalsofStatistics,pp.697–725,2009. JohnBinder,KevinMurphy,andStuartRussell. Space-efficientinferenceindynamicprobabilistic networks. InInternationalJointConferenceonArtificialIntelligence,1997. ChristopherM.Bishop. PatternRecognitionandMachineLearning. Springer,2006. DavidMBlei,AndrewYNg,andMichaelIJordan.LatentDirichletAllocation.JournalofMachine LearningResearch,3:993–1022,2003. Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010,pp.177–186.Springer,2010. TamaraBroderick,NicholasBoyd,AndreWibisono,AshiaCWilson,andMichaelIJordan.Stream- ingVariationalBayes. InNeuralInformationProcessingSystems,2013. YuriBurda,RogerGrosse,andRuslanSalakhutdinov.Importanceweightedautoencoders. InInter- nationalConferenceonLearningRepresentations,2016. BobCarpenter,AndrewGelman,MatthewDHoffman,DanielLee,BenGoodrich,MichaelBetan- court,MarcusBrubaker,JiqiangGuo,PeterLi,andAllenRiddell. Stan:Aprobabilisticprogram- minglanguage. JournalofStatisticalSoftware,2016. FrançoisChollet. Keras. https://github.com/fchollet/keras,2015. DavidECuller. Dataflowarchitectures. Technicalreport,DTICDocument,1986. PeterDayan,GeoffreyEHinton,RadfordMNeal,andRichardSZemel. TheHelmholtzmachine. Neuralcomputation,7(5):889–904,1995. Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a Laplacianpyramidofadversarialnetworks. InNeuralInformationProcessingSystems,2015. AdjiB.Dieng,DustinTran,RajeshRanganath,JohnPaisley,andDavidM.Blei. χ-divergencefor approximateinference. InarXivpreprintarXiv:1611.00328,2016. Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential Monte Carlo sampling methodsforBayesianfiltering. StatisticsandComputing,10(3):197–208,2000. Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo methods. InSequentialMonteCarlomethodsinpractice,pp.3–14.Springer,2001. NicholasFoti, JasonXu,DillonLaird,andEmilyFox. Stochasticvariationalinferenceforhidden Markovmodels. InNeuralInformationProcessingSystems,2014. AlanEGelfandandAdrianFMSmith. Sampling-basedapproachestocalculatingmarginaldensi- ties. JournaloftheAmericanstatisticalassociation,85(410):398–409,1990. 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.