Table Of ContentPublishedasaconferencepaperatICLR2017
DEEP PROBABILISTIC PROGRAMMING
DustinTran MatthewD.Hoffman RifA.Saurous
ColumbiaUniversity AdobeResearch GoogleResearch
EugeneBrevdo KevinMurphy DavidM.Blei
GoogleBrain GoogleResearch ColumbiaUniversity
7
1
0 ABSTRACT
2
r WeproposeEdward,aTuring-completeprobabilisticprogramminglanguage.Ed-
a warddefinestwocompositionalrepresentations—randomvariablesandinference.
M
Bytreatinginferenceasafirstclasscitizen,onaparwithmodeling,weshowthat
probabilisticprogrammingcanbeasflexibleandcomputationallyefficientastra-
7
ditionaldeeplearning.Forflexibility,Edwardmakesiteasytofitthesamemodel
usingavarietyofcomposableinferencemethods,rangingfrompointestimation
]
L to variational inference to MCMC. In addition, Edward can reuse the modeling
M representationaspartofinference,facilitatingthedesignofrichvariationalmod-
elsandgenerativeadversarialnetworks. Forefficiency,Edwardisintegratedinto
t. TensorFlow, providing significant speedups over existing probabilistic systems.
a Forexample,weshowonabenchmarklogisticregressiontaskthatEdwardisat
t
s least35xfaster thanStanand6xfasterthanPyMC3. Further,Edwardincursno
[ runtimeoverhead:itisasfastashandwrittenTensorFlow.
2
v
1 INTRODUCTION
7
5
7 The nature of deep neural networks is compositional. Users can connectlayers in creative ways,
3 withouthavingtoworryabouthowtoperformtesting(forwardpropagation)orinference(gradient-
0 basedoptimization,withbackpropagationandautomaticdifferentiation).
.
1 In this paper,we designcompositionalrepresentationsforprobabilisticprogramming. Probabilis-
0
ticprogrammingletsusersspecifygenerativeprobabilisticmodelsasprogramsandthen“compile”
7
those models down into inference procedures. Probabilistic models are also compositionalin na-
1
ture,andmuchworkhasenabledrichprobabilisticprogramsviacompositionsofrandomvariables
:
v (Goodmanetal.,2012;Ghahramani,2015;Lakeetal.,2016).
i
X Lesswork,however,hasconsideredananalogouscompositionalityforinference. Rather,manyex-
isting probabilistic programming languages treat the inference engine as a black box, abstracted
r
a away from the model. These cannot capture probabilistic inferences that reuse the model’s
representation—a key idea in recent advances in variational inference (Kingma&Welling, 2014;
Rezende&Mohamed,2015;Tranetal.,2016b),generativeadversarialnetworks(Goodfellowetal.,
2014),andalsoinmoreclassicinferences(Dayanetal.,1995;Gutmann&Hyvärinen,2010).
WeproposeEdward1,aTuring-completeprobabilisticprogramminglanguagewhichbuildsontwo
compositionalrepresentations—oneforrandomvariablesandoneforinference. Bytreatinginfer-
ence as a first class citizen, on a par with modeling, we show that probabilisticprogrammingcan
be as flexible and computationally efficient as traditional deep learning. For flexibility, we show
howEdwardmakesiteasytofitthesamemodelusingavarietyofcomposableinferencemethods,
ranging from point estimation to variationalinference to MCMC. For efficiency, we show how to
integrateEdwardinto existingcomputationalgraphframeworkssuch as TensorFlow(Abadietal.,
2016). FrameworkslikeTensorFlowprovidecomputationalbenefitslikedistributedtraining,paral-
lelism,vectorization,andGPUsupport“forfree.” Forexample,weshowonabenchmarktaskthat
Edward’sHamiltonian Monte Carlo is many times faster than existing software. Further, Edward
incursnoruntimeoverhead:itisasfastashandwrittenTensorFlow.
1See Tranetal. (2016a) for details of the API. A companion webpage for this paper is available at
http://edwardlib.org/iclr2017.Itcontainsmorecompleteexampleswithrunnablecode.
1
PublishedasaconferencepaperatICLR2017
2 RELATED WORK
Probabilisticprogramminglanguages(PPLs)typicallytradeofftheexpressivenessofthelanguage
withthecomputationalefficiencyofinference. Ononeside,therearelanguageswhichemphasize
expressiveness(Pfeffer, 2001; Milchetal., 2005; Pfeffer, 2009; Goodmanetal., 2012), represent-
ing a rich class beyond graphical models. Each employs a generic inference engine, but scales
poorly with respect to model and data size. On the other side, there are languages which em-
phasizeefficiency(Spiegelhalteretal.,1995;Murphy,2001;Plummer,2003;Salvatieretal.,2015;
Carpenteretal.,2016). ThePPLisrestrictedtoaspecificclassofmodels,andinferencealgorithms
are optimized to be efficient for this class. For example, Infer.NET enables fast message pass-
ing for graphical models (Minkaetal., 2014), and Augur enables data parallelism with GPUs for
Gibbssamplingin Bayesian networks(Tristanetal., 2014). Edwardbridgesthisgap. Itis Turing
complete—itsupportsanycomputableprobabilitydistribution—anditsupportsefficientalgorithms,
suchasthosethatleveragemodelstructureandthosethatscaletomassivedata.
TherehasbeensomepriorresearchonefficientalgorithmsinTuring-completelanguages. Venture
and Anglican design inference as a collection of local inference problems, defined over program
fragments(Mansinghkaetal.,2014;Woodetal.,2014). Thisproducesfastprogram-specificinfer-
encecode, whichwe buildon. Neithersystemsupportsinferencemethodssuch asprogrammable
posterior approximations, inference models, or data subsampling. Concurrent with our work,
WebPPL features amortized inference (Ritchieetal., 2016). Unlike Edward, WebPPL does not
reusethemodel’srepresentation;rather,itannotatestheoriginalprogramandleverageshelperfunc-
tions,whichisalessflexiblestrategy. Finally,inferenceisdesignedasprogramtransformationsin
Kiselyov&Shan(2009);S´cibioretal.(2015);Zinkov&Shan(2016).Thisenablestheflexibilityof
composinginferenceinsideotherprobabilisticprograms.Edwardbuildsonthisideatocomposenot
onlyinferencewithinmodelingbutalsomodelingwithininference(e.g.,variationalmodels).
3 COMPOSITIONAL REPRESENTATIONS FOR PROBABILISTIC MODELS
Wefirstdevelopcompositionalrepresentationsforprobabilisticmodels. Wedesiretwocriteria: (a)
integrationwithcomputationalgraphs,anefficientframeworkwherenodesrepresentoperationson
dataandedgesrepresentdatacommunicatedbetweenthem(Culler,1986);and(b)invarianceofthe
representationunderthegraph,thatis,therepresentationcanbereusedduringinference.
Edward defines random variablesas the key compositionalrepresentation. They are class objects
withmethods,forexample,tocomputethelogdensityandtosample.Further,eachrandomvariable
x is associated to a tensor (multi-dimensionalarray) x∗, which represents a single sample x∗ ∼
p(x). Thisassociationembedstherandomvariableontoacomputationalgraphontensors.
The design’ssimplicity makesit easy to developprobabilisticprogramsin a computationalgraph
framework. Importantly,allcomputationisrepresentedonthegraph. Thisenablesonetocompose
randomvariableswithcomplexdeterministicstructuresuchasdeepneuralnetworks,adiversesetof
mathoperations,andthirdpartylibrariesthatbuildonthesameframework.Thedesignalsoenables
compositionsofrandomvariablestocapturecomplexstochasticstructure.
Asanillustration,weuseaBeta-Bernoullimodel,p(x,θ)=Beta(θ|1,1) 50 Bernoulli(x |θ),
Qn=1 n
whereθisalatentprobabilitysharedacrossthe50datapointsx∈{0,1}50. Therandomvariablex
is50-dimensional,parameterizedbytherandomtensorθ∗. Fetchingtheobjectxrunsthegraph: it
simulatesfromthegenerativeprocessandoutputsabinaryvectorof50elements.
tf.ones(50)
1 theta =Beta(a=1.0,b=1.0)
2 x =Bernoulli(p=tf.ones(50)*theta) θ∗ x∗
θ x
Figure1:Beta-Bernoulliprogram(left)alongsideitscomputationalgraph(right).Fetchingxfrom
thegraphgeneratesabinaryvectorof50elements.
Allcomputationisregisteredsymbolicallyonrandomvariablesandnotovertheirexecution. Sym-
bolic representationsdo not require reifyingthe full model, which leads to unreasonablememory
2
PublishedasaconferencepaperatICLR2017
1 #Probabilisticmodel
zn 2 z =Normal(mu=tf.zeros([N, d]),sigma=tf.ones([N, d]))
3 h =Dense(256,activation=’relu’)(z)
φ θ 4 x =Bernoulli(logits=Dense(28* 28,activation=None)(h))
5
6 #Variationalmodel
7 qx= tf.placeholder(tf.float32, [N, 28 *28])
xn 8 qh= Dense(256,activation=’relu’)(qx)
9 qz=Normal(mu=Dense(d,activation=None)(qh),
10 sigma=Dense(d,activation=’softplus’)(qh))
N
Figure2:Variationalauto-encoderforadatasetof28×28pixelimages:(left)graphicalmodel,with
dottedlinesfortheinferencemodel;(right)probabilisticprogram,with2-layerneuralnetworks.
consumption for large models (Tristanetal., 2014). Moreover, it enables us to simplify both de-
terministic andstochastic operationsin the graph, beforeexecutingany code(S´cibioretal., 2015;
Zinkov&Shan,2016).
Withcomputationalgraphs,itisalsonaturaltobuildmutablestateswithintheprobabilisticprogram.
Asatypicaluseofcomputationalgraphs,suchstatescandefinemodelparameters;inTensorFlow,
this is given by a tf.Variable. Another use case is for building discriminative models p(y|x),
wherexarefeaturesthatareinputastrainingortestdata. Theprogramcanbewrittenindependent
ofthedata,usingamutablestate(tf.placeholder)forxinitsgraph. Duringtrainingandtesting,
wefeedtheplaceholdertheappropriatevalues.
In AppendixA, we provideexamplesof a Bayesian neuralnetworkfor classification (A.1), latent
Dirichletallocation(A.2),andGaussianmatrixfactorization(A.3).Wepresentothersbelow.
3.1 EXAMPLE: VARIATIONALAUTO-ENCODER
Figure2 implements a variational auto-encoder (VAE) (Kingma&Welling, 2014; Rezendeetal.,
2014)inEdward. Itcomprisesaprobabilisticmodeloverdataandavariationalmodeldesignedto
approximatetheformer’sposterior.Hereweuserandomvariablestoconstructboththeprobabilistic
modelandthevariationalmodel;theyarefitduringinference(moredetailsinSection4).
There are N data points x ∈ {0,1}28·28 each with d latent variables, z ∈ Rd. The program
n n
usesKeras(Chollet,2015)todefineneuralnetworks. Theprobabilisticmodelisparameterizedby
a2-layerneuralnetwork,with256hiddenunits(andReLUactivation),andgenerates28×28pixel
images. The variationalmodel is parameterizedby a 2-layer inference network, with 256 hidden
unitsandoutputsparametersofanormalposteriorapproximation.
Theprobabilisticprogramisconcise. CoreelementsoftheVAE—suchasitsdistributionalassump-
tionsandneuralnetarchitectures—areallextensible. Withmodelcompositionality,wecanembed
it into morecomplicatedmodels(Gregoretal., 2015; Rezendeetal., 2016) and for other learning
tasks (Kingmaetal., 2014). With inferencecompositionality(which we discussin Section4), we
can embed it into more complicated algorithms, such as with expressive variational approxima-
tions (Rezende&Mohamed, 2015; Tranetal., 2016b; Kingmaetal., 2016) and alternative objec-
tives(Ranganathetal.,2016a;Li&Turner,2016;Diengetal.,2016).
3.2 EXAMPLE: BAYESIANRECURRENTNEURALNETWORKWITHVARIABLELENGTH
Random variables can also be composed with control flow operations. As an example, Figure3
implementsaBayesianrecurrentneuralnetwork(RNN)withvariablelength.Thedataisasequence
ofinputs{x ,...,x }andoutputs{y ,...,y }oflengthT withx ∈ RD andy ∈ R pertime
1 T 1 T t t
step. Fort=1,...,T,aRNNappliestheupdate
ht =tanh(Whht−1+Wxxt+bh),
where the previoushiddenstate is ht−1 ∈ RH. We feed each hiddenstate into the output’slike-
lihood, y ∼ Normal(W h +b ,1), and we place a standard normal prior over all parameters
t y t y
{W ∈ RH×H,W ∈ RD×H,W ∈ RH×1,b ∈ RH,b ∈ R}. Our implementationis dy-
h x y h y
namic:itdiffersfromaRNNwithfixedlength,whichpadsandunrollsthecomputation.
3
PublishedasaconferencepaperatICLR2017
Wh 1 defrnn_cell(hprev, xt):
2 returntf.tanh(tf.dot(hprev, Wh) + tf.dot(xt, Wx) + bh)
3
Wx xt 4 Wh=Normal(mu=tf.zeros([H, H]),sigma=tf.ones([H, H]))
5 Wx=Normal(mu=tf.zeros([D, H]),sigma=tf.ones([D, H]))
6 Wy=Normal(mu=tf.zeros([H, 1]),sigma=tf.ones([H,1]))
b
h 7 bh=Normal(mu=tf.zeros(H),sigma=tf.ones(H))
··· ··· 8 by=Normal(mu=tf.zeros(1),sigma=tf.ones(1))
Wy ht 109 x = tf.placeholder(tf.float32, [None, D])
11 h = tf.scan(rnn_cell, x,initializer=tf.zeros(H))
by yt 12 y =Normal(mu=tf.matmul(h, Wy) + by,sigma=1.0)
Figure3: Bayesian RNN: (left)graphicalmodel; (right)probabilisticprogram. The programhas
anunspecifiednumberoftimesteps;itusesasymbolicforloop(tf.scan).
3.3 STOCHASTIC CONTROL FLOW AND MODELPARALLELISM
a∗ tf.while_loop(...)
a
p∗ x∗
p x
Figure4: Computationalgraphforaprobabilisticprogramwithstochasticcontrolflow.
Randomvariablescanalsobeplacedinthecontrolflowitself,enablingprobabilisticprogramswith
stochasticcontrolflow.Stochasticcontrolflowdefinesdynamicconditionaldependencies,knownin
the literature as contingentor existentialdependencies(Mansinghkaetal., 2014; Wuetal., 2016).
See Figure4, where x may or may not depend on a for a given execution. In AppendixA.4, we
usestochasticcontrolflowtoimplementaDirichletprocessmixturemodel.Tensorswithstochastic
shapearealsopossible: forexample,tf.zeros(Poisson(lam=5.0))definesavectorofzeroswith
lengthgivenbyaPoissondrawwithrate5.0.
Stochasticcontrolflowproducesdifficultiesforalgorithmsthatusethegraphstructurebecausethe
relationshipofconditionaldependencieschangesacrossexecutiontraces.Thecomputationalgraph,
however, providesan elegantway of teasing out static conditionaldependencestructure (p) from
dynamicdependencestructure(a). Wecanperformmodelparallelism(parallelcomputationacross
componentsofthemodel)overthestaticstructurewithGPUsandbatchtraining. Wecanusemore
genericcomputationstohandlethedynamicstructure.
4 COMPOSITIONAL REPRESENTATIONS FOR INFERENCE
We described random variables as a representation for building rich probabilistic programs over
computationalgraphs. We now describe a compositionalrepresentation for inference. We desire
two criteria: (a) support for many classes of inference, where the form of the inferred posterior
dependsonthealgorithm;and(b)invarianceofinferenceunderthecomputationalgraph,thatis,the
posteriorcanbefurthercomposedaspartofanothermodel.
To explainourapproach,we willuse a simplehierarchicalmodelasa runningexample. Figure5
displaysajointdistributionp(x,z,β)ofdatax,localvariablesz,andglobalvariablesβ. Theideas
hereextendtomoreexpressiveprograms.
4.1 INFERENCE AS STOCHASTICGRAPH OPTIMIZATION
The goal of inference is to calculate the posterior distribution p(z,β | x ;θ) given data x ,
train train
whereθ areanymodelparametersthatwewillcomputepointestimatesfor.2 Weformalizethisas
2Forexample,wecouldreplacex’ssigmaargumentwithtf.exp(tf.Variable(0.0))*tf.ones([N, D]).
Thisdefinesamodelparameterinitializedat0andpositive-constrained.
4
PublishedasaconferencepaperatICLR2017
β
1 N =10000 #number ofdatapoints
2 D = 2 #datadimension
3 K = 5 #number ofclusters
4
z x 5 beta=Normal(mu=tf.zeros([K, D]),sigma=tf.ones([K, D]))
n n 6 z =Categorical(logits=tf.zeros([N, K]))
7 x =Normal(mu=tf.gather(beta, z),sigma=tf.ones([N, D]))
N
Figure5: Hierarchicalmodel: (left)graphicalmodel;(right)probabilisticprogram. Itisamixture
of Gaussians over D-dimensional data {x } ∈ RN×D. There are K latent cluster means β ∈
n
RK×D.
thefollowingoptimizationproblem:
minL(p(z,β |x ;θ), q(z,β;λ)), (1)
train
λ,θ
whereq(z,β;λ)isanapproximationtotheposteriorp(z,β|x ;θ),andLisalossfunctionwith
train
respecttopandq.
Thechoiceofapproximationq,lossL,andrulestoupdateparameters{θ,λ}arespecifiedbyanin-
ferencealgorithm.(Noteqcanbenonparametric,suchasapointoracollectionofsamples.)
InEdward,wewritethisproblemasfollows:
1 inference= ed.Inference({beta: qbeta, z: qz}, data={x:x_train})
Inferenceisanabstractclasswhichtakestwoinputs.Thefirstisacollectionoflatentrandomvari-
ablesbetaandz,associatedtotheir“posteriorvariables”qbetaandqzrespectively.Thesecondisa
collectionofobservedrandomvariablesx,whichisassociatedtotheirrealizationsx_train.
TheideaisthatInferencedefinesandsolvestheoptimizationinEquation1. Itadjustsparameters
ofthedistributionofqbetaandqz(andanymodelparameters)tobeclosetotheposterior.
Classmethodsareavailabletofinelycontroltheinference.Callinginference.initialize()builds
acomputationalgraphtoupdate{θ,λ}. Callinginference.update()runsthiscomputationonce
toupdate{θ,λ};wecallthemethodinaloopuntilconvergence. Importantly,noefficiencyislost
inEdward’slanguage: thecomputationalgraphisthesameasifitwerehandwrittenforaspecific
model.Thismeanstheruntimeisthesame;alsoseeourexperimentsinSection5.2.
AkeyconceptinEdwardisthatthereisnodistinct“model”or“inference”block.Amodelissimply
acollectionofrandomvariables,andinferenceisawayofmodifyingparametersinthatcollection
subjecttoanother. Thisreductionismofferssignificantflexibility. Forexample,we caninferonly
partsofamodel(e.g.,layer-wisetraining(Hintonetal.,2006)),inferpartsusedinmultiplemodels
(e.g.,multi-tasklearning),orpluginaposteriorintoanewmodel(e.g.,Bayesianupdating).
4.2 CLASSES OF INFERENCE
The design of Inference is very general. We describe subclasses to represent many algorithms
below:variationalinference,MonteCarlo,andgenerativeadversarialnetworks.
Variationalinferencepositsafamilyofapproximatingdistributionsandfindstheclosestmemberin
the family to the posterior(Jordanetal., 1999). In Edward, we build the variationalfamily in the
graph;seeFigure6(left). Forourrunningexample,thefamilyhasmutablevariablesasparameters
λ={π,µ,σ},whereq(β;µ,σ)=Normal(β;µ,σ)andq(z;π)=Categorical(z;π).
SpecificvariationalalgorithmsinheritfromtheVariationalInferenceclass. Eachdefinesitsown
methods, such as a loss function and gradient. For example, we represent maximum a posteriori
(MAP)estimationwithanapproximatingfamily(qbetaandqz)ofPointMassrandomvariables,i.e.,
withallprobabilitymassconcentratedatapoint.MAPinheritsfromVariationalInferenceandde-
finesthenegativelogjointdensityasthelossfunction;itusesexistingoptimizersinsideTensorFlow.
InSection5.1,weexperimentwithmultiplegradientestimatorsforblackboxvariationalinference
(Ranganathetal.,2014). Eachestimatorimplementsthesameloss(anobjectiveproportionaltothe
divergenceKL(qkp))andadifferentupdaterule(stochasticgradient).
5
PublishedasaconferencepaperatICLR2017
1 qbeta =Normal( 1 T =10000 #number ofsamples
2 mu=tf.Variable(tf.zeros([K, D])), 2 qbeta =Empirical(
3 sigma=tf.exp(tf.Variable(tf.zeros([K, D]))))3 params=tf.Variable(tf.zeros([T, K, D])))
4 qz=Categorical( 4 qz=Empirical(
5 logits=tf.Variable(tf.zeros([N, K]))) 5 params=tf.Variable(tf.zeros([T, N])))
6 6
7 inference= ed.VariationalInference( 7 inference= ed.MonteCarlo(
8 {beta: qbeta, z: qz},data={x:x_train}) 8 {beta: qbeta, z: qz},data={x:x_train})
Figure6: (left)Variationalinference.(right)MonteCarlo.
1 defgenerative_network(eps):
2 h =Dense(256,activation=’relu’)(eps)
3 returnDense(28* 28,activation=None)(h)
ǫ 4
n 5 defdiscriminative_network(x):
θ 6 h =Dense(28* 28,activation=’relu’)(x)
7 returnDense(h,activation=None)(1)
8
9 #Probabilisticmodel
x 10 eps=Normal(mu=tf.zeros([N, d]),sigma=tf.ones([N, d]))
n 11 x =generative_network(eps)
N 12
13 inference= ed.GANInference(data={x:x_train},
14 discriminator=discriminative_network)
Figure 7: Generative adversarialnetworks: (left) graphicalmodel; (right) probabilistic program.
Themodel(generator)usesaparameterizedfunction(discriminator)fortraining.
Monte Carlo approximates the posterior using samples (Robert&Casella, 1999). Monte Carlo
is an inference where the approximating family is an empirical distribution, q(β;{β(t)}) =
1 T δ(β,β(t)) and q(z;{z(t)}) = 1 T δ(z,z(t)). The parameters are λ = {β(t),z(t)}.
T Pt=1 T Pt=1
SeeFigure6(right).MonteCarloalgorithmsproceedbyupdatingonesampleβ(t),z(t) atatimein
the empiricalapproximation. Specific MC samplersdetermine the update rules: they can use gra-
dientssuchasin HamiltonianMonteCarlo(Neal, 2011) andgraphstructuresuchas insequential
MonteCarlo(Doucetetal.,2001).
Edward also supports non-Bayesian methods such as generative adversarial networks (GANs)
(Goodfellowetal., 2014). See Figure7. The modelposits randomnoise eps over N data points,
each with d dimensions; this random noise feeds into a generative_network function, a neural
network that outputs real-valued data x. In addition, there is a discriminative_network which
takesdata asinputandoutputsthe probabilitythatthe datais real(inlogitparameterization). We
buildGANInference;runningitoptimizesparametersinsidethetwoneuralnetworkfunctions.This
approachextendstomanyadvancesinGANs(e.g.,Dentonetal.(2015);Lietal.(2015)).
Finally, one can design algorithms that would otherwise require tedious algebraic manipulation.
Withsymbolicalgebraonnodesofthecomputationalgraph,wecanuncoverconjugacyrelationships
betweenrandomvariables. Userscanthenintegrateoutvariablestoautomaticallyderiveclassical
Gibbs (Gelfand&Smith, 1990), mean-field updates (Bishop, 2006), and exact inference. These
algorithmsarebeingcurrentlydevelopedinEdward.
4.3 COMPOSING INFERENCES
Core to Edward’sdesignis thatinferencecanbe writtenas a collectionofseparate inferencepro-
grams. Below we demonstrate variational EM, with an (approximate)E-step over local variables
and an M-step overglobal variables. We instantiate two algorithms, each of which conditionson
inferencesfromtheother,andwealternatewithoneupdateofeach(Neal&Hinton,1993),
1 qbeta =PointMass(params=tf.Variable(tf.zeros([K, D])))
2 qz=Categorical(logits=tf.Variable(tf.zeros([N, K])))
3
4 inference_e= ed.VariationalInference({z: qz}, data={x:x_train, beta:qbeta})
5 inference_m= ed.MAP({beta:qbeta}, data={x: x_train, z: qz})
6 ...
6
PublishedasaconferencepaperatICLR2017
7 for_ inrange(10000):
8 inference_e.update()
9 inference_m.update()
ThisextendstomanyothercasessuchasexactEMforexponentialfamilies,contrastivedivergence
(Hinton, 2002), pseudo-marginalmethods(Andrieu&Roberts, 2009), andGibbs samplingwithin
variationalinference(Wang&Blei,2012;Hoffman&Blei,2015).Wecanalsowritemessagepass-
ingalgorithms,whichsolveacollectionoflocalinferenceproblems(Koller&Friedman,2009).For
example,classicalmessage passingusesexactlocalinferenceandexpectationpropagationlocally
minimizestheKullback-Leiblerdivergence,KL(pkq)(Minka,2001).
4.4 DATASUBSAMPLING
Stochastic optimization (Bottou, 2010) scales inference to massive data and is key to algorithms
such as stochastic gradient Langevin dynamics (Welling&Teh, 2011) and stochastic variational
inference(Hoffmanetal.,2013). Theideaistocheaplyestimatethemodel’slogjointdensityinan
unbiasedway. At each step, onesubsamplesa data set {x } of size M and then scales densities
m
withrespecttolocalvariables,
N
logp(x,z,β)=logp(β)+ logp(x |z ,β)+logp(z |β)
Xh n n n i
n=1
M
N
≈logp(β)+ logp(x |z ,β)+logp(z |β) .
M X h m m m i
m=1
To supportstochastic optimization,we representonlya subgraphof thefullmodel. Thisprevents
reifyingthefullmodel,whichcanleadtounreasonablememoryconsumption(Tristanetal.,2014).
Duringinitialization,wepassinadictionarytoproperlyscalethearguments.SeeFigure8.
1 beta=Normal(mu=tf.zeros([K, D]),sigma=tf.ones([K, D]))
β 2 z =Categorical(logits=tf.zeros([M, K]))
3 x =Normal(mu=tf.gather(beta, z),sigma=tf.ones([M, D]))
4
5 qbeta =Normal(mu=tf.Variable(tf.zeros([K, D])),
6 sigma=tf.nn.softplus(tf.Variable(tf.zeros([K, D]))))
zm xm 7 qz=Categorical(logits=tf.Variable(tf.zeros([M, D])))
8
M 9 inference= ed.VariationalInference({beta: qbeta, z: qz},data={x:x_batch})
10 inference.initialize(scale={x: float(N)/M, z: float(N)/M})
Figure 8: Data subsampling with a hierarchical model. We define a subgraph of the full model,
formingaplateofsizeM ratherthanN. WethenscalealllocalrandomvariablesbyN/M.
Conceptually,thescale argumentrepresentsscalingforeachrandomvariable’splate, asif wehad
seen that random variable N/M as many times. As an example, AppendixB shows how to im-
plement stochastic variational inference in Edward. The approach extends naturally to streaming
data(Doucetetal.,2000;Brodericketal.,2013;McInerneyetal.,2015),dynamicbatchsizes,and
data structures in which working on a subgraph does not immediately apply (Binderetal., 1997;
Johnson&Willsky,2014;Fotietal.,2014).
5 EXPERIMENTS
Inthissection,weillustratetwomainbenefitsofEdward:flexibilityandefficiency. Fortheformer,
weshowhowitiseasytocomparedifferentinferencealgorithmsonthesamemodel.Forthelatter,
weshowhowitiseasytogetsignificantspeedupsbyexploitingcomputationalgraphs.
5.1 RECENT METHODSIN VARIATIONAL INFERENCE
We demonstrate Edward’s flexibility for experimenting with complex inference algorithms. We
considertheVAEsetupfromFigure2andthebinarizedMNISTdataset(Salakhutdinov&Murray,
7
PublishedasaconferencepaperatICLR2017
Inferencemethod Negativelog-likelihood
VAE(Kingma&Welling,2014) ≤88.2
VAEwithoutanalyticKL ≤89.4
VAEwithanalyticentropy ≤88.1
VAEwithscorefunctiongradient ≤87.9
Normalizingflows(Rezende&Mohamed,2015) ≤85.8
Hierarchicalvariationalmodel(Ranganathetal.,2016b) ≤85.4
Importance-weightedauto-encoders(K =50)(Burdaetal.,2016) ≤86.3
HVMwithIWAEobjective(K =5) ≤85.2
Rényidivergence(α=−1)(Li&Turner,2016) ≤140.5
Table1: InferencemethodsforaprobabilisticdecoderonbinarizedMNIST.TheEdwardPPL isa
convenientresearchplatform,makingiteasytobothdevelopandexperimentwithmanyalgorithms.
2008). Weused=50latentvariablesperdatapointandoptimizeusingADAM.Westudydifferent
components of the VAE setup using different methods; AppendixC.1 is a complete script. After
trainingweevaluateheld-outloglikelihoods,whicharelowerboundsonthetruevalue.
Table1 shows the results. The first method uses the VAE from Figure2. The next three methods
use the same VAE but apply different gradientestimators: reparameterizationgradientwithout an
analyticKL;reparameterizationgradientwithan analyticentropy;andthe scorefunctiongradient
(Paisleyetal.,2012;Ranganathetal.,2014).Thistypicallyleadstothesameoptimabutatdifferent
convergence rates. The score function gradient was slowest. Gradients with an analytic entropy
produced difficulties around convergence: we switched to stochastic estimates of the entropy as
it approached an optima. We also use hierarchical variational models (HVMs) (Ranganathetal.,
2016b)withanormalizingflowprior;itproducedsimilarresultsasanormalizingflowonthelatent
variable space (Rezende&Mohamed, 2015), and better than importance-weightedauto-encoders
(IWAEs)(Burdaetal.,2016).
Wealsostudynovelcombinations,suchasHVMswiththeIWAEobjective,GAN-basedoptimization
onthedecoder(withpixelintensity-valueddata),andRényidivergenceonthedecoder. GAN-based
optimizationdoesnotenablecalculationof the log-likelihood;Rényidivergencedoesnotdirectly
optimizeforlog-likelihoodsoitdoesnotperformwell.ThekeypointisthatEdwardisaconvenient
researchplatform:theyarealleasymodificationsofagivenscript.
5.2 GPU-ACCELERATED HAMILTONIAN MONTECARLO
1 #Model
β 2 x = tf.Variable(x_data,trainable=False)
3 beta=Normal(mu=tf.zeros(D),sigma=tf.ones(D))
4 y =Bernoulli(logits=tf.dot(x,beta))
5
xn yn 6 #Inference
7 qbeta =Empirical(params=tf.Variable(tf.zeros([T, D])))
N 8 inference= ed.HMC({beta:qbeta}, data={y:y_data})
9 inference.run(step_size=0.5 / N,n_steps=10)
Figure9: EdwardprogramforBayesianlogisticregressionwithHamiltonianMonteCarlo(HMC).
We benchmarkruntimesfor a fixed numberof Hamiltonian Monte Carlo (HMC; Neal, 2011) iter-
ations on modern hardware: a 12-core Intel i7-5930K CPU at 3.50GHz and an NVIDIA Titan X
(Maxwell) GPU. We apply logistic regression on the Covertype dataset (N = 581012, D = 54;
responseswere binarized)usingEdward, Stan (with PyStan)(Carpenteretal., 2016), and PyMC3
(Salvatieretal., 2015). We ran 100 HMC iterations, with 10 leapfrog updatesper iteration, a step
sizeof0.5/N,andsingleprecision.Figure9illustratestheprograminEdward.
Table2displaystheruntimes.3 Edward(GPU)featuresadramatic35xspeedupoverStan(1CPU)
and6xspeedupoverPyMC3(12CPU).ThisshowcasesthevalueofbuildingaPPLontopofcom-
3Inapreviousversionofthispaper,wereportedPyMC3took361s. Thiswascausedbyabugpreventing
PyMC3fromcorrectlyhandlingsingle-precisionfloatingpoint. (PyMC3withdoubleprecisionisroughly14x
8
PublishedasaconferencepaperatICLR2017
Probabilisticprogrammingsystem Runtime(s)
HandwrittenNumPy(1CPU) 534
Stan(1CPU)(Carpenteretal.,2016) 171
PyMC3(12CPU)(Salvatieretal.,2015) 30.0
Edward(12CPU) 8.2
HandwrittenTensorFlow(GPU) 5.0
Edward(GPU) 4.9
Table 2: HMC benchmarkforlarge-scalelogisticregression. Edward(GPU)issignificantlyfaster
thanothersystems. Inaddition,Edwardhasnooverhead:itisasfastashandwrittenTensorFlow.
putationalgraphs. Thespeedupstemsfromfastmatrixmultiplicationwhencalculatingthemodel’s
log-likelihood;GPUs can efficiently parallelizethis computation. We expectsimilar speedupsfor
modelswhosebottleneckisalsomatrixmultiplication,suchasdeepneuralnetworks.
Therearevariousreasonsforthespeedup. Stanonlyused1CPU asitleveragesmultiplecoresby
runningHMC chainsinparallel. Stanalsouseddouble-precisionfloatingpointasitdoesnotallow
single-precision.ForPyMC3,wenoteEdward’sspeedupisnotaresultofPyMC3’sTheanobackend
comparedtoEdward’sTensorFlow. Rather,PyMC3doesnotuseTheanoforallitscomputation,so
itexperiencescommunicationoverheadwithNumPy. (PyMC3wasactuallyslowerwhenusingthe
GPU.)WepredictthatportingEdward’sdesigntoTheanowouldfeaturesimilarspeedups.
In addition to these speedups, we highlight that Edward has no runtime overhead: it is as fast as
handwrittenTensorFlow. FollowingSection4.1,thisisbecausethecomputationalgraphsforinfer-
enceareinfactthesameforEdwardandthehandwrittencode.
5.3 PROBABILITY ZOO
InadditiontoEdward,wealsoreleasetheProbabilityZoo,acommunityrepositoryforpre-trained
probabilitymodelsandtheirposteriors.4 ItisinspiredbythemodelzooinCaffe(Jiaetal.,2014),
whichprovidesmanypre-traineddiscriminativeneuralnetworks,andwhichhasbeenkeytomaking
large-scaledeeplearningmoretransparentandaccessible. ItisalsoinspiredbyForest(Stuhlmüller,
2012),whichprovidesexamplesofprobabilisticprograms.
6 DISCUSSION: CHALLENGES & EXTENSIONS
We describedEdward,aTuring-completePPL withcompositionalrepresentationsforprobabilistic
models and inference. Edward expands the scope of probabilistic programming to be as flexible
andcomputationallyefficientastraditionaldeeplearning. Forflexibility,weshowedhowEdward
canuseavarietyofcomposableinferencemethods,capturerecentadvancesinvariationalinference
andgenerativeadversarialnetworks,andfinelycontroltheinferencealgorithms. Forefficiency,we
showed how Edward leverages computational graphs to achieve fast, parallelizable computation,
scalestomassivedata,andincursnoruntimeoverheadoverhandwrittencode.
In presentwork, we are applying Edwardas a research platform for developingnew probabilistic
models(Rudolphetal.,2016;Tranetal.,2017)andnewinferencealgorithms(Diengetal.,2016).
Aswithanylanguagedesign,Edwardmakestradeoffsinpursuitofitsflexibilityandspeedforre-
search. For example, an open challenge in Edward is to better facilitate programs with complex
controlflowandrecursion. Whilepossibletorepresent,itisunknownhowtoenabletheirflexible
inference strategies. In addition, it is open how to expand Edward’sdesign to dynamiccomputa-
tionalgraphframeworks—whichprovidemoreflexibilityintheirprogrammingparadigm—butmay
sacrifice performance. A crucial next step for probabilistic programming is to leverage dynamic
computationalgraphswhilemaintainingtheflexibilityandefficiencythatEdwardoffers.
slowerthanEdward(GPU).)ThishasbeenfixedafterdiscussionwithThomasWiecki. Thereportednumbers
alsoexcludecompilationtime,whichissignificantforStan.
4TheProbabilityZooisavailableathttp://edwardlib.org/zoo.Itincludesmodelparametersand
inferredposteriorfactors,suchaslocalandglobalvariablesduringtrainingandanyinferencenetworks.
9
PublishedasaconferencepaperatICLR2017
ACKNOWLEDGEMENTS
We thank the probabilistic programmingcommunity—forsharing our enthusiasm and motivating
furtherwork—includingdevelopersofChurch,Venture,Gamalon,Hakaru,andWebPPL.We also
thank Stan developersfor providingextensive feedbackas we developedthe language, as well as
ThomasWieckiforexperimentaldetails.WethanktheGoogleBayesFlowteam—JoshuaDillon,Ian
Langmore,RyanSepassi,andSrinivasVasudevan—aswellasAmrAhmed,MatthewJohnson,Hung
Bui, Rajesh Ranganath, Maja Rudolph, and Francisco Ruiz for their helpfulfeedback. Thiswork
issupportedbyNSFIIS-1247664,ONRN00014-11-1-0651,DARPAFA8750-14-2-0009,DARPA
N66001-15-C-4032,Adobe,Google,NSERCPGS-D,andtheSloanFoundation.
REFERENCES
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu
Devin, Sanjay Ghemawat, GeoffreyIrving, Michael Isard, Manjunath Kudlur, Josh Levenberg,
Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan,
Pete Warden, Martin Wicke, YuanYu, and XiaoqiangZhang. TensorFlow: A system forlarge-
scalemachinelearning. arXivpreprintarXiv:1605.08695,2016.
ChristopheAndrieuandGarethORoberts.Thepseudo-marginalapproachforefficientMonteCarlo
computations. TheAnnalsofStatistics,pp.697–725,2009.
JohnBinder,KevinMurphy,andStuartRussell. Space-efficientinferenceindynamicprobabilistic
networks. InInternationalJointConferenceonArtificialIntelligence,1997.
ChristopherM.Bishop. PatternRecognitionandMachineLearning. Springer,2006.
DavidMBlei,AndrewYNg,andMichaelIJordan.LatentDirichletAllocation.JournalofMachine
LearningResearch,3:993–1022,2003.
Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of
COMPSTAT’2010,pp.177–186.Springer,2010.
TamaraBroderick,NicholasBoyd,AndreWibisono,AshiaCWilson,andMichaelIJordan.Stream-
ingVariationalBayes. InNeuralInformationProcessingSystems,2013.
YuriBurda,RogerGrosse,andRuslanSalakhutdinov.Importanceweightedautoencoders. InInter-
nationalConferenceonLearningRepresentations,2016.
BobCarpenter,AndrewGelman,MatthewDHoffman,DanielLee,BenGoodrich,MichaelBetan-
court,MarcusBrubaker,JiqiangGuo,PeterLi,andAllenRiddell. Stan:Aprobabilisticprogram-
minglanguage. JournalofStatisticalSoftware,2016.
FrançoisChollet. Keras. https://github.com/fchollet/keras,2015.
DavidECuller. Dataflowarchitectures. Technicalreport,DTICDocument,1986.
PeterDayan,GeoffreyEHinton,RadfordMNeal,andRichardSZemel. TheHelmholtzmachine.
Neuralcomputation,7(5):889–904,1995.
Emily L Denton, Soumith Chintala, Rob Fergus, et al. Deep generative image models using a
Laplacianpyramidofadversarialnetworks. InNeuralInformationProcessingSystems,2015.
AdjiB.Dieng,DustinTran,RajeshRanganath,JohnPaisley,andDavidM.Blei. χ-divergencefor
approximateinference. InarXivpreprintarXiv:1611.00328,2016.
Arnaud Doucet, Simon Godsill, and Christophe Andrieu. On sequential Monte Carlo sampling
methodsforBayesianfiltering. StatisticsandComputing,10(3):197–208,2000.
Arnaud Doucet, Nando De Freitas, and Neil Gordon. An introduction to sequential monte carlo
methods. InSequentialMonteCarlomethodsinpractice,pp.3–14.Springer,2001.
NicholasFoti, JasonXu,DillonLaird,andEmilyFox. Stochasticvariationalinferenceforhidden
Markovmodels. InNeuralInformationProcessingSystems,2014.
AlanEGelfandandAdrianFMSmith. Sampling-basedapproachestocalculatingmarginaldensi-
ties. JournaloftheAmericanstatisticalassociation,85(410):398–409,1990.
10