ebook img

Particle Gibbs with Ancestor Sampling for Probabilistic Programs PDF

0.35 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Particle Gibbs with Ancestor Sampling for Probabilistic Programs

Particle Gibbs with Ancestor Sampling for Probabilistic Programs Jan-Willem van de Meent Hongseok Yang Vikash Mansinghka Frank Wood Dept of Statistics Dept of Computer Science Computer Science & AI Lab Dept of Engineering Columbia University University of Oxford Mass Institute of Technology University of Oxford 5 Abstract met. The random variables x therefore need not have 1 the same entries for all possible executions of F[y]. In 0 2 languages that have recursion and higher-order func- Particle Markov chain Monte Carlo tech- tions (i.e. functions that act on other functions), it b niques rank among current state-of-the-art e methods for probabilistic program inference. is straightforward to define models that can instanti- F ate an arbitrary number of random variables, such as A drawback of these techniques is that they certain Bayesian nonparametrics, or models that are 9 relyonimportanceresampling, whichresults specified in terms of a generative grammar. At the in degenerate particle trajectories and a low ] same time this greater expressivity makes it challeng- L effective sample size for variables sampled ing to design methods for efficient posterior inference M early in a program. We here develop a for- in arbitrary programs. malismtoadaptancestorresampling,atech- . t nique that mitigates particle degeneracy, to In this paper we show how a recently proposed tech- a t the probabilistic programming setting. We nique known as particle Gibbs with ancestor sam- s present empirical results that demonstrate pling (PGAS) [Lindsten et al., 2012] can be adapted [ nontrivial performance gains. toinferenceinhigher-orderprobabilisticprogramming 5 systems [Mansinghka et al., 2014, Wood et al., 2014, v Goodman et al., 2008]. A PGAS implementation re- 9 6 1 Introduction quires combining a partial execution history, or pre- 7 fix, with the remainder of a previously completed ex- 6 Probabilistic programming languages extend tradi- ecution history, which we call a suffix. We develop a 0 tionallanguageswithprimitivesforsamplingandcon- formalism for performing this operation in a manner . 1 ditioning on random variables. Running a program F that guarantees a consistent program state, correctly 0 generates random variates x for some subset of pro- updates the probabilities associated with each sam- 5 gramexpressions,whichcanbethoughtofasasample pledandobservedrandomvariable,andavoidsunnec- 1 : fromapriorp(x|F)implicitlydefinedbytheprogram. essary recomputation where possible. An empirical v InaconditionedprogramF[y],asubsetofexpressions evaluation demonstrates that the increased statistical i X is constrained to take on observed values y. This de- efficiencyofPGAScaneasilyoutweighitsgreatercom- r finesaposteriordistributionp(x|F[y])ontherandom putational cost. a variatesxthatcanbegeneratedbytheprogramF[y]. Probabilisticprogramsareinessenceproceduralrepre- 1.1 Related Work sentationsofgenerativemodels. Theserepresentations areoftenbothsuccinctandextendable,makingiteasy Currentgenerationprobabilisticprogrammingsystems toiterateoveralternativemodeldesigns. Anyprogram fall into two broad categories. On the one hand, sys- that always samples and constrains a fixed set of vari- tems like Infer.NET [Minka et al., 2010] and STAN ables {x,y} admits an alternate representation as a [Stan Development Team, 2014] restrict the language graphical model. In general, a program F[y] may also syntax in order to omit recursion. This ensures that have random variables that are only generated when the set of variables is bounded and has a well-defined certain conditions on previously sampled variates are dependency hierarchy. On the other hand languages suchasChurch[Goodmanetal.,2008],Venture[Mans- Appearing in Proceedings of the 18th International Con- inghka et al., 2014], Anglican [Wood et al., 2014], and ference on Artificial Intelligence and Statistics (AISTATS) ProbabilisticC[PaigeandWood,2014]donotimpose 2015, San Diego, CA, USA. JMLR: W&CP volume 38. such restrictions, which makes the design of general- Copyright 2015 by the authors. purpose inference methods more difficult. Particle Gibbs with Ancestor Sampling for Probabilistic Programs Arguably the simplest inference methods for prob- An expression e is either a constant literal c, a sym- abilistic programming languages rely on Sequential bol s, an application (e &e) with operator e and zero Monte Carlo (SMC) [Del Moral et al., 2006]. These or more arguments &e, a function literal (lambda (&s) “forward” techniques sample from the prior by run- e) with argument list (&s), an if-statement (if e e ning multiple copies of the program, calculating im- e), oraquotedexpression(quote e). Eachexpression portanceweightsusingtheconditioningexpressionsas e evaluates to a value v upon execution. In addition needed. The only non-trivial requirement for imple- to the self-explanatory bool, int, float, and string menting SMC variants is that there exists an efficient types, values can be primitive procedures (i.e. lan- mechanismfor“forking”aprogramstateintomultiple guage built-ins such +, -, etc.), compound procedures copies that may continue execution independently. (i.e., closures), and lists of zero or more values (&v). Anotherwell-knowninferencetechniqueforprobabilis- The stochastic type represents stochastic processes, ticprogramsislightweightMetropolis-Hasting(LMH) whose samples must either be i.i.d. or exchangeable. [Wingate et al., 2011]. LMH methods construct a A stochastic process sp supports two operations database of sampled values at run time. A change to thesampledvaluesisproposed,andtheprogramisre- (sample sp) -> v runinitsentirety,substitutingpreviouslysampledval- (observe sp v) -> v ues where possible. The newly constructed database The sample primitive draws a value from sp, whereas ofrandomvariablesistheneitheracceptedorrejected. the observe primitive conditions execution on a sam- LMHisstraightforwardtoimplement,butcostly,since ple v that is returned as passed. Both operations as- the program must be re-run in its entirety to evaluate sociate a value v to sp as a side-effect, which changes the acceptance ratio. A more computationally effi- the probability of the execution state in the inference cientstrategyisofferedbyVenture[Mansinghkaetal., procedure. For exchangeable processes such as (crp 2014], whichrepresentstheexecutionhistoryofapro- 1.0)thisalsoaffectstheprobabilityoffuturesamples. gramasagraph,inwhicheachevaluatedexpressionis a node. A graph walk can then determine the subset Following the convention employed by Venture and of expressions affected by a proposed change, allowing Anglican, we define programs as sequences of three partial re-execution of the program conditioned on all types of top-level statements unaffected nodes. F ::= t F | END SMC and MH based algorithms each have trade-offs. t ::= [assume s e] | [observe e v] | [predict e] Techniques that derive from SMC can be run very ef- ficiently, but suffer from particle degeneracy, resulting Variables in the global environment are defined using in a deteriorating quality of posterior estimates calcu- the assume directive. The observe directive conditions lated from values sampled early in the program. In execution in the same way as its non-toplevel equiv- MH methods subsequent samples are typically corre- alent. The predict directive returns the value of the lated, and many updates may be needed to obtain an expression e as inference output. independent sample. As we will discuss in Section 3, PGAS can be thought of as a hybrid technique, in the 2.2 Importance Sampling Semantics sense that SMC is used to generate independent up- dates to the previous sample, which mitigates the de- In many probabilistic languages the (observe e v) generacyissuesassociatedwithSMCwhilstincreasing formissemanticallyequivalenttoimposingarejection- mixing rates relative to MH. samplingcriterion. Forexample,aprogramsubjectto (observe (> a 0) true) can be interpreted as a rejec- 2 Probabilistic Functional Programs tion sampler that repeatedly runs the program and only returns predict values when (> a 0) holds. 2.1 Language Syntax The semantics of observe that we have defined here imply an interpretation of a probabilistic program as For the purposes of exposition we will consider a sim- an importance sampler, where sample draws from the pleLispdialect,extendedwithprimitivesforsampling prior and observe assigns an importance weight. In- and observing random variables stead of constraining the value of an arbitrary expres- sion e, observe conditions on the value of (sample e). e ::= c | s | (e &e) | (lambda (&s) e) Thisrestrictedformofconditioningguaranteesthatwe | (if e e e) | (quote e) can calculate the likelihood p(v|(sample e)) for every v ::= bool | int | float | string | primitive observe, as long as we implement a density function | compound | (&v) | stochastic for all possible stochastic values in the language. Jan-Willem van de Meent, Hongseok Yang, Vikash Mansinghka, Frank Wood More formally, we use the notation F[y] to refer to 3 Particle MCMC methods a program conditioned on values y via a sequence of top-level[observe e v]statements. ExecutionofF[y] 3.1 Sequential Monte Carlo will require the evaluation of a number of (sample e) expressions, whose values we will denote with x. We SMC methods are importance sampling techniques nowinformallydefineF[y,x]astheprograminwhich that target a posterior p(x|y) on as space X by per- all(sample e)expressionsarereplacedbyconditioned forming importance sampling on unnormalized densi- equivalents (observe e v), resulting in a fully deter- ties {γn(xn)}Nn=1 defined on spaces of expanding di- ministicexecution. SimilarlywecandefineF[x]asthe mensionality {Xn}Nn=1, where each Xn ⊆ Xn+1 and program obtained from F[y,x] by replacing [observe XN =X. Thisresultsinaseriesofintermediateparti- e v] statements with unconditioned forms [assume s clesets{wnl,xln}Nn=1,whichwerefertoasgenerations. (sample e)]withauniquesymbolsineachstatement. Each generation is sampled via two steps, Whereas top-level observe statements are fixed in al ∼R(a |w ), xl ∼ρ (x |xaln ). n n n n n n n−1 numberandorder, F maynotevaluatethesamecom- binationofsamplecallsineveryexecution. Toprovide Here R(a|w) is a resampling procedure that returns amoreprecisedefinitionofF[x],weassociateaunique index a = l with probability wl/(cid:80) wl(cid:48) and ρ is a l(cid:48) n address α with each sample call that can occur in the transitionkernel. Thesamplesxl areassignedweights n execution of F. We here use a scheme in which the γ (xl) run-timeaddressofeachevaluationisaconcatenation wl = n n . α(cid:48)::(t,p)oftheaddressα(cid:48) oftheparentevaluationand n γ (xaln )ρ (xl |xaln ) n−1 n−1 n n n−1 a tuple (t,p) in which t identifies the expression type Inthecontextofprobabilisticprograms,wecandefine and p is the index of the sub-expression within the a series of partial programs F [y ] that truncate at form. This particular scheme labels every evaluation, n n each top-level [observe e v] statement. We can then not just the sample calls, allowing us to formally rep- sequentially sample F [y ,x ] W ,x by par- resent F : A → E as a mapping from addresses A to n n n−1 n n program expressions E. We represent x : S → V as a tially conditioning on xn−1 at each(cid:59)generation. Since x is a sample from the prior, this results in an im- mapping from a subset of addresses S ⊂A associated n portance weight [Wood et al., 2014] with sample calls to values V. Similarly, y : O → V is a mapping from addresses O ⊂ A associated with p(y |F [xl]) observe statements to values. With these definitions wl = n n n . n al in place, a conditioned program F[x] : A → E simply p(yn−1|Fn−1[xnn−1]) replaces F(α)=(sample e) with Note that this is simply the likelihood of the n-th F[x](α)=(observe e x(α)), ∀α∈S. top-level observe. In practice we continue execu- tion relative to F [y ,x ] to avoid rerunning The importance sampling interpretation of a program n−1 n−1 n−1 F [y ,x ]initsentirety. Themeansthattheinfer- F[y] W,x (read as F[x] yields W,x) is defined in n n n−1 ence backend must include a routine for forking mul- terms(cid:59)of random variables x and a weight W tiple independent executions from a single state. F[y] W,x W =p(y|F[x]) . Bydefinit(cid:59)ion,thegeneratedsamplesxaredrawnfrom 3.2 Iterative Conditional SMC the prior p(x|F). The weight W = p(y|F[x]) is the An advantage of SMC methods is that they provide joint probability of all top-level observe statements in a generic strategy for joint proposals in high dimen- F[y]. Repeated execution of F[y] yields a weighted sional spaces. An importance sampling scheme where sample set {Wl,xl} that may be used to approximate F[y] W,x draws from the prior has a vanishingly the posterior as small(cid:59)probability of generating a high-weight sample. (cid:88) Wl Bysamplingthesmallestpossiblesetofvariablesxl at p(x|F[y])(cid:39)pL(x|F[y])= δ . n (cid:80) Wk xl each generation and selecting ancestors al accord- l k n+1 ing to the likelihood of the next observed data point, More generally the importance weight is defined as we ensure that xl is sampled conditioned on high- n+1 the joint probability of all observe calls (top-level and weight values of x from the previous generation. At n transformed) in the program thesametimethisstrategyhasadrawback: eachtime F[y,x] W W =p(y,x|F), the particle set is resampled, the number of unique values at previous generations decreases, typically re- F[x](cid:59)W,y W =p(x|F[y]), sulting in coalescence to a single common ancestor in F (cid:59)W,y,x W =1. O(LlogL) generations [Jacob et al., 2013]. (cid:59) Particle Gibbs with Ancestor Sampling for Probabilistic Programs Inmanyapplicationsitisnotpracticallypossible(due Here update 1 differs from the normal CSMC update to memory requirements) to set L to a value large in that it samples a complete set of ancestor indices enough to guarantee a sufficient number of indepen- a∗, not the complement to the retained indices a∗,−k. dent samples at all generations. In such cases, par- At each generation the ancestor abn of the retained n ticle variants of MCMC techniques [Andrieu et al., particle is resampled according to a weight 2010] can be used to combine samples from multiple SMC sweeps. An iterated conditional SMC (ICSMC) wnl−1|N =wnl−1p(yN,xkN[xln]|FN[yn,xln]). sampler repeatedly selects a retained particle k with probability wk /(cid:80) wk(cid:48), and then performs a condi- Here xk [xl] denotes the substitution of xl into xk , N k(cid:48) N N n n N tionalSMC(CSMC)sweep,wheretheresamplingstep which is defined as a mapping xk [xl]:Ak ∪Al →V N n N n is conditioned on the inclusion of the retained particle where entries in xl : Al → V augment or replace n n at each generation. Formally, this procedure is a par- entries in xk :Ak →V. N N tially collapsed Gibbs sampler that targets a density Intuitively, ancestor resampling can be thought of as φ(x,a,k) on an extended space proposing new program executions by complementing random variables xl of a partial execution, or prefix, φ(x11::LN,a12::LN,k)= (cid:80)wNkwk(cid:48) (cid:89)L p(xl1|F1) wthiethexreectuaitnioend,voarlusueffisnfxr.omThxiskNstteopsipsepceirfyfortmheedfuattureeacohf k(cid:48) N l=1 (cid:89)N (cid:89)L wnaln−1 p(xl |F [xaln ]) . gbeenreersaatmiopnl,eadllmowainnygttihmeersetinainthede clionuerasgeeotfoopnoetesnwteiaepll.y (cid:80) wl(cid:48) n n n−1 n=2l=1 l(cid:48) n−1 Incorporation of the ancestor resampling step results in the following updates at each n=2,...,N We use the shorthand xk =xb1:bN and ak =ab2:bN to 1:N 2:N refertothesampledvaluesandancestorindicesofthe retained particle, whose index b at each generation 1a. Update the retained particle n cTahnebneotraectiuornsivxe−lyk daenfidnaed−kviraefbeNrs=tokthanedcobmn−p1le=meanbnnts. a∗n,bn ∼R(an|wn∗−1|N) where the retained particle is excluded. An ICSMC x∗,bn ←xbn n n sampler iterates between two updates 1b. Update the particles for l∈{1,...,L}\b n 1. {x∗,−k,a∗,−k}∼φ(x−k,a−k|xk,ak,k) a∗,l ∼R(a |w∗ ) 2. k∗ ∼φ(k|x∗,−k,a∗,−k,xk,ak) n n n−1 x∗,l ∼p(x |F [x∗,a∗n,l]) n n n n−1 If we interpret x−k, a and k as auxilliary variables, then the marginal on xkN leaves the density γN(xN) 4 Rescoring Probabilistic Programs invariant [Andrieu et al., 2010]. The advantage of ICSMC samplers is that the target PGAS for probabilistic programs requires calculation space is iteratively explored over subsequent CSMC of p(y ,xk [xl]|F[y ,xl]). This probability is, by N N n n n sweeps. A disadvantage is that consecutive sweeps definition, the importance weight of FN[yN,xkN[xln]] often yield partially degenerate particles, since many and can therefore in principle be obtained by exe- newly generated particles will coalesce to the retained cuting this re-conditioned form. The main drawback particle with high probability. For this reason ICSMC of this naive approach is that it requires LN eval- samplers mix poorly when the number of particles is uations of the program in its entirety. This results not large enough to generate at least two completely in an O(LN2) computational cost, which quickly be- independent lineages in a single CSMC sweep. comes prohibitively expensive as the number of gen- erations N increases. A second complicating factor 3.3 Particle Gibbs with Ancestor Sampling is that naively rewriting part of the execution history of the program may not yield a set of random values PGAS is a technique that augments the CSMC sweep xk [xl] that could be generated by running F [y ]. N n N N with a resampling procedure for the index abn of the n We here develop a formalism that allows regeneration retained particle [Lindsten et al., 2012]. At a high of a self-consistent program execution starting from a level, this sampling scheme performs two updates partial execution with random variables xl, assuming n all future random samples are inherited from retained 1. {x∗,−k,a∗}∼φ(x−k,a|xk,k) values xk . To do so we introduce the notion of a N 2. k∗ ∼φ(k|x∗,−k,a∗,−k,xk,ak) trace, a data structure that annotates each evaluation Jan-Willem van de Meent, Hongseok Yang, Vikash Mansinghka, Frank Wood with information necessary to re-execute the expres- When a predicate in the suffix no longer takes on the sion relative to a new program state. Given a trace same value, we have a choice of either rejecting the for the suffix, i.e. the remaining top-level statements regenerated suffix outright, or updating it using a re- in a program, it becomes possible to re-execute con- generation procedure that evaluates the newly chosen ditioned on future random values, in a manner that branchandremovesreferencetoanyvaluessampledin avoid unnecessary recomputation where possible. the invalidated branch. We here consider the former strict form of regeneration, which guarantees that the 4.1 Intuition regenerated suffix references precisely the same set of sample values as before. In higher order languages with recursion and memo- A second aspect that complicates rescoring is the ex- ization, regeneration is complicated by two factors: istence of memoized procedures. As an example, con- sider the following infinite mixture model 1. The program F [y ,xk [xl]] may be undercon- N N N n ditioned, in the sense that xk [xl] does not con- 0: [assume class-prior (crp 1.0)] N n 1: [assume class tainvaluesforsomesamplecallsthatcanbeeval- (mem (lambda (n) uated in its execution. It can also be overcondi- (sample class-prior)))] tioned, when xk [xl] contains values for sample 2: [assume class-dist N n calls that will never be evaluated. (mem (lambda (k) (normal-dist 2. The expression for the stochastic argument to (sample (normal-dist 0.0 1.0)) each observe in FN[yN,xkN[xln]] may need to be 1.0)))] re-evaluated if it in some way depends on global 3: [observe (class-dist (class 0)) 2.1] variables defined in F [y ,xl]. Because each 4: [observe (class-dist (class 1)) 0.6] n n n ... variable may in turn reference other variables, we N+3: [observe (class-dist (class N)) 1.2] must be able to reconstruct the program environ- mentrecursivelyinordertorescoreeachobserve. Here each observe makes a call to class, which sam- The first non-triviality arises from the existence of if ples an integer class label k from a Chinese restaurant expressions. As an example, consider the program Process(CRP).Thecalltoclass-disteitherretrieves an existing stochastic value, or generates one when a 0: [assume random? (sample (flip-dist 0.5))] 1: [assume mu (if random? new value k is encountered. This type of memoization (sample (normal-dist 0.0 1.0)) pattern allows us to delay sampling of the parameters 0.0)] until they are in fact required in the evaluation of a 2: [observe (normal-dist mu 1.0) 0.1] top-level observe, and makes it straightforward to de- Thesampleexpressioninline1isonlyevaluatedwhen fine open world models with unbounded numbers of thesampleexpressioninline0evaluatestotrue. More parameters. At the same time it complicates analy- generally,anyprogramthatcontains(sample e)inside sis when performing rescoring. Memoized procedure an if expression will not be guaranteed to instantiate calls are semantically equivalent to lazily defined vari- thesamerandomvariablesincaseswherethepredicate ables. Programs that rely on memoization can there- of the if expression itself depends on previously sam- fore essentially define variables in a non-deterministic pled values. In programming languages that lack re- order. A regeneration operation must therefore dy- cursionwehavetheoptionofevaluatingbothbranches namically determine the set of variables that need to andincludingorexcludingtheassociatedprobabilities be re-evaluated at run time. conditioned on the predicate value. This is essentially thestrategythatisemployedtohandleifexpressions 4.2 Traced Evaluation in Infer.NET and BUGS variants. In languages that The operations that need to happen during a rescor- dopermitrecursion,thisisingeneralnotpossible. For ing step are (1) the regeneration of a consistent set example, the following program would require evalua- of global environment variables, which includes any tion of an infinite number of branches: bindings in the prefix, augmented with any bindings 0: [assume geom (lambda (p) defined in the suffix (some of which may require re- (if (sample (flip-dist p)) evaluation as a result of changes to bindings in the 1 prefix), (2) a verification that the flow control path in (+ 1 (geom p))))] the suffix is consistent with the environment bindings 1: [observe (poisson-dist (geom 0.5)) 3] in the prefix, and (3) the recomputation of the proba- In other words, we cannot in general pre-evaluate the bilitiesofanysampledandobservedvalueswhoseden- values associated with both branches in the suffix. sity values depend on bindings in the prefix. Particle Gibbs with Ancestor Sampling for Probabilistic Programs In order to make it possible to perform the above op- Calls to sample inherit annotations from the trace τ erations, we begin by introducing a set of annotations passed as an argument, and add an entry in σ: for each value v that is returned upon evaluation of ((sample e),α,R,Λ)⇓(v,v,l,ρ,ω,σ[α(cid:55)→(τ,v)],φ) an expression e. In practical terms, each value in the languageisboxed intoadatastructurewhichwecalla if (e,α::(s,0),R,Λ)⇓τ =(v1, ,l,ρ,ω,σ,φ) and trace. Werepresentatraceτ astuple(v,(cid:15),l,ρ,ω,σ,φ). v is drawn from the stochastic process v1 . v is the value of the expression. (cid:15) is a partially evalu- Callstoobserveinheritfromτ andaddanentryinω: atedexpression, whosesub-expressionsarethemselves ((observe e v ),α,R,Λ) represented as traces. l is the accumulated log-weight 1 2 of observes evaluated within the expression. ρ is a ⇓(v2,v2,l1+l12,ρ,ω[α(cid:55)→(τ,v2,l12)],σ,φ) mapping {s → τ} from symbols to traces, containing if (e ,α::(o,0),R,Λ)⇓τ =(v , ,l ,ρ,ω,σ,φ)) 1 1 1 the subset of the global environment variables that and l =L(v ,v ) . 12 1 2 were referenced in the evaluation of e. ω is a map- HereL(v ,v )isusedtodenotethelog-densityofvalue ping {α (cid:55)→ (τ,v,l)}. It contains an entry at the ad- 1 2 v relative to v (which must be of type stochastic). dress α of each observe that was evaluated in e and 2 1 its sub-expressions. This entry is represented as a tu- Primitive procedure applications (primop e e ) eval- 1 2 ple (τ,v,l) containing a trace of the first argument to uate to eval(prim v v ) where v is the result of eval- 1 2 i the observe (which must be of the stochastic type), uating e : i the observed value v, and the associated log-weight l. ((prim e e ),α,R,Λ) Similarly σ is a mapping {α (cid:55)→ (τ,v)} that contains 1 2 an entry for each evaluated sample expression (which ⇓(eval(prim v1 v2),(prim τ1 τ2),l,ρ,ω,σ,φ) , omits the associated log-weight). The last component if (e ,α::(p,i),R,Λ) ⇓ τ , and ρ, ω, σ, and φ are ob- i i φ is again a mapping {α (cid:55)→ τ} that records all traces tained by merging the corresponding components of thatappearasconditionsinifexpressionsandthereby the τ , and the log-density l is the sum of the l . i i influence the control flow of program execution. Applicationofasingle-argumentcompoundprocedure We now describe the semantics of the traced evalua- (i.e. closure) e leads to the evaluation of the body e 1 tion (e,α,R,Λ) ⇓ τ. The evaluation operator ⇓ re- of the procedure relative to the environments R ,Λ 1 1 turns the trace τ of an expression e at address α, rel- in which the compound procedure was defined: ative to a global environment R (i.e. variables defined ((e e ),α,R,Λ)⇓(v,(τ τ ),l,ρ,ω,σ,φ) via assume statements) and local bindings Λ (i.e. vari- 1 2 1 2 ables bound in compound procedure calls), resulting if in a trace τ = (v,(cid:15),l,ρ,ω,σ,φ). We assume the im- (e ,α::(a,0),R,Λ)⇓τ , 1 1 plementation provides a standard evaluation function τ =((lambda (s) e),R ,Λ ), , , , , , ), 1 1 1 for primitive procedures eval(prim v ...v ) = v. We 1 n (e ,α::(a,1),R,Λ)⇓τ =(v , , , , , , ), 2 2 2 reiterate that the notation α::(t,p) denotes an evalu- (e,α::(b,0),R ,Λ [s(cid:55)→τ ])⇓τ =(v, , , , , , ), 1 1 2 3 ation address composed of a parent address α, a type identifiertandasub-expressionindexp,wheretisone and l,ρ,ω,σ,φ are obtained by combining the corre- ofiforif,lforlambda,qforquote,aforapplications, spondingcomponentsoftheτi. Thecasewithmultiple and b when evaluating compound procedure bodies. arguments is defined similarly. Constants c evaluate to: Quote expressions (quote τ) simply return τ: ((quote e),α,R,Λ)⇓τ if (e,α::(q,0),R,Λ)⇓τ . (c,α,R,Λ)⇓(c,c,0.0,{},{},{},{}) . We call this type of trace “transparent”, since it con- Finally, an if expression (if e e e ) returns either 1 2 tains no references to other traces that may take on the result of evaluating e or e . We show only the 1 2 different values or probabilities in another execution. case where the true branch is taken: Symbollookupssintheglobalenvironmentreturnthe ((if e e e ),α,R,Λ)⇓(v,(cid:15),l,ρ,ω,σ,φ[α(cid:55)→τ]) 1 2 value of s stored in R: if (s,α,R,Λ)⇓(v,s,0.0,{s(cid:55)→τ},{},{},{}) (e,α::(i,0),R,Λ)⇓τ =(true, , , , , , ) , if R(s)=τ and τ =(v, , , , , , ) . (e ,α::(i,1),R,Λ)⇓τ =(v,(cid:15), , , , , ) , 1 1 Lookups from the local environment are inlined: and l,ρ,ω,σ,φ are obtained by combining the corre- sponding components of τ and τ . The other case is (s,α,R,Λ)⇓(v,(cid:15),0.0,ρ,ω,σ,φ) 1 that the value of τ is false, and has the semantics if Λ(s)=τ and τ =(v,(cid:15),l,ρ,ω,σ,φ) . similar to the one above. Jan-Willem van de Meent, Hongseok Yang, Vikash Mansinghka, Frank Wood 4.3 Regeneration and Rescoring 1.0 p e e WenowdefineanoperationR(τ,R)=τ(cid:48) thatregener- w S 0.5 atesatracedvaluerelativetoanenvironmentR. This S / S operation performs the following steps: E 0.0 1. Re-evaluate predicates: Compare τ = φ(α) to 0.02 τ(cid:48) = R(τ,R) for all α. Abort if τ(cid:48) and τ have me different values v. Otherwise update φ[α(cid:55)→τ(cid:48)]. S / Ti 0.01 2. Re-scoreobserveexpressionsandstatements: Let ES (τ,v,l)=ω(α),andτ(cid:48) =R(τ,R). Ifτ(cid:48)andτ have 0.00 differentvaluesv(cid:48) andv ,recalculatel(cid:48) =L(v(cid:48),v) 0 20 40 60 80 100 0 0 0 t and update ω[α → (τ(cid:48),v,l(cid:48))]. Otherwise, update ω[α→(τ(cid:48),v,l)]. Figure 1: Effective sample size as a function of t, 3. Re-score samples: Let (τ,v) = σ(α), and τ(cid:48) = normalized by the number of PMCMC sweeps (top) R(τ,R). Calculate l(cid:48) = L(τ(cid:48),v) and update and wall time in seconds (bottom). PGAS with 10 ω[α(cid:55)→(τ(cid:48),v,l(cid:48))]. particles is shown in cyan. ICSMC results are shown in blue, with shorter dashes representing increasing 4. Regenerate the environment bindings: For all symbols s that do not exist in R, let τ(cid:48) = particle counts 10, 20, 50, 100, 200, and 500. All lines R(ρ(s),R) and update R[s (cid:55)→ τ(cid:48)] and ρ[s (cid:55)→ τ(cid:48)]. show the median ESS over 25 independent restarts. For existing symbols update ρ[s(cid:55)→R(s)]. We for convenience assume that R is updated in place, p(y ,xk [x∗,l]|F [y ,xl]) as the log-weight of the N N n N n n though this may be avoided by having R return rescored trace R(Tl,Rl ). Note here that by con- a tuple (R(cid:48),τ(cid:48)). n n−1 struction, we may extract T from T without addi- n+1 n 5. If any bindings were changed with new values in tional computation. step 4, regenerate all sub-expressions τ in (cid:15) to i reconstruct (cid:15)(cid:48). 5 Experiments 6. If (cid:15)(cid:48) was reconstructed in step 5, evaluate (cid:15)(cid:48) and update v to the result of this evaluation. ToevaluatethemixingpropertiesofPGASrelativeto ICSMC, we consider a linear dynamical system (i.e. Rescoring an individual trace may be performed as a Kalman smoothing problem) with a 2-dimensional part of the regeneration sweep by calculating a differ- latent space and a D-dimensional observational space, enceinlog-density∆l. This∆l isthesumofallterms l(cid:48)−l in step 2 and all terms l(cid:48) in step 3, and any ∆l(cid:48) zt ∼Norm(A·zt−1,Q), yt ∼Norm(C·zt,R). values return from recursive calls to R. We impose additional structure by assuming that the We have omitted a few technical details in this high- transition matrix A is a simple rotation with angular level description. The first is that we build a map velocity ω, whereas the transition covariance Q is a C = {τ → τ(cid:48),...} on a call to R, which is passed as diagonal matrix with constant coefficient q, an additional argument in recursive calls to R, effec- (cid:20) (cid:21) cosω −sinω tively memoizing the computation relative to a given A= , Q=qI . sinω cosω 2 initial environment R. This reduces the computation on recursive calls, which potentially expand the same We simulate data with D = 36 dimensions, T = 100 traces τ many times as sub-expressions of (cid:15). timepoints,ω =4π/T,q =0.1,α=0.1,andr =0.01. We now consider an inference setting where C and R 4.4 Ancestor Resampling are assumed known and estimate the state trajectory z ,aswellastheparametersofthetransitionmodel Given an implementation of a traced evaluator and 1:T ω and q, which are given mildly informative priors, a regenerating/rescoring procedure R, an imple- ω ∼Gamma(10,2.5) and q ∼Gamma(10,100). mentation for PGAS in probabilistic programs be- comes straightforward. We represent the programs Whilethisisatoyproblemwhereanexpectationmax- F [y ,xl] evaluated up to the first n top-level state- imization (EM) algorithm could likely be derived, it n n n ments as pairs (Rl,τl). We now construct a con- is illustrative of the manner in which probabilistic n n catenated suffix T by recursively re-evaluating T = programs can extend models by imposing additional n n (cons τnbn Tn+1). For n = 1,...,N, we then define structure,inthiscasethedependencyofAonω. This Particle Gibbs with Ancestor Sampling for Probabilistic Programs modifiedKalmansmoothingproblemcanbedescribed 0.04 0.04 in a small number of program lines 0.03 0.03 [assume C [...]] ; assumed known E[ω]] 0.02 E[q]] 0.02 [assume R [...]] ; assumed known d[ d[ [assume omega (* (sample (gamma-dist 10. 2.5)) St St 0.01 0.01 (/ pi T))] [assume A [[(cos omega) (* -1 (sin omega))] 0.00 0.00 [(sin omega) (cos omega) ]]] 0 20 40 60 80 100 0 20 40 60 80 100 [assume q (sample (gamma-dist 10. 100.))] Sweep Sweep [assume Q (* (eye 2) q)] [assume x Figure 2: Standard deviation of posterior estimates (mem (lambda (t) E[ω] and E[q] over 25 independent restarts, as a func- (if (< t 1) tion of the PMCMC sweep. PGAS results are shown [1. 0.] (sample (mvn-dist (mmul A (x (dec t))) W)))))] incyan,ICSMCresultsareshowninblue,withshorter [observe (mvn (mmul C (x 1)) R) [...]] dashes indicating a larger particle count. ... [observe (mvn (mmul C (x 100)) R) [...]] [predict omega] 6 Discussion [predict q] Here [...] refers to a vector or matrix literal. Relative to other PMCMC methods such as ICSMC, PGAS methods have qualitatively different mixing WecompareresultsforPGASwith10particlestoIC- characteristics,particularlyforvariablessampledearly SMC with 10,20,50,100,200, and 500 particles. In in a program execution. Implementing PGAS in the each case we run 100 PMCMC sweeps and 25 restarts contextofprobabilisticprogramsposestechnicalchal- with different random seeds. To characterize mixing lenges when programs can make use of recursion and rates we calculate the effective sample size (ESS) of memoization. The technique for traced evaluation de- the aggregate sample set {wts,l,zst,l} over all sweeps veloped here incurs an additional computational over- s={1,...,100}, head,butavoidsunnecessaryrecomputationduringre- 100 L generation. When the cost of recomputation is large 1 (cid:88)(cid:88) ESS = , Vk = ws,lI[zk =zs,l]. this will result in computational gains relative to a t (cid:80) (Vk)2 t t t t k t s=1l=1 naive implementation that re-executes the suffix fully. Note that our approach tracks upstream, not down- Here Vk represents the total importance weight asso- t stream dependencies. In other words, we know what ciated with each unique value zk in {zs,l}. t t environment variables affect the value of a given ex- Figure 1 shows the ESS as a function of t. ICSMC pression, but not which expressions in a suffix de- shows a decreasing ESS as t approaches 0, indicating pendonagivenvariable. Allreferencedsymbolvalues poor mixing for values sampled at early generations. must therefore be checked during regeneration, which PGAS,incontrast,exhibitsanESSthatfluctuatesbut can require a O(LN2) computation in itself. Further is otherwise independent of t. ESS estimates varied gainscouldbeobtainedconstructingadownstreamde- approximately 15% relative to the mean across inde- pendency graph for the suffix, allowing more targeted pendent restarts. This suggests that fluctuations in regeneration via graph walk techniques analogous to the ESS reflect variations in the prior probability of those employed in Venture [Mansinghka et al., 2014]. latent transitions p(z |z ,A). Atthesametime, theempiricalresultspresentedhere t t−1 are indicative of the fact that, even without these ad- For this model, our PGAS implementation with 10 ditional optimizations, PGAS can easily yield better particles has a computational cost per sweep com- statistical results in cases where the parameter space parable to that of ICSMC with 300 particles. How- is large and ICSMC sampling fails to mix. ever,whenweconsidertheESSpercomputationtime, theincreasesinmixingefficiencyoutweighincreasesin computational cost for state estimates below t (cid:39) 50. 7 Acknowledgements This is further illustrated in Figure 2, which shows the standard deviation of sample estimates of the pa- We would like to thank our anonymous reviewers, as rametersωandq. PGASshowsbetterconvergenceper wellasBrooksPaigeandDanRoyfortheircomments sweep,particularlyforestimatesofq. ICSMCwith500 on this paper. JWM was supported by Google and particlesperformssimilarlytoPGASwith10particles Xerox. HY was supported by the EPSRC. FW and when estimating ω, though ICSMC with 500 particles VKM were supported under DARPA PPAML. VKM notably has a higher cost per sweep. was additionally supported by the ARL and ONR. Jan-Willem van de Meent, Hongseok Yang, Vikash Mansinghka, Frank Wood References Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010. PierreDelMoral,ArnaudDoucet,andAjayJasra. Se- quentialMonteCarlosamplers.JournaloftheRoyal Statistical Society: Series B (Statistical Methodol- ogy), 68(3):411–436, June 2006. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2006.00553.x. Noah Goodman, Vikash Mansinghka, Daniel M Roy, KeithBonawitz,andJoshuaBTenenbaum. Church: a language for generative models. In Proc. 24th Conf. Uncertainty in Artificial Intelligence (UAI), pages 220–229, 2008. Pierre E. Jacob, Lawrence M. Murray, and Sylvain Rubenthaler. Path storage in the particle filter. Statistics and Computing, December 2013. ISSN 0960-3174. doi: 10.1007/s11222-013-9445-x. Fredrik Lindsten, Michael I Jordan, and Thomas B. Sch¨on. Ancestor Sampling for Particle Gibbs. Neu- ral Information Processing Systems, 2012. Vikash Mansinghka, Daniel Selsam, and Yura Perov. Venture: a higher-order probabilistic programming platform with programmable inference. arXiv, page 78, 2014. URL http://arxiv.org/abs/1404. 0099. T Minka, J Winn, J Guiver, and D Knowles. Infer .NET 2.4, Microsoft Research Cambridge, 2010. BrooksPaigeandFrankWood. ACompilationTarget for Probabilistic Programming Languages. Interna- tionalConferenceonMachineLearning(ICML),32, 2014. Stan Development Team. Stan: A C++ library for probability and sampling, version 2.5.0, 2014. URL http://mc-stan.org/. David Wingate, Andreas Stuhlmueller, and Noah D Goodman. Lightweight implementations of proba- bilisticprogramminglanguagesviatransformational compilation.InProceedingsofthe14thinternational conference on Artificial Intelligence and Statistics, page 770 778, 2011. F Wood, JW van de Meent, and V Mansinghka. A new approach to probabilistic programming infer- ence. In Artificial Intelligence and Statistics, pages 1024–1032, 2014.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.