ebook img

Slice Sampling for Probabilistic Programming PDF

0.7 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Slice Sampling for Probabilistic Programming

Slice Sampling for Probabilistic Programming Razvan Ranca Zoubin Ghahramani University of Cambridge University of Cambridge Abstract and inference performance (Gordon [2013]). In this paper we address usability by presenting a new PPL, 5 1 StocPy, available online as a Python library (Ranca We introduce the first, general purpose, slice 0 [2014]). We also take a step towards better inference sampling inference engine for probabilistic 2 performance by implementing a novel, slice sampling, programs. This engine is released as part of n inference engine in StocPy. StocPy,anewTuring-Completeprobabilistic a J programminglanguage,availableasaPython StocPy is based on the PPL design presented by 0 library. We present a transdimensional gen- Wingateetal.[2011], butiswrittenpurelyinPython, 2 eralisation of slice sampling which is neces- and works on model definitions written in the same saryfortheinferenceenginetoworkontraces language. This enables us to take full advantage of ] I with different numbers of random variables. Python’s excellent prototyping abilities and fast de- A We show that StocPy compares favourably velopmentcycles,andthusallowsustospecifymodels . to other PPLs in terms of flexibility and us- in very flexible ways. For instance models containing s c ability, and that slice sampling can outper- complex control flow and elements such as stochastic [ form previously introduced inference meth- (mutual) recursion are easily representable. Addition- 1 ods. Our experiments include a logistic re- ally, the pure Python implementation means StocPy v gression, HMM, and Bayesian Neural Net. itselfprovidesagoodbaseforfurtherexperimentsinto 4 PPLs, such as defining novel naming techniques for 8 stochasticvariables,lookingatprogramrecompilation 6 4 1 Introduction to improve inference performance, or testing out new 0 inferenceengines. WeillustratethebenefitsStocPyof- . fers by discussing some of the language’s features and 1 Therehasbeenarecentsurgeofinterestinprobabilis- contrastingthedefinitionsofseveralmodelsinStocPy 0 tic programming, as demonstrated by the continued 5 developmentofnewlanguages(eg: Woodetal.[2014], against those in Anglican (Wood et al. [2014]). 1 Goodman et al. [2008], Lunn et al. [2009], Milch et al. Webelievethateaseofprototypingandimplementing : v [2007]) and by the recent DARPA1 program encour- novelinferenceenginesiscrucialforthefuturesuccess Xi aging further research in this direction. The increase of PPLs. As decades of compiler design have shown , in activity is justified by the promise of probabilis- a “magic bullet” inference technique is unlikely to be r a tic programming, namely that of disentangling model found (eg: [Appel, 1998, p. 401]). This is re-inforced specification from inference. This abstraction would bythefactthatPPLscanallowforagreatdealofflex- open up probabilistic modelling to a much larger au- ibilityregardingthetypesofmodelsandqueriesposed dience, including domain experts, while also making (eg: propositionallogiccanberepresentedbyexposing model design cheaper, faster and clearer. delta functions, probabilistic programs can be condi- tioned on arbitrary boolean expressions). Such flex- Two of the major challenges lying ahead for Proba- ibility means a general PPL inference engine has a bilistic Programming Languages (PPLs), before their daunting task ahead, which includes subtasks such as full promise can be achieved, are those of usability handling boolean satisfiability problems. Therefore, it 1Probabilistic Programming for Advancing Machine seemsvitaltodevelopatoolboxofinferencetechniques Learning: http://ppaml.galois.com/wiki/ which work well on different types of modelling tasks. This high-level reasoning supports not only the devel- opment of StocPy itself, but also of the slice sampling inferenceengine,whichoutperformspreviousinference Preliminary work. Under review by AISTATS 2014. Do techniques on certain classes of models. not distribute. Manuscript under review by AISTATS 2014 out this library, all 96 primitives would have had to 1 import stocPy 2 be specified, in a computationally efficient way, in the 3 def guessMean(): language itself, which would have required a signifi- 4 mean = stocPy.normal(0, 1, obs=True) 5 stocPy.normal(mean, 1, cond=5) cant amount of effort. The library support also re- 6 7 samples = stocPy.getSamples(guessMean, 10000, alg="slice") duces the barrier of entry for all manner of PPL re- 8 stocPy.plotSamples(samples, xlabel="Mean") search. For instance, Python’s excellent network and graph libraries (eg: Hagberg et al. [2008]) could be Figure 1: Complete StocPy program that infers a used to define a novel naming convention for stochas- gaussian’s mean. tic primitives, which takes the program’s control flow into account. Such a naming convention is required 2 StocPy by the framework of Wingate et al. [2011] and could be an improvement over the simpler version presented We implement a novel PPL, StocPy, which is made in that paper. In general, StocPy tries to accomodate available online (Ranca [2014]). StocPy has both a for such avenues of research (eg: StocPy’s automatic Metropolis-Hastings inference engine (implemented in variable naming can be turned off by setting a single thestylepresentedbyWingateetal.[2011])andaSlice flag). sampling inference engine which we further discuss in Section3. Weightedcombinationsofthetwoinference 2.2 StocPy Programming Style strategies can also be used. As mentioned above, a user defined model can make use of most common Python features, as 2.1 Why make a Python PPL? long as the model is confined to one file. The OneofthemainpromisesofPPLsistoallowquickand interaction with StocPy is meant to be lightweight. cheap development of probabilistic models, including Essentially, the user should perform all stochastic by domain experts who may not be very comfortable operations via StocPy, rather than another library with machine learning, or even with programming. of random numbers. That is to say, rather than ThepopularityofPPLssuchasBUGSispartlydueto calling scipy.stats.norm(0,1), the user could their simplicity. On the other hand, lisp-based PPLs call either the StocPy stochastic primitive directly: such as Goodman et al. [2008] offer greater flexibility StocPy.normal(0,1)4, or use the StocPy wrap- but at the price of usability. StocPy is able to offer per over scipy.stats: StocPy.stocPrim("normal", the same power as the lisp-based PPLs while also al- (0,1)). By defining stochastic primitives through lowing the user to work in Python, a language both StocPy, the user will define a valid generative model. very friendly to beginners and extremely popular for Conditioning is done within the same function call prototypying. that defines a variable, by specifying the “cond” parameter (eg: StocPy.normal(0,1,cond=True)). Additionally,usingPythonmeanswecanmakeStocPy Finally, the variables or expressions we’d like to easily available as a library, and that we can make use observe can be specified in two ways, either di- of Python’s flexible programming style when defining rectly, in the variable definitions, via the “obs” models (eg: stochastic mutually recursive functions, parameter (eg: stocPy.normal(0,1,obs=True)), use of globals). A user need only provide a function or via a bespoke observe statement (eg: entry point to their model, which can then call any StocPy.observe(expression, "expName")). The other functions, make use of any globals defined in later is more flexible, allowing us to observe the result the file, and use any common2 Python constructs as of arbitrary expressions aggregated, via their name, needed. in arbitrary ways. Once the model is defined, infer- Finally,usingPythonoffersboththeuserandthelan- ence can be done by calling one of several inference guagedesignerarichlibrarysupport. Forinstance, as functions, and passing the entry point to the model language designers, we are able to provide support for as a parameter. A minimal (but complete) model all96stochasticprimitivesdefinedinthe“scipy.stats” definition, which tries to infer the mean of a gaussian library3 by defining a single wrapper function. With- given a single observation, is shown in Figure 1. This model is revisited in Figures 10 and 9 where we see 2More exotic constructs, such as the use of list com- its true posterior and abstract model specification prehensions in place of for loops, are not currently sup- respectively. ported by the automatic variable naming. However they canbeused,ifthestochasticprimitivesinvolvedareman- In Figure 1, line 1 imports our library. Line 3 de- ually named by the user. 3scipy.stats primitives: 4these are only availalble for a few distributions. See http://docs.scipy.org/doc/scipy/reference/stats.html the StocPy library for details Manuscript under review by AISTATS 2014 finesthefunctionthatistheentrypointtoourmodel. the operations in lines 5 and 6 can be performed with Line 4 specifies the prior on our mean (also a gaus- exponential stepping out and shrinking methods. One sian) and tells StocPy that we want to observe the possible implementation of these operations is shown values this variable takes. Line 5 conditions a nor- in the supplementary material. In this way, slice sam- mally distributed random variable with our sampled pling corrects one of the main disadvantages of MH, mean and variance 1, on our only observation (5). namely that it has a fixed step size. In MH, picking a Line 7 performs inference on our model, extracting wrong step size can significantly hamper progress ei- 10,000 samples via slice sampling. Finaly, line 8 uses ther by unnecesarily slowing down the random walk’s a StocPy utility function to plot the resulting distri- progress or by resulting in a large proportion of re- bution, which is shown in Figure 4. jected sampled. Slice sampling, however, adjusts an inadequate step size of size S with a cost that is only InFigures2,3wepresentmoreexamplesofmodelsex- logarithmic in the size of S (MacKay [2003]). pressed in StocPy and in Anglican. The models are 2 ofthosepresentedbyWoodetal.[2014](moremodels While some methods to mitigate MH’s fixed step size areincludedinthesupplementarymaterial). TheAn- exist, such as performing pre-run tests to determine glican specifications are taken either directly from the a good step-size, these adjustment are not possible in paper or from the Anglican website5. We can see that the variant of MH commonly used in PPLs. We re- inthecaseofthemorecomplexmodels,weareableto fer to the MH proposal described in Wingate et al. provide a more succint representation in StocPy than [2011], which is the same as the “RDB” benchmark in Anglican. used in Wood et al. [2014]. In this MH implementa- tion, whenever we wish to resample a variable, we run the model until we encounter the variable and then 3 Slice Sampling Inference Engine sample it from its prior conditioned on the variables we have seen so far in the current run. Therefore no 3.1 Advantages of Slice Sampling fine-tuning of the proposal distribution is possible. Slice sampling (Neal [2003]) is a Markov Chain Monte To get an idea of what can be gained with slice sam- Carlo algorithm that can extract samples from a dis- plingoversimple,single-site,MHwelookattwoexam- tribution P(x) given that we have a function P (x), ∗ ples, one favourable to MH and one unfavourable. In which we can evaluate, and which respects P (x) = ∗ thiswaywecangetanideaoftheexpected“best”and ZP(x), for some constant Z > 0. The idea behind “worst” case performances of the two. We choose to Slice sampling is that, by using an auxiliary height lookatasimplegaussianmeaninferenceproblem,and variable u, we can sample uniformly from the region placedifferentpriorsoverthemean. Thebestcasefor under the graph of the density function of P(x). MH is a prior that is identical to the posterior, which In order to get an intuition of the algorithm we can means MH as described above (i.e. RDB) is already consider the one dimensional case. Here we can view sampling from the correct distribution. Conversely, the algorithm as a method of transitioning from a the worst case for MH is an extremely uninformative point (x,u) which lies under the curve P (x) to an- prior, such that most samples will end up being re- ∗ other point (x,u) lying under the same curve. The jected (in our example, the prior is uniform between 0 0 0 additionoftheauxiliaryvariableumeansthatweneed and 10,000 while the posterior is very close to a nor- only sample uniformly from the area under the curve mal with mean 2 and variance 0.032, more details in P (x). the supplementary material). ∗ A basic Slice sampling algorithm can be described as: The results are presented in Figures 5, 6. Here we re- port the Kolmogorov Smirnov (KS) distance between Algorithm 1 Slice Sampling the analytically derived posterior and a running av- erage of the inferred posterior. The x axis shows the 1: Pick initial x1 such that P∗(x1)>0 number of times the model has been interpreted (and 2: Pick number of samples to extract N had it’s trace likelihood computed). We run the tests 3: for t=1 N do → multiple times starting from different random seeds 4: Sample a height u uniformly from [0, P∗(xt)] and report both the median (solid line) and the 25% 5: Define a segment [xl,xr] at height u and 75% percentiles (dashed lines) over runs. In this 6: Pick xt+1 ∈[xl,xr] such that P∗(xt+1)>u experimentwecanseethatthepotentialupsideofslice samplingovershadowsthedownside. Intheworstcase The key to Slice sampling’s efficiency is the fact that scenarioforslicesampling, wehaveaninferenceprob- lem where our prior is equal to the posterior. That is 5Anglican model repository: we already know the correct answer before seeing any http://www.robots.ox.ac.uk/ fwood/anglican/examples/ ∼ Manuscript under review by AISTATS 2014 1 def branching(): 1 [assume r (poisson 4)] 2 r = stocPy.poisson(4, obs=True) 2 [assume l (if (< 4 r) 6 (+ (fib(* 3 r)) (poisson 4)))] 3 l = 6 if r > 4 else fib(3*r) + stocPy.poisson(4) 3 [observe (poisson l) 6] 4 stocPy.poisson(l, cond=6) 4 [predict r] Figure 2: Branching model expressed in StocPy (left) and Anglican (right) 1 [assume initial-state-dist (list (/ 1 3) (/ 1 3) (/ 1 3))] 2 [assume get-state-transition-dist (lambda (s) (cond ((= s 0) 1 sProbs = (1.0/3, 1.0/3, 1.0/3) (list 0.1 0.5 0.4)) ((= s 1) (list 0.2 0.2 0.6)) ((= s 2 tProbs(0=.1{50, :0.(105.,1,0.07.)5}, 0.4), 1 : (0.2, 0.2, 0.6), 2 : 3 [assum2e) t(rlainsstit0i.o1n5 (0l.a1m5bd0a.7()p)r)e)v]-state) (discrete ( 34 eMeans = (-1,1,0) 4 [assumgeetg-estt-asttea-ttera(nmseimti(olna-mdbidsat (pirnedve-xs)ta(tief))()<]= index 0) ( 5 def hmm(): idnidsecxret1e)))i)n)i)t]ial-state-dist) (transition (get-state (- 6 states = [] 5 [assume get-state-obs-mean (lambda (s) (cond ((= s 0) -1) ((= 7 states.append(stocPy.categorical(sProbs,obs=True)) s 1) 1) ((= s 2) 0)))] 8 for ind, ob in stocPy.readCSV("hmm-data.csv"): 6 [observe-csv "hmm-data.csv" (normal (get-state-obs-mean ( 9 states.append(stocPy.categorical(tProbs[states[ind]], obs get-state $1)) 1 ) $2] =True)) 7 [predict (get-state 0)] 10 stocPy.normal(eMeans[states[ind]], 1, cond=ob) 8 [predict (get-state 1)] 9 ... 10 [predict (get-state 16)] Figure 3: HMM model expressed in StocPy (left) and Anglican (right) data. InthiscasetheextraoverheadincurredbySlice time. Wehoweverchosetousetracelikelihoodsrather sampling slows it down roughly by a factor of 2 when than seconds on the x-axis, which implicitely shows compared to MH (Figure 5). In the second case, how- thatthetwoenginesaveragethesamenumberoftrace- ever,wehaveaveryuninformativepriorandhereslice likelihood calculation per unit of time. sampling significantly outperforms MH (Figure 6). In The main non-trivial aspect, and novel contribution, fact, the difference between the 2 algorithms will only of the inference engine construction is handling trans- get more pronounced as the prior gets more uninfor- dimensional models. We discuss this below. mative. That is, examples can be created, where Slice sampling is arbitrarily faster than MH. However, we must nore that these examples are of very simple, one 3.2.1 Trans-dimensional models dimensional, unimodal models. The generalization of the insights gained from these examples to more com- In order to understand the additional complications plex models is non-trivial. Even so, we have shown thattrans-dimensionalmodelspresentforinferenceen- a category of models where slice sampling performs gines we look at a simple example taken from Wood substantially better than MH. Comparisons on more et al. [2014], the Branching model. This model has complex models are carried out in Section 4. 2 variables whose values determine the distribution of a 3rd variable which we condition on. This model is trans-dimensional since on different traces either 1 or 3.2 Inference engine construction both of the 2 variables will be sampled. Our engine follows the basic slice sampling algorithm Re-writing the model so that both variables are al- presented in Algorithm 1. As presented in Wingate ways sampled, even if one of them is unused, leaves et al. [2011], we consider a sample x to be a program the posterior invariant. Therefore one method to cor- execution trace and P (x) to be the likelihood of all rectly perform inference in a trans-dimensional model ∗ stochastic variables sampled in trace x. is to always sample all variables that might be used in a trace. This approach will however be extremely The bottleneck in the inference engines is the trace inefficient in large models and is not a viable general likelihoodcalculation. Metropoliscalculatesthisvalue solution. In Figure 7 we use this trick to see what the exactly once per sample whereas Slice sampling needs spaceofpossibletracelikelihoodslookslike,andwhat to calculate it at least 3 times (one each for x , x l r thetrueposterioris. Herepois1andpois2refertothe and the next x), and potentially many more times. Poissonvariablessampledinlines2and3,respectively, For this reason, StocPy provides the possibility to do of the StocPy model shown in Figure 2. Integrating inferenceuntilacertainnumberoftraceloglikelihood out the pois2 variable from the above trace likelihood calculations have been performed, which allows us to space results in the correct posterior distribution. compare Metropolis and slice sampling directly and fairly. Further, all experiments comparing slice with The issue with trans-dimensional jumps comes from Metropolis ran the algorithms for the same length of thefactthatanaiveinferencealgorithmwillnotsam- Manuscript under review by AISTATS 2014 Performance on Easy Model Performance on Hard Model 350 20 20 Number samples22311050505000000 KS difference from posterior222222222---------987654321 Metropolis KS difference from posterior222---321 Metropolis Slice Slice 01 0 1 2 3 4 5 6 2-10 102 103 104 102 103 104 Mean Trace likelihood calculations Trace likelihood calculations Figure 4: Inferred posterior of the Figure 5: Case most favourable to Figure 6: Case which is un- model shown in Figure 1 MH, where the prior and the poste- favourable to MH, where the prior rior are both equal to Normal(0, 1) is Uniform(0, 10000) and the poste- rior is roughly Normal(2, 0.032) True trace likelihood Trace likelihood when ignoring trans-dimensionality d o 00..000034keliho 00..002205hood 000..0.00000012Trace li 00..001105ace likeli 14 0.005Tr 12 0 2 4 p6ois18 10 12 14 024681po0is2 0 2 4 p6ois18 10 12 14 02468p101o0i21.s04020 Figure 7: Likelihood space of the execution traces ple the second poisson when it is not necessary, but a new value for variable m by following the slice sam- will still think that the trace likelihoods of runs with pling specification presented in lines 4-6 of Algorithm differentnumbersofsampledvariablesarecomparable. 1. The randomness that h must take as input is m In doing so, the inference engine will be pretending to u g ,whereuiscomposedofrrandomnumbersand m ∼ be sampling from a 2D trace likelihood even when it g is the joint distribution of these r numbers. The m really is 1D. The space of likelihoods implied by the probability of moving from state x to state x is then 0 naive inference engine, and the posterior it would ob- π(x)g (u)/D ,andtheprobabilityofgoingbackfrom m | | tainbyintegratingoutpois2,isshowninFigure8. We x to x is π(x)g (u)/D . Intuitively π(x) = π(x) 0 0 m0 0 | 0| 0 now describe how to correct the slice sampling infer- since slice sampling samples uniformly under its tar- ence engine to handle trans-dimensional models. get distribution. Transdimensionality means that the number ofvariablessampled intraces willbe different andthereforethatD willbedifferentfromDandthat 3.3 Transdimensional Slice Sampling 0 the dimensionality of u and u (r and r respectively) 0 0 will also be different. To understand the trans-dimensional corrections we can place them in the framework of Reversible Jump Thecorrectionweapllytoaccountforthisissimilarto Markov Chain Monte Carlo (RJMCMC, Green and the one applied by Wingate et al. [2011]. Specifically Hplainstgiea[s20a09fo])r.mInofforRmJaMllyC,MwCe ciannwthhiinchk owfeslcicaeresfaumlly- we update P∗(x) = P∗(x)+log |D|D0||∗∗ppsftraelseh , where choose the next sample so that the acceptance proba- pstale is the probability of all varia(cid:16)bles that a(cid:17)re in the current trace but not the new and p is the prob- bilitywillalwaysbe1. Weincludeashortexplanation fresh ability of variables sampled in the new trace that are of the RJMCMC notation in the Appendix. not in the current. In order to place slice sampling in this framework, we can think of a program execution trace as a state x. 4 Empirical Evaluation The choice of move we make is equivalent to choosing which random variable we wish to resample. There- fore, if there are D random variables present in the 4.1 Inferring the mean of a Gaussian | | current trace, and we pick one to sample uniformly, then j (x) = 1/D . Once the variable is chosen, we We begin with the inference of the mean of a gaus- m | | candefineadeterministicfunctionh whichproduces sian. This time the prior and posterior are of similar m Manuscript under review by AISTATS 2014 True posterior Posterior implied by ignoring trans-dimensionality 0.35 0.45 0.30 0.40 0.35 0.25 bility0.20 bility00..3205 a a Prob0.15 Prob0.20 0.15 0.10 0.10 0.05 0.05 0.000 2 4 6 8 10 12 14 16 0.000 2 4 6 8 10 12 14 16 pois1 pois1 Figure 8: Posteriors inferred by integrating out pois2 sizes, but the posterior is shifted by an unusual obser- maticadjustmentofthekernel’swidthandMetropolis’ vation. In this setting we look at 3 models, namely efficiency in likelihood calculations per sample. 1-dimensional, 2-dimensional and trans-dimensional. ThemodelspecificationsaregiveninFigure9andthe 4.2 Anglican models models’ posteriors are shown in Figure 10. We now look at the performance of Metropolis, slice In order to further test the Slice sampling inference sampling and some different mixtures of the two over enginewelookat3ofthemodelsdefinedinWoodetal. the 3 models. The mixture methods work by flipping [2014]. Themodelspecificationsfortheseareprovided a biased coin before extracting each sample in order in Section 2.2 and in the supplementary material. to decide which inference method to use. To evaluate the engines, we use each to generate 100 To compare the inference engines, we extract samples independent sample runs. In Figure 12 we plot the until a certain number of trace likelihood calculations convergenceoftheempiricalposteriortothetruepos- are performed and then repeat this process 100 times, terior as the number of trace likelihood calculations startingfromdifferentrandomseeds. Wethenplotthe increase. For continuous distributions we use the median and the quartiles of the KS differences from Kolmogorov-Smirnov statistic (as before), whereas for the true posterior, over all generated runs. Figure 11 discrete ones we employ the Kullback-Leibler diver- shows the quartiles of the runs on the three models. gence to measure the similarity of two distributions. On the simple, 1d, model all variants of Slice sam- On the Branching and HMM models (12) naive pling clearly outperform Metropolis. In the quartile Metropolis-Hastings outperforms all Slice combina- graph we consider mixtures of Metropolis and Slice tions. These models are quite small and are condi- both with 10% Metropolis and with 50% Metropolis tioned on few datapoints, which means they have rel- and find that the change doesn’t have a significant atively similar priors and posteriors. On such models, impact on performance. This is likely because, if slice the overhead of slice sampling cannot be recovered, picksagoodsample,Metropolisislikelytosimplykeep since the naive Metropolis actually does quite well by it unchanged (since it will tend to reject the proposal simply sampling the prior. Additionally, in the HMM from the prior if it is worse). model, we are only inferring in which of 3 states the model is in. With such a small state space the value On the 2d model, slice still clearly outperforms that an adjusting proposal kernel can add is limited. Metropolis, though the gap is not as pronounced as for the 1d model. Further, as in the 1d model, the On the Marsaglia model however (12), slice sampling 3 different slice variants all get quite similar perfor- does obtain a boost in performance. This is due both mance. Additionally, on this model, the fact that the tothecontinuousnatureofthemodel(i.e. largerstate slice mixtures get more samples per trace calculation space) and to the fact that the prior and posterior are translates into a slightly better performance for them significantly different (figure in Appendix). It’s worth than for the pure slice sampling method. noting that the Marsaglia model is the only one in which Metropolis from the prior outperformed Angli- The third, trans-dimensional, model reveals a more can’s Particle MCMC. Slice sampling is therefore bet- pronounced performance difference between slice and terthanbothPMCMCandMetropolison thismodel. Metropolis than in the 2 dimensional case. Further, onthismodel,weseeasignificantgapbetweenthe1:9 It is interesting to note that, on all models tried, slice Metropolis:Slice method and the other slice sampling outperformsMetropolisonaper-samplebasis. Thisis approaches. Itseemsthat,inthiscase,theratioof1:9 an unfair comparison since slice does more “work” to strikes a good balance between slice sampling’s auto- generate a sample than Metropolis, but it illustrates the point that the samples chosen by slice are in gen- Manuscript under review by AISTATS 2014 NormalMean1: NormalMean2: NormalMean3: m N(0,1) m N(0,1) m N(0,1) ∼ ∼ ∼ observeN(m,1)=5 v invGamma(3,1) ifm<0: v invGamma(3,1) ∼ ∼ predictm observeN(m,v)=5 else: v=1/3 predictm observeN(m,v)=5 predictm Figure 9: Specifications of the 3 gaussian mean inference models T0r.0u06e Posterior for NormalMean1 model T0r.00u35e Posterior for NormalMean2 model T0r.0u07e Posterior for NormalMean3 model 0.005 0.0030 0.006 Probability00..000043 Probability000...000000212055 Probability000...000000435 0.002 0.0010 0.002 0.001 0.0005 0.001 0.0004 2 0 2 x 4 6 8 10 0.00006 4 2 0 x2 4 6 8 10 0.0006 4 2 0 x2 4 6 8 10 Figure 10: Analytically derived posteriors of the 3 gaussian mean inference models Performance on NormalMean1 Model Performance on NormalMean2 Model Performance on NormalMean3 Model 20 20 20 erior2-1 erior2-1 erior2-1 m post2-2 m post2-2 m post2-2 nce fro22--43 nce fro2-3 nce fro22--43 KS differe22--65 Metropolis KS differe22--54 Metropolis KS differe22--65 MSleicteropolis Slice Slice 1:9 Metropolis:Slice 2-7 102 103 104 2-6 102 103 104 2-7 102 103 104 105 Trace likelihood calculations Trace likelihood calculations Trace likelihood calculations Figure 11: Median (solid), 25% and 75% quartiles (dashed) convergence rates on the 3 Gaussian models Performance on Branching Model Performance on HMM Model Performance on Marsaglia Model 23 24 20 KL divergence from posterior222222222222---------012987654321 MSleicteropolis KL divergence from posterior22222222222-------01237654321 MSleicteropolis KS difference from posterior22222-----54321 MSleicteropolis 2-10 101 102 103 104 2-8 103 104 2-6 102 103 104 Trace likelihood calculations Trace likelihood calculations Trace likelihood calculations Figure 12: Median (solid), 25% and 75% quartiles (dashed) convergence rates on the 3 Anglican models eral better, theonly questionis ifthe difference issuf- Irisdataset,obtainedfromBacheandLichman[2013]. ficient as to justify the increased overhead. The result is shown in Figure 13 and shows that Slice sampling converges significantly faster than MH. Sec- 4.3 Models for classification ondlywetrainasmallBayesianneuralnetworkonthe same dataset. Our neural network is composed of an Lastly we look at some new models, namely logistic inputlayerofthesamedimensionalityasthedata(4), regression and a small neural network. Since we are two hidden layers of 4 and 2 neurons respectively and doingclassification,herewecanactuallyplotthemean an output layer of a single neuron (since we are treat- squared error that the model achieves after a certain ing the task as a binary classification problem). In number of trace likelihood computations. As before, Figure 14 we see that slice sampling does significantly we perform separate runs starting with different ran- betterthantheMHbaselineonthistask. Wealsono- domseedsandplot boththemedian andthe 25%and ticethatthehardertheinferenceproblemis,themore 75% quartiles. themarginbywhichslicesamplingoutperformsgrows. Iris-setosa, on which the performance gap is smallest, First we perform logistic regression on the well known Manuscript under review by AISTATS 2014 LProp. test classification erroro0000000g.......4637512istic Regression performance onMS Ileircteirosp-oslisetosa LProp. test classification erroro00000000g........4567345600005555istic Regression performance on IrMSlieiscter-ovpoilrisginica LProp. test classification errorog0000000.......i4637512stic Regression performance on IrisMSle-ictveroeporlsisicolor 0.00 5000 10000 15000 20000 25000 30000 0.300 5000 10000 15000 20000 25000 30000 0.00 5000 10000 15000 20000 25000 30000 35000 40000 45000 Trace likelihood calculations Trace likelihood calculations Trace likelihood calculations Figure13: Median(solid),25%and75%quartiles(dashed)convergenceratesoflogisticregressiononIrisdataset Neural Network performance on Iris-setosa Neural Network performance on Iris-virginica 0.7 0.7 Neural Network performance on Iris-versicolor Metropolis Metropolis 0.7 Prop. test classification error000000......463512 Slice Prop. test classification error000000......463512 Slice Prop. test classification error000000......463512 MSleicteropolis 0.00 200 400 600 800 1000 0.00 500 1000 1500 2000 2500 3000 3500 0.00 5000 10000 15000 20000 25000 30000 35000 40000 Trace likelihood calculations Trace likelihood calculations Trace likelihood calculations Figure14: Median(solid),25%and75%quartiles(dashed)convergencerateofaBayesianNNontheIrisdataset islinearlyseparablefromtheother2classes,whichare inference techniques the system could support. No notlinearlyseparablefromoneanother. Inthecaseof details, nor discussions of trans-dimensionality, are Iris-versicolor,wewerenotabletoruntheenginelong present. enoughforMetropolistomatchslice’sperformance,as Metropolisisstilllaggingbehindevenafterrunning10 6 Conclusion times longer than slice. We have introduced and made available StocPy, the 5 Related Work first general purpose Python Probabilistic Program- ming Language. We have shown the benefits StocPy offers both to users and designers of PPLs, namely Our slice sampling inference engine is based on the flexibility, clarity, succintness and ease of prototyping. slice sampling method presented by Neal [2003], and influenced by the computer friendly slice sampler We have also implemented a novel, slice sampling, in- shown in MacKay [2003]. ferenceengineinStocPy. Toourknowledgethisisthe first general purpose slice sampling inference engine Slice sampling techniques have been applied to a wide and the first slice sampling procedure that solves the rangeofinferenceproblems(Neal[2003])andareused problem of trans-dimensionality. in some of the most popular PPLs, such as BUGS (Lunn et al. [2009]) and STAN (Stan Development We have empirically evaluated this slice sampling en- Team [2014]). However, the slice samplers these lan- gine and shown that the potential benefits far out- guagesemployarenotexposeddirectlytotheuser,but weight the potential costs, when compared to single- instead only used internally by the language. There- site Metropolis. While Metropolis works well on very fore, the slice samplers present in these languages are small models where the prior and the posterior are not intended to generalise to all models and, specif- similar, slice provides substantial benefits as the dis- ically, make no mention of trans-dimensionality cor- tributions diverge. Additionally, on the models where rections. Poon and Domingos [2006], also proposes Metropolis performs best slice only experiences a con- a slice sampling based solution, this time to Markov stant slowdown due to its overhead, whereas when Logic problems. However this algorithm is focused on Metropolisperformspoorlytheperformancedifference deterministicornear-deterministicinferencetasksand can be arbitrarily large. so bears little resemblance to our approach. Finally, we have provided comparisons with Anglican The most similar use-case to ours would be in Ven- which show promising results despite slice sampling ture (Mansinghka et al. [2014]). Here however slice not beeing optimised for runtime speed. A full bench- sampling is only mentioned in passing, amongst other marking of the systems remains to be performed. Manuscript under review by AISTATS 2014 References F. Wood, J. W. van de Meent, and V. Mansinghka. A new approach to probabilistic programming in- A.W.Appel.ModerncompilerimplementationinML. ference. In Proceedings of the 17th International Cambridge university press, 1998. conference on Artificial Intelligence and Statistics, K. Bache and M. Lichman. UCI machine learning 2014. repository, 2013. URL http://archive.ics.uci. edu/ml. N. Goodman, V. Mansinghka, D. Roy, K. Bonawitz, and J. Tenenbaum. Church: A language for gener- ative models. In Proceedings of the 24th Conference on Uncertainty in Artificial Intelligence, UAI 2008, pages 220–229, 2008. A. D. Gordon. An agenda for probabilistic program- ming: Usable, portable, and ubiquitous. 2013. P.J.GreenandD.I.Hastie. ReversiblejumpMCMC. Genetics, 155(3):1391–1403, 2009. A. A. Hagberg, D. A. Schult, and P. J. Swart. Ex- ploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy2008), pages 11–15, Pasadena, CA USA, Aug. 2008. D. Lunn, D. Spiegelhalter, A. Thomas, and N. Best. The BUGS project: Evolution, critique and future directions. Statisticsinmedicine,28(25):3049–3067, 2009. D. J. MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003. V. Mansinghka, D. Selsam, and Y. Perov. Ven- ture: ahigher-orderprobabilisticprogrammingplat- form with programmable inference. arXiv preprint arXiv:1404.0099, 2014. B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, and A. Kolobov. BLOG: Probabilistic models with unknown objects. Statistical relational learn- ing, page 373, 2007. R.M.Neal. Slicesampling. Annals of statistics,pages 705–741, 2003. H. Poon and P. Domingos. Sound and efficient infer- encewithprobabilisticanddeterministicdependen- cies. In AAAI, volume 6, pages 458–463, 2006. R. Ranca. StocPy: An expressive probabilistic pro- gramming language, in python. https://github. com/RazvanRanca/StocPy, 2014. Stan Development Team. Stan Modeling Language Users Guide and Reference Manual, Version 2.4, 2014. URL http://mc-stan.org/. D. Wingate, A. Stuhlmueller, and N. D. Goodman. Lightweight implementations of probabilistic pro- gramming languages via transformational compila- tion. In International Conference on Artificial In- telligence and Statistics, pages 770–778, 2011. Supplementary Material for “Slice Sampling for Probabilistic Programming” Razvan Ranca Zoubin Ghahramani University of Cambridge University of Cambridge 5 1 Detailed slice-sampling pseudocode Algorithm 3 Given current sample x and its height 1 u, get next sample x from the interval [x ,x ] such n l r 0 2 Here we present the detailed pseudocode for one pos- that P∗(xn)>u n sible slice sampling implementation. Algorithm 1 is 1: function SampNext(x,u,xl,xr) a the high-level logic of slice sampling which was also 2: Sample xn uniformly from [xl,xr] J presented in the paper. 3: while P∗(xn) u do ≤ 0 The other algorithms expand upon this, showing the 4: if xn >x then 2 lowerleveloperations. Specifically,Algorithm2shows 5: xr =xn 6: else ] howwecouldperformtheoperationdescribedinline5 AI ofAlgorithm1,byexponentiallyincreasingthe[xl,xr] 7: xl =xn interval. Similarly, Algorithm 3 describes how the op- 8: Sample xn uniformly from [xl,xr] . cs erationfromline6ofAlgorithm1isperformed,namely 9: return xn [ bysamplinganextxplacedonthealreadydefinedseg- ment [xl,xr] such that x is choosen “uniformly under 1 the curve”. 2 True Posteriors v 4 8 Algorithm 1 Slice Sampling Here we present 2 true posteriors that support some 6 4 1: Pick initial x1 such that P∗(x1)>0 claims made in the paper. In Figure 1, we see that 0 2: Pick number of samples to extract N the “Hard” Gaussian mean inference problem results . 3: for t=1 N do in an analytical posterior that is very close to a Gaus- 1 → 0 4: Sample a height u uniformly from [0, P∗(xt)] sian with mean 2 and standard deviation 0.032. The 5 5: Define a segment [xl,xr] at height u secondposterior,showninFigure2presentsthediffer- 1 6: Pick xt+1 [xl,xr] such that P∗(xt+1)>u ence between the prior and posterior of the Marsaglia : ∈ model. This figure gives an intuitive understanding of v i why Metropolis is outperformed by slice sampling on X the Marsaglia model. It is easy to imagine how bad Algorithm 2 Return appropriate [x ,x ] interval, r l r Metropoliswilldointhismodelifitonlysamplesfrom a given current sample x and its height u the prior. 1: function StepOut(x,u) 2: Pick small initial step width wi 3 Additional Sample Programs in 3: while P∗(x w)<u or P∗(x+w)<u do − StocPy and Anglican 4: wi =wi/2 5: w =wi 6: while P∗(x w)>u do For completeness and as to get a better idea of the 7: w =w 2− “feel” of StocPy we show the remaining 2 Anglican 8: xl =x w∗ models that were not presented in the paper. These 9: w =wi− modelsareaDirichletProcessMixture(Figure3)and 10: while P∗(x+w)>u do a Marsaglia (Figure 4). We also show these models’ 11: w =w 2 implementation in Anglican, for comparison’s sake. ∗ 12: xr =x+w As in the paper, we can see that on the more com- 13: return xl,xr plexmodelStocPymanagestobemoresuccint,largely thankstothePythonin-builtoperators. Additionally,

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.