A Scalable Blocked Gibbs Sampling Algorithm For Gaussian And Poisson Regression Models 6 1 0 2 NicholasA. Johnson∗,Frank O.Kuehnel,Ali NasiriAmini GoogleInc n a J February 2,2016 0 3 ] Abstract E M Markov Chain Monte Carlo (MCMC) methods are a popular technique in Bayesianstatisticalmodeling. Theyhavelongbeenusedtoobtainsamplesfrom . t posteriordistributions,butrecentresearchhasfocusedonthescalabilityofthese a techniques for large problems. We do not develop new sampling methods but t s insteaddescribeablockedGibbssamplerwhichissufficientlyscalabletoaccomo- [ datemanyinterestingproblems. Thesampler wedescribeappliestoarestricted 1 subset of the Generalized Linear Mixed-effects Models (GLMM’s); this subset v includes Poisson andGaussian regression models. Theblocked Gibbssampling 7 stepsjointlyupdateapriorvarianceparameteralongwithalloftherandomeffects 4 underneathit.Wealsodiscussextensionssuchasflexiblepriordistributions. 0 0 0 1 Introduction . 2 0 Therehasbeenagreatdealofworkonimplementingefficient,large-scaleregularized 6 regressions,buttherehasbeenmuchlessprogressinscalingupfullyBayesianregres- 1 sion models. Bayesian and Empirical Bayes models (such as GLMM’s) have found : v wideusewhenappliedtosmallerdatasets.Twoofthemorepopularsoftwarepackages Xi areSTAN[12]andthelme4Rpackage[1]. STANishighly-customizableandituses MCMCtodrawposteriorsamplesofmodelparameters. Thelme4softwareisperhaps r a themostpopularRpackageformixedeffectsmodels,butitsimplementationisbased onLaplaceapproximationwhichinvolvesfactorizationoflargematrices. RecentworktoscaleupBayesianmodelinferenceincludesconsensusBayes[13], stochasticgradientLangevindynamics(SGLD)[18],andtheWeierstrasssampler[17]. TheintentofourpublicationistopresentblockedGibbssamplingintermsofsimple, scalableoperationsforsolvinglargeregressionmodels. ThenarrowclassofBayesian regressionmodelsthatweconsideralsohaveassociatedMCMCmoveswhichdepend ona(relatively)smallsetofsufficientstatistics. In this section we first discuss an example of the type of data that our algorithm can model. We then givesome backgroundon randomeffectsmodelsand followby ∗e-mail:[email protected] 1 n.views n.actions url ad.id 52 4 abc.com 83473 73 5 xyz.edu 40983 19 0 abc.com 4658 532 16 efg.com 40983 3 0 z.com 4658 ... ... ... ... Table1: Anexampledatasetforwhichn.actionscouldbemodeledasaPoissoncount dependingontwofeatures,“url”and“ad.id”,andonanobservedoffset“n.views”. describing the class of GLMM’s that our algorithm applies to. Finally, we end this sectionwithareviewofbackgroundmaterialandrelatedwork. The data we accept as inputcan be summarized througha matrix such as that in table (1). When modeling that example data, our goal could be to predict say the numberof actions users took, n.actions, based on featuresurl and ad.id and usingn.viewsasan offset(thenumberofadstheusersaw). Inpracticewewould havemanymorefeaturecolumnsinourdataset.Thesefeaturesoftenhavea“longtail” oflevelswithlittleassociateddata,andBayesianpriorsorregularizationareessential inordertomakeuseofthem.Countdataiscommoninapplications,andwefocusonit formostofthisreport;however,section3.3brieflydiscussesGaussianmodelsaswell. WewillstartbydescribingGLMM’sastheyareusuallypresentedandthendiscuss how the data in table (1) could be summarized in terms of this notation. A Poisson GLMMcanbewrittenas: {Y|b,β,D}∼Pois(Dexp(Xβ+Zb)) (1) whereY ∈ Nn isthePoissonresponse,X ∈ Rn×p isthefixedeffectsmodelmatrix, β ∈ Rp is the fixed effectparameter, Z ∈ Rn×r is the randomeffect modelmatrix, b ∈ Rr is the random effect (a random variable), and D ∈ Rn is a positive offset. >0 TypicallyaGaussianpriorisplacedonb: b∼N(0,Σ(σ))i.e. bhasanormaldistribu- tionwithcovarianceΣ andΣis inturnparameterizedbya lower-dimensionalvector σ. Acommonspecialcaseisthatσ ∈RF whereF isthenumberoffamilies(i.e.fea- ≥0 turecolumnsintheinitialexample),andΣ(σ)isadiagonalmatrixwithσ appearing 1 onthefirstL diagonalelements,σ onthenextL ,andsoon(andL +...+L =r). 1 2 2 1 F Inthelanguageofthe “lme4”R package[1] aPoisson modelofthe datain table(1) canbespecifiedas: n.actions∼1+(1|url)+(1|ad.id)+offset(log(n.views)) (2) WhenmodelingthisexampledatasetwewouldhaveF =2randomeffectfamilies, p=1fixedeffects(soXisthen×1matrix[1] ),L wouldbethenumberofunique n×1 1 url’sinthe dataset, andL wouldbethe numberofuniquead.id’s. Thenσ2 is the 2 1 prior variance of the url random effects and σ2 is the prior variance of the ad.id 2 randomeffects. 2 To be more precise, we define the index set J to be the k’th block of indices: k {T +1,...,T +L } where T := k L and T = 0. The randomeffect k−1 k−1 k k j=1 j 0 variancematrixhas(j,j)’thdiagonalelemePntΣ(σ) =σ2 whenj ∈J . jj k k Themixedeffectsmodelisthenfitbymaximizingthemarginallikelihood: n ℓ(β,σ):=Z P({bi}ri=1|σ) P(Yj|Dj,Xj,Zj,β,b)db1...dbr jY=1 (3) F Lf whereP({b }r |σ)= P(b |σ ) (4) i i=1 Tf−1+j f fY=1jY=1 whereX andZ denotethej’throwsoftherespectivematrices,andD andY refer j j j j tothej’thelementoftherespectivevectors. Usually the integral in (3) cannot be computed in closed form, so it is approxi- matedbyMCMCorLaplaceapproximation[1].Gaussianmodelsareanexceptionbut eveninthatcasecalculatingthemarginallikelihoodcaninvolvematrixfactorizations which are prohibitive to compute. Software such as lme4 can handle more general priorcovariancestructurethanthatdescribedabove,butwefocusonthecasewhereΣ isdiagonal. When modelingPoisson data, the algorithmwill onlyapplyto the restrictedcase that: (a) p = 1, X = [1] , (b) Z is a 0-1 matrix (i.e. Z ∈ {0,1}), and (c) the n×1 jk partialrowsumswithinafamily’sblockofcolumnsareequaltoone: Z =1. t∈Jk it Itiseasytoinsteadaccommodate t∈JkZit ∈ {0,1},butweomitthPedetails. When modelingGaussian data we relaxPthis to allow any realvaluedentriesbut maintaina restrictiononthesparsitypattern: 1{Z 6=0}=1fork =1,2,...,F t∈Jk it Condition (c) states that we hPave conditional independence between the random effectsparametersbwithinasinglefeaturefamily“k”.Wealsodependonthissparsity pattern to store Z efficiently. We discuss this in more detail in section 2. Next we chooseaGammapriordistributioninsteadofthestandardlog-normaloneforGLMMs: exp(b ) ∼ Gamma(σ−2,σ−2), where E[exp(b )] = 1 and Var(exp(b )) = σ2 for j k k j j k j ∈ J . The mean-one restriction is for identifiability; without it the Gamma prior k distributionscouldbescaledbyanyamountandanadjustmenttothefixedeffectwould yieldexactlythesamedatadistribution. WewillseethattheconjugacyoftheGamma andPoissondistributionssimplifiessomesamplingstepsinouralgorithm. Withtheserestrictionsandconjugatepriorsinplace,wedevelopablockedGibbs samplingiterationin which we updatea single familyat a time. We jointlyupdatea priorparameterandalloftherandomeffectsbeneathit. Weendthissectionwithareviewofbackgroundmaterialandrelatedwork. Gibbs samplingis widelyused, so wejustcoversome earlypapers,relevanttextbooks,and relatedapplicationstoBayesianregressionmodels. GelfandandSmith[4]introducedthestatisticscommunitytoGibbssamplingasa computationaltechniqueforinferenceinBayesianmodels.Thispaperdidnotpropose Gibbssampling(theyattributeditto[6]),buttheyshowedthepowerofthetechnique throughseveralexamples.Thepaperhassincebeencitedoversixthousandtimes. 3 Conditions for ergodicity of Gibbs samplers and Metropolis-Hastings algorithms are given in [14] and simpler, less-general conditions for convergence of the Gibbs sampleraregivenin[11]. GibbssamplingandmanyotherMCMCalgorithmsarede- scribed in the text book by Jun Liu [8]. This includes blocked Gibbs samplers (also referredtoas“grouped”)andothervariations.Thetextbook“BayesianDataAnalysis” (BDA)[5] containsexamplesofBayesian regressionmodelsandGibbssamplerstai- loredtothem. BDA also includesexamplesof blockedGibbssamplers(e.g. chapter 15section5inthethirdedition). Blocked Gibbs samplingwas evaluatedin the contextof Gaussian Mixed Effects modelsbyChibandCarlin[3]. TheyconsideredtheGaussianlongitudinalmodel: y =X β+W b +ǫ ∈Rnk (5) k k k k k whereb ∈Rq andhasGaussianpriorb ∼N (0,Σ). Theydevelopedsevendifferent k k q Gibbssamplerswithvaryinglevelsofblocking. Theseincludedsamplerswhichinte- gratedout{b }andβ anddrewΣconditionalononlytheobserveddata{y }. They k k alsoshowedhowtoapplytheiralgorithmstobinaryprobitregressionusingthelatent variablerepresentationz = sign(y )wherez isobservedbuty isnot. Theyfound k k k k thatblockingsubstantiallyreducedautocorrelationintheirexamplesthattheadditional computationalcostoftheblockedupdateswasagoodtradeoff. Inourexperiencetheblockedupdatesareespeciallyimportantwhenthereisalong tail of levels which have little associated data. Consider advertisers as an example: theremaybeasmallsubsetofadvertisersresponsibleforahugenumberof“views”and “actions”,andthesearemostinformativewhenlearningthepriorvarianceparameters σ. Oftenthough,thereisalsoamuchlargerpopulationofadvertiserswithverysparse data,andtheirpresenceslowsdownthemixingofanon-blockedGibbssampler. TheworkofVolfoskyandHoff[16]issimilarinthattheybuildmodelswithmany randomeffectfamilies: theirfocusisonGaussianANOVAmodelswithmultiplefac- tors and interaction terms. They implementGibbs samplersof balanced designsand suggestdataaugmentationasatechniquetohandleimbalanceddesigns(i.e. addmiss- ing data which would make the design balanced). They focused on relatively small datasets(i.e. n < 10,000,F < 3, andL < 10). Thealgorithmwe describebelow k hasbeenappliedtomuchlargerproblems. In the nextsection we give a detailed descriptionof the algorithmand associated computations,andinsection3wediscusssomeextensionstothealgorithm. 2 GibbsSamplingforGamma-PoissonRegressionMod- els In this section we give a detailed description of the blocked Gibbs sampler for the Gamma-Poissonmodel.Inanun-blockedGibbssamplerwithvariables(V ,V ,...,V ) 1 2 n wewoulditeratetheupdates: v ∼p(V |V =v ,...,V =v ,V =v ,...,V =v ) (6) k k 1 1 k−1 k−1 k+1 k+1 n n 4 In a “blocked” or “grouped”Gibbs sampler we can update several variables at a time: {v :j ∈I}∼p({V :j ∈I}|{V :i6∈I}={v :i6∈I}) (7) j j i i WedescribehowtheblockIcanbetakentobealargesetofrandomeffectsaswell asanassociatedpriorvarianceparameter.Thisblockedupdateisjustasscalableasun- blockedupdates(andmuchmorescalablethananaiveimplementationofanunblocked Gibbssampler). Asmentionedearlier,wedonotdevelopanalgorithmforgeneralX andZ model matrices. We will handle the case that X is the n×1 matrix [1] and so β ∈ R. n×1 With the restriction on Z described above we can more compactly write the subse- quentcomputationsin terms of an n×F matrix of indices I. We define I = t if jk Z = 1. Forexample,if the firstfamilyof randomeffectsis basedon“url” j,(t+Tk−1) andweenumeratetheuniqueurl’sas1,2,3,...,thenI =tifthej’throwoftheinput j1 tablecontainsthet’thurl. Tomakesomeequationseasiertoreadwewillsometimes writeI(j,k)inplaceofI . jk The table below shows what the matrix I would look like for the dataset in the introduction(Table1). ThecolumnswithheadingsI andI showthefirstandsecond 1 2 columnsofthematrixI. n.views n.actions url ad.id I I 1 2 52 4 abc.com 83473 1 1 73 5 xyz.edu 40983 2 2 19 0 abc.com 4658 1 3 532 16 efg.com 40983 4 2 3 0 z.com 4658 5 3 ... ... ... ... ... ... NextwewilldefineB :=exp(b ). Bisaraggedarrayratherthanamatrix kt Tk−1+t becausethelength{Bkt}t ∈ RLk candependonthefamilyindexk. Asa shorthand we will simply write B := {B } for the vectorof randomeffectsassociated with k kt t thek’thfamily. This representation is not just notationally convenient– it is also how we repre- sentZ andbinouroptimizedimplementationofthealgorithm. Nextwedefinethree operationsintermsofcomponentwisevectorproducts/divisions,andtheseoperations representthe bulk of the computationin largedatasets. These operationsare used to computethesufficientstatisticswhichappearinourGibbssamplingsteps. F Predict(B,β):=βD B [I]∈Rn (8) f fY=1 Predict (B,β):=Predict(B,β)/B [I]∈Rn (9) −k k SumBy (V):={ V }Lk ∈RLk,V ∈Rn (10) k j i=1 X j:I(j,k)=i whereB [I] := {B }n ∈ Rn. IfwepartitionedthecolumnsofZ byfamily k k,I(j,k) j=1 Z =[Z(1)Z(2)···Z(F)]thenanotherwaytodefineB [I]wouldsimplybethematrix- k 5 vectorproductZ(k)B . In(10)weuseanegativesubscript“−k”toremindthereader k thatthepredictionisbasedonallbutthek’thrandomeffectfamily. OurdatadistributioncanberewrittenconciselyintermsofPredict(): {Y|D,B,β,Z}∼Poisson(Predict(B,β)) (11) recallingthatI isjustadifferentrepresentationofthematrixZ. ToupdateB wewillperformtwocomputations: k events:=SumBy (Y)∈NLk (12) k pevents:=SumBy (Predict (B,β))∈RLk (13) k −k “SumBy”canbecomputedinO(n)time,andwewillshowthatalthoughthecost of“Predict”isO(nF),thiscanbereducedthroughamortization. OurblockedGibbssamplingupdateforthek’thfamilyisthen: σ ∼P(σ |Y,D,B ,σ ,β) (integratingoutB ) (14) k k −k −k k B ∼P(B |Y,D,B ,σ,β) (conditioningonσ ) (15) k k −k k Thistwostageprocessproducesasamplefromthejointposterior: (B ,σ )∼P(B ,σ |Y,D,B ,σ ,β) k k k k −k −k Duetoconditionalindependencethisupdateisindependentofσ −k (B ,σ )∼P(B ,σ |Y,D,B ,β) k k k k −k It is importantto note that the conditionaldistributionswill only depend on 2L k sufficientstatistics: events ∈ NLk andpevents ∈ RLk. ThisisalsotruefortheGaus- >0 sianregressiondescribedinsection3.3. 1 A second point worth noting is that we need not precisely sample the variance parameterinequation(14);tomaintainthecorrectstationarydistributionitissufficient touseMetropolis-Hastings[15]. Earlier we mentioned that it was possible to speed up the computation of the Predictfunctionsdefinedabove,and we describethis now. SupposethatΠold := Predict(B,β)wascomputedbeforesampling(B ,σ ). Oncewehavedrawnanew k k valueBnew wecanupdate: k Bnew[I] Πnew :=Πold k ∈Rn (componentwisemultiplication/division) Bold[I] k Theun-amortizedcost ofthe Predict (B,β) operationswouldbe O(nF2) fora −k singlescanoverallF families. InpracticewecomputePredict(B,β)usingequa- tion (8) at the beginning of each scan to reduce the cost by a factor of F. Due to accumulationofnumericalerrorswecannotrefreshonceperscanforarbitrarilylarge 1Thissimplificationdoesnotoccurforlogisticmodels(P(Yk =1)=(1+exp(−θ))−1)ortruncated Gaussianregressionswhichareusedinalatent-variablerepresentationofbinaryprobitregressions[3]. 6 F;however,wehaveobservednopracticalconsequenceswhenapplyingthistomodels withF inthehundreds. IfweletΠ := Predict (B,β)thendataloglikelihoodforrowsj withindex −k I =tis: jk ℓ (B ):= (−B Π +Y log(B Π )−log(Y !)) t kt kt j j kt j j X j:I(j,k)=t =−pevents B +events log(B )+c(t,Y,Π) t kt t kt wherec(t,Y,Π):= (Y logΠ −log(Y !)) j:I(j,k)=t j j j ThepriorlikelihoPodonBktis P(B =u|σ )=dgamma(u,σ−2,σ−2) kt k k k wheredgamma(x,θ,η):=C (θ,η)xθ−1exp(−ηx) Γ ηθ andC (θ,η):= Γ Γ(θ) BecausetheprioronB isaproductofindependentGammadistributions,eachel- kt ementof(B ,{Y :I(j,k)=t}) isindependentafterconditioningon(D,B ,σ ,β). kt j t −k k Thisconditionalindependenceisusedtosimplifysomehighdimensionalintegralsinto productsofone-dimensionalintegralsintheformulasbelow. Usingthisconditionalindependencewecomputethemarginaldatalikelihoodafter integratingoutB : k P(Y|D,B ,σ )= P(B =u |σ )exp(ℓ (u ))du ...du (16) −k k Z kt t k t t 1 Lk Yt = P(B =u|σ )exp(ℓ (u))du (17) kt k t Z Yt C (σ−2,σ−2) = exp(c(t,Y,Π)) Γ k k (18) C (σ−2+events ,σ−2+pevents ) Yt Γ k t k t In the equation above we did not condition on the other prior parameters, σ , or −k integratethem outbecause Y is independentof σ after conditioningon B . For −k −k useinsubsequentpseudocodewedefinethefunctionPriorMarginal()as PriorMarginal(σ ,events,pevents) k C (σ−2,σ−2) := Γ k k (19) C (σ−2+events ,σ−2+pevents ) Yt Γ k t k t Tosampleσ wetaketheproductofthemarginaldatalikelihoodandtheprioron k σ k P(Y,σ |D,B )=P(Y|D,B ,σ )P(σ ) k −k −k k k which is proportional to the posterior P(σ |Y,D,B ) (as a function of σ ). We k −k k assumeaflat,improperpriorP(σ )≡1becausewehavenopreferredchoice. k 7 Wedonotbothertocompute exp(c(t,Y,Π))sincewemustrenormalizeoruse t Metropolis-Hastingsanyway. ThQe resultis that we need only computethe aggregate statistics‘events’and‘pevents’tofindP(σ |Y,D,B ). k −k Onceσ isdrawn,theposteriordistributionofB issimplyaproductofGamma k k distributions: Lk P(B |σ ,Y,D,B )= dgamma(B ,(events +σ−2),(pevents +σ−2)) k k −k kt t k t k tY=1 Finally,wediscusstheupdateforthefixedeffectparameterβ. Thefixedeffectβ can be updatedthrougha Monte Carlo EM algorithm, but forsimplicity in the pseu- docode we just put a Gamma(1,1) prior on β and update it like the other random effects.Weexpectlittledifferenceinbehaviorwhenappliedtolargedatasets. The pseudocode in algorithm (1) summarizes the blocked Gibbs sampling algo- rithm described above. As mentioned earlier, step 10 in the algorithm could be a single Metropolis-Hastings update. The “griddy Gibbs sampler” updates described in [14] are anotheroption. Finally, in subsequentpseudocodewe write just “Sample Bnew ∼P(B |B ,Y,D,β,σ)”inplaceofthefor-looponlines11-13. f f −f Algorithm1Gamma-PoissonGibbsSamplingalgorithm 1: initializeB,β,σ 2: foriter=1,2,...do 3: Π←Predict(B,β) 4: Sampleβnew ∼dgamma(β,1+ jYj,1+β−1 jΠj) 5: Π←Πβnew/β P P 6: β ←βnew ⊲UpdateΠforuseinthenextsamplingstep 7: forf =1,2,...,F do ⊲Foreachfeaturefamily 8: events←SumBy (Y) f 9: pevents←SumBy (Π/B [I]) f f 10: Sampleσf fromPriorMarginal(σf,events,pevents) 11: fort=1,2,...,Lf do ⊲SampleBfnew ∼P(Bf|B−f,Y,D,β,σ) 12: Bfnetw ∼dgamma(eventst+σf−2,peventst+σf−2) 13: endfor Bnew[I] 14: Π←Π f ⊲UpdateΠforuseinthenextsamplingstep B [I] f 15: Bf ←Bfnew 16: endfor 17: endfor 3 Extensions InthissectionwediscussextensionswhichremainasscalableastheGamma-Poisson regressiondescribedintheprevioussection. 8 3.1 Handling AMore General Z Matrix SupposethematrixZfromtheintroductioncanbepartitionedintotwosetsofcolumns Z = [Z(1)|Z(2)]andthatZ(1) isstructuredaswe requiredforalgorithm1. Ifonthe otherhandZ(2) isnotstructuredthisway,thenclearlywecanusethealgorithm1to updatetherandomeffectsandpriorsassociatedwithZ(1)andusemoregeneralupdates forZ(2). We take a moment to discuss one seemingly straightforwardextension that turns outto notbe as scalable in the Poisson model. Suppose the elements of Z were not 0-1 but still had the sparsity pattern 1{Z 6= 0} = 1. We can represent the t∈Jk it informationinZ compactlyusingtwoPn×F matricesI andS. Whiletheindexmap I stores the sparsity pattern, the additionalmatrix S stores the non-zerovalues of Z i.e. S isequaltothe(j,T +I )’thelementofZ (recallthedefinitionofT in jk k−1 jk k section1was k L ). j=1 k InthismorPegeneralmodelwewouldmodifyPredicttoinsteadbe: F ScaledPredict(B,β):=βD B [I]Sf ∈Rn (20) f fY=1 whereSf ={Sjf}nj=1isthef’thcolumnofSandtheexponentiationBf[I]Sf istaken componentwise. Underthisgeneralizationthevectors‘events’and‘pevents’arenolongerthesuffi- cientstatisticsfortheconditionaldistributionof(B ,σ ). Wemustinsteadaggregate k k byuniquevaluesof(S ,I )ratherthanofI . jk jk jk Unless S takesonfew uniquevalues, the vectorofsufficientstatistics canbe as f longas2nelements. Anotherimportantdifferenceisthattheconditionaldistributions oftheelementsofB willnolongerhaveaGammadistribution. f 3.2 MoreFlexiblePriorDistributions InordertoefficientlyimplementourblockedGibbssampleritisimportantthatwecan computethemarginaldatalikelihood P(B = u|σ )exp(ℓ (u))duinclosedform. kt k t This is possible because the Gamma dRistribution is a conjugate prior for the Poisson distribution. Westillhaveconsiderableflexibilitybecausetheseintegralscanbeevaluatedana- lyticallyformixturesofdiscreteandGammadistributionsaswell: d d+g P(B ∈A)= wk1{lock ∈A}+ wk dgamma(x,σ−2,σ−2)dx kt j j j Z kj kj Xj=1 j=Xd+1 A wherelock ∈R ,σ ∈R , wk =1 j ≥0 kj >0 j Xj Acommonspecialcasewouldbeasparseprior: P(B ∈A)=wk1{1.0∈A}+(1−wk) dgamma(x,σ−2,σ−2)dx kt Z k k A 9 Thisisreferredtoasa“spikeandslab”or“spikeandbell”prior[7],[9]. Sampling from the two-dimensionalconditional distribution of (wk,λ ) may re- k quiremorecaretoimplement,butthelikelihoodisstillafunctionof2L statistics. k 3.3 GaussianRegressionModels WhendefiningtheGaussianmodel,wewillrefertothematrixS definedinequation (20). In that section we pointed out that handling a more general Z matrix came at significantcomputationalcostinthe Poissonmodel; however,in theGaussianmodel thisisnotthecase.Theupdatesareequallysimpleafterliftingthe0-1restrictiononthe entriesofZ. Againweconsiderthecasewithsparsitypattern 1{Z 6=0}=1 t∈Jk it fork =1,2,...,F. P ThePredictfunctionsaresimilartothosedefinedforthePoissonmodel: F GaussPredict(B,β):=β+ B [I]S ∈Rn (21) f f fX=1 GaussPredict (B,β):=β+ B [I]S ∈Rn (22) −k f f fX6=k whereB [I]S isacomponentwiseproduct.Ourmodelofthedataisnow: f f Y ∼N(GaussPredict(B,β),D−1) (23) whereDnowservesasanresidualinverse-varianceratherthananoffsetintheregres- sion.Wedonotdevelopthesamplingstepsneededtoinfertheresidualvarianceinthis report. WenextdefinethesufficientstatisticsfortheGibbssamplingsteps: invvar:=SumBy (S2D)∈RLk (24) k k error:=SumBy ((Y −GaussPredict (B,β))S D)∈RLk (25) k −k k Alloperationsaretakentobecomponentwise–includingthesquaredtermS2inequa- k tion(24). As in the Poisson modelwe will computethe data log likelihoodassociatedwith eachleveloftherandomeffect: ℓ (B ):=−(1/2) (Y −Π −S B )2D +log(2π/D ) t kt j j jk kt j j j:I(Xj,k)=t(cid:0) (cid:1) =−(1/2)invvar B2 +error B +c(t,Y,Π) t kt t kt wherec(t,Y,Π):=−(1/2) (Y −Π )2D +log(2π/D ) j:I(j,k)=t j j j j WeplaceaN(0,σ2)prioPronB ,an(cid:0)d,asinequation(16),wewill(cid:1)integrateover k kt 10