ebook img

Fitting birth-death processes to panel data with applications to bacterial DNA fingerprinting PDF

0.21 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Fitting birth-death processes to panel data with applications to bacterial DNA fingerprinting

Great Expectations: EM Algorithms for Discretely Observed Linear Birth-Death-Immigration Processes CharlesR.Doss1,MarcA.Suchard2,IanHolmes3,MidoriKato-Maeda4,andVladimirN.Minin1,∗ 1DepartmentofStatistics, UniversityofWashington, Seattle,WA98195, USA 2DepartmentsofBiomathematics, Biostatistics, andHumanGenetics, UniversityofCalifornia, LosAngeles,CA90095, USA 3DepartmentofBioengineering andBiophysicsGraduateGroup, UniversityofCalifornia, Berkeley, CA94720,USA 4DepartmentofMedicine, UniversityofCalifornia, SanFrancisco, CA94143, USA 1 1 Abstract 0 Estimating parameters of continuous-time linear birth-death-immigrationprocesses, observed dis- 2 cretelyatunevenlyspacedtimepoints,isarecurringthemeinstatisticalanalysesofpopulationdynam- r ics. Viewingthistaskasamissingdataproblem,wedeveloptwonovelexpectation-maximization(EM) p A algorithms.Whenbirthrateiszeroorimmigrationrateiseitherzeroorproportionaltothebirthrate,we useKendall’sgeneratingfunctionmethodtoreducetheE-stepoftheEMalgorithm,aswellascalcula- 5 tionoftheFisherinformation,toonedimensionalintegration.Thisreductionresultsinasimpleandfast 1 implementationofthe EMalgorithm. To tackle theunconstrainedbirthandimmigrationrates, we ex- tendadirectsamplerforfinite-stateMarkovchainsandusethissamplingproceduretodevelopaMonte ] O CarloEMalgorithm.Wetestouralgorithmsonsimulateddataandthenuseournewmethodstoexplore C the birth and death rates of a transposable element in the genome of Mycobacteriumtuberculosis, the . causativeagentoftuberculosis. t a t s [ 1 Introduction 2 v Linearbirth-death-immigration (BDI)processesprovideusefulbuildingblocksformodelingpopulationdy- 3 namicsinecology(Nee,2006),molecularevolution(Thorneetal.,1991),andepidemiology(GibsonandRenshaw, 9 1998) among many other areas. Although Keiding (1975) has extensively studied inference for fully ob- 8 0 served continuous-time BDI processes, more often such processes are not observed completely, posing 9. interesting computational problems forstatisticians. Here,weuseapplied probability toolstodevelop new, 0 efficient implementations of the expectation-maximization (EM) algorithm for fitting discretely observed 0 BDIprocesses. 1 : We assume that we observe one or multiple independent BDI trajectories at fixed, possibly irregularly v spaced, timepoints. Holmes(2005)proposed anEMalgorithm forsuchdiscretely observed BDIprocesses i X in the context of finding the most optimal alignment of multiple genomic sequences. Of course, the EM r algorithm is not the only way to find maximum likelihood estimates (MLEs) of discretely observed BDI a parameters,asdemonstratedbyThorneetal.(1991),whoinitiallyproposedtheBDImodelforthesequence alignment problem. However, Holmes (2005) argues that the EM algorithm’s simplicity and robustness makethismethodattractive forlarge-scale bioinformatics applications. Computingexpectations ofthecomplete-data log-likelihood, neededforexecuting anEMalgorithm, can be challenging, especially if the complete-data were generated by a continuous-time stochastic process. Whencompletedataaregeneratedbyafinitestate-space continuous-time Markovchain(CTMC),theseex- pectations canbecomputed efficiently(Lange,1995;HolmesandRubin,2002). AlthoughtheBDIprocess isalsoaCTMC,theinfinitestate-spaceoftheprocessprohibitsusfromusingthesecomputationallyefficient methods. Holmes (2005) considers a BDI model with the immigration rate either zero or proportional to the birth rate. Underthis restriction, the complete-data likelihood belongs totheexponential family, which meansthattheexpectedcomplete-datalog-likelihood isalinearcombinationofexpectedsufficientstatistics ofthecomplete data. Holmes(2005)computes theseexpectations ofthesufficientstatistics bynumerically 1 solving a system of coupled non-linear ordinary differential equations (ODEs). For this restricted immi- grationmodelandforthedeath-immigration modelwedevelopanewcomputationally efficientmethodfor computing the expected sufficient statistics. Our method combines ideas from Kendall (1948) and Lange (1982)andreducescomputationsofexpectedsufficientstatisticstoone-dimensionalintegration,acomputa- tional taskthatismuchsimplerthansolving asystem ofnonlinear ODEs. Wedevelop asimilarintegration method to compute the observed Fisher information matrix via Louis (1982)’s formula. Moreover, for the death-immigration model and for Holmes (2005)’s sequence alignment model, we derive the expectations ofthecomplete-data sufficientstatistics inclosedform. When rates of the BDI model are not constrained, the infinite-dimensional vector of sufficient statistics precludes us from applying the above integration technique. Therefore, we resort to Monte Carlo (MC) estimation of the expected complete-data log-likelihood. Our MC-EM algorithm requires sampling BDI trajectories over afinite time-interval conditional on the observed states of the process at the end-points of the interval. Golinelli (2000) previously accomplished this task via reversible jump Markov chain Monte Carlo(MCMC).InsteadofthisMCMCprocedure,weadoptandextendHobolth(2008)’sdirectsamplerfor finite state-space CTMCs to simulate end-point conditioned BDI trajectories exactly. While the resulting MC-EM algorithm is slower than our integration method for the restricted immigration BDI model, the algorithm stillperformswellonmoderately-sized problems. WetestourtwoEMalgorithmsonsimulateddata. Wethenturntoaproblemofestimatingbirthanddeath rates of the transposable element IS6110 in Mycobacterium tuberculosis, the causative bacterial agent of mosttuberculosisinhumans. Estimatingtheseratesisanimportanttaskinmolecularepidemiology,because researchers use IS6110 genotypes to group infected individuals into epidemiological clusters (Smalletal., 1994). Rosenberg etal.(2003)useseriallysampledIS6110genotypesfromM.tuberculosisinfectedpatients to estimate IS6110 birth and death rates. These authors proposed an approximate likelihood method to accomplish this estimation. We revisit this problem using our EM algorithm and compare our results with Rosenberg etal.(2003)’sapproximation. Wealsoexamine differences inbirthanddeathratesamongthree mainlineagesofM.tuberculosis. 2 The EM Algorithm We start with a continuous-time homogeneous linear BDI process X with birth rate λ 0, death rate t { } ≥ µ 0,andimmigrationrateν 0. Weassumethatweobservetheprocess,X = 0,1,...,atn+1distinct t ≥ ≥ times,0= t < t < ... <t . WedenoteourdatavectorbyY = (X ,...,X )andtheparametervector 0 1 n t0 tn by θ = (λ,µ,ν). In addition to the full BDImodel, weconsider a restricted immigration model, in which werequireν = βλforsomeknownconstant β,andadeath-immigration model,wherewesetλ = 0. Weareinterested incomputing theMLEsoftheparameters, θˆ= argmaxθlo(Y;θ),where n−1 l (Y;θ) := logp (t t ;θ) (1) o Xti,Xti+1 i+1− i i=0 X isthe observed-data log-likelihood and pi,j(t;θ) = Pθ(Xt = j X0 = i), i,j = 0,1,..., are the transition | probabilities oftheBDIprocess. Thesetransitionprobabilities canbecalculated eitherusingthegenerating functionderivedbyKendall(1948)orviatheorthogonalpolynomialrepresentationofKarlinandMcGregor (1958). Despite the explicit algebraic nature of the orthogonal polynomials, the latter method can be nu- merically unstable and the generating function method is often preferred (Sehletal., 2009). Although one can maximize the likelihood l (Y;θ) using standard off-the-shelf optimization algorithms, such generic o algorithms can be problematic when one needs to analyze multiple data sets without manual tuning of the algorithms and whenthe BDIrates are functions of ahigh dimensional parameter vector. Asanalternative togenericoptimization,wedevelopEMalgorithms,knownfortheirrobustnessandabilitytocopewithhigh 2 dimensional optimization, to maximize the observed data likelihood under BDI models. (Dempsteretal., 1977). CompletedatainourcaseconsistoftheBDItrajectories X ,observedcontinuously duringtheinterval t { } [t ,t ]. Let l be the log-likelihood of the complete data. Toexecute an EM algorithm we need to be able 0 n c to compute Eθ′[lc( Xt ;θ)Y] (the E-step) and to maximize this expectation over θ (the M-step). We { } | develop separate algorithms for implementing these E- and M-steps for the restricted immigration/death- immigrationandthefullBDImodels,becausethetwoclassesofmodelsdifferinthewaythecompletedata collapse intosufficientstatistics. 2.1 Restricted Immigrationand Death-ImmigrationModels SincetheBDIprocessisaCTMC,thelog-likelihood ofcompletedatais ∞ ∞ l ( X ;θ) = d(i)[i(λ+µ)+ν]+ [n log(iλ+ν)+n log(iµ)]+const, (2) c t i,i+1 i,i−1 { } − i=0 i=0 X X whered(i)isthetotaltimespentinstateiandn isthenumberofjumpsfromstateitostatej during the i,j interval [t ,t ](Guttorp,1995). Replacing ν withβλintheaboveequation, wearriveatthecomplete-data 0 n log-likelihood fortherestricted immigrationmodel, l ( X ;λ,µ) = R (λ+µ) t βλ+N+logλ+N−logµ+const, (3) c { t} − tn − n tn tn where the number of jumps up N+ := n , the number of jumps down N− := n , and tn i≥0 i,i+1 tn i≥0 i,i−1 the total particle-time R := tnX ds = ∞ id(i) are sufficient statistics. Similarly, if we set λ = 0 tn t0 s P i=0 P (death-immigration model),then R P l ( X ;µ,ν) = R µ t ν +N+logν +N−logµ+const, (4) c { t} − tn − n tn tn Equations (3) and (4) show that, for the E-step, the only expectations we need are Uθ,Y := Eθ Nt+n|Y , Dθ,Y := Eθ Nt−n|Y ,andPθ,Y := Eθ[Rtn|Y]. UsingtheMarkovpropertyandadditivityofexp(cid:2)ectations(cid:3), we break these expectations into sums of expectations of the numbers of jumps up and down and the total (cid:2) (cid:3) particle time during each time interval [t ,t ], conditional on X and X . Homogeneity of the BDI k k+1 tk tk+1 modelssuggests thatinordertocompletetheE-stepoftheEMalgorithm, weneedtobeabletocalculate U (t) = E N+ X = i,X = j , i,j t | 0 t D (t) = E N− X = i,X = j , and (5) i,j (cid:0) t | 0 t (cid:1) P (t) = E(R X = i,X =j). i,j (cid:0) t 0 t (cid:1) | FollowingMininandSuchard(2008),wechoosetoworkwithrestricted moments U˜ (t) = E N+1 X = i , i,j t {Xt=j} | 0 D˜i,j(t) = E(cid:0)Nt−1{Xt=j} | X0 = i(cid:1), and (6) P˜i,j(t) = E(cid:0)Rt1{Xt=j} |X0 = i (cid:1), (cid:0) (cid:1) that we can divide by transition probabilities p (t) to recover the conditional expectations (5). In order to ij computetherestricted moments,wefirstconsider thejointgenerating function H (u,v,w,s,t) := E uN+vN−e−wRsXt X = i , (7) i 0 | (cid:16) (cid:17) 3 where0 u,v,s 1andw 0. Partialderivatives ofthisfunction, ≤ ≤ ≥ ∞ ∞ ∞ ∂H (u,1,0,s,t) G+(t,s) = i = sj nPr(N+ = n,X = j) = U˜ (t)sj, i ∂u u=1 t t i,j (cid:12) Xj=0 nX=0 Xj=0 (cid:12) ∞ ∞ ∞ ∂H (1,v,0,s,t)(cid:12) G−i (t,s) = i ∂v v=1 = sj nPr(Nt− = n,Xt = j) = D˜i,j(t)sj, and (8) (cid:12) Xj=0 nX=0 Xj=0 ∂H (1,1,w,s,t)(cid:12)(cid:12) ∞ ∞ ∞ G∗(t,s) = i = sj xdPr(R x,X = j) = P˜ (t)sj i ∂w (cid:12)w=0 −Xj=0 Z0 t ≤ t −Xj=0 i,j (cid:12) (cid:12) arepower series withcoefficients U˜ (t), D˜ (t), and P˜ (t)respectively. Therefore, ifwecan compute i,j i,j i,j − G+(t,s),G−(t,s),andG∗(t,s)foreverypossibletands,thenweshouldbeabletorecovercoefficientsof i i i thecorresponding powerseriesviadifferentiation orintegration. Numericalevaluationofpartialderivatives (8)isstraightforward ifwecancomputefinitedifferences ofH (u,v,w,s,t). Remarkably, H (u,v,w,s,t) i i isavailable inclosed form, aswedemonstrate inthetheorem below, so onecan evenobtain derivatives (8) analytically. Theorem1. Let X bealinearBDIprocesswithparametersλ 0,µ 0,andν 0. Overtheinterval t { } ≥ ≥ ≥ [0,t], let N+ be the number of jumps up, N− be the number of jumps down, and R be the total particle- t t t time. Then Hi(u,v,w,s,t) = E uNt+vNt−e−wRtsXt X0 = i satisfies the following partial differential | equation: (cid:16) (cid:17) ∂ ∂ H = s2uλ (λ+µ+w)s+vµ H +ν(us 1)H , (9) i i i ∂t − ∂s − subject to initial condition H ((cid:2)u,v,w,s,0) = si. The Cau(cid:3)chy problem defined by equation (9) and the i initialcondition hasaunique solution. Whenλ > 0,thesolution is H (u,v,w,s,t) = α1−α2ss−−αα21e−λ(α2−α1)rt i α1−α2 λν e−ν(1−uα1)t, i 1− ss−−αa21e−λ(α2−α1)rt ! (cid:18)s−α2−(s−α1)e−λ(α2−α1)rt(cid:19) (10) λ+µ+w∓√(λ+µ+w)2−4λµuv whereα = fori= 1,2. Whenλ = 0,thesolution is i 2λu Hi(u,v,w,s,t) = se−(µ+w)t vµ e−(µ+w)t −1 ieνu[vµ−(µ+w(µ)s+](we)−2(µ+w)t−1)+ν(cid:16)µu+vµw−1(cid:17)t. − µ+w (cid:0) (cid:1)! Proof. Ourproof,detailedinAppendixA,isageneralization ofKendall(1948)’sderivationofthegenerat- ingfunction ofX . t Having H in closed form gives us access to functions G+, G−, and G∗, so we are left with the task i i i i of recovering coefficients of these power series. One way to accomplish this task is to differentiate the powerseries repeatedly, e.g. U˜ (t) = 1 ∂jG+i (s,t) . InAppendix C,wedemonstrate that forthe death- i,j j! ∂sj s=0 immigration model and Holmes(2005)’srestricted(cid:12)BDImodel, these derivatives can befound analytically. (cid:12) In general, repeated differentiation of G+, G−, and(cid:12) G∗ needs to be done numerically, making this method i i i impractical. Instead, we extend G+(t, ), G−(t, ), and G∗(t, ) to the boundary of a unit circle in the i · i · i · complexplanebythechangeofvariabless = e2πiz (iinthiscontextistheimaginarynumber√ 1,notthe − initialstateoftheBDIprocess). Forexample, ∞ G+(t,e2πiz) = U˜ (t)e2πijz l l,j j=0 X 4 isaperiodicfunctioninz,whichmeansthatU˜ (t)areFouriercoefficientsofthisperiodicfunction. There- l,j fore,wecanusetheRiemannapproximation totheFouriertransform integraltoobtain 1 1 K−1 U˜ (t) = G+ t,e2πis e−2πibsds G+ t,e2πik/K e−2πibk/K, l,j l ≈ K l Z0 (cid:0) (cid:1) Xk=0 (cid:16) (cid:17) for some suitably large K. The Fast Fourier Transform (FFT) (Henrici, 1979) can be applied to compute quickly multiple Fourier coefficients (Lange, 1982; Dormanetal., 2004; Suchardetal., 2008). We do not, however,useFFTinouralgorithm,becauseforaparticulartimeintervallengtht,wealmostalwaysneedto computeU˜ (t),D˜ (t),P˜ (t)foronlyonevalueofj. i,j i,j i,j TocompletetheM-stepatthekthiterationoftheEMalgorithm,wedifferentiateequation(3)withrespect toλandµtoobtain µˆ = Dθk,Y andλˆ = Uθk,Y k+1 k+1 Pθk,Y Pθk,Y +βtn fortherestricted immigrationmodel. Similarly,weuseequation(4)tofind µˆ = Dθk,Y andνˆ = Uθk,Y k+1 k+1 Pθk,Y tn forthedeath-immigration model. Weobtaintheobserved Fisherinformation viaLouis(1982)’sformula: .. IˆY(θˆ)= Eθˆ lc( Xt ;θˆ)Y Eθˆ l˙c( Xt ;θˆ)l˙c( Xt ;θˆ)′ Y , − { } | − { } { } | . .. h i h i where l is the gradient and l is the Hessian of the complete-data log-likelihood. This requires the cal- c c culation of the conditional cross-product means, E N+N− Y , E N+R Y , E N−R Y , and the t T | t T| t T| conditional second moments of N+,N−, andR . Thederivation ofthe information in termsofthese mo- T T T (cid:2) (cid:3) (cid:2) (cid:3) (cid:2) (cid:3) ments is in Appendix B. These conditional second- and cross-moments, as well as, PY and DY, can be computedinanalogous fashion toUY above,usingthejointgenerating function (10). 2.2 Full BDI Model Thecomplete-data log-likelihood ofthefullBDImodel,uptoanadditiveconstant, is ∞ l ( X ;θ) = [R (λ+µ)+t ν]+ n log(iλ+ν)+log(µ)N−. (11) c { t} − tn n i,i+1 tn i=0 X Thus the sufficient statistics are n , N−, and R . The technique we developed for the restricted { i,i+1}i≥0 tn tn immigrationcasedoesnotapplytoE(n Y)foralli 0,becausewecannotderivethejointgenerating i,i+1 | ≥ functionforstate-specificjumpsup,n fori 0,inclosedform. Moreover,evenifcouldcalculatethese i,i+1 ≥ expectations, the problem of evaluating the infinite sum in (11) would remain. Therefore, in the E-step of the EM algorithm, we resort to Monte Carlo to compute the expected complete-data log-likelihood for the full BDI model. We apply the ascent-based MC-EM method of Caffoetal. (2005), which dynamically adjusts the number of Monte Carlo simulations to guarantee the ascent property of the EM algorithm. In order to approximate expectations of the sufficient statistics via Monte Carlo, we adapt Hobolth (2008)’s direct sampling method for finite state space CTMCs conditioned on end points to simulating full linear BDIprocess trajectories conditioned onendpoints. Hobolth (2008)’s direct sampling is a recursive algorithm that generates realizations of X T , condi- { t}t=0 T tional onX = aandX = b. Letp = f (t)dtbetheprobability thatthere isajumptostateibefore 0 T i 0 i timeT,where R f (t)= e−λatλ p (T t)/p (T), (12) i a,i i,b a,b − 5 i is a+1 or a 1, λ = λa+ν, and λ = µa. We start the algorithm by using the probabilities a,a+1 a,a−1 − (1 p p ,p ,p ) to randomly decide whether not to jump, to jump up, or to jump down. If a+1 a−1 a+1 a−1 − − there are no jumps, then necessarily we started with a = b and the algorithm ends; if there is ajump up or down we use the inverse cumulative distribution function (CDF) method to simulate the time at which the t jumpoccurs using CDFsF (t) = f (x)/p dx,fori = a+1,a 1. Wecompute and invert these CDFs i 0 i i − numerically usingquicktocomputetransition probability formulaefromKarlinandMcGregor(1958). Af- R ter simulating the firstjump time, τ, over the interval [0,T], we repeat this process by setting X = i, and 0 T to T τ until T τ < 0 when the algorithm terminates. Because we can simulate very rapidly from − − theBDIprocess withoutconditioning ontheending state, accept-reject sampling alsoprovides aneffective methodofsimulatingtheconditionalBDIprocesswhentheoutcomesweconditiononaremoderatelylikely. When they are very rare, accept-reject sampling can be exceedingly slow, and Hobolth (2008)’s method is necessary. IntheM-stepoftheEMalgorithm,thecomplete-datalog-likelihood (11)canbewrittenasl ( X ;θ) = c t { } l ( X ;µ)+l ( X ;λ,ν),,makingitpossibletomaximizeseparatelyoverµandover(λ,ν). Maximiz- 1 t 2 t { } { } ingoverµ,weget µˆ = Uθk,Y. k+1 Pθk,Y Maximizing function (11) over (λ,ν) is more difficult, as there is no obvious analytic solution. First, we convertthis2-dimensionaloptimizationproblemintoa1-dimensionalonebynoticingthatsettingthederiva- tiveswithrespecttoλandν to0andthensummingthetwoequations together yields −Pθk,Yλˆ−Tνˆ+Uθk,Y = 0. (13) Wecanthenwrite,forinstance,ν(λ) = Uθk,Y−Pθk,Yλ,plugthisexpressionintothelikelihoodfunction(11) T and apply the Newton-Raphson algorithm to maximize this function with respect to λ. Using the optimal value λˆ, we recover νˆ = ν(λˆ). Because our Monte Carlo E-step takes most of the computing time, we allow Newton-Raphson to take as many iterations as it needs to converge under a prespecified tolerance; Newton-Raphson algorithm generally onlytakesbetween2and5iterations toconverge inourexperience. 3 Results 3.1 Simulations To test our methods, we simulate data from a restricted immigration BDI model with λ = .07, µ = .12 andβ = 1.2. Theparameters arechosentoresemble, butnotexactlymatch,thedynamicsofourbiological example, discussed in the next subsection. We simulate 100 independent processes starting from initial statesdrawnuniformlybetween1and15. Fromeachprocesswecollectatleasttwoobservations. Weplace observation times uniformly between 0 and 30. Table 1 gives some summary statistics for the simulated data. First,weassumetherestrictedimmigrationmodelandapplyourEMalgorithmforthismodelwithinitial parameter values of 0.2 for both λ and µ. We tested other choices of starting values, but the algorithm was not sensitive to them. Next, we pretend that we do not know that the data were generated under the restricted immigration modelandfitthefullBDImodeltothesimulateddatausingourMC-EMalgorithm, starting each parameter, λ, µ, and ν, at 0.2. The results of fitting these two models are shown in Table 2. As expected, estimation is more precise for the restricted immigration model. In the full BDI model, ν is themostdifficulttoidentifyunlesstherearemanyobservations startingverycloseto0,sothevarianceofνˆ tendstobelarge. 6 Value SimulatedData IS6110Data NumberofIntervals 387 252 AverageIntervalLength 5 0.35 NumberofIndividuals 100 196 NumberofIntervalswithanIncrease 78 14 AverageIncrease givenanIncrease 1.5 1 NumberofIntervalswithaDecrease 190 14 AverageDecreasegivenaDecrease 2.5 1.2 NumberofIntervalswithNoChange 119 224 MeanStartingState 5.5 11 StandardDeviationofStartingState 3.8 5.3 TotalLengthofTime 1947 89 Table1: SummarystatisticsforthesimulatedandM.tuberculosisIS6110data. λ µ ν Restricted Immigration 0.067(0.052,0.081) 0.12(0.10,0.14) | FullModel 0.057(0.039,0.074) 0.12(0.099,0.14) 0.11(0.058,0.16) TrueValue 0.07 0.12 0.084 Table 2: Results of EM algorithmsappliedto simulateddata underthe restrictedimmigrationandfullBDI models. Reportedvaluesaremaximumlikelihoodestimatesand95%confidenceintervals. 3.2 Comparisonwiththe Frequent Monitoring Method We compare our EM algorithm for computing the actual MLE to the frequent monitoring (FM) method of Rosenberg etal. (2003) for computing the MLE of an approximate likelihood. In the FM method, Rosenberg etal. (2003) assume that if the starting and ending values of the birth-death process are equal foraparticular interval, thennojumpsoccurred inthisinterval. Further, ifthedifference betweenthestart- ing and ending values is 1or 1, then exactly one jump up or exactly one jump down must have occurred − respectively. Theauthorsexcludeallobservedintervals,forwhichstartingandendingvaluesdifferbymore thanoneunit. Letibethestartingstateforaninterval,tthelengthoftheinterval,andλ = i(λ+µ). Then i thecorresponding probabilities forthethree possible eventsaree−λiu, iλ 1 e−λiu ,and iµ 1 e−λiu λi − λi − respectively. Rosenberg etal. (2003) use this FM method to estimate rates in what is effectively a multi- (cid:0) (cid:1) (cid:0) (cid:1) state branching process, but we will compare the two methods on our restricted immigration BDI model with immigration rate constrained to be 0. We again simulate an underlying BDI process using λ = 0.07 and µ = 0.12. To compare the two methods, we generate three different sets of data. In each set, wegen- erate observed states of the BDI process at a fixed constant distance dt apart. This distance varies across thedatasets, taking thevalues .2,.4, and.6,respectively. Werepeat thisprocedure 200timesandcompute birthanddeathrateestimates andcorresponding 95%confidence intervalsusingtheEMalgorithm andFM approximation method. Weshow boxplotsoftheresulting estimates forλandµinFigure1. Asexpected, the FM estimates behave reasonably when interval lengths are small, but the approximation becomes poor asweincrease theinterval length. TheFMmethod alwaysunderestimates theparameters since themethod effectivelyundercountsthenumberofunobservedjumpsintheBDIprocess. WealsocomputeMonteCarlo estimates of coverage probabilities of the two methods, shown above the box plots in Figure 1. Not sur- prisingly, coverage of the 95% confidence intervals computed under the proper BDI model likelihood are veryclosetothepromisedvalueof0.95. Incontrast,theFMapproximation-based 95%confidenceintervals containthetrueparametervaluelessthan95%forallthreesimulation scenarios. 7 Birth Rate l Death Rate m 6 1 0.96 0.82 0.94 0.6 0.95 0.44 . 0.94 0.92 0.94 0.85 0.94 0.78 9 0 0 . 0 4 1 . 0 7 0 e . 2 at 0 1 R 0. 5 0 0 . 1 0 . 0 EM 3 FM 8 0 0 . . 0 0 0.2 0.4 0.6 0.2 0.4 0.6 Time Interval Length Time Interval Length Figure1: Boxplotsofbirth(leftpanel)anddeath(rightpanel)rateestimates,obtainedfrom200simulateddatasets usingtheEMalgorithmandfrequentmonitoring(FM)method.Thetrueparametervalues,usedindatasimulations,are markedbythehorizontaldashedlines. Abovetheboxplots,weshowMonteCarloestimatesofcoverageprobabilities thatthe95%confidenceintervalsattain. 3.3 MycobacteriumtuberculosisIS6110Transposon Weapply our restricted immigration EMalgorithm to estimation of birth and death rates of the transposon IS6110inM.tuberculosis(McEvoyetal.,2007). Atransposonortransposableelementisageneticsequence thatcanduplicate,removeitself,andjumptoanewlocationinthegenome. IS6110isatransposonthatplays an important role in epidemiological studies of tuberculosis. More specifically, the number and locations of IS6110 elements in the M. tuberculosis form a genetic signature or genotype of the mycobacterium, allowingepidemiologists todrawinferenceaboutdiseasetransmissionwhenthesamegenotypeisobserved amongpatientswithactivetuberculosis(vanEmbdenetal.,1993). Suchgenotypiccomparisoncantranslate into meaningful epidemiological inference only if the dynamics of IS6110 evolution are well understood. Therefore, accurate estimation of rates of changes of IS6110-based genotypes is critical for using these genotypes inepidemiological studies (TanakaandRosenberg,2001). Weanalyze data from an ongoing population-based study that includes all tuberculosis cases reported to the San Francisco Department of Public Health (Cattamanchietal., 2006). Our data include patients with more than one M. tuberculosis isolate from specimens sampled more than 10 days apart and genotyped withIS6110restrictionfragmentlengthpolymorphism. WeignoregenomiclocationsofIS6110andassume that the transposon counts are discretely observed realizations of a BDI process, with no immigration; in particular, we assume that the patient is not reinfected with a different strain of the bacteria in the period betweenobservations. Table1givessummarystatisticsforthedata. We plot birth and death rate estimates and their 95% confidence intervals, obtained with our EM algo- rithm,inFigure2(verticalbarslabeled“All”). ThestartingvaluesfortheEMdonotaffecttheresults. Inthe analysispresented,theEMalgorithmwasstartedwithparameterguessesof.05and.05forλandµ,respec- tively,andtheirMLEswere0.027and0.031,respectively. Ourestimateforλisconsistentwiththeestimate 0.0188 0.0103 from Rosenberg etal. (2003). Although the authors’ credible interval forµ overlaps with ± 8 5 0 Birth Rate l . 0 Death Rate m 3 e 0 at 0. R 0 0 . 0 ALL EU EA IND ALL EU EA IND Lineages Figure 2: Pointestimates and 95%confidenceintervalsfor birthand death rate of the IS6110transposableelement obtainedfromallindividuals(ALL)andobtainedbyseparatelyanalyzingthreeM.tuberculosislineages: European- American(EU),Indo-Oceanic(IND),andEastAsian(EA). ours, our estimate for µ is noticeably higher than Rosenberg etal. (2003)’s estimate of 0.0147 0.00906. ± Note from Table 1 that among the intervals with a decrease, the average count drop was by more than 1; therewere3intervalswhereIS6110countsdroppedby2,whereastherewerenointervalthatexperiencedan increasebymorethan1. ThuswewouldexpectourestimateforµtoincreaseoverRosenberg etal.(2003)’s approximation, whereas that of λ should be similar between the two methods. We also point out that we analyze an updated version of the data analyzed by Rosenberg etal. (2003). Moreover, Rosenberg etal. (2003)useaslightlymorecomplicatedmodelforIS6110evolution,whichtakesintoaccountshiftsintrans- posonlocation. Giventhesedifferencesinthedataandthemethods,consistencyofourandRosenberg etal. (2003)’sestimatesiscomforting. 3.3.1 Mycobacterium tuberculosis LineageComparison Inadditiontoestimationoftheglobalbirthanddeathrates,weseparately estimatetheseparametersineach of the three lineages of M.tuberculosis observed inSan Francisco. Based on genomic sequence similarity, M.tuberculosis isdividedintosixmainlineages: Euro-American, East-Asian,Indo-Oceanic, East-African- Indian, West-African I and West-African II (Gagneuxetal., 2006). In our lineage-specific analysis, we consider 109 individuals infected with Euro-American lineage strains, 54 individuals infected with East- Asian lineage strains, and 25 individuals infected with Indo-Oceanic lineage strains. The M. tuberculosis lineage-specific estimates and confidence intervals are plotted in Figure 2. Most notably, there appears to beasubstantial difference betweendeathratesoftheEuro-American andEast-Asianlineages. Sincethisis anovelresultthathasimplications formonitoring tuberculosis withmolecular genotyping, weexaminethe difference indeathratesbetweenlineages moreclosely. Thenumber of IS6110 elements isapotential confounder inour analysis, because patients infected with Euro-American and East-Asian differ drastically in the number of IS6110 elements at the beginning of the observation period. Theisolates fromthe Euro-American lineage havebetween 2and17IS6110 elements, with 41 out of 109 patients having the first recorded IS6110 count less than 6, while IS6110 counts vary between6and22fortheEast-Asianisolates. Warrenetal.(2002)suggestthatIS6110genotypeswithfewer 9 X < 6 X ‡ 6 X < 6 X ‡ 6 0 0 0 0 0 0 35 40 250 200 0 0 0 0 0 0 cy 25 3 2 15 n 0 eque 150 200 0015 100 r 1 F 0 0 0 1 0 5 0 5 5 0 0 0 0 47 49 51 53 150 160 170 180 9 10 11 12 13 40 50 60 Number of Intervals Number of Intervals Sum of Interval Lengths Sum of Interval Lengths Figure3: Lowvshighcountgenotypeanalysis. Histogramsofsimulatednumbersofintervalsandsumsofinterval lengthsareplottedforintervalswithstartingvalueslessthansixandgreaterorequaltosix. Theverticaldashedlines indicatetheobservedvaluesofthefourstatistics. thansixelementshaveverylowrateofchange,becauseintheirdata,caseswithnoobservedchangesinthe genotype aredominated bysuchlow-count genotypes. However,according toourlinearbirth-death model, Warrenetal. (2002)’s observation of low-count genotypes evolving slower than high-count genotypes is nothing but expected. To demonstrate this, we simulate 1000 datasets using our global birth and death rates and observed initial IS6110 counts for each patient. We record the number of intervals with equal starting and ending values less than six, n , and equal starting and ending values greater or equal to 0,<6 six, n . We also recorded the length sum of both kinds of intervals: t and t . In our data, 0,≥6 0,<6 0,≥6 nobs = 53andnobs = 171withnobs /tobs = 4.6 > 2.8 = nobs /tobs ,inagreementwithWarrenetal. 0,<6 0,≥6 0,<6 0,<6 0,≥6 0,≥6 (2002)’sanalysis. Histogramsofsimulatedvaluesofthefourstatistics,n ,n ,t ,andt ,shown 0,<6 0,≥6 0,<6 0,≥6 in Figure 3, demonstrate that our birth-death model replicates well the observed dynamics of low-count and high-count IS6110 genotypes. We conclude that our data do not provide evidence that evolutionary dynamics of low-count genotypes differ from high-count genotype dynamics. Therefore, it is unlikely that our estimated discrepancy between death rates of Euro-American and East-Asian M. tuberculosis lineages iscausedbyhighpercentage oflow-countgenotypes intheEuro-Americanlineage isolates. 4 Discussion We propose two implementations of an EM algorithm for maximum likelihood estimation of discretely observed BDI processes. When the birth rate is zero or the immigration rate is zero or proportional to the birth rate, we show that the E-step of the EM algorithm can be reduced to computing a small number of one-dimensional Fourier transform integrals. This makes our method more efficient than Holmes (2005)’s strategy, which involves finding numerical solutions of non-linear ODEs. Moreover, in Appendix C, we demonstratethatfortheHolmes(2005)’sEMalgorithmandforthedeath-immigrationmodel,ourgenerating function method yields analytic formulae for the expected complete-data log-likelihood. Therefore, for theseclassesofmodels,ourEMalgorithmisexact,meaningthatneitherE-stepnorM-stepofthealgorithm requires numericalapproximations. Oursecond EMalgorithm uses exactsampling ofend-point conditioned BDItrajectories. Thissampling algorithm is a direct extension of Hobolth (2008)’s algorithm. The key difference is that Hobolth (2008) worked with finite state-space CTMCs with diagonalizable infinitesimal generators. Assuming that the spectral decomposition of the generator is available, the author was able to obtain analytic expressions for 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.