StatisticalScience 2010,Vol.25,No.3,388–407 DOI:10.1214/10-STS339 (cid:13)c InstituteofMathematicalStatistics,2010 Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments Haim Bar, James Booth, Elizabeth Schifano and Martin T. Wells 1 1 0 2 Abstract. A two-groups mixed-effects model for the comparison of (normalized) microarray data fromtwotreatment groupsis considered. n a Most competing parametric methods that have appeared in the litera- J ture are obtained as special cases or by minor modification of the pro- 5 posedmodel.Approximate maximum likelihood fittingis accomplished ] via a fast and scalable algorithm, which we call LEMMA (Laplace ap- E proximated EM Microarray Analysis). Theposterior oddsof treatment M gene interactions, derived from the model, involve shrinkage esti- × . mates of both the interactions and of the gene specific error variances. t a Genes are classified as being associated with treatment based on the t s posterior odds and the local false discovery rate (f.d.r.) with a fixed [ cutoff. Our model-based approach also allows one to declare the non- 1 null status of a gene by controlling the false discovery rate (FDR). It v is shown in a detailed simulation study that the approach outperforms 5 0 well-known competitors. We also apply the proposed methodology to 9 two previously analyzed microarray examples. Extensions of the pro- 0 posed method to paired treatments and multiple treatments are also . 1 discussed. 0 Key words and phrases: EM algorithm, empirical Bayes, Laplace ap- 1 1 proximation, LEMMA, LIMMA, linear mixed models, local false dis- v: covery rate, microarray analysis, mixture model, two-groups model. i X r 1. INTRODUCTION ingonagene-by-gene basis,microarray technologies a allow scientists to view the expression of thousands Microarraytechnologieshavebecomeamajordata of genes from an experimental sample simultane- generator inthe post-genomics era. Instead of work- ously. Due to the cost, it is common that thousands of genes are measured with a small number of repli- Haim Bar is Ph.D. candidate, Department of Statistical cations,asaconsequence,onefacesa large G,small Science, Cornell University, Ithaca, New York, USA n problem,whereGisthetotalnumberofgenesand (e-mail: [email protected]). James G. Booth is n is the number of replications. After preprocess- Professor, Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New ing of the raw image data, the expression levels are York, USA (e-mail: [email protected]). Elizabeth often assumed to follow a two-groups model, that Schifano is Postdoctoral Fellow, Department of is, the expressions are each either null or non-null Biostatistics, Harvard School of Public Health, Cambridge, Massachusetts, USA (e-mail: [email protected]). Martin T. Wells is This is an electronic reprint of the original article Professor, Department of Statistical Science, Cornell published by the Institute of Mathematical Statistics in University, Ithaca, New York, USA (e-mail: Statistical Science, 2010, Vol. 25, No. 3, 388–407. This [email protected]). reprint differs from the original in pagination and typographic detail. 1 2 BAR,BOOTH, SCHIFANO ANDWELLS withpriorprobability p or p =1 p ,respectively. (2001) for detecting changes of gene expression in 0 1 0 − The two-groups model plays an important role in a two-channel cDNA microarray experiment. This the Bayesian microarray literature and is broadly was extended to replicate chips with multiple condi- applicable (Efron, 2008). tions using a hierarchical lognormal–normal model A general review of issues pertaining to microar- in Kendziorski et al. (2003). A key difference be- ray dataanalysis isprovided inAllison et al.(2006). tweenthesemodelsandthosediscussedaboveisthat Here,wefocusonstatistical inferenceand,inpartic- they effectively induce shrinkage in the mean effects ular, on what Allison et al. (2006) refer to as “con- (i.e., the numerator of the t-statistic), while assum- sensus points 2 and 3”: the advantages of shrinkage ing homogeneous variability across genes. methods, and controlling the false discovery rate. Instead of directly modeling the variation of the We review several inferential methods, and develop expression data, two-groups models are character- a unifying linear model approach. izedbymixingmeasurementsoverlatentgene-specific Classical parametric statistics do not provide a indicators. Lonnstedt and Speed (2002) used this reliable methodology for determining differentially approach to derive the so-called B-statistic as the expressed genes. The large number of genes with logarithm of the posterior odds of differential ex- relatively few replications in typical microarray ex- pression. Smyth (2004) extended the B-statistic to periments yield variance estimates of the expres- thelinear models setting and has written thewidely sion levels that are often unreliable. The classical used limma R package (R Development Core Team, t-test and F-test are generated under a heteroge- 2007). Smyth (2004) also shows that the B-statistic neous error variance model assumption and do not is a monotone function of a t-statistic with a regu- enjoytheadvantage gainedbyshrinkageestimation. larized variance which he refers to as a moderated The assumption that the variances are equal across t-statistic. Wright and Simon (2003) and Cui et al. all genes is typically not realistic. Hypothesis tests (2005)derivesimilarlymoderatedstatistics,andCui basedonapooledcommonvarianceestimatorforall et al. (2005) showed that their proposed test, us- genes have low power and can result in misleading ing a James–Stein type variance estimator, had the differential expression results (Wright and Simon, best or nearly the best power for detecting differen- 2003; Smyth, 2004; Cui et al., 2005). tiallyexpressedgenesoverawiderangeofsituations An important observation is that, although there compared to a number of existing alternative proce- are only a few replications for each gene, the total dures. number of measurements is very large. If informa- SincetheperformanceoftheF-typeteststatistics tion is combined across the genes (i.e., genome-wide arising from models with a random gene-specific er- shrinkage), it is possible to construct test proce- ror variance (leading to shrinkage estimates of the dures that have improved performance. The SAM error variances) is better than in the case where the test (Tusher, Tibshirani and Chu, 2001) and a reg- variances are fixed,why only model thevariances as ularized t-test in Efron et al. (2001) first used infor- random but not the means? In effect, the approach mation across thegenome-wide expressionvaluesby of Lonnstedt and Speed (2002), and its extension the addition of a data-based constant to the gene- in Smyth (2004), already do this by treating both specific standard errors. the gene-specific mean effects and error variances as The Bayesian approach seems to be particularly random. These models have been further general- well suited for combining information in expression ized by Tai and Speed (2006, 2009) to the multi- data. Hierarchical Bayesian models have also been variate setting to handle, for example, short time- used for variance regularization by estimating mod- seriesofmicroarrays.Theseauthorscoinedtheterm erated variances of individual genes. The estimated “fully moderated” for such models. However, as we variances are calculated as weighted averages of the point out later is Section 4, the specific distribu- gene-specific sample variances and pooled variances tional assumptionsmadeinthesemodelsimply that across all genes. In particular, the regularized t-test the shrinkage factor for the mean effects is the same proposed by Baldi and Long (2001) uses a hier- for all genes, resulting in performance equivalent to archical model and substitutes an empirical Bayes the ordinary moderated-t. variance estimator based on a prior distribution in Hwang and Liu (2010) proposed an alternative place of the usual variance estimate. Another hier- empirical Bayes approach which shrinks both the archical approach was developed in Newton et al. meansandvariancesdifferentially(seealsoLiu,2006). 3 LAPLACE APPROXIMATEDEM MICROARRAYANALYSIS Their simulation studies indicate that their fully how previously proposed statistics fall into the six moderated procedure is more powerful than all the modelcategories.NotethattheRRcategoryalsoin- othertestsexistingintheliterature.TheHwangand cludes the LIMMA model. However, inference with Liu (2010) procedure uses method of moments esti- the B-statistic of Lonnstedt and Speed (2002) and mators of some model parameters rather than max- Smyth (2004) results in a shrinkage factor for the imum likelihood. The advantage of our EM fitting mean effects which is the same for all genes. Conse- algorithm is that it is easily extended to more gen- quently, LIMMA is therefore similar to an FR-type eralmodels,forexample,includingcovariates,orthe model in terms of frequentist performance since the threegroupsmixturemodeldiscussedinSection3.4. posterior odds are monotone in the moderated t- Still, their approach provided the key insight that statistic. motivated the model formulation and subsequent Inthispaperwepresentaunifiedmodelingframe- computational algorithm described in this article. work forempirical Bayes inferencein microarray ex- The development of the empirical Bayes method- periments together with a simple and fast EM algo- ologies that improve the power to detect differen- rithm for estimation of the model parameters. We tiallyexpressedgenesessentiallyreducestothechoice focus on a simple two-condition experimental setup, ofwhethergene-specificeffectsshouldbemodeledas but the ANOVA formulation we posit in the next fixed or random. This question applies to effects on section allows for easy generalization to more than both the mean and the error variance. Thus, there two conditions and comparisons based on a single are four combinations of fixed and random factors sample of two channel arrays such as the more gen- leading to four models which we denote by FF, RF, eral designs in Kerr, Martin and Churchill (2000) FR and RR, where the first letter identifies whether and Smyth (2004). The methods of this article can, themean effects arefixed or random and thesecond in principle, also be extended to a multivariate em- letter doesthesamefortheerrorvariances.Twoad- pirical Bayes model, for example, to analyze short ditional models, denoted FH and RH, are obtained time-coursedataasintheextensionoftheB-statistic if the error variances are assumed to be homoge- by Tai and Speed (2006, 2009), or to multiple ar- neous across genes. The FF category correspondsto ray platforms as is used in epigenomic data analysis the naive approach of applying t- or F-tests to each (Figueroa et al., 2008). gene separately. TheFR category includes the mod- WeapplyanapproximateEMalgorithmforfitting elsinWrightandSimon(2003)andCuietal.(2005). the proposed model, with the latent null/non-null Thegamma–gamma andlog-normal–normalmodels status of each gene playing the role of missing data. ofNewtonetal.(2001)andKendziorskietal.(2003) The integral needed to evaluate the complete data areoftheRHtype.Theapproach of HwangandLiu likelihood makes direct application of the EM algo- (2010) falls in the RR category. Table 1 summarizes rithm intractable. However, a simple and accurate Table 1 Models corresponding to combinations of fixed and random factors Meaneffect Error variance Methods Fixed Fixed (Heterogenous) t-test/F-test Fixed Fixed (Homogenous) F in Cui and Churchill (2003) 3 Fixed Random Wright and Simon (2003), Cui et al. (2005), Lonnstedt and Speed (2002), Smyth(2004) Random Fixed (Heterogenous) Random Fixed (Homogenous) Newton et al. (2001), Kendziorski et al. (2003) Random Random F in Hwang and Liu (2010), SS Lonnstedt,Rimini and Nilsson (2005), Tai and Speed (2009), Lonnstedt and Speed (2002), Smyth(2004) 4 BAR,BOOTH, SCHIFANO ANDWELLS approximation is obtained via the Laplace approx- 2. MODEL AND NOTATION imation (de Bruijn, 1981, Chapter 4; Butler, 2007, Let y denote the response (e.g., log expression ijg page 42). This approximation makes the EM algo- ratio) of gene g, for subject (replicate) j, in treat- rithmscalable, tractable, andextremely fast.Imple- mentgroupi=1,2.Webeginwiththelinearmodel, mentation of Bayesian microarray models typically (1) y =µ+τ +γ +ψ +ε , involves drawing MCMC samples from the poste- ijg i g ig ijg rior distribution of effects from all genes. MCMC with a typical assumption concerning the errors be- sampling provides a mechanism to study the full ing Bayesian posterior distribution. However, there is a (2) ε i.i.d. N(0,σ2 ) heavycomputationalburdenthatmakestheMCMC ijg ∼ ε,g implementationlessattractive. TheLaplaceapprox- for j = 1,...,n , independently across genes and ig imationcircumventsthegenerationofthethousands treatmentgroups.Weimposetheidentifiability con- ofgeneeffectparametersandgivesahighlyaccurate straints, τ +τ =0 and ψ +ψ =0 for all g = 1 2 1g 2g approximation to the integral in the expression of 1,...,G.Then τ =τ τ is themain effect of treat- 1 2 − the complete data likelihood. The Laplace approx- ment, averaged across genes, and ψg =ψ1g ψ2g, − imated EM algorithm based analysis is the inspi- g=1,...,G, are the gene specific treatment effects. ration of the acronym LEMMA (Laplace approxi- Note that we do not assume that the mean treat- mated EM MicroarrayAnalysis) for the contributed ment effect is zero. While assuming τ =0 is often R package, lemma (Bar and Schifano, 2009), which reasonablewhenperformingdifferentialgeneexpres- sion analysis on large microarray data sets, we find implements the methodology described in this pa- this to be not only an unnecessary constraint, but per. also unrealistic in certain situations. For example, The paper is organized as follows. In Section 2 when a data set consists mostly of genes that are we introduce the necessary notation for our two- known to be differentially expressed, or when com- groupsmodelalongwiththepriordistributionspec- paringexpressionlevelsacrossspecies(where“treat- ifications. Section 3 describes the approximate EM ment”isinterpretedas“species”),thereisnoreason algorithm for fitting the RR model. We also pro- to assume that the overall mean difference between poseageneralizationoftheLIMMAmodelandshow the two treatment groups is zero. howtheEMalgorithm iseasily modifiedtoestimate We further suppose that the genes fall into two its parameters, and we briefly discuss extensions to groups, a null group in which ψ 0 and a non- g ≡ multipletreatments andto athree-groupsmodel.In null group in which ψ = 0. The primary goal is g 6 Section4weshowthattheposteriorprobabilitythat to classify genes as null or non-null based on the a gene is non-null is a function of a fully-moderated observed responses. A probabilistic approach is to (in the sense of Hwang and Liu, 2010) posterior t- suppose that each gene has prior probability p1 of beingnon-null(and p =1 p of beingnull)andto statistic with shrinkage in both the numerator and 0 1 − useBayesruletodeterminetheposteriorprobability the denominator. We show that our RR framework given the data; specifically, generalizes several other statistics, and describe two inferential procedures, one based on the posterior p f (y ) 1 1,g g (3) p (y )= , probability that a gene is non-null, and one which is 1,g g p f (y )+p f (y ) 0 0,g g 1 1,g g based on the null distribution and the FDR proce- where f (y ) is the probability density of the re- 1,g g dure.Section 5 gives results of a simulation studyin sponses for gene g implied by the non-null model, whichwecompare theperformanceof various meth- andf (y )isthecorrespondingquantityifthegene 0,g g ods to the “Optimal Rule” procedure based on full is in the null group. knowledgeofthetruemodelanditsparameters.Our Inpractice, of course,themixtureprobability and proposed methodology is applied to two well-known theparametersthatdeterminethenullandnon-null microarray examples: the ApoA1 data (Callow et densities have to be estimated. This estimation step al., 2000) and the Colon Cancer data (Alon et al., depends upon additional assumptions, if any, that 1999) in Section 6. We conclude the article in Sec- are made about the distribution of the responses. tion 7 with a discussion. As noted in the Introduction, a basic question is 5 LAPLACE APPROXIMATEDEM MICROARRAYANALYSIS whether gene-specific effects should be modeled as given in Lonnstedt and Speed (2002) and Smyth fixed or random, leading to the model categories (2004), where the mean of the random effects dis- we denote by FF, RF, FR and RR, and two addi- tribution is assumed to be zero. In a classical (one tional models, FH and RH, obtained when the error group) normal mixed-model, the mean of the ran- variances are assumed to be homogeneous, that is, dom effect is assumed to be zero because it is not σ2 σ2. separately identifiable from the overall mean. How- ε,g≡ ε The ANOVA model (1) together with the distri- ever, in the two-groups setting in which ψg in (1) butional assumption (2) allows us to restrict at- is modeled as a mixture, assuming ψ = 0 in (4) 6 tention to the sum and difference of gene-specific poses nosuch identifiability problems.Furthermore, treatment means, respectively, sg =y¯1·g +y¯2·g and this additional parameter allows for two useful and dg =y¯1·g y¯2·g, and the gene-specific mean squared important extensions of the model: (a) to paired − errors, (within-group) analyses, and (b) to three-groups al- lowing for over- and under-expressed non-null sta- 2 nig tus. These extensions are described in more detail mg = (yijg−y¯i·g)2/fg, in Section 3.4. i=1 j=1 XX where f =n +n 2. Notice that s g N(2µ+ 3. ESTIMATION g 1g 2g g − | ∼ 2γ ,σ2), where σ2 σ2 (1/n +1/n ), and g de- g g g ≡ ε,g 1g 2g | In this section we describe in detail an approxi- notes conditioning on any gene-specific random ef- mate EM algorithm for fitting the LEMMA model. fects. It follows that s carries no information about g Estimation for the other five models can be carried the gene-specific treatment effect ψ . For this rea- g out by making appropriate modifications to this al- son,ourestimationproceduresuseonlythemarginal gorithm. The LEMMA model has six parameters, likelihoodbasedonthedata( d , m ).Themodel g g two beingthe shapeand scale of the distributionfor { } { } (1)togetherwithassumption(2)alsoimpliesthatd g the error variances given in (5). The remaining vec- and mg are conditionally independent, with dg g tor of parameters is (p ,τ,ψ,σ2) which we denote (1 b )N +b N independently of m g σ2 χ|2∼/ 1 ψ − g 0 g 1 g| ∼ ε,g fg by φ. fg, where bg, g = 1,...,G, denotes independent Estimates of the hyperparameters, α and β, are Bernoulli(p1) latent indicators of non-null status for obtainedbymaximizingthemarginallikelihoodbased theGgenes,N0 andN1 denotenormalvariateswith on mg , given by { } unequal means τ and τ +ψ =τ respectively, but g 6 L( m ) equal variances σg2, and χ2fg denotes a chi-squared { g} variate with f degrees of freedom. G ∞ The familygof parametric models considered in = f(mg|σε2,g)f(σε−,g2)dσε−,g2 this paper is completed by specifying distributions g=1Z0 Y (6) forthegene-specificeffects, ψ and σ2 .Inwhat { g} { ε,g} G mfg/2−1(f /2)fg/2 followswesupposethat,ifthe(non-null)gene-specific g g = effects are modeled as random variates, they follow Γ(fg/2)Γ(α)βα g=1 a normal distribution, Y Γ(f /2+α) g (4) ψ i.i.d. N(ψ,σ2). . g ∼ ψ · (mgfg/2+1/β)fg/2+α On the other hand, if the gene-specific variances are In practice, we find the maximum likelihood esti- modeled as random variates, they are drawn from mates for α and β using the nlminb function in an inverse gamma distribution, R. In all the simulations and case studies the func- (5) σ−2 i.i.d.Gamma(α,β), tionconvergedquickly.Sincethemarginallikelihood ε,g ∼ is based on the statistics m , the computation g { } where α and β are shape and scale parameters. We time depends only on the number of genes, G, but refer to the RR model specified by (1), (2) and (5) not on the sample sizes. We have also derived and with the non-null gene-specific effects (4) as the implemented moment estimators [similar to Smyth LEMMA model. (2004), who comments that m follow a scaled g { } It is worth contrasting (4) with the corresponding F-distribution], and we found that both methods assumption in the models leading to the B-statistic provide accurate estimation of α and β. 6 BAR,BOOTH, SCHIFANO ANDWELLS 3.1 EM Algorithm G (10) = E logL(b ,d ;σ˜2 )d +C We apply the EM algorithm to estimate φ, with φ(m){ g g ε,g | g} g=1 the latent indicators, b , playing the role of the X g { } G missing data. Since d and m are conditionally in- g g (m) = p log[p f (d )] dependent given (bg,σε2,g), the complete data likeli- { 0,g 0 0,g g g=1 hood for φ based on ( b , d , m ) is X g g g { } { } { } (m) +p log[p f (d )] +C G 1,g 1 1,g g } L (φ)= L(b ,d ;σ2 ) C g g ε,g Q˜(φ,φ(m))+C, g=1Z ≡ Y (7) where C does not depend on φ, f and f denote ·L(mg;σε2,g)f(σε−,g2)dσε−,g2, N(τ,σ˜g2) and N(τ +ψ,σψ2 +σ˜g2) d0e,gnsities w1,gith σ˜g2= where f(σε−,g2) represents the gamma density with σ˜ε2,g(1/n1g+1/n2g),p1,g =E(bg|dg)andp0,g+p1,g= shape α and scale β. 1. Theintegral in (7)makes directapplication of the TheM-stepatthe(m+1)iterationrequiresmaxi- EM algorithm intractable. However, a simple and mization of Q˜(φ,φ(m)) withrespectto φ to yield the accurate approximation is obtained via the Laplace updated estimate φ(m+1). That is, approximation (de Bruijn, 1981, Chapter 4; Butler, φ(m+1)=argmaxQ˜(φ,φ(m)). 2007, page 42) φ This leads to the following maximum likelihood es- G LC(φ)≈L˜C(φ)≡ L(bg,dg;σ˜ε2,g)L(mg;σ˜ε2,g) timate update equations for p1, τ and ψ: g=1 G (8) Y (m+1) 1 (m) (11) p = p , f(σ˜−2) 2π/ℓ′′(m ;σ˜2 ), 1 G 1,g · ε,g − g ε,g Xg=1 where ℓ′′(m ;σ2 ) is the seqcond derivative of G p(m)d /σ˜2 g ε,g (12) τ(m+1) = g=1 0,g g g logL(mg;σε2,g) with respect to σε2,g, and σ˜ε2,g is the P G p(m)/σ˜2 posterior mode of σ2 given m , given by g=1 0,g g ε,g g and P f /2 σ˜ε2,g = f /2+g α+1mg ψ(m+1) g (13) (9) α+1 1 G p(m)(d τ(m+1))/(σ2(m)+σ˜2) + . = g=1 1,g g − ψ g , fg/2+α+1 · (α+1)β P G p(m)/(σ2(m)+σ˜2) g=1 1,g ψ g Notice thatthelastthreefactors ontheright-sideof while the update foPr σ2 is the solution of the equa- (8)donotinvolvetheparameterφandcantherefore ψ tion beignoredintheimplementationofEM.Inpractice, we replace α and β by their maximum likelihood G 1 (m) p estimates obtained from the marginal likelihood in 1,g σ2 +σ˜2 (6). Xg=1 ψ g (14) Denote the estimate after m iterations of EM by G (d τ(m+1) ψ(m+1))2 φ(m). The (m+1)st E-step consists of taking the = p(m) g − − , 1,g (σ2 +σ˜2)2 conditionalexpectation ofthelogarithmof(7)given g=1 ψ g X the observed data, using the current estimate, φ(m). and σ2 =0 if p =0 for all the genes. Using the Laplace approximation (8), this is given ψ 1,g Strictly speaking, the update for ψ in (13) is con- by ditional on the current value of σ2. However, we Q(φ,φ(m)) have found this variant of EM to haψve almost iden- tical convergence properties to the full EM in which =E [logL (φ) d , m ] φ(m) C |{ g} { g} Q˜ is maximized jointly with respect to all four com- E [logL˜ (φ) d , m ] ponents of φ. ≈ φ(m) C |{ g} { g} 7 LAPLACE APPROXIMATEDEM MICROARRAYANALYSIS 3.2 Modifications for RF, RH, FF, FH, FR corresponding assumption in LIMMA is ψ σ2 g| ε,g ∼ N(0,v σ2 ). This assumption, combined with (5), LEMMA is considered an RR model because the 0 ε,g results in a closed form expression for the complete gene-specific effects (ψ ,σ2 ) are modeled as ran- g ε,g data likelihood (7), rendering the use of the Laplace domvariates. By consideringoneor bothof theseas approximationunnecessary.Anotherdifferenceisthat fixed effects, we obtain models that fall into one of the mean effect of treatment, averaged across genes the RF, RH, FR, FF or FH categories. Henceforth, (τ), is assumed to be zero in the LIMMA model. the category labels RF, RH, FF, FH, FR refer to However, this difference has little bearing on the ar- the models derived from the LEMMA (RR) model guments that follow. with the corresponding fixed/random distributional As noted in Section 2, it is unnecessary to assume assumption modifications. that the mean of the non-null gene-specific effects, The complete data likelihood for the RF model is ψ, is zero. Hence, we consider a generalized LIMMA G model (denoted by RG in what follows) with (15) L (φ) L(b ,d ;σ2 )L(m ;σ2 ). C ≈ g g ε,g g ε,g (16) ψ σ2 N(ψ,v σ2 ) gY=1 g| ε,g∼ 0 ε,g Sincenointegration is requiredto evaluate this like- for the non-nullgene-specific effects, and,as such, it lihood, the Laplace approximation is not needed in falls into the RR category. The EM algorithm dis- thiscase.As withtheLEMMA(RR)model,wefirst cussed earlier in this section can be implemented estimate the error variances, σ2 , separately us- to fit this generalized model with minor modifica- { ε,g} ing the marginal likelihood for m . This results tions. Specifically, after using the Laplace approxi- g { } in the simple estimate, σˆ2 = m . The EM algo- mation, the Q-function has the same form as (10) ε,g g rithm for estimating φ then proceeds in an iden- with v σ˜2 replacing σ2 +σ˜2 as the variance in 0,g ε,g ψ g tical manner except that σ˜2 is replaced by σˆ2 = the non-null density f , where v =v +1/n + g g 1,g 0,g 0 1g σˆ2 (1/n +1/n ).ThealgorithmfortheRHmodel 1/n . This leads to update equations for p and τ ε,g 1g 2g 2g 1 is also similar with the marginal likelihood estima- identical to (11) and (12), respectively. The update tor ofthehomogeneous errorvariance given by σˆ2= for ψ is ε m f / f . g g g g g G p(m)(d τ(m+1))/(v σ˜2 ) For all the fixed gene-specific effects models (FR, ψ(m+1)= g=1 1,g g − 0,g ε,g , P P FF and FH) it is easily verified that dg −τ(m) − P Gg=1p1(m,g)/(v0,gσ˜ε2,g) (m) ψ =0. This implies that the EM update for the g and the update of vPsatisfies 0 mixing parameter satisfies p(m+1) G p(m) 1 = G p(m)(dg−τ(m+1)−ψ(m+1))2, 1 1,g v 1,g v2 σ˜2 1 G p(m) Xg=1 0,g Xg=1 0,g ε,g = 1 andv =0ifp =0forallthegenes.Theseupdates G p(m)exp (d τ(m))2/2σˆ2 +p(m) 0 1,g Xg=1 0 {− g − e,g} 1 simplify further if the sample sizes are the same for (m) all genes. >p , 1 3.4 Model Extensions where σˆ2 represents the appropriate σ2 estima- e,g g tor for the desired model. As a result, the EM se- The LEMMA model is easily extended in a num- quence for p always converges to 1, regardless of ber of useful ways. First, it enables within-group 1 the starting value. An explanation for this behavior analysis which follows the same estimation proce- is that the mixture probability is not identifiable if dure by simply dropping the i index and combin- the gene-specific effects are fixed. ing the terms µ and τ. We found this to be use- ful in practical applications, when, for example, re- 3.3 A Generalization of LIMMA searchers wish to perform a paired-sample test. The LIMMA model proposed by Smyth (2004) Similarly, we can extend the model to have multi- is similar to the LEMMA model described in Sec- pletreatmentgroupsandtestdifferent(user-defined) tion2.Akeydifferenceistheassumptionconcerning contrasts,aswasdoneinSmyth(2004)fortheLIMMA the random gene-specific effects given in (4). The model. Mathematically, this generalization is very 8 BAR,BOOTH, SCHIFANO ANDWELLS simple, and, in practice, when dealing with a small 1[λˆ (d τˆ)+(1 λˆ )ψˆ]2 ψˆ2 g g g exp − − + or moderate number of treatment groups, the es- · −2 λˆ σ˜2 2σˆ2 (cid:26) g g ψ(cid:27) timation procedure poses no significant computa- tional challenges. For example, we use the (t 1- σ˜2 −1/2 1 dimensional vector) summary statistics dg =H−Y¯g, ∝ σˆ2 +gσ˜2 exp −2Tg2 , where H is a contrast matrix (e.g., the Helmert ma- (cid:18) ψ g(cid:19) (cid:26) (cid:27) trix) and Y¯g = (Y¯1·g,Y¯2·g,...,Y¯t·g)′. Note that the with the constant of proportionality being exp(ψˆ2/ 2 2 Helmert matrix gives the d and s statis- 2σˆ2), where × g g ψ tics for the one-treatment case [scaled by a factor of 1/√2]. Obtaining the estimates and test statistics 1 1 1 −1 σψ2 λ = + = . in the multiple treatment case is analogous to the g σ2 σ2 σ2 σ2 +σ2 g(cid:18) g ψ(cid:19) ψ g derivations in (3.1). See the Appendix for details. The statistic T is a posterior t-statistic, being the As noted in Zhang, Zhang and Wells (2010), it is g ratio of the estimated posterior expectation of ψ g often the case that the probabilities of under- and to its estimated posterior standard deviation. Note over-expressed genes are not equal. The assumption thattheLEMMAmodelinducesthreeformsofshrink- that the distribution of the non-null genes has a age in T . The first two forms come from λˆ >0 in g g nonzeromean(ψ)canbemodifiedtoallowformulti- both the numerator and the denominator. Third, plenon-nullcomponentsinthemixturedistribution. σ˜2, a function of the posterior mode σ˜2 , is itself a For example, we might assume that each gene is ei- g ε,g shrinkage estimator as a weighted compromise be- ther in the null group (ψ =0) with probability p , g 0 tween the usual error variance estimator m and g in one non-null component with probability p with 1 the mode of the inverse gamma distribution [(α+ ψg ∼i.i.d. N(ψ,σψ2), or in a second non-null group 1)β]−1. withprobabilityp2 withψg ∼i.i.d. N(−ψ,σψ2),where The likelihood ratio in (17) has the same form p0+p1+p2=1. The two-component model in the for the RF and RH models with σ˜2 replaced by ε,g previous sections is the special case in which p2=0. σˆ2 and σˆ2, respectively in σ˜2. [Recall that σ2 = ThelemmaRpackageusesthethreecomponentmix- ε,g ε g g σ2 (1/n +1/n ).]Teststatisticsforthefixedmean ε,g 1g 2g ture by default, and we have found that, indeed, effects models,FR,FF andFH, areobtained as lim- when there are two mixture components, the EM its of T as λˆ 1. algorithm converges to pˆ =0. Note that the R im- g g → 2 It is interesting to compare the likelihood ratio plementation assumes that the means of the non- (17) with the corresponding statistic under the null groups are of the same magnitude but opposite LIMMAandRGmodelassumptionsdiscussedinthe sign. This assumption can be relaxed, for instance, previous section. For these models σ2 is replaced by ψ by assuming only that ψ <0<ψ . 1 2 v σ˜2 , and so the shrinkage coefficient becomes 0 ε,g v 4. INFERENCE 0 λ = . g v +1/n +1/n 0 1g 2g The posterior probability that gene g is non-null In particular, if the sample sizes are the same for is given by the expression (3). Its estimated value all genes, then the amount of shrinkage is the same based on the LEMMA model can be expressed as a for all genes. Furthermore, if ψ is set equal to zero, function of the likelihood ratio as it is in LIMMA, then T is proportional to the g L0,g fˆ0,g test-statistic for the FR model, L1,g ≡ fˆ1,g d τˆ g T = − . =(2πσ˜g2)−1/2exp{−(dg −τˆ)2/2σ˜g2} g σ˜g /([2π(σˆ2 +σ˜2)]−1/2 This has the same form as the moderated t-statistic ψ g of Smyth (2004) and Wright and Simon (2003) ex- (17) exp (d τˆ ψˆ)2/2(σˆ2 +σ˜2) ) cept for the subtraction of the average gene effect, · {− g − − ψ g } τ, in the numerator and the use of the mode rather σ˜2 −1/2 g thantheexpectedvalueoftheposteriordistribution = σˆ2 +σ˜2 of σ2 given m in the denominator. (cid:18) ψ g(cid:19) ε,g g 9 LAPLACE APPROXIMATEDEM MICROARRAYANALYSIS For inference, wecomparetheposterior nullprob- In the LEMMA-generated data, we varied ψ, so ability, 1 p in (3), with a local f.d.r. threshold that ψ 0,1,2,...,6 Ψ, and set σ2 =1. In the − 1,g ∈{ }≡ ψ to decide whether a gene is in the non-null group. LIMMA data generation setup, we used v 1,2, Alternatively, our model-based approach also allows ...,8 to generate the non-null genes acco0r∈di{n6g t6o one to declare the non-null status of a gene by con- 6} (16). For both generation schemes we set τ =0, as trolling the false discovery rate (FDR), using the the LIMMA modeldoes not involve τ, and it is only Benjamini and Hochberg (1995) (BH) procedurefor estimable under the random gene by treatment in- ∗ anygiven level, q .Specifically,usingthetheoretical teraction effect models (RR, RF, RH). We gener- null-gene distributions of {dg}, which are assumed ated yijg according to equations (1) and (2) with to be N(τˆ,σ˜g2), we obtain the p-values for the ob- the above parameter specifications, and computed fisenrdvetdhe{dlagr}g.eWsteinddeenxotge′ ftohrewph-vicahluePsgF′b≤y q{∗P×g}g,′a/nGd, a{dsge}leactnidon{omfgs}p.eWcifihcilepawraemoentleyr pvareluseenstetrteisnuglst,snfuor- where PF isthesortedlistof p-values.Wedeclare merous simulations were performed with a variety { g } ′ all the genes with index smaller than or equal to g of sample sizes n , i=1,2, non-null probabilities p , i 1 (in the sorted list) as non-null, and the FDR the- and gene-specific treatment variances σ2. In addi- ψ orem guarantees that the expected false discovery tion, we also considered using the log-normal dis- rate is bounded by q∗. tribution to generate the error variance σ2 rather ε,g than the inverse gamma distribution. We found the 5. SIMULATION STUDY results to be qualitatively insensitive to these dif- ferent settings, and the results presented below por- Inthissectionweassesstheperformanceofseveral tray anaccurate summaryof theperformanceof the estimation/testing proceduresmentioned in this pa- methods. per under two data generation models, one accord- ing to the LEMMA model and the other accord- 5.2 Data Analysis and Results ing to the LIMMA model. In practice, the correct We consider two metrics for determining null and model is unknown, so our goal is to compare the non-null status of genes. The first method is based power, accuracy, false discovery rate and parameter oncomputingempiricalquantilecriticalvalues.Since estimation for different true-model/procedure com- the distribution of many of our test statistics is un- binations. In what follows we use the term “proce- known, we defined a test-specific critical value, T , c dure” to define the combination of the model se- asthe0.95quantileamongthe1900 100nullgenes. lected for analysis (which may or may not be the × Bydesign,thisresultedinan average sizeof0.05 for truemodel) and theestimation and inferential tech- each test.Theaverage powerforeach procedurewas niques derived from this model. determined by the proportion of non-null genes cor- 5.1 Data Generation rectly declared non-null based on the (test-specific) empirical critical value T . Figure 1 shows the av- c Inbothscenarios(LEMMAandLIMMA),wesim- erage power (on the logit scale) of the likelihood ulated S = 100 data sets according to a mixture ratio tests derived assuming the FF, FH, FR, RF, modelwithtwogroups,nullandnon-null.Eachdata RH and RR models, with estimation procedures as set consisted of G=2000 genes, of which p G were 1 described in Section 3. Also included in our com- non-null, and we used p =0.01,0.05,0.1,0.25. For 1 parison were the RG likelihood ratio tests, derived each of the S data sets we drew G inverse gamma from themodeldefinedin Section 3.3, andthe mod- error variates with shape α and scale β. By vary- erated t-tests obtained from the limma R package. ing α and β, we adjusted the amount of error vari- Since in our simulations we know the exact values ance variability present in the data. The values of of the parameters, we also included the “Optimal α, β, n1g n1, and n2g n2 were chosen so that Rule” statistics (denoted by OR) which were ob- ≡ ≡ mean(σg2)=1. With n1=n2=6, we set α=5 and tained by plugging in the true parameter values in β=1/12 for the “low” error variance variability; we the likelihood ratio statistic for the true data gener- set α=2.1 and β=10/33 for the “high” error vari- ation model (either LEMMA or LIMMA). ance variability. Hence, the standard deviation (and When the data are generated according to the also the coefficient of variation, CV) of σg2 for the LEMMA model our simulations show that the tests former was 1/√3, and for the latter was √10. derived from the RR model achieved the highest 10 BAR,BOOTH, SCHIFANO ANDWELLS Fig. 1. Average power (on the logit scale) for empirical quantile analysis under the RR data generation model, with n =n =6, S=100samples,G=2000 genes,andp =0.05 probabilityofnon-nullstatus.Left:lowerrorvariancevariability 1 2 1 (CV =0.58). Right: high error variance variability (CV =3.16). power for all ψ Ψ (and almost identical to the where p (y ) is given by (3). Since p can only 1,g g 1 ∈ Optimal Rule’s), as can be seen in Figure 1. When be estimated in the random-mean models, we only the data are generated according to the LIMMA considered the local f.d.r. statistics associated with model,thelikelihoodratiotestsderivedfromtheRR RR,RFandRH.Forcomparison,wealsoconsidered andRGmodelshavenearlyidenticalperformancein the local f.d.r. statistics for RG and the Optimal termsofpowerasthoseofthemoderated-tstatistics Rule,and two types of B statistics computed by the and the LIMMA Optimal Rule for all values of v limma package to differentiate between those com- 0 (figure not shown). putedwiththedefaultvalueof p =0.01 [referredto 1 As expected,our simulations also showed that the as “Limma(0.01)”] and those computed with the es- average power in the homogeneous error variance timatedvalueofp [referredtoas“Limma(pˆ )”].We 1 1 models (RH, FH) decreases as the error variance also included local f.d.r. statistics computed from variability increases. In general, the random gene thelocfdr(Efron,TurnbullandNarasimhan,2008) models (RR, RF, RH) demonstrate higher average R package (referred to as “Efron” for simplicity). power than their corresponding fixed gene counter- To evaluate the performance of these procedures, parts.Noticealsothattheperformanceofmoderated- welookedattwocomplementarymetrics.Thefirstis t and the FR statistics are almost identical. the measure of accuracy, definedby the ratio (TP+ The second performance assessment method did TN)/(P +N) as in Hong (2009), where P and N not require computing empirical quantiles, and was are the total numbers of non-null and null genes, based on local f.d.r. criteria. Efron et al. (2001) and respectively, and TP +TN is the sum of correct Efron (2005) defined local f.d.r. as classifications (true positives plus true negatives). Thesecondmetricisthefalsediscoveryrate,defined (18) f.d.r.(y )=Pr(null Y =y ) g | g by FP/(FP +TP), where FP is the total number for the posterior probability of a gene g being in the of false positives. Clearly, our goal is to maximize null group. Note that this is precisely 1 p (y ), theaccuracy whilemaintaining alow false discovery 1,g g −