Assessment of Substitution Model Adequacy Using Frequentist and Bayesian Methods Jennifer Ripplinger*,1 and Jack Sullivan1,2 1Bioinformatics and Computational Biology, University of Idaho 2Department of Biological Sciences, University of Idaho *Corresponding author: E-mail: [email protected]. Associate editor: Jeffrey Throne Abstract In order to have confidence in model-based phylogenetic methods, such as maximum likelihood (ML) and Bayesian analyses,onemustuseanappropriatemodelofmolecularevolutionidentifiedusingstatisticallyrigorouscriteria.Although D o R model selection methods such as the likelihood ratio test and Akaike information criterion are widely used in the w n e phylogenetic literature, model selection methods lack the ability to reject all models if they provide an inadequate fit to lo s ad e thedata.Therearetwomethods,however,thatassessabsolutemodeladequacy,thefrequentistGoldman–Cox(GC)test e d ar and Bayesian posterior predictive simulations (PPSs), which are commonly used in conjunction with the multinomial log from c likelihood test statistic. In this study, we use empirical and simulated data to evaluate the adequacy of common h h substitution models using both frequentist and Bayesian methods and compare the results with those obtained with ttp s a model selection methods. In addition, we investigate the relationship between model adequacy and performance in ML ://a rt and Bayesian analyses in terms of topology, branch lengths, and bipartition support. We show that tests of model cad i adequacybasedonthemultinomial likelihood oftenfailto rejectsimple substitution models,especially whenthe models e c m l incorporateamong-siteratevariation(ASRV),andnormallyfailtorejectlesscomplexmodelsthanthosechosenbymodel ic e .o selectionmethods.Inaddition,wefindthatPPSsoftenfailtorejectsimplermodelsthantheGCtest.Useofthesimplest u p substitution models not rejected based on fit normally results in similar but divergent estimates of tree topology and .co m branch lengths. In addition, use of the simplest adequate substitution models can affect estimates of bipartition support, /m although these differences are often small with the largest differences confined to poorly supported nodes. We also find be thatalternativeassumptionsaboutASRVcanaffecttreetopology,treelength,andbipartitionsupport.Ourresultssuggest /artic that using the simplest substitution models not rejected based on fit may be a valid alternative to implementing more le complexmodelsidentifiedbymodelselectionmethods.However,allcommonsubstitutionmodelsmayfailtorecoverthe -ab s correct topology and assign appropriate bipartition support if the true tree shape is difficult to estimate regardless of tra c model adequacy. t/2 7 Key words: Bayesian, Goldman–Cox, maximum likelihood, model adequacy, parametric bootstrap, posterior predictive /1 2 simulation. /27 9 0 /1 Introduction offbetweenbiasandvarianceaccordingtotheirrespective 07 4 statisticalcriteria(seePosadaandBuckley2004andSullivan 43 Statisticalmethodsofphylogeneticinference,suchasmax- 7 and Joyce 2005 for reviews of model selection methods). b imum likelihood (ML) and Bayesian estimation, utilize an y Althoughuseofanymodelselectionmethodispreferable g explicit model of molecular evolution, normally selected u to choosing a model arbitrarily (Ripplinger and Sullivan es from a group of nucleotide substitution models referred 2008), these are relative fit methods that lack the ability t on toasthegeneraltimereversible(GTR;Tavare´1986)family. to reject all models if none provides an adequate fit to 10 Modelchoiceiscriticalbecauseuseofanunderparameter- A thedata.Thatis,noneofthetypicalmodelselectionmeth- p ized model can mislead an analysis by failing to account ods actually assess goodness of fit in an absolute sense. ril 2 fullyformultiplesubstitutionsatthesamesitewhileinclu- 0 There are two methods that assess the absolute fit be- 1 9 sionofsuperfluousparameterscanincreasethevariancein tween a substitution model and the data: the Goldman– parameter estimates (e.g., Gaut and Lewis 1995; Sullivan Cox (GC) test (also known as the parametric bootstrap and Swofford 1997, 2001; Lemmon and Moriarty 2004). goodness-of-fit test; Reeves 1992; Goldman 1993; Whelan Consequently, model selection methods such as the hier- etal.2001)andposteriorpredictivesimulation(PPS;Rubin archicalimplementationofthelikelihoodratiotest(hLRT; 1984;Gelmanetal.1996;Huelsenbecketal.2001;Bollback Frati et al. 1997; Sullivan et al. 1997; Posada and Crandall 2002,2005).Bothtestsnormallyutilizethemultinomiallog 1998), Akaike information criterion (AIC; Akaike 1973), likelihoodteststatistic,whichiscalculatedastheweighted Bayesianinformationcriterion(BIC;Schwarz1978),andde- sumofsitepatternloglikelihoods.TheGCtestisafrequent- cisiontheory(DT)approach(Mininetal.2003;Abdoetal. istmethodusedtoevaluatethehypothesisofaperfectfit 2004)identifysubstitutionmodelsthatoptimizethetrade between the model and data using a null distribution ©TheAuthor2010.PublishedbyOxfordUniversityPressonbehalfoftheSocietyforMolecularBiologyandEvolution.Allrightsreserved.Forpermissions,please e-mail:[email protected] 2790 Mol. Biol. Evol. 27(12):2790–2803. 2010 doi:10.1093/molbev/msq168 Advance Access publication July 8, 2010 MBE Assessment of Model Adequacy · doi:10.1093/molbev/msq168 generatedwiththeparametricbootstrap.ToconductaGC Despiteinstanceswheremodeladequacymethodshave test, one first performs a ML analysis and calculates the failed to reject GTR family models, there has been specu- value of the realized test statistic lationthatthissetofmodelsisinsufficient(e.g.,Sanderson and Kim 2000; Revell et al. 2005; Kelchner and Thomas d5ðlnL (cid:1) lnL Þ; multinomial constrained 2007) and, consequently, several authors have suggested a need for rigorous analysis of model adequacy using the wherethefirstcomponentisthemultinomialloglikeli- GC test and PPSs (Sullivan and Joyce 2005; Gatesy 2007). hoodandthesecondcomponentistheloglikelihoodcon- However, Waddell et al. (2009) have shown that the GC strained to a particular substitution model. Data for the test can lack power. They could not reject the GTR þ null distribution are simulated using maximum likelihood I þ C model as inadequate using the GC test for a single estimates(MLEs)ofthetreetopology,branchlengths,and datasetbutdemonstratedthatsetsoftaxaexhibiteddata substitutionmodelparametersandsubsequentlyanalyzed patterns that deviated significantly from those expected in the same manner as the original data. Test statistics under the model using pairwise tests of symmetry. D o are then calculated for each replicate and used to form Here,weextendtheexaminationoftheadequacyofthe w n thenulldistribution;aPvalueiscalculatedbydetermining GTRfamilymodelsonanarrayofdatausingboththeGC loa d therankoftherealizedteststatisticinrelationtothenull test and the PPSs.We first evaluate thefit of 56 common e d distribution. substitutionmodelsoneachof25empiricaldatasetsand fro m PPS(e.g.,Bollback2002)isaBayesianmethodforassess- comparetheresultswiththoseobtainedwithmodelselec- h ing model adequacy that utilizes parameter estimates tionmethods.Wethentestwhetherornotuseofthesim- ttp s drawn from their respective posterior distributions rather plest models not rejected based on absolute goodness of ://a than MLEs, which eliminates the reliance on point esti- fit produces significantly different estimates of topology, ca d mates associated with the GC test. In order to conduct branchlengths,and bipartitionsupportthanmodelscho- em PPSs,onefirstcalculatestherealizedteststatistic(themul- sen by model selection. Last, we evaluate the adequacy of ic .o tinomial likelihood) and conducts a Bayesian analysis to GTRfamilymodelsonsimulateddatasetsandinvestigate up produce posterior distributions for the tree topology, theperformanceofbothadequateandinsufficientmodels .co m branch lengths, and substitution model parameters. Data inrecoveringthetruetree,estimatingbranchlengths,and /m for the null, or predictive, distribution are generated in assigning support to appropriate bipartitions. be /a a manner similar to the parametric bootstrapexcept sim- rtic ulation parameters are sampled from the posterior distri- Methods le -a bution independently for each replicate. A multinomial b test statistic is subsequently calculated for each replicate Data Collection stra andusedtoformthenulldistribution;therealizedteststa- In order to assess the performance of model adequacy ct/2 tistic is evaluated against this distribution. methods under a variety of conditions, we downloaded 7/1 Because tests of model adequacy are computationally 25 diverse data sets from the phylogenetic databaseTree- 2/2 7 intensive,theyarerarelyusedasmodelselectionstrategies BASE (http://www.treebase.org). The culled data included 9 0 andhaveinsteadbeenusedtoinvestigatethefitofpartic- 2arthropod,1sponge,2vertebrate,7floweringplant,1red /10 ular substitution models. Goldman (1993) used the para- algae,4clubfungus,6sacfungus,1slimenet,and1water 74 4 metric bootstrap to evaluate the fit of three relatively mold data set, which comprised 2 mitochondrial, 3 chlo- 37 simple GTR family models [Jukes–Cantor (JC; Jukes and roplast,and20nucleargenesequencealignments.Wefirst by g Cantor 1969), Felsenstein 81 (F81; Felsenstein 1981), and importedthedatasetsintoPAUP*4.0b10(Swofford2002) ue Hasegawa–Kishino–Yano (HKY; Hasegawa et al. 1985)] and removed alignment regions the original authors had st o to primate wg-globin pseudogene and tree-of-life small labeled as poor or ambiguous. We then removed redun- n 1 subunit rRNA data sets and was able to reject both the dant haplotypes using Collapse 1.2 (available from 0 A JaCppalnieddtthoe Fth8e1 mrRoNdAelsd,aatsawseeltl.aSsimthielarHlyK,YWmheoldaenl wetheanl. htetrtps:t/a/tdeasr.wBienc.uauvsigeoi.nesc)luwsihoinleotfregaatpinsgangadpasmasbifigfuthoucshcahraacr-- pril 20 1 (2001) used the GC test to reject the GTR model (Tavare´ acters causes difficulty in calculating the multinomial 9 1986) for a primate mitochondrial data set. Conversely, likelihood, these characters were removed; nonetheless, Bollback (2002) used PPSs to evaluate the fit between the resulting pool of data sets exhibited a large amount the JC, HKY, and GTR models and the primate wg-globin of diversity (fig. 1). Citations for each data set, as well as pseudogenedatasetanalyzedbyGoldman(1993)andwas datacollectedaspartofthisstudy,areprovidedinsupple- unable to reject any of the models. Furthermore, several mentary material, Supplementary Material online. authors have been unable to reject the adequacy of sub- stitution models using the GC test when the models ac- Frequentist Analysis counted for among-site rate variation (ASRV). For We began our analysis by identifying optimal models ac- example, Carstens et al. (2005) were unable to reject the cording to the hLRT, corrected AIC (AIC ), BIC, and DT c HKY þ C model for a salamander cytochrome b (cyt b) model selection methods; models were selected from data set, and Demboski and Sullivan (2003) were unable amongthe56GTRfamilymodelsimplementedinModelt- to reject the GTR þ C model for chipmunk cyt b data. est3.7 (Posada and Crandall 1998) and DT-ModSel (Minin 2791 MBE Ripplinger and Sullivan · doi:10.1093/molbev/msq168 D o w n lo a d e d fro m h ttp s ://a c a d e m FIG.1.Summarystatisticsforthe25empiricaldatasetsanalyzedaspartofthisstudy.Thedatasetscontained(a)5–44haplotypes((cid:1)x521) ic.o and203–2,279((cid:1)x5736)nucleotides,(b)amaximumpdistanceof4.0–36.5%((cid:1)x516.8%),(c)aweightedaveragetreelengthof0.08–3.32((cid:1)x5 up 0.74%),and(d)aweightedaveragestemminessindexof0.02–0.93((cid:1)x50.26%).TreelengthsandstemminessindiceswerecalculatedasBIC .co m weighted averages across allMLtrees constructed withthe 56nucleotide modelsimplemented in Modeltestand DT-ModSel. /m b e /a et al. 2003). In order to identify best-fit models under the Inordertoassesstherelativeperformanceofstatistically rtic variousrelativecriteria,wefirstusedPAUP*tocalculateML supported models, we performed additional ML analyses le -a scoresforeachcandidatemodelandthenusedModeltest using substitution models chosen by model selection b s toselectoptimalmodelsaccordingtothehLRT,AICc,and methods and the simplest models not rejected by the trac BIC. Similarly, we recalculated ML scores and used DT- GC test. For each analysis, we estimated initial model pa- t/2 7 ModSel to identify models with the lowest expected risk. rameters from a NJ starting tree constructed with LogDet /1 2 Branchlengths were notincluded duringmodel selection, distance,whichwereusedasstartingvaluesforaMLheu- /2 7 andsequencelengthwasusedtoapproximatesamplesize ristic search with ten random addition starting trees and 90 intheAICc,BIC,andDTcalculations(e.g.,PosadaandBuckley TBR branch swapping. We subsequently reoptimized pa- /107 2004; Ripplinger and Sullivan 2008). rameter estimates and performed additional search itera- 4 4 3 After conducting model selection, we performed GC tionsuntilthetreetopologystabilized(Sullivanetal.2005). 7 b tests for all 56 substitution models to determine which Inordertoquantifydifferencesamongmodelschosenwith y g modelsexhibitedanadequatefittothedata.Wefirstcon- alternativemethods,wecalculatedsymmetricdistancedif- ue s ducted ML analyses in PAUP* to obtain realized test sta- ferences (SDDs; Robinson and Foulds 1981) and sum of t o n tistics and estimates of the topology, branch lengths, branchlengthsforMLtreesconstructedwithmodelscho- 1 0 and substitution model parameters. Model parameters sen by model selection methods as well as the simplest A p were initially estimated from a neighbor joining (NJ) tree models not rejected by the GC test. Last, we conducted ril 2 constructed with LogDet distance (Lockhart et al. 1994) ML nonparametric bootstrap analyses for each supported 0 1 andusedasstartingvaluesforaheuristicMLsearchusing model using 1,000 replicates, substitution model parame- 9 ten random addition starting trees and tree-bisection- ters optimized on the preceding ML analysis, ten random reconnection (TBR) branch swapping. We then generated additionstartingtrees,andTBRbranchswapping.Incases 100datareplicatesforthenulldistributionusingSeq-Gen where the C-shape parameter exceeded the limit of 300, (Rambaut and Grassly 1997), which utilizes simulation to the parameter was optimized on the limit instead of the produceasequencealignmentgivenatree,branchlengths, empirical value. andsubstitutionmodelparameters.Eachreplicatewassub- sequentlyanalyzedinthesamemannerastheoriginaldata Bayesian Analysis and the difference between the multinomial and con- Because only 24 GTR family models can be implemented strained likelihoods was calculated to form the null distri- in MrBayes3.1.2 (Huelsenbeck et al. 2001; Ronquist and bution.Last,wecalculatedaPvaluebyevaluatingthetest Huelsenbeck2003),werepeatedmodelselectionaccording statisticagainstthenulldistributionandassessedtheout- tothehLRT,AIC ,andBICusingaversionofMrModeltest c come at a 5 0.05. (Nylander 2004) we adapted to include BIC scores. In 2792 MBE Assessment of Model Adequacy · doi:10.1093/molbev/msq168 addition,weidentifiedsubstitutionmodelswithminimum Results posteriorriskusingaversionofDT-ModSelwemodifiedto Empirical Analysis selectamongpertinentmodels.Wesubsequentlyemployed MrBayes to perform Bayesian Markov chain Monte Carlo TheGCtestoftenfailedtorejectsimplesubstitutionmod- analyses for each data set using all 24 applicable models. els as long as they incorporated ASRV [invariable sites (I), Foreachanalysis,weperformedtwoindependentrunsfrom C-distributed rates (C), or a combination of the two (I þ randomstartingtrees,eachwiththreeheatedandonecold C);fig.2a].AlthoughthesimpleJCmodel,whichassumes chain,for5(cid:2)106generations;treesandsubstitutionmodel bothequalbasefrequenciesandsubstitutionrates,wasre- parameters were sampled every 100 generations. We con- jectedforalldatasets,JCþIcouldnotberejectedfor68% firmedthat thechainshadconvergedupon thestationary ofthedataandbothJCþCandJCþIþCcouldnotbe distribution by analyzing both log likelihood plots andthe rejectedfor72%ofthedata.TheGCtestidentifiedtheF81 standard deviation of split frequencies. We discarded the model, which incorporates unequal base frequencies and first 500,000 generations (10%) of each analysis as burn- equal substitution rates, as the simplest adequate model D o in. We then conducted PPSs using MAPPS (Bollback for one data set, whereas F81 þ I and F81 þ C were w n 2002), which calculate the realized test statistic (multino- thesimplestmodelsnotrejectedforanotherdataset.Sim- loa d mial likelihood) from the sequence alignment, produces ilarly, the Kimura 2-parameter model (K2P; Kimura 1980) e d data replicatesbysampling1,000jointtreeandmodelpa- with C-distributed rate variation, which assumes equal fro rameterposteriorprobabilitiesandconstructsthenulldis- basefrequenciesandunequaltransition/transversionrates, m h tributionbycalculatingteststatisticsforeachdatareplicate. was the simplest model not rejected for one data set, ttp s Finally, we calculated average posterior tree lengths using whereas HKY þ I and HKY þ C, which incorporate both ://a substitution models chosen by model selection methods unequalbasefrequenciesandtransition/transversionrates, ca d as well as the simplest models not rejected by PPSs. wereidentifiedasthesimplestnonrejectablemodelsforan em additionaltwodatasets.Onlytwodatasetsrequiredcom- ic .o Simulation Analysis plex models; the transversional model (TVM) with invari- u p Althoughuseofempiricaldataallowsustoinvestigatethe able sites or C-distributed rates was needed to capture .co m effects of substitution model adequacy on a range of real adequately the signal from a small geranium data set (p /m worldscenarios,itwouldalsobeusefultoutilizesimulated distance 5 14.1%, BIC-weighted tree length 5 0.19, and be data to explore the relationship between the model ade- BIC-weightedstemminessindex50.02),whereasthemost /artic quacy and the accuracy of phylogenetic inference. Conse- complexGTRþIþCmodel,whichassumesunequalbase le quently, we performed simulations based on shrew frequencies and independent substitution rates, was re- -ab s mitochondrial DNA (Brandli et al. 2005) and water mold quired to fit the 32 shrew sequences used for simulation tra c nuclear DNA (Mirabolfathy et al. 2001) data sets, both (p distance 5 13.8%, BIC-weighted tree length 5 0.49, t/2 7 of which elicited unusual behavior from model adequacy and BIC-weighted stemminess index 5 0.22). /1 2 methods.Inordertoobtainatreetopologyforsimulation, Similarly, PPSs often failed to reject simple models, es- /2 7 we conducted ML analyses for each data set using the pecially when those models incorporated ASRV (fig. 2b). 90 GTRþIþCmodel.Foreachanalysis,initialmodelparam- The JC model without ASRV was the simplest nonreject- /10 7 eters were estimated based on a NJ tree constructed with able model for 16% of the data sets, whereas JC þ I, 4 4 3 LogDet distance and used as starting values for a ML JC þ C, and JC þ I þ C could not be rejected for 40%, 7 b heuristic search with ten random addition starting trees 52%, and 60% of the data, respectively. PPSs identified y g and TBR branch swapping; the ML search was reiterated F81 þ I as the simplest adequate model for one data ue s until the tree topologies converged. We partitioned the set, K2P þ C as the simplest nonrejectable model for an t o n shrewdatasetintothreepartsconsistingofthecytbgene, additional data set, and K2P þ I þ C as the simplest ad- 1 0 controlregion,andcytochromeoxidaseII(COXII)gene;the equate model for two data sets. All models were rejected A p cyt b and COXII genes were further divided by codon for an additional two data sets, including the water mold ril 2 position to produce a total of seven data partitions. data used for simulation (p distance 5 17.8% and 15.6%, 0 1 Similarly, the water mold data set was divided into three BIC-weighted tree length 5 0.91 and 0.63, and BIC- 9 subsets representing the internal transcribed spacer (ITS) weighted stemminess index 5 0.21 for both data sets), 1region,the5.8SrRNAgene,andtheITS2region.Foreach butneitherofthesedatasetsweretheonesthatrequired partition,weobtainedMLEsofGTRþIþCmodelparam- complex models when evaluated with the GC test. eters from a ML analysis constrained to the true tree In most cases, both the GC test and the PPSs failed to topology and used Seq-Gen to simulate ten replicates. reject models that were much less complex than those Wethenconcatenatedthesimulateddatatoformtensets identified by model selection methods (fig. 3). There was ofpartitionedreplicatesforeachdataset.Wesubsequently a significant difference in complexity among substitution analyzedthereplicatesinthesamemannerastheempirical models chosen by model selection methods and the sim- data except that phylogenetic analyses were conducted plestmodelsnotrejectedbytheGCtest[median(M)54.0 using all applicable models instead of limiting the analysis (hLRT),8.0(AIC ),5.0(BIC),5.0(DT),and1.0(GC)param- c to models identified by model selection methods and the eters; Friedman rank sum test adjusted for ties, P , 0.01, simplest models not rejected based on fit. df 5 4] due to both the large number of parameters 2793 MBE Ripplinger and Sullivan · doi:10.1093/molbev/msq168 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 7 /1 2 /2 7 9 0 /1 0 7 4 4 3 7 b y g u e s t o n 1 0 A p FmIGo.d2e.lsFsrueqchueanscyJCw,iwthhiwchhicahssucmomesmeoqnuaslubbsatsietuftrieoqnumenocdieeslsaanrdesnuobtstrietjuetcitoendrbayte(sa,)nGorCmtaelslytscaanndno(tb)bPePrSesjefcotred25asemlopnigricaasl dthaetaysinetcso.rSpiomraptlee ril 20 1 a proportion of invariable sites (I), C-distributed rates (C), or a combination of the two (I þ C). Assessed models include JC, F81, which 9 assumes unequal basefrequencies andequalsubstitution rates, K2P,whichincorporates equalbasefrequencies andseparate transitionand transversionrates,HKY,whichassumesunequalbasefrequenciesandseparatetransitionandtransversionrates,K3P,whichincorporatesequal basefrequencies,twotransversionrates,andequaltransitionrates,K3Pufwithunequalbasefrequencies,Tamura–Nei,whichassumesunequal base frequencies, one transversion rate, and independent transition rates, TrNef with equal base frequencies, transitional (TIM), which incorporatesunequalbasefrequencies,twotransversion,andindependenttransitionrates,TIMefwithequalbasefrequencies,transversional (TVM), which incorporates unequal base frequencies, independent transversion rates, and one transition rate, TVMef with equal base frequencies, symmetrical (SYM), which assumes equal base frequencies and independent substitution rates, and GTR, which incorporates unequal basefrequencies and independent substitution rates. Overall, PPSsfailed to reject somewhatsimpler models thantheGCtest. inferred by the AIC and relatively simple models not re- identified by model selection methods were normally c jected by the GC test (evaluated using pairwise Wilcoxon supported as adequate by the GC test (84%, 96%, 80%, signed rank tests adjusted for ties with the Bonferroni and 84% of the models chosen by the hLRT, AIC , BIC, c multiple-test correction). Consequently, optimal models and DT methods, respectively). Similarly, there was 2794 MBE Assessment of Model Adequacy · doi:10.1093/molbev/msq168 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 7 /1 2 /2 7 9 0 /1 0 7 4 4 3 7 b y g u e s t o n 1 0 A p ril 2 0 1 9 FIG.3.DifferenceinnumberofparametersbetweenthesimplestmodelsnotrejectedbyGCtestsorPPSsandbest-fitmodelsidentifiedusing thehLRT,correctedAIC(AIC),BIC,andDTapproachfor25empiricaldatasets.Bothtestsofmodeladequacyoftenfailedtorejectmodels c thatwere simpler thanthosechosen bymodel selection methods,especially the AIC. c asignificantdifferenceinthenumberofparametersincor- [M 5 6.0 (hLRT), 9.0 (AIC ), 5.0 (BIC), 5.0 (DT), and 1.0 c poratedbysubstitutionmodelschosenbymodelselection (PPS); Friedman test, P , 0.01, df 5 4] due primarily to methods from the reduced set of models implementedin the complex models chosen by the AIC and the simple c MrBayes and the simplest models not rejected by PPSs models not rejected by PPSs (evaluated with Wilcoxon 2795 MBE Ripplinger and Sullivan · doi:10.1093/molbev/msq168 signed rank tests). Substitution models selected by model K2P þ C was the simplest nonrejectable model for two selectionmethodswerenormallysupportedbyPPSsbutat replicates, whereas the GC test identified HKY þ I and alowerratethantheGCtest(72%ofmodelsselectedusing HKY þ C as the simplest adequate models for two addi- thehLRT,AIC ,andBICand68%ofmodelsselectedbyDT tional replicates and HKY þ C alone as the simplest ade- c were supported by PPSs). quate model for three replicates. The more complex The use of alternative substitution models chosen by Kimura3-parameter(K3P;Kimura1981)modelwithC-dis- model selection methods, as well as use of the simplest tributed rate variation, which assumes unequal base fre- models not rejected by the GC test, resulted in similar quencies and two transversion rates, was the simplest but statistically significant differences in SDDs among nonrejectable model for one replicate, whereas the equal ML tree topologies [M 5 0.34 (hLRT), 0.34 (AIC ), 0.31 base frequency Tamura–Nei model (TrNef; Tamura and c (BIC), 0.31 (DT), and 0.31 (GC); Friedman test, P 5 0.04, Nei 1993) with C-distributed rates was the simplest non- df 5 4]. However, no significant differences among treat- rejectable model for an additional replicate. The GC test ments could be detected using pairwise Wilcoxon signed identified both K3P þ C and TrNef þ C as the simplest D o ranktestswithamultiple-testcorrection.Furthermore,al- adequate models for the remaining replicate. Conversely, w n thoughuseofstatisticallysupportedmodelsresultedinML PPSsidentifiedJCþCasthesimplestnonrejectablemodel loa d treeswithsimilarlength,therewasastatisticallysignificant for eightreplicates and K2Pþ C as the simplestsubstitu- e d differenceinthesumofbranchlengths[M50.49(hLRT), tion model for the remaining two replicates. fro m 0.49(AICc),0.49(BIC),0.49(DT),and0.48(GC);Friedman BoththeGCtestandthePPSsfailedtorejectlesscom- h test, P , 0.01, df 5 4] due to shorter trees inferred using plex substitution models than those identified by model ttp s thesimplestmodelsnotrejectedbytheGCtest(evaluated selectionmethods(fig.5).Therewasasignificantdifference ://a usingWilcoxonsignedranktests).Medianbootstrapvalues inmodelcomplexityamongsubstitutionmodelsidentified ca d calculatedusingmodelsselectedbymodelselectionmeth- by model selection methods and the simplest models not em odsandtheGCtesttendedtobefairlysimilar[M582% rejected by the GC test [M 5 6.5 (hLRT), 9.5 (AICc), 5.0 ic.o (HhoLwRTev),er8,0F%rie(AdmICacn),’s79te%st(BcIaCn)n,o7t9%be(DapTp),liaenddto81t%his(GdCa)t]a. (PB.IC)0,.051.0,d(fD5T)4,]adnude4p.r0im(GarCil)ytpoartahmeeptaerrasm; Fertieerd-rmicahnmtoesdt-, up.co m because bootstrap values calculated for bipartitions on elschosenbytheAICcandthesimplemodelsnotrejected /m the same tree are not independent from one and other. by the GC test (evaluated with pairwise Wilcoxon signed be /a Although the median difference in bootstrap support be- ranktests).Similarly,therewasalsoasignificantdifference rtic tweenmodelschosenbymodelselectionmethodsandthe in complexity among models chosen by model selection le -a simplest models not rejected by the GC test was only 2% methods from the reduced set of models implemented b s (forallmodelselectionmethods),themaximumdifference in MrBayes and the simplest models not rejected by PPSs tra c rangedfrom34%(AICcvs.GC)to47%(hLRT,BIC,andDT [M 5 9.0 (hLRT), 10.0 (AICc), 5.0 (BIC), 5.0 (DT), and 1.0 t/2 7 vs. GC), with the largest differences typically confined to (PPS)parameters;Friedmantest,P.0.01,df54]dueto /1 2 more poorly supported bipartitions. Similarly, the use of differencesamongthecomplexmodelschosenbythehLRT /2 7 sfruobmstitthuetiorendumcoeddeslestcohfomseondbelysmimopdleelmseenletcetdioinnMmerBthayoedss, alenadstAcIoCmc,psliemxpmleordmelosdidelesnsteifileecdteadsbadyetqhueaBteICbaynPdPSDsT(,eavnald- 90/10 7 aswellasuseofthesimplestmodelsnotrejectedbyPPSs, uated with Wilcoxon signed rank tests). 4 4 3 resultedinasignificantdifferenceinSDDsamongaverage Useofall56substitutionmodelsleadtotherecoveryof 7 b posterior tree lengths [M 5 0.69 (hLRT), 0.66 (AIC ), 0.58 anincorrectMLtreeforalltenreplicates,withstandardized y c g (BIC), 0.58(DT),and 0.95(PPS); Friedman,P5 0.01,df 5 SDDs from the true tree ranging from 0.38 to 0.72. Al- ue s 4], although no significant pairwise differences could be thoughtherewasnosignificantdifferenceinSDDsamong t o n identified using Wilcoxon signed rank tests with the Bon- ML tree topologies when using the substitution models 1 0 ferroni correction. Median posterior probabilities were identified by model selection methods or the simplest A p more similar than bootstrap values [M 5 96% (hLRT), models not rejected by the GC test [M 5 0.49 (hLRT), ril 2 96%(AIC ),96%(BIC),96%(DT),and97%(PPS)].Although 0.50(AIC ),0.49(BIC),0.49(DT),and0.50(GC);Friedman 0 c c 1 themediandifferenceinbipartitionposteriorprobabilities test,P50.93,df54],therewasasignificantdifferencein 9 was small (M 5 1% for hLRT and DT vs. PPS and 2% for SDDsamongMLtreeswhenusingsubstitutionmodelsthat AIC andBICvs.PPS),themaximumdifferenceinposterior made alternative assumptions about ASRV [M 5 0.47 c probabilities was much larger than the maximum differ- (equalrates,eq),0.50(I),0.50(C),and0.50(IþC);Fried- ence in bootstrap values, ranging from 72% (hLRT, BIC, mantest,P.0.01,df53].Theusealternativestatistically and DT vs. PPS) to 74% (AIC vs. PPS). supportedmodelsresultedinsimilarbutstatisticallysignif- c icantlydifferenttreelengths[M50.45(hLRT),0.45(AIC ), c Simulation Analysis 0.45 (BIC), 0.45 (DT), and 0.44 (GC); Friedman test, P , Although the GC test and PPSs both failed to reject rela- 0.01, df 5 4], although no significant pairwise differences tively simple substitution models for replicates generated could be detected using Wilcoxon signed rank tests with fromtheshrewdataset,PPSstypicallysupportedlesscom- theBonferronicorrection.Similarly,therewasasignificant plexmodelsthantheGCtest(fig.4).TheGCtestrejected differenceinbranchlengthsamongmodelsthatmadedif- the JC model with and without ASRV for all replicates. ferent assumptions about ASRV [M 5 0.34 (eq), 0.42 (I), 2796 MBE Assessment of Model Adequacy · doi:10.1093/molbev/msq168 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 7 /1 2 /2 7 9 0 /1 0 7 4 4 3 7 b y g u e FIG.4.Ratewithwhichsubstitutionmodelsarenotrejectedby(a)GCtestsand(b)PPSsfortenreplicatesgeneratedbysimulatingstochastic st o evolution using a partitioned GTR þ I þ C model constrained to the ML topology identified for a shrew mitochondrial data set. GC tests n 1 typicallydidnotrejectmodelsthatincorporatedbothunequalbasefrequenciesandtransition/transversionsubstitutionrates,especiallywhen 0 A the modelsalsoincluded C. Conversely,PPSs oftenfailed to reject eventhe simplest modelsas long asthey incorporated C. p ril 2 0 1 0.43(C),and0.44(IþC);Friedmantest,P,0.01,df53] Furthermore,therewassignificantdifferenceamongmod- 9 due to significant differences among all treatment pairs els that made alternative assumptions about ASRV [M 5 (evaluated with Wilcoxon signed rank tests). Although 0.36(eq),0.45(I),0.47(C),and0.47(IþC);Friedmantest, all models tended to underestimate the true tree length P , 0.01, df 5 3] due to models that assume equal sub- (0.49),equalratesmodelsperformedworsethanthosethat stitution rates or a proportion of invariable sites. The re- incorporated ASRV. There was a significant difference in sults were similar to those obtained under ML; equal SDDs among average posterior tree lengths between sub- ratesmodelsunderestimatedtreelengthmorethanmod- stitution models chosen by model selection methods and elsthatincorporatedASRV(especially,theC-distribution). thesimplestmodelsnotrejectedbyPPSs[M50.50(hLRT), Concordant with previous results, tests of model ade- 0.50(AIC ),0.50(BIC),0.50(DT),and0.42(GC);Friedman quacy failed to reject simple models for replicates gener- c test, P , 0.01, df 5 4] due to significantly shorter trees ated from a water mold data set (fig. 6). The GC test inferred using the simplest substitution models not re- againrejectedJCwithandwithoutASRVforallreplicates. jectedbyPPS(evaluatedwithWilcoxonsignedranktests). F81 þ I, F81 þ C, K2P þ I þ C, and K3P þ C were the 2797 MBE Ripplinger and Sullivan · doi:10.1093/molbev/msq168 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 7 /1 2 /2 7 9 0 /1 0 7 4 4 3 7 b y g u e s t o n 1 0 A p ril 2 0 1 9 FIG.5.Disparityinnumberofparametersbetweenthesimplestmodel(s)notrejectedbyGCtestsorthePPSsandoptimalmodelsidentifiedby thehLRT,AIC,BIC,andDTmethodsfortenreplicatesfromashrewdataset.Bothtestsofmodeladequacyoftenfailedtorejectmodelsthat c were simpler thanthosechosen bymodel selection methods. simplestnonrejectablemodelsforonereplicateapiece.The and JC þ I, JC þ C, and equal rates K2P as the simplest GC test identified K2P þ I and K2P þ C as the simplest nonrejectable models for the remaining four replicates. adequate models for one replicate; HKY þ I and HKY þ Both tests of model adequacy failed to reject simpler Cwerethesimplestmodelsnotrejectedforanadditional models than those selected by model selection methods replicate.TheequalratesHKYmodelandK2PþCmodel (fig. 7), which is similar to results obtained from the em- were identified as the simplest adequate models for two pirical and shrew simulation data. There was a significant replicatesapiece.Conversely,PPSsidentifiedtheequalrates difference in the number of parameters incorporated by JCmodelasthesimplestadequatemodelforsixreplicates models chosen by model selection methods and the 2798 MBE Assessment of Model Adequacy · doi:10.1093/molbev/msq168 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 7 /1 2 /2 7 9 0 /1 0 7 4 4 3 7 b y g u FIG.6.Frequencywithwhichcommonsubstitutionmodelsarenotrejectedby(a)GCtestsand(b)PPSsfortenreplicatesgeneratedusing es agenepartitionedGTRþIþCmodelandMLtopologyforawatermolddataset.Althoughbothmethodsfailedtorejectrelativelysimple t o n models,PPSs normally failedto reject lesscomplex models(including JC without ASRV) thantheGCtest. 1 0 A p ril 2 simplest models not rejected by the GC test [M 5 4.5 All56substitutionmodelsinferredthesameMLtopology 0 1 (hLRT), 6.0 (AIC ), 4.5 (BIC), 5.0 (DT), and 3.5 (GC); Fried- fornineofthetenreplicatesdespitetherejectionofseveral 9 c mantest,P,0.01,df54]duetothedifferencebetween simple models using the GC test. Use of all substitution thecomplexmodelschosenbytheAIC andtherelatively models resulted in the recovery of the true tree for eight c simplemodelsnotrejectedbytheGCtest(evaluatedwith replicatesandtherecoveryofthesameincorrecttreeforan Wilcoxon signed rank tests). Similarly, there was a signifi- additionalreplicate,whereasmodelsthatincorporatedthe cant difference in model complexity among models se- least complex rate matrix outperformed parameter-rich lected by model selection methods from the reduced set models for the remaining replicate. Consequently, there ofmodelsimplementedinMrBayesandtheleastcomplex was no significant difference in SDDs among ML tree models not rejected by PPSs [M 5 4.0 (hLRT), 7.0 (AIC ), topologies generated using substitution models identified c 4.0 (BIC), 4.0 (DT), and 0.0 (PPS) parameters; Friedman by model selection methods and the least complex test, P , 0.01, df 5 4] due to the simple models not re- modelsnotrejectedbytheGCtest(M50forallmethods; jected by PPSs (evaluated with pairwise Wilcoxon signed Friedman test, P , 0.99, df 5 4) or among models that rank tests). made different assumptions about ASRV (M 5 0 for all 2799
Description: