ebook img

Function-valued traits in evolution. PDF

0.52 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Function-valued traits in evolution.

Function-valued traits in evolution Pantelis Z. Hadjipantelis1, Nick S. Jones2, John Moriarty3, David A. Springate4 and Christopher G. Knight4 rsif.royalsocietypublishing.org 1Centre for ComplexityScienceand DepartmentofStatistics,UniversityofWarwick, Coventry CV47AL, UK 2DepartmentofMathematics,Imperial College London, London SW72AZ,UK 3Schoolof Mathematics,and 4FacultyofLifeSciences, UniversityofManchester,Oxford Road, ManchesterM139PL, UK Many biological characteristics of evolutionary interest are not scalar vari- Research ables but continuous functions. Given a dataset of function-valued traits generated by evolution, we develop a practical, statistical approach to Cite this article: Hadjipantelis PZ, Jones NS, inferancestralfunction-valuedtraits,andestimatethegenerativeevolution- Moriarty J, Springate DA, Knight CG. 2013 aryprocess.Wedothisbycombiningdimensionreductionandphylogenetic Gaussian process regression, a non-parametric procedure that explicitly Function-valued traits in evolution. J R Soc accounts for known phylogenetic relationships. We test the performance of Interface 10: 20121032. methods on simulated, function-valued data generated from a stochastic http://dx.doi.org/10.1098/rsif.2012.1032 evolutionary model. The methods are applied assuming that only the phylogeny, and the function-valued traits of taxa at its tips are known. Our method is robust and applicable to a wide range of function-valued data, and also offers a phylogenetically aware method for estimating the Received: 15 December 2012 autocorrelationof function-valued traits. Accepted: 28 January 2013 1. Introduction Thenumber,reliabilityandcoverageofevolutionarytreesaregrowingrapidly Subject Areas: [1,2].However,knowingorganisms’evolutionaryrelationshipsthroughphylo- biomathematics, bioinformatics geneticsisonlyonestepinunderstandingtheevolutionoftheircharacteristics [3]. Three issues are particularly challenging. The first is limited information: Keywords: empirical information is typically available only for extant taxa, represented comparative analysis, Ornstein–Uhlenbeck bytips of a phylogenetictree,whereasevolutionaryquestions frequentlycon- process, non-parametric Bayesian inference, cern unobserved ancestors deeper in the tree. The second is dependence: the available information for different organisms in a phylogeny is independent functional phylogenetics, ancestral because a phylogeny describes a complex pattern of non-independence; reconstruction, functional Gaussian process observedvariationisamixtureofthisinheritedvariationandspecificvariation regression [4].Thethirdishighdimensionality:theemergingliteratureonfunction-valued traits [5–7] recognizes that many characteristics of living organisms are best represented as a continuous function rather than a single factor or a small number of correlated factors. Such characteristics include growth or mortality Author for correspondence: curves [8], reaction norms [9] and distributions [10], where the increasing Pantelis Z. Hadjipantelis ease of genome sequencing has greatly expanded the range of species in e-mail: [email protected] which distributions of gene [11] or predicted protein [12] properties are avail- able. Therefore, a function-valued trait is defined as a phenotypic trait that canbe represented byacontinuous mathematical function [9]. Previous work [13] proposed an evolutionary model for function-valued data d related by a phylogeny T. The data are regarded as observations of a phylogenetic Gaussian process (PGP) at the tips of T. That work showsthat a PGP can be expressed as a stochastic linear operator X on a fixed set fof basis functions (independent components of variation),so that d¼XTf: ð1:1Þ However, the study does not address the linear inverse problem of obtaining Electronic supplementary material is available estimatesf^ andX^ offandX:ourfirstcontributioninthispaperistoprovide at http://dx.doi.org/10.1098/rsif.2012.1032 or an approach to this problem in §2.2 via independent principal component analysis (IPCA; [14]). via http://rsif.royalsocietypublishing.org. & 2013TheAuthors.PublishedbytheRoyalSocietyunderthetermsoftheCreativeCommonsAttribution License http://creativecommons.org/licenses/by/3.0/, which permits unrestricted use, provided the original authorandsourcearecredited. WerefertoXasthemixingmatrix,andtothe(i,j)thentry 2 ofXasthemixingcoefficientoftheithbasisfunctionatthejth function-valued Ornstein–Uhlenbeck phylogenetic taxon. It is these mixing coefficients that we model as evol- sample (figure 3) model (equation 2.1) tree (figure 2) rs ving. For each fixed value of i, the Xij are correlated (owing input if.roy a to phylogeny) as j varies over the taxa; the basis functions ls o themInse§l2v.e3s,wdeoandodtreevssoltvheeipnroobulermmoofdeesl.timatingthestatistical dimreednusciotinoanlity hypesetripmaraatimoneter cietyp u structureofthemixingcoefficientsbyperformingphylogenetic (§2.2) (§2.4) blis Gaussian process regression (PGPR) on each of the rows of hin X^ separately. This corresponds to assuming independence g.o between the rows (i.e. that the coefficients of the different rg basisfunctions evolveindependently).Itiscommonlyargued J q estimates R inthequantitativegeneticsliterature[15]thatevolutionarypro- S (table 2) o cesses can be modelled as Ornstein–Uhlenbeck (OU) c phylogenetic In processes.Undertheseassumptions,theestimationofthefor- mixing coefficients matrix GP regression te wardoperatorreducestotheestimationofasmallvectorgof and (§2.3) rfac functional basis (figure 3) e parameters [13]. In §2.1, we clarify the interpretation of these 1 0 : parameters in evolutionary contexts. The explicit PGPR pos- 2 0 terior likelihood function is then used to obtain maximum- 12 1 likelihood estimates (MLEs) for g. The estimation of g is ancestral 03 estimates (figure 5) 2 knowntobeachallengingstatisticalproblem[16].Wesuggest output anapproachbasedontheprincipleofbagging[17]in§2.4. Figure 1. The three methods presented in this paper (ovals) and their Our final contribution (§2.5) addresses the problem of interrelationships. estimating the function-valued traits of ancestral taxa. The earlier-mentioned PGPR step also returns a posterior distri- bution for the mixing coefficient of each basis function at each ancestral taxon in the phylogeny. At any particular ancestor,theestimatedbasisfunctionsmaybecombinedstat- istically, using the posterior distributions of their respective mixing coefficients, to provide a posterior distribution for the function-valued trait. Because the univariate posterior distributionsareGaussian,andthemixingislinear,thepos- terior for the function-valued trait has a closed form representation as a GP (equation (2.6)) that provides a major analytical and computational advantage for the approach. We can verify the methods proposed by using a PGP as a stochastic generative model. This simulates corre- lated function-valued traits across the taxa of T. Given only Figure2.Therandomphylogenetictreeusedandexamplesofthefunction- the phylogeny and the function-valued traits of taxa at its valuedtraitsshownatthetips(extanttaxa)andtheinternalnodes(ancestral tips, our estimates for f^ and the ancestral functions are taxa). A subset of these is used in figure 5. then comparedwiththe simulation. Overall, our three methods (in §§2.2, 2.4, 2.5) appropri- ately combine developments in functional data analysis Second, wechosea basisfinequation (1.1).Wehave no with the evolutionary dynamics of quantitative phenotypic reasonaprioritosupposethatthisbasisisorthogonaland,in traits,allowingnon-parametricBayesianinferencefromphy- general,thereisnoreasonforourinferenceproceduretobesen- logenetically correlated function-valued traits. An outline of sitivetotheparticularshapeofthebasisfunctions.Thethree the framework presented in this study can be found in simplenon-orthogonal,unimodalfunctionsshowninfigure3 figure 1. were therefore chosen as examples. For computational pur- poses, eachbasisfunction wasstorednumericallyasavector oflength1024,sothatthebasismatrixfwasofsize3(cid:2)1024 2. Methods and implementation anditsithrowstoredtheithbasisfunction. Third, different mixing coefficients were generated by a 2.1. Artificial evolution of function-valued traits phylogenetic OU process for each basis function and stored We begin by generating a random phylogenetic tree T with in the respective row of X. Our modelling assumption is 128 tips, shown in figure 2. This fixes the experimental that the mixing coefficients for distinct basis functions f, 1 design for our simulation and inference, but further simu- f,f arestatisticallyindependentofeachother:inequation 2 3 lations given in the electronic supplementary material (1.1), this means that the rows of X are independent. It is confirm that the statistical performance of our methods is thereforesufficienttodescribethestochasticprocessgenerat- consistentacrossarangeofchoicesforT.Branchlengthdis- ing X, the ith row of X with i[f1;2;3g. We calculated the i tributions are surprisingly consistent across organisms [18]; mixing matrix at the 128 tip taxa so X is of size 3(cid:2)128. branch lengths were drawn from the empirical branch The ‘true’ ancestral values were established by generating length distribution (see electronic supplementary material, phylogenetic OU processes over the whole phylogeny. The sectionS1) extracted from TREEFAM v. 8.0[2]. values of this process at tip taxa were stored in a row (a) (b) 3 6 30 rs 4 20 if.ro y y 2 10 alsoc ie 0 0 typ u b −2 −10 lish in −4 −20 g.o rg J R (c) (d) S o c 6 6 In te 4 4 rfac e 1 2 2 0: y 2 0 1 0 0 21 0 3 2 −2 −2 −4 −4 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 x x Figure3.(a)originalbasissignals,f;(b)mixedsampleatthetips,d(fourindividualfunction-valuedtraitsareshown;redlineandgreybandshow,respectively, the mean and 2 s.d. for all 128 function-valued data at the tips); (d) IPCA basis,f^; (c) PCA basis. (Online version in colour.) vectorX (X isasimulationofthetiptaxamixingcoefficientsX Table 1. The fixed values used for the parameters in equation (2.1) to i i i excluding the non-phylogenetic variation), and its values at generate the mixing coefficients X. Each row constitutes a value of gi. ij internal taxawere stored in a row vector W for performance 6.17 and 2.06 correspond to 0.75 and 0.25 of the tree’s ‘ , respectively. i max analyses in §2.5. To simulate the additional effect of non- When i¼2, ‘i is not applicable, because there is no phylogenetic phylogenetic variation (e.g. due to measurement error or variationin the sample. environmental effects), independent (i.e. non-phylogenetic) variationwasaddedtoeachentryofX(cid:2)i: i si ‘i si f n X ¼X(cid:2) þe; i i i 1 2.5 6.17 0.5 where e is a 1(cid:2)128 vector of independent Gaussian errors 2 0 n.a. 1.0 i with mean 0 and standard deviation si; and, finally, the 3 1.5 2.06 0.5 n matrix multiplication in equation (1.1) was performed to obtainthesimulateddatad.The‘extant’function-valuedtrait qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi attiptaxonjisthusP3i¼1Xijfi(avectoroflength1024),whereas wheresif ¼ ðsiÞ2=2ai;DT(tj,tg)denotesthephylogeneticor the ancestral function-valued trait at internal taxon g is patristic distance (i.e. the distance in T) between the jth and P3i¼1Wigfi:Theancestralfunction-valuedtraitsthereforeexhi- gth tip taxa,sn isdefined as earlier,and bitonlythephylogeneticpartofsimulatedvariation,whereas (cid:3) the extant function-valued traits exhibit both phylogenetic de ¼ 1; iftj ¼tgand tjisatiptaxon; and non-phylogenetic variation. Of course, it is not possible tj;tg 0; otherwise to reconstruct non-phylogenetic variation using phylogenetic adds non-phylogenetic variation to extant taxa as discussed methods: we simulate non-phylogenetic variation only to earlier,i.e.deevaluatesto1onlyforextanttaxa,thuss quan- demonstratethatitdoesnotpreventthereconstructionofthe n tifies within-species genetic or environmental effects and phylogeneticpartofvariationforancestraltaxain§§2.2–2.5. measurement error in the ith mixing coefficient. We see We now comment on the specific parameters chosen for from equation (2.1) that the proportion of variation in the the phylogenetic OU processes mentioned earlier. As in row X attributable to thephylogenyis ðsiÞ2=ðsiÞ2þðsiÞ2: Hansen [19], we refer to the strength of selection parameter a i f f n In the Gaussian process regression (GPR) literature in and the random genetic drift s: we add superscripts to these machine learning, 1/2ais equivalent to ‘, the characteristic parameters to distinguish between the three different OU length-scale [20] of decay in the correlation function and processes. With this notation, the mixing coefficients for the in the following we work with the latter. For all of the OU row X havethefollowingcovariancefunction: i processes, we used characteristic length-scales relative to KTiðt1;t2Þ¼E½XijXig(cid:3) 8.22, the maximum patristic distance (‘max) between two ð2:1Þ extant taxa for our simulated tree (figure 2). The values we ¼ðsiÞ2expð(cid:4)2aiD ðt;t ÞÞþðsiÞ2de ; f T j g n tj;tg used are given in table 1. In particular, sif ¼0 when i¼2 and it follows that the characteristic length-scale ‘ plays no AGPmaybespecifiedbyitsmeansurfaceanditscovari- 4 role for this OU process, and equally we do not define the ancefunctionK(g),wheregisavectorofparameters.Because strength-of-selection parameteraiwhen i¼2. thecomponentsofgparametrizethepriordistribution,they rs arereferredtoashyperparameters.TheGPpriordistribution if.ro y a 2.2. Dimensionality reduction and source separation for is denoted lso c function-valued traits f (cid:2)Nð0;KðgÞÞ: iety p u b Givenadatasetdoffunction-valuedtraits,wewouldliketo If x* is a set of unobserved coordinates and x is a set of lis find appropriate estimates X^ and f^ of the mixing matrix X observed coordinates, then the posterior distribution of the hin g and the basis setf, respectively. The first task is to identify vector f(x*)giventheobservations f(x)is .o rg agoodlinearsubspaceSofthespaceofallcontinuousfunc- fðx(cid:5)ÞjfðxÞ(cid:2)NðA;BÞ; ð2:2Þ tionsbychoosingbasisfunctionsappropriately.Thepurpose J R is to work, not with the function-valued data directly, but where So c withtheirprojectionsinS.Wemaysaythatthechosensub- A¼Kðx(cid:5);x;gÞKðx;x;gÞ(cid:4)1fðxÞ; ð2:3Þ In spaceSisgood,iftheprojecteddataapproximatetheoriginal terfa data well, whereas the number of basis functions is not and ce 1 unnecessarily large so that S has the ‘effective’ dimension B¼Kðx(cid:5);x(cid:5);gÞ(cid:4)Kðx(cid:5);x;gÞKðx;x;gÞ(cid:4)1Kðx(cid:5);x;gÞT ð2:4Þ 0: of thedata. 20 1 Wethenfacealinearinverseproblem:giventhedatasetd and K(x*, x,g) denotesthe jx*j(cid:2)jxj matrix of the covariance 21 0 of function-valued traits, the task is to generate estimates function K evaluated at all pairs x(cid:5)i [X(cid:5);xj [X: Equations 32 X^ and f^ (equation (1.1)). This task is also known as source (2.3) and (2.4) convey that the posterior mean estimate will separation [21], which has a variety of implementations be a linear combination of the given data and that the pos- making different assumptions about the basis fand mixing terior variancewill be equal to the prior variance minusthe coefficients X. One widely used approach is principal com- amount that can be explained by the data. Additionally, the ponents analysis (PCA) [22], which returns orthogonal sets log-likelihood of thesamplef(x)is of basis functions to explain the greatest possible variation. 1 PCA has been extended to take account of phylogenetic logpðfðxÞjgÞ¼(cid:4) fðxÞTKðx;x;gÞ(cid:4)1fðxÞ 2 relationships [23], however, ifa sample of functions isgener- 1 jxj ated by mixing non-orthogonal basis functions, the PCs of (cid:4) logðdetðKðx;x;gÞÞÞ(cid:4) log2p: ð2:5Þ 2 2 the sample (whetheror not theyaccount for phylogeny) will notequalthebasiscurves,duetotheassumptionoforthogon- Itcanbeseenfromequation(2.5)thattheMLEissubjectboth tothefititdelivers(thefirstterm)andthemodelcomplexity ality(figure3).Intheindependentcomponentanalysis(ICA), the alternative assumption is made that the rows X of X are (thesecondterm).Thus,GPRis non-parametricinthesense i statistically independent. This assumption fits more naturally thatnoassumptionismadeaboutthestructureofthemodel: the more data gathered, the longer the vector f(x), and the with our modelling assumptions, because we assume that the rows X are mutually independent [21]. ICA has proved moreintricatetheposterior modelforf(x*). i fruitful in other biological applications [24] as has passing PGPR extends the applicability of GPR to evolved func- tion-valued traits. A PGP is a GP indexed by a phylogeny theresultsofPCAtoICA,whichhasbeentermedIPCA[14]. PCA is an appropriate tool for identifying the effective T, where the function-valued traits at each pair of taxa are dimension of a high-dimensional dataset [25]. Therefore, to conditionally independent given the function-valued traits of their common ancestors. When the evolutionary process achieve both dimension reduction and source separation, we first applied PCA to the dataset d (the 128 function-valued has the same covariance function along any branch of T traitsatthetipsofT)todeterminetheappropriatenumberof beginning at its root (called the marginal covariance function), basisfunctions.ThePCswerethenpassedtotheCubICAimple- these assumptions are sufficient to uniquely specify the covariance function of the PGP, K . As we assume that T is mentation of ICA [26]. CubICA returned a new set of basis T functions(figure3d)thatweretakenastheestimatedbasisf^. knowninourinverseproblem,theonlyremainingmodelling choice is therefore the marginal covariance function. As can be seen from equation (2.1), K is a function of patristic dis- 2.3. Phylogenetic Gaussian process regression tances on the tree rather than Euclidean distances as ICAalsoreturnstheestimatedmixingcoefficientsattiptaxa, standardin spatialGPR. X^: Our next step was to perform PGPR [13] separately In comparative studies, where one has observations at on each row X^ ; assuming knowledge of the phylogeny T, thetipsofT,thecovariancefunctionK maybeusedtocon- i T in order to obtain posterior distributions for all mixing struct a GP prior for the function-valued traits, allowing coefficientsthroughout thetree T. functionalregression.Inthemodelthatweuse,thisisequiv- GPR [20] is a flexible Bayesian technique in which prior alent to specifying a Gaussian prior distribution for the distributions are placed on continuous functions. Its range mixing coefficients Y and X . This may be carried out by ij ij of priors includes the Brownian motion and OU processes, regardingtherowvectorsY andX asobservationsofauni- i i whicharebyfarthemostcommonlyusedmodelsofcharac- variatePGP.AsnotedinJones&Moriarty[13],ifweassume ter evolution [15,27]. Its implementation is particularly that the evolutionary process is Markovian and stationary, straightforward, because the posterior distributions are also thenthemodellingchoicevanishes,andthemarginalcovari- GPs and have closed forms. We now give a brief exposition ance function is specified uniquely: it is the stationary OU of GPR, using notation standard in the machine learning covariance function. If we also add explicit modelling of literature[20]. non-phylogenetically related variation at the tip taxa, the sˆ–s lˆ–l sˆ–s 5 f f n n s l s f n rs 2.0 if.roy a ls o c 1.5 iety ensity 1.0 publish d in g .o 0.5 rg J R 0 S o –2 –1 0 1 2 –2 –1 0 1 2 –2 –1 0 1 2 c In relative error relative error relative error te rfa c Figure 4. Kernel density estimates of the relative errors in 1024 runs of the gestimation procedure, each time for a different tree, a different set of mixing e 1 coefficients and a different set of parameters in g; no components of gare assumed to be known beforehand. Estimation results are commented on in §3. 0: 2 The median values shown by the dotted line are (20.073, 20.131 and 0.001), respectively. 01 2 1 0 3 2 univariate prior covariance function has the unique func- Table 2. The bagging estimates for the hyperparameters in equation (2.1) tional form presented in equation (2.1). We do not assume (standard deviations of bagging estimates in parentheses). Each row knowledge of the parameters of equation (2.1), however, corresponds to a given estimate of the vector g^i: These estimates provide theirestimationisthe subject of §2.4. the maximum-likelihood value for equation (2.5) and are comparable with theoriginal ones from table 1. 2.4. Hyperparameter estimation i s^i ‘^i s^i BecausetheposteriordistributionsreturnedbyPGPRdepend f n on the hyperparameter vector g, we must estimate g in 1 3.41 (0.62) 2.83 (0.47) 0.78 (0.47) order to reconstruct ancestral function-valued traits, and the 2 0.55 (0.33) 0.05 (0.02) 0.84 (0.34) estimation procedure should correct for the dependence 3 2.83 (0.33) 2.06 (0.50) 0.73 (0.29) owing to phylogeny. MLE of the phylogenetic variation, non-phylogenetic variation and characteristic length-scale hyperparameterssi;si and‘i,respectively,maybeattempted f n patristic distances). Commenting further on this matter, numerically using the explicit prior likelihood function exceptionally small characteristic length-scales relative to the (equation(2.5)).Becauseestimatingsi and‘ialoneischallen- f tree patristic distances, as seen here, practically suggest ging[16](althoughtheestimationimprovessignificantlywith taxa-specific phylogenetic variation, i.e. non-phylogenetic increasedsamplesize),andwehavefurtherincreasedthechal- variation. This holds also in its reverse: exceptionally large lenge byintroducingnon-phylogenetic variation, we propose characteristic length-scales suggest a stable, non-decaying animprovedestimationprocedureusingthemachinelearning variation across the examined taxa that is indifferent to techniquebagging[17],whichamemberoftheboostingframe- their patristic distances, again suggesting the absence of work [22]. We show that these estimates may be further phylogeneticvarianceamong thenodes. improved if one knows the value of the ratio (s)2/(s)2, f n To assess the robustness of this hyperparameter esti- whichiscloselyrelatedtoPagel’sl[28]. mation method, we performed 1024 simulations, randomly Bagging (bootstrap aggregating) seeksto reduce the var- regenerating the tree and parameter vectorgeach time (see iance of an estimator by generating multiple estimates and electronic supplementary material, section S2). The accuracy averaging. It is simple to implement given an existing esti- of these estimates is shown in figure 4. Improved results mation procedure: one adds a loop front end that selects a when the ratio (s)2/(s)2 is known a priori (e.g. through f n bootstrap sample and sends it to the estimation procedure knowledge of Pagel’s l) are also given in the electronic and a back end that aggregates the resulting estimates [17]. supplementary material, sections S2 and S3. Our ultimate Wegenerated100(sub)treesof100taxabysamplingwithout aim is ancestor reconstruction rather than hyperparameter replacementouroriginal128taxatree,obtainedtheMLEfor estimationper se, andthis isthe subject of §2.5. goneachsubtree,andaveragedtheseestimatestoobtainthe aggregated estimateg^: Our results are shown in table 2: for 2.5. Ancestor reconstruction i¼1 and i¼3, given our moderate sample size (128 taxa), the accuracy of these results is at least in line with the state Having generated function-valued data (§2.1), extracted of the art [16] despite the additional challenge posed by mixing coefficients X^ (§2.2) and performed hyperparameter non-phylogenetic variation. For i¼2, where phylogenetic estimation (§2.4), we may now perform PGPR (§2.3) on variation is absent from the generative model (si ¼0), our eachrowX^ ;toobtaintheunivariateGaussianposteriordis- f i estimation procedure indicates its absence by returning esti- tributionforthemixingcoefficientWit(cid:5) atanyinternaltaxon matesfor ‘iwhose magnitude isunrealisticallysmallfor the t*.Asdiscussedin§2.3,theGPpriordistributionhascovari- examined tree (less than the first percentile of the tree’s ancefunction(equation(2.1)).Wehaveassessedtheaccuracy (a) (b) (c) 6 10 rs 5 if.roy a ls o c y 0 ie ty p u b –5 lis h in g –10 .o rg J 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 0 0.2 0.4 0.6 0.8 1.0 R S x x x o c In te Figure5.Posteriordistributionsatthreepointsinthephylogenyusingtheestimatedf^ andg^.Thepredictionmadebytheregressionanalysisisshownviathe rfa c e posteriormean(redline),thecomponentofposteriorvarianceduetophylogeneticvariation(2s.d.,darkgreyband)andnon-phylogeneticvariation(2s.d.,light 1 0 greyband).Theblacklineshowsthesimulateddataenablingvisualvalidationoftheancestralpredictions.(c),theblacklineisthetrainingdataatatiptaxonthe :2 0 redlineanddarkgreybandrepresenttheposteriordistributionofitsphylogeneticcomponent,whereasthelightgreybandrepresentstheestimatedmagnitudeof 1 2 1 non-phylogeneticvariation.Therootandinternaltaxonherearethesameasthoseindicatedinfigures2and3,andthetipisthesecondfrombottomonthesame 0 3 2 figure. (a) root estimate; (b) node estimate; and (c) tip estimate. (Online version in colour.) of our bagging estimate g^ in §2.4 and we now substitute this(admittedlysimulatedandhighlycontrolled)setting,we g^i into equation (2.1). Taking a simple and direct approach, canreason effectivelyaboutancestral function-valued traits. our estimate f^ obtained in §2.2 may then be substituted into equation (1.1) to obtain the function-valued posterior distribution ft(cid:5) for the function-valued trait at taxon t*. 3. Discussion Becauseourestimatedbasisfunctionsarestorednumerically In §2.1, we have appealed to equation (1.1) in the setting of as vectors of length 1024, this gives the same discretion for mathematicalinverseproblemswhere,givendatad,thechal- theancestral traits. Conditioning on ourestimated mixing coefficients X^ for lengeistoinferaforwardoperatorGandmodelfsuchthat i thetip taxa, theposteriordistribution of Wit(cid:5) is d¼GðfÞ; ð3:1Þ Wit(cid:5)(cid:2)NðA^i;B^iÞ; and such problems are typically under-determined and requireadditionalmodellingassumptions[29].Givenaphy- where the vector A^ and matrix B^ are obtained from logeny T and function-valued data d at its tips, we wish to i i equations (2.3) and (2.4), taking x¼X^i; x(cid:5)¼Wit(cid:5) and g^i, infer theforwardoperatorGTand modelfsuch that respectively,forourobservationcoordinates,estimationcoor- d¼G ðfÞ: ð3:2Þ T dinates and hyperparameter vector. Because our prior assumptionisthattherowsofXarestatisticallyindependent Whenthedatadareasmallnumberofcorrelatedfactors of each other, itfollowsfromequation (1.1)that pertiptaxon,avarietyofstatisticalapproachesareavailable [30,31]. When the data are functions, the PGPs [13,32] have ! ft(cid:5) (cid:2)N Xk A^if^i;Ski¼1f^iTB^if^i : ð2:6Þ been proposed as the forward operator and this is the approach we have takeninthis work. i¼1 Our dimensionality-reduction methodology in §2.2 The marginal distributions of this representation (mean can be easily varied or extended. For example, any suitable and standarddeviation) areshown infigure5. implementation of PCA may be used to perform the initial ^ Figure5comparesthefunction-valuedestimatesft(cid:5) tothe dimension reduction step: in particular, if the data have an simulated function-valued traits at the root (figure 5a), irregular design (as happens frequently with function- an internal node (figure 5b), and at a tip (figure 5c). In valued data), the method of Yao et al. [33] may be applied figure 5a,b, the simulated function-valued data are shown to account for this; the ICA step then proceeds unchanged. inblack,andcanbeseentypicallytoliewithintwoposterior WealsonotethatwhilewefindtheCubICAimplementation standard deviations. In figure 5c, the black line is the ofICAtobethemostsuccessfulinoursignalseparationtask, observed function-valued trait at that tip: the red line and otherimplementationssuchasFastICA[21]orJADE[34]can darkgreybandrepresenttheposteriordistributionofitsphy- also be used. In general, ICA gives rows X^ of the estimated i logenetic component, and the light greyband representsthe mixing matrix that are maximally independent undera par- estimated magnitude of the additional non-phylogenetic ticular measure of independence involving, for example, variation.Uncertaintyoverthephylogeneticpartofvariation higher sample moments or mutual information, in order to (dark grey band) decreases from root to tip, as all obser- approximatethesolution of theinverseprobleminequation vations are at the extant tip taxa. We note that the posterior (1.1) under our assumption of independence between the distributions,evenattheroot,putclearstatisticalconstraints rows of X. PCA and ICA have different purposes (respect- onthephylogeneticpartofancestralfunction-valueddata:in ively, orthogonal decomposition of variation and separation of independently mixed signals) and we use them sequen- of thefunction-valued traitis 7 tially in IPCA. IPCA is non-parametric and, in particular, bmoetahnsdtihstartibuuntiloiknealPlyCAa,ndIPCpAhyilsogroenbuetsitcatlolynoang-nGoasutiscs.iaTnhitiys E½fðx1Þfðx2Þ(cid:3)¼X3 ððsfiÞ2þðsniÞ2Þfiðx1Þfiðx2Þ: ð3:4Þ rsif.ro i¼1 y a in the data and, unlike phylogenetically corrected PCA, ls o IPCA is robust to mis-specification of the phylogeny and to The estimates of sf and sn obtained in §2.4 may be cie mixed phylogenetic and non-phylogenetic variation in the substituted into equatioin (3.4) tio obtain an estimate of the typu data:anyof these can befeatures of biologicaldata. autocovariance of the function-valued traits under study. blis It can be seenin figure4 thattheestimation of ‘is more This estimate hasthe attractions both of being positive defi- hin g bchiaasllaenngdinvgaritahnacne.thTehiessctoimrraetsipoonnodfsstontohresdfo,chuamvienngtegdrdeaiftfeir- nite(byconstruction)and oftakingphylogenyintoaccount. .org Variousframeworksexistthatcouldbeusedtogeneralize culty of estimating the parameter a in the OU model, themethodpresentedin§2.4,tomodelheterogeneityofevol- JR particularly for smaller sample sizes. Our work on hyper- utionary rates along the branches of a phylogeny [36] or for So parameter estimation in §2.4 mitigates these difficulties due c multiple fixed [15] or randomly evolving [16,37] local In tbooostmstarallpsaomuprlessaimzepl[e1.6,3S5o]mbeywuhasitngunbiangtguiinvgelyin, obradgegrintgo ocepstsim,taheofopthtiemmuimxintrgaictovefafliucieenaptsp.eFaorrstohnelystaintiothnearmyeOaUn,parnod- terface ‘works’ exactly because the subsampleg^ estimates are vari- 1 notinthecovariancefunction,andsodoesnotplayaroleas 0 : able and thus we avoid overfitted final estimates (see a parameter in GPR [20]. We have not implemented such 20 electronicsupplementarymaterial,sectionS2).Conceptually, extensionshere,effectivelyassumingthatasinglefixedopti- 121 our work on hyperparameter estimation, when taken mum is adequate for each mixing coefficient. Nonetheless, 032 together with §2.2, relates to the character process models our framework is readily extensible to include such effects, ofPletcher&Geyer[8]andorthogonalpolynomialmethods either implicitlythrough branch-length transformations [38], of Kirkpatrick & Heckman [5], which give estimates for or explicitly by replacing the OU model with the more the autocovariance of function-valued traits. Writing out general Hansen model[37]. equation (1.1) for a single function-valued trait (at the jth RcodefortheIPCA,ancestralreconstructionandhyper- tip taxon, say),ourmodel may be viewedas parameter estimation is available from https://github. X3 X3 com/fpgpr/. fðxÞ¼ g fðxÞþ e fðxÞ; ð3:3Þ ij i ij i i¼1 i¼1 P.Z.H.andD.A.S.weresupportedbyanEPSRCInstitutionalSpon- sorship award to the University of Manchester and the EPSRC where the mixing coefficient X has been expressed as the ij Mathematics Platform Engagement activity grant. C.G.K. was sup- sum of g , the genetic (i.e. phylogenetic) part of variation, ij ported by a fellowship from the Wellcome Trust. The authors plus eij, the non-phylogenetic (e.g. environmental) part of express their gratitude to John A. D. Aston for his encouragement variation,justasinthesereferences.Then,theautocorrelation andscientificadvice. References 1. MaddisonDR,SchulzKS.2007Thetreeoflifeweb TrendsEcol.Evol.27,637–647.(doi:10.1016/j.tree. onphylogenies.J. R.Soc. Interface 10, 20120616. project. Seehttp://tolweb.org. 2012.07.002) (doi:10.1098/rsif.2012.0616) 2. Li Het al. 2006TREEFAM:acurateddatabaseof 8. PletcherSD,GeyerCJ.1999Thegeneticanalysisof 14. Yao F, Coquery J, Le CaoKA.2012Independent phylogenetictreesofanimalgenefamilies.Nucleic age-dependent traits:modelingthe character principalcomponent analysisfor biologically Acids Res.34,D572–D580. (doi:10.1093/nar/ process. Genetics153,825–835. meaningfuldimensionreductionoflargebiological gkj118) 9. KingsolverJG, Gomulkiewicz R,Carter PA.2001 data sets. BMCBioinformatics13,13–24. (doi:10. 3. YangZ, RannalaB. 2012Molecular phylogenetics: Variation,selectionandevolutionoffunctionvalued 1186/1471-2105-13-13) principles and practice. Nat.Rev.Genet.13, 303– traits.Genetica 112–113, 87–104.(doi:10.1023/ 15. ButlerMA,KingAA.2004Phylogeneticcomparative 314.(doi:10.1038/nrg3186) A:1013323318612) analysis: amodelling approach foradaptive 4. CheverudJM,DowMM, Leutenegger W.1985The 10. Zhang Z, Mu¨llerHG.2011Functional density evolution. Am.Nat. 164, 683–695.(doi:10.1086/ quantitativeassessmentofphylogeneticconstraints synchronization. Comput. Stat.Data Anal.55, 426002) incomparativeanalyses:sexualdimorphisminbody 2234–2249. (doi:10.1016/j.csda.2011.01.007) 16. BeaulieuJM,Jhwueng DC, BoettigerC, O’earaBC. weightamongprimates.Evolution39,1335–1351. 11. Moss SP,Joyce DA, HumphriesS, Tindall KJ, 2012Modelingstabilizingselection:expandingthe (doi:10.2307/2408790) Lunt DH. 2011Comparativeanalysisof Ornstein–Uhlenbeck model ofadaptiveevolution. 5. KirkpatrickM, Heckman N. 1989A quantitative teleostgenome sequencesrevealsan ancient Evolution8,2369–2383.(doi:10.1111/j.1558-5646. genetic modelfor growth,shape, reaction norms, intronsizeexpansionin the zebrafish lineage. 2012.01619.x) and other infinite-dimensional characters.J. Math. Genome Biol.Evol. 3, 1187–1196.(doi:10.1093/ 17. BreimanL. 1996Bagging predictors.Mach.Learn. Biol.27, 429–450.(doi:10.1007/BF00290638) gbe/evr090) 24,123–140. 6. TheFunctionalPhylogenies Group.2012 12. KnightCG,KassenR,HebestreitH,RaineyPB.2004 18. Venditti C,MeadeA, Pagel M. 2010Phylogenies Phylogenetic inferencefor function-valued Global analysisofpredicted proteomes: functional revealnewinterpretationofspeciationandthered traits:speechsoundevolution.TrendsEcol.Evol.27, adaptation ofphysicalproperties.Proc. Natl Acad. queen.Nature463,349–352.(doi:10.1038/ 160–166. (doi:10.1016/j.tree.2011.10.001) Sci. USA101, 8390–8395.(doi:10.1073/pnas. nature08630) 7. Stinchcombe JR,Kirkpatrick M, Function-valued 0307270101) 19. HansenT. 1997Stabilizingselection and the traits working group 2012Phylogenetic inference 13. JonesNS,MoriartyJ.2013Evolutionaryinferenceor comparativeanalysisofadaptation. Evolution 51, for function-valued traits:speech soundevolution. function-valued traits:Gaussian processregression 1341–1351.(doi:10.2307/2411186) 20. Rasmussen CE, Williams CKI.2006Gaussian 27. HansenT,Martins E.1996Translatingbetween 33. Yao F, Mu¨llerHG, Wang JL. 2005Functionaldata 8 processesformachinelearning.Cambridge,MA:MIT microevolutionaryprocessand macroevolutionary analysisfor sparselongitudinal data.J. Am.Stat. Press. patterns: thecorrelation structureofinterspecific Assoc.100, 577–590. (doi:10.1198/ rs 21. Hyva¨rinenA,OjaE.2000Independentcomponent data.Evolution 50,1404–1417. (doi:10.2307/ 016214504000001745) if.ro y analysis:algorithmsandapplications.NeuralNetw.13, 2410878) 34. CardosoJF. 1999High-ordercontrastsfor als o 411–430.(doi:10.1016/S0893-6080(00)00026-5) 28. PagelM.1997Inferringevolutionaryprocessesfrom independent componentanalysis. Neural cie 22. BishopC. 2006Pattern recognition andmachine phylogenies.Zool.Scr.26,331–348.(doi:10.1111/j. Comput.11, 157–192. (doi:10.1162/089976 typ u learning. Berlin,Germany: Springer. 1463-6409.1997.tb00423.x) 699300016863) b lis 23. RevellLJ. 2009Size-correctionand principal 29. JaynesET.1984Priorinformationandambiguityin 35. Collar DC, O’MearaBC, WainwrightPC, Near TJ. hin g componentsfor interspecific comparativestudies. inverse problems. Inverse Probl.14,151–166. 2009Piscivorylimits diversification offeeding .o Evolution 63, 3258–3268.(doi:10.1111/j.1558- 30. Salamin N,WuestRO, Lavergne S, Thuiller W, morphologyin centrarchid fishes. Evolution 63, rg 5646.2009.00804.x) Pearman PB.2010Assessing rapid evolution in a 1557–1573.(doi:10.1111/j.1558-5646.2009. J R 24. Scholz M, Gatzek S, SterlingA, FiehnO, Selbig J. changingenvironment.TrendsEcol.Evol.25,692– 00626.x) S o 2004Metabolitefingerprinting:detectingbiological 698.(doi:10.1016/j.tree.2010.09.009) 36. Revell LJ,Collar DC. 2009Phylogenetic analysisof c In features byindependent componentanalysis. 31. HadfieldJD,NakagawaS.2010Generalquantitative theevolutionarycoreallyusinglikelihood.Evolution te Bioinformatics 20,2447–2454. (doi:10.1093/ genetic methods forcomparativebiology: 63–64,1090–1100. (doi:10.1111/j.1558-5646. rfac e bioinformatics/bth270) phylogenies,taxonomiesandmulti-traitmodelsfor 2009.00616.x) 1 0 : 25. MinkaTP.2000Automaticchoiceofdimensionality continuous and categoricalcharacters.J.Evol.Biol. 37. HansenT,PienaarJ,OrzackSH.2008Acomparative 2 0 for PCA. Adv.Neural Inf.Process. Syst. 13, 514. 23,494–508.(doi:10.1111/j.1420-9101.2009. methodfor studying adaptation toarandomly 12 1 26. BlaschkeT,Wiskott L.2004CuBICA:independent 01915.x) evolvingenvironment. Evolution 8,1965–1977. 03 2 component analysisbysimultaneousthird- and 32. Kerr M. 2012Evolutionaryinferencefor functional (doi:10.1111/j.1558-5646.2008.00412.x) fourth-ordercumulant diagonalization.IEEE Trans. data:using Gaussian processes onphylogenies of 38. Pagel M. 1999Inferringthe historicalpatterns of SignalProcess.52, 1250–1256.(doi:10.1109/TSP. functional data objects.MSc Thesis,Universityof biologicalevolution.Nature401,877–884.(doi:10. 2004.826173) Glasgow, Glasgow, UK. 1038/44766)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.