GBE Dynamics and Adaptive Benefits of Protein Domain Emergence and Arrangements during Plant Genome Evolution Anna R. Kersting, Erich Bornberg-Bauer, Andrew D. Moore*, and Sonja Grath* EvolutionaryBioinformaticsGroup,InstituteforEvolutionandBiodiversity,UniversityofMuenster(WWU),Germany *Correspondingauthor:E-mail:[email protected];[email protected]. Accepted:9January2012 Abstract Plant genomes are generally very large, mostly paleopolyploid, and have numerous gene duplicates and complex genomic featuressuchasrepeatsandtransposableelements.Manyofthesefeatureshavebeenhypothesizedtoenableplants,which cannot easily escape environmental challenges, to rapidly adapt. Another mechanism, which has recently been well described as a major facilitator of rapid adaptation in bacteria, animals, and fungi but not yet for plants, is modular rearrangement of protein-coding genes. Due to the high precision of profile-based methods, rearrangements can be well capturedattheproteinlevelbycharacterizingtheemergence,loss,andrearrangementsofproteindomains,theirstructural, functional, and evolutionary building blocks. Here, we study the dynamics of domain rearrangements and explore their adaptivebenefitin27plantand3algalgenomes.Weuseaphylogenomicapproachbywhichwecanexplaintheformation of 88% of all arrangements by single-step events, such as fusion, fission, and terminal loss of domains. We find many domains are lost along every lineage, but at least 500 domains are novel, that is, they are unique to green plants and emergedmoreorlessrecently.Thesenoveldomainsduplicateandrearrangemorereadilywithintheirgenomesthanancient domainsandareoverproportionallyinvolvedinstressresponseanddevelopmentalinnovations.Noveldomainsmoreoften affectregulatoryproteinsandshowahigherdegreeofstructuraldisorderthanancientdomains.Whereasarelativelylarge and well-conserved core set of single-domain proteins exists, long multi-domain arrangements tend to be species-specific. Wefindthatduplicatedgenesaremoreofteninvolvedinrearrangements.Althoughfissioneventstypicallyimpactmetabolic proteins, fusion events often create new signaling proteins essential for environmental sensing. Taken together, the high volatility of single domains and complex arrangements in plant genomes demonstrate the importance of modularity for environmental adaptability of plants. Key words: plant genome evolution, modular evolution, whole-genome duplication, evolution of stress response. Introduction speed. In contrast, the number of domain arrangements, thatis,thecombinationofthesedomainsinproteins,con- Thewealthofgenomicdatahasgovernedanumberofin- tinues to rapidly grow(Levitt 2009; Yang et al. 2009). The sightfulstudiesongenomeevolution.Todate,moststudies studyofdomainrearrangementsacrosslargephylahaspro- haveconcentratedongeneduplications,genefamilyexpan- videdadetailedunderstandingofmodularproteinevolution sionorreduction,selectivesweepsorsignalsofselectionus- (Bjo¨rklundetal.2005;Ekmanetal.2007;Fongetal.2007; ingsite-basedstatistics.Analternativeapproachtostudying WangandCaetano-Anolles2009;Yangetal.2009)andhas genome evolution utilizes the modular nature of proteins. Most proteins are composed of one or many protein do- demonstrated that domain rearrangements, paired with mains, which are the units of protein structure, function, the occasional formation of novel domains (Moore and and evolution (So¨ding and Lupas 2003; Moore et al. Bornberg-Bauer2012),createanenormousdegreeofpro- 2008). The majority of proteins can be described using tein diversity (Apic et al. 2001; Levitt 2009; Yang et al. a small set of domains, which, despite the ever-increasing 2009).Themajorityofeukaryoticproteinshavemorethan amountofavailablesequencedata,growsatonlymoderate onedomain(Apicetal.2001;Ekmanetal.2005;Yangetal. ªTheAuthor(s)2012.PublishedbyOxfordUniversityPressonbehalfoftheSocietyforMolecularBiologyandEvolution. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http://creativecommons.org/licenses/by-nc/ 3.0),whichpermitsunrestrictednon-commercialuse,distribution,andreproductioninanymedium,providedtheoriginalworkisproperlycited. 316 GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 GBE ModularEvolutioninPlants 2009), and while many domains are found infew arrange- events have created many large genomes with various de- ments,onlyfewdomainsareversatileandformawidearray grees of ploidy within a relatively short period of time. ofdifferentarrangements(Weineretal.2008;Cohen-Gihon 35% of all vascular plants are recent polyploids (Wood et al. 2011). Rearrangement events at the protein level are et al. 2009). Moreover, angiosperms have undergone up to easy to detect, and the key mechanisms are thought to be fourroundsofWGDinroughly320Myr,withoneWGDcom- fusion, fission, and terminal deletion (Bjo¨rklund et al. montoallseedplants319MaandoneWGDcommontoall 2005; Weiner et al. 2006). These events are likely fueled angiosperms 192 Ma (van de Peer et al. 2009; Jiao et al. byaseriesofunderlyinggeneticeventssuchasnonallelicho- 2011).Althoughpolyploidyeventsposeagenomicchallenge mologous recombination, exon-shuffling, nonhomologous to their host and most polyploidy events are considered end joining or transposition (Babushok et al. 2007; Buljan a‘‘deadend’’forevolution(Mayroseetal.2011),ithasbeen et al. 2010). However, with few exceptions (e.g., Oshima suggestedthatpolyploidy,beittheresultofautopolyploidyor etal.2010),tracesofthegeneticmechanismsofrearrange- allopolyploidy, may occasionally provide a starting point for mentswiftlydecay.Buljanetal.(2010)exploredthegenetic evolutionary innovation (Freeling et al. 2006; van de Peer eventsthatfacilitatedomaingaineventstoexistingarrange- et al. 2009). The benefit of an increased amount of genetic ments. Their results provide support to the notion that material might be to allow for swift adaptation to extreme domainsaretypicallyaddedateitherterminus.Thekeymech- environments(vandePeeretal.2009).Forexample,thein- anism for such domain gain events involves the joining of creasedheterozygosityresultingfrompolyploidyimpactsthe exonsbetweengenesorterminalexonextension.Thestudy wiringofsignalingcascadesandcanfacilitatestrongvariation ofdomaincontentevolutionineukaryoteshasillustratedthat in gene expression (Osborn et al. 2003). Numerous studies domain loss and gain are frequent events (Moore and have also explicitly explored the impact of WGD in plants Bornberg-Bauer2011; ZmasekandGodzik 2011).Whereas atthegenomiclevel,forexample,byexploringduplicatere- lostdomainstendtobeofcatalyticnature,gaineddomains tention rates (Hanada et al. 2008; Tang et al. 2008; Zheng tend to be regulatory. Despite the diverse studies that have et al. 2009), gene dosage effects (Freeling et al. 2006; explored modular evolution across many species as well as Misook et al. 2007; Bekaert et al. 2011), or recombination in restricted clades, to date no study has quantitatively rates(Akhunovetal.2003).WGDsmayenhancethepotential addressed the topic of modularity in a set of plant species. fordiversificationandspeciation(vandePeeretal.2009),yet However,modularevolutionmaybeofparticularimportance thedetailsremainpoorly understood. forplants,astheyfaceachallengethatmanyotherspeciesdo Asgenomicstabilityislargelyinfluencedbygenomesize not—they cannot easily evade environmental changes andrepeat content (Bennetzen2005),onemight speculate because of their sessile nature. In particular the fusion of thatplantshavehighratesofrecombinationandhenceex- genes, and consequently of domain arrangements, allows hibitahighnumberofdomainrearrangements.Indeed,com- for‘‘jumps’’inproteinevolutionandmaygoverntrulynovel parative studies have illustrated that angiosperms exhibit geneticphenotypes.Hencesuchfusionproteinsmayexhibit higher recombination rates than vertebrates (Kejnovsky greatadaptivepotential.Indeed,recentfindingssuggestthat et al. 2009). However, to date, no study has explored the chimeric genes formed by gene fusion can be found in extent of modular proteinevolution inplants. regions of selectivesweeps (Rogers and Hartl 2012). Giventheirlargegenomesize,higherrecombinationrates, Fusion events have been shown to be associated with and the inability to flee upon environmental challenges, it regulatory proteins such as the metazoan bHLH transcrip- seemslikelythatplantsmayutilizetheirabundantgenomic tion factors (Amoutzias et al. 2005) or the MIKC-type material to facilitate rapid evolutionary innovation. Conse- MADS-box transcription factor proteins in plants (Veron quently, the benefits of modular domain rearrangements et al. 2007; Shan et al. 2009). Innovation of transcription mightbeparticularlypronounced,sincetheabilityofmodular factor families is often the result of duplication events, evolutiontoswiftlyimplementchangestotheproteinreper- whichmayoccurinchromosomalregionswithhighrecom- toiremaybeakeyprocessinbothexploitingexistingandcre- binationrates.Furthermore,ithasbeenillustratedthatdu- ating functionalities. So far, all studies on the evolutionary plication events in combination with high recombination dynamics and the adaptive potential of domain rearrange- rates are strong forces in genome evolution (Lang et al. mentshavebeenreportedforbacteria(EnrightandOuzounis 2010). 2001),metazoa(Ekmanetal.2007),orfungi(Cohen-Gihon Duplications have been more frequently described for etal. 2011),butnonefor plants. plantsthanelsewhereandplantgenomeevolutionisspecial Inthisreport,weexplorethenatureofmodularevolution inseveralaspects.First,plantgenomesarerepeat-richand in 29 green plant species (Viridiplantae) with taxa ranging transposableelementshaveaparticularlyprominentrolein from green algae to liliopsida and eudicotyledons. Our aim creating retrocopies of genes, for example, in monocots is tounderstand the evolutionary dynamics by studying the (Bennetzen 2005; Baucom et al. 2009; Baucom, Estill et al. frequencyofindividualmodulareventssuchasfusion,fission, 2009). Second, several whole-genome duplication (WGD) or terminal loss. We apply a maximum parsimony-based GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 317 GBE Kerstingetal. approachtoreconstructeventsplacingthisstudyintoaphy- For the annotation of Pfam-A domains, we used the logenomicframeworkandquantitativelyaddresstheroleof model-definedgatheringthresholdandquerysequenceswere domain emergence and domain rearrangements. Further- requiredtomatchatleast30%ofthedefiningmodel(Buljan more, we explore the speed with which new domains, etal.2010).Pfam-BdomainswereannotatedusinganEvalue and their arrangements, are gained and lost; how many of cutoffof0.001(Ekmanetal.2007).Pfam-Adomainswithclan theseeventsarecladeorspecies-specificandwhetherevent membershipweremappedtotheirclansanddomainsoftype ‘‘hotspots’’ can be found amongst the phylogenies of the ‘‘repeat’’ or ‘‘motif’’ were collapsed into one large domain considered species. Finally, we employ several functional instance(Ekmanet al. 2005; Forslundetal.2007). analyses based on the Gene Ontology (GO) classification (Ashburner and Lewis 2002) to shed light on the potential Reconstruction of the Ancestral Domain State; Domain adaptivebenefitsofdomainemergenceandrearrangements Gain, Loss, and Emergence during plantgenomeevolution. Wereconstructedancestraldomaincontentsusingamaxi- mumparsimonyapproachasfollows:thetree(seefig.1B) Materials and Methods wastraversedtwice,firstfromleavestorootthenfromroot toleaves.Domainpresenceorabsenceisdeterminedbyma- Proteomes and Domain Annotation jorityrule.Duringfirsttraversal(leaves/root),thestateof Comparativeanalysesofproteindomainsandtheirarrange- domaindissettopresentatanoden,ifdispresentinthe ments were performed on the following 29 plant genomes: majority of leaves of the subtree rooted in n (leaves of n). Arabidopsis thaliana v9.0 (The Arabidopsis Initiative 2000); Similarly,dissettoabsentatn,ifdisabsentinthemajority Arabidopsis lyrata v1.0 (Hu et al. 2011); Carica papaya v1.0 ofleavesofn.Ifthereisnostatemajorityfordinthechild (Mingetal.2008);Citrussinensisv1.0(SweetOrangeGenome nodesofn(i.e.,thereisanidenticalproportionofpresence Project 2010); Citrus clementine v0.9 (Haploid Clementine andabsencestatesfordintheleaves),thestateofdatnis Genome International Citrus Genome Consortium 2011); settounknown.Astraversalcontinuestowardtheroot,dis Eucalyptus grandis v1.0 (Eucalyptus grandis Genome Project settopresent(absent)atnassoonasthemajorityofleaves 2010);Mimulusguttatusv1.1;Aquilegiacoerulea;Theobroma ofnexhibitthepresent(absent)state.Ergo,presentandun- cacao v1.0 (Argout et al. 2011); Glycine max v1.0 (Schmutz knownareresolvedtopresent,whileunknownandabsent et al. 2010); Medicago truncatula v3.0 (Young et al. 2005); areresolved toabsent. Thefirsttraversalterminatesatthe Lotusjaponicav1.0(Youngetal.2005);Populustrichocarpa root node. All unknown states at the root node are set to v2.0(Tuskanetal.2006);Ricinuscommunisv1.0(Chanetal. present(notethatthisrootincludestheoutgroups).During 2010); Manihot esculenta v1.1; Malus domestica (Velasco the second traversal (root / leaves), unknown states are etal.2010);Prunuspersicav1.0(InternationalPeachGenome resolved by setting them to the state of their ancestor. Initiative2010);Cucumussativav1.0(Huangetal.2009);Vitis We used a combination of custom-made python scripts vinifera v1.0 (Jaillon et al. 2007); Setaria italica v2.0 (Setaria and the ETE2 package (Huerta-Cepas et al. 2010) for tree italica Genome Sequencing Project 2011); Zea mays v4a.53 traversal.Branchlengthsofthetree(Soltisetal.2002;Choi (Schnable et al. 2009); Sorghum bicolor v1.4 (Dubchak et al. 2004; Magallo´n and Sanderson 2005; Hedges et al. etal.2009);Oryzasativav6.1(Goetal.2002);Brachypodium 2006; Cartwright and Collins 2007; Anderson and Janßen distachyon v1.0 (Vogel et al. 2010); Phoenix dactylifera v2.0 2009; Bhattacharya et al. 2009; Bremer et al. 2009; (Al-Dous et al. 2011); Selaginella moellendorffii v1.0 (Banks Forest and Chase 2009; Herron et al. 2009; Wang and et al. 2011); Physcomitrella patens v1.5 (Rensing et al. Caetano-Anolles 2009; Lang et al. 2010; Reineke et al. 2008); Chlamydomonas reinhardtii v4.0 (Merchant et al. 2011) and whole-genome duplication events (Blanc and 2007); Ostreococcus lucimarinus v2.0 (Palenik et al. 2007); Wolfe 2004; Schnable et al. 2009; van de Peer et al. andMicromonaspusillav3.0(Wordenet al. 2009). 2009; Jiao et al. 2011) were extracted from the literature. Werootedthetree;1.700MabyincludingTrichoplaxad- We performed a Blast (Altschul et al. 1997) search to haerens v1.0 (Srivastava et al. 2008), Rhizopus oryzae (Ma identifyrecentlyduplicatedproteins.Proteinswithasimilar- et al. 2009) and Drosophila melanogaster v5.11 (Adams ityof75%ormoreandanEvalue(cid:1)10(cid:2)20wereconsidered et al. 2000). Phylogenetic relationships for all 32 species to be paralogs. We employed a synteny analysis to distin- (29 plants and 3 outgroups) used for this study are given guish between tandem and segmental duplications. Two in supplementary figure 1 (Supplementary Material online). geneswereconsideredtobetandemduplicatesiftheywere If several splice variants were present for one protein, we fiveorlessgenesapart.Paralogswithmorethanfivegenes excluded all but the longest transcript. All proteomes were betweenthemwereconsideredtobearesultofasegmental scanned for domains with the pfam_scan utility and duplication event (Hanada et al. 2008). HMMER3.0 (Eddy 2011) against the Pfam-A and Pfam-B Domaingainandlosseventsalongbrancheswerecalcu- models obtained from Pfam (v.24) (Finn et al. 2008). latedbycomparisonofdomaincontentatagivennodewith 318 GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 GBE ModularEvolutioninPlants FIG. 1.—Domaingain,loss,emergenceandproteomecoverageof26plantgenomes.(A)Correlationofdomaingainandlosswithbranchlength. Bothgainandlosscorrelatesignificantlywithbranchlength(gain:q50.6,P,0.001;loss:q50.63,P,0.001).(B)Phylogeneticrelationshipofall speciesusedinthisstudy.Foreachbranch,thesizeofthegreencirclecorrespondstothenumberofdomainemergenceeventsalongthebranch. Branchescoloredinredindicatethatthegainand/orlossatthisbranchishigherthantheaveragegainand/orlossrates.Exactvaluesfordomaingain, loss,andemergencearegiveninsupplementarytable2(SupplementaryMaterialonline).(C)Domaincoverageforproteins.Theloweraxis(percentage ofproteinswithdomains)displaystheproportionofproteinswithonlyPfam-Adomains(red),onlyPfam-Bdomains(darkblue),bothPfam-Aand Pfam-Bdomains(lightblue),andwithoutanyproteindomainannotation(yellow).Theupperaxisdisplaysproteomesizeindicatedasverticalblackline foreachspecies.Statisticsforthreespecies(Setariaitalica,Prunuspersica,andMimulusguttatus)thatarestillunderFortLauderdalerestrictionarenot provided. thedomaincontentofitsancestor.Wedistinguishbetween wasusedtodetectstructuraldisorderindomainsequences. ‘‘gained’’domains,whicharealldomainsfoundpresentat Emergeddomainsweredividedintofourbins(Viridiplantae, a node while absent in its ancestor, and ‘‘emerged’’ do- Embryophyta, Tracheophyta, and Magnoliophyta), corre- mains,whicharegaineddomainswhichcanonlybefound spondingtotheiremergencenodes.Domainsthatemerged withinViridiplantae.Ergo,emergeddomainsareasubsetof after the Magnoliophyta node were pooled into one ‘‘RE- thegaineddomains.Emergeddomainsweredeterminedby CENT’’bin.Tocomparedisorderofemergeddomainswith scanninggaineddomainswithHMMER3.0againstNCBINR old domains (i.e., domains that exist at the root), a bin andIntegr8(Kerseyetal.2005).Gaineddomains,whichare ‘‘OLD’’wasconstructedconsistingof500randomlypicked not present in the outgroups were also scanned against domainsoccurringintheroot.Inaddition,weconstructed NCBINRtodeterminethekingdomswherethesedomains a ‘‘RANDOM’’ bin consisting of 100 randomly selected do- arepresent(supplementarytable6,SupplementaryMaterial mains,whichexistattheroot.Toaccountforsamplingbias, online).Domaineventrates(gainandloss)werecalculated werepeatedtherandomselection100times.Statisticalinfer- bydividing thenumberofeventspredicted tooccuralong encewasconductedwiththekruskalmctestoftheRpackage a given branch by the branch length (in million years). pgirmess (Siegel and Castellan 1988; R Development Core Giventheevidencethatnoveldomainsarefrequentlyen- Team 2008). richedinstructuraldisorder(Buljanetal.2010;Mooreand Wequantifieddomainemergenceandexploredasetof Bornberg-Bauer 2012), we predicted disorder in domains attributes (Moore and Bornberg-Bauer 2012). Domain classified as emerging. VSL2.0 (Obradovic et al. 2005) frequency, d(f), is defined as the absolute frequency of GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 319 GBE Kerstingetal. adomainacrossallplantgenomesusedfortheanalysis.The etal.2006;Weineretal.2006;Buljanetal.2010).Theal- domainratex(d)ofdomaindisdefinedasthedomainfre- gorithmassignsafusioneventwhentwoancestralarrange- quencydividedbythenumberofplantsinwhichdoccurs. ments can be fused to form the gained arrangement. A Thedomainsuccessratecorrespondstothedomainratedi- gainedarrangementisconsideredtobetheresultoffission videdbythenodeage(inmillionyears)atwhichthedomain if an ancestral arrangement can be split to give rise to the firstemerged.TheprevalenceP(d)ofadomaindisthenum- newarrangement;bothproductsofthesplitarerequiredto berofplantswithddividedbythenumberofplantswiththe bepresentinthecurrentnode.Incontrast,forterminalde- emergence node of d as an ancestor. letion, only one product of the split (the gained arrange- ment) may be present in the current node (the other Functional Analysis of Domains productisconsideredtobelost).Thealgorithmcountsado- main addition event when the newly gained arrangement Where available, GO (Ashburner and Lewis 2002) annota- contains a domain that is absent in the ancestral node. tion of proteomes was obtained from PLAZA 2.0 (Proost Note that in general, any new arrangement can be ex- et al. 2009); Blast2GO (Go¨tz et al. 2008) with default set- plained by a sufficiently large ‘‘chain’’ of events. However, tingswasusedtofunctionallyannotatetheremainingpro- since the likelihood of events is not available, we make teomes. Comparative functional analyses were performed noassumptionsabouttherelativecostsofeachmechanism by assessing GO-term overrepresentation (overrepresenta- and therefore are not able to determine the most likely tionanalysis,ORA)intwoseparatesteps.First,foremerging chain. Instead, we focus on single-step solutions, that is, domains,weperformedthefunctionalanalysisindirectlyby on cases where a newly gained arrangement can be ex- using the GO annotation of arrangements that harbor at plainedbyasingleevent.Usingthisstrategy,wecandiffer- leastoneemergingdomain,similartoapreviousapproach entiatebetweenarrangementswithexactsolution(i.e.,the (MooreandBornberg-Bauer2012).Statisticalinferencewas formationcanbeexplainedbyexactlyonemechanism),ar- conductedusingtheRpackageTopGo(Alexaetal.2006). rangements with nonambiguous solution (i.e., only one As universe, we used the GO annotation of all proteins in mechanismexplainsthearrangementbutthereareseveral our data set; the sample consisted of arrangements with eventspossible)andarrangementswithambiguoussolution emerging domains. Second, for assessing functional over- (i.e., conflicting solutions of different types). All arrange- representation of arrangements in events (such as fusion ments with solution are referred to as ‘‘simple gains,’’ orfission),weagainconductedanORAusingtheproteins whereas all other arrangements are considered to be GOannotation,however,oursampleherewasthearrange- ‘‘complex gains.’’ ment setthat resultsfrom a specific event (e.g.,all gained arrangements explainable by a fusion event). P value transformed TermClouds were created by logarithmic Results transformation of the False Discovery Rate (FDR)-corrected (BenjaminiandHochberg1995)Pvalueobtainedfromthe Domain Coverage ORA,suchthattermsizerepresentsthesignificanceofthe Inplants,onaverage,50%oftheproteomeresidueswere GO term. Visualization was created using Wordle (http:// foundtobecoveredbydomainannotation;theresiduecov- www.wordle.net/)withthetransformedPvalueasacustom eragerangesfrom30%to70%(supplementarytable1and scaling factor. fig. 2, Supplementary Material online). For an average of 35% of the residues, for each plant, a Pfam-A domain Reconstruction of the Ancestral Domain Arrangements can be detected, whereas Pfam-B domains affect 15% of State, Arrangement Gain, and Loss allresidues.Residuecoveragelevelsforallspeciesaregiven We defined domain arrangements as ordered sets of in supplementary table 1 (Supplementary Material online). domainsforeachprotein.Fortheanalysisofarrangements Attheproteinlevel,thecoveragedistributionismoredi- in this study, only Pfam-A domains were used. Ancestral verse (supplementary table 1 and fig. 2, Supplementary states for arrangements were reconstructed as previously Material online). On average, 70% of the proteins for one described. Similarly, arrangement gain and loss was plant species have at least one Pfam-A or Pfam-B domain. determined by comparing current and ancestral states. Fifty percent of the proteins contain only Pfam-A domains, 14% contain only Pfam-B domains, and 6% contain both Determination of Arrangement Rates Pfam-AandPfam-Bdomains(fig.1C).Allproteincoverage For each gained arrangement, we applied a search algo- values are given in supplementary table 1 (Supplementary rithm to determine the possible mechanism that led to its Material online). The total number of proteins containing formation.Weconsideredthefourmostimportantmecha- Pfam-A and Pfam-B domains is highly variable between nismsofmodularrearrangements—fusion,fission,terminal the different proteomes (fig. 1C, supplementary table 1, deletion,anddomainaddition(Bjo¨rklundetal.2005;Pasek Supplementary Material online). 320 GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 GBE ModularEvolutioninPlants FIG. 2.—GeneOntology(GO)termsassociatedwithemergingdomains.GOtermsaffectedbyemergenceweretestedforoverrepresentation usingtheTopGOpackageandalltermspresentinplantsasuniverse(fordetails,seeMaterialsandMethods).Thefontsizecorrespondstothevalueof significanceobtainedforthisterm.SignificancewasdeterminedaftercorrectionformultipletestingusingFDR(BenjaminiandHochberg1995)correction atP,0.01.ThevastmajorityofGOtermsisrelatedtostimulusresponse,development,reproduction,regulation,andplant-specificmetabolicprocesses. Domain Emergence unknownfunction(DUFs)orbelongtothesetofpoorlyan- notatedPfam-Bdomains.Weassessedfunctionaloverrepre- Toinvestigatedomaingain,loss,andemergenceacrossthe sentationusingthefunctionofproteinsthatobtainemerging considered plants, we reconstructed the ancestral domain domains—wearehencenotexploringwhichfunctionalmod- contentateachinternalnodeofthetree(seealsoMaterials ulesemergebutratherwhichproteinfunctionalitiesundergo andMethods;supplementaryfig.1,SupplementaryMaterial innovation (by the addition of an emerging domain). online).Intotal,545domainsemergedintheplantkingdom, Thereisincreasingevidencethatyoungdomainscanex- thatis,thesedomainsareexclusivelyfoundinViridiplantae. hibithigherlevelsofstructuraldisorderthanestablisheddo- The largest amount of domain emergence within plants mains (Buljan et al. 2010; Moore and Bornberg-Bauer occursalongthebranchleadingtoEmbryophyta,whichsees 2012). We examined the degree of structural disorder in theemergenceof262domains(fig.1B).Atotalof114and emerging domains. The results indicate that emerging do- 66 domains emerge along the branches to Magnoliophyta mains are significantly enriched in intrinsic disorder, more and Tracheophyta, respectively. Fifty-one domains emerged thaninrandomlychosendomains(seeMaterialsandMeth- prior to the split of Embryophyta and the green algae and ods;supplementaryfig.3,SupplementaryMaterialonline). 52 domains are the result of recent emergence events and Furthermore,theyoungeradomain,thehigherthedegree canonlybefoundwithinMagnoliophyta(seealsoDiscussion of disorder. below) (fig. 1B). Domain Gain and Loss Radiation and Functional Impact of Emerging Domains Domaingainandlossarefrequenteventsinplantevolution, Next, we assessed whether emerged domains confer spe- and thereis a strongvariation between differentbranches cificfunctionalitiesandwhetherthesemightprovideadap- (fig.1A).Nevertheless,bothgainandlossratescorrelatesig- tive benefit. We assessed functional overrepresentation nificantly with branch length (Spearman rank correlation, using GO categories and TopGO (Alexa et al. 2006) (see gain: q 5 0.6, P , 0.001; loss: q 5 0.63, P , 0.001). On MaterialsandMethodsfordetails).WefindthatGOterms average, plants have a domain gain rate of 6.64/Myr and prefixed by response_to are overrepresented along with a domain loss rate of 6.11/Myr (fig. 1A, supplementary functionalities related to reproduction, developmental table 2 and fig. 9, Supplementary Material online). In mechanisms, and metabolic processes (fig. 2). monocots, the average domain gain rate (6.7/Myr) is Webinnedemergingdomainsaccordingtotheirpointof lowerthanthedomainlossrate(7.4/Myr),whereasineu- emergence (for details, see Materials and Methods) and dicotsthesituationisreversed;eudicotsshowalossrate rankedthembytheirfrequencyd(f).The5%highestranked of 7.4/Myr and a gain rate of 8.3/Myr (supplementary domainsfromeachagebin(supplementarytable3,Supple- table2andfig.9,SupplementaryMaterialonline).Some mentaryMaterialonline)weresubjecttofurtherinvestigation branches exhibit very high loss rates, such as the branch as these can be considered to be particularly ‘‘successful’’ leadingtoP.dactylifera,thebranchestothetwoFabaceae emerging domains. Among these, we find domains with M. truncatula and L. japonica, and the branches to the plant-specificfunctionssuchasfloweringcontrol,auxinreg- two Andropogoneae Z. mays and S. bicolor (fig. 1B). ulation,fruitdevelopment,cellwalldevelopment,andplant organelle recognition. Furthermore, we detected domains Gain, Loss, and Distribution of Arrangements related to the F-box protein family, to transcription factors andtoDNAbinding.Forthemajorityofemergingdomains, We next explored the dynamics of arrangement gain and directfunctionalannotationisdifficult—thelargestpropor- loss. After determining the presence/absence of arrange- tion(85%)ofallemergingdomainsinplantsaredomainsof ments at ancestral nodes (for details, see Materials and GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 321 GBE Kerstingetal. FIG. 3.—Arrangementssharedbetweenspecies.Thedashedline represents thenumberofarrangementssharedby thedifferentnumbersof species(rightaxis).Thedistributionofuniquearrangementsisroughlybimodalwiththemajorityofarrangementssharedbyeitherfeworallspecies. Theleftaxisandbarplotsdisplaythefrequencyofarrangementswithacertainlength(one,two,three,four,five,six,andsevenormoredomains). Althoughsingle-domainarrangementstendtooccurinallspecies,longerarrangementsareoftenspecies-specific. Methods),wecomparedarrangementcontentateachnode Modular Rearrangements with the content at the corresponding parent node to de- Usingasimplemodelofmodularrearrangement(fordetails, terminearrangementgainandloss.Asexpected,bothgain seeMaterialsandMethods),wenext exploredthemecha- and loss rates correlate significantly with branch length nisms that can facilitate the formation of novel arrange- (Spearmanrankcorrelation,gain:q50:56,P,0.001;loss: ments. For this, we considered fusion, fission, terminal q50.38,P50.003,supplementaryfig.5,Supplementary deletion, and domain addition. The results illustrate that Material online). Overall, arrangement gain rate is higher 70%ofallgainedarrangementscanbeexplainedbyexactly than arrangement loss rate. However, both rates correlate onesolution(exactsolutions).Ofthegainedarrangements, significantly with each other (q 5 0.56, P , 0.001). By 14%canbeexplainedbyoneparticularmechanism,how- far,thelargestamountofarrangementgain(2,814arrange- ever,withanumberofdifferentpossiblesolutions(nonam- ments) occurs along the branch to M. domestica followed biguous solutions); only 4% have conflicting solutions by the branch to R. communis (1,018). Large amounts of (ambiguous solutions). The remaining 12% of all new ar- arrangementlosscanbefoundalongthebranchestoP.dac- rangements are complex gains that likely arose by a chain tylifera (1,028) and L. japonica (680); both plants also ofevents(seeMaterialsandMethods;fig.4).Thedifferent showed a high amount of domain loss. All values for ar- events were found to occur with different frequencies rangement gain and loss are given in supplementary table (table 1). Fusion events makeup the largest proportion of 4 (Supplementary Material online). exact solutions, followed by domain addition, fission, and Weinvestigatedtheamountofarrangementssharedby terminal deletion. Fusion events occur with a frequency differentplantsspecies(fig.3).Thedistributionisbimodal, of 4.59/Myr, followed by fission with 1.98/Myr, and gain withthelargestnumberofarrangementsbeingeitherspe- with 1.89/Myr. Domain deletion events can be split in cifictoonespecies(;7,000)orsharedbyall(;1,000);only C-terminal and N-terminal domain deletion; both events a very small amount of arrangements is shared by 10–20 haveafrequencyof0.7/Myr.Allrateswereaveragedacross species. Although by far the largest proportion of arrange- allbranches.Wefurtherexploredeventfrequenciesacross mentssharedbyallspeciesconsistsofsingle-domainproteins, different age bins. At the Embryophyta node, 68% of thecontraryistrueforspecies-specificarrangements. Here, new arrangements are affected by domain addition and thelargestnumberofarrangementstendstobecomposed 26% by fusion. Domain deletion (4%) and fission (3%) ofmorethanonedomain,withalargeproportioncontain- are less prevalent at this node. Over time, the frequency ingsevenormoredomains.Thisindicatesthatthelongeran of domain deletion and fission increases up to 13% and arrangement is, the more likely it is species-specific. 21% in recent rearrangements, whereas domain additions 322 GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 GBE ModularEvolutioninPlants Table1 emergenceofnewdomainscanprovideanimportantstrat- ContributionofFusion,Fission,C-TerminalDeletion,N-Terminal egy for evolving stress response. More than 500 domains DeletionandDomainAdditiontoSimpleArrangementGains emerged within Viridiplantae of which more than 100 do- Fusion Fission C-Del N-Del Add mains are unique for Tracheophyta (fig. 1). We recently as- Totalnumber 9,669 4,073 1,283 1,424 4,848 sessed the impact of domain emergence in a set of Averagenumber/Myr 4.59 1.98 0.7 0.7 1.89 insects,whereonly30domainsemergedwithin19insectge- nomes spanning roughly 300 Myr of evolution (Moore and NOTE.—Del,deletion;Add,addition. Bornberg-Bauer2011).Hence,itwouldseemthatplantsex- decreasetoafrequencyof24%.Thelargestfractionofre- hibitalargeamountofdomaininnovation.Onemightspec- cently gained arrangements (49%) can be explained by fu- ulate that plants at least partly address the challenge of sionevents(fig.4). asessilelifestylebymeansofdomaininnovation.Theinves- tigationofGOtermsofproteinscontainingemergeddomains Discussion furthersupportsthisnotion.Alargenumberoftermsarere- latedtoplant-specificprocesses,suchasmegagametogenesis Domain Emergence anddevelopmentofplant-specificorgans.Thisisnotsurpris- Theincreasingavailabilityofplantgenomeshasallowedus ingasthereproductivesystemandmorphologyofplantsnot toconductacomparativedomainanalysisbetweenasetof onlydifferstronglyfromotherkingdomsbutarealsohighly diverse plant species. Here, we reconstruct the ancestral variablebetweenplantspecies(Endress2001;Bennici2005; statesofdomaincontentandarrangementandinvestigate Williams 2008; Kawakita and Kato 2009). Besides these thefunctionalimpactofdomainemergenceanddomainre- plant-specific functions, a number of overrepresented GO arrangements across a comprehensive set of 29 genomes termscorrespondtoresponse_tocategoriesandtosecondary datingback;800Myr.However,theconsideredcladestill metabolitepathwaysrelatedtostressresponse,suchasauxin containsanumberofspeciesforwhichgenomesequences andjasmonic acid. Such secondary metabolites are strongly aremissing,suchasthegymnospermsorthecharophyta.As related to the defense and response mechanisms in plants these genomes become available, a more comprehensive (GraceandLogan2000;PaterakiandKanellis2010;Kerchev picture of modular evolution in plants will emerge. et al. 2012). As the composition of these compounds is In contrast to animals, plants are sessile organisms that variable between plant species and also within species areunabletoescapestrongenvironmentalshiftsandmust (Kroymann 2011), such secondary metabolites may provide ratheradapttosuchvariation.Hence,plants,moresothan astrongflexiblebasisforimprovingadaptationanddefense. animals,arerequiredtoevolvemechanismsinordertodeal Functionallinkstophotosynthesisarenotfoundamongst withbioticandabioticstresses.Here,weillustratethatthe emerged domains (fig. 2). This is likely explained by FIG. 4.—Mechanismsofrearrangementacrossdifferentclades.Weappliedasearchalgorithmtoassessthemechanismsthatmightaccountfor newlygainedarrangements(fordetails,seeMaterialsandMethods).Only12%ofallgainedarrangementscannotbeexplainedbyaone-stepevent (complexgains).Theremaining88%ofsimplegainscanbefurtherdifferentiatedintoexactsolutionswhereonlyoneparticularmechanism(fusion, fission, terminal deletion, or domain addition) was necessary to explain the arrangement gain event (70%). All proteomes were divided into four differentagebins:Embryophyta,Tracheophyta,Magnoliophyta,andRecentNodes.Thefrequenciesoffusion,fission,andterminaldeletionincrease overtime,whereasthefrequencyofdomainadditiondecreases. GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 323 GBE Kerstingetal. photosynthesis not being unique to plants; photosynthetic SupplementaryMaterialonline).Incontrast,proteinsformed processescanbefoundinalgaeandinmanyspeciesofbac- byfissionmainlyplayaroleinmetabolicandbiosynthesispro- teria (Olson 1970, 2001). Indeed, photosynthesis-related cesses(supplementaryfig.7,SupplementaryMaterialonline). GOtermscanbedetectedbyinvestigatinggaineddomains Proteinsshapedbydomaindeletionaremainlyrelatedtoba- which are absent in the outgroups (supplementary fig. 6, sicfunctionssuchastheprimarymetabolism,andonlyami- Supplementary Material online), as well as response_to nor part of these proteins are stress–response related termsandanumberofplant-specificfunctionalitiesrelated (supplementaryfig. 7, SupplementaryMaterialonline). to development, similar to those terms found in proteins Ourresultsprovidefurtherevidencethatduplicationim- containing emerged domains. pactsratesofmodularrearrangement(BuljanandBateman Emergeddomainsseemtobeevolutionarilyimportantas 2009). We find that proteins affected by rearrangement they have ahigh prevalence of 0.9–1, indicating that their eventsareoverrepresentedinduplicatedgenes(supplemen- occurrence is strongly conserved. Besides their widespread tarytable 6, SupplementaryMaterial online).Furthermore, occurrenceinnearlyallleaves,suchemergeddomainsoften wefindindicationthatspecieswithrecentWGDhavehigh- occur in high copy numbers (supplementary table 3, erratesoffusionandfissionincomparisontospecieswith- Supplementary Material online). out recent WGD (supplementary table 7, Supplementary Investigating the most successful emerged domains un- Material online). In general, duplicates are thought to un- coversconnectionstokeyfunctionalcategoriessuchastran- dergooneofthreedifferentscenarios:subfunctionalization, scription factors, binding-related processes,and secondary where the two duplicates share ancestral gene function; metabolites,includingauxinandjasmonicacid(supplemen- neofunctionalization,whereonecopyretainstheancestral tarytable3,SupplementaryMaterialonline).Indeed,aburst functionandtheothercopydivergestowardanovelfunc- of transcription factors and their constituent domains, tion; and pseudogenization, where one copy is not ex- which are assumed to be correlated with increasing com- pressed and is subsequently lost (Walsh 2003). One plexityinplantevolution(Langetal.2010),hasbeenfound explanation for sub- or neofunctionalization is the loss or in angiosperms. The increase of plant complexity with du- change of regulatory regions (Ganko et al. 2007). As the plicationevents(Freelingetal.2006)maypartlybetheresult conservationofnoncodingsequencesfollowsanexponen- of duplication facilitating increasingly complex regulatory tial decay rate (Reineke et al. 2011), the retention of both networks (Veron et al. 2007). duplicatesmightbetheresultofthechangeofoneofthe Emergingdomainsexhibitanincreasedamountofintrin- gene’s regulatory region under relaxed selectional con- sicdisorder;themorerecenttheemergenceevent,themore straints.Thehighretentionrateofproteinsthatresultfrom likelythedomaininquestionexhibitsintrinsicdisorder.Dis- afusioneventmightbeexplainedbytheconservationofat orderedsequencesmayincreasethebindingaffinityofpro- least one regulatory element in the upstream region, teins(DysonandWright2005).Highintrinsicdisorderpaired whereasafterfission,onearisingproteinmaylosearegula- withthefactthatemergeddomainsaresignificantlyunder- tory region and undergo pseudogenization followed by representedinsingle-domainproteins(hypergeometrictest, gene loss. A further reason for sub- and neofunctionaliza- P , 0.01), leads us to the speculation that emerging do- tion after duplication might be domain rearrangements in mainsmayhavehigherinteractionpotential,whichinturn oneparalogordifferentialdomainloss(Buljanetal.2010). may increase their viability and result in higher prevalence Wefurtherillustrate the impact of protein domain rear- andfrequency.Indeed,someofthemostsuccessfulemerg- rangementsonanorganism’sproteinrepertoire(fig.5).The ing domains have links to binding-related processes. emerging domains PAN_2 (emerged in the Tracheophyta) andS_locus_glycop(Embryophyta)oftenco-occurtogether withtheB-lectindomain.Arrangementscontainingthetwo Arrangement Mechanisms emergeddomainsS_locus_glycopandPAN_2arefrequently In plants, roughly 70% of the domain-containing proteins rearrangedwithinparalogousgenes(fig.5)andobtainacat- are single domain (supplementary fig. 4, Supplementary alyticfunctionthroughtheadditionofkinasedomains.Pro- Materialonline).Thishighpercentageofsingle-domainpro- teinsthatconsistofarrangementswiththesetwoemerged teinscanbeanartifactoflowdomaincoverageor‘‘eroded- domains have GO functions related to the recognition of domains,’’ which have diverged beyond detection (Weiner pollen, protein phosphorylation, and cell recognition. Al- etal.2006).Recentrearrangementscanmostlybeexplained though we observed fusion events in tandemly duplicated by the fusion of two single or two domain proteins. The genesinourcasestudy,fusioneventsarenotgenerallyover- relative rates of fusion and fission are similar to previously represented in tandemly duplicated genes (supplementary reported rates (Kummerfeld and Teichmann 2005). GO table5,SupplementaryMaterialonline).Afterfusion,dupli- terms overrepresented in proteins, which arose via fusion, catesmightbedifficulttorecognizeasparalogs.Onemight are stress-, defense-, and adaptation-related as well as therefore speculate that in tandemly duplicated proteins related to the reproduction system (supplementary fig. 7, fused arrangements are harder to detect. The increased 324 GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 GBE ModularEvolutioninPlants FIG. 5.—Example of two emergent domains at the Tracheophyta node (PAN_2) and Embryophyta node (S_locus_glycope). The evolution of example arrangements over time is shown in five different species (Arabidopsis thaliana [AT], Oryza sativa [OS], Populus trichocarpa [PT], Ricinus communis[RC],Vitisvinifera[VV]).Theobservablediversityinarrangementswithinthisfamilyisexplainablebysimpleone-stepeventsoffusion,fission, terminaldeletion,anddomaingain. rates of events along more recent branches might be ex- asubsetofbetween5and24proteomes,innate_immune_r- plained by WGD which have taken place in angiosperms esponse is significantly overrepresented, suggesting that (De Bodt et al. 2005; Freeling et al. 2006; Shoemaker there might be different pathogens affecting different sub- et al. 2006; van de Peer et al. 2009; Paterson et al. clades.ProteinswithGOtermsrelatedtoreproduction,signal 2010).Indeed,inapairwisecomparisonoffusionandfission transduction, and prefixed with response_to are overrepre- ratesbetweenplantpairs,whichdifferbyonerecentWGD, sented in species-specific arrangements or those shared by we find increased rates in plants with more recent WGD only few species. The high number of species-specific ar- (supplementary table 7, Supplementary Material online). rangementsobservedhereisinaccordancewiththeobserva- Roughly one-third of all vascular plants have undergone tion that, within a set of five angiosperm species, around recent WGDs (Wood et al. 2009). 20% of proteins do not align toanorthologous group (Pa- terson et al. 2010). The high amount of species-specific ar- Arrangement Distribution rangements and genes might also be a consequence of frequent duplication events followed by lineage-specific re- We investigated the distribution of shared arrangements tention (Paterson et al. 2010). This supports the hypothesis among theplantspecies. Themajorityofdomain arrange- that plants have many flexible genetic mechanisms to pro- ments are either species-specific or universal (fig. 3). This ducespecies-specific adaptation (Bomblies 2010). bimodaldistributionisevenstrongerwhenweconsideronly a well-annotated subset of our species and exclude the Gain and Loss of Domains and Arrangements greenalgae(supplementaryfig.8,SupplementaryMaterial online).Inparticular,proteinswithtwoorthreedomainsare Weinvestigatedgainandlossatthelevelsofdomainsand oftenspecies-specific.Incombinationwiththeobservation domainarrangementsbyreconstructingtheancestralstates that roughly 70% of all domain-containing proteins are based on maximum parsimony. We observe that gain and single-domain proteins (supplementary fig. 4, Supplemen- losscanfrequentlybefoundinallcladesinplantevolution tary Material online), this can lead to the assumption that atbothdomainandarrangementlevels.Thisisinagreement the fusion of single-domain proteins is a powerful mecha- withBuljanandBateman(2009),whofoundanequalevent nismtoobtainspecies-specificproteinswithnewfunction- distributionafterspeciationandduplicationwithinanimals alities. This distribution suggests that only very few long andahighamountofchangeinarrangementsafterdupli- arrangements are highly conserved; long arrangements cationevents.Asweheredonotconductadirectcompar- are possibly more often affected by fission events. Proteins ison of paralogs, but instead compare presence/absence with arrangements shared by several but not all species are patterns of domains and their arrangements across pro- overrepresented in GO terms related to basic functions such teomes, our results only support the notion that domain asprimarymetabolism,cellulosebiosyntheticprocess,andcell gainandlosscanbefoundalongallbranchesandthatboth wall organization. In proteins with arrangements shared by have a significant correlation with each other and with GenomeBiol.Evol.4(3):316–329. doi:10.1093/gbe/evs004 AdvanceAccesspublicationJanuary16,2012 325