ebook img

The Role of Lineage-Specific Gene Family Expansion in the Evolution of Eukaryotes PDF

12 Pages·2002·0.37 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview The Role of Lineage-Specific Gene Family Expansion in the Evolution of Eukaryotes

Letter The Role of Lineage-Specific Gene Family Expansion in the Evolution of Eukaryotes Olivier Lespinet, Yuri I. Wolf, Eugene V. Koonin,1 and L. Aravind NationalCenterforBiotechnologyInformation,NationalLibraryofMedicine,NationalInstitutesofHealth, Bethesda,Maryland20894,USA A computational procedure was developed for systematic detection of lineage-specific expansions (LSEs) of protein families in sequenced genomes and applied to obtain a census of LSEs in five eukaryotic species, the yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe, the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, and the green plant Arabidopsis thaliana. A significant fraction of the proteins encoded in each of these genomes, up to 80% in A. thaliana, belong to LSEs. Many paralogous gene families in each of the analyzed species are almost entirely comprised of LSEs, indicating that their diversification occurred after the divergence of the major lineages of the eukaryotic crown group. The LSEs show readily discernible patterns of protein functions. The functional categories most prone to LSE are structural proteins, enzymes involved in an organism’s response to pathogens and environmental stress, and various components of signaling pathways responsible for specificity, including ubiquitin ligase E3 subunits and transcription factors. The functions of several previously uncharacterized, vastly expanded protein families were predicted through in-depth protein sequence analysis, for example, small-molecule kinases and methylases that are expanded independently in the flyandinthenematode.ThefunctionsofseveralothermajorLSEsremainmysterious;theseproteinfamiliesare attractive targets for experimental discovery of novel, lineage-specific functions in eukaryotes. LSEs seem to be one of the principal means of adaptation and one of the most important sources of organizational and regulatorydiversityincrown-groupeukaryotes. [Supplemental material is available online at ftp://ncbi.nlm.nih.gov/pub/aravind/expansions, and http://www. genome.org.] The eukaryotic crown group (the unresolved assemblage of eages, are largely constructed from the same set of protein lineages in the eukaryotic tree, which includes plants, ani- domains,andarebasedonthesamearchitecturalprinciples mals,fungi,andsomeprotists,asopposedtoearlybranching (Chervitzetal.1998;AravindandSubramanian1999;Rubin eukaryotes,whichareallunicellularprotists),althoughonly etal.2000;Landeretal.2001). representingtheproverbialtipoftheeukaryoticphylogenetic This unity notwithstanding, preliminary comparative iceberg,encompassesaremarkablevarietyoforganisms(Pat- studies on the sequenced eukaryotic genomes also provided terson1999;DacksandDoolittle2001).Thisdiversityisap- clues as to what evolutionary phenomena might underlie parentinbothmorphologicalandbiochemicalfeaturesofthe theirdiversity.Attheleveloftheproteinsetsencodedinthe crown group that spans the entire range from unicellular crown-group genomes, the main contributing forces appear yeasts and chlorophytes, through facultatively multicellular to be the emergence of new domain architectures through slimemolds,togenuinemulticellularorganisms,plants,ani- domainaccretionanddomainshuffling,lineage-specificgene mals,andfungi(Soginetal.1996;Patterson1999).Thecom- loss,andlineage-specificexpansionofproteinfamilies(Ara- plete,ornearlycomplete,genomesequencesfromthreema- vindandSubramanian1999;Aravindetal.2000;Rubinetal. jorbranchesofthecrowngroup,plants,animals,andfungi 2000;Landeretal.2001).Lineage-specificexpansion(LSE)is are starting to provide the first molecular explanations for definedinrelativeterms,astheproliferationofaproteinfam- both their unity and diversity. From one viewpoint, the ilyinaparticularlineage,relativetothesisterlineage,with crown-groupeukaryotesareremarkablyuniforminthatthey whichitiscompared(Jordanetal.2001).Thus,iftwosister share a large set of conserved orthologs in the core compo- lineages, for example, Drosophila and Caenorhabditis repre- nentsoftheiressentialfunctionalsystems,suchasthosein- sentinginsectsandnematodes,respectively,arecompared,all volved in DNA replication and repair, most aspects of RNA protein-family proliferation events (duplications to n- metabolism, cytoskeletal organization, protein degradation, plications)thatoccurredineitheroftheselineagesaftertheir andsecretion(Chervitzetal.1998;Rubinetal.2000;Lander separationareconsideredLSEs. etal.2001).Furthermore,componentsofthesignaltransduc- Preliminary analysis of proteins from the crown-group tion pathways, structural and regulatory components of the eukaryotic genomes revealed some tangible correlations be- nucleus, and pre-mRNA processing complexes, although tween LSE and emergence of new biological functions, re- showingcleardifferencesbetweenthemajorcrown-grouplin- sponse to diverse environmental pressures, and organiza- tionalcomplexity.SomeofthemoststrikingcasesofLSEare 1Correspondingauthor. relatedtopathogenandstressresponseandinclude,among [email protected];FAX(301)435-7794. Articleandpublicationareathttp://www.genome.org/cgi/doi/10.1101/ otherfamilies,expansionsoftheimmunoglobulinsuperfam- gr.174302. ilyassociatedwiththevertebrateimmunesystem,AP-ATPases 1048 GenomeResearch 12:1048–1059©2002byColdSpringHarborLaboratoryPressISSN1088-9051/01$5.00;www.genome.org www.genome.org Lineage-Specific Gene Expansion in Eukaryotes involvedinplantdiseaseresistance(Hulbertetal.2001),and exclusionofhomologsfromotherspeciesandparalogsfrom the cytochrome P450 family, which participates in detoxifi- thesamespeciesthatdonotbelongtothegivenLSC(Fig.1; cationsystemsinbothplantsandanimals(Nelson1999;Tijet Supplementary Material available online at ftp://ncbi.nlm. et al. 2001). Transcription factors represent another func- nih.gov/pub/aravind/expansions, and http://www.genome. tionalcategoryofproteinsthattendtoshowwidespreadLSE: org).Thus,theclustersgeneratedbytheautomaticprocedure the independent expansions of the POZ–C2H2 and C4DM– usedhereappearedtorepresentpredominantly,ifnotexclu- C2H2 fusions in insects, the nuclear hormone receptors in sively,authenticLSEsand,therefore,couldbeutilizedreliably nematodes,andtheKRAB-domain-fusedZn-fingersinverte- forquantitativeandqualitativeanalysesofthisphenomenon. brates,apparentlymadesubstantialcontributionstotheevo- Certainlimitationsrelatedtothecurrentstateofsequencing lutionofdevelopmentalanddifferentiationfeaturesspecific andannotationoftheeukaryoticgenomesneedtobekeptin to each of these lineages (Sluder et al. 1999; Aravind et al. mind when interpreting these clusters. Only one genome, 2000;Riechmannetal.2000;Coulsonetal.2001;Landeret that of S. cerevisiae, should be considered truly complete, al.2001). whereasinothers,somegenesareobviouslystillmissing,for Despite a wealth of anecdotal information, we are un- example,thosethatresideinheterochromatinicregions.Fur- awareofasystematiccomparativeanalysisofLSEsineukary- thermore,giventheknownproblemswithgenepredictionin oticgenomes.Withthisobjective,wedevisedaprocedureto plant and animal genomes, we removed nearly identical se- systematicallydetectLSEs.HavingidentifiedLSEsinfiveeu- quencespriortotheLSCanalysis(seeMethods).Thiselimi- karyotic proteomes, those of Saccharomyces cerevisiae, Schizo- natedpotentialredundancy,butsometrue(nearlyidentical) saccharomycespombe,Caenorhabditiselegans,Drosophilamela- paralogsresultingfromrecentduplicationscouldhavebeen nogaster, and Arabidopsis thaliana, we predicted, wherever lostintheprocess.Giventhisprocedure,theresultspresented feasible, the biochemical or biological functions of the lin- hereshouldbeconsideredconservativeestimatesofthenum- eage-specificclusters(LSC)andexploredtheirpotentialroles berofgenesinLSCs.Ontheotherendofthespectrum,ex- inthediversificationofthecrowngroup.Here,wepresenta tremely diverged members of LSCs (or even entire LSCs), systematic analysis of the demography of LSEs and provide which retain minimal sequence conservation, could have evidence for a major involvement of LSEs in the generation beenmissedbythisanalysis. ofthediversityofbiologicalfunctionsinmulticellulareukary- The two ascomycete yeasts, S. pombe and S. cerevisiae, otes. were the closest pair of sister lineages compared. The two animals,D.melanogasterandC.elegans,representedaslightly greater phylogenetic divergence relative to each other, RESULTS AND DISCUSSION whereas the plant A. thaliana represented an even deeper branch with respect to animals and fungi. Thus, the LSCs Identification and Validation of Candidate fromeachofthesespeciesenabledustoexaminetheroleof Lineage-Specific Clusters LSEsindiversificationofeukaryotesatdifferentlevelsofevo- UsingtheclusteringproceduredescribedintheMethodssec- lutionarydivergence. tion, we delineated candidate LSCs for five eukaryotic ge- nomes. The automatically generated LSCs were further sur- Proteome-Wide Demography of Lineage-Specific veyedforfalsepositives,thatis,proteinsthatwereunrelated to the rest of the proteins in the cluster, by using BLAST Family Expansion searchesandmultiplealignments.Asubsetoffalse-positives The detected LSEs encompassed between ∼20% of the pro- arosefromcompositionallybiasedsegmentsthatescapedfil- teome(theyeasts)and∼80%(A.thaliana)(Fig.2A).Oneofthe tering during the automatic process. The presence of some causesforthisdiverserangeofLSEsappearstobethephylo- false-positiveswasmainlyduetooneormoreoftheproteins geneticdistancefactor;thetwoyeastspecieshaveaccruedfar inaclustercontainingmultipledomainsorbeingartificially fewerLSEsafterdivergingfromtheircommonancestorcom- fusedtoanotherprotein.Themajorityofsuchfalse-positives paredwithA.thaliana,whichhasnoclosesisterlineagesin weredetectedamongA.thalianaproteins,inwhichgenepre- theanalyzedsetofgenomesandhas,accordingly,gainedthe diction errors resulted in artificial fusions of distinct genes. greatest number of expansions after its divergence from the Onseveraloccasions,theseartificialgenefusionsresultedin commonancestorwithfungiandanimals.Positivelinearcor- an erroneous merger of one or more distinct clusters; these relations, with moderate-to-strong significance, were ob- weremanuallyseparated.Additionally,afewsmallerclusters servedbetweentheproteomesizeandeachofthefollowing: thatbelongedtoalargerLSE-specificexpansionweremerged. (1)fractionofproteinscontainedinLSCs(Fig.2A),(2)num- On average, ∼9% of the LSCs of size greater than two were berofLSCs(Fig2B),and(3)averagenumberofproteinsper subjectedtomanualcorrections. LSC (Fig. 2C). The majority of the clusters in each species Theautomaticprocedureusedfordelineatingcandidate consistedoftwomembers.Ineachcase,thenumberoftwo- LSCs included single-linkage clustering of proteins by se- memberclustersshowedanegativecorrelationwiththepro- quencesimilarityandanultrametrictreeconstructionusing teomesize,whereasthenumberofclusterswiththreeormore UPGMA (see Methods). These methods accurately reproduce members showed a positive correlation with the proteome phylogenetic relationships only under the strict molecular size (Fig. 2D). Thus, larger proteomes had more proteins in clockhypothesis.Therefore,toverifythephylogeneticcoher- largerLSCsattheexpenseoftwo-memberLSCs.Foreachspe- enceofthecandidateclusters,10ofthecandidateLSCsfrom cies,thedistributionoftheLSCsbythenumberofmembers each analyzed species that consisted of 4 or more members followedthenegativepowerlaw:P(k)=ck(cid:1)(cid:1)inwhichP(k)is andhadhomologsinotherspecieswerechosenforphyloge- thefrequencyoffamilieswithexactlykmembersandcand(cid:1) neticanalysis.Ineachtree,theproteinsfromthecandidate are constants (Fig. 3). The differences between the slopes of LSCgroupedtogetherand,in48ofthe50cases,thisgrouping thesepowerlawdistributions(indouble-logarithmiccoordi- was strongly supported by bootstrap analysis (>70%) to the nates)werecompatiblewiththeaforementionedcorrelations Genome Research 1049 www.genome.org Lespinet et al. Figure1 Phylogeneticanalysisofselectedeukaryoticlineage-specificexpansions.Groupssupportedbyabootstrapvalue>70%arecoloredpink forDrosophilamelanogaster,redforHomosapiens,orangeforCaenorhabditiselegans,greenforArabidopsisthaliana,andyellowforSchizosaccha- romycespombe.(A)Prolylhydroxylases.(B)Smallmoleculekinases(Chstandsforcholinekinase).(C)Patched-likeprotein.(D)MAP-Kinases.(E) P450familyhydroxylases.(F)MBOATmembraneacyltransferases.(At)Arabidopsisthaliana;(Bs)Bacillussubtilis;(Ce)Caenorhabditiselegans;(Dd) Dictyosteliumdiscoideum;(Dm)(Drosophilamelanogaster;(Hs)Homosapiens;(Pbcv1)ParameciumbursariaChlorellavirus1;(Rs)Ralstoniasola- nacearum;(Sa)Staphylococcusaureus;(Sc)Saccharomycescerevisiae;(Sm)Sinorhizobiummeliloti;(Sp)Schizosaccharomycespombe.Completetree descriptions(fulllistsofGInumbersorgenenames,andbootstrapvalues)areavailableintheSupplementaryMaterialonlineatftp://ncbi.nlm.nih.gov/ pub/aravind/expansions,andhttp://www.genome.org.. between the degree of clustering and proteome size, that is, fromtheclosestsisterlineageincludedintheanalysis.LSCs theyeastLSCsshowedthesteepestdecay,whereasthosefrom with EC=1 are those families that have been invented de A. thaliana had the flattest distribution (Fig. 3). This is also novo and proliferated thereafter in a particular lineage. The consistentwithearlierobservationsthat,ingeneral,thesize relativeabundanceofLSCsintheECrangebetween0and0.9 distributionofparalogousproteinfamiliesinproteomesfol- isroughlyconstantforalltaxaconsideredhere,withslightly lowed the power law decay (Huynen and van Nimwegen >5%oftheLSCsineachofthebinsofsize0.1inthisrange 1998; Qian et al. 2001). These findings suggest that LSCs (Fig.4).Notably,∼40%(onaverage)oftheLSCspresentina evolvedlargelythroughastochasticprocessofgeneduplica- givenproteomewereintheECrangeof0.9to1(Fig.4).Thus, tionwherebytheprobabilityofduplicationwithinaclusterat nearlyone-halfoftheparalogousproteinclustersencodedin anygiventimeisproportionaltothesizeofthecluster,rather eukaryotic genomes have been generated almost entirely thanthroughgenome-scaleduplications. through LSE. This applied to the full range of evolutionary TocharacterizetheroleofLSEsintheevolutionofthe distancesexploredhereandtherewasnoobviousdependence respective classes of paralogous proteins in each lineage, we ontheevolutionarydepthatwhichLSEswereidentified;the devisedtheexpansioncoefficient(EC),whichistheratioof fraction of paralogous classes contained in these exclusive the number of proteins in LSCs to the total membership of LSCswasevengreaterintheyeastS.cerevisiaethanitwasin thegivenclassofparalogsinagivenproteome.TheECisa A.thaliana(Fig.4).Thisobservation,togetherwiththecorre- measure of the fraction of a given paralogous class that has lations between proteome size and different parameters of evolvedthroughLSEafterthedivergenceofthegivenlineage LSEs (Fig. 2), suggests that the ancestral core set of proteins 1050 Genome Research www.genome.org Lineage-Specific Gene Expansion in Eukaryotes eralimplausibilityofahighlycom- plex common ancestor for the crown group, this mechanism for the evolution of apparent LSEs would necessarily entail indepen- dentgenelossesinthesameparalo- gousfamilyinmultiplelineages,as opposed to a single expansion. Therefore, the lineage-specific du- plicationscenarioismoreparsimo- nious than the scenario based on thelineage-specificlosses. Analysis of the top 25 LSEs with EC=1 from all proteomes pooledtogether,indicatedthatthe majorityofthemare(cid:2)-helicalpro- teinsorhaveconservedpatternsof histidines and cysteines. Typical Figure 2 Linear correlation between the proteome size and parameters of eukaryotic lineage- examples include the (cid:2)-helical specificexpansion(LSE)infiveeukaryoticspecies.Correlationcoefficients(r)andsignificancelevels(P) nonspecific lipid-transfer protein were determined using ordinary least square linear regression. (At) Arabidopsis thaliana; (Ce) Cae- norhabditiselegans;(Dm)Drosophilamelanogaster;(Sc)Saccharomycescerevisiae;(Sp)Schizosaccharo- inplants,theC4DmdomaininD. mycespombe.(A)Theproteomesize(X-axis)isplottedagainstthepercentageoftheproteomemade melanogaster that chelates cations upofLSEs.(B)Theproteomesize(X-axis)isplottedagainstthenumberoflineage-specificclusters.(C) through conserved cysteines, and Theproteomesize(X-axis)isplottedagainstthemeannumberofproteinsinlineage-specificclusters. the T20D4.15 family of disulfide- (D)Theproteomesize(X-axis)isplottedagainstthepercentageofduplication((cid:2))andthepercentage ofn-plication(n>=3)((cid:1))amongtheLSCs. bonded secreted proteins from C. elegans. Thus, de novo emergence of protein domains that substan- inherited by the crown-group lineages from their last com- tiallycontributedtoLSEsappearstohaveinvolvedprimarily monancestorcontainedfewparalogscomparedwiththeex- inventionofstructurallysimplefolds.Thesefoldscouldhave tant proteomes. Subsequent to the divergence of the indi- evolvedthroughcompactionoflong(cid:2)-helicalcoiledcoilsor viduallineages,manygenesinheritedfromthecommonan- throughdisulfide-bond-ormetal-supportedstabilizationme- cestor as well as gene families invented de novo have diatedbyafewstrategicallyplaced,conservedcysteinesand/ undergone one or more rounds of duplication. This process or histidines. Invention of such simple domains could have seemstohavebeenparticularlyactiveinthegenerationofthe beenmoreexpedientthanemergenceofcomplex(cid:2)/(cid:3)struc- largeproteomesofmulticellulareukaryotesandprobablypro- turesthatrequireseveralspecificstabilizinginteractionstobe videdthemwiththerawmaterialfortheircellulardifferen- fixed(AravindandKoonin2000). tiation.Inprinciple,itcouldbearguedthattheancestorhad Biological Significance asmanyparalogousfamiliesasthemostcomplexoftheex- of Lineage-Specific Expansions tantgenomesorevenmore,andtheappearanceofLSEhad beencreatedbylineage-specificgeneloss,whichiscommon Theaboveobservationsshowthat,quantitatively,LSEsarea intheevolutionofatleastsomeeukaryoticlineages(Aravind majorcomponentofthedifferencesbetweentheproteomes etal.2000;Braunetal.2000).However,apartfromthegen- ofvariouseukaryotictaxa.Newparalogousfamiliescouldpro- videthematerialforspecificadap- tations and for evolution of new functional systems. In qualitative terms,wesoughttoinvestigatethe biological significance of LSEs by identifying conserved domains, subcellular localization signatures, such as signal peptides and trans- membrane regions, and other fea- turesofproteinsinLSCsthatmight allowpredictionoftheirfunctions (when less than obvious). These identificationsforthetopfiveLSCs in each organism are shown in Table 1. We categorized the LSCs intobroadfunctionalclassestodis- cern global functional trends and alsoinvestigatedindividualLSCsin anattempttogainamoredetailed Figure3 Sizedistributionofthelineage-specificclustersinthreeeukaryoticspecies.(Blue)Schizo- understanding of their actual bio- saccharomycespombe;(pink)Caenorhabditiselegans;(green)Arabidopsisthaliana.Clustersize(X-axis) isplottedagainstthenumberofLSCsindoublelogarithmiccoordinates.Theequationsofthepower logical roles (Table 2; Supplemen- lawdistributionfittingthelinearpartofthedataareshownonthegraph. taryMaterialavailableonline.). Genome Research 1051 www.genome.org Lespinet et al. Figure4 Distributionoflineage-specificclustersbyExpansionCoefficient(EC).TheX-axisshowsrangesofECvalues(seetext)andtheY-axis showsthepercentageofLSCswithineachECrange.(Yellow)Schizosaccharomycespombe;(orange)Saccharomycescerevisiae;(pink)Drosophila melanogaster; (red) Caenorhabditis elegans; (green) Arabidopsis thaliana. In each class, the average value of the five species is indicated by a horizontalline. Although LSEs occurred in most biological functional adaptationofthenematodes(Johnstone2000).Similarly,in classes,LSCswithpredictedorganism-specificfunctions,such D. melanogaster and Arabidopsis, prominent LSEs are, respec- as pathogen and stress response, transcription regulation, tively,theinsectcuticularproteins(Andersenetal.1995)and controlledproteindegradationmediatedbytheubiquitinsys- pectin/cellulose biosynthesis enzymes (Willats et al. 2001), tem,proteinmodification,signaltransduction,chemorecep- bothofwhicharecriticalfortheformationofmorphological tion, and small molecule metabolism were most abundant featuresuniquetotheselineages.Typically,theseproteinsare (Tables1and2).Atypicalexampleofanexpansionrelatedto required in large amounts as structural components of the an organism-specific function is that of the C. elegans colla- respective organisms; hence, these lineage-specific expan- gens,whicharerequiredforcuticleformation,acharacteristic sionscouldprincipallyhelpinincreasedproductionofthese Table1. TheTopFiveLineage-SpecificGeneFamilyExpansionsinFiveEukaryotes Saccharomyces Schizosaccharomyces Drosophila Caenorhabditis Arabidopsis serevisiae pombe melanogaster elegans thalianab LSC LSC LSC LSC LSC Rank name/function Na name/function Na name/function Na name/function Na name/function Na 1 Uncharacterized 25 Wtffamilyof 20 Trypsin-likeserine 178 7TMOdorant 264, Plant-specific 316 Ecm34p-like non-globular proteases receptors 228 kinases proteins proteins (twodistinct clusters) 2 Hexose 15 MayorFacilitator 13 Insectcuticle 88 Uncharacterized 182 Plant-specific, 251 transporters Superfamily proteins proteins F-box transporter containinga containing nematode- proteins specific domain 3 Aminoacid 14 Ser/Thrrepeat- 11 Cytochrome 83 Integral 151 Extracellular 221 permeases containing P450family membrane domainsoften non-globular hydroxylases O-acetyl/Acyl associatedwith proteins transferases kinases 4 Cellwall 11 Alpha-amylases 11 C4DM+ 82 7TMreceptors 122 PPRmodule 194, glycoproteins Zn-finger proteins 195 (twodistinct dusters) 5 Cellwall 11 Gal4-likefungal 8 POZ-containing 55 C-typelectins 115 Apoptotic 150 mannoproteins C6finger transcription 55 (secreted (AP)-ATPases factor proteins) Odorant receptors aNumberofmembersintheLSC. bAfamilyof171transposons,withmutatorelements,wasnotincluded. 1052 Genome Research www.genome.org Lineage-Specific Gene Expansion in Eukaryotes Table2. FunctionsofSelectedLineage-SpecificProteinClustersinFiveEukaryotes Speciesb (no.of Nameoftheclustera members) Biologicalfunctionsandothercomments Transcriptionregulation AP2-likeDNA-bindingproteins At(117) Plant-specifictranscriptionfactorswithmultiplerolesinstressandethylene responseanddevelopment(Riechmannetal.2000). MYB-likeDNA-bindingproteins At(100,48) HTH-domain-containingtranscriptionfactorswithdiverserolesin developmentandregulationofvariousenvironmentalresponses (Riechmannetal.2000). WRKY-likeDNA-bindingproteins AT(68) DNA-bindingproteinsinvolvedinregulationofdevelopmentandpathogen response. RF-Afamilyofnucleic At(47) Anexpansioninvolvingtheconservedarchaeo-eukaryoticreplicationfactorA acid-bindingproteins(OBfold) thatispresentinasinglecopyinothereukaryoticlineages(Wold1997). Viv1/PVAL-liketranscription AT(41) Plant-specifictranscriptionfactorsinvolvedinabscisicacidresponse,seed factors differentiation,anddevelopment(Riechmannetal.2000). Nuclearhormonereceptors Ce(66,43,26, Zn-dependentDNA-bindingproteinstypifiedbyvertebratesteroidreceptors. 26,andother ManyoftheC.elegansmembersofthisfamilymayfunctionindependently smallclusters) ofligands,andcharacterizedmemberslikeodr-7haverolesincell-type differentiation(Sluderetal.1999). C4DM+Zn-finger-containing Dm(82) TranscriptionfactorstypifiedbytheZeste-white5family.Consistofa proteins DNA-bindingC2H2-fingerandC4DM,apredictedZn-dependent protein–proteininteractiondomain(Landeretal.2001). SAZ-typeMybdomain- Dm(40) AspecializedversionoftheMYBDNA-bindingdomaintypifiedby containingproteins transcriptionfactors,suchasStonewall,Adf-1,andZeste. POZ+Zn-finger Dm(55) AclassofDNA-binding,chromatin-associatedtranscriptionfactors,suchas Broad-complex,Lola,andtrithorax-likeconsistofaspecificversionofthe POZdomainfusedtoaC2H2-finger. C6finger-containingproteins Sp(4) Gal4-likeC6Znfingersareamongthemostcommontranscriptionfactorsin theascomycetefungi. Pathogen/stressresponse AP-ATPases AT(150,29,17) Plantdisease-resistancelociproducts,typicallyconsistofaTIRandan AP-ATPasedomaincombinedwithleucine-richrepeats(LRRs)(Hulbertetal. 2001). Pepsin-likeproteases At(51),Ce(16) Secretedproteasesthatcouldbeinvolvedinextracellularregulatory proteolyticcascades. Subtilisin-likeproteases At(57) Secretedproteasesthatcouldbeinvolvedinextracellularregulatory proteolyticcascades. Papain-likeproteases At(14) Thiolproteasesthatcouldbeinvolvedinstressresponsesandingermination. MetalloproteasescontainingCUB Ce(23) Membrane-associatedmetalloproteasesthatcouldbeinvolvedinproteolytic domains cascadesonthecellsurface. C-typelectins Ce(115,42) Extracellularproteinscontainingadhesionmodulespotentiallyinvolvedin Dm(28) recognitionofspecificpathogensurfacemolecules. Chitinases Ce(33)Dm(17) Enzymespotentiallyinvolvedinhydrolysisofcellwallsoffungalpathogens. Toll-likereceptors Dm(8) Keyreceptorsoftheanti-pathogenresponsepathways. CUB-domainproteins Ce(40) Extracellularadhesionproteins. P450hydroxylases At(124,34,33, Oxidoreductasesinvolvedindetoxificationofdiversexenobioticsthrough 28)Dm(83) hydroxylation(Nelson1999;Tijetetal.2001). Ce(46,16) PRI-domainproteins At(24)Ce(40) Secretedproteinsthatcouldfunctionasinhibitorsofenzymesoradhesion molecules. Cellwallmannoproteins Sc(11) Involvedincoldshockandanoxicstressresponse. (cid:2)-helicalperoxidases At(73) Enzymesgeneratingnascentoxygenaspartoftheoxidativedefense mechanisms. Signaling Concanavalin-likelectins At(43) Someoftheselectinsarefusedtokinasesasextracellularreceptordomains andprobablyfunctionascarbohydratereceptors. PPR-moduleproteins AT(194,195) (cid:2)-superhelicalproteinsthatcouldfunctionasprotein–proteininteraction scaffoldsinvariouscontexts. Calcium-dependentprotein AT(44) TheprincipaltransducersofCa++signalingthatmediatethispathwayin kinases variouscontexts. Plant-specificproteinkinases At(316) Involvedinvarioussignalingpathways,suchashormoneresponse,disease resistance,anddevelopment.Oftenfusedtovariousotherdomains, includingApple,LRRs,andbulblectins. Octicosapeptidemoduleproteins At(72,17,14) ACa++-bindingsignalingmodule;somearefusedtoVTV1-likeDNA-binding domainsandGAFdomains(Ponting1996). NPH-3-like,plant-specific At(30) SpecializedPOZdomains,someofwhichareinvolvedinplantlightresponse POZ-domainproteins signaling. PP2Cphosphatases At(20) Phosphoserinephosphatasesthatfunctionindiversesignlaingpathways,e.g., abscisicacidsignaling. (Tablecontinuedonfollowingpage.) Genome Research 1053 www.genome.org Lespinet et al. Table2. (Continued) Speciesb (no.of Nameoftheclustera members) Biologicalfunctionsandothercomments Worm-specificS/Tkinases Ce(65) Adistinct,nematode-specificbranchofthecaseinkinasefamily. Receptorguanylatecyclases Ce(13,12) Potentialreceptorsofsecretedpeptidefirstmessengersbyanalogytomating fusedtoproteinkinases pheromonereceptorsofseaurchins. Worm-specificdomains Ce(42) Uncharacterizeddomainprobablyinvolvedinspecificprotein–protein interactions;somearefusedtoSET,caspase,kinase,andPHDdomains. POZ-domainproteins Ce(26,29) OftenfusedtoMATHdomains,possiblyfunctionaschromatin-associated adaptors. Insulin-likepeptides Ce(11) Probablyfunctionasnematode-specificpeptidehormonesorgrowthfactors. Sec14-domainproteins Dm(23) Probablyparticipateinregulationofproteintraffickingandvesicularcargo loading. SET-domainproteinswithan Dm(10) ProteinmethyltransferasescontainingadivergentSETdomainwitha insertedmetal-chelating characteristicinsertofametal-chelatingmodule.Probableregulatorsof module chromatindynamics. Geko-domainproteins Dm(8,17) AlargefamilyofDrosophila-specificcysteine-richproteins,theonly characterizedmember,Geko,isinvolvedinolfaction.TheLSCmightbe functionallycoupledtothecorrespondinglyexpandedolfactoryreceptor families. Ubiquitinsignaling/proteinunfoldinganddegradation F-boxproteins At(251,64,41, Specificity-definingE3subunitsofubiquitinligases;fusedtoseveralother 23)Ce(111, domainsthatmightactasscaffoldsfortheassemblyoftheubiquitinating 46,21) enzymecomplexes(KipreosandPagano2000). RING-fingerproteins At(74,16,12) ThemajorityoftheRINGfingersintheLSCsareoftheRING-H2category; probablyfunctionasspecificE3-ligases U-boxproteins At(21,18) RING-fingerderivativesthatprobablymediatemultiubiquitinationofspecific targets. Ubiquitin-domainproteins At(11) Probablyutilizedsimilarlytoubiquitin,butcouldspecificallyconjugatewith differentproteins. Adenoviral-typeproteases At(117) ProbablyinvolvedindeubiquitinationasexemplifiedbyULP1/SMT4(Liand Hochstrasser2000;Nishidaetal.2000). GH3-domainproteins At(17) ShareaconserveddomainwiththeE1subunitsofubiquitinligases;mightbe negativeregulatorsofthesignalosome. MATH-domainproteins Ce(81)At(73) RelatedtotheMATHdomainsoftheubiquitincarboxy-terminalhydrolases andE3-ligasesoftheTRAFfamily;couldfunctionasadaptorsinubiquitin pathways. Prolylhydroxylases Dm(19)At(10) Hydroxylationofprolinesbytheseenzymesmightprovidetargetsfor ubiquitinationbyspecificE3-ligases(AravindandKoonin2001). Cyclophilin-typepeptidyl-prolyl Dm(10) Catalyzeisomerizationofproline-containingpeptidebonds;mightfunctionin isomerases regulatingaggregationofproteincomplexes. Chemoreceptorsandsmallmoleculesensors 7-transmembraneolfactory Ce(264,228, Receptorsforodorants/environmentalchemicals(Dryer2000;Glusmanetal. receptors 122) 2001). Insect-typeodorantreceptors Dm(55) Receptorsforodorants/environmentalchemicals. Pheromone-bindingproteins Dm(27) Probablyinvolvedinthebindinganddeliveryofodorantstochemoreceptory cells. Patched-typesterolbinding Ce(15) Bindlipidsandsterolsinvariouscontextsincludingstabilizationofreceptor membraneproteins complexes. Juvenilehormoneandother Dm(27) Probablyinvolvedinthebindinganddeliveryofsmallmoleculesintheinsect small-molecule-binding haemolymph. proteins Lipid-bindproteins(NLTP) At(49,26) Cysteine-rich(cid:2)-helicalproteinsinvolvedinlipidbindinganddeliveryin variouscontextsandwaxdeposition. Jacalin-typelectins At(44) Mightbeinvolvedinsugarbindingandstorage. Hemocyanins Dm(10) Copper-dependentoxygentransportproteins. Cyaninfamilyproteins At(34) Copper-bindingproteins. IonChannelsandTransporters Degenerinfamilychannels Dm(24) Sodiumchannels,probablyfunctionintactilereceptionandrelated ion-dependentsignalingpathways. Potassiummchannels Ce(15) Potassiumchannelsofthedoubleporecategory,probablyfunctionas pH-dependentchannels. Innexin-typechannels Ce(20) ChannelsrelatedtotheDmShaking-Bprotein,mightbeinvolvedinthe formationofgapjunctions. cNMP-gatedchannels At(21) Cyclicnucleotide-gatedchannelscontaininganintracellularcNMP-binding domain. Aminoacidtransporters At(33) AminoacidtransportersoftheN-aminoacidtransporterfamily. (Tablecontinuedonfollowingpage.) 1054 Genome Research www.genome.org Lineage-Specific Gene Expansion in Eukaryotes Table2. (Continued) Speciesb (no.of Nameoftheclustera members) Biologicalfunctionsandothercomments Potassiumtransporters At(17) Belongtotheplanttinyroothairfamily;probablyinvolvedinpotassium uptake. Na-P-transporter-relatedproteins Ce(26) Probablyinvolvedinphosphateuptakebysymport. Hexosetransporters Sc(15) Belongtothe12TMsugartransportersuperfamily. ABCtransporters Dm(11,9,5) TransporterscontainingtwoABC-classATPasedomains. Smallmoleculemetabolism Lipases At(106) Afamilyofphospholipidlipasesoftheflavodoxinfold;involvedin degradationofphosphatidylcholine.Couldbeinvolvedinmetabolizing lipidsingerminationordegradinglipidmembranesofpathogens. 2-OG-Fedioxygenases At(67) Hydroxylasesinvolvedinthebiosynthesisofnumerousplantsecondary metabolites,suchasgibberellins(AravindandKoonin2001). NH2cinnamoyl/benzoyl- At(56) Transfersaromaticcarboxylicacidgroupstodiversetargetsinthebiosynthesis transferase ofplantsecondarymetabolites. SmallmoleculeO-methylases At(38,15) Catalyzethemethylationstepinthebiosynthesisofdiverseplantproducts, suchascaffeicacid. GlutathioneS-transferases At(14)Ce(28) Catalyzetheconjugationofelectrophilicsubstrates,particularxenobiotic,to Dm(27) glutathioneaspartoftheirtransportanddetoxification;additionallyhave peroxidaseandsmallmoleculeisomeraseactivities. Predictedsecretedsmall Ce(32) Containspecificdisulfidebonds;probablycatalyzemethylationofextracellular moleculemethylases smallmolecules. Integralmembrane Ce(151) Afamilyofmembrane-associatedacyltransferasescloselyrelatedtothe O-acyltransferases bacterialmembraneassociatedacyltransferasesthatacylatemacrolide antibioticsandcellsurfacepolysaccharides. Predictedsmallmoleculekinases Ce(23)Dm(45) Relatedtoaminoglycosideandlipidkinases;probablyinvolvedin phosphorylationofsmallmolecules,suchasodorantsand/orxenobiotics. Structural/morphologicalproteins Cystine-richexpansions At(35) Plantcell-wallglycoproteins. Pectinmethylesterases At(89) Involvedinthebiosynthesisofpectins,majorstructuralcomponentsofplants. Pectin-associatedproteins At(26) Four-cysteine(cid:2)-helicaldomains,somefusedtopectinesterases. Cuticularcollagens Ce(34,32,26, Theprincipalstructuralcomponentofthenematodecuticle(Johnstone2000). 11) Majorspermproteinfamily Ce(32,10) Theprincipalstructuralcomponentofnematodesperms. Insectcuticularproteins Dm(88) Theprincipalstructuralcomponentoftheinsectcuticle(Andersenetal. 1995). Peritrophin-likeproteins Dm(40) Insect-specificextracellularmatrixproteins. Cellwallglycoproteins Sc(11) Proteincomponentoftheyeastcellwall. Ecm34p-likeproteins Sc(25) Proteincomponentoftheyeastcellwall. aThemembersofeachLSCarelistedintheSupplementaryMaterialsection,inwhichtheLSCscanbeidentifiedbytheirnamesandthenumber ofmembers. bSpeciesabbreviations:(At)Arabidopsisthaliana;(Ce)Caenorhabditiselegans;(Dm)Drosophilamelanogaster;(Sc)Saccharomycescerevisiae;(Sp) Schizosaccharomycespombe.ThenumberofmembersineachLSCisindicatedinparentheses;commasseparatedistinctLSCsthatbelongto thesameclassofparalogousproteins. proteins.Extendingthisanalogy,itispossiblethatseveralof components include the massive proliferation of apoptotic the LSCs with no detectable homologs elsewhere could rep- (AP-)ATPasesandtheaccompanyingmoderateexpansionof resentasyetuncharacterized,butabundant,lineage-specific metacaspasesinplants,andtheparallelexpansionofcaspases structuralproteins(Table2). in vertebrates (Aravind et al. 2001; Holub 2001). These pro- ManyoftheidentifiedLSCshadpredictedbiochemical teinsareeitherknownorpredictedtoparticipateinmultiple characteristics that pointed to roles in stress and pathogen pathways associated with apoptosis or hypersensitive re- response.Particularlystrikinginthiscategorywastheexpan- sponse.Inthiscontext,alsoofinterestaretheexpansionsof sionofproteasesofthepepsin-likeandsubtilisin-likefamilies molecules containing modules functioning in extracellular in A. thaliana, trypsin-like proteases in D. melanogaster, and adhesion. Prominent examples of these include the C-type Zn-metalloproteasesinC.elegans(Table2).Allofthesepro- lectins(D.melanogaster,C.elegans),PR1proteins(C.elegans, teases are predicted to be secreted molecules, and their re- A.thaliana),CUBdomainproteins(C.elegans),andthebulb- peated,independentexpansionsuggeststhattheyarewidely lectindomain(A.thaliana).Aswiththeimmunoglobulindo- utilizedeitherfordirectdegradationofpathogenproteinsor mainprotein,thatarehighlyexpandedinvertebrates,these ascomponentsofstress-triggeredproteolyticcascadesbroadly moleculesprobablyparticipateintherecognitionandbind- analogoustothevertebratecomplementandclottingsystems ingofspecificpathogensasapartofdefensemechanismsof (Bouchard and Tracy 2001; Southan 2001). Alternatively, in thecorrespondingorganisms(Table2). thecaseofplants,theycouldaidinproteindigestioninthe Earlier analysis of the LSEs involving transcription fac- processofgermination.Better-understoodcasesofsimilarlin- tors had suggested that they included proteins regulating eage-specific expansions related to stress/pathogen-response criticalaspectsofthedevelopmentoftheorganism(Aravind Genome Research 1055 www.genome.org Lespinet et al. andKoonin1999;Riechmannetal.2000;Landeretal.2001). radationhasbeenshowntooccurthroughtherecognitionof Forexample,theproteinsbelongingtothePOZandSAZ-type hydroxyproline by ubiquitin ligase complexes (Ivan et al. MybdomainexpansionsinD.melanogaster(Table2)regulate 2001).Thus,theLSEof2-oxoglutarate-dependentprolylhy- as diverse functions as maintenance of the antero-posterior droxylases(AravindandKoonin2001)detectedinD.melano- Hox gene expression pattern, neurite outgrowth and path- gasterandA.thalianacouldrepresentanothercaseinwhich finding, and organogenesis (Aravind and Koonin, 1999; therangeofthecoreubiquitinationpathwayisexpandedvia Landeretal.2001).Thus,itappearsthatproliferationofnew diversificationoftheterminaleffectors. transcriptionfactorfamilies,followedbytheirrecruitmentas The role of LSE in the diversification of proximal com- upstreamordownstreamregulatorswithrespecttocorecon- ponents of signal transduction systems, receptors, had been served developmental pathways, have contributed substan- noticedpreviouslyinthecasesofindependentexpansionsof tiallytotheevolutionofmorphologicaldiversityinanimals. odorant receptors/7-transmembrane chemoreceptors seen in Thegeneralityofthisobservationwasreinforcedbytheevi- different animal lineages (Dryer 2000; Glusman et al. 2001) denceofmassive,lineage-specificexpansionanddiversifica- and plant receptor kinases containing extracellular leucine- tion of various transcription-factor families in the plant A. rich repeats, bulb lectin, or EGF-like extracellular domains thaliana (Table 2). Many of these include well-characterized (ShiuandBleecker2001).Here,wedetectedotheranalogous DNA-bindingproteins,suchastheMADSboxandMYBdo- expansionsofupstreamsignalingproteins,suchaspotassium main proteins, that have been shown previously to partici- channels, innexin family channels (both in C. elegans), and pate in plant-specific functions, including development of tetraspaninsanddegenerin-typechannelsinD.melanogaster flowersandotherstructures,meristemaldifferentiation,and (similarLSEsofK-channelsandtetraspaninsarealsoseenin organ-specific gene expression (Riechmann et al. 2000). In humans). The proteins involved in these expansions are this study, we detected certain unexpected expansions of linked to the organism’s responses to external as well as in- DNA-binding proteins in plants that might point to previ- ternalhomeostaticstimuli.Thus,suchexpansionscouldserve ously unrecognized transcription regulators. Examples in- as the raw material for the behavioral and physiological ad- clude the proteins homologous to the mitochondrial tran- aptation of organisms to their specific environments. Lin- scription termination factor, which, in other eukaryotes, is eage-specific expansions are also seen in a range of protein- presentinasinglecopythatfunctionsinthemitochondrion modifyingenzymesofdifferentsignaltransductioncascade, (Fernandez-Silva et al. 1997). The additional paralogs in suchasproteinkinasefamiliesinmostlineages,SET-domain plantshaveprobablyacquireddifferenttranscription-related protein-methylases in D. melanogaster, and PP2C phospha- functionsbecausetheyformatightcluster,distinctfromthe tasesinplants.Aswiththeubiquitinsystem,theseappearto ancestralmitochondrialversion.Plantsalsoshowanexpan- beameansoflinkingwell-conservedstemsofsignalingpath- sionoftheDNA-bindingreplicationfactorA(RF-A),with>40 waystodistinctsetsofterminaltargets. copiesinA.thaliana,incontrasttotheone-threecopiesob- AnotheraspectoftheinvolvementofLSEsintheevolu- servedinothereukaryotes.Theexpansionanddivergenceof tion of signal-transduction networks is the extensive prolif- RF-A in plants suggest that the plant-specific paralogs are eration of families of proteins containing adaptor domains. probablyutilizedastranscriptionfactorsratherthanintheir Alongwiththeirexpansion,manyadaptordomainshavealso usual capacity in replication (Wold 1997). These and other recombinedwithavarietyofotherdomains,probablyallow- such examples (Table 2) illustrate that transcription factors ingtheemergenceofnewnetworksofinteractions.Astriking arerecruitedfromawidevarietyofpre-existingsourcesand example is the major expansion of proteins containing the diversifytooccupynewfunctionalnichesviaLSE. small Ca-binding octicosapeptide (OOP) module (Ponting WeobservedamajorroleofLSEintheelaborationofthe 1996)inA.thaliana.SomeOOPmodulesarefusedtoVIV1- ubiquitinpathway,whichisinvolvedinthedegradationand like plant-specific DNA-binding proteins and a specialized regulatory modifications of proteins (Hershko and Ciecha- classofGAFdomains,suggestingthattheylinktranscription nover 1998). Evidence of LSE was obtained for several com- regulationandsmallmoleculeinteractionstoCa-dependent ponentsoftheubiquitinsystem,inparticular,E3subunitsof signaling. Another notable case is a novel adaptor domain, ubiquitin ligases containing the F-box domain (Kipreos and typifiedbytheamino-terminaldomainoftheCaspase-1Aiso- Pagano2000)(A.thalianaandC.elegans)andtheRING-finger form,whichsofarwasdetectedonlyinC.elegans.Altogether, (A.thaliana).BecausetheE3proteinsarespecificitydetermi- theC.elegansgenomeencodes>40membersofthisdomain nantsthatareinvolvedintargetingtheconservedubiquitin- family, which, in addition to the caspase fusion, also form ligationmachinerysystemtospecificsubstrates(Jacksonetal. multidomainproteinswithSET-domainmethylases,PHDfin- 2000), their diversification through LSE probably provides a gers,andkinases.Giventhe(cid:2)-helicalstructurepredictedfor meansofharnessinganotherwiseconservedsystemtoregu- thisdomain,andenrichmentinchargedresidues,itprobably latethedegradationofdiversesetsoftargets.Inasimilarvein, functionsasaprotein–proteininteractionmodule. both nematodes and plants also show independent LSEs of Another, somewhat unexpected generalization that theMATHdomain.Thisdomain,whichtendstoformfusions emergedfromthepresentanalysisistheprevalenceofsmall toubiquitincarboxy-terminalhydrolasesorRING-fingerE3s molecule-modifyingenzymesamongtheLSEs.Inplants,the (Aravindetal.1999;Polekhinaetal.2002),mightserveasan proliferationofsuchenzymes,namelymethylasesofthecaf- additionaladaptorthatmediatesde/ubiquitinationofspecific feicacidO-methylasefamily,dioxygenasesofthegibberellin- targets. A. thaliana has a prominent proliferation of the ad- hydroxylasefamily,andavarietyoflipasesandacyltransfer- enovirus-likethiolproteasesuperfamilywhosemembers(e.g., ases, correlates with the plethora of secondary metabolites, Smt4/Ulp1)inyeastandinvertebrates,removeubiquitin-like suchaspigments,volatilearomaticcompoundsalkaloids,and proteinsfromtheirtargets(LiandHochstrasser2000;Nishida waxes that are produced by plants (Seigler 1998). However, etal.2000).Thus,inplants,thisLSCprobablycontributesto their large numbers suggest that the entire diversity of me- further diversification of the regulation of ubiquitin- tabolites produced even by plants such as A. thaliana with dependentproteindegradation.Targetingofproteinsfordeg- relativelysimplegenomesisunder-appreciatedtoalargeex- 1056 Genome Research www.genome.org Lineage-Specific Gene Expansion in Eukaryotes tent.Interestingly,animalsalsohaveseveralLSEsassociated METHODS withsmallmoleculemetabolism.Someofthese,suchasgly- The protein set for the nematode C. elegans was from the cosyltransferasesandacyltransferases,suggesttheremightbe WormPep20 data set (http://www.sanger.ac.uk/Projects/ an as yet unexplored, lineage-specific diversity of carbohy- C_elegans/wormpep);theproteinsetsforotheranalyzedeu- drates and lipid moieties that are associated with glycopro- karyotic species were extracted from the NCBI (NIH) nonre- teins, lipoproteins, and other cellular metabolites. The two dundant(nr)proteinsequencedatabase.Thehumanprotein independentexpansionsofpredictedsmall-moleculekinases setwasnotsystematicallyanalyzedbecauseofextensiveprob- relatedtoethanolamineandaminoglycosidekinases(Honet lemswithgenepredictions,resultinginfragmentaryproteins, al. 1997) (in D. melanogaster and, to a lesser extent, in C. artificial fusions, and inclusion of pseudogene translations andtranslationofnoncodingDNA. elegans)andtheexpansionofsecretedmethylasesinC.elegans Identical or nearly identical (98% or greater) sequences areparticularlyenigmatic.Giventheroleoftherelatedbac- wereremovedfromthedatasetsusingtheBLASTCLUSTpro- terialkinasesandmethylasesinxenobioticresistance(Hagg- gram. For documentation on its use, see ftp://ftp.ncbi. blom1990),theseenzymesmightbeusedtomodifyarange nlm.nih.gov/blast/documents/README.bcl. LSCs were iden- of xenobiotics encountered by the animals in their specific tifiedusingthefollowingprocedure:BLASTcomparisonsfor environments.Alternatively,theycouldmodifyvariousenvi- all proteins in the analyzed set of complete eukaryotic ge- ronmental substances to convert them to forms more easily nomeswererunagainstthedatabaseconsistingofthesame sensedbythechemoreceptorsoftheseorganisms. set of proteins. Symmetrical relative similarity scores (R =R =max(S /S ,S /S ),inwhichS istheBLAST AB BA AB AA BA BB AB bitscoreforqueryAandsubjectBwererecorded.Suchscores range from 0 (no significant hit found) to 1 (identical pro- teins). For each protein A in a given genome X (e.g., C. el- Conclusions egans),asetofcandidatefamilycomembers{B}wasdefinedas asetofproteinsfromthesamegenomeXsatisfyingthecon- A computational procedure for systematic detection of lin- dition(R >R ;for(cid:1)C⊄ X)(i.e.,similaritybetweenthegiven AB AC eage-specific expansions of protein families was developed proteinAandanotherC.elegansproteinBisgreaterthanthat andappliedtoobtainacomprehensivecensusofLSEsinfive betweenAandanyproteinCfromanyothergenome).Then, eukaryoticgenomes.LSEsappeartohaveplayedanimportant allsuchsetsfromXweremergediftheysharedatleastone role in the growth and differentiation of the proteomes of member (single-linkage clustering), resulting in grouping all multicellular eukaryotes. Many paralogous gene families in proteinsfromXintoclusters{A}(manyofwhichmightcon- tain only a single protein). This procedure leads to heavy crown-group eukaryotes appear to have evolved almost en- overclustering because, even if only one pair of proteins in tirelythroughLSEafterthedivergenceoftheexaminedsister two distinct LSCs passes the comembership condition (e.g., lineages from their ancestors. This fundamental process of duetofluctuationsintheobservablesimilarity),thetwoLSCs genefamilyexpansionwasactiveatawiderangeofphyloge- are merged by the single-linkage algorithm. This over- netic distances, from the relatively close species of yeasts to inclusivesetofclusterswasrefinedthroughidentificationof the much earlier separation of plants from the rest of the the most closely related proteins from other genomes. For crown-group taxa. Generally, the fraction of proteins found each A⊄ {A}, the best alien hit C was identified as [C | max- in LSCs and the fraction of large families among LSCs posi- (RAC); C⊄ X]. Sets {A}∪ {C} (i.e., candidate LSC members and tivelycorrelatewiththesizeofeukaryoticproteomes. theirclosestalienrelatives)weresubjecttoUPGMAclustering onthebasisofrelativesimilarityscores.Underthisprocedure, Examination of the known and predicted functions of proteins from other genomes that show high similarity to thedetectedLSEsrevealscertaingeneralprinciples.Genesen- some candidate LSC members may intrude into the clus- codingproteinstypicallyrequiredinlargequantitiesascom- ter and split it apart. Subclusters {A(cid:3)} satisfying [A(cid:3)⊂ X] (i.e., ponentsofanorganism’smorphologicalstructuresareoften UPGMAsubtreesconsistingofproteinsexclusivelyfromthe subjecttoLSEandappeartobefixedversionsofthecommon currentlyanalyzedgenomeX)andincludingmorethanone phenomenonofgeneamplification,withfine-tuningadded proteinwereconsideredtorepresentLSCs. through sequence diversification (Kondrashov et al. 2002). Proteinsequencesimilaritysearcheswereperformedus- Another major set of LSCs consists of proteins involved in ing the gapped BLASTP program against the nonredundant recognition and binding of pathogens and xenobiotics and protein sequence database (NCBI, NIH). Iterative profile searchestodetectmoredistantrelationshipswereperformed withstandingenvironmentalstress.TheseLSCsprobablypro- usingthePSI-BLASTprogram(Altschuletal.1997),withthe videtherawmaterialforgeneratingthediversityrequiredto inclusion threshold typically set at E=0.01; only predicted counterrapidlychangingpathogensandtorespondtoother globular regions from proteins were used as seeds for variableenvironmentalfactors.Expansionfollowedbydiver- PSI-BLASTsearches.Proteinswerepartitionedintoprobable sificationoftheproteinsintheLSCsappearstobeacommon globular and nonglobular regions using the SEG program meansofgeneratingnewspecificitiesinsignalingpathways. (Wootton1994).Conserveddomainsweredetectedusingdo- Inparticular,intheubiquitinsystem,alargenumberofthe main-specific PSSMs constructed using the PSI-BLAST pro- E3componentsoftheubiquitinligase,whichtargetittospe- gram (Chervitz et al. 1998). Multiple alignments were con- cific proteins, are drawn from LSEs. Expansions of adaptor structed using the T_Coffee (Notredame et al. 2000) and ClustalX (Thompson et al. 1997) programs and corrected modulesfollowedbytheirfusiontodiversedomainsprobably manuallyonthebasisofPSI-BLASTsearchresults,which,on resultintheemergenceofnovelinteractionsthatcontribute some occasions, correctly detect conserved sequence motifs to signaling and transcription regulation. Several expanded missed by multiple alignment methods. These alignments enzymefamiliesalsopointtotheexistenceofan,asyet,un- were used to construct Neighbor Joining phylogenetic trees discovereddiversityofsmallmoleculemetabolitesinvarious (Saitou and Nei 1987) using the PAUP* (Swofford 1998) and lineages. Thus, LSE seems to be one of the most important PHYLIP (Felsenstein 1996) package (the evolutionary dis- sourcesofstructuralandregulatorydiversityincrown-group tances were calculated using the PROTDIST program of eukaryotes,whichwascriticalforthetremendousexploration PHYLIP),andthesupportfornodesofinterestwasevaluated ofthemorphospaceseenintheseorganisms. by use of 1000 bootstrap replicates. Secondary structure of Genome Research 1057 www.genome.org

Description:
nih.gov/pub/aravind/expansions, and http://www.genome. org). Thus . gens, which are required for cuticle formation, a characteristic adaptation of the
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.