Systematic Phylogenomic Evidence of en Bloc Duplication of the Ancestral 8p11.21–8p21.3-like Region Alexandre Vienne,* Jeffrey Rasmussen,* Laurent Abi-Rached,(cid:1) Pierre Pontarotti,(cid:1) and Andre´ Gilles* *EABiodiversite´ 2202,Universite´ deProvence,Marseille, France; and(cid:1)INSERM U119,Marseille, France Thegenomesofmanyhigherorganisms,includingplantsandbonyfish,frequentlyundergopolyploidization,andithas longbeenhypothesizedthatthese,andother,large-scalegenomicduplicationshaveplayedanimportantroleinthemajor evolutionarytransitionsofourpast.Herewebuilduponanearlyworktoshowthatthehumangenomicregion8p11.21– 8p21.3hasthreeparalogousregionsonchromosomes4,5,and10thatwereproducedbytworoundsofduplicationsafter theprotostomian-deuterostomiansplitandbeforetheactinopterygian-sarcopterygiansplit.Webaseouranalysisonthe phylogeneticreconstructionoftheevolutionaryhistoryof38genefamilieslocatedintheseregions.Usinganalignment centeredonproteindomains,threedifferentphylogeneticmethods,anddivergencetimeestimation,thisanalysisgives D o more supportin favorof twoancient polyploidization events inthe vertebrate ancestral genome. w n lo a Introduction d e d Genomic duplications have continuously occurred Thirty years ago, Susumo Ohno (Ohno 1970) fro m throughout evolutionary history, and it has long been suggested that vertebrate genomes evolved by two rounds h theorizedthattheyareessentialtotheevolutionofgenetic ofgenome-wideduplications.Thishypothesisoftwotetra- ttp novelties (Stephens 1951). It is possible that creating ploidizationeventswasbasedonthedifferenceinthesize s://a genetic redundancy is the best way to increase the size of of the genomes and the amount of DNA per haploid cell c a the genome and, in doing so, to furnish the raw material between many vertebrate and chordate taxa. de m needed to develop new biological functions. One of the first arguments in support of such dupli- ic Polyploidization—the addition of one or more com- cations was theidentification ofnumerous co-orthologsin .o u pletesetsofchromosomestotheoriginalset(GraurandLi vertebrates for one gene in cephalochordata (e.g., amphi- p.c 2000, pp. 480–482)—is extensive in plants and has been oxus). Many examples are known of one ortholog identi- om shown to be an important process in plant speciation as fied in amphioxus to two, three, or four co-orthologs in /m b well as in the evolution of vertebrates (Lundin 1993). To vertebrates(seeforexampleHolland,Holland,andKozmik e /a givetwoexamplesofplantpolyploidization,considerthat 1995). This implies that the duplication events for these rtic 50% to 70% of the angiosperms (Wendel 2000), and ap- genes occurred independently after the cephalochordata/ le proximately95%ofpteridophytes(ferns)(Masterson1994) chordata separation and before the Gnathostomata radia- -ab s are thought to have experienced at least one polyploidiza- tion (fig. 1). However, it does not give any clue about tra tion event during their evolution. apossibleenblocduplication,andinan‘‘ideal’’caseofen ct/2 Recently,103duplicatedchromosomalregionsinthe bloc duplication,alltheco-orthologs ofamphioxuswould 0 /8 Arabidopsisthalianagenomehavebeenidentifiedandare constitute a network of paralogous regions having the /1 2 hypothesized to be the product of a unique polyploidiza- same duplication date. 9 0 tionevent(Vision,Brown,andTanksley2000).However, Several earlier studies supported Ohno’s hypothesis /1 0 the overlap of some of these regions and their presence in of two rounds of duplication at the genome level. We 81 4 different ways presupposes the occurrence of many previously identified the presence of multigenetic families 1 3 polyploidization events at different times. Other examples located in four chromosomal regions (4p16, 8p12, 5q33- b y are known in plants, as in Gossypium hirsutum, the 35,and10q24)andconcluded,usingabasicanalysis,that g u genome of which resulted from a single allopolyploidiza- they had been created by two rounds of duplication from es tion event between two diploid genomes (Cro, Small, and a common ancestral region (Pebusque et al. 1998). Now, t o n Wendel 1999). Although polyploidization is most com- 1,642 ‘‘paralogons’’ have recently been identified in the 0 3 mon in plants, it is nonetheless observed throughout the human genome (McLysaght, Hokamp, and Wolfe 2002) A p tkrneoewonf,liifnec,luanddingsomthee etextarmapploleisdocfompomlyopnlocidarapniCmyaplrsinaures aidnednttihfiactartiigoonrooufs wpoarraklocgoonusstiturteegsioansstartthiantg pcoaminet foarbothuet ril 20 1 carpio (bony fishes) and Hyla versicolor (amphibian) (as throughancientgenomeduplications.Toestimatewhether 9 reviewed in Soltis and Soltis 1999).The stigmata of such the majority of paralogous genes in the human genome events can also be observed in the genomes of Saccha- were issued from duplications that occurred in the early romyces cerevisiae (Wolfe and Shields 1997) and Danio stage of vertebrate evolution, Gu, Wang, and Gu (2002) rerio (Taylor, Van De Peer, and Meyer 2001). estimated the relative timing of 1,739 gene duplication events. Their analysis indicates that large-scale and con- stant small-scale duplications contributed to the elabora- Key words: vertebrate evolution, 2R, paralogous regions, poly- tion of the vertebrate genome. ploidisation,diploidisation,mapping,timedivergence. McLysaght, Hokamp, and Wolfe (2002) concluded E-mail:[email protected]. that at least one round of large-scale duplication occurred Mol.Biol.Evol.20(8):1290–1298.2003 in the vertebrate genome; whereas Gu, Wang, and DOI:10.1093/molbev/msg127 Gu (2002) concluded that at least two rounds of large- Molecular Biology and Evolution, Vol. 20, No. 8, (cid:1) Society for Molecular Biology and Evolution 2003; all rights reserved. scale duplication occurred associated with continuous 1290 HumanGenomicRegion8p11.21–8p21.3Duplication 1291 D o w n lo a d e d fro m h ttp s ://a c a d FIG.1.—Schematicrepresentationofthephylogeneticrelationshipbetweenthemainbilateriangroups.Valuesateachnodeareinmillionsofyears em and correspond to the estimation of divergence times between the different groups (Hedges 2001). The blank circles indicate the two rounds of ic duplicationdescribedbyOhno(1970):thefirstroundofgenomeduplication(T1)occurredsomewhereaftertheurochordataemergence;thesecond .o u (T2)occurredbeforethetetrapodaemergence.ThefilledcirclesindicatethetworoundsofduplicationdescribedbyHollandetal.(1994):afterthe p .c cephalochordata-chordatasplit(T1)andbeforethegnathostomataemergence(T2). o m /m small-scale duplications. In the study published by Gu, positioned between 8p11.21 and 8p21.3 that had two or b e Wang,andGu(2002),onepeakobservedintheearlystage more potential paralogs.Thisgenomic region corresponds /a of vertebrates likely corresponds to the one-round large- to an interval of 27.8 Mb, and 111 predicted genes are rtic le scale event reported by McLysaght, Hokamp, and Wolfe located inside this span. For the 46 genes predicted to be -a b (2002), whereas another peak found after the mammalian unique by the BLAT tool, the analysis was not taken s radiationismorelikelytobetheresultofrecentsegmental/ further, but this does not mean that they do not have any trac tandem duplications. paralogs—someoftheduplicatesmighthavebeencreated t/2 0 The phylogenetic analysis of 31 genes located in duringolderduplicationevents,haveyettobedetected,or /8 /1 these regions supported the hypothesis of en bloc have simply been lost. A total of 29 paralogs are found 2 9 duplications after the Cephalochordata/Chordata separa- within 10q26.3–10q21.2 (an interval of 79.7 Mb), and 28 0 /1 tion and before the emergence of Gnathostomata (Abi- paralogs are shared between chromosomes 8 and 10; 16 0 8 1 Rached et al. 2002). Unfortunately, only six of the genes paralogs are found with 5q35.2–5q31.1, which corre- 4 1 studied had more than two paralogs, making it impossible sponds to an interval of 42.7 Mb; and 15 paralogs are 3 b to rigorously test the 2R hypothesis or estimate the shared between chromosomes 8 and 5. Furthermore, 10 y g divergence time for the duplicated genes. paralogs are found in the 4q35.1–4p16.3 region, which u e s As in the study of the major histocompatibility corresponds to an interval of 192.98 Mb. They are t o complex(MHC)regionanditsparalogs(Abi-Rachedetal. widespread on the entire chromosome, and 9 paralogs n 0 2002), we choose the strategy of taking into account the are shared between chromosomes 8 and 4. Although the 3 A possibility of domain shuffling and we revisited the phylogenetic relationship for the PDL paralog located on p formerly identified paralogons in the 8p11.21–8p21.3, chromosome 8 was ambiguous, we kept this gene family ril 2 0 10q21.2–10q26.3, 5q31.1–5q35.2, and 4p16.1–4q35.1 because of its localization of paralogous genes in the 1 9 regions. The analysis was realized by the reinvestigation considered chromosome 4, 5, and 10 regions. Two other of their gene content, testing of their phylogenetic regions were evidenced on chromosome 2p23.3-q11.2 (5 relationships between paralogous genes found in these genes) and the two parts of chromosome 20, 20p13 (2 regions, using three reconstruction methods, showing genes) and 20q13.33 (3 genes). statistical evidence of their nonrandom distribution over the genome, and estimating the divergence time of these Domains Identification genes in order to test the 2R theory for these regions. For the remaining 65 genes, many paralogs were predictedandtheirconstitutivedomainswereidentifiedin Materials and Methods the Pfam database (Bateman et al. 2002). Their phyloge- Genes Localization netic relationships were determined on the basis of Pfam UsingtheBLATtool(Kent2002;http://genome.ucsc. domain alignments and reconstructions made by the edu/goldenPath/octTracks.html), weidentifiedthosegenes Neighbor-Joining method (Saitou and Nei 1987) using 1292 Vienneetal. MEGA2 (Kumar etal.2001) for eachoftheirconstitutive 2inonlineSupplementaryMaterial);theircongruencewas domains. Congruence between the different domains was tested using the Templeton test with PAUP*4.0 under the done using the incongruence length difference (ILD) test Maximum Parsimony method (Nei and Kumar 2000). (Farrisetal.1995;ThorntonandDeSalle2000).Thisstep was done to avoid potential reconstruction biases from Relative Rate Test and Estimation of Duplication Time domain shuffling (as detected in Abi-Rached et al. 2002). The Relative Rate Test was performed with Phyltest Thiscommonmechanismisindeedthoughttohaveplayed (Kumar 1996) to determine if the substitution rate of the amajorroleintheevolutionofnewproteins(Doolittleand different paralogous genes was homogeneous. Sequences Bork 1993; Doolittle 1995), and we choose to take it into whosesubstitutionpatternwassignificantlydifferentfrom account for the analysis. the mean pattern were not included in the estimation. The duplication times were then estimated using a linearized Sequence Retrieval and Phylogenetic Reconstruction Neighbor-Joining tree of the conserved sequences in D For those sequences whose evolutionary history MEGA2 (Kumar et al. 2001) as describe elsewhere o w agreed with the two rounds of duplications, more se- (Balczarek, Lai, and Kumar 1997; Kumar and Hedges n lo quenceswereretrievedusingBlastP(Altschuletal.1997) 1998). The calibration times we used were mammal- a d onthenonredundantdatabase(NR)andontheDaniorerio amphibian (360 MYA), and tetrapod-teleost (420 MYA) e d referencedproteinsdatabaseattheNationalCenterforBio- (Kumar and Hedges 1998; Hedges 2000; Wang and Gu fro technology Information (NCBI) (http://www.ncbi.nlm. 2000). m h nih.gov/genome/seq/DrBlast.html). TBlastN was used on ttp tHhGeMTaPk-iFfuugguuGruebnroimpeicsspPrreodjieccttewdepbrpoatgeein(shtdtpa:t/a/bfuagsue.hagtmthpe. Statistical Significance of Gene Distribution s://ac The size and the gene number of all the human a mrc.ac.uk/blast/). Sequences were aligned with ClustalX d e (Thompsonetal.1997),andthephylogeneticrelationships genomic regions were defined with the University of m were determined at the domain level using the Neighbor- California at Santa Cruz (UCSC) Human Genome ic.o Joining algorithm in MEGA2 (Kumar et al. 2001). Trees Bbrowser. The studied region of chromosome 8 is up were rooted at the midpoint, and the outgroup was com- approximately 27.8 Mbp enclosed between the genes .co m posed of protostomian sequences, except that three gene FGF17(8p21.3)andVDAC3(8p11.21),andcontains111 /m genes. Among these 111 genes, 25 are paralogs that have b families with deuterostomian sequences were also found: e been created by duplications after the protostomian/ /a BAG4 (Drosophila sister group to the chicken sequence deuterostomian separation and before the Osteychthyian rtic BAA13589), MGC1136 (Drosophila sister group to le the human 10q22.2 XP_061191 and Fugu JGI_11847 se- split. The other regions were defined as follows: between -a b ANK (10q21.2) and BAG4 (10q26.2)for chromosome 10 s qmuoeunsceesZ)F,Aan_dMLOoUcS55E8a9n3d(FDurgousoJpGhIil_a21s8is1tesreqguroeunpcetso).the (enclosingatotalof293genes);betweenLOXL2(5q23.3) trac and FGFR (5q35.3) for chromosome 5 (enclosing 255 t/2 Two other reconstruction methods were used for the 0 genes); and between PDL (4q35.1) and EPB49 (4p16) for /8 25 genes yielding more than two paralogs: the Maximum /1 chromosome4(atotalof2271genes).Thetotalnumberof 2 Parsimony method inPAUP*4.0 (Swofford 2000, Appen- 9 genes in the human genome was estimated to be 35,000 0 dix 1) and the Maximum Likelihood method using Tree- /1 (Lander et al. 2001; Venter et al. 2001). 0 Puzzle 5.0 (Strimmer and von Haeseler 1996). The 8 1 bootstrapproportion(Felsenstein1985)wasusedtoaccess Takingintoaccountthetotalnumberofgenes(2,819) 41 located in these regions, the probability that a randomly 3 the strength of the topologies for each of the three b selected gene would be located in one of them is 2,819/ y methods. g 35,000¼0.08; similarly, the probability for a gene to be u e s Saturation Visualization located somewhere else in the genome is 1–0.08¼0.92. t o The gene nomenclature was set according to the n 0 Tree reconstruction with DNA or protein sequences HUGOgenenomenclaturecommitteefor26genefamilies. 3 A can be biased by mutational saturation (Van de Peer et al. The mapping of all human genes was done using the p 2002). To avoid suchproblems andto findthose taxa that BLAT tool available with The Draft Human Genome ril 2 0 could be implicated in branch swapping—and thus in the Browser of UCSC (UCSC Human Genome Project 1 9 wrongtopologiesgained—thesaturationattheaminoacid Working Draft, April 2002 assembly). level was visualized with MUST (Philippe 1993) by comparing two distance matrices, the first one calculated Results using thepdistanceandthesecond oneusing thePoisson The analysis consists of the following steps: (1) correction. When saturation was observed, it was usually localizationandphylogeneticcharacterizationofthegenes causedbytheoutgroup,andthustherewasnoimplication belongingtothe8p11.21–8p21.3region,(2)identification in the reconstruction. ofparalogousgenespotentiallyissuedfromthetworounds ofduplication,(3)statisticalstudyofthedistributionofthe Four-Cluster Analysis and Templeton Test humanparalogswithintheparalogousregions,(4)analysis The robustness of the topologies was tested by the of the topological robustness of the constructed phylog- Four-cluster analysis in Phyltest (Kumar 1996) for the 25 enies, and (5) estimation of the duplication time of the gene families with more than two paralogs (see Appendix paralogs. HumanGenomicRegion8p11.21–8p21.3Duplication 1293 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 FIG.2.—Representationofthephylogeneticreconstructionsusingthreereconstructionmethods(MaximumParsimony,MaximumLikelihood,and 0 Neighbor-Joining)forfourofthe38genefamilies.Thenumbersatthenodesrepresentthebootstrapproportionforeachmethod(thevaluesforthe /8/1 MaximumLikelihoodaresupportvaluesbutcanbeinterpretedinthesamewayasBootstrapProportion).Orthologoussequenceswereidentifiedinthe 2 9 FGF(Takifugurubripes)genomeforthe38datasets,buttheyareshownrepresentedonthetreesfor26datasetsbecausealltestswerealreadydone 0 whentheywereidentified(13genefamiliesdisplayingmorethan2paralogsand13genefamiliesyielding2paralogs).Thetopologiesforgenefamilies /10 yieldingtwoparalogswillbeaccessibleat:http://www.evolution.luminy.univ-mrs.fr/duplications. 81 4 1 3 b Phylogenetic Analysis MaximumParsimony(seeSupplementaryMaterialonline) y g reconstruction methods. For the Maximum Likelihood ue Of the 111 genes located in the 8p11.21–8p21.3 s region, 46 were predicted to be unique by the BLAT tool method,thevaluesarequartetPuzzlingsupportvaluesand t on (see Materials and Methods). As a first approximation for willbeinterpretedasBootstrapProportions.TheBootstrap 03 the remaining 65 genes, we used the Neighbor-Joining Proportions at the nodes were most often high (BP . 69) A algorithm with Poisson correction to visualize the phylo- foourtgtrhoeupdifmfeardeentuppaorafloagtoleuasstgroonuepspr(ofitogs.to2m).iWaneseuqsuedenacne pril 2 genetic relationships of their domains (Bateman et al. 0 1 2002).Eventhoughduplicationisacontinuousprocessin (Drosophila melanogaster or Caenorhabditis elegans se- 9 quence) for all the reconstructions. Three of the studied genomes (Lynch and Conery 2000; Gu, Wang, and Gu gene families had, however, a mixed outgroup composed 2002), we focused our analysis on the window of dupli- cation events between the protostomian-deuterostomian of paralogs from older duplication events, as well as from separation and the actinopterygian-sarcopterygian split. protostomians and deuterostomians: these gene families We selected 38 paralog groups for this time frame, within are BAG4, MGC1136, and Loc55893. This mixing indi- which there were 25 families yielding more than two cated that the true orthologs in the Drosophila mela- paralogs and 13 families yielding two paralogs. We then nogaster or Caenorhabditis elegans genomes for these carriedoutcompletephylogeneticanalysesonthe38gene families has not been detected or has simply been lost in families. theprotostomianlineage.Genelossisindeedcommonand Alltreetopologiesobtainedfromthe25genefamilies has been described in a comparative analysis of Dro- with more than 2 paralogs were supported by Bootstrap sophila melanogaster and Anopheles gambiae (Zdobnov Proportions (BP) for the Neighbor-Joining and the et al. 2002). Nevertheless, we were able to identify 1294 Vienneetal. Table 1 Localizationof the79Paralogs for38GeneFamilies fromthe 8p11.23–8p21.3Regionofthe HumanGenome Chromosome 8 10 5 4 Others FGF17 8p21.3 10q24.32 5q35.1 RAI16 8p21.3 10q25.3 PSD3 8p21.3 10q24.32 5q31.2 2q14.1 FLJ11264 8p21.3 10q11.23 Xq22.2 LPL 8p21.3 15q21.3,18q21.1 SCAM-1 8p21.3 10q24.1 4q35.1 PPP3cc 8p21.3 10q22.2 4q24 EGR3 8p21.3 10q21.3 5q31.2 2p13.1 LOXL1 8p21.3 10q24.2 5q23.3 2p12 LYSAL1 8p21.3 10q24.2 D LOC51312 8p21.2 10q24.2 ow NEF3 8p21.2 10q24.33 10p13 n lo EBF2 8p21.2 10q26.3 5q34 a d 2ABA 8p21.2 5q32 4p16.1 e d DAPDYRSAL12A 88pp2211..22 10q26.3 55qq3323.3 4p16.1 220pp2133.3 fro m STMN4 8p21.1 1p36.11,8q21.2,20q13.3,6q25.2 h CHRNA2 8p21.1 8q11.1,15q24.3,20q13.33 ttp DLOUCS5P54893 88pp1221.1 10q25.2 5q35.1 28qq1211..213,20q13.33 s://a c RBPMS 8p12 15q22.31 a d PURG 8p12 5q31.3 7p13 e m MGC1136 8p12 10q22.2 ic FLJ14299 8p12 10q22.2 .o u C8ORF2 8p12 10q24.31 p BAG4 8p12 10q26.2 14q32.33 .co ADRB3 8p11.23 10q25.3 5q32 m FLJ12526 8p12 20p13 /m b 4EBP1 8p12 10q22.1 5q31.3 e FTGAFCRC11 88pp1121.23 1100qq2266..1121 5q35.3 44pp1166..33 /artic le SFRP1 8p11.21 10q24.2 4q32.1 -a LOC51125 8p11.21 10q24.2 7q11.21 b s ANK1 8p11.21 10q21.2 4q26 tra TPA 8p11.21 10q22.2 5q35.2 4p16.2 c AP3M2 8p11.21 10q22.2 t/2 0 VDAC3 8p11.21 10q22.2 5q31.1 /8 PDL 10q23.33 5q31.1 4q35.1 /1 2 9 0 NOTE.—Geneswithonlytwoparalogsareindicatedinitalics.TheirlocalizationwasdeterminedwiththeBLATtoolontheUCSCWebsite. /1 0 8 1 orthologous sequences in Takifugu rubripes for all of the equalto55;n,thetotalnumberofparalogousgenes,setto 4 1 data sets analyzed (fig. 2). 79; p, the probability of a gene being within these three 3 b y regions, at 0.08; q, the probability of being located g Localization of Paralogous Genes anywhereelseinthegenome,equalto0.92(seeMaterials ue s Atotalof79paralogswereidentifiedforthe38gene and Methods). This statistical test was made under the t o n families meeting the considered window of duplication assumption that the probability for a gene to belong to 0 3 time. Of these, 55 genes are located in the 10q26.3– a part of the genome is proportional to the number of A p 110,qfi2g1..32),,4aqn3d5.c1o–m4ppl1e6te.3s,eatnsdo5fqp3a5ra.2lo–g5sq3a1re.1foreugnidonisn(tthaebslee constFituurttihveermgeonrees, iwnethiantdrivegidiounal.ly tested each of the 13 ril 20 1 fourregionsforfourfamilies:2ABA,DPYSL,FGFR,and paralogous regions (see Materials and Methods). In this 9 TPA. case, the Bonferroni correction was used to adjust downward the a-level of each individual test and thus to Validation of the Distribution of Paralogous Genes ensure that the overall risk for a number of tests remains 0.05. Thus the a-level was divided by the number of To test if the genomic distribution of the 55 genes comparisons (0.05/13). Using the above formula and now positioned within the defined paralogous genomic regions calculating the probability of finding a paralogous gene (10q21.2–10q26.3, 4p16.1–4q35.1, and 5q31.1–5q35.2) locatedonchromosomes1,2,4,5,6,7,8q,10,14,15,18, differedfromarandomdistribution,weusedthefollowing 20, or X for each gene located within the chromosome 8 Binomial formula: regionwefindthatthedistributionoftheparalogousgenes 8k 2(cid:1); PðX ¼kÞ¼Cknpkqn(cid:1)k on chromosomes 5 and 10 is statistically different from Withtheparameter(cid:1)definedastheentirehumangenome; a random distribution (with high significance P , 0.001), k, the number of genes landing in these three regions, strongly suggesting that these regions were created by en HumanGenomicRegion8p11.21–8p21.3Duplication 1295 D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o FIG.3.—Localizationofthe55paralogousgenesintheparalogyregionslocatedonchromosomes10,4,and5,andthelocalizationofthe5 up paralogs in the chromosome 2 portion defined within 2p23.3–2q11.2 that might have been recombined in this region. The order of the genes on .c o chromosome8isaccordingtotheUCSCHumanGenomeProjectWorkingDraft(Kentetal.2002),butitisarbitraryfortheotherparalogousregions. m Forthe25genesyieldingmorethantwoparalogs,thosethatarenotlocatedonchromosome4,5,or10arenotshownhere. /m b e /a blocduplications.Thisresultsuggeststhatthedistribution from 0.354 to 0.999). This predominance of observed rtic le of the paralogous genes in the other regions could be due (8,10), (4,5) topologies suggests the way the duplications -a b to chance alone. With regard to chromosome 4, we can occurred, and the other observed topologies can be s separate it into two parts: 4p16.1-p16.3 (five paralogous explained by recombination (see Supplementary Material tra c genes out of 36 predicted genes in this zone) and 4q24- online).Fromthefour-clusteranalysisandthecongruence t/2 0 q35.1 (five paralogous genes out of 617 predicted genes). of the topologies, we can conclude that, for this region of /8 /1 Inthiscase,the4p16.1-p16.3regionishighlysignificantly the genome, two rounds of duplications did indeed occur 2 9 different from a random distribution, whereas 4q24-q35.1 between the protostomian-deuterostomian separation and 0 /1 is not. This finding indicates that the two sections of before the actinopterygian-sarcopterygian split during 0 8 chromosome 4 are dichotomous—one belonging to the vertebrate evolution. 14 1 two rounds of duplication and the other possibly resulting The determination of the phylogenetic relationships 3 b from shuffling events. and the comparisons between four human chromosomal y g regions show that they duplicated en bloc. u e s Congruency of the Tree Topologies t o ATempletontest(Templeton1983)ofthetopologies Estimation of the Duplication Dates n 03 obtained from the three different reconstruction methods To distinguish suitable gene families from which to A p (Neighbor-Joining, Maximum Parsimony, and Maximum estimate the timing of the duplication events, we selected ril 2 Likelihood)indicatedahighcongruencebetweenthethree families with constant substitution patterns using the 0 1 9 topologies gained by the various techniques. Relative Rate test (see Materials and Methods) (Kumar To give statistical significance to the topologies 1996). We found that 22 of the 38 data sets evolved with obtained,wenextemployedafourclusteranalysis(Kumar a conserved molecular clock, but we used 21 of them to 1996). Of the 25 gene families with more than two estimate the duplication times (we remove the ANK gene paralogs, our phylogenies indicated that 11 represent family because of the extreme value –more than 1,100 a (8,10), (4,5) topology (for a total of 44% of the genes MYA in comparison with all the other time divergence analyzedbythismethod;withConfidenceProportion(CP) estimations). valuesrangingfrom0.549to1);5correspondedtoa(8,5), Thefollowinghypothesiscouldbeconsidered:afirst (4,10) topology (20% of the analyzed genes; CP values duplicationofanancestralregiongivingrisetothe(8,10) ranging from 0.376 to 0.858); 3 corresponded to a (8,4), and (4, 5) ancestral region followed by a second (5,10) topology (12% of the analyzed genes; CP values duplication giving rise to the 8 and 10 paralogous region, ranging from 0.548 to 0.887); and 6 represented other on the one hand, and the 4 and 5 paralogous region, on topologies(24%oftheanalyzedgenes;CPvaluesranging the other hand. These events were then followed by 1296 Vienneetal. recombination.Inthiscaseonlythegenesyieldinga(8,10) and(4,5)topologywereusedtoestimatethetimingofthe first and second rounds of duplication. They were estimated to have occurred at T1¼687 6 155.70 MYA and T2¼506.5 6 103.76 MYA, respectively. We could take into an account a more general hypothesis: en bloc duplication occurred with a less clear history than above. In this case we used the complete data set (including the families with different topologies than before) and we obtained dates of T1¼738.95 6 74.84 MYA and T2¼ 532.54 6 57.84 MYA, respectively (fig. 4) (see Supple- mentary Material online). D o Putative Ancestral Shuffling w n lo It is significant that within the five genes present in a d the 2p23.3-q14.1 region, four of their paralogous genes ed were absent from 4q35.1–4p16.3, suggesting an ancestral fro m shuffling event between the 2p23.3-q11.2 and 4q35.1– h 4p16.3 regions. For the paralogous regions along chro- FIG. 4.—Scheme showing the phylogenetic relationships between ttp the species (adapted from Hedges 2001) and the estimation of the s mosome 20, 20p13 (2 genes) and 20q13.33 (3 genes) duplication time based on the analysis of 21 of the 38 data sets. The ://a corresponded to a lack of paralogous genes from the resultsobtainedforthetimeestimationarerepresentedbythetwoblank ca 10q26.3–10q21.2region(butpresentwithin8p21.1–21.2), dots,andlinesateithersideofthedotsrepresenttheconfidenceinterval. de m suggesting another possible shuffling event. The first round of duplication was estimated to have occurred at T1¼ ic 738.95674.84MYA;thesecondroundofduplicationwasestimatedat .o T2¼532.54657.84MYA. up Discussion .c o Pattern of 2R Segmental Vertebrate Duplication m included the amphioxus sequences for EGR and FGFR /m atipmperoTdahciveheercsgo—enncpcohery—dlaocnglteenarerelstyiucltsshreoocwfonotsuhtrrauttcht8irpoe1ne1,c.2mo1ma–pp8plpei2nm1ge.,3ntaarnreyd- (bstaehgehmct,ooHmnuopmkleaetmnetpsae,lqaufnoedrnWctihnoigslfoepf(u2trh0pe0oa2sme))ap.nhFdiouGxrtuuhs,ergWmenaonorgem,,eaMnwdcoLuGylud- be/article gionanditsthreeparalogousregionsofthehumangenome -a (2002) found an excess in date range between 397–695 b resulted from two rounds of en bloc duplication after the MYA and 430–750 MYA, respectively, whereas our stra protostomian-deuterostomian divergence and before the c osteichthyian split. adnuapllyicsaistioenstsimtoabteed7t3h8eMmYeaAnaangde5s3o1fMthYeAfi.rsTthaenredfosreec,otnhde t/20/8 divergence time estimates for the duplication events were /1 Phylogenetic Reconstruction 2 similar (as for T1 and T2, Wang and Gu 2000)—albeit 90 By employing a robust methodology based on within a large time window—whether or not the lineages /10 8 phylogenetictreesofproteindomainsconstructedbythree oftheproteindomainswereconsidered.Possibleexplana- 1 4 different algorithms, we observed a majority of (8,10), tions for this result are that domain shuffling did not play 13 (4,5)topologies(44%),allowingustodiscountthevarious a major role in locating the different genes to this region by other topologies arising from undetected new artifacts, (in contrast to gene belonging to the MHC region; Abi- gu e crossing over between the different regions, rapid Rached et al. 2002), or that the information contained by s diploidization processes, ancestral polymorphism, or two the shuffled domain is diluted by the phylogenetic t on rapid rounds of duplication (Furlong and Holland 2002). information of the other domains. 03 A p Mapping of Paralogous Genes The Deciphering of the Ancestral Diploidization Process ril 2 0 The statistical analysis yielded strong support for the 19 This article represents a first step toward understand- nonrandom distribution of the paralogous genes found on ing the complex diploidization events that shaped the chromosomes 5 and 10. With regard to the paralogous vertebrate genome. Given the potentially numerous gene sections of chromosome 4, the test rejected the entire deletions that occurred on chromosome 8 after the chromosome as being paralogous but did not reject the duplication events, we underestimate the number of 4p16.1–16.3 subregion. Interestingly, all five genes vertebrate paralogous genes that could be conserved in present in this region displayed a (8,10), (4,5) topology. these4regions.Itwouldbeinterestinginfuturestudiesto take the same approach, and to use the region 5q35.2- Time Divergence q31.1asreferencetodetecttheparalogousgenessharedby To estimate the lower bound of the duplication more chromosomes 10, 4, and elsewhere on chromosome 8. precisely, we comprehensively searched the Takifugu In the future, to demonstrate an entire genome rubripes genome for paralogous sequences; to better duplication, it will be interesting to show that the other determine the upper bound of the first duplication, we paralogous regions described in the human genome HumanGenomicRegion8p11.21–8p21.3Duplication 1297 (McLysaght, Hokamp, and Wolfe 2002) display a pattern scale duplications in vertebrate evolution. Nat. Genet. 31: similartotheonedescribedfor8p11.21–8p21.3,10q21.2– 205–209. 10q26.3, 4p16.1–4q35.1, and 5q31.1–5q35.2 paralogous Hedges, S. B. 2001 Molecular evidence for the early history of regions: (1) a majority of (but not only) (A,B) (C,D) living vertebrates. Pp. 119–134 in E. Ahlberg, ed. Major eventsinearlyvertebrateevolution:paleontology,phylogeny, topologies,(2)anonrandomdistributionoftheparalogous anddevelopment. TaylorandFrancis, London. genes in the parasyntenic region, and (3) similar diver- Holland, N. D., L. Z. Holland, and Z. Kozmik. 1995. An gence times. amphioxus Pax gene, AmphiPax-1, expressed in embryonic endoderm, but not in mesoderm: implications for the The Evolutionary Fate and Consequences of evolution of class I paired box genes. Mol. Mar. Biol. Paralogous Regions Biotechnol. 4:206–214. Holland, P. W., J. Garcia-Fernandez, N. A. Williams, and As regards the fate of duplicated regions, the A. Sidow. 1994. Gene duplications and the origins of verte- substitution rate between paralogous genes shows that brate development. Development Suppl.125–133. D they evolved at a similar pace. This result contrasts with Kent, W. J. 2002. BLAT-the BLAST-like alignment tool. o w whatweobservedfortheMHCanditsparalogousregions Genome Res. 12:656–664. n lo (Abi-Rached et al. 2002), where the analysis of the Kent, W. J., C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. ad substitution pattern showed that genes located on chro- Pringle, A. M. Zahler, and D. Haussler. 2002. The Human ed mosome 9 had a lower substitution rate than the other Genome Browser at UCSC. Genome Res.12:996–1006. fro Kumar,S.1996.PHYLTEST:aprogramfortestingphylogenetic m pinadraicloatgeotuhsatrethgeiopnrso.pToshietiornes‘u‘lotnslydeosnceribpeadraliongothuissresgtuiodny hypotheses. PennsylvaniaState University, University Park. http Kumar, S., and S. B. Hedges. 1998. A molecular timescale for s retains the ancestral function’’ is not a rule. Therefore the vertebrate evolution.Nature392:917–920. ://a evolution of the paralogous region after large-scale c Kumar, S., K. Tamura, I. B. Jakobsen, and M. Nei. 2001. a d duplication is not a homogeneous phenomenon. To de- MEGA2: molecular evolutionary genetics analysis software. em termine the fate of duplicated regions, other sets of paral- Bioinformatics 17:1244–1245. ic ogous region should be investigated as described here. Lander, E. S., L. M. Linton, B. Birren et al. (253 co-authors). .ou p 2001. Initial sequencing and analysis of the human genome. .c o Nature 409:860–921. m Acknowledgments Lundin, L. G. 1993. Evolution of the vertebrate genome as /m b We thank Marie-Eve Letzgus and Etienne Danchin reflected in paralogous chromosome regions in man and the e /a for their assistance and helpful discussions. Lynhcohu,seMm.,oaunsde.JG. eSn.oCmoincser1y6.:210–0109.. The evolutionary fate and rticle consequences of duplicate genes.Science 290:1151–1155. -a Literature Cited Masterson, J. 1994. Stromatal size in fossils plants: evidence bs Abi-Rached,L.,A.Gilles,T.Shiina,P.Pontarotti,andH.Inoko. 4fo2r1–p4o2l4y.plo¨ıdy in majority of angiosperms. Science 264: tract/2 2002.Evidenceofenblocduplicationinvertebrategenomes. 0 AltsNZacnhahdatu.nPlGg,S,eISW-n.Be.tLF.M.A,31SiTl:Tl1.e:0rL,a0.–ann1eMd0w5aD.dgd.eenJn.e,rLaAitpio.mnAaon.f.Sp1cr9oh9tae7fi.fneGrd,aapJtap.beZadsheBansLegAa,rScZTh. MNecigGL,eyennMsoaemg.t,.hi3ct,a1nA:d2du.0,p0KlS–ic..2a0HtKi4oou.nkmamdaurp.r,ina2gn0d0e0aK.r.lyMH.cohWleocrodulafleate.r2ee0vv0oo2llu.utEtiioxontne.nNsainavtde. /8/1290/108 programs. NucleicAcids Res. 25:3389–3402. phylogenetics. OxfordUniversity Press,New York. 14 Ohno,S.1970.Evolution bygeneduplication. Springer-Verlag, 1 Balczarek,K.A,Z-C.Lai,andS.Kumar. 1997.Evolutionand 3 functional diversification of the paired box (Pax) DNA- Berlin. by binding domains.Mol.Biol. Evol.14:829–842. Pebusque, M. J., F. Coulier, D. Birnbaum, and P. Pontarotti. gu Bateman,A.,E.Birney,L.Cerruti,R.Durbin,L.Etwiller,S.R. 1998.Ancientlarge-scalegenomeduplications:phylogenetic es Eddy,S.Griffiths-Jones,K.L.Howe,M.Marshall,andE.L. and linkage analyses shed light on chordate genome t o Sonnhammer. 2002. The Pfam protein families database. evolution.Mol. Biol.Evol. 15:1145–1159. n 0 Nucleic AcidsRes. 30:276–280. Philippe, H. 1993. MUST, a computer package of management 3 A Croe,vRol.vCe.,inRd.eLp.enSdmeanltll,yanadfteFr.Wpoelnydpelol.id19f9o9r.mDatuipolnicaintedcogtetnoens. 5u2ti7li2ti.esforsequencesandtrees.NucleicAcidsRes.21:5264– pril 2 Proc. Natl. Acad.Sci. USA96:14406–14411. Saitou, N., and M. Nei. 1987. The Neighbor-Joining method: 01 9 Doolittle, R. F. 1995. The multiplicity of domains in proteins. a new method for reconstructing phylogenetic trees. Mol. Annu.Rev. Biochem. 64:287–314. Biol. Evol.4:406–425. Doolittle, R. F., and P. Bork. 1993. Evolutionarily mobile Soltis,D.E.,andP.S.Soltis.1999.Polyploidy:recurrentforma- modules in proteins.Sci.Am. 269:50–56. tion andgenomeevolution.Trends Ecol. Evol.14:34–352. Farris, J. S., M. Kallersjo, A. G. Kluge, and C. Bult. 1995. Stephens, S. G. 1951. Possible significance of duplication in Testingsignificance of incongruence.Cladistics 10:315–319. evolution.Adv.Gent. 4:247–265. Felsenstein, J. 1985. Confidence limits on phylogenies: an Strimmer, K., and A. von Haeseler. 1996. Quartet puzzling: approach using thebootstrap. Evolution 39:783–791. aquartetmaximumlikelihoodmethodforreconstructingtree Furlong, R. F., and P. W. H. Holland. 2002. Were vertebrates topologies. Mol. Biol.Evol. 13:964–969. octoploid? Phil.Trans. R. Soc. Lond.Ser. B.357:531–544. Swofford, D. L. 2000. PAUP*: phylogenetic analysis using Graur, D., and W- H. Li. 2000. Fundamentals of molecular parsimony (*and other methods). Version 4. Sinauer Asso- evolution.2ndedition.SinauerAssociates,Sunderland,Mass. ciates, Sunderland,Mass. Gu, X., Y. Wang, and J. Gu. 2002. Age distribution of human Taylor, J. S., Y. Van De Peer, I. Braasch, and A. Meyer. 2001. genefamiliesshowssignificantrolesofbothlarge-andsmall- Comparative genomics provides evidence for an ancient 1298 Vienneetal. genome duplication event in fish. Phil. Trans. R. Soc. Lond. Vision,T.J.,D.G.Brown,andS.D.Tanksley.2000.Theorigins 356:1661–1679. of genomic duplications in Arabidopsis. Science 290:2114– Taylor, J. S., Y. Van De Peer, and A. Meyer. 2001. Genome 2117. duplication, divergent resolution and speciation. Trends Wang, Y., and X. Gu. 2000. Evolutionary patterns of gene Genet.17:299–301. families generated in the early stage of vertebrates. J. Mol. Templeton, A. R. 1983. Phylogenetic inference from restriction Evol. 51:88–96. endonuclease cleavage site maps with particular reference to Wendel,J.F.2000.Genomeevolutioninpolyploids.PlantMol. the evolutionof humans andapes. Evolution 37:221–244. Biol.42:225–249. Thompson,J.D.,T.J.Gibson,F.Plewniak,F.Jeanmougin,and Wolfe,K.H.,andD.C.Shields.1997.Molecularevidenceforan D. G. Higgins. 1997. The ClustalX Windows interface: ancient duplication of the entire yeast genome. Nature flexible strategies for multiple sequence alignment aided by quality analysistools. Nucleic AcidsRes. 24:4876–4882. 387:708–713. Zdobnov,E.M.,C.vonMering,I.Letunicetal.(34co-authors). Thornton, J. W., and R. DeSalle. 2000. Gene family evolution and homology: genomics meets phylogenetics. Annu. Rev. 2002. Comparative genome and proteome analysis of D GenomicsHum. Genet. 1:41–73. Anopheles gambiae and Drosophila melanogaster. Science o w Van De Peer, Y., T. Frickey, J. S. Taylor, and A.Meyer. 2002. 298:149–159. n Dealingwithsaturationattheaminoacidlevel:acasestudybased loa d onancientlyduplicatedzebrafishgenes.Gene295:205–211. Kenneth Wolfe, Associate Editor e d Ven2t0e0r,1J..CT.h,eM.sDeq.uAednacmes,oEf.Wth.eMhyuemrsaentalg.e(n2o72mceo.-aSucthieonrsc)e. Accepted March 25, 2003 from 291:1304–1351. h ttp s ://a c a d e m ic .o u p .c o m /m b e /a rtic le -a b s tra c t/2 0 /8 /1 2 9 0 /1 0 8 1 4 1 3 b y g u e s t o n 0 3 A p ril 2 0 1 9
Description: