Table Of ContentManuscriptsubmittedtoeLife
What is an archaeon and are the
1
Archaea really unique?
2
AjithHarish1*
3
*Forcorrespondence:
[email protected]( 1DepartmentofCellandMolecularBiology,StructuralandMolecularBiologyProgram,
4
)
UppsalaUniversity,Uppsala,Sweden
5
6
Abstract TherecognitionofthegroupArchaea40yearsagostimulatedresearchinmicrobial
7
evolutionandmolecularsystematicsthatpromptedanewclassificatoryschemetoorganize
8
biodiversity.AdvancesinDNAsequencingtechniqueshavesincesignificantlyimprovedthe
9
genomicrepresentationofthearchaealbiodiversity.Inaddition,advancesincomputational
10
modelingthatfacilitatelarge-scalephylogenomicshaveresolvedmanyrecalcitrantbranchesofthe
11
TreeofLife.Despitethetechnicaladvancesandanexpandedtaxonomicrepresentation,two
12
importantaspectsoftheoriginsandevolutionoftheArchaearemaincontroversial,evenaswe
13
celebratethe40thanniversaryofthemonumentaldiscovery.Theissuesconcern(i)theuniqueness
14
(monophyly)oftheArchaea,and(ii)theevolutionaryrelationshipsoftheArchaeatotheBacteria
15
andtheEukarya;bothofthesearerelevanttothedeepstructureoftheTreeofLife.Acloserlook
16
atrecentpublishedphylogenomicdatasetsthataddresstheseissuesshowsthattheconflicting
17
conclusionsareprimarilyduetoascarcityofinformationinstandarddatasets—thecore-genes
18
datasets—toreliablyresolvetheconflicts.However,theseconflictscanberesolvedefficientlyby
19
employingcomplexgenomicfeaturesandgenome-scaleevolutionmodels—adistinctclassof
20
phylogenomiccharactersandevolutionmodels,whichcanbeemployedroutinelytomaximizethe
21
useofgenomesequencesfortestingevolutionaryhypotheses.Insightsfromthestudyofcomplex
22
genomiccharactersraisequestionsaboutthecontinueduseoftheconventionalcore-gene
23
phylogeneticapproach,thatis,ourrelianceonelementarycharacters(nucleotidesandamino
24
acids)andonmodelsofsubstitutionmutationsalonetosolvedifficultphylogeneticproblems.
25
Doesitneedtobesupplementedwithcomplexgenomiccharactersandcorrespondingmodelsof
26
evolution?Orperhapssupplanted?Basedonacriticalanalysisofthepersistentdifficultyin
27
resolvingthearchaealradiation,Iargueforthelatter.
28
29
Introduction
30
TherecognitionoftheArchaeaastheso-called“thirdformoflife”wasmadepossibleinpartbya
31
newtechnologyforsequenceanalysis,oligonucleotidecataloging,developedbyFredrikSangerand
32
colleaguesinthe1960s(1,2).CarlWoese’sinsightofusingthismethod,andthechoiceofthesmall
33
subunitribosomalRNA(16S/SSUrRNA)asaphylogeneticmarker,notonlyputmicroorganisms
34
onaphylogeneticmap(ortree),butalsorevolutionizedthefieldofmolecularsystematicsthat
35
ZukerkandlandPaulinghaspreviouslyalludedto(3). Comparativeanalysisoforganism-specific
36
(oligonucleotide)sequence-signaturesinSSUrRNAledtotherecognitionofadistinctgroupof
37
microorganisms (2, 4). Initially referred to as Archaeabacteria, these unusual organisms had
38
‘oligonucleotidesignatures’distinctfromotherbacteria(Eubacteria),andtheywerelaterfoundto
39
bedifferentfromthoseofEukarya(eukaryotes)aswell.Manyotherfeatures,includingmolecular,
40
biochemicalaswellasecological,corroboratedtheuniquenessoftheArchaea.Thusthearchaeal
41
conceptwasestablished(2).
42
1of21
ManuscriptsubmittedtoeLife
Thestudyofmicrobialdiversityandevolutionhascomealongwaysincethen: sequencing
43
microbial genomes, and directly from the environment without the need for culturing is now
44
routine(5,6).Thiswealthofsequenceinformationisexcitingnotonlyforcatalogingandorganizing
45
biodiversity,butalsotounderstandtheecologyandevolutionofmicroorganisms–archaeaand
46
bacteriaaswellaseukaryotes–thatmakeupavastmajorityoftheplanetarybiodiversity.Since
47
large-scaleexplorationbythemeansofenvironmentalgenomesequencingbecamepossiblealmost
48
adecadeago,therehasalsobeenapalpableexcitementandanticipationofthediscoveryofa
49
fourthformoflifeora“fourthdomain”oflife(7).Thereferencehereistoafourthformofcellular
50
life,butnottoviruses,whichsomehavealreadyproposedtobethefourthdomainoftheTreeof
51
Life(ToL)(7,8).Ifafourthformoflifeweretobefound,whatwouldthedistinguishingfeaturesbe,
52
andhowcoulditbemeasured,definedandclassified?
53
Ratherthanthediscoveryofafourthdomain,andcontrarytotheexpectations,however,current
54
discussioniscenteredaroundthereturntoadichotomousclassificationoflife(9-11),despitethe
55
rapidexpansionofsequencedbiodiversity–hundredsofnovelphyladescriptions(12,13). The
56
proposeddichotomousclassificationsschemes,unfortunately,areinsharpcontrasttoeachother,
57
dependingon:(i)whethertheArchaeaconstituteamonophyleticgroup—auniquelineofdescent
58
thatisdistinctfromthoseoftheBacteriaaswellastheEukarya;and(ii)whethertheArchaeaform
59
asistercladetotheEukaryaortotheBacteria.Boththeissuesstemfromdifficultiesinvolvedin
60
resolvingthedeepbranchesoftheToL(10,11,14).
61
Thetwinissues,firstrecognizedinthe80sbasedonsingle-gene(SSUrRNA)analyses,continue
62
tobethesubjectsofalong-standingdebate,whichremainsunresolveddespitelarge-scaleanalyses
63
ofmulti-genedatasets(5,15-19).Inadditiontothechoiceofgenestobeanalyzed,thechoiceof
64
theunderlyingcharacterevolutionmodelisatthecoreofcontradictoryresultsthateithersupports
65
theThree-domainstree(5,19)ortheEocytetree(17,20).Inmanycases,addingmoredata,either
66
asenhancedtaxon(species)samplingorenhancedcharacter(gene)sampling,orboth,canresolve
67
ambiguities (21, 22). However, as the taxonomic diversity and evolutionary distance increases
68
amongthetaxastudied,thenumberofconservedmarker-genesthatcanbeusedforphylogenomic
69
analysesdecreases.Accordingly,resolvingthephylogeneticrelationshipsoftheArchaea,Bacteria
70
andEukaryaisrestrictedtoasmallsetofgenes—50atmost—inspiteofthelargeincreaseinthe
71
numbersofgenomessequencedandtheassociateddevelopmentofsophisticatedphylogenomic
72
methods.
73
Based on a closer scrutiny of the recent phylogenomic datasets employed in the ongoing
74
debate,Iwillshowherethatoneofthereasonsforthispersistentambiguityisthatthe‘information’
75
necessarytoresolvetheseconflictsispracticallynonexistentinthestandardmarker-genes(i.e.core-
76
genes)datasetsemployedroutinelyforphylogenomics. Further,Idiscussanalyticalapproaches
77
thatmaximizetheuseoftheinformationthatisingenomesequencedataandsimultaneously
78
minimizephylogeneticuncertainties.Inaddition,Idiscusssimplebutimportant,yetundervalued,
79
aspectsofphylogenetichypothesistesting,whichtogetherwiththenewapproachesholdpromise
80
toresolvetheselong-standingissueseffectively.
81
Results
82
Informationincoregenesisinadequatetoresolvethearchaealradiation
83
Data-displaynetworks(DDNs)areusefultoexamineandvisualizecharacterconflictsinphylogenetic
84
datasets,especiallyintheabsenceofpriorknowledgeaboutthesourceofsuchconflicts,ideally
85
beforedownstreamprocessingofthedataforphylogeneticinference(23,24). Whilecongruent
86
datawillbedisplayedasatreeinaDDN,incongruencesaredisplayedasreticulationsinthetree.
87
Fig.1Ashowsaneighbor-netanalysisoftheSSUrRNAalignmentusedtoresolvethephylogenetic
88
positionoftherecentlydiscoveredAsgardarchaea(20).TheDDNisbasedoncharacterdistances
89
calculatedastheobservedgeneticdistance(p-distance)of1,462characters,andshowsthetotal
90
amountofconflictinthedatasetthatisincongruentwithcharacterbipartitions(splits).Theedge
91
2of21
ManuscriptsubmittedtoeLife
(branch)lengthsintheDDNcorrespondtothesupportfortherespectivesplits.Accordingly,two
92
well-supportedsetsofsplitsfortheBacteriaandtheEukaryaareobserved.TheArchaea,however,
93
doesnotformadistinct,well-resolved/well-supportedgroup,andisunlikelytocorrespondtoa
94
monophyleticgroupinaphylogenetictree.
95
Figure1.Data-displaynetworksdepictingthecharacterconflictsindifferentdatasetsthatemploydifferent
charactertypes.(A)SSUrRNAalignmentof1,462characters.Concatenatedproteinsequencealignmentof(B)
29core-genes,8,563characters;(C)48core-genes,9,868charactersand(D)also48core-genes,9,868SR4
recodedcharacters(datasimplifiedfrom20to4character-states).Eachnetworkisconstructedfroma
neighbor-netanalysisbasedontheobservedgeneticdistance(p-distance)anddisplayedasanequalanglesplit
network.Edge(branch)lengthscorrespondtothesupportforcharacterbipartitions(splits),andreticulationsin
thetreecorrespondtocharacterconflicts.Datasetsin(A),(C)and(D)arefromRef.20,andin(B)isfromRef.17.
Likewise,theconcatenatedproteinsequencealignmentoftheso-called‘genealogydefiningcore
96
ofgenes’(25)–asetofconservedsingle-copygenes–alsodoesnotsupportauniquearchaellineage.
97
3of21
ManuscriptsubmittedtoeLife
Fig.1BisaDDNderivedfromaneighbor-netanalysisof8,563charactersin29concatenatedcore-
98
genes(17),whileFig.1C,Disbasedon9,868charactersin44concatenatedcore-genes(alsofrom
99
(20)). Eventakentogether,noneofthestandardmarkergenedatasetsarelikelytosupportthe
100
monophylyoftheArchaea—akeyassertionofthethree-domainshypothesis(26). Simplyput,
101
thereisnotenoughinformationinthecore-genedatasetstoresolvethearchaealradiation,orto
102
determinewhethertheArchaeaarereallyuniquecomparedtotheBacteriaandEukarya.However,
103
othercomplexfeatures—includingmolecular,biochemicalandphenotypiccharacters,aswell
104
asecologicaladaptations—supporttheuniquenessoftheArchaea.Theseidiosyncraticarchaeal
105
charactersincludethesubunitcompositionofsupramolecularcomplexesliketheribosome,DNA-
106
andRNA-polymerases,biochemicalcompositionofcellmembranes,cellwalls,andphysiological
107
adaptationstoenergy-starvedenvironments,amongotherthings(27,28).
108
Complex phylogenomic characters minimize uncertainties regarding the unique-
109
nessoftheArchaea
110
Anucleotideisthesmallestpossiblelocus,andanaminoacidisaproxyforalocusofanucleotide
111
triplet.Unliketheelementaryaminoacid-ornucleotide-charactersinthecore-genesdataset(Fig.1),
112
theDDNinFig. 2isbasedoncomplexmolecularcharacters–genomiclocithatcorrespondto
113
proteindomains,typically~200aminoacids(600nucleotides)long.Neighbor-netanalysisofprotein-
114
domaindatacodedasbinarycharacters(presence/absence)isbasedontheHammingdistance
115
(identicaltothep-distanceusedinFig.1). HeretheArchaeaalsoformadistinctwell-supported
116
cluster,asdotheBacteriaandtheEukarya.
117
Figure2.Data-displaynetworks(DDN)depictingcharacterconflictsamongcomplexphylogenomiccharacters–
genomiclocicorrespondingtoprotein-domainsinthiscase.(A)Neighbor-netanalysisbasedonHamming
distance(identicaltothep-distanceusedinFig.1)of1,732characterssampledfrom141species.(B)DDNbased
onanenrichedtaxonsamplingof81additionalspeciestotaling222speciesandamodestincreaseto1,738
characters.Thedatasetin(A)isfromRef.10,whichwasupdatedwithnovelspeciestorepresenttherecently
describedarchaealandbacterialspecies(5,12,20).
Fig2AisaDDNbasedonthedatasetthatincludesprotein-domaincohortsof141species,used
118
4of21
ManuscriptsubmittedtoeLife
inaphylogenomicanalysistoresolvetheuncertaintiesattherootoftheToL(29). Comparedto
119
thedatainFig.1,thetaxonomicdiversitysampledfortheBacteriaandEukaryaismoreextensive,
120
butlessextensivefortheArchaea; itiscomposedofthetraditionalgroupsEuryarchaeotaand
121
Crenarchaeota.Fig.2BisaDDNofanenrichedsamplingof81additionalspecies,whichincludes
122
representativesofthenewlydescribedarchaealgroups:TACK(30),DPANN(5),andAsgardgroup
123
includingtheLokiarchaeota(20).Inaddition,speciessamplingwasenhancedwithrepresentatives
124
from the candidate phyla described for Bacteria, and with unicellular species of Eukarya. The
125
completelistspeciesanalyzedisinSITable1.
126
Notably, the extension of the protein-domain cohort was insignificant, from 1,732 to 1,738
127
distinctdomains(characters).Basedonthewell-supportedsplitsintheDDNthatformadistinct
128
archaealcluster,theArchaeaarelikelytobeamonophyleticgroup(clade)inphylogeniesinferred
129
fromthesedatasets.
130
Dataqualityaffectsmodelcomplexityrequiredtoexplainphylogeneticdatasets
131
ResolvingtheparaphylyormonophylyoftheArchaeaisrelevanttodeterminingwhethertheEocyte
132
tree(Fig. 3A)ortheThree-domainstree(Fig. 3B),respectively,isabetter-supportedhypothesis.
133
RecoveringtheEocytetreetypicallyrequiresimplementingcomplexmodelsofsequenceevolution
134
ratherthantheirrelativelysimplerversions(11).Ingeneral,complexmodelstendtofitthedata
135
better. Forinstance, accordingtoamodelselectiontestforthe29core-genesdataset, theLG
136
model(31)ofproteinsequenceevolutionisabetter-fittingmodelthanotherstandardmodels,
137
such as the WAG or JTT substitution model (SI-Table 2), as reported previously (17). Further, a
138
relativelymorecomplexversionoftheLGmodel,withmultiplerate-categorieswasfoundtobea
139
better-fittingmodelthanthesimplersingle-rate-categorymodel(Fig.3C;SI-Table2).Thefitofthe
140
dataisestimatedasthelikelihoodofthebesttreegiventhemodel.
141
Acomplex,multiplerate-categoriesmodelaccountsforsite-specificsubstitutionratevariation.
142
Substitution-rateheterogeneityacrossdifferentsitesinthemultiple-sequencealignment(MSA)
143
wasapproximatedusingadiscreteGammamodelwith4,8or12ratecategories(LG+G4,LG+G8or
144
LG+G12,respectively).TheArchaeaisconsistentwithaparaphyleticgroupintreesderivedfromthe
145
rate-heterogeneousversionsoftheLGmodel(Fig.3A).Furthermore,thefitofthedataimproves
146
withtheincreaseincomplexityofthesubstitutionmodel(Fig. 3C).Modelcomplexityincreases
147
withanyincreaseinthenumberofratecategoriesand/ortheassociatednumbersofparameters
148
thatneedtobeestimated. However,witharelativelysimplerversion–arate-homogeneousLG
149
model,inwhichthesubstitution-ratesareapproximatedtoasinglerate-category,theArchaeaare
150
consistentwithamonophyleticgroup(Fig.3B).
151
Incontrast,treesinferredfromtheprotein-domaindatasetsareconsistentwithmonophyly
152
oftheArchaeairrespectiveofthecomplexityoftheunderlyingmodel(Fig. 3D-F).TheMkmodel
153
(Markovkmodel)isthebest-knownprobabilisticmodelofdiscretecharacterevolution,particularly
154
ofcomplexcharacterscodedasbinary-statecharacters(32,33). SincetheMkmodelassumesa
155
stochasticprocessofevolution,itisabletoestimatemultiplestatechangesalongthesamebranch.
156
Implementingasimplerrate-homogeneousversionoftheMkmodel(Fig. 3D),aswellasmore
157
complexrate-heterogeneousversionswith4,8or12ratecategories(Mk+G4,Mk+G8orMk+G12,
158
respectively),alsorecoveredtreesthatareconsistentwiththemonophylyoftheArchaea(Fig.3E)
159
ThetreederivedfromtheMk+G4modelisshowninFig. 3E.WhilethetreederivedfromMk+G8
160
modelisidentical(SI-Fig. 1)totheMk+G4tree,theMk+G12treeisalmostidenticalwithminor
161
differencesinthebacterialsub-groups(SI-Fig.2)
162
Inallcases,bipartitionsforArchaeashowstrongsupportwithposteriorprobability(PP)of0.99
163
whilethatofBacteriaandEukaryaissupportedwithaPPof1.0;inspiteofsubstantiallydifferent
164
fitsofthedata.TheuniquenessoftheArchaeaisalmostunambiguousinthiscase(butseenext
165
section).
166
5of21
ManuscriptsubmittedtoeLife
Figure3.Comparisonofconcatenated-genetreesderivedfromaminoacidcharactersandgenometrees
derivedfromprotein-domaincharacters.Branchsupportisshownonlyforthemajorbranches.Scalebars
representtheexpectednumberofchangespercharacter.(A),(B)Core-genes-treederivedfromabetter-fitting
model(LG+G4)andaworsefittingmode(LG),respectively,ofaminoacidsubstitutions.(C)Modelfittodatais
rankedaccordingtheloglikelihoodratio(LLR)scores.LLRscoresarecomputedasthedifferencefromthe
best-fittingmodel(LG+G12)ofthelikelihoodscoresestimatedinPhyML.Thus,largerLLRvaluesindicateless
supportforthatmodel/treerelativetothemost-likelymodel/tree.Substitutionrateheterogeneityis
approximatedwith4,8or12ratecategoriesinthecomplexmodels,butwithasingleratecategoryinthe
simplermodel.(D),(E)aregenome-treesderivedfromabetter-fittingmodel(Mk+G4)andaworsefittingmodel
(Mk),respectively,ofprotein-domaininnovation.(F)ModelfittodataisrankedaccordinglogBayesfactor(LBF)
scores,whichlikeLLRscoresarethelogoddsofthehypotheses.LBFscoresarecomputedasthedifferencein
likelihoodscoresestimatedinMrBayes.
6of21
ManuscriptsubmittedtoeLife
Siblingsandcousinsareindistinguishablewhenreversiblemodelsareemployed
167
AlthoughaDDNisusefultoidentifyanddiagnosecharacterconflictsinphylogeneticdatasetsand
168
topostulateevolutionaryhypotheses,aDDNbyitselfcannotbeinterpretedasanevolutionary
169
network,becausetheedgesdonotnecessarilyrepresentevolutionaryphenomenaandthenodes
170
donotrepresentancestors(23,24).Therefore,evolutionaryrelationshipscannotbeinferredfrom
171
aDDN.Likewise,evolutionaryrelationshipscannotbeinferredfromunrootedtrees,eventhough
172
nodesinanunrootedtreedorepresentancestorsandanevolutionmodeldefinesthebranches
173
(seeFig.4A).
174
Figure4.Effectofalternativeadhocrootingsonthephylogeneticclassificationofarchaealbiodiversity.(A)An
unrootedtreeisnotfullyresolvedintobipartitionsattherootofthetree(i.e.apolytomousratherthana
dichotomousrootbranching)andthusprecludesidentificationofsistergrouprelationships.Itiscommon
practicetoaddauser-specifiedrootaposterioribasedonpriorknowledge(orbelief)oftheinvestigator.Four
possible(ofmany)rootingsR1-R4areshown.(B)Operationally,addingaroot(rooting)aposterioriamountsto
addingnewinformation–anewbipartitionandanancestoraswellasanevolutionarypolarity–thatis
independentofthesourcedata.(C-F)ThedifferentpossibleevolutionaryrelationshipsoftheArchaeatoother
taxa,dependingonthepositionoftheroot,areshown.Rootingisnecessarytodeterminetherecencyof
commonancestryaswellthetemporalorderofkeyevolutionarytransitionsthatdefinephylogenetic
relationships.
Anunrootedtree,unlikearootedtree,isnotanevolutionary(phylogenetic)treeperse,sinceit
175
isaminimallydefinedhypothesisofevolutionorofrelationships;itis,nevertheless,usefultorule
176
7of21
ManuscriptsubmittedtoeLife
outmanypossiblebipartitionsandgroups(34,35).Giventhataprimaryobjectiveofphylogenetic
177
analyses is to identify clades and the relationships between these clades, it is not possible to
178
interpretanunrootedtreemeaningfullywithoutrootingthetree(seeFig.4A).Identifyingtherootis
179
essentialto:(i)distinguishbetweenancestralandderivedstatesofcharacters,(ii)determinethe
180
ancestor-descendantpolarityoftaxa,and(iii)diagnosecladesandsister-grouprelationships(Fig.
181
4).Yet,mostphylogeneticsoftwareconstructonlyunrootedtrees,whicharethenconsistentwith
182
severalrootedtrees(Fig.4C-F).However,anunrootedtreecannotbefullyresolvedintobipartitions,
183
becauseanunresolvedpolytomy(atrifurcationinthiscase)existsneartherootofthetree(Fig.4A),
184
whichotherwisecorrespondstothedeepestsplit(root)inarootedtree(Fig.4,C-F).
185
Resolving the polytomy requires identifying the root of the tree. The identity of the root
186
corresponds,inprinciple,toanyoneofthepossibleancestorsasfollows:
187
i.Anyoneoftheinferred-ancestorsattheresolvedbipartitions(opencirclesinFig.4A),or
188
ii. Anyoneoftheyet-to-be-inferred-ancestorsthatliesalongthestem-branchesoftheunre-
189
solvedpolytomy(dashedlinesinFig.4A)oralongtheinternal-braches.
190
Inthelattercase,rootingthetreeaposteriorionanyofthebranchesamountstoinsertinganad-
191
ditionalbipartitionandanancestorthatisneitherinferredfromthesourcedatanordeducedfrom
192
theunderlyingcharacterevolutionmodel. Sincestandardevolutionmodelsemployedroutinely
193
cannotresolvethepolytomy,rooting,andhenceinterpretingtheTreeofLifedependson:
194
i.Priorknowledge—eg.,fossilsoraknownsister-group(outgroup),or
195
ii.Priorbeliefs/expectationsoftheinvestigators—eg.,simpleisprimitive(36,37),bacteriaare
196
primitive(38,39),archaeaareprimitive(1),etc.
197
BothoftheseoptionsareindependentofthedatausedtoinfertheunrootedToL.Somepossible
198
rootingsandtheresultingrooted-treetopologiesareshownascladogramsinFig.4,C-F.Iftheroot
199
liesonanyoftheinternalbranches(e.g. R1inFig. 4,A-C),orcorrespondstooneoftheinternal
200
nodes, withinthearchaelradiation, theArchaeawouldnotconstituteauniqueclade(Fig. 4C).
201
However, if the root lies on one of the stem-branches (R2/R3/R4 in Fig. 4 A, B), monophyly of
202
theArchaeawouldbeunambiguous(Fig.4D-F).Determiningtheevolutionaryrelationshipofthe
203
Archaeatoothertaxa,though,requiresidentifyingtheroot.
204
Directionalevolutionmodels,unlikereversiblemodels,areabletoidentifythepolarityofstate
205
transitions,andthustherootofatree(40-42).Therefore,theuncertaintyduetoapolytomousroot
206
branchingisnotanissue(Fig5A).Moreover,directionalevolutionmodelsareusefultoevaluatethe
207
empiricalsupportforpriorbeliefsabouttheuniversalcommonancestor(UCA)attherootofthe
208
ToL(29).ABayesianmodelselectiontestimplementedtodetectdirectionaltrends(42)choosesthe
209
directionalmodel,overwhelmingly(Fig. 5B),overtheunpolarizedmodelfortheprotein-domain
210
datasetinFig.2B,asreportedpreviouslyforthedatasetinFig.2A(29).Further,thebest-supported
211
rootingcorrespondstorootR4(Fig. 4FandFig. 5A)—monophylyoftheArchaeaismaximally
212
supported(PPof1.0). Furthermore,thesister-grouprelationshipoftheArchaeatotheBacteria
213
ismaximallysupported(PP1.0). Accordingly,ahigherordertaxon,Akaryotes,proposedearlier
214
(Forterre1992)formsawell-supportedclade. ThusAkaryotes(orAkarya)andEukaryaaresister
215
cladesthatdivergefromtheUCAattherootoftheToL,alsoasreportedpreviously(29).
216
Alternativerootingsaremuchlesslikely,andarenotsupported(Fig.5C).Accordingly,indepen-
217
dentoriginoftheeukaryotesaswellakaryotesisthebest-supportedscenario.TheThree-domains
218
tree(rootR3,Fig.4E)is10171timeslesslikely,andthescenarioproposedbytheEocytehypothesis
219
(rootR1,Fig.5A)ishighlyunlikely.Thecommonbeliefthatsimpleisprimitive,aswellasbeliefsthat
220
archaeaareprimitiveorthatarchaeaandbacteriaevolvedbeforeeukaryotes,arenotsupported
221
either.
222
Employingcomplexmolecularcharactersmaximizesrepresentationoforthologous,
223
non-recombininggenomicloci,andthusphylogeneticsignal
224
GenomiclocithatcanbealignedwithhighconfidenceusingMSAalgorithmsaretypicallymore
225
conservedthanthoselociforwhichalignmentuncertaintyishigh.Suchambiguouslyaligned
226
8of21
ManuscriptsubmittedtoeLife
Figure5.(A)Rootedtreeoflifeinferredfrompatternsofinheritanceofuniquegenomic-signatures.A
dichotomousclassificationofthediversityoflifesuchthatArchaeaisasistergrouptoBacteria,whichtogether
constituteacladeofakaryotes(Akarya).EukaryaandAkaryaaresister-cladesthatdivergefromtherootofthe
treeoflife.Eachcladeissupportedbythehighestposteriorprobabilityof1.0.Thephylogenysupportsa
scenarioofindependentoriginsanddescentofeukaryotesandakaryotes.(B)Modelselectiontestsidentify,
overwhelmingly,directionalevolutionmodelstobebetter-fittingmodels.(C)Alternativerootings,and
accordinglyalternativeclassificationsorscenariosfortheoriginsofthemajorcladesoflife,aremuchless
probableandnotsupported.
9of21
ManuscriptsubmittedtoeLife
regionsofsequencesareroutinelytrimmedoffbeforephylogeneticanalyses(43). Typically,
227
the conserved well-aligned regions correspond to protein domains with highly ordered three-
228
dimensional(3D)structureswithspecific3Dfolds(Fig.6A).Regionsofsequencesthataretrimmed
229
usuallyshowhighervariabilityinlength,arelessorderedandareknowntoaccumulateinsertion
230
anddeletion(indel)mutationsatahigherfrequencythanintheregionsthatcorrespondtofolded
231
domains(44).Thesevariable,structurallydisorderedregions,whichflankthestructurallyordered
232
domains,linkdifferentdomainsinmulti-domainproteins(Fig.6A).Multi-domainarchitecture(MDA),
233
theN-to-Cterminalsequenceofdomainarrangement,isdistinctforaproteinfamily,anddiffersin
234
closelyrelatedproteinfamilieswithsimilarfunctions(Fig.6A).ThevariationinMDAalsorelatesto
235
alignmentuncertainties.
236
Figure6.Alignmentuncertaintyincloselyrelatedproteinsduetodomainrecombination.(A)Multi-domain
architecture(MDA)ofthetranslationalGTPasesuperfamilybasedonrecombinationof8modulardomains.57
distinctfamilieswithvaryingMDAsareknown,ofwhich6canonicalfamiliesareshownasaschematiconthe
leftandthecorresponding3Dfoldsontheright.Aminoacidsequencesofonly2ofthe8conserveddomains
canbealignedwithconfidenceforuseinphylogeneticanalysis.Thelengthofthealignmentvariesfrom
200-300aminoacidsdependingonthesequencediversitysampled(14,76).TheEF-Tu—EF-Gparalogouspair
employedaspseudo-outgroupsfortheclassicalrootingoftherRNAtreeishighlighted.(B)Phyleticdistribution
of1,738outthe2,000distinctSCOP-domainssampledfrom222speciesusedforphylogeneticanalysesinthe
presentstudy.About70percentofthedomainsarewidelydistributedacrossthesampledtaxonomicdiversity.
10of21
Description:position of the recently discovered Asgard archaea (20). The DDN is . representatives of the newly described archaeal groups: TACK (30), DPANN (5), and Asgard group. 123 including the .. implies evolutionary convergence, parallelism or character reversals caused by multiple processes. 332.