DNA RESEARCH 20, 135–150, (2013) doi:10.1093/dnares/dss039 Advance Access publication on 11 January 2013 Evaluation of Codon Biology in Citrus and Poncirus trifoliata Based on Genomic Features and Frame Corrected Expressed Sequence Tags TOUQEER Ahmad1, GAURAV Sablok2, TATIANAV. Tatarinova3, QIANG Xu1, XIU-XIN Deng1, and WEN-WU Guo1,* Key Laboratoryof Horticultural Plant Biology (MOE), Huazhong Agricultural University, Wuhan 430070, China1; Research and Innovation Center, Fondazione Edmund Mach (FEM), Istituto Agrario San Michele (IASMA), Via Mach 1., San Michele all’Adige, Trentino 38010, Italy2 and Glamorgan Computational Biology Group, University of Glamorgan, Pontypridd CF37 1DL, UK3 *To whom correspondence should be addressed. Tel. þ86 27-87281543. Fax. þ86 27-87280016. Email: [email protected] Edited by Prof. Kenta Nakai (Received 9August 2012;accepted27November2012) Abstract Citrus,asoneofthegloballyimportantfruittrees,hasbeenanobjectofinterestforunderstandinggen- etics and evolutionary process in fruit crops. Meta-analysesof 19 Citrus species, including 4 globallyand economically important Citrus sinensis, Citrus clementina, Citrus reticulata, and 1 Citrus relative Poncirus trifoliata,wereperformed.WeobservedthatcodonsendingwithA-orT-atthewobblepositionwerepre- ferred in contrast to C- or G- ending codons, indicating a close association with AT richness of Citrus species and P. trifoliata. The present study postulates a large repertoire of a set of optimal codons for the Citrus genus and P. trifoliata and demonstrates that GCT and GGT are evolutionary conserved optimal codons. Our observation suggested that mutational bias is the dominating force in shaping the codonusage bias (CUB) inCitrus and P. trifoliata. Correspondence analysis(COA) revealed that the prin- cipal axis [axis 1; COA/relative synonymous codon usage (RSCU)] contributes only a minor portion (∼10.96%) of the recorded variance. In all analysed species, except P. trifoliata, Gravy and aromaticity played minor roles in resolving CUB. Compositional constraints were found to be strongly associated withtheaminoacidsignaturesinCitrusspecies andP.trifoliata.Ourpresent analysispostulatescompos- itional constraints in Citrus species and P. trifoliata and plausible role of the stress with GC and co- 3 evolution pattern of amino acid. Key words: Citrus; coevolution; mutational bias; Poncirus trifoliata; stress; GC biology 3 1. Introduction codons for encoding the particular amino acid. These differences in the usage of the synonymous Genomecomposition(suchasGC-andAT-content) codons have also been one of the factors for the evo- and subsequent balance of codon usage and eukary- lutionofproteomediversityandcanhelpustounder- otic translation machinery play an important role in stand the evolution of those proteins that have the evolution of the nucleotides at the wobble pos- structural differences in spite of being conserved at ition.1 Degeneracy in biased codon usage explains the sequence level.3–7 the concept behind the usage of same amino acids Two major paradigms of codon usage were pro- withmultiplesynonymouscodonsexceptmethionine posedasplausibleanswerstoclarifythenon-random- (Met) and tryptophan (Trp).2 This genome compos- ness of codon usage at intra- and inter-species levels: ition bias leads to the usage of some codons at a (i) natural selection that is expected to yield a correl- higher frequency as compared to the synonymous ation with codon bias in highly expressed genes,8 #TheAuthor2013.PublishedbyOxfordUniversityPressonbehalfofKazusaDNAResearchInstitute. ThisisanOpenAccessarticledistributedunderthetermsoftheCreativeCommonsAttributionNon-CommercialLicense(http://creativecommons. org/licenses/by-nc/3.0/),whichpermitsunrestrictednon-commercialuse,distribution,andreproductioninanymedium,providedtheoriginal workisproperlycited. 136 Codon Evolution in Citrus species and Poncirus trifoliata [Vol. 20, rapidly regulated and variably expressed genes, and species show wide variation in traits such as cold (ii) neutral processes, such as mutational biases anddroughttoleranceandareofcommercialimport- (MBs), where some mutations occur more often ance.Wehaverestrictedouranalysesto Citrus genera than others across the genome of an organism to develop resource information for Citrus species. because of local variations in the base composition. Our study demonstrated that CUB in Citrus species Several lines of evidence support the argument that and P. trifoliata is biased towards AT richness, and a the usage of synonymous codons with unequal fre- relatively higher occurrence of A- and T-ending quencies in both prokaryotic and eukaryotic genes is codons was observed. We found several interesting the result of a complex balance between MB and/or and diverse patterns of optimal codons, and it was natural selection and genetic drift.9–12 It has been observed that two optimal codons coding for suggested that the codon bias could be positively Alanine and Glycine were evolutionary conserved selected because of a more efficient and accurate between the Citrus species and P. trifoliata. To the translation, and favoured codons may correspond to best of our knowledge, our analysis presents the first the most highly expressed genes.2,13 In addition to complete report on the identification of optimal neutral and selection processes, GC-biased gene con- codons acrossthe entire Citrus genus and P. trifoliata, version, which depends on the local recombination whichcouldserveasthepotentialsourcefordevelop- rate, is an important factor in shaping codon and ing transgenic Citrus cultivars using codon optimiza- amino acid usage.14 tion. GC and evolution were correlated using 3 To date, wide variations have been observed in Hamming distance parameter to identify suggestive codon usage patterns in many organisms and have evolutionary pairs of genes. Co-orthologous genes provided clues to understand the evolution of genes matrix identified using the alignment ratio suggested and gene families. The factors that potentially affect close association of Citrus clementina, Citrus sinensis, biased usage of codons are MB that correlates the and Citrus reticulata, as they belong to the same codon usage bias (CUB) with the genomic GC phylogeneticclade.Wefurtherdemonstratedthatnu- content, Hill2Robertson effect explaining the inter- cleotide bias has a genome-wide influence on amino ference of selection of one locus with another locus, acid composition of Citrus species and P. trifoliata. translationalselection,andacumulativeeffectofrep- lication and translational selection in shaping the 2. Materials and methodology codon usage across the genes of several bacterial species.15–19 It has been shown that CUB and 2.1. Sequence information and processing protein functional conservation play a major role for Our dataset consists of genome-predicted coding the decelerated evolution of whole genome duplica- tion in Saccharomyces cerevisiae.20 Several other regions and ESTs. All the genome-predicted coding sequences of C. sinensis were retrieved from recently factors that might affect the codon usage are amino acid conservation and hydrophobicity,21 gene expres- sequenced Citrus genome.25 In case of other Citrus sion,22 mRNA folding stability, codon2anticodon species, ESTs were downloaded from the National interaction, and gene length.22 Center for Biotechnology Information (NCBI) EST Citrus is a diploid genus of the Rutaceae family, repository (NCBI; http://www.ncbi.nlm.nih.gov). In addition, we downloaded putative unique transcripts whose cultivated forms are important for human for all studied species from the PlantGDB database diet with more than 122 MT of annual world fruit (http://www.plantgdb.org). A detailed description of production. We have systematically analysed codon thedatausedisshownin Table1.IncaseofESTs,the usage patterns and genomic heterogeneity using a ESTs were first clustered into contigs and singletons meta-analyses approach that involves multivariate using CAP3 with default parameters.26 A minimum tools and codon usage indices such as relative syn- match percentage cutoff of 95% for 40 overlapping onymous codon usage (RSCU) and effective number of codons (Nc) in the studied gene sets.23,24 In baseswasusedtoassign2sequencestoacluster. present study, we inferred global pattern of CUB, mu- tationalpressures,andassociationofGC withpoten- 2.2. Frame correction of ESTs 3 tial stress events. We have collectively analysed All the UniGenes (contigsþsingletons) were ana- horticulturally important Citrus species that includes lysed for frame correction and prediction of protein- recently sequenced double haploid Citrus sinensis coding region using FrameDP.27,28 Briefly, the follow- genome25 and 18 additional Citrus species with ing pipeline was implemented using FrameDP to expressed sequence tags (ESTs) counts of more than identify open reading frames (ORFs): firstly, each EST 1000 per species. To make this analysis comparative, was compared against the TAIR database (Arabidopsis wehavealsoincludedPoncirustrifoliatathatisconsid- Information Resource; http://www.arabidopsis.org/) ered to be a distant relative of Citrus genus. These using BLASTX29 with E-value¼1024, identity No. 2] T. Ahmad et al. 137 Table1. GenomiccompositionofcodingregionofCitrusspeciesandP.trifoliataatGC,GC ,GC ,GC ,andGC 1 2 3 3s No. Citrusspecies Genes/ESTa Unigenescount Genesb GC GC GC GC GC 1 2 3 3s 1 C.sinensis 44275 – 41362 43.92 50.4 40.33 41.03 38.8 2 C.aurantifolia 8219 7550 4715 47.78 52.96 43.87 46.52 44.73 3 C.aurantium 14584 11952 6787 46.93 52.47 42.59 45.72 43.94 4 C.clementina 118365 51591 33765 47.04 51.84 42.51 46.76 44.99 5 C.clementina(cid:4)C.tangerina 1843 1283 677 44.66 51.67 39.77 42.55 40.47 6 C.jambhiri 1017 858 701 45.86 51.79 41.13 44.66 42.67 7 C.japonicavar.margarita 2924 1628 588 46.11 52.67 41.42 44.25 42.35 8 C.limettioides 8188 7933 3964 46.47 50.76 42.75 45.9 44.25 9 C.limonia 11045 9761 4256 45.82 51.13 41.37 44.96 43.17 10 C.medica 1115 896 711 45.65 51.91 40.9 44.13 42.15 11 C.reshni 5768 3735 2644 45.2 51.51 40.57 43.52 41.5 12 C.reticulata 55980 46170 30816 46.85 52.3 42.77 45.48 43.62 13 C.reticulata(cid:4)C.temple 5823 3685 2253 46.48 52.41 41.64 45.41 43.41 14 C.sinensis(cid:4)P.trifoliata 1837 1522 992 46.07 51.67 41.41 45.15 43.23 15 C.sunki 5216 4688 1746 47.58 51.51 43.58 47.64 46.17 16 P.trifoliata 62695 35740 19445 46.78 52.33 42.67 45.35 43.52 17 C.unshiu 19072 9289 4592 45.67 51.72 41.14 44.15 42.21 18 C.paradisi 8039 4621 2517 45.32 52.19 40.51 43.27 41.3 19 C.paradisi(cid:4)P.trifoliata 7954 3335 2596 45.23 51.96 40.77 42.98 40.96 Means of GC% were calculated at the first, second, and third positions. GC represents the GC at the third synonymous 3s position. aInC.sinensis,genome-predictedcodingregionsareused,whereasfortherestofthespecies,thenumberrepresentstheEST countin thecolumn(genes/EST). bSelectedgenes above300bp threshold. percent (%)¼40%over100aminoacids.27Secondly, GC -rich and GC -poor groups, we have selected 5% 3 3 the training dataset was generated from the BLASTX of the genes with the highest and the lowest GC 3 results,and,subsequently,thetrainingmatrixwascal- values. For the genes in the GC -rich and-poor 3 culated, which represents the coding style of the groups,wehavecomputedtwomeasures:(i)position- species. Thirdly, a collection of putative protein- algradientsofGC and(ii)CG skewthataredefined 3 3 coding sequences (CDSs) was generated for each as: CG skew is the difference in fraction of cytosines 3 Citrus species and P. trifoliata based on its homology (C) and guanines (G) in the third position of the with known protein dataset and on coding style codondividedbythesumofCandGinthethirdpos- recognition matrix. ition: CG -skew¼(C 2G )/(C þG ) and positional 3 3 3 3 3 gradient GC as a [G (x)þC (x)]/Nseq, where x is 3 3 3 the distance measured as number of codons from 2.3. Sequence filtering and GC variation the first ATG, and Nseq is the number of sequences. From the set of corrected sequences, we discarded proteinsshorter than100amino acidstocreateare- 2.4. Indices of codon usage and correspondence liable dataset for all the studied Citrus species and P. analysis trifoliata for further analysis.23 The final sequence Theeffectivenumberofcodons(Nc)providesanin- dataset was subsequently analysed by tabulating the dependent measure of CUB, regardless of the gene frequency of GC at the first, second, and third codon length.23 The expected Nc values were computed positions (GC , GC GC , and GC , respectively) 1 2, 3 3s according to the equation proposed by Wright, using in-house written Perl and Cþþ scripts. GC is 3 which assumes equal use of G and C (A and T) in de- defined as the fraction of cytosines (C) and guanines generate codon groups.23 (G) in the third position of the codon: GC ¼3(C þ 3 3 G )/L for the ORF of length L, whereas GC is ( ) 3 3S 29 defined as GþC base composition at the third syn- Nc¼2þsþ ;wheres¼GC : ½s2þð1(cid:2)sÞ2(cid:3) 3s onymously degenerate position of codons. To define 138 Codon Evolution in Citrus species and Poncirus trifoliata [Vol. 20, We have further analysed RSCU using multivariate acid composition, we partitioned the codons accord- analysis for all the 59 informative codons (excluding ing to GC-rich (the so-called GARP amino acids: Met, Trp, and the three stop codons).30,31 As pro- Glycine, Alanine, Arginine, and Proline) and AT-rich, posed earlier, if RSCU values are close to 1.0, it indi- FYMINK amino acids (Phenylalanine, Tyrosine, cates that all the synonymous codons are used Methionine, Isoleucine, Asparagine, and Lysine) equally without any bias towards the usage of a par- amino acids.36,37 We have maintained the exclusion ticular codon in a gene of length L. In our study, axis of Leucine and Arginine from the dataset as per 1 (COA/RSCU) and axis 2 (COA/RSCU) represent the Singer and Hickey.37 first and second major axes of correspondence ana- lysis. The Kyte–Doolittle scale was used to calculate 2.7. Statistical analysis the hydropathy score that is the arithmetic mean of Alltheindicesofcodonusagewerecalculatedusing the sum of hydropathic indices of each amino acid. CodonW(http://codonw.sourceforge.net),andseveral This scale also provides information on transmem- in-house developed Perl, R, and Cþþ scripts were brane or surface regions.32 written to streamline the downstream analyses. Statistical analyses were performed using R in R 2.5. Identification of optimal codons studio (http://rstudio.org/). All the results were inter- pretedbasedonthenon-parametricSpearman’srank Optimalcodonsarebelievedtoachievefastertrans- correlation (r). lation rates and high accuracy, therefore the effect is more pronounced in highly expressed genes.33 For the identification of optimal codons, we used 10% 3. Results and discussion of total genes from extreme ends of the principal axis of correspondence analysis (axis 1; COA/RSCU). 3.1. Patterns of genomic content and codon usage Codon usage was then compared using x2 contin- variation across Citrus species and P. trifoliata gency (x2) of the two groups, and codons whose fre- TherecentlysequencednucleargenomeofC.sinen- quency of usage was significantly higher at three sis25 presents an opportunity to analyse the nucleo- different levels of statistical precision (P-value,0.5; tide compositional pressure and heterogenetic P-value,0.01; P-value,0.001) in highly expressed variation, which could possibly help usto understand genes when compared with lowly expressed genes the molecular adaptation of these horticulturally im- were defined as the optimal codons. We also identi- portantfruitspecies.Wehaveusedmeta-analysesap- fied the genes related to the ribosomal proteins and proachusingthepredictedgenomiccodingregionsof observed the association of the ribosomal proteins recently sequenced C. sinensis and frame-corrected with axis 1 (COA/RSCU) and axis 2 (COA/RSCU) of ESTs of related species and P. trifoliata. Nucleotide the correspondence analysis to optimize the identifi- composition analysis indicated that Citrus species cation of the optimal codons. are AT rich ((cid:2)46% GC, typical for many sequenced genomes of dicot species), and this bias towards the 2.6. GO and stress-associated annotation, Hamming AT richness is dominant and is observed across all parameter, and coevolution of amino acid the studied species (Table 1). composition Basedontheaboveobservation,wecouldhypothe- Stress-relatedgeneswereidentifiedusingreciprocal sizethatthesespecies arebiasedtowardstheA-and/ best bidirectional hits (RBH) using NCBI BLASTP or T-ending codons in the coding region across these program with E-value 10230, and gene ontology two genera. Generally, most of the analysed dicot (GO) annotation of the model dicot plant species prefer to use A- or T-ending codons at the Arabidopsisthalianawasthenusedby‘guilt-by-associ- third position; this observation is in accordance with ation’ approach. To optimize ourannotation pipeline, a previous study in Citrus using a small dataset of weselectedGOannotations‘involvedin’thecategory 177 CDS regions.38 We found that average GC is 3 containing word ‘stress’ and/or ‘response’ and significantly lower than GC and the observed GC 1 grouped together as ‘stress related’. The normalized (P-value,0.05) which potentially explains that the Hamming distance was then evaluated according to genome composition is biased towards the dominat- the method described in Sablok et al.34 All the ing AT usage encoding genes at the third, ‘wobble’ frame-corrected coding regions and the genome-pre- position. dicted coding regions were parsed. Subsequent pro- In eukaryotes, the intra-genomic heterogeneity is teins were used to identify co-orthologous genes, high, and interspecific variation of the average GC and a suggestive phylogeny was drawn using the co- content is low.39,40 In Citrus species, GC usage varies orthologous matrix as described in Lechner et al.35 by position, but with much greater variance, with To identify the effect of nucleotide bias on amino higher usage in position 1 ((cid:2)51.8% GC) and lower No. 2] T. Ahmad et al. 139 GCusageinpositions2and3((cid:2)41.6versus (cid:2)44.7% var. margarita; 20.048** Citrus limettioides; and also at the third synonymous positions GC (cid:2) 20.031** Citrus limonia; 0.093** Citrus medica; 3s 42.8%). The observed results are consistent with the 0.216** Citrus reshni; 0.061** C. reticulata; 0.101** previous reports demonstrating preference for G is in C. reticulata(cid:4)Citrus temple; 0.119** C. sinensis(cid:4)P. position1andT/Ainpositions2and341anddemon- trifoliata; 20.213** Citrus sunki; 0.061** P. trifoliata; stratetheroleofmutationalpressureintheevolution 0.139** Citrus unshiu; 0.160** Citrus paradisi; of CUB across Citrus species and P. trifoliata. 0.212** C. paradisi(cid:4)P. trifoliata; **P-value,0.01). Furthermore, there is a significant wide variation in We noted that most genes tend to lie below the GC usage at the third synonymous position (GC standard trajectory path and are poised towards 3s (cid:2)42.8%) (Table 1; Fig. 1). In our analysis, we esti- GC , which clearly demonstrates that MB is acting 3s mated codon usage patterns of frame-corrected as a major factor for the wide variation in codon codingregionsobtainedfromESTsversusgenepredic- usage across the Citrus species and P. trifoliata tions derived from a fully assembled and annotated (Fig. 2). However, there might be additional factors genome. that might influence codon usage across these To look for biased association between the nucleo- species. This dependence of codon usage on genome tide content and codon usage, we plotted Nc against basecomposition(AT-orGC-richness)hasbeenprevi- GC , also described as the Nc plot (Fig. 2) that has ously reported in several unicellular genomes.43,44 In 3s been widely demonstrated as an important param- a genome-wide analysis of eubacterial and archeal eter to evaluate codon usage variation among genes, genomes, it has been suggested that genome-wide as this codon usage index has a definite relationship variance in codon usage is primarily due to MB as with the GC (more compositionally biased DNA is the GC content shows wide variation along the iso- 3 expected to encode a smaller subset of codons).23,42 chors.17 Simultaneously, alternative views associate Wright23 argued thatthe comparison of actual distri- GC variability with various factors such as transcrip- bution of genes, with expected distribution under no tional optimization, methylation, recombination, and selection could be indicative, if CUB of genes has horizontal gene transfer.45 It can be inferred that someotherinfluencesotherthancompositionalcon- the cloning of orthologs and homologues conserved straints. A significant correlation was found between across the Citrus species and P. trifoliata that are AT Nc and GC (0.466** C. sinensis; 20.025** Citrus rich and have low Nc values will require few degener- 3s aurantifolia; 0.070** Citrus aurantium; 0.125** ate primers. On the contrary, genes that are GC rich C. clementina; 0.183** C. clementina(cid:4)Citrus tanger- andhavinghighGCvalueswillrequiremoredegener- ina; 0.184** Citrus jambhiri; 0.179** Citrus japonica ate primers for enhancement of cloning efficiency. Figure 1. The distribution of GC content in Citrus species, P. trifoliata and A. thaliana genes. The GC content showed unimodal 3 3 distribution. 1 4 0 C o d o n E v o lu t io n in C it r u s s p e c ie s a n d P o n c ir u s t r if o lia t a Figure2. NcversusGC plotofCitrusspeciesandP.trifoliatagenes.ThesolidblacklineindicatestheexpectedNcvalue,ifthecodonbiasisonlyduetoGC Speciesorder:A¼C. 3s 3s. sinensis;B¼C.aurantifolia;C¼C.aurantium;D¼C.clementina;E¼C.clementina(cid:4)C.tangerina;F¼C.jambhiri;G¼C.japonicavar.margarita;H¼C.limettioides;I¼C.limonia; [ J¼C. medica; K¼C. reshni; L¼C. reticulata; M¼C. reticulata(cid:4)C. temple; N¼C. sinensis(cid:4)P. trifoliata; O¼C. sunki; P¼P. trifoliata; Q¼C. unshiu; R¼C. paradisi; and S¼C. Vo paradisi(cid:4)P.trifoliata. l. 2 0 , No. 2] T. Ahmad et al. 141 3.2. Correspondence analysis C. clementina(cid:4)C. tangrina; 20.772**, C. jambhiri; It has been reported that there is a significant het- 20.831**, C. japonica var. margarita; 20.803**, erogeneity within the genes and genomes.33,46 A C. limettioides; 0.849**, C. limonia; 0.894**, heat map was constructed using the observed RSCU C. medica; 20.855**, C. reshni; 0.894**, C. reticulata; values for all the 59 informative codons (excluding 0.846**, C. reticulata(cid:4)C. temple; 20.884**, the Met, Trp, and the stop codons) using the average C. sinensis(cid:4)P. trifoliata; 0.806**, C. sunki; 0.879**, linkage clustering method (Fig. 3). Heat map and P.trifoliata;0.834**,C. unshiu; 20.848**, C.paradisi; the supporting values (Supplementary Table 1) 20.879**, C. paradisi(cid:4)P. trifoliata; **P-value, clearly define that the global codon usage is biased 0.01). The observed results indicate dominance of towards AT richness. We have partitioned the genes MB in the Citrus species and P. trifoliata and suggest based on the high and low GC content to see the that the variation in the usage of synonymous 3s global pattern of deviation across the two principal codons among the genes in Citrus species and axes displaying the major variation in accordance P. trifoliata is largely a biased representation of with the synonymous GC (Supplementary Fig. 1). nucleotide content of the genes. 3s Correspondence analysis showed that relative inertia tends to decrease along the axes, and in all 3.3. Identification of optimal codons in Citrus species the studies species, it was observed that axis 1 and P. trifoliata (COA/RSCU)contributedtothemajorportionofrela- Earlier and recent reports suggest that the usage of tive inertia indicating that the major trend of codon optimal codons and balanced codon usage enhances usage variation is associated with axis 1 (10.75 C. theefficiencyoftranslationbyincreasingthetranslation sinensis; 12.15 C. aurantifolia; 12.27 C. aurantium; rateofthepreferredcodonsovertheothersynonymous 14.54 C. clementina; 6.65 C. clementina(cid:4)C. tangrina; codon choices.47,48 Genes using optimal codons have 11.62 C. jambhiri; 7.01 C. japonica var. margarita; higher translation rate as compared to genes using 10.41 C. limettioides; 14.66 C. limonia; 11.24 C. non-optimal codons, which in turn increases the ribo- medica; 10.67 C. reshni; 12.54 C. reticulata; 8.66 C. someusageefficiencyandpotentiallyreducestheribo- reticulata(cid:4)C. temple; 10.85 C. sinensis(cid:4)P. trifoliata; some drop off.22,49,50 ESTs constitute partial 13.61 C. sunki; 12.06 P. trifoliata; 9.11 C. unshiu; transcriptome representation correlated with gene 8.72 C. paradisi; 10.75 C. paradisi(cid:4)P. trifoliata). abundance and expression.51,52 In 14 Citrus species High significant correlation (þ/2) was also observed (C.aurantifolia;C.aurantium;C.clementina;C.jambhiri; between axis 1 and GC (0.860**, C. sinensis; C.limettioides;C.limonia;C.medica;C.reshni;C.reticulata; 3s 0.888**, C. aurantifolia; 0.887**, C. aurantium; C. reticulata(cid:4)C. temple; C. sinensis(cid:4)P. trifoliata; 0.903**, 20.708** C. clementina; 20.836**, C. unshiu; C. paradisi; C. paradisi(cid:4)P. trifoliata) and Figure3. HeatmapoftheaverageRSCUofthe59degeneratecodonsintheCitrusspeciesandP.trifoliatausingEuclideandistanceand averagelinkageclusteringmodule. 142 Codon Evolution in Citrus species and Poncirus trifoliata [Vol. 20, P. trifoliata, we extracted the expression of ESTs count dominant over the AGG codon ((cid:2)RSCU: 1.52) at the using the CAP3 assembly (.ace files) and compared respective levels of significance P-value,0.01 and the usage of codons in accordance with earlier P-value,0.05. Using a mutation and selection reports.53–55 We also selected the ribosomal protein- model, Knight et al.56 demonstrated that ‘pairs of encoding genes and quantified the association of the specieswithconvergentGCcontentmightalsoevolve ribosomal encoding genes with axis 1 (COA/RSCU) convergent protein sequences, especially at func- andaxis2(COA/RSCU)ofcorrespondenceanalysis. tionally unconstrained positions’. For instance, the A star map showing optimal codons was con- frequencies of both Lysine and Arginine are highly structed taking 10% of the genes from the extreme anti-correlated with GC content, and Lysine and tailsofmultivariateanalysis(Table2),andseveraldis- Arginine can easily be substituted for one another in tinct trends of optimal codons were detected. In an proteins. This observation explains the inclusion of earlier study,38 optimal codons in Citrus were esti- LysineandArginineinourstudyandiswellsupported mated based on the correspondence analysis of by an earlier study in nematodes.57 It has been pro- codon usage, and the relative frequency of synonym- posed that Arginine and Leucine have a tendency to ous codon using 177 CDS and A- or T-ending show different codon usage patterns because of the optimal codons (TAA, GCT, GAT, CTT, AGG, AGA, and prevalenceofthesynonymousGCsubstitutionsinthe GTT) was observed. Overall, we observed that the first and the third codon position.58 Recently, it has trend of optimal codons usage was not conserved beenpostulatedthatcodonoptimizationsignificantly across all analysed species. However, in our analyses, enhancedMIRgeneexpressioninSolanumlycopersicum we have observed that GCT ((cid:2)1.57; RSCU) and GGT cv.Microtom,59whichpotentially depictstheimport- ((cid:2)1.09; RSCU) representing Alanine and Glycine anceandtheusageoftheoptimalcodonsingeneex- were found to be evolutionary conserved across all pression and transgenics. To our knowledge, this is the species. Because both these optimal codons are the first time large-scale identification of optimal T-endingcodons,whichisanindicationofthedomin- codons in Citrus species and P. trifoliata, which could ant role of the MB in the conservation of the A- or T- serveasamodelrepertoiretoenhancethetransform- ending codons, it potentially represents the biased ationefficiency. genome composition. However, for the other observed optimal codons, pattern genomic compos- 3.4. Roleofother selectiveconstraintsoncodonbiasin ition was not able to explain the deviation. We Citrus species and P. trifoliata further observed that for most amino acids with 2- A recent study has revealed that MB deeply influ- to 6-fold degeneracy level, there has been a general encesthefoldingstabilityofproteins,makingproteins preference for the usage of two or more codons as on the average less hydrophobic and, therefore, less optimal codons. For example, in Glycine, two highly stable with respect to unfolding and also less suscep- distributed optimal codons GGA and GGTwere iden- tible to misfolding and aggregation.60 To identify tified and they could be classified as the primary the potential effects of Gravy and aromaticity, we and secondary optimal codons preferentially based computed non-parametric correlation coefficients on the RSCU (GGA, (cid:2)1.16 and GGT, (cid:2)1.09). A betweenaxis1(COA/RSCU)andGravyoraromaticity curious trend observed among the Citrus species and scores for all studied species. We observed that Gravy P. trifoliata isthat Leucine is frequentlyencoded opti- and aromaticity played a minor role in shaping the mally by a G-ending codon (TTG; (cid:2)RSCU: 1.49) in variation of codon usage across Citrus species and P. four species (C. sinensis, C. clementina, C. reticulata, trifoliata.InthecaseofP.trifoliata,nosignificantcor- and P. trifoliata) rather than synonymous A- or T- relation was observed foraromaticity, suggesting that ending codon. However, in other species, CTT aromaticity has no significant role in shaping the ((cid:2)RSCU: 1.49) and CTC ((cid:2)RSCU: 0.89) (TTA, RSCU: codon usage variation in this genus. 1.16; CTT RSCU: 1.51) were the optimal codons. Wefounddeviationsintheusageofoptimalcodons encoding Lysine. For example, Lysine is frequently 3.5. Variations in GC3 across Citrus species and P. encoded by AAA ((cid:2)RSCU: 0.98) as optimal codon in trifoliata three species (C. sinensis, C. reticulata, and P. trifo- Recent reports suggest that GC composition and 3 liata), whereas AAG ((cid:2)RSCU: 1.01) was found to be GC gradient are acting as major factors along the the optimal codon for Lysine in rest of the Citrus orientation of transcription in monocots. It has been species. In Arginine, we observed that in two species alsosuggestedthatGCcompositionisvitalforunder- (C. aurantifolia and C. paradisi(cid:4)P. trifoliata), AGG standingchromatinremodelling,geneexpression,and represents the potential optimal codon instead of recombination.45,61 Some studies failed to depict GC the synonymous AGA. But based on the RSCU values, gradientalongthegenesofdicotplantsusingA.thali- AGA ((cid:2)RSCU: 1.66) was assumed to be more ana as a model.61 The reason for this failure is that No. 2] T. Ahmad et al. 143 Table2. StarmapshowingoptimalcodondistributioninCitrusspeciesandP.trifoliata Codon/AA A B C D E F G H I J K L M N O P TTT/Phe * * * * TTC/Phe * * * * * * * * * * TTA/Leu * * * * TTG/Leu * * * * CTT/Leu * * * * * * * * * * * * * * CTC/Leu * * * * * * * * * * * CTA/Leu * * * * * CTG/Leu * * * ATT/Ile * * * * * ATC/Ile * * * * * * * * * * * * * ATA/Ile * * * GTT/Val * * * * * * * * GTC/Val * * * * * * * * * * * * * GTA/Val * * * * GTG/Val * * TCT/Ser * * * * * * * * * * * * * * TCC/Ser * * * * * * * * * TCA/Ser * * * * * * * * * * * TCG/Ser * * AGT/Ser * * * * * AGC/Ser * * * CCT/Pro * * * * * * * * * * * * * * * CCC/Pro * * * * * * * CCA/Pro * * * * * * * * * * * * * * * CCG/Pro * * ACT/Thr * * * * * * * * * * * * * * * ACC/Thr * * * * * * * * * * * ACA/Thr * * * * * ACG/Thr * * GCT/Ala * * * * * * * * * * * * * * * * GCC/Ala * * * * * * * * * * * GCA/Ala * * * * * * GCG/Ala TAT/Tyr * * * TAC/Tyr * * * * * * * CAT/His * * * * * * CAC/His * * CAA/Gln * * * * CAG/Gln * * * * * * AAT/Asn * * * * * * * * * AAA/Lys * * * AAG/Lys * * * * * * * * * * * * GAT/Asp * * * * * * * GAA/Glu * * * * GAG/Glu * * * * * * TGT/Cys * * * Continued 144 Codon Evolution in Citrus species and Poncirus trifoliata [Vol. 20, Table2. Continued Codon/AA A B C D E F G H I J K L M N O P TGC/Cys * * * * * * * * * * CGT/Arg * * * * * * * * * * * * * CGC/Arg * * * * * * * * * * * * CGA/Arg * * * * * * CGG/Arg AGA/Arg * * * * * * * AGG/Arg * * GGT/Gly * * * * * * * * * * * * * * * * GGC/Gly * * * * * GGA/Gly * * * * * * * * * Species are in the following order: A¼C. sinensis; B¼C. aurantifolia; C¼C. aurantium; D¼C. clementina; E¼C. jambhiri; F¼C.limettioides;G¼C.limonia;H¼C.medica;I¼C.reshni;J¼C.reticulata;K¼C.reticulata(cid:4)C.temple;L¼C.sinensis(cid:4) P.trifoliata; M¼P.trifoliata;N¼C. unshiu; O¼C. paradisi;and P¼C.paradisi(cid:4)P.trifoliata. AA representsamino acid. highandlowGC geneshaveoppositegradientsalong that has the highest cold-resistant character among 3 the genes, and when high and low GC genes are Citrus species and its wild relatives. The occurrence 3 clumped together, the effect disappears. In Citrus of the high GC genes in all these species suggests 3 species and P. trifoliata, unimodal bell-shaped distri- an association between DNA methylation and GC 3 bution of GC , centred at 0.39, is considered to be a as it has been previously suggested that high GC 3 3 typical mode of GC distribution for dicot species. composition has been influenced by the GC mis- 3 We selected the top and bottom 5% of genes across match-repair mechanism that is dominant in stress- each studied Citrus species and P. trifoliata, and strik- associated genes as the absence of this positive bias ingly distinct gradients of GC and CG skew for high may lead to the loss of the recombination repair 3 3 and low GC genes were revealed. Genes with high mechanism and could be detrimental to the plant 3 GC showed higher level of GC codons in their adaptation in the evolving stress conditions. 3 3 middle coding regions than in terminal coding Genomic regions under higher selective pressure are regions.Inaddition,highGC geneshaveapreference more frequently recombining, and as a result relative 3 for C over G in the middle coding regions, and this increaseinGC contentcan beobserved.45Asshown 3 preference was not found in the 30-coding end in Supplementary Table 2 and Fig. 5, in all Citrus (cid:2)100bp in size (Fig. 4; Supplementary Fig. 2). species and P. trifoliata, the ratio of stress to non- It was previously reported that stress-related genes stress-related genes in GC -rich group was elevated 3 in grasses are GC3-rich.45 We stratified all Citrus in comparison to the GC -medium group and 3 species and P. trifoliata genes into three groups by depleted in the GC -poor group. It is worth noting 3 GC : rich (top 5%), poor (bottom 5%), and medium that GC -poor group has fewer genes functionally 3 3 (middle 90%). We observed that among the four described as stress-related when compared with the major economically important Citrus species, in C. GC -rich group. 3 reticulata (1718), C. clementina (1522), and In all the studied species, GC -richand -poor genes 3 P. trifoliata (951), most of the GC -rich genes were have different trends from 50 to 30 end of the genes. 3 related to stress in comparison to the total number GC -rich genes become even more GC rich, and 3 3 of observed stress-related genes, but in the case of GC -poor genes become more GC poor. Firstly, the 3 3 C. sinensis, it was observed that a high number of GC -rich group has more stress related genes than 3 medium GC genes (1857) were also abundant in the GC -poor group. Secondly, the GC -rich group 3 3 3 stress-related genes. The relative abundance of has a positive gradient of GC from 50 and 30 flanks 3 the medium GC genes in C. sinensis may be due to to the middle portion of the CDS. The low-GC 3 3 the time-course adaptability of this species during the group has a negative gradient of GC from 50 and 30 3 period of evolution, suggesting an adaptive evolution flanks to the middle portion of the CDS. Thirdly, towards the stress. rich- and poor-GC groups showed different CG 3 3 A high abundance of stress-related genes in C. reti- skew trends along the CDS: GC -rich genes favour C 3 3 culata and P. trifoliata may be closely related to their over G , and this preference is most pronounced in 3 highstress-tolerancetraitandespeciallytoP.trifoliata first 150 codons, whereas GC -poor genes favour G 3 3