ebook img

Ming-Che Shih Agricultural Biotechnology PDF

38 Pages·2011·1.87 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Ming-Che Shih Agricultural Biotechnology

De Novo Assembly of Expressed Transcripts and Global Analysis of the Phalaenopsis aphrodite Transcriptome S p e Chun-lin Su1,4, Ya-Ting Chao2,4, Yao-Chien Alex Chang3,4, Wan-Chieh Chen1, Chun-Yi Chen1, c i Ann-Ying Lee1, Kee Tuan Hwa1 and Ming-Che Shih1,* a l 1AgriculturalBiotechnologyResearchCenter,AcademiaSinica,Taipei,11529,Taiwan Is 2DepartmentofComputerScienceandEngineering,YuanZeUniversity,Chungli,Taoyuan,32003,Taiwan su 3DepartmentofHorticulture,NationalTaiwanUniversity,Taipei,10617,Taiwan e 4Theseauthorscontributedequallytothiswork – *Correspondingauthor:E-mail,[email protected];Fax,+886-2-26515693 R (Received April 23, 2011; Accepted July 12, 2011) eDo gw Being one of the largest families in the angiosperms, sequencing; NLS, nuclear localization signal; SNP, single nu- ulnloa ad Orchidaceaedisplayagreatbiodiversityresultingfromadap- cleotide polymorphism; TKL, tyrosine kinase like. red tation to diverse habitats. Genomic information on orchids P fro am is rather limited, despite their unique and interesting The nucleotide sequences of raw reads and contigs from this p h biological features, thus impeding advanced molecular re- work were deposited in GenBank (NCBI accession number: ettp rs search.Herewereportastrategytointegratesequenceout- SRA030409;contigs:JI626343–JI831113). ://a puts of the moth orchid, Phalaenopsis aphrodite, from two ca d high-throughput sequencing platform technologies, Roche e m 454andIllumina/Solexa, inordertomaximizeassemblyef- Introduction ic.o ficiency. Tissues collected for cDNA library preparation up includedawiderangeofvegetativeandreproductivetissues. Being the largest family of angiosperms with >25,000 species .co m We also designed an effective workflow for annotation and (Pridgeonet al.2005),theOrchidaceacecontainmemberswith /p functionalanalysis. Afterassemblyandtrimmingprocesses, many unique physiological characteristics that are absent in cp /a 233,823 unique sequences were obtained. Among them, more popularly studied model plants, such as rice and rtic 42,590 contigs averaging 875bp in length were annotated Arabidopsis. The great biodiversity of orchids is displayed not le -a to protein-coding genes, of which 7,263 coding genes were only in terms of speciation but also in their vast geographical bs foundtobenearlyfulllength.Thesequenceaccuracyofthe distribution,resultinginvariationsinmorphology,physiology, tra c assembled contigs was validated to be as high as 99.9%. habitatandinteractionwiththeecosystem.Greateffortshave t/5 2 Genes with tissue-specific expression were also categorized been made to accumulate valuable genomic information for /9 /1 byprofilinganalysiswith RNA-Seq.Gene productstargeted modelorganismsrapidly.Orchids,likemanyothernon-model 5 0 to specific subcellular localizations were identified by their organisms, have received less attention from researchers by 1/1 annotations. We concluded that, with proper assembly to comparison.Aneffectiveworkflowtogenerateandaccumulate 86 4 combine outputs of next-generation sequencing platforms, genomicdataquicklyiscriticaltofacilitatein-depthresearchof 57 8 transcriptome information can be enriched in gene discov- non-modelorganisms. b y ery, functional annotation and expression profiling of a Orchids have some conserved characteristics that distin- g u non-model organism. guish them from other plants, such as floral zygomorphy es (Rudall and Bateman 2002), a fused stamen and pistil (Rudall t o n Keywords: Assembly (cid:2) Database (cid:2) Expression profiling (cid:2) andBateman2002),aggregationofpollenintopollinia(Dressler 15 Phalaenopsis (cid:2) Transcriptome. 1990), and mycotrophy at an early stage of life (Rasmussen Ap Abbreviations: BLAST, basic local alignment search tool; 2002).Orchidpollinationofteninvolvescomplexmechanisms, ril 2 0 CAM, crassulacean acid metabolism; DAS, days after whichhadbeendescribedasearlyasDarwiniantimes(Darwin 1 9 sowing; DSN, duplex-specific nuclease; ER, endoplasmic re- 1862)andhavebeenresearchedextensivelyeversince(Nilsson ticulum; EST, expressed sequence tag; FPKM, fragments per 1992, van der Cingel 2007, Micheneau et al. 2009). kilobase of exon per million fragments mapped; GFP, green Palaeontological evidence of orchid–insect interaction was fluorescent protein; GO, Gene Ontology; KEGG, Kyoto found in amber fossil dating back 15–20 million years Encyclopedia of Genes and Genomes; NGS, next-generation (Ramirez et al. 2007). Based on the photosynthetic patterns, PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097,availableonlineatwww.pcp.oxfordjournals.org !TheAuthor2011.PublishedbyOxfordUniversityPressonbehalfofJapaneseSocietyofPlantPhysiologists. Allrightsreserved.Forpermissions,pleaseemail:[email protected] PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. 1501 C.-l.Suetal. orchids can be broadly divided into C and crassulacean acid support further functional genomics studies when compared 3 metabolism(CAM)plants.Closelyrelatedorchidtaxamayhave withotherplantspecies. different photosynthetic patterns: thick-leaved Oncidium spe- BecausePhalaenopsisaphrodite(mothorchid)isanimport- cies carry out CAM photosynthesis, while the photosynthetic antparentincommercialbreedingprogramsfordesirabletraits pattern of thin-leaved Oncidium species is C (Silvera et al. such as large and white flowers, it was selected as our model 3 2010). It is of particular interest that some orchids such as orchid for functional genomic research. It is native to Taiwan Phalaenopsis shift their photosynthetic pattern between and epiphytic with CAM photosynthesis. To investigate growth stages or when facing environmental changes (Guo expressed genes involved in many biological processes of andLee2006,Pinget al.2010).TheprevalenceofCAMphoto- orchids, we constructed cDNA libraries from various organs synthesiswithepiphytismatlowaltitudeunderthecanopyofa of P. aphrodite, including root and shoot tissues, germinating rainforest indicated a functional relationship between traits seeds and flowers, at different developmental stages. An effi- under correlated evolutionary divergence (Silvera et al. 2009). cient workflow was developed using data generated from dif- D o Researchinfunctionalgenomicsandthegenomestructureof ferenthigh-throughputsequencingplatformstomaximizeEST w n orchids may shed light on the molecular regulation or evolu- output in order to increase the number and length of lo a tionarytrackofthephotosyntheticpatternshift. assembledcontigsandtoreducesequencingbias. de d Inrecentyears,orchidshavegainedcommercialimportance Next-generation sequencing (NGS) platforms, such as fro worldwide as ornamentals. With such agroeconomic signifi- Roche’s 454 GS FLX, Illumina/Solexa Genome Analyzer and m h cance, studying thereproductive behaviorof orchids is asim- Applied Biosystems’ SOLiD, have been widely used in recent ttp s portant as investigating the vegetative physiology described yearsforhigh-throughputsequencingofmanyorganisms(Wall ://a above. The reproduction of orchids is unique due to the et al.2009,Metzker2010).Ingeneral,Roche’ssequencingtech- c a d absence of endosperm in the mature seeds, with embryo de- nologyproduceslongreadsandisadvantageousforassemblyof e m velopment arrested at the globular stage (Arditti 1992). sequences into longer contigs; however, the number of reads ic .o Characteristics ofembryo maturation only become evident in generatedineachrunislowerthanthatofotherplatformsand u p theprotocorm,auniquestructurethatisdevelopedafterger- not enough to reach deep coverage for low-abundant genes. .c o m mination and found only in orchids (Vinogradova and Thistechnologyalsohasshortcomingsindecodinghomopoly- /p Andronova 2002). Symbiosis with mycorrhiza is required for meric nucleotide tracts correctly. Additional problems identi- cp /a germination of the poorly differentiated seeds in nature fiedinourstudyweresequencingbias,thenumberofuncalled rtic (Burgeff 1959). This dependence on fungus can be eliminated bases and a software glitch that produced duplicated contigs. le -a through the addition of sugars to artificial culture medium The Solexa technology provides a high number of reads for b s (Knudson 1922). Tissue culture and seed germination in an deeper coverage, which is beneficial for gene discovery. tra c artificialenvironmentbecameavailablethroughyearsofdevel- However,itsshortreadlengthlimitsdenovocontigassembly t/5 2 opmentaimedatfacilitatingorchidbreedingandpropagation. efficiency. Discussions on sequencing bias of high-throughput /9 /1 With recent advances in sequencing technologies, technologieshavetakenplaceinseveralpublicationsdescribing 5 0 genome-scalesequencingprojectsincludingdenovotranscrip- biassuchasbasesubstitutioncausedbytheamplificationpro- 1/1 tome analysis of many emerging model organisms (Shin et al. cedure of Solexa (Dohm et al. 2008, Hillier et al. 2008) or IN/ 86 4 2008, Meyer et al. 2009, Wakaguri et al. 2009) and reference DELsinhomopolymericrepeatregionscausedbythechemistry 5 7 8 mapping of expressed transcripts (Nagalakshmi et al. 2008) usedinRoche454(Quinlanet al.2008).Properintegrationof b y were launched. High-throughput technologies, e.g. RNA-Seq, data to minimize sequencing bias by either technology will g u werealsoappliedtofunctionalresearchsuchasgeneexpression assistinreducingsequencemistakes. es profiling of rice anther (Huang et al. 2009) and overall tran- Assembled contigs from sequencing reads were annotated t o n scriptome analysis with the combination of tiling array in and archived into an orchid-specialized database, Orchidstra 15 Drosophila(Graveleyet al.2011).However,veryfewfunctional (http://orchidstra.abrc.sinica.edu.tw), which is publicly avail- Ap genomic studies in orchids have been conducted so far. able.Potential applications ofthisdatabasearenotlimitedto ril 2 0 Compared with model plants, orchids have large genome Phalaenopsis since the establishment of an orchid genome 1 9 sizesandlongerlifecycles,thusmakingthemmorechallenging databasewillprovideinvaluablereferencesforfutureworksin subjects in genetic and genomic works. Previously, several otherorchidgenera. sequencing efforts have been made to provide a collection of orchid genes on a genomic scale. In one study, conventional capillary sequencing was employed to identify expressed se- Results quencetags(ESTs)ofPhalaenopsisequestris,resultinginarela- Generation of transcriptome information tivelylowoutputof3,688clusteredunigenes(Tsaiet al.2006). Very recently, results from multiple sequencing techniques Phalaenopsis aphrodite cDNA samples were prepared from wereintegratedtogenerate8,501contigsand76,116singletons vegetativetissues(mixtureofroot,leafandstem),reproductive forPhalaenopsisspp.(Fuet al.2011).Nevertheless,thenumber tissues (young inflorescence as reproductive library 1; flower of orchid genes in current databases is still not enough to buds and open blossom as reproductive library 2) and 1502 PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. TranscriptomeanalysisofPhalaenopsisaphrodite germinatingseeds[0–30daysaftersowing(DAS)asseedlibrary non-mappedSolexareads(Fig.1A;SupplementaryTableS3). 1;40–75DASasseedlibrary2,and>75DASasseedlibrary3]. Inthefirststep,readsfromalllibrarieswerepooled;assemblyof Due to expression redundancy, vegetative cDNA library was all454readswasperformedwiththeNewblerAssemblersoft- normalized by the duplex-specific nuclease (DSN) method to ware.Afterremovingcontigs<200bp,34,563contigswereob- facilitatenovelgenediscovery(Zhulidovet al.2004).ThecDNA tained from this primary assembly with an average size of samples were sequenced using the 454 GS FLX Titanium and 1,194bp.Additionally,85,144singletons>200bpwithanaver- the Illumina/Solexa Genome Analyzer II platforms. Pooling of age size of 298bp were obtained and included in the unique the 454 sequencing runs resulted in 3,302,528 qualified reads sequencesforlatermapping. with an average length of 307 bases after adaptor trimming, Inthesecondstep,the454contigsandsingletonsobtained i.e. about 1Gb of sequence data in total (Supplementary (119,707 reference sequences altogether) were used as refer- TableS1).Illumina/SolexaGAIIreadsweregeneratedin76or ence frames, and Illumina/Solexa GAII reads were aligned to 120bp paired-end format. In total, 28.9Gb of adaptor-filtered the reference frames using CLC Genomic Workbench. About D o andQ20-trimmedsequencewereobtainedwiththisplatform. 92%ofSolexareadsweremappedtothereferencecontigsand w n Altogether,nearly30Gbofrawsequencewereproduced.After singletons in this step. When inconsistencies or rare variants lo a theprocessofdatatrimming,99.64%ofRoche454and72.64% were encountered, Solexa reads provided deeper coverage to de d of Solexa sequence results were preserved (Supplementary adjust454referencesequences.Therearetwotypesofconflict, fro TableS1). uncalled bases (N) and miscalled bases. About 97% of bases m h Wehaveoptimizedassemblyproceduresthroughaseriesof uncalled by 454 sequencing (1,350 out of a total of 1,393 Ns) ttp s tests,theresultsofwhicharesummarizedinFig.1A.Ourpro- were identified after mapping of Solexa reads. We found that ://a cedure consist of three steps: primary assembly of 454 reads, the remaining 43 uncorrected Ns were not covered by Solexa c a d reference mapping of Solexa reads and de novo assembly of reads in the mapping process. There were also 427,790 e m ic .o u p .c A o Assembly pipeline Roche 454 Titanium m /p c (Quality trimmed) p /a Yes 1 No rticle de novo assembly -ab s tra c t/5 2 Contig Contig (Ref) Singleton (Ref) /9/1 Formation 50 Length ≥ 200 bases unmapped unmapped 1/1 8 6 sequence adjustment de novo 45 Map to Ref 3 assembly 78 b y Yes 2 No g u e s (Quality trimmed) t o n 1 B Solexa paired-end 5 A Annotation pipeline p ril 2 0 1 9 Auto- Web Assembly Filtering Assignment Annotation Database Fasta: contigs Removal of BlastX - nr Homologue Orchidstra structure RNA tBlastX - ref-seq Similar contaminated RNA Pfam Weak similar GO Putative KEGG Unknown Fig. 1 Assembly scheme and annotation pipeline. (A) The strategy used for contig assembly is illustrated. Step 1: Roche 454 reads were assembled into Ref (reference sequence). Step 2: Solexa reads were mapped to the reference sequence and were used to adjust sequencing bias. Step 3:the remaining Solexareads weredenovo assembled into contigs. (B) Bioinformatic pipeline for annotation. Fordetails, see the MaterialsandMethodsandResults. PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. 1503 C.-l.Suetal. inconsistentbasesadjustedbySolexareads;basesweremainly annotated genes (18.2% of the total unique contigs) with at present in homopolymeric nucleotide tracts or were poly- least one Blastx hit at the E-value cut-off of 1e-10. Table 1 morphismswithminorallelefrequencies. shows that 88.7% of the 42,590 annotated genes were Inthethirdstep,Solexareadsthatcouldnotbealignedto assembled by sequences derived from both Roche 454 and 454 references were separately assembled de novo. The third Solexa platforms. De novo assembly of the remaining Solexa assemblyresultedin126,535transcriptswithanaveragesizeof reads contributed to 11.1% of the contigs, whereas contigs 334bppercontig.Altogether,93.6%ofRoche454and97.5%of identified solely by Roche 454 reads were only 0.2%. The size Solexaqualitydatawereassembledintouniquesequencescon- distribution of the annotated contigs is shown in Fig. 2. Size tainingeithercontigsorsingletonsthatarelongerthan200bp distributionatN50equals1,485bpinlengthandaveragecontig (SupplementaryTableS3).Intotal,246,212uniquesequences size is 875bp. Among the annotated coding sequences, 7,263 (prefixed with PACT plus a serial number) with a size larger contigs(17%oftheannotatedgenes)arepredictedtobenear than 200bp were assembled and were considered as D o Phalaenopsis-expressed transcripts for further processing w n within the annotation pipeline. Nucleotide sequences of raw A loa readsandcontigsweredepositedintheGenBankdatabase. > 3000 de d Annotation ge (bp)2222864200000000----3222086400000000 from h TheannotationpipelineisshowninFig.1B.Contaminatedand an2000-2200 ttp structureRNAsequencesintheassembled246,212uniquese- gth r11860000--21080000 s://a qdpualatesantbcsaeessqeuwceoenrmceepsrreiumssiniongvgevadirbublsya,strcRnoNmcAupt,a-rogiefsfonEne-rvaaalgluavieencsottofra1eac-n3ud0st.ocAhmlsoiozrefotd-- Contig len11142080000600000----01111-642080000000000 cademic.o ware bug was identified in the Roche 454 Newbler Assembler 400-600 up programthatgeneratedduplicatecontigs.Afterblastandcom- 200-400 .co parison, 6,683 duplicate sequences were found and removed 0 000 000 000 000 000 000 m/p fromthetranscriptpool.Altogether,12,389contigswereelimi- 2 4Contig n6umber8 16 18 cp/a nated which consisted of about 5% of the total transcripts rtic (SupplementaryTableS4).Amongthem,transcriptsoriginat- B le 4000 -a ing from exogenous contaminations including virus, bacteria b s and yeast make up <1% of the assembled contigs, indicating p) tra b 3000 c thehealthinessofthegreenhouseplants.Afterremovalofcon- h ( t/5 taminating sequences, 233,823 unique contigs were preserved ngt 2/9 forTfuhrethreemraaninninotgautinoinq.uetranscriptsequenceswerethenanno- ontig le 2000 /1501/1 tated according to sequence comparison results from BlastX C 1000 86 4 analysisandfunctionalannotation,usingGeneOntology(GO; 57 8 http://www.geneontology.org/), Pfam (http://pfam.sanger.ac. 0 b y uk/) and Kyoto Encyclopedia of Genes and Genomes KEGG; N10 N20 N30 N40 N50 N60 N70 N80 N90 g u http://www.genome.jp/kegg/) databases. All 233,823 unique N value es contigs were compared againstthenr protein sequence data- Fig.2 Lengthdistributionofthe42,590annotatedcontigs.(A)Length t on base using standalone Blastx, tBlastX and customized Perl distribution in base pairs. The average contig length is 875bp. 15 A scripts. The homology search against nr resulted in 42,590 (B)Contigsizedistributionplot,whereN50isequalto1,485bases. p ril 2 0 1 9 Table 1 Data source of assembled contigs Source of assembly Annotated genes Non-annotated genes No. of Average No. of Base No. of Average No. of Base contigs length (bp) bases ratio (%) contigs length (bp) bases ratio (%) 454 contig with Solexa mapped 31,321 1,055.82 33.07Mb 88.7 77,364 336.77 26.05Mb 43.7 454 only 253 323.02 81.73kb 0.2 820 267.05 218.98kb 0.4 Solexa de novo assembly 11,016 375.35 4.13Mb 11.1 113,049 295.03 33.35Mb 55.9 Total 42,590 875.47p 37.28Mb 100.0 191,233 311.79bp 59.62Mb 100.0 Atotalof42,590annotatedand191,233non-annotatedcontigsweretracedbacktothreedatasources.No.ofbasesindicatesthesumofnucleotidebasesofallthe contigs, and baseratio isthe numberofbases from eachstep as afraction oftotalbases. 1504 PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. TranscriptomeanalysisofPhalaenopsisaphrodite full length, as theycovered >90% ofamino acid sequences of Construction of the database the corresponding homologs. Arbitrary nomenclature was In order to facilitate orchid functional genomics research, we applied to give names to assembled contigs with various de- createdOrchidstra,apubliclyavailableresourcethatcombines greesofsimilarity identifiedincurrentdatabasesaccording to Phalaenopsisorchidtranscriptomedatawithrelevantannota- E-valueandidentitytotheblastedgenes(Table2).Therewere tions. Orchidstra (http://orchidstra.abrc.sinica.edu.tw) is a 24,131Pfamdomainsand16,744GOidentitiesidentifiedinthe web-basedinterfaceforannotatedtranscriptomicandgenomic annotatedgenes,while3,221geneswereconfirmedbytheEC data of orchids generated from sequencing technology. numberprovidedbyKEGG. Currently available organisms in Orchidstra include P. aphro- Overall GO distribution is shown in Fig. 3. High counts of dite,P.bellinaandP.equestris.Amongthem,P.aphroditetran- ESTs in metabolic processes were identified, such as genes scriptsweregeneratedbyoursequencingefforts.Theinterface involvedinthebiosynthesisofmacromoleculesandinnitrogen, ofOrchidstraprovidesaccesstoinformationonsequence,ac- protein and nucleic acid metabolism. Gene retrieval was per- D cession number, top hit alignments of blastx against the nr o formedtoidentifyimportantgenesinthreefunctionalgroups: w database, GO mapping results, EC number, KEGG pathway n transcriptionfactors,kinasesandtransporters.UsingthePlant lo map, KEGG Orthology and Pfam domain alignments for each a d Transcription Factor Database categories, 1,275 P. aphrodite e transcript. d transcription factors were identified (Fig. 4), of which zinc fro m finger proteins (41%) were the most abundant type followed Expression profiling and gene category analysis h by MYB (17%) and NAC (8%). A similarly high ratio of zinc ttp finger proteins in the transcription factor category is present Solexareadsfromindividuallibraryweresubmittedtoexpres- s://a in Arabidopsis (38%) and rice (25%). There were 946 orchid sionanalysisapplyingtheFPKM(fragmentsperkilobaseofexon ca transporters (including channels and carriers) identified in per million fragments mapped) value as the expression index dem this analysis, whereas 644 Arabidopsis transporters and 949 (Fig. 5A). Genes differentially expressed among the three dif- ic .o rice transporters were reported. The numbers of amino acid ferent tissue types (vegetative, reproductive and seed tissues) up were analyzed in a Venn diagram (Fig. 5B). The majority of .c transporters arecloseamongthethree species.Atotalof739 o m expressedgenesappearinallthreetissuetypes(24,248outof kinases were found in the Phalaenopsis annotation using the /p classificationsystemoftheRKDdatabase.Thisenzymeclassis 42,590genes,or57%).Seedtissuehasthegreatestnumberof cp/a under-representedwhencomparedwiththenumberofkinases unique genes (3,690), while reproductive tissue has 2,608 rtic foundinrice(1,411)andArabidopsis(1,552),probablydueto unique genes and vegetative tissue has 762 unique genes. GO le-a tissuedistributionforlibrarypreparation(Fig.4).However,this analysis of uniquely expressed transcripts is shown in bs mayalsoresultfromspeciesvariations,whichclearlyexistbe- SupplementaryFig.S2. trac tweenArabidopsisandrice.Forexample,thereisamuchhigher t/5 2 number of TKLs (tyrosine kinase like) in rice when compared Validation of sequence accuracy /9/1 with Arabidopsis, i.e. 1,127 vs. 372 (Dardick et al. 2007). Forty-eight P. aphrodite contigs were randomly chosen from 50 1 However, in analogy to rice and Arabidopsis kinases, the TKL the Orchidstra transcriptome database for PCR amplification /1 8 typeisstillthemostabundantkinaseinorchid(504TKLs).In and analysis in order to validate both the accuracy of the as- 64 5 spite of the massive amounts of sequence data acquired, the semblyandbaseassignment.Geneidentitiesandprimersused 7 8 investigation of the transcriptome in this study may not be inthisstudyarelistedinSupplementaryTableS5.Allofthe by g sufficientforacompletelistofexpressedtranscripts.Moreef- target genes were successfully amplified by PCR with the pre- u e fortstocollecttranscriptomedatabysequencingmorelibraries dicted amplicon size (1,217bp on average). Overall sequence st o obtainedfromorchidplantsgrownundervarioustreatmentsor accuracy from assembly is as high as 99.9% when comparing n 1 atdifferentgrowthstagesmayfurtherenrichthedatabase. contig sequences derived from next-generation sequencers 5 A p ril 2 0 1 Table 2 Summary of gene annotation 9 Tentative annotation Blast cut-off No. of genes Homolog to accession/definition of protein Identity (cid:3)90% and E-value (cid:4)1e-40 848 Similar to accession/definition of protein Identity <90% and E-value (cid:4)1e-40 or 20,285 1e-40 <E-value (cid:4)1e-30) Weakly similar to accession/definition of protein 1e-30 <E-value (cid:4)1e-20 6,607 Putative protein to accession/definition of protein 1e-20 <E-value (cid:4)1e-10 14,921 Total annotated genes 42,590 Total unknown transcripts E-value >1e-10 191,233 ThedefinitionofautomaticannotationwasassignedtoassembledcontigsbasedontheE-valueandsequenceidentityfromtheBlastXresult.Atotal of233,823 assembledcontigs were classified into 42,590annotated genes and 191,233non-annotated (unknown) genes. PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. 1505 C.-l.Suetal. 11000 r e b m 4000 u n e n e G 2000 0 GO:0042180 cellular ketone metabolic processGO:0046483heterocyclemetabolicprocessGO:0044260celluarmacromoleculemetabolicprocessGO:0006629lipidmetabolicprocessGO:0009653anatomicalstructuremorphogenesisGO:0009059macromoleculebiosyntheticprocessGO:0009308aminemetabolicprocessGO:0006810transportGO:0034641cellularnitrogencompoundmetabolicprocessGO:0009791post-embryonicdevelopmentGO:0044249cellularbiosyntheticprocessGO:0010467geneexpressionGO:0044283smallmoleculebiosyntheticprocessGO:0006519cellularaaandderivativemetabolicprocessGO:0007422,GO:0022621,GO:0001501 system developmentGO:0019222regulationofmetabolicprocessGO:0009725responsetohormonestimulusGO:0005975carbohydratemetabolicprocessGO:0044248cellularcatabolicprocessGO:0019538proteinmetabolicprocessGO:0043412macromoleculemodificationGO:0006793phosphorusmetabolicprocessGO:0003006reproductivedevelopmentalprocessGO:0006082organicacidmetabolicprocessGO:0090304nucleicacidmetabolicprocessGO:0006091generationofprecursormetabolitesandenergyGO:0010033responsetoorganicsubstanceGO:0044262cellularcarbohydratemetabolicprocessGO:0050794regulationofcellularprocessGO:0044424intracellularpartGO:0031090organellemembraneGO:0031967organelleenvelopeGO:0016020membraneGO:0031975envelopeGO:0031988membrane-boundedvesicleGO:0005622intracellularGO:0044446intracellularorganellepartGO:0044425membranepartGO:0022891specifictransmembranetransporteractivityGO:0032553ribonucleotidebindingGO:0016772transferaseactivity,pho-containinggroupsGO:0003723RNAbindingGO:0043169cationbindingGO:0022804activetransmembranetransporteractivityGO:0008233peptidaseactivityGO:0016817hydrolaseactivity,actingonacidanhydridesGO:0017076purinenucleotidebindingGO:0003677DNAbindingGO:0016788hydrolaseactivity,actingonesterbondsGO:0016757transferaseactivity,glycosylgroupsGO:0017076purinenucleosidebinding Downloaded from https://academic.oup.com/pcp/artic Biological process Cellular component Molecular function le -a b GO classes s tra c Fig.3 GeneOntologyanalysisofPhalaenopsis-annotatedgenes.Categoriespertainingtobiologicalprocess,cellularcomponentandmolecular t/5 functionweredefinedbyGeneOntologyclassification. 2 /9 /1 5 0 1 /1 8 withthoseofthePCRampliconsobtainedthroughaconven- TableS6).Greenfluorescenceindicatingexpressionandtarget- 64 5 tionalcapillarysequencer,AB3730(SupplementaryTableS5). ing to cellular compartments was observed using a confocal 78 microscope, indicating successful expression of the target by g Subcellular localization genes, and it was also found co-localized with images of red u e In order to validate the annotation functionally in terms of fluorescence derived from expression of the corresponding st o Arabidopsis subcellular markers carrying mCherry standard n cellular components, constructs of selective Phalaenopsis 1 genes fused with soluble modified green fluorescent protein constructs localized to compartments including the Golgi, ER 5 A (smGFP; Davis and Vierstra 1998) were bombarded into (loNceallsiozantioetnals.ig2n0a0l7)(NanLSd)apneputcildeear(cLoenesterutcatl.w2it0h08a) ninucltehaer pril 2 PhalaenopsisSogoYukidian‘V3’whitepetalsorsepalstoexam- 01 co-bombardment experiment (Fig. 6). Genes selected from 9 inetheirsubcellularlocalizations.BasedonGOannotationand the Orchidstra database were proven to be localized to the the literature (Donaldson et al. 1992, Mao et al. 2005, correct cellular compartments according to annotation and Osterrieder et al. 2010), the following Phalaenopsis genes maybeusefulascellularmarkersforfuturecellbiologystudies. were chosen for this approach: 60S ribosomal protein (PATC141573) and an smD2 protein of small nuclear ribonu- cleoprotein (snRNP; PATC144393) genes for nuclear localiza- Discussion tion, a cyclophilin gene (PATC140899/PATC155502) for the cytoplasm, an actin gene (PATC141888) for the cytoskeleton, Large number of P. aphrodite expressed genes were identified aGTPase(PATC138866)andceramidesynthase(PATC140948) and analyzed with a high-throughput sequencing approach for the endoplasmic reticulum (ER) and an ADP-ribosylation described in this study. In addition, a transcriptome database factor(PATC142447)fortheGolgiapparatus(Supplementary was constructed which is accessible to the public. Gene 1506 PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. TranscriptomeanalysisofPhalaenopsisaphrodite 11000 Phalaenopsis aphrodite 1,000 Arabidopsis 500 Rice r 400 e b m nu 300 e n e G 200 100 D o w n 0 lo a d AP2-ERF bHLH bZIP Zinc finger GRAS MADS MYB NAC WRYV ABC transporter aa transporter Metal transporter Nitrate transporter eptide transporter Pho transporter Sugar transporter Channel AGC CAMK CMGC CK1 STE TKL ed from https://ac P a d Transcription factor Transpoter Kinase e m ic Gene category .o u p Fig. 4 Gene discovery by a high-throughput EST approach. Genes encoding transcription factors, transporters and kinases were identified .c o throughcombinationsearchofannotation,PfamandGO.Numbersofgenesinsubcategoriesidentifiedintheorchiddatabaseaswellasthosein m /p theArabidopsisandricedatabasesareshown. c p /a rtic le -a b s expression profiling analysis was performed to differentiate TheoptimizedassemblyworkflowdescribedinFig.1Anot tra c genes with tissue-specific expression. These transcripts can only increased the number of unique contigs but also led to t/5 serve as molecular and cellular markers in breeding programs better base usage efficiency. A high base usage rate of this 2/9 andfuturegenomeresearch. workingpipelinewasachieved,as93.6%ofRoche454trimmed /1 5 0 sequenceand97.5%ofSolexaqualitysequenceassembledinto 1 Assembly efficiency /1 uniquecontigs(Fig.1A).Atotalof246,212uniquecontigsand 8 6 4 Several types of non-commercial assembly software such as singletons with a size longer than 200bp were obtained from 5 7 Velvet from EMBL-EBI and SOAPdenovo from BGI were eval- thisassemblymethodbeforeannotation. 8 b uated forassembling readsgenerated in this study.Neitherof Inconclusion,thisassemblyprocesshasatleastfourbenefits y g u themissuitableforassemblywithinputsequencingdatathat to satsify the general interest of researchers engaged in tran- e s divergesgreatlyinnumbersandlengthsofreads,thusillustrat- scriptomicresearch.Thefirstistheabilitytoobtainlongcon- t o n ingthecomplexityofanddifficultyinintegratingsuchdata.Fu tigswiththeRoche454dataassembly.Thesecondistheability 1 5 et al. (2011) attempted to combine sequencing output from to correct sequencing errors present in low-fold coverage 454 A p Sanger sequencing as well as Roche 454 and Solexa high- readsusingthehigh-depthcoveragesequencingdataobtained ril 2 throughputsequencingbyconsensusannealingofcontigsgen- withtheSolexaplatform.Thirdly,denovoassemblyofthere- 01 9 erated from each individual platform. From our observation, mainingunmappedSolexadataenrichednovelgenediscovery. theannealingofassembledcontigsfromdifferentdatasources Finally,thefrequencyofthemappedSolexareadscanbeused purelybasedontheconsensusisnotreliableandisinefficientin forgeneexpressionprofiling. terms of data usage (Supplementary Table S2). Neither is it Annotation reliabletomixlongandshortsequencesortomixfewandrich dataforassemblysimplybecausealgorithmandparameterad- Consideringerrorinheritanceandinsufficientinformationfora justmentisverydifferentwithdistinctdatasets.Astrategyto genethatcouldplaymultiplerolesinvariousfunctions,wefirst optimizetheassemblyprotocolandalogicalworkingproced- divided annotation outcome into five categories according ureisnecessaryinconsolidatingdatafrommultipleplatforms to criteria of identity and E-value from blastX and tblastX re- toimprovesequencingefficiencyfortranscriptomicstudiesofa sults: homolog, similar, weakly similar, putative and unknown novelspecies. (Table 2). The first four categories with an E-value cutoff at PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. 1507 C.-l.Suetal. A Veg Rep1 Rep2 Seed1 Seed2 Seed3 similarityshouldnotbecoincidental;however,atpresent,the biologicalsenseofthiscorrelationremainsunclear. Apart from those genes that were found to be similar to protein-coding genes in current databases, there are 191,233 unique contigs in our database that lack similarity to known genes (E-value >1e-10) and do not possess an open reading frame long enough for comparison. Non-annotated genes are not uncommon since a significant proportion of non-annotated UniGenes exists in current databases of rice and Arabidopsis (data not shown). These unknown contigs may contain fragmented RNA including untranslated regions (UTRs), introns, long non-coding RNAs (ncRNAs), microRNA D o precursors and other types of transcripts (including contami- w n natedtranscriptsfromunidentifiedsources). lo a d e d Functional category and differential expression fro m RNA-Seqtakingadvantageofthelargenumberofreadsgener- h ttp ated from high-throughput sequencing technology has been s proved to be a powerful tool in analysis of gene expression ://a c a Expression profiling (Wang et al. 2009). When filtered through criteria d e 01.0 2.0 3.0 4.0 5.0 7.5 10.0 15.0 rgeeqnueircinlugsttehriengexapnraelsyssiiosnreivnedaelexd(L8o,8g024FPdKifMferevnaltuiael)lyteoxpbrees>se5d, mic.o u p genes (Fig. 5A); many of them are proteins with unknown .c B o functions. A unique glycine-rich RNA-binding protein m /p (PATC230728), glutaredoxin (PATC153260), thioredoxin c p (PATC143280),glutathionetransferase(PATC148647)andper- /a Veg7e6ta2tive oxidase (PATC156909) genes were highly expressed in all tis- rticle sues, indicating an active defense mechanism and stress -a b s 748 621 response.Afewothercommongenes,includingthoseencod- tra Reproductive 24,248 Seed ianngdurbibiqousoitminasl,pprrootteeiansse,sw,herisetoamneoHng2Bth,eChhligah/lby-ebxinpdreinssgedprgoetneeins ct/52/9 generallyfoundinalltissuesexamined.Acyclophilinhomolog /1 2,608 9,913 3,690 50 (PATC155502)andanimmunophilin(PATC148202),bothim- 1 /1 munosuppressants in humans, were also highly expressed in 8 6 4 all tissues. Cyclophilin and immunophilin are enzymes 5 7 withpeptidyl-prolylisomeraseactivityinvolvedinproteinfold- 8 b Fig.5 Differentialgeneexpressionbetweentissues.(A)Genecluster y ing. An orchid cyclophilin homolog had been reported in a g analysis.ExpressionlevelsweredeterminedbyFPKMvaluescalculated u VandaorchidhybridandinaPhalaenopsishybridwithexpres- e from Solexa read counts. Veg, vegetative library; Rep1 and 2, repro- sion correlated to a peloric mutant phenotype (Chen et al. st o ductive library 1 and 2; Seed 1–3, seed library 1, 2 and 3. (B) The n 2005). 1 numbersofgenesdifferentiallyexpressedorsharedamongvegetative, 5 reproductiveandseedlibrariesareshown. Uniquegenesidentifiedinthreetypesoftissues(vegetative; Ap reproductive1and2combined;seed1,2and3combined)were ril 2 analyzed by their GO distribution (Supplementary Fig. S2). 01 9 10E-10(42,590uniquecontigs)weredefinedasprotein-coding Many of the differentially expressed genes were annotated as genes, while unknown transcripts were defined as unknown function, exhibiting a vast research potential. The non-annotated (191,233 contigs). To provide additional infor- seed library contains the most unique genes, presumbly due mation, Pfam, GO and KEGG identities were also assigned to to the high diversity of genes needed for the architecture of thegeneswhenapplicable. early embryo formation. The novel genes identified in seed Of the top-hit species list, 5,288 Phalaenopsis genes were libraries present promising and intriguing projects for future found to be similar to genes of common grapevine (Vitis functionalresearch.Cupin,LEA(lateembryogenesisabundant), vinifera), while 2,647 and 1,394 genes were similar to genes of metallothionein, heat shock 70kDa protein, several dehydro- rice and Ricinus communis, respectively (Supplementary Fig. genases and some other genes with unknown function are S3).Giventhatmanyplantspecieshaveamuchhighernumber among the most highly expressed seed-specific genes. ofUniGenesorESTsincurrentdatabasesthangrapevine,this LLA-115(atapetum-specificprotein),apollen-specificprotein 1508 PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. TranscriptomeanalysisofPhalaenopsisaphrodite D o w n lo a d e d fro m h ttp s ://a c a d e m ic .o u p .c o m /p c p /a rtic le -a b s tra c t/5 2 /9 /1 5 0 1 /1 8 6 4 5 7 8 b y g u e s t o n 1 Fig.6 SubcellularlocalizationofPhalaenopsisaphroditegenes.SelectivePhalaenopsisgeneswerefusedwithGFPandexpressedinpetalcellsby 5 A particle bombardment. Confocal images derived from co-expression of Phalaenopsis constructs and Arabidopsis marker constructs were p arranged in thefollowing order:A–C, nuclearlocalization; D–F, ERlocalization; G–I,Golgilocalization. Phalaenopsis constructs with GFP:A, ril 2 0 PATC144393;D,PATC140948;andG,PATC142447areinthegreenchannel,andArabidopsismarkerconstructswithmCherry:B,ER-rkCD3-959; 1 9 E,G-rbCD3-968;andH,E3170areintheredchannel.C,FandIaremergedimagestakeninthegreenandredchannelsfromAandB,DandEand GandH,respectively.Greenfluorescenceimagesofthecytoplasmconstruct,J,cyclophilin(PATC140899/PATC155502);andthecytoskeleton construct,K,actin(PATC141888)weretakenwithoutco-bombardmentwithArabidopsismarkergenes.AfluorscenceimageoffreeGFPwith vector-onlycontrol,326-GFP,isshowninL.Scalebarsindicate20mm. homolog, a MADS box transcription factor, and many others Validation of sequence accuracy and functional wereidentifiedasreproductive-specificexpressedgenes.These annotation of the assembled contigs tissue-specificexpressedgenespresentopportunitiestoinves- tigate genes involved in unique molecular and physiological Whenampliconsofrandomlyselectedcontigsweresubmitted functionsinorchidsfurther. to capillary sequencing and the sequences confirmed in both PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011. 1509 C.-l.Suetal. directions, the overall accuracy of the assembled contigs was Conclusion estimated to be as high as 99.9%. This result assured us that To initiate genomic research for a non-model organism, an our assembly procedure is fairly reliable. This high accuracy is ESTapproachfortranscriptomeanalysishasmanyadvantages due to the deep coverage and sequence adjustment from and potential applications. An effective workflow to inte- Solexa mapping. Among the 48 genes used for validation, six grate data from advanced sequencing technologies was were found to have single nucleotide polymorphisms (SNPs) described. Significant features of this working procedure in- detected by Sanger sequencing (Supplementary Table S5); clude pooling of libraries from various tissues for a compre- however,additionalSNPswithlowerminorallelefrequencycan hensivesourceofexpressedgenes,stepstocombinereadsfrom be identified by Solexa consensus comparison (data not twoplatformstoenhanceassemblyperformance,astreamlined shown). annotation process, generation of gene expression profiles Particle bombardment (You et al. 2003) with GFP fusion using sequencing result as tags, and construction of a web constructs to transiently express genes of interest in petals of D database. The entire workflow minimizes time and labor for o acommerciallypopularPhalaenopsishybridisnewlydeveloped w informatic process and maximizes output in both novel gene n in this study for functional confirmation of the annotation discoveryandsequenceinformation.Furthermore,itcaneasily loa d process. Subcellular localization of Phalaenopsis contig–GFP beadaptedtomostresearchsubjectsondemand. ed fusionconstructswasobservedbyfluorescenceco-localization fro m withArabidopsisstandardconstructs(withmCherry)incellu- h larcompartmentssuchasnuclei,GolgiandERinorchidpetal Materials and Methods ttp s cells, indicating a similar mechanism of protein targeting be- ://a Plant materials c tween the two species (Fig. 6). Additionally, genes targeting a d e cytoplasm and cytoskeleton localizations were also observed. Taiwan’snativemothorchid,P.aphroditeRchb.f.,wascollected m Theresultnotonlydemonstratescorrectfunctionalannotation from its original mountain habitat in Dawu, Taitung county, ic.o u of a selection of Phalaenopsis genes but also provides subcel- and kindly provided by Dr. Tsai-Mu Shen, National Chyayi p .c lular markers for future cell biology studies. Lack of a reliable University, Chyayi county, Taiwan. Phalaenopsis Sogo om protocol for stable transformation and the long life cycle of Yukidian ‘V3’ (a popular tetraploid commercial Phalaenopsis /p c p orchids are hurdles for a transgenic approach in functional hybrid) used in the particle bombardment experiment was /a studies of orchid genes in general. Transient expression in purchasedfromcommercialnurseries. rtic le white petal cells of a commercial Phalaenopsis hybrid by par- Mature plants were maintained in growth chambers at -a ticle bombardment provides an alternative way to investigate 22–27(cid:5)C under a 12h light/dark cycle with regular irrigation bstra genefunctions. and fertilization. Seeds from hand-pollinated capsules (120d c after pollination) were germinated on 1/4 Murashige t/52 /9 and Skoog medium including Gamborg B5 vitamins (Duchefa /1 Perspectives of an orchid database 5 Biochemie), pH 5.6, with 1% tryptone, 2% sucrose and 0.85% 0 1 Orchid genomic information including genome structure, agarunderthesamegrowthconditionsasformatureplants. /18 6 transcriptomesequencesandmarkerssuchasphysical,genom- Three libraries were built from collections of various 4 5 ic and cytogenetic markers are fairly limited in the cur- orchid tissues. Leaf, stem and root from mature P. aphrodite 78 b rent databases. Sequence information of expressed genes of plants were pooled for the vegetative library. A reproductive y g P. aphrodite obtained in our study was deposited in the library was built from young inflorescence (emerging inflor- ue s Orchidstra database. Orchidstra is a web database designated escence about 10–15cm in length) and mature stalk (repro- t o fororchidgenomicandtranscriptomicinformation.Orchidstra ductive 1 library), and flower buds, full blossoms and n 1 5 is a combined word from orchid and orchestra, denoting senescing flowers (reproductive 2 library). A seed library was A p tahbeouhtatrhmeobneiaouutsyoinftoerrcphlaidys.oIftiasncootllesicmtiopnlyaofsegqeuneenscetoarcbhriinvge bthurieltepfrhoamsesa:seceodlle1ctliiobnraroy,fpgroertmocionramtinfgormseaetdisond(i0vi–d3e0dDiAntSo) ril 20 1 when the globular shape of the protocorm with absorbing 9 but serves as a value-added database with functional anno- hairs was formed; seed 2 library, protocorm development tations and gene expression profiling data, among other fea- (40–75 DAS) with pseudo-rhizome and cleavage forming; tures. For example, an orchid specialized microarray was and seed 3 library, seedling formation (75–100 DAS) with designed based on Phalaenopsis transcriptomic sequences the first leaf and sometimes the root emerging from the assembledinthiswork,andtheperformancetestisinprogress. protocorm. The relative expression level derived from microarray analysis is included in the annotation to support functional studies Orchid RNA extraction (data not shown, see Orchidstra database). In the future, we will keep on enriching genomic information of more orchid Orchid total RNA was isolated as described previously with species with continuous efforts in data integration with func- minor modification (Yu and Goh 2000). Plant tissues were tionalresearch. frozen and ground in liquid nitrogen. RNA was extracted by 1510 PlantCellPhysiol.52(9):1501–1514 (2011) doi:10.1093/pcp/pcr097 !TheAuthor2011.

Description:
of the moth orchid,. Phalaenopsis aphrodite, from two high-throughput sequencing platform technologies, Roche Gene products targeted to specific subcellular runs with Apache Web server in a Linux environment. PHP and
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.