BIOINFORMATICS FOR BEGINNERS BIOINFORMATICS FOR BEGINNERS Genes, Genomes, Molecular Evolution, Databases and Analytical Tools Supratim Choudhuri With contribution from Dr Michael Kotewicz on the Optical Mapping of DNA CenterforFoodSafetyandAppliedNutrition,FDA, CollegePark,Maryland AMSTERDAM(cid:129)BOSTON(cid:129)HEIDELBERG(cid:129)LONDON NEWYORK(cid:129)OXFORD(cid:129)PARIS(cid:129)SANDIEGO SANFRANCISCO(cid:129)SINGAPORE(cid:129)SYDNEY(cid:129)TOKYO AcademicPressisanimprintofElsevier AcademicPressisanimprintofElsevier 32JamestownRoad,LondonNW17BY,UK 225WymanStreet,Waltham,MA02451,USA 525BStreet,Suite1800,SanDiego,CA92101-4495,USA 2014PublishedbyElsevierInc. ThebookwaspreparedbyU.S.governmentemployeesinconnectionwiththeirofficialduties,andtherefore copyrightprotectionisnotavailableintheUnitedStatespursuantto17U.S.C.Section105. Nopartofthispublicationmaybereproduced,storedinaretrievalsystemortransmittedinanyformorby anymeanselectronic,mechanical,photocopying,recordingorotherwisewithoutthepriorwrittenpermission ofthepublisher PermissionsmaybesoughtdirectlyfromElsevier’sScience&TechnologyRightsDepartmentinOxford, UK:phone(144)(0)1865843830;fax(144)(0)1865853333;email:[email protected], visittheScienceandTechnologyBookswebsiteatwww.elsevierdirect.com/rightsforfurtherinformation Notice Thepublisherandtheauthormakenorepresentationsorwarrantieswithrespecttotheaccuracyand completenessofthecontentsofthiswork.Noresponsibilityisassumedbythepublisherandtheauthorfor anyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,or fromanyuseoroperationofanymethods,products,instructionsorideascontainedinthematerialherein. BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress ISBN:978-0-12-410471-6 ForinformationonallAcademicPresspublications visitourwebsiteatelsevierdirect.com 14 15 16 17 18 10 9 8 7 6 5 4 3 2 1 To my Family Preface As the title of the book suggests, this book is indeed Because this book is about bioinformatic analysis for “beginners.” It is not intended for advanced stu- using web-based databases and tools, the emphasis is dents of bioinformatics or practicing bioinformaticians. on sequence analysis. Global gene-expression profiling This book has been written from the perspective of an hasnotbeenemphasizedotherthanashortdiscussion. end-user who wants to use the freely available web- The makers of gene-expression analysis platforms pro- based databases and tools for bioinformatic analysis. vide necessary software for analysis. Lastly, it is not The audience of this book could include any scientist possible to show every type of analysis in a book with or student who has a background in basic molecular a defined word count; nor is it possible to discuss all biology but has not used web-based databases and thelinksandallthefunctionsassociatedwithadatabase tools for sequence analysis, or has not done bioinfor- or analysis. Therefore, this book should serve as an matic analysis on a regular basis. The total number of initial guide, and it is expected that the reader will chapters is only nine. This is because related sections take it upon himself/herself to explore further using have been combined into one chapter for coherence and the databases and tools. Terms such as program, tool, understanding. These sections could have been easily algorithm, and web server have been used interchange- split into separate stand-alone chapters to increase the ably throughout the book. These terms essentially mean numberofchapters. thesamethinginthecontextofthisbook.However,the More than a decade into the first human genome term web server could be used to mean both the hard- sequencing, the use of bioinformatic analysis has been wareandthesoftware. steadily increasing. There are more web-based freely Because the principal audience of the book is available databases and analytical tools than ever supposed to be non-specialists, it was felt necessary to before. Modern biology has pervaded even the social introduce the science and some core concepts of geno- sciences. For example, sociologists and psychologists mics as well as some important genomic techniques are now probing how the epigenomic effects of envi- before embarking on the bioinformatic analysis. By the ronmental factors (including social factors) might same token, some fundamental aspects of molecular shape the personality and behavior of the offspring evolution have been discussed in this book because the postnatally. The National Center for Biotechnology goal of many applications of bioinformatics is to trace Information has established an epigenomics database, the signatures of molecular evolution, as well as study which will be immensely useful to scientists in the near the relatedness of taxa. In order to minimize the num- future.Thus,bioinformaticshasbeenslowlybutsteadily ber of references in the text, reviews are cited wher- pervading all branches of biology and beyond. In keep- ever possible. ing with this, more and more bioinformatics books are being written for experts, which do not necessarily cater Supratim Choudhuri totheneedsofthenon-experts. ix Acknowledgment Theauthorwouldliketoacknowledgetheinvaluable tools and databases and for making them freely contributions of all scientists and engineers who available to facilitate bioinformatic analysis and developed databases and online tools for analysis, and learning. made them freely available. The author would also The author would like to thank Dr Steve Gendel for like to acknowledge the contributions of the groups/ hiscarefulreadingoftheallergenicitypredictionsection institutions/organizations for hosting and maintaining inChapter8,andprovidinghelpfulsuggestions. these resources on web servers. A number of links The author would also like to thank manycolleagues for freely available databases and web-based tools for for their encouragement, enthusiasm, and support for analysis have been provided throughout the book. theproject. Wherever possible, the latest relevant publications Last but not the least, the author is grateful to (which usually include the previous publications as MrGrahamNisbetandMsCatherineMullaneofElsevier well) describing these resources have been cited to for making this project a reality, helping to bring it to acknowledge the contribution. The scientific com- successfulcompletion,andbeingavailablewheneverhelp munity is truly grateful to the developers of these andadvicewereneeded. xi C H A P T E R 1 Fundamentals of Genes and Genomes(cid:1) O U T L I N E 1.1 Biological Macromolecules, Genomics, 1.9.1 Configuration and Chirality of Amino Acids 15 and Bioinformatics 2 1.9.2 Ionic Character of AminoAcids 16 1.9.3 Relationship between Protein Function 1.2 DNAasthe Universal GeneticMaterial 2 and the Location of Amino Acids in 1.3 DNADouble Helix 2 the PolypeptideChain 16 1.3.1 StructuralUnitsof DNA 2 1.9.4 Linkage between Amino Acids—The 1.3.2 Linkage between Nucleotides 3 Peptide Bond 17 1.3.3 Base-Pairing Rules, Double Helix, 1.9.5 Four Levels of ProteinStructure 17 and Triple Helix 4 1.9.6 Acidic and Basic Proteins 17 1.3.4 Single-Stranded DNA 4 1.9.7 Nonstandard AminoAcids in 1.3.5 Base Sequenceand the Genetic Code 5 PolypeptideChains 18 1.4 Conformationsof DNA 5 1.10 Genome Structureand Organization 18 1.10.1 TheStructure of aRepresentative 1.5 Typical Eukaryotic Gene Structure 5 Genome—The Human Genome 19 1.5.1 Transcribed Region 7 1.10.2 Functional SequenceElementsinthe 1.5.1.1 Intron-SplicingSignals 7 Genome 21 1.5.1.2 EffectofIntronPhaseon 1.10.2.1 Promoters 21 AlternativeSplicing 9 1.10.2.2 Enhancers 21 1.5.1.3 EvolutionofIntrons 10 1.10.2.3 LocusControlRegions 21 1.5.2 50-FlankingRegion of TranscribedGenes 11 1.10.2.4 Insulators 22 1.5.3 30-FlankingRegion of TranscribedGenes 11 1.10.3 Epigenetic Modifications of the GenomeCan 1.6 Mutations in the DNA Sequence 12 Editthe Language Written inthe DNA Sequence and Add an Extra Layer of 1.7 Some Features ofRNA 12 Complexityin Genome Expression 22 1.7.1 Instability of mRNA 12 1.10.3.1 HistoneCode 23 1.7.2 50- and 30-Untranslated Regions of mRNA 12 1.10.3.2 TheDynamicsofEpigenetic 1.7.3 Secondary Structures in RNA 13 Changes 24 1.8 CodingVersus Noncoding RNA 14 1.10.4 Lessons Learned from the Second Phase 1.8.1 Small Noncoding RNA,Long Noncoding of the ENCODE Project about the DNA RNA, Competing Endogenous RNA, Elements inthe Human Genome and and Circular RNA 14 itsEpigenetic Modifications 24 1.9 Protein Structure andFunction 15 References 25 (cid:1)Theopinionsexpressedinthischapteraretheauthor’sownandtheydonotnecessarilyreflecttheopinionsoftheFDA,theDHHS, ortheFederalGovernment. 1 BioinformaticsforBeginners. 2014PublishedbyElsevierInc. 2 1. FUNDAMENTALSOFGENESANDGENOMES 1.1 BIOLOGICAL MACROMOLECULES, copies the single-stranded RNA genome into a single- GENOMICS, AND BIOINFORMATICS strandedDNA,whichthenproducesadouble-stranded viral DNA genome. The double-stranded viral DNA Genetic information is stored in the cell in the form genomeisreferredtoastheprovirus,whichgetsincor- of biological macromolecules, such as nucleic acids porated into the host genome from where it keeps pro- and proteins. The genetic information not only drives ducing more retrovirus particles with single-stranded the functioning of the whole organism, but also drives RNAgenomes. the evolutionary engine. Thus, an understanding of the molecular basis of life is fundamental to understanding how genetic information shapes life and drives its 1.3 DNA DOUBLE HELIX evolution. The following discussion captures some fundamental aspects of the structure and function of The structure of the DNA double helix and its genesandgenomeswithspecialnotes(inboxes)onthe building blocks are described in all biology textbooks. applicationsofthisinformation. Here,someotheraspectsarealsohighlighted,including the information in Box 1.1. DNA is a double-stranded right-handed helix; the two strands are complementary 1.2 DNA AS THE UNIVERSAL becauseofcomplementarybasepairing,andantiparallel GENETIC MATERIAL because the two strands have opposite 50230 orientation (Figure1.1A).ThediameterofthehelicalDNAmolecule is 20A˚ (52nm). The helical conformation of DNA With some exceptions, deoxyribonucleic acid (DNA) istheuniversalgeneticmaterial.Insomeviruses,termed creates the alternate major groove and minor groove (Figure1.1B). RNA viruses, RNA is the genetic material. The term ribovirus is used for viruses with single- and double- stranded RNA genomes, including retroviruses, which 1.3.1 Structural Units of DNA areRNA-basedforaportionoftheirlifecycle.1 AmongtheRNAviruses,retrovirusesarewellknown; DNA is composed of structural units called nucleo- they include the notorious AIDS virus. Retroviruses tides (deoxyribonucleotides). Each nucleotide is com- are unique because in their life cycle they have both posed of a pentose sugar (20-deoxy-D-ribose); one of RNA and DNA versions of their genome. A complete the four nitrogenous bases—adenine (A), thymine (T), retroviruscontainsanRNAgenome.TheRNAgenome guanine(G),orcytosine(C);andaphosphate.Thepentose encodes some protein products that are necessary for sugar has five carbon atoms and they are numbered 10 converting the single-stranded RNA genome into a (1-prime)through50 (5-prime).Thebaseisattachedtothe double-stranded DNAgenomeandthenitssubsequent 10carbonatomofthesugar,andthephosphateisattached integration into the host genome. One such protein to the 50 carbon atom (Figure 1.1A). The sugar and base product of the retroviral genome is the reverse form a nucleoside, whereas nucleoside plus phosphate transcriptase(RT)enzyme.Uponentryintothecell,the makes a nucleotide. Hence, nucleoside5sugar1base, reverse transcriptase is produced from the viral RNA whereasnucleotide5sugar1base1phosphate.Table1.1 genomeusingthehostcellularmachinery.TheRTthen shows the naming of nucleosides and nucleotides. BOX1.1 1. ThemajorgroovesinDNAcanbindproteins.This nature.Thecomplementaryandantiparallelnature isanimportantpropertyofDNAstructurebecause ofdouble-strandednucleicacidsisanimportant themajorgroovesintheupstreamregulatoryregions propertytorememberwhiledesigning ofagenebindtranscription-regulatoryproteins. syntheticoligonucleotidesforhybridization Forexample,forZn-fingertranscriptionfactors, (probesorprimers). eachZnfingerrecognizesandbindstoaspecific 3. Byconvention,nucleicacid(DNAorRNA) trinucleotidesequenceinthemajorgrooveofDNA.2 sequenceiswritten50-30 fromlefttoright,suchas 2. Anydouble-strandednucleicacid(whetherDNA 50-ATGTAAGCAC-30.Ifthe50-30 designationisnot doublestrand,DNA(cid:3)RNAhybriddoublestrand, mentioned,itisassumedthatthesequencehasbeen orRNA(cid:3)RNAdoublestrand)isantiparallelin writtenina50-30 direction,followingconvention. BIOINFORMATICSFORBEGINNERS 3 1.3. DNADOUBLEHELIX FIGURE 1.1 DNAstructure.(A)TwonucleotidesoftheDNAdoublehelix,showingtheirantiparallelorientation,twoH-bondsbetween A andT andthree H-bondsbetweenG and C; (B) theDNAdoublehelixshowingthe majorandminor groovesas wellasthe diameterof the molecule; (C) the convention of classifying the two sides of the phosphodiester bond and the products generated from their cleavage; (D)thefrontside(Watson(cid:3)Crickedge)andthebackside(Hoogsteenedge)ofapurine;(E)howHoogsteenH-bondingaidsintheformation ofthetriplehelix(seeSection1.3.3);(F)theantiandthesynconformationsofbasesaroundtheN-glycosidicbond. Each nucleotide in DNA (as well as in RNA) has one TABLE1.1 NamingofNucleosidesandNucleotides replaceable hydrogen, which is what makes the DNA Nucleoside Nucleotide (andRNA)acidic. Base (base1sugar) (base1sugar1phosphate) Adenine Deoxyadenosine DeoxyadenylicacidOR 1.3.2 Linkage between Nucleotides (sugar5deoxyribose) deoxyadenosinemonophosphate The nucleotides are joined by 50(cid:3)30 phosphodiester Guanine Deoxyguanosine DeoxyguanylicacidOR (sugar5deoxyribose) deoxyguanosinemonophosphate linkage; that is, the 50-phosphate of a nucleotide is linked to the 30-OH of the preceding nucleotide by a Cytosine Deoxycytidine DeoxycytidylicacidOR phosphodiester linkage. In a linear DNA molecule, the (sugar5deoxyribose) deoxycytidinemonophosphate 50-end has a free phosphate and the 30-end has a free Thymine Deoxythymidine DeoxythymidylicacidOR OH group (Figure 1.1A). Each phosphodiester bond (sugar5deoxyribose) deoxythymidinemonophosphate has two sides: a 30-side that is linked to the 30-end of Uracil Uridine(inRNA) UridylicacidORuridine the preceding nucleotide, and a 50-side that is linked to (inRNA) (sugar5ribose) monophosphate 50-end of the following nucleotide. The 30-side is called BIOINFORMATICSFORBEGINNERS 4 1. FUNDAMENTALSOFGENESANDGENOMES the A side by convention and its cleavage generates of the DNA double helix, whereas the Watson(cid:3)Crick a 50-PO4 product. The 50-side is called the B side by edge is internal. In normal base pairing in DNA and convention and its cleavage generates a 30-PO product RNA (Watson(cid:3)Crick base pairing), the Watson(cid:3)Crick 4 (Figure1.1C). edge (i.e. the front) of the two complementary bases is involved. However, the Hoogsteen edge provides an additional hydrogen bonding site. Therefore, the AaT and GaC base pairs in the normal double helix can 1.3.3 Base-Pairing Rules, Double Helix, and Triple Helix form additional hydrogen bonds (Hoogsteen hydro- gen bonds) to give rise to a triple helix involving the In the double-stranded DNA, A pairs with T by two Hoogsteen edge of the purines, i.e. N7 of A and G hydrogenbondsandGpairswithCbythree hydrogen for the third strand (Figure 1.1E). Hoogsteen hydrogen bonds (Figures 1.1A and 1.1B); thus GC-rich regions bonds can also form in RNA. In nucleic acids, the of DNA have more hydrogen bonds and consequently presence of a stretch of homopurine allows a stretch are more resistant to thermal denaturation. Each of homopyrimidine to hybridize through Hoogsteen nucleotide pair (AaT and GaC) has a molecular hydrogen bonding to form a section of DNA triple weight of approximately 660Da (sodium salt; 610 helix. The homopyrimidine-containing third strand is without sodium). In the helical double-stranded DNA oriented parallel to the oligopurine strand (Figure 1.1E), molecule, the sugar(cid:3)phosphate backbone lies outside whereas the homopurine-containing third strand is oriented and the bases are inside. Base pairs are stacked and antiparallel to the oligopurine strand (see Box 1.2).3(cid:3)5 horizontal; hence they are perpendicular to the axis For bases, two conformational variations are possi- of DNA. Because of the stacked nature of the base ble. The bond joining the 10-carbon of the deoxyribose pairs in DNA, spatially flat molecules can intercalate sugar to the base is the N-glycosidic bond. Rotation between them. Of the four bases, A and G are purines about this base-to-sugar glycosidic bond gives rise to whereas T and C are pyrimidines. In double-stranded syn and anti conformations. The anti conformation is DNA,apurinepairswithapyrimidine(AwithTandG the most common one (Figure 1.1F); however, the syn with C). Therefore, total amount of purine should conformation can trigger the formation of triple helix equal total amount of pyrimidine; in other words, the (Figure 1.1E)and also play a role in transversion muta- purine/pyrimidine ratio should be 1.0 or close to 1.0. tion (see Molecular basis of mutation, Section 2.3.1 in This purine(cid:3)pyrimidine equivalence in double-stranded Chapter2). DNAiscalledChargaff’srule. In the bases, the side with the N1 position of 1.3.4 Single-Stranded DNA the heterocyclic ring is the “front,” also called the Watson(cid:3)Crick edge (Figure 1.1D); the opposite side is Many DNA viruses have single-stranded DNA (for the “back,” also called the Hoogsteen edge. Purines example, ϕX-174, parvoviruses). RNA viruses have have an imidazole ring, which forms the “back”; so in RNA as the genetic material, and the RNA genome can purines, the N7 position of the imidazole ring is part besingleordoublestranded.Single-strandedDNAdoes of the Hoogsteen edge (Figure 1.1D). The Hoogsteen not have base equivalence and hence does not follow edge of the bases is located towards the edge (outside) Chargaff’sbaseequivalencerule. BOX1.2 1. EachphosphatehasthreereplaceableH1; 3. Thepurine(cid:3)pyrimidineequivalencecanbe phosphodiester-bondformationbetweentwo utilizedtodetermineifaDNAmoleculefroman nucleotidesleavesonereplaceableH1.These unknownsourceisdoublestrandedorsingle replaceableH1maketheDNA(andRNA)acidic stranded.Inadouble-strandedDNAmolecule, (Figures1.1and1.3). thepurine/pyrimidineratioshouldbe1.0(orclose 2. Theintercalationpropertyofspatiallyflatmolecules to1.0);incontrast,inasingle-strandedDNA isutilizedtovisualizeDNA(andRNA)inagelusing moleculethisequivalenceislacking. flataromaticmoleculesthatfluoresceunderUV, 4. ThedifferentialthermalstabilityofAT-richversus suchasethidiumbromideandacridineorange. GC-richregionsindouble-strandednucleicacids Theintercalationofthesemoleculescanalsocause istakenintoconsiderationwhiledesigning frameshiftmutationduringDNAreplication. oligonucleotidesforhybridizationfordifferent BIOINFORMATICSFORBEGINNERS