Table Of Content

BIOINFORMATICS FOR BEGINNERS BIOINFORMATICS FOR BEGINNERS Genes, Genomes, Molecular Evolution, Databases and Analytical Tools Supratim Choudhuri With contribution from Dr Michael Kotewicz on the Optical Mapping of DNA CenterforFoodSafetyandAppliedNutrition,FDA, CollegePark,Maryland AMSTERDAM•BOSTON•HEIDELBERG•LONDON NEWYORK•OXFORD•PARIS•SANDIEGO SANFRANCISCO•SINGAPORE•SYDNEY•TOKYO AcademicPressisanimprintofElsevier AcademicPressisanimprintofElsevier 32JamestownRoad,LondonNW17BY,UK 225WymanStreet,Waltham,MA02451,USA 525BStreet,Suite1800,SanDiego,CA92101-4495,USA 2014PublishedbyElsevierInc. ThebookwaspreparedbyU.S.governmentemployeesinconnectionwiththeirofficialduties,andtherefore copyrightprotectionisnotavailableintheUnitedStatespursuantto17U.S.C.Section105. Nopartofthispublicationmaybereproduced,storedinaretrievalsystemortransmittedinanyformorby anymeanselectronic,mechanical,photocopying,recordingorotherwisewithoutthepriorwrittenpermission ofthepublisher PermissionsmaybesoughtdirectlyfromElsevier’sScience&TechnologyRightsDepartmentinOxford, UK:phone(144)(0)1865843830;fax(144)(0)1865853333;email:[email protected], visittheScienceandTechnologyBookswebsiteatwww.elsevierdirect.com/rightsforfurtherinformation Notice Thepublisherandtheauthormakenorepresentationsorwarrantieswithrespecttotheaccuracyand completenessofthecontentsofthiswork.Noresponsibilityisassumedbythepublisherandtheauthorfor anyinjuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,or fromanyuseoroperationofanymethods,products,instructionsorideascontainedinthematerialherein. BritishLibraryCataloguing-in-PublicationData AcataloguerecordforthisbookisavailablefromtheBritishLibrary LibraryofCongressCataloging-in-PublicationData AcatalogrecordforthisbookisavailablefromtheLibraryofCongress ISBN:978-0-12-410471-6 ForinformationonallAcademicPresspublications visitourwebsiteatelsevierdirect.com 14 15 16 17 18 10 9 8 7 6 5 4 3 2 1 To my Family Preface As the title of the book suggests, this book is indeed Because this book is about bioinformatic analysis for “beginners.” It is not intended for advanced stu- using web-based databases and tools, the emphasis is dents of bioinformatics or practicing bioinformaticians. on sequence analysis. Global gene-expression profiling This book has been written from the perspective of an hasnotbeenemphasizedotherthanashortdiscussion. end-user who wants to use the freely available web- The makers of gene-expression analysis platforms pro- based databases and tools for bioinformatic analysis. vide necessary software for analysis. Lastly, it is not The audience of this book could include any scientist possible to show every type of analysis in a book with or student who has a background in basic molecular a defined word count; nor is it possible to discuss all biology but has not used web-based databases and thelinksandallthefunctionsassociatedwithadatabase tools for sequence analysis, or has not done bioinfor- or analysis. Therefore, this book should serve as an matic analysis on a regular basis. The total number of initial guide, and it is expected that the reader will chapters is only nine. This is because related sections take it upon himself/herself to explore further using have been combined into one chapter for coherence and the databases and tools. Terms such as program, tool, understanding. These sections could have been easily algorithm, and web server have been used interchange- split into separate stand-alone chapters to increase the ably throughout the book. These terms essentially mean numberofchapters. thesamethinginthecontextofthisbook.However,the More than a decade into the first human genome term web server could be used to mean both the hard- sequencing, the use of bioinformatic analysis has been wareandthesoftware. steadily increasing. There are more web-based freely Because the principal audience of the book is available databases and analytical tools than ever supposed to be non-specialists, it was felt necessary to before. Modern biology has pervaded even the social introduce the science and some core concepts of geno- sciences. For example, sociologists and psychologists mics as well as some important genomic techniques are now probing how the epigenomic effects of envi- before embarking on the bioinformatic analysis. By the ronmental factors (including social factors) might same token, some fundamental aspects of molecular shape the personality and behavior of the offspring evolution have been discussedin this book because the postnatally. The National Center for Biotechnology goal of many applications of bioinformatics is to trace Information has established an epigenomics database, the signatures of molecular evolution, as well as study which will be immensely useful to scientists in the near the relatedness of taxa. In order to minimize the num- future.Thus,bioinformaticshasbeenslowlybutsteadily ber of references in the text, reviews are cited wher- pervading all branches of biology and beyond. In keep- ever possible. ing with this, more and more bioinformatics books are being written for experts, which do not necessarily cater Supratim Choudhuri totheneedsofthenon-experts. ix Acknowledgment Theauthorwouldliketoacknowledgetheinvaluable tools and databases and for making them freely contributions of all scientists and engineers who available to facilitate bioinformatic analysis and developed databases and online tools for analysis, and learning. made them freely available. The author would also The author would like to thank Dr Steve Gendel for like to acknowledge the contributions of the groups/ hiscarefulreadingoftheallergenicitypredictionsection institutions/organizations for hosting and maintaining inChapter8,andprovidinghelpfulsuggestions. these resources on web servers. A number of links The author would also like to thank manycolleagues for freely available databases and web-based tools for for their encouragement, enthusiasm, and support for analysis have been provided throughout the book. theproject. Wherever possible, the latest relevant publications Last but not the least, the author is grateful to (which usually include the previous publications as MrGrahamNisbetandMsCatherineMullaneofElsevier well) describing these resources have been cited to for making this project a reality, helping to bring it to acknowledge the contribution. The scientific com- successfulcompletion,andbeingavailablewheneverhelp munity is truly grateful to the developers of these andadvicewereneeded. xi C H A P T E R 1 (cid:1) Fundamentals of Genes and Genomes O U T L I N E 1.1 Biological Macromolecules, Genomics, 1.9.1 Configuration and Chirality of Amino Acids 15 and Bioinformatics 2 1.9.2 Ionic Character of AminoAcids 16 1.9.3 Relationship between Protein Function 1.2 DNAasthe Universal GeneticMaterial 2 and the Location of Amino Acids in 1.3 DNADouble Helix 2 the PolypeptideChain 16 1.3.1 StructuralUnitsof DNA 2 1.9.4 Linkage between Amino Acids—The 1.3.2 Linkage between Nucleotides 3 Peptide Bond 17 1.3.3 Base-Pairing Rules, Double Helix, 1.9.5 Four Levels of ProteinStructure 17 and Triple Helix 4 1.9.6 Acidic and Basic Proteins 17 1.3.4 Single-Stranded DNA 4 1.9.7 NonstandardAminoAcids in 1.3.5 Base Sequenceand the Genetic Code 5 PolypeptideChains 18 1.4 Conformationsof DNA 5 1.10 Genome Structureand Organization 18 1.10.1 TheStructure of aRepresentative 1.5 Typical Eukaryotic Gene Structure 5 Genome—The Human Genome 19 1.5.1 Transcribed Region 7 1.10.2 Functional SequenceElementsinthe 1.5.1.1 Intron-SplicingSignals 7 Genome 21 1.5.1.2 EffectofIntronPhaseon 1.10.2.1 Promoters 21 AlternativeSplicing 9 1.10.2.2 Enhancers 21 1.5.1.3 EvolutionofIntrons 10 1.10.2.3 LocusControlRegions 21 1.5.2 50-FlankingRegion of TranscribedGenes 11 1.10.2.4 Insulators 22 1.5.3 30-FlankingRegion of TranscribedGenes 11 1.10.3 Epigenetic Modifications of the GenomeCan 1.6 Mutations in theDNA Sequence 12 Editthe Language Written inthe DNA Sequence and Add an Extra Layer of 1.7 Some Features ofRNA 12 Complexityin Genome Expression 22 1.7.1 Instability of mRNA 12 1.10.3.1 HistoneCode 23 1.7.2 50- and 30-Untranslated Regions of mRNA 12 1.10.3.2 TheDynamicsofEpigenetic 1.7.3 Secondary Structures in RNA 13 Changes 24 1.8 CodingVersus Noncoding RNA 14 1.10.4 Lessons Learned from the Second Phase 1.8.1 Small Noncoding RNA,Long Noncoding of the ENCODEProject about the DNA RNA, Competing Endogenous RNA, Elements inthe Human Genome and and Circular RNA 14 itsEpigenetic Modifications 24 1.9 Protein Structure andFunction 15 References 25 (cid:1)Theopinionsexpressedinthischapteraretheauthor’sownandtheydonotnecessarilyreflecttheopinionsoftheFDA,theDHHS, ortheFederalGovernment. 1 BioinformaticsforBeginners. 2014PublishedbyElsevierInc. 2 1. FUNDAMENTALSOFGENESANDGENOMES 1.1 BIOLOGICAL MACROMOLECULES, copies the single-stranded RNA genome into a single- GENOMICS, AND BIOINFORMATICS strandedDNA,whichthenproducesadouble-stranded viral DNA genome. The double-stranded viral DNA Genetic information is stored in the cell in the form genomeisreferredtoastheprovirus,whichgetsincor- of biological macromolecules, such as nucleic acids porated into the host genome from where it keeps pro- and proteins. The genetic information not only drives ducing more retrovirus particles with single-stranded the functioning of the whole organism, but also drives RNAgenomes. the evolutionary engine. Thus, an understanding of the molecular basis of life is fundamental to understanding how genetic information shapes life and drives its 1.3 DNA DOUBLE HELIX evolution. The following discussion captures some fundamental aspects of the structure and function of The structure of the DNA double helix and its genesandgenomeswithspecialnotes(inboxes)onthe building blocks are described in all biology textbooks. applicationsofthisinformation. Here,someotheraspectsarealsohighlighted,including the information in Box 1.1. DNA is ado uble-stranded right-handed helix; the two strands are complementary 1.2 DNA AS THE UNIVERSAL becauseofcomplementarybasepairing,andantiparallel GENETIC MATERIAL because the two strands have opposite 50230 orientation (Figure1.1A).ThediameterofthehelicalDNAmolecule is 20A˚ (52nm). The helical conformation of DNA With some exceptions, deoxyribonucleic acid (DNA) creates the alternate major groove and minor groove istheuniversalgeneticmaterial.Insomeviruses,termed (Figure1.1B). RNA viruses, RNA is the genetic material. The term ribovirus is used for viruses with single- and double- stranded RNA genomes, including retroviruses, which 1.3.1 Structural Units of DNA areRNA-basedforaportionoftheirlifecycle.1 AmongtheRNAviruses,retrovirusesarewellknown; DNA is composed of structural units called nucleo- they include the notorious AIDS virus. Retroviruses tides (deoxyribonucleotides). Each nucleotide is com- are unique because in their life cycle they have both posed of a pentose sugar (20-deoxy-D-ribose); one of RNA and DNA versions of their genome. A complete the four nitrogenous bases—adenine (A), thymine (T), retroviruscontainsanRNAgenome.TheRNAgenome guanine(G),orcytosine(C);andaphosphate.Thepentose encodes some protein products that are necessary for sugar has five carbon atoms and they are numbered 10 converting the single-stranded RNA genome into a (1-prime)through50 (5-prime).Thebaseisattachedtothe double-stranded DNAgenomeandthenitssubsequent 10carbonatomofthesugar,andthephosphateisattached integration into the host genome. One such protein to the 50 carbon atom (Figure 1.1A). The sugar and base product of the retroviral genome is the reverse form a nucleoside, whereas nucleoside plus phosphate transcriptase(RT)enzyme.Uponentryintothecell,the makes a nucleotide. Hence, nucleoside5sugar1base, reverse transcriptase is produced from the viral RNA whereasnucleotide5sugar1base1phosphate.Table1.1 genomeusingthehostcellularmachinery.TheRTthen shows the naming of nucleosides and nucleotides. BOX1.1 1. ThemajorgroovesinDNAcanbindproteins.This nature.Thecomplementaryandantiparallelnature isanimportantpropertyofDNAstructurebecause ofdouble-strandednucleicacidsisanimportant themajorgroovesintheupstreamregulatoryregions propertytorememberwhiledesigning ofagenebindtranscription-regulatoryproteins. syntheticoligonucleotidesforhybridization Forexample,forZn-fingertranscriptionfactors, (probesorprimers). eachZnfingerrecognizesandbindstoaspecific 3. Byconvention,nucleicacid(DNAorRNA) trinucleotidesequenceinthemajorgrooveofDNA.2 sequenceiswritten50-30 fromlefttoright,suchas 2. Anydouble-strandednucleicacid(whetherDNA 50-ATGTAAGCAC-30.Ifthe50-30 designationisnot doublestrand,DNA(cid:3)RNAhybriddoublestrand, mentioned,itisassumedthatthesequencehasbeen orRNA(cid:3)RNAdoublestrand)isantiparallelin writtenina50-30 direction,followingconvention. BIOINFORMATICSFORBEGINNERS 3 1.3. DNADOUBLEHELIX FIGURE 1.1 DNAstructure.(A)TwonucleotidesoftheDNAdoublehelix,showingtheirantiparallelorientation,twoH-bondsbetween A andT andthree H-bondsbetweenG and C; (B) theDNAdoublehelixshowingthe majorandminor groovesas wellasthe diameterof the molecule; (C) the convention of classifying the two sides of the phosphodiester bond and the products generated from their cleavage; (D)thefrontside(Watson(cid:3)Crickedge)andthebackside(Hoogsteenedge)ofapurine;(E)howHoogsteenH-bondingaidsintheformation ofthetriplehelix(seeSection1.3.3);(F)theantiandthesynconformationsofbasesaroundtheN-glycosidicbond. Each nucleotide in DNA (as well as in RNA) has one TABLE1.1 NamingofNucleosidesandNucleotides replaceable hydrogen, which is what makes the DNA Nucleoside Nucleotide (andRNA)acidic. Base (base1sugar) (base1sugar1phosphate) Adenine Deoxyadenosine DeoxyadenylicacidOR 1.3.2 Linkage between Nucleotides (sugar5deoxyribose) deoxyadenosinemonophosphate Guanine Deoxyguanosine DeoxyguanylicacidOR The nucleotides are joined by 50(cid:3)30 phosphodiester (sugar5deoxyribose) deoxyguanosinemonophosphate linkage; that is, the 50-phosphate of a nucleotide is linked to the 30-OH of the preceding nucleotide by a Cytosine Deoxycytidine DeoxycytidylicacidOR (sugar5deoxyribose) deoxycytidinemonophosphate phosphodiester linkage. In a linear DNA molecule, the 50-end has a free phosphate and the 30-end has a free Thymine Deoxythymidine DeoxythymidylicacidOR (sugar5deoxyribose) deoxythymidinemonophosphate OH group (Figure 1.1A). Each phosphodiester bond has two sides: a 30-side that is linked to the 30-end of Uracil Uridine(inRNA) UridylicacidORuridine the preceding nucleotide, and a 50-side that is linked to (inRNA) (sugar5ribose) monophosphate 50-end of the following nucleotide. The 30-side is called BIOINFORMATICSFORBEGINNERS

Bioinformatics for Beginners: Genes, Genomes, Molecular Evolution, Databases and Analytical Tools PDF

226 Pages·2014·9.28 MB·English

by Choudhuri

Checking for file health...

Save to my drive

Quick download

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Bioinformatics for Beginners: Genes, Genomes, Molecular Evolution, Databases and Analytical Tools

See more

The list of books you might like

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.