Table Of ContentStatistical Bioinformatics with R
Thispageintentionallyleftblank
Statistical Bioinformatics with R
Sunil K. Mathur
University of Mississippi
AMSTERDAM•BOSTON•HEIDELBERG•LONDON
NEWYORK•OXFORD•PARIS•SANDIEGO
SANFRANCISCO•SINGAPORE•SYDNEY•TOKYO
AcademicPressisanimprintofElsevier
AcademicPressisanimprintofElsevier
30CorporateDrive,Suite400,Burlington,MA01803,USA
525BStreet,Suite1900,SanDiego,California92101-4495,USA
84Theobald’sRoad,LondonWC1X8RR,UK
Copyright©2010Elsevier,Inc.Allrightsreserved.
Nopartofthispublicationmaybereproducedortransmittedinanyformorbyanymeans,electronicormechanical,
includingphotocopying,recording,oranyinformationstorageandretrievalsystem,withoutpermissioninwritingfromthe
publisher.Detailsonhowtoseekpermission,furtherinformationaboutthePublisher’spermissionspoliciesandour
arrangementswithorganizationssuchastheCopyrightClearanceCenterandtheCopyrightLicensingAgency,canbefound
atourwebsite:www.elsevier.com/permissions.
ThisbookandtheindividualcontributionscontainedinitareprotectedundercopyrightbythePublisher(otherthanasmay
benotedherein).
Notices
Knowledgeandbestpracticeinthisfieldareconstantlychanging.Asnewresearchandexperiencebroadenourunderstanding,
changesinresearchmethods,professionalpractices,ormedicaltreatmentmaybecomenecessary.
Practitionersandresearchersmustalwaysrelyontheirownexperienceandknowledgeinevaluatingandusingany
information,methods,compounds,orexperimentsdescribedherein.Inusingsuchinformationormethodstheyshouldbe
mindfuloftheirownsafetyandthesafetyofothers,includingpartiesforwhomtheyhaveaprofessionalresponsibility.
Tothefullestextentofthelaw,neitherthePublishernortheauthors,contributors,oreditors,assumeanyliabilityforany
injuryand/ordamagetopersonsorpropertyasamatterofproductsliability,negligenceorotherwise,orfromanyuseor
operationofanymethods,products,instructions,orideascontainedinthematerialherein.
LibraryofCongressCataloging-in-PublicationData
Mathur,SunilK.
StatisticalbioinformaticswithR/SunilK.Mathur.
p.cm.
Includesbibliographicalreferencesandindex.
ISBN978-0-12-375104-1(alk.paper)
1.Bioinformatics–Statisticalmethods.2.R(Computerprogramlanguage)I.Title.
QH324.2.M3782010
570.285’5133–dc22
2009050006
BritishLibraryCataloguing-in-PublicationData
AcataloguerecordforthisbookisavailablefromtheBritishLibrary.
ISBN:978-0-12-375104-1
ForinformationonallAcademicPresspublications
visitourWebsiteatwww.elsevierdirect.com
PrintedintheUnitedStatesofAmerica
09 10 9 8 7 6 5 4 3 2 1
Contents
PREFACE...................................................................................... ix
ACKNOWLEDGMENTS................................................................. xv
CHAPTER1 Introduction........................................................... 1
1.1 StatisticalBioinformatics....................................... 1
1.2 Genetics ............................................................. 3
1.3 Chi-SquareTest................................................... 6
1.4 TheCellandItsFunction...................................... 9
1.5 DNA................................................................... 12
1.6 DNAReplicationandRearrangements.................... 14
1.7 TranscriptionandTranslation................................ 15
1.8 GeneticCode....................................................... 16
1.9 ProteinSynthesis................................................. 19
Exercise1........................................................... 20
AnswerChoicesforQuestions1through15............ 21
CHAPTER2 Microarrays.............................................................. 23
2.1 MicroarrayTechnology......................................... 23
2.2 IssuesinMicroarray............................................. 25
2.3 MicroarrayandGeneExpressionandItsUses......... 29
2.4 Proteomics.......................................................... 30
Exercise2........................................................... 31
CHAPTER3 ProbabilityandStatisticalTheory............................ 33
3.1 TheoryofProbability............................................ 34
3.2 MathematicalorClassicalProbability..................... 36
3.3 Sets.................................................................... 38
3.3.1 OperationsonSets..................................... 39
3.3.2 PropertiesofSets....................................... 40
3.4 Combinatorics..................................................... 41
3.5 LawsofProbability............................................... 44 v
vi Contents
3.6 RandomVariables................................................ 53
3.6.1 DiscreteRandomVariable........................... 55
3.6.2 ContinuousRandomVariable ...................... 56
3.7 MeasuresofCharacteristicsofaContinuous
ProbabilityDistribution......................................... 57
3.8 MathematicalExpectation..................................... 57
3.8.1 PropertiesofMathematicalExpectation........ 60
3.9 BivariateRandomVariable.................................... 62
3.9.1 JointDistribution....................................... 62
3.10 Regression.......................................................... 71
3.10.1 LinearRegression ...................................... 72
3.10.2 TheMethodofLeastSquares ...................... 73
3.11 Correlation.......................................................... 78
3.12 LawofLargeNumbersandCentral
LimitTheorem..................................................... 80
CHAPTER4 SpecialDistributions,Properties,
andApplications...................................................... 83
4.1 Introduction ........................................................ 83
4.2 DiscreteProbabilityDistributions........................... 84
4.3 BernoulliDistribution ........................................... 84
4.4 BinomialDistribution............................................ 84
4.5 PoissonDistribution............................................. 87
4.5.1 PropertiesofPoissonDistribution................. 88
4.6 NegativeBinomialDistribution.............................. 89
4.7 GeometricDistribution.......................................... 92
4.7.1 LackofMemory......................................... 93
4.8 HypergeometricDistribution................................. 94
4.9 MultinomialDistribution....................................... 95
4.10 Rectangular(orUniform)Distribution..................... 99
4.11 NormalDistribution.............................................. 100
4.11.1 SomeImportantPropertiesofNormal
DistributionandNormalProbabilityCurve.... 101
4.11.2 NormalApproximationtotheBinomial......... 106
4.12 GammaDistribution............................................. 107
4.12.1 AdditivePropertyofGammaDistribution...... 108
4.12.2 LimitingDistributionofGamma
Distribution............................................... 108
4.12.3 WaitingTimeModel................................... 108
Contents vii
4.13 TheExponentialDistribution................................. 109
4.13.1 WaitingTimeModel................................... 110
4.14 BetaDistribution.................................................. 110
4.14.1 SomeResults............................................. 111
4.15 Chi-SquareDistribution......................................... 111
4.15.1 AdditivePropertyofChi-Square
Distribution............................................... 112
4.15.2 LimitingDistributionofChi-Square
Distribution............................................... 112
CHAPTER5 StatisticalInferenceandApplications...................... 113
5.1 Introduction ........................................................ 113
5.2 Estimation........................................................... 115
5.2.1 Consistency............................................... 115
5.2.2 Unbiasedness............................................ 116
5.2.3 Efficiency.................................................. 118
5.2.4 Sufficiency................................................. 120
5.3 MethodsofEstimation.......................................... 121
5.4 ConfidenceIntervals............................................. 122
5.5 SampleSize......................................................... 132
5.6 TestingofHypotheses.......................................... 133
5.6.1 TestsaboutaPopulationMean.................... 138
5.7 OptimalTestofHypotheses.................................. 150
5.8 LikelihoodRatioTest............................................ 156
CHAPTER6 NonparametricStatistics.......................................... 159
6.1 Chi-SquareGoodness-of-FitTest............................ 160
6.2 Kolmogorov-SmirnovOne-SampleStatistic.............. 163
6.3 SignTest............................................................. 164
6.4 WilcoxonSigned-RankTest................................... 166
6.5 Two-SampleTest................................................. 169
6.5.1 WilcoxonRankSumTest............................. 169
6.5.2 Mann-WhitneyTest.................................... 171
6.6 TheScaleProblem................................................ 174
6.6.1 Ansari-BardleyTest.................................... 175
6.6.2 LepageTest.............................................. 178
6.6.3 Kolmogorov-SmirnovTest............................ 180
viii Contents
6.7 GeneSelectionandClusteringofTime-Courseor
Dose-ResponseGeneExpressionProfiles................ 182
6.7.1 SingleFractalAnalysis................................ 184
6.7.2 Order-RestrictedInference.......................... 186
CHAPTER7 BayesianStatistics................................................... 189
7.1 BayesianProcedures............................................ 189
7.2 EmpiricalBayesMethods...................................... 192
7.3 GibbsSampler..................................................... 193
CHAPTER8 MarkovChainMonteCarlo...................................... 203
8.1 TheMarkovChain................................................ 204
8.2 AperiodicityandIrreducibility............................... 213
8.3 ReversibleMarkovChains..................................... 218
8.4 MCMCMethodsinBioinformatics.......................... 220
CHAPTER9 AnalysisofVariance................................................ 227
9.1 One-WayANOVA ................................................ 228
9.2 Two-WayClassificationofANOVA......................... 241
CHAPTER10 TheDesignofExperiments...................................... 253
10.1 Introduction ........................................................ 253
10.2 PrinciplesoftheDesignofExperiments.................. 255
10.3 CompletelyRandomizedDesign............................. 256
10.4 RandomizedBlockDesign..................................... 262
10.5 LatinSquareDesign............................................. 270
10.6 FactorialExperiments........................................... 278
10.6.1 2n-FactorialExperiment.............................. 279
10.7 ReferenceDesignsandLoopDesigns..................... 286
CHAPTER11 MultipleTestingofHypotheses............................... 293
11.1 Introduction ........................................................ 293
11.2 TypeIErrorandFDR............................................ 294
11.3 MultipleTestingProcedures.................................. 297
REFERENCES............................................................................... 305
INDEX .......................................................................................... 315
Preface
Bioinformaticsisanemergingfieldinwhichstatisticalandcomputationaltech-
niques are used extensively to analyze and interpret biological data obtained
fromhigh-throughputgenomictechnologies.Genomictechnologiesallowus
to monitor thousands of biological processes going on inside living organ-
isms in one snapshot, and are rapidly growing as driving forces of research,
particularly in the genetics, biomedical, biotechnology, and pharmaceutical
industries.
Thesuccessofgenometechnologiesandrelatedtechniques,however,heavily
dependsoncorrectstatisticalanalysesofgenomicdata.Throughstatisticalanal-
yses and the graphical displays of genomic data, genomic experiments allow
biologiststoassimilateandexplorethedatainanaturalandintuitivemanner.
Thestorage,retrieval,interpretation,andintegrationoflargevolumesofdata
generatedbygenomictechnologiesdemandincreasingdependenceonsophis-
ticatedcomputerandstatisticalinferencetechniques.Newstatisticaltoolshave
beendevelopedtomakeinferencesfromthegenomicdataobtainedthrough
genomicstudiesinamoremeaningfulway.
Thistextbookisofaninterdisciplinarynature,andmaterialpresentedherecan
becoveredinaone-ortwo-semestercourse.Itiswrittentogiveasolidbasein
statisticswhileemphasizingapplicationsingenomics.Itismysincereattempt
tointegratedifferentfieldstounderstandthehigh-throughputbiologicaldata
anddescribevariousstatisticaltechniquestoanalyzedata.Inthistextbook,new
methodsbasedonBayesiantechniques,MCMCmethods,likelihoodfunctions,
design of experiments, and nonparametric methods, along with traditional
methods,arediscussed.InsightsintosomeusefulsoftwaresuchasBAMarray,
ORIGEN,andSAMareprovided.
Chapter 1 provides some basic knowledge in biology. Microarrays are a very
usefulandpowerfultechniqueavailablenow.Chapter2providessomeknow-
ledge of microarray technology and a description of current problems in
using this technology. Foundations of probability and basic statistics, assum-
ing that the reader is not familiar with statistical and probabilistic concepts,
ix
Description:Designed for a one or two semester senior undergraduate or graduate bioinformatics course, Statistical Bioinformatics takes a broad view of the subject - not just gene expression and sequence analysis, but a careful balance of statistical theory in the context of bioinformatics applications. The inc