Table Of ContentUsing R at the Bench
Step-by-Step Data Analytics for Biologists
2
OTHER TITLES FROM COLD SPRING HARBOR LABORATORY PRESS
At the Bench: A Laboratory Navigator, Updated Edition
At the Helm: Leading Your Laboratory, Second Edition
Experimental Design for Biologists, Second Edition
Lab Math: A Handbook of Measurements, Calculations, and Other Quantitative Skills
for Use at the Bench, Second Edition
Next-Generation DNA Sequencing Informatics, Second Edition
Statistics at the Bench: A Step-by-Step Handbook for Biologists
3
Using R at the Bench
Step-by-Step Data Analytics for Biologists
M. Bremer
Department of Mathematics
and Statistics
San Jose State University
R.W. Doerge
Department of Statistics
Department of Agronomy
Purdue University
4
Using R at the Bench
Step-by-Step Data Analytics for Biologists
All rights reserved
© 2015 by Cold Spring Harbor Laboratory Press
Printed in China
Publisher and Acquisition Editor John Inglis
Director of Editorial Services Jan Argentine
Project Manager Inez Sialiano
Director Publication Services Linda Sussman
Assistant Production Editor Maria Ebbets
Production Manager Denise Weiss
Cover Designer Jim Duffy/Denise Weiss
Front cover illustration was created by Jim Duffy.
Library of Congress Cataloging-in-Publication Data
Bremer, M. (Martina)
Using R at the bench : step-by-step data analytics for biologists / M. Bremer, Department of Mathematics and
Statistics, San Jose State University, R.W. Doerge, Department of Statistics, Department of Agronomy, Purdue
University.
pages cm
Includes bibliographical references and index.
ISBN 978-1-62182-112-0 (hardcover)
1. Bioinformatics. 2. Biology–Data processing. 3. R (Computer program language) I. Doerge, R. W. (Rebecca W.)
II. Title.
QH324.2.B74 2015
570.285–dc23
2015018960
10 9 8 7 6 5 4 3 2 1
All World Wide Web addresses are accurate to the best of our knowledge at the time of printing.
Authorization to photocopy items for internal or personal use, or the internal or personal use of specific clients, is
granted by Cold Spring Harbor Laboratory Press, provided that the appropriate fee is paid directly to the Copyright
Clearance Center (CCC).Write or call CCC at 222 Rosewood Drive, Danvers, MA 01923 (978-750-8400) for
information about fees and regulations. Prior to photocopying items for educational classroom use, contact CCC at the
above address. Additional information on CCC can be obtained at CCC Online at www.copyright.com.
For a complete catalog of all Cold Spring Harbor Laboratory Press publications, visit our website at www.cshlpress.org.
5
Contents
Acknowledgments
1 Introduction
2 Common Pitfalls
2.1 Examples of Common Mistakes
2.2 Defining Your Question
2.3 Working with and Talking to a Statistician
2.4 Exploratory versus Inferential Statistics
2.5 Different Sources of Variation
2.6 The Importance of Checking Assumptions and the Ramifications of Ignoring the
Obvious
2.7 Statistical Software Packages
2.8 Installing and Using R and R Commander
2.8.1 Loading Data
2.8.2 Variable Types
2.8.3 Handling Graphics
2.8.4 Saving Your Work
2.8.5 Getting Help
3 Descriptive Statistics
3.1 Definitions
3.2 Numerical Ways to Describe Data
3.2.1 Categorical Data
3.2.2 Quantitative Data
3.2.3 Determining Outliers
3.2.4 How to Choose a Descriptive Measure
3.3 Graphical Methods to Display Data
6
3.3.1 How to Choose the Appropriate Graphical Display for Your Data
3.4 Probability Distributions
3.4.1 The Binomial Distribution
3.4.2 The Normal Distribution
3.4.3 Assessing Normality in Your Data
3.4.4 Data Transformations
3.5 The Central Limit Theorem
3.5.1 The Central Limit Theorem for Sample Proportions
3.5.2 The Central Limit Theorem for Sample Means
3.6 Standard Deviation versus Standard Error
3.7 Error Bars
3.8 Correlation
3.8.1 Correlation and Causation
4 Design of Experiments
4.1 Mathematical and Statistical Models
4.1.1 Biological Models
4.2 Describing Relationships between Variables
4.3 Choosing a Sample
4.3.1 Problems in Sampling: Bias
4.3.2 Problems in Sampling: Accuracy and Precision
4.4 Choosing a Model
4.5 Sample Size
4.6 Resampling and Replication
5 Confidence Intervals
5.1 Interpretation of Confidence Intervals
5.1.1 Confidence Levels
5.1.2 Precision
5.2 Computing Confidence Intervals
5.2.1 Confidence Intervals for Large Sample Mean
5.2.2 Confidence Interval for Small Sample Mean
5.2.3 Confidence Interval for Population Proportion
5.3 Sample Size Calculations
6 Hypothesis Testing
6.1 The Basic Principle
7
6.1.1 p-values
6.1.2 Errors in Hypothesis Testing
6.1.3 Power of a Test
6.1.4 Interpreting Statistical Significance
6.2 Common Hypothesis Tests
6.2.1 t-test
6.2.2 z-test
6.2.3 F-test
6.2.4 Tukey’s Test and Scheffé’s Test
6.2.5 χ2-test: Goodness-of-Fit or Test of Independence
6.2.6 Likelihood Ratio Test
6.3 Non-parametric Tests
6.3.1 Wilcoxon-Mann-Whitney Rank Sum Test
6.3.2 Fisher’s Exact Test
6.3.3 Permutation Tests
6.4 E-values
7 Regression and ANOVA
7.1 Regression
7.1.1 Correlation and Regression
7.1.2 Parameter Estimation
7.1.3 Hypothesis Testing
7.1.4 Logistic Regression
7.1.5 Multiple Linear Regression
7.1.6 Model Building in Regression: Which Variables to Use?
7.1.7 Verification of Assumptions
7.1.8 Outliers in Regression
7.1.9 A Case Study
7.2 ANOVA
7.2.1 One-Way ANOVA Model
7.2.2 Two-Way ANOVA Model
7.2.3 ANOVA Assumptions
7.2.4 ANOVA Model for Microarray Data
7.3 What ANOVA and Regression Models Have in Common
8 Special Topics
8.1 Classification
8
8.2 Clustering
8.2.1 Hierarchical Clustering
8.2.2 Partitional Clustering
8.3 Principal Component Analysis
8.4 Microarray Data Analysis
8.4.1 The Data
8.4.2 Normalization
8.4.3 Statistical Analysis
8.4.4 The ANOVA Model
8.4.5 Variance Assumptions
8.4.6 Multiple Testing Issues
8.5 Next-Generation Sequencing Analysis
8.5.1 Experimental Overview
8.5.2 Statistical Issues in Next-Generation Sequencing Experiments
8.6 Maximum Likelihood
8.7 Frequentist and Bayesian Statistics
References
Index
Index of Worked Out Examples
Index of R Commander Commands
9
Acknowledgments
We would like to thank the Department of Statistics at Purdue University and ADG for
initiating the serendipitous circumstances that brought us together. We are grateful to our
families and friends for their endless support. We thank Bingrou (Alice) Zhou for
assistance with an early version of this Manual.
MARTINA BREMER
REBECCA W. DOERGE
10
Description:Using R at the Bench: Step-by-Step Data Analytics for Biologists is a convenient bench-side handbook for biologists, designed as a handy reference guide for elementary and intermediate statistical analyses using the free/public software package known as "R." The expectations for biologists to have a