From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science Norm Matloff, University of California, Davis library(MASS) f (t) = ce−0.5(t−µ)(cid:48)Σ−1(t−µ) x <- mvrnorm(mu,sgm) X 0.015 z0.010 0.005 10 5 −10 0 −5 x 2 0 −5 x1 5 10−10 See Creative Commons license at http://heather.cs.ucdavis.edu/ matloff/probstatbook.html The author has striven to minimize the number of errors, but no guarantee is made as to accuracy of the contents of this book. 2 Author’s Biographical Sketch Dr. Norm Matloff is a professor of computer science at the University of California at Davis, and was formerly a professor of statistics at that university. He is a former database software developer in Silicon Valley, and has been a statistical consultant for firms such as the Kaiser Permanente Health Plan. Dr. Matloff was born in Los Angeles, and grew up in East Los Angeles and the San Gabriel Valley. HehasaPhDinpuremathematicsfromUCLA,specializinginprobabilitytheoryandstatistics. He has published numerous papers in computer science and statistics, with current research interests in parallel processing, statistical computing, and regression methodology. Prof. Matloff is a former appointed member of IFIP Working Group 11.3, an international com- mittee concerned with database software security, established under UNESCO. He was a founding member of the UC Davis Department of Statistics, and participated in the formation of the UCD Computer Science Department as well. He is a recipient of the campuswide Distinguished Teaching Award and Distinguished Public Service Award at UC Davis. Dr. Matloffistheauthoroftwopublishedtextbooks, andofanumberofwidely-usedWebtutorials on computer topics, such as the Linux operating system and the Python programming language. He and Dr. Peter Salzman are authors of The Art of Debugging with GDB, DDD, and Eclipse. Prof. Matloff’s book on the R programming language, The Art of R Programming, was published in 2011. His book, Parallel Computation for Data Science, will come out in 2014. He is also the author of several open-source textbooks, including From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science(http://heather.cs.ucdavis.edu/probstatbook),and Programming on Parallel Machines (http://heather.cs.ucdavis.edu/~matloff/ParProcBook. pdf). Contents 1 Time Waste Versus Empowerment 1 2 Basic Probability Models 3 2.1 ALOHA Network Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 The Crucial Notion of a Repeatable Experiment . . . . . . . . . . . . . . . . . . . . 5 2.3 Our Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 “Mailing Tubes” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5 Basic Probability Computations: ALOHA Network Example . . . . . . . . . . . . . 10 2.6 Bayes’ Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 ALOHA in the Notebook Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.8 Solution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.9 Example: Divisibility of Random Integers . . . . . . . . . . . . . . . . . . . . . . . . 17 2.10 Example: A Simple Board Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.11 Example: Bus Ridership . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.12 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.12.1 Example: Rolling Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.12.2 Improving the Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.12.3 Simulation of the ALOHA Example . . . . . . . . . . . . . . . . . . . . . . . 24 2.12.4 Example: Bus Ridership, cont’d. . . . . . . . . . . . . . . . . . . . . . . . . . 25 i ii CONTENTS 2.12.5 Back to the Board Game Example . . . . . . . . . . . . . . . . . . . . . . . . 25 2.12.6 How Long Should We Run the Simulation? . . . . . . . . . . . . . . . . . . . 26 2.13 Combinatorics-Based Probability Computation . . . . . . . . . . . . . . . . . . . . . 26 2.13.1 Which Is More Likely in Five Cards, One King or Two Hearts? . . . . . . . . 26 2.13.2 Example: Lottery Tickets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.13.3 “Association Rules” in Data Mining . . . . . . . . . . . . . . . . . . . . . . . 28 2.13.4 Multinomial Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 2.13.5 Example: Probability of Getting Four Aces in a Bridge Hand . . . . . . . . . 30 3 Discrete Random Variables 35 3.1 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.4.1 Generality—Not Just for DiscreteRandom Variables . . . . . . . . . . . . . . 36 3.4.1.1 What Is It? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4.3 Computation and Properties of Expected Value . . . . . . . . . . . . . . . . . 37 3.4.4 “Mailing Tubes” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.4.5 Casinos, Insurance Companies and “Sum Users,” Compared to Others . . . . 43 3.5 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.2 Central Importance of the Concept of Variance . . . . . . . . . . . . . . . . . 47 3.5.3 Intuition Regarding the Size of Var(X) . . . . . . . . . . . . . . . . . . . . . . 47 3.5.3.1 Chebychev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.5.3.2 The Coefficient of Variation . . . . . . . . . . . . . . . . . . . . . . . 48 3.6 A Useful Fact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 CONTENTS iii 3.7 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.8 Indicator Random Variables, and Their Means and Variances . . . . . . . . . . . . . 50 3.9 A Combinatorial Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.10 Expected Value, Etc. in the ALOHA Example . . . . . . . . . . . . . . . . . . . . . 53 3.11 Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.11.1 Example: Toss Coin Until First Head . . . . . . . . . . . . . . . . . . . . . . 55 3.11.2 Example: Sum of Two Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.11.3 Example: Watts-Strogatz Random Graph Model . . . . . . . . . . . . . . . . 55 3.12 Parameteric Families of pmfs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.12.1 Parameteric Families of Functions . . . . . . . . . . . . . . . . . . . . . . . . 56 3.12.2 The Case of Importance to Us: Parameteric Families of pmfs . . . . . . . . . 57 3.12.3 The Geometric Family of Distributions . . . . . . . . . . . . . . . . . . . . . . 58 3.12.3.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.12.3.2 Example: a Parking Space Problem . . . . . . . . . . . . . . . . . . 61 3.12.4 The Binomial Family of Distributions . . . . . . . . . . . . . . . . . . . . . . 63 3.12.4.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.12.4.2 Example: Flipping Coins with Bonuses . . . . . . . . . . . . . . . . 64 3.12.4.3 Example: Analysis of Social Networks . . . . . . . . . . . . . . . . . 65 3.12.5 The Negative Binomial Family of Distributions . . . . . . . . . . . . . . . . . 66 3.12.5.1 Example: Backup Batteries . . . . . . . . . . . . . . . . . . . . . . . 67 3.12.6 The Poisson Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . 67 3.12.6.1 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 3.12.7 The Power Law Family of Distributions . . . . . . . . . . . . . . . . . . . . . 69 3.13 Recognizing Some Parametric Distributions When You See Them . . . . . . . . . . . 70 3.13.1 Example: a Coin Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 3.13.2 Example: Tossing a Set of Four Coins . . . . . . . . . . . . . . . . . . . . . . 72 3.13.3 Example: the ALOHA Example Again . . . . . . . . . . . . . . . . . . . . . . 72 iv CONTENTS 3.14 Example: the Bus Ridership Problem Again . . . . . . . . . . . . . . . . . . . . . . . 73 3.15 Multivariate Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3.16 A Cautionary Tale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.16.1 Trick Coins, Tricky Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 3.16.2 Intuition in Retrospect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 3.16.3 Implications for Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 3.17 Why Not Just Do All Analysis by Simulation? . . . . . . . . . . . . . . . . . . . . . 77 3.18 Proof of Chebychev’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.19 Reconciliation of Math and Intuition (optional section) . . . . . . . . . . . . . . . . . 79 4 Introduction to Discrete Markov Chains 85 4.1 Example: Die Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.2 Long-Run State Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3 Example: 3-Heads-in-a-Row Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.4 Example: ALOHA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.5 Example: Bus Ridership Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.6 An Inventory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 5 Continuous Probability Models 93 5.1 A Random Dart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.2 Continuous Random Variables Are “Useful Unicorns” . . . . . . . . . . . . . . . . . 94 5.3 But Equation (5.2) Presents a Problem . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.4 Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.4.1 Motivation, Definition and Interpretation . . . . . . . . . . . . . . . . . . . . 98 5.4.2 Properties of Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.4.3 A First Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.5 Famous Parametric Families of Continuous Distributions . . . . . . . . . . . . . . . . 103 CONTENTS v 5.5.1 The Uniform Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.1.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 103 5.5.1.2 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.5.1.3 Example: Modeling of Disk Performance . . . . . . . . . . . . . . . 104 5.5.1.4 Example: Modeling of Denial-of-Service Attack . . . . . . . . . . . . 105 5.5.2 The Normal (Gaussian) Family of Continuous Distributions . . . . . . . . . . 105 5.5.2.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 105 5.5.2.2 Example: Network Intrusion . . . . . . . . . . . . . . . . . . . . . . 107 5.5.2.3 Example: Class Enrollment Size . . . . . . . . . . . . . . . . . . . . 109 5.5.2.4 More on the Jill Example . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5.2.5 Example: River Levels . . . . . . . . . . . . . . . . . . . . . . . . . 109 5.5.2.6 The Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . 110 5.5.2.7 Example: Cumulative Roundoff Error . . . . . . . . . . . . . . . . . 111 5.5.2.8 Example: Bug Counts . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.5.2.9 Example: Coin Tosses . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.5.2.10 Museum Demonstration . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.5.2.11 Optional topic: Formal Statement of the CLT . . . . . . . . . . . . 113 5.5.2.12 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.3 The Chi-Squared Family of Distributions . . . . . . . . . . . . . . . . . . . . 114 5.5.3.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 114 5.5.3.2 Example: Error in Pin Placement . . . . . . . . . . . . . . . . . . . 115 5.5.3.3 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.4 The Exponential Family of Distributions . . . . . . . . . . . . . . . . . . . . . 116 5.5.4.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.4.2 R Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.5.4.3 Example: Refunds on Failed Components . . . . . . . . . . . . . . . 117 5.5.4.4 Example: Garage Parking Fees . . . . . . . . . . . . . . . . . . . . . 117 vi CONTENTS 5.5.4.5 Connection to the Poisson Distribution Family . . . . . . . . . . . . 118 5.5.4.6 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.5.5 The Gamma Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . 120 5.5.5.1 Density and Properties . . . . . . . . . . . . . . . . . . . . . . . . . 120 5.5.5.2 Example: Network Buffer . . . . . . . . . . . . . . . . . . . . . . . . 121 5.5.5.3 Importance in Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.5.6 The Beta Family of Distributions . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.6 Choosing a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.7 A General Method for Simulating a Random Variable . . . . . . . . . . . . . . . . . 125 5.8 “Hybrid” Continuous/Discrete Distributions . . . . . . . . . . . . . . . . . . . . . . . 125 6 Stop and Review: Probability Structures 129 7 Covariance and Random Vectors 133 7.1 Measuring Co-variation of Random Variables . . . . . . . . . . . . . . . . . . . . . . 133 7.1.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 7.1.2 Example: Variance of Sum of Nonindependent Variables . . . . . . . . . . . . 135 7.1.3 Example: the Committee Example Again . . . . . . . . . . . . . . . . . . . . 135 7.1.4 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 7.1.5 Example: a Catchup Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2 Sets of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 137 7.2.1 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2.1.1 Expected Values Factor . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2.1.2 Covariance Is 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 7.2.1.3 Variances Add . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 7.2.2 Examples Involving Sets of Independent Random Variables . . . . . . . . . . 139 7.2.2.1 Example: Dice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 CONTENTS vii 7.2.2.2 Example: Variance of a Product . . . . . . . . . . . . . . . . . . . . 140 7.2.2.3 Example: Ratio of Independent Geometric Random Variables . . . 140 7.3 Matrix Formulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 7.3.1 Properties of Mean Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.3.2 Covariance Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 7.3.3 Example: (X,S) Dice Example Again . . . . . . . . . . . . . . . . . . . . . . . 143 7.3.4 Example: Easy Sum Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.3.5 Example: Dice Game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 7.3.6 Correlation Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8 Multivariate PMFs and Densities 151 8.1 Multivariate Probability Mass Functions . . . . . . . . . . . . . . . . . . . . . . . . . 151 8.2 Multivariate Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.2.1 Motivation and Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 8.2.2 Use of Multivariate Densities in Finding Probabilities and Expected Values . 154 8.2.3 Example: a Triangular Distribution . . . . . . . . . . . . . . . . . . . . . . . 155 8.2.4 Example: Train Rendezvouz . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 8.3 More on Sets of Independent Random Variables . . . . . . . . . . . . . . . . . . . . . 159 8.3.1 Probability Mass Functions and Densities Factor in the Independent Case . . 159 8.3.2 Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 8.3.3 Example: Ethernet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 8.3.4 Example: Analysis of Seek Time . . . . . . . . . . . . . . . . . . . . . . . . . 161 8.3.5 Example: Backup Battery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 8.3.6 Example: Minima of Uniformly Distributed Random Variables . . . . . . . . 163 8.3.7 Example: MinimaofIndependentExponentiallyDistributedRandomVariables163 8.3.8 Example: Computer Worm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 8.3.9 Example: Electronic Components . . . . . . . . . . . . . . . . . . . . . . . . . 166 viii CONTENTS 8.3.10 Example: Ethernet Again . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 8.4 Example: Finding the Distribution of the Sum of Nonindependent Random Variables167 8.5 Parametric Families of Multivariate Distributions . . . . . . . . . . . . . . . . . . . . 167 8.5.1 The Multinomial Family of Distributions . . . . . . . . . . . . . . . . . . . . 168 8.5.1.1 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . 168 8.5.1.2 Example: Component Lifetimes . . . . . . . . . . . . . . . . . . . . 169 8.5.1.3 Mean Vectors and Covariance Matrices in the Multinomial Family . 170 8.5.1.4 Application: Text Mining . . . . . . . . . . . . . . . . . . . . . . . . 173 8.5.2 The Multivariate Normal Family of Distributions . . . . . . . . . . . . . . . 173 8.5.2.1 Densities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 8.5.2.2 Geometric Interpretation . . . . . . . . . . . . . . . . . . . . . . . . 174 8.5.2.3 Properties of Multivariate Normal Distributions . . . . . . . . . . . 177 8.5.2.4 The Multivariate Central Limit Theorem . . . . . . . . . . . . . . . 178 8.5.2.5 Example: Finishing the Loose Ends from the Dice Game . . . . . . 179 8.5.2.6 Application: Data Mining . . . . . . . . . . . . . . . . . . . . . . . . 179 9 Introduction to Continuous-Time Markov Chains 185 9.1 Memoryless Property of Exponential Distributions . . . . . . . . . . . . . . . . . . . 185 9.1.1 Derivation and Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 9.1.2 Example: “Nonmemoryless” Light Bulbs. . . . . . . . . . . . . . . . . . . . . 187 9.2 Continuous-Time Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.3 Holding-Time Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 9.3.1 The Notion of “Rates” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.3.2 Stationary Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.3.2.1 Intuitive Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 9.3.2.2 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 9.3.3 Example: Machine Repair . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Description: