Gerhard Bohm, Günter Zech Introduction to Statistics and Data Analysis for Physicists Verlag Deutsches Elektronen-Synchrotron Prof. Dr. Gerhard Bohm Deutsches Elektronen-Synchrotron Platanenallee 6 D-15738 Zeuthen e-mail: [email protected] Univ.-Prof. Dr. Günter Zech Universität Siegen Fachbereich Physik Walter-Flex-Str. 3 D-57068 Siegen e-mail: [email protected] Bibliografische Information der Deutschen Bibliothek DieDeutscheBibliothekverzeichnetdiesePublikationinderDeutschenNationalbib- liografie;detailliertebibliografischeDatensindimInternetüber<http://dnb.ddb.de> abrufbar. Herausgeber Deutsches Elektronen-Synchrotron und Vertrieb: in der Helmholtz-Gemeinschaft Zentralbibliothek D-22603 Hamburg Telefon (040) 8998-3602 Telefax (040) 8998-4440 Umschlaggestaltung: Christine Iezzi Deutsches Elektronen-Synchrotron,Zeuthen Druck: Druckerei, Deutsches Elektronen-Synchrotron Copyright: Gerhard Bohm, Günter Zech ISBN 978-3-935702-41-6 DOI 10.3204/DESY-BOOK/statistics(e-book) <http://www-library.desy.de/elbook.html> Das Buch darf nicht ohne schriftliche Genehmigung der Autoren durch Druck, Fo- tokopie oder andere Verfahren reproduziert oder unter Verwendung elektronischer Systeme verarbeitet, vervielfältigt oder verbreitet werden. Der ganze oder teilweise Download des e-Buchs zum persönlichen Gebrauch ist er- laubt. Preface There is a largenumber ofexcellentstatistic books.Nevertheless,we think that it is justified to complement them by another textbook with the focus on modern appli- cationsin nuclearandparticle physics.To this end we haveincludeda largenumber of related examples and figures in the text. We emphasize less the mathematical foundations but appeal to the intuition of the reader. Data analysis in modern experiments is unthinkable without simulation tech- niques.WediscussinsomedetailhowtoapplyMonteCarlosimulationtoparameter estimation,deconvolution,goodness-of-fittests.Wesketchalsomoderndevelopments likeartificialneuralnets,bootstrapmethods,boosteddecisiontreesandsupportvec- tor machines. Likelihood is a central concept of statistical analysis and its foundation is the likelihood principle. We discuss this concept in more detail than usually done in textbooks and base the treatment of inference problems as far as possible on the likelihood function only, as is common in the majority of the nuclear and particle physics community. In this way point and interval estimation, error propagation, combining results, inference of discrete and continuous parameters are consistently treated.WeapplyBayesianmethodswherethelikelihoodfunctionisnotsufficientto proceedto sensible results,for instance in handling systematic errors,deconvolution problems and in some cases when nuisance parameters have to be eliminated, but we avoid improper prior densities. Goodness-of-fit and significance tests, where no likelihood function exists, are based on standard frequentist methods. Our textbook is based on lecture notes from a course given to master physics students at the University of Siegen, Germany, a few years ago. The content has been considerably extended since then. A preliminary German version is published as an electronic book at the DESY library.The presentbook is addressedmainly to master and Ph.D. students but also to physicists who are interested to get an intro- duction into recent developments in statistical methods of data analysis in particle physics. When reading the book, some parts can be skipped, especially in the first five chapters. Where necessary, back references are included. Wewelcomecomments,suggestionsandindicationsofmistakesandtypingerrors. We are prepared to discuss or answer questions to specific statistical problems. We acknowledge the technical support provided by DESY and the University of Siegen. February 2010, Gerhard Bohm, Günter Zech Contents 1 Introduction: Probability and Statistics.......................... 1 1.1 The Purpose of Statistics....................................... 1 1.2 Event, Observation and Measurement ............................ 2 1.3 How to Define Probability? ..................................... 3 1.4 Assignment of Probabilities to Events ............................ 4 1.5 Outline of this Book ........................................... 6 2 Basic Probability Relations ...................................... 9 2.1 Random Events and Variables................................... 9 2.2 Probability Axioms and Theorems............................... 10 2.2.1 Axioms ................................................ 10 2.2.2 Conditional Probability, Independence, and Bayes’ Theorem .. 11 3 Probability Distributions and their Properties ................... 15 3.1 Definition of Probability Distributions ........................... 16 3.1.1 Discrete Distributions.................................... 16 3.1.2 Continuous Distributions ................................. 16 3.1.3 Empirical Distributions .................................. 20 3.2 Expected Values............................................... 20 3.2.1 Definition and Properties of the Expected Value............. 21 3.2.2 Mean Value............................................. 22 3.2.3 Variance ............................................... 23 3.2.4 Skewness............................................... 26 3.2.5 Kurtosis (Excess)........................................ 26 3.2.6 Discussion.............................................. 27 3.2.7 Examples .............................................. 28 II Contents 3.3 Moments and Characteristic Functions ........................... 32 3.3.1 Moments ............................................... 32 3.3.2 Characteristic Function .................................. 33 3.3.3 Examples .............................................. 36 3.4 Transformationof Variables..................................... 38 3.4.1 Calculation of the Transformed Density .................... 39 3.4.2 DeterminationoftheTransformationRelatingtwoDistributions 41 3.5 Multivariate Probability Densities ............................... 42 3.5.1 Probability Density of two Variables ....................... 43 3.5.2 Moments ............................................... 44 3.5.3 Transformation of Variables .............................. 46 3.5.4 Reduction of the Number of Variables...................... 47 3.5.5 DeterminationoftheTransformationbetweentwoDistributions 50 3.5.6 Distributions of more than two Variables ................... 51 3.5.7 Independent, Identically Distributed Variables .............. 52 3.5.8 Angular Distributions.................................... 53 3.6 Some Important Distributions................................... 55 3.6.1 The Binomial Distribution................................ 55 3.6.2 The Multinomial Distribution............................. 58 3.6.3 The Poisson Distribution ................................. 58 3.6.4 The Uniform Distribution ................................ 65 3.6.5 The Normal Distribution ................................. 65 3.6.6 The Exponential Distribution ............................. 69 3.6.7 The χ2 Distribution ..................................... 70 3.6.8 The Gamma Distribution................................. 72 3.6.9 The Lorentz and the Cauchy Distributions.................. 74 3.6.10 The Log-normalDistribution ............................. 75 3.6.11 Student’s t Distribution ................................. 75 3.6.12 The Extreme Value Distributions.......................... 77 4 Measurement errors ............................................. 81 4.1 General Considerations......................................... 81 4.1.1 Importance of Error Assignments.......................... 81 4.1.2 Verification of Assigned Errors ............................ 82 4.1.3 The Declaration of Errors ................................ 82 Contents III 4.1.4 Definition of Measurement and its Error.................... 83 4.2 Different Types of Measurement Uncertainty...................... 84 4.2.1 Statistical Errors ........................................ 84 4.2.2 Systematic Errors (G. Bohm) ............................. 88 4.2.3 Systematic Errors (G. Zech) .............................. 90 4.2.4 ControversialExamples .................................. 94 4.3 Linear Propagation of Errors.................................... 94 4.3.1 Error Propagation....................................... 94 4.3.2 Error of a Function of Several Measured Quantities .......... 95 4.3.3 Averaging Uncorrelated Measurements ..................... 98 4.3.4 Averaging Correlated Measurements ....................... 98 4.3.5 Several Functions of Several Measured Quantities............100 4.3.6 Examples ..............................................101 4.4 Biased Measurements ..........................................103 4.5 Confidence Intervals ...........................................104 5 Monte Carlo Simulation .........................................107 5.1 Introduction ..................................................107 5.2 Generation of Statistical Distributions ...........................109 5.2.1 Computer Generated Pseudo Random Numbers .............109 5.2.2 Generation of Distributions by Variable Transformation ......110 5.2.3 Simple Rejection Sampling ...............................115 5.2.4 Importance Sampling ....................................116 5.2.5 Treatment of Additive Probability Densities ................119 5.2.6 Weighting Events........................................120 5.2.7 Markov Chain Monte Carlo...............................120 5.3 Solution of Integrals ...........................................123 5.3.1 Simple Random Selection Method .........................123 5.3.2 Improved Selection Method...............................126 5.3.3 Weighting Method.......................................127 5.3.4 Reduction to Expected Values ............................128 5.3.5 Stratified Sampling ......................................129 5.4 General Remarks..............................................129 6 Parameter Inference I ...........................................131 6.1 Introduction ..................................................131 IV Contents 6.2 Inference with Given Prior......................................133 6.2.1 Discrete Hypotheses .....................................133 6.2.2 Continuous Parameters ..................................135 6.3 Definition and Visualization of the Likelihood.....................137 6.4 The Likelihood Ratio ..........................................140 6.5 The Maximum Likelihood Method for Parameter Inference .........142 6.5.1 The Recipe for a Single Parameter.........................143 6.5.2 Examples ..............................................144 6.5.3 Likelihood Inference for Several Parameters.................148 6.5.4 Combining Measurements ................................151 6.5.5 Normally Distributed Variates and χ2......................151 6.5.6 Likelihood of Histograms .................................152 6.5.7 Extended Likelihood.....................................154 6.5.8 Complicated Likelihood Functions .........................155 6.5.9 Comparison of Observations with a Monte Carlo Simulation ..155 6.5.10 Parameter Estimate of a Signal Contaminated by Background 160 6.6 Inclusion of Constraints ........................................163 6.6.1 Introduction ............................................163 6.6.2 Eliminating Redundant Parameters........................164 6.6.3 Gaussian Approximation of Constraints ....................166 6.6.4 The Method of Lagrange Multipliers.......................167 6.6.5 Conclusion .............................................168 6.7 Reduction of the Number of Variates.............................168 6.7.1 The Problem ...........................................168 6.7.2 Two Variables and a Single Linear Parameter ...............169 6.7.3 Generalization to Several Variables and Parameters..........169 6.7.4 Non-linear Parameters ...................................171 6.8 Method of Approximated Likelihood Estimator....................171 6.9 Nuisance Parameters...........................................174 6.9.1 Nuisance Parameters with Given Prior .....................175 6.9.2 Factorizing the Likelihood Function........................176 6.9.3 Parameter Transformation, Restructuring ..................177 6.9.4 Profile Likelihood .......................................179 6.9.5 Integrating out the Nuisance Parameter ....................181 6.9.6 Explicit Declaration of the Parameter Dependence...........181 Contents V 6.9.7 Advice .................................................181 7 Parameter Inference II ..........................................183 7.1 Likelihood and Information .....................................183 7.1.1 Sufficiency..............................................183 7.1.2 The Conditionality Principle..............................185 7.1.3 The Likelihood Principle .................................186 7.1.4 Bias of Maximum Likelihood Results.......................187 7.1.5 Stopping Rules..........................................190 7.2 Further Methods of Parameter Inference..........................191 7.2.1 The Moments Method ...................................191 7.2.2 The Least Square Method ................................195 7.2.3 Linear Regression .......................................198 7.3 Comparison of Estimation Methods..............................199 8 Interval Estimation ..............................................201 8.1 Introduction ..................................................201 8.2 Error Intervals ................................................202 8.2.1 Parabolic Approximation.................................203 8.2.2 General Situation .......................................204 8.3 Error Propagation.............................................205 8.3.1 Averaging Measurements .................................205 8.3.2 Approximating the Likelihood Function ....................208 8.3.3 Incompatible Measurements ..............................209 8.3.4 Error Propagation for a Scalar Function of a Single Parameter 210 8.3.5 Error Propagation for a Function of Several Parameters ......210 8.4 One-sided Confidence Limits....................................214 8.4.1 General Case ...........................................214 8.4.2 Upper Poisson Limits, Simple Case ........................215 8.4.3 Poisson Limit for Data with Background ...................216 8.4.4 Unphysical Parameter Values .............................219 8.5 Summary.....................................................219 9 Deconvolution ...................................................221 9.1 Introduction ..................................................221 9.1.1 The Problem ...........................................221 VI Contents 9.1.2 Deconvolution by Matrix Inversion ........................224 9.1.3 The Transfer Matrix.....................................226 9.1.4 Regularization Methods ..................................226 9.2 Deconvolution of Histograms....................................227 9.2.1 Fitting the Bin Content ..................................227 9.2.2 Iterative Deconvolution ..................................231 9.2.3 Regularization of the Transfer Matrix ......................232 9.3 Binning-free Methods ..........................................234 9.3.1 Iterative Deconvolution ..................................234 9.3.2 The Satellite Method ....................................235 9.3.3 The Maximum Likelihood Method.........................237 9.4 Comparison of the Methods.....................................239 9.5 Error Estimation for the Deconvoluted Distribution................241 10 Hypothesis Tests ................................................245 10.1 Introduction ..................................................245 10.2 Some Definitions ..............................................246 10.2.1 Single and Composite Hypotheses .........................246 10.2.2 Test Statistic, Critical Region and Significance Level.........246 10.2.3 Errors of the First and Second Kind, Power of a Test ........247 10.2.4 P-Values ...............................................248 10.2.5 Consistency and Bias of Tests.............................248 10.3 Goodness-of-Fit Tests..........................................250 10.3.1 General Remarks........................................250 10.3.2 P-Values ...............................................252 10.3.3 The χ2 Test in Generalized Form..........................254 10.3.4 The Likelihood Ratio Test................................261 10.3.5 The Kolmogorov–SmirnovTest............................263 10.3.6 Tests of the Kolmogorov–Smirnov– and Cramer–von Mises Families................................................265 10.3.7 Neyman’s Smooth Test...................................266 10.3.8 The L Test ............................................268 2 10.3.9 Comparing a Data Sample to a Monte Carlo Sample and the Metric .................................................269 10.3.10 The k-Nearest Neighbor Test.............................270 10.3.11 The Energy Test........................................270