Oberwolfach Seminars Volume 39 Mathias Drton Bernd Sturmfels Seth Sullivant Lectures on Algebraic Statistics Birkhäuser Basel · Boston · Berlin Mathias Drton Bernd Sturmfels University of Chicago Department of Mathematics Department of Statistics University of California 5734 S. University Ave 925 Evans Hall Chicago, IL 60637 Berkeley, CA 94720 USA USA e-mail: [email protected] e-mail: [email protected] Seth Sullivant Department of Mathematics North Carolina State University Raleigh, NC 27695-8205 USA e-mail: [email protected] 2000 Mathematics Subject Classification: 62, 14, 13, 90, 68 Library of Congress Control Number: 2008939526 Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at <http://dnb.ddb.de>. ISBN 978-3-7643-8904-8 Birkhäuser Verlag, Basel – Boston – Berlin This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in other ways, and storage in data banks. For any kind of use permission of the copyright owner must be obtained. © 2009 Birkhäuser Verlag AG Basel · Boston · Berlin P.O. Box 133, CH-4010 Basel, Switzerland Part of Springer Science+Business Media Printed on acid-free paper produced from chlorine-free pulp. TCF ∞ Printed in Germany ISBN 978-3-7643-8904-8 e-ISBN 978-3-7643-8905-5 9 8 7 6 5 4 3 2 1 www.birkhauser.ch Contents Preface vii 1 Markov Bases 1 1.1 Hypothesis Tests for Contingency Tables . . . . . . . . . . . . . . . 1 1.2 Markov Bases of Hierarchical Models . . . . . . . . . . . . . . . . . 11 1.3 The Many Bases of an Integer Lattice . . . . . . . . . . . . . . . . 19 2 Likelihood Inference 29 2.1 Discrete and Gaussian Models . . . . . . . . . . . . . . . . . . . . . 29 2.2 Likelihood Equations for Implicit Models . . . . . . . . . . . . . . 40 2.3 Likelihood Ratio Tests . . . . . . . . . . . . . . . . . . . . . . . . . 48 3 Conditional Independence 61 3.1 Conditional Independence Models. . . . . . . . . . . . . . . . . . . 61 3.2 Graphical Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.3 Parametrizationsof Graphical Models . . . . . . . . . . . . . . . . 79 4 Hidden Variables 89 4.1 Secant Varieties in Statistics. . . . . . . . . . . . . . . . . . . . . . 89 4.2 Factor Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5 Bayesian Integrals 105 5.1 Information Criteria and Asymptotics . . . . . . . . . . . . . . . . 105 5.2 Exact Integration for Discrete Models . . . . . . . . . . . . . . . . 114 6 Exercises 123 6.1 Markov Bases Fixing Subtable Sums . . . . . . . . . . . . . . . . . 123 6.2 Quasi-symmetry and Cycles . . . . . . . . . . . . . . . . . . . . . . 128 6.3 A Colored Gaussian Graphical Model . . . . . . . . . . . . . . . . 131 6.4 Instrumental Variables and Tangent Cones. . . . . . . . . . . . . . 135 6.5 Fisher Information for Multivariate Normals . . . . . . . . . . . . . 142 6.6 The Intersection Axiom and Its Failure. . . . . . . . . . . . . . . . 144 6.7 Primary Decomposition for CI Inference . . . . . . . . . . . . . . . 147 6.8 An Independence Model and Its Mixture . . . . . . . . . . . . . . . 150 7 Open Problems 157 Bibliography 165 Preface Algebraic statistics is concerned with the development of techniques in algebraic geometry,commutativealgebra,andcombinatorics,toaddressproblemsinstatis- tics and its applications. On the one hand, algebra provides a powerful tool set for addressing statistical problems. On the other hand, it is rarely the case that algebraic techniques are ready-made to address statistical challenges, and usually newalgebraicresultsneedtobedeveloped.Thiswaythedialoguebetweenalgebra and statistics benefits both disciplines. Algebraic statistics is a relatively new field that has developed and changed ratherrapidlyoverthelastfifteenyears.Oneofthefirstpiecesofworkinthisarea wasthepaperofDiaconisandthesecondauthor[33],whichintroducedthenotion of a Markov basis for log-linear statistical models and showed its connection to commutative algebra. From there, the algebra/statistics connection spread to a number of different areas including the design of experiments (highlighted in the monograph[74]), graphicalmodels, phylogenetic invariants,parametric inference, algebraic tools for maximum likelihood estimation, and disclosure limitation, to namejustafew.Referencestothisliteraturearesurveyedintheeditorial[47]and the two review articles [4, 41] in a special issue of the journal Statistica Sinica. An area where there has been particularly strong activity is in applications to computational biology, which is highlighted in the book Algebraic Statistics for Computational Biology of Lior Pachterand the second author [73]. We will some- times refer to that book as the “ASCB book.” These lecture notes arose out of a five-day Oberwolfach Seminar, given at the Mathematisches Forschungsinstitut Oberwolfach (MFO), in Germany’s Black Forest, over the days May 12–16, 2008. The seminar lectures provided an intro- duction to some of the fundamental notions in algebraic statistics, as well as a snapshotofsomeofthecurrentresearchdirections.Givensuchashorttimeframe, we were forced to pick and choose topics to present, and many areas of active research in algebraic statistics have been left out. Still, we hope that these notes give an overview of some of the main ideas in the area and directions for future research. The lecture notes are an expanded version of the thirteen lectures we gave throughout the week, with many more examples and background material than we could fit into our hour-long lectures. The first five chapters cover the material viii Preface inthosethirteenlecturesandroughlycorrespondtothefivedaysoftheworkshop. Chapter 1 reviews statistical tests for contingency table analysis and explains the notion of a Markov basis for a log-linear model. We connect this notion to com- mutativealgebra,andgivesomeofthemostimportantstructuraltheoremsabout Markovbases.Chapter2isconcernedwithlikelihoodinferenceinalgebraicstatis- ticalmodels.Weintroducethesemodelsfordiscreteandnormalrandomvariables, explain how to solve the likelihood equations parametrically and implicitly, and show how model geometry connects to asymptotics of likelihood ratio statistics. Chapter 3 is analgebraicstudy of conditionalindependence structures.We intro- duce these generally, and then focus in on the special class of graphical models. Chapter 4 is an introduction to hidden variable models. From the algebraic point of view, these models often give rise to secant varieties. Finally, Chapter 5 con- cerns Bayesian integrals, both from an asymptotic large-sample perspective and from the standpoint of exact evaluation for small samples. DuringourweekinOberwolfach,weheldseveralstudentproblemsessionsto complementourlectures.Wecreatedeightproblemshighlightingmaterialfromthe differentlecturesandassignedthestudentsintogroupstoworkontheseproblems. Theexercisespresentedarangeofcomputationalandtheoreticalchallenges.After daily and sometimes late-night problem solving sessions, the students wrote up solutions,whichappearinChapter6.Ontheclosingdayoftheworkshop,weheld an open problem session, where we and the participants presented open research problems related to algebraic statistics. These appear in Chapter 7. Therearemanypeopletothankfortheirhelpinthepreparationofthisbook. First, we would like to thank the MFO and its staff for hosting our Oberwolfach Seminar,whichprovidedawonderfulenvironmentforourresearchlectures.Inpar- ticular,wethankMFOdirectorGert-MartinGreuelforsuggestingthatweprepare these lecture notes. Second, we thank Birkh¨auser editor Thomas Hempfling for his help with our manuscript. Third, we acknowledge support by grants from the U.S.NationalScienceFoundation(DrtonDMS-0746265;SturmfelsDMS-0456960; Sullivant DMS-0700078 and 0840795). Bernd Sturmfels was also supported by an Alexander von Humboldt research prize at TU Berlin. Finally, and most im- portantly, we would like to thank the participants of the seminar. Their great enthusiasm and energy created a very stimulating environment for teaching this material. The participants were Florian Block, Dustin Cartwright, Filip Cools, Jo¨rn Dannemann, Alex Engstr¨om, Thomas Friedrich, Hajo Holzmann, Thomas Kahle, Anna Kedzierska, Martina Kubitzke, Krzysztof Latuszynski, Shaowei Lin, HugoMaruri-Aguilar,SofiaMassa,HeleneNeufeld,MounirNisse,JohannesRauh, ChristofSo¨ger,CarlosTrenado,Oliver Wienand, ZhiqiangXu, Or Zuk,and Piotr Zwiernik. Chapter 1 Markov Bases This chapter introduces the fundamental notion of a Markov basis, which rep- resents one of the first connections between commutative algebra and statistics. This connection was made in the paper by Diaconis and the second author [33] oncontingencytableanalysis.Statisticalhypothesesaboutcontingencytablescan be tested in an exact approach by performing random walks on a constrained set of tables with non-negative integer entries. Markov bases are of key importance to this statistical methodology because they comprise moves between tables that ensure that the random walk connects every pair of tables in the considered set. Section1.1reviewsthe basicsofcontingencytablesandexacttests;formore background see also the books by Agresti [1], Bishop, Holland, Fienberg [18], or Christensen[21]. Section1.2 discussesMarkovbases inthe contextofhierarchical log-linear models. The problem of computing Markov bases is addressed in Sec- tion 1.3, where the problem is placed into the setting of integer lattices and tied to the algebraic notion of a lattice ideal. 1.1 Hypothesis Tests for Contingency Tables A contingency table contains counts obtained by cross-classifying observed cases according to two or more discrete criteria. Here the word ‘discrete’ refers to cri- teria with a finite number of possible levels. As an example consider the 2×2- contingencytableshowninTable1.1.1.Thistable,whichistakenfrom[1,§5.2.2], presents a classification of 326 homicide indictments in Florida in the 1970s. The two binary classification criteria are the defendant’s race and whether or not the defendant received the death penalty. A basic question of interest for this table is whether at the time death penalty decisions were made independently of the de- fendant’srace.Inthissectionwewilldiscussstatisticaltestsofsuchindependence hypotheses as well as generalizations for larger tables. 2 Chapter 1. Markov Bases Death Penalty Defendant’s Race Yes No Total White 19 141 160 Black 17 149 166 Total 36 290 326 Table 1.1.1: Data on death penalty verdicts. Classifying a randomly selected case according to two criteria with r and c levels, respectively, yields two random variables X and Y. We code their possible outcomes as [r] and [c], where [r] := {1,2,...,r} and [c] := {1,2,...,c}. All probabilistic information about X and Y is contained in the joint probabilities p =P(X =i, Y =j), i∈[r], j ∈[c], ij which determine in particular the marginal probabilities (cid:2)c p := p =P(X =i), i∈[r], i+ ij j=1 (cid:2)r p := p =P(Y =j), j ∈[c]. +j ij i=1 Definition 1.1.1. The two random variables X and Y are independent if the joint probabilities factor as p = p p for all i ∈ [r] and j ∈ [c]. We use the symbol ij i+ +j X⊥⊥Y to denote independence of X and Y. Proposition1.1.2. The tworandom variables X andY areindependent if andonly if the r×c-matrix p=(p ) has rank 1. ij Proof. (=⇒):ThefactorizationinDefinition1.1.1writesthematrixpastheprod- uct of the column vector filled with the marginal probabilities p and the row i+ vector filled with the probabilities p . It follows that p has rank 1. +j (⇐=):Sincephasrank1,itcanbewrittenasp=abt fora∈Rr andb∈Rc. All entries in p being non-negative, a and b can be chosen to have non-negative entries as well. Let a and b be the sums of the entries in a and b, respectively. + + Then,p =a b ,p =a b ,anda b =1.Therefore,p =a b =a b a b = i+ i + +j + j + + ij i j i + + j p p for all i,j. (cid:2) i+ +j Supposenowthatwerandomlyselectncasesthatgiverisetonindependent pairs of discrete random variables (cid:3) (cid:4) (cid:3) (cid:4) (cid:3) (cid:4) X(1) X(2) X(n) , ,..., (1.1.1) Y(1) Y(2) Y(n) that are all drawn from the same distribution, that is, P(X(k) =i, Y(k) =j)=p for alli∈[r], j ∈[c], k ∈[n]. ij 1.1. Hypothesis Tests for Contingency Tables 3 The joint probability matrix p=(p ) for this distribution is considered to be an ij unknown element of the rc−1 dimensional probability simplex (cid:5) (cid:6) (cid:2)r (cid:2)c Δrc−1 = q ∈Rr×c : qij ≥0for alli,j and qij =1 . i=1j=1 A statistical model M is a subset of Δrc−1. It represents the set of all candidates for the unknown distribution p. Definition 1.1.3. The independence model for X and Y is the set MX⊥⊥Y ={p∈Δrc−1 : rank(p)=1}. The independence model MX⊥⊥Y is the intersection of the probability sim- plex Δrc−1 and the set of all matrices p=(pij) such that p p −p p =0 (1.1.2) ij kl il kj forall1≤i<k≤rand1≤j <l≤c.Thesolutionsettothissystemofquadratic equations is known as the Segre variety in algebraic geometry. If all probabilities are positive, then the vanishing of the 2×2-minor in (1.1.2) corresponds to p /p ij il =1. (1.1.3) p /p kj kl Ratios of probabilities being termed odds, the ratio in (1.1.3) is known as an odds ratio in the statistical literature. The orderofthe observedpairsin(1.1.1)carriesnoinformationaboutpand we summarize the observations in a table of counts (cid:2)n U = 1 , i∈[r], j ∈[c]. (1.1.4) ij {X(k)=i,Y(k)=j} k=1 The table U = (U ) is a two-way contingency table. We denote the set of all ij contingency tables that may arise for fixed sample size n by (cid:5) (cid:6) (cid:2)r (cid:2)c T(n):= u∈Nr×c : u =n . ij i=1j=1 Proposition 1.1.4. The random table U = (U ) has a multinomial distribution, ij that is, if u∈T(n) and n is fixed, then (cid:7)r (cid:7)c n! P(U =u)= puij. u !u !···u ! ij 11 12 rc i=1j=1