ebook img

Statistical methods for linguistic research: Foundational Ideas - Part II PDF

0.38 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Statistical methods for linguistic research: Foundational Ideas - Part II

Statistical methods for linguistic research: Foundational Ideas - Part II Bruno Nicenboim and Shravan Vasishth 6 February 2, 2016 1 0 2 Abstract n Weprovide an introductory review of Bayesian data analytical meth- a J ods, with a focus on applications for linguistics, psychology, psycholin- guistics, and cognitive science. The empirically oriented researcher will 1 3 benefit from making Bayesian methods part of their statistical toolkit due to the many advantages of this framework, among them easier in- ] terpretation of results relative to research hypotheses, and flexiblemodel P specification. We present an informal introduction to the foundational A ideasbehindBayesiandataanalysis,using,asanexample,alinearmixed . models analysis of data from a typical psycholinguistics experiment. We t a discuss hypothesis testing using the Bayes factor, and model selection t usingcross-validation. Weclose withsome examplesillustrating theflex- s [ ibility of model specification in the Bayesian framework. Suggestions for furtherreading are also provided. 1 v 5 1 Introduction 4 2 0 InPartIofthis review,we presentedthe mainfoundationalideasoffrequentist 0 statistics, with a focus on their applications in linguistics and related areas in 2. cognitivescience. Ourmaingoaltherewastotrytoclarifytheunderlyingideas 0 in frequentist methodology. We discussed the meaning of the p-value, and of 6 TypeIandIIerrorsandTypeSandMerrors. Wealsodiscussedsomecommon 1 misinterpretations associated with hypothesis testing in this framework. : v There is no question that frequentist data-analytic methods such as the t- i test, ANOVA, etc. are and will remain an important part of the toolkit of the X experimentally inclined linguist. There is, however, another framework that is r a not yet part of the standard statistics curriculum in linguistics, but can be of great value: Bayesian data analysis. Learning to use this framework for data analysis is easier than it seems; in fact, most of us already think like Bayesians when we carry out our frequentist analyses. One example of this is the (incor- rect)interpretationofthep-valueastheprobabilityofthenullhypothesisbeing true. Ifwearefacedwithap-valueof0.06(i.e.,avaluejustabovethethreshold of statistical significance), we often resort to expressions of gradedness, saying that “the result was marginally significant”, or “the results did not reach the conventionallevelof significance”,or even “a non-significanttrend towardssig- nificance.”1 This desire to take the non-significant result as meaningful comes 1For a list of over 500 remarkably imaginative variants on this phrasing, see 1 aboutbecauseofanirresistabletemptationto (erroneously)giveaBayesianin- terpretation to the p-value as the probability of the null hypothesis being true. AsdiscussedinPartI,thep-valueistheprobabilitythatthe statisticisatleast as extreme as the one observed given that the null is true. When we misinter- pret the p-value in terms of the probability of the null being true, a p-value of 0.06 doesn’t seem much different from 0.04. Several studies have shown that such interpretations of the p-value are not uncommon (among many others: Haller and Krauss (2002); Lecoutre et al. (2003)). The interpretation of fre- quentistconfidenceintervalsalsosuffersfromsimilarproblems(Hoekstra et al., 2014). Given that the Bayesian interpretation is the more natural one that we converge to anyway, why not simply do a Bayesian analysis? One reason that Bayesian methods have not become mainstream may be that, until recently, it was quite difficult to carry out a Bayesian analysis except in very limited situations. With the increase in computing power, and the arrival of several probabilistic programming languages, it has become quite easy to carry out relatively complicated analyses. The objective of this paper is to provide a non-technicalbut practically ori- entedreview ofsomeofthe toolscurrently availablefor Bayesiandata analysis. Since the linear mixed model (LMM) is so important for linguistics and re- lated areas, we focus on this model and mention some extensions of LMMs. We assume here that the reader has fitted frequentist linear mixed models (Bates et al., 2015b). The review could also be used to achieve a better un- derstanding of papers that use Bayesian methods for statistical inference. Westartthereviewbyoutliningwhatweconsidertobethemainadvantages of adopting Bayesian data-analytic methods. Then we informally outline the basic idea behind Bayesian statistics: the use of Bayes’ theorem to incorporate prior information to our results. Next, we review several ways to verify the plausibility of a researchhypothesis with a simple example from psycholinguis- tics using Bayesian linear mixed models. In the second part of the paper, we discuss some example applications involving standard and less standard (but very useful) models. In the final section, we suggestsome further readings that provide a more detailed presentation. 2 Why bother to learn Bayesian data analysis? Statisticians have been arguing for decades about the relative merits of the frequentistvs Bayesianstatisticalmethods. We feelthatbothapproaches,used as intended, havetheir merits, but that the importance of Bayesianapproaches remains greatly underappreciated in linguistics and related disciplines. We see two main advantages to using Bayesian methods for data analysis. First, Bayesian methods allow us to directly answer the question we are in- terested in: How plausible is our hypothesis given the data? We can answer this question by quantifying our uncertainty about the parameters of interest. Second, and perhaps more importantly, it is easier to flexibly define hierarchi- cal models (also known as mixed effects or multilevel models) in the Bayesian frameworkthaninthefrequentistframework. Hierarchicalmodels,whetherfre- quentistorBayesian,arehighlyrelevantfortherepeatedmeasuresdesignsused https://mchankins.wordpress.com/2013/04/21/still-not-significant-2/. 2 inlinguisticsandpsycholinguistics,becausetheytakebothbetween-andwithin- group variances into account, and because they pool information via “shrink- age”(seeGelman et al.(2012)). Thesepropertieshavethedesirableeffectsthat we avoid overfitting the data, and we avoid averaging and losing valuable in- formation about group-level variability (Gelman and Hill (2007) provide more details). For example, both subjects and items contribute independent sources ofvarianceina standardlinguistics repeatedmeasuresdesign. Inahierarchical model, both these sources of variance can be included simultaneously. By con- trast, in repeated measures ANOVA, one has to aggregate by items (subjects), which artificially eliminates the variability between items (subjects). This ag- gregationleadstoartificiallysmallstandarderrorsoftheeffectofinterest,which leads to an increase in Type I error. The frequentist linear mixed model standardly used in psycholinguistics is generally fit with the lme4 package (Bates et al., 2015b) in R. However, if we want to include the maximal random effects structure justified by the design (Schielzeth and Forstmeier, 2009; Barr et al., 2013), these models tend to not converge or to give unrealistic estimates of the correlations between random effects (Bates et al., 2015a). In contrast, the maximal random effects structure can be fit without problems using Bayesian methods, as discussed later in this review (also see Chung et al., 2013; Bates et al., 2015a; Sorensen et al., 2015). In addition, Bayesian methods can allow us to hierarchically extend virtually any model: non-linear models (which are not generalized linear models) and eventhehighlycomplexmodelsofcognitiveprocesses(Lee,2011;Shiffrin et al., 2008). See also the discussion about hierarchical models in the section entitled Examples of applications of Bayesian methods. 3 Bayesian data analysis: An informal introduc- tion Inlinguistics,weareusuallyinterestedindeterminingwhetherthereisaneffect of a particular factor on some dependent variable; an example from psycholin- guistics is the difference in processing difficulty between subject and object relative clauses as measured by reading times. In the frequentist paradigm, we assume thatthere is some unknownpointvalue µ thatrepresentsthe difference in reading time between the two relative clause types; the goal is to reject the nullhypothesisthatthistrueµhasthevalue0. IntheBayesianframework,our goalis to obtainanestimateofµ giventhe dataalongwithanuncertaintyesti- mate(suchasacredibleinterval,discussedindetailbelow)thatgivesusarange overwhichwecanbereasonablysurethatthetrueparametervaluelies. Weob- tain these estimates given the data and given our prior knowledge/information about plausible values of µ. We elaborateon this idea in the next sectionwhen we present a practical example, but the essential point is that the distribution of µ (called the posterior distribution) can be expressed in terms of the prior and likelihood:2 2Thetermlikelihoodmaybeunfamiliar. Forexample,ifwehavenindependentdatapoints, x1,...,xn which are assumed to be generated from a Normal distribution with parameters µ and σ and a probability density function f(·), the joint probabilityof these data points is theproduct f(x1)×f(x2)×···×f(xn). Thevalueofthis productisafunctionofdifferent values of µ and σ, and it is common to call this product the Likelihood function, and it is 3 Posterior∝Prior×Likelihood (1) To repeat, given some prior information about the parameter of interest, and the likelihood, we can compute the posterior distribution of the parame- ter. The focus is not on rejecting a null hypothesis but on what the posterior distributiontells us aboutthe plausible valuesofthe parameterofinterest. Re- call that frequentist significance testing is focused on calculating the p-value P(statistic | µ = 0), that is, the probability of observing a test statistic (such as a t-value) at least as extreme as the one we observed given that the null hypothesis that µ=0 is true. By contrast, Bayesianstatistics allows us to talk aboutplausible values ofthe parameterµ giventhe data,throughthe posterior distribution of the parameter. Another important point is that the posterior is essentiallya weightedmeanofthe priorandthe likelihood. The implicationsof this statement are made clear graphically in Figure 1. Here, we see the effect of two types of prior distributions on binomial data given 10 or 100 observa- tions. Two points are worth noting. First, when the prior is spread out over a wide range and assign equal probability to all possible values, the posterior distribution ends up closer to the likelihood; this alignment to the likelihood is more pronounced when we have more data (larger sample size). Second, when we have weakly informative priors, with sparse data, the posterior is closer to the prior, but with more data, the posterior is again closer to the likelihood. What this implies is that when we have little data, it is worth investing time in developing priors informed by prior knowledge; but when we have a lot of data, the likelihood will dominate in determining the posterior (we return to this point later, with an example). Indeed, in large-sample situations, we will usually find that the Bayesian posterior and the frequentist mean, along with their uncertainty estimates, are nearly identical or very similar (even though their meaning is quite different—see the discussion below on Bayesian credible intervals). Bayes’ theorem is just a mathematical rule that allows us to calculate any posteriordistribution. In practice,however,this is true for a very limited num- ber of cases, and, in fact, the posterior of many of the models that we are interested in cannot be derived analytically. Fortunately, the posterior distri- bution can be approximated with numerical techniques such as Markov Chain MonteCarlo(MCMC).Manyoftheprobabilisticprogramminglanguagesfreely available today (see the final section for a listing) allow us define our models without having to acquire expert knowledgeabout the relevantnumerical tech- niques. It is all very well to talk about Bayesian methods in the abstract, but how can they be used by linguists and psycholinguists? To answer this question, consider a concrete example from psycholinguistics. We feel that it is easier to show an example that the reader can relate to in order to convey a feeling for how a Bayesian analysis would work; for further study, excellent introductory textbooks are available (we give some suggestions in the final section). often written L(x1,...,xn;µ,σ2). The essential idea here is that the likelihood tells us the jointprobabilityofthedatafordifferentvalues oftheparameters. 4 Flat prior; N=10 Flat prior; N=100 10.0 7.5 5.0 2.5 y 0.0 t si n Weakly informative prior; N=10 Weakly informative prior; N=100 e D10.0 7.5 5.0 2.5 0.0 0.00 0.25 0.50 0.75 1.000.00 0.25 0.50 0.75 1.00 Estimate Distribution prior likelihood posterior Figure 1: Posteriordistributions given different likelihoods and priors for bino- mial data. 5 4 An example of statistical inference using Bayesian methods In contrast to significance testing in frequentist statistics, Bayesianinference is not necessarily about dichotomous decisions (reject null or fail to reject null), butratherabouttheevidenceforandagainstdifferenthypotheses. Weillustrate one way to carry out statistical inference using a Bayesian linear mixed model with the data from Gibson and Wu (2013), which has a simple two-condition repeated measures design. Gibson and Wu (2013) compared reading times at the headnounforsubjectandobjectrelativeclausesandarguedforfacilitation in the case of object relative clauses (providing a counter-example to the cross- linguistic generalization that subject relative clauses are easier to process than object relative clauses; also see Hsiao and Gibson (2003)). We will assume, as we have done elsewhere (Sorensen et al., 2015), that the dependent variable, reading times, has a lognormal distribution, and thus we will use log reading times as the dependent variable.3 We fit the linear mixed model with the full randomeffectsstructurejustifiedbythe design(Barr et al.,2013),andwe code object relative clauses with 1 and subject relative clauses with −1. With this sum contrast coding, Gibson and Wu’s prediction that object relative clauses are easierthan subjectrelative clauses inChinese would, if correct,result in an effectwith anegativesign(i.e. shorterreadingtimes forthe conditioncodedas 1). In the Gibson and Wu dataset, we have 37 participants and 15 items; due to some missing data, we have a total of 547 data points. The conditions were presented to participants in a counterbalanced manner using a Latin square. The researcher familiar with lme4 (for details see Bates and Sarkar (2007) and lme4 documentation: Bates et al. (2015b)) will not find the transition to the Bayesian approach difficult. In lme4 syntax, the model we would fit would be lmer(log(rt) ~ cond + (cond|subj)+(cond|item)) We can fit an analogous Bayesian linear mixed model with the stan lmer function from the rstanarmpackage (Gabry and Goodrich, 2016); see the code in Listing 1. The main novelty in the syntax is the specification of the priors foreachparameter. Someother detailsneedto be specified, suchasthe desired numberofchainsanditerationsthattheMCMCalgorithmrequirestoconverge to the posterior distribution of the parameter of interest. To speed up compu- tation, the number ofprocessors(cores)availablein their computer canalso be specified. In the example shown in Listing 1, we have left other parameters of theMCMCalgorithmatthedefaultvalues,buttheymayneedsomefinetuning in case of non-convergence (this is announced by a warning message). A comparison of the estimates from the lme4 “maximal” LMM and the analogous Bayesian LMM are shown in Table 1. An important point to notice isthatcorrelationsbetweenthevaryinginterceptsandslopesinthefittedmodel using the lmer function are on the boundary; although this does not register a convergence failure warning in lmer, such boundary values constitute a failure to estimate the correlations (Bates et al., 2015a). This failure to estimate the correlations is due to the sparsity of data: we have only 37 subjects and even 3Notice, however, that this is not necessarily the best characterization of latencies; see Ratcliff(1993);Rouder(2005);Nicenboimetal.(2015). 6 1 library(rstanarm) 2 dgw <- read.table("gibsonwu2012data.txt") 3 dgw_hn <- subset(dgw, subset = region == "headnoun") 4 dgw_hn$cond <- ifelse(dgw_hn$type == "obj-ext", 1, -1) 5 m1 <- stan_lmer(formula = log(rt) ˜ cond + (cond | subj) + (cond | item), 6 prior_intercept = normal(0, 10), 7 prior = normal(0, 1), 8 prior_covariance = decov(regularization = 2), 9 data = dgw_hn, 10 chains = 4, 11 iter = 2000, 12 cores = 4) 13 #summary(m1) # Very long summary with all the parameters in the model Listing 1: Code for fitting a linear mixed model with stan lmer. A major difference from lme4 syntax is that priorsarespecified for (a) the intercept, (b) theslope,and(c)thevariance-covariancematricestherandomeffectsforsubject and item. The other specifications, regarding chains and iterations, relate to the how samples are taken from the posterior distribution, and specifying the number of cores can speed up computation. feweritems (15)andareaskingtoo muchofthe lmerfunction. Inthe Bayesian LMM,thepriorspecifiedonthecorrelationmatrixoftherandomeffectsensures thatifthereisinsufficientdata,theposteriorwillhaveameancorrelationnear0 withawideuncertaintyassociatedwithit(thisisthecaseintheitemsintercept- slope correlation),andifthere is more data,the posteriorwillbe a compromise between the prior and the likelihood, although the uncertainty associated with this estimate may still be high (this is the case in the subjects intercept-slope correlation). See Sorensen et al. (2015) for more discussion on this point. 5 Prior specification Since priors are an important part of the model, it is worth saying a few words about them. In order to fit a Bayesian model, we need to specify a prior dis- tribution on each parameter; these priors express our initial state of knowledge about the possible values that the parameter can have. It is possible to spec- ify completely uninformative priors, such as flat priors, but these are far from ideal since they concentrate too much probability mass outside of any reason- ableposteriorvalues. Thiscanhavetheconsequencethatwithoutenoughdata, this prior will dominate in determining the posterior mean and the uncertainty associated with it (Gelman, 2006). Priors that give some minimal amount of informationimprove inference and are called regularizing or weakly informative priors (see also Chung et al. (2013); Gelman et al. (2008)). In the typical psycholinguistic experiment, different weakly informative pri- ors generally don’t have much of an effect on the posterior, but it is a good 7 Table 1: Comparison of the frequentist and Bayesian model estimates for the Gibson and Wu data-set. The estimates for both the coefficients and the vari- ancecomponentsarecomparable,butnoticethatthecorrelationsbetweenvary- ing intercepts and slopes are quite different in the frequentist and Bayesian models. The stan lmer packages provides the median and the standard devi- ation of the median absolute difference (MAD) for the fixed effects, but one could equally well compute the mean and standard error, to mirror the lme4 convention. lmer stan lmer Random effects Random effects Groups Name Std.Dev. Corr Std.Dev. Corr subj (Intercept) 0.2448 0.2425 cond 0.0595 -1.00 0.0762 -0.521 item (Intercept) 0.1820 0.1829 cond 0.0002 1.00 0.0475 0.012 Residual 0.5143 0.5131 Fixed effects Fixed effects Estimate Std. Error t value Median MAD-SD (Intercept) 6.06180 0.0657 92.24 6.0641 0.0658 cond -0.03625 0.0242 -1.50 -0.0364 0.0301 idea to do a sensitivity analysis by evaluating the effect of different priors on the posterior;see Vasishth et al.(2013); Levshina(2016)for examples fromlin- guistics. Another use of priors is to include valuable prior knowledge about the parameters into our model; these informative priors could come from ac- tualpriorexperiments(the posteriorfrompreviousexperiments)orfrommeta- analyses (Vasishth et al., 2013), or from expert judgements (O’Hagan et al., 2006; Vasishth, 2015). Such a use of priors is not widespread even among Bayesian statisticians, but could be a powerful tool for incorporating prior knowledge into a new data analysis. For example, starting with relatively in- formativepriorscouldbe a hugeimprovementoverstartingeverynew study on relative clauses with the assumption that we know nothing about the topic. Returning to the practical issue of prior specification in R functions like, stan lmer, it is a good idea to specify the priors explicitly; the default priors assumed by the function may not be appropriate for your specific data. For example, inour example above,if we set the priorfor the interceptas a normal distributionwithameanofzeroandastandarddeviationoften,Normal(0,10),4 we are assuming that we are 68% sure that the grand mean in log-scale will be between −10 (very near zero milliseconds) and 10 (≈ 22 seconds). This prior is extremely vague, and can be changed but it won’t affect the results much. However, notice that had we not log-transformed the dependent variable, we would be assuming that we are 68% sure that the grand mean is between -10 and 10 milliseconds, and 95% sure that it is between -20 and 20 milliseconds! We are guaranteedto not get sensible estimates if we do that. 4This happens to be the default value inthis version of rstanarm, but this might change infutureversions. 8 We start the analysis by setting Normal(0,1) as the prior for the effect of subjectvs.objectrelativeclauses. This meansthatwe assumethatweare68% certainthatthedifferencebetweentheconditionsshouldbelessthan1300ms.5 Inreality,therangeislikelytobemuchsmaller,butwewillstartwiththisprior. We will exemplify the effect of the priors by changing the prior of the estimate forthe effectofthe experimentalconditioninthe nextsection. The readermay notice that there is also a prior for the covariance matrix of the random effects (Chung et al.,2013);theregularizationparameterinthispriorspecificationtoa valuelargerthanonecanhelptogetconservativeestimatesoftheintercept-slope correlations when we don’t have enough data; recall the discussion regarding Table 1 above. For further examples,see Bates et al.(2015a) and the vignettes in the RePsychLing package (https://github.com/dmbates/RePsychLing). A more in-depth discussion is beyond the scope of this paper, and the reader is referred to the tutorial by Sorensen et al. (2015) and a more general discussion by Chung et al. (2013).6 6 The posterior and statistical inference As mentioned earlier, the result of a Bayesian analysis is a posterior distribu- tion, that is, a distribution showing the relative plausibilities of each possible value of the parameter of interest, conditional on the data, the priors, and the model. Every parameter of the model (fixed effects, random effects, shape of the distribution) will have a posterior distribution. Typically, software such as the rstanarm package in R will deliver samples from the posterior that we can use for inference. In our running example, we will focus on the posterior distribution of the effect of object relative clauses in comparison with subject relative clauses; see Figure 2. To communicate our results, we need to summarize and interpret the pos- terior distribution. We can report point estimates of the posterior probability such as the mean or the median (in some cases also the mode of the distribu- tion, known as the maximum a posteriori or MAP, is also reported). When the posterior distribution is symmetrical and approximately normal in shape, the meanandmedianalmostconvergetothesamepoint(seeFigure2). Inthiscase itwon’tmatterwhichonewereport,sincetheyonlydifferinthefourthdecimal digit; see Listing 2. It is also important to summarize the amount of posterior probability that lies below or above some parameter value. In Gibson and Wu’s data, since the researchquestionamountstowhethertheparameterispositiveornegative,and Gibson and Wu predict that it will be negative, we can compute the posterior probabilitythatthedifferencebetweenobjectandsubjectrelativeclausesisless thanzero,P(βˆ)<0. Thisprobabilityis0.89;seeListing3andalsoFigure3(a). 5Themeanreadingtimeatthecriticalregionis550ms,andsinceweassumedalognormal distribution,tocalculatethems,weneedtofindoutexp(log(550)+1)−exp(log(550)−1),which is1300;the68%istheprobabilitymassbetween [-1,1]inthestandardnormaldistribution. 6Weleftthepriorsofotherparameterssuchasthestandarddeviationofthedistribution, orresidualsinlme4terms,andthescaleoftherandomeffects attheirdefaultvalues. These priorsshouldn’tbeignoredwhendoingarealanalysis! Theusershouldverifythattheymake sense. 9 14 samples_m1 <- as.data.frame(m1) # It saves all the samples from the model. 15 posterior_condition <- samples_m1$cond 16 options(digits = 4) 17 mean(posterior_condition) ## [1] -0.0358 19 median(posterior_condition) ## [1] -0.03573 Listing 2: Code for summarizing point estimates. 21 mean(posterior_condition < 0) # Posterior probability that lies below zero. ## [1] 0.89 Listing 3: Code for finding the mass of the posteriorprobability that lies below zero. Thereisnothingspecialaboutzero;andsincethedifferencebetweenEnglish objectandsubjectrelativeclausesis,ingeneral,quitelarge,andGibson and Wu predict that in Chinese the same processes as in English give an advantage to object relativesoversubject relative clauses,we could be interested, instead, in knowing the probability that the advantage of object relative clauses is at least 20 ms. This advantage can be translated into approximately −0.02 from the grand mean in log-scale; we can inspect the posterior distribution and find out that this, P(βˆ)<−0.02, is 0.67; see also Figure 3(b). 6.1 The 95% credible interval Itispossible(anddesirable)toreportanintervalofposteriorprobability,thatis, twoparametervaluesthatcontainbetweenthemaspecifiedamountofposterior probability. This type of interval is also known as credible interval. A credible interval demarcates the range within which we can be certain with a certain probabilitythatthe“truevalue”ofaparameterlies. Itisthetruevaluenotout there “innature”,but true inthe model’s logicalworld(see alsothe interesting distinction between small and large worlds in McElreath (2015)). The Bayesian credible interval is different from the frequentist confidence interval because the credible interval can be interpreted with the data at hand, while the frequentist counterpart is a property of the statistical procedure. The statistical procedure only indicates that frequentist confidence intervals across a series of hypothetical data sets produced by the same underlying pro- cess will contain the true parameter value in a certain proportion of the cases (Hoekstra et al., 2014; Morey et al., 2015). ThetwomostcommontypesofBayesiancredibleintervalsarethepercentile interval and highest posterior density interval (HPDI; Box and Tiao, 1992). In the first one, we assign equal probability mass to each tail. This is the most 10

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.