Statistical Adequacy and Mis-Specification (M-S) Testing Aris Spanos Virginia Tech, USA Outline: 1. Introduction: relevant philosophical/methodological questions 2. Error Statistics as a modification/extension of the Fisher-Neyman-Pearson framework 3. Statistical adequacy and the reliability of inference 4. Error Statistics: Specification, M-S testing and Respecification 5. Empirical example: Predicting the population of the USA 6. The problem of Spurious Correlation (Regression) 7. Statistical Specification vs. Model Selection (Akaike-type) 8. Summary and conclusions 1 1 Introduction Developing a methodology of Mis-Specification (M-S) testing with a view to secure statistical adequacy (Spanos, 1986,1989,1995): how to specify and validate statistical models, how to proceed when statistical assumptions are vi- olated, how to probe model assumptions, isolate the sources of departures, and accommodate them in a respecified model for primary statistical inferences. Why is this of interest to philosophers of science? Addressing these questions raises crucial philosophical/methodological issues ¥ and problems which stem from several long-standing foundational problems be- deviling frequentist statistics since the 1930s. Methodological debates in practice often revolve around philosophical and logical issues regarding the nature, inter- pretation, and justification of methods and models used to learn from noisy, incomplete, and non-experimental data. Philosophers of science have a crucial normative role to play in scrutinizing the epistemic credentials of methods and models used in practice; philosophy practice. ½ Ironically, early philosophers of confirmation were reluctant to appeal to statis- ¥ tical techniques that are used in practice because of their empirical assumptions. It was/is misleadingly assumed that random sampling is required, when in fact 2 non-random samples are routinely used by practitioners of statistical modeling and inference; practice philosophy. ½ The uses to which some philosophers of science put these modeling tools are ¥ sometimes vitiated by failure to validate underlying model assumptions — the very task to which philosophers ought to be most sensitive; practice philosophy. ½ Relevant philosophical/methodological questions (i) Where do statistical models come from? What about the theory? • (ii) What is the role of substantive information in specifying and validating • a statistical model? (iii) Is a purely probabilistic construal of a statistical model possible? • (iv) How does one ensure that the list of probabilistic assumptions forming • a statistical model is: (a) complete (no hidden assumptions), (b) internally consistent and (c) testable vis-a-vis data x :=(x ,x ,...,x )? 0 1 2 n (v) Is a Mis-Specification (M-S) test [probing the validity of the model • assumptions] different from a Neyman-Pearson (N-P) test? How? (vi) Is M-S testing vulnerable to: • (a) the fallacies of acceptance and rejection? 3 (b) the infinite regress and circularity problems? (c) illegitimate double-use of data? (vii) Can one secure statistical adequacy using thorough M-S testing? • How can M-S testing rule out a potentially infinite number of other possible models? (viii) How serious is the underdetermination of statistical models by data? • Is the problem endemic? (ix) Is Exploratory Data Analysis (EDA) using graphical techniques an • illegitimate form of data snooping that can bias statistical inference? (x) Does the use of EDA in selecting, validating and respecifying a statis- • tical model involve illegitimate double-use of data? (xi) Is purposeful ‘designing’ of M-S tests to detect particular departures • suggested by EDA illegitimate? (xii) How does one justify the application of several M-S tests for detecting • departures from a single or multiple assumptions? (xiii) Does respecification amount to ‘patching up’ the original model to • account for the regularities it didn’t? (xiv) Is respecification susceptible to the fallacy of rejection and/or the • pre-test bias charge? 4 2 Error Statistics: Extending the F-N-P framework 2.1 The Fisher-Neyman—Pearson (F—N—P) frequentist approach Fisher (1922) initiated a change of paradigms in statistics by recasting the then dominating Bayesian-oriented induction by enumeration, relying on large sample approximations (see Pearson, 1920), into a frequentist ‘model-based induction’, relying on finite sampling distributions, inspired by Gossett’s (1908) derivation of the Student’s t distribution for any sample size n > 1. He proposed to viewthe data x :=(x ,x ,...,x ) as a realization of: (a) a ‘random 0 1 2 n sample’ from (b) a pre-specified ‘hypothetical infinite population’ and made the initial choice (specification) of the statistical model a response to the question: “Of what population is this a random sample?” (Fisher, 1922, p. 313), emphasizing that: ‘the adequacy of our choice may be tested posteriori’ (ibid., p. 314). He went on to formalize notions (a)-(b) in purely probabilistic terms by defining the concept of a (parametric) statistical model: (x) = f(x;θ), θ Θ , x n , R θ X M { ∈ } ∈ where f(x;θ), x n denotes the joint distributionof the sample X :=(X ,...,X ). R X 1 n ∈ Example - simple Normal model. Let X NIID(μ,σ2), k=1,2,...,n,... k v 5 Neyman and Pearson (N-P) (1933) supplemented Fisher’s framework with a more complete component on hypothesis testing; Fisher established the estimation component almost single-handedly. Although the formal apparatus of the Fisher-Neyman—Pearson (F—N—P) statisti- cal inference was largely in place by the late 1930s, the nature of the underlying inductive reasoning was clouded in disagreements. Fisher argued for ‘inductive inference’ spearheaded by his significance testing (Fisher, 1955), and Neyman argued for ‘inductive behaviour’ based on N—P test- ing (Neyman, 1956); see also Pearson (1955). In particular, neither account gave a satisfactory answer to the question: when do data x provide evidence for (or against) a hypothesis or a claim H? 0 Indeed, several crucial foundational problems were left largely unanswered: (a) the role of pre-data vs. post-data error probabilities (Hacking, 1965), (b) the fallacies of acceptance and rejection, (c) when does a model ‘account for the regularities’ in the data? (d) what is the role of substantive information in statistical modeling? These issues created endless confusion in the minds of practitioners concerning the appropriate use and interpretation of the frequentist approach to inference. 6 2.2 Error Statistics (E-S) Mayo (1996) took on the challenge and argued persuasively that some of the chronic methodological problems (a)-(b) can be addressed by supplementing the Neyman—Pearson approach to testing (see Pearson, 1966) with a post-data eval- uation of inference based on severe testing reasoning, calling this elabora- tion/extension ‘Error Statistics (E-S)’; see also Mayo and Cox (2006). In this presentation the focus is on (c)-(d), which turn out to be interrelated. Placing the foundational problems (a)-(d) in proper perspective. In E-S the role of a statistical model (x) is: θ M (i) to specify the premises of inference, (ii) to assign probabilities to all events of interest, and (iii) to provide ascertainable error probabilities in terms of which one can assess the optimality and reliability of inference methods. Error statistics emphasizes the reliability and precision of inference. One can secure statistical reliability by establishing the statistical adequacy of (x). θ M Precision is assured by using the most effective (optimal) inference methods. In modern frequentist statistics, the optimality of estimators, tests and predic- tors is grounded in a deductive argument of the basic form: 7 if (x) is true, then (x) [a set of inference propositions] follows. Q θ M Examples of inference propositions in the context of the simple Normal model: (i) X = 1 n X is a strongly consistent and fully efficient estimator of μ; n n k=1 k (ii) τ(X), C (α) , where τ(X)=√n(Xn μ0) and C (α)= x : τ(x) > c defines P1 s − 1 α { } { } a UMP test for: H :μ = μ vs. H :μ > μ ; 0 0 1 0 (iii) (X c s μ X + c s )=1 α defines the shortest width CI. P n α n α − √n ≤ ≤ √n − The deductive component, (x) (x), is then embedded into a broader Q θ M → inductive understructure which relates data x , via (x), to inference 0 θ M results (x ), as they pertain to the phenomenon of interest. Q 0 The literature on the frequentist approach since the 1930s has paid insufficient attention to the reliability and pertinence of inference results (x ). Q 0 To secure that one needs to probe for and eliminate potential errors at the two points of nexus with reality: (A) From the phenomenon of interest to an adequate statistical model, (B) From inference results to evidence for substantive claims. Statistics textbooks are, almost exclusively, concerned with the deductive com- I ponent, if (x) then (x), and give insufficient attention to (A)-(B). Q θ M 8 If (x), then (x) Q θ M Statistical Model Inference (assumptions [1]-[4]) Propositions = ⇒ ↑ ↓ Data: x =(x ,x ,...,x ) Inference results: (x ) Q 0 1 2 n 0 - . Model-based frequentist statistical induction Error statistics addresses the first concern by securing statistical adequacy to render trustworthy the link: A. From a phenomenon of interest to an adequate statistical model: x [i] 0 (x). θ −→ M 9 To addresses the second concern, E-S supplements the F-N-P approach with a post-data evaluation of inference based on severe testing. B. From inference results to evidence for substantive claims: [ii] (x ) Q 0 −→ Example. Assume that x =.02, n=10000, [large n case] s=1.1, yieldingτ(x )=1.82, n 0 which leads to rejecting H :μ =0 vs. H :μ >0, at α=.05, since c =1.645. 0 0 1 0 α Does this provide evidence for a substantive discrepancy from the null? The post data evaluation of the claim μ > γ, for different discrepancies γ, based on: SEV(τ(x );μ > γ)= (τ(X) τ(x );μ γ), P 0 0 ≤ ≤ indicates that for a high enough threshold, say .9, the maximum discrepancy warranted by data x is γ<.006. One then needs to use substantive information 0 to assess whether γ<.006 is substantively significant or not. Severity evaluation of the claim: μ > γ γ .001 .005 .006 .01 .02 .05 .07 SEV(τ(x );μ > γ) .958 .914 .898 .818 .500 .003 .000 0 POW(τ(X);c ;μ=γ) .060 .117 .136 .231 .569 .998 1.00 α 10
Description: