ebook img

To Explain or to Predict? PDF

0.38 MB·
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview To Explain or to Predict?

StatisticalScience 2010,Vol.25,No.3,289–310 DOI:10.1214/10-STS330 (cid:13)c InstituteofMathematicalStatistics,2010 To Explain or to Predict? Galit Shmueli Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and descrip- 1 tion. In many disciplines there is near-exclusive use of statistical mod- 1 eling for causal explanation and the assumption that models with high 0 2 explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction n a must be understood for progressing scientific knowledge. While this J distinction has been recognized in the philosophy of science, the statis- 5 tical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive ] E goal. The purpose of this article is to clarify the distinction between M explanatory and predictive modeling, to discuss its sources, and to re- . veal the practical implications of the distinction to each step in the t a modeling process. t s Key words and phrases: Explanatory modeling, causality, predictive [ modeling, predictive power, statistical strategy, data mining, scientific 1 research. v 1 9 8 1. INTRODUCTION explanation. And yet in other research fields, such 0 asepidemiology, theemphasisoncausalexplanation . Looking at how statistical models are used in dif- 1 versus empirical prediction is more mixed. Statisti- ferent scientific disciplines for the purpose of the- 0 cal modeling for description, where the purpose is 1 ory building and testing, one finds a range of per- 1 ceptions regarding the relationship between causal to capture the data structure parsimoniously, and : which is the most commonly developed within the v explanation and empirical prediction. In many sci- i field of statistics, is not commonly used for theory X entific fields such as economics, psychology, educa- building and testing in other disciplines. Hence, in tion, and environmental science, statistical models r a are used almost exclusively for causal explanation, this article I focus on the use of statistical mod- eling for causal explanation and for prediction. My and models that possess high explanatory power mainpremiseisthatthetwo areoften conflated,yet are often assumed to inherently possess predictive the causal versus predictive distinction has a large power. In fields such as natural language processing impact on each step of the statistical modeling pro- and bioinformatics, the focus is on empirical predic- cess and on its consequences. Although not explic- tionwithonlyaslightandindirectrelationtocausal itly stated in the statistics methodology literature, appliedstatisticians instinctively sensethatpredict- Galit Shmueli is Associate Professor of Statistics, ing and explaining are different. This article aims to Department of Decision, Operations and Information Technologies, Robert H. Smith School of Business, fill a critical void: to tackle the distinction between University of Maryland, College Park, Maryland 20742, explanatory modeling and predictive modeling. USA (e-mail: [email protected]). Clearingthecurrentambiguity betweenthetwois critical not only for proper statistical modeling, but This is an electronic reprint of the original article more importantly, for proper scientific usage. Both published by the Institute of Mathematical Statistics in Statistical Science, 2010, Vol. 25, No. 3, 289–310. This explanation and prediction are necessary for gener- reprint differs from the original in pagination and ating and testing theories, yet each plays a differ- typographic detail. ent role in doing so. The lack of a clear distinction 1 2 G. SHMUELI within statistics has created a lack of understand- data. Regression models are the most common ex- ing in many disciplines of the difference between ample. The justification for this practice is that the building sound explanatory models versus creating theory itself provides the causality. In other words, powerful predictive models, as well as confusing ex- the role of the theory is very strong and the reliance planatory power with predictive power. The impli- ondataandstatistical modelingarestrictlythrough cations of this omission and the lack of clear guide- the lens of the theoretical model. The theory–data lines on how to model for explanatory versus pre- relationship varies in different fields. While the so- dictive goals are considerable for both scientific re- cial sciences are very theory-heavy, in areas such as search andpracticeandhave alsocontributed tothe bioinformatics and natural language processing the gap between academia and practice. emphasis on a causal theory is much weaker. Hence, I start by defining what I term explaining and given this reality, I define explaining as causal ex- predicting. These definitions are chosen to reflect planation and explanatory modeling as the use of the distinct scientific goals that they are aimed at: statistical models for testing causal explanations. causal explanation and empirical prediction, respec- To illustrate how explanatory modeling is typi- tively. Explanatory modeling and predictive model- cally done, I describe the structure of a typical arti- ing reflect the process of using data and statistical cle in a highly regarded journal in the field of Infor- (or data mining) methods for explaining or predict- mation Systems (IS). Researchers in the field of IS ing, respectively. Theterm modeling is intentionally usually have training in economics and/or the be- chosenovermodels tohighlighttheentireprocessin- havioral sciences. The structure of articles reflects volved, from goal definition, study design, and data the way empirical research is conducted in IS and collection to scientific use. related fields. 1.1 Explanatory Modeling The example used is an article by Gefen, Kara- hanna and Straub (2003), which studies technology In many scientific fields, and especially the social acceptance. Thearticle starts with a presentation of sciences, statistical methods are used nearly exclu- the prevailing relevant theory(ies): sively for testing causal theory. Given a causal theo- retical model, statistical models are applied to data Online purchase intensions should be ex- in order to test causal hypotheses. In such mod- plained in part by the technology accep- els, a set of underlying factors that are measured tancemodel(TAM).Thistheoreticalmodel by variables X are assumed to cause an underlying is at present a preeminent theory of tech- effect, measured by variable Y. Based on collabora- nology acceptance in IS. tive work with social scientists and economists, on an examination of some of their literature, and on The authors then proceed to state multiple causal conversations with a diverse group of researchers, I hypotheses (denoted H1,H2,... in Figure 1, right conjecture that, whether statisticians like it or not, panel), justifyingthemerits for each hypothesis and the type of statistical models used for testing causal grounding it in theory. The research hypotheses are hypotheses in the social sciences are almost always given in terms of theoretical constructs rather than association-based models applied to observational measurable variables. Unlike measurable variables, Fig. 1. Causal diagram (left) and partial list of stated hypotheses (right) from Gefen, Karahanna and Straub (2003). TO EXPLAINOR TO PREDICT? 3 constructs are abstractions that “describe a phe- ofnewobservations.Predictive model isanymethod nomenon of theoretical interest” (Edwards and that produces predictions, regardless of its underly- Bagozzi, 2000) and can be observable or unobserv- ingapproach:Bayesian orfrequentist,parametricor able.Examplesofconstructsinthisarticlearetrust, nonparametric, data mining algorithm or statistical perceived usefulness(PU), andperceived ease ofuse model, etc. (PEOU).Examplesofconstructsusedinotherfields 1.3 Descriptive Modeling includeanger,poverty,well-being,andodor.Thehy- potheses section will often include a causal diagram Although not the focus of this article, a third illustrating the hypothesized causal relationship be- type of modeling, which is the most commonly used tween the constructs (see Figure 1, left panel). The and developed by statisticians, is descriptive mod- next step is construct operationalization, where a eling. This type of modeling is aimed at summariz- bridge is built between theoretical constructs and ing or representing the data structure in a compact observable measurements, using previous literature manner.Unlikeexplanatorymodeling,indescriptive and theoretical justification. Only after the theoret- modeling the reliance on an underlying causal the- ical component is completed, and measurements are ory is absent or incorporated in a less formal way. justified and defined, do researchers proceed to the Also,thefocusisatthemeasurablelevelratherthan nextstepwheredataandstatisticalmodelingarein- at the construct level. Unlike predictive modeling, troduced alongside the statistical hypotheses,which descriptivemodelingisnotaimed atprediction.Fit- are operationalized from the research hypotheses. ting a regression model can be descriptive if it is Statistical inference will lead to “statistical conclu- used for capturing the association between the de- sions” in terms of effect sizes and statistical sig- nificance in relation to the causal hypotheses. Fi- pendent and independent variables rather than for nally, the statistical conclusions are converted into causal inference or for prediction. We mention this research conclusions, often accompanied by policy type of modeling to avoid confusion with causal- recommendations. explanatory and predictive modeling, and also to In summary, explanatory modeling refers here to highlightthedifferentapproachesofstatisticiansand theapplication of statistical models to datafor test- nonstatisticians. ing causal hypotheses about theoretical constructs. 1.4 The Scientific Value of Predictive Modeling Whereas “proper” statistical methodology for test- ing causality exists, such as designed experiments Althoughexplanatorymodelingiscommonlyused or specialized causal inference methods for observa- for theory building and testing, predictive modeling tional data [e.g., causal diagrams (Pearl, 1995), dis- is nearly absent in many scientific fields as a tool covery algorithms (Spirtes, Glymour and Scheines, for developing theory. One possible reason is the 2000), probability trees (Shafer, 1996), and propen- statistical training of nonstatistician researchers. A sity scores (Rosenbaum and Rubin, 1983; Rubin, look at many introductory statistics textbooks re- 1997)],inpracticeassociation-basedstatisticalmod- veals very little in the way of prediction. Another els, applied to observational data, are most com- reason is that prediction is often considered unsci- monly used for that purpose. entific. Berk (2008) wrote, “In the social sciences, 1.2 Predictive Modeling for example, one either did causal modeling econo- metric style or largely gave up quantitative work.” I define predictive modeling as the process of ap- From conversations with colleagues in various disci- plying a statistical model or data mining algorithm plines it appears that predictive modeling is often to data for the purpose of predicting new or future valued for its applied utility, yet is discarded for sci- observations. In particular, I focus on nonstochastic entific purposes such as theory building or testing. prediction (Geisser, 1993, page 31), where the goal is to predict the output value (Y) for new observa- Shmueli and Koppius (2010) illustrated the lack of tions given their input values (X). This definition predictive modeling in the field of IS. Searching the also includes temporal forecasting, where observa- 1072 papers published in the two top-rated journals tions until time t (the input) are used to forecast Information Systems Research and MIS Quarterly future values at time t+k,k>0 (the output). Pre- between1990and2006,theyfoundonly52empirical dictions include point or interval predictions, pre- papers with predictive claims, of which only seven diction regions, predictive distributions, or rankings carried out proper predictive modeling or testing. 4 G. SHMUELI Even among academic statisticians, there appears Mendelson (1990, 1994) and Collopy, Adya and tobeadividebetweenthosewhovaluepredictionas Armstrong (1994). the main purpose of statistical modeling and those 2. The development of new theory often goes hand who see it as unacademic. Examples of statisticians in hand with the development of new measures who emphasize predictive methodology include (VanMaanen,SorensenandMitchell,2007).Pre- Akaike (“The predictive point of view is a proto- dictive modeling can be used to discover new typical point of view to explain the basic activity of measures as well as to compare different oper- statistical analysis” in Findley and Parzen, 1998), ationalizations of constructs and different mea- Deming (“The only useful function of a statistician surement instruments. is to make predictions” in Wallis, 1980), Geisser 3. By capturing underlying complex patterns and (“The prediction of observables or potential observ- relationships,predictivemodelingcansuggestim- ables is of much greater relevance than the estimate provements to existing explanatory models. of what are often artificial constructs-parameters,” 4. Scientific development requires empirically rig- Geisser, 1975), Aitchison and Dunsmore (“predic- orous and relevant research. Predictive model- tionanalysis... issurelyattheheartofmanystatis- ingenablesassessingthedistancebetweentheory tical applications,” Aitchison and Dunsmore, 1975) and practice, thereby servingas a “reality check” and Friedman (“One of the most common and im- to the relevance of theories.1 While explanatory portantusesfordataisprediction,”Friedman,1997). power provides information about the strength of an underlying causal relationship, it does not Examples of those who see it as unacademic are imply its predictive power. KendallandStuart(“TheScience of Statistics deals 5. Predictive power assessment offers a straightfor- with the properties of populations. In considering a ward way to compare competing theories by ex- populationofmenwearenotinterested,statistically amining the predictive power of their respective speaking, in whether some particular individual has explanatory models. brown eyes or is a forger, but rather in how many 6. Predictive modeling plays an important role in of the individuals have brown eyes or are forgers,” quantifying the level of predictability of measur- KendallandStuart,1977)andmorerecentlyParzen able phenomena by creating benchmarks of pre- (“The two goals in analyzing data... I prefer to dictive accuracy (Ehrenberg and Bound, 1993). describe as “management” and “science.” Manage- Knowledge of un-predictability is a fundamen- ment seeks profit... Science seeks truth,” Parzen, tal component of scientific knowledge (see, e.g., 2001). In economics there is a similar disagreement Taleb, 2007). Because predictive models tend to regarding “whether prediction per se is a legitimate havehigherpredictiveaccuracythanexplanatory objective of economic science, and also whether ob- statistical models, they can give an indication of served data should be used only to shed light on ex- the potential level of predictability. A very low isting theories or also for the purpose of hypothesis predictability level can lead to the development seeking in order to develop new theories” (Feelders, of new measures, new collected data, and new 2002). empiricalapproaches.Anexplanatorymodelthat Beforeproceedingwiththediscriminationbetween is close to the predictive benchmark may suggest explanatory and predictive modeling, it is impor- that our understanding of that phenomenon can tant to establish prediction as a necessary scientific only be increased marginally. On the other hand, endeavor beyond utility, for the purpose of devel- an explanatory model that is very far from the oping and testing theories. Predictive modeling and predictive benchmarkwould imply thatthere are predictive testing serve several necessary scientific substantial practical and theoretical gains to be functions: had from further scientific development. 1. Newlyavailablelargeandrichdatasetsoftencon- Forarelated,moredetaileddiscussionofthevalue tain complex relationships and patterns that are ofpredictiontoscientifictheorydevelopmentseethe hardtohypothesize,especiallygiventheoriesthat work of Shmueli and Koppius (2010). exclude newly measurable concepts. Using pre- dictive modeling in such contexts can help un- 1Predictive models are advantageous in terms of negative cover potential new causal mechanisms and lead empiricism:amodeleitherpredictsaccurately oritdoesnot, to the generation of new hypotheses. See, for ex- andthiscanbeobserved.Incontrast,explanatorymodelscan ample, the discussion between Gurbaxani and neverbeconfirmed and are harder to contradict. TO EXPLAINOR TO PREDICT? 5 1.5 Explaining and Predicting Are Different F is often represented by a path model, a set of qualitative statements, a plot (e.g., a supply and In the philosophy of science, it has long been de- demand plot), or mathematical formulas. Measur- bated whether explaining and predicting are one or able variables X and Y are operationalizations of distinct. The conflation of explanation and predic- X and Y, respectively. The operationalization of F tion has its roots in philosophy of science literature, into a statistical model f, such as E(Y)=f(X), is particularly the influential hypothetico-deductive done by considering F in light of the study design model (Hempel and Oppenheim, 1948), which ex- plicitly equated prediction and explanation. How- (e.g., numerical or categorical Y; hierarchical or flat ever, as later became clear, the type of uncertainty design; time series or cross-sectional; complete or associated with explanation is of a different nature censored data) and practical considerations such as than that associated with prediction (Helmer and standardsinthediscipline.Because F is usuallynot Rescher, 1959). This difference highlighted the need sufficiently detailed to lead to a single f, often a set fordevelopingmodelsgearedspecificallytowarddeal- of f models is considered. Feelders (2002) described ing with predicting future events and trends such as this process in the field of economics. In the predic- the Delphi method (Dalkey and Helmer, 1963). The tive context, we consider only X, Y and f. distinction between the two concepts has been fur- The disparity arises because the goal in explana- ther elaborated (Forster and Sober, 1994; Forster, tory modeling is to match f and F as closely as 2002;Sober,2002;HitchcockandSober,2004;Dowe, possible for the statistical inference to apply to the GardnerandOppy,2007).InhisbookTheory Build- theoretical hypotheses. The data X,Y are tools for ing, Dubin (1969, page 9) wrote: estimating f, which in turn is used for testing the Theoriesofsocialandhumanbehaviorad- causal hypotheses. In contrast, in predictive mod- dress themselves to two distinct goals of eling the entities of interest are X and Y, and the science:(1)predictionand(2)understand- function f is used as a tool for generating good pre- ing. It will be argued that these are sep- dictions of new Y values. In fact, we will see that arate goals [...] I will not, however, con- even if the underlying causal relationship is indeed clude that they are either inconsistent or Y = F(X), a function other than fˆ(X) and data incompatible. other than X might be preferable for prediction. The disparity manifests itself in different ways. HerbertSimondistinguishedbetween“basicscience” Four major aspects are: and “applied science” (Simon, 2001), a distinction similar to explaining versus predicting. According Causation–Association: In explanatory modeling f to Simon,basic science is aimed at knowing (“to de- represents an underlying causal function, and X scribe the world”) and understanding (“to provide is assumed to cause Y. In predictive modeling f explanations of these phenomena”). In contrast, in captures the association between X and Y. applied science, “Laws connecting sets of variables Theory–Data: In explanatory modeling, f is care- allowinferencesorpredictionstobemadefromknown fully constructed based on F in a fashion that values of some of the variables to unknown values of supports interpreting the estimated relationship other variables.” between X and Y andtesting thecausal hypothe- Whyshouldtherebeadifferencebetweenexplain- ses. In predictive modeling, f is often constructed ing and predicting? The answer lies in the fact that from the data. Direct interpretability in terms of measurable data are not accurate representations therelationship between X and Y is notrequired, of their underlying constructs. The operationaliza- althoughsometimestransparencyoff isdesirable. tion of theories and constructs into statistical mod- Retrospective–Prospective: Predictive modeling is els and measurable data creates a disparity between the ability to explain phenomena at the conceptual forward-looking, in that f is constructed for pre- level and the ability to generate predictions at the dictingnewobservations.Incontrast,explanatory measurable level. modeling is retrospective, in that f is used to test To convey this disparity more formally, consider an already existing set of hypotheses. a theory postulating that construct X causes con- Bias–Variance: The expected prediction error for a struct Y, via the function F, such that Y =F(X). new observation with value x, using a quadratic 6 G. SHMUELI loss function,2 is given by Hastie, Tibshirani and data is available or if the data are noise- Friedman (2009, page 223) less. However, in modeling based on a fi- nitequantityofrealdata,thereisasignifi- EPE=E{Y −fˆ(x)}2 cantgapbetweenthesetwopointsofview, =E{Y −f(x)}2+{E(fˆ(x))−f(x)}2 because an optimal model for prediction (1) purposes may be different from one ob- +E{fˆ(x)−E(fˆ(x))}2 tained by estimating the ‘true model.’ =Var(Y)+Bias2+Var(fˆ(x)). Theliteratureonthistopicisvast,andwedonotin- tend to cover it here, although we discuss the major Bias is the result of misspecifying the statistical points in Section 2.6. model f. Estimation variance (the third term) is The focus on prediction in the field of machine the result of using a sample to estimate f. The learningandbystatisticians suchas Geisser,Aitchi- firsttermistheerrorthatresultsevenifthemodel sonandDunsmore,BreimanandFriedman,hashigh- iscorrectlyspecifiedandaccuratelyestimated.The lighted aspects of predictive modeling that are rel- abovedecompositionreveals asourceofthediffer- evant to the explanatory/prediction distinction, al- ence between explanatory and predictive model- though they do not directly contrast explanatory ing: In explanatory modeling thefocus is on mini- and predictive modeling.3 The prediction literature mizing bias to obtain the most accurate represen- raises theimportanceof evaluating predictivepower tation of the underlying theory. In contrast, pre- using holdout data, and the usefulness of algorith- dictive modeling seeks to minimize the combina- mic methods (Breiman, 2001b). The predictive fo- tion of bias and estimation variance, occasionally cus has also led to the development of inference sacrificing theoretical accuracy for improved em- tools that generate predictive distributions. Geisser pirical precision. This point is illustrated in the (1993) introduced “predictive inference” and devel- Appendix, showing that the “wrong” model can oped it mainly in a Bayesian context. “Predictive sometimes predict better than the correct one. likelihood”(seeBjornstad,1990)isalikelihood-based approach to predictive inference, and Dawid’s pre- The four aspects impact every step of the mod- quential theory (Dawid, 1984) investigates inference eling process, such that the resulting f is markedly conceptsintermsofpredictability.Finally,thebias– different in the explanatory and predictive contexts, variance aspect has been pivotal in data mining for as will be shown in Section 2. understanding the predictive performance of differ- ent algorithms and for designing new ones. 1.6 A Void in the Statistics Literature Another area in statistics and econometrics that Thephilosophicalexplaining/predictingdebatehas focuses on prediction is time series. Methods have notbeendirectlytranslatedintostatistical language beendevelopedspecificallyfortestingthepredictabil- in terms of the practical aspects of the entire statis- ity ofaseries [e.g.,randomwalktests ortheconcept tical modeling process. of Granger causality (Granger, 1969)], and evalu- A search of the statistics literature for discussion ating predictability by examining performance on of explaining versus predicting reveals a lively dis- holdout data. The time series literature in statis- cussioninthecontext ofmodel selection,andinpar- tics is dominated by extrapolation models such as ticular,thederivationandevaluationofmodelselec- ARIMA-typemodelsandexponentialsmoothingmeth- tion criteria. In this context, Konishi and Kitagawa ods, which are suitable for prediction and descrip- (2007) wrote: tion, but not for causal explanation. Causal models for time series are common in econometrics (e.g., Theremay beno significant difference be- Song and Witt, 2000), where an underlying causal tween the point of view of inferring the theory links constructs, which lead to operational- true structure and that of making a pre- ized variables, as in the cross-sectional case. Yet, to diction if an infinitely large quantity of 3Geisser distinguished between “[statistical] parameters” 2For abinary Y, various0–1 loss functions havebeen sug- and“observables” intermsoftheobjects ofinterest.Hisdis- gested in place of the quadratic loss function (Domingos, tinction is closely related, but somewhat different from our 2000). distinctionbetweentheoreticalconstructsandmeasurements. TO EXPLAINOR TO PREDICT? 7 the best of my knowledge, there is no discussion in 2.1 Study Design and Data Collection thestatistics timeseriesliteratureregardingthedis- Even at the early stages of study design and data tinction between predictive and explanatory model- collection, issues of what and how much data to ing, aside from the debate in economics regarding collect, according to what design, and which col- the scientific value of prediction. lection instrument to use are considered differently To conclude, the explanatory/predictive model- for prediction versus explanation. Consider sample ing distinction has been discussed directly in the size. In explanatory modeling, where the goal is to model selection context, but not in the larger con- estimatethetheory-basedf withadequateprecision text.Areasthatfocusondevelopingpredictivemod- and to use it for inference, statistical power is the eling such as machine learning and statistical time main consideration. Reducing bias also requires suf- series,and“predictivists” suchasGeisser, have con- ficient data for model specification testing. Beyond sidered prediction as a separate issue, and have not a certain amount of data, however, extra precision discusseditsprincipalandpracticaldistinctionfrom is negligible for purposes of inference. In contrast, causal explanation in terms of developing and test- in predictive modeling, f itself is often determined ing theory. The goal of this article is therefore to from the data, thereby requiring a larger sample examine the explanatory versus predictive debate for achieving lower bias and variance. In addition, fromastatistical perspective,consideringhow mod- more data are needed for creating holdout datasets eling is used by nonstatistician scientists for theory (see Section 2.2). Finally, predicting new individ- development. ual observations accurately, in a prospective man- The remainder of the article is organized as fol- ner, requires more data than retrospective inference lows.InSection 2,Iconsidereach stepinthemodel- regarding population-level parameters, due to the ingprocessintermsofthefouraspectsofthepredic- extra uncertainty. tive/explanatory modeling distinction: causation– A second design issue is sampling scheme. For in- association, theory–data, retrospective–prospective stance,inthecontextofhierarchicaldata(e.g.,sam- andbias–variance.Section3illustratessomeofthese pling students within schools) Afshartous and de differencesviatwoexamples.Adiscussionoftheim- Leeuw (2005) noted, “Although there exists an ex- plications of the predict/explain conflation, conclu- tensive literature on estimation issues in multilevel sions, and recommendations are given in Section 4. models,thesamecannotbesaidwithrespecttopre- diction.”Examiningissuesofsamplesize,sampleal- 2. TWO MODELING PATHS location, and multilevel modeling for the purposeof “predicting a futureobservable y∗j in the Jth group In the following I examine the process of statisti- of a hierarchial dataset,” they found that allocation cal modeling through the explain/predict lens, from for estimation versus prediction should be different: goal definition to model use and reporting. For clar- “an increase in group size n is often more benefi- ity, I broke down the process into a generic set of cial with respect to prediction than an increase in steps, as depicted in Figure 2. In each step I point the number of groups J...[whereas] estimation is out differences in the choice of methods, criteria, more improved by increasing the number of groups data, and information to consider when the goal is J instead of the group size n.” This relates directly predictive versus explanatory. I also briefly describe to the bias–variance aspect. A related issue is the the related statistics literature. The conceptual and choice of f in relation to sampling scheme. Afshar- practical differences invariably lead to a difference tousanddeLeeuw(2005)foundthatfortheirhierar- between a final explanatory model and a predic- chicaldata,ahierarchicalf,whichismoreappropri- tive one, even though they may use the same initial ate theoretically, had poorerpredictive performance data.Thus,apriorideterminationofthemainstudy than a nonhierarchical f. goal as either explanatory or predictive4 is essential Athirddesign consideration is thechoice between to conducting adequate modeling. The discussion in experimental and observational settings. Whereas this section assumes thatthemain research goal has forcausalexplanation experimentaldataaregreatly beendeterminedaseitherexplanatory orpredictive. preferred, subject to availability and resource con- straints, in prediction sometimes observational data 4The main study goal can also bedescriptive. arepreferableto“overlyclean”experimentaldata,if 8 G. SHMUELI Fig. 2. Steps in the statistical modeling process. they better represent the realistic context of predic- and type of missingness, and to choose a course of tion in terms of the uncontrolled factors, the noise, action accordingly. Although a rich literature ex- the measured response, etc. This difference arises ists on data imputation, it is monopolized by an from the theory–data and prospective–retrospective explanatory context. In predictive modeling, the so- aspects. Similarly, when choosing between primary lution strongly depends on whether the missing val- data (data collected for the purpose of the study) ues are in the training data and/or the data to be and secondary data (data collected for other pur- predicted. For example, Sarle (1998) noted: poses),theclassiccriteriaofdatarecency,relevance, If you have only a small proportion of and accuracy (Patzer, 1995) are considered from cases with missing data, you can simply a different angle. For example, a predictive model throw out those cases for purposes of es- requires the secondary data to include the exact timation; if you want to make predictions X,Y variables to be used at the time of prediction, for cases with missing inputs, you don’t whereasforcausalexplanationdifferentoperational- have the option of throwing those cases izations of the constructs X,Y may be acceptable. out. Intermsofthedatacollectioninstrument,whereas in explanatory modeling the goal is to obtain a re- Sarle further listed imputation methods that are liable and valid instrument such that the data ob- useful for explanatory purposes but not for predic- tainedrepresenttheunderlyingconstructadequately tive purposes and vice versa. One example is us- (e.g., item response theory in psychometrics), for ing regression models with dummy variables that predictive purposesit is more important to focus on indicatemissingness,whichisconsideredunsatisfac- the measurement quality and its meaning in terms tory in explanatory modeling, but can produce ex- of the variable to be predicted. cellent predictions. The usefulness of creating miss- Finally,considerthefieldofdesignofexperiments: ingness dummy variables was also shown by Ding twomajorexperimentaldesignsarefactorialdesigns andSimonoff(2010).Inparticular,whereastheclas- and response surface methodology (RSM) designs. sic explanatory approach is based on the Missing- Theformerisfocusedoncausalexplanationinterms At-Random,Missing-Completely-At-RandomorNot- of finding the factors that affect the response. The Missing-At-Random classification (Little and Ru- latter is aimed at prediction—finding the combina- bin, 2002), Ding and Simonoff (2010) showed that tionofpredictorsthatoptimizesY.Factorialdesigns for predictive purposes the important distinction is employ a linear f for interpretability, whereas RSM whether the missingness dependson Y or not. They designs use optimization techniques and estimate a concluded: nonlinearf fromthedata,whichislessinterpretable In the context of classification trees, the but more predictively accurate.5 relationship between the missingness and 2.2 Data Preparation the dependent variable, rather than the standardmissingnessclassificationapproach We considertwo common datapreparation opera- of Little and Rubin (2002)...is the most tions:handlingmissingvaluesanddatapartitioning. helpfulcriteriontodistinguishdifferentmiss- ing data methods. 2.2.1 Handling missing values Most real datasets consist of missing values, thereby requiring one to Moreover, missingness can be a blessing in a pre- identify the missing values, to determine the extent dictive context, if it is sufficiently informative of Y (e.g., missingness in financial statements when the 5I thank Douglas Montgomery for this insight. goal is to predict fraudulent reporting). TO EXPLAINOR TO PREDICT? 9 Finally, a completely different approach for han- 2.3 Exploratory Data Analysis dlingmissingdataforprediction,mentionedbySarle Exploratory data analysis (EDA) is a key initial (1998) and further developed by Saar-Tsechansky step in both explanatory and predictive modeling. and Provost (2007), considers the case where to- It consists of summarizing the data numerically and be-predicted observations are missing some predic- graphically, reducing their dimension, and “prepar- tor information, such that the missing information ing” for the more formal modeling step. Although canvaryacrossdifferentobservations.Theproposed the same set of tools can be used in both cases, solution is to estimate multiple “reduced” models, they are used in a different fashion. In explanatory each excludingsomepredictors.Whenpredictingan modeling, exploration is channeled toward the the- observation withmissingness on acertain set of pre- oretically specified causal relationships, whereas in dictors, the model that excludes those predictors is predictivemodelingEDAisusedinamorefree-form used. This approach means that different reduced fashion, supporting the purpose of capturing rela- models are created for different observations. Al- tionships that are perhaps unknown or at least less though useful for prediction, it is clearly inappro- priate for causal explanation. formally formulated. One example is how data visualization is carried 2.2.2 Data partitioning A popular solution for out. Fayyad, Grinstein and Wierse (2002, page 22) avoiding overoptimistic predictive accuracy is to contrasted “exploratory visualization” with “confir- evaluate performance not on the training set, that matory visualization”: is, the data used to buildthe model, butrather on a holdout sample which the model “did not see.” The Visualizationscanbeusedtoexploredata, creation of a holdout sample can beachieved in var- to confirm a hypothesis, or to manipu- ious ways, the most commonly used beinga random late a viewer...In exploratory visualiza- partition of the sample into training and holdout tion the user does not necessarily know sets. A popular alternative, especially with scarce what he is looking for. This creates a dy- data, is cross-validation. Other alternatives are re- namicscenarioinwhichinteractioniscrit- sampling methods, such as bootstrap, which can ical...Inaconfirmatory visualization, the be computationally intensive but avoid “bad par- user has a hypothesis that needs to be titions” and enable predictive modeling with small tested. This scenario is more stable and datasets. predictable. System parameters are often Datapartitioningisaimedatminimizingthecom- predetermined. bined bias and variance by sacrificing some bias in Hence, interactivity, which supports exploration returnforareductioninsamplingvariance.Asmaller across a wide and sometimes unknown terrain, is sample is associated with higher bias when f is esti- very useful for learning about measurement quality matedfromthedata,whichiscommoninpredictive and associations that are at the core of predictive modeling but not in explanatory modeling. Hence, modeling, but much less so in explanatory model- data partitioning is useful for predictive modeling ing, where the data are visualized through the the- but less so for explanatory modeling. With today’s oretical lens. abundanceof large datasets, wherethebias sacrifice is practically small, data partitioning has become a A second example is numerical summaries. In a standard preprocessing step in predictive modeling. predictive context, one might explore a wide range In explanatory modeling, data partitioning is less of numerical summaries for all variables of inter- commonbecauseofthereductioninstatisticalpower. est,whereas inan explanatory model,thenumerical When used, it is usually done for the retrospective summaries would focus on the theoretical relation- purpose of assessing the robustness of fˆ. A rarer ships. For example, in order to assess the role of a yet important use of data partitioning in explana- certain variable as a mediator, its correlation with tory modeling is for strengthening model validity, the response variable and with other covariates is by demonstrating some predictive power. Although examined by generating specific correlation tables. one would not expect an explanatory model to be A third example is the use of EDA for assess- optimal in terms of predictive power, it shouldshow ingassumptions of potential models (e.g., normality some degree of accuracy (see discussion in Section or multicollinearity) and exploring possible variable 4.2). transformations. Here, too, an explanatory context 10 G. SHMUELI would be more restrictive in terms of the space ex- in X being correlated with the error term. Winkel- plored. mann (2008) gave the example of a hypothesis that Finally, dimension reduction is viewed and used health insurance (X) affects the demand for health differently.Inpredictivemodeling,areductioninthe servicesY.Theoperationalizedvariablesare“health numberof predictors can help reducesampling vari- insurance status” (X) and “number of doctor con- ance. Hence, methods such as principal components sultations” (Y). Omitting an input measurement analysis (PCA) or other data compression methods Z for “true health status” (Z) from the regression that are even less interpretable (e.g., singular value model f causes endogeneity because X can be de- decomposition) are often carried out initially. They termined by Y (i.e., reverse causation), which man- ifests as X being correlated with the error term in may later lead to the use of compressed variables f. Endogeneity can arise due to other reasons such (such as the first few components) as predictors, as measurement error in X. Because of the focus even if those are not easily interpretable. PCA is in explanatory modeling on causality and on bias, also used in explanatory modeling, but for a differ- there is a vast literature on detecting endogeneity ent purpose. For questionnaire data, PCA and ex- and on solutions such as constructing instrumen- ploratory factor analysis are used to determine the tal variables and using models such as two-stage- validity of the survey instrument. The resulting fac- least-squares (2SLS).Anotherrelated term issimul- tors are expected to correspond to the underlying taneous causality, which gives rise to special mod- constructs. In fact, the rotation step in factor anal- els such as Seemingly Unrelated Regression (SUR) ysis is specifically aimed at making the factors more (Zellner, 1962). In terms of chronology, a causal ex- interpretable. Similarly, correlations are used for as- planatory model can include only “control” vari- sessing the reliability of the survey instrument. ablesthattakeplacebeforethecausalvariable(Gel- 2.4 Choice of Variables man et al., 2003). And finally, for reasons of model identifiability (i.e., given the statistical model, each Thecriteria for choosing variables differ markedly causal effect can be identified), one is required to in explanatory versus predictive contexts. include main effects in a model that contains an in- In explanatory modeling,wherevariables are seen teraction term between those effects. We note this asoperationalizedconstructs,variablechoiceisbased practice because it is not necessary or useful in the on the role of the construct in the theoretical causal predictive context, due to the acceptability of unin- structureandontheoperationalizationitself.Abroad terpretable models and the potential reduction in terminology related to different variable roles exists sampling variance when dropping predictors (see, in various fields: in the social sciences—antecedent, e.g., the Appendix). consequent, mediator and moderator6 variables; in In predictive modeling, the focus on association pharmacology and medical sciences—treatment and ratherthancausation,thelackofF,andtheprospec- control variables;andinepidemiology—exposure and tivecontext,meanthatthereisnoneedtodelveinto confounding variables. Carte andCraig (2003)men- the exact role of each variable in terms of an under- tioned that explaining moderating effects has be- lying causal structure. Instead, criteria for choosing come an important scientific endeavor in the field of predictorsarequality of theassociation between the Management Information Systems. Another impor- predictors and the response, data quality, and avail- tant term common in economics is endogeneity or ability of the predictors at the time of prediction, “reverse causation,” which results in biased param- known as ex-ante availability. In terms of ex-ante eter estimates. Endogeneity can occur due to dif- availability, whereas chronological precedence of X ferent reasons. One reason is incorrectly omitting to Y is necessary in causal models, in predictive an input variable, say Z, from f when the causal models not only must X precede Y, but X must be available at the time of prediction. For instance, ex- construct Z is assumed to cause X and Y. In a re- plaining wine quality retrospectively would dictate gression model of Y on X, the omission of Z results including barrel characteristics as a causal factor. The inclusion of barrel characteristics in a predic- 6“A moderator variable is one that influences the tive model of future wine quality would be impossi- strength of a relationship between two other vari- ble if at the time of prediction the grapes are still ables, and a mediator variable is one that explains on the vine. See the eBay example in Section 3.2 for the relationship between the two other variables” (from http://psych.wisc.edu/henriques/mediator.html). another example.

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.