Karl Popper: All life is problem solving Statistics and Language Acquisition Research Statistics is defined as the science of collecting, organizing, presenting, analyzing and interpreting data for the purpose of assisting in making a more effective decision. In this definition of statistics, two key terms need elaboration: data and effective decision-making. The final step in statistics is making sound and effective decisions, and to arrive at effective decisions the role of data is highly underlined: the more accurate decisions we want to 1 make, the more appropriate data we need to collect. How is it possible to arrive at sound decisions when the data on which decisions are made are not related? So, questions such as what data is, what appropriate data means, and how to collect and measure data are dealt with in measurement theories. Decision makers make better decisions when they use all available information in an effective and meaningful way. The primary role of statistics is to provide decision makers with methods for obtaining and analyzing information to help make these decisions. As far as second language acquisition research (SLAR) is concerned, Messick (1989) states that measurement is a data-driven and theory-driven endeavor. Measurement in research serves to provide systematic means for gathering evidence about human behaviors. Using measurement procedures, researchers elicit, observe, record, and analyze the related behavior (here, data) of language learners. In addition, measurement theories help SLA researchers in interpretation of obtained results in the light of SLA theories. Norris and Ortega (2003) believe that measurement occurs based on several interrelated stages. They divide the measurement process into two broad categories in which each category itself consists of three phases. The general categories are conceptual and procedural (See Figure 1). The conceptual component of measurement process consists of construct definition, behavior identification, and task specification. In terms of construct definition, researchers should explicitly explain "what it is they want to know". Based on current views, construct refers to interplay between a theoretical explanation of a phenomenon and the data collected about the phenomenon. Thus, if constructs are not clearly specified, behaviors (data) can not be linked with them. The second phase, behavior identification, is 2 related to the first concept mentioned previously. As it was mentioned, we need learners' behavior for decision making and interpretation. Researchers should know which particular type of behavior be observed and collected to arrive at related interpretations. In next chapters on parametric and non- parametric statistical test, you will recognize how selection of appropriate and related evidence determines the type of statistics indexes needed for data analysis. Task specification, the third phase of conceptual category, refers to decisions we need to make concerning specific tasks and situations to elicit targeted behaviors. If a selected task in the process of research can not provide the evidence required, naturally the measurement validity would be at stake. Figure 1: Measurement Process (Norris & Ortega, 2003) In the procedural stages of measurement process, researchers' onus is to proceduralize the outcomes of the conceptual stages and to use 3 mechanisms to elicit and analyze the data to provide evidence for interpretations. Three stages are introduced to achieve the objectives of the procedural phase of measurement process. In the first stage, behavior elicitation stage, researchers use specific tasks to elicit data. Observation scoring, the second stage, deals with scoring the data. Remember scoring in practice should be clearly linked to intended interpretations, and it is of different types such as categorical, ordinal, interval, and ratio types. Finally, in the data analysis stage, the scores are summarized and interpreted in light of numerous statistical paradigms. It appears that for second/foreign language researchers, statistics is mostly of interest when the procedural stages of measurement process are involved. Given that the primary purpose of second language acquisition research is to understand and inform improvements in language teaching and learning, validity of its interpretations depends absolutely on selection of appropriate measurement level for appropriate statistics. Consequently, behavior elicitation, observation scoring, and data analysis are the keys to arrive at more effective decisions on the part of SLA researchers. Which One is Better: a Mole Wrench or a Pipe Wrench? A decision always facing applied linguistics researches is the choice from among parametric or nonparametric statistical tests. A common misconception appears to be the priority of parametric over nonparametric tests because of the so-called power of parametric tests. In the present section, we examine the critical factors contributing to the decisions made concerning the selection of appropriate statistical tests. The factors to be discussed are 4 power, sample size, and effect size as the core issues in decision making. The purpose here is to finally arrive at the conclusion that the selection of a test depends upon the function for which it is formulated. I surmise the choice between a parametric or nonparametric test is similar to the decision between a mole wrench and a pipe wrench. In Webster's Unabridged Dictionary (Copyright Random House 2000), a pipe wrench is defined as "a tool having two toothed jaws, one fixed and the other free to grip pipes and other tubular objects when the tool is turned in one direction only." A wrench is defined as "a tool for gripping and turning or twisting the head of a bolt, a nut, a pipe, or the like, commonly consisting of a bar of metal with fixed or adjustable jaws." Using the wrenches metaphor, definitely the selection of a wrench over another one depends on consideration of numerous factors. In the same way, a parametric test type appears to be as useful as a nonparametric test type based on the assumptions on which the tests are designed. In chapter two and three we will expound the main assumptions of both parametric and nonparametric tests. Test Power A critical decision to be made by applied linguistics researchers has always been the choice among alternative statistical tests to meet the requirements of their research design. Which test is the best? When there are alternative statistical tests to handle data (parametric and nonparametric tests), a researcher should first of all consider the power of the test and select the most powerful test. In fact, a statistical test's power is the probability that the test procedure will result in statistical significance. Since always statistical significance is the favorite objective in research, it is of primary significance 5 for SLA researchers to plan a study to achieve a high power. Usually because of the difficulty of the calculations, the power is often ignored or some so- called rule-of-thumb is adopted by researchers. Fortunately today, thankful to statistical packages and internet-based soft wares, power analysis is not a complex procedure (For a free Internet-based trial version of power analysis, see www.Power-Analysis.com). Traditionally in applied linguistics research, statistical tests assume a null hypothesis of no difference, and hypothesis testing means a decision making based on the sample data obtained from the population against an alternative hypothesis. Hypothesis testing needs a decision concerning if sample data is consistent or inconsistent with the null hypothesis. We can see the probable outcomes of this decision in Table 1: Table 1: Type І and Type ІІ Errors True in the Population Researcher's Decision H0 is True Ho is False Type І error (α) Correct decision Rejects H0 (Accept H1) Correct decision Type ІІ error (β) Accepts H0 As displayed on Table 1, there are two alternatives to happen: either the decisions made are correct or the decisions made are with errors. As far as the former is concerned, if a treatment is really effective and the research is 6 successful in rejecting the null hypothesis, or if a treatment has no real effect and the research can not reject the null hypothesis, we make certain that the study's result is correct. However, in many cases this is not true and we should consider two types of potential errors in decision making by SLA researchers: Type І (α) and Type ІІ (β) errors. A Type I error happens when the treatment really has no effect but we mistakenly reject the null hypothesis. A Type II error occurs in case the treatment is effective but we fail to reject the null hypothesis. Supposing that the null hypothesis is true and alpha is set at .05, we expect a Type I error to occur in 5% of all studies. That is, the rate of Type I error is equal to alpha. Supposing the null hypothesis is false, we would expect a Type II error to occur in the proportion of studies denoted by one minus power, and this error rate is known as beta. Test power could be defined as the probability of not committing a Type II error. So, an indirect relationship could be found between a test power and Type II error: if power increases, Type II error decreases. According to Cohen (1988), power refers to "the probability that the statistical test will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists" (p. 56). In simple terms, we can consider the power of a test as our probability of finding what we were looking for in the research design. In other words, power is the probability of correctly rejecting the null hypothesis and is equal to 1 – β. In statistical tests, the desired power value is set at .80. Therefore, in the same way that α is conventionally set at .05 by researchers, β can also be set at less than .20 or greater than .80 for power. So, the power of .85, for instance, shows that the research has statistically sufficient power for the researcher to accept the alternative hypothesis with confidence. 7 Power analysis is usually done a priori when researchers determine power prior to the study and data collection. In this case, power analysis could be utilized to decide upon appropriate sample size to achieve appropriate power. It could be done to economize time and resources carrying out a study which has very little chance of finding a significant effect. Moreover, a priori power analysis guarantees that researchers do not waste time and resources testing more subjects than are necessary to detect an effect. However, it could also be conducted post hoc when a study has been completed to determine what the power was in the study. Post-hoc analysis done after a study could help researchers to explain the results if a study did not find any significant effects. Using SPSS, the complex procedure of power can be done performed with facility. From the menus, select: Analyze → General Linear Model → Univariate Now in the Univariate window, select a variable from the left dialogue box and move it to the right Dependant Variable box. Then select Options to have Univariate Options window. In the Display section, put a click on Observed power and Continue. Finally press the Ok button. The output of SPSS is a table as following: 8 Tests of Between-Subjects Effectsc Dependent Variable: Comp2output1 Type II Sum Noncent. Observed a Source of Squares df Mean Square F Sig. Parameter Power Corrected Model .000b 0 . . . .000 . Intercept 196.000 1 196.000 32.667 .011 32.667 .953 Error 18.000 3 6.000 Total 214.000 4 Corrected Total 18.000 3 a. Computed using alpha = .05 b. R Squared = .000 (Adjusted R Squared = .000) c. Weighted Least Squares Regression - Weighted by preposition As the table displays, the calculated observed power is .95 which is above .80. The interpretation is that the study had a high power to detect the effect the research hypothesis claimed. Sample Size and Power Power is influenced by a host of factors including sample size, the effect size, the level of error in experimental measurement, and the type of statistical test used. Sample size is a major factor contributing to test power. In this regard, power analysis is recommended to make certain that the sample size is large enough for the statistical tests to actually detect the differences that they claimed to find. Statistically speaking, in case the sample size is too small, the standard statistical tests will not have the sufficient statistical power to find differences that actually exist. In fact, no significant difference is found although in reality there is a difference. As it was mentioned previously about 9