ACT Research Report Series 2015 (5) Statistical Properties of School Value-Added Scores Based on Assessments of College Readiness Dina Bassiri, PhD Dina Bassiri is a senior research scientist in the Statistical and Applied Research Department, specializing in educational outcomes research and student growth models. Acknowledgments Thanks to Joann Moore for her assistance in preparing the data for this study and to Jeff Allen, Jim Sconing, Richard Sawyer, Chrys Dougherty, Greg Bales, and Karen Zimmerman for their reviews and helpful comments and suggestions on an earlier draft of this report. © 2015 by ACT, Inc. All rights reserved. ACT®, ACT Explore®, and ACT Plan® are registered trademarks of ACT, Inc. 4587 Contents Abstract ............................................................................... iv Introduction ............................................................................ 1 Projection and Value-Added Models ................................................... 1 Research Questions .................................................................. 3 Methodology ........................................................................... 4 Models and Growth Projection Accuracy .............................................. 4 Sample and Data ...................................................................... 7 Analysis ................................................................................ 8 Results ................................................................................. 9 How Strongly Related Are Students’ Projected Scores to Their Actual Scores? ........ 12 How Often Are School Value-Added Scores Classified as Significantly Above or Below Average? .................................................................... 19 What Is the Variability Across School Cohorts in the Proportion of Students Meeting or Exceeding Expected Growth? .................................................... 20 What School Characteristics Are Related to School Value-Added Scores? ............ 22 To What Extent Do Value-Added Scores Vary Across High- and Low-Poverty Schools? ........................................................................... 26 How Consistent Are a School’s Value-Added Scores? ................................ 27 Summary ............................................................................ 30 Conclusion ........................................................................... 31 Appendix A .......................................................................... 33 Appendix B .......................................................................... 36 References .......................................................................... 37 Abstract The 2001 reauthorization of the Elementary and Secondary Education Act (ESEA) known as No Child Left Behind and more recent federal initiatives such as Race to the Top and the ESEA flexibility waiver have brought student growth to the forefront of education reform for assessing school and teacher effectiveness. This study examined growth projections and school value-added scores that are based on the ACT college and career readiness system for different growth periods (grades 8–10, 9–10, 10–11, 10–12, and 8–12). For this report, I investigated methodological questions related to value-added modeling. The analyses are based on large samples of students who took two or more of the tests in the ACT college and career readiness system in grades 8, 9, 10, and 11/12, respectively. For each growth period, I examined high schools’ effects on student achievement in English, mathematics, reading, and science by controlling for prior test scores and varying time between tests. Across the various growth periods, I found differences in the variance of school effects (value-added scores), as well as differences among types of schools in the proportion of their students classified as significantly above or below average. Most school effects were not significantly different from the average school effect and could not usually be distinguished from average with confidence. This was more often the case for small schools than for large schools. I found evidence of consistency in value-added scores by examining same-year and same-cohort correlations for different growth periods and by examining correlations of value-added scores for adjacent cohorts (one, two, and three years apart) at the same schools. Value-added scores based on grades 8–12 data show greater reliability over time. The grades 8–12 data also generated value-added scores that perform better than scores based on shorter growth periods in terms of differentiating schools and achieving statistical significance with smaller sample size. I also found positive associations between value-added scores and mean prior academic achievement and an inverse relationship with school poverty level and the proportion of racial/ethnic minority students. Generally, compared to other school characteristics, grade level enrollment and the proportion of students tested had weaker associations with value-added measures. Overall, the relationships of school characteristics and value-added scores suggest that different types of schools would be expected to have different mean value-added scores. Introduction The No Child Left Behind reauthorization of the Elementary and Secondary Education Act (ESEA) in 2001 (NCLB, 2002) brought great attention to the monitoring of educational progress and sparked questions on how best to measure the effectiveness of schools and teachers. More recent initiatives, including Race to the Top and the ESEA flexibility guidelines, continue to spur development of methods to accurately and fairly monitor school and teacher performance. Under NCLB, schools are accountable for increasing the percentage of students meeting proficiency cutoffs. Status measures, such as proficiency rates, use test scores from a single point in time, so they cannot directly measure how much schools or teachers affect learning throughout the year. There is considerable agreement among educators and researchers that status measures do not measure educator effectiveness well. Growth models, on the other hand, are less affected by the influence of student background and prior academic achievements. This is particularly important with non-random assignment of students to schools. As pointed out by Ballou, Sanders, and Wright (2004), the non-random distribution of student characteristics results in inequalities in the composition of student bodies from one school to another that may be mistakenly construed as disparities in school effectiveness. Furthermore, research indicates that in addition to effects associated with students’ social class and prior achievement, aggregate school socioeconomic status has a significant effect on student outcomes (Willms, 1986). Research has also shown that academically related psychosocial behaviors such as motivation, social connectedness, school attendance, obedience of rules, and avoidance of drugs are important predictors of academic success in middle school and high school (Kaufman & Bradbury, 1992; Rumberger, 1995; Worrell & Hale, 2001; Jones & Byrnes, 2006). Generally, growth models are used to measure how much growth occurred over a period of time for a student or group of students, project how much growth will occur, or indicate how much growth needs to occur to meet a certain achievement level (also known as a growth-to-standards model). Relative to status measures, measures of student growth are less vulnerable to the influence of student backgrounds (Riddle, 2008). Projection and Value-Added Models Projection and value-added models (VAMs) are types of growth models. As the name suggests, VAMs attempt to determine the value that is added to student learning by a teacher or school (i.e., producing more growth than expected, given specific characteristics of the student or school or teacher). Braun, Chudowsky, and Koenig (2010) refer to VAMs as “a variety of sophisticated statistical techniques that use one or more years of prior student test scores, as well as other data, to adjust for preexisting differences among students when calculating contributions to student test performance” (p. 1). According to Harris (2009), “the term is used to describe analyses using longitudinal student-level test score data to study the educational input-output relationship, including especially the effects of individual teachers (and schools) on student achievement” (p. 321). Unlike some applications of growth modeling, VAMs are often specified with the intent of estimating the causal effects of schools (or teachers) on students (Briggs, 2011). VAMs purport to quantify 1 ACT Research Report Statistical Properties of School Value-Added Scores Based on Assessments of College Readiness the effect on student achievement that, on average, a school or a teacher would have on similar students.1 Projection models are methods for forecasting (predicting) future test scores based on current and/ or prior test scores and other variables. Typically, projection models are implemented by regressing later test scores on prior test scores (and other variables) for one cohort, and then applying the regression parameters to the current cohort to obtain projected scores.2 VAMs and projection models can work hand-in-hand. VAMs can be based on residual scores, defined as the difference between actual and projected scores. For example, a school value-added score can be obtained by averaging the residual scores for students who attended the school for the growth period spanned by the tests. School districts and state education agencies across the country are increasingly relying on VAMs to measure school and teacher performance, sometimes with high stakes attached. Using value- added measures to inform high-stakes decisions is certainly controversial, and there is not currently a consensus in the research community on the use of value-added measures for evaluation and decision making. Some of the disagreement is rooted in technical aspects and statistical properties of VAMs and their use in accountability (Harris, 2009; Braun et al., 2010). In addition, the American Statistical Association (2014) has issued an official statement on the use of VAMs. It urged states and school districts to exercise caution in the use of VAM scores for high stakes purposes and offered reasonable guidelines for practice. Some have expressed skepticism (Amrein-Beardsley, 2014; Darling-Hammond, Amrein-Beardsley, Haertel, & Rothstein, 2012; National Research Council, 2010), and others, including prominent foundations and some think tanks, have been more positive (Bill & Melinda Gates Foundation, 2012; Glazerman, Goldhaber, Loeb, Raudenbush, Staiger, & Whitehurst, 2010; Gordon, Kane, & Staiger, 2006; Hanushek & Rivkin, 2004). It appears that teachers and principals trust classroom observations more than VAMs. This conclusion emerges from both Goldring, Grissom, Rubin, Neumerski, Cannata, Drake, and Schuermann’s (2015) analysis across eight districts and an in-depth case study of Chicago (Jiang, Sporte, & Luppescu, 2015). The general consensus is that a set of VAM scores does contain some useful information that meaningfully differentiates among teachers, especially in the tails of the distribution. However, individual VAM scores do suffer from high variance and low year-to-year stability as well as an undetermined amount of bias (Goldhaber & Hansen, 2013; Kane, McCaffrey, Miller, & Staiger, 2013; McCaffrey, Sass, Lockwood, & Mihaly, 2009), leading some to suggest that it is not reliable enough to be used for high-stakes purposes (Darling-Hammond et al., 2012). Consequently, if VAM scores are to be used for evaluation, they should not be given inordinate weight and certainly not treated as the “gold standard” to which all other indicators must be compared (Braun, 2015). In view of the importance of VAMs, a special issue of the Journal of Educational and Behavioral Statistics (Wainer, 2004) was devoted to the careful examination of this methodology; the issue 1 Value-added analyses return norm-referenced effectiveness quantities indicating whether a teacher or school is significantly more or less effective than a group’s average (e.g., district or state). Note that the effects are centered about zero (or in the case of the Student Growth Percentile, around the 50th percentile), indicating average effectiveness. Like all norm-referenced quantities, value-added effects do not implicitly include criteria stipulating what is enough effectiveness. The interest in incorporating growth measures into criterion-referenced assessment led to the development of criterion-referenced growth-to-standard models, stipulating adequate progress toward desired achievement outcomes (Betebenner, 2009). 2 In this study, projected scores are not used to predict the score of different students in the future. 2 focused almost entirely on the statistical properties of the measures. Recently, a special issue of Educational Researcher (Harris & Herrington, 2015) was devoted to the careful examination of value- added–based teacher accountability. It focused on the effects on teaching and learning that come from embedding VAMs into policies like teacher evaluation, tenure, and compensation. Value-added–based teacher accountability is likely to be different from school-based accountability. Although both use students test scores, there is much wider acceptance of school-level measures generally, and they have better statistical properties. Research Questions In this study, I address questions that may be of interest to users of the ACT college and career readiness system of assessments, but also to nonusers wishing to evaluate the utility of growth models based on college readiness assessments. The system included ACT Explore® (for grades 8 and 9), ACT Plan® (for grade 10), and the ACT® test (for grades 11 and 12).3 I examine projection and value-added models for different growth periods, defined by student grade level at the two times of testing (8–10, 9–10, 10–11, 10–12, and 8–12).4 I sought to answer the following questions pertaining to inferences about the growth of individual students (Question 1) and about the effectiveness of schools (Questions 2–6) across six different growth periods. I also evaluated and compared three types of growth models based on the accuracy with which they predict individual students’ test scores in subsequent grades.5 1. How strongly related are students’ projected scores to their actual scores? • Does the strength of relationship vary for different projection models? • Is measurement error in ACT college and career readiness system scores properly addressed by using multiple test scores? • Is the strength of relationship enhanced by using test scores from multiple prior years? 2. How often are school value-added scores classified as being significantly above or below average? 3. What is the variability across school cohorts in the proportion of students meeting or exceeding expected growth? 4. What school characteristics are related to school value-added scores? 5. To what extent do value-added scores vary across high- and low-poverty schools? 6. How consistent are a school’s value-added scores? • across growth periods for the same school year? • across growth periods for the same cohort of students? • across cohorts of students (for each growth period)? 3 The ACT Explore and ACT Plan assessments are only available to existing customers through 2016 and will be replaced by the ACT Aspire™ system afterward. 4 For student grade levels 9–10, two different growth periods (10–14 and 9–18 months) are considered. 5 The extent to which the assumptions of these models are met will affect answers to these questions, but in this study I do not investigate how well these assumptions are met or fit the data. Castellano and Ho (2015) investigated assumptions in different types of growth models, including the “meanResid” model (referred to as Std model in this study) by classifying them by aggregation functions (mean or median), regression approach (linear mean and nonlinear quantile), and the scales that support interpretations (percentile rank and score scale), among other factors using both simulated and empirical data. 3 ACT Research Report Statistical Properties of School Value-Added Scores Based on Assessments of College Readiness Methodology This study used longitudinal data from a large sample of students who took ACT Explore in eighth or ninth grade, ACT Plan in tenth grade, and the ACT in eleventh or twelfth grade to investigate certain statistical properties of school value-added scores. Clearly, to be relevant for studying school value- added scores, tests should measure the knowledge and skills specified in schools’ curricula. All three components of the ACT college and career readiness system measure academic achievement in four subject areas (English, mathematics, reading, and science), with respect to the curriculum of the grade level for which it is intended.6 The tests use a common score scale, making it easier for users to monitor progress (students’ strengths and weaknesses). Data collected as part of the assessment programs were used to create longitudinal datasets for different growth periods: grades 8–10, 9–10, 10–11, 10–11/12 (10–12 for brevity), and 8–11/12 (8–12 for brevity). Using value-added models, I examine high schools’ effects on student achievement, controlling for prior academic achievement level as measured by the prior scores in all four subject areas. I also include a covariate in the models that accounts for varying lengths of time between Test 1 and Test 2. By using the number of months between tests as a covariate, the projected scores accommodate varying time spans between tests, as more growth is expected as more time passes (Allen, Bassiri, & Noble, 2009). I then fit this model to each of the four subject areas to obtain the estimated school effects on students’ academic performance in English, mathematics, reading, and science. Models and Growth Projection Accuracy There are several options for modeling school (or teacher) performance. In a study comparing growth models, Goldschmidt, Choi, and Beaudoin (2012) classify growth models into five categories (categorical, gain score, regression, value-added, and normative). In another study, Ehlert, Koedel, Parsons, and Podgursky (2012) classify growth models into three broad classes (Student Growth Percentiles, one-step VAMs, and two-step VAMs). According to Cody, McFarland, Moore, and Preston (2010), most growth models tend to fall into three general categories of simple growth, growth-to-standard, and projection models. In this study, I consider three types of growth models: a Standard VAM, a Standard-Classified VAM, and the Quantile Regression model. I evaluate these models based on the accuracy with which they predict test scores at Time 2. As we will see later, due to small differences observed between the standard VAM and the other two models with respect to the growth projection accuracy, I will then adopt the standard VAM model and base the remaining analyses only on this model. I briefly describe each of the models studied in the next few sections. Standard VAM (Std). I modeled Test 2 (ACT Plan or the ACT test) scores for each student using as predictors the student’s Test 1 (ACT Explore or ACT Plan) scores in all four subject areas and the number of months that passed between the two tests. To account for unobserved high school effects, a random intercept model was used. The random intercept model consists of fixed effects 6 Every three to four years, ACT conducts the ACT National Curriculum Survey®, in which more than 20,000 educators nationwide in grades 7–14 are asked to identify the knowledge and skills that are important for students to know to be ready for college-level work. ACT also reviews textbooks on state-approved lists for courses in grades 7–12 and examines objectives for instruction for all states that have published such objectives. ACT then analyzes the information to refine the scope and sequence for each subject area test (ACT, 2011a; ACT, 2011b; ACT, 2007). 4 (the intercept and regression coefficients associated with the predictors) and a random effect that adjusts the intercept for individual schools. The model for projected scores is given in Equation 1: 4 Y =b +SbX + bT + U + R (1) ij 0 p ijp 5 iijj j ij p=1 In Equation 1, Y is the projected Time 2 score for the ith student from the jth high school, b is the ij 0 overall intercept term, X (p = 1, 2, 3, 4) are the prior test scores in the four subject areas (English, ijp mathematics, reading, and science) for the ith student from the jth high school, b (p = 1, 2, 3, 4) p are the regression coefficients for the four Test 1 scores, T is the number of months between Test 1 ij and Test 2 for the ith student from the jth high school, b is the regression coefficient for the testing 5 span, and U is the school-specific intercept and can be interpreted as the school value-added j score assumed to be normally distributed with mean 0 and variance t2. Student-level error term R is assumed to be normally distributed with mean 0 and variance s2. I then fit this model to each ij of the four subject areas to obtain student projections and school value-added scores. The model is a special case of a hierarchical linear model (Raudenbush & Bryk, 2002) and can be fit using statistical software packages such as HLM or SAS. In this study, I calculated the value-added score for a particular school as the average of the residual scores of its students, where the residual score for a particular individual student is defined as the difference between the student’s projected score, based on the estimated fixed effects in Equation 1 and the student’s actual score. School value-added scores can also be based on both the fixed effects and random effects in Equation 1; see Snidjers and Bosker (1999). Standard-Classified VAM (Std-cls). In this model, Y is defined as in Standard VAM. But prior test ij scores for the four subject areas in the right-hand side of Equation 1 are coded differently. That is, the subject-specific prior test score (sspts) is classified into 10 levels,7 whereas the other three nonspecific prior test scores (nspts) are treated as linear effects, as in Equation 1. For example, when projecting the Test 2 English score, the Test 1 English score is classified into 10 levels while the Test 1 scores in the other three subject areas (mathematics, reading, and science) remain continuous. Comparing results for Std and Std-cls models allows us to examine the adequacy of modeling only the linear relationship between Test 1 and Test 2 score. Quantile Regression (QR). Quantile regression can be conceptualized as an elaboration of ordinary least squares linear regression, i.e., from parameterizing trends in conditional means of the response variable Y to parameterizing trends in many different conditional quantiles of Y. The QR model demonstrates growth not in terms of conditional mean test score, but in terms of conditional percentile of test score. The main advantage of QR over least-squares regression is its flexibility for modeling data with heterogeneous conditional distributions (i.e., it makes no distributional assumption about the error term, and so it offers model robustness). QR is especially useful with data that are heterogeneous in the sense that the tails and the central location of the conditional distributions vary differently with the covariates. See Chen (2005) for more formal definition of QR. 7 In theory, since ACT Explore scale scores range from 1 to 25, I could have defined up to 25 classified levels, but because of data sparsity, especially in the lower tail of the distribution, I limited the classification of subject-specific test score to 10 levels. 5 ACT Research Report Statistical Properties of School Value-Added Scores Based on Assessments of College Readiness For a student with Grade 3 and Grade 4 test scores, for example, QR could be used to identify the percentile of the Grade 4 score conditional on the Grade 3 score. A student that has a percentile of 85 would indicate that the student’s grade 4 score is in the 85th percentile among her peers with the same Grade 3 score. The student’s growth can then be described as being as good/better than 85% of his/her academic peer group, as defined by initial test score. These growth percentiles are a normative measure of students’ academic progress, comparing a student’s progress over time to his/ her academic peers with a similar academic (test score) history (Betebenner, 2009). Given this normative foundation, the QR model can be used to identify “typical growth” (e.g., the predicted score from a QR model of the 0.50 quantile). Similar to residuals from a linear regression model, residuals can be obtained from a QR model as the difference between actual and projected scores, where projected scores are the 50th percentile scores from the QR. School mean QR residuals can then be used to support value-added inferences. In this study, the linear quantile regression model (using the SAS QUANTREG procedure) was used to fit the data. In contrast, the Student Growth Percentile (SGP) model fits the data by using a nonparametric spline model using the R software package (Betebenner, 2011). This model has been used in several states (e.g., Colorado, Massachusetts, Arizona, and Indiana). Due to the different modeling assumptions used in the SAS QUANTREG procedure and the SGP package, the estimated quantiles from the two programs can be quite different when initial test scores are at the tails of the score distribution but almost identical for middle-range scores. In the SAS QUANTREG procedure, a linear function of the independent variables is used to model the quantiles of the response variable, while the SGP package models the quantiles of the response variable as a linear combination of cubic B-splines of the independent variables. The B-spline quantile regression is a nonlinear (or nonparametric) extension of the linear quantile regression and can also handle heteroscedasticity and skewness of the conditional distributions of the response variable associated with values of the independent variables. Unlike the linear quantile regression, the B-spline quantile regression does not assume that the conditional densities of the response variable have the same shape across different values of the independent variables. Thus, it is a more flexible method than the linear QR model. School Value-Added Score. The models discussed above can be used to produce projected scores (also known as predicted scores) for individual students. Two of the models (Std and Std-cls) are mixed-effects models, and so projected scores for individual students can either be based on the fixed effects parameters or based on both the fixed effects and random intercepts. The fixed effect parameters produce projected scores based on the population-averaged model, and the random intercept is set to 0. The projections are thus interpreted as expected Time 2 scores, assuming the typical school effect. Using projections based only on the fixed-effects model, residual scores can be calculated for each student as the difference between the actual Time 2 score and projected Time 2 score. This residual describes a student’s growth in terms of how much she performed above expected, given her prior scores and assuming the typical school effect. This approach is known as the residual gain model (Castellano & Ho, 2013). A school’s mean residual score then describes how much higher or lower than expected students score, on average, and can be used to support school value-added interpretations. The one-sample t-test can be used to test whether the mean residual is different from 0 for each school. 6