DOCUMENT RESUME ED 357 039 TM 019 746 Linn, Robert L. AUTHOR Educational Assessment: Expanded Expectations and TITLE Challenges. INSTITUTION National Center for Research on Evaluation, Standards, and Student Testing, Los Angeles, CA. SPONS AGENCY Office of Educational Research and Improvement (ED), Washington, DC. CSE-TR-351 REPORT NO PUB DATE Jan 93 R117G10027 CONTRACT 38p.; Paper presented at the Annual Meeting of the NOTE American Psychological Association (100th, Washington, DC, August 14-18, 1992). Evaluative/Feasibility (142) Reports PUB TYPE Speeches /Conference Papers (150) EDRS PRICE MF01/PCO2 Plus Postage. DESCRIPTORS *Educational Assessment; Educational Change; Educational Policy; Elementary Secondary Education; *Evaluation Methods; *Futures (of Society); Generalization; *National Competency Tests; *Research Needs; Student Evaluation; Test Construction; *Test Validity IDENTIFIERS Alternative Assessment; Performance Based Evaluation; *Reform Efforts ABSTRACT Current national efforts to expand the role of educational assessment and radically change the nature of assessment are discussed. Rationales for expectations that a new national examination system would serve as a lever of educational reform are analyzed. Challenges posed for educational measurement by the proposed heavy reliance on complex, performance-based assessment are examined in terms of the needed comprehensive validation research. Particular attention is given to existing generalizability evidence and the implications of high degree of task specificity in performance for the design of an assessment system. Nine figures and one table illustrate the discussion. (Contains 45 references.) (Author) *********************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. ******************************************************************** Ik p All . . . e w U U DEPARTMENT OP EDUCATION Otto. of Educhonal Romero, and ImprO.rnnt EDUOP.TIONAL RESOURCES INFORMATION CENTER (ERIC) teproduced as Trim document Au bean orpanitlion wowed lrorn the person or dripinseng it to improve 0 Minos change haw* been made reproduction Quality this doc Points of view or apemen, staled rn official men! do not nceSsarily rePrSent DE RI position or policy Educational Assessment: Expanded Expectations and Challenges Technical Report 351 CSE Robert L. Linn CRESST/University of Colorado at Boulder 41. UCLA Center for the Study of Evaluation in collaboration with University of Colorado NORC, University of Chicago LRDC, University of Pittsburgh The RAND Corporation ti BEST COPY MAME Educational Assessment: Expanded Expectations and Challenges CSE Technical Report 351 Robert L. Linn CRESST/University of Colorado at Boulder January 1993 Invited Address: Thorndike Award from Division 15, Educational Psychology, of the American Psychological Association presented at the annual meeting of the American Psychological Association, Washington, DC, August 16, 1992. National Center for Research on Evaluation, Standards, and Student Testing (CRESST) Graduate School of Education University of California, Los Angeles Los Angeles, CA 90024-1522 (310) 206-1532 wain. i9u THE REGENTS OF THE UNIVPRSITY OF CALIFORNIA iJ © Copyright 1993 The Regents of the University of California The work reported herein was supported in part under the Educational Research and Development Center Program cooperative agreement R117G10027 and CFDA catalog number 84.117G as administered by the Office of Educational Research and Improvement, U.S. Department of Education. The findings and opinions expressed in this report do not reflect the position or policies of the Office of Educational Research and Improvement or the U.S. Department of Education. Educational Assessment Expanded Expectations and Challenges Abstract Current national efforts to expand the role of educational assessment and radically change the nature of the assessment are discussed. Rationales for expectations that a new national examination system would serve as a lever of educational reform are analyzed. Challenges posed for educational measurement by the proposed heavy reliance on complex, performance-based assessment are examined in terms of the needed comprehensive validation Particular attention is given to existing general.izability evidence research. and the implications of high degree of task specificity in performance for the design of an assessment system. iii ; Educational Assessment Expanded Expectations and Challenges Robert L. Linn CRESST/Univetsity of Colorado at Boulder The recent report "Testing in American Schools: Asking the Right Questions" (U.S. Congress, Office of Technology Assessment, 1992), prepared at the request of Congress by the Office of Technology Assessment (OTA), documents the central role of educational testing in recent national debates about educational reform. Although the current emphasis on assessment has some important new featb...es that I will discuss in some detail, it is hardly a novelty for testing and assessment to figure prominently in policy makers' efThrts to reform education. As Madaus (1985) observed several years ago, "testing is the darling of policy makers across the country" (p. 5). Similar statements could have been made at various times during the past century and a half, most notably during periods when the schools were under attack and reformers sought to demonstrate the need for change. As has been true of previous reform efforts, assessment is central to the First, assessment current educational reform debate for at least two reasons. results are relied upon to document the need for change. Second, assessments are seen as critical agents of reform. Indeed, Petrie (1987) went further when he argued that "It would not be too much of an exaggeration to say that evaluation and testing have become the engine for implementing educational policy" (p. 175, emphasis in the original). The primary focus of this paper is that educational policy makers are keenly interested in educational assessment and that their greatly expanded, and sometimes unrealistic, expectations, together with the current press for radical changes in the nature of assessments, represent major challenges for educational measurement. First, however, I will discuss the attraction that assessment results have for policy makers. 1 Barometer of Educational Quality Educational assessments are often rather naively expected to serve as a Such an expectation kind of impartial barometer of educational quality. makes assessment results of particular interest and value to various types of In this regard, policy makers have often pointed to educational policy makers. assessment results in order to demonstrate educational shortcomings. The OTA report provides a brief recounting of this history of testing in American schools from the time that Horace Mann introduced written examinations in the mid-19th century. The OTA report (U.S. Congress, Office of Technology Assessment, 1992) summarized the view that tests could document the need for change as follows: The idea underlying the implementation of written examinations, that they could provide information about student learning, was born in the minds of individuals already convinced that education was substandard. This sequenceperception of failure followed by the collection of data designed to document failure (or success)offers early evidence of what has become a tradition of school reform and a truism of student testing: tests are often administered not just to discover how well schools or kids are doing, but rather to obtain external confirmation validationof the hypothesis that they are not doing well at all. (p. 108. emphasis in the original) More recent examples where test results have been used to document the need for reform are plentiful, but two will suffice to make the point. The June 6, 1991 press conference for the release of the first state-by-state results of the National Assessment of Educational Progress (N \EP) in mathematics was used by Secretary of Education Lamar Alexander o say that the results should serve as a "wake up America call." According to Secretary Alexander, "The big news is: None of the states is cutting it." In the second example, President Bush relied on a much less relevant source of test data for purposes of drawing inferences about schools when he made the following comments regarding the release of SAT results in the fall of 1991. "Last week we learned SAT scores had fallen again. Scores on the Verbal SAT have tumbled to the lowest level ever. These numbers tell us our schools are in trouble" (cited by Jaeger, 1992). The remainder of this paper could be used to analyze what is wrong with this type of use of SAT results, but that is not my focus today and, in any event, 2 that has already been done on several occasions (e.g., Linn, 1987; Wainer, It is worth noting, however, that despite the negative picture, there is 1986). considerable evidence regarding improvements in achievement that the critics choose not to acknowledge. For example, even if it were reasonable to use aggregate SAT results as an indicator of educational progress, results presented by Carson, Huelskamp, and Woodall (1991) and discussed by Berliner (1992), and by Jaeger (1992), demonstrate that the global comparisons of cross-sectional results for different years hide more than they reveal. Jaeger's ',1992) discussion is particularly telling: As Carson and his colleagues (Carson, Huelskamp and Woodall. 1991) have illustrated, the SAT score decline is a perfect example of Simpson's Paradox. Although the mean score of the total SAT test-taking population has declined, the mean score of every major racial and ethnic group that composes the population of SAT test-takers has increased during the last 15 years. Comparison of the mean performances of the population of SAT test-takers today and the population of test- takers fifteen or twenty years ago, unadjusted for changes in the test-taking population, can only produce seriously misguided conclusions. (pp. 7-8) Turning to the potentially more informative NAEP results, it should be noted that Secretary Alexander's analysis of the cross-sectional results in mathematics does not take into account trend data that show improvements in mathematics achievement especially during the 1980s (see Jaeger, 1992; Linn, in press; Mullis, Dossey, Foertsch, Jones, & Gentile, 1991). As can be seen in Figure 1, the performance of African-American students on the NAEP mathematics scale was significantly higher at all three age levels in 1990 than it was in 1978. The performance of White and Hispanic students also was higher in 1990 than in 1978 at all three age levels, but the differences were not significant for 17-year-olds. For the 1990 bridge samples, the within-grade standard deviations ranged from 31 to 33. In other words, a difference of about 16 points represents an effect size of about .5. Thus, most of the differences shown in Figure 1 are large enough to be of practical as well as statistical significance. Although policy makers seeking reform make great use of test results to argue the case that education is inadequate, others with an intere t in the status quo, of course, emphasize quite different results. As was demonstrated Figure 1 Changes in Average Performance on the Overall NAEP Mathematics Scale Between 1978 and 1990 (Based on Mullis, Dossey, & Gentile, Foertsch, Jones, 1991) E White African American Ra Hispanic 21' 0 C cd 13 17 9 Age Significant increase In Mean (p < .05) 1. 4 in articles on the Lake Wobegon effect (e.g., Cannell, 1987, 1988; Linn, Graue, & Sanders, 1990), reports comparing student achievement to outdated national norms for the form of a test that has been used year after year in a state or school district have provided misleadingly positive impressions of student achievement in individual states and districts. Figure 2 shows results from a study that I have worked on in collaboration with Dan Koretz, Lorrie Shepard, Steve Dunbar, Freddy Hiebert, and Bobbie Flexer (e.g., Koretz, Linn, Dunbar, & Shepard, 1991). The operational test results in two districts using different norm-referenced tests in relatively high-stakes testing programs are contrasted with results of administrations of alternative tests that were constructed to cover the same content objectives and then equated to the district norm-referenced tests using samples of students from other districts where the norm-referenced test in question was not used. There is good generalization regarding mean student performance in only one of the four contrasts (mathematics in district A). In the other three contrasts, the operational test provides an inflated impression of student achievement in comparison to the alternative test. The possibly obvious point is that considerable caution is needed in using achievement test results to draw inferences about the quality of education. Despite the pitfalls, however, policy makers attempting to provide support for current practice, as well as those trying to undermine it, continue to rely heavily on test results to make their case. Instrument of Reform Issues regarding the uses and misuses of educational assessment results to draw inferences about the quality of education are worthy of more detailed discussion, but the fowls of the remainder of this paper is on a second use of That is the use of a : essments not just as a educational assessment. barometer of educational conditions but as an instrument of reform. Educational assessments are expected not only to serve as a monitor of educational achievement but to be powerful tools of educational reform. 5