NEW MEXICO STANDARDS-BASED ASSESSMENT TECHNICAL REPORT: SPRING 2007 ADMINISTRATION PREPARED FOR THE NEW MEXICO PUBLIC EDUCATION DEPARTMENT BY: HARCOURT ASSESSMENT, INC. PSYCHOMETRIC AND RESEARCH SERVICES 19500 BULVERDE ROAD SAN ANTONIO, TEXAS 78259 Contents 1 TEST DESIGN AND DEVELOPMENT .................................................................1 1.1 Overview..........................................................................................................................1 1.2 Test Design......................................................................................................................2 1.3 Item Development............................................................................................................5 1.4 Field Test and Item Analysis...........................................................................................6 2 SCORING PROCEDURES.................................................................................10 2.1 Processing for Student Documents................................................................................10 2.1.1 Receiving...........................................................................................................10 2.1.2 Structure Definition and Order Entry................................................................10 2.1.3 Document Staging..............................................................................................11 2.1.4 Scanning.............................................................................................................11 2.1.5 Scoring Editing..................................................................................................12 2.1.6 Archiving...........................................................................................................13 2.1.7 Job Submission..................................................................................................13 2.1.8 Computer Operations.........................................................................................13 2.1.9 Alerts and Research...........................................................................................13 2.2 Scoring Constructed Response Items.............................................................................13 2.2.1 Developing Scoring Materials for Constructed Response Items.......................14 2.2.2 Monitoring for scoring accuracy/reliability.......................................................17 2.2.3 Overall Scoring Process.....................................................................................19 2.3 Inter-rater Reliability.....................................................................................................19 3 SUMMARY OF STUDENT PERFORMANCE.......................................................23 4 STATISTICAL ANALYSES OF ITEM AND TEST SCORES..................................29 4.1 Classical Test Theory Based Analyses..........................................................................29 4.1.1 Item-Level Statistics..........................................................................................29 4.1.2 Test Level Statistics...........................................................................................32 4.2 Differential Item Functioning........................................................................................39 5 SCALING OF THE ASSESSMENT AND ITEM RESPONSE THEORY...................40 5.1 Calibration and Equating Process..................................................................................40 5.2 Item-Level IRT Statistics...............................................................................................42 5.3 Scoring Tables...............................................................................................................45 6 RELIABILITY..................................................................................................47 6.1 Introduction....................................................................................................................47 6.2 Coefficient Alpha...........................................................................................................47 6.3 Standard Error of Measurement.....................................................................................48 6.4 Conditional Standard Error of Measurement.................................................................48 6.5 Classification Accuracy and Consistency......................................................................49 7 VALIDITY.......................................................................................................52 7.1 Test Content...................................................................................................................52 7.2 Internal Structure...........................................................................................................52 7.3 Relationships to other variables.....................................................................................59 8 GRADE 11 STANDARD SETTING ....................................................................61 APPENDIX A TEST BLUEPRINTS..............................................................................64 APPENDIX B ITEM-LEVEL STATISTICS...................................................................71 New Mexico Standards Based Assessment Technical Report—Spring 2007 Administration APPENDIX C PERFORMANCE LEVEL PERCENTAGES FOR SELECTED DEMOGRAPHIC SUBGROUPS..................................................................................126 APPENDIX D TEST LEVEL STATISTICS FOR SELECTED DEMOGRAPHIC SUBGROUPS151 APPENDIX E INTERRATER AGREEMENT AND CORRELATIONS FOR CONSTRUCTED RESPONSE ITEMS...................................................................................................202 APPENDIX F SCORING TABLES FOR THE 2007 NMSBA TESTS ...........................222 APPENDIX G GRADE 11 PERFORMANCE LEVEL DESCRIPTORS...........................302 APPENDIX H SUMMARY OF NMPED MEETING TO FINALIZE CUTPOINTS FOR GRADE 11 ASSESSMENTS.......................................................................................305 New Mexico Standards Based Assessment Technical Report—Spring 2007 Administration Tables and Figures TABLE 1.1 SUMMARY OF NUMBER OF ITEMS AND POINTS FOR NMSBA ENGLISH TEST...........................................................................................................................3 TABLE 1.2 SUMMARY OF NUMBER OF ITEMS AND POINTS FOR NMSBA SPANISH TEST...........................................................................................................................4 TABLE 1.3 SUMMARY OF ITEM P-VALUES AND POINT-BISERIAL/ITEM-TEST CORRELATIONS..........................................................................................................9 TABLE 2.1 SUMMARY OF INTERRATER RELIABILITY STATISTICS FOR NMSBA ENGLISH LANGUAGE MATHEMATICS TESTS............................................................20 TABLE 2.2 SUMMARY OF INTERRATER RELIABILITY STATISTICS FOR NMSBA ENGLISH LANGUAGE READING TESTS. ....................................................................20 TABLE 2.3 SUMMARY OF INTERRATER RELIABILITY STATISTICS FOR NMSBA ENGLISH LANGUAGE SCIENCE/SOCIAL STUDIES (GRADE 11) TESTS.......................21 TABLE 2.4 SUMMARY OF INTERRATER RELIABILITY STATISTICS FOR NMSBA SPANISH LANGUAGE MATHEMATICS TESTS.............................................................21 TABLE 2.5 SUMMARY OF INTERRATER RELIABILITY STATISTICS FOR NMSBA SPANISH LANGUAGE READING TESTS......................................................................21 TABLE 2.6 SUMMARY OF INTERRATER RELIABILITY STATISTICS FOR NMSBA SPANISH LANGUAGE SCIENCE TESTS.......................................................................22 TABLE 3.1 SCALE SCORE CUTSCORE INTERVALS FOR NMSBA ENGLISH TESTS ..24 TABLE 3.2 SCALE SCORE CUTSCORE INTERVALS FOR NMSBA SPANISH TESTS...25 TABLE 3.3 PERCENTAGE OF STUDENTS AT EACH PL CLASSIFICATION FOR ENGLISH LANGUAGE NON-BIA STUDENTS..............................................................26 TABLE 3.4 PERCENTAGE OF STUDENTS AT EACH PL CLASSIFICATION FOR SPANISH LANGUAGE STUDENTS...............................................................................27 TABLE 3.5 PERCENTAGE OF STUDENTS AT EACH PL CLASSIFICATION FOR ENGLISH LANGUAGE BIA STUDENTS......................................................................28 TABLE 4.1 MEAN AND STANDARD DEVIATIONS OF P-VALUES AND ITEM-TEST CORRELATIONS........................................................................................................32 TABLE 4.2 SUMMARY RAW TEST SCORE STATISTICS (ENGLISH LANGUAGE NON- BIA STUDENTS)........................................................................................................34 TABLE 4.3 SUMMARY RAW TEST SCORE STATISTICS (SPANISH LANGUAGE STUDENTS)................................................................................................................35 TABLE 4.4 SUMMARY RAW TEST SCORE STATISTICS (ENGLISH LANGUAGE BIA STUDENTS)................................................................................................................36 TABLE 4.5 SUMMARY SCALE SCORE STATISTICS (ENGLISH LANGUAGE NON-BIA STUDENTS)................................................................................................................37 TABLE 4.6 SUMMARY SCALE SCORE STATISTICS (SPANISH LANGUAGE STUDENTS)38 TABLE 4.7 SUMMARY SCALE SCORE STATISTICS (ENGLISH LANGUAGE BIA STUDENTS)................................................................................................................39 New Mexico Standards Based Assessment Technical Report—Spring 2007 Administration TABLE 5.1 MEAN RASCH ITEM DIFFICULTIES FOR ENGLISH LANGUAGE NMSBA TESTS........................................................................................................................42 TABLE 5.2 MEAN RASCH ITEM DIFFICULTIES FOR SPANISH LANGUAGE NMSBA TESTS........................................................................................................................42 TABLE 5.3 MEAN AND STANDARD DEVIATION OF RASCH FIT STATISTICS.............43 TABLE 5.4 COUNT OF INFIT AND OUTFIT STATISTICS OUTSIDE OF LIMITS (ENGLISH LANGUAGE TESTS)....................................................................................................44 TABLE 5.5 COUNT OF INFIT AND OUTFIT STATISTICS OUTSIDE OF LIMITS (SPANISH LANGUAGE TESTS)....................................................................................................45 TABLE 6.1 CLASSIFICATION ACCURACY AND CONSISTENCY FOR THE ENGLISH LANGUAGE NMSBA TESTS......................................................................................50 TABLE 6.2 CLASSIFICATION ACCURACY AND CONSISTENCY FOR THE SPANISH LANGUAGE NMSBA TESTS......................................................................................51 TABLE 7.1 REPORTING CATEGORY INTERCORRELATIONS FOR ENGLISH READING53 TABLE 7.2 REPORTING CATEGORY INTERCORRELATIONS FOR ENGLISH MATH..54 TABLE 7.3 REPORTING CATEGORY INTERCORRELATIONS FOR ENGLISH SCIENCE/GRADE 11 SOCIAL STUDIES.....................................................................55 TABLE 7.4 REPORTING CATEGORY INTERCORRELATIONS FOR SPANISH READING56 TABLE 7.5 REPORTING CATEGORY INTERCORRELATIONS FOR SPANISH MATH...57 TABLE 7.6 REPORTING CATEGORY INTERCORRELATIONS FOR SPANISH SCIENCE58 TABLE 7.7 CORRELATIONS AMONG NMSBA SUBJECT TEST SCORES....................59 New Mexico Standards Based Assessment Technical Report—Spring 2007 Administration 1 Test Design and Development 1.1 Overview The NMSBA was administered in Spring 2007 in English and Spanish for the subject areas of Reading, Mathematics, and Science for grades 3 through 9. A Writing test was also administered as part of the NMSBA but is not addressed in this report as only raw, unequated scores are reported, and only on a limited basis. The 2007 NMSBA tests were composed of multiple choice (MC) items and two types of constructed response (CR) items–short answer (SA; scored 0, 1, and 2) and open-ended (OE) items (scored 0, 1, 2, 3, and 4). MC items were either drawn from the Stanford Achievement Test Series, Tenth Edition (SAT10) for English test (or from Aprenda 3 for Spanish test), or were new items written specifically for the NMSBA (“augmented items”). All constructed response items were augmented items. All augmented items and the portion of the SAT10 items that addressed one or more specific New Mexico standards were used to derive student scores. The remainder of the SAT10 items that did not address a specific New Mexico standard were dropped for the 2007 administration. The augmented items had been field tested in previous administrations of the NMSBA from 2004 onwards. Harcourt (with NMPED oversight) wrote items for the augmented portion of the NMSBA based on the relevant NM standards, and these items went through the normal item development process. Harcourt developed items following the test specifications and the blueprint. These candidate items were reviewed by NMPED and panels of NM teachers to ensure that they addressed the content that they were intended to address and that they did not mention or portray any specific demographic group in a stereotypical or demeaning fashion. Items that were flagged at this point were generally sent back to the item authors along with relevant feedback for revision, after which they were scheduled for inclusion in the next round of item reviews (as though they were newly written items). Items that successfully passed this initial stage of review were then embedded within an operational testing form where a representative sample of the target population responded to the item. These items did not affect the student’s scores—responses were used solely to calculate the item statistics. This allowed the statistical properties of the item to be assessed and is referred to as “field testing”. After field testing a “data review” meeting was held to review all field tested items. The participants in the data review meeting were NM teachers who were experienced in the content area and grade level of the items that they were reviewing. At this meeting, item data from the field test administration was reviewed by the teachers to identify items that had problems that had not been caught earlier in the process. This could include things such as confusing wording, ambiguous, or partially or totally correct MC item distractors, differentially difficult items (DIF, also known as “item bias”), or other flaws in the item that were not apparent until item data became available. Items could only be rejected at the data review meeting for specific and identifiable flaws—items could not be rejected simply because the item statistics looked “bad”. Items that were accepted at this stage were retained in an item “bank” for possible inclusion in future forms of the exam. The New Mexico Standards Based Assessment Page 1 Technical Report—Spring 2007 Administration final test form was then constructed using items in the bank and from the previous year’s form (the 2007 form included approximately 70% of the augmented items from the 2006 form) and was forwarded to NMPED for approval. After NMPED approval was secured, the test form was put into production for administration in 2007. The number of items decreased as the test development process proceeded. Items were lost at the item review process (content and bias review) and at data review. Finally, a subset of the items accepted at data review become operational items. The remaining items are kept in the bank for future operational use. Such attrition is natural and expected. The targets that are set each year for number of items required for each benchmark covered by the test take this attrition into account to ensure that all benchmarks on the test have adequate coverage. 1.2 Test Design The NMSBA contains three types of items: multiple choice (MC), short answer (SA) and open-ended (OE) items. The MC items required students to select a correct answer from several alternatives and had a maximum possible score of 1. The SA item required students to answer a question with a couple of words or a few sentences, and had a maximum possible score of 2. The OE items required student to answer with a paragraph or two, and had a maximum possible score of 4. For the 2007 administration only, several open-ended items were included on the Grade 11 math test that had been on the previous (2006) form of the test. The maximum points possible for these items ranged from two to six points—two and three point items were classed as SA items and those worth four or more points were classed as OE items. For mathematics and science, items in the Spanish test were trans-adapted (i.e., translated with an emphasis on capturing the meaning within the item, rather than a simple word for word translation) from the English language test items. For reading, items and reading passages for Spanish and English tests were separately and independently developed in the target language; the Spanish language items and passages were not translated English language items or passages. The number of MC, SA, and OE items and their score points for 2007 NMSBA are summarized in Tables 1.1 and 1.2. Note that multiple choice items made up approximately 70% of the items on the test, with the remaining 30% being constructed response items. New Mexico Standards Based Assessment Page 2 Technical Report—Spring 2007 Administration Table 1.1 Summary of Number of items and Points for NMSBA English Test Item Point MC OE SA MC OE SA Grade Count Count Count Count Count Points Points Points Mathematics 3 55 75 39 2 14 39 8 28 4 57 78 40 2 15 40 8 30 5 62 87 43 3 16 43 12 32 6 62 87 43 3 16 43 12 32 7 62 89 43 4 15 43 16 30 8 62 89 43 4 15 43 16 30 9 60 86 42 4 14 42 16 28 11 60 93 42 4 14 42 19 32 Reading 3 47 67 35 4 8 35 16 16 4 47 67 35 4 8 35 16 16 5 47 67 35 4 8 35 16 16 6 49 69 37 4 8 37 16 16 7 49 69 37 4 8 37 16 16 8 49 69 37 4 8 37 16 16 9 49 69 37 4 8 37 16 16 11 51 80 36 7 8 36 28 16 Science 3 48 66 34 2 12 34 8 24 4 48 66 34 2 12 34 8 24 5 48 66 34 2 12 34 8 24 6 48 68 34 3 11 34 12 22 7 48 68 34 3 11 34 12 22 8 48 70 34 4 10 34 16 20 9 48 70 34 4 10 34 16 20 Social Studies 11 60 84 42 3 15 42 12 30 New Mexico Standards Based Assessment Page 3 Technical Report—Spring 2007 Administration Table 1.2 Summary of Number of Items and Points for NMSBA Spanish Test Item Point MC OE SA MC OE SA Grade Count Count Count Count Count Points Points Points Mathematics 3 55 75 39 2 14 39 8 28 4 57 78 40 2 15 40 8 30 5 62 87 43 3 16 43 12 32 6 62 87 43 3 16 43 12 32 7 62 89 43 4 15 43 16 30 8 62 89 43 4 15 43 16 30 9 60 86 42 4 14 42 16 28 11 60 93 42 4 14 42 19 32 Reading 3 47 67 35 4 8 35 16 16 4 47 67 35 4 8 35 16 16 5 47 67 35 4 8 35 16 16 6 50 70 38 4 8 38 16 16 7 49 69 37 4 8 37 16 16 8 49 69 37 4 8 37 16 16 9 49 69 37 4 8 37 16 16 11 52 77 37 5 10 37 20 20 Science 3 48 66 34 2 12 34 8 24 4 48 66 34 2 12 34 8 24 5 48 66 34 2 12 34 8 24 6 48 68 34 3 11 34 12 22 7 48 68 34 3 11 34 12 22 8 48 70 34 4 10 34 16 20 9 48 70 34 4 10 34 16 20 NM has in place a framework of standards and benchmarks that define what students are expected to know and to be able to do, by grade and content area—these define what it means for a student to be “proficient”. So that the NMSBA would be an accurate measure of NM students’ mastery of the content covered by New Mexico’s standards and benchmarks, Harcourt first thoroughly reviewed each standard and benchmark, and then worked closely with NMPED to develop assessment blueprints for each grade and content area. Assessment blueprints function as maps or plans for the test developers. They identify content or reporting categories (at the overall test, standard, and benchmark levels), and determine which and how many items matching specific test content are to be included on a test. The test blueprints provide the structure for constructing test forms by defining the content to be covered and the relative emphasis to be given to each content area. The structure of the test with respect to the standards and the benchmarks and the constellation of the number of items and points at each standard and benchmark level are provided in Appendix A. New Mexico Standards Based Assessment Page 4 Technical Report—Spring 2007 Administration 1.3 Item Development It is essential that the NMSBA address the depth, breadth, and intent of the New Mexico Content Standards for each grade level. Important information regarding student performance must be derived from the standards-based portion of the test. The standards-based (criterion-referenced) items developed for NMSBA as well as the Stanford 10 items identified as assessing New Mexico content standards must align to the New Mexico Academic Standards in order to provide a valid assessment. Item and test development activities extend beyond the item writing process itself to include: • revising and maintaining test specifications and blueprints • conducting item reviews for alignment to the content standards, for bias, and for technical quality • conducting field tests and analyzing field test data • constructing new operational tests • ongoing data analyses to ensure a valid and reliable assessment The item development process began with a thorough review of the New Mexico content standards, the test specifications, and blueprints by Harcourt’s assessment specialists. A sufficient number of new items were written to each standard to allow for the attrition inherent in the review and field testing process. Once the item development plan has been reviewed and approved by the New Mexico Public Education Department, item development assignments and support materials were provided to Harcourt’s staff of trained, professional item writers. The item writers then wrote and submitted their items to the assessment specialists who conducted a thorough quality review of items using criteria such as: • Does the item measure the standard it is intended to measure? • Does the item conform to best practices in item design? • Is the content of the item accurate? • Is the item unique? • Is the language of the item clear and concise? • Are the multiple choice item distractors plausible? • Does the item have only one correct answer? • Is the language of the item simple and concise? The initial review was followed by an additional review in which a senior assessment specialist analyzed the item for accurate content and best testing practices. Thereafter, the items went through an editorial review to make sure that proper vocabulary, spelling, and grammar were used. There was an intentional parallelism among all items, art, and passages in phraseology, sentence structure, and stimulus attributes. Edited items were then passed back to the lead assessment specialist, who conducted an additional review. Prior to committee New Mexico Standards Based Assessment Page 5 Technical Report—Spring 2007 Administration