ebook img

ERIC ED540957: The Reliability of Classroom Observations by School Personnel. Research Paper. MET Project PDF

0.54 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED540957: The Reliability of Classroom Observations by School Personnel. Research Paper. MET Project

MET project REsEARcH PAPER The Reliability of Classroom Observations by School Personnel Andrew D. Ho Thomas J. Kane Harvard Graduate School of Education January 2013 ABOUT THis repOrT: This report presents an in-depth discussion of the technical methods, results, and implications of the MET project’s study of video-based classroom observations by school personnel.1 A non-technical summary of the analysis is in the policy and practitioner brief, Ensuring Fair and Reliable Measures of Effective Teaching. All MET project papers and briefs are available at www.metproject.org. ABOUT THe meT prOjecT: The MET project is a research partnership of academics, teachers, and education organizations committed to investigating better ways to identify and develop effective teaching. Funding is provided by the Bill & Melinda Gates Foundation. The approximately 3,000 MET project teachers who volunteered to open up their classrooms for this work are from the following districts: The Charlotte-Mecklenburg Schools, the Dallas Independent Schools, the Denver Public Schools, the Hillsborough County Public Schools, the Memphis Public Schools, the New York City Schools, and the Pittsburgh Public Schools. Partners include representatives of the following institutions and organizations: American Institutes for Research, Cambridge Education, University of Chicago, The Danielson Group, Dartmouth University, Educational Testing Service, Empirical Education, Harvard University, National Board for Professional Teaching Standards, National Math and Science Initiative, New Teacher Center, University of Michigan, RAND, Rutgers University, University of Southern California, Stanford University, Teachscape, University of Texas, University of Virginia, University of Washington, and Westat. AcKNOWLeDGemeNTs: The Bill & Melinda Gates Foundation supported this research. Alejandro Ganimian at Harvard University provided research assistance on this project. David Steele and Danni Greenberg Resnick in Hillsborough County (Fla.) Public Schools were extremely thorough and helpful in recruiting teachers and principals in Hillsborough to participate. Steven Holtzman at Educational Testing Service managed the video scoring process and delivered clean data in record time. ON THe cOver: A MET project teacher records herself engaged in instruction using digital video cameras (at right of photo). 1 The lead authors and their affiliations are Andrew D. Ho., Assistant Professor at the Harvard Graduate School of Education, and Thomas J. Kane, Professor of Education and Economics at the Harvard Graduate School of Education and principal investigator of the MET project. Table of contents InTRODucTIOn AnD ExEcuTIvE suMMARy 3 sTuDy DEsIgn 5 DIsTRIbuTIOn Of ObsERvED scOREs 9 cOMPOnEnTs Of vARIAncE AnD RElIAbIlITy 13 cOMPARIng ADMInIsTRATORs AnD PEERs 15 THE cOnsEquEncEs Of TEAcHER DIscRETIOn In cHOOsIng lEssOns 23 fIfTEEn MInuTE vERsus full lEssOns 25 fIRsT IMPREssIOns lIngER 27 AlTERnATIvE WAys TO AcHIEvE RElIAbIlITy 29 cOnclusIOn 30 REfEREncEs 32 The Reliability of classroom Observations | 1 2 | The Reliability of classroom Observations Introduction and Executive summary For many teachers, the classroom observation has been the only opportunity to receive direct feedback from another school professional. As such, it is an indispensable part of every teacher evaluation system. Yet it also requires a major time commitment from teachers, principals, and peer observers. To justify the investment of time and resources, a classroom observation should be both accurate (that is, it should reflect the standards that have been adopted) and reliable (that is, it should not be unduly driven by the idiosyncrasies of a particu- lar rater or particular lesson). In an earlier report from the Measures of Effective Teaching (MET) project (Gathering Feedback for Teaching), Kane and Staiger (2012) compared five different instruments for scoring classroom instruction, using observ- ers trained by the Educational Testing Service (ETS) and the National Math and Science Initiative (NMSI). The report found that the scores on each of the five instruments were highly correlated with one another. Moreover, all five were positively associated with a teacher’s student achievement gains in math or English language arts (ELA). However, achieving high levels of reliability was a challenge. For a given teacher, scores varied considerably from lesson to lesson, and for any given lesson, scores varied from observer to observer. In this paper, we evaluate the accuracy and reliability of school personnel in performing classroom observa- tions. We also examine different combinations of observers and lessons observed that produce reliability of .65 or above when using school personnel. We asked principals and peers in Hillsborough County, Fla., to watch and score videos of classroom teaching for 67 teacher-volunteers using videos of lessons captured during the 2011–12 school year. Each of 129 observers provided 24 scores on lessons we provided to them, yielding more than 3,000 video scores for this analysis. Each teacher’s instruction was scored an average of 46 times, by dif- ferent types of observers: administrators from a teacher’s own school, administrators from other schools, and peers (including those with and without certification in the teacher’s grade range). In addition, we varied the length of observations, asking raters to provide two sets of scores for some videos (pausing to score after the first 15 minutes and then scoring again at the end of the full lesson). For other lessons, we asked them to pro- vide scores only once at the end of the full lesson. We also gave teachers the option to choose the videos that school administrators would see. For comparison, peers could see any of a teacher’s lesson videos, including the chosen lessons and the lessons that were explicitly not chosen by teachers. Finally, we tested the impact of prior exposure to a teacher on a rater’s scores, by randomly varying the order in which observers saw two different pairs of videos from the same teacher. The Reliability of classroom Observations | 3 suMMARy Of fInDIngs We briefly summarize seven key findings: 1. Observers rarely used the top or bottom categories (“unsatisfactory” and “advanced”) on the four-point observation instrument, which was based on Charlotte Danielson’s Framework for Teaching. On any given item, an average of 5 percent of scores were in the bottom category (“unsatisfactory”), while just 2 percent of scores were in the top category (“advanced”). The vast majority of scores were in the middle two categories, “basic” and “proficient.” On this compressed scale, a .1 point difference in scores can be sufficient to move a teacher up or down 10 points in percentile rank. 2. compared to peer raters, administrators differentiated more among teachers. The standard deviation in underlying teacher scores was 50 percent larger when scored by administrators than when scored by peers. 3. Administrators rated their own teachers .1 points higher than administrators from other schools and .2 points higher than peers. The “home field advantage” granted by administrators to their own teachers was small in absolute value. However, it was large relative to the underlying differences in teacher practice. 4. Although administrators scored their own teachers higher, their rankings were similar to the rankings produced by others outside their schools. This implies that administrators’ scores were not heavily driven by factors outside the lesson videos, such as a principal’s prior impressions of the teacher, favoritism, school citizenship, or personal bias. When administrators inside and outside the school scored the same teacher, the correlation in their scores (after adjusting for measurement error) was .87. 5. Allowing teachers to choose their own videos generated higher average scores. However, the relative ranking of teachers was preserved whether videos were chosen or not. In other words, allowing teachers to choose their own videos led to higher scores among those teachers, but it did not mask the differences in their practice. In fact, reliability was higher when teachers chose the videos to be scored because the variance in underlying teaching practice was somewhat wider. 6. When an observer formed a positive (or negative) impression of a teacher in the first several videos, that impression tended to linger—especially when one observation immediately followed the other. First impressions matter. 7. There are a number of different ways to ensure reliability of .65 or above. Having more than one observer really does matter. To lower the cost of involving multiple observers, it may be useful to supplement full- lesson observations with shorter observations by others. The reliability of a single 15-minute observation was 60 percent as large as that for a full lesson observation, while requiring less than one-third of the obser- vation time. We conclude by discussing the implications for the design of teacher evaluation systems in practice. 4 | The Reliability of classroom Observations study Design Beginning in 2011–12, the Bill & Melinda Gates Foundation supported a group of 337 teachers to build a video library of teaching practice.2 The teachers were given digital video cameras and microphones to capture their practice 25 times during the 2011–12 school year (and are doing so again in 2012–13). There are 106 such teachers in Hillsborough County, Fla. In May 2012, 67 of these Hillsborough teachers consented to having their lessons scored by administrators and peers following the district’s observation protocol. With the help of district staff, we recruited administra- tors from their schools and peer observers to participate in the study. In the end, 53 school administrators (principals and assistant principals) and 76 peer raters agreed to score videos, for a total of 129 raters. “same-school” versus “other school” administrators: The 67 participating teachers were drawn from 32 schools with a mix of grade levels (14 elementary schools, 13 middle schools, and 5 high schools). In 22 schools, a pair of administrators (the principal and an assistant principal) stepped forward to participate in the scoring. Another nine schools contributed one administrator for scoring. Only one school had no partici- pating administrators. In our analysis, we compare the scores provided by a teacher’s own administrator with the scores given by other administrators and peer observers from outside the school. self-selected lessons: Teachers were allowed to choose the four lessons that administrators saw. Of the 67 teachers, 44 took this option. In contrast, peers could watch any video chosen at random from a teacher. In this paper, we compare the peer ratings for videos that were chosen to those that were not chosen. Some school districts require that teachers be notified prior to a classroom observation. Advocates of prior notification argue that it is important to give teachers the chance to prepare, in case observers arrive on a day when a teacher had planned a quiz or some other atypical content. Opponents argue that prior notification can lead to a false impression of a teacher’s typical practice, since teachers would prepare more thoroughly on days they are to be observed.3 Even though the analogy is not perfect, the opportunity to compare the scores given by peer raters for chosen and non-chosen videos allows us to gain some insight into the consequences of prior notification.4 Peer certification: In Hillsborough, peer raters are certified to do observations in specific grade levels (early childhood; PK to grade 3; elementary grades; K–6; and middle/high school grades, 6–12).5 We allocated teacher videos to peers with the same grade range certification (either elementary or middle school/high school) as the teacher as well as to others with different grade range certifications. 2 The project is an extension of the Measures of Effective Teaching (MET) project and the teachers who participated in the original MET project data collection. 3 Prior notification can also complicate the logistical challenge of scheduling the requisite observations. This can be particularly onerous when external observers are tasked with observing more than one teacher in each of many schools. 4 Because teachers were allowed to watch the lessons and choose which videos to submit after comparing all vid- eos, they could be even more selective than they might be with prior notification alone. Arguably, we are studying an extreme version of the prior notification policy. 5 Some peers are certified to observe in more than one grade range. The Reliability of classroom Observations | 5 15-minute ratings: During the school day, a principal’s time is a scarce resource. Given the number of obser- vations they are being asked to do, a mix of long and short observations could help lighten the time burden. But shorter observations could also reduce accuracy and reliability. To gain insight into the benefits of longer versus shorter observations, we asked observers to pause and provide scores after the first 15 minutes of a video and then to record their scores again at the end of the lesson. A randomly chosen half of their observa- tions were performed this way; for the other half, they provided scores only at the end of the lesson. We deliberately chose a 15-minute interval because it is on the low-end of what teachers would consider reasonable. However, by using such a short period, and comparing the reliability of very short observations to full-lesson observations, we gain insight into the full range of possible observation lengths. Observations with durations between 15 minutes and a full lesson are likely to have reliability between these two extremes. Rating load: Each rater scored four lessons from each of six different teachers, for a total of 24 lessons. Attrition was minimal and consisted of a single peer rater dropping out before completing all 24 scores.6 Assigning videos to observers: With a total of 67 teachers (with four lessons each) and an expectation that each rater would score only 24 videos, we could not use an exhaustive or “fully crossed” design. Doing so would have required asking each rater to watch more than 10 times as many videos—4 * 67 = 268. As a result, we designed an assignment scheme that allows us to disentangle the relevant sources of measurement error while limiting the number of lessons scored per observer to 24. First, we identified eight videos for each teacher—the four videos they chose to show to administrators (the “chosen” videos) and four more videos picked at random from the remaining videos they had collected during the spring semester of the 2011–12 school year (their “not chosen” videos).7 Second, we assigned each teacher’s four chosen videos (or, if they elected not to choose, four randomly cho- sen videos) to the administrators from their school. If there were two administrators from their school, we assigned all four chosen videos to each of them. Third, we randomly created administrator teams of three to four administrators each (always including administrators from more than one school) and peer teams of three to six peer raters each. We randomly assigned a block of three to five additional teachers from outside the school to each administrator team. (For example, if an administrator had three teachers from his or her school participating, he or she could score three additional teachers; if the administrator had one teacher from his or her school, he or she could score five more.) The peer rater teams were assigned blocks of six teachers. Fourth, to ensure that the scores for the videos were not confounded by the order in which they were observed or by differences in observers’ level of attentiveness or fatigue at the beginning or end of their scoring assign- ments, we randomly divided the four lessons from a teacher assigned to a rater block into two pairs. We randomly assigned the order of the pairs of videos to the individual observers to each of the observers in the block. Moreover, within each pair of videos for a teacher, we randomized the order when the first and second were viewed. 6 Another observer stepped in to complete those scores. 7 Teachers understood that if they did not exercise their right to choose, we would identify lessons at random to assign to administrators. 6 | The Reliability of classroom Observations Table 1 illustrates the assignment of teachers to one peer rater block. In this illustration, there are six teachers (11, 22, 33, 44, 55, 66) and two peer raters (Rater A and Rater B). The raters will score two lessons that the teachers chose (videos 1 and 2) as well as two videos that the teachers did not choose (videos 3 and 4). Half of the videos will be scored using the “full lesson only” mode. Half of the videos will be scored at 15 minutes and then rescored at the end of 60 minutes. In this illustration, Rater A rates full lessons first and Rater B rates 15 minutes first then full lessons. Table 1 HyPOTHETIcAl AssIgnMEnT MATRIx fOR OnE TEAM Of PEER RATERs Two Raters: A, B Six Teachers: 11, 22, 33, 44, 55, 66 Four videos: 1,2 (Chosen by Teacher); 3,4 (Not chosen by Teacher) Observation:→ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Rater:↓ Rating at 15 minutes, Followed by Full Lesson Full Only 11 11 22 22 33 33 44 44 55 55 66 66 22 22 11 11 66 66 55 55 33 33 44 44 Rater A 1 2 3 4 4 3 1 2 3 4 2 1 1 2 4 3 4 3 2 1 1 2 4 3 55 55 11 11 33 33 22 22 66 66 44 44 11 11 44 44 33 33 22 22 66 66 55 55 Rater B 4 3 1 2 3 4 2 1 3 4 1 2 3 4 4 3 1 2 4 3 1 2 1 2 variance components: By ensuring that videos were fully crossed within each of the teams of peers and administrators and then pooling across blocks, we were able to distinguish among 11 different sources of vari- ance in observed scores. In the parlance of Generalizability Theory (Brennan, 2004; Cronbach et al., 1977), this is an Item-by-Rater-by-Lesson-within-Teacher design, or I×R×(L:T). These are described in Table 2. The light shaded source is the variance due to teachers. In classical test theory (e.g., Lord & Novick, 1968), this would be described as the “true score variance,” or the variance attributable to persistent differences in a teacher’s practice as measured on this scale.8 Reliability is the proportion of observed score variance that is attributable to this true score variance. A reliability coefficient of one would imply that there was no measure- ment error; that the teacher’s scores did not vary from lesson to lesson, item to item, or rater to rater; and that all of the variance in scores was attributable to persistent differences between teachers. Of course, the reliabil- ity is not equal to one because there is evidence of variance attributable to other sources. 8 Later in the paper, we will take a “multivariate” perspective, where different types of raters may identify different true- score variances. In addition, by observing how administrators scored their own teachers differently than others did, we will divide the true-score variance into two components: that which is observable in the videos and that which is attrib- utable to other information available to the teacher’s administrator. The Reliability of classroom Observations | 7 Table 2 DEscRIbIng vARIAncE cOMPOnEnTs fOR A I x r x (L:T) gEnERAlIzAbIlITy sTuDy Source Description T Teacher variance or “true score” variance. The “signal” that is separable from “error.” I Variance due to items. Some items are more difficult than others. R Variance due to raters. Some raters are more difficult than others. L:T Variance due to lessons. Confounded with teacher score dependence upon lessons. T x I Some teachers score higher on certain items. T x R Some raters score higher certain teachers. I x R Some raters score higher certain items. T x I x R Some raters score higher certain teachers on certain items. I x (L:T) Some items receive higher scores on certain lessons. Cofounded with teacher score dependence. (L:T) x R Some raters score certain lessons higher. Confounded with teacher score dependence. (L:T) x I x R, e Error variance, confounded with teacher score dependence on items, raters, and lessons. T = Teacher, I = Item, R = Rater, L = Lesson, and e = residual error variance The darker shaded sources of variance refer to “undesirable” sources of variability for the purpose of dis- tinguishing among teachers on this scale. These variance components all contain reference to teachers and would thus alter teacher rankings. Unshaded variance components refer to observed score variance that is not consequential for relative rankings. For example, an item-by-rater interaction quantifies score variability due to certain raters who give lower mean scores for certain items. This adds to observed score variability but does not change teacher rankings, as this effect is constant across teachers. However, this source of error becomes consequential if, for example, different raters rate different teachers. Generalizability theory enables us to estimate sources of error and to simulate the reliability of different scenarios, varying the number and type of raters and the number and length of lessons. 8 | The Reliability of classroom Observations

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.