ebook img

Teachers and Cheaters. Just an Anagram? PDF

55 Pages·2017·1.51 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Teachers and Cheaters. Just an Anagram?

Teachers and Cheaters. Just an Anagram? Santiago Pereda-Fernández ∗ Banca d’Italia January 14, 2017 Abstract InthispaperIstudythemanipulationoftestscoresintheItalianeducationsystem. Using an experiment consisting in the random assignment of external monitors to classrooms, Iapplyanewmethodologytostudythenatureandextentofmanipulation of test scores at different levels in primary and secondary education, and I propose a correction method. The results show frequent manipulation, which is not associated with an increase in the correlation of the answers after I control for mean test scores. The manipulation is concentrated in the South and Islands region, and it tends to favor female and immigrant students. Finally, the negative correlation between the amount of manipulation and the number of missing answers in open ended questions relative to multiple choice questions suggests that teachers are more responsible for the manipulation than students. Keywords: Cheating Correction, Copula, Discrimination, Gender, Nonlinear Panel Data, Test Scores Manipulation JEL classification: C23, C25, I21, I28, J24 ∗Banca d’Italia, Via Nazionale 91, 00184 Roma, Italy. This paper was previously circulated with the name A New Method for the Correction of Test Scores Manipulation. I would like to thank Alessandro Belmonte, Stéphane Bonhomme, Nicola Curci, Domenico Depalo, Patrizia Falzetti, Raquel Fernández, Iván Fernández-Val, Guzmán González-Torres, Caroline Hoxby, Andrea Ichino, Claudio Michelacci, Marco Savegnago, Paolo Sestito, Martino Tasso, Jeffrey Wooldridge, Paolo Zacchia, Stefania Zotteri, and seminar participants at Banca d’Italia, EUI, IMT Lucca, Universidad de Alicante, Universidad de Cantabria and the 2nd IAAE for helpful comments and suggestions. All remaining errors are my own. The views presented in this paper do not necessarily reflect those of the Banca d’Italia. I can be reached via email at [email protected]. 1 1 Introduction A policy maker interested in evaluating the education system requires a comparable measure of academic achievement across students. Standardized tests permit the comparison of students’ knowledge, and are often used to evaluate teachers (Hanushek, 1971; Rockoff, 2004; Aaronson et al., 2007) and principals (Grissom et al., 2014), although the reliability of these estimates has been put into question (Rothstein, 2010, 2015; Chetty et al., 2014). A major threat to the comparability of these tests is the manipulation of the scores, which alters students’ recorded performance.1 There is ample evidence that these tests are susceptibleofbeingmanipulated, eitherbyteachersgradingunfairly(JacobandLevitt,2003; Dee et al., 2011; Angrist et al., 2014; Diamond and Persson, 2016), by students copying each other (Levitt and Lin, 2015), or even by principals who alter the pool of students who take the exam (Figlio, 2006; Cullen and Reback, 2006; Hussain, 2015) which, despite not being a manipulation of the individual test scores, affects its overall distribution. In this paper I study this phenomenon making the following contributions: first, I study the extent of test score manipulation taking advantage of a natural experiment in the Italian education system that randomly assigned external monitors to proctor some tests. On top of already known results, I find that the manipulation systematically favors female over male students, and immigrants over natives. Second, I propose a method to detect and correct manipulated test scores based on how likely the actual results are to happen at random. Methods that classify tests as either manipulated or fair, face two potential sources of misclassification: mistaking fair tests by manipulated (type I misclassification), and mistaking manipulated tests by fair (type II misclassification). Type I misclassification is particularly unfair, since it is usually the case that some statistics used to detect cheating are similar for high-achieving tests without manipulation, and manipulated tests. For example, they are both likely to have high class means, or correlated test scores, which could merely 1ThroughoutthispaperIrefertotestscoremanipulationandcheatingasanyactiontakenbythestudents ortheteachersthatresultsinavariationofthetestscores, usuallyanincrease. Thiscouldtakeplacebefore the test (alteration of the pool of students), during the test (students copying from one another, teachers turning a blind eye or telling the answers), or after the test (unfair grading, including leniency). 2 reflect effective teaching practices. Moreover, empirical studies on education often rely on raw test scores as a measure of students’ achievement, and are frequently standardized to have zero mean and unit standard deviation. However, this may not be the most appropriate approach to detect cheating: the answers to every single test item allow to consider a richer correlation structure of the results, which can be more informative to detect test scores manipulation. This correlation stems from factors that operate in a different manner and can be classified into three main categories: individual characteristics, which only affect a single student; class characteristics, which affect every student in the same classroom; and question characteristics, which affect every student, though only in each specific question. Hence, when a question is difficult, a small fraction of students is likely answer the question correctly, creating a high correlation in the answers for that particular question, both within and between classrooms. Toovercomethesechallenges, themethodIproposecomparesthelikelihoodoftheresults of two groups: one in which test scores are assumed to be fair (treatment group), and another one in which they might have been manipulated (control group), analogously to a comparisonbetweenblindgradedandnon-blindgradedexams(e.g. Lavy(2008)orHinnerich et al. (2011)). Hence, the results in the treatment group allow to estimate the probability of obtaining the observed test scores at random without manipulation. If the frequency of such results in the control group is larger than in the treatment group, then it indicates the existence of manipulated test scores, and the larger the difference, the more widespread manipulation. This likelihood function accounts for all the previously mentioned effects which create correlation patterns in students’ answers without manipulation.2 Under the assumption that theestimatesofthegroupwithanexternalmonitorarenotmanipulated, differencesbetween the two sets of estimates reflect the amount of manipulation for each demographic group and question.3 The estimates from the treatment group are subsequently used to calculate the 2The setup is similar to those considered in Item Response Theory: they model the result to each question using an individual latent trait which is constant across questions, and questions are allowed to vary in difficulty. See Bacci et al. (2014) for an example applied to the INVALSI tests. 3Notethatthisdoesnotimplythatthetestscoresofeverystudentinthecontrolgroupweremanipulated, 3 probability of obtaining the observed results with manipulation. This constitutes the basis for the correction method, which applies a larger reduction of test scores the more unlikely the results and the higher the test scores are. The data I use stems from a set of recently introduced low stakes standardized tests in the Italian education system in primary, lower secondary, and upper secondary education. Students take these exams in their own schools, proctored by a teacher from their school who was not their teacher during the academic year. These teachers are also responsible for grading, transcribing the test scores, and sending them back to the National Institute for the Evaluation of the Education System (INVALSI). However, a set of randomly selected classrooms have an external monitor, who is responsible for the same tasks, but had no prior connection to the school. This constitutes a large scale natural experiment to study test score manipulation in the absence of an external monitor. Previous work used the results from preceding years of the primary and lower secondary tests.4 Theyfoundthathavinganinternalmonitorisassociatedwithhigher, morecorrelated test scores (Bertoni et al., 2013), which could be the result of students interactions (Lucifora and Tonello, 2015), or of teachers’ shirking at grading (Angrist et al., 2014). Moreover, the amount of manipulation is much larger in the South & Islands of Italy, which is greatly correlated with other measures of social capital (Paccagnella and Sestito, 2014). I find substantial test score manipulation, which is heterogeneous in various dimensions. Apart from the already known geographical patterns, I find that female and immigrant studentsbenefitfromthismanipulationmorethantheirmaleandnativepeers. Inparticular, females have more manipulated test scores than males. This result holds for all exams, and the manipulation is higher in mathematics exams, in which it can amount up to 2.1%, whereas in the Italian exams it is at most 1.2%. Regarding differences between different ethnic groups, I find that immigrant students in Italy tend to be favored relative to natives. This manipulation is larger in Italian exams, and it can be up to 2.7%. nor that the manipulation was of the same magnitude for students with the same characteristics. 4In particular, Bertoni et al. (2013) focused on grades 2 and 5 for the 2010 tests, Angrist et al. (2014) andBattistinetal.(2014)ongrades2and5forthe2010-12tests, andLuciforaandTonello(2015)ongrade 6 for the 2010 tests. 4 If students were responsible for the manipulation, the correlation in their answers would increase. However, once I control for the mean scores, the correlation is not substantially different when the monitor is internal or external, but it is larger than the correlation found when students come from different classrooms. Hence, rather than manipulation, the correlation in students’ answers most likely reflects a combination of teacher quality, peer effects, and sorting of students. Also, the larger the amount of manipulation in open ended questions relative to multiple choice questions, the smaller the fraction of missing answers to open ended questions relative to multiple choice questions. These patterns are the opposite of what would arise if students copied each other during the exam. Even though these exams have no formal consequences on teachers (e.g. their wages are not linked to the results), they may have incentives to manipulate the results if they perceive that they are or could be evaluated in the future: if they were to be payed based on the performance of their students, or if principals used the results internally.5 Hence, manipulation could be a means to invalidate the comparability of the results to prevent their students test scores having consequences on them. The rest of the paper is organized as follows: the institutional details of the test and some descriptive statistics are presented in section 2. The empirical strategy and the correction methods are explained in section 3. Section 4 shows the results of the estimation, while section 5 shows the class-level correction in practice. Section 6 concludes. 2 Italian National Evaluation Test INVALSI is the Italian institute responsible for the design and administration of annually standardized tests for Italian students. It was created in 1999, and in the academic year 2008/09 these tests acquired nationwide status. All students enrolled in certain grades are required to take two tests, one in mathematics and another one in Italian language. Even 5These concerns, among others, have led to important boycotts of the 2014/15 and 2015/16 tests: in some of the exams, up to 10% of the students did not participate. See http://www. invalsi.it/invalsi/doc_evidenza/2015/Comunicato_stampa_Prove_INVALSI_2015_07_05.pdf http: //www.invalsi.it/invalsi/doc_evidenza/2016/Com_Stampa_INVALSI_II_SEC_SEC_GRADO.pdf. 5 though the Italian Ministry of Education stated the necessity of establishing a system of evaluation of teachers and schools based on students’ performance, the tests have been low stakes for all grades, with the exception of the 8th (III media), which corresponds to the end of the compulsory secondary education, in which the results of the test account for a sixth of their final marks. These exams are taken in classroom, and they are proctored by either an internal or an external monitor, who is also responsible for grading, transcribing the result of each student to a sheet and sending it to INVALSI.6 Internal monitors are teachers of a different class in the same school, while external monitors are teachers and principals who had not worked in the town of the school they were assigned for at least two years before the exam.7 External monitors are randomly assigned to classes with the same selection mechanism used by the IEA-TIMSS survey. In a first stage, a fixed number of schools from each region are selected atrandom. Inasecondstage, dependingonthenumberofclassroomsexistingintheselected schools, one or two of them are selected at random by INVALSI.8 Students in these classes constitute the treatment group. Teachers, unlikeexternalmonitors, mayhaveincentivestomanipulatetestscores: despite the low stakes nature of the exam, and although their salaries are not linked to the exam results, they may perceive that they are evaluated based on the results. INVALSI sends the results to principals, who can make them public to entice parents to enroll their children in their school. However, anecdotal evidence suggests that the results are discussed in front of all teachers, having internal consequences, such as the assignment of troublesome students. This, coupled with the possibility that in the future principals might be able to pay teachers based on their performance, may give incentives to teachers to manipulate their students’ scores.9 In this context, manipulation of the test scores could be used as a tool to invalidate 6Onlysomeofthequestionsaremultiplechoice, sothistaskcannotbeautomaticallydonebyamachine. 7Some of the external monitors are retired teachers, while others are precari, i.e. teachers with no tenure position. They are paid between 100 and 200 EUR for the job, and can be asked in the subsequent years to monitor more exams, giving them incentives to grade fairly. 8The 2013 tests were the first in which the assignment was done by INVALSI by public procedure. Previously, it was done by the selected schools. The recent changes in the assignment of external monitors have allowed to reduce the number of treated classrooms. 9200 million euros have been assigned to principals to distribute among their teachers. The criteria to 6 the comparability of the results and consequently prevent linking students’ performance to teachers’ pay. Hence, teachers whose students perform worse would have more incentives to manipulate the test scores. 2.1 Data and Descriptive Statistics As shown in table 1, over 2.3 million students were tested during the academic year 2012/13, of which over 143,000 were assigned an external monitor. It also shows the mean percentage of correct answers for students with either an internal or an external monitor in all ten exams, which was higher when the monitor was internal. The difference between the two groups varies across grades and is larger for the mathematics exam. Table 1: Size of the groups, academic year 2012/13 2nd grade 5th grade 6th grade 8th grade 10th grade EX IN EX IN EX IN EX IN EX IN N 25070 437479 24773 424046 27504 410332 28153 360528 38273 270262 C 1424 25346 1426 25559 1457 21756 1464 19041 2203 15339 S 737 6451 736 6422 732 5143 1416 4537 1094 3276 % Correct 53.87 61.20 54.79 59.52 44.53 45.25 50.83 52.48 42.09 45.13 (Math) (20.68) (21.58) (18.87) (19.25) (16.80) (16.70) (18.98) (19.02) (17.72) (18.39) % Correct 59.90 64.76 74.36 76.82 64.25 64.40 72.44 73.12 64.20 65.92 (Ita) (17.39) (17.84) (16.12) (15.52) (16.74) (16.87) (14.96) (14.78) (16.20) (17.00) Notes: N, C and S respectively denote the number of students, classrooms and schools, and EX and IN respectively denote the groupswiththeexternalandtheinternalmonitor. Classeswithaninternalmonitorinschoolsthathadatleastoneclasswithan externalmonitorareexcluded. Standarddeviationsinparentheses. Table 2 shows the mean and standard deviation of the covariates I use in this paper. Similarly to previous editions of the test, some of the variables were not perfectly balanced across the two groups. In particular, the mean class size is slightly larger in those classrooms proctored by an external monitor in more than a half of the exams, and they have a slightly higher presence of male and immigrant students in the upper secondary exams. Finally, the geographic stratification led to an over-representation of students coming from regions in which test scores were more manipulated in previous years (Bertoni et al., 2013). Forexpositionalbrevity,Ifocusmyanalysisonthe10thgraders’mathematicsexam,since distributethismoneyincludesteachingquality,whichcouldbemeasuredbytheresultsoftheINVALSItests. See https://labuonascuola.gov.it/documenti/LA_BUONA_SCUOLA_SINTESI_SCHEDE.pdf?v=0b45ec8. 7 Table 2: Mean and standard deviation of covariates 2nd grade 5th grade 6th grade 8th grade 10th grade EX IN EX IN EX IN EX IN EX IN Class size 17.61* 17.26 17.37* 16.59 18.88 18.86 19.23* 18.93 17.37 17.62 (4.69) (5.22) (4.75) (5.04) (4.20) (4.54) (4.49) (4.54) (5.49) (5.97) Male 0.51 0.51 0.50 0.50 0.51 0.51 0.51 0.50 0.51* 0.49 (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) (0.50) Native 0.95 0.95 0.94 0.94 0.92 0.92 0.91* 0.92 0.90* 0.91 (0.21) (0.21) (0.24) (0.24) (0.26) (0.26) (0.28) (0.27) (0.29) (0.28) North 0.39* 0.46 0.38* 0.44 0.43* 0.45 0.41* 0.43 0.41* 0.45 (0.49) (0.50) (0.49) (0.50) (0.49) (0.50) (0.49) (0.50) (0.49) (0.50) Center 0.19* 0.18 0.19* 0.18 0.19* 0.17 0.20* 0.18 0.18* 0.16 (0.40) (0.39) (0.39) (0.38) (0.39) (0.38) (0.40) (0.38) (0.39) (0.37) South & Isles 0.42* 0.36 0.43* 0.38 0.38 0.39 0.39 0.39 0.40* 0.38 (0.49) (0.48) (0.50) (0.48) (0.49) (0.49) (0.49) (0.49) (0.49) (0.49) Notes: EX and IN respectively denote the groups with the external and the internal monitor. Standard deviations in parentheses. The asterisk denotes that difference between the two groups is significantly different from zero at the 95% confidencelevel. they constitute the largest treatment group, and the percentage of manipulation, measured asthedifferenceinthepercentageofcorrectanswersbetweenthetwogroups, islargerforthe mathematics exam. Given that the amount of manipulation substantially varied by exam, pooling all the exams together to detect cheating patterns would be counter-productive, as it would cover several manipulation patterns. Regardless, the regressions for all exams are shown in appendix B, and those results that are different across exams are also reported in the paper. A total of 38,273 students in 2,203 classes were assigned an external monitor, whereas 270,262 students in 15,339 classrooms were assigned an internal monitor in schools without external monitors.10 The total number of questions in this exam was 50. The left panel in figure 1 shows the proportion of students who answered each question correctly. Even though there is a lot of variability across answers, students proctored by external monitors scored worse than the rest in all but three of the questions, suggesting that the scores were manipulated. Both difficult and easy questions can have large or small differences between the two groups, although there is a weak correlation between the difficulty of a question, measured as the proportion of correct answers for the treatment group, and the difference between the two 10Since Bertoni et al. (2013) found that the manipulation was less severe in non-treated classrooms in treated schools, I exclude them from the main analysis. The manipulation patterns in these classrooms are similar to those from non-treated schools. These results are available upon request. 8 groups.11 Similarly, the right graph shows that there is a change in the distribution of the total number of correct answers, with both the mean, the median and the mode increasing when the examiner is internal. The majority of the change takes place around the center of the distribution, whereas the tails show a change much smaller in magnitude. Since this is a low stakes exam, there are no jumps at a cut-off grade and the change is quite smooth. Figure 1: 10th grade mathematics exam results 0.05 EX EX 0.8 IN IN DIF 0.04 0.6 0.03 0.4 0.02 0.2 0.01 0 0 0 10 20 30 40 50 0 10 20 30 40 50 r Theleftgraphdepictstheproportionofcorrectanswersbyquestion(questionsaresortedbyhowfrequently they were correctly answered by students proctored by an external monitor); the right graph depicts the students’ distribution of test scores. EX, IN, DIF, and r respectively denote the groups with the external and the internal monitor, the difference between them, and the number of correct answers. Apart from the mean test score, other measures used to identify cheating (Jacob and Levitt, 2003; Quintano et al., 2009) are based on the correlation of the answers. However, if the mean test scores of classrooms in the treatment group differs from those in the control group, then the correlation in the answers will be different in both groups, even if there is no manipulation.12 This presents a comparability problem which can be resolved by appropriately controlling for the mean test scores. To illustrate this point, consider the following alternative statistic to the within class 11This result is, however, not consistent across exams, and for some of them there is no correlation at all. 12Toseethis,ifeverystudentinaclassgotthemaximumgrade,thenthecorrelationintheanswerswould be one by construction. On the other hand, if all of them got one half of the answers right, the correlation could be equal to one, but also equal to zero. 9 correlation of test scores: the mean number of correct answers in common between two students, s, conditional on each of them having correctly answered r and r questions. This is estimated by Q (cid:80)C (cid:80)Nc (cid:80) 1(cid:0)r = r,r = r,r = s(cid:1) E (s r,r) (cid:88)s c=1 i=1 j(cid:54)=i i j ij (1) n | ≡ (cid:80)C (cid:80)Nc (cid:80) 1(r = r,r = r) s=0 c=1 i=1 j(cid:54)=i i j whereN isthenumberofstudentsinclassroomc,C isthenumberofclassrooms,andQis c the number of questions. This statistic mirrors the Oaxaca-Blinder decomposition: because both groups have different means, by controlling for them it is possible to see if the answers were more homogeneous for students proctored by an internal monitor in a comparable manner. If cheating increases answers’ homogeneity, then the manipulation would affect both the mean test scores and the conditional homogeneity.13 Figure 2 shows the values of this statistic for different values of (r,r), both for students in the same classroom in each of the two groups, and for students who are in different classrooms.14 As expected, this conditional mean is uniformly larger for students in the same classroom relative to students indifferentclassrooms. However, theconditionalmeannumberofcorrectanswersincommon is roughly the same when the monitors are either internal or external. Hence, it could be argued that the amount of homogeneity in students’ answers is the same in both groups once we control for their mean test scores, and it may reflect spillovers or the teacher effect. 3 Empirical Methodology Questions in the INVALSI tests are graded on a right/wrong basis, so let y equal one if icq student i in classroom c correctly answered question q in the exam, and zero otherwise. This variable can be modeled with a latent variable, y∗ , which is the sum of three effects: a icq student-class effect, η , a question effect, ξ , and a specific student-class-question iid shock, ic q ε . The student-class effects measures the ability of a student, and more able students have icq a higher probability of answering correctly each question. On the other hand, the question 13This would happen if students copied each other, or if teachers graded in a systematic way. 14Because of the number of possible combinations of r and r I show a representative selection of them. Full results are available upon request. 10

Description:
figure 7, which show the mean test scores in every Italian province, both before and after the correction is applied. Figure 6: Correction for cheating, test scores, and likelihood, 10th grade mathematics exam. The upper and lower figures respectively show the scatter plot of the mean correction to
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.