Eurasian Journal of Educational Research, Issue 49, Fall 2012, 61-80 Application of Computerized Adaptive Testing to Entrance Examination for Graduate Studies in Turkey Okan BULUT* Adnan KAN** Suggested Citation: Bulut, O., & Kan, A. (2012) Application of computerized adaptive testing to entrance examination for graduate studies in Turkey. Egitim Arastirmalari-Eurasian Journal of Educational Research, 49, 61-80. Abstract Problem Statement: Computerized adaptive testing (CAT) is a sophisticated and efficient way of delivering examinations. In CAT, items for each examin responses to the items. In this way, the difficulty level of the test is levels with a small number of items. A number of operational testing programs have applied to any operational test in Turkey, where there are several standardized assessments taken by millions of people every year. Therefore, this study investigates the applicability of CAT to a high-stakes test in Turkey. Purpose of Study: The purpose of this study is to examine the applicability of CAT procedure to the Entrance Examination for Graduate Studies (EEGS), which is used in selecting students for graduate programs in Turkish universities. Methods: In this study, post-hoc simulations were conducted using real responses from examinees. First, all items in EEGS were calibrated using the three-parameter item response theory (IRT) model. Then, ability estimates were obtained for all examinees. Using the item parameters and responses to EEGS, post-hoc simulations were run to estimate abilities in CAT. Expected A Posteriori (EAP) method was used for ability estimation. Test termination rule was standard error of measurement for estimated abilities. * Research Assist.University of Minnesota, Department of Educational Psychology , [email protected] ** Assoc Prof Dr. Gazi University, Department of Educational Sciences, [email protected] 61 62 Okan Bulut & Adnan Kan Findings and Results: The results indicated that CAT provided accurateability estimates with fewer items compared to the paper-pencil format of EEGS. Correlations between ability estimates from CAT and the real administration of EEGS were found to be 0.93 or higher under all conditions. Average number of items given in CAT ranged from 9 to 22. The number of items given to the examinees could be reduced by up to 70%. Even with a high SEM termination criterion, CAT provided very reliable ability estimates. EAP was the best method among several ability estimates methods (e.g., MAP, MLE, etc.). Conclusions and Recommendations: CAT can be useful in administering EEGS. With a large item bank, EEGS can be administered to examinees in a reliable and efficient way. The use of CAT can help to minimize the cost needed anymore. It can also help to prevent cheating during the test. Keywords: Computerized adaptive testing, item response theory, standardized assessment, reliability. Standardized tests in Turkey are implemented in such a way that a multiple- choice test in a paper-pencil format with the same items for everyone is given to all examinees on a certain date. Most of the large-scale assessments in Turkey are administered by the Student Selection and Placement Center and Ministry of National Education. The Student Placement Examination, the Foreign Language Examination for Civil Servants, the Entrance Examination for Graduate Studies, and the Level Determination Exam are some of the high-stakes tests that are taken by many examinees in Turkey every year. For example, over one million examinees take the Student Selection Examination (SSE), which is used for placing students into undergraduate programs in Turkish universities. The Foreign Language Examination for Civil Servants, which is used for measuring English reading comprehension skills of public servants, is also taken by thousands of people. The Entrance Examination for Graduate Studies (EEGS), which is similar to the GRE in the US in terms of its purpose, is taken by fourth-year undergraduates and college graduates. EEGS scores are submitted with graduate school applications in Turkey (Student Selection and Placement Center, 2012). Among these tests, EEGS is an important one because scores obtained from EEGS are used for admitting students to graduate programs and also for selecting graduate assistants in Turkish universities. One big criticism of EEGS might be the lack of scores for the EEGS subtests are obtained using traditional Classical Test Theory (CTT) methods, test scores depend heavily on items used in the test and persons taking the test. For instance, a person may attend two administrations of EEGS within the same year and obtain very different scores, although the ability level of the person hasn Eurasian Journal of Educational Research 63 fact that test scores and item difficulties are weighted based on the performance of other test-takers in a particular test administration. Therefore, test scores from EEGS can be substantially biased for some examinees. Another issue with EEGS is the lack of stability in the precision of test scores. Since only a specific set of items is administered to each examinee, it is hard to compute test scores for everyone at a similar level of precision. Also, the use of all items for all examinees may not be necessary because some items may provide very small amounts of information or no information for some examinees with a particular ability level. For example, some items can be very hard or very easy for some examinees. This situation may cause several disadvantages. First, items that are ability level. Second, administering very difficult or very easy items to examinees can make them bored or frustrated. Thus, using such items would be a waste of time. Also, examinees may attempt to guess the answers to items that are very difficult for them, which may, in turn, increase the error inability estimation. If it is possible to give each examinee a test with an ideal matching to his/her ability level, the problems mentioned above could be solved effectively (Mead &Drasgow, 1993). important issue in all testing programs. To administer items that would match given items in the test to select the next appropriate items for an examinee. Computerized adaptive testing (CAT) is a procedure that put this idea into practice. CAT is a special approach to the assessment of latent abilities in which the selection of the test items presented to the examinee is based on the responses given by the examinee to previously administered items (Frey & Seitz, 2011).The basic idea behind CAT is to give examinees only items tailored or adapted to their ability levels in order to maximize the information drawn from each response. In a typical CAT administration, an iterative process with the following steps is used: 1. All the items that have not yet been administered are evaluated to determine which will be the best one to administer next given the currently estimated ability level. 2. The best next item is administered and the examinee responds. 3. A new ability estimate is computed based on the responses to all of the administered items. 4. Steps 1 through 3 are repeated until a stopping criterion is met (Rudner, 2012). Among the advantages of CAT over conventional testing, Betz and Weiss (1974) stated that CAT-based tests are shorter than conventional form sand provide precise ability estimates of examinees. Embretson (1996) also mentioned that CAT requires fewer items, producing more valid measurement experiences than paper and pencil tests. Another advantage of CAT is its capacity to substantially increase measurement efficiency, which is the ratio of measurement precision totest length 64 Okan Bulut & Adnan Kan (Frey& Seitz, 2009; Segall, 2005). Compared to conventional testing programs that mostly administer a fixed number of items in a fixed order, CAT can reduce the number of items by approximately half without a loss of information and precision (e.g. Segall, 2005).Although most CATs use item pools that have been calibrated with a unidimensional item response theory (IRT) model (e.g., van der Linden & Hambleton, 1997), there are multidimensional and bi-factor CAT algorithms for tests with a multidimensional structure as well (e.g., Segall, 1996, 2001; Wang & Chen, 2004). Currently, there are many operational programs that carry out CAT in different fields. Some examples are Graduate Record Examination (GRE) for graduate school admissions and Graduate Management Admission Test (GMAT) for business school admissions in the US, Japanese Computerized Adaptive Test (J-CAT)for diagnosing the proficiency level of Japanese as a second language, Paramedic exams by National Registry of Emergency Medical Technicians for certifying the competency of entry- level emergency medical technicians. Also, a number of testing programs and tests are working toward the implementation of CAT; they include the United States Medical Licensing Examination (USMLE)of the National Board of Medical Examiners (IACAT, 2012). Comparing the popularity of CAT and the comprehensive literature about its applications in the US and other countries, CAT is still a fairly new area in Turkey. There are only a few studies that examined applicability of CAT to different adaptive and paper-pencil test formats in terms of validity and reliability. Results indicated that there was no statistically significant difference between reliability estimations of the adaptive and conventional formats. However, when the researcher investigated the relationship between test scores from adaptive and paper-pencil formats, he found correlation coefficients of 0.88 and 0.81 for adaptive and conventional testing formats, respectively. Although differences were not very large, CAT administration provided better results. Kaptan (1993) conducted a similar study by comparing ability estimates obtained from a conventional paper test and a computerized adaptive test. In the study, examinees took a 50-item math test in paper-pencil format and a 14-item CAT test. Results indicated that CAT provided a 70% reduction rate in the number of items administered. Also, there was no significant difference found between the ability estimates from -20 reliability coefficients of CAT. Correlations obtained from CAT and the paper-pencil format of the same test were compared. In the study, the CAT item bank included 61 items. Correlation between the two formats was found significant with a coefficient of 0.36, indicating a low relationship. The researcher indicated some potential reasons for that, such as limited number of items in the bank and a test stopping rule with fixed number of items. In a similar study, Iseri (2002) constructed an item bank using the items in the Secondary School Student Selection and Placement Eurasian Journal of Educational Research 65 using fewer items. In test sessions in which students were allowed to go back to the items responded to earlier, estimations for students with higher ability level was better than those with lower levels. The Bayesian estimation method provided better ability estimates. Also, both of the stopping rules (fixed number of items and fixed standard error) yielded reliable results. Kalender (2011, 2012) applied computerized adaptive testing tothe science subtest of Student Selection Examination in Turkey. A post-hoc simulation study and a live CAT study were conducted. Expected A Priori (EAP) was used for estimating abilities, with standard errors ranging from .10 to .50 as test termination criteria. Results showed that CAT provided a reduction by up to 80% in the number of items given to students compared to the paper and pencil form of the test. Correlations between ability estimates obtained from CAT and the full-length test were higher than 0.80.For the live CAT administration, this correlation was about.74, which might be due to the small sample size (33 persons) used in the study. After recent cheating issues in standardized assessments in Turkey, Kalender (2012) argues that the use of CAT can help to prevent cheating since each person receives different items during the test. More research is needed to examine the applicability of CAT to different testing programs in Turkey. CAT can be a solution to the current issues with the high-stakes tests in Turkey. The Entrance Examination for Graduate Studies is an exam that CAT can be applied to more easily. As Kalender (2012) mentioned, transition from the conventional testing to CAT can be relatively easier for EEGS because persons eligible to take EEGS are mostly college graduates who are used to different test formats. Therefore, they can more easily adapt themselves to such a change in test format more easily. This study applies CAT to the Entrance Examination for Graduate Studies (EEGS) in Turkey and shows the benefits of this method over the paper-pencil testing. The purpose of the study is to compare ability estimates from CAT and paper-pencil administrations results through a post-hoc simulation study by using different ability estimations and test termination criteria. Method Research Design The purpose of this study is to examine applicability and efficiency of CAT for the subtests of EEGS. Through post-hoc simulations, performance of CAT will be compared to the conventional (i.e. paper-pencil format) testing. There are two research questions for this study: 1) How does the CAT perform for estimating ability levels of examinees in EEGS compared to the conventional paper-pencil format? 2) Do different test termination conditions (i.e., SEM) affect ability estimation and test length during the CAT administration? 66 Okan Bulut & Adnan Kan A post-hoc simulation method was used to examine applicability of CAT for EEGS. The post-hoc approach to simulation is used when CAT is to be used to reduce the length of a test that has been administered conventionally (Weiss, 2012).In this approach, the item bank for CAT consists of all items administered to test-takers in the test. This type of simulation study can help to determine how much reduction in test length can be achieved by re-administering the items in an adaptive way without changing the psychometric properties of the test scores. Sample The data for this study come from the 2008administrations of the Entrance Examination for Graduate Studies (EEGS).Results of EEGS are used for admitting students to graduate programs and selecting graduate assistants in Turkish universities. Fourth-year undergraduate students and college graduates are eligible to take the test. The test is administered twice a year in a conventional form (i.e. paper-pencil test).EEGS consists of three subtests: quantitative 1, quantitative 2, and verbal. Each of the quantitative 1 and quantitative 2 sections has 40 items that measure mathematical and logical reasoning abilities. The quantitative 2 section has more advanced and difficult items than does quantitative 1. The verbal section has 80 items that measure verbal reasoning ability. All items in EEGS have five response options and they are scored dichotomously. To conduct a post-hoc CAT analysis, a random sample of 10,000 examinees (5,000 male, 5,000 female) was selected from the full dataset. The sample includes ranged from 18 to 61.Table 1 shows the summary statistics for the total scores from EEGS. Table 1 Summary Statistics for the Total Scores in the Three Subtests of EEGS Test # of Alpha Mean SD Min Max items Quantitative 1 40 0.96 23.28 11.92 0 40 Quantitative 2 40 .97 18.36 13.31 0 40 Verbal 80 .96 59.72 16.66 0 80 Data Analysis In this study, the post-hoc simulation procedure described by Weiss (2012) was used: 1. Item parameters based on an item response theory (IRT) model are estimated using the available item response data. 2. Then, using these item parameters, abilities (theta) are estimated for each examinee. 3. A test termination criterion (e.g., a standard error of .3 or fixed number of items) is determined. 4. The CAT is implemented by selecting items adaptively for each examinee and the CAT is terminated based upon the pre-specified termination rule. Eurasian Journal of Educational Research 67 5. Final theta values are estimated for each examinee using maximum likelihood (MLE) or Bayesian methods. 6. The CAT theta estimates are compared with the conventional test theta estimates based on the number of items administered in the CAT. By following these steps, first, item parameters for quantitative 1, quantitative 2, and verbal subtests of EEGS were estimated using the three-parameter logistic IRT model (3PL) in Xcalibre 4.1 (Guyer & Thompson, 2011).IRT model assumptions (i.e., unidimensionality and local item independence) have been checked for the subtests of EEGS. All three subtests were found appropriate for IRT modeling. The 3PL model has the best model-data fit for EEGS among other unidimensional IRT models (Bulut, 2010). The 3PL unidimensional IRT model can be shown as follows: (1) where is the unidimensional ability estimate for person i, is item difficulty for item j, is item discrimination for item j, and is guessing parameter for item j. Summary statistics for the calibrated items and summary statistics for the ability estimates for the three subtests of EEGS are presented in Table 1 and Table 2, respectively. Also, test information functions (TIF), which show the information and precision of items in the test, and standard error of measurement based on the 3PL ( model for each subtest of EEGS are shown in Figure 1. 1) Table 1 Summary Statistics for all Calibrated Items in the Three Subtests of EEGS Test Parameter Mean SD Min Max a 2.450 0.607 1.189 3.735 Quantitative 1 b -0.089 0.484 -0.905 1.107 c 0.089 0.053 0.036 0.268 a 3.066 0.790 1.490 4.183 Quantitative 2 b 0.289 0.459 -0.621 1.541 c 0.046 0.025 0.021 0.157 a 1.993 1.179 0.438 4.128 Verbal b -1.281 1.217 -3.848 0.654 c 0.039 0.026 0.021 0.119 Table 2 Summary statistics for the ability estimates from the three subtests of EEGS Test Max Info Min Mean SD Min Max CSEM Quantitative 1 42.574 0.153 0.004 0.990 -2.120 1.981 Quantitative 2 72.343 0.118 0.001 1.010 -1.699 2.169 Verbal 57.807 0.132 0.006 1.007 -3.853 2.001 Note: CSEM = Conditional standard error of measurement. Eurasian Journal of Educational Research, Issue 49, Fall 2012, 61-80 Figure 1.Test information function and standard error of measurement for quantitative 1 (left), quantitative 2 (middle), and verbal(right) subtests 68 Eurasian Journal of Educational Research, Issue 49, Fall 2012, 61-80 Posteriori (EAP) method were estimated for all examinees using the same software. EAP estimator was preferred in this study because, unlike the Maximum Likelihood (ML) estimator, EAP does not rely on an iterative procedure and uses a closed form estimator (i.e., a simple integration using numerical quadrature). Another advantage of EAP over ML is that it provides a finite estimate for the perfect and null scores. Thus, EAP can provide a finite estimate after the first item, even if the response was in one of the two extreme categories (Choi, Podrabsky& McKinney, 2010). Although EAP was used for estimating abilities and computing all accuracy measures, ability estimates from Maximum Likelihood (MLE), Maximum a Posteriori (MAP), and Weighted Least Square (WLS) were also obtained to look at the relationship between EAP and other ability estimators. In the next step, estimated item parameters and person abilities were used to configure a CAT administration. Firestar-D (Choi, 2009; Choi et al., 2010) was used for running post-hoc CAT analyses. Firestar-D generates R codes (R Development Core Team, 2012) for implementing post-hoc CAT analyses based on pre-specified item selection and test termination criteria. In this study, the maximum Fisher information (MFI) method was used as item selection method. MFI method can be shown as follows: (2) The MFI method iteratively selects the next item that provides maximum information at a particular . Every selected item provides the greatest increase in test information and the greatest reduction in standard error.CAT can be terminated when each examinee is measured with a pre-specified degree of precision, which allows measurement of levels of all examinees equally. In several test settings, CAT ( 2) is terminated when a predetermined number of items is reached. However, using a fixed number of items as the termination criterion may be inappropriate for CAT because it does not provide all (Weiss, 2004). In this study, a fixed standard error of measurement (SEM) was used as the termination criterion for the CATs in the post-hoc simulations. The test is terminated when SEM for the estimated theta estimate drops below the pre-specified SEM value. A number of SEM termination criteria (.25, .30, and .40) were used for each subtest of EEGS. Figure 2 shows a visual example of this iterative process. 69 70 Okan Bulut & Adnan Kan Figure 2.An example of the adaptive ability estimation process of an examinee in CAT After post-hoc CAT simulations were completed for the three subtests of EEGS, the following evaluation criteria from Weiss and Gibbons (2007) were computed to compare the performance of CAT to the conventional testing of EEGS: 1. The average number of items required by CAT to recover full- estimates with a pre-specified standard error of measurement. 2. ) and full- estimates ( ). 3. Average signed difference (i.e. bias)between CAT and full- 4. Average absolute difference(i.e. accuracy) between CAT and full- estimates: 5. Root mean squared difference (RMSD) between CAT and full- estimates: