RASCH MEASUREMENT Transactions of the Rasch Measurement SIG American Educational Research Association Vol. 29 No. 1 Summer 2015 ISSN 1051-0796 Coffee with Ben and IOMC 2015 Photographs presented here are from a recent meeting several participants had with Ben Wright, while International Outcome Measurement Conference (IOMC) was conducted in Chicago. This event was an unusual opportunity to congratulate Ben on celebrating his 90th birthday and reflect on the vision of scientific measurement he originally presented to health outcome measurement some years ago. It also forces recognition that advancing precision and objectivity in healthcare requires tenacity, as well as sensitive awareness of patient Jack Stenner, Ben Wright, and Ong Kim Lee wellbeing. Despite numerous examples of complacent IOMC was refreshing and reassuring, an enormous implementation of raw scores and ordinal measures that success. We are building the science that Ben dreamed can be harmful to patients, IOMC 2015 represents hope. about, implementing Thurstone’s measurement principles The choice between ordinal and linear can never be with Rasch models to measure health outcomes. arbitrary but must be judicious in recognition of patient Participants from 12 countries attended 17 sessions across effects. The wave of outcome measurement sweeping a broad range of outcome measurement issues. Rasch over healthcare must recognize measurement cannot be model integration with Medicare G, linear measurement conducted without careful regard for individual patients. of PROs, and computation of raw score mortality risk I am reminded of Luigi Tesio and Carl Granger’s were among diverse topics. Those sessions are important emphasis for several years and Richard Smith’s concerns achievements, as virtually all major federal funding for many years before them that outcome measurement must health outcome instrument development currently become person-metric. IOMC 2015 is a rejuvenation of maintains and fortifies weaker, obsolete ordinal that idea. measurement methods. Nick Bezruczko Given the magnitude and intensity of those events over two conference days, I personally need to pause and Table of Contents reflect on the broader scene, especially after having Update from IOMC-2015……… ....................... 1499 attended National Council on Measurement (NCME) just Update from AERA-2015...................... ............. 1501 couple of days before. A more somber contrast could not be made. I am driven to the conclusion that educational Individual-Centered vs. Group-Centered Measures and psychological measurement practitioners actually like (Stone & Stenner)…. ........................................... 1502 ordinal measures represented by true-score theory and Individualized Medicine and Personalized Outcome item response theory (2, 3, or 4 item parameters). They Measures (Royal, Lybarger & Fisher) ………...1504 really are not interested in linear measurement or scientific objectivity or the epistemology of scientific Does Item Sequence Order Impact Local Item inquiry based on abstract scales. Contemporary Dependence in Surveys (Royal)……………..… 1507 psychometrics seems to be tripping over the stepping A Probabilistic Model of the Law of Supply and stones to scientific knowledge that Ben Wright laid out Demand (Fisher)………………………………. 1508 for outcome measurement. Feelings of despair and disappointment prevail. R-Program Code for Calibration of Item Difficulties (Fung)…………………………………………. 1512 Rasch Measurement Transactions 29:1 Summer 2015 1499 Coffee with Ben Starbucks at Oak and Rush April 20, 2015 Ben Wright and Jeremy Hobart (From left to right): Jeremy Hobart, Filemon Cerda, Luigi Tesio, Richard Smith, Ben Wright, Jack Stenner, (From left to right): Jeremy Hobart, Richard Smith, Matt Ong Kim Lee, Craig Velozo, and William Fisher Schulz, Jack Stenner and Luigi Tesio Richard Smith and Matt Schulz (From left to right): Ong Kim Lee, Ben Wright, and William Fisher Jack Stenner, Ben Wright and Ong Kim Lee Craig Velozo and Luigi Tesio Rasch Measurement Transactions 29:1 Summer 2015 1500 William Fisher and Filemon Cerda Pictured: George Engelhard, Jr (2015 Rasch SIG Business Meeting Keynote speaker), Stephanie Wind (Recipient of the Georg William Rasch Early Career Publication Award) and Jim Tognolini (Rasch SIG Chair) Benjamin Drake Wright (born 1926) Update from the Rasch SIG Business Meeting The annual Rasch SIG Business Meeting took place on Thursday, April 3 from 6:15 p.m. to 7:45 p.m. The meeting afforded a great opportunity to engage and network with colleagues. SIG Chair, Jim Tognolini, gave a presentation of the SIG’s functioning over the past year and member of the leadership committee (also pictured) gave updates on the AERA conference program, finances Pictured: Leigh Harrell-Williams (SIG Treasurer), Sara and awards status. This year’s keynote speaker was Hennings (Program Co-Chair), Jim Tognolini (SIG George Engelhard, Jr. who provided a thought-provoking Chair), and Mikaela Raddatz (SIG Secretary) presentation entitled “Invariant Measurement with Raters and Rating Scales”. The meeting concluded with Stephanie Wind being presented the Georg William Rasch Early Career Publication Award. Rasch Measurement Transactions 29:1 Summer 2015 1501 Individual-Centered vs. Group- artificial environment returned to the base degree of temperature. Furthermore, all the measurements agree. Centered Measures Interestingly, the experimentally induced change of The Preface to Probabilistic Models for Some Intelligence environment also produced what may be called causal and Attainment Tests (Rasch, 1960) cites Skinner (1956) validity, not unlike constructive validity (Cronbach & and Zubin (1955). In an argument whereby, "… Meehl, 1954) inasmuch as the temperature was individual-centered statistical techniques require models manipulated, fabricated, engineered, etc. via construction in which each individual is characterized separately and and use of the artificial environment. When measuring from which, given adequate data, the individual mechanism(s) such as outdoor thermometers are properly parameters can be estimated" (Rasch 1960, p. xx). The manufactured this result is to be expected, and this Skinner reference is easily located. The mimeographed experimental outcome and its replication would be work by Zubin has not been found, but we did find predicted prior to environmental manipulation from all we another Zubin paper given at the 1955 ETS Invitational know about temperature and thermometers. This outcome Conference on testing problems in which he writes, “An might further be termed validity as theoretical Example of the application of individual-centered equivalence (Lumsden & Ross, 1973) because the techniques which keeps the sights of the experimenter replications produced by all six thermometer recordings focused on the individual instead of on the group…” (p. might be considered "one" temperature. Our theoretical 116) may have helped Rasch situate his thinking. Rasch prediction is expected as a consequence of the causal goes on to state "… present day statistical methods are process produced by the experiment, and reported by all entirely group-centered so that there is a real need for the instruments. Causal validity is a consequence of the developing individual-centered statistics" (p. xx). What successful theoretical predictions realized in the constitutes the differences in these statistics? experiment. Its essence is "prediction under intervention." The manipulable characteristics of our While it is individual persons and groups of persons that experiment involving the base environment, change made are the focus of discussion, we begin with an even more by way of an artificial environment, and the final change simple illustration because human behavior is complex, of recorded temperature are the consequence of a well- and a single mechanical-like variable is a better functioning construct theory and measuring mechanism. illustration to one that is complex. We choose Each of the six individual thermometers records a temperature for this illustration because measuring similarly induced experimental deviation and a return to mechanisms (Stenner, Stone & Burdick, 2004) for the base state. Each thermometer constitutes an temperature are well established and all report out in a individual unit, and the six thermometers constitute a common metric or degree (disregarding wind-chill, etc.). group albeit without variation, which is exactly what A measuring mechanism consists of (1) guiding would be predicted. substantive theory, (2) successful instrument fabrication, and (3) demonstrable data by which the instrument has Now consider a transition to human behavior. Height is established utility in the course of its developmental the new outcome measure and the determination of height history. at a point in time can be obtained from another well- established measuring mechanism – the ruler, which Consider six mercury-tube outdoor thermometers that are provides a point-estimate for one individual measured at a placed appropriately in a local environment, but near each single point in time. When this process is continued for other. They all register approximately the same degree of the same individual over successive time periods we temperature, independently verified by consulting NOAA produce a trajectory of height for the person over time for the temperature at this location. One by one each (purely individual centered as no reference to any other thermometer is placed in a compartment able to person(s) is required). From these values one may increase/decrease the prevailing temperature by at least determine growth over time intervals as well as any ten degrees. Upon verifying the artificially induced observed plateaus and spurts well-known to occur in temperature change for each thermometer, it is returned to individual development. The individual's trajectory rate its original location and checked to see if it returns to its may also vary because of illness and old age, so we could previous value and agrees with the other five. discover different rates over certain time periods as well as determine a curvilinear average to describe the person's If each of the six thermometers measured a similar and total trajectory. Growth in height is a function of time, consistent degree of temperature before and after the and the human characteristics entailed in a person's induced environmental intervention/manipulation, this overall development result from genetic and consistency of instrument recording validates a deep environmental makeup. These statistics are intra- understanding of the attribute “temperature” and its individually determined. Such statistical analyses measurement. Each thermometer initially recorded the produce the "individual-centered statistics" that Rasch same temperature, and following a change to and from the spoke about. Rasch Measurement Transactions 29:1 Summer 2015 1502 Aggregating individual measurements of height into a Many applications of Rasch models to human science group or groups is a common method for producing data are thin on substantive theory. Rarely proposed is an "group-centered statistics" often employing some a priori specification of the item calibrations (i.e. frequency model such as the normal curve. This is most constrained models). Causal Rasch Models (Stenner, common when generalizing the characteristics of human Fisher, Stone & Burdick, 2013; Burdick, Stone, & growth in overall height based upon a large number of Stenner, 2006; Stenner, Stone & Burdick, 2009; Stenner individuals. The difference between measuring a group & Stone, 2010) prescribe (via engineering and of individuals compared to our first illustration using a manufacturing quality control) that item calibrations take group of thermometers is that while we expected no the values imposed by a substantive theory. For data to deviation among the thermometers, we do not expect all be useful in making measures, those data must conform to individuals to gain the same height over time, but rather the invariance requirements of both the Rasch model and to register individual differences. Hence, we resort to the substantive theory. Thus, Causal Rasch Models are descriptive statistics to understand the central trend, and doubly prescriptive. When data meet both sets of the amount of variation found in the group or groups. An requirements; the data are useful not just for making obvious group-centered statistical analysis might measures of some construct, but are useful for making aggregate by gender; comparing the typical height of measures of that precise construct specified by the females to males or provide norms tables equation that produced the theoretical item calibrations. The measurement of height is straightforward and the A Causal (doubly constrained) Rasch Model that fuses a measurement mechanism has been established over substantive theory to a set of axioms for conjoint additive several thousand years. The same cannot be said for measurement affords a much richer context for the measuring mental attributes occurring in psychological, identification and interpretation of anomalies than does an health, and educational investigations. Determining the unconstrained descriptive Rasch model. First, with the relevant characteristics for their measurement is more measurement model and the substantive theory fixed it is difficult although the procedures for their determination self-evident that anomalies are to be understood as should follow those already discussed. The major problems with the data ideally leading to improved statistical hurdle is moving from the ordering of a observation models that reduce unintended dependencies variable's units to its "measurement application." The in the data (Andrich, 2002). Second, with both model and measurement models of Georg Rasch have been construct theory fixed it is obvious that our task is to instrumental in driving this process forward. produce measurement outcomes that fit the (aforementioned) dual invariance requirements. An Do we know enough about the measurement of reading unconstrained model cannot distinguish whether it is the that we can manipulate the comprehension rate model, data, or both that are suspect. experienced by a reader in a way that mimics the above temperature example? In the Lexile Framework for Over centuries, instrument engineering has steadily Reading (LFR) the difference between text complexity of improved to the point that for most purposes “uncertainty an article and the reading ability of a person is causal on of measurement,” usually reported as the standard the success rate (i.e. count correct). It is true that short deviation of a distribution of imagined or actual term manipulation of a person’s reading ability is, at replications taken on a single person, can be effectively present, not possible, but manipulation of text complexity ignored. The practical outcome of such successful is possible because we can select a new article that engineering is that the "problem" of measurement error is possesses the desired text complexity such that any virtually non-existent; consider most bathroom scale difference value can be realized. Concretely, when a applications. The use of pounds and ounces also becomes 700L reader encounters a 700L article the forecasted arbitrary as is evident from the fact that most of the world comprehension rate is 75%. Selecting an article at 900L has gone metric although other standards remain. What is results in a decrease in forecasted comprehension rate to decisive is that a unit is agreed to by the community and 50%. Selecting an article at 500L results in a forecasted is slavishly maintained through substantive theory comprehension rate of 90%. Thus we can together with consistent implementation, instrument increase/decrease comprehension rate by judicious manufacture, and reporting. We specify these stages: manipulation of texts, i.e. we can experimentally induce a change in comprehension rate for any reader and then return the reader to the “base” rate of 75%. Furthermore, The doubly prescriptive Rasch model embodies this successful theoretical predictions following such process. interventions are invariant over a wide range of environmental conditions including the demographics of Different instruments qua experiences underlie every the reader (male, adolescent, etc.) and the characteristics measuring mechanism; environmental temperature, of text (length, topic/genre, etc.). human temperature, children’s reported weight on a bathroom scale, reading ability. From these illustrations Rasch Measurement Transactions 29:1 Summer 2015 1503 and many more like them we determine point estimates Individualized Medicine and and individual trajectories and group aggregations. This Personalized Outcome Measures: outcome lies in well-developed construct theory, instrument engineering and manufacturing conventions Implications for Rasch Measurement that we designate measuring mechanisms. In recent years the notion of “individualized medicine” Mark H. Stone and A. Jackson Stenner has become increasingly popular. Physicians have always known that everyone is different and that recovery for one References person may look entirely different for another. The Andrich, D. (2002). Understanding resistance to the data- problem, however, has been that individualized medicine model relationship in Rasch’s paradigm: A reflection for has not truly been possible (at least wide-scale) until more the next generation. J. Applied Measurement, 3, 325-359. recently. Existing work in genetics and epigenetics coupled with everyday technologies are quickly paving Burdick, D. S., Stone, M. H., & Stenner, A. J. (2006). the way for individualized medicine. In fact, on January The combined gas law and a Rasch reading law, Rasch 30, 2015, President Obama unveiled the “Precision Measurement Transactions, 20 (2), 1059-1060. Medicine Initiative” which specifically notes: Cronbach, L & Meehl, P. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281-302. Most medical treatments have been designed for the ‘average patient’. As a result of this ‘one-size-fits-all- Lumsden, J. & Ross, J. (1973). Validity as theoretical approach’, treatments can be very successful for some equivalence. Australian Journal of Psychology, 25 (3), patients but not for others. This is changing with the 191-197. emergence of precision medicine, an innovative approach to disease prevention and treatment that takes into Rasch, G. (1960). Probabilistic models for some account individual differences in people’s genes, intelligence and attainment tests. Chicago: The University environments, and lifestyles (The White House, 2015). of Chicago Press. Skinner, B. F. (1956). A case history in scientific method. Eric Topol, a renowned cardiologist and geneticist, has The American Psychologist, 11, discussed the notion of “individualized medicine” in a 221-233. number of tangible ways, often illustrating what is possible when the human genome is mapped and how the Stenner, A. J., Burdick, D. S. & Stone, M. H. (2008). use of personal data and technology are able to develop Formative and reflective models: Can a Rasch analysis personalized treatments (Topol, 2014, 2015). Topol tell the difference?, Rasch Measurement Transactions, 22 argues the future of medicine will involve using (1), 1152-1153. smartphones and similar devices to monitor our health and stay on top of treatments. Stenner, A. J., Stone, M. H., & Burdick, D. S. (2009). Indexing vs. measuring, Rasch Measurement In another example Briggs (2015) discusses epigenomics Transactions, 22 (4), 1176-1177. and how the ability to turn genes on and off could help Stenner, A. J., Stone, M., Burdick, D. (2009). The concept fight cancer. Briggs provides an excellent example of a measurement mechanism. Rasch Measurement answering the question “how is it that identical twins can Transactions, 23 (2), 1204-1206. share the exact same DNA but yet exhibit differences in growth, behavior and acquisition of illnesses?” He Stenner, A. J., Stone, M. H. (2010). Generally objective explains that cells read the genetic code embedded in measurement of human temperature and reading ability: DNA much like a script as opposed to a mold that Some corollaries. Journal of Applied Measurement, 11 replicates the same results. He suggest we should think of (3), 244-252. the genetic code much like a movie script in which the vision of the director (e.g., James Cameron vs. Woody Stenner, A.J., Fisher, W.P., Stone, M.H. and Burdick, Allen) could produce a significantly different film despite D.S. (2013). Causal Rasch Models. Front. Psychol. 4:536. using the exact same script. doi: 10.3389/tpsyg.2013.00536 Zubin, J. (1955). Experimental abnormal psychology The aforementioned breakthroughs and initiatives show a (mimeographed). New York: Columbia University Store. great deal of promise for the future of medicine. However, many problems with measurement still remain Zubin, J. (1955). Clinical vs. actuarial prediction: a that could potentially impede this progress. Most pseudo-problem. In Proceedings of the 1955 Invitational statistical approaches involve clustering, analyzing and Conference on Testing Problems. Princeton: Educational reporting data at group levels. Certainly, these approaches Testing Service, 107-128 have their purposes and are very useful when . investigating relationships and trends. However, most “person-centered” analyses used in health outcomes Rasch Measurement Transactions 29:1 Summer 2015 1504 measurement use statistical models that attempt to group Rasch model parameters and fit statistics are estimated at individuals into homogenous subgroups and then the level of individual persons, marking an important determine predictors of subgroup membership (Muthén methodological departure from classical statistical and Muthén, 2000). Since the 1950s, researchers (see genetic, genetic epidemiological, and behavioral genetic Hayes, 1953; and Estes, 1956) have acknowledged that approaches to measurement that rely exclusively on group summaries of data often obscure individual family- and group-level comparisons as the basis of differences and make it difficult to draw appropriate inference and decision-making. inferences at the person level. It seems that in all our fervor to make sense of information and find solutions to The need for this kind of important methodological problems many forget that all data sets consist of innovation emerges also in comparisons of instrument individual data points. Thus, while it often makes sense to sensitivities to change in health status, which sometimes group and norm some data, it is actually somewhat produce clinically counter-intuitive results. Hobart, Cano, counter-intuitive to do this with outcomes measurement in and Thompson (2010) show that two functional status the health, psychological and educational sciences. An instruments having the same sensitivity to change at the example from sports medicine illustrates why. group level differed in their capacities to detect significant improvement by 50% to 31% at the individual level. Many familiar with baseball may be aware of a surgical Group level effect size indicators can be misleading, and procedure called the ulnar collateral ligament (UCL) Rasch measurement can provide meaningful and reconstruction, also known as the “Tommy John Surgery” clinically interpretable quantitative comparisons at the (TJS). The TJS procedure essentially consists of replacing individual level that are not otherwise available. the UCL with another tendon from elsewhere in the body. The procedure has been widely advertised to have success It is fair to say that if individualized medicine and rates around 90-93% so many athletes, including those personalized outcomes are indeed the future of medicine, without UCL injuries, have opted to have the procedure as then Rasch models should be expected to take on a more a means to enhance pitching performance (Ahmad, prominent role as these models are uniquely equipped to Grantham, & Greiwe, 2012). tackle many of these critical challenges. However, such high success rates are based on group Kenneth D. Royal, North Carolina State University norms typically obtained from rehab scales that measure Melanie Lybarger, Independent Consultant factors such as strength, mobility, comfort, sleep quality, William P. Fisher, Jr., University of California - Berkeley etc. Of course, it is intuitive that what recovery looks like for one individual may not mirror that of another. A more References accurate outcome measure would be to compare one’s Ahmad, C. S., Grantham, W. J., & Greiwe, R. M. (2012). performance relative to a personal baseline measure. With Public perceptions of Tommy John surgery. The baseball and the wide variety of data collected on pitchers Physician and Sports Medicine, 40(2), 64-72. (e.g., pitching velocity, accuracy, etc.), personal baseline measures have been increasingly investigated in recent Briggs, J. (2015). What is epigenomics and how is it years. Sports medicine physicians are now recognizing helping in the fight against cancer? Mayo Clinic that when comparing a pitcher’s post-UCL performance Individualized Medicine blog. Available at: to his baseline statistics success rates typically drop to http://individualizedmedicineblog.mayoclinic.org/discussi approximately 70-75% (Yurkiewicz, 2015). Thus, when on/what-is-epigenomics-and-how-is-it-helping-in-the- outcomes are based on group norms they are, in many fight-against-cancer/ instances, inflated. The repercussions for such methodological carelessness is profound for persons Estes, W. K. (1956). The problem of inference from considering these procedures, as an inflated success rate curves based on group data. Psychological Bulletin, likely provides a false sense of security for potential 53(2):134–40. patients and may impact ones decision to undergo an elective procedure. Further, assuming 25-30% of athletes Hayes, K. J. (1953). The backward curve: A method for are unable to match their personal baseline levels of the study of learning. Psychological Review, 60, 269–75 performance, such procedures, especially when performed when medically unnecessary, could be devastating to Hobart, J. C., Cano, S. J., & Thompson, A. J. (2010). one’s career and future. Effect sizes can be misleading: Is it time to change the way we measure change? Journal of Neurology, In the context of Rasch measurement, this difference Neurosurgery, & Psychiatry, 81, 1044-1048. between group- and individual-level comparisons takes on new significance, in genetics and elsewhere, as was noted Markward, N. J., & Fisher, W. P., Jr. (2004). Calibrating by Markward and Fisher (2004, p. 131) in their study of the genome. Journal of Applied Measurement, 5(2), 129- 13 short tandem repeat marker loci from the FBI’s 141. CODIS database: Rasch Measurement Transactions 29:1 Summer 2015 1505 Muthén, B. O., & Muthén, L. K. (2000). Integrating person-centered and variable-centered analyses: growth Journal of Applied Measurement mixture modeling with latent trajectory classes. Vol. 16, No. 1, 2015 Alcoholism: Clinical and Experimental Research, 24(6), 882–91 A Mathematical Theory of Ability Measure Based on Partial Credit Item Responses, Nan L. Kong The White House, Office of the Press Secretary (2015). Fact Sheet: President Obama’s Precision Medicine Differential Item Functioning Analysis by Applying Initiative [Press release]. Retrieved from: Multiple Comparison Procedures, Paolo Eusebi https://www.whitehouse.gov/the-press- and Svend Kreiner office/2015/01/30/fact-sheet-president-obama-s-precision- medicine-initiative Visually Discriminating Upper Case Letters, Lower Topol, E. J. (2014). Individualized medicine from pre- Case Letters and Numbers, Janet Richmond, womb to tomb. Cell, 157(1), 241-253. Russell F. Waugh, and Deslea Konza Topol, E. J. (2015). The future of medicine is in your Testing the Multidimensionality of the Inventory of smartphone. The Wall Street Journal. Available at: School Motivation in a Dutch Student Sample, http://www.wsj.com/articles/the-future-of-medicine-is-in- Hanke Korpershoek, Kun Xu, Magdalena Mo your-smartphone-1420828632 Ching Mok, Dennis M. McInerney, and Greetje van der Werf Yurkiewicz, S. (2015, April 24). Did UCL Surgery Work? Baseball Stats May Have the Answer. Medpage Today. Measuring Teaching Assistants' Efficacy using the Available at: Rasch Model, Zi Yan, Chun Wai Lum, Rick Tze http://www.medpagetoday.com/SportsMedicine/EliteSpor Leung Lui, Steven Sing Wa Chu, and Ming Lui ts/51184 Detecting Measurement Disturbance Effects: The Rasch-related Coming Events Graphical Display of Item Characteristics, Randall E. Schumacker July 3-31, 2015, Fri.-Fri. Online workshop: Practical Rasch Measurement – Further Topics (E. Smith, Criteria Weighting with Respect to Institution's Goals Winsteps), www.statistics.com for Faculty Selection, Sheu Hua Chen, Yen Ting July 27-Nov. 20, 2015, Mon.-Fri. Introduction to Chen, and Hong Tau Lee Rasch Measurement (D. Andrich, I. Marais, Gendered Language Attitudes: Exploring Language RUMM), www.education.uwa.edu.au/ppl/courses as a Gendered Construct using Rasch Measurement Aug. 14-Sept. 11, 2015, Fri-Fri. Online workshop: Theory, Kris A. Knisely and Stefanie A. Wind Many-Facet Rasch Measurement (E. Smith, Facets), www.statistics.com Richard Smith, Editor, www.jampress.org Sept. 4-Oct. 16, 2015, Fri.-Fri. Online workshop: Rasch Applications, Part 1: How to Construct a Rasch Scale (W. Fisher), www.statistics.com Sept. 9-11, 2015, Wed.-Fri. In-person workshop: Introductory Rasch (A. Tennant, RUMM), Leeds, Statistics Joke UK, Sept. 14-16, 2015, Mon.-Wed. In-person workshop: A statistician goes hunting with two mathematicians. Intermediate Rasch (A. Tennant, RUMM), Leeds, They spot a duck. The first mathematician levels his UK, rifle, fires, and misses to the right. The second Sept. 14-16, 2015, Mon.-Wed. IACAT Conference: mathematician levels his rifle, fires, and misses to the International Association of Computerized left. The statistician turns to his friends and says Adaptive Testing, Cambridge, UK, www.iacat.org “looks to me like we got him, boys” (p. 45). Sept. 17-18, 2015, Thur.-Fri. In-person workshop: Advanced Rasch (A. Tennant, RUMM), Leeds, Courtesy of Behar, R., Grima, P., & Marco-Almagro, UK, L. (2013). Twenty-five analogies for explaining Oct. 16-Nov. 13, 2015, Thur.-Fri. Online workshop: statistical concepts. The American Statistician, 67(1), Practical Rasch Measurement – Core Topics (E. 44-48. Smith, Winsteps), www.statistics.com Rasch Measurement Transactions 29:1 Summer 2015 1506 Does Item Sequence Order Impact perceived the items to be somewhat redundant based on their adjacent positioning. Curious if item sequence order Local Dependence in Surveys? may be the culprit for statistical dependency in this case, a simple experiment was conducted. Many survey researchers consider it a best practice to group related items together as it makes it easier for Table 1. Statistically dependent item pairs based on participants to complete the survey, gives the appearance veterinary students’ responses. of greater cohesiveness, and requires a lesser cognitive Item Item Correlation load from participants (Bradburn, Sudman, & Wansink, # 2004; Dillman, 2000). However, a considerable number 1 Copying from another student of studies have found that item order effects, often called during a quiz or exam “assimilation effects” or “carry-over effects”, can result in 2 Using unauthorized cheat sheets or .69 biased participant responses (Heiman, 2002). A recent other materials during a quiz or experience analyzing survey data provided an interesting exam case in which item ordering was hypothesized as the 12 Missing class or lab due to a false culprit for a number of statistically dependent item pairs. excuse .37 A simple experiment was conducted to test this 13 Claiming to have attended class hypothesis. when you actually did not 19 Failing to prepare adequately for a Background group assignment or laboratory .67 20 Doing less than your fair share in a Statistical dependency, also referred to as “local item group project or a laboratory dependence”, refers to the extent to which a response to 22 Presenting your clinical skills book one item is directly influenced by a response to another for signing without actually item (Marais & Andrich, 2008). Survey researchers completing the skill routinely investigate statistical dependency when .60 23 Listing false completions on your evaluating the psychometric properties of an instrument. online clinical skills completion Typically, when statistically dependent items are summary discovered they are reviewed for content and a decision is made to either retain, revise, or discard one or more of the Experiment potentially dependent items. Two questionnaires were created with one version Case Example containing the items in the same order as originally presented (control) and the other version containing items An academic misconduct survey was administered at presented in random order (experiment). The survey was North Carolina State University’s College of Veterinary published on Amazon Mechanical Turk to a national Medicine. The survey contained 23 items measuring the panel of respondents. The first 100 participants extent to which various actions and behaviors constitute completing each survey was awarded a small stipend for academic misconduct. A 7-point semantic differential their time and effort. Individuals who participated in one scale (1 = Not Misconduct to 7 = Severe Misconduct) was survey were ineligible for participation in the other, used to capture participants’ perspectives. A total of 137 ensuring 200 distinct individuals completed the surveys. students completed the survey. Quality Control As part of the routine psychometric analysis, statistical dependence was investigated by reviewing residual item Upon data collection, a series of routine quality control correlations. Items with residual correlations greater than checks were performed as part of the initial Rasch 0.3 were considered statistically dependent (Smith, 2000). analysis. Data from both sets were evidenced to be mostly Results of the psychometric investigation indicated four unidimensional, highly reproducible (reliability > .90), pairs of items were statistically dependent (see Table 1). and fit the Rating Scale Model (Andrich, 1978) quite well. In order to obtain excellent fit, 10 misfitting persons Upon conclusion of the analysis, psychometric results were removed from the control group and 2 misfitting (which included item pairs flagged as statistically persons were removed from the experimental group. dependent) were reviewed by a team of veterinary faculty. The faculty had difficulty understanding why each Results presumably dependent item pair was so highly correlated, as they perceived each item pair to represent substantively Item pairs that were previously flagged as potentially different questions. One faculty member noticed each pair statistically dependent were investigated in both the of dependent items appeared in sequential order on the control and experimental data sets (see Table 2). instrument and hypothesized that respondents may have Rasch Measurement Transactions 29:1 Summer 2015 1507 Table 2. Investigation of Potentially Dependent Item Pairs Marais, I., & Andrich, D. (2008). Effects of varying Item Pairs Control Group Experimental Group magnitude and patterns of response Correlation Correlation dependence in the unidimensional Rasch model. Journal 1 & 2 .40 .32 of Applied Measurement, 9(2), 12 & 13 .32 .08 105-124. 19 & 20 .68 .32 22 & 23 .56 .28 Smith, R. M. (2000). Fit analysis in latent trait measurement models. Journal of Applied Results from the control study indicated that when the Measurement, 1(2), 199-218. items were presented in the same order as the veterinary student survey, each pair of items was once again flagged A Probabilistic Model of the Law of as being potentially statistically dependent. However, Supply and Demand when the items were randomly presented to participants in the experimental group evidence of statistical dependency Smith (1962) presents an experimental study of was greatly reduced. In fact, two pairs of items fell below competitive market behavior notable (Cowen and the suggested threshold of .30 and the remaining two pairs Tabarrok, 2009, p. 39) for being a rare and early instance fell to .32 (just slightly above the suggested threshold). of a controlled study in economics. Chart 1 (Smith, 1962, Based on this evidence, it appears that item ordering may p. 113; reproduced below) shows the convergence of impact local item dependence, at least in some situations. prices, supply, and demand, as predicted by economic theory. The idea of a lawful relationship in which prices Implications and Recommendations are predictable given the difference between supply and demand immediately suggests the applicability of a Many researchers and practitioners routinely revise or probabilistic model structured in the same form as a discard one or more survey items that are flagged as natural law, such as Rasch (1960, pp. 110-115; statistically dependent. The results of this experiment 1972/2010; Burdick and Stenner, 1996; Burdick, Stone, suggest one should use additional caution when and Stenner, 2006) proposed, and as has been suggested considering revising or discarding items based on initial repeatedly throughout the history of economics inspection of residual correlations. It is recommended that (Boumans, 1993, 2005; Fisher, 2010a, 2010c; Grattan- one pays particular attention to the order in which items Guiness, 2010; Mirowski, 1991; Myers, 1983, pp. 65-75). were presented to respondents. If the flagged items were However, the resemblance of Smith’s Chart 1 to a Wright adjacent to one another it may be a false-positive map (Wilson, 2005, 2013) is only superficial, since what detection due (presumably) to respondents’ acquiescing to appear to be stair-stepped vertical histograms are in fact what they may perceive as a redundant question. crossing representations of the available supply and Researchers using Rasch models to analyze survey data demand quantities. may be wise to randomize survey item order when possible to minimize threats to false-positive detections of statistical dependency and potentially other threats to item location estimates and stability. Kenneth D. Royal, North Carolina State University References Andrich, D. (1978). A rating formulation for ordered response categories. Psychometrika, 43, 561-73. Bradburn, N., Sudman, S., & Wansink, B. (2004). Asking questions: the definitive guide to questionnaire design. San Francisco: Jossey-Bass. Dillman, Don. (2000). Mail and internet surveys. New Figure 1. Supply (S), Demand (D), and Prices (P) (Smith, York: John Wiley & Sons. 1962, p. 113) Heiman, G. W. (2002). Research Methods in Psychology. The usual approach to estimating the relationship of 3rd Edition. Boston & New York. Houghton Mifflin supply, demand, and prices is through a series of Company. simultaneous equations. Roughly the same results can be obtained using algebraic approximations (Cohen, 1979). Rasch Measurement Transactions 29:1 Summer 2015 1508
Description: