ebook img

ERIC ED363665: Coding Major Fields of Study. PDF

10 Pages·1993·0.2 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED363665: Coding Major Fields of Study.

DOCUMENT RESUME ED 363 665 TM 020 787 AUTHOR Bobbitt, L. G.; Carroll, C. D. TITLE Coding Major Fields of Study. PUB DATE [93] 10p.; Author is affiliated with the National Center NOTE for Education Statistics (NCES). Evaluative/Feasibility (142) PUB TYPE Reports EDRS PRICE MF01/PC01 Plus Postage. Algorithms; *Coding; Computer Assisted Testing; DESCRIPTORS Computer Software; Course Selection (Students); Databases; Educational Research; Higher Education; Longitudinal Studies; *Majors (Students); *National Surveys; *Research Methodology; *Telephone Surveys Autocoders; *Beginning Postsecondary Students Long IDENTIFIERS Study; *Computer Assisted Telephone Interviewing; National Center for Education Statistics; Student Surveys ABSTRACT The National Center for Education Statistics conducts surveys which require the coding of the respondent's major field of study. This paper presents a new system for the coding of major field of study. It operates on-line i a Computer Assisted Telephone Interview (CATI) environment and allows conversational checks to verify coding directly from the respondent. The system "learns" by maintaining a database of response/coding pairs which can be incorporated into its algorithm after supervisor review. This paper analyzes the effectiveness of this approach and database in coding major field of study for the Beginning Postsecondary Students Longitudinal Study Second Followup 1990-1994. (Contains 2 figures, 2 tabies.) (Author) *********************************************************************** Reproductions supplied by EDRS are the best that can be made from the original document. *********************************************************************** U.S. DEPARTMENT OF EDUCATION 011,-e of EduCal.onal Researt and impronernena EDU ATIONAL RESOURCES INFORMATION CENTER (ERIC) This document has Ceen reproduced as rece.ved Iron, the person or onpanuatiOn Or.g.nahng .1 C Manor changes nase been made tO anprove reproduction qualay Poets ()tine..., or opanaons stated rn 1155 clocs ment dO not necessaray repreSenl oalicaal OERI oosinon or poky CODING MAJOR FIELDS OF STUDY by L.G. Bobbitt & C.D. Carroll L.G. Bobbitt 555 New Jersey Ave., Washington, DC 20208-5652 2 CODING MAJOR FIELD OF STUDY by L. G. Bobbitt and C. D. Carroll 20208-5652 555 New Jersey Ave. Washington DC L. G. Bobbitt search, CATI Key Words: Autocoding, fuzzy Abstract Statistics conducts surveys The National Center for Education responden'_'s major field of which require the coding of the the coding of major This paper presents a new system for study. environment and It operates on-line in a CATI field of study. verify coding directly from the allows conversational checks to database of The system "learns" by maintaining a respondent. incorporated into its response/coding pairs which can k.e This paper analyzes the algorithm after supervisor review. in coding major field effectiveness of this approach and database Postsecondary Students Longitudinal of study for the Beginning Study Second Followup 1990-1994. Introduction collected data concerning major NOES projects have frequently been used to lUntil recently, two procedures have field of study. of First, experts in the Classification gather this data. respondents' text strings Instructional Programs (CIP) examined about 1,400 possibilities. and assigned-6-digit codes from 2-digit codes from a list of about Second, respondents selected 2-digit and 6- The correspondence between the 35 possibilities. because the expert coders digit codes was artificially high respondents' 2-digit viewed both the text strings and the the expert pool was Inter-rater reliability within selections. 3-4 days of training. typically in the .80-.90 range after Telephone Interviewing (CATI) The advent of Computer Assisted obtaining higher quality text strings enhanced the capability for However, on-line of study. describing respondents' major fields In time consuming. codings into the 2-digits proved very into CIP 6-digit codes addition, post-coding of text strings resulted in expensive call- delayed file delivery and sometimes back procedures. critical of the 6-digit Furthermore, researchers were frequently Simply put, the 6-digit codes were too and/or 2-digit codings. 2-digit codes did not provide complex for analyses and the One of the major critics developed adequate detail for analyses. codes based on system with 111 possible an alternative 3-digit within A College Course Map: patterns of courses described Taxonomy and Transcript Data. sophisticated users of text Finally, as researchers become more - 1 7/21/93 DRAFT the improves handling of strings), string data (or as software in the text strings becomes paramount value of the high quality Coding becomes including CATI. data collection systems, subsetting collections of text primarily a key for sorting or strings. Methodology for Data Collection the development outlined above contributed to All of the factors in current NCES coding major field of study of a new approach for incorporates on-line The new coding approach CATI projects. system, using a 3-digit coding into our existing CATI the The existing CATI system executes classification system. The coding software study coding software. NCES major field of of study functions for the major field then takes over all CATI the string and a 3-digit code to question, and returns a response this data, and The CATI system then stores existing CATI system. question for the respondent. proceeds with the next of prompting the CATI operator The coding software takes care Initially, the respondent is session. throughout the coding of study", and "What is your major field asked an open question, The software. is entered into the coding the respondent's reply and performs a the response into words, coding software breaks up in the response. fuzzy search.for the words in that The search is "fuzzy" which search is limited to words (1) the domain of the sound. have a similar initial the search is not only to find (2) the object of the in the that it is not present target word (or determine determine a short list of dictionary), but also to "closest" the search which are words in the domain of to the target word. coding system, crucial for an on-line Since speed is obviously discernable refined and "tuned" so that no the search has been operation occurs. pause in the CATI initially list of "reasonable" codes, The software maintains a then all categories found in the dictionary, empty. If a word is If the list. word are added to the which are related to that is dictionary, the CATI operator word is not found in the The words 7) list of similar words. presented with a short (5 to "closest" to the target that the words which are are ordered so either select a word The CATI operator can word are shown first. of coding. the entered word for purposes from the list, or ignore then all the list of similar words, If a word is selected from of word are added to the list categories related to that "reasonable" coding outcomes. 2 4 7/21/93 DRAFT constructing the list of similar There are several reasons for pick from it when exact words and having the CATI operator misspellings First, and most importantly, matches are not found. efficient way of and this is a quick and are extremely common, misspelling will result in dealing with them. In most cases, a the first word in the list. the correct spelling being shown as part contains root words Secondly, the dictionary for the most dictionary does not contain A3 a general rule, the only. For example, it would be multiple variations of the same word. MATHEMATICS, MATHEMATICAL, inefficient to store the words MATH, In some cases, the same major. etc. when they are all coded to indicate the major field of study, when the distinction helps to included in the multiple variations of the same word are is a entry in the For example, thE woKd CHEMICAL dictionary. ENGINEERING, while the dictionary and maps to the major CHEMICAL While we engineering associations. word CHEMISTRY contains no algorithms to adjust for suffixes, have experimented with various avoids having algorithmic prefixes, etc., the current approach of word variations. (and possibly wrong) determinations operator enters the string For example, suppose the a CATI respondent's major field of "Decison Information Sciences" for a to capital letters, breaks The systdin converts everything study. and proceeds to look up the word the string into three words, The system presents is found. DECISON, for which no exact match The input word. list of possible matches for the a screen with a and is selected by the CATI word DECISION tops the list, words, and since The program looks up the other two operator. dictionary words, no action is they are both exact matches for required by the operator. of are processed, the set Once all the words in the response in increasing order of "reasonable" codes are presented select one of the The CATI operator may then likelihood. "reasonable" code list and "reasonable" codes, override the select "uncodeable", or select a code outside the list, The operator is trained to reenter/edit the initial response. the respondent, possibly discuss the coding process with initial response to one which more resulting in changing the major. clearly identifies the respondent's to identify shortcomings Online coding encourages CATI operators while still discussing the in the collected text strings When such shortcomings respondent. respondent's major with the elicit CATI operator can immediately are identified, the without the expense of later additional detail in the responses, the main object of Coding of responses is secondary, callbacks. strings. improve the collected text our approach is to software would find that the word Returning to our example, the - 3 - 5 7/21/93 DRAFT only one major field of study, DECISION is associated with The word INFORMATION is associated Business/Management Systems. (1) Business/Management with three major fields of study: and (3) Computer and (2) Computer Programming, Systems, with 26 The word SCIENCE is associated Information Science. including Business/Management different major fields of study, The software Information Science. Systems and Computer and study all possible major field of constructs a screen listing In this case the most least likely. codes in order from most to Business/Management Systems so the CATI likely code is clearly interview. code and continue with the operator would select that of the DECISION tips the scale favor In our example, the word If the reply had instead Systems. code for Business/Management then two codes (Business Management been "Information Sciences", Information Science) would have been Systems, and Computer and would In this case the CATI operator tied for most likely code. If, for for more information. have probed the respondent Sciences this Department of Computer example, the major is in the will probably result in the discussion with the r,ispondent Science, under Computer and Information correct classification Or perhaps, under Systems. not under Busines.s/Management that his/her real major is probing, the respondent clarifies Business. Sciences" in the School of really "Decision Information coding approach, we receive only Using a more traditional offline which Sciences" and must make a judgement the string "Information correct. may or may not be Word Association DataBase that we have specific The approach outlined above presumes of to codes for major field information about how responses map this is available from Fortunately, some information on study. collected on the We know what strings were previous surveys. field of study was finally coded previous surveys, and what major Starting with this information we for each of those strings. code mappings, using our constructed a database of word to links between judgement about whether to delete or preserve The final database structure codes. particular words and codes, related to their associated consists of an index of words easily be word the associated codes can so that for any identified. designed so that it has the ability to The coding system has been controlled for existing words in a "learn" new words or new codes coding of the database to the Because of the importance way. database in this way is normally system, updating of the The response string, personnel. performed by very experienced major field of study code are assigned root words, and final - 4 6 7/21/93 DRAFT what In addition, the system indicates gathered and reviewed. not associated with words in the response string were or were operator overrode the set of root words, and whether the CATI A computer "reasonable" codes presented by the coding system. database words, and potential program identifies potential new As new associated with existing database words. new codes to be presented on a screen database entries are identified, they are The update system database. and can be added or not added to the combinations that are (1) not only presents words or word/code (2) have not previously already present in the database and that the operator. been refused entry into the database by Figure 2 Figure 1 Distribution of Code mappings Distribution of Word Mappings 10 19 300 275 15 18 250 14 .8 200 12 "6 10 I 150 -0" I [9 5 100 111 I 4 4 4 50 2 I 29 Ilw 4 0 1 1 1234567119101112131415191719212327 0 13 26 . 10 5 6 4 6 1 3 7 NumberWromps 1 ;Limbs- of rrsys in the current word association There are 739 word to code maps 113 unique The database has 406 unique words and database. word/code maps for the The distribution of the number of codes. Figure 1 shown in Figure 1. unique words in the dictionary is associated with only a very few clearly shows that most words are Also, the number of words in the major field of study codes. number of maps increases so database quickly decreases as the three or four codes. that very few words map to more than distribution of the number of However, as shown in Figure 2, the Although different. word/code maps across codes is quite in the database decreases generally speaking the number of codes increases, there are some exceptions to as the number of maps have more than three or this general trend. Most of the codes Figure 2 also shows that there are four words which map to them. unique code, for example the word some words which map to a study code for "philosophy" maps only the major field of "philosophy". 5 7/21/93 DRAFT Coding Responses working, we used our software to To evaluate how our approach was These were responses which review 1427 actual CATI responses. for the collected during the field test our CATI operators Longitudinal Study Second Beginning Postsecondary Students None of these test responses were Followup 1990-94 (BPS:90/94). update the database. ever used to construct or 92%, 2237 words, of which 2057, or The test responses included in our database. immediately by the computer system were found reviewed by an operator who The remaining 180 words were Only 85 of database words. identified 95 of them as misspellings dictionary, a not found in the of the 2237 original words were typically Most of the words which one about 96%. success rate of study respondent's major field of encounters when asking about a in the database. appear to be included unique major 272, or 19%, mapped to a Of the 1427 raw responses, had one code An additional 273 responses field of study code. majority of the words in the response. which was mapped,to by a final CATI operator review the Although in practice we have the single 38% of the total, have a code, these 545 responses, or essentially autocoded k; .)ur system. most likely code and are the approach ranks all -odes by Of the remaining responses, our the found in the dictionary fo, number of word/code pa:rs likely code" co mean that Using a detinition of "most response. pairs found, maximum number of word/code the code is tied for the number of these responses by the Table 1 shows the distribution of most likely codes. Table 1 Identified Number of "Most Likely" Codes Cumulative % Percentage of # of # of Most (all responses) all responses Responses Likely Codes 38.2 38.2 545 1 38.2 0.0 0 2 56.3 18.1 258 3 61.4 5.1 73 4 61.5 0.1 2 5 that for even the remaining It is important to recognize better using our coding system is that responses, the effect of have been obtained otherwise. information is obtained than would confirmed to input word "anthology" was For example, the initial and was identified as the be a misspelling by a CATI operator 6 - 7/21/93 DRAPT This immediately clarifies what the major word "anthropology". Under a more traditional offline coding of the respondent is. "anthology" would approach it is problematic whether a major of have been correctly coded. Table 2 Results by the Number of Words in Raw Responses Responses # of Responses # of # of Words with a Sinale with All Words Raw in Raw Most Likely Code Found Responses Response 210 614 656 1 pl 275 675 2 44 68 77 3 13 9 16 4 1 0 1 5 2 0 2 6 545 1302 1427 Total by the Table 2 shows the distribution of our 1427 raw responses third column of Table 2 number of words in the response. The all of their words found shows the number of responses which had from the CATI in the dictionary (possibly with some help contained only one or Over 90% of all the responses operator). in the All the words in the response were found two words. rate of dictionary for 1302 of the 1427 responses, a success 91%. 38% of the responses were As mentioned above, only 545, or some Even of the 656 responses linked to a single most likely code. (32%) were linked to a which consisted of a single word, only 210 The reason for this difference is that single most likely code. descriptive enough to allow one to many words simply are not An example is single code. narrow down the possibilities to a which could be one of the the single word response "EDUCATION", following major field of study codes: Early Childhood Ed Education: Not Phys. Ed. Elementary Ed Health/Phys Ed/Recreation (HPER): non-school Interdisciplinary: all other Physical Education Secondary Ed Special Education 7/21/93 DRAFT approach is on average Preliminary results indicate that our new simple approach of asking taking only a few seconds longer than a There is also slightly for the major and recording the response. CATI previous approaches. higher cost for training relative to They must coding system works. operators must understand how the misspelled or are not in the recognize what to do when words are database. Conclusions can be divided into Our test results on actual CATI responses be Approximately one-third of the responses can three groups. intervention by the CATI operator. coded with extremely little approach does extremely well and For almost another third, the five. codes from 110 to less than reduces the number of possible approach could be characterized as For the final third, the of correcting It appears that the benefits somewhat helpful. code the response while the spelling errors and of attempting to clarification merits the small amount respondent is available for for the new coding system. of additional time required 1 - _ 8

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.