ebook img

ERIC ED592649: Back to the Basics: Bayesian Extensions of IRT Outperform Neural Networks for Proficiency Estimation PDF

2016·0.72 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED592649: Back to the Basics: Bayesian Extensions of IRT Outperform Neural Networks for Proficiency Estimation

Back to the basics: Bayesian extensions of IRT outperform neural networks for proficiency estimation Kevin H. Wilson∗ , Yan Karklin∗, Bojian Han† , Chaitanya Ekanadham∗ ∗Knewton,Inc. NewYork,NY †CarnegieMellonUniversity. Pittsburgh,PA {kevin,yan,chaitu}@knewton.com [email protected] ABSTRACT enablesmoreefficientdiagnosisandremediationofherweak- Estimatingstudentproficiencyisanimportanttaskforcom- nesses and more effective advancement of her knowledge puterbasedlearningsystems. WecompareafamilyofIRT- frontier. Proficiencyestimatescanalsoprovidethestudent based proficiency estimation methods to Deep Knowledge or teacher with actionable information to improve student Tracing (DKT), a recently proposed recurrent neural net- outcomeswhenreportedasanalytics[21]. workmodelwithpromisinginitialresults. Weevaluatehow well each model predicts a student’s future response given Twoclassicalfamiliesofmethodsforestimatingproficiency previousresponsesusingtwopubliclyavailableandonepro- areItemResponseTheory(IRT)[8,13]andBayesianKnowl- prietary data set. We find that IRT-based methods consis- edge Tracing (BKT) [2]. IRT essentially amounts to struc- tently matched or outperformed DKT across all data sets turedlogisticregression(seeSection2.1), estimatinglatent at the finest level of content granularity that was tractable quantities corresponding to student ability and assessment for them to be trained on. A hierarchical extension of IRT properties such as difficulty. BKT does not capture assess- thatcaptureditemgroupingstructureperformedbestover- ment properties but employs a dynamic representation of all. When data sets included non-trivial autocorrelations studentability. Agrowingbodyofrecentworkhasfocused in student response patterns, a temporal extension of IRT onmodelingvariousstructuralpropertiesofstudentsandas- improved performance over standard IRT while the RNN- sessments in an attempt to combine the advantages of IRT basedmethoddidnot. WeconcludethatIRT-basedmodels and BKT, for instance [14, 15, 11, 5, 10, 12, 3]). In a re- provide a simpler, better-performing alternative to existing centlyproposedmethodknownasDeepKnowledgeTracing RNN-based models of student interaction data while also (DKT)[16], arecurrentneuralnetworkwastrainedtopre- affording more interpretability and guarantees due to their dict student responses and was shown to outperform the formulationasBayesianprobabilisticmodels. best published results ([15]) on the publicly available AS- SISTmentsdataset[4]byabout20percentagepointswith Acknowledgements respecttotheAUCmetricdescribedinSection4. ManythankstoSiddharthReddy,DavidKuntz,KyleHaus- ToinvestigateDKT’sadvantageovertraditionalmodels,we mann,andCeliaAlicatafordiscussionsofthisworkandhelp comparedastandardoneparameterIRTmodel,twoexten- editingthemanuscript. sions of that model, and DKT on three data sets (two are publicly available and one is proprietary) on a realistic on- Keywords line prediction task that is typically required by computer- Item Response Theory, Recurrent Neural Nets, Bayesian basedlearningsystems(seeSection4),andwhichwascon- ModelsofStudentPerformance,DeepKnowledgeTracing sistent with the evaluation task employed in [16].1 We re- produce the results of [16] on the ASSISTments data set, 1. INTRODUCTION but find that proper accounting for duplicate data negates A key challenge for computer-based learning systems is to theclaimedperformancegains. Forthetwolargerdatasets, estimateastudent’sproficiencybasedonherpreviousinter- computationaltractabilityhamperedourabilitytotrainDKT actionswiththesystem. Accurateestimationofproficiency onfine-grainedcontentlabels,whiletrainingIRT-basedmod- elsscaledtohandlethem. Moreover,theIRT-basedmodels’ ∗Contributedequallytothework. best tractable performance matches or outperforms DKT’s †PerformedinitialcodingandanalysiswhileatKnewton. best tractable performance on all data sets, with a hierar- chicalextensionofIRTperformingthebestinallcases. We concludethatforthesedatasets,IRT-basedmodelsprovide simple, better-performing alternatives to DKT while also affording more interpretability and guarantees due to their formulationasBayesianprobabilisticmodels. 1Code for the IRT and DKT models, as well as in- structions for reproducing our results, can be found at github.com/Knewton/edm2016. Proceedings of the 9th International Conference on Educational Data Mining 539 2. MODELSOFSTUDENTRESPONSES In many situations, including each of our data sets, the as- In this section we set notation and describe the models we sessment items may have structure that can inform predic- compare. Throughout, we will represent the student re- tionsofstudentresponses. Forexample,groupsofitemsmay sponse data D as a set of tuples (s,i,r,t) indicating the assess the same topic, resulting in item properties that are student, item, correctness, and time of each response. In moresimilarwithingroupsthanacrossthemAlternatively, thispaper,timewillbeindexedbyinteractionindex(rather items may be derived from common templates. Templates, thanwallclocktime). oftenfoundinmathcourses,looklike“Whatisx+y?”and aparticularinstantiationisgeneratedbychoosingvaluesfor x and y. For example, the ASSISTments data set contains 2.1 ItemResponseTheory(IRT) severalproblems,manyofwhicharewiththesametemplate, Item Response Theory (IRT) is a standard framework for manyofwhichinturnassessasingleskill. modelingstudentresponsesdatingbacktothe1950s[8,13]. Asinglenumber,calledtheproficiencyorability,represents We can augment the IRT model to incorporate knowledge astudent’sknowledgestateduringthecourseofcompleting about item groups, resulting in a hierarchical IRT model several assessments. It is assumed that this proficiency is (HIRT). Each item i is associated with a group j(i) whose notchangingduringthisexamination.2 difficulty is distributed normally around a per-group mean µ : β N(µ ,σ2). Each µ is in turn distributed The model assumes that many students have completed a j(i) i ∼ j(i) j accordingtothehyperpriorµ N(0,τ2). Thisreflectsthe testofdichotomousitemsandassignseachstudentsapro- j ∼ belief that the difficulty of items in the same group should ficiency θs ∈ R. A key innovation of IRT is to model vari- be similar. The degenerate cases provide some intuition: ation across different items. In its simplest form, the one- the limit σ 0 is the same model as 1PO IRT where we parameter model, each item i is assigned a parameter βi, consider the→items in the group to be the same item, and representingthedifficultyoftheitem. Theprobabilitythat the limit τ 0 is equivalent to a 1PO IRT model with no a student s answers item i correctly is given by f(θs−βi), groupings. → wheref issomesigmoidalfunction. 2.2.1 Learning When f is the logistic function, this corresponds to (struc- tured)logisticregression,wherethefactorsforaresponseto Learning is done similarly to Bayesian IRT (section 2.1), anitemareindicatorsforstudentsanditems. Weuseavari- exceptthatweascendthemodifiedlogposteriorprobability antofthismodelknownas1PO(one-parameterogive)IRT, logP( θ , β , µ D)= wherethelinkfunctionf(x)=Φ(x)isthecumulativedistri- { s} { i} { t}| bution function of the standard normal distribution3. The rlogf(θs−βi)+(1−r)log(1−f(θs−βi)) 4m;awxeimtuakmelaikBelaihyoeosidansoaluptpiroonacohf{aθnsd,βrie}guislaurnizdeertdheetesromluitnioend (s,i1,Xr,t)∈D 1 1 of {θs,βi} by imposing independent standard normal prior − 2 θs2− 2σ2 (βi−µj(i))2− 2τ2 µ2j+C. (2) distributionsovereachθs andβi. Xs Xi Xj 2.1.1 Learning Wemaximizethisobjectivewithrespectto{θs,βi,µj}. Totraintheparametersonstudentresponsedata,wemax- imize the log posterior probability of θ ,β given the re- 2.3 TemporalIRT(TIRT) { s i} sponsedata(thesetofresponsecorrectnesses r:(s,i,r,t) 1POIRTandHIRTassumeeachstudent’sknowledgestate { ∈ D , each of which is 0 or 1). Assuming independent, stan- remains constant over time. However, in a setting where a } dardnormalpriorsoneachθs,βi,thelogposterioris: student may be acquiring (or forgetting) knowledge over a period of time (e.g., while interacting with a tutoring sys- logP( θ , β D)= { s} { i}| tem), we can extend this model by modeling each θs as a rlogf(θ β)+(1 r)log(1 f(θ β)) stochastic process varying over time (see for example [5]). s− i − − s− i We adopt the approach described in [3], modeling the stu- (s,i,Xr,t)∈D dent’sknowledgeasaWienerprocess: 1 1 θ2 β2+C. (1) − 2 s− 2 i Xs Xi Wemaximizethisobjectivewithrespecttotheparameters (θs,t+τ−θs,t)2 using standard second-order ascent methods to obtain the P(θs,t+τ|θs,t)=e− 2γ2τ ∀s,t,τ. (3) maximumaposteriori(MAP)estimateofeachparameter. In other words, the change in student s’s knowledge state betweentimetandafuturetimet+τ (expressedasθ 2.2 HierarchicalIRT(HIRT) θs,t+τ) is normally distributed about 0 with variances,γt2−τ where γ is a parameter controlling the“smoothness”with 2For an in depth discussion of IRT and a review of related whichtheknowledgestatevariesovertime. literaturesee[17],especiallyChapter5. 3The ogive yields nearly identical results to the commonly 2.3.1 Learning used logistic link function, but allows closed-form posterior computationinthetemporalIRTmodeldescribedinSec.2.3 Wefittheparametersaccordingtotheproceduredescribed 4For example, the response predictions are invariant when in[3]. Estimatingtheentiretrajectoryθ~s,t foreachstudent addingaconstantoffsettothe θ ’sand β ’s. simultaneously with item parameters is very expensive and { s} { i} Proceedings of the 9th International Conference on Educational Data Mining 540 difficult to do in real-time. To simplify the approach, we 2.4.1 LearningandParameterChoices learnparametersintwostages: Inordertomakelearningtractable,wereducedthedimen- sdiiomneanlistiyonoaflthsepaincpeuRtCbyupsrionjgecatinrgantdhoem~xs,ptr∈ojRec2tIiotnoamlaotwreixr 1. We learn the βi according to a standard 1PO IRT c:R2I RC, as was done in [16]. We used batch gradient model(seeSection2.1.1)onthetrainingstudentpop- → ascent with dropout [18], and chose the input dimensional- ulationandfreezetheseduringvalidation. ity C and the hidden dimensionality H by sweeping these parameters on a data set that was held out from the data 2. Foreachresponseofeachstudentintheheld-outvali- usedfortrainingandcross-validation. dation population, we predict this response according toatemporalIRTmodelgiventhestudent’sprevious Thepredictionsaregivenbythefollowingequations: responses, as described below. For further details of thevalidationprocedure,seeSection4. ~h =g(W ~h +W c(~x )+~b ) (6) s,t+1 hh s,t xh s,t h ~y =φ(W ~h +~b ) (7) s,t+1 hy s,t+1 y Forthesecondstep,wecombinetheapproximation: Here, g and φ are the logistic and arctangent functions, re- P({(s0,i,r,t0)∈D:s0=s,t0≤t}|θs,t)≈ spectively. TheparametersofthemodelWhh,Wxh,Why,~bh,~by arefitbyoptimizingthecross-entropyoftheresponseswith P((s0,i,r,t0)|θs,t) (4) the predicted probabilities (which is equivalent to the log (s0,i,r,t0)∈YD:s0=s,t0≤t likelihoodiftheseprobabilitieswereproducedviaagenera- with(3),integratingoutpreviousproficienciesofthestudent tiveprobabilisticmodel): togetatractableapproximationofthelogposterioroverthe student’scurrentproficiencygivenpreviousresponses: rlogys,t,i+(1−r)log(1−ys,t,i) (8) logP(θ D) [rlogf(α˜ (θ β))+ (s,i,Xr,t)∈D s,t| ≈ t0 s,t− i Stochastic gradient ascent with minibatches of students on (ss00,=i,Xrs,,tt00)≤∈tD theunrolledRNN,codedusingTheano[1],wasusedtoop- timizethisobjectivefunction. (1 r)log(1 f(α˜ (θ β)))], (5) − − t0 s,t− i where α˜t0 = 1+γ2(t−t0) −1/2 . The α˜t’s are essentially 3. DATASETS discounting the relative effect of older responses when esti- In order to test these models, we used three data sets, two (cid:0) (cid:1) matingthecurrentproficiency. See[3]fordetails. publicly accessible and one proprietary. Each of these data setscomesfromasysteminwhichstudentsinteractwitha 2.4 DeepKnowledgeTracing(DKT) computer-based learning system in a variety of educational Recently, a recurrent neural network was used to predict settings (e.g., interspersed with classroom lectures, offline student responses [16]. Such architectures have seen enor- work,etc.). mous success in applications to a wide range of other do- mains(e.g.,imageprocessing[6],speechrecognition[7],and 3.1 ASSISTments naturallanguageprocessing[20]). This data set comes from the ASSISTments product, an online platform which engages students with formative as- Inthismodel,theinputvectorsarerepresentationsofwhether sessments replete with scaffolded hints. Most assessments the student answered a particular question correctly or in- are templated, and each problem is aligned with one, sev- correctly at the previous time step, and the output vectors eral,ornoneoftheskillsthattheproductisattemptingto arerepresentationsoftheprobability,overallthequestions teach. inthequestionbank,thatastudentwillgetthequestioncor- rectatthefollowingtimestep. In[16],theauthorspropose The data set [4] is divided in two parts, the“skill builder” usingaone-hotvector~xs,t∈R2Itorepresenttheresponseof set associatedwithformative assessmentandthe“nonskill astudents(onitemi)attimet. HereI isthetotalnumber builder”set associated with summative assessment. All of of items and the first I slots represent answering correctly our results are reported on the“skill builder”data set as and the remaining I slots represent answering incorrectly. weexpectastrongertemporalsignalfromformativeassess- Outputvectors~ys,t∈RI arevectorsofprobabilities,where ment than from summative assessment. This was also the the ith element of ~ys,t is the model’s predicted probability evaluationdatasetfor[16]. thatstudentswouldansweritemicorrectlyattimet+1. In preprocessing the data, we associated items not aligned We use a model with one hidden layer, of dimension H, with a skill to a designated “dummy” skill, as was done whichisfullyconnected5 toboththeinputandoutputlay- in [16]. We chose to discard rows duplicating a single in- ers, as well as recurrently to itself. This model is able to teraction (represented by a unique order_id value), a step capturetemporaleffects(viatherecurrentcomponentofthe we do not believe was taken by [16]. These duplicate rows network)andremainsflexibleenoughtodescribenon-trivial arisewhenasingleinteractionisalignedwithmultipleskills. relationshipsbetweenitems. Withoutremovingtheseduplicates,modelsthatprocessall 5Note that in [16], an LSTM network was used in addition skills simultaneously, including DKT and the IRT variants totheRNNdescribedhere,andtheperformanceofthetwo used in this paper, will see the same student interaction networkswascomparable. several times in a row, essentially providing these models Proceedings of the 9th International Conference on Educational Data Mining 541 Assistments KDD Knewton 0.91 0.76 0.73 Acc0000....76770912 0.7219 0.7224 0.7292 0.7115 Acc0000....98880897 0.9030 0.9030 0.9068 0.8901 Acc000...777402 0.7292 0.7481 0.7391 0.7 0.68 0.86 0.68 084 IRT TIRT HIRT DKT IRT TIRT HIRT DKT IRT TIRT HIRT DKT 0.78 0.86 0.82 0.77 0 AUC000000......777777463512 0.7651 0.7653 0.7740 0.7429 AUC0000....88784082 0.8542 0.8542 .8597 0.8110 AUC0000....77874608 0.8045 0.8166 0.8189 0.7756 IRT TIRT HIRT DKT IRT TIRT HIRT DKT IRT TIRT HIRT DKT Figure 1: Summary of results across models and metrics. Error bars represent the standard error of measure of the metric across five folds. For TIRT, parameter selection yielded γ2 = 0.01 for ASSISTments, γ2 = 0 for KDD (making it identical to IRT), and γ2 = 100.0 for Knewton. For HIRT, parameter selection yielded σ2 = 0.125 and τ2 = 0.5 for ASSISTments, σ2=0.5andτ2=0.25forKDD,andσ2=0.25andτ2=0.125forKnewton. ForDKT,C =50,H =100,andtheprobability ofdropoutis0.25forallmodels. access to the ground truth when making their predictions. didinSection3.1,arbitrarilybutconsistentlyretainingonly Thiscanartificiallyboostpredictionresultsbyasignificant one of the skills after preprocessing, and associating items amount (see Section 5), as these“duplicate”rows account not associated with any skills with a designated“dummy” for approximately 25% of the rows. Indeed, we observed skill. thattheperformancegainsofDKTarenegatedwhenthese duplicates are removed (see Section 5). Note that typical After pre-processing, the data set retained 3,679,198 inter- BKT-based approaches are not susceptible to this artificial actionsfor1,146userson207,856stepsarisingfrom19,355 boost, since they usually split the data by skill and train problems and 494 KCs. The overall percent correct was separatemodels. 88.82%. After pre-processing, the data set consisted of 346,740 in- 3.3 Knewton teractions for 4,097 users on 26,684 items arising from 815 Data was collected from a variety of educational products templates and 112 skills. The overall percent correct was integrated with Knewton’s adaptive learning platform and 64.54%. used in various classroom settings across the world. These products vary with respect to the educational content used 3.2 KDDCup (disciplines spanned math, science, and English language In 2010, the PSLC DataShop released several data sets de- learning) as well as the way in which students are guided rived from Carnegie Learning’s Cognitive Tutor in (Pre- through the content. For example, students may take an )Algebrafromtheyears2005–2009[19]. Weusedthelargest initial assessment and then be remediated on areas need- of the“Development”data sets, labeled“Bridge to Algebra ing improvement. In other products, students start from 2006–2007.” the beginning and work toward a predefined goal set by the teacher. In all of these settings, Knewton receives data One distinct difference between Carnegie Learning’s prod- about each interaction (the (s,i,r,t) tuple of Section 2). uct and ASSISTments is that Carnegie Learning provides We utilized approximately 1M responses of 6.3K randomly much finer representations of the concepts assessed by an sampled students on 105.6K questions spanning roughly 4 individual item. In particular, Carnegie Learning is built months. Students who worked on fewer than 5 questions around scaffolded, formative assessment, where each step a total were excluded. After pre-processing, student history studenttakestoansweraproblemiscountedasaseparatein- lengths ranged from 5 to 3.2K responses. The overall per- teraction,witheachsteppotentiallyassessingdifferentskills centcorrectoftheseresponsesis54.6%. (calledKnowledgeComponents(KCs)inthedataset). Note that this“Problem Step”structure provides a hierarchy → 4. EVALUATIONMETHODOLOGY whichHIRT(Section2.2)canexploit. 4.1 ParameterSelection Like ASSISTments, any particular interaction may assess For each data set, 20% of students were first set aside for zeroormoreskills. Wefollowthesamemethodologyaswe parameterselection,whichweperformedasfollows: Proceedings of the 9th International Conference on Educational Data Mining 542 Assistments KDD Knewton 0.69 0.90 0.68 0.80 0.85 0.67 0.80 0.78 Accuracy000...666465 Accuracy00..7705 Accuracy00..7746 0.63 MAP MAP MAP 0.65 0.62 AUC AUC 0.72 AUC 0.61 0.60 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 window length window length window length Figure 2: Accuracy metrics for the three data sets computed using a rolling window of previous responses, as a function of windowlength. Responseaccuracyiscomputedbypredictingcorrectifthemajorityofresponsesinthewindowarecorrect. IRT HIRT tIRT DKT∗ ASSISTments problem_id template_id problem_id problem_id template_id → KDD Step Name Problem Name Step Name Step Name KC → Knewton item_id concept_id item_id item_id concept_id → Table1: Itemlabelsyieldingbestresultsforeachmodelanddataset. ForHIRT,thefirstlabelspecifiesthedifficultymean groupingidentifier,andthesecondtheitemidentifier. For1POIRTtherewerenoparameterstoselect. selection(Section4.1),averagingtheAccandAUCmetrics • overfivedifferentsplitsofthestudentpopulation. ForHIRT,wesweptvaluesofthevariancesτ2 andσ2 • of the group means and item difficulties respectively, 5. RESULTSANDDISCUSSION including regimes (τ2 small) which made the model Table1enumeratesthefieldschosenineachdatasettoiden- mathematicallyequivalentto1POIRT. tifyitemsanditemgroups(forHIRTonly)thatyieldedthe computationallytractablemodelwiththebestresults. Note ForTIRT,wesweptthetemporalsmoothnessparam- • eter γ2, including the regime (γ2 small) which made thatfortheIRT-basedmodels,ourvalidationscheme(Sec- tion 4.2) estimates a single number θ for each student at themodelmathematicallyequivalentto1POIRT. st eachpointt>1ofthevalidation.Forcomputationalreasons, itwasnotfeasibletoevaluateDKTonfine-grainedlabelsin ForDKT,wesweptthecompressiondimensionC (the • KDD and Knewton (for ASSISTments, fine-grained labels dimension of the space to which the input was pro- were tractable but yielded worse results), whereas all IRT jected using a random matrix), the hidden dimension variantswereabletoprocessdataatthefinestlevels. H,thedropoutprobabilityp,andthestepsizeofour gradientascent. We trained and validated each of the three models on each ofthethreedatasetsasdescribedinSec.4. Theresultson 4.2 Onlinepredictionaccuracy ourevaluationtaskaresummarizedinFigure1. Theresults Weuseanevaluationmethodwecallonlineresponsepredic- clearlyindicatethatsimpleIRT-basedmodelsdoaswellor tionwhichmatchesthatof[16]. Studentsarefirstsplitinto significantlybetterthanDKTacrossalldatasets. trainingandtestingpopulations. Eachmodelisfirsttrained on the training population and the model parameters that ThefactthatHIRTisthebest-performingmodelacrossthe are not student-level (item parameters for IRT-based mod- board (except for MAP accuracy on the Knewton dataset els, weights for neural networks) are frozen. Then for each whereTIRTslightlyoutperformsit)suggeststhatgrouping time t > 1 in each testing student’s history, we train the structure is useful information to exploit when predicting student-level parameters in the model on the first t 1 in- studentresponses. Indeed,theHIRTmodeldoeshaveaccess − teractions of the student history and allow it to compute tostrictlymoreinformationthantheothermodelsinthatit the probability that the t’th response is correct. This pro- hasboththeitemandgroupidentifierassociatedwitheach cess mirrors the practical task that must be completed by interaction. While the DKT model does have the ability anITS. to infer item relationships from data, our results indicate that building in this knowledge is more advantageous in a Wereporttwodifferentmetricsforcomparingthepredicted varietyofeducationalsettings. Onepotentialareatoexplore correctness probabilities with the observed correctness val- is in learning a hierarchical model purely from the data, ues. Accuracy(Acc)iscomputedasthepercentofresponses whichcouldprofitfromthestructuredBayesianframework inwhichthecorrectnesscoincideswiththeprobabilitybeing withoutrequiringpriorinformationorexpertlabels. greaterthan50%. AUCistheAreaUndertheROCCurve oftheprobabilityofcorrectnessforeachresponse. The temporal IRT model yielded higher accuracy on the Knewton dataset, but not on the other two data sets. To We use five-fold cross validation (by partitioning the stu- understandtheseeffects,weinvestigatedthedegreetowhich dents)onthe80%ofthedatasetremainingafterparameter temporalstructureinthedataaffectspredictiveperformance Proceedings of the 9th International Conference on Educational Data Mining 543 bylookingathowanaive“windowedpercentcorrect”(pre- [1] Bergstra,J.,etal.Theano: aCPUandGPUmath dictthestudentwillanswerthetthquestioncorrectlyifthey expressioncompiler.InSciPy2010. answer at least half of the previous w questions correctly) [2] Corbett,A.,andAnderson,J.Knowledgetracing: modelperformsasafunctionofwindowlengthw(Figure2). Modelingtheacquisitionofproceduralknowledge.User ModelingandUser-AdaptedInteraction4,4(1995), TheKnewtondatasethasaclearoptimalwindowlength– 253–278. integratingoverwindowstooshortortoolongdegradedper- [3] Ekanadham,C.,andKarklin,Y.T-SKIRT:Online formance, which is indicative of nontrivial temporal struc- estimationofstudentproficiencyinanadaptivelearning ture. However, for the ASSISTments and KDD data sets, system.MachineLearningforEducationWorkshopat longerwindowlengthsperformequalorbetterthanshorter ICML(2015). windowlengths,suggestingthatstaticmodelswoulddojust [4] Feng,M.,Heffernan,N.,andKoedinger,K. aswellinthesecases. Indeed,thiswouldexplainwhyTIRT Addressingtheassessmentchallengewithanonlinesystem thattutorsasitassesses.InUserModeling,Adaption,and doesmoreorlessthesameasbaseline1POIRTonASSIST- Personalization,G.-J.Houben,G.McCalla,F.Pianesi,and ments and KDD but shows significant improvement on the M.Zancanaro,Eds.2010,pp.243–266. Knewton data set. However, it does not explain why DKT [5] Gonzalez-Brenes,J.,Huang,Y.,andBrusilovsky,P. lagsregardlessoftheamountoftemporalstructure. Generalfeaturesinknowledgetracing: Applicationsto multiplesubskills,temporalitemresponsetheory,and Finally,wenotethatourDKTresultsinFigure1contradict expertknowledge.InEDM2014. those of [16] ontheASSISTmentsdataset, whichreported [6] Gregor,K.,etal.DRAW:Arecurrentneuralnetwork forimagegeneration.InICML2015. an AUC of 0.86. We believe this is due to data cleaning [7] Hinton,G.,etal.Deepneuralnetworksforacoustic issues,specificallytheissueofremovingduplicatessoasnot modelinginspeechrecognition. toartificiallyboostonlinepredictionaccuracy,asdiscussed [8] Hulin,C.L.,andDrasgow,F.ItemResponseTheory.In in Section 3.1. Indeed, we were able to reproduce the per- HandbookofIndustrialandOrganizationalPsychology, formance reported in [16] when applying our RNN imple- S.Zedeck,Ed.,vol.1.AmericanPsychologicalAssociation, mentationontherawdataset(withduplicatesleftin). 1990,pp.577–636. [9] Khajah,M.,Lindsey,R.V.,andMozer,M.C.How Other recent work [9] points out that the specific method deepisknowledgetracing? InEDM2016. of computing AUC in [16] also significantly affects the re- [10] Khajah,M.M.,Huang,Y.,Gonza´lez-Brenes,J.P., Mozer,M.C.,andBrusilovsky,P.Integrating portedperformancerelativetoBKT-basedmodels,andfur- knowledgetracinganditemresponsetheory. therdemonstratesthatBKT-basedmodelscanperformjust PersonalizationApproachesinLearningEnvironments aswellasDKTonavarietyofdatasets. (2014),7. [11] Lan,A.S.,Studer,C.,andBaraniuk,R.G. 6. CONCLUSION Time-varyinglearningandcontentanalyticsviasparse factoranalysis.InKDD2014. OurresultsindicatethatsimpleIRT-basedmodelsequalor [12] Lee,J.I.,andBrunskill,E.TheImpacton outperform DKT on a variety of data sets, suggesting that IndividualizingStudentModelsonNecessaryPractice incorporating domain knowledge into structured Bayesian Opportunities. models comprises a promising area of future research for [13] Lord,F.M.ATheoryofTestScores.No.7in modelingstudentinteractiondata. PsychometricMonograph.PsychometricCorporation,1952. [14] Pardos,Z.A.,andHeffernan,N.T.Modeling In our experience, structured models were easier to train individualizationinaBayesianNetworksimplementationof and required less parameter tuning than DKT. Moreover, knowledgetracing.InUserModeling,Adaption,and Personalization,P.D.Bra,A.Kobsa,andD.Chin,Eds. the computational demands of DKT hampered our ability 2010,pp.255–266. to fully explore the parameter space, and we found that [15] Pardos,Z.A.,andHeffernan,N.T.KT-IDEM: computation time and memory load were prohibitive when IntroducingItemDifficultytotheKnowledgeTracing training on tens of thousands of items. These issues could Model.InUserModeling,Adaption,andPersonalization, notbemitigatedbyreducingdimensionalitywithoutsignif- J.A.Konstan,R.Conejo,J.L.Marzo,andN.Oliver,Eds. icantlyimpairingperformance. Furtherworkondiscrimina- 2011,pp.243–254. tive models is necessary to bridge this gap, but currently, [16] Piech,C.,Bassen,J.,Huang,J.,Ganguli,S.,Sahami, M.,Guibas,L.,andSohl-Dickstein,J.DeepKnowledge IRT-based models seem superior both in terms of perfor- Tracing.InNIPS2015. mance and ease of use, making them suitable candidates [17] Rupp,A.A.,Templin,J.,andHenson,R.A.Diagnostic forreal-worldapplications(e.g. intelligenttutoringsystems, Measurement: Theory,Methods,andApplications. recommendationsystems,orstudentanalytics). GuilfordPress,2010. [18] Srivastava,N.,etal.Dropout: ASimpleWaytoPrevent A promising avenue of research could explore combining NeuralNetworksfromOverfitting.JournalofMachine the advantages of structured Bayesian models with those LearningResearch15 (2014),1929–1958. of large-scale discriminative models, which have provided [19] Stamper,J.,Niculescu-Mizil,A.,Ritter,S., G.JGordon,G.,andKoedinger,K.Challengedatasets superiorperformanceinseveralotherdomains,particularly fromKDDCup2010. inthelarge-dataregime. Acrucialchallengeforstructured pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp. modelsishowtoaccommodatethediversityofeducational [20] Vinyals,O.,etal.Grammarasaforeignlanguage.In settingsfromwhichthedataarecollected(differentcontent, NIPS2015. different classroom environments, etc.) while retaining the [21] Wilson,K.H.,andNichols,Z.TheKnewtonPlatform: structurethatdrivespredictivepowerandinterpretability. AGeneral-PurposeAdaptiveLearningInfrastructure. 7. REFERENCES Proceedings of the 9th International Conference on Educational Data Mining 544

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.