ebook img

ERIC ED615512: Behavioral Testing of Deep Neural Network Knowledge Tracing Models PDF

2021·1.4 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED615512: Behavioral Testing of Deep Neural Network Knowledge Tracing Models

Behavioral Testing of Deep Neural Network Knowledge Tracing Models Minsam Kim Yugeun Shim Riiid Riiid Seoul,SouthKorea Seoul,SouthKorea [email protected] [email protected] Seewoo Lee Hyunbin Loh Juneyoung Park Riiid Riiid Riiid Seoul,SouthKorea Seoul,SouthKorea Seoul,SouthKorea [email protected] [email protected] [email protected] ABSTRACT and grading [3, 24, 33, 35]. With the advent of online edu- KnowledgeTracing(KT)isatasktomodelstudents’knowl- cational platforms, there is an increasing demand in build- edgebasedontheircourseworkinteractionswithinanIntel- ing assessment models using the interaction history data of ligent Tutoring System (ITS). Recently, Deep Neural Net- users. OneapproachtotracktheskillofusersisKnowledge works(DNN)showedsuperbperformanceoverclassicalmeth- Tracing(KT),whichisthetasktomodelstudentsknowledge ods on multiple dataset benchmarks. While most Deep basedontheircourseworkinteractionswithinanIntelligent LearningbasedKnowledgeTracing(DLKT)modelsareop- Tutoring System (ITS) [7]. timized for general objective metrics such as accuracy or AUC on benchmark data, proper deployment of the service To tackle the KT problem, the recent EdNet Challenge in requires additional qualities. Moreover, the black-box na- Kaggle has gathered a total of 3,406 teams, 4,412 partici- ture of DNN models makes them particularly difficult to pants, to submit 64,678 models. Participants trained KT diagnoseorimprovewhenunexpectedbehaviorsareencoun- models on the EdNet KT dataset [6], and the models were tered. In this context, we adopt the idea of black-box test- evaluatedby the AreaUnder the Receiver Operating Char- ing / behavioral testing from Software Engineering and (1) acteristicCurve(AUC).TheAUCofthetop5modelswere define desirable KT model behaviors to (2) propose a KT 0.820,0.818,0.818,0.817,0.817,whichareverysimilar,and model analysis framework to diagnose the model’s behav- all models were based on the Transformer Neural Network ioral quality. We test-run the framework using three state- structure [38]. While neural network structures are usu- of-the-art DLKT models on seven datasets based on the allydesignedtoappropriatehumanintuitions,mostmodels proposed framework. The result highlights the impact of lack interpretability compared to classical models. There- dataset size and model architecture upon the model’s be- fore, when evaluation results for black-box models do not havioral quality. The assessment results from the proposed varysignificantly,itbecomesunclearhowtochoosethebest frameworkcanbeusedasanauxiliarymeasureofthemodel modelfordeployment. Also,whilesmallquantitativediffer- performance by itself, but can also be utilized in model im- enceintheobjectivefunctionormodelAUCmightnothurt provementsviadata-augmentation,architecturedesign,and the users’ perception of the model reliability, few adversar- loss formulation. ial decisions of the black-box model can dissuade the user’s faith. [34]alsonotethattheperformancesofblack-boxmod- Keywords elsthataretrainedforgeneralmetricssuchasclassification accuracy or AUC(Area Under receiving operator character- KnowledgeTracing,DeepLearning,BehavioralTesting,Model istic Curve) can be overestimated. Validation Asaresult,DeepLearningbasedKnowledgeTracing(DLKT) 1. INTRODUCTION models are not frequently implemented in the education AssessmentisacentraltaskinEducation,asitisinvolvedin community due to potential risks arising from the lack of meta-cognition [17], tracing the skill trajectory, recommen- modelinterpretability. Inthisstudy,weproposebehavioral dationofcontents[36],adjustmentoftutoringstrategy[14], testing as an approach to alleviate this problem. The con- tribution of this work are summarized as follows: • WeproposeanoveltestingframeworktovalidateDLKT Minsam Kim, Yugeun Shim, Seewoo Lee, Hyunbin Loh and Juneyoung modelsusingatestonbehaviors. Theideaistodefine Park “Behavioral Testing of Deep Neural Network Knowledge Tracing consistent and convincing behaviors to be desired on Models”. 2021. In: ProceedingsofThe14thInternationalConferenceon DLKT models. EducationalDataMining(EDM21).InternationalEducationalDataMining Society,105-117.https://educationaldatamining.org/edm2021/ • As an example of applying the framework, we bench- EDM’21June29-July022021,Paris,France mark three state-of-the-art DLKT models from the Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 105 proposed validation framework. Positive results high- [28], EKT [21], KTM [39], DHKT [40], SAINT [5], AKT light the reliability of DLKT models and encourages [13], and PEBG [22]. themodel’sadoption,whilenegativeresultspointout the limitations of DLKT models and show spaces for WhilethereexistavarietyofdifferentKTmodels,[12]per- improvement. formedamajorexperimentontheaccuracyofthreegroups of KT models (Markov process, Logistic Regression, Deep • We introduce methods to utilize evaluations from the Learning) on nine real-world datasets. While deep learning framework to design and improve DLKT models. models do show better AUC and RMSE on some datasets, otherlinearmodelsincludingtheauthors’proposedBestLR 2. RELATEDWORKS approach yielded comparable or superior performances on mostdatasets,whichalsoprovidedbettermodelinterpretabil- 2.1 KnowledgeTracing ity as well. KnowledgeTracing(KT)isthetasktopredicttheexpected correctness of an interaction of a student to a question by modelingthestudent’sknowledgefrompastinteractions[7]. 2.2 BehavioralTestinginOtherApplications Inthisstudy,weformulatetheKTtaskasfollows: theinter- actionsequenceofauserisdenotedasXu ={xu,xu,··· ,xu} To alleviate unexpected behaviors of black-box models, [2] 1 2 T introducesbehavioraltesting(alsoknownasblack-boxtest- whereu∈U istheuserindex. Tosimplifythenotations,we ing)totestdifferentcapabilitiesofasysteminthesoftware omittheuserindexuunlessspecified. Eachinteractionx = t engineering perspective. Many studies work on effectively (q ,c ) at step t is defined by the pair of question q ∈ Q it t it designing test cases [18, 25, 26]. [29] gives a detailed re- and correctness c ∈ {0,1} where Q = {q ,q ,··· ,q } de- t 1 2 n view on the behavioral testing method applied in various notes the set of all questions and i denotes the question t software testing domains. In Natural Language Processing, indexofstept. AKTmodelpredictsthecorrectnessproba- [34] apply the behavioral testing framework to validate the bilityP(c =1|x ,x ,··· ,x ,q )ofanunseeninteraction t 1 2 t−1 it behaviors of general NLP models. They introduce Check- X atstept,whereX =x ,x ,··· ,x isthefirsttinterac- t t 1 2 t List, which is a task agnostic methodology for testing NLP tions of an interaction sequence X. models. CheckList is a list of general linguistic capabilities and test type baselines for NLP tasks. It is also a software Notation Description tool to generate test cases for NLP models. u∈U User index X Interaction sequence of a single user(=Xu) x Interaction at time step t t 2.3 BehavioralStudiesinKnowledgeTracing q ∈Q Question j i Question index at time step t The expected behaviors of the KT models have been dis- t c Correctness at time step t for question q cussed in some studies, which point out the adversarial be- t it c(qj) Correctness at time step t for question q haviors of KT models and propose new models to alleviate t j the problem. DKT+ [41] raises two problems of the Deep X Interaction sequence of X up to time step t t Knowledge Tracing (DKT) model [31], which are increas- ing correctness probabilities fromfalse responses, andwavy predictiontransitionbytime. However,thesebehaviorscan naturallyoccurfromtheeducationaleffectsembeddedinthe Many KT models utilize domain specific tags such as skill interaction, which we discuss in detail in Section 3.1. The components of questions [27, 8, 31, 43, 42], difficulty of authorsaddthreeregularizationtermsinDKT+toenhance questions [32, 10, 37], or knowledge graphs [4]. Item Re- theconsistencyofthepredictionsofDKT,andintroduceex- sponseTheory(IRT)[32]modelsthecorrectnessprobability tra performance measures. ofastudentrespondingtoaquestionusingcustomdesigned models,andfitsthemodelparametersusingmaximumlikeli- The authors of [19] lists some desirable behaviors based on hood. Forinstance,the4-PLmodelpredictsthecorrectness the monotonicity of the KT models to improve the general probability of a user with skill level θ solving item i by ability of the models. Then, they perform three types of noveldataaugmentationtechniques(replacement,insertion, d −c pi(θ)=ci+ 1+ei−ai(θi−bi), anddeletion)andapplythemtothetrainingofKTmodels. whereai,bi,ci,di areparametersthatmodeldiscrimination, As examined in these relevant studies, the adversarial be- difficulty, pseudo-guessing, and slip of item i [23]. haviors and low interpretability of DL models hinders the AIEd society to adopt Deep Learning based KT (DLKT) AnotherprominentapproachisBayesianKnowledgeTracing models and sustain on adopting interpretable models based (BKT) [27, 42], which uses Markov process to model diffi- on BKT, IRT, or Cognitive Diagnosis Models [9]. In this culty of the question items and learning capability of the study, we provide a validation framework of DLKT models students. AnotherwellknownapproachisDeepKnowledge and conduct an extensive set of experiments on the desired Tracing (DKT) [31], which is the first Deep Learning based behaviorsofDLKTmodels. Goodresultshighlighttherelia- KT model (DLKT). Since the introduction of DKT, many bilityofDLKTmodelsandencouragesthemodel’sadoption researchers have worked on different network structures to on most datsets. On the other hand, bad results point out capture the complex aspects of the knowledge state. There thelimitationsofDLKTmodelsonsomedatasetsandshow areavarietyofmodelsbasedondifferentstructuressuchas spaces for improvement. DKVMN [43], DKT+ [41], SKVMN [1], SAKT [30], GKT 106 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 3. BEHAVIORALTESTINGFOR of circumstance might cause the model to fit a non- KNOWLEDGETRACING monotonicrelationshipbetweenthetwofundamentally unrelatedsubjects. InmostITS’s,however,thetarget We propose a black-box behavioral testing framework for study domain is usually restricted to a single subject, knowledgetracingtask. Firstwedefinetheknowledgestate orasetofknowledgecomponentswherethestudent’s (KS),andthenelaborateondesirablebehaviorsofKTmod- comprehensiononthecomponentsisusuallypositively els’ KS representation. Finally, in Subsection 3.2, we intro- correlated. Another case is when a negative response duce specific experiment setups to assess whether DLKT increases the correctness probability of a problem as models satisfy those behaviors. describedasanadversarialbehaviorin[41]. However, theeducationaleffectofconsumingaquestioncangive We define the knowledge state to be a vector representa- positivefeedbackontheknowledgestateevenifthein- tion of a user’s correctness probability on a set of questions Q(cid:48) at a specific time point. Given the first t interactions teraction response was wrong. Therefore, we assume the described monotonic behavior in general knowl- X =x ,x ,··· ,x ofauser,wedefinetheuser’sknowledge t 1 2 t edge tracing environments while simultaneously keep- state as: ing track of the opposite case in the experiments of (cid:104) (cid:105) KS = P(c(qj) =1|X ,q ) (1) Section 4. t t t−1 j qj∈Q(cid:48) • Robustness: For any black-box system, it is gener- c(qj) represents the Bernoulli indicator for the event when t allydesirablethatinsignificantchangeinthesystem’s the user answers correctly to question q at time step t, j input leads to limited change in the output. For a as defined in Table in Section 2.1, which is updated along knowledgetracingmodelinanITS,theinputrefersto the provision of the user interaction sequence. Note that a the student’s interaction record and the output refers KS is the collection of prediction values of questions, which to the model prediction on correctness probability for either is responded or not. We describe the desired aspects an encountered question or a set of questions. There- of DLKT models in the following section. fore,weformulatetherobustnessofknowledgetracing model as below in a general sense, adopting the ∆X 3.1 ExpectedBehaviorsof previously defined: TracedKnowledgeState |P(c(qj) =1|X∪∆X,q )−P(c(qj) =1|X,q )|<(cid:15) First we introduce two properties on the change ∆KS of t j t j t the knowledge state KS with respect to an atomic change for some (cid:15) , a single interaction ∆X, t > t , and t p ∆X of the interaction sequence X, which is an insertion or q ∈ Q. If we impose the inequality to always hold j a deletion of single interaction record. on t=t +1 and fixed (cid:15) , then it is equivalent to im- p 1 posing continuity on the knowledge state in terms of First,monotonicityinsiststhatthemodel’sknowledgestate time-steps. Wetreatcontinuityasaspecificcaseof ro- shouldbeupdatedtoamoreknowledgeablestatewhenthe bustness and introduce customized test for continuity student adds another correctly answered question (positive separately from the test for robustness. interaction) or when an interaction record with incorrect However,consideracasewhenq andq in∆Xassess response (negative interaction) of the student is deleted. If j itp similar concepts, or when the educational effect from the ∆ is applied in the middle of the interaction record, all the interaction with one question affects the student’s changes after the perturbation should hold the property as correctnessprobabilityontheotherquestion. Thenan well. Second, robustnessinsists thatalittle perturbationin insertionordeletionofonequestionispronetohavea the interaction history should not yield a dramatic change significantimpactuponthepredictedcorrectnessprob- infutureknowledgestates. Thedetailsofthetwoproperties ability value of the other for t > t . Therefore, the are introduced below. p defined robustness / continuity need not be univer- sallydesirableforallpairsofquestions. Theimpactof this property would eventually depend on the degree • Monotonicity: If ∆X is a correctly responded inter- ofdependencyamongthequestions. Therefore,inthe action (q ,1) at perturbation time t , then we can itp p experiments,whileassumingrobustnessformostques- track the relation of P(c(tqj) = 1|X ∪ ∆X,qj) and tionpairs,wealsocarefullytrackwheresomequestions P(c(qj) = 1|X,q ) depending on how q and ∆X are affectthepredictionvaluesofotherquestionsinano- t j j correlated. In many cases, a positive correlation in table amount. correctness probabilities is desired due to the relation of knowledge states: Nextwediscusswhatconstitutesanexpectedvalueofknowl- P(c(qj) =1|X∪∆X,q )>P(c(qj) =1|X,q ) edge state. Testing whether the knowledge state has accu- t j t j rately captured the user’s interaction history is in line with for t>t and q ∈Q. p j the existing quantitative metrics (AUC, ACC) adopted in However,therecanalsobenegativelycorrelatedques- KT literature. However, the existing test methods focus tions which could be consequences of factors such as onlyonasingleactualquestiondataprovidedpereachtime limited learning resource. For instance, a college stu- step whereas we propose to assess knowledge tracing model dent might sacrifice her studying time on one sub- via its knowledge state on a virtual question set in order ject over another when both subjects’ examinations to provide a more holistic assessment via knowledge state are scheduled too closely with each other. This type representation. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 107 Althoughtracingknowledgestateonasetofquestionswould 3.2 BehavioralTestSetups provide a more comprehensive picture of how the user’s Below we describe four behavioral testing setups for DLKT knowledge is traced, it lacks actual label of correctness on models. First,perturbationtestsaimtotestmodel’smono- unseen questions at each time step, as discussed in Section tonicityandrobustnessgivenanatomicperturbationtothe 3.1. Therefore, we describe below novel measures to assess original interaction sequence data. Second, continuity test correctnessofknowledgestateunderafewpurposefullyde- aims to check whether model’s knowledge state represen- signed circumstances. tation is continuous along the interaction sequence. Third, initialknowledgestatetestcheckswhethertheinitialknowl- edge state reflects each question’s corresponding difficulty • ApproximateLabelofUser-IndependentInitialKnowl- measure. Fourth,convergencetestcheckswhethertheknowl- edge State: At first prediction step, we approximate edge state converges to the expected value and how fast it correctnesslabelforall questionsviatheir’globaldif- converges. Following subsections elaborate each of the test ficulty’. Theinitialknowledgestateforallusersshould setup in further detail. Table 1 provides summary of the representthemodel’spriorbeliefofquestiondifficulty tests. beforeencounteringanyuser-specificinteractionrecord data. Itisreasonabletoassessthequalityofthisvalue 3.2.1 PerturbationTests in terms of correlation of model’s prediction on each We examine monotonicity and robustness of the model by questionandthequestion’sglobaldifficulty. However, perturbation tests. We experiment three types of pertur- this is an approximation at best since the question’s bations: insertion, deletion, and flip. For each original in- difficulty might not be accurately captured by simple teraction sequence, we determine t , which is the index of p average over its occurrences. If the actual interaction interaction to be perturbed. For the insertion case, we add data generated from the ITS provides a very difficult a new interaction between x and x . For the deletion tp−1 tp questiononlyafteruser’sknowledgeissignificantlyac- case, we remove the interaction x . For the flip case, we tp cumulated, simple average of question correctness la- flip the correctness of x from 1 to 0 and from 1 to 0. tp bels would not be representative of the question’s in- herent difficulty. Model’s prediction would be high. Inordertocheckmonotonicity,weassesswhetherthemodel’s predictedcorrectnessprobabilityinthefollowinginteraction • Ideal Value of Knowledge State after Converged In- sequenceX changestowardstheexpecteddirection. For teraction Data: We generate obvious edge-case test [tp:] insertion/deletion/flipofaninteractiontowhichuserre- cases where user’s knowledge state on a set of ques- spondedcorrectly,weexaminewhetherthefollowingfuture tion has converged to a value. We create this virtual dataset by simply repeating an identical interaction correctness probability P(ct(cid:48)+1 = 1|Xt(cid:48)), ∀t(cid:48) > tp increases /decreases/decreases,respectively. Intheexperiments,we record on each question consecutively. The model’s fixtheperturbationpointtobelocatedhalfwayintheuser’s prediction value for the question in the repeated in- original interaction sequence, then measure the proportion teraction should converge to the repeatedly provided ofinteractionswhichthemodel’spredictedcorrectnessprob- label value. ability changes towards the expected direction. • Approximate Label of Knowledge State in General: It’s also possible to approximate a pseudo-label for To assess the model’s robustness, we visualize how the de- unseen questions using rolling/expanding averages or gree of impact from perturbation changes along the time IRT-like algorithms which demonstrate more stable steps from t . We expect the degree of impact from per- p and monotonic behavior by design. Although we con- turbationuponthemodel’spredictiontodecaygraduallyas jecturesuchtrainingmethodologyofDLKTmodelsus- moreinteractionsarefedintothemodelaftertheperturba- ing pseudo-labels might provide regularization effect, tion point t . p we do not include this case in the scope of this work. 3.2.2 ContinuityTest We test whether the knowledge state is continuous, in the Table1: BehavioralTestSummary manner described in the previous section 3.1. For every time-step, we provide the model with not only the origi- Behavior Analysis Method nal interaction at the corresponding time-step, but also a Monotonicity PerturbationTest: Percentageofinterac- set of questions Q(cid:48) simultaneously to construct knowledge tion samples of which model prediction state KS at the time-step. Although we don’t have actual t changed in expected direction. correctness label for those virtual interactions, we only in- Robustness PerturbationTest: Degreeofimpactfrom quire how the knowledge state or the model prediction on perturbation across time-steps. Q(cid:48) evolves along the time-steps. Continuity ContinuityTest: Avgerageandmaximum change in knowledge state score per step In the experiments, we approximate a score on the user’s and throughout entire sequence. knowledge state by averaging the model-predicted correct- Initial Value Initial Value Test: Correlation between ness probability over the sample set of questions Q(cid:48) to re- question correctness rate and initial port: (1) average and maximum student score change per knowledge state. single time-step and (2) average student score change and Convergence Convergence Test: Convergence speed as range across 100 time-steps. inmodelAUCandaveragemodeloutput at different time-steps. 3.2.3 InitialKnowledgeStateTest 108 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 4. EXPERIMENTS Table2: DatasetStatistics In this section, we benchmark three Deep Learning based Dataset Users Items Skills #Intr. %Crct. KnowledgeTracingmodelsDKT,SAKT,andSAINTonthe ASSIST15 14228 100 100 656K 73 proposed behavioral tests. First, we train optimized mod- ASSIST17 1708 3162 411 935K 37 els for each architecture-dataset pair by searching hyper- STATICS 282 1223 98 189K 77 parameters on the train and validation data split. Second, Spanish 182 409 221 579K 77 we report the classification accuracy and the AUC met- EdNet-small 5000 13156 118 518K 65 ric, which are commonly used for model assessments in the EdNet-med 100000 13518 118 11M 64 Knowledge Tracing literature. Third, we present the pro- EdNet 605763 13528 118 138M 66 posed behavioral test results of model instances on well- known datasets for Knowledge Tracing. Weassessthequalityofthepriorknowledgestateembedded by the model by the initial knowledge state test. Without 4.1 Datasets anyuser-specificrecord,thepriorknowledgestateembedded Wedescribethedatasetsusedinourexperiments. Alldatasets in the model should accurately reflect the average difficulty are open to the public. of the question to all users. Thus, we check Spearman rank correlation and Pearson correlation between the question’s ASSISTments[11] is a dataset containing student interac- average difficulty and the model-predicted prior belief for tionsfromanonlinetutoringsystemforsolvingMassachusetts each question. Comprehensive Assessment System (MCAS) 8th Math test questions. We use the datasets ASSISTments 2015 (Assist- Indetail,atrainedDLKTmodelM’sinitialknowledgestate ments15)andASSISTmentsChallenge2017(Assistments17). for a question q can be represented as P (c=1|·,q ). We j M j compare this with the question correctness rate over the STATICSisadatasetcontainingcollegestudentinteractions entire dataset as in Eq 2 which is equivalent to the number on a one-semester Statics course. This dataset is available of correctly responded qj-interactions over the number of in the PSLC DataShop web site [16]. occurrences of q based on all user data. j Spanish[20]isadatasetcontainingmiddle-schoolstudentin- teraction data for Spanish exercises. (cid:80) |{x(u)|q =q ,c =1}| gc = u∈U t it j t (2) EdNet[6] is the largest public benchmark education dataset qj (cid:80)u∈U|{x(tu)|qit =qj}| containing user interaction data of an online tutoring sys- tem, for preparing TOEIC (Test of English for Interna- tional Communication®). For ablation studies on the size Consequently, we measure: ofthedataset,werandomlychoose100,000usersforEdNet- (cid:16)(cid:2) (cid:3) (cid:2) (cid:3) (cid:17) medium, and 5,000 users for EdNet-small. Corr P (c=1|·,q ) , gc (3) M j qj∈Q qj qj∈Q This initial knowledge state test pinpoints on whether the Table3: ModelHyper-parameters learned question embedding in the DLKT model alone has Model Parameter Tuning Details captured any information about the corresponding ques- Common Adam learning rate 0.001, 0.003, 0.01 tion’s difficulty. Moreover, we emphasize the importance Dropout rate 0, 0.25, 0.5 oftheinitialknowledgestatesincethestateassumedbythe Embedding dimension 64, 128, 256 model would likely affect the user’s first impression on the Maximum Seq.Length 100, 200, 400 system to make decisions. DKT # Recurrent Layers 1, 2, 4 SAKT # Attention Layers 1, 2, 4 SAINT Warm-up Steps 200, 400, 4000 3.2.4 ConvergenceTest # Attention Head 1,4,8 In convergence test, we assess whether the model’s knowl- edge state value converges to a target value in a desired manner. We generate simple virtual interaction sequence 4.2 ModelsandAlgorithms data by repeating an identical interaction for 50 time-steps Weperformhyper-parametertuningonthetrainingofmod- for each question for both correctness cases. Therefore, the els. For each configuration of hyper-parameters, we choose virtual dataset would consist of virtual user interaction se- the model weights with the best validation AUC. In the quences of size twice of the number of questions. training step, an early-stopping policy is applied with pa- tience 30, which means that we stop the training process In the experiments, we report the model’s standard AUC and save the best weights if there is no AUC improvement metric at time-steps 5, 10, and 50. We expect significantly in the recent 30 validation steps. Among the best weights highfiguresastheinquiredinteractionsequenceisextremely for each configuration, we choose the weight with the best simple. We also visualize how the average model prediction validation AUC for each dataset, and evaluate the weights value across the questions evolves throughout the 50 time- with an independent test set for test metrics. stepsfor eachof the correctnesscase. We expectthe values to quickly converge to 1 / 0 for interaction sequences of whichcorrectnesslabelisallcorrect/incorrect,respectively. 4.2.1 TrainingDetails Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 109 In this study, we use DKT, SAKT, SAINT in the experi- Wereportthetestpassrateforinsertion,deletion,andflip. ments . DKT models the student’s knowledge state using The results are shown in Table 6, 7, and 8, respectively. a Recurrent Neural Network (RNN) by compressing the in- Figure 1 describes the average impact on model prediction teraction history in a hidden layer. SAKT is the first KT from insertion perturbation on each dataset (column) and model that used self-attention layers, where in each layer correctness label of the inserted interaction (row). Figure 2 the question embeddings are queries and interactions em- describesthedegreeofmaximumimpactoverusersequences beddings are key and values. SAINT is the first KT model from insertion perturbation. basedonTransformers. Thesequenceofquestionsisfedinto the encoder, and the sequence of responses are fed into the decoder with the encoder output. Table6: InsertionTestPassRates(%) Model DKT SAKT SAINT Our model hyper-parameters are shown in Table 3. We Assistments15 70.9 70.3 65.3 use the Adam optimizer [15] with default parameters. For Assistments17 69.6 55.7 56.7 SAKTandSAINT,weusedtheNoamschemeforscheduling STATICS 71.1 61.0 58.2 the learning rate, and tune the number of warmup-steps. Spanish 80.1 75.6 60.7 EdNet-small 66.3 78.0 75.9 The original SAKT implementation does not include resid- EdNet-medium 83.2 80.6 77.3 ual connection from the query. This enforces the first pre- EdNet 72.7 71.6 71.2 diction to a same number every time, regardless of the first Average 73.4 70.4 66.5 questionprovidedtothemodel. Since3.2.3becomesredun- dant, we use the modified SAKT architecture with residual connection. For SAINT and SAKT, the dimension of the Table7: DeletionTestPassRates(%) feedforward network is set to 4×(model dimension). For Model DKT SAKT SAINT SAINT, we use the same number of attention layers for the Assistments15 69.0 66.3 62.1 encoder and the decoder. Assistments17 63.7 54.4 54.6 STATICS 60.7 55.9 49.2 4.3 Results: TraditionalAssessment Spanish 81.6 81.9 59.7 AUCandaccuracyresultsareshowninTables4and5. The EdNet-small 65.6 75.3 71.9 difference of these standard metrics is generally less than EdNet-medium 80.3 76.8 73.5 0.01. For KT-based tutoring systems, this difference would EdNet 72.3 68.3 69.5 belessimportantthanbehavioralperformance. AUCshows Average 70.4 68.4 62.9 the monotonicity of interactions by all users, and accuracy doesnotfocusontheexactmodelprediction. Ontheother hand, behavioral tests can check the performance of model Table8: FlipTestPassRates(%) prediction for a single user, and analyze the impact of a Model DKT SAKT SAINT single interaction. Assistments15 77.1 96.3 94.7 Assistments17 69.5 86.1 66.4 STATICS 93.4 92.9 84.7 Table4: StandardAUCmetric Spanish 87.5 89.1 83.9 Model DKT SAKT SAINT EdNet-small 75.2 95.0 95.8 Assistments15 0.7242 0.7226 0.7179 EdNet-medium 87.8 94.8 95.5 Assistments17 0.7742 0.7650 0.7680 EdNet 79.5 83.6 86.0 STATICS 0.8269 0.8248 0.8275 Average 81.4 91.1 86.7 Spanish 0.8336 0.8456 0.8364 EdNet-small 0.7332 0.7380 0.7328 EdNet-medium 0.7717 0.7760 0.7722 • Ingeneral,deletionandinsertionpassratesrangefrom EdNet 0.7810 0.7905 0.7863 60% to 80%, and flip pass rates range from 80% to Average 0.7778 0.7804 0.7773 90%. Note that a flip can be interpreted as a com- bination of deletion and insertion. Therefore, the im- pact of perturbation is supposed to be larger, leading Table5: StandardClassificationAccuracy(%) tohigherpassratesascomparedtoinsertion/deletion Model DKT SAKT SAINT cases. FromFigure1,Figure6(Appendix),andFigure Assistments15 74.2 74.6 74.4 7(Appendix),wenotethatthedegreeofimpactfrom Assistments17 72.1 71.0 71.8 replacementistwiceofthatfrominsertionordeletion. STATICS 81.4 81.2 81.1 Spanish 81.9 82.6 82.0 • Robustness: From Figure 1, we observe that the de- EdNet-small 68.2 70.2 69.8 greeofaverageimpactfromperturbationgraduallyde- EdNet-medium 72.5 72.6 72.4 creases along the time-steps in general, and that the EdNet 73.5 74.1 73.9 average impact is limited by only about 2%. There- Average 74.8 75.2 75.1 fore, the desired robustness holds in terms of average impact. 4.4 Results: BehavioralTesting • Monotonicity: FromFigure1,theaverageimpactfrom 4.4.1 PerturbationTests positive/negative perturbation tends to remain posi- 110 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Figure1: PerturbationTest: AverageImpactonModelPredictionfromInsertion. Figure2: PerturbationTest: MaximumDegreeofImpactonModelPredictionfromInsertion. tive/negative, respectively, for Assistments15, EdNet- 4.4.2 ContinuityTest small,EdNet-medium,andEdNetdatasets. OnSpan- We report average and maximum step-wise change in KS ish dataset, such trend is more noisy for SAINT net- score over students in Table 9. Apart from the single-step work. change, we also measure final change of score from the first time-step to the last, and the total range of score explored • On Assistments17 and Statics dataset, the expected throughoutthetime-steps,averagedoverallstudentsinTa- monotonic behavior from SAKT and SAINT is not ble 10. Sum of absolute change in KS coordinates, or Man- observed as the average impact oscillates across zero. hattan distance of KS’s along time-steps (averaged over all This can be also seen from DKT’s significantly higher students)isshowninFigure3. EdNet-mediumwasomitted pass rates on the two datasets in Table 6. due to its similarity with the plot from EdNet-small. • From Figure 2, we note that there exists questions which persist to respond in a larger degree (even up • Except for Assistments17 and Statics, average score to 80%) after 40 time-steps. Note that on the EdNet change per single time-step or an interaction remains dataset, both transformer-based architectures SAKT reasonably low below 5% for all architectures. This and SAINT allow larger impacts from perturbations suggeststhattheknowledgestateisfairlystableacross thanDKT.Thiscanbeexplainedbythesuperiorper- the time-steps. formanceofthetwomodelsonEdNetdataoverDKT intermsofstandardevaluationmetricsAUCandACC. • OnAssistments17andStatics,weobservesignificantly larger changes, especially in DKT. DKT’s maximum Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 111 Figure3: ContinuityTest: AverageChangeinKSalongTime-steps Table9: ContinuityTest: Average/MaximumStudentScoreChange(%)perSingleTimeStep Model Assist15 Assist17 STATICS Spanish EdNet-small EdNet-med EdNet Average Avg 2.45 10.48 6.39 2.93 2.04 2.14 2.27 4.10 DKT Max 16.82 84.97 65.09 20.33 16.98 17.06 33.68 36.42 Avg 1.97 3.92 1.15 1.68 1.29 1.19 1.48 1.81 SAKT Max 15.65 56.60 20.24 52.92 21.46 19.21 22.53 29.80 Avg 4.64 7.99 2.73 3.48 2.02 1.92 1.29 3.44 SAINT Max 31.80 53.01 24.18 38.74 16.17 22.12 17.19 29.03 Table10: ContinuityTest: AverageStudentScoreTotalChange(%)/TotalRange(%)over100Interactions Model Assist15 Assist17 STATICS Spanish EdNet-small EdNet-med EdNet Average Diff 10.14 11.50 9.94 18.30 9.23 12.82 10.83 11.82 DKT Range 24.72 63.42 47.54 46.54 22.89 27.41 28.37 37.27 Diff 15.96 13.97 15.67 18.36 10.48 13.00 8.95 13.77 SAKT Range 30.98 40.16 22.08 53.33 23.70 23.80 20.69 30.68 Diff 13.34 11.62 6.97 20.53 11.32 5.30 9.01 11.15 SAINT Range 37.98 49.96 26.19 51.51 22.98 24.21 20.59 33.35 scorechangeacrossasingletime-stepisashighas85% describedinSection3.2.3. TheresultsarepresentedinTable and 65% for Assistments17 and Statics, respectively. 11. In the scatter plot of Figure 4, we choose the 200 most frequently answered questions from each data-set to show • In general, we observe decreasing marginal impact of how the initial model predictions and question correctness each interaction data as time proceeds. rates are distributed and correlated. • From Figure 3 and Table 9, we note that SAKT’s knowledge state changes significantly less than other • WeobservefromTable11thatallmodels’initialknowl- models, consistently throughout all datasets. We also edge states are positively correlated with the global investigated whether this ’speed’ of change affects to- question difficulty with statistical significance. tal ’dislocation’ of knowledge state in Table 10. In- terestingly, SAKT’s knowledge state moved by far- • The difference in correlation metrics among datasets thestonaverage(13.77%)whileitsrangeexploredwas is much more significant than that among models. the smallest (30.68%) on average. This suggests that • Basedonthethreescatterplotsofthefirstrowinthe SAKT’s knowledge state evolution was least volatile. figure, we note that the correlation becomes stronger as the size of dataset grows from EdNet Small to full 4.4.3 InitialKnowledgeStateTest EdNet data. Table 11 and Table 2 also suggests that To assess the validity of initial knowledge states embedded thenumberofinteractionsperuniquequestionispos- in the model, we measured the correlation of the predicted itively correlated with the initial knowledge state test prior knowledge state and the global question difficulty as metric. 112 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) • From the scatter plot, we see that the three models target value of 1, the former case converges around occupyslightlydifferentclusteringregionsintheplot. 30% level in most datasets. We attribute this to the For instance, in EdNet Medium and EdNet dataset, tutorial content embedded in each of the interaction, SAINT’s initial prediction value is consistently larger along with the question item used for assessment in thanthatof theother twomodels, whichsuggestsen- the dataset. semble of the models to reduce bias. Table11: InitialKnowledegeStateTest: Correlation(%) 4.5 OverviewofExperimentalResults Model DKT SAKT SAINT Based on the proposed DLKT validation framework, we Assistments15 85.6 84.8 82.5 conducted a comprehensive investigation of three popular Assistments17 63.2 52.2 58.2 DLKT models on seven benchmark datasets to scrutinize STATICS 56.1 58.1 54.6 the models’ behavioral characteristics. The results high- Spanish 56.5 49.1 44.0 light strengths and weaknesses of three DLKT models. Al- EdNet-small 39.0 38.5 31.6 though DLKT models demonstrated stable and robust be- EdNet-medium 75.1 74.2 74.4 haviorsinlinewithexpectationinmostdatasets,theresults EdNet 87.6 86.5 88.2 revealed few major disadvantages for each models: DKT Average 66.1 63.3 61.9 showedbetterstabilityinperturbationtestswhiletheother architectures occasionally presented volatile fluctuation in theresponsecurve. Inthecontinuitytest, SAKTpresented 4.4.4 ConvergenceTest a significantly smoother evolution of knowledge state, but Asthedatasetwegenerateandusefortheconvergencetest other models’ knowledge state representations were seem- is extremely simple as described in Section3.2.4, we expect ingly volatile in a few datasets which strongly precludes the KT models’ standard performance metrics to increase DLKT’s adoption. On the other hand, this suggests room quicklyalongthetime-steps. Forinstance,atthefifthtime- forimprovingDLKTmodelsbasedonthespecificissuepin- step,themodelwouldhavealreadyreceivedfourequivalent pointedbythisframework. Forinstance,thevolatilityofKS interactionrecordwiththesamequestionandthesamecor- couldbealleviatedbydirectregularizationofthechangein rectness label for the virtual student. We report the model theKS.Ontheotherhand,theresultsfromtheconvergence AUCattime-step5,10,and50inTable12. InFigure3we test showed that DKT was fragile even to simple edge-case also visualize the model’s average response across different datawhichunderminesgeneralizationcapabilityofDKT,as questionsforeachcorrectnesslabelvaluesassumed. Weex- compared to other attention-based architectures. pecttheaverageresponseplottoquicklyconvergetoeither 1 or 0 based on the assumed correctness label value. Thesebehavioralcharacteristicsidentifiedfromtheproposed frameworkshowthatthetwopopulararchitecturalparadigms, RNN and Attention-based, possess different strengths and • In general, all models show fairly high performance weaknesses under KT environment. This also hints that an from early time-step of 5, except for Assistments17 architectural combination or ensemble approach might al- dataset. leviate the identified issues to improve both standard KT • BothTable12andFigure5suggestSAKTconsistently model evaluation metrics and behavioral characteristic. achievesfastestconvergencetoareasonablevalueclose enough to either 1 or 0. DKT, however, consistently converges to a value farther from the two edges, as 5. CONCLUSION compared to the other two models. In particular, for Inthiswork,weintroducedthedesiredpropertiesofknowl- the incorrect case (second row) of the Figure 5, we edge tracing models and proposed a novel model valida- observe DKT converges to a value higher than 50% tionframeworkforDeepLearningbasedKnowledgeTracing (reddottedhorizontalline). Forthepositivecase(first (DLKT)models. Usingtheframework,weconductedacom- row),DKTconvergestoacorrectnessprobabilitylevel prehensive analysis of three popular DLKT models’ behav- around only 70% for Assitments17, EdNet-small, and ioral characteristics and identify their strengths and weak- EdNet-medium. nesses of the models in seven different benchmark datasets. We believe that the analysis on both strengths and weak- • DKT’s convergence pattern is fairly monotonic while nesses diagnosed by the framework would serve as a useful SAKT and SAINT’s patterns go through fluctuation guideline for model enhancement. Also based on the find- which likely pertains to noise. ingsfromtheproposedframework,acustomizedadoptionof DLKT models fitting to the nature of the data and desired • ItisnoteworthythatincreasingdatasetsizefromEdNet- behaviors as well as accuracy would become possible. smalltoEdNet-mediumandEdNetsignificantlyhelps all three models’ convergence behavior on both target We believe potential future work includes: (1) tackling the correctnessvalues,especiallyforDKT.DKT’sconver- weaknesses of DLKT models identified in this work via ar- gencevaluemovedsignficantlyclosertodesiredvalues chitectural modification or model combination, (2) explor- of1and0. ForSAKTandSAINT,largerdatasetsize ingthebenefitofdataaugmentationusingvirtualedge-case led to more stable response plot, reducing the degree datasimilartoconverginginteractiondatausedinthecon- fluctuation. vergencetest,and(3)extendingtheproposedtestingframe- • Convergenceintheincorrectcaseandthecorrectcase work beyond the task of knowledge tracing (i.e. student is asymmetric. While the latter closely achieves the score prediction and item recommendation). Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 113 Figure4: InitialKnowledgeStateTestScatterPlot Figure5: ConvergenceTest: AverageModelPredictionalongConvergingInteractionSequence Table12: ConvergenceTestAUC Step Model Assist15 Assist17 STATICS Spanish EdNet-small EdNet-med EdNet Average DKT 0.867 0.601 0.829 0.688 0.622 0.790 0.845 0.749 5 SAKT 0.885 0.615 0.903 0.805 0.869 0.853 0.861 0.827 SAINT 0.866 0.609 0.936 0.731 0.628 0.881 0.854 0.786 DKT 0.932 0.634 0.924 0.745 0.679 0.874 0.932 0.817 10 SAKT 0.961 0.782 0.928 0.923 0.953 0.928 0.951 0.918 SAINT 0.947 0.763 0.979 0.856 0.757 0.958 0.954 0.888 DKT 0.979 0.695 0.983 0.791 0.744 0.954 0.990 0.876 50 SAKT 0.998 0.938 0.942 0.993 0.997 0.994 0.995 0.980 SAINT 0.995 0.939 0.999 0.977 0.963 0.997 0.997 0.981 114 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.