ebook img

ERIC ED615540: Early Prediction of Conceptual Understanding in Interactive Simulations PDF

2021·1.1 MB·English
by  ERIC
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED615540: Early Prediction of Conceptual Understanding in Interactive Simulations

Early Prediction of Conceptual Understanding in Interactive Simulations Jade Cock Mirko Marras Christian Giang Tanja Käser EPFL EPFL EPFL EPFL jade.cock@epfl.ch mirko.marras@epfl.ch christian.giang@epfl.ch tanja.kaeser@epfl.ch ABSTRACT and unstructured environments allowing students to choose Interactive simulations allow students to independently ex- their own action path [6]. Providing adaptive guidance to plore scientific phenomena and ideally infer the underlying studentshasthereforethepotentialtoimprovelearningout- principles through their exploration. Effectively using such comes. environments is challenging for many students and there- fore, adaptive guidance has the potential to improve stu- Implementing effective support in interactive learning en- dentlearning. Providingeffectivesupportis,however,alsoa vironments is a challenge in itself: the complexity of the challengebecauseitisnotclearhoweffectiveinquiryinsuch environment makes it difficult to define a priori how suc- environments looks like. Previous research in this area has cessful student behaviour looks like. Previous research has mostlyfocusedongroupingstudentswithsimilarstrategies focused on leveraging sequence mining and clustering tech- or identifying learning strategies through sequence mining. niquestoidentifythekeyfeaturesofsuccessfulinteractions. Inthispaper,weinvestigatefeaturesandmodelsforanearly Forexample,[7]haveusedaninformationtheoreticsequence predictionofconceptualunderstandingbasedonclickstream mining approach to detect differences in the interaction se- dataofstudentsusinganinteractivePhysicssimulation. To quences of students with high and low prior knowledge, this end, we measure students’ conceptual understanding while [8] investigated the effects of prior knowledge activa- throughatasktheyneedtosolvethroughtheirexploration. tion. Otherwork[9]focusedondetectingbehavioursleading Then, we propose a novel pipeline to transform clickstream to the design of a correct causal explanation. [10] identified data into predictive features, using latent feature represen- key factors for successful inquiry: focusing on an unknown tations and interaction frequency vectors for different com- component and building contrastive cases. Similarly, [11] ponentsoftheenvironment. Ourresultsoninteractiondata found that the identification of the dependent variable and from 192 undergraduate students show that the proposed its isolated manipulations lead to a better quantitative un- approach is able to detect struggling students early on. derstanding of the phenomena at hand. Another technique is to manually categorise students’ log data and use the tags as ground truth for a classifier of successful inquiry Keywords behaviour [12]. [13] developed a dashboard displaying in- skip-grams,earlyclassification,interactivesimulations,con- formation about the mined sequences to guide teachers in ceptual understanding building their lessons. 1. INTRODUCTION Moreworkhasfocusedonanalysingandpredictingstudents’ Over the last years, interactive simulations have been in- strategiesindifferenttypesofopenendedlearningenviron- creasingly used for science education (e.g, the PhET simu- ments (OELEs), such as educational games. Prior research lationsaloneareusedover45M timesayear[1]). Interactive inthatdomainhas,forexample,investigatedstudents’prob- simulationsallowstudentstoengageininquiry-basedlearn- lemsolvingbehaviour[14],analysedtheeffectofscaffolding ing: they can design experiments, take measurements, and onstudents’motivation[15],extractedstrategicmovesfrom test their hypotheses. Ideally, students discover the prin- videolearninggames[16],detecteddifferenttypesofconfu- ciples and models of the underlying domain through their sion [17], or identified students’ exploration strategies [18]. own exploration [2], but students often struggle to effec- tivelylearninsuchenvironments[3,4,5]. Apossiblereason Most of the previous work on OELEs has performed a pos- for this is that interactive simulations are usually complex teriori analyses. However,inordertoprovidestudentswith support in real-time, we need to be able to detect strug- gling students early on. Due to the lack of clearly defined student trajectories and underlying skills, building a model of students’ learning in OELEs is challenging. A promis- Jade Cock, Mirko Marras, Christian Giang and Tanja Käser “Early Pre- ing approach for early prediction in OELEs is the use of a diction of Conceptual Understanding in Interactive Simulations”. 2021. clustering-classification framework [19]: in the (first) offline In:ProceedingsofThe14thInternationalConferenceonEducationalData step, students are clustered based on their interaction data Mining (EDM21). International Educational Data Mining Society, 161- and the clustering solution is interpreted. The second step 171.https://educationaldatamining.org/edm2021/ isonline: studentsareassignedtoclustersinreal-time. This EDM’21June29-July022021,Paris,France Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 161 A B C Closed circuit configurations Open circuit configurations Setup I Separation: 2 mm Area: 400 mm2 1 Voltage: 4 V Separation: 4 mm Area: 200 mm2 Separation: 8 mm Area: 400 mm2 2 Setup II Separation: 6 mm Area: 600 mm2 3 Voltage: 2 V Separation: 8 mm Area: 400 mm2 Separation: 10 mm Area: 100 mm2 4 Figure 1: User interface of the PhET Capacitor Lab with different plate parameters for a closed (A) and an open circuit (B). The two initial closed circuit configurations and the resulting four open circuit configurations presented to the participants in the capacitor ranking task (C). (Simulation image by PhET Interactive Simulations, University of Colorado Boulder, licensed underCC-BY4.0,https://phet.colorado.edu). frameworkhasbeensuccessfullyappliedtoanalyseandpre- 2. CONTEXTANDDATA dict students’ trajectories in mathematics learning [20], to All experiments and evaluations of this paper were con- differentiate between ‘high’ and ‘low’ learners [21], to build ducted using data from students exploring an interactive studentmodelsforinteractivesimulations[22],ortopredict simulation. Inthefollowing,wedescribethelearningactiv- students’explorationstrategiesinaneducationalgame[23]. ity, the data collection, and the categorisation of students’ conceptualunderstandingattheendofthelearningactivity. In this paper, we aim at early predicting conceptual un- derstandingbasedonstudents’logdatafromaninteractive Learning Activity. The data for this work was collected in Physics simulation. All our analyses are based on data col- a user study where participants were asked to engage in lectedfrom 192undergraduatePhysicsstudentsinteracting aninquiry-basedlearningactivitywiththePhETCapacitor with a PhET simulation. We propose a novel pipeline for Lab simulation1. The Capacitor Lab is an interactive sim- transformingclickstreamdataintopredictivefeaturesusing ulation with a simple and intuitive interface allowing users latent feature representations and frequency vectors. Then, to explore the principles behind a plate capacitor (Fig. 1A we extensively evaluate and compare various combinations and B). Specifically, students can load the capacitor by ad- of predictive algorithms and features on different classifica- justing the battery and observe how the capacitance and tiontasks. Incontrasttopreviousworkusingunsupervised the stored energy of the capacitor change when adjusting clustering to obtain student profiles [21, 20, 23, 22], our the voltage, the area of the capacitor plates or the distance learningactivitywiththesimulationincludesataskspecifi- between them. After loading the capacitor, the circuit can callydesignedtoassessstudents’conceptualunderstanding. beopenedthroughaswitch,andstudentscanagainobserve With our analyses, we address three research questions: 1) howmanipulationofthedifferentcomponentsinfluencesca- Can students’ interaction with the data be associated with pacitance and stored energy. Moreover, the simulation pro- the gained conceptual understanding? 2) Can conceptual vides a voltmeter, while check boxes in the interface allow understanding be inferred through sequence mining meth- userstoenableordisablevisualisationsofspecificmeasures. ods with embeddings? 3) Can the proposed methods be used for early predicting students’ conceptual understand- Based on this simulation, a learning activity was designed ing based on partial sequences of interaction data? in which participants had to explore the relationships be- tween the different components of the circuit and rank four Our results show that all tested models are able to predict different capacitor configurations by the amount of stored students’ conceptual knowledge with a high AUC when ob- energy. Theconfigurationsweregeneratedbasedontwoini- serving students’ full sequences (offline). The best models tialsetups(IandII,respectively)representingcapacitorsin are also able to detect struggling students early on and to a closed circuit with different settings for battery voltage, provide a more fine-grained prediction of students’ concep- tual knowledge later during interaction. 1https://phet.colorado.edu/en/simulation/capacitor-lab- basics 162 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) plate area and separation (Fig. 1C). For each initial setup, Circuit two open circuit configurations were generated (i.e. config- understanding urations 1 & 2 from setup I and 3 & 4 from setup II) by Open and closed Closed only opening the switch and then changing the values for plate BOTH CLOSED areaandseparation. Tocompletethisrankingtask,partici- Exhaustive Ranking pantswereallowedtousethesimulationforasmuchtimeas understanding performed by theyneeded. Itshouldbenotedthatthevaluesintherank- Yes No Capacitance Other ing task were chosen outside the ranges of the adjustable 4213 (38) 4231 (25) 1324 (47) 1243 (5) 4123 (1) valuesinthesimulationsuchthatstudentscouldnotsimply CORRECT 2431 (2) CAPACITANCE 1342 (3) 4132 (2) reproducethefourconfigurationsinthesimulation,buthad 4321 (10) 3412 (4) 1432 (1) to solve the task by figuring out the relationships between 2413 (20) 3124 (3) 134 (1) 2143 (3) 1234 (27) the different components and the stored energy. AREASEP OTHER Data Collection. Data was collected from 214 first-year un- dergraduate Physics students who completed the capacitor Figure2: Treeusedtomapthe16differentrankingssubmit- ranking task as part of a homework assignment. While tedbythestudentstoclasslabelsassociatedwithconceptual working with the simulation, students’ interaction traces understanding of a capacitor. The different class labels are (i.e. clicksoncheckboxes,draggingofcomponents,moving indicated in capitalised letters. The numbers in parentheses of sliders) were automatically logged by the environment. indicatethenumberofsubmissionsforeachranking. Moreover, it recorded students’ final answers in the rank- ing task (i.e. the ranking of the four configurations). All data was collected in a completely anonymous way and the decision tree with four leaves (each representing a group study was approved by the responsible institutional review with similar conceptual understanding) and mapped all 16 board prior to the data collection (HREC number: 050- rankings submitted by the students to the leaves (Fig. 2). 2020/05.08.2020). Afterafirstscreeningofthedata,several Thesegeneratedclasslabelswillserveasgroundtruthlabels log files (9) were excluded because of inconsistencies in the fortheclassificationtaskpresentedinthefollowingsections. data, and another 13 because they had barely any interac- tion (less than 10 clicks) with the environment. Removing 3. METHOD thesedatapointsresultedinadatasetof192studentsused Usingourproposedapproach,weareinterestedinpredicting for our analyses. the conceptual understanding students gain from interact- ing with the simulation. Therefore, we are solving a super- Categorisation of conceptual understanding. The design of vised classification problem, i.e. we aim at predicting the therankingtaskallowstorelatestudents’responsestotheir class labels (representing students’ conceptual understand- conceptual understanding of a capacitor. For this purpose, ing)basedontheobservedstudentinteractions. Ourmodel we analysed the 16 (out of 24 possible) rankings submit- buildingprocesstosolvethisclassificationproblemconsists tedbythestudentswithregardstoconceptualunderstand- of four steps (Fig. 3). We first extract the raw clickstream ing and grouped them accordingly. To this end, three con- eventsfromthelogsandprocessthemintoactionsequences. cepts of understanding associated with the functioning of We then compute three different types of features for each capacitors were evaluated in a top-down approach: answers action sequence and feed them into our classifiers. were first separated by those representing an understand- ing of both the open and closed circuit (label both), and Event Logs. From the simulation logs, we extract the click- those only representing an understanding of the closed cir- stream data of each student s as follows: anything between cuit(labelclosed). Forthoseanswersrepresentinganunder- amouseclick/pressandamousereleasequalifiesasanevent, standingofboththeopenandclosedcircuit,twocaseswere whileanythingbetweenamousereleaseandmouseclick/press distinguished. It was assumed that students who chose the is called break. Each event is then labelled by the compo- onlycorrectrankingoftheconfigurations(“4213”)gainedan nent the user was interacting with at the mouse click, and exhaustive understanding of the underlying concepts (label chronologically arranged with the breaks into a sequence. correct). Studentswhoinsteadchoseoneoftheotherrank- ings were assumed to know how plate area and separation Action Processing. We distinguish three main components influencethestoredenergyinboththeopenandclosedcir- ontheplatform,whosevaluescanbechanged: 1)thevoltage, cuits,butfailedtodiscovertheinfluenceofvoltageonstored 2) the separation between the plates, and 3) the area of the energy(labelareasep). Withinthoseanswersthatonlyrep- plates. Anactiononthesecomponentscanbeconductedin resented an understanding of the capacitor’s functioning in a) an opened circuit or b) a closed circuit, with the stored the closed circuit, we also distinguished between two cases. energy information display i) on or ii) off. We categorise The first case represents the answer that would be consid- eacheventinvolvingthesemainactionsbythecombination ered correct if the task was to order the four configurations of: the action on the component {1), 2), 3)}, the circuit bycapacitanceinsteadofenergy(“1324”,labelcapacitance). state {a), b)}, and the stored energy display {i), ii)}. Any Interestingly, 47 students (i.e. 24% of all students) submit- other event is categorised as 4) other. tedthisrankingasananswer. Thesecondcaserepresentsall otherpossibleanswers(labelother)thatcouldbesubmitted Thesequence of eachstudents is now composedofchrono- if (a part of) the closed circuit was understood. logically ordered events (divided into the 13 different cate- gories listed above) separated by breaks. The breaks may Based on these three underlying concepts, we generated a be caused by the student being inactive due to observing Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 163 Event Logs Action Processing Feature Creation Classification Student X Student X Three approaches: Two approaches: Action Counts Student Y Student Y Action Span Pairwise Embeddings Random forests Neural networks (RF) (NN) Figure 3: Schematic overview of the classification pipeline. Raw clickstream events are extracted from log data and processed into an action sequence per user. Three types of features are computed for each action sequence. Classification is performed usingRandomForests(RF)andNeuralNetworks(NN). Figure 4: Distribution of break lengths in seconds, across all students. Theboldgraylinedenotesthe60%threshold. Figure5: Timelineofanexemplarystudentforeachclassla- beldisplayingthechronologicalsequenceofinteractionswith thethreemaincomponentsofthesimulation. Thegreenand the progression of a value, reflecting about an observation, orangebarsindicatewhetherthestudentdisplayedthecapac- ortakingnotes. Duetoitsdefinitionastheperiodbetween itanceandstoredenergy. Thebackgroundindicateswhether a mouse release and a mouse click/press, a break may also the interactions were conducted in a closed (grey) or open appear due to logistic reasons, such as moving the mouse (white)circuit. from one component in the simulation to another compo- nent. Indeed,thestudents’eventsequencesconsistofmany short breaks (Fig. 4). Like stop words in natural texts, our assumptionisthattheseveryshortbreaks,thoughveryfre- seconds. Wealsointroducethenotionoftime,whichwede- quent in our sequences, do not contain much information. fine related to a student’s interactions: at time t the length In fact, our classification over students’ understanding may of the raw sequence of interactions for student s is t, i.e. be impaired if those noisy states are not removed, like it |r | = t. We denote the maximum time of student s with s,t is the case for sentiment analysis when stop words are not T ,correspondingtothefullsequencer . Figure5visualises s s deleted[24]. Todeterminethethresholdatwhichthebreaks the timelines for an exemplary student of each class label. are removed, we plot the distribution of inactivity periods, It can be observed that for these examples, certain aspects and cut at the elbow of the curve for each student, which of conceptual understanding could be inferred by visual in- corresponds to a delimitation at 60%, i.e. for each student spection (e.g. the capacitance student never activated the we keep the top 40% of breaks. We then categorise each of check box to visualise the stored energy). However, other ourremainingbreakssimilarlytoourmainactionevents: by differencesinconceptualunderstandingaremoredifficultto component5)break,circuitstate{a),b)},andstoredenergy detect by humans (e.g. the differences between the correct display {i), ii)}, resulting in four different break categories. and areasep students). The resulting sequence r for student s is the chronological Feature Creation. Next, we transform the interactions in s timeline of the student’s events and breaks, divided into 17 each sequence to obtain three different types of features: categories. We refer to this timeline r as the raw sequence Action Counts, Action Span, and Pairwise Embeddings. s of interactions for the rest of the paper. We denote the length(correspondingtothetotalnumberofinteractionsof ToobtaintheAction Counts featuresF forastudents, AC,s student s) of r with N , i.e. |r |=N . On average, these we first transform each interaction within the raw sequence s s s s sequences have a length of N =67.86±42.56. In terms of r inaone-hotencodedvector,resultingina17-dimensional s s seconds, the sequences r lasted on average 512.18±435.57 vector h for each interaction i and hence, a sequence of s s,i 164 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) vectors Hs = {hs,i} with i = 1,...,Ns. To compute fAC,s,t One hot Softmax input output for student s at time step t, we compute the average over vector layer hs,i with i = 1,...,t: fAC,s,t = 1t (cid:80)ti=1hs,i. By using this 0 linHeaidr dleanyer 0.1 aggregatingtechnique,ourfeaturesaretranslatedtothe(av- 0 0.02 Likelihood eraged)numberoftimeseachstudenthasinteractedineach 1 0.8 of a being Action surrounded of our categories. Therefore, for each student s, we end up a by each with a feature set F ={f } with t=1,...,T . possible AC,s AC,s,t s interaction 0 0.01 ThecomputationoftheActionSpanfeaturesF forastu- 0 D-dimensions 0.03 AS,s dentsisverysimilar. Ratherthanlookingintothenumber 17-dimensions 17-dimensions oftimesastudentshasinteractedwitheachofourcompo- nentsinaparticularstate,welookintotheamountoftime Figure6: Skip-gramarchitecture. Aftertrainingoftheskip- (in seconds) s has spent in each of the categories. We first gram,thecorrespondingrowoftheweightmatrixW repre- 1 transformeveryinteractioniwithinr intoa17-dimensional s sentsastructurepreservingembeddingofactiona. vector h : This vector is 0 for all dimensions but the di- s,i mension d corresponding to the category the i interaction h belongs to. This representation is similar to a one-hot en- After training the skip-gram model on I, we build the fea- coded vector, but instead of filling in a 1 at dimension d, turesF foreachstudents. TheweighsW ofthehidden wefillintheduration(inseconds)ofinteractioni. Thisre- PW,s 1 layer represent a structure preserving embedding of the in- sultsinasequenceofvectorsH ={h }withi=1,...,N . s s,i s teractions. Inourcase,W hasdimensions17×D(Disthe Wethencomputethefeaturevectorf forstudentsat 1 AS,s,t embedding dimension). For each interaction i of student s, timetintwosteps. Inafirststep,weagainaverageoverall vectors h up to time t, leading to ˜f = 1(cid:80)t h . we get the corresponding row r in W1, i.e. rs,i =W1·hs,i. We then sn,oirmalise ˜fAS,s,t to obtain fAASS,,ss,,tt. Byt usiin=g1 ths,iis iTh=is1a,g.a..i,nNrse.sulTtsoincoamspeuqtueenfPceWo,fs,tvefcotrorsstuRdse=nt{srs,ait} twimithe aggregation technique, our feature vector fAS,s,t represents step t, we compute the average over rs,i with i = 1,...,t: therelativeamountoftimestudentshasspentineachcat- f = 1(cid:80)t r . Therefore,foreachstudents,weend egoryuptotimet. Foreachstudents,weendupwithwith uPpWw,sit,th a fteatui=re1ses,tiF ={f } with t=1,...,T . PW,s PW,s,t s a feature set F ={f } with t=1,...,T . AS,s AS,s,t s Classification. Toperformtheclassificationtask,weexplore The third feature, Pairwise Embeddings, is fundamentally two different approaches: Random Forests and Fully Con- differentfromthethetwootherfeatures: wereplaceeachin- nected Deep Neural Networks. teractionintherawsequencebyanembeddingvectorwhich we obtain by training a pairwise skip-gram [25]. The archi- Random Forests (RFs) are simple, yet powerful machine tecture of such a network consists of two dense layers: an learningalgorithms. Theyconsistofanensembleofdecision embedding layer followed by a classification layer. Usually trees,eachtrainedonadifferentsubsetofsamplesandadif- appliedtonaturallanguageapplications(NLP),itsprimary ferentsubsetoffeatures. Thedecisionsofeachtreearethen goal is to predict the context of a word. Here, our pairwise aggregated to determine the final prediction of a sample. skip-gram attempts to predict the behavior of a student in The strength of this method is that overfitting is prevented the simulation before and after performing a specific inter- throughtherandomisationoftrainingsamplesandfeatures action. The skip-gram model can be formulated as: during the training of each tree and that the strengths of several good classifiers are exploited. While RF classifiers p=softmax(W ·(W ·a)) (1) 2 1 are well tested and efficient to train, they require the input It takes a, an interaction we wish to predict the context of features to have the same dimension for every sample. We asaninput,andoutputsp,aprobabilityvectorwhichcon- thereforetrainseparateRFmodelsforeachtimestept. The tains the likelihood of a being surrounded by each possible inputfeaturesfortheRFmodelfortimesteptare{f }, M,s,t interaction. W andW representtheweightmatrices(em- with M ∈ AC,AS,PW and s = 1,...,S, where S denotes 1 2 beddings). For each interaction a, we feed 2·w pairs into the number of students. The output of the RF is a vector the network, where w is the so-called window size of the p of dimension C (with C denoting the number of RF,M,s,t model (context). The first element of each pair is a. The classes)foreachstudents,whichrepresentstheprobability second element of each pair, the ground truth label, is one of each class. ofthew interactionsprecedingorfollowinga. Forexample, a window size of w = 2 would yield the following pairs for Neural Networks (NNs) were built with the idea of emu- action a: (a,a ),(a,a ),(a,a ),(a,a ). lating neurons firing in our brain: their nodes are to the −2 −1 +1 +2 neuronswhattheiredgesaretotheiraxons. Theadvantage In our case, to obtain the set of pairs I for a student s, of those deep networks is that they are able to model non- s we first again transform each interaction within the raw se- linear decision boundaries. However, the back propagation quence r in a one-hot encoded vector, resulting in a 17- calculationsmakethemrelativelyslowtotrain. Inthiswork, s dimensional vector h for each interaction i and, hence, we use a Fully Connected Deep Neural Network consisting s,i a sequence of vectors H = {h } with i = 1,...,N . We of d hidden dense layers and one classification layer with a s s,i s then build the input pairs for each interaction i, i.e. I = softmax activation. Similar to RFs, our NN model requires i,s {(h ,h )} with j ∈{−w,...,w}\0. We obtain the set of featurestohavethesamedimensionforallthesamples. We s,i s,j pairs for all students as I = {I } with s = 1,..,S, where S therefore also train the NN models for fixed points in time. s is the total number of students. The input features for the NN model for time step t are Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 165 {f }, with M ∈ AC,AS,PW and s = 1,...,S, where S M,s,t denotesthenumberofstudents. Duetothesoftmaxactiva- tion,theoutputoftheNNisavectorp ofdimension NN,M,s,t C (withC denotingthenumberofclasses)foreachstudent s, which represents the probability of each class. 4. EXPERIMENTALEVALUATION We evaluated the predictive performance of our classifica- tion pipeline on the collected data set (see Section 2). We conducted experiments to compare the performance of dif- ferentmodelandfeaturecombinationsforthestudents’full datasequencesaswellasforearlyclassificationusingpartial sequences of the data, to answer our research questions. Experimental Setup. We applied a train-test setting for all the experiments, i.e. parameters were fitted on a training Figure7: AUCfor4-class,3-classand2-classcasesusingdif- data set and performance of the methods was evaluated on ferentmodelandfeaturecombinations. Predictionsaremade atestdataset. Predictiveperformancewasevaluatedusing at the end of the interaction with the simulation, i.e. based the macro-averaged area under the ROC curve (AUC). We onthecompletesequentialinteractiondataofthestudents. used the AUC as a performance measure as it is robust to class imbalance. For RF models, we tuned the following hyperparameters We performed all our experiments using different levels of using a grid search: number of trees [5, 7, 9], number of detail for the classification task. As ground truth, we used features used at each decision level [‘auto’, ‘all’], number the class labels presented in Fig. 2. Given the hierarchical of samples [bootstrap resampling of training size samples natureofthedecisiontreeseparatingthestudentsinclasses and balanced subsamples]. NN models were implemented basedontheirconceptualknowledge,weperformedtheclas- using the scikit-learn library, trained for 300 epochs, and sification task focusing on three different levels of detail: optimised for the log-loss function with the following hy- perparameters: learningrate[‘adaptive’,‘invscaling’],initial learningrate[0.01,0.001],solver[‘adam’,‘sgd’],hiddenlayer • 2-class case: starting at the root of the decision tree sizes and number [(32, 16), (64, 32), (64, 32, 16), (128, 64, (seeFig. 2),wedividestudentsintotwoclassesbased 32, 16)], and activation function [‘relu’, ‘tanh’, ‘identity’]. on their understanding of the circuits: both (98 stu- dents) and closed (94 students). Theskip-grammodelprovidingourpairwiseembeddingfea- tures was implemented using the TensorFlow package. We • 3-class case: going one step down in the hierarchy of trained the model for 150 epochs, with a window size of the tree, we further divide the left branch of the tree w = 2, a batch size of 16, and an embedding dimension of (see Fig. 2) based on whether the students have com- 15. We used categorical cross-entropy as the training loss. pletely understood all concepts (leading to a correct Becauseofitsunsupervisednature,wetrainedthemodelon answer in our ranking task) or not. We therefore ob- our whole dataset. tain three different classes: correct (38 students), ar- easep (60 students), and closed (94 students). Offline Classification. In a first experiment, we were inter- • 4-class case: here, we also split the right branch of estedinassessingwhetheritispossibletoassociatestudents’ the tree (see Fig. 2) and divide the students into two behaviour in the simulation with their conceptual under- groups based on whether they ranked the configura- standing achieved through the learning activity. This will tionsinthetaskbasedoncapacitance,resultinginfour be referred to as an offline classification task, since we are classes: correct (38 students), areasep (60 students), usingstudents’completeinteractionsequencesrs. Thepre- capacitance (47 students), and other (47 students). dictive performance in terms of AUC for the three classifi- cationproblems(4-classcase,3-classcase,and2-classcase) withadistinctionbetweendifferentmodelandfeaturecom- Foreachofthosethreecases,wetrainedtwotypesofclassi- binations is illustrated in Fig. 7. fiers (RF and NN) on our three different feature types (Ac- tion Counts - AC, Action Span - AS, Pairwise Embeddings The results of this first experiment showed that for the 2- -PW),usingastratified10-foldnestedcrossvalidation. We class case, all combinations of models and features reached kept the folds invariant across all experiments and strati- very high average performances as quantified by their AUC fied over the classes (according to the class labels of the scores (value range: 0.95 − 0.97). The best mean score 4-class case). Because of class imbalance, we used random was achieved by the combination of NN with PW features oversampling for the training sets. We used a nested cross (AUC = 0.97). However, it should be noted that NN,PW validation to avoid potential bias introduced by estimating the performance differences between the combinations were modelperformanceduringhyperparametertuning. Thisal- comparatively small. Using a one-way ANOVA, no statisti- lowed us to tune the hyperparameters within the training callysignificantdifferenceswerefoundbetweenthedifferent folds (by further splitting them) and hold out the test sets groups (F(5,54) = 0.839,p = 0.528). It seems that for for performance evaluation alone. this rather rough classification into groups of students who 166 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 2-class case 3-class case 4-class case 1.0 NN with PW RF with PW 0.9 NN with AC RF with AC NN with AS RF with AS 0.8 C U A 0.7 0.6 0.5 10 20 30 40 50 60 70 80 90100110120130140150 10 20 30 40 50 60 70 80 90100110120130140150 10 20 30 40 50 60 70 80 90100110120130140150 time step time step time step Figure8: AUCfor4classes,3classes,and2classesusingdifferentmodelandfeaturecombinations. Predictionsaremadeover time,stoppingattimestept=150asonlyveryfewstudentshavelongerinteractionsequences. onlyunderstoodtheclosedcircuitandthosewhounderstood visible with certain combinations showing a trend of better boththeopenandclosedcircuits,thedifferentcombinations predictiveperformance(i.e. NNwithPWandNNwithAS) ofmodelsandfeaturesperformequallywellandwithahigh as compared to others (e.g., RF with AC). predictive accuracy. Predicting over Time. In a second experiment, we were in- By extending the classification task to the 3-class case, the terested in assessing, whether we could predict students’ AUCscoresdroppedincomparisontothe2-classcase(value conceptual knowledge for shorter interaction sequences, i.e. range: 0.86−0.90). The lowest score was observed for RF when not using students’ full sequences, but only the first withAC(AUC =0.86),whilebestperformanceswere t interactions. For all three classification cases (2-class, 3- RF,AC observed for RF with AS (AUC =0.90), NN with AS class, and 4-class), we trained all model (RF, NN) and fea- RF,AS (AUC = 0.89), and NN with PW (AUC = ture(AC,AS,PW)combinationsfort=10,20,...,150time NN,AS NN,PW 0.89). Similartothe2-classcase, nostatisticallysignificant steps. As described in Section 3, to compute the features differences were found between the different groups using a f , f , and f for a student s at time step t, AC,s,t AS,s,t PW,s,t one-way ANOVA (F(5,54)=0.740,p=0.597). This result we only use the student’s interactions i up to that point in illustrates that further dividing those who understood the time, i.e. i = 1,...,t. Similarly, at each time step t, the functioning of the capacitor in both the open and closed models are exclusively used to predict students whose in- circuit had a similar impact on the predictive performance teractions sequences contain a minimum of t elements (i.e. of all model and feature combinations. N ≥ t). For students with shorter interaction sequences s (i.e. N <t), the last available prediction will be used. For s Finally,whenevaluatingthe4-classcase,themostcomplex example, for a student with 30 interactions (N = 30), we s classificationtask,themeanAUCscoresfurtherdroppedfor would make the first three predictions using the models for allcombinations(valuerange: 0.78−0.84). Thelowestper- t = 10, t = 20, and t = 30 time steps. For the remain- formancewasagainobservedforRFwithAC(AUC = ing time steps, the predictions from t = 30 will be carried RF,AC 0.78). Introducingthefourthclasstotheclassificationprob- over. We chose this approach for predicting because we as- lem seemed to have a smaller negative impact on the AUC sume that the student will leave the simulation after N s scores for NN with AS (AUC = 0.83) and NN with interactions and it therefore does not make sense to update NN,AS PW (AUC = 0.84), which obtain the best perfor- the prediction afterwards. Figure 8 illustrates the predic- NN,PW mances for the 4-class case. This observation was partially tive performance in terms of AUC for the different model confirmedbyaone-wayANOVAthatshowedatrendtosta- and feature combinations and all classification cases. tisticalsignificancefordifferencesbetweenthecombinations (F(5,54)=1.989,p=0.095): as we increase the amount of As expected, for the 2-class case, all models achieve a high classes, the p-value decreases. performanceforlonginteractionsequences. TheAUCofall NN models is larger than 0.9, starting at time step t = 70. The results of this first experiment show that it is possi- Generally, the difference between the models and feature ble to perform an offline prediction of students’ conceptual combinations for t ≥ 70 is small. We also observe some understanding in the capacitor ranking task (see Section 2) modeldifferencesforearliertimesteps,wheretheRFmodel based on the different combinations of models (NN or RF) withtheactionspanfeaturesperformsbetterthantheother and feature generation methods (AC, AS or PW) proposed models. It achieves an AUC larger than 0.8 already at in this work. From the entire data sequences of students’ time step t = 30 (AUC = 0.82). Moreover, the NN RF,AS interactions with the simulation, we observed that predic- modelwithactionspanisclosetothatperformanceattime tive performance generally decreased when the complexity step t = 30 (AUC = 0.79). Naturally, predicting at NN,AS of the classification task was increased. While all combina- even earlier time steps is more difficult, but some of the tions showed very good performances for the coarse classi- models achieve a decent AUC of 0.7 already after observ- fication of the 2-class case, AUC scores started to diverge ing 20 interactions with the simulation (AUC =0.74, RF,AS more among combinations for the 3- and 4-class cases. Es- AUC =0.73, AUC =0.71). RF,AC NN,AS pecially for the 4-class case, where differences became more Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 167 Figure9: Evolutionofconfusionmatricesovertimestepsforthe3-classcase: NNwithPW(top)andRFwithAS(bottom). Forthe3-classcase,theperformancesofallalgorithmshave less confident (pˆ(correct|c = correct) = 0.67). On the true decreasedfromtimestept=10comparedtotheaboveprob- otherhand,theNNmodelisalreadymoreaccurateinidenti- lem. Thisisnotsurprising,asdifferentiatingthecorrectstu- fying students from class closed (pˆ(closed|c =closed)= true dentsfromtheareasep studentsisnotasstraightforwardas 0.66),whiletheRFmodelcannotidentifystudentsfromthis separatingstudentswhodidnotinteractintheopencircuit class well (pˆ(closed|c = closed) = 0.42). At time step true fromtherest. Though,inthe2-classcases,allperformances t = 60, both models are almost equally accurate in iden- were close to one another, we notice that for the 3-class tifiying students from class correct (NN: pˆ(correct|c = true problemattimestept=40,threemodel-featurescombina- correct) = 0.76, RF: pˆ(correct|c = correct) = 0.80). true tionstakethelead(AUC =0.75,AUC =0.74, TheNNmodelisstillbetteratclassifyingstudentsfromthe NN,PW RF,AS AUC =0.74), while three fall behind (AUC = classclosed. Bothmodelshavetroublewithcorrectlyidenti- NN,AS RF,PW 0.7,AUC =0.69andAUC =0.68)untilt=70, fyingstudentsfromclassareasep. WhiletheNNmodeltends RF,AC NN,AC where the NN with PW outperforms all other model and to assign these students to class closed (pˆ(areasep|c = true feature combinations with an AUC of 0.87. closed) = 0.52), the RF model is becoming better at cor- rectly assigning them (pˆ(closed|c = closed) = 0.42). true Thevarianceacrossmodel-featureperformancesislargerin These observed trends continue to get stronger with an in- the 4-class case. From time step t = 20 already, RF with creasing number of time steps. At t = 100, the NN model AS outperforms all other combinations, reaching an AUC is very accurate when it comes to classifying students from of 0.8 at time step t = 100. Similarly to the 3-class prob- classescorrect andclosed. Studentsfromclassareasep have lem, the same three models take the lead, while the others only a 35% chance of being correctly classified and a 55% fall behind from time step t = 70 (AUC = 0.75, chance of getting assigned to class closed. In practice, this NN,PW AUC = 0.78, AUC = 0.76 and AUC = would mean that 55% of the students would get more in- RF,AS NN,AS RF,PW 0.72, AUC =0.72 and AUC =0.73). tervention (hints) than necessary. The RF classifier is also RF,AC NN,AC very accurate in detecting students from class correct, but Given the fact the predictive performance in terms of AUC is, however, not able to distinguish between students from seemstobesimilarforthebestperformingmodels,weper- class closed and class areasep. In practice, this would mean formed a more detailed analysis to assess how different the misclassified students from class closed would get less help predictions of the several model and feature combinations thannecessaryandmisclassifiedstudentsfromclassareasep were. In a real-world application (intervention setting), we would get more help than necessary. would probably use a 2-class classifier to identify a coarse split (between classes both and closed) already at a rela- This experiment shows that we can (coarsely) classify stu- tively low number of time steps and use a 3-class method dentsafterobservingarelativelylownumberofinteractions. to provide a more detailed prediction later on. With the For the 2-class case, the AUC of the best model (RF with current model performance for the 4-class case, usage of a AS) is larger than 0.8 after t = 30 time steps. Naturally, more detailed classifier seems not practicable. We there- the classification task is more complex for the 3-class and fore investigate the predictions of the two best models for the 4-class cases. The best model on the 3-class case (NN the 3-class case: NN with PW and RF with AS. Figure 9 with PW) achieves an AUC close to 0.8 at time step 50. shows the confusion matrices for these two models for an The second analysis demonstrates that achieving a similar increasing number of time steps. We do not show results predictiveperformanceintermsofAUCdoesnotimplythe for t > 100, as the predictive performance of the models same classification behaviour, i.e.the best models on the 3- does not improve much anymore for longer sequences (see class case (NN with PW and RF with AS) have different alsoFig. 8). While bothmodels haveaverysimilar predic- strengths. It has, however, one important limitation: stu- tive performance in terms of AUC up to time step 60, we dentsspenddifferentamountoftimesonthesimulationand can already see that the models evolve differently in terms therefore, the length of their interaction sequences varies. of prediction. At time step t = 40, the RF model is al- There are for example students with 80 interactions and readyveryaccurateindetectingstudentsfromclasscorrect other students with only 50 interactions. Performing the (pˆ(correct|c = correct) = 0.78), while the NN model is classificationtaskattimestep40isearlyforastudentwith true 168 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 2-class case 3-class case 4-class case 1.0 NN with PW RF with PW 0.9 NN with AC RF with AC NN with AS RF with AS 0.8 C U A 0.7 0.6 0.5 0 25 50 75 100 0 25 50 75 100 0 25 50 75 100 % interactions % interactions % interactions Figure 10: AUC on the 2-class, 3-class and 4-class cases for different model and feature combinations. Predictions were made for25%,50%,75%and100%ofthetotalnumberofinteractionsforeachstudent. a total of 80 interactions. For a student with only 50 inter- overallloweraswearenowdifferentiatingthedifferentlevels actionsintotal,thepredictionattimestep40comesalmost of conceptual knowledge in a more fine-grained way. As we at the end of the interaction time with the simulation. have seen in Fig. 9), it is difficult to differentiate between students from the left branch of the tree (i.e. correct vs. Online(Early)Classification. Thethirdandlastexperiment areasep in Fig. 2). Performance across models varies more addressesthelimitationsofthepreviousexperiment. Inthis for the 3-class case. When observing 75% of students inter- experiment, we were interested in assessing the ”early”pre- actions,theAUCoftheworstmodel(NNwithAS)amounts dictive performances of the different models using only a to 0.78, while the best model (NN with PW) has an AUC partofastudent’ssequence. Giventhefactthatthelength of 0.84. We also observe that the NN with PW features is ofinteractionsequencesvariesoverthestudents,wedidnot consistentlythebestmodel,regardlessoftheamountofob- align the sequences by absolute time steps, but by percent- servedinteractions,withanincreasinggaptotheothermod- ages of interactions. Specifically, we aimed at making the els. At 50% of interactions, the gap in performance among prediction for a student after having seen 25%, 50%, 75%, the three best models is still small (AUC = 0.699, NN,PW and 100% of the interaction sequence of this student. Note AUC =0.7,AUC =0.69). Itgetslargerat75% RF,AS RF,PW that this experiment does not require a re-training of our (AUC = 0.84, AUC = 0.82, AUC = NN,PW RF,AS NN,AS models. We just retrieve the predictions of the models for 0.78). When observing the complete sequences of the stu- thecorrespondingtimestept. Forourexamplestudentwith dents (i.e. 100% of the interactions), all models reach an a total of 80 interactions, we retrieve the predictions of the AUC of 0.85 or higher (see also Fig.7). models for time steps t = 20, t = 40, t = 60, and t = 80. Figure10showstheAUCofthemodelsforallclassification Again,theperformancedecreaseswhenmovingtofourclasses, cases, with an increasing number of interactions (in %). duetotheincreasingcomplexity(intermsoflevelofdetail) of the classification task. The AUC of the best model (NN As expected, all model and feature combinations perform with PW) amounts to 0.83, while for the worst two models well for the 2-class case. To achieve a high classification AUC =0.77(RFwithAC,NNwithAC).Whileallthemod- accuracy, we do not need to observe the full sequence of els’AUCislowerthan0.7whenobservingonly25%or50% a student. For all NN models, obtaining 75% of students’ ofstudents’interactions,interestinglythereisalargegapbe- interactions is enough to achieve an AUC of around 0.9. tweenthebestmodel(RFwithAS)andalltheothermodels WithRF,themodelusingtheactionspanfeaturesalsoob- (i.e. at 25%: AUC =0.59, AUC =0.56). RF,AS RF,AC tains an AUC of more than 0.9 at 75% of the interactions (AUC =0.92). Theperformanceofthetwootherfea- With this last experiment, we assessed the capabilities of RF,AS turetypesisslightlylower(AUC =0.9,AUC = our models to make predictions as early as possible during RF,AC RF,PW 0.88). Naturally, predictive performance of all the models interaction with the simulation as a basis for intervention. is lower when observing smaller parts of students’ interac- By evaluating the models at different percentages of total tion sequences. If obtaining only the first 25% of students’ interactions, we took into account the fact that the defini- interactions, there is more variation in the achieved AUC tionof‘early’dependsonthestudent. Ourresultsshowthat betweenmodels,withthebestmodel(RFwithAS)achiev- after observing the first 50% of interactions, we are able to inganAUCof0.66andtheworstmodelachievinganAUC reliably distinguish between students with a high and low of 0.57 (RF with PW). It is promising that the best model conceptualunderstandinggainedbytheendofthelearning at 50% of the interactions exhibits an AUC of almost 0.8 activity (both and closed). At 75% of interactions, the best (AUC = 0.78), which makes it a valuable candidate models are also able to provide a more fine-grained predic- RF,AS for a coarse early prediction and intervention, i.e. differ- tion (correct, areasep, and closed). entiating between students with a high conceptual under- standing (class both) and students with a low conceptual 5. DISCUSSION understanding (class closed), early on. Over the last decade, interactive simulations of scientific phenomena have become increasingly popular. They allow Naturally, performance of the models for the 3-class case is studentstolearntheprinciplesunderlyingadomainthrough Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 169 their own explorations. However, only few students pos- for all classification tasks (2-class case, 3-class case, and 4- sessthedegreeofinquiryskillsandself-regulationnecessary classcase). ThoughtheANOVArevealednosignificantdif- for effective learning in these environments. In this paper, ferencesbetweenthepredictiveperformanceofmodelswith we therefore explored approaches for an early identification differenttypesoffeatures,wecanobservethattheNNwith of struggling students as a basis for adaptive guidance: we PWachievesahigherperformanceonaveragethanallother aimed at predicting students’ conceptual knowledge while combinations but the NN with AS. What is more, the per- interacting with a Physics simulation. Specifically, we were formancesinitsfirstquartiledominatethosefromthethird interested in answering the following three research ques- quartile of three model and feature combinations. Addi- tions: 1) Can students’ interaction with the data be associ- tionally,itsperformancevarianceissmallerthanthatofthe atedwiththegainedconceptualunderstanding? 2)Cancon- NN with AS. This shows that pairwise embeddings gener- ceptualunderstandingbeinferredthroughsequencemining ated by a skip-gram approach can be a valuable asset for methodswithembeddings? 3)Cantheproposedmethodsbe finer-grained classification, even if no statistical difference usedforearlypredictingstudents’conceptualunderstanding wasfoundwithrespecttotheothermodel-featurecombina- based on partial sequences of interaction data? tions. We can therefore answer research question 2) with a partial yes. Toanswerthefirstresearchquestion,weanalyseddatafrom 192 first-year undergraduate Physics students who used an To address the third research question, we assessed predic- interactivecapacitorsimulationtosolveataskinwhichthey tiveperformanceoftheproposedapproachwhenonlypartial had to rank four capacitor configurations by their stored sequences of the students’ interaction data were observed. energy. Previous research has emphasised the importance We analysed the performances of our proposed approaches of aligning instructional and assessment activities to imple- based on varying proportions of the available data and for ment pedagogically meaningful learning activities with ed- classification tasks with different levels of complexity. The ucational technology [26, 27]. Since one objective of auto- resultsofourexperimentsshowthattheproposedcombina- matically detecting student learning behavior is to provide tionsofmodelsandgeneratedfeaturesallowedustopredict some kind of assessment (either formative or summative), thecorrectclasslabelsearlyon. Thebestmodelswereable thelearningtaskpresentedinthisworkwasdesignedtofa- to reliably predict students’ conceptual understanding for cilitate the application of sequence mining. The design of the 2-class case (AUC ≈ 0.8) after having seen 50% of the thislearningtaskallowedustorelateallofthestudentsan- students’interactiondata. Toreachasimilarpredictiveper- swerstoacertainlevelofconceptualunderstanding. Usinga formanceforthemorefine-grained3-classand4-classcases, decisiontree,wewereabletomapeachrankingtooneofthe thebestmodelsneededabout75%ofthedata. Thefindings fourlabelsrepresentinggroupsofsimilarconceptualunder- fromtheseexperimentsthereforerepresentapromisingstep standing. Our results show that all evaluated models were towardsearlypredictionofstudents’conceptualunderstand- able to correctly associate students’ sequential interaction ing in OELEs. We can therefore answer research question data in the simulation with the generated labels, achieving 3) with yes. high predictive performance when fed with full sequences (2-classcase: AUC >0.9,3-classcase: AUC >0.85,4-class One of the limitations of this work is the unfeasibility to case: AUC >0.75). Thishighpredictivepowerwasalsoob- track whether students used external resources (other than servedin[22],wheretheyreachedanaccuracyof85%when thesimulation)inordertorankthefourcapacitorconfigura- separating ‘high’ learners from ‘low’ learners based of full tions. Thismaybiastheinferencefromthesimulationusage interaction sequences. However, despite their findings of a to the extrapolated understanding level. Furthermore, due potential third cluster, they did not investigate the ternary tooursmallsamplesize,wewereabletoonlytrainshallow classificationtask. Inthispaper,weincreasethegranularity NN classifiers and skip-grams. Finally, the external valid- ofthelabelsinordertotargetmorespecificshortcomingsin ity of these experiments remains to be evaluated on other theknowledgeofthestudentsinordertoprovidethemwith interactive simulations and different types of tasks. more detailed feedback. We therefore conclude that we can answer research question 1) with yes. To conclude, the proposed approach represents a promising steptowardsearlypredictionofstudents’learningstrategies Thesecondresearchquestioninvestigatesthebenefitsofla- in interactive simulations, that moreover, can be associated tent features generated by skip-grams for offline classifica- with their level of conceptual understanding. The proposed tion tasks in the context of education. Usually applied to learning activity seems to represent an interesting example NLPproblems,skip-gramshavetheabilitytolearnthecon- forthedesignoflearningtasksinOELEsthatfacilitatesthe text of a word in an unsupervised fashion. In our case, association of detected student strategies with conceptual we use it to find the OELE behaviour of the students sur- understandingthroughsequencemining. Futureworkcould rounding their interaction, and retrieve the embedding ma- explore whether such designs could also be used to identify trix of the neural network to create our latent representa- conceptual understanding at a more fine-grained level. tions. This approach has already been proven efficient to analyse student strategies in blended courses [28], but not for the identification of conceptual understanding. To eval- 6. ACKNOWLEDGMENTS uate the predictive power of latent feature representations, We thank Kathy Perkins (PhET Interactive Simulations, we trained two classifiers (NN and RF) on three types of UniversityofColoradoBoulder)andDanielSchwartz(Stan- feature (Action Counts, Action Span and Pairwise Embed- ford University) for the valuable input on the study design dings) on the full sequences of students. At first, we notice and setup and for supporting data collection. thatallmodelandfeaturecombinationsachieveahighAUC 170 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.