Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview ERIC ED615599: Just a Few Expert Constraints Can Help: Humanizing Data-Driven Subgoal Detection for Novice Programming

Just a Few Expert Constraints Can Help: Humanizing Data-Driven Subgoal Detection for Novice Programming Samiha Marwan, Yang Shi, Ian Menezes, Min Chi, Tiffany Barnes, Thomas W. Price NorthCarolinaStateUniversity,Raleigh,NC,USA samarwan, yshi26, ivmeneze, mchi, tmbarnes, [email protected] ABSTRACT impact student confidence and motivation in computer sci- Feedbackonhowstudentsprogressthroughcompletingsub- ence (CS) [32]. Prior research has shown that immediate goalscanimprovestudents’learningandmotivationinpro- feedbackcanaddressthis,reducingnovices’uncertaintyand gramming. Detecting subgoal completion is a challenging improving their confidence, engagement, motivation, and task,andmostlearningenvironmentsdosoeitherwithexpert- learning [42, 8, 33, 21, 38, 39]. authored models or with data-driven models. Both models have advantages that are complementary – expert models Oneeffectiveformofimmediatefeedbackissubgoalfeedback encodedomainknowledgeandachievereliabledetectionbut [40],whichindicatesstudents’progressonspecificsub-steps requireextensiveauthoringefforts andoftencannotcapture of the problem. Feedback on subgoals offers special advan- all students’ possible solution strategies, while data-driven tagesbecauseitdemonstrateshowastudentcanbreakdown models can be easily scaled but may be less accurate and a problem into a set of smaller sub-tasks; which is a key to interpretable. In this paper, we take a step towards achiev- simplifying the learning process [37, 38], and can promote ing the best of both worlds – utilizing a data-driven model students’retentioninproceduraldomains[35]. Togenerate that can intelligently detect subgoals in students’ correct such feedback, learning environments need to be able to do solutions, while benefiting from human expertise in edit- subgoal detection, which is the process of detecting when a ing these data-driven subgoal rules to provide more accu- studentcompletesakeyobjectiveorsub-partofaprogram- rate feedback to students. We compared our hybrid “hu- ming task (e.g. receiving and validating user input). How- manized”subgoaldetectors,builtfromdata-drivensubgoals ever, subgoal detection during problem-solving is known to modified with expert input, against an existing data-driven be extremely challenging because it requires assessing stu- approach and baseline supervised learning models. Our re- dents’ intended problem solving approach rather than their sults showed that the hybrid model outperformed all other program output. In other words, it is difficult to automati- modelsintermsofoverallaccuracyandF1-score. Ourwork callyevaluatewhetherastudentcompletedasubgoalinthe advances the challenging task of automated subgoal detec- middle of problem-solving due to the many possible strate- tion during programming, while laying the groundwork for gies that students can approach to solve a problem, even future hybrid expert-authored/data-driven systems. when using test cases or autograders. Historically, toprovidefeedback onsubgoals, learningenvi- Keywords ronments have used expert-authored models, where human Subgoals, Formative feedback, Data-driven hybrid models experts encode a set of rules to predict solution strategies thatstudentsmightperformtocompleteaspecificsubgoal. 1. INTRODUCTION While expert models can generate accurate feedback with Formative feedback has been shown to be an effective form interpretable explanations, they also require extensive hu- of automated feedback that can improve students’ learning manparticipationparticularlyforopen-endedprogramming and motivation [54, 38, 8, 20]. In programming, immediate tasks,whereitbecomesunmanageabletocaptureeverypos- formativefeedbackduring problem-solvingisimportantbe- sible correct solution [59]. More recently, data-driven (DD) cause some problems require students to find one of many models, where the model learns rules from historical data, correct solutions [16], and novices may be uncertain about have become more prominent models. This is because DD whentheyareontherighttrack[62]. Thisuncertaintymay models reduce the expert-authoring burden, and have the lead some students to give up [43], and can also negatively potentialtobeeasilyscaledtomoreproblemsandcontexts. Moreover, DD models learn from multiple students’ solu- tions, which makes it capture code situations that human experts cannot easily perceive, particularly in open-ended Samiha Marwan, Yang Shi, Ian Menezes, Min Chi, Tiffany Barnes and programming tasks [49, 45]. However, DD models are de- Thomas Price “Just a Few Expert Constraints Can Help: Humanizing pendentonthequalityofthedata,andmayhaveloweraccu- Data-Driven Subgoal Detection for Novice Programming”. 2021. In: racythanexpertmodels[59,48];and,therefore,maysome- Proceedings of The 14th International Conference on Educational Data times provide misleading feedback in practice [48]. Both Mining(EDM21). InternationalEducationalDataMiningSociety,68-80. of these models have strengths and weaknesses, and in this https://educationaldatamining.org/edm2021/ paperweproposeanapproachthattakesadvantageofboth. EDM’21June29-July022021,Paris,France 68 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) We present a hybrid approach that leverages both a data- 2.1 FeedbackandSubgoalDetection driven model and expert insights to detect subgoal com- Formative feedback is defined as a type of task-level feed- pletion during problem-solving block-based programs. Our back that provides specific, timely information to a student hybrid model is based on three main steps. First, we used in response to a particular problem, based on the student’s an unsupervised data-driven model to generate subgoal de- current ability [54]. In a review of effective formative feed- tectors for a programming task, and represent them as a back in education, Shute shows that immediate formative set of human-interpretable and human-editable rules. Sec- feedbackiseffectivebecauseitcanimprovestudents’learn- ond,thisrepresentationallowedustoevaluatetheaccuracy ing[11,38]andretention[54,44],particularlyinprocedural of the subgoal detectors; particularly when they have inac- skills such as programming [54]. Most intelligent tutoring curate detections. Third, we used human expert insights to systemsprovideimmediatefeedbackthroughidentifyinger- refineandfixrulesthatledtoinaccuratesubgoaldetections. rors (e.g. error detectors [5, 55], anomaly detectors [31], or Thesethreestepsresultedinahybrid data-drivenapproach misconception feedback [25, 24]); however far less work has that generates subgoal detectors with high accuracy, that been devoted to providing feedback on students’ subgoals. targetexpert-authoringeffortonly whereimprovementsare Automated assessment systems (i.e. autograders) can pro- needed, and that can also be easily scaled to various prob- vide feedback on correct subgoals by showing the passing lems and contexts. test cases using expert-authored models [7, 28, 26, 38]. For example,mostautogradersuseinstructortestcasestocheck Weevaluatedourhybriddata-drivenmodeltoablock-based for correct program output; however they require students programming problem from an introductory CS classroom to submit an almost-complete program to obtain feedback againstthesame,fullydata-driven(DD)model,but without [7,29]. Asaresult,thisfeedbackisoftennotavailableinthe experts’ intervention. We evaluated the accuracy of both early stage of programming when timely feedback on sub- models by comparing their subgoal detections on a given goalsismostlyneeded. Toprovidetimelysubgoalfeedback, programming task, to that of human experts, and we hy- prior research has taken two exclusive approaches: expert- pothesize that our hybrid model will surpass the accuracy authored approach and data-driven approach. achieved by the DD model. We found that the expert eval- uations of subgoal detections achieved significantly higher Expert-authoredApproaches: Priorworkhasexploredstu- agreement with our hybrid model than that achieved with dent completion to subgoals by diagnosing students’ solu- the DD model. We also found that our hybrid model out- tions against expert models (e.g. constraint-based mod- performs the state-of-the-art supervised models: code2vec els [42]), even when a student has incomplete submissions, [53],SupportVectorMachine(SVM)[14],andXGBoost[9]. to provide feedback on whether they are on track [27], or Inaddition,wepresentcasestudiesofhowthehybridmodel whether they completed key objectives of short program- ledtodifferingsubgoaldetectionsinstudentprogramscom- ming tasks [38]. However, these systems often require ex- pared to the DD model. We also discuss how we can close tensiveexpertefforttocreaterules. Toaddressthisauthor- the loop by applying our hybrid model in block-based pro- ingburden,example-tracingtutoringsystemsinfertutoring gramming classrooms to provide students with immediate rulesbasedonexamplesofpotentialstudentbehaviors. This feedback on subgoals. still requires an author with some domain expertise, but it allows rules to be constructed by non-programmers who Insummary,inthispaperweinvestigatethisresearchques- havedomainexpertise[2]. Anexpertcancreatedifferentex- tion: RQ: How well does a hybrid data-driven model com- amplesolutionstocapturedifferentsolutionstrategies; and bined with expert insights perform compared to: 1) a data- augment them with hints or feedback. Example-tracing tu- driven model without expert augmentation and 2) baseline tors have been developed in multiple non-programming do- supervised learning approaches that leverage expert subgoal mains like genetics [12], mathematics [1], and applied ma- labels? Our work provides the following contributions: (1) chine learning, and they have been shown to improve the wepresentahybridsubgoaldetectionapproachwhichcom- problem-solving process and student learning [2]. Despite bines an unsupervised data-driven model with domain ex- theaccuracyofexpertmodelsinprovidingfeedbackontest pertise to achieve the benefits of both data-driven models, cases or correct features, which can be equivalent to sub- andexpertmodels,and(2)wedemonstratehowourhybrid goals, it is unclear how feasible they are in domains with approach advances the state of the art in subgoal detection vast solution spaces and open-ended problems, such as in in open-ended programming tasks over supervised and un- programming tasks [59, 39, 25]. supervised baselines, with an accuracy range of 0.80 - 0.92. Data-driven Approaches: Data-driven approaches refers to 2. LITERATUREREVIEW systematically collecting and analyzing various types of ed- ucationaldata,toguidearangeofdecisionstohelpimprove In this work, we investigate the challenge of automatically the success of students [15, 50]. Data-driven models largely detecting subgoals effectively. We propose a method that avoidtheneedforexpertauthoringaltogetherbyusingprior involves a hybrid approach where human experts modify students’correctsolutions,insteadofexpertrulesorinstruc- data-driven models to build effective subgoal detectors. In torsolutions,tolearnpatternsofcorrectsolutions. Thisen- the following, we first review prior work on the immediate ables automated assessment feedback on student code [21]. feedbackwithafocusonsubgoalfeedback. Then,wereview Many data-driven models work by executing a comparison prior work that involves merging machine and human ex- functionthatcalculatesthedistancebetweenstudents’code pertintelligencetoimproveperformanceofmachinelearned and all the possible correct solutions, and then compares models. Finally, we review both state-of-the-art supervised the path of the most close solution with that of the stu- learningmodelsandanunsuperviseddata-drivenmodelthat dent[47,49,59,39]. Whilemostdata-drivenmethodshave we used for subgoal detection in a programming task. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 69 been used to generate fine-grained feedback – such as hints Stamper et al. used an initial small amount of sample data in the hint factory [58], little work has used these meth- generated by human experts to enhance the automatic de- ods to generate subgoal feedback. For example, Diana et livery of hints [57]. Moreover, example-tracing tutors allow al. developed a model that searches for meaningful code expertstospecifymoderately-branchingsolutionsforopen- chunksinstudentcodetogeneratedata-drivenrubriccrite- ended problems, allowing some intelligent tutors originally ria to automatically assess students’ code [19]. Diana et al implemented using complex expert systems to be almost showthatadata-drivenmodelcanhaveagreementwiththat completelyreplicatedtosupportpracticallearningneeds[2]. of the experts [19]. In the iList tutor, Fossati et al. used a data-driven model to provide feedback on correct steps, 2.3 SupervisedLearningModelsofCodeAnal- where they assess student code edits as good, if it improves ysis the student’s probability to reach the correct solution, or uncertain if a student spent more time than that taken by Supervised learning algorithms leverage labels created by prior students at this point [21]. Fossati et al. found that human experts, to guide the model search process. With thisfeedbackwaswell-perceivedbystudents, andimproved available labels, automated learning algorithms can be ap- theirlearning[21]. However,data-drivenmodelsaredepen- plied to the subgoal detection tasks for programming data. dent on the similarity of the current student’s approach to As shown in [6], one baseline is to extract term frequency- prior student submissions, making it difficult to control the inversedocumentfrequency(TF-IDF)featuresandusestra- quality of their feedback [39, 51, 59, 46]. In a case study ditionalmachinelearningalgorithmssuchassupportvector paper, Shabrina et al. discuss the practical implications of machines(SVM)[14]andXGBoost[9]. However,asword-or data-driven feedback on subgoals, showing that the quality token-based features such as TF-IDF lose important struc- offeedbackisimportant,evenpositivefeedback,sinceinac- tural information from programming data [3], recent work curate feedback can cause students to spend more time on uses structural representations from code and a more com- ataskevenaftertheyweredone[51]. Becauseofsuchchal- plex model structure to learn more complex features. For lenges and perhaps other reasons such as the inability for example, Shi et al. applied a code2vec [4] model to detect individual instructors to augment autograder feedback, few thecompletionofrubricsonstudentprogrammingdata[53]. toolshavebeenbuilttoprovideimmediatefeedbacktostu- Inthiswork,wecomparedourhybriddata-drivenmodelto dents on whether they have achieved subgoals during their these existing supervised learning baseline models to check programming tasks [34]. our improvement on the subgoal detection task. 2.2 Integrating Expert Knowledge into Mod- 2.4 Data-DrivenSubgoalDetectionModel els Among the various data-driven models for detecting sub- goals,orrubricitems[39,18,19],webuiltourproposedhy- In recent years, combining machine and human intelligence brid model on top of an unsupervised data-driven subgoal hasbeenextensivelyexploredinawiderangeofdomainsin- detection (DD) algorithm, presented in [64]. We applied cluding artificial intelligence and software engineering. For this algorithm on a programming task called Squiral (de- example, Chou et al developed a virtual teaching assistant scribed in Section 3) by running the following steps. First, (VTA) that uses teachers’ answers as human intelligence, the algorithm extracts prior student solutions in Squiral as and machine intelligence to use teachers’ answers to locate abstractsyntaxtrees(ASTs)[49,47]. Then,itapplieshier- student errors, and generate hints [10]. Chou et al found archical code clustering for feature extraction and selection thatthesemechanismsreduceteachertutoringloadandre- by: (1) extracting common codeshapes, which are common ducethecomplexityofdevelopingmachineintelligence[10]. subtrees, in ASTs of correct students’ solutions (Figure 2 Inthesoftwareengineeringdomain,thereisanemergingap- showsexamplesofcodeshapes);(2)filteringredundantcode proachcalledcollective intelligence thatmergesthewisdom shapes; (3) identifying decision shapes, which consist of a of multiple developers with program synthesis algorithms, disjunction of code shapes (i.e. C ∧ ... ∧ C ) that are whichhasbeenshowntosignificantlyimprovetheefficiency 1 n usually mutually-exclusive (e.g. a program uses a loop, or and accuracy of program synthesis [61]. a repeated set of commands, but not both), and (4) hierar- chically clustering frequently co-occurring code shapes into Human-in-the-loopmethodsareanothereffectivetrendthat subgoals. In [64], the authors found a meaningful Cohen’s usehumanintelligencetoimprovetheefficiencyofMachine Kappa(0.45)inagreementofthealgorithmandexpertsub- Learning models, while the model is learning [23, 56, 30, goaldetectiononstudentdata,suggestingthatDD subgoals 63]. For example, Goecks introduces a theoretical founda- could be used to generate feedback. However, since the DD tionthatincorporateshumaninputmodalities,likedemon- subgoals are typically represented in a regular-expression- strationorevaluation,toleveragethestrengthsandmitigate like form, labels are needed to make them meaningful and the weaknesses of reinforcement learning (RL) algorithms usable for students in programming environments. [23]. Theirresultsshowthatusinghuman-in-the-loopmeth- odsacceleratesthelearningrateofRLmodels,withamore efficientsample,inrealtime[23]. Ourworkalsouseshuman 3. METHOD intelligence to improve the accuracy of a machine learning Thisworkpresentsandevaluatesahybriddata-drivenmodel model; however, it does so after the model is trained. for generating and detecting subgoals in a block-based pro- gramming exercise (explained in Section 3.1). To evaluate In the educational domain, expert knowledge is widely ap- our hybrid data-driven approach, we applied our model on plied to augment data-driven and machine-learned models student data collected from an Introduction to Computing forproblemsolvingandfeedback. Forexample,inalogictu- course, and we asked human experts to evaluate the accu- torthatprovidesdata-drivenhintsusingstudents’solutions, racy of its subgoal detections (explained in Section 3.2.2). 70 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Figure1: OnepossiblesolutionforSquiralwithlinenumbers ontheleft,andthescript’soutputontheright. We use this dataset to provide examples of how our ap- proachworks,butwealsodiscusshowitcanbegeneralized. Wecomparedourhybridmodelagainstitsunderlyingdata- driven(DD)modeldescribedaboveinSection2.4,aswellas state-of-the-artsupervisedlearningmodels(explainedbelow in Section 4.1). Figure2: ThreecodeshapesA,B,C,inboththedata-driven 3.1 Dataset and hybrid models. Each code shape represents a false de- tection and its fix by human experts. Red dashed nodes are Our data is collected from a CS0 course for non-majors in removedandgreenboldnodesareadded. a public university in the United States that uses Snap!, a block-based programming environment [22]. This program- ming environment logs all students actions while program- logdataalongwithDD detections,andaskedexperts ming (e.g. adding, deleting or running blocks of code) with to evaluate the accuracy of the DD detections. the time taken for each step. These student logs (i.e. ac- tions) can also be replayed as a trace of code snapshots of 3. Weusedhumanexpertinsightstofixcodeshapesthat all students’ edits – allowing us to detect the time and the led to false subgoal detections; and then combined codesnapshotwhereasubgoaliscompletedduring student themagaintocreateamodifiedsetofhybridsubgoals problem-solving process. and evaluated its new accuracy. Inthiswork,wecollectedstudents’logsfromonehomework exercise named Squiral, derived from the BJC curriculum 3.2.1 Step 1: Interpreting and Editing Data-Driven [22]. Squiral requires a visual output, where students are Subgoals askedtowriteaprocedurethattakesoneinput‘r’todraw The goal of this step is to generate data-driven subgoals a square-like spiral with r rotations. Figure 1 shows a pos- using the DD model and present them in an interpretable sible correct solution of Squiral that requires procedures, and editable form. We applied the DD model (described variables,andloops. Wecollectedatrainingdatasetfrom3 in Section 2.4) on S16, F16, and S17 student datasets to semesters: Spring 2016 (S16), Fall 2016 (F16), and Spring generate a number of clustered code shapes. Column 1 in 2017 (S17), which includes data of 174 students, that has a Table 1 shows the description of 7 subgoals corresponding total log data of 29,574 student actions to code-shape clusters generated from correct solutions (n = 52). We evaluated each cluster by displaying its code 3.2 HybridData-DrivenSubgoalDetection shapes separately and interpreted their code behavior. For Ourhybriddata-drivenmodelisbasedontwomainthings. example,codeshapeAinFigure2,ontheleft,representsa First, the DD model is used to generate data-driven sub- decisionshapethatrequiresstudenttousethe‘ReceiveGo’ goals. Second, expert-authored constraints are added to block(i.e. thehatblockinFigure1,line1,whichisusedto improve the quality of these subgoals and the accuracy of startascript)intheirmainscript,ortoevaluateaprocedure their detection. We implemented our hybrid approach by withoneparameter,whichisdonebycreatingandsnapping conducting the following 3 high-level steps: aprocedureinthemainscript(asshowninFigure1,line3). Wetreatedeachclusterasasubgoal,andforasubgoaltobe detected, the DD model requires all of its code shapes, and 1. WeusedtheDDmodeltogenerateaninitialsetofsub- exactly one component of its decision shapes, to be present goals, consisting of clusters of code shapes. We then in student code. presented these clusters to experts in an interpretable form, who combined them to create a concrete set of While the data-driven clusters can represent appropriate meaningful subgoal labels. subgoals, we combined some of them to create a shorter listofhigher-levelsubgoalssimilartotheprogrammingtask 2. We integrated DD subgoal detection model into the rubric. Column 2 in Table 1 show the combined subgoals. students’programmingenvironment,allowingstudents It is worth noting that we also took the insights of two in- to see when the DD algorithm detected completion of structors of the CS0 course on how meaningful these sub- eachsubgoal. Wethencollectedstudentprogramming goals are for students to understand. Additionally, because Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 71 oneofourgoalsistousethesedata-drivensubgoalsinlearn- SqlTutor by Mitrovic et al. [41], we introduce 3 fixes (or ingenvironments,weaskedhumanexpertstolabelthemto constraints) to resolve them. make them easily understandable by students, and instruc- tors. Forexample,thehumanlabelofsubgoal1is:“Create Subgoal1falsedetection. AsshowninTable1,subgoal1la- aprocedurewithoneparameter,anduseitinyourcode”,as belrequiresastudenttocreateaprocedurewithoneparam- showninTable1. Wethendevelopedadata-drivensubgoal eter,anduseit(orevaluateit)inthemainscript. However, detector that takes student code as an input, and outputs subgoal1codeshapesconsistofthecreationandevaluation thestatusofeachsubgoal. Forexample,iftheinputiscode of a procedure, or the use of a ‘ReceiveGo’ block (the hat C and the output is {1, 0, 0, 1}, this means the subgoal block used to start a script). This means that whether a detectordetectsthecompletionofsubgoals1and4,andthe studentcreatedandevaluatedaprocedure,or addeda‘Re- absence of subgoals 2 and 3 in C. ceiveGo’blockinthemainscript,theDD modelwilldetect the completion of that subgoal, but experts did not inter- 3.2.2 Step2: InvestigatingData-DrivenSubgoals pretthe‘ReceiveGo’blockasmeetingthissubgoal,yielding a false detection. To fix this false detection, we simply re- The goal of this step is to investigate the correctness of the moved the ‘ReceiveGo’ block as an option for this subgoal. DD subgoaldetections. InSpring2020(S20),weintegrated CodeshapeAinFigure2showsthecodeshapesofsubgoal the DD subgoal detector into the Snap! programming en- 1 of the DD model (on the left), and how it is fixed in the vironment for the Squiral exercise to provide students with hybrid model (on the right). subgoalfeedback[39]. InSnap!,studentscouldseethesub- goal labels (shown in Table 1), colored gray to start, green Subgoal 2 false detection. As shown in line 6 in Figure 1, whenthesubgoalwasdetected,orredwhenitbecamebro- subgoal2requiresastudenttousea‘repeatblockthatiter- ken. We collected 4,480 edit logs from 27 student submis- ates4timesthenumberofrotationstodrawaSquiral with sions with an average of 166 edits per student in S20. For the correct number of sides. While code shapes of this sub- each student edit, we recorded the DD subgoal detection goalsatisfythisdefinition,theyalsoincludeacodeshapeof state(e.g. {1, 0, 0, 0}forsubgoal1beingcomplete). We adding a ‘pen down’ block, which is necessary to draw, but askedhumanexpertstomanuallyreplayeachstudent’strace only inside a procedure. Therefore, if a ‘pen down’ block data and evaluate whether the subgoal labels, as shown to is used outside of a procedure, subgoal 2 will not be de- students,wereachievedatthespecifictimestampswhenthe tected. To fix this false detection, we added a disjunction DD algorithm detected them. Importantly, we asked ex- code shape to detect the presence of ‘pen down’ inside or perts to report when the expert-authored labels for each outside a procedure, as shown in code shape B in Figure 2. subgoal (which students saw) were achieved. Since these labels do not precisely match the code shape combinations Subgoal4falsedetection. Asshowninlines6-9inFigure1, that the DD subgoal detector used, it was very possible for subgoal 4 requires the use of ‘move’, ‘turn’, and ‘change - the DD model to be “wrong.” In other words, we asked by -’ blocks (which increments a value of a variable), inside experts to determine when each student achieved “Create a ‘repeat block’. We found that code shapes of subgoal 4 a Squiral procedure with one parameter and use it in your only include the ‘turnLeft’ block; however, if the student code,”andcomparedthattowhentheDD detectormarked solutioncontainsa‘turnRight’block(whichdoesthesame this subgoal label as complete. ‘turn’ functionality but from a different direction), the sub- goal will not be detected. To fix this false detection, we We classified each evaluated instance as either Early, on- modified all the code shapes that require the use of ‘turn- Time, or Late. An instance is classified as Early if the DD Left’ block to accept either ‘turnRight’ or ‘turnLeft’ detection is before the human expert detection timestamp, blocks, as shown in code shape C in Figure 2. OnTimeiftheycoincide,andotherwiseLate. Forexample,if forstudentS ,thehumanexpertdetectedthecompletionof j ThesethreefalsedetectionsshowthatpriorsolutionsinS16, subgoali(SG )attimet=5;whilethealgorithmdetected i F16,andS17datasetsoftenuseda‘ReceiveGo’and‘turn- it at t = 2, then we label SG for S as Early detection. i j Left’ blocks, and used ‘pen down’ inside a procedure; but Then we sorted students in descending order based on the thiswasnotalwaysthecaseintheS20data. Thisshowsthat percentageoffalsedetectionsintheirlogdata,andwetook investigating the accuracy of a model, either data-driven or the first 66% of this data (∼ 18 students as a data sample) expert,isnecessarysinceitisimpossibletopredicthowstu- to investigate the reasons for false detections. We did not dents will behave in practice or how the data will change usethefullsetoffalsedetections,sinceourprimarygoalwas from one class to another. to fix the most common mismatches, without overfitting to the dataset. 3.2.3 Step 3: Improving the Data-Driven Subgoals We then focus on false detections that occured due to new, withHumanInsights correct solutions, in the S20 dataset, that had no match- The goal of this step is to apply the human expert con- ing code shapes in the training dataset (S16, F16, and S17 straints(explainedinStep2)tomitigatethefalsedetections datasets). We do not investigate expected false detections of the DD algorithm. To do so, we developed a tool that thatoccuredduetoknownlimitationsintheDD algorithm iterates over each code shape of the data-driven subgoals, (e.g. the DD algorithm does not differentiate between vari- and allows humans to edit code shapes (i.e. add, delete or able names). modify) to apply the three constraints (i.e. fixes) explained in Step 2. Because this tool modifies the code shapes, we We found three reasons for false detections for subgoals 1, then use the original DD algorithm to re-cluster all code 2, and 4. Inspired by the design of the constraint-based shapes to ensure that the most similar code shapes remain 72 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Table 1: Data-driven subgoals (composed of code shape clusters) with their corresponding human labels that were used when presentedtostudents. Data-DrivenCodeShapeClusters CombinedClusters SubgoalHumanLabel C1: Evaluate a procedure with one parameter on Create a Squiral procedure with the script area OR Add a ‘ReceiveGo’ block. C1 + C2 = Subgoal 1 one parameter and use it in your code. C2: Create a procedure with one parameter. C3: Have a ‘multiply’ block with a variable in The Squiral procedure rotates a ’repeat’ block OR two nested ‘repeat’ blocks. C3 + C4 = Subgoal 2 the correct number of times. C4: Have a ‘pen down’ block inside a procedure C5: Add a variable in a ‘move’ block inside a The length of each side of the Squiral C5 = Subgoal 3 ‘repeat‘ block. is based on a variable. C6: Have a ‘move’ and ‘turnLeft’ block The length of the Squiral increases inside a ’repeat’ block. C6 + C7 = Subgoal 4 with each side. C7: ‘Change’ a variable value inside a ‘repeat’ block. clusteredtogether. Figure2showsanexampleofthreecode most meaningful times when a student is close to finishing shapesbeforeandaftertheyhavebeeneditedbyhumanex- a subgoal, including: (1) the first time a student completed perts,thatfixfalsedetectionsthatexistedforsubgoals1,2, asubgoal, accordingtoahumanexpert, (2)uptofivecode and 4, respectively. edits before that subgoal is completed, and (3) any time when either model (i.e. hybrid, or DD) suggests a change In summary, our hybrid model humanizes data-driven sub- in a subgoal’s status. While these changes may or may not goal detection models through a series of important steps. betrue,wewantedtohaveexpertsevaluatethecorrectness First, we apply a data-driven model to correct, historical of how the algorithms may have detected subgoal changes student solutions to generate a set of human interpretable at these points. code shape clusters. Second, a human expert labels the subgoals these clusters represent, in a way that is meant For each subgoal, two human experts evaluated 150, 163, to align with the original programming assignment. Third, 178, and 196 student snapshots for subgoals 1, 2, 3, and 4, we collect data from students solving the same task using respectively, making a total of 687 labeled snapshot data- a programming environment augmented with subgoal feed- points. The experts used the subgoals as their rubric; and back(i.ehumanlabelswithcolors)basedontheDD subgoal theystartedthelabellingprocessbyevaluatingthefirsttime detector. Fourth, we had experts examine code traces with a subgoal is detected. To do so, they divided the data (27 the DD subgoal feedback to determine when the displayed students*4subgoals=108datapoint)intoasetof6rounds, subgoallabelswereactuallyachieved. Fifth,humanexperts wherethefirstroundconsistsof2studentsandtheremain- modified the code shapes that led to discrepancies between ing5roundsconsistsof5students. Thetwoexpertscollabo- thedata-drivenandexpertdetectionsforthedisplayedsub- rativelyevaluatedthefirstroundtogethertomakesurethey goals. This series of steps leverages the natural cycle of a haveaclearunderstandingoftherubricsubgoals. Thenfor frequently-offeredCS0classtobootstrapthecreationofDD the next two rounds, they evaluated the logs independently subgoal detectors in the programming environment. and after each, they met to discuss and resolve conflicts. Forthese3rounds,thetwoexpertshadamoderatetosub- stantial agreement with a Cohen’s kappa ranging from 0.5 4. EVALUATION to 0.67. The reason why the kappa values are low is that Inthisexperiment,weappliedboththehybridandtheDD weconsideredanydisagreementevenifitwasadifferenceof modelstodetectsubgoalsinstudents’S20codesubmissions. one timestamp,butitdoeshighlighthowsubgoaldetection Wealsoaskedtwohumanexpertstoevaluatethepresenceor canbesubjective,whichisachallengeformeasuringtheac- absence of each subgoal in a subset of students’ code snap- curacy of subgoal detection. As a result, for the remaining shots (sequential states of student code, corresponding to data,thetwoexpertscontinuedtoevaluateitindependently theircodeedits,e.g.,theadditionordeletionofcodeblocks) andthenmettodiscussandresolveconflictstoproducerel- using the subgoal labels (shown in Table 1) as rubric items ativelyobjectivegoldstandard expertlabels. Weusedthese (with 1 for the subgoal’s presence and 0 for its absence), labelled logs as ground truth to compare the accuracy and resulting in an expert (or gold standard) subgoal state. F1-scores of both the DD and the hybrid models. We also calculatedtheagreementbetweentheexpertdetectionsand Because S20 data consists of 4480 code snapshots, it is not those generated by the hybrid and DD models. feasible to evaluate the models on every timestamp for two reasons. First, students mostly need feedback on a given subgoal when they are making edits towards finishing that 4.1 SupervisedLearningModels subgoal, not after every single edit they make. Second, stu- Wealsocomparedourhybridhumanizedmodelwithsuper- dents break and recomplete subgoals frequently, even when vised machine learning models as another form of baseline. they are not working on a particular subgoal, and therefore The supervised models were trained and tested on the S20 it is not meaningful to have an expert label at every sin- dataset, using the same 687 expert-labeled snapshots de- gle datapoint. As a result, we evaluate the models at the scribedabove. Wetrainedseparatemodelstodetecteachof Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 73 the subgoal labels (using training/testing splits discussed Subgoal 1 Subgoal 2 below). This allows us to directly compare these super- 200 vised methods, which require labeled training data, to the DD model, which does not, and to our hybrid approach, 0 where the expert uses some of the labeled data to improve −200 the model. While we discuss some limitations to this com- ppaerrfisoormnainncSeeoctfioonur7s,utbhgeosaelbdaesteelcintoerss.help contextualize the Number of Edits−240000 Subgoal 3 Subgoal 4 ModeDHlDybrid The baseline supervised machine learning models we have 0 chosenaretwoshallowmodels(SVM[14]andXGBoost[9]) and one deep learning model (code2vec [4]). We used the −200 sameeditsforpredictionsastheDD andhybridmodels,and extracted the term frequency–inverse document frequency −400 0 10 20 0 10 20 (TF-IDF)featuresfromthemodels,thusavectorrepresen- Students tationofaneditisgenerated,andusedfortrainingthetwo Figure 3: The number of code edits (y-axis) that occurred models. We performed grid search cross validation for both betweengoldstandardexpertsubgoaldetections,anddetec- models. ForSVM,weusedaparameterspaceoflinearkernel tions by the DD and Hybrid models, for first-time subgoal and Radial Basis Function (RBF) [60] kernel, and searched detectionsforeachstudenttrace(x-axis). the regularization parameters from 1 to 10. For XGBoost, we searched the subsampling space of 0.1 to 1, with the number of estimators from 5 to 100, stepping by 5. Ten- fold cross validation is performed to search the parameter goals, we find low to moderate agreement over all student spaces. The training, validation, and test sets are split by logs between DD and human expert detections, with Co- studentstomakesurethatnostudentsusedfortestingwill hen’s kappa values ranging between 0.25-0.581. However, haveaneditintraining,sinceeditsfromonestudentwould we find substantial or better agreement between the hybrid beverysimilar,andusingsamplessimilartothetestingset model and human expert detections, with Cohen’s kappa in training would lead to an unfair comparison. valuesrangingbetween0.6-0.84. Itisworthnotingthatthis agreementishigherthanthatachievedbetweenthetwohu- Weselectedonestate-of-the-artdeeplearningmodel,code2vec man experts (described in Section 4). These results showed [4], for comparison as well, as the model has recently been thattheadditionofjustthreehumanconstraintstoadata- applied in educational code classification tasks [53]. In- driven model succeeded in improving its accuracy, making steadofusingavectoroftermfrequencytorepresentedits, the hybrid model agree more with the gold standard (that code2vec uses the structural representations from ASTs to theexpertsco-constructed),thantheexperts’originalagree- represent the code, and the representation is learned from ment with one another. traininganeuralnetwork1. Weusedearlystoppingtoavoid overfitting. To ensure the robustness of our results, we ran Wealsodeterminedthenumberoffalsedetections(i.e. Early 20timeswithresamplingforallsupervisedbaselinemodels, and Late detections, as described in Section 3.2.2) for both and reported average metrics (e.g. F1-score, accuracy). the hybrid and DD models. We found the DD model de- tected 40.66%, 10.66%, 5.62% and 5.10% Early detections, 5. RESULTS and2%,13.5%,12.36%,and15.31%Latedetectionsforsub- RQ1a: How well does a hybrid model perform compared to a goal1,2,3,and4,respectively. However,ourhybridmodel data-driven model without expert augmentation? detected 1.33%, 13.5%, 10.67% and 4.6% Early detections, and8%,5.52%,1.7%,and2.81%Latedetectionsforsubgoal The prediction results for each subgoal from the DD model 1, 2, 3, and 4, respectively. To visualize these false detec- and the hybrid model are shown in Table 2. Our hybrid tions,Figure3visualizedthesefalsedetectionsbypresenting modelachieveshigheraccuracyandF1-scoresonallsubgoals the distance (i.e. how many edits) between the Early and thantheDD model. Inparticular,itreaches>0.8accuracy LatedetectionsbyboththeDD andhybridmodels,andthe for all subgoals, and it reaches > 0.9 accuracy for 2 out of gold standard human expert detections of the first time a the 3 subgoals that were modified (i.e. subgoal 1 and 4) subgoaliscompleted. Thex-axispresentsthestudents(n= with expert constraints. It is worth noting that the hybrid 27), and the y-axis presents the number of edits a student modelachieveshigheraccuracyinsubgoal3,whichwasnot makes until they complete a subgoal. We used a negative modified with expert constraints. This is possible because number to indicate how much earlier the models were than after we modified code shapes for the other subgoals, we the gold standard detection, 0 to show when models agree reclustered the code shapes (as described in Section 3.2.3), withthegoldstandard,andapositivenumbertoshowhow and the new code shapes for subgoal 3 were changed. This much later the models were. We also used empty circles to islikelybecause,afterreclustering,somecodeshapesmoved indicate instances where a subgoal is never detected by the to subgoal 3, resulting in higher recall. models but it was detected by human experts. While Fig- ure 3 shows only how early/late the model is in detecting We also measured the agreement between human experts, when a subgoal is first completed, this is likely the most DD,andhybridmodelsubgoaldetections. Forthefoursub- important detection. Our results suggest a high agreement between the hybrid model’s detections with the gold stan- 1We applied code2vec using the process described in [53]. dard, and a strong improvement over the DD model. 74 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Table2: Precision,Recall,F1-scoreandAccuracyobservedwithSupervised,Data-Driven(DD)andourHybridModels. Precision Recall F1-score Accuracy XG- Code Hyb- XG- Code Hyb- XG- Code Hyb- XG- Code Hyb- SVM DD SVM DD SVM DD SVM DD Boost 2vec rid Boost 2vec rid Boost 2vec rid Boost 2vec rid Subgoal 1 0.71 0.68 0.89 0.44 0.95 0.79 0.75 0.91 0.94 0.76 0.71 0.66 0.89 0.60 0.85 0.82 0.76 0.90 0.57 0.91 (n = 150) Subgoal 2 0.58 0.61 0.57 0.70 0.69 0.61 0.77 0.79 0.63 0.85 0.55 0.66 0.64 0.66 0.76 0.74 0.75 0.75 0.77 0.81 (n = 163) Subgoal 3 0.60 0.62 0.69 0.80 0.75 0.54 0.61 0.76 0.64 0.95 0.52 0.59 0.70 0.71 0.84 0.70 0.72 0.79 0.82 0.88 (n = 178) Subgoal 4 0.62 0.64 0.64 0.79 0.87 0.64 0.75 0.84 0.55 0.93 0.59 0.66 0.69 0.65 0.90 0.76 0.74 0.75 0.80 0.93 (n = 196) RQ1b: Howdoesahybridmodelperformcomparedtosuper- WepresenthereacasestudyofthestudentEm2 whenthey visedlearningapproachesthatleverageexpertsubgoallabels? received an inaccurate subgoal detection based on the DD model,andhowthehybridmodelcouldhavemitigatedthis Weshowourcomparisonofsupervisedlearningmodelsand false detection. the hybrid model in Table 2. On all subgoals, except one, the hybrid model has higher accuracy and F1-score than EmstartedsolvingSquiral bysnappingthe‘whengreenflag code2vec,SVM,andXGBoostmodels,outperformingthem clickedblock’(i.e. ‘ReceiveGo’block)onthemainscriptas by 0.10, 0.06 and 0.15 percent of F1-score, respectively. In shown in Figure 4A, and the system falsely detected sub- subgoal 1, we found that code2vec achieved a higher F1- goal 1. Em then proceeded to work on subgoal 2, without score than all the other models, and a relatively similar ac- creating the required procedure, and created a loop using curacy to the hybrid model (0.903, 0.906). One possible the ‘repeat’ block nested with ‘move’ and ‘turnLeft‘ blocks explanation for this is that subgoal 1 is the simplest sub- (shown in Figure 4B). This time, the system was correct goal, requiring only that the student has defined and used in not detecting subgoal 2 because the loop was not in a a procedure, regardless of its content, and this simple code procedure, and does not iterate on the ‘rotations’ param- patternmayhavebeeneasierforthesupervisedapproaches eter. Afterwards, Em correctly created a procedure with tolearn. Theseresultsshowthatahybridmodeliteratively one parameter as shown in Figure 4C; however, the system constructed through cycles of student data collection, ma- shows no change, since it already falsely detected subgoal chine learning, along with human labeling and correction 1 earlier, and therefore, no change in the feedback is given canbeusedtocreateaccurateautomaticsubgoaldetections to the student. Em then destroyed the procedure, with- on a novice programming task. Furthermore, these super- out ever making it again. Em kept working for the rest of vised learning models, that were mostly outperformed by the time on creating a number of redundant loops, similar ourhybridmodel, werelearnedusinglabelsfromsnapshots to the one in Figure 4C, with constant values to manually that were strategically chosen to reflect important decision drawtheSquiral shape(ratherthanusingavariabletovary pointsforthemodel,suggestingthatthesupervisedmodels’ its length). performance may suffer if a random selection of snapshots were used to create an expert-labeled training set instead. Emspentatotalof55minutestodrawSquiralinaniterative manner. WhiletheDD systemaccuratelydetectedsubgoals 2-4asincomplete,thiscasestudyhighlightspotentialharm 5.1 CaseStudies that may have arisen from the false detection of subgoal 1. Inthissection,wepresentcasestudiestohighlightwaysthe Whensubgoal1wasdetectedearly,Emskippedovercreat- hybridmodelimprovedupontheoriginalDD model,aswell ing a procedure. Later, when she did create the procedure as the hybrid model’s limitations. These case studies come correctly, she got no additional feedback (since the subgoal fromthe33%ofstudentswhowerenotinvestigatedwhenthe was already detected), and promptly deleted it. Preventing expertidentifiedfalsedetectionsinS20fromtheoriginalDD these unneeded deletions is a primary role of correct, pos- model, as discussed in our methods (Section 3.2.2). These itive subgoal feedback. However, had Em been using the studentsalsousedtheoriginalDDsystem,buttheirdatadid hybridmodel,subgoal1wouldnothavebeendetectedearly not informourhybridmodel. Thesecasestudies,therefore, because the expert edited the faulty code shape. We argue help us understand the ways our hybrid model might help that this might have allowed Em to keep working on creat- new students, as well as limitations of the model. Though ingaprocedure(asshowninsnapshotCinFigure4),which our prior work suggests the DD subgoal detections over- wouldhavebeendetectedascompletebythehybridsystem all were often useful to students [39], our post-hoc analysis only at this time. It is also possible that receiving inaccu- hereshowsthatthefalsedetections mayhavenegativelyim- rate feedback at the very beginning may have led to Em’s pacted student programming behavior, suggesting the need mistrustinthesystem,sincepriorworkshowsthatincorrect for our hybrid model’s improvements. feedback can reduce students’ willingness to use it [48]. Note that, we do not believe this incorrect DD detection 5.1.1 Case Study 1 (Em): Inaccurate Data-Driven SubgoalFeedback 2We provide anonymous names for students. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 75 and F1 score when detecting subgoals at key moments dur- ing students’ work. Our results show that this is a chal- lengingtask: evenastate-of-the-artsupervisedlearningap- proachwithaccesstolabeleddatastruggledtoidentifysome subgoals (F1 score as low as 0.64). This agrees with prior workusingexpert-authored[13,38]andsupervisedlearning models [53], showing that immediate feedback on subgoals Figure 4: Code snapshots A, B, and C, implemented by stu- is a hard problem. dentEminCaseStudy1. While automatically detecting subgoals is challenging, re- search suggests that the ability to provide automated, im- means the full data-driven system should not be used since mediatefeedbackonsubgoalscansignificantlyimprovestu- our prior work shows the system can be helpful to students dents’motivationandlearning. Providingsubgoalsfornovices [39]. However, we need to explore how to present subgoal can improve student learning by breaking down the pro- feedback in a way that promotes students to question the gramming task into smaller subtasks, which is a challeng- feedback since any such program will inevitably fail to rec- ing task for novices [36, 38]. In human tutoring dialogues ognize some correct variants of a subgoal solution. forprogramming,tutorsprovideacombinationofcorrective and positive feedback, increasing students’ motivation and 5.1.2 Case Study 2 (Jo): Cases when Hybrid Sub- confidence in programming [8, 33, 17]. Automatic subgoal detection could be used to provide similar corrective and goalDetectionscouldbeIncorrect positive feedback during programming. We know of only Wepresenthereacasestudywherethehybridmodelwould 3 systems that can afford such immediate feedback, that not haveprovidedaccuratesubgoalfeedback. Asevidenced is not based on unit tests, during programming, that have by the model’s high overall accuracy (Table 2), these in- beenshowntopromotelearning,confidenceandpersistence stances were rare, but understanding them highlights the for linked lists [21], database queries [42], and block-based affordances and limitations of our approach. Specifically, programming [38]. It is perhaps uncommon to make such we investigated subgoal 1, where the hybrid model had the systems due to the difficulty in anticipating all student ap- lowest F1-score. proaches,pairedwiththehighpotentialforinaccuraciesand studentreactionstothem. Ouraccuracyresultssuggestthat StudentJocorrectlycreatedaprocedurewithaparameter; ourhybrid,humanizedapproachcanbeusedtobuildsimilar however, since Jo had not yet used the procedure in their automaticsubgoaldetectionsystemsthatcouldbedeployed mainscript,subgoal1wasnotdetected(accuratedetection). and more easily scaled across problems in real classrooms. Jocontinuedprogrammingandcompletedsubgoals2and3, which were accurately detected by the system. Afterwards, 6.2 Affordances of Data-driven, Hybrid and Joaddedaprocedurecalltothemainscript,andsubgoal1 wasdetectedascomplete(accuratedetection). However,Jo Expertmodels then added a ‘pen up’ block underneath the procedure call, Our results suggest that the hybrid model has good poten- where unexpectedly, the hybrid model changed subgoal 1’s tialforsolvingtheproblemofautomatedsubgoaldetection. status back to incomplete (incorrect detection). Herewediscusstheadvantagesandtrade-offsofthehybrid approach, compared to data-driven and expert models. This false detection in the hybrid model was due to an overly-specific code shape. Specifically, the code shape re- 6.2.1 RQ1.a: HybridversusData-DrivenModels quired the procedure call to be the last block in the main script(whichwastruefor94%ofstudents,butnotJo),lead- Our results show that a hybrid, iterative model that lever- ingtothefalsedetection. Thiscaseconfirmstheimportance ages data-driven subgoal extraction, human labeling, and of iteratively investigating and refining data-driven subgoal expertrefinementbasedonlabeledstudentdata,cangreatly detectionstokeepimprovingtheiraccuracy,whichisacom- improvemodelperformancecomparedtoapurelydata-driven mon process in expert-authored models as well. While this (DD) model. The expert constraints improved F1-score of falsedetectionhasastraightforwardfix,similartotheones the data-driven model by 0.14-0.25 points, as shown in Ta- presentedinSection3.2.2,itshowsonelimitationofthehy- ble2. Basedonouranalysis,thehybridmodel,reducedthe brid model: creating these fixes requires the expert to find numberof EarlyandLatesubgoaldetectionsandincreased and address the false detections in the first place, which is the onTime detections, when compared to the original DD dependentonfindingthebugsinthedatainspected. Thisis model,asshowninFigure3. Thisisacriticalimprovement, also one reason why the hybrid model performance of sub- since prior work shows that the quality of feedback affects goal 1 has a lower F1-score than the code2vec model (as novices’ programming behavior [48, 51], but also their self- shown in Table 2). perceptions, and trust in the learning environment [48]. Thehybridmodelcreationdoesrequireadditionallabelling 6. DISCUSSION effort needed to evaluate the models; however, this effort 6.1 AutomatedSubgoalDetection seems well worth it, and is needed to evaluate the accuracy The key contribution of this paper is tackling the critical of any data-driven model before deployment. Compared to challenge of automated subgoal detection during program- priorwork,ouriterativehybridmodelsharessimilarbenefits ming tasks. Our results show that a hybrid data-driven of“human-in-the-loop”methodsinmachinelearning[56,30] model meaningfully addresses this goal, with high accuracy andalsorepresentsdata-drivenrulesinaninterpretableand 76 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) editable form that simplified the process of merging human in Squiral (like having whether a ‘turnLeft’ or ‘turnRight’ insights into the model. In prior work, Diana et al. found, block), we have not tested it in programming tasks with qualitatively, that their generated data-driven rubrics are larger space of solution approaches. Therefore, it is not considerably similar to human-generated rubrics [19]. Sim- knownhowwelltheaccuracyresultswillgeneralizetoother ilarly, in our work, not only did instructors agree with the types of programming tasks or languages. However, the it- hybridmodelsubgoaldetections,wealsofoundthesedetec- erative process of data collection, DD subgoal extraction, tions have substantial or better agreement with the human labeling,andcollectionofdatafromstudentsusingthesub- expert gold standard than the DD model. From these re- goal labels and detectors, could be applied for other pro- sults,weconcludethatahybridmodelcanbeusedtoitera- gramming problems, of the same level as Squiral, and re- tivelyimproveandhumanizedata-drivensubgoaldetection. peated until the models achieved high accuracy. Second, some of S20 data that was used for models’ evaluation was 6.2.2 RQ1b: HybridversusSupervisedLearningMod- also used to inform expert constraints in the hybrid model. However, this was only 66% of the data, and we discussed elsthatLeverageExpertSubgoalLabels above how the added constraints are generalizable, which Our results show that a hybrid model can surpass the per- should have helped in any semester (see Section 5.1). Ad- formance of supervised learning models. The steps we used ditionally, our case studies in Section 5.1 show examples of tocreateourhybridmodelwereto: (1)applyadata-driven how the hybrid detector performed on unseen data, though model, (2) add expert constraints, and (3) determine inter- there was insufficient data for a quantitative evaluation. esting datapoints consisting of times when subgoals might be achieved. The steps we used to create supervised learn- Third, we used only the labelled S20 data to train the su- ing models leveraged the interesting datapoints from step 3 pervisedbaselinemodels,butwealsoused3othersemesters of our hybrid model, hand-labeled them, and used them to of unlabeled data to train the unsupervised DD and hybrid buildsupervisedlearningmodels. Wehighlightthistopoint models. However, we argue that this ability to leverage a outthatitishardtodeterminewhatlabeleddatatouseina largerunlabeleddatasetisanadvantageoftheunsupervised supervisedlearningmodelforsubgoaldetection,sincethere methods, rather than a limitation of our analysis. Fourth, are hundreds of potential snapshots from each student. As someofthedatapointsthatwerelabeledfortheevaluationof a result, it is unclear whether the supervised model would allthemodelswereselectedinpartbyusingthehybridand have performed as well, compared to one learned from a DD model detections, as discussed above, and this might labeled dataset selected at random or regular timestamps. havebiasedtheresults. However,allthemodelswereevalu- Our results show that, even after using carefully selected atedonthesesamedatapointsthatwerestrategically chosen labelled data to train the model, the SVM and XGBoost fortheirimportance,andthereareinstanceswheresomeof baselinesdidnotachievethelevelofaccuracyofthehybrid thesupervisedmodelsoutperformedtheoriginalDD model. model for all subgoals. Even code2vec, the state-of-the-art It is not clear how a different data selection strategy would supervisedlearningmodel[53],haslowerperformanceforall have affected the results, and we argue that training and subgoals,exceptsubgoal1,thanthehybridmodel. Perhaps testing the supervised models on a dataset of the same size the code2vec performed worse due to the size of labelled with randomly selected snapshots would likely decrease the dataset (687 datapoints), though recent results suggest the performanceofsupervisedmodels. Finally,wedidnotcom- model is still effective with small datasets [52]. pare the hybrid model to a purely expert-authored model, andwedidnotmeasuretimetakenbyexpertstomodifythe 6.2.3 HybridversusExpertModels data-driven rules. We argue that these comparisons require Ourhybridapproachoffersdistinctadvantagesandtrade-offs hiringexpertstoauthorrulesandperformingtimeanalysis, compared to expert-authored models for subgoal detection. which is beyond the scope of this paper. Traditional expert-authored models (e.g. constraint-based tutors [42]) have the advantage of high accuracy and high In summary, this work proposes a new paradigm for ‘hu- confidence,butthetrade-offsofconsiderabledomainexpert manizing’data-drivensubgoaldetectionfornoviceprogram- timeforcreationandthepotentialfailuretoanticipatesome ming. Specifically, we proposed to humanize data-driven student strategies and misconceptions. Our hybrid model subgoals in an iterative refinement process. We (1) extract hastheadvantageofincorporatingactualstudentstrategies data-driven subgoals from student work, (2) give them hu- andmisconceptions, andprimarilyrequireshumaneffortto man labels, (3) collect more data from students program- labeldata andidentifyerrors,taskswhichcanpotentiallybe ming with the labels and subgoal detectors, (4) present ex- donebynon-expertsanddistributedacrossmultiplepeople. pertswiththelabels,andinterpretabledetectors,alongwith Adomainexpertisonlyneededtoedittheautomaticallyex- student behavior data so they can add expert constraints. tractedrulesandisaffordedthechancetodosowithactual Thisprocessensuresthathumansareinvolvedineverystep student data available. A significant tradeoff of the hybrid of the creation of automatic subgoals, offering the advan- model is its reliance on data - so the quality of the dataset tages of reflecting real student behaviors, and limiting and will directly impact the quality of the subgoal detectors. focusing expert authoring time. Our results show that this Furthermore, both models are likely to need refinement as hybridhumanizedmodeloutperformsfullydata-drivenmod- studentsusethem,andthisprocessisalreadybuiltintothe els and state-of-the-art supervised learning models. This hybrid model creation and refinement cycle. proposed paradigm can be used to create humanized auto- maticsubgoaldetectionfortaskswhereitmaybetooexpen- 7. LIMITATIONS&CONCLUSION sivetocreatefullexpertmodelsfor,butthatareimportant for student learning, motivation, and retention. This work has 5 main limitations. First, while the DD model can capture small differences in solution approaches Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 77

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.