Automatic Assessment of the Design Quality of Python Programs with Personalized Feedback J. Walker Orr Nathaniel Russell GeorgeFoxUniversity GeorgeFoxUniversity [email protected] [email protected] ABSTRACT program that will get them one step closer to the goal of The assessment of program functionality can generally be a functioning program. It uses histories of edits made by accomplishedwithstraight-forwardunittests. However,as- students, starting with a blank slate and ultimately termi- sessing the design quality of a program is a much more dif- nating with a functional program to train the model. The ficultandnuancedproblem. Designqualityisanimportant system is based on Continuous Hint Factory [9] which uses consideration since it affects the readability and maintain- aregressionfunctiontopredictavectorthatrepresentsthe ability of programs. Assessing design quality and giving besthintthentranslatesthatvectorintoahuman-readable personalized feedback is very time consuming task for in- edit. Similarly [11] used a neural network to embed pro- structors and teaching assistants. This limits the scale of gramsandpredicttheprogramoutput. Usingthatmodelof giving personalized feedback to small class settings. Fur- theprogramoutput,analgorithmwasdevelopedtoprovide ther, design quality is nuanced and is difficult to concisely feedback to the student on how to correct their program. express as a set of rules. For these reasons, we propose a Also, [16] use arecurrent neural network to predict student neural network model to both automatically assess the de- success at a task given a history of student submissions of signofaprogramandprovidepersonalizedfeedbacktoguide their program for evaluations. studentsonhowtomakecorrections. Themodel’seffective- ness is evaluated on a corpus of student programs written All these systems model student programs from Hour of inPython. Themodelhasanaccuracyratefrom83.67%to Code [2]. Hour of Code is a massively open online course 94.27%, depending on the dataset, when predicting design platformthatteachespeoplehowtocodewithavisualpro- scores as compared to historical instructor assessment. Fi- gramming language. The language is simple and does not nally, we present a study where students tried to improve contain control constructs such as loops. thedesignoftheirprogramsbasedonthepersonalizedfeed- back produced by the model. Students who participated in Moreover,thecombinationoflanguageandproblemsetting thestudyimprovedtheirprogramdesignscoresby19.58%. are simple enough that there is a single or very few func- tional solutions for each problem [4]. This level of simplic- ityprecludestheconsiderationofprogramdesign. However Keywords forgeneralpurposeprogramminglanguagessuchasPython, Assessment, neural networks, intelligent tutoring therearemanywaysofcreatingfunctionallyequivalentpro- grams. Itisimportantforthesakeofmaintainability,mod- 1. INTRODUCTION ularity, clarity, and re-usability that students learn how to Recently there has a been a lot of work in the development design programs well. oftoolsforeducationinprogrammingandcomputerscience. Specifically there are many systems for intelligent tutoring When it comes to the quality of design, there are varying which are designed to help students learn how to solve a standards. Further, some standards are more objective or programming challenge. The tutoring involved is primarily easiertopreciselyidentifythatothers. Forexample,theuse focusedinsuggestingfunctionalimprovements,thatis,how ofglobalvariablesarebothwidelyrecognizedaspoordesign to finish the program so that it works correctly. and are easy to identify. For some programming languages, “linters” exist to apply rules to check for common design Intelligent tutors such as [4] uses reinforcement learning to flaws. For Python, Pylint [12] is a code analysis tool to predict a useful hint in the form of an edit to a student’s detectcommonviolationsofgoodsoftwaredesign. Itdetects designproblemssuchastheuseofglobalvariables,functions thataretoolongortaketoomanyarguments,andfunctions that use too many variables. Pylint is design to enforce the official standards of the Python programming community WalkerOrrandNathanielRussell“AutomaticAssessmentoftheDesign codified in PEP 8 [15]. QualityofPythonProgramswithPersonalizedFeedback”. 2021. In: Pro- ceedingsofThe14thInternationalConferenceonEducationalDataMin- There are aspects of good design that are difficult to iden- ing (EDM21). International Educational Data Mining Society, 495-501. tify. For example, simple logic is a good design idea, but is https://educationaldatamining.org/edm2021/ quitenebulous. Thecomplexityofaprogram’slogiciscon- EDM’21June29-July022021,Paris,France Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 495 textual, it entirely depends on the problem the program is solving. Also, modularity is universally judged as a quality ofgooddesign,howeveritisnotalwayscleartowhatextent a program should be made modular. How many functions or classes are too many? Again, it depends on the context of the problem for which the program is designed. In a professional setting, code reviews are often practiced to promote quality design that goes beyond the straight- ... ... forward rules of“linters.” Code reviews are a manual pro- cess which require a lot of human effort. A recently de- Input Hidden Output Layer Layer Layer veloped system call DeepCodeReviewer [5] automates the code review process with a deep learning model. By using Figure 1: The model of program design quality, a feed- proprietarydataonhistoricalcodereviewstakenfromaMi- forwardneuralnetwork. The“InputLayer”isthefeaturevec- crosoft software version control system, DeepCodeReviewer tor created from the AST. The“Hidden Layer”corresponds was trained to successfully annotate segments of C# code tocalculationofx(cid:48) specifiedinEquation1. Finally,the“Out- with useful comments on the code’s quality. putLayer”producesasinglevalue,thedesignscoreasfound inEquation2. However, to our knowledge, there is no system to perform in-depth code analysis for the purposes of evaluating and assessing design for general purpose languages in an edu- 2. METHOD cational context. The process of assessing the design of a program is time consuming for instructors and teaching as- The task is to predict a design quality score for a student sistantsanditisanimportantcomponentofcompleteintel- program written in Python. The score y is a real number ligent tutoring system. Such a system needs to be adjusted between zero and one. The program is represented by a orcalibratedforthecontextofparticularproblemsorassign- featurevector(cid:126)xproducedbytheoutputofaseriesoffeature ments since there are important aspects of software design functions computed from the program’s AST. arecontextdependent. Moreoverthesystemneedstomatch theparticularstandardsofaninstructor. Hencewepropose For an AST T, a series of feature functions fi((T) output a system that models design quality with a neural network is concatenated in to a feature vector (cid:126)x that represents key trained on previously assessed programs. aspects of the program’s design. The model g((cid:126)x;Θ) is a feed-forward neural network with a single hidden layer. It is a regression model that predicts the score y based onthe 1.1 OurApproachandContribution feature vector(cid:126)x and parameters Θ. We propose a design quality assessment system based on a feed-forwardneuralnetworkthatutilizesanabstractsyntax 2.1 Features tree (AST) to represent programs. The neural network is a regression model that is trained on assessed student pro- Despiterecentadvancesindeeplearning,wechosetorepre- gramstopredictascorebetweenzeroandone. Eachfeature sent the student program with feature functions computed the model uses is designed to be meaningful to human in- onitsAST.Deeplearningishighlyeffectiveatlearninguse- terpretation and is based on statistics collected from the fulfeaturerepresentationsofeverythingfromimagestotime program’s AST. We intentionally do not use deep learning seriestonaturallanguagetexts. However,deeplearningalso asitwouldmaketherepresentationoftheprogramdifficult requireslargeamountsofdataandinthissettingthequan- to understand. Personalized feedback is generated based tity of manual annotated student programs is limited. on each feature of an individual program. By swapping a feature’s value for an individual program with the average Additionally,ASTareanaturalandeffectivemeansofrep- feature value of good programs, it is possible to determine resenting and understanding programs and can be created which changes need to be made to the program to improve withfree,availabletools. AnASTisanexactrepresentation is design. The primary contributions of this work are the of the source code of program based on the programming following: language’s grammatical structure. Producing an AST rep- resentation of a programming language is an essential first step in compilers and interpreters. The AST of a program • The first to explicitly predict the design quality of contains all the content of its source but also is augmented programs in an educational setting to the best of our with the syntactic relationships between every element. A knowledge. parserandtokenizertoproduceanASTforthePythonpro- gramminglanguageisprovidedbyitsownstandardlibrary. • Highefficacywithanaccuracyfrom83.67%to94.27% ThismakestheASTthenaturalrepresentationtouse,since with only small amounts of training data required. it is free, convenient, exact, interpretable, and does not re- quiredanyadditionaldata. Incontrast,deeplearningwould • Thefirstintelligenttutoringsystemfordesignquality require a large amount of data to effectively reproduce the for Python. same representation. • Personalized feedback without the explicit training or Prior to representation as an AST, a program must first be annotation. broken into a series of tokens via the process of lexicaliza- 496 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) FunctionDef Argvaulmueesnts Assignment For-loop Return Stototarel Ass=ign Con0s.0tant Svatoluree vLaoluaeds AugAssign BinaryOp def avg(values): total = 0.0 Store Add Load Load Division Call total + value total / len for value in values: total += value return total / len(values) Load values Figure2: AnabstractsyntaxtreeforasegmentofPythoncodethatcomputestheaverageofthevaluesinalist. TheASThas beencondensedforthesakeofbrevityandtherestrictionsofspace. Therelatedcodesegmentisshowninblue. tion. Lexicalization is the process of reading a program, eters,“weights”of the neuron. We use the ReLU [8] as the character-by-character and dividing into work-like tokens. activation function for the hidden layer neurons. The final Thesetokensarealsoassignedatypesuchasafunctioncall prediction of the design score is made by the output layer’s orvariablereference. Thegrammaticalrulesofthelanguage single neuron: areappliedtothelexicalizedprogramtocreatetheAST.In (cid:18) d(cid:48) (cid:19) anAST,theleafnodesofthetreearetheprogram’stokens y=σ (cid:88)w x(cid:48) (2) while the interior nodes correspond to syntactic elements j j andconstructions. Forexample,aninteriornodecouldrep- j=1 resent the body of a function or the assignment of a value to a variable. An example of an AST is found in Figure 2. whered(cid:48) isthenumberofhiddenlayerneuronsandw ∈Θ j are weights of this neuron. The function sigmoid is used Given that an AST is a complete representation of a pro- because its domain spans from (−∞,∞) but its range is gram,itisanaturalbasisforassessingthequalityofapro- [0,1]whichultimatelyguaranteesthemodelalwaysoutputs gram’s design. Deep learning may be able to automatically avalidscore. Themodelistrainedwithmeansquarederror learning the same key syntactic relationships with enough as the loss function: data, however this information is simply available via AST. n Further,featurescomputedfromtheASTwillbehumanin- MSE= 1 (cid:88)(y∗−y )2 (3) n k k terpretable unlike arepresentation producedby deeplearn- k=1 ing. wherey∗istheground-truthdesignscoreforthekthinstance The features we created are all based on statistics collected k i.e. programandnisthenumberofinstancesinthetraining fromaprogram’sAST.Someconsistofsimplycountingthe data. The model is trained with the ADAM algorithm [7] number of nodes of a given type, for example, the number andeachparameterinthemodelwasregularizedaccording ofuserdefinedfunctions. Otherfeaturefunctionsarebased to their L2 norm [6]. For all our experiments, a hidden ofsubsectionsoftheAST,suchasthenumberofnodesper layer of size 32 was used. The model was trained for 250 line or per function. Finally, some features are ratios or epochs and the model from the best round according to a percentagessuchastheaveragepercentoflinesinanumber development set was selected for our experiments. inafunctionthatareempty. Allofthefeaturesarerelatively simpleandfasttocompute,yetgenerallycapturethedesign and quality of a program. Each of the feature functions 2.3 Ensemble f (T) we defined can be found in Appendix A. Due to the fact that fitting neural networks to data is a lo- i cal optimization problem, the effect of initial values of the 2.2 Model parameters Θ of the model remain after training. The pro- cess of training a neural network will produce a different Themodelisafeed-forwardneuralnetwork[13]withasingle model given the same data. This variation in the results of hidden layer and single neuron in the output layer. The a trained model is particularly pronounced when the train- model’s structure is illustrated in Figure 1. The values of ing data set is relatively small. To address this variation theinputlayerarethefeaturevector(cid:126)x. Eachneuroninthe hidden layer x(cid:48) defined with the following equation: and mitigate its impact an ensemble of models can trained, j each with different initial parameter values. Each model is (cid:18) d (cid:19) independently trained and a single prediction is made by a x(cid:48)j =ReLU (cid:88)wi,jxi (1) simple of the average of individual predictions i.e. i=1 m 1 (cid:88) y= y (4) m l where d is the dimension of (cid:126)x and wi,j ∈Θ are the param- l=1 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 497 where m is the number of models and y is the prediction is“Travel,”an assignment that involves the distance a ve- l of the lth model. For our experiments, an ensemble of 10 hicle travelled after going a constant speed for a specified models was used. duration. Thereare118studentprogramsfor“Travel.”The nextassignment,“Budget”isabudgetingprogramthatlets 2.4 PersonalizedFeedback auserspecifyabudgetandexpensesanddeterminesifthey are over or under their budget. 168 student programs were Thegoalofintelligenttutorsistoprovidepersonalizedfeed- collectedfor“Budget”. Thethirdassignmentinthedataset back and suggestions on how to improve a program. The consistsofcreatingaprogramtoplay“Rock-Paper-Scissors” most straight-forward means of providing feedback would against the computer. For this assignment, there are 111 be to simply predict which possible improvements apply to studentprograms. Thelastassignmentisprogrammingthe agivenprogram. However,trainingamodeltodirectlypre- classic casino game“Craps”which involves rolling multiple dict relevant feedback would require a dataset of program dice and placing different types of bets and wagers. This withcorrespondingfeedbackandsuchadatasetcanbehard assignment has 120 collected student programs. to find or is expense to construct. All the assignments require the student to write the pro- In order to avoid the need for a dataset with explicit feed- gram from scratch in Python 3. The programs are to have back annotation, we use our model, trained on predicting a command-line, text-based interface and user validation. design score, to evaluate how changes in program features Students are required to use if-statements, loops, user de- would lead to a higher assessed score. Using the training fined functions. The“Craps”program also requires the stu- data, we compute an average feature vector (cid:126)x of all the dent to do exception handling, and file I/O. A requirement “good”programs i.e. those with a design score greater than of the program was to maintain a record of their winnings 0.75. Togeneratefeedbackforaprogram,itsfeaturevector across sessions of playing the game, hence the results were (cid:126)x is compared to the average (cid:126)x. For each feature, a new vector (cid:126)x(cid:48) is created by replacing the feature value x with requiredtobestoredtoafile. Also,thestandardsofdesign i quality go up as the course progresses and since “Craps” valuewiththeaverage’svaluex . Thisprocessissettingup i is the last assignment, it has the highest standards. Each ahypothesis,whatiftheprogramwasclosertotheaverage student program has an associated design score that was “good” program with regards to a particular feature? To answer this, the trained model g((cid:126)x;θ(cid:126)) is used to predict a normalized to value between zero and one. designscoreforthenewvectori.e. y(cid:48) =g((cid:126)x(cid:48);θ(cid:126)). Bycompar- i ingtheoriginalscoreoftheprogramywiththenewscorey(cid:48), 3.2 BaselineMethods i the hypothesis can be tested. If the new score y(cid:48) is greater i The model is compared against a variety of baseline regres- than the original predicted score y, then the alteration of sion methods. The simplest is linear regression, which sim- x to be closer to x is an improvement. Feedback based i i ply learns a weight per each feature. Next is a regression onthisalterationisrecommendedtothestudentasperson- decisiontreewhichistrainedwiththeCARTalgorithm[1]. alized feedback. Since each feature in (cid:126)x is understandable It has the advantage over linear regression in that it can to a human, feedback is given in the form of the suggestion learnnon-linearrelationships. Non-linearitymeansamodel to increase or decrease particular features. The suggestion canlearn“sweet-spots”ratherthansimplyhavinga“moreis for alteration is based on the comparison of x versus x , if i i better”understanding of some features. For example, hav- x > x , the feedback of decrease x is given. In the other i i i ing some modularity in the form of user defined functions case, where x < x the feedback is to increase x . Based i i i is good, however, too many is cumbersome. The“correct” on the feature and the feedback of increase or decrease, a numberofuserdefinedfunctionslikelyshouldfallintoarel- user-friendly sentence is selected from a table of predefined ativelysmallrange. Modelselectiononthemaximumdepth responses. For example, if x is the number of user defined i of the tree with a development set was used to determine functionsandx =3,x=5,andy(cid:48) >ythenthefeedbackof i i that 10 was the best setting. “increase the number of user defined functions”is created. However, both of these models have the issue that they are 3. EXPERIMENTS not constrained to produce a score between zero and one, Thesystemwasevaluatedintwodifferentexperimentalset- their prediction can be any real number. Hence another tings. The first evaluation is direct test of the model’s ac- baseline method was used, created to be an intermediary curacy on known design scores. For this, several different stepbetweenlinearregressionandtheneuralnetworkmodel. datasets and settings were compared against several base- It is a linear model with a sigmoid transformation which lines. The second evaluation is a small study of how stu- guaranteestheoutputbebetweenzeroandone. Thismodel dent’s responded to the system’s feedback. Students were iseffectivelythefinallayeroftheneuralnetworkmodel,i.e. given feedback on the quality of their programs based on the neural network without the hidden layer. The model is the model. They were given the chance to correct their specified by the equation: programs after receiving feedback and have it manually re- assessed. (cid:18)(cid:88)d (cid:19) y=σ w x (5) i,j i 3.1 Dataset i=1 The dataset was collected over three years from an intro- ductiontocomputersciencecoursewhichteachesPython3. ThismodelisalsotrainedwithADAM[7]. Allthebaseline Itconsistsoffourseparateprogrammingassignmentswhich modelsandtheneuralnetworkaretrainedwithMSEasthe involve a wide range of programming skills. The simplest loss function. 498 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Method Travel Budget RPS Craps Combined MSE Accuracy MSE Accuracy MSE Accuracy MSE Accuracy MSE Accuracy Linear Regression 0.009 93.09% 0.032 87.43% 0.038 83.2% 0.043 84.06% 0.027 87.03% Decision Tree 0.018 90.10% 0.031 87.26% 0.078 77.33% 0.072 79.41% 0.076 81.42% Sig. Linear Regression 0.022 89.60% 0.046 85.13% 0.086 77.91% 0.063 81.43% 0.070 80.64% Neural Network 0.007 93.48% 0.024 88.48% 0.041 83.9% 0.08 79.57% 0.033 85.61% Ensemble 0.005 94.27% 0.022 90.14% 0.022 87.66% 0.053 83.67% 0.031 86.99% Table1: DesignScorePredictionResults 3.3 Results the other datasets, the neural network outperformed linear Themodelwascomparedversuseachbaselineinfivediffer- regressionbyasmallmargin,buton“Craps”theneuralnet- ent settings: the“Travel”,“Budget”,“Rock-Paper-Scissors”, workmodelunder-performedthelinearregressionmodelby and“Craps”programs, and a combined dataset which in- 5%. Again, this is most likely due the instability and vari- cludesalltheprograms. Theresultsoftheexperimentscan ability of neural network predictions i.e. small differences be found in Table 1. Each model is evaluated according to in features can lead to a large difference in the prediction. two different metrics: MSE and average accuracy. Average In the “Craps” dataset, the improvement of the ensemble accuracy is defined as 1 (cid:80)n (cid:0)1−|y −y∗|(cid:1). overthesingleneuralnetworkmodelillustratestherelative n k k stability of the ensemble’s predictions. In every case, the k=1 ensemble is superior to the single neural network and had Overall,thedecisiontreeandthesigmoid-transformedwere the best overall performance by producing the most accu- clearlythetwoworstmodels. Thiswassurprisingsincedeci- rateresultsonthreeofthedatasetsandeffectivelytyingfor siontreesaregenerallythoughttobestrictlymorepowerful the best on the other two. than linear models. However, decision trees look for highly discriminative features to partition the data into more con- One noticeable pattern was that all the models performed sistent groups. The under-performance of the decision tree better on the“Travel”and“Budget”datasets than on the possibly indicates that none of the features were especially “RPS”, “Craps”, and combined datasets. Universally, the indicative of a good or bad design on their own. Instead, most difficult dataset was“Craps”which likely lowers the the quality of a program is better described by a collection accuracyonthecombineddataset. Duetotheshiftingstan- of subtle features, which gives credence to the belief that dardsandexpectationsofstudentassignments,amodelper design quality is nuanced. assignment appears to worthwhile. This is a bit counter- intuitive since there are many common standards and ex- Thereasonsigmoid-transformedlinearmodelunder-performed pectations across assignments. linear regression was likely due to it being trained with ADAM.ADAMdoesnotguaranteeconvergencetoaglobal Overall, the network ensemble produced reliable, accurate optimum like the analytical solution to linear regression. results when trained per dataset. The accuracy of the en- Apparently the restriction on predictions to be within the sembleisarguablyclosetobeingusefulinpracticalapplica- specified range of zero to one was not important. tion. Further, comparing the scores of an instructor versus another instructor or even against themselves, the rate of Linearregressiondidsurprisinglywell,beatingboththeneu- agreementmustbe lessthan100%andwithanaccuracyof ral network and network ensemble in the“Craps”and com- the network ensemble ranging from 83.67% to 94.27%, the bined datasets, though barely. In those cases, the differ- model’s accuracy is possibly close to a realistic ceiling. ence between the ensemble and linear regression was less than a percent. This is likely due to the stability of lin- 3.4 FeedbackStudy ear regression’s predictions. Though linear regression does not have the power and flexibility of neural networks this In order the evaluate the effectiveness of the personalized can also be a benefit by limiting how wrong their predic- feedback, we conducted a small study on the effect of the tions are. Neural networks and even ensembles can make feedback on the design score of student programs. For the overconfidentpredictionsonoutliersorotherunusualcases. “Rock-Paper-Scissors” program the network ensemble was The“Craps”dataset contained the most complex programs used to generate personalized feedback for the student pro- anditislikelyahandfulofpredictionssignificantlybrought gramsinsteadoftheusualinstructorfeedback. Thenetwork down the average. ensemble was the same as used in the design score experi- ments,itwastrainedwithprioryearsworthofstudentpro- The network ensemble outperformed the single neural net- grams. Havingreceivedthepersonalizedfeedback,students work in every case, which is to be expected. The margin of optedintocorrectingtheirprogramforextracreditontheir improvement of the ensemble versus the single neural net- assignment. The feedback was in form of a series of com- work in accuracy on the four individual program datasets ments,whereeachcommentwas“increase”or“decrease”the ranged from 1% to 4%. The network ensemble did the name of a feature as described in Section 2.4. best overall by being the best in most cases or coming in a closesecondinalltheothercases. Theimportanceofusing Theclassisanintroductiontocomputersciencecoursewith anensembleisevidentonthe“Craps”datasetwherethein- multiple sections and two different instructors. Students dividual neural network under-performed significantly. On from both instructors participated in the study. Out of 73 students enrolled across the sections of the course, 15 stu- Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 499 dents chose to opt-in. SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’18) Deep Learning The revised programs were assessed again manually for de- Day, 2018. signqualityandthescoreswerecomparedagainsttheorigi- [6] G. E. Hinton, N. Srivastava, A. Krizhevsky, nals. Thedesignscoreoftheprogramsstartedatanaverage I. Sutskever, and R. R. Salakhutdinov. Improving of68.33%andafterthefeedbackandcorrectiontheaverage neuralnetworksbypreventingco-adaptationoffeature rose to 87.92%, a 19.58% absolute improvement. Using a detectors. arXiv preprint arXiv:1207.0580, 2012. paired t-test, the improvement was judged to be significant [7] D. Kingma and J. Ba. Adam: A method for stochastic with a p-value of 0.001. optimization. In the 3rd International Conference on Learning Representations (ICLR), 2014. The results of the study suggest the feedback was gener- [8] V. Nair and G. E. Hinton. Rectified linear units ally useful in guiding students to improve the design qual- improve restricted boltzmann machines. In ity of their programs. The improvement was noticeable to Proceedings of the 27th international conference on the instructors anecdotally as well. For example, the usage machine learning (ICML-10), pages 807–814, 2010. of global variables and“magic numbers”decreased signifi- [9] B. Paassen, B. Hammer, T. W. Price, T. Barnes, cantly. Thoughthestudydoeshavesomecaveatsincluding S. Gross, and N. Pinkwart. The continuous hint its small sample size and opt-in participation. It could be factory-providing hints in vast and sparsely populated that those students willing to opt-in are those most willing edit distance spaces. Journal of Educational Data or able improve with a second chance. Mining, 2018. [10] S. J. Pan and Q. Yang. A survey on transfer learning. 4. CONCLUSIONS&FUTUREWORK IEEE Transactions on knowledge and data Overall, we proposed a neural network model ensemble for engineering, 22(10):1345–1359, 2009. predict the design quality score of a student program and [11] C. Piech, J. Huang, A. Nguyen, M. Phulsuksombati, experimentally demonstrated its effectiveness. Further, our M. Sahami, and L. Guibas. Learning program system provided personalized feedback based on the differ- embeddings to propagate feedback on student code. In ence between a program’s feature values and the average International conference on machine Learning, pages features’ value of“good”programs. A small study provides 1093–1102. PMLR, 2015. evidencethatthefeedbackwasofpracticalusetostudents. [12] Pylint.org. Pylint - code analysis for python. Students were able to improve their programs significantly https://pylint.org. based on the feedback they received. [13] F. Rosenblatt. Principles of neurodynamics. perceptrons and the theory of brain mechanisms. There is also evidence that training models per assignment Technical report, DTIC Document, 1961. ismosteffective. However,themodelneedstobeevaluated [14] B. Settles. Active learning literature survey. Technical onmoreprogrammingassignments. Further,thereisapos- report, University of Wisconsin-Madison Department sibility of utilizing transfer learning [10] to help the model of Computer Sciences, 2009. learn what is in common across the assignments. [15] G. Van Rossum, B. Warsaw, and N. Coghlan. Pep 8. https://www.python.org/dev/peps/pep-0008/. Thefeedbackgivenwasshowntobeeffective,butmorenu- [16] L. Wang, A. Sy, L. Liu, and C. Piech. Learning to anced feedback could be useful. Specifically, feedback tar- represent student knowledge on programming geted to individual lines or segments of code would possi- exercises using deep learning. International bly help students improve their program’s more effectively. Educational Data Mining Society, 2017. However, this may require additional supervision i.e. anno- tation for explicit training. Active learning [14] or multi- APPENDIX instance learning [3] may be alternatives to gathering addi- tional annotation. A. FEATUREFUNCTIONS • The number of functions 5. REFERENCES • The number of assignments [1] L. Breiman, J. Friedman, C. J. Stone, and R. A. • AST nodes per function Olshen. Classification and regression trees. CRC press, • Lines of code per function 1984. [2] Code.org. Code.org: Learn computer science. • Total lines of code https://code.org/research. • Number of literals [3] T. G. Dietterich, R. H. Lathrop, and T. Lozano-P´erez. • The proportion of white-space characters to the total Solving the multiple instance problem with number of characters axis-parallel rectangles. Artificial intelligence, • Number of empty lines 89(1-2):31–71, 1997. [4] A. Efremov, A. Ghosh, and A. Singla. Zero-shot • Deepest level of indentation learning of hint policy via reinforcement learning and • Number of“if”statements program synthesis. International Educational Data • Number of comments Mining Society, 2020. • Number of AST nodes per lines of code [5] A. Gupta and N. Sundaresan. Intelligent code reviews using deep learning. In Proceedings of the 24th ACM • Number of try-except statements 500 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) • AST nodes per try-except statement • AST nodes per“if”statement • Number of lists • Number of tuples • Average line number of literals • Average line number of function definition • Average line number of“if”statement • Ratio of AST nodes inside functions versus total num- ber of AST nodes • Number of function calls • Number of“pass”statements • Number of“break”statements • Number of“continue”statements • Number of global variables • Number of zero and one integer literals • Average line number of“import”statement • Number of numeric literals • Number of comparisons • Number of“return”statements • Maximum number of“return”statements per function • Maximum number of literals per“if”statement Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 501