Learning student program embeddings using abstract execution traces Guillaume Cleuziou Frédéric Flouvat UniversityofOrléans,INSACentreValdeLoire, UniversityofNewCaledonia,ISEAEA7484 LIFOEA4022 Nouméa,NewCaledonia Orléans,France frederic.fl[email protected] [email protected] ABSTRACT of the submitted programs. These training platforms must Improvingthepedagogicaleffectivenessofprogrammingtrai- go beyond a simple syntactic analysis of the script, and al- ning platforms is a hot topic that requires the construction lowtheassociatedsemanticstobeconsidered. Forthispur- offineandexploitablerepresentationsoflearners’programs. pose, learning program embedding has recently emerged as This article presents a new approach for learning program a promising area of research [15, 13, 17, 7, 2, 3]. A natural embeddings. Startingfromthehypothesisthatthefunction way to generate such vectorial and condensed representa- ofaprogram,butalsoits”style”,canbecapturedbyanalyz- tions is to consider a computer program as a text and to ing its execution traces, the code2aes2vec method proceeds exploit methodologies inspired by Text Mining. in two steps. A first step generates abstract execution se- quences(AES)frombothpredefinedtestcasesandabstract Text mining has attracted a lot of interest in recent years. syntaxtrees(AST)ofthesubmittedprograms. Thedoc2vec The representation of texts as vectors of real numbers, also method is then used to learn condensed vector representa- called ”embedding”, has been at the heart of many recent tions (embeddings) of the programs from these AESs. Ex- works. Theserepresentationsmakeitpossibletoproject(or perimentsperformedonrealdatasetsshowsthattheembed- ’embed’) a whole vocabulary into a low-dimensional space. dingsgeneratedbycode2aes2vecefficientlycaptureboththe Moreover, such a representation of words allows to exploit semantics and the style of the programs. Finally, we show awidevarietyofnumericalprocessingmethods(neuralnet- therelevanceoftheprogramembeddingsthusgeneratedon works, SVM, clustering, etc.). At that stage, one of the the task of automatic feedback propagation as a proof of challengesistocaptureintheserepresentationstheunderly- concept. ingsemanticrelationships(e.g. similarities,analogies). The work of [11] based on the use of neural networks has been Keywords a precursor in this area. Their word2vec method is one of the most referenced in the field. Its principle is based on RepresentationLearning,ProgramEmbeddings,NeuralNet- the relation between a word and its context (words appear- works,EducationalDataMining,ComputerScienceEduca- ingbeforeandafter). Todothis,theyproposesomesimple tion, doc2vec. andefficientarchitecturestolearnwordembeddingsfroma corpusoftexts. Forexample,theCBOW (ContinuousBag- 1. INTRODUCTION Of-Words) architecture trains a neural network to predict Increasingly, programmingisbeinglearnedthroughthe use eachwordinatextgivenitscontext. Theirresultsshowthe ofonlinetrainingplatforms. Typically,learnerssubmittheir ability of this approach to extract complex semantic rela- code(s)and the platformreturns any syntax errors or func- tions (analogies) from simple operations on v() projections, tional errors (typically based on test cases defined by the s.t. v(”king”)−v(”man”)+v(”woman”)≈v(”queen”) or teacher). The exploitation of these data opens up new per- v(”Paris”)−v(”France”)+v(”Italy”)≈v(”Rome”). spectivestomonitorandhelpbeginnersinlearningprogram- ming. They can be used, for example, to identify students Thetranspositionoftheseapproachestocomputerprograms who are dropping out, to target bad practices or to prop- isnotstraightforward. Thecodehascertainspecificitiesthat agate teacher feedbacks. These functionalities would allow need to be integrated to have such rich representations [1]. thelearnertobemoreautonomousduringhislearning,and Unlike texts, codes are runnable, and small modifications theteachertobemorereactiveandefficientinhisinterven- can have significant impacts on their executions. A pro- tions. However,thisexploitationrequiresadetailedanalysis gram can also call other programs that can themselves call otherprograms. Thecontextinwhichaninstructionisused is also particularly important in deducing its role. Finally, unlike texts, program syntax trees are usually deeper and Guillaume Cleuziou and Frédéric Flouvat “Learning student program composed of repeating substructures (loops). Existing ap- embeddings using abstract execution traces”. 2021. In: Proceed- proachesforbuildingprogramembeddingsonlypartiallyin- ings of The 14th International Conference on Educational Data Min- tegrate these specificities. They independently exploit the ing (EDM21). International Educational Data Mining Society, 252-262. instructions [3], the inputs/outputs [15], part of the execu- https://educationaldatamining.org/edm2021/ tiontraces[17]ortheabstractsyntaxtree(AST)[2]. They EDM’21June29-July022021,Paris,France 252 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) D test case 1 2,4,8,…,7,10 … … AES ID program f.rpoym student tteesstt cc…aassee n2 … 22,,3……,,46,,…1,29 IESCCSvCSApCILCRffatuuuq a eaaaoos CCrbbbsrvtlll nn2lllCauiooa___sssssg mrmmocccrllrttCneena1aarrrn1a nnppiii nnn pppvsvFl aagttltFappattt_o__arr eorrraaeepppiirn12 nnar rr Caaatv CSnaattSv_rrr a ommugaCCaaaiaurnnbelmmmrb2aa11lts2_s l l s tC111llcCl AIIca__eCffr ora llsnFLLniCCeeaipnsltttpo llnnioo _spt_lgtrvv_ tirmm a pppnaanaravratrraaan ppanav 11nrrrtrgaamaaaa _g2Frrermmmi1 eeen1o t111 r …… WW …… U … … to process AES test cases AES embeddings execution traces AST feed forward Neural Network given by teacher Figure1: Generalschemeofthecode2aes2vecmethodforlearningprogramembeddings. focus more on the function of the program (what it does) 4. a proof of concept on the use of such program em- thanonitsstyle(howitdoes). Moreover,mostoftheseap- beddingsforfeedbackpropagationforeducationalpur- proachesaresupervisedandbuildembeddingsforaspecific poses. task (e.g. predict errors, predict functionality, etc.). In view of these limitations, we propose the code2aes2vec The next section details existing works in the field and method, exploiting instructions, code structure and execu- the originality of our approach in relation to it. Section 3 tion traces of programs, in order to build finer program presentsourtwo-stepsmethod: theconstructionofabstract embeddings. Figure 1 schematizes the overall approach execution sequences (AES) and the learning of embeddings code2aes2vec we propose. The first step of this method from them. Section 4 is devoted to the qualitative and consistsingeneratingAbstractExecutionSequences(AES) quantitative evaluations of the learned representations be- from traces obtained on test cases executions and program fore drawing up the many perspectives of this work (sec- ASTs. Thesecondstepusesthedoc2vec1 neuralnetwork[8] tion 5). to learn program embeddings from AESs. Contrary to ex- istingapproaches,wethereforeproposeagenericandunsu- 2. RELATEDWORKS pervised method that learns program embeddings by using Learning representations from programs is at the heart of functional, stylistic and execution elements. This aspect is many recent works. They aim to embed this data in a se- crucialinourapplicationtobeabletodifferentiateprograms manticspacefromwhichfurtheranalysiscanbeconducted, answeringtothesameexercise(i.e. implementingthesame because the generated representations (vectors of real val- functionality) but in different ways (in terms of strategy or ues) are directly exploitable by a large part of the learning efficiency). Ourapproachisvalidatedontworealdatasets, algorithms. To do this, recent work relies heavily on meth- composed of several thousands of Python programs from odsdevelopedtobuildwordembeddingsintexts,whiletry- educationalplatforms. Onthesedatasets,weshowthatem- ing to integrate the specificities of the code. beddings generated with code2aes2vec allow to efficiently detectthefunctionandthestyleofaprogram. Inaddition, These program embeddings are then used for prediction wepresentaproofofconceptoftheuseofsuchembeddings or analysis tasks related to the software development (de- to propagate teacher feedbacks. bugging, API discovery, etc.) and teaching (learning pro- gramming). Twotypesofembeddingsaremoreparticularly To summarize, the main contributions of this paper are : studied: embeddings of the elements composing a program (words,tokens,instructions,orfunctioncalls)[12,6,7]and embeddings of the programs themselves [15, 17, 3, 4]. 1. the definition of a new (intermediate) program repre- sentation,calledAbstractExecutionSequences(AES), Nguyen et al. [12] study API call sequences to derive an allowing to capture more semantics, API embedding that is independent of the programming language (thus allowing translation from one language to 2. exploiting these (intermediate) representations with another). Thisembeddingislearnedbyword2vec[11]from doc2vec to build program embeddings in an unsuper- call sequences from several million methods. This method vised way, allows to build a word embedding from words appearing nearby in each text (i.e. its context). Two datasets derived 3. the diffusion to the community of two enhanced from the Java JDK and from more than 7000 recognized datasets from educational platforms in computer sci- C#projectsfromGitHub(morethan10stars)areusedfor ence, learning. 1a method derived from word2vec that allows to learn a De Freez et al. [6] have a close objective, namely to build document embedding from its words. an embedding of functions used in a code in order to find Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 253 synonymous functions. However, the function calls made in dict the name of a method (i.e. the functionality) from its eachprogramareextractedintheformofagraphdepending code. To do this, the program is first decomposed into a onthecontrolstructures. Randomwalksarethenperformed collection of paths (from one leaf to another) in the AST. inthisgraphforeachfunction,andtheextractedpathsare Only the most frequent paths in the dataset are used as used as input to word2vec [11] to derive an embedding of features (size constraints are also integrated). Then, the functions. AnextractoftwomillionlinesoftheLinuxkernel network learns which one is important for predicting the is used to learn the embedding. methodnameusingtheattentionprinciple. Theparameters ofthetrainedneuralnetworkcorrespondpartlytothefinal In[7],theauthorsproposearelativelysimilarapproachbut embeddings and partly to the weights supposed to quantify replace function calls with abstract instructions. Each in- theimportanceofeach(feature)pathforthepredictiontask struction sequence represents a path in the code depending (attention principle). Training is performed on a corpus of on conditional instructions. In this way, it is similar to a more than 13 million Java programs from GitHub’s 10,072 trace even if the code is never executed. All possible paths most popular projects. As mentioned by the authors, this are extracted but loop repetitions are ignored and only the approach requires a large number of input programs. Fur- most frequent instructions are considered (a threshold of thermore,itisnotpossibletopredictthefunction(andem- 1000 occurrences is used in the experiments). Moreover, bedding) of a program whose paths do not appear in the only certain constant values are considered to limit the size training set. The embeddings produced capture informa- of the vocabulary. An embedding of these program instruc- tion and semantic relationships about the function of the tionsisthenlearnedbytheGloVemethod[14]basedonthe code, but ignore style variants. Thus, two programs with co-occurrence matrix of the words. The authors also use a the same function will be similar, regardless of how they corpusof311,670procedures(inClanguage)fromtheLinux havebeencoded. Thequalityoftheanalyzedcodealsohas kernel,andevaluateitonadatasetof19,000pre-identified animpactonlearning. Thenamesgiventothevariablesare analogies. particularly important for prediction. In many applications, it is not just a matter of consider- 3. THEcode2aes2vecMETHOD ing the program’s components but the program as a whole. Two main strategies emerge for learning program embed- Recent work has therefore focused on the construction of dings : by observing the results of program execution [15, program embeddings. 17] or by analyzing the script [3] and/or its AST [2]. Our approach is at the intersection of these two strategies and Followinga’teaching’motivation,Piechetal. [15]construct thus aimstotakeadvantage of the functional andsyntactic studentprogramembeddingsandusethemtoautomatically descriptionsoftheprogramstoinducerelevantembeddings. propagateteacherfeedbacks(usingthek-means algorithm). We thus propose the code2aes2vec method which proceeds Theembeddingspaceisbuiltfromaneuralnetworktrained in two steps: to predict the output of a program using its input. It thus capturesthefunctionalaspectofthecode. Theauthorsalso try to capture the style of programs, using a recursive neu- 1. thecode2aessteprepresentsaprogramasanAbstract ralnetworkbasedontheprogram’sAST.Contrarytoother ExecutionSequence(AES),correspondingtotheAST approaches,thegeneratedembeddingsarematrices,notvec- paths used by the program during its execution on tors,thuslimitingtheirexploitationasinputsfornextdata predefined test cases; analyses. Theyalsodon’tconsiderlearner-definedvariables. Theobtainedrepresentationsactuallycapturequitewellthe 2. the aes2vec step uses a neural network to construct code function, but fail in capturing the style of codes. The the embedding of the programs based on their AES analyzedprograms(fromtheHourofCodesiteandacourse (using the doc2vec approach [8]). atStanford)arewritteninalanguagesimilartoScratchand allow operations in a labyrinthine world. 3.1 code2aes: construction of Abstract Execu- In[17],theauthorshighlightthelimitationsofsyntax-based tionSequences(AES) approachestocapturethesemanticsofaprogram. Instead, Translating a program into an AES requires providing, in they propose to consider the trace resulting from the code addition to the program itself, a collection of test cases on execution, and more precisely the values of the most fre- whichtheprogramwillberuninordertoexploititstraces. quent variables. Different representations are proposed and Inoureducationalcontext,thepreparationofsuchacollec- used to train a recurrent neural network whose objective is tion of test cases is not an additional effort since test cases topredicterrorsmadebystudentsinaprogrammingcourse. are generally integrated into training platforms to evalu- The embeddings of the programs corresponds to one of the atesubmittedcontributions. Moreover,thisapproachoffers neural network’s layers. The authors put forward one rep- teachers the possibility of introducing verification choices resentation more particularly, considering the trace of each andthustodrivetheinterpretationofhis/herlearners’pro- variableindependentlyandintegratingthedependenciesbe- grams according to his/her own pedagogical choices. For tweenvariablesinthestructureoftheneuralnetwork. How- example,let’sconsideranexercisewhoseobjectiveistofind ever, the obtained embeddings are specific to one task, and a value in a table/list. A teacher wishing to emphasize al- this method requires to redefine the neural network archi- gorithmicefficiencymaychoosetointegrateafewunittests tecture, with re-training, for each exercise. for which the desired value appears early in the table. In suchcase,anefficientprogramstopstheloopassoonasthe Finally, Alon et al. [2] propose a neural network to pre- desired value appears. These test cases will thus make it 254 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) possible to distinguish two (valid) programs based on their the neural network is reduced to a single hidden layer (en- execution trace. coding); thematrixW oftheweightsconnectingtheinput layer to the hidden layer contains the word embeddings. Inpracticethenumberoftestcasesprovidedbytheteacher is quite small. We generally observe that less than ten test word2vec has already been used to learn token embeddings cases are enough to evaluate whether a program is correct fromacomputerprogram[6,3]. Howeverthedistributional or not. hypothesisseemslesssatisfiedonthetokensfromaprogram than on the natural language, in particular because of the Figure2illustratestheprocessoftranslatingaprograminto very limited size of the vocabulary and especially a little anAES.Thisexampleconsidersasinputthecodesubmitted constrained compositionality (almost all combinations are byalearnerinresponsetotheexercise”writeaPythonfunc- observed). tionthatreturnstheminimumvalueinaninputlist”. First, the AST is constructed. It describes the syntactic struc- [8] have proposed a variant of word2vec, aiming to learn tureoftheprogramintermsofcontrolstructures(if-else, simultaneously the embeddings of the words and the docu- for, while), function calls (call), assignments (assign), ments from which they are extracted. The doc2vec method etc. Second, the code is executed on an example (here the is still based on the distributional hypothesis allowing to input [12, 1, 25]) and its execution trace is kept, indi- predict a word knowing its context, but this time the con- cating the program lines successively executed. Finally, the text integrates (in addition to the preceding and following AESisconstructedbymappingthesetwolevelsofinforma- words) the identifier of the document from which the word tion: syntacticandfunctional. Thesequenceresultingfrom sequence comes from. In doing so, the authors introduce the trace is translated into a sequence of ”words”extracted the idea that there are document specific variations in the from the nodes of the AST. natural/universal distribution of words in the language. Threelevelsoftranslation(orabstraction)areproposedac- We exploit precisely this hypothesis of document-based dis- cording to the depth considered in the AST: tributional variations for the processing of AES built from the programs. We consider that each program, during its execution, generates different sequences of tokens (AES). • AES level 0: each program line is represented by a We then use the DM (Distibutive Memory) version of the single word corresponding to the head symbol of the doc2vec algorithm to train a feed-forward Neural Network associated sub-tree in the AST (in red in Figure 2), (withonehiddenlayer)tomaximizethefollowinglogprob- ability : • AES level 1: each program line is represented by one or more words corresponding to the head symbols of the associated sub-tree and its main sub-trees (in red and blue in Figure 2), L=(cid:88)S T(cid:88)s−klogp(ws|ws ,...,ws ,d ) (1) i i−k i+k s • AES level 2: each program line is translated in a se- s=1 i=k quenceofwordscorrespondingtoallthenodesappear- ing in the associated sub-tree (in red, blue and black with S the total number of documents (or AESs), T the in Figure 2). s totalnumberofwords(ortokens)indocuments,d thesth s document,ws theith wordindocumentd andkthesizeof i s For the last two levels, the names of the variables and pa- the context on either side of the target word. rameters, as well as the values of the constants, have been normalized so as not to artificially extend the considered Figure 3 presents the architecture of the neural network ”vocabulary”. Thus, the variable res is renamed var1 and used for doc2vec as we use it for learning program em- the variable i is renamed var2. beddings via their AES. The forward pass consists in first calculating the values of the hidden layer by aggregating Notethateachexecutionofaprogramononetestcasegen- the encodings of each word of the context and of the docu- erates a partial AES. A program will finally be represented ment : h(wis−k,...,wis+k;W,D) where h() denotes an ag- by the concatenation of the partial AES obtained on each gregation function to be defined (typically a sum, aver- test case. An AES can thus be considered as a representa- age or concatenation), W and D denoting the word em- tive text of the program. Each partial AES corresponds to bedding matrix (weight matrix between word inputs and a sentence of this text. hidden layer) and the document embedding matrix (weight matrix between document input and hidden layer) respec- 3.2 aes2vec : learning program embeddings tively. Theoutputoftheneuralnetworkcanbeinterpreted asaprobabilitydistributiononthewordsofthevocabulary fromAES by applying an activation function (softmax) on the output Tpohtehwesoisrdo2fvwecormdsetihnodna[1tu1r]aisl blaansgeudaognet[h1e6]d:istariwbourtdioncaanlhbye- aywndis =U bth+eUwhei(gwhist−mk,a.t.r.i,xwbis+etkw;eWen,Dh)idwdheneraenbdisouatpbuiatslateyremr: inferred from its context. For example, the CBOW (Con- tinuous Bag Of Words) version of word2vec allows to train anfeed-forwardNeuralNetworktopredicta(central)word pfrroemceditinsgcoanntdexfot.lloIwninwgowrdo2rvdescinatchoentteexxtt.isThdeefisntreudctbuyrethoef p(wis|wis−k,...,wis+k,ds)= (cid:80)eyjweisyj (2) Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 255 Program execution on one test case Depth level 0 Depth level 1 Depth level 2 Program to process minimum([12,1,25]) 2 If If Compare If Compare Call_len param1 Eq Constant_int 1 def minimum(liste): 5 Assign Assign var1 Subscript param1 Assign var1 Subscript param1 2 if len(liste) == 0: 6 For For For var2 Call_range Constant_int Call_len param1 3 res = None 7 If If Compare If Compare Subscript param1 Lt var1 4 else: 6 For For For var2 Call_range Constant_int Call_len param1 5 res = liste[0] 7 If If Compare If Compare Subscript param1 Lt var1 6 for i in range(0,len(liste)): 8 Assign Assign var1 Subscript param1 Assign var1 Subscript param1 7 if liste[i] < res : 6 For For For var2 Call_range Constant_int Call_len param1 8 res = liste[i] 9 return res 7 If If Compare If Compare Subscript param1 Lt var1 6 For For For var2 Call_range Constant_int Call_len param1 9 Return Return var2 Return var2 Abstract Syntax Tree construction Execution Trace AES obtained from one test case FunctionDef If Return test body else else Compare Assign Assign For res body Eq res res i If Call Constant Constant Subscript Call test corps Compare Assign len liste 0 None liste range Constant Call Lt res Subscript res Subscript 0 len liste liste liste Line 2 Line 3 Line 5 Line 6 Line 7 Line 8 Line 9 Figure2: ConstructionofanAESfromaPythonprogramandagiventestcase. Finally, the network weights are updated by stochastic gra- concatenation strategy offers the opportunity to exploit the dientdescentontheerror,definedbythedifferencebetween orderofwordswithinthecontext. Ifthesequentiality(inside the obtained output and the one-hot vector encoding the the context) is not a determining factor in the construction target word. ofembeddingsfornaturallanguage,wewillconfirminfuture experiments that the order of tokens is of high importance for learning program embeddings using AESs. Weconsiderprogramsastextualdocumentswhosewordse- quence is given by an AES obtained by the previous step 4. EVALUATIONOFTHEAPPROACH (code2aes). Indeed, as for text, the choice and the order of the ”words”in an AES capture the semantic of the pro- 4.1 Datasetpresentation gram, i.e. what the program does (its function) and how it Educational data are complex since programs may contain operates (its style). errors, be small in size, may not fully meet the intended functionsandmayberelativelyredundant. Thesedatahave Inthislearningmodel,eachAESvector(inD)isusedonly verydifferentcharacteristicsfromthedatasetsusedinsoft- forpredictionsofthetokensfromthisAES,whiletokenvec- ware development. For our experiments, we thus built and tors(inW)arecommontoallAESs. Thesizeofthevectors use several real educational datasets (see Table 1). They (for AES and tokens) is fixed and actually corresponds to consist of Python programs submitted by students on two the size of the desired representation space (embeddings). trainingplatformsinintroductoryprogrammingcourses. In addition to our (documented) code2aes2vec code for learn- Once the model is trained, the D matrix contains the em- ingprogramembeddings,wealsomakeavailable3thesethree beddingsoftheprograms. Thepositioningofanewprogram ”corpora”ofPythonprograms,theassociatedtestcases,and inthisembeddingspaceconsistsininferring2 anewcolumn theAESsbuiltoneachprogram. Alloftheresultspresented vector in D using the tokens from the new AES. The other intherestofthissectioncanthusbefullyandeasilyrepro- parameters of the model remaining fixed (W as well as the duced. softmax parameters). The NewCaledonia-5690 dataset (or NC-5690) includes the Finally, let us mention that the choice of the aggregation programscreatedin2020byagroupof60studentsfromthe strategy used in the hidden layer can be decisive. Indeed, University of New-Caledonia, on a programming training a sum or average will consider each context as a bag-of- platform4. The NewCaledonia-1014 dataset (or NC-1014) words (without taking into account the order), whereas a 3https://github.com/GCleuziou/code2aes2vec.git 2InferenceismadebypurshasingthelearningontheNeural 4Platform developed and made available by the CS depart- Network. ment of the Orl´eans University Institute of Technology. 256 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) w i classifier Assign aggregation (sum, average … or concatenation) current AES word context vector vectors matrix W of matrix D D W W W W words of AES AES Id If Compare var1 Subscript w w w w i-2 i-1 i+1 i+2 Figure 3: The neural network aes2vec used to predict a word w from its original AES identifier and its context (here, two i previousandtwofollowingwords). Table 1: Characteristics of the three Python datasets col- Table2: ExercisesintheNewCaledonia-1014dataset. lected on programming training platforms. The ’Nb. of Exercice Statement #test cases words’reportedisthetotalsizeoftheAEScorpus;thenum- berinparenthesesindicatesthesizeofthe’vocabulary’. swapping swapitemsinalist 4 minimum lookfortheminimuminalist 8 compareStrings comparetwostrings 11 Datasets NC-1014 NC-5690 Dublin-42487 fourMore100 return the first four values 7 Nb. programs 1,014 5,690 42,487 greaterthan100fromaninput Avgnb. testcases 13.1 10.4 3.7 list perprogram indexOccurrence returntheindexofthefirstoc- 7 Nb. correct 189 1,304 19,961 currenceofaniteminalist programs compareDates compare two dates from their 30 Nb. exercises 8 66 65 day,monthandyear Nb. ’words’ 113,223 761,726 7,4M polynomial return the roots of a polyno- 6 AES-0 (20) (44) (38) mialofdegree2 Nb. ’words’ 226,682 1,7M 15,2M dayNight display information about the 32 AES-1 (42) (57) (83) periodofadaygivenatime Nb. ’words’ 690,019 3,9M 40,4M AES-2 (71) (113) (209) gregation stage and the training is performed over 500 iter- includesasub-partofNewCaledonia-5690composedofcon- ations. tributions associated with 8 exercises selected for their al- gorithmic diversity (see Table 2) and their balanced vol- In a first step, we evaluate our approach in a qualitative umetry (100 to 150 programs per exercise). We will use way on the dataset NewCaledonia-1014 constituted for this it as a ’toy’ dataset facilitating qualitative analyzes. The purpose. Figure 4 (left) shows a visualization of the 912 Dublin-42487 dataset includes student programs from the programs of the training set, obtained by a non-linear pro- University of Dublin, carried out between 2016 and 2019. jection using the t-SNE dimension reduction algorithm [9]. Althoughtheoriginalcorpus[3]containsnearly600,000pro- It can be seen that, although embeddings are learned in an grams(PythonandBash),weproposehereasubsetenriched unsupervisedmanner,thecode2aes2vecmethodlearns,from semi-automatically with test cases (not provided initially). a relatively limited number of training data, a representa- tionspaceinwhichtheareasidentifydistinctprogramfunc- 4.2 Embeddinganalysis tionalities. Thus, the program vectors are organized quite Inthefollowingexperiments,eachdatasethasbeendivided naturally into 8 clusters that are highly correlated with the into three sub-parts: training (90%), validation (5%) and original 8 exercises. Moreover, the topological organization test (5%); the validation set being used to select the best of the clusters matches well with the algorithm inherent to model among those learned during the different iterations theprograms. Exercises’swapping’and’minimum’areclose (aes2vec). Unless otherwise stated, the aes2vec algorithm in the embedding space and correspond to the only two ex- has been set up to learn embeddings of dimension 100, the ercisesthatiterateoverallvaluesofaninputlist. Exercises size of the context is set to 2, concatenation is used as ag- ’compareStrings’, ’fourMore100’ and ’indexOccurrence’, in Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 257 theupperpartofthespace,requiretopartiallyiterateover As baselines, random classifier informs about the difficulty a list. Finally, the last three exercises do not require loops ofthistaskapriori,whiledoc2veccorrespondstothe(naive) and rely only on the use of conditional instructions. The use of the algorithm doc2vec [8] to learn embeddings from exercise’dayNight’isdistinguishedbytheexpecteduseofa the codes directly (without any intermediate representa- displayfunction(print)whilealltheotherexercisesreturn tion). We also report the results obtained by the (super- a result (return). This feature may explain the ’isolation’ vised)code2vecapproach5 [2]executedwithdefaultparam- of the programs from this exercise from other programs. eters. Stylistic differentiation of programs is difficult to assess NC-1014 NC-5690 Dublin-42487 quantitatively. This task would require either objective cri- randomclassifier 0.125 0.015 0.009 code2vec[2] 0.230 0.098 0.037 teria that can be extracted automatically or expensive ex- doc2vec[8]+SVM 0.412 0.495 0.380 pert labeling. Given the absence of such stylistic knowl- code2aes2vec+SVM edgeindatasets,wechoosetoillustrateonanexamplehow (AES-0) 0.882 0.460 0.391 thecode2aes2vecmethoddistinguishesdifferentstylesinthe (AES-1) 1.0 0.698 0.544 writing of a same function. Figure 4 (right) presents in de- (AES-2) 1.0 0.832 0.651 tailtheembeddingspacelearnedfortheexercise’minimum’. It’sinterestingtonoticethatprogramstylesareclearlydis- Table3: Quantitativeandcomparativeevaluationofthepro- tinguished, notably the two ways to program a Python for ducedembeddings,onthetaskofretrievingthefunctionofa loop(usingindexesvs. elementsdirectly). Inaverydetailed program(accuracy). way, programs are also grouped according to whether their loopstartswiththefirstelementofthelist(range(0,...)) or with the second one (range(1,...)), after initialization It can be seen that the code2vec model recently proposed of the minimum to the first element in any cases. by [2] cannot be trained satisfactorily on any of the three datasets. This is due to the numerous parameters to Word order is important in our aes2vec method. For ex- learn and the large number of examples this method re- ample, our approach distinguishes programs having a simi- quires. In the largest dataset (Dublin-42487), code2vec lar boolean condition but expressed in a different order (if “only”has 42,487 programs as inputs. To the opposite, our liste[i]<res: vs. if res>liste[i]:). This distinction code2aes2vecmethodasseveralmillionentriesthankstoour may seem artificial since these two expressions are strictly AES intermediate representation. equivalent from the evaluation point of view. However, the firstsyntaxappearsmore’natural’thanthesecond. Itwould Thecomparativeresultsobtainedwiththreedifferentlevels be easy to get rid of this phenomenon by normalizing the of AES (denoted by AES-0, AES-1 and AES-2 in Table 3) expressions at the code2aes step; this option can be left at confirmthatthequalityoftheembeddingsisimprovedwhen teacher’s discretion. thelevelofdetailoftheAESincreases. Level2AES(AES- 2)areundeniablyleadingtothebestvectorrepresentations Finally, we draw the reader’s attention on some valid but of programs. atypicalprograms. Inparticularaprogramusingthenative Python function min, or the one using the sort function. In order to take into account the word/token order, Their separation from the rest of the programs is crucial code2aes2vecanddoc2vechavebeensetupsofarwithcon- sinceitoffersawaytodetectprograms(apriorivalid)thata catenation as aggregation step. In order to confirm the im- teacherwouldliketorejectoratleastmoderateconsidering portance of the order, we compare in Figure 5 embeddings that they deviate from his/her pedagogical objective. More obtained by the code2aes2vec algorithm with both types of generally,thisanalysisseemstoconfirmthattheembedding aggregation(sumvs. concatenation). Unlikeforconcatena- spaces learned by the code2aes2vec method correctly cap- tion, we observe a very rapid degradation in the quality of tures not only the function of the programs but also their theembeddingsobtainedwithasumtypeaggregationwhen style. Their intrinsic quality paves the way for many prac- the size of the context increases. Indeed, the vocabulary on tical uses that could significantly improve the efficiency of which AESs are based is very limited (only a few dozen or learningplatforms (detection of atypical solutions, automa- evenhundredsoftokens)andthedistributionofthesewords tion/propagationoffeedbacks,student’trajectory’analysis, inAESsisnotuniform. Thusitquicklybecomesdifficultto study of error typologies, etc.). differentiate contexts as their size increases without taking into account the word order. 4.3 Applicationtofeedbackpropagation In a second step, we evaluate the code2aes2vec approach fromaquantitativepointofviewonthethreedatasets(Ta- In order to confirm that the learned embeddings are fine ble 3). We consider an usual task for program embedding enough to be usefully exploited in an educational context, evaluation, namely the prediction of its function (i.e. exer- wehaveimplementedafirstproofofconceptonthetaskof ciseidentification). Foreachconsideredconfiguration(AES propagating feedbacks. level), the training data are used first to learn (without su- pervision) a representation space. Then, these embeddings We have considered the exercise ’mean’ from the dataset are used to learn (with supervision) a SVM classifier (with NewCaledonia-5690whoseinstructionwastowriteaPython polynomial kernel) [5]. Finally, the embeddings are (in- 5The other methods presented in the state of the art could directly) assessed according to their ability to predict the notbecomparedbecauseofthelackofavailableoperational function of the code (i.e. the exercise) for test data. implementations. 258 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) compareStrings f o ri fe leelme mi<nr elsi:ste: f o ri fi riens >rlainsgtee([0i,]l:en(liste)): indexOccurrence fourMore100 for i in range(1,len(liste)): dayNight if res>liste[i]: path by the polynomial elements ufusnecs ttihoen n saotrivt(e) compareDates minimum path by the indices for i in range(0,len(liste)): if liste[i]<res: for i in range(1,len(liste)): if liste[i]<res: uses the native swapping function min() Figure 4: Visualization of program embeddings obtained by the method code2aes2vec for the 8 exercises of the dataset NewCaledonia-1014. The colors identify the exercises, with incorrect programs in light and correct ones in dark. The fig- ureontheleftrepresentsalltheembeddingsandtheoneontherightdetailstheareaassociatedwiththe’minimum’exercise. Figure5: EvaluationoftheembeddingsoneachofthethreedatasetsaccordingtothetypeofAES,theaggregationmethodand thecontextsize(nb. wordsbefore/after). functionreturningthemeanofthevaluescontainedinalist helpthestudentcorrecthisproposal6. Oncethekfeedbacks passed as a parameter (and returning None if this list is were compiled (one per cluster/medoid), we went through empty). For this exercise, 157 programs were submitted to eachclusterandaskedtheteacher,foreachprogram(other the platform by 24 different students ; among these sub- than the medoid), to indicate whether the feedback defined missions 122 programs were evaluated as incorrect by the forthemedoidcouldbeappliedtothatotherprogram. The platformandonwhichwesoughttopropagateteacherfeed- objectiveisthustoassesstheextenttowhichfeedbackfrom backs based on their embeddings. onemedoidcanpropagatetoallotherprogramsinthesame cluster. For this purpose, we performed a clustering (k-means [10]) on the 122 incorrect programs. We then presented to a Operationally, clustering is performed on the 122 incorrect teacherthemostrepresentativeprogram(clustermedoid)of eachclusterobtained,askinghimtoprovideonefeedbackto 6Ifmorethanoneerrorisfound,theteachermustchoosethe one he/she feels needs to be corrected first. Each feedback is thus limited to the resolution of a single error. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 259 programs,definedbytheirembeddingsinIR100. Forafixed The use of embeddings allows in this example to assist the number of k clusters, the partition selected to be analyzed teacher in his task of accompanying the students. For each istheoneminimizingtheMSE(MeanSquareError)among feedback requested, 4 to 5 additional neighboring programs 100 runs (random initializations) of the k-means. are automatically processed. Moreover, it is reasonable to think that over time a sufficiently large collection of feed- Table4presentsthepartitionobtainedfor5clusters(k=5). backs will be defined by the teacher to cover almost the Foreachclusterweindicateitssize,theprogramassociated entire embedding space so as to systematically identify one with its medoid as well as the feedback provided by the relevantfeedbackeachtimeanewincorrectprogramissub- teacher for this medoid. mittedanditsembeddingwillpositionitinapre-identified neighborhood. Let us first observe that the feedbacks provided by the teacher may relate to errors different in nature. It can be 5. CONCLUSIONANDPERSPECTIVES eitheranerrorinthedesignofthealgorithm(clusters1and Thispaperstudiestheproblemoflearningvectorrepresenta- 3)oranerrorinthewritingofthePythonprogram(clusters tions,orembeddings,ofprogramsinaneducationalcontext 2, 4 and 5). where the function is just as important as the style. Faced with this problem, we propose the method code2aes2vec transforming the code into abstract execution sequences (AES), and then into embeddings. This approach adapts the doc2vec method for program application and is based on the document-based distributional variations hypothesis. The publication of the source code of the approach is ac- companied by the availability to the community of a new enriched’corpus’composedofmorethan5,000studentpro- grams(Python). Experimentsconductedonthesenewdata and on a public data set validate the quality of the learned embeddings, capturing in a fine way the function and the styleoftheprograms. Inadditionapromisingproofofcon- cept was carried out on a classical task in the field, namely the propagation of teacher feedbacks. The perspectives of this work are numerous. First, our experimentation focus on programs done in introductory courses, i.e. pretty simple codes. It would be interesting to analyzemoreelaboratedones(frommoreadvancedcourses) and to evaluate the impact of code complexity on perfor- Figure6: Propagationofteacherfeedbacktoneighboringpro- mance. gramsineachcluster. Illustrationontheexercisemean. Then, it seems necessary to complete the results observed on the stylistic differentiation of programs, by formalizing Werepeatedthisworkwithapartitioninto10clusters,this the notion of style of a program, in order to quantitatively time asking the teacher to provide 10 feedbacks. Finally, evaluateourprogramembeddings. Inthesameway,amore wemeasuredtherateofcorrectfeedbackwhenpropagating preciseanalysisofthetestcasesusedwillhavetobecarried thefeedbackfromeachmedoidtoitsneighboringprograms out in order to determine to what extent the constructed in the same cluster. Figure 6 shows the evolution of the embeddings are sensitive to them. correctfeedbackrateasafunctionoftheneighborhoodsize considered for propagation. It can be seen that the further Finally,tohavemoreexploitablecorpora,weplantoextend awayfromthemedoids,themoreerrorsintheautomatically ourimplementationtohandleanytypeoflanguage(thecur- determined feedbacks. rent implementation only processes the Python language). Of course, a small number of feedbacks (5 or 10) is not From a more methodological perspective, all the words in enough to cover all the errors present in the 122 incorrect the program have the same weight during the embedding programs. In practice, it will therefore not be envisaged to construction in our approach. Thus, a correct program and propagatetoalltheprogramsintheclusterbutonlytothe onereturningawrongvalue(orthrowinganerror)mayhave neighboring programs of the medoid. The dashed curves in very similar embeddings, although functionally very differ- Figure 6 indicate the proportion of programs covered as a ent. This aspect could be integrated in the construction of function of the size of the neighborhood under considera- ourAESorinthearchitectureoftheneuralnetworkusedto tion. We can see that with a neighborhood radius corre- generate embeddings. For that, it could also be interesting sponding to a distance of 1.0, a propagation over 5 clusters to add to our AES the values taken by the variables, in the allowstocover33%oftheprogramswithaprecisionof90%; same way that [17] but in a generic multimodal approach. similarly such a propagation on 10 clusters allows to cover Another perspective would be to allow the expert to inte- significantly more programs (43%) with a higher precision grate part of his knowledge on the language. As discussed (93%). previously, some instruction sequences can be equivalent 260 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Cluster1 (#31) Cluster2 (#22) Cluster3 (#28) Cluster4 (#9) Cluster5 (#32) def mean(l): def mean(l): def mean(l): def mean(l): def mean(l): if len(l)==0: if len(l)==0: if len(l)==0: if l==(): if len(l)==0: res=None res=None res=None res=none res=None else: else: else: else: else: res=0 s=0 res=0 res=0 res=0 cpt=0 for elem in l: cpt=0 for i in range(len(l)): cpt=0 for elem in l: s+=elem avg=0 x=res+l[i] for elem in l: res=res+elem res=s//len(l) for elem in l: res=x/len(l) res=res+elem cpt=cpt+1 return res res=res+elem return res cpt=cpt+2 res=elem/cpt cpt=cpt+1 res=res%cpt return res avg=res/cpt return res return avg The division step The // operator In the case of an The null value in Python The % operator cor- must be performed corresponds in empty list, your is written ’None’ (instead of responds in Python oncethesumcalcu- Python to the in- function does not ’none’). to the modulus. For lation is completed teger division. For returnNone(asre- the computation of a (put this instruc- the computation of quested). mean a simple divi- tion out of the for a mean a simple sion is required (op- loop). division is required erator /). (operator /). Table 4: Description of the 5 cluster partition generated by k-means on the embeddings of the incorrect programs from the mean exercise. For each cluster (table column): (1) the number of programs, (2) the program associated with the medoid of the cluster and (3) the feedback defined by the teacher for this program. Instructions in red are the ones that are questioned inthefeedback. (e.g., if liste[i] <res: vs. if res> liste[i]:). Se- Computational learning theory, pages 144–152, 1992. manticrelationsbetweenwordscanalsobeknown(e.g.,the [6] D. DeFreez, A. V. Thakur, and C. Rubio-Gonz´alez. relation between for and while statements). This knowl- Path-based function embedding and its application to edge could be used to constrain the neural network and error-handling specification mining. In Proceedings of guide embedding construction. Finally, these program em- the ACM Joint Meeting on European Software beddingsopenupalargenumberofperspectivesforteaching Engineering Conference and Symposium on the aid,inadditiontothetaskoffeedbackpropagation. Forex- Foundations of Software Engineering, pages 423–433, ample,theycouldbeusedtoidentifyerrortypologies,alter- 2018. native solutions, or even predict dropout students through [7] J. Henkel, S. K. Lahiri, B. Liblit, and T. Reps. Code the analysis of their ’trajectories’. vectors: understanding programs through embedded abstracted symbolic traces. In Proceedings of the ACM 6. REFERENCES Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of [1] M. Allamanis, E. T. Barr, P. Devanbu, and C. Sutton. Software Engineering, pages 163–174, 2018. A survey of machine learning for big code and naturalness. ACM Computing Surveys, 51(4):1–37, [8] Q. Le and T. Mikolov. Distributed representations of 2018. sentences and documents. In International conference on machine learning, pages 1188–1196, 2014. [2] U. Alon, M. Zilberstein, O. Levy, and E. Yahav. code2vec: Learning distributed representations of [9] M. LJPvd and G. Hinton. Visualizing code. Proceedings of the ACM on Programming high-dimensional data using t-sne. J Mach Learn Res, Languages, 3(POPL):1–29, 2019. 9:2579–2605, 2008. [3] D. Azcona, P. Arora, I.-H. Hsiao, and A. Smeaton. [10] J. MacQueen et al. Some methods for classification user2code2vec: Embeddings for profiling students and analysis of multivariate observations. In based on distributional representations of source code. Proceedings of the fifth Berkeley symposium on In Proceedings of the International Conference on mathematical statistics and probability, volume 1, Learning Analytics & Knowledge, pages 86–95, 2019. pages 281–297. Oakland, CA, USA, 1967. [4] R. Bazzocchi, M. Flemming, and L. Zhang. Analyzing [11] T. Mikolov, K. Chen, G. Corrado, and J. Dean. cs1 student code using code embeddings. In Efficient estimation of word representations in vector Proceedings of the 51st ACM Technical Symposium on space. arXiv preprint arXiv:1301.3781, 2013. Computer Science Education, pages 1293–1293, 2020. [12] T. D. Nguyen, A. T. Nguyen, H. D. Phan, and T. N. [5] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A Nguyen. Exploring api embedding for api usages and training algorithm for optimal margin classifiers. In applications. In IEEE/ACM International Conference Proceedings of the fifth annual workshop on on Software Engineering, pages 438–449. IEEE, 2017. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 261