Grouping Source Code by Solution Approaches — Improving Feedback in Programming Courses Frank Höppner OstfaliaUniversityofAppliedSciences [email protected] ABSTRACT We thus address the following research question: Can we Various similarity measures for source code have been pro- supportthelecturerinprovidingmeaningfulfeedbackabout posed, many rely on edit- or tree-distance. To support a thesolutionapproachestakenbythestudents? Thisrequires lecturer in quickly assessing live or online exercises with automation as a lecturer does not have the time to review respect to approaches taken by the students, we compare allsolutionsmanually,especiallynotinonlineteachingsitu- source code on a more abstract, semantic level. Even if ation. Providingmeaningfulfeedbackrequiressomekindof novice student’s solutions follow the same idea, their code insightinthesolutionapproachaparticularstudentwasfol- length may vary considerably – which greatly misleads edit lowing(semantics),howcommontheapproachis,howmany and tree distance approaches. We propose an alternative different approaches have been followed in the course, etc. similarity measure based on variable usage paths (VUP), Weintendtoachievethisbyprovidingasimilaritymeasure that is, we use the way how variables are used in the code for source code that does not focus on results (as unit tests to elaborate code similarity. The final stage of the mea- do) but to reach a higher semantic level by assessing the sure involves a matching of variables in functions based on more abstract code structure: different solution approaches how the variable is used by the instructions. A preliminary manifest themselves in different code structures. evaluation on real data is presented. Such a measure would be useful in many ways. For in- stance, during an in-classroom teaching situation it would Keywords enable the lecturer to pick two (or more) submissions fol- sourcecodedistance,semanticanalysis,variableusagepaths lowingdifferentapproachestostartadiscussionabouttheir pros and cons. It could also support a lecturer in browsing 1. INTRODUCTION thespectrumofapproachesthatwerefollowedbythecourse Learningaprogramminglanguagerequiresalotofpractice. members (how many different approaches were chosen how Students gain knowledge and experience from both, home- often) without having to check every solution manually. It workandin-classroomexercises. Itis,however,veryimpor- may enable the lecturer to pick a solution that follows the tant to give students feedback, e.g. discuss different solu- sameapproachasasolutionthathasbeendiscussedalready, tionapproacheswiththem. Both,thosewhohaveandhave but did not pass all unit tests, thereby representing a live not yet succeeded, will benefit from such discussions, either challenge to the course members (“spot the error”). Note it widens their view (because they may not have thought that we are not aiming at grading the source with respect about alternative approaches yet) or encourages them to to its correctness, but leave this to the unit tests. As tests try the exercise at a later point in time once more (once do not provide any useful feedback to students that do not they got a glimpse on how to solve it). The sharing of dif- yethaveanappropriatecodestructure,thedesiredmeasure ferentsolutionapproachesanddiscussingtheprosandcons mayclosethisgap,asitwouldallowustodifferentiatecode of different approaches improves their algorithmic thinking submissionsthatarefarfromworkingfromthosethatfollow skills. However, it is quite common in many programming a reasonable solution approach and got the code structure courses, that the only (automatic) feedback consists of the right, but only the tiny details prevents them from passing number of passed or failed unit tests. Achieving a positive the tests. Such a more detailed inspection would enable a feedback then requires an already well developed solution much more sensitive (automatic) feedback. and no credit is given for, say, getting the code structure right – which may frustrate novices noticeably. 2. RELATEDWORK Similarity or distance measures for source code have been investigatedforalongtime. Manyapproacheshaveincom- monthattheystartfromanabstractsyntaxtree(AST)that Frank Höppner “Grouping Source Code by Solution Approaches — Im- represents the code. Code is then compared by calculating proving Feedback in Programming Courses”. 2021. In: Proceed- some kind of tree distance on the corresponding ASTs: the ings of The 14th International Conference on Educational Data Min- minimal number of steps (node deletion, insertion, and re- ing (EDM21). International Educational Data Mining Society, 461-467. labelling) to transform one AST to the other. A survey on https://educationaldatamining.org/edm2021/ tree edit distance can be found in [2]. A tree, however, has EDM’21June29-July022021,Paris,France no conscience about variable identity: the very same vari- Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 461 able occurs over and over again as a new node in an AST. publicdoubleavg(doublea[]){ publicdoubleavg(doublea[]){ Toreflectwhichvariablechangeaffectsothercode,program if(a==null)returnNaN; if(a==null)returnNaN; dependency graphs (PDG) may be used. But comparing intn=0; intn=0; graphs is even more complicated than comparing trees, a for(intj=0;j<a.length;++j){ doublesum=0; if(a[j]>0)++n; for(inti=0;i<a.length;++i){ surveyongrapheditdistanceisgivenin[6]. Tosimplifythe } if(a[i]>0){ comparison,theASTmayalsobelinearizedsuchthatmuch doublesum=0; ++n; simpler string comparison measure (edit distance) may be for(intj=0;j<a.length;++j){ sum+=a[i]; applied (cf. [5, 7, 9]). if(a[j]>0)sum+=a[j]; } } } An extremly fast approach is to define a hash function on returnsum/n; returnsum/n; }//(a)[TwoLoopQuickExit] }//(b)[OneLoopQuickExit] (possibly pre-processed trees) such that a similarity is de- tected when hash codes are identical [1]. The flexibility of suchanapproachis,however,verylimited,asthesimiliarity publicdoubleavg(doublea[]){ publicdoubleavg(doublea[]){ measure is a binary one. It may nevertheless be useful for if(a!=null){ if(a==null) intn=0; returnNaN; clone detection (plagiarism) or may be enhanced by calcu- doublesum=0; intn=0; lating multiple hashes [4]. Tree edit distance has been used for(inti=0;i<a.length;++i){ doublesum=0; in,e.g.,[10]todetectsimilarsourcecodes. In[3]adistance if(a[i]>0){ for(inti=0;i<a.length;++n){ measureforsourcecodehasbeenproposed,thattransforms ++n; if(a[i]>0){ the AST into a sequence of tokens on which Levenshtein or sum+=a[i]; ++i; edit distance is applied. }} a[i]+=sum; returnsum/n; } }else{ } In many related papers the goal is to compare (student’s) returnNaN; returnsum; code to a (lecturer’s) reference code; the higher the simi- }}//(c)[OneLoopGuardedIf] }//(d)[Disordered] larity, the closer is the student’s submission to the correct solution. In this work, however, we want to identify source codes that are semantically similar, that is, they follow the Figure 1: Responses to a simple programming exer- same solution approach. Especially when novices start to cise: Writeafunctionavgthatyieldsthemeanofall code, their solutions are often more lengthy and redundant positive values in the array. Left: Code (a) uses two thanthoseofanexperiencedprogrammer. Thisaffectsmea- loops, whilecode(b)onlyone. Whilethemaincode suressuchaseditdistanceortreedistancedramatically. The is organized after an if-statement in (a) and (b), it is approach in [3] did not work out as well as expected, and embedded in the conditional statement in (c). Code the authors hypothesized about one of the reasons that the (d) uses similar instructions, but the variable usage lengthofthecodedramaticallyinfluencestheeditdistance. is screwed up and does not solve the problem. Whilelongercodemaybelessreadableandmoreredundant, itisneithermorelikelywrongnordoesitnecessarilyfollow adifferentsolutionapproach. Asanexample,considercodes 3.1 VariableUsagePaths (a)and(b)fromFig.1. Thetaskwastocalculatethemean Solving a programming exercise requires to combine pro- of all positive array elements. Both codes follow the same gramming instructions such that a handful of variables approach, but (b) uses a single loop instead of two and is jointlybuildupthefinalresult(andreturnittothecaller). thus more compact. But semantically, we consider both so- The key to a solution are thus the variables and how they lutions as being identical. We are not aware of any source are embedded in the (possibly nested) instructions. Code code similarity measure that tries to reflect that. analysis often starts with an abstract syntax tree (AST); there, every variable usage is represented by a new node in Itisalsocommonintheliteraturetoreplacevariablenames the tree. This occludes important information: Where in by a constant string (possibly depending on the variable’s the source code is the same variable used? We therefore type), which eases matching sources that use different vari- rearrange the AST to a graph, where a node is unique for ablenames. Inplagiarismdetection(see[5,9]andreferences each variable and subsequent usages of the variable link to therein) this is considered as a countermeasure against dis- the same node. Fig. 2 shows an example for the code of guising plagiarism by variable renaming. However, the way Fig. 1(b). All variables are marked in red color. From the a single variable is used throughout the code is crucial for paths between the variable sum and function avg, we can a solution approach. Fig. 1(d) utilizes the same statements seethatthevariablesumisdeclared,occursinaconditional (e.g., array-access in conditional statement of a for-loop), statement that is embedded in a loop, and finally occurs in but does not calculate anything meaningful and thus does thereturnvalue. Weconsiderthesepaths(showninbluein not represent the same solution approach as codes (a-c). Fig. 2) as a kind of fingerprint for the role of this variable: codefollowingthesameapproachrequiresvariableswiththe same roles. 3. COMPARINGSOURCECODEBY Wethusdonotoperatedirectlyonthegraph,butpathsfrom VARIABLEUSAGEPATHS variable nodes to the enclosing function node. After apply- Wearguethattwosolutionsfollowthesameapproachifthey ing some transformations to simplify the graph somewhat use variables in the same way. In the subsequent sections (e.g.removingbodyorreplacingfor,while,etc.byasubsum- we show how we grasp variable usage in the code and then inglabelloop),weendupwithstringrepresentationsofthe define similarity measures on this representation. three blue paths like sum/expression/return/avg, sum/dec- 462 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Test _0 avg return bodymodifier double Body public statement_1 statement_4 statement_2 statement_0 statement_3 Parameters Declare Declare If For name_0 return name_0 then control repeatDeclare update_0 Declaration Declaration return Expression Body Declaration Expression set set expression operand_1 operator statement_0 set operator 0 expression 0 FieldAccess NULL == If 0 control ++ type name then parameter_0 name name Double NaN Body control operand_0 statement_0statement_1 name operand Expression Expression Assignment Expression Expression operand_1operator operand operand_0 operator left right operator operatoroperand_0 operand_1 operatoroperand_1 n / sum ++ ArrayAccess += > ArrayAccess 0 < FieldAccess operand_0 type type array index array type index name int double a i length type type double int Figure 2: Graph representation of code in Fig. 1(b) (red: variable nodes; blue: VUPs of variable ’sum’). laration/declare/avg and sum/assignment/if/loop/avg. thenumberofelementsfirstbeforeitsumstherelevantval- ues, while code (b) does both in a single loop. While the choiceofoneortwoloopsmayaffecttheefficiency,theyare Definition 1. Let I be a set of code instruction labels semanticallyequivalentandthussimilar. Atfirstglancethe (such as if, loop, ...). For a given source code C (single variable usage paths (of, say, the loop counter) in all loops class),letV beasetofvariable-identifiers1 andF theset seem identical such that a set comparison (where duplicate C C of methods declared in C. A path p = (v,i ,...,i ,f) ∈ elements are not counted) appears just right. However, the 1 n V ×I∗ ×F := P reflects that a variable v is used by loopcounterisnamedjincode(a)andiincode(b). Fur- C C instruction i , which is itself used by instruction i , etc., thermore, (a) defines two variables with the same name j 1 2 in function f. The VUP-representation (variable usage (oneforeachloop). Letusignorethesedetailsbyintroduc- path)ofcodeC isasetP ⊆P ofallpathsoccurringinC. ing a simplification (and revisit the problem later): C We thus intentionally drop many details from the original Definition 2. From a VUP representation P we obtain a source code, e.g. numerical constants or full expressions. SVUPrepresentation(simplifiedVUP)representationP(cid:48) From the fact that two codes that are structurally identi- byreplacingallvariableidentifiersandallmethodidentifiers cal(sameVUPrepresentation)wecannotconcludeanything by a constant identifier vn (generic variable name) and fn about the number of unit tests both codes may pass. But (generic function name), resp. (I ={vn}, F ={fn}). as we have mentioned earlier, we seek for a common code skeleton, which may indicate that the programmers were Standard methods to measure set similarity may then be guided by the same underlying idea. The skeleton includes usedtocomparesourcecode. ForgivenSVUPsPandQwe the information which variables need to be used in which use the F measure2: instructions — but there are many different ways of coding 1 expressions equivalently, so we simply stick with unit tests p·r |P ∩Q| |P ∩Q| F =2· where p= , r= to check their correctness. 1 p+r |P| |Q| Fig. 3 shows a dendrogram for an average-linkage cluster- 3.2 SimplifiedSetSimilarity ingusingthismeasure(actually1−F toobtainadistance 1 Giventwosourcecodes(a)and(b)fromFig.1andthecor- measure). Codes (a)-(d) correspond to TwoLoopQuickExit, respondingVUP-representationsP,Q. Code(a)determines OneLoopQuickExit,OneLoopGuardedIfandDisordered. Addi- tionalexamplesinclude: Nonsense(similarsetofinstruction 1Note the variable names themselves are not valid identi- fiers, as the same variable name may occur more than once 2although based on asymmetric precision p and recall r, F 1 in the same function, e.g. variable j in code (a) of Fig. 1. itself is symmetric and thus a similarity measure Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 463 Disordered TwoLoopQuickExit OneLoopQuickExitTempVar OneLoopQuickExit TwoLoopQuickExit OneLoopGuardedElse OneLoopQuickExit OneLoopGuardedIf Nonsense Disordered OneLoopGuardedElse OneLoopQuickExitTempVar OneLoopGuardedIf Nonsense Empty Empty 0.8 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0 Figure 3: Dendrogram for source codes of Fig. 1 Figure 4: Dendrogram for source codes of Fig. 1 based on F -measure on VUP representation. based on F -measure on SVUP representation. 1 1 Table 1: Example of comparing P and Q (see text). m(·)thathavebeenassignedalready. Wereflectthecostof P 1 vn/expr/if/loop/fn d m(d) addingamissingpathelementbyaddingitasavirtualpath 2 vn/assign/if/loop/fn if (1,6), (2,7) to the SVUP P or Q (depending on where it was missing). 3 vn/declare/fn assign (3,8) 4 vn/assign/fn declare (4,8) Table 1 shows a detailed example. The table on the left 5 vn/return/fn showsthepathsbelongingtoSWUPofP andQ. Allpaths Q 6 vn/expr/loop/fn step un- / assigned have been numbered for easier reference. The initial map 7 vn/assign/loop/fn 0 1,2,3,4,6,7,8 / 5,9 m(·)isshownonthetopright;forinstance,thepathelement 8 vn/assign/declare/fn 1 3,4,8 / 1,2,5,6,7,9 if unifies path #1 of P with #6 of Q, as well as #2 of P 9 vn/return/fn 2 4 / 1,2,3,5,6,7,8,9 with #7 of Q. At the bottom right each line corresponds to an iteration of the matching process. In step 0, paths #5 and #9 are already identical. In the second iteration set but variables are mixed randomly), OneLoopQuickExit- the path element if is chosen from map m(·), because it TempVar (the mean value is assigned to a temporal vari- unifies2paths(allothersonlyone). Wehavethusmatched able before it is returned) and OneLoopGuardedElse (same 6 paths in total (step 1 of bottom right table), and only as OneLoopGuardedIf (Fig. 1(c)) but inverted if-condition). #3, #4, #8 remain unassigned (gray). The map d(·) now From the dendrogram we learn that codes TwoLoopQuick- offers two alternatives (assign and declare, both |m(·)| = ExitandOneLoopQuickExitbecomeidenticalasdesired. But 1), we arbitrarily choose assign as the second missing path we can also see that code OneLoopGuardedIf, where a con- element for the third iteration. This leaves only path #4 ditional statement encloses the main code, is recognized as unassigned. Thechoiceofm(declare)hascoveredpaths#4 very dissimilar. The additional if-statement occurs in all and #8, so we remove all pairs from m(·) containing any paths and the simple Jaccard similarity finds this code to of these (already covered) paths. This ends the matching be completely different. We address this problem next. phase (all |m(·)| = 0). We had to add the first missing path element if to paths in Q; likewise we had to add the second path element assign to P to match #3 against #8. 3.3 ReflectingInstructionEmbedding As a penalty for the missing path elements we add them to A single conditional statement may introduce an new ele- therespectiveSVUP,thatis,P becomes{1,2,3,4,5,assign} ment in many paths turning them all different (as in codes and Q = {6,7,8,9,if}. Taking the established identity of (a),(c)inFig.1). Forcompensationwecouldmeasureapar- paths into account, this gives us an F value of 1 tialsimilaritybetweenpaths(e.g.,80%ofpathpiscontained inq),butasingleinstruction(suchastheconditionalstate- |P ∩Q| 4 |P ∩Q| 4 16/30 8 p= = , r= = , so F =2· = . ment in (c)) would then still weaken all path similarities. |P| 6 |Q| 5 1 44/30 11 We therefore propose a different approach in the fashion of Fig. 4 shows the resulting dendrogram. Now OneLoop- edit distance, where we pay a constant cost for a missing GuardedIf (Fig. 1(c)) became much more similar to One/T- path element, which may then be used in many paths. woLoopQuickExit (Fig. 1(a-b)). But we are still dissatisfied with the high similarities towards Disordered: it uses the GiventwoSVUPrepresentationsP,Q,wecompareallpaths same set of embedded instructions (causing the high simi- p∈P,q ∈Q to identify missing path elements δ(p,q)∈I∗. larity),butthevariableusageismixedup. Theinstructions (For example, δ(vn/loop/if/fn,vn/loop/fn)=if ). Many dif- may look identical, but the author did not get the role of ferent path combinations may lead to the same δ(p,q), so variables right and that should degrade the similarity. by m(d) we denote the set of all pairs (p,q) ∈ P ×Q with δ(p,q)=d. ThematchingofpathsinP topathsinQisdone iteratively: Inthefirstiteration,wematchallpathsfromP 3.4 MatchingVariables andQthatareidentical(δ(p,q)=()). TomatchtheSVUP So we finally revisit the simplification of definition 2. We representationsatminimalcost,insubsequentiterationswe havehypothesizedthatsimilarsolutionapproachesusevari- identify the missing path element d that unifies the largest ables at specific places in the code skeleton. Up to now, number of paths (that is, choose d = argmax |d(x)|). Be- we have mixed the usage paths of different variables in the x fore entering the next iteration, all pairs are removed from SVUP representation. This will be sorted out next. 464 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Definition 3. Let π : P → P,P (cid:55)→ {p|p = (v,f) TwoLoopQuickExit (v,i ,...,i ,f) ∈ P} be a filtering function that returns 1 n OneLoopQuickExit only those paths that refer to variable identifier v and OneLoopGuardedElse function identifier f. From a VUP representation P with OneLoopGuardedIf a set V of variable identifiers and F of function identi- OneLoopQuickExitTempVar fiers, weobtainaGVUP representation (groupedVUP) Disordered P(cid:48) = {π (P)|v ∈ V,f ∈ F}. That is, a GVUP is a set (v,f) Nonsense of VUPs, one VUP set for each variable in each function. Empty 0.8 0.6 0.4 0.2 0.0 For instance, the code in Fig. 1(b) consists of 4 variables in one function, so the GVUP representation is a set of four VUP representations (one set for each variable). Now, as Figure 5: Dendrogram for source codes of Fig. 1 the GVUP representations P and Q of two source codes based on F -measure on GVUP representation. 1 are sets of sets, how do we generalize the F calculation? 1 Consider the following example, where we use abbreviated instructions I ={a,b,c}: vn). Althoughtheexampledidnotincludedifferentfunction names,itworksinthesameway. Fig.5showstheresulting P = {{x/a/c/f,x/b/c/f},{y/a/f,y/c/f},{z/b/f}}, dendrogram. As desired, the similarity of code Disordered Q = {{x/a/f,x/c/f},{y/b/f}} (Fig. 1(d)) is now worst among all solutions that follow the Formally, a direct calculation of |P ∩Q| yields 0 as none of same solution approach. the sets in P is contained in Q. But this does not meet our intention. Note that variable y in P plays the role of x in 4. EXPERIMENTALEVALUATION Q(same forzinP andyinQ). Variablesmayberenamed We started to evaluate the proposed approach on real stu- without changing the semantics, in fact |P ∩Q| should be dent code submissions and demonstrate its performance on 2. The desired semantics is to match variable and function some examples. The submissions were inspected manually namesappropriately,renamethemaccordinglyandcalculate andgroupedbyapproach(defininggroundtruth). Whilethe (cid:83) (cid:83) the F1 measure on the obtained X∈P X and Y∈QY. author of an exercise may have a specific solution in mind, a group of students usually finds multiple ways to solve the HowdowematchthesetsX ∈P againstY ∈Q? Allpaths exercise. The real data contains nearly identical submis- in X or Y refer to the same variable and function name, so sions (potential plagiarism) as well as submissions that ap- wecansafelytransformtheVUPtoaSVUPrepresentation pear somewhat chaotic, have superfluous declarations and and use the measure from section 3.3 to construct a pair- calculations, or contain artefacts indicating a change in the wise cost matrix. We employ the Munkres algorithm [8] to solution approach over time. Most codes can nevertheless find the optimal assignment based on this cost matrix. As beassignedtosolutionapproaches,butthestrongvariation theMunkresalgorithmperformsa1:1least-costassignment, innovice’scodelengthmisleadsedit-andtree-distancesuch somevariablesmaynotgetassigned. Wematchthemafter- that resulting dendrograms do not match the approaches. wardsinasecondpasstotheleastcostlycounterpart. (This allowsustomatchmultipleidentifiersinoneprogramtothe We show results for two exercises, the first exercise asks for samevariableinanother,asrequiredwhencomparingcodes the most frequently occurring element in an integer array. (a) and (b) of Fig. 1.) To inspire the students to elaborate on different solutions, an additional restriction was given that all values v in the As an example, for the abovementioned P and Q we would array satisfy 0 ≤ v < 10000. The two main solution ap- create a cost matrix of all pairwise F1-values (P-paths in proacheswere: (1)iterateoverallelementsinanouterloop, rows, Q-paths in columns): countthefrequencyofthecurrentelementinaninnerloop, remember the element that occurs most often; (2) instanti- 0.67 0.50 ate an array of size 10000 (associating a counter with each 1.00 0.50 possible element in the original array), increment the re- 0.50 1.00 spective counter while looping over the array and identify TheMunkresalgorithmoptimallyassigntwooftheP-paths the largest entry in the counter array. Apart from these totwooftheQ-paths(boldface). ThethirdP-pathisthen two dominating solutions, three solutions (3) sorted the ar- associated with the first Q-path, which matches best ac- ray first, such that identical values are grouped together, cording to the higher F1-value. We have thus assigned the whichsimplifiesfrequencycountinginasingleloopoverthe variables x and y of P to variable x in Q and may think array.3 Finally, (4) there are some exotic solutions which of renaming all these variables in both codes to, say, u. In may be considered as a mixture of the discussed solutions. thesamefashionwemayrenamethevariablesofthesecond Possibly students became aware of the other approaches by assignment to v; reunifying all SVUPs leads us to: discussing approaches among each others, but had difficul- tiesinsolvingthetaskandswitchedbackandforthbetween P(cid:48) ={ u/a/c/f,u/b/c/f,u/a/f,u/c/f, v/b/f}, them. Usually, elements of all other solutions can be found Q(cid:48) ={ u/a/f,u/c/f, v/b/f} inthem. Fig.6showsaverage-linkagehierarchicalclustering Thefinalsimilarityisobtainedbyapplyingthecalculations 3However, when this exercise was handed out, sorting al- of Sect. 3.3 to both sets P(cid:48) and Q(cid:48) (this time with differ- gorithms were not yet discussed (so this solution required ent variable names rather than just generic variable names background knowledge or a student’s own initiative). Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 465 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG2113227644456662343431664693433221573253545165521191437037602725786393084953504166525823263944430925148 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG121712856665436297223257573376634564142233336451451512459143340581536787483248842123310204665836076542547423999015903 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG2322213555716455217113215912364444643356663343431646297252836234493092037654101272635655786893084954034 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG312672854352279757325156476634326663314223345334491151216451451053134818484232756580204632678653607909035123998425474 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG2711322345666446469491334343162332213231645521551557647037601657867235044189308495365258234493092654232 GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG312167856665766347325757224392153641422333344551649145334215121058316786020468212384484257358360765274544519090332399 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Figure 6: Dendrogram for exercise ’most frequent Figure 7: Dendrogram for exercise ’happy number’ element’ based on F -measure (measure from Sect. based on F -measure (measure from Sect. 3.2, 3.3, 1 1 3.2, 3.3, and 3.4 from top to bottom). and 3.4 from top to bottom). results for the similarities from Sect. 3.2, 3.3, and 3.4 from intotwomajorbranches,whichdifferinthewayintermedi- top to bottom. The ground truth is shown by means of the atevariablesareused. Approachesthatusethesamekindof node colors: 1: black, 2: blue, 3: red, 4: gray, incomplete utility variables (e.g. to store digits, square of a digit, etc.) / unfinished: cyan. While the clusters in the top clustering areclosermatchesthanapproachesthatusedifferentsetsof mixes even the two large approaches (1) and (2), the clus- utility variables. tering in the middle is much better but does not separate solution (3). This is only achieved by the GVUP-clustering (bottom),whichcorrespondsbesttothegroundtruth. The 5. CONCLUSIONS bubblesort used in (3) uses very similar loops as the main task, so it was crucial to distinguish the role of variables as Assessing the variety in student’s solutions to a program- done in the GVUP-approach. ming exercises without having to inspect all codes manu- ally can help a lecturer in many ways. We have proposed Asecondexercisedealswiththeidentificationofhappynum- a measure that captures how variables and instructions are bers4. The calculations require the sum of squares of each coupled by means of variable usage paths, and use this fin- digit and the submissions differ mainly in the way how this gerprinttomatchcodefromdifferentsolutionswhileatthe issolved. Theintendedsolution(black)wasaloopthatsuc- sametimebeingtoleranttocoderepetitions. Theapproach cessivelysplitsthelastdigit(usingintegerdivisionandmod- needs to be evaluated further, but the first results appear ulo). Anotherpopularsolutionisbasedonstring-conversion promising. Thenatureofthecomparisonisset-based,which (blue), a few (somewhat restricted) approaches (red) dealt allowsusnotonlytoassesssimilarity(usingF1),butalsoto with different numbers of digits individually (avoiding a use recall and precision. This enables further applications, loop). Again, the average-linkage clustering for all three for instance, we may assess partial solutions by the degree proposals is shown in Fig. 7 and the GVUP-approach sepa- howmanyelementsofacompletesolutiontheycontain(us- rates them best. The modulo-approaches (black) subdivide ing recall only) or assess the student’s degree of program- ming maturity by investigating the amount of superfluous 4https://en.wikipedia.org/wiki/Happy_number statements (using precision only). 466 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 6. REFERENCES [1] I. D. Baxter, A. Yahin, L. Moura, M. Sant’Anna, and L. Bier. Clone detection using abstract syntax trees. In Int. Conf. on Software Maintenance, pages 368–377. IEEE, 1998. [2] P. Bille. A survey on tree edit distance and related problems. Theoretical Computer Science, 337:217–239, 2005. [3] J. Broisin and C. Herouard. Design and evaluation of a semantic indicator for automatically supporting programming learning. In Int. Conf. Educational Data Mining, pages 270–275, 2019. [4] M. Chilowicz, E. Duris, and G. Roussel. Syntax tree fingerprinting for source code similarity detection. In 2009 IEEE 17th International Conference on Program Comprehension, pages 243–247. IEEE, 2009. [5] Z. Djuric and D. Gasevic. A source code similarity system for plagiarism detection. The Computer Journal, 56(1):70–86, 2013. [6] X.Gao, B.Xiao, D.Tao,andX.Li.Asurveyofgraph edit distance. Pattern Analysis and Applications, 13(1):113–129, 2010. [7] O.Karnalim.Syntaxtreesandinformationretrievalto improve code similarity detection. In Proceedings of the Twenty-Second Australasian Computing Education Conference, pages 48–55, 2020. [8] M. Munkres. Algorithms for the Assignment and Transportation Problems. Journal of the Society of Industrial and Applied Mathematics, 5(1):32–38, 1957. [9] C. Ragkhitwetsagul, J. Krinke, and D. Clark. A comparison of code similarity analysers. Empirical Software Engineering, 23(4):2464–2519, 2018. [10] T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting similar java classes using tree algorithms. In Int. Workshop on Mining software repositories, pages 65–71, 2006. Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 467