StyleCounsel: Seeing the (Random) Forest for the Trees in Adversarial Code Stylometry by Christopher McKnight Athesis presentedtotheUniversityofWaterloo infulfillmentofthe thesisrequirementforthedegreeof MasterofMathematics in ComputerScience Waterloo,Ontario,Canada,2018 (cid:13)c ChristopherMcKnight2018 IherebydeclarethatIamthesoleauthorofthisthesis. Thisisatruecopyofthethesis,including anyrequiredfinalrevisions,asacceptedbymyexaminers. Iunderstandthatmythesismaybemadeelectronicallyavailabletothepublic. ii Abstract Authorship attribution has piqued the interest of scholars for centuries, but had historically remainedamatterofsubjectiveopinion,baseduponexaminationofhandwritingandthephysical document. Midway through the 20th Century, a technique known as stylometry was developed, in which the content of a document is analyzed to extract the author’s grammar use, preferred vocabulary, and other elements of compositional style. In parallel to this, programmers, and particularly those involved in education, were writing and testing systems designed to automate the analysis of good coding style and best practice, in order to assist with grading assignments. IntheaftermathoftheMorrisWormincidentin1988,researchersbegantoconsiderwhetherthis automated analysis of program style could be combined with stylometry techniques and applied tosourcecode,toidentifytheauthorofaprogram. Theresultsofrecentexperimentshavesuggestedthiscodestylometrycansuccessfullyiden- tify the author of short programs from among hundreds of candidates with up to 98% precision. This potential ability to discern the programmer of a sample of code from a large group of possible authors could have concerning consequences for the open-source community at large, particularly those contributors that may wish to remain anonymous. Recent international events have suggested the developers of certain anti-censorship and anti-surveillance tools are being targetedbytheirgovernmentsandforcedtodeletetheirrepositoriesorfaceprosecution. Inlightofthisthreattothefreedomandprivacyofindividualprogrammersaroundtheworld, andduetoadearthofpublishedresearchintopracticalcodestylometryatscaleanditsfeasibility, we carried out a number of investigations looking into the difficulties of applying this technique in the real world, and how one might effect a robust defence against it. To this end, we devised a system to aid programmers in obfuscating their inherent style and imitating another, overt, author’sstyleinordertoprotecttheiranonymityfromthisforensictechnique. Oursystemutilizes the implicit rules encoded in the decision points of a random forest ensemble in order to derive a set of recommendations to present to the user detailing how to achieve this obfuscation and mimicry attack. In order to best test this system, and simultaneously assess the difficulties of performing practical stylometry at scale, we also gathered a large corpus of real open-source software and devised our own feature set including both novel attributes and those inspired or borrowedfromothersources. Our results indicate that attempting a mass analysis of publicly available source code is fraught with difficulties in ensuring the integrity of the data. Furthermore, we found ours and mostotherpublishedfeaturesetsdonotsufficientlycaptureanauthor’sstyleindependentlyofthe content to be very effective at scale, although its accuracy is significantly greater than a random guess. Evaluations of our tool indicate it can successfully extract a set of changes that would iii result in a misclassification as another user if implemented. More importantly, this extraction was independent of the specifics of the feature set, and therefore would still work even with a more accurate model of style. We ran a limited user study to assess the usability of the tool, and found overall it was beneficial to our participants, and could be even more beneficial if the valuablefeedbackwereceivedwereimplementedinfuturework. iv Acknowledgements This work benefitted from the use of the CrySP RIPPLE Facility at the University of Water- loo. I would like to thank my supervisor Ian Goldberg for his guidance, encouragement, and res- oluteattentiontodetail. Ourweeklymeetingswereagreathelpinnudgingmetowardthefinish line, while the many opportunities for personal development offered throughout the duration of my programme were priceless. Truly my eyes have been opened to a world beyond that with whichIwasfamiliar,andIshallneverlookback. To the members of my committee, Mike Godfrey and Yaoliang Yu, I am very grateful for yourtime,expertiseandvaluablefeedback. BeingabletousetheprivatestudyroomsatWaterlooPublicLibrarywhilewritingthisthesis waspriceless. Finally,IwouldliketothankRadioXandallitsDJsforkeepingmysanityduringlonghours ofcodingandwriting,particularlyJohnnyVaughanandhis“4til7Thang”(evenLittleSi). v Dedication FormylovingandsupportivewifeKatya,whosetirelessdriveandworkethicwasaconstant inspiration, and our son James, whose companionship on many contemplative early morning walkshelpedgetmethougheventhedarkesthoursofthisarduousjourney. vi Table of Contents ListofTables x ListofFigures xi 1 Introduction 1 2 Motivation 4 3 Background 8 3.1 TheFederalistPapers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 TheMorris/InternetWorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 LiteratureReview 12 4.1 AuthorshipAttributionofNaturalLanguage . . . . . . . . . . . . . . . . . . . . 12 4.1.1 EarlyWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.2 Computer-AssistedStudies . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.3 Internet-ScaleAuthorshipAttribution . . . . . . . . . . . . . . . . . . . 20 4.2 PlagiarismDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.1 IntrinsicPlagiarismDetection . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2 AuthorshipVerification . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 AuthorshipAttributionofSoftware . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 SourceCodeAttribution . . . . . . . . . . . . . . . . . . . . . . . . . . 30 vii 4.3.2 ExecutableCodeAttribution . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 DefencesAgainstAuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . 42 5 ImplementationandMethodology 51 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.1 ChoiceofProgrammingLanguage . . . . . . . . . . . . . . . . . . . . . 53 5.2.2 EclipseIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.3 WekaMachineLearning . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.4 RandomForest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.5 GitHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 ObtainingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.1 TheGitHubDataAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3.2 RateLimiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.3 EnsuringSoleandTrueAuthorship . . . . . . . . . . . . . . . . . . . . 63 5.3.4 RemovingDuplicateandCommonFiles . . . . . . . . . . . . . . . . . . 64 5.3.5 SummaryofDataCollection . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 FeatureExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.1 NodeFrequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.2 NodeAttributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.3 Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.5 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 TrainingandMakingPredictions . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6 MakingRecommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6.2 ParsingtheRandomForest . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.6.3 AnalyzingtheSplitPoints . . . . . . . . . . . . . . . . . . . . . . . . . 81 viii 5.6.4 PresentingtotheUser . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.7 UsingthePlugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.8 PilotUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.8.1 StudyDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6 Results 100 6.1 ConductingSourceCodeAuthorshipAttributionattheInternetScale . . . . . . 100 6.2 Extracting a Class of Feature Vectors That Can Systematically Effect a Classifi- cationasAnyGivenTarget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 PilotUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.2 ExperienceswithManualTask . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.3 ExperienceswithAssistedTask . . . . . . . . . . . . . . . . . . . . . . 115 6.3.4 SummaryofUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7 Conclusions 119 7.1 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2 FinalRemarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 References 127 APPENDICES 141 A UserStudyQuestionnaire 142 ix List of Tables 5.1 Repositoriesperauthor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Nodeclasshierarchycounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Nodeattributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 Investigationintorepositoriesthatcompletelyfailed . . . . . . . . . . . . . . . . 106 x
Description: