Table Of ContentStyleCounsel: Seeing the (Random) Forest
for the Trees in Adversarial Code
Stylometry
by
Christopher McKnight
Athesis
presentedtotheUniversityofWaterloo
infulfillmentofthe
thesisrequirementforthedegreeof
MasterofMathematics
in
ComputerScience
Waterloo,Ontario,Canada,2018
(cid:13)c ChristopherMcKnight2018
IherebydeclarethatIamthesoleauthorofthisthesis. Thisisatruecopyofthethesis,including
anyrequiredfinalrevisions,asacceptedbymyexaminers.
Iunderstandthatmythesismaybemadeelectronicallyavailabletothepublic.
ii
Abstract
Authorship attribution has piqued the interest of scholars for centuries, but had historically
remainedamatterofsubjectiveopinion,baseduponexaminationofhandwritingandthephysical
document. Midway through the 20th Century, a technique known as stylometry was developed,
in which the content of a document is analyzed to extract the author’s grammar use, preferred
vocabulary, and other elements of compositional style. In parallel to this, programmers, and
particularly those involved in education, were writing and testing systems designed to automate
the analysis of good coding style and best practice, in order to assist with grading assignments.
IntheaftermathoftheMorrisWormincidentin1988,researchersbegantoconsiderwhetherthis
automated analysis of program style could be combined with stylometry techniques and applied
tosourcecode,toidentifytheauthorofaprogram.
Theresultsofrecentexperimentshavesuggestedthiscodestylometrycansuccessfullyiden-
tify the author of short programs from among hundreds of candidates with up to 98% precision.
This potential ability to discern the programmer of a sample of code from a large group of
possible authors could have concerning consequences for the open-source community at large,
particularly those contributors that may wish to remain anonymous. Recent international events
have suggested the developers of certain anti-censorship and anti-surveillance tools are being
targetedbytheirgovernmentsandforcedtodeletetheirrepositoriesorfaceprosecution.
Inlightofthisthreattothefreedomandprivacyofindividualprogrammersaroundtheworld,
andduetoadearthofpublishedresearchintopracticalcodestylometryatscaleanditsfeasibility,
we carried out a number of investigations looking into the difficulties of applying this technique
in the real world, and how one might effect a robust defence against it. To this end, we devised
a system to aid programmers in obfuscating their inherent style and imitating another, overt,
author’sstyleinordertoprotecttheiranonymityfromthisforensictechnique. Oursystemutilizes
the implicit rules encoded in the decision points of a random forest ensemble in order to derive
a set of recommendations to present to the user detailing how to achieve this obfuscation and
mimicry attack. In order to best test this system, and simultaneously assess the difficulties of
performing practical stylometry at scale, we also gathered a large corpus of real open-source
software and devised our own feature set including both novel attributes and those inspired or
borrowedfromothersources.
Our results indicate that attempting a mass analysis of publicly available source code is
fraught with difficulties in ensuring the integrity of the data. Furthermore, we found ours and
mostotherpublishedfeaturesetsdonotsufficientlycaptureanauthor’sstyleindependentlyofthe
content to be very effective at scale, although its accuracy is significantly greater than a random
guess. Evaluations of our tool indicate it can successfully extract a set of changes that would
iii
result in a misclassification as another user if implemented. More importantly, this extraction
was independent of the specifics of the feature set, and therefore would still work even with a
more accurate model of style. We ran a limited user study to assess the usability of the tool,
and found overall it was beneficial to our participants, and could be even more beneficial if the
valuablefeedbackwereceivedwereimplementedinfuturework.
iv
Acknowledgements
This work benefitted from the use of the CrySP RIPPLE Facility at the University of Water-
loo.
I would like to thank my supervisor Ian Goldberg for his guidance, encouragement, and res-
oluteattentiontodetail. Ourweeklymeetingswereagreathelpinnudgingmetowardthefinish
line, while the many opportunities for personal development offered throughout the duration of
my programme were priceless. Truly my eyes have been opened to a world beyond that with
whichIwasfamiliar,andIshallneverlookback.
To the members of my committee, Mike Godfrey and Yaoliang Yu, I am very grateful for
yourtime,expertiseandvaluablefeedback.
BeingabletousetheprivatestudyroomsatWaterlooPublicLibrarywhilewritingthisthesis
waspriceless.
Finally,IwouldliketothankRadioXandallitsDJsforkeepingmysanityduringlonghours
ofcodingandwriting,particularlyJohnnyVaughanandhis“4til7Thang”(evenLittleSi).
v
Dedication
FormylovingandsupportivewifeKatya,whosetirelessdriveandworkethicwasaconstant
inspiration, and our son James, whose companionship on many contemplative early morning
walkshelpedgetmethougheventhedarkesthoursofthisarduousjourney.
vi
Table of Contents
ListofTables x
ListofFigures xi
1 Introduction 1
2 Motivation 4
3 Background 8
3.1 TheFederalistPapers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 TheMorris/InternetWorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
4 LiteratureReview 12
4.1 AuthorshipAttributionofNaturalLanguage . . . . . . . . . . . . . . . . . . . . 12
4.1.1 EarlyWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Computer-AssistedStudies . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.3 Internet-ScaleAuthorshipAttribution . . . . . . . . . . . . . . . . . . . 20
4.2 PlagiarismDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 IntrinsicPlagiarismDetection . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 AuthorshipVerification . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 AuthorshipAttributionofSoftware . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.1 SourceCodeAttribution . . . . . . . . . . . . . . . . . . . . . . . . . . 30
vii
4.3.2 ExecutableCodeAttribution . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4 DefencesAgainstAuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . 42
5 ImplementationandMethodology 51
5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.1 ChoiceofProgrammingLanguage . . . . . . . . . . . . . . . . . . . . . 53
5.2.2 EclipseIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2.3 WekaMachineLearning . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.4 RandomForest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.5 GitHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 ObtainingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.1 TheGitHubDataAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.2 RateLimiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.3 EnsuringSoleandTrueAuthorship . . . . . . . . . . . . . . . . . . . . 63
5.3.4 RemovingDuplicateandCommonFiles . . . . . . . . . . . . . . . . . . 64
5.3.5 SummaryofDataCollection . . . . . . . . . . . . . . . . . . . . . . . . 65
5.4 FeatureExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.4.1 NodeFrequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4.2 NodeAttributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.3 Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.5 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 TrainingandMakingPredictions . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6 MakingRecommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.6.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6.2 ParsingtheRandomForest . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.6.3 AnalyzingtheSplitPoints . . . . . . . . . . . . . . . . . . . . . . . . . 81
viii
5.6.4 PresentingtotheUser . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.7 UsingthePlugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.8 PilotUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.8.1 StudyDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6 Results 100
6.1 ConductingSourceCodeAuthorshipAttributionattheInternetScale . . . . . . 100
6.2 Extracting a Class of Feature Vectors That Can Systematically Effect a Classifi-
cationasAnyGivenTarget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3 PilotUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
6.3.2 ExperienceswithManualTask . . . . . . . . . . . . . . . . . . . . . . . 114
6.3.3 ExperienceswithAssistedTask . . . . . . . . . . . . . . . . . . . . . . 115
6.3.4 SummaryofUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7 Conclusions 119
7.1 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.2 FinalRemarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
References 127
APPENDICES 141
A UserStudyQuestionnaire 142
ix
List of Tables
5.1 Repositoriesperauthor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Nodeclasshierarchycounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Nodeattributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 Investigationintorepositoriesthatcompletelyfailed . . . . . . . . . . . . . . . . 106
x
Description:author's style in order to protect their anonymity from this forensic technique. Our system utilizes .. of the originals has left the true authorship of the letters an unanswered question. Furthermore, Stylometry is defined as “the statistical analysis of literary style” [Hol98], and as such i