Table Of Content

StyleCounsel: Seeing the (Random) Forest for the Trees in Adversarial Code Stylometry by Christopher McKnight Athesis presentedtotheUniversityofWaterloo infulfillmentofthe thesisrequirementforthedegreeof MasterofMathematics in ComputerScience Waterloo,Ontario,Canada,2018 (cid:13)c ChristopherMcKnight2018 IherebydeclarethatIamthesoleauthorofthisthesis. Thisisatruecopyofthethesis,including anyrequiredfinalrevisions,asacceptedbymyexaminers. Iunderstandthatmythesismaybemadeelectronicallyavailabletothepublic. ii Abstract Authorship attribution has piqued the interest of scholars for centuries, but had historically remainedamatterofsubjectiveopinion,baseduponexaminationofhandwritingandthephysical document. Midway through the 20th Century, a technique known as stylometry was developed, in which the content of a document is analyzed to extract the author’s grammar use, preferred vocabulary, and other elements of compositional style. In parallel to this, programmers, and particularly those involved in education, were writing and testing systems designed to automate the analysis of good coding style and best practice, in order to assist with grading assignments. IntheaftermathoftheMorrisWormincidentin1988,researchersbegantoconsiderwhetherthis automated analysis of program style could be combined with stylometry techniques and applied tosourcecode,toidentifytheauthorofaprogram. Theresultsofrecentexperimentshavesuggestedthiscodestylometrycansuccessfullyiden- tify the author of short programs from among hundreds of candidates with up to 98% precision. This potential ability to discern the programmer of a sample of code from a large group of possible authors could have concerning consequences for the open-source community at large, particularly those contributors that may wish to remain anonymous. Recent international events have suggested the developers of certain anti-censorship and anti-surveillance tools are being targetedbytheirgovernmentsandforcedtodeletetheirrepositoriesorfaceprosecution. Inlightofthisthreattothefreedomandprivacyofindividualprogrammersaroundtheworld, andduetoadearthofpublishedresearchintopracticalcodestylometryatscaleanditsfeasibility, we carried out a number of investigations looking into the difficulties of applying this technique in the real world, and how one might effect a robust defence against it. To this end, we devised a system to aid programmers in obfuscating their inherent style and imitating another, overt, author’sstyleinordertoprotecttheiranonymityfromthisforensictechnique. Oursystemutilizes the implicit rules encoded in the decision points of a random forest ensemble in order to derive a set of recommendations to present to the user detailing how to achieve this obfuscation and mimicry attack. In order to best test this system, and simultaneously assess the difficulties of performing practical stylometry at scale, we also gathered a large corpus of real open-source software and devised our own feature set including both novel attributes and those inspired or borrowedfromothersources. Our results indicate that attempting a mass analysis of publicly available source code is fraught with difficulties in ensuring the integrity of the data. Furthermore, we found ours and mostotherpublishedfeaturesetsdonotsufficientlycaptureanauthor’sstyleindependentlyofthe content to be very effective at scale, although its accuracy is significantly greater than a random guess. Evaluations of our tool indicate it can successfully extract a set of changes that would iii result in a misclassification as another user if implemented. More importantly, this extraction was independent of the specifics of the feature set, and therefore would still work even with a more accurate model of style. We ran a limited user study to assess the usability of the tool, and found overall it was beneficial to our participants, and could be even more beneficial if the valuablefeedbackwereceivedwereimplementedinfuturework. iv Acknowledgements This work benefitted from the use of the CrySP RIPPLE Facility at the University of Water- loo. I would like to thank my supervisor Ian Goldberg for his guidance, encouragement, and res- oluteattentiontodetail. Ourweeklymeetingswereagreathelpinnudgingmetowardthefinish line, while the many opportunities for personal development offered throughout the duration of my programme were priceless. Truly my eyes have been opened to a world beyond that with whichIwasfamiliar,andIshallneverlookback. To the members of my committee, Mike Godfrey and Yaoliang Yu, I am very grateful for yourtime,expertiseandvaluablefeedback. BeingabletousetheprivatestudyroomsatWaterlooPublicLibrarywhilewritingthisthesis waspriceless. Finally,IwouldliketothankRadioXandallitsDJsforkeepingmysanityduringlonghours ofcodingandwriting,particularlyJohnnyVaughanandhis“4til7Thang”(evenLittleSi). v Dedication FormylovingandsupportivewifeKatya,whosetirelessdriveandworkethicwasaconstant inspiration, and our son James, whose companionship on many contemplative early morning walkshelpedgetmethougheventhedarkesthoursofthisarduousjourney. vi Table of Contents ListofTables x ListofFigures xi 1 Introduction 1 2 Motivation 4 3 Background 8 3.1 TheFederalistPapers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 TheMorris/InternetWorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 LiteratureReview 12 4.1 AuthorshipAttributionofNaturalLanguage . . . . . . . . . . . . . . . . . . . . 12 4.1.1 EarlyWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.2 Computer-AssistedStudies . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.1.3 Internet-ScaleAuthorshipAttribution . . . . . . . . . . . . . . . . . . . 20 4.2 PlagiarismDetection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.2.1 IntrinsicPlagiarismDetection . . . . . . . . . . . . . . . . . . . . . . . 28 4.2.2 AuthorshipVerification . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 AuthorshipAttributionofSoftware . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3.1 SourceCodeAttribution . . . . . . . . . . . . . . . . . . . . . . . . . . 30 vii 4.3.2 ExecutableCodeAttribution . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 DefencesAgainstAuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . 42 5 ImplementationandMethodology 51 5.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2.1 ChoiceofProgrammingLanguage . . . . . . . . . . . . . . . . . . . . . 53 5.2.2 EclipseIDE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.2.3 WekaMachineLearning . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.4 RandomForest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.5 GitHub . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3 ObtainingData . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.3.1 TheGitHubDataAPI . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.3.2 RateLimiting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3.3 EnsuringSoleandTrueAuthorship . . . . . . . . . . . . . . . . . . . . 63 5.3.4 RemovingDuplicateandCommonFiles . . . . . . . . . . . . . . . . . . 64 5.3.5 SummaryofDataCollection . . . . . . . . . . . . . . . . . . . . . . . . 65 5.4 FeatureExtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.4.1 NodeFrequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.4.2 NodeAttributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.3 Identifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.4.4 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.4.5 Other . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.5 TrainingandMakingPredictions . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6 MakingRecommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.6.1 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.6.2 ParsingtheRandomForest . . . . . . . . . . . . . . . . . . . . . . . . . 79 5.6.3 AnalyzingtheSplitPoints . . . . . . . . . . . . . . . . . . . . . . . . . 81 viii 5.6.4 PresentingtotheUser . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.7 UsingthePlugin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 5.8 PilotUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.8.1 StudyDetails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6 Results 100 6.1 ConductingSourceCodeAuthorshipAttributionattheInternetScale . . . . . . 100 6.2 Extracting a Class of Feature Vectors That Can Systematically Effect a Classifi- cationasAnyGivenTarget . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.3 PilotUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 6.3.2 ExperienceswithManualTask . . . . . . . . . . . . . . . . . . . . . . . 114 6.3.3 ExperienceswithAssistedTask . . . . . . . . . . . . . . . . . . . . . . 115 6.3.4 SummaryofUserStudy . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7 Conclusions 119 7.1 FutureWork . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.2 FinalRemarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 References 127 APPENDICES 141 A UserStudyQuestionnaire 142 ix List of Tables 5.1 Repositoriesperauthor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 5.2 Nodeclasshierarchycounts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Nodeattributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 6.1 Investigationintorepositoriesthatcompletelyfailed . . . . . . . . . . . . . . . . 106 x

Description:

author's style in order to protect their anonymity from this forensic technique. Our system utilizes .. of the originals has left the true authorship of the letters an unanswered question. Furthermore, Stylometry is defined as “the statistical analysis of literary style” [Hol98], and as such i

(Random) Forest for the Trees for Adversarial Code Stylometry PDF

154 Pages·2017·0.95 MB·English

Checking for file health...

Download

Upgrade Premium

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Download (Random) Forest for the Trees for Adversarial Code Stylometry PDF Free - Full Version

by Unknow| 2017| 154 pages| 0.95| English

Download (Random) Forest for the Trees for Adversarial Code Stylometry by in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About (Random) Forest for the Trees for Adversarial Code Stylometry

Detailed Information

Author:	Unknown
Publication Year:	2017
Pages:	154
Language:	English
File Size:	0.95
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free (Random) Forest for the Trees for Adversarial Code Stylometry Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download (Random) Forest for the Trees for Adversarial Code Stylometry PDF?

Yes, on https://PDFdrive.to you can download (Random) Forest for the Trees for Adversarial Code Stylometry by completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read (Random) Forest for the Trees for Adversarial Code Stylometry on my mobile device?

After downloading (Random) Forest for the Trees for Adversarial Code Stylometry PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of (Random) Forest for the Trees for Adversarial Code Stylometry?

Yes, this is the complete PDF version of (Random) Forest for the Trees for Adversarial Code Stylometry by Unknow. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download (Random) Forest for the Trees for Adversarial Code Stylometry PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.