Efficient, Scalable, and Accurate Program Fingerprinting in Binary Code SaedAlrabaee AThesis In TheConcordiaInstitute for InformationSystemsEngineering PresentedinPartialFulfillmentoftheRequirements FortheDegreeof DoctorofPhilosophy(InformationandSystemsEngineering)at ConcordiaUniversity Montreal,Quebec,Canada February2018 (cid:2)c SaedAlrabaee,2018 CONCORDIA UNIVERSITY SCHOOL OF GRADUATE STUDIES This is to certify that the thesis prepared By: Saed Alrabaee Entitled: Efficient, Scalable, and Accurate Program Fingerprinting in Binary Code and submitted in partial fulfillment of the requirements for the degree of (cid:1)(cid:37)(cid:80)(cid:68)(cid:85)(cid:80)(cid:83)(cid:1)(cid:48)(cid:71)(cid:1)(cid:49)(cid:73)(cid:74)(cid:77)(cid:80)(cid:84)(cid:80)(cid:81)(cid:73)(cid:90)(cid:1)(cid:1) (Information and Systems Engineering) complies with the regulations of the University and meets the accepted standards with(cid:1) respect to originality and quality.(cid:1) Signed by the final examining committee: Chair Dr. Joey Paquet External Examiner Dr. Jean-Yves Marion External to Program Dr. Peter Grogono Examiner Dr. Amr Youssef Examiner(cid:1) Dr. Mohammad Mannan Thesis Supervisor(cid:1)(cid:9)(cid:84)(cid:10) Dr. Mourad Debbabi Dr. Lingyu Wang Approved by Dr. Rachida Dssouli Chair of Department or Graduate Program Director April 06, 2018 (cid:37)(cid:66)(cid:85)(cid:70)(cid:1)(cid:80)(cid:71)(cid:1)(cid:37)(cid:70)(cid:71)(cid:70)(cid:79)(cid:68)(cid:70) Dr. A. Asif, Dean(cid:13)of Faculty of Engineering and Computer Science ABSTRACT Efficient, Scalable, and Accurate Program Fingerprinting in Binary Code SaedAlrabaee,Ph. D. ConcordiaUniversity,2018 Why was this binary written? Which compiler was used? Which free software packages did the developer use? Which sections of the code were borrowed? Who wrote thebinary? Thesequestionsareofparamountimportancetosecurityanalystsandreverse engineers, and binary fingerprinting approaches may provide valuable insights that can help answer them. This thesis advances the state of the art by addressing some of the most fundamental problems in program fingerprinting for binary code, notably, reusable binarycodediscovery,fingerprintingfreeopensourcesoftwarepackages,andauthorship attribution. First, to tackle the problem of discovering reusable binary code, we employ a technique for identifying reused functions by matching traces of a novel representation of binary code known as the semantic integrated graph. This graph enhances the con- trol flow graph, the register flow graph, and the function call graph, key concepts from classical program analysis, and merges them with other structural information to create a joint data structure. Second, we approach the problem of fingerprinting free open source iii software (FOSS) packages by proposing a novel resilient and efficient system that in- corporates three components. The first extracts the syntactical features of functions by considering opcode frequencies and performing a hidden Markov model statistical test. Thesecondappliesaneighborhoodhashgraphkerneltorandomwalksderivedfromcon- trol flow graphs, with the goal of extracting the semantics of the functions. The third appliesthez-scoretonormalizedinstructionstoextractthebehavioroftheinstructionsin a function. Then, the components are integrated using a Bayesian network model which synthesizestheresultstodeterminetheFOSSfunction,makingitpossibletodetectuser- relatedfunctions. With these elements now in place, we present a framework capable of decoupling bi- nary program functionality from the coding habits of authors. To capture coding habits, the framework leverages a set of features that are based on collections of functionality- independent choices made by authors during coding. Finally, it is well known that tech- niques such as refactoring and code transformations can significantly alter the structure of code, even for simple programs. Applying such techniques or changing the compiler and compilation settings can significantly affect the accuracy of available binary analysis tools, which severely limits their practicability, especially when applied to malware. To address these issues, we design a technique that extracts the semantics of binary code in terms of both data and control flow. The proposed technique allows more robust bi- nary analysis because the extracted semantics of the binary code is generally immune fromcodetransformation,refactoring,andvaryingthecompilersorcompilationsettings. iv Specifically, it employs data-flow analysis to extract the semantic flow of the registers as well as the semantic components of the control flow graph, which are then synthe- sized into a novel representation called the semantic flow graph (SFG). We evaluate the framework on large-scale datasets extracted from selected open source C++ projects on GitHub, Google Code Jam events, Planet Source Code contests, and students’ program- ming projects and found that it outperforms existing methods in several respects. First, it isabletodetectthereusedfunctions. Second,itcanidentifyFOSSpackagesinreal-world projects and reused binary functions with high precision. Third, it decouples authorship fromfunctionalitysothatitcanbeappliedtorealmalwarebinariestoautomaticallygener- ateevidenceofsimilarcodinghabits. Fourth,comparedtoexistingresearchcontributions, it successfully attributes a larger number of authors with a significantly higher accuracy. Finally, the new framework is more robust than previous methods in the sense that there is no significant drop in accuracy when the code is subjected to refactoring techniques, codetransformationmethods,anddifferentcompilers. v ACKNOWLEDGEMENTS I would like to express my heartfelt gratitude to my supervisors Prof. Mourad Debbabi and Prof. Lingyu Wang, who contributed to this thesis and improved it significantly with their guidance and advises . Their affluent and profound knowledge, precious insights, and constructive criticism allowed me to build a successful research. I greatly appreci- ate their dedication to helping students academically. I should not forget to mention that despiteDr. Debbabihadverybusyschedule,heallocatedtimetomeetallstudents. More- over,hischarismainspireseveryone. Thankyouforeverything! Mygratefulnessextends to the members of the examining committee: Drs. Jean-Yves Marion, Peter Grogono, Amr Youssef, and Mohammad Mannan, who honored me by accepting to evaluate this thesis. Their time and efforts are highly appreciated. Special thanks to my colleagues, namely, Paria Shirani, Noman Saleem, Stere Preda, Ashkan Rahimian, who provided their valuable expertise. I convey very special acknowledgements to Paria Shirani who provided me with insightful technical discussions. I would like also to thanks my close friend, Mahmoud Khasawneh, for stimulating discussions and good memories shared to- gether. I feel very fortunate to have such nice friend. Further, I would like to express my gratitude to Abdullah Amareen, Momen Oqaily, Ahmad Bataineh, and Suhib Melhem, who supported me strongly. last but not least, I would like to express my profound grati- tudetomybelovedmyparents,sisters,andbrothers,whoexpressedtheirencouragement andlove. vi TABLE OF CONTENTS LISTOFFIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv LISTOFTABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi 1 Introduction 1 1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Researchobjectivesandcontributions . . . . . . . . . . . . . . . . . . . 4 1.2.1 SIGMA:ReusedFunctionIdentification[29] . . . . . . . . . . . 5 1.2.2 FOSSIL:FreeOpenSourcePackagesIdentification[30] . . . . . 6 1.2.3 BinAuthor: BinaryAuthorshipAttribution . . . . . . . . . . . . . 7 1.2.4 BinGold: ExtractingtheSemanticsofBinaryCode[31] . . . . . 8 1.3 ThesisOrganization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 BackgroundandRelatedWork 10 2.1 ImportanceofBinaryAnalysis . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.1 ReverseEngineering . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.2 MalwareAnalysis . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 DigitalForensics . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.4 SoftwareInfringement . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 BinaryAnalysisChallenges . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3 BinaryCodeTransformation . . . . . . . . . . . . . . . . . . . . . . . . 17 vii 2.3.1 FunctionInlining . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 InstructionReordering . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.3 RefactoringProcess . . . . . . . . . . . . . . . . . . . . . . . . 18 2.4 ReusedFunctionIdentification . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 FingerprintingFreeOpenSourceFunction . . . . . . . . . . . . . . . . . 19 2.5.1 Search-BasedFunctionFingerprinting . . . . . . . . . . . . . . . 19 2.5.2 DynamicFunctionFingerprintingMethods . . . . . . . . . . . . 20 2.6 AuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.1 SourceCodeAuthorship . . . . . . . . . . . . . . . . . . . . . . 21 2.6.2 BinaryCodeAuthorship . . . . . . . . . . . . . . . . . . . . . . 22 2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 TowardsIdentifyingReusedFunctionsinBinaryCode 26 3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2 ExistingRepresentationsofBinaryCode . . . . . . . . . . . . . . . . . . 27 3.2.1 ControlFlowGraph . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 RegisterFlowGraph . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.3 FunctionCallGraph . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 SIGMAApproach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.2 BuildingBlocks . . . . . . . . . . . . . . . . . . . . . . . . . . 34 A.StructuralInformationControlFlowGraph . . . . . . . . . . 34 viii B.MergedRegisterFlowGraph . . . . . . . . . . . . . . . . . . 36 C.ColorFunctionCallGraph . . . . . . . . . . . . . . . . . . . 38 3.3.3 SIG: SemanticIntegratedGraph . . . . . . . . . . . . . . . . . 39 3.3.4 GraphEditDistance . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4 ExperimentalResults . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4 IdentifyingFreeOpen-SourceSoftwareFunctionsinBinaryCode 49 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 4.2.2 ThreatModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.2.3 SystemOverview . . . . . . . . . . . . . . . . . . . . . . . . . . 54 4.2.4 FOSSPackages . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.3 DesignandImplementationofOurSystem . . . . . . . . . . . . . . . . . 56 4.3.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.3.2 FeatureSelection . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.3.3 DetectionMethod. . . . . . . . . . . . . . . . . . . . . . . . . . 61 A.HiddenMarkovModel . . . . . . . . . . . . . . . . . . . . . 61 B.NeighborhoodHashGraphKernel . . . . . . . . . . . . . . . 63 C.CalculationofZ-score . . . . . . . . . . . . . . . . . . . . . 65 D.BayesianNetworkModel . . . . . . . . . . . . . . . . . . . . 66 ix 4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.1 DatasetPreparation . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.4.2 EvaluationMetrics . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.4.3 AccuracyofFOSSIL . . . . . . . . . . . . . . . . . . . . . . . . 70 A.EffectofBayesiannetworkmodel . . . . . . . . . . . . . . . 71 B.AccuracyacrossdifferentversionsofFOSSpackages . . . . . 71 4.4.4 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.4.5 ScalabilityStudy . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.4.6 ConfidenceEstimationofBayesianNetwork . . . . . . . . . . . 79 4.4.7 ImpactofEvadingTechniques . . . . . . . . . . . . . . . . . . . 80 4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5 IdentifyingtheAuthorsofProgramBinaries 85 5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.1 AuthorshipAttribution . . . . . . . . . . . . . . . . . . . . . . . 86 5.2.2 ThreatModel . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 5.3 BinAuthor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3.1 FiltrationProcess . . . . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.2 FeatureCategorization . . . . . . . . . . . . . . . . . . . . . . . 92 A.GeneralChoices . . . . . . . . . . . . . . . . . . . . . . . . . 93 B.VariableChoices . . . . . . . . . . . . . . . . . . . . . . . . . 96 x
Description: