ebook img

Semantic Differential Repair for Input Validation and Sanitization PDF

12 Pages·2014·0.58 MB·English
by  
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview Semantic Differential Repair for Input Validation and Sanitization

Artifact for InSpeumtaVnatliicdaDtiifofenreanntdiaSlaRneiptiazairtion ∗ TSAS*I**CdeosnusiestReeontty*tsCaoaEm*puldeteetln*eW*maeulcloDvAEEC* Muath Alkhalaf, Abdulbaki Aydin, and Tevfik Bultan DepartmentofComputerScience UniversityofCalifornia SantaBarbara,CA,USA {muath | baki | bultan}@cs.ucsb.edu ABSTRACT typically expects the user input to be in a certain format. Since the user input can contain typing errors, or may be Correct validation and sanitization of user input is crucial purposefullywritten(byamalicioususer)toviolatetheex- inwebapplicationsforavoidingsecurityvulnerabilitiesand pected format, the web application has to validate the user erroneous application behavior. We present an automated input using input validation operations such as regular ex- differential repair technique for input validation and saniti- pression matching, and sometimes modify the input to put zation functions. Differential repair can be used within an itintheexpectedformatusingsanitizationoperationssuch applicationtorepairclientandserver-sidecodewithrespect as string replacement. to each other, or across applications in order to strengthen Ifinputvalidationorsanitizationisnotused,inputsthat thevalidationandsanitizationchecks. Givenareferenceand violate the expected format can easily cause an application atargetfunction,ourdifferentialrepairtechniquestrength- tocrashsincetheuserinputbecomestheinputparameterof ens the validation and sanitization operations in the tar- theactionthatisexecutedbasedontheuserrequest. More- get function based on the reference function. It does this over,duringactionexecution,userinputcanbepassedasa by synthesizing three patches: a validation, a length, and parametertosecuritysensitiveoperationssuchassendinga a sanitization patch. Our automated patch synthesis algo- querytotheback-enddatabase. Inordertoensurethesecu- rithms are based on forward and backward symbolic string rityoftheapplication,theuserinputsthatflowintosensitive analyses that use automata as a symbolic representation. functionsmustbecorrectlyvalidatedandsanitized. Dueto Compositionofthethreeautomaticallysynthesizedpatches global accessibility of web applications, malicious users all withtheoriginaltargetfunctionresultsintherepairedfunc- around the world can exploit a vulnerable application, so tion, which provides stronger validation and sanitization any existing vulnerability in a web application is likely to than both the target and the reference functions. be exploited by some malicious user somewhere. Given the CategoriesandSubjectDescriptors significanceofthissecuritythreat,onewouldexpectwebap- plicationdeveloperstobeextremelycarefulinwritinginput D.2.4 [Software Engineering]: Software/Program Verifi- validation and sanitization functions. Unfortunately, web cation—Formal methods; D.2.5 [Software Engineering]: applications are notorious for security vulnerabilities such Testing and Debugging asSQLinjectionandcross-sitescripting(XSS)thataredue GeneralTerms to improper input validation and sanitization. Inthispaper,wefocusonanalyzingandrepairingvalida- Reliability, Verification tion and sanitization functions in web applications. In or- Keywords der to repair a function, one first needs specification of the expected behavior. Manual specification of expected input Differential repair, automated repair, string analysis, input formats for different types of input fields, or manual speci- validation and sanitization fication of attack patterns that characterize different types 1. INTRODUCTION of vulnerabilities require extra effort from the developers and are error prone. In this paper, we present a differential Oneofthemainformsofinteractionbetweenauseranda analysis approach that eliminates the need to write manual web application is through text fields. For many text fields specifications. (such as username, email, zip code, etc.) a web application Webapplicationdevelopersoftenintroduceredundantin- ∗ThisresearchissupportedbytheNSFgrantCNS-1116967. Muath putvalidationandsanitizationcodeintheclientandserver- AlkhalafissupportedbyafellowshipfromtheKingSaudUniversity. side code of a web application. The checks done on the client-side improve the responsiveness of the application by preventing unnecessary communication with the server and Permissiontomakedigitalorhardcopiesofallorpartofthisworkforpersonalor cPlearsmsroisosmiounsetoismgraankteeddwigiitthaolutofreehaprrdovcidoepdiethsatocfoaplilesoarrepnaorttmofadtehiosrwdiostrrkibuftoerd reduce the server load at the same time. However, since fpoerrpsroonfiatloorrcocmlamsserrcoioamladuvsaentiasgegarnadnttehdatwcoipthieosubtefaerethpisronovtiidceedantdhathtecfouplliceistaatiroen a malicious user can by-pass the client-side checks, it is onthefirstpage.CopyrightsforcomponentsofthisworkownedbyothersthanACM mnoutstmbeadheonoorreddi.sAtribbsutrtaecdtinfgorwpitrhocfiretdoirticsopmermmiettrecdi.alToadcvopayntoatgheerwanisde,tohrartecpoupbliiessh, necessary to re-validate and re-sanitize at the server-side. tboepaorstthoisnnseortviecresaonrdtothreedifsutrlilbcuitteattoiolnistos,nrethqeuifirersstprpioargsep.eTcioficcoppeyrmoisthsieornwainsde/,otroa Moreover, many applications repeat the checks for differ- freeep.uRbelqisuhe,sttopeprmositssoionnssefrrvoemrsPoerrmtoissrieodniss@traibcmut.eortgo.lists,requirespriorspecific enttypesoffieldsindifferentpartsoftheapplicationwhich IpSeSrmTAis’s1i4o,nJaunlyd/2o1r–a2f5e,e2.014,SanJose,CA,USA can be exploited to obtain multiple instances of the valida- CISoSpTyAri’g1h4t2Ju0l1y42A1–C2M5,927081-41,-4S5a0n3J-o2s6e4,5C-a2l/i1f4o/r0n7ia..,.$U1S5A.00 hCtotpp:y//rdigxh.dto2i0.o1r4g/A1C0.M114957/82-611-40530834-.226614054-20/114/07...$15.00. 225 function reference_function($x){ function validation_patch($x){ if (strlen($x) > 4) if (preg_match(’/<*|[^<].{4,}|<[^<].{3,}|<<[^<].{2,}| exit(); <<<[^<].+/’,$x)) else { exit(); $x = preg_replace(’/</’, ’’, $x); else if ($x == ’’) return $x; exit(); } else return $x; function length_patch($x){ } if (preg_match( } ’/"".{1,2}|".{1,2}"|.{1,2}""|"[^"]{3,3}|[^"]{3,3}"/’,$x)) exit(); function target_function($y){ else $y = preg_replace(’/"/’, ’\"’, $y); return $x; return $y; } } function sanitization_patch($x){ Figure1: Asmall, butillustrativeexample, showing $x = preg_replace(’/</’, "", $x); return $x; atargetfunctiontoberepairedbasedonareference } function. function repaired_function($x){ tionandsanitizationcodewiththesameintendedfunction- return target_function( ality. Finally, across different applications, one can easily sanitization_patch( length_patch( find multiple instances of validation and sanitization code validation_patch($x) usedtocheckstandardformats(suchasemail)ortoprotect ) against same class of vulnerabilities (such as SQL injection ) ); and XSS). Using the semantic differential repair techniques } presentedinthispaper,weexploittheseredundancieswithin and application and across applications, and automatically Figure 2: The repaired function that is generated repair input validation and sanitization functions by com- by our differential repair algorithm for the target paringthemagainsteachother. Ourresultswereevaluated function shown in Figure 1 bytheISSTAartifactevaluationcommitteeandwerefound to meet the expectations. post-image (which is the set of strings that reach the sink, Restofthepaperisorganizedasfollows. Afteranoverview i.e., the return statement). Hence, our goal is to make sure of our differential repair approach (Section 2) we present that the post-image of the repaired function does not con- our contributions which are: 1) the formal modeling of dif- tainanystringthatisnotinthepost-imageofthereference ferentialrepairproblemforinputvalidationandsanitization function and the original target function. functions(Section3),2)anoveldifferentialrepairalgorithm The post-image for the reference function in Figure 1 is that automatically generates three patches which together thelanguageofallstringsthatareshorterthan5characters constitute a repair (Section 4), 3) specialized automata- and not empty and do not contain the character ‘<’, while based pre and post-image computations for efficient string the post-image for the target function is the language of analysis(Section5),and4)theimplementationandexperi- all strings that do not contain the character ‘"’ unless it is mentalevaluationoftheproposedapproach(Section6). We precededbythecharacter‘\’. Forexample,thestring“foo” conclude the paper with a discussion of related work (Sec- isanelementinthereferencefunction’spost-imagewhilethe tion 7) followed by our concluding remarks (Section 8). string“foo<”isnotsinceitcontainsthe‘<’character. Also, the strings“foo”and“foo\"bar”are elements in the target 2. OVERVIEW function’spost-imagewhilethestring“foo"bar”isnotsince Inthissectionwegiveanoverviewofourautomateddiffer- it contains the character ‘"’ without being preceded by the ential repair technique that strengthens the validation and character ‘\’. sanitizationfunctionalityofagiventargetfunctionbasedon Duetolackofidempotencyinsomestringsanitizationop- a given reference function. erations,composingareferenceandatargetsanitizerstoget Consider the example functions shown in Figure 1. The astrongeronedoesnotworkingeneral. Forexample,ifthe referencefunctionstartswithavalidationcheckthatblocks two sanitizers share the following non-idempotent sanitiza- any string that is longer than 4 characters. This is followed tion operation preg_replace(’/"/’, ’\"’, $x) which es- by a sanitization operation which replaces the character ‘<’ capesthedouble-quotecharacter,thecompositionwouldre- with(cid:15)(i.e.,deletes‘<’). Finally,theresultofthesanitization sultindoubleescaping(e.g.,“ab"c”wouldbecome“ab\\"c” operationgoesthroughanothervalidationcheckthatblocks instead of“ab\"c”) which is a known security problem [12]. the empty string. The target function in Figure 1 does not Bycomputingthedifferencesbetweenthereferenceandtar- do any validation. It only sanitizes the input string by re- get sanitizers we avoid this problem. placingthecharacter‘"’withthestring“\"”(i.e.,itescapes Our differential repair algorithm works in three phases, the double quote characters). whereeachphasegeneratesapatch-functionwithaspecific Thegoalofourdifferentialrepairtechniqueistostrengthen purpose: (1)avalidationpatch,(2)alengthpatch,and(3)a thevalidationandsanitizationoperationsinthetargetfunc- sanitization patch. The final repair is obtained by applying tion as much as the reference function. More precisely, the the composition of all three patch-functions together. goal is to make sure that the repaired target function does not return a string that is not returned by the reference Validationpatch. function or the original target function. We call the set of The purpose of this phase is to generate a patch that stringsreturnedbyavalidationandsanitizationfunctionits makes sure that the repaired function rejects all the inputs 226 that are rejected by the reference function. Figure 2 shows In our example, there is one sanitization operation in the the validation-patch produced in this phase of the repair referencefunctioninwhichthecharacter‘<’isdeleted. Even algorithm for our running example. The validation patch after application of the validation and length patches, this blocksallinputstringsthatareeitherempty,consistofone behaviorwouldnotbefullyreplicatedattherepairedtarget or more ‘<’ characters or longer than 4 characters. For ex- function. Although the validation patch will prevent some ample,thestrings“”,“<”,“<<<”and“<html>”willbeblocked strings such as“<<<”from reaching the sink at the repaired bythevalidationpatch. Ontheotherhand,thestrings“fo” function, there are still other strings, such as“a<b”for ex- and“<a>”will not be blocked. ample, that will still be in the post-image of the repaired The validation patch blocks the inputs that generate a function but not in the post-image of the reference func- stringthatisinthepost-imageofthetargetfunctionbutnot tion, since the character ‘<’ gets deleted. The goal of the in the post-image of the reference function. Note that our sanitization patch is to remedy such situations, and make algorithmisabledetectthatsomeinputstringsareblocked sure that the sanitization operations in the target function by the reference function only after being sanitized such as are as strong as the sanitization operations in the reference thestring“<<<”(whichisfirstconvertedtoemptystringby function. deletion of ‘<’ and then blocked by the reference function). Unlike the previous two phases, the sanitization patch So,forthiscase,tomakesurethatthestring“<<<”isnotin does not block the input strings that are found in the dif- thepost-imageoftherepairedfunction,thevalidationpatch ferencebetweenthepost-imagesofthetargetandreference blocks it. functions. Instead we use an algorithm called min-cut al- gorithm to generate a sanitization code that will delete (or Lengthpatch. escape)certaincharactersintheinputstringssuchthatthe The purpose of this phase is to make sure that (1) the difference between the two post-images is removed. Using maximum length of the strings that are in the the post- thismin-cutalgorithm,ourdifferentialrepairalgorithmwill image of the repaired function is not bigger than the max- generate the sanitization patch-function shown in Figure 2. imum length of the strings that are in the post-image of This function does not block input strings that contain the the reference function and (2) the minimum length of the character ‘<’, but rather, deletes this character from these strings that are in the post-image of the repaired function input strings and returns the corresponding string without is not smaller than the minimum length of the strings that that character. This repair simulates the same sanitization are in the post-image of the reference function. behaviorofthereferencefunctioninthenewrepairedfunc- For the reference function in our example, the minimum tion. In section 4.3 we explain the min-cut algorithm and length is 1, since it blocks the empty string, and the maxi- how to automatically generate the sanitization patch. mumlengthis4. Ontheotherhand,forthetargetfunction, Given the final sanitization phase, one might think that afterthevalidationpatchisapplied,theminimumlengthis the first two phases are redundant. However, without the 1 since it also blocks the empty string, but the maximum first two phases, the repair generated by our approach can length is not 4 but 8. The reason is that the sanitization become too conservative by rejecting all input strings or by in the target function escapes the ‘"’ character so that an deleting all characters from the input string. Dividing the inputstringoflength4like“""""”(whichpassesthevalida- repairgenerationtothreeseparatephasesenablesustogen- tion patch) is escaped to produce the string“\"\"\"\"”at erate a combined repair that is not overly conservative. the sink, which is of length 8. Thefinalresultofthedifferentialrepairalgorithmforour This example shows that due to the sanitization opera- runningexampleisshowninFigure2. Therepairedfunction, tioninthetargetfunction,wegetalengthdifferenceinthe is obtained by composing the three patch-functions, in the post-image languages even though the validation patch has order in which they were introduced here, with the original alreadyblockedallstringslongerthan4. Toaddressthisis- target function. suewegeneratealengthpatchthatblocksanyinputstring 3. MODELINGSANITIZERFUNCTIONS thatresultsinastringlongerthan4charactersatthetarget sink even if the input string itself is shorter than 4 charac- As we discussed above, the differential repair algorithm ters. For example, the length patch blocks the string“"a"” takes two validation and sanitization functions as input. In although it has 3 characters only since it will result in the this section we will give a characterization of the functions string“\"a\"”oflength5atthesinkwhichislongerthan4 that our differential repair algorithm takes as input. characters. On the other hand, the string“foo”will not be Input validation and sanitization operations in web ap- blockedbythelengthpatchsinceitwillreachthesinkasit plications can be characterized using three types of func- is, 3 characters long. tions: 1) pure validator, 2) pure sanitizer and 3) validating- Figure 2 shows the length patch-function for our exam- sanitizer functions. A pure validator is a total function: ple. Note that the function assumes that the validation F :Σ∗ →{⊥,(cid:62)} patch function is applied before it so it only blocks things v notblockedbythevalidationpatchfunction. Insection4.3 that takes a string s ∈ Σ∗ and returns either (cid:62) indicating weexplainhowtoautomaticallygeneratethelengthpatch- that the string is valid and should be accepted or ⊥ indi- function. cating the string is not valid and should be rejected. Note that,apurevalidatordoesnotchangethevalueoftheinput Sanitizationpatch. string, it either accepts or rejects it as it is. The purpose of this final phase is to take care of the dif- A pure sanitizer is a total function: ferencesthatareduetosanitizationoperations. Ourgoalis F :Σ∗ →Σ∗ tomakesurethatthepost-imageoftherepairedfunctionis s a subset of the post-image of the reference function. thatmapsaninputstrings∈Σ∗toanoutputstrings(cid:48) ∈Σ∗. 227 Notethat,apuresanitizerdoesnotrejectanyinputstring, Algorithm 1 DifferentialRepair(FT,FR) hoAwevvaelri,diattimnga-ysamnoitdizifeyrsisomaefuonfctthioeni:nput strings. 12:: MM12::==AA((PPrree+⊥+⊥((FFRT))));; Fvs :Σ∗ →{⊥}∪Σ∗ 34:: if (ML(VM:=1\MM12\)M(cid:54)=2∅;)then 5: FV :=GenerateBlockingSimulator(MV); thattakesaninputstrings∈Σ∗ andeitherreturns⊥indi- 6: else cating that s is invalid or maps s to output string s(cid:48) ∈Σ∗. 7: FV :=IdentityFunction;MV :=A(∅); 8: endif Notethat,avalidating-sanitizermayrejectsomeinputsand 9: M1:=A(Post+(FR,Σ∗)); modify some others. For the rest of the paper we call a 10: M2:=A(Post+(FT,L(MV))); validating-sanitizer function a sanitizer for short. 11: Md=M2\M1; In this paper, we model all input validation and sanitiza- 12: if (L(Md)(cid:54)=∅)then tionoperationsinwebapplicationsassanitizers. Notethat, 13: if (lenmin(M2) < lenmin(M1) ∨ lenmax(M2) > lenmax(M1)) then onecansimulateapurevalidatorusingasanitizer: Ifanin- 14: M3 :=RestrictLength(M2,M1); putisrejectedbythevalidator,itisrejectedbythesanitizer 15: ML:=A(Pre+(FT,L(M2\M3))); and if it is accepted by the validator it is returned without 16: FL :=GenerateBlockingSimulator(ML); modification by the sanitizer. Obviously, any pure sanitizer 1187:: elsMe2:=M3; isalsoasanitizerthatneverrejectsaninput. Hence,byjust 19: FL:=IdentityFunction; focusing on sanitizers we are able to analyze all three types 20: endif of behavior. 2221:: iMfd(L:=(MMd2)(cid:54)=\M∅)1;then We extract one sanitizer function per input field which 23: Mmc:=A(Pre+(FT,L(Md))); characterizes all the validation and sanitization operations 24: Σmc:=Mincut(Mmc); thatareusedforthatparticularfield. Validationandsaniti- 25: FS :=GenerateSanitizer(Σmc,M1); 26: else zationoperationsinvolveuseofregularexpressionsandval- 27: FS :=IdentityFunction; idationoperationssuchasstringmatch,substring,andsan- 28: endif itizationoperationssuchasstringreplace, trim, addslashes, 29: else htmlspecialchars, etc. In section 6 we will discuss how to 30: FS :=FL:=IdentityFunction; 31: endif extractsanitizerfunctionsfromawebapplicationandwhen 32: FDR:=FT ◦FS◦FL◦FV; to map two functions to each other. 33: return FDR; DifferentialRepairProblem. Given a target sanitizer function FT and a reference san- 4. DIFFERENTIALREPAIR itizerfunctionF ,thegoalofdifferentialrepairistogener- R ate a new sanitizer function FP, called a patch, such that Given a target sanitizer FT and a reference sanitizer FR, whenF ispatchedbycomposingitwithFP,theresulting ourdifferentialrepairalgorithmconsistsofthreephasesthat T produce three patches: (1) The validation patch genera- repaired function returns a string only if F and F can R T tion phase produces FV, (2) the length patch generation both return that string. I.e., we want to make sure that a phase produces FL, and (3) the sanitization patch genera- string is not in the post-image of the repaired function if it tionphaseproducesFS. Theresultofourdifferentialrepair is not in the post-image of F or F . In order to formalize T R algorithm is a patch that is the composition of these three this, let us first define the sanitizer composition as follows: individual patches: FS◦FL◦FV and the repair we gener- GiventwosanitizerfunctionsF andF ,theircomposition, 1 2 F ◦F :Σ∗ →Σ∗∪{⊥}, is a sanitizer function defined as: ateisthecompositionofthispatchwiththetargetfunction, 1 2 i.e., F =F ◦FS◦FL◦FV. DR T (cid:40) OurdifferentialrepairalgorithmisshowninAlgorithm1. F1◦F2(x)= ⊥F1(F2(x)) iiffFF22(=x)⊥(cid:54)=⊥ The algorithm takes a target sanitizer FT and a reference sanitizerF asinputandgeneratessanitizerF asoutput R DR Now, let us also define the difference between the post- whichcorrespondstodifferentialrepairofFT withrespectto images of two sanitizer functions F1 and F2 as follows: FR. Ourdifferentialrepairalgorithmsisbasedonautomata basedsymbolicstringanalysis,andusesDeterministicFinite Diff(F1,F2)={x|∃y∈Σ∗:F1(y)=x∧(∀z∈Σ∗:F2(z)(cid:54)=x)} Automata (DFA) to represent sets of strings. For example, it computes post or pre-images of given sanitizers as DFA which is the set of strings that are in the post-image of F (where the set of strings accepted by the DFA corresponds 1 but not in the post-image of F . Given this definition, the tothepostorpre-imageofasanitizer). InAlgorithm1,each 2 differential repair problem is to automatically construct a variablewithanamestartingwithM representsaDFA.The patch FP such that Diff(F ◦FP,F ) = ∅, which means operations∩,∪,\, areautomataoperationsthatgenerate T R when we compose F with FP we want to make sure that automatathataccepttheintersection,union,differenceand T the result, F ◦FP, isatleastas strictas F , i.e., its post- complementofthelanguagesofthegivenautomata,respec- T R imageisasubsetofthepost-imageofF . Wecallthisnew tively, and the = operation tests the language equivalence R composedfunctionthedifferentialrepairF ,whereF = between two automata. The function A(L) takes a regular DR DR F ◦FP. Note that, due to the way we are constructing language L⊆Σ∗ and returns a DFA that recognizes L. We T thedifferentialrepair,bycomposingthetargetfunctionF alsouseL(M)todenotethelanguageacceptedbytheDFA T with the automatically generated patch FP, we guarantee M. The variables with a name starting with F represent that the repaired function F is at least as strict as F , sanitizers. In the remaining part of this section we discuss DR T i.e., its post-image is also a subset of the post-image of F . the three phases of the Algorithm 1. T 228 4.1 PhaseI:ValidationPatchGeneration Σ-{<} 1 Σ Our goal is to generate a validation patch FV such that: 0 < Σ-{<} 4 Σ-{Σ<} 6 Σ Σ 2 Σ-{<} 8 Σ ∀x∈Σ∗:FR(x)=⊥⇒FT ◦FV(x)=⊥, < 5 < 7 < 9 i.e., the validation patch FV guarantees that FT ◦FV does Figure 3: The validation patch automaton MV for not accept inputs that FR rejects. the example in Figure 1. The validation patch FV In order to compute the validation patch, we first need blocks the strings accepted by this automaton. to identify the set of strings that are rejected by F and T F . We call this the negative pre-image of a sanitizer. For R input values than the target function F by computing the a given sanitizer function F, this set is defined as: T differencebetweennegativepre-imagesofM andM . Ifthe 1 2 Pre (F)={s|F(s)=⊥} differenceisemptythenFV isassignedtheidentityfunction ⊥ (line 7) which is a sanitizer function that returns the input In general, it is not possible to compute the pre or post- asitiswithoutblockinganyvalue(i.e.,itisano-op). Ifthe image of a sanitizer precisely since string analysis is an un- differenceisnotempty,thetargetfunctionmustbepatched decidable problem. We use automata-based backward sym- to reject the values rejected by the reference function. To bolic string analysis techniques discussed in Section 5 to achieve this we automatically generate a patch that rejects compute an over approximation of the negative pre-image, only the strings that are rejected by F but not F . R T Pre+(F), where Pre+(F) ⊇ Pre (F). This means that, Note that the validation patch we generate is not sound ⊥ ⊥ ⊥ wemayconcludethatcertainstringsarerejectedbyF when due to over-approximation of the negative pre-image of the they are not. On the other hand, since we are comput- targetfunctionF . ThesetofstringsthatareinPre+(F ) T ⊥ R ing an over-approximation, any string that is rejected by ∩ (Pre+(F )\Pre (F )) will not be blocked by the patch ⊥ T ⊥ T F is guaranteed to be in Pre+(F). Since we are using we generate, whereas they should be blocked in order to ⊥ automata-based symbolic string analysis, the result of the reach our precise goal. We can make the validation patch negative pre-image computation is an automaton that ac- soundbyblockingallthestringsinPre+(F )withoutcom- ⊥ R ceptsthelanguagePre+(F),andwedenotethisautomaton putingthesetdifferencewithPre+(F ),but,thatwouldre- ⊥ ⊥ T as A(Pre+(F)). sult in generation of a validation patch in many cases even ⊥ Computing the pre-image of a sanitizer can be compli- whenitisnotnecessary. Ourexperimentsindicatethatthe cated due to the interaction of validation and sanitization imprecision in our pre-image computation is not a problem operations. For example, consider the code segment: in practice since for all the examples we manually checked trim($x); we observe that Pre+⊥(FR) ∩ (Pre+⊥(FT)\Pre⊥(FT)) =∅. if ($x == "") Figure 3 shows the validation patch automaton MV that exit("String is empty"); isautomaticallygeneratedfortheexampleshowninFigure1 whereΣrepresentstheASCIIcharacters. Tosavespacewe where the input string is first sanitized by the trim oper- collapsed all transitions between any two states s and s ation and then it is validated by comparing the resulting i j into one transition t . We annotate this transition with a stringtoemptystring. Tocomputethesetofinputsblocked ij setofcharactersΣ ⊆ΣsuchthatifacharactercisinΣ by this piece of code (i.e., its negative pre-image) we need C C then there is a transition on c between s and s . The sink to compute the pre-image of the set of strings that satisfy i j state along with transitions into and out of it are omitted. the condition $x == "" after the trim operation. This is Since our analysis represents the set of strings at each a language which is the Kleene closure of the union of all program point using DFA, we generate the patch repair space characters that are removed by the trim operation. function FV based on the DFA that is computed by our We compute the negative pre-image of a sanitizer using analysis. The validation patch code that is generated with automata-based backward symbolic string analysis. The GenerateBlockingSimulator filters the inputs bysimu- analysisstartsatasink(wherethesanitizerterminates)and latingtheresultingautomatonMV inFigure3todetermine goes backwards. The set of strings that can reach to a par- if the input string is accepted by MV. If the input string ticular program point is represented as an automaton that is accepted by the automaton MV, then FV will return ⊥ accepts the corresponding set of strings. Our automata- to block the input, otherwise it will return the input string based string analysis implements the pre-image and post- without modification. image computations for all common validation and saniti- zation operations as discussed in Section 5. Given a set 4.2 PhaseII:LengthPatchGeneration of input strings as output of a string operation, pre-image Thegoaloflengthpatchgenerationistogenerateapatch computation computes the set of input strings that would FL such that: result in that output set. Post-image computation does the reverse. The negative pre-image of a sanitizer is computed ∀x∈Σ∗: by a fixpoint computation that repeatedly applies the pre- ((∃y,z∈Σ∗:|FR(y)|≤|FT ◦FV(x)|≤|FR(z)|)⇒ image computations and uses automata-based widening [5] FL(x)=x)∧ to achieve convergence as discusses in Section 5. (¬(∃y,z∈Σ∗:|FR(y)|≤|FT ◦FV(x)|≤|FR(z)|)⇒ FL(x)=⊥) We use this automata-based backward symbolic string analysis in lines 1 and 2 of Algorithm 1 two construct two i.e.,giventhetargetfunctionF composedwiththevalida- T automataM andM ,thatacceptanover-approximationof tion patch FV, FL rejects any input string that will cause 1 2 the negative pre-images of F and F , respectively, where the output of F ◦FV to contain a string of length longer T R T L(M ) = Pre+(F ) and L(M ) = Pre+(F ). The next orshorterthanallthestringsintheoutputofthereference 1 ⊥ R 2 ⊥ T step(line3)checksifthereferencefunctionF rejectsmore function F . R R 229 The validation patch makes sure that any input string Σ-{"} 7 rejectedbythereferencesanitizerisalsorejectedbythere- " paired target sanitizer. However, this does not mean that Σ-{"} 4 " Σ the set of strings that are returned by the repaired target Σ-{"} 1 " Σ-{"} 8 10 Σ sanitizer and the reference sanitizer are the same after the validation patch since they may be using different sanitiza- 0 " Σ-{"} 5 " tion operations. The length patch is the first step in estab- 2 " Σ 9 lishingthattherepairedtargetsanitizerdoesnotreturnany 6 string that is not returned by the reference sanitizer. The Figure 4: The length patch automaton ML for the length patch makes sure that the length of any string re- example in Figure 1. The length patch FL rejects turnedbytherepairedtargetfunctionisnotlargerorsmaller the strings accepted by this automaton. than all the strings returned by the reference sanitizer. Thelines9-20inAlgorithm1constructthelengthpatch. setofstringsL⊆Σ∗ returnedbyasanitizerfunctionF,we Thelines9and10computetheautomatathataccepttheset call the set of input strings that are mapped by F to L the ofstringsthatarereturnedbythereferencesanitizerandthe pre-image of F with respect to L and we define it as: targetsanitizerthatiscomposedwiththevalidationpatch. GivenL⊆Σ∗,thesetofstringsreturnedbythesanitizerF Pre(F,L)={s|∃s(cid:48)∈L:F(s)=s(cid:48)} when its input set is restricted to L is defined as: Again, due to over-approximation, we compute the set Post(F,L)={s|∃s(cid:48)∈L:∃s∈Σ∗:F(s(cid:48))=s} Pre+(F,L)⊇Pre(F,L). Thisover-approximationmayre- We call this the post-image of sanitizer F with respect to sult in blocking input strings that do not contribute to the L. Due to undecidability of string analysis we compute length discrepancy. an over-approximation of this set, namely, Post+(F,L) ⊇ Inline16wegeneratethelengthpatchFLthatblocksthe Post(F,L). The lines 11 and 12 in Algorithm 1 check if stringsthatareacceptedbytheautomatonML inFigure4 thereareanystringsthatarereturnedbythetargetsanitizer andreturnsthestringsrejectedbyML withoutanychange. composedwiththevalidationpatchthatarenotreturnedby Figure 4 shows the length patch automaton ML that our the reference sanitizer by checking if Post+(F ,L(MV))\ algorithm computes for the example shown in Figure 1. T Post+(F ,Σ∗)=∅. Ifthedifferenceisempty,thenwecon- R 4.3 PhaseIII:SanitizationPatchGeneration siderF ◦FV tobeasstrictasF andtheanalysisconcludes T R byassigningIdentityFunction(i.e.,no-op)tolengthand The third and final phase in the repair algorithm is the sanitization patches FL and FS (line 30). sanitizationpatchgeneration,whichresultsinapatchfunc- Note that, due to over-approximation in our analysis, it tion FS such that: is not guaranteed that F ◦FV is as strict as F even if the difference is empty. HTowever, again manual inRspection ∀x∈Σ∗ : (∀y∈Σ∗:FR(y)(cid:54)=x)⇒ of our experiments indicate that our approximate analysis (∀z∈Σ∗:FT ◦FS◦FL◦FV(z)(cid:54)=x) always finds the differences if they exist since the precision whichmeansthat,afteraddingthesanitizationpatchFS to of our post-image computation is quite good in practice. the previously generated patches, we want the differential If a difference is found, then we check if the difference repair F = F ◦FS ◦FL◦FV to be as strict as F in DR T R corresponds to a length difference in line 13. Let us first terms of the set of strings it returns. define lenmax and lenmin for an automaton. Given an au- The lines 21-28 in Algorithm 1 generate the sanitization tomaton M, lenmax(M) = ∞ if M accepts an infinite set, patch. First,inthelines21,22,wecheckifthereisadiffer- andlenmax(M)isthelengthofthelongeststringacceptedby encebetweenwhatFT returns(aftervalidationandsanitiza- M otherwise. We can check if lenmax(M)=∞ by checking tionpatchesareapplied)andwhatFRreturnsassumingany iftherearecyclesinM onanypathfromthestartingstate input. Ifnodifferenceisfound,thenweconsiderF ◦FL◦FV T to an accepting state. If there is at least one cycle, then to be as strict as F and the analysis concludes by assign- R lenmax(M) = ∞. If there are no cycles, then lenmax(M) ingIdentityFunctiontoFS (line27). Thisindicatesthat is finite, and we use a depth first search to compute the there is no sanitization patch. Note that, as we discussed length of the longest string accepted by M. On the other before, due to over-approximation our repair algorithm can hand, given an automaton M, lenmin(M) is the length of miss differences, however we have not observed this in our the shortest string accepted by M. If the start state is an experiments. accepting state then lenmin(M)=0. Otherwise, lenmin(M) If a difference is found, then, in the line 23, we compute iscomputedbyfindingthelengthoftheshortestpathfrom anover-approximationofthesetofinputstringsthatresult the start state to an accepting state. in such a difference. The set L(M ) represents an over- mc Ifalengthdifferenceisfound,thenwerestrictthelength approximation of the set of all input strings that are the of the set of strings accepted by FT to remove the length cause of the difference between the set of strings returned difference using the following operation in line 14: byF andF ◦FL◦FV. WecallM themincutautomaton R T mc andintheline24weusethismincutautomatontogenerate RestrictLen(M2,M1)≡M2∩ lenma(cid:83)x(M1) Σi asymmbinoclustinatlphheambientcu(atsalepxhpalbaienteadrebreelomwo)v,edsufcrhomthtahteiifnpthuet i=lenmin(M1) strings, then the difference between the post-images of F R Afterthelengthrestriction,inline15,weusethepre-image andF ◦FL◦FV disappear. Then,intheline25,wegenerate T computation to compute an over-approximation of the set thesanitizationpatchFS toeitherdeleteorescapethesetof of input strings that cause the length discrepancy. Given a symbolsinthemincutalphabet. Finally,inthelines32and 230 33, we construct and return the differential repair function " " FDR as the composition of the target function FT with the 0 Σ-{<,"} 1 Σ-{<,"} 2 Σ-{<,"} 5 < three patch functions generated during the three phases of < < < the repair algorithm. The mincut algorithm [31] takes the DFA Mmc as input, minimum cut Σ-{<,"} 6 Σ-{"} " arenpdaiprrofudnuccteisonasFeStomfochdaifireasctaergsiΣvemncisnupcuhttshtaritnsganinitiszuacthiona 3 " 7 Σ-{"} 9 < Σ-{<,"} waythatthemodifiedstringisnotacceptedbytheM . In " mc thebasiccase,FS modifiestheinputstringsbyjustdeleting 8 a set of characters using the replace function (later, we ex- < Σ-{<,"} tendthisbasiccasetohandleescapingcharactersinsteadof 10 deleting them). In order to prevent extensive modification Figure5: Themincut automaton M for theexam- totheinput,thesetofcharacterstobedeletedshouldbeas mc ple in Figure 1. The dotted line shows the mincut smallaspossible. Thequestion,then,is,howdoweidentify edges with the corresponding mincut alphabet {<}. the set of characters to be deleted? First,wewillformalizethisprobleminautomata-theoretic terms[31]. WesayS ⊆Σisanalphabet-cutofM,ifL(M)∩ sulting string does not match Mmc. Although the function LdoS¯n=ot∅,cownhtearine LanS¯y=ch(aΣra\cSte)r∗iins tSh.e Tsehteofmainll-astlprihnagbsett-hcautt FFSS◦isFaLs◦oFunVd,Σre∗p)a⊆irPthOatSTw+ill(FguRa,rΣa∗n)t,eweethaaptpPlyOtwSoT+he(FurTis◦- problemisfindingthealphabet-cutS ,suchthatforany tics to generate more accurate repair functions. min other alphabet-cut S, |S |≤|S|. The first heuristic is the escape heuristic. We look at min Themin-alphabet-cutproblemcanalsobestatedingraph- the automaton M1 generated in line 9 of the Algorithm 1 theoretic terms. Given a DFA M, an edge-cut of M is a set (whichrepresentsallthestringvaluesreturnedbyFR),and of transitions E ⊆δ such that if the set of transitions in E check if all the characters in the mincut alphabet Σmc are are removed from the transition relation δ then none of the always preceded by the same single character e. If that is states in F are reachable from the initial state q . Let S the case, we we call the character e the escape character. 0 E denotethesetofsymbolsofthetransitionsinE. IfE isan Formally speaking, given DFA M1 = (cid:104)Q1,q0,Σ,δ1,FR(cid:105), we edge-cutofM thenSE isanalphabet-cutofM. Hence,find- check that ∀q ∈ Q1,∀c ∈ Σmc : δ1(q,c) (cid:54)= sink ⇒ (∀q(cid:48) ∈ ingthemin-alphabet-cutisequivalenttofindinganedge-cut Q1 : δ1(q(cid:48),c(cid:48)) = q ⇒ c(cid:48) = e). If this is the case, then the with minimum set of distinct symbols. sanitization patch FS we generate uses the replace oper- Note that, if M accepts the empty string then there will ation to escape all the characters in the mincut alphabet not be any edge (or alphabet) cut since the initial state Σmc (instead of deleting them) by prepending them with would be an accepting state. For the rest of our discussion the escape character e. we will assume that L(M) (cid:54)= ∅ (we can easily handle the The second heuristic we use is the trim heuristic. Here, caseswhereitacceptstheemptystringbyfirsttestingifthe ifwegetamincutΣmc whichcontainsspacecharacters,we input string is empty and then inserting a single character first check if M1 accepts any string that contains a space to the input if it is). character (which can be determined by checking if transi- It has been shown that the min-alphabet-cut problem is tions on space characters always go to the sink). If not, NP-hard [31], so, rather than trying to find the optimum then we generate a patch that deletes the space charac- solution, we can consider using efficient heuristics that give ters as in our basic mincut based patch generation algo- a reasonably small cut that is not necessarily the optimum rithm. If M1 does accept a string that contains a space solution. One heuristic solution is to minimize the number character, then we check if it is the case that the strings ofedgesinacutratherthanthenumberofdistinctalphabet accepted by M1 do not start or end with space characters. symbols. Given a DFA M, a min-edge-cut of M is an edge- Formally speaking, given DFA M1 = (cid:104)Q1,q0,Σ,δ1,FR(cid:105), we cut Emin such that for any other edge-cut E, |Emin|≤|E|. check that for all space character s δ1(q0,s) = sink and Note that the min-edge-cut minimizes the number of edges ∀q ∈ Q1 : δ1(q,c) ∈ FR ⇒ c (cid:54)= s). If so, then we generate intheedge-cutwhereasthemin-alphabet-cutminimizesthe patch FS which uses the trim function to delete the space set of symbols on the edges in the edge-cut. Interestingly, charactersfromthebeginningandendofeachinputstring. even though the min-alphabet-cut problem is intractable, there is an efficient algorithm for computing the min-edge- 5. SYMBOLICSTRINGANALYSIS cut. We use the Ford-Fulkerson’s max-flow min-cut algo- In order to implement our differential repair algorithm rithm [10] to find a min-edge-cut E where the complex- min weuseforwardandbackwardsymbolicstringanalysistech- ity of the algorithm is O(|δ|2). Note that |S | ≤ |E |, min min niques to compute the post-image Post+(F,L), pre-image i.e.,themin-edge-cutprovidesandupperboundforthemin- Pre+(F,L) and negative pre-image Pre+(F) of sanitizer alphabet-cut. Soifthemin-edge-cutissmallthenthesetof ⊥ functions. Beforeweapplythedifferentialrepairalgorithm, distinct symbols on the edges of the min-edge-cut will give we first extract the sanitizer functions from source code (as us a good approximation of the S . min discussed in Section 6) and represent them in a language- Figure 5 shows the mincut automaton M for our run- mc agnosticintermediaterepresentation. Ourintermediaterep- ningexampleinFigure1alongwiththemincutedgeswhich resentation supports the string validation and sanitization correspond to the mincut alphabet Σ = {< }. mc operationsthatareusedinvalidationandsanitizationfunc- Once we compute an alphabet-cut Σ , we generate the mc tions. Eachextractedsanitizationfunctioncancontainmul- sanitizationpatchFS withareplacestatementthatdeletes tipleexitstatementswhichcorrespondtorejectionofthein- thesymbolsinΣ fromtheinput,makingsurethatthere- mc put string, and a return statement that returns the output 231 ofthesanitizer. Toconductforwardandbackwardsymbolic if they have already been escaped before. This may result string analyses on this intermediate representation, we im- in double escaping i.e. w eew will become w eeeew . 1 2 1 2 plementthepre-andpost-imagecomputationsforeachval- PostTrimLeft(DFA M , char s): this automata oper- 1 idation and sanitization operation in our intermediate rep- ation removes all the s characters from the beginning of resentation. AsasymbolicrepresentationweuseDetermin- strings in L(M ) up to the first character that is not equal 1 istic Finite Automata (DFA) where each string expression tos. ItreturnsaDFAM suchthatL(M)={c w|w c w∈ 1 1 1 in the program is associated with a DFA, and the language L(M ),w ∈{s}∗ and w∈Σ∗ and c ∈Σ−{s}}. 1 1 1 acceptedbytheDFArepresentsthepossiblevaluesthatthe PostTrimRight(DFA M , char s): this automata oper- 1 string expression can take at any given execution. We use ation removes all the s characters from the end of strings the symbolic DFA representation provided by the MONA in L(M ) up to the first character that is not equal to s. 1 DFA library [6], in which transition relation of the DFA It returns a DFA M such that L(M) = {wc | wc w ∈ 1 1 1 is represented as Multi-terminal Binary Decision Diagrams L(M ),w ∈{s}∗ and w∈Σ∗ and c ∈Σ−{s}}. 1 1 1 (MBDDs). We implement forward and backward symbolic PostReplaceChar(DFAM ,charc,StringS): thisau- 1 analysesthatcorrespondtoflow-sensitiveandpath-sensitive tomatonoperationreplacesasinglecharcwithastringsin symbolicreachabilityanalyses,whereweiterativelycompute allstringsinL(M ). ItreturnsaDFAM suchthatL(M)= 1 anoverapproximationoftheleastfixpointthatcorresponds {w Sw S... w Sw |k>0,w cw c...w cw ∈L(M ), 1 2 k k+1 1 2 k k+1 1 to the reachable values of the string expressions by using w ∈(Σ−{c})∗}. Thisoperationcanbeusedtomodelstring i the pre- or post-image computations for the string valida- operations such as PHP’s htmlspecialchars efficiently. tionandsanitizationoperationsateachiteration[1,32,30]. We would like to discuss the automata algorithms that SinceDFAscanrepresentinfinitesetsofstrings,thefixpoint we developed to compute the pre- and post-images of the computations are not guaranteed to converge. To allevi- Escapeoperationthatwementionedaboveasanexample. ate this problem, we use the automata widening technique PostEscape(M , e, E) implementation: Given M = 1 1 proposed by Bartzis and Bultan [5] to compute an over- (cid:104)Q ,q ,Σ,δ ,F (cid:105) the result DFA M = (cid:104)Q,q ,Σ,δ,F(cid:105) is 1 10 1 1 0 approximation of the least fixpoint. Briefly, the widening constructed as follows: For each state q that has at least i operator merges those states belonging to the same equiva- one out transition (q −→c q ) on a character c ∈ {e}∪E i j lence class identified by certain conditions. (1) we mark each transition (q −→c q ) out from q to a i j i Weimplementedtheautomata-basedpreandpost-image state q , (2) we add a new state q and a new transition j k computationsforcommonstringoperationssuchasconcate- (q −→e q ) on escape character e, (3) we move each marked i k nation and language-based replacement that have been de- transition (q −→c q ) to become a transition (q −→c q ). The scribed in earlier work [32, 30], and we also implemented i j k j resulting automaton does not have nondeterminism which novel specialized algorithms for computing pre and post- meansthatweavoidtheuseofsubsetconstructionalgorithm image of specialized string sanitization operations such as for determinization. trim and addslashes. PreEscape(M , e, E) implementation: Given M = 1 1 (cid:104)Q ,q ,Σ,δ ,F (cid:105)wefirstpreprocessM topartitionalltran- SpecializedReplaceAlgorithms. 1 0 1 1 1 sitions(q−→e q(cid:48))intotwosetsoftransitions: Escapingtransi- Toincreasetheprecisionandperformanceofthedifferen- tions T and escaped transitions T such that: ∀q,q(cid:48) ∈Q : tialrepair,wedevelopedanumberofautomata-basedalgo- g e 1 (q −→e q(cid:48)) ∈ T ⇒ ∃q(cid:48)(cid:48) ∈ Q : (q(cid:48)(cid:48) −→e q) ∈ T . Notice that rithmsforcomputingthepreandpost-imagesoffrequently e 1 g inM ,atransition(q−→e q(cid:48))cannotbeescapingandatthe used string operations such as trim, htmlspecialchars, 1 sametimebeingescaped,i.e.,T ∩T =∅. Otherwisewewill addslashes, mysql_real_escape_string, tolower, and e g have a string w eecw ∈ L(M ) where w ,w ∈ Σ∗ which toupper. These algorithms are more precise and more ef- 1 2 1 1 2 contradicts the definition of ficient than using the general replace algorithm to model PostEscape(M ,e,E). We can formalize this condition as these specialized operations [32]. The general replace algo- 1 follows: There is no path q ,...,q ,q ,q ,q ,...,q rithm consumes significantly more time and space since it 0 i−1 i i+1 i+2 f in M where q ∈ F, δ(q ,e) = q , δ(q ,e) = q and relies on automata determinization which requires the use 1 f i−1 i i i+1 δ(q ,c) = q where c ∈ {e}∪E. During the analy- of subset construction algorithm which has an exponential i+1 i+2 sis, we enforce this condition by applying PreEscape on complexity. On the other hand, the specialized algorithm we developed for the Escape operation, for example, runs M1∩ Escape(A(Σ∗),e,E). Using the same reasoning we in linear time and its result is precise without any over- conclude that (1)∀(q −→e q(cid:48)) ∈ Tg : q(cid:48) ∈/ F, (2) ∀c ∈/ {e}∪ approximation. Intheexperimentalresultswedemonstrate E, ∀(q −→e q(cid:48)) ∈ T : δ(q(cid:48),c) = sink, (3) ∀q ∈ Q (∀c ∈ g 1 the improvement that we gain from these specialized algo- {e}∪E : δ(q,c) (cid:54)= sink ⇔ ∀c(cid:48) ∈/ {e}∪E : δ(q,c(cid:48)) = sink), rithms. Belowwedescribethepost-imageoffourofthespe- and (4)δ(q ,e)(cid:54)=sink⇒(q −→e q(cid:48))∈T . 0 0 g cialized replace operations that we implemented. We omit Using these results, we compute T and T using a depth g e the description of pre-images and other operations due to firsttraversalstartingfromthestartstateq . Wethencom- 0 space limitation. pute the set of escaped states Q which is the set of states e PostEscape(DFAM1,chare,charsetE): thisautomata that has all input transitions in Tg. Due to the precondi- operation escapes characters in E along with the escape tions we stated earlier, all transitions on e must be either character e itself in all strings in L(M1) using the escape coming into an escaped state or going out from it. Also all character e. It returns a DFA M such that L(M) = { transitions on escaped characters c ∈ {E} must be going w1ec1w2ec2 ... wkeckwk+1 | k >0, w1c1w2c2...wkckwk+1 out from an escaped state. Finally, given Qe, we construct ∈L(M1),∀i:ci ∈E∪{e} and wi ∈Σ∗−(E∪{e})∗}. An the new DFA by removing all states qk ∈Qe such that: (1) exampleofanescapefunctionisPHP’saddslashes. Notice all transitions (q −→e q ) are removed, and (2) each transi- thatEscapeescapesallcharsc∈{e}∪E withoutchecking i k 232 tion (q −→c q ) is added, as an out transition, to all states patches,48ofthemaregeneratedforthepairsthatarefrom k j q where a transition (q −→e q ) was removed. Based on the different applications. We found 14 server-server sanitiza- i i k conditions we discussed above on M , this last step can be tion patches within the same application, which indicates 1 done without determinization and subset construction. inconsistent sanitization policy within the application. For server-server and client-client, there are no length patches 6. IMPLEMENTATION&EXPERIMENTS since the validation patches in these cases do not involve We have implemented our differential repair algorithm in length restrictions. a tool called SemRep using our symbolic string analysis li- Table 2 shows the details of sanitization patches and re- brary LibStranger1. In order to evaluate our repair al- sults of the mincut algorithm. As we can see our mincut gorithm we experimented on five open-source PHP appli- heuristics were able to identify 41 trim operations and 10 cations 1) PHPNews v1.3.0 (news publishing software), 2) escape operations. This identification is very helpful since UseBB v1.0.16 (forum software), 3) Snipe Gallery v3.1.5 applying sanitization patches that escape certain charac- (image management system), 4) MyBloggie v2.1.6 (weblog ters is not idempotent which is critically important to be system), 5) Schoolmate v1.5.4 (school administration soft- avoided for server-client pairs. In the client-client saniti- ware) along with a number of JavaScript sanitizer-function zation patches there was an interesting case in which the benchmarksfromvariouswebsites[1]. Weranalltheexper- mincut was Σ. The reason was that the languages of the iments on an Intel I5 machine with 2.5GHz X 4 processors post-image DFAs were disjoint which means that the two and 32 GB of memory running Ubuntu 12.04. functions return completely different sets of strings (in this Extracting Sanitizer Functions. Before we run our casethediscrepancywasduetothepresence/absenceofthe differential repair algorithm, we need to extract sanitizer dash symbol in phone number fields). functions from the client and/or server side and map them Table 1: Number of Patches Generated to each other to generate target, reference sanitizer pairs. #pairs #valid. #length #sanit. WebuiltacrawlerusingHTMLUnit[11]tofindinputfields C-S 122 61 11 0 S-C 122 53 2 30 and corresponding sinks in a PHP web application. When S-S 206 49 0 33 the crawler hits a web page with an HTML form, it fills it C-C 19 34 0 5 out automatically using a set of pre specified profiles and submits it. Then, for each HTML input field, JavaScript Table 2: Sanitization Patch Results validationandsanitizationcodeisdynamicallyextractedas mincut mincut #trim #escape #delete avg size max size described in [1], along with the HTML constraints on the S-C 4 10 15 10 20 field, resulting with one sanitizer function per each input S-S 3 5 23 0 20 field. Ontheserver-side,wealsocollecttheexecutiontraces C-C 7 15 3 0 2 tofigureouttheinputsandthesinks(wheretheinputsflow into) during crawling. We use that information later on to Table 3 shows the memory and time performance of our map server-side sanitizer functions to client-side sanitizer repair algorithm. Rows I, II, and III corresponds to val- functions. Finally,weusethefrontendofPixy[13]tostat- idation, length, and sanitization patch generation phases, icallyextractthecorrespondingsanitizerfunctionsfromthe respectively. Memory performance is represented as num- server-side. WeaugmentedPixycodetoaddpathsensitivity berofBDDnodesthatisneededtorepresentaDFAwhere and support for PHP5. Client-client sanitizers are mapped the size of each BDD node is 16 bytes. In general our al- to each other based on the type of data they operate on gorithmisefficientbothintermsoftimeandmemory. The (i.e., email address, phone number, ...etc). Server-server averagetotaltimeforthealgorithmis1.35secondsandthe mappingisdonewithinthesameapplicationandacrossdif- maximum time is 186.22 seconds. During our experiments ferent applications based on field names that are extracted wehadtoabortthealgorithmin36among469pairs(%7.6) from the PHP $_POST and $_GET arrays. duetoMONA’slimitonBDDsize. Themainreasonforex- Results. Table1showsthetotalnumberoftarget-reference ceeding the BDD size limit is length constraints with large sanitizer pairs we analyzed and the number of patches gen- numbers. For example, in one of the cases in the experi- erated at each phase of the algorithm for four categories: ments, validation patch restricts the length of the language Client-server(C-S)wheretheserver(target)ispatchedagainst oftheinputstringsto255. Then,sanitizationfunctionhtml- the client (reference), server-client (S-C) where the client specialchars in the target increases the maximum length of (target) is patched against the server (reference), server- thepost-imageto1275. Sinceweusefiniteautomatatorep- server(S-S)andclient-client(C-C).Notethatfortheserver- resentsetsofstrings,theautomatonhastocountthelength server and client-client cases we analyze each pair twice by of the strings with its states. For this reason, we can see consideringeachsanitizerfunctionastargetonceandasref- from Table 3 that the second phase of the algorithm which erenceonce. Themostcommonlygeneratedpatchesareval- deals with the length constraints has the highest time and idation patches which indicates that inconsistencies in val- memory consumption. idation policies are common. There are also a significant Figure 6 shows a comparison in terms of time and mem- number of sanitization patches generated, except that the ory (represented using the number of BDD nodes) between client-server pairs generated no sanitization patches. We the performance of the generic replace algorithm and the checkedtheclient-serverpairsmanuallyandconfirmedthat specialized replace algorithms we presented in this paper. this is an accurate result (i.e., there are no cases where the In our setup we computed the post-image of the two PHP sanitization at the client-side is stronger than the sanitiza- functionsaddslashes(whichdoes3replaceoperations)and tion at the server-side). Among 49 server-server validation htmlspecialchars(whichdoes5replaceoperations)onthe 1Thetoolandthelibraryalongwiththesourcecodeareavailableat language (cid:83)l Σl. This setup is identical to a sanitizer func- http://cs.ucsb.edu/~vlab/tools.html i=0 233 Table3: TimeandMemoryPerformanceofAnalysis passes all test suites. Our approach does not require a test repair DFA size peak DFA size time suiteanditcanhandleunboundedloopsusingfixpointcom- phase (#bddnodes) (#bddnodes) (seconds) putation. Livshits et. al. [17] automatically place a set of avg max avg max avg max I 997 32,650 484 33,041 0.14 4.37 sanitizersinasanitizerfreeprogrambasedonauserdefined II 129,606 347,619 245,367 4,911,410 9.39 168.00 policy. In our case we take into account the original saniti- III 2,602 11,951 4,822 588,127 0.17 14.00 zationcode. Wegeneratetherepairinawaythatallowsus to place it at the beginning of the code that is under repair 100000 addslashes withoutrequiringaplacementpolicy. Placingtherepairbe- 10000 htmlspecialchars fore the original code, instead of changing the code, allows us to avoid interference with the original sanitization code optimized addslashes 1000 that may have side-effects. optimized htmlspecialchars me (ms) 100 stoDpiffaeftreerntfiinaldainngaldyisffiserteenchcensiqwuietsho[2u,t2t1r,y1in6g,1t4o,r1e5p]atiyrptihcaemlly. Ti 10 In [21] differential symbolic execution is used to find differ- ences between original and refactored code by summarizing 1 procedures into symbolic constraints and then comparing 0.1 different summaries using an SMT solver. SYMDIFF [14] 0 10 20 30 40 50 60 Length computesthesemanticdifferencebetweentwofunctionsus- ing the Z3 SMT solver [18]. There are specialized differen- 10000 addslashes tial analysis techniques that focus on web applications. In NoTamper [7], client-side code is analyzed to generate test htmlspecialchars D nodes (thousands) 110001000 ooppttiimmiizzeedd ahdtmdslslapsehceiaslchars cIWsnearAsvaePesrrTtechEcoaeCdtne,tartwfeoohlugilcoushweiddueutsoptehstepesasyttpemsettbrhoce[al8issc]e,erevtgxheeerenc-esusritadaimteoinoeonfoatfpuhrtteohhceaoeprcssplsilpeiacnrnaotdtpiaooenxsnde-. mber of BD 1 pouanrdwocrokvearlasgoeg.eInneraadtdesitaionfixtofofirnsduicnhgdsieffmeraenntcice.differences, Nu Staticanalysisofstringshasbeenanactiveresearcharea, 0.1 withthegoalofeliminatingbugscausedbyinadequateuseof 0 10 20 30 40 50 60 stringvalidationandsanitizationoperations. In[19],multi- Length track DFAs, known as transducers, are used to model re- Figure 6: Time and Memory performance for placementoperationsinconjunctionwithagrammar-based genericreplaceandoptimized/specializedreplaceal- string analysis approach. The resulting tool was used in gorithms. detecting vulnerabilities in PHP programs. Wassermann et al. [26, 27] propose grammar-based static string analyses tionthatrestrictsthelengthofitsinputitoacertainvalue todetectSQLinjectionsandXSS,followingMinamide’sap- lthroughbranchconditionlen(i) <= l,andthensanitizes proach. Amorerecentapproachinstaticstringanalysishas the input that passes the branch condition using one of the beentheuseoffinitestateautomataasasymbolicrepresen- two functions. The x axis shows the length while the y tation[4,32]. Inthisapproach,complexstringmanipulation axis shows the time and memory and uses a log scale. We operations, such as string replacement, can be modeled us- notice that the generic replace grows exponentially as we ing automata representation. In order to guarantee conver- increase the length while the specialized ones do not. For gence in automata-based string analysis, several widening the generic replace, the analysis reaches MONA limit on operators have been used [9, 5, 32]. Constraint-based (or BDD size at length 31 for addslashes and at length 8 for symbolic-execution-based) techniques represent a third ap- htmlspecialchars. On the other hand, for the specialized proach for string analysis. Such techniques have been used replace operations it took less than 1 second to run with fortheverificationofstringoperationsinJavaScript[23]and length 100 with negligible memory overhead. the detection of security flaws in sanitization libraries [12]. 8. CONCLUSIONS 7. RELATEDWORK We presented a novel differential repair algorithm that Automatic program repair [3, 29, 28, 24, 25, 20, 22] be- automatically repairs the semantic difference between two cameanactivetopicrecently. Weimeret. al.[29,28]repair input validation and sanitization functions. The algorithm programsusinggeneticprogrammingbyrandomlymutating takesatargetandareferencefunctionasinputandfixesthe the abstract syntax tree (AST). Son et. al. [24, 25] patch target function with respect to the reference function auto- access control problems in PHP by finding the differences maticallywithouttheneedforaspecificationormanualin- between a possibly buggy AST and a correct one and then tervention. The repaired function provides stronger valida- inserting statements from the latter into the former to re- tionandsanitizationthanboththetargetandthereference move the difference. Unlike syntax based approaches, our functions. Therepairisconstructedbycomposingthetarget differentialrepairalgorithmusesasemanticapproachwhich functionwiththreeautomaticallygeneratedpatchfunctions enablesustogeneratepreciserepairsinmultiplelanguages. whichwecallthevalidationpatch,thelengthpatchandthe In [20, 22] test suites are used to find bugs then symbolic sanitizationpatch. Ourexperimentsonseveralwebapplica- executionisusedtofindconstraintsonvariablesthatresult tionsshowthatourdifferentialrepairtechniqueissuccessful in such bugs. Using the solution to the negation of these in repairing differences between sanitizer functions. constraints, a patch is synthesized such that the program 234

Description:
application to repair client and server-side code with respect to each other, or across applications in order to strengthen the validation and sanitization
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.