ebook img

DTIC ADA459093: Carving Differential Unit Test Cases from System Test Cases PDF

0.18 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview DTIC ADA459093: Carving Differential Unit Test Cases from System Test Cases

Carving Differential Unit Test Cases from System Test Cases Sebastian Elbaum, Hui Nee Chin, Matthew B. Dwyer, Jonathan Dokulil Department of Computer Science and Engineering University of Nebraska - Lincoln Lincoln, Nebraska {elbaum,hchin,dwyer,jdokulil}@cse.unl.edu ABSTRACT ofasystem. Analternativeapproachtounittestdevelopment,that doesnotrelyonspecifications,isbasedontheanalysisofaunit’s Unittestcasesarefocusedandefficient. Systemtestsareeffective implementation. Testersdevelopingunittestsinthiswaymayfo- atexercisingcomplexusagepatterns.Differentialunittests(DUT) cus,forexample,onachievingacoverage-adequacycriteriaofthe areahybridofunitandsystemtests.Theyaregeneratedbycarving targetunit’scode. Suchtests,however,areinherentlysusceptible thesystemcomponents,whileexecutingasystemtestcase,thatin- to errors of omission with respect to specified unit behavior and fluencethebehaviorofthetargetunit,andthenre-assemblingthose maytherebymisscertainfaults. Finally, unittestingrequiresthe componentssothattheunitcanbeexercisedasitwasbythesys- developmentoftestharnessesorthesetupofatestingframework temtest. WeconjecturethatDUTsretainsomeoftheadvantages (e.g.,junit[18])tomaketheunitsexecutableinisolation. ofunittests,canbeautomaticallyandinexpensivelygenerated,and Systemtestsareusuallydevelopedbasedondocumentsthatare havethepotentialforrevealingfaultsrelatedtointricatesystemex- commonly available for most software systems that describe the ecutions. Inthispaperwepresentaframeworkforautomatically system’sfunctionalityfromtheuser’sperspective,forexample,re- carving and replaying DUTs that accounts for a wide-variety of quirementdocumentsanduser’smanuals.Thismakessystemtests strategies, we implement an instance of the framework with sev- appropriatefordeterminingthereadinessofasystemforrelease, eraltechniquestomitigatetestcostandenhanceflexibility,andwe ortograntorrefuseacceptancebycustomers. Additionalbenefits empiricallyassesstheefficacyofcarvingandreplayingDUTs. accruefromtestingsystem-levelbehaviorsdirectly. First, system testscanbedevelopedwithoutanintimateknowledgeofthesys- 1. INTRODUCTION teminternals,whichreducesthelevelofexpertiserequiredbytest developersandwhichmakestestsless-sensitivetoimplementation- Software engineers develop unit test cases to validate individ- level changes that are behavior preserving. Second, system tests ual program units (e.g., methods, classes, packages) before they may expose faults that unit tests do not, for example, those that areintegratedintothewholesystem. Byfocusingonanisolated span multiple units or that involve very complex usage of units. unit,unittestsarenotconstrainedbyotherpartsofthesystemin Finally,sincetheyinvolveexecutingtheentiresystemnotesthar- exercising the target unit. This smaller scope for testing usually nessesneedbeconstructed. resultsinsignificantlymoreefficienttestexecutionandfaultisola- While system tests are an essential component of all practical tionrelativetowholesystemtestinganddebugging[1, 19]. Unit software validation methods, they do have several disadvantages. testcasesarealsousedasacomponentofseveralpopulardevelop- Theycanbeexpensivetoexecute;forlargesystemsdaysorweeks, mentmethods,suchasextremeprogramming(XP)[2],testdriven andconsiderablehumaneffortmaybeneededforrunningathor- development(TDD)practices[3],continuoustesting[37],andef- ough suite of system tests [25]. In addition, even very thorough ficienttestprioritizationandselectiontechniques[33]. systemtestingmayfailtoexercisethefull-rangeofbehaviorim- Developingeffectivesuitesofunittestcasespresentsanumber plementedbysystem’sunits;thus,systemtestingcannotbeviewed ofchallenges. Specificationsofunitbehaviorareusuallyinformal asaneffectivereplacementforunittesting. Finally,faultisolation andareoftenincompleteorambiguous,leadingtothedevelopment andrepairduringsystemtestingcanbesignificantlymoreexpen- ofoverlygeneralorincorrectunittests.Furthermore,suchspecifi- sivethanduringunittesting. cationsmayevolveindependentlyofimplementationsrequiringad- Theprecedingcharacterizationofunitandsystemtests,although ditionalmaintenanceofunittestsevenifimplementationsremain notcomprehensive,illustratesthatsystemandunittestshavecom- unchanged.Testersmayfinditdifficulttoimaginesetsofunitinput plementarystrengthsandthattheyofferarichsetoftradeoffs. In valuesthatexercisethefull-rangeofunitbehaviorandtherebyfail thispaper,wepresentageneralframeworkforcarvingandreplay- toexercisethedifferentwaysinwhichtheunitwillbeusedasapart ingofwhatwecalldifferentialunittests(DUT)whichaimatex- ploitingthosetradeoffs. Wetermedthemdifferentialbecausetheir primaryfunctionisdetectingdifferencesbetweenmultipleversions ofaunit’simplementation.DUTsaremeanttobefocusedandeffi- cient,liketraditionalunittests,yettheyareautomaticallygenerated Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor alongwithacustomtest-harness,makingtheminexpensivetode- personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare velopandeasytoevolve. Inaddition,sincetheyindirectlycapture notmadeordistributedforprofitorcommercialadvantageandthatcopies thenotionofcorrectnessencodedinthesystemtestsfromwhich bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to theyarecarved,theyhavethepotentialforrevealingfaultsrelated republish,topostonserversortoredistributetolists,requirespriorspecific tocomplexpatternsofunitusage. permissionand/orafee. Copyright200XACMX-XXXXX-XX-X/XX/XX...$5.00. Report Documentation Page Form Approved OMB No. 0704-0188 Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information, including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it does not display a currently valid OMB control number. 1. REPORT DATE 3. DATES COVERED 2006 2. REPORT TYPE 00-00-2006 to 00-00-2006 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Carving Differential Unit Test Cases from System Test Cases 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION University of Nebraska - Lincoln,Department of Computer Science and REPORT NUMBER Engineering,Lincoln,NE,68588 9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S) 11. SPONSOR/MONITOR’S REPORT NUMBER(S) 12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited 13. SUPPLEMENTARY NOTES The original document contains color images. 14. ABSTRACT 15. SUBJECT TERMS 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF ABSTRACT OF PAGES RESPONSIBLE PERSON a. REPORT b. ABSTRACT c. THIS PAGE 11 unclassified unclassified unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18 Inourapproach,DUTsarecreatedfromsystemtestsbycaptur- Given stx{ input/s, expected output/s } ingcomponentsoftheexercisedsystemthatinfluencethebehavior ofthetargetedunit,andthatreflecttheresultsofexecutingtheunit; Carve ctxm Execute stxinput S0 … Spre m Spost … … output wetermthiscarving. Thosecomponentsareautomaticallyassem- ctxm{Spre, Spost} capture bledintoatestharnessthatestablishesthepre-stateoftheunitthat wasencounteredduringsystemtestexecution.Fromthatstate,the mevolves: m + != m’ unit is replayed and the resulting state is queried to determine if therearedifferenceswiththerecordedunitpost-state. load Spre m’ Spost’ IdeallyDUTswill(a)retainthefaultdetectioneffectivenessof sfeyrsetnemcestetshtsatoanrethneotatrignedticuantiitv,e(bo)fodnilfyferreipnogrstyssmteamllnteusmtbreesruslotsf,d(icf-) Replay ctxmon m’ Spost Same? yes Behavior of m " m’ no beexecutedfasterthansystemtests, and(d)beapplicableacross !has affected behavior of m multiplesystemversions.WeempiricallyinvestigateDUTcarving andreplaytechniqueswithrespecttothesecriteriathroughacon- trolled experiment within the context of regression testing where Figure1:Carvingandreplayprocess. wecomparetheperformanceofsystemtestsandcarvedunittests. Theresultsindicatethatcarvedtestcasescanbeaseffectiveassys- temtestcasesintermsoffaultdetection,butmuchmoreefficient. Aprogramexecutioncanbeformalizedeitherasasequenceof The contributions of this paper are: (i) presenting of a frame- programstatesorasasequenceofprogramactionsthatcausestate workforautomaticallycarvingandreplayingDUTsthataccounts changes.Asequenceofprogramstatesiswrittenasσ=s ,s ,... 0 1 forawide-varietyofimplementationstrategieswithdifferenttrade- wheres ∈Sands istheinitialprogramstateasdefinedbyJava. i 0 offs; (ii)implementinganewstate-basedstrategyforcarvingand Astates isreachedfroms byexecutingasingleaction(e.g., i+1 i replayatamethodlevelthatoffersarangeofcosts,flexibility,and bytecode). A sequence of program actions is written as σ¯. We scalability;and(iii)identifyingevaluationcriteriaandempirically denotethefinalstateofanactionsequences(σ¯). assessingtheefficiencyandeffectivenessofcarvingandreplayof DUTsonmultipleversionsofaJavaapplication. Webelievethese 2.2 BasicCarvingandReplaying contributions lay a solid and general foundation for further study Figure 1 illustrates the CR process. Given a system test case of carving and replay of DUTs and we outline several directions st ,carvingaunittestcasect fortargetunitmduringtheex- forfutureworkinSection6. InthenextSection, wepresentour x xm ecution of st consists of capturing s , the program state im- frameworkforcarvingandreplaytesting.Section3detailstheim- x pre mediately before the first instruction of an activation of method plementationofoneofthoseinstantiations.Section4describesour m, and s , the program state immediately after the final in- studyandresults.Section5discussesrelatedwork. post struction of the activation of m has executed. The captured pair of states (s ,s ), defines a differential unit test case for a pre post 2. A FRAMEWORK FOR TEST CARVING method,ct . Statesinthispaircanbedefinedbycapturingthe xm appropriatestatesinσ, orthroughthecumulativeeffectsofase- ANDREPLAY quence of program actions, by capturing s(σ¯) at the appropriate Javaprogramscanhavemillionsofallocatedheapinstances[14] points in σ¯. A CR testing approach is said to be state-based if and hundreds of thousands of live instances at any time. Conse- it records pairs (s ,s ) and action-based if it records pairs pre post quently,carvingtherawstateofrealprogramsisimpractical. We (σ¯ ,s )wheres =s(σ¯ ). pre post pre pre believe that cost-effective carving and replay (CR) based testing Inpractice,itiscommonforamethod,m,toundergosomemod- willrequiretheapplicationofmultiplestrategiesthatselectinfor- ification,e.g.,tom(cid:48),overtheprogramlifetime.Toefficientlyvali- mation in raw program states and use that information to trade a datetheeffectsofamodification,wereplayct onm(cid:48).Replaying xm measureofeffectivenesstoachievepracticalcost.Strategiesmight adifferentialunittestforamethodm(cid:48)requirestheabilitytoeither include,forexample,carvingasinglerepresentativeofeachequiv- loadstates intomemoryorexecuteσ¯ dependingonhowthe pre pre alenceclassofprogramstatesorpruninginformationfromacarved statewascarved. Fromthisstate,executionofm(cid:48) isinitiatedand statethatamethodundertestisguaranteedtonotbedependenton. itcontinuesuntilitreachesthepointcorrespondingtothecarved Thespaceofpossiblestrategiesisvastandwebelievethatageneral s .Atthatpoint,thecurrentexecutionstate,s(cid:48) ,iscompared post post frameworkfor CRtestingwill aidinexploringcost-effectiveness tos . Iftheresultingstatesarethesame,wecanattestthatthe post trade-offspossibleinthespaceofCRtestingtechniques. changedidnotaffectthebehaviorofthetargetunit. However, if Regardlessofhowonedevelops,orgenerates,aunittest,there thechangealteredthebehaviorofm,thenfurtherprocessingwill arefouressentialsteps:(1)identifyaprogramstatefromwhichto berequiredtodeterminewhetherthealterationmatchesthedevel- initiatetesting,(2)establishthatprogramstate,(3)executetheunit oper’sexpectations. from that state, and (4) judge the resulting state as to its correct- There are multiple techniques for diagnosing the root cause of ness.IntherestofthisSection,wedefineageneralframeworkthat detected differences. For example, a difference could trigger the allowsdifferentstrategiestobeappliedineachofthesesteps. executionofsystemtestst todeterminewhetheradifferenceman- x ifestsathigherlevelsofabstractions,theresultsofct couldbe 2.1 ProgramStatesandProgramExecutions xm comparedwiththeresultsofmanuallydevelopedunittestsform, For the purposes of explaining our framework, we consider a orintermediatestateswithintheexecutionofmandm(cid:48)(e.g.,after Javaprogramtobeakindofstatemachine.Atanypointduringthe everystatement)couldbecomparedtoidentifytheearliestpointat executionofaprogramtheprogramstate,S,canbedefined,con- whichstatesdiffer. Wediscusssupportforsomeofthesediagnos- ceptually,asallofthevaluesinmemory.Asneeded,wewilldefine ticsinSection2.4andleavetheothersforfuturework. notationforaccessingspecificportionsofastate,forexample,the Several fundamental challenges must be addressed in order to parametersinthecurrentactiveframeofthecallstack. make CR cost-effective. First, the proposed basic carving proce- dureisatbestinefficientandlikelyimpractical.Inefficientbecause amethodmayonlydependonasmallportionoftheprogramstate, thusstoringthecompletestateiswastedeffort. Furthermore,two TReesdtu Cctaiosne ctxm Spre ! Spre Reduced ctxm distinctcompleteprogramstatesmaybeidenticalfromthepointof viewofagivenmethod,thuscarvingcompletestateswouldyield redundantunittests.Impracticalbecausestoringthecompletestate ofaprogrammaybeprohibitivelyexpensiveintermsoftimeand ctxm Spre ! Spre space. Second, changes to m may render ct unexecutable in m(cid:48). Reducing the cost of CR testing is impoxrmtant, but we must Test Cases ! ! Keep ctxm Filtering Drop ct produceDUTsthatarerobusttochangessothattheycanbeexe- zm ! cutedacrossaseriesofsystemmodificationsinorderrecoverthe ctzm Spre Spre overheadofcarving.Finally,theuseofcompletepost-statestode- tectbehavioraldifferencesisnotonlyinefficientbutmayalsobe toosensitivetobehaviordifferencescausedbyreasonsotherthan Figure2:Sampleapplicationsofprojectionsfunctions. faults(e.g.,faultfixes,improvements,internalrefactoring)leading to the generation of brittle tests. The following sections address thesechallenges. 2.3 ImprovingCRwithProjections 2.3.3 ApplyingProjections WefocusCRtestingonasingleunitbydefiningprojectionson Figure2illustratestwopotentialapplicationsoftheprojections: carvedpre-statesthatpreserveinformationrelatedtotheunitunder testcasereductionandtestcasesfiltering. testandprovidesignificantreductioninpre-statesize. Reduction aims at thinning a single carved test case by retain- ingonlytheprojectedpre-state(inFigure2theprojectionofs pre 2.3.1 State-basedProjections carvedfromct leadstoasmallers ).ReducingaDUT’spre- xm pre stateresultsinreducedspacerequirementsand,moreimportantly, Astateprojectionfunctionπ : S → S preservesselectedpro- inquickerreplaysinceloadingtimeisafunctionofthepre-state gram state components. For example, a state projection function size. Asweshallsee, dependingonthetypeofprojection, these may preserve only the values of reference fields, thereby elimi- gains may be achieved at the expense of reduced fault detection nating all scalar fields, which would maintain the heap shape of power(e.g.,aprojectionmaydiscardanobjectthatwasnecessary a program state. Many useful state projections are based on the notion of heap reachability. A reference r(cid:48) is reachable in one to expose the fault). Furthermore, test executability may be sac- dereference from r if the value of some field of r holds r(cid:48); let rificedaswell. State-basedprojectionsmaybecomeunexecutable reach(r) = {r(cid:48) | ∃ v (r,f) = r(cid:48)} where v is the ifthedatastructuresusedbythetargetunitchanges,forexample, f∈F field field shiftingfromanarraytoaheap-basedstructure,evenifbehavioris dereference function. References reachable through any chain of preserved. Action-basedprojectionsmaybecomeunexecutableif dereferences up to length k from r are defined by using the iter- (cid:83) atedcompositionofthisbinaryrelation, reachi(r); asa thebehaviorofaunitmethodchangessothatadifferentnumberor 1≤i≤k sequenceofmethodsisneededinthemodifiedprogramtoproduce notational convenience we will refer to this as reachk(r). The thedesiredpre-state. Still,reductioncanbeavaluablemechanism positive-transitive closure of the relation, reach+(r), defines the toimprovetheefficiencyofCRbykeepingjusttheportionsofthe setofallreachablereferencesfromrinoneormoredereferences. pre-statethataremostlikelytoberelevanttothetargetedmethod. State-based CR testing approaches should use projections that FilteringaimsatremovingredundantDUTsfromthesuite.Con- retainatmosttheinterfacereachableprojectionwhichisdefined sideramethodthatisinvokedduringtheprograminitializationand to preserve the set of heap objects reachable from a calling con- isindependentoftheprogramparameters. Suchmethodwouldbe text, {r | ∃ reach+(p)}. Robustness to change under p∈Params exercisedbyallthesystemtestsinthesamewayandresultinmul- thisprojectionisidenticaltothatofthecompleteprogrampre-state tiple identical DUTs for that particular methods. A simple filter sincealldatathatthemethodcouldpossiblyreferenceiscaptured. wouldremovesuchduplicatestests,keepingjusttheuniqueDUTs. It is possible to trade robustness for reduction in carving cost by Now consider a simple accessor method with no parameters that definingprojectionsthateliminatemorestateinformation.Section just returns the value of a scalar field. If this method is invoked 3presentstwoprojectionsthatexercisethistrade-off. bythetestsfromdifferentpre-states, thenmultipleDUTswillbe carved,andasimplelosslessfilterwillnotdiscardanyDUTeven 2.3.2 Action-basedProjectionsandTransformations thoughtheyexercisesimilarbehavior.Inthiscase,applyingapro- Projectionsonsequencesofprogramactions, π¯ : σ¯ → σ¯. can jectionthatpreservesthepre-statecomponentsdirectlyreachable beusedtodistilltheportionofaprogramrunthataffectsthepre- fromthiswouldresultinmanyDUTsthatareredundant(inFigure state of a unit method. Unfortunately, a purely projection-based 2π(s )forct andforct areidenticalsooneofthemcan pre xm zm approachtostate-capturewillnotworkforallJavaprograms. For be removed). Clearly, in some cases, this kind of lossy filtering example,aprogramthatcallsnativemethodsdoesnot,ingeneral, mayresultinalowerfaultdetectioncapabilitysincewemaydis- have access to the native methods instructions. To accommodate cardaDUTthatisindeeddifferentandhencepotentiallyvaluable. this,wecanallowfortransformationofactionsduringcarving,i.e., Note that, contrary to test case reduction, filtering only uses pro- replaceonesequenceofinstructionswithanother. Transformation jectionstojudgetestequivalence,consequently,testexecutability could be used, for example, to replace a call to a native method ispreservedsincetheDUTsthatarekeptarecomplete.Inpractice, withaninstructionsequencethatimplementstheside-effectsofthe however,reductionandfilteringarelikelytobeappliedintandem nativemethod. Moregenerally,onecoulddesignaninstanceofπ¯ suchthatreducedtestsarethenfiltered,orfilteredtestsarethenre- thatwouldreplaceanyportionofatracewithasummarizingaction duced(withoutnecessarilyusingthesameprojectionforreduction sequence. andfiltering). 2.4 Adjusting Sensitivity through Differenc- s s s s s s ingFunctions pre post1 post2 post3 post4 post5 5 ThebasicCRtestingapproachdescribedearliercomparesacarved completepost-statetoapost-stateproducedduringreplaytodetect } 1 2 3 4 activation of m behavioraldifferencesinaunit. Theuseofcompletepost-statesis bothinefficientandunnecessaryforthesamereasonsasoutlined } calls out of unit aboveforpre-states. Whilewecouldusecomparisonofpost-state projectionstoaddresstheseissues,webelievethatthereisamore flexiblesolution. Figure3:Differencingsequencesofpost-states. Methodunittestaretypicallystructuredsothat,afterasequence ofmethodcallsthatestablishadesiredpre-statethemethodunder testisexecuted. Whenitreturnsadditionalmethodcallsandcom- is differenced with corresponding states at intermediate states of parisonsareexecutedtoimplementapseudo-oracle.Forexample, themethodundertest. Forexample,atpoint1,thetestcompares unittestsforared-blacktreemightexecuteaseriesofinsertand the current state to the captured s , similarly at points 2 and deletecallsandthenquerythetree-heightandcompareittoanex- post1 3thepreandpost-statesofthecalloutoftheunitarecompared. pectedresulttojudgepartialcorrectness. Weallowasimilarkind Usingasequenceofpost-statesrequiresthatacorrespondencebe of pseudo-oracle in CR testing by defining differencing functions definedbetweenlocationsinmandm(cid:48).Correspondencescouldbe onpost-statesthatpreserveselectedinformationabouttheresults definedusingavarietyofapproaches,forexample,onecoulduse ofexecutingtheunitundertest. Thesedifferencingfunctionscan thecallsoutofmandm(cid:48) todefinepointsforpost-statecompari- taketheformofpost-stateprojectionsorcanbemoreaggressive, son(asisillustratedinFigure3)orcommonpointsinthetextofm capturingsimplepropertiesofpost-states,suchastreeheight,and andm(cid:48)couldbedetectedviatextualdifferencing.Faultisolationis consequentlymaygreatlyreducethesizeofpost-stateswhilepre- enhancedusingmultiplepost-states,sinceifthefirstdetecteddif- servinginformationthatisimportantfordetectingbehavioraldif- ference is at location i then that differencewas introduced in the ferences. regionofexecutionbetweenlocationi−1andi. Ofcourse,stor- We define differencing functions that map states to a selected ingmultiplepost-statesmaybeexpensivesoweadvocatetheuse differencing domain, dif : S → D. Differencing in CR testing ofσ tonarrowthescopeofcodethatmustbeconsideredfor is achieved by evaluating dif(spost) = dif(spost(cid:48)). State projec- faultpiossotlationonceabehavioraldifferenceisattributedtoafault. tion functions are simply differencing functions where D = S. In addition to the reachability projections defined in the previous 3. INSTANTIATINGTHEFRAMEWORK sub-section,projectionsonunitmethodreturnvalues,calledreturn differencing, and on fields of the unit instance, this, called in- Inthissectionwedescribethearchitectureandsomeimplemen- stancedifferencing,areusefulsincetheycorrespondtotechniques tationdetailsofastate-basedinstantiationoftheframework.(Sec- usedwidelyinhand-builtunittests. tion5discussesexistingcarvingandreplayimplementationswhich Acentralissueindifferentialtestingisthedegreetowhichdif- areaction-based). ferencing functions are able to detect changes that correspond to 3.1 SystemArchitecture faultswhilemaskingimplementationchanges. Werefertothisas thesensitivityofadifferencingfunction. Clearly,comparingcom- Figure4showsthearchitectureoftheCRtools,withtheshaded pletepost-stateswillbehighly-sensitive,detectingbothfaultsand rectangles being the primary components. The carving activity implementation changes. A projection function that only records startswiththeCarver classwhichtakesfourinputs: theprogram thereturnvalueofthemethodundertestwillbeinsensitivetoim- name,thetargetmethodmwithintheprogram,thesystemtestcase plementationchangeswhilepreservingsomefault-sensitivity.Note st inputs,andoptionstoboundthecarvingprocess. x alsothatthesedifferencingfunctionsprovidedifferentincomplete Carver utilizes a custom class loader CustomLoader (that uti- views on program state. Their incompleteness reduces cost and lizesBCEL[13])toincorporateintotheprogram:asingletonCon- providesameasureofimplementationchangeinsensitivity,butitis textFactoryclassconfiguredtostorepreandpoststates,andinvo- problematicsinceitmayreducetheirfaultdetectioneffectiveness. cationsoftheContextFactoryattheentryandexit(s)ofm. Then, Weaddressthisbyallowingformultipledifferencingfunctions every execution of m will cause two invocations of ContextFac- tobeappliedinCRtestingwhichhasthepotentialtoincreasefault- tory: onetostores andonetostores . ContextFactoryuti- pre post sensitivity,withoutnecessarilyincreasingimplementationchange- lizestheContextBoundingclasstoassistwiththedeterminationof sensitivity. Forexample,usingapairofreturnandinstancediffer- whatpartofthestateshouldbestoredwhentestcasereductionis encingfunctionsallowsonetodetectfaultsinbothinstancefield utilized.Bydefault,ContextBoundingperformsthemostconserva- updatesandmethodresults,butwillnotexposedifferencesrelated tiveprojection:aninterfacereachabilityprojection(asdescribedin todeeperstructuralchangesintheheap. Faultisolationefficiency Section2.3).Morerestrictiveprojectionscanbeperformedthrough couldalsobeenhancedbytheavailabilityofmultipledifferencing theBoundingAnalysisclass;wehaveimplementedtwosuchprojec- functions, since each could focus on a specific property or set of tionsanddescribetheminthenextsection.Finally,theopensource programstatecomponentsthatwhichwillhelpdevelopersrestrict packageXStream,describedinmoredetailinthenextsection,per- theirattentiononapotentiallysmallportionofprogramstatethat formsstateserializationandtemporarystorage.Finally,thedatais mayreflectthefault. compressedwiththeoff-theshelfcompressionutilitybzip. Thereisanotherdifferencingdimensionthatcanimprovefault TheReplaycomponentsharesmanyoftheclasseswithCarver. isolation.ItconsistsofgeneralizingthedefinitionofDUTstocap- AsinCarver,Replayinstrumentstheclassofthetargetunit,inthis tureasequenceofpost-states,(spre,σpost),thatcaptureinterme- case m(cid:48), and utilizes the ContextFactory, but only to store spost. diatepointsduringtheexecutionofthemethodundertest.Figure3 TheContextLoaderclassobtainsandloadss ,usingXStreamto pre illustratesascenarioinwhichageneralizedDUTbeginsexecution unmarshall the stored program state, and then invokes the target ofmatspre.Conceptually,duringreplayasequenceofpost-states unitforexecution. projection Filter Program ContextFactory ContextFactory m’ ContextBounding m ContextBounding Dut m’:Spost XStream XStream m:Spre,Spost Dut ContextLoader post XStream pre/post CustomLoader Spre m:Spre CustomLoader bcel Spost bcel st x m m BounSdidineg e fAfencatlysis m Spost m’ m’ BounSdidineg e fAfencatlysis … … Dif m m’ Carver Replay function Program Options m st m m’ Options x Figure4:CRToolArchitecture. Two set of scripts, represented with double-side rectangles in Interfacek-boundedreachableprojection.Theinterfacek-bounded Figure4,areutilizedtoprovidethefilteringanddifferencingmech- reachableprojectiondefinesthesetofpreservedreferencestoin- anisms. OnceatestsuiteofDUTsisgenerated,testcasefiltering clude only those reachable via reference chains of length k, i.e., canbeperformedtoremoveredundanttestcasesbasedonthesame {r | ∃ reachk(p)}. Usingsmallvaluesofkcangreatly p∈Params setofprojectionsavailablethroughBoundingAnalysis. Difscripts reducethesizeoftherecordedpre-stateandformanymethodsit compare two s according to a specified differencing function willhavenoimpactonunit-testrobustness.Forexample,avalueof post todeterminewhetherthechangesfrommtom(cid:48)generateabehav- 1wouldsufficeforamethodwhoseonlydereferencesareaccesses ioraldifference.Currently,differencingfunctionsonreturnvalues, tofieldsofthis.Intheimplementation,whentraversingthepro- oninstancefields,onfullprogramstate(thedefault)arefullyau- gram using Xstream to store the program state, we keep track of tomated. TofacilitateexperimentationwithdifferentDiffunctions thelengthofdereferencechainstohalttraversalwhenkisreached. ourtoolscurrentlystorethefull s , butweplantoimplement Iftheunitaccessesdataalongareferencechainoflengthgreater post optionstostoreonlydif(s )whichhasthepotentialtosignifi- than k, then a k-bounded projection will retain insufficient data post cantlyreducethecostofcarving,replayanddifferencing. about the pre-state to allow replay. Our implementation dynami- callydetectsthissituationandissuesaSentinelAccessException todistinguishreplayfailurefromanapplicationexception.Thisis achievedbyextendingXstreamwithacustomconverterthatauto- 3.2 InterestingImplementationAspects maticallytransformsobjectsthatlieatadepthofk+1tocontain Inthissectionwebrieflydescribethemostinterestingaspectsof anadditionalbooleanfieldthatmarksitasasentinelinstance.The theimplementation. unitundertestistheninstrumentedtoinsertatestofthisboolean fieldandraisetheexceptioniftrue. Limitationsofthe java.io.Serializableinterface. Ourapproach May-referencereachableprojection. Themay-referencereach- requirestheabilitytosaveandrestoreobjectdatarepresentingthe able projection uses a static analysis that calculates a characteri- programstate.However,theJavajava.io.Serializablein- zation of the heap instances that may be referenced by a method terfacelimitsthetypeofobjectsthatcanbeserialized. Forexam- activation either directly or through method calls. This charac- ple,Javadesignatesfilehandlerobjectsastransient(non-serializable) terization is expressed as a set of regular expression of the form: becauseitreasonablyassumesthatahandler’svalueisunlikelyto pf ...f (F+)? This captures an access path that is rooted at a be persistent, and restoring it could enable illegal accesses. The 1 n parameter p and consists of n dereferences by the named fields samelimitationsapplytootherobjects, suchasdatabaseconnec- f . If the analysis calculates that the method may reference an tions and network streams. In addition, the Java serialization in- i object through a dereference chain of length greater than n, the terfacemayimposeadditionalconstraintsonserialization. Forex- optional final term is included to capture objects that are reach- ample,itmaynotserializeclasses,methods,orfieldsdeclaredas ablefromtheendofthechainthroughdereferenceoffieldsinthe privateorfinalinordertoavoidpotentialsecuritythreats. setF. Letreach (r) = {r(cid:48) | ∃ v (r,f) = r(cid:48)}capture Fortunately, we are not the first to face these challenges. We F f∈F field reachabilityrestrictedtoasetoffieldsF;reach denotesreacha- foundmultipleserializationlibrariesthatoffermoreadvancedand f bilityforthesingletonsetf. Foraregularexpressionoftheform flexibleserializationcapabilitieswithvariousdegreesofcustomiza- pf ...f , where m ≤ n, we construct the set: reach (p)∪ tion. We ended up choosing the XStream library [41] because it 1 m f1 ... ∪ reach (...(reach (p))), since we want to capture all comesbundledwithmanyconvertersfornon-serializabletypesand fm f1 referencestouchedalongthepath. Iftheregularexpressionends adefaultconverterthatusesreflectiontoautomaticallycaptureall with the term F+ then we union an additional term of the form objectfields,itserializestoXMLwhichismorecompactandeasier reach+(reach (...(reach (p)))).Thisprojectioncansignif- toreadthannativeJavaserialization,andithasbuilt-inmechanisms F fm f1 icantlyreducethesizeofcarvedpre-stateswhileretainingarbitrar- totraverseandmanagethestorageoftheheapwhichwasessential ilylargeheapstructuresthatarerelevanttothemethodundertest. inimplementingthefollowingprojections. Weimplementedak-boundedaccesspathbasedmay-reference C-selection-k. SimilarinconcepttoS-selection,thistechnique analysisthatusedtheflow-insensitivecontext-sensitiveequivalence- executesallDUTs,carvedwithak-boundedreachableprojection, class based read-write analysis implemented in Indus [30]. This thatexercisemethodsthatwerechangedinP(cid:48). Withinthistech- analysispartitionsparameterandvariablenamesintoequivalence niqueweexploredepthboundinglevelsof1,2,5,and∞(unlim- classes. Thetwodistinctfeaturesoftheanalysisare: 1)foreach iteddepthwhichcorrespondstotheinterfacereachableprojection.) equivalence class, an abstract heap structure based on the names C-selection-mayref. SimilartoC-selection-kexceptthatitcarves involvedinread/writeaccessismaintained,and2)distinctequiv- DUTsutilizingamay-referencereachableprojection. alenceclassesaremaintainedforeachmethodscopeexceptinthe case of static fields and variable names occurring in methods in- 4.2 Measures volved in recursive call chains. We generate regular expressions Regressiontestselectiontechniquesachievesavingsbyreducing thatcapturethesetofallpossiblereferencedaccesspathsuptoa the number of test cases that need to be executed on P(cid:48), thereby givenfixedlength,k,withadefaultofk=2.Whentraversingthe reducing the effort required to retest P(cid:48). We conjecture that CR programusingXstream,wesimultaneouslykeeptrackofallregular techniquesachieveadditionalsavingsbyfocusingonunitsofP(cid:48). expressionsandmarkonlythoseobjectsthatlieonadefinedaccess Toevaluatetheseeffects, wemeasurethetimetoexecuteandthe pathforstorageinXML.Thisanalysisisalsocapableofdetecting timetochecktheoutputsofthetestcasesintheoriginaltestsuite, whenamethodisside-effectfreeandinsuchcasesthestorageof the selected test suite, and the carved selected test suites. For a post-statesisskippedsincemethodreturnvaluescompletelydefine carvedtestsuitewealsomeasurethetimeandspacetocarvethe theeffectofsuchmethod. originalDUTtestsuite. Onepotentialcostofregressiontestselectionisthecostofmiss- ingfaultsthatwouldhavebeenexposedbythesystemtestsprior 4. EXPERIMENT totestselection. Similarly, DUTsmaymissfaultsduetotheuse Thegoaloftheexperimentistoassessexecutionefficiency,fault ofprojectionsaimedatimprovingcarvingefficiency.Wewillmea- detectioneffectiveness,androbustnessoftheDUTs. Wewillper- sure fault detection effectiveness by computing the percentage of formsuchassessmentthroughthecomparisonofsystemtestsand faultsfoundbyeachtestsuite. Wewillalsoqualifyourfindings theircorrespondingcarvedunittestcasesinthecontextofregres- byanalyzinginstanceswheretheoutcomesofacarvedtestcaseis siontesting.Withinthiscontext,weareinterestedinthefollowing differentfromitscorrespondingsystemtestcase. researchquestions: To evaluate the robustness of the carved test cases in the pres- ence of program changes, we are interested in considering three RQ1: Cancarvingtechniquessaveregressiontestexecutioncosts? potentialoutcomesofreplayingact onunitm(cid:48): 1)faultisde- Wewouldliketocomparethecostofreusingcarvedunittest xm tected, ct causes m(cid:48) to reveal a behavioral differences due to cases versus the costs of utilizing regression test selection xm afault;2)falsedifferenceisdetected,ct causesm(cid:48) toreveala techniquesthatworkonsystemtestcases. xm behavioral change from m to m(cid:48) that is not a fault (not captured RQ2: What is the fault detection effectiveness of the carved test bystx); andtestisunexecutable, ctxm isill-formedwithrespect cases? Thisisimportantbecausesavingtestingcostswhile tom(cid:48).Testsmaybeill-formedforavarietyofreasons,e.g.,object reducingfaultdetectionisrarelyanenticingtrade-off. protocol changes, internal structure of object changes, invariants change, and we refer to the degree to which a test set becomes RQ3: Howrobustarethecarvedtestsinthepresenceofsoftware ill-formedunderachangeitssensitivitytochange. Weassessro- evolution? We would like to assess the reusability of the bustnessbycomputingthepercentageofcarvedtestsandprogram carvedunittestcasesunderarealevolvingsystem,andex- unitsfallingintoeachoneoftheoutcomes.Sincetherobustnessof amine how different types of change can affect the carved atestcasedependsonthechange, wequalifyrobustnessbyana- testssensitivity. lyzingtherelationshipbetweenthetypeofchangeandthelifespan andsensitivityoftheDUT. 4.1 TestingTechniques 4.3 Artifact LetP beaprogram,letP(cid:48)beamodifiedversionofP,andletT TheartifactwewillusetoperformthisexperimentisSiena[9]. beatestsuitedevelopedinitiallyforP.Regressiontestingseeksto Siena is an event notification middleware implemented in Java. testP(cid:48). Tofacilitateregressiontesting,testengineersmayre-use This artifact is available for download in the Subject Infrastruc- T totheextentpossible. Inthisstudyweconsideredfourtypesof tureRepository(SIR)[15,32]. SIRprovidesSiena’ssourcecode, testregressiontechniques,twothatworkwithsystemtests(S)and asystemleveltestsuitewith567testcases,multipleversionscor- twothatworkedwithcarvedtests(C): respondingtoproductreleases, andasetofseededfaultsineach version(theauthorswerenotinvolvedinthislatestactivity). S-retest-All. WhenP ismodified,creatingP(cid:48),wesimplyreuse For this experiment we consider Siena’s core components (not all non-obsolete test cases in T to test P(cid:48); this is known as the onapplicationincludedinthepackagethatisbuiltwiththosecom- retest-all technique [23] and it has been said to represent current ponents). We utilize the five versions of Siena that have seeded industrialpractices[25]. faults that did not generate compilation errors (faults that gener- atedcompilationerrorscannotbetested)andthatwereexposedby S-selection. The retest all technique can be expensive: rerun- atleastonesystemtestcase(faultsthatwerenotfoundbysystem ningalltestcasesmayrequireanunacceptableamountoftimeor tests would not affect our assessment). For brevity, we summa- humaneffort. Regressiontestselectiontechniques[5,10,24,34] rizethemostrelevantinformationtoourexperimentinTable1and use information about P, P(cid:48), and T to select a subset of T, T(cid:48), pointthereadertoSIR[32]toobtainmoredetailsabouttheprocess with which to test P(cid:48). We utilize the modified entity technique employedtopreparetheSienaartifactforexperimentation. Table [10],whichselectstestcasesthatexercisemethods,inP,that(1) 1providesthenumberofmethods,methodschangedbetweenver- havebeendeletedorchangedinproducingP(cid:48),or(2)usevariables sions and covered by the system test suite, system tests covering orstructuresthathavebeendeletedorchangedinproducingP(cid:48). thechangedmethods,andfaultsincludedineachversion. Version Methods Changed-covered Testsexecuting Faults Carving Metric Reduction methods changedmethods C-select-k C-select v0 109 - - - 1 2 5 ∞ mayref v1 100 2 494 3 Plain Minutes 118 118 117 119 194 v5 111 2 494 1 MB 820 14K 26K 26K 15K v6 111 2 8 1 Compressed Minutes 315 513 717 725 619 v7 107 10 550 2 MB 9 13 15 15 14 Lossless Minutes 258 299 335 382 378 Table1:Siena’scomponentsattributes. Filtering MB 61 1.1K 2.8K 2.9K 1.7K Lossless Minutes 270 332 399 447 428 Compressed MB 3 4 14 11 11 4.4 ExperimentalSetupandDesign Table2:CarvingtimesandsizestogenerateinitialDUTsuite. Theoverallexperimentalprocessconsistedofthefollowingsteps. First,wepreparedthetestsuitesgeneratedbyS-retest-all,S-selection, For Siena, constraining the carving depth does not seem to af- C-selection-k*,andC-selection-mayref fortheirautomaticexecu- fect the carving time (the variation we observed is small enough tion. Thepreparationofthesystemleveltestsuiteswastrivialbe- tobeattributabletonoise)butitdoesimpactthestoragerequire- causetheywerealreadyavailableintherepository.Thepreparation ments for keeping (s ,s ). The similar replay times are at- pre post ofthecarvedselectionsuites,requiredforustoruntheCRtoolto tributabletothefactthatthecarvingtoolfirsttraversesthewhole carvealltheinitialsuiteofDUTsfromthesystemtestswhenexe- heap,anditthentruncatesthecontenttothedesireddepthleverag- cutedinv0. ingtheXStreamfacility.Alsonotethatfordepthsgreaterthantwo Second,weruneachtestsuiteonthefault-freeversionsofSiena thedifferencesinstoragespaceareminimalduetotherather“shal- to generate an oracle for each version. In the case of the system low”natureofthesubject(dereferencechainswithlengthgreater testsuite,theoracleconsistedinthesetofoutputsgeneratedbythe than2arerareinSiena).Themay-referenceprojectionrequiresan program. Forthecarvedtests,theoracleconsistedofthemethod additional61%ofanalysistime,butprovidesaspacereductionof returnvalueandtherelevantspost (weexploreseveralalternative 43%relativeto∞.Asweshallsee,thisspacereductionisvaluable projectionstodefinetherelevantstate). inthatittranslatesintomoreefficientreplay. Third,weruneachtestsuiteoneachfaultyinstanceofeachver- In the second row of Table 2 we see that simply compressing sion (some versions contained multiple faults) and recorded their thestatedataincreasedthecarvingtimelinearlyasafunctionthe executiontime. Wedealtwitheachfaultinstanceindividuallyto carved state size (and it will add uncompression time as well for controlforpotentialmaskingeffectsamongfaultsthatmightneg- the DUTs selected for replaying), but it consistently provided a atively affect the fault detection performance of the system tests. three-orders of magnitude reduction in the space required by the Note that, due to their more concentrated focus, unit tests would DUTs,offeringaveryinterestingtradeoff. Filteringthetestsuite notbeassusceptibletosuchfaultinteractions. byremovingcarvedDUTswiththethesamepre-state(wecallthis Fourth,toassessfaultdetectioneffectiveness,foreachtestsuite, losslessfilteringinTable2)alsoprovidedsavingsofoneorderof wecomparedtheoutcomeofeachtestcasebetweenthefault-free magnitude in the required storage but with a reduced impact on version(oracle)andthefaultyinstancesofeachversion. Tocom- carvingtime. Thelastrowpresentsahybridapproachthatapplies parethesystemtestoutcomesbetweencorrectandfaultversions, filteringandthencompressesthedata. Thislastcarvingtechnique weusedpre-defineddifferencingfunctionsthatarepartofourim- isfasterthantheoneperformingjustcompressionandstillprovides plementation which ignore “non-deterministic” output data (e.g, asimilardegreeofstoragesavings. dates, times, random numbers). For the unit tests, we performed Itisveryimportanttonotethatthecarvingnumbersreportedin asimilardifferencing,butappliedtothetargetmethodreturnval- Table2correspondtotheinitialcarvingofthecompleteDUTsuite uesandspost. Whentheoutcomeofatestcasedifferedbetween –DUTscarvedforeachoftheover100methodsinSienafromeach thefault-freeandthefaultyversion,afaultisfound. oftheover550systemteststhatmayexecuteeachmethod–and Last, we compared the measures across the test suites gener- canbeperformedautomaticallywithoutthetester’sparticipation. ated by S-retest-all, S-selection, C-selection-k*, and C-selection- Duringtheevolutionofthesystem,DUTswillbereplayedrepeat- mayref. We then repeated the same steps to collect data for the edlyamortizingtheinitialcarvingcosts, andonlyasubsetofthe sametechniqueswhenutilizingtestcasefilteringandcompression. DUTswillneedtoberecarved. Recarvingwillbenecessarywhen Theresultsemergingfromthiscomparisonarepresentedinthenext isdeterminedthatchangesintheprogrammayaffectaDUT’srele- section.AlltheseactivitieswereperformedonanOpteron250pro- vantpre-state. Webelievethatexistingimpactanalysistechniques cessor,with4GBofRAM,runningLinux-Fedora,andJava1.5. [26]couldbeused,forexample,todeterminewhatDUTsmustbe recarvedwhentheaunitischanged,andweplantointegratethose 4.5 Results intoourinfrastructureinthefuture. We now proceed to analyze the replay efficiency. Replay effi- In this section we provide the results addressing each research ciencyisparticularlyimportantsinceitislikelythatacarvedDUT questionregardingcarvingandreplayingefficiency,faultdetection willberepeatedlyreplayedasthetargetunitevolves.Figure5sum- effectiveness,androbustnessandsensitivityoftheDUTssuites. marizesthereplayexecutiontimesforsomeofthetechniqueswe RQ1: Efficiency. Wefirstfocusontheefficiencyofthecarving consider. Duetospaceconstraintsweonlypresentresultsforthe process. Althoughourinfrastructureautomatescarving, thispro- testsuitesgeneratedbythreeplaincarvingtechniques(weignore cessdoesconsumetimeandstoragesoitisimportanttoassessits C-selection-k1 because although it is the fastest it cannot replay efficiencyasitmightimpactitsadoptionandscalability. Table2 manyDUTs,andweignoreC-selection-∞becauseitsresultsare summarizes the time (in minutes) and the size (in MB) that took almostidenticaltoC-selection-k5). Wealsopresentplotsfortwo tocarveandstorethecompleteinitialsuiteofDUTsutilizingthe techniqueswithfilteringapplied,andonewithfilteringanduncom- differenttechniques. Eachrowcontainsavariationofthecarving pression. Eachobservationcorrespondstothereplaytimeofeach (plain,withcompression,withDUTsfiltering,orhybrid),andeach selected test suite under each version and an overall average (the columncontainsatestcasereductiontechnique. linesjoiningobservationsarejustmeanttoassistintheinterpreta- 140 tion). ThetestsuiteresultingfromtheS-retest-alltechniquecon- sistentlyaverages135minutesperversion. Thetestsuitesresult- ingfromS-selectforeachversionaverages91minutesperversion, 120 withsavingsoverS-retest-allrangingfromfromaminimumof4 minutesinv6maximumof132minutesinv7. (Factorsthataffect theefficiencyofthistechniquearenotwithinthescopeofthispa- 100 perbutcanbefoundat[17]). Onaverage, S-select takes67%of thetimerequiredbyS-retest-all. The test suites selected by the C-selection-k* techniques show 80 verysimilartendencies.Onaverage,theC-selection-k*techniques s e replay execution time was 19 minutes, and they took less than a ut n minutetoreplayv6andupto82minutesforC-selection-k5tore- Mi 60 playv2. Onaverage, thesesuitestakes13%ofthetimerequired by S-retest-all, and 19% of the time required by S-select. The test suites selected by C-selection-mayref takes 16% of the time 40 requiredbyS-retest-all, 19%ofthetimerequiredbyS-selection- mayref,and76%ofthetimerequiredbyC-selection-k5.Applying filtering to the carved suites generated by C-selection-k5 reduced 20 testexecutiontimebymorethanathird(e.g.,from84to19DUTs inv2).Last,andasexpected,weobservethathandlingcompressed filesaddsconsiderablereplayingtime. 0 Wealsomeasuredthediffingtimerequiredbyalltechniques.For v1 v5 v6 v7 Average thesystemtestsuitesthediffingtimeswereconsistentlylessthan Versions aminute,fortheC-selection-k*suitesitaveraged12minutes,and C-select-k2 C-select-k5 fortheC-selection-mayref averaged6minutes.Whenfilteringwas C-select-mayref Filtered-C-select-k5 employed, diffing time for the C-selection-k* techniques was re- Filtered-C-select-mayref Filtered-Uncomp-C-select-k5 ducedbyanaverageof54%. Overall,althoughthediffingactivity S-retest-all S-Select isimportanttotheperformanceofthecarvedsuites,implementing simple incremental differencing functions could dramatically im- Figure5:Executiontimes. provetheirdiffingperformance. Forexample, wecurrentlycom- replayingthetestsuitecarvedutilizingC-selection-k1didnotde- parealltheprogrampost-state,butwecouldinsteadfirstcompare tectallthebehavioraldifferencesexhibitedbytheselectedsystem thereturnvaluestoseeifitrevealsanydifferences,andifitdoes test cases (1 out of the 8 system tests exposed a behavioral dif- not,thencomparetherestofthepost-state. Thissimpletechnique wouldsufficetoreducev5diffingtimeby96%. ferencethatwasnotexposedbyanyofitscorrespondingDUTs). ThisreductioninFFwasduetothedepth-1projectionwhichdid RQ2: Faultdetectioneffectiveness. Thetestsuitesdirectlygen- notcaptureenoughpre-statetodetectabehavioraldifference. The eratedbyS-selection, C-selection-k*, andC-selection-mayref de- othercarvedsuites,however,diddetectthisfault. tected as many faults as the S-retest-all technique. As expected, Inv5,andindependentlyofthecarvedtestsuiteused,3outof theapplicationofourlosslessfiltertoreducethenumberofDUTs 300failingsystemtestsdidnothaveanycorrespondingDUTonthe didnotresultinanylossinfaultdetectionpower. Thisindicates changedmethodsfailing(99%).Weobservedthesamesituationin that a DUT test suite may be as effective as a system test suite v7 : f2 where 18 out of 203 DUTs (9%) did not expose behav- at detecting faults, even when using aggressive projections. It is ioraldifferenceseventhoughthecorrespondingsystemtestsfailed. worthnoting,however,thatwhencomputingfaultdetectioneffec- WhenweanalyzedthereasonsforthisreductioninFFwediscov- tivenessoverawholetestDUTsuitewedonotaccountforthefact eredthatinbothcasesthetooldidnotcarveinv0thepre-statefor that,forsomesystemtests,theircorrespondingcarvedDUTsmay oneofthechangedmethods. Thetooldidnotcarveanypre-state havelostorgainedfaultdetectioneffectiveness.Weconjecturethat forthosemethodsbecausethesystemtestcasedidnotreachthem. thisisalikelysituationwithoursubjectbecausemanyofthefaults Changesinthecodestructure(e.g.,additionofamethodcall,han- aredetectedbymultiplesystemtests. Toaddressthissituationwe dlingofanexception),however,madethesystemtestcasesreach performaneffectivenessanalysisattheindividualtestcaselevel. those changed methods (and expose a fault) in later versions. In Foreachcarvingtechniquewecompute: 1)PP,thepercentage both circumstances, improved DUTs that would have resulted in ofpassingselectedsystemtests(selectedutilizingS-Selection)that 100%FFcouldhavebeengeneratedbyre-carvingthetestcasesin have all corresponding carved unit test cases passing, and 2) FF: later versions (carve from v to replay in v ). More generally, i i+1 thepercentageoffailingsystemteststhathaveatleastonecorre- theseobservationspointoutagainfortheneedtoestablishmecha- spondingfailingcarvedunittestcase. Table3presentsthePPand nismstodetectchangesinthecodethatshouldtriggerre-carving. FFvaluesforallthetechniquesunderallversioninstances. V7:f1alsopresentsaninterestingbutoppositefinding.Inthis WhenusingthetestsuiteresultingfromC-selection-k1wefind instanceonly24%ofthepassingsystemtestshadalltheirassoci- thatmostDUTsgoingthroughthechangedmethodscannotbefully atedDUT’spassing. Thesedifferencesmaybecausedbyfaults(a executed,whichleadsinmostcasestoaverysmallnumberofsys- faultthatdidnotpropagatetotheoutputsothesystemtestdidnot tem test cases for which all the DUTs passed as well. v7 is the detectit)orjustbythecodechanges(changesintheunitsgenerated exceptiontothisPPtendency,wherecarvingwithadepthofone changesinthepost-states)1.Weexplorethisissuefurthernext. wasasgoodasthemaximumdepthbecausethechangedmethod onlyoperatedontheclass’sscalarfields. Notethatevenwhenus- 1SinceweonlyconsideredSienaversionswithfaultsexposedby ingthisrestrictivereduction,theFFvaluesareonaverage97%.In system tests, we are not exploring possible situations in which the cases where FF is not 100% such as in v6, we observed that faultsareexposedjustbyDUTs. C-selection-k C-selection rathersmallerandmorefocusedunittests. Anotherimportantdis- 1 5 ∞ mayref tinction is that the testing target are not the semantic differences PP FF PP FF PP FF PP FF betweenversions,butrathermethodsintheprogram. v1:f1 1 100 100 100 100 100 100 100 v1:f2 1 100 100 100 100 100 100 100 Thepreliminaryresultsfromouroriginaltestcarvingprototype v1:f3 1 100 100 100 100 100 100 100 [16], later expanded by Reddy [22], evidenced the potential of v5 1 99 100 99 100 99 100 99 carvedteststoimprovetheefficiencyandthefocusofalargesys- v6 1 88 100 100 100 100 100 100 temtestsuite, identifiedchallengestoscale-uptheapproach, and v7:f1 24 100 24 100 24 100 24 100 defined some scenarios under which the carved test cases would v7:f2 100 91 100 91 100 91 100 91 andwouldnotperformwell. Wehavebuiltonthatworkbypre- Table3:FaultDetectionEffectiveness. sentingagenericframeworkfordifferentialcarving,extendingthe typeofanalysisweperformedtomaketheapproachmorescalable, RQ3: Robustnessandsensitivity. Wepreviouslyexaminedhow andbydevelopingafullsetoftoolsthatcanenableustoexperi- DUTs obtained through C-selection-k1 are quite fragile in terms mentwithdifferenttechniquesonvariousprograms. of their executability, and how certain code changes may make a Weareawareoftwootherresearcheffortsrelatedtothenotion methodreachanewpartoftheheapthatwasnotoriginallycarved. oftestcarving. First,Orsoetal. introducedthenotionofselective Acomplementarywaytoevaluatetherobustnessandsensitivityof record and replay mechanisms [27], which was later prototyped DUTs is to compare their performance in the presence of meth- [28]andusedtoreplayunitsinisolation. Second,thetestfactor- odsthatchanged,andinthepresenceofmethodsthatchangedand ingapproachintroducedbySaffetal. takesasimilarapproachto areindeedfaulty. Weperformedsuchdetailedcomparisononthe Orso’swiththecreationofwhattheycalledmockobjectsthatserve filteredsuitesforC-selection-∞andC-selection-mayref,andnow to create the scaffolding to support the execution of the test unit brieflydiscussthreedistinctinstancesofthescenarioswefound. [36]. Thesamegroupintroducedatoolsetforfully-featuredJava Inbothfaultyinstancesofv7,theversionwiththemostmeth- executionenvironmentsthatcanhandlemanyofthesubtleinterac- odschanged(10),noneofthebehavioraldifferenceswerefoundby tionspresentinthisprogramminglanguage(e.g.,callbacks,arrays, methodsotherthanthefaultyones. Thisisclearlyanidealsitua- native methods) [35]. In terms of our framework, both of these tion. V1 : f1representsperhapsamorecommoncasewerenone approacheswouldbeconsideredaction-basedCRapproaches. We oftheDUTsgoingthroughnon-faultychangedmethodsfailed,but havepresented,whatistothebestofourknowledge,thefirststate- only 78% of the DUTs traversing faulty methods actually failed. basedapproachtoCRtesting. Yet a different perspective is offered by v5. Only two methods Saffetal.describetheirapproachindetailallowingustoprovide changedinthisversion,andonethemisinvokedexclusivelybythe a more in depth comparison with our approach. While carving a other. Thefaultislocatedinthecallee. Thecallermethodisexer- methodtestcase,theirinfra-structurerecordsthesequenceofcalls cisedby6354DUTsoutofwhich736detectbehavioraldifferences that can influence the method and then they record the sequence (12%).The(faulty)calleemethodisexercisedby26173DUTsout of calls made by the method and the return values and unit state of which 928 detect behavioral differences (less than 4%). This side-effects of those calls. In our framework, this would amount lastscenario,inwhichcarvingstillgeneratesmorebehavioraldif- to calculating σ¯ such that s(σ¯) = s for the method of inter- pre ferencesforthefaultymethodthanforchangeone, isinteresting estandthencalculatingsummarizingtracesσ¯ thatreflectthe calli because it shows that even for correct changes the number of af- return value and side effects for each call out of the method and fectedDUTsmaybelarge. carving s , the relevant pre-state for each call. During replay prei It is worth noting that the differencing functions offer an op- thesamesequenceofcallswiththesameparametersisexpected- portunity to control this problem. For example, a more relaxed anydeviationresultsinareportofadifferenceduringreplay. In differencingmechanismfocusedonjustreturnvalueswouldhave ourframework,wewouldidentifythepointsatwhichthe,n,calls detectedallthefaultsinv5andv6,whilereducingthenumberof outofthemethodoccuraspost-statelocationstodefineaDUTof falsedifferencessignificantlysincebothfaultsmanifestthemselves theform(σ¯,(s ,...,s )). pre1 pren inthereturnvalue.Suchadifferencingfunction,however,leadsto Bothoftheseaction-basedapproaches,capturetheinteractions areducedfaultdetectioninthecaseofv1. Mechanismstoselect betweenthetargetunitanditscontextandthenbuildthescaffold- andappropriatelycombinethesedifferencingfunctionswillbeim- ing to replay just those interactions. Hence, they do not incur in portantfortherobustnessandsensitivityofDUTs. costsassociatedwithcapturingandstoringthesystemstateforeach targetedunit. Ontheotherhand,thisapproachmaygeneratetests 5. RELATEDWORK thataresensitivetochangesthatdonoteffectmeaning,e.g.,chang- ingtheorderofindependentmethodcalls. Saffetal. haveidenti- OurworkwasinspiredbyWeide’snotionofmodulartestingas fiedthisissueandproposetoanalyzethelifespanofthefactored ameanstoevaluatethemodularreasoningpropertyofapieceof testcasesacrosssequencesofmethodmodifications[35]. Thisis software[38]. AlthoughWeide’sfocuswasnotontestingbuton acriticalfactorinjudgingthecost-effectivenessofCRtestingand theevaluationofthefragilityofmodularreasoning,heraisedsome wehavestudiedthisissueinSection4.5. importantquestionsregardingthepotentialapplicabilityofwhathe These two related efforts have shown their feasibility in terms calleda“modularregressiontechnique”thatledtoourwork. ofbeingabletoreplaytestsandthelatterapproachhasprovided Within the context of regression testing, our approach is also initial evidence that it can save time and resources under several similar to Binkley’s semantic guided regression testing in that it scenarios. Neitherapproach,however,hasbeenevaluatedinterms aimstoreducetestingcostsbyrunningasubsetoftheprogram[5, ofitsfaultdetectioneffectivenesswhichultimatelydeterminesthe 6]. Binkley’stechniqueproposestheutilizationofstaticslicingto valueofthecarvedtests,orinthecontextofregressiontesting. identifypotentialsemanticdifferencesbetweentwoversionsofa Our work also relates to efforts aimed at developing unit test program.Healsopresentsanalgorithmtoidentifythesystemtests cases. Several frameworks grouped under the umbrella of Xunit that must be run on the slices resulting from the differences be- have been developed to support software engineers in the devel- tweentheprogramversions. Thefundamentaldistinctionbetween opment of unit tests. Junit, for example, is a popular framework thisandourapproachisthatwedonotrunsystemleveltests,but

See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.