Table Of ContentCarving Differential Unit Test Cases from System Test Cases
Sebastian Elbaum, Hui Nee Chin, Matthew B. Dwyer, Jonathan Dokulil
Department of Computer Science and Engineering
University of Nebraska - Lincoln
Lincoln, Nebraska
{elbaum,hchin,dwyer,jdokulil}@cse.unl.edu
ABSTRACT ofasystem. Analternativeapproachtounittestdevelopment,that
doesnotrelyonspecifications,isbasedontheanalysisofaunit’s
Unittestcasesarefocusedandefficient. Systemtestsareeffective
implementation. Testersdevelopingunittestsinthiswaymayfo-
atexercisingcomplexusagepatterns.Differentialunittests(DUT)
cus,forexample,onachievingacoverage-adequacycriteriaofthe
areahybridofunitandsystemtests.Theyaregeneratedbycarving
targetunit’scode. Suchtests,however,areinherentlysusceptible
thesystemcomponents,whileexecutingasystemtestcase,thatin-
to errors of omission with respect to specified unit behavior and
fluencethebehaviorofthetargetunit,andthenre-assemblingthose
maytherebymisscertainfaults. Finally, unittestingrequiresthe
componentssothattheunitcanbeexercisedasitwasbythesys-
developmentoftestharnessesorthesetupofatestingframework
temtest. WeconjecturethatDUTsretainsomeoftheadvantages
(e.g.,junit[18])tomaketheunitsexecutableinisolation.
ofunittests,canbeautomaticallyandinexpensivelygenerated,and
Systemtestsareusuallydevelopedbasedondocumentsthatare
havethepotentialforrevealingfaultsrelatedtointricatesystemex-
commonly available for most software systems that describe the
ecutions. Inthispaperwepresentaframeworkforautomatically
system’sfunctionalityfromtheuser’sperspective,forexample,re-
carving and replaying DUTs that accounts for a wide-variety of
quirementdocumentsanduser’smanuals.Thismakessystemtests
strategies, we implement an instance of the framework with sev-
appropriatefordeterminingthereadinessofasystemforrelease,
eraltechniquestomitigatetestcostandenhanceflexibility,andwe
ortograntorrefuseacceptancebycustomers. Additionalbenefits
empiricallyassesstheefficacyofcarvingandreplayingDUTs.
accruefromtestingsystem-levelbehaviorsdirectly. First, system
testscanbedevelopedwithoutanintimateknowledgeofthesys-
1. INTRODUCTION teminternals,whichreducesthelevelofexpertiserequiredbytest
developersandwhichmakestestsless-sensitivetoimplementation-
Software engineers develop unit test cases to validate individ- level changes that are behavior preserving. Second, system tests
ual program units (e.g., methods, classes, packages) before they may expose faults that unit tests do not, for example, those that
areintegratedintothewholesystem. Byfocusingonanisolated span multiple units or that involve very complex usage of units.
unit,unittestsarenotconstrainedbyotherpartsofthesystemin Finally,sincetheyinvolveexecutingtheentiresystemnotesthar-
exercising the target unit. This smaller scope for testing usually nessesneedbeconstructed.
resultsinsignificantlymoreefficienttestexecutionandfaultisola- While system tests are an essential component of all practical
tionrelativetowholesystemtestinganddebugging[1, 19]. Unit software validation methods, they do have several disadvantages.
testcasesarealsousedasacomponentofseveralpopulardevelop- Theycanbeexpensivetoexecute;forlargesystemsdaysorweeks,
mentmethods,suchasextremeprogramming(XP)[2],testdriven andconsiderablehumaneffortmaybeneededforrunningathor-
development(TDD)practices[3],continuoustesting[37],andef- ough suite of system tests [25]. In addition, even very thorough
ficienttestprioritizationandselectiontechniques[33]. systemtestingmayfailtoexercisethefull-rangeofbehaviorim-
Developingeffectivesuitesofunittestcasespresentsanumber plementedbysystem’sunits;thus,systemtestingcannotbeviewed
ofchallenges. Specificationsofunitbehaviorareusuallyinformal asaneffectivereplacementforunittesting. Finally,faultisolation
andareoftenincompleteorambiguous,leadingtothedevelopment andrepairduringsystemtestingcanbesignificantlymoreexpen-
ofoverlygeneralorincorrectunittests.Furthermore,suchspecifi- sivethanduringunittesting.
cationsmayevolveindependentlyofimplementationsrequiringad- Theprecedingcharacterizationofunitandsystemtests,although
ditionalmaintenanceofunittestsevenifimplementationsremain notcomprehensive,illustratesthatsystemandunittestshavecom-
unchanged.Testersmayfinditdifficulttoimaginesetsofunitinput plementarystrengthsandthattheyofferarichsetoftradeoffs. In
valuesthatexercisethefull-rangeofunitbehaviorandtherebyfail thispaper,wepresentageneralframeworkforcarvingandreplay-
toexercisethedifferentwaysinwhichtheunitwillbeusedasapart ingofwhatwecalldifferentialunittests(DUT)whichaimatex-
ploitingthosetradeoffs. Wetermedthemdifferentialbecausetheir
primaryfunctionisdetectingdifferencesbetweenmultipleversions
ofaunit’simplementation.DUTsaremeanttobefocusedandeffi-
cient,liketraditionalunittests,yettheyareautomaticallygenerated
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor alongwithacustomtest-harness,makingtheminexpensivetode-
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare velopandeasytoevolve. Inaddition,sincetheyindirectlycapture
notmadeordistributedforprofitorcommercialadvantageandthatcopies thenotionofcorrectnessencodedinthesystemtestsfromwhich
bearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,to
theyarecarved,theyhavethepotentialforrevealingfaultsrelated
republish,topostonserversortoredistributetolists,requirespriorspecific
tocomplexpatternsofunitusage.
permissionand/orafee.
Copyright200XACMX-XXXXX-XX-X/XX/XX...$5.00.
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2006 2. REPORT TYPE 00-00-2006 to 00-00-2006
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
Carving Differential Unit Test Cases from System Test Cases
5b. GRANT NUMBER
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
University of Nebraska - Lincoln,Department of Computer Science and REPORT NUMBER
Engineering,Lincoln,NE,68588
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
The original document contains color images.
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE 11
unclassified unclassified unclassified
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
Inourapproach,DUTsarecreatedfromsystemtestsbycaptur- Given stx{ input/s, expected output/s }
ingcomponentsoftheexercisedsystemthatinfluencethebehavior
ofthetargetedunit,andthatreflecttheresultsofexecutingtheunit; Carve ctxm Execute stxinput S0 … Spre m Spost … … output
wetermthiscarving. Thosecomponentsareautomaticallyassem- ctxm{Spre, Spost} capture
bledintoatestharnessthatestablishesthepre-stateoftheunitthat
wasencounteredduringsystemtestexecution.Fromthatstate,the mevolves: m + != m’
unit is replayed and the resulting state is queried to determine if
therearedifferenceswiththerecordedunitpost-state. load Spre m’ Spost’
IdeallyDUTswill(a)retainthefaultdetectioneffectivenessof
sfeyrsetnemcestetshtsatoanrethneotatrignedticuantiitv,e(bo)fodnilfyferreipnogrstyssmteamllnteusmtbreesruslotsf,d(icf-) Replay ctxmon m’ Spost Same? yes Behavior of m " m’
no
beexecutedfasterthansystemtests, and(d)beapplicableacross
!has affected behavior of m
multiplesystemversions.WeempiricallyinvestigateDUTcarving
andreplaytechniqueswithrespecttothesecriteriathroughacon-
trolled experiment within the context of regression testing where Figure1:Carvingandreplayprocess.
wecomparetheperformanceofsystemtestsandcarvedunittests.
Theresultsindicatethatcarvedtestcasescanbeaseffectiveassys-
temtestcasesintermsoffaultdetection,butmuchmoreefficient. Aprogramexecutioncanbeformalizedeitherasasequenceof
The contributions of this paper are: (i) presenting of a frame- programstatesorasasequenceofprogramactionsthatcausestate
workforautomaticallycarvingandreplayingDUTsthataccounts changes.Asequenceofprogramstatesiswrittenasσ=s ,s ,...
0 1
forawide-varietyofimplementationstrategieswithdifferenttrade- wheres ∈Sands istheinitialprogramstateasdefinedbyJava.
i 0
offs; (ii)implementinganewstate-basedstrategyforcarvingand Astates isreachedfroms byexecutingasingleaction(e.g.,
i+1 i
replayatamethodlevelthatoffersarangeofcosts,flexibility,and bytecode). A sequence of program actions is written as σ¯. We
scalability;and(iii)identifyingevaluationcriteriaandempirically denotethefinalstateofanactionsequences(σ¯).
assessingtheefficiencyandeffectivenessofcarvingandreplayof
DUTsonmultipleversionsofaJavaapplication. Webelievethese 2.2 BasicCarvingandReplaying
contributions lay a solid and general foundation for further study
Figure 1 illustrates the CR process. Given a system test case
of carving and replay of DUTs and we outline several directions
st ,carvingaunittestcasect fortargetunitmduringtheex-
forfutureworkinSection6. InthenextSection, wepresentour x xm
ecution of st consists of capturing s , the program state im-
frameworkforcarvingandreplaytesting.Section3detailstheim- x pre
mediately before the first instruction of an activation of method
plementationofoneofthoseinstantiations.Section4describesour
m, and s , the program state immediately after the final in-
studyandresults.Section5discussesrelatedwork. post
struction of the activation of m has executed. The captured pair
of states (s ,s ), defines a differential unit test case for a
pre post
2. A FRAMEWORK FOR TEST CARVING method,ct . Statesinthispaircanbedefinedbycapturingthe
xm
appropriatestatesinσ, orthroughthecumulativeeffectsofase-
ANDREPLAY
quence of program actions, by capturing s(σ¯) at the appropriate
Javaprogramscanhavemillionsofallocatedheapinstances[14] points in σ¯. A CR testing approach is said to be state-based if
and hundreds of thousands of live instances at any time. Conse- it records pairs (s ,s ) and action-based if it records pairs
pre post
quently,carvingtherawstateofrealprogramsisimpractical. We (σ¯ ,s )wheres =s(σ¯ ).
pre post pre pre
believe that cost-effective carving and replay (CR) based testing Inpractice,itiscommonforamethod,m,toundergosomemod-
willrequiretheapplicationofmultiplestrategiesthatselectinfor- ification,e.g.,tom(cid:48),overtheprogramlifetime.Toefficientlyvali-
mation in raw program states and use that information to trade a datetheeffectsofamodification,wereplayct onm(cid:48).Replaying
xm
measureofeffectivenesstoachievepracticalcost.Strategiesmight adifferentialunittestforamethodm(cid:48)requirestheabilitytoeither
include,forexample,carvingasinglerepresentativeofeachequiv- loadstates intomemoryorexecuteσ¯ dependingonhowthe
pre pre
alenceclassofprogramstatesorpruninginformationfromacarved statewascarved. Fromthisstate,executionofm(cid:48) isinitiatedand
statethatamethodundertestisguaranteedtonotbedependenton. itcontinuesuntilitreachesthepointcorrespondingtothecarved
Thespaceofpossiblestrategiesisvastandwebelievethatageneral s .Atthatpoint,thecurrentexecutionstate,s(cid:48) ,iscompared
post post
frameworkfor CRtestingwill aidinexploringcost-effectiveness tos . Iftheresultingstatesarethesame,wecanattestthatthe
post
trade-offspossibleinthespaceofCRtestingtechniques. changedidnotaffectthebehaviorofthetargetunit. However, if
Regardlessofhowonedevelops,orgenerates,aunittest,there thechangealteredthebehaviorofm,thenfurtherprocessingwill
arefouressentialsteps:(1)identifyaprogramstatefromwhichto berequiredtodeterminewhetherthealterationmatchesthedevel-
initiatetesting,(2)establishthatprogramstate,(3)executetheunit oper’sexpectations.
from that state, and (4) judge the resulting state as to its correct- There are multiple techniques for diagnosing the root cause of
ness.IntherestofthisSection,wedefineageneralframeworkthat detected differences. For example, a difference could trigger the
allowsdifferentstrategiestobeappliedineachofthesesteps. executionofsystemtestst todeterminewhetheradifferenceman-
x
ifestsathigherlevelsofabstractions,theresultsofct couldbe
2.1 ProgramStatesandProgramExecutions xm
comparedwiththeresultsofmanuallydevelopedunittestsform,
For the purposes of explaining our framework, we consider a orintermediatestateswithintheexecutionofmandm(cid:48)(e.g.,after
Javaprogramtobeakindofstatemachine.Atanypointduringthe everystatement)couldbecomparedtoidentifytheearliestpointat
executionofaprogramtheprogramstate,S,canbedefined,con- whichstatesdiffer. Wediscusssupportforsomeofthesediagnos-
ceptually,asallofthevaluesinmemory.Asneeded,wewilldefine ticsinSection2.4andleavetheothersforfuturework.
notationforaccessingspecificportionsofastate,forexample,the Several fundamental challenges must be addressed in order to
parametersinthecurrentactiveframeofthecallstack. make CR cost-effective. First, the proposed basic carving proce-
dureisatbestinefficientandlikelyimpractical.Inefficientbecause
amethodmayonlydependonasmallportionoftheprogramstate,
thusstoringthecompletestateiswastedeffort. Furthermore,two TReesdtu Cctaiosne ctxm Spre ! Spre Reduced ctxm
distinctcompleteprogramstatesmaybeidenticalfromthepointof
viewofagivenmethod,thuscarvingcompletestateswouldyield
redundantunittests.Impracticalbecausestoringthecompletestate
ofaprogrammaybeprohibitivelyexpensiveintermsoftimeand ctxm Spre ! Spre
space. Second, changes to m may render ct unexecutable in
m(cid:48). Reducing the cost of CR testing is impoxrmtant, but we must Test Cases ! ! Keep ctxm
Filtering Drop ct
produceDUTsthatarerobusttochangessothattheycanbeexe- zm
!
cutedacrossaseriesofsystemmodificationsinorderrecoverthe ctzm Spre Spre
overheadofcarving.Finally,theuseofcompletepost-statestode-
tectbehavioraldifferencesisnotonlyinefficientbutmayalsobe
toosensitivetobehaviordifferencescausedbyreasonsotherthan Figure2:Sampleapplicationsofprojectionsfunctions.
faults(e.g.,faultfixes,improvements,internalrefactoring)leading
to the generation of brittle tests. The following sections address
thesechallenges.
2.3 ImprovingCRwithProjections 2.3.3 ApplyingProjections
WefocusCRtestingonasingleunitbydefiningprojectionson Figure2illustratestwopotentialapplicationsoftheprojections:
carvedpre-statesthatpreserveinformationrelatedtotheunitunder testcasereductionandtestcasesfiltering.
testandprovidesignificantreductioninpre-statesize. Reduction aims at thinning a single carved test case by retain-
ingonlytheprojectedpre-state(inFigure2theprojectionofs
pre
2.3.1 State-basedProjections carvedfromct leadstoasmallers ).ReducingaDUT’spre-
xm pre
stateresultsinreducedspacerequirementsand,moreimportantly,
Astateprojectionfunctionπ : S → S preservesselectedpro-
inquickerreplaysinceloadingtimeisafunctionofthepre-state
gram state components. For example, a state projection function
size. Asweshallsee, dependingonthetypeofprojection, these
may preserve only the values of reference fields, thereby elimi-
gains may be achieved at the expense of reduced fault detection
nating all scalar fields, which would maintain the heap shape of
power(e.g.,aprojectionmaydiscardanobjectthatwasnecessary
a program state. Many useful state projections are based on the
notion of heap reachability. A reference r(cid:48) is reachable in one to expose the fault). Furthermore, test executability may be sac-
dereference from r if the value of some field of r holds r(cid:48); let rificedaswell. State-basedprojectionsmaybecomeunexecutable
reach(r) = {r(cid:48) | ∃ v (r,f) = r(cid:48)} where v is the ifthedatastructuresusedbythetargetunitchanges,forexample,
f∈F field field
shiftingfromanarraytoaheap-basedstructure,evenifbehavioris
dereference function. References reachable through any chain of
preserved. Action-basedprojectionsmaybecomeunexecutableif
dereferences up to length k from r are defined by using the iter-
(cid:83)
atedcompositionofthisbinaryrelation, reachi(r); asa thebehaviorofaunitmethodchangessothatadifferentnumberor
1≤i≤k sequenceofmethodsisneededinthemodifiedprogramtoproduce
notational convenience we will refer to this as reachk(r). The
thedesiredpre-state. Still,reductioncanbeavaluablemechanism
positive-transitive closure of the relation, reach+(r), defines the
toimprovetheefficiencyofCRbykeepingjusttheportionsofthe
setofallreachablereferencesfromrinoneormoredereferences.
pre-statethataremostlikelytoberelevanttothetargetedmethod.
State-based CR testing approaches should use projections that
FilteringaimsatremovingredundantDUTsfromthesuite.Con-
retainatmosttheinterfacereachableprojectionwhichisdefined
sideramethodthatisinvokedduringtheprograminitializationand
to preserve the set of heap objects reachable from a calling con-
isindependentoftheprogramparameters. Suchmethodwouldbe
text, {r | ∃ reach+(p)}. Robustness to change under
p∈Params exercisedbyallthesystemtestsinthesamewayandresultinmul-
thisprojectionisidenticaltothatofthecompleteprogrampre-state
tiple identical DUTs for that particular methods. A simple filter
sincealldatathatthemethodcouldpossiblyreferenceiscaptured.
wouldremovesuchduplicatestests,keepingjusttheuniqueDUTs.
It is possible to trade robustness for reduction in carving cost by
Now consider a simple accessor method with no parameters that
definingprojectionsthateliminatemorestateinformation.Section
just returns the value of a scalar field. If this method is invoked
3presentstwoprojectionsthatexercisethistrade-off.
bythetestsfromdifferentpre-states, thenmultipleDUTswillbe
carved,andasimplelosslessfilterwillnotdiscardanyDUTeven
2.3.2 Action-basedProjectionsandTransformations
thoughtheyexercisesimilarbehavior.Inthiscase,applyingapro-
Projectionsonsequencesofprogramactions, π¯ : σ¯ → σ¯. can jectionthatpreservesthepre-statecomponentsdirectlyreachable
beusedtodistilltheportionofaprogramrunthataffectsthepre- fromthiswouldresultinmanyDUTsthatareredundant(inFigure
state of a unit method. Unfortunately, a purely projection-based 2π(s )forct andforct areidenticalsooneofthemcan
pre xm zm
approachtostate-capturewillnotworkforallJavaprograms. For be removed). Clearly, in some cases, this kind of lossy filtering
example,aprogramthatcallsnativemethodsdoesnot,ingeneral, mayresultinalowerfaultdetectioncapabilitysincewemaydis-
have access to the native methods instructions. To accommodate cardaDUTthatisindeeddifferentandhencepotentiallyvaluable.
this,wecanallowfortransformationofactionsduringcarving,i.e., Note that, contrary to test case reduction, filtering only uses pro-
replaceonesequenceofinstructionswithanother. Transformation jectionstojudgetestequivalence,consequently,testexecutability
could be used, for example, to replace a call to a native method ispreservedsincetheDUTsthatarekeptarecomplete.Inpractice,
withaninstructionsequencethatimplementstheside-effectsofthe however,reductionandfilteringarelikelytobeappliedintandem
nativemethod. Moregenerally,onecoulddesignaninstanceofπ¯ suchthatreducedtestsarethenfiltered,orfilteredtestsarethenre-
thatwouldreplaceanyportionofatracewithasummarizingaction duced(withoutnecessarilyusingthesameprojectionforreduction
sequence. andfiltering).
2.4 Adjusting Sensitivity through Differenc-
s s s s s s
ingFunctions pre post1 post2 post3 post4 post5
5
ThebasicCRtestingapproachdescribedearliercomparesacarved
completepost-statetoapost-stateproducedduringreplaytodetect }
1 2 3 4 activation of m
behavioraldifferencesinaunit. Theuseofcompletepost-statesis
bothinefficientandunnecessaryforthesamereasonsasoutlined }
calls out of unit
aboveforpre-states. Whilewecouldusecomparisonofpost-state
projectionstoaddresstheseissues,webelievethatthereisamore
flexiblesolution.
Figure3:Differencingsequencesofpost-states.
Methodunittestaretypicallystructuredsothat,afterasequence
ofmethodcallsthatestablishadesiredpre-statethemethodunder
testisexecuted. Whenitreturnsadditionalmethodcallsandcom-
is differenced with corresponding states at intermediate states of
parisonsareexecutedtoimplementapseudo-oracle.Forexample,
themethodundertest. Forexample,atpoint1,thetestcompares
unittestsforared-blacktreemightexecuteaseriesofinsertand
the current state to the captured s , similarly at points 2 and
deletecallsandthenquerythetree-heightandcompareittoanex- post1
3thepreandpost-statesofthecalloutoftheunitarecompared.
pectedresulttojudgepartialcorrectness. Weallowasimilarkind
Usingasequenceofpost-statesrequiresthatacorrespondencebe
of pseudo-oracle in CR testing by defining differencing functions
definedbetweenlocationsinmandm(cid:48).Correspondencescouldbe
onpost-statesthatpreserveselectedinformationabouttheresults
definedusingavarietyofapproaches,forexample,onecoulduse
ofexecutingtheunitundertest. Thesedifferencingfunctionscan
thecallsoutofmandm(cid:48) todefinepointsforpost-statecompari-
taketheformofpost-stateprojectionsorcanbemoreaggressive,
son(asisillustratedinFigure3)orcommonpointsinthetextofm
capturingsimplepropertiesofpost-states,suchastreeheight,and
andm(cid:48)couldbedetectedviatextualdifferencing.Faultisolationis
consequentlymaygreatlyreducethesizeofpost-stateswhilepre-
enhancedusingmultiplepost-states,sinceifthefirstdetecteddif-
servinginformationthatisimportantfordetectingbehavioraldif-
ference is at location i then that differencewas introduced in the
ferences.
regionofexecutionbetweenlocationi−1andi. Ofcourse,stor-
We define differencing functions that map states to a selected
ingmultiplepost-statesmaybeexpensivesoweadvocatetheuse
differencing domain, dif : S → D. Differencing in CR testing
ofσ tonarrowthescopeofcodethatmustbeconsideredfor
is achieved by evaluating dif(spost) = dif(spost(cid:48)). State projec- faultpiossotlationonceabehavioraldifferenceisattributedtoafault.
tion functions are simply differencing functions where D = S.
In addition to the reachability projections defined in the previous
3. INSTANTIATINGTHEFRAMEWORK
sub-section,projectionsonunitmethodreturnvalues,calledreturn
differencing, and on fields of the unit instance, this, called in- Inthissectionwedescribethearchitectureandsomeimplemen-
stancedifferencing,areusefulsincetheycorrespondtotechniques tationdetailsofastate-basedinstantiationoftheframework.(Sec-
usedwidelyinhand-builtunittests. tion5discussesexistingcarvingandreplayimplementationswhich
Acentralissueindifferentialtestingisthedegreetowhichdif- areaction-based).
ferencing functions are able to detect changes that correspond to
3.1 SystemArchitecture
faultswhilemaskingimplementationchanges. Werefertothisas
thesensitivityofadifferencingfunction. Clearly,comparingcom- Figure4showsthearchitectureoftheCRtools,withtheshaded
pletepost-stateswillbehighly-sensitive,detectingbothfaultsand rectangles being the primary components. The carving activity
implementation changes. A projection function that only records startswiththeCarver classwhichtakesfourinputs: theprogram
thereturnvalueofthemethodundertestwillbeinsensitivetoim- name,thetargetmethodmwithintheprogram,thesystemtestcase
plementationchangeswhilepreservingsomefault-sensitivity.Note st inputs,andoptionstoboundthecarvingprocess.
x
alsothatthesedifferencingfunctionsprovidedifferentincomplete Carver utilizes a custom class loader CustomLoader (that uti-
views on program state. Their incompleteness reduces cost and lizesBCEL[13])toincorporateintotheprogram:asingletonCon-
providesameasureofimplementationchangeinsensitivity,butitis textFactoryclassconfiguredtostorepreandpoststates,andinvo-
problematicsinceitmayreducetheirfaultdetectioneffectiveness. cationsoftheContextFactoryattheentryandexit(s)ofm. Then,
Weaddressthisbyallowingformultipledifferencingfunctions every execution of m will cause two invocations of ContextFac-
tobeappliedinCRtestingwhichhasthepotentialtoincreasefault- tory: onetostores andonetostores . ContextFactoryuti-
pre post
sensitivity,withoutnecessarilyincreasingimplementationchange- lizestheContextBoundingclasstoassistwiththedeterminationof
sensitivity. Forexample,usingapairofreturnandinstancediffer- whatpartofthestateshouldbestoredwhentestcasereductionis
encingfunctionsallowsonetodetectfaultsinbothinstancefield utilized.Bydefault,ContextBoundingperformsthemostconserva-
updatesandmethodresults,butwillnotexposedifferencesrelated tiveprojection:aninterfacereachabilityprojection(asdescribedin
todeeperstructuralchangesintheheap. Faultisolationefficiency Section2.3).Morerestrictiveprojectionscanbeperformedthrough
couldalsobeenhancedbytheavailabilityofmultipledifferencing theBoundingAnalysisclass;wehaveimplementedtwosuchprojec-
functions, since each could focus on a specific property or set of tionsanddescribetheminthenextsection.Finally,theopensource
programstatecomponentsthatwhichwillhelpdevelopersrestrict packageXStream,describedinmoredetailinthenextsection,per-
theirattentiononapotentiallysmallportionofprogramstatethat formsstateserializationandtemporarystorage.Finally,thedatais
mayreflectthefault. compressedwiththeoff-theshelfcompressionutilitybzip.
Thereisanotherdifferencingdimensionthatcanimprovefault TheReplaycomponentsharesmanyoftheclasseswithCarver.
isolation.ItconsistsofgeneralizingthedefinitionofDUTstocap- AsinCarver,Replayinstrumentstheclassofthetargetunit,inthis
tureasequenceofpost-states,(spre,σpost),thatcaptureinterme- case m(cid:48), and utilizes the ContextFactory, but only to store spost.
diatepointsduringtheexecutionofthemethodundertest.Figure3 TheContextLoaderclassobtainsandloadss ,usingXStreamto
pre
illustratesascenarioinwhichageneralizedDUTbeginsexecution unmarshall the stored program state, and then invokes the target
ofmatspre.Conceptually,duringreplayasequenceofpost-states unitforexecution.
projection
Filter
Program ContextFactory
ContextFactory m’ ContextBounding
m ContextBounding Dut m’:Spost XStream
XStream m:Spre,Spost Dut
ContextLoader
post
XStream
pre/post
CustomLoader Spre m:Spre CustomLoader
bcel Spost bcel
st
x m
m BounSdidineg e fAfencatlysis m Spost m’ m’ BounSdidineg e fAfencatlysis
… …
Dif
m m’
Carver Replay
function
Program Options m st m m’ Options
x
Figure4:CRToolArchitecture.
Two set of scripts, represented with double-side rectangles in Interfacek-boundedreachableprojection.Theinterfacek-bounded
Figure4,areutilizedtoprovidethefilteringanddifferencingmech- reachableprojectiondefinesthesetofpreservedreferencestoin-
anisms. OnceatestsuiteofDUTsisgenerated,testcasefiltering clude only those reachable via reference chains of length k, i.e.,
canbeperformedtoremoveredundanttestcasesbasedonthesame {r | ∃ reachk(p)}. Usingsmallvaluesofkcangreatly
p∈Params
setofprojectionsavailablethroughBoundingAnalysis. Difscripts reducethesizeoftherecordedpre-stateandformanymethodsit
compare two s according to a specified differencing function willhavenoimpactonunit-testrobustness.Forexample,avalueof
post
todeterminewhetherthechangesfrommtom(cid:48)generateabehav- 1wouldsufficeforamethodwhoseonlydereferencesareaccesses
ioraldifference.Currently,differencingfunctionsonreturnvalues, tofieldsofthis.Intheimplementation,whentraversingthepro-
oninstancefields,onfullprogramstate(thedefault)arefullyau- gram using Xstream to store the program state, we keep track of
tomated. TofacilitateexperimentationwithdifferentDiffunctions thelengthofdereferencechainstohalttraversalwhenkisreached.
ourtoolscurrentlystorethefull s , butweplantoimplement Iftheunitaccessesdataalongareferencechainoflengthgreater
post
optionstostoreonlydif(s )whichhasthepotentialtosignifi- than k, then a k-bounded projection will retain insufficient data
post
cantlyreducethecostofcarving,replayanddifferencing. about the pre-state to allow replay. Our implementation dynami-
callydetectsthissituationandissuesaSentinelAccessException
todistinguishreplayfailurefromanapplicationexception.Thisis
achievedbyextendingXstreamwithacustomconverterthatauto-
3.2 InterestingImplementationAspects
maticallytransformsobjectsthatlieatadepthofk+1tocontain
Inthissectionwebrieflydescribethemostinterestingaspectsof anadditionalbooleanfieldthatmarksitasasentinelinstance.The
theimplementation. unitundertestistheninstrumentedtoinsertatestofthisboolean
fieldandraisetheexceptioniftrue.
Limitationsofthe java.io.Serializableinterface. Ourapproach
May-referencereachableprojection. Themay-referencereach-
requirestheabilitytosaveandrestoreobjectdatarepresentingthe
able projection uses a static analysis that calculates a characteri-
programstate.However,theJavajava.io.Serializablein-
zation of the heap instances that may be referenced by a method
terfacelimitsthetypeofobjectsthatcanbeserialized. Forexam-
activation either directly or through method calls. This charac-
ple,Javadesignatesfilehandlerobjectsastransient(non-serializable)
terization is expressed as a set of regular expression of the form:
becauseitreasonablyassumesthatahandler’svalueisunlikelyto
pf ...f (F+)? This captures an access path that is rooted at a
be persistent, and restoring it could enable illegal accesses. The 1 n
parameter p and consists of n dereferences by the named fields
samelimitationsapplytootherobjects, suchasdatabaseconnec-
f . If the analysis calculates that the method may reference an
tions and network streams. In addition, the Java serialization in- i
object through a dereference chain of length greater than n, the
terfacemayimposeadditionalconstraintsonserialization. Forex-
optional final term is included to capture objects that are reach-
ample,itmaynotserializeclasses,methods,orfieldsdeclaredas
ablefromtheendofthechainthroughdereferenceoffieldsinthe
privateorfinalinordertoavoidpotentialsecuritythreats.
setF. Letreach (r) = {r(cid:48) | ∃ v (r,f) = r(cid:48)}capture
Fortunately, we are not the first to face these challenges. We F f∈F field
reachabilityrestrictedtoasetoffieldsF;reach denotesreacha-
foundmultipleserializationlibrariesthatoffermoreadvancedand f
bilityforthesingletonsetf. Foraregularexpressionoftheform
flexibleserializationcapabilitieswithvariousdegreesofcustomiza-
pf ...f , where m ≤ n, we construct the set: reach (p)∪
tion. We ended up choosing the XStream library [41] because it 1 m f1
... ∪ reach (...(reach (p))), since we want to capture all
comesbundledwithmanyconvertersfornon-serializabletypesand fm f1
referencestouchedalongthepath. Iftheregularexpressionends
adefaultconverterthatusesreflectiontoautomaticallycaptureall
with the term F+ then we union an additional term of the form
objectfields,itserializestoXMLwhichismorecompactandeasier
reach+(reach (...(reach (p)))).Thisprojectioncansignif-
toreadthannativeJavaserialization,andithasbuilt-inmechanisms F fm f1
icantlyreducethesizeofcarvedpre-stateswhileretainingarbitrar-
totraverseandmanagethestorageoftheheapwhichwasessential
ilylargeheapstructuresthatarerelevanttothemethodundertest.
inimplementingthefollowingprojections.
Weimplementedak-boundedaccesspathbasedmay-reference C-selection-k. SimilarinconcepttoS-selection,thistechnique
analysisthatusedtheflow-insensitivecontext-sensitiveequivalence- executesallDUTs,carvedwithak-boundedreachableprojection,
class based read-write analysis implemented in Indus [30]. This thatexercisemethodsthatwerechangedinP(cid:48). Withinthistech-
analysispartitionsparameterandvariablenamesintoequivalence niqueweexploredepthboundinglevelsof1,2,5,and∞(unlim-
classes. Thetwodistinctfeaturesoftheanalysisare: 1)foreach iteddepthwhichcorrespondstotheinterfacereachableprojection.)
equivalence class, an abstract heap structure based on the names
C-selection-mayref. SimilartoC-selection-kexceptthatitcarves
involvedinread/writeaccessismaintained,and2)distinctequiv-
DUTsutilizingamay-referencereachableprojection.
alenceclassesaremaintainedforeachmethodscopeexceptinthe
case of static fields and variable names occurring in methods in- 4.2 Measures
volved in recursive call chains. We generate regular expressions Regressiontestselectiontechniquesachievesavingsbyreducing
thatcapturethesetofallpossiblereferencedaccesspathsuptoa the number of test cases that need to be executed on P(cid:48), thereby
givenfixedlength,k,withadefaultofk=2.Whentraversingthe reducing the effort required to retest P(cid:48). We conjecture that CR
programusingXstream,wesimultaneouslykeeptrackofallregular techniquesachieveadditionalsavingsbyfocusingonunitsofP(cid:48).
expressionsandmarkonlythoseobjectsthatlieonadefinedaccess Toevaluatetheseeffects, wemeasurethetimetoexecuteandthe
pathforstorageinXML.Thisanalysisisalsocapableofdetecting timetochecktheoutputsofthetestcasesintheoriginaltestsuite,
whenamethodisside-effectfreeandinsuchcasesthestorageof the selected test suite, and the carved selected test suites. For a
post-statesisskippedsincemethodreturnvaluescompletelydefine carvedtestsuitewealsomeasurethetimeandspacetocarvethe
theeffectofsuchmethod. originalDUTtestsuite.
Onepotentialcostofregressiontestselectionisthecostofmiss-
ingfaultsthatwouldhavebeenexposedbythesystemtestsprior
4. EXPERIMENT
totestselection. Similarly, DUTsmaymissfaultsduetotheuse
Thegoaloftheexperimentistoassessexecutionefficiency,fault ofprojectionsaimedatimprovingcarvingefficiency.Wewillmea-
detectioneffectiveness,androbustnessoftheDUTs. Wewillper- sure fault detection effectiveness by computing the percentage of
formsuchassessmentthroughthecomparisonofsystemtestsand faultsfoundbyeachtestsuite. Wewillalsoqualifyourfindings
theircorrespondingcarvedunittestcasesinthecontextofregres- byanalyzinginstanceswheretheoutcomesofacarvedtestcaseis
siontesting.Withinthiscontext,weareinterestedinthefollowing differentfromitscorrespondingsystemtestcase.
researchquestions: To evaluate the robustness of the carved test cases in the pres-
ence of program changes, we are interested in considering three
RQ1: Cancarvingtechniquessaveregressiontestexecutioncosts?
potentialoutcomesofreplayingact onunitm(cid:48): 1)faultisde-
Wewouldliketocomparethecostofreusingcarvedunittest xm
tected, ct causes m(cid:48) to reveal a behavioral differences due to
cases versus the costs of utilizing regression test selection xm
afault;2)falsedifferenceisdetected,ct causesm(cid:48) toreveala
techniquesthatworkonsystemtestcases. xm
behavioral change from m to m(cid:48) that is not a fault (not captured
RQ2: What is the fault detection effectiveness of the carved test bystx); andtestisunexecutable, ctxm isill-formedwithrespect
cases? Thisisimportantbecausesavingtestingcostswhile tom(cid:48).Testsmaybeill-formedforavarietyofreasons,e.g.,object
reducingfaultdetectionisrarelyanenticingtrade-off. protocol changes, internal structure of object changes, invariants
change, and we refer to the degree to which a test set becomes
RQ3: Howrobustarethecarvedtestsinthepresenceofsoftware ill-formedunderachangeitssensitivitytochange. Weassessro-
evolution? We would like to assess the reusability of the bustnessbycomputingthepercentageofcarvedtestsandprogram
carvedunittestcasesunderarealevolvingsystem,andex- unitsfallingintoeachoneoftheoutcomes.Sincetherobustnessof
amine how different types of change can affect the carved atestcasedependsonthechange, wequalifyrobustnessbyana-
testssensitivity. lyzingtherelationshipbetweenthetypeofchangeandthelifespan
andsensitivityoftheDUT.
4.1 TestingTechniques
4.3 Artifact
LetP beaprogram,letP(cid:48)beamodifiedversionofP,andletT
TheartifactwewillusetoperformthisexperimentisSiena[9].
beatestsuitedevelopedinitiallyforP.Regressiontestingseeksto
Siena is an event notification middleware implemented in Java.
testP(cid:48). Tofacilitateregressiontesting,testengineersmayre-use
This artifact is available for download in the Subject Infrastruc-
T totheextentpossible. Inthisstudyweconsideredfourtypesof
tureRepository(SIR)[15,32]. SIRprovidesSiena’ssourcecode,
testregressiontechniques,twothatworkwithsystemtests(S)and
asystemleveltestsuitewith567testcases,multipleversionscor-
twothatworkedwithcarvedtests(C):
respondingtoproductreleases, andasetofseededfaultsineach
version(theauthorswerenotinvolvedinthislatestactivity).
S-retest-All. WhenP ismodified,creatingP(cid:48),wesimplyreuse
For this experiment we consider Siena’s core components (not
all non-obsolete test cases in T to test P(cid:48); this is known as the
onapplicationincludedinthepackagethatisbuiltwiththosecom-
retest-all technique [23] and it has been said to represent current
ponents). We utilize the five versions of Siena that have seeded
industrialpractices[25].
faults that did not generate compilation errors (faults that gener-
atedcompilationerrorscannotbetested)andthatwereexposedby
S-selection. The retest all technique can be expensive: rerun- atleastonesystemtestcase(faultsthatwerenotfoundbysystem
ningalltestcasesmayrequireanunacceptableamountoftimeor tests would not affect our assessment). For brevity, we summa-
humaneffort. Regressiontestselectiontechniques[5,10,24,34] rizethemostrelevantinformationtoourexperimentinTable1and
use information about P, P(cid:48), and T to select a subset of T, T(cid:48), pointthereadertoSIR[32]toobtainmoredetailsabouttheprocess
with which to test P(cid:48). We utilize the modified entity technique employedtopreparetheSienaartifactforexperimentation. Table
[10],whichselectstestcasesthatexercisemethods,inP,that(1) 1providesthenumberofmethods,methodschangedbetweenver-
havebeendeletedorchangedinproducingP(cid:48),or(2)usevariables sions and covered by the system test suite, system tests covering
orstructuresthathavebeendeletedorchangedinproducingP(cid:48). thechangedmethods,andfaultsincludedineachversion.
Version Methods Changed-covered Testsexecuting Faults Carving Metric Reduction
methods changedmethods C-select-k C-select
v0 109 - - - 1 2 5 ∞ mayref
v1 100 2 494 3 Plain Minutes 118 118 117 119 194
v5 111 2 494 1 MB 820 14K 26K 26K 15K
v6 111 2 8 1 Compressed Minutes 315 513 717 725 619
v7 107 10 550 2 MB 9 13 15 15 14
Lossless Minutes 258 299 335 382 378
Table1:Siena’scomponentsattributes. Filtering MB 61 1.1K 2.8K 2.9K 1.7K
Lossless Minutes 270 332 399 447 428
Compressed MB 3 4 14 11 11
4.4 ExperimentalSetupandDesign
Table2:CarvingtimesandsizestogenerateinitialDUTsuite.
Theoverallexperimentalprocessconsistedofthefollowingsteps.
First,wepreparedthetestsuitesgeneratedbyS-retest-all,S-selection, For Siena, constraining the carving depth does not seem to af-
C-selection-k*,andC-selection-mayref fortheirautomaticexecu- fect the carving time (the variation we observed is small enough
tion. Thepreparationofthesystemleveltestsuiteswastrivialbe- tobeattributabletonoise)butitdoesimpactthestoragerequire-
causetheywerealreadyavailableintherepository.Thepreparation ments for keeping (s ,s ). The similar replay times are at-
pre post
ofthecarvedselectionsuites,requiredforustoruntheCRtoolto tributabletothefactthatthecarvingtoolfirsttraversesthewhole
carvealltheinitialsuiteofDUTsfromthesystemtestswhenexe- heap,anditthentruncatesthecontenttothedesireddepthleverag-
cutedinv0. ingtheXStreamfacility.Alsonotethatfordepthsgreaterthantwo
Second,weruneachtestsuiteonthefault-freeversionsofSiena thedifferencesinstoragespaceareminimalduetotherather“shal-
to generate an oracle for each version. In the case of the system low”natureofthesubject(dereferencechainswithlengthgreater
testsuite,theoracleconsistedinthesetofoutputsgeneratedbythe than2arerareinSiena).Themay-referenceprojectionrequiresan
program. Forthecarvedtests,theoracleconsistedofthemethod additional61%ofanalysistime,butprovidesaspacereductionof
returnvalueandtherelevantspost (weexploreseveralalternative 43%relativeto∞.Asweshallsee,thisspacereductionisvaluable
projectionstodefinetherelevantstate). inthatittranslatesintomoreefficientreplay.
Third,weruneachtestsuiteoneachfaultyinstanceofeachver- In the second row of Table 2 we see that simply compressing
sion (some versions contained multiple faults) and recorded their thestatedataincreasedthecarvingtimelinearlyasafunctionthe
executiontime. Wedealtwitheachfaultinstanceindividuallyto carved state size (and it will add uncompression time as well for
controlforpotentialmaskingeffectsamongfaultsthatmightneg- the DUTs selected for replaying), but it consistently provided a
atively affect the fault detection performance of the system tests. three-orders of magnitude reduction in the space required by the
Note that, due to their more concentrated focus, unit tests would DUTs,offeringaveryinterestingtradeoff. Filteringthetestsuite
notbeassusceptibletosuchfaultinteractions. byremovingcarvedDUTswiththethesamepre-state(wecallthis
Fourth,toassessfaultdetectioneffectiveness,foreachtestsuite, losslessfilteringinTable2)alsoprovidedsavingsofoneorderof
wecomparedtheoutcomeofeachtestcasebetweenthefault-free magnitude in the required storage but with a reduced impact on
version(oracle)andthefaultyinstancesofeachversion. Tocom- carvingtime. Thelastrowpresentsahybridapproachthatapplies
parethesystemtestoutcomesbetweencorrectandfaultversions, filteringandthencompressesthedata. Thislastcarvingtechnique
weusedpre-defineddifferencingfunctionsthatarepartofourim- isfasterthantheoneperformingjustcompressionandstillprovides
plementation which ignore “non-deterministic” output data (e.g, asimilardegreeofstoragesavings.
dates, times, random numbers). For the unit tests, we performed Itisveryimportanttonotethatthecarvingnumbersreportedin
asimilardifferencing,butappliedtothetargetmethodreturnval- Table2correspondtotheinitialcarvingofthecompleteDUTsuite
uesandspost. Whentheoutcomeofatestcasedifferedbetween –DUTscarvedforeachoftheover100methodsinSienafromeach
thefault-freeandthefaultyversion,afaultisfound. oftheover550systemteststhatmayexecuteeachmethod–and
Last, we compared the measures across the test suites gener- canbeperformedautomaticallywithoutthetester’sparticipation.
ated by S-retest-all, S-selection, C-selection-k*, and C-selection- Duringtheevolutionofthesystem,DUTswillbereplayedrepeat-
mayref. We then repeated the same steps to collect data for the edlyamortizingtheinitialcarvingcosts, andonlyasubsetofthe
sametechniqueswhenutilizingtestcasefilteringandcompression. DUTswillneedtoberecarved. Recarvingwillbenecessarywhen
Theresultsemergingfromthiscomparisonarepresentedinthenext isdeterminedthatchangesintheprogrammayaffectaDUT’srele-
section.AlltheseactivitieswereperformedonanOpteron250pro- vantpre-state. Webelievethatexistingimpactanalysistechniques
cessor,with4GBofRAM,runningLinux-Fedora,andJava1.5. [26]couldbeused,forexample,todeterminewhatDUTsmustbe
recarvedwhentheaunitischanged,andweplantointegratethose
4.5 Results intoourinfrastructureinthefuture.
We now proceed to analyze the replay efficiency. Replay effi-
In this section we provide the results addressing each research
ciencyisparticularlyimportantsinceitislikelythatacarvedDUT
questionregardingcarvingandreplayingefficiency,faultdetection
willberepeatedlyreplayedasthetargetunitevolves.Figure5sum-
effectiveness,androbustnessandsensitivityoftheDUTssuites.
marizesthereplayexecutiontimesforsomeofthetechniqueswe
RQ1: Efficiency. Wefirstfocusontheefficiencyofthecarving consider. Duetospaceconstraintsweonlypresentresultsforthe
process. Althoughourinfrastructureautomatescarving, thispro- testsuitesgeneratedbythreeplaincarvingtechniques(weignore
cessdoesconsumetimeandstoragesoitisimportanttoassessits C-selection-k1 because although it is the fastest it cannot replay
efficiencyasitmightimpactitsadoptionandscalability. Table2 manyDUTs,andweignoreC-selection-∞becauseitsresultsare
summarizes the time (in minutes) and the size (in MB) that took almostidenticaltoC-selection-k5). Wealsopresentplotsfortwo
tocarveandstorethecompleteinitialsuiteofDUTsutilizingthe techniqueswithfilteringapplied,andonewithfilteringanduncom-
differenttechniques. Eachrowcontainsavariationofthecarving pression. Eachobservationcorrespondstothereplaytimeofeach
(plain,withcompression,withDUTsfiltering,orhybrid),andeach selected test suite under each version and an overall average (the
columncontainsatestcasereductiontechnique. linesjoiningobservationsarejustmeanttoassistintheinterpreta-
140
tion). ThetestsuiteresultingfromtheS-retest-alltechniquecon-
sistentlyaverages135minutesperversion. Thetestsuitesresult-
ingfromS-selectforeachversionaverages91minutesperversion,
120
withsavingsoverS-retest-allrangingfromfromaminimumof4
minutesinv6maximumof132minutesinv7. (Factorsthataffect
theefficiencyofthistechniquearenotwithinthescopeofthispa-
100
perbutcanbefoundat[17]). Onaverage, S-select takes67%of
thetimerequiredbyS-retest-all.
The test suites selected by the C-selection-k* techniques show
80
verysimilartendencies.Onaverage,theC-selection-k*techniques
s
e
replay execution time was 19 minutes, and they took less than a ut
n
minutetoreplayv6andupto82minutesforC-selection-k5tore- Mi
60
playv2. Onaverage, thesesuitestakes13%ofthetimerequired
by S-retest-all, and 19% of the time required by S-select. The
test suites selected by C-selection-mayref takes 16% of the time
40
requiredbyS-retest-all, 19%ofthetimerequiredbyS-selection-
mayref,and76%ofthetimerequiredbyC-selection-k5.Applying
filtering to the carved suites generated by C-selection-k5 reduced
20
testexecutiontimebymorethanathird(e.g.,from84to19DUTs
inv2).Last,andasexpected,weobservethathandlingcompressed
filesaddsconsiderablereplayingtime.
0
Wealsomeasuredthediffingtimerequiredbyalltechniques.For
v1 v5 v6 v7 Average
thesystemtestsuitesthediffingtimeswereconsistentlylessthan
Versions
aminute,fortheC-selection-k*suitesitaveraged12minutes,and
C-select-k2 C-select-k5
fortheC-selection-mayref averaged6minutes.Whenfilteringwas
C-select-mayref Filtered-C-select-k5
employed, diffing time for the C-selection-k* techniques was re-
Filtered-C-select-mayref Filtered-Uncomp-C-select-k5
ducedbyanaverageof54%. Overall,althoughthediffingactivity
S-retest-all S-Select
isimportanttotheperformanceofthecarvedsuites,implementing
simple incremental differencing functions could dramatically im- Figure5:Executiontimes.
provetheirdiffingperformance. Forexample, wecurrentlycom-
replayingthetestsuitecarvedutilizingC-selection-k1didnotde-
parealltheprogrampost-state,butwecouldinsteadfirstcompare
tectallthebehavioraldifferencesexhibitedbytheselectedsystem
thereturnvaluestoseeifitrevealsanydifferences,andifitdoes
test cases (1 out of the 8 system tests exposed a behavioral dif-
not,thencomparetherestofthepost-state. Thissimpletechnique
wouldsufficetoreducev5diffingtimeby96%. ferencethatwasnotexposedbyanyofitscorrespondingDUTs).
ThisreductioninFFwasduetothedepth-1projectionwhichdid
RQ2: Faultdetectioneffectiveness. Thetestsuitesdirectlygen-
notcaptureenoughpre-statetodetectabehavioraldifference. The
eratedbyS-selection, C-selection-k*, andC-selection-mayref de-
othercarvedsuites,however,diddetectthisfault.
tected as many faults as the S-retest-all technique. As expected,
Inv5,andindependentlyofthecarvedtestsuiteused,3outof
theapplicationofourlosslessfiltertoreducethenumberofDUTs
300failingsystemtestsdidnothaveanycorrespondingDUTonthe
didnotresultinanylossinfaultdetectionpower. Thisindicates
changedmethodsfailing(99%).Weobservedthesamesituationin
that a DUT test suite may be as effective as a system test suite
v7 : f2 where 18 out of 203 DUTs (9%) did not expose behav-
at detecting faults, even when using aggressive projections. It is
ioraldifferenceseventhoughthecorrespondingsystemtestsfailed.
worthnoting,however,thatwhencomputingfaultdetectioneffec-
WhenweanalyzedthereasonsforthisreductioninFFwediscov-
tivenessoverawholetestDUTsuitewedonotaccountforthefact
eredthatinbothcasesthetooldidnotcarveinv0thepre-statefor
that,forsomesystemtests,theircorrespondingcarvedDUTsmay
oneofthechangedmethods. Thetooldidnotcarveanypre-state
havelostorgainedfaultdetectioneffectiveness.Weconjecturethat
forthosemethodsbecausethesystemtestcasedidnotreachthem.
thisisalikelysituationwithoursubjectbecausemanyofthefaults
Changesinthecodestructure(e.g.,additionofamethodcall,han-
aredetectedbymultiplesystemtests. Toaddressthissituationwe
dlingofanexception),however,madethesystemtestcasesreach
performaneffectivenessanalysisattheindividualtestcaselevel.
those changed methods (and expose a fault) in later versions. In
Foreachcarvingtechniquewecompute: 1)PP,thepercentage
both circumstances, improved DUTs that would have resulted in
ofpassingselectedsystemtests(selectedutilizingS-Selection)that
100%FFcouldhavebeengeneratedbyre-carvingthetestcasesin
have all corresponding carved unit test cases passing, and 2) FF:
later versions (carve from v to replay in v ). More generally,
i i+1
thepercentageoffailingsystemteststhathaveatleastonecorre-
theseobservationspointoutagainfortheneedtoestablishmecha-
spondingfailingcarvedunittestcase. Table3presentsthePPand
nismstodetectchangesinthecodethatshouldtriggerre-carving.
FFvaluesforallthetechniquesunderallversioninstances.
V7:f1alsopresentsaninterestingbutoppositefinding.Inthis
WhenusingthetestsuiteresultingfromC-selection-k1wefind
instanceonly24%ofthepassingsystemtestshadalltheirassoci-
thatmostDUTsgoingthroughthechangedmethodscannotbefully
atedDUT’spassing. Thesedifferencesmaybecausedbyfaults(a
executed,whichleadsinmostcasestoaverysmallnumberofsys-
faultthatdidnotpropagatetotheoutputsothesystemtestdidnot
tem test cases for which all the DUTs passed as well. v7 is the
detectit)orjustbythecodechanges(changesintheunitsgenerated
exceptiontothisPPtendency,wherecarvingwithadepthofone changesinthepost-states)1.Weexplorethisissuefurthernext.
wasasgoodasthemaximumdepthbecausethechangedmethod
onlyoperatedontheclass’sscalarfields. Notethatevenwhenus- 1SinceweonlyconsideredSienaversionswithfaultsexposedby
ingthisrestrictivereduction,theFFvaluesareonaverage97%.In system tests, we are not exploring possible situations in which
the cases where FF is not 100% such as in v6, we observed that faultsareexposedjustbyDUTs.
C-selection-k C-selection rathersmallerandmorefocusedunittests. Anotherimportantdis-
1 5 ∞ mayref
tinction is that the testing target are not the semantic differences
PP FF PP FF PP FF PP FF
betweenversions,butrathermethodsintheprogram.
v1:f1 1 100 100 100 100 100 100 100
v1:f2 1 100 100 100 100 100 100 100 Thepreliminaryresultsfromouroriginaltestcarvingprototype
v1:f3 1 100 100 100 100 100 100 100 [16], later expanded by Reddy [22], evidenced the potential of
v5 1 99 100 99 100 99 100 99 carvedteststoimprovetheefficiencyandthefocusofalargesys-
v6 1 88 100 100 100 100 100 100 temtestsuite, identifiedchallengestoscale-uptheapproach, and
v7:f1 24 100 24 100 24 100 24 100
defined some scenarios under which the carved test cases would
v7:f2 100 91 100 91 100 91 100 91
andwouldnotperformwell. Wehavebuiltonthatworkbypre-
Table3:FaultDetectionEffectiveness. sentingagenericframeworkfordifferentialcarving,extendingthe
typeofanalysisweperformedtomaketheapproachmorescalable,
RQ3: Robustnessandsensitivity. Wepreviouslyexaminedhow andbydevelopingafullsetoftoolsthatcanenableustoexperi-
DUTs obtained through C-selection-k1 are quite fragile in terms mentwithdifferenttechniquesonvariousprograms.
of their executability, and how certain code changes may make a Weareawareoftwootherresearcheffortsrelatedtothenotion
methodreachanewpartoftheheapthatwasnotoriginallycarved. oftestcarving. First,Orsoetal. introducedthenotionofselective
Acomplementarywaytoevaluatetherobustnessandsensitivityof record and replay mechanisms [27], which was later prototyped
DUTs is to compare their performance in the presence of meth- [28]andusedtoreplayunitsinisolation. Second,thetestfactor-
odsthatchanged,andinthepresenceofmethodsthatchangedand ingapproachintroducedbySaffetal. takesasimilarapproachto
areindeedfaulty. Weperformedsuchdetailedcomparisononthe Orso’swiththecreationofwhattheycalledmockobjectsthatserve
filteredsuitesforC-selection-∞andC-selection-mayref,andnow to create the scaffolding to support the execution of the test unit
brieflydiscussthreedistinctinstancesofthescenarioswefound. [36]. Thesamegroupintroducedatoolsetforfully-featuredJava
Inbothfaultyinstancesofv7,theversionwiththemostmeth- executionenvironmentsthatcanhandlemanyofthesubtleinterac-
odschanged(10),noneofthebehavioraldifferenceswerefoundby tionspresentinthisprogramminglanguage(e.g.,callbacks,arrays,
methodsotherthanthefaultyones. Thisisclearlyanidealsitua- native methods) [35]. In terms of our framework, both of these
tion. V1 : f1representsperhapsamorecommoncasewerenone approacheswouldbeconsideredaction-basedCRapproaches. We
oftheDUTsgoingthroughnon-faultychangedmethodsfailed,but havepresented,whatistothebestofourknowledge,thefirststate-
only 78% of the DUTs traversing faulty methods actually failed. basedapproachtoCRtesting.
Yet a different perspective is offered by v5. Only two methods Saffetal.describetheirapproachindetailallowingustoprovide
changedinthisversion,andonethemisinvokedexclusivelybythe a more in depth comparison with our approach. While carving a
other. Thefaultislocatedinthecallee. Thecallermethodisexer- methodtestcase,theirinfra-structurerecordsthesequenceofcalls
cisedby6354DUTsoutofwhich736detectbehavioraldifferences that can influence the method and then they record the sequence
(12%).The(faulty)calleemethodisexercisedby26173DUTsout of calls made by the method and the return values and unit state
of which 928 detect behavioral differences (less than 4%). This side-effects of those calls. In our framework, this would amount
lastscenario,inwhichcarvingstillgeneratesmorebehavioraldif- to calculating σ¯ such that s(σ¯) = s for the method of inter-
pre
ferencesforthefaultymethodthanforchangeone, isinteresting estandthencalculatingsummarizingtracesσ¯ thatreflectthe
calli
because it shows that even for correct changes the number of af- return value and side effects for each call out of the method and
fectedDUTsmaybelarge. carving s , the relevant pre-state for each call. During replay
prei
It is worth noting that the differencing functions offer an op- thesamesequenceofcallswiththesameparametersisexpected-
portunity to control this problem. For example, a more relaxed anydeviationresultsinareportofadifferenceduringreplay. In
differencingmechanismfocusedonjustreturnvalueswouldhave ourframework,wewouldidentifythepointsatwhichthe,n,calls
detectedallthefaultsinv5andv6,whilereducingthenumberof outofthemethodoccuraspost-statelocationstodefineaDUTof
falsedifferencessignificantlysincebothfaultsmanifestthemselves theform(σ¯,(s ,...,s )).
pre1 pren
inthereturnvalue.Suchadifferencingfunction,however,leadsto Bothoftheseaction-basedapproaches,capturetheinteractions
areducedfaultdetectioninthecaseofv1. Mechanismstoselect betweenthetargetunitanditscontextandthenbuildthescaffold-
andappropriatelycombinethesedifferencingfunctionswillbeim- ing to replay just those interactions. Hence, they do not incur in
portantfortherobustnessandsensitivityofDUTs. costsassociatedwithcapturingandstoringthesystemstateforeach
targetedunit. Ontheotherhand,thisapproachmaygeneratetests
5. RELATEDWORK thataresensitivetochangesthatdonoteffectmeaning,e.g.,chang-
ingtheorderofindependentmethodcalls. Saffetal. haveidenti-
OurworkwasinspiredbyWeide’snotionofmodulartestingas
fiedthisissueandproposetoanalyzethelifespanofthefactored
ameanstoevaluatethemodularreasoningpropertyofapieceof
testcasesacrosssequencesofmethodmodifications[35]. Thisis
software[38]. AlthoughWeide’sfocuswasnotontestingbuton
acriticalfactorinjudgingthecost-effectivenessofCRtestingand
theevaluationofthefragilityofmodularreasoning,heraisedsome
wehavestudiedthisissueinSection4.5.
importantquestionsregardingthepotentialapplicabilityofwhathe
These two related efforts have shown their feasibility in terms
calleda“modularregressiontechnique”thatledtoourwork.
ofbeingabletoreplaytestsandthelatterapproachhasprovided
Within the context of regression testing, our approach is also
initial evidence that it can save time and resources under several
similar to Binkley’s semantic guided regression testing in that it
scenarios. Neitherapproach,however,hasbeenevaluatedinterms
aimstoreducetestingcostsbyrunningasubsetoftheprogram[5,
ofitsfaultdetectioneffectivenesswhichultimatelydeterminesthe
6]. Binkley’stechniqueproposestheutilizationofstaticslicingto
valueofthecarvedtests,orinthecontextofregressiontesting.
identifypotentialsemanticdifferencesbetweentwoversionsofa
Our work also relates to efforts aimed at developing unit test
program.Healsopresentsanalgorithmtoidentifythesystemtests
cases. Several frameworks grouped under the umbrella of Xunit
that must be run on the slices resulting from the differences be-
have been developed to support software engineers in the devel-
tweentheprogramversions. Thefundamentaldistinctionbetween
opment of unit tests. Junit, for example, is a popular framework
thisandourapproachisthatwedonotrunsystemleveltests,but