Cloud-Based Distributed Mutation Analysis ∗ Robert Merkel James Georgeson FacultyofInformationTechnology Atlassian MonashUniversity Sydney,NSW,Australia Clayton,Victoria,Australia [email protected] [email protected] ABSTRACT mutant is said to be killed. Test suite quality is then mea- 6 suredbycalculatingtheproportionofmutantskilledbythe MutationTestingisafault-basedsoftwaretestingtechnique 1 testsuite,themutation score (ormutation adequacy score). which is too computationally expensive for industrial use. 0 Contemporary software development practices have in- Cloud-based distributed computing clusters, taking advan- 2 creased the importance of automated assessments of test tage of the MapReduce programming paradigm, represent n a method by which the long running time can be reduced. suite quality. The practices of continuous integration [6], a In this paper, we describe an architecture, and a prototype andcontinuousdelivery[18]meanthattestsuitesareused— J and modified—many times throughout a working day. Fur- implementation,ofsuchacloud-baseddistributedmutation 8 testing system. To evaluate the system, we compared the thermore,aftertesting,softwaresystemsareoftendeployed 2 performance of the prototype, with various cluster sizes, to toproductionuseonmultipleoccasionsperday. Delivering an existing“state-of-the-art”non-distributed tool, PiT. We reliable software in these circumstances requires fast, auto- ] also analysed different approaches to work distribution, to mated, and accurate assessments of test suite quality. E Despiteconsiderableevidencethatmutationanalysisscores determinehowtomostefficientlydividethemutationanal- S are a more accurate estimate of a test suite’s ability to de- ysis task. Our tool outperformed PiT, and analysis of the . tect software faults than conventional coverage metrics [14, s results showed opportunities for substantial further perfor- c mance improvement. 3, 21], it is yet to achieve significant industrial adoption. [ Jia and Harman [19] suggest that the slowness of mutation analysisisamajorreasonforthis. Generatingmanymutant 2 CCSConcepts versions of a program and executing each of them against v •Software and its engineering → Empirical software a test suite is a computationally expensive process—in one 7 validation; study [3], an approximately 6000 line C program produced 5 over10000mutants,eachofwhichmustbeexecutedagainst 1 7 Keywords large parts of the software’s test suite. Inothersectorsofinformationtechnology,cloudcomput- 0 mutationanalysis;programmutation;cloudcomputing;dis- ing has allowed businesses to utilise technology in ways not . 1 tributed computing; testing previouslypossible. ServicessuchasAmazon’sEC2[2]pro- 0 vide users with elastic access to computational resources as 6 1. INTRODUCTION needed without requiring a significant upfront hardware in- 1 vestment[22]. Amutationtestingsystemdesignedforcloud : Mutation Testing, also known as Mutation Analysis or v Program Mutation, is a fault-based software testing tech- architecturescouldprovidesoftwaredeveloperswithresults i nique that can be used to assess the comprehensiveness of in acceptable timeframes, making it viable even for large X industrial projects. a program’s test suite. Many mutant versions of a program r In this paper, we describe an architecture for a cloud- aregenerated,eachofwhichdiffersfromtheoriginalbythe a based, distributed mutation analysis tool, and present a inclusionofsomesmallfaultsuchasthereplacementofthe prototype tool based on this architecture. Our tool makes statement y=x+2 with y=x-2. The original program’s test useofApacheSpark[28],aframeworkfordistributedcloud suite is then executed against each mutant. If the inclusion computing,andtheMapReduceprogrammingparadigm,to of the mutant’s fault causes some test case to fail, then the distributesubtaskscontainingpartofthemutationanalysis ∗corresponding author task across Amazon EC2 [2] computing nodes. We empir- ically analyse two different work distribution strategies for our tool. We compare the two strategies’ scalability and Permissiontomakedigitalorhardcopiesofpartorallofthisworkforpersonalor overallperformanceonthreeopensourcesoftwareprojects. classroomuseisgrantedwithoutfeeprovidedthatcopiesarenotmadeordistributed Furthermore, we compare both strategies with a state-of- forprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitation the-art non-distributed tool, PiT [10]. We analyze the run- onthefirstpage.Copyrightsforthird-partycomponentsofthisworkmustbehonored. Forallotheruses,contacttheowner/author(s). time behaviour of the system to identify opportunities for more accurate division of mutation analysis tasks. In our (cid:13)c 2016Copyrightheldbytheowner/author(s). discussion, based on our results, we present concepts for a ACMISBN978-1-4503-2138-9. moreefficientworkdistributionmodel,andidentifyanother DOI:10.1145/1235 ProgramStatement 2.2 MapReduceProgrammingModel Operator BeforeMutation AfterMutation INVERT NEGS -i i MapReduceisaprogrammingmodelthatiswell-suitedto MATH a + b a - b processing large datasets on distributed systems [15] (that RETURN VALS return true return false is, parallelized computation where the compute nodes do Table 1: Examples of Mutation Operators in the not share access to the same physical main memory [17]). PiT Mutation Testing System [10]. MapReduce implementations provide two functions—map and reduce. The map function takes as input a collection of subsets of some large dataset to process and computes a collection of partial results. The reduce function aggre- opportunity for optimization of mutation analysis within gates each partial result into a final result. MapReduce a continuous integration workflow—incrementally updating programs are naturally parallelisable—each invocation of mutation scores as the software and test suite evolves. the map function can be executed on any available pro- cessing node. The reduce function then provides a simple 2. PRELIMINARIES waytocomputeafinalresultfromacollectionofpartialre- sultswithoutrequiringcoordinationbetweeneachmaptask. 2.1 MutationAnalysisPreliminaries MapReduceallowsthecomplexitiesofdynamictaskschedul- ingandnodefailuretobehandledtransparentlybytheim- Acreeetal.[1]defineaprocessforusingmutationanalysis plementation[15]. Hadoop [27]isanopen-sourceframework to assess a test suite’s ability to detect actual faults in a that provides reliable distributed computing constructs, in- program: cludingMapReducefunctionality[15]. Hadoopallowsforef- 1. A set of mutant programs, M, is generated by apply- ficientprocessingofextremelylargedatasetsondistributed ing mutation operators to a program P. Each mutant systems formed from commodity hardware. Spark [28] is a m ∈ M differs from P by the inclusion of some small newer framework for cluster computing, that also supports fault such as the replacement of a > operator with a MapReduce but adds facilities for distributed shared mem- ≥ operator, or a logical AND with a logical OR. This is ory. intendedtochangethebehaviourofP whilemaintain- WhendesigningMapReduceprograms,itisimportantto ing a syntactically correct program m. The mutation consider the granularity of each map task— that is, how (i.e. the injected fault) is defined by the mutation op- much of the original input data is partitioned to each sub- erator which,whenappliedtoP,producesasubsetof task. Ifthesizeofeachpartitionistoosmallandthenumber M. ExamplesofmutationoperatorsinPiT,amodern of subtasks too large, the system is likely to incur overhead mutationtestingsystemforJava,areshowninTable1. in scheduling and network communication. Conversely, if thepartitionsarelargeandthenumberofsubtasksissmall, then imbalances in subtask processing time will result in 2. The original program P is executed against a set of delays while a minority of nodes complete large subtasks. test cases T. If P does not pass all test cases t ∈ T, then the faults injected during mutation analysis 3. A MODEL FOR DISTRIBUTED MUTA- cannot be differentiated from legitimate faults in the TIONTESTING program and hence, the analysis cannot continue. 3.1 Introduction 3. Each mutant m ∈ M is executed against each test case t ∈ T. If any test case t fails when executed Inthissection,wepresentasimplifiedmodelformutation against m, the mutant m is said to have been killed; testing using the MapReduce abstraction. We describe a otherwise m is said to be alive and a gap in T has basic system architecture for this purpose, define key terms been identified. It should be noted that a mutation used in the remainder of this paper, and attempt to justify will occasionally produce no observable change in the the assumptions we’ve made in simplifying this problem. output of a program on any input. These equivalent 3.2 SystemArchitecture mutants are a well-documented problem in mutation testing [7, 19], and are not the subject of the present We assume a simple distributed system where a singular work. master node coordinatesworkacrosssomeclusterofworker nodes. Symmetric communication channels exist between 4. Ifthenumberofkilledmutantsisk,andthenumberof the master node and each worker node, but no communi- live (non-equivalent) mutatnts is l, the mutation ade- cation is required between worker nodes. Communication quacy score (alternatively,mutationscore),S ,ofthe betweenworkernodesisnotrequiredformutationanalysis, T test suite T is then defined as: and its absence simplifies the computation required to co- ordinate large parallel jobs. An annotated diagram of this k S = , architecture is presented in Figure 1, and key terms are de- T l fined below: Somemutationanalysissystemsuseanalternativedef- Master Node One node in the cluster that coordinates inition for the mutation score: k , the ratio of killed k+l andpartitionsworkamongsttheavailableworkernodes. mutants to the total number of testable mutants cre- This is the machine that a user will interact with to ated. Ineithercase,ahighermutationadequacyscore deploy jobs to the cluster. indicatesagreaterabilityofatestsuitetodetectfaults in a program. Worker Node Oneofmanyprocessingnodesinthecluster 2. The set of all mutants M can be partitioned amongst Master Node available processing nodes and executed concurrently. Parameters Hybridapproachesarealsopossible,inwhichasubsetofM Distribution Strategy is both generated and executed on a processing node from some partition of classes and mutation operators. Partition Subtask executes Phase 1 3.4 Parameters Phase 2 Worker Node Worker Node Worker Node Worker Node Worker Node AmutationanalysistaskoperatesonaprogramP,where … P is a collection of classes and tests, and is defined by the Partial Result following parameters: Classes C The subset of classes in P from which mu- Results Reducer tants should be generated. Typically all classes in P Combined Result will be included, but specific classes may be excluded in some cases (e.g. to reduce long running times by Figure 1: Key definitions in our distributed muta- analysing only a specific subsystem in P). tion testing model. Tests T ThesubsetoftestcasesinP touseduringthe executionofmutants,i.e.thetestcasestorunagainst themutantsgeneratedfromP. AsisthecasewithC, that execute tasks as assigned by the master node. wemaywishtomanuallyexcludeirrelevanttestcases Parameters The inputs to a mutation testing task—a set to prevent long running times. of classes, tests, and mutation operators. Mutation Operators O The set of mutation opera- tors to be applied to P when generating mutants. In Partition Asubsetoftheparameterssuppliedtoaworker techniques such as Selective Mutation [24], only the node,thatis,apartial input fromwhichtocomputea most useful mutation operators are applied to a pro- partial result. gram. Wemayalsowishtorestricttheseforaspecific Distribution Strategy Themethodbywhichparameters domain (e.g. mutation operators targeting arithmetic arepartitionedamongstavailableworkernodes. Some operators are perhaps more applicable in arithmetic- components may be mapped to worker nodes one-by- heavy programs than in others). one,andothersmaybebroadcasttoallworkernodes. TheoutputofamutationanalysistaskisasetofresultsR. For example, one possible distribution strategy would They contain the mutation adequacy score for the program broadcastalltestsandmutationoperatorsamongstall P, and may contain other information such as live mutants availableworkernodes,thenmapindividualclassesto and timing data. each partition. 3.5 Phases Subtask A partial mutation analysis task that is executed There are two main phases to complete during the mu- onaworkernode. Asubtasktakesapartitionasinput, tation analysis of a program P, as shown with inputs and and computes a partial result. outputs in Figure 2 : Phase A discrete stage or component inside a mutation Mutant Generation ThesetofmutationoperatorsOare analysistaskorsubtask(e.g. generate mutants,orex- appliedtothesetofclassesC toproduceasetofmu- ecute mutants). tants M; followed by Partial Result The output of a partial mutation analysis Mutant Execution The set of test cases T are executed task executed by a worker node. This will contain against each mutant m ∈ M. The outputs of this the status of the mutants generated by the mutation phase (i.e. which mutants were detected by a failing analysis subtask. test case and which remain alive) comprise the set of results R. Combined Result Theaggregationofpartialresultsfrom all worker nodes. This is the final result of the muta- 3.6 Assumptions tion analysis. Boththe generationand executionphases operate on the 3.3 ParallelisableComponents entire program P. Although it is theoretically possible to There are virtually no data dependencies or synchronisa- partitionaprogrambyclassandmakeonlythecomponents tion constraints within each stage. Mutants can be gener- in the execution path of tests covering that partition avail- ated from different mutation operators and different classes able to each processing node, we assume that all of P is concurrently, and once generated, executed and evaluated available to each individual processing node at all times. independently. This immediately offers two opportunities All classes c ∈ C and tests t ∈ T then simply denote an for parallelization: identifier forsomecomponentofP (i.e. aclassname)—not the raw data comprising that class or test. Therefore, the 1. Subsets of the set of all mutants M can be generated entire program P must be transferred to each worker node. concurrentlybypartitioningavailableclassesandmu- The overhead incurred is unlikely to be practically im- tation operators amongst available processing nodes; portant. An example of a very large, monolithic software {O,T} {O,T} Parameters c c 1 c 5 c 3 c 2 4 {O,T} {O,T} Worker Node Worker Node Worker Node Worker Node Worker Node P {C,T} Parameters {C,T} O C o o 1 o 5 o 3 o 2 4 Mutant Generation {C,T} {C,T} Worker Node Worker Node Worker Node Worker Node Worker Node M (a) Parallelise by Mutation Operator—All Classes and Tests are broadcast to every worker node, and mutation T Mutant Execution operators are mapped out individually. {O,T} {O,T} R Parameters c c 1 c 5 Figure2: Anidealizedmodelofthemutationtesting c 3 c 2 4 process. {O,T} {O,T} Worker Node Worker Node Worker Node Worker Node Worker Node package is the Chrome web browser. On the Ubuntu op- (b)ParallelisebyClass—Allmutationoperatorsandtests are broadcast to every worker node, and classes are erating system, the binary packages take up around 50MB, mapped out individually. whichwilltakearoundonesecondtotransferacrossalocal network of the capacity typically used by cloud computing {C,T} {C,T} Figure 3: DistributionPSaratmreatetrsegies in our prototype. providers. The largest program used in our experiments, o o described in §5, is JodaTime—a Java program with almost 1 o o3 o 5 300 combined classes and test classes. The jar file used to 2 4 a claim we consider reasonable after brief experimentation archiveandtransferthisprogrambetweenprocessingnodes {C,T} {C,T} withalternatives. Itsexistingsupportforlocalparallelmu- issmallerthan1.5MB.Comparedtothetensofminutesre- Worker Node Worker Node Worker Node Worker Node Worker Node tation testing provided us with some confidence in its abil- quired for mutation analysis of JodaTime even when using ity to support distributed parallel processing, and its self- a 16-core cluster, even a one second delay is insignificant. proclaimedstatusas“state-of-the-art”providesausefulper- formance baseline for evaluating our results. In retrospect, 4. A PROTOTYPE DISTRIBUTED MUTA- someaspectsofPiTprovedlessthanidealforourpurposes, TIONTESTINGSYSTEM as we will discuss in §4.5. 4.3 Architecture 4.1 Introduction Weattemptedtorealisethemodeldescribedin§3asfaith- This section describes an implementation of the abstract fullyaspossible. TheprototypeiswritteninJavaandisin- distributed mutation testing system described in §3. This vokedasacommand-lineutilitythatallowsuserstospecify: system is subsequently used to evaluate the suitability of distributed computing in mutation analysis in §5. A Java Program comprised of a set of classes and tests 4.2 ToolsandFrameworks to use as input to the mutation analysis; Our system is built largely on existing tools and frame- A Set of Mutation Operators to apply during mutant works. While this imposed a number of limitations, it also generation. Since our system uses PiT as the muta- allowedustobuildandevaluateaworkingprototypequickly, tiontestingcomponent,wehavethesamesetofavail- to inform later, more sophisticated implementations. ablemutationoperatorsasPiT(sevenareusedbyde- Java was chosen as the implementation and target lan- fault [12]); and guage for the prototype. This was due to the industrial relevance of Java and the availability of actively-developed A Distribution Strategy that defines how the required mutation analysis and distributed computing tools which work is partitioned to worker nodes in the cluster. could be utilized for prototype construction. Available distribution strategies in our prototype are AftersomepreliminaryinvestigationofHadoop[27],Spark[28] visualised in Figure 3. was chosen as the distributed computing platform for our prototype. Thiswasmainlyduetheeaseatwhichlocalclus- The mutation testing process is then distributed across a ters can be setup in Spark for testing and debugging, and cluster hosted on Amazon’s EC2 service [2]. In our exper- the support for creating clusters on Amazon’s EC2 service; iments, we used m1.large nodes with two processor cores Spark’sfastdistributedsharedmemorywasnotactuallyre- and 7.5GB of RAM (allocated equally to the two cores by quired for our application. Note that Spark treats multiple Spark). The single processing node experiments were per- processor cores on the same system as, in effect, indepen- formed by deploying the prototype to a single node in se- dent nodes. Therefore, in our experiments, where we used rial local debug mode. All other cluster sizes were achieved twin-core Amazon compute nodes, we describe the size of byinitialisingSparkwithvaryingnumbersofworkernodes. our clusters in“cores”. NotethatwhileSparkavoidsresendingshareddatatoeach PiT[10]waschosenasthemutationengineforourproto- individual core on the node, for scheduling purposes Spark type. It is actively developed, open-source, written in Java, treats each processor core allocated to it as an independent already runs in parallel on a single machine, and claims to processingnode. Becauseofthesefactors,in§5wedescribe represent the “state-of-the-art” in mutation testing [10]— cluster sizes by the number of processor cores. Dependency Analysis DependencyDistancebetweenclasses Input Program (Jar) Spark Master underanalysisandotherclassesintheprogramiscom- Parameters puted. This is defined as follows—if Class A calls a Broadcast Data Distribution Strategy method from Class B, then Classes A and B have a PiT Component dependency distance of 1. If Class B calls a method Map Data Scan Classpath executes ► Dependency Analysis fromClassC,theclassesAandC haveadependency Spark Worker Spark Worker Spark Worker Spark Worker Spark Worker Mutant Generation distance of 2. Partial Result (JSON) Mutant Execution PiT provides an option to ignore tests beyond a given de- Accumulate pendency distance from a target class. This is intended as Combined Result (JSON) a strategy for reducing runtime by ignoring potentially ir- Results File relevant tests. We have configured PiT to ignore this fea- tureforourexperiments,butthedependencyanalysisisstill Figure 4: High-level architectural overview of our executed regardless. This introduces some overhead in our distributed mutation testing prototype. subtasks,butthemodificationsrequiredtoremovetheover- headcouldnothavebeencompletedinthetimeavailablefor the project. TheoutputoftheprogramisaJSONdictionarycontain- ingacollectionofpartialresultsforeachsubtask,aswellas 4.5 Limitations anaggregatedresultrepresentingthecombinationofallpar- tialresults. Thesepartialresultscontainedtimingdataand 4.5.1 MutantGeneration otherdiagnosticinformationwhicharenotnecessaryforan industrial tool, but allowed us to conduct the experiments We use PiT effectively as a black box. We provide it with described in §5. input parameters for a mutation analysis task (or subtask), We have modified PiT to function as a programmable andweprocesstheoutput(i.e. acollectionofmutantsthat componentthatcanbeinitialisedwithmutationtestingpa- were killed, and those that are still alive). This makes it rameters(i.e.classes,tests,andmutationoperators)dynam- difficult to exploit parallelism that may exist inside PiT’s ically, and executed to produce a JSON object containing mutationtestingprocess. Forexample,weareunabletogen- the results of the analysis, along with other metrics such as eratemutantsseriallyatthemasternodeanddispatchthem durations of various phases and subject class sizes. We ex- toprocessingnodesindividually—allmutantsaregenerated ecute subtasks on Spark workers as map tasks. Essentially, andexecutedwithinthesamesubtask. Correctingthismay we: have allowed for more advanced distribution strategies, but would have required significant programming effort. 1. Providetheprototypewiththeinputprogramasajar file (containing both classes and tests); 4.5.2 ConstantOverheads 2. Use Spark’s broadcasting facilities to share the input PiT implements some optimisations intended to improve programjarfilewiththeworkernodesinthecluster;1 runtimeperformanceinanon-distributedenvironment. This includesacoverage analysis phaseinwhichthetestsuiteis 3. Use a combination of Spark broadcast and map tasks executed without mutation to determine unreachable sec- to send some partition of classes, tests, and muta- tions of code, and a dependency analysis phase, discussed tionoperatorstoeachprocessingnode(definedbythe in §4.4. As discussed, we did not have the time to mod- choice of distribution strategy); ify PiT to disable these steps, and so they are repeated for 4. Execute a mutation analysis subtask on each Spark every subtask at every processing node. We take this into worker node to produce a partial result; consideration when evaluating our system. 5. CombineeachpartialresultattheSparkmasterviaa 4.5.3 EconomicConstraints simple reduce function; then Our cluster was hosted on Amazon’s EC2 service. This 6. Write the partial and combined results to an output is especially appropriate considering that it is our goal to file for further analysis. eventuallyprovidemutationtestingasaserviceinthecloud. However, we were unable to run our experiments on Ama- OurimplementationarchitectureissummarisedinFigure4. zon’sfreetier,andallusagecostswerepaidbytheauthors. Thisrestrictedthesizeoftheclusterwewereabletousefor 4.4 AdditionalPhases ourexperiments,andrestrictedthenumberofrepetitionsof experiments that could be completed to confirm that mea- PiT’sactualmutationtestingprocessisslightlymorecom- surements did not vary significantly between trials. plicated than the process we described in Figure 2. It in- troducestwoadditionalphasesnotdepictedinourabstract model: 5. EMPIRICALEVALUTION Scan Classpath Thesuppliedclassandtestnamesarelo- Weconductedempiricalevaluationsofourtooltoanswer cated on the classpath, that is, the actual data mak- two main questions: inguptheseparametersisresolvedfromtheidentifiers provided by the user; and 1. Can a distributed approach perform mutation analy- 1NotethatSparkemploysefficientbroadcastalgorithmsand sisfasterthananon-distributedapproach—and,ifso, protocols specifically for this purpose [5]. how efficiently can it be scaled? Mutatable Executable Nodes Spark/EC2Configuration Test Linesof Test Mutants 1 SerialSparkdeployedtomasternode Project Classes Classes Code Cases Generated 2 DistributedSparkwithoneworkernode gson 44 78 2567 943 1585 4 DistributedSparkwithtwoworkernodes joda-time 161 133 11378 12480 8037 8 DistributedSparkwithfourworkernodes commons-math 71 47 4679 547 3745 12 DistributedSparkwithsixworkernodes 16 DistributedSparkwitheightworkernodes Table 2: The sizes of each project used as input to Table3: SparkandEC2configurationsusedforeach our experiments. cluster size used in our experiments. 2. Canweaccuratelypredicttheamountofworkrequired 5.2.1 Experiment1 toperformsomefractionofthemutationanalysistask sothatworkcanbedividedupbetweencomputational Using our prototype, we perform mutation analysis of: nodes to maximise utilization and thus efficiency? • Each program listed in §5.1; using Researchquestion1directlyaddressestheviabilityofour • EachdistributionstrategydescribedinFigure3; with generalapproachtodistributedmutationanalysis. Research question 2 addresses how an optimized tool might be con- • 1, 2, 4, 8, 12, and 16 processor cores available. These structedtoachievethebestperformanceandlowestresource configurations are summarised in Table 3. utilization. Weconductedthreeexperiments,measuringdif- ferent aspects of the performance of the prototype system: To compare the performance with a non-distributed (but locally parallelised) implementation, we also used an un- Experiment 1 measured the execution times of the pro- modified version of PiT, configured to take advantage of totypesystemusingdifferentworkdistributionstrate- bothavailablecores,toperformmutationanalysisonasin- gies and cluster sizes. This was primarily intended to gle node. address research question 1. In all experiments, we apply the set of default mutation operators as defined by PiT [12]. Experiment 2 investigated the relationship between the Therefore,theindependentvariablesforExperiment1are: sizeofindividualclasseswithinthesoftwareundertest, the number of mutants created within those classes, • The software under test; and the time taken to process the created mutants. This experiment addresses research question 2. • The distribution strategy; Experiment 3 examinedthenumberofmutantsgenerated • WhetherourmodifieddistributedtoolorstandardPiT by applying different mutation operators to the soft- was used to perform mutation analysis; and wareundertest,todeterminewhethertheproportions • The number of processor cores allocated. were consistent across projects. This experiment ad- dresses research question 2. The dependent variable was the clock time (measured in millisecondsusingthesystemclockonthemasternode)re- 5.1 Experimentalsubjects quiredtocompletemutationanalysisforthesoftwareunder To conduct our three experiments, we required experi- test. mental subjects—representative software systems for input For the 16-node distributed case, and the standard PiT to our experiments (i.e. to mutate and analyse). We re- case, we repeated measurements three times. The varia- quiredexperimentalsubjectsthatwereimplementedinJava tion between experimental runs was insignificant. See the (asPiTonlysupportsJava)andusedtheJUnitunittesting appendix for full details. framework [16]. We also preferred open source projects to D is the duration of the analysis when deployed to P P enableeasierreplicationofourresultsbyfutureresearchers. processing nodes, and D is the duration of the analysis 1 For our study, we chose three open source libraries as in- when executed on one processing node. puttoourprototype,whosesourcerepositoriesareavailable 5.2.2 Experiment2 from the online project repository site GitHub [13]: Weperformedmutationanalysisusingtheprototypetool, Gson [26] A popular JSON parsing library; usingindividualclassesaspartitions. Werecordedthenum- Joda-Time [9] A de facto standard replacement library berofmutatablelinesofcodeineachpartition. Thestatistic for Java’s Date and Time utilities; and is provided by PiT, and excludes method headers, whites- pace, and lines consisting entirely of non-mutatable con- Commons Math [4] A collection of mathematical utili- structssuchasbraces. Wealsorecordedtheexecutiontime, ties. Commons Math is a very large project, and se- provided by PiT, for mutants relating to that class, subdi- rialmutationtestingbenchmarksfortheentireproject vided into four phases: takehourstocompleteonourmachines. Intheinterest of keeping our experiment runtimes reasonable, only Classpath Scan DeterminethelocationofallJavaclasses the geometry package of the Commons Math project intheartifactandtestsuitenecessarytorunthetests; (and its associated tests) were used. Dependency Analysis Determinetheshortestinvocation The size of these projects are shown in Table 2. path between all pairs of classes in the system. As notedin§4.5.2,thiswasunusedoverhead,butwasnot 5.2 Method feasible to disable in the time available; Mutant Generation Generate mutants by applying mu- tation operators to the input program; and Mutant Execution Executeeachmutantagainstthetest suite. 5.2.3 Experiment3 Weperformedmutationanalysisusingtheprototypetool, partitioningbymutationoperator. Werecordedthenumber of mutants created in the software under test using each default mutation operator available in PiT, as well as the time required to complete the partial mutation analysis for the mutation operator. 6. RESULTS 6.1 Experiment1 Figure 5 shows the clock time taken to complete muta- tion analysis, on each of the three experimental artifacts, using the two distribution strategies for cluster sizes of be- tween 1 and 16 cores. The time taken for an unmodified version of PiT running two threads on a single twin-core node to complete mutation analysis for each of the exper- imental artifacts is indicated by the horizontal line of the Figure6: Distributionofclasssizesforeachproject. corresponding shade. ThefigureshowsthattheParallelisebyMutationOpera- torstrategy(inwhicheachnodeperformedallmutantgen- erationandexecutionforonemutationoperator)performed betteronsmallerclusters,butasthereareonlysevendefault mutation operators, adding additional nodes to the cluster after this did not reduce the time taken. By contrast, the Parallelise by Class strategy (in which partitions consisted ofallmutantsforasingleclass)wasfarsloweronsmallclus- ters, but execution time continued to reduce as the cluster was expanded up to sixteen cores. With 16 cores available, the clock time to execute the Parallelise by Class strategy was lower than for Parallelise by Mutation Operator. Withasufficientlylargecluster,bothstrategieswereable to complete mutation analysis faster than non-distributed PiT, though the speedup was far from the theoretical max- imum. With 16 cores, the Parallelise by Class strategy was able to complete mutation analysis 40% faster than non- distributed PiT for Gson, 42% faster for Joda-Time, and 50% faster for Commons Math. 6.2 Experiment2 Figure 6 shows the distribution of class sizes for each project. We found less variance than expected. Of the 276 classesanalysed,themajorityarebetweenapproximately75 and 150 mutatable lines of code. The relationship between the size of the class and the number of mutants generated is shown in Figure 7. Across all projects, there is a strong linear relationship between class size and mutants generated. Figure8showsthedurationoftimespentineachphaseof themutationanalysisprocessusingourprototypetool,when Figure 7: Mutants generated by class size. using the Parallelise by Class strategy. Classpath scanning, dependencyanalysis,andmutantgenerationwereessentially unaffected by the size of the class being processed. This is unsurprising for classpath scanning and dependency analy- sis, as they depend on the properties of the entire project, not just the specific class. Mutant generation, by contrast, Figure 5: Durations of distribution strategies as cluster size increases. Figure 8: Durations by mutatable lines of code. Figure 9: Mutant execution time by class size. is fast enough compared to the other phases that its com- putational cost can be ignored. There is a moderately strong linear relationship (r2 = 0.498) between class size and the time taken to complete mutant execution for all mutants relating to that class, as shown in Figure 9. 6.3 Experiment3 Figure10showsthenumberofmutantsgeneratedbyap- plying each of the mutation operator classes to our experi- mental subject programs. There is no clear pattern to the proportion of mutants across projects. 6.4 Threatstovalidity We have reasonable confidence in the internal validity of Figure 10: Mutants generated by operator. the results. It is possible, but unlikely, that our prototype tool was not performing mutation analysis as described in the paper Onepotentialapproachgiventhelimitedpredictivepower due to bugs. Comparisons between our modified tool and of these metrics is to use a fine-grained partitioning strat- unmodifiedPiTsuggestthatthetoolproducedequalquan- egythatproducesmanymoresubtasksthannodes,asSpark titiesofmutantswhenappliedtothesameJavasourcefiles. will automatically allocate partitions to idle nodes. Unfor- Faults in the (new) code used to distribute test cases tunately,itseemsthattheoverheadfortaskstartupishigh, acrossnodesisamorelikelysourceofinvaliddata. Tomit- as shown by the relative performance of the two distribu- igate this threat, an extensive unit test suite was devised tion strategies. The larger partitions from the Parallelize for all the new code in the prototype tool, which caught a byMutationOperatorstrategyweremoreefficientonsmall number of bugs during development. clusters than the Parallelize by Class strategy, though it It is possible that timing data collected from within PiT could not scale as well. Some of this is due to the unnec- or by the Apache Spark cluster software may have had in- essary overheads in our prototype distributed mutation en- accuracies. These are unlikely to be substantial given the gine. However, it seems that task startup in Apache Spark relativelylongdurationsmeasuredinourexperiments. Vari- is inherently slow, suggesting that the number of partitions ations in network load may also have caused timing varia- should be minimised. tions. The lack of variation we observed in our (limited) Moreover, our results indicate an alternative work distri- repeated observations suggest that this is likely to be rela- bution scheme may be feasible. Mutant generation is ex- tively minor. tremely fast compared to mutant execution. As such, gen- There are a number of limits to the external validity of eratingmutantsinthemasternode,andcreatingpartitions our results. that contain an equal number of mutants in each, means The tool which we used for our experiments was a proof- thatanappropriatenumberofworkpacketscanbecreated of-concept prototype, and its performance is unlikely to be to match the cluster size. The subtasks are likely to re- representative of a properly engineered production system. quire similar amounts of time to execute, thus making very However,giventhenatureofthelimitationsoftheprototype efficientuseoftheworkernodes. Mutantscanbeverycom- (notably, the amount of unnecessary work performed and pactly represented (in principle, as a byte offset and a mu- theinefficientpartitioningschemes)thatanymoreadvanced tation operator taking no more than a few bytes), and as system is likely to be faster in absolute terms, and exhibit network bandwidth does not appear to be the limiting fac- better scalability rather than worse. tor in performance, sending lists of mutants to the worker Our tool was restricted to software written in Java and nodes is not likely to slow the computation down. Such an tested using the JUnit testing framework. The overheads approach will also make it easier to estimate the compu- fromthedifferentpartsofthemutationanalysisprocessare tation required to conduct mutation analysis, allowing the highlydependentonthedetailsofhowmutantsareinserted, usertomakeaninformedchoiceastowhetherthetimeand and the overheads of the testing framework—as such, opti- financial costs of running the analysis are worthwhile. mal parallelization strategies may vary for programs imple- Whileparallelizationprovidesawaytoovercomeonebar- mented with a different toolchain. rier to applying mutation analysis—its slowness—to some Finally, time limitations meant that we were only able extentitmaytradethecostsofexcessivetimeforthefinan- to evaluate our tool using three experimental subjects, all cialcostsofrentingsufficientcloudinfrastructure. However, of which were open source utility libraries of one kind or asnotedin§8,mostoftheexistingmethodologiesforreduc- another. With regards to Experiment 3, where significant ingtheoverheadofmutationanalysiscanbeeasilycombined differencesintheproportionofmutantscreatedbyapplying with our methods. Furthermore, the context in which test the various mutation operators to the three experimental qualityoftenneedstobeassessed—anevolvingtestsuiterun subjects, we lack enough data to say whether these differ- regularlyonanevolvingcodebase—offerslargelyunexplored ences represented rare outliers or wide variation. scope to further reduce the computation required. To our knowledge, no peer-reviewed academic work has examined 7. DISCUSSION theoptimizationsforrepeatedapplicationofmutationanal- ysis to an evolving codebase and test suite. Coles [11] has OurresultsforExperiment1demonstratethattheanswer suggested several such optimizations for Incremental Anal- to research question 1 is in the affirmative: a distributed, ysis, of which some have already been implemented in PiT. cloud based approach to mutation analysis can outperform However, no evidence of the effects of these optimizations a non-distributed system. Our proof-of-concept prototype, on performance and accuracy of the analysis (several of the despiteknowninefficiencies,wasabletosignificantlyoutper- theproposedoptimizationsmakeseveralpossiblyunverified formthenon-distributedtoolitwasbasedon. Parallelizing assumptions about the behaviour of modified software) has byclass,whilethelessefficientofthetwoapproachestried, beenreported. Webelievethatthepotentialefficiencygains scaledacceptablywellto16CPUcores. Ourresultindicates from incremental analysis, including both optimizations al- thatdistributedcomputingframeworksareaviablemethod readyproposedbyColes,andothersnotyetidentified,may for speeding up mutation analysis. be very substantial. For instance, is seems plausible that a Onresearchquestion2,weobservedsomeconnectionbe- previously killing test will continue to kill the correspond- tween per-module metrics, and the time taken to perform ing mutant the next time mutation analysis is performed; mutation analysis on that module. Class size has a linear therefore, for previously killed mutants, in most cases only relationship with the time taken to conduct mutation anal- a single tests will need to be executed to verify that the ysis on that class; however, that linear relationship is not mutantisstillkilled,insteadofexecutingasignificantfrac- as strong as one might hope. By contrast, the proportion tion of all the covering tests. We plan empirical studies to of the work required by different mutation operators varied examine the effect of such optimizations. substantially from project to project, and does not appear Furthermore,inthecontextofcontinuousintegration,the to be a useful basis for dividing up work efficiently. users of mutation analysis may well be most interested in the lack of even local parallelization in the current version. mutationanalysisrelatingtothepartsofthecodebasethey Regardless,Major’smoremodulardesignandresearchfocus have changed most recently, rather than the entire code- may make it a better basis for developing a more advanced base. Therefore,itmayproveunnecessarytoconductacom- distributed mutation analysis system. pletemutationanalysisoneachcontinuousintegrationstep; Therehasbeensomeinterestinapplyingdistributedcom- a“first pass”examining mutation coverage for classes that putingframeworkstoothertestingtasks. Parveenetal.[25] have changed (and possibly their direct dependencies) may used Apache Hadoop to speed up execution of JUnit unit be useful, and is likely to be much quicker than mutation test suites. They reported a thirty-fold speedup over serial analysisoftheentirecodebase. Evaluatingtheusefulnessof execution using a 150-node computing cluster. Unit testing this approach is largely a human factors question, and will is a simpler problem than mutation analysis, as the work- require putting a working tool in the hands of developers. loads are more predictable. 8. RELATEDWORK 9. CONCLUSION Literature on parallelization of mutation analysis dates We have demonstrated a proof-of-concept tool for per- backtotheearly1990s. Offuttet al.[23]presentedaparal- forming mutation analysis on distributed cloud-based sys- lel implementation of Mothra on an MIMD system with 16 tems, using Apache Spark and the MapReduce paradigm processing cores. They observed that as, for most systems, to distribute the mutation analysis tasks across a cluster the number of mutants generated far exceeds the number of Amazon EC2 computing nodes. Despite known ineffi- of test cases, parallelizing by mutant is likely to be a more ciencies, the prototype was able to outperform the non- effective approach. After accounting for the overhead in distributedtoolitwasbasedononclustersof8nodes. Anal- scheduling and program compilation, they reported an“al- ysisoftheruntimebehaviourofourprototypesuggeststhat mostlinear”speed-up. Similarlypromisingresultswerealso whilethereissomecorrelationbetweenthesizeofamodule reported by Choi [8]. Modern mutation analysis tools, in- (class) within a codebase and the time required to perform cludingPiT[10],oftensupportlocalparallelization. Source analysisofthemutationsinthatclass,thevariationremains code inspection suggests that PiT parallelizes by mutant. substantial. The minimal cost of mutant generation sug- The only distributed mutation analysis research the au- gests that, instead, the most efficient strategy for distribut- thors were able to find was conducted in 1993 by Zapf [29]. ing work between nodes may be to generate all mutants on Zapf’s MedusaMothra system distributed mutation analy- the“master”nodeandallocateequal-sizedcollectionsofmu- sis tasks across a local network of Unix workstations, and tants to each worker node. sometimes achieved near-linear speedups. While this far- Distributed computation is one way in which the long- sightedworkdemonstratedthepotentialofdistributedmu- standingperformancebarriertoindustrialadoptionofmuta- tation analysis, the technical context has changed a great tionanalysiscanbeovercome,andthepresentwork,embry- deal since. MedusaMothra partitioned subtasks as finely as onic as it is, demonstrates its potential. Incremental anal- ispossible-asubtaskconsistedofasingle testcaseandmu- ysis, in the context of a continuous integration workflow, tant. Such an approach is likely to be extremely inefficient may represent another. We hope that combining these two in a modern context. Modern unit test suites are written techniqueswiththelargeexistingbodyofworkonspeeding toexecutequickly,andindividualtestsoftenexecuteinless upmutationanalysisbyothermeansmayfinallyallowitto than a millisecond. If Zapf’s partitioning approach were gain wide industrial adoption. tried today, the worker nodes would be likely to spend the vast majority of time idle due to network latency. Thereisaverylargeresearchliteratureonimprovingthe 10. REFERENCES efficiencyofserialmutationanalysis. JiaandHarman’ssur- [1] A. Acree, T. Budd, R. DeMillo, R. Lipton, and vey paper [19] defines two broad categories of approaches: F. Sayward. Mutation Analysis. Technical report, mutantreduction,whichreducethenumberofmutantswhich DTIC Document, 1979. must be executed to calculate a mutation score, and execu- [2] Amazon Web Services. Amazon Elastic Compute tion reduction, which reduce the cost of executing a set of Cloud (EC2) - Scalable Cloud Hosting, 2015. Last mutants. By this definition both local and distributed par- accessed September 2015. URL: allelism are execution reduction techniques. Virtually all http://aws.amazon.com/ec2/. workinmutantreduction,andmuchoftheserialexecution reduction literature, is applicable in a distributed context. [3] J. Andrews, L. Briand, and Y. Labiche. Is Mutation Whilethereisarichhistoryofresearchinmutationanaly- an Appropriate Tool for Testing Experiments? In sis,relativelyfewmutationanalysistoolsareactivelymain- Proceedings. 27th International Conference on tained. PiT, while actively maintained, is less than ideal as Software Engineering, 2005, pages 402–411, May 2005. a basis for academic research. Major [20] is an alternative doi:10.1109/ICSE.2005.1553583. mutationanalysissystemwhichmaybemoresuitable. Ma- [4] Apache Foundation. The Apache Commons jorisdescribedbyitsauthoras“designedtobehighlycon- Mathematics Library. Apache Foundation, 2014. figurable to support fundamental research in software engi- [5] Apache Foundation. Spark Programming Guide - neering”,and“duetoitsefficiencyandflexibility,theMajor Spark 1.5.1 Documentation, 2015. URL: http://spark. mutationframeworkissuitablefortheapplicationofmuta- apache.org/docs/latest/programming-guide.html. tionanalysisinresearchandpractice”. Whilewedidnotset [6] K. Beck. Extreme programming: A humanistic outtoperformadirectcomparison,ourresultsindicatethat discipline of software development. In Fundamental PiT,evenusedserially,maystillbesignificantlyfasterthan Approaches to Software Engineering, pages 1–6. Majoronthesamebenchmarks. Inpart,thismaybedueto Springer, 1998.