/ ' .. ~ IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, MANUSCRIPT ID Evaluating. Heuristics for Planning Effective and Efficient Inspections Forrest J. Shull, Carolyn B. Seaman, Madeline M. Diep, Raimund L. Feldmann, Sara H. Godfrey, and Myrna Regardie Abstract - A significant body of knowledge many types of organizations have shown that a concerning software inspection practice indicates properly conducted inspection can remove between that the value of inspections varies widely both 60% and 90% of the existing defects [1). It is well within and across organizations. Inspection established that inspections in the earliest phases of effectiveness and efficiency can be measured in software development yield the most savings by . numerous ways, and may be affected by a variety avoiding downstream rework. of factors such as Inspection planning, the type of However, the value of inspections varies widely software, the developing organization, and many both within and across organizations. Inspection others. In the early 1990's, NASA formulated heuristics for inspection planning based on best effectiveness and efficiency can be measured in nu practices and early NASA inspection data. Over the merous ways (defects found, defects slipped to test intervening years, the body of data from NASA ing, time spent, defects found per unit of effort, etc.), inspections has grown. This paper describes a and may be affected by a· variety of factors, some multi-faceted exploratory analysis performed on this related to inspection planning (e.g. number of in · data to elicit lessons learned in general about spectors) and others related to the software and the conducting inspections and to recommend developing organization (e.g. application domain). improvements to the existing heuristics. The The work described here is based on an analysis of a contributions of our results include support for large body of data collected from inspections at modifying some of the original inspection heuristics NASA, the US governmental space agency. NASA (e.g. Increasing the recommended page rate), heuristics for inspection planning were formulated evidence that Inspection planners must choose between efficiency and effectiveness, as a good in the early 1990's based on best practices and data tradeoff between them may not exist, and from early NASA inspections. Over the intervening Identification of small subsets of inspections for years, the body of data from NASA inspections has which new inspection heuristics are needed. Most grown, and recently the authors of this paper were Importantly, this work illustrates the value of given the opportunity to analyze it to gain new in collecting rich data on software Inspections, and sights into inspection effectiveness and efficiency. using it to gain insight into, and Improve, inspection The original set of h~uristics for planning inspec practice. tions was formulated by Dr. John Kelly at NASA's Jet Propulsion Laboratory (JPL), based on metrics Index Terms - D.2.5 a Code inspections and collected across hundreds of. inspections [2] .. The walkthroughs, D.2.8 Metrics/Measurement. D.2.8.c heuristics focused on parameters known as the Process metrics, 0.2.9.k Project control & modeling • moderator's three control metrics, that is, the three parameters over which the inspection planner has direct influence. Modifying the values of these pa 1 INTRODUCTION rameters is the mechanism by which an inspection moderator can affect the outcome of a given inspec A long history of experience and experimentation tion. Kelly's research examined data from many in has produced a significant body of knowledge con spections at NASA to formulate optimal ranges for cerning the proven effectiveness of software inspec these values and to help guide inspection planners. tions. Data and experience from many years and These values were incorporated into the JPL robust formal inspection training, which has been widely disseminated across all the NASA Centers. • Forrest Shull, Madeline Diep, Raimund Feldmann, and Myrna Rc gardie are all with the Fraunhofcr USA Center for Experimental The specific heuristics resulting from Kelly's re- . Software Engineering, College Park. MD 20740. E-mail: {fshull, search are: mdicp, rfeldmann, mrcgardie)@fc-rnd.umd.edu. • Team size, the number of participants involved • Carolyn Seaman is with the Fraunhofcr Center, as well as the Uni in the inspection, should be between four and versity of Maryland Baltimore County, Department of Information Systems, Baltimore, MD 21250. E-mail: [email protected]. six people, regardless of the type of document • Sara Godfrey is with NASA Goddard Space Flight Center, Greenbelt, being inspected. These values ·reflect the fact MD 20771. E-mail: [email protected]. that teams of less than four people are likely to lack important perspectives, while larger teams ,\ln1111saipl receiv'-'ll (i11;.•rl tfnlc ofs 11b111is,io11 ift f1'l'irt•tf). l'/en,e 110/e /lint 111/ nck11,.1111/etfx111,•11/s ,Jm11/tf bi• tilnct'tf nl //11• e11d 11{ /lit l"'I"''· l><!{cwe //1e bibli«•Kmp/111, I 2 IEEE TRANSACTIONS ON XXXXXXXXXXXXXXXXXXXX, VOL. II, NO'. II, MMMMMMMM 1996 are more likely to experience dynamics. that decreasing, i.e. the heuristics may be out of date limit full participation. in relation to what is realistic in the contempo • Meeting length should be less than two hours, rary NASA development environment. regardless of the type of document being in Based on these initial findings, we conducted a spected. If meetings stretch on longer than two series of exploratory analyses to better understand hours, members' energy is likely to flag and the the effect of inspection planning parameters on in results are likely to be less than optimal. It is spection effectiveness and efficiency, and to ulti recommended that meetings end after two mately refine and improve the inspection planning hours and then additional meetings should be heuristics. In particular, our analysis is guided by scheduled if warranted. the following research questions: • Page rate, the number of document pages that Ql. What effect do inspection parameters have on the inspectors examine per hour of the meeting, other inspection outcomes, besides total number will depend on the type of document. Inspec of defects found? tions of requirements documents should exam Q2. Are there variations in the heuristics that are ine less than 15 pages per hour; design and test appropriate for different situations? documents less than 20 pages per hour; and Q3. Are there variations in the heuristics that would better ensure that compliance would result in code documents less than ten pages per hour. better inspection outcomes (i.e. that would im These recommendations reflect the fact that giv prove the relationship between compliance and ing a team too much material to look through effectiveness/ efficiency)? will invariably result in a more superficial in We begin, in section 2, by describing the initial spection. analysis showing the effectiveness of the original Many things about software development have heuristics. We then describe the data that our analy changed since that time. Languages, design nota sis is based on and our methodology in terms of the tions, even the scale and type of problems tackled various exploratory data analyses we performed in on NASA projects are very different from what they order to gain understanding of the relationships would have been in the early 1990s. Inspections between variables (in section 3) and the results of themselves remain an important part of develop those analyses (in section 4). A discussion of these ment processes at NASA. For example, software results is presented in section 5, fol(owed by a dis inspections are included in the mandatory NASA cussion of related literature that addresses the fac Procedural Requirements for Software Engineering tors influencing inspection effectiveness and effi (NPR 7150.2), issued by the Office of the Chief En ciency in section 6. We summarize our. conclusions gineer [3]. As a result, one focus of our work has in section 7. been to examine whether the recommended ranges of parameters still apply to contemporary NASA 2 BACKGROUND - INSPECTIONS AT NASA projects. Our first step in analyzing the inspection data We obtained a large set of data from a vari~ty of was to attempt to validate the NASA inspection software inspections across NASA. It includes 2,528 planning heuristics. This analysis and its results are inspections of requirements, design, code, test plans, described in more detail in section 2, but the overall and a small number of other unspecified artifacts. finding confirms that the heuristics continue to be These inspections come from 81 projects across five generally effective in most circumstances (i.e. on NASA Centers. These data were self-reported by average, inspections that comply with the heuristics . developers as part of th_e inspection process and result in more defects found than those inspections were not reported to management, so we have some that did not comply), but there was evidence that confidence that they accurately represent inspection they could be improved upon. In particular, we practice and were not manipulated for any purpose. found four potential weaknesses: We describe here our initial analyses, which mo 1. The heuristics only apply to maximizing the tivated the investigation described in the rest of the total number of defects found (i.e. effective paper. We first divided the data into two sets: an ness), and don't address other potential out "historical" dataset that covered the period of the comes of interest (e.g. effort spent), in particular early 1990s when the original heuristics were formu related to efficiency; lated, and a "contemporary" datase~ that covered 2. The heuristics represent a one-size-fits-all ap the time period since then (up until 2006). As the proach to inspection planning, with no refine dividing line, we used January 1, 1995. This was a ment for different project situations; rather ~rbitrary choice but had the advantage of 3. The heuristics are not as universally applicable dividing the entire set into two roughly equal parts as one would hope, i.e. some modification could (1041 contemporary inspections and 1487 historical yield a stronger relationship between compli ones). It should be noted that the relative sparseness ance and inspection effectiveness and/ or effi- of the contemporary set (1041 inspections over 11 ciency; · years, as compared to 1487 inspections over 6 years) 4. Compliance with the heuristics seemed to be is an artifact of the data collection process (i.e. our AUTHOR: TITLE. 3 dataset cannot be considered to be exhaustive), and 239 were not. is not evidence that the prevalence of inspections at Our results show that projects that followed the NASA has decreased. heuristics for team size detected more defects on For each inspection in the dataset, we determined · average for both contemporary and historical if it conformed to the recommended values of each datasets. In fact, the difference is actually more pro inspection heuristic. For each heuristic, we compared nounced in the contemporary data. From the statis the mean number of defects reported for inspections tical nnalysis, the p-values shown in the rightmost that complied with that heuristic (in-compliance in column are much less than our chosen alpha-level of · spections) to the mean for inspections that did not 0.05. This means that there is less than a 5% chance (out-of-compliance inspections). Since none of our that the perceived difference is actually due to variables were normally distributed, we used a non chance, rather than a real effect of the parameter. parametric statistic, the Mann-Whitney test, to iden .We nlso observe that the meeting length heuristic tify significant differences in the means. We summa has an effect counter to expectations. That is, in both rize these results in Table 1, dividing them accord the contemporary and historical datasets, inspec ing to the dataset (contemporary vs. historical), and tions that conformed to the meeting length heuristic observe whether the heuristics might have become found fewer defects than those that exceeded the )more or less effective over time For each dataset, we heuristic. This result is curious, but is probably af report how many inspections actually provided the fected at least in part by the fact that relatively very data required for this analysis and the average few inspections did not conform to this heuristic. number of defects found by inspections meeting The results pertaining to the page rate heuristic that criteria. are also worthy of note. They indicate that the page Table 1 highlights several observations about the rate heuristic is not only less effective, but also more dataset. In the contemporary data, it is much less difficult to comply with, in the contemporary likely that teams followed the suggested heuristics dataset than historically. Historically, about 24% of for inspection team size and page rate. For example, the inspections were able to ~.omply with the page out of 229 projects providing "team size" data, only rate heuristic (as compared to 15% of the contempo 23 (lO'X,) fell into the suggested range; and only rary inspections), and those inspections found sig about 15% of the projects reporting "page rate" data nificantly more defects. The contemporary inspec conformed to the heuristics. In contrast, for the his tions that conformed to the page rate heuristic torical dataset, regarding the "team size" data, the found marginally more defects (4.4 vs. 4.1), but the number of inspections that followed the heuristics difference is not significant. This finding points to and those that did not is much closer to 50/50; the need to update this heuristic. namely, 253 were in the recommended range and Table t. Testing original heuristics on historical and contemporary data. In-compliance Inspec- Out-of-compliance In Inspections · tions spections following # of inspec Avg.# of # of inspec- Avg. # of heuristics tions resulting tions resulting significantly defects defects better? ···. . : CONTEMPORARY DATASET (1995 and later)·.: ..: ~:,. ;°:,.·: ,. '· Team size 23 38.7 206 6.5 YES (p<0.0005) Meeting length 184 3.7 7 27.6 NO .. HPaIgSeT rOaRteI CAL DATA.S•E, ... .T 2.I 3,(;1 • .9 94- ·and ~e...a . : r·~li:e~.r .·); . : :·.· .... :..': ' 1.3. 4. ·.' :· .:~ ~ . . '. ·.~. .. : ·\·':; , ;,·: ·_ ::4, ..~1~';-.;::.::? :• (o=N0.O5) Team size 253 11.7 239 7.3 YES (p<0.0001) Meeting length 460 8.5 29 22.7 NO Page rate ·_,.115 ... \. ,, 15.6 _··. 35~ · ;.7.4' YES ., ,· (p<0.0001) .. ... •• •• # ..... ••• •, . ••• if 4 IEEE TRANSACTIONS ON JOURNAL. NAME, MANUSCRIPT ID Table 2. Variables used in the analysis. .I rideoendeht'.Variaoles-i..(j ~~·~'}.rr-fi",;ir.,'·~.:~~:V!.:;8.·~:>~~~:,~~.t~.hi~Ji$~.j.:.:~;:.:~~.~:1.t,t{~'.:f,tft~r.:':~ Team size Number of people attending the inspection meeting and/ or serv- ing as inspectors Meeting: Jen2th Length of inspection meetin2, in hours Page rate Number of document pages inspected, divided by the meeting length · Deoendent Vai'iabfes ~1..,;<,, c~·.(~."''!.;,~ :f..'1Y.}.'r,!·•~Mif!' \~· +':"!~-.~ 1:-:~,"0.r"O.:,~·;::'~;.:~~.i:>;1~~~~~~,:¥.~~~-; Number of defects found Total number of defects reported as an outcome of the inspection Total effort Total effort, in person-hours, including meeting effort and prepa- ration effort, but not including: rework Defects per page Number of defects found divided by the number of document pages inspected Defects per hour Number of defects found divided bv total effort in person-hours :,Jrltetverurii?-:Vanables:~:;~O-~~~f:":-.~~~~.;:.{,~t~?;·~ti.~-!M~.<f~.;_;~~;-:;:;w;~~~-->~;;'.~"'~1m Application domain Attitude, Orbit, Flight software Software tvpe NASA-defined software type codes (A to H, or unspecified) Project size Small or medium Product tvpe Requirements, desi2n, code, test documents Center NASA Center at which the insoection took olace As mentioned in the introduction, the analysis pre 3. 1. 1 Independent variables sented in the rest of this paper is motivated by our obser vations from the analysis presented above. In particular, The independent variables for our analyses consist of it is clear that the inspection planning heuristics could be the three inspection control metrics, i.e. team size more uniformly effective for all inspection parameters. In (number of participants), meeting length, and page addition, from Table 1, we can see that contemporary rate. inspections were less likely to comply with the heuristics The meeting length is reported in hours. An anom than older (historical) inspections. This may indicate that, aly that we noticed during inspection of the data in due to changes in the development environment, the preparation for analysis (but after the analysis pre heuristics are becoming harder to apply, and projects are sented in section 2) was that a number of records re more often violating them. ported very large meeting lengths, as high as 80 hours. In conversations with some of our contacts from whom our data were donated, we learned that some of these 3 METHODOLOGY very long meetings were actually inspections that The methodology for our work was exploratory. spanned multiple meetings, but for reporting pur We started with the research questions above, probed poses, the meeting lengths were summed. Since the the data in various ways, and allowed the results of meeting length heuristic is concerned with the length each analysis to guide the types of analysis to come of contiguous time,that the inspection team meets, us next. We used both visual and statistical methods to ing the summed meeting length data did not make gain insight into relationships between variables that sense. For our analyses, we ignored inspections for might be interesting and/ or to confirm or determine which the reported meeting length was greater than 4 the strength of relationships we suspected might exist. hours. We chose 4 hours as a reasonable limit on the Thus, we did not have a pre-defined sequence of steps length that a single contiguous meeting was likely to that we followed for data analysis. Instead, in this sec last. Also, it constituted a logical break in· the data tion, we present the details of the dataset we used, in where only 194 records (-8% of the total data set) were cluding all the variables we investigated, and an over eliminated. view of the variety of analysis techniques we em Page rate was a derived measure that normally did ployed throughout the paper. The sequences of analy not appear in the raw data we received from Centers. ses used are detailed in section 4, along with their find ings. Since some inspections reported the size of the in spected artifact in lines of code (LOC), and others in 3.1 Variables pages, we used a scaling factor of 30 LOC per page to The dataset we obtained has numerous fields, which convert between the two (this is the standard conver we have designated for our purposes as independent, sion factor used in planning inspections at NASA [21). dependent, or intervening variables. These variables We then used the page measure (raw or derived) to are listed in Table 2 and are described in the subsec calculate the page rate by dividing it by the meeting tions below. length. AUTHOR ET AL:'TITLE 5 3.1.2 Dependent variables Not all of the inspections in the dataset had re The dependent variables for our study corresponded to corded values for all of the varinbles outlined above. the inspection outcomes, i.e. those nttributes that de Hence, we conducted our analyses separately to use scribe how· successful the inspection was. Nearly all the the mnximum possible amount of data in each annly inspections in our dataset reported the total number of sis. For this reason, the number of data points reported defects found, which we designated as one of our de in section 4 varies from one nnalysis to another. pendent variables. However, in order to get n clearer 3.2 Data analyses picture of our annlysis results, we also looked at various As explained earlier, our analysis was exploratory in normalized outcome variables. In particular, we normal nature, so we did not follow a defined sequence of ized the defect count for each inspection by number of pages inspected ("defects per page") and by total in steps. In section 4, as we present results we will also report the tests and techniques that were used to gen spection effort ("defects per hour''). We also used total erate each finding. In this section, we provide an over inspection effort in person-hours as a dep~ndent vnri- able. · view of how the different tests were performed. 3. 1.3 Intervening variables Comparison of means tests: Some of our analyses There is a very lnrge number of attributes of inspec were similar to the analysis that motivated this inves tions that could potentially nffect the inspection out tigation, described in section 2. For each inspection come. Many, but not all, of these were represented in heuristic, we divided the data into an in-compliance our dataset. There were several important attributes set and an out-of-compliance set. We then performed a for which there was enough variety in the data that we Mann-Whitney test to see if the outcomes of the two could trent them as intervening variables in the analy sets of inspections were significantly different. To ad sis. For each of these variables, we partitioned the data dress Ql, we expanded the set of outcome variables into subsets bnsed on the attribute values. used, to include total effort expended, defects per page For application domain, we identified subsets of the and defects per hour, in addition to the number of de inspection data that included inspections of nrtifacts fects found (as in section 2). We were also interested in from projects in· the same domain. The domains for fine-tuning the inspection heuristics for particular con which we hnd sufficient data for analysis were atti texts (Q2).For this purpose we performed similar tude, orbit, and flight software. These domains are analyses, except with subsets of the data, partitioned sub-domains of satellite control software, which is the by values of various intervening variables. For some of larger domain in which NASA is working. Another the subsets and heuristics, the divisions between in way to categorize'domain is through the NASA set of compliance and out-of-compliance inspections were softwnre type "codes", designated A through H. For too unbalnnced to do a meaningful comparison, i.e. example, type A software is software that controls either nearly all inspections in the subset were in com spncecraft carrying humans. In our dataset, the only pliance, or nearly all were out of compliance. These substantial subset based on these categories was type cases had to be ignored during this analysis, but they C software, which is software that supports non-life were noted as examples of situations where the heuris cri tical aspects of a mission, e.g. processing of scientific tics were either very easy or very difficult to comply data from instruments. Inspection records outside this with. We also used the Mann-Whitney test to check for subset were either of software of another category, or significnnt differences between subsets of the data for which no category was reported. Finally, project based on values of dependent variables (e.g. to charac size was determined by creating categories of projects terize highly effective inspections) in an effort to iden based on totnl size in LOC (small, medium, etc.) based tify better thresholds for the heuristics. on the distribution of project size in the dntaset (small was defined as 10-lOOKLOC, etc.). The only subsets of Visualizations: In order to determine i.£ there were any sufficient size for analysis were small projects and me obvious "natural" thresholds in the data that might dium projects. serve as improved thresholds for the inspection heuris In addition to these intervening variables, subsets of tics (e.g. the most effective range for number of par the inspection dnta were nlso defined based on the ticipants), we used scatter plots. The scatter plots were work artifact being inspected (e.g. requirements, de examined to see if, in any cases, it was visually possi sign documents, code, test documents) and on the ble to identify any ranges of independent variables NASA Center in which the project was performed. The associated with optimal vnlues of the dependent vnri NASA Center variable is important because it is a ables. proxy for cultural and historical differences between Another visualization we found useful were box Centers, as well as differences in inspection processes. plots, which show, within a given subset of the datn, · While all Centers must adhere to a general agency the distribution (including the mean, median, quarti wide inspection process, there is considerable leeway les, and outliers) of a particulnr variable. We used box in the details for Centers to tailor the process. These plots to investigate possible relationships between two intervening variables were available for nll inspec variables. To do this, we segmented the dnta into sub tions in the dataset. sets based on one vnriable, then generated boxplots if.. -· 6 IEEE TRANSACTIONS ON JOURNAL.NAME. ~NUSCRIPT 10 using another variable in each subset. A side by side combinations of independent and dependent variables comparison· of these boxplots was used to intuit how in Table 2, for the whole dataset and for all subsets different the distributions were, and thus the possibil (based on values of intervening variables) large ity of a relationship between the two variables. enough to support the analysis. Because of a lack of normality, the non-parametric test, Spearman's rho, Regression trees: Regression trees are one approach to was used. modeling the relationship between inspection parame ters and outcomes. Regression tree modeling applies 4 FINDINGS an iterative partitioning algorithm to a set of data, re To reiterate, our specific research questions for this analy- sulting in a tree-like structure, where each node repre sis are: · serits a subset of the data conforming to a conjunction Qt. What effect do inspection parameters have on of conditions based on the independent variables in other inspection outcomes, besides total num the set. The conditions are chosen such that each re ber of defects found? sulting subset is as homogeneous as possible with re Q2. Are there variations in the heuristics that are spect to a chosen dependent variable. The "quality" of appropriate for different situations? a regression tree (i.e. how useful it is in characterizing Q3. Are there variations in the heuristics that a dataset) is normnlly assessed through two metrics. would better ensure that compliance would The first is the correlation coefficient between the ac result in better inspection outcomes (i.e. that tual values of the dependent variable and the values would improve the relationship between com predicted by the tree. The second is the relative error, pliance and effectiveness/ efficiency)? which indicates how much of the dntaset is co·rrectly As a baseline, we began with an overview of how predicted using the tree. Regression trees give insight the heuristics perform in the dataset as a whole. These into those independent and intervening variables most results are shown in Table 3. Again, we used the Mann likely to have an effect on the value of the dependent Whitney test to compare the means between in variable, nnd which combinations of the variables nre compliance and out-of-compliance inspections. These likely to be significnnt within different subsets of the results are mostly consistent with those in Table 1, so data. We used regression tress primarily as an investi from this point we will no longer distinguish between gative tool to help us narrow down our large set of historical and contemporary inspections. However, it independent variables and focus on those most likely should be noted that older inspections, in general, con to be significant. sumed more resources and found more defects, but were less efficient, than more recent inspections. Tests of correlation: We calculated correlations for all Table 3. Testing original inspection heuristics on entire dataset. ::. ~. COMBINED:DATASET (both historical'and coriteninoranr)~;~.i,i./'~~;~·~•f.,:i:~: ,~;-!'~~!:" • ~~ ~:)-·.:,..,l'i~ In-compliance inspec- Out-of-compliance in- Inspections tions snections following # of in- Avg. # of # of in- Avg. # heuristics spections resulting de- -spections of result- significantly fects ine defects better? ·Team size 276 14.0 445 7.0 YES (p<0.0005) Meeting length 644 7.1 36 23.6 NO Page rate .:. ., '· .• •1':3 .8•.. ; . ·.. . ·''./ : 1. 3. .s:; ·:: -·., , ". ; : . .-~...8 -9\ ..· ::~: ~..: ~· . -!'· · : '6• :,_5. •·. •·_•:~ -: (P<Y0E.0S0 05) --· AUTHOR ET AL.:'TITLE 7 Table 4. Testing original inspection heuristics with respect to different dependent variables. (effect of ·on 'four·outcome ·;. ·, · COMBINED DATASET heuristics variables);·:~·(:.~· :; ··.{:. ·: • 1 ... ~. ...... ~·: ..; ~;: Do inspections conforming to the heuristics perform better or worse than those that don't with respect to: Total effort? Total de- Defects per Defects per fects? page? hour? Team size worse better worse worse Meeting length better worse worse better Page rate ~.· :\· .:> .: :;: ,i ·w:-.!o~ri1~s··~e:·' :·;":· ;•. . .~ ·~: :t.>· ::-,,~;:· : ·.·~ b: e/t\t/e\'r- ,. : ,. ., ~:/:. ·/ .~· :::;- ...b: . e~;t :t:~e·. .r.: ~ :.~. . !::.;: ......:....-....:....:.. ;1..,.:../. :J.•".,..;w \,..,.,...o •lr' .~s•· '?e,·" ,,;.'·-.,.= _ • Next, we repeated this analysis for all the depend conformance. In an effort to understand this phe ent variables. The results (comparison of means, nomenon better, we did a deeper analysis of the in shown in Table 4) show a much more complicated pic spection dataset from the perspective of maximizing ture than that presented by the results on total defects the number of defects found in an inspection. This led alone (from Table 3). Columns 2 and 3 of Table 4 show to the following finding: that compliance to team size and page rate heuristics result in more effective but more expensive inspec Finding 1: Inspections in which large numbers of de tions. The heuristic for meeting length h;;is ;;in opposite fects were reported involved higher levels of effort and effect. When one examines the outcomes of an inspec more participants. tion more closely (i.e. by normalizing the number of defects found, as in the last two columns of Table 4), While rather straightforward and seemingly obvious, the picture becomes yet more cloudy. For ex;;imple, this finding implies that neither higher levels of effort what appears to be a benefit of complying with the nor more participants alone will result in higher num team size heuristic (more total defects found) evapo bers of defects. It also implies that there is no "short rates when we consider defects per p;;ige or defects per cut" to finding more defects other than working hard hour. at it. It should be noted that similar findings were also. The results from this initial analysis imply that op found with respect to the number of defects found per timizing the effectiveness of an inspection (i.e. maxi page, although for simplicity we focus simply on total mizing the number of defects found) is at odds with defects found in this discussion. This .finding is sup optimizing its efficiency (i.e. minimizing total effort ported principally by tests of correlation, but visual spent). This led us to focus on one of our dependent examinations of distributions, regression trees, and variables, defects per hour (total number of defects examination of outliers also contributed to under found divided by total effort spent), which reflects the standing. Significant correlations were found between tradeoff between defect detection and effort. Maximiz the number of defects found ;;ind all variables tested ing defects per hour would likely be a concern of many (number of inspection participants, total effort, meet inspections, particularly in projects where managers ing effort, and preparation effort - more about page want to get the most out of the resources they con rate later). Correlation coefficients ranged from 0.49 sume, and where there is not a willingness to spend a (for number of participants) to 0.79 (for total effort). premium to ensure that every single defect is found. Scatter plots showed little evidence of an upper bound However, in other types of projects, particularly safety on the benefits of adding resources (time and people) or mission-critical software, the concern may be in fact to inspections. For example, Figure 1 shows a scatter to maximize the number of defects found, despite the plot relating the mean number of defects found and cost. Thus, in the presentation below, we have gath the meeting length of an inspection (i.e. each dot corre ered two sets of evidence that our analysis yielded. sponds to the mean number of defects found for all First, we present findings that lead to guidance for pro inspections having the corresponding meeting length). jects concerned with maximizing the number of defects There were only two inspections with meeting length found (i.e. effectiveness). Then, in section 4.2, we pre of 3.5 hours, so the point corresponding to that value sent findings that point to heuristics for maximizing can be considered an anomaly and ignored for the defects per hour (i.e. efficiency). moment. The scatter plot shows an increasing relation ship that does not level off, at least within the range of 4.1 Maximizing total defects found this dataset. Table 3 shows that more defects are found in inspec tions that are in conformance with the heuristics re lated to team size and page rate, than in those out of ii 8 IEEE TRANSACTIONS ON JOURNAL0NAME, MANUSCRIPT ID the high-defect inspections were higher effort (in terms 60 0 ?f meeting effort, preparation effort, and total effort), involved more participants, and found more defects 0 per page than the rest of the inspections. All of these '° 0 0 0 comparisons were significant (p<.05) using the Mann I~ Whitney test. Again, this supports the conclusion that 0 0 0 the secret to finding more defects is not counter 0 • i~tuitive, but consists of using more people and more i 0 00 ) 0 0 hme. 0 8 0 iC 0 20 0 S'o 0 Q) 0 0 00 ~~ 0 0 0 250- 10 'o (J) Oo 0 o0o:,a:, 0 0 0 0 0 0 0 * 0 200- 0 2 meeting length * * Figure 1. Scatterplot relating meeting lengt.h and * number of defects found (each dot represents a mean 150- * over all inspections with the same meeting length). V'!e generat~d ~ numb~r of regression trees using various combinations of independent variables and 100- • number of defects found as the dependent variable, but few of them were very strong (most correlation * * coefficients were < 0.5, and all had a relative error >= * l ~0%). However, !hey did provide some insight into the 50- independent variables most likely to affect total defects found. _Since regression tree c1:nalysis is very sensitive to outhers, we removed 6 records with total numbers of defects found that were suspiciously high (all were . o- above 150). Removing these outliers resulted in I stronger and more informative trees. One (correlation total # defects ~oeffici~nt .61, relative error 65%) used only the three Figure 2. Distribution of number of defects found. inspection parameters (meeting length, team size, and The 25th percentile, median, and 75th percentile of page rate) as independent variables, while the second the data are represented as.horizontal lines. (correlation coefficient .74, relative error 56%) used all independent ·and intervening variables except the Th~ relationship between page rate and inspection NASA Center at which the inspection took place (the effectiveness deserves a bit more examination. The cur Center dominated all other independent variables rent heuristics give different thresholds for the optimal when it was included in the analysis). In both cases, page rate for different types of inspection artifacts the do~inant !ndep~nd~nt variable distinguishing be (code, design, etc.). However, as can be seen from Table tween mspechons finding low vs. high numbers of 1, the per~e~tage of inspections conforming to the page defects was meeting length, which supports Finding 1 rate heuristic has always been small, especially for the bec~use meeting length ~s a .major component of effort. contemporary dataset. Moreover, conforming to the Figure 2 shows the distribution of total defects pnge rnte_h eu_risti~ results in more defects found only found, including the 6 "outliers" that were excluded !~ .t h~ h1~torical inspections, not the contemporary. from the regression tree analysis. The plot shows a l his 1n:1phes that the original page rate heuristic is large number of inspections with rel.1tively very high mor~ difficult to adhere to, and less useful, than it was ~umbe~s of d~fects found. We compared the set of 55 previously. This leads us to investigate how the page mspechons with numbers of defects found more than rate heuristic might be relaxed and what the effects of one standard deviation above the mean to the remain doing so would be. ing inspections in terms of average team size, meeting length, and othe~ dependent variables. We found that AUTHOR ET AL.:.TITLE 9 Table 5. Variations in numbers of defects found in code inspections as page rate increases. .. ·· Average # of de- . · · · Average# of defects/page (effective ··" . 'J .. :.. . .... .- # of Inspections · · .f ects found . . ness decrease) ·. . . 28 10 4.3 4.45 77 20 4.1 1.81(59%) 145 40 4.9 1.1(75%) 258 80 4.6 0.7(84%) 368 (100%) 2667 4.42 0.57(87%) As a starting point, we calculated the average num to the page rate heuristic do in fact perform better than ber of defects found, ·and defects found per page, for those that do not in terms of defects found per page. code inspections with page rates under various thresh However, this leads to a dilemma. Table 1 shows that olds. The results are shown in Table 5. They show that, violating the page rate heuristic is common, and is be in fact, increasing the page rate does not have a severe coming more common. Clearly, it is tempting, given impact on total defects found. However, there is a se schedule and budget pressures, to speed up the inspec vere penalty associated with the number of defects tion process by increasing page rate, when it appears found per page. There is a significant correlation (p < that developers can handle the increased amounts of ·. 05) between page rate and defects per page for code material. This analysis shows, however, that there is a inspections. From Table 5, we see that relaxing the hefty price being paid for this. page rate heuristic to 20 pages/hour (as opposed to the recommended 10 pages/hour) results in an almost Finding 2: Inspections of all types (except possibly re 60% reduction in the number of defects found per quirements) would be significantly more effective (in page, and there is a 75% penalty when relaxing the terms of defects found per page) if they adhered to the page rate to 40 pages/hour. Further, the vast majority original page rate heuristics. of code inspections are well above this threshold. 4.2 Maximizing defects per hour Similar results were found for design and test in spections. For example, relaxing the page rate heuristic Another relevant measure when evaluating inspec to 40 pages/hour (as compared to the recommended tions is the efficiency of the inspection process, one 20 pages/hour) for design inspections results in a 28% indicator of which is the number of defects found in reduction in defects found per page, and a 40% reduc the inspection per person-hour of total effort. Under a tion for a page rate up to 80 pages/ hour. For test in variety of conditions (e.g. limited resources, software spections, relaxing the heuristic to 40 pages/hour of low criticality, heavy reliance on testing, etc.), (from the recommended 20) results in a 40% hit in de maximizing the defects found per hour may be a more fects found per page. There were very few require important inspection goal than finding the maximum ments inspections in the dataset with data on both de number of defects. With this in mind, we examined our fects and page rate, and these did not represent much dataset for insights into potential heuristics for maxi variation in page rate, so this analysis was not very mizing the depend~nt variable "defects per hour." meaningful for requirements inspections. Our fi~t finding in this section paints a broad pic If we assume that the inspections in which the page ture of defects per hour among the inspections in our rate was higher did not, in general, have lower true dataset. It arises from the boxplot in Figure 3, which defect density (an unsubstantiated but not unreason shows the distribution of defects per hour over the en able assumption), then these results imply that higher tire datase~, as well as descriptive statistics. page rates are associated with missed defects. This is consistent with the justification for the original page Finding 3: With a few exceptions (treated in later find rate heuristics, i.e. that higher page rates will result in ings), the defects per hour for most inspections re a more superficial inspection. This finding was hinted mains within a narrow range of 1-2 defects per hour of at earlier, in fact, in the analysis presented in Table 4, total effort. which showed that, overall, inspections that conform ij 10 IEEE TRANSACTIONS ON JOURNAL.NAME, MANUSCRIPT 10 ticipants is between 1 and 3, and the meeting length is between .25 and .5 hours. 20 This finding is meaningful in this context because inspection experts at NASA have been struggling with 1t how to streamline the inspection process for small pro 15 jects. Full inspections that conform to the inspection 1t heuristics seem to be overkill for such projects, which can't afford the high levels of effort. Finding 4 suggests that there is in fact a streamlined, but still efficient, ap- 1t prooch. . 10 •1t Finding 4 was tested, using a Mann-Whitney test limited to contemporary code inspections from small 1t projects, comparing values o~ d~f~cts per hour betwee~ 1t e* those inspections that fell w1thm the thresholds speci fied in Finding 4 and those that did not (ignoring, for 5 ' the moment, page rate, which will be discussed later). The results show that code inspections on small pro $ jects that are described by Finding 4 (1-3 participants and .25 - .5 hours meeting length) have higher num bers of defects per hour than those that don't (2.8 vs. 0 1.8). However, this finding was not statistically signifi cant. The relevant dataset in this case is very small; defects per hour only 23 code inspections from small projects had defect Figure 3. Distribution of defects per hour over en and effort data, 16 of which conformed to Finding 4. tire dataset. We then tested the relationship between compliance with the heuristics and defects per hour only on the While the range of values for defects per hour is "normal" inspections (i.e. excluding the hyper DpH over 16, both the mean and the median are less than 2. set). The results agrees with the results from the analy In fact, 75% of the values are less than 2. However, the sis against the entire dataset (as presented in Table 4), clear presence of significant outliers in Figure 3 led us where out-of-complianc~ inspections (with the excep to further investigate this small collection of "super tion of the meeting length heuristic) outperform in inspections" that resulted in abnormally high numbers compliance inspections. Thus, the results in Table 4 are of defects per hour. We defined a subset of our inspec not an artifact of the hyper DpH inspections. tion data, which we called "hyper DpH inspections", In an effort to gain further insight about what to include all inspections with defects per hour greater maximizes defects per hour, we generated several re than one standard deviation above the mean for de gression trees, using defects per hour as the dependent fects per hour over all inspections, giving a threshold variable, and various combinations of independent of about 3.4 defects per hour. The hyper DpH inspec variables. Although none of the trees generated were tion set consisted of 52 inspections, with 496 inspec very good (all had correlation coefficients <= 0.5, and tions constituting the set of inspections with "normal" relative error >= 78%), they did provide some insight defects per hour values. into the independent variables most likely to affect We conducted a number of exploratory analyses on defects per hour. When using all independent and in the hyper DpH inspection set to both characterize it tervening variables in the tree, the most significant and compare it to the rest of the dataset. From this variable was the NASA Center at which the inspection analysis, we concluded that the hyper DpH inspections took place. This implies that the factors affecting de had both lower total effort and higher numbers of de fects per hour are different for each Center, thus mak fects found than other inspections. Further, the lower ing it hard to generalize. It may also point to undocu levels of effort were due to lower numbers of partici mented differences in inspection processes between pants, shorter meeting times, and shorter preparation Centers. When the Center variable was removed from times. Seventy-five percent of the hyper DpH inspec the analysis, the resulting tree showed that meeting tions had between 1 and 3 participants, and a meeting length and software type were the most influential length between .25 and .5 hours. Hyper DpH inspec variables. The role of meeting length is obvious, as it is tions also tended to be more recent than other inspec involved in the calculation of effort. Looking more tions, part of small projects, and were mostly (but not closely at the role of software type (e.g. attitude, orbit exclusively) code inspections. software, etc.), we compared the mean defects per hour T~is analysis leads to the following finding: in the "normal" inspections of each software type. We found that only inspections of software classified as Finding 4: For code inspections on small projects, de "other" had significantly higher values for defects per fects per hour is maximized when the number of par- hour than inspections of any other software types. This . -- ..:..~ -.,. .- -