Paper: Quantitative Analysis of Faults and Failures in a Complex Software System

Thursday, May 7, 2009

Fenton NE, Ohlsson N. Quantitative Analysis of Faults and Failures in a Complex Software System. IEEE Trans Softw Eng. 2000 August;26(8):797-814.

Fenton and Ohlsson attempt to hack out a bit of solid empirical knowledge amongst the wilderness of published defect data. They note that whilst there are plenty of hypotheses and rules of thumb about defects and defect distribution (e.g. 20% of defects account for 80% of failures (the Pareto principle for software), or large modules are proportionally more reliable than small modules) there is little in the way of published empirical knowledge that can be used for validation or benchmarking. They study defect data from a telecommunication system in order to evaluate the extent to which their data supports or rejects a range of hypotheses about defects.

Fenton and Ohlsson find evidence to support the Pareto principle in software, e.g. the majority of operational defects come from a small number of faults.

They also find evidence that modules with higher defect counts found before release have fewer operational defects after release (and similarly, the inverse is also supported). This point is worth taking slowly. It seems reasonable, on the one hand, that modules with lots of defects before release are simply just undergoing better testing, and so should have fewer defects after release. But, you might also reject this idea and believe that that their are a few poorly designed, troublesome modules that are, and will continue to be, responsible for most of the defects. That is, we can predict buggy modules by looking at where bugs have been found before (this idea is supported, as previously mentioned).

So, exactly how to predict post-release defect density based on pre-release density is in dispute. Okay, sure, so what? Fenton and Ohlsson point out that if their hypothesis is generalisable (that is, if post-release density is lower for modules with high pre-release density, and vice versa) then using defect density as the "de facto" measure of software quality is a highly suspect practice. As they say:
If fault density is measured in terms of pre-release faults (as is very common), then, at the module level, this measure tells us worse than nothing about the quality of the module; a high value is more likely to be an indicator of extensive testing than of poor quality.
So, in terms of a benchmarking tool for comparing projects, pre-release defect density isn't going to work. But how about post-release defect density? The authors show that their post-release defect densities are inline with other published studies, so it's plausable. (I sure hope so, because this has been my plan for benchmarking the software quality of various climate models!)

There's one (at least) missing piece to all of this that I have yet to figure out: what exactly constitutes a pre- or post-release defect? That is, when do you stop counting defects as post-release for release n and pre-release for release n+1? In this paper, it's not entirely clear, but defects were associated with releases (and phases within the release) based on temporal proximity to certain milestones. Again, we've seen some work that associates defects to versions based on their bug report classifications and temporal proximity to releases. It seems to me to be crucial to be able state explicitly what the conditions are for pre- and post-release defects in order for these numbers to be even remotely comparable.

(Ah counting. Back in kindergarten it all seemed so easy.)

No comments:

Post a Comment