Counting defects

Friday, May 29, 2009

I've talked about the issues with using defect counts for judging and benchmarking software quality. I do, however, still think it's worth doing an investigation into the defect counts of climate modelling software because:
  1. It will force me to get my hands into the code and bug reports. My hope is that even a basic familiarity with these things will help me understand issues of quality for computational scientists. It may also give me a bit of currency in discussions with scientists if I have some understanding of the details.
  2. The results may be useful to the individual climate modelling groups, as a gauge for quality within their group.
  3. Doing the study furthers a dialogue between computer scientists and computational scientists and climate modelling groups.
  4. I might end up with something useful!
    • As I say in point #1, I might actually gain some insight about computational science software quality. ;-)
    • Aside from comparing defect densities, I might find another ways to use this data for benchmarking. For instance, at the workshop for software quality at ICSE, Elmar Juergens spoke about how in judging quality absolute values suck (that might been exactly what he said), and how trend analysis is much better. He was speaking from the point of view of process improvement. But this raises an interesting idea: if we redefine software quality as "a good software development process" (whatever that means) maybe we could use aspects of quality trends as points of comparisons between projects.
As I mentioned in a previous post on Fenton and Ohlsson's work, I'll need to be very specific about how I measure defects to do it. First off, defects come in two major flavours: faults (statically identifiable errors, including dangerous uses of the language) and failures (observable, runtime errors). Most of the papers I've read[1] count defects by looking at bug reports and version control comments. Since a bug report or fix in a repository can cover both faults and failures, a defect in these cases is maybe best described as a problem worth fixing. I've only come across one paper so far that measures faults, and that's Hatton's paper (linked to above, see his slides for more details). Hatton used static analysis software from Programming Research Ltd., which I could run with.

The way that the folks in [1] count defects is simply a matter of counting bug reports, or counting the number of check-in comments that say "fixed", "bug" (or other keywords that suggest a fix for a bug). Some papers count defects before a release is made (pre-release) and others count defects against a release (post-release). What makes a bug pre- or post-release is a matter of opinion: some papers go by how it's marked in the bug database, others set a threshold of days before and after a release date with which to categorise bugs. Some papers explicitly mention that defects are counted only if they have been fixed (i.e. just reporting a defect isn't enough) whereas other papers aren't clear about this. Finally, some papers only consider defects logged against certain areas of the software as worth counting (for instance, an installation problem may not be counted but a UI problem would be). Phew.

I'm sure there are more dimensions I haven't considered!

[1] A sampling: Koru et al., 2007; Fenton & Ohlsson, 2000; Kaaniche & Kanoun, 1996; Zimmerman et al., 2007

No comments:

Post a Comment