Jon Pipitone: On the trouble of benchmarking software quality

All during ICSE and over the past few days as I settle in at UVic, I've been talking with folks about my research ideas to do with investigating the quality of scientific software. I've had my head in papers on defect densities over the past few weeks, and so this has been a very helpful opportunity to take a step back, and get some discussion going over what exactly my focus ought to be and how to go about the investigation. I'd like to tell the story of where my thinking has taken me -- partly to let you all in on it, and partly as a move to help me organise my thoughts. This may be a bit messy. ;-) And may span a few posts.

As described in the post linked to above, I started out on this research track with the fairly concrete idea of benchmarking the software quality of climate modelling software by using defect densities as a comparison tool. Using defect counts and densities are common way of measuring software quality -- the fewer the defects, the higher the quality. The thinking is this: if climate models turn out to generally be of higher quality than other similarly-sized software projects (measured via defect density) then that's a strong indication that something interesting is going on in the climate modellers software development process -- and we'd better take a further look. If the modelling software is generally worse, that's also interesting and worth investigating (I mean, hey, we software engineers might have something to contribute!).

The glitch is that I don't think defect densities, or many of our quality metrics, are going to be good tools for benchmarking -- they're far too subjective. It's not just me either: everyone I pitched the idea to at the workshop on software quality at ICSE was hesitant about the idea, as were most other folks I talked with at ICSE. Fenton and Neil, 1999 has some discussion about the difficulty of using defect densities for measuring and benchmarking software quality. What it comes down to is:

The way defect data are reported varies across publications. Some papers use defect rate, others defect density, others use failure rate.
When are we counting defects, during pre-release, or post-release, both? Some publications don't mention this.
What constitutes a defect isn't always clear: are we counting statically determinable faults (in which case, what heuristics are we using), or are we considering only those failures found and reported by the developers and users?
A plain count of defects ignores severity. So, how are we accounting for the severity of defects? And who determines the severity? Users and developers might have very different ideas of what the severity of a defect is.
A failure isn't in the code, it's a discovered unsatisfactory behaviour. So how a piece of software is tested and used directly determines the defects that are found (if it's never tested, it appears failure-free!). How are testing effort and usage information (user base size, etc..) accounted for in defect counts?
Finally, different people, teams, and domains have very different ideas of what constitutes good quality (e.g. a usability bug may not be nearly as important for a scientist as it is for a commercial product). Comparing software across these boundaries with a simple defect measure ignores these relative notions of quality, which may mean the measure is useless. (At the very least, the measure must be understood as an indicator of how well a group's software development process works for them rather than an indicator about the software in any objective sense. But maybe this has always been the case.)

It seems that the problem is that beyond just being inconsistently reported and ill-defined, defect density is only a single number standing in place of a very complicated constellation of properties. It may be a useful (and somewhat valid) measurement to use within a project since you could argue many of these properties stay the same from phase to phase (e.g. a team will decide once, at the beginning of project, whether they are counting pre- or post-release defects). But as a benchmark or basis for comparison of quality between projects, I think it's dubious, especially without a better understanding of what we actually want to measure first.

So, how do we get a better understanding? In the interests of experimenting with a one item, one post system, I'll leave that thought for the next post.

3 comments:

Ernie said...: It seems like there are two notions of quality at work here. One is software quality, which is where defect density is most relevant (array overflows, exceptions, etc).

The other is scientific quality. I came across this using MPL - you can use a bunch of functions in the library, most of which I have a very shady understanding of, all of which return a set of numbers. Numbers which I have no way of showing are correct, because they're what I'm looking for. I think the second issue is more interesting, and I suspect the solution is some form of data-driven testing.

For example, climate models are 'tested' by looking at the data we already have. If we could set up a way to represent the data someone generated from observations, say as a library, then we could use it as an automated test of our calculations (and then have greater confidence in future calculations).

It would be a great contribution if you could further explain what constitutes a 'defect' in scientific/climate software, maybe using a case study approach.; May 28, 2009 at 6:03 AM
jon said...: Huh. Well-timed response, Neil (ahem, Ernie). That's where I'm headed with this: to further explain what constitutes a "defect" in scientific software.

To your first point: I agree there is a difference between software quality and scientific quality. There's plenty of interplay between them of course, and I'd like to unravel that, myself.

I gather the verification and validation of climate models is very complicated, but my understanding is that there are "libraries" of observational data, and standard scenarios used for testing. (Consider the model inter-comparison projects).; May 28, 2009 at 10:40 AM
itsolusenz said...: ITSolusenz departments manage all components web application, software development including, Application Development Company, software development company india, Software Development Services.; August 17, 2009 at 5:33 AM

Jon Pipitone

On the trouble of benchmarking software quality

Wednesday, May 27, 2009

3 comments:

Post a Comment

Blog Archive

About Me