As described in the post linked to above, I started out on this research track with the fairly concrete idea of benchmarking the software quality of climate modelling software by using defect densities as a comparison tool. Using defect counts and densities are common way of measuring software quality -- the fewer the defects, the higher the quality. The thinking is this: if climate models turn out to generally be of higher quality than other similarly-sized software projects (measured via defect density) then that's a strong indication that something interesting is going on in the climate modellers software development process -- and we'd better take a further look. If the modelling software is generally worse, that's also interesting and worth investigating (I mean, hey, we software engineers might have something to contribute!).
The glitch is that I don't think defect densities, or many of our quality metrics, are going to be good tools for benchmarking -- they're far too subjective. It's not just me either: everyone I pitched the idea to at the workshop on software quality at ICSE was hesitant about the idea, as were most other folks I talked with at ICSE. Fenton and Neil, 1999 has some discussion about the difficulty of using defect densities for measuring and benchmarking software quality. What it comes down to is:
- The way defect data are reported varies across publications. Some papers use defect rate, others defect density, others use failure rate.
- When are we counting defects, during pre-release, or post-release, both? Some publications don't mention this.
- What constitutes a defect isn't always clear: are we counting statically determinable faults (in which case, what heuristics are we using), or are we considering only those failures found and reported by the developers and users?
- A plain count of defects ignores severity. So, how are we accounting for the severity of defects? And who determines the severity? Users and developers might have very different ideas of what the severity of a defect is.
- A failure isn't in the code, it's a discovered unsatisfactory behaviour. So how a piece of software is tested and used directly determines the defects that are found (if it's never tested, it appears failure-free!). How are testing effort and usage information (user base size, etc..) accounted for in defect counts?
- Finally, different people, teams, and domains have very different ideas of what constitutes good quality (e.g. a usability bug may not be nearly as important for a scientist as it is for a commercial product). Comparing software across these boundaries with a simple defect measure ignores these relative notions of quality, which may mean the measure is useless. (At the very least, the measure must be understood as an indicator of how well a group's software development process works for them rather than an indicator about the software in any objective sense. But maybe this has always been the case.)
So, how do we get a better understanding? In the interests of experimenting with a one item, one post system, I'll leave that thought for the next post.
3 comments:
It seems like there are two notions of quality at work here. One is software quality, which is where defect density is most relevant (array overflows, exceptions, etc).
The other is scientific quality. I came across this using MPL - you can use a bunch of functions in the library, most of which I have a very shady understanding of, all of which return a set of numbers. Numbers which I have no way of showing are correct, because they're what I'm looking for. I think the second issue is more interesting, and I suspect the solution is some form of data-driven testing.
For example, climate models are 'tested' by looking at the data we already have. If we could set up a way to represent the data someone generated from observations, say as a library, then we could use it as an automated test of our calculations (and then have greater confidence in future calculations).
It would be a great contribution if you could further explain what constitutes a 'defect' in scientific/climate software, maybe using a case study approach.
Huh. Well-timed response, Neil (ahem, Ernie). That's where I'm headed with this: to further explain what constitutes a "defect" in scientific software.
To your first point: I agree there is a difference between software quality and scientific quality. There's plenty of interplay between them of course, and I'd like to unravel that, myself.
I gather the verification and validation of climate models is very complicated, but my understanding is that there are "libraries" of observational data, and standard scenarios used for testing. (Consider the model inter-comparison projects).
ITSolusenz departments manage all components web application, software development including, Application Development Company, software development company india, Software Development Services.
Post a Comment