Ever since
my first pitch of the defect density study I've been trying to work out what the bigger research questions are here. I'm not really content with just collecting the defect density results unless I can see how the results fit into a larger story.
My first instinct was to find out more about defects themselves and to see what other people have done with them in their studies. What is the relationship between defect densities and software quality? How do other people understand software quality and measure it? What can we really learn about software quality from looking at defect densities? And to what use can I put these results once I have them? I'm starting to get a picture of defect densities and their usefulness, and
it is not nearly as good of a tool as I had thought, but
it is still worth evaluating.
The title of
my talk last week was one possible framing of a much bigger question: why do climate modellers trust the code they write? As in
Daniel Hook's presentation at
SE-CSE '09, trustworthiness seems like an appropriate way to frame the discussion about software quality when it comes to climate models. Why? Because, coarsely, in computational science pursuits like climate modelling there are not always hard and fast rules to distinguish correct and incorrect results. As Hook says, there are no perfect oracles to which results can be checked against (the oracle problem), and that even if oracles existed the approximations and measurement errors inherent in modelling can make it tricky to distinguish any introduced error coming from faulty code (this is the tolerance problem).
So, how then do the climate modellers know if they're on the right track when constructing their models? We know they employ a wide suite of sophisticated tests to tease out flaws in the conceptual model (validation) and errors in their implementation of that model (verification). My understanding is that underlying some of the validation work are judgement calls, gut checks, and tacit heuristics used to distinguish whether a model is doing the right thing. For example, climate modellers might ask of a model output, "is it raining where it ought to be raining?" The answer to this question isn't well-defined, but it can be answered with a lot of background knowledge and familiarity with the climate processes. This is partly the oracle problem at play. The model output is the result of a scientific experiment and not something we could hope to give a complete description of before hand. I'm not saying validation is all guesswork -- not even close -- but just that there are unformalisable elements to model validation that, I don't think, we're used to thinking about when we discuss traditional software testing. We are used to thinking about software as having more explicit and testable requirements[1].
On the verification side, the tolerance problem entails that, even if we ignore the conceptual problems with the model, it is still not a straightforward matter to be certain if the code is correct. Uncertainties in the data, truncation error in approximations, and round-off error in computations can all hide real errors resulting from flaws in the model implementation.
Asking why climate modellers trust the code they write is one way of trying to understand what climate modellers are doing when they attempt to write good quality code. Given that they have such radically different notion of requirements and correctness, how is their notion of software quality different? If you can't always write unit against the bulk of your work, and you can't always explicitly write down rules for correctness, what then do you mean by good code? I think it's important to start with these questions because the answers inform other questions about the usefulness of defect densities and of quality benchmarking. With a firmer idea of what quality actually is for climate modellers, we can then work on how best to measure it or benchmark it.
To summarise, the primary question is:
What does software quality mean for climate modellers?
The software quality folks have come up with an impressive list of attributes of software quality known is the "
ilities". Maybe a more specific version of the above question asks about which quality attributes are most important for climate modellers.
I think there are two other companion questions that need to be asked:
How do climate modellers judge a piece of code against these quality attributes?
What practices do climate modellers follow to achieve high quality software (in terms of the identified quality attributes)?
If software quality was a game of football, the first question asks about the shape of the field and the rules of the game, the second asks about where the goal posts are, and the third asks about the playbook. Ahem.
So, how I go about answering these questions?
I
could ask the climate modellers directly. This assumes that they know the answers explicitly. I'm not sure I could answer the same questions for myself.
I could also look at defects. I've defined a defect before as "something worth fixing". Can we say this means a defect is part of the software, created or omitted, that indicates a lack of satisfaction of the important quality attributes. If so, then looking carefully a defect and its circumstances, and in particular
asking the climate modellers questions about reported defects might provide some of the basis for answering the above three questions. Or at least the basis from which to ask more intelligent questions.
That is, investigating why and when a piece of climate modelling software falls short might be the very place to look for exposed notions of quality, quality goals, and the practises used to manage them.
Would interviewing scientists about defects give the complete story? Certainly not. For at least these reasons:
- I'd only be able to consider a sampling of the defects, and only interview a sampling of the modellers the defects related to.
- As noted in earlier posts, some defects may go unreported. Put another way, the selection of reported defects depends on the type of testing that is done, and not necessarily on the nature of the defect itself. That is, defects are not found if no one goes looking for them.
- Refining that point a bit: the defects that are found may only be associated with the subset of the quality attributes that are the least well managed. That is, software may show fewer defects related to quality attributes for which there is a well-functioning process in place. These attributes would not appear to be as well represented and thus may not seem important when, in fact, they are.
[1] I feel pretty strange talking with such authority. Please jump in if you know better.