Jon Pipitone

Paper: The T Experiments

Monday, April 6, 2009

L. Hatton, "The t experiments: errors in scientific software," Computational Science & Engineering, IEEE, vol. 4, no. 2, pp. 27-38, 1997.

Back in 1997, Les Hatton published a study composed of two experiments to test scientific software quality. The first, which he calls T1, involved the static analysis of over 100 pieces of scientific software from a broad range of application areas. The second experiment, T2, involved dynamic analysis of 9 different seismic data processing programs, each one supposedly designed to do the same thing. In a nutshell, the results suggest that scientific software has plenty of statically detectable faults (and the number of faults varied widely across the different programs analysed), and that there is significant and unexpected uncertainty in the output of this software -- agreement amongst the seismic processing packages is only to 1 significant digit. Hatton says, "Taken with other evidence, the T experiments suggest that the results of scientific calculations carried out by many software packages should be treated with the same measure of disbelief researchers have traditionally attached to the results of unconfirmed physical experiments."

Hatton makes the distinction in the paper between two types of defects: faults and failures. A fault is a statically detectable trouble area in the software code; "a misuse of the language which will very likely cause the program to fail in some context". A failure is a defect measured at runtime (for example, a program crash or an output inaccuracy). There is a uncertainty associated with software faults -- for many of them (say for example, assigning a pointer to an integer variable) it's not sure bet that they will "mature" into a runtime failure. Hatton weights each kind of fault with a severity rating which amounts to the likelihood the fault could cause a failure, a "rough risk factor". It's these weighted fault rates he publishes in the paper. He doesn't go into detail about the ratings, so I can't comment other than to say I'm already suspicious of something so subjective being used this way. More details of the T1 experiment are available in his book, L. Hatton, Safer C: Developing Software for in High-Integrity and Safety-Critical Systems. New York, NY, USA: McGraw-Hill, Inc., 1995.

I'd like to note that what Hatton is doing here in this study is looking at various bits of software on his own, at a single point in time, in order to objectively determine code quality. In my pitch about investigating climate modelling code quality I was thinking more about looking at the defects uncovered by the software users and developers themselves. The defects identified[1] by the climate scientists will contain both faults and failures. The faults found by scientists are sure to be relevant and important whereas the same is probably not true for all of the "generic" faults detected by static analysis software. Either way, both of these methods of gauging quality are incomplete in some sense: both use heuristics to find defects (if you consider the testing scientists do as a "heuristic").

This paper provides some partial statistics I could use to compare with climate models. Hatton used static analysis software from Programming Research Ltd., so it's at least feasible I could run the climate models through the same analysis. I'm not sure what the value would be for doing that just yet. Partly, because it's not clear what I'd be comparing against: Hatton analyses software from "40 application areas, including, for example, graphics, nuclear engineering, mechanical engineering, chemical engineering, civil engineering, communications, databases, medical systems, and aerospace."

[1] I suspect defects in one version to show up over the course developing later versions, but some defects may be known ahead of time. I'm not sure about how to count these yet.

On Climate Model Software Quality

Monday, March 30, 2009

I've abandoned the search for a thesis surrounding an education tool for climate change. It just wasn't getting clearer where to go with it or how to pitch it from a software engineering angle.

So here's a new topic. Whilst spending last summer at the Hadley Centre, Steve made the preliminary observation that the defect density of Hadley Centre's climate model appears to be surprisingly lower than defect rates for comparably-sized projects. Does this observation hold up under scrutiny? What if we control carefully for project size and user and developer base size? If we were to compare the kinds of defects found in the other projects to those found in the Hadley GCM, surely we'd find that there are classes of defects that are rarely considered defects by the Hadley scientists (e.g. superficial bugs, like GUI defects, for instance). So what exactly do scientists consider as defects? Can these be characterised? If we only compared the defect density between projects over similar classes of defects, do we still see the lower defect rate in the climate model? How do other GCMs or climate models compare?

Regardless of the outcome of a more rigorous look at climate model defects, there are larger questions of software quality to explore. Namely, what is the underlying cause for the differences in defect density (whether the GCM defect density is, after all, better, worse, or comparable)? If the defect density is lower for climate models, one hypothesis may be that it's a result of the fact that climate scientists are both the users and developers of their software, so maybe they are more likely to catch defects early on. But then, we'd expect to see similar defect density patterns in open source software. Another hypothesis is that climate models are inherently more "resistant" to defects because of the powerful constraints put on them by the physical systems they simulate (e.g. conservation of mass, and energy) and the extreme numerical sensitivity of the models. Or maybe the folks at Hadley have a great software engineering process that others need to learn from.

Thoughts? What am I missing?

There is a mountain of literature on the nature of scientific software quality, defect density, and related topics. I'm just starting into it now. Here's a glimpse into where I am:

L. Hatton, "The t experiments: errors in scientific software," Computational Science & Engineering, IEEE, vol. 4, no. 2, pp. 27-38, 1997.

My feed of interesting articles

Friday, March 13, 2009

Google Reader has a great feature whereby you can share articles from news feeds you find interesting. Here's a link to my shared items page, which you can subscribe to:

http://www.google.com/reader/shared/15421741800551159855

Of course, I could just create a blog post about the item and link to it, but this is much much easier and keeps my blog as a place to post (slightly!) more developed thoughts.

Climate Interactive goes to Copenhagen

Thursday, March 12, 2009

In December of 2009, the Kyoto Protocol gets revisited and remade at the Climate Conference in Copenhagen. Leading up to that conference is the scientific Climate Congress in Copenhagen, which is happening right now (in fact, today is the last day). This conference aims to synthesise "existing and emerging scientific knowledge necessary in order to make intelligent societal decisions concerning application of mitigation and adaptation strategies in response to climate change."

The folks from Climate Interactive (see also) are there and gave a presentation about their climate simulation software, C-ROADS, that's designed especially for use as a decision making tool for policy makers. It's fast, simple to understand and use, and produces predictions inline with the accepted climate science. The great thing is that it's being used -- read their blog for lots of examples (in particular, see the recent post about John Kerry using C-ROADS).

In two of the extra slides of the Climate Interactive presentation, they state:

It is difficult for decision makers to:
aggregate diverse emissions reductions proposals into a single global emissions projection and
mentally simulate from that emissions projection the resulting atmospheric CO2 level or temperature increase.
Tools are needed to help decision makers assess whether policy options are sufficient to achieve goals for stabilizing CO2 levels and limiting global temperature increase to within a safe range.

This is quite similar to the motivation behind the educational tool idea I pitched a while back. Hrm..

Structured brainstorming exercise

Wednesday, March 11, 2009

Steve and I have been talking about how to generate ideas on how software engineering research can be applied to the climate crisis.

Here's an idea for a structured brainstorming exercise that's inspired by Edward de Bono's Six Thinking Hats technique. Instead of the thinking hats that de Bono suggests we instead have a hat for each software engineering sub-discipline (e.g. requirements analysis, software design, formal methods, software testing/quality, etc..). The brainstorming is done with participants "wearing" only one of these Research Hats at a time; while wearing a hat the participant focuses on thinking about applying only their hat's research area to a given problem.

We gather a list of climate change issues, and get participants to brainstorm research problems that come out of each issue applicable to their research area.

Using research area Hats is purely a technique to focus the brainstorming conversations, so that they stay within software engineering and so that we make sure to get a variety of viewpoints on each issue. I imagine the exercise carried out in teams, with each team taking turns all wearing the same hat, and flipping through the various issues. It also makes sense to dispense with the research hat idea if the participants are experts in one particular sub-discipline, in which case the exercise is simply, "How can your research, or research discipline, be applied to this issue?"

Our weekly climate change chats have explored some rough ideas of the issues. See these posts for my thematic summary of our discussions.

Climate policy engineering

Sunday, March 8, 2009

Attention: wild speculation and provocative operations below.

Steve and I met last week and discussed a rather intruiging, and far out topic: the relationship between software design and global climate policy design. I'll lay it out rather straightforwardly and unconditionally:

Software systems are some of the most complex systems that humans design. Taking a global view, the world's climate policies also form a complex system of human design. Are there any design or process techniques from software engineering that can be applied to climate policy design?

(In short, just "s/software/climate policy/g" and see what makes sense.)

It doesn't seem like climate policy is built nearly as deliberately or methodically as software often is. You might say that it's "hacked". Can insight from software development processes be used to guide the process of planetary climate policy development?

Software designers use an architecture to cope with the complexity of large software systems (btw, is this a form of chunking?). Climate policy (policy in general?) lacks this sort of architecture (or maybe lacks a good one?). What can we use from the design of software architectures to help us design good climate policy? Are there design patterns in effective climate policy that match up with the software engineer's idea of software design patterns?

You can run the analogy along all the different aspects of software engineering: requirements engineering, design, testing, quality, development process, etc...

Does the analogy work? Also, can we go the other way around and import useful concepts from the way climate policy is designed into software engineering? Ugh.. ;-)

I suppose what we're doing here is applying systems engineering thinking to climate policy design by way of our knowledge of software engineering (as one type of systems engineering).

By the way, Steve has just blogged a much more coherent statement of how software engineering can play a role in fighting climate change.

Lakoff, Systems Thinking, Obama, Sweeney

Thursday, March 5, 2009

Check out Linda Sweeney's recent blog post on Systems Thinking + The Obama Code. She elaborates on one of the seven characteristics that George Lakoff identifies as part of the "Obama Code": "seven crucial intellectual moves [by Obama] that [Lakoff believes] are historically, practically, and cognitively appropriate, as well as politically astute."