L. Hatton, "The t experiments: errors in scientific software," Computational Science & Engineering, IEEE, vol. 4, no. 2, pp. 27-38, 1997.
Back in 1997, Les Hatton published a study composed of two experiments to test scientific software quality. The first, which he calls T1, involved the static analysis of over 100 pieces of scientific software from a broad range of application areas. The second experiment, T2, involved dynamic analysis of 9 different seismic data processing programs, each one supposedly designed to do the same thing. In a nutshell, the results suggest that scientific software has plenty of statically detectable faults (and the number of faults varied widely across the different programs analysed), and that there is significant and unexpected uncertainty in the output of this software -- agreement amongst the seismic processing packages is only to 1 significant digit. Hatton says, "Taken with other evidence, the T experiments suggest that the results of scientific calculations carried out by many software packages should be treated with the same measure of disbelief researchers have traditionally attached to the results of unconfirmed physical experiments."
Hatton makes the distinction in the paper between two types of defects: faults and failures. A fault is a statically detectable trouble area in the software code; "a misuse of the language which will very likely cause the program to fail in some context". A failure is a defect measured at runtime (for example, a program crash or an output inaccuracy). There is a uncertainty associated with software faults -- for many of them (say for example, assigning a pointer to an integer variable) it's not sure bet that they will "mature" into a runtime failure. Hatton weights each kind of fault with a severity rating which amounts to the likelihood the fault could cause a failure, a "rough risk factor". It's these weighted fault rates he publishes in the paper. He doesn't go into detail about the ratings, so I can't comment other than to say I'm already suspicious of something so subjective being used this way. More details of the T1 experiment are available in his book, L. Hatton, Safer C: Developing Software for in High-Integrity and Safety-Critical Systems. New York, NY, USA: McGraw-Hill, Inc., 1995.
I'd like to note that what Hatton is doing here in this study is looking at various bits of software on his own, at a single point in time, in order to objectively determine code quality. In my pitch about investigating climate modelling code quality I was thinking more about looking at the defects uncovered by the software users and developers themselves. The defects identified[1] by the climate scientists will contain both faults and failures. The faults found by scientists are sure to be relevant and important whereas the same is probably not true for all of the "generic" faults detected by static analysis software. Either way, both of these methods of gauging quality are incomplete in some sense: both use heuristics to find defects (if you consider the testing scientists do as a "heuristic").
This paper provides some partial statistics I could use to compare with climate models. Hatton used static analysis software from Programming Research Ltd., so it's at least feasible I could run the climate models through the same analysis. I'm not sure what the value would be for doing that just yet. Partly, because it's not clear what I'd be comparing against: Hatton analyses software from "40 application areas, including, for example, graphics, nuclear engineering, mechanical engineering, chemical engineering, civil engineering, communications, databases, medical systems, and aerospace."
[1] I suspect defects in one version to show up over the course developing later versions, but some defects may be known ahead of time. I'm not sure about how to count these yet.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment