Paper: Reexamining the fault density-component size connection

Monday, April 13, 2009

L. Hatton, "Reexamining the fault density component size connection," Software, IEEE, vol. 14, no. 2, pp. 89-97, 1997.

By summarising several studies, Hatton concludes that the number of faults has a logarithmic relationship to code size (or, more generally, code complexity). One implication of this is that smaller components have a higher density of faults (since the logarithmic curve rises sharply at first, and then grows more slowly). The interesting bit of the paper comes when Hatton models these observations from a psychological basis beginning with G. Miller's observation that a person can only cope with 7±2 independant pieces of information in short-term memory. Hatton's model specifies one behaviour (logarithmic) for components as they increase in size up to the "Miller threshold" (Hatton uses a capital Omega prime symbol to denote this), and second behaviour (quadratic) for components as they increase past this threshold.

This paper is simultaneously cool and confusing. Cool because Hatton is trying to rigorously use our knowledge from psychology to explain the underlying causes for defect density, rather than just fit an equation to the data. And he tries to back up his models with lots of data, or clearly state when he can't back it up without doing more empirical work.

But the paper is also confusing for a number of reasons. Some are technical: many of the figures are hard to read: the axis titles are vague or ambiguous --one plot is actually missing an x-axis!; another plot's description conflicts with the description given in the paper text. Others are less technical: Hatton switches from talking about different models throughout the paper and isn't very clear as to the situations he's discussing. I had to re-read the paper a few times because of this (as always, YMMV, maybe it's just me being slow). He's also very vague at points, especially when it comes to explaining why fault density ought to be proportionately higher for components below the Miller threshold,. He says,
"... if a system is decomposed into pieces much smaller than the short-term memory cache, the cache is used inefficiently because the interface of such a component with its neighbours is not 'rehearsed' explicitly into the cache in the same way, and the resulting components tend to exhibit higher defect densities."
Erm.. that's all we get on the matter. In short, I like what he's trying to do here by backing up his models, he just doesn't convince me in this paper that he's got the right explanation.

That said, he concludes that there is an optimum component size range to achieve low fault density. The range differs across languages (because some languages pack less information in per line so the Miller threshold is reached sooner if you're just counting lines) and across programmers (because Miller's magic number is actually a range from 5-9 pieces of information). He mentions several implications of this that make for testable hypotheses. For instance, "manual [code] inspections would be most effective on components that fit into cache", and "only substantial reuse within the same system will likely improve reliability. Modest reuse ... is likely to make it worse."

Paper: A Critical Look at Quality in Large-Scale Simulations

Tuesday, April 7, 2009

D. E. Stevenson, "A critical look at quality in large-scale simulations," Computing in Science & Engineering, vol. 1, no. 3, pp. 53-63, 1999.

Here's a wandering article that has lots of great thoughts but never seems to pull them together tightly enough for me to come up with any solid, unified take-aways. Stevenson sets out to describe and tackle the friction, and danger, created by the differing ideas of simulation quality by management and scientists. He's coming at this problem from the perspective of ASCI (Accelerated Scientific Computing Initiative) science -- that is, the folks charged with "predicting, with confidence, the behaviour of nuclear weapons through comprehensive, science-based simulations." After spending some time discussing the disconnect in the understanding of simulation quality and the resulting problems that this disconnect creates, Stevenson takes a step back and looks at the the general modelling and simulation endeavour itself. He explores what modelling and simulation is, why we do it and what we should hope to gain from doing it, and how these two things ought to inform our notion of quality. He also provides us with some observations on why building high-quality simulations is probably difficult, discusses what validation and verification is, and distinguishes between two types of quality. Phew.

Here's Stevenson's summary of the article:
"Software engineering is meant to produce software by a manufacturing paradigm, but this paradigm simply cannot deal with the scientific issues. This article examines the successes and failures of software engineering. I conclude that process does not develop software, people and their tools do. Second, software metrics are not meaningful when the software's purpose is to guarantee the world's safety in the nuclear era. Finally, the quality of simulations must be based on the quality of insights gained from the revealed science."
Okay, some key points to mention. On the topic of what modelling and V&V is, Stevenson introduces some clear terminology. He defines validation and verification in terms of three different types of systems: observational (the world out there -- e.g. the climate), theoretical (our model/theory of the workings of the world -- e.g. the equations that describe climate processes), and calculational (e.g. the implementation of the theoretical model -- the climate model code). Validation checks that the theoretical system properly explains the observational system, and verification checks that the calculational system correctly implements the theoretical system. Stevenson then uses the term validation in a broader sense, stating that "complete validation of the observational-theoretical-calculational systems requires that we compute the right numbers for the right reasons."

On the nature of quality, Stevenson points out that our reasons for modelling need to inform our notions of quality; and that the divide between management and science occurs because this isn't happening. We model, and validate those models, in order to gain insight into the nature of whatever it is that we're modelling (i.e. the observational system). Insight is the essential purpose of science, and simulations are just tools to gain insight. Insight and modelling are the products of science. But, from a manufacturing perspective (read: a management perspective), insight isn't essential for building a model and a simulation. A model can just be seen as a specification (not as a product itself), and a simulation as a final product. Thus, from an from an engineering management position, validation and insight take a back seat -- the real problem is one of manufacturing. And so scientists and management are looking at the same process at cross-purposes.

If insight is the end goal of simulation computing then the quality of computing can be measured by the quality of insight. Since insight leads to knowledge, we can judge the quality of insight by the quality of knowledge we get from a project. How do we judge the quality of knowledge we get? Well... frankly, this is were I lose track of the article a bit. Either it's because I'm just dense, Stevenson is intentionally vague, and/or he's put the real content in another paper of his, D. E. Stevenson, "Science, computational science, and computer science: at a crossroads," in CSC '93: Proceedings of the 1993 ACM conference on Computer science. New York, NY, USA: ACM Press, 1993, pp. 7-14.

What he does say is that whilst we know what scientific and mathematical knowledge looks like (inductive and deductive, respectively), we don't really know what knowledge from computer science looks like. He references a few "principles" of computing knowledge from paper I just mentioned: physical exactness (elimination of parameterisations), computability, and bounded errors (a priori or postepriori error estimates). I'll have to read that paper before I can say much more about that...

Stevenson goes on to describe two kinds of quality, intrinsic and internal. I have to say I'm not quite sure I understand the distinction very well. Here's my take. Stevenson defines intrinsic quality as "the sum total of our faith in the system of models and machines." He says about internal quality, "each dimension [of insight and knowledge we receive from the simulation?], such as the mathematics or the physics, has its own idea of internal quality."

I think what he's doing here is making a distinction in quality that's analogous to the distinction between verification and validation in that intrinsic quality applies to the match between observational and theoretical systems, and internal quality applies to theoretical and calculational systems. Intrinsic quality is an epistemological notion of a good modelling endeavour. It is what we're talking about when we ask: regardless of any possible implementation, what needs to be present in any model and implementation for the modelling effort to be a good one in terms of getting us insight and knowledge?

Internal quality looks at the quality issue from the other side. It assumes (or disregards) intrinsic quality, and focuses just on how good our model and implementation is in terms of the kinds of knowledge we have already. For a mathematician or scientist in general, internal quality may relate to the simplicity or elegance of the model. For a computer scientist or engineer, internal quality may relate to the simplicity or robustness of the code.

Stevenson's point is, I think, that ultimately we computer scientists don't have a clear justification for our measures of quality. If insight and knowledge is the end goal of modelling, we need to have a clear sense of intrinsic quality in our endeavour. Then we need to use this understanding to inform our measures of internal quality. Otherwise we're just measuring things because we can, not because they show us the way to better science.

Paper: Predicting Defects for Eclipse

T. Zimmermann, R. Premraj, and A. Zeller, "Predicting defects for eclipse," in PROMISE '07: Proceedings of the Third International Workshop on Predictor Models in Software Engineering. Washington, DC, USA: IEEE Computer Society, 2007, pp. 9+.

In this study, Zimmermann et al. map defects found in the Eclipse bug database to the source code, for both pre- and post-release defects. They also calculate several complexity metrics for each file and package, and then explore how those metrics correlate with pre-release defects and post-release defect counts, and briefly how they can be used to predict defect proneness. All of their data is published here. Among other things, their results show a strong correlation between pre- and post-release defects (a buggy package/file is still buggy after release); all complexity measures are at least positively correlated with pre- and post-defect rates (a more complex package/file has more defects); it's possible to learn (linear regression-wise) reasonable models to assess defect proneness for later releases by looking only at a single release.

What's interesting to me about this study is the definition of defect used here, and the method of counting defects. Zimmerman et al. define a defect by the bug report and the associated code change that fixes it. In this way, a defect is defined as anything worth fixing. In Hatton's terms, this definition covers both faults and failures, but limits it to only those problems the users and developers find relevant.

Programmatically counting defects is done in two steps: in the first step fixes are identified by searching through the version control change log for entries that contain references to bugs (e.g. "'fixed 42233' or 'bug #23444'"); in the second step the release the fix applies to is determined by looking at the bug report in the bug tracking system. This method could be adapted to any project where the developers consistently mark fixes with a reference to the bug tracking system or release number (including posting comments in the code). The authors reference three other papers which use a similar technique.

Paper: The T Experiments

Monday, April 6, 2009

L. Hatton, "The t experiments: errors in scientific software," Computational Science & Engineering, IEEE, vol. 4, no. 2, pp. 27-38, 1997.

Back in 1997, Les Hatton published a study composed of two experiments to test scientific software quality. The first, which he calls T1, involved the static analysis of over 100 pieces of scientific software from a broad range of application areas. The second experiment, T2, involved dynamic analysis of 9 different seismic data processing programs, each one supposedly designed to do the same thing. In a nutshell, the results suggest that scientific software has plenty of statically detectable faults (and the number of faults varied widely across the different programs analysed), and that there is significant and unexpected uncertainty in the output of this software -- agreement amongst the seismic processing packages is only to 1 significant digit. Hatton says, "Taken with other evidence, the T experiments suggest that the results of scientific calculations carried out by many software packages should be treated with the same measure of disbelief researchers have traditionally attached to the results of unconfirmed physical experiments."

Hatton makes the distinction in the paper between two types of defects: faults and failures. A fault is a statically detectable trouble area in the software code; "a misuse of the language which will very likely cause the program to fail in some context". A failure is a defect measured at runtime (for example, a program crash or an output inaccuracy). There is a uncertainty associated with software faults -- for many of them (say for example, assigning a pointer to an integer variable) it's not sure bet that they will "mature" into a runtime failure. Hatton weights each kind of fault with a severity rating which amounts to the likelihood the fault could cause a failure, a "rough risk factor". It's these weighted fault rates he publishes in the paper. He doesn't go into detail about the ratings, so I can't comment other than to say I'm already suspicious of something so subjective being used this way. More details of the T1 experiment are available in his book, L. Hatton, Safer C: Developing Software for in High-Integrity and Safety-Critical Systems. New York, NY, USA: McGraw-Hill, Inc., 1995.

I'd like to note that what Hatton is doing here in this study is looking at various bits of software on his own, at a single point in time, in order to objectively determine code quality. In my pitch about investigating climate modelling code quality I was thinking more about looking at the defects uncovered by the software users and developers themselves. The defects identified[1] by the climate scientists will contain both faults and failures. The faults found by scientists are sure to be relevant and important whereas the same is probably not true for all of the "generic" faults detected by static analysis software. Either way, both of these methods of gauging quality are incomplete in some sense: both use heuristics to find defects (if you consider the testing scientists do as a "heuristic").

This paper provides some partial statistics I could use to compare with climate models. Hatton used static analysis software from Programming Research Ltd., so it's at least feasible I could run the climate models through the same analysis. I'm not sure what the value would be for doing that just yet. Partly, because it's not clear what I'd be comparing against: Hatton analyses software from "40 application areas, including, for example, graphics, nuclear engineering, mechanical engineering, chemical engineering, civil engineering, communications, databases, medical systems, and aerospace."

[1] I suspect defects in one version to show up over the course developing later versions, but some defects may be known ahead of time. I'm not sure about how to count these yet.