Jon Pipitone: May 2009

Counting defects

Friday, May 29, 2009

I've talked about the issues with using defect counts for judging and benchmarking software quality. I do, however, still think it's worth doing an investigation into the defect counts of climate modelling software because:

It will force me to get my hands into the code and bug reports. My hope is that even a basic familiarity with these things will help me understand issues of quality for computational scientists. It may also give me a bit of currency in discussions with scientists if I have some understanding of the details.
The results may be useful to the individual climate modelling groups, as a gauge for quality within their group.
Doing the study furthers a dialogue between computer scientists and computational scientists and climate modelling groups.
I might end up with something useful!
- As I say in point #1, I might actually gain some insight about computational science software quality. ;-)
- Aside from comparing defect densities, I might find another ways to use this data for benchmarking. For instance, at the workshop for software quality at ICSE, Elmar Juergens spoke about how in judging quality absolute values suck (that might been exactly what he said), and how trend analysis is much better. He was speaking from the point of view of process improvement. But this raises an interesting idea: if we redefine software quality as "a good software development process" (whatever that means) maybe we could use aspects of quality trends as points of comparisons between projects.

As I mentioned in a previous post on Fenton and Ohlsson's work, I'll need to be very specific about how I measure defects to do it. First off, defects come in two major flavours: faults (statically identifiable errors, including dangerous uses of the language) and failures (observable, runtime errors). Most of the papers I've read[1] count defects by looking at bug reports and version control comments. Since a bug report or fix in a repository can cover both faults and failures, a defect in these cases is maybe best described as a problem worth fixing. I've only come across one paper so far that measures faults, and that's Hatton's paper (linked to above, see his slides for more details). Hatton used static analysis software from Programming Research Ltd., which I could run with.

The way that the folks in [1] count defects is simply a matter of counting bug reports, or counting the number of check-in comments that say "fixed", "bug" (or other keywords that suggest a fix for a bug). Some papers count defects before a release is made (pre-release) and others count defects against a release (post-release). What makes a bug pre- or post-release is a matter of opinion: some papers go by how it's marked in the bug database, others set a threshold of days before and after a release date with which to categorise bugs. Some papers explicitly mention that defects are counted only if they have been fixed (i.e. just reporting a defect isn't enough) whereas other papers aren't clear about this. Finally, some papers only consider defects logged against certain areas of the software as worth counting (for instance, an installation problem may not be counted but a UI problem would be). Phew.

I'm sure there are more dimensions I haven't considered!

[1] A sampling: Koru et al., 2007; Fenton & Ohlsson, 2000; Kaaniche & Kanoun, 1996; Zimmerman et al., 2007

Climate Interactive -- Public simulator and Webinar

This is essentially a RT (What do you call this for blogs? RB?).

You've heard about Climate Interactive before from me, me, me, and Steve -- they build simple climate simulations to be used in political decision-making. They've recently released an even more accessible version of their simulation software:

C-Learn is the more accessible, online version of the C-ROADS simulation, which was recently seen in US State Department Special Envoy Jonathan Pershing's plenary address in Bonn Germany to the UNFCCC. Now you can explore how changes in fossil fuel emissions from three parts of the world, plus deforestation and afforestation, will affect CO2 concentrations, global temperature, and sea level rise. And you can make your own graphs to show others your simulation experiments.

It also means that anyone who wants to run a mock-UN negotiation policy exercise such as "the Copenhagen Climate Exercise" has a simulator to back them up.

Try the sim, give us feedback, and let us know how you are using it! Send it to climateinteractive [at] sustainer [dot] org, or submit your information on our Climate Interactive site.

You can't play with C-ROADS online, but there is an upcoming public webinar to show off and discuss the software on June 3rd @ 11:00am EST. Details:

This session will include interactive demonstrations of C-ROADS (we’ll ask “what if” questions of the simulator), description of past and potential applications, and an open discussion about how open architecture simulators could help contribute to effective policy design and dialogue towards a stable climate, particularly aiming towards Copenhagen in 2009.

On quality in scientific software

Thursday, May 28, 2009

In yesterday's post on the trouble of benchmarking software quality, I gave a list of reasons why just looking at defect counts isn't likely to give us enough information to benchmark software quality.

What I didn't do was give any reasons that are specific to scientific software. For that I'm going to first refer you to slides from Daniel Hook's presentation at the SECSE '09 Workshop. Daniel[1] described scientific software development as a number of model refinements: from measurements to theory to algorithms to source code to machine code. At each refinement there are different types of acknowledged errors due to simplifying assumptions, truncation and round-off errors, etc. There are also unacknowledged errors that come from concrete or conceptual mistakes. Validation and Verification activities attempt to weed-out these errors, but the testing process is frustrated by two problems unique(?) to scientific software:

The Oracle Problem: "In general, scientific software testers only have access to approximate and/or limited oracles". As Neil points out in a comment, the output of scientific software is often what you're looking for -- the results of an experiment. If you knew exactly what you ought to get, you would not be doing science.
The Tolerance Problem: "If an output exhibits acknowledged error then a tester cannot conclusively determine if that output is free from unacknowledged error: i.e., it can never be said that the output is "correct." This complicates output evaluation and means that many test techniques cannot be (naively) applied."

Both of these problems mean that even defining a defect is tricky. It's not like in other domains where you have the notion of a clear set of requirements (however plausible that idea is!) which can be tested against.

[1] Strange, now that I've met Daniel I find it more comfortable to use his first name rather than referring to him as Hook 09. ;-) Maybe also it's because I'm referring to a presentation.

Why I came back to graduate school

This will be a rather personal post. Earlier this year I made the choice to return to graduate school. I'm damn lucky to even have this choice to make, of course. That fact added a level of pressure to get the decision right.

I had left the Master's program the previous year after completing eight months, and all my courses. I felt, then, entirely at odds with the lifestyle, ineffectual and useless in any research directions, generally disconnected from the activities that keep me sane and feeling worthwhile. Looking back, I was probably also very burnt out after an intense summer season of farm work, and an even more intense semester of school work. It was an incredibly tough eight months.

So why have I come back? (and why did I not choose to continue my farming path, or doing other work with more obvious direct impact?). Here is what I wrote in my journal late last year:

I'm interested in the subject area. My guts are just put together this way I guess.
My work here may or may not have impact, but to some extent the same can be said of any work.
I believe, or want to believe, in the academic process -- that the impact from it can be great ("blue sky" v.s. applied research).
I will get insight by direct experience as to how research works and the academic culture. I hope to answer the question of whether I fit with research work, and if research work fits with me.
I'll get the chance to inspire other CS students and researchers to be more socially relevant. I'll join a, hopefully, curious community and bring with me a different perspective and set of experiences which I hope will be of some benefit.
My work or research may lead me to other worthwhile work which I'm unaware of now.
It's a steady job and a familiar lifestyle from which I can explore other jobs and lifestyles.

Generally, it came down to dropping my expectations. I wrote out these points to summarise:

A master's thesis that doesn't have direct impact does not preclude doing work that does (either alongside my thesis work, or eventually).
My time here may have unforeseeable positive impact in the future, or it may turn me on to more relevant research or work.
Or I may achieve nothing other than spending a few months learning about how other interesting people live.

Finally, with all of that said, I know myself better than to think it would be enough just to realise these things. I wrote down a list of things that I was worried about, and would need to be conscious of -- a list of things I knew I had to be prepared for:

I am in an environment (and a place: UofT) which holds many of my old habits; habits that I no longer find helpful, or like. Being back means having to work to change these habits. Again.
I'm signing up for an indoor, contemplative life. Not the outdoor, action-oriented life I'm used to now. I need a balance, and I'll have to work really hard to get it because this won't be the norm here.
Balance also in my work: both to my research and to other, exploratory work/jobs, and direct-action work. My volunteer and activism work is immediately satisfying and keeps me tethered to sense of reality.
Self-esteem. I'm signing up for work that will/may challenge my skill, diligence, and (as a result) ego. I will need to work hard (the easy part) and work smart (less easy, hopefully I'll learn this) and be supportive of myself (even less easy, but I know how to do this much better now).

On the trouble of benchmarking software quality

Wednesday, May 27, 2009

All during ICSE and over the past few days as I settle in at UVic, I've been talking with folks about my research ideas to do with investigating the quality of scientific software. I've had my head in papers on defect densities over the past few weeks, and so this has been a very helpful opportunity to take a step back, and get some discussion going over what exactly my focus ought to be and how to go about the investigation. I'd like to tell the story of where my thinking has taken me -- partly to let you all in on it, and partly as a move to help me organise my thoughts. This may be a bit messy. ;-) And may span a few posts.

As described in the post linked to above, I started out on this research track with the fairly concrete idea of benchmarking the software quality of climate modelling software by using defect densities as a comparison tool. Using defect counts and densities are common way of measuring software quality -- the fewer the defects, the higher the quality. The thinking is this: if climate models turn out to generally be of higher quality than other similarly-sized software projects (measured via defect density) then that's a strong indication that something interesting is going on in the climate modellers software development process -- and we'd better take a further look. If the modelling software is generally worse, that's also interesting and worth investigating (I mean, hey, we software engineers might have something to contribute!).

The glitch is that I don't think defect densities, or many of our quality metrics, are going to be good tools for benchmarking -- they're far too subjective. It's not just me either: everyone I pitched the idea to at the workshop on software quality at ICSE was hesitant about the idea, as were most other folks I talked with at ICSE. Fenton and Neil, 1999 has some discussion about the difficulty of using defect densities for measuring and benchmarking software quality. What it comes down to is:

The way defect data are reported varies across publications. Some papers use defect rate, others defect density, others use failure rate.
When are we counting defects, during pre-release, or post-release, both? Some publications don't mention this.
What constitutes a defect isn't always clear: are we counting statically determinable faults (in which case, what heuristics are we using), or are we considering only those failures found and reported by the developers and users?
A plain count of defects ignores severity. So, how are we accounting for the severity of defects? And who determines the severity? Users and developers might have very different ideas of what the severity of a defect is.
A failure isn't in the code, it's a discovered unsatisfactory behaviour. So how a piece of software is tested and used directly determines the defects that are found (if it's never tested, it appears failure-free!). How are testing effort and usage information (user base size, etc..) accounted for in defect counts?
Finally, different people, teams, and domains have very different ideas of what constitutes good quality (e.g. a usability bug may not be nearly as important for a scientist as it is for a commercial product). Comparing software across these boundaries with a simple defect measure ignores these relative notions of quality, which may mean the measure is useless. (At the very least, the measure must be understood as an indicator of how well a group's software development process works for them rather than an indicator about the software in any objective sense. But maybe this has always been the case.)

It seems that the problem is that beyond just being inconsistently reported and ill-defined, defect density is only a single number standing in place of a very complicated constellation of properties. It may be a useful (and somewhat valid) measurement to use within a project since you could argue many of these properties stay the same from phase to phase (e.g. a team will decide once, at the beginning of project, whether they are counting pre- or post-release defects). But as a benchmark or basis for comparison of quality between projects, I think it's dubious, especially without a better understanding of what we actually want to measure first.

So, how do we get a better understanding? In the interests of experimenting with a one item, one post system, I'll leave that thought for the next post.

Paper: Software quality: the elusive target

Friday, May 8, 2009

Kitchenham B, Pfleeger SL. Software quality: the elusive target [special issues section]. Software, IEEE. 1996;13(1):12-21.

I'd been looking for an article like this for a while: a good overview of the definitions of quality. Kitchenham and Pfleeger take David Garvin's work on the definition of quality and apply them to software quality. There are five different perspectives to view software quality from:

Transcendental view: software quality can be recognised and worked towards, but never precisely defined or perfectly achieved. Thus, view holds that quality is inherently unmeasurable.
User view: how well the software suits the needs of its users. Measuring quality from a user view involves refining concepts like "reliability" and "usability" into measurable characteristics (say, number of hours of learning time needed).
Manufacturing view: how well the software conforms to its specifications, and development process. Measurement is by defect counts and rework costs.
Product view: how well the software scores on various "internal quality indicators", like program complexity. If the internal quality is high then so must be the external quality.
Value-based view: how much the customer is willing to pay for the software. This view, in some sense, unites the various other views with a very practical measure.

The article talks very briefly about comparing projects by using defect density and scaling by user base, but doesn't go into any of the philosophical troubles. There's also a decent reading list on software quality in general.

Paper: Quantitative Analysis of Faults and Failures in a Complex Software System

Thursday, May 7, 2009

Fenton NE, Ohlsson N. Quantitative Analysis of Faults and Failures in a Complex Software System. IEEE Trans Softw Eng. 2000 August;26(8):797-814.

Fenton and Ohlsson attempt to hack out a bit of solid empirical knowledge amongst the wilderness of published defect data. They note that whilst there are plenty of hypotheses and rules of thumb about defects and defect distribution (e.g. 20% of defects account for 80% of failures (the Pareto principle for software), or large modules are proportionally more reliable than small modules) there is little in the way of published empirical knowledge that can be used for validation or benchmarking. They study defect data from a telecommunication system in order to evaluate the extent to which their data supports or rejects a range of hypotheses about defects.

Fenton and Ohlsson find evidence to support the Pareto principle in software, e.g. the majority of operational defects come from a small number of faults.

They also find evidence that modules with higher defect counts found before release have fewer operational defects after release (and similarly, the inverse is also supported). This point is worth taking slowly. It seems reasonable, on the one hand, that modules with lots of defects before release are simply just undergoing better testing, and so should have fewer defects after release. But, you might also reject this idea and believe that that their are a few poorly designed, troublesome modules that are, and will continue to be, responsible for most of the defects. That is, we can predict buggy modules by looking at where bugs have been found before (this idea is supported, as previously mentioned).

So, exactly how to predict post-release defect density based on pre-release density is in dispute. Okay, sure, so what? Fenton and Ohlsson point out that if their hypothesis is generalisable (that is, if post-release density is lower for modules with high pre-release density, and vice versa) then using defect density as the "de facto" measure of software quality is a highly suspect practice. As they say:

If fault density is measured in terms of pre-release faults (as is very common), then, at the module level, this measure tells us worse than nothing about the quality of the module; a high value is more likely to be an indicator of extensive testing than of poor quality.

So, in terms of a benchmarking tool for comparing projects, pre-release defect density isn't going to work. But how about post-release defect density? The authors show that their post-release defect densities are inline with other published studies, so it's plausable. (I sure hope so, because this has been my plan for benchmarking the software quality of various climate models!)

There's one (at least) missing piece to all of this that I have yet to figure out: what exactly constitutes a pre- or post-release defect? That is, when do you stop counting defects as post-release for release n and pre-release for release n+1? In this paper, it's not entirely clear, but defects were associated with releases (and phases within the release) based on temporal proximity to certain milestones. Again, we've seen some work that associates defects to versions based on their bug report classifications and temporal proximity to releases. It seems to me to be crucial to be able state explicitly what the conditions are for pre- and post-release defects in order for these numbers to be even remotely comparable.

(Ah counting. Back in kindergarten it all seemed so easy.)

Paper: Software faults: a quantifiable definition, J. Munson et al.

Wednesday, May 6, 2009

J. C. Munson, A. P. Nikora, and J. S. Sherif, "Software faults: a quantifiable definition," Adv. Eng. Softw., vol. 37, no. 5, pp. 327-333, 2006.

Here's a curious little paper. The authors set out find a rigorous definition of the inherently messy notion of a software fault. Their solution is bold, simple, and ludicrous: just count the number of source tokens that change at each check-in and use that as the number of software faults.

What this paper does have going for it is an extensive reading list on software quality, broken down by subtopic. I'll definitely be coming back to this paper for other reading suggestions.

Paper: Orthogonal Defect Classification

Tuesday, May 5, 2009

R. Chillarege, I. S. Bhandari, J. K. Chaar, M. J. Halliday, D. S. Moebus, B. K. Ray, and M. Y. Wong, "Orthogonal defect classification-a concept for in-process measurements," Software Engineering, IEEE Transactions on, vol. 18, no. 11, pp. 943-956, 1992.

Chillarege et al. describe an technique for identifying problem spots in the software development process whilst a project is underway through classifying software defects that come up. They suggest a paradigm and describe a pilot study to validate it, but overall I wasn't convinced. There's a lot in this paper that smacks of advertising over content -- maybe the guts of results are found in all the subsequent papers they link to?

Anyhow, my interest in this paper is for the authors' concept of software quality, their use of defect classification, and their thinking on the link between the two. Plus, it seems to be widely referenced.

To begin, the authors point out a gap in the qualitative-quantitative spectrum of measurement methods we have for software quality. They want a measurement scheme that is lightweight, sensitive to the development process (in that it can help locate process problems), and also consistent across phases of a project and between projects (so those who use it can learn from theirs and others experiences). Statistical defect models (quantitative) and root cause analysis (qualitative) are both done retrospectively, are time consuming, and in the case of the statistical methods, often intentionally ignore the details of the software development process used, so they can't provide detailed process feedback.

Enter ODC (Orthogonal Defect Classification). The main idea: come up with various classifications of defects and map those classes onto the software development process so that every defect points to a process problem. (The word "orthogonal" is used both to mean "mutually independent" and because the authors run the metaphor of software-development-process-as-a-mathematical-vector-space, and defect classes "span" this space).

There are two tricks to doing this. The first trick is to use a layer of indirection in the mapping of defects to parts of process. Defects are first mapped to defect types, and then defect types are mapped to parts of the software development process. Why? Because mapping directly isn't something practitioners can do in the moment (it's error prone, and the attempt to do so is nothing more than a "good opinion survey ... not ... a measurement") and because the indirection allows us to compare results across projects and phases.

My view on this: I'm not sure assigning defect types or mapping a defect types onto a parts of the process is any less error prone or requires any less opinion, it just seems to divide up the opinion-making into smaller chunks.

Anyhow, the second trick is about making sure your defect classes actually span the process space. The authors point out that a sufficient classification scheme is a work in progress that ultimately needs to be empirically validated. A good chunk of the paper is devoted to describing a pilot study of ODC to validate it, or referencing future work.

Looking at defect types in more depth then. The first important point: defect types are chosen by the semantics of the fix, rather than only by qualities of the defects themselves. They are assigned by engineer making the fix. Here are the 8 types of defects:

Function -- errors that effect capability and require a formal design change.
Assignment -- logical errors in small bits of code.
Interface -- errors in interacting with other components.
Checking -- errors in data validation.
Timing/serialisation -- errors in the use of shared/real-time resources.
Build/package/merge -- errors in libraries or change management systems.
Documentation -- errors in documentation and publications.
Algorithm -- efficiency or correctness problems that require fixes through reimplementation (not requiring a formal design change).

The authors investigate how the quantities of defects of each type vary over different phases of a project. They note, for instance, that function bugs appear with greater number in the design phase, and timing bugs appear more in the system test phase. The authors then take this "trend analysis a stage deeper" and provide a correlation table that maps principle defect types to stages in the software development process.

This section is maddeningly vague on details -- it's not clear where the process stages have come from or how the correlations were done specifically. This is a shame because this mapping is crucial to the underlying argument for the usefulness of ODC. Any deviation from the "expected" principle variation trends of defect quantities is considered by ODC to point to process problems, but what exactly the "expected" variations ought to be isn't well described, nor is what exactly constitutes a variation other than to say the judgement is "determined with experience".

Overall I'm suspicious of ODC as described in this paper. Partly because the paper lacks detail, but also because I wonder if the classification scheme is objective enough to work as the authors claim -- especially across projects. (El Emam, K.; Wieczorek, I. 1998) show some evidence that defect classification is repeatable within members of a development team, but even then theirs is a highly qualified experiment.

For my interests, this is the first paper I looked at that hints that it might be possible to compare the quality of two projects by looking at the types of defects that turn up. But it only hints. I'll be posting about other work that adds a lot more murk to these waters.

Jon Pipitone