On static analysis

Monday, August 31, 2009

Last week I got serious about running a thorough static analysis (using Cleanscape's FortranLint) of one of the climate modelling packages I'm studying. It turns out to be trickier than I thought just to get the source code in a state to be analysed because of the complexity and "homebrewedness" of the configuration systems used.

What do I mean? Well, the models I'm studying are complex beasts. They are composed of many of sub-models, and those sub-models themselves are built from sub-sub-models. For example, a global climate model may be composed of an atmosphere model, an ocean model, and a land model. These sub-models are often functioning models in their own right and can often be run separately. And as I say, the sub-models are also built up from various models. The ocean model may have a sea-ice model, a biogeochemical model, and an ocean dynamics model. There may also be different versions of these sub- or sub-sub models being actively developed.

There are also piles and piles of configuration options for each of these components (the models, the sub-models, the sub-sub-models).

Thus, the climate model code shouldn't really be thought of in the singular sense. It's not source code for a climate model, but for an almost infinite number of different climate models depending on which sub-, or sub-sub-models are included in a particular build, and which configuration options are used.

A word on configuration options. The configuration system for some of the climate models I'm looking at are very complex (as you might expect). They include a generous helping of C preprocessor (CPP) instructions to include or remove chunks of code or other files in order to get just the right bits of functionality. As well, there are many makefiles and home-brewed scripts to assemble and ready the appropriate source files for compilation (e.g. move only the files land ice model version 2 files, not version 1 files, and rename them like so, etc..). Of course, there are also plenty of run-time configuration options slurped in from configuration data files (but since that happens after compilation it's not a concern to me when doing static analysis).

The upshot of all of this is that the source code for a climate model isn't shipped in a state that can be run through static analysis. In order for the static analysis tool to do it's job, it needs to be handed the source code in a ready-to-compile state. After all, the static analysis tool is an ultra-picky compiler that doesn't actually do any compilation but instead just spits out warnings about the structure of the code.

(I'm simplifying slightly: both of the static analysis tools I've looked at (FortranLint and Forcheck) both offer the ability to handle some preprocessing statements. Forcheck implemented it's own limited CPP-style preprocessor, and FortranLint will just call cpp for you on the file. Thus, it is possible to hand the static analysis tool code that isn't exactly in a compilable state, but you still need to configure the static analysis tool to do all the preprocessing... and that essentially duplicates the work that's being done by the homebrewed scripts and Makefiles).

The trouble is that getting a snapshot of the code that's ready for compilation isn't a trivial task. The homebrewed scripts and makefiles do a lot of magic as I described above. Somewhere in that magic -- and often not in one nice, distinct stage -- the code gets compiled. That is, no where in the process is there a folder of preprocessed, ready-to-compile files: configuration and compilation are bound up together.

Ideally I'd like to be able to run the configuration/compilation scripts up to the point in which they produce the ready-to-compile code, then run my static analysis tools over the code, and then continue on with the compilation process so that I can be sure that the code I'm analysing is exactly the code is able to be compiled into a working model. That would be the ultimate validation that I'm analysing the correct code, right? (If I were to use the built in preprocessing facilities of the static analysis tools I can never be sure that I've exactly duplicated the work done in the configuration scripts).

Unfortunately, this separation of configuration and compilation can't be done with out deeply understanding and re-writing the configuration scripts. hmmm... That's one option. It's more messy than I'd like it to be, but I might need to do it to remove any doubts about the validity of my results.

The other option I've come up with is a bit more cavalier, but still might be justifiable. It goes like this: redirect all calls to the compiler in the makefiles to a script that simply copies the target file to another location first before doing the actual compilation. The idea here is to intercept right at the point of compilation in order to take a snapshot of only those files that are compiled and when their in their proper configured and preprocessed state.

In fact, since I don't care about actually compiling the model, the stand-in compiler script could simply output an empty file instead of the actual compiled file. (Outputting an empty file is necessary in order to make other steps of the makefile happy and believe some real work was done.) Of course, replacing the compiler with something that doesn't actually do any compilation also requires that another programs in the makefiles that expect real work to have been done (i.e. the archiving tool, ar) must also be redirected to dummy scripts.

The result would be a folder full of ready-to-compile source files that should, in theory, all be able to be compiled together to make the climate model, and thus ready to be fed to the static analysis tool.

Also, in theory, and with less of a deeper understanding of the climate models, I should be able to compile the files I get from this process into a binary file that I can compare to the binary produced by the unadulterated configuration/compilation process in order to validate this hack.

Where I'm at: I tried putting this process in place last week with one of the models. I successfully got a nice pile of source files to analyse. I'm now just dealing with configuring the static analysis tool to handle external dependencies, but I should know soon whether this idea will work or not.

No comments:

Post a Comment