Experiences with Numerics on OpenBSD - C

Experiences with Numerics on OpenBSD - C - Numerical Debugging

Debugging text-processing code is easy – you just use a lot of printf() and see how well or how badly things turn out.

Numerical codes are different because printf() will print out a number but you may have a difficult time deciding if it is the correct number.

I have only a little experience with this issue, so here is a set of basic guides on what to look for, and how to find bugs (or, sometimes, numerical issues) in the code you are looking at. It’s all that I know (so far).

Note: I’m not expecting that you wrote the code you are debugging. A useful principle of software development is to be uncommitted as to the ownership of the code. Maybe you wrote it, but you shouldn’t see bugs as a personal affront, even, or especially, if someone else is point out bugs in the code. Instead, it’s more helpful to see bugs as learning experiences. The code is a team effort to compose a very large and perfect text. All team members contribute in their own way with their own skills.

Sources of code

Physicists and astrophysicists and others often get their code from github or personal web pages. It’s a good place to start when learning how to do numerical physics. Reading what others have done helps you understand the state of the art. Reading what many different people have done helps you understand the range of approaches and opinions about how to solve a problem.

Let’s examine a typical (anonymized) example (in Fortran):

real function interpolate_kr(kr,ix,iy)
    implicit none
    integer(4) ix,iy
    integer(8) ii,i1,i2
    real kr,xx,yy,x1,x2,y1,y2
    i1=1
    i2=nk
    do while(i2-i1>1)
        ii=(i1+i2)/2
        if(kr>tf(ix,ii)) then
            i1=ii
        else
            i2=ii
        endif
    enddo
    x1=log(tf(ix,i1))
    y1=log(tf(iy,i1))
    x2=log(tf(ix,i2))
    y2=log(tf(iy,i2))
    xx=log(kr)
    yy=y1+(y2-y1)*(xx-x1)/(x2-x1)
    interpolate_kr=exp(yy)
endfunction

What is happening in this code?

The kr value is compared to elements of an external array tf(nn,mm)
It seems there is a binary search with kr in tf along the second dimension
The values i1 and i2 (I-One and I-Two) form a range in the tf array
The interpolation is exponential in form
The array tf is external and not listed as a parameter to the function

How could you prove this routine is correct? How could you find a bug in it that is related to any of the following:

The binary search is correct, or not
The result depends on the tr array being sorted correctly (binary search needs this)
The logarithms are never taken of a zero (which results in some kind of error)
The exponential is not about a number that is too big (another floating point error)

And so on. You should be able to describe a couple of other issues with this code fragment.

Troubleshooting the source

Finding bugs is always about finding issues that are usually obvious once they are pointed out. Finding obvious things is hard because you and your team being focussed on either (a) other issues or (b) not seeing what is in front of you but seeing what you imagine is correct.

“Other issues” result in bugs not being fixed because you are either writing new code or debugging other code. Which is fine; it’s a big project and lots of things need to get done.

Not seeing what is truly there is a human fault (and we all have this). What can be done about this? The first suggestion is use tools:

grep -w tr interpolate_kr.f90

You can do many things here:

count how many times “tr” is used – it should be a predictable number because it is declared once and used in several known locations thereafter
verify “tr” is used correctly and not misspelled as tz or tx or something
check numbered variables like i1 is not il (I-One and I-Ell).
check for copy-paste errors

How do you check copy-paste errors? There are typically two reasons for copy-paste:

big code fragment is identical in two places (so, verify it is indeed identical, but it is better recoded as a function)
big code fragment is almost identical except for using different variable or variables

The second suggestion is to check for copy-paste errors with the counting trick above, or by rearranging the code (even temporarily) to allow you to see the similar lines directly above each other:

term_a=svr(i,2)+(svr(j,2)-svr(i,2))*(rr-svr(i,1))/(svr(j,1)-svr(i,1)
...
term_b=svz(i,2)+(svz(j,2)-svr(i,2))*(rr-svz(i,1))/(svz(j,1)-svz(i,1)

Are these two lines different the way they were intended?

The third suggestion: a code review by another person or persons. This is commonly done in industrial settings and may be a novel idea in an academic or research environment. There are several important things to know about code review methods and preparing a team new to the concept. Search the web for some advice, e.g. Wikipedia, Microsoft experiences etc. Note that code review is not about finding bugs, it is about making the code better in a way that (a) meets the local code layout standard, (b) is clear and easily understood, (c) meets quality goals such as numerical accuracy or includes self-contained tests, and (d) has agreement amongst the team that the code is satisfactory (not necessarily bug-free).

A coding standard might include:

mandatory use of code-formatting tools to standardize indent, brackets, spaces, and so on.
spelling rules for variables, such as CamelCaseValues; absence of numbers in variable spellings (forbidding i1, i2, i3); requiring or forbidding type codes in names (such as f for float, hence fFlighTimeSeconds); and so on.
maximum module size in lines
standard copyright notices
advice on complicated expressions such as 3D difference equations
directory structures
logging features and how to control them
configuration file names, locations and standardized syntax

Troubleshooting the numbers

Numeric codes have a number of possible outcomes:

wrong
NaN
+infinity
-infinity
misleading
almost correct
meets scale and accuracy expectations but is completely incorrect
doesn’t meet expectations and yet is completely correct
doesn’t crash
works

NaN and infinities

The NaN and infinities can be detected with suitable code inserted to check for these. Here is an example in Nim:

import math, strformat

var
  a: array[0..2,float] = [1.0, -2.0, 3.3]

# check unit under test
proc countnan( UUT: openArray[float] ) =
  for ix, v in UUT:
    if(classify(v) == fcNan):
      echo &"array has Nan at {ix}"
      break

echo "a is a ok"
countnan(a)

echo "is a ok?"
for ix, v in a:
  a[ix] = ln(a[ix])
countnan(a)

For which the result is

$ ./nan
a is a ok
is a ok?
array has Nan at 1

Some of the other values you can classify: fcNormal, fcInf, fcNegInf. See the documentation for the math module for Nim.

Checking blocks of code for errors

But, you say, I may not want to check results all the time. You may want to ask your computer to signal errors, or simply record if there are any errors (and if there aren’t, you can be a little more relaxed.)

The floating point standard allows for software to reset the set of faults encountered, and read the current set afterwards. So you can

import fenv
...

# clear fp flags; later, detect if exceptions would be raised

var fsave: ptr Tfenv
fsave = cast[ptr Tfenv](alloc(sizeof(Tfenv)))
discard feholdexcept(fsave)

... # computations appear here

if fetestexcept(FE_DIVBYZERO + FE_INVALID + FE_OVERFLOW) != 0:
  echo "divide by 0 or Invalid or Overflow detected"

See the documentation for the fenv module for Nim, and man pages for feholdexcept for Linux/BSD.

What does this result mean (if you get a divide by 0 or Overflow)? It means your code is calculating with invalid numbers. You need to determine where these are introduced, and fix them. See the example in OpenBSD Numerics Experience - 1 - RNG for a narrative about how to detect such issues.

Simple error reporting

Lastly, you can insert some testing code that can interrupt your program whenever a numerical error appears. Taking an idea from Gnu Scientific Library, define a routine like “oopsie” as follows:

$ cat oopsie.c
/* define an extern iff debugging */
#ifdef DEBUG
void oopsie(char *s,...){ }
#endif

And, in your main routine:

...
/* define an extern iff debugging */
#ifdef DEBUG
void oopsie(char *s,...);
#else
void oopsie(char *s,...){ }
#endif

int main(){
int i, j;
printf("%s\n","this is main.c demonstrating debug printing");

i = rand();
    if(i> 5) oopsie("i more than 5");

With it enabled by the source code flag DEBUG, you can run the code in the debugger and break on oopsie; this immediately prints an issue message without having to single-step through all your code:

$ gdb ./main
...
(gdb) break oopsie
(gdb) run
Starting program: /home/jal/debug/a.out
...
this is main.c demonstrating debug printing
...
Breakpoint 1, oopsie (s=0x9766035d548 "i more than 5") at oopsie.c:4
4       void oopsie(char *s,...){ }

See the message you composed – it can be anything informative. And no printfs were harmed in the generation of this message.

(gdb) bt
#0  oopsie (s=0x9766035d548 "i more than 5") at oopsie.c:4
#1  0x000009766035ea96 in main () at main.c:23

And you can see where your code was called with backtrace.

An interesting advantage here is that you can compile with optimizations turned on, avoiding other implications of debugging (-O1 recommendations to get traceable statements, and so on). However, you probably need to define the -g compiler option to ensure the object file includes the oopsie routine name and it’s argument name.

Troubleshooting the formulas

If your code fairly and accurately represents the mathematical formulas, there may still be issues with accuracy and even correctness.

The classic example in floating point is:

x = (1.0 + 1.5e-99) - 1.0

What is the result? Logically it should be 1.5e-99, but is it?

The result could be 0.0 or 1.5e-99, depending on the compiler, compiler options, and so on.

The usual advice from numerical books (Hamming, McCracken and Dorn, etc) is to avoid subtraction! How, you say? Several methods exist:

don’t actually avoid subtraction but keep the two addends apart as separate values until the last possible moment
only maintain the one value, maintaining a “logical” subtraction in the definition of a variable
rewrite a formula in a series form where subtractions cancel out mathematically, not numerically
review the values to be sure hidden subtraction (adding positive and negative numbers) is not happening

Another source of problems we’ve seen is rounding: this can change values from a valid range, e.g. [0,1) to an unexpected range [0,1], where unity is unwanted because it is not valid in that situation.

Rounding can be configured to occur to the “nearest” value, “towards zero”, “towards positive values”, and “towards negative values”. Keeping this as a configurable option in your code might help you narrow down numerical issues that are otherwise hard to find.

Lastly, using the (Fortran) example at the top of the page, we see

x1=log(tf(ix,i1))
y1=log(tf(iy,i1))
x2=log(tf(ix,i2))
y2=log(tf(iy,i2))
xx=log(kr)
yy=y1+(y2-y1)*(xx-x1)/(x2-x1)
interpolate_kr = exp(yy)

This is subtraction of logarithms, which is the same as division! Avoid this by changing the formula for yy to use division and logarithm like this:

interpolate_kr = tf[y1,i1] * exp(log(tf[iy,i2]/tf[iy,i1])*((xx-x1)/(x2-x1)))

And you can avoid taking logarithms of y1, y2.

Conclusion

Numbers were the first reason for inventing computers. Avoiding bugs or finding them afterwards is (still) one of the most difficult parts of “computing numbers”.

References:

GSL - GNU Scientific Library

Numerical Methods for Scientists and Engineers, RW Hamming, Dover 1987

Numerical Methods and Fortran Programming, Daniel McCracken and William Dorn, Wiley, 1964

June 2021