Experiences with Numerics on OpenBSD - C - Numerical Debugging

Experiences with Numerics on OpenBSD - C - Numerical Debugging

Debugging text-processing code is easy – you just use a lot of printf() and see how well or how badly things turn out.

Numerical codes are different because printf() will print out a number but you may have a difficult time deciding if it is the correct number.

I have only a little experience with this issue, so here is a set of basic guides on what to look for, and how to find bugs (or, sometimes, numerical issues) in the code you are looking at. It’s all that I know (so far).

Note: I’m not expecting that you wrote the code you are debugging. A useful principle of software development is to be uncommitted as to the ownership of the code. Maybe you wrote it, but you shouldn’t see bugs as a personal affront, even, or especially, if someone else is point out bugs in the code. Instead, it’s more helpful to see bugs as learning experiences. The code is a team effort to compose a very large and perfect text. All team members contribute in their own way with their own skills.

Sources of code

Physicists and astrophysicists and others often get their code from github or personal web pages. It’s a good place to start when learning how to do numerical physics. Reading what others have done helps you understand the state of the art. Reading what many different people have done helps you understand the range of approaches and opinions about how to solve a problem.

Let’s examine a typical (anonymized) example (in Fortran):

real function interpolate_kr(kr,ix,iy)
    implicit none
    integer(4) ix,iy
    integer(8) ii,i1,i2
    real kr,xx,yy,x1,x2,y1,y2
    i1=1
    i2=nk
    do while(i2-i1>1)
        ii=(i1+i2)/2
        if(kr>tf(ix,ii)) then
            i1=ii
        else
            i2=ii
        endif
    enddo
    x1=log(tf(ix,i1))
    y1=log(tf(iy,i1))
    x2=log(tf(ix,i2))
    y2=log(tf(iy,i2))
    xx=log(kr)
    yy=y1+(y2-y1)*(xx-x1)/(x2-x1)
    interpolate_kr=exp(yy)
endfunction

What is happening in this code?

How could you prove this routine is correct? How could you find a bug in it that is related to any of the following:

And so on. You should be able to describe a couple of other issues with this code fragment.

Troubleshooting the source

Finding bugs is always about finding issues that are usually obvious once they are pointed out. Finding obvious things is hard because you and your team being focussed on either (a) other issues or (b) not seeing what is in front of you but seeing what you imagine is correct.

“Other issues” result in bugs not being fixed because you are either writing new code or debugging other code. Which is fine; it’s a big project and lots of things need to get done.

Not seeing what is truly there is a human fault (and we all have this). What can be done about this? The first suggestion is use tools:

grep -w tr interpolate_kr.f90

You can do many things here:

How do you check copy-paste errors? There are typically two reasons for copy-paste:

The second suggestion is to check for copy-paste errors with the counting trick above, or by rearranging the code (even temporarily) to allow you to see the similar lines directly above each other:

term_a=svr(i,2)+(svr(j,2)-svr(i,2))*(rr-svr(i,1))/(svr(j,1)-svr(i,1)
...
term_b=svz(i,2)+(svz(j,2)-svr(i,2))*(rr-svz(i,1))/(svz(j,1)-svz(i,1)

Are these two lines different the way they were intended?

The third suggestion: a code review by another person or persons. This is commonly done in industrial settings and may be a novel idea in an academic or research environment. There are several important things to know about code review methods and preparing a team new to the concept. Search the web for some advice, e.g. Wikipedia, Microsoft experiences etc. Note that code review is not about finding bugs, it is about making the code better in a way that (a) meets the local code layout standard, (b) is clear and easily understood, (c) meets quality goals such as numerical accuracy or includes self-contained tests, and (d) has agreement amongst the team that the code is satisfactory (not necessarily bug-free).

A coding standard might include:

Troubleshooting the numbers

Numeric codes have a number of possible outcomes:

NaN and infinities

The NaN and infinities can be detected with suitable code inserted to check for these. Here is an example in Nim:

import math, strformat

var
  a: array[0..2,float] = [1.0, -2.0, 3.3]

# check unit under test
proc countnan( UUT: openArray[float] ) =
  for ix, v in UUT:
    if(classify(v) == fcNan):
      echo &"array has Nan at {ix}"
      break

echo "a is a ok"
countnan(a)

echo "is a ok?"
for ix, v in a:
  a[ix] = ln(a[ix])
countnan(a)

For which the result is

$ ./nan
a is a ok
is a ok?
array has Nan at 1

Some of the other values you can classify: fcNormal, fcInf, fcNegInf. See the documentation for the math module for Nim.

Checking blocks of code for errors

But, you say, I may not want to check results all the time. You may want to ask your computer to signal errors, or simply record if there are any errors (and if there aren’t, you can be a little more relaxed.)

The floating point standard allows for software to reset the set of faults encountered, and read the current set afterwards. So you can

import fenv
...

# clear fp flags; later, detect if exceptions would be raised

var fsave: ptr Tfenv
fsave = cast[ptr Tfenv](alloc(sizeof(Tfenv)))
discard feholdexcept(fsave)

... # computations appear here

if fetestexcept(FE_DIVBYZERO + FE_INVALID + FE_OVERFLOW) != 0:
  echo "divide by 0 or Invalid or Overflow detected"

See the documentation for the fenv module for Nim, and man pages for feholdexcept for Linux/BSD.

What does this result mean (if you get a divide by 0 or Overflow)? It means your code is calculating with invalid numbers. You need to determine where these are introduced, and fix them. See the example in OpenBSD Numerics Experience - 1 - RNG for a narrative about how to detect such issues.

Simple error reporting

Lastly, you can insert some testing code that can interrupt your program whenever a numerical error appears. Taking an idea from Gnu Scientific Library, define a routine like “oopsie” as follows:

$ cat oopsie.c
/* define an extern iff debugging */
#ifdef DEBUG
void oopsie(char *s,...){ }
#endif

And, in your main routine:

...
/* define an extern iff debugging */
#ifdef DEBUG
void oopsie(char *s,...);
#else
void oopsie(char *s,...){ }
#endif

int main(){
int i, j;
printf("%s\n","this is main.c demonstrating debug printing");

i = rand();
    if(i> 5) oopsie("i more than 5");

With it enabled by the source code flag DEBUG, you can run the code in the debugger and break on oopsie; this immediately prints an issue message without having to single-step through all your code:

$ gdb ./main
...
(gdb) break oopsie
(gdb) run
Starting program: /home/jal/debug/a.out
...
this is main.c demonstrating debug printing
...
Breakpoint 1, oopsie (s=0x9766035d548 "i more than 5") at oopsie.c:4
4       void oopsie(char *s,...){ }

See the message you composed – it can be anything informative. And no printfs were harmed in the generation of this message.

(gdb) bt
#0  oopsie (s=0x9766035d548 "i more than 5") at oopsie.c:4
#1  0x000009766035ea96 in main () at main.c:23

And you can see where your code was called with backtrace.

An interesting advantage here is that you can compile with optimizations turned on, avoiding other implications of debugging (-O1 recommendations to get traceable statements, and so on). However, you probably need to define the -g compiler option to ensure the object file includes the oopsie routine name and it’s argument name.

Troubleshooting the formulas

If your code fairly and accurately represents the mathematical formulas, there may still be issues with accuracy and even correctness.

The classic example in floating point is:

x = (1.0 + 1.5e-99) - 1.0

What is the result? Logically it should be 1.5e-99, but is it?

The result could be 0.0 or 1.5e-99, depending on the compiler, compiler options, and so on.

The usual advice from numerical books (Hamming, McCracken and Dorn, etc) is to avoid subtraction! How, you say? Several methods exist:

Another source of problems we’ve seen is rounding: this can change values from a valid range, e.g. [0,1) to an unexpected range [0,1], where unity is unwanted because it is not valid in that situation.

Rounding can be configured to occur to the “nearest” value, “towards zero”, “towards positive values”, and “towards negative values”. Keeping this as a configurable option in your code might help you narrow down numerical issues that are otherwise hard to find.

Lastly, using the (Fortran) example at the top of the page, we see

x1=log(tf(ix,i1))
y1=log(tf(iy,i1))
x2=log(tf(ix,i2))
y2=log(tf(iy,i2))
xx=log(kr)
yy=y1+(y2-y1)*(xx-x1)/(x2-x1)
interpolate_kr = exp(yy)

This is subtraction of logarithms, which is the same as division! Avoid this by changing the formula for yy to use division and logarithm like this:

interpolate_kr = tf[y1,i1] * exp(log(tf[iy,i2]/tf[iy,i1])*((xx-x1)/(x2-x1)))

And you can avoid taking logarithms of y1, y2.

Conclusion

Numbers were the first reason for inventing computers. Avoiding bugs or finding them afterwards is (still) one of the most difficult parts of “computing numbers”.

References:

GSL - GNU Scientific Library

Numerical Methods for Scientists and Engineers, RW Hamming, Dover 1987

Numerical Methods and Fortran Programming, Daniel McCracken and William Dorn, Wiley, 1964

June 2021



OpenBSD Numerics Experience - 1 - RNG
OpenBSD Numerics Experience - 2 - RNG floats
OpenBSD Numerics Experience - 3 - FFTW
OpenBSD Numerics Experience - 4 - CAF
OpenBSD Numerics Experience - 5 - MPI Networking
OpenBSD Numerics Experience - 6 - Memory Models
OpenBSD Numerics Experience - 7 - Python Image Display
OpenBSD Numerics Experience - 8 - RNGs, again
OpenBSD Numerics Experience - 9 - Nim
OpenBSD Numerics Experience - A - Graphical Display
OpenBSD Numerics Experience - B - ParaView
OpenBSD Numerics Experience - C - Numerical Debugging
OpenBSD Numerics