Experiences with Numerics on OpenBSD - 6

Experiences with Numerics on OpenBSD - 6 - Memory Models

This is a hard one.

Adjusting the code for the cosmic initializer turned out to be pretty simple, but when compiled it resulted in a link time error:

ld: error: (....) relocation R_X86_64_PC32 out of range: (...) is not in [-2147483648, 2147483647]

-mcmodel=medium

Asking @misc resulted in the guidance that compiling with -mcmodel=medium is required. What does this mean? After some research, it turns out:

some object files are created assuming data and code is within 31-bit offsets from each other. That is, plus or minus 2GB: [-2147483648, 2147483647].
some object files are created assuming data (not code) is within a 64-bit offset
the difference is -mcmodel=small for the 2GB data+code model, and -mcmodel=medium for the 2GB code and 64-bit data model.

OK, so rebuild with -mcmodel=medium and the link step works but the program fails at runtime with:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

The backtrace was 64 calls deep, and meaningless:

(gdb) bt
#0  0x00000b2e7c782d60 in ?? ()
#1  0x0000000000000004 in ?? ()
#2  0x00000b31188a09e0 in ?? ()
#3  0x0000000000000020 in ?? ()
#4  0x0000000000000020 in ?? ()
#5  0x0000000000000001 in ?? ()
#6  0x2b485153b410927a in ?? ()
.... and so on

But when I ran similar (much smaller) test code, it worked.

So I went back to the failing code. Digging deeper into the backtrace required a static build to get a symbol table for the backtrace. And it was deep into the internals of fftw3. Hmm, nothing wrong with that code (by assumption). Here is what it looked like:

(gdb) bt
#0  0x00000efe25ec2f80 in recur () from /usr/local/lib/libfftw3f.so.7.1
#1  0x00000efe25ec2e8e in recur () from /usr/local/lib/libfftw3f.so.7.1
#2  0x00000efe25ec365a in zero () from /usr/local/lib/libfftw3f.so.7.1
#3  0x00000efe25e80dc4 in fftwf_measure_execution_time ()
   from /usr/local/lib/libfftw3f.so.7.1
#4  0x00000efe25e7cd2b in evaluate_plan () from /usr/local/lib/libfftw3f.so.7.1
#5  0x00000efe25e7cb47 in search0 () from /usr/local/lib/libfftw3f.so.7.1
#6  0x00000efe25e7b5f3 in mkplan () from /usr/local/lib/libfftw3f.so.7.1
.... and so on

And this was 25 calls deep. Not the same issue. Or, perhaps it was.

Where was I? Linking works, but while running code the program segfaults. Some kind of pointer issue, perhaps? The Fortran code I was testing had no pointers.

Static sections

It is very easy to make arrays larger than 2GB. But, in fact what the initializer was doing was calling pencil_fft.f90 which had local temporary arrays which were only 14M elements, or 56MB.

In Fortran, all statically defined variables end up in the same section. A “section” is a unit of memory allocation as defined by the linker and loader: it has a name, length, read/write attributes, etc. All object files are combined by the linker by matching up their sections and either concatenating them (such as for static variables or executable code ) or merging them (offset tables).

Programmers don’t need to know much about sections until they run into security issues (write XOR with execute, abbreviated W ^ X), or stack size issues, or as in this case, memory addressing issues.

The section here is the .bss section:

$ objdump -h pencil_fft.o

pencil_fft.o:     file format elf64-x86-64

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00002dd8  0000000000000000  0000000000000000  00000040  2**2
          CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  1 .data         00000000  0000000000000000  0000000000000000  00002e18  2**2
          CONTENTS, ALLOC, LOAD, DATA
  2 .bss          000b4ca8  0000000000000000  0000000000000000  00002e20  2**5
          ALLOC

where the 0x000b4ca8 size is 740520 bytes (decimal).

Looking into the whole program, the bss section was not too big. So, back to examine the code.

The fft routines declared input and output arrays as integer(C_INT), dimension(*), intent(in) while the calling code was explicit dimensions dimension(number). Maybe that was it?

Arrays with assumed-shape and explicit-shape

The difference is whether the calling sequence delivers the array itself (as with explicit-shape), or some description of it (assumed-shape). Both the routine and the main code must agree otherwise you get dreaded Segmentation faults.

For example

$ cat >t.f90 << __end
program lbug
  implicit none
  save
  real    r3(ng,ng,ng)  ! explicit shape, ng is a macro

  print *,'hello bug'
  call someexternal(r3,3)
  print *,r3(3,3,3)
endprogram
__end

$ cat >tse.f90 << __end
subroutine someexternal(a,b)
   real a(:,:,:)        ! assumed shape
   print *,'b is ',b
   print *,a(1,1,1)
endsubroutine
__end

$ egfortran -c -cpp -Dng=20 t.f90
$ egfortran -c              tse.f90
$ egfortran t.o tse.o
$ ./a.out
 hello bug
 b is    4.20389539E-45

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

A compile-time detection for this happens when both program and subroutine appear in the same file. In this case, you get

 call someexternal(r3,3)
           1
Error: Explicit interface required for 'someexternal' at (1): assumed-shape argument

To fix this, you can choose to modify the code to be explicit (passing the dimensions along with the array), or explain the interface to the compiler with an INTERFACE definition.

The code then looks like this:

program lbugt
  implicit none
  save
  real, allocatable :: r3(:,:,:)  ! deferred shape
  interface
    subroutine someexternal(x,y)
      real, dimension (:,:,:) :: x
      real :: y
    end subroutine
  end interface

  print *,'hello bug'
  allocate (r3(ng,ng,ng))         ! ng is a macro
  call someexternal(r3,3.0)
  print *,r3(3,3,3)
endprogram

The results look like:

$ egfortran -cpp -Dng=20 -c tb.f90
$ cat lbugp.f90
subroutine someexternal(a,b)
   real a(:,:,:)   ! assumed shape
   print *,'b is ',b
   print *,a(1,1,1)
endsubroutine
$ egfortran -c lbugp.f90
$ egfortran tb.o lbugp.o
$ ./a.out
 hello bug
 b is    3.00000000
   0.00000000
   0.00000000
$

Conclusion

This is one subtlety of the new Fortran language spec: arrays are a little more complicated that you might think.

February 2020