The Fast Fourier Transform, FFT, and various incarnations including the popular FFTW3 library are well known in the signal-processing business. And, since just about everybody processes signals as a lifestyle, it’s a good thing to know it well.
The CUBEnu code uses FFT for convolution: converting to frequency domain, running a windowing function on the frequency-domain data, then inverting the fourier transform back to the spatial domain. The window function is just a sin().
The code fragments call the planning and execution routines according to most of the recommendations in the documentation (FFTW version 3.3.7, October 2017). A couple of issues remain: using the recommended Fortran routines ddtw_execute_dft, _r2c, etc. And the routine names are slightly different for single-precision Fortran.
One possible option is the unaligned-data flag, not currently used, and consequent complicated alignment-compatible allocation mechanism. I didn’t try that, at least not the Fortran version. See section 7.4 of the FFTW doc for details.
So, totally abandoning the OpenBSD packages for a build-your-own approach is not a good idea. But, if you absolutely positively need highest performance, this might be worth the trouble – which is mostly the issue of reproducing this build in a year or three from now when you no long remember how.
Here is my method:
cd
mkdir clang gcc src
cd src
ftp https://path.to.source.files.org/fftw-3.3.8.tar.gz
tar xzf fftw-3.3.8.tar.gz
cd fftw-3.3.8
./configure --help
./configure --quiet \
--prefix=$HOME/clang \
--disable-doc \
--enable-threads \
--enable-sse2 \
--enable-float \
CC=cc CFLAGS="-O3 -march=penryn" \
F77=flang
gmake
gmake check
gmake install
gmake clean
And repeat without --enable-float
.
For GCC, use CC=egcc
and
CFLAGS="-O3 -march=core2 -mtune=core2"
.
./configure --quiet \
--prefix=$HOME/gcc \
--disable-doc \
--enable-threads \
--enable-sse2 \
--enable-float \
CC=egcc CFLAGS="-O3 -march=core2 -mtune=core2 " \
F77=egfortran
gmake
gmake check
gmake install
gmake clean
And another run without the --enable-float
.
In spite of all this work, the result is very close to the OpenBSD package results. Custom compilation has no useful effect on Penryn (Core2).
In OpenBSD, a malloc() for a large array usually results in a page-aligned allocation. Just out of curiosity I tried a few 4-byte offsets from such a page- aligned pointer, and got a small (consistent, but small at 7%) slowdown. Don’t do that.
The alignment option that FFTW3 document describes does work but is not faster, at least not on Penryn CPUs.
January 2020
Links
FFTW Home Page Smith - The Scientist’s and Engineer’s Guide to Digital Signal Processing
OpenBSD Numerics Experience - 1 - RNG
OpenBSD Numerics Experience - 2 - RNG floats
OpenBSD Numerics Experience - 3 - FFTW
OpenBSD Numerics Experience - 4 - CAF
OpenBSD Numerics Experience - 5 - MPI Networking
OpenBSD Numerics Experience - 6 - Memory Models
OpenBSD Numerics Experience - 7 - Python Image Display
OpenBSD Numerics Experience - 8 - RNGs, again
OpenBSD Numerics Experience - 9 - Nim
OpenBSD Numerics Experience - A - Graphical Display
OpenBSD Numerics Experience - B - ParaView
OpenBSD Numerics Experience - C - Numerical Debugging
OpenBSD Numerics