We try three programs across two CPUs across several compiler options. The purpose is to show the variety of possible results for various kinds of numerical program and options available to the user.
Note: these test results are now dated, using OpenBSD 6.6 where the LLVM Fortran compiler “flang” was available. I have not updated the results to OpenBSD 6.8 or later because the result is the same: I explain that you can expect various levels of performance with tuning. The software versions are not relevent here.
The programs are
heated_plate.f90
chaotic_system.cpp
phase_chain.cpp
The compilers are Flang and clang C++. We try various compile options such as
-O3
or -Ofast
– the likely choice for many people-fopenmp-simd
which enables the SSE and AVX instruction sets
as well as OpenMP-march=penryn
or -march=skylake
according to the CPUs I have on hand--static
to see if statically compiled code is indeed fasterThe hardware I have is an Intel 2-core E7400 Penryn and Intel 2-core i3 Skylake laptop.
The code solves a partial-differential equation for temperature across a plate. The data and charts are for the flang compiler, version 8.0.1 on OpenBSD 6.6.
This chart shows the heated_plate example with OpenMP turned on
for all cases. The default setting is more than 2x slower than -O3
.
A few attempts at: SSE or AVX instructions didn’t improve speed. The hpsimd program uses the simd pragma on a function performing the diff reduction step (abs(old-new)). The release notes suggest this shouldn’t do much.
The biggest gain (from 28.9s) is with -O3
(10.2s). The architecture-specific -march=skylake
(10.2s) is not much different.
These tests were run on the Skylake processor which is why the code for the older Penryn also works.
This chart shows the heated_plate example with OpenMP turned on for all cases. This older CPU (circa 2008) is not as fast as Skylake (circa 2015).
The hp.simdavx, hp.o3skylake, hpsimd.simdavx programs fail due to illegal instructions (unsupported in this older CPU) and no results can be shown.
This is a C++ code example from the Boost odeint library. The code identifies the Lyapunov exponents for the Lorenz equation. Data and chart are for the clang compiler, version 8.0.1 on OpenBSD 6.6. It uses the Boost “odeint” library in serial and simd mode.
The program.(config) refers to configurations with no optimization (.x), simd, fast, fast with simd instructions, and two CPU-specific settings for Penryn and Skylake.
The -fast
option clearly has the most effect on performance, with the
simd instruction set not doing much to help. The Skylake
cpu optimization improves times from 24-ish seconds down to 21.3 seconds
when adding -march=skylake
. The Penryn code
choice has a lesser effect.
On Penryn, the -Ofast
option makes a huge difference. The fastest code
(Penryn-specific) at 26.4s is about 0.7s better than simd+fast. (As with
heated_plate, the Skylake version does not work on a Penryn CPU.)
The code calculates the motions of two coupled oscillators with the Boost odeint library. This is also a C++ code and it is capable of running under OpenMP.
I’ve sorted the results in order of improvement (but not how I actually
explored the options). Here we also try the --static
linking option as
well as fast, simd, Penryn and Skylake optimizations. The “mp” in the program
name refers to using OpenMP parallelization as well. See the Boost website
for details of how odeint can be parallelized with OpenMP.
Static linking has a visible effect even without optimizations (20.6s down
to 17.8s) and with -Ofast
(15.6s to 12.0s). OpenMP does scale fairly
well (12.0s to 7.9s) but obviously not 100% faster.
The Skylake
optimization is not significantly different than -Ofast
.
On Penryn the code is progressively faster as we change to static
linking, -Ofast
option, then both, then add OpenMP parallelization,
then all three, then finally the Penryn CPU-specific code.
I don’t have an explanation for why the Penryn is significantly faster than the much newer Skylake. Some speculations: clock correctness, Skylake security mitigations (Meltdown, Spectre), operations error (me).
Notes:
heated_plate.f90
is from Source Codes in F90.
The heated_plate_simd.f90
code is composed by me from heated_plate.f90
. It
uses a simd function (omp declare simd) to compute the max absolute difference
between solutions.
chaotic_system.cpp
and phase_chain.cpp
are from the Boost Odeint library.
They have been slightly modified by me to use the Boost auto_cpu_time function to report cpu and elapsed times at program completion.
OpenBSD Numerics |
OpenBSD Numerics - Parallelization
OpenBSD Numerics - Clusters |
OpenBSD Numerics - Examples
OpenBSD Numerics - Experiences pages