Numerics on OpenBSD - Examples

Comparing compiler settings across two CPUs

We try three programs across two CPUs across several compiler options. The purpose is to show the variety of possible results for various kinds of numerical program and options available to the user.

Note: these test results are now dated, using OpenBSD 6.6 where the LLVM Fortran compiler “flang” was available. I have not updated the results to OpenBSD 6.8 or later because the result is the same: I explain that you can expect various levels of performance with tuning. The software versions are not relevent here.

The programs are

Iterative heat distribution problem heated_plate.f90
Boost example chaotic system integrated with odeint chaotic_system.cpp
Boost example OpenMP use with odeint phase_chain.cpp

The compilers are Flang and clang C++. We try various compile options such as

no option – the default is little optimization
-O3 or -Ofast – the likely choice for many people
-fopenmp-simd which enables the SSE and AVX instruction sets as well as OpenMP
-march=penryn or -march=skylake according to the CPUs I have on hand
--static to see if statically compiled code is indeed faster

The hardware I have is an Intel 2-core E7400 Penryn and Intel 2-core i3 Skylake laptop.

Heated_Plate

The code solves a partial-differential equation for temperature across a plate. The data and charts are for the flang compiler, version 8.0.1 on OpenBSD 6.6.

This chart shows the heated_plate example with OpenMP turned on for all cases. The default setting is more than 2x slower than -O3.

A few attempts at: SSE or AVX instructions didn’t improve speed. The hpsimd program uses the simd pragma on a function performing the diff reduction step (abs(old-new)). The release notes suggest this shouldn’t do much.

The biggest gain (from 28.9s) is with -O3 (10.2s). The architecture-specific -march=skylake (10.2s) is not much different. These tests were run on the Skylake processor which is why the code for the older Penryn also works.

This chart shows the heated_plate example with OpenMP turned on for all cases. This older CPU (circa 2008) is not as fast as Skylake (circa 2015).

The hp.simdavx, hp.o3skylake, hpsimd.simdavx programs fail due to illegal instructions (unsupported in this older CPU) and no results can be shown.

Chaotic System

This is a C++ code example from the Boost odeint library. The code identifies the Lyapunov exponents for the Lorenz equation. Data and chart are for the clang compiler, version 8.0.1 on OpenBSD 6.6. It uses the Boost “odeint” library in serial and simd mode.

The program.(config) refers to configurations with no optimization (.x), simd, fast, fast with simd instructions, and two CPU-specific settings for Penryn and Skylake.

The -fast option clearly has the most effect on performance, with the simd instruction set not doing much to help. The Skylake cpu optimization improves times from 24-ish seconds down to 21.3 seconds when adding -march=skylake. The Penryn code choice has a lesser effect.

On Penryn, the -Ofast option makes a huge difference. The fastest code (Penryn-specific) at 26.4s is about 0.7s better than simd+fast. (As with heated_plate, the Skylake version does not work on a Penryn CPU.)

Phase Chain

The code calculates the motions of two coupled oscillators with the Boost odeint library. This is also a C++ code and it is capable of running under OpenMP.

I’ve sorted the results in order of improvement (but not how I actually explored the options). Here we also try the --static linking option as well as fast, simd, Penryn and Skylake optimizations. The “mp” in the program name refers to using OpenMP parallelization as well. See the Boost website for details of how odeint can be parallelized with OpenMP.

Static linking has a visible effect even without optimizations (20.6s down to 17.8s) and with -Ofast (15.6s to 12.0s). OpenMP does scale fairly well (12.0s to 7.9s) but obviously not 100% faster.

The Skylake optimization is not significantly different than -Ofast.

On Penryn the code is progressively faster as we change to static linking, -Ofast option, then both, then add OpenMP parallelization, then all three, then finally the Penryn CPU-specific code.

I don’t have an explanation for why the Penryn is significantly faster than the much newer Skylake. Some speculations: clock correctness, Skylake security mitigations (Meltdown, Spectre), operations error (me).

Notes:

heated_plate.f90 is from Source Codes in F90.

The heated_plate_simd.f90 code is composed by me from heated_plate.f90. It uses a simd function (omp declare simd) to compute the max absolute difference between solutions.

chaotic_system.cpp and phase_chain.cpp are from the Boost Odeint library.

They have been slightly modified by me to use the Boost auto_cpu_time function to report cpu and elapsed times at program completion.

OpenBSD Numerics | OpenBSD Numerics - Parallelization
OpenBSD Numerics - Clusters | OpenBSD Numerics - Examples
OpenBSD Numerics - Experiences pages