Numerics on OpenBSD - Clusters Cluster of sailboats

Numerics on OpenBSD - Clusters

Making compiled code run across a cluster

Clusters allow solution of problems that don’t fit in the memory of one machine, or don’t provide sufficient solution speed from one machine.

A cluster of OpenBSD machines would include access from all to a common filesystem (NFS), identical user accounts on all machines, and OpenMPI installed on each machine. An important prerequisite is SSH access amongst the nodes.

OpenBSD and OpenMPI

The OpenMPI (MPI) libraries and runtime provides automated startup and management of the same program running on multiple computers. This is single-program multiple-data (spmd) computing. Each separate instance of the program is started on it’s assigned computer to compute it’s part of the result.

The OpenMPI implementation of MPI is a supported port for OpenBSD. We call it MPI to avoid confusion with other libraries (OpenMP, OpenMPT, etc).

MPI is built with GCC compilers for C and Fortran. You can also use the clang compiler with this MPI runtime.

GCC Fortran also supports Co-Array Fortran (CAF), which is a higher-level language abstraction that uses MPI as a basis. CAF support for a cluster requires an additional (unsupported) OpenCoarrays library to be installed. CAF support for a single machine is included in the GCC Fortran package.

OpenBSD and MPI on OpenBSD 7.1

OpenBSD version 7.1 includes OpenMPI version 4.1.2 in ports. This version has a couple of issues:

  1. The file-based data service used to coordinate nodes uses an unsupported pthreads call. An alternative needs to be specified.
  2. With GCC Fortran (egfortan compiler), a number of irrelevent symbol definition warnings are often issued at runtime. You may safely ignore these.

The first issue is fixed by defining this symbol before using mpirun:

export PMIX_MCA_gds=hash

OpenBSD and MPI and OpenMP

The SMP capability of cluster members is accessible with OpenMP (as described here, and a reminder this is not supported).

Cluster computing is accessible with MPI. See our examples below.

MPI code example

cat > test1.c << __end
#include "mpi.h"
#include <stdio.h>
int
main(void)
{
    int rank, size;

    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("Hello, world, I am %d of %d\n", rank, size);
MPI_Finalize();

return 0;
}
__end


$ mpicc test1.c
$ export PMIX_MCA_gds=hash
$ mpirun -np 2 -H localhost:2 ./a.out
Hello, world, I am 1 of 2
Hello, world, I am 0 of 2

This proves simple MPI functioning. For details on what is actually happening behind the scenes, add --showme to the mpicc command, or --display-map to the mpirun command.

The PMIX_MCA_gds definition avoids a “lazy binding failed” error with the OpenMPI distributed database component “gds”.

MPI and OpenMP code example

Here is an example program showing OpenMP (parallel threads) and OpenMPI (distributed processing across multiple nodes). It computes the same value on most nodes, and a unique value on one. It then prints the total.

cat > test2.c << __end
#include <mpi.h>
#include <omp.h>
#include <stdio.h>
int
main(void)
{
    int rank, size, nthreads, tot;
    int a[100100], i;
    tot = 0;

    MPI_Init(NULL, NULL);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Comm_size(MPI_COMM_WORLD, &size);
    nthreads = omp_get_max_threads();
    if (rank > 0) {
        #pragma omp parallel for private(i) reduction(+ : tot)
        for (i = 0; i < 100100; i++) {
            a[i] = 2 * i;
            tot = tot + a[i];
        }
    } else {
        tot = -1;
    }
    printf("Hello, world, I am %d of %d, threads %d, tot %d\n",
       rank,
       size,
       nthreads,
       tot);
    MPI_Finalize();
    return 0;
}
__end


$ OMPI_CFLAGS=-fopenmp mpicc test2.c  -lomp
$ OMP_NUM_THREADS=4 mpirun --np 3 -H localhost:3 a.out
Hello, world, I am 0 of 3, threads 4, tot -1
Hello, world, I am 2 of 3, threads 4, tot 1429975308
Hello, world, I am 1 of 3, threads 4, tot 1429975308

This example shows a combination of OpenMP SMP (multithreading) calculating the array a and totalling it, and MPI running the same code several times.

The OMPI_CFLAGS variable is needed to specify -fopenmp otherwise compiler errors occur. This is direction to MPI at the compile step.

The OMP_NUM_THREADS variable selects the number of active threads (the default is the current number of CPUs on the platform). This is direction to OpenMP at runtime.

I leave it as an exercise for the reader to add an MPI reduction on the “size” copies of tot.

When to MPI?

The main issue with MPI is networking: it is slow. A 1Gbit/s Ethernet can move about 15M double-precision floats per second. Memory bandwidth of most CPUs is 20 to 40 GB/s or 2 to 4 billion double-precision floats per second. If you have a choice, use a many-core many-gigabyte SMP computer, not a cluster.

OpenMPI is the coordinated effort of many people with lots of experience in several prior implementations of the MPI standard. As a result, OpenMPI is quite sophisticated, and to the uninitiated, opaque and complex.

Writing code to use MPI or OpenMPI is fairly well described in several places. Using OpenMPI to best advantage to run that code is not well described. A few hints are given above.

See also the OpenBSD pkg-readme at

/usr/local/share/doc/pkg-readmes/openmpi

There is no current architectural description that I could find, not since PMIx (a massive scaling feature) was added, anyway.

The manpage documentation is insufficient for beginners: various technical presentations at the annual Supercomputing conferences are written for the highly initiated. The README file (in the main repo) is lengthy, 2285 lines, and is worth some study. Ignore the OpenBSD advice. The FAQ on the website is mostly obsolete.

The best guides that I could find are listed under Links.

Note on supercomputing environments: Most sites have hundreds or thousands of users running dozens or hundreds of jobs. Users must use a “module” command to enable specific versions of software, e.g. module load openmpi/gcc. Users must write shell scripts and use a job scheduler such as Slurm, PBS or SGE which supports a queue of runnable shell scripts. Job schedulers accept details on resources (number of CPUs, number of cores, time limits) and provide these to the scripted job. None of this is available on OpenBSD (yet).

Links:

OpenMPI home page

MPI Tutorials Several programming examples, but describes the older mpich implementation, not OpenMPI.

OpenMPI Github repositories Includes main repo, some subsidiary software, the website, and documentation.

OpenMPI Papers list, published in the last 15 years

Modular Component Architecture slides (2006) PDF slides. Still valid as description but leaves out all the later innovations like PMIx.

Process Management Interface - Exascale - PMIx PDF slides. Motivates the PMIx development. Fairly technical.


OpenBSD Numerics | OpenBSD Numerics - Parallelization
OpenBSD Numerics - Clusters | OpenBSD Numerics - Examples
OpenBSD Numerics - Experiences pages