-
Notifications
You must be signed in to change notification settings - Fork 230
Likwid Mpirun
Pinning to dedicated compute resources is important for pure MPI and even more for hybrid MPI/threaded applications. While all major MPI implementations include their mechanism for pinning, likwid-mpirun provides a simple and portable solution based on the powerful capabilities of likwid-pin. This is still experimental at the moment. Still it can be adapted to any MPI and OpenMP combination with the help of a tuning application in the test directory of LIKWID. likwid-mpirun works in conjunction with PBS, LoadLeveler and SLURM. The tested MPI and compilers are Intel C/C++ compiler, GCC, Intel MPI and OpenMPI. The support for mvapich is untested.
As usual you can get a help message with
$ likwid-mpirun -h
You always have to specify the total number of MPI processes with the -np NUMPROC
.
Two cases are distinguished: Pure MPI and hybrid applications.
Pure MPI:
$ likwid-mpirun -np 16 ./a.out
This will start 16 processes, the number of processes per compute node is calculated from the PBS/LoadLeveler/SLURM node file. If two hosts are given, eight processes are pinned to cores/SMT threads per node. The pinning is implemented with the likwid-pin node domain.
Pure MPI with explicit pinning:
$ likwid-mpirun -np 16 -nperdomain S:2 ./a.out
For this case a single option -nperdomain
covers all cases. The argument contains a domain character as already known from the other LIKWID applications and the number per domain separated by a colon. Above example will start two processes per socket up to 16 processes and will pin the processes with likwid-pin.
Domains can be:
- N - for node
- S - for socket
- C - for last level shared cache
- M - for NUMA domain (interesting e.g. for AMD Magny Cours)
For pinning on Magny Cours the following can be useful:
$ likwid-mpirun -np 16 -nperdomain M:2 ./a.out
This will start 2 processes per NUMA domain. On a two socket AMD MagnyCours system this will result in 8 processes per node with two nodes total for this run.
For debugging use the debug option:
$ likwid-mpirun -debug -np 16 -nperdomain M:2 ./a.out
This will output all command which would be executed.
Pinning of hybrid applications:
$ likwid-mpirun -np 16 -pin S0:0,1_S1:0,1 ./a.out
Hybrid pinning has only one option covering all possibilities with -pin
.
The argument string to pin consists of valid likwid-pin expressions separated by underscores. The number of separated expression denote the number of processes started
per node. Above example will start two processes per node. The first process and its threads (two) will be pinned to Socket one, core 0,1. The second process and its threads will be pinned to socket two, core 0,1. Consequently, the above statement requires 4 hosts to run.
The main pinning complexity is that the OpenMP as well as the MPI implementation could start their own threads for management purpose. These threads need to be skipped and their position in the started threads has to be determined in advance. For the tested MPI+Compiler combinations, the skip masks are integrated into likwid-mpirun.
At the moment all pinning uses block distribution, round robin variants for node and global are planned.
-h, --help Help message
-v, --version Version information
-d, --debug Debugging output
-n/-np <count> Set the number of processes
-nperdomain <domain> Set the number of processes per node by giving an affinity domain and count
-pin <list> Specify pinning of threads. CPU expressions like likwid-pin separated with '_'
-d, --dist <count>(:<order>) Specify distance between MPI processes. Orders can be 'close' or 'spread'. Default is 'close'.
-t, -tpp <count> Set number of thread for each process
-s, --skip <hex> Bitmask with threads to skip
-mpi <id> Specify which MPI should be used. Possible values: openmpi, intelmpi, slurm and mvapich2
If not set, module system is checked
-omp <id> Specify which OpenMP should be used. Possible values: gnu and intel
Only required for statically linked executables.
-hostfile Use custom hostfile instead of searching the environment
-g/-group <perf> Set a likwid-perfctr conform event set for measuring on nodes
-m/-marker Activate marker API mode
likwid-mpirun checks for some known MPI implementations (OpenMPI, IntelMPI and Mvapich2) in the file system and the module system. It searches for the executables like mpiexec
in the path that can be either in the environment variable MPIHOME
, MPI_ROOT
or MPI_BASE
. If it does not find it, try to set it on the command line with -mpi [openmpi, intelmpi, mvapich2 or slurm]
.
If you are running in a batch job environment that is supported by likwid-mpirun the hosts are read from the batch system. In cases where you run it interactively or in an unsupported batch job environment, you have to generate a valid hostfile for likwid-mpirun. The syntax is very simple: List a hostname as many times as the host has slots.
localhost
localhost
localhost
host1
host2
host2
There are three slots on localhost
, one slot on host1
and two slots on host2
.
Besides the correct pinning of MPI processes and their threads, the application execution can be measured using likwid-perfctr. By setting a performance group or custom event set on the command line, the call of likwid-pin is substituted with likwid-perfctr. By now, you can perform end-to-end measurements and instrumented code using the LIKWID Marker API.
Measure the double-precision floating-point operations used by all participating systems running a hybrid application with one MPI process per socket and 10 threads per MPI process:
$ likwid-mpirun -pin S0:0-9_S1:0-9 -g FLOPS_DP ./a.out
Measure the energy used by all participating systems running one process per socket:
$ likwid-mpirun -nperdomain S:1 -g ENERGY ./a.out
likwid-mpirun is intelligent enough to measure socket-wide performance counters on one CPU, the others skip the reading of the hardware registers, they just read the core-local performance counters.
When measuring is activated, no overloading of the hosts is allowed. Multiple processes would read the hardware performance counters so that the final results wouldn't be valid anymore. There are plans to substitute likwid-perfctr with likwid-pin for the overloaded processes.
likwid-mpirun is able to run applications through SLURM.
$ salloc -N X
$ likwid-mpirun -np 2 ./a.out
likwid-mpirun recognizes the SLURM environment and calls srun
instead of mpiexec
or mpirun
. You can see the srun
command when using the -d
command line switch. Some MPI implementations require special parameters and there is currently no way to add custom options to srun
. One common switch is --mpi=pmi2
(at least on our cluster). You can either change the Lua code (likwid-4.3.3: cp $(which likwid-mpirun) .; vi -n 592 likwid-mpirun; ./likwid-mpirun ...
) or you set the environment variable SLURM_MPI_TYPE=pmi2
before running likwid-mpirun
.
In some rare cases it might be required to use the MPI implementation specific way of starting applications (mpiexec
, mpirun
, ...). You can force using this way by using the --mpi
command line switch.
-
Applications
-
Config files
-
Daemons
-
Architectures
- Available counter options
- AMD
- Intel
- Intel Atom
- Intel Pentium M
- Intel Core2
- Intel Nehalem
- Intel NehalemEX
- Intel Westmere
- Intel WestmereEX
- Intel Xeon Phi (KNC)
- Intel Silvermont & Airmont
- Intel Goldmont
- Intel SandyBridge
- Intel SandyBridge EP/EN
- Intel IvyBridge
- Intel IvyBridge EP/EN/EX
- Intel Haswell
- Intel Haswell EP/EN/EX
- Intel Broadwell
- Intel Broadwell D
- Intel Broadwell EP
- Intel Skylake
- Intel Coffeelake
- Intel Kabylake
- Intel Xeon Phi (KNL)
- Intel Skylake X
- Intel Cascadelake SP/AP
- Intel Tigerlake
- Intel Icelake
- Intel Icelake X
- Intel SappireRapids
- Intel GraniteRapids
- Intel SierraForrest
- ARM
- POWER
-
Tutorials
-
Miscellaneous
-
Contributing