-
Notifications
You must be signed in to change notification settings - Fork 233
TutorialMarkerC
The Marker API consists of a bunch of function calls and defines that enable the measuring of code regions to get a better insight of the system's activities executing the code region. In scientific programming the interesting code regions are often calculation loops.
The instrumentation is done inside your application but the setup of the performance counters is performed by likwid-perfctr. After the application run, the results are evaluated by likwid-perfctr.
The functions are defined in the header file likwid-marker.h
. Each function can be either called directly or through some defines. Using the defines is recommended because it enables the (dis)activating of the Marker API at build time. The defines are only resolved to the functions if -DLIKWID_PERFMON
is set as compiler flag.
-
LIKWID_MARKER_INIT
orlikwid_markerInit()
Initialize the Marker API and read the configured eventsets from the environment -
LIKWID_MARKER_THREADINIT
orlikwid_markerThreadInit()
Add thread to the Marker API and pin the application thread (commonly only required if not pthread based threading). -
LIKWID_MARKER_REGISTER(char* tag)
orlikwid_markerRegisterRegion(char* tag)
Register a region nametag
to the Marker API. This creates an entry in the internally used hash table to minimize overhead at the start of the named region. This call is optional, the same operations are done by start if not registered previously. -
LIKWID_MARKER_START(char* tag)
orlikwid_markerStartRegion(char* tag)
Start a named region identified bytag
. This reads the counters are stores the values in the thread's hash table. If the hash table entrytag
does not exist, it is created. -
LIKWID_MARKER_STOP(char* tag)
orlikwid_markerStopRegion(char* tag)
Stop a named region identified bytag
. This reads the counters are stores the values in the thread's hash table. It is assumed that a STOP can only follow a START, hence no existence check of the hash table entry is performed. -
LIKWID_MARKER_GET(char* tag, int* nevents, double* events, double* time, int* count)
orlikwid_markerGetRegion(char* tag, int* nevents, double* events, double* time, int* count)
If you want to process a code regions measurement results in the instrumented application itself, you can call this function to get the intermediate results. The region is identified bytag
. Thenevents
parameter is used to specify the length of theevents
array. After the function returns,nevents
is the number of events filled in theevents
array. The aggregated measurement time is returned intime
and the amount of measurements is returned incount
. -
LIKWID_MARKER_SWITCH
orlikwid_markerNextGroup()
Switch to the next eventset. If only a single eventset is given, the function performs no operation. If multiple eventsets are configured, this function switches through the eventsets in a round-robin fashion.
Notice: This function creates the biggest overhead of all Marker API functions as it has to setup the register to the next eventset. -
LIKWID_MARKER_CLOSE
orlikwid_markerClose()
Finalize the Marker API and write the aggregated results of all regions to a file that is picked up by likwid-perfctr for evaulation.
Let's assume we have a C code with a interesting code region, like:
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
#define N 10000
int main(int argc, char* argv[])
{
int i;
double data[N];
#pragma omp parallel for
for(i = 0; i < N; i++)
{
data[i] = omp_get_thread_num();
}
return 0;
}
We want to measure the parallel for loop. We need to perform some transformations to the code to place the Marker API functions to the right places. At first we need to substitute the #pragma omp parallel for
with two pragmas to #pragma omp parallel
and #pragma omp for
, so that we can add the start and stop calls. Moreover, we have to add the initialization and finalization API calls. The resulting code looks like:
#include <stdlib.h>
#include <stdio.h>
#include <omp.h>
// This block enables compilation of the code with and without LIKWID in place
#ifdef LIKWID_PERFMON
#include <likwid-marker.h>
#else
#define LIKWID_MARKER_INIT
#define LIKWID_MARKER_THREADINIT
#define LIKWID_MARKER_SWITCH
#define LIKWID_MARKER_REGISTER(regionTag)
#define LIKWID_MARKER_START(regionTag)
#define LIKWID_MARKER_STOP(regionTag)
#define LIKWID_MARKER_CLOSE
#define LIKWID_MARKER_GET(regionTag, nevents, events, time, count)
#endif
#define N 10000
int main(int argc, char* argv[])
{
int i;
double data[N];
LIKWID_MARKER_INIT;
#pragma omp parallel
{
LIKWID_MARKER_REGISTER("foo");
}
#pragma omp parallel
{
LIKWID_MARKER_START("foo");
#pragma omp for
for(i = 0; i < N; i++)
{
data[i] = omp_get_thread_num();
}
LIKWID_MARKER_STOP("foo");
}
LIKWID_MARKER_CLOSE;
return 0;
}
The call to LIKWID_MARKER_INIT
initializes the Marker API and adds all configured eventsets to the LIKWID library. Each thread registers the upcoming region foo
and initializes access to the hardware counters. We could also fuse the two parallel regions but there must be a barrier between LIKWID_MARKER_REGISTER(tag)
and LIKWID_MARKER_START(tag)
which is done implicitly here because each parallel region ends with a barrier unless nowait
is specified. In the second parallel region we perform LIKWID_MARKER_START("foo")
to start the measurement for each thread and store the start value of the counters using the name foo
. After the loop we are interested in, each thread stops the measurement phase, stores the stop value of the counters and aggregates the result in the hash table entry foo
. After the parallel region, we finalize the Marker API and write the results to a file for later evaluation.
This is only a basic instrumentation, it does not take multiple eventsets or internal processing of measurement results into account. A slightly enhanced example can be found here: https://github.com/RRZE-HPC/likwid/blob/master/examples/C-markerAPI.c
At first we need to know the include and library paths for LIKWID. There is no common way but if you used the default build process you can use this:
$ /bin/bash -c "echo LIKWID_LIB=$(dirname $(which likwid-perfctr))/../lib/"
$ /bin/bash -c "echo LIKWID_INCLUDE=$(dirname $(which likwid-perfctr))/../include/"
This prints the paths to the LIKWID library and the header files. Assuming you have exported the paths in your environment, you can build your application using gcc:
$ gcc -fopenmp -DLIKWID_PERFMON -L$LIKWID_LIB -I$LIKWID_INCLUDE <SRC> -o <EXEC> -llikwid
You can (de)activate the Marker API integration with the define -DLIKWID_PERFMON
. If you omit it, the calls resolve to an empty string causing no overhead.
By now, we have not defined which performance counters should be measured and which metrics be derived for our application. This is done using likwid-perfctr which also performs the pinning of threads and the validity checking of the given eventsets.
Run the application serially:
$ likwid-perfctr -C S0:0 -g L3 -m <EXEC>
Run the application parallel using multiple CPUs:
$ likwid-perfctr -C S0:0-3 -g L3 -m <EXEC>
A possible output of a parallel run looks like this:
--------------------------------------------------------------------------------
CPU name: Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
CPU type: Intel Xeon Haswell EN/EP/EX processor
CPU clock: 2.30 GHz
--------------------------------------------------------------------------------
YOUR PROGRAM OUTPUT
--------------------------------------------------------------------------------
================================================================================
Group 1 L3: Region foo
================================================================================
+-------------------+----------+----------+----------+----------+
| Region Info | Core 0 | Core 1 | Core 2 | Core 3 |
+-------------------+----------+----------+----------+----------+
| RDTSC Runtime [s] | 0.001782 | 0.001733 | 0.001734 | 0.001740 |
| call count | 1 | 1 | 1 | 1 |
+-------------------+----------+----------+----------+----------+
+---------------------------+---------+--------------+--------------+--------------+--------------+
| Event | Counter | Core 0 | Core 1 | Core 2 | Core 3 |
+---------------------------+---------+--------------+--------------+--------------+--------------+
| INSTR_RETIRED_ANY | FIXC0 | 8.915056e+06 | 6.420429e+06 | 3.921364e+06 | 1.421908e+06 |
| CPU_CLK_UNHALTED_CORE | FIXC1 | 4.956292e+06 | 3.699170e+06 | 2.406131e+06 | 1.092918e+06 |
| CPU_CLK_UNHALTED_REF | FIXC2 | 4.015639e+06 | 2.998418e+06 | 1.951642e+06 | 8.863970e+05 |
| L2_LINES_IN_ALL | PMC0 | 3.069070e+05 | 2.392480e+05 | 1.616770e+05 | 6.799700e+04 |
| L2_LINES_OUT_DEMAND_DIRTY | PMC1 | 6.030000e+02 | 6.210000e+02 | 7.360000e+02 | 1.063000e+03 |
+---------------------------+---------+--------------+--------------+--------------+--------------+
+--------------------------------+---------+----------+---------+---------+------------+
| Event | Counter | Sum | Min | Max | Avg |
+--------------------------------+---------+----------+---------+---------+------------+
| INSTR_RETIRED_ANY STAT | FIXC0 | 20678757 | 1421908 | 8915056 | 5169689.25 |
| CPU_CLK_UNHALTED_CORE STAT | FIXC1 | 12154511 | 1092918 | 4956292 | 3038627.75 |
| CPU_CLK_UNHALTED_REF STAT | FIXC2 | 9852096 | 886397 | 4015639 | 2463024 |
| L2_LINES_IN_ALL STAT | PMC0 | 775829 | 67997 | 306907 | 193957.25 |
| L2_LINES_OUT_DEMAND_DIRTY STAT | PMC1 | 3023 | 603 | 1063 | 755.75 |
+--------------------------------+---------+----------+---------+---------+------------+
+-------------------------------+--------------+--------------+--------------+--------------+
| Metric | Core 0 | Core 1 | Core 2 | Core 3 |
+-------------------------------+--------------+--------------+--------------+--------------+
| Runtime (RDTSC) [s] | 0.00178246 | 0.001733079 | 0.001733966 | 0.001739621 |
| Runtime unhalted [s] | 2.154906e-03 | 1.608332e-03 | 1.046142e-03 | 4.751810e-04 |
| Clock [MHz] | 2.838773e+03 | 2.837531e+03 | 2.835617e+03 | 2.835880e+03 |
| CPI | 5.559463e-01 | 5.761562e-01 | 6.135954e-01 | 7.686278e-01 |
| L3 load bandwidth [MBytes/s] | 1.101963e+04 | 8.835069e+03 | 5.967434e+03 | 2.501584e+03 |
| L3 load data volume [GBytes] | 0.019642048 | 0.015311872 | 0.010347328 | 0.004351808 |
| L3 evict bandwidth [MBytes/s] | 2.165098e+01 | 2.293260e+01 | 2.716547e+01 | 3.910737e+01 |
| L3 evict data volume [GBytes] | 3.8592e-05 | 3.9744e-05 | 4.7104e-05 | 6.8032e-05 |
| L3 bandwidth [MBytes/s] | 1.104128e+04 | 8.858001e+03 | 5.994600e+03 | 2.540691e+03 |
| L3 data volume [GBytes] | 0.01968064 | 0.015351616 | 0.010394432 | 0.00441984 |
+-------------------------------+--------------+--------------+--------------+--------------+
+------------------------------------+-------------+-------------+-------------+---------------+
| Metric | Sum | Min | Max | Avg |
+------------------------------------+-------------+-------------+-------------+---------------+
| Runtime (RDTSC) [s] STAT | 0.006989126 | 0.001733079 | 0.00178246 | 0.0017472815 |
| Runtime unhalted [s] STAT | 0.005284561 | 0.000475181 | 0.002154906 | 0.00132114025 |
| Clock [MHz] STAT | 11347.801 | 2835.617 | 2838.773 | 2836.95025 |
| CPI STAT | 2.5143257 | 0.5559463 | 0.7686278 | 0.628581425 |
| L3 load bandwidth [MBytes/s] STAT | 28323.717 | 2501.584 | 11019.63 | 7080.92925 |
| L3 load data volume [GBytes] STAT | 0.049653056 | 0.004351808 | 0.019642048 | 0.012413264 |
| L3 evict bandwidth [MBytes/s] STAT | 110.85642 | 21.65098 | 39.10737 | 27.714105 |
| L3 evict data volume [GBytes] STAT | 0.000193472 | 3.8592e-05 | 6.8032e-05 | 4.8368e-05 |
| L3 bandwidth [MBytes/s] STAT | 28434.572 | 2540.691 | 11041.28 | 7108.643 |
| L3 data volume [GBytes] STAT | 0.049846528 | 0.00441984 | 0.01968064 | 0.012461632 |
+------------------------------------+-------------+-------------+-------------+---------------+
At first the group (eventset) and region name is printed followed by an information table listing the number of region calls and the measurement time for each CPU. Since our instrumented code runs the named region only once, the call count is 1. The next table contains the raw aggregated counter values of the region calls. If we use more than one CPU, a table with a few statistics (MIN, MAX, SUM, AVG) is printed. After that we have a table of derived metrics as defined by the performance group L3
for each thread/CPU. Similar to the raw values, the last table contains some statistics of the derived metrics. It is only printed if more than one CPU is used.
At the moment it is not possible to place multiple threads on a single CPU. You could use likwid-pin after likwid-perfctr but the access to the hash table entries (one per CPU and named region) is not thread-safe.
-
Applications
-
Config files
-
Daemons
-
Architectures
- Available counter options
- AMD
- Intel
- Intel Atom
- Intel Pentium M
- Intel Core2
- Intel Nehalem
- Intel NehalemEX
- Intel Westmere
- Intel WestmereEX
- Intel Xeon Phi (KNC)
- Intel Silvermont & Airmont
- Intel Goldmont
- Intel SandyBridge
- Intel SandyBridge EP/EN
- Intel IvyBridge
- Intel IvyBridge EP/EN/EX
- Intel Haswell
- Intel Haswell EP/EN/EX
- Intel Broadwell
- Intel Broadwell D
- Intel Broadwell EP
- Intel Skylake
- Intel Coffeelake
- Intel Kabylake
- Intel Xeon Phi (KNL)
- Intel Skylake X
- Intel Cascadelake SP/AP
- Intel Tigerlake
- Intel Icelake
- Intel Icelake X
- Intel SappireRapids
- Intel GraniteRapids
- Intel SierraForrest
- ARM
- POWER
-
Tutorials
-
Miscellaneous
-
Contributing