Fortran UNified Device Acceleration Library
OpenACC/OpenMP allows to manage (highly parallel, accelerated ) device memory by means of runtime rutines, e.g. allocate and copy to/from device. These routines, in general, handles C's pointers: FUNDAL provides a convenient fortran API to use OpenMP/OpenACC runtime routines handling C's data in background simplifying end-user experience. FUNDAL API is designed to (seamless) unify OpenACC and OpenMP runtime routines calling in order to minimize end-user's effort in developing device-offloaded applications.
A minimal example of FUNDAL usage is contained into src\examples\fundal_taste.F90
and is reported below.
program fundal_taste
use, intrinsic :: iso_fortran_env, only : I4P=>int32, R8P=>real64 ! portable kinds
use :: fundal ! FUNDAL library
implicit none
real(R8P), pointer :: a_dev(:,:,:)=>null() ! device memory
real(R8P), pointer :: b_hos(:,:,:)=>null() ! host memory
integer(I4P) :: ierr ! error status
integer(I4P) :: i, j, k ! counter
! initialize environment global variables
myhos = dev_get_host_num() ! get host ID
devtype = dev_get_device_type() ! get device type
call dev_set_device_num(0) ! set device ID (in complex scenario this ID is less trivial than 0, e.g. MPI)
mydev = dev_get_device_num() ! get device ID
! allocate device memory
call dev_alloc(fptr_dev=a_dev,lbounds=[-1,-2,-3],ubounds=[1,2,3],ierr=ierr,dev_id=mydev)
! allocate host memory
! set host memory
b_hos = -3._R8P
! copy to device
call dev_memcpy_to_device(dst=a_dev, src=b_hos)
! work on device
!$acc parallel loop independent deviceptr(a_dev) collapse(3)
!$omp target teams distribute parallel do collapse(3) has_device_addr(a_dev)
do k=-3,3
do j=-2,2
do i=-1,1
a_dev(i,j,k) = a_dev(i,j,k) / 2._R8P
! copy from device
call dev_memcpy_from_device(dst=b_hos, src=a_dev)
! check results
print*, b_hos
endprogram fundal_taste
The device memory must be defined as pointer
while host memory can be either pointer
or allocatable
The memory handling (allocate, copy, free) is seamless exploiting a unified API for both OpenACC and OpenMP paradigms,
e.g. call dev_memcpy_from_device(dst=b_hos, src=a_dev)
is the unified API for memory copy from device to host
for both OpenACC and OpenMP without the necessity to write different code for different backends and/or wraps snippets with
conditional preprocessing macros.
Additionaly, note that OpenACC pragmas are ignored when compiled with OpenMP without OpenACC flags (and viceversa) thus there is no need to wrap pragmas with conditional preprocessing macros.
- KISS, keep it simple and stupid;
- easy handling OpenACC memory offloading on (higly parallel) accelerated devices (GPU);
- easy handling OpenMP memory offloading on (higly parallel) accelerated devices (GPU);
- MPI enabled for multi-devices clusters;
- Free, Open Source Project.
Status of implemented API:
- device memory handling:
- dev_alloc
- OpenACC
- OpenMP
- dev_memcpy
- OpenACC
- OpenMP
- dev_assign_to_device
- OpenACC
- OpenMP
- dev_assign_from_device
- OpenACC
- OpenMP
- dev_memcpy_to_device
- OpenACC
- OpenMP
- dev_memcpy_from_device
- OpenACC
- OpenMP
- dev_free
- OpenACC
- OpenMP
- dev_alloc
- device handling:
- dev_get_device_num
- OpenACC
- OpenMP
- dev_get_device_type
- OpenACC
- OpenMP
- dev_get_host_num
- OpenACC
- OpenMP
- dev_get_num_devices
- OpenACC
- OpenMP
- dev_get_property_string
- OpenACC
- OpenMP
- dev_get_device_num
- NVIDIA HPC SDK, NVFortran: fully support OpenACC backend, works on NVIDIA GPUs, tested with v12.3+;
- INTEL IFX: fully support OpenMP backend, works on INTEL GPUs, tested with v2024.0.2-20231213;
- GNU gfortran: partially support OpenACC backend, compile, but does not work with all tests, tested with v13.1.0;
FUNDAL is an open source project, it is distributed under a multi-licensing system:
- for FOSS projects:
- for closed source/commercial projects:
Anyone is interest to use, to develop or to contribute is welcome, feel free to select the license that best matches your soul!
More details can be found on wiki.
- Stefano Zaghi, [email protected]
- Giacomo Rossi, [email protected]
- Andrea di Mascio, [email protected]
- Francesco Salvadore, [email protected]
FUNDAL is a pure fortran library (exploiting few pre-processing C macros) thus it can be built as any fortran library. The library sources are contained in the directory src/lib, while the examples are in src/examples and tests in src/tests. Clone or download the repository to get all sources, e.g.
git clone [email protected]:szaghi/FUNDAL.git
FUNDAL is a module-based Fortran library and must be compiled accordingly to the modules' hierarchy.
A fobos
file is provided for easy building by means of program.
Currently only NVIDIA SDK (NVFortran) and INTEL IFX compilers are supported. GNU gfortran is only partially supported.
In the following, the bare minimal information to build FUNDAL tests is reported. For a more detailed documentation on tests see tests documentation.
To build tests and examples with OpenACC backend by means of NVIDIA sdk type: build -mode fundal-test-oac-nvf
tree exe/
├── fundal_alloc_free_test
├── fundal_array_access_test
├── fundal_derived_type_memcpy_test
├── fundal_device_handling_test
├── fundal_memcpy_test
├── fundal_use_test
To build tests and examples with OpenMP backend by means of INTEL sdk type: build -mode fundal-test-omp-ifx
tree exe/
├── fundal_alloc_free_test
├── fundal_array_access_test
├── fundal_derived_type_memcpy_test
├── fundal_device_handling_test
├── fundal_memcpy_test
├── fundal_use_test
All test can be executed without any argument and a successful execution produces a test passed
Test can also be executed all with a single script:
Moreover, the tests can be built and executed by means of
# only execution rule -ex run-tests
Executing rule "run-tests"
Command => utils/
# build and execution with OpenACC-NVF rule -ex build-run-tests-oac-nvf
Executing rule "build-run-tests-oac-nvf"
Command => clean
Command => build -mode fundal-test-oac-nvf
Command => rule -ex run-tests
# build and execution with OpenMP-IFX rule -ex build-run-tests-omp-ifx
Executing rule "build-run-tests-omp-ifx"
Command => clean
Command => build -mode fundal-test-omp-ifx
Command => rule -ex run-tests
In the following, the API of each FUNDAL routine is documented in details with also examples.
FUNDAL library has a main module from witch all exported names can be used:
use :: fundal
The (currently) exported names are:
! runtime memory routines
public :: dev_alloc_unstr
public :: dev_free_unstr
public :: dev_free
public :: dev_memcpy_from_device_unstr, dev_memcpy_to_device_unstr
public :: dev_memcpy_from_device, dev_memcpy_to_device
public :: dev_assign_from_device, dev_assign_to_device
! device handling routines
public :: dev_get_device_memory_info
public :: dev_get_device_num
public :: dev_get_device_type
public :: dev_get_host_num
public :: dev_get_num_devices
public :: dev_get_property_string
public :: dev_init
public :: dev_set_device_num
! environment global variables
public :: dev_memory_avail
public :: local_comm
public :: mydev
public :: myhos
public :: devtype
public :: IDK
For MPI applications, an auxiliary module is also provided, i.e. fundal_mpih_object
, it contains the
definition of an object for handling MPI tasks (initializations, environment handling, finalizations, ecc...), e.g.
use :: fundal_mpih_object
type(mpih_object) :: mpih ! MPI handler.
call mpih%initialize(do_mpi_init=.true., do_device_init=.true.)
if (mpih%myrank == 1_I4P) call MPI_SEND(var, 1, MPI_REAL8, 0, 100, MPI_COMM_WORLD, mpih%ierr)
call MPI_FINALIZE(mpih%error)
MPI handler class provides the following API
type :: mpih_object
!< MPI handler class.
integer(I4P) :: error=0_I4P !< Error traping flag.
integer(I4P) :: myrank=0_I4P !< MPI ID process.
integer(I4P) :: procs_number=1_I4P !< Number of MPI processes.
integer(I8P) :: hos_memory_avail=0_I8P !< Host (CPU) memory available (GB) for each process.
real(R8P) :: timing(1:2) !< Tic toc timing.
integer(I4P) :: tictoc=1_I4P !< Next is tic or toc?
integer(I4P), allocatable :: req_send_recv(:) !< MPI request receive flags.
integer(I4P) :: devs_number=0_I4P !< Number of devices.
integer(I8P), pointer :: dev_memory_avail=>null() !< Device memory available (GB).
integer(I4P), pointer :: mydev=>null() !< Device ID.
integer(I4P), pointer :: local_comm=>null() !< Local communicator.
integer(I4P), pointer :: myhos=>null() !< Host ID.
integer(IDk), pointer :: devtype=>null() !< Device type (currently used only for OpenACC backend).
character(:), allocatable :: myrankstr !< MPI ID stringified.
! public methods
procedure, pass(self) :: abort !< Handy MPI abort wrapper.
procedure, pass(self) :: barrier !< Handy MPI barrier wrapper.
procedure, pass(self) :: description !< Return pretty-printed object description.
procedure, pass(self) :: error_stop !< Stop run with error output.
procedure, pass(self) :: finalize !< Handy MPI finalize wrapper.
procedure, pass(self) :: initialize !< Initialize MPI handler data.
procedure, pass(self) :: print_message !< Print a message on stdout with rank prefix.
procedure, pass(self) :: tictoc_timing !< Return the last tic toc timing.
procedure, pass(self) :: tic !< Start a tic toc timing.
procedure, pass(self) :: toc !< Stop a tic toc timing.
endtype mpih_object
Note that some global environment variables are conventiently pointed by MPI handler class members, e.g.
, ecc...
Aside the main module and the MPI handler one, there is a C macros include source fundal.H, i.e.:
/* cpp macros to setup backends */
#if defined DEV_OAC
# define DEVMODULE openacc
# if defined COMPILER_NVF
# define DEVICEVAR deviceptr
# elif defined COMPILER_GNU
# define DEVICEVAR present
# endif
#elif defined DEV_OMP
# define DEVMODULE omp_lib
# define DEVICEVAR has_device_addr
# define OMPLOOP target teams distribute parallel do
# define DEVMODULE omp_lib
# define DEVICEVAR shared
# define OMPLOOP parallel do
This include set some compile-time macros necessary to compile the library with OpenACC or OpenMP backend (or with a non-device fallback one).
Runtime routines to handle memory device.
The dev_alloc
allocates space in the device memory returning a (fortran) pointer to it.
The device memory is not mapped to any host memory. The signature is:
subroutine dev_alloc(fptr_dev, ubounds, ierr, dev_id, lbounds, init_value)
real/integer, intent(out), pointer :: fptr_dev(..) !< Pointer to allocated memory.
integer(I4P), intent(in) :: ubounds(:) !< Array upper bounds.
integer(I4P), intent(out) :: ierr !< Error status.
is a pointer array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
is an integer array of rank 1 containing the upper bounds of fptr_dev
returns the error status of allocation, it is 0 for a successful allocation.
integer(I4P), intent(in), optional :: dev_id !< Device ID.
integer(I4P), intent(in), optional :: lbounds(:) !< Array lower bounds, 1 if not passed.
real/integer, intent(in), optional :: init_value !< Optional initial value.
is the device num (ID) over the allocation happens. For OpenACC it is not used. For OpenMP is set to the environmental global
variable mydev
(that must be previously initialized by means of dev_get_device_num
) if it is not passed.
is an integer array of rank 1 containing the lower bounds of fptr_dev
. It is set to 1 if it is not passed.
is a real/integer scalar (of the same kind of fptr_dev
): if it is passed it is used to initialized fptr_dev
with a parallel device loop.
usage example
use :: fundal
real(R8P), pointer :: a(:,:,:)
integer(I4P) :: ierr
call dev_alloc(fptr_dev=a,lbounds=[-1,-2,-3],ubounds=[1,2,3],init_value=1._R8P,ierr=ierr)
The dev_alloc_unstr
allocates space in the device memory using unstructured model. It actually maps device
memory to host one. The signature is:
subroutine dev_alloc_unstr(fptr_dev, init_value)
real/integer, intent(inout) :: fptr_dev(..) !< Host memory to be mapped on device.
is an array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
real/integer, intent(in), optional :: init_value !< Optional initial value.
is a real/integer scalar (of the same kind of fptr_dev
): if it is passed it is used to initialized fptr_dev
with a parallel device loop.
usage example
use :: fundal
real(R8P), :: a(:,:,:)
call dev_alloc_unstr(fptr_dev=a,init_value=1._R8P)
The dev_free
frees memory directly allocated on the device.
The signature is:
subroutine dev_free(fptr, dev_id)
real/integer, intent(out), pointer :: fptr_dev(..) !< Pointer to allocated memory.
is a pointer array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
integer(I4P), intent(in), optional :: dev_id !< Device ID.
is the device num (ID) over the allocation happens. For OpenACC it is not used. For OpenMP is set to the environmental global
variable mydev
(that must be previously initialized by means of dev_get_device_num
) if it is not passed.
usage example
use :: fundal
real(R8P), pointer :: a(:,:,:)
call dev_free(fptr_dev=a)
The dev_free_unstr
frees memory mapped (unstructured model) on the device.
The signature is:
subroutine dev_free_unstr(fptr)
real/integer, intent(inout), pointer :: fptr(..) !< Mapped memory.
is an array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
usage example
use :: fundal
real(R8P), :: a(:,:,:)
call dev_free(fptr=a)
The dev_memcpy_from_device
copies data from device memory to local host memory.
The signature is:
subroutine dev_memcpy_from_device(dst, src)
real/integer, intent(out), target :: dst(:) !< Destination memory (host memory).
real/integer, intent(in), target :: src(:) !< Source memory (device memory).
is a target, host memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
is a target, device memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
usage example
use :: fundal
real(R8P), pointer :: a(:,:,:)
real(R8P), allocatable :: b(:,:,:)
call dev_memcpy_from_device(dst=b, src=a)
The dev_memcpy_to_device
copies data from local host memory to device memory.
The signature is:
subroutine dev_memcpy_to_device(dst, src)
real/integer, intent(out), target :: dst(:) !< Destination memory (device memory).
real/integer, intent(in), target :: src(:) !< Source memory (host memory).
is a target, device memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
is a target, host memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
usage example
use :: fundal
real(R8P), pointer :: a(:,:,:)
real(R8P), allocatable :: b(:,:,:)
call dev_memcpy_to_device(dst=a, src=b)
The dev_memcpy_from_device
copies data from device memory (mapped) to local host memory.
The signature is:
subroutine dev_memcpy_from_device_unstr(dst)
real/integer, intent(inout) :: dst(:) !< Destination memory (host memory mapped on device).
is host memory mapped on device, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
usage example
use :: fundal
real(R8P), allocatable :: a(:,:,:)
call dev_memcpy_from_device_unstr(dst=a)
The dev_memcpy_to_device_unstr
copies data from local host memory to device mapped memory.
The signature is:
subroutine dev_memcpy_to_device_unstr(dst)
real/integer, intent(inout) :: dst(:) !< Destination memory (mapped device memory).
is mapped device memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
usage example
use :: fundal
real(R8P), allocatable :: a(:,:,:)
call dev_memcpy_to_device_unstr(dst=a)
The dev_assign_from_device
copies data from device memory to local host memory: the host memory is deallocated
and re-allocated of the correct size, this procedure mimics the automatic left-hand-side fortran reallocation
of standard allocatable arrays.
The signature is:
subroutine dev_assign_from_device(dst, src)
real/integer, intent(inout), allocatable :: dst(:) !< Destination memory (host memory).
real/integer, intent(in) :: src(:) !< Source memory (device memory).
is a target, host memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
is a target, device memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
usage example
use :: fundal
real(R8P), pointer :: a(:,:,:)
real(R8P), allocatable :: b(:,:,:)
call dev_assign_from_device(dst=b, src=a)
The dev_assign_to_device
copies data from local host memory to device memory: the device memory is deallocated
and re-allocated of the correct size, this procedure mimics the automatic left-hand-side fortran reallocation
of standard allocatable arrays.
The signature is:
subroutine dev_assign_to_device(dst, src)
real/integer, intent(inout), pointer :: dst(:) !< Destination memory (device memory).
real/integer, intent(in) :: src(:) !< Source memory (host memory).
is a target, device memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
is a target, host memory, array of any ranks up to 7 of real (kinds R8P, R4P) or integer (kinds I8P, I4P, I1P).
usage example
use :: fundal
real(R8P), pointer :: a(:,:,:)
real(R8P), allocatable :: b(:,:,:)
call dev_assign_to_device(dst=a, src=b)
Runtime routines to handle device(s), in particular for complex scenario like MPI programming.
Return the value of current device ID (for the current thread and MPI process). The signature is:
function dev_get_device_num() result(device_num)
integer(I4P) :: device_num !< Device ID for current thread.
No args are required. Note that the device type environment global variable, devtype
, must be set before use
this routine. By default it is seto to acc_device_default
for the OpenACC backend.
usage example
use :: fundal
integer :: dev
dev = dev_get_device_num()
Return the device type. The signature is:
function dev_get_device_type() result(devtype)
#ifdef DEV_OAC
integer(acc_device_kind) :: devtype
integer(I4P), :: devtype
No args are required. The result is standard integer (always equal to 0) for OpenMP backend that does not provide such
a runtime routine, whereas it is integer(acc_device_kind)
for OpenACC backend.
usage example
use :: fundal
#ifdef DEV_OAC
integer(acc_device_kind) :: devtype
integer(I4P) :: devtype
devtype = dev_get_device_type()
Return the value of current host ID (for the current thread and MPI process). The signature is:
function dev_get_host_num() result(host_num)
integer(I4P) :: host_num !< Device ID for current thread and MPI process.
No args are required.
usage example
use :: fundal
integer(I4P) :: myhost
myhost = dev_get_host_num()
Return the number of available (non host) devices. The signature is:
function dev_get_num_devices() result(devices_number)
integer(I4P) :: devices_number !< Devices number.
No args are required. Note that the device type environment global variable, devtype
, must be set before use
this routine. By default it is seto to acc_device_default
for OpenACC backend. For OpenMP backend that does not
provide such a runtime routine it returns always 1.
usage example
use :: fundal
integer(I4P) :: devices_number
devices_number = dev_get_num_devices()
Return the pretty-printed string value of device-property for the specified device. Note that the device type
environment global variable, devtype
, must be set before use
this routine. By default it is seto to acc_device_default
for OpenACC backend. For OpenMP backend that does not
provide such a runtime routine it returns always a null string.
subroutine dev_get_property_string(dev_num, string, prefix, memory)
integer, value, intent(in) :: dev_num !< Device ID.
character(*), intent(out) :: string !< Stringified device property.
is the device ID queried.
is the output string containing the pretty-printed device-property value.
character(*), intent(in), optional :: prefix !< String prefix.
integer(I8P), intent(out), optional :: memory !< Device memory.
is a prefix string prefixed to each row of output string.
is the value (bytes) of memory available on device.
usage example
use :: fundal
integer :: dev
character(999) :: property_string
dev = dev_get_device_num()
call dev_get_property_string(dev_num=dev, string=property_string, prefix=' ')
print '("current thread device property = ",A)', new_line('a')//trim(property_string)
