Skip to content

Parallel Utilities

Riccardo Rossi edited this page Nov 1, 2017 · 26 revisions

WORK IN PROGRESS!! NOT READY!!

Parallel Utilities

The landscape of parallelization within c++ is currently undergoing a change, as since c++11 the language is adquring a set of standard extensions which aim to rival with OpenMP. Even though OpenMP is working correctly, it is wise to start futureproofing the codebase so that in a future we can easily switch between different implementation.

In order to make this possible the "parallel_utilities" class provides basic looping features, which we believe cover a great majority of the common OpenMP use cases.

Such features are the following:

Basic, statically scheduled, for loop

This is implemented in the function

 template<class InputIt, class UnaryFunction>
 inline void parallel_for(InputIt first, const int n, UnaryFunction f)

to make an example, making a parallel for loop to set the value of a variable to a prescribed value would look in the user code as

 Kratos::Parallel::parallel_for(rNodes, [&](ModelPart::NodesContainerType::iterator it)
    {
        noalias(it->FastGetSolutionStepValue(rVariable)) = value;
    }
 );

It is worth focusing on the use of the C++ lambda function. In the current code everything is captured by reference (the [&]) thus emulating OpenMP's default shared. The user should refer to the C++11 documentation or to some tutorial to understand the syntax of lambdas. For example (https://www.cprogramming.com/c++11/c++11-lambda-closures.html).

A limitation of using C++11 is unfortunately that the type of iterator must be specified explicitly in the lambda function. Using c++14, we would have been able to write

 Kratos::Parallel::parallel_for(rNodes, [&](auto it)
    {
        noalias(it->FastGetSolutionStepValue(rVariable)) = value;
    }
 );

This type of loops is good for implementing simple cases, in which no "firstprivate" or "private" functions are needed. The point here is that passing a private value to the lambda can be trivially achieved by simply capturing the relevant call by argument. For example we could have made a local copy of value by doing (which would imply capturing by value the variable with the same name)

 Kratos::Parallel::parallel_for(rNodes, [value](auto it)
    {
        noalias(it->FastGetSolutionStepValue(rVariable)) = value;
    }
 );

Unfortunately this usage has perfomrance implications, since labda capturing happens once per execution of the lambda and not once per thread as was the case for OpenMP.

In order to emulate the behaviour of OpenMP (few threads with a lot of work on each) a more general function is needed

chunked for loop

the signature of such more flexible function is

template<class InputIt, class BinaryFunction>
inline void block_parallel_for(InputIt first, const int n, const int NChunks, BinaryFunction f)

which is also used in writing reductions. for example to sum a value we get from all the nodes.

array_1d<double, 3> sum_value = ZeroVector(3);

//here a reduction is to be made. This is achieved by using the block parallel for and defining the Loops
//note also that in order to be "didactic" all the values are being captured explicitly by reference, except for rVar which is captured by value
Kratos::Parallel::block_parallel_for(rModelPart.NodesBegin(), 
                                        [&sum_value, rVar, &rModelPart](auto it_begin, auto it_end){
        array_1d<double, 3> tmp = ZeroVector(3);
        for(auto it = it_begin; it != it_end; ++it){
            noalias(tmp) += it->GetValue(rVar);
        }
        
        Kratos::Parallel::AtomicAddVector(tmp, sum_value); //here we do sum_value += tmp
    }
);

the important point here is that the array of nodes is subdivided in nchunks portions, each of which is processed independently. Each chunk must then process the data on the entire chunk, so that the lambda capturing (and the sync of the reduction variable) only happens nchunks times

The example also shows the use of "AtomicAdd" function. A number of atomic functions exist performing

  • AtomicAdd/AtomicSub/AtomicAssign for integers and doubles
  • AtomicAddVector/AtomicSubVector/AtomicAssignVector for vectors

Crucial difference to OpenMP parallism

The type of parallelism intoduced in c++11 and following standards, makes a fundamentally different assumption wrt OpenMP regarding the total number of processes being run. This is a design decision oriented to allowing in the future the support for GPU-type parallelization within the std lib.

From the user perspective, the visual effect of this is that by design the new c++ standards avoid defining a function equivalent to "omp_get_thread_num", which in OpenMP returns the integer Id, of the thread executing the code. The closest equivalent is

  std::thread::id this_id = std::this_thread::get_id(); 

which as described in (http://en.cppreference.com/w/cpp/thread/get_id) returns a hash value corresponding to the thread Id. The practical implication of this is that the allocation of local arrays of size(nthread) is not viable.

Project information

Getting Started

Tutorials

Developers

Kratos structure

Conventions

Solvers

Debugging, profiling and testing

HOW TOs

Utilities

Kratos API

Kratos Structural Mechanics API

Clone this wiki locally