Skip to content

CPU 2. Parallel For and its Variations

Phillip Allen Lane edited this page Apr 9, 2024 · 2 revisions

One of the most common constructs you will encounter in data-parallel programming is a parallel-for loop. A parallel-for loop takes the contents of a for loop and schedules iterations across numerous threads. There are some requirements that must be met to ensure that a for loop is correctly executed when parallelized in this manner.

  1. Each iteration must be disjoint. In other words, no iteration can rely on the results of another iteration.
  2. Each iteration must be unordered. In other words, the for loop must be safe to execute iterations in any order.
  3. Each iteration must only write to its own variables. In other words, if two iterations write to the same result in memory, there is the possibility for race conditions, meaning that a shared variable written to by multiple iterations may be corrupted.

One simple example of a for loop which meets these requirements is called a SAXPY, which stands for "single-precision A times X plus Y." A SAXPY loop looks like this:

for (int i = 0; i < M; i++)
    y[i] += a * x[i];

Parallel.For

Parallel.For is the easiest way to parallelize this loop. Differing from the .NET TPL, you must call Parallel.For within a parallel region. Otherwise, you will get an exception. A parallelized SAXPY in DotMP might look like this:

DotMP.Parallel.ParallelRegion(() =>
{
    DotMP.Parallel.For(0, M, i =>
    {
        y[i] += a * x[i];
    });
});

Because a parallel-for is such a common operation, there is a wrapper function that represents the above construct very concisely called Parallel.ParallelFor, and looks like this:

DotMP.Parallel.ParallelFor(0, M, i =>
{
    y[i] += a * x[i];
});

Parallel.ParallelFor also supports the num_threads parameter from Parallel.ParallelRegion.

Parallel.For supports a couple optional parameters, the two biggest ones being the schedule parameter and the chunk_size parameter. Both of these go hand-in-hand in loop scheduling.

Loop Scheduling

Loop scheduling is the process by which loop iterations are assigned to threads. Loop scheduling is a problem with lots of research behind it, both from a theoretical perspective and a practical perspective. Loop scheduling may seem easy: just assign each thread an equal fraction of the loop to execute, right? Unfortunately, it's not that simple. There are two main reasons why this approach falls short:

  1. Each thread may work through its otherwise equal iterations at a varying pace. This may be because of a heterogeneous processor which has a mix of "fast" and "slow" cores, such as Intel's P/E core architecture or ARM big.LITTLE.
  2. Each iteration may require varying amounts of work to complete. This is a special class of loop called an "irregular loop," and optimal scheduling of irregular loops is a proven NP-hard problem.

So what are we to do? Well, DotMP has some approaches which help even out the workload and ensure that parallel-for loops are executed in minimal wall time.

Static Scheduling

Static scheduling is DotMP's default scheduler. If no schedule is passed, or if you pass DotMP.Schedule.Static to the schedule parameter, static scheduling is used. The static scheduler, in the absence of a chunk_size parameter, does the approach listed above. It assigns each thread in the parallel region an equal amount of the for loop to execute.

You can pass a chunk_size parameter, which specifies the size of a "chunk." A "chunk" is how much work is executed at a time, and chunks are assigned in round-robin order to threads. For instance, if you have a loop with 64 iterations, 4 threads, and a chunk size of 4, then iterations are assigned as follows:

  • Thread 0 gets iterations 0-3, 16-19, 32-35, and 48-51.
  • Thread 1 gets iterations 4-7, 20-23, 36-39, and 52-55.
  • Thread 2 gets iterations 8-11, 24-27, 40-43, and 56-59.
  • Thread 3 gets iterations 12-15, 28-31, 44-47, and 60-63.

We note that static scheduling is often unoptimal due to observed behavior with .NET and thread scheduling. Static scheduling, despite being the default, should be used sparsely.

Static scheduling has the least overhead of all schedulers. No two threads have to interact at any point during static scheduling, and this can be beneficial for processors where inter-thread communication is expensive. However, as aforementioned, static scheduling in DotMP has been observed to be unoptimal compared to the other schedulers.

Dynamic Scheduling

Dynamic scheduling is an example of a "load-balancing" scheduler. It does a fairly good job at distributing work evenly among threads, even in the face of highly irregular loops where certain iterations may take many times longer to execute than others. If you pass DotMP.Schedule.Dynamic to the schedule parameter, dynamic scheduling is used.

Dynamic scheduling keeps a central, shared queue of iterations that threads must request work from. When a thread has no work to do, it fetches a number of iterations from the central queue equal to the chunk_size parameter. In the absence of a chunk_size parameter, DotMP has a simple heuristic to estimate a good chunk size. However, this is done with no knowledge of the loop and its characteristics, so it may be worth tuning the chunk_size parameter to fit your needs.

A small chunk size means that there is greater opportunity for load balancing, e.g., work can be balanced among threads at a higher level of granularity. However, a small chunk size also means more scheduler overhead. A small chunk size is only recommended for loops where each iteration does a great deal of work. A larger chunk size means greatly reduced scheduler overhead, but also means that there's coarser load balancing, and can result in workload imbalance in the worst cases.

Guided Scheduling

Guided scheduling is another load-balancing scheduler. If you pass DotMP.Schedule.Guided to the schedule parameter, guided scheduling is used.

Guided scheduling also keeps a central, shared queue of iterations that threads must request work from, in a very similar manner to dynamic scheduling. The difference is how the chunk_size parameter is used. Guided scheduling only uses the chunk_size parameter as a minimum number of iterations to request, but there is no maximum. When a thread requests work from the central queue, it requests a number of iterations proportional to the number of remaining iterations in the queue. This means that at the beginning of a parallel-for loop, a large number of iterations are requested at a time in order to keep scheduling overhead low. As the number of iterations in the queue drops, a smaller number of iterations are requested at a time to maintain load balancing. The floor is set by chunk_size, and by default, is 1.

Guided scheduling is a very versatile scheduler that works great for a large array of loops. Its main weakness is when a for loop has its longest or most expensive iterations at the beginning of the loop (such as an exponential decreasing workload). In cases like this, guided scheduling may run into workload imbalances that it is not able to recover from, in the case that the first chunk requested takes longer to execute than the entire rest of the loop.

Work-Stealing Scheduling

Work-stealing scheduling is yet another load-balancing scheduler. If you pass DotMP.Schedule.WorkStealing to the schedule parameter, work-stealing scheduling is used.

Work stealing is, by far, the most complex scheduler in DotMP. At the beginning of execution, each thread in the parallel region gets assigned its own local queue of iterations to execute, identical to how static scheduling works when no chunk size is provided. However, if a thread runs out of work in its queue, it will act as a thief. A thief will randomly select another thread's queue, called the victim, and will "steal" half of that thread's remaining iterations for itself. This continues until all queues are empty.

The chunk_size parameter specifies how many iterations a thread executes at a time from its own local queue. It's worth noting that once a thread pulls iterations from the queue and begins executing them, they cannot be stolen by another thread. Like all other schedulers, a higher chunk size means less scheduler overhead, but a low chunk size means finer-grained load balancing.

Runtime Scheduling

If you pass DotMP.Schedule.Runtime to the schedule parameter, it indicates that the scheduler should be pulled at runtime from the OMP_SCHEDULE environment variable to decide the schedule to use at runtime. Here are some examples of how you can set OMP_SCHEDULE:

  • static,128 - static scheduling with a chunk size of 128.
  • guided,8 - guided scheduling with a chunk size of 8.
  • dynamic - dynamic scheduling with the default chunk size.
  • workstealing,4 - work-stealing scheduling with a chunk size of 4.

Parallel.ForReduction

Reduction loops (loops which aggregate to a single variable) are common, but invalid under the conditions that Parallel.For can produce correct results. Therefore, DotMP provides a variant of the traditional parallel-for loop to support reductions.

Take a simple example of a vector dot product. A serial version of this loop might look like this:

float dot = 0.0f;

for (int i = 0; i < M; i++)
    dot += a[i] * b[i];

You can parallelize this for loop using DotMP's Parallel.ForReduction method:

float dot = 0.0f;

DotMP.Parallel.ParallelRegion(() =>
{
    DotMP.Parallel.ForReduction(0, M, DotMP.Operations.Add, ref dot, (ref float dot, int i) =>
    {
        dot += a[i] * b[i];
    });
});

In order for the runtime to properly parallelize the loop in the most efficient way possible, more information must be given to the runtime. You'll notice the DotMP.Operations.Add specifies that the reduction operation is addition: we are adding to dot. Then, we provide dot as a reference parameter. Finally, the body of the loop takes a reference to another dot parameter, as well as the current iteration integer. The body of this loop then operates on the ref accepted by the delegate instead of the one outside of the parallel region.

The way this works is that the runtime creates thread-local versions of dot, so each thread will aggregate its results to its own thread-local version of dot. Then, before the loop concludes, all of the thread-local versions are aggregated into the original variable according to Operations.Add.

Available operations in the Operations enum include:

  • Operations.Add - reduction using the + operator
  • Operations.Subtract - reduction using the - operator
  • Operations.Multiply - reduction using the * operator
  • Operations.BinaryAnd - reduction using the & operator
  • Operations.BinaryOr - reduction using the | operator
  • Operations.BinaryXor - reduction using the ^ operator
  • Operations.BooleanAnd - reduction using the && operator
  • Operations.BooleanOr - reduction using the || operator
  • Operations.Min - reduction using Math.Min
  • Operations.Max - reduction using Math.Max

Parallel.ForReduction supports the same schedule and chunk_size parameters as Parallel.For, and the wrapper function Parallel.ParallelForReduction exists and behaves as you'd expect.

Parallel.ForCollapse

Another common construct is nested for loops acting on multidimensional data. Take the first step in a stencil-based heat transfer simulation:

for (int i = 0; i < M; i++)
    for (int j = 0; j < N; j++)
        scratch[i, j] = 0.25 * (grid[i - 1, j] + grid[i + 1, j] + grid[i, j - 1] + grid[i, j + 1]);

Parallel.For and its derivatives do not permit nesting. In essence, this means that usually you would only be able to parallelize the outermost for loop. This can still get very good scaling (you usually don't "lose out" on parallelism), but it may be advantageous to want to parallelize both of the loops. The benefits of this are that you multiply the iterations that the scheduler has to work with. In this example, instead of the scheduler only having M iterations to work with, it would have M*N. This can allow you to use larger chunk sizes (i.e., reducing scheduler overhead) while maintaining good load balancing.

DotMP provides a method called ForCollapse which allows this. To write the heat transfer step above, you would do:

DotMP.Parallel.ParallelRegion(() =>
{
    DotMP.Parallel.ForCollapse((0, M), (0, N), (i, j) =>
    {
        scratch[i, j] = 0.25 * (grid[i - 1, j] + grid[i + 1, j] + grid[i, j - 1] + grid[i, j + 1]);
    });
});

Internally, the schedulers see this as a 1-dimensional loop. When a chunk is selected for execution, the method which executes chunks will transform 1D indices into corresponding 2D indices for the loop. There is overhead associated with this, so for loops with lightweight iterations, you may notice performance degradation when using ForCollapse due to the index calculations that must be performed.

Each set of indices is passed to ForCollapse as a tuple. There are overloads to allow for up to four pairs of indices to be specified, like this:

DotMP.Parallel.ParallelForCollapse((A, M), (B, N), (C, P), (D, Q), (i, j, k, l) =>
{
    do_work(i, j, k, l);
});

For 5-dimensional or higher loops, indices and bounds are instead passed as arrays:

(int, int)[] bounds = new (int, int)[] { (A, M), (B, N), (C, P), (D, Q), (E, R) };

DotMP.Parallel.ParallelForCollapse(bounds, indices =>
{
    int i = indices[0];
    int j = indices[1];
    int k = indices[2];
    int l = indices[3];
    int m = indices[4];

    do_work(i, j, k, l, m);
});

Collapsed for loops saw substantial performance uplift as of v1.6.0. If you are using a version before v1.6.0, you will notice substantial performance degradation using collapsed for loops over simple parallelization of the outermost loop. As of v1.6.0, index calculations are fastest with 2D and 3D indices, but collapsed for loops of any size should consistently outperform prior versions.

As shown, there exists Parallel.ParallelForCollapse as a wrapper around ParallelRegion and ForCollapse. There is also:

  • ForReductionCollapse - a conjunction of ForReduction and ForCollapse
  • ParallelForReductionCollapse - a wrapper around ParallelRegion and ForReductionCollapse.