Pattern Name: EmbarrassinglyParallel

AlgorithmStructure Design Space


This pattern is used to describe concurrent execution by a collection of independent tasks. Parallel Algorithms that use this pattern are called embarrassingly parallel because once the tasks have been defined the potential concurrency is obvious.

Also Known As:


Consider an algorithm that can be decomposed into many independent tasks. Such an algorithm, often called an "embarrassingly parallel" algorithm, contains obvious concurrency that is trivial to exploit once these independent tasks have been defined, because of the independence of the tasks. Nevertheless, while the source of the concurrency is often obvious, taking advantage of it in a way that makes for efficient execution can be difficult.

The EmbarrassinglyParallel pattern shows how to organize such a collection of tasks so they execute efficiently. The challenge is to organize the computation so that all units of execution finish their work at about the same time -- that is, so that the computational load is balanced among processors. The following figure illustrates the problem.

This pattern automatically and dynamically balances the load as necessary. With this pattern, faster or less-loaded UEs automatically do more work. When the amount of work required for each task cannot be predicted ahead of time, this pattern produces a statistically optimal solution.

Examples of this pattern include the following:

As these examples illustrate, this pattern allows for a fair amount of variation: The tasks can all be roughly equal in size, or they can vary in size. Also, for some problems (the database search, for example), it may be possible to solve the problem without executing all the tasks. Finally, for some problems (branch-and-bound computations, for example), new tasks may be created during execution of other tasks.

Observe that although frequently the source of the concurrency is obvious (hence the name of the pattern), this pattern also applies when the source of the concurrency requires some insight to discover; the distinguishing characteristic of problems using this pattern is the complete independence of the tasks.

More formally, the EmbarrassinglyParallel pattern is applicable when what we want to compute is a solution(P) such that

    solution(P) =
        f(subsolution(P, 0),
          subsolution(P, 1), ...,
          subsolution(P, N-1))

such that for i and j different, subsolution(P, i) does not depend on subsolution(P, j). That is, the original problem can be decomposed into a number of independent subproblems such that we can solve the whole problem by solving all of the subproblems and then combining the results. We could code a sequential solution thus:

    Problem P;
    Solution subsolutions[N];
    Solution solution;
    for (i = 0; i < N; i++) {
        subsolutions[i] = 
            compute_subsolution(P, i);
    solution = 

If function compute_subsolution modifies only local variables, it is straightforward to show that the sequential composition implied by the for loop in the preceding program can be replaced by any combination of sequential and parallel composition without affecting the result. That is, we can partition the iterations of this loop among available UEs in whatever way we choose, so long as each is executed exactly once.

This is the EmbarrassinglyParallel pattern in its simplest form -- all the subproblems are defined before computation begins, and each subsolution is saved in a distinct variable (array element), so the computation of the subsolutions is completely independent. These computations of subsolutions then become the independent tasks of the pattern as described earlier.

There are also some variations on this basic theme:

What all of these variations have in common, however, is that they meet the pattern's key restriction: It must be possible to solve the subproblems into which we partition the original problem independently. Also, if the subsolution results are to be collected into a shared data structure, it must be the case that the order in which subsolutions are placed in this data structure does not affect the result of the computation.


Use the EmbarrassinglyParallel pattern when:

This pattern can be particularly effective when:


Implementations of this pattern include the following key elements:


This pattern is typically used to provide high-level structure for an application; that is, the application is typically structured as an instance of this pattern. It can also be used in the context of a simple sequential control structure such as sequential composition, if-then-else, or a loop construct. An example is given in our overview paper, where the program as a whole is a simple loop whose body contains an instance of this pattern.


The EmbarrassinglyParallel pattern has some powerful benefits, but also a significant restriction.


There are many ways to implement this pattern. If all the tasks are of the same size, all are known a priori, and all must be completed (the simplest form of the pattern), the pattern can be implemented by simply dividing the tasks among units of execution using a parallel loop directive. Otherwise, it is common to collect the tasks into a queue (the task queue) shared among UEs. This task queue can then be implemented using the SharedQueue pattern. The task queue, however, can also be represented by a simpler structure such as a shared counter.

Key elements.

Defining tasks and scheduling their execution.

A set of tasks is represented and scheduled for execution on multiple units of execution (UEs). Frequently the tasks correspond to iterations of a loop. In this case we implement this pattern by splitting the loop between multiple UEs. The key to making algorithms based on this pattern run well is to schedule their execution so the load is balanced between the UEs. The schedule can be:

Implementation techniques include parallel loops and master-worker and SPMD versions of a task-queue approach.

Parallel loop.

If the computation fits the simplest form of the pattern -- all tasks the same size, all known a priori, and all required to be completed -- they can be scheduled by simply setting up a parallel loop that divides them equally (or as equally as possible) among the available units of execution.

Master-Worker or SPMD.

If the computation does not fit the simplest form of the pattern, the most common implementation involves some form of a task queue. Frequently this is done using two types of processes, master and worker. There is only one master process; it manages the computation by:

There can be many worker processes; each contains some type of loop that repeatedly:

Frequently the master and worker processes form an instance of the ForkJoin pattern, with the master process forking off a number of workers and waiting for them to complete.

A common variation is to use an SPMD program with a global counter to implement the task queue. This form of the pattern does not require an explicit master.

Detecting completion and terminating.

Termination can be implemented in a number of ways.

If the program is structured using the ForkJoin pattern, the workers can continue until the termination condition is reached, checking for an empty task queue (if the termination condition is "all tasks completed") or for some other desired condition. As each worker detects the appropriate condition, it terminates; when all have terminated, the master continues with any final combining of results generated by the individual tasks.

Another approach is for the master or a worker to check for the desired termination condition and, when it is detected, create a "poison pill", a special task that tells all the other workers to terminate.

Correctness considerations.

The keys to exploiting available concurrency while maintaining program correctness (for the problem in its simplest form) are as follows.

The variations mentioned earlier impose additional requirements:

Efficiency considerations.


Vector addition.

Consider a simple vector addition, say C = A + B. As discussed earlier, we can consider each element addition (Ci = Ai + Bi) as a separate task and parallelize this computation in the form of a parallel loop:

Varying-length tasks.

Consider a problem consisting of N independent tasks. Assume we can map each task onto a sequence of simple integers ranging from 0 to N-1. Further assume that the effort required by each task varies considerably and is unpredictable. Several implementations are possible, including:


See our overview paper for an extended example using this pattern.

Known Uses:

There are many application areas in which this pattern is useful. Many ray-tracing codes use some form of partitioning with individual tasks corresponding to scan lines in the final image [Bjornson91a]. Applications coded with the Linda coordination language are another rich source of examples of this pattern [Bjornson91b].

Parallel computational chemistry applications also make heavy use of this pattern. In the quantum chemistry code GAMESS, the loops over two electron integrals are parallelized with the TCGMSG task queue mechanism mentioned earlier. An early version of the Distance Geometry code, DGEOM, was parallelized with the Master-Worker form of the EmbarrassinglyParallel pattern. These examples are discussed in [Mattson95b].

Related Patterns:

The SeparableDependencies pattern is closely related to the EmbarrassinglyParallel pattern. To see this relation, think of the SeparableDependencies pattern in terms of a three-phase approach to the parallel algorithm. In the first phase, dependencies are pulled outside a set of tasks, usually by replicating shared data and converting it into task-local data. In the second phase, the tasks are run concurrently as completely independent tasks. In the final phase, the task-local data is recombined (reduced) back into the original shared data structure.

The middle phase of the SeparableDependencies pattern is an instance of the EmbarrassinglyParallel pattern. That is, you can think of the SeparableDependencies pattern as a technique for converting problems into embarrassingly parallel problems. This technique can be used in certain cases with most of the other patterns in our pattern language. The key is that the dependencies can be pulled outside of the concurrent execution of tasks. If this isolation can be done, then the execution of the tasks can be handled with the EmbarrassinglyParallel pattern.

Many instances of the GeometricDecomposition pattern (for example, "mesh computations" in which new values are computed for each point in a grid based on data from nearby points) can be similarly viewed as two-phase computations, where the first phase consists of exchanging boundary information among UEs and the second phase is an instance of the EmbarrassinglyParallel pattern in which each UE computes new values for the points it "owns".

It is also worthwhile to note that some problems in which the concurrency is based on a geometric data decomposition are, despite the name, not instances of the GeometricDecomposition pattern but instances of EmbarrassinglyParallel. An example is a variant of the vector addition example presented earlier, in which the vector is partitioned into "chunks", with computation for each "chunk" treated as a separate task.