DataDecomposition Design Pattern

Pattern Name: DataDecomposition

FindingConcurrency Design Space

Intent:

This pattern addresses the question "How do you decompose a problem's data into units that can be operated on relatively independently?"

Motivation:

This pattern looks at the issues involved in decomposing data into units that can be updated concurrently.

For example, most linear algebra problems update large matrices, applying a similar set of operations to each element of the matrix. In these cases, it is straightforward to drive the parallel algorithm design by looking at how the matrix can be broken up into blocks that are updated concurrently. The task definitions then follow from how the blocks are defined and mapped onto the processing elements of the parallel computer.

Applicability:

Use this pattern when

You have reviewed the DecompositionStrategy pattern and decided to try a data-based decomposition of your problem.

Implementation:

Compilers are good at analyzing data dependencies and can in some cases automatically deduce a data decomposition. In most cases, however, you have to carry out the decomposition by hand.

If you have already carried out a task-based decomposition, the data decomposition is driven by the needs of each task. If well-defined and distinct data can be associated with each task, the decomposition should be simple.

If you are starting with a data decomposition, however, you need to look not at the tasks but at the central data structures defining your problem, considering whether they can they be broken down into chunks that can be operated on concurrently. A few common examples are

Array-based computations: Concurrency can be defined in terms of updates of different segments of the array. If the array is multidimensional, observe that it can be decomposed in variety of ways (rows, columns, or blocks of varying shapes).
Recursive data structures: We can think of, for example, decomposing the parallel update of a large tree data structure by decomposing the data structure into subtrees that can be updated concurrently.

Regardless of the nature of the underlying data structure, the decomposition of the data serves as the organizing principle of your parallel algorithm.

As you consider how to decompose the problem's data structures, keep in mind the competing forces mentioned in the DecompositionStrategy pattern:

Flexibility. The size and number of data chunks need to be flexible to support the widest range of parallel systems. Consider parameterizing the decomposition so it can be smoothly adjusted to fit the granularity of most parallel computers.
If at all possible, you should define chunks whose size and number are controlled by a small number of parameters. These parameters define so-called "granularity knobs" that you can vary to modify the data decomposition to match the needs of the underlying hardware. (Note, however, that many designs are not infinitely adaptable with respect to granularity.)
The easiest place to see the impact of granularity on the data decomposition is in the overhead required to manage dependencies between chunks. The time required to manage dependencies must be small compared to the overall runtime. In a good data decomposition, the dependencies scale at a lower dimension than the computational effort associated with each chunk. For example, in many finite difference codes, the half-width of the finite difference stencil defines a region of cells along the surface of each decomposed chunk. This surface region defines the dependency between chunks. The size of the set of dependent cells scales as the surface area, while the effort required in the computation scales as the volume of the chunk. This means that you can scale the computational effort (based on the chunk's volume) to offset overheads associated with data dependencies (based on the surface area of the chunk).
Efficiency. You need to make sure that the chunks are large enough so the amount of work to update the chunk offsets the overhead of managing dependencies. A more subtle issue to consider is how the chunks map onto units of execution. An effective parallel algorithm must balance the load between units of execution. If this isn't done well, for example, several processing elements in the parallel computer may finish their work before the others, and the overall scalability will suffer. This may require clever ways to break up the problem. For example, if the problem clears the columns in a matrix from left to right, a column mapping of the matrix will cause problems as the units of execution with the leftmost columns will finish their work before the others. A row-based block decomposition or even a block-cyclic decomposition (in which rows are assigned cyclically to processing elements) would do a much better job of keeping all the processors fully occupied.
Simplicity. It may seem obvious, but many programmers have wasted countless hours trying to debug overly complex data decompositions. A data decomposition will usually require a mapping of a global index space onto a task-local index space. Make this mapping abstract so it can be easily isolated and tested.

Once you have your data decomposed, you need to look at the task decomposition implied by the tasks. The TaskDecomposition pattern may help you with this analysis.

Examples:

Medical imaging.

Consider the medical imaging problem described earlier (in the Examples section of the DecompositionStrategy pattern). In this application, a point inside a model of the body is selected randomly, a radioactive decay is allowed to occur at this point, and the trajectory of the emitted particle is followed. To create a statistically significant simulation, thousands if not millions of trajectories are followed.

A task-based decomposition is a natural choice for this problem. Memory constraints, however, have motivated the development of data-based decompositions for this problem. When the memory of the underlying parallel hardware is distributed, it is advantageous to avoid replicating the huge body model on each processing element.

In a data-based decomposition, the body model is the large central data structure around which the computation can be organized. The model is broken into segments, and one or more segments are associated with each processing element. The body segments are only read, not written, during the trajectory computations, so there are no data dependencies created by the decomposition of the body model.

Once the data has been decomposed, you need to look at the tasks associated with each data segment. In this case, each trajectory passing through the data segment defines a task. The trajectories are initiated and propagated within a segment exactly as for the task-based approach. The difference occurs when a segment boundary is encountered. When this happens, the trajectory must be passed between segments. It is this transfer that defines the dependencies between data chunks.

Notice that this algorithm is more complex than one based on a task-based decomposition. Considerable effort can be required to implement the bookkeeping required to keep track of the set of trajectories as they move through the body model.

Matrix multiplication.

Consider the standard multiplication of two matrices (C = A · B). In the Examples section of the TaskDecomposition pattern we discussed a task-based decomposition suitable for shared-memory environments but less so for distributed-memory environments. Several data-based decompositions are possible for this problem. A straightforward one would be to assign a row of the product matrix C to each processing element. From the definition of matrix multiplication, that means that each processing element would need the full A matrix, but only the corresponding row of B. With such a data decomposition, the basic task in our algorithm becomes the computation of a row of C. This still requires the replication of too much data (the full A matrix), however, so we might refine our algorithm so that we decompose all three matrices into blocks. The basic task then becomes the update of a C block, with the A and B blocks being cycled among the nodes as the computation proceeds. The result is the data-based decomposition discussed as an example of the GeometricDecomposition pattern. Although such a block decomposition is more complex, it is nevertheless the approach used in practice since it is the most efficient.

Known Uses:

Data decompositions are very common in parallel scientific computing. The parallel linear algebra library ScaLAPACK is a good example.