DependencyAnalysis Design Pattern

Pattern Name: DependencyAnalysis

FindingConcurrency Design Space

Intent:

This pattern addresses the question "After you have decomposed a problem into tasks, how do you analyze how they depend on each other?"

This pattern comes into play once you have decomposed a problem into tasks that can execute concurrently. In a few cases, not only can these tasks execute concurrently, but they are completely independent. For independent tasks, the programmer need only create the tasks and schedule them for efficient execution on the processing elements of the parallel computer.

More frequently, however, a problem's tasks are not independent; they influence each other such that what happens in one task affects the execution of another task. We call these influences between tasks dependencies. Dependencies take one of two basic forms:

The most common form of dependency occurs when two or more tasks must share or exchange data as they execute. For example, in a data-based decomposition, the original problem domain is divided into multiple regions that can be updated in parallel. The update of any given region requires information from other regions (frequently, though not always, from the boundaries of its neighboring regions). This represents a data-sharing dependency between the regions.
The second basic form of dependency is an ordering constraint. In other words, a collection of tasks may need to execute in a certain order. For example, a program may need to ensure that a complex data structure has been completely determined before a collection of tasks begins processing the data. Another form of ordering constraint occurs when a collection of tasks must run at the same time. For example, in the data decomposition example we discussed earlier, if all of the regions are not processed at the same time, the parallel program could stall (deadlock) as some regions wait for data from inactive regions.

Finding and managing dependencies is one of the most difficult jobs facing a parallel program designer.

Applicability:

Use this pattern when

You have decomposed your problem into tasks that can execute concurrently, perhaps using the DecompositionStrategy pattern, and understand both the problem can be broken down into semi-independent tasks (its task decomposition) and how its data must be decomposed to support those tasks (its data decomposition).

Implementation:

The goal of a dependency analysis is to understand in detail how the tasks that make up your parallel program depend on each other. There are two kinds of dependencies:

Data-sharing dependencies, which occur when two or more tasks must share or exchange data as they execute.
Ordering constraints, which occur when tasks to execute in a certain order.

The dependencies between a problem's tasks have a major impact on both the efficiency and the complexity of the final program:

The efficiency of a parallel program is proportional to the fraction of time spent making progress on a problem's computation. In this light, time spent communicating shared data or managing temporal constraints is wasted. Your dependencies must require little time to manage relative to the computation's time.
Managing dependencies is a major source of complexity in a design and leads to many of the most serious errors in a parallel program. Probably the most difficult bugs to detect and fix are those that result from inconsistent sharing of data. Race conditions (situations in which program results depend on the relative order of task execution) frequently arise when the state of shared data is out of synch with task execution. For example, in an iterative algorithm, it is easy to commit synchronization errors that cause data from an earlier iteration to be incorrectly used in the present iteration. It is therefore important to manage dependencies so that these and other errors are easy to detect and fix.

Correctly analyzing dependencies among tasks is both difficult and crucial to the overall effectiveness of the design. There is no single way to accomplish this analysis, but we have found the following approach to be effective:

First, identify how the tasks should be grouped together. Certain tasks may need to cooperatively update shared data structures, and therefore the algorithm design must ensure that they run together. On the other hand, a subset of tasks may be completely independent, but to help build a good scheduling algorithm to efficiently execute them, they should be grouped together. This and other issues pertaining to the grouping of tasks are discussed in the GroupTasks pattern.
Next, identify any ordering constraints between groups of tasks. For example, if one group of tasks generates a matrix and another group uses that matrix, the second group must wait until the first group completes before it can execute. This process is discussed further in the OrderTasks pattern.
Finally, analyze how tasks share data, both within and among groups. This process is discussed further in the DataSharing pattern.

It is not always possible to see whether decisions made at one step of this process will be effective until after you have worked through later steps. In fact, in many cases you cannot fully understand whether a decomposition will be effective until after analyzing the resulting dependencies. Therefore, you should expect to iterate back and forth between the dependency and decomposition patterns, first carrying out a decomposition, analyzing its dependencies, and then reconsidering the decomposition. Even experienced parallel programmers may need to iterate though several cycles before getting it right.

Examples:

Molecular dynamics.

We will consider a single example as we look at each one of the dependency analysis design patterns. In the present pattern, we'll just introduce the problem and its decomposition. The dependency analysis itself will be described in the Examples sections of the individual dependency patterns.

Our example problem is to design a parallel molecular dynamics program. The theoretical background of molecular dynamics is interesting, but not really relevant to this discussion; we just need to understand the problem at its simplest level.

Molecular dynamics is used to simulate the motions of a large molecular system. For example, molecular dynamics simulations show how a large protein moves around and how different-shaped drugs might interact with the protein. Not surprisingly, molecular dynamics is extremely important in the pharmaceutical industry. It turns out that molecular dynamics is important in computer science as well. It's a perfect test problem for computer scientists working on parallel computing: it's simple to understand, relevant to science at large, and very tough to effectively parallelize. Many papers have been written by computer scientists about parallel molecular dynamics algorithms (see the references in [Mattson94d] for some of these papers).

So what is the basic molecular dynamics problem? The idea is to treat the molecule as a large collection of balls connected by springs. The balls represent the atoms in the molecule, while the springs represent the chemical bonds between the atoms. The molecular dynamics simulation itself is an explicit time-stepping process. At each time step, you compute the force on each atom, and then use standard classical mechanics to compute how the force moves the atoms. This process is carried out repeatedly to step through time and compute a trajectory for the molecular system.

The forces due to the chemical bonds (the "springs" are relatively simple to compute. These correspond to the vibrations and rotations of the chemical bounds themselves. These are short-range forces that can be computed with knowledge of the handful of atoms that share chemical bonds. What makes the molecular dynamics problem so difficult is the fact that the "balls" have partial electrical changes. Hence, while atoms interact with a small neighborhood of atoms through the chemical bonds, the electrical charge causes every atom to apply a force to every other atom.

This is the famous N-body problem. On the order of N² terms must be computed to get the long-range force. Since N is large (tens or hundreds of thousands) and the number of time steps in a simulation is huge (tens of thousands), the time required to compute these long-range forces dominates the computation. Clever scientists have worked out elegant ways to reduce the effort required to solve the N-body problem. We are only going to discuss the simplest of these tricks: the so-called cutoff method.

The idea is quite simple. Even though each atom exerts a force on every other atom, this force decreases as the distance between the atoms grows. Hence, it should be possible to pick a distance beyond which the force contribution can be ignored. That's the cutoff, and it reduces a problem that scales as O(N²) (where "O(N)" denotes "on the order of N") to one that scales as O(N · n), where n is the number of atoms within the cutoff volume, usually hundreds). The computation is still huge, and it dominates the overall runtime for the simulation, but at least the problem is tractable.

There are a host of details, but the basic simulation can be summarized with the following simplified pseudo-code:


    Int const N   // number of atoms 

    Array of real :: Atoms (3,N) //3D coordinates
    Array of real :: Force (3,N) //force in each dimension
    Array of lists :: neighbors(N) //atoms in cutoff volume

    loop over time steps
        vibrational_forces (N, Atoms, Forces)
        rotational_forces (N, Atoms, Forces)
        neighbor_list (N, Atoms, neighbors)
        long_range_forces (N, Atoms, neighbors, Forces)
        update_atom_positions(N, Atoms, Forces)
        physical_properties ( ... Lots of stuff  ... )
    end loop

There are a few details to mention, and then we can consider how this can be solved in parallel.

First, the neighbor_list() computation is time-consuming. The gist of the computation is a loop over each atom, inside which every other atom is checked to see if it falls within the indicated cutoff volume. Fortunately, the time steps are very small, and the atoms don't move very much in any given time step. Hence, this time-consuming computation is only carried out every 10 to 100 steps. For our discussion, we are not going to parallelize this computation.

Second, the physical_properties() function computes velocities, energies, correlation coefficients, and a host of interesting physical properties. These computations, however, are not that involved and do not affect the overall parallel algorithm. Hence, we will ignore them in this discussion.

We are now ready to look at how to decompose this problem for execution on a parallel computer. In any parallel algorithm project, the first step is to decide where the most time-consuming computations are taking place. We have already discussed this and pointed out that the bulk of the time is in long_range_forces(). This means that whatever we do, we must pick a problem decomposition that makes that computation run efficiently in parallel.

While we won't discuss it in detail here (see [Mattson94d] for details), each of the computations inside the time loop has a similar structure. Namely, they include a loop over atoms in which each loop iteration independently updates that atom's corresponding array elements. So a natural task definition for each function is an iteration of this atom-loop: i.e., the update required by each atom.

Now let's look at the data decomposition. Each element of the array of atomic coordinates, Atoms (the array of atomic coordinates) is updated using its own data in all cases and coordinate data from a neighborhood of atoms in most cases. Elements of the Forces array are updated similarly (Newton's third law comes into play so when you compute the force of atom I on atom J you also get the negative of the force of atom J and atom I). Hence, these arrays need to be managed as shared data.

Summarizing our problem and its decomposition, we have the following:

Tasks that find the vibrational forces on an atom.
Tasks that find the rotational forces on an atom.
Tasks that find the long-range forces on an atom.
Tasks that update the position of an atom.
A task to update the neighbor list for all the atoms (which we will leave sequential).
Shared arrays for the atomic coordinates and the forces.

Dependencies among these tasks and the associated data will be analyzed in the Examples sections of the dependency patterns ( GroupTasks, OrderTasks, and DataSharing).

Intent:

Motivation:

Applicability:

Implementation:

Examples:

Molecular dynamics.