A General Model of Parallel Computing

Introduction

People understand the world by comparing their observations of the world against a set of internally held models. This is an innate behavior, and for the most part we are not aware that it is taking place. When trying to bring order to a new field of study, however, or when trying to deepen understanding of an old one, it can be useful to make the models explicit.

For sequential programming, almost all programmers use a common model applicable to single-processor computers: the von Neumann model. The von Neumann model views a computer as a single processor with a single stream of instructions operating on a single memory. Processors and memory subsystems vary widely from one computer to another, but these details can be neglected by the programmer (except for final performance tuning), letting a program coded to the von Neumann model run on any single-processor computer.

For parallel programming, unfortunately no such consensus view yet exists. This complicates programmers' jobs by forcing them to choose from a range of models. In addition, the lack of universal models has diluted the efforts of programming tool developers, resulting in relatively immature tools for parallel computing. In other words, the lack of common models serves to make the parallel programmer's difficult job even harder.

In this note, we attempt to correct this shortcoming by proposing common models to support parallel computing. We begin by describing the types of models that are required; we then describe a particular set of models we will use in our pattern language and how they interact.

Types of Models

Our needs cannot be met with a single model; instead we need a hierarchical set of models, where each layer in the hierarchy is a self-consistent model appropriate for a class of activities. Each layer is self-contained, but the layers fit together in well-understood ways, so someone can work at one layer in the hierarchy confident that his or her work can be mapped onto models at lower and higher levels as needed. Such a hierarchy supports the way programmers normally work: High-level algorithm design would be done using the top layer of the hierarchy, in which (for parallel programs) concurrency and shared data are abstract and easy to express. Implementing a design would mean moving down one layer, to a model that supports thinking about how to safely implement concurrency and shared data. Once the design has been implemented as a program, a model in an even lower layer would be used to carry out optimizations such as overlapping computation and communication. Each step of the application-development process would use a different model (one that focuses on the key issues of the task at hand)..

What would the layers in such a hierarchy look like? There have been a number of attempts to define them, but no one approach has been adopted as a consensus view. We propose the following layers:

The top layer of the hierarchy is a specification model, which is a high-level, abstract view of a parallel computation appropriate for the algorithm designer at the "paper and pencil" phase of a project.
The next layer is a programming model, which is an abstraction used to map an algorithm specification onto the constructs of programming environments. In other words, a programming model abstracts how a programming environment presents the parallel computer to the programmer.
The next layer is a computational model, which provides an abstract view of how a computation takes place on a computer system. It must be simple and general enough to support algorithm design.
The bottom layer is a cost model that incorporates the detailed execution characteristics of real machines. This is the level in the hierarchy where, for example, the interconnection network and its performance characteristics would be described.

It is very important to define carefully the models and the way they interrelate, and it is important that the models work well together. If the programming model does not mesh well with the specification model, for example, the resulting programs will be overly complex both for the compiler to optimize and for the programmer to understand.

A Hierarchy of Models for our Pattern Language

We are concerned with helping the general-purpose programmer write parallel programs. To this end, we are most interested in the top three layers of the hierarchy of models. Cost models are important, especially when considering the tradeoffs raised by closely related algorithms. All too often, however, by emphasizing the cost model, the overall hierarchy is cluttered with low-level details, making the collection of models too complex for the general programmer to tolerate. To avoid this trap, we will ignore the cost model entirely and focus on the top three layers alone.

Computational model: the MIMD NUMA model

A parallel computer consists of multiple processing elements (PEs) connected into a single logical system. Borrowing from the work on candidate type architectures (CTA), we propose the following abstract view of a typical computer system, which we call the MIMD (multiple instruction, multiple data) NUMA (non-uniform memory access) model.

Consider a general MIMD computer consisting of multiple von Neumann computers. As implied by the term MIMD, each of these von Neumann computers has its own instruction stream and data stream; they are then connected by some mechanism that allows them to share information and to synchronize their execution. We can represent the whole system using a single model if we view any sharing of information, whether through a tightly-integrated physical shared memory (e.g., symmetric multiprocessor workstations) or by way of message-passing along an interconnection network (e.g., workstation clusters or the Paragon supercomputer), as a type of shared memory.

The cost of accessing information within an instruction stream varies depending on the source of the information. If it comes from physically local memory, the cost of accessing the information is low. If it comes from a remote memory location, the cost is usually much higher, since it includes both the overhead of accessing the network and any OS overhead. Once again relating shared memory to shared information , this means that we can view our general MIMD computer system as a NUMA (non-uniform memory access) computer.

Our proposal, therefore, is to view every parallel computer as a MIMD NUMA shared-memory computer. This is common practice among software performance engineers, since even for an SMP machine the caches attached to the processors create a memory hierarchy with varying costs. A consequence of adopting the model, then, is that locality of data, which affects performance in NUMA systems, must be explicitly taken into account by the programmer.

Note that we are not advocating that only shared-memory programming environments be used. Shared-memory programming environments are available on workstation clusters (e.g.. Myrias' PAM or TreadMarks), and message-passing environments are available on SMP workstations (e.g., MPI from MPI Technologies Inc.). All we are advocating is that in thinking about an algorithm, you can view any practical parallel computer system as a shared memory MIMD NUMA machine.

This model does not support a quantitative cost model, so it cannot be used, for example, to study the impact of different network topologies. It does, however, address the major issues in constructing a parallel algorithm, namely organizing a computation to minimize synchronization and to make memory references as local as possible.

Programming model: the coordination model

Many parallel programming models can be layered on top of the MIMD NUMA model. For our purposes, the most effective programming model that is general enough to meet our needs is the coordination model. It does not support quantitative cost estimation and therefore cannot help designers make fine distinctions between related algorithms. It does, however, provide a common ground for discussing parallel algorithms.

The model is simple: A parallel computation is viewed as a number of "units of execution" (UEs). A UE (usually a thread or a process) acts very much like a von Neumann machine. Each UE runs independently except at discrete and explicit points where it interacts with another UE. These interactions, the so-called coordination events, coordinate the UEs; that is, they provide communication, synchronization, and management services for the UEs.

This model clearly fits most coordination languages (Linda, ALWAN, etc.) and most message-passing systems (PVM, MPI, etc.). It does not adequately describe the situation in which several UEs share a single address space (i.e., a multithreaded programming environment); such a system allows interactions to occur anywhere, since memory addresses are accessible to multiple UEs. Note, however, that a program that allowed multiple threads to interact freely would likely be subject to race conditions; in contrast, well-written multithreaded programs make accesses to shared memory distinct and protect them through some type of synchronization. In other words, a good shared-memory program is based on the coordination model.

Specification model: abstract parallel algorithms

Having defined the lower layers of our hierarchy of models, we now look at the top layer, which we will use to describe concurrent algorithms. We define the following elements of a parallel algorithm:

A parallel algorithm is constructed in terms of a set of tasks, where a task is a sequence of operations that together make up some logical component of the overall job to be done.. The first job of the parallel algorithm designer is to express exploitable concurrency by identifying the tasks in the problem. This decomposition of problem into tasks will only be useful as the basis for a parallel algorithm if the tasks are largely independent and thus can execute concurrently (with proper coordination).
Tasks interact through discrete coordination events, where by "coordination events" we mean operations that let tasks communicate, synchronize, or change their status (e.g., create new tasks or destroy existing tasks). Thus, a task can be thought of as a sequence of sequential code snippets separated by coordination events. During execution of one of the sequential code snippets, it does not interact with other tasks.
Tasks can be composed in sequence (one following another, as in ordinary sequential programming, as would be the case if one task requires for its execution data that will only be available after completion of another task) or in parallel (for concurrent execution). This suggests a partial order on the set of tasks that make up an algorithm: We can say task A is "less than" task B (with respect to the partial order) if task A must complete before task B can start. For tasks that must be composed in parallel because they involve interlocking coordination events, we can apply this same partial-ordering idea to the sequential snippets that make up the tasks. For example, if task A and task B coordinate by means of task A sending a message to task B, the sequential snippet in A that generates the message is "less than" the sequential snippet in B that uses the data in the message.

As with the underlying models, the above statements are too general to support formal reasoning or quantitative performance analysis. It does, however, give us a shared vocabulary for a parallel algorithm. In particular, it makes clear the algorithm designers' need to define tasks, manage their coordination, map them onto UEs, and finally map the UEs onto processing elements.

How the models interact

Having defined models for the three layers we want to address, we can talk about the relationships among them. These relationships are an important part of our overall model.

Tasks (in the specification model) are mapped onto UEs (in the programming model). This defines how the potential concurrency of the algorithm is mapped onto the concurrency of the programming environment, by associating units of work (tasks) with potential workers (processes, threads, etc.)
UEs (in the programming model) are in turn associated with PEs (in the computational model). This defines how the concurrency of the programming environment is mapped onto the concurrency of the target hardware.