Pattern Name: SeparableDependencies

AlgorithmStructure Design Space

Supporting Document: Examples

 

Matrix-Vector Multiplication, MPI Implementation:

The following code (to be executed, SPMD-style, by each UE) accomplishes the desired result using MPI. Code to distribute the matrix among UEs and initiate the computation is not shown.

      SUBROUTINE PAR_BLAS2(m,n,a,b,c,comm)
      REAL a(m), b(m,n)   !local slice of array and matrix
      REAL c(n)           !result
      REAL sum(n)         !local vector for partial results
      INTEGER m,n,comm,i,j,ierr

      !calculate local sum
      DO j = 1,n
          sum(j) = 0.0
          DO i = 1,m
              sum(j) = sum(j) + a(i)*b(i,j)
          END DO
      END DO

      !calculate global sum, result or reduction placed in c at all nodes
      CALL MPI_ALLREDUCE(sum,c,n,MPI_REAL,MPI_SUM,comm,ierr)

      RETURN

The MPI_ALLREDUCE primitive performs the necessary synchronization and communication to compute the desired result behind the scenes.

Numerical Integration, OpenMP Implementation:

With OpenMP, we use fork-join parallelism to break up the loop in the program for execution on multiple threads. In this case, the number of threads is set to NUM_THREADS with a call to omp_set_num_threads(). We then parallelize the program using a single "parallel for" pragma. The reduction is handled using the standard reduction clause from OpenMP.

    #include <stdio.h>
    #include <omp.h>
    #define NUM_THREADS 2
    static long num_steps = 100000;
    double step;
    void main ()
    {
        int i;
        double x, pi, sum = 0.0;
        
        step = 1.0/(double) num_steps;
        
        omp_set_num_threads (NUM_THREADS);
        
        #pragma omp parallel for reduction(+:sum) private(x)
        for (i=1;i<= num_steps; i++){
            x = (i-0.5)*step;
            sum = sum + 4.0/(1.0+x*x);
        }
        pi = step * sum;
        printf(" pi is %f \n",pi);
    }

In this case we used the default schedule, which is implementation dependent. The performance of this simple example is not sensitive to the particular schedule used. If the computation within a loop iteration varied unpredictably, we would want to use a dynamic schedule, which would be selected by using the schedule (dynamic) clause on the omp pragma.

Numerical Integration, MPI Implementation:

We can also implement this algorithm as an SPMD program using MPI. The approach used here is very common. One copy of the program is run on each UE. The program initializes MPI and then queries the system to find a processor ID (MPI_Comm_rank) and the number of UEs (MPI_Comm_size). The loop is scheduled in an interleaved manner by running the loop iterations from "id" to the number of integration steps (num_steps) with an increment equal to the number of UEs.

    #include <stdio.h>
    #include "mpi.h"
    static long num_steps = 100000;
    double step;
    void main (int argc, char** argv)
    {
        int i, id, num_procs;
        double x, pi, sum = 0.0;
        
        MPI_Init(&argc, &argv);
        step = 1.0/(double) num_steps;
        
        MPI_Comm_rank(MPI_COMM_WORLD, &id);
        MPI_Comm_size(MPI_COMM_WORLD, &num_procs);
        
        for (i=(id+1);i<= num_steps; i+=num_procs){
            x = (i-0.5)*step;
            sum = sum + 4.0/(1.0+x*x);
        }
        MPI_Reduce(&sum, &pi, 1, MPI_FLOAT, MPI_SUM, 0 MPI_COMM_WORLD);
        if(id == 0){
            pi *= step;
            printf(" pi is %f \n",pi);
        }
    }

The reduction is handled by an MPI library routine, MPI_Reduce(). The form used here combines the local results into a single value on the node of rank 0; this node then handles the single output operation.