For references pertaining to the psychology of programming, go here.
[Ackerman82] A W. B. Ackerman, "Data flow languages", Computer, Vol. 15, No. 2, pp. 15-25, 1982.
Very good summary of data flow, and changes made to traditional languages to accommodate parallelism [e.g., outlawing side effects] Reproduced in "Selected Reprints on Dataflow and Reduction Architectures" ed. S. S. Thakkar, IEEE, 1987, pp. 179-188. Earlier version in Proc. NCC, 1979 reproduced in Kuhn, Padua's tutorial 1981.
[Agha86] G. Agha, Actors: A model of Concurrent Computation in Distributed Systems, MIT Press, 1986.
[Akl96] S. G. Akl and I. Stojmenovic, "Broadcasting with Selective Reduction: A Powerful Model of Parallel Computation", in [Zomaya96], chapter 7, pp. 192-222, 1996.
This paper describes the BSR extension of the PRAM model (though out of fairness to the author, I doubt he would call it a PRAM extension). It is a variation of the PRAM CRCW model where multiple processors can write to a common location using a broadcast. Following the broadcast, each memory location selectively keeps or rejects the memory and applies a reduction rule for the values it keeps..
[Akl97] S.G. Akl, Parallel Computation: Models and Methods, Prentice-Hall, 1997
Focusing throughout on models of computation and methods of problem solving, this text shows how a parallel algorithm can be designed for a given computational problem to run on a parallel computer, and then how it can be analyzed to determine its "goodness." The book covers in detail the main three models of computation used to design parallel algorithms (namely combinational circuits, shared memory machines, and interconnection networks). The algorithms are grouped by method of design (prefix computation, divide and conquer, and pointer based methods. Explores general methods for designing parallel algorithms - prefix computation, list ranking, divide and conquer, split and plan, matrix multiplication, broadcasting with selective reduction, etc.
[Alexander64] C. Alexander. Notes on the Synthesis of Form. Harvard University Press, 1964.
This is the classic text that introduces the Patterns approach to design.
[Alexander79] C. Alexander. The Timeless Way of Building. Oxford University Press, 1979.
[Alexander77] C. Alexander, S. Ishikawa, and M. Silverstein. A Pattern Language. Oxford University Press, 1977.
[Alverson94] Gail A. Alverson, William G. Griswold, Calvin Lim, David Notkin, and Lawrence Snyder "Abstractions for portable, scalable parallel programming" , Technical report UW_CSE_93_12_09, Washington University, Department of Computer Science, January 1994. paper
[Alverson90] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith. "The Tera Computer System", Proceedings of the 1990 International Conference on supercomputing, June, 1990.
The Tera MTA computer is a novel architecture. The key features of the machine are multiple threads and a flat, very large shared memory (no cache). The role of data locality is reduced so as long as enough threads can be defined to provide plenty of slackness and hide memory latencies, the system will scale to large numbers of nodes.
[Amza96] C. Amza, A. Cox, S. Dwarkadas, P. Keleher, H. Lu, R. Rajamony, W. Yu, and W. Zwaenepoel. "Treadmarks: Shared Memory Computing on Networks of Workstations", IEEE Computer, Vol. 29, No. 2, pp. 18-28, 1996.
[Andrews91] G.R. Andrews, Concurrent Programming: Principles and Practice, Benjamin/Cummings, 1991.
[Andrews93] G.R. Andrews and R.A. Olsson, The SR Programming Language, Benjamin/Cummings, 1993.
[Armstrong96] R. Armstrong, "An Illustrative Example of Frameworks for Parallel Computing", in [POOMA96], 1996.
This short paper provides an overview of object-oriented frameworks. He mentions Sandia POET and Los Alamos' POOMA and that they are going to merge. It then goes onto describe POET and a case study from statistical physics.
[Ashcroft91] E.A. Ashcroft, A.A. Faustini, and R. Jagamonathan, "Lucid: An intensional Language for Parallel Applications programming" in [Szymanski91], 1991.
[Babb84] R. G. Babb, "Parallel Processing with Large-Grain Data Flow Techniques", Computer, Vol. 17, No. 7, pp. 55-61, July, 1984.
[Babb87] Robert G. Babb, II and David C DiNucci, "Design and Implementation of Parallel Programs with Large-Grain Data Flow", in The Characteristics of Parallel Algorithms, MIT Press edited by Leah H. Jamieson, Dennis B. Gannon, E Robert J. Douglass, pp. 335-350, 1987.
[Bagheri91] B. Bagheri, T.W. Clark and L.R. Scott, "PFortran (a parallel extension of Fortran) reference manual." Research Report UH/MD-119, Dept. of Mathematics, University of Houston, 1991.
This group produced a number of languages very closely related to Pfortran. I describe their work in terms of a single coordination language called P, which is then applied to Fortran, C or C++. The basic idea behind P is a generalization of message-passing as a restricted form of shared memory. For example, you can reference an array element on another node with the notation a(I)@node_j. This is a type of virtual shared memory, but the message-passing analog is pretty obvious.
[Bal89] H.E. Bal and J.G. Steiner and A.S. Tanenbaum, Programming Languages for Distributed Computing Systems, Computing Surveys, vol. 21, no. 3, pp. 261-322, 1989.
[Baratoloo95] A. Baratloo, P. Dasgupta, and Z.M. Kedem, "A Novel Software System for Fault Tolerant Parallel Processing on Distributed Platforms", In Proceedings of the 4th IEEE International Symposium on high performance Distributed Computing, 1995.
This is the original Calypso paper in which the implementation at NYU on Sun workstations is described. For information about the NT port of Calypso, see [Mclaughlin97].
[Barnoy92] A. Barnoy, S. Kipnis, "Designing Algorithms in the Postal Model for message passing systems", Proceedings of the fourth Annual ACM Symposium on Parallel Algorithms and Architectures,", pp. 13-22, 1992.
[Baude91] F. Baude and G. Vidal-Naquet, "Actors as a Parallel Programming Model", Proceedings of 8th Symposium on Theoretical Aspects of Computer Science, Springer Lecture Notes in Computer Science, 480, 1991.
[Bertsekas89] D.P. Bertsekas and J.N. Tsitsiklis. Parallel and Distributed Computation Numerical Methods. Prentice-Hall, 1989.
[Bjornson91a] R. Bjornson, C. Kolb, and A. Sherman, "Ray Tracing with Network Linda", SIAM News, volume 24, number 1, January 1991.
[Bjornson91b] R. Bjornson, N. Carriero, T.G. Mattson, D. Kaminsky, and A. Sherman, "Experience with Linda", Yale University Computer Science Department, Technical Report RR-866, August, 1991.
[Blelloch89] G. Blelloch, "Scans as Primitive Operations", IEEE Transactions on Computers, vol. 38, pp. 1526-1538, November, 1989.
Scans are generalizations of the well known parallel prefix operation. In this paper, an argument is made that the scan should be considered one of the fundamental operations made available on a parallel computer.
[Blelloch90] G.E. Blelloch and G.W. Sabot, "Compiling Collection-Oriented Languages onto Massively-Parallel Computers", Journal of Parallel and Distributed Computing, pp. 119-134, 1990.
[Blelloch92] G.E. Blelloch, NESL: a Nested Data Parallel Language, School of Computer Science, Carnegie-Mellon University, CMU-CS-92-103, 1992.
[Blelloch97] G.E. Blelloch and B. M. Maggs, "Parallel Algorithms", to appear in, CRC Handbook of computer Science, 1997.
This is a long discussion of parallel algorithms. They open with a review of how to model parallel computation. After the review, they explain the Work-depth models they use in this paper. This is a model of a parallel algorithm- not a parallel machine. The idea is to model the algorithm abstractly and then map it onto PRAM or some other computation model should the need arise. The work in the work-depth model is the number of operations to be performed. The depth is the longest chain of dependencies among the operations. The parallelism of the algorithm is the ratio of the work to the depth. There are actually many work-depth models. They use the most abstract one - the circuit model. They use this model to describe a number of algorithms: divide-and-conquer, parallel-pointer methods, breadth-first graph searching, finding minimum spanning trees, sorting, computational geometry, and several others. I find this reference intriguing since it focuses on algorithms that are of interest outside the scientific computing community
[Botorog95] G. H. Botorog and H. Kuchen, " Algorithmic Skeletons for Adaptive Multigrid Methods", Computing surveys symposium on models of programming languages and computations", ACM Computing Surveys, vol. 28, No 2, pp. 293-294, 1996.
We are considering two issues, which, we believe, are very important in the context of algorithmic skeletons: on the one hand, finding an adequate level of generality for the skeletons and on the other hand, using skeletons to solve `real-world' problems in parallel. To the first issue, we argue that very general skeletons may be inadequate for expressing certain algorithms and may lead to implementations that are not very efficient, whereas too specialized skeletons may degenerate to plain parallel library functions. Our approach is therefore somewhere in between these two extremes, more precisely, we try to design skeletons that can be used in solving related problems, or problems from different areas, but having a similar (e.g., hierarchical) structure. We choose as our class of applications adaptive multigrid algorithms. There are several reasons for this. Firstly, multigrid methods are non-trivial numerical applications. Secondly, there is an entire class of multigrid methods, which build however on the same blocks, thus fitting naturally in the skeletal concept. Thirdly, multigrid skeletons can be used in the implementation of other classes of hierarchically structured algorithms, like for instance N-body methods. We have then analyzed the data dependencies and access patterns that occur in multigrid and have derived from them two kinds of operations: `horizontal' or intra-grid operations, like relaxation or correction computation and `vertical' or inter-grid operations like grid refinement/coarsening and data prolongation and restriction. Based on these data dependencies, we have derived a series of skeletons for multigrid. Besides point-wise operations like `map', we also employed `environment operations', which are applied to single points, together with their environment, i.e., their neighboring points in the horizontal case and their parents/children in the vertical case, respectively.
[Boyle87] J. Boyle, R. Butler, T. Disz, B. Glickfeld, E. Lusk, R. Overbeek, J. Patterson, and R. Stevens, Portable Programs for Parallel Processors, Hold, Rinehart, and Winston, 1987.
This is one of the earliest efforts to support portable parallel programming. It describes the m4 package, which is also called the Argonne macros package.
[Brooks92] R. R. Brooks and M. Hodoscek, "Parallelization of CHARMM for MIMD Machines," Chemical Design Automation News, vol. 7, p. 16, 1992.
This paper is a classic example of how far you can push the loop-splitting algorithmic motif. I have been able to scale this code out to 512 nodes on a Paragon!!!
[Brooks96] E. D. Brooks III and K. H. Warren, "A Parallel Programming model bridging shared memory and distributed memory architectures", UCRL-JS-126451. 1997.
In this paper, they present a parallel C preprocessor called PCP. This programming language is an explicit shared memory language.
[Browne94] James C. Browne, Jack J. Dongarra, Syed I. Hyder, Keith Moore, and Peter Newton"Visual Programming and Parallel Computing", Technical report CS-94-229, University of Tennessee, April 1994. postscript paper
[Butler94] R. Butler and E. Lusk, "Monitors, messages, and clusters: The p4 Parallel Programming System," Parallel Computing, vol. 20, pp. 547-564, 1994.
P4 is mostly thought of as a message-passing library. It provides much more than message-passing, however. P4 includes monitors and therefore is one of the first packages that supports mixed SMP/message-passing computing.
[Campbell94b] D.K.G. Campbell and S. J. Turner. "CLUMPS: A model of efficient general-purpose parallel computation," Proceedings of the IEEE TENCON Conference, Vol. 2. Pp. 723-727, 1994.
>[Cann92] D.C. Cann, J.T. Feo, A.D.W. Bohoem, and R.R. Oldehoeft, "SISAL Reference Manual: language version 2.0", 1992.
[Carriero91] N. Carriero and D. Gelernter. How to Write Parallel Programs: A First Course, MIT press, 1991.
This book describes Linda and how to write a broad range of programs using Linda. It is aimed at the beginning parallel programmer and so it gets a bit verbose.
[Carriero94] N. J. Carriero, D. Gelernter, T. G. Mattson, and A. H. Sherman, "The Linda alternative to message-passing systems," Parallel Computing, vol. 20, pp. 633-655, 1994.
This is yet another Linda apologetics paper where we specifically compared it to message-passing systems.
[Chan90] T. Chan, "Hierarchical Algorithms and Architectures for Parallel Scientific Computing", Appeared as Computer Architecture News, Vol. 18, No. 3, p. 318, 1990.
In this paper, Chan argues that an increasing number of important algorithms in scientific computing are hierarchical. He gives a number of examples including multigrid. He argues for designing hardware and programming environments to specifically support this hierarchical structure.
[Chandy88] K.M. Chandy and J. Misra, "Parallel Program Design: A Foundation", Addison-Wesley, 1988.
This is the book where the UNITY programming language is defined.
[Chandy92] K. Mani Chandy and C. Kesselman, "The Derivation of Compositional Programs", Proceedings of the 1992 Joint International Conference and Symposium on Logic Programming, MIT Press, 1992.
[Chandy92b] K. M. Chandy and C. Kesselman, "CC++: A Declarative Concurrent Object Oriented Programming Notation", Caltech Technical Report, 1992.
[Chandy94] K. M. Chandy. Concurrent program archetypes. In Proceedings of the Scalable Parallel Library Conference, 1994.
[Chandy95] K. M. Chandy, R. Manohar, B. L. Massingill, and D. I. Meiron. Integrating task and data parallelism with the group communication archetype. In Proceedings of the 9th International Parallel Processing Symposium, 1995.
[Chandy97] K. M. Chandy and B. L. Massingill. Parallel program archetypes. Technical Report CS-TR-96-28, California Institute of Technology, 1997. ftp://csvax.caltech.edu/tr/cs-tr-96-29.ps.Z.
[Chandy96] K. M. Chandy and B. A. Sanders. Reasoning about program composition. Technical Report 96-035, University of Florida, 1996. ftp://ftp.cise.ufl.edu/cis/tech-reports/tr96/tr96-035.ps.
[Chapman92] B. Chapman and P. Mehrotra and H. Zima, Programming in Vienna Fortran, Scientific Programming, vol. 1, no. 1, pp. 31-50, 1992.
Vienna Fortran is a data-parallel variant of Fortran. The Vienna Fortran group had a significant impact on the design of HPF.
[Chapman96] B. Chapman, M. Haines, P. Mehrotra, H. Zima, J. Van Rosendale, "Opus: A coordination Language for Multidisciplinary Applications", to appear in scientific computing, http://www.cs.uwyo.edu/~haines/research/opus/sciprog.ps, 1996.
A set of language extensions to HPF. Opus is a coordination language designed to support multidisciplinary computing under HPF. It is based on ShareD Abstractions (SDA's). An SDA encapsulates data and methods (much as an object-oriented language does) to provide a well-structured mechanism for HPF programs to interact. SDA's also use the concept of monitors in that an active method has exclusive access to the data in an SDA. SDA's have distinct address spaces. One uses SDA's to implement task level parallelism within HPF.
[Cheatham] T. Cheatham, "Models, Languages, and compiler Technology for High Performance Computers", Harvard University Technical Report.
This is a BSP paper. It describes a coordination language called BSP-L and a programming environment to support the language called BSP-H.
[Chen91] M. Chen, Y. Choo, J. Li, "Crystal: Theory and Pragmatics of Generating Efficient Parallel Code", in [Szymanski91], 1991.
[Cheng96] D.Y. Cheng, "Tools for Portable High-Performance Parallel Computing", in [Zomaya96] pp. 865-896, 1986.
[Ciancarini92] P. Ciancarini, K. K. Jensen, D. Yankelevich, "The Semantics of a Parallel Language based on a Shared Data space", University of Pisa Technical report, 1992
This paper presents a formal semantics of Linda. It's slow going, but it makes sense. They then use this semantics to investigate Linda's semantics by looking at a number of formal models of computation. This part of the paper is really hard to follow. They look at the Chemical abstract machine, SOS, CCS, and Petri nets.
[Clark86] K. L. Clark and S. Gregory, "PARLOG: Parallel Programming in Logic", ACM Trans. Programming Language Systems, Vol. 8, No. 1, pp. 1-49, 1986.
[Clark87] K.L. Clark, A Declarative Environment for Concurrent Logic Programming, Proceedings of TAPSOFT, Springer Lecture Notes in Computer Science 250, pp. 212-242, 1987.
[Clark92] T.W. Clark, R. von Hanxleden, K. Kennedy, C. Koelbel, and L.R. Scott, "Evaluating Parallel Languages for Molecular Dynamics Computations," Proceedings of the Scalable High Performance Computing Conference, (SHPCC-92), p 98, 1992
This is a good paper to compare the use of a data-parallel language (Fortran-D) with a coordination language (Pfortran). The data-parallel language, in my opinion, comes out looking really bad.
[Cole89] M. Cole, Algorithmic Skeletons: Structured Management of Parallel Computation, Pitman Research Monographs in Parallel and Distributed Computing, 1989.
[Cook73] S. Cook and R. Reckhow, "Time Bounded Random Access Machines", Journal of Computer and Systems Sciences, vol. 7, pp. 354-375, 1973.
This is the classic paper referenced when talking about the RAM model. The RAM model is a realization of the high-level von Neumann model. I have not read this paper yet, but I've seen it referenced all over the place.
[Coplien95] J. O. Coplien and D. C. Schmidt, editors. Pattern Languages of Program Design, pages 1-5. Addison-Wesley, 1995.
[Crowl92] L. A. Crowl, "Variables and Parameters as References and Containers", Oregon State Department of Computer Science Technical Report 92-60-20, 1992.
In this paper, Crowl argues that most object-based languages use a reference model for the variables. Crowl believes that a container model is more efficient and more applicable to parallel and distributed computing. A container variable is the sort of variable used in Fortran where a variable contains copies of a value. Object-oriented languages such as Smalltalk or Actors associate variables with references to objects. He describes his work with the Natasha language and uses it to argue for the superiority of the container model in parallel and distributed languages.
[Crowl93] L. A. Crowl, T. and J. LeBlanc, "Parallel Programming with Control Abstraction", Oregon State Department of Computer Science Technical Report number 93-60-15a, 1993.
This is the paper where the utility of user extensible control abstractions is presented. A control abstraction is a closure relation specifying dependency relations between operations. In my opinion, the methodology proposed in this paper is obscure and not directly useful to the general parallel programmer. His abstractions are implemented in terms of a high-level programming model called Matroshka and a related programming language called Natasha.
[Culler91] D. Culler, A. Aah, K. Schauser, T. von Eicken, and J. Wawrzynek, "Fine-grain parallelism with minimal hardware support: a compiler-controlled threaded abstract machine.", In Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, CA, April, 1991.
This is the paper where a Threaded Abstract Machine is described. This is a compiler-based approach to latency tolerance and lightweight synchronization. I need to take a close look at this since I am interested in formal models of parallelism that map onto multi-threaded systems.
[Culler92] David E. Culler and Seth Copen Goldstein and Klaus Erik Schauser and von Eicken, Thorsten, "Active Messages: A Mechanism for Integrated Communication and Computation, University of California, Berkeley Technical report number UCB/CSD 92/675. ftp.cs.berkeley.edu:/ucb/TAM/isca92.ps 1992.
Starting with the premise that current MPP designers tend to optimize along specific dimensions rather than towards a balance (considering that most I/O networks try and deliver high volume communications, but to avoid peak processor performance 90% of the time must be spent in computation => expensive network is mostly idle!) To achieve a balance, startup cost must be low, and a means to overlap communication and computation must be provided: hence the active message, which is simply a message containing in its header the address of a user-level handler which is executed on message arrival. The role of the handler is to extract the message as quickly as possible and to integrate it into the current computation. Not surprisingly, such handlers must not be allowed to block. One advantage is that no temporary buffers are required, as the program code will have pre-allocated space for the message, however care must be taken to avoid deadlocking. The paper then shows how implementations of this low-level construct runs very efficiently on the nCube and CM5 (orders of magnitude faster). Next, the Split-C programming model is described, which uses a put and get op to copy data to and from processors. By requesting data as earlier as possible, and only blocking on the value when it's actually needed the computation/communication overlay is very high. Utilization measurements for a range of data distributions (columns per processor) shows that while processor and network utilization fluctuate the joint utilization is relatively steady. Moving on the paper describes the threaded abstract machine for compilers, and speculates on the hardware support that could improve efficiency of such a model. Active messages are compared with the Monsoon and J-machine tokens, and found to be different.
[Culler93a] D. Culler, R. Karp, D. Patterson, A. Sahay, K.E. Schauser, E. Santos, R. Subramonian T. von Eicken, "LogP: Toward a Realistic Model of Parallel Computation", ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1-12, May 1993.
[Culler93b] David E. Culler, Andrea Dusseau, Seth Copen Goldstein, Arvind Krishnamurthy, Steven Lumetta, von Eiken, Thorsten, Katherine Yelick, Parallel Programming in Split-C, ftp.cs.berkeley.edu:/ucb/split-c/sc93.ps.Z 1993.
Based loosely on the SPMD model, with global address space with a clear concept of locality, the languages allows a mixture of parallel paradigms (message-passing, shared memory and data parallelism). The extensions to standard C include global pointers (reference objects on remote processors - maths such as ++ works correctly), split-phase assignment (:= allows communications/processing to be pipelined), signaling store (remote update with notification), bulk transfer, and spread arrays. Using a program for modeling electromagnetic waves as an example, the power of the models is used to show how simple optimizations improve efficiency by a factor of 4-5 on a 64 processor CM-5 (using ghost nodes, split-phase communications and signaled stores.
[Curfman96] L. Curfman McInnes, D. Stevens, S. Balay, W. Gropp, and B. Smith, "PETSc: Meeting the Challenge of large-scale Applications", in [POOMA96], 1996.
This paper provides a very terse introduction to PETSc. It attempts to place it in the context of object-oriented programming environments.
[Dally89] W.J. Dally and D.S. Wills, "Universal Mechanisms for Concurrency", PARLE '89, Parallel Architectures and Languages Europe, Springer-Verlag Lecture Notes in Computer Science 365, pp. 19-33, 1989
[Danelutto94] M. Danbelutto, "Working Group Report: Skeletons/Templates", Programming Environments for Massively Parallel Distributed Systems, Birkhauser Verlag, Basel, 1994.
This is a short but useful paper where the state of research in Skeletons is summarized.
[Davis97] G. Davis and B. Massingill. The mesh-spectral archetype. Technical Report CS-TR-96-26, California Institute of Technology, 1997. ftp://csvax.caltech.edu/tr/cs-tr-96-26.ps.Z.
[D'Azevedo94a] E.F. D'Azevedo and C.H. Romine, "DOLIB: Distributed Object Library", Oak Ridge National Laboratory Tech Report ORNL/TM-12744, 1994.
This system creates a shared memory region that sits on top of a standard message-passing system. The shared memory is viewed as a single dimension array that can be accessed through a put/get mechanism. See [Mattson95a] for a detailed discussion of DOLIB and a comparison with GA and NX.
[Dehne93] F. Dehne, A. Fabri. A. Ran-Chaplin,"Scalable Parallel computational geometry for coarse Grained Multicomputers", in Proceedings ACM Symposium on Computational Geometry, pp. 298-307, 1993.
This is the paper where the CGM (a variant of BSP) model is presented and then applied to computational geometry.
[Dijkstra80] E. Dijkstra and C. S. Scholten. "Termination detection for diffusing computations". Information Processing Letters, 11:1, August (1980).
[DiNucci89] A David C. DiNucci and A Robert G. Babb, "Design and implementation of parallel programs with LGDF2", Proceedings of COMPCON'89, IEEE, pp. 102-107, 1989.
[Dongarra86] J. Dongarra and D. Sorenson, "SCHEDULE: Tools for Developing and Analyzing Parallel Fortran Programs, ANL/MSC TM 86, 1986.
Click here to see my notes on SCHEDULE.
[Dongarra87] J.J. Dongarra and D.C.Sorensen. "A Fully Parallel Algorithm for the Symmetric Eigenvalue Problem", SIAM J. Sci. and Stat. Comp. 8, S139-S154 (1987).
[Douglas93] C. C. Douglas, T. G. Mattson, M. H. Schultz, "A comparison of Linda, P4, POSYBL, PVM and TCGMSG: Two and Four node communication Times", Yale University Technical Report TR-975, 1993.
[Duncan90] R. Duncan, "A survey of Parallel Computer Architectures", Computer 23(2), pp.5-16, 1990.
[Edwards96] H. C. Edwards, J. C. Browne, "Scalable Distributed Dynamic Array and its Application to a Parallel hp Adaptive Finite Element Code. In [POOMA96], 1996.
[Eppstein88] D. Eppstein and Z. Galil, "Parallel Algorithmic techniques for Combinatorial Computation", Annual Reviews of Computer Science, Vol. 3, pp. 233-283, 1988.
[Faust90] J.E.Faust and H.M.Levy "The Performance of an Object-Oriented Threads Package". Proceedings on Object-Oriented Programming: Systems, Languages, and Applications, Canada, October 1990
[Feldman93] J.A. Feldman and C.-C. Lim and T. Rauber, "The Shared Memory Language pSather on a Distributed Memory Multiprocessor", Proceeding of a Workshop on Languages, Compilers and Run-Time Environments for Distributed Memory Multiprocessors, appeared as SIGPLAN Notices, Vol. 28, No. 1, pp. 17-20, January 1993.
[Feo90] J.T. Feo, D.C. Camm, and R.R. Oldehoeft, "A Report on the SISAL Language Project," Journal of Parallel and Distributed Computing, vol. 12, p. 349, 1990.
[Feo92] J.T. Feo, A Comparative Study of Parallel Programming Languages: The Salishan Problems. North-Halland, 1992.
[Flynn72] M.J. Flynn, "Some Computer Organizations and Their Effectiveness," IEEE Trans. computers, vol. C-21, No. 9, 1972.
This is the classic paper that started the whole parallel taxonomy discussion. He introduces the SIMD, MIMD, MISD, and SISD nomenclature for computer architectures.
[Foster90] I. Foster and S. Taylor, Strand: New Concepts in Parallel Programming, Prentice Hall, 1990.
This book introduces the Strand language. Strand is a variation on Flat Concurrent Prolog and borrows heavily from the work of the Parlog community. This language is elegant. Communication takes place through single assignment variables. Another nice feature of the language is the program semantics are distinct form how the program is mapped onto a parallel computer. While a very interesting language, this approach did not catch on and it has essentially died out.
[Foster93a] I. Foster and S. Tuecke, Parallel Programming with PCN, ftp from info.mcs.anl.gov, version 2.0 January, 1993.,
PCN takes many of the ideas of Strand and combines it with the idea of program composition. It also adds mutable data structures (something missing completely in Strand). Finally, the syntax in Strand was essentially the Edinburgh Prolog syntax. PCN took a safer course and based its syntax on C.
[Foster93b] I. Foster, R. Olson, and S. Tuecke, "Programming in Fortran M," Technical Report ANL-93/26. Argonne National laboratory, 1993.
[Foster95] I. Foster, Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering, Addison-Wesley, 1995.
[Fox88] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker. Solving Problems on Concurrent Processors. Prentice Hall, 1988.
[Fox89] G. C. Fox, "Parallel Computing comes of Age: Supercomputer Level Parallel Computations at Caltech," Concurrency: Practice and Experience, vol. 1, No. 1, p. 63, 1989.
A Good book that summarizes the experiences gained form the Concurrent Computation Project at Caltech. This is the project that gave us the Cosmic Cube. A good number of big names in parallel scientific computing got their start with this project and are represented in this book.
[Gamma95] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Elements of Reusable Object-Oriented Software. Addison-Wesley, 1995.
[Gehani86] N. H. Gehani and W. D. Roome, "Concurrent C," Software: Practice and Experience, Volume 16, Number 9, 1986, pp. 821-844.
Concurrent C extends the CSP model used in Occam to provide a more general (and complex) concurrent programming model. Concurrent C uses a synchronous message-passing scheme to facilitate parallelism, with processes proceeding independently except when communication takes place. Concurrent C introduces a fairly complex set of new operators and program structures to support asynchronous parallel programming. Through the use of transaction pointers (a transaction is a structured two-way communication mechanism) and process variables, Concurrent C processes are able to communicate directly with other processes, regardless of the physical position of the processes in the system. Concurrent C maps well onto distributed memory machines. ( Actually, Concurrent C can be implemented on a shared memory system, but constructs for managing shared memory were deliberately left out of the language to ensure portability. )
[Gehani88] N. Gehani and W. Roone, "Concurrent C++: Concurrent Programming with Class(es)", Software Practice and Experience Vol. 18, pp. 1157-1177, 1988.
[Gerbessiotix92] A.V. Gerbessiotis and L.G. Valiant, "Direct Bulk-Synchronous Parallel Algorithms", in Algorithm Theory - SWAT'92, edited by O. . Nurmi and E. Ukkonen. Pp. 1-18, 1992.
[Gibbons90] P.B. Gibbons, "Asynchronous PRAM algorithms", in Synthesis of Parallel Algorithms, edited by J.H. Reif, Oorgan-Kaufman, 1990.
[Golub89] G.H. Golub and C.F. Van Loan. Matrix Computations. The Johns Hopkins University Press, 1989.
[Grimshaw87] A.S. Grimshaw and L.W. Liu. "Mentat: An object oriented data-flow system", Proceedings of the 1987 Object Oriented Programming systems, Languages, and Applications Conference, pp. 35-47, October 1987.
[Griswold90] W.G. Griswold, G.A. Harrison, D. Notkin, and L. Snyder,"Scalable Abstractions for Parallel Programming", in the Proceedings of the fifth Distributed Memory computing Conference, vol. 2, IEEE Computer Society Press, pp. 1008-1016, 1990.
This is the paper where Snyder and company introduce the phase abstraction model of parallel computation.
[Haines94] M. Haines, D. Cronk, and P. Mehrotra. "On the design of Chant: A talking threads Package", Proceedings of Supercomputing 94, pp. 350-359, 1994..
The term talking threads is used to describe threads that are in direct communication regardless of whether they exist in the same address space or not. Chant integrates lightweight threads with distributed memory communication, both point-to-point and remote service request. Chant extends the POSIX pthreads standard by introducing a new thread object, called a "chanter", which is capable of remote communication and remote thread operations. Chant sits atop pthreads and a communication library (such as MPI, p4, ...), and provides the necessary "glue" for an efficient implementation. firstname.lastname@example.org.
[Halstead85] R. H. Halstead Jr. "Multilisp: A language for concurrent symbolic computation", ACM Trans. Programming Language Systems, Vol. 7, No. 4, pp. 501-538, 1985.
[Hansen87] P.B. Hansen, "Joyce-a programming language for distributed systems", Software, Practice, Experience, Vol. 17, No. 1, pp. 29-50, 1987.
[Harrison91] R. J. Harrison, "Portable Tools and Applications for Parallel Computers," Int. J. Quantum Chem., vol. 40, pp. 847-863, 1991.
This is the paper that introduces the TCGMSG message-passing library. TCGMSG is a very simple message-passing library and was at one time the dominant message-passing library in parallel computational chemistry.
[Hatcher91] P.J. Hatcher and M.J. Quinn, Data-Parallel Programming on MIMD Computers, MIT Press, 1991.
Hatcher and Quinn worked extensively to take the C* language popularized for the thinking machines systems and mapped it onto MIMD machines. They called this Data Parallel C.
[HPF93] The HPF Forum, "High Performance Fortran, Journal of Development," Special issue of Scientific Programming , vol. 2, No. 1,2, 1993.
[Heywood92] T. Heywood and S. Ranka, "A Practical Hierarchical Model of Parallel Computation: The Model", Journal of Parallel and distributed computing, vol. 16, pp. 212-232, 1992.
This is the classic H-PRAM paper. This model was designed to make the PRAM approach work for systems where you need to account for synchronization and communication costs. The model consists of a dynamically configurable hierarchy of synchronous PRAMs. As an aside, in the introduction to this paper, they give a particularly nice summary of the types of models used in parallel computing.
[Hillis85] D. W. Hillis, "The Connection Machine," MIT Press, Cambridge, Massachusetts, 1985.
The book to read if you want to know about the most famous of all SIMD machines.
[Hillis86] W. D. Hillis and G. L. Steele, "Data parallel algorithms," Communications of the ACM, Vol. 29, No. 12, pp. 1170-1183, 1986.
[Hinrichs] Susan Hinrichs, " A Compile Time Model for Composing Parallel Programs", Carnegie Mellon University, Computer Science Department Technical Report, CMU-CS-94,108, 1994
In this report, she proposes a coordination language called PCS (Programmed Communication Service). The basic idea is to setup communication channels between nodes at compile time. By leaving these channels open throughout the computation, we avoid per-message overhead due to buffer and link management.
[Hockney] R.W. Hockney and C.R. Jesshope, "Parallel Computers 2", Adams Hilger, 1988. For taxonomies, see chapter 1.2 on pp. 53 to 81.
[Hoare85] C.A.R. Hoare, Communicating Sequential Processes, Prentice Hall, 1985
The classic paper that describes the CSP programming model. This model was used to implement Occam. It can also be used in the formal analysis of message-passing programming.
[Hudak86] P. Hudak and L. Smith, "Parafunctional programming: a paradigm for programming multiprocessors", Proceedings of the 13th ACM Symposium on Principles of Programming Languages, pp. 243-254, 1986.
Parafunctional programming refers to the addition of mapping, order constraints, and priorities to a pure functional parallel programming language. They describe the ParAlfl language in this paper.
[Hudak89] P. Hudak, "The Conception, Evaluation, and Application of Functional Programming", ACM Computing Surveys, vol. 21, No. 3, pp. 359-411, 1989.
[Hudak92] P. Hudak and J. Fasel, "A Gentle Introduction to Haskell", ACM SIGPLAN Notices, Vol. 27, No. 5, May, 1992.
[Hwang] K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing, McGraw-Hill International, 1985.
[Ibbett] R.N. Ibbett, T. Heywood, M.I. Cole, R.J. Pooley, P. Thanisch, N.P. Topham, G. Chochia, P.S. Coe, P.E. Heywood, "Algorithms, Architectures and Models of Computation, University of Edinburgh, Computer systems Group Technical report.
This paper describes the Hierarchical PRAM model. They developed a simulator for working with HPRAM called HASE. They also developed a programming language called H-Fork to be used in programming HPRAM on the HASE simulator.
[Inmos86] Inmos, "Occam Programming Manual," Inmos Limited, March 1986.
The Occam language is directly based on the famous CSP model [Hoare85]. In Occam, input and output primitives are used as basic building blocks for structured communication between sequential processes. This communication takes place through one-way channels between processes. These channels are first class members of the programming language.
[JaJa92] J. Ja'Ja'. An Introduction to Parallel Algorithms. Addison-Wesley, 1992.
[JaJa96] J.F. Ja'Ja', "Fundamentals of Parallel Algorithms", in [Zomaya96] pp. 333-354.
[Jagannathan96] R. Jagannathan, "Dataflow Models", [Zomaya96] chapter 8, 1996.
This paper reviews data flow computing models. It considers classical static based on directed acyclic graphs. The nodes are operations and the edges represent the flow of data between nodes. When the incoming edges hold values (values), the node "fires" and writes the results to the output edge. This is the data-driven static dataflow model. Also reviewed are dynamic models in which each edge is labeled by a 4-tuple tag specifying invocation ID, the iteration ID, the code block address, and the instruction address. This generalization causes a node to fire when all the tuples arriving at a node have matching 4-tuples. Also reviewed is demand driven dataflow models where a node fires only if the result is needed and the data is available. Other variants are reviewed including the education model, easyflow, and operator nets.
[Jellinghaus90] R. Jellinghaus, "Eiffel Linda: An Object-Oriented Linda Dialect", ACM SIGPLAN Notices, Vol. 25, No. 12, p. 70, 1990
In this paper, the author outlines his specification for the combination of Linda and Eiffel. The unique contribution in this paper is to make tuple and a templates first class objects within the language. This lets one assign templates to variables, incorporate tuples within tuples, and other manipulations permitted by the base language. The system was never fully implemented so the value of this paper is in its exploration of how to incorporate Linda into an object-oriented language.
[Johnson88] R. Johnson and B. Foote. Designing reusable classes. Journal of Object-Oriented Programming, 1:22-35, June/July 1988.
[Johnson95] K. L. Johnson, M. F. Kaashoek, and D. A. Wallach, "CRL: High-Performance All-Software Distributed Shared memory", Proceedings of the 15th Symposium on Operating Systems Principles, December, 1995.
CRL (C Region Library) is an all software distributed shared memory system. A region of shared memory is created and accessed through explicit read and write operations on contiguous regions of memory. A processor first maps the region onto its local memory with a call to rgn_map(). The process then signifies if it wants read only access (rgn_start_read/rgn_end_read) or if it wants read/write access (rgn_start_write/rgn_end_write). There is also an option to explicitly flush a read/write region.
[Joosen96] W. Joosen, S. Bijnens, J. Van Oeyen, B. Robben, F. Matthijs, and P. Verbaeten, "Concurrent Programs in CORRELATE: Scientific Simulations with varying Granularity", in [POOMA96], 1996
This brief paper describes the CORRELATE language and its use for coarse grained and fine-grained molecular dynamics algorithms.
[Jordan87] Harry Jordan, "The Force," ECE Technical Report Number 87-1-1, Department of Electrical and Computer Engineering, University of Colorado, January 1987.
A programming language specialized to SPMD programs running on MIMD machines. Processes are statically allocated when a Force program begins and are not destroyed until the end of the program's execution. Unless restricted by an explicit parallel primitive, all statements in a Force program are executed by all processes in parallel. Parallelism is introduced through a variety of doall commands and communication takes place through shared variables. Asynchronous communication and barrier synchronization are supported within the language. For additional information about the FORCE see [Kumar91].
[Kahn90] K.M. Kahn and V.A. Saraswat, "Actors as a Special Case of Concurrent Constraint (Logic) Programming", OOPSLA/ECOOP 90 Conference on Object-Oriented Programming: Systems, Languages, and Applications, editor N. Meyrowitz, pp. 57-66, October, 1990. (NOTE: Appears as SIGPLAN Notices, Vol.25, No.10)
[Kale89] L.V. Kale, "The Design Philosophy of the Chare kernel Parallel Programming system", Tech Report UIUC.DCS-R-89-1555, CS Dept, Univ. of Ill, 1989.
[Karp88] A.H. Karp and R.G. Babb II, "A Comparison of 12 parallel Fortran dialects," IEEE Software, vol. 5.pp. 52-67, 1988. Much of this material appeared in IBM technical report number G320-3511, 1988.
This paper shows source code for the famous PI numerical integration program using 12 different programming environments: Alliant FX/8, BBN butterfly, Cray multi-tasking, Cray micro-tasking, ELXSI 6400, Encore Multimax, Flex/32, IBM 3090/VF, Sequent Balance, and iPSC with NX. For even more versions of the PI program, see [Mattson95a] and [Larrabee94].
[Keleher96] P. Keleher, "The relative importance of Concurrent Writers and Weak Consistency models", in the 16th International Conference on distributed Computing Systems, Hong Kong, 1996..
[Kuehn85] J. T. Kuehn and H. J. Siegel, "Extensions to the C programming language for SIMD/MIMD parallelism," 1985 International Conference on Parallel Processing, August 1985, pp. 232-235.
Parallel-C, as originally proposed, was a superset of the C programming language that included extensions to support both synchronous (SIMD) and asynchronous(MIMD) language constructs. In particular, Parallel-C was intended to provide an explicitly parallel high-level language for the PASM parallel processing system[Siegel86]. The original language specification includes syntax for synchronous constructs, and a discussion of suggested approaches for asynchronous constructs
[Kumar91] S. P. Kumar and I.R. Philips, "An overview of portable parallel programming in Fortran", Boeing Computer Services Technical Report, AMS-TR-162, 1991.
This paper compares and contrasts SCHEDULE, The FORCE, PCF Fortran, Linda, and Strand.
[Kumar94] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin Cummings, 1994.
An excellent text on parallel algorithms. The book's great strengths is that its authors invest little time in PRAM algorithms, preferring ones which can be implemented on machines which can actually be built. The chapters discussing architectures and programming languages are run-of-the-mill, but the bulk of the book describes and analyzes a wide variety of algorithms for mesh- and hypercube-based multicomputers (chosen as representative of sparsely-connected and densely-connected machines respectively).
[Kutti85] S. Kutti, "Taxonomy of Parallel Processing and Definitions", Parallel Computing, vol. 2, pp.353-359, 1985.
[Larrabee94] A. R. Larrabee, "The p4 Parallel Programming System, the Linda Environment, and Some Experiences with Parallel Computation," Scientific Programming, vol. 2, pp. 23-36, 1994.
This paper uses source code from the pi program just as in [Mattson95a]. It also compares Linda and P4 using an LU decomposition.
[Lee94] J. W. Lee, " Concord: Re-Thinking the Division of Labor in a Distributed shared Memory System", Proceedings of Scalable High Performance Computing and communications, 1994.
[Lemke93] M. Lemke and D. Quinlan, "P++, a Parallel C++ Array Class Library for Architecture-Independent Development of Structured Grid Applications, Proceeding of a Workshop on Languages, Compilers and Run-Time Environments for Vol. 28, No. 1, pp.21-23, January 1993.
[Lin90] C. Lin and L. Snyder, "A comparison of programming models for shared memory multiprocessors." In Proceedings of the International Conference on Parallel Processing, Vol. II, pp. 163-180, 1990.
[Lin95] C. Lin and L. Snyder, "SIMPLE Performance Results in ZPL", in [Pingali95] pp. 361-375, 1995.
This paper describes a compiler based system based on the CTA machine and the phase abstraction programming model and shows that this technology supports portable programming with high performance. Experiments are reported for the SIMPLE fluid dynamics benchmark suite on the KSR and Paragon systems.
[Liu93] P. Liu, W. Aiello, and S. Bharr, "An Atomic Model for Message Passing", Proceedings of the 5th Annual ACS Symposium on Parallel Algorit5hms and Architectures, pp. 154-163, 1993.
[Lusk87] E.L. Lusk and R.A. Overbeek, "Use of Monitors in Fortran: A Tutorial on the Barrier, Self-scheduling DO-loop and AskFor monitors," Tech. Report No. ANL-84-51 Rev. 1, Argonne National Laboratory, 1987.
This is the basic reference for the parmacs macro package for portable shared memory programming.
[Mackie91] I. Mackie, "Lilac: A Functional Programming Language Based on Linear Logic", Department of Computing, Imperial College of Science and Technology, London, Sept. 1991.
[Maggs89] B. M. Maggs, "Beyond Parallel Random-Access Machines", in J.L.C. Sanz (editor) Opportunities and Constraints of Parallel Computing, Springer-Verlag, pp. 83-84, 1989.
This paper builds the case for the inherent weakness of PRAM models and proposes a DRAM (Distributed Random Access memory) model instead. For more details about the DRAM model, see [Leiserson88].
[Maggs95] B.M. Maggs, L.R. Matheson, R.E. Tarjan, "Models of Parallel Computation: A Survey and Synthesis", Proceedings of the 29th Hawaii International Conference on system sciences, Vol. 2, pp. 61-70, 1995.
If you were going to read only one paper about models of parallel computation, this would be the paper to read. They discuss classes of models at a high level, and then launch into a review of the major computational models. The various PRAM models are nicely summarized as well as the less common distributed and hierarchical memory models. They close the review with the bridging models (models with the goal of unifying multiple parallel architectures under a single model) the most famous of which are BSP, LogP, and CTA. Finally, the paper discusses what we want in a useful model and they propose at a very high level a two parameter model based on the number of processors (P) and a parameter I to represent the impact of the network between processors. This parameter is the communication interval and it represents the maximum of the latency, bandwidth, or synchronization costs.
[Maggs96] B. M. Maggs, " A Critical Look at Three of Parallel Computing's Maxims", Proceedings of the 1996 International Symposium on Parallel Architectures, algorithms and networks (I-SPAN'96), pp. 1-7, June 1996.
In this short paper, Maggs discusses three maxims of parallel computing. He disputes that parallel architectures are converging towards designs based on commodity microprocessor chips. His argument is based on an investigation of benchmark results showing that commodity-based microprocessor MPPs are poorly balanced compared to systems with on-chip support for communication. He also disputes that wormhole routing is decidedly the most efficient method when compared to store-and-forward routine. Finally, he disagrees and actually defends the PRAM model. The crux of his argument is that a computational model doesn't have to be directly realizable to be useful. All that's required is that it is possible to precisely map the model onto real systems.
[Massingill97] B. Massingill. The mesh archetype. Technical Report CS-TR-96-25, California Institute of Technology, 1997. ftp://csvax.caltech.edu/tr/cs-tr-96-25.ps.Z.
[Massingill98a] B.L. Massingill. A structured approach to parallel programming. Technical Report CS-TR-98-04, California Institute of Technology, 1998. ftp://csvax.caltech.edu/tr/cs-tr-98-04.ps.Z.
[Massingill98b] B.L. Massingill, T.G. Mattson, and B.A. Sanders. A Pattern Language for Parallel Application Programming. Technical Report CISE TR 98-019, University of Florida, 1998. ftp://ftp.cise.ufl.edu/cis/tech-reports/tr98/tr98-019.ps.
[Massingill99] B.L. Massingill, T.G. Mattson, and B.A. Sanders. A Pattern Language for Parallel Application Programming. Technical Report CISE TR 99-009, University of Florida, 1999. ftp://ftp.cise.ufl.edu/cis/tech-reports/tr99/tr99-009.ps.
[Mattson94d] T. G. Mattson, "The Efficiency of Linda for General Purpose Scientific Programming," Scientific Programming, Vol. 3, pp. 61-71, 1994.
This paper is an attempt to address the issue of efficiency in Linda. How Linda's implementation supports efficiency is described. I then go onto consider a complete classification of algorithms and I show that efficient Linda programs have been implemented for each case. From this, I argue that it is correct to state that Linda is efficient for general purpose parallel computing. I close with a quick outline of when Linda isn't efficient.
[Mattson95a] T. G. Mattson, "Comparing Programming Environments for Portable Parallel Computing," International Journal of Supercomputing Applications, vol. 12, p. 233 1995.
Compares several programming environments. The comparisons include speed and ease of use. A PI program and a Ping-Pong program is given for each programming environment. This idea works pretty well for giving a quick overview of different programming environments.
[Mattson95b] T.G. Mattson (editor), Parallel Computing in Computational Chemistry, ACS symposium Series 592, American Chemical Society, 1995.
[Mattson96] T.G. Mattson, "Scientific Computation", chapter 34 in [Zomaya96], pp. 981-1002, 1996.
This is a rambling, introductory paper that addresses the issues raised by parallel computing for scientific applications. I think this is the only publication where I discuss the coordination model. I've talked about it at numerous conferences, but I've only published it here.
[McColl88] W.F. McColl, "Parallel Algorithms and Architectures", Parallel Computing 88, edited by G.A. van Zee and J.G.G. van de Vorst, pp. 1-22, 1988.
[McColl93a] W.F. McColl, Bulk Synchronous Parallel Computing, Second Workshop on Abstract Models for Parallel Computation, Oxford University Press, 1993
[Mclaughlin97] D. Mclaughlin, S. Sardesai, and P. Dasgupta, "Calypso NT: Reliable, Efficient Parallel Processing on Windows NT Networks", University of Arizona Technical report, http://www.eas.asu.edu/~calypso, 1997.
This paper provides a concise overview of the Calypso programming environment. Calypso is a shared memory variation of the BSP programming model so it has well understood formal properties. It is also very simple with an API that uses only a few routines. The system uses a central memory server which sacrifices some performance but provides a certain degree of fault tolerance.
[Misra94] J. Misra, "A Logic for Concurrent Programming", University of Texas at Austin, Technical report, April, 1994.
This paper explores the concepts of safety and progress in programs (both sequential and concurrent). Safety means "the program won't cause something bad to happen" while "progress means "the program does some good". These properties can be formally deal with using a co or "constraint" operations (for safety_) and a "transient" operator for progress. While its natural to add these to UNITY, these operators hold value for any formal specification system.
[Nikhil] R.S. Nikhil, "Cid: A Parallel Shared-Memory C for Distributed-Memory machines", in [Pingali95], 1995.
[Nieplocha94] J. Nieplocha, R. J. Harrison, and R. J. Littlefield, "Global Arrays: A Portable Shared-Memory Programming Model for Distributed Memory Computers,'" Proceedings Supercomputing'94,IEEE Computer Society Press, 1994.
This is the major reference for the GA package. This package supports a get/put mechanism for two dimensional arrays stored into shared memory. It is used extensively in the computational chemistry community -- especially for complex algorithms with dynamically changing and unpredictable data distributions.
[Olsson92] Ronald A. Olsson, Gregory R. Andrews, Michael H. Coffin, and Gregg M. Townsend "SR: A Language for Parallel and Distributed Programming" ,Technical report TR92-09, University of Arizona, Department of Computer Science, 1992. postscript paper.
[Pancake90] C. M. Pancake and D. Bergmark, "Do Parallel Languages Respond to the Needs of Scientific Programmers?", Computer, pp. 13- 23, December, 1990.
This paper is a bit out of date, but it's still a useful paper to get a high-level and systematic view of how scientific programmers construct parallel software. The activity is broken down into 4 domain/solution pairs: Model/Conceptual, Algorithm/Algorithmic, Program/Implementation, and Process/Physical. The paper then goes on to summarize what a parallel languages need to do to support scientific programming. She breaks them down in to concurrent languages, language extensions, and runtime libraries. These are briefly reviewed. Problems are in part attributed to a fundamentally different viewpoint between scientific programmers and computer scientists. The paper closes with some now out of data comments about new directions in parallel programming languages.
[Pelagatti97] Susanna Pelagatti. Combining Task and Data Parallelism within Skeleton-based Models", presentation at the workshop, Theory and Practice of Higher-Order Parallel Programming, Schloß Dagstuhl, Germany, February 17th - 21st, 1997.
This talk approaches the problem of providing both task and data-parallel abstractions within a system which is easy to use, portable and reaches high levels of performance on different architectures. To achieve all this, an automatic and efficient solution of mapping, scheduling, data distribution and grain size determination must be found. These problems are proved intractable for models allowing parallel computations with arbitrary structure to be expressed. Moreover, parallel applications appear to use a restricted set of recurring structures, possibly nested, for both data and task parallel parts. The talk analyzes the most common paradigms of task and data parallelism (task queue, pipeline, independent and composed data parallelism) and shows how they can be provided within a skeleton based language (P3L). The P3L support is organized according to a template based methodology and embeds in a transparent way good implementation strategies for each skeleton.
[Perrot96] R.H. Perrott, "Parallel Languages", in [Zomaya96] pp. 843-864, 1996.
[Phillip85] M.J. Phillip, "Unification of synchronous and asynchronous models for Parallel programming languages", Purdue University, Masters Thesis, http://dynamo.ecn.purdue.edu/~hankd/CARP/XPC/paper.html#ref_MeV85, 1989.
This thesis describes the XPC language. While the research described in this thesis is interesting, my major use for this reference was its review of SIMD and MIMD languages given in the introductory chapters. He reviews the SIMD languages Glypnir, C*, Parallel-C, Actus, Lucas and the MIMD languages Occam, Concurrent C, Linda, and the Force.
[Quinn90] M.J. Quinn and P.J. Hatcher, "Data-Parallel Programming on Multicomputers", IEEE Software, 69-76, September, 1990.
[Rose87] John R. Rose and Guy L. Steele Jr., "C* : An Extended C Language for Data Parallel Programming," Second International Conference on Supercomputing, Vol. II, May 1987, pp. 2-16.
This is the language TMC came up with for the thinking machine [Hillis85]. It is closely attuned to the model of a SIMD machine serviced by a sequential front end host machine.
[Sabot89] G. Sabot, The Paralation Model: Architecture-Independent Parallel Programming, MIT Press, 1989.
Paralation is a programming model developed by Gary Sabot of TMC. It is a coordination language that adds a data structure and a few operations to manipulate that data structure. The central data structure is the field, which is an array of objects. Fields can be collected together into a group of fields called a paralation. You then have an element-wise evaluation operator to independently operate on every field in a paralation. Data is moved between paralation with a move operator. The communication patterns are specified with a mapping operator, which is specified using source and destination fields to a match operator. Move uses a combining approach (much like PAMS) to deal with collisions between data values.
[Saraswat90] V.A. Saraswat and M. Rinard, "Concurrent Constraint Programming", Proceedings of the 17th Symposium on Principles of Programming Languages, pp. 232-245, 1990.
[Scales94] D. Scales and M. Lam, "An Efficient Shared Memory Layer for Distributed Memory", Stanford Computer Science Technical Report number CSL-TR-94,627, 1994.
This paper describes a system called SAM that simplifies the task of programming machines with distributed address spaces by providing a shared name space and dynamic caching of remotely accessed data. SAM makes it possible to utilize the computational power available in networks of workstations and distributed memory machines, while getting the ease of programming associated with a single address space model. The global name space and caching are especially important for complex scientific applications with irregular communication and parallelism. SAM is based on the principle of tying synchronization with data accesses. Precedence constraints are expressed by accesses to single-assignment values, and mutual exclusion constraints are represented by access to data items called accumulators. Programmers easily express the communication and synchronization between processes using these operations; they can also use alternate paradigms that are built with the SAM primitives. Operations for prefetching data and explicitly sending data to another processor integrate cleanly with SAM's shared memory model and allow the user to obtain the efficiency of message-passing when necessary. We have built implementations of SAM for the CM-5, the Intel iPSC/860, the Intel Paragon, the IBM SP1, and heterogeneous networks of Sun, SGI, and DEC workstations (using PVM). In this report, we describe the basic functionality provided by SAM, discuss our experience in using it to program a variety of scientific applications and distributed data structures, and provide performance results for these complex applications on a range of machines. Our experience indicates that SAM significantly simplifies the programming of these parallel systems, supports the necessary functionality for developing efficient implementations of sophisticated applications, and provides portability across a range of distributed memory environments.
[Schmidt93] D. C. Schmidt. The ADAPTIVE Communication Environment: An object-oriented network programming toolkit for developing communication software. http://www.cs.wustl.edu/~schmidt/ACE-papers.html, 1993.
ACE is the first attempt to apply design patterns to writing concurrent programs.
[Shapiro87] E. Shapiro, "Concurrent Prolog: Collected Papers, MIT Press, 1987.
[Siegel86] Howard Jay Siegel, Thomas Schwederski, James T. Kuehn, and Nathaniel J. Davis IV, "An Overview of the PASM Parallel Processing System," Tutorial: Computer Architecture, IEEE Computer Society Press, Washington DC, 1986, pp. 387-407.
PASM is a Partitionable SIMD/MIMD computer consisting of a number of control processors and processing elements. The system may be dynamically reconfigured to allow either synchronous(SIMD) or asynchronous (MIMD) execution of parallel programs
[Singh92] J.P. Singh, W. Weber, A. Gupta, "SPLASH: Stanford Parallel Applications for Shared-Memory, Stanford Computer Science technical report number CSL-TR-92-526, 1992.
We present the Stanford Parallel Applications for Shared-Memory (SPLASH), a set of parallel applications for use in the design and evaluation of shared-memory multiprocessing systems. Our goal is to provide a suite of realistic applications that will serve as a well-documented and consistent basis for evaluation studies. We describe the applications currently in the suite in detail, discuss and compare some of their important characteristics such as data locality, granularity, synchronization, etc. and explore their behavior by running them on a real multiprocessor as well as on a simulator of an idealized parallel architecture. We expect the current set of applications to act as a nucleus for a suite that will grow with time.
[Skedzielewski91] S.K. Skedzielewski, "Sisal, Parallel Functional Languages and Compilers", editor B.K. Szymanski, ACM Press Frontier Series, pp. 105-158, 1991.
[Skillicorn88] D.B. Skillicorn,, "A taxonomy for Computer Architectures", Computer Vol. 21, No. 11 pp. 46-57, 1988.
This is the paper where Skillicorn outlines his exhaustive taxonomy. It is too unwieldy to use pedagogically, but if a complex completeness argument is required, it could be very useful to have this taxonomy around.
[Skillicorn91] D.B. Skillicorn, "Practical Concurrent Programming for Parallel Machines", The Computer Journal, vol. 34, No. 4, pp. 302-310, 1991.
[Skillicorn94] D.B. Skillicorn and D. Talia editors, Programming languages for Parallel Processing, IEEE Computer Society Press, 1994
[Skillicorn94b] D.B. Skillicorn, Foundations of Parallel Programming, Cambridge University Press, 1994.
[Snir96] M. Snir, S.W. Otto, S. Huss-Lederman, D.W. Walker, J Dongarra, MPI: The Complete Reference, MIT Press 1996.
[Snyder86] L. Snyder, "Type Architectures, shared memory and the Corollary of Modest Potential", Annual Review of Computer Science, Annual Review Inc., pp. 289-318, 1986.
Snyder borrows from the idea of a type species in biological taxonomies and introduces the idea of a type architecture. In other words, an architecture that best epitomizes the defining characteristics of a whole class of machines. It is not a rigid specification to which all machines must somehow conform. This type architecture is an example that establishes the common ground between systems and relegates differences to the details. In his paper, he rejects the PRAM model and proposes the following Candidate Type Architecture (CTA): A finite set of sequential computers connected by a fixed bounded degree graph with a global controller. He also includes an interesting and pessimistic analysis of the potential benefits of massively parallel processing. Paraphrasing Snyder... The frequent challenge with physical phenomena models is not to solve a fixed-size problem faster, but to solve larger instances within a fixed time budget. Simple rationale: Solutions are so computationally intensive that problem instances of an "interesting" size do not complete execution in a tolerable amount of time. What is the effect of parallelism on problem size? Keeping time fixed and UNREALISTICALLY assuming the best possible speedup, it follows that super linear problems (the interesting, realistic ones) can only be improved sublinearly by parallel computation. Specifically, if a problem takes t = c*n^x sequential steps, and if the application of p processors permits an increase in the problem size by a factor of m, t = c*(m*n)^x/p,then the problem size can increase by at most a factor of m = p^(1/x).For example, to increase by two orders of magnitude the size of a problem whose sequential performance is given by t = cn^4 (three spatial, one time dimension) requires, WITH UNREALISTIC OPTIMISM, 100,000,000 processors!
[Sunderam90] V. Sunderam, "PVM: a Framework for Parallel Distributed Computing," Concurrency: Practice and Experience, vol. 2, pp. 315-339, 1990.
[Szymanski91] B.K. Szymanski (ed.), "Parallel Functional Languages and Compilers", ACM Press, 1991.
This book includes review chapters by the key authors of the major parallel functional languages. The last chapter is by all the contributors and compares and contrasts the parallel functional languages described in the book. The languages described here include Lucid, EPL, SISAL, ID, Haskell, Crystal, and a Fortran compiler system called PTRAN.
[Tam] M. Tam, J. M. Smith, D. Farber, "A Taxonomy-based Comparison of Several Distributed Shared Memory Systems, Operating Systems Review, 23 (3) pp.40-67, 1990.
[To97] Hing Wing To, "Structured Parallel Programming: Parallel Abstract Data Types", presentation at the workshop, Theory and Practice of Higher-Order Parallel Programming, Schloß Dagstuhl, Germany, February 17th - 21st, 1997.
The work we present is motivated by the observation that irregular computation underlies many important applications. In addressing this problem we propose solutions to the problem of introducing new skeletons into an existing skeleton language. At Imperial College we have developed a language for structured parallel programming (SPP(X)) using functional skeletons to compose and co-ordinate concurrent activities written in a standard imperative language. The kernel language, which is currently under construction, is based around the distributed array data type. The language is thus highly suitable for expressing regular computations. The kernel language is less suitable for expressing programs over irregular data structures. We propose the introduction of Parallel Abstract Data Types (PADT) to enable such programs to be expressed naturally. A PADT consists of a data type definition of the irregular data structure and a set of parallel operators over this type. The introduction of PADTs raises the question of how they are to be implemented. A direct approach would be to hand code the operators in some imperative language with message-passing. In this approach the PADT becomes part of the kernel language. This is undesirable as this could possibly lead to an ever increasing kernel language as new PADTs are found to be necessary. The disadvantages of increasing the kernel language with the introduction of each new PADT include the need to extend and reformulate any system for optimizing combinations of skeletons and any system for compiling programs. We propose an alternative approach to implementing PADTs. The philosophy of our approach is to build the PADTs on top of the existing language constructs rather than to extend the kernel language. The encoding of a PADT into the kernel language can be arrived at through several data type transformations thus ensuring correctness and providing the opportunity for reasoning about optimizations. We demonstrate the approach by outlining a derivation from an example program over an irregular triangular mesh to its final regular array encoding. We conclude the talk by suggesting that the concept can be applied to High Performance Fortran (HPF). Currently, HPF-1 cannot directly express irregular applications. There are proposals to extend the language with constructs for expressing irregular applications. It may be possible to apply our techniques to HPF-1, thus enabling irregular applications to be expressed without the need for kernel language extensions. There are open questions including whether HPF-1 or SPP(X) are sufficiently rich to build all "useful" PADTs.
[Torre92] P. de la Torre and C.P. Kruskal, "Towards a Single Model of Efficient Computation in Real Parallel Machines", Future Generation Computer Systems, Vol.8, No.4, p. 395, September 1992.
[Traub92] Kenneth R. Traub and David E. Culler and Klaus E. Schauser, "Global Analysis for Partitioning Non-Strict Programs into Sequential Threads", Lisp and Functional Programming, pp. 324-334, ftp.cs.berkeley.edu:/ucb/TAM/lfp92.ps 1992.
Discusses how threads can be extracted from the Id base language (a non-strict but eager language) to ensure the largest thread size possible. Note: a thread is defined as an ordering which 1. is defined for all possible invocations 2. possible to complete the execution without any pauses. Data flow graphs using certain and indirect dependence labels, with send/receive annotations, is used. Two partitioning methods, demand and dependence analysis are defined, along with subpartitioning (removes blocks linked by indirect dependencies from a thread). Using these building blocks a global algorithm is discussed and the special cases of conditionals discussed. Compares favorably to strictness anal for LAZY languages(need to remove dependence analysis) as it yields larger threads, but recursion isn't as well defined under this scheme.
[Turcotte93] L. Turcotte, "A Survey of Software Environments for Exploiting Networked Computing Resources," Tech Report # MSM-EIRS-ERC-93-2, Mississippi State University, 1993.
This reference provides a relatively complete list of programming environments for cluster computing over networks of workstations. Brief descriptions are provided for many of the environments, but given the large number of systems included, very little detail is included.
[Turner85] D. A. Turner, "Miranda: a non-strict functional language with polymorphic types", In Functional languages and Computer Architecture, Lecture Notes in Computer Science, Springer-Verlag, 1985.
[Upfal84] E. Upfal, "Efficient Schemes for Parallel Communication", Journal of the Association for Computing Machinery, Vol. 31, pp. 507-517, 1984.
[Valiant88] L.G. Valiant, "Optimally Universal Parallel Computers", Phil. Trans. Royal Society Lond. Series A, vol. 326, pp. 373-376, 1988.
It is shown that any program written for the idealized shared-memory model of parallel computation can be simulated on a hypercube architecture with only constant factor inefficiency, provided that the original program has a certain amount of parallel slackness.
[Valiant90] L.G. Valiant, "General Purpose Parallel Architectures", Handbook of Theoretical Computer Science, Vol. A, editor J. van Leeuwen, Elsevier Science Publishers and MIT Press, 1990
The possibilities for efficient general purpose parallel computers are examined. First some network models are reviewed. It is then shown how these networks can efficiently implement some basic message routing functions. Finally, various high-level models of parallel computation are described and it is shown that the above routing functions are sufficient to implement them efficiently.
[Valiant90b] L.G. Valiant, "A Bridging Model for Parallel Computation", Communications of the ACM, vol. 33, no. 8, pp. 103-111, 1990.
The success of the von Neumann model of sequential computation is attributable to the fact that it is an efficient bridge between software and hardware: high-level languages can be efficiently compiled on to this model; yet it can be efficiently implemented in hardware. The author argues that an analogous bridge between software and hardware is required for parallel computation if that is to become as widely used. This article introduces the bulk-synchronous parallel (BSP) model as a candidate for this role, and gives results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware
[van de Geijn91] R. A. van de Geijn, "Efficient Global Combine Operations," Proceedings Sixth Distributed Memory Computing Conference, p. 291, IEEE Computer Society Press, 1991.
This paper describes hybrid algorithms for global combination operations on distributed memory machines. This is an excellent introduction to common techniques used with divide and conquer parallel algorithms.
[Van de Velde94] E.F. Van de Velde, Concurrent Scientific Computing, Springer-Verlag, 1994.
[Walker94] D.W. Walker, "The Design of a Standard Message Passing Interface for Distributed Memory Concurrent Computers," Parallel Computing, vol. 20, p. 657, 1994.
[Wilson93] G.V. Wilson "A glossary of parallel computing terminology, Parallel and Distributed Technology, Vol. 1, p. 52, 1993.
An excellent and relatively complete glossary of parallel computing. I highly recommend this paper to newcomers to parallel computing
[Yelon96] J. Yelon and L.V. kale, "Agents: an Undistorted Representation of Problem Structure", University of Illinois CS Technical report, 1996.
This technical report describes a programming model consisting of a network of agents that execute instructions based on the messages they receive. This is similar to object-oriented systems where objects are active. The driving force of this research is to make the data flow graphs implied by a program match with the programming methodology. This is a thought provoking paper.
[Yonezawa86] A. Yonezawa and J. Briot and E. Shibayama, "Object-Oriented Concurrent Programming in ABCL/1", SIGPLAN Notices, vol. 21, no. 11, pp. 258-268, 1986.
[Zomaya96] A. Y. H. Zomaya (ed.), Parallel and distributed Computing Handbook, McGraw-Hill, 1996
This huge book (around 1200 pages) covers a huge array of both formal and pragmatic topics in parallel computing. If you can only have one book in your Parallel Computing library, this would be the book to have.