high performance computing systems: present and future

1-12,,;I{V I F.R Future Generation Computer Systems 10 (1994) 241-248

FGCS ~UTURE

~ENERATION (~OMPUTER

OYSTEMS

High Performance Computing Systems: Present and future

R. Bisiani Universitk di Venezia, Scienze dell' Informazione, Vta Torino 153, 30173 Mestre, Venezia, Italy

and Carnegie Mellon University, Pittsburgh, USA

Abstract

This paper explores the factors that will influence the process of establishing High Performance Computing machines as widespread, general tools. The paper deals with systems rather than architectures because it is impossible, at least at this stage of knowledge, to ignore the computational model and the software used by an architecture. The paper focuses on general ideas rather than specific systems.

Key words: Computer architecture; parallel processing

1. Introduction

Establishing High Performance Computing machines as widespread, general tools is important because HPC machines offer the possibility of implementing new and extremely useful applications. At this stage of knowledge it is necessary to deal with systems rather than machines or architectures because it is impossible to ignore the computational model and the software used by an architecture. It is also necessary to recog- nize that high performance systems must be parallel systems since no sequential computer implemented with electronic technology can achieve the performance required by new and complex applications. Thus, this paper focuses on parallel systems.

Since pinpointing the performance of parallel systems is a moving target and there is no clear direction or winner in the race for performance, this paper will not mention any specific achieve- ment but will rather focus on general concepts,

Elsevier Science B.V. SSDI 0167-739X(94)00014-6

ideas and opinions. The reader is referred to the state-of-the-art review in the HPC Report for specific, although dated, information.

2. The search for computational models

There is widespread consensus on the lack of established parallel processing models. Let 's take a quick look at the most popular models so that we can then put their characteristics and the architectures that execute them in perspective. We will briefly examine shared-memory, message-passing, P-RAMs and collection-oriented models.

2.1. Message-passing and shared-memory models

One of the most intuitive ways of organizing a parallel computation is to simply connect a bunch of processors by letting every processor access a single, shared memory. These models, that are

242 1~ Bisiani / Future Generation Computer Systems 10 (1994) 241-248

called shared-memory models, look like a simple extension of the von Neumann's model but in reality add a major complication to the behavior of a program because they introduce the possibility of two or more processors acting at the same time on the same data. In other words, both the organization of the work and the coordination of the activities are not strictly regulated by shared memory models and therefore end up being the responsibility of the programmer.

Message-passing models are probably the most popular among parallel processing programmers. The basic idea is simply that the only way to perform the communication and coordination between parallel processes is through messages that are explicitly sent and received by processes. By comparison, the processes in a shared memory model are not always aware that a given memory access is also used to communicate. Therefore, the explicit synchronization that is necessary in shared-memory models disappears in message passing models but the need for synchronization remains and is taken over covertly by the act of sending or receiving messages. Since the memory of each process is segregated from all the other memories it can be easier in this model to insure the correctness of an application. At the same time, the segregation between memories can force a very unnatural organization of problems which must be decomposed in processes that share data. One could therefore argue that the conceptual complexity of both models is rather similar although either of them is better suited to different applications and, in conclusion, one wonders if the popularity of the message passing model does not simply derive from the availability of architectures that directly support it.

2.2. P-RAM models

P-RAM models are the principal theoretical models in the sense that they can be defined formally enough to allow the analysis of the algorithms' performance independently of the implementation. This would be an enormous advantage if there were real machines capable of (efficiently) executing a P-RAM algorithm. Neverthe-

less this property is so important that it is worth studying these models in the hope of either making them closer to the implementation reality or of bypassing the implementation problems that invalidate the assumptions of these models.

P-RAM models assume that the parallel program is organized as a set of parallel processes that can all read and write a common memory. The processes have exactly the same behavior of sequential machines in the von Neumann model. The memory is organized as a set of addressable cells. P-RAM models are process-oriented in the sense that the parallelism comes from having more than one process act concurrently on data while the structure of the data is not part of the model. In other words, the model does not en- force the organization of data in any parallel data structure and the parallel behavior is totally defined by the behavior of the processes. Usually these models are also synchronous in the sense that at any instant each process executes the same step of the same program. Another way of looking at P-RAM models is to think of the parallelism as the result of the work of many processes that execute sequential operations.

These models are similar to human work orga- nizations when there are many people who work at the same task but on different data; for example the clerks of an office that all process the same form. What makes these models both formally tractable and unrealistic is the assumption that memory can be accessed in constant time independently of the location accessed and of the behavior of all other processes. Therefore all P-RAM algorithms are inherently slower than what theory might lead us to believe. If the memory access delay were arbitrary and dependent on the input data and the machine implementation then the main advantage of P-RAMs, namely the possibility of evaluating an algorithm independently of the implementation, would completely vanish. This problem might be avoided by intro- ducing a communication mechanism between processors and memory which has a provable maximum delay. Although such a communication mechanism has been invented there is no commercial machine that uses it yet: unfortunately the cost of developing a new machine has become

R. Bisiani / Future Generation Computer Systems 10 (1994) 241-248 243

so high that the introduction of a new architectural idea requiring a new programming style is almost impossible.

2.3. Collection-oriented models

These models follow a radically different ap- proach: the operators are full-fledged parallel operators rather than sequential operators executed concurrently. This opens the possibility of both implementing the parallel operators very efficiently and avoiding the inefficiencies caused by the interference between sequential operators applied when they are to the same data structure.

So far, the most successful use of a collection- oriented model can be found in the vector units of supercomputers. The parallel data structure is the vector and the parallel operators are arith- metic operators applied pairwise to the elements of two vectors. This simple application of the collection-oriented models allows both extremely good performance on some applications and a smooth transition between sequential and parallel code. The pitfall of vector machines is simply that not all computation can be implemented by pairwise operations on vectors and Amdahrs law inexorably limits performance if even a small part of the computation does not fit the model and cannot be parallelized.

Collection-oriented models have a wider scope than vector computation and parallel operators, e.g. parallel prefix, 'scan' operators, have been devised that allow most algorithms to be recast in this model. How successfully, though, still remains to be ascertained. The machines that are closest to the collection-oriented model are the SIMD machines but a collection-oriented model does not necessarily imply an SIMD machine. In fact, recent work has shown that collection-oriented programs can be converted into other models. The power of these models is rooted in the possibility of describing an algorithm without having to manage the synchronization, because the model already provides an implicit and straight- forward synchronization mechanism. In fact a collection-oriented program is very similar to a sequential program and therefore could be a bet-

ter vehicle to introduce users to parallel programming.

2.4. General characteristics o f models

A model that can satisfy all the needs of parallel program developers must have a number of features that are often partially conflicting. Such a model: • should be general enough to be useful for many

problems and many architectures; • should reflect the cost of executing a program; • should limit the programming effort; • should permit an efficient implementation. The first two characteristics are quite contradic- tory and at this point it seems that no model can satisfy them both. The need to balance the ease of programming and the efficiency of the result- ing program are the kernel of the problem when selecting a model. On the one hand a clear and short description of the problem must not contain any detail related to the machine that will execute the code, on the other hand speed-optimiza- tion partially depends on the detail of machine behavior. For example, while modularity is a de- sirable property any software system, it does not lend itself to the implementation of cooperative parallel programs but rather emphasizes the competitive side of synchronization. By the same to- ken, programs written with collection-oriented methodologies are harder to modularize. The second two characteristics are also conflicting at this moment but our experience with sequential systems lets us hope that it will not be so in the future. In particular, parallel software is now being written along the same guidelines that were used for sequential software twenty years ago: the lowest the level of programming, the better the performance. On the other hand, the current sequential hardware and compiler technology can achieve a very high performance with high level languages, and parallel compilers might follow in the same steps.

Even with sequential machines the highest factor in the cost of any computer application is the cost of software development and thus software reusability is very important. Reusability depends

244 R. Bisiani / Future Generation Computer Systems 10 (1994) 241-248

on many characteristics of the model, of the langnage and of the development methodology but it is certainly completely hindered if the computational model depends on the particular hardware that will be used. Current parallel systems cannot count on a common computational model and not even algorithms are portable between machines.

Different applications have different needs. For example airline reservation or banking systems can be easily programmed with a shared memory model and with medium granularity processes (granularity is related to the ratio of computation vs. communication, the larger the granularity the more time is spent computing before having to communicate). On the other hand, weather forecast is a large granularity application and could work best with a message passing model. Applications that have a simple tight flow of control work best with cooperative models while competitive models can deal best with applications that have a dynamically changing load.

Another important factor is the inertia of the user community that shies away from any innova- tion that requires new investments unless it has a clear and immediate economic advantage. Be- cause of this behavior, parallel processing has had its best successes in cases where the parallelism could be exploited without rewriting the programs as is the case with vector machines. Therefore, the adoption of any model that radically departs from the von Neumann model will be possible only if the payoff is guaranteed.

Finally, some experts state that the only way to write parallel applications is to think them as parallel applications from the beginning. Others believe the opposite and see parallel programming only as an evolution from existing sequential algorithms. The first hypothesis seems more plausible although for many years 'radically' parallel and sequential applications will have to co- exist. If and when we will find a way to make the two programming methodologies converge, for example by means of automatic translators, we will have reached the ideal situation. If this will not happen, it is possible that the evolution from 'sequential' thinking to 'parallel thinking' will be forced by the existence of parallel hardware so

cheap to make sequential hardware uninterest- ing.

3. Real systems

As we stated at the beginning, this paper will not mention or compare specific systems because the lack of good metrics, the continuous evolution of the commercial offerings and the lack of winning solutions make any explicit comparison obsolete in a few months. To aid in the comparison of different machine solutions we believe one can simply divide the architectures in three classes. This subdivision ignores some very important factors like the number of processors (tens or thousands?) and the kind of control (centralized or distributed?) but it captures the basic differences: • Vector supercomputers. These machines are

usually sequential processors with a built-in functional unit that can work on vector data.

• Machines with a physically centralized memory. These machines use a single shared memory connected to the processors by a very fast interconnection network.

• Machines with a physically distributed memory. These machine have nodes that contain both processors and memory.

3.1. Supercomputers

Supercomputers are by all means specialized machines that happen to have many wealthy users and it is their cost that mostly limits their application. An efficient use of supercomputers requires programs that can be mostly vectorized and these machines are equipped with sophisticated software systems that reorganize the sequential code so that it can be most efficiently executed in the vector units. This class of machines is in some sense the only class of machines that has broken the barrier between sequential and parallel applications. Unfortunately, their usability depends critically on the degree of vectorizability of the application because the effectiveness of a vector machine depends on the percentage of vectorizable code, for example if 'only' a fraction of 0.95


of the code is vectorizable the speed improve- ment is limited to a factor of 20 regardless of the speed of the vectorizing unit. Due to this it is not clear whether there is a real need for vector machines with extremely fast vector units (compared to the speed of the sequential part). The supercomputers that are currently produced have also an edge in scalar speed because they are often built using the best technology available but the last architectural advancements of workstations have also made this advantage less evident.

Right now vector supercomputers are extremely expensive and have a questionable cost/performance ratio; thus their existence is threatened by systems that use both vector parallelism and multiple processors while using processors that are less powerful but substantially less expensive. This is possible because the processors of the new breed of fast machines are highly integrated and use a less extreme technology.

3.2. Shared-memory machines

Machines that use a physically shared-memory are severely limited by the technological difficulty of connecting thousand of processors to the same memory system while maintaining a reasonable latency. Technology improvements tend to make the implementation of these machines even harder because faster processors require a faster memory.

In practice, the most popular use of this class of machines consists in the execution of many applications in parallel rather than in speeding up a single application. In other words they can easily be used to improve the throughput of a set of jobs. The other kind of applications that these machines are used successfully for are the trans- action oriented ones, like the reservation and banking tasks. It is likely that this kind of machines will not be able to successfully tackle many new demanding applications.

3.3. Message-passing machines

Finally we come to the large class of physically distributed-memory architectures. Although most

of these machines are used with a message passing model there are notable exceptions of machines with a communication mechanism that is clever enough to be able to efficiently support a shared-memory model.

Europe is very strong in the class of pure distributed-memory machines mainly because of the availability of the INMOS processors that make good building blocks for this class of machines. European companies have about a third of the world market. While vector supercomputers can achieve good performance simply with sequential Fortran code and shared-memory machines can be justified by executing many applications in parallel, this class of machines simply must be programmed in parallel. (Message-passing machines are not too good as multitasking systems, the main application of small shared- memory machines).

4. Measurements: The land of obfuscation

The benchmarking of parallel systems cannot rely on a full set of established and fair practices. The situation is much worse than what happens in the sequential system arena because often the performance of a parallel machine is simply stated in terms of the performance of one of its processors multiplied by the number of processors. If the measure used for a single processor is related to the performance of real applications, like SPECMark, then the parallel performance will be at best a very optimistic guess that assumes applications can be optimally parallelized and have linear speed-up.

If, instead, a peak measure like MIPS is used then the result will be doubly unrealistic because both the sequential processor measure and the assumptions about perfect speed-up are far from reality. Unfortunately, this is the measure that one more often sees in a parallel machine brochure. For example a strong source of con- tention is caused bY the fact that different architectures have diffbrent ratios between peak values and realistic values. A case in point is the argument between vector and massively parallel manufacturers because, while massively parallel

246 R. Bisiani ~Future Generation Computer Systems 10 (1994) 241-248

machines have a very high peak performance, their real performance is a smaller fraction of the peak value. The only area in which one can make meaningful comparisons is with the MFLOP measures that are specified together with the program used to compute the value.

5. New technologies

The future of parallel processing could depend on new technologies and models that break the constraints imposed to the current models by the electronic technology. The largest constraints are: • the impossibility of implementing a very dense

interconnection system, for example with the density of the human brain;

• the limitations of the basic functions that are extremely simple and not very powerful (e.g. and, or);

• the speed limitations intrinsic to the need of 'moving' electrons;

• the physical characteristics like the RC constant and the generation of heat that make it hard to reduce the size. We will mention three possible alternatives:

the connectionist, the optical and the molecular models.

5.1. Connectionist models

The idea that all connectionist models have in common is that a system capable of exhibiting intelligent behavior of the same kind of the human brain could be built by connecting many simple computing elements. For example, the neuronal models are inspired by the neurons in the human brain. While the brain can process continuously variable information like the voice of another person, the current neural models seem to be much better at identifying fixed pat- terns.

The most obvious difference of this class of models when compared to the models seen so far is the lack of algorithms and programs in the common sense of the terms. The 'program' executed by a connectionist machine depends as much on the behavior of the neurons as on the

behavior of the interconnections between them and, moreover, one cannot extract the program by examining the neurons or the connections. The programming can only be done by example, teaching the system the correct behavior.

Therefore this model is radically different from the algorithmic models and is extremely promis- ing. Unfortunately there are two major problems with the implementation of neuronal models: • how exactly the model works is not clear and • we do not have the technology necessary to

build hardware of sufficient size to tackle large tasks.

Each of the proposed connectionist systems has a different way of approximating the behavior of the neurons. These models are usually emulated on supercomputers since the basic simulation step is a multiply-add of floating point numbers. The other problem is the lack of appropriate hardware. Although a number of ad-hoc integrated circuits have been tried, it still remains true that we do not know how to build a neural machine of sufficient size to make these models competitive with general purpose computers.

5.2. Optical technology

Optical technology is based on representing and transmitting information as beams of coher- ent light instead of electrical current. Optical technology is already widespread in the transmis- sion of information through optical fibers. In these applications the cost of the transducers between electric and optical signals at the ends of the fiber are irrelevant when compared to the total cost of the fiber. When instead the technology is used between cards or integrated circuits the cost and size of transducers cannot be dis- counted and does not make it convenient to use the optics unless other parts of the machine also use optics.

The real potential of this technology is not in reproducing the behavior of electronic computers but in taking advantage of the capabilities of optics to compute. Therefore, the primitives used in the computational model suitable for an optical machine should be parallel because light beams can easily travel in parallel without mutual


interference. The data that are most suitable to optical machines are bi-dimensional arrays or vectors of analog quantities because they can easily be represented by optics. Researchers in this area have come up with many ideas but there is still no established set of devices that are in more widespread use than others.

5.3. Molecular comput ing

Molecular computing, sometimes called molecular electronics, has the goal of developing computing functions at the molecular level. The functions that could be built with molecular technology encompass both the classical computing functions now in use and other functions that are more typical of the human computing mecha- nisms and that can perform a complex pattern matching in a single step. The latter are, I believe, the real interesting ones because they could make molecular computers extremely powerful. Molecular computing is an extremely new technology and the feasibility of a full computer still has to be demonstrated.

6. Evolution or revolution?

If we agree that, at least in the next few years, the performance of hardware will keep increas- ing, the major obstacle to the development of high performance machines remains the software development. While human activities are com- monly tackled with parallel solutions and it is natural for a person to invent parallel algorithms to organize one's work, computers are seen as strongly sequential machines and there is a solid educational and cultural barrier that makes it difficult to switch to parallel machines. This difficulty is hard to eradicate because even most computer scientist have this 'sequential bias'.

In order for software not to remain a major obstacle either parallel algorithms must become a natural way of programming or the tools must become very sophisticated. This would to raise the level of programming so high that programs could become a mere description of the goals that the machine is then free to implement in any

way it likes. The latter can probably be safely forgotten for the next ten years because our knowledge in this area is still too weak. There- fore, in order for parallel programming to become common practice it is necessary for a very small number of computational models and related languages to become established. By established we mean that most manufacturers should make them available on their machines, a lot of applications should be implemented using that mode l / l anguage and the basic ideas should be taught in most computer science courses. For this to happen many things must be right since there is a tremendous resistance to the introduction of any new software tool. For example, Fortran is still used in many key areas although language technology has dramatically improved since For- tran was invented. It is not impossible, though, for something valid to become commercially widespread. For example, the C language and the Unix operating system are good examples of technologies that were developed in a research envi- ronment and then were adopted by the commercial world. The process that has established these tools lasted for about ten years and was strongly supported by the evolution of hardware. If VLSI and RISC technologies had not made hardware cheap and fast the monopoly of large computer manufacturers would have continued and common tools would not have been accepted.

A clear indication that parallelism might be initially an evolutionary process is given by the recent introduction of a number of parallel systems made of a number of workstations connected by a network or a dedicated switch. The message given to the user is that the machine is not a parallel system but simply a set of workstations that happen to be connected. The software that is offered has the same problems of all parallel software: it needs a lot of work to write applications but it is independent of the topology and medium of the interconnection; as a matter of fact the technical description almost ignores the characteristics of the switch. The goal is obvi- ously to introduce parallelism without scaring away the users.

Up to now parallel machines have been intro- duced to the users with a completely different

248 R. Bisiani ~Future Generation Computer Systems 10 (1994) 241-248

strategy since their strong point was performance and therefore the technical description of their interconnection technology was one of the most important things and so was the flexibility of their software in making use of parallelism. It will probably be because of hardware advances (thanks to VLSI and RISC again!) that parallelism will acquire a large base of customers. One hundred years ago it was difficult to forecast some of the uses of technology that are today taken for granted, it would be difficult to think of high performance computers simply as a vehicle for large computations for science or engineering applications. Large calculations are just one side of computing and maybe the farthest from the way men think and interact. The great challenges will also be in the area of man/machine inter- face, sensory and motion and non-numeric computing in general. While the supercomputer market is oriented towards a limited number of users, high-performance non-numeric computing will become important for basically everybody.

Another important application area will be the management of the large stream of informations

that the communication technology will put at our disposal but will not help us use. For example, through the small personal machines that are just coming on the market in a rudimentary ver- sion we will be able to access hundreds of geo- graphically distributed databases. We need high performance computers to be able to handle and use all this information to our advantage.

Therefore, after reaching the Teraflop mark, high performance computers will have to become useful for every day applications and in order for this to happen the direction of technological development wil have to change.

Roberto Bisiani is a Professor of Computer Architecture at the Uni- versity of Venice, Italy and an Adjuct Senior Research Computer Scientist at Carnegie Mellon University, Pitts- burgh, USA. Professor Bisiani's inter- ests are in task-oriented machines, codesign and parallel processing envi- ronments. He can be contacted at [email protected].

high performance computing systems: present and future

Documents