piccs: a new graduate program integrating computer and ...jps/igert-final/igert.doc · web viewwe...

List of Faculty ParticipantsPrimary research interests of Computer Science (CS) faculty are listed as well, to place their areas of collaboration.

Executive Committee

Prof. Jeremiah Ostriker (PI and Director), Astrophysics, University Provost

Prof. Jaswinder Pal Singh (co-PI), Computer Science: Boundary of Parallel Applications and Systems

Prof. David Dobkin (co-PI), Computer Science (Dept. Chair): Visualization and Algorithms

Prof. Kai Li (co-PI), Computer Science: Parallel Systems, Parallel Systems and Immersive Visualization

Co-Principal Investigators

Prof. Hans-Peter Bunge, Geosciences

Prof. Douglas Clark, Computer Science: Application-Driven Performance Diagnosis and Parallel Systems

Prof. Adam Finkelstein, Computer Science: Visualization

Prof. Tom Funkhouser, Computer Science: Immersive Visualization

Prof. Isaac Held, Geosciences Department and NOAA Geophysical Fluid Dynamics Laboratory

Prof. David Spergel, Astrophysics (Director of Graduate Studies)

Prof. William Tang, Princeton Plasma Physics Laboratory and Princeton Dept. of Astrophysics

Prof. Robert Tarjan, Computer Science: Algorithms

Prof. Martin Weigert, Molecular Biology

Other Key Participants

Dr Venkataramani Balaji, NOAA Geophysical Fluid Dynamics Laboratory

Dr Renyue Cen, Astrophysics

Dr. Dannie Durand, Computer Science and Sloan Computational Biology Fellow

Prof. Fred Hughson, Molecular Biology

Prof. John Hopfield, Molecular Biology

Prof. Simon Levin, Evolutionary and Ecological Biology

Prof. Stanislas Leibler, Molecular Biology

Prof. Steven Pacala, Evolutionary and Ecological Biology

Prof. Yigong Shi, Molecular Biology

Prof. Kenneth Steiglitz, Computer Science, Applied Numerical Computation and Optimization

1

3C. Thematic Basis for the Group EffortThe rationale for the integrative PICCS program (see Project Summary) is as follows. Computational science has

emerged as a key scientific discipline. After years of great progress, however, many individual computational sciences as well as computer science (CS) are at a crossroads, where neither can hide from dramatic changes in sophistication in the other. The next great advances in progress on both sides will require each to understand the other more deeply, computational sciences to share knowledge and tools, and integrative research and training across traditional boundaries. The PICCS program is designed to integrate the entire computational science pipelinefrom models and methods through parallel computing and sophisticated visualization (see figure)to accelerate these synergies, develop new disciplines across boundaries, and train a new breed of computational and computer science researcher for the future.

Left: The computational science pipeline. Right: Structure of PICCS program. Computer science (CS) techniques and training will contribute to application sciences, the latter will contribute to improved CS, and disciplines will share

knowledge.

Science has traditionally been advanced by the two pillars of theory and experiment. It is increasingly clear that computer simulation can revolutionize many disciplines by providing a third pillar. Simulation provides an alternative or a guide to expensive or infeasible experiments, and inspires new theories and directions. The need, challenges, and opportunities are so pressing that government agencies are establishing massive efforts to advance scientific simulation.

Dramatic advances in computer technology can make the potential a reality. A few years ago, we could not hope to simulate problems of the sizes and accuracies needed to resolve fine-resolution climate models or understand galaxy formation in depth. Today, the technologies to enable the revolution are in sight. In fact, more and more important practical problems are within close reach of exponentially increasing hardware capabilities. Unlocking the mysteries of nature still requires deeper understanding, but achievable simulations will deliver much-needed insights into these too.

While hardware will continue its march, the challenge faced by computational science is to harness its capabilities. Unfortunately, this path is not evolutionary. Dramatic changes are occurring at each stage of the pipeline. The problems to be solved are not only larger in size but also model more complex and realistic phenomena, so sophisticated new algorithms are needed. Technological advances are no longer translated to performance by improvements in traditional vector supercomputers (which scientists had become leading experts in programming) but by radically different and more complex parallel architectures. These require new algorithmic and programming approaches and mind-sets. Finally, visualization is poised for a second revolution in its ability to provide insights: moving toward user-guided steering, massive, complex data sets, and immersive environments. Expertise in these areas lies in computer science, and the computational scientists of tomorrow must acquire expertise in the new methods to be effective. Many of the needs and challenges are similar across computational science disciplines, so learning from one another is very valuable too.

Fertilization in the other direction is equally important. New and dramatically better techniques and tools must be developed in all stages of the pipeline, from models and methods through parallel algorithms to parallel systems, programming environments and visualization. To date, most of this research is done in isolation in CS departments, using either toy problems or simplified abstractions of real applications. While much initial progress has been achieved with these, the inward focus of computer science is now losing steam: To make substantial advances, computer scientists must understand and drive their research with real applications, dealing with their full complexity and scale. The techniques and tools developed must be useful, usable, and scalable in unprecedented ways. This requires close collaboration with real users and a dramatic increase in realism in the problems addressed. Computational science has always been at the forefront of driving improvements in high-end computing, and will continue to be so in the future.

Finally, in addition to cross-understanding and close collaboration, we also need to develop and train a new breed of researcher who is truly interdisciplinary. Such people do research in the area between application and computer sciences, understanding both deeply enough to make contributions to both. They also serve as bridges between

PICCS scope (gray)Pr

oble

ms

Alg

orith

ms

Para

llel s

oftw

are

Para

llel s

yste

ms

Vis

ualiz

atio

nComp.Sci

Geo/GFDL/EEB

Biology

Astro/PPPL

Other Science and Engg.

Analysis Tools

2

communities, bringing people together and interfacing their languages and research. This is increasingly critical in a world with ever more interdisciplinary scientific needs. In fact, it is the lack of such people that keeps disciplines in their shells even at universities and prevents valuable collaborations from happening. Prototypes of such researchers exist at various levels. In algorithms and computer systems, for example, cross-disciplinary research has already led to many unique new insights. These include algorithms that exploit domain insights on one hand and lend themselves to parallelism on the other, and scaling and other systems implications that came from deep understanding of application properties. There is now a pressing need to not only build the bridges but also make this a programmatic area of study.

Our goal is to set up a university-wide interdisciplinary graduate certificate program (the Princeton Integrated Computer and Computational Sciences program) to provide this integrated training across department boundaries in the entire pipeline, from applications and models through sequential and parallel algorithms and systems research, to new forms of visualization. The program will be very different from and much broader than traditional scientific computing programs that focus mainly on numerical methods for existing systems. In fact, on the algorithms side, it will leverage a program in mathematical methods for numerical analysis at Princeton itself (Section 3F), and complement it with aspects in which computer scientists have strengths: algorithms to apply the methods to irregular and dynamic domains; parallel algorithms; and increasingly important methods for science like search, mining, approximation and optimization.

The broader impact of the program will be the training of (a) application scientists in the intersection of their disciplines with CS and computational techniques, (b) a new generation of application-driven computer scientists, and (c) the new breed of bridge researchers described above. It will enable the next generation of researchers to collaborate far more effectively than previous generations in critical interdisciplinary research. And it will bring students in all disciplines in close touch with government and industrial laboratories, product groups, and the latest technologies in different disciplines. Many areas of science are already finding cross-disciplinary fertilization, and most importantly “bridge people”, to be a key missing ingredient for a wide range of critical advances. In fact, a major stated goal of the government initiatives is to make modern high-performance computing more broadly accessible to scientists and to enable them to effectively exploit available resources. We think this is critical, that the benefits to computer science are dramatic too, and that in fact the necessary integrated training should be provided earlier: at the graduate level.

Specific aspects of the training will include curriculum development for integrative education across department boundaries, joint Ph.D. advising of students by faculty in different departments, and the formation of cross-discipline integrative research groups, in addition to annual cross-cutting challenges (Section 3F). Mechanisms will be put in place to bring people from diverse disciplines together physically, which is one of the greatest practical obstacles to integrative efforts. Already, a recent cross-disciplinary Computational Research in Princeton seminar series has been very successful in bringing together diverse researchers from university departments, laboratories and industry. Its success, discussions, and the common problems it exposed across disciplines have been among the key catalysts in our efforts to start this program now. Interestingly, so have the cross-disciplinary interests and demand observed among undergraduate students and postdocs (Section 3F), which suggest that we should provide integrated training in graduate school.

The program will leverage many unique resources that Princeton offers. These include access to leading researchers in all steps of the pipeline, as well as faculty whose primary research area is increasingly at the boundary between applications and computer science, who have a history of fostering interdisciplinary collaborations, and who are very excited about both the research and training aspects. Not too many universities have a set of faculty whose existing interests can serve so well as the basis for this type of integrative program. Next to the university are major research laboratories, e.g. the Geophysical Fluid Dynamics Lab and Princeton Plasma Physics Lab, with whom collaborations are starting in the CS department and whose leaders are co-PIs on the proposal. Another catalyst for trying to build this program now is that the timing is so appropriate: We increasingly see scientists from both sides already reaching out to each other for precisely the research and expertise/training reasons discussed here. Finally, the departments and laboratories already share substantial resources and infrastructure that will serve students’ needs throughout the pipeline.

The disciplines currently operate in different departments and largely in isolation. While the individual faculty collaborations are very positive steps, and enable a few students to look across to the other side, the lack of unifying programmatic focus generally prevents students from truly crossing discipline boundaries and obtaining integrated training (those that try often feel a lack of belonging to either side). Faculty enthusiasm and collaborations are a good foundation, but an established institutional program is essential to take this to the next, much-needed level. An program will have many benefits that will not be achieved without it. It will establish the focus, identity, convergence and institutionalization essential for training the interdisciplinary scientists of the next generation. It will enable annual cross-cutting thematic challenges on which its members will jointly focus their efforts, helping ensure the

3

integrative focus. It will create new curricula and the concerted pedagogical approach that is needed in addition to research training.

Finally, we hope that the PICCS program will serve as the nucleus for of a broader-based Center for Integrative Research at Princeton. A Center will have many additional benefits. It will enable the hiring of new faculty whose research and education interests are in the PICCS area. It will propagate the research and training goals of the PICCS program at a much larger scale. It will serve as a powerful recruiting tool and will help involve more students and researchers in high-end computational science, a crucial area for scientific endeavor that has recently seen a lot of competition for computer scientists' attention with exploding areas like the internet and consumer multimedia. We hope that the PICCS program will spur the development of similar programs at other universities, and that the training these programs provide will dramatically hasten the simulation revolution and its contributions to science and technology.

4

3D. Multidisciplinary Research Theme and Major Research Efforts

The multidisciplinary research theme of this proposal is the integration of the computational and computer science pipeline: from models through algorithms and parallel computing to visualization. This proposal proposes a new integrated program at Princeton called Princeton Integrated Computer and Computational Science (PICCS) as a means to combine research and education activities in five departments and two research labs: Computer Science (CS), Astrophysics, Geoscience, Evolutionary and Ecological Biology, Molecular Biology, DOE’s Princeton Plasma Physics Laboratory (PPPL) and NOAA’s Geophysical Fluid Dynamics Laboratory (GFDL). The program will also leverage and further develop our current research collaborations with national and industrial laboratories. Students from other departments will be eligible to join the program as well, and faculty from those departments will become affiliated too.

Our main research goals, for which the rationale was presented in the Thematic Basis in Section 3C, are to exploit the synergies across computational and computer science to enable advances in both sets of disciplines that would not easily be possible otherwise. In integrating the entire pipeline, the focus of our program is much broader than, though nicely complementary to, that of traditional scientific computing programs, which generally focus on applied numerical and mathematical methods for existing computers (in fact, we will leverage a program in mathematical methods for numerical analysis run by the Mathematics department). It is also broader than that of CS programs that focus on algorithms or on building high-performance systems. We want to train researchers in CS and the science disciplines to understand the other side well, to enable joint research in common problems across science areas, and create a new class of researchers who straddle the boundary between disciplines quite widely and make contributions to both sides.

Figure 2. PICCS research matrix. The shaded portion depicts the six thrusts described in this proposal.

The research thrusts of the program can be described as a matrix, with computational natural science and engineering areas on the vertical axis and cross-cutting computing challenges on the horizontal axis. Figure 2 shows the proposed scope of the PICCS program, highlighting the thrusts described in this section. We show parallel computing divided into the two subareasboth of which are thrusts in this proposalcorresponding to the two major and quite different types of platforms. Faculty from the main faculty list in Section 3B (not including those in the Other Research section) with primary responsibility for a research thrust area are listed below or to the right of that area. Each vertical discipline must typically deal with all horizontal challenges, which are researched programmatically in Computer Science. CS faculty in a horizontal cross-cutting area will collaborate with some or all of the science disciplinary faculty in their areas (a particular box of the matrix will be addressed together by faculty in that row and column). Algorithms research for scientific problems will be performed by both CS researchers and researchers in the individual disciplines, and are covered in the disciplinary research thrusts. Since the different science disciplines share many common problems in the stages of the pipeline (algorithms, parallel computing on tightly-coupled systems, using clusters, and visualization methods), researchers in these areas will collaborate in relevant areas, share solutions and learn from each other too.

A good way to illustrate the importance of the proposed integrative research and training is by example. We first describe the status of a horizontal cross-cutting area, namely parallel computing in general, to show the need for and

5

benefits of integrative work in this area. Then, we present a more concrete example from a vertical discipline, namely computational biology. Both these areas will be discussed in detail as research thrusts later in this section.

Cross-cutting Area: Parallel computing. Parallel computing has become the clear future of high-performance computing, replacing traditional vector machines that scientists had learned to live with. The basic organizational structure of parallel machines is rapidly converging, so it is no longer an exotic discipline in which the investment of learning may not be justified. However, within the convergence there is a lot of variation among communication architectures. New architectures that support new programming models (e.g., a shared address space) are taking root at the high end. At the same time, commoditization is driving more loosely coupled “clusters” built out of inexpensive workstations or smaller parallel systems to become very important. Commoditization and integration together are making “systems of systems” an attractive but challenging new type of platform. Application scientists need to and want to use all of these platforms for their large-scale computations. For this, they must not only move to parallel computing but learn to use the different types of systems that have very different characteristics. At the same time, applications themselves are changing. As the problems being solved become more complex (modeling natural phenomena more closely, for example), more sophisticated algorithms need to be developed, and these are often difficult to parallelize. Although has been a lot of progress, bringing the expertise in the new technologies to the science disciplines is currently quite slow, especially with the radically different architectures and new frame of mind needed compared to traditional vector supercomputing. Scientific computing users had become the leading experts in programming vector supercomputers, since they needed the computational power most; they must now do the same for parallel computing.

In the other direction, the next level of progress in techniques and systems for parallel computing requires very closely coupled interaction with real applications. Computer scientists know how to deal with and build systems for simple applications at relatively small scale, but not for the complex applications and large scale that high-end science users demand. At large scale, the interactions between systems and applications are much more complex and important. An interesting problem here is that neither side is mature. Understanding how best to build the systems requires understanding applications and their status and evolution, and understanding how best to develop parallel applications requires understanding the systems and their status and evolution. Even more challenging than architectures is building programming systems to make the job of using these complex machines easier. Advances in each side enable the other, but both sides are fluid and rapidly evolving, and this is likely to be the case for quite some time. It is thus clear that researchers or practitioners on either side need to understand the other increasingly well, and that cross-disciplinary researchers who can understand both sides well will be in a very good position to make fundamental contributions to both. A concrete example that makes this case clearly is described for an astrophysics application in Section 3D.4.

Example from Computational Biology. Many approaches exist to predicting 3-D protein structure, but none are particularly successful today. Russ Altman, a computational biologist at Stanford and himself a Ph.D. in Computer Science, developed a method to take probabilistic data from sources with uncertainty and predict a conforming structure together with the uncertainty in the resulting positions of atoms. Cheng Chen, a graduate student in the Computer Systems Laboratory working with Singh (co-PI on this proposal) studied algorithms, optimization, computational biology and parallel computing, and began to work on this problem with both advisors. Using sequential algorithmic techniques and understanding of memory hierarchies, Cheng sped up the original code by over a factor of 20. Using knowledge of biological structure, he developed a new hierarchical algorithm that dramatically reduces computational complexity. Using knowledge of statistics, optimization and biology, he extended the approach to work with other types of constraints that are important in practice. Using knowledge of parallel computing and parallel architectures, he developed novel parallel methods to speed up the hierarchical algorithm. This research not only enabled the approach to be used on much larger proteins than were imaginable earlier, but also led to a significant new and general approach to load balancing important emerging algorithms used in many domains, a perfect example of applications research driving important general computer science techniques. Cheng has used his cross-disciplinary learning to make important contributions to both biology and computer science. His papers have been published in leading parallel computing and algorithms venues, and selected from the RECOMB computational biology conference for publication in the Journal of Computational Biology. Similarly, Steven Kleinstein, a computer science graduate student at Princeton, is working with Singh and Martin Weigert in Molecular Biology in the young area of simulating immune systems. He has used algorithmic techniques to dramatically enhance performance, developed a simulation system that is used in several departments inside and outside Princeton (for education and research) and is obtaining interesting biological results. These are examples of the types of research and students we would like the PICCS program to develop.

The program’s research charter will initially be driven by a set of key science and computer science challenges on which existing collaborations between computer science and other departments are based, each of which requires the use ofand innovations inthe entire computational science pipeline. The six research thrust areas described in the

6

rest of this section are as follows. The first three presented here are application drivers for computational science, and the last three are computer science areas in which they will drive research.

(a) Computational Astrophysics. Astrophysics researchers are heavy users of computational and visualization cycles. Traditionally, then can be classified primarily into two broad categories: theoretical and observational. The former focus primarily on developing and testing models of cosmological evolution (for example), and use large-scale computation to simulate them. The latter use computers to mine and analyze observational data. Our astrophysics faculty have seen a strong need for a new brand of “computational astrophysicist” who works at the boundary of these areas and high-performance computing and visualization, understanding how to take advantage of emerging high-performance computing technologies to accelerate the science. Prototypes exemplifying this type of person include Larry Smarr, director of NCSA and a leading figure in high performance computing, and Lars Hernquist, a Professor at Harvard. Interestingly, both worked as post-doctoral fellows under Ostriker, the PI of this proposal. In fact, Princeton’s department has become a major user of higher-performance computing cycles, but new technologies have made collaborations with the computer science essential. Such collaborations are under way between individual faculty. The Plasma Physics program within Astrophysics is affiliated with the DOE’s PPPL laboratory, a participant in the PICCS program (Dr Bill Tang is a co-PI), and collaborations in computational plasma physics with PPPL are planned as well.

(b) Computational Biology. Many aspects of biology are fast becoming an information science. A difference from astrophysics is that the basic approaches and models needed to solve critical problems are not well developed or understood. This research thus drives the need for basic algorithms and methods, as well as parallel computing and visualization, which are active areas of research for computational biologists. In fact, many leading computer scientists have recently been attracted to computational biology, which has led to the development of superior algorithms for many innovative approaches taken by biologists. Our research efforts are primarily in two broad areas: (a) protein structure determination, both experimental and computational, where we have already made tremendous improvements through algorithm design and parallelism, and (b) young research areas in simulating system dynamics, e.g. computational immunology and computational neurobiology. In addition to our current research collaborations, the new development of a Center for Genomics at Princeton, focusing on systems-level rather than cell-level modeling, will provide natural fodder for many aspects of the PICCS program.

(c) Computational Geosciences. Advances in computational technology have allowed general circulation models of atmosphere, ocean and deep earth to enter a state of very rapid progress. NOAA’s GFDL laboratory focuses on atmospheric, oceanic and coupled models, collaborating with the Atmospheric and Oceanic Sciences program at Princeton, while other Princeton’s geoscientists focus on deep-earth models. Collaborations between GFDL and the Ecological Biologists are underway to extend models to capture nitrate flows. Most computation today is done either on vector machines, or implements simple, regular models on message passing parallel systems. The need to resolve both types of models more finely and much more adaptively (e.g. for clouds and rocks) is apparent, and collaboration with computer sciences on the parallel computing side for this is viewed as essential, especially to examine migration to the more attractive shared address space model and performance portability. Geoscientists are also eager to use the CS department’s clusters, and to greatly enhance their visualization methods to examine complex circulation flows.

(d) Parallel Computing on Tightly-coupled Multiprocessors. The need for integrative research was discussed in the example above. Driven by the above and other application areas, we will take on challenges in parallel algorithm design for complex problems, scaling to large systems, programming environments to make using the machines easier, and programming for performance portability across parallel architectures. We will also study the implications of applications for architectural design and programming models. All of this research will be strongly application-driven.

(e) Room-Wide, Building-Wide and Campus-Wide Clusters. Although the drive toward commoditization and systems of systems has made clusters very important platforms, they have less efficient communication architectures with less hardware support and are thus much more difficult to use efficiently. We will develop programming models on new cluster organizations and study their tradeoffs. A key issue is to structure applications and algorithms for performance portability across tightly-coupled machines and clusters. Building-wide and campus-wide clusters have two purposes: using idle cycles on desktop machines and servers at night, and constructing very large clusters hierarchically from smaller room-wide clusters. Key issues for these include application/algorithm design to exploit them in these different ways, fault tolerance, dynamic reconfiguration, and tolerating long communication latencies.

(f) Visualization. As applications advance, they generate enormous and complex data sets. The goal of visualization research is to develop new methods to extract information in the data sets and present it on display systems so that a user can best obtain insights from it. The challenges include constructing giant, ultra high-resolution display systems, building scalable I/O and communication subsystems to deal with enormous data sets, and developing new algorithms and extraction methods to visualize the enormous and complex data sets. Visualization

7

abstractions and methods must be developed in close conjunction with users, who are the only ones who know what might be interesting or what to look for. On the visualization systems side, wall-size display systems with scalable resolution (at least one order of magnitude higher than the HDTV) can be constructed, which raise fascinating possibilities for immersive visualization, interactive data fly-throughs and exploration, and collaborative visualization. We are building a scalable parallel display wall in the CS department that already has developed a collaboration with astrophysicists, PPPL and other DOE labs. It uses parallel clusters, and serves as a driving force and infrastructure for the entire computational pipeline.

Other research thrusts are and will be in place as well, including in the design of sequential algorithms for the application areas and in other natural sciences and engineering areas. These areas and departments will be included in the program as well, driven by student and faculty interest and relevance to the program. To focus and drive the interdisciplinary collaboration, every year of the program will have a cross-cutting challenge theme on which the computer and computational scientists will jointly focus. All participants in the program will participate integrally in at least two annual themes, and most will participate in more. Each theme will focus on a stage of the computational pipeline; application scientists will focus on their application domain for the annual theme, while CS researchers will work together with them all.

In describing the research thrusts, we begin with the last three that are based in CS and that cut horizontally through the other areas. Once the trends and challenges in these areas are understood, it will be easier to discuss the natural science research thrusts that cut vertically through them. In response to a summary comment in the pre-proposal review, we have enhanced our treatment of the contributions of PICCS to computer science itself in this proposal.

3D.1 Parallel Computing on High-end Multiprocessors

(a) Background and Goals. Two major developments, discussed earlier, have occurred in high-performance computing that are very relevant to this program. One is the move from vector to parallel computers; it requires scientists to develop new algorithms and migrate their codes, which is not part of their mainstream research, and is at once a need and an opportunity for computer scientists and application scientists to interact and be trained in the domains of the other. The other major development is the ongoing convergence in architecture. From a world of diverse programming models, each implemented in hardware using special-purpose techniques, the field has converged to general-purpose architectures that each implement either of the two dominant programming models [Cull98]. These architectures are based on the same commodity parts for the processors and their memory systems, and differ only in the communication architecture that connects the processing nodes. The convergence makes it possible, finally, to develop portable algorithms and programs, which is a great enabler for the widespread use of parallel computing.

The state of parallel architecture and programming models is as follows. At small scale, bus-based symmetric multiprocessors (SMPs) dominate. At larger scale, two types of multiprocessor platforms have emerged as dominant. The first is high-end, tightly coupled machines that use customized communication architectures. These machines increasingly provide hardware support for the attractive shared address space (SAS) programming model with automatic caching and coherence of remote data. They are called cache-coherent, nonuniform memory access (CC-NUMA) or distributed shared memory (DSM) multiprocessors. The second is clusters. These are less tightly coupled systems that use commodity interconnect technology as well to connect commodity nodes. The nodes may be PCs, workstations, orespecially for high-performance computingthemselves multiprocessors. These clusters typically do not provide special hardware support for a shared address space and coherent replication, which must therefore be implemented in software. The other major parallel programming model, explicit message passing, is generally implemented in software on both types of systems.

For several reasons not discussed here,1 a coherent shared address space is a very attractive programming model. It has been shown to be very effective at moderate scale for a wide range of applications when supported efficiently in hardware on DSM machines.2 Whatever the programming model of choice, programmers would like to write parallel programs once and have them run well on a range of systems (performance portability) and a wide range of scales (scalability). Being able to do this in an easier programming model would greatly simplify the task of programmers in harnessing the computational power and be a great step forward for parallel computing. However, the SAS model has not yet been proven to work well across a range of applications on clusters, and the performance and programmability tradeoffs between it and message passing are not well known for clusters or even for large-scale tightly coupled

1 The reasons include providing a graceful migration path from the volume small-scale SMP marketplace, and dramatic ease of programming for many complex applications that become more prevalent as multiprocessing matures.2 In fact, showing this took integrated research in applications and systems as well as pure architecture research, especially due to the chicken-and-egg problem of the architecture being new and unproven. The demonstration led to a major shift in industry in building high-end multiprocessors.

8

machines. Sorting out the programming model and performance portability is one of the key outstanding questions in scalable parallel computing. Application users studying this problem would treat the systems and the implementations of the programming models as fixed; system designers not only tend to focus on simple programs but also treat the programs as fixed; it only through cross-disciplinary research that we can truly understand this area by opening up both sides and understanding how best to solve the problems without treating any side as a black box.

We are actively pursuing core systems, core applications/algorithms, and application-driven systems research in both programming models on both major types of platforms. A fortunate circumstance for us, and for the training and research in the PICCS program, is the similarity of platforms used by the participating departments. Several departments share a large, tightly-coupled Origin2000 [Lau97] machine acquired through NSF equipment grants. Most of the departments own, and are contemplating moving their codes to, nearly identical clusters of SMPs, acquired through a large shared equipment grant from Intel Corporation. In addition to the above areas, parallel programming remains the greatest challenge in parallel computing. One of our joint NSF instrumentation grants specifies research in the development of parallel programming environments and performance diagnosis tools, taking advantage of the opportunity provided by real local users and applications who are just transitioning to these styles of systems. Finally, given the very complex and performance-oriented nature of the systems and applications, performance evaluation based an understanding of architecture-application interactions is a very important aspect of research and training for the next generation of scientists, and is not currently well practiced or widely understood. This section discusses the research that is specific to tightly coupled platforms, while the next section focuses on the research in room-wide, building-wide and campus-wide clusters. Multiprocessing has a long history of using kernels and simplified abstractions of real applications; however, all these areas now require the kind of simultaneous research in realistic applications and systems that the PICCS program emphasizes, not treating either side as a black box.

(b) Specific Research Problems (for Tightly Coupled Systems)

Parallel Applications and Libraries. For all the application areas in this proposal, we will develop efficient sequential algorithms as well as parallel algorithms and implementations, focusing on both specific system types and performance portability. These will be described in later sections. Based on what we learn, we will develop libraries of methods that are optimized to take advantage of the complex memory hierarchies in modern systems and that, being application- rather than kernel-driven, are truly usable across domains. In addition, we will pursue other emerging application areas, especially those with challenging irregular behavior and those in new, more commercial or media-oriented areas that will drive the design of multiprocessors. We will expand the SPLASH-2 application suite that we distribute [Woo95] with these with new applications and different degrees of optimization, as well as message passing versions and applications that stress input/output performance. Scalable shared address space technology is quite new, machine organizations and applications are becoming more complex, and we have only scratched the surface in understanding how to achieve good parallel performance by taking advantage of its capabilities. As seen in the earlier examples, many of the best, even general algorithms are developed by taking advantage of both knowledge of the problem domain and knowledge of the architecture or system, which argues for cross-disciplinary research and training. Such research can have several useful results: in new general partitioning schemes and libraries, in advancing the adoption of new systems and programming models by demonstrating high performance (which initially must be done by people who understand both sides), in transferring knowledge and experience to practitioners, and in enabling new uses of parallelism in emerging domains. It also helps us understand the implications of increasingly complex and challenging applications, and provides valuable complex workloads for systems software and architecture research.

Performance Portability. Even within tightly coupled, high-end multiprocessing, there are many organizational and performance differences among machines. Due to the explicit control it requires, the message passing model has fairly performance-portable programming guidelines. In the SAS model, not only performance characteristics but also the granularity at which coherence and communication are performed differ among machines. Since communication is implicit, differences in the organization of processors into nodes do not have to be visible to the programmer but can also affect performance substantially. Systems may also vary in the hardware support they provide for the programming model: for SAS but not coherent replication (Cray T3D/E), SAS and coherent replication in the caches (SGI Origin2000, Convex Exemplar), SAS and coherent replication in main memory (KSR-1), or neither. While users would like performance portability for their codes across machines (which they use at various sites), their codes today are written differently for the different systems, or are written conservatively (e.g. in message passing) to not take advantage of hardware support. It is critical that algorithmic and programming guidelines as well as software systems be developed to provide performance portability across systems and organizations. This is a major challenge, and will involve research in algorithm design, program structuring, and languages that encourage performance portable programming. It cannot be done without understanding both

9

applications and architectures deeply enough. The challenge of performance portability becomes even larger when we consider clusters in the next section.

Scalability. While hardware-coherent shared memory is fast becoming dominant at moderate scale for high-end machines, its performance at large scale is not well understood. Nor, for that matter is that of message passing on challenging applications. Using real applications, we will examine scalability and bottlenecks to it in both applications and architectures. We will also examine whether the programming techniques developed for performance portability can be extended to achieve scalability too. Our initial work in this area [Holt96], via simulation, showed a promising outlook for some prototypical applications. However, scalability has to be understood on real systems and especially with full-scale, complex user applications in different areas. It is also not clear how large-scale systems should be organized: as a flat collection of processors, or as a two-level communication hierarchy (multiprocessor nodes connected together) or a multilevel hierarchy. Scalability also requires interdisciplinary research to arrive at the correct conclusions; e.g. current applications may run well at moderate-scale and not scale up, but improvements to them in parallel algorithms or orchestration, which pure systems researchers will not attempt, may lead to good scalability. Understanding both sides is also needed for good scaling methodology, i.e. knowing how to scale workloads as machines become larger [Sing93].

Resource Distribution. As we reach the next major threshold in integration, when an entire system (processor, cache, memory, network interface) and not just a processor fit on a chip, the interfaces between components become fluid rather than constrained by packaging. Large-scale machines become easier to construct since the individual components are small and each chip may have many processors. Resource distribution and machine organization become key issues, especially for large-scale systems, and the fluidity of the interfaces makes the tradeoffs change all over again. How to distribute resources will depend on the characteristics of applications and how they scale, as well as the costs of systems resources. Our early work in this area pointed to the possibility of fairly fine-grained machines (in memory to processing ratio) being successful for many applications that are in fact scalable. But much more research is needed.

Implications for Programming Models. In examining performance portability or scalability for the SAS model, we must compare both performance and programmability with explicit message passing. Even on hardware-coherent machines at the high end, many scientific users currently simply port old message passing codes to them or develop new codes with message passing, using the shared memory only as a means of accelerating message passing. Clusters, even more so, are typically programmed with message passing. As the programs people want to run become more complex (e.g. because the phenomena being modeled become more realistic) message-passing programming becomes more difficult [Sing95b]. However, it does provide both functional and performance portability, and it is still the dominant programming model in high-performance computing. We will examine the tradeoffs in performance and programmability between three major programming modelsmessage passing (send-receive), a non-coherent shared address space (put-get), and a coherent shared-address space (read-write)on a range of platforms and organizations for real applications. The platforms will include flat architectures as well as two-level hierarchies with combinations of the programming models used within and across nodes. This too may require restructuring both applications as well as implementations. For example, an application user can compare the programming models on the same platform. Even if the user has written very good applications, it may be that the software implementation of the message passing systems layer is not well implemented on the machine (we have encountered this). Only a person who understands both sides can look into this complex implementation too, determine in detail where the problem is, fix the message passing layer and try the experiment again. Or one might easily arrive at an incorrect conclusion about programming models. Graduate students will learn to develop real applications in the different models to understand the tradeoffs.

Programming Environments and Performance Diagnosis Tools. Making parallel programming easier is perhaps the greatest challenge, even for SAS systems. Many programming languages have been developed, but there is still a great need for (i) languages or extensions that do a good job of managing control over data locality while also making the desired parallelism easy to express, especially in emerging, increasingly complex applications, (ii) support for rapid prototyping of new parallel algorithms, vital to areas like protein structure and computational neurobiology, and (ii) programming environments that integrate support for performance feedback, using both the runtime system of the language and any hardware mechanisms provided by the system for performance monitoring. Performance feedback is particularly important for coherent shared address space systems, since the same implicit nature of communication and replication that make programming easier also generate more possibilities for artifactual communication and contention. These are particularly difficult to reason about without good tools, but building tools for them is challenging and not addressed well today. The increasing complexity of systems (memory and communication hierarchies) also makes tools all the more critical. And since programming environments and tools are targeted directly at real users, it is clearly an area where close interaction of designers with users and understanding the issues on both sides is critical, though it is usually missing from actual design. Our approach to environments is not to

10

develop entirely new languages, but rather to extend existing languages and systems in an application driven way, including both rudimentary parallel programming systems in use today and a concurrent object oriented language (COOL) that already provides some support for locality management. On the tools side, we will integrate performance information from all levels as most appropriatehardware for information about low-level dynamic events, the runtime system (in a language like COOL) for high-level dynamic events, and the compiler to reduce dynamic costs and map back to program informationand determine how to provide the integrated information to the user in the most meaningful, hierarchical way. We will also examine how to design additional hardware support that can get at the most challenging contention problem in a meaningful way without perturbation that can change its characteristics. Finally, language standards for scientific computing such as High Performance Fortran work very well for regular array-based programs, but break down when confronted with irregular, dynamically changing applications that are increasingly popular even in scientific computing. We have ideas for extending the HPF model significantly to incorporate support for such applications, thus making it a much more widely usable standard.

Performance Evaluation. With parallel architecture being so performance-driven, good quantitative evaluation of design ideas and tradeoffs is extremely critical. The large parameter space in both applications and systems implies that it is also complex. It requires a good understanding of the key architecture-workload interactions and how they scale with workload or architectural parameters, which in turn means understanding both sides well. Qualitative factors like how well a workload is optimized for a system can also greatly affect conclusions. Evaluation is currently perhaps the weakest area of parallel systems research, and must be improved. We have performed many studies and developed evaluation methodologies [Woo95], which required using our knowledge of both applications and systems. We will continue to do this with the new workloads as systems and applications evolve.

Performance Prediction. Simulation has been a powerful method for evaluating tradeoffs at moderate scale. However, the increasing complexity and scale of applications and systems makes simulation very difficult. It is now critical, for both algorithm and system designers, to develop tools for predicting performance as application and system parameters change. Many performance models exist for the cost parameters of a generic communication architecture (e.g. BSP [Val90], LogP [Cull93] etc.). However, the much more difficult challenge is modeling the critical properties of the application and its interactions with the granularity or size parameters of the system, and then putting the two together to predict performance. Pure analytical modeling is too difficult and unreliable for complex applications and systems. We would like to develop a combination of analytical modeling and restricted simulation that enables us to extrapolate from performance data on a real platform to the impact of changing application or machine parameters. This too requires intimate understanding of applications and their critical system interactions at different levels of abstraction. It may also require support in the compiler/runtime systems, or feed useful information to them for their optimizations.

3D.2 Room-Wide, Building-Wide and Campus-wide Clusters (a) Background and Goals. Low cost and commoditization have caused clusters to emerge as major platforms for

high-performance computing. Unlike the systems of the previous section, clusters (a) use commodity interconnects, and (b) don’t provide hardware support for programming models (e.g. a coherent SAS). New, high-performance commodity interconnects called system area networks have breathed new life into clusters (Memory Channel [Gill96], Myrinet [Bode95], Tandem ServerNet, etc.). While they have much better communication performance than traditional local area networks (such as Ethernet), they are still quite far from custom multiprocessor interconnects, especially in the overhead to initiate data transfers. Relatively small clusters (with tens of CPUs) may be built with networks of PCs, and relatively large systems (with hundreds of CPUs) will be built with networks of cache-coherent symmetric multiprocessor (SMP) servers. As processor technology continues to outstrip bus technology, SMPs may be replaced by cache-coherent distributed shared memory (DSM) machines even at small scale. Already, the largest high-performance systems are constructed by taking the largest DSM systems that vendors will build (which is constrained by market forces) and putting them together as clusters of DSMs; e.g., the machines at Los Alamos and NCSA.

We have been doing systems research in supporting SAS and message passing programming models on clusters of uniprocessor PCs [Blum94]. Our research has recently been moving into clusters of SMPs (which are much more likely vehicles as SMPs are already commoditized) and into taking on a much more application-driven nature. In the future, we will move to examining clusters of DSMs, much larger scale, greatly expanding the scope and realism of the applications we use, and clusters not just within a room but across wider areas such as a building or a campus. Many of the participating departments have acquired and are beginning to use clusters of exactly the same type, through shared equipment grants from Intel and Microsoft Corporations and separate NSF equipment grants, which will facilitate our cross-training research and training. Co-PIs Li and Singh have worked and advised students together in cluster systems, and Bunge and Ostriker are already moving toward using their systems.

11

Clusters raise the programming model question all over again. Message passing is implemented largely in software on tightly-coupled platforms as well as clusters, with the same interface. The amount and structure of communication is controlled explicitly by the program, so differences in performance and programming guidelines depend only on the performance characteristics of the communication architecture (overhead, latency, bandwidth). An SAS, however, is implemented in software on clusters (on top of low level messaging), but in hardware on tightly-coupled systems. Thus, not only the basic performance characteristics are different, but so are the granularities of communication and coherence and hence the amount and type of communication generated. Programs that run well on hardware-coherent systems may not on clusters. The success of clusters, especially for the SAS model, is not well understood.

Clusters require both systems and applications research, which even more so must go hand-in-hand to reach the right designs and conclusions. This is because the performance penalties for mismatches are much greater than in hardware-SAS systems, application/algorithm design for clusters is not well understood, andespecially for the unproven SAS modelit will take success by cross-disciplinary research to spark adoption by users. In fact, current evidence is that application restructuring and understanding will be at least as important to their success as systems research [Jian97].

Systems research for SAS on clusters has been ongoing for a while, but we believe this is a very fruitful time for the kind of integrated research that we are proposing. The programming model is established for high-end systems, as is the prevalence of clusters; protocols and communication layers are reasonably mature; however, the performance potential and scalability are yet to be well understood, which may drive many novel system enhancements and performance-portable application structuring techniques. Fortunately, local real users want to use the cluster systems, and several of them are currently migrating from message-passing to shared memory machines at the tightly-coupled end, so it is a good time for computer scientists to work with them and understand the tradeoffs among programming models for both types of platforms. We will use existing applications from hardware-coherent shared memory to drive this research, as well as the science applications in this proposal. We will focus not only on the SAS model, but implement message passing applications and systems on the same substrate for comparison at every step of the way.

Our interest is first in clusters designed to operate as multiprocessors, i.e. clusters that are a rack of systems in a room, interconnected by a system area network. However, for many less communication-intensive applications, we are also interested in using systems that are physically distributed, e.g. across a building or even a campus. These come in two varieties. First, systems that were not intended to be used as a single parallel system (e.g. systems on people’s desktops or in different offices or laboratories), which raises its own set of new issues. Second, it is attractive to construct “systems of systems” by connecting together individual clusters (each on its own system area network in a room) using wider-area, high-bandwidth but high-latency networks (e.g. Gigabit Ethernet or fiber). For each major research area we discuss for clusters, we discuss “room-wide” clusters first, followed by the more distributed cases.

(b) Specific Research Problems. Unlike for tightly coupled systems, where the programming model is now supported directly by systems built in industry which application scientists will use, in the case of clusters we need to build and improve the systems as well as do research in parallel software. Supporting programming models gives rise to the layered architecture shown in Figure 3(a) (two pages later). These are the layers that affect the performance of applications, and hence at which improvements can be made. The lowest is the communication layer. It consists of the communication hardware (controller, network interface or NI, and network) and the low-level software that provides basic messaging facilities. Next is the protocol that provides the programming model to the application programmer. Finally, above the protocol runs the application itself. Let us briefly discuss our integrated research in the three layers.

Communication Layer. The largest overhead in cluster communication has not been the raw latency or bandwidth of the interconnect, but the software overheads incurred at the end points. Much research has been done in recent years to streamline common-case communication by getting the operating system (OS) out of the way [Eick92,Paki95]. Our communication model is called Virtual Memory Mapped Communication (VMMC) [Blum94, Dubn97]. Unlike traditional message passing, once buffers are set up by the OS it allows a process to send data to a remote memory without interrupting the remote process or requiring it to explicitly receive the data. We call this feature remote deposit.

Remote deposit is useful for both programming models. To support SAS more effectively, we are extending VMMC with remote fetch and synchronization support that also do not interrupt the processor, using commodity NI support. NIs respond to data requests, and keep track of nodes waiting at a lock, providing mutual exclusion.. Extending VMMC to multiprocessor nodes (SMP and DSM) raises challenges in virtualizing the NI and using bandwidth effectively.

When moving to building-wide and especially campus-wide clusters, three new issues become major concerns: scale, fault tolerance and reconfigurability. Large scale implies that all costs become more important. Since the networks connecting clusters have worse latency than bandwidth properties, the communication layer should support latency tolerance (e.g. multithreading and prefetching data). New techniques must be developed to cross

12

administrative domains with low cost. Dealing with faults is very important because while it is rare to pull a power cord out of the wall in the computer room, this may well happen in people’s offices. Also, the more links and the more nodes under different administrations, the more the link and especially node failures. We have developed basic fault tolerance support in VMMC for programmable NIs, but a lot more research is needed to provide true end to end fault management and graceful degradation of performance. Complete fault tolerance is difficult, and the kinds of solutions that are needed (fault containment, check-pointing, fail-over, etc.) must be driven by application needs. Network reconfiguration support in the high-performance communication layer is important because the topology of the network may change in its various parts, and we do not want to shut down the entire distributed system for this. Finally, since widely distributed systems are not likely to be dedicated to a particular application, integrating the communication layer with scheduling becomes important; e.g. scheduling a process when communication for it arrives.

Protocol Layer. For message passing, we will port the Message Passing Interface (MPI) standard to run on VMMC. At least initially, this will be the programming model of choice on clusters for the application users. For SAS, software methods provide replication and coherence in main memory rather than only in hardware caches. Access control and coherence can be supported at various granularities via different means. Our primary area of research is in so-called shared virtual memory (SVM), which leverages existing operating system support and memory management hardware to provide access control and coherence transparently at the granularity of pages [Li89], embedding the protocol in page fault handlers. Page granularity amortizes communication costs over kilobytes of data (particularly important on clusters given the high overhead of communication) and does not require the additional overhead (such as code instrumentation or hardware support) needed for fine-grained access control.

The disadvantage of SVM is that the large, page granularity of coherence and communication causes the expensive communication to be more frequent (due to “false sharing” of pages by processors) and voluminous (due to “fragmentation”, i.e. moving an entire page when only a piece of it is required). A solution is to use relaxed consistency models to buffer coherence actions and postpone them until synchronization points [Kele92]. These protocols differ in the eagerness or laziness with which they propagate and apply coherence information to pages, leading to tradeoffs that can only be resolved in the context of real applications. Based on an analysis of application properties, we have developed a family of home-based protocols that tend to perform better than the best earlier protocols [Ifto98], and to scale much better in both performance and in the memory overhead that is a critical limitation in SVM systems.

Many protocol approaches exist [Kele92,Ifto98,Kont96] and many issues are unresolved [Ifto99]. For example, lazy protocols reduce communication and false sharing, but make synchronization operations more expensive. This dilates critical sections, potentially leading to much serialization. How much laziness is good depends on both NI functionality and network characteristics, as well as on the nature of applications. Real, complex applications are now needed to obtain good general answers to these tradeoffs, guide better system design, and evaluate the potential of SVM clusters.

Protocols for multiprocessor nodes. Multiprocessor nodes expose a two-level communication hierarchy, with hardware coherence within nodes and software communication across nodes. We (and others) have extended our lazy, home-based protocols to provide a uniform SAS model both within and across nodes in a bus-based SMP, taking advantage of the hardware coherence and synchronization within a node [Sama97,Dwar99]. However, this raises many new protocol tradeoffs that are not yet resolved. We also plan to extend the protocols further to use DSM rather than SMP nodes. The presence of fewer, larger nodes raises new challenges for protocol data structures and laziness (within or across nodes). Having multiple NIs per node also raises new issues: opportunities for greater fault tolerance, challenges in virtualizing and managing the Nis at protocol level, and tradeoffs in where within a node protocol threads should run. However, this is probably the most realistic way to construct very large systems for the forseeable future, and hence to achieve a uniform SAS. We will develop our DSM cluster protocols on our local SGI Origin2000 infrastructure, and experiment with them on larger systems at SGI and Los Alamos, where graduate students may obtain internships (see letters).

Communication Support and Protocol Enhancements. We are also studying protocol enhancements that are driven from below by opportunities in modern communication layers, and from above by bottlenecks seen in applications. For example, using the commodity NI support for basic operations described earlier, we have developed new protocols that don’t interrupt the main processor. They finally achieve performance close to hardware-SAS at small scale for many (often restructured) applications [Bila98]. However, protocol management is now separated from synchronization, so a host of protocol tradeoffs and design issues are opened up again. With DSM nodes, multiple NIs per node will raise new challenges here. Runtime support can enable important optimizations like latency tolerance, integration of thread scheduling with protocols, scheduling for two-level hierarchies, etc. Finally, studying bottlenecks in more complex applications than before has already led us to many new ideas; e.g. even lazier models like Scope Consistency, protocol and computation movement to reduce synchronization overhead, and techniques to

13

alleviate contention and address imbalances in communication cost it causes. New, far more complex applications will undoubtedly lead to others.

Scalability. Scalable performance for SVM clusters has not been demonstrated or understood. It also requires running large problems, which is limited by large memory overheads in SVM protocols. Scalability is not only very important for computational science, but clearly requires applications research as well. It will be a major focus of our research.

Alternative Approaches. An important alternative to page-based SVM is to support the SAS model at fine or variable granularity even on clusters. Access control can be done either by instrumenting memory accesses in software [Scho94], or via additional hardware in the node [Rein94]. Fine coherence granularity allows the use of simpler consistency models. Each approach has its advantages: SVM amortizes communication costs when spatial locality is good and has no extra access control costs; the fine-grained approach reduces false sharing and synchronization cost (since protocol activity and communication occur at memory accesses themselves). Which is more promising in different node configurations is an important question for clusters, as is the value of different kinds of hardware support. The only way to resolve these is by understanding applications and via application-driven performance evaluation.

Building- and Campus-Wide Clusters. Fault tolerance/containment must be provided in the protocol too. Home-based approaches have advantages in this regard, but protocols must tolerate faults and to allow nodes to leave and perhaps later join an ongoing computation. Imbalances in communication costs and contention are very common on clusters; the protocol layer can track the cost of the operations incurred, and the application can obtain these and adjust itself dynamically. Finally, the system-application complex must deal with the heterogeneity of campus-wide clusters.

Overall, SAS approaches for room-wide clusters have made a lot of progress, but are now relatively mature and incremental. Only close collaboration with and integrated research in real applications is likely to lead to the next level of design improvements or determine if and how SAS on clusters can succeed for scientific applications at large scale.

Application Layer. We must take the next major step in the applications used to drive SAS systems, moving forward form a mix of kernels and some real applications to the full-fledged applications used by real users. We must also perform integrated research in this and the system layers. We will consider the following challenges in this layer.

Performance Portability and Scalability . As clusters become increasingly important and available to scientists, programming models that do not support applications well on them may not survive for scientific applications even on tightly coupled systems that provide successful hardware support for them, since users want to write a program once and run it on all available systems. Thus, performance portability is another very important reason to study SAS on clusters. Understanding it is inherently interdisciplinary research, since it involves both improvements to systems as well as algorithm and application restructuring and guidelines. And performance portable applications themselves often require understanding the problem domain as well as the system (its granularities, and what operations are expensive on it). So far, the work in this area has been done for some applications by people who understand both sides for them [Jian97]. Programming guidelines for performance portability in the more explicitly controlled

message passing model are quite well understood (“messages are bad, send them infrequently”), but not for SAS.Performance portability to clusters is, of course, even more challenging than across tightly-coupled systems. Our

goal will be to develop not only such applications/algorithms but also guidelines for performance-portable programming across the continuum of systems from hardware-coherent all the way to clusters. This will include

14

Application Layer

Communication Layer

Protocol Layer

Figure 3 (a) Layers influencing performance

clusters of multiprocessor nodes, which expose strong two-level hierarchies that raise new challenges and techniques. While this work will proceed via case studies, we hope general guidelines will be developed that will find their way into programming environments as well. In fact, such guidelines may be at least as important as languages themselves for the holy grail of simplifying effective parallel programming. We have already made some progress in this area [Jian97]. Figure 3(b) shows parallel speedups for some applications on a 16-processor hardware-coherent SGI Origin2000 as well as on a home-based SVM cluster with 16 uniprocessor nodes. For the cluster, speedups are shown before and after application restructuring, starting from applications written to run well on the Origin. The gap on the cluster before and after restructuring is dramatic. Simple optimizations like aligning/padding data structures do not help much; rather, substantial algorithm and data restructuring is required, though in well-motivated ways. While these restructurings do not have large impact when applied back to this hardware-coherent platform, recent work shows that they are quite critical for the same hardware-coherent machine at large scale. This commonality between performance portability and scalability is very encouraging for the development of well-structured programming guidelines. The scalability of SVM systems is not demonstrated or understood, even for restructured applications, and more complex and realistic applications have not been examined. Clearly, there is a long way to go and a lot to understand in this area.

Programming Models. Message passing systems are in wide use on clusters, though the performance tradeoffs with SAS are not clear here either. Just as we compare coherent SAS, non-coherent SAS and message passing on tightly coupled systems, we will develop these models on our clusters and study the tradeoffs here as well. We will include clusters of uniprocessors, SMPs and DSMs, examining combinations of programming models within and across nodes. Developing efficient message passing implementations will allow the science collaborators to use our clusters quickly.

Adaptability. The high overheads of communication make contention a huge bottleneck in clusters. This leads to large imbalances in communication costs among nodes, that are not seen in hardware-coherent systems. It may therefore be more important for either applications or systems to be adaptive and balance themselves at runtime, even if the computation can be balanced well otherwise. We have ideas in this area that we will pursue.

Applications for Distributed Clusters. Harnessing wider-area clusters will also need applications research as well. What kinds of applications are coarse-grained enough, how can existing applications be restructured to make them so, and what systems enhancements will enable a wider range of applications to run successfully? For example, looser coupling among components or partitionable sub-domains of irregular applications may lead to somewhat inferior sequential algorithms but ones that can more efficiently take advantage of such a distributed infrastructure. Wide-area clusters are also likely to be heterogeneous and time-shared, so adaptable applications will be critical in this case. This is largely untrodden area for all but embarrassingly parallel applications, but it is very valuable to users.

3D.3 Visualization Research Thrusts (a) Background and Goals. Any project involving computational science must have a visualization component. At

the heart of scientific computing lies the requirement of being able to see and understand your results. As Hamming said, ``The purpose of computing is insight -- not numbers.'' The availability of high-speed graphics engines combined with supercomputing capabilities makes it possible for visualization to sit at the core of the theory-experimentation cycle. As the technology advances, visualization systems will aim higher. Just as we have progressed from static to dynamic imagery, so can we consider growing from small displays to larger, more immersive displays. Visual immersion and the ability to interactively explore the data may expose features and lead to insights that are difficult to reach from an external viewpoint. We are currently developing a very-large, high-resolution, high-performance display wall system. The scale of the display wall (8'x18') is driven by the need for immersive collaboration -- multiple people standing in front of a single, large display. The device has very high resolution (currently six million pixels), driven by the need for accurate rendering of very large data sets and complex structures. Finally, the display requires a high-performance rendering architecture in order to handle very large data sets at interactive rates. While we are conducting research in several aspects of visualization, we discuss them here in the context of the display wall.

While much of the technology described here is of general use, the project is driven by applications in scientific visualization. Our research in this area has already begun with collaborations between CS and astrophysics as described below. Collaborations between users and computer scientists are essential for many reasons. We are interested in not only constructing the systems but also using them for new kinds of visualization algorithms and methods to enable scientific insights. Only the scientific users know what kinds of insights and abstractions or views of data they might want; graphics or systems researchers can then determine how to get them fast. We expect a tight feedback loop in which novel visualization algorithms (both general and specific to a parallel display wall) are developed in response to user needs, and the new views/abstractions of data this enables lead users to think up still

15

further view abstractions. At the same time, graphics researchers are familiar with many representations of data that may jog the imagination of scientific users further. We hope the collaborations will also train people who understand both sides.

(b) Specific Research ProblemsDisplay Wall. The demands of scientific visualization applications usually exceed the capabilities of current

display systems. Even high-end graphics workstations (e.g., SGI's Onyx2 with InfiniteReality Graphics) are not adequate to satisfy the demanding resolution and polygon throughput requirements of large scientific data sets. Even the biggest monitors are not large enough to allow a scientist to be immersed in its image, and they do not allow many scientists to interact with a data set collaboratively. An ideal display system for scientific visualization displays an image measuring roughly 36,000 - 28,000 pixels [Shos92]. Current CRTs and various flat-panel display devices are about three orders of magnitude away from this. Although projection devices can display a large-scale image on a surface, they lack adequate resolution to enable visualization of complex data sets. Unfortunately, the resolution of individual display devices has been improving at the rate of only 5% per year during the last two decades.

We are investigating a display wall approach that uses multiple projection devices to form a wall-sized display system. It provides enough space in its room for users to interact with the wall directly and to communicate with each other effectively. The state-of-the-art approach for building a display wall is to use a high-end graphics machine with multiple graphics pipelines to drive multiple, manually aligned, expensive CRT projectors. The Power Wall at University of Minnesota and the Infinite Wall at University of Illinois at Chicago [Czer97] are examples that use up to four CRT projectors. This approach is very expensive and it does not scale. Furthermore, the current systems all use sequential APIs such as OpenGL. Graphics primitives need to be “fanned-in” through the API and then “fanned-out” to the parallel rendering hardware, an obvious bottleneck for large-scale visualization.

Our display wall is constructed from multiple commodity components (projectors, cameras, graphics accelerators, speakers) in PCs connected by a fast, system area network. This leads to high performance coupled with low cost, since it tracks technology well. A schematic of our current system is shown in Figure 4. We use eight graphics accelerator cards inside PCs attached to a Myrinet network to drive an array of eight 1024x768 LCD projectors. The images rendered by the eight servers are projected on a 18’ x 8’ rear-projection screen in a 4x2 grid to form a single image covering the entire screen. Meanwhile, several cameras located around the room track user positions and gestures, and multiple speakers deliver spatialized sound. The effective display resolution of this system is around 4000x1500 pixels, while the potential rendering performance is eight times that of a single PC. Soon, we expect to upgrade our system to include fifteen projectors arranged in a 5x3 grid, allowing displays with 20 million pixels per frame rendered at 50 million polygons per second with an aggregate textured fill rate of over 1 GPixels per second. We expect such a large-scale, high-resolution, and high-performance display system to enable new methodologies for scientific visualization, which may lead to new types of insights.

Figure 4: A scalable display wall system.

The physical size of the display wall is important because it allows a user to visualize and interact with rendered objects at a human-scale, which can be critical to perception and evaluation of complex objects (e.g., for visualization of complex molecules). The wall covers a large field of view of the user, providing the user with an immersive experience that is particularly compelling when combined with surrounding audio output. Also, such a large-scale display device can enhance collaborations between many people simultaneously viewing and discussing visual data.

16

The architecture employs a scalable cluster to drive the walla system area network connecting many PCs and graphics acceleratorsachieving scalable rendering performance and resolution. System area networks such as Myrinet [Bode95] provide adequate bandwidth today and scale with technology. Each PC drives only a single graphics accelerator, and a parallel API is used to alleviate the performance bottleneck in conventional approaches.

We must address several research issues, including automatic methods for projector calibration, efficient communication, multi-speaker sound systems, camera-based tracking algorithms, and parallel rendering algorithms. We focus here on the challenges in parallel rendering, particularly in the context of large scientific data sets.

Parallel Rendering. Our goal here is to develop fast 3D rendering algorithms that execute efficiently over a network of PCs, a subset of which project to portions of the display wall. Since the final image on the wall is composed of multiple sub-images corresponding to the regions updated by different projectors, there is a natural image-parallel decomposition of the rendering computation across the PCs attached to the projectors. We can sort 3D graphics primitives to be rendered based on their overlaps with the projection regions on the wall. Then, each PC must render only the subset of primitives overlapping its projector's region. However, if the graphics primitives are not uniformly distributed over the screen (e.g. the distribution of stars in a galaxy is highly nonuniform), or if we have available more graphics processors than there are projectors, then this simple, static partitioning approach does not achieve optimal performance. Thus, we need dynamic, scalable algorithms.

We are investigating a sort-first approach [Muel95] in which multiple client PCs distribute 3D graphics primitives to multiple server PCs that render images for separate regions (tiles) of the display. We use sort-first to leverage the tight coupling between geometry and rasterization processors inside typical PC graphics accelerator cards, and to avoid inter-process communication that would be required to composite rendered images in a sort-last approach [Moln94].

We partition the rendering computation using “virtual tiles,” non-overlapping regions of the screen not necessarily corresponding one-to-one with projection regions (physical tiles). To achieve load balance in complex domains, we allow virtual tiles to be any size or shape, as long as each pixel on the screen maps to exactly one virtual tile. Each virtual tile is assigned to a processor, and each processor renders an image containing all graphics primitives that at least partially overlap its virtual tile. Pixels of a virtual tile rendered on one computer, but projected to the wall by another via its frame buffer, are sent over the network as shown in Figure 5(a).

A

C

B

D

Sorted2D

Primitives

PixelPrimitives

RemoteSystem

Physical Tiles

A

C

B

D

1

3

2

4

Rasterization

RemoteSystem

Virtual Tiles

1 2

3 4

Frame Buffers (a) Compositing virtual tiles. (b) Scene adaptive virtual tiles.

Figure 5: Virtual Tiling.

The two key challenges are to develop algorithms that: 1) compute virtual tiles dynamically for load balance, and 2) sort graphics primitives among virtual tiles in real-time. Both will benefit greatly from collaboration. One goal of effective virtual tiling is that the tiles should match scene regions that require different view parameters, refresh rates, and resolutions. These are determined by application needs. A second goal is load balancing. Here too, higher level insight into the structure of the domain data may help tremendously, especially given the high cost of mismatches. As a very simple example, a graphics researcher would view galactic data as a set of stars or entities to partition. But knowledge that the stars tend to be clumped in well-defined galaxies or clusters, and that density profile data can be used to identify these, may help construct partitions that try to assign whole galaxies to different processors. Figure 5(b) shows another simple example, where one bookcase is placed in each virtual tile. There are many other constraints on partitioning, such as minimizing overlap of primitives with virtual tiles, enabling fast sorting, and matching virtual tile assignment to frame buffer or projector assignment. We will develop algorithms that balance these many factors to adjust virtual tiles dynamically based on input regarding: 1) predicted scene properties, 2) tracked viewing information, 3) measured rendering rates, and 4) higher-level structural input provided by a user.

Having partitioned the scene, the next challenge is to sort primitives among virtual tiles. The data sets, obtained from simulation or observation, are extremely large and are distributed over many PCs and disks. Typically, before

17

sorting we reduce the data by extracting abstract representations that are meaningful to the user. This requires visualization and/or data mining methods. These methods, and hence those for sorting, are largely governed by application needs and data set properties. Let us look at a concrete example: fast visualization of isosurfaces in astrophysical data. Figure 6 in Section 3D.4 shows three isosurfaces in galaxy formation data, with a range of density thresholds. These surfaces were extracted off-line, taking minutes per image. We are developing algorithms to interactively change the threshold for the isosurface, essentially animating the model between the three images shown as the user turns a knob. The scientist can thus track the growth of structures as they progress through different densities. The data are organized in a large grid of voxels, of which only those that are near isosurfaces are interesting. We need to find these quickly, and then partition them across processors for polygon extraction and finally sorting to the virtual tiles. To find interesting voxels, we currently use a threshold-indexing scheme [Cign97], that knows nothing about the structure of the data, and a simple partitioning scheme. Insights into the data will make both aspects much more efficient. For example, a domain user may know that interesting voxels are clustered in a certain way, or that if interesting voxels are found in one area then they will not be found in another. Additionally, our science collaborators viewing this data will doubtless think of new visualization modalities. We will use the structural insights they possess to develop more efficient algorithms for these, which we hope we can generalize to other domains. Other research problems we are pursuing for visualization (relevant to all science areas but not discussed here) include parallel visualization in general and parallel I/O for large data sets.

18

The next three sections deal with the natural science (“application” or vertical) thrust areas. Each section is divided into a set of common issues or subsections. We describe the common section structure here, briefly identifying some common properties across application areas for each issue. Individual application area sections will then either specialize the key issues to their areas or discuss other issues that arise.

(a) Overview and Goals, including the need for integrative research and training. (b) Concrete Problems that faculty are currently working or collaborating on. A focus problem or two is chosen for

this part, and other problems being pursued are discussed at the end of the section.(c) Methods and Algorithmic Challenges. Methods used, and why new ones are needed. In many areas, the focus

has been on modeling fairly regular domains and scenarios. As our understanding increases, we simulate more realistic domains that are irregular and require space- and time-adaptive methods as well as application insights to be efficient.

(d) Parallel Computing Challenges: The increasing complexity of the models result in algorithms and data access patterns that are difficult to parallelize effectively. The need for performance portability (including to clusters) and scalability were discussed earlier. The greater complexity of models as well as machines (e.g. deeper and more distributed memory hierarchies) also challenge parallel programming environments, including domain specific languages/environments/libraries or performance diagnosis tools. These challenges hold for all application areas.

(e) Visualization Challenges: Complex, large data sets raise new challenges for visualization methods. Enhanced visualization systems raise new opportunities and challenges, including new modalities and functionality.

(f) Other Problems. Some other areas being researched by faculty, in which PICCS students may immediately work.

3D.4 Astrophysics Research Thrusts (a) Background and goals Cosmology stands at a critical juncture. Over five years, microwave background

experiments will map in subtle detail the properties of density fluctuations three hundred thousand years after the big bang. Large-scale structure surveys will characterize both the properties of galaxies and their large scale distribution. In many ways, the outstanding problem in cosmology is understanding how the tiny variations in the early universe grew to form galaxies, clusters of galaxies and the large-scale patterns seen in the galaxy distribution. Because light travels at a finite speed, powerful telescopes can observe the distant universe as it was in its earliest stages. The goal of our cosmology program is to be able to start from measured initial conditions and use known, but rich, physics to reproduce and understand the properties of galaxies, the intergalactic medium and the dark matter as a function of time, helping us choose from the current set of proposed models the small subset which may pass all tests in matching with observations.

Traditionally, there are three types of astrophysicists: observers, experimentalists and theorists. The complex problems encountered in cosmology require massive computing. Thus, a new breed of theoristscomputational astrophysicistshas been born. Princeton has a very strong tradition of using cutting-edge computers to make seminal contributions, dating back to Von Neumann and Peebles. With the complexity of high-performance computers advancing so rapidly, and the problems being very challenging for parallel computing, success in computational astrophysics increasingly necessitates detailed knowledge of both sides. Additionally, results obtained from simulations and especially from ongoing observational programs will produce massive data sets with complex structure. Visualization and data mining techniques will be needed to interpret and analyze the data. The required knowledge for computational astrophysicists is so demanding that a formal program such as PICCS to train the next generation of students is necessary.

(b) Specific Ongoing Research Princeton has for some time been a center for numerical cosmology. Renyue Cen and Jeremiah Ostriker have developed state-of-the-art cosmological hydrodynamic codes. These simulations have played an important role in modeling hot gas in clusters, gravitational lensing, and the relationship between galaxy environment and galaxy formation. They have revolutionized our understanding of the Lyman alpha forest, the gas clouds that are the building blocks of galaxy formation. Princeton is also a center for observational cosmology. The Sloan Digital Sky Survey, a collaboration of institutions including Princeton, is in the process of mapping the “nearby universe”. Its terabyte data will contain positions of 100 million galaxies and 1 million quasars, and will vastly improve our knowledge of the current universe. Princeton scientists, including Spergel, are part of the NASA Microwave Anisotropy Probe, which will obtain a map of the microwave sky and yield initial conditions for numerical simulations.

Finally, research collaborations are under way between CS and astrophysics in parallel computing and visualization for challenging problems in cosmology, as described earlier, fueled by equipment grants from NSF.

19

(c) Methods and Algorithmic Challenges. Cosmological simulation is extremely challenging, because it is an inherently non-linear process, involving the rich coupling between non-linear gravitational dynamics, hydrodynamics, star formation feedback and radiative transfer. There is a very wide range of length and time-scales, so the simulations require large grids (with the state of the art requiring specification of more than 100 variables per grid) and spatially adaptive time-stepping techniques that must also lend themselves well to parallelism. A typical simulation is performed over many time-steps. In each step, the basic methods can be divided into those used to model gas and those for dark matter. Dark matter interactions are mostly gravitational, and many methods can be used depending on the resolution needed. These include particle-mesh (PM), particle-particle/particle-mesh (P 3M) and tree methods, as well as a combination of tree and particle methods called TPM. The latter two are discussed later. For gas, Eulerian and Langrangian methods are used, including moving-mesh methods, and combined with gravitation.

We have developed a range of algorithms for simulating the inherently multi-scale physics of structure formation. Even our mesh-based codes are already moving toward increasing spatial adaptivity, to resolve a larger range of scales. These approaches need to be developed further and; to take proper advantage of adaptation, we must develop methods for adaptive time-stepping as well. We must also model radiative transfer better, which is in its infancy in cosmology.

(d) Parallel Computing Challenges. Our cosmological codes run either on vector machines or using explicit message passing. Galaxies are very highly clustered, with large density contrasts, which makes load balancing challenging. The codes also have irregular, time-varying access patterns, and do not scale very well. While a coherent SAS model seems very well suited to such codes, we do not have any implementations for it. Thus, there are three important challenges: partitioning the applications for load balance and locality in general, understanding how to do this on DSM machines for scalable performance, and understanding how to structure them for clusters. The various methods also provide a rich set of very challenging applications codes to push the system software research in the CS department. They have irregular and dynamically changing properties, complex dependences and synchronization needs, and fine-grained and long range data access and communication patterns. In addition, galaxies that were initially in one of the part of the space tend to move in a highly non-linear fashion across space, requiring repartitioning, and the data sets needed are extremely large. Many sophisticated algorithms can be used, and their tradeoffs in a parallel environment are not understood. Most codes were initially developed for vector machines, and must now be redesigned for efficient parallelism. Parallel computing research could hardly hope for a richer set of challenges, together with an eager and sophisticated user community. The parallel methods developed will be widely applicable to other areas as well.

As a concrete example, let us focus on the TPM method. One of its components, a hierarchical N-body (tree) method used to calculate gravitational interactions at a range of length scales, provides a good example of interdisciplinary research being important and making contributions to both sides (as promised in the introduction). Hierarchical N-body methods take advantage of the insight that the strength of gravitational interaction falls off rapidly with distance, so interactions with particles that are far away can be computed less accurately than with particles that are close by; this can be extended hierarchically. To facilitate a hierarchical approach, the three-dimensional space holding the galaxies is represented as a tree. Internal nodes of the tree are recursively subdivided space cells, and the leaves are particles. The tree is highly imbalanced (nonuniform), being deeper in denser regions. It is rebuilt every time-step since the positions of the particles change. In the Barnes-Hut algorithm [BH], the tree is traversed (partially) once per body to compute the net force acting on that body, and the traversal descends the tree more in regions that are closer to that body in space.

Parallelizing tree codes is itself challenging, since they are dynamic and irregular. Parallel partitioning algorithms had been developed for message passing. A problem in message passing was that without a shared address space, it was difficult for a particle to know which node owned the particles it needed during its tree traversal. The solution was to use a partitioning method that produced rectangular but load balanced partitions. The load balancing itself took advantage of deep insight into the problem domain, and was developed by an interdisciplinary astrophysics/CS researcher [Salm91]. Then, a separate phase of parallel computation was introduced, in which each processor communicated to every other the particles and cells from its portion of the tree that they might need. The additional computational (not communication) cost of this phase turned out to be the largest bottleneck by far. By exploiting insights into both the nature of the problem and aspects of SAS architectures, Singh realized three things: (i) the major problem that necessitated the second phase was naming data, which the SAS took away, (ii) the nature of the problem was such that there was a lot of temporal locality, so that hardware caching would work well without explicit management of main memory, and (iii) the tree is an encoding of space, and partitioning the tree is a lot cheaper than partitioning space. He developed a new partitioning approach that did not produce rectangular partitions (which are no longer needed) and that was much more successful [Sing95a]. Interestingly, even message passing implementations now use a similar partitioning approach, and emulate an application-specific SAS in software

20

[Warr94]. The approach is used in other application domains too, and the work led to many architectural insights: the advantages of hardware-coherent SAS for many irregular applications, and that caches can indeed work very well for many scalable scientific applications. These have prompted Princeton astrophysicists to want to migrate to SAS as their applications become more complex. It is this type of progress that we expect cross-disciplinarily trained PICCS students to make.

With parallelizing a tree-code well understood, the next step is the TPM method as a whole. Since cosmological systems consist of many galaxies separated by vast regions of very sparse population, it is not very efficient to use a single tree to represent the entire space in these regions. Rather, the TPM code uses a coarse-resolution particle-mesh method to compute long-range forces across distant galaxies, and a separate tree code to compute forces in each dense (clustered) region. With many trees (dense regions), should we parallelize only across trees, only within trees, or a hybrid? Also, the load balancing needed for the PM method and that needed for the tree methods will likely ask for different assignments of particles to processors, thus causing further communication and data locality problems across computational phases that are not seen by examining only individual algorithms. There are many challenges and tradeoffs, current codes in message passing do not scale well, coherent SAS affords new possibilities, and clusters complicate matters further. Overall, this is a very rich space for collaboration. In fact, our ongoing collaboration has already improved the parallel performance of a mainstay cosmology code by a factor of 2.

Finally, parallelizing the adaptive grid methods that will be increasingly used is less well understood than parallelizing tree methods. These too are irregular and offer challenges load balancing, communication and partitioning cost. Adaptive time-stepping, which is increasingly viewed as critical to use with spatially adaptive methods, increases the challenges for both sequential algorithms and parallelism.

(e) Visualization Challenges. Visualization is indispensable for understanding the complex temporal and spatial structures that form during cosmic structure formation [Cen93a,b, Cen94, Cen97]. Many statistical techniques are used to compare model simulations with the observed universe [Cen93b, Gott94]. With fine-tuning of models to meet tests, the differences among contending models become very subtle. Thus, better statistics are required. This is difficult, in large part because we must rely on what is known and seen to propose sensible statistics. With high-resolution and immersive 3-d visualization, “what is seen” will have an entirely different meaning. As a simple example, when we saw the spectacular 3-d images of the Lyman alpha clouds at different densities being formed (see isosurfaces in Figure 6 [Cen97]), it was immediately apparent that a shape measure would be most relevant for the structures in question.

Figure 6. 3-d images of the Lyman alpha clouds at isodensities of 3,10,30, respectively; Cen97.

Data sets are very large. A single output at a fixed time has size 10-20 gigabytes for routine simulations, and many of these need to be visualized together. Additionally, each simulation has different types of particles and gases. Fast visualization requires memory to be tens of GB to hold the data, or much smarter algorithms. Extracting abstracted data and then visualizing it raises many algorithmic challenges that need close understanding of astrophysics and visualization, as discussed earlier in Section 3D.3. A simpler example of visualization needs is in trying to understand the relationship between the temperature of a region and the density of galaxies there at a fixed epoch. This needs the two variables to be superposed in the complex data set in an innovative, easy-to-see fashion. An especially useful feature is the ability to zoom into specific regions in order to see in any detail the actual structures being formed, and to “walk through” the data set to improve this ability. We can imagine many visualization scenarios to deal with these rich and complex data, especially in rich environments like the display wall; implementing them will be challenging.

Cosmological visualization is also valuable for education and outreach (see training section). We will also collaborate with the Hayden Planetarium (Neil Tyson and Frank Summers) in the Digital Galaxy, a NASA project to construct a 3-d model of our Galaxy for use by both scientists and the general public. This extremely demanding

21

visualization program will require both innovative ideas from computer science and an understanding of the underlying astrophysics.

(e) Other Ongoing Research. We have focused mainly on simulation above. Interpreting and analyzing massive observational data sets are also active research areas at Princeton. Determining the statistics of microwave background data requires inverting a million-by-million matrix using standard techniques. Spergel and his collaborators have developed fast algorithms for this analysis [Oh98]. However, even these faster algorithms will require scalable parallel machines to interpret the MAP data in a timely way when it arrives in early 2001. Interpreting, cataloging and analyzing the data from the Sloan Sky Survey over the next few years will require the use of new data-mining techniques, massive data base techniques and large parallel machines to calculate the statistical properties of galaxies. These resemble commercial data mining problems that are important drivers for multiprocessors and visualization. Computational research in plasma physics is ongoing at PPPL, in turbulence and complex multi-scale dynamics. The simulations need the highest performance, more adaptive methods, and multiple physics. PPPL scientists are already looking to CS for expertise in parallel computing and visualization, and will benefit from shared problems with other departments as well.

3D.5 Computational Biology Research Thrusts: Our major existing driving problems come primarily from two areas of ongoing collaboration between CS and

Biology scientists, treated separately here: protein structure determination, and the nascent area of simulating immune response. We also discuss other related areas in each case where research is ongoing in Biology and we anticipate CS collaboration, including a very rich area in a Center for Genomic Sciences about to be initiated at Princeton.

A. Protein Structure Determination(a) Background and Goals. The determination of molecular structure is a key element of molecular biology.

Proteins and nucleic acids are complex three-dimensional structures with thousands of atoms. An understanding of their structure can facilitate the study of how these molecules function and aid in the design of drugs which augment, interfere with, or otherwise affect their functions. The problem is usually approached experimentally, using X-ray crystallography, nuclear magnetic resonance (NMR), or other methods that yield distances, angles, and other structural information. Theoretical or computational methods include molecular dynamics [Levi88], distance geometry [Crip88], energy minimization [Nilg88] and threading sequences onto known structures. The computational problem is extremely challenging because of the complexity of the energy landscape, including many local minima. However, it is extremely important for a wide variety of purposes, and is the “holy grail” of computational biology. In practice, computation is also used to refine predicted or experimentally determined structures, or to determine structure starting from a decent guess. The need for innovative algorithms, large-scale parallelism, clusters, and sophisticated visualization are well recognized, and the area is far less mature algorithmically, or even in successful approaches, than cosmology.

(b) Specific Ongoing Research. We are conducting research is on both the computational and experimental sides. We discuss our main ongoing computational project, and then mention others and some experimental projects that require computational approaches as well. The problem we choose is structure determination from noisy data, a collaboration between Singh in Computer Science and Russ Altman at the Stanford Medical School. We used it as a concrete example illustrating the value of cross-disciplinary research in the introduction to this section (i.e. Section 3C).

In addition to experimental data, there is a large body of knowledge from general chemistrybond lengths, various angles, and volume and surface constraintsand other sources that can be brought to bear on determining a structure. Most earlier approaches predict a single structure. However, the data from these sources is uncertain (noisy), and quality and abundance of data from different sources can vary significantly. Thus, given a protein whose sequence we know, our goal is to cast the available data as a set of (noisy) constraints, and satisfy the constraints to determine not only a 3-d structure but also a measure of the variability in the estimated structure. This resulting structure can then serve as the starting point for refinement techniques, or to guide the search for appropriate drug binding sites, or to represent families of protein structures for homology-based approaches (core areas of a family will be low-variance).

(c) Methods and Algorithmic Challenges. We explicitly represent and manipulate the uncertainty in our structure-determination algorithm, representing both the mean positions of atoms as well as the variances and covariances. If every constraint (e.g. the distance between two atoms) can be represented as a Gaussian distribution with mean and variance, then we can use the constraints to update the current estimate of the structure by using an iterated (since constraints are nonlinear) least-squares algorithm [Altm95]. Constraints are introduced to the updating algorithm in chunks of say m each. Once all constraints are introduced in the current iteration, the covariance matrix used by the algorithm is ``reheated'' or zeroed, and the constraints introduced again in a different order in the next

22

iteration. This process repeats till the structure converges. The filter consists of a set of dense and sparse matrix and matrix-vector computations. Unfortunately, many constraints are not adequately viewed as simple Gaussians. We model them as multi-component or mixture Gaussians, obtained using expectation-maximization methods [Chen94]. We extend the algorithm to handle mixture rather than single Gaussians, using heuristics to avoid the combinatorial explosion resulting from the “tree” of multi-component constraints. Better algorithmic solutions are needed.

Finally, since the energy landscape of structures is so complex, we take advantage of knowledge about the structure of proteins to dramatically improve performance. In particular, proteins are hierarchically structured, so we can apply the approach hierarchically. Conceptually, we first apply it to determine the structure of individual amino acids, then use these and other known constraints to determine secondary structures, then recursively up to larger substructures. The use of hierarchy (simplified here) dramatically reduces computational complexity without sacrificing accuracy [Chen98]. In practice, either graph partitioning or manual adjustment are used to construct the hierarchies, which do not match biological substructures exactly. The approach is promising but in its early stages: substantial algorithmic challenges remain to handle real problems robustly, construct better hierarchies automatically, and deal with a variety of other kinds of constraints than distance, including angles, volume, surfaces etc., and test and improve.

(d) Parallel Computing Challenges. Despite all the optimizations, the algorithms are extremely expensive. There are few objective measures for how good an algorithm is in this area, so it must be tested on real problems. Solving even a useful test problem can take days or weeks on a fast workstation. Parallelism is available at many levels: across independent nodes in the tree that represents the hierarchy (e.g. independent supersecondary or secondary structures); across independent branches in the multi-component algorithm, and within the update for a set of constraints in a single basic least-squares algorithm. Interestingly, the biologists had independently tried to parallelize the basic single-Gaussian algorithm, but they parallelized across constraint set updates, which incurs high costs on real systems and did not yield significant parallel speedups. Our approach was to parallelize within a constraint set update, which implies well-understood parallel matrix computations [Chen94]. Exploiting multi-component parallelism reduces interprocessor communication due to its coarser granularity, but is much more difficult to balance the workload across diverse constraint sets. However, in the hierarchical approach, the tree nodes at the lowest level of the hierarchy are too small to afford good parallel performance, so we need to exploit parallelism across nodes in the hierarchy as well.

This presents interesting and novel challenges for load balancing and locality. Dividing tree nodes statically among processors, or even based on workload estimates, will not yield load balance. And the computation times for many constraint types are not predictable. The common approach augmenting a good static partition with task stealing for dynamic load balancing (each task being a sub-piece of a node computation) doesn’t work well: within each node, the tasks have to synchronize often, so tasks that are stolen must be merged back frequently, which is expensive. Driven by this application, we have recently developed a new, general dynamic load balancing approach called dynamic regrouping in which groups of processes that are done with their assigned work dynamically join other groups (permanently) instead of stealing work from them. This approach needs further development, but it is applicable to a wide variety of other problems as well [Chen99]. It is a very good example of in-depth work in an application area resulting in a generally applicable computer science contribution. The application also has irregular access patterns and high memory access and communication times, for which new approaches must be developed. Scalable parallel performance, especially on clusters, will be quite challenging. Finally, in addition to standard challenges for programming environments that apply to most of these application domains, algorithms in this area are not very mature, and they must be tested and rapidly altered/prototyped with realistic rather than scaled down problems, i.e. in parallel, raising interesting questions for programming languages and interfaces. Several results from this collaboration have already been published in prestigious systems, computational biology, and algorithms conferences.

(e) Visualization Challenges. The fact that structures are represented with explicit variances in atomic coordinates as well as covariances across them raises interesting challenges for three-dimensional visualization. It is often important to superimpose computed structures with known ones to determine where matches are or aren’t good. Overall measures won’t provide insights into the methods, specific areas will. The state of the art is for biologists to develop their own rudimentary visualization software on top of commercial packages, but clearly a lot will be gained from immersive visualization where one can walk through the superposed structures and understand the failures to match.

(f) Other Ongoing Research. Other ongoing computational research is in scalable parallel molecular dynamics and energy minimization techniques, where a part of the novelty is in using cache-coherent SAS programming models, and in the simulation of slow processes by sampling smaller systems for shorter periods of time. The latter has new challenges for both basic methods and for distributed parallel computing over campus-wide clusters of multiprocessors (each computing samples). If successful, it will dramatically increase the efficiency with which ab initio calculations can be performed. On the experimental side, Fred Hughson and Yigong Shi are determining high-

23

resolution structures of proteins and protein-DNA complexes, often comprising many thousands of atoms, using x-ray crystallography. Computational methods play a major role in several aspects of this work. Throughout the process of determining a structure, it is necessary to 'fit' polypeptide chains into electron density maps, a process that, because of the often-poor quality of the initial maps, has largely resisted automation. Thus, fitting is mostly manual, accomplished by means of relatively crude visualization software. Improved methods for readjusting the positions of parts of the polypeptide chain to fit the electron density without violating stereo-chemical constraints would be immediately useful for the field. Immersive visualization methods in which such constraints were encoded as physical resistance might revolutionize this aspect of structure determination. A second challenge is to further develop the algorithms used for improving structural models by refinement against the x-ray data. Refinement of thousands of parameters against 100,000 or more measurements is a large optimization problem, which currently utilizes simulated annealing at significant computational expense (some "experiments" require a week or more on an SGI R10000 workstation). Better basic algorithmic approaches, and especially parallel methods on workstation or SMP clusters, are needed.

B. Simulation of Complex Dynamic Biological Systems: Computational Immunology(a) Background and Goals. Reductionism has been and continues to be a powerful tool for understanding the

immune system. However, many fundamental questions require a systems-level understanding of immune system dynamics. Many hypotheses propose how microscopic mechanisms lead to macroscopic properties, but rigorously testing these hypotheses in the lab is time-consuming and costly (when it is even possible). Simulations are a low-cost alternative to "wet-lab" experiments. More fundamentally, modeling may allow us to understand the immune system as a system; empirical results open important but only static windows, and do not enable us to study the real dynamics of adaptive immunity. This is a very young area with many open problems in representation, algorithms, and parallel computing to tackle realistic systems and receptor repertoires [Ande94; Merr98; More98]. Scientists trained solely in immunology or computing are not adequately prepared to address the challenges of immune system modeling and simulation, which include (a) models that incorporate qualitative and quantitative experimental observations along with associated uncertainties, (b) algorithms to simulate detailed mechanisms that span multiple time and space scales, (c) techniques to search large, multi-dimensional spaces to understand what parameter combinations fit empirical data and (d) formal frameworks that can express complex dynamics and constraints of the type found in the immune system.

(b) Specific Ongoing Research. A Biology-CS collaboration between Weigert and Singh has developed a simulation of the basic immune response, using a simulator called IMMSIM++ [Cela92, Seid94]. It consists of cooperating models, which are continually being refined as we focus on specific aspects of the response (usually in support of an in vivo experimental objective). We intend this evolving simulator to be a comprehensive tool that immunlogists can use.

Currently, we are investigating the generation of antibody diversity, selection in the bone marrow and thymus, affinity maturation, clonal selection, isotype switch, the evolution of the immune response, antigenic shift and drift in viral populations, and auto-immunity [Cela96, Morp95, Stew97]. We have three main computational goals: (a) Hypothesis testing: through realistic simulations, affirming or ruling out models proposed to explain experimental data; (b) Identifying the critical parameters by elucidating correlations between simulation parameters and output measures; (c) Experiment planning & interpretation: determining the best data to collect in order to understand the key features of an immune response, and determining the proper interpretation of these data with respect to the underlying process.

For illustration, we focus here on hypothesis testing, specifically to evaluate a proposed hypothesis for germinal center (GC) dynamics [Kepl93]. GCs are structures that form in the lymph nodes and spleen during an immune response. Empirical work has led to intimate knowledge of many GC features: selection and proliferation sites, DNA sequences of responding antibodies, flow rates through the GC, affinity of mutated antibodies relative to their progenitors, etc.

(c) Methods and Algorithmic Challenges. We simulate the hypothesis and perform statistical analysis to see how well it explains the data. One class of immune models we could use is d ifferential equation-based models, which deal with cells at the population level. They group a diverse cell population into a small number of equivalence classes, based on a few characteristics. One shortcoming of this approach is its failure to capture the uniqueness of individual cells, which is a critical property of immune system dynamics. Our approach is to represent the cells and molecules in the system explicitly. This is much more expensive computationally, but in addition to being more realistic it has many other benefits; e.g. non-linear effects do not present difficulties, and spatially distributed effects can be represented.

Many algorithmic challenges arise when including realistic detail in the simulation. For example, the choice of how to represent interactions between biological molecules can have serious computational consequences. One of the

24

most important simplifications that we employ involves using bit-strings and a matching function, such as Hamming distance, to model molecular binding properties. In order to incorporate more realistic properties into our binding models, we are forced to choose between more sophisticated algorithms based on the bit-string representation or to develop new representations that model the biological reality more closely in time and space. Several other such tradeoffs exist.

An interesting challenge is the large number of simulation parameters that must be understood and then fitted to experimental data. Many parameters are constrained by experimental knowledge, but others are not. We run many simulations to discover what, if any, regions of parameter space provide the best fit to experimental data. This is critical to understanding whether or why simulations can match reality. Algorithms to automate this process will be invaluable.

(d) Parallel Computing Challenges. Many questions cannot be answered without simulating realistic-size immune systems. This is increasingly true as our basic understanding increases. Parallelism is already becoming necessary. Preliminary experience suggests that simulators have a lot of parallelism both across and within component models. However, load-balancing is very challenging. The most natural decomposition of the simulation involves assigning processors local areas of physical space. Unfortunately, cells and molecules are often produced at high levels in small local environments, not uniformly across space. Additionally, computationally demanding sites like germinal centers form dynamically during a simulated immune response, introducing a temporally adaptive component as well. As we being to simulate physical space more realistically, the challenges will increase. The spatial and temporal adaptivity are different and more dynamic than in cosmology or protein structure, so new challenges will arise. Finally, a long-term goal is to develop simulators of the entire immune response pathway, which can then be used by immunologists to study their issues of choice. In addition to modeling and algorithmic issues, this will be challenging for parallelism. Searching the parameter space, discussed above, can also use parallelism effectively, but this is a much easier parallel problem.

(e) Visualization Challenges. Unlike the situation in astrophysics, where spatial distributions of positions and properties are critical, simply knowing these aspects for entities in the immune system does not help a great deal in understanding. It is more important to know the relationships among the various entities, such as which cells can bind each other and how strongly. Complex relationships arise because each cell can present many different types of receptors on its surface. Each type of receptor allows the cell to interact with a different population of cells and molecules. There are many unique receptors of each receptor type, and the strength of binding depends on the particular receptors on each cell. To complicate matters further, cells can dynamically change the particular receptors and even the receptor types they present. Devising new methods to visualize these relationships, especially as they change in time, will undoubtedly lead to a more intuitive understanding of immune system dynamics.

(f) Other Ongoing Research. We discuss ongoing research that is very relevant to the PICCS program in three areas of dynamic systems: other projects in immunology; neurobiology; and “integrative biology” in the new Genomic Center.

Immunology. Other research projects include: CLONE, a simulator of cell proliferation [Shlo98]; VKH and TCR, simulations of the DNA rearrangement processes that form antibodies and T cell receptors (in press, J Immunol); mathematical models of thymus development [Mehr98,Stew98]; TRP/LEU, a simulation of mutation and affinity maturation; QUASI, an identifier of advantageous mutations in highly mutated viral populations (manuscript in preparation); and SHANNON, a measure of Shannon entropy and other statistical functions for molecules of immunological interest [Litw92,Stew97].

Neurobiology. John Hopfield’s group works on the mathematics and simulations of ‘neural networks’. This research has two goals, which fit the PICCS integrative model very well: understanding neurobiology and how it uses large complex networks of nerve cells for its computations; and using the understanding to solve engineering problems, i.e. getting computers to do some of the things which even simple organisms effortlessly do. Simple models are built, and if they don’t work then insights into neurobiology are used to enhance them with more biological function. The latter aspect distinguishes the research from neural networks as used in computer science itself, where the neurobiology is absracted away. It integrally requires understanding the engineering problems being approached (e.g. speech recognition), algorithms, and neurobiology. The long-term emphasis is on large scale systems that have collective properties. E.g., the fixed nature of perception in spite of the quite variable responses of individual nerve cells indicates that higher nervous function must be based on collective aspects of a system. Computer science methods can contribute to the modeling of these systems of systems, and parallel computing to solving the scale of simulations needed.

Center for Genomic Sciences. More generally, biology is also at a crossroads in its history. Information about sequence and structure is exploding, with entire genetic blueprints being known. The avalanche of sequence and structure information poses tremendous new challenges, since the focus now has to be on determining function from

25

it. First, we must determine the function of individual genes; second, we must understand how the functions of the roughly 80,000 genes in the human genome are coordinated to achieve homeostasis. The challenge for the future is in studying the principles that drive the integration of information in complex biological systems. While the reductionist approach that has dominated 20th century biology has been very successful, and will continue to be, we now have to develop the tools to study how component parts are assembled into and work together as a whole. This exciting and fundamental opportunity is the rationale for the new Center for Genomic Sciences being established at Princeton. The center will focus on the necessary science disciplines (biology, chemistry and physics). The research it performs will provide tremendous fodder for computer science research, e.g. in managing complexity, information mining and management, and in building and studying integrated models. The PICCS program will help provide this linkage.

3D.6 Geosciences Research Thrusts(a) Overview and goals. General circulation models are in a state of rapid progress, owing substantially to

advances in compute technology. These advances are bringing us closer to harnessing the great disparity of length-scales inherent in the circulation problem. From small-scale eddies in the oceans and fine-scale structure inside the earth, up to the planetary scale, a vast array of dimensions has to be captured. The highest-performance supercomputer systems have always been demanded, and the systems mastered. However, the move to parallel computing is a challenging mind-shift, and it is very clear that close collaboration between modelers and computer scientists has become essential.

Understanding earth's circulation systems is a key challenge in modern geoscience. On the short time scale, evolution is dominated by atmospheric and oceanic circulations, which may overturn and rearrange themselves within a few thousand years. Anthropogenic greenhouse gases play an important role in influencing these systems, and modeling their impact on general circulation and global climate is now accepted as a Grand Challenge. The coupled oceanic and climate system is highly complex. Dynamic, thermodynamic, radiative and chemical processes all play a role in determining its course. The system is also influenced by complex biological cycles, which are difficult to assess.

On the other hand, earth's long-term evolution is dominated by solid-state deformation of the inner portion of the earth. This deep-earth circulation is driven mainly by primordial heat, and provides the major driving force for plate tectonics and continental drift. Although proceeding at speeds of cm/yr, all earth processes that occur on geologic timescales are affected, such as the economically important continental shelf and platform stratigraphy, which is controlled predominantly by vertical motions of the continental lithosphere. Modeling the circulation of the earth's interior is difficult: the material properties of rocks at depths are poorly known, and these properties vary strongly with pressure, temperature and deformation history, giving rise to highly non-linear phenomena.

(b) Specific Ongoing Research. Princeton's research in global circulation modeling is strengthened greatly by the presence of the Geophysical Fluid Dynamics Laboratory (GFDL), which focuses on the former category of circulation.

1. A key effort of GFDL ocean research is focused on better capture of the North Atlantic thermohaline circulation. The intensity of this circulation depends on the amount and distribution of fresh water input from surrounding continents, as well as the energetics of flows whose leading-order terms are finer than current ocean model resolutions. Full understanding of this circulation will provide a dramatic increase in our knowledge of century-scale climate dynamics. With current hardware technology and methods, it remains just out of reach. Parallel methods for eddy-resolving models and extension of current coupled-model methods to unprecedented resolution are central areas of research.

2. In the atmospheric model we are focusing on cloud dynamics. The behavior of the tropical atmosphere is dominated by moist, convective turbulent processes that remain well below the resolution of global circulation models. Cloud-resolving models remain the means to simulate these processes. Current methods for cloud-resolving models rely on approximations that are valid for limited areas, which we push well beyond their bounds in order to simulate the elements of the large-scale tropical circulation, such as the Hadley and Walker cells.

3. In our "chlorophyll" ecosystem model, we are using novel productivity and remineralization models to capture nitrate transport in the global circulation model (GCM). Productivity is passed to a remineralization/export module, which splits the total production into the parts that are new and those that are merely regenerated. The nitrate demand is then incorporated into the GCM. The procedure provides valuable diagnostic information on defects in the physical model, by pointing to regions where there is a mismatch between observed productivity and predictions. We will use this model to diagnose what sorts of changes in physical models allow for greater compatibility between satellite data and nitrate fields. This research is done jointly by the Evolutionary and Ecological Biology department and GFDL.

26

4. In the deep-earth circulation model, we are attempting to constrain the flow history of the earth's interior during the past 150 million years. This is the time of the most recent mantle overturn, associated with large-scale geologic events like the cordilleran and alpine orogeny and the opening of the atlantic system. Reconstructions of lithospheric plate motions provide the geologic input models. Some of the most successful current work tests competing geodynamic hypotheses against the information derived from seismic earth models. As a result, deep-earth circulation modeling has been able to explain a large portion of mantle seismic structure in the context of ancient oceanic plate subduction. In addition, we are beginning a major new effort to build a circulation model of earth's molten iron core. This convection system at the center of the planet is responsible for generating earth's magnetic field. Only recently has compute power become sufficient to solve the full set of coupled magneto-hydrodynamic equations associated with core circulation.

(c) Methods and algorithmic challenges. The current class of circulation models is based on relatively uniform grids. However, as discussed above, many dynamically important processes occur at sub-grid scales (eddies, clouds, lithospheric faults), which we do not resolve. We must develop new adaptive-mesh methods that will allow better capture of these processes. Because the algorithmic challenges of these techniques are shared across the disciplines represented in this proposal (like astrophysics), we expect that the PICCS program will provide an excellent venue for frequent interaction with those disciplines. For the atmospheric model we currently work on improved diagnostics for generalized grids. GFDL has an exchange-grid module approach which will be developed to this end.

In the deep-earth circulation model we are implementing a Lagrangian particle-in-cell method to track the advection of non-diffusive chemical species and rare gases inside the earth. The high accuracy of the Lagrangian approach allows us to avoid the problems associated with numerical diffusion. The method is essentially a way to represent the sub-grid physics of chemical evolution, but its cost is substantial and better methods may be devised.

Another algorithmic challenge for the deep-earth model is better modeling of the strong lateral viscosity variations of mantle rocks. We are working on a new matrix-dependent transfer algorithm for the multigrid momentum solver, which will allow a better representation of the viscosity field at coarse and fine grid levels The robust treatment of strong lateral viscosity variations is central to our challenging long-term goal of a more realistic representation of the lithospheric plates, which are characterized by high viscosity interiors separated by weak plate margins. This will require complex adaptive methods too. The new core convection model, mentioned earlier, shares many techniques with the mantle code. However, the key challenge for convection is incorporating Coriolis effects into the multigrid solver. Coriolis effects dominate in core convection, making the numerical challenges akin to ocean circulation models.

(d) Parallel computing challenges. Parallel computing is essential but complex, especially as the codes become adaptive. To this end, Geosciences, EEB and GFDL faculty rely increasingly on joint research with colleagues in Computer Science. For the current regular deep-earth circulation model, we use simple domain decomposition in a message-passing model. The current version is a target code in NASA's Program of the Earth and Space Sciences. It performs and scales well, achieved a 100 Gflops sustained speed on a 1000 processor Cray T3E.

A message-passing version of the model is also being ported to a 100-processor PC cluster at the Geosciences Department. Its main function is to support a range of geophysical modeling, and serve as a development and post-processing platform for users with access to high-end parallel systems. The cluster is also an essential teaching resource, aimed at exposing students to modern parallel computing. MPI message-passing allows code migration between the cluster and high-end platforms provided to us by NASA and DOE. However, the geoscientists are working closely with Singh and Li in the CS department to move the model to the new cluster of high-performance Intel SMPs. The geoscientists will benefit from major improvements due to their advanced communication and protocol layers and, it is hoped, from the ability to use the SAS programming model effectively, rather than message passing, as the codes quickly become more complex and irregular. The CS researchers will benefit from the range of applications that are in the appropriate, early stages for parallel algorithm development, for evaluation of their protocols and systems, and for exploring tradeoffs in systems and programming models. Geoscience students will be valuable feedback-providing users of the programming environments and tools, and at the right stage of application development. Together, we hope our combined knowledge of parallel computing and the application domain will lead to novel, performance-portable parallel methods. Longer-term, we hope the knowledge will transfer, and to train students adequately across areas.

27

Figure 7. (A) Cut-away of the 3D temperature field for a geodynamic initial condition model [Bung98]. Color denotes temperature. (B) Same as (A) but after the geologic plate motion has been imposed. The more complex downwelling structure reflects subduction history beneath the Northwestern Pacific. (C) Map showing plate motion at present day.

(e) Visualization challenges. One of the most important challenges to circulation modelers comes from representing the vast amount of information contained in 3D fields. The model in Figure 7 shows temperatures for a geodynamic earth model. Temperatures are chosen not because they contain the entire information of the model. Rather we display temperatures, because of the simple scalar nature of this field. Equally important in assessing the character of an earth model are vector velocities (to understand the dynamics of the circulation) or the concentrations of chemical species. At present, we lack the ability to insightfully represent this information in 3D, so simpler 2D graphics is employed for it (see Figure 7(C)). Even greater challenge is associated with representing the propagation of 3D seismic wave-fronts, accounting for the complexities of scatter and refraction, through the heterogeneity structure of this model. To our knowledge, the visualization of complex seismic wavefronts traveling through 3D spherical earth models has not been attempted so far and would represent a valuable visualization research problem in itself. The problem are becoming increasingly important, as we begin explore seismic implications of deep-earth circulation models. We are working with the CS faculty who are visualization experts and are building a large-scale display wall, to help us address these issues.

In summary, there are a lot of opportunities and need for integrative research and training across computer science and these science disciplines. The computational and visualization problems also have a lot of overlap, so joint research and training to attack common problems across science disciplines (and with computer science) should be very fruitful.

28

References (few in number due to wide scope of research thrusts and limited space)

[Altm94] R. Altman, et al, “Probabilistic Constraint Satisfaction with Non-Gaussian Noise”, Proc. Uncertainty in AI, 94.

[Altm95] R. Altman., “A Probabilistic Approach to Determining Biological Structure: Integrating Uncertain Data Sources”, Int. J. Human Comp. Studies, 42:593–616, 1995.

[Ande94] R.M. Anderson. Mathematical studies of parasitic infection and immunity. Science 264, 1884-1886. [Barn86] J. Barnes and P. Hut, “A Hierarchical O(N log N) Force-Calculation Algorithm”, Nature, 324:446—449,

1986.[Blum94] M. Blumrich, et al. A Virtual Memory Mapped Network Interface for the SHRIMP Multicomputer . Proc.

Intl. Symp. On Computer Architecture (ISCA), 1994.[Bode95] N. J. Boden, D. Cohen, R. E. Felderman, A. E. Kulawik, C. L. Seitz, J. N. Seizovic, and W.-K. Su,

“Myrinet: A Gigabig-per-Second Local Area Network,” IEEE MICRO, 15(1): 29--36, February 1995.[Bung98] H. Bunge et al. Time scales and heterogeneous structure in geodynamic earth models, Science, 280, 1998. [Bung95] H. Bunge et al, Mantle convection modeling on parallel virtual machines, Computers in Physics, 9, 1995.[Cela96] F. Celada et al. Affinity maturation and hypermutation in humoral immune response. Eur. J Immunol 26.[Cela92] F. Celada et al. A computer model of cellular interactions in the immune system . Immunol Today 13, 56-62.[Cen93a] R. Cen and J.P. Ostriker, “CDM Cosmology with Hydrodynamics and Galaxy Formation: The Evolution of

the IGM and Background Radiation Fields”, Astrophysical Journal, 417, 1993, p. 404.[Cen93b] R. Cen and J.P. Ostriker, “CDM Cosmogony with Hydrodynamics and Galaxy Formation: Galaxy

Properties Redshift Zero”, Astrophysical Journal, 417, 1993, p. 415.[Cen94] R. Cen, J. Miralda-Escude, J.P. Ostriker and M. Rauch, “Gravitational Collapse of Small-Scale Structure as

the Origin of the Lyman Alpha Forest”, Astrophysical Journal Letters, 437, 1994, L9.[Cen97] R. Cen and R.A. Simcoe, “Sizes, Shapes, Correlations of Lyman Alpha Clouds and Their Evolution in the

CDM+Lambda Universe”, Astrophysical Journal, 483, 1997, p. 8.[Cen98] R. Cen and J.P. Ostriker, “Where are the Baryons?", Astrophysical Journal, in press.[Chal97] Chaljub, E. and A. Tarantola, Sensitivity of SS precursors to topography on the upper-mantle 660-Km

discontinuity, Geophys. Res. Lett., 24, 2613-2616, 1997. [Chan95] R. Chandra. “The COOL Parallel Programming Language,” Ph.D. Thesis, Stanford University, 1995.[Chen98] C. Chen, et al, “Hierarchical Organization of Molecular Structure Computation”, Proc. RECOMB’98..[Chen99]C. Chen et al, “Load Balancing Irregular Protein Structure Computations”, SIAM. Parallel Proc. Conf.,

1999.[Cign97] P. Cignoni, P. Marino, C. Montani, E. Puppo, R. Scopigno. Speeding Up Isosurface Extraction using

Interval Trees. IEEE Trans. on Visualization and Computer Graphics, Vol.3(2), June 1997, pp.158-170.[Crip88] G. M. Crippen, T.F. Havel, Distance Geometry and Molecular Conformation, Research Studies Press, 1988.[Cull93] D. Culler et al. “LogP: Toward a Realistic Model of Parallel Computation”, Proc. Principles and Practice of

Parallel Programming (PPoPP), 1993.[Cull98] D. Culler and J.P. Singh, “Parallel Computer Architecture: A Hardware-Software Approach,” Morgan

Kaufmann, 1998.[Czer97] M. Czernuszenko, “The ImmersaDesk and InfinityWall Virtual Reality Displays,” Computer Graphics,

31(2).[Dubn97] C. Dubnicki et al. “Design and Implementation of Virtual Memory Mapped Communication on Myrinet”,

Proc. Intl. Parallel Processing Symposim (IPPS), 1997.[Dwar99] S. Dwarkadas et al. “Fine-grain versus Coarse-grain Software DSM”, Proc. HPCA’99.[Fash93] M. Fasham et al. A seasonal three- dimensional ecosystem model of nitrogen cycling in the North Atlantic

euphotic zone: A comparison of the model results with observation from Bermuda Station S and OWS India, Gl. Biogeochem. Cycles, 1993.

[Gill96] R. Gillett et al. “The Memory Channel”, Proc. COMPCON’96.[Glat95] G. Glatzmaier et al, A 3-D, self-consistent computer simulation of a geomagnetic field reversal, Nature, 377.[Gott96] J.Gott et al “Topology of Large-scale Structure by Galaxy Type:Hydrodynamic Simulations”, Astrophy. J.

465.

29

[Gran97] S. P. Grand et al, Global seismic tomography: a snapshot of convection in the earth, GSA Today, 7, 1-6, 1997. [Grub97] N. Gruber et al. Global patterns of marine nitrogen fixation and denitrification, Gl. Biogeochem. Cycles,

11(2).[Holt96] C. Holt et al, “Application and Architectural Bottlenecks in Scaling DSM Machines,” Proc. ISCA’96.[Ifto98] L. Iftode. “Home-based Shared Virtual Memory,” Ph.D. Thesis, Computer Science Dept., Princeton

University.[Ifto99] L. Iftode and J.P. Singh, “Shared Virtual Memory: Progress and Challenges,” Proceedings of the IEEE, 1999.[Jian97] D.Jiang et al. “Application Restructuring and Performance Portability For Shared Memory”, Proc.

PPoPP’97.[Kele92] P. Keleher et al. “Lazy Release Consistency for Software Distributed Shared Memory”. Proc. ISCA’92.[Kepl93] T. Kepler et al. Cyclic re-entry of germinal center B cells and the efficiency of affinity maturation.

Immunology Today 14, 41.[Kon96] L. Kontothannasis and M.L. Scott, “Using Memory Mapped Network Interfaces to Improve the Performance

of Distributed Shared Memory”, Proc. Intl. Symp. High Performance Computer Architecture (HPCA), 1996.[Lau97] J.P. Laudon and D. Lensoki, “The SGI Origin2000: A Scalable CC-NUMA Server”, Proc. ISCA ’97.[Levi88] M. Levitt et al, “Accurate Simulation of Protein Dynamics in Solution”, Proc. Nat.l Acad. of Science, 85,

1988.[Li89] K. Li et al, “Memory Coherence in Shared Virtual Memory Systems”, ACM Trans. Comp. Systems (TOCS),

7(4).[Litw92] S. Litwin and Jores, R. Shannon information as a measure of amino acid diversity. In Theoretical and

experimental insights into immunology, A. S. Perelson and G. Weisbuch, eds. (Berlin: Springer-Verlag).[Mehr98] Mehr, R., Perelson, A. S., Sharp, A., Segel, L., and Globerson, A. MHC-linked syngeneic developmental

preference in thymic lobes colonized with bone marrow cells: a mathematical model. Dev Immunol 5, 303-318.[Merr98] Merrill, S. J. Computational models in immunological methods: an historical review. J Imm. Methods 216.[Mitr89] J. Mitrovica et al, Tilting of continental interiors by the dynamical effects of subduction, Tectonics, 8, 1989.[Moln94] S. Molnar, M. Cox, D. Ellsworth, and H. Fuchs, “A Sorting Classification of Parallel Rendering,” IEEE

Computer Graphics and Applications, 14(4): 23-32, July 1994.[More98] Morel, P. A. Mathematical modeling of immunological reactions. Frontiers Biosci 3, d338-7.[Morp95] D. Morpurgo, et al Modelling thymic functions in a cellular automaton. Int Immunol 7, 505-516.[Muel95] C. Mueller, “Sort-First Rendering Architecture for High-Performance Graphics,” in ACM Computer

Graphics.[Nilg88] M. Nilges et al, “Determination of Three-Dimensional Structures of Proteins from Interproton Distance Data

by Dynamical Simulated Annealing from a Random Array of Atoms”, FEBS Lett., 239, 1988.[Paki95] S.Pakin et al. “High Performance Messaging on Workstations,” Supercomputing’95.[Rein94] S. Reinhardt et al., “Decoupled Hardware Support for Distributed Shared Memory,” Proc. ISCA’96.[Ryu93] D. Ryu et al, “A Cosmological Hydrodynamic Code Based on the TVD Scheme”, Astrophys. Journal, 414.[Salm92] J. Salmon, “Parallel Hierarchical N-body Methods,” Ph.D. Thesis, California Institute of Technology, 1992.[Sama98] R. Samanta et al., “Home-based SVM across SMP Nodes,” Proc. HPCA’98.[Shlo97] Shlomchik, M. J., Watts, P., Weigert, M. G., and Litwin, S. Clone: a Monte-Carlo computer simulation of B

cell clonal expansion, somatic mutation, and antigen-driven selection. Curr Top Microbiol Immunol 229, 1997.[Scho94] I. Schoinas et al. “Fine-grained Access Control for Distributed Shared Memory”, Proc. ASPLOS ’94.[Seid92] P. Seiden et al. A model for simulating cognate recognition and response in immune system. J Theor

Biology. 158, 1992[Shos92] S. Shostak, Advanced Imaging, August 1992.[Sing93] J.P. Singh et al, “Scaling Parallel Programs for Multiprocessors: Methodology and Examples,” IEEE

Computer, 26, 1993.[Sing95a] J.P. Singh, et al, “Load Balancing and Data Locality in Adaptive Hierarchical N-Body Methods: Barnes-

Hut, Fast Multipole, and Radiosity”, Journal of Parallel and Distributed Computing (JPDC), 27(2):118-141, 1995.

[Sing95b] J.P. Singh et al, “Implications of Hierarchical N-body Methods for Multiprocessor Architecture”, ACM Trans. Comput. Systems, 1995.

30

[Stew98] J.J. Stewart. The female X-inactivation mosaic in systemic lupus erythematosus. Immunol Today 19, 352-357.

[Stew97] J.J. Stewart et al. A solution to the rheumatoid factor paradox: pathologic rheumatoid factors can be tolerized by competition with natural rheumatoid factors. J Immunol 159, 1728-1738.

[Stew97] J.J. Stewart et al. A Shannon entropy analysis of immunoglobulin and T cell receptor. Mol Immunology , 34.[Val90] L. Valiant. “A Bridging Model for Parallel Computation”, Comm. of the ACM, 33(8), Aug. 1990. [Woo95] S.Woo et al. “The SPLASH-2 Programs: Characteristics and Methodological Considerations”, Proc.

ISCA’95.[Xu95] Xu, G. “A New Parallel N-Body Gravity Solver: TPM”, Astrophysical Journal Supplements, 98, 1995, 35.[Zhou97] Y. Zhou et al. “Relaxed Consistency and Coherence Granularity in Software DSM Systems”, Proc.

PPoPP’97.

31

piccs: a new graduate program integrating computer and ...jps/igert-final/igert.doc · web viewwe...

Documents