scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · web...

91
1 Core 1 – Computational Frameworks and Algorithms [***editing skipped section 1.1***] 1.1 Vision: what the world will look like in ten years (2-4 pages) Take away message: There is a coming era of distributed, large-scale, integrated life science that requires grid computing capabilities. UVA is uniquely qualified to deliver demonstrate this capability on real problems of significance, specifically integrated diabetes research. Life science research and clinical delivery are highly fractionated. The data, applications, and people are in different organizations, at different sites, with different access control policies. This has led to a model where most research and clinical delivery takes place in an “island” with very limited (particularly in bandwidth) input from, and access to, resources on other “islands”. Resources in this case can be “soft” resources such as people, data sets, processes and applications, as well as “hard” resources such as instruments, processing power, or other specialized devices. The lack of access to these resources means both redundant work, but more importantly, slower and less efficient progress on important challenges. Why is this project important? Who cares? The combination of computational science and biology, clinical medicine, and the elimination of the silos that have characterized life sciences. Integrated data sets across labs and organizational boundaries, powerful data mining tools that exploit the coming of inexpensive sequencing of patients and the vast quantities of clinical data; the integration and secure access of clinical data for the delivery of health care. This in turn will lead to prospective health care via the discovery of the mechanisms (pathways) of disease, the ability to predict who is at risk, and how to mitigate those risks, etc. document.doc 1 of 91

Upload: dinhque

Post on 14-Apr-2018

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

1 Core 1 – Computational Frameworks and Algorithms[***editing skipped section 1.1***]

1.1 Vision: what the world will look like in ten years (2-4 pages)

Take away message: There is a coming era of distributed, large-scale, integrated life science that requires grid computing capabilities. UVA is uniquely qualified to deliver demonstrate this capability on real problems of significance, specifically integrated diabetes research.

Life science research and clinical delivery are highly fractionated. The data, applications, and people are in different organizations, at different sites, with different access control policies. This has led to a model where most research and clinical delivery takes place in an “island” with very limited (particularly in bandwidth) input from, and access to, resources on other “islands”. Resources in this case can be “soft” resources such as people, data sets, processes and applications, as well as “hard” resources such as instruments, processing power, or other specialized devices. The lack of access to these resources means both redundant work, but more importantly, slower and less efficient progress on important challenges.

Why is this project important? Who cares?

The combination of computational science and biology, clinical medicine, and the elimination of the silos that have characterized life sciences. Integrated data sets across labs and organizational boundaries, powerful data mining tools that exploit the coming of inexpensive sequencing of patients and the vast quantities of clinical data; the integration and secure access of clinical data for the delivery of health care. This in turn will lead to prospective health care via the discovery of the mechanisms (pathways) of disease, the ability to predict who is at risk, and how to mitigate those risks, etc.

Shape of the solution.

Grids – The Means to an End

Short term

Long term

How Grids will facilitate the vision?

Eliminate boundaries

Secure interaction

Access to data for large scale integration and data mining (e.g., epidemiological)

1.2 Grid Computing

Take away message: Grids are about more than cycle sharing. They are about resource sharing. In the life sciences data is the crucial resource. UVA knows grids, has done it before, and has a solid plan for delivering both in the short term, as well as for attacking

document.doc 1 of 60

Page 2: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

known hard problems that must be solved before grids can become ubiquitous in the medical world. The potential benefits of grid to life science are huge! We’ve got to say something about Globus – and how there is more to life than Globus.

1.2.1 What is a Grid?

In 1994, we outlined our vision for wide-area distributed computing:

For over thirty years science fiction writers have spun yarns featuring worldwide networks of interconnected computers that behave as a single entity. Until recently such science fiction fantasies have been just that. Technological changes are now occurring which may expand computational power in the same way that the invention of desk top calculators and personal computers did. In the near future computationally demanding applications will no longer be executed primarily on supercomputers and single workstations using local data sources. Instead enterprise-wide systems, and someday nationwide systems, will be used that consist of workstations, vector supercomputers, and parallel supercomputers connected by local and wide area networks. Users will be presented the illusion of a single, very powerful computer, rather than a collection of disparate machines. The system will schedule application components on processors, manage data transfer, and provide communication and synchronization in such a manner as to dramatically improve application performance. Further, boundaries between computers will be invisible, as will the location of data and the failure of processors. (Grimshaw94)

The future is now.w; Aafter almost a decade of research and development by the gGrid community, we see gGrids (then called Metasystemmetasystemss [Smarr & Catlett 92]) being deployed around the world both in both academic settings, and, more tellingly, for production commercial use.

crs s to its markesGrids are collections of interconnected resources harnessed together in order to satisfy various needs of users. The resources may be administered by different organizations and may be distributed, heterogeneous and fault-prone. The manner in which users can interact with these resources as well as the usage policies for those resources may vary widely. The grid’s infrastructure must manage this complexity so that users interact with resources as easily and smoothly as possible.

A popular definition of grids is that a grid system is a collection of distributed resources connected by a network. A grid system, also called a grid, gathers resources – desktop and hand-held hosts, devices with embedded processing resources such as digital cameras and phones or tera-scale supercomputers – and makes them accessible to users and applications. This reduces overhead and accelerates projects. A grid application can be defined as an application that operates in a grid environment or is “on” a grid system. Grid system software (also called middleware) is software that facilitates writing grid applications and manages the underlying grid infrastructure. The resources in a grid typically share at least some of the following characteristics:

They are numerous.

document.doc 2 of 60

Page 3: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

They are owned and managed by different, potentially mutually-distrustful organizations and individuals.

They are potentially faulty.

They have different security requirements and policies.

They are heterogeneous. E.g., they have different CPU architectures, run different operating systems, and have different amounts of memory and disk.

They are connected by heterogeneous, multi-level networks.

They have different resource management policies.

They are likely to be geographically separated (perhaps by a campus, an enterprise, or a continent).

These definitions are necessarily general. What constitutes a “resource” can be a difficult question, and the actions performed by a user on a resource can vary widely. For example, a traditional definition of a resource has been “machine,“ or more specifically “CPU cycles on a machine.” The actions users perform on such a resource can be running a job, checking availability in terms of load, and so on. These definitions and actions are legitimate, but limiting. Today, resources can range from biotechnology applications, stock market databases, to wide-angle telescopes. The actions performed might be “run if license is available,” “join with user profiles,” and “procure data from specified sector.” A grid can encompass all of these resources and user actions, so its infrastructure must be designed to accommodate a wide variety of resources and actions without compromising basic principles such as ease of use, security, autonomy, etc.

A grid enables users to collaborate securely by sharing processing, applications and data across systems in order to facilitate collaboration, faster application execution and easier access to data. More concretely, this means being able to:

Find and share data. When users need access to data on other systems or networks, they should simply be able to access it like data on their own system. System boundaries that are not useful should be invisible to users who have been granted legitimate access to the information.

Find and share applications. The leading edge of development, engineering and research efforts consists of custom applications – permanent or experimental, new or legacy, public-domain or proprietary. Each application has its own requirements. Why should application users have to jump through hoops to get applications together with the data sets needed for analysis?

Share computing resources. It sounds very simple: one group has computing cycles that it doesn’t need, while colleagues in another group don’t have enough cycles. The first group should be able to grant access to its own computing power without compromising the rest of the network.

Grid computing is in many ways a novel way to construct applications. It has received a significant amount of recent press attention and been heralded as the next wave in computing. However, under the guises of “peer-to-peer systems”, “Grids” and “distributed systems,” grid computing requirements and the tools to meet these

document.doc 3 of 60

Page 4: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

requirements have been under development for decades. Grid computing requirements address the issues that frequently confront a developer trying to construct applications for a grid. The novelty is that these requirements are addressed by the grid infrastructure, in order to reduce the burden on the application developer.

Clearly, the minimum capability needed to develop grid applications is the ability to send, to transmit bits from one machine to another. All else can be built from that. However, there are several challenges that frequently confront a developer constructing applications for a grid. These challenges lead us to a number of requirements which we believe any complete grid system must address. They are:

Security. This covers a gamut of issues, including authentication, data integrity, authorization (access control), and auditing. If grids are to be accepted by corporate and government IT departments, a wide range of security concerns must be addressed. Security mechanisms must be integral to applications and capable of supporting diverse policies. Furthermore, they must be built in at the beginning. Trying to patch security measures in as an afterthought (as some systems are attempting today) is a fundamentally flawed approach. We do not believe that a single security policy is perfect for all users and organizations, so there must be mechanisms that allow users and resource owners to select policies that fit their particular security and performance needs as well as local administrative requirements. In medical computing, security has a particular set of externally imposed requirements (such as HIPPA) which may vary from country to country.

Global namespace. The lack of a global namespace for accessing data and resources is one of the most significant obstacles to wide-area distributed and parallel processing. The current multitude of disjoint namespaces greatly impedes developing applications that span sites. All grid objects must be able to access (subject to security constraints) any other grid object transparently without regard to location or replication.

Fault tolerance. Failure in large-scale grid systems is and will be a fact of life. Hosts, networks, disks and applications frequently fail, restart, disappear, and otherwise behave unexpectedly. Forcing the programmer to predict and handle all of these failures significantly increases the difficulty of writing reliable applications. Fault-tolerant computing is a known, very difficult problem. Nonetheless it must be addressed or businesses and researchers will not entrust their data to grid computing.

Accommodating heterogeneity. A grid system must support interoperability between heterogeneous hardware and software platforms. Ideally, a running application should be able to migrate from platform to platform as necessary. At a bare minimum, components running on different platforms must be able to communicate transparently.

Binary management. The underlying system should keep track of executables and libraries and know which ones are current, which ones are used with which persistent states, where they have been installed, and where upgrades should be installed. These tasks reduce the burden on the programmer.

document.doc 4 of 60

Page 5: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Multi-language support. In 1970s, the joke was “I don’t know what language they’ll be using in the year 2000, but it’ll be called Fortran.” Fortran has lasted over 40 years, and C almost 30. Diverse languages will always be used and legacy applications will need support.

Scalability. There are over 400 million computers in the world today and over 100 million network-attached devices (including computers). Scalability is clearly a critical necessity. Any architecture relying on centralized resources is doomed to failure. A successful grid architecture must strictly adhere to the distributed systems principle: the service demanded of any given component must be independent of the number of components in the system. In other words, the service load on any given component must not increase as the number of components increases.

Persistence. It is critical to have I/O and the ability to read and write persistent data in order to communicate between applications and to save data. However, the current files/file libraries paradigm should be supported, since it is familiar to programmers.

Extensibility. Grid systems must be flexible enough to satisfy current user demands and unanticipated future needs. Therefore, we feel that mechanism and policy must be realized by replaceable and extensible components, including (and especially) core system components. This model facilitates developments of improved implementations that provide value-added services or site-specific policies while enabling the system to adapt over time to a changing hardware and user environment.

Site autonomy. Grid systems will be composed of resources owned by many organizations, each of which will want to retain control over its own resources. For each resource, its owner must be able to limit or deny use by particular users, specify when it can be used, and so forth. Sites must also be able to choose or rewrite an implementation of each grid component as best suits their needs. A site may trust the security mechanisms of one particular implementation over those of another and it should be able to freely choose that implementation.

Complexity management. Finally, but importantly, complexity management is one of the biggest challenges in large-scale grid systems. In the absence of system support, the application programmer is faced with a confusing array of decisions. Complexity exists in multiple dimensions: heterogeneity in policies for resource usage and security, a range of different failure modes and different availability requirements, disjoint namespaces and identity spaces, and the sheer number of components. For example, professionals who are not IT experts should not have to remember the details of five or six different file systems and directory hierarchies (not to mention multiple user names and passwords) in order to access files they use on a regular basis. Thus, providing the programmer and system administrator with clean abstractions is critical to reducing the cognitive burden.

These requirements are high-level and independent of implementation. But they all must be addressed by the grid infrastructure in order to reduce the burden on the application developer. If the system does not address these issues with an architecture based on well-thought principles, the programmer will spend valuable time on basic grid functions, needlessly increasing development time and costs.

document.doc 5 of 60

Page 6: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

1.2.2 Global Bio Grid Principles and Design Philosophy

The Global Bio Grid uses the design philosophy and principles exposed by designers of Legion, one of the first grid systems [long list of cites] [XXX Legion ref]. Many of them have been incorporated in the Global Grid Forum’s Open Grid Services Architecture. These principle are:

Provide a single-system view. With today’s operating systems we can maintain the illusion that our local area network is a single computing resource. But once we move beyond the local network or cluster to a geographically-dispersed group of sites, perhaps consisting of several different types of platforms, the illusion breaks down. Researchers, engineers and product development specialists (most of whom do not want to be experts in computer technology) are forced to request access through the appropriate gatekeepers, manage multiple passwords, remember multiple protocols for interaction, keep track of where everything is located, and be aware of specific platform-dependent limitations (e.g., this file is too big to copy or to transfer to that system; that application runs only on a certain type of computer, etc.). Re-creating the illusion of single computing resource for heterogeneous, distributed resources reduces the complexity of the overall system and provides a single namespace.

Provide transparency as a means of hiding detail. Grid systems should support the traditional distributed system transparencies: access, location, heterogeneity, failure, migration, replication, scaling, concurrency and behavior [Grimshaw et al, 1998 Architecture]. For example, users and programmers should not have to know an object’s location in order to use it (access, location and migration transparency), nor should they need to know that a component across the country failed. They want the system to recover automatically and complete the desired task (failure transparency). This behavior is the traditional way to mask details of the underlying system. Transparency also addresses fault-tolerance and complexity.

Provide flexible semantics. Our overall objective is a grid architecture that is suitable to as many users and purposes as possible. A rigid system design in which policies are limited, trade-off decisions are pre-selected, or all semantics are pre-determined and hard-coded would not achieve this goal. Indeed, if we dictate a single system-wide solution to almost any of the technical objectives outlined above, we would preclude large classes of potential users and uses. Therefore, we will allow users and programmers as much flexibility as possible in their applications’ semantics, resisting the temptation to dictate solutions. Whenever possible, users can select both the kind and the level of functionality and choose their own trade-offs between function and cost. This philosophy is manifested in the system architecture. The object model specifies the functionality but not the implementation of the system’s core objects; the core system therefore consists of extensible, replaceable components. We will provide default implementations of the core objects, although users will not be obligated to use them. Instead, we encourage users to select or construct object implementations that answer their specific needs.

Reduce user effort. In general, there are four classes of grid users who are trying to accomplish some mission with the available resources: end-users of applications, applications developers, system administrators and managers. We believe that users

document.doc 6 of 60

Page 7: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

want to focus on their jobs (i.e., their applications), and not on the underlying grid plumbing and infrastructure. Thus, for example, to run an application in Legion a user could type

legion_run my_application my_data

at the command shell. The grid should then take care of messy details such as finding an appropriate host on which to execute the application and moving data and executables around. Of course, the user may need to specify or override certain behaviors, perhaps specify an OS on which to run the job, or name a specific machine or set of machines, or even replace the default scheduler. This level of control should also be supported.

Reduce “activation energy.” One of the typical problems in technology adoption is getting users to actually use it. If it is too difficult to shift to a new technology, users will tend to avoid trying it out unless their need is immediate and extremely compelling. This is not a problem unique to grids: it is human nature. Therefore, one of our most important goals is to make it easy to use the technology. Using an analogy from chemistry, we kept the activation energy of adoption as low as possible. Thus, users can easily and readily realize the benefit of using grids – get the reaction going – creating a self-sustaining spread of grid usage throughout the organization. This principle manifests itself in features such as “no recompilation” for applications port to a grid and support for mapping a grid to a local operating system’s file system. Another variant of this concept is the motto “no play, no pay.” The basic idea is that if you do not need encrypted data streams, fault resilient files or strong access control, you should not have to pay the overhead of using it.

Do no harm. To protect their objects and resources, grid users and sites will require grid software to run with the lowest possible privileges.

Do not change host operating system. Organizations will not permit their machines to be used if their operating systems must be replaced. Our experience with Mentat [XXX Mentat paper XXX] Error: Reference source not foundindicates, though, that building a grid on top of host operating systems is a viable approach. However, it must be able to run as a user level process, and not require root access.

[***some of these are straight out of the Legion materials, and indeed still said “Legion will do XX.” I edited where it seemed to make sense —I ignored a mention of a Legion command.. Someone should be sure that this is correct.]

Overall, the application of these design principles at every level provides a unique, consistent, and extensible framework upon which to create grid applications.

1.3 The Global Bio Grid

[***Andrew, I’m not sure about the way you make your argument in this section. I rearranged some things, but basically it first mentions GBG. Then it talks about life sciences and the issues it faces in computing (section 1.3.1, according to my guess at a numbering scheme) Then it gives technical details of GBG (section 1.3.2). Then it explains GBG deployment plans (section 1.3.3). It’s OK, but I think that you need something to leap the gap between 1.3.1 and 1.3.2. You lay out a bunch of problems, but

document.doc 7 of 60

Page 8: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

you don’t explain why GBG is the answer to those problems (at least not right here). Maybe you should put in a paragraph or even a few sentences that says something to the effect of “All of these problems are very difficult and very important, grids offer a good environment for addressing and solving them, and GBG is a neat-o fabu grid that will cater to the needs of life science research.”]

Several institutions — the University of Virginia (UVa), North Carolina BioGrid (NCBio), the University of Texas at Austin (UT), The Center for Advanced Genomics (TCAG), and Texas Tech University (TTU) — have agreed to form the nucleus of a hardware and software infrastructure called the Global Bio Grid (GBG) to further biological, behavioral, and life sciences research. Creation of this grid has already begun at several institutions and several others are committed to their participation in the GBG if this proposal is funded. Furthermore, several other national and international partners have been identified for future expansion of the GBG and agreements in principle have been reached with them.

We envision the GBG as a dynamic system, where users, projects, data sets, computational and other resources, applications, and organizations are added and possibly removed as the capabilities of the system are enhanced, as resources are purchased or decommissioned, and as research needs evolve.

1.3.1 Problems in Life Science Research

Life science researchers face a myriad of problems in effectively and efficiently carrying out their research mission. Significant human effort is spent by research organizations and sometimes by the researchers themselves in manually managing their data; setting up and maintaining their local computational resources; and deploying, managing, and fine-tuning the performance of the applications required to perform their research. The effort required is exacerbated if the resources or participants in a research effort are geographically or organizationally separate. Day-to-day management becomes more complicated, more time consuming, and more prone to error. All of these factors slow down the discovery process and raise costs as human, data, computational and other equipment are not utilized as efficiently as they might be.

1.3.1.1 Access and Management of Data

Researchers in the biological and behavioral sciences require access to a broad array of existing data resources and generate new data through their efforts. As computational and data storage capacities continue to increase, researchers are producing more and larger volumes and types of data. At the same time, researchers are creating ever-more complex models of biological processes and require access to and integration of data of different types from multiple sources. At the same time, researchers are using more sophisticated data mining and data integration techniques to detect significant patterns across biological data, while the speed and volume at which data are consumed is also rapidly increasing. The trend in biological sciences is towards a spiraling of volume and type of data being produced and digested.

At the same time, the data required by these new research techniques are physically stored in different locations by different organizations. Simply finding it is a sometimes a

document.doc 8 of 60

Page 9: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

problem. Data sets, even of similar semantic meaning, are stored in different formats with different levels of accuracy or reliability and different storage technologies. Using data that is copied or cached in multiple places raises problems with consistency and coherence, especially when running multiple distributed jobs against the data. Each data set may also have access or usage restrictions, whether because of proprietary ownership and licenses, patient privacy rights (such as HIPPA), security reasons, or some other factor.

All of these obstacles add up to make data management and access a difficult task that consumes a significant amount of research team effort. Currently, there are a variety of ways to acquire data sets, usually involving some level of manual intervention. The proper data sets need to be identified and found. This is still an ad-hoc process with only limited support for searching. The source organization must then grant access to the data. Except for truly public data sources, negotiating access to data is currently done almost entirely by direct bilateral discussions. Then, a method for accessing data must be chosen. This might involve accessing the data directly from the source or transferring a copy of the data to a site local to the research effort. In either case a human must make the transfer and manage the copy. For dynamic data sources, the copy may rapidly go out of date. For a research effort that spans multiple sites or computing resources, multiple copies are required, each needing to be kept up to date. For large data sets, it may be a significant strain to copy and store data locally at all. Accessing a large data set from across the country each time it is needed can also create a considerable performance problem. Once the data is finally available, the content may need to be automatically or manually manipulated in some fashion: transformed to a new format, cleaned up to recalibrate or eliminate errors, randomized to provide privacy for patient records, etc.

How can grid computing help this situation? Grid software can be thought of as lower- and middle-level software with respect to data. At the lowest level, grid software is designed to handle storing the bits, keeping track of where they reside, and providing basic access and security to them. At a middle level, grid software may add functionality such as caching and replication to improve performance and availability. Current grid software does not provide good tools and abstractions to describe and understand the semantics of data, carry out data mining operations, and integrate data across multiple data sets. These shortcomings are not limited to grid systems, since most other data management systems do not handle these issues well either. To improve the state of the art in these areas, we present a research agenda in Core 2 to study these issues.

It is clear that the current methods are not up to the task of efficiently and effectively managing all of the data needs for future research efforts.

1.3.1.2 Computational Resources

With the increased reliance on computational biology and data and statistical mining techniques, biological and behavioral science research efforts require ever more computational resources. A common paradigm in bioinformatics involves scanning a database and applying some function to each entry to generate a score or some other value. More formally, this procedure can be described as

XXX formula XXX

document.doc 9 of 60

Page 10: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

In certain cases the function needs to be evaluated for the cross product of all values in one database across all of the values in another database.

XXX formula II XXX

Typical times to evaluate the function for a single pair of entries range from a few seconds to several hours of CPU time and the number of evaluations can easily take up thousands of CPU hours. To make such tasks tractable, the task is parallelized, breaking the total number of evaluations into a single or a small groups of evaluations per compute job and the jobs are run across multiple processors.

Another common paradigm involves larger numbers of ensemble computations with either stochastic behavior or slightly different parameters. Either way, the result is a large number of essentially independent, non-communicating computationally complex jobs which in aggregate require more power than one machine can provide. When spread across a large collection of machine, however, they are manageable. These are classic “bag-of-tasks” problems (appendix XXX contains more detail about our experience with running bag-of-tasks problems).

The need to accommodate increasing demands for computational power causes research efforts to spend significant effort on managing the required computational resources. As computational biology becomes more ubiquitous, it is likely that a single research effort will require more computational power than the local institution can provide. Thus, it is reasonable to expect that local research efforts will need access to resources that are physically separated from the research team and controlled by a different organization.

Currently, gaining access to computational resources is done either by purchasing the required resources with the project budget or individually negotiating for use the time of existing resources. Buying new computational resources for a project can potentially prove an inefficient use of both time and money. New resources require expertise and effort to configure and may end up under-utilized or useless outside the project’s specialized demands. For borrowed/rented resources, there is often a significant effort up front to provide the research team with the proper accounts and infrastructure software (such as queuing systems or databases) and train research team members on unfamiliar infrastructure tools.

[***want to expand on how grids might change this?]

1.3.1.3 Applications

Researchers employ various computational applications to further their studies, including simulations of base biological processes, protein and genome comparisons, statistical data mining and correlation, and many others. Some of these applications are home-grown while others are standard vendor products or freeware packages, such as BLAST, FASTA, NAMD, and Amber. Research teams have to arrange for their applications to be deployed wherever they will be used, make sure that different machines have the proper version of the software, and check that the software is configured in a compatible manner to other machines used in the research effort. For applications being developed and tested by the research team, the redeployment of newer application versions will be frequent and can be a substantial drain on time. As research increasingly spans the computational

document.doc 10 of 60

Page 11: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

resources of many smaller desktops or clusters or multiple sites and organizations, the problem of deploying applications and maintaining their consistency becomes a greater strain on system administrators associated with each project.

[***want to expand on how this problem is handled on grids?]

1.3.1.4 Dependability

Dependability—the ability to justifiably rely on the grid infrastructure for day-to-day operations—is a paramount property for a usable grid. Incorporating strategies and techniques for achieving dependability in grid applications is a known, difficult problem. It must nevertheless be addressed, or businesses and researchers will not adopt grids as a technological foundation on which to entrust their data and applications. Grid components such as hosts, networks, disk and applications frequently fail, restart, disappear and otherwise behave unexpectedly. Furthermore, the evolving fault and threat models that grid designers face exacerbate the task of achieving dependability; not only will grids be disrupted by attacks that target the Internet community as a whole, e.g., viruses and worms, but also they will also be subject to attacks that specifically target life-science grids, as the technology matures and becomes widely employed in production systems. Requiring programmers or life science domain experts to anticipate and handle all these possible faults and threats would significantly increase the difficulty of writing dependable applications. Instead, a grid system should automatically and proactively manage and cope with faults as an essential service to its user applications.

1.3.2 Technical Overview of GBG

Since the GBG is part production system and part research system, it is impossible and frankly undesirable, to characterize all of the important features the system will have in five or even three years. Many aspects of the system, such as policy negotiation and confidentiality enforcement, will be influenced by feedback from the needs of biological researchers as they deploy into the new grid environment and develop schemes to match the new possibilities open to them. However, our long experience with distributed and grid computing systems dictates that the system must have certain features and follow certain guiding principles in order to succeed in delivering its promise to users. [***refer to discussion of design philosophy earlier?] These features and principles are summarized below.

1.3.2.1 Single Global Namespace

One of the key problems in managing projects and resources and in writing application software that spans distributed sites is that there is no common way to identify many of important entities, such as users, applications, computers, data sets, files and directories, databases, policies, accounting records, etc. The vast majority of institutions maintain isolated local naming schemes for certain important entities, such as the local user account names, local file systems (which may or may not be locally shared), and local machine names and addresses. Since the naming schemes at each site are independent, there is no way to identify whether a name used at one institution identifies the same entity at another institution. For example, there is no way to determine if the person Joe

document.doc 11 of 60

Page 12: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Smith has an account at two different institutions, and if so what the local names are. The situation is similar for a common application, for common data sets and databases.

The lack of a common and global namespace for important entities is a significant barrier to managing a distributed environment, since it prohibits the creation of software to automate many processes and inhibits the development of applications that can be easily moved from one machine or organization to another. Therefore, one of our fundamental principles is that GBG will provide a single global namespace for certain key entities. At the minimum, this will initially include users, directories, files, access to database views, several common applications, and the compute resources incorporated into the GBG.

1.3.2.2 Federated Sites with Common Access

The GBG will be organized as a federation of cooperating distributed sites. The decision to organize the system as a federation reflects the practical situation of the real world: equipment available for GBG is physically housed at different sites; the data sets are physically located at different sites; personnel are already in place to maintain these resources at certain sites; researchers are in geographically separate areas from which they must be able to access the resources of the GBG; and it may be politically, technically, or contractually difficult or even impossible to move or copy many of the resources. In addition, a centralized approach is undesirable from a technical standpoint, since a centralized structure will ultimately not scale as the system grows.

Grid software is the key to making such a complex system manageable for administrators and reasonable to use for researchers. The grid software will be the glue that ties together the users, data, and computational and other resources at the various sites. Grid software kits for end users will provide the tools users need to easily access the resources that are incorporated into the GBG. (XXX Weak paragraph XXX) [***what are these kit things and do you need to mention them here?]

Specifically the GBG will provide access to the following types of resources:

Data sources. The GBG will incorporate relevant public databases, such as the Protein Database (PDB) and Swisprot. By mapping public databases into the GBG, each user can access them without having to make local copies or otherwise expend effort to procure or maintain them. Each database can be administered and maintained by one site — the one which is likely already doing so. The grid software can automatically manage local caching and cache consistency to improve performance for researchers.

Relevant and available commercial databases will also be incorporated into the GBG. The goals and issues are the same as for public databases, except that the grid software will have to manage access to the data differently. Only select users may be able to access the data or restrictions on where the data can be moved may be imposed by the licensing agreement. The grid software will need to deal with such restrictions.

Finally, the GBG will incorporate relevant in-house databases, data sets, annotations and the like. The GBG will provide mechanisms to easily distribute and provide access to locally produced data, saving the data owner the hassles of developing and

document.doc 12 of 60

Page 13: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

maintaining a dissemination mechanism himself, while allowing him to retain the proper level of control if necessary.

<Andrew continue here>

Application suites. Many biological research projects require running one or more biological, statistical, or data mining application to drive discovery. Just like data, applications may be publicly available freeware solutions, vendor licensed software, or home-grown and possibly proprietary programs, and the grid will need to handle access and license enforcement accordingly. Legacy applications will be written in various languages, may run on one or more operating systems, and may be stand-alone programs, parallel programs, or entire processes or work flows. Again, the grid environment must be able to run the types of legacy applications that researchers need to use. Applications will may need very little or very large amounts of compute time, and may need to be run a small number of times or a very large number of times, as is the case for Monte Carlo and other probability based simulations, studies involving non linear systems and parameter space studies.

The tools provided to grid users must make deploying and running such applications, work flows or suites of applications easy and efficient. Since the computational environment of the GBG is distributed and federated in nature it is important that the grid software help manage the physical disbursal of the appropriate versions of programs to the various computers in the system. This will help to alleviate the inefficiencies and potential errors that researchers and administrators face in trying to use distributed computing resources. Similarly, the grid must provide useful tools that help end users run applications, work flows, or suites of applications in an efficient manner that best uses the researcher’s time and the computing resource’s capacity.

The GBG will provide tools for end users or administrators to deploy applications into the GBG when necessary. This approach will not only allow power users to control rolling out their applications, which is especially important if the applications are to be frequently updated, but will also provide a quick way to disseminate new programs or program versions at least within the GBG community (see the Dissemination section – Core 5 ?? XXX for additional dissemination initiatives). Certain relevant applications will also be deployed directly into the GBG for public use by participating GBG sites. For certain heavily used or important applications the GBG may deploy portals for easing the use of these popular applications.

Heterogeneous Compute Resources. The complement to deploying applications and data is deploying and providing controlled access to the necessary computing environments on which to run applications. The GBG must pull together an array of computing resources from its various member institutions which provides the computation backbone to support the missions of the Core 2 and Core 3 researchers, as well as those of the Cores 1, 5, and 6. These resources will be geographically separate, across site and administrative domain boundaries, heterogeneous in machine type, capabilities, and operating system. Computers will span the gamut from PCs and workstations to clusters and supercomputers.

document.doc 13 of 60

Page 14: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Since the resources will be a federated collection, the resources will potentially have different access and usage policies. This is simply a fact of life when disparate organizations are involved, but by dealing with it explicitly, it vastly increases the GBG’s ability to attract and incorporate new partner sites into the grid.

1.3.2.3 Virtual Organizations.

A virtual organization is a way to use grid technology and its management of distributed resources and access control to create the illusion for a group of users that they are part of one organization. As an example, a project team could set up a virtual organization within the GBG that contains directories for all of the data, applications and computers to be used in the project, and set access control and security and resource usage policies so that to users they are part of a self contained world. Users of such a virtual organization have the experience of working within a local environment, including the ability to share data, applications, computation resources and policies on the usage thereof, even though the actual resources are physically separated. The grid infrastructure takes care of managing the stated access control policies of the organization and masking the fact that resources are distributed. Since the GBG will be a platform shared among various projects, and aims to scale to a national biological computing and data resource, providing virtual organizations to manage and keep separate the policies and private resources of various separate partnerships is essential to its ultimate usefulness.

1.3.2.4 Fine-grain Modular Security with Work towards HIPPA and CFR 21 part 11

Security is clearly one of the hallmark problems encountered when resources are distributed across sites and administrative boundaries. The GBG must handle the usual security issues: authentication of identity, access control, data privacy and data integrity in order to be a workable solution for its users and resource providers.

XXX ran out of steam here – HELP? XXX

Since the GBG is designed to support work specifically in the biological and behavioral sciences, it is certain that some data that researchers collect or require access to will be subject to HIPPA and CFR 21 part 11 rules. (XXX statutes?) It is important that the GBG can incorporate such data, so the GBG must therefore be able to support enforcement of relevant HIPPA and CFR 21 rules. To comply with CFR 21 and HIPPA rules, we will initially only deploy data that can pass CFR 21 part 11 and HIPPA rules. As part of the Core 2 research agenda (see section XXX) we will investigate security policy negotiation and interaction and in particular how to support compliance with HIPPA.

XXX Is CFR 21 part of HIPPA, or separate?? XXX

1.3.2.5 Build GBG Using Off-the-Shelf Technology

Our experience in developing Legion taught us many valuable lessons. Among them is that the development of grid software is an arduous and time-consuming task. Many of the basic grid issues have been solved reasonably well and several off-the-shelf

document.doc 14 of 60

Page 15: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

commercial and freeware software have emerged that incorporate the functionality we need. Partial solutions can be obtained from commercial vendors such as Avaki, Platform Computing, Sun, and United Devices and from academic efforts such as Legion or Globus. None of these solutions provides all of the necessary functionality, even for the initial phase of the project. But each has features that may be woven into a cohesive and robust initial platform which can rapidly start delivering significant benefits to the research of the other core efforts.

1.4 Build GBG on Evolving StandardsIn constructing and deploying the GBG we are committed to deploying solutions that follow the grid and bioinformatics standards being developed in the Global Grid Forum (GGF) and I3C. Deploying system components that adhere to standards is the only way that we can ensure that the GBG will work within the widest possible set of future sites, will integrate with the widest possible set of outside software components, and will minimize the effort required to develop new functionality. The goal of the GBG is to pave the way towards biological grid systems that provide vast collections of resources to the researchers of the entire country. In order to approach this goal, the GBG must minimize solutions that will be incompatible with other technology or which will be rejected by the community.

In the grid community, the major standards bodies are the GGF and the various Web Services standards groups (W3C, DMTF, and OASIS). Several of the members of the GBG team have been actively engaged in the GGF since its inception. Particular standards efforts of note include the GGF's newly accepted Open Grid Service Infrastructure (OGSI) standard [cite XXX], and the work of several groups under the umbrella of the Open Grid Service Architecture (OGSA) initiative as well as the work in security, WS-Security, and policy negotiation, WS-Agreement within the Web Services community.

The OGSA basic architecture is shown below in Figure XXX. At the bottom of the stack are the Web Services standards and OGSI on which the rest of the architecture is built. OGSI provides basic grid service naming, discovery, lifetime management, and communications capabilities.

document.doc 15 of 60

Page 16: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Grid Program Execution Services

Core GridServices

Grid Data Services

Source: A Visual Tour of Open Grid Services Architecture. IBM, 8/03

Services Management

Services Communications

SecurityPolicy Management

Domain Specific Services

OGSI - Open Grid Services Infrastructure

Web Services & Hosting Platforms(J2EE, .Net, etc)

Virginia Grid Researchersare involved in all aspectsthe OGSA stack.

• Job queuing & scheduling• Application provisioning• Auditing & Metrics

• Data access• Data transformation• Data federation• Data caching• Metadata catalog

Figure XXX. The OGSA services stack as currently envisioned. The Virginia Grid team has worked on software at all layers in this stack, and is actively involved in the OGSA working group, the OGSI working group, the Grid Program Execution Services group, and the Data services group.

Higher level, “core services” include security services, policy management services, manageability services, data services, service (e.g., application) provisioning services, scheduling and brokering services, and so on. The architecture and philosophy of the OGSA are very reminiscent of the Legion architecture.

The pace of standards work in both groups very rapid at the moment and the GBG will participate in the relevant efforts and will participate in driving important areas if necessary. We believe that OGSA, in conjunction with relevant Web services standards, represents the best choice for the life sciences community.

We are not alone in choosing to base the GBG on OGSA. All of the major vendors, IBM, HP, Sun, Fujitsu, and NEC have all embraced OGSA and are actively participating in the OGSA working group that is defining the standards. Further, IBM is basing its “on-demand computing” and “autonomic” computing efforts on OGSA. Thus we are confident that in the future there will be a large number of OGSA compliant hardware and service offerings.

In addition to the grid world, there are standards for within the biology community for data naming and search and for data formats. One recent standard that we intend to monitor closely is the Life Science Identifier (LSID) standard [XXX cite XXX] finalized by the Interoperable Informatics Infrastructure Consortium (I3C). This standard has been proposed to provide a common naming scheme and name binding process for all types of biological and life sciences data. The GBG architects will monitor the progress of LSID

document.doc 16 of 60

Page 17: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

based solutions and are committed to deploying compatible solutions as they become available.

1.4.1 GBG Deployment

The GBG will be deployed in three broad phases over the next five years although we expect that the GBG will be evolving in smaller ways almost continuously over its lifetime to incorporate the best technology available as it develops. Since it will ultimately be a production environment for the biological scientists that use it, GBG administrators will plan upgrades and new software roll outs carefully, with an emphasis on keeping user disruption to a minimum. The first phase, which is already partially under way, is designed to provide researchers with state-of-the-art tools as soon as possible in order to rapidly improve their access to and management of data and computational resources and to increase their overall ability to get related work done. The two later phases will be to deploy the new state of the art software, including software both developed as part of this proposal as well as developed by vendors and others in the grid and biological sciences community. In addition later phases will add new data sources and participating organizations into the expanding GBG community.

1.4.2 Phase I

Phase I focuses on getting technology into the hands of researchers and providing access to data and computational resources otherwise unavailable to them or difficult to maintain by them. Phase I will begin with the charter member sites of the GBG, as shown in Figure XXX, which are UVa, NCBio, UT, TCAG, and TTU. The expertise, experience, and resources provided by each institution and their roles within the greater GBG project are described in detail in the Core 4 discussion.

Public DB Public DB

Cluster 1

Cluster 2

Cluster N

Processing

APP 1

APP 2

APP N

Applications

Public DBBetter Research, Faster• Secure, wide-area access to global

breadth of consistent, current data• Access to vast processing power• Ability to securely share proprietary

data and applications, as needed

PDB

NCBI

EMBL

SEQ_1

Data

University of Texas

SEQ_3

BiochemistryDERCSEQ_2

NC BIO

EMBL APP 2APP 1

University of Virginia

VSF - TCAG

Sequences

Texas Tech

SEQ_1

document.doc 17 of 60

Page 18: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Figure XXX. The GBG initial sites are shown. As we move from Phase I to Phase II additional sites around the world will be added. Preliminary discussions have already been held with the e-Science centers in Cambridge and Oxford, UK.

1.4.2.1 Phase I Goals

To accomplish this mission, we have set eight goals for Phase I.

Goal #1: Construct GBG rapidly. Getting the GBG operational, even if in a somewhat limited way, is crucial to supporting the research efforts of both the biological and bioinformatics researchers as well as the research efforts in the other cores. Until the system is operational, researchers will not be able to truly learn to use it or be able to realize any of the benefits it offers. To accomplish this quick deployment, Phase I of the GBG will employ primarily currently available off-the-shelf software and existing computational hardware.

Goal #2: Deploy real data sets. Several public databases are readily available and widely used by the biological research community. By deploying these databases into the GBG during Phase I we can quickly gain real users from the research community and begin to collect early feedback from them. The users themselves will benefit from easier access to the data, possibly improved access performance and decreased maintenance requirements at each user site.

We will also deploy data sets stored locally at member institutions that are needed for Core 2 and Core 3 research efforts. These data sets will then be available to all collaborating partners, regardless of their location, subject to access control policies.

Goal #3: Deploy a strong base of computational hardware into the GBG.To fulfill the computational needs of the biological researchers as well as to provide a platform for the Core 1 researchers, the GBG will incorporate a substantial set of computational assets. In particular, during Phase I the GBG will initially incorporate machines from 4 academic departments within Uva, including the Centurion 300+ processor cluster and over 200 PCs in the public computing laboratories and classrooms, 160 processors from the NCBio, and over 1200 processors from UT’s Texas Advanced Computing Center. These machines collectively represent a capability that no single institution could provide. As Phase I evolves, additional machines will be incorporated into the GBG if they become available at the existing sites or additional sites join the GBG.

Goal #4: Deploy relevant applications into the GBG. The GBG will deploy several widely used, freely available applications, including BLAST, FASTA, NAMD, and Genesis. As with goal #2, deploying common applications will provide a carrot to recruit early phase users from the biological research community and will provide real world application instances with which Core 1 researchers and GBG administrators can gain experience.

We will also deploy local applications for Core 2 and 3 researchers as needed on the grid.

Goal #5: Provide easy access to data for researchers. It is not sufficient to simply have data available within the GBG, we must also provide easy mechanisms for researchers to access and update their existing data and to deploy new data sets. The GBG will deploy the Avaki Data Grid (ADG) to the necessary GBG sites. ADG software provides

document.doc 18 of 60

Page 19: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

functionality to manage distributed directories and files and for accessing relational databases via canned queries. It also includes client-side software that accesses data as though it is local to the user’s machine, providing access to data via NFS and CIFS. For more details on ADG and its selection, see the Core 4 discussion on GBG grid software.

Goal #6: Provide software to run applications on the grid. Data access is only one piece of the puzzle. Researchers also need tools to make the job of harnessing the GBG’s computational power relatively easy and as reliable as possible for the main types of bioinformatics applications, including parallel programs, parameter space studies, and large scale comparisons. Platform Computing’s MultiCluster and related other components of their LSF suite will provide the application management and monitoring capability for GBG Phase I. MultiCluster is a reliable and proven product for running applications across resources that span networks and sites. MultiCluster is discussed in more detail in the Core 4 section.

Goal #7: Create a stable platform to support work on Core 1 research efforts. The biologists and bioinformaticians [***is that a word?] of Cores 2 and 3 are the driving force behind the construction of the GBG. However, the researchers in Core 1 also need a platform on which to build, debug, and test software related to investigating the Core 1 research agenda. The Phase I deployment is designed to be more than adequate to support early prototype work. As prototypes develop and solidify, they will be deployed into the GBG for use and comment by Core 2 and 3 researchers as well.

Goal #8: Initiate early feedback from researchers to drive future directions. Part of the process of using technology as new as grid software and working towards the long-term goal of a nation-wide bioinformatics grid system is to find out what works and what doesn’t and to determine what users really need. By deploying a usable and state-of-the-art grid system as early as possible, the GBG will jump-start the feedback process from a real user base. This is critical for making the most possible progress in the shortest timeframe, since determining user requirements happens most easily when users actually see the possibilities and use the latest technology. The feedback gained in Phase I will drive improvements to the deployed GBG grid software as well as to drive the Core 1 research.

1.4.2.2 Phase I Grid Software

To quickly deploy the GBG and to provide the best and most stable possible platform, we have purposely chosen to use the best existing, proven grid solutions to construct the GBG’s software infrastructure. In the sections below, we briefly introduce the major grid software components.

Platform LSF MultiClusterTM

Platform LSF is software for managing and accelerating batch workload processing for compute-and data-intensive applications. Platform LSF enables users to intelligently schedule and guarantee completion of batch workloads across a distributed, virtualized IT environment. Platform LSF fully utilizes all IT resources regardless of operating system, including desktops, servers and mainframes to ensure policy-driven, prioritized service levels for always-on access to resources. Platform LSF is based on the production-proven, open, grid-enabling, Virtual Execution Machine (VEM)™ architecture that sets

document.doc 19 of 60

Page 20: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

the benchmark for performance and scalability across heterogeneous environments. With more than 1,600 customers and the industry's most extensive library of third-party application integrations, Platform LSF is the leading commercial solution for production-quality workload management.

The Platform LSF MultiCluster extends the basic LSF suite to enable control of resources which span multiple clusters and geographical locations. With Platform LSF MultiCluster, local ownership and control is maintained ensuring priority access to any local cluster while providing global access across an enterprise grid. The features of Platform LSF will enable GBG users to complete workload processing faster with increased computing power, enhancing productivity and speeding time to results.

In order to make the selection of LSF MultiCluster as our grid compute engine, Platform has agreed to license the necessary software for all of the GBG sites at a deeply discounted price so that it would fit into the Core 4 budget.

Avaki Data GridTM

Avaki Data Grid 4.0 (ADG) provides a single stream-lined mechanism for making various types of data available to developers and users in any location on demand. ADG supports relational database data (results of SQL statements and stored procedures); URL-based data sources such as servlets, CGI scripts or web services, XML documents, XSD style sheets[***you mean XSL?], and files in direct- or network-attached or SAN storage. Database results can be automatically formatted into XML, and XML data can be transformed via stored XSD style sheets into any structure or format, providing a powerful range of conversion capabilities.

When using ADG 4.0, users and applications access data as if it were local. One unified data catalog provides one view for accessing all available data. Distributed caches ensure performance across a wide area. Users access data transparently through standard file system protocols, while applications access data using standard methods: ODBC, JDBC, Web Services/SOAP, file read, and JSP/Tag library.

In order to make the selection of ADG as our grid data engine, Avaki has agreed to license the necessary software for all of the GBG academic sites at no cost. With this arrangement, the GBG will acquire the proven best-of-class data grid software to handle a good portion of grid software needs for no cost.

Contingency Plans

In the event that we cannot acquire the commercial software for whatever reason (budget, availability, etc.), there are reasonable solutions that we can obtain for free to which we can fall back. UVa has a license agreement to use the Legion 1.8 grid system for the deployment on projects like the GBG. Legion 1.8 provides remote data management and access, application staging, and a wide range of support for running legacy and MPI programs in a distributed environment. Several of the Core 1 team members are responsible for the development of the Legion system and are confident that it will serve as a reasonable replacement for either the Platform MultiCluster or ADG components or both if necessary.

document.doc 20 of 60

Page 21: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

1.4.2.3 Phase I Benefits

o benefits to researchers

o XXX what else ? XXX

1.4.2.4 Phase I Shortcomings

o stated goals not met and briefly how we will address them later.

o XXX what else ? XXX

1.4.3 Phase II

Phase II overview – new software (results of research and new best of breed grid software), new partners (Cambridge, Oxford)

1.4.3.1 Phase II Goals

o incorporate new sites, new data, new compute resources and new researchers

o roll out early solutions from Core 1 research results

o deploy latest software and improve standards compliance

1.4.3.2 Phase II Software

o latest available grid stuff that we can afford

o do we have any concrete ideas of what will be here?

1.4.3.3 Phase II Players

o Phase I players

o Cambridge

o Oxford

o others?

Phase II picture (maybe)

1.4.3.4 Phase II Timeline

o Years 2-3?

1.4.3.5 Phase II Expected Results:

o ???

1.4.4 Phase III

Same as phase II outline

document.doc 21 of 60

Page 22: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

XXX Checklist of major things to be done:

1) Improve “Fine-grained modular security…” section

2) Write Phase I Benefits section

3) Write Phase I Shortcomings section

4) Write Phase II section

5) Write phase III section

1.5 Research Challenges

For each of the research challenges we will present

the problem,

the shape of the solution

the research agenda

how we will evaluate the solution

brief integration and testing plan

1.5.1 Security – Confidentiality, Data Integrity, Access Control, Policy Negotiation (Humphrey, Grimshaw, Knaus)

One of the most important challenges in the creation of the GBG is to create a secure environment for users and data. The core of this requirement is to prevent unauthorized access to data and programs. The security infrastructure must be designed to support a wide variety of user needs and anticipate the advances of the rapidly-developing field of security mechanisms, policies, and legislative requirements from around the world. There currently exists a number of excellent underlying mechanisms to support a secure environment; in particular, there is a general acceptance in the academic and commercial sectors that it is not the low-level cryptography routines (e.g., RSA, Diffie-Hellman) that are the cause of most current breeches of security. However, many of the higher level concepts based on the lower-level cryptography are not developed to the point that they are usable in a day-to-day production and research environment such as the GBG.

There are many key unsolved problems that we will address in the GBG. We will develop a comprehensive suite that enables the researcher to specify security requirements (e.g., confidentiality and integrity constraints) on both the data they produce and the data they access. There are competing emerging approaches to the issue of federation (see Liberty and WS-Federation); we will evaluate these approaches for viability and further develop easy-to-use authentication procedures to facilitate this federation. Other issues include management of trust relationships,  privacy, and auditing/intrusion detection/accountability. To be elaborated.

Phase 1 development will focus on the rapid deployment of usable security infrastructures from current mechanisms. It is important to recognize that these current approaches fall into one of three categories: [1] work “in isolation” but have not been extended to support the heterogeneity and site autonomy of the grid, [2] are appropriate

document.doc 22 of 60

Page 23: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

for a “single grid” in which the heterogeneity (in policy, in particular) is minimal (an example of this is the DOE Science Grid), [3] have not completely been evaluated in the context of “inter-Grids”, in which there is a true heterogeneity of policies and mechanisms. We firmly believe that the GBG falls into category 3, where there is a unique combination of mechanisms and policies, so we argue that, while there are many excellent foundations for an immediately-operational security infrastructure, we will need to provide the “glue” that facilitates operation in the GBG. Our core will be the Web Services security approach that is currently being endorsed/developed in part of the Open Grid Services Architecture (OGSA). “Web Services” security fits perfectly with the emerging XML-based data exchange documents for clinical data

In Phase 2, we will be a testbed for the management of HIPPA-compliant management of medical information. We will create specific solutions for the patient-centric management of protected health information. The basis of the requirements will be obtained through a per-person registration form that specifies:  the name(s) of the persons or class of persons authorized to use or disclose the data, the specific identification of the persons(s) or class of persons to whom the data can be disclosed, a date when the authorization will expire, a description of the method by which the individual can revoke the authorization, and the risks of pursuant to this authorization. The management of such information on the scale of the GBG will be a key challenge, as it will be necessary to, for example, develop an appropriate (scalable) taxonomy of person classes, track the accesses to the secure data, quickly revoke access when directed by the patient, etc.

1.5.2 High Availability

Building dependable life science grids is, at least in part, an engineering challenge. Developers and users of distributed systems are typically primarily concerned with availability and reliability. Informally, scientists or business users want to know that a given service is “available” to perform useful work, e.g., access, query and retrieve data from a protein database; fetch a document on the web; or view radiography images from a remote site. They also expect services to be “reliable,” meaning that services will function correctly. This is not always the case.

Why do grids fail and what can we do about it? Some of the primary cause for the brittleness of current grid systems are environmental:

Grids mostly consist of prototypes and proof-of-concept projects — there are very few production grids in use today.

Grids have been used mostly in research environments by early pioneers who were willing to tolerate unreliability.

In commercial settings, product development is driven by the push to be first to market. Dependability is at best a secondary concern.

These problems can be addressed through as:

ghave edey),gthatwe,

i s

document.doc 23 of 60

Page 24: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

doption of state-of-the-art engineering practices and industrial-strength solutions [Brewer2001, Oppenheimer2003] in designing and building the basic services provided by a grid system, so as to avoid mistakes and “bugs” in both the design and implementation of a grid infrastructure.

1.5.2.1 Research Challenges

While best-practices in software engineering should be used to increase dependability, the development of life-science grids present several research challenges.

Grid requirements. What are the requirements of a grid system in terms of quantifiable and verifiable dependability attributes? Not only should standard facets such as reliability and availability be stated explicitly, but attributes such as maintainability, usability, security, safety and survivability should also be stated. What are the requirements for specific life-science applications? What are the requirements for grids to support such applications? In what forms should these requirements be stated? Only when such issues have been adequately examined should we consider the application or development of tools, methodologies, architectures and software artifacts.

Grid fault models. Related to the task of meeting dependability requirement is that of specifying the fault models for various life-science grid applications and deployment scenarios. The goal of this proposal is to construct a national resource for biomedical computing. If successful, it will undoubtedly be confronted with malicious attacks, either because of its scale or because it manages high-value systems and/or data. Research is needed to identifying and developing fault models that take into account not only traditional faults such as environmentally-induced faults (power failure, storms and hurricanes) but also evolving threats (malicious attacks, denial-of-service attacks, electronic worms and viruses). Furthermore, life-science grids pose additional and novel threats. Researchers and businesses may coordinate on one project but compete on another. The fault model must therefore incorporate threats from a cooptition [**???] model wherein grid users may exhibit both cooperative and competitive relationships simultaneously.

Grid dependability architectures tools and methodologies. Past experiences indicate that techniques described in dependability literature will often be appropriate for a grid application, given a suitable fault model. The challenge is integrating known techniques and a target application, since application-domain scientists are not dependability experts. Conversely, grid infrastructure builders are not domain experts. Our research will focus on enabling dependable grid applications by setting up a framework to enable easy composition and integration of techniques and structures developed by dependability experts with applications. Preliminary research indicates that this approach is both feasible and applicable to life science applications [Nguyen-Tuong1996]. For example, many life-science applications belong to a class of application structures known as “bag-of-tasks,” wherein the aggregate computation is decomposed into independent subtasks [Casanova1997, Zagrovic2002, Chien2003]. This class of applications is highly scalable and amenable to dependability techniques. However, while the implicit current threat models for these applications include such actions as cheating (to get more credit), faking data, and crashing hosts, they do not include denial-of-service attacks. It would be a simple feat to take them down.

document.doc 24 of 60

Page 25: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Developing frameworks, tools, and methodologies for exploiting application patterns and identifying further patterns in both grid and life-science applications, is an ongoing research effort.

Grid integration and legacy applications. Architecture and tool development must take into account the underlying fabric of grid systems, e.g., queuing systems, application servers, workflow systems, message queues, to reduce the barriers of entry for the incorporation of dependability techniques. Further these architectures should be congruent with standards such as those emerging from the grid, web services and life science communities (e.g., GGF, I3C, WS-* specifications). When feasible, dependability techniques should also be compatible with existing legacy applications, with as little effort required as possible on the part of end users.

The first two challenges discussed above are a result of a major deficiency in the grid literature when discussing dependability issues: the omission of explicit requirements and fault models. Such assumptions are often implied but the emphasis is often on techniques, such as various forms of replication, without the requisite careful exposition of dependability requirements. Without an explicit causal linkage between dependability requirements and given techniques, it is difficult to gauge the effectiveness and range of applications over which a given technique should be used. Applying results—concepts, tools, methodology and architectures—obtained from over thirty years of research in dependable systems to grids should mitigate this problem and yield practical, fruitful results.

1.5.2.2 Research Agenda

We have developed concrete steps for attacking the research problems outlined above. First, we will require that all software artifacts and architectures be congruent with the OGSI and OGSA specifications. Furthermore, we will actively participate and contribute to the specification of standards throughout various organizations such as GGF and the I3C. We have four main steps for meeting the research challenges.

1) Identify requirements

We will identify dependability requirements for grid-based life-science applications through a comprehensive and continual survey of the field. Our approach draws on our past experiences with life-science applications, interactions with colleagues both in life-sciences and in the grid community, surveying the literature, leveraging our relationships with industrial and non-profit partners such as Avaki, TCAG, IBM and HP, and attending conferences, workshops and standards-setting bodies in the grid and life-science communities.

The requirements identified will not be monolithic but will differ for various applications. Even for the same application, requirements will differ across deployment environments.

2) Identification of fault models

We will also expand significant resources in the first year to identify relevant fault and threat models. We will draw heavily on the existing body of literature in the dependability community as well as the security community. We will also draw on the expertise of the dependability research group at UVa

document.doc 25 of 60

Page 26: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

(http://dependability.cs.virginia.edu/) for collaboration and insights. As in the previous step, we expect to identify multiple fault models that will be applicable to various grid deployment scenarios.

3) Identification of key applications

We will identify key applications which will drive our research and development activities. The number of applications should be relatively small so that we can perform in-depth research and experiments, but should be large enough to encompass broad classes of life-science applications. We have identified an initial candidate set of applications: a publishing service that will host life-science databases such as the PDB, a software repository to manage applications distributed by the center [***which center?](including externally developed software), and a computationally intensive application such as protein/DNA sequence comparison. The set of applications will certainly expand as we get feedback from the community.

4) Development of architectures, tools and methodology

The process of developing architectures, tools and methodology based on insights in the previous three steps is a continuous process throughout the center’s lifetime. We will first develop and integrate several key capabilities into the grid infrastructure. The thrust is to provide services to support dependability at the system-level. These capabilities include:

Grid Fault Monitoring and Reporting. The ability to detect and report errors is a basic grid requirement. We will draw upon our experiences in managing large grid systems to provide remote and self-diagnostics capabilities for all grid resources. Such capabilities are already present in a basic form at the individual grid resource level, e.g., queuing systems, clusters, hosts, fault-detection services [Stelling1998, Natrajan2002], but they are not uniformly and consistently applied across all managed resources, even within a single administrative domain.

Evaluation criteria. This should include whether or not the fault monitoring and reporting tools are compliant with the OGSI and OGSA specifications, whether they are sufficient to support dependability requirements identified in step 1, given the fault models in step 2, performance analysis of software in terms of resource consumption (e.g., CPU, memory, network bandwidth), maintainability and ease-of-use

Grid Resource Management. We will draw upon concepts developed in the survivable systems community [Lipson2000, Knight2002, Knight2003, Rowanhill2004, Scandariato2004] to both proactively and reactively manage a grid system [Hollingsworth2003]. The premise behind a survivable grid system is that it is able to provide essential services despite being subjected to various faults, though perhaps in a degraded form. Proactive management consists of configuring the grid system to reduce or eliminate the susceptibility of the system under control to various faults. An example would be to proactively patch software to eliminate security vulnerabilities. Reactive management consists of configuring the system to reduce or eliminate the susceptibility of the system under control once a fault has been detected. An example would be to direct requests intended for a service hosted on a failed host to an alternate backup

document.doc 26 of 60

Page 27: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

service or to notify a scheduling service so that it does not place any new computational tasks on the failed host.

Evaluation criteria include the capability for a service to survive various faults, where survival is defined in the requirement elicitation analysis in step 1 and faults defined in step 2, the performance tradeoffs between levels of dependability and resource consumption (e.g., CPU, memory, network bandwidth).

Another thrust area will be to develop tools, techniques and APIs for incorporating dependability techniques at the application-level. We will investigate the broad class of life-science applications that are, or should be, structured as bag-of-tasks computations. We will also identify and investigate other common application patterns in life-science applications. We will incorporate previous research in these arenas [Nguyen-Tuong1999, Nguyen-Tuong2000, Hwang2003, Thain2003] and extend those works to cope with the evolving fault models identified in R2.

A key competitive advantage of this proposal is that its personnel have been active in the grid community over the last decade in academic and industrial settings, as well as the life-science community. Leveraging such experience will be a significant factor in constructing and operating a successful biomedical computing center.

1.6 Technology Integration Plan The technology plan – deployment plan covered in Core 4, something like the slide below, explained. With year-by-year deliverables.

The Global Bio Grid will not suddenly appear in full form. Like any complex technology-based project it will come into being in phases as described above. Further, unknown events, changes in requirements, and technological changes can change the road-map. That said, we have an overall plan based on what we know today.

The plan consists of three components, starting with robust commercial implementations of critical software, incorporating center and partner constructed software and components, and incorporating and dealing with changes in the external environment, e.g., new standards, new threat characteristics, etc.

As stated above our intention is to begin with robust commercial software, and to use that whenever possible and cost-efficient. The reasons are simple: (1) the development costs to re-engineer what has already done are prohibitive, and (2) as a general rule the open source implementations of the same functionality are not yet as robust and mature as the commercial versions, and (3) the commercial versions are supported by an existing support infrastructure.

The GBG deployment plan will be covered in the Core 4 – Infrastructure, write-up. He we address the technology roadmap and how we expect to acquire and integrate the required pieces in the face of a constantly changing external technology environment.

document.doc 27 of 60

Page 28: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Integration with Core 2 – as an instance of a more general problem

1.7 AlgorithmsTake away message: Besides data and cycles there is a strong need for algorithms – both algorithms to solve problems at all – as well as improved algorithms. UVA and its partners can deliver.

Robust optimization (Robins)

Information Retrieval and Data Mining (French, Kovatchev, others?)

document.doc 28 of 60

Page 29: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

1.8 Appendix: Background on Grids and the Use of Grids in the Life Sciences

1.8.1 Grids in Operation - Legion

Take-away message: We’ve done grids very successfully before – not all grids are Globus and low-level. Also, that we did this in the context of NPACI, where the PI (Grimshaw was Metasystems (Grids) thrust area leader and member of the NPACI steering committee for five years before taking leave from academia to found a successful grid startup.

Here we will describe a typical Legion grid in operation. We will distinguish it from what most people know – Globus, and stress that our plan is to build Legion-like grids using the OGSA – rather than low-level Globus-like grids.

<Fix the transition above. !!!! >

Legion is comprehensive grid software that enables efficient, effective and secure sharing of data, applications and computing power. It addresses the technical and administrative challenges faced by organizations such as research, development and engineering groups with computing resources in disparate locations, on heterogeneous platforms and under multiple administrative jurisdictions. Since Legion enables these diverse, distributed resources to be treated as a single virtual operating environment with a single file structure, it drastically reduces the overhead of sharing data, executing applications and utilizing available computing power regardless of location or platform. The central feature in Legion is the single global namespace. Everything in Legion has a name: hosts, files, directories, groups for security, schedulers, applications, etc. The same name is used regardless of where the name is used and regardless of where the named object resides at any given point in time.

In this and the following sections, we use the term “Legion” to mean both the academic project at the University of Virginia as well as the commercial product, Avaki, distributed by the Avaki Corporation.

Legion helps organizations create a compute grid, allowing processing power to be shared, as well as a data grid, a virtual single set of files that can be accessed without regard to location or platform. Fundamentally, a compute grid and a data grid are the same product – the distinction is solely for the purposes of exposition. Legion’s unique approach maintains the security of network resources while reducing disruption to current operations. By increasing sharing, reducing overhead and implementing with low disruption, Legion delivers important efficiencies that translate to reduced cost.

We start with a somewhat typical scenario and how it might appear to the end-user. Suppose we have a small grid as shown below with four sites, two different departments in one company, a partner site and a vendor site. Two sites are using load management systems; the partner is using Platform Computing Load Sharing Facility (LSF) software and one department is using a Sun Grid Engine (SGE). We will assume that there is a mix of hardware in the grid, e.g., Linux hosts, Solaris hosts, AIX hosts, Windows 2000 and

document.doc 29 of 60

Page 30: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Tru64 Unix. Finally, there is data of interest at three different sites. A user then sits down at a terminal, authenticates to Legion (logs in) and runs the command

legion_run my_application my_data

Legion will then by default, determine the binaries available, find and select a host on which to execute my_application, manage the secure transport of credentials, interact with the local operating environment on the selected host (perhaps an SGETM queue), create accounting records, check to see if the current version of the application has been installed (and if not install it), move all of the data around as necessary, and return the results to the user. The user does not need to know where the application resides, where the execution occurs, where the file my_data is physically located, nor any other of the myriad details of what it takes to execute the application. Of course, the user may choose to be aware of, and specify or override, certain behaviors, for example, specify an architecture on which to run the job, or name a specific machine or set of machines, or even replace the default scheduler. In this example the user exploits key features:

Global name spacenamespace. Everything she specifies is in terms of a global name spacenamespace that names everything: processors, applications, queues, data files and directories. The same name is used regardless of the location of the user of the name or the location of the named entity.

Wide-area access to data. All of the named entities, including files, are mapped into the local file system directory structure of her workstation, making access to the Gridgrid transparent.

Access to distributed and heterogeneous computing resources. Legion keeps track of binary availability and the current version.

Single sign-on. The user need not keep track of multiple accounts at different sites. Indeed, Legion supports policies that do not require a local account at a site to access data or execute applications, as well as policies that require local accounts.

Policy-based administration of the resource base. Administration is as important as application execution.

Accounting both for resource usage information and auditing purposes. Legion monitors and maintains a RDBMS with accounting information such as who used what application on what host, starting when and how much was used.

Fine-grained security that protects both the user’s resources and those of others.

Failure detection and recovery.

document.doc 30 of 60

Page 31: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Figure XXX. A Legion Grid.

1.8.1.1 Creating and Administering a Legion Grid

Legion enables organizations to collect resources – applications, computing power and data – to be used as a single virtual operating environment. This set of shared resources is called a Legion Grid (see Figure XXX, above). A Legion Grid can represent resources from homogeneous platforms at a single site within a single department, as well as resources from multiple sites, heterogeneous platforms and separate administrative domains.

Legion ensures secure access to resources on the grid. Files on participating computers become part of the grid only when they are shared, or explicitly made available to the grid. Further, even when shared, Legion’s fine-grained access control is used to prevent unauthorized access. Any subset of resources can be shared, for example, only the processing power or only certain files or directories. Resources that have not been shared are not visible to grid users. By the same token, a user of an individual computer or network that participates in the grid is not automatically a grid user and does not automatically have access to grid files. Only users who have explicitly been granted access can take advantage of the shared resources. Local administrators may retain control over who can use their computers, at what time of day and under which load conditions. Local resource owners control access to their resources.

Once a grid is created, users can think of it as one computer with one directory structure and one batch processing protocol. They need not know where individual files are located

document.doc 31 of 60

Page 32: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

physically, on what platform type or under which security domain. A Legion Grid can be administered in different ways, depending on the needs of the organization.

1. As a single administrative domain. When all resources on the grid are owned or controlled by a single department or division, it is sometimes convenient to administer them centrally. The administrator controls which resources are made available to the grid and grants access to resources. In this case, there may still be separate administrators at the different sites who are responsible for routine maintenance of the local systems.

2. As a federation of multiple administrative domains. When resources are part of multiple administrative domains, as is the case with multiple divisions or companies cooperating on a project, more control is left to administrators of the local networks. They each define which of their resources are made available to the grid and who has access. In this case, a team responsible for the collaboration would provide any necessary information to the system administrators, and would be responsible for the initial establishment of the grid.

With Legion, there is little or no intrinsic need for central administration of a grid. Resource owners are administrators for their own resources and can define who has access to them. Initially administrators cooperate in order to create the grid; after that, it is a simple matter of which management controls the organization wants to put in place. In addition, Legion provides features specifically for the convenience of administrators who want to track queues and processing across the grid. With Legion, they can:

Monitor local and remote load information on all systems for CPU use, idle time, load average and other factors from any machine on the grid (see Figure XXX below).

Add resources to queues or remove them without system interruption, and dynamically configure resources based on policies and schedules.

Log warnings and error messages and filter them by severity.

Collect all resource usage information down to the user, file, application or project level, enabling Gridgrid-wide accounting.

Create scripts of Legion commands to automate common administrative tasks.

document.doc 32 of 60

Page 33: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Figure XXX. Legion system monitor running on NPACI-Net in 2000 with the “USA” site selected. Clicking on a site opens a window for that site, with the individual resources listed. Sites can be composed hierarchically. Those resources in turn can be selected, and then individual objects are listed. “POWER” is a function of individual CPU clock rates and the number and type of CPUs. “AVAIL” is a function of CPU power and current load; it is what is available for use.

In Figure XXX, below, we display the Legion job status tools. Using these tools the user can determine the status of all of the jobs that they have started from the Legion portal – and access their results as needed.

document.doc 33 of 60

Page 34: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Figure XXX. Legion job status tools accessible via the web portal.

In Figure XXX, below, we show the portal interface to the underlying Legion accounting system. We believed from very early on that grids must have strong accounting or they will be subject to the classic tragedy of the commons, in which everyone is willing to use grid resources, yet no one is willing to provide them. Legion keeps track of who used which resource (CPU, application, etc.), starting when, ending when, with what exit status, and with how much resource consumption. The data is loaded into a relational DBMS and various reports can be generated. (An LMU is a “Legion Monetary Unit”, it is one CPU second normalized by the clock speed of the machine.)

[***Where is the LMU graphic? Not showing on my version]Figure XXX. Legion accounting tool.

1.8.1.2 Legion Data Grid

Data access is critical for any application or organization. A Legion Data Grid Error: Reference source not found greatly simplifies the process of interacting with resources in multiple locations, on multiple platforms or under multiple administrative domains.

document.doc 34 of 60

Page 35: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Users access files by name – typically a pathname in the Legion virtual directory. There is no need to know the physical location of the files.

There are two basic concepts to understand in the Legion Data Grid – how the data is accessed, and how the data is included into the grid.

Data Access

Data access is through one of three mechanisms: a Legion-aware NFS server called a Data Access Point (hereafter a DAP), a set of command line utilities or Legion I/O libraries that mimic the C stdio libraries.

DAP Access

The DAP provides a standards-based mechanism to access a Legion Data Grid. It is a commonly-used mechanism to access data in a Data Grid. The DAP is a server that responds to NFS 2.0/3.0 protocols and interacts with the Legion system. When an NFS client on a host mounts a DAP it effectively maps the Legion global namespace into the local host file system, providing completely transparent access to data throughout the grid without even installing Legion software.

However, the DAP is not a typical NFS server. First, it has no actual disk or file system behind it. It interacts with a set of resources that may be distributed, be owned by multiple organizations, be behind firewalls, etc. Second, the DAP supports the Legion security mechanisms: access control is with signed credentials, and interactions with the data grid can be encrypted. Third, the DAP caches data aggressively, using configurable local memory and disk caches to avoid wide-area network access. Further, the DAP can be modified to exploit semantic data that can be carried in the meta-data of a file object, such as “cacheable,” “cacheable until,” or “coherence window size.” In effect, DAP provides a highly secure, wide-area NFS.

To avoid the rather obvious hot-spot of a single DAP at each site, Legion encourages deploying more than one DAP per site. There are two extremes: one DAP per site, and one DAP per host. Besides the obvious tradeoff between scalability and the shared cache effects of these two extremes, there is also an added security benefit of having one DAP per host. NFS traffic between the client and the DAP, typically unencrypted, can be restricted to one host. The DAP can be configured to only accept requests from local host, eliminating the classic NFS security attacks through network spoofing.

Command Line Access

A Legion Data Grid can be accessed using a set of command line tools that mimic the Unix file system commands such as ls, cat, etc. The Legion analogues are legion_ls, legion_cat, etc. The Unix-like syntax is intended to mask the complexity of remote data access by presenting familiar semantics to users.

I/O Libraries

Legion provides a set of I/O libraries that mimic the stdio libraries. Functions such as open and fread have analogues such as BasicFiles_open and BasicFiles_fread. The libraries are used by applications that need stricter

document.doc 35 of 60

Page 36: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

coherence semantics than offered by NFS access. The library functions operate directly on the relevant file or directory object rather than operating via the DAP caches.

Data Inclusion

Data inclusion is through one of three mechanisms: a “copy” mechanism whereby a copy of the file is made in the grid, a “container” mechanism whereby a copy of the file is made in a container on the grid, or a “share” mechanism whereby the data continues to reside on the original machine, but can be accessed from the grid. Needless to say, these three inclusion mechanisms are completely orthogonal to the three access mechanisms discussed earlier.

Copy Inclusion

A common way of including data into a Legion Data Grid is by copying it into the grid with the legion_cp command. This command creates a grid object or service that enables access to the data stored in a copy of the original file. The copy of the data may reside anywhere in the grid, and may also migrate throughout the grid.

Container Inclusion

Data may be copied into a grid container service as well. With this mechanism the contents of the original file are copied into a container object or service that enables access. The container mechanism reduces the overhead associated with having one service per file. Once again, data may migrate throughout the grid.

Share Inclusion

The primary means of including data into a Legion Data Grid is with the legion_export_dir command. This command starts a daemon that maps a file or rooted directory in Unix or Windows NT into the data grid. For example,

legion_export_dir C:\data /home/grimshaw/share-data

maps the directory C:\data on a Windows machine into the data grid at /home/grimshaw/share-data. Subsequently, files and subdirectories in C:\data can be accessed directly in a peer-to-peer fashion from anywhere else in the data grid, subject to access control, without going through any sort of central repository. A Legion share is independent of the implementation of the underlying file system, whether a direct-attached disk on Unix or NT, an NFS-mounted file system, or some other file system such as a hierarchical storage management system.

Combining shares with DAPs effectively federates multiple directory structures into an overall file system, as shown in Figure XXX below. Note that there may be as many DAPs as needed for scalability reasons.

document.doc 36 of 60

Page 37: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Figure XXX. Legion Data Grid.

1.8.1.3 Distributed Processing

Research, engineering and product development depend on intensive data analysis and large simulations. In these environments, much of the work still requires computation-intensive data analysis – executing specific applications (which may be complex) with specific input data files (which may be very large or numerous) to create result files (which may also be large or numerous). For successful job execution, the data and application must be available, sufficient processing power and disk storage must be available, and the application’s requirements for a specific operating environment must be met. In a typical network environment, the user must know where the file is, where the application is, and whether the resources are sufficient to complete the work. Sometimes, in order to achieve acceptable performance, the user or administrator must move data files or applications to the same physical location.

With Legion, users do not need to be concerned with these issues in most cases. Users have a single point of access to an entire Gridgrid. Users log in, define application parameters and submit a program to run on available resources, which may be spread across distributed sites and multiple organizations. Input data is read securely from distributed sources without necessarily being copied to a local disk. Once an application is complete, computational resources are cleared of application remnants and output is

document.doc 37 of 60

Page 38: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

written to the physical storage resources available in the Gridgrid. Legion’s distributed processing support includes several features, listed below.

1.8.1.4 Automated Resource Matching and File Staging

A Legion Grid user executes an application, referencing the file and application by name. In order to ensure secure access and implement necessary administrative controls, predefined policies govern where applications may be executed or which applications can be run on which data files. Avaki matches applications with queues and computing resources in different ways:

Through access controls. For example, a user or application may or may not have access to a specific queue or a specific host computer.

Through matching of application requirements and host characteristics. For example, an application may need to be run on a specific operating system, or require a particular library to be installed, or require a particular amount of memory.

Through prioritization. For example, based on policies and load conditions.

Legion performs the routine tasks needed to execute the application. For example, Legion will move (or stage) data files, move application binaries and find processing power as needed, as long as the resources have been included into the Gridgrid and the policies allow them to be used. If a data file or application must be migrated in order to execute the job, Legion does so automatically; the user does not need to move the files or know where the job was executed. Users need not worry about finding a machine for the application to run on, finding available disk space, copying the files to the machine and collecting results when the job is done.

1.8.1.5 Support for Legacy Applications – No Modification Necessary

Applications that use a Legion Grid can be written in any language, do not need to use a specific API, and can be run on the Gridgrid without source code modification or recompilation. Applications can run anywhere at all on the Gridgrid without regard to location or platform as long as resources are available that match the application’s needs. This is critical in the commercial world where the sources for many third party applications are simply not available.

1.8.1.6 Batch Processing – Queues and Scheduling

With Legion, users can execute applications interactively or submit them to a queue. With queues, a Gridgrid can group resources in order to take advantage of shared processing power, sequence jobs based on business priority when there is not enough processing power to run them immediately, and distribute jobs to available resources. Queues also permit allocation of resources to groups of users. Queues help insulate users from having to know where their jobs physically run. Users can check the status of a given job from anywhere on the Gridgrid without having to know where it was actually processed. Administrator tasks are also simplified because a Legion Grid can be managed as a single system. Administrators can:

document.doc 38 of 60

Page 39: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Monitor usage from anywhere on the network.

Preempt jobs, re-prioritize and re-queue jobs, take resources off the Gridgrid for maintenance, or add resources to the Gridgrid – all without interrupting work in progress.

Establish policies based on time windows, load conditions or job limits. Policies based on time windows can make some hosts available for processing to certain groups only outside business hours or outside peak load times. Policies based on the number of jobs currently running or on the current processing load can affect load balancing by offloading jobs from overburdened machines.

1.8.1.7 Unifying Processing Across Multiple Heterogeneous Networks

A Legion Grid can be created from individual computers and local area networks or from clusters enabled by software such as the LSF or SGE. If one or more queuing systems, load management systems or scheduling systems are already in place, Legion can interoperate with them to create virtual queues, thereby allowing resources to be shared across the clusters. This approach permits unified access to disparate, heterogeneous clusters, yielding a Gridgrid that enables an increased level of sharing and allows computation to scale as required to support growing demands.

1.8.1.8 Security

Security was designed in the Legion architecture and implementation from the very beginning Error: Reference source not found. Legion’s robust security is the result of several separate capabilities, such as authentication, authorization and data integrity, that work together to implement and enforce security policy. For Gridgrid resources, the goal of Legion’s security approach is to eliminate the need for any other software-based security controls, substantially reducing the overhead of sharing resources. With Legion security in place, users need only know their sign-on protocol and the Gridgrid pathname where their files are located. Accesses to all resources are mediated by access controls set on the resources. We’ll describe the security implementation in more detail in 1.8.2.3. Further details of the Legion security model are available in the literature Error: Reference source not found Error: Reference source not found.

1.8.1.9 Automatic Failure Detection and Recovery

Service downtimes and routine maintenance are common in a large system. During such times, resources may be unavailable. A Legion Grid is robust and fault-tolerant. If a computer goes down, Legion can migrate applications to other computers based on predefined deployment policies as long as resources are available that match application requirements. In most cases migration is automatic and unobtrusive. Since Legion is capable of taking advantage of all the resources on the Gridgrid, it helps organizations optimize system utilization, reduce administrative overhead and fulfill service-level agreements.

Legion provides fast, transparent recovery from outages. In the event of an outage, processing and data requests are rerouted to other locations, ensuring

document.doc 39 of 60

Page 40: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

continuous operation. Hosts, jobs and queues automatically back up their current state, enabling them to restart with minimal loss of information if a power outage or other system interruption occurs.

Systems can be reconfigured dynamically. If a computing resource must be taken offline for routine maintenance, processing continues using other resources. Resources can also be added to the Gridgrid, or access changed, without interrupting operations.

Legion migrates jobs and files as needed. If a job’s execution host is unavailable or cannot be restarted, the job is automatically migrated to another host and restarted. If a host must be taken off the Gridgrid for some reason, administrators can migrate files to another host without affecting pathnames or users. Requests for that file are rerouted to the new location automatically.

1.8.2 The Legion Grid Architecture: Under the Covers

Legion is an object-based system comprised of independent objects, disjoint in logical address space, that communicate with one another by method invocation. Method calls are non-blocking and may be accepted in any order by the called object. Each method has a signature that describes its parameters and return values (if any). The complete set of signatures for an object describes that object’s interface, which is determined by its class. Legion class interfaces are described in an Interface Description Language (IDL), three of which are currently supported in Legion: the CORBA IDL Error: Reference source not found, MPL Error: Reference source not found and BFS Error: Reference source not found (an object-based Fortran interface language). Communication is also supported for parallel applications using a Legion implementation of the MPI libraries. The Legion MPI supports cross-platform, cross-site MPI applications.

All things of interest to the system – for example, files, applications, application instances, users, groups – are objects. All Legion objects have a name, state (which may or may not persist), meta-data (<name, valueset> tuples) associated with their state and an interface1. At the heart of Legion is a three-level global naming scheme with security built into the core. The top-level names are human names. The most commonly used are path names, e.g.,

/home/grimshaw/my_sequences/seq_1

Path names are all that most people ever see. The path names map to location independent identifiers called LOIDs. LOIDs are extensible self-describing data structures that include as a part of the name an RSA public key (the private key is part of the object’s state). Consequently, objects may authenticate one another and communicate securely without the need for a trusted third party. Finally, LOIDs map to object addresses, which contain location-specific current addresses and communication protocols. An object address may change over time, allowing objects to migrate throughout the system, even while being used.

1 Note the similarity to the OGSA model, in which everything is a resource with a

name, state and an interface.

document.doc 40 of 60

Page 41: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Each Legion object belongs to a class and each class is itself a Legion object. All objects export a common set of object-mandatory member functions (such as deactivate(), ping() and getInterface()), which are necessary to implementing core Legion services. Class objects export an additional set of class-mandatory member functions that enable them to manage their instances (such as createInstance() and deleteInstance()).

Much of the Legion object model’s power comes from the role of Legion classes, since a sizeable amount of what is usually considered system-level responsibility is delegated to user-level class objects. For example, classes are responsible for creating and locating their instances and for selecting appropriate security and object placement policies. Legion core objects provide mechanisms that allow user-level classes to implement chosen policies and algorithms. We also encapsulate system-level policy in extensible, replaceable class objects, supported by a set of primitive operations exported by the Legion core objects. This effectively eliminates the danger of imposing inappropriate policy decisions and opens up a much wider range of possibilities for the application developer.

1.8.2.1 Naming with Context Paths, LOIDs and Object Addresses

As in many distributed systems over the years, access, location, migration, failure and replication transparencies are implemented in Legion’s naming and binding scheme. Legion has a three-level naming scheme. The top level consists of human-readable name spacenamespaces. There are currently two, context space and an attribute space. A context is an directory-like object that provides mappings from a user-chosen string to a Legion object identifier, called a LOID. A LOID can refer to a context object, file object, host object or any other type of Legion object. For direct object-to-object communication, a LOID must be bound to its low-level object address (OA), which is meaningful within the transport protocol used for communication. The process by which LOIDs are mapped to OAs is called the Legion binding process.

Contexts

Contexts are organized into a classic directory structure called context space. Contexts support operations that lookup a single string, return all mappings, add a mapping and delete a mapping. A typical context space has a well-known root context which in turn ““points“” to other contexts, forming a directed graph (Figure XXX). The graph can have cycles, though it is not recommended.

document.doc 41 of 60

Page 42: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Figure XXX. Legion context space.

We have built a set of tools for working in and manipulating context space, for example, legion_ls, legion_rm, legion_mkdir, legion_cat and many others. These are modeled on corresponding Unix tools. In addition there are libraries that mimic the stdio libraries for doing all of the same sort of operations on contexts and files (e.g., creat(), unlink(), etc.) [***is that supposed to be create()?].

Legion Object Identifiers

Every Legion object is assigned a unique and immutable LOID upon creation. The LOID identifies an object to various services (e.g., method invocation). The basic LOID data structure consists of a sequence of variable length binary string fields. Currently, four of these fields are reserved by the system. The first three play an important role in the LOID-to-object address binding mechanism: the first field is the domain identifier, used in the dynamic connection of separate Legion systems; the second is the class identifier, a bit string uniquely identifying the object’s class within its domain; the third is an instance number that distinguishes the object from other instances of its class. LOIDs with an instance number field of length zero are defined to refer to class objects. The fourth and final field is reserved for security purposes. Specifically, this field contains an RSA public key for authentication and encrypted communication with the named object. New LOID types can be constructed to contain additional security information, location hints and other information in the additional available fields.

Object Addresses

Legion uses standard network protocols and communication facilities of host operating systems to support inter-object communication. To perform such communication Legion converts location-independent LOIDs into location-dependent communication system-level OAs through the Legion binding process. An OA consists of a list of object address elements and an address semantic field, which describes how to use the list. An OA element contains two parts, a 32-bit address type field indicating the type of address

document.doc 42 of 60

Page 43: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

contained in the OA and the address itself, whose size and format depend on the type. The address semantic field is intended to express various forms of multicast and replicated communication. Our current implementation defines two OA types, the first consisting of a single OA element containing a 32-bit IP address, 16-bit port number and 32-bit unique id (to distinguish between multiple sessions that reuse a single IP/port pair). This OA is used by our UDP-based data delivery layer. The other is similar and uses TCP. Associations between LOIDs and OAs are called bindings, and are implemented as three-tuples. A binding consists of a LOID, an OA and a field that specifies the time at which the binding becomes invalid (including never). Bindings are first-class entities that can be passed around the system and cached within objects.

1.8.2.2 Metadata

Legion attributes provide a general mechanism to allow objects to describe themselves to the rest of the system in the form of metadata. An attribute is an n-tuple containing a tag and a list of values; the tag is a character string, and the values contain data that varies by tag. Attributes are stored as part of the state of the object they describe, and can be dynamically retrieved or modified by invoking object-mandatory functions. In general, programmers can define an arbitrary set of attribute tags for their objects, although certain types of objects are expected to support certain standard sets of attributes. For example, host objects are expected to maintain attributes describing the architecture, configuration and state of the machine(s) they represent.

Attributes are collected together into meta-data databases called collections. A collection is an object that ““collects“” attributes from a set of objects and can be queried to find all of the objects that have meet some specification. Collections can be organized into directed acyclic graphs, e.g., trees, in which collection data ““flows“” up and is aggregated in large-scale collections. Collections can be used extensively for resource discovery in many forms, e.g., for scheduling (““retrieve hosts where the one-minute load is less than some threshold””).

1.8.2.3 Security

Security is an important requirement for grid infrastructures as well as for clients using grids. The definitions of security can vary depending on the resource accessed and the level of security desired. In general, Legion provides mechanisms for security while letting grid administrators configure policies that use these mechanisms. Legion provides mechanisms for authentication, authorization and data integrity/confidentiality. The mechanisms are flexible enough to accommodate most policies desired by administrators. Legion explicitly accepts and supports that grids may comprise multiple organizations that have different policies and that wish to maintain independent control of all of their resources rather than trust a single global administrator.

In addition, Legion operates with existing network security infrastructures such as Network Address Translators (NATs) and firewalls. For firewall software, Legion requires port-forwarding to be set up. After this set-up, all Legion traffic is sent through a single bi-directional UDP port opened in the firewall. The UDP traffic sent through the firewall is transmitted using one of the data protection modes described in Section 5.3.3.

document.doc 43 of 60

Page 44: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

In addition, Legion provides a sandboxing mechanism whereby different Legion users are mapped to different local Unix/Windows users on a machine and thus isolated from each other as well as the grid system on that machine. The sandboxing solution also permits the use of ““generic”” IDs so that Legion users who do not have accounts on that machine can continue to access resources on that machine using generic accounts.

Recall that LOIDs contain an RSA public key as part of their structure. Thus, naming and identity have been combined in Legion. Since public keys are part of object names, any two objects in Legion can communicate securely and authenticate without the need for a trusted third party. Further, any object can generate and sign a credential. Since all grid components, e.g., files, users, directories, applications, machines, schedulers, etc., are objects, security policies can be based not just on which person, but what component, is requesting an action.

1.8.2.4 Authentication

Users authenticate themselves to a Legion grid with the login paradigm. Logging in requires specifying a user name and password. The user name corresponds to an authentication object that is the user’s proxy to the grid. The authentication object holds the user’s private key, her encrypted password and user profile information as part of its persistent state. The password supplied during login is compared to the password in the state of the authentication object in order to permit or deny subsequent access to the grid. The state of the authentication object is stored on disk and protected by the local operating system. In other words, Legion depends on the local operating system to protect identities. Consequently, it requires careful configuration to place authentication objects on reliable machines. In a typical configuration, authentication objects for each administrative domain are restricted to a small number of trusted hosts.

A Legion system is neutral to the authentication mechanism. Past implementations have used the possession of a valid Kerberos credential as evidence of identity (as opposed to the password model above). In principle, any authentication mechanism, for example, NIS, LDAP, Kerberos, RSA SecurID, etc., can be used with Legion. Upon authentication, the authentication object generates and signs an encrypted (non-X.509) credential that is passed back to the caller. The credential is stored either in a temporary directory similar to the approach of Kerberos or stored as part of the internal state of the DAP. In either case, the credential is protected by the security of the underlying operating system.

Although login is the most commonly used method for users to authenticate themselves to a grid, it is not the only method. A Legion Data Access (DAP) enables accessing a data grid using the NFS protocol. The DAP maintains a mapping form the local Unix ID to the LOID of an authentication object. The DAP communicates directly with authentication objects to obtain credentials for Unix users. Subsequently, the DAP uses the credentials to access data on the user’s behalf. With this access method, users may be unaware that they are using Legion’s security system to access remote data. Moreover, clients of the DAP typically do not require installing Legion software.

Credentials are used for subsequent access to grid resources. When a user initiates an action, her credentials are securely propagated along the chain of resources accessed in the course of the action using a mechanism called ““implicit parameters””. Implicit

document.doc 44 of 60

Page 45: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

parameters are the grid equivalent of environment variables – they are tuples of the form <name, value> and are used to convey arbitrary information to objects.

1.8.2.5 Authorization

Authorization addresses the issue of who or what can do which operations on what objects. Legion is policy-neutral with respect to authorization, i.e., a large range of policies are possible. Legion/Avaki ships with an access control list (ACL) policy, which is a familiar mechanism for implementing authorization. For each action (or function) on an object, e.g., a ““read”” or a ““write”” on a file object, there exists an ““allow”” list and a ““deny”” list. Each list can contain the names of other objects, typically, but not restricted to authentication objects such as ““/users/jose””. A list may also contain a directory, which in turn contains a list of objects. If the directory contains authentication objects, then it functions as a group. Selecting which authentication objects are listed in directories can form arbitrary groups. The allow list for an action on an object thus identifies a set of objects (often, authentication objects, which are proxies for humans) that are allowed to perform the action. Likewise, the deny list identifies a set of objects which are not allowed to perform the action. If an object is present in the allow list as well as the deny list, we choose the more restrictive option, i.e., the object is denied permission. An access control list can be manipulated in several ways; one way provided in Legion is with a ““chmod”” command that maps the intuitive ““+/-”” and ““rwx”” options to the appropriate allow and deny lists for sets of actions.

Access control lists can be manipulated to permit or disallow arbitrary sharing policies. They can be used to mimic the usual read-write-execute semantics of Unix, but can be extended to include a wider variety of objects than just files and directories. Complex authorizations are possible. For example, if three mutually-distrustful companies wish to collaborate in manner wherein one provides the application, another provides data and the third provides processing power, they can do so by setting the appropriate ACLs in a manner that permits exactly the actions desired and none other. In addition, the flexibility of creating groups at will is a benefit to collaboration. Users do not have to submit requests to create groups to system administrators – they can create their own groups by linking in existing users into a directory and they can set permissions for these groups on their own objects.

1.8.2.6 Data Integrity and Data Confidentiality

Data integrity is the assurance that data has not been changed, either maliciously or inadvertently, either in storage or in transmission, without proper authorization. Data confidentiality (or privacy) refers to the inability of an unauthorized party to view/read/understand a particular piece of data, again either in transit or in storage. For on-the-wire protection, i.e., protection during transmission, Legion follows the lead of almost all contemporary security systems in using cryptography. Legion transmits data in one of three modes: private, protected and none. Legion uses the OpenSSL libraries for encryption. In the private mode, all data is fully encrypted. In the protected mode, the sender computes a checksum (e.g., MD5) of the actual message. In the third mode, ““none””, all data is transmitted in the clear except for credentials. This mode is useful for applications that do not want to pay the performance penalty of encryption and/.or

document.doc 45 of 60

Page 46: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

checksum calculations. The desired integrity mode is carried in an implicit parameter that is propagated down the call chain. If a user sets the mode to private then all subsequent communication made by the user or on behalf of the user will be fully encrypted. Since different users may have different modes of encryption, different communications with the same object may have different encryption modes. Objects in the call chain, for example, the DAP, may override the mode if deemed appropriate.

For on-the-host data protection, i.e., integrity/confidentiality during storage on a machine, Legion relies largely on the underlying operating system. Persistent data is stored on local disks with the appropriate permissions such that only appropriate users can access them.

1.8.3 How Can Grids Be Used in the Life Sciences?

Broad classes of apps – high throughput, single large problem, data sharing, workflows,

Contemporary Examples

[***This needs to be cleaned up]

The above description of a Gridgrid is rather vague and high-level. The real question for biomedical users is ““how can I benefit from a Gridgrid?”” There are many ways to use a Gridgrid. They range from relatively simple implementations, intricate implementations that exploit grid capabilities to solve currently impossible problems. Below, we briefly sketch five possible uses that span the spectrum and illustrate some of the possibilities: shared persistent object spaces, transparent remote execution, wide-area queuing systems, wide-area parallel processing, and meta-applications. As you will see, these examples map directly back to the capabilities provided by Legion (which will be provided by extension by the GBG).

1.8.3.1 Shared Persistent Object (File) Space

The simplest service a Gridgrid provides is location transparency for files, queries on relational databases, and more generally object access. This is a well-understood capability in the case of files that is known by most as a distributed file system. For example, in a distributed file system a user program running on site A can access a file at site B using the same mechanism (open, close, read, write) used to access files at site A. Further, the same file may be shared by more than one user and by more than one task. Having a shared file space significantly simplifies collaboration. Issues that must be addressed in such a distributed file system are naming, caching, replication, coherence of caches and replicas, and so on. Well-known distributed file systems include NFS and Andrew [reference 9 in metatsystems] [1].

A more powerful model is shared object spaces. Instead of just sharing files, all entities – files as well as executing tasks, can be named and shared between users (subject to access control restrictions). This merging of ““files”” and ““objects”” is driven by the observation that the traditional distinction between files and other objects is somewhat of an anachronism. Files really are objects but they happen to live on a disk, so that as they are slower to access and persist when the computer is turned off. In a shared object space, a file-object is a typed object with an interface that exports the standard file operations, such as read, write, and seek. In addition, the interface can define object properties such

document.doc 46 of 60

Page 47: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

as its persistence, fault, synchronization, and performance characteristics. Thus not all files need be the same; this eliminates the need to provide Unix synchronization semantics for all files, since many applications simply do not require those semantics. Instead, appropriate semantics along many dimensions such as replication and consistency can be selected on a file-by-file basis, and even changed at run-time.

Beyond basic sequential files, class-based persistent objects offer a range of opportunities, including:

Application-specific “file” interfaces. For example, instead of just read, write, and seek, a 2D-array-file may have functions such as read_column, read_row, and read_sub_array. Or a “file” may known to have structure – such as a FASTA format file, and have methods on in such as compare_sequence(), or compare_library(). The advantage to the user is the ability to interact with the file system in application terms rather than arbitrarily linearized streams of data. Further, the implementations can be optimized for a particular data structure, by storing the data in sub-arrays, for example, or by scattering the data to multiple devices to provide parallel I/O [Karp OOPSLA paper].

User specification of caching and prefetch strategies. This feature allows the user to exploit application domain knowledge about access patterns and locality to tailor the caching and prefetch strategies to the application, and thus pipeline the I/O and have the I/O execute concurrently with the application execution. This can provide significant improvements in observed I/O performance – particularly in wide-area systems.

Asynchronous I/O. Combined with a parallel object-implementation, I/O can easily be performed concurrently with application computation. The combination of user specification of prefetching and asynchronous I/O can drastically reduce I/O delays.

Multiple outstanding I/O requests. Again, combined with a parallel object implementation the application can request data long before it is actually needed. The application can then compute while I/O is being processed. By the time the data is needed it may already be available, resulting in almost zero latency.

Data format heterogeneity. Persistent classes may be constructed so as to hide data format heterogeneity, automatically translating data as it is read or written. Thus I may expect to read FASTA format, but the underlying data may be in a different format, or if reading a file of floating point numbers stored on a PC I can read it as floating point numbers on a Sun and the data will be coerced automatically.

Active simulations. In addition to passive files, persistent objects may also be active entities. For example, an organ simulation can proceed at a specified pace (such as wall clock time) and be accessed (read) by other objects. Of course, the organ simulation may itself use and manipulate other objects.

I can read it as floating point numbers on a Sun and the data will be coerced automatically.Active simulations. In addition to passive files, persistent objects may also be active entities. For example, an organ simulation simulation can proceed at a specified pace (e.g., wall clock time) and can be accessed (read) by other objects. Of course the organ simulation may itself use and manipulate other objects.

document.doc 47 of 60

Page 48: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Now for some more concrete examples, both from our group and others.

Often users want to be able to access the same, current copy of a public database regardless of where their application runs. This was the case for researchers at Stanford (Russ Altman, et al) [XXXLegionIO paper and AltmanXXX], who wanted to run a code that would scan the PDB, executing a program called gnomad against each entry. This would be done repeatedly for each of a set of proteins. This is a typical bag-of-tasks problem (more later [***Where?]) with the requirement that the data accessible at many locations. The PDB data became transparently available throughout the grid.

Another example comes from the e-Diamond project at the UK’s Oxford e-Science Centre. The following is quoted from (http://www.e-science.ox.ac.uk/public/eprojects/e-diamond/).

The e-Diamond project aims to digitize, collate and annotate a large federated database of mammograms at St. Georges and Guy's Hospitals in London, the John Radcliffe Hospital in Oxford, and the Breast Screening Centres in Edinburgh and Glasgow. Each of these sites will be linked together by 'grid' technology to provide a seamless interface to access stored mammograms. Each stored mammogram has medical information stored with it. Grid technology will allow a radiographer in one of the hospitals to compare their patient mammograms with similar ones stored in the database at other locations. This will help radiographers in their diagnosis of difficult or borderline cases.

A critical step in the exploitation of a medical image database is the standardization (or 'normalization') of the incoming data before it is stored in the database. It is known that different equipment and locations can produce widely varying results in image quality. This can interfere with image clarity and influence the final patient diagnosis. The University of Oxford has patented a method for the normalization of digital mammograms and is collaborating with IBM and Miranda Solutions to develop a digital image database system (e-Diamond), for use over the 'grid'. Digital images obtained from the participating hospitals will be normalized prior to their inclusion in the database. The effect of this process will be as if mammogram images were all produced on one machine at a single site.

The chief benefit of the project is the new ability for radiographers to quickly compare their difficult mammograms with similar ones produced at other hospitals around the country. This will substantially aid the diagnosis of malignant cancers and potentially improve the overall breast cancer survival rates presently seen.

The e-Diamond image archive will provide research, teaching and training resource to the hospitals involved in its development. It is anticipated that detailed epidemiological studies will be undertaken using the e-Diamond

document.doc 48 of 60

Page 49: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

archive to look at the effects of hormone replacement therapy on breast cancer risk for instance.

The e-Diamond project aims to use the image archive to evaluate new software to compute the quality of each new mammogram it receives. Each will undergo a normalization process before it is finally included in the archive.

Thus the ability to secure share data, in this case flat-file mammogram data between sites is critical to e-Diamond. There is also a significant data mining aspect to the problem, as illustrated by the flowing diagram in Figure XXX, below. It is from the e-Diamond project and shows the data flows

Figure XXX. e-Diamond data mining

In the commercial sector, Avaki is the leading provider of commercial data grid solutions. While grids are generic and not specific to a particular application, much like operating systems are not specific, Avaki chose for strategic reasons to begin selling in the life sciences. In particular Avaki targeted bio-techs and major pharmaceuticals. (Avaki’s customers include Aventis, Pfizer, Gene Logic, Structural Bio Informatics, and others.) The challenge faced by many of the larger pharmaceuticals is that they have research and development facilities all over the world. At each site there are full time staff who are tasked with keeping up-to-date copies of public, third-party private, and internal proprietary databases. The challenge is to keep all of this data up-to-date, avoid

document.doc 49 of 60

Page 50: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

redundant copies, and ensure that results from one location (or machine) can be compared with others. It has been estimated that 40% of the computations performed at several large pharmaceutical companies are repeats of runs that have already been done but need to be re-done because an old version of the data was used. This means wasted time, additional delays for project completion, and thus bottom-line impact.

Avaki provides a data-grid solution that provides transparent access to flat-file, relational, and application data throughout the organization. The benefits are a reduction in labor costs (data only needs to be curated once [***is this the right word?], not at multiple sites and in multiple workgroups), a reduction in redundant computation and associated delays, and a tremendous reduction in managed storage (caused by many, many copies of the same DB.)

The Future:

Future: integrated DB’s as we are proposing for Diabetes

1.8.3.2 Transparent Remote Execution

A slightly more complex service is that of transparent remote execution. Consider a hypothetical situation in which a user is working on a computationally complex sequential code, perhaps written in C or Fortran. After compiling and linking the application the user is left with the problem of deciding where to execute the code. The user can choose to run it on their workstation (assuming adequate space is available) or on a local high performance machine, or they can execute at a remote supercomputer center. The choice involves many trade-offs. First there is the scheduling issue: which choice will result in the fastest turn-around? This is particularly acute when the user has accounts at multiple centers. How is the user to know which is likely to result in the best performance without manually checking each. Next there are the inconveniences of using remote resources. Data and executable binaries may first need to be physically copied to a remote center, and the results copied back for analysis and display. This may be further complicated by the need to synchronize with collaborator's data copying. Finally there are the administrative difficulties of acquiring and maintaining multiple accounts at multiple sites.

In a grid the user can simply execute the application at the command line. The underlying system selects an appropriate host from among those the user is authorized to use, transfers binaries if necessary, and begins execution. Data is transparently accessed from the shared persistent object space, and the results are stored in a shared persistent object space for later use.

Concretely, this capability manifests in tools such as legion_run described earlier, as well as the ancillary tools needed to make it possible. Those tools include the capability to register program binaries so that they can be automatically provisioned on the target compute resource if they are not already installed. In Legion, transparent execution may also involve a portal, such as described in [XXX Anand XXX].

In Figure XXX, below, user an4m is logged into the grid. He can select one of his own applications to run (mpialone, mpimixed, myenv, query, or scrtst) or one of the well-

document.doc 50 of 60

Page 51: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

known programs that have been configured by a system administrator (in this case Amber or Render Grid).

Figure XXX. Transparent execution in Legion.

In the case of his own codes, he may either just select them for execution, or optionally select type of platform to use, e.g., Linux, Solaris, etc. Suppose that he selects Amber. Another set of portals launches requesting input/output files, parameters, and so on. It also asks if he would like to see a visualization as the computation proceeds. The result is shown below in Figure XXX.

document.doc 51 of 60

Page 52: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Figure XXX. Portal for 3D Molecular Modeling Using Amber. <XXX citation: Anand XXX>

1.8.3.3 Wide-Area Queuing System

Queuing systems are a part of everyday life at production supercomputer centers. The basic idea is simple: rather than interactively starting a job, the user submits a description of the job to a program (the queuing system), and that program schedules the job to execute at some point in the future. The purpose of the queuing system is to optimize an objective function such as system utilization, minimum average delay, or a priority scheme. There are over twenty-five different queuing systems in use today.2 Examples include Platform’s LSF, PBS, Condor, and from Sun’s SGE.

Just as queuing systems can be used to manage homogeneous resources at a particular site, grid queuing systems manage heterogeneous resources at multiple sites. This enables the user to submit a job from any site and, subject to access control, have the job

2 http://www.epm.ornl.gov/~zhou/lds/vendlist.html has links to the home pages of many

queuing systems.

document.doc 52 of 60

Page 53: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

automatically scheduled somewhere in the grid. Further, if binaries for the application are available for multiple architectures, the queuing system will have a choice of platforms on which to schedule the job. The advantage of this over the current state of practice is significant. Often a user with multiple accounts at different sites must “shop around” for a different sites looking for the shortest queue in which to run a job and may need to copy data between remote sites. This consumes the most precious resource, user time, which would be better spent doing science.

For a concrete example, consider again the PDB scan application, gnomad, discussed earlier. In typical use, it starts tens of thousands of jobs, each corresponding to a comparison of a particular target to a particular protein in the PDB. Rather than attempt to run all at once, they could be placed in one of several possible queues and executed over time. Each queue can be associated with different resource sets, and have different policies with respect to priority, number of concurrent jobs, etc.

1.8.3.4 Wide-Area Parallel Processing

Another opportunity presented by grids is connecting multiple resources together for the purpose of executing a single application, thereby providing the opportunity to execute mission-critical problems on much larger scale than would otherwise be possible. Not all problems will be able to exploit this capability, since the application must be latency-tolerant. After all, it is a minimum of thirty milliseconds from California to Virginia. Several applications fall into this category, including bag-of-tasks problems and many regular finite difference methods.

First consider a typical bag-of-tasks problem. This would include parameter space studies, Monte Carlo computations, and simple data parallel problems. In a parameter space study the same program is repeatedly executed with slightly different parameters (the program may be sequential or parallel). For example design space studies of high frequency oscillators using electron transport codes that use finite difference techniques to solve a complex set of nonlinear differential equations.

In a Monte Carlo simulation many random trials are performed. The trials may be easily distributed among a large number of workers and the results summed up. Examples include modeling atomic transport in rarefied gases or simulations describing a column of gas near the exobase of an atmosphere, which is bombarded by ions from a space plasma (e.g. the solar wind or a trapped magnetospheric plasma).

In a simple data parallel problem the same operation is performed on each of many different data points, and the computation at each data point is independent of all other data points. <USE an LS example, e.g., BLAST processing or autodock> is a good example of this type of problem. Note that parameter space studies and simple data parallel problems have much in common, the only difference is whether the parameters are data or generated values.

Bag-of-tasks problems are well suited to grids because they are highly latency-tolerant. While one computation is being performed, the results of the previous computation can be sent back to the caller and the parameters for the next computation can be sent to the worker from the caller. Further, the order of execution is unimportant, completely eliminating the need for synchronization between workers. This simplifies load balancing

document.doc 53 of 60

Page 54: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

significantly. The computations can also be easily spread to a large number of sites because the computations do not interact in any way.

Concrete Examples: [***none here]

1.8.3.5 Meta-Applications

The most challenging class of applications for the grid is meta-applications. A meta-application is a multi-component application, many of whose components were previously executed as stand-alone applications.

<FIX THIS UP – make it a life science example !!! >

A generic example is shown above [***nope]. In this example, three previously stand-alone applications have been connected together to form a larger, single application. Each of the components has hardware affinities. Component 1 is a fine grain data parallel code that requires a tightly coupled machine such as an IBM SP-2, and component 2 is a coarse grain data parallel code that can run efficiently on a “pile of PC’s” machine, component 3 is a vectorizable code that “wants” to run on a vector machine such as Cray T90. Component 1 also has a very large database that is physically located at site 1. Component 3 is written in Cray Fortran, Component 1 is a shared memory Fortran code, and component 2 is written in C using MPI.

There are a number of characteristics that can make meta-applications a challenge for the underlying software layer. Generally these characteristics fall into one of four categories: component characteristics, distributed databases, security, and scheduling. Two other characteristics, the need for fault-tolerance and security will be discussed as well. [***security is a category and characteristic?]

Component characteristics. Meta-application components are often written and maintained by geographically separate research groups. The components may be written in different languages, or perhaps using different parallel dialects of the same language, (e.g., Fortran components using MPI, PVM, and HPF). The components often represent valuable intellectual property for the owners. So they may want to keep the code in-house to protect it. Often, too, the code is really understood only by the authors, necessitating distributed code maintenance. Similarly software licenses may be available on a subset of the available hosts. The goal to have then is to facilitate inter-component communication when the components are in different languages, and to transparently migrate binaries from site to site.

[***This seems to be unfinished: need distributed databases, security, and scheduling categories the discussion of other characteristics.]

Concrete Examples: [***Anything?]

1.8.3.6 Workflows

[***Anything?]

Concrete Examples: [***Anything?]

document.doc 54 of 60

Page 55: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

1.9 References[***someone needs to fix this up]

1. E. Levy, and A. Silberschatz, “Distributed File Systems: Concepts and Examples,” ACM Computing Surveys, vol. 22, No. 4, pp. 321-374, December, 1990.

2. A.S. Grimshaw, A. Natrajan, and M.A. Humphrey, “Legion: An Integrated Architecture for Grid Computing,” to appear in IBM Systems Journal.

3. Natrajan, A., Crowley, M., Wilkins-Diehr, N., Humphrey, M. A., Fox, A. D., Grimshaw, A. S., Brooks, C. L. III, “Studying Protein Folding on the Grid: Experiences using CHARMM on NPACI Resources under Legion,” to appear Grid Computing Environments 2003, Concurrency and Computation: Practice and Experience.

4. A.S. Grimshaw and S. Tuecke, “Grid Services Extend Web Services,” Web Services Journal, August 2003.

5. Michael J. Lewis, Adam J. Ferrari, Marty A. Humphrey, John F. Karpovich, Mark M. Morgan, Anand Natrajan, Anh Nguyen-Tuong, Glenn S. Wasson and Andrew S. Grimshaw, “Support for Extensibility and Site Autonomy in the Legion Grid System Object Model” Journal of Parallel and Distributed Computing, Volume 63, pp. 525-38, 2003.

6. A. Natrajan, M. Humphrey, and A.S. Grimshaw, “The Legion support for advanced parameter-space studies on a grid,” Future Generation Computing Systems, vol. 18, pp. 1033-1052, 2002.

7. A. Natrajan, A. Nguyen-Tuong, M.A. Humphrey, M. Herrick, B.P. Clarke and A.S. Grimshaw, “The Legion Grid Portal”, Grid Computing Environments, Concurrency and Computation: Practice and Experience, 2001.

8. A. Nguyen-Tuong and A.S. Grimshaw, “Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications,” Parallel Processing Letters, vol. 9, No. 2 (1999), 291-301.

9. S.J. Chapin, D. Katramatos, J.F. Karpovich, and A.S. Grimshaw, “Resource Management in Legion,” Journal of Future Generation Computing Systems, vol. 15, pp. 583-594, 1999.

10. S.J. Chapin, C. Wang, W.A. Wulf, F.C. Knabe, and A.S. Grimshaw, “A New Model of Security for Metasystems,” Journal of Future Generation Computing Systems, vol. 15, pp. 713-722, 1999.

11. A.S. Grimshaw, A. Ferrari, F.C. Knabe, and M. Humphrey, “Wide-Area Computing: Resource Sharing on a Large Scale,” IEEE Computer, 32(5): 29-37, May 1999.

12. A.S. Grimshaw, A. Ferrari, G. Lindahl, and K. Holcomb, “Metasystems,” Communications of the ACM, 41(11): 46-55, Nov 1998.

13. A.S. Grimshaw, A. Nguyen-Tuong, M.J. Lewis, and M. Hyett, “Campus-Wide Computing: Early Results Using Legion at the University of Virginia,” International Journal of Supercomputing Applications, 11(2): 129-143, Summer 1997.

document.doc 55 of 60

Page 56: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

14. A.S. Grimshaw and W.A. Wulf, “The Legion Vision of a Worldwide Virtual Computer,” Communications of the ACM, 40(1): 39-45, Jan 1997.

15. J.F. Karpovich, A.S. Grimshaw, and J. C. French, “Extensible File Systems (ELFS): An Object-Oriented Approach to High Performance File I/O, “ Proceedings of OOPSLA '94, Portland, OR, Oct 1994: 191-204.

16. J.F. Karpovich, J. C. French, and A.S. Grimshaw, “High Performance Access to Radio Astronomy Data: A Case Study,” Proceedings of the 7th International Conference on Scientific and Statistical Database Management, Charlottesville, VA, Sept 1994: 240-249.

17. D. Katramatos, M. Humphrey, A. Grimshaw, and S. Chapin, “JobQueue: A Computational Grid-wide Queuing System,” International Workshop on Grid Computing, pp. 99-110, November 2001.

18. B. White, M. Walker, M. Humphrey, and A. Grimshaw “LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications”, Proceedings SC 01, Denver, CO. www.sc2001.org/papers/pap.pap324.pdf

19. A. Natrajan, A. Fox, M. Humphrey, A. Grimshaw, M. Crowley, N. Wilkins-Diere, “Protein Folding on the Grid: Experiences using CHARMM under Legion on NPACI Resources,” International Symposium on High Performance Distributed Computing (HPDC), pp. 14-21, San Francisco, California, August 7-9, 2001.

20. A. Natrajan, M.A. Humphrey and A.S. Grimshaw, “Grids: Harnessing Geographically-Separated Resources in a Multi-Organisational Context”, High Performance Computing Systems, June 2001.

21. M. Humphrey, N. Beekwilder, K. Holcomb, and A. Grimshaw, “Legion MPI: High Performance in Secure, Cross-MSRC, Cross-Architecture MPI Applications,” 2001 DoD HPC Users Group Conference, Biloxi, Mississippi, June 2001.

22. A. Natrajan, M. Humphrey, and A. Grimshaw, “Capacity and Capability Computing using Legion,” Proceedings of the 2001 International Conference on Computational Science, pp. 273-283, San Francisco, CA, May 2001.

23. A. Waugh, G. A. Williams, L. Wei, & R. B. Altman. Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases. Pacific Symposium on Biocomputing 2001, Mauna Lani, 360-371. 2001.

Andrew Grimshaw, “Avaki Data Grid – Secure Transparent Access to Data,” in, Grid Computing: A Practical Guide To Technology And Applications, Ahmar Abbas, editor, Charles River Media, 2003.

A. Natrajan, M.A. Humphrey, and A.S. Grimshaw, “Grid Resource Management in Legion”, in Resource Management for Grid Computing, eds. Jennifer Schopf and Jaroslaw Nabrzyski, 2003.

document.doc 56 of 60

Page 57: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

A.S. Grimshaw, A. Natrajan, M.A. Humphrey and M.J. Lewis, A. Nguyen-Tuong, J.F. Karpovich, M.M. Morgan, A.J. Ferrari, “From Legion to Avaki: The Persistence of Vision”, Grid Computing: Making the Global Infrastructure a Reality, eds. Fran Berman, Geoffrey Fox and Tony Hey, 2003.

D. Katramatos, M. Humphrey, A. Grimshaw, and S. Chapin, “JobQueue: A Computational Grid-wide Queuing System,” International Workshop on Grid Computing, pp. 99-110, November 2001.

B. White, M. Walker, M. Humphrey, and A. Grimshaw “LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications”, Proceedings SC 01, Denver, CO. www.sc2001.org/papers/pap.pap324.pdf

A. Natrajan, A. Fox, M. Humphrey, A. Grimshaw, M. Crowley, N. Wilkins-Diere, “Protein Folding on the Grid: Experiences using CHARMM under Legion on NPACI Resources,” International Symposium on High Performance Distributed Computing (HPDC), pp. 14-21, San Francisco, California, August 7-9, 2001.

A. Natrajan, M.A. Humphrey and A.S. Grimshaw, “Grids: Harnessing Geographically-Separated Resources in a Multi-Organisational Context”, High Performance Computing Systems, June 2001.

M. Humphrey, N. Beekwilder, K. Holcomb, and A. Grimshaw, “Legion MPI: High Performance in Secure, Cross-MSRC, Cross-Architecture MPI Applications,” 2001 DoD HPC Users Group Conference, Biloxi, Mississippi, June 2001.

A. Natrajan, M. Humphrey, and A. Grimshaw, “Capacity and Capability Computing usingLegion,” Proceedings of the 2001 International Conference on Computational Science, pp. 273-283, San Francisco, CA, May 2001.

L. Jin, and A. Grimshaw, “From MetaComputing to Metabusiness Processing,” proceedings IEEE International Conference on Cluster Computing - Cluster 2000, Saxony, Germany, December 2000.

B. White, A. Grimshaw, and A. Nguyen-Tuong, “Grid Based File Access: The Legion I/O Model,” to Proceedings of the Symposium on High Performance Distributed Computing (HPDC-9), Aug 2000, Pittsburgh, PA.

H. Dail, G. Obertelli, F. Berman, R. Wolski, and A. Grimshaw, “Application-Aware Scheduling of a Magnetohydrodynamics Application in the Legion Metasystems,” Proceedings International Parallel Processing Symposium Workshop on Heterogeneous

document.doc 57 of 60

Page 58: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

Processing, Cancun, May 2000.

A.S. Grimshaw, M.J. Lewis, A.J. Ferrari and J.F. Karpovich, “Architectural Support for Extensibility and Autonomy in Wide-Area Distributed Object Systems”, Proceedings of the 2000 Network and Distributed Systems Security Conference (NDSS’00), San Diego, California, February 2000.

M. Humphrey, F. Knabe, A. Ferrari, and A. Grimshaw, “Accountability and Control of Process Creation in Metasystems,” in Proceedings of the 2000 Network and Distributed Systems Security Conference (NDSS'00), pp. 209-220, San Diego, CA, February 2000.

S.J. Chapin, D. Katramatos, J.F. Karpovich and A.S. Grimshaw, “The Legion Resource Management System”, Proceedings of the 5th Workshop on Job Scheduling Strategies for Parallel Processing in conjunction with the International Parallel and Distributed Processing Symposium, April 1999.

A.J. Ferrari, F.C. Knabe, M. Humphrey, S.J. Chapin, and A.S. Grimshaw, “A Flexible Security System for Metacomputing Environments,” 7th International Conference on High-Performance Computing and Networking Europe (HPCN'99), April 1999, Amsterdam: 370-380.

C.L. Viles, M.J. Lewis, A.J. Ferrari, A. Nguyen-Tuong, and A.S. Grimshaw, “Enabling Flexibility in the Legion Run-Time Library,” H.R. Arabnia, ed., Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'97), June 30-July 2 1997, Las Vegas, NV: 265-274.

A. Nguyen-Tuong, A.S. Grimshaw, and M. Hyett, “Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System,” Proceedings of the 5th International Symposium on Reliable and Distributed Systems, Oct 1996: 1-11.

References

[Brewer2001] E. Brewer. Lessons from Giant-Scale Services. IEEE Internet Computing, July/August 2001.

[Casanova1997] H. Casanova, J. Dongarra, NetSolve: A network server for solving computational science problems, International Journal of Supercomputer Applications and High Performance Computing, 11(3), 212-223, 1997.

document.doc 58 of 60

Page 59: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

[Chien2003] A. Chien, B. Calder, S. Elbert, K. Bhartia, Entropia : Architecture and performance of an enterprise desktop Grid system, Journal of Parallel and Distributed Computing, 63(5), 597-610, 2003. 

[Hollingsworth2003] Instrumentation and Measurement, J. K. Hollingsworth, and B. Turney, The Grid: A Blueprint for the New Computing Infrastructure, 2nd Edition, K. Kesselman and I. Foster, ed., Morgan-Kaufmann, 2003.

[Hwang2003] S. Hwang, C. Kesselman, “Grid Workflow: A Flexible Failure Handling Framework for the Grid”, High Performance Distributed Computing, pp. 126-137, 2003.[Knight2000] Knight, John C., Kevin J. Sullivan, Matthew C. Elder, Chenxi WangSurvivability Architectures: Issues and ApproachesDARPA Information Survivability Conference and Exposition (DISCEX 2000), Hilton Head SC (January 2000)

[Knight2002] Knight, John C., Dennis Heimbigner, Alexander Wolf, Antonio Carzaniga, Jonathan Hill, Premkumar Devanbu, Michael Gertz. The Willow Architecture: Comprehensive Survivability for Large-Scale Distributed Applications, Intrusion Tolerance Workshop, DSN-2002 The International Conference on Dependable Systems and Networks, Washington DC (June 2002)

[Knight2003] Knight, John, Elisabeth A. Strunk and Kevin J. Sullivan, Towards a Rigorous Definition of Information System Survivability, DISCEX 2003, Washington DC (April 2003)

[Lipson2000] Lipson, H.F., and Fisher, D.A. Survivability — A New Technical and Business Perspective on Security. Proceedings of the 1999 New Security Paradigms Workshop. Caledon Hill, ON, September 21-24, 1999. New York, NY: Association for Computer Machinery, 2000

[Natrajan2002] Natrajan, A., Humphrey, M. A., Grimshaw, A. S., Legion Support for Advanced Parameter-Space Studies on a Grid, Future Generation Computer Systems, Vol. 18, No. 8, pp. 1033-1052, Elsevier Science, October 2002.

[Nguyen-Tuong1996] Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System, A. Nguyen-Tuong, A. S. Grimshaw, and M. Hyett, 15th International Symposium on Reliable and Distributed Systems, pp. 1-11, October 1996.

[Nguyen-Tuong1999]Using Reflection for Incorporating Fault-Tolerance Techniques into Distributed Applications, A. Nguyen-Tuong and A. S. Grimshaw, Parallel Processing Letters, Vol. 9, No. 2, pp. 291-301, 1999.

[Nguyen-Tuong2000] Integrating Fault-Tolerance Techniques into Grid Applications, A. Nguyen-Tuong, Doctoral Thesis, University of Virginia, Department of Computer Science, August 2000.

[Oppenheimer2003] Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do Internet services fail, and what can be done about it? In Proceedings of Fourth USENIX Symposium on Internet Technologies and Systems (USITS '03), March 2003.

[Rowanhill2004] Rowanhill, Jonathan C., Philip E. Varner and John C. Knight, Efficient Hierarchic Management For Reconfiguration of Networked Information Systems, submitted to Dependable Systems and Networks 2004

document.doc 59 of 60

Page 60: scrap heap --------------------------------------spw4s/core-1-jan-17-2004_edit.doc · Web viewAccess to data for large scale integration and data mining (e.g., epidemiological) Grid

[Scandariato2004] Scandariato, Riccardo and John C. Knight, An Automated Defense System to Counter Internet Worms, submitted to Dependable Systems and Networks 2004

[Stelling1998] A Fault Detection Service for Wide Area Distributed Computations. P. Stelling, I. Foster, C. Kesselman, C.Lee, G. von Laszewski. Proc. 7th IEEE Symposium on High Performance Distributed Computing, pp. 268-278, 1998.

[Stockinger2002]File and Object Replication in Data Grids. H. Stockinger, A. Samar, B. Allcock, I. Foster, K. Holtman, and B. Tierney; Journal of Cluster Computing, 5(3)305-314, 2002.

[Thain2003] D. Thain, M. Livny, “The Ethernet Approach to Grid Computing”, High Performance Distributed Computing, pp. 138-147, 2003.

[Zagrovic2002] Simulation of Folding of a Small Alpha-helical Protein in Atomistic Detail using World wide distributed Computing. Bojan Zagrovic, Christopher D. Snow, Michael R. Shirts, and Vijay S. Pande. Journal of Molecular Biology (2002)

document.doc 60 of 60