itr/ap+im+si+sy: a prototype knowledge environment for ...jessup/me/keg_proposal_0413.doc · web...

ITR/AP+IM+SI+SY: A Prototype Knowledge Environment for the Geosciences Draft2 April 5, 2001

A. Project SummaryWe propose to create a prototype Knowledge Environment for the Geosciences (KEG) that demonstrates a seamless, virtual laboratory for Earth system science research and education. This environment is a platform for fundamental IT research enabling advances in methodologies and tools for distributed, large-scale collaborative research, knowledge evolution, and distributed information environments. It will organize research products into a searchable, shared group resource and thus a knowledge-based problem-solving environment for geosciences. It is targeted at the heart of the research activity—the process—not the end result. This will be the first time a comprehensive knowledge system has been proposed to integrate frontier research on high performance simulation of the Earth system with both archival and interactive geosciences learning environments. The complexity of a geophysical knowledge framework is a new and fertile context for basic IT research. Moreover, the proposed research is structured to foster iteration among IT and the core geophysical problems, whereby basic IT research contributions spur new geosciences developments that in turn pose new IT research challenges.The need to understand the physical and biological processes that shape our environment is a grand challenge for this century. The possible influence of human activities on the Earth system and the fragility of a complex global economy to severe natural events make this an urgent problem. However, the Earth/Sun system operates at disparate spatial and temporal scales and advancing our understanding of this system requires a vast array of observational data, many scientific disciplines, and scientific models. Moreover, scientific understanding of the environment must be accessible to diverse groups, from scientists to policy makers and from educators to students. Traditional modes of scientific research are stymied when applied to the breadth of scales encountered in the geosciences and often only reach a limited audience of specialists. To meet this grand challenge, new ways of collaboration and dissemination will be necessary that leverage IT. This proposal will contribute to the digital infrastructure vital for the next generation of geosciences research.The prototype knowledge environment for the geosciences proposed here will be assembled in three layers:

1) Interaction Portal (IP)2) Knowledge Framework (KF)3) Multiscale Earth System Repository (MS)

The IP is the connection between the community and the knowledge environment and consists of tools, components and interfaces built upon the common fabric of the Knowledge Framework. Some areas of emphasis for IP are a common code development environment, visualization and supporting interactions among a geographically distributed group. The KF mediates knowledge between the MSESM and the IP using principles of encapsulation, polymorphism, and data abstraction to facilitate interdisciplinary research with a set of distributed methods, classes, and tools. Finally, the MSESM generalizes current Earth system models and will be a linked hierarchy of models at several scales. The prototype multiscale model will have a nonhydrostatic atmosphere with interactive chemistry and cloud microphysical processes.A fundamental component of our system design is a shared Earth system modeling framework that will provide a “commons” for university and NCAR computer and Earth scientists to compare, test, and evaluate new tools and methods for modeling complex Earth system processes. This diverse scientific effort will be organized and archived with “middleware” that enhances opportunities for applications of the model products to research on impacts and consequences of weather and climate variability. The final modeling and analysis products will be transmitted to the Digital Library for Earth System Education [DLESE] for peer-review and posting in the collection. The integrative nature of learning about the Earth demands a core information technology infrastructure that makes distributed learning a reality—the time to act is now.The IT research will be accomplished through a multidisciplinary scientific team with expertise ranging from knowledge representation, reasoning and problem solving environments, collaboration research to

i

ITR/AP+IM+SI+SY: A Prototype Knowledge Environment for the Geosciences

parallel computation, scientific visualization and data analysis to human-computer interaction, software process and architecture. Support for the substantive geophysical model development is also broad and leverages the considerable resources of the participating universities (Courant Institute of Mathematical Sciences, Howard University, Purdue University, Stanford University, University of Alabama-Huntsville, University of California at Los Angeles, University of Chicago, University of Colorado, University of Illinois Urbana-Champaign, University of Michigan, and University of Wisconsin) as well as the National Center for Atmospheric Research.

ii


B. Table of Contents

A. Project Summary................................................................................................................................iB. Table of Contents.............................................................................................................................iiiC. Project Description............................................................................................................................11 The Information Technology Revolution for the Geosciences.........................................................11.1 National and Global Context.............................................................................................................11.2 A Vision for Enabling Virtual Communities of Researchers and Educators....................................11.3 NCAR’s Role....................................................................................................................................22 Elements of the KEG and Related Work...........................................................................................22.1 Problem Solving Environments].......................................................................................................22.1.1 Problems/limitations with existing systems: Need integration, scalability, etc…............................22.2 Collaboratories and Related Infrastructure........................................................................................22.3 Portals for Scientific Research & Education.....................................................................................22.3.1 An Environment for Hypothesis Development and Testing.............................................................22.3.2 Frameworks for Realizing the Portal.................................................................................................22.4 Scientific Data: Complex, Diverse, and Very Large.........................................................................22.5 Mining: Data, Information, and Knowledge.....................................................................................32.6 Discovery of Information, Data, Software, Tools, and Knowledge [Jessup?...................................42.7 Visualization: Multiscale and Terascale............................................................................................52.8 Advanced Collaborative Environments.............................................................................................52.9 Next-generation Multiscale Earth System Models [Tribbia, Ghil]...................................................62.10 Distributed Group Development of Frameworks, Tool, Models, and Agents..................................62.11 Executing and Managing Simulation Processes................................................................................62.12 Knowledge Systems: Ontologies for the Geosciences......................................................................62.13 The IT Challenges: Scalability, Overall Integration, IT Research [Middleton/Fox/Hammond]......73 Research Design and Methods..........................................................................................................73.1 The Concept......................................................................................................................................73.2 IT Research Challenges.....................................................................................................................83.3 KEG Definitions................................................................................................................................83.4 Goals, Requirements, and Characteristics.........................................................................................93.5 Design................................................................................................................................................93.6 Architecture and Enabling Frameworks..........................................................................................103.6.1 Detailed description of the architecture...........................................................................................103.6.1.1 IP layer............................................................................................................................................103.6.1.2 KF1 layer.........................................................................................................................................113.6.1.3 3.6.1.3 KF2 layer.............................................................................................................................133.6.1.4 3.6.1.4 KF3 layer.............................................................................................................................143.6.1.5 3.6.1.5 KF4 layer.............................................................................................................................183.6.1.6 3.6.1.6 MS layer..............................................................................................................................183.7 3.7 Outcomes...................................................................................................................................203.7.1 3.7.1 Infrastructure..........................................................................................................................213.7.2 Education and research....................................................................................................................213.7.3 3.7.3 Services..................................................................................................................................213.7.4 3.7.4 An expandable framework.....................................................................................................213.8 3.8 Software Engineering Challenge...............................................................................................214 A Multi-scale Earth System Model.................................................................................................214.1 Definition:.......................................................................................................................................214.2 The Problems: Modeling and Software...........................................................................................244.3 The Approach..................................................................................................................................244.4 Goals and Outcome.........................................................................................................................25

iii


5 A KEG for Everyone!......................................................................................................................255.1 Deliverables.....................................................................................................................................255.2 Technology Transfer.......................................................................................................................256 Education and Outreach..................................................................................................................256.1 Outreach to the Scientific Community – Summer Design Institutes..............................................256.2 Outreach to Communities................................................................................................................256.3 The K-12 Educational Community – K-12 KEG............................................................................256.4 Outreach to a Diverse Community – Collaboration with SOARS..................................................266.5 Outreach to the Public – Sharing Information about KEG:............................................................267 Usage Scenarios..............................................................................................................................267.1 Hurricane Landfall (HAL) Test Bed...............................................................................................267.2 El Nino Southern Oscillation (ENSO) Test Bed.............................................................................267.3 Megacity Impact on Regional And Global Environment (MIRAGE) Test Bed.............................278 Broader Impacts..............................................................................................................................279 Management Plan (up to three pages in length) [hammond].........................................................2810 Prior Results....................................................................................................................................30D. References Cited................................................................................................................................IE. Biographical Sketches.....................................................................................................................VIF. Proposal Budget [hammond]........................................................................................................VIIG. Current and Pending Support.......................................................................................................VIIIH. Facilities, Equipment, and Other Resources.................................................................................VIIII. Special Information and Supplementary Documents...................................................................VIIIJ. Appendices...................................................................................................................................VIIIK. Attic..............................................................................................................................................VIII11 Data Mining from UAH...............................................................................................................VIII11.1 Data mining in a distributed Environment...................................................................................VIII11.1.1 Goals.............................................................................................................................................VIII11.1.2 Requirements................................................................................................................................VIII11.1.3 Basics..............................................................................................................................................IX11.1.4 Applying Data Mining....................................................................................................................IX

iv


C. Project Description1 The Information Technology Revolution for the Geosciences1.1 National and Global ContextOver the last 30 years, the global population has doubled, carbon dioxide concentration has increased from 315 to 370 ppm, and the mean global temperature has risen from 13.9 degrees C to 14.4. A gaping ozone hole appears every spring over Antarctica and another seems to be developing over the Arctic. Air and water pollution problems are global in scale. Never before has the need to understand our planet, the complex interactions of its processes, and our own impact upon the system been as urgent and compelling as now. The unique challenge of the geosciences is to address as a whole the many interlocking processes in the atmosphere, oceans, land surfaces, ice sheets, and biota that together determine the behavior of the planet. This holistic set of processes requires an earth systems approach to global and even regional and local problems, combining many specialties in a way that is not required in other scientific pursuits. The research issues are ceasing to be the purview of any single discipline and span multiple communities with stakeholders in the areas of education, environmental and societal impacts, and multiple earth system disciplines.Detailed observations of the Earth, distributed and diverse data and information holdings, powerful simulation and analysis capabilities, knowledge holdings, and collaboration environments – to name but a few - clearly have tremendous potential to elevate our knowledge and understanding of our planet. The information technology revolution brings us unprecedented new capabilities that offer substantial promise for integrating these resources and turning them into powerful new tools and environments. As we consider our future, however, simple extensions of extant technologies and methodologies will not begin to address our requirements. A new era of scientific discovery is within reach - if these new capabilities can be effectively harnessed in the service of science. [Too vague, more work – don]1.2 A Vision for Enabling Virtual Communities of Researchers and EducatorsA centerpiece of NCAR’s long-term vision is to develop a Geosciences Decision Support Environment to substantially improve our understanding of and to provide accurate and timely information about the Earth system in which we live. This information and the decision support environment itself will be used to facilitate and accelerate fundamental scientific research, enrich education programs and to feed into policy decisions and assessments. This vision is consistent with the PITAC report [PITAC99], “Research is conducted in virtual laboratories in which scientists and engineers can routinely perform their work without regard to physical location—interacting with colleagues, accessing instrumentation, sharing data and computational resources, and accessing information in digital libraries.”

To make strides toward realizing this vision we propose to create a prototype Knowledge Environment for the Geosciences (KEG) that will produce a knowledge enabled collaborative problem-solving environment for Earth system research and education. This environment is a platform for fundamental IT research enabling advances in methodologies and tools for distributed, large-scale collaborative research, knowledge evolution, and distributed information environments. It will organize research products into a searchable, shared group resource that underlies a compelling concept: a knowledge-based problem-solving environment for geosciences research, education, and assessment. It is targeted at the heart of the research activity—the process—not just the end result. This will be the first time a comprehensive knowledge system has been proposed to integrate frontier research on high performance simulation of the Earth system with both archival and interactive geosciences learning environments. The complexity of a geophysical knowledge framework is a new and fertile context for basic IT research. The proposed research is structured to foster iteration among IT and the core geophysical problems, whereby basic IT research contributions spur new geosciences developments that in turn pose new IT research challenges.

1


1.3 NCAR’s RoleNCAR’s primary function is to serve as an integrator of people, disciplines, methods, technologies, and activities in the pursuit of advancing the national research agenda. It also acts as a catalyst, bringing together many specialists, disciplines, approaches, technologies, and activities to propel the science forward. While these roles have traditionally been in the context of earth system research, they must now extend into the information technology realm as well if geoscience is to achieve the progress that is needed.NCAR is well positioned to play a prominent role in motivating the evolution of information technology research in the context of the geosciences. Broad community projects and large-scale simulation efforts push the envelope of what’s possible, and serve as harbingers of future community needs. In this proposed work we team earth system researchers with their counterparts in computational science in order to develop new understanding of the problem domain and to attack the basic research problems in computational science. This synergistic partnership is crucial to advancing the research agenda for all of the disciplines involved.NCAR has a responsibility to foster the development of important, long-term community infrastructure and to support it as a persistent resource for research. This role complements this work by providing a path for sustaining and providing longevity for the prototype environments, frameworks, tools, and software that are produced as a result of this effort.2 Elements of the KEG and Related WorkIn considering a next generation environment for supporting distributed group research, one can identify a number of logical components that we understand fairly well today. Collaboratories present shared, virtual spaces where groups of researchers can conduct experiments, share results, collectively produce intermediate analyses, and work together to produce knowledge products such as publications. In geosciences research, terascale simulations produce terascale data holdings and these in turn must be analyzed in the context of the observed record - massive data in its own right. Recent advances in Grid technologies provide a model for an underlying computational and data fabric conceived for terascale modeling and analysis. Generalized frameworks for advanced numerical models are emerging that not only facilitate plug-and-play flexibility for algorithms, but also have substantial promise for supporting domain-specific problem solving environments. The overarching challenge is to enable all of these technologies to be combined into effective problem-solving environments. The effort proposed here is aimed at building upon a number of other research efforts and extending them such that a knowledge-enabled meta-framework is realized. It will be all things to all people, ‘nuff said. [Replace, scope is too narrow – don ->] This environment provides an interdisciplinary team with virtual proximity to all required resources and each other. In the sections that follow we describe the primary building blocks of the prototype Knowledge Environment for the Geosciences, related work, and research challenges.2.1 Problem Solving Environments][Will work with Elias over the weekend – don]

2.1.1 Problems/limitations with existing systems: Need integration, scalability, etc…2.2 Collaboratories and Related Infrastructure[Need Umich SPARC & CHEF background]2.3 Portals for Scientific Research & Education2.3.1 An Environment for Hypothesis Development and Testing2.3.2 Frameworks for Realizing the Portal2.4 Scientific Data: Complex, Diverse, and Very LargeObservational programs such as NASA’s Earth Observing System (EOS Terra, Aqua, and Aura) [] present a proverbial fire hose of data for the Earth System community. Space Science will face similar challenges

2


when platforms such as the Stratospheric Observatory for Infrared Astronomy (SOFIA) [] and other advanced observatories become operational. At the same time, researchers successfully harness parallel computational platforms to simulate phenomena at unprecedented resolution while nested and multiscale models will add an additional level of complexity. Furthermore, climate and weather researchers and impacts assessment stakeholders require tremendous flexibility to combine and compare multiple disparate datasets including GIS. Overall, the geosciences community faces massive growth in the scope, complexity, and ultimate size of important, crucial scientific data with volumes escalating into the terabyte and petabyte range during this decade. The Data Problem challenges our very ability to understand the systems and underlying processes and could has the potential to stand as a formidable barrier to research progress if not addressed.

A meta-framework that anticipates future data requirements must possess extraordinary qualities relative to performance, scalability, flexibility, distributed operation and, above all, the incorporation of semantic content. We propose to build upon and coalesce several community efforts, each of which contributes a unique part to the KEG concept. Recent work in HDF5 [] addresses scalability and performance in the context of parallel computation and exposes a powerful and flexible data model. The Distributed Oceanographic Data System (DODS) is a popular framework for enabling data abstraction and distributed access but has not been targeted at high-performance applications. Recent work at the University of Wisconsin on the VisAD class library [] provides an elegant abstraction of data that is highly synergistic with both DODS and HDF5. One aspect of this research will be coalescing these into the meta-framework context with a coupling to DataGrid technologies, which enable distributed operation and address performance issues. One of the outstanding opportunities presented by this research is to explore the possibilities afforded by coupling this best-of-class synthesis with geoscience-specific ontologies, which enable management, discovery, and usage based upon semantic content.

2.5 Mining: Data, Information, and Knowledge[Lotsa good material here from Sara and Steve. Need to condense and possibly re-tier – don]Data Mining is concerned with the technologies that provide the ability to extract meaningful information and knowledge from large, heterogeneous data sources. Currently, large numbers of observations are acquired and stored in diverse and distributed data repositories, resulting in the need for “theories” that distill the information and knowledge content. The challenge of extracting meaningful information becomes progressively more formidable for the Geoscience community with the launch of the components of NASA’s Earth Observing System (EOS Terra, Aqua and Aura) and future missions. Similar challenges will face the Space Science community when platforms such as the Stratospheric Observatory for Infrared Astronomy (SOFIA) and other advanced observatories become operational. Since the acquisition of data is a continuing process, general tools and algorithms are needed for analyzing data, as well as for creating and testing theories or hypotheses. Due to the vast amounts of data involved, automated approaches that limit the need for human intervention are desirable.

Much progress has been made in both data mining and knowledge discovery over the past few years. For example, these techniques have proven useful for the automation of the analysis process and reducing data volume. However, the domain is still fairly new and this research frontier offers many areas for substantial improvement, such as the utilization of background knowledge, provability of results, scalability and the use of distributed computing approaches.

Current and near term data mining systems are limited in scope and exists only as a complementary tool for use with manual scientific analysis. Even so, there are several sound reasons for pursuing the use of data mining in the geosciences domain. It is a powerful tool for the analysis of huge amounts of science data and for when the manual examination of the data is impossible. Furthermore, the careful use of mining tools provides the ability to refine and add more layers to the knowledge bases associated with a given science domain. Both of these help to minimize the domain scientists’ data handling tasks, and

3


allow them to maximize their research time. Because mining plans and techniques can be stored and documented in a reusable knowledge base, scientists can also publish their techniques for reuse and verification, and for use in other domains, thus reducing the tendency to reinvent the wheel. Also such new mining approaches can be readily integrated into a larger set of services in a framework to provide a new hybrid tool that is much greater than the sum of its parts.

It is the intelligent application of algorithms to data that is the means by which we extract knowledge. In data mining, the number of algorithms that can be used to filter, manipulate and analyze this data from its raw form into information, and ultimately into knowledge continues to expand. The interdisciplinary focus of scientific research today has added complexity with newer algorithms and different uses for the collected data. A new paradigm for data mining is required: a framework for modeling distributed data, algorithms, and their inter-relationships. This framework can be used as the foundation upon which to build automated systems that can be used for the discovery of causal relationships, hypothesis testing and theory formation, and to provide means for the detection of interesting correlations and patterns.

This next generation of mining tools, working within a larger framework, can provide a number of new capabilities for better dealing with the rising tide of geoscience data sets. They can provide flexible support for scientific analysis, ranging from reducing and refining the volume of data, to automation of the analysis process itself. Such tools should be accessible from all platforms including workstations, data archive centers and data mining environments. They should provide benefit to all levels of users, from the casual user to domain experts. And the tools should be available in real-time, on-demand, on-ingest and must provide verifiable and repeatable results.

To accomplish this, there are a number of challenges that must be met. Foremost is to define common standard interfaces for interoperability of distributed services and the incorporation of these standards into the geosciences domain. This common standard definition has begun with the efforts of such projects as the Earth Science Markup Language (ESML) and OGC Classification and services. But along with this common language, new frameworks must be developed to support this paradigm of interoperability. Such a framework should support network-accessible data sets (e.g. the data pool concept.) It should support a catalog of distributed capabilities and planning for resource allocation and load balancing. It should be able to deal with new and changing environments, such as streaming input, on-orbit processing, and grid environments such as NASA’s IPG, etc. The need for such a framework has been clearly stated in several arenas. For example, the scientists attending the ITSC-hosted NASA Data Mining Workshop expressed the desire for a taxonomy of data mining techniques and a semantic model for data.

Such a framework must be able to scale to much larger levels, and robust enough to survive in the rapidly changing computing environment of today and tomorrow. For example, a web-based user interface for visually connecting distributed processes, querying information about services and algorithms, and interactive inspection of data results. New methods of knowledge representation will be required in order to contain and use the large amounts of data associated not just with data sets, but also algorithms, processes and all the other facets associated with large-scale geoscience research.

In support of the KEG Framework, the team will also create an ontology for formally representing both the data and the algorithms used to manipulate that data, and more importantly, the complex associations among them. Applying its expertise in data mining and pattern recognition the research team will then develop access methodologies that will exploit this ontology. The integrated framework gained from merging a formal ontology with an overall search and modification capability will facilitate the transformation, exploitation, and exploration of data, in order to improve the ability to extract knowledge.

2.6 Discovery of Information, Data, Software, Tools, and Knowledge [Jessup?

4


2.7 Visualization: Multiscale and Terascale

[Working on over the weekend – don]

2.8 Advanced Collaborative Environments[Argonne contribution - will condense, insert references. Also need CHEF i.e. web-based collaboration –don-]In order to facilitate the creation of knowledge environment for the geosciences that demonstrates a seamless, virtual laboratory for Earth system science and research, the environment needs to support collaboration at a fundamental level. How are participants connected to the virtual laboratory, how is the laboratory organized or structured. If participants are located at a variety of different locations across the United States, with some resources collocated with participants while other resources are not how is the information shared and how do groups collaborate. Argonne National Laboratory of which the University of Chicago group is associated has begun to address these issues via the development of the Access Grid [1, 2]. The Access Grid is an ensemble of network, computing and interaction resources that supports group-to-group human interaction across the grid. It consists of large-format multimedia displays, presentation and interactive software environments, interfaces to grid middleware, and interfaces to remote visualization environments. Access Grid Nodes are deployed into “designed spaces” that explicitly support the high-end audio and visual technology needed to provide a high-quality compelling and productive user experience. Access Grid nodes are connected via the high-speed Internet, typically using multicast video and audio streams. It is possible to participate in Access Grid sessions from multimedia PC’s. We are currently exploring the use of commodity technologies such as game consoles as high-performance, low cost Access Grid nodes, thus exploiting the fact that gaming consoles provide more advanced graphics than do most PC’s and are beginning to provide virtual environments support as well. The Access Grid enables distributed meetings, collaborative teamwork sessions, seminars, lectures, tutorials and training. The Access Grid design point is modest (3-20 people per site) but promotes group-to-group collaboration and communication. Large-format displays integrated with intelligent or active meeting rooms are a central feature of Access Grid nodes. The development of the AccessGrid connects groups together; but that is just the beginning; it only forms the foundation for development of virtual communities. Once the community is formed the infrastructure and tools need to be in place for it to accomplish its goals. Chicago proposes to work on the development of collaborative visualization tools that support a wide variety of different endpoints, be it a desktop or advanced display environment, layered on the Access Grid collaboration infrastructure. There has been work done on supporting collaborative visualization and analysis over network based environments. Foster et.al. [3] outline the requirements for enabling collaboration-oriented environments, based on what has evolved into today’s grid infrastructure [4]. An early project in the use of collaborative visualization is the Collaborative Visualization (CoVis) [5] that is being used as an educational tool. As an example CoVis has taken tools used by atmospheric scientist and simplified them for use in education. CoVis uses existing network based tools in an ad hoc manner to build the collaboration environment. This requires that the tools be available at all the sites and that each user becomes familiar with the needed tools, this can be a problem in a large collaboration where infrastructure and capabilities vary. Other collaborative visualization work has been done in the context of the Web, using browser and Internet infrastructure as the backbone on which tools are built [6, 7] that is connection oriented. Shastra defines an architecture for collaboration in a LAN based setting, which is stream based [8]. Chicago plans to build past experience in the area of collaborative visualization [9-12] and the work of the community; in the development of collaborative visualization tools that exploit the emerging Grid infrastructure [13-21] and Access Grid framework. In order for the collaborative visualization to succeed the underlying infrastructure needs to be flexible to operate across a variety of different network connections and on a variety of different endpoints. The endpoint maybe a users laptop, a room sized high-resolution display, or even an immersive virtual environment. Datasets that users will be collaborating on the analysis of will also vary in size, and it will be impossible for some of the machines to be capable of handling all the data,

5


much less have the computation resources available to process it into a useful form. This unbalanced environment of resources means that the infrastructure needs to be aware of the use of visualization servers, enabling distant/remote visualization tools (connection to Folk/Hibbard??). Chicago plans work on integrating all of its collaborative visualization tools with backends that build on existing efforts in the efficient execution of the visualization pipeline [22, 23].

2.9 Next-generation Multiscale Earth System Models [Tribbia, Ghil]2.10 Distributed Group Development of Frameworks, Tool, Models, and Agents2.11 Executing and Managing Simulation Processes2.12 Knowledge Systems: Ontologies for the Geosciences[Deborah: I need to streamline but you’re welcome to as well ;-). Additional information and references on applications of knowledge systems to GIS and other earth-system applications would be helpful.]

Increasingly, knowledge's new products are stored in and used through distributed computing resources. Within a particular discipline, specific storage formats and transmittal protocols are often used. The problems quickly arise when trying to understand interactions among complex or interdisciplinary systems; something that is increasingly common in the both the education and scientific research processes. Suddenly, the specific storage formats and protocols become an additional barrier in opening the information up to another area using modern computer technology. Even today, the products educators and scientists seek are based in complex structures and/or contain or rely upon large volumes of reference materials, (often-undocumented) computer code, visual representations, and/or numeric data. In these cases, the information is only useful if the software used to access it represents accurately the structures and underlying assumptions. Even though the connectivity and capability of people using computers via networks continues to improve, and an increasing amount of information is available digitally, the two have not been brought together in a persistent way so that additional knowledge can be built upon them and captured.

Large-scale ontologies are becoming an essential component of many applications including standard search (such as Yahoo and Lycos), e-commerce (such as Amazon and eBay), configuration (such as Dell and PC-Order), and government intelligence (such as DARPA's High Performance Knowledge Base (HPKB), Rapid Knowledge Formation, (RKF), and Agent Markup Language (DAML) programs.While fairly common in eCommerce, various expert and recommender systems, and even Geographic Information Systems, significant usage with the context of the geosciences is fertile ground for research.

Reusable ontologies are becoming increasingly important for tasks such as information integration, knowledge-level interoperation, and knowledge-base development. Ontolingua [] provides a set of tools and services to support the process of achieving consensus on common shared ontologies by geographically distributed groups using web-based tools. . These tools make use of the worldwide web to enable wide access and provide users with the ability to publish, browse, create, and edit ontologies stored on an ontology server. Users can quickly assemble a new ontology from a library of modules. Ontolingua utilizes open standards such as KIF and OKBC and has been used for years in academic and commercial settings. The Ontolingua Server may be accessed through the URL http://ontolingua.stanford.edu/.

The ontologies are becoming so large that it is not uncommon for distributed teams of people with broad ranges of training to be in charge of the ontology development, design, and maintenance. Standard ontologies (such as UN/SPSC) are emerging as well which need to be integrated into large application ontologies, sometimes by people who do not have much training in knowledge representation. This process has generated needs for tools that support broad ranges of users in (1) merging of ontological

6

http://ontolingua.stanford.edu/


terms from varied sources (2) diagnosis of coverage and correctness of ontologies (3) maintaining ontologies over time. Chimaera [] is an ontology environment aimed at supporting these tasks.

Within the context of the PKEG, we envision several important areas for using ontologies: representing the semantic meaning of data, recommending tools, mining data and information, and facilitating the use of the KEG by a diverse community. Building upon best-of-breed knowledge technologies and research efforts, primary research challenges lie in developing geoscience-specific ontologies, addressing scale and pruning algorithms, and addressing knowledge usage in an environment of high uncertainty.

It is important to note that there are myriad other enticing possibilities that could be explored, including spatiotemporal feature detection, algorithm selection, visual representation recommendation, and the dynamic construction of cooperative networks of components. As the meta-framework and knowledge components are integrated, we expect the overall environment to evolve into community infrastructure that fundamentally enables another generation of research in computational science and geoscience.

2.13 The IT Challenges: Scalability, Overall Integration, IT Research [Middleton/Fox/Hammond]

3 Research Design and MethodsThe prototype Knowledge Environment for the Geosciences proposed here is composed of coordinated and tightly coupled IT research activities. The Knowledge Framework (KF) provides the ‘glue’ to mediate between the upper and lower tiers. It is the “middleware” for knowledge encapsulation, discovery, recording, and sharing for large-scale distributed and collaborative research work. Each element in the architecture is designed to be highly distributed, and scaleable. Here we elaborate upon the coordinated activities and prototype concept.

3.1 The ConceptThe prototype Knowledge Environment for the Geosciences proposed here is composed of three coordinated and tightly coupled IT research activities: the Interaction Portal (IP), the Knowledge Framework (KF), and the Multi-scale repository (MS). These activities are arranged in a multi-tiered architecture. The upper tier is where users interact with the environment; the lower tier represents the

7

Research Education

EOS

Impacts Policy

Interaction Environment

Knowledge Framework

GIS ExtantModels

Multi-scale


primary sources of information to be built and used; and the middle tier provides the glue to mediate between the upper and lower tiers. It is the middleware for knowledge encapsulation, discovery, recording, and sharing for large-scale distributed and collaborative research work. Each element in the architecture is designed to be highly distributed, and scaleable.

3.2 IT Research ChallengesThis proposal features many IT research challenges in fully addressing the complexity and nterdisciplinary nature of stored and presented information and knowledge. They are:- Provide a robust and scalable information technology education and research environment that facilitates

the processes of learning and researching.- Validate applications of object-oriented and knowledge systems engineering methods in the large-scale

geosciences domain. - A data model that extends and integrates data semantics (e.g. references from real fields back to the

mathematical equations underlying simulations and references from real and text fields back to a semantic network of terms in physics or more specialized disciplines).

- Hypermedia...- Resource description...- Etc.-

3.3 KEG Definitions

The Interaction Portal (IP): is where the community member connects with the knowledge environment. One model for the interaction portal is that of a knowledge-enabled next-generation problem-solving [e.g. PYTHIA] environment a powerful collection of tools, components, and interfaces that operate seamlessly with one another enabling a broad range of research activities and can be deployed across a range of end-user portals. Layered upon the KF and targeted initially at the MS application domain, the IP will realize a prototype next-generation portal for IT and domain research for the Geosciences. The IP must support a wide variety of human activities including finding, accessing, combining, processing, analyzing, and visualizing, up to, terascale data objects, developing and evaluating model components and tools, collaborating with other researchers, and drawing from and contributing to the knowledge base. Several levels of access must be addressed, ranging from individual web-based browser access up to collaborative distributed group interactions in a sophisticated tiled, project environment (e.g. the AccessGrid). In order to address educational, research, and multi-disciplinary impacts assessment usage, the IP will also be aimed at a range of levels of user sophistication from student to scientist.

The Multi-Scale repository (MS): The scientific cornerstone of KEG is the investigation of multi-scale Earth system characteristics, including analysis, simulation and prediction. This concept is a significant generalization of the current standard methods of Earth system research, and composes of a set of matched and coupled collections of data, and numerical component models depicting the climate system. The multi-scale concept expands this structure of coupled individual components to linked embedded hierarchies of data and models within each component module. This multi-scale structure will permit the investigation of the most uncertain aspect of selected geophysical systems; numerical simulations, sub grid scale parameterizations, etc.. It will allow us to address an expanding list of scientific questions that involve interactions between disparate scales in space and time [Ghil85; Ghil87]. Such questions are at the core of many of the most societally relevant research problems to be addressedin the geosciences in the next 5-10 years.

The Knowledge Framework (KF): is a layered set of services with well defined (access and/or application programming) interfaces; it features a highly distributed set of methods, classes, tools, and components:

8


and is a fusion of knowledge and object representation. Each of the elements of the KF utilizes a protocol and content specification for passing data, metadata and knowledge between themselves and those in the IP and MS layers. The framework will support modeling, visualization, analysis, software development, and display and collaboration aspects of the environment.

3.4 Goals, Requirements, and Characteristics

KEG itself is conceived to be a symmetric architecture, i.e. unlike most conventional web-based architectures for which the flow of information is mostly from the information layer (e.g. web pages) through the middleware (e.g. web server) to the user (e.g. web browser). The KF will mediate the representation of knowledge as is passes back and forth between the MS and IP layers and is the key to implementing the prototype KEG. Viewed from the MS, the KF is a toolkit of successively layered software classes and services for assembling components of a, or a whole, set of (physical) multi-scale resources (datasets, models, etc.), which is the planned ultimate prototype of the KEG proposed here. As a result of this requirement, elements of the KF must function in both “input” and “output” modes.

Attention will be given to establishing and maintaining integrity of the captured knowledge, resource distribution and utilization, load balancing and error handling. An important diagnostic or validation step for the knowledge framework components is their publication. This will be the basis for debate and benchmarking leading to their credibility, quality and acceptance. The KF will allow for an immense variety of applications and interfaces built upon open language and protocol standards and allow data, metadata and knowledge sources to be added and accessed/served. Ultimately, users can gain needed resources, codes, data and metadata efficiently without having to know details about storage or location.

The KF will use traditional object concepts of encapsulation and polymorphism that allow discipline-specific features to be hidden from users [Booch94; Taylor98], and high levels of abstraction, which can be largely discipline-independent [Hibbard98]. When accessed from the IP or MS layers, a particular component would take their discipline specific form. The KF therefore will allow interdisciplinary scientific progress on large scales as well as be able to scale across the spectrum of user requirements: from K-16 to senior researchers.

3.5 Design

A preliminary architectural design was developed for P-KEG during the preparation of this proposal and is depicted in Figure PKEG1. While this design utilizes six tiers or layers and will be used to discuss important technologies for this proposal, it is necessary for this design to be iterative and coupled to end user requirements and the actual development processes. A formal capabilities and design document will be developed in the first year of the project and be used to establish the software design.

Figure PKEG1.

We will utilize UML capable tools for the detailed design, code stub generation, reverse engineering, etc. In addition, we will hold modeling sessions with domain experts early in the project to formalize the functional requirements and characteristics.

9


3.6 Architecture and Enabling Frameworks

3.6.1 Detailed description of the architecture

From the preliminary KEG architecture shown above we now focus on the KF elements in more detail and describe them, their provisional protocols, proposed content specifications, some functional characteristics and how they may fit together. This section deals with specific use and the reasoning for technologies as well as what needs to be developed.

3.6.1.1 IP layerThe IP must support a wide variety of human activities including finding, accessing, combining, processing, analyzing, and visualizing terascale data objects, developing and evaluating model components and tools, collaborating with other researchers, and drawing from and contributing to the knowledge base. Several levels of access must be addressed, ranging from individual web-based browser access up to collaborative distributed group interactions in a sophisticated tiled, project environment. In order to address educational, research, and multi-disciplinary impacts assessment usage, the IP is aimed at a range of levels of user sophistication - from student to scientist.

The IP necessarily builds upon the layered fabric of the KF, as a collection of complementary and interoperating components, some of which have analogs in extant information technology. Powerful access and interaction with the knowledge base is a central element of the IP. The knowledge base for this effort will be the Multi-Scale repository (MS), a distributed store of simulation code components, simulated and observed data, tools, analyses, commentary, documents, images, movies, and even virtual collaborative experiences (e.g. AccessGrid sessions).

The ability to share, browse, study, and even experience high-level knowledge holdings will enable distributed collaborative IT and domain research at a new level. The IP will present an environment for software design and code development for all levels of the KEG, including the model development and validation processes and be accessible to discipline and information scientists, algorithm specialists and software engineers. The concept will encapsulate SourceForge-type models, including in-portal execution of analysis, visualization, models, and collaboration components, etc.

Data analysis and visualization are vital to the process of developing understanding and ultimately, new knowledge. One of our goals is to enable geographically distributed researchers to effectively hypothesize and subsequently explore, analyze, and compare large, complex geosciences data from their home location. The geosciences application domains addressed in this effort pose formidable challenges. Most already involve terabyte-class datasets, many growing to petabyte-class by mid-decade. Present-day studies of the Earth system are not only growing rapidly in scale/resolution, but also in complexity, and include oceans, rivers, atmosphere, sea ice, vegetation, and biogeochemistry. Future research incorporating a multiplicity of nested grid scales, presents logical data complexity beyond anything encountered before. We propose to address these areas by combining distance visualization techniques, multi-resolution approaches, access to powerful data fusion capabilities, and metadata/knowledge-base awareness.

Another goal of this endeavor is to develop prototype tools that facilitate saving an exploration, a powerful concept aimed at addressing ‘Do you see what I see?’ questions among geographically distributed researchers. The AccessGrid (see [AccessGrid]) is a rapidly developing technology aimed at facilitating group-to-group interaction among geographically distributed people. Access to the collaboration services in the KF layer for this effort will provide a next-generation realization of the AccessGrid fused with the

10


IP and other problem solving environments. This effort requires extensions to the AccessGrid framework so that prototype visual analysis tools function in a distributed group mode. In addition, we propose to extend the framework such that AccessGrid experiences may be recorded, annotated, and deposited in the knowledge base as a record of human experience and knowledge exchange. These experiences will be searchable in the same way that other knowledge holdings are.

The Interaction Portal will serve as a model for a next-generation user environment for geosciences research and will offer a springboard for future development. The knowledge and data search capability will be a quantum leap beyond what is currently possible, and will address major shortcomings in current community capabilities. We envision this being a major step towards the concept of access to a Semantic Web [Berners-Lee98] in the context of geosciences research. The new software development environment will revolutionize how next-generation analysis, visualization and models are developed and will be redeployable for similar efforts at other research centers and universities.

3.6.1.2KF1 layer

The KF1 layer is a set of interaction and discovery services including special and general-purpose modules.

User Interface services:

<some of this needs to include AG infrastructure, discussion of various protocols: ICA, VNC, RDP, AIP, X >

Collaboration services:

To support the complex mix of activities that constitute a scientific research endeavor requires collaboration systems that extend beyond simply sharing screens. A collaborative environment must provide facilities for creating a shared workspace where work can be performed both as a group and individually over an extended period of time. While working simultaneously as a group, scientists may interact via synchronous collaboration tools. Synchronous collaboration extends across a range of technologies from shared displays (pixel sharing) to shared data or visualization streams to shared applications (event sharing). Each of these has its on cost/benefit tradeoffs. Display sharing technologies will be used to support collaboration in the IP. In the KF, we will be concerned with application sharing and support for data stream sharing. They will also work asynchronously, contributing work when others in the group are not necessarily available and on-line. Tools to for asynchronous collaboration include discussion lists, calendars, announcements, e-mail lists, and folders to organize resources. They also include the ability to create, modify, and annotate shared resources.

For the collaboration system of the KF, we will develop a middleware infrastructure for collaboration systems. This middleware will provide frameworks that will allow the collaboration system to be customized to the needs of the user and the user's task. These frameworks will include an application framework, a user management framework, and a resource framework.

The application framework will provide components to register, locate, invoke, and share applications. We will develop an API through which applications can access sharing services. The framework will also provide wrappers through which sharing may be added to applications which are not built for that purpose. To make it possible for the user to find applications to add to their portal, there must be metadata defined for each application and that metadata must be registered in a directory. Where possible, we will

11


make use of appropriate metadata and directory standards.

The user management framework will provide directory services for storing user attributes and group affiliations. This directory will be available to all applications to allow for user identification, presence, and security. The resource framework will interface with the KEG cataloging service and KF/MS interface services to provide search, location, storage, and retrieval of resources.

The collaboration system provides the problem-solving environment for KEG. As such, it must allow the user to define the workflows that support a particular task. <need more here>

The application framework will monitor and log all user activity in a manner that provides security and privacy. The reason for this is twofold. First, it allows researchers and system developers to mine this data for information on how the collaboration system is being used. This information can then be fed back into the development process to improve and expand the system in the most cost-effective ways. Second, it allows agents to be written which can look at the data and make inferences based on what a given user, or a group of users in aggregate, has done. For example, if a user is setting up a workflow for a simulation run, the agent may detect the type of activity being performed and provide assistance in formulating steps to be included. Or, it may be able to suggest collaborators that are working on similar problems or datasets or other resources that may be of use.

We plan to expand the information resource discovery to use the Resource Description Framework [RDF] format which facilitates ways that “knowledge” about artifacts can be stored and processed by agents, and directly capturing knowledge about the objects that exist in the KF. This knowledge includes the information stored in the MS layer and knowledge about the KF objects themselves (“metadata”). The task here is to develop information integration leading tools that can be used to express meta-information and constraints about entities in an information environment and agents (or processes) that can exploit that information to provide relevant services to users of the environment. For instance, it should be possible to provide a service in the knowledge framework that tells you how many objects can be used to manipulate a particular type of information in the MS layer.

The ANSI/NISO Z39.50 [Z3950a, Z3950b] standard defines a way for elements in a layered framework to communicate for the purpose of information retrieval. Z39.50 makes it easier to use large information databases by standardizing the procedures and features for searching and retrieving information in a distributed environment. The particular (Open Source) implementation of Z39.50 we plan to utilize is from Island Edge Research [IER] which forms queries and responses in XML. Z39.50 is attractive because it is simple and an established open standard. As the project evolves we will evaluate new protocol standards as they emerge and are adopted.

Development services:Within the prototype KEG, and especially driven by the formation of hypothese (problems) in the IP layer, it will be necessary to provide a portion of the framework for developing new analyses, visualizations, model runs, classification or evolution of new knowledge areas, or even a combination of these and/or other services. This service is the one that is typical not provided within a framework environment but instead, these capabilities are developed externally and then inserted. We argue that to successfully KF-enable these developments, they should arise from within the KF.

Since there are a variety of methods already in existence, some just need to be encapsulated with the framework, some will need to have service interfaces or API's developed (e.g. in adapting a SourceForge-type capability), and still others will need to be invented, or developed. Particular services for development within the KF may include: makefiles, validation, debugging, ontology creation, evolution, agents, code segments or class libraries, or even an entire model. We envision that at least a new XML

12


derivative language may be required for internal expression of development requirements between the KF elements, in essence, a ‘DevML’, which we will consider in the early design phases.

3.6.1.3 KF2 layerVisualization services:The scalability requirements for the KEG dictate the need for a flexible level of KF visualization elements. In effect this means they must accommodate interactive visualization of terascale data sets residing on clusters all the way to simple or on-the-fly summary representations for a quick look and in the ideal case, both of these extremes should be accessible to educators and scientists alike. In addition, these elements must cooperate, i.e. interface, with the services in the KF1 layer. <Some VisAD, some SCD, some RSI?> /Bill/Don/PAF/

Important community resources like VisAD [HibbardXX] lead the way in powerful, flexible visualization component frameworks. When complemented with script-based languages like NCL [Alpert9X] and IDL [RSI9x], for example and state-of-the-art volume visualization techniques, a .............

As part of the KF2 visulization service, we will create new implementation classes for VisAD data and display component interfaces, to support visualization of terascale data sets distributed across the nodes of large processor clusters. By suitable interfacing to data services (mining, management, etc in lower KF layers), the data components may be “tiled” across a distributed set of resources (e.g. compute nodes), typically (but not necessarily) via a requested partition of the data set's spatial domain. Display components from the KF1 layer will access data components visualized locally on each cluster node, so that display information rather than data are transmitted to user displays.

In addition, VisAD display components have a built-in collaboration capability so that any display can be used to construct a collaborating display on another system. For example, VisAD display components can run in web applets and in virtual reality. Using collaboration services, VisAD can be integrated with the AccessGrid for “collaborative distributed group interactions” that include access via web browsers, workstations and virtual reality.

All visualization services must be able to access distributed data holdings and we propose to utilize a Data Access Protocol, initially based on the DODS DAP [DODS] which is discussed in more detail in a later section.

Analysis services:It is common for analyses of data for different portions of the research, and sometimes education, process to be scattered between standalone codes, model code, and even visualization codes. We propose to extract these functional requirements into a set of flexible and deployable analysis services which can be utilized in concert with other elements within the KF2 layer and exchanged with the KF1 layer.

<Some VisAD, some SCD, some RSI?, stats...>/Bill/Don/PAF/Doug/

Assimilation services: /Joe/Don/UAH?/

In this context, assimilation includes data and knowledge fusion. This service within the prototype KEG will provide advanced capabilities for integrating spatially and temporally-disparate data for use within the other KF layers and ultimately for presentation to a user. This service takes configuration information, such as control over the aggregation and resolution of the temporal domain.

<Need elaboration here....>

13


Modeling services:<This is not for the first year but the ideas need to be developed>

To address the multi-scale science problems of interest, numerical simulation models are one of the primary research tools. In the same way that individual datasets or discrete knowledge holdings are, ensembles or components of models have not been brought together with a framework such as that proposed here. NCAR, together with its research community are developing partnerships to address the development of new model frameworks, especially in the area of Earth System modeling. For the KEG, and this specific KF service, we will leverage those activities and focus them toward both the multi-scale requirements we have identified and the accompanying framework of other services.

Particular capabilities for this service are: scientific utilities, advanced data structures (which build on basic data structures within the framework) such as: differential operators, pointwise operators, and scale coupling, scale aware physics parameterizations, model specific math libraries such as: linear and non-linear solvers, fast spectral transforms, etc., model components to advance dynamical equation sets applicable to a given scale, model validation and diagnostic capability, model steering, interfaces for real time summary extraction, and a capability to build production model runs, including input and output streams, independent of target platforms and operating environments. It is likely that some new XML-derivative language will be required to implement many of these capabilities, for example, a new ModelML.

3.6.1.4 KF3 layerMining services:Research extensions of existing 'mining' [Hinke97; DATA99; Ramac00] and semantic analysis techniques will be required to address the aggregation and organization of information and knowledge resources as they pass through the KF portion of the KEG. This means mining in a sense well beyond ‘data’. Algorithms, templates, and agents for application to the multi-scale repository must be assembled and/or developed and in turn stored as part of the knowledge holdings as well as possibly kept active in a server process ready to perform common operations in response to requests from KF2 layer services. A comprehensive and scalable data model design will be required for use within this service and we intend to leverage existing efforts from the investigator team for the [DODS], [Globus], [VisAD] and [STT] projects.

The reality that large immovable datasets are part of todays multi-scale repository also may require custom order processing (using mining, subsetting, and other methods) as a means to reduce the data volume for visualization and other purposes.

Since most existing mining systems are end-to-end themselves, an important first step is to migrate the data processing services and resulting mining applications into a form that supports the integration of highly distributed and interoperable software components and adds a high degree of scalability. This process will free these services from the conventional tightly coupled single server approach that is currently utilized. This un-coupling will also allow other KF layer services to utilize processing and data resources that are widely dispersed both in terms of spatial location and as well in terms of hardware and software implementations.

At the same time, full access to the KF metadata and knowledge components may be associated with the mining services. To make this extension, another facet of this research will be to extend traditional search mechanisms with advanced search capabilities (e.g. latent semantic analysis) in order to provide a new level of domain and problem specific knowledge and information finding capability. The mining services will support the fusion of data and knowledge sets at different spatial and temporal resolutions as provided

14


in the KF2 layer.

As with all KF elements, mining services will incorporate all pertinent standards and emerging standards, such as the Simple Object Access Protocol (SOAP), to insure the highest level of interoperability with other components of the KF to achieve the highest level of open access to data and services possible and allow the mining services to be deployed in specific end-to-end applications. Inherent in this type of distributed environment will also be the ability to access heterogeneous data sets using a common data protocol. together with emerging metadata technologies, such as the Earth Science Markup Language [ESML].

Cataloging and Tracking services:

In order to effectively support the needs of geo-scientists, the knowledge framework will need to provide cataloging and tracking services. For instance, scientists will need to be able to classify both objects and relationships in the knowledge framework. That is, it will not only be important to assert that a data set contains information generated by an atmospheric model, but that it is related to another data set that was generated by a previous version of the atmospheric model. This type of relationship is important to evaluate the evolution of a model: Is the new version generating more accurate data? Is the new version generating more compact data? etc. As this is just one type of valuable relationship, it is clear that the knowledge framework will need to be able to support multiple types of relationships, and since a significant portion of knowledge framework objects will not be stored in a Web-accessible form, the knowledge framework will need to be able to create, maintain, and evolve relationships between arbitrary data formats. One important problem to which cataloging services can be directly applied, is the problem of relating the files produced by an ensemble model run to one another. Ensemble models runs involve multiple models running in tandem producing data that is in some sense synchronized, e.g. these ten files each represent data produced by models that were processing the same region of space at the same time but addressing different concerns (such as atmospheric conditions, sun intensity, and ocean temperature). Ensemble model runs typically generate thousands of files and currently nothing is being used to relate these files to one another, other than ad hoc techniques such as file naming conventions. What is needed is a service that can group the output of ensemble model runs into conceptual wholes linked by relationships that indicate the semantic groupings of the file: e.g. these ten files represent model data for Boulder, Colorado on April 23, 2001 at 2 PM and they are linked to these ten files that represent the same model data for Boulder on April 23rd at 1 PM, and so on.

With respect to tracking services, a critical capability is to provide a recommender service for geo-science data. Recommender services are becoming more common with the rise of the World Wide Web (although recommender techniques and technology predate the Web []) and are typified by Amazon.com’s ability to generate information statements like “People who bought this book, also purchased these titles...” In the knowledge framework, it will be important to derive patterns of use over existing data sets. Thus, if scientists always access data set B after retrieving data set A, then the knowledge framework should automatically inform scientists who access data set A about the existence of data set B.

We intend to explore techniques for providing cataloging and tracking services based on an approach known as open hypermedia. Open hypermedia [Østerbye, et al., 1996] is a technology that supports the creation, navigation, and management of relationships among a heterogeneous set of applications and data formats. In the same way that the hypermedia services of the World Wide Web [Berners-Lee, 1996] enable navigation among related Web pages, open hypermedia allows users to navigate among the documents of integrated applications. Open hypermedia, however, provides a higher level of hypermedia sophistication than that found on the Web. In particular, open hypermedia is not limited to a read-only environment; users have the power to freely create links between information regardless of the number and types of information involved. There are three technical characteristics of open hypermedia systems

15


that enable this ability: advanced hypermedia data models, externally managed relationships, andthird-party application integration.

Advanced hypermedia data models: The advanced hypermedia data models of open hypermedia systems allow them to create links between multiple sets of documents. Links can be typed, and meta-information about the documents being related can be stored. These models are more powerful when compared to the hypermedia data model of the Web (as provided by HTML), which consists of one-way pointers embedded inside documents.

Externally managed relationships: Externally managed relationships imply that links are stored separate from the content being related. This approach, again, contrasts to the approach taken by the Web, in which link information is embedded directly within the content of HTML documents. The benefits of externally managed links include the ability to link to information located on read-only media or within material protected by read-only access permissions and the ability to associate multiple sets of links with the same information.

Third-party application integration: Finally, third-party application integration allows hypermedia services to be provided in the tools used by users on a daily basis. This approach also contrasts with the Web where typically the only applications that provide hypermedia services are Web browsers. With the open hypermedia approach, end-user applications provide hypermedia services directly, allowing users to create relationships over a wide range of applications and data formats. The open hypermedia field has developed a wide range of techniques for integrating legacy systems into an open hypermedia framework [Davis, et al., 1994; Whitehead, 1997; Wiil, et al., 1996] while new applications can be designed and implemented to directly support and provide open hypermedia services.

Open hypermedia systems provide the extensible infrastructure that enable hypermedia services over an open set of data, tools, and users [Østerbye, et al., 1996]. The first open hypermedia system appeared in 1989 [Pearl, 1989] and the field has been evolving rapidly ever since. Østerbye and Wiil present a good history and overview of open hypermedia systems in [Østerbye, et al., 1996]. There are many open hypermedia systems in existence. Examples include Chimera [Anderson, 1997; Anderson, 1999a; Anderson, 1999b; Anderson, et al., 2000a; Anderson, et al., 1994; Anderson, et al., 2000b], Microcosm [Hall, et al., 1996], DHM [Grønbæk, et al., 1993], HOSS [Nürnberg, et al., 1996], and HyperDisco [Wiil, et al., 1996]. For the purposes of the proposed research, we have selected Chimera to help support our prototype development efforts. Chimera specializes in supporting the creation of relationships across heterogeneous data sources [Anderson, et al., 2000b] and is, thus, well suited to the domain of the proposed research.

Techniques in open hypermedia can be usefully applied to the infrastructure design of the knowledge framework. Not all objects in the knowledge framework will be Web-based (in fact, maybe only a small percentage of objects will be Web-accessible) and so standard Web linking technologies will not be sufficient for capturing the relationships that exist between these objects. An open hypermedia system, however, can easily provide the infrastructure to establish and maintain relationships between the heterogeneous data formats of the knowledge framework. The fundamental research challenge will involve adapting open hypermedia infrastructure to the scale of the knowledge framework (which is envisioned to contain thousands to tens of thousands of objects). Past work in open hypermedia has addressed scalability in the number of relationships over a comparatively small set of objects [Anderson, 1999a; Anderson, 1999b]. Now we must scale the technology so it can support the creation of hundreds of thousands (if not millions) of relationships over a large set of objects.

In addition, research must be performed to provide distributed relationship management capabilities to each object of the knowledge framework. Thus, for instance, it should be possible for a user of the

16


interaction portal to import two knowledge framework objects into a design environment and establish a relationship between them (since both objects would be “hypermedia-enabled”) regardless of where these objects actually live. Plus, the link could potentially be made available outside of the design environment that created the link in the first place. This will require distributed hypermedia link repositories that live in the knowledge framework and replicate link information. This characteristic implies an additional design challenge: How should link consistency be characterized and enforced in the knowledge framework? That is, if a link is deleted, how long will it take before all copies of that link are removed from all of the distributed link repositories.

Notification services:

A key functionality of the services layer of the knowledge framework is notification. For instance, scientists will want to be notified when interesting things occur in the framework, such as a particular paper being published, or when a running model has discovered an interesting feature that may be useful for steering future model runs. In addition, ensemble model runs may make use of notification services to coordinate the different elements running in the ensemble. Educators or students may want notification of events of topical or social interest .... <please someone fill this in/out> /Roberta?/.

One approach to notification services is via events using even publish/subscribe technology (also known as event notification technology). The basic notion involves a set of publishers and a set of subscribers. The publishers publish specifications of particular event types they may generate at some point in the future. Subscribers can subscribe to a particular set of published events and will receive notifications each time an event is generated for one of their subscriptions. For instance, a Web server may generate an event each time one of its pages is updated. Subscribers interested in those events would subscribe to the “page updated” event published by the Web server. Then each time a page is updated, the web server creates a new instance of the “page updated” event configured with information about the newly updated event, and hands the event to an event notification system. The event system takes this instance and sends it to each subscriber interested in the “page updated” event. A key benefit enabled by event notification technology is that the publishers and subscribers remain completely ignorant, and thus completely independent, of one another. A publisher simply generates an event, and the event notification system takes care of sending the event to all of the interested recipients.

Event notification technology has been in existence since the mid-eighties [Reiss, 1990 #19] but, until very recently, has been restricted to supporting publishers and subscribers on local area networks. The University of Colorado has been active in pushing the scope of this technology to wide-area networks (WANs) with the Siena project Carzaniga, 2000 #264]. Siena's support for Internet-scale event notification makes it an ideal candidate to develop notification services for the knowledge framework. In its current form, Siena can support the construction of experimental prototypes that explore the use of notification services between a small set of knowledge framework entities. However, Siena is not currently able to scale to the thousands of objects currently envisioned for a production P-KEG environment.

As such, basic computer science research will be needed to expand the scope of event notification technology from small sets of publishers and subscribers distributed across a WAN to an environment that can consist of thousands (if not more) of publishers and subscribers distributed across both LANs and WANs, similar to the proposed P-KEG environment. For Siena to reach this level of scalability, new research will be needed in low-level distributed event protocols and in advanced event routing algorithms.

An additional area of research for event notification technology within the context of the P-KEG environment is support for domain-specific event patterns. For instance, while scientists will most likely be interested in coarse-grain events from models such as “model run started” and “model run ended”, they may also be interested in finer grain events such as “hurricane detected” or “sea temperature rising” that

17


might be generated, for instance, by atmospheric and oceanic models. Furthermore, they may be interested in particular event sequences: if each time an oceanic model generates three “sea temperature rising” events, an atmospheric model generates a “hurricane detected” event, then scientists may want to be notified when such a pattern occurs. In particular, we believe that very powerful analytical tools can be built on top of a notification service that provides advanced event pattern services. We intend to explore such services in the P-KEG environment using Siena as a development platform.

3.6.1.5 KF4 layerIn the present KEG design, the KF4 layer mediates the flow of data, metadata and knowledge from distributed repositories and services in the KF3 layer and above. As such, these capabilities are refered to as systems rather than services since they must accommodate a variety of protocols and content formats, different size and frequencies of requests in a reliable and scalable manner.

Knowledge Management System:

<other disciplines have used knowl. sys., explain what's different,<Knowledge Management System (KIF, OKBC, DAML, code, lib, jar,>what we can use, what we will need to adapt, etc.>

DARPA Agent Markup Language [DAML],Knowledge Interchange Format [KIF], theOpen Knowledge Base Connectivity [OKBC],

Metadata Management System:<ESML, FGDC, GCMD, ArcXML, SAX, SOAP?>

Data Management System:

<introduce the abstract data (knowl) model/class, i.e. DMS output><discussion of how and who will contribute to the "support" areas above>

Data Access Protocol (DAP) from the Distributed Oceanographic Data System [DODS], <why, how> /PAF/DODS servers, format and protocol support, local catalog lists, etc. The DMS must accommodate the higher-level data model which, among other things, allows transformations from data back to equations and physical properties.

Transmittal protocols and data exchange interfaces and message passing techniques

It will be necessary to provide efficient support for nested regular spatial sampling topologies, by creating new implementations of data sampling component interfaces that include algorithms specialized for these topologies.

Support for high performance, redundancy, scalable data throughput will require adoption of grid technologies [Foster, Globus], in addition to lessons learned from ADaM (the Algorithm Development and Mining) which was the first and only data mining system ported to NASA's IPG (Information Power Grid).

3.6.1.6 MS layerIn the application to global warming: here the 2 “visible” layers are, on the one hand, global surface-air temperature maps described in terms of a few leading EOFs or other “significant” parameters. On the

18


other hand, extreme events -- droughts, floods & locusts (??), the hidden layers are the hemispheric or sectorial regimes, each described also in terms of a few parameters, such as cluster centroid & its ellipsoid in phase space. See for instance [Robertson and Ghil, 1999, Smyth, Ide and Ghil, 1999]. In this case the HMM graph is one way, from the large to the small scales.

Include and represent these HMM's within MS, code and ontologies.

The way to proceed is using fully nested runs in the way that large-eddy simulations (LES) are used to inform turbulence-closure models. The closure will be done in the final configuration by using sophisticated stochastic models, with their variance & higher moments (to account for extreme events: say at least 3rd & 4th moments, i.e., skewness & kurtosis) determined from the “LES” runs. Ideally, we should non-Gaussian, power-law distributions with “long tails”, but there's too little experience w/that & I think this is already fwd.-looking enough.

Here the hidden layer will be continental- or smaller-scale weather regimes that affect clouds, etc. The HMM graph in this case has arrows pointing both ways, from large to small scales (the way that current deterministic parametrizations are designed) & back.

This would mean looking at a (simulated) extreme-drought, flood, hurricanes, or heat waves, etc, and using the multi-scale capabilities to try to actually produce realistic extrema. One possibilty could also be that we try to steer this in terms of extremes associated with shifts in regime probabilities. This may have to be accomplished after the proposal is submitted, though. I am thinking that extremes is a simpler concept to convey to ITR folks than planetary regimes which precondition the system for local extreme events.

This approach unifies the MS activities and builds important bridges between it and the KF. In fact, the HMMs for both design and application of the MS are solidly anchored in the "data mining" layer of the KF.

Data repository: <many formats (HDF5?)>

What is in the repository - talk about existing databases, what metadata needs to be harnessed, or generated, how we accumulate the knowledge bases, model code, etc.

Metadata repository:

Important metadata elements to be stored and accessed are: data schemas (describing real and text fields, groupings of fields, and functional dependencies), units for real values, coordinate systems (invertible transforms between real spaces of the same dimension), sampling geometries and topologies (regular and irregular), and missing data indicators error estimates (error variance for real values). These metadata are integrated into data semantics, so that unit conversions, coordinate transforms, resamplings, and propagation of missing data and error estimates are implicit in computational and visualization operations.

An important supporting technology will be the Earth Science Markup Language [ESML, Ramac01]. ESML is a specialized markup language for Earth Science metadata based on the eXtensibleMarkup Language (XML). It is designed to provide a standard way for describing both content and structure of a data file, thus facilitating development of dataset-independent search, visualization, and analysis tools without requiring data to be in any particular format or formats. A standard definition for structural metadata, in particular, is needed to enable the

19


interoperation of data access and analysis tools with a wide variety of different Earth science data file formats and structures. Work with standard data formats, such as HDF-EOS, has attempted to address the data structure definition problem, but this type of solution requires the translation of all the data files to the proposed format. Development of a standard definition for structural metadata, applicable to Earth science data in any format, is an important problem that ESML addresses.

Abstract data model and integrating metadata......

Experience from VisAD's “discipline-independent” data model which is applicable to “oceans, rivers, atmosphere, sea ice, vegetation, and biogeochemistry” and in fact any numerical or text data. It supports regular, irregular and nested regular spatial sampling topologies (i.e., UnionSets of Gridded3DSets, in VisAD language). It includes many relevant low-level data manipulation algorithms (e.g., resampling, derivative, iso-surface) and is a good framework for new algorithms (e.g., specialized interpolation).

Data objects that contain Earth- and time-referenced data with methods for sub-setting and mathematical operations for mapping domains onto the users display space for visualization and interaction. These new abstract data objects will know the domains of sampled functions for state dependent vertical coordinates and complex shapes to represent terrain, ocean basins, rivers, etc.

Knowledge repository:

Programs and scripts are one way of encoding knowledge, but one that is too restricted by assumptions about data inputs and outputs and often the operating systems and environment. Experience from the generality of the VisAD data model which can enables programs to be expressed at a higher level prompts research into the capture of code-level knowledge. Since knowledge is often combined with or part of data and metadata use, we require a knowledge model which enables programs that parse and work with a variety of data organizations, and parse and work with a variety of metadata for units, coordinate systems, sampling topologies, missing data and error estimates. Such programs would provide much more robust encodings of the underlying science algorithms and hence a higher level expression of the scientific knowledge in those algorithms. Specifically, they make program-encoded knowledge more generally applicable to unforeseen data.

Artifacts of research should include “software” or “software components”. In the short run, software is an important part of knowledge because file formats inevitably do not standardize all the information that scientists are interested in, so they define conventions for how extra information is stored and implement these conventions in software. In the long run, that is for a scientist trying to reconstruct someone else's research process years later, being able to run the original team's software is vital. This is the only way to explore the simulation, analysis and visualization choices that the original team faced.

A central problem for a geoscience knowledge environment is for scientists to be able to exchange software as part of the research process and to be able to leave software as part of the long-term record of the process (and know that others will be able to run that software years later).

<this next part is from Bill but it may be too specific at this stage - comments anyone?>Java is the right language for sharing and archiving knowledge expressed in programs, because Java's precisely defined semantics preserve correct functioning of programs across space and time.<end next part>

3.7 Outcomes

20


3.7.1 Infrastructure

The IP is The KF is software: protocols, services, an applications programming interface (API) framework and documentation. The MS is

Software capabilities of the KF for the prototype KEG will include objects that contain Earth- and-time-referenced data with methods for sub-setting and mathematical operations for mapping domains onto the users display space for visualization and interaction. These new abstract data objects will know the domains of sampled functions for state dependent vertical coordinates and complex shapes to represent terrain, ocean basins, rivers, etc. Another capability is the provision of servers for user-specified analysis and interactions with whole or partial very large, 4- and 5-dimensional geosciences datasets. The KF objects and classes will allow the design of model experiments and their operation and include methods and interfaces for user interaction. Information and data exchange interfaces are also required to support the IP, i.e. for analysis, visualization, validation, debugging, iterative development, access policies, and security.

3.7.2 Education and research

Scientists, educators and students who have networked capabilities will be integral to the information technology and scientific research processes. Though this capability set initially will be modest (at least through the first year or two), its foundation will facilitate adding capabilities to address an enlarging circle of education and research interests across many disciplines. The software design will require development of efficient and robust message passing techniques for distributed object deployment, which account for varying network loads, incomplete messages, time-outs, etc. Object design will enable state-of-the-art software development of Earth system models, analysis and visualization applications and distributed collaboration.

3.7.3 Services

3.7.4 An expandable framework

3.8 Software Engineering Challenge/Alex..../others/4 A Multi-scale Earth System Model 4.1 Definition: The modeling cornerstone will be a multi-scale model for earth system modeling, understanding and prediction. Current earth system models are composed of a set of matched and coupled numerical component models that depict distinct subsystems of the climate system, such as the atmosphere and oceans. The basic concept of MESM represents a significant generalization of this approach: it expands the structure of coupled individual components to linked embedded hierarchies of models within each component module.This multi-scale, hierarchical model structure will permit the investigation of the most uncertain aspect of geophysical-flow simulations: the pervasive uncertainty associated with the finite resolution of numerical

21


simulations and the consequent need for sub-grid scale parameterization of unresolved physical processes. It will allow us to address an expanding list of scientific questions that involve interactions between disparate scales in space and time [Ghil85; Ghil87]. These questions are at the core of many of those research problems that are most relevant to society and that the geosciences need to address in the next 5-10 years.The multi-scale science questions that are of current import and will benefit from the construction of an MESM modeling framework can be broadly classified into two categories: (i) direct, long-range scale interactions and (ii) turbulent, or turbulence-like, cascading scale interactions. Cloud microphysics play a key role in regulating the delicate radiative balance that dominates Earth's climate. Thus the minute imbalances that are associated with global climate trends and interdecadal climate variability are clearly a prime example of a science question in the first category.Another such example is the complex chemistry of the Antarctic ozone hole, which relies on reactions that occur on nitric acid trihydrate ice cloud particles. These reactions require the planetary-scale thermal shielding of the Antarctic polar vortex to form and affect in turn the hemispheric ozone budget. Understanding and predicting such long-range scale interactions clearly necessitates the construction of a multi-scale modeling framework.Atmospheric problems in the second category can be most easily described using Figure 1. The figure illustrates the observational fact that atmospheric variability is concentrated along the diagonal on a log-log plot of temporal and spatial scales. In simple words, short time scales are associated with small spatial scales, and long time scales with large spatial ones. Such scaling is reminiscent of the space-time spectra of turbulent flow and, intuitively, the local nonlinear interactions of variability on neighboring scales may be a partial explanation for some of the range of phenomena plotted. Note that there a high-variability ridge lies close to the diagonal of the plot (cf. also Fraedrich & Bottger, 1978). The low-frequency variability (LFV) is in terms of ten to 100 days and is thus intraseasonal.

22


A more complete and satisfactory explanation of this scaling behavior will be a major science objective of this project. An example of the class of phenomena that fall into the cascading-interaction category is the range of variations associated with El Niño events. The El Niño-Southern Oscillation (ENSO) is a coupled atmosphere-ocean mode of variability with spatial extent across the entire tropical Pacific basin. It is known to affect global weather patterns and has a multi-annual time scale of variation. The primary physical mechanism of air-sea interaction, however, relies upon the relationship between anomalies in

atmospheric convection induced by higher than normal sea surface temperatures. These large-scale convective anomalies interact with and are influenced by transient convective features on shorter time and space scales, most importantly intraseasonal variations known as Madden-Julian Oscillations (MJOs).The role of these intraseasonal variations within ENSO and the extent to which MJOs trigger phase reversals of ENSO are a major focus of current climate research. The dynamics of MJOs is extremely complex since the large-scale convection associated with them is merely an envelope of convective elements organized on the mesoscale. These elements oftentimes propagate from east to west, counter to the sense of propagation of the MJO envelope. The individual convective elements have a spatial scale of a few kilometers and life times of a few hours at most. Through the organized convective structures, the MJO, and ENSO there is a cascade-like link between planetary scales and interannual variability, at one end of the cascade, and tropical convective activity with a space scale of kilometer s and lifetimes of hours. It will thus require a multi-scale modeling framework to fully understand, model and predict climate variations associated with ENSO.

23

hour day wee year seaso year

10

102

103

104

km

Time Scalehour

convectionthunderstorm

meso-scalevariability

MCC

synoptic-scalevariabilitytraveling

low-frequencyvariability (LFV)*

blocking persistent anomalies

ENSO

Figure 1 Atmospheric Variability


4.2 The Problems: Modeling and SoftwareA computer-based simulation and prediction model has to focus upon a limited portion of the time-space spectrum of motions and processes. Hence the parameterization of sub-grid effects and the consequent use of empirical relations to achieve the closure of the mathematical system that governs the computational model is indispensable. The appropriate detail of physical and numerical description must be imposed at this stage to render the system tractable from a computational, as well as a conceptual point of view.The challenges of Earth system modeling, however, are raising concerns as to the accuracy and adequacy of this traditional modeling philosophy. In each component module, important interactions within that module, as well as the coupling with other modules are being effected through processes that are heavily parameterized. For example, reactive flows and the condensation of cloud droplets, which are inherently micro- and molecular-scale processes, are parameterized at various levels of sophistication in almost all models that treat atmospheric chemistry and the atmospheric hydrologic cycle. It is becoming increasingly clear, though, that even on global time and space scales, there is great climate sensitivity to the details of such microphysical processes through the interaction of clouds with the global radiation balance. Similar, disparate scale interactions and reactive-flow interactions occur throughout the geosciences.Computational simulation and prediction models are essential in advancing the geosciences and computer software has become a driving force in our research. The engineering of software is, however, becoming increasingly complicated. One must balance a variety of competing factors that include functionality, quality, performance, safety, usability, time to market, and cost. Moreover, the size and complexity of Earth system models that are being built is rapidly growing. Through the prototype Knowledge Environment we will study the software development process and focus on the discovery of principles and the development of technologies to support the engineering of the large, complex software systems that are needed. An essential part of this activity builds on the work to develop an Earth System Modeling Framework.4.3 The ApproachTo address this range of problems, we plan to develop multi-scale modules that will allow the seamless replacement of sub-grid scale parameterizations of cumulus convection and cloudiness, say, or of molecular reactions and microphysical phase transitions, with detailed process models designed for cumulus and microscale simulation. The first implementation of this capability will be for the global atmosphere.This will require the design and development of a non-hydrostatic dynamical core with a locally flexible resolution and a hierarchy of physical process models. The latter will have varying complexity and range from microphysical-scale models for direct numerical simulation, through large eddy-simulation and meso-meteorological scale, all the way to the global scale. An example of such models currently in existence is the NCAR large eddy simulation model (LES) for the atmospheric and oceanic planetary boundary layers (PBLs). This LES model resolves the energy-containing eddies in the turbulent boundary layer (Moeng et al...). Other examples are cloud-resolving models that explicitly carry cloud droplet and ice particulate distributions along with the most energetic sub-cloud scale motions (Moncrieff et al..) and mesoscale atmospheric models (e.g. NCAR-MM5: Kuo et al...) that parameterize individual convective elements and their cloud formation but explicitly model their mesoscale organization into rain bands, fronts and convective complexes. Finally, global atmospheric circulation models (e.g. NCAR-CCM3: Kiehl et al.....) explicitly model only the synoptic-scale traveling cyclones that generate the weather features depicted on standard weather maps but parameterize all smaller-scale features. Whilst process model embedding has been utilized in the past and is a capability of models such as the NCAR-MM5, such a modeling framework has not been attempted across the entire range of scales being proposed here.There are other compelling computational reasons for developing a new global model and studying this development process for specific knowledge objects that are important to retain during the planning, development, testing, and validation stages. The majority of existing models have not been constructed for contemporary high-performance computer systems. These systems are built from a very large number

24


of RISC-based microprocessors. Unless data structures, iteration loops, and memory access patterns are tuned for these systems, performance hovers at 5% or less of the peak. While performance and scalability have been achieved for individual components of earth system models, the realization of scalability for a system as complex as the multi-scale Earth system model will require significant research advances in algorithm design.4.4 Goals and OutcomeFor both numerical and computational reasons, we will form a focused multidisciplinary team of applications scientists, computational scientists, applied mathematicians, and software engineers to collaboratively develop a new global multi-scale Earth system model (MESM) tuned to running efficiently on highly parallel RISC-based microprocessor systems. The prototype multi-scale component will be comprised of a global atmospheric non-hydrostatic model with interactive chemistry and cloud microphysics. This system will be tested and help provide societally relevant answers on a variety of key scientific questions in the atmospheric and oceanic sciences.5 A KEG for Everyone!5.1 Deliverables5.2 Technology Transfer6 Education and OutreachThe education and outreach program of KEG centers on four specific themes: outreach to the scientific community, outreach to communities, enhancing diversity in the geosciences, and public information.6.1 Outreach to the Scientific Community – Summer Design InstitutesEach year we will conduct a KEG Summer Design Institute that brings “customers” for our products into the design and evaluation process. The focus will be on system design, establishing evaluation teams, and engaging student participation. Effective collaboration also calls for exchange agreements so that collaborators may individually spend time at other institutions. We have allowed for four investigators per year to participate for up to two weeks each.6.2 Outreach to CommunitiesA fundamental aspect of the IP portion of KEG is its scalability to address the research and learning needs of a broad community of scholars and educators interested in aspects of the geosciences. A focus of our education and outreach activity will be to implement and support versions of the KEG designed to support users from a variety of communities outside the usual graduate, and undergraduate university research community. A number of different such communities could be conceived that focus their work on specific topical issues or for the service of specific learning groups. For example, a community-based group studying the effects of industry and land use on their local environment could constitute a topical issue based community. In another example, the K-12 educational community has as its focus the education of a specific learning group. From the onset, we will work towards a scalable KEG for multiple communities, and will demonstrate this capability by initiating two specific versions of KEG for the two different types of communities mentioned above – K-12 education and community land-use. In building for this scalability, we will be building a system that has the ability to serve other communities inherent in its design. 6.3 The K-12 Educational Community – K-12 KEGWe will develop and implement a version of the KEG designed for the K-12 learning community. This version will encourage interactive learning of the Earth’s system using simplified models, introduce and facilitate collaboration between students (in the same location or remote), and produce a variety of educational materials to support the use of K-12 KEG in this community. Educators will participate on the K-12 KEG IP design team since they require tools that provide management, tracking, and assessment capabilities. The evolving Digital Library for Earth Systems Education will provide a rich resource for information to support the KEG, and our K-12 KEG will include an interface to DLESE. Similarly, we

25


will insure that materials developed by users of the K-12 KEG incorporate the appropriate metadata fields so that these resources can be searched through the DLESE interface . We will solicit guidance from a select advisory board of K-12 geoscience educators. After the first 2 years of development and testing, we will focus our efforts on professional development of K-12 educators, while continuing to revise our prototype to maintain full interactive capabilities with the full KEG prototype. Professional development will occur at dedicated K-12 KEG training workshops at NCAR, as well as through workshops held in conjunction with meetings of professional societies. Information about the K-12 KEG will be made available through web outreach and dissemination at professional meetings such as the National Science Teachers Association.6.4 Outreach to a Diverse Community – Collaboration with SOARSThe SOARS program is dedicated to increasing the number of African American, American Indian, and Hispanic/Latino students enrolled in master's and doctoral degree programs in the atmospheric and related sciences with the goal of supporting the development of a diverse, internationally competitive and globally engaged workforce within the scientific community. The SOARS learning community provides multi-year programming for undergraduate and graduate students (protégés) that includes educational and research opportunities, mentoring, career counseling and guidance, and the possibility of financial support for a graduate level program. Protégés spend their summers at NCAR, participate in ongoing research projects, an eight week scientific writing and communication workshop, and scientific seminars. They benefit from long-term mentoring from respected scientists and professionals, learn about career opportunities, practice leadership and are encouraged to complete a graduate program in an atmospheric or related science. SOARS program students will be invited to participate in development and use of the prototype Knowledge Environment as it evolves. This will provide these students the opportunity to become involved with work in the area of information technology and its intersection with the geosciences. It will also provide them with increased access to research data and collaboration opportunities. Additional information about the SOARS program is available at <www.ucar.edu/soars>.6.5 Outreach to the Public – Sharing Information about KEG: Through our existing web-based information portals (the UCAR/NCAR education and outreach website and the Windows to the Universe project), we will provide the public with information that describes the purpose of the KEG, and how they can become involved. As new portals to KEG develop to meet specific needs of user communities, participation by members of the public may be invited. We will use our existing outreach interfaces to share information about these opportunities.7 Usage ScenariosTo bring information technology research innovations to bear on highly significant scientific and social problems, we highlight three application scenarios, or test beds for the knowledge environment:7.1 Hurricane Landfall (HAL) Test BedThe improved forecasting and warning of hurricane landfall is a priority objective of the U. S. Weather Research Program that has enormous potential for mitigating damage and disruption of lives and commerce. Meaningful progress will require research advances in observing technology, data assimilation, and forecast models, as well as improved warning procedures and understanding of the socio-economic impacts. An advanced coupled hurricane model is being developed based on the new multi-agency Weather Research and Forecasting [WRF] Model and software framework that will provide important research contributions to this program. Running at convection resolving scales, a fully parallel coupled atmosphere/ocean/wave simulation system will challenge the capability of the largest multi-processor computer platforms.7.2 El Nino Southern Oscillation (ENSO) Test Bed The ENSO problem involves high performance simulation of global climate on short timescales with multi-spatial resolving requirements, experimental forecasting, and interaction with an international scientific and user community, societal impacts, and a considerable amount of existing knowledge to

26


incorporate into existing and planned digital libraries. Study of ENSO provides a challenge to every aspect of the IT infrastructure we propose and draws on a vast talent.7.3 Megacity Impact on Regional And Global Environment (MIRAGE) Test BedMIRAGE is a new collaborative scientific initiative to coordinate the study of the export and transformations of pollutants from large metropolitan areas to regional and global scales. As growing urban populations put pressures on their local resources, they are also exporting increasing amounts of pollutants to regional and global scales. Aerosol emissions, in particular, are expected to increase by a factor of two over the next 40 years and an increasing fraction of those emissions is expected to come from developing parts of the world. The [MIRAGE] framework will emphasize both hypothesis testing and predictive studies. Thus, MIRAGE activities will benefit from the KEG through the integration of geographical, demographic, and socio-economic data bases, gas and aerosol emissions inventories, air quality measurements, accurate representations of local meteorology, and synoptic transport to regional and global scales, data assimilation, chemistry/transport models, and models of socio-economic development.8 Broader ImpactsThe KEG vision is: to facilitate and accelerate fundamental research for the greater benefit of society, to enrich education programs, and to feed into policy decisions and assessments. Through its diverse use in the geosciences, the prototype KEG will yield benefits beyond the UCAR community, ultimately including computer scientists, the communications, weather forecasting, and other industries, and the general public. The most important benefit of KEG in geosciences is its potential to convey unprecedented amounts of semantic information. Software created through this effort, including source code, will be made generally available at no cost. The shared intellectual experience of developing a project of the scope of the prototype-KEG: the IP, KF, MSESM and establishing its K-12 counterpart, will be documented via publications, conference presentations, and within the KEG knowledge archive. We also envision encountering and solving challenging new problems in collaborative, distributed, multidisciplinary research in information and computer sciences.

27


9 Management Plan (up to three pages in length) [hammond]This project will be led by the National Center for Atmospheric Research under the direction of the PI, Tim Killeen. Dr. Killeen will serve as the primary contact for NSF, external relations, and providing overall scientific leadership. We expect to appoint a full time project manager working in the NCAR Director’s office to assist Dr. Killeen in his duties and to provide overall coordination of activities. Three component activities have a community lead (IP: Stevens, KF: McGuinness and MsESM: Ghil) and an NCAR co-lead (IP: Middleton, KF: Fox and MSESM: Tribbia). The Education and Outreach activity will be led by Roberta Johnson, UCAR vice president for Education and Outreach(???). This core group will work in conjunction with a software engineering manager, and the KEG chief architect (???) to coordinate

all activities, funding profiles, outreach efforts, etc. This project brings together a large interdisciplinary collaborator team with comprehesive experience in the required research domains. (The two figures need to be updated with the current organization…)An advisory group, appointed by the PI and Activity leaders, in consultation with NSF, will serve as a source of strategic advice on project directions. We expect to meet annually face-to-face and to participate in teleconference reviews on two separate occasions during the year. The Advisory Group will consist of senior IT and Earth science researchers and managers and meet once a year at a project all hands meeting and on additional occasions via video or teleconference. The following list indicates the people that we anticipate inviting but as of yet have not contacted concerning this responsibility: …

Two important strategies will be used to facilitate effective completion of tasks and support of project goals.First, the four project activities are well defined both in scope and their place in the environment architecture. In the diagram to the right, they are arranged in a circle with cross coupling to demonstrate their mutual dependencies and interactions. Each activity (led by the community and NCAR investigators) may place itself at the apex of the diagram, using the other activities for infrastructure. The diagram shows an example with the Interaction Portal at the apex. In this way, component activities may proceed semi-autonomously without entraining the entire project.

28

Interaction Portal

Knowledge Framework

Multiscale Earth System Model

Education / Public Outreach

PI-Killeen

Advisory Group

Project Manager

Activity… Manager

Activity… Manager

Activity… Manager

Activity… Manager


The second strategy is that, within and between each activity, smaller tasks are identified upon which between one and three investigators may collaborate, together with professional and support staff and students. These smaller tasks will be based on the rapid prototype, or design-construct-test-iterate project development methodology. The scope of this project requires several levels of design reviews. After the initial design is incremented, a preliminary design review will be held for KEG as a whole. Activity leaders have responsibility for specifying and conducting comprehensive design review for the architecture, code, use and overall effectiveness.To address cross-cutting themes, we have developed a team expertise/responsibility matrix to identify particular roles that individuals may play at any stage or in any task of the project. (Peter, do you have the matrix?) To summarize the matrix here, contributing team members cover the following areas of expertise: education and public outreach (Roberta Johnson, Tom Windham), large scale collaborative user interaction and AccessGrid (Rick Stevens), problem solving environments (Elias Houstis), knowledge systems and engineering (Deborah McGuinness and Richard Fikes), latent semantic analysis (Elizabeth Jessup), development of interdisciplinary, multiscale, Earth system model components, including validation, interpretation and assessment (Paul Edwards, Michael Ghil, Steve Hammond, Robert Harriss, Joe Klemp, Andrew Majda, Vernon Morris, Anne Smith, Joe Tribbia, Warren Washington, Bob Wilhelmson, Don Wuebbles), hypermedia and framework development (Kenneth Anderson), software architecture, software engineering design, and process in large systems (Alex Wolf), Globus and data grid architectures, geographically distributed resources, security and access policies (Ian Foster), object oriented design and knowledge framework development (Peter Fox), data mining (Sara Graves, Steve Tanner), parallel computation and performance optimization (Steve Hammond and Elizabeth Jessup), statistical analysis components (Doug Nychka), collaboratory framework development, object oriented design, (Daniel Kiskis), data analysis and visualization (Bill Hibbard, Don Middleton), and collaboratory environments for science research (Tim Killeen). Team participants also bring with them expertise in many leading edge information technology projects, for example: AccessGrid, Globus, Earth System Model Framework [ESMF], Distributed Oceanographic Data System [DODS], Island Community Network [ICN], Space Physics and Aeronomy Research Collaboratory [SPARC], Comprehensive Framework for Collaboration [CHEF], [Chimaera] (an ontology evolution environment), Earth Science Markup Language [ESML], Space Time Toolkit [STT], Algorithm Development and Mining [AdaM], and VisAD. We note in particular that we are leveraging many ongoing and planned efforts both within the community and at NCAR/ UCAR to capitalize on the timeliness of this project as currently defined. We have considered the scope of this effort, its management, the team and requested budget very carefully.For some very well defined parts of the project, we prefer to subcontract to the commercial sector. One of these subcontracts is for continuing education of students and staff in the project in the area of object-oriented design, component and framework develop and the software engineering process. For this task, we have identified and budgeted for short courses from Rational Inc. In the first year, $40K is allotted, and then $6K for successive years to maintain and increment skills.

29


10 Prior Results (up to one page per PI and co-PI, describing prior results with NSF funding on related projects, maximum a total of five pages)

30


D. References Cited[AccessGrid] http://www-fp.mcs.anl.gov/fl/accessgrid/[Berners-Lee98] Berners-Lee, T., 1998: “Semantic Web

Roadmap”,http://www.w3.org/DesignIssues/Semantic.html[Booch94] Booch, G. 1994; Object Oriented Analysis and Design. Second Edition. Addison-Wesley,

Reading, Massachusetts.[CHEF] Kiskis, D. L, and Hardin, J. 2000: private communication.[Chimaera] McGuinness, D. L., Richard Fikes, James Rice, and Steve Wilder. An Environment for

Merging and Testing Large Ontologies. In the Proceedings of the Seventh International Conference on Principles of Knowledge Representation and Reasoning (KR2000), Breckenridge, Colorado, USA. April 12-15, 2000. http://ksl.stanford.edu/software/chimaera

[DAML] Hendler, J. and Deborah McGuinness. “The DARPA Agent Markup Language”'. To appear in IEEE Intelligent Systems Trends and Controversies, January, 2001. Available at: http://www.ksl.stanford.edu/people/dlm/papers/ieee-daml01-abstract.html and also www.daml.org

[DATA99] “NASA Workshop on Issues in the Application of Data Mining to Scientific Data, Final Report”, University of Alabama in Huntsville, 19-21 October 1999.

[DLESE] http://www.dlese.org[DODS] http://www.unidata.ucar.edu/packages/dods[ESMF] http://www.scd.ucar.edu/css/NASACAN.htm[ESML] http://www.itsc.uah.edu/research/esml.html[Ghil85] Ghil, M., R. Benzi, and G. Parisi (Eds.), l985: Turbulence and Predictability in

Geophysical Fluid Dynamics and Climate Dynamics, North-Holland Publ. Co., Amsterdam, 449 pp.

[Ghil87] Ghil, M., and S. Childress, 1987: Topics in Geophysical Fluid Dynamics: Atmospheric Dynamics, Dynamo Theory and Climate Dynamics, Springer-Verlag, New York, 485 pp.

[Globus] http://www.globus.org/[HDF5] http://hdf.ncsa.uiuc.edu/HDF5/[Hibbard98] Hibbard, W., 1998; VisAD: Connecting people to computations and people to people.

Computer Graphics, 32(3), 10-12.[Hinke97] Hinke, T., J. Rushing, H. Ranganath and S. Graves, “Target-Independent Mining for

Scientific Data: Capturing Transients and Trends for Phenomena Mining”, Proc. Third Int.Conf. On Data Mining KDD-97, Newport Beach, CA, Aug. 14-17,1997.

[ICN] http://www.islandedge.com/[KIF] http://www-ksl.stanford.edu/knowledge-sharing/kif/[MIRAGE] http://mirage.acd.ucar.edu[OKBC] http://www.ksl.Stanford.EDU/software/OKBC/ and

http://www-ksl-svc.stanford.edu:5915/doc/release/okbc/okbc-spec/okbc-2-0-3.pdf[PITAC99] Information Technology Research: Investing in Our Future. President's Information

Technology Advisory Committee Report to the President, 1999[PYTHIA] http://www.cs.purdue.edu/research/cse/pythia/[Ramac00] Ramachandran, R., H. Conover, S. Graves, K. Keiser, “Algorithm Development and

Mining (ADaM) System for Earth Science Applications”, Second Conference on Artificial Intelligence, 80th AMS Annual Meeting, Long Beach, CA, January, 2000.

I

http://www.cs.purdue.edu/research/cse/pythia/

http://www-ksl-svc.stanford.edu:5915/doc/release/okbc/okbc-spec/okbc-2-0-3.pdf

http://www.ksl.Stanford.EDU/software/OKBC/

http://mirage.acd.ucar.edu/

http://www-ksl.stanford.edu/knowledge-sharing/kif/

http://www.islandedge.com/

http://www.itsc.uah.edu/research/esml.html

http://www.unidata.ucar.edu/packages/dods

http://www.dlese.org/

http://www.daml.org/

http://www.ksl.stanford.edu/people/dlm/papers/ieee-daml01-abstract.html

http://ksl.stanford.edu/software/chimaera

http://www.w3.org/DesignIssues/Semantic.html


[Ramac01] Ramachandran, R., M. Alshayeb, B. Beaumont, H. Conover, S. Graves, N. Hanish, X. Li, S. Movva, A. McDowell, and M. Smith, “Earth Science Markup Language”, 17th Conference on Interactive Information and Processing Systems (IIPS) for Meteorology, Oceanography, and Hydrology, January 2001 (accepted).

[RDF] http://www.w3.org/RDF[SOARS] http://www.ucar.edu/soars[SPARC] http://intel.si.umich.edu/sparc[STT] http://vast.uah.edu/SpaceTimeToolkit/[Taylor98] Taylor, D. A., 1998; Object Technology: A Manager’s Guide. Second Edition. Addison-

Wesley, Reading, Massachusetts[WRF] http://www.mmm.ucar.edu/wrf[Z39.50] http://lcweb.loc.gov/z3950/agency/ and http://www.dlib.org/dlib/april97/04lynch.html

References----------

[Carzaniga, 2000 #264] Carzaniga, A., Rosenblum, D. S., and Wolf, A. L. (2000). Achieving Scalability and Expressiveness in an Internet-Scale Event Notification Service. In Proceedings of the 19th ACM Symposium on Principles of Distributed Computing.

[Reiss, 1990 #19] Reiss, S. P. (1990). Connecting Tools Using Message Passing in the Field Environment. IEEE Software, 7(4): 57-66. Component efforts:

Hibbard, W., 1998; VisAD: Connecting people to computations and people to people. Computer Graphics, 32(3), 10-12.

Hibbard, W., C. Rueden, S. Emmerson, T. Rink, D. Glowacki, D. Murray, T. Whittaker, D. Fulker and J. Anderson; Java distributed objects for numerical visualization in VisAD. Communications of the ACM, accepted for publication.

[Harcourt] Academic Press Dictionary of Science and Technology http://www.harcourt.com/

T. R. Gruber. Toward principles for the design of ontologies used for knowledge sharing. Presented at the Padua workshop on Formal Ontology, March 1993

T. R. Gruber. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):199-220, 1993.

[Anderson, et al., 1994] Anderson, K. M., Taylor, R. N., and Whitehead, E. J., Jr. (1994). Chimera: Hypertext for Heterogeneous Software Environments. In Proceedings of the Sixth ACM Conference on Hypertext, pp. 94-107. Edinburgh, Scotland. September 18-23, 1994.

[Anderson, 1997] Anderson, K. M. (1997). Integrating Open Hypermedia Systems with the World Wide Web. In Proceedings of the Eighth ACM Conference on Hypertext, pp. 157-166. Southampton, UK. April 6-11, 1997.

[Anderson, 1999a] Anderson, K. M. (1999a). Data Scalability in Open Hypermedia Systems. In Proceedings of the Tenth ACM Conference on Hypertext, pp. 27-36. Darmstadt, Germany. February 21-25, 1999.

II

http://www.dlib.org/dlib/april97/04lynch.html

http://lcweb.loc.gov/z3950/agency/

http://www.mmm.ucar.edu/wrf

http://intel.si.umich.edu/sparc

http://www.ucar.edu/soars

http://www.w3.org/RDF


[Anderson, 1999b] Anderson, K. M. (1999b). Supporting Industrial Hyperwebs: Lessons in Scalability. In Proceedings of the 21st International Conference on Software Engineering, pp. 573-582. Los Angeles, CA, USA. May 16-22, 1999.

[Anderson, et al., 2000a] Anderson, K. M., Och, C., King, R., and Osborne, R. M. (2000a). Integrating Infrastructure: Enabling Large-Scale Client Integration. In Proceedings of the Eleventh ACM Conference on Hypertext, pp. 57-66. San Antonio, TX, USA. May 30 - June 4, 2000.

[Anderson, et al., 2000b] Anderson, K. M., Taylor, R. N., and Whitehead, E. J., Jr. (2000b). Chimera: Hypermedia for Heterogeneous Software Development Environments. ACM Transactions on Information Systems, 18(3): 211-245.

[Berners-Lee, 1996] Berners-Lee, T. (1996). WWW: Past, Present, and Future. Computer, 29(10): 69-77.

[Davis, et al., 1994] Davis, H. C., Knight, S., and Hall, W. (1994). Light Hypermedia Link Services: A Study of Third Party Application Integration. In Proceedings of the Sixth ACM Conference on Hypertext, pp. 41-50. Edinburgh, Scotland. September 18-23, 1994.

[Grønbæk, et al., 1993] Grønbæk, K., Hem, J. A., Madsen, O. L., and Sloth, L. (1993). Designing Dexter-based Cooperative Hypermedia Systems. In Proceedings of the Fifth ACM Conference on Hypertext, pp. 25-38. Seattle, Washington, USA. November 14-18, 1993.

[Hall, et al., 1996] Hall, W., Davis, H., and Hutchings, G. (1996). Rethinking Hypermedia: The Microcosm Approach. Kluwer Academic Publishers, Norwell, MA, USA.

[Nürnberg, et al., 1996] Nürnberg, P. J., Leggett, J. J., Schneider, E. R., and Schnase, J. L. (1996). Hypermedia Operating Systems: A New Paradigm for Computing. In Proceedings of the Seventh ACM Conference on Hypertext, pp. 194-202. Washington DC, USA. March 16-20, 1996.

[McGuinness:2000] McGuinness, Deborah L.. "Conceptual Modeling for DistributedOntology Environments." In Proceedings of the Eighth International Conferenceon Conceptual Structures Logical, Linguistic, and Computational Issues (ICCS2000). Darmstadt, Germany. August 14-18, 2000.

[McGuinness-etal:2000b]McGuinness, Deborah L., Richard Fikes, James Rice, andSteve Wilder. "The Chimaera Ontology Environment."Proceedings of theSeventeenth National Conference on Artificial Intelligence (AAAI 2000). Austin,Texas. July 30 - August 3, 2000.

[McGuinness-etal:2000a] McGuinness, Deborah L., Richard Fikes, James Rice, andSteve Wilder. "An Environment for Merging and Testing Large Ontologies."Proceedings of the Seventh International Conference on Principles of KnowledgeRepresentation and Reasoning (KR2000). Breckenridge, Colorado, USA. April12-15, 2000.

[Østerbye, et al., 1996] Østerbye, K., and Wiil, U. K. (1996). The Flag Taxonomy of Open Hypermedia Systems. In Proceedings of the Seventh ACM Conference on Hypertext, pp. 129-139. Washington DC, USA. March 16-20, 1996.

III


[Pearl, 1989] Pearl, A. (1989). Sun's Link Service: A Protocol for Open Linking. In Proceedings of the Second ACM Conference on Hypertext, pp. 137-146. Pittsburgh, PA, USA. November 5-8, 1989.

[Whitehead, 1997] Whitehead, E. J., Jr. (1997). An Architectural Model for Application Integration in Open Hypermedia Environments. In Proceedings of the Eighth ACM Conference on Hypertext, pp. 1-12. Southampton, UK. April 6-11, 1997.

[Wiil, et al., 1996] Wiil, U. K., and Leggett, J. J. (1996). The HyperDisco Approach to Open Hypermedia Systems. In Proceedings of the Seventh ACM Conference on Hypertext, pp. 140-148. Washington DC, USA. March 16-20, 1996.ANL References[1] L. Childers, T. L. Disz, M. Hereld, R. Hudson, I. Judson, R. Olson, M. E. Papka, J. Paris, and R.

Stevens, "ActiveSpaces on the Grid: The Construction of Advanced Visualization and Interaction Environments," in Parallelldatorcentrum Kungl Tekniska Högskolan Seventh Annual Conference (Simulation and Visualization on the Grid), vol. 13, Lecture Notes in Computational Science and Engineering, B. Engquist, L. Johnsson, M. Hammill, and F. Short, Eds. Stockholm, Sweden: Springer-Verlag, 1999, pp. 64-80.

[2] L. Childers, T. Disz, R. Olson, M. E. Papka, R. Stevens, and T. Udeshi, "Access Grid: Immersive Group-to-Group Collaborative Visualization," presented at Immersive Projection Technology, Ames, Iowa, 2000.

[3] I. Foster, M. E. Papka, and R. Stevens, "Tools for Distributed Collaborative Environments: A Research Agenda," presented at High Performance Distributed Computing Focus Workshop on Multimedia and Collaborative Environments, Syracuse, New York, 1996.

[4] I. Foster and C. Kesselman, The grid : blueprint for a new computing infrastructure. San Francisco: Morgan Kaufmann Publishers, 1999.

[5] R. D. Pea, D. C. Edelson, and L. M. Gomez, "Distributed Collaborative Science Learning using Scientific Visualization and Wideband Telecommunications," presented at Multimedia information systems for science and engineering education: Harnessing technologies, 1994.

[6] A. Chabert, E. Grossman, L. S. Jackson, S. R. Pietrowiz, and C. Seguin, "Java object-sharing in Habanero," Communications of the ACM, vol. 41, pp. 69-76, 1998.

[7] L. Beca, G. Cheng, G. C. Fox, T. Jurga, K. Olszewski, M. Podgorny, P. Sokolowski, T. Stachowiak, and K. Walczak, "TANGO - a Collaborative Environment for the World-Wide Web," Syracuse University, Syracuse, Technical Report 1995.

[8] C. Bajaj and S. Cutchin, "Web based Collabortive Visualization of Distributed and Parallel Simulation," presented at IEEE Parallel Symposium on Visualization, 1999.

[9] T. L. Disz, R. Evard, M. W. Henderson, W. Nickless, R. Olson, M. E. Papka, and R. Stevens, "Designing the Future of Collaborative Science: Argonne's Futures Laboratory," IEEE Parallel and Distributed Technology Systems and Applications, vol. 3, pp. 14 -21, 1995.

[10] T. L. Disz, M. E. Papka, M. Pellegrino, and R. Stevens, "Sharing Visualization Experiences among Remote Virtual Environments," presented at International Workshop on High Performance Computing for Computer Graphics and Visualization, 1995.

[11] M. E. Papka, R. Stevens, and M. Szymanski, "Collaborative Virtual Reality Environments for Computational Science and Design," presented at Computer-Aided Design of High-Temperature Materials, Santa Fe, New Mexico, 1997.

[12] R. Olson and M. E. Papka, "Remote Visualization with Vic / Vtk," presented at Visualization 2000 Hot Topics, Salt Lake City, Utah, 2000.

[13] W. Allcock, I. Foster, S. Tuecke, A. Chervenak, and C. Kesselman, "Protocols and Services for Distributed Data-Intensive Science," presented at VII International Workshop on Advanced Computing and Analysis Techniques in Physics Research, Bativa, IL, 2000.

[14] B. Allcock, J. Bester, J. Bresnahan, A. Chervenak, I. Foster, C. Kesselman, S. Meder, V. Nefedova, D. Quesnel, and S. Tuecke, "Secure, Efficient Data Transport and Replica Management

IV


for High-Performance Data-Intensive Computing," presented at IEEE Mass Storage Conference, 2001.

[15] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, and S. Tuecke, "The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientic Datasets," Journal of Network and Computer Applications, vol. 23, pp. 187-200, 2000.

[16] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and S. Tuecke, "A Resource Management Architecture for Metacomputing Systems," presented at IPPS/SPDP '98 Workshop on Job Scheduling Strategies for Parallel Processing, 1998.

[17] S. Fitzgerald, I. Foster, C. Kesselman, G. vonLaszewski, W. Smith, and S. Tuecke, "A Directory Service for Configuring High-Performance Distributed Computations," presented at 6th IEEE Symposium on High-Performance Distributed Computing, 1997.

[18] I. Foster and C. Kesselman, "The Globus Project: A Status Report," presented at IPPS/SPDP '98 Heterogeneous Computing Workshop, 1998.

[19] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, and A. R. (). "A Distributed Resource Management Architecture that Supports Advance Reservations and Co-Allocation," presented at International Workshop on Quality of Service, 1999.

[20] I. Foster, A. Roy, and V. Sander, "A Quality of Service Architecture that Combines Resource Reservation and Application Adaptation," presented at 8th International Workshop on Quality of Service, Pittsburgh, PA, 2000.

[21] I. Foster, J. Insley, G. v. Laszewski, C. Kesselman, and M. Thiebaux, "Data Exploration on the Grid," IEEE Computer, vol. 32, pp. 36 - 43, 1999.

[22] R. Stevens, "Corridor One: An Integrated Distance Visualization Environment For SSI AND ASCI Applications," 1999

[23] J. Ahrens, K. Brislawn, K. Martin, B. Geveci, C. C. Law, and M. E. Papka, "Large Scale Data Visualization Using Parallel Data Streaming," IEEE Computer Graphics & Applications, 2001, to Appear.

V


E. Biographical Sketches(Two pages from each sr. personnel listed on the proposal)

VI


F. Proposal Budget [hammond](Cumulative and annual budgets, including sub award budgets, if any, and up to three pages of budget justification) This proposal requests $14,825,802 over five years. The cost to cover the University participation is $8,374,250 including 10 institutions represented by 17 non-NCAR lead personnel. The vast majority of these sub awards is allotted to graduate students and post-doctoral fellows, the remainder covers non-academic year staff salaries, travel, publications and miscellaneous expenses. NCAR’s requested budget covers approximately eight graduate students/post doctoral fellows and fractions of software engineers, and a share expense for the Project Manager. The budget also contains funding for the Education and Public Outreach effort, the KEG Design Institute (which funds all non-NCAR participants to meet each year), the exchange program, training and continuing education for all project participants, software for all project participants, travel to IT conferences for NCAR funded personnel, and a small amount for computer equipment and miscellaneous materials.NCAR’s policy is to budget salaries at 85% work time. NCAR’s benefit rate is 47.1%. NCAR’s overhead rate is 46.6%. NCAR assesses overhead only on the first $25,000 of any sub award each year. NCAR has chosen to fully co-sponsor all NCAR scientific investigators (11, including the PI) for the project at the 0.1 FTE level. Further, NCAR includes co-sponsorship for the Project Manager, and three of the software engineers. Since this project will require substantial computational resources for the execution, analysis and visualization of the multi-scale Earth System Model, NCAR is cosponsoring a total of 19,000 General Accounting Units (GAUs) of computer time at the NCAR Computing Facility (1 GAU is the equivalent of 1 hour of a Cray C-90 processor). NCAR’s total co-sponsorship is $2,800,376 over the five years. All NCAR participation in the KEG Test beds is co-sponsored or funded by other programs and is not part of the NCAR budget request.Here we show how the money is being distributed across the different proposed activities. The total proposed for University participation is $8,350,00 or 56% of the total requested. We have allocated 38.2% of the requested funds in support of the Knowledge Framework activities, 19% for the Interaction Portal, and 14.4 % for activities related to the multiscale Earth system model. Thus, almost 60% of the proposed budget is directly supporting the IT research and development activities. That MSESM funding, plus the NCAR co-sponsored time of scientists, will be used both to support the research and development of a new, next generation global model but also to provide support for the necessary scientific expertise needed to guide and motivate the IT activities.

(still need to fill in budget details and breakdowns)We request that NSF pay the collaborating universities directly…

VII


G. Current and Pending SupportH. Facilities, Equipment, and Other Resources (briefly describe the adequacy of the organizational resources available to perform the effort proposed.)I. Special Information and Supplementary Documents(One page listing names and institutions of all persons associated with the proposal. Documentation of collaborative arrangements of significance to the proposal through letters of commitment.)J. AppendicesK. Attic11 Data Mining from UAH The Space Time Toolkit (STT) is a Java-based toolkit that provides advanced capabilities for integrating spatially and temporally-disparate data within a highly interactive 3D display environment. Unlike most tools that require that data be converted to a common spatial and temporal grid before integration, the STT allows one to ingest swath, map-projected, station, event, or path data in whatever spatial and temporal domain in which it exists. The STT then allows the end-user to select the 2D or 3D display domain and then on-the-fly maps all data into that domain asneeded.

11.1 Data mining in a distributed Environment

The ITSC has a long history in research and application of data mining technology as related to scientific data sets. ITSC's Algorithm Development and Mining (ADaM) technology was originally developed through the award of a NASA Research Announcement (NRA) and matured into a system that has supported a variety of Earth Science research activities [refs]. Recently, ITSC has taken steps toward the next evolution in the form of a research project to create the EnVironmEnt for On-Board Processing (EVE). EVE builds off ITSC's ADaM experience to put data processing and mining technology on-board future satellite platforms. Another evolutionary route for ITSC data mining technology is being focused in the direction of the KEG project. The focus in this direction is to migrate the data processing services and resulting data mining applications into a form that supports the integration of highly distributed and interoperable software components, thereby freeing this technology from the conventional tightly coupled single server approach typically utilized for data processing and mining systems. The un-coupling of this technology will allow researchers in the KEG platform to utilize processing and data resources that are widelydispersed both in terms of spatial location and as well in terms of hardware and software implementations.

11.1.1 Goals

The following are the goals that have been identified for the data mining component of the KEG environment.

? Provide a robust interactive data mining environment to support collaborative research.? Create tools that are uncoupled from existing hardware and software platforms.? Create a data mining environment that is inherently scalable and extensible to other science domains.

11.1.2 Requirements

VIII


? All components operating and available in a totally distributed environment (data, services, metadata, etc).? Incorporation of all pertinent standards to insure the highest level of interoperability with other existing and future systems.? Designed for Internet2 and Grid capabilities - planning for what is possible beyond current network bandwidths.? Strive to achieve the highest level of open access to data and services possible.? Support for reading heterogeneous data sets (ESML).? Support for data fusion of data sets at different spatial and temporal resolutions.? Provide graphical user interface that leads the user through the process of connecting/chaining/configuring services into the workflow - algorithm development? Provide interactive-capable environment for the inspection of data results at discrete processing points.? Create a processing environment that integrates distributed services and data sets.? Define standardized network-accessible interfaces for the interoperability of services.

11.1.3 Basics

The initial stage of this research will be based on the creation of a robust infrastructure of network-accessible software components capable of performing data processing tasks in a highly distributed environment. We call this underlying technology i-Dproc, implying internet-based data processing. i-Dproc will be comprised of software components defined by a standard interface convention that will make it possible to access across the network regardless of the choice of software language or hardware platform for implementation. Emerging standards, such asthe Simple Object Access Protocol (SOAP), will be evaluated and utilized to facilitate this compatibility across systems.

Inherent in this type of distributed environment will also be the ability to access heterogeneous data sets without proliferation the use of specialized reader. Again, emerging metadata technologies, such as the Earth Science Markup Language (ESML) [ref], are making the development of generic data readers a reality and will be incorporated into this research and development. The basis for a knowledge base of services will likewise be constructed and incorporated into i-Dproc to provide logical mappings between searchable data sets and services that will be used to automate data-to-service process flows as shown in Figure X.

11.1.4 Applying Data Mining

ITSC researchers have long history of working with scientific researchers and are very familiar with the process and approach that scientist take in formulating new algorithms. This process of hypothesis generation is shown in the Figure X2.

Building from the i-Dproc foundation, ITSC will be researching and constructing a new data mining application, called i-Mining, to meet the specific needs of the KEG platform, as well as the needs of scientific researchers in general. The i-Mining application will present a graphical web-based portal [insert more adjectives here] interfacefor researchers to interactively build iterative models based on conventional data mining procedures. The i-Mining application will provide researchers with a “what if” environment for configuring and re-configuring processing steps, checking and testing results, until the desired model is derived. The newly constructed models can themselves can then be instantiated (plug-and-play) as new services in the system so collaborating researchers can build off of each others ideas. The i-Dproc foundation will enable i-

IX


Mining to take advantage of distributed processing components as well as distributed heterogeneous data sets. i-Mining distributed service components will comply to the interface standards defined by i-Dproc so the system will be easily extensible to add new functionality and will scale well into other environments such as the Grid environments being explored by ITSC and others [IPG ref].

X