a provenance-aware virtual sensor system using the open provenance model

A Provenance-Aware Virtual Sensor System Using the Open Provenance Model

Yong Liu, Joe Futrelle, James Myers, Alejandro Rodriguez, Rob Kooper National Center for Supercomputing Applications (NCSA)

University of Illinois at Urbana-Champaign {yongliu, futrelle, jimmyers, alejandr,kooper}@ncsa.illinois.edu

ABSTRACT Sensor web applications such as real-time environmental decision support systems require the use of sensors from multiple heterogeneous sources for purposes beyond the scope of the original sensor design and deployment. In such cyberenvironments, provenance plays a critical role as it enables users to understand, verify, reproduce, and ascertain the quality of derived data products. Such capabilities are yet to be developed in many sensor web enablement (SWE) applications. This paper develops a provenance-aware “Virtual Sensor” system, where a new persistent live “virtual” sensor is re-published in real-time after some model-based computational transformations of the raw sensor data streams. We describe the underlying OPM (Open Provenance Model) API’s (Application Programming Interfaces), architecture for provenance capture, creation of the provenance graph and publishing of the provenance-aware virtual sensor where the new virtual sensor time-series data is augmented with OPM-compliant provenance information. A case study on creating real-time provenance-aware virtual rainfall sensors is illustrated. Such a provenance-aware virtual sensor system allows digital preservation and verification of the new virtual sensors. KEYWORDS: Sensor Web, Virtual Sensor, Open Provenance Model, Streaming Data 1. INTRODUCTION Multi-scale remote and in-situ sensing networks are rapidly being developed and deployed to observe environmental systems at resolutions never before possible. Of the many types of sensors available, those measuring physical process parameters, such as temperature, pressure, humidity, light, sound etc., are the most common for long term deployments. For example, the National Weather Service (NWS) Next Generation Weather Radar (NEXRAD) system has been operational

since 1997, providing meteorological observations that have been used for weather forecasting [4]. Chemical and biological sensors are also rapidly being deployed in aquatic environments [8] that can measure dissolved oxygen, methane and total gas tension, CO2, pH, nitrate, and other nutrients. The increasing ubiquity of sensor networks and the value of their data for scientific discovery have compelled researchers [7] to envision an earth system science revolution. Sensor Web Enablement (SWE) standards, published by the Open Geospatial Consortium (OGC), have been designed to facilitate such a change by providing standardized sensor encoding and web services. However, one key barrier to this revolution is the difficulty faced by individual researchers when attempting to use sensor data from multiple sources. This type of use requires that the data be transformed, interpolated, fused, or otherwise processed before it can be used for task-specific analysis or modeling. For example, the “resolution gap” described by Ram et al. [19] as the difference between the spatiotemporal resolution of the sensor data and the resolution desired by the researcher, entails that the data be processed before it can be used. Managing such processing of streaming data can be difficult. Making such derived sensor data products (virtual sensors) available to other researchers, with sufficient information on their history (data provenance) to make them useful is even more challenging. Provenance plays a critical role as it enables users to understand, verify, reproduce, and ascertain the quality of derived data products. Such capabilities and features are yet to be developed in many sensor web enablement (SWE) applications. In this paper, we describe a prototype virtual sensor system that supports creation of virtual sensors through application of model-based transformations to one or more sensor data sources, and republication of virtual sensor streams with associated provenance information, in the spirit of “curation of process” [6]. To accomplish this, we use the emerging Open Provenance Model (OPM) and its implementation in Tupelo, a semantic content

330978-1-4244-6622-1/10/$26.00 ©2010 IEEE

https://www.researchgate.net/publication/222549638_Hart_JK_and_Martinez_K_Environmental_sensor_networks_a_revolution_in_the_Earth_system_scienceEarth-Sci_Rev?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/6553408_Chemical_Sensor_Networks_for_the_Aquatic_Environment?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/252172634_The_WSR-88_rainfall_algorithm?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/39997108_Curating_Scientific_Web_Services_and_Workflows?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

middleware developed at NCSA, to produce OPM-compliant provenance records that are provided along with virtual sensor data. To the best of our knowledge, this paper is the first to apply OPM to a real-time sensor web enablement application. The rest of this paper is organized as follows: in Section 2, we review background information and related work on virtual sensors and sensor provenance. In Section 3, we explain the Tupelo OPM API and associated OPM concepts we use for virtual sensor re-publishing. In Section 4, we describe the architecture and implementation of our sensor provenance capture system. We illustrate this through a case study for a virtual rainfall sensor using NEXRAD data in section 5 and some discussion of the results in section 6. We conclude in Section 7. 2. BACKGROUND AND RELATED WORK Currently there are various large scale national environmental observatory initiatives [15] in the United States. Our virtual sensor research is driven by urgent needs of individual researchers and stakeholders (e.g., various water management entities) who not only want to access various sensor data products from these projects, but also want to create and share custom derived products that support their specific needs (i.e., virtual sensors). In this paper, we follow the definition presented previously in [11] that defines a virtual sensor as a product of a set of thematic, spatial, and/or temporal transformations and aggregations of raw sensor measurements. The results of such a virtual sensor can then be re-published as a new “live” sensor data stream over the web. Each instance of the virtual sensor produces a new sensor data stream based on a few deployment parameters such as spatial location, temporal frequency of the time-series data needed, and thematic interest (e.g., rainfall rate). For example, once we define a virtual rain gage sensor, we can deploy an instance of it at a desirable location with a user-specified temporal frequency. A more elaborated discussion on the virtual sensor concept, including ontologies, can be found in [11]. In terms of sharing sensor data streams, the Microsoft SenseWeb [9] presents a system that allows different users to register their sensors and share them on the SensorMap web site. However, SenseWeb does not allow users to create new virtual sensors, nor does it provide provenance tracking. Reddy et al. [20] presents a concept called “sensor data republishing” to enable collaboration between citizen scientists, which is similar to our idea of sharing virtual sensor data streams over the web. However, the processing methods involved are relatively lightweight calculations such as error correction,

statistical estimation, or simple interpolation (such as generating a contour map based on some point data). By contrast, our notion of a virtual sensor is more broadly defined and includes such lightweight calculations as well as complex multiple-workflow-based computation, which increases the complexity of provenance tracking. To allow users to verify, validate and reuse others’ virtual sensors, a provenance-aware virtual sensor capability is needed where users can trace data from a virtual sensor stream back to the raw data source and transformation steps that produced it. As pointed out by Wallis et al. [24], users will not be able to use available sensor data if they cannot trust it. Approaches for establishing the reliability of real sensor data cannot simply be applied to virtual sensor data, which is produced using more complex processing that is itself often implemented as scientific workflows. While data provenance has recently undergone intensive studies in both database and workflow systems (see e.g., [23], [17], [1]), technology for storing, extracting and presenting useful provenance information for virtual sensors is still in its infancy. In addition, web-scale sharing and access of virtual sensors is needed. Given that virtual sensors can involve distributed data sources and multiple processing steps on different platforms (driven by events or user interaction), their provenance spans execution environments (usually beyond a standalone workflow environment) and must be integrated. Such distributed provenance information integration (or “mashup”) and presentation is possible with emerging standardized provenance models such as the OPM [16]. Existing work such as Dozier and Frew [3] usually only considers provenance within a closed, single, homogeneous execution environment. Our contribution in this paper is our use of OPM to mash-up heterogeneous sources of provenance information. 3. OPEN PROVENANCE MODEL AND TUPELO OPM API In this section, we describe the relevant open provenance model and the OPM API as implemented in the Tupelo semantic content middleware. 3.1. Tupelo Tupelo1 implements an abstract content model based on standard semantic web technologies and ideas from Content Management Systems. Using this abstraction, Tupelo provides applications with a core set of operations for reading, writing, and querying content and metadata as well as a framework and set of implementations for performing these operations using existing storage

1 http://tupeloproject.org/

331

https://www.researchgate.net/publication/220919033_The_Open_Provenance_Model_An_Overview?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/240773307_Sensor-Internet_Share_and_Search-Enabling_Collaboration_of_Citizen_Scientists?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/221176350_Know_Thy_Sensor_Trust_Data_Quality_and_Data_Integrity_in_Scientific_Digital_Libraries?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/222707906_A_formal_semantics_for_the_Taverna_2_workflow_model?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/220919078_KeplerpPOD_Scientific_Workflow_and_Provenance_Support_for_Assembling_the_Tree_of_Life?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/5875988_The_WATERS_Network_An_Integrated_Environmental_Observatory_Network_for_Water_Research?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/3338919_SenseWeb_An_Infrastructure_for_Shared_Sensing?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/23668870_Computational_provenance_in_hydrologic_science_A_snow_mapping_example?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

https://www.researchgate.net/publication/220422734_The_provenance_of_electronic_data?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

systems and protocols, including file systems, relational databases, syndication protocols (such as RSS feeds), and object-oriented application data structures. Because Tupelo operations are described declaratively, they can be transformed and delegated at runtime in order to automate community and application data management policies, including assigning and resolving global persistent identifiers, tracking provenance, invoking processing codes to generate metadata and derived data products, applying rules-based inference, and matching vocabulary terms across multiple taxonomies and ontologies. RDF (Resource Description Framework) provides global identification for metadata descriptions by requiring descriptive elements to be identified with URIs (Uniform Resource Identifiers). Tupelo extends this requirement to binary data objects as well, and provides pluggable kernel modules that can apply application-specific policies for mapping between URIs and local identification schemes used by underlying storage resources (e.g., file systems). Tupelo also provides global identifier minting services, supporting UUIDs (Universally Unique Identifiers) and any other minting algorithm that can be mapped to URIs. Using standard protocols and syntax, the content is in effect made portable across a variety of underlying implementations, allowing data and metadata from multiple applications to be mashed up, shared, and preserved [5]. 3.2. OPM and OPM API Provenance was previously closely tied to specific workflow frameworks, which creates interoperability challenges among different workflow systems. This motivates the development of the OPM2 that can provide an application- and domain-neutral way of describing data and process provenance. Because OPM can describe not just computational events but also real-world processes, it can be used to link together provenance descriptions being produced by a variety of cooperating sub-processes and components in a typical complex scientific work environment, including “human” processes such as document editing. OPM is also described independently of its representation, and multiple compatible representations have been developed using XML (Extensible Markup Language), RDF, and relational database structures. A reference for Tupelo's OPM binding as an OWL (Web Ontology Language) ontology can be found at the web site3. Tupelo provides a representation-neutral OPM API allowing for storing, retrieving, and querying OPM

2 http://twiki.ipaw.info/bin/view/Challenge/OPM 3 http://tinyurl.com/yce4mz6

records. An implementation is provided for RDF, and the API can be extended to support XML and relational representations as well. Complete Java API documentation can be found at the web site4. The RDF implementation works by querying an RDF triple store looking for assertions that match patterns derived from Tupelo’s RDF ontology for OPM. These assertions identify OPM entities and the relationships between them as well as carrying metadata about temporal scope and other causal aspects of the execution trace. By the same token the API can be used to make provenance assertions and publish them to any RDF triple store Tupelo supports (e.g., Jena, Sesame). Figure 1 shows how OPM represents artifacts (A1 and A2), processes (P1 and P2) and causal relationships (R1, R2, and R3 showing artifacts used by and generated by processes), and derivable “wasTriggeredBy” and “wasDerivedFrom” relations. Arcs represent causal transitions and dependencies, e.g., “wasGeneratedBy” which associates an artifact with the process that produced it. These, together with the ideas (not shown) of agents capable of controlling processes, annotations providing description of artifacts and processes, and accounts and OPM graphs representing collections of OPM statements comprise the overall OPM.

Figure 1. Open Provenance Model Elements

4. ARCHITECTURE AND IMPLEMENTATION OF PROVENANCE-AWARE VIRTUAL SENSOR CAPABILITY In this section, we describe how we enable virtual sensor re-publishing with provenance tracking. In particular, we present the adapter architecture of transforming non-OPM-compliant meta-data records to OPM-compliant records and how we use the OPM API to construct an end-to-end OPM graph. 4.1. Provenance-Aware Virtual Sensor System Before describing the provenance capture service, we first present the overall architecture for the prototype virtual sensor system.

4 http://tinyurl.com/yd8wq8l

332

https://www.researchgate.net/publication/220105697_Semantic_middleware_for_e-Science_knowledge_spaces?el=1_x_8&enrichId=rgreq-fb9a1151-7758-493f-8b58-d2814819a952&enrichSource=Y292ZXJQYWdlOzIyNDE0MzMzNDtBUzo5OTIxNzIxMzIzMTEyNkAxNDAwNjY2NjM1ODY3

As can be seen from the architectural diagram in Figure 2, our system can fetch data from remote sensor data stores (the bottom layer: Remote Sensor Stores) such as the NEXRAD Level II data in a remote FTP server using a Java daemon service provided by our streaming data toolkit. Once such data is deposited into our local repository at the NCSA Virtual Machine cloud platform, the system’s middle layer (Data and Workflow Services) comes into play. Each new raw sensor data packet triggers a set of virtual sensor transformation workflows which compute derived virtual sensor data products and re-publish the resulting virtual sensor data streams as new “live” sensor data streams again using the streaming data service publishing capability (details of the streaming data capabilities of our system are described in [21]). Currently, both point-based virtual sensors and polygon-based virtual sensors are supported. Point-based virtual sensors represent derived measurments in a specific latitued/longtitude location. Polygon-based virtual sensors represent derived measurements in a polygon area (e.g., avearge rainfall rate within a city boundary). The virtual sensor abstraction and managment service provides an ontology of virtual sensors as well as virtual sensor-related utility tools such as virtual sensor definition tools, OGC KML (Keyhole Markup Language) toolkits to extract polygons from an input KML file and generate new color-coded KML output based on the derived data products, OGC SWE SOS (Sensor Observation Service) service, etc. Data, metadata, and the registry of virtual sensor definitions are all managed via Tupelo.

Figure 2. Architecture Diagram of the Provenance-

Aware Virtual Sensor System The user interface (Web User Interface layer) enables user definition of new virtual sensors, allowing users to click on a Map-based user interface to create either point-based or polygon-based virtual sensors. For example, a mouse click on the map can add a new point-based virtual sensor and initiate recurring data-triggered execution of an associated back-end workflow to produce a new “live” virtual sensor data stream. Users can also upload a new KML file to trigger a new workflow based on the polygon

information contained in the KML file to generate a polygon-based virtual sensor. In addition, the streaming data service also publishes a web page where current active virtual sensor streams can be viewed and explored (see Figure 5). 4.2. Provenance Capture and Conversion of Non-OPM Compliant Records Tupelo plays a critical role in this system, managing data and capturing provenance across the three layers. The virtual sensor transformation processing described in the previous subsection is implemented using the Cyberintegrator workflow engine [13], which records provenance in Tupelo using an ontology developed prior to the creation of OPM. To make this information available as OPM records, we implemented Tupelo’s OPM API as a query-driven layer on a set of queries on terms from the Cyberintegrator workflow execution trace ontology (Such indirection will become unnecessary when Cyberintegrator implements OPM directly.). Additional OPM provenance is generated by performing pattern-matching operations on the set of log files produced by the virtual sensor web application and generating OPM RDF assertions. The resulting OPM output captures the end-to-end provenance of the virtual sensor data and makes it available in a form usable in any OPM compliant tool. For example, as shown in Figure 3, we have used generic OPM graphing code to traverse snapshots of the virtual sensor’s OPM information and automatically produce simple graphics representing process steps and dependencies. The traversal code was generic in that it could have been used to produce graphics for any OPM graph accessible via any of Tupelo’s implementations such as Taverna workflow [23]. Some layout directives were added to the graph to show aspects of the execution trace that we did not represent in OPM, such as which parts of the complex execution trace belonged to a single Cyberintegrator workflow.

Figure 3. OPM Provenance Mashup Architecture

Once OPM records describing virtual sensor workflows are available, they can be used to link the provenance of virtual sensor processing with prior provenance information, such as our knowledge about the origins of

333

sensor data consumed by that processing. This is the sense in which the OPM is “open”—it is designed to be used to describe derivation relationships that cross the boundary between one execution context (e.g., the virtual sensor processing workflow) and another (e.g., the NEXRAD data product distribution pipeline). The resulting OPM records are therefore a more complete and descriptive picture of the nature and origin of the information we produce and can be shared with data consumers so that they can further augment the provenance descriptions. Figure 3 shows the architecture that allows Tupelo to capture and translate non-OPM compliant records (either in flat-file logs or RDF triple stores) and prior sensor observation events to produce an OPM-compliant provenance graph mashup. This was implemented in the case study presented in the next section. 5. A CASE STUDY In this section, a case study is presented with results showing a provenance-aware virtual sensor system. 5.1. An Example Use Case: Point-based and Polygon-based Virtual Rainfall Sensors We have identified some real-time rainfall-driven use cases, where a real-time decision support system is being developed to minimize the occurrences of Combined Sewer Overflow (CSO) events in Chicago during severe rainfall seasons. While the optimization and simulation models for Chicago CSO events control are under development, a fundamental issue is already clear: there is no suitable rainfall time series data with adequate spatial coverage and temporal frequency from on-the-ground rain gage data. NEXRAD provides a daily "rain gage" corrected data product, but it is not available in real time. It is possible to transform NEXRAD Level II data's reflectivity measurements to rainfall data using multi-step spatial, temporal and thematic transformations. Further, there are different methods/algorithms in the literature to perform such transformations, and it would be useful to check what transformations are performed before selecting data for use in the subsequent simulation and optimization models. We have shown before that our virtual sensor system is capable of supporting creation of both point-based and polygon-based virtual rainfall sensors using the NEXRAD Level II raw data in real-time (see [10][12]). This paper presents our results from our use of the methodology discussed in the previous section to construct and present a provenance mashup across the system layers for the end-to-end process verification for both types of virtual sensors. Such end-to-end provenance information is crucial for users to verify and validate the derived data

products. We believe this step is necessary before our virtual sensor system can be used more broadly by the community. Figure 4 shows both types of virtual rainfall sensors (point-based and polygon-based) in the web interface. The bubble point represents a virtual rain gage while different color coded polygons represent different rainfall rate value at that polygon. Figure 5 shows an example of the list of current active sensor/data streams (including some system-management-oriented event streams that record, for example, when the streaming data fetcher encounters anomalies in the raw NEXRAD data packets.). We will use this use case to produce the provenance mashup in the next section for both the point-based and polygon-based virtual rainfall sensors shown in Figure 4.

Figure 4. Point- and Polygon-based Virtual Rainfall

Sensors on the Web

Figure 5. List of Streams in the Virtual Sensor System

Published on the Web Page

5.2. Provenance “Mash-up” Results A set of connected OPM provenance graphs was produced as shown in summary in Figure 6. Note that there are 5 different OPM subgraphs at this level of description. Below we group them into three categories:

334

(1) The OPM Graph describing fetching streaming data: Figure 7 shows the NEXRAD streaming data fetcher daemon fetching one NEXRAD datum and triggering two workflows (Figures 9 and 10). This process is outside of workflow and is running on the virtual machine as a standalone daemon process. Note that the “observation” process represents the original radar observation, and the timestamp on the “triggered” arc that terminates at it represents the time of observation.

(2) The OPM graph describing user interaction with the web application (i.e., user generated virtual sensors): Figure 8 shows how a user-generated point-based virtual sensor is created by a mouse-click on the web user interface. This adds a new pair of latitude and longitude coordinates to the list of current point virtual sensors, which is subsequently used by the Figure 10 “Spatial and Thematic Transformation” workflow. This is another process outside of the workflow execution traces.

(3) OPM graphs describing workflow execution: There are three of these. First, Figure 9 is an OPM graph that shows the polygon transformation execution steps and dependencies in the workflow “Polygon Transformation” which computes rainfall average in a set of polygons defined in the sewersheds.kml file and publishes the results in a color-coded KML stream (see Figure 4 for the visualization of such color-coded KML result). Second, Figure 10 is an OPM graph that shows the workflow “Spatial and Thematic Transformation” which computes point-based rainfall rate virtual sensors. Note that these two workflows have casual relationships “triggered by” with both the NEXRAD Fetch Daemon (Figure 7) and User Interaction process (Figure 8). The details of the OPM graph based on the third workflow execution trace “Temporal Transformation” is not shown in this paper due to its excessively complex graph, but its causal dependency with other processes is not sacrificed as it can be clearly seen from Figure 10 that it uses output data from the “Spatial and Thematic” workflow. The temporal transformation workflow computes a temporal aggregation of rainfall accumulation over a certain time period (e.g., 20 minutes) for point-based virtual rainfall sensors.

The overall provenance graph mash-up can be seen in Figure 6, which shows five processes and their causal relationships with minimum details on each process. We note that Figure 6 is itself a valid OPM graph, showing how OPM’s support of descriptions at multiple levels of granularity allow details to be exposed or hidden as desired by the users. This is a very useful feature, since in complex distributed sensor web systems, many levels of

granularity and specificity are required when validating and verifying data and processing steps.

Figure 6. Overall Virtual Sensor OPM Provenance Graph Mashup Result with Minimum Details on

Individual Process

Figure 7. OPM Graph with Details on NEXRAD Data

Fetcher Daemon Process

Figure 8. OPM Graph with Details on User

Interaction Process

335

Figure 9. OPM Graph with Details on Polygon Transformation Process for Polygon-based Virtual Rainfall Sensor

336

Figure 10. OPM Graph with Details on Spatial and Thematic Transformation Process for Point-based Virtual Rainfall Rate Sensor

337

6. DISCUSSION

In this section we offer some discussion concerning our system and the relevance of provenance-aware virtual sensors to the broad sensor web community. The virtual sensor system described in this paper is typical of medium-to-large scale scientific work processes such as sensor web applications, which entail multiple, heterogeneous, loosely-coordinated processing stages. Appropriately generic models of data and process provenance such as OPM raise representational questions about how cooperating processes that interact with one another can also produce data that “cooperates” by being able to be linked together to form aggregate knowledge about the origin and significance of data products. A number of these concerns are raised in this and other current work: (1) Granularity. What counts as a significant stage in a

complex work process? No consensus has been reached in the broader scientific provenance research community, largely because different applications have different requirements for the use of provenance information (e.g., anomaly detection, regulatory compliance, etc.).

(2) Identification. How are significant steps and data artifacts identified? In open models, how do we resolve identifiers produced by multiple, uncoordinated processes?

(3) Global consistency. Are models like OPM sufficient for reconciling, or determining that it is impossible to reconcile, the execution traces and other observations that have been combined from multiple, loosely-coordinated processes, with respect to causality?

In developing the example shown in Figures 6-10, we encountered and had to provisionally resolve several of these issues. For example, log file entries often did not directly identify data artifacts, but rather identified related ancillary information, such as workflow step parameters whose values identified streams carrying those artifacts. In terms of granularity, Cyberintegrator presented the same issue as other workflow systems: that they treat processing steps as “black boxes” which may contain causally-relevant information (such as parameters) that are “hard-coded” into the workflow step implementation and are therefore not recorded by its execution tracing code. As a result we had to connect the output of the “Spatial and Thematic” workflow to the input of the “Temporal Transformation” workflow manually based on our knowledge of the contents of such a “black box”. In terms of global consistency, we made several assumptions about timing that cannot be made in the general case. In one case we assumed that timestamps in a log file were in the CST(Central Standard Time) time zone, because they contained no time zone information. Finally, we assumed

all clocks were synchronized and based on that deduced that each temporal transformation (only one is shown) was causally dependent on all prior spatial and thematic transformations. In practice, at scale many such considerations and inferences are required, and for this reason OPM has been developed as a model of observations of provenance rather than as a model of some posited underlying casual reality. One noticeable remaining challenge is the large volume of provenance metadata about the transformation due to the continuous processing of the sensor data stream [14], which makes both the scalable storage and efficient query harder to realize. However, this issue is beyond the scope of this paper and will be discussed in our future work. 7. CONCLUSIONS In this paper, we showed the methodology and results of using the Tupelo OPM API to produce a provenance mashup across multiple layers in the virtual sensor system that is also applicable for any other sensor web application. While the processes involved with our virtual sensor application are heterogeneous (comprising workflows, Java daemons, and user interaction), we successfully demonstrate the feasibility of constructing an “open” provenance graph for both the point-based and polygon-based virtual rainfall sensor case studies. To the best of our knowledge, this paper is the first to bring the OPM community and the sensor web application community together with a real use case study, proving the applicability and feasibility of using OPM for sensor web applications. Future work will include implementing OPM-compliant provenance capture across all layers of our virtual sensor system as well as making provenance information accessible and usable to application end-users. Performance evaluation will be carried out in the future as well. The lessons learned from our study are also valuable for other sensor web applications, as provenance becomes increasingly ubiquitous and necessary [18]. ACKNOWLEDGEMENTS The authors would like to acknowledge the Office of Naval Research, which supports this work as part of the Technology Research, Education, and Commercialization Center (TRECC) (Research Grant N00014-04-1-0437) and the TRECC team working on the Digital Synthesis Framework for Virtual Observatories. We also thank UIUC/NCSA Adaptive Environmental Sensing and Information Systems initiative for partially supporting this project. Finally, Sam Cornwell is thanked for his assistance in implementation and Kathleen Ricker for her professionally editing and proofreading of this paper.

338

REFERENCES

[1] S. Bowers, McPhillips, T. M., Riddle, S., Anand, M. K. & Ludascher, B., “Kepler/pPOD: Scientific Workflow and Provenance Support for Assembling the Tree of Life”, In Proceedings of the 2nd International Provenance and Annotation Workshop ,J. Freire, D. Koop, and L. Moreau (Eds.): IPAW 2008, LNCS 5272, pp. 70–77, 2008.

[2] R. Cifelli, N. Doesken, P. Kennedy, L. D. Carey, S. A. Rutledge, C. Gimmestad and T. Depue. "In box: The Community Collaborative rain, hail, and snow network," Bull. Am. Meteorol. Soc.,vol. 86, pp. 1069-1077,2005

[3] J. Dozier, and J. Frew, “Computational provenance in hydrologic science: A snow mapping example”, Philosophical Transactions of the Royal Society A, 367, 1021-1033, doi: 10.1098/rsta.2008.0187, 2009

[4] R. A. Fulton, Breidenbach, J. P., Seo, D.-J., Miller, D. A., and O’Bannon, T. “The WSR-88D rainfall algorithm”, Weather and Forecasting, June 1998, 377-395.

[5] J.Futrelle, J.Gaynor, J. Plutchak, J. Myers, R. McGrath, P. Bajcsy, J. Kastner, K. Kotwani, J. S. Lee, L. Marini, R.Kooper, T.McLaren and Y. Liu, “Semantic Middleware for E-science Knowledge Spaces”, 7th International Workshop on Middleware for Grids, Clouds and e-Science - MGC 2009, Urbana, Illinois, USA, December 1, 2009

[6] C.Goble and De Roure, D. “Curating Scientific Web Services and Workflows”. Educause Review, 43 (5). 10-11, 2008

[7] J. K. Hart, and K. Martinez, "Environmental Sensor Networks: A revolution in the earth system science?" Earth Sci.Rev., vol. 78, pp. 177-191,2006

[8] K. S. Johnson, Needoba, J. A., Riser, S. C., and Showers, W. J. “Chemical Sensor Networks for the Aquatic Environment”. Chemical Reviews, 107(2):623 – 640, 2007

[9] A. Kansal, S. Nath, J. Liu, and F. Zhao, "SenseWeb: An Infrastructure for Shared Sensing," IEEE Multimedia. Vol. 14, No. 4, pp. 8-13, 2007.

[10] Y. Liu, D. J. Hill, A. Rodriguez, L. Marini, R. Kooper, J. Futrelle, B. Minsker, J. D. Myers. “Near-Real-Time Precipitation Virtual Sensor based on NEXRAD Data”, ACM GIS 08, November 5-7, 2008, Irvine, CA, USA

[11] Y. Liu, Hill, D., Rodriguez, A., Marini, L., Kooper, R.,Myers, J., Xiaowen Wu, and Minsker, B.. "A new framework for on-demand virtualization, repurposing and fusion of heterogeneous sensors". In Proceedings of the 2009 international Symposium on Collaborative Technologies and Systems, IEEE, DOI= http://dx.doi.org/10.1109/CTS.2009.5067462

[12] Y. Liu, D. Hill, L. Marini, R. Kooper, A. Rodriguez, J. Myers ."Web 2.0 Geospatial Visual Analytics for Improved Urban Flooding Situational Awareness and Assessment", ACM GIS '09 , November 4-6, 2009. Seattle, WA, USA

[13] L. Marini, R. Kooper, P. Bajcsy, J. Myers, “Supporting exploration and collaboration in scientific workflow systems”, in AGU, Fall Meet. Suppl., Abstract IN31C-07. 2007: San Francisco, CA. USA

[14] A.Misra, Blount, M., Kementsietsidis, A., Sow, D. M. & Wang, M., “Advances and challenges for scalable provenance in stream processing systems”, In Proceedings of the 2nd International Provenance and Annotation Workshop , LNCS 5272, 253-265,2008

[15] J. L.Montgomery, T. Harmon, W. Kaiser, A. Sanderson, C. N. Haas, R. Hooper, B. Minsker, J. Schnoor, N. L. Clesceri, W. Graham and P. Brezonik. "The waters network: An integrated environmental observatory network for water research," Environ. Sci. Technol., vol. 41, pp. 6642-6647, 2007

[16] L. Moreau, Freire, J., Futrelle, J., McGrath, R. E., Myers, J. & Paulson, P., “The Open Provenance Model: an Overview”, In Proceedings of the 2nd International Provenance and Annotation Workshop ,LNCS 5272, pp. 323–326, 2008.

[17] L.Moreau, Groth, P., Miles, S., Vazquez-Salceda, J., Ibbotson, J., Jiang, S., Munroe, S., Rana, O., Schreiber, A., Tan, V.,Varga, L., “The provenance of electronic data”. Commun ACM, 51(4), 52-58, 2008

[18] L. Moreau, “The Foundations for Provenance on the Web”. Foundations and Trends in Web Science . Available: http://eprints.ecs.soton.ac.uk/18176/1/psurvey.pdf

[19] S. Ram, Khatrl, V., Hwang, Y.,Yool, S. R. “Semantic modeling and decision support in hydrology”, Photogrammetric Engineering & Remote Sensing, 66(10), 1229-1239,2000

[20] S. Reddy, , Chen, G., Fulkerson, B., Kim, S.J., Park, U., Yau, N., Cho, J., Heidemann, J., Hansen, M., “Sensor-Internet Share and Search--Enabling Collaboration of Citizen Scientists”. In Proceedings of the ACM Workshop on Data Sharing and Interoperability on the World-wide Sensor Web, pp. 11-16. Cambridge, Mass., USA, ACM. April, 2007.

[21] A. Rodriguez, R. E. McGrath, Y. Liu and J. D. Myers, "Semantic Management of Streaming Data", 2nd International Workshop on Semantic Sensor Networks at the International Semantic Web Conference, Washington, DC, October 25-29, 2009.

[22] S. S.Sahoo, Sheth, A.,Henson, C. “Semantic provenance for eScience: Managing the deluge of scientific data”. IEEE Internet Comput., 12(4), 46-54, 2008

[23] J.Sroka, J. Hidders, P. Missier, C. Goble, “A formal semantics for the Taverna 2 workflow model”, J. Comput. System Sci. (2010), DOI:10.1016/j.jcss.2009.11.009

[24] J. C.Wallis, Borgman, C. L., Mayernik, M. S., Pepe, A., Ramanathan, N. & Hansen, M., “Know thy sensor: Trust, data quality, and data integrity in scientific digital libraries”. Lect. Notes Comput. Sci., 4675 LNCS, 380-391,2007

339

a provenance-aware virtual sensor system using the open provenance model

Documents