frs linked open data concept v1.3 20101130

15
FRS and Linked Open Data Potential – Conceptual Discussion v 1.3 November 30, 2010 Dave Smith USEPA/OEI/OIC/IESD/ISSB [email protected] 202-566-0797 Document Change History Revisio n Date Author Description 1.0 11/12/201 0 David G. Smith Initial Version 1.1 11/24/201 0 David G. Smith Minor updates/revi sions as followon to 11/23 discussion 1.2 11/29/201 0 David G. Smith Collaboratio ns, potential pilots, FOAF and other models 1.3 11/30/201 0 David G. Smith Additional collaboratio ns and detail on facility granularity concept

Upload: dave-smith-usepa-office-of-environmental-information

Post on 09-May-2015

694 views

Category:

Technology


1 download

DESCRIPTION

Background: Presentation to Ecoinformatics International Technical Collaboration Partnership International Web Meeting - Linked Open Data and Environmental Information Day 1 – December 6, 2010 Geospatial Topic – Dave Smith

TRANSCRIPT

Page 1: FRS Linked Open Data Concept v1.3 20101130

FRS and Linked Open Data Potential – Conceptual Discussion v 1.3November 30, 2010

Dave Smith USEPA/OEI/OIC/IESD/ISSB

[email protected]

Document Change HistoryRevision Date Author Description

1.0 11/12/2010 David G. Smith Initial Version1.1 11/24/2010 David G. Smith Minor

updates/revisions as followon to 11/23 discussion

1.2 11/29/2010 David G. Smith Collaborations, potential pilots, FOAF and other models

1.3 11/30/2010 David G. Smith Additional collaborations and detail on facility granularity concept

Page 2: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

ContentsDocument Change History.......................................................................1

Introduction:............................................................................................1

Concept:...................................................................................................1

Current Situation:.....................................................................................1

Linked Open Data Issues:.........................................................................1

Data Model Issues:...................................................................................1

Linked Open Data Development:.............................................................1

Existing Resources....................................................................................1

Short-Term data needs:...........................................................................1

Longer-Range, Emergent data needs:......................................................1

Other Ongoing, Related Activities............................................................1

Anticipated Next Steps:............................................................................1

Introduction:The intent of this concept paper is to initially explore some conceptual, blue-sky, no-constraints for potential improvements to the FRS Linked Open Data approach being published via data.gov, and to stimulate additional ideas and brainstorming. Followon to this will be examination of alternatives, prioritizations and finalization of thoughts toward implementation.

Concept:Provide enhancements to FRS Linked Open Data approach to improve analysis, enhance facility representation, improve robustness of LOD querying and analytics, integrate other existing metadata capabilities and improve capabilities to support Semantic Web approaches, such as more-informed RDF serialization.

2

Page 3: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

Current Situation:FRS data is currently being published via Data.gov, e.g. RDF button on Data.gov catalog pages (e.g. http://www.data.gov/raw/1030 ) for FRS data.

Figure 1: Example of Current FRS RDF Offering (highlighted in red box)

The data returned is tied to a data.gov URL, e.g. http://www.data.gov/semantic/data/alpha/1030/dataset-1030.rdf.gz

Linked Open Data Issues:Currently, FRS and other datasets published via Data.gov are being serialized as RDF to support semantic web and linked open data. A basic problem with the Data.gov RDF does not just apply to the FRS RDF data, it likely applies across the board.

3

Page 4: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

Firstly, in terms of access, the data is a gzipped download. Data must be downloaded and unzipped before it can be accessed - more ideally, it would be good to see Data.gov serving the data up as a SPARQL endpoint, or as a SESAME repository or other means of serving up a triple store. That download/unzip paradigm does not lend itself to dynamic mashups.

With regard to the Data.gov RDF, it appears to be a brute-force serialization of data tables into RDF. It doesn't really have the semantic depth to support analysis that it could use (See Fig. 1-3).

<rdf:Description rdf:about="#entry9985">

<hdatum_desc>NAD83</hdatum_desc>

<state_name>NEBRASKA</state_name>

<latitude83>40.944623</latitude83>

<interest_types>STATE MASTER</interest_types>

<city_name>GARLAND</city_name>

<create_date>01-MAR-00</create_date>

<frs_facility_detail_report_url rdf:resource=" http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?p_registry_id=110006555085 "/>

<congressional_dist_num>01</congressional_dist_num>

<pgm_sys_acrnms>NE-IIS</pgm_sys_acrnms>

<epa_region_code>07</epa_region_code>

<country_name>USA</country_name>

<fips_code>31159</fips_code>

<huc_code>10200203</huc_code>

<collect_desc>ADDRESS MATCHING-HOUSE NUMBER</collect_desc>

<primary_name>TERRI KELLER RESIDENCE</primary_name>

<rdf:type rdf:resource=" http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry "/>

<ref_point_desc>ENTRANCE POINT OF A FACILITY OR STATION</ref_point_desc>

<postal_code>683609338</postal_code>

<registry_id>110006555085</registry_id>

<location_address>1976 OLD MILL RD</location_address>

<accuracy_value>30</accuracy_value>

<update_date>06-AUG-01</update_date>

<county_name>SEWARD</county_name>

<conveyor>FRS</conveyor>

<longitude83>-96.990306</longitude83>

<state_code>NE</state_code>

<site_type_name>STATIONARY</site_type_name>

4

Page 5: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

</rdf:Description>

Figure 1: Sample of current Data.gov FRS RDF/XML Representation

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#hdatum_desc > "NAD83" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#state_name > "NEBRASKA" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#latitude83 > "40.944623" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#interest_types > "STATE MASTER" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#city_name > "GARLAND" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#create_date > "01-MAR-00" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#frs_facility_detail_report_url > < http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?p_registry_id=110006555085 > .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#congressional_dist_num > "01" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#pgm_sys_acrnms > "NE-IIS" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#epa_region_code > "07" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#country_name > "USA" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#fips_code > "31159" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#huc_code > "10200203" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#collect_desc > "ADDRESS MATCHING-HOUSE NUMBER" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#primary_name > "TERRI KELLER RESIDENCE" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.w3.org/1999/02/22-rdf-syntax-ns#type > < http://data-gov.tw.rpi.edu/2009/data-gov-

5

Page 6: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

twc.rdf#DataEntry > .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#ref_point_desc > "ENTRANCE POINT OF A FACILITY OR STATION" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#postal_code > "683609338" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#registry_id > "110006555085" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#location_address > "1976 OLD MILL RD" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#accuracy_value > "30" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#update_date > "06-AUG-01" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#county_name > "SEWARD" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#conveyor > "FRS" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#longitude83 > "-96.990306" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > < http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#state_code > "NE" .

< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > <

http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#site_type_name > "STATIONARY" .

Figure 2: Sample of current Data.gov FRS Representation as Triples

The current RDF serialization is essentially just a brute force conversion - there is plenty of opportunity to enhance and improve.

The properties are things that some EPA users might easily understand, but would others, e.g. huc_code, pgm_sys_acrnms – are these uniquely identifiable and understood, within this dataset? Thinking import reference to EPA data dictionary, perhaps EPA namespace or other means of defining them more positively is needed. We have a lot of metadata that we can bring into the mix, toward enhancing identifiability, understandability and usability of the RDF data.

There isn't really much structure or model, it's essentially a flat table. Everything is just treated as alphanumeric data types. No temporal intelligence to dates, et cetera. It doesn't identify registry ID as something unique or indexable. There are many things that can and should be defined better. There is probably a semantic analogue to our data model that we can develop as an RDF/OWL/etc analogue and then map to it.

6

Page 7: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

One approach which may make more sense is to go back and look at the relational database model, which can support more richness – essentially, individual tables and their relationships would be generated as Linked Open Data, and the SPARQL queries would then have the flexibility of current SQL queries.

Regarding the properties, are there in some cases other namespaces that we could/should be leveraging? geo: as one example - our data is, however, NAD83, and geo: assumes WGS84. We could reproject to WGS84 and provide geo: values to supplement what we have, as one possibility. Similarly, maybe foaf: or other namespaces, which deal with addresses and points of contact. The RDF only carries locations, but FRS also has contacts, if we should at some point incorporate those as well.

In summary, I think it could stand to be improved from a standpoint of accessibility (SPARQL, et cetera - I think Data.gov needs to look at that from a services infrastructure standpoint), and then, improved usability, by following more of a data model approach, as opposed to this flat mapping, and approaches like mapping to existing namespaces and following existing models where appropriate, and we should be able to leverage some of our metadata elements, data models and other artifacts toward a better representation and mapping.

Data Model Issues: Long range, some additional tweaks to FRS data model may be needed in order to enhance data representation and better support Linked Open Data - some of these are described in brief below.

Linked Open Data Development:Potential collaboration with

Joshua Lieberman (OGC Geospatial Semantics SWG)

Spatial Ontology Community of Practice

Jim Hendler (RPI), George Thomas (HHS): CIO Council and Data.gov Geospatial Semantics threads

John Harman / Michael Pendleton (LOD, SRS)

Steve Young / Zach Scott / Open Gov Team (LOD)

Talis, pending contract (LOD)

TRI Program (Potential Pilot)

7

Page 8: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

Kevin Kirby (Data Model)

Tom Giffen (Data Model, Business Rules)

Ken Blumberg (Business Rules)

Cindy Dickinson (Standards, Business Rules)

Others (program offices, regions, GISWG)

Existing Resources Leverage Data Modeling work that Kevin Kirby has been working on

Drill into gist.owl and other potential resources

Short-Term data needs: Semantic Enhancements / Linked Open Data

Improvement of capabilities for supporting Linked Open Data applications – Analysis of data structure toward supporting faceted, dimensional analyses (Figure 1)Development of URI schemes, potentially namespaces, and mans and approaches for allowing unique identification and linkage

8

Page 9: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

Figure 3: Potential Facets / Dimensions for Analysis and Semantic Enhancement

Semantic Dimensions:Explore various dimensions of facility:

Spatial – o GML representation of absolute location (lat/long, etc)o Spatial representation framework for facility (building footprints, parcel boundary,

others for future)o Facility data modeling granularity and relationships - get a better handle on what

the facility "thing" represents, and its' relation to other things - for example, a parcel boundary, containing an industrial complex with manufacturing and storage buildings (differing NAICS, possibly even different companies operating and licensed/permitted), plus associated air stacks, SPCC measures, water outfalls, et

9

Page 10: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

cetera. When we pull up "facility" it should ultimately reflect that bigger picture for context, with the component of interest in highlight.

Temporalo Data currencyo Temporal aspects to regulation, enforcement, permitting, et cetera – future

Corporate Dimensiono Corporate ownership – at facility level and at ultimate corporate parent level

Function - Activity and Useo NAICS/SIC Codeso EPA Regulatory programo EPA Interest Typeo Linkages / translation between interest type and other ontologies/vocabularieso Linkages to regulatory programs and other components

Interrelationships of facilities (future)

Individualso Friend-of-a-friend (FOAF) and other existing RDF constructs

Many other potential enhancements

Potential PilotsA number of potential pilots for mashups can be considered. What may be “low hanging fruit” for OEI build upon exploitation of known internal assets, i.e.

FRS

TRI (Toxic Release Quantities for Given Location)

SRS (Substance)

Potentially, as one scenario, one could tie TRI discharges to reaches via OW web services and TRI reported receiving waters, and then tie this to observed impacts downstream.

One caveat of using EPA data is that it is known to EPA users, but ideally needs to be more fully fleshed-out to make it discoverable and uniquely identifiable for external users, perhaps via embedded EPA identifiers (perhaps an epa: namespace or similar means of identifying our assets)

10

Page 11: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

Other potential scenarios TBD… OECA targeted enforcement vs. OSHA, or OPP vs. USDA pesticides application data.

Longer-Range, Emergent data needs:These are not specific to LOD, but are instead emergent attributes of interest for FRS – LOD approaches may help inform on how to structure these.

HUC CodesCompletion of prepopulating of HUC Codes can support identification of facilities impacting major watersheds, e.g. Chesapeake Bay (OECA need) – Other potential needs: Airsheds

Municipality Toward improving data quality – Physical street address may include ZIP Code for city which is different than actual municipality where site resides – for example, Suburban Drive, State College PA is actually Ferguson Township, PA – and local planning and building code officials and emergency responders who either have or need information on the facility of interest would be different than that of the one listed

RelationshipAbility to relate facilities – relating individual components of a larger system of infrastructure, such as relating a gas terminal to a compressor station – changes to one may impact others.Ability to organize information in appropriate fashions, such as relating multiple individual oil platforms with discrete permits to a lease boundary with another level of permitting.

Indian CountryMore robust identification/validation of facilities which may lie within tribal boundaries – refinement of IND-3 boundaries with other source data, analysis of flows containing either tribal flag (Y/N) and/or tribal identifier (tribe/reservation name) - (collaboration with Elizabeth Jackson / Ed Liu)

Facility DefinitionPotential broadening of scope and use of FRS to accomodate grant award locations and other types of locations – 2005 NAPA Report recommendations for consistent agencywide site identification. May be predicated on buildout of other capabilities, such as being able to relate sites.

11

Page 12: FRS Linked Open Data Concept v1.3 20101130

FRS Data Model Initial Conceptual DiscussionNovember 11, 2010 November 30, 2010

Other Ongoing, Related ActivitiesA number of activities, internal and external, can help to inform on direction and data model for FRS data collection and publishing activities – some of these are listed below:

Potential EPA Corporate ID WorkgroupCollaborate with TRI, TSCA, FRP, RMP, Others who collect corporate parent information, as well as OECA and others who need corporate parent information to support analysis.

White House Corporate ID WorkgroupCollaborate with emergent White House Corporate ID workgroup – Beth Noveck / Steve Croley, SEC, Labor and other agencies to align, coordinate and collaborate on corporate identifiers

OpenGovCollaboration with EPA Open Gov initiatives to inform on how best to publish data for external reuse.

National Academy of Public AdministrationFollow-through on 2005 NAPA Report recommendations

Spatial Ontology Community of Practices (SOCOP)Collaboration on vocabularies, standards and data modeling approaches

Data.Gov Data Architecture SubgroupCollaboration on vocabularies, standards and data modeling approaches

EPA OEI/OIC/IESD Data Standards BranchCollaboration on vocabularies, standards and data modeling approaches

Others…

Anticipated Next Steps:TBD, develop ideas for potential pilots, engage on “LOD Cookbook” and approaches for representing and rendering our data as RDF.

12