badc, bodc, cclrc, pml and soc distributed data, distributed governance, distributed vocabularies,...

65
BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + + + + +[ ]= Bryan Lawrence (on behalf of a big team, and note also a substantial piece of work with specific authorship included herein)

Upload: madeleine-jennings

Post on 12-Jan-2016

228 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

BADC, BODC, CCLRC, PML and SOC

Distributed Data, Distributed Governance, Distributed

Vocabularies, with a dash of CLADDIER: The NERC DataGrid

Distributed Data, Distributed Governance, Distributed

Vocabularies, with a dash of CLADDIER: The NERC DataGrid

+ ++ + +[ ]=

Bryan Lawrence

(on behalf of a big team, and note also a substantial piece of work with specific authorship included herein)

Page 2: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Outline

• Introduction to the BADC• Motivation• Standards

– Feature Types• Taxonomy• Overall Architecture• NDG Products

– Discovery Portal– Data Extractor– MOLES (NumSim relationship with NMM)– CSML

• CSML– Description– Prototyping in MarineXML– Round-Tripping

• Vocabulary Issues IN NDG (Hughes, Kondapalli, Lowry)• NDG Timeline

Page 3: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

BADC Role

The BADC role is to assist UK researchers to locate, access and interpret atmospheric data and to ensure the long-term integrity of atmospheric data produced by NERC projects.– Facilitation and Curation/Preservation!

Page 4: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

BADC Data Holdings

• A BADC dataset is an aggregation of data files, documents and metadata sharing common administrative policies. These policies could be file validation, access control or retention schemes.

• Datasets vary from TBs in millions of files to a few MBs in a single file.

• There are presently over 100 datasets.

Page 5: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

BADC User examples

• Atmospheric chemistry models.

• Pollution chemistry measurement campaigns.

Page 6: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

User examples

• Bird feeding habits.

Page 7: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

User examples

• Radio communication modelling.

• Wind power research.

• A & E influenza cases.

Page 8: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

User examples

• Castle mortar decay.

• Discomfort indices.

Page 9: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

User Diversity

Registered Users by discipline

28%

12%

7%8%8%

7%

7%

8%

7%

0%

3%

2%

2%

1%

0%

Atmospheric Physics

Atmospheric Chemistry

Other

Earth Science

Geography

Engineering

Marine Science

Terrestrial and Fresh Water

Medical/Biological Sciences

Personal use

Earth Observation

Maths/Computing Sciences

Polar Science

Economics

Not specified

BADC User Origin(qualification 20+ users)

53

61

63

6674

81 89

93

179

349

484636

2524

2424 22 22 21

20

USA

Germany

France

Japan

China

Other

Italy

India

Canada

Australia

Russia

Netherlands

Spain

Poland

Switzerland

Total Users by institute type

1%1%5%5%

7%

13%

68%

School

Commercial

Other

NERC

Unknow n

Government

University

These stats quite old now, but the distributions haven’t changed.

Page 10: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Climate in 20010 – A graphic Illustration

Figures from Gary Strand, NCAR, ESG website

March 2006, 2.5 PB

Typically, two-thirds of this data will never see the light of the day: why?

No one can remember what it was, or, if they can remember that, where it is!

Page 11: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

http://www.realclimate.org/index.php?p=121

Data as Evidence

http://www.uoguelph.ca/~rmckitri/research/trcback.html

What McIntyre got right:

Page 12: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Data Retention Policies

University of Cambridge Research Division: Data generated in the course of research should be kept securely in paper or electronic format, as appropriate. Back-up records should always be kept for data stored on a computer. The [AMRC] considers a minimum of ten years to be an appropriate period. However, research based on clinical samples or relating to public health may require longer storage to allow for long-term follow-up to occur.[AMRC: Association of Medical Research Charities]

University of Oxford Research Services Office: A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result. …. A successful laboratory notebook allows for ready verification of quality and integrity of research data and enables another investigator to reproduce the procedure which has been documented and get the same result.

Natural Environment Research Council: … Scientists will frequently process the data they have collected selectively, or with specific application packages, in order to prepare material for publication in the scientific literature. But the full value of the data collected may only be realised if the entire dataset is subjected to generic processing (eg to ensure calibration and adequate quality control) and is sufficientlydocumented to allow others to re-use it at a later date. The original collector may be the onlyperson in a position to undertake such work, and so to unlock the full potential of the data. Thoseholding data collected under NERC funding will be expected to cooperate in validating andpublishing them in their entirety - when this can be justified in terms of their scientific value -rather than merely creaming off a subset for immediate publication in the literature. …

Page 13: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

http://ndg.nerc.ac.uk

British Atmospheric Data Centre

British Oceanographic Data Centre

Complexity + Volume + Remote Access = Grid Challenge

NCAR

Page 14: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

NDG Assumptions

1. No one would change their data storage systems!2. Need to support a wide range of “metadata-

maturity”! 3. No NDG-wide user management system possible.

• It is illegal to share user information without each and every user agreeing …• implies no way of having one virtual organisation with

common user management!• With a large enough group it is impossible to agree on

common roles that could be associated with access control.

• … but we want single-sign on … and trust relationships between data providers …

Page 15: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Integration – semantics

• Want interdisciplinary semantic access to information, not abstract data– getData(potential temperature from ERA-40 dataset in North Atlantic from 1990 to 2000)

– not: getData(“era40.nc”, ‘PTMP’, 20:50, 300:340, 190:200)

– or even worse:for j=1990:2000

getData(“era40_”+j+“.nc”, ‘PTMP’, 20:50, 300:340)

• Lossy is OK!– Care less about completeness of representation than

semantic unification

Page 16: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Standards

• ISO 19101: Geographic information – Reference model

A geospatial dataset…

…consists of features and related objects…

…in a defined logical

structure…

…delivered through

services…

…and described by metadata.

Page 17: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Standards

• Geographic ‘features’– “abstraction of real world

phenomena” [ISO 19101]– Type or instance– Encapsulate important

semantics in universe of discourse

– “Something you can name”• Application schema

– Defines semantic content and logical structure

– ISO standards provide toolkit:

• spatial/temporal referencing

• geometry (1-, 2-, 3-D)• topology• dictionaries (phenomena,

units, etc.)– GML – canonical encoding

[from ISO 19109 “Geographic information – Rules for Application Schema”]

Page 18: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Architecture: NDG Metadata Taxonomy

… not one schema, not one solution!

CSMLNCML+CF

MOLES THREDDS

DIF -> ISO19115

CLADDIER

Page 19: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Architecture:

Deployment Data Providers

NDG Core Services

Users

NDG GUI Interface(s)

Vocab Services

Page 20: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Architecture:

Deployment

NDG Core Services

Users

NDG GUI Interface(s)

Vocab Services

Page 21: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Architecture:

Deployment

Users

NDG GUI Interface(s)

Vocab Services

Page 22: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Architecture:

Deployment

UsersVocab Services

Page 23: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

Current Status

Page 24: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

MOLES: implementation

Core linking concept is the deployment

Deployment

Activity

on behalf of an Activity

of a Data Production Tool at an Observation Station

that produces a Data Entity

DataProduction

Tool

ObservationStation

Data Entity

Each of the main metadata objects has security data attached to it. This means that this can be applied to queries on the metadata

Links the metadata records into a structure that can be turned into a navigable structure

Page 25: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Simulators as data production tools: NumSim

NDG Products: NumSim

Page 26: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

NumSim Example

NumSim Example

Page 27: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

International Discovery - Climate

Page 28: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

NDG “Pseudo-Demo”

EXPLOITING DISCOVERY WEB

SERVICE

(running interface on my laptop last night)

Page 29: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

More Browse

Scrolling Down

Page 30: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

MOLES Navigation

Actually, this is where we plan to use NumSim and Numerical Model Metadata

Page 31: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

MOLES to Secure Dx

Page 32: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

NDG Authentication

Offering up trusted host list …

Page 33: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Data Extractor

Page 34: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

NDG-A: Climate Science Modelling Language

• Aims:– provide semantic integration mechanism for NDG data– explore new standards-based interoperability framework– emphasise content, not container

• Design principles:– offload semantics onto parameter type (‘phenomenon’,

observable, measurand)• e.g. wind-profiler, balloon temperature sounding

– offload semantics onto CRS• e.g. scanning radar, sounding radar

– ‘sensible plotting’ as discriminant• ‘in-principle’ unsupervised portrayal

– explicitly aim for small number of weakly-typed features (in accordance with governance principle and NDG remit)

Page 35: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Climate Science Modelling Language

• CSML feature types– defined on basis of geometric and topologic

structureCSML feature

type Description Examples

TrajectoryFeature

Discrete path in time and space of a platform or instrument.

ship’s cruise track, aircraft’s flight path

PointFeature Single point measurement. raingauge measurement

ProfileFeature Single ‘profile’ of some parameter along a directed line in space.

wind sounding, XBT, CTD, radiosonde

GridFeature Single time-snapshot of a gridded field. gridded analysis fieldPointSeriesFeature Series of single datum measurements. tidegauge, rainfall

timeseries

ProfileSeriesFeature Series of profile-type measurements.

vertical or scanning radar, shipborne ADCP, thermistor chain timeseries

GridSeriesFeature Timeseries of gridded parameter fields.

numerical weather prediction model, ocean general circulation model

Page 36: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Climate Science Modelling Language

• CSML feature types– examples...

ProfileSeriesFeature

ProfileFeature

GridFeature

Page 37: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Climate Science Modelling Language

• Numerical array descriptors– provides ‘wrapper’

architecture for legacy data files

– ‘Connected’ to data model numerical content through ‘xlink:href’

• Three subtypes:– InlineArray– ArrayGenerator– FileExtract (NASAAmes,

NetCDF, GRIB)

• Composite design pattern for aggregation

+arraySize[1]+uom[0..1]+numericType[0..1]+numericTransform[0..1]+regExpTransform[0..1]

«Type»AbstractArrayDescriptor

+aggType[1]+aggIndex[1]

«Type»AggregatedArray

1

+component

*

+values[*]

«Type»InlineArray

+expression[1]

«Type»ArrayGenerator

+fileName[1]

«Type»AbstractFileExtract

+variableName[1]+index[0..1]

«Type»NASAAmesExtract

+variableName[1]

«Type»NetCDFExtract

+parameterCode[1]+recordNumber[0..1]+fileOffset[0..1]

«Type»GRIBExtract

+id+metaDataProperty+description+name

«Type»GML::AbstractGMLType

Page 38: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Climate Science Modelling Language

• Inline array

• Array generator

<NDGInlineArray><arraySize>5 2</arraySize><uom>udunits.xml#degreeC</uom><numericType>float</numericType><regExpTransform>s/10/9/ge</regExpTransform><numericTransform>+5</numericTransform><values>1 2 3 4 5 6 7 8 9 10</values>

</NDGInlineArray>

<NDGArrayGenerator><arraySize>10001</arraySize><uom>udunits.xml#minute</uom><numericType>float</numericType><expression>0:5:50000</expression>

</NDGArrayGenerator>

Page 39: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Climate Science Modelling Language

File extract<NDGNASAAmesExtract>

<arraySize>526</arraySize><numericType>double</numericType><fileName>/data/BADC/macehead/mh960606.cf1</fileName><variableName>CFC-12</variableName>

</NDGNASAAmesExtract>

<NDGNetCDFExtract gml:id="feat04azimuth"><arraySize>10000</arraySize><fileName>radar_data.nc</fileName><variableName>az</variableName>

</NDGNetCDFExtract>

<NDGGRIBExtract><arraySize>320 160</arraySize><numericType>double</numericType><fileName>/e40/ggas1992010100rsn.grb</fileName><parameterCode>203</parameterCode><recordNumber>5</ recordNumber><fileOffset>289412</fileOffset>

</NDGGRIBExtract>

Page 40: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

XM

L P

arser

SeeMyDENC

Data Dictionary

S52 Portrayal Library

SENC

MarineGML

(NDG) Feature

Types

XML

XML

XML

Biological Species

Chl-a from Satellite

ModelledHydrodynamics

XSLT

XSLT

XSLT

For each XSD (for the source data) there is an

XSLT to translate the data to the Feature

Types (FT) defined by CSML. The FT’s and

XSLT are maintained in a ‘MarineXML registry’ The FTs can then

be translated to equivalent FTs for

display in the ECDIS system

XSLT

Features in the source XSD must be present in

the data dictionary.

XSD

XSD

XSD

XML

XML

The result of the translation is an encoding that contains the

marine data in weakly typed (i.e. generic) Features

XSLT

XSLT

Phenomena in the XSD must have an associated

portrayal

ECDIS acts as an example client for

the data.

Data from different parts of the marine

community conforming to a variety of schema

(XSD)

MeasuredHydrodynamics

S-57v3 GML

XML

XSD

XML

XSD

Feature described using S-57v3.1Application

Schema can be imported and are equivalent to the same features in CSML’

Slide adapted from Kieran Millard (AUKEGGS, 2005)

MarineXML Testbed

Page 41: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Biological sampling station with attributes for the species sampled at each

Grid of Chl-a from the MERIS instrument on ENVISAT

Predicted and measured wave climate timeseries (height, direction and period)

Vectors of currents from instruments

MarineXML Testbed

Slide adapted from Kieran Millard (AUKEGGS, 2005)

Page 42: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

The Concept of re-using Features

Here structured XML is converted to plain ascii text in the form required for a numerical model

HTML warning service pages are generated ‘on the fly’XML can also be converted to SVG to display data graphically

Here the same XML is converted to the SENC format used in a proprietary tool for viewing electronic navigation charts.

All this requires agreement on standards

Slide adapted from Kieran Millard (AUKEGGS, 2005)

Page 43: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

CSML Present

Other User Example: Norwegian Met Office

Page 44: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

CSML Round Tripping - 1

Managing semantics

UGAS

GML app schema

XML

<gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList>

GML dataset

instance

Class1

Class2

-End1

1

-End2

*

«datatype»DataType1

conceptual model

Conforms to

101010

New Dataset

Application

produces

parser

V1.0 will be in NDG Alpha

Page 45: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

CSML Round Tripping - 2

Managing data - 1

parser

V1.0 in NDG Alpha

<gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList>

GML dataset

scanner

V1.0 in NDG Alpha

GML app schema

XML

instance

101010

CF Dataset

Application

producesCF

Page 46: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Managing Data 2

101010

CF Dataset

<gml:featureMember> <NDGPointFeature gml:id="ICES_100"> <NDGPointDomain> <domainReference> <NDGPosition srsName="urn:EPSG:geographicCRS:4979" axisLabels="Lat Long" uomLabels="degree degree"> <location>55.25 6.5</location> </NDGPosition> </domainReference> </NDGPointDomain> <gml:rangeSet> <gml:DataBlock> <gml:rangeParameters> <gml:CompositeValue> <gml:valueComponents> <gml:measure uom="#tn"/> <gml:measure uom="#amount"/> <gml:measure uom="#gsm"/> </gml:valueComponents> </gml:CompositeValue> </gml:rangeParameters> <gml:tupleList>

GML dataset

scanner

XSLT

ISO19115

XMLPUBLISH

DECISIONPROCESSES

101010

CF Dataset

Define Dataset

Add Information

Page 47: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

CSML (and friends): software stack

Page 48: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Future of CSML

Some new features: (Swath, Ragged ProfileSeries)

But more importantly, factor out the storage and introduce the concept of “processing affordance”.

(see http://home.badc.rl.ac.uk/lawrence/papers/ExeterCommunique06.pdf

Page 49: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Architecture:

Deployment

Page 50: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

Vocabulary Management for NERC DataGrid

Michael Hughes, V.Siva Kondapalli and Roy Lowry

Page 51: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Vocabulary Presentation Outline

• Problem and Solution• NERC DataGrid Vocabulary Model• Vocabulary Technical Governance• Vocabulary Content Governance• Mappings and Thesaurus Server• Potential Role of Local Mappings

Page 52: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

The Problem

• NERC DataGrid cannot function operationally without metadata and data semantic interoperability

• This will never be achieved without:– Readily accessible standard terms whose

meaning is clearly understood– Readily accessible semantic maps both

within and between lists of standard terms– Semantic maps between local terms and

standard terms

Page 53: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

The Solution?

• Implementation of a Vocabulary Server• Building OWL ontologies mapping

between domain-relevant de-facto standard vocabularies

• Deploying the ontologies through a Web Service thesaurus server

• Making tools available for users to build and deploy local ontologies

Page 54: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

NDG Vocabulary Model

• The vocabulary resource is built from Entries– The representation of a single object in the real world

comprising:• Key - A bit pattern that represents an entity. It must be unique,

permanent and free from semantics.• Term – Text used to label the entity to facilitate human

recognition.• Abbreviation – An shortened version of the term for use where

space is tight. Target size is 20-30 bytes.• Definition – Text that unambiguously specifies the entity.

– Entries are aggregated into Lists (entity class or subclass e.g. UK post towns)

– Lists are aggregated into Constraints (entity class e.g. post towns of the world)

Page 55: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Vocabulary Technical Governance

• The story so far– Lists are available as flat ASCII files or XML

documents as URLs e.g.• http://www.cgd.ucar.edu/cms/eaton/cf-metadata/standard_name.xml• ftp://ftp.pol.ac.uk/pub/bodc/jgofs/datadict/new/parameter_group.csv• http://www.sea-search.net/cdi_documentation/cdi_sampling_codes.csv• http://gcmd.nasa.gov/Resources/valids//gcmd_parameters.html

– Some (BODC, SEA-SEARCH) include keys– Some (CF, BODC) include definitions– None are properly versioned

Page 56: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Vocabulary Technical Governance

• Versioning should:– Provide a unique label for each instantiation of the

list– Enable any previous instantiation of the list to be

recreated – Provide timestamp information for creation and

modification of every object in the vocabulary system

• Delivery should– Be from the master, not a copy– Be accessible to software agents to allow automated

synchronisation of local copies– Have a ‘hotline’ to content governance

Page 57: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Vocabulary Technical Governance

• NERC DataGrid Vocabulary Server– Back End

• Fully automated record archive, timestamps and version numbering. Live April 2006.

• 47 (of 115) lists publicly accessible.

– Front End• Web Service API. Live June 2006.• XML list downloads from website (July 2006?).• Web-form tools (August 2006?).

Page 58: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Vocabulary Content Governance

• Standard lists need to respond to ever expanding user requirements

• Change needs to be rapid or users lose interest

• Standard lists need to maintain information quality and internal consistency

• Content governance has to resolve these conflicting requirements

Page 59: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Vocabulary Content Governance

• Content governance in oceanographic and atmospheric domains is based on:– Moderated e-mail discussion lists– ‘Benign Dictator’ and well-meaning volunteers

• Variable success depending on right people having ‘spare’ time at the right moments

• More formalism underpinned by more resources required

• But need to be careful about going too far or levels of service become unacceptable

Page 60: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Mappings and Thesaurus Server

• There will never be a single list for a given topic

• Term mapping therefore an essential part of semantic interoperability

• Marine Metadata Interoperability (http://marinemetdata.org) have developed tooling and trialled mappings in the measurement phenomena arena

Page 61: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Mappings and Thesaurus Server

• MMI approach– Harmonise lists to be mapped in OWL

(Voc2OWL tool)– Map on basis of ‘same as’, ‘broader than’

and ‘narrower than’ relationships (VINE tool)– Place a Web Service API over the map to

implement a term or thesaurus server

Page 62: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Mappings and Thesaurus Server

• NERC DataGrid Plans– Use MMI technology plus domain expertise available

in BODC, BADC and their user communities to build a complete map between

• BODC Parameter Discovery Vocabulary (300 terms)• CF Standard Names (5-600 terms)• GCMD Parameter Valids (2-300 relevant terms)

– Incorporate this map into the NDG Discovery Service to facilitate smart searching (e.g. ‘pigments’ finds dataset labelled ‘chlorophyll’) through MMI Web Service

– Integrate ontology maintenance into source list maintenance

Page 63: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Role of Local Mappings

• There will always be local terms and understanding

• ‘Pigment data sets’ could mean:– Chlorophyll OR carotenoids OR

phaeopigments– Chlorophyll AND carotenoids AND

phaeopigments

• Depends on point of view

Page 64: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

Role of Local Mappings

• Possible solution to this:– User builds an ontology reflecting local

perception of the mapping between local terms and standard terms

– Discovery or data integration tools use ontology as a ‘plug-in’ allowing user to operate with local terminology

• Tools (e.g. VINE) could be made available to facilitate this

Page 65: BADC, BODC, CCLRC, PML and SOC Distributed Data, Distributed Governance, Distributed Vocabularies, with a dash of CLADDIER: The NERC DataGrid + ++ + +[

UKOLN, July 2006

NDG Timeline

NDG2 runs until September 2007:• NDG-Alpha (June 2006)

– Not all components are in place (particularly delivery broker)– Not many products are deployable by non-NDG participants

(too much hard work installing things that haven’t been optimised for installation)

– Discovery portal is usable, linking to NCAR data etc, but isn’t very user friendly (options not obvious etc).

• NDG-Beta (Feb 2007)– Most components should work, but deployment of software may

still be difficult by non-participants• NDG-Prod (Jun 2007)

– Should be deployable and far more user friendly (spending from Feb-June working on deployment and friendliness, no new functionality)

• Last few months working on sustainability etc

http://proj.badc.rl.ac.uk/trac/roadmap