overview and motivation of the icat software suite kerstin kleese van dam
TRANSCRIPT
Overview and Motivation of the ICAT
Software SuiteKerstin Kleese van Dam
Science and Technology Facilities CouncilSTFC employ more than 2200 staff
who are deployed at 7 locations, these are: Swindon where the
headquarter is based, the Rutherford Appleton Laboratory, the Daresbury Laboratory, the Chilbolton Observatory, the UK Astronomy Technology Centre in Edinburgh, the Isaac Newton Group of Telescopes on La Palma; and the Joint Astronomy Centre in Hawaii.
Research and Science Support at STFC
Deliver world class science Engender world class science
Communicate world class science
Annually over 15000 visiting Scientists from around the world from both Academia and Industry.
Why an Integrated e-Infrastructure is required
HPC
HPCAnalysis
Storage
Storage
Analysis
Experiment
ExperimentComputing
HPC
Scientist
What STFC aim to achieve with their e-Infrastructure
Enabling users to get rapid access to their current and past data, related experiments, publications etc., leading to improved analysis through more complete information.
Creating a powerful, long lasting scientific knowledge resource.
Integrated e-Infrastructure
Proposal
Metadata Catalogue
Information
Experiment
Data Acquisition
System
Secure Storage
Data
Analysis
Publication
E-PubsProposal System
All Data and Metadata Capture is automated.
e-Infrastructure – Access to Multiple
Facilities(2)
Data Portal
SNS - ORNL
ISIS – TS1 + 2
DLS
CLF
CSL - Canada
SRS + ERLP
How we achieve the integration
HPC
HPCAnalysis
Storage
Storage
Analysis
Experiment
ExperimentComputing
HPC
Metadata
Scientist
ICAT Software Suite
The ICAT software suite centrally catalogues all experiment related information and extracted key results. Where ever possible information is gathered automatically trough integration with existing IT systems such as proposal systems or data acquisition.The catalogue and the data it references are accessible via a well defined API for easy embedding into any applications.
Distributed Data
Metadata Catalogue
Generic Catalogue Access Interface
Data Access and Analysis Applications
Underlying Data Infrastructure
Online Proposal System
User Office System incl.:
User Database
Scheduling
Health and Safety
Proposal Management
Metadata Catalogue
Data Acquisition System
Storage Management
System
DataAccessPortal
Single Sign On Account Creation and Management
ICAT Software Suite, providing the crucial integration of key functions.
The online proposal system is the entrance point to the Data Management System, and is a rich source of contextual information about the users experiment.
ICAT and the STFC Proposal Systems
ICAT and STFC Data Acquisition
Plug-ins for the data acquisition system ensure automatic, quality controlled collection of data and metadata. ICAT can be easily linked to any existing system.•ISIS :
- SECI (C#, .net) with link to LabView and openGenie•DLS :
- Generic Data Acquisition (Java, on top of EPICS) •CLF :
- For Laser Diagnostics, (LabView)
ICAT and DLS Storage Management
DLS uses the Storage Resource Broker for its Storage Management, this has been integrated with ICAT for data access and delivery.
Main advantage : •Decoupling physical file location from the logical one.•Strict Security•Expandable to many storage systems
SRB MCATDatabase
SRB MCATServer
SRB ADSServer
SRBClient
SRB DiskServer (Local Server)
Atlas Data Store
SRB MCATDatabase
SRB MCATServer
SRB ADSServer
SRBClientSRBClient
SRB DiskServer (Local Server)
SRB DiskServer (Local Server)
Atlas Data StoreAtlas Data Store
ICAT and ISIS Storage Management
ISIS uses their own in house developed data storage access system called Data.ISIS.
Similar to SRB it abstracts from the physical location of the files and delivers the same advantageous in terms of decoupling of logical and physical location of files and security.
ICAT Architecture
Online Proposal System
User Office System incl.:
User Database
Scheduling
Health and Safety
Proposal Management
Metadata Catalogue
Data Acquisition System
Storage Management
System
DataAccessPortal
Single Sign On Account Creation and Management
ICAT Software Suite, providing the crucial integration of key functions.
ICAT 3.3 Aims and Objectives
• ICAT API Version 3.3 aims to be the Grid aware software infrastructure that enables
applications to exploit the capabilities of the ICAT catalogue.
• Data Portal Version 3.3 aims to be the Grid aware software infrastructure that serves the Data Search and Retrieval (DSR) requirements of the STFC. It makes use of the ICAT API 3.3 .
Overall Architecture Principles
• The ICAT software suite has a modular design with clear functional boundaries for each component.• Core functionalities have been grouped together, customisable presentation layers are separated from the function layer to achieve easy maintenance, easy customisation, insulation from changes to underlying areas.• All interaction with the ICAT catalogue are now through the ICAT API.
Core Scientific Metadata
Model
(CSMD)
Rich Data at STFC
Scientific Data of the highest Quality is produced at STFC Facilities and
Departments.The continuity and longevity of STFC has
led to a unique wealth of Information.
How about a system that would give access to all of it independent of where it was produced?
Model Motivation (1)
Most Scientists think in terms of Studies during which they perform a number of investigations e.g. experiments,
observations, measurements and simulations. Results from these investigations usually run through
different stages: raw data, analysed or derived data and end results. Data should be grouped accordingly.
Metadata and Software (e.g. STFC DataPortal) should allow the user to search for interesting data.
Not all information captured in specific metadata schemas e.g. CML, would be used to search for this data or
distinguish one data set from another, give possibility to select special parameter.
Model Motivation (2)
A common general format/standard for Scientific Studies and data holdings metadata did not exist
By proposing Model and Implementation:– Form a specification for the types of metadata
studies should capture during Scientific Studies– Ease citation, collaboration, exploitation and
Integration– Allow easy Integration of distributed
heterogeneous metadata systems into a homogeneous (albeit virtual) Platform
Therefore – The Common Scientific Metadata Model (CSMD) developed.
General Layout
Why – i.e. what was the needWhat is it – description – support keyword searches and taxonmic approaches – data
organisation like a file systems but support linking to a database also
Where is it used – project & softwareWhat are the users likely to search on
What distinguishes one study/investigation/data set from the next
Metadata Model Structure
The Common Scientific metadata model (CSMDM) is a study-data set orientated model holding study information about:
– Topic Indexing– Provenance– Data Holding– Legal notes
• Copyright, patents and conditions of use etc relating to the study and the data in the study
– Related Material• Publications, Community
information and related links– (Access Conditions)
Metadata Granule
Topic
Study
Access Conditions
Related Material
Legal Note
Data Holding
Investigation 1 M
11
Atomic Data Object
Data Collection
11
1
M
M
M
M
1
Model Breakdown: Provenance
The Study contains the following metadata:
– The Study Name– The Study Institution– The Investigator– Extended Study Information
• Abstract• Funding • Start and End times
– Investigations
Investigations
A Study can have more than one investigation; possible
enumerations are experiment, simulation,
measurements etc. – investigations contain:
– Name– Investigation Type– Abstract– Resource– Link to DataHolding
Topic (for indexing)
Keywords– Discipline (i.e. domain)– Keyword Source (e.g.
domain dictionary)– Keyword
Subjects
– Discipline– Subject Source (e.g.
domain taxonomy)– Subject
Access Condition & Related Material
Access Conditions
– Contains a list of users or groups who are allowed access to the metadata and data, or a pointer to an access control system which contains such data for this study
Related Material
– One or many links and or textual descriptions of material related to this study e.g. earlier studies or parallel studies
Data
Data Description holds a logical description of the
Study’s data:
– Data Name– Type of Data– Status– Data Topic– Parameters– Related Data Ref– Relation type (e.g.
derived)
Data Location contains the link between logical name
(e.g. URI's) and physical URLs
– Data Name– Locator(s) (In the
case of Atomic Data Objects these can refer to files as well as named Selects on a database – i.e. virtual data objects)
More on Parameters
Parameters contain a lot of information about the atomic data objects (ADO) and collections
A collection/ADO can have many parameter entries, each parameter entry contains:
Parameter derivation (e.g. measured/fixed)
– The value– The units– Range – Error margin
Parameter aggregation is also supported
Cardinality Issues
The model recommends a certain cardinality of elements
Certain metadata components are necessary for one to have an instance of the implemented model – treating everything as
optional is not acceptableIt is though implementations may modify
this more to their needs – model attempts to remain ideal (i.e. most common Cardinality)
Enumeration Issues
Enumerations (or controlled vocabularies) e.g. types of investigator, types of institutions; these are distinct from the model e.g. as
taxonomies are.However they are necessary for the model to
work so implementations e.g. STFC DataPortal implementation of the model propose some
enumerations for common thingsRecognised and relevant controlled vocabularies are hoped to be used by implementations where
they are available
Conformance Level
For a complete metadata study-dataset record a large amount of metadata has to
be stored/processedSo it’s useful to have conformance levels
Model uses 5 levelsEach level specifies more metadata (and
Indexing information) should be held
Level 1
Type of Information captured:
– Study and Investigation metadata with indexing at the Study level
Level 1 metadata is similar to library/publication style metadata (e.g.
DublinCore)
Level 2
Type of Information captured:
– Level 1 + DataHolding metadata (i.e. DataSets and DataObjects)
Level 3
Type of Information captured:
– Level 2 + related material, Access condition, indexing to data collection levels
Level 4
Type of Information captured:
– Level 3 + indexing to data object level and data object parameter information
Level 5
Type of Information captured:
– All metadata components are filled as L4 + funding, resources used, facilities used etc
Conformance Levels
L1 is similar to library/publication style metadata (e.g. DublinCore)
The current DataPortal uses somewhere between L4 and L5 –the new systems designed with CSMD
conforms to L4+ Benefit of conformance levels; the higher the level of conformance to the CSMD the richer the clients
that operate on the data can be– e.g. identifying datasets and atomic data
objects which link directly to keywords/taxonomies and not just studies
CSMD Used on DataPortal
Implementation used as Data Interface for
DataPortalSingle view of heterogeneous
systems/schemasActs as a stress test of the
model– Limitations feed into
Model Requirements– New requirements
feed back into implementation
ICAT Schema 3.3
Specifics of the ICAT 3.3 Schema
ICAT 3.3 Schema - Facility
ICAT 3.3 Schema - Study
ICAT 3.3 Schema – Study (2)
ICAT 3.3 Schema – Study (3)
Study Investigation
Study Status
ICAT 3.3 Schema - Investigation
ICAT 3.3 Schema - Instrument
ICAT 3.3 Schema - Shift
ICAT 3.3 Schema – Shift (2)
ICAT 3.3 Schema - Keywords
ICAT 3.3 Schema - Topic
ICAT 3.3 Schema – Topic (2)
Topic
Topic List
What is an Ontology?
•Ontologies are used to capture knowledge about a domain of interest.
• An ontology describes the concepts in the domain and the relationships that hold between those concepts.
Advantages of Ontologies
• Provide increased flexibility when representing frequently changing viewpoints of information.
• Alterations can be simply followed up in the model without having to alter the applications on which they are based.
• Allows a unified view of heterogeneous data sources.
• Remove conflicts and terminological uncertainties.
• Facilitate Moderated searches, optimisation of the search results.
Why Ontologies are a useful Solution?
•At present over 1,700,000 keywords describing experiments are housed in ISIS ICAT many of which are synonyms.
•These keywords are used to index experimental studies, however this is seen as a limited method as these free text keywords have no context, and are hard to map by non-experts to terms used by facilities in the same domain and harder still to those outside.
•The creation of ontologies at ISIS will aid in the mapping of concrete manifestations of familiar terms in one domain as well as related concepts in different domains.
•This will facilitate searching of data by category and grouping of data into keywords across studies.
•This could aid in the cross facility searching of related scientific data from the various scientific facilities housed at STFC e.g. CLF and DLS.
A Protégé-OWL Ontology
• Classes• Individuals• Properties
A class is a concept in the domain - a class of People - a class of Pets - a class of Countries
A class is a collection of elements with similar properties.
Instances of classes- America can be an instance of the class Country.
Gemma
Mathew
Fluffy
Italy
America
England
Fido
Class Person
Class Pet
Class Country
livesIn
hasSibling
hasPet
ISIS Facilities Ontology Hierarchy
Class ISISExperiment
Class DataFile
Class Year
wasConductedIn
hasInvestigator
Class Instrument
Class InvestigatorHRP00145.RAW
1986
Pete Jones
HRPD
Class CrystallographyGroupExperiment
hasUsedInstrument
HydraziniumClass InvestigationTitle
hasTitle
hasDataFileName
Protein Crystallography GroupExperiment
ISIS Facilities Ontology
Sample, Investigator and Experiment Ontologies
Sample Investigator Experiment
Ontology Maintainer
•A web application for graphically displaying current versions of an ontology
•Currently ontologies are built within Protégé, an editing environment
•Difficulty in showing constructed ontologies to other domain experts
•The OntoMaintainer allows users to visualize ontology and enter feedback on the classification and structure of the hierarchy
•Encourages collaboration between domain experts (scientists) and ontology builders by allowing members of the community to be involved in the development and maintenance of ontologies
Topic Mapping Tool
•Mapping Tool provides a way of linking proposal system data to the structure of the ontology.
•Data is mapped to the ontology structure according to a set of defined rules.
Proposal System Database
Ontology Mapping Rules
Mapping Tool
Object
Sample Detail
Chemical Formula Name SampleType
LiquidLiquidpoly{1,4-phenylene-[9,9-bis(4-phenoxy butylsulfonate)]fluorene-2,7-diyl} ; C12E5;
D2O
poly{9,9-bis[6-(N,N-trimethyl-ammonium)hexyl]fluorene-co-1,4-phenylene};C12E5;D2=
C37H52N2I2:C22H46=6;D2=
C37H30S2O8; C22H46O6;D2O
•Ontologies would help maximise the value of data collected at ISIS and other STFC facilities by improving the access, navigation and reuse of data.
•Ontologies would facilitate the mapping of terms across STFC facilities which will allow cross-facility searching e.g. external users will be able to search for all experiments carried out across STFC using a powder diffractometer (instrument) even if they do not know the local names of the specific instruments.
•The OntoMaintainer will facilitate the process of creating and maintaining ontologies by providing a means of getting feedback directly from domain experts
ICAT 3.3 Schema - Investigator
ICAT 3.3 Schema – Investigator (2)
Investigator
Facility User
ICAT 3.3 Schema - Sample
ICAT 3.3 Schema – Sample Parameter
ICAT 3.3 Schema – Dataset
ICAT 3.3 Schema – Dataset (2)
ICAT 3.3 Schema – Dataset (3)
ICAT 3.3 Schema – Dataset Status
ICAT 3.3 Schema – Dataset Type
ICAT 3.3 Schema – Dataset Parameter
ICAT 3.3 Schema – Data File
ICAT 3.3 Schema – Data File (1)
ICAT 3.3 Schema – Data File (2)
ICAT 3.3 Schema – Related Data Files
ICAT 3.3 Schema – Data File Parameter
ICAT 3.3 Schema – Authorisation
Other ICAT Related Schema
ICAT API Session Schema
There are 3 tables to the schema, the user, user_session and myproxy_servers:
USER – All users who have logged inUSER_SESSION – All user’s sessions on IcatMYPROXY_SERVERS -- configuration information about which server to logon to
ICAT Core Database Schema
ICAT Core Session
ICAT Core Event
ICAT Core User
ICAT Core DataBase
ICAT Core Database
The core ICAT catalogue is at STFC run on an Oracle 10G RAC clustered database server.
The system has been customised to make efficient use of the offered features of Oracle.
If required these could however be removed in the future.
ICAT API
ICAT Architecture
Online Proposal System
User Office System incl.:
User Database
Scheduling
Health and Safety
Proposal Management
Metadata Catalogue
Data Acquisition System
Storage Management
System
DataAccessPortal
Single Sign On Account Creation and Management
ICAT Software Suite, providing the crucial integration of key functions.
ICAT API Version 3.3 (1)
The ICAT API version 3.3 is the interface that any application should use to interact
with the core ICAT system catalogue. At present it is used by applications such as the ISIS XML ingest, the DLS Generic Data
Acquisition System, DLS DDH and the DataPortal. The API offers a wide range of web services for the easy interaction with
the ICAT core catalogue.
ICAT API Version 3.3 (2)
The ICAT API version 3.3 consists of three main components:
•Web Services offered to other applications•ICAT Catalogue Interactions•ICAT Catalogue Session Management
ICAT API Version 3.3 (3)
The ICAT API version 3.3 uses JPL and SQL to directly interact with the underlying oracle databases
The ICAT API version 3.3 has been written in Java using EJB3, JPA and JAX-WS
ICAT API Version 3.3 (4)
Web Services offered to other applications for the Search, List, Ingest, Delete, Modification of:
AuthenticationInvestigation, Datafile and Dataset InformationInvestigatorKeywordsPublicationSampleDownload
DataPortal
ICAT Architecture
Online Proposal System
User Office System incl.:
User Database
Scheduling
Health and Safety
Proposal Management
Metadata Catalogue
Data Acquisition System
Storage Management
System
DataAccessPortal
Single Sign On Account Creation and Management
ICAT Software Suite, providing the crucial integration of key functions.
DataPortal for ICAT Version 3.3
The DataPortal is a highly customisable web interface to interact with the ICAT version 3.3.There are at present two distinctive versions one for ISIS and one for DLS. Whereas the underlying functionality is the same the graphical representation and choice of used services varies.The DataPortal offers a number of search interfaces, the ability to explore investigations and download associated data.
Top Left Hand Menu
Bottom Left Hand Menu
Top Right Hand Menu
Session Expire
DLS DataPortal
Questions?