cris and dataspaces by epitaxial growth
DESCRIPTION
CRIS and DataSpaces by Epitaxial Growth. Keith G Jeffery STFC Anne Asserson UiB. With acknowledgement to http://www.mineral-forum.com/. Authors. Keith G Jeffery STFC-RAL. Anne Asserson UiB. History Hypothesis Achievement Conclusion. CRIS Interoperation Requirement - PowerPoint PPT PresentationTRANSCRIPT
1©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
CRIS and DataSpaces by Epitaxial Growth
Keith G Jeffery STFC
Anne Asserson UiB With acknowledgement to
http://www.mineral-forum.com/
2©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Authors
Anne Asserson UiB
Keith G Jeffery STFC-RAL
3©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Structure
• History
• Hypothesis
• Achievement
• Conclusion
– CRIS Interoperation Requirement
– Classical Interoperation– CRIS Interoperation– Dataspaces
– Requirement– Challenges– Solution Concept
– Schema matching and mapping
– Other Challenges
4©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
History: CRIS Interoperation Requirement
• Internationally and Nationally– (a) to allow each funding organization to make strategic funding
decisions related to knowledge of the actions of other funding organisations; (b) to provide an international base of reviewers for research proposals;
– (c) to provide comparative metrics on research performance and cost-benefit;
– (d) for researchers to find colleagues in areas peripheral to their own (where they claim they knew the key players) on an international basis;
– (e) for encouraging international innovation taking research outputs through to wealth-creation.
• For business purposes– Funding organisations <-----> research organisations
5©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
History: Classical Interoperation Technology
• Significant challenge in ICT / Computer Science– For structured data sources
• From formal schemas erect global schema to reduce n(n-1) interoperations to n
– For semi-structured data sources• From XML schema as above • else derive schema from tags
– For unstructured data sources• Harvesting • Possibly advanced knowledge engineering and machine
learning techniques
• All require considerable human intervention• This does not scale for future internet
6©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
History: CRIS Interoperation Technology
• IDEAS (1984-87), EXIRPTS (1987-89), ERGO 1997-99)
• Categorisation of Techniques (2005)– Remote Wrapper– Local Wrapper– Catalog– Catalog Plus Pull (ERGO2++)– Full CERIF– Harvesting
• Confirmed by euroHORCs special group 2008– Recommended converge to CERIF to allow evolution
to Full CERIF technique.
7©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
History: Dataspaces• The vision behind dataspaces is that the end-user ‘sees’ a space
with all relevant information required for a particular purpose presented in a form suitable for the purpose of the end-user.
• Utilise human effort and knowledge to guide (or actually execute) the creation of a temporary, partial global schema by matching of schemas and mapping (i.e. specifying the wrapper software) for interoperation.
• More recently it has been suggested that machine learning could be employed using manually-created mappings as training data.
• Dataspaces includes the concept of ‘pay as you go’ only requiring the matching and mapping necessary for a particular interoperation instance.
• The incremental nature of building / accessing dataspaces stimulated the concept of epitaxial growth in dataspaces
8©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Structure
• History
• Hypothesis
• Achievement
• Conclusion
– CRIS Interoperation Requirement
– Classical Interoperation– CRIS Interoperation– Dataspaces
– Requirement– Challenges– Solution Concept
– Schema matching and mapping
– Other Challenges
9©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Hypothesis: Requirement
Local CRIS
remote CRIS
remote CRIS
remote CRIS
query results
CERIF
CERIF
CERIF
10©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Hypothesis: Challenges1. discovery of relevant CRIS;2. description of each relevant CRIS;3. matching each description in (2) to CERIF including
(a) management of different character sets and multimedia representation;(b) management of different languages;(c) management of different syntax (data structure); (d) management of different semantics (meaning);
4. generating mappings to describe the required conversions on instances under each CRIS schema to/from CERIF;
5. generating conversion software for instances in any source CRIS;6. managing a query in terms of a local schema (or CERIF) over all
CRIS;7. managing the response (answers);8. management of change in schemas (3) (but usually only (c) and (d));
11©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Hypothesis: Solution Concept
With acknowledgement to
http://www.mineral-forum.com/
12©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Epitaxial Growth
Structurally congruent
The term epitaxy comes from the Greek roots epi, meaning "above", and taxis, meaning "in ordered manner".
ions
Crystal
structure
13©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Hypothesis: Solution Concept
project
person organisation
‘seed’schema
extended project schema
extendedpersonschema
extendedorganisation
schema
source 1 source 2
source 1
source 2 source 1
source 2
input
epitaxial
growth
The Concept of Epitaxial Growth of Schemas
14©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Structure
• History
• Hypothesis
• Achievement
• Conclusion
– CRIS Interoperation Requirement
– Classical Interoperation– CRIS Interoperation– Dataspaces
– Requirement– Challenges– Solution Concept
– Schema matching and mapping
– Other Challenges
15©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Schema Matching and Mapping
• Challenges 3 and 4 (matching and mapping) are addressed using the novel technique
• We propose epitaxial growth from a canonical CERIF schema to generate the global schema for any group of CRIS.
• In the process, for each CRIS, there is documented – entities and attributes matching CERIF (CRIS=CERIF); – entities and attributes in CERIF that are missing in the CRIS
CERIF > CRIS; – entities and attributes in the CRIS that are missing in CERIF
CRIS > CERIF;
16©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Schema Matching and Mapping
– lAnguage match;– Lexical match;– sYntactic match; – sEmantic match;
– Reconciliation;– Epitaxial growth;
Schemas of CRIS and CERIF represented as fully connected cyclic graphs.
node
A
L
Y
E
Match Flags
17©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Match FlagsPhase E type T type M =
MatchR = paRtial match
N = No match
A = lAnguage AM AR AN
L = Lexical LM LR LN
Y = sYntactic YM YR YN
E = sEmantic D = Dictionary EMD ERD END
T = Thesaurus B = suBterm EMTB ERTB ENTB
P = suPerterm EMTP ERTP ENTP
O = Ontology EMO ERO ENO
18©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: IdealPhase E type T type M =
MatchR = paRtial match
N = No match
A = lAnguage AM AR AN
L = Lexical LM LR LN
Y = sYntactic YM YR YN
E = sEmantic D = Dictionary EMD ERD END
T = Thesaurus B = suBterm EMTB ERTB ENTB
P = suPerterm EMTP ERTP ENTP
O = Ontology EMO ERO ENO
19©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Translation
• If Flags AR | AN • To reach AM• Utilise E : Semantic
analysis• To achieve
– EMD– EMTB– EMTP– EMO
node
A
L
Y
E
20©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: TranslationPhase E type T type M =
MatchR = paRtial match
N = No match
A = lAnguage AM AR AN
L = Lexical LM LR LN
Y = sYntactic YM YR YN
E = sEmantic D = Dictionary EMD ERD END
T = Thesaurus B = suBterm EMTB ERTB ENTB
P = suPerterm EMTP ERTP ENTP
O = Ontology EMO ERO ENO
21©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Improvement
• If Flags– LR | LN
• To reach LM• Utilise
– Y: Syntactic analysis for structural positioning to achieve
• YM
– E: Semantic analysis for related terms to achieve
• EMD• EMTB• EMTP• EMO
node
A
L
Y
E
22©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: ImprovementPhase E type T type M =
MatchR = paRtial match
N = No match
A = lAnguage AM AR AN
L = Lexical LM LR LN
Y = sYntactic YM YR YN
E = sEmantic D = Dictionary EMD ERD END
T = Thesaurus B = suBterm EMTB ERTB ENTB
P = suPerterm EMTP ERTP ENTP
O = Ontology EMO ERO ENO
23©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Algorithm (sketch) 1• AM | LM | YM | [EMD|EMO] direct mapping
– the issue of EMTB, EMTP is dealt with below;
• In all other cases R | N may be resolved by a later phase;
• AR | AN flag precludes lexical match unless resolved by ED | ET | EO;
• AM + LR | LN flag precludes a match unless resolved by ED | ET | EO;
• Y processing indicates nodes to attempt to match using ED | ET | EO;
24©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Algorithm (sketch) 2
• ED provides for multilingual term equality;• ET processing like Y; the processing must cycle
over the graph to find the closest YM and then apply ET to find EMTP | EMTB and allocate the flag(s) appropriately;
• EO processing like Y; the processing must cycle over the graph to find the closest match for the current node by locating EMO for terms in nodes YM-related to the current node thus determining the semantics of the current node;
25©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial Growth
With acknowledgement to
http://www.mineral-forum.com/
26©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial Growth
• In the general case we now have two graphs with matching and non-matching nodes.
• likely there are excess nodes in the CRIS graph (i.e. the CRIS contains all the CERIF elements, probably with different names, plus more entities / attributes suitable for local processing)
27©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial GrowthCRIS Superset
e
e a a
a a e a a
a aa
e
a e
a
a
e
a
a
e
e
e a a
a a e a a
a aa
e
a e
a
a
e
a
a
e
a a
CRIS CERIF
e = entity, a = attribute
28©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial GrowthCRIS Subset
e
e a a
a a
a
e
a e
a
a
e
a
a
e
e a a
a a
a
e
a
a a
CRIS CERIF
e = entity, a = attribute
29©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial Growth: Algorithm (sketch) 1
• For nodes in the CRIS schema that do not have corresponding nodes in the CERIF schema:
• Their syntactic positions in the graph, related to nodes in the CRIS schema, are analysed
– starting from the root and working down the graph, left to right at each level;
• The language is flagged– so that this can be the variable value input in the CERIF language
attribute;• The lexical term for the entity or attribute name is noted ready for
translation to the canonical language for interoperation – with the CERIF translation attribute set to automatic;
• The syntactic structure present in the CRIS but not in CERIF is then added to the CERIF schema as the epitaxial growth……...
30©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial Growth: Algorithm (sketch) 2
• This is achieved in the following steps:– if the additional nodes are attributes under an
existing entity ‘X’, a new CERIF entity ‘Xadditional’ is created with the attributes as nodes under it and ‘Xadditional’ is linked with a cardinality 1:1 with ‘X’ in CERIF using matched primary keys;
31©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial GrowthAdditional Attributes
CERIF Entity
‘X’
NEWCERIF Entity
‘Xadditional’
NEWCERIF LinkingRelation
‘X’‘Xadditional’
CRIS‘Y’
CRISEntity‘Y=X’
CRISEntity‘Y≠X’
32©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial Growth: Algorithm (sketch) 3
– if the additional nodes include an additional entity ‘Y’, linking relations are created in CERIF to link ‘Y’ to existing CERIF entities and attributes of ‘Y’ are nodes under ‘Y’;
– This step depends on the complexity of the CRIS: primary/foreign key relationships (with cardinality 1:n) can be analysed to generate the linking relations and if the CRIS already has n:m cardinality relationships using linking relations these can be mapped directly;
33©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
NEWCERIF Linking
Relations‘Y’
To other entities
NEWCERIF Linking
Relations‘Y’
To other entities
Achievement: Epitaxial GrowthAdditional Entity
NEWCERIF Entity
‘Y’
NEWCERIF Linking
Relations‘Y’
To other entities
CRIS‘Y’
CRISEntity‘Y≠X’
For all X
Links to other entities
34©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Epitaxial Growth: Algorithm (sketch) 4
• A similar process is followed for nodes in the CERIF schema not having corresponding nodes in the CRIS schema.
35©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
With acknowledgement to
http://www.mineral-forum.com/
36©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Achievement: Other Challenges• Challenge 1: DISCOVERY
– managed by the CRIS community knowledge (e.g. DRIS) and/or by web services searches UDDI/WSDL.
• Challenge 2: DESCRIPTION– the schema of any structured (or semi-structured) CRIS.
• {Challenges 3 and 4 : MATCHING and MAPPING– described above}
• Challenge 5: CONVERSION– using technology from Hypermedata (SkKoBeJe1999). From the
schema graphs of the CRIS and CERIF the convertor can be generated to convert instances under one schema to instances under another.
• Challenges 6 and 7; QUERY and RESPONSE– using technology from MIPS (JeHuKaWiBeMa94)
• Challenge 8; CHANGE– managed by repeating the above as required
37©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Structure
• History
• Hypothesis
• Achievement
• Conclusion
– CRIS Interoperation Requirement
– Classical Interoperation– CRIS Interoperation– Dataspaces
– Requirement– Challenges– Solution Concept
– Schema matching and mapping
– Other Challenges
38©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Conclusion 1• questions facing society today can be solved, or partially solved
using research information to guide actions and developments. • world-scale problems cannot be solved by local-scale research
information.• base technology to support and move forward is interoperating
(CERIF-)CRIS within a (European) e-infrastructure.• current technologies do not scale to ‘future internet’.
• responsibility of the CRIS community to find, develop and promote a solution for interoperating CRIS that is reliable, scalable, economic and has appropriate security and privacy.
• interoperating CRIS using epitaxial growth is a promising line of development.
39©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg
Conclusion 2
And of course, it would be much more effective and efficient……..
……..If all research information systems used CERIF as their storage and interoperation format