cris and dataspaces by epitaxial growth

39
1 ©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg CRIS and DataSpaces by Epitaxial Growth Keith G Jeffery STFC Anne Asserson UiB With acknowledgement to http://www.mineral-forum.com/

Upload: taniel

Post on 19-Jan-2016

30 views

Category:

Documents


1 download

DESCRIPTION

CRIS and DataSpaces by Epitaxial Growth. Keith G Jeffery STFC Anne Asserson UiB. With acknowledgement to http://www.mineral-forum.com/. Authors. Keith G Jeffery STFC-RAL. Anne Asserson UiB. History Hypothesis Achievement Conclusion. CRIS Interoperation Requirement - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CRIS and DataSpaces by Epitaxial Growth

1©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

CRIS and DataSpaces by Epitaxial Growth

Keith G Jeffery STFC

Anne Asserson UiB With acknowledgement to

http://www.mineral-forum.com/

Page 2: CRIS and DataSpaces by Epitaxial Growth

2©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Authors

Anne Asserson UiB

Keith G Jeffery STFC-RAL

Page 3: CRIS and DataSpaces by Epitaxial Growth

3©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Structure

• History

• Hypothesis

• Achievement

• Conclusion

– CRIS Interoperation Requirement

– Classical Interoperation– CRIS Interoperation– Dataspaces

– Requirement– Challenges– Solution Concept

– Schema matching and mapping

– Other Challenges

Page 4: CRIS and DataSpaces by Epitaxial Growth

4©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

History: CRIS Interoperation Requirement

• Internationally and Nationally– (a) to allow each funding organization to make strategic funding

decisions related to knowledge of the actions of other funding organisations; (b) to provide an international base of reviewers for research proposals;

– (c) to provide comparative metrics on research performance and cost-benefit;

– (d) for researchers to find colleagues in areas peripheral to their own (where they claim they knew the key players) on an international basis;

– (e) for encouraging international innovation taking research outputs through to wealth-creation.

• For business purposes– Funding organisations <-----> research organisations

Page 5: CRIS and DataSpaces by Epitaxial Growth

5©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

History: Classical Interoperation Technology

• Significant challenge in ICT / Computer Science– For structured data sources

• From formal schemas erect global schema to reduce n(n-1) interoperations to n

– For semi-structured data sources• From XML schema as above • else derive schema from tags

– For unstructured data sources• Harvesting • Possibly advanced knowledge engineering and machine

learning techniques

• All require considerable human intervention• This does not scale for future internet

Page 6: CRIS and DataSpaces by Epitaxial Growth

6©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

History: CRIS Interoperation Technology

• IDEAS (1984-87), EXIRPTS (1987-89), ERGO 1997-99)

• Categorisation of Techniques (2005)– Remote Wrapper– Local Wrapper– Catalog– Catalog Plus Pull (ERGO2++)– Full CERIF– Harvesting

• Confirmed by euroHORCs special group 2008– Recommended converge to CERIF to allow evolution

to Full CERIF technique.

Page 7: CRIS and DataSpaces by Epitaxial Growth

7©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

History: Dataspaces• The vision behind dataspaces is that the end-user ‘sees’ a space

with all relevant information required for a particular purpose presented in a form suitable for the purpose of the end-user.

• Utilise human effort and knowledge to guide (or actually execute) the creation of a temporary, partial global schema by matching of schemas and mapping (i.e. specifying the wrapper software) for interoperation.

• More recently it has been suggested that machine learning could be employed using manually-created mappings as training data.

• Dataspaces includes the concept of ‘pay as you go’ only requiring the matching and mapping necessary for a particular interoperation instance.

• The incremental nature of building / accessing dataspaces stimulated the concept of epitaxial growth in dataspaces

Page 8: CRIS and DataSpaces by Epitaxial Growth

8©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Structure

• History

• Hypothesis

• Achievement

• Conclusion

– CRIS Interoperation Requirement

– Classical Interoperation– CRIS Interoperation– Dataspaces

– Requirement– Challenges– Solution Concept

– Schema matching and mapping

– Other Challenges

Page 9: CRIS and DataSpaces by Epitaxial Growth

9©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Hypothesis: Requirement

Local CRIS

remote CRIS

remote CRIS

remote CRIS

query results

CERIF

CERIF

CERIF

Page 10: CRIS and DataSpaces by Epitaxial Growth

10©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Hypothesis: Challenges1. discovery of relevant CRIS;2. description of each relevant CRIS;3. matching each description in (2) to CERIF including

(a) management of different character sets and multimedia representation;(b) management of different languages;(c) management of different syntax (data structure); (d) management of different semantics (meaning);

4. generating mappings to describe the required conversions on instances under each CRIS schema to/from CERIF;

5. generating conversion software for instances in any source CRIS;6. managing a query in terms of a local schema (or CERIF) over all

CRIS;7. managing the response (answers);8. management of change in schemas (3) (but usually only (c) and (d));

Page 11: CRIS and DataSpaces by Epitaxial Growth

11©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Hypothesis: Solution Concept

With acknowledgement to

http://www.mineral-forum.com/

Page 12: CRIS and DataSpaces by Epitaxial Growth

12©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Epitaxial Growth

Structurally congruent

The term epitaxy comes from the Greek roots epi, meaning "above", and taxis, meaning "in ordered manner".

ions

Crystal

structure

Page 13: CRIS and DataSpaces by Epitaxial Growth

13©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Hypothesis: Solution Concept

project

person organisation

‘seed’schema

extended project schema

extendedpersonschema

extendedorganisation

schema

source 1 source 2

source 1

source 2 source 1

source 2

input

epitaxial

growth

The Concept of Epitaxial Growth of Schemas

Page 14: CRIS and DataSpaces by Epitaxial Growth

14©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Structure

• History

• Hypothesis

• Achievement

• Conclusion

– CRIS Interoperation Requirement

– Classical Interoperation– CRIS Interoperation– Dataspaces

– Requirement– Challenges– Solution Concept

– Schema matching and mapping

– Other Challenges

Page 15: CRIS and DataSpaces by Epitaxial Growth

15©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Schema Matching and Mapping

• Challenges 3 and 4 (matching and mapping) are addressed using the novel technique

• We propose epitaxial growth from a canonical CERIF schema to generate the global schema for any group of CRIS.

• In the process, for each CRIS, there is documented – entities and attributes matching CERIF (CRIS=CERIF); – entities and attributes in CERIF that are missing in the CRIS

CERIF > CRIS; – entities and attributes in the CRIS that are missing in CERIF

CRIS > CERIF;

Page 16: CRIS and DataSpaces by Epitaxial Growth

16©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Schema Matching and Mapping

– lAnguage match;– Lexical match;– sYntactic match; – sEmantic match;

– Reconciliation;– Epitaxial growth;

Schemas of CRIS and CERIF represented as fully connected cyclic graphs.

node

A

L

Y

E

Match Flags

Page 17: CRIS and DataSpaces by Epitaxial Growth

17©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Match FlagsPhase E type T type M =

MatchR = paRtial match

N = No match

A = lAnguage AM AR AN

L = Lexical LM LR LN

Y = sYntactic YM YR YN

E = sEmantic D = Dictionary EMD ERD END

T = Thesaurus B = suBterm EMTB ERTB ENTB

P = suPerterm EMTP ERTP ENTP

O = Ontology EMO ERO ENO

Page 18: CRIS and DataSpaces by Epitaxial Growth

18©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: IdealPhase E type T type M =

MatchR = paRtial match

N = No match

A = lAnguage AM AR AN

L = Lexical LM LR LN

Y = sYntactic YM YR YN

E = sEmantic D = Dictionary EMD ERD END

T = Thesaurus B = suBterm EMTB ERTB ENTB

P = suPerterm EMTP ERTP ENTP

O = Ontology EMO ERO ENO

Page 19: CRIS and DataSpaces by Epitaxial Growth

19©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Translation

• If Flags AR | AN • To reach AM• Utilise E : Semantic

analysis• To achieve

– EMD– EMTB– EMTP– EMO

node

A

L

Y

E

Page 20: CRIS and DataSpaces by Epitaxial Growth

20©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: TranslationPhase E type T type M =

MatchR = paRtial match

N = No match

A = lAnguage AM AR AN

L = Lexical LM LR LN

Y = sYntactic YM YR YN

E = sEmantic D = Dictionary EMD ERD END

T = Thesaurus B = suBterm EMTB ERTB ENTB

P = suPerterm EMTP ERTP ENTP

O = Ontology EMO ERO ENO

Page 21: CRIS and DataSpaces by Epitaxial Growth

21©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Improvement

• If Flags– LR | LN

• To reach LM• Utilise

– Y: Syntactic analysis for structural positioning to achieve

• YM

– E: Semantic analysis for related terms to achieve

• EMD• EMTB• EMTP• EMO

node

A

L

Y

E

Page 22: CRIS and DataSpaces by Epitaxial Growth

22©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: ImprovementPhase E type T type M =

MatchR = paRtial match

N = No match

A = lAnguage AM AR AN

L = Lexical LM LR LN

Y = sYntactic YM YR YN

E = sEmantic D = Dictionary EMD ERD END

T = Thesaurus B = suBterm EMTB ERTB ENTB

P = suPerterm EMTP ERTP ENTP

O = Ontology EMO ERO ENO

Page 23: CRIS and DataSpaces by Epitaxial Growth

23©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Algorithm (sketch) 1• AM | LM | YM | [EMD|EMO] direct mapping

– the issue of EMTB, EMTP is dealt with below;

• In all other cases R | N may be resolved by a later phase;

• AR | AN flag precludes lexical match unless resolved by ED | ET | EO;

• AM + LR | LN flag precludes a match unless resolved by ED | ET | EO;

• Y processing indicates nodes to attempt to match using ED | ET | EO;

Page 24: CRIS and DataSpaces by Epitaxial Growth

24©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Algorithm (sketch) 2

• ED provides for multilingual term equality;• ET processing like Y; the processing must cycle

over the graph to find the closest YM and then apply ET to find EMTP | EMTB and allocate the flag(s) appropriately;

• EO processing like Y; the processing must cycle over the graph to find the closest match for the current node by locating EMO for terms in nodes YM-related to the current node thus determining the semantics of the current node;

Page 25: CRIS and DataSpaces by Epitaxial Growth

25©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial Growth

With acknowledgement to

http://www.mineral-forum.com/

Page 26: CRIS and DataSpaces by Epitaxial Growth

26©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial Growth

• In the general case we now have two graphs with matching and non-matching nodes.

• likely there are excess nodes in the CRIS graph (i.e. the CRIS contains all the CERIF elements, probably with different names, plus more entities / attributes suitable for local processing)

Page 27: CRIS and DataSpaces by Epitaxial Growth

27©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial GrowthCRIS Superset

e

e a a

a a e a a

a aa

e

a e

a

a

e

a

a

e

e

e a a

a a e a a

a aa

e

a e

a

a

e

a

a

e

a a

CRIS CERIF

e = entity, a = attribute

Page 28: CRIS and DataSpaces by Epitaxial Growth

28©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial GrowthCRIS Subset

e

e a a

a a

a

e

a e

a

a

e

a

a

e

e a a

a a

a

e

a

a a

CRIS CERIF

e = entity, a = attribute

Page 29: CRIS and DataSpaces by Epitaxial Growth

29©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial Growth: Algorithm (sketch) 1

• For nodes in the CRIS schema that do not have corresponding nodes in the CERIF schema:

• Their syntactic positions in the graph, related to nodes in the CRIS schema, are analysed

– starting from the root and working down the graph, left to right at each level;

• The language is flagged– so that this can be the variable value input in the CERIF language

attribute;• The lexical term for the entity or attribute name is noted ready for

translation to the canonical language for interoperation – with the CERIF translation attribute set to automatic;

• The syntactic structure present in the CRIS but not in CERIF is then added to the CERIF schema as the epitaxial growth……...

Page 30: CRIS and DataSpaces by Epitaxial Growth

30©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial Growth: Algorithm (sketch) 2

• This is achieved in the following steps:– if the additional nodes are attributes under an

existing entity ‘X’, a new CERIF entity ‘Xadditional’ is created with the attributes as nodes under it and ‘Xadditional’ is linked with a cardinality 1:1 with ‘X’ in CERIF using matched primary keys;

Page 31: CRIS and DataSpaces by Epitaxial Growth

31©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial GrowthAdditional Attributes

CERIF Entity

‘X’

NEWCERIF Entity

‘Xadditional’

NEWCERIF LinkingRelation

‘X’‘Xadditional’

CRIS‘Y’

CRISEntity‘Y=X’

CRISEntity‘Y≠X’

Page 32: CRIS and DataSpaces by Epitaxial Growth

32©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial Growth: Algorithm (sketch) 3

– if the additional nodes include an additional entity ‘Y’, linking relations are created in CERIF to link ‘Y’ to existing CERIF entities and attributes of ‘Y’ are nodes under ‘Y’;

– This step depends on the complexity of the CRIS: primary/foreign key relationships (with cardinality 1:n) can be analysed to generate the linking relations and if the CRIS already has n:m cardinality relationships using linking relations these can be mapped directly;

Page 33: CRIS and DataSpaces by Epitaxial Growth

33©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

NEWCERIF Linking

Relations‘Y’

To other entities

NEWCERIF Linking

Relations‘Y’

To other entities

Achievement: Epitaxial GrowthAdditional Entity

NEWCERIF Entity

‘Y’

NEWCERIF Linking

Relations‘Y’

To other entities

CRIS‘Y’

CRISEntity‘Y≠X’

For all X

Links to other entities

Page 34: CRIS and DataSpaces by Epitaxial Growth

34©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Epitaxial Growth: Algorithm (sketch) 4

• A similar process is followed for nodes in the CERIF schema not having corresponding nodes in the CRIS schema.

Page 35: CRIS and DataSpaces by Epitaxial Growth

35©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

With acknowledgement to

http://www.mineral-forum.com/

Page 36: CRIS and DataSpaces by Epitaxial Growth

36©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Achievement: Other Challenges• Challenge 1: DISCOVERY

– managed by the CRIS community knowledge (e.g. DRIS) and/or by web services searches UDDI/WSDL.

• Challenge 2: DESCRIPTION– the schema of any structured (or semi-structured) CRIS.

• {Challenges 3 and 4 : MATCHING and MAPPING– described above}

• Challenge 5: CONVERSION– using technology from Hypermedata (SkKoBeJe1999). From the

schema graphs of the CRIS and CERIF the convertor can be generated to convert instances under one schema to instances under another.

• Challenges 6 and 7; QUERY and RESPONSE– using technology from MIPS (JeHuKaWiBeMa94)

• Challenge 8; CHANGE– managed by repeating the above as required

Page 37: CRIS and DataSpaces by Epitaxial Growth

37©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Structure

• History

• Hypothesis

• Achievement

• Conclusion

– CRIS Interoperation Requirement

– Classical Interoperation– CRIS Interoperation– Dataspaces

– Requirement– Challenges– Solution Concept

– Schema matching and mapping

– Other Challenges

Page 38: CRIS and DataSpaces by Epitaxial Growth

38©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Conclusion 1• questions facing society today can be solved, or partially solved

using research information to guide actions and developments. • world-scale problems cannot be solved by local-scale research

information.• base technology to support and move forward is interoperating

(CERIF-)CRIS within a (European) e-infrastructure.• current technologies do not scale to ‘future internet’.

• responsibility of the CRIS community to find, develop and promote a solution for interoperating CRIS that is reliable, scalable, economic and has appropriate security and privacy.

• interoperating CRIS using epitaxial growth is a promising line of development.

Page 39: CRIS and DataSpaces by Epitaxial Growth

39©Keith G Jeffery, Anne Asserson CRIS and Dataspaces CRIS2010, Aalborg

Conclusion 2

And of course, it would be much more effective and efficient……..

……..If all research information systems used CERIF as their storage and interoperation format