controlled vocabularies and smc 4 lrt semantic mapping in cmdi

22
2013-05-17 - Utrecht Matej Ďurčo, ICLTT, Vienna Controlled Vocabularies and SMC4LRT Semantic Mapping in CMDI

Upload: tangia

Post on 23-Mar-2016

44 views

Category:

Documents


3 download

DESCRIPTION

2013-05-17 - Utrecht Matej Ďu r č o, ICLTT, Vienna. Controlled Vocabularies and SMC 4 LRT Semantic Mapping in CMDI. Activities : CLARIN taskforce – within SCCTC building on CLAVAS - Vocabulary Alignment Service for CLARIN - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

2013-05-17 - UtrechtMatej Ďurčo, ICLTT, Vienna

Controlled Vocabulariesand SMC4LRT

Semantic Mapping in CMDI

Page 2: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

2

Activities:• CLARIN taskforce – within SCCTC

building on CLAVAS - Vocabulary Alignment Service for CLARIN• DARIAH joint taskforce

VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). goal: establish a service providing controlled vocabularies and reference data for the DARIAH (and CLARIN) community.

• SMC – Semantic Mapping Component a module in the CMD-Infrastructure

goal: „semantic search“ = enhance the search in the heterogeneous data collection (of

CMDI) a) by exploiting the shared data categories (SMC on

schema level) b) by expressing the data in RDF (SMC on instance level)

Context

Page 3: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

Context II - CLARIN-AT

• CCV – CLARIN Center Vienna CenterProfile CMD recordhttp://clarin.aac.ac.at/ccv/index.htmlexpected ready by: 2013-06

Infrastructure services:• CLARIN Metadata Repository• SMC – Semantic Mapping Component• SMC-Browser

• Controlled Vocabularies engagement in CLARIN + DARIAHtask forces

3

Page 4: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

Old visionconceptualization sketch from 2009

4

Page 5: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

Potential usages for CV● Metadata Generation, Curation

● Data-Enrichment / Annotation

● Data Analysis

● Search (Query Expansion, autocomplete, facets etc. )

● needed for CMD2RDF- provide identifiers for entities

(- provide equivalencies between concepts/entities from different vocabularies (concept schemes). ?

like equivalencies in Wikipedia (page for Johann Wolfgang Goethe):GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065

)

5

Page 6: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

Related Activities● DARIAH Schema Registry + Crosswalk Registry● LT-World @DFKI

full-blown ontology with People, Projects, Organisations, Events, LR integration would have to happen at another level (RDF/LOD).

● CoNE – Control of Named Entities @MPDL/eSciDoc http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities

● EATS - Entity Authority Tool Set @New Zealand Electronic Text Centre (NZETC).http://eats.readthedocs.org/en/latest/

● TextGrid

● http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html

● FRBR - Functional Requirements for Bibliographic Records RDA - Resource Description and Accesshttp://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)

6

Page 7: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

Candidate Vocabularies● Data Categories / Concepts - ISOcat

● Languages - ISO-639

● Countries - country codes

● Persons - GND, VIAF, dbpedia?

● Organizations - GND, VIAF, dbpedia?

● Schlagwörter/Subjects - GND, LCSH

● Resource Typology - 

● Tagsets!? (with mappings between tags)

AAT - international Architecture and Arts ThesaurusGND - Gemeinsame Norm Datei (DNB)GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)VIAF - Virtual_International_Authority_File

7

Page 8: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

ISOcat and CLAVAS• export closed+simple DCs

(perhaps even better to manually select)

• Third party applications use - ISOcat for explain() function - CLAVAS for value(/entity)-lists

8

Page 9: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

informed query inputinformation about available data categories and values for those categoriescan be used as base for a complex query-input widgetwith context-sensitive autocomplete

however this rather only as fallbackto autocomplete based on actual data

9

Page 10: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

CMD RDF• Semantic Mapping on instance level

express MD records in RDF (for LOD)=> bind also values in MD fields to concepts

• Modelling aspects • CMD Specification • Data Categories• CMD instances:

- Identifier, Provenance, Hierarchy, - Components, Elements, - Values, Literal Values, Mapping to entities – Vocabularies

=> CLAVAS• Ontological Relations

Prefix name

Prefix IRI

rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#rdfs: http://www.w3.org/2000/01/rdf-schema#xsd: http://www.w3.org/2001/XMLSchema#owl: http://www.w3.org/2002/07/owl#skos: http://www.w3.org/2004/02/skos/core#isocat: http://www.isocat.org/datcat/dcr: http://isocat.org/ns/dcr.rdf# cmd: http://clarin.eu/cmd/1.0#cmd_spec: ?dce: http://purl.org/dc/elements/1.1/dcterms: http://purl.org/dc/termsoa: http://www.w3.org/ns/oa#ore: http://www.openarchives.org/ore/terms/cr: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/

used namespaces

10

Page 11: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

11

Approach – Individuals/Instance Level

One step when (pre)processing incoming new MD-sets1.Express MD-Records as RDF-triples:

2.Identify potential target Domain Ontologies/Vocabularies3.Create inverted Index:

4.Define lookup function:

5.Enrich dataset with new facts:

6.Property-values of Metadata-Records are linked to individuals of domain ontologies

<#mdrecord #property #external-entity>

lookup(category, string-value) → <external-entity, measure>

label → entity

<#mdrecord #property “string-value”>

Page 12: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

12

Candidate Categories/Properties• ResourceType, Format, AnnotationLevelType→ map to: isocat-DataCategories (Profiles: Metadata, Morphosyntax, ...)

• Genre, Topic, Subject→ map to: Taxonomies, Library Classification systems

(LCSH, DDC, Dornseiff,...)

• Project, Institution, Person, Publisheropen controlled vocabularies (real entities)

→ map to: CLAVAS-organisations, LT-World (perhaps others: LCCN, DBPedia?)

Page 13: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

Next Steps• Install current OpenSKOS at CCV – CLARIN Center Vienna

• synchronize 3 current datasets via OAI-PMH with sister instance at Meertensalso to test the synchronization process (and implications)

• CMD2RDF

• „special groups vocabularies“ in CLARIN-AT• Plant names• Instruments

13

Page 14: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

AppendixExplanations to SMC and CMDI

14

Page 15: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

15Semantic Mapping (schema level) - concept• metadata fields in (completely) different profiles

but bound to (the same) data categories (ConceptLinks)• use this linkage when searching in the data

i.e. allow the user to search

a) „in the data category“b) in a MD field but also all related fields from other profiles

• Multiple mapping levels:1. just mapping based on the ConceptLink resolvable via ComponentRegistry

different elements pointing to the same DatCat2. use equivalence relations between DatCats from Relation Registry3. use equivalence relations also between Container DatCats4. use also other relations in Relation Registry (subClassOf, almostSameAs,

…)5. apply selected (user defined) relation sets from Relation Registry

Page 16: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

16

CMDI linking• components and elements in CMD profiles are bound to data

categories• the CMD records reference their profiles• in Relation Registry data categories are related to each other

in separate (possibly overlapping/contradicting) relation sets

Page 17: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

17

Semantic Mapping Component• separate CMDI module• relies on information from ComponentRegistry, DCR,

RelationRegistry• is used by Metadata Repository / Service / Browser • Task:

resolution: dcrIndex ↔ cmdIndexdcrIndex :: (abstract) data category defined in DCRcmdIndex :: path to a field in a MDRecord

• (different from - query expansion: CQL(datcat) → CQL(cmdIndex[])- query translation: e.g. CQL → XPath

Input Output

dcrIndex isocat.DC-2545 (= isocat.resourceTitle)

=> cmdIndex[] [BamdesCommonFields.resourceTitle, imdi-corpus.Corpus.Title, …]

cmdIndex

Actor.Role => dcrIndex isocat:DC-2559 (participantRole)

Page 18: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

18

Examples of DCR use in CMD metadata• resourceName isocat:DC-2544

- CorpusProfile.Corpus.Metadata.Name- CorpusProfile.Corpus.SourceList.Source.Name- collection.GeneralInfo.Name- Session.Name- imdi-corpus.Corpus.Name- ToolService.GeneralInfo.Name- GTRP.Collection.GeneralInfo.Name- DIDDD.Collection.GeneralInfo.Name- Soundbites.Collection.GeneralInfo.Name- DynaSAND.Collection.GeneralInfo.Name

BUT:• CMD Element: „Name“

- http://www.isocat.org/datcat/DC-2544- http://www.isocat.org/datcat/DC-2536- http://www.isocat.org/datcat/DC-4160- http://www.isocat.org/datcat/DC-4176- http://www.isocat.org/datcat/DC-4180- http://purl.org/dc/elements/1.1/rights- http://purl.org/dc/elements/1.1/contributor- http://www.isocat.org/datcat/DC-2454- http://www.isocat.org/datcat/DC-2557- …

CMD Element name

|distinct Elems|

|distinct DatCats|

Name 40 11Type 16 8Title 14 6Language 10 6ID 11 5format 10 5identifier 6 5Description 31 4Code 8 4date 12 4publisher 9 4source 10 4subject 6 4Creator 6 3Address 5 3Organisation 3 3Availability 6 3datatype 8 3contributor 4 3

Page 19: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

19

Examples of DCR use in CMD metadata II• languageID isocat:DC-2482

- LrtInventoryResource.LrtCommon.Languages.ISO639.iso-639-3-code- Session.MDGroup.Content.Content_Languages.Content_Language.Id- Session.MDGroup.Actors.Actor.Actor_Languages.Actor_Language.Id- Session.Resources.WrittenResource.LanguageId- ToolService.Documentation.DocumentationLanguages.Language.ISO639.iso-639-3-

code- ToolService.Tool.Documentation.DocumentationLanguages.Language.ISO639.iso-639-

3-code- GTRP.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code- DIDDD.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code- DynaSAND.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code

• languageName isocat:DC-2484 - ToolService.Documentation.DocumentationLanguages.Language.LanguageName- ToolService.Tool.Documentation.DocumentationLanguages.Language.LanguageName- GTRP.Collection.DocumentationLanguages.Language.LanguageName- DIDDD.Collection.DocumentationLanguages.Language.LanguageName- DynaSAND.Collection.DocumentationLanguages.Language.LanguageName

• dct:language - OLAC-DcmiTerms.language

• metadataLanguage isocat:DC-2543 - CorpusProfile.Corpus.Metadata

• dominantLanguage isocat:DC-2468 - Session.MDGroup.Content.Content_Languages.Content_Language.Dominant

• sourceLanguage isocat:DC-2494 - Session.MDGroup.Content.Content_Languages.Content_Language.SourceLanguage

• targetLanguage isocat:DC-2499 - Session.MDGroup.Content.Content_Languages.Content_Language.TargetLanguage

implementationLanguage isocat:DC-3798 - ToolService.Tool.Implementation.implementationLanguage

Page 20: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

20

DCR usage in Component Registry

Datcats in CompReg 288ISOcat 164

dc-elems 15

dc-terms 55

private ISOcat DatCats (?) 54

Elements with Datcats 82,38%

Components with Datcats 67

Data Categories Sets 827

isocat (Metadata Profile#5) 712

dublincore elements 16

dublincore terms 99Component RegistryCMD-Profiles 53

standalone Components 235*)

overall Components 298

distinct Elements 893

all Elements 3.030

all paths (profile/comp/elem 4.565

Components structure

as of 2012-05

Page 21: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

SMC Browser21

TODO• feed with statistics

of the instance data• add relations from RELcat• add operations on graphs

(intersection, difference, …)

Explore the Component Metadata Framework

Profile specifications from Component Registryvisualized as interactive graphsstatistics (about reuse of Components)

http://clarin.aac.ac.at/smc-browser/

Page 22: Controlled Vocabularies and SMC 4 LRT Semantic  Mapping in CMDI

SMC BrowserExplore the Component Metadata Framework

22