controlled vocabularies and smc 4 lrt semantic mapping in cmdi
DESCRIPTION
2013-05-17 - Utrecht Matej Ďu r č o, ICLTT, Vienna. Controlled Vocabularies and SMC 4 LRT Semantic Mapping in CMDI. Activities : CLARIN taskforce – within SCCTC building on CLAVAS - Vocabulary Alignment Service for CLARIN - PowerPoint PPT PresentationTRANSCRIPT
2013-05-17 - UtrechtMatej Ďurčo, ICLTT, Vienna
Controlled Vocabulariesand SMC4LRT
Semantic Mapping in CMDI
2
Activities:• CLARIN taskforce – within SCCTC
building on CLAVAS - Vocabulary Alignment Service for CLARIN• DARIAH joint taskforce
VCC1/Task 5: Data federation and interoperability and VCC3/Task3: Reference Data Registries (and external partners). goal: establish a service providing controlled vocabularies and reference data for the DARIAH (and CLARIN) community.
• SMC – Semantic Mapping Component a module in the CMD-Infrastructure
goal: „semantic search“ = enhance the search in the heterogeneous data collection (of
CMDI) a) by exploiting the shared data categories (SMC on
schema level) b) by expressing the data in RDF (SMC on instance level)
Context
Context II - CLARIN-AT
• CCV – CLARIN Center Vienna CenterProfile CMD recordhttp://clarin.aac.ac.at/ccv/index.htmlexpected ready by: 2013-06
Infrastructure services:• CLARIN Metadata Repository• SMC – Semantic Mapping Component• SMC-Browser
• Controlled Vocabularies engagement in CLARIN + DARIAHtask forces
3
Old visionconceptualization sketch from 2009
4
Potential usages for CV● Metadata Generation, Curation
● Data-Enrichment / Annotation
● Data Analysis
● Search (Query Expansion, autocomplete, facets etc. )
● needed for CMD2RDF- provide identifiers for entities
(- provide equivalencies between concepts/entities from different vocabularies (concept schemes). ?
like equivalencies in Wikipedia (page for Johann Wolfgang Goethe):GND: 118540238 | LCCN: n79003362 | NDL: 00441109 | VIAF: 24602065
)
5
Related Activities● DARIAH Schema Registry + Crosswalk Registry● LT-World @DFKI
full-blown ontology with People, Projects, Organisations, Events, LR integration would have to happen at another level (RDF/LOD).
● CoNE – Control of Named Entities @MPDL/eSciDoc http://colab.mpdl.mpg.de/mediawiki/Control_of_Named_Entities
● EATS - Entity Authority Tool Set @New Zealand Electronic Text Centre (NZETC).http://eats.readthedocs.org/en/latest/
● TextGrid
● http://www.dnb.de/DE/Standardisierung/LinksAFS/linksafs_node.html
● FRBR - Functional Requirements for Bibliographic Records RDA - Resource Description and Accesshttp://metadaten-twr.org/ - Technology Watch Report: Standards in Metadata and Interoperability (last entry from 2011)
6
Candidate Vocabularies● Data Categories / Concepts - ISOcat
● Languages - ISO-639
● Countries - country codes
● Persons - GND, VIAF, dbpedia?
● Organizations - GND, VIAF, dbpedia?
● Schlagwörter/Subjects - GND, LCSH
● Resource Typology -
● Tagsets!? (with mappings between tags)
AAT - international Architecture and Arts ThesaurusGND - Gemeinsame Norm Datei (DNB)GTAA - Gemeenschappelijke Thesaurus Audiovisuele Archieven (Common Thesaurus [for] Audiovisual Archives)VIAF - Virtual_International_Authority_File
7
ISOcat and CLAVAS• export closed+simple DCs
(perhaps even better to manually select)
• Third party applications use - ISOcat for explain() function - CLAVAS for value(/entity)-lists
8
informed query inputinformation about available data categories and values for those categoriescan be used as base for a complex query-input widgetwith context-sensitive autocomplete
however this rather only as fallbackto autocomplete based on actual data
9
CMD RDF• Semantic Mapping on instance level
express MD records in RDF (for LOD)=> bind also values in MD fields to concepts
• Modelling aspects • CMD Specification • Data Categories• CMD instances:
- Identifier, Provenance, Hierarchy, - Components, Elements, - Values, Literal Values, Mapping to entities – Vocabularies
=> CLAVAS• Ontological Relations
Prefix name
Prefix IRI
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#rdfs: http://www.w3.org/2000/01/rdf-schema#xsd: http://www.w3.org/2001/XMLSchema#owl: http://www.w3.org/2002/07/owl#skos: http://www.w3.org/2004/02/skos/core#isocat: http://www.isocat.org/datcat/dcr: http://isocat.org/ns/dcr.rdf# cmd: http://clarin.eu/cmd/1.0#cmd_spec: ?dce: http://purl.org/dc/elements/1.1/dcterms: http://purl.org/dc/termsoa: http://www.w3.org/ns/oa#ore: http://www.openarchives.org/ore/terms/cr: http://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/
used namespaces
10
11
Approach – Individuals/Instance Level
One step when (pre)processing incoming new MD-sets1.Express MD-Records as RDF-triples:
2.Identify potential target Domain Ontologies/Vocabularies3.Create inverted Index:
4.Define lookup function:
5.Enrich dataset with new facts:
6.Property-values of Metadata-Records are linked to individuals of domain ontologies
<#mdrecord #property #external-entity>
lookup(category, string-value) → <external-entity, measure>
label → entity
<#mdrecord #property “string-value”>
12
Candidate Categories/Properties• ResourceType, Format, AnnotationLevelType→ map to: isocat-DataCategories (Profiles: Metadata, Morphosyntax, ...)
• Genre, Topic, Subject→ map to: Taxonomies, Library Classification systems
(LCSH, DDC, Dornseiff,...)
• Project, Institution, Person, Publisheropen controlled vocabularies (real entities)
→ map to: CLAVAS-organisations, LT-World (perhaps others: LCCN, DBPedia?)
Next Steps• Install current OpenSKOS at CCV – CLARIN Center Vienna
• synchronize 3 current datasets via OAI-PMH with sister instance at Meertensalso to test the synchronization process (and implications)
• CMD2RDF
• „special groups vocabularies“ in CLARIN-AT• Plant names• Instruments
13
AppendixExplanations to SMC and CMDI
14
15Semantic Mapping (schema level) - concept• metadata fields in (completely) different profiles
but bound to (the same) data categories (ConceptLinks)• use this linkage when searching in the data
i.e. allow the user to search
a) „in the data category“b) in a MD field but also all related fields from other profiles
• Multiple mapping levels:1. just mapping based on the ConceptLink resolvable via ComponentRegistry
different elements pointing to the same DatCat2. use equivalence relations between DatCats from Relation Registry3. use equivalence relations also between Container DatCats4. use also other relations in Relation Registry (subClassOf, almostSameAs,
…)5. apply selected (user defined) relation sets from Relation Registry
16
CMDI linking• components and elements in CMD profiles are bound to data
categories• the CMD records reference their profiles• in Relation Registry data categories are related to each other
in separate (possibly overlapping/contradicting) relation sets
17
Semantic Mapping Component• separate CMDI module• relies on information from ComponentRegistry, DCR,
RelationRegistry• is used by Metadata Repository / Service / Browser • Task:
resolution: dcrIndex ↔ cmdIndexdcrIndex :: (abstract) data category defined in DCRcmdIndex :: path to a field in a MDRecord
• (different from - query expansion: CQL(datcat) → CQL(cmdIndex[])- query translation: e.g. CQL → XPath
Input Output
dcrIndex isocat.DC-2545 (= isocat.resourceTitle)
=> cmdIndex[] [BamdesCommonFields.resourceTitle, imdi-corpus.Corpus.Title, …]
cmdIndex
Actor.Role => dcrIndex isocat:DC-2559 (participantRole)
18
Examples of DCR use in CMD metadata• resourceName isocat:DC-2544
- CorpusProfile.Corpus.Metadata.Name- CorpusProfile.Corpus.SourceList.Source.Name- collection.GeneralInfo.Name- Session.Name- imdi-corpus.Corpus.Name- ToolService.GeneralInfo.Name- GTRP.Collection.GeneralInfo.Name- DIDDD.Collection.GeneralInfo.Name- Soundbites.Collection.GeneralInfo.Name- DynaSAND.Collection.GeneralInfo.Name
BUT:• CMD Element: „Name“
- http://www.isocat.org/datcat/DC-2544- http://www.isocat.org/datcat/DC-2536- http://www.isocat.org/datcat/DC-4160- http://www.isocat.org/datcat/DC-4176- http://www.isocat.org/datcat/DC-4180- http://purl.org/dc/elements/1.1/rights- http://purl.org/dc/elements/1.1/contributor- http://www.isocat.org/datcat/DC-2454- http://www.isocat.org/datcat/DC-2557- …
CMD Element name
|distinct Elems|
|distinct DatCats|
Name 40 11Type 16 8Title 14 6Language 10 6ID 11 5format 10 5identifier 6 5Description 31 4Code 8 4date 12 4publisher 9 4source 10 4subject 6 4Creator 6 3Address 5 3Organisation 3 3Availability 6 3datatype 8 3contributor 4 3
19
Examples of DCR use in CMD metadata II• languageID isocat:DC-2482
- LrtInventoryResource.LrtCommon.Languages.ISO639.iso-639-3-code- Session.MDGroup.Content.Content_Languages.Content_Language.Id- Session.MDGroup.Actors.Actor.Actor_Languages.Actor_Language.Id- Session.Resources.WrittenResource.LanguageId- ToolService.Documentation.DocumentationLanguages.Language.ISO639.iso-639-3-
code- ToolService.Tool.Documentation.DocumentationLanguages.Language.ISO639.iso-639-
3-code- GTRP.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code- DIDDD.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code- DynaSAND.Collection.DocumentationLanguages.Language.ISO639.iso-639-3-code
• languageName isocat:DC-2484 - ToolService.Documentation.DocumentationLanguages.Language.LanguageName- ToolService.Tool.Documentation.DocumentationLanguages.Language.LanguageName- GTRP.Collection.DocumentationLanguages.Language.LanguageName- DIDDD.Collection.DocumentationLanguages.Language.LanguageName- DynaSAND.Collection.DocumentationLanguages.Language.LanguageName
• dct:language - OLAC-DcmiTerms.language
• metadataLanguage isocat:DC-2543 - CorpusProfile.Corpus.Metadata
• dominantLanguage isocat:DC-2468 - Session.MDGroup.Content.Content_Languages.Content_Language.Dominant
• sourceLanguage isocat:DC-2494 - Session.MDGroup.Content.Content_Languages.Content_Language.SourceLanguage
• targetLanguage isocat:DC-2499 - Session.MDGroup.Content.Content_Languages.Content_Language.TargetLanguage
implementationLanguage isocat:DC-3798 - ToolService.Tool.Implementation.implementationLanguage
20
DCR usage in Component Registry
Datcats in CompReg 288ISOcat 164
dc-elems 15
dc-terms 55
private ISOcat DatCats (?) 54
Elements with Datcats 82,38%
Components with Datcats 67
Data Categories Sets 827
isocat (Metadata Profile#5) 712
dublincore elements 16
dublincore terms 99Component RegistryCMD-Profiles 53
standalone Components 235*)
overall Components 298
distinct Elements 893
all Elements 3.030
all paths (profile/comp/elem 4.565
Components structure
as of 2012-05
SMC Browser21
TODO• feed with statistics
of the instance data• add relations from RELcat• add operations on graphs
(intersection, difference, …)
Explore the Component Metadata Framework
Profile specifications from Component Registryvisualized as interactive graphsstatistics (about reuse of Components)
http://clarin.aac.ac.at/smc-browser/
SMC BrowserExplore the Component Metadata Framework
22