open data support webinar on metadata quality and alignment

Download Open Data Support Webinar on metadata quality and alignment

If you can't read please download the document

Upload: gyles-williams

Post on 22-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

  • Slide 1
  • Open Data Support Webinar on metadata quality and alignment
  • Slide 2
  • This presentation has been created by PwC Authors: Stefanos Kotoglou, Nikolaos Loutas, Niels Vandekeybus, Bert Van Nuffelen, Disclaimer The views expressed in this presentation are purely those of the authors and may not, in any circumstances, be interpreted as stating an official position of the European Commission. The European Commission does not guarantee the accuracy of the information included in this presentation, nor does it accept any responsibility for any use thereof. Reference herein to any specific products, specifications, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favouring by the European Commission. All care has been taken by the author to ensure that s/he has obtained, where necessary, permission to use any parts of manuscripts including illustrations, maps, and graphs, on which intellectual property rights already exist from the titular holder(s) of such rights or from her/his or their legal representative. Presentation metadata Slide 2 Open Data Support is funded by the European Commission under SMART 2012/0107 Lot 2: Provision of services for the Publication, Access and Reuse of Open Public Data across the European Union, through existing open data portals(Contract No. 30-CE- 0530965/00-17).
  • Slide 3
  • Agenda Introduction of the participants Agenda and objectives of the webinar Practical details on how the participants can interact with the system Welcome to the webinarOpen Data Support project and its outcomes Overview of our harvesting approach Demo ODIPIntroduction to the DCAT AP The mappings The validation service Implementation experiences Alignment of national/ European data models for describing datasets with the DCAT- AP Open discussion on high-quality metadata as an enabler to open data reuseWrap-up and next steps Slide 3
  • Slide 4
  • Welcome to the webinar Slide 4
  • Slide 5
  • Slide 5 The Open Data Support project and its outcomes
  • Slide 6
  • Open Data Support mission... Slide 6 To Improve the and facilitate the to datasets published on local and national Open Data portals in order to increase their within and across borders. See also: http://www.slideshare.net/OpenDataSupport See also: http://www.slideshare.net/OpenDataSupport
  • Slide 7
  • By... Providing to metadata descriptions of open datasets via a ODIPP Pan-European Data portal Slide 7
  • Slide 8
  • The three type of services Slide 8 Publishing services Training services Consulting services We help (potential) Open Government Data publishers to prepare, transform and publish reusable metadata descriptions of their datasets. We train public administrations across Europe on the value of (Linked) Open Government Data and help them build capacity in effectively and efficiently publishing data. We provide consultancy services tailored to the needs of public administrations covering a wide spectrum of topics from IT to licensing of (Linked) Open Government Data.
  • Slide 9
  • Our key outcomes 16 portals harvested on ODIP 71670 number of datasets 24 of onsite training sessions delivered in 20 MSs +1000 public servants from European public administrations trained Our software is available for reuse at: https://joinup.ec.europa.eu/community/ods/og_page/software https://joinup.ec.europa.eu/community/ods/og_page/software Our training material is available for use and reuse at: training.opendatasupport.eu training.opendatasupport.eu Slide 9
  • Slide 10
  • Slide 10 The Open Data Interoperability Platform (ODIP) enables you to share metadata of datasets described using the DCAT-AP, thus improving the discoverability and visibility of your datasets, eventually leading to wider reuse.
  • Slide 11
  • Overview of our harvesting approach What can ODIP do? Harvest metadata from an Open Data portal. Transform the metadata to RDF. Harmonise the RDF metadata produced in the previous steps with DCAT-AP. Validate the harmonised metadata against the DCAT-AP. Publish the description metadata as Linked Open Metadata. Translate metadata automatically in English Slide 11 ODIPP Pan-European Data portal
  • Slide 12
  • Example Harvesting a CKAN-based Open Data portal 1.Create a new job on ODIP 2.Extraction phase -Add and Configure a CKAN Extractor to harvest data from a CKAN API. 3.Transformation phase -Add ODS Value mapper -Add a SPARQL Update Query Transformer with the pertinent queries -Add ODS Cleaner -Add and configure DCAT Application Profile Harmoniser -Add Modification detector -Add ODS Validator -Add Web Translations 4.Loading phase -Load the extracted data in a Virtuoso RDF Store via the Virtuoso Loader 5.Scheduling the job on ODIP Slide 12
  • Slide 13
  • Example 1. Create Job : Creating a job on ODIP To create a new job, click on New Job. At the bottom part of the screen you can configure the actual tasks within each of the three phases by selecting a tab. For each phase you can add and configure modules accordingly. Slide 13 Provide a name for the Job. Present the job with a short description. Press the Add button to determine the plug- ins to deploy.
  • Slide 14
  • Example 2.Extraction : Adding and Configuring a CKAN Extractor to harvest data from a CKAN API After adding the CKAN extractor plugin you will be prompted to fill out the following form: Slide 14 The Web location of the CKAN portal you wish to harvest. The portal should support API version 3 and the API must be enabled. Publisher, license, title and description: Used in the stored catalog for the dct:publisher, dct:license, dct:title and dct:description properties. Predicate prefix: JSON attributes are converted to predicates by appending them to the predicate prefix. The CKAN API response is in JSON, we then convert this into RDF. Subject prefix: The prefix used to create a URI for each the metadata of harvested dataset. The subject is created as /dataset/ Ignored keys: A comma seperated list of JSON attributes that should not be converted to RDF triples.
  • Slide 15
  • Example 3. Transformation : Adding and configuring plug-ins to harmonise data(1/3) Start by adding the ODS DCAT Application Profile Harmonizer. This plugin will create the harmonized catalog data and a basic skeleton for each dataset it identifies. Use the Modification Detector to compare provenance data generated by the CKAN extractor between the current and previous version of the raw data to set the dct:modified field of the catalog records. No configuration is required. Slide 15 Provide a name to identify the catalog
  • Slide 16
  • Example 3. Transformation : Adding and configuring plug-ins to harmonise data (2/3) Mapping the description of dataset to dct:description as required by the DCAT-AP. Use the ODS Cleaner Plugin to remove raw data loaded into the working set before storing it into a harmonized graph. No configuration is required. Slide 16 Use the SPARQL Update Query Transformer to map existing properties and values to the ones of recommended by the DCAT-AP.
  • Slide 17
  • Example 3. Transformation : Adding and configuring plug-ins to harmonise data (3/3) Result The final result of your harmonisation pipeline should look similar to the following : Configure the Virtuoso Loader to load the harmonized data into Virtuoso. Slide 17
  • Slide 18
  • Example 4. Loading: Load the extracted data in a Virtuoso RDF Store via the Virtuoso Loader The Virtuoso Loader will store the generated triples in the Virtuoso RDF store. The triples will be inserted into a graph of your choice. The Virtuoso Loader needs host, port and user credentials to connect to your Virtuoso server. Slide 18
  • Slide 19
  • 5. Scheduling a job on ODIP A job can be scheduled to run at a set interval or chained after another job: Interval Scheduling: Example: 0 0 4 * * * - each day at 4 am 0 0 0 * * 1 - each Monday at midnight 0 30 * * * - every half past the hour Chained scheduling: Select a job after which this job should be executed. Slide 19
  • Slide 20
  • 6. Automated translations to English If no english title is present in the catalog metadata, a machine translation is provided. Translations have their own pipeline, because their is a monetary cost. Slide 20
  • Slide 21
  • Slide 21 The DCAT Application Profile A common vocabulary for describing datasets hosted in data portals in Europe, based on the Data Catalogue vocabulary (DCAT).
  • Slide 22
  • A shared initiative of... Funded by the ISA Programme under Action 1.1. Improving semantic interoperability in European eGovernment systems (a.k.a the SEMIC project).SEMIC Slide 22
  • Slide 23
  • An international Working Group of experts Chair: Antonio Carneiro (Publications Office) 59 Working Group members representing: -15 different European Member States and EEA (UK,IT,ES,DK,DE,SK,BE,AT,SE,FI,FR,IE,NL, GR,SI, NO ) -US -Several European Institutions and international organisations -40 different Data Portals Slide 23 See also: https://joinup.ec.europa.eu/asset/dcat_application_profile/description See also: https://joinup.ec.europa.eu/asset/dcat_application_profile/description
  • Slide 24
  • By using a common metadata schema to describe datasets and sharing metadata... Data publishers increase discoverability and thus reuse of their data. Data reusers can uniformly search across platforms without facing difficulties caused by the use of separate models or language differences. Slide 24 The quality and the availability of the description metadata directly affects how easily datasets can be found!
  • Slide 25
  • The DCAT-AP enables the exchange of description metadata between data portals Slide 25
  • Slide 26
  • Slide 26 The DCAT Application Profile data model
  • Slide 27
  • Controlled vocabularies Slide 27 Property URIUsed for ClassProposed vocabulary dcat:mediaTypeDistributionMDR File types Name Authority List dcat:themeDatasetEuroVoc domains dcat:themeTaxonomyCatalogEuroVoc dct:accrualPeriodicityDataset Dublin Core Collection Description Frequency Vocabulary dct:formatDistributionMDR File Type Named Authority List dct:languageCatalog, DatasetMDR Languages Named Authority List dct:publisherCatalog, DatasetMDR Corporate bodies Named Authority List dct:spatialCatalog, Dataset MDR Countries Named Authority List, MDR Places Named Authority List adms:statusCatalogRecordADMS change type vocabulary dct:typeLicense DocumentADMS license type vocabulary
  • Slide 28
  • Slide 28 Alignment of national/ European data models for describing datasets with the DCAT-AP
  • Slide 29
  • Overview of national data models 12 portals use a CKAN model (exposed via API v3 or higher) 2 portals provide RDF exports in DCAT(-AP) Spain and Slovenia 1 portal provides CSV export in own data model Greece 1 portal provides data via catalog interoperability protocol Norway Slide 29 Note: Since CKAN covers the majority of the harvested Open Data portals & the mapping issues are similar in the other cases we will concentrate in the following on the CKAN harvesting.
  • Slide 30
  • Mapping CKAN data model to DCAT-AP Based on CKAN JSON API v3 or higher CKAN JSON API provides a single access to the meta data descriptions of the datasets data to support the CKAN system (who entered the record, is the dataset private, counters, etc.) About 50 CKAN properties compared to 36 DCAT-AP properties Slide 30
  • Slide 31
  • Mapping CKAN data model to DCAT-AP CKAN data model consists of 2 concepts: Dataset & Resource DCAT-AP has more concepts: Publisher, ContactPoint, Mapping of CKAN to DCAT-AP requires the creation of new identifiers Some decisions are different: E.g. license is at the level of Dataset in case of CKAN, in DCAT-AP this is at the Distribution level. Slide 31
  • Slide 32
  • CKAN API technical information Slide 32 Extension method 1: Introduce portal specific meta data Extension method 2: Update the CKAN api model Completeness: Not because the property exists in the message it carries a value
  • Slide 33
  • Common mapping of CKAN API to DCAT-AP: Dataset Ckan PropertiesRequirement idOptional licenseOptional titleMandatory num_resourcesRecommended metadata_createdOptional metadata_modifiedOptional notesMandatory frequencyOptional key/extra [key=categorisation] Recommended Dcat Properties dct:identifier dct:license dct:title dcat:keyword dct:issued dct:modified dct:description dct:accrualPeriodicity dcat:theme
  • Slide 34
  • Common mapping of CKAN API to DCAT-AP: Distribution Ckan PropertiesRequirement revision_timestampOptional descriptionRecommended formatRecommended nameRecommended urlMandatory createdOptional Dcat Properties dct:modified dct:description dct:format dct:title dct:accessURL dct:issued
  • Slide 35
  • Common mapping of CKAN API to DCAT-AP: Contact Point Ckan PropertiesRequirement maintainer_emailOptional author_emailOptional Dcat Properties vCard, v:email vCard, v:fn
  • Slide 36
  • Common mapping of CKAN API to DCAT-AP: Publisher Ckan PropertiesRequirement extras[key=publisher]Mandatory extras[key=publisher_e mail] Optional Dcat Properties foaf:name foaf:mbox
  • Slide 37
  • Alignment on values Eurovoc DomainDEGBES 04 POLITICS politik_wahlenGovernment http://datos.gob.es/kos/sector- publico/sector/sector-publico Society Administration http://datos.gob.es/kos/sector- publico/sector/seguridad 08 INTERNATIONAL RELATIONS Defence 10 EUROPEAN COMMUNITIES 12 LAW gesetze_justiz http://datos.gob.es/kos/sector- publico/sector/legislacion-justicia 16 ECONOMICS economyEconomy http://datos.gob.es/kos/sector- publico/sector/economia Slide 37 mapping national taxonomies to controlled vocabularies
  • Slide 38
  • Summary predicate mappings Slide 38 2 properties 5 properties 12 properties Dataset 1 property 3 properties 8 properties Distribution 2 properties Contact Point 1 property Publisher % mandatory properties covered % recommended properties covered % optional properties covered Class
  • Slide 39
  • Summary Austria (21 properties) Slide 39 100% Dataset 100% Distribution 100% Contact Point 40% 100% Publisher 50% 0% 36 properties in extra
  • Slide 40
  • Summary Germany (22 properties) Slide 40 100% Dataset 100% Distribution 100% Contact Point 62.5% 100% Publisher 50% 0% 95 properties in extra
  • Slide 41
  • Summary Spain (16 properties) Slide 41 100% Dataset 100% Distribution 100% Contact Point 75% 100% Publisher 58,3% 0% 0 properties in extra
  • Slide 42
  • Summary Finland (20 properties) Slide 42 100% Dataset 100% Distribution 100% Contact Point 25% 100% Publisher 58,3% 0% 40 properties in extra
  • Slide 43
  • Summary France (18 properties) Slide 43 100% Dataset 100% Distribution 100% Contact Point 25% 100% Publisher 58,3% 0% 66% 49 properties in extra
  • Slide 44
  • Summary Great Britain (26 properties) Slide 44 100% Dataset 100% Distribution 100% Contact Point 75% 100% Publisher 50% 0% 84 properties in extra
  • Slide 45
  • Summary Greece (12 properties) Slide 45 100% Dataset 100% Distribution 100% Contact Point 25% 100% Publisher 33% 0% 66%
  • Slide 46
  • Summary Ireland (18 properties) Slide 46 100% Dataset 100% Distribution 100% Contact Point 37,5% 100% Publisher 41,6% 0% 60% 6 properties in extra
  • Slide 47
  • Summary Italy (14 properties) Slide 47 100% Dataset 100% Distribution 100% Contact Point 37,5% 100% Publisher 16,6% 0% 60% 69 properties in extra
  • Slide 48
  • Summary The Netherlands (16 properties) Slide 48 100% Dataset 100% Distribution 100% Contact Point 33% 100% Publisher 33% 0% 75% 66% 14 properties in extra
  • Slide 49
  • Summary Romania (20 properties) Slide 49 100% Dataset 100% Distribution 100% Contact Point 50% 100% Publisher 50% 0% 2 properties in extra
  • Slide 50
  • Summary Slovenia (21 properties) Slide 50 100% Dataset 100% Distribution 100% Contact Point 37,5% 100% Publisher 33% 0% 80%
  • Slide 51
  • Summary Sweden (20 properties) Slide 51 100% Dataset 100% Distribution 100% Contact Point 50% 100% Publisher 41,6% 0%
  • Slide 52
  • Summary Slovakia (19 properties) Slide 52 100% Dataset 100% Distribution 100% Contact Point 62,5% 100% Publisher 41,6% 0% 80% 66% 1 properties in extra
  • Slide 53
  • Summary EU ODP ( properties) Slide 53 100% Dataset 100% Distribution 100% Contact Point 62,5% 100% Publisher 75% 3 properties in extra 80%
  • Slide 54
  • Validation Collection objective: collect the available datasets and make the outcome of the harmonization informative, not restrictive. The summary of the quality of the mapping process in ODIP: but also in public logs http://data.opendatasupport.eu/validation/.log Slide 54
  • Slide 55
  • Validation but also in public logs http://data.opendatasupport.eu/validation/.log Slide 55
  • Slide 56
  • Result of the harvesting (numbers vary over time) Slide 56 CatalogDatasets Dataset without distribution Datasets with distributionDistributions Avg distributions per dataset at221196211593974,4 de8211138198220642,7 es64605545906186443,2 fi39813975051,3 fr1382632513501356042,6 gb19977484215135869255,7 gr7624521282,5 ie42314229992,4 it90271778850232092,6 nl2674182249277413,1 no2280 3321,5 odp7681337648417835,5 ro1950 13677,0 se540 1051,9 sk2190 5602,6 sl100 505,0 16716706248654222494133,8
  • Slide 57
  • Configuration & logs: public pointers (1/5) addendum Open Data Portal of the European Union https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/odp.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/odp.md http://data.opendatasupport.eu/validation/odp.log Great Britain https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/gb.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/gb.md http://data.opendatasupport.eu/validation/gb.log Austria https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/at.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/at.md http://data.opendatasupport.eu/validation/at.log Slide 57
  • Slide 58
  • Configuration & logs: public pointers (2/5) addendum Germany https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/de.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/de.md http://data.opendatasupport.eu/validation/de.log Spain https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/es.html https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/es.html http://data.opendatasupport.eu/validation/es.log Finland https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/fi.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/fi.md http://data.opendatasupport.eu/validation/fi.log Slide 58
  • Slide 59
  • Configuration & logs: public pointers (3/5) addendum France https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/fr.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/fr.md http://data.opendatasupport.eu/validation/fr.log Greece https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/gr.html https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/gr.html http://data.opendatasupport.eu/validation/gr.log Ireland https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/ie.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/ie.md http://data.opendatasupport.eu/validation/ie.log Slide 59
  • Slide 60
  • Configuration & logs: public pointers (4/5) addendum Italy https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/it.html https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/it.html http://data.opendatasupport.eu/validation/it.log The Netherlands https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/nl.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/nl.md http://data.opendatasupport.eu/validation/nl.log Romania https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/ro.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/ro.md http://data.opendatasupport.eu/validation/ro.log Slide 60
  • Slide 61
  • Configuration & logs: public pointers (5/5) addendum Sweden https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/se.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/se.md http://data.opendatasupport.eu/validation/se.log Slovakia https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/sk.md https://github.com/tenforce/OpenDataSupport/blob/master/odi p.opendatasupport.eu/mapping_queries/sk.md http://data.opendatasupport.eu/validation/sk.log Slide 61
  • Slide 62
  • Slide 62 High-quality metadata as an enabler to open data reuse
  • Slide 63
  • Wrap up and next steps Working Group on the revision of the DCAT-AP supported by ISA Action 1.1. Indicate your interest to participate Share your change requests with us. Slide 63
  • Slide 64
  • Be part of our team... Slide 64 Find us on Contact us Join us on Follow us Open Data Support http://www.slideshare.net/OpenDataSupport http://www.opendatasupport.eu Open Data Support http://goo.gl/y9ZZI @OpenDataSupport [email protected]
  • Slide 65