cedar & prelida preservation of linked socio-historical data

43
CEDAR & PRELIDA Preservation of Linked Socio- Historical Data Albert Meroño-Peñuela @albertmeronyo PRELIDA consolidation workshop @ ISWC, 17-10-2014

Upload: prelida-project

Post on 01-Jul-2015

217 views

Category:

Technology


0 download

DESCRIPTION

by Albert Meroño, presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.eu

TRANSCRIPT

Page 1: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

CEDAR & PRELIDA Preservation of Linked Socio-

Historical Data

Albert Meroño-Peñuela@albertmeronyo

PRELIDA consolidation workshop @ ISWC, 17-10-2014

Page 2: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

CEDAR: Harmonizing Historical Census Data in the Semantic Web

Page 3: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

CEDAR: Source Historical DataDutch Historical Censuses (1795-1971)

[Public Historical Statistical Data]

Page 4: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

4

From scans to spreadsheets

Page 5: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

CEDAR goal: cross queries

?

1795 1830 1889 1930 1971

(through ~3K tables)

Page 6: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Towards 5-star Census Data

Page 7: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Towards 5-star Census Data

>1 year ago

1 year ago

Page 8: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
Page 9: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

• Web publishable• Machine processable• Dynamic schema• Easily link with other

datasets

Page 10: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Why with semantic technology?

• Web publishable, human & machine readable

• Finer granularity level (cell level)

• Statistical comparability by leveraging semantic descriptions

• Provenance

• Harmonization through linkage to other datasets (the 5th star)

Page 11: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

RDF Data Cube

“There are many situations where it would be useful to be able to publish multi-dimensional data, such as

statistics, on the web in such a way that they can be linked to related data sets and concepts.”

Page 12: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
Page 13: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
Page 14: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

RDF Data Cube vocabulary (QB)• SDMX compatible• Defines cubes as a set of observations that consist of

dimensions, measures and attributes

• Dimensions: time period, region, sex (qb:DimensionProperty)• Measure: population life expectancy (qb:MeasureProperty)

• Attribute: unit of measure = years, metadata status = measured (qb:AttributeProperty)

Observation: “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years”

Page 15: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

CEDAR Integrator

https://github.com/CEDAR-project/Integrator

Page 16: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Raw data

cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ;

rdfs:label "K17";

tablink:value "12.0" ;

tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ;

tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ;

tablink:sheet cedar:BRT_1889_08_T1-S0 .

Page 17: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Harmonization Rules as Open Annotations

cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ;

oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ;

oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ;

oa:serializedAt "2014-09-24"^^xsd:date ;

oa:serializedBy

<https://github.com/CEDAR-project/Integrator> ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-mapping-activity .

cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ;

sdmx-dimension:sex sdmx-code:sex-F .

Page 18: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Harmonized RDF Data Cube

cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;

cedar:population "12"^^xml:decimal ;

maritalstatus:maritalStatus

maritalstatus:single ;

cedarterms:occupationPosition cedarterms:job-D ;

sdmx-dimension:sex sdmx-code:sex-F ;

cedarterms:occupation hisco:88030 ;

sdmx-dimension:refArea gg:11150 ;

prov:wasDerivedFrom

cedar:BRT_1889_08_T1-S0-K17 ;

prov:wasGeneratedBy

cedar:BRT_1889_08_T1-S0-K17-activity .

Page 19: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Classification Systems and Concept Schemes

• Some missing harmonized dimensions!• Encode all variables and their values using concept

schemes• Some already exist

– Which ones? How many of them?– Where? – By whom?– Are they used at all? Can I reuse them?

• Some need to be created– Manual and expert knowledge based– Can we do it automatically? Or assist the process?

Page 20: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Dutch Historical

Censuses

(CEDAR)

Dutch Ships

and Sailors

Gemeente

geschiede

nis.nl

HISCO

ICONCLASS

Dutch

Historical

Religions

Dutch

Historical

House Types

Page 21: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Existing dimensions

• HISCO

http://historyofwork.iisg.nl/

Page 22: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Existing dimensions

• Gemeentegeschiedenis.nl

Page 23: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Existing LSD dimensions

• P1: Discoverability? How to discover dimensions created by others?

• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others?

• P3: Relevance? What’s the size of LSD?

Page 24: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

LSD Dimensions

http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions

Hourly JSON-LD dumps

Page 25: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

http://lsd-dimensions.org/

Page 26: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
Page 27: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
Page 28: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data
Page 29: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Existing LSD dimensions

• P1: Discoverability? How to discover dimensions created by others? LSD Dimensions

• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? Logarithmic law / probably yes

• P3: Relevance? What’s the size of LSD? ~7.9% of the LOD cloud

Page 30: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Creating new LSD Dimensions

• CEDAR needs concept schemes for

– Historical religious denominations (i.e. religions in the NL in 18th-20th c.)

– Historical occupations (id.)

– Historical building types (id.)

Page 31: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

https://github.com/CEDAR-project/TabCluster

Page 32: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

TabCluster

Leverages● Lexical properties

○ Hierarchical clustering in Python scipy○ String distances

● Semantic properties (LOD tagging)○ skos:Concept of most frequent cluster-term○ Closest common skos:broader skos:Concept of all

cluster-terms

Page 33: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Compatibility? Remixability? Reusability?

Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics (SemStats) ISWC 2014.

Page 34: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Concept Drift

Census classification of occupations as for

1859

• Root node is void• Depth 1: occupation groups• Leaves: actual occupations

Page 35: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Concept Drift

Census classification of occupations as for

1889

• Root node is void• Depth 1: occupation groups• Leaves: actual occupations

Page 36: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Concept Drift

Census classification of occupations as for

1899

• Root node is void• Depth 1: occupation groups• Leaves: actual occupations

Page 37: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Concept Drift

Upper ontologies

(HISCO, AC)

Year-

dependent

ontologies

1859 1869 1879

Page 38: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Concept Drift

Upper ontologies

(HISCO, AC)

Year-

dependent

ontologies

Page 39: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Concept Drift

Upper ontologies

(HISCO, AC)

Year-

dependent

ontologies

? ?

Page 40: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Preserving CEDAR

Page 41: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Preserving CEDAR

• DANS-EASY as backend (http://easy.dans.knaw.nl/)

• Archived objects: Turtle snapshots

– 20Go uncompressed, 200Mo compressed (per snapshot)

– Versioning (stats on current release)

• Users still need to

– SPARQL the data => bring up the endpoint on demand

– Run analytics on the data => outsource statistical analysis

Page 42: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Thank you

Questions, suggestions, comments most welcome

@albertmeronyo

http://www.cedar-project.nlhttp://krr.cs.vu.nl/

http://easy.dans.knaw.nl/http://lsd-dimensions.org/

Page 43: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data

Me in 6 tweetshttp://www.albertmeronyo.org

• Background: Computer Science, Web hacker, AI & Law

• PhD candidate at the VU University Amsterdam, DANS, and eHumanities group (KNAW)

• Topic: Semantic Web for the Humanities

• CEDAR project (2012-2015): harmonized historical Dutch censuses in the Semantic Web

• Problem: statistical data publishing, concept drift and dynamics of meaning

• Last paper: What is Linked Historical Data? (EKAW 2014)