publish -time data integration for open data platforms
DESCRIPTION
WOD 2013. Publish -Time Data Integration for Open Data Platforms. Julian Eberius , Patrick Damme, Katrin Braunschweig, Maik Thiele and Wolfgang Lehner (TU Dresden) . Motivation. Premise. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/1.jpg)
Dipl. Medien-Inf. Julian Eberius |
Publish-Time Data Integration for Open Data Platforms
WOD 2013
Julian Eberius, Patrick Damme, Katrin Braunschweig,Maik Thiele and Wolfgang Lehner (TU Dresden)
![Page 2: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/2.jpg)
Dipl. Medien-Inf. Julian Eberius | 2
> Motivation
![Page 3: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/3.jpg)
Dipl. Medien-Inf. Julian Eberius | | 3
> Premise
Reusability• Standardization• Integration
Free-For-All• Many contributors• Many domains• Lack of standards
Continuous publishing without standardization will continuously increase heterogeneity on the platform.
Is there a solution without predefined schemata / ontologies?
![Page 4: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/4.jpg)
Dipl. Medien-Inf. Julian Eberius | | 4
> Problem
Different names for attributes of the same meaning Different meanings for attributes with
same values
![Page 5: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/5.jpg)
Dipl. Medien-Inf. Julian Eberius | | 5
> System Overview
![Page 6: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/6.jpg)
Dipl. Medien-Inf. Julian Eberius | | 6
> Offline
Domain Clustering Bottom-up clustering on schema-
level Used online to limit search space But also to improve accuracy
Domain Statistics Create different forms value set
synopses Used to save comparison work
online
![Page 7: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/7.jpg)
Dipl. Medien-Inf. Julian Eberius | | 7
> Online
Input New dataset ds+ with value
sets vs+
Output Attribute name suggestions
Constraint Instanteneous response
time (Publish-Time!)
Basic Approach Assign ds+ to domain based
on schema information Generate recommendations
based on values
![Page 8: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/8.jpg)
Dipl. Medien-Inf. Julian Eberius | | 8
> Naiv-C
Most Naive Approach: Iterate over Corpus C return the names of all attributes with
sufficiently similar value sets order them by overall frequency in the
corpus
Properties: Finds all similar value sets Generates the largest possible number of
recommendations Extremely long run time Might generate to many
recommendations
![Page 9: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/9.jpg)
Dipl. Medien-Inf. Julian Eberius | | 9
> Naiv-D
Domain-based Approach: Classify incoming dataset into domain D Iterate over Domain D continue as in Naiv-C
Properties: Finds less similar value sets Shorter run time Only generates recommendations from
one domain
![Page 10: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/10.jpg)
Dipl. Medien-Inf. Julian Eberius | | 10
> Cluster / Analysis-D
Synopsis-based Approaches: Create representative value sets RVS for
datasets in domain Match only against RVS
Clustering-D Cluster VS in domain, create RVS Pre-compute recommendation list as all
attribute names of value sets participating in final cluster
Online: find single most similar RVS in D
Analysis-D Create RVS directly for sets of VS with
equal name Online: Find set of similar RVS in D
![Page 11: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/11.jpg)
Dipl. Medien-Inf. Julian Eberius | | 11
> Evaluation
![Page 12: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/12.jpg)
Dipl. Medien-Inf. Julian Eberius | | 12
> Quality I
![Page 13: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/13.jpg)
Dipl. Medien-Inf. Julian Eberius | | 13
> Quality II
![Page 14: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/14.jpg)
Dipl. Medien-Inf. Julian Eberius | | 14
> Runtimes
![Page 15: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/15.jpg)
Dipl. Medien-Inf. Julian Eberius | | 15
> Cluster Size
![Page 16: Publish -Time Data Integration for Open Data Platforms](https://reader035.vdocuments.mx/reader035/viewer/2022062814/56816717550346895ddb879a/html5/thumbnails/16.jpg)
Dipl. Medien-Inf. Julian Eberius | | 16
> Conclusion
We need statistics-based data integration at publish time to limit the growth of heterogenity in large public dataset corpora.
Lots of work to do: clustering, matching, statistics, indexing, performance.