typifier: inferring the type semantics of structured data (icde2013)

26
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics and Formal Description Methods (AIFB) www.kit.edu TYPifier: Inferring the Type Semantics of Structured Data Yongtao Ma, Thanh Tran 29 th IEEE International Conference on Data Engineering (ICDE2013)

Upload: thanh-tran

Post on 10-May-2015

129 views

Category:

Technology


2 download

DESCRIPTION

TYPifier: Inferring the Type Semantics of Structured Data Yongtao Ma, Thanh Tran 29th IEEE International Conference on Data Engineering (ICDE2013)

TRANSCRIPT

Page 1: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association

Institute of Applied Informatics and Formal Description Methods (AIFB)

www.kit.edu

TYPifier: Inferring the Type Semantics of Structured DataYongtao Ma, Thanh Tran29th IEEE International Conference on Data Engineering (ICDE2013)

Page 2: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)2 April 8th, 2013

Contents

Introduction

TYPification Features

TYPification Algorithm

Evaluation

Conclusion

ICDE2013, Brisbane

Page 3: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)3 April 8th, 2013

Problem

Type information is MissingDynamic Web Data

Heterogeneous Enterprise Data

ICDE2013, Brisbane

Page 4: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)4 April 8th, 2013

Problem

Type information is MissingDynamic Web Data

Heterogeneous Enterprise Data

ICDE2013, Brisbane

ID Title Price Brand Description

p1Epson E1700

260 EpsonUp to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W

p2 HP 55252 2699 HP620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print

p3LG 47LM7600

1143 LGStandby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary...

p4Panasonic L55DT50

2399 PanasonicPower consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame.

p5MadMaps Pacific

8 SpotitoutWindows Vista / 7 / XP. Media: DVD. It’s a snap to load Pacific Coast GPS Travel Directory by MAD Maps into your GPS device.

p6Garmin Maps

99 GaminWindows Vista / 7 / XP. Media: DVD. Compatible with GPS Garmin Colorado, Dakota, eTrex...Coverage includes detailed maps for traveling in Australia.

p7Rosetta Spanish

399Rosetta Stone

Windows Vista / 7 / XP. Media: DVD. Build your vocabulary and language abilities... Discover how to speak, read, write, and understand…

p8Learn German

9 InnovativeWindows Vista / 7 / XP. Media: DVD. Learn level 9 German vocabulary with the audio playback tool, Listen to the lesson dialog and master the language…

Page 5: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)5 April 8th, 2013

Problem

Type information is MissingDynamic Web Data

Heterogeneous Enterprise Data

Typification: inferring the type semantics of structured data

ICDE2013, Brisbane

Page 6: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)6 April 8th, 2013

Contributions

We formulate Typification as a clustering problem, where the goal is to identify a particular kind of clusters that represent the types of entities

We propose a solution for automatically computing pseudo-schema features from data

We propose TYPifier, a novel clustering algorithm for the typification problem, which is

An divisive hierarchical clustering algorithm

Optimized for (pseudo-)schema-based features

Determine the number of types (clusters) automatically

Show that typification helps to improve date integration!

ICDE2013, Brisbane

Page 7: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)7 April 8th, 2013

FEATURES FOR TYPIFICATION

ICDE2013, Brisbane

Page 8: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)8 April 8th, 2013

Schema Features

Features characterize a type well if:Shared by most entities of that type

Not in the feature sets of other entities that belong to other types

Schema Features: labels of attributes or relationse.g. Resolution but also HD and LET Tech for type TV

Advantages: Better type indicators

Problems: missing, scarce

Solutions: derive pseudo-schema features

ICDE2013, Brisbane

Page 9: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)9 April 8th, 2013

Pseudo-schema FeaturesWords in attribute values that act as schema features

TF-IDFImportance of a term for a document, relative to others in the corpus

Representative for instances rather than types

Learning words in attribute values representative for types

ID Title Price Brand Description

p1Epson E1700

260 EpsonUp to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W

p2 HP 55252 2699 HP620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print

p3LG 47LM7600

1143 LGStandby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary...

p4Panasonic L55DT50

2399 PanasonicPower consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame.

ICDE2013, Brisbane

Page 10: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)10 April 8th, 2013

Pseudo-schema Schema Features

ICDE2013, Brisbane

Feature Co-occurrence Graph

Feature Co-occurrence Graph is a weighted directed graph G = (N,E,L) with:- N: the set of words in the attribute values- E: edges as ordered vertex pair (n1,n2), indicating that n1

co-occurs with n2 in the description of some instances- L: edge labels. Let Nn1 and Nn2 be set of instances that

contain n1 and n2 in description, the edge labels stand for the conditional co-occurrence probabilities calculated as

p(n2|n1)= |Nn1∩Nn2|/|Nn1|

Page 11: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)11 April 8th, 2013

Pseudo-schema Schema Features

ICDE2013, Brisbane

dpi

A4 ppm

W

Smart

TV LED

Instance W dpi

p1 X X

p2 X X

p3 X

p4 X

0.5

1.0

NW={p1,p2,p3,p4}Ndpi={p1,p2}w(dpi|w)=|Nw∩Ndpi|/|Nw=0.5w(w|dpi) |Nw∩Ndpi|/|Ndpi|=1.0

HD

Page 12: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)12 April 8th, 2013

Pseudo-schema Schema Features

v1 and v2 are co-occurred if p(v2|v1)>θ and p(v1|v2)>θ

ICDE2013, Brisbane

dpi

A4 ppm

W

Smart

TV LED

0.5

1.0

HD

θ=0.50

Page 13: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)13 April 8th, 2013

Pseudo-schema Schema Features

ICDE2013, Brisbane

w

ppm dpi

A4

MaximumClique

HD

TV Smart

LED

W

Page 14: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)15 April 8th, 2013

TYPIFICATION ALGORITHM

ICDE2013, Brisbane

Page 15: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)16 April 8th, 2013

Clusters

ICDE2013, Brisbane

A cluster is defined as a tuple C(F, N, S)F: the set of (pseudo-)schema features

N: the set of all entities that have an element in F as feature

S: the set of clusters that are either child or descendant nodes of C

Cluster Distance : co-occurrence count of features fi and fj

: the count of entities having f as feature

Ni (Nj ) is the entity set associated with Ci (Cj )

Page 16: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)17 April 8th, 2013

Cluster Relation

ICDE2013, Brisbane

Four cluster relations : Ci a parent (ancestor) of Cj

: Ci a child (descendant) of Cj

: Ci and Cj represent the same cluster

: there is no relation between Ci and CjEvidence No counter-evidence

Page 17: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)18 April 8th, 2013

Typification

ICDE2013, Brisbane

S*root

Power

platform

Media

Resolution

Print Speed

LED

HD

Coverage

Level

Language

C

Empty

0

Root

Power

1. Power < Root Add & Split Clusters

Resolution

2. Resolution < Power Add & Split Clusters

Print Speed

3.Print Speed = Resolution Merge

S*Power

Resolution

Print Speed

LED

HD

C

platform

Media

Coverage

Level

Language

1S*

Resolution

Print Speed

C

LED

HD

2S*

Resolution

Empty

C

LED

HD

3S*

Power

LED

HD

C

platform

Media

Coverage

Level

Language

4

Children or Descendants

of the root

Siblings of the root

4. Split Entities

Page 18: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)19 April 8th, 2013

EVALUATION

ICDE2013, Brisbane

Page 19: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)20 April 8th, 2013

Evaluation

BaselinesHierarchical: BIRCH

Partitional: K-means++

Kernel-based: SVC

Density-based: OPTICS

DatasetsBTC

DBpedia (DBP)

Product Data (P)PPS : using pseudo-schema features

PTFIDF: using TF-IDF features

PD: using all words

Dataset Entity Triple SchemaFeature

Type Hierarchy PS Features

BTC 334,661 2,991,411 537 163 0 -

DBP 3,600 49,751 146 16 5 -

PPS 22,331 111,647 5 6 0 136

PTFIDF 22,331 111,647 5 6 0 7,211

PD 22,331 111,647 5 6 0 18,917

ICDE2013, Brisbane

Page 20: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)21 April 8th, 2013

Efficiency

ICDE2013, Brisbane

DBP BTC PPS PTFIDF PD1,000

10,000

100,000

1,000,000

10,000,000

TYPifier

K-Means++

BIRCH

OPTICS

SVC

Datasets

Tim

e l

og

(ms)

TYPifier, K-means++ and BIRCH are similar in efficiency

Pseudo-schema features help to improve efficiency

Page 21: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)22 April 8th, 2013

Effectiveness

ICDE2013, Brisbane

DBP BTC PPS PTFIDF PD0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

TYPifierK-Means++BIRCHOPTICSSVC

Datasets

F-m

ea

su

re (

%)

TYPifier outperforms other baselines

+33.92% in F-measure (compared to second best)

Pseudo-schema feature outperforms other types of feature

+86.15% in F-measure (compared to second best)

Page 22: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)23 April 8th, 2013

Hierarchies

ICDE2013, Brisbane

TYPifier outperforms other baselines

Original Hierarchies

Hierarchies Generated by OPTICS

Hierarchies Generated by BIRCHHierarchies Generated by TYPifier

Tree Edit Distance

TYPifier OPTICS BIRCH

12 14 24

Page 23: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)24 April 8th, 2013

Parameter Sensitivity

Precision improves with higher θ, because pseudo-schema features become more representative

Recall improves as θ increases (at low level), drops at high level, because less and lesser pseudo-schema features can be generated

ICDE2013, Brisbane

0.1 0.2 0.3 0.4 0.50

102030405060708090

100

TYPifierKMeans++BIRCH

θ

Pre

cisi

on (

%)

0.1 0.2 0.3 0.4 0.50

10

20

30

40

50

60

70

80

TYPifierKMeans++BIRCH

θ

Reca

ll (%

)

Page 24: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)25 April 8th, 2013

Parameter Sensitivity

The sensitivity of ε depends on feature correlations

Higher ε leads to better precision and recall

Extremely high ε may leads to poor quality of hierarchies

ICDE2013, Brisbane

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.950

60

70

80

90

100

DBPBTCP_PSP_TFIDF

ε

Pre

cisi

on (

%)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

10

20

30

40

50

60

70

80

DBPBTCP_PSP_TFIDF

ε

Reca

ll (%

)

Page 25: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)26 April 8th, 2013

Conclusion

Introduce and formulate Typification as clustering problem

Learning pseudo-schema features

A divisive hierarchical clustering solution for TYPificationTYPifier outperforms baselines by +33.92% in F-measure!

Pseudo-schema feature is essential also for baselines! (outperforms other types of feature by +86.15% in F-measure)

Generate not only clusters but also hierarchies that closely match human conceptualization / ground truth model!

ICDE2013, Brisbane

Page 26: TYPifier: Inferring the Type Semantics of Structured Data (icde2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)27 April 8th, 2013

Thank you for your attention! Questions?

Thanh Tran, https://sites.google.com/site/kimducthanh/

ICDE2013, Brisbane