typifier: inferring the type semantics of structured data (icde2013)

KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association

Institute of Applied Informatics and Formal Description Methods (AIFB)

www.kit.edu

TYPifier: Inferring the Type Semantics of Structured DataYongtao Ma, Thanh Tran29th IEEE International Conference on Data Engineering (ICDE2013)

Institute of Applied Informatics and Formal Description Methods (AIFB)2 April 8th, 2013

Contents

Introduction

TYPification Features

TYPification Algorithm

Evaluation

Conclusion

ICDE2013, Brisbane


Problem

Type information is MissingDynamic Web Data

Heterogeneous Enterprise Data

ICDE2013, Brisbane


Problem



ICDE2013, Brisbane

ID Title Price Brand Description

p1Epson E1700

260 EpsonUp to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W

p2 HP 55252 2699 HP620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print

p3LG 47LM7600

1143 LGStandby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary...

p4Panasonic L55DT50

2399 PanasonicPower consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame.

p5MadMaps Pacific

8 SpotitoutWindows Vista / 7 / XP. Media: DVD. It’s a snap to load Pacific Coast GPS Travel Directory by MAD Maps into your GPS device.

p6Garmin Maps

99 GaminWindows Vista / 7 / XP. Media: DVD. Compatible with GPS Garmin Colorado, Dakota, eTrex...Coverage includes detailed maps for traveling in Australia.

p7Rosetta Spanish

399Rosetta Stone

Windows Vista / 7 / XP. Media: DVD. Build your vocabulary and language abilities... Discover how to speak, read, write, and understand…

p8Learn German

9 InnovativeWindows Vista / 7 / XP. Media: DVD. Learn level 9 German vocabulary with the audio playback tool, Listen to the lesson dialog and master the language…


Problem



Typification: inferring the type semantics of structured data

ICDE2013, Brisbane


Contributions

We formulate Typification as a clustering problem, where the goal is to identify a particular kind of clusters that represent the types of entities

We propose a solution for automatically computing pseudo-schema features from data

We propose TYPifier, a novel clustering algorithm for the typification problem, which is

An divisive hierarchical clustering algorithm

Optimized for (pseudo-)schema-based features

Determine the number of types (clusters) automatically

Show that typification helps to improve date integration!

ICDE2013, Brisbane


FEATURES FOR TYPIFICATION

ICDE2013, Brisbane


Schema Features

Features characterize a type well if:Shared by most entities of that type

Not in the feature sets of other entities that belong to other types

Schema Features: labels of attributes or relationse.g. Resolution but also HD and LET Tech for type TV

Advantages: Better type indicators

Problems: missing, scarce

Solutions: derive pseudo-schema features

ICDE2013, Brisbane


Pseudo-schema FeaturesWords in attribute values that act as schema features

TF-IDFImportance of a term for a document, relative to others in the corpus

Representative for instances rather than types

Learning words in attribute values representative for types

ID Title Price Brand Description

p1Epson E1700

260 EpsonUp to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W

p2 HP 55252 2699 HP620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print

p3LG 47LM7600

1143 LGStandby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary...

p4Panasonic L55DT50

2399 PanasonicPower consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame.

ICDE2013, Brisbane


Pseudo-schema Schema Features

ICDE2013, Brisbane

Feature Co-occurrence Graph

Feature Co-occurrence Graph is a weighted directed graph G = (N,E,L) with:- N: the set of words in the attribute values- E: edges as ordered vertex pair (n1,n2), indicating that n1

co-occurs with n2 in the description of some instances- L: edge labels. Let Nn1 and Nn2 be set of instances that

contain n1 and n2 in description, the edge labels stand for the conditional co-occurrence probabilities calculated as

p(n2|n1)= |Nn1∩Nn2|/|Nn1|



ICDE2013, Brisbane

dpi

A4 ppm

W

Smart

TV LED

Instance W dpi

p1 X X

p2 X X

p3 X

p4 X

0.5

1.0

NW={p1,p2,p3,p4}Ndpi={p1,p2}w(dpi|w)=|Nw∩Ndpi|/|Nw=0.5w(w|dpi) |Nw∩Ndpi|/|Ndpi|=1.0

HD



v1 and v2 are co-occurred if p(v2|v1)>θ and p(v1|v2)>θ

ICDE2013, Brisbane

dpi

A4 ppm

W

Smart

TV LED

0.5

1.0

HD

θ=0.50



ICDE2013, Brisbane

w

ppm dpi

A4

MaximumClique

HD

TV Smart

LED

W


TYPIFICATION ALGORITHM

ICDE2013, Brisbane


Clusters

ICDE2013, Brisbane

A cluster is defined as a tuple C(F, N, S)F: the set of (pseudo-)schema features

N: the set of all entities that have an element in F as feature

S: the set of clusters that are either child or descendant nodes of C

Cluster Distance : co-occurrence count of features fi and fj

: the count of entities having f as feature

Ni (Nj ) is the entity set associated with Ci (Cj )


Cluster Relation

ICDE2013, Brisbane

Four cluster relations : Ci a parent (ancestor) of Cj

: Ci a child (descendant) of Cj

: Ci and Cj represent the same cluster

: there is no relation between Ci and CjEvidence No counter-evidence


Typification

ICDE2013, Brisbane

S*root

Power

platform

Media

Resolution

Print Speed

LED

HD

Coverage

Level

Language

C

Empty

0

Root

Power

1. Power < Root Add & Split Clusters

Resolution

2. Resolution < Power Add & Split Clusters

Print Speed

3.Print Speed = Resolution Merge

S*Power

Resolution

Print Speed

LED

HD

C

platform

Media

Coverage

Level

Language

1S*

Resolution

Print Speed

C

LED

HD

2S*

Resolution

Empty

C

LED

HD

3S*

Power

LED

HD

C

platform

Media

Coverage

Level

Language

4

Children or Descendants

of the root

Siblings of the root

4. Split Entities


EVALUATION

ICDE2013, Brisbane


Evaluation

BaselinesHierarchical: BIRCH

Partitional: K-means++

Kernel-based: SVC

Density-based: OPTICS

DatasetsBTC

DBpedia (DBP)

Product Data (P)PPS ： using pseudo-schema features

PTFIDF: using TF-IDF features

PD: using all words

Dataset Entity Triple SchemaFeature

Type Hierarchy PS Features

BTC 334,661 2,991,411 537 163 0 -

DBP 3,600 49,751 146 16 5 -

PPS 22,331 111,647 5 6 0 136

PTFIDF 22,331 111,647 5 6 0 7,211

PD 22,331 111,647 5 6 0 18,917

ICDE2013, Brisbane


Efficiency

ICDE2013, Brisbane

DBP BTC PPS PTFIDF PD1,000

10,000

100,000

1,000,000

10,000,000

TYPifier

K-Means++

BIRCH

OPTICS

SVC

Datasets

Tim

e l

og

(ms)

TYPifier, K-means++ and BIRCH are similar in efficiency

Pseudo-schema features help to improve efficiency


Effectiveness

ICDE2013, Brisbane

DBP BTC PPS PTFIDF PD0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

80.00

90.00

TYPifierK-Means++BIRCHOPTICSSVC

Datasets

F-m

ea

su

re （

%）

TYPifier outperforms other baselines

+33.92% in F-measure (compared to second best)

Pseudo-schema feature outperforms other types of feature

+86.15% in F-measure (compared to second best)


Hierarchies

ICDE2013, Brisbane

TYPifier outperforms other baselines

Original Hierarchies

Hierarchies Generated by OPTICS

Hierarchies Generated by BIRCHHierarchies Generated by TYPifier

Tree Edit Distance

TYPifier OPTICS BIRCH

12 14 24


Parameter Sensitivity

Precision improves with higher θ, because pseudo-schema features become more representative

Recall improves as θ increases (at low level), drops at high level, because less and lesser pseudo-schema features can be generated

ICDE2013, Brisbane

0.1 0.2 0.3 0.4 0.50

102030405060708090

100

TYPifierKMeans++BIRCH

θ

Pre

cisi

on (

%)

0.1 0.2 0.3 0.4 0.50

10

20

30

40

50

60

70

80

TYPifierKMeans++BIRCH

θ

Reca

ll (%

)


Parameter Sensitivity

The sensitivity of ε depends on feature correlations

Higher ε leads to better precision and recall

Extremely high ε may leads to poor quality of hierarchies

ICDE2013, Brisbane

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.950

60

70

80

90

100

DBPBTCP_PSP_TFIDF

ε

Pre

cisi

on (

%)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

10

20

30

40

50

60

70

80

DBPBTCP_PSP_TFIDF

ε

Reca

ll (%

)


Conclusion

Introduce and formulate Typification as clustering problem

Learning pseudo-schema features

A divisive hierarchical clustering solution for TYPificationTYPifier outperforms baselines by +33.92% in F-measure!

Pseudo-schema feature is essential also for baselines! (outperforms other types of feature by +86.15% in F-measure)

Generate not only clusters but also hierarchies that closely match human conceptualization / ground truth model!

ICDE2013, Brisbane


Thank you for your attention! Questions?

Thanh Tran, https://sites.google.com/site/kimducthanh/

ICDE2013, Brisbane

https://sites.google.com/site/kimducthanh/

https://sites.google.com/site/kimducthanh/

typifier: inferring the type semantics of structured data (icde2013)

Technology

schema features features

types schema features

typification icde2013

pseudoschema features

data engineering icde2013

pseudoschemabased features

lg led

typification problem