typifier: inferring the type semantics of structured data (icde2013)
DESCRIPTION
TYPifier: Inferring the Type Semantics of Structured Data Yongtao Ma, Thanh Tran 29th IEEE International Conference on Data Engineering (ICDE2013)TRANSCRIPT
KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
TYPifier: Inferring the Type Semantics of Structured DataYongtao Ma, Thanh Tran29th IEEE International Conference on Data Engineering (ICDE2013)
Institute of Applied Informatics and Formal Description Methods (AIFB)2 April 8th, 2013
Contents
Introduction
TYPification Features
TYPification Algorithm
Evaluation
Conclusion
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)3 April 8th, 2013
Problem
Type information is MissingDynamic Web Data
Heterogeneous Enterprise Data
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)4 April 8th, 2013
Problem
Type information is MissingDynamic Web Data
Heterogeneous Enterprise Data
ICDE2013, Brisbane
ID Title Price Brand Description
p1Epson E1700
260 EpsonUp to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W
p2 HP 55252 2699 HP620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print
p3LG 47LM7600
1143 LGStandby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary...
p4Panasonic L55DT50
2399 PanasonicPower consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame.
p5MadMaps Pacific
8 SpotitoutWindows Vista / 7 / XP. Media: DVD. It’s a snap to load Pacific Coast GPS Travel Directory by MAD Maps into your GPS device.
p6Garmin Maps
99 GaminWindows Vista / 7 / XP. Media: DVD. Compatible with GPS Garmin Colorado, Dakota, eTrex...Coverage includes detailed maps for traveling in Australia.
p7Rosetta Spanish
399Rosetta Stone
Windows Vista / 7 / XP. Media: DVD. Build your vocabulary and language abilities... Discover how to speak, read, write, and understand…
p8Learn German
9 InnovativeWindows Vista / 7 / XP. Media: DVD. Learn level 9 German vocabulary with the audio playback tool, Listen to the lesson dialog and master the language…
Institute of Applied Informatics and Formal Description Methods (AIFB)5 April 8th, 2013
Problem
Type information is MissingDynamic Web Data
Heterogeneous Enterprise Data
Typification: inferring the type semantics of structured data
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)6 April 8th, 2013
Contributions
We formulate Typification as a clustering problem, where the goal is to identify a particular kind of clusters that represent the types of entities
We propose a solution for automatically computing pseudo-schema features from data
We propose TYPifier, a novel clustering algorithm for the typification problem, which is
An divisive hierarchical clustering algorithm
Optimized for (pseudo-)schema-based features
Determine the number of types (clusters) automatically
Show that typification helps to improve date integration!
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)7 April 8th, 2013
FEATURES FOR TYPIFICATION
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)8 April 8th, 2013
Schema Features
Features characterize a type well if:Shared by most entities of that type
Not in the feature sets of other entities that belong to other types
Schema Features: labels of attributes or relationse.g. Resolution but also HD and LET Tech for type TV
Advantages: Better type indicators
Problems: missing, scarce
Solutions: derive pseudo-schema features
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)9 April 8th, 2013
Pseudo-schema FeaturesWords in attribute values that act as schema features
TF-IDFImportance of a term for a document, relative to others in the corpus
Representative for instances rather than types
Learning words in attribute values representative for types
ID Title Price Brand Description
p1Epson E1700
260 EpsonUp to 600 x 600 dpi, Up to 10 ppm (colour)... Format : A4, Letter, B5, A5...Energy consumption in operation/stand-by: 285 W/5 W
p2 HP 55252 2699 HP620 W in printing, 3600 dpi , 30ppm A4 Print Speed, 30ppm Mono A4 Print
p3LG 47LM7600
1143 LGStandby Mode 0.1 W. Full HD 1080p gives high picture quality over standard HDTV via LG LED... LG’s 47-inch Smart TV is a revolutionary...
p4Panasonic L55DT50
2399 PanasonicPower consumption 85 W. The DT50 LED-LCD series provides a fantastic Smart TV experience and features a 3D IPS LED panel, 1080p Full HD resolution, and a new narrow metal frame.
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)10 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
Feature Co-occurrence Graph
Feature Co-occurrence Graph is a weighted directed graph G = (N,E,L) with:- N: the set of words in the attribute values- E: edges as ordered vertex pair (n1,n2), indicating that n1
co-occurs with n2 in the description of some instances- L: edge labels. Let Nn1 and Nn2 be set of instances that
contain n1 and n2 in description, the edge labels stand for the conditional co-occurrence probabilities calculated as
p(n2|n1)= |Nn1∩Nn2|/|Nn1|
Institute of Applied Informatics and Formal Description Methods (AIFB)11 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
dpi
A4 ppm
W
Smart
TV LED
Instance W dpi
p1 X X
p2 X X
p3 X
p4 X
0.5
1.0
NW={p1,p2,p3,p4}Ndpi={p1,p2}w(dpi|w)=|Nw∩Ndpi|/|Nw=0.5w(w|dpi) |Nw∩Ndpi|/|Ndpi|=1.0
HD
Institute of Applied Informatics and Formal Description Methods (AIFB)12 April 8th, 2013
Pseudo-schema Schema Features
v1 and v2 are co-occurred if p(v2|v1)>θ and p(v1|v2)>θ
ICDE2013, Brisbane
dpi
A4 ppm
W
Smart
TV LED
0.5
1.0
HD
θ=0.50
Institute of Applied Informatics and Formal Description Methods (AIFB)13 April 8th, 2013
Pseudo-schema Schema Features
ICDE2013, Brisbane
w
ppm dpi
A4
MaximumClique
HD
TV Smart
LED
W
Institute of Applied Informatics and Formal Description Methods (AIFB)15 April 8th, 2013
TYPIFICATION ALGORITHM
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)16 April 8th, 2013
Clusters
ICDE2013, Brisbane
A cluster is defined as a tuple C(F, N, S)F: the set of (pseudo-)schema features
N: the set of all entities that have an element in F as feature
S: the set of clusters that are either child or descendant nodes of C
Cluster Distance : co-occurrence count of features fi and fj
: the count of entities having f as feature
Ni (Nj ) is the entity set associated with Ci (Cj )
Institute of Applied Informatics and Formal Description Methods (AIFB)17 April 8th, 2013
Cluster Relation
ICDE2013, Brisbane
Four cluster relations : Ci a parent (ancestor) of Cj
: Ci a child (descendant) of Cj
: Ci and Cj represent the same cluster
: there is no relation between Ci and CjEvidence No counter-evidence
Institute of Applied Informatics and Formal Description Methods (AIFB)18 April 8th, 2013
Typification
ICDE2013, Brisbane
S*root
Power
platform
Media
Resolution
Print Speed
LED
HD
Coverage
Level
Language
C
Empty
0
Root
Power
1. Power < Root Add & Split Clusters
Resolution
2. Resolution < Power Add & Split Clusters
Print Speed
3.Print Speed = Resolution Merge
S*Power
Resolution
Print Speed
LED
HD
C
platform
Media
Coverage
Level
Language
1S*
Resolution
Print Speed
C
LED
HD
2S*
Resolution
Empty
C
LED
HD
3S*
Power
LED
HD
C
platform
Media
Coverage
Level
Language
4
Children or Descendants
of the root
Siblings of the root
4. Split Entities
Institute of Applied Informatics and Formal Description Methods (AIFB)19 April 8th, 2013
EVALUATION
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)20 April 8th, 2013
Evaluation
BaselinesHierarchical: BIRCH
Partitional: K-means++
Kernel-based: SVC
Density-based: OPTICS
DatasetsBTC
DBpedia (DBP)
Product Data (P)PPS : using pseudo-schema features
PTFIDF: using TF-IDF features
PD: using all words
Dataset Entity Triple SchemaFeature
Type Hierarchy PS Features
BTC 334,661 2,991,411 537 163 0 -
DBP 3,600 49,751 146 16 5 -
PPS 22,331 111,647 5 6 0 136
PTFIDF 22,331 111,647 5 6 0 7,211
PD 22,331 111,647 5 6 0 18,917
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)21 April 8th, 2013
Efficiency
ICDE2013, Brisbane
DBP BTC PPS PTFIDF PD1,000
10,000
100,000
1,000,000
10,000,000
TYPifier
K-Means++
BIRCH
OPTICS
SVC
Datasets
Tim
e l
og
(ms)
TYPifier, K-means++ and BIRCH are similar in efficiency
Pseudo-schema features help to improve efficiency
Institute of Applied Informatics and Formal Description Methods (AIFB)22 April 8th, 2013
Effectiveness
ICDE2013, Brisbane
DBP BTC PPS PTFIDF PD0.00
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
TYPifierK-Means++BIRCHOPTICSSVC
Datasets
F-m
ea
su
re (
%)
TYPifier outperforms other baselines
+33.92% in F-measure (compared to second best)
Pseudo-schema feature outperforms other types of feature
+86.15% in F-measure (compared to second best)
Institute of Applied Informatics and Formal Description Methods (AIFB)23 April 8th, 2013
Hierarchies
ICDE2013, Brisbane
TYPifier outperforms other baselines
Original Hierarchies
Hierarchies Generated by OPTICS
Hierarchies Generated by BIRCHHierarchies Generated by TYPifier
Tree Edit Distance
TYPifier OPTICS BIRCH
12 14 24
Institute of Applied Informatics and Formal Description Methods (AIFB)24 April 8th, 2013
Parameter Sensitivity
Precision improves with higher θ, because pseudo-schema features become more representative
Recall improves as θ increases (at low level), drops at high level, because less and lesser pseudo-schema features can be generated
ICDE2013, Brisbane
0.1 0.2 0.3 0.4 0.50
102030405060708090
100
TYPifierKMeans++BIRCH
θ
Pre
cisi
on (
%)
0.1 0.2 0.3 0.4 0.50
10
20
30
40
50
60
70
80
TYPifierKMeans++BIRCH
θ
Reca
ll (%
)
Institute of Applied Informatics and Formal Description Methods (AIFB)25 April 8th, 2013
Parameter Sensitivity
The sensitivity of ε depends on feature correlations
Higher ε leads to better precision and recall
Extremely high ε may leads to poor quality of hierarchies
ICDE2013, Brisbane
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.950
60
70
80
90
100
DBPBTCP_PSP_TFIDF
ε
Pre
cisi
on (
%)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
10
20
30
40
50
60
70
80
DBPBTCP_PSP_TFIDF
ε
Reca
ll (%
)
Institute of Applied Informatics and Formal Description Methods (AIFB)26 April 8th, 2013
Conclusion
Introduce and formulate Typification as clustering problem
Learning pseudo-schema features
A divisive hierarchical clustering solution for TYPificationTYPifier outperforms baselines by +33.92% in F-measure!
Pseudo-schema feature is essential also for baselines! (outperforms other types of feature by +86.15% in F-measure)
Generate not only clusters but also hierarchies that closely match human conceptualization / ground truth model!
ICDE2013, Brisbane
Institute of Applied Informatics and Formal Description Methods (AIFB)27 April 8th, 2013
Thank you for your attention! Questions?
Thanh Tran, https://sites.google.com/site/kimducthanh/
ICDE2013, Brisbane