www.data61.csiro.au
Automating Data Integration with Machine Learning Bringing Your Data Together
Natalia Rümmele | Data Scientist September 2016
Data Integration Problem
• Combine data from different sources
• Provide unified view of data
Data Integration | Natalia Rümmele 2 |
?
Data Integration Problem
• Combine data from different sources
• Provide unified view of data • Multiple datasets with common entities and
content
• Siloed systems
• Sharing data across systems/schemas
• Handling legacy systems
• Handling different schemas
Data Integration | Natalia Rümmele 3 |
?
Data Integration Problem
• Combine data from different sources
• Provide unified view of data • Multiple datasets with common entities and
content
• Siloed systems
• Sharing data across systems/schemas
• Handling legacy systems
• Handling different schemas
• Resource-intensive ETL…
Data Integration | Natalia Rümmele 4 |
?
Data Integration Problem
• Combine data from different sources
• Provide unified view of data • Multiple datasets with common entities and
content
• Siloed systems
• Sharing data across systems/schemas
• Handling legacy systems
• Handling different schemas
• Resource-intensive ETL…
• Can machine learning help?
Data Integration | Natalia Rümmele 5 |
?
Areas of Investigation
• Schema matching & mapping • Connecting datasets by establishing a global data
model
• Entity Resolution • Detecting entities across databases
• Privacy Preserving Analytics
• Learning the global data model without sharing data
• Data Quality
• Semantic and Syntactic scoring
Data Integration | Natalia Rümmele 6 |
Relational Schema Matching
Schema Matching
• Goal: Automatically connect datasets across relational schemas
Data Integration | Natalia Rümmele 8 |
UserName
Joe Blogs
Jill Blogs
…
Names
Blogs, Joe
Blogs, Jill
…
People
Blogs, J
Blogs, J
…
_USERS_
Blogs_Joe
Blogs_Jill
…
IDs
Joe Blogs
Jill Blogs
…
Schema Matching
• Goal: Automatically connect datasets across relational schemas
• Problem: Syntax vs Semantics • Requires humans to distinguish between similar columns
• Machines get caught on syntax
Data Integration | Natalia Rümmele 9 |
UserName
Joe Blogs
Jill Blogs
…
Names
Blogs, Joe
Blogs, Jill
…
People
Blogs, J
Blogs, J
…
_USERS_
Blogs_Joe
Blogs_Jill
…
IDs
Joe Blogs
Jill Blogs
…
Name
Semantic type
Data Integration: Schema Matcher
• Can we automatically label columns with semantic types?
• Given multiple datasets and a set of semantic types, can we detect all columns of the same semantic type where: • Column names may be different
• Common entries may not exist
• Formatting issues may exist
Data Integration | Natalia Rümmele 10 |
UserName
Joe Blogs
Jill Blogs
…
Names
Blogs, Joe
Blogs, Jill
…
People
Blogs, J
Blogs, J
…
_USERS_
Blogs_Joe
Blogs_Jill
…
IDs
Joe Blogs
Jill Blogs
…
Name Name Name Name Name
Schema Matcher
Name Address Phone
Machine Learning: Feature Vector
• Multi-class classification problem • Represent column as a vector of features • Classify column as one of semantic types using a ML classifier • Training data needed!
Data Integration | Natalia Rümmele 11 |
People
Joe Blogs
Jill Blogs
Fred Flogs
Fiona Flogs
George Glogs
Gini Glogs
Henry Hogs
...
Column Header + Table Name • Nearest neighbours to class training labels • Edit distance metrics • WordNet distance
Content • Character Frequencies • Missing Values • Number repeated values • Entropy
Machine Learning: Cost Matrix
• Class imbalance: unknown class over-represented (unlabeled columns)
• Class resampling strategies: oversampling/undersampling to mean, etc.
• Cost-sensitive learning by introducing asymmetric costs of misclassifications
Data Integration | Natalia Rümmele 12 |
Semantic Type
Predicted
Act
ual
Unknown Name Address Phone
Unknown 0 1 1 1
Name 40 0 1 1
Address 40 1 0 1
Phone 40 1 1 0
Machine Learning: Bagging
• Number of columns usually much smaller than dataset
• Can use bagging to build many smaller samples by subsampling
• Also addresses class imbalance
Data Integration | Natalia Rümmele 13 |
People
Joe Blogs
Jill Blogs
Fred Flogs
Fiona Flogs
George Glogs
Gini Glogs
Henry Hogs
...
People
Joe Blogs
Jill Blogs
Fred Flogs
People
George Glogs
Jill Blogs
Gini Glogs
People
Henry Hogs
Joe Blogs
Henry Hogs
People
Joe Blogs
Jill Blogs
Fred Flogs
People
Joe Blogs
Gini Glogs
Fred Flogs
Data Integration: Schema Matcher
• Schema Matcher is trained on examples and user feedback
• System improves as predictions are corrected
Data Integration | Natalia Rümmele 14 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Schema Matcher
Address Address
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Schema Matcher
Name Address
Data Integration: Schema Matcher
• User supplies semantic types and datasets
• User labels some example sets
Data Integration | Natalia Rümmele 15 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
Data Integration: Schema Matcher
• User supplies semantic types and datasets
• User labels some example sets
• Predictions are generated
Data Integration | Natalia Rümmele 16 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
Data Integration: Schema Matcher
• User supplies types and datasets
• User labels some example sets
• Predictions are generated, user corrects
Data Integration | Natalia Rümmele 17 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
Data Integration: Schema Matcher
• User supplies types and datasets
• User labels some example sets
• Predictions are generated, user corrects
• User adds more data
Data Integration | Natalia Rümmele 18 |
Address Name
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
Data Integration: Schema Matcher
• User supplies types and datasets
• User labels some example sets
• Predictions are generated, user corrects
• User adds more data
• Repeat
Data Integration | Natalia Rümmele 19 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Address Name
People
Joe Blogs
Jill Blogs
…
Phone Email
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
Name
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
Applications
• Linking and connecting data
• Class-wide transforms
• Relabelling columns to unified naming convention
• Labelling no-header datasets
• Merging tables by semantic type
• A component for further semantic modelling
Data Integration | Natalia Rümmele 20 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
_USERS_
Blogs_Joe
Blogs_Jill
…
Number
98712533
34598734
…
Name City Phone Name Phone
Semantic Modelling
Implicit Semantics
• The semantic meaning of a dataset is more than a column label
Data Integration | Natalia Rümmele 22 |
Name
Joe Blogs
Jill Blogs
…
BirthDate
21-05-1986
97-12-1990
…
City
Perth
Adelaide
…
State
WA
SA
…
Workplace
Data61
CSIRO
…
Person City
lives-in? born-in? works-in?
Making Implicit Explicit
• The semantic meaning of a dataset is more than a column label
• There are usually relationships implied between the columns
Data Integration | Natalia Rümmele 23 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
_USERS_
Blogs_Joe
Blogs_Jill
…
Number
98712533
34598734
…
Name Name Number Name Number
Person City Phone Person Phone
has-a
lives-in
has-a
Semantic Model
• The semantic meaning of a dataset is more than a column label
• There are usually relationships implied between the columns
Data Integration | Natalia Rümmele 24 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
Name Name Number
Person City Phone
has-a
lives-in
Semantic Model
Semantic Modelling
• For this we need an ontology or graph schema
• Can come from: • Built up iteratively from definitions
• Pre-defined domain ontologies
• Downloaded ontologies from Semantic Web
Data Integration | Natalia Rümmele 25 |
*M.Taheriyan et al. “A graph-based approach to learn semantic descriptions of data sources ”, ISWC 2013.
*
Ontology
Data Integration | Natalia Rümmele 26 |
Class Node: Abstract Concept
*
Ontology
Data Integration | Natalia Rümmele 27 |
Class Node: Abstract Concept Data Node: Properties
*
Ontology
Data Integration | Natalia Rümmele 28 |
Class Node: Abstract Concept Data Node: Properties Relationship: Relationship between Concepts
*
RDB2RDF Schema Matching
• Given an ontology and a set of known semantic models, can we generate a semantic model for a new dataset
Data Integration | Natalia Rümmele 29 |
UserName
Joe Blogs
Jill Blogs
…
Contact
99110002
45878723
…
Name Number
Person Phone
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
Name Name Number
Person City Phone
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Name Name
Person City
+
Constructing Semantic Model
Data Integration | Natalia Rümmele 30 |
• Map all possible semantic types matched for columns onto the Ontology
• Need to find most likely semantic model – subgraph which covers matched semantic types
• Minimum Cost Steiner Tree Problem (approximate)
UserName
Joe Blogs
Jill Blogs
…
Organization
Data2Decisions
Data61
…
Person name
Org name
Person Organization
worksFor
Advantages
Not just better understanding of your data…. • Better entity resolution • Easier integration to graph databases or merging between relational • Enables more sophisticated and accurate merging • More powerful search
Data Integration | Natalia Rümmele 31 |
UserName
Joe Blogs
Jill Blogs
…
Location
Melbourne
Perth
…
Contact
99110002
45878723
…
_USERS_
Blogs_Joe
Blogs_Jill
…
Number
98712533
34598734
…
Name City Phone Name Phone
Person City Phone Person Phone
has-a
lives-in
has-a
Advanced Search
• The ontology allows transitive and subclass relationships
• Searches can associate new columns e.g. City -> State -> Country
• A search for something in a country can also proceed to the subclasses
Data Integration | Natalia Rümmele 32 |
Graph Database
• The ontology can act as the intermediary between graph databases and relational databases
• Functions as a graph schema and global (unified) schema of integrated datasets
Data Integration | Natalia Rümmele 33 |
People
Joe Blogs
Jill Blogs
…
Addr
15 Something St
74A Another Rd
…
People
Joe Blogs
Jill Blogs
…
Summary
Data Integration
• Schema Matcher • Relational schema
• Find semantically similar columns across data sets
• Semantic Modelling • Graph schema
• Find semantically similar columns + relationships between them
Data Integration | Natalia Rümmele 35 |
Name Name Number
Person City Phone
Open Questions
• Can we learn column transformations?
• Complex column matches • One-to-many matches
• Many-to-one matches
• Applications in entity resolution
• Applications in search
Data Integration | Natalia Rümmele 36 |
split?
concatenate?
www.csiro.au
Data Platforms Team Engineering and Design
Thank you ?
Alex Collins
Stephen Hardy
Yuriy Tyshetskiy
Natalia Rümmele
Maybe you?