Transcript
Page 1: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

www.data61.csiro.au

Automating Data Integration with Machine Learning Bringing Your Data Together

Natalia Rümmele | Data Scientist September 2016

Page 2: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration Problem

• Combine data from different sources

• Provide unified view of data

Data Integration | Natalia Rümmele 2 |

?

Page 3: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration Problem

• Combine data from different sources

• Provide unified view of data • Multiple datasets with common entities and

content

• Siloed systems

• Sharing data across systems/schemas

• Handling legacy systems

• Handling different schemas

Data Integration | Natalia Rümmele 3 |

?

Page 4: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration Problem

• Combine data from different sources

• Provide unified view of data • Multiple datasets with common entities and

content

• Siloed systems

• Sharing data across systems/schemas

• Handling legacy systems

• Handling different schemas

• Resource-intensive ETL…

Data Integration | Natalia Rümmele 4 |

?

Page 5: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration Problem

• Combine data from different sources

• Provide unified view of data • Multiple datasets with common entities and

content

• Siloed systems

• Sharing data across systems/schemas

• Handling legacy systems

• Handling different schemas

• Resource-intensive ETL…

• Can machine learning help?

Data Integration | Natalia Rümmele 5 |

?

Page 6: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Areas of Investigation

• Schema matching & mapping • Connecting datasets by establishing a global data

model

• Entity Resolution • Detecting entities across databases

• Privacy Preserving Analytics

• Learning the global data model without sharing data

• Data Quality

• Semantic and Syntactic scoring

Data Integration | Natalia Rümmele 6 |

Page 7: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Relational Schema Matching

Page 8: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Schema Matching

• Goal: Automatically connect datasets across relational schemas

Data Integration | Natalia Rümmele 8 |

UserName

Joe Blogs

Jill Blogs

Names

Blogs, Joe

Blogs, Jill

People

Blogs, J

Blogs, J

_USERS_

Blogs_Joe

Blogs_Jill

IDs

Joe Blogs

Jill Blogs

Page 9: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Schema Matching

• Goal: Automatically connect datasets across relational schemas

• Problem: Syntax vs Semantics • Requires humans to distinguish between similar columns

• Machines get caught on syntax

Data Integration | Natalia Rümmele 9 |

UserName

Joe Blogs

Jill Blogs

Names

Blogs, Joe

Blogs, Jill

People

Blogs, J

Blogs, J

_USERS_

Blogs_Joe

Blogs_Jill

IDs

Joe Blogs

Jill Blogs

Name

Semantic type

Page 10: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration: Schema Matcher

• Can we automatically label columns with semantic types?

• Given multiple datasets and a set of semantic types, can we detect all columns of the same semantic type where: • Column names may be different

• Common entries may not exist

• Formatting issues may exist

Data Integration | Natalia Rümmele 10 |

UserName

Joe Blogs

Jill Blogs

Names

Blogs, Joe

Blogs, Jill

People

Blogs, J

Blogs, J

_USERS_

Blogs_Joe

Blogs_Jill

IDs

Joe Blogs

Jill Blogs

Name Name Name Name Name

Schema Matcher

Name Address Phone

Page 11: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Machine Learning: Feature Vector

• Multi-class classification problem • Represent column as a vector of features • Classify column as one of semantic types using a ML classifier • Training data needed!

Data Integration | Natalia Rümmele 11 |

People

Joe Blogs

Jill Blogs

Fred Flogs

Fiona Flogs

George Glogs

Gini Glogs

Henry Hogs

...

Column Header + Table Name • Nearest neighbours to class training labels • Edit distance metrics • WordNet distance

Content • Character Frequencies • Missing Values • Number repeated values • Entropy

Page 12: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Machine Learning: Cost Matrix

• Class imbalance: unknown class over-represented (unlabeled columns)

• Class resampling strategies: oversampling/undersampling to mean, etc.

• Cost-sensitive learning by introducing asymmetric costs of misclassifications

Data Integration | Natalia Rümmele 12 |

Semantic Type

Predicted

Act

ual

Unknown Name Address Phone

Unknown 0 1 1 1

Name 40 0 1 1

Address 40 1 0 1

Phone 40 1 1 0

Page 13: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Machine Learning: Bagging

• Number of columns usually much smaller than dataset

• Can use bagging to build many smaller samples by subsampling

• Also addresses class imbalance

Data Integration | Natalia Rümmele 13 |

People

Joe Blogs

Jill Blogs

Fred Flogs

Fiona Flogs

George Glogs

Gini Glogs

Henry Hogs

...

People

Joe Blogs

Jill Blogs

Fred Flogs

People

George Glogs

Jill Blogs

Gini Glogs

People

Henry Hogs

Joe Blogs

Henry Hogs

People

Joe Blogs

Jill Blogs

Fred Flogs

People

Joe Blogs

Gini Glogs

Fred Flogs

Page 14: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration: Schema Matcher

• Schema Matcher is trained on examples and user feedback

• System improves as predictions are corrected

Data Integration | Natalia Rümmele 14 |

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Schema Matcher

Address Address

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Schema Matcher

Name Address

Page 15: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration: Schema Matcher

• User supplies semantic types and datasets

• User labels some example sets

Data Integration | Natalia Rümmele 15 |

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Address Name

People

Joe Blogs

Jill Blogs

Phone Email

Page 16: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration: Schema Matcher

• User supplies semantic types and datasets

• User labels some example sets

• Predictions are generated

Data Integration | Natalia Rümmele 16 |

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Address Name

People

Joe Blogs

Jill Blogs

Phone Email

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Name

Email

Page 17: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration: Schema Matcher

• User supplies types and datasets

• User labels some example sets

• Predictions are generated, user corrects

Data Integration | Natalia Rümmele 17 |

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Address Name

People

Joe Blogs

Jill Blogs

Phone Email

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Name

Email

Page 18: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration: Schema Matcher

• User supplies types and datasets

• User labels some example sets

• Predictions are generated, user corrects

• User adds more data

Data Integration | Natalia Rümmele 18 |

Address Name

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

People

Joe Blogs

Jill Blogs

Phone Email

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Name

Email

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

People

Joe Blogs

Jill Blogs

Page 19: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration: Schema Matcher

• User supplies types and datasets

• User labels some example sets

• Predictions are generated, user corrects

• User adds more data

• Repeat

Data Integration | Natalia Rümmele 19 |

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Address Name

People

Joe Blogs

Jill Blogs

Phone Email

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

Name

Email

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

People

Joe Blogs

Jill Blogs

Page 20: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Applications

• Linking and connecting data

• Class-wide transforms

• Relabelling columns to unified naming convention

• Labelling no-header datasets

• Merging tables by semantic type

• A component for further semantic modelling

Data Integration | Natalia Rümmele 20 |

UserName

Joe Blogs

Jill Blogs

Location

Melbourne

Perth

Contact

99110002

45878723

_USERS_

Blogs_Joe

Blogs_Jill

Number

98712533

34598734

Name City Phone Name Phone

Page 21: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Semantic Modelling

Page 22: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Implicit Semantics

• The semantic meaning of a dataset is more than a column label

Data Integration | Natalia Rümmele 22 |

Name

Joe Blogs

Jill Blogs

BirthDate

21-05-1986

97-12-1990

City

Perth

Adelaide

State

WA

SA

Workplace

Data61

CSIRO

Person City

lives-in? born-in? works-in?

Page 23: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Making Implicit Explicit

• The semantic meaning of a dataset is more than a column label

• There are usually relationships implied between the columns

Data Integration | Natalia Rümmele 23 |

UserName

Joe Blogs

Jill Blogs

Location

Melbourne

Perth

Contact

99110002

45878723

_USERS_

Blogs_Joe

Blogs_Jill

Number

98712533

34598734

Name Name Number Name Number

Person City Phone Person Phone

has-a

lives-in

has-a

Page 24: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Semantic Model

• The semantic meaning of a dataset is more than a column label

• There are usually relationships implied between the columns

Data Integration | Natalia Rümmele 24 |

UserName

Joe Blogs

Jill Blogs

Location

Melbourne

Perth

Contact

99110002

45878723

Name Name Number

Person City Phone

has-a

lives-in

Semantic Model

Page 25: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Semantic Modelling

• For this we need an ontology or graph schema

• Can come from: • Built up iteratively from definitions

• Pre-defined domain ontologies

• Downloaded ontologies from Semantic Web

Data Integration | Natalia Rümmele 25 |

*M.Taheriyan et al. “A graph-based approach to learn semantic descriptions of data sources ”, ISWC 2013.

*

Page 26: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Ontology

Data Integration | Natalia Rümmele 26 |

Class Node: Abstract Concept

*

Page 27: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Ontology

Data Integration | Natalia Rümmele 27 |

Class Node: Abstract Concept Data Node: Properties

*

Page 28: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Ontology

Data Integration | Natalia Rümmele 28 |

Class Node: Abstract Concept Data Node: Properties Relationship: Relationship between Concepts

*

Page 29: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

RDB2RDF Schema Matching

• Given an ontology and a set of known semantic models, can we generate a semantic model for a new dataset

Data Integration | Natalia Rümmele 29 |

UserName

Joe Blogs

Jill Blogs

Contact

99110002

45878723

Name Number

Person Phone

UserName

Joe Blogs

Jill Blogs

Location

Melbourne

Perth

Contact

99110002

45878723

Name Name Number

Person City Phone

UserName

Joe Blogs

Jill Blogs

Location

Melbourne

Perth

Name Name

Person City

+

Page 30: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Constructing Semantic Model

Data Integration | Natalia Rümmele 30 |

• Map all possible semantic types matched for columns onto the Ontology

• Need to find most likely semantic model – subgraph which covers matched semantic types

• Minimum Cost Steiner Tree Problem (approximate)

UserName

Joe Blogs

Jill Blogs

Organization

Data2Decisions

Data61

Person name

Org name

Person Organization

worksFor

Page 31: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Advantages

Not just better understanding of your data…. • Better entity resolution • Easier integration to graph databases or merging between relational • Enables more sophisticated and accurate merging • More powerful search

Data Integration | Natalia Rümmele 31 |

UserName

Joe Blogs

Jill Blogs

Location

Melbourne

Perth

Contact

99110002

45878723

_USERS_

Blogs_Joe

Blogs_Jill

Number

98712533

34598734

Name City Phone Name Phone

Person City Phone Person Phone

has-a

lives-in

has-a

Page 32: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Advanced Search

• The ontology allows transitive and subclass relationships

• Searches can associate new columns e.g. City -> State -> Country

• A search for something in a country can also proceed to the subclasses

Data Integration | Natalia Rümmele 32 |

Page 33: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Graph Database

• The ontology can act as the intermediary between graph databases and relational databases

• Functions as a graph schema and global (unified) schema of integrated datasets

Data Integration | Natalia Rümmele 33 |

People

Joe Blogs

Jill Blogs

Addr

15 Something St

74A Another Rd

People

Joe Blogs

Jill Blogs

Page 34: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Summary

Page 35: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Data Integration

• Schema Matcher • Relational schema

• Find semantically similar columns across data sets

• Semantic Modelling • Graph schema

• Find semantically similar columns + relationships between them

Data Integration | Natalia Rümmele 35 |

Name Name Number

Person City Phone

Page 36: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

Open Questions

• Can we learn column transformations?

• Complex column matches • One-to-many matches

• Many-to-one matches

• Applications in entity resolution

• Applications in search

Data Integration | Natalia Rümmele 36 |

split?

concatenate?

Page 37: Automating Data Integration with Machine Learning...Automating Data Integration with Machine Learning Bringing Your Data Together Natalia Rümmele | Data Scientist ... •Cost-sensitive

www.csiro.au

Data Platforms Team Engineering and Design

e [email protected]

Thank you ?

Alex Collins

Stephen Hardy

Yuriy Tyshetskiy

Natalia Rümmele

Maybe you?


Top Related