imap: discovering complex semantic matches between database schemas  

35
ITCS6010 Fall 2008 Anuradha Venkataraman 800556407

Upload: art

Post on 08-Feb-2016

30 views

Category:

Documents


0 download

DESCRIPTION

iMAP: Discovering Complex Semantic Matches between Database Schemas  . ITCS6010 Fall 2008 Anuradha Venkataraman 800556407. Introduction. Semantic Mapping : specifies how data in two different data sources are related and how to transform data from one data source to another - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: iMAP: Discovering Complex Semantic Matches between Database Schemas  

ITCS6010Fall 2008Anuradha Venkataraman800556407

Page 2: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Semantic Mapping : specifies how data in two different data sources are related and how to transform data from one data source to another 

It is a 2 step process,◦ find semantic matches between attributes of

two schemas location=concat(city,state)

◦ find mappings to transform data in one schema to another based on the matches found(SQL queries, Clio) 

location=select concat(city,state) from S2

Page 3: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Semantic mapping is a fundamental task in DI or other data sharing systems like DW, EDI , co operations and collaborations

At the time the paper was written semantic mapping was done through a manual process. This was time consuming and also prone to error.

Majority of prior related work focused on creating 1 - 1 matches

Complex matches are common in real-world schemas 

Page 4: iMAP: Discovering Complex Semantic Matches between Database Schemas  

1-1 matches - an attribute in the source schema is matched to another attribute in the target schema

name=custname

Complex matching - specifies a combination of attributes in the source schema that relate to a combination of attributes in the target schema

name=concat(fname,lname)◦ Finding complex matches is more difficult because the

search space is unbounded. 

Page 5: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Schema matching in relational schemas - idea can be used for other types as well

Goal to provide an interactive design environment where a human can create mappings between schemas quickly by using the system’s suggestions.

Create a semi-automatic system to find complex matches between two schemas.

Find complex matches. Use domain knowledge, external data and overlap data 

Page 6: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Match Generation by using search techniques over the solution space

Uses a number of searchers(search modules), each searcher searches a subset of the search space.◦ Text searcher – searches matches for textual target attributes, uses concatenation

operator

Beam search is deployed to control the search through the search space

Machine learning, statistics and heuristics are used to evaluate candidate matches.

Diminishing returns principle is used to determine when the search needs to be terminated

Considers name similarity, domain knowledge, integrity constraints, external data and overlap data to re rank and select the best matches

Provides an explanation module to provide the user with reasons for taking various decisions at various stages

Page 7: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Semi-automatically discovering complex matches that combines search through a set of candidate matches and methods for evaluating each match

Uses of new kinds of domain knowledge (overlap data and mining external data)

A mechanism for explaining the decisions made by the matching system 

Page 8: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Match selector

1-1 and complex matches

Match candidates

Explanationmodule

User

Domain knowledgeand data

Target schema T and source schema S

Similarity matrix

Similarity estimator

Match generator

Page 9: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Match Generator◦ 2 input schemas( S - source schema, T - target schema)◦ For each attribute t in T, generates possible match

candidates by employing various searchers  Similarity Estimator

◦ Computes similarity score for the match candidates indicating the level of similarity to target attribute t

◦ Outputs similarity score matrix  Match Selector

◦ Selects the best matches based on the similarity score and domain constraints

Page 10: iMAP: Discovering Complex Semantic Matches between Database Schemas  

The three modules use external data and domain knowledge and overlap data during the processing to improve accuracy and efficiency.

The modules also interact with an explanation module to provide explanations to various actions that the module performs 

Page 11: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Search over the search space of possible matches

Search space is extremely large Uses multi-searcher strategy

◦ Each searcher has a specific purpose◦ Each searcher searches only a subset of the

search space Advantages:

◦ Makes the system extensible – new searchers can be added and integrated

◦ Each searcher considers small , relevant portions of the search space

Page 12: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Search Strategy: ◦ Uses beam search to control the search process◦ Evaluates each candidate based on a scoring function

and only retains k – top matches at each level. Match Evaluation

◦ Assigns score based on semantic distance between the candidate match and target attribute.

◦ Uses machine learning, statistics, heuristics. Termination

◦ Uses diminishing returns principle◦ Stops when the difference between the best matches in

consecutive levels of iterations is less than a threshold.

Page 13: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Target attributes that are numeric price, total, age

Restricted to common operations such as add, subtract, divide, multiply

total=qty*unitprice Considers similarity between the value

distributions of target attribute and the complex match candidate using the Kullback-Leibler divergence measure. ◦ This method calculates the divergence between two

distributions based on information and probabilistic theory.

Page 14: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Examines concatenation of text attributes in S location=concat(city,state,country)

Uses Naïve Bayes Text Classifier to calculate score

This classifier is trained based on the data in the target schema to learn the target attribute, returns probability value for each data instance from source mapping, the average is the score of the mapping.

Starts with 1-1 matches and proceeds with concatenation using beam search strategy

Page 15: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Target attribute t is categorical if the data instances have less than x distinct values

product-category=product-type

Finds conversions between categorical attributes

waterfront=f(near-water)

Uses KL divergence measures to determine similarity

Page 16: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Relates data of a schema to the schema of the other

Focuses on binary attributes in target schema(yes/no)

mp3support = {yes/no} Identifies binary attributes in T and searches for

the attribute name in the data of S◦ description =“…plays mp3,wav…”

Transforms source attribute s into a categorical attribute of t if it contains more than x instances of the attribute name of t.

mp3support = yes if description contains mp3 no if description does not contain mp3

Page 17: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Finds matches that are a conversion between two different types of units

length-mm=10 * size-cm

Determines physical attributes by looking for the presence of units in the data or in the attribute name

Page 18: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Finds complex matches for date attributes Bdate=bday/bmonth/byear

Uses ontology to identify date or part of date attributes

Bdate is ontology Date;bday- ontology day, etc.

Uses ontology to identify relationships to determine the type of conversion.

Page 19: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Employs scoring techniques that can not be used in the searchers due to efficiency reasons.

Uses two modules:◦ Name base evaluator – similarity between the

names(concatenation of complex match) and target attribute

◦ A Naïve Bayes evaluator – based on a naïve based classifier

Page 20: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Selects the best matches based on the scores and other domain integrity constraints .

Name does not contain numbers Match name=login_name {avenkat5} is ranked lower

Uses domain knowledge to clean up complex matches

Candidate match - total = price*(qty – product_id), pid=product_id

Domain constraint – total and pid are not related. iMap drops product_id to get total=price*qty

Page 21: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Directs search process and prunes meaningless candidates early

Knowledge – fname and location are not related; searcher does not consider complex matches that have both attributes together

Uses domain knowledge as early as possible to prune the search space

Types of Domain Knowledge used Domain Constraints – present in schema or provided by user.

◦ Used in various phases based on type of constraint average of total number of rooms is less that 10, all matches that evaluate to more than 10 are

ignored. This decision might be postponed to a later a stage if evaluation is time consuming. Past Complex Matches – Uses knowledge gained from previous matching of schemas in

similar domains.◦ Extracts expression template for complex operations which are used by the searchers

Previous match : cost=price(1 + .50); template – attribute1=variable(1+constant), matcher looks for similar expressions in complex mapping.

External Data – mines properties of attributes from external sources. These properties are used as domain constraints in various stages. ◦ External data source generally provided by user( domain expert ).

Information from mined data – average unit cost does not exceed 1000$.

Page 22: iMAP: Discovering Complex Semantic Matches between Database Schemas  

In real-world schemas source and target schemas mostly share some data .

Common tuples or data that represent same entity.

This overlap data provides important information on the mappings of the attributes which can be used.

Special overlap searchers are used when there is overlap data in source and target schemas.

Page 23: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Overlap Text Searcher◦ Used instead of text searcher◦ Uses overlap data to revaluate the mappings generated

by the text searcher.◦ New score based on the fraction of overlap data that the

mapping satisfies. Overlap data – John Smith = concat(John,Smith) Mapping name=concat(fname,city) does not match overlap

data – score 0; name=concat(fname,lname) matches overlap data – score 1.

Overlap Numeric Searcher◦ Used instead of numeric searcher◦ Uses equation discovery system to find the best

arithmetic expression that matches attribute t. Overlap category and schema mismatch

Searcher ◦ Similar technique to Overlap text searcher

Page 24: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Explanations help users better understand the system.

The system uses complex processes and hence there is a need to explain decisions to the user.

This helps the user guide the system to find the correct matches

Type of Questions a user can ask:◦ Explain a match – why numrooms=baths+beds+floor

generated◦ Explain why a match is not present – why

price=listprice+agent_fee not generated◦ Explain the ranking between matches – why

price=listprice better than price=listprice+agent_fee Questions can be asked to any specific module

Page 25: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Uses a dependency graph that is created as the process flows through each module

Records matches, assumptions, data Nodes: attributes, assumptions, candidate

matches, domain knowledge. Connected by directed edge labeled with the module that was responsible for the decision .

Page 26: iMAP: Discovering Complex Semantic Matches between Database Schemas  
Page 27: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Domains and Data◦ Cricket, Inventory, Financial, Real estate◦ Collected from various sources◦ Use both schemas with overlap data as well as disjoint

schemas

Performance Measure◦ Top 1 matching accuracy - % of matches where the

best match for a attribute is the correct match◦ Top 3 matching accuracy - % of matches where the

top 3 ranked matches for a attribute contains the correct match – more relevant as the tool is used only to suggest a rank list of matches.

Page 28: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Overlap Data◦ Default iMap – 58-74 % accuracy◦ Exploiting domain knowledge – 68-92 %

Disjoint Data◦ Default iMap – 55-76 % accuracy◦ Exploiting domain knowledge – 62-79 %

Top – 1 Accuracy – 62-92% Top – 3 Accuracy – 64 –95% Exploiting domain knowledge and

presence of overlap data increases accuracy

Page 29: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Default System – 33-55% Exploiting domain constraint and overlap data – 50

-86% Top-1 Accuracy –

◦ Disjoint data – 27 – 58 %◦ Overlap Data – 70-85%

Top – 3 Accuracy◦ Disjoint data – 43 – 92 %◦ Overlap Data – 43 – 92%

Exploiting domain constraints and overlap data is significant

High accuracy in Top -3 Matches for complex matches

Page 30: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Top - 3 accuracy is better than Top – 1

Removing small attributes that act as noise in a complex match is difficult

Phone=concat(id,areacode,number), id is a small value that acts as noise

Difficult to find small attributes that form parts of a complex match

Address=concat(street,city,state) , apt# is a small value that is missed

Page 31: iMAP: Discovering Complex Semantic Matches between Database Schemas  

L. Xu and D. Embley. Using domain ontologies to discover direct and indirect matches for schema elements. In Proc.of the Semantic Integration Workshop at ISWC-2003.

Uses domain ontology to find relationships and mappings between two schemas

Can be useful in some contexts such as the date searcher. More such modules can be added

Clio – a complimentary work that uses schema mappings to provide an interactive system that generates transformations from one schema to another.

Page 32: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Finding complex matches is a tough task iMap identifies both 1-1 and complex matches Uses search modules to generate matches Exploits domain knowledge and overlap data to

improve accuracy Provides a explanation feature to better assist

the user The accuracy levels obtained from

experimentation shows that on further fine tuning and improvisation it can be even better

Page 33: iMAP: Discovering Complex Semantic Matches between Database Schemas  

Handle other types of data sources other than relational data sources

Provide better user interaction style Integrate more ontology based searchers

Page 34: iMAP: Discovering Complex Semantic Matches between Database Schemas  

How is overlap data identified? How are the domain constraints taken as input How does the user interact with the system? How can the accuracy be improved? Can the system learn from past experiences like user

suggestions (not just past matches)? Is it scalable? Can it handle multiple schema matching

(like DI systems where number of schemas are matched with a mediated schema)?

What other search modules do you think need to be integrated?

How efficient is it when handling complex schemas (the schemas experimented which are rather small)?

Can it be extended to find complex matches over multiple tables more accurately?

Page 35: iMAP: Discovering Complex Semantic Matches between Database Schemas