discovering, maintaining, and using semantics for database schemas yuan an, ph.d. ischool at drexel...
TRANSCRIPT
Discovering, Maintaining, and Using Semantics for Database Schemas
Yuan An, Ph.D.iSchool at Drexel
February 23, 2009CS Department at Villanova Univ.
2
Background
• Information integration is the problem of sharing and using data across disparate information sources.
• What challenges information integration is that information sources are often distributed, autonomous, and heterogeneous.
3
Example of Information Integration• Patient healthcare and medical data
usually resides in multiple sources such as different units of hospitals, labs, clinics, personal data management devices, and even drugstores.
• Example tasks for Information integration:– obtaining a holistic view of patient health
status– merging data for multiple healthcare providers
4
Information Integration
5
A Central Issue
• A key component of any solutions for information integration is the definitions of mappings between different data sources/schemas.
• Despite a decade’s effort, building schema mapping remains a very difficult problem.
• The difficulty lies in the requirement of understanding the meaning of the schemas being mapped.
6
An Example
ID AdmDatePatRef
Admission
DisDate
ID DocPatRef
Treatment
Date Desc
ID NameMedCr#
Patient
Diagnosis
Philadelphia General Hospital DB
ID EnterPolicy#
Coronary
Leave Patient
ID EnterPolicy#
Pulmonary
Leave Patient
ID
Admission
ID DocProgID
Treatment
Date
ID SymptomPatRef
Progress
Boston Mass General Hospital DB
Transfer patient medical informationfrom Philadelphia General Hospital to Boston MassGeneral Hospital.
7
Schema Semantics
ID AdmDatePatRef
Admission
DisDate
ID DocPatRef
Treatment
Date Desc
ID NameMedCr#
Patient
Diagnosis ID EnterPolicy#
Coronary
Leave Patient
ID DocProgID
Treatment
Date
ID SymptomPatRef
Progress
ID SymptomPatRefProgress
Progress
hasIDhasSymptom
Patient
hasRefNhasName
relate
* 1
ID DocProgIDTreatment Date
Progress
hasIDhasSymptom
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
prescribe
apply
* 1
1*
8
• We aim at developing an automatic tool for discovering semantic mappings from database schemas to conceptual models (CM).
Discovering Semantics
DB
conceptual model
9
Benefits of Discovering Semantics for Schemas
ID DocPatRefTreatment Date Desc
Philadelphia General Hospital DB Schema
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
prescribe
recommend
apply
monitor relate* 1
* *1 * * 1
1*
Boston Mass Hospital DB Conceptual Model
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
10
• We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution.
Maintaining Semantics
DB
conceptual model
DB’
conceptual model’
11
Using Semantics for Discovering Schema Mapping
DB2
conceptual model 2
DB1
conceptual model 1
12
Roadmap
• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema
Mapping• Conclusions
13
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
prescribe
recommend
apply
monitor relate* 1
* *1 * * 1
1*
• Much more semantics in conceptual models, e.g., weak entities, partOf, n-ary relationships, ISA relationships…• Need to distinguish them all from schema structures.
Challenges
ID DocPatRefTreatment Date Desc
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
Discover all and only the “reasonable” trees we call semantic trees that are plausible semantics of the table.
14
• Schema matching tools: associate atomic elements in different schemas using syntactic links.
• Schema mapping tools: infer query expressions for translating/exchanging data.– unable to discover expected semantics of a
schema in terms of a conceptual model.
Existing Mapping Tools
ID DocPatRef
Treatment
Date Desc ID DocProgID
Treatment
Date
15
Our Solution for Discovering
Semantics
ID DocPatRefTreatment Date Desc
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
prescribe
recommend
apply
monitor relate* 1
* *1 * * 1
1*
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
Simple correspondences can be specified manually or by using a schema matching tool.
The key is to discover “reasonable” links based on1.analysis key and foreign key constraints in schemas.2. a careful study of standard database design princiles.
We focus on deriving semantic trees connectingthe individual concepts using “reasonable” links.
16
Discovering Semantic Trees
ID DocPatRefTreatment Date Desc
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
prescribe
recommend
apply
monitor relate* 1
* *1 * * 1
1*
Treatment
hasIDhasDate
Doctor
hasPhyIDhasName
Progress
hasIDhasSymptom
Patient
hasRefNhasName
Step1: determine a skeleton tree and its anchor by key columns.
Step2: determine skeleton trees the their anchors corresponded to by f.k. columns.
Step4: link any concepts corresponding to unaccounted-for columns.
Step3: link the skeleton trees using shortest functional paths.
17
“Divide and Conquer”
• A gradual manner: 1. ER0 – an initial subset with binary
relationships.2. ER1 – adding n-ary relationships 3. ER2 – adding ISA relationships.
18
“Good” Properties of the Algorithm• Guarantees only for “standard”
relational schemas.1. A sense of “completeness”: the algorithm
finds all the “correct” semantics.2. A sense of “soundness”: for multiple
candidates, each one would result in an “indistinguishable” table by the standard database design methodology.
19
The MAPONTO Tool
the mapping formulas
20
Evaluation ResultsSchemas # of
Tables
# of Columns
Ontology # of Nodes
# of Links
UTCS Department
8 32 Academic Department
62 1913
VLDB Conference
9 38 Academic Conference
27 143
DBLP Bibliography
5 27 Bibliographic Data
75 1178
OBSERVER Project
8 115 Bibliographic Data
75 1178
Country 6 18 CIA Factbook 52 125
21
Evaluation Results
• correct semantics for 85% of the tested tables.
• maximum number of semantics candidates is 4.
• Average execution time less than 1 second.
22
Roadmap
• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema
Mapping• Conclusions
23
• We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution.
Maintaining Semantics
DB
conceptual model
DB’
conceptual model’
24
Challenges in Maintenance
• What to maintain: how to define the property for maintenance and how to detect violation on the property.
• How to capture changes to CMs and relational schemas.
• How to reconcile CMs and schemas according to the intent of users.
25
Our Goals of Mapping Maintenance
• To keep the mapping consistent: a consistent conceptual-relational mapping allows two-way legal instances translation.
• To reconcile the conceptual model when the associated schema evolve.
• To update the mapping when associated conceptual model evolve.
26
Capturing CM/Schema Changes
• A user can change CM/schema in different ways:– Modifying the original model.– Generating a new model.
• It is difficult to ask the user to provide a sequence of primitive actions.
• It would be easier to ask the user to draw correspondences.
Biosample(bsid,species,organ,…,donor_disease)
Biosample(bsid,species,organ,…) tissue(bsid,donor_disease)
27
Reconciling CM and Schema
• Analyzing the existing semantics in the original mappings in terms of skeleton trees and connections between anchors.
• Discovering changes through correspondences between old and new models.
• Synchronizing models and adapting the mapping accordingly.
28
Evaluation Methodology and Results• The same data sets for discovering conceptual-
relational mappings.• Measuring efficiency and benefits in comparison
to mapping reconstructing approach.• Comparing the number of mapping candidates
generated by maintaining and reconstructing approaches.
• The maintenance approach can save at least 80% of user effort for reaching consistent mappings. Execution time is insignificant: avg. < 1 sec.
29
Roadmap
• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema
Mapping• Conclusions
30
Using CM-Relational Mappings for Discovering Schema Mapping
DB2
conceptual model 2
DB1
conceptual model 1
31
Current Solutions for Schema Mapping
compose Progress(ID,PatRef,Symptom) with Treatment(ID’,ProgID,Doc,Date) where Progress.ID=Treatment.ProgID → Treatment(ID’,PatRef,Doc,Date,Symptom).
SOURCE: TARGET:
Treatment ID DocProgID Date
ID SymptomPatRefProgress ID DocPatRefTreatment Date Desc
32
33/44
1. load Doctor.name and Doctor.clinic into employee as employee.name and employee.clinic in the target.
2. load Scientist.name and Scientist.lab into employee as employee.name and employee.lab in the target.
3. compose Doctor(ssn,name’,clinic) with Scientist(ssn,name,lab) where they have the same ssn → employee(z,name,clinic,lab).
Using the SemanticsEmployeessnname
Doctorssnclinic
Scientistssnlab
X
Doctor
Scientist
employeessn name clinic
ssn name lab
eid name clinic lab
Employeessnname
Doctorssnclinic
Scientistssnlab
X
Principles of the Semantic Approach• Discovering two conceptual subgraphs (CSG)
that are “semantically similar” (≠ “structurally match”) and then translating the CSGs into algebraic expressions
1. connections between corresponding pairs of nodes are semantically similar or compatible, e.g., ISA, partOf…
2. maintaining desirable properties in database queries.
3. the principle of parsimony: smallest trees.
34
Evaluation Methodology
• Comparison between the semantic approach and traditional approachs based on referential integrity constraint.
• Manually specified mapping expressions as a “gold standard”.
• Traditional “precision” and “recall” as evaluation criteria.
• Data collection from a variety of domains.
35
Test DataSchema #
tablesAssociated CM #
nodes in CM
#mappings tested
DBLP1DBLP2
229
BibliographicDBLP2 ER
757
6
Mondia1Mondial2
2826
FactbookMondial2 ER
5226
5
Amalgam1Amalgam2
1527
Amalgam1 ERAmalgam2 ER
826
7
3Sdb13Sdb2
99
3Sdb1 ER3Sdb2 ER
3 3
UTCSUTDB
813
KA ontologyCS dept. ontology
10562
2
HotelAHotelB
65
hotelA ontologyhotelB ontology
77
5
NetworkANetworkB
1819
networkA ontologynetworkB ontology
2827
6
36
Summary of the Evaluation Results
• Found all the expected mappings as found by the traditional approach.
• Improved precision (70% of the test cases) by eliminating suspicious pairings.
• Improved recall (40% of the test cases) by considering ISA as functional relationship.
• No much complicated semantics, no improvements.
37
Roadmap
• Background• Contributions• Discovering Semantics for Schemas • Maintaining Semantics for Schemas• Using the Semantics for Schema
Mapping• Conclusions
38
Conclusions• A novel and effective tool for discovering
semantics for schemas in terms of conceptual models.
• A round-trip engineering process for maintaining semantic mappings.
• A semantic approach for improving schema mappings using the semantics.
• A suite of tools for assisting users to discover and maintain mappings between different data representations in a variety of information integration situations.
39
Thank You!
40