minor thesis a scalable schema matching framework for relational databases student: ahmed saimon...
TRANSCRIPT
Minor Thesis
A scalable schema matching framework for relational databases
Student: Ahmed Saimon AdamID: 110022478Award: MSc (Computer & Information Science)Date: 17th September 2010Supervisor: Dr. Jixue Liu
INTRODUCTION
•What is a database schema?▫Structure of a database that describes how its
concepts, their relationships and constraints are arranged
•What is Schema matching?▫process of identifying semantic correspondences
between elements of database schemas
Schema matching applications▫Critical task in any data sharing process▫Data warehousing
Consolidation of multiple transaction processing databases▫database integration processes
Eg: two companies merge, integrate employee, inventory, financial databases
▫Cooperation between government agencies and various institutions.
Eg. Police/transport dept, Immigration and universities
Importance of the research•Currently done manually and semi automatically•Doing manually: tedious, error-prone, costly•No fully automatic system available
require user interaction• semantic query processing, mobile web, ecommerce
collaboration in enterprises•Demand for more scalable, accurate, efficient
schema matching technology increasing
Research objectives•Propose a framework that▫adopts a scalable architecture▫Offers a library of schema matching algorithms that
exploit various information for better accuracy▫ is independent of any specific application domain
Methodology•Build a framework by adopting a composite
architecture•Create a library of matchers at different levels•Build a prototype and perform empirical evaluation
on it to test accuracy, scalability and efficiency
Schema Matching Architecture• Input▫Represented in SQL DDL format
….. CREATE TABLE StudentDB.Student(
studentId INT,studentName VARCHAR(100),studentPhone VARCHAR(50)PRIMARY KEY (studentId) );
…..
Schema Matching Architecture• Input▫Currently supports versions after Oracle9 and SQL
Server 2000 Uses a data type conversion table if different DBMS
▫Input processor extracts schema information Eg: element names, data types, keys
Schema Matching Architecture•Process (schema matching)▫Implements multiple matching algorithms (matchers)
•Schema level▫Element names similarity algorithms
Prefix, Suffix, n-gram Tech = Technology (prefix matching) Phone = telephone (suffix matching) Context Con, ont, nte, tex, ext (ngram)
▫Structural similarities Data type, Field length etc.
Schema Matching Architecture• Instance Level▫Statistical data
Statistical data obtained: eg. Range, % alphanumeric characters, statistical properties (eg: mean, std.dev), distinct values etc.
▫Discovering complex correspondences Mining actual values Match different data types (gender : M,F = 1,2) Ambiguity issues: Jaguar (car or animal)?
Schema Matching Architecture•Output▫Similarity score between attributes obtained in each
matching algorithm all scores normalized between 0 to 1
▫Match results in similarity cube Attribute level, table level, schema level similarities can
be generated
Experimental Evaluation
•Accuracy▫Tested on 2 small schemas of 10 tables each with 2-10
attributes▫Checked results against manually derived result▫Accuracy degrades as schema size increases▫55-60% true matching▫Tested on a schema with 140 tables and 1360
attributes 20-40% true matching
Conclusion•A basic framework for schema matching is proposed•Matching functions performed independently for
higher scalability so that additional algorithms can be integrated easily
•Needs improvement in efficiency by deploying hybrid matching algorithms
•Requires various different algorithms to assess similarities from different views and increase accuracy