research topics in computing data modelling for data schema integration 1 march 2005 david george
TRANSCRIPT
Research Topics in Computing
Data Modelling for Data Schema Integration
1 March 2005
David George
Data Integration 2
Modelling & Data Integration
Key Elements of today’s Presentation
Key Drivers for Data Integration
Dimensions and Issues in Integration
Three Integration Approaches
David George
Data Integration 3
Drivers for Data Integration
David George
Data Integration 4
Drivers for Data Integration (1)
Organisations evolving as global entities with distributed data.
Systems characterised by mix of legacy and new databases and applications.
Organisational change : Organic growth – size and diversity. Business re-engineering. Corporate mergers and acquisitions.
David George
Data Integration 5
Drivers for Data Integration (2)
Organisations evolved as collections of distinct, autonomous departments with disconnected systems e.g. in financial services.
Trends in Business Intelligence initiatives: Decision-making support. Customer segmentation. Marketing strategies.
Development of distributed or multidatabase systems.
David George
Data Integration 6
Dimensions and Issues in Integration
David George
Data Integration 7
Architecture & Design Issues
Multidatabase systems can be classified in two ways:
Homogeneous systems – local databases having same techniques and language.
Heterogeneous systems – local databases demonstrating diverse data models and language.
Key Dimensions in systems heterogeneity
System heterogeneity – hardware, OS, DBMS Semantic heterogeneity - models and data
David George
Data Integration 8
<<<< << Check
Design >> >>>>
Why Heterogeneity/Conflict?
Translating conceptualisations of the real world into database world representations
David George
Data Integration 9
Research Work Conceptualised
Books Model (a)
The data of interest is about Books, their
Publishers and adopting Universities.
Publications Model (b)
The data of interest is about Publications and their Types
David George
Data Integration 10
Publisher
Topics
Book University
Keywords
Publication
Published by Adopted by
contains
Refer to
Title
Word
Title Name
Name
Code
NameAddress
City
Code
Research Area
Publisher
David George
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Books
Publications
Data Integration 11
Keywords
Word
Publisher
Topics
Book University
Topics
Publication
Published by Adopted by
contains
Refer to
Title
Name
Title Name
Name
Code
NameAddress
City
Code
Research Area
Name
Publisher Published by
David George
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
A
B
Data Integration 12
Publisher
Topics
Book University
Publication
Published by Adopted by
Refer to
Title
Title NameName
Code
Name
Address City
Code
Research Area
Published by
David George
Books and Publications Integrated
contains
Data Integration 13
Semantic Heterogeneity/Conflict
Structural Conflicts Generalisation versus Specialisation Conflicts. Entity versus attributes. Naming conflicts.
Attribute (Domain) Conflicts Data Type conflicts. Measure and Scale conflicts. Integrity, Presence & Absence. Data Values
David George
Data Integration 14
Semantic Heterogeneity/Conflict
Generalisation/Specialisation Conflicts.
(i.e. Structural)
Naming conflicts. Synonyms e.g. vs Homonyms e.g. vs
Customer Client
Market (Products) Market (Customers)
Data Integration 15
Semantic Heterogeneity/Conflict
Data Type (representation) conflicts. Student - 26254006 (integer or string) Student - No vs Name (integer or string)
Measure and Scale etc conflicts. Dimension - volume vs weight Measure - light years vs miles Scale - miles vs kilometres Precision - 1:100 versus A:E Date - dd/mm/yyyy vs mm-dd-yy ???
David George
Data Integration 16
Semantic Heterogeneity/Conflict
Integrity Constraints e.g. Age Range <21 vs Age >18 Referential conflict 1:1 vs 1:M (e.g. 1 invoice for 1/ M orders)
Presence/Absence. No null, nulls – e.g. optional No corresponding attribute
Data Values Same items different values
David George
Data Integration 17
Integration Approaches
David George
Data Integration 18
Integration Approaches
Federated Database (Multidatabase) Systems.
Data Warehouse (Materialised in house) Systems.
Mediators (Virtual integration) Systems.
David George
Data Integration 19
Federated Database Systems
David George
Data Integration 20
Federated Databases (1)
David George
Data Integration 21
Federated Databases (2)
A Class of heterogeneous databases that: Consist of both new and old systems. Previously existed in their own stand-alone
(autonomous) environments. Integration is a consequence of distribution.
Organisation can adopt different architectures i.e. the way databases are mapped together:
Loosely Coupled integrations. Tightly Coupled integrations.
David George
Data Integration 22
Federated Databases (3)
Tightly Coupled Federations
Federation administrator determines schema view for all component systems in the federation.
Negotiates export schemas (tables and attributes) from federation participants who control exports of local schemas.
Local schema exports integrated as a federated schema.
Less autonomy at federation user level for view creation.
David George
Data Integration 23
Federated Databases (4)
Loosely Coupled Federations
The federated component databases have a greater degree of autonomy.
No central schema view is imposed on users.
Federated user is effectively an administrator creating views.
User employs a MDB Query Language (v TC schema integration).
David George
Data Integration 24
Federated Databases (5)
Sharing is made explicit by allowing export schemas from the local or component database.
The export schemas are imported to the federation to represent the shareable federated database.
Each source can call on others for information.
FDBMSs differ from homogeneous Distributed DBMSs – they use the same data model and DBMS.
DDBMSs sharing is therefore implicit.David George
Data Integration 25
Data Warehousing Systems
David George
Data Integration 26
Data Warehousing (1)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Global Schema
Local Schema Local Schema
User Query
O/RDB
Wrapper
Web Sourc
e
Wrapper
Repository
Data Extraction
Global Schema
Local Schema Local Schema
User Query
O/RDB
Wrapper
Web Sourc
e
Wrapper
Repository
Data Extraction
LocalOperational
WarehouseDecision Support& Mining
Network Internet
Integration& Storage
David George
R3R2
Data Integration 27
Data Warehousing (2)
Represents the physical separation of operational and decision support environments.
Operational data provides the raw material for: Decision support systems. Data-mining (DM).
E.g. identifying trends or characteristics.
DM = process of “non-trivial extraction of implicit, previously unknown, and potentially useful information”.
David George
Data Integration 28
Data Warehousing (3)
Warehouse integrates multiple, heterogeneous data sources - e.g. Relational DBs, flat files.
Data is pre-fetched into a central or intermediate warehouse repository by mediation process.
Data is “cleaned” and data integration techniques applied e.g. filtered, joined or aggregated.
Data may be transformed to conform to the warehouse schema.
Provides consistency in naming conventions, data structures, attributes, etc.
David George
Data Integration 29
Data Warehousing (4)
Data then stored (materialised) in warehouse repository – possibly in separate data marts.
Result is a repository of synthesised data for management decision-making.
Queries are made over the repository’s global schema.
Information is independent from the source data.
Data extraction tends to be periodically.
David George
Data Integration 30
Mediator (+Wrapper) Systems
David George
Data Integration 31
Mediator Systems (1)
Data Sources
Mediated Schema
Local Schema Local Schema
O/RDB
Wrapper
Web Sourc
e
Wrapper
User Query
Query 2Query1
Integration System
Data Sources
Mediated Schema
Local Schema Local Schema
O/RDB
Wrapper
Web Sourc
e
Wrapper
User Query
Query 2Query1
Integration System
Mediator
Network Internet
David George
Query Translation
Data Integration 32
Mediator Systems (2)
Global schema created and mapped to the source schemas.
User makes queries over global, mediated schema.
Mappings can be either: Global-as-view (GAV). Local-as-view (LAV).
Mediator translates global schema query and reformulates it into sub-queries of local schemas.
Wrappers execute and return.
David George
Data Integration 33
Mediator Systems (3)
Wrappers standardise how source information is described and accessed (i.e. they translate or adapt).
Query answers are returned to the user on demand – after sources are interrogated.
Thus data is always up-to-date (v. Warehousing).
Mediators integrate information view, without integrating the source data.
David George
Data Integration 34
Mediator Systems (4)
Results in a homogeneous information source using views - based on the mediated (global) schema.
Integration is virtual i.e. retrieved by the mediator but not stored in any central repository.
Differs from Warehousing Queries – where made to materialised data.
In short – provides virtual source schema integration via schema mapping and integrated view.
David George
Data Integration 35
Comparisons
David George
Data Integration 36
Federation versusWarehousing & Mediation
Federation represents a more “static” approach – using agreed couplings to allow view creation.
Warehousing and Mediation addresses integration in a more “dynamic” way – using extraction, transformation and integration processes.
David George
Data Integration 37
Warehousing vs. Mediation
Warehouse: Update-driven: i.e. in warehouse repository Heterogeneous data is integrated in advance and
stored in-house for direct query and analysis.
Mediation: Wrapper and Mediator layer on top of source DBs. Query-driven: Query to mediated schema then
translated into queries appropriate to sources. Results integrated into a global answer set.
David George
Data Integration 38
Summary
David George
Data Integration 39
Summary Drivers for Data Integration
Organisational change. Business Intelligence and Strategies.
Integration Issues Different Conceptual Model representations. Resulting Semantic Heterogeneities.
Integration Approaches Federated Systems. Data Warehousing and Mediator Systems.
David George
Data Integration 40
Next step ……
David George
Data Integration 41
Research ResourcesReference Material
Journals Books Presentation slides
UCLAN Website
Internal:http://janus/dgeorge/integration/journals.asp
External:http://www.janus.computing.uclan.ac.uk/dgeorge/integration/journals.asp
David George