chapter 12 information system integration · prof. dr.-ing. stefan deßloch ag heterogene...
TRANSCRIPT
Prof. Dr.-Ing. Stefan DeßlochAG Heterogene InformationssystemeGeb. 36, Raum 329Tel. 0631/205 [email protected]
Chapter 12 – Information Systems Integration
© Prof.Dr.-Ing. Stefan Deßloch
IS Integration - Approachesn Data/Information Integration
n integrated access to (heterogeneous) data originating from multiple sourcesn queries range over data from multiple DBs!
n virtual integration: integrate on access/query (e.g., federated DBMS)n materialized integration: extract, transform, load data into a single materialized
data warehouse in advance (e.g., data replication, data warehousing)n needs a strong foundation to overcome multiple kinds of heterogeneity
n Enterprise Application Integrationn integration of (heterogeneous, coarse-grained) applications within an enterprise
(vs. development of new application)n integration across different middleware platforms, typically based on MOM
n Business-to-business Integrationn support interactions, integration of business processes among trading partners,
across company boundariesn foundation for e-business, e-commercen typically based on middleware for web service orchestration/choreography
Middleware for Heterogeneous & Distributed Information Systems2
© Prof.Dr.-Ing. Stefan Deßloch
Integration Challengesn Goal of Integration:
Provide a homogeneous, integrated view on multiple, distributed, autonomous and heterogeneous systems, components, or data sources.
n Three fundamental challenges:n Distributionn Autonomyn Heterogeneity
n Orthogonal, but interrelated
Let’s look at the above challenges in the scope of data/information integration!
Middleware for Heterogeneous & Distributed Information Systems3
© Prof.Dr.-Ing. Stefan Deßloch
Distributionn Physical distribution
n Data located on (geographically) separated systemsn Challenges:
n Addressing data across the globe (URLs)n Accessing data in different schemas (Multi-database languages, federated database
systems)n Optimizing distributed queries (no topic of this lecture)
n Logical distributionn Several possible storage locations for a given data itemn Caused by (partial) redundancy due to overlapping intension of schema elementsn Challenges:
n Maintaining consistency among redundant datan Provide metadata to enable data localizationn Detect and resolve duplicates n Detect and resolve data inconsistencies and conflicts
n Physical and logical distribution are orthogonal: n Data can be logically distributed and physically on the same system (and vice
versa)
} Data Cleaning
Middleware for Heterogeneous & Distributed Information Systems4
© Prof.Dr.-Ing. Stefan Deßloch
Autonomyn Design Autonomy
n Administrators of data sources can freely decide in which way they model datan Data model, formats, units, …n Leads to heterogeneity among sources
n Interface Autonomyn Freedom to decide how technical access is providedn Protocols (HTTP, JDBC, SOAP, …), supported query languages (SQL, XQuery, …)
n Access Autonomyn Freedom to decide whom to allow access to what datan Mode of Authentication (Certificates, Username/Password) n Authorization (boolean, R/W, Access Control Lists, …)
n Judicial Autonomyn Freedom to prohibit integration of data by othersn Intellectual property (IP) issues
ð Autonomy is the major cause of integration problems
Middleware for Heterogeneous & Distributed Information Systems5
© Prof.Dr.-Ing. Stefan Deßloch
Heterogeneityn Translated from [LeNa07]:
“Two information systems that do not provide the exact same methods, models and structures to access their data are called heterogeneous.”
n Causes for heterogeneity among IS: n Specific requirementsn Independent development n Developer preferencesn ...➡ All aspects result from autonomy
n Heterogeneity of metadata and datan Two main approaches:
n Try to resolve heterogeneity when neededn Enforce homogeneity/limit heterogeneity by establishing standards (not in this
lecture)n No real solution to the problemn Only creates “spheres of homogeneity”, any participants that have existing systems or
requirements not conforming to the standards have to resolve heterogeneity locally
Middleware for Heterogeneous & Distributed Information Systems6
© Prof.Dr.-Ing. Stefan Deßloch
Technical Heterogeneityn Refers to differences in the options to access data, e.g.
n Communication protocols (HTTP, SOAP, …)n Exchange formats (binary, text, XML, …)n APIs (JDBC, ODBC, proprietary)n Query mechanism
n Forms, canned queriesn Query languages
n Query languagen SQL, XQuery, …
Middleware for Heterogeneous & Distributed Information Systems7
© Prof.Dr.-Ing. Stefan Deßloch
Data Model Heterogeneityn Caused by the use of different data models among data sources
n hierarchical, relational, object-relational, object-oriented, XML, …n Data models can have different expressiveness, e.g. support of
n Inheritancen Types and degree of associations between entities/application conceptsn Multi-valued attributesn Different atomic data types
n Mapping from semantically richer to poorer models in general results in a loss of information
Middleware for Heterogeneous & Distributed Information Systems8
© Prof.Dr.-Ing. Stefan Deßloch
Syntactic Heterogeneityn Differences in the representation of identical facts
n Binary representations (little/big endian, number formats)n Encodings (ASCII, ISO-8859-1, EBCDIC, Unicode, …)n Separators (Tab-delimited vs. CSV)n Textual representation
n Not to be mixed up with semantic heterogeneity!n Usually easy to resolve (if used consistently)n Examples:
n “20070201” vs. “Februar 1st, 2007” vs. “02-01-07” vs. “1.2.2007”n “123.45” vs. “1.2345x102”
➨ Data Fusion
Middleware for Heterogeneous & Distributed Information Systems9
© Prof.Dr.-Ing. Stefan Deßloch
n Caused by modeling identical application concepts differently using the samemodelling concepts in the same data model
n Example - denormalized relational schema
n Easily resolved using relational operators:SELECT e.EmpNo, e.Name, e.DoB, d.name as deptname, d.deptnoFROM Employee e, Department d WHERE e.deptno = d.deptno
Structural Heterogeneity
EmpDeptEmpNo Name DoB Deptname DeptNo4711 Bob 1978-03-20 Accounting 110815 Jane 1975-11-05 Sales 71234 Joe 1954-05-26 Accounting 11
EmployeeEmpNo Name DoB DeptNo4711 Bob 1978-03-20 110815 Jane 1975-11-05 71234 Joe 1954-05-26 11
DepartmentDeptNo Name7 Sales11 Accounting
Middleware for Heterogeneous & Distributed Information Systems10
© Prof.Dr.-Ing. Stefan Deßloch
Structural Heterogeneity (cont.)n Example: inverted hierarchy
n Easily resolved using XQuery
<bib><book title=“a”>
<author name=“x”/><author name=“y”/>
</book><book title=“b”>
<author name=“x”/></book>
</bib>
<bib><author name=“y”><book title=“a”/>
</author><author name=“x”><book title=“a”/><book title=“b”/>
</author></bib>
<bib> {for $a in distinct-values(doc("BookAuthor.xml")//author/@name)return <author name="{$a}"> {
for $b in doc("BookAuthor.xml")//bookwhere $b/author/@name = $areturn <book title="{$b/@title}"/>} </author>
} </bib>
Middleware for Heterogeneous & Distributed Information Systems11
© Prof.Dr.-Ing. Stefan Deßloch
Schematic Heterogeneityn Often considered a special case of structural heterogeneityn Caused by modeling identical application concepts using different data model
concepts of the same data modeln Example: attribute value – relation name conflict
n Problems of this kind cannot be resolved generically with SQLn How to handle an unknown/variable number of values for categorical attributes?
PersonID Name Gender
1234 Bob male
4567 Jane female
MenID Name
1234 Bob
WomenID Name
4567 Jane
categorical attribute
Middleware for Heterogeneous & Distributed Information Systems12
© Prof.Dr.-Ing. Stefan Deßloch
Semantic Heterogeneityn “Semantics” = interpretation of data and metadatan Different representation of identical application concepts, (e.g. synonyms)n Identical representation of different application concepts (e.g. homonyms)
n e.g. Lotus (the car) vs. Lotus (the flower)n Ambiguities – unclear whether two elements refer to the same concept (are
synonyms) or refer to broader/narrower terms (hypernyms)n hypernym or synonym?
n car – (motor) vehiclen person – employeen product – item
n decision depending on contextn Perhaps the biggest challenge in IIn Resolving semantic heterogeneity is a prerequisite for many integration tasksn Many attempts to automate
➨ Schema Matching
Middleware for Heterogeneous & Distributed Information Systems13
© Prof.Dr.-Ing. Stefan Deßloch
Data Integration Middlewaren Traditional Middleware (shortcomings)
n supports access to multiple data sources within the same application, transactionn directly (using DB-gateways)n indirectly (by invoking distributed application components)
n but fails to provide data integrationn no means to analyze/query data from multiple sources within the same statement
SELECT *FROM Source1-table T1, Source2-table T2WHERE T1.a1 = …
ANDT1.a2 = T2.a1
n does not help to overcome data heterogeneityn Two architectural approaches to achieve data integration
n materialized integration: replication, data warehousingn virtual integration: federated DBMS, multi-database systems
Middleware for Heterogeneous & Distributed Information Systems14
© Prof.Dr.-Ing. Stefan Deßloch
Data Federation: Logical (Virtual) Integrationn Goal: homogeneous, integrated view of
data from multiple sourcesn illusion of a single (logical) databasen a single query may collect (or join) data
from multiple sourcesn "On-Demand" Data Integration
n data stays where it is (in the sources)n not copied into a new DB
n data is transformed/integrated at query time
n integration server combines results from data source queries
n Data Federation requiresn Wrapper/mediator technologyn Data and schema integration
mechanisms
Integration Server (FDBMS)
query
(sub-)queries
Middleware for Heterogeneous & Distributed Information Systems15
© Prof.Dr.-Ing. Stefan Deßloch
Distribution Controlled by DBMS (revisited)n Distributed transaction processingn DB-operation may span across
multiple data sourcesn single SELECT statement can
access tables in DB1, DB2, DB3n Virtual integration
n distributed data storagen integrated on demand (query)
n Degrees of transparencyn location transparency: physical
location of data is hiddenn distribution transparency:
n fact that data is stored in different data sources is hidden
n single logical DB and DB-schema for application programmer
n Multiple approaches and architectures possible
presentation
applicationlogic
resourcemanagement
client
DB1 DB2 DB3
DBS1
T
DBS2
DBS3
P
distributed TA
Middleware for Heterogeneous & Distributed Information Systems16
© Prof.Dr.-Ing. Stefan Deßloch
Example - DB2 Relational Connect
.........
.........
.........
.........
...NameAccNo
.........
.........
.........
.........
...BalanceAccNo
.........
.........
.........
.........
...CrLimitAccNo
DB2 Sybase Oracle
Cust SJBR SFBR
DB2 Relational Connect
Select *From Cust, SJBR, SFBRWhere Cust.Acct No = SJBR.Acct NoAnd SJBR.Acct No = SFBR.Acct No
Middleware for Heterogeneous & Distributed Information Systems17
© Prof.Dr.-Ing. Stefan Deßloch
Standard – SQL/MEDn ‘Foreign Data Wrapper’ in ‘SQL/MED’
Foreign Table
Init Request
Open
"Iterate"
Close
Foreign DataWrapper
DBMS
FileSystem
ApplicationSystem
ORDBMS
Foreign Servers
Middleware for Heterogeneous & Distributed Information Systems18
© Prof.Dr.-Ing. Stefan Deßloch
Data Warehouse: Physical (Materialized) Integration
data source 1
data source n
: main datawarehouse
data mart
data mart
:
Monitor
Extract Transform Load
staging area
Data WarehouseManager
metadatarepository
data flowcontrol flow
Analysis – Reporting - MiningTools
Middleware for Heterogeneous & Distributed Information Systems19
© Prof.Dr.-Ing. Stefan Deßloch
DWH Mediator
Materialized and Virtual Integration
Middleware for Heterogeneous & Distributed Information Systems20
Source Source Source Source
ETL ETL Wrapper Wrapper
DWintegratedschema
globalschema
schemamappings
Application Application
queryprocessing
queryprocessing
© Prof.Dr.-Ing. Stefan Deßloch
Pros and Cons of Materialized vs. Virtual DI
Middleware for Heterogeneous & Distributed Information Systems21
Materialized Virtual
data freshness low: update frequency high: always fresh
query response time low: local processing high: distributed proc.
complexity low: like DBMS high: distr. query planning
loss of autonomy providing load data query processing
query support unlimited, full SQL may be limited (by source)
read/write both (on DWH) typically read-only
storage resources high: full data low: only meta data
load on sources load/update: planned query: not planned
data cleaning possible not possible
© Prof.Dr.-Ing. Stefan Deßloch
Bridging/Resolving Heterogeneityn Real-world integration scenarios suffer from all kinds of heterogeneityn Middlewre and the primary issues they address:
n Wrappers (data model heterogeneity, technical heterogeneity, syntactic heterogeneity)
n Mediators/federated DBMS (technical heterogeneity, structural heterogeneity, distribution)
n Multi-database languages (schematic heterogeneity, technical heterogeneity, distribution)
n DB Gateways (technical heterogeneity)n ETL tools (structural heterogeneity, technical heterogeneity, syntactic
heterogeneity)ð focus on data access/transformation infrastructure (i.e., as a runtime platform)
n Further general integration techniques neededn Schema Matching and Integration (semantic heterogeneity, structural
heterogeneity)n Data Cleaning/Fusion (syntactic heterogeneity, semantic heterogeneity (in data))ð focus on integration planning
Middleware for Heterogeneous & Distributed Information Systems22
© Prof.Dr.-Ing. Stefan Deßloch
Integration Processn Schema Matching
n Find inter-schema correspondencesn Schema Mapping
n Based on correspondencesn Define how to "translate" one schema into another
n implies data transformationn Schema Integration
n Based on correspondences (and mapping)n Define an integrated, global/federated schema
èè Integration Plan!
n Integration plan can then be “implemented” using middleware for virtual or materialized data integration
Middleware for Heterogeneous & Distributed Information Systems23
© Prof.Dr.-Ing. Stefan Deßloch
Summaryn Middleware
n supports the development, deployment, and execution of complex information systems
n facilitates interaction between and integration of applicationsacross multiple distributed, heterogeneous platforms and data sources
n Major challenges: distribution, autonomy, heterogeneityn different forms of (data) heterogeneity
n Data/Information Integrationn goal: integrated access to (heterogeneous) data originating from multiple sourcesn materialized integration extracts data from sources and stores it in an integrated
database for query processing, data analysisn virtual integration leaves data in the sources and performs complex, distributed
query processing for data accessn both approaches have pros and cons
n Information integration processn general techniques, independent of middleware plattforms
Middleware for Heterogeneous & Distributed Information Systems24