chapter 12 information system integration · prof. dr.-ing. stefan deßloch ag heterogene...

24
Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 [email protected] Chapter 12 – Information Systems Integration

Upload: others

Post on 13-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

Prof. Dr.-Ing. Stefan DeßlochAG Heterogene InformationssystemeGeb. 36, Raum 329Tel. 0631/205 [email protected]

Chapter 12 – Information Systems Integration

Page 2: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

IS Integration - Approachesn Data/Information Integration

n integrated access to (heterogeneous) data originating from multiple sourcesn queries range over data from multiple DBs!

n virtual integration: integrate on access/query (e.g., federated DBMS)n materialized integration: extract, transform, load data into a single materialized

data warehouse in advance (e.g., data replication, data warehousing)n needs a strong foundation to overcome multiple kinds of heterogeneity

n Enterprise Application Integrationn integration of (heterogeneous, coarse-grained) applications within an enterprise

(vs. development of new application)n integration across different middleware platforms, typically based on MOM

n Business-to-business Integrationn support interactions, integration of business processes among trading partners,

across company boundariesn foundation for e-business, e-commercen typically based on middleware for web service orchestration/choreography

Middleware for Heterogeneous & Distributed Information Systems2

Page 3: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Integration Challengesn Goal of Integration:

Provide a homogeneous, integrated view on multiple, distributed, autonomous and heterogeneous systems, components, or data sources.

n Three fundamental challenges:n Distributionn Autonomyn Heterogeneity

n Orthogonal, but interrelated

Let’s look at the above challenges in the scope of data/information integration!

Middleware for Heterogeneous & Distributed Information Systems3

Page 4: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Distributionn Physical distribution

n Data located on (geographically) separated systemsn Challenges:

n Addressing data across the globe (URLs)n Accessing data in different schemas (Multi-database languages, federated database

systems)n Optimizing distributed queries (no topic of this lecture)

n Logical distributionn Several possible storage locations for a given data itemn Caused by (partial) redundancy due to overlapping intension of schema elementsn Challenges:

n Maintaining consistency among redundant datan Provide metadata to enable data localizationn Detect and resolve duplicates n Detect and resolve data inconsistencies and conflicts

n Physical and logical distribution are orthogonal: n Data can be logically distributed and physically on the same system (and vice

versa)

} Data Cleaning

Middleware for Heterogeneous & Distributed Information Systems4

Page 5: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Autonomyn Design Autonomy

n Administrators of data sources can freely decide in which way they model datan Data model, formats, units, …n Leads to heterogeneity among sources

n Interface Autonomyn Freedom to decide how technical access is providedn Protocols (HTTP, JDBC, SOAP, …), supported query languages (SQL, XQuery, …)

n Access Autonomyn Freedom to decide whom to allow access to what datan Mode of Authentication (Certificates, Username/Password) n Authorization (boolean, R/W, Access Control Lists, …)

n Judicial Autonomyn Freedom to prohibit integration of data by othersn Intellectual property (IP) issues

ð Autonomy is the major cause of integration problems

Middleware for Heterogeneous & Distributed Information Systems5

Page 6: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Heterogeneityn Translated from [LeNa07]:

“Two information systems that do not provide the exact same methods, models and structures to access their data are called heterogeneous.”

n Causes for heterogeneity among IS: n Specific requirementsn Independent development n Developer preferencesn ...➡ All aspects result from autonomy

n Heterogeneity of metadata and datan Two main approaches:

n Try to resolve heterogeneity when neededn Enforce homogeneity/limit heterogeneity by establishing standards (not in this

lecture)n No real solution to the problemn Only creates “spheres of homogeneity”, any participants that have existing systems or

requirements not conforming to the standards have to resolve heterogeneity locally

Middleware for Heterogeneous & Distributed Information Systems6

Page 7: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Technical Heterogeneityn Refers to differences in the options to access data, e.g.

n Communication protocols (HTTP, SOAP, …)n Exchange formats (binary, text, XML, …)n APIs (JDBC, ODBC, proprietary)n Query mechanism

n Forms, canned queriesn Query languages

n Query languagen SQL, XQuery, …

Middleware for Heterogeneous & Distributed Information Systems7

Page 8: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Data Model Heterogeneityn Caused by the use of different data models among data sources

n hierarchical, relational, object-relational, object-oriented, XML, …n Data models can have different expressiveness, e.g. support of

n Inheritancen Types and degree of associations between entities/application conceptsn Multi-valued attributesn Different atomic data types

n Mapping from semantically richer to poorer models in general results in a loss of information

Middleware for Heterogeneous & Distributed Information Systems8

Page 9: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Syntactic Heterogeneityn Differences in the representation of identical facts

n Binary representations (little/big endian, number formats)n Encodings (ASCII, ISO-8859-1, EBCDIC, Unicode, …)n Separators (Tab-delimited vs. CSV)n Textual representation

n Not to be mixed up with semantic heterogeneity!n Usually easy to resolve (if used consistently)n Examples:

n “20070201” vs. “Februar 1st, 2007” vs. “02-01-07” vs. “1.2.2007”n “123.45” vs. “1.2345x102”

➨ Data Fusion

Middleware for Heterogeneous & Distributed Information Systems9

Page 10: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

n Caused by modeling identical application concepts differently using the samemodelling concepts in the same data model

n Example - denormalized relational schema

n Easily resolved using relational operators:SELECT e.EmpNo, e.Name, e.DoB, d.name as deptname, d.deptnoFROM Employee e, Department d WHERE e.deptno = d.deptno

Structural Heterogeneity

EmpDeptEmpNo Name DoB Deptname DeptNo4711 Bob 1978-03-20 Accounting 110815 Jane 1975-11-05 Sales 71234 Joe 1954-05-26 Accounting 11

EmployeeEmpNo Name DoB DeptNo4711 Bob 1978-03-20 110815 Jane 1975-11-05 71234 Joe 1954-05-26 11

DepartmentDeptNo Name7 Sales11 Accounting

Middleware for Heterogeneous & Distributed Information Systems10

Page 11: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Structural Heterogeneity (cont.)n Example: inverted hierarchy

n Easily resolved using XQuery

<bib><book title=“a”>

<author name=“x”/><author name=“y”/>

</book><book title=“b”>

<author name=“x”/></book>

</bib>

<bib><author name=“y”><book title=“a”/>

</author><author name=“x”><book title=“a”/><book title=“b”/>

</author></bib>

<bib> {for $a in distinct-values(doc("BookAuthor.xml")//author/@name)return <author name="{$a}"> {

for $b in doc("BookAuthor.xml")//bookwhere $b/author/@name = $areturn <book title="{$b/@title}"/>} </author>

} </bib>

Middleware for Heterogeneous & Distributed Information Systems11

Page 12: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Schematic Heterogeneityn Often considered a special case of structural heterogeneityn Caused by modeling identical application concepts using different data model

concepts of the same data modeln Example: attribute value – relation name conflict

n Problems of this kind cannot be resolved generically with SQLn How to handle an unknown/variable number of values for categorical attributes?

PersonID Name Gender

1234 Bob male

4567 Jane female

MenID Name

1234 Bob

WomenID Name

4567 Jane

categorical attribute

Middleware for Heterogeneous & Distributed Information Systems12

Page 13: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Semantic Heterogeneityn “Semantics” = interpretation of data and metadatan Different representation of identical application concepts, (e.g. synonyms)n Identical representation of different application concepts (e.g. homonyms)

n e.g. Lotus (the car) vs. Lotus (the flower)n Ambiguities – unclear whether two elements refer to the same concept (are

synonyms) or refer to broader/narrower terms (hypernyms)n hypernym or synonym?

n car – (motor) vehiclen person – employeen product – item

n decision depending on contextn Perhaps the biggest challenge in IIn Resolving semantic heterogeneity is a prerequisite for many integration tasksn Many attempts to automate

➨ Schema Matching

Middleware for Heterogeneous & Distributed Information Systems13

Page 14: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Data Integration Middlewaren Traditional Middleware (shortcomings)

n supports access to multiple data sources within the same application, transactionn directly (using DB-gateways)n indirectly (by invoking distributed application components)

n but fails to provide data integrationn no means to analyze/query data from multiple sources within the same statement

SELECT *FROM Source1-table T1, Source2-table T2WHERE T1.a1 = …

ANDT1.a2 = T2.a1

n does not help to overcome data heterogeneityn Two architectural approaches to achieve data integration

n materialized integration: replication, data warehousingn virtual integration: federated DBMS, multi-database systems

Middleware for Heterogeneous & Distributed Information Systems14

Page 15: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Data Federation: Logical (Virtual) Integrationn Goal: homogeneous, integrated view of

data from multiple sourcesn illusion of a single (logical) databasen a single query may collect (or join) data

from multiple sourcesn "On-Demand" Data Integration

n data stays where it is (in the sources)n not copied into a new DB

n data is transformed/integrated at query time

n integration server combines results from data source queries

n Data Federation requiresn Wrapper/mediator technologyn Data and schema integration

mechanisms

Integration Server (FDBMS)

query

(sub-)queries

Middleware for Heterogeneous & Distributed Information Systems15

Page 16: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Distribution Controlled by DBMS (revisited)n Distributed transaction processingn DB-operation may span across

multiple data sourcesn single SELECT statement can

access tables in DB1, DB2, DB3n Virtual integration

n distributed data storagen integrated on demand (query)

n Degrees of transparencyn location transparency: physical

location of data is hiddenn distribution transparency:

n fact that data is stored in different data sources is hidden

n single logical DB and DB-schema for application programmer

n Multiple approaches and architectures possible

presentation

applicationlogic

resourcemanagement

client

DB1 DB2 DB3

DBS1

T

DBS2

DBS3

P

distributed TA

Middleware for Heterogeneous & Distributed Information Systems16

Page 17: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Example - DB2 Relational Connect

.........

.........

.........

.........

...NameAccNo

.........

.........

.........

.........

...BalanceAccNo

.........

.........

.........

.........

...CrLimitAccNo

DB2 Sybase Oracle

Cust SJBR SFBR

DB2 Relational Connect

Select *From Cust, SJBR, SFBRWhere Cust.Acct No = SJBR.Acct NoAnd SJBR.Acct No = SFBR.Acct No

Middleware for Heterogeneous & Distributed Information Systems17

Page 18: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Standard – SQL/MEDn ‘Foreign Data Wrapper’ in ‘SQL/MED’

Foreign Table

Init Request

Open

"Iterate"

Close

Foreign DataWrapper

DBMS

FileSystem

ApplicationSystem

ORDBMS

Foreign Servers

Middleware for Heterogeneous & Distributed Information Systems18

Page 19: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Data Warehouse: Physical (Materialized) Integration

data source 1

data source n

: main datawarehouse

data mart

data mart

:

Monitor

Extract Transform Load

staging area

Data WarehouseManager

metadatarepository

data flowcontrol flow

Analysis – Reporting - MiningTools

Middleware for Heterogeneous & Distributed Information Systems19

Page 20: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

DWH Mediator

Materialized and Virtual Integration

Middleware for Heterogeneous & Distributed Information Systems20

Source Source Source Source

ETL ETL Wrapper Wrapper

DWintegratedschema

globalschema

schemamappings

Application Application

queryprocessing

queryprocessing

Page 21: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Pros and Cons of Materialized vs. Virtual DI

Middleware for Heterogeneous & Distributed Information Systems21

Materialized Virtual

data freshness low: update frequency high: always fresh

query response time low: local processing high: distributed proc.

complexity low: like DBMS high: distr. query planning

loss of autonomy providing load data query processing

query support unlimited, full SQL may be limited (by source)

read/write both (on DWH) typically read-only

storage resources high: full data low: only meta data

load on sources load/update: planned query: not planned

data cleaning possible not possible

Page 22: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Bridging/Resolving Heterogeneityn Real-world integration scenarios suffer from all kinds of heterogeneityn Middlewre and the primary issues they address:

n Wrappers (data model heterogeneity, technical heterogeneity, syntactic heterogeneity)

n Mediators/federated DBMS (technical heterogeneity, structural heterogeneity, distribution)

n Multi-database languages (schematic heterogeneity, technical heterogeneity, distribution)

n DB Gateways (technical heterogeneity)n ETL tools (structural heterogeneity, technical heterogeneity, syntactic

heterogeneity)ð focus on data access/transformation infrastructure (i.e., as a runtime platform)

n Further general integration techniques neededn Schema Matching and Integration (semantic heterogeneity, structural

heterogeneity)n Data Cleaning/Fusion (syntactic heterogeneity, semantic heterogeneity (in data))ð focus on integration planning

Middleware for Heterogeneous & Distributed Information Systems22

Page 23: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Integration Processn Schema Matching

n Find inter-schema correspondencesn Schema Mapping

n Based on correspondencesn Define how to "translate" one schema into another

n implies data transformationn Schema Integration

n Based on correspondences (and mapping)n Define an integrated, global/federated schema

èè Integration Plan!

n Integration plan can then be “implemented” using middleware for virtual or materialized data integration

Middleware for Heterogeneous & Distributed Information Systems23

Page 24: Chapter 12 Information System Integration · Prof. Dr.-Ing. Stefan Deßloch AG Heterogene Informationssysteme Geb. 36, Raum 329 Tel. 0631/205 3275 dessloch@informatik.uni-kl.de Chapter

© Prof.Dr.-Ing. Stefan Deßloch

Summaryn Middleware

n supports the development, deployment, and execution of complex information systems

n facilitates interaction between and integration of applicationsacross multiple distributed, heterogeneous platforms and data sources

n Major challenges: distribution, autonomy, heterogeneityn different forms of (data) heterogeneity

n Data/Information Integrationn goal: integrated access to (heterogeneous) data originating from multiple sourcesn materialized integration extracts data from sources and stores it in an integrated

database for query processing, data analysisn virtual integration leaves data in the sources and performs complex, distributed

query processing for data accessn both approaches have pros and cons

n Information integration processn general techniques, independent of middleware plattforms

Middleware for Heterogeneous & Distributed Information Systems24