advanced database management systems - current...

121
Advanced Database Management Systems Distributed DBMS:Introduction and Architectures Alvaro A A Fernandes School of Computer Science, University of Manchester AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 1 / 121

Upload: duonghuong

Post on 29-Apr-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Advanced Database Management SystemsDistributed DBMS:Introduction and Architectures

Alvaro A A Fernandes

School of Computer Science, University of Manchester

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 1 / 121

Outline

Introduction to Distributed DBMSs

Distributed DBMS Architectures

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 2 / 121

Introduction to Distributed DBMSs

Distributed Computing

Definition

I A number of

I distinct processing elements

I possibly administratively autonomous

I possibly heterogeneous

I interconnected by a computer network

I cooperating in the performance of assigned tasks.

I Several aspects of DBMS can be distributed, e.g.:I Control (e.g., over updates, allocation of resources, etc.)I Processing logic (e.g., algebraic operators, data movements, etc.)I Services (e.g., optimization, access control, etc.)I Data (e.g., tuples, columns, relations, etc.)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 3 / 121

Introduction to Distributed DBMSs

What is a Distributed Database Management System?

Definition

I A distributed database (DDB) is a collection of multiple, distinct, butlogically interrelated databases, placed in different physical locationsand linked by a computer network.

I A distributed database management system (DDBMS) is the softwarethat manages the DDB and provides mechanisms that make thisdistribution transparent to applications and end users.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 4 / 121

Introduction to Distributed DBMSs

DDBMS Environment (1)

I Note that a (centralized) DBMSmay be networked without beinga DDBMS.

I This happens when there is nological view over the data andresources that could be accessedand shared over a network.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 5 / 121

Introduction to Distributed DBMSs

DDBMS Environment (2)

I When there is a logical viewover data and resources, theycan then be accessed and sharedover a network and co-operationtakes place.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 6 / 121

Introduction to Distributed DBMSs

DDBMS Environment (3)Implicit Assumptions

I Data may be stored at a number of sites.

I Each site logically has a distinct (assumed single) processor.

I Processors at different sites are interconnected by a computernetwork, therefore no multiprocessors, no specialist interconnect, nospecialist parallel hardware.

I The DDB is a DB, not a collection of data files, therefore the data islogically related (e.g., as manifested in the access patterns that arecharacteristic of the relational data model).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 7 / 121

Introduction to Distributed DBMSs

DDBMS Environment (4)Applications/Promises

I Any organization which has a decentralized structure is a good apriori candidate for using DDBMSs.

I A DDBMS promises:I Transparent management of distributed, fragmented, and replicated

dataI Improved reliability/availability through distributed processesI Improved performance by exploiting locality and parallelismI Easier and more economical system expansion through scale out (i.e.,

more of the same “boxes”) rather than scale up (i.e., ever bigger, moreexpensive “boxes”).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 8 / 121

Introduction to Distributed DBMSs

DDBMS Environment (5)Transparency

I Transparency stems from abstraction, i.e., the separation of thehigher-level semantics of a system from the lower-levelimplementation concerns.

I In a DDBMS environment, a fundamental issue is to provide dataindependence through several kinds of transparency:

I Network (or distribution) transparencyI Replication transparencyI Fragmentation transparency

I horizontal, through selectionI vertical, through projectionI hybrid, combining both

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 9 / 121

Introduction to Distributed DBMSs

DDBMS Environment (6)Example Relations

Example

EMP =ENO ENAME TITLE

E1 J. Doe Elect. Eng.E2 M. Smith Syst. Anal.E3 A. Lee Mech. Eng.E4 J. Miller ProgrammerE5 B. Casey Syst. Anal.E6 L. Chu Elect. Eng. .E7 R. Davis Mech. Eng.E8 J. Jones Syst. Anal

PROJ =

PNO PNAME BUDGET LOC

P1 Instrumentation 150000 TokyoP2 Database Develop. 135000 OsloP3 CAD/CAM 250000 OsloP4 Maintenance 310000 ParisP5 CAD/CAM 500000 Paris

ASG =ENO PNO RESP DUR

E1 P1 Manager 12E2 P1 Analyst 24E2 P2 Analyst 6E3 P3 Consultant 10E3 P4 Engineer 48E4 P2 Programmer 18E5 P2 Manager 24E6 P4 Manager 48E7 P3 Engineer 36E7 P5 Engineer 23E8 P3 Manager 40

PAY =TITLE SAL

Elect. Eng. 40000Syst. Anal. 34000Mech. Eng. 27000Programmer 24000

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 10 / 121

Introduction to Distributed DBMSs

DDBMS Environment (7)Transparent Access

Example

Find the name and salary of employees on assignments not lasting 12 months.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 11 / 121

Introduction to Distributed DBMSs

DDBMS Environment (8)Potentially Improved Performance

Locality : Data can be kept close to its points of use by means ofdata distribution strategies whilst still benefitting all bymeans of data integration strategies.

Parallelism : Distribution allows for both inter-query (i.e., when wholequery evaluation plans (QEPs) run in distinct sites) andintra-query parallelism (i.e., when QEP fragments of thesame query run in distinct sites).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 12 / 121

Introduction to Distributed DBMSs

DDBMS Environment (9)Scale-Out System Expansion

I Scaling-out (i.e., deriving a positive response in performance to moreof the same processing elements) is generally considered to be easierthan scaling-up (i.e., deriving a positive response in performance bythe same number of larger processing elements).

I With the widespread availability of high-performance commodityhardware, scale-out is all the more appealing now than in the past.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 13 / 121

Distributed DBMS Architectures

DBMS Abstraction Through Schema LevelsThe ANSI/SPARC Architecture

I A DBMS supports abstractionsby means of schemas that definedifferent views at different levels.

I A DDBMS must providetransparency without breaking,and indeed supporting, suchexpectations.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 14 / 121

Distributed DBMS Architectures

DBMS Implementation Alternatives (1)

I The DBMS implementation dimensions that matter the most inDDBMSs are:

distribution :of various kinds (e.g., processing, data)

heterogeneity :of various kinds (e.g., syntactic, semantic)

autonomy :of various kinds (e.g., at instance level, at schema level)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 15 / 121

Distributed DBMS Architectures

DBMS Implementation Alternatives (2)Dimensions of the Problem

I Distribution refers to whether the data and processing components ofthe system are located on the same machine or not.

I There many sources of heterogeneity, e.g.,I infrastructural (e.g., different hardware, communications, OSs, etc.)I syntactic (e.g., different data model, database languages, etc.)I semantic (e.g., different names for the same concepts, different

concepts with the same name)

I Autonomy is the least understood, the most troublesome to contendwith, and takes various forms:

I Design autonomy, i.e., the degree to which the design of a componentDBMS can change without explicit co-ordination and control

I Communication autonomy, i.e., the degree to which a componentDBMS can decide whether and how to communicate with others.

I Execution autonomy, i.e., the degree to which a component DBMS candecide whether and how to execute operations locally

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 16 / 121

Distributed DBMS Architectures

DBMS Implementation Alternatives (3)

I There are some interestingtriples in the space defined bythe implementation dimensionsin the figure:

centralized DBMSs :(D=none, H=none, A=none)

client-server DBMSs :(D=some, H=some, A=none)

federated DBMSs :(D=any, H=some, A=some)

multi-DBMSs :(D=any, H=any, A=any)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 17 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (1)Federated DBMS from Component Schemas

I In the figure, federated is being usedto connote the notion that thecomponent DBMSs co-operate byneither exercising high degrees ofautonomy nor inflicting high levels ofheterogeneity on others.

I In this case, a global conceptualschema (GCS) arises from localconceptual schemas (LCS) and localinternal schemas (LIS) by anegotiation process (or by impositionfrom the centre, if within organizationboundaries).

I The external schemas (ES) can bemore easily derived from the GCS.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 18 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (2)Multi-DBMS from Component Schemas

I In the figure, ’multi-’ is being used toconnote the notion that thecomponent DBMSs have no coercionon their autonomy nor on howheterogeneous they make themselves.

I In this case, a GCS does not normallyarise by negotiation (e.g., there maybe more than one GCS if thecomponent DBMSs have publicinterfaces).

I The component DBMSs may still havelocal external schemas (LES) imposedupon them.

I The GCS too can have global externalschemas (GES) imposed upon it.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 19 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (3)Multi-DBMS without a Global Schema

I The absence of a global schema meansthat only partial views arise, i.e., thereis no attempt at a unified descriptionof all the component DBMSs.

I This is more likely in the case ofad-hoc, single-use scenarios, wherethere is no motivation to invest oncreating a global view over theresources.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 20 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (4)Multi-DBMS Execution Model

I In a multi-DBMS there is a need tomap a global request into localsub-requests and local sub-results intoa global result.

I The component DBMSs still cater forlocal requests with local results.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 21 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (5)Time-Shared Access to a Centralized Database

I In mere time-sharing of a centralizedDBMS, all data and all applicationsrun remotely from the point of access.

I Requests are for batch tasks, aresponse (and not necessarily results)is sent back.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 22 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (6)Multiple Clients/Single Server

I In client-server approaches, clients areapplications that interface throughclient-side services andcommunications with a server.

I The server runs server-side services inresponse to client requests.

I Because of the client-side services thatsupport the application, high-level,fine-grained, interactive requests canbe sent that cause results (i.e., filteredanswers only) to flow back.

I In general, the client-side services offerquery language interfaces (perhapslanguage-embedded, or form-based, orboth).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 23 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (7)Pros/Cons of Client-Server Architectures

ProsI More efficient division of labor

I Client-side scale-up andscale-out

I Better price/performance onclient machines

I Ability to use familiar tools onclient machines

I Client access to remote data

I Full DBMS functionalityprovided to many

I Overall better systemprice/performance

Cons (vis-a-vis other distributionstrategies)

I Possible bottleneck and singlepoint of failure in the server

I Server-side scale-up andscale-out less easy

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 24 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (8)Multiple Clients/Multiple Servers

I Distributing server-side load ispossible.

I Mechanisms become more complex atthe lower levels.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 25 / 121

Distributed DBMS Architectures

Distributed DBMS Architectures (9)Co-Operating Servers

Once servers start co-operating, one iscoming close to a truly distributed DDBMS.

The newest classes of DDBMSs have arisenin the last five year as a result of pressure tomaintain

I extremely large repositories of eitherstructured or unstructured datasupporting

I workloads consisting of

I either relatively fewcomputationally intensiveanalyses

I or an extremely large amount ofrelatively simple retrieval orupdate requests.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 26 / 121

Distributed DBMS Architectures

SummaryDistributed DBMS Architectures

I DDBMSs have risen in importance due to structural changes in thecomputing landscape that saw the networking of high-quality PCsbecome the norm.

I Even so, they still retain their original role of emulating theoperational decentralization of organizations.

I DDBMS architectures capitalize on localization and parallelization tooffer a potential for performance gains.

I Nonetheless, autonomy and heterogeneity levels can create significanthurdles for full distribution.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 27 / 121

Distributed DBMS Architectures

Advanced Database Management SystemsData Distribution Strategies

Alvaro A A Fernandes

School of Computer Science, University of Manchester

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 28 / 121

Outline

Distributed DBMSs: The Design Problem

Data Distribution Strategies

Fragmentation and Allocation

Fragmentation, in More Detail

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 29 / 121

Distributed DBMSs: The Design Problem

Distribution Strategies (1)The Design Problem

I In the general setting, we need to decide:I the placement of data and programsI across the sites of a computer networkI as well as possibly designing the network itself

I In DDBMS, the placement of applications entails:I placement of the distributed DBMS softwareI placement of the applications that run on the database

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 30 / 121

Distributed DBMSs: The Design Problem

Distribution Strategies (2)Dimensions of the Problem

I Whether only data is partitioned across sites (and programs arereplicated everywhere) or whether programs are partitioned too

I Whether the access patterns are stable or not

I Whether knowledge of such access patterns is complete or not

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 31 / 121

Distributed DBMSs: The Design Problem

Distribution Strategies (3)Design Approaches

Top-Down : only possible, in practice, when the system is beingdesigned from scratch, and only lasts if heterogeneity andautonomy are tightly controlled

Bottom-Up : only practical solution when the component databasesalready exist at a number of sites, and more likely to lastwhen heterogeneity and autonomy cannot be controlled

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 32 / 121

Data Distribution Strategies

Data Distribution Strategies (4)Some Design Issues

I Why fragment at all?

I How to fragment?

I How much to fragment?

I How to ensure correctness of fragmentation?

I How to allocate fragments?

I What information is required?

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 33 / 121

Data Distribution Strategies

Data Distribution Strategies (5)Fragmentation (1)

I Why can’t we just distribute relations?I Because most relations are designed to be suitable for a great many

applications, and different applications may be subject to differentlocality aspects and offer different parallelization opportunities.

I What is a reasonable unit of distribution?I Roughly, that view on a relation that is needed by one or more

applications in one place

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 34 / 121

Fragmentation and Allocation

Data Distribution Strategies (6)Fragmentation (2)

Consider the case of entire relations as the unit of distribution:

I Most relations have subsets whose semantics characterize specialaffinity (e.g., of location, of timing, etc.).

I For example, in a relation Employees, the attribute Department maycharacterize location affinity if different departments occupy differentlocations.

I If so, then unnecessary communication may be incurred if wedistribute entire relations.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 35 / 121

Fragmentation and Allocation

Data Distribution Strategies (7)Fragmentation (3)

Consider the case of sub-relations as the unit of distribution:

I A sub-relation, referred to as a fragment in the DDBMS context, iswhat is specified by a view (typically by selection or projection orboth).

I Fragmentation can be derived in knowledge of applications and theiraffinities and allows parallel/distributed execution.

I For example, if Employee is horizontally fragmented by the attributeDepartment, and different fragments are held where thecorresponding department is located, computing the average salary ineach department can be done in parallel.

I If, after fragmentation, a particular query/view cannot be definedover a single fragment, then extra processing will be needed.

I Also, semantic checks may be more difficult (e.g., enforcingreferential integrity).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 36 / 121

Fragmentation and Allocation

Data Distribution Strategies (8)Fragmentation Alternatives: Horizontal (1)

I Broadly speaking, defined by a selection.

I Reconstruction is by union.

Example

PROJ1 ← σbudget<200000(PROJ)PROJ2 ← σbudget≥200000(PROJ)PROJ ← PROJ1 ∪ PROJ2

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 37 / 121

Fragmentation and Allocation

Data Distribution Strategies (9)Fragmentation Alternatives: Horizontal (2)

Example

PROJ1 =PNO PNAME BUDGET LOC

P1 Instrumentation 150000 TokyoP2 Database Develop. 135000 Oslo

PROJ2 =PNO PNAME BUDGET LOC

P3 CAD/CAM 250000 OsloP4 Maintenance 310000 ParisP5 CAD/CAM 500000 Paris

PROJ =PNO PNAME BUDGET LOC

P1 Instrumentation 150000 TokyoP2 Database Develop. 135000 OsloP3 CAD/CAM 250000 OsloP4 Maintenance 310000 ParisP5 CAD/CAM 500000 Paris

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 38 / 121

Fragmentation and Allocation

Data Distribution Strategies (10)Fragmentation Alternatives: Vertical (1)

I Broadly speaking, defined by a projection.

I Reconstruction is by a natural join on the replicated key.

Example

PROJ1 ← πPNO,BUDGET (PROJ)PROJ2 ← πPNO,PNAME ,LOC (PROJ)PROJ ← PROJ1 ./PNO PROJ2

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 39 / 121

Fragmentation and Allocation

Data Distribution Strategies (11)Fragmentation Alternatives: Vertical (2)

ExamplePROJ1 =

PNO BUDGET

P1 150000P2 135000P3 250000P4 310000P5 500000

PROJ2 =PNO PNAME LOC

P1 Instrumentation TokyoP2 Database Develop. OsloP3 CAD/CAM OsloP4 Maintenance ParisP5 CAD/CAM Paris

PROJ =PNO PNAME BUDGET LOC

P1 Instrumentation 150000 TokyoP2 Database Develop. 135000 OsloP3 CAD/CAM 250000 OsloP4 Maintenance 310000 ParisP5 CAD/CAM 500000 Paris

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 40 / 121

Fragmentation and Allocation

Data Distribution Strategies (12)Correctness of Fragmentation

CompletenessThe decomposition of a relation R into fragmentsR1,R2, ...,Rn is complete if and only if each data item in Rcan also be found in some Ri .

ReconstructibilityIf a relation R is decomposed into fragments R1,R2, ...,Rn,then there should exist some relational operator ∇ such thatR = ∇n

i=1Ri .

DisjointnessIf a relation R is decomposed into fragments R1,R2, ...,Rn,and data item di is in Rj , then di should not be in any otherfragment Rk (k 6= j).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 41 / 121

Fragmentation and Allocation

Data Distribution Strategies (13)Allocation Alternatives

Non-replicated : the fragments form a proper partition, and eachfragment resides at only one site.

Replicated : the fragments overlap, either fully (i.e., each fragmentexists at every site) or partially (i.e., each fragment exists atsome sites only).

I An often used rule-of thumb is that if the number of proper (i.e.,read-only) queries is larger than the number of updating queries, thenreplication tends to be advantageous in proportion, otherwise theopposite is the case.

I Especially in the client/server case, caching is also part of the designconsiderations.

I Web giants (e.g., Facebook, Amazon) use replication extensively.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 42 / 121

Fragmentation and Allocation

Data Distribution Strategies (14)Replication v. Caching: Some Contrasts

Replication Caching

target server client or middle-tiergranularity coarse fine

storage device typically disk typically main memoryimpact on catalog yes no

update protocol propagation invalidationremove copy explicit implicit

mechanism separate fetch fault in and keep copy after use

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 43 / 121

Fragmentation, in More Detail

Data Distribution Strategies (15)Information Requirements

I The are four kinds of information required:I about the databaseI about the applications (i.e., the queries, by and large)I about the communication networkI about the computer system

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 44 / 121

Fragmentation, in More Detail

FragmentationKinds

I Horizontal Fragmentation (HF)I Primary Horizontal Fragmentation (PHF)I Derived Horizontal Fragmentation (DHF)

I Vertical Fragmentation (VF)

I Hybrid Fragmentation (HF)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 45 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (1)Information Requirements: Database

I We draw a link from a relationR to a relation S if we there isan equijoin on the key of R andthe corresponding foreign key inS .

I We call R the owner, and S themember.

I We need the cardinalities ofrelations and the (average)length of their tuples.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 46 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (2)Information Requirements: Application (1)

I Given R with schema [A1, . . . ,An], a simple predicate pj has theform Aiθc where θ ∈ {=, 6=, <,>,≤,≥}, c ∈ Domain(Ai ).

I For a relation R, we define Pr = {p1, . . . , pm}.I Given R and Pr , we define the set of minterm predicates

M = {m1, . . . ,mr} as M = { mk |mk =∧

pj∈Pr p∗j },

1 ≤ j ≤ m, 1 ≤ k ≤ r , where p∗j = pj or else p∗j = ¬p∗j .

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 47 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (3)Information Requirements: Application (2)

Example

Some (but not all) simple predicates on PROJ are:p1 : LOC = ′Tokyo ′

p2 : LOC = ′Oslo ′

p3 : LOC = ′Paris ′

p4 : BUDGET ≤ 200000

Some (but not all) minterm predicates on PROJ are:m1 : LOC = ′Tokyo ′ ∧ BUDGET ≤ 200000m2 : ¬(LOC = ′Tokyo ′) ∧ BUDGET ≤ 200000m3 : LOC = ′Tokyo ′ ∧ ¬(BUDGET ≤ 200000)m4 : ¬(LOC = ′Tokyo ′) ∧ ¬(BUDGET ≤ 200000)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 48 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (4)Information Requirements: Application (3)

I We also need quantitative information about the application:I The selectivity of a minterm mi , denoted by sel(mi ) is the number of

tuples in the corresponding relation R that would be produced byσmi (R).

I The access frequency of an application qi , denoted by acc(qi ) is thenumber of times in which qi accesses data in a given period.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 49 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (5)Definition

I A (primary) horizontal fragment Rj of a relation R is defined asRj ← σmi (R) where mi is a minterm predicate on R.

I Given a set of minterm predicates M = {m1, . . . ,mr} over R, one candefine r horizontal fragments in R.

I [Oszu and Valduriez, 1999] give an algorithm that, given a relation Rand a set of simple predicates on R, produces a correct set offragments from R.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 50 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (6)Example (1)

Example (Information Required)

I Let the relations PAY and PROJ be candidates for PHF.I Let the following be the applications involved:

I A1: Find the name and budget of projects given their project number.I A2: Find projects according to their budget.I Let A1 be issued at three sites.I Let one site access A2 for budgets below 200000, and the other two

access A2 for those above.I Let the following be the simple predicates:

p1 : LOC = ′Tokyo′

p2 : LOC = ′Oslo′

p3 : LOC = ′Paris ′

p4 : BUDGET ≤ 200000p5 : BUDGET > 200000

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 51 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (7)Example (2)

Example (Output)

I Applying the algorithm alluded to, the following minterm predicatesresult: m1 : LOC = ′Tokyo ′ ∧ BUDGET ≤ 200000

m2 : LOC = ′Tokyo ′ ∧ BUDGET > 200000m3 : LOC = ′Oslo ′ ∧ BUDGET ≤ 200000m4 : LOC = ′Oslo ′ ∧ BUDGET > 200000m5 : LOC = ′Paris ′ ∧ BUDGET ≤ 200000m6 : LOC = ′Paris ′ ∧ BUDGET > 200000

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 52 / 121

Fragmentation, in More Detail

Primary Horizontal Fragmentation (8)Example (3)

Example (Fragments Obtained)

PROJ1 =PNO PNAME BUDGET LOC

P1 Instrumentation 150000 Tokyo

PROJ3 =PNO PNAME BUDGET LOC

P2 Database Develop. 135000 Oslo

PROJ4 =PNO PNAME BUDGET LOC

P3 CAD/CAM 250000 Oslo

PROJ6 =PNO PNAME BUDGET LOC

P4 Maintenance 310000 ParisP5 CAD/CAM 500000 Paris

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 53 / 121

Fragmentation, in More Detail

Derived Horizontal Fragmentation (1)Definition

I A derived horizontal fragment is defined on a member relationaccording to a selection operation on its owner.

I Recall that a link from owner to member is defined in terms of anequijoin.

I A semijoin between R and S is defined as follows:R n S ≡ πA(R ./ S), where A is the list of attributes in the schemaof R.

I Given a link L, where owner(L) = S and member(L) = R, the derivedhorizontal fragments of R are defined as Ri = R n Si , 1 ≤ i ≤ w ,where w is the maximum number of fragments to be generated andSi = σmi (S) is the primary horizontal fragment defined by theminterm predicate mi .

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 54 / 121

Fragmentation, in More Detail

Derived Horizontal Fragmentation (2)Example (1)

Example (Information Required, Fragments Defined)

I Let there be a link L1 with owner(L1) = PAY andmember(L1) = EMP.

I Let PAY1 ← σSAL≤30000(PAY ) and PAY2 ← σSAL>30000(PAY ).I Then two DHFs are defined:

I EMP1 ← EMP n PAY1

I EMP2 ← EMP n PAY2

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 55 / 121

Fragmentation, in More Detail

Derived Horizontal Fragmentation (3)Example (2)

Example (Fragments Obtained)

EMP1 =

ENO ENAME TITLE

E3 A. Lee Mech. Eng.E4 J. Miller ProgrammerE7 R. Davis Mech. Eng.

EMP2 =

ENO ENAME TITLE

E1 J. Doe Elect. Eng.E2 M. Smith Syst. Anal.E5 B. Casey Syst. Anal.E6 L. Chu Elect. Eng. .E8 J. Jones Syst. Anal

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 56 / 121

Fragmentation, in More Detail

Vertical Fragmentation (1)

I Vertical fragmentation has also been studied in the centralizedcontext since it is important for:

I normalization of designsI physical clustering

I In terms of physical clustering, there is excitement in the DBMSindustry (at the time of writing) about an extreme form of verticalpartitioning in which single columns are stored separately.

I Certain access patterns are made easier by this and compression levelsan order of magnitude larger can be obtained, which is importantwhen dealing with the massive volumes of data that are typical ofanalytics workloads.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 57 / 121

Fragmentation, in More Detail

Vertical Fragmentation (2)

I Vertical fragmentation is more difficult than horizontal fragmentation,because more alternatives exist.

I Heuristic approaches that can be used are:

grouping : one adds attributes to fragments one by one.splitting : one breaks down a relation into fragments based on

access patterns.

I See [Oszu and Valduriez, 1999] for an example (or else recall, fromyour earlier database studies the theory of normal forms and how it isjustifiably disobeyed).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 58 / 121

Fragmentation, in More Detail

SummaryData Distribution Strategies

I Fragmentation, allocation, replication and caching are all mechanismsthat DDBMSs make use of to respond to the affinity of locality thatdata exhibits, particularly in decentralized organizations.

I The design decisions required are well-studied and well-foundedsolutions are available but require a great deal of information.

I The benefits can be significant particularly for response time becauseof the greater degree of natural parallelism that becomes possible.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 59 / 121

Fragmentation, in More Detail

Advanced Database Management SystemsDistributed Query Processing

Alvaro A A Fernandes

School of Computer Science, University of Manchester

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 60 / 121

Outline

The Distributed Query Processing Problem

Two-Phase Distributed Query Optimization

Localization and Reduction

Cost-Related Issues

Join Ordering in DQP

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 61 / 121

The Distributed Query Processing Problem

Distributed Query Processing (1)What is the Problem? (1)

I Assume the fragments EMPi

and ASGj to be stored in thesites shown in the figure.

I Assume the double-shaftedarrows to denote the transfer ofdata between sites.

I Strategy 1 can be said to aim todo processing locally in order toreduce the amount of data thatneeds to be shipped to theresult site, i.e., Site 5.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 62 / 121

The Distributed Query Processing Problem

Distributed Query Processing (2)What is the Problem? (2)

I Strategy 2 can be said to aim toship all the data to, and do allthe processing at, the site whereresults need to be delivered

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 63 / 121

The Distributed Query Processing Problem

Distributed Query Processing (3)Cost of Alternatives (1)

Example (Assumptions)

I ta (tuple access cost) = 1 unit

I tt (tuple transfer cost) = 10 units

I |ASG | = 100, length(ASG) = 10, |EMP| = 80, length(EMP) = 5

I |ASG1| = |σENO≤′E3′(ASG)| = 50

I |EMP1| = |σENO≤′E3′(EMP)| = 40

I V (ASG ,RESP) = 5

I length(ENO) = 2

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 64 / 121

The Distributed Query Processing Problem

Distributed Query Processing (4)Cost of Alternatives (2)

Example (Consequences)

I size(ASG) = 100× 10 = 1, 000, size(EMP) = 80× 5 = 400

I |ASG2| = |ASG | − |ASG1| = 100− 50 = 50

I size(ASG1) = size(ASG2) = |ASG1| × 10 = 50× 10 = 500

I |EMP2| = |EMP| − |EMP1| = 80− 40 = 40

I size(EMP1) = size(EMP2) = |EMP1| × 5 = 40× 5 = 200

I |ASG ′1| = |ASG ′2| = |σRESP=′manager′(ASG1)| = |ASGi |V (ASG ,RESP)

= 505

= 10

I |ASG ′| = |ASG ′1|+ |ASG ′2| = 10 + 10 = 20

I length(EMPi ./ENO ASG ′i ) = length(EMP) + length(ASG)− length(ENO) =10 + 5− 2 = 13

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 65 / 121

The Distributed Query Processing Problem

Distributed Query Processing (5)Cost of Alternatives (3)

Example (Comparison (1))

Action Cost Formula Cost

produce ASG ′i = 2× |ASGi | × ta= 2× 50× 1 100

transfer ASG ′i to sites 3, 4 = 2× size(ASG ′i )× tt= 2× 10× 10× 10 2,000

produce EMP ′i = 2× |EMPi | × |ASG ′i | × ta= 2× 40× 10× 1 800

transfer EMP ′i to site 5 = 2× size(EMPi ./ENO ASG ′i )× tt= 2× 40×10

40× 13× 10 2,600

Total Cost of Strategy 1 5,500

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 66 / 121

The Distributed Query Processing Problem

Distributed Query Processing (6)Cost of Alternatives (4)

Example (Comparison (2))

Action Cost Formula Cost

transfer EMP to site 5 = size(EMP)× tt= 400× 10 4,000

transfer ASG to site 5 = size(ASG)× tt= 1000× 10 10,000

produce ASG’ = |ASG | × ta= 100× 1 100

join EMP and ASG’ = |EMP| × |ASG ′| × ta= 80× 20× 1 1,600

Total Cost of Strategy 2 15,700

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 67 / 121

The Distributed Query Processing Problem

Distributed Query Processing (7)Query Optimization Objectives

I Minimize a cost function such as total time or response time.

I All components may have different weights in different distributedenvironments.

I One could have different goals, e.g., maximize throughput.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 68 / 121

The Distributed Query Processing Problem

Distributed Query Processing (8)Where Can Decisions Be Made?

Centralized I A single site determines the schedule.I This is simpler, but requires knowledge about the entire

distributed database.

Distributed I There is co-operation among sites to determine theschedule.

I This only requires sharing local information, butco-operation has a cost.

Hybrid I One site determines the global schedule.I Each site optimizes the local subqueries.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 69 / 121

The Distributed Query Processing Problem

Distributed Query Processing (9)Issues Regarding the Network

Wide-Area Network (WAN) I WANs have comparatively lowbandwidth, low speed and high protocol overhead

I As a result, communication cost will dominate, to theextent that it may be possible to ignore all other costs.

I Thus, the global schedule will aim to minimizecommunication cost.

I Local schedules are decided according to centralizedquery optimization decisions.

Local-Area Network (LAN) I Communication cost is not asdominant as in WANs.

I Thus, all components in the total cost function must beconsidered.

I Broadcasting is an option.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 70 / 121

Two-Phase Distributed Query Optimization

Distributed Query Optimization (1)Two-Phase Approach

I One way to implement distributedquery optimization as a continuumwith the centralized case is tostructure the decision-making stagesin such a way that the optimizerbreaks the overall task into twophases.

I In the first phase, a single-node QEPis produced (that would run if theDBMS were not a distributed DBMS);in the second phase, this single-nodeQEP is transformed into a multi-nodeone.

I The second phase partitions a QEPinto fragments linked by exchangeoperators, then schedules eachfragment to execute in differentcomponent nodes.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 71 / 121

Localization and Reduction

Distributed Query Optimization (2)Localizing a Global Query

I Given an algebraic query on global relations:I determine which are distributed;I for those, determine which fragments are involved;I replace references to global relations with the reconstruction expression

(which is referred to as a localization program).

I The leaves of distributed relations are replaced by its localizationprogram over its fragments.

I The result is sometimes referred to as a generic query and is likely tobenefit from optimization by reduction.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 72 / 121

Localization and Reduction

Distributed Query Optimization (3)Some Examples (1)

I Assume EMP is horizontally fragmented into EMP1, EMP2 andEMP3 as follows:

1. EMP1 ← σENO≤′E3′(EMP)2. EMP2 ← σ′E3′<ENO≤′E6′(EMP)3. EMP3 ← σENO>′E6′(EMP)

I Assume ASG is horizontally fragmented into ASG1 and ASG2 asfollows:

1. ASG1 ← σENO≤′E3′(ASG )2. ASG2 ← σENO>′E3′(ASG )

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 73 / 121

Localization and Reduction

Distributed Query Optimization (4)Some Examples (2)

I Assume the following query:SELECT E.ENAMEFROM EMP EWHERE E.ENO = ’E5’

I The figure shows the correspondinggeneric query with the leaf replacedby its localization program.

I Then, the figure shows the queryafter optimization by reduction, inthis case because it follows fromthe predicates that defined thefragments that only EMP2 cancontribute to the specified results.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 74 / 121

Localization and Reduction

Distributed Query Optimization (5)Some Examples (3)

I Assume the following query:SELECT E.ENAMEFROM EMP E, ASG AWHERE E.ENO = A.ENO

I The figure shows the correspondinggeneric query with the leaf replacedby its localization program.

I We next show the query afterreduction.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 75 / 121

Localization and Reduction

Distributed Query Optimization (6)Some Examples (4)

I The figure shows the reduced joinquery.

I Note that the optimizer has usedthe commutativity between join andunion to push the joins upstreamand reduce the amount of work.

I This also helps in scheduling thejoins to execute in parallel.

I Note, finally, that the optimizer hasmade use of the fact that EMP3

and ASG1 do not share tuples(because their predicates lead to acontradiction, and hence wouldreturn an empty set) andeliminated the need to join them.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 76 / 121

Localization and Reduction

Distributed Query Optimization (7)Some Examples (5)

I Assume EMP is verticallyfragmented into EMP1 and EMP2

as follows:

1. EMP1 ← πENO,ENAME (EMP)2. EMP2 ← πENO,TITLE (EMP)

I Assume the following query:SELECT E.ENAMEFROM EMP E

I The figure shows the correspondinggeneric query with the leaf replacedby its localization program.

I Then, the figure shows the queryafter optimization by reduction, inthis case because it follows fromthe projection lists that defined thefragments that only EMP1 cancontribute to the specified results.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 77 / 121

Localization and Reduction

Distributed Query Optimization (8)A Detailed Example Derivation (1)

I Assume PROJ is horizontally fragmented into PROJ1, PROJ2 andPROJ3 as follows:

1. PROJ1 ← σLOC=′Tokyo′(PROJ)2. PROJ2 ← σLOC=′Oslo′(PROJ)3. PROJ3 ← σLOC=′Paris′(PROJ)

I Assume the following query: SELECT AVG(P.BUDGET)FROM PROJ PWHERE P.LOC = ’OSLO’

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 78 / 121

Localization and Reduction

Distributed Query Optimization (9)A Detailed Example Derivation (2)

(translate) γAVG(BUDGET )(σLOC=′Oslo′(PROJ))(localize) γAVG(BUDGET )(σLOC=′Oslo′(PROJ1 ∪ (PROJ2 ∪ PROJ3)))(expand) γAVG(BUDGET )(σLOC=′Oslo′

(σLOC=′′Tokyo′(PROJ) ∪ (σLOC=′Oslo′(PROJ) ∪ σLOC=′Paris′(PROJ))))(combine) γAVG(BUDGET )(σLOC=′Oslo′∧LOC=′Tokyo′(PROJ)

∪(σLOC=′Oslo′∧LOC=′Oslo′(PROJ)∪(σLOC=′Oslo′∧LOC=′Paris′(PROJ))))

(simplify) γAVG(BUDGET )(σ⊥(PROJ) ∪ (σLOC=′Oslo′(PROJ) ∪ (σ⊥(PROJ))))(simplify) γAVG(BUDGET )(∅ ∪ (σLOC=′Oslo′(PROJ) ∪ ∅))(simplify) γAVG(BUDGET )(σLOC=′Oslo′(PROJ))(simplify) γAVG(BUDGET )(PROJ2)

This derivation shows that the query can be executed only over the Oslohorizontal fragment PROJ2 and wherever it is stored.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 79 / 121

Cost-Related Issues

Distributed Query Optimization (10)Scheduling Query Fragments

I Given a fragment query, find the best global schedule by minimizing acost function.

I Join processing in centralized DBMSs tends to prefer linear (e.g.,left-deep) trees because the size of the search space is reduced by thelinearity constraint).

I However, in distributed DBMSs, join processing over bushy treesreveals opportunities for parallelism.

I Other decisions include:I Which relation to ship where?I Whether to ship the whole or to ship as needed?I Whether to use semijoins? (Semijoins save on communication at the

expense of more local processing.)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 80 / 121

Cost-Related Issues

Distributed Query Optimization (11)Cost Functions

Total Time (also referred to as Total Cost): The overall strategy in thiscase is to

I Reduce the cost (i.e., time) in each componentindividually

I Do as little of each cost component as possible

This optimizes the utilization of the resources and tends toincreases system throughput.

Response Time The overall strategy in this case is to do as many thingsas possible in parallel.However, this may increase the total time because of overallincreased activity.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 81 / 121

Cost-Related Issues

Distributed Query Optimization (12)Total Cost

I The total cost is the summation of all cost factors:

1. Total cost = CPU cost + I/O cost + communication cost2. CPU cost = unit instruction cost × no.of instructions3. I/O cost = unit disk I/O cost × no. of disk I/Os4. communication cost = (unit message initiation cost × no. of

messages)+ (unit transmission cost × no. of bytes)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 82 / 121

Cost-Related Issues

Distributed Query Optimization (13)Response Time

I The response time is the elapsed time between the initiation and thecompletion of a query.

I Processing and communication costs that are incurred in sequence ina component count at most once.

I If several sequential tasks are executed in parallel, the cost that iscounted is the maximum cost of all those tasks.

I 1. Response time = CPU time + I/O time + communication time2. CPU time = unit instruction time × no. of sequential instructions3. I/O time = unit I/O time × no. of sequential I/Os4. communication time = (unit message initiation time × no. of

sequential messages) + (unit transmission time × no. of sequentialbytes

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 83 / 121

Cost-Related Issues

Distributed Query Optimization (14)Some Cost Factors

wide-area networks I Message initiation and transmission costs arerelatively high.

I Local processing cost is comparatively low (fastmainframes or minicomputers)

I Ratio of communication to I/O costs is high (2-digits to1-digit?).

local-area networks I Communication and local processing costs arecomparable.

I Ratio of communication to I/O costs is not high (closeto 1:1?).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 84 / 121

Cost-Related Issues

Distributed Query Optimization (15)Example: Total Cost v. Response Time

I Assume that:

I only the communication cost is consideredI one message conveys one unit of work

(e.g., a tuple)

I Let UM denote the unit message initialization timeand UT the unit transmission time. LetTsend(r , s, t) denote the time to send r from s to t.

I Total time = (n + m)UM + (np + mq)UT

I Response time = max{Tsend(n, 1, 3),Tsend(m, 2, 3)}I Tsend(n, 1, 3) = nUM + npUT

I Tsend(m, 2, 3) = mUM + mqUT

I If n = 900, m = 1, 000, p = 90, and q = 100, then

I Total time = 1, 900UM + 181, 000UTI Response time = 1, 000UM + 100, 000UT

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 85 / 121

Join Ordering in DQP

Distributed Query Optimization (16)Join Ordering in Fragment Queries

I Given an n-ary relation R with attributes A1, . . . ,An, let |R| denote thecardinality of R, and let length(Ai ) denote the (possibly average) length inbytes of a value from the domain of Ai , in which case the (possibly average)length of a tuple in R is length(R) =

∑ni=1 length(Ai ).

I Let size(R) = |R| × length(R).

I Given two relations R and S that are not co-located, we ship R to the site ofS if size(R) ≤ size(S) and we ship S to the site of R if size(S) < size(R).

I For many relations, there may be too many alternatives.

I Also, computing the cost of all alternatives and selecting the best onedepends on computing the size of intermediate relations, which is difficult.

I In practice, heuristics are needed.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 86 / 121

Join Ordering in DQP

Distributed Query Optimization (17)Join Ordering: An Example (1)

I Consider the 2-way joinPROJ ./PNO (ASG ./ENO EMP)

I The join graph shows the sites where eachrelation is, and there is an edge betweentwo relations if an equijoin on the edgelabel is required.

I The many different execution alternativesare shown next, with a double-shaftedarrow denoting the shipment of the relationin the left to the site in the right, and the’@’ sign denoting that the left-hand sideexpression is evaluated at the site in theright-hand side.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 87 / 121

Join Ordering in DQP

Distributed Query Optimization (18)Join Ordering: An Example (2)

1. 1.1 EMP ⇒ 21.2 EMP’ ← EMP ./ ASG @ 21.3 EMP’ ⇒ 31.4 EMP’ ./ PROJ @ 3

2. 2.1 ASG ⇒ 12.2 EMP’ ← EMP ./ ASG @ 12.3 EMP’ ⇒ 32.4 EMP’ ./ PROJ @ 3

3. 3.1 ASG ⇒ 33.2 ASG’ ← ASG ./ PROJ @ 33.3 ASG’ ⇒ 13.4 ASG’ ./ EMP @ 1

4. 4.1 PROJ ⇒ 24.2 PROJ’ ← PROJ ./ ASG @ 24.3 PROJ’ ⇒ 14.4 PROJ’ ./ EMP @ 1

5. 5.1 EMP ⇒ 25.2 PROJ ⇒ 25.3 PROJ ./ (ASG ./ EMP)@ 2

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 88 / 121

Join Ordering in DQP

Distributed Query Optimization (19)Join Ordering: An Example (3)

1. An alternative to enumerating all possibilities is to use the heuristic ofconsidering only the sizes of the operands and assuming that thecardinality of the join is the product of the input cardinalities.

2. In this case, relations are ordered by increasing sizes and the order ofexecution is given by this ordering and the join.

3. For example, the order (EMP, ASG, PROJ) could use Strategy 1, andthe order (PROJ, ASG, EMP) could use Strategy 4.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 89 / 121

Join Ordering in DQP

Distributed Query Optimization (20)Approaches Based on Semijoins (1)

I Consider the join of two relations R[A] (located at site 1) and S[A](located at site 2).

I One could evaluate R ./A S .

I Alternatively,one could evaluate one of the equivalent semijoins:R ./A S ⇔ (R nA S) ./A S

⇔ R ./A (S nA R)⇔ (R nA S) ./A (S nA R)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 90 / 121

Join Ordering in DQP

Distributed Query Optimization (21)Approaches Based on Semijoins (2)

1. Using a join:

1.1 R ⇒ 21.2 R ./A S@ 2

2. Using a semijoin:

2.1 S’ ← πA(S)2.2 S’ ⇒ 12.3 R’ ← R nA S’ @ 12.4 R’ ⇒ 22.5 R’ ./A S @ 2

Semijoin is better if

size(πA(S)) + size(R nA S)) < size(R)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 91 / 121

Join Ordering in DQP

SummaryDistributed Query Processing

I There is an evolutionary continuum from centralized to distributedquery optimization.

I Localization and reduction are the main techniques by which aheuristically-efficient distributed QEP can be arrived at.

I In wide-area distributed query processing (DQP), communicationcosts tend to dominate, although in local-area networks this is not thecase.

I The join ordering problem remains, here too, an important one.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 92 / 121

Join Ordering in DQP

Advanced Database Management SystemsData Integration Strategies

Alvaro A A Fernandes

School of Computer Science, University of Manchester

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 93 / 121

Outline

Data Integration: Problem Definition

Process Alternatives

View-Based Data Integration

Schema Matching, Mapping and Integration

Dataspaces

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 94 / 121

Data Integration: Problem Definition

Data Integration (1)Problem Definition

I Data(base) integration is the process as a result of which a set ofcomponent DBMSs are conceptually integrated to form amulti-DBMS, i.e., a DDBMS that offers a single, logically coherentschema to users and applications.

I Equivalently, given existing databases with their Local ConceptualSchemas (LCSs), data integration is the process by which they areintegrated into a Global Conceptual Schema (GCS).

I A GCS is also called a mediated schema, or, more simply, a globalschema.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 95 / 121

Data Integration: Problem Definition

Data Integration (2)Some Assumptions, Some Issues

I In general, the problem only arises if the component DBMSs alreadyexist, so data integration is typically a bottom-up process.

I In some respects, it can be conceived of as the reverse of the datadistribution (i.e., fragmentation and allocation) problem.

I One of the most important concerns in data integration is the level ofheterogeneity of the component DBMSs.

I This, in turn, is strongly linked to the degree of autonomy that eachcomponent DBMSs enjoys and exercises.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 96 / 121

Process Alternatives

Data Integration (3)Some Alternatives

Physical Integration : in this case, the source databases are integratedand the outcome is materialized. It is the more commonpractice in data warehousing.

Logical Integration in this case, the global schema that emerges fromintegrating the sources remains virtual. It is the morecommon practice when the component DBMSs enjoyautonomy (e.g., in scientific contexts, where differentresearch groups maintain different data resources but stillallow them to be part of a multi-DBMS of interest to themand to others).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 97 / 121

Process Alternatives

Data Integration (4)A Bottom-Up Process

I The most widely-used approachinvolves:

I translationI Each LCS abstracts over a

data source.I A translator maps across to

and from concepts in theLCS and concepts in anintermediate schema (IS).

I integrationI The ISs are cast in an

interlingua, a canonicalformalism in which theLCSs of the participatingsources can be cast.

I The integrator uses the ISsto project out the GCS tousers and applications.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 98 / 121

Process Alternatives

Data Integration (5)Dealing with Autonomy and Heterogeneity

I In contexts where heterogeneity is the norm (e.g., when themulti-DBMS is formed from public resources) the translators areoften referred to as wrappers and the integrator is referred to as themediator.

I Wrappers can reconcile different kinds of heterogeneity, e.g.:

infrastructural including those stemming from the system softwareor network level

syntactic including those relating to data model and querylanguages (e.g., generating a relational view of aspreadsheet)

semantic which are the hardest to capture and maintain in sync

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 99 / 121

View-Based Data Integration

Data Integration (6)Schemas as Views

I There are two major possibilities to relate a GCS and its LCSs bymeans of views:

Global-As-View (GAV) : in this case, the LCSs are the extents overwhich one writes a set of views that, together, comprisethe GCS against which global queries are formulated.

Local-As-View (LAV) : in this case, the GCS is assumed to existand each LCS is treated as if it were a view over thispostulated GCS.

I We will focus on GAV, and a simple example of how it works iscoming soon.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 100 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (1)Postulating Semantic Equivalences

I Assume three distributed,heterogeneous, autonomous datasources, S1-S3.

I The first task is to postulate thatthere are (1:1, 1:n, m:n) relationshipsbetween tables and columns indifferent sources.

I This is done by matching atschema-level, i.e., using schema namesand structures, and at instance-level,i.e., using attribute values andstructures.

I The dashed lines show somepostulated equivalences, explicitly, atcolumn/attribute level (and, forsimplicity, only implicitly, attable/relation level).

I Lighter attributes with dark borders denoteprimary keys.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 101 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (2)Postulating Mappings

I From postulated equivalences, thenext task is to write view expressionsthat define constructs in one schemain terms of one or more otherschemas.

I For example, one might define S3 interms of S1 and S2 with the followingmappings:

R ← πx→a,m→b,n→c(X ./x=y Y )

S ← πt→d,r+q→e(T ) ∪ πv→d,w→e(U)

I When written as above, we often callthe left-hand side of a mapping, thehead, and the right-hand side, thebody.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 102 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (3)Query Evaluation over Integrated Schemas

I From postulated mappings, thenext task is to issue queriesagainst the integrated schema.

I Assume a query against S3:γavg(e)(σd>5(S)) ≡γavg(e)(σd>5(Q ∪ Q ′))

i.e., we rewrite using themappings that define S in S3 interms of T and U in S1:

S2 : Q ← πt→d,r+q→e(T )S1 : Q ′ ← πv→d,w→e(U)

I The subqueries Q and Q ′ runremotely at S1.

I Results are shipped to S3 wherethe union runs locally.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 103 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (4)Schema Matching (1)

There are two major kinds of heterogeneity that make schema matching ahard problem:

Structural (or Syntactic) Heterogeneity

I Type conflicts (e.g., address as string v. address asstruct)

I Dependency conflicts (e.g., net salary plus tax v. salary)I Key conflicts (e.g., absence of foreign keys that would

be required)

Schematic (or Semantic) HeterogeneityThese are conflicts that arise from the fact that the designersof the GCS and the LCS have different underlying ontologiesin mind, i.e., they conceptualize the database domain indifferent terms.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 104 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (5)Schema Matching (2)

Semantic heterogeneity takes many forms, including:

I Synonyms, i.e., when two words can be interchanged in a context they aresaid to be synonymous relative to that context. For example, in sport, theword match can be synonymous with tie.

I Homonyms, i.e., when two words are spelled the same way but havedifferent meanings they they are said to be homonymous. For example, wemay want to have an attribute spelled ’price’ in the GCS and find it in anLCS but it may have a different meaning there (e.g., it excludes VAT, whichis not what the GCS expects).

I Hypernyms, which are words that are more generic than a given word. Forexample, the GCS may expect a relation ’employees’ not to discriminatebetween temporary and permanent staff, whereas in some LCS may onlystore in ’employees’ the permanent staff (e.g., because it stores temporarystaff under, say, ’temps’).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 105 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (6)Schema Matching (3)

I There are other complications too:I Insufficient schema and instance information: how can one find out

how a derived attribute (e.g., VAT) is calculated?I Subjectivity of the matching: how can one be sure that the

correspondence (e.g., between two relations named ’employees’) is validfor all instances?

I Many issues also conspire to make schema matching hard:I Schema-level versus instance-level matching: which do we use? Both?

If so, which weight does each have?I Element-level versus structure-level matching: if we find a match for an

attribute but all other attributes in the same relation do not match, dowe trust the match?

I Matching cardinality is hard without additional information, as it is notnormally captured in DDLs (although XMLSchema, e.g., can do).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 106 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (7)Combined Schema Matching Approaches

I One way to strengthen the validity of the decision, it is possible touse multiple matchers (i.e., different similarity-assigning algorithms).

I This allows for specialization, e.g., different matchers may focus ondifferent domains (e.g., names, or telephone numbers, or addresses,etc.)

I A meta-matcher integrates these into one prediction (e.g., taking the(possibly weighted) mean of the similarity values computed byindividual matchers).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 107 / 121

Schema Matching, Mapping and Integration

Database Integration Tasks (8)Schema Integration

I Once correspondences are deemedvalid, we can use them to create aGCS.

I While matching can (and is)essentially an automated process,selecting matches to becomemappings and combining thesemappings into a GCS is largely amanual process, i.e., in a rule-basedapproach, like the previousexamples, the rules are notnormally generated automatically.

I Approaches to data integration areillustrated in the figure.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 108 / 121

Dataspaces

Dataspaces as a Data Integration Approach (1)The Question of Cost

I What we have described so far can be called classical, mediator-baseddata integration.

I It delivers high-quality results early but with high upfront costs due tothe need for human expertise in making up for the shortcomings ofmatching and mapping derivation.

I Dataspaces[Franklin et al., 2005, Halevy et al., 2006, Hedeler et al., 2009] are anew approach to data integration: it automates the bootstrapping ofan integrated view, accepts the lower-quality of results early on, butaggressively seeks and uses feedback to improve them over time.

I The idea is that users get some results quickly at near to no cost: ifthis motivates them, they pay some cost in the form of feedback astheir need spurs them.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 109 / 121

Dataspaces

Dataspaces as a Data Integration Approach (2)The Question of Quality v. Cost v. Time

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 110 / 121

Dataspaces

Dataspaces as a Data Integration Approach (3)The Broad Aim of a Dataspace Management System

I Given a set of data sources, a dataspace management systems(DSpMS) aims to obtain the best mappings with minimal humanintervention.

I This means bootstrapping the set-up stage (i.e., the postulation ofsemantic equivalences and the mappings derived them) usingautomated means.

I This also means being intelligent and efficient in seeking as few andas useful feedback instances from users as possible and, once they areobtained, making the most of them for improving results[Belhajjame et al., 2010].

I One wants to pay as little as possible as late as possible and stillobtain excellent results for the effort spent.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 111 / 121

Dataspaces

Dataspace Architecture (1)Seen as a Stack

I As we saw, classical integrationinvolves layering a mediator overdata sources, which are assumed tobe fully-fledged databases.

I The same holds for dataspaces,except that the mediator (i.e., theability to translate queries againstan integrated schema into queriesagainst sources, stitching thepartial results into integrated oneson the return journey) is poweredby automatically derived mappingsfrom automatically derivedmatches.

I For this later task, some have usedmodel management techniques(essentially an algebra over schemaconstructs, including mappings andmatches)[Bernstein and Melnik, 2007].

I A dataspace is then seen to beunique in introducing improvementvia feedback.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 112 / 121

Dataspaces

Dataspace Architecture (2)Seen as a Composition of Algebras (1)

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 113 / 121

Dataspaces

Dataspace Architecture (3)Seen as a Composition of Algebras (2)

I A DSpMS is a DBMS: it retains the ability to evaluate queries oversources to produce the specified results.

I A DSpMS is also a data integration system: it retains the ability touse mappings and schemas over many distributed resources and usethe former to make the latter seem an integrated source.

I In one approach, a DSpMS is also a model management system[Hedeler et al., 2010]: it has the ability to sample sources and matchthem to generate correspondences as well as to operate on schemas(e.g., merge, compose, subtract, extract schema constructs).

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 114 / 121

Dataspaces

Dataspace Architecture (4)Seen as a Composition of Algebras (3)

I In this approach, a significant part of a DSpMS is an engine toevaluate algebraic operations over schemas, matches andcorrespondences.

I These operations are like programs that a DSpMS executes to allowusers to derive many integrated views over a collection of resources,rather than a single one.

I We can see, therefore, that what is truly unique to a DSpMS is theuse of feedback for improving integration mappings that weregenerated algorithmically and are, for this reason, likely to producepoor quality results.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 115 / 121

Dataspaces

SummaryData Integration Strategies

I With the explosion in the availability of networked data andcomputational resources, the data integration problem has become anextremely important one.

I Superimposing a global conceptual schema over local ones is asimportant a task as it is costly.

I However, view-based techniques can can be used to great effect.

I The greatest hurdle remains the reconciliation of schematicheterogeneity, the upfront cost of which is often prohibitive.

I The notion of a dataspace has been introduced recently tocharacterize a pay-as-you-go approach to data integration, i.e.,avoiding having to pay high upfront costs.

I The idea is that conflict reconciliation happens over time,incrementally, driven by the assimilation of user feedback.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 116 / 121

Dataspaces

Acknowledgements

The material presented mixes original material by the author as well asmaterial adapted from

I [Oszu and Valduriez, 1999]

The author gratefully acknowledges the work of the authors cited whileassuming complete responsibility any for mistake introduced in theadaptation of the material.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 117 / 121

Dataspaces

References (1)

Belhajjame, K., Paton, N. W., Embury, S. M., Fernandes, A. A. A.,and Hedeler, C. (2010).Feedback-based annotation, selection and refinement of schemamappings for dataspaces.In Manolescu, I., Spaccapietra, S., Teubner, J., Kitsuregawa, M., Leger, A.,Naumann, F., Ailamaki, A., and Ozcan, F., editors, EDBT, volume 426 ofACM International Conference Proceeding Series, pages 573–584. ACM.

http://doi.acm.org/10.1145/1739041.1739110.

Bernstein, P. A. and Melnik, S. (2007).Model management 2.0: manipulating richer mappings.In Chan, C. Y., Ooi, B. C., and Zhou, A., editors, SIGMOD Conference,pages 1–12. ACM.

http://doi.acm.org/10.1145/1247480.1247482.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 118 / 121

Dataspaces

References (2)

Franklin, M. J., Halevy, A. Y., and Maier, D. (2005).From databases to dataspaces: a new abstraction for informationmanagement.SIGMOD Record, 34(4):27–33.http://doi.acm.org/10.1145/1107499.1107502.

Halevy, A. Y., Franklin, M. J., and Maier, D. (2006).Principles of dataspace systems.In Vansummeren, S., editor, PODS, pages 1–9. ACM.

http://doi.acm.org/10.1145/1142351.1142352.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 119 / 121

Dataspaces

References (3)

Hedeler, C., Belhajjame, K., Mao, L., Paton, N. W., Fernandes, A.A. A., Guo, C., and Embury, S. M. (2010).Flexible dataspace management through model management.In Daniel, F., Delcambre, L. M. L., Fotouhi, F., Garrigos, I., Guerrini, G.,Mazon, J.-N., Mesiti, M., Muller-Feuerstein, S., Trujillo, J., Truta, T. M.,Volz, B., Waller, E., Xiong, L., and Zimanyi, E., editors, EDBT/ICDTWorkshops, ACM International Conference Proceeding Series. ACM.

http://doi.acm.org/10.1145/1754239.1754241.

Hedeler, C., Belhajjame, K., Paton, N. W., Campi, A., Fernandes, A.A. A., and Embury, S. M. (2009).Dataspaces.In Ceri, S. and Brambilla, M., editors, SeCO Workshop, volume 5950 ofLecture Notes in Computer Science, pages 114–134. Springer.

http://dx.doi.org/10.1007/978-3-642-12310-8 7.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 120 / 121

Dataspaces

References (4)

Oszu, M. T. and Valduriez, P. (1999).Principles of Distributed Database Systems.Prentice Hall International, 2nd edition.

AAAF (School of CS, Manchester) Advanced DBMSs 2012-2013 121 / 121