mocha : a self-extensible database middleware system for distributed data sources

27
© Copyright 2000 M. Rodriguez-Martinez, All Rights Reserved MOCHA MOCHA : A Self-Extensible : A Self-Extensible Database Middleware Database Middleware System for System for Distributed Data Sources Distributed Data Sources Manuel Rodriguez-Martinez Nick Roussopoulos

Upload: shepry

Post on 09-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

MOCHA : A Self-Extensible Database Middleware System for Distributed Data Sources. Manuel Rodriguez-Martinez Nick Roussopoulos. Client. Client. Motivation. Data Sources are distributed and heterogeneous : Fact of Life. Internet. Oracle 8i. Informix. XML Data. Text Data. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

© Copyright 2000 M. Rodriguez-Martinez, All Rights Reserved

MOCHAMOCHA: A Self-Extensible : A Self-Extensible Database Middleware System for Database Middleware System for

Distributed Data SourcesDistributed Data SourcesManuel Rodriguez-Martinez

Nick Roussopoulos

Page 2: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 2

MotivationMotivation

Data Sources are distributed and heterogeneous: Fact of Life ...

ClientClient

Oracle 8i Informix XML Data Text Data

Internet

Page 3: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 3

Client-Server ConnectivityClient-Server Connectivity

2-tier architecture means FAT Clients

ClientClient

Oracle 8i Informix XML Data Text Data

Internet

Not a Good Idea

Page 4: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 4

Middleware Integration ServiceMiddleware Integration Service

ClientClient

Oracle 8i Informix XML Data Text Data

Internet

Translator Translator Translator Translator

Middleware is a 3-tier connectivity solution – Thin Clients

IntegrationServer Catalog

Page 5: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 5

Problem 1: Code DeploymentProblem 1: Code Deployment• User-defined types and functions

– Polygon – Composite() – image aggregation

• Porting and manual installation of code– Operating system– Hardware platform

• Expensive Software Maintenance– Updates– Version management

• Security – Software certification

Page 6: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 6

Problem 1: Code DeploymentProblem 1: Code Deployment

ClientClient

Oracle 8i Informix XML Data Text Data

Internet

Translator Translator Translator Translator

Not Scalable – Expensive System Growth

IntegrationServer Catalog

Page 7: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 7

Problem 2: Query ProcessingProblem 2: Query Processing• Operator placement options

– Limited by site-dependent software• Composite() – got to have it before using it!

• Most processing at Integration Server– Powerful Data Servers are under-utilized

• I/O Nodes

– Excessive data movement over the network• Network bottleneck • Unfeasible in WANs, Internet

Page 8: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 8

Problem 2: Query ProcessingProblem 2: Query Processing

ClientClient

Oracle 8i Informix XML Data Text Data

Internet

Translator Translator Translator Translator

Not Scalable – Inefficient evaluation of queries

IntegrationServer Catalog

100MB

100MB

100MB

Page 9: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 9

MOCHA Solution: Ship Code!MOCHA Solution: Ship Code!

Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location

Client

Oracle Informix

DAP DAPQPC

CodeRepository

Catalog

Internet

Virginia

MarylandVirginiaTexas

Page 10: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 10

MOCHA Solution: Filter Data!MOCHA Solution: Filter Data!

Select location, Composite(image)From RastersWhere week BETWEEN t1 and t2Group By location

Client

Oracle Informix

DAP DAPQPC

CodeRepository

Internet

Virginia

MarylandVirginiaTexas

Catalog200MB

tuples

100MB

tuples

results

200KB

results

150KB

results

150KB

results

200KBresults

150KB

results

200KB

results

350KB

results

350KB

Page 11: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 11

MOCHA GoalsMOCHA Goals

Automatic Deployment of Code (self-extensible)– QPC ships compiled Java classes

• User-defined types and functions

– XML for their metadata (easy exchange)

Data processing at data source sites– Utilize powerful machines

• On-site data distillation

Processing based on data movement reduction– “Filter” data at the data sources– “Expand” data near the clients

Page 12: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 12

The MOCHA ArchitectureThe MOCHA Architecture

Client

Client

Informix Oracle

QPC

DAP DAP

CodeRepository

Catalog

• Multi-threaded• Distributed Objects

Coordination Thread

Execution Thread

Execution Thread

Page 13: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 13

QPC: The Integration ServerQPC: The Integration Server

Client API

Query Parser

Catalog Manager

Query Optimizer

Execution Engine

CodeLoader

SQL &XML

Proc.Interface

DAP Access API

XMLCatalog

CodeRepository

DAP

QPC Controls and Coordinates Query Execution

Page 14: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 14

DAP: The Facilitator of DataDAP: The Facilitator of Data

DAP Provides QPC withRemote Access to the Data

Data Source

DAP Access API

Control Module

Execution Engine

CodeLoader

SQL &XML

Proc.Interface

Data Source Access Layer

JDBC I/O API DOM JNI

100MB

tuples

100MB

tuples

100MB

tuples

results

150KB

100MB

tuples

Page 15: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 15

Road MapRoad Map

IntroductionProblem DefinitionMOCHA Architecture • Query Processing• Experiments• Summary

Page 16: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 16

Processing The QueriesProcessing The Queries

• Issue 1: Placement and deployment of operators– Which operators go to QPC, and which go to the DAPs?

• Issue 2: How to determine this placement?– Dynamic programming [SAC+79], [ML86]

– But search space is enormous• Placement of UDF, joins, execution sites …

• Plenty of “bad” plans

In MOCHA: Query Optimization based on heuristics– Network usually is the critical factor optimize for it first

– CPU and I/O are cheaper optimize for them later

– Quickly converge to a “good” plan

Page 17: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 17

Operator PlacementOperator Placement• Data-Reducing Operators

– “Filter” the data – Aggregates, predicates, projections, semi-joins

• Composite(), Overlaps() , AvgEnergy()

Push to the DAPs– Code Shipping policy (Unique to MOCHA)– Only send back distilled results+ Less data movement

• Cost:– Computation cost – Transfer of filtered results

Page 18: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 18

Operator PlacementOperator Placement• Data-Inflating Operators

– “Expand” the data – projections, image processing, some joins …

• DoubleResolution(), RotateSolid()

Pull to the QPC– Data Shipping policy [FJK96]– Only send back raw arguments+ Less data movement

• Cost:– Computation cost – Transfer of raw argument values

Page 19: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 19

Placement Metric: VRFPlacement Metric: VRF

Volume Reduction Factor: Given operator and relation R, then VDA

VDTVRF )(

•VDT - volume of data transmitted after applying to R•VDA - volume of data originally present in R

is Data-Reducing VRF < 1

Composite()

is Data-Inflating VRF 1

DoubleRes()

Page 20: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 20

Goal: Plans with small CVRFGoal: Plans with small CVRF

Cumulative Volume Reduction Factor:Given a plan P to solve query Q over relations R1, …, Rn

CVDA

CVDTPCVRF )(

• CVDT - volume of data transmitted by applying all operators in P to R1, …, Rn• CVDA- volume of data originally present in R1, …, Rn

Search SpaceOptimizer searchesfor plans that move

minimal amount of data.

CVRF(Plan) [0,1]

Page 21: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 21

Performance EvaluationPerformance Evaluation

• Goals of this study:– Measure how good code shipping can be– Validate heuristics being proposed

• VRF• CVRF

– Guide implementation of the optimizer

• Configured MOCHA with plans that place operators based on heuristics.

Page 22: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 22

Experimental EnvironmentExperimental Environment

• Sequoia 2000 Benchmark– scientific data - points, polygons, satellite images– Distributed applications

• Software and Hardware: – JDK 1.2– QPC - Sun Ultra 60, Solaris 2.6– DAPs - Sun Ultra 1, Sun Ultra5, Solaris 2.6– Data Sources

• 2 Informix IUS 9.12 Server

– 10 Mpbs Ethernet

Page 23: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 23

Reducing vs. InflatingReducing vs. Inflating

Ru

nnn

ing

Tim

e (s

ecs)

0

200

400

600

800

1000

1200

1400

1600

1800

DB CPU NET

QPC QPC

QPC

DAPDAP

DAP

Query Class

Q1 Q2 Q3

• Query classes– Composite of all images– Clipping and sub-setting– Double resolution of images

Performance gains– composites

• 99% data reduction

• 4-1 better performance

– clipping and expansion• 80% data reduction

• 3-1 better performance

Validates heuristics

Page 24: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 24

VRF vs SelectivityVRF vs Selectivity

• Select graphs identifiers based

on number of vertices and arc

length

Selectivity [HS93] and

cardinality [HKWY97] are not

enough for distributed

predicate placement

• Need to also consider size of

arguments for predicates!

• Consider 50% selectivity

– DAP CVRF = 0.01

– QPC CVRF = 1

0

100

200

300

400

500

600

700

800

DB CPU NET

Ru

nnn

ing

Tim

e (s

ecs)

SelectivityQ

PC

DA

P

QP

C

DA

P

QP

C

DA

P

QP

C

DA

P

QP

C

DA

P

0 .25 .50 .75 1

VRF is a better metric

Page 25: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 25

Implementation StatusImplementation Status

Operational System– SIGMOD 2000 Demo

Experimental deployment of MOCHA– NASA Earth Scientists

(ESIP Federation)– Goddard Space Flight

Center– NCSA

Land Cover Visualization Tool

Page 26: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 26

Summary and ConclusionsSummary and Conclusions• Proposed a new Middleware Architecture: MOCHA

– Automatic Code Deployment (self-extensible)• Shipping Java classes

– Query processing based on data movement reduction

• Proposed VRF metric for placement of functions– Better than selectivity and result cardinality

• Future work– Deployment of MOCHA for NASA ESIP Federation– Full implementation of MOCHA Optimizer

• More Info:– http://mocha.umiacs.umd.edu/http://mocha.umiacs.umd.edu/

Page 27: MOCHA : A Self-Extensible Database Middleware System for  Distributed Data Sources

SIGMOD 2000 M. Rodriguez-Martinez – N. Roussopoulos 27

Problem 2: Query ProcessingProblem 2: Query Processing

ClientClient

Oracle 8i Informix XML Data Text Data

Internet

Translator Translator Translator Translator

Not Scalable – Inefficient evaluation of queries

IntegrationServer Catalog

100MB

100MB

100MB

200MB

200MB

200MB